When developing a software package, an excellent practice is to give a comprehensive illustration of the methods in the package using an existing experiment data package, annotation data or data in the ExperimentHub or AnnotationHub, or submitting new data to those resources yourself.
If existing data is not available or applicable, or a new smaller dataset is needed for examples in the package, data can be included either as a separate data package (for larger amounts of data) or within the package (for smaller datasets).
The Bioconductor Build system does not support git-lfs. This is not a current option for storing large data. Large data sets must be included through the ExperimentHub. See additional information at Create A Hub Package.
Experimental data packages contain data specific to a particular analysis or experiment. They often accompany a software package for use in the examples and vignettes and in general are not updated regularly. If you need a general subset of data for workflows or examples first check the AnnotationHub resource for available files (e.g., BAM, FASTA, BigWig, etc.) or ExperimentHub for more processed and predefined Bioconductor class structures.
Bioconductor strongly encourages creating an experiment data package that utilizes ExperimentHub or AnnotationHub (See Create A Hub Package) but a traditional package that encapsulates the data is also acceptable with pre-approval from the bioc-devel mailing list.
Bioconductor strongly encourages the use of existing datasets, but if not available data can be included directly in the package for use in the examples found in man pages, vignettes, and tests of your package. This is a good reference by Hadley Wickham about package data.
However, as mentioned in The DESCRIPTION file chapter,
Bioconductor does not encourage using
LazyData: True despite its
recommendation in this article.
Note. You might have to modify the usage section with
Some key points are summarized in the following sections.
data/ is exported to the user and readily available.
It is made available in an session through the use of
It will require documentation concerning its creation and source information.
It is most often a
.RData file created with
save() but other types are acceptable as well, see
Please remember to compress the data.
For packages that need to use documented data within a function, we recommend
creating an environment for holding the data to avoid polluting the
global environment when using
Here we create a new and empty environment. Extract the data within the environment and finally extract the data object from the environment using double brackets.
It is often desirable to show a workflow which involves the parsing or loading of raw files. Bioconductor recommends finding existing raw data already provided in another package or the hubs as previously stated.
However, if this is not applicable, raw data files should be included in the
inst/extdata directory. Files of these type are often accessed utilizing
system.file(). Bioconductor requires documentation on these files in an
inst/script/ directory. See data documentation.
Rarely, a package may require parsed data that is used internal but should not
be exported to the user. An
R/sysdata.rda is often the best place to include
this type of data. Proper documentation of the data will be required.
Downloads of files and external data from the web should be avoided.
If it is necessary, at minimum the files should be cached.
See BiocFileCache for Bioconductor recommended package
for caching of files. If a maintainer creates their own caching directory, it
should utilize standard caching directories
tools::R_user_dir(package, which="cache"). It is not allowed to download or write any files to a users
home directory, working directory, or installed package directory. Files should
be cached as stated above with
BiocFileCache (preferred) or
tempdir()/tempfile() if files should not be persistent. Please also see
sections on querying web resources and file