When developing a software package, an excellent practice is to give a comprehensive illustration of the methods in the package using an existing experiment data package, annotation data or data in the ExperimentHub or AnnotationHub, or submitting new data to those resources yourself.
If existing data is not available or applicable, or a new smaller dataset is needed for examples in the package, data can be included either as a separate data package (for larger amounts of data) or within the package (for smaller datasets).
The Bioconductor Build system does not support git-lfs. This is not a current option for storing large data. Large data sets must be included through the ExperimentHub.
Experimental data packages contain data specific to a particular analysis or experiment. They often accompany a software package for use in the examples and vignettes and in general are not updated regularly. If you need a general subset of data for workflows or examples first check the AnnotationHub resource for available files (e.g., BAM, FASTA, BigWig, etc.).
Bioconductor strongly encourages creating an experiment data package that utilizes ExperimentHub or AnnotationHub (See Creating an Experiment Hub Package or Creating an Annotation Hub Package) but a traditional package that encapsulates the data is also acceptable.
See the Package Submission guidelines for submitting related packages.
Bioconductor strongly encourages the use of existing datasets, but if not available data can be included directly in the package for use in the examples found in man pages, vignettes, and tests of your package. This is a good reference by Hadley Wickham about data.
Some key points are summarized in the following sections.
data/ is exported to the user and readily available.
It is made available in an session through the use of
It will require documentation concerning its creation and source information.
It is most often a
.RData file created with
save() but other types are acceptable as well, see
Please remember to compress the data.
It is often desirable to show a workflow which involves the parsing or loading of raw files. Bioconductor recommends finding existing raw data already provided in another package or the hubs.
However, if this is not applicable, raw data files should be included in the
Files of these type are often accessed utilizing
Bioconductor requires documentation on these files in an
Rarely, a package may require parsed data that is used internal but should not be exported to the user.
R/sysdata.rda is often the best place to include this type of data.
Downloads of files and external data from the web should be avoided.
If it is necessary, at minimum the files should be cached.
See BiocFileCache for Bioconductor recommended package for caching of files.
If a maintainer creates their own caching directory, it should utilize standard
tools::R_user_dir(package, which="cache"). It is not
allowed to download or write any files to a users home directory or working
directory. Files should be cached as stated above with
tempdir()/tempfile() if files should not be persistent.