10 Package data

When developing a software package, an excellent practice is to give a comprehensive illustration of the methods in the package using an existing experiment data package, annotation data or data in the ExperimentHub or AnnotationHub, or submitting new data to those resources yourself.

If existing data is not available or applicable, or a new smaller dataset is needed for examples in the package, data can be included either as a separate data package (for larger amounts of data) or within the package (for smaller datasets).

The Bioconductor Build system does not support git-lfs. This is not a current option for storing large data. Large data sets must be included through the ExperimentHub.

10.1 Experiment Data Package

Experimental data packages contain data specific to a particular analysis or experiment. They often accompany a software package for use in the examples and vignettes and in general are not updated regularly. If you need a general subset of data for workflows or examples first check the AnnotationHub resource for available files (e.g., BAM, FASTA, BigWig, etc.).

Bioconductor strongly encourages creating an experiment data package that utilizes ExperimentHub or AnnotationHub (See Creating an Experiment Hub Package or Creating an Annotation Hub Package) but a traditional package that encapsulates the data is also acceptable.

See the Package Submission guidelines for submitting related packages.

10.2 Adding Data to Existing Package

Bioconductor strongly encourages the use of existing datasets, but if not available data can be included directly in the package for use in the examples found in man pages, vignettes, and tests of your package. This is a good reference by Hadley Wickham about data.

However, as mentioned in The DESCRIPTION file chapter, Bioconductor does not encourage using LazyData: True despite its recommendataion in this article.

Some key points are summarized in the following sections.

10.2.1 Exported Data and the data/ directory

Data in data/ is exported to the user and readily available. It is made available in an session through the use of data(). It will require documentation concerning its creation and source information. It is most often a .RData file created with save() but other types are acceptable as well, see ?data().

Please remember to compress the data.

10.2.2 Raw Data and the inst/extdata/ directory

It is often desirable to show a workflow which involves the parsing or loading of raw files. Bioconductor recommends finding existing raw data already provided in another package or the hubs.

However, if this is not applicable, raw data files should be included in the inst/extdata directory. Files of these type are often accessed utilizing system.file(). Bioconductor requires documentation on these files in an inst/script/ directory.

10.2.3 Internal data

Rarely, a package may require parsed data that is used internal but should not be exported to the user. An R/sysdata.rda is often the best place to include this type of data.

10.2.4 Other data

Downloads of files and external data from the web should be avoided.

If it is necessary, at minimum the files should be cached. See BiocFileCache for Bioconductor recommended package for caching of files. If a maintainer creates their own caching directory, it should utilize standard caching directories tools::R_user_dir(package, which="cache"). It is not allowed to download or write any files to a users home directory or working directory. Files should be cached as stated above with BiocFileCache (preferred) or R_user_dir or tempdir()/tempfile() if files should not be persistent.