13 Package data
When developing a software package, an excellent practice is to give a comprehensive illustration of the methods in the package using an existing experiment data package, annotation data or data in the ExperimentHub or AnnotationHub, or submitting new data to those resources yourself.
If existing data is not available or applicable, or a new smaller dataset is needed for examples in the package, data can be included either as a separate data package (for larger amounts of data) or within the package (for smaller datasets).
The Bioconductor Build system does not support git-lfs. This is not a current option for storing large data. Large data sets must be included through the ExperimentHub. See additional information at Create A Hub Package.
13.1 Experiment Data Package
Experimental data packages contain data specific to a particular analysis or experiment. They often accompany a software package for use in the examples and vignettes and in general are not updated regularly. If you need a general subset of data for workflows or examples first check the AnnotationHub resource for available files (e.g., BAM, FASTA, BigWig, etc.) or ExperimentHub for more processed and predefined Bioconductor class structures.
Bioconductor strongly encourages creating an experiment data package that utilizes ExperimentHub or AnnotationHub (See Create A Hub Package) but a traditional package that encapsulates the data is also acceptable with pre-approval from the bioc-devel mailing list.
See the Package Submission guidelines for further submission procedures and if submitting related or circular dependent packages please read related package submission procedure
13.2 Adding Data to Existing Package
Bioconductor strongly encourages the use of existing datasets, but if not available data can be included directly in the package for use in the examples found in man pages, vignettes, and tests of your package. This is a good reference by Hadley Wickham about package data.
However, as mentioned in The DESCRIPTION file chapter,
Bioconductor does not encourage using LazyData: True
despite its
recommendation in this article.
Note. You might have to modify the usage section with @usage data("mydata")
Some key points are summarized in the following sections.
13.2.1 Exported Data and the data/
directory
Data in data/
is exported to the user and readily available.
It is made available in an session through the use of data()
.
It will require documentation concerning its creation and source information.
It is most often a .RData
file created with save()
but other types are acceptable as well, see ?data()
.
Please remember to compress the data.
For packages that need to use documented data within a function, we recommend
creating an environment for holding the data to avoid polluting the
global environment when using data()
.
data_env <- new.env(parent = emptyenv())
data("mydata", envir = data_env, package = "mypackage")
mydata <- data_env[["mydata"]]
Here we create a new and empty environment. Extract the data within the environment and finally extract the data object from the environment using double brackets.
13.2.2 Raw Data and the inst/extdata/
directory
It is often desirable to show a workflow which involves the parsing or loading of raw files. Bioconductor recommends finding existing raw data already provided in another package or the hubs as previously stated.
However, if this is not applicable, raw data files should be included in the
inst/extdata
directory. Files of these type are often accessed utilizing
system.file()
. Bioconductor requires documentation on these files in an
inst/scripts/
directory. See data documentation.
13.2.3 Internal data
Rarely, a package may require parsed data that is used internal but should not
be exported to the user. An R/sysdata.rda
is often the best place to include
this type of data. Proper documentation of the data will be required.
13.2.4 Other data
Downloads of files and external data from the web should be avoided.
If it is necessary, at minimum the files should be cached.
See BiocFileCache for Bioconductor recommended package
for caching of files. If a maintainer creates their own caching directory, it
should utilize standard caching directories tools::R_user_dir(package, which="cache")
. It is not allowed to download or write any files to a users
home directory, working directory, or installed package directory. Files should
be cached as stated above with BiocFileCache
(preferred) or R_user_dir
or
tempdir()/tempfile()
if files should not be persistent. Please also see
sections on querying web resources and file
downloads.