5 Common Bioconductor Methods and Classes
5.1 Motivation
Bioconductor is a large and diverse project with many packages that provide functionality for a wide range of biological data types and statistical methods. It has a rich set of classes and methods that are widely used across many packages. It is, therefore, important to reuse existing data classes and methods to ensure that packages are interoperable with the rest of the Bioconductor software ecosystem. Central data representations allow users to readily integrate analysis workflows across multiple Bioconductor packages providing a more seamless user experience.
Many classes in Bioconductor are implemented using the S4 object-oriented system in R. The S4 system is particularly well-suited for the representation of complex genomic data structures. The initial motivations to use S4 in Bioconductor were centered around its benefits over other systems such as S3. These benefits include, but are not limited to, formal class definitions, multiple inheritance, and validity checking.
Although Bioconductor promotes the re-use of existing S4 classes to represent genomic data, there are cases where new classes are needed for cutting-edge technologies. In such cases, new classes should be developed, ideally, with open discussion and consideration of the Bioconductor community.
5.1.1 Use Case: Importing data
For developers who import data into their package, it is important to know which packages and methods are available for reuse. The following list provides commonly used packages and their methods to import various data types:
- GTF, GFF, BED, BigWig, etc., – rtracklayer
::import() - VCF – VariantAnnotation
::readVcf() - SAM / BAM – Rsamtools
::scanBam(), GenomicAlignments::readGAlignment*() - FASTA – Biostrings
::readDNAStringSet() - FASTQ – ShortRead
::readFastq() - MS data (XML-based and mgf formats) – Spectra
::Spectra(), Spectra::Spectra(source = MsBackendMgf::MsBackendMgf())
This list is not exhaustive, and developers are encouraged to initiate dialogue with other community members to identify additional packages and methods that may be useful for their specific use case. We acknowledge that class and method discoverability can be a challenge and we are working to improve this aspect of the Bioconductor project.
5.1.2 Common Classes
The following table, though certainly not exhaustive, provides select classes and constructor functions to represent genomic data:
| Data Type | Package and Function | Description |
|---|---|---|
| Rectangular feature by sample |
SummarizedExperiment ::SummarizedExperiment()
|
RNAseq count matrix, microarray, etc. |
| Genomic coordinates |
GenomicRanges ::GRanges()
|
1-based, closed interval genomic coordinates |
| Genomic coordinates (multiple) |
GenomicRanges ::GRangesList()
|
Genomic coordinates from multiple samples |
| Ragged genomic coordinates |
RaggedExperiment ::RaggedExperiment()
|
Ragged (variable length) genomic coordinates |
| DNA/RNA/AA sequences |
Biostrings ::*StringSet()
|
DNA, RNA, or amino acid sequences |
| Gene sets |
BiocSet ::BiocSet(), GSEABase ::GeneSet(), GSEABase ::GeneSetCollection()
|
Collections of gene sets |
| Multi-omics data |
MultiAssayExperiment ::MultiAssayExperiment()
|
Data integrating multiple omics assays |
| Single cell data |
SingleCellExperiment ::SingleCellExperiment()
|
Single-cell expression and related data |
| Mass spec data |
Spectra ::Spectra()
|
Mass spectrometry data |
| File formats |
BiocIO ::BiocFile-class
|
Classes for interacting with various biological data file formats |
Search biocViews for other classes and methods that may be useful for your package.
5.2 Package Submission Considerations
Bioconductor strives for interoperability across packages, and package submissions are generally not accepted unless they demonstrate such interoperability, typically by reusing existing Bioconductor classes and methods where appropriate. Submissions that introduce new classes or data structures must provide strong justification and clearly describe how they interoperate with existing Bioconductor infrastructure.
In the case where the data does not conform to an existing data class, we recommend discussing the design of a new class with the Bioconductor community. The open discussion can take place on main Bioconductor communication channels such as the bioc-devel mailing list, or the Bioconductor community Slack.
5.3 Package Implementations
The following packages are examples of packages that reuse Bioconductor classes and methods:
| package | inherits classes and methods from: |
|---|---|
| DESeq2 | SummarizedExperiment, GenomicRanges |
| GenomicAlignments | GenomicRanges, Rsamtools |
| VariantAnnotation | GenomicRanges, SummarizedExperiment, Rsamtools |