# EpiAnnotator Databanks

#### 2017-11-04

This document introduces the concepts of annotation, databank, meta table and platform. It also provides an overview of how annotations are stored in EpiAnnotator.

## Annotations

Enrichment analysis of selected (compared to background) regions is performed with respect to a genomic annotation. In general, an annotation is one of the following:

1. A non-overlapping set of genomic regions. Examples for such annotation is Ensemble genes or CpG islands. In this case, we can also think of annotation as a classification of every base pair in the genome into yes (this base lies within a region of the annotation) or no (this base is outside the annotation’s regions).

2. A partition of the genome into multiple different states. An example for such annotation are chromatin states for a particular cell line or cell type, as identified in the ENCODE project. In this case, we can also think of annotation as a classification of every base pair in the genome into one of the available states.

In EpiAnnotator, annotations are never strand-specific. Strand information is also ignored in the sets of interest uploaded for running an enrichment analysis.

## Databanks

In EpiAnnotator, a databank contains a group of annotations on the same genome assembly. A databank’s name can contain only Latin letters, digits, and the underscore symbol (_). Also, its suffix indicates which assembly it targets. For example, EpiAnnotator provides the following databanks:

• For human (hg38)
• EpiAnnotator_hg38
• LOLA_Core_hg38
• For human (hg19)
• EpiAnnotator_hg19
• LOLA_Core_hg19
• For mouse (mm10)
• EpiAnnotator_mm10
• For mouse (mm9)
• EpiAnnotator_mm9

Every databank is saved in a dedicated directory and contains the following components:

• List of supported chromosomes and their lengths, in base pairs, saved in a TAB-separated text file named chromosomes.txt.
• A meta table listing all annotations stored in the databank. The meta table is saved in the comma-separated value file meta.csv and it contains the columns “ID”, “Repository”, “Annotation”, “Class”, “Subclass”, “Tissue”, “Cell line”, “Disease”, “Version”, “Sex”, “Additional”.
• One subdirectory per supported chromosome, in which every stored annotation is represented by a single comma-separated value file compressed as a gz archive.

## Platforms

The genome-wide methylation array MethylationEPIC by Illumina interrogates over 850,000 CpGs in the human genome. Studies based on this assay typically produce sets of selected and background probes. Examples for selected probes include the hypermethylated probes in a certain disease subtype1, or the CpGs that change their methylation state with age2. The background set consists of those probes that first passed the filtering criteria and then appeared non-significant after being tested for differential methylation. An enrichment analysis with respect to a known annotation, say, gene promoters, would test if the rate of overlap of gene promoters with CpGs in the selected set is significantly higher or lower than the corresponding rate in the background set.

The MethylationEPIC assay mentioned in the paragraph above is an example for platform. Platforms in EpiAnnotator are genome-wide assays that target a limited predefined set of genomic locations. Other examples for such assays could be the Affymetrix gene and exon expression arrays. Every platform defines a universe of possible regions that can appear in the selected and background sets. In contrast to arbitrary genomic regions, supporting a platform for a given databank provides two advantages:

1. The selected and background sets are defined and uploaded as lists of (probe) identifiers instead of genomic regions.
2. EpiAnnotator can precompute the distances of every region in the platform’s universe to its closest region of every annotation in the databank. This enables rapid enrichment analysis once selected and background sets are uploaded.

## References

1. Pajtler KW et al. Molecular classification of ependymal tumors across all CNS compartments, histopathological grades, and age groups. Cancer Cell 27(5):728-743, 2015
2. Horvath S. DNA methylation age of human tissues and cell types. Genome Biology 14(10):R115, 2013