Karimzadeh M, Ernst C, Kundaje A, Hoffman MM. 2018. Umap and Bismap: quantifying genome and methylome mappability. doi: https://doi.org/10.1093/nar/gky677 In press at Nucleic Acids Research. (BibTeX)
The free Umap software package efficiently identifies uniquely mappable regions of any genome. Its Bismap extension identifies mappability of the bisulfite-converted genome (methylome).
To identify the single-read mappability of a bisulfite-converted genome, we create two altered genome sequences. In the first sequence, we convert all cytosines to thymine (C → T). In the other sequence we convert all guanines to adenine (G → A). Our approach follows those of Bismark and BWA-meth. We convert the genome sequence this way because bisulfite treatment converts un-methylated cytosine to uracil which is read as thymine. Similarly the guanine that base-pairs with the un-methylated cytosine in the - strand converts to adenine. These two conversions, however, never occur at the same time on the same read. We identify the uniquely mappable regions of these two genomes separately, and then combine the data to represent the single-read mappability of the forward and reverse strands in the bisulfite-converted genome.
Bismap requires special handling of reverse complementation of C → T or G → A converted genomes. Conversion of C → T on the sequence AATTCCGG produces AATTTTGG. In the Bowtie index, the reverse complement would be CCAAAATT. However for the purpose of identifying the mappability of the bisulfite-converted genome, we expect the reverse complement to be TTGGAATT. The reason is that both forward and reverse strands undergo bisulfite treatment simultaneously. There is no DNA replication after bisulfite treatment. To handle this issue, Bismap creates its own reverse complemented chromosomes and suppresses Bowtie's usual reverse complement mapping.
Umap and Bismap each take approximately 200 CPU hours to run on the human genome for a given read length. This can be parallelized in a computing cluster over 400 cores to take only 30 min.
Umap efficiently identifies the single-read mappability of any genome for a range of sequencing read lengths. The single-read mappability of a genomic region is a fraction of that region which overlaps with at least one uniquely mappable k-mer. The Bismap extension of Umap produces the single-read mappability of a bisulfite-converted genome. Both Umap and Bismap produce an integer vector for each chromosome that efficiently defines the mappability for any region and can be converted to a browser extensible data (BED) file. In addition to single-read mappability, we can measure the mappability of a genomic region by another approach. To quantify the single-read mappability of a given genomic region, we measure the fraction of potential uniquely mappable reads in that region. A region, however, can have 100% single-read mappability, but in practice require a high coverage sequencing to properly identify that region. For example, a 1 kbp region with 100% single-read mappability can be mappable due to a minimum of 10 unique 100-mers that none of them overlap or a maximum of 1100 unique 100-mers that highly overlap. Therefore, we define the multi-read mappability, the probability that a randomly selected read of length k in a given region is uniquely mappable. For the genomic region Gi:j starting at i and ending at j, there are j - i + k + 1 k-mers that overlap with Gi:j. The multi-read mappability of Gi:j is the fraction of those k-mers that are uniquely mappable.
View the Umap and Bismap track hub in the UCSC genome browser for GRCh37/hg19, GRCh38/hg38, GRCm37/mm9, or GRCm38/mm10.
The track hub is composed of two different supertracks; Umap and Bismap. The Umap supertrack contains the tracks for the unconverted genome and the Bismap supertrack contains the tracks for the bisulfite-converted genome.
Each supertrack contains the single-read mappability and multi-read mappability tracks for four different read lengths: 24 bp, 36 bp, 50 bp, and 100 bp.
By default, the Umap and Bismap single-read mappability and multi-read mappability tracks for 24 bp are shown, while other tracks are hidden. In order to change the tracks you want to view in the genome browser, click on the supertrack name and change the default settings.
The single-read mappability tracks are in BED6 format, showing regions of the genome that are uniquely mappable by at least one k-mer. In Umap, unlike Bismap, the mappability of the reverse strand is the same as the forward strand. Thus in Bismap, the mappability of both strands is shown. In Bismap, only regions that are uniquely mappable in both strands must be considered uniquely mappable. For this reason, we calculated the multi-read mappability of bisulfite-converted genome only for the regions that are uniquely mappable in both strands.
View the Umap and Bismap track hub in UCSC genome browser.
The current track hub data is based on Umap V1.1.0. If you need to access data produced using Umap V1.0.0 (before May 2nd 2017), they are deposited in Zenodo.
For the data in the track hub, we did not use any of the haplotypes, unlocalized contigs, or unplaced contigs. Upon frequent requests, we generated mappability of the hg38 genome without haplotypes but including unlocalized and unplaced contigs using the following kmers: 24, 36, 50, 101, 150, 200, 250, and 500. You can download the Umap uint8 files of this assembly from here.
Download the BED files as a tarball for further analysis:
Download the uint8 files below:
Download individual k-mer files of Umap or Bismap:
Read the documentation for Umap and Bismap software, which begins with a quick start.
Please ask questions about Umap on our mailing list. If you want to report a bug or request a feature, use Umap issue tracker. We are interested in all comments on the package, and the ease of use of installation and documentation.
Umap was originally develped by Anshul Kundaje in MATLAB. The original Umap repository is available here. Here we describe a reimplementation of Umap in Python, where we added new features including Bismap.