Data simple - RNA-seq units

June 2016

On RNA-seq expression units

Or why you should use TPM

RNA-seq is a widely used technique allowing sensitive differential gene expression analysis. The typical RNA-seq experiment involves the preparation of mRNA samples, fragmentation of the mRNA molecules, reverse transcription to cDNA, and the conversion of the sample into a molecular library for sequencing. The sequencing output consists of millions of reads generated from the cDNA fragments. The reads are then aligned to a reference genome or transcriptome, in order to determine the qualitative and quantitative composition of the cDNA library. The ultimate goal is the estimation of the relative abundance of the underlying genomic features, i.e. the genomic regions that can appear represented in the sample, such as exons or genes.

Along with the relative expression of the genomic feature, which is exactly the target measurement in RNA-seq experiments, read counts depend on some confounding factors. Namely, read counts for each genomic feature depend heavily on the total number of obtained reads (sequencing depth) and the feature's length, as well as transcriptome and nucleotide composition. The dependence on feature length may be surprising to you, as you might expect the method to produce a number of reads proportional to the number of cDNA molecules in the sample. The key is the fragmentation step referred above. Sequencing requires that the molecules are broken down into smaller pieces. Longer transcripts will naturally yield a higher number of fragments and, thus, a higher read count.

In this post I will go over different RNA-seq expression units and the basic math behind them, as well as their implementation in the popular gene expression analysis Bioconductor package edgeR.

Sample normalization factors

RNA-seq experiments generate relative, rather than absolute, measurements. In order for any units to be comparable between samples or between experiments, we need to perform inter-sample normalization. While feature expression comparisons within samples need to account for all factors referred in the second paragraph above, comparisons between samples need to be corrected for differences in sequencing depth and transcriptome composition only. Attributes like feature length and GC content (nucleotide composition) do not vary and thus cancel out.

How differences in sequencing depth affect read count is self explanatory. The impact of transcriptome composition, on the other hand, can be understood using the following example. Given some feature present in two RNA-seq samples with the same number of total reads (sequencing depth), if one sample has a larger transcriptome, a smaller proportion of the reads will pertain to that feature. Similarly, for samples with the same qualitative transcriptome and sequenced at the same depth, if a subset of other shared features is more highly expressed in one sample it will make up a larger proportion of the total read pool, leaving a lower proportion of the reads mapping to the test feature.

For this reason, final counts of aligned reads (known as feature summarization data) need to be corrected for such biases. The typical analysis of feature summarization data will drop reads for features with counts below a certain threshold, calculate sample normalization factors, and scale library sizes (sum of reads in each sample) accordingly. This will address the biases brought about by differences in sequencing depth and transcriptome composition. In the specific case of edgeR, an empirical approach based on the trimmed mean of M values (TMM) method is used, implemented in the function calcNormFactors.

After sample normalization, expression units are chosen for inter-sample and within-sample differential feature expression analysis. I will go over the most popular units next.

CPM

The simplest RNA-seq feature expression unit reports normalized counts, or the number of reads that align to a particular feature after correcting for sequencing depth and transcriptome composition bias. Normalized counts are the most popular unit among differential expression analysis methods (including edgeR). However, feature length normalization is skipped, with the important consequence that within-sample differential feature expression analysis is not possible.

This unit is known as counts per million reads mapped (CPM). In its basic form, for each feature $i$, $CPM$ is the count of sequenced fragments mapping to the feature (the random variable I am calling $r_i$ here) scaled by the total number of reads ($R$) times one million (to bring it up to a more convenient number).

$$ \text{CPM}_i = \dfrac{r_i}{\dfrac{R}{10^6}} = \dfrac{r_i}{R}\cdot 10^6 $$ with

$ R = \sum_{j=1}^n r_j $

and

$ n = $ number of represented features

If we drop the multiplication by one million, $ CPM_i $ becomes simply the number of reads mapping to feature $ i $ relative to the total number of fragments in the sample and thus has the properties of relative counts:

$ 0 \leq \dfrac{r_i}{R} \leq 1 $ and $ \sum_{i=1}^n \dfrac{r_i}{R} = 1 $

edgeR includes function cpm.default to compute this unit. It divides every count in a sample (generated by feature summarization from SAM/BAM alignment maps using HTSeq or featureCounts, for example) by the total number of counts for that sample (library size or, more correctly, total number of mapped reads) multiplied by one million. Here's edgeR's cpm.default.

        
> edgeR::cpm.default

function (x, lib.size = NULL, log = FALSE, prior.count = 0.25, ...)
{
    x <- as.matrix(x)
    if (is.null(lib.size))
        lib.size <- colSums(x)
    if (log) {
        prior.count.scaled <- lib.size/mean(lib.size) * prior.count
        lib.size <- lib.size + 2 * prior.count.scaled
    }
    lib.size <- 1e-06 * lib.size
    if (log)
        log2(t((t(x) + prior.count.scaled)/lib.size))
    else t(t(x)/lib.size)
}

The key part of their implementation can be written as t(t(x)/(1e-06 * lib.size)). Feature counts are in matrix x, in which every column is a sample and every row is a feature. Total read counts for each sample are the column sums of the matrix, resulting in vector lib.size of the same dimension as the number of columns in the matrix (corresponding to the number of samples in the experiment). CPM is computed using a matrix-vector division.

Of course this basic implementation ignores sample normalization factors and thus fails to account for differences in transcriptome composition. As mentioned in the previous section, that is accomplished by scaling library sizes according to sample normalization factors (calculated in edgeR using calcNormFactors). In edgeR's code (cpm.DGEList) the library size is scaled by the sample normalization factor before calling the cpm.default function from above.

			
> edgeR::cpm.DGEList

function (x, normalized.lib.sizes = TRUE, log = FALSE, prior.count = 0.25, ...)
{
    lib.size <- x$samples$lib.size
    if (normalized.lib.sizes)
        lib.size <- lib.size * x$samples$norm.factors
    cpm.default(x$counts, lib.size = lib.size, log = log, prior.count = prior.count)
}

RPKM/FPKM

If (normalization factor-scaled) CPM is appropriate for inter-sample comparison, within-sample comparisons must account for differences in feature length. The most commonly used unit in this case is reads per kilobase of transcript per million reads mapped (RPKM), introduced with one of the original RNA-seq papers (Mortazavi et al. Nature Methods, 2008). The alternative unit FPKM is nothing more than a generalization of RPKM to deal with the fact that a single sequenced cDNA fragment can actually produce more than one read, depending on the sequencing technology. This currently refers to paired-end sequencing experiments, in which a read pair, rather than a single read, corresponds to a cDNA fragment. The word 'reads' in RPKM is replaced by 'fragments' in FPKM. RPKM (or FPKM) is defined formally for each feature $i$ as the count (the random variable $ r_i $) scaled by the feature's length ($ l_i $) times one thousand (to kilobase) and further scaled by the total number of reads ($ R $) times one million (again, in order to bring it up to a more convenient number). $$ \text{RPKM}_i = \dfrac{r_i}{ \left(\dfrac{l_i}{10^3}\right) \left( \dfrac{R}{10^6} \right)} = \dfrac{r_i}{l_i R} \cdot 10^9 $$

edgeR includes function rpkm.default to compute this unit. Since RPKM actually builds on CPM by adding feature length normalization, edgeR's implementation calculates RPKM by simply dividing each feature's CPM (variable y in the code) by that feature's length multiplied by one thousand. This adds feature length normalization to sequencing depth-normalized counts.

				
> edgeR::rpkm.default

function (x, lib.size = NULL, log = FALSE, prior.count = 0.25, ...)
{
	y <- cpm.default(x = x, lib.size = lib.size, log = log, prior.count = prior.count, ...)
	gene.length.kb <- gene.length/1000
	if (log)
		y - log2(gene.length.kb)
	else y/gene.length.kb
}

The inconsistency with the RPKM/FPKM unit

However, the RPKM is inconsistent across samples (Wagner et al. Theory Biosci., 2012). The reason is explained by the following insight: RPKM normalizes by the total number of reads and this is not a measure of total transcript number. This relationship depends on the length distribution of transcripts, which is variable between samples. In other words, the relationship between the total number of reads ($ R $ in the equations) and the actual total number of cDNA fragments sampled depends on the feature length distribution. For this reason, this normalization factor is variable across samples, hence the statement that opens this paragraph: the resulting unit is inconsistent across samples. A clear manifestation of the problem is the fact that the sum of all RPKM values will vary from sample to sample.

Inter-feature differential expression analysis within samples will still work. Running the risk of weakening the underlying message of this article, I will even go as far as stating that even inter-sample RPKM comparisons will, in most cases, be ok. In practical terms, since we typically look at thousands of features and it is reasonable to assume that a majority of them will not be differentially expressed between samples, we are probably not that far off when using RPKM. We can probably trust most conclusions drawn in such terms in the literature. The inconsistency is still real and is there, though, so it needs fixing.

TPM is the solution (kind of...)

A possible fix emerges from the explanation of RPKM's inconsistency above. Transcript length distributions can vary between samples and that introduces an inconsistency in RPKM calculation. So, instead of dividing by the total number of sequenced reads ($ R $) for a given sample, we can instead divide by the total number of length-normalized reads. This unit, transcripts per million fragments sequenced (TPM), was introduced originally in Li et al. Bioinformatics, 2010 and further supported in Wagner et al. Theory Biosci., 2012. Compare the TPM equation with CPM and RPKM's, re-written here for direct comparison. For simplicity, sample normalization factor scaling of library size is excluded. $$ \text{CPM}_i = \dfrac{r_i}{\sum_{j=1}^n r_j} \cdot 10^6 $$ $$ \text{RPKM}_i = \dfrac{r_i}{\sum_{j=1}^n r_j} \cdot \dfrac{1}{\dfrac{l_i}{10^3}} \cdot 10^6 $$ $$ \text{TPM}_i = \dfrac{r_i}{\sum_{j=1}^n \dfrac{r_j}{l_j}} \cdot \dfrac{1}{l_i} \cdot 10^6 $$

The interpretation is that $ TPM_i $ represents the number of transcripts of feature $i $ found in a total number of one million transcripts. Interestingly, we can easily convert RPKM values to TPM by simply dividing each feature's RPKM by the sum of the RPKM values of all features and multiplying by one million. In fact, TPM is really just RPKM scaled by a constant to correct the sum of all values to 1 million. $$ \begin{aligned} \text{TPM}_i &= \dfrac{\text{RPKM}_i}{\sum_{j=1}^n \text{RPKM}_j } \cdot 10^6 \end{aligned} $$

Calculating TPM

Armed with this information, we can convert RPKM to TPM in two different ways: from pre-calculated RPKM, by diving by the sum of RPKM values, or directly from the normalized counts. Below I have written some example R code to calculate TPM starting from RPKM values computed using edgeR's rpkm function.

				
# Start from feature summarization data (as an edgeR DGEList object)
# Filter-out low expression features, perform sample normalization and calculate RPKM
y <- edgeR::calcNormFactors(y)
y <- edgeR::rpkm(y)

# Calculate TPM from RPKM
tpm_from_rpkm <- function(x) {
  rpkm.sum <- colSums(x)
  return(t(t(x) / (1e-06 * rpkm.sum)))
}

tpm <- tpm_from_rpkm(y)

And the following is a simple implementation of TPM computation starting from edgeR-normalized counts.

				
# Start from feature summarization data (as an edgeR DGEList object)
# Filter-out low expression features and perform sample normalization
y <- edgeR::calcNormFactors(y)

# Calculate TPM from normalized counts
calc_tpm <- function(x, gene.length) {
    x <- as.matrix(x)
    len.norm.lib.size <- colSums(x / gene.length)
    return((t(t(x) / len.norm.lib.size) * 1e06) / gene.length)
}

tpm_man <- calc_tpm(y, gene.length = y$genes$Length)

TPM accounts for the lengths of all transcripts found in the sample and thus brings us one step closer to a good solution. This unit is more stable across samples than RPKM. However, the abundances of all transcripts will still change between samples, meaning that the denominator of the TPM equation (sum of length-normalized read counts) is still variable. This way, TPM is not directly comparable between experiments either.

Conclusion

So if no unit is perfect, how are we supposed to analyse differential feature expression? Maybe a better unit will be developed at some point, but in the mean time CPM is an appropriate unit for inter-sample comparison. When comparing feature expression within samples, TPM should be used instead of RPKM/FPKM.

If you want to dive deeper in the subject of why TPM is a better unit than RPKM/FPKM, besides the papers I cited above I would recommend two very informative blog posts published in Bits of DNA (by Prof. Lior Pachter) and The Farago (by Prof. Pachter's student Harold Pimentel).