June 2016
On RNA-seq expression units
Or why you should use TPM
RNA-seq is a widely used technique allowing sensitive differential gene expression analysis. The typical RNA-seq experiment involves the preparation of mRNA samples, fragmentation of the mRNA molecules, reverse transcription to cDNA, and the conversion of the sample into a molecular library for sequencing. The sequencing output consists of millions of reads generated from the cDNA fragments. The reads are then aligned to a reference genome or transcriptome, in order to determine the qualitative and quantitative composition of the cDNA library. The ultimate goal is the estimation of the relative abundance of the underlying genomic features, i.e. the genomic regions that can appear represented in the sample, such as exons or genes.
Along with the relative expression of the genomic feature, which is exactly the target measurement in RNA-seq experiments, read counts depend on some confounding factors. Namely, read counts for each genomic feature depend heavily on the total number of obtained reads (sequencing depth) and the feature's length, as well as transcriptome and nucleotide composition. The dependence on feature length may be surprising to you, as you might expect the method to produce a number of reads proportional to the number of cDNA molecules in the sample. The key is the fragmentation step referred above. Sequencing requires that the molecules are broken down into smaller pieces. Longer transcripts will naturally yield a higher number of fragments and, thus, a higher read count.
In this post I will go over different RNA-seq expression units and the basic math behind them, as well as their implementation in the popular gene expression analysis Bioconductor package edgeR.
Sample normalization factors
RNA-seq experiments generate relative, rather than absolute, measurements. In order for any units to be comparable between samples or between experiments, we need to perform inter-sample normalization. While feature expression comparisons within samples need to account for all factors referred in the second paragraph above, comparisons between samples need to be corrected for differences in sequencing depth and transcriptome composition only. Attributes like feature length and GC content (nucleotide composition) do not vary and thus cancel out.
How differences in sequencing depth affect read count is self explanatory. The impact of transcriptome composition, on the other hand, can be understood using the following example. Given some feature present in two RNA-seq samples with the same number of total reads (sequencing depth), if one sample has a larger transcriptome, a smaller proportion of the reads will pertain to that feature. Similarly, for samples with the same qualitative transcriptome and sequenced at the same depth, if a subset of other shared features is more highly expressed in one sample it will make up a larger proportion of the total read pool, leaving a lower proportion of the reads mapping to the test feature.
For this reason, final counts of aligned reads (known as feature summarization
data) need to be corrected for such biases. The typical analysis of feature
summarization data will drop reads for features with counts below a certain
threshold, calculate sample normalization factors, and scale library sizes
(sum of reads in each sample) accordingly. This will address the biases brought
about by differences in sequencing depth and transcriptome composition. In the
specific case of edgeR, an empirical approach based on the trimmed mean of M
values (TMM) method is used, implemented in the function
After sample normalization, expression units are chosen for inter-sample and within-sample differential feature expression analysis. I will go over the most popular units next.
CPM
The simplest RNA-seq feature expression unit reports normalized counts, or the number of reads that align to a particular feature after correcting for sequencing depth and transcriptome composition bias. Normalized counts are the most popular unit among differential expression analysis methods (including edgeR). However, feature length normalization is skipped, with the important consequence that within-sample differential feature expression analysis is not possible.
This unit is known as counts per million reads mapped (CPM). In its basic form, for each feature \(i\), \(CPM\) is the count of sequenced fragments mapping to the feature (the random variable I am calling \(r_i\) here) scaled by the total number of reads (\(R\)) times one million (to bring it up to a more convenient number).
$$ \text{CPM}_i = \dfrac{r_i}{\dfrac{R}{10^6}} = \dfrac{r_i}{R}\cdot 10^6 $$ with
\( R = \sum_{j=1}^n r_j \)
and
\( n = \) number of represented features
If we drop the multiplication by one million, \( CPM_i \) becomes simply the number of reads mapping to feature \( i \) relative to the total number of fragments in the sample and thus has the properties of relative counts:
\( 0 \leq \dfrac{r_i}{R} \leq 1 \) and \( \sum_{i=1}^n \dfrac{r_i}{R} = 1 \)
edgeR includes function
> edgeR::cpm.default
function (x, lib.size = NULL, log = FALSE, prior.count = 0.25, ...)
{
x <- as.matrix(x)
if (is.null(lib.size))
lib.size <- colSums(x)
if (log) {
prior.count.scaled <- lib.size/mean(lib.size) * prior.count
lib.size <- lib.size + 2 * prior.count.scaled
}
lib.size <- 1e-06 * lib.size
if (log)
log2(t((t(x) + prior.count.scaled)/lib.size))
else t(t(x)/lib.size)
}
The key part of their implementation can be written as
Of course this basic implementation ignores sample normalization factors and thus
fails to account for differences in transcriptome composition. As mentioned in the
previous section, that is accomplished by scaling library sizes according to
sample normalization factors (calculated in edgeR using
> edgeR::cpm.DGEList
function (x, normalized.lib.sizes = TRUE, log = FALSE, prior.count = 0.25, ...)
{
lib.size <- x$samples$lib.size
if (normalized.lib.sizes)
lib.size <- lib.size * x$samples$norm.factors
cpm.default(x$counts, lib.size = lib.size, log = log, prior.count = prior.count)
}
RPKM/FPKM
If (normalization factor-scaled) CPM is appropriate for inter-sample comparison, within-sample comparisons must account for differences in feature length. The most commonly used unit in this case is reads per kilobase of transcript per million reads mapped (RPKM), introduced with one of the original RNA-seq papers (Mortazavi et al. Nature Methods, 2008). The alternative unit FPKM is nothing more than a generalization of RPKM to deal with the fact that a single sequenced cDNA fragment can actually produce more than one read, depending on the sequencing technology. This currently refers to paired-end sequencing experiments, in which a read pair, rather than a single read, corresponds to a cDNA fragment. The word 'reads' in RPKM is replaced by 'fragments' in FPKM. RPKM (or FPKM) is defined formally for each feature \(i\) as the count (the random variable \( r_i \)) scaled by the feature's length (\( l_i \)) times one thousand (to kilobase) and further scaled by the total number of reads (\( R \)) times one million (again, in order to bring it up to a more convenient number). $$ \text{RPKM}_i = \dfrac{r_i}{ \left(\dfrac{l_i}{10^3}\right) \left( \dfrac{R}{10^6} \right)} = \dfrac{r_i}{l_i R} \cdot 10^9 $$
edgeR includes function
> edgeR::rpkm.default
function (x, lib.size = NULL, log = FALSE, prior.count = 0.25, ...)
{
y <- cpm.default(x = x, lib.size = lib.size, log = log, prior.count = prior.count, ...)
gene.length.kb <- gene.length/1000
if (log)
y - log2(gene.length.kb)
else y/gene.length.kb
}
The inconsistency with the RPKM/FPKM unit
However, the RPKM is inconsistent across samples (Wagner et al. Theory Biosci., 2012). The reason is explained by the following insight: RPKM normalizes by the total number of reads and this is not a measure of total transcript number. This relationship depends on the length distribution of transcripts, which is variable between samples. In other words, the relationship between the total number of reads (\( R \) in the equations) and the actual total number of cDNA fragments sampled depends on the feature length distribution. For this reason, this normalization factor is variable across samples, hence the statement that opens this paragraph: the resulting unit is inconsistent across samples. A clear manifestation of the problem is the fact that the sum of all RPKM values will vary from sample to sample.
Inter-feature differential expression analysis within samples will still work. Running the risk of weakening the underlying message of this article, I will even go as far as stating that even inter-sample RPKM comparisons will, in most cases, be ok. In practical terms, since we typically look at thousands of features and it is reasonable to assume that a majority of them will not be differentially expressed between samples, we are probably not that far off when using RPKM. We can probably trust most conclusions drawn in such terms in the literature. The inconsistency is still real and is there, though, so it needs fixing.
TPM is the solution (kind of...)
A possible fix emerges from the explanation of RPKM's inconsistency above. Transcript length distributions can vary between samples and that introduces an inconsistency in RPKM calculation. So, instead of dividing by the total number of sequenced reads (\( R \)) for a given sample, we can instead divide by the total number of length-normalized reads. This unit, transcripts per million fragments sequenced (TPM), was introduced originally in Li et al. Bioinformatics, 2010 and further supported in Wagner et al. Theory Biosci., 2012. Compare the TPM equation with CPM and RPKM's, re-written here for direct comparison. For simplicity, sample normalization factor scaling of library size is excluded. $$ \text{CPM}_i = \dfrac{r_i}{\sum_{j=1}^n r_j} \cdot 10^6 $$ $$ \text{RPKM}_i = \dfrac{r_i}{\sum_{j=1}^n r_j} \cdot \dfrac{1}{\dfrac{l_i}{10^3}} \cdot 10^6 $$ $$ \text{TPM}_i = \dfrac{r_i}{\sum_{j=1}^n \dfrac{r_j}{l_j}} \cdot \dfrac{1}{l_i} \cdot 10^6 $$
The interpretation is that \( TPM_i \) represents the number of transcripts of feature \(i \) found in a total number of one million transcripts. Interestingly, we can easily convert RPKM values to TPM by simply dividing each feature's RPKM by the sum of the RPKM values of all features and multiplying by one million. In fact, TPM is really just RPKM scaled by a constant to correct the sum of all values to 1 million. $$ \begin{aligned} \text{TPM}_i &= \dfrac{\text{RPKM}_i}{\sum_{j=1}^n \text{RPKM}_j } \cdot 10^6 \end{aligned} $$
Calculating TPM
Armed with this information, we can convert RPKM to TPM in two different
ways: from pre-calculated RPKM, by diving by the sum of RPKM values, or directly
from the normalized counts. Below I have written some example R code to
calculate TPM starting from RPKM values computed using edgeR's
# Start from feature summarization data (as an edgeR DGEList object)
# Filter-out low expression features, perform sample normalization and calculate RPKM
y <- edgeR::calcNormFactors(y)
y <- edgeR::rpkm(y)
# Calculate TPM from RPKM
tpm_from_rpkm <- function(x) {
rpkm.sum <- colSums(x)
return(t(t(x) / (1e-06 * rpkm.sum)))
}
tpm <- tpm_from_rpkm(y)
And the following is a simple implementation of TPM computation starting from edgeR-normalized counts.
# Start from feature summarization data (as an edgeR DGEList object)
# Filter-out low expression features and perform sample normalization
y <- edgeR::calcNormFactors(y)
# Calculate TPM from normalized counts
calc_tpm <- function(x, gene.length) {
x <- as.matrix(x)
len.norm.lib.size <- colSums(x / gene.length)
return((t(t(x) / len.norm.lib.size) * 1e06) / gene.length)
}
tpm_man <- calc_tpm(y, gene.length = y$genes$Length)
TPM accounts for the lengths of all transcripts found in the sample and thus brings us one step closer to a good solution. This unit is more stable across samples than RPKM. However, the abundances of all transcripts will still change between samples, meaning that the denominator of the TPM equation (sum of length-normalized read counts) is still variable. This way, TPM is not directly comparable between experiments either.
Conclusion
So if no unit is perfect, how are we supposed to analyse differential feature expression? Maybe a better unit will be developed at some point, but in the mean time CPM is an appropriate unit for inter-sample comparison. When comparing feature expression within samples, TPM should be used instead of RPKM/FPKM.
If you want to dive deeper in the subject of why TPM is a better unit than RPKM/FPKM, besides the papers I cited above I would recommend two very informative blog posts published in Bits of DNA (by Prof. Lior Pachter) and The Farago (by Prof. Pachter's student Harold Pimentel).