**June 2016**

## On RNA-seq expression units

### Or why you should use TPM

RNA-seq is a widely used technique allowing sensitive differential gene expression analysis. The typical RNA-seq experiment involves the preparation of mRNA samples, fragmentation of the mRNA molecules, reverse transcription to cDNA, and the conversion of the sample into a molecular library for sequencing. The sequencing output consists of millions of reads generated from the cDNA fragments. The reads are then aligned to a reference genome or transcriptome, in order to determine the qualitative and quantitative composition of the cDNA library. The ultimate goal is the estimation of the relative abundance of the underlying genomic features, i.e. the genomic regions that can appear represented in the sample, such as exons or genes.

Along with the relative expression of the genomic feature, which is exactly the target measurement in RNA-seq experiments, read counts depend on some confounding factors. Namely, read counts for each genomic feature depend heavily on the total number of obtained reads (sequencing depth) and the feature's length, as well as transcriptome and nucleotide composition. The dependence on feature length may be surprising to you, as you might expect the method to produce a number of reads proportional to the number of cDNA molecules in the sample. The key is the fragmentation step referred above. Sequencing requires that the molecules are broken down into smaller pieces. Longer transcripts will naturally yield a higher number of fragments and, thus, a higher read count.

In this post I will go over different RNA-seq expression units and the basic math behind them, as well as their implementation in the popular gene expression analysis Bioconductor package edgeR.

### Sample normalization factors

RNA-seq experiments generate relative, rather than absolute, measurements. In order for any units to be comparable between samples or between experiments, we need to perform inter-sample normalization. While feature expression comparisons within samples need to account for all factors referred in the second paragraph above, comparisons between samples need to be corrected for differences in sequencing depth and transcriptome composition only. Attributes like feature length and GC content (nucleotide composition) do not vary and thus cancel out.

How differences in sequencing depth affect read count is self explanatory. The impact of transcriptome composition, on the other hand, can be understood using the following example. Given some feature present in two RNA-seq samples with the same number of total reads (sequencing depth), if one sample has a larger transcriptome, a smaller proportion of the reads will pertain to that feature. Similarly, for samples with the same qualitative transcriptome and sequenced at the same depth, if a subset of other shared features is more highly expressed in one sample it will make up a larger proportion of the total read pool, leaving a lower proportion of the reads mapping to the test feature.

For this reason, final counts of aligned reads (known as feature summarization
data) need to be corrected for such biases. The typical analysis of feature
summarization data will drop reads for features with counts below a certain
threshold, calculate sample normalization factors, and scale library sizes
(sum of reads in each sample) accordingly. This will address the biases brought
about by differences in sequencing depth and transcriptome composition. In the
specific case of edgeR, an empirical approach based on the trimmed mean of M
values (TMM) method is used, implemented in the function

After sample normalization, expression units are chosen for inter-sample and within-sample differential feature expression analysis. I will go over the most popular units next.

### CPM

The simplest RNA-seq feature expression unit reports normalized counts, or the number of reads that align to a particular feature after correcting for sequencing depth and transcriptome composition bias. Normalized counts are the most popular unit among differential expression analysis methods (including edgeR). However, feature length normalization is skipped, with the important consequence that within-sample differential feature expression analysis is not possible.

This unit is known as __c__ounts __p__er __m__illion reads
mapped (CPM). In its basic form, for each feature \(i\), \(CPM\) is the
count of sequenced fragments mapping to the feature (the random variable
I am calling \(r_i\) here) scaled by the total number of reads (\(R\))
times one million (to bring it up to a more convenient number).

$$ \text{CPM}_i = \dfrac{r_i}{\dfrac{R}{10^6}} = \dfrac{r_i}{R}\cdot 10^6 $$ with

\( R = \sum_{j=1}^n r_j \)

and

\( n = \) number of represented features

If we drop the multiplication by one million, \( CPM_i \) becomes simply the number of reads mapping to feature \( i \) relative to the total number of fragments in the sample and thus has the properties of relative counts:

\( 0 \leq \dfrac{r_i}{R} \leq 1 \) and \( \sum_{i=1}^n \dfrac{r_i}{R} = 1 \)

edgeR includes function

` ````
> edgeR::cpm.default
function (x, lib.size = NULL, log = FALSE, prior.count = 0.25, ...)
{
x <- as.matrix(x)
if (is.null(lib.size))
lib.size <- colSums(x)
if (log) {
prior.count.scaled <- lib.size/mean(lib.size) * prior.count
lib.size <- lib.size + 2 * prior.count.scaled
}
lib.size <- 1e-06 * lib.size
if (log)
log2(t((t(x) + prior.count.scaled)/lib.size))
else t(t(x)/lib.size)
}
```

The key part of their implementation can be written as

Of course this basic implementation ignores sample normalization factors and thus
fails to account for differences in transcriptome composition. As mentioned in the
previous section, that is accomplished by scaling library sizes according to
sample normalization factors (calculated in edgeR using

` ````
> edgeR::cpm.DGEList
function (x, normalized.lib.sizes = TRUE, log = FALSE, prior.count = 0.25, ...)
{
lib.size <- x$samples$lib.size
if (normalized.lib.sizes)
lib.size <- lib.size * x$samples$norm.factors
cpm.default(x$counts, lib.size = lib.size, log = log, prior.count = prior.count)
}
```

### RPKM/FPKM

If (normalization factor-scaled) CPM is appropriate for inter-sample comparison,
within-sample comparisons must account for differences in feature length. The
most commonly used unit in this case is __r__eads __p__er __k__ilobase
of transcript per __m__illion reads mapped (RPKM), introduced with one of
the original RNA-seq papers
(Mortazavi *et al.* Nature Methods, 2008).
The alternative unit FPKM is nothing more than a generalization of RPKM to deal with
the fact that a single sequenced cDNA fragment can actually produce more than
one read, depending on the sequencing technology. This currently refers
to paired-end sequencing experiments, in which a read pair, rather than a single read,
corresponds to a cDNA fragment. The word 'reads' in RPKM is replaced by 'fragments' in
FPKM. RPKM (or FPKM) is defined formally for each feature \(i\) as the count
(the random variable \( r_i \)) scaled by the feature's length (\( l_i \))
times one thousand (to kilobase) and further scaled by the total number of
reads (\( R \)) times one million (again, in order to bring
it up to a more convenient number).
$$ \text{RPKM}_i = \dfrac{r_i}{ \left(\dfrac{l_i}{10^3}\right) \left( \dfrac{R}{10^6} \right)} = \dfrac{r_i}{l_i R} \cdot 10^9 $$

edgeR includes function

` ````
> edgeR::rpkm.default
function (x, lib.size = NULL, log = FALSE, prior.count = 0.25, ...)
{
y <- cpm.default(x = x, lib.size = lib.size, log = log, prior.count = prior.count, ...)
gene.length.kb <- gene.length/1000
if (log)
y - log2(gene.length.kb)
else y/gene.length.kb
}
```

### The inconsistency with the RPKM/FPKM unit

However, the RPKM is inconsistent across samples
(Wagner *et al.* Theory Biosci., 2012).
The reason is explained by the following insight: RPKM normalizes by the
total number of reads and this is not a measure of total transcript number.
This relationship depends on the length distribution of transcripts, which
is variable between samples. In other words, the relationship between the
total number of reads (\( R \) in the equations) and the actual total number
of cDNA fragments sampled depends on the feature length distribution. For
this reason, this normalization factor is variable across samples, hence
the statement that opens this paragraph: the resulting unit is inconsistent
across samples. A clear manifestation of the problem is the fact that the sum
of all RPKM values will vary from sample to sample.

Inter-feature differential expression analysis within samples will still work. Running the risk of weakening the underlying message of this article, I will even go as far as stating that even inter-sample RPKM comparisons will, in most cases, be ok. In practical terms, since we typically look at thousands of features and it is reasonable to assume that a majority of them will not be differentially expressed between samples, we are probably not that far off when using RPKM. We can probably trust most conclusions drawn in such terms in the literature. The inconsistency is still real and is there, though, so it needs fixing.

### TPM is the solution (kind of...)

A possible fix emerges from the explanation of RPKM's inconsistency above.
Transcript length distributions can vary between samples and that introduces
an inconsistency in RPKM calculation. So, instead of dividing by the total
number of sequenced reads (\( R \)) for a given sample, we can instead divide
by the total number of length-normalized reads. This unit, __t__ranscripts
__p__er __m__illion fragments sequenced (TPM), was introduced originally in
Li *et al.* Bioinformatics, 2010
and further supported in
Wagner *et al.* Theory Biosci., 2012.
Compare the TPM equation with CPM and RPKM's, re-written here for direct comparison.
For simplicity, sample normalization factor scaling of library size is excluded.
$$ \text{CPM}_i = \dfrac{r_i}{\sum_{j=1}^n r_j} \cdot 10^6 $$
$$ \text{RPKM}_i = \dfrac{r_i}{\sum_{j=1}^n r_j} \cdot \dfrac{1}{\dfrac{l_i}{10^3}} \cdot 10^6 $$
$$ \text{TPM}_i = \dfrac{r_i}{\sum_{j=1}^n \dfrac{r_j}{l_j}} \cdot \dfrac{1}{l_i} \cdot 10^6 $$

The interpretation is that \( TPM_i \) represents the number of transcripts of feature \(i \) found in a total number of one million transcripts. Interestingly, we can easily convert RPKM values to TPM by simply dividing each feature's RPKM by the sum of the RPKM values of all features and multiplying by one million. In fact, TPM is really just RPKM scaled by a constant to correct the sum of all values to 1 million. $$ \begin{aligned} \text{TPM}_i &= \dfrac{\text{RPKM}_i}{\sum_{j=1}^n \text{RPKM}_j } \cdot 10^6 \end{aligned} $$

### Calculating TPM

Armed with this information, we can convert RPKM to TPM in two different
ways: from pre-calculated RPKM, by diving by the sum of RPKM values, or directly
from the normalized counts. Below I have written some example R code to
calculate TPM starting from RPKM values computed using edgeR's

` ````
# Start from feature summarization data (as an edgeR DGEList object)
# Filter-out low expression features, perform sample normalization and calculate RPKM
y <- edgeR::calcNormFactors(y)
y <- edgeR::rpkm(y)
# Calculate TPM from RPKM
tpm_from_rpkm <- function(x) {
rpkm.sum <- colSums(x)
return(t(t(x) / (1e-06 * rpkm.sum)))
}
tpm <- tpm_from_rpkm(y)
```

And the following is a simple implementation of TPM computation starting from edgeR-normalized counts.

` ````
# Start from feature summarization data (as an edgeR DGEList object)
# Filter-out low expression features and perform sample normalization
y <- edgeR::calcNormFactors(y)
# Calculate TPM from normalized counts
calc_tpm <- function(x, gene.length) {
x <- as.matrix(x)
len.norm.lib.size <- colSums(x / gene.length)
return((t(t(x) / len.norm.lib.size) * 1e06) / gene.length)
}
tpm_man <- calc_tpm(y, gene.length = y$genes$Length)
```

TPM accounts for the lengths of all transcripts found in the sample and
thus brings us one step closer to a good solution. This unit is more stable
across samples than RPKM. However, the abundances of all transcripts will
still change between samples, meaning that the denominator of the TPM equation
(sum of length-normalized read counts) is still variable. This way, TPM is
**not directly comparable between experiments** either.

### Conclusion

So if no unit is perfect, how are we supposed to analyse differential feature expression? Maybe a better unit will be developed at some point, but in the mean time CPM is an appropriate unit for inter-sample comparison. When comparing feature expression within samples, TPM should be used instead of RPKM/FPKM.

If you want to dive deeper in the subject of why TPM is a better unit than RPKM/FPKM, besides the papers I cited above I would recommend two very informative blog posts published in Bits of DNA (by Prof. Lior Pachter) and The Farago (by Prof. Pachter's student Harold Pimentel).