Question

AveLogCPM of a library subset

1

Entering edit mode

martin.weihrauch ▴ 20

@martinweihrauch-13664

Last seen 6.0 years ago

I've mapped mouse-RNA-Seq data to a genome using the Gencode M17 comprehensive annotation .gtf file. Then I did read-counting using featureCounts, but I supplied a subset of the annotation, containing only lncRNA annotations. This leads to only 1-2% of all reads to align successfully (usually less than 1 Million total reads of a library). I think this is to be expected given the relatively low expression levels of lncRNA genes.

Now this really affects the AveLogCPM values and the low-count filtering using the cpm() function of edgeR.

Is this way of read counting using the lncRNA subset a valid approach (statistically speaking)? This really greatly affects p-values and logCPM.

How could I adjust low-expression filtering to accomodate this?

edger rnaseq • 954 views

ADD COMMENT • link updated 6.0 years ago by Aaron Lun ★ 28k • written 6.0 years ago by martin.weihrauch ▴ 20

score 3 · Accepted Answer · 2018-04-26

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 1 hour ago

The city by the bay

It seems like you're making life difficult for yourself.

Firstly, you should use the entire annotation during counting - this ensures that featureCounts will be able to detect and remove ambiguous read assignments. For example, if a lncRNA overlaps with a protein-coding gene on the same strand, you cannot distinguish reads from the two transcripts - featureCounts will recognise this and avoid assigning any reads in the overlapping (exonic) regions. However, if you only provide the function with a subset of the annotation, all reads in the overlapping regions will be incorrectly assigned to the lncRNA even if they originated from the protein-coding gene.

Once you do that, most of your problems with the interpretation of the CPM will go away. You can filter on the CPM as described in the user's guide, and if you so desire, further filter on genes annotated as lncRNAs. The resulting count matrix can then be used in the rest of the edgeR pipeline. (I should add that, if you were using filterByExpr, the high-abundance filter should not be greatly affected by prior subsetting for lncRNAs. This is because the filter threshold is chosen at a CPM corresponding to a pre-specified absolute count; if you subset by feature, the CPM threshold will naturally adjust itself to preserve the threshold on the absolute count. And obviously, for retained genes, the actual counts in the matrix do not change upon subsetting.)

The real issue is how you decide to normalize. If you use only the lncRNA counts in calcNormFactors, you are assuming that most lncRNAs are not DE across your conditions. If you use all (high-abundance) counts in calcNormFactors, you avoid this assumption, but you then make a different assumption that scaling biases for lncRNAs are the same as those in the rest of the transcriptome. The choice between these two strategies can potentially have quite an effect on your DE analysis, e.g., if lncRNAs are globally upregulated in one condition. Personally, I would go for the second strategy; if I recall correctly, lncRNAs are not too different from protein-coding mRNAs in their biophysical properties, so it would be surprising to see different sequencing biases between the two classes.

If the sample-to-sample fold changes in the effective library sizes (product of the normalization factor and library size) are similar between the two strategies above, then there should not be a major difference in the p-values. The absolute scale of the normalization should not have a major effect on the calculation of the p-values, as that information is irrelevant when comparing across samples.

ADD COMMENT • link 6.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you very much Aaron, I will follow your advice. This helps me greatly, as I was struggling between different approaches, each giving different results keeping me awake at night.

I think the lncRNAs should behave the same as protein-coding mRNAs, being poly-adenylated and all. Thus I will go with your second suggestion.

The only headache will be the filtering for lncRNAs, as some lncRNAs are annotated as "protein-coding" and thus won't allow simple matching for that biotype.

I used this command in linux to get all the lncRNA ensembl IDs from the gencode lncRNA .gtf file:

awk -F "\t" '$3 == "gene" { print $9 } ' gencode.vM17.long_noncoding_RNAs.gtf | awk -F ";" '{ print $1 }' > lncRNAs.txt

The resulting lncRNAs.txt file I loaded into Excel and removed the remaining stuff like gene id and the "-s and transcript version. Now I got a .txt file containing only the ensemblIDs. Maybe I can use this in match() to filter.

I would then run everything as per usual, count with the comprehensive .gtf file containing everything and keep filtering and normalization the same:

# Prefiltering low read count genes
y <- DGEList(counttable) # counttable is simply fc$counts from fc <- featureCounts()
keep <- rowSums(cpm(y) > 1) > 3 # Outfilter low-count/unexpressed genes
summary(keep)
y <- y[keep, keep.lib.sizes=FALSE] # Re-calculate library sizes after count reductions
y <- calcNormFactors(y, method = "TMM") # Calculate the library normalization factors*

After these steps I could use match() to get only the lncRNAs for differential expression calling (remove all but the lncRNAs from object y).

### Subset lncRNAs here ###
lncRNA_subset <- read.csv("~/Annotation_files/GENCODE/lncRNA.csv", header = FALSE)
head(lncRNA_subset)

index <- rownames(y$counts) %in% lncRNA_subset$V1
summary(index)
remain <- rownames(y$counts)[index]

y$counts <- y$counts[remain, ] # Here I replace with only the lncRNA matches

Then:

y <- estimateDisp(y, design, robust = TRUE)
fit <- glmQLFit(y, design, robust = TRUE)

x <- glmTreat(fit, contrast = c(something), lfc = 0.2)

de_lncRNAs <- topTags(x, n = 9001, p.value = 0.05)

Would this give me all differentially expressed lncRNAs with a false-discovery rate of 5% and removing any that don't make the lfc-threshold of at least 0.2 logFC? As I understood it, glmTreat will test whether the entire confidence interval for the fold changes pass that threshold, making it a better choice than simply removing by logFC manually afterwards.

ADD REPLY • link 6.0 years ago martin.weihrauch ▴ 20

1

Entering edit mode

I would use all high-abundance genes throughout edgeR, and only select for lncRNAs by subsetting fit prior to calling topTags. This allows more information to be shared between genes for empirical Bayes shrinkage in estimateDisp and glmQLFit. Subsetting will then ensure that the BH correction and output only involves lnRNAs.

ADD REPLY • link 6.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Perfect, thanks, I will do that. Greatly appreciate your response.

When modifying fit like this:

index <- rownames(fit$counts) %in% lncRNA_subset$V1
summary(index)
remain <- rownames(fit$counts)[index]

fit$counts <- fit$counts[remain, ]

I get the following error when calling glmTreat(fit, contrast(0, 1)):

Error in Ops.CompressedMatrix(offset.old, offset.adj) : 
  CompressedMatrix dimensions should be equal for binary operations

Maybe I have to change fit in a different way.

ADD REPLY • link 6.0 years ago martin.weihrauch ▴ 20

0

Entering edit mode

Just subset it directly, fit[remain,]. Don't modify the internal fields as things get out of sync.