Question: Analyse only protein_coding genes
gravatar for martin.weihrauch
11 months ago by
martin.weihrauch10 wrote:


I'm using featureCounts() to generate the count matrix of my RNAseq experiment and edgeR to do the DGE analysis.

Now, during the filtering step I am removing low-count genes like this:

fc <- featureCounts(..)

counttable <- fc$counts

y <- DGEList(counts=counttable)

keep <- rowSums(cpm(y) > 1) >= 4 # Keep genes only if they have more than x count per million (calculated to require at least 6 reads on average) AND which are found in at least x replicates per group (which is the minimal amount of samples in a condition)

y <- y[keep, keep.lib.sizes=FALSE] # Re-calculate library sizes after count reductions

Out of 53k+ entries, only around 14k remain after that. But in those 14k there are still genes I want to remove from the analysis, such as mtRNAs, noncoding RNAs etc..

I've tried using biomaRt or editing my annotation GTF file to narrow it down to protein_coding genes, but it didn't work so far.

Is there a way to remove genes that aren't protein_coding? A nifty way to subset my GTF file (Mus_musculus.GRCm38.91.gtf).

I had no luck in bash using grep "protein_coding" ... 


EDIT: I have tried to limit my annotation .GTF file to only protein_coding lines using: $ awk '/protein_coding/ { print }' Mus_musculus.GRCm38.91.gtf > test.gtf

The test.gtf generated is ~60MB smaller and a grep mtRNA returns no hits, but a grep lincRNA returns some, but those lincRNAs have protein_coding in their biotype. I will try using this filtered GTF file in featureCounts.


Load annotation file //Annotation_files/test.gtf ||
||    Features : 719464                                                       ||
||    Meta-features : 22073                                                   ||
||    Chromosomes/contigs : 41 

So the number of meta-features was reduced from ~50k to 22k (hopefully all protein_coding annotations).

ADD COMMENTlink modified 11 months ago • written 11 months ago by martin.weihrauch10

It does depend of course on what gene annotation system you are using. It would help potential responders to explain up-front that you are using Ensembl rather than Entrez Gene Ids.

ADD REPLYlink modified 11 months ago • written 11 months ago by Gordon Smyth35k
gravatar for Aaron Lun
11 months ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

This question doesn't really have much to do with edgeR. You're really asking about the availability of annotation indicating whether a gene is protein-coding or not. This is probably most easily achieved via biomaRt, see A: BiomaRt: protein coding transcript identification. You could also use the presence/absence of a protein_id tag in EnsDb packages, see the "Querying protein features" vignette for ensembldb.

ADD COMMENTlink modified 11 months ago • written 11 months ago by Aaron Lun21k

Thanks for your reply. I've tried to use biomaRt using the approach you linked before as well. I was able to get annotation of all protein_coding genes with Enembl_gene_ids as well, but I was having trouble because my countmatrix somehow contains ~500 annotations more than what biomaRt returns, so the files don't match in length and I couldn't get it to work (my GTF file is the ensembl release 91). I tried playing with match() or checking for identical ensembl_gene_ids and removing the resulting "NA" values etc., but to no avail so far. Additionally, when the biomaRt-query apparently completely jumbled the order of my input gene_ids, so I end up with a gene_id and biotype list of completely different order than what I put in.

I think I could be more successful with the EnsDb packages as you wrote, even though they are "older" they should contain all protein_coding annotations as I don't care for up-to-date non-coding annotation in this case.

One of my experimental conditions introduces large read counts for mtRNA genes which I want to exclude from my DGE analysis with edgeR. So I have to remove those and other non-coding genes before I estimate dispersions etc...

If I could modify this "keep" file : keep <- rowSums(cpm(y) > 1) >= 4   ... to additionally set to FALSE all non-coding genes that would work as well. This filtering works great for excluding low-count genes (many of which are non-coding), but the extremely high-count mtRNAs in some of my samples remain.

ADD REPLYlink modified 11 months ago • written 11 months ago by martin.weihrauch10

Changes in the ordering and length of the biomaRt output are completely solvable with match:

ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
my.genes <- c("ENSG00000284190", "ENSG00000171658", "Something else")
out <- getBM(attributes=c("ensembl_gene_id", "gene_biotype"), 
    filters="ensembl_gene_id", values=my.genes, mart=ensembl)
out <- out[match(my.genes, out$ensembl_gene_id),] # now in correct order/length.

Then keep is similarly easy to modify, as shown below:

keep <- keep & (out$gene_biotype=="protein_coding" & !$gene_biotype))
ADD REPLYlink modified 11 months ago • written 11 months ago by Aaron Lun21k

Thank you Aaron, this worked flawlessly.

E: Just in case, how could I exclude specific genes? e.g. all genes in this vector c(ENSMUSG00000064370, ENSMUSG00000064341, ENSMUSG00000064351, ENSMUSG00000064363, ENSMUSG00000064345, ENSMUSG00000064367, ENSMUSG00000064368).

ADD REPLYlink modified 11 months ago • written 11 months ago by martin.weihrauch10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 305 users visited in the last hour