Question: Analyse only protein_coding genes
gravatar for martin.weihrauch
8 months ago by
martin.weihrauch10 wrote:


I'm using featureCounts() to generate the count matrix of my RNAseq experiment and edgeR to do the DGE analysis.

Now, during the filtering step I am removing low-count genes like this:

fc <- featureCounts(..)

counttable <- fc$counts

y <- DGEList(counts=counttable)

keep <- rowSums(cpm(y) > 1) >= 4 # Keep genes only if they have more than x count per million (calculated to require at least 6 reads on average) AND which are found in at least x replicates per group (which is the minimal amount of samples in a condition)

y <- y[keep, keep.lib.sizes=FALSE] # Re-calculate library sizes after count reductions

Out of 53k+ entries, only around 14k remain after that. But in those 14k there are still genes I want to remove from the analysis, such as mtRNAs, noncoding RNAs etc..

I've tried using biomaRt or editing my annotation GTF file to narrow it down to protein_coding genes, but it didn't work so far.

Is there a way to remove genes that aren't protein_coding? A nifty way to subset my GTF file (Mus_musculus.GRCm38.91.gtf).

I had no luck in bash using grep "protein_coding" ... 


EDIT: I have tried to limit my annotation .GTF file to only protein_coding lines using: $ awk '/protein_coding/ { print }' Mus_musculus.GRCm38.91.gtf > test.gtf

The test.gtf generated is ~60MB smaller and a grep mtRNA returns no hits, but a grep lincRNA returns some, but those lincRNAs have protein_coding in their biotype. I will try using this filtered GTF file in featureCounts.


Load annotation file //Annotation_files/test.gtf ||
||    Features : 719464                                                       ||
||    Meta-features : 22073                                                   ||
||    Chromosomes/contigs : 41 

So the number of meta-features was reduced from ~50k to 22k (hopefully all protein_coding annotations).

ADD COMMENTlink modified 8 months ago • written 8 months ago by martin.weihrauch10

It does depend of course on what gene annotation system you are using. It would help potential responders to explain up-front that you are using Ensembl rather than Entrez Gene Ids.

ADD REPLYlink modified 8 months ago • written 8 months ago by Gordon Smyth35k
gravatar for Aaron Lun
8 months ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

This question doesn't really have much to do with edgeR. You're really asking about the availability of annotation indicating whether a gene is protein-coding or not. This is probably most easily achieved via biomaRt, see A: BiomaRt: protein coding transcript identification. You could also use the presence/absence of a protein_id tag in EnsDb packages, see the "Querying protein features" vignette for ensembldb.

ADD COMMENTlink modified 8 months ago • written 8 months ago by Aaron Lun21k

Thanks for your reply. I've tried to use biomaRt using the approach you linked before as well. I was able to get annotation of all protein_coding genes with Enembl_gene_ids as well, but I was having trouble because my countmatrix somehow contains ~500 annotations more than what biomaRt returns, so the files don't match in length and I couldn't get it to work (my GTF file is the ensembl release 91). I tried playing with match() or checking for identical ensembl_gene_ids and removing the resulting "NA" values etc., but to no avail so far. Additionally, when the biomaRt-query apparently completely jumbled the order of my input gene_ids, so I end up with a gene_id and biotype list of completely different order than what I put in.

I think I could be more successful with the EnsDb packages as you wrote, even though they are "older" they should contain all protein_coding annotations as I don't care for up-to-date non-coding annotation in this case.

One of my experimental conditions introduces large read counts for mtRNA genes which I want to exclude from my DGE analysis with edgeR. So I have to remove those and other non-coding genes before I estimate dispersions etc...

If I could modify this "keep" file : keep <- rowSums(cpm(y) > 1) >= 4   ... to additionally set to FALSE all non-coding genes that would work as well. This filtering works great for excluding low-count genes (many of which are non-coding), but the extremely high-count mtRNAs in some of my samples remain.

ADD REPLYlink modified 8 months ago • written 8 months ago by martin.weihrauch10

Changes in the ordering and length of the biomaRt output are completely solvable with match:

ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
my.genes <- c("ENSG00000284190", "ENSG00000171658", "Something else")
out <- getBM(attributes=c("ensembl_gene_id", "gene_biotype"), 
    filters="ensembl_gene_id", values=my.genes, mart=ensembl)
out <- out[match(my.genes, out$ensembl_gene_id),] # now in correct order/length.

Then keep is similarly easy to modify, as shown below:

keep <- keep & (out$gene_biotype=="protein_coding" & !$gene_biotype))
ADD REPLYlink modified 8 months ago • written 8 months ago by Aaron Lun21k

Thank you Aaron, this worked flawlessly.

E: Just in case, how could I exclude specific genes? e.g. all genes in this vector c(ENSMUSG00000064370, ENSMUSG00000064341, ENSMUSG00000064351, ENSMUSG00000064363, ENSMUSG00000064345, ENSMUSG00000064367, ENSMUSG00000064368).

ADD REPLYlink modified 8 months ago • written 8 months ago by martin.weihrauch10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 297 users visited in the last hour