Removing the duplicated genes before limma::voom by using limma::avereps
1
1
Entering edit mode
Yang Shi ▴ 10
@ea61ff7a
Last seen 10 months ago
Zheng Zhou

Dear Communities

As the author of limma suggested, the DEG analysis of RSEM expected count should be done by limma::voom from a computational point of view (proper transformation for differential expression in normalized log expression data). But there are nearly 2000 genes when the Ensembl ID was transformed to the official gene symbol. So I wonder whether limma::avereps could/should be conducted to remove duplicated genes before feeding to voom. Any suggestions would be great appreciation!

# limma::avreps
rt = as.matrix(Expected_count)
rownames(rt) = rt[,1]
exp = rt[,2:ncol(rt)]
dimnames = list(rownames(exp),colnames(exp))
dataMatrix = matrix(as.numeric(as.matrix(exp)),
                    nrow = nrow(exp), 
                    dimnames = dimnames)
dataMatrix %<>%  
  limma::avereps() %>% 
  as.data.frame() %>% 
  rownames_to_column('symbol')
RSEM avereps voom limma RNASeq • 1.7k views
ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 14 minutes ago
WEHI, Melbourne, Australia

avereps() is not applicable to RNA-seq data because read counts can't be averaged. Since RSEM quantified the data by Ensembl ID, it would be best to keep the Ensembl IDs through the analysis.

ADD COMMENT
0
Entering edit mode

Thanks to your reply and your package sir! Regarding "avereps() is not applicable to RNA-seq data because read counts can't be averaged". Is that reasonable for RNA-Seq(FPKM/TPM) were averaged by limma::avereps? If the Ensembl ID was kept though the analysis, is that reasonable to remove the duplicated genes after DEG analysis? Thanks so much!

ADD REPLY
1
Entering edit mode

You are assuming that the official gene symbol uniquely identifies a gene. This is not a valid assumption! It is also not necessarily a valid assumption about Ensembl Gene IDs (or NCBI Gene IDs), but both of those annotation services make a concerted effort to collapse duplicate identifiers, and old versions get deprecated, so it is much easier to know if a given ID has been deprecated or not, based on the build. Gene symbols do get deprecated as well, but they have meaning to biologists and tend to persist as an alias. As an example, there is a 1:1 mapping between NCBI Gene IDs and HGNC Symbols:

> z <- mapIds(org.Hs.eg.db, keys(org.Hs.eg.db), "SYMBOL", "ENTREZID", multiVals = "list")
'select()' returned 1:1 mapping between keys and columns

Even though there is a 1:1 mapping between the Gene IDs and symbols, the symbols aren't unique

> table(table(unlist(z)))

    1     2     3 
66081     9     1

Those may arise from duplicate NCBI Gene IDs, or they may be actual duplicates.

But this isn't true of Aliases!

> zz <- mapIds(org.Hs.eg.db, keys(org.Hs.eg.db), "ALIAS", "ENTREZID", multiVals = "list")
'select()' returned 1:many mapping between keys and columns
> table(sapply(zz, length))

    1     2     3     4     5     6     7     8     9    10    11    12    13 
38855 10699  5930  3932  2610  1613   916   604   340   202   144    91    57 
   14    15    16    17    18    19    20    21    22    24    28    32 
   38    20    14    13     6     6     4     2     2     2     1     1

An alias is just a deprecated gene symbol and depending on when and how the IDs were mapped to symbols, you may have a mixture or symbols and aliases. I personally never rely on gene symbols unless I have no other choice.

ADD REPLY
0
Entering edit mode

Thanks for this so detailed answer! Regarding "never rely on gene symbols unless I have no other choice", it means that you prefer to use ensemble ID or entrezID? The reason why I always want to transform those kinds od ID to official gene symbol is that people are usually familiar with this other than ensemble ID or others. And if I keep those duplicated names of gene, those would not be applicable for R in most cases, like set as row.names. So limma::avereps is used to removing those duplicated names for RNA-Seq(TPM/FPKM) and microarray. I'm not sure RNA-Seq(TPM/FPKM) would be okay by this function. Could you please give me more suggestions? Thanks so much!

ADD REPLY
1
Entering edit mode

I prefer either Ensembl or NCBI Gene IDs. I am agnostic to the choice, but they are superior to gene symbols. I always also provide the gene symbols, because as you note, they have meaning to many biologists. But I don't rely on the gene symbols other than to provide them to end users. In other words, the row.names for my counts matrix will be either Ensembl or NCBI Gene IDs, and the 'genes' list item of my DGEList will be a data.frame that includes both the Gene ID and symbol.

When I then output the data, there will be columns containing the Gene ID, the symbol, the gene description, and then the statistics. Biologists can then peruse the list and find genes of interest. If there are duplicated gene symbols, then so be it. Given the unique Gene ID, the biologist can go to the relevant annotation source and figure out why two different Gene IDs have the same symbol.

As an example, I just analyzed some RNA-Seq data from Daphnia magna. Two of the top genes had different NCBI Gene IDs and different directionality, but the same exact gene name (mucin-2). NCBI says those two genes are on completely different chromosomes. Apparently there are lots of genes called mucin-2 in Daphnia! All those mucin-2 genes may all be the same exact gene, and maybe that's common for Daphnia, but I don't know that to be the case and am unwilling to make that assumption. Gene symbols and gene names are meant to be descriptive, whereas Gene IDs are meant to provide a unique identifier, and it's safer IMO to stick with the IDs.

ADD REPLY
0
Entering edit mode

Got that, very thankful sir!

ADD REPLY

Login before adding your answer.

Traffic: 554 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6