Question

How to treat genes in RNAseq analyses

0

Entering edit mode

mlosada323 • 0

@mlosada323-23593

Last seen 4.0 years ago

Hi everyone,

I'm new to RNAseq analysis and I have a general question about how to treat genes in DESeq2. Initially I estimated transcript counts in salmon and then imported (tximport) them into DESeq2. Transcripts were converted to genes using the EnsDb.Hsapiens.v86 database, gene counts were estimated and normalized and those became the units (rows) of my DESeq2 analyses. But many of those genes belong to subfamilies of the same gene family or motifs of the same gene, etc. Hence my question is, when would be better, if ever, to group genes by family or motif or any other higher genomic hierarchy? I can see how grouping genes and increasing counts per gene may be statistically beneficial (e.g., less variance, less low-count genes), but is it biologically correct? Any thoughts, guidance, links to previous comments, etc would be highly appreciated.

Then assuming you want to combine genes from the same family or motif and analyze them in DESeq2, how do you do that?

Best regards

Marcos

deseq2 • 454 views

ADD COMMENT • link updated 4.1 years ago by James W. MacDonald 66k • written 4.1 years ago by mlosada323 • 0

0

Entering edit mode

If you followed a proper pipeline, all the software knows that there are repetitive elements in genes and transcripts and it's handling the best way to properly count them all.

ADD REPLY • link 4.1 years ago swbarnes2 ★ 1.4k

score 0 · Answer 1 · 2020-05-26

0

Entering edit mode

James W. MacDonald 66k

@james-w-macdonald-5106

Last seen 2 days ago

United States

It's not clear what you mean by 'grouping genes'. If by that you mean something similar to what you did when summarizing the transcripts to the gene level, that doesn't seem reasonable at all. When summarizing the transcripts you do obscure some of the possibly different functions or other biological effects that a different transcript may have, but that is entirely different from assuming that all the genes in a gene family are somehow interchangeable.

A different way of 'grouping genes' is to do some sort of a gene set test, of which there are many. The main goal being to provide evidence that a set of genes that are part of a functional group (gene pathway, etc) are being perturbed. I don't know if there are gene set tests available in DESeq2, although my recollection is that there aren't. You could use things like the goseq package for GO testing, or any of the gene set tests in edgeR/limma (camera, romer, roast, etc), depending on how you want to combine the genes and what null hypothesis you might want to test against.

ADD COMMENT • link 4.1 years ago James W. MacDonald 66k

0

Entering edit mode

Correct. There are some posts with code for using goseq with DESeq2. Or you should use limma or edgeR with roast, camera, etc.

ADD REPLY • link 4.1 years ago Michael Love 42k

0

Entering edit mode

Thanks for your comment James. I wasn't referring to combining different genes in a pathway, I was referring to agglomerating counts of motifs or members of the same gene subfamily. For example in my dataset I detected 20 motifs of the gene ADAMTS (ADAM Metallopeptidase With Thrombospondin Type 1), ADAMTS1 to ADAMTS20 . Similarly, I detected 10 members of the ABCA gene subfamily (ATP Binding Cassette Subfamily A), ABCA1 to ABCA10. I could analyze these 30 genes separately as DESesq2 output them or somehow agglomerate all the transcript counts of the twenty ADAMTS and ten ABCA genes into only two genes, ADAMTS and ABCA, respectively. I hope this clarifies the issue.

ADD REPLY • link 4.1 years ago mlosada323 • 0

0

Entering edit mode

Right. But that doesn't make any sense to me. All the genes in a family are similar but, critically, not the same. For example, the ADAMTS family all share structural similarities, but they do different things. Cribbing directly from Wikipedia, we have

ADAMTS (short for a disintegrin and metalloproteinase with thrombospondin motifs) is a family of multidomain extracellular protease enzymes.[1] 19 members of this family have been identified in humans, the first of which, ADAMTS1, was described in 1997.[2] Known functions of the ADAMTS proteases include processing of procollagens and von Willebrand factor as well as cleavage of aggrecan, versican, brevican and neurocan, making them key remodeling enzymes of the extracellular matrix. They have been demonstrated to have important roles in connective tissue organization, coagulation, inflammation, arthritis, angiogenesis and cell migration.[3][4] Homologous subfamily of ADAMTSL (ADAMTS-like) proteins, which lack enzymatic activity, has also been described.[5] Most cases of thrombotic thrombocytopenic purpura arise from autoantibody-mediated inhibition of ADAMTS13.

The people who are responsible for saying what genes are, and what they do, seem to think all the ADAMTS genes are different things. Mainly because they all have different targets for their proteolytic action. So why would you want to aggregate to a single signal? And at what point do you stop aggregating? Do you lump in the ADAMTSL proteins as well? And are you just going to aggregate based on the HUGO symbols? What if you have two genes that look similar but are completely unrelated?

And how would you explain that to anybody else? What's the rationale for doing the combining, and how do you then interpret any differences?