Differential gene expression using R
1
0
Entering edit mode
lakshmi9c • 0
@lakshmi9c-23931
Last seen 12 months ago

I am working on RNA Seq data analysis to get differential gene expression between 2 conditions. I am using ballgown package on R, and successfully loaded the data into R. However, I do have these queries after my progress:

1. Is it necessary to remove low variance transcripts while doing differential gene expression? And why?
2. Why do we need to remove low gene abundance & low variance transcripts?
3. How do I get gene name and gene id without stattest() function on R using ballgown?

deseq2 edger normalization limma • 476 views
0
Entering edit mode
2
Entering edit mode
@gordon-smyth
Last seen 31 minutes ago
WEHI, Melbourne, Australia

I'm a bit puzzled about what analysis you're planning to do. Is this question a followup to your previous question about tuxedo and StringTie?

If you're using ballgown, why not follow the ballgown documentation and ballgown functions?

You've tagged your question to get help from the authors of several DE packages (limma, edgeR and DESeq2), but it isn't clear what relevance these packages have to your question. Ballgown is specifically designed for isoform-level DE whereas limma, edgeR and DESeq2 are specifically not designed for isoform-level DE, so it isn't clear what you're planning to do.

None of the four packages (balldown, limma, edgeR or DESeq2) make any use of variance filtering, so why the questions about it?

If you do want to do a DE analysis using a particular one of the packages mentioned, just follow the DE workflows that are provided for that package. Many workflows explain the filtering recommended in detail, for example:

The edgeR function filterByExpr implements the recommended filtering approach for limma and edgeR.

0
Entering edit mode

Indeed, if you want help with the Ballgown-specific question, then please show a reproducible example. First, be sure that you have followed the Ballgown vignette(s) and that all of your commands have run as expected.

0
Entering edit mode

Thank you for the fast reply! I will keep that in mind about proper tags in my future questions. However, my concern is about using ballgown package for differential expression analysis of my RNA-Seq data. I need to find the differentially expressed genes between two sample conditions. I have tried as you suggested by looking up manuals/ protocol for running Ballgown on R. But unfortunately, I can't find a standard protocol. I have run all commands correctly as per paper- https://www.nature.com/articles/nprot.2016.095. But I don't understand the need to remove low variance transcripts. Also I'm unable to proceed after getting the list of gene names and unsure of what the next step is. How should I do normalisation for Ballgown? I'm stuck at this step and would much appreciate any help on how do I proceed in order to get DE genes. Thanks in advance!

0
Entering edit mode

You don't have to do any normalization per se. From the help for stattest:

Library size adjustment is performed by default by using the sum of the log nonzero expression measurements for each sample, up to the 75th percentile of those measurements. This adjustment can be disabled by setting libadjust=FALSE. You can use mod and mod0 to specify alternative library size adjustments

0
Entering edit mode

Filter to remove low-abundance genes. One common issue with RNA-seq data is that genes often have very few or zero counts. A common step is to filter out some of these. Another approach that has been used for gene expression analysis is to apply a variance filter. Here we remove all transcripts with a variance across samples less than one:

>bg_chrX_filt = subset(bg_chrX,″rowVars(texpr(bg_chrX)) >1″,genomesubset=TRUE)


And to be fair there is a function in genefilter for removing low variance genes, so at one point that was a thing that people talked about doing. Although I'm not sure it's much of a thing these days.

0
Entering edit mode

Here we remove all transcripts with a variance across samples less than one:

I guess that, by doing this, they are essentially removing genes whose values are virtually constant across all samples, which I am not sure is ideal, unless these are values of 0 or other low count values, in which case a filter for low counts [not variance] would suffice.

I do recall years ago filtering microarray data based on variance, but that was at the level of the probe-set (where relevant to the array design) and used to remove failed probes.

0
Entering edit mode

Yes, I think you're right, but the paper was referenced only in a comment added 10 days after my answer. Variance filtering is incompatible with limma, edgeR or DESeq2, as I've pointed out many times over the years.

Anyway, I can't answer questions about ballgown or the associated protocol paper. I only responded to this question originally because it was tagged with limma and edgeR.