Question

Bias correction in single end experiment for DEG and DET

0

Entering edit mode

yohann.nedelec • 0

@yohannnedelec-10940

Last seen 7.8 years ago

Hello,

I'd like to get some advice about analyses I'd like to improve.

I'm concerned about bias (GC in particular) when comparing transcripts and gene expressions between groups of samples.

My objective are:

Identify DE genes and DE transcripts
Eliminate some bias before doing eQTL and sQTL mapping

About my data: 80 libraries in each of the two groups, ~30M reads in single end

Currently, I directly use the output from RSEM and pipe it to Voom to correct for known batch effects between samples (mainly flowcells effects).

Could you please point me to a better direction than this ?
Should I apply tximport before ?
Would you method, Alpine, work in my case (can it work with single end)?

Thank you for your help,
Regards,

tximport alpine • 1.2k views

ADD COMMENT • link 7.8 years ago yohann.nedelec • 0

score 1 · Answer 1 · 2016-06-20

hi Yohann,

For GC content bias on the gene level, you can use the Bioconductor packages cqn or EDASeq and then any of the downstream statistical packages (DESeq2, edgeR, limma, etc). I believe for both packages, you can obtain the offset matrix for statistical analysis (don't know if your eQTL pipelines can accept offsets, but this is a simple thing for a linear model to accommodate), or you can get a normalized bias-corrected matrix for EDA.

I believe you could also use cqn and EDASeq with estimated transcript counts.

Now, your RSEM to limma-voom pipeline may be perfectly fine as is and you don't have to use the above tools, if it is the case that the GC dependence is explained mostly by batch terms. You can figure this out by running cqn or EDASeq, making the GC dependence plot, and coloring lines by batch. If nearly all the variation is across batch and not within batch, then I wouldn't change your current pipeline.

You can use tximport, but this is really a convenience function for reading in transcript quantifications and summarizing to the gene level. RSEM does this itself already.

alpine doesn't support single end yet. I hope to spend more time expanding the features and adding more documentation later this year (and adding to Bioconductor).

score 0 · Answer 2 · 2016-06-21

0

Entering edit mode

yohann.nedelec • 0

@yohannnedelec-10940

Last seen 7.8 years ago

Thanks a lot for your answer Michael,

About correcting for length and GC content biases at the transcript level, my understanding is that I first have to calculate the GC content and length of each transcript and feed that info to EDAseq.
Am I correct with this approach or are there some caveats that I'm missing ?

ADD COMMENT • link 7.8 years ago yohann.nedelec • 0

0

Entering edit mode

hi Yohann,

(quick note about the site, you can add Comments/Replies to thread a conversation instead of Answers which are for answering the original posted question)

Yes you would calculate GC content and length and feed these to EDASeq or cqn. Pointers for doing this are: extractTranscriptSeqs in the GenomicFeatures package and sum(width(grl)) if you have a GRangesList of the exons per transcript. But if you have further package specific questions, you can make a new post and get the advice of the package authors by tagging the post.

ADD REPLY • link 7.8 years ago Michael Love 41k