Dear all,
I am dealing with a mixture of high/low complexity RNA-seq libraries, some of RNA samples had low RIN values (~4) and others high (>8) all coming from human brain tissue. I found high proportion of multimapping reads (~40%) for the most affected and a different proportion of gene biotypes detected (90% of the reads mapped to protein-coding for unaffected samples vs 60% for higly degraded). The rest mapped to ribosomal RNA and others (i.e miscRNA). I am wondering how confident I can be performing a differential expression analysis (control vs patient) on such a variable dataset. Is there a way to control for the level of degradation? like design ~degradation+condition ? or should I use surrogate variable analysis to remove unwanted variation? When I plot a PCA the most degraded samples cluster far apart.
Thanks,
Thanks Aaron.
In my case PC1 is separating the outliers from the rest (50% variance), what is better to correct for PC1 in my model design or just add a covariate with high/low quality as you pointed?
Other option would be trying this methodology:
http://biorxiv.org/content/early/2016/09/09/074245
they present 'quality surrogate variable analysis' (qsva) for this purpose.
I think I will try different options (also removing outliers) and see what I get.