RNA-seq: compare two technical replicates
1
0
Entering edit mode
Marianna • 0
@7cc5052f
Last seen 8 weeks ago
Italy

Dear all,

I'd like your opinion on the best way to compare two technical replicates.

First, I give you some preliminary information on the experiment. I isolated total RNA from the same sample (mammalian cell line) by usining the same kit (Qiagen) but two different protocols for DNA elimination (using the Qiagen gDNA eliminator column or the Qiagen DNase). Looking at the RNA at the TapeStation, the first protocol led to the complete loss of the RNA 28s subunit. This was weird...so I decided to go deeper.

As the final goal is the gene expression analysis (DE), RNA-seq libraries for these two samples were constructed and sequenced. I obtained a different coverage for the two libraries: 28 M reads and 44 M reads. Most probably due to the flowcell balancing that was not optimal (as declared by the service provider).

Now I'd like to know if the two libraries are similar (theorically they should be identical, but we know that this is impossible!), and which is the best one. I checked the number of expressed genes counting the genes with a minimum of x raw reads, but this is cearly impacted by the coverage. So the 44M llibrary showed a higher number of genes.

Do you have any suggestion to evaluate the similarity between the two samples and to determine which is the best one?

Thank you all Best Marianna

RNASeq • 655 views
0
Entering edit mode
@gordon-smyth
Last seen 3 minutes ago
WEHI, Melbourne, Australia

You can measure the agreement between the two by fitting an edgeR model with just an intercept and estimating the dispersion (biological coefficient of variation). If the BCV is close to zero then the two replicates are completely equivalent apart from RNA-seq sequencing variability.

This analysis doesn't show which is best though. An MD plot between them might give some more insight.

0
Entering edit mode

Hi Gordon Smyth ,

I used the following code:

x <- read.delim("input_2rep.txt",row.names="Gene", stringsAsFactors=FALSE, header = TRUE)
group <- factor(c(1,2))
dge <- DGEList(counts=x, group=group)
keep <- filterByExpr(dge)
table(keep)
dge <- dge[keep, , keep.lib.sizes=FALSE]
dge_norm<- calcNormFactors(dge)
plotMD(cpm(dge_norm, log=TRUE), column=1)
design <- model.matrix(~ group)
dge_norm <- estimateDisp(dge_norm, design)


But I got the following error:

 In estimateDisp.default(y = y$counts, design = design, group = group, : No residual df: setting dispersion to NA  It seems impossible with two samples. Isnt'it? The plotMD is weird. Points should be "packed" around 1, if samples are very similar. isnt'it? Thank you for your suggestions Marianna ADD REPLY 0 Entering edit mode ADD REPLY 0 Entering edit mode My suggestion was to fit an edgeR model with just an intercept, not a two-group model as you have tried to do. You need design <- model.matrix(~1)  The MD plot shows that the first sample has fairly high counts for lots of genes that are completely absent in the second sample. More generally, the first sample has similar representation to the second for most genes but reduced representation for sizeable minority of genes compared to the second. Correction I should have said simply design <- matrix(1,2,1). ADD REPLY 0 Entering edit mode Thnk you for your reply, Gordon Smyth I tried with design <- model.matrix (~1)  but it gives the following error. I tried in several ways but I didn't work. Why?? This is the error: Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05, : nrow(design) disagrees with ncol(y)  As far as the plotMD, don't you think that sample 2 has fairly high counts for lots of genes that are completely absent in the sample 1? This should be explained by the different coverage of the two RNAseq libraries (test 1: about 28M reads; test 2: more than 40M reads). Isn't it? I attach the two plots below. Thank you for your time! Marianna ADD REPLY 0 Entering edit mode As to the design, now I think the code is ok x <- read.delim("input_2rep.txt",row.names="Gene", stringsAsFactors=FALSE, header = TRUE) group <- factor(c(1,2)) dge <- DGEList(counts=x, group=group) keep <- filterByExpr(dge) table(keep) dge <- dge[keep, , keep.lib.sizes=FALSE] dge_norm<- calcNormFactors(dge) plotMD(cpm(dge_norm, log=TRUE), column=1) design <- model.matrix (~1, data=dge_norm$samples)
dge_norm <- estimateDisp(dge_norm, design) dge_norm <-
estimateDisp(dge_norm,design,robust=TRUE) dge_norm\$common.dispersion


I got 0.550701, thus BCV is 0.74. Really high....

1
Entering edit mode

The result is clear. Inter-replicate dispersion is huge. Replicates are not equivalent, not even close. Second replicate is better in that it gives more complete coverage. The difference is not explainable merely by coverage.

If this has helped you and answered your question, you could consider clicking on the "thumbs up" icon next to my answer above.

0
Entering edit mode

Thank you Gordon Smyth

I got the point. Just the last question: how can you say that the difference between the two samples is not explanable merely by coverge? The fact the some genes (most probably those with low expression level), are represented only in sample 2, cannot be attributed to coverage? I mean...it's just a matter of probability to catch a gene. And the higher is the coverage the higer is the probabability. Isn't it?

I have to choose one of the two replicas, to establish the best protocol for RNA isolation. And the doubt is the following: sample 2 is better than sample 1 just because of the coverage, or there are some other characteristics that concurr to attribute to sample 2 the best score?

1
Entering edit mode

If the difference in coverage was merely due to chance and sequencing depth then the BCV would be equal to 0. The BCV already adjusts for chance variation associated with sequencing. It represents the difference in true expression levels after sequencing variability has been removed, and it is independent of sequencing depth.

It is practically impossible that you can get so many genes highly expressed in one sample and completely absent in the other merely by chance.

0
Entering edit mode

Gordon Smyth sorry for annoying you...this is really the last one!

When you say "so many genes", what do you mean?

I counted the number of genes expressed (reads > 10 or >20) after TMM normalization. These are the results:

Do you think these differences (about 1500 genes) reflect more than a different coverage?

Thank you again

Marianna

1
Entering edit mode

When you say "so many genes", what do you mean?

Look at the MD plot. Why do you think there are so many points below the 0-line in first MD plot and above the 0-line in the second plot? Especially look at the diagonal line below the first plot, which is made up of genes with count=0 in the first sample. Many of these have very high counts in the second sample, as can be seen from their position in the plot.

And as I have said before, the high BCV tells you rigorously that this imbalance is not due to chance or sampling depth.

I counted the number of genes expressed (reads > 10 or >20)

I don't think that gives you any useful information. When I said "expressed" I meant literally count > 0. When I said "completely absent" I meant count=0.

I believe that I have given you advice that clearly elucidates the difference between your samples. If you prefer to do your own analysis that is fine. But I will not respond any more in this thread.