Dear All,
I processed and analyzed RNAseq data before. It was pretty straight forward with Star, featureCounts and edgeR.
I received a new dataset of 36 samples with SE sequence file. I expected 36 files with ~30 million reads. Instead, I got 108 files (18 samples/per lanes with ~ 10 million reads). 12 groups with 3 biological replicates.
It turned out, each sample was sequenced 3 times ( technical replicates). My understanding is that I do not need technical replicates for RNAseq only biological replicates.
When I asked why , the answer was :"This was safer way to do the analysis and reduce bias than analyzing each sample just 1 time 30 M reads. Raw count should be a sum of the lanes with the replicated samples: Lane 1, 2,3 and respectively Lane 4,5,6. The FastQ data should be looked at similarly as sum of the replicate lane values and this way the data still considered as 30 million reads per sample."
My question is that if this is the correct way to analyze it? To process all the 108 files, sum up the raw counts of the technical replicates and feed into edgeR as 36 samples in 4 experimental groups.
Your advice and help are highly appreciated.
Thanks a lot,
A
Hi Ryan,
Thanks for your reply. I would like to clarify it further.
Are you suggesting that having many files with lower counts as technical replicates and then combining them (adding up the raw counts of 3 files) is a better way to go than having one file with ~ 30 million reads. Along those lines, if I want to detect e.g isoforms and I need 100 million reads then it is better to split 100 million reads into technical replicates (for instance 5*20 millions) and to combine the raw counts. I guess my concern is the statement of "This was safer way to do the analysis and reduce bias than analyzing each sample just 1 time 30". I have not seen a published paper stating that a sample should be split into technical replicates because it reduces bias. Would you please comment on this and let me know the reference! Thanks a lot.
In and of itself, whether you have all your reads in one file or in multiple files makes very little difference. If each read (or read pair, for paired-end data) is processed independently, the other reads in the same file should have no effect. The summed counts after processing each file separately should be very similar to the counts from processing a combined file.
As an aside, I say "very similar" rather than "identical", which is what they would be if each read's fate was truly independent. This is because some aligners share information across reads to improve detection of indels or structural variations (not that it matters for assigning reads into genes). Duplicate removal will also depend on other reads, though it's not recommended for routine RNA-seq analyses anyway. Obviously, if you're doing something like de novo transcript assembly, the other reads will have a major effect on the identity and counts of the assemblies, in which case it might be better to pool first to obtain more stable transcripts. Similar considerations apply for UMI data, where pooling before processing is more appropriate.
The more pertinent choice regarding the number of technical replicates occurs when planning the experimental design for sequencing. In my institute, we barcode and multiplex sequencing of multiple samples across several lanes of the flow cell. This means that each sample has multiple technical replicates, one per lane. The alternative would be that you sequence each sample on a single lane, such that each sample corresponds to a single technical replicate. We tend to dislike the second option as this increases our susceptibility to differences in sequencing efficiency between lanes. Such effects would cancel out with the multiplexing option; I don't know of any citation, but it seems to be standard practice at most sequencing facilities.
So I suppose the statement about reducing bias is correct, though I suspect that it mostly refers to how the technical replicates were generated experimentally (assuming multiplexing was done) rather than how they were analyzed computationally. From a bioinformatics point of view, I prefer having multiple technical replicates, simply because it's easier to parallelize operations.
Before adding the raw counts across lanes, I would also do an MDS plot of all the 108 files labeled by an index of the biological sample identity and colored by lane, to assess the similarity between technical replicates from the same biological sample, they should overlap on the same spot.