Question: Remove reads from raw fastq
gravatar for ferbecneu
8 months ago by
ferbecneu0 wrote:

Hi, Im analysing sequencing data and comparing distinct samples, however between two of my conditions I have very different read numbers and that is causing me troubles during the analysis. I would like to remove reads from some of my samples but these reads should be random so that I dont skew my data. Does anybody has idea how can I do that?

Thank you very much!


biostrings shortread • 230 views
ADD COMMENTlink modified 8 months ago by Gordon Smyth37k • written 8 months ago by ferbecneu0
Answer: Remove reads from raw fastq
gravatar for Gordon Smyth
8 months ago by
Gordon Smyth37k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth37k wrote:

I don't know what analysis you are conducting or what sort of sequencing you are doing, but I would be horrified to see anyone doing what you propose to do. It would be far better to improve your analysis methods so that the analysis can handle unequal sequencing depths without skewing the results. Generally speaking, such analysis methods do exist.

Having said that, if you have a matrix of read counts, and want to reduce the library size for one or more of the samples, it is easy and quick to do that using the thinCounts() function of the edgeR package. That is equivalent to randomly selecting rows of the raw FastQ file but very, very much more efficient.

For example, if `counts' is a matrix of read counts, then

counts2 <- thinCounts(counts, target.size=min(colSums(counts)))

will create a new matrix for which all the columns have the same total count. The thining is done in such a way as to simulate random selection of reads.

ADD COMMENTlink modified 8 months ago • written 8 months ago by Gordon Smyth37k

Thank you, I think that what you answered is just what i need to do but I have a problem. Im working with a fastq file whose reads I read with readDNAstringset and then use thinCounts, however I get this: Error in colSums(x) : 'x' must be numeric. I know that maybe this is too basic but Im starting with bioinformatics, can you help me figure out how to solve it? Thanks!

ADD REPLYlink written 8 months ago by ferbecneu0

No, I can't help because I have no idea what sort of analysis you are trying to do. Why would you run readDNAstringset? I don't know.

Regarding thinCounts(), the error message seems pretty self explanatory. thinCounts() operates on a numeric matrix of counts but that's not what readDNAstringset produces.

As you have suggested, this is indeed pretty basic. One always needs to pay a bit of attention to what sort of arguments functions accept and what output they produce.

ADD REPLYlink written 8 months ago by Gordon Smyth37k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 479 users visited in the last hour