Question

readDGE function error in R

0

Entering edit mode

shraddha.adamane • 0

@shraddhaadamane-11285

Last seen 9.4 years ago

Hello there,

A very good morning.

I am using edgeR to carry out some comparative gene expression analysis from RNA seq raw counts data. I am using the readDGE function to compile counts data for library size.

The table I get shows that this has worked for most of the samples returns a value of NA for 2 samples.

Files group lib.size norm.factors

12345_exn 12345_exn.txt 1 NA 1
23456_Inv 23456_Inv.txt 1 164265.2 1
34567_DCIS 34567_DCIS.txt 1 172467.7 1
45678_exn 45678_exn.txt 1 NA 1
56789_exn 56789_exn.txt 1 168533.8 1

I have checked the following:

a.All the counts are numbers and not text (and not NA), b.the files have the correct headings, c. the samples are named correctly.

I am at a loss for what the problem could be, any suggestions/ advice would be greatly appreciated. Awaiting an answer badly.

Shradha.

edger • 1.0k views

ADD COMMENT • link updated 9.4 years ago by Aaron Lun ★ 29k • written 9.4 years ago by shraddha.adamane • 0

score 0 · Answer 1 · 2016-08-12

I would guess that the 12345_exn.txt and 45678_exn.txt files are missing some genes that are present in the other files. This causes a NA value to be generated when the counts are collated - which makes sense, because if the gene is missing, the function can't know its count. Make sure each file has the same number and names for all genes. If not, you can either remove the offending rows and recalculate the library sizes:

dge <- dge[rowSums(is.na(dge$counts))==0, , keep.lib.size=FALSE]

... or you can set the counts to zero, but only if you know that the missingness represents a count of zero:

dge$counts[is.na(dge$counts)] <- 0
dge$samples$lib.size <- colSums(dge$counts)

Whether or not that is the case depends on the process you used to generate the counts.

P.S. I notice that your library sizes aren't integer. While this is not a problem in and of itself, edgeR is intended to work with counts - either integer read counts, or something like the expected counts from RSEM. You had better not be using CPMs or RPKMs as inputs.

Edit: Actually, ignore what I said above. readDGE will automatically assign a count of zero to any gene that is not present in a file. So, the only possible reason for getting NA values would be to have them in the file itself.