Question

DESeq2 Error NA values

0

Entering edit mode

bharata1803 ▴ 60

@bharata1803-7698

Last seen 5.1 years ago

Japan

Hello all,

I tried to load my count matrix with this command:

DESeqObj <- DESeqDataSetFromMatrix(countData = RoundedReadCount,colData = metadata,design = ~ Condition)

After that, I got an error like this:

converting counts to integer mode
Error in validObject(.Object) : 
  invalid class “DESeqDataSet” object: NA values are not allowed in the count matrix
In addition: Warning message:
In `mode<-`(`*tmp*`, value = c(395L, 231L, 243L, 866L, 3L, 45L,  :
  NAs introduced by coercion

I have check in my count matrix there is no NA values. Probably you can give me some suggestion about this problem. Thank you.

deseq2 • 7.8k views

ADD COMMENT • link updated 8.9 years ago by Michael Love 41k • written 8.9 years ago by bharata1803 ▴ 60

0

Entering edit mode

show us the output of

head(RoundedReadCount)

ADD REPLY • link 8.9 years ago andrew.j.skelton73 ▴ 370

0

Entering edit mode

I take the screenshot of the result for that command : https://www.dropbox.com/s/3gv92qbud15314f/capture1.png?dl=0

ADD REPLY • link 8.9 years ago bharata1803 ▴ 60

0

Entering edit mode

I tried to find error by subset the RoundedReadCount for the first 3 data, it succes. After I move to the fourth data, it is error. So, my guess is the data like the 4th row is causing the problem.

This is the data from 4th row:

ENST00000001008

866

29

14976

31998649144

3842

1090941

1720

568

1751

4694

2177

5825

8672

15447

It seems there is a really huge number there. Do you think that's the problem? The number is illogical. I tried to visualize the bam files in that gene, the result shouldn't that big. Probably have some problem in express.

ADD REPLY • link 8.9 years ago bharata1803 ▴ 60

0

Entering edit mode

How did you do your counting?

ADD REPLY • link 8.9 years ago andrew.j.skelton73 ▴ 370

0

Entering edit mode

31 Billion reads mapping to a single transcript is not realistic.... something has gone wrong somewhere.

ADD REPLY • link 8.9 years ago andrew.j.skelton73 ▴ 370

0

Entering edit mode

I use eXpress software. I choose the effective count for the read count matrix. I don't know why its result is strange. After that, I tried to use estimate count fro eXpress and the result is normal. I'm still trying to figure it out why effective count from eXpress is weird.

ADD REPLY • link 8.9 years ago bharata1803 ▴ 60

0

Entering edit mode

One thing that I should note is that you have transcript IDs in there. DESeq2 is not designed for transcript level counts, only gene level counts. You can do HTSeq_Count -> DESeq2 on your data to do gene level differential expression. Transcript level is a whole other ball game, as you'd be dealing with fractional counts. Tools such as Salmon or Kallisto can return transcript level quantification, or the Tuxedo pipeline. I'd definitely go back through your eXpress command (if you choose to go down that route), and check everything. Otherwise, check out some of the other software I suggested.

ADD REPLY • link 8.9 years ago andrew.j.skelton73 ▴ 370

0

Entering edit mode

Actually I had tried Tuxedo workflow. The problem was Cufflinks result gave me the gene expression for entire locus which is in my gene of interest, the locus has 9 different genes.So, the gene expression level is actually the accummulation of 9 different genes. That's why I changed to count based analysis. At first I use HT-seq to generate the readcount, but the result was weird because it gave too many ambigous result. Many of the rea count from HT-seq was 0 because this ambiguity.

ADD REPLY • link 8.9 years ago bharata1803 ▴ 60

0

Entering edit mode

Well, there's not a great deal you can do about the counting methods if your alignment has ambiguously mapped reads. Look at Salmon (https://github.com/COMBINE-lab/salmon) It'll take your fastq files and quantify them to the transcript level, you can also get it to give you gene level quantification. it's a completely different method to the ones you've used thus far, it uses k-mer counting to map reads.

ADD REPLY • link 8.9 years ago andrew.j.skelton73 ▴ 370

0

Entering edit mode

Well, actually, I hope I can use a tool to get the gene level expression directly. I realized that what I want to do is find the correlation between gene expression and its promoter expression. Probably if I use transcript level, after I get the expression level, I will find another problem when I need to accummulate the expression level of different transcript for the same gene. Do you have any suggestion? Actually I think cufflinks is really suitable for this. It's just that the result report for the whole locus it's really unfortunate. My gene of interest is quite unique. All of the 9 gene family, each of it has 5 exon which only the first exon is different (all other 4 exon is used together by 9 different genes). No wonder HT-seq get ambigous mapping because of this. HT-seq will fail to decide which gene is the owner of the read in 4 exon which is used together by 9 different genes.

ADD REPLY • link 8.9 years ago bharata1803 ▴ 60

0

Entering edit mode

Salmon uses the accumulated transcript expression to give the gene level expression. I'd say Salmon is your best bet in this case. The alternative is Kallisto, but they haven't implemented gene expression level yet. I've come into so many problems with what Cufflinks actually does, that I'd much rather trust Salmon's approach. My final point would be that if you've got ambiguous reads, then k-mer counting is going to be a much better approach for you.

ADD REPLY • link 8.9 years ago andrew.j.skelton73 ▴ 370

0

Entering edit mode

Thank you. I will try Salmon then. I also read in other forum about StringTie. Probably I will try both to make sure but maybe if you have any opinion or experience about that tool.

ADD REPLY • link 8.9 years ago bharata1803 ▴ 60

0

Entering edit mode

Again, stringtie works on the same principle as cufflinks. I don't know how it'll deal with ambiguously mapped reads.

ADD REPLY • link 8.9 years ago andrew.j.skelton73 ▴ 370

0

Entering edit mode

I have finished generating Salmon output. It was quick. So, for the gene level analysis, do I need to generate raw count and use DESeq2 to analyze the expression level? At what step di I need to sum the transcript level so that I can get the gene level expression? Is it after I load DESeq2 or before that? (I think if I need to sum before I load to DESeq it would be hard because I need to know the transcript id for every gene). Thank you for your suggestion.

ADD REPLY • link 8.9 years ago bharata1803 ▴ 60

0

Entering edit mode

Yep, it's pretty impressive. To get salmon to produce gene level summarised counts, you need to add the -g parameter into your salmon command, and provide a gene map. A gene map is basically a tab delimited text file with two columns, one contains the transcript ID, the other contains the Gene ID. Run salmon quant --help-reads and look at the entry for -g. Salmon should then produce 2 count files, one for transcripts and one for genes, the gene one you can put into DESeq2

ADD REPLY • link 8.9 years ago andrew.j.skelton73 ▴ 370

score 0 · Answer 1 · 2015-06-09

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 day ago

United States

Just to restate, DESeq2 is designed for gene-level analysis, and we recommend using software such as htseq-count, summarizeOverlaps or featureCounts to obtain for each gene a count of fragments which can be uniquely aligned to that gene.

"Many of the read count from htseq was 0 because this ambiguity."

I'm not sure exactly what problem you are describing, but if you want to sum all the reads/fragments for a gene family, you could edit the annotation file so that these genes all make up one feature. However if you want to test each gene in the family separately, it is not recommended as input to DESeq(2) to use non-uniquely assigning reads. You can search the archive to find many posts on this topic.

ADD COMMENT • link 8.9 years ago Michael Love 41k

0

Entering edit mode

The HTseq-count read is ambigous for my gene family. After I tried to find the problem, I think it is caused by the ambiguity in the HTSeq process. My guess is in my gene of interest, the gene family contains of 5 exons. All of it use the same 4 last exon and only the first exon is different. I'm thinking that the usage by 9 gene together can make HTseq confuse which one should the reads of the last 4 exon count assigned to. So, Htseq-count, summarizeOverlaps, and featureCounts is already out of option. I already tried to use them and most of the read is 0.

What I want to do is not get the gene family expression, but the single gene in a gene family (consist of 9 genes). For info, the cufflinks workflow produce result but for whole gene family, not for a single gene. My problem right now is to get gene expression from trasncript expression. eXpress result is in the form of transcript expression count reads. My gene of interest actaully has 2 transcript variant and another gene that I want to check has more variant. I'm kinda confused to get the total gene expression level. I imagine the gene expression level is the accummulation of transcript level but I'm not really sure whether I can directly add count from eXpress or not and in which step I should do the addition.

ADD REPLY • link 8.9 years ago bharata1803 ▴ 60