Search
Question: DESeq2 Error NA values
0
gravatar for bharata1803
2.4 years ago by
bharata180320
Japan
bharata180320 wrote:

Hello all,

I tried to load my count matrix with this command:

DESeqObj <- DESeqDataSetFromMatrix(countData = RoundedReadCount,colData = metadata,design = ~ Condition)

After that, I got an error like this:

converting counts to integer mode
Error in validObject(.Object) : 
  invalid class “DESeqDataSet” object: NA values are not allowed in the count matrix
In addition: Warning message:
In `mode<-`(`*tmp*`, value = c(395L, 231L, 243L, 866L, 3L, 45L,  :
  NAs introduced by coercion

 

I have check in my count matrix there is no NA values. Probably you can give me some suggestion about this problem. Thank you.

ADD COMMENTlink modified 2.4 years ago by Michael Love14k • written 2.4 years ago by bharata180320

show us the output of 

head(RoundedReadCount)
ADD REPLYlink written 2.4 years ago by andrew.j.skelton73290
I take the screenshot of the result for that command : https://www.dropbox.com/s/3gv92qbud15314f/capture1.png?dl=0

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by bharata180320

I tried to find error by subset the RoundedReadCount for the first 3 data, it succes. After I move to the fourth data, it is error. So, my guess is the data like the 4th row is causing the problem.

This is the data from 4th row:

ENST00000001008 866 29 14976 31998649144 3842 1090941 1720 568 1751 4694 2177 5825 8672 15447

 

It seems there is a really huge number there. Do you think that's the problem? The number is illogical. I tried to visualize the bam files in that gene, the result shouldn't that big. Probably have some problem in express.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by bharata180320

How did you do your counting?

ADD REPLYlink written 2.4 years ago by andrew.j.skelton73290

31 Billion reads mapping to a single transcript is not realistic.... something has gone wrong somewhere. 

ADD REPLYlink written 2.4 years ago by andrew.j.skelton73290

I use eXpress software. I choose the effective count for the read count matrix. I don't know why its result is strange. After that, I tried to use estimate count fro eXpress and the result is normal. I'm still trying to figure it out why effective count from eXpress is weird.

ADD REPLYlink written 2.4 years ago by bharata180320

One thing that I should note is that you have transcript IDs in there. DESeq2 is not designed for transcript level counts, only gene level counts. You can do HTSeq_Count -> DESeq2 on your data to do gene level differential expression. Transcript level is a whole other ball game, as you'd be dealing with fractional counts. Tools such as Salmon or Kallisto can return transcript level quantification, or the Tuxedo pipeline. I'd definitely go back through your eXpress command (if you choose to go down that route), and check everything. Otherwise, check out some of the other software I suggested. 

 

ADD REPLYlink written 2.4 years ago by andrew.j.skelton73290

Actually I had tried Tuxedo workflow. The problem was Cufflinks result gave me the gene expression for entire locus which is in my gene of interest, the locus has 9 different genes.So, the gene expression level is actually the accummulation of 9 different genes. That's why I changed to count based analysis. At first I use HT-seq to generate the readcount, but the result was weird because it gave too many ambigous result. Many of the rea count from HT-seq was 0 because this ambiguity.

ADD REPLYlink written 2.4 years ago by bharata180320

Well, there's not a great deal you can do about the counting methods if your alignment has ambiguously mapped reads. Look at Salmon (https://github.com/COMBINE-lab/salmon) It'll take your fastq files and quantify them to the transcript level, you can also get it to give you gene level quantification. it's a completely different method to the ones you've used thus far, it uses k-mer counting to map reads. 

ADD REPLYlink written 2.4 years ago by andrew.j.skelton73290

Well, actually, I hope I can use a tool to get the gene level expression directly. I realized that what I want to do is find the correlation between gene expression and its promoter expression. Probably if I use transcript level, after I get the expression level, I will find another problem when I need to accummulate the expression level of different transcript for the same gene. Do you have any suggestion? Actually I think cufflinks is really suitable for this. It's just that the result report for the whole locus it's really unfortunate. My gene of interest is quite unique. All of the 9 gene family, each of it has 5 exon which only the first exon is different (all other 4 exon is used together by 9 different genes). No wonder HT-seq get ambigous mapping because of this. HT-seq will fail to decide which gene is the owner of the read in 4 exon which is used together by 9 different genes.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by bharata180320

Salmon uses the accumulated transcript expression to give the gene level expression. I'd say Salmon is your best bet in this case. The alternative is Kallisto, but they haven't implemented gene expression level yet. I've come into so many problems with what Cufflinks actually does, that I'd much rather trust Salmon's approach. My final point would be that if you've got ambiguous reads, then k-mer counting is going to be a much better approach for you. 

ADD REPLYlink written 2.4 years ago by andrew.j.skelton73290

Thank you. I will try Salmon then. I also read in other forum about StringTie. Probably I will try both to make sure but maybe if you have any opinion or experience about that tool. 

ADD REPLYlink written 2.4 years ago by bharata180320

Again, stringtie works on the same principle as cufflinks. I don't know how it'll deal with ambiguously mapped reads. 

ADD REPLYlink written 2.4 years ago by andrew.j.skelton73290

I have finished generating Salmon output. It was quick. So, for the gene level analysis, do I need to generate raw count and use DESeq2 to analyze the expression level? At what step di I need to sum the transcript level so that I can get the gene level expression? Is it after I load DESeq2 or before that? (I think if I need to sum before I load to DESeq it would be hard because I need to know the transcript id for every gene). Thank you for your suggestion.

ADD REPLYlink written 2.4 years ago by bharata180320

Yep, it's pretty impressive. To get salmon to produce gene level summarised counts, you need to add the -g parameter into your salmon command, and provide a gene map. A gene map is basically a tab delimited text file with two columns, one contains the transcript ID, the other contains the Gene ID. Run salmon quant --help-reads and look at the entry for -g. Salmon should then produce 2 count files, one for transcripts and one for genes, the gene one you can put into DESeq2

ADD REPLYlink written 2.4 years ago by andrew.j.skelton73290
0
gravatar for Michael Love
2.4 years ago by
Michael Love14k
United States
Michael Love14k wrote:

Just to restate, DESeq2 is designed for gene-level analysis, and we recommend using software such as htseq-count, summarizeOverlaps or featureCounts to obtain for each gene a count of fragments which can be uniquely aligned to that gene.

"Many of the read count from htseq was 0 because this ambiguity."

I'm not sure exactly what problem you are describing, but if you want to sum all the reads/fragments for a gene family, you could edit the annotation file so that these genes all make up one feature. However if you want to test each gene in the family separately, it is not recommended as input to DESeq(2) to use non-uniquely assigning reads. You can search the archive to find many posts on this topic.

ADD COMMENTlink written 2.4 years ago by Michael Love14k

The HTseq-count read is ambigous for my gene family. After I tried to find the problem, I think it is caused by the ambiguity in the HTSeq process. My guess is in my gene of interest, the gene family contains of 5 exons. All of it use the same 4 last exon and only the first exon is different. I'm thinking that the usage by 9 gene together can make HTseq confuse which one should the reads of the last 4 exon count assigned to. So, Htseq-count, summarizeOverlaps, and featureCounts is already out of option. I already tried to use them and most of the read is 0.

 

What I want to do is not get the gene family expression, but the single gene in a gene family (consist of 9 genes). For info, the cufflinks workflow produce result but for whole gene family, not for a single gene. My problem right now is to get gene expression from trasncript expression. eXpress result is in the form of transcript expression count reads. My gene of interest actaully has 2 transcript variant and another gene that I want to check has more variant. I'm kinda confused to get the total gene expression level. I imagine the gene expression level is the accummulation of transcript level but I'm not really sure whether I can directly add count from eXpress or not and in which step I should do the addition.

ADD REPLYlink written 2.4 years ago by bharata180320
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 343 users visited in the last hour