converting counts to integer mode
Error in validObject(.Object) :
invalid class “DESeqDataSet” object: NA values are not allowed in the count matrix
In addition: Warning message:
In `mode<-`(`*tmp*`, value = c(395L, 231L, 243L, 866L, 3L, 45L, :
NAs introduced by coercion
I have check in my count matrix there is no NA values. Probably you can give me some suggestion about this problem. Thank you.
I tried to find error by subset the RoundedReadCount for the first 3 data, it succes. After I move to the fourth data, it is error. So, my guess is the data like the 4th row is causing the problem.
This is the data from 4th row:
ENST00000001008
866
29
14976
31998649144
3842
1090941
1720
568
1751
4694
2177
5825
8672
15447
It seems there is a really huge number there. Do you think that's the problem? The number is illogical. I tried to visualize the bam files in that gene, the result shouldn't that big. Probably have some problem in express.
I use eXpress software. I choose the effective count for the read count matrix. I don't know why its result is strange. After that, I tried to use estimate count fro eXpress and the result is normal. I'm still trying to figure it out why effective count from eXpress is weird.
One thing that I should note is that you have transcript IDs in there. DESeq2 is not designed for transcript level counts, only gene level counts. You can do HTSeq_Count -> DESeq2 on your data to do gene level differential expression. Transcript level is a whole other ball game, as you'd be dealing with fractional counts. Tools such as Salmon or Kallisto can return transcript level quantification, or the Tuxedo pipeline. I'd definitely go back through your eXpress command (if you choose to go down that route), and check everything. Otherwise, check out some of the other software I suggested.
Actually I had tried Tuxedo workflow. The problem was Cufflinks result gave me the gene expression for entire locus which is in my gene of interest, the locus has 9 different genes.So, the gene expression level is actually the accummulation of 9 different genes. That's why I changed to count based analysis. At first I use HT-seq to generate the readcount, but the result was weird because it gave too many ambigous result. Many of the rea count from HT-seq was 0 because this ambiguity.
Well, there's not a great deal you can do about the counting methods if your alignment has ambiguously mapped reads. Look at Salmon (https://github.com/COMBINE-lab/salmon) It'll take your fastq files and quantify them to the transcript level, you can also get it to give you gene level quantification. it's a completely different method to the ones you've used thus far, it uses k-mer counting to map reads.
Well, actually, I hope I can use a tool to get the gene level expression directly. I realized that what I want to do is find the correlation between gene expression and its promoter expression. Probably if I use transcript level, after I get the expression level, I will find another problem when I need to accummulate the expression level of different transcript for the same gene. Do you have any suggestion? Actually I think cufflinks is really suitable for this. It's just that the result report for the whole locus it's really unfortunate. My gene of interest is quite unique. All of the 9 gene family, each of it has 5 exon which only the first exon is different (all other 4 exon is used together by 9 different genes). No wonder HT-seq get ambigous mapping because of this. HT-seq will fail to decide which gene is the owner of the read in 4 exon which is used together by 9 different genes.
Salmon uses the accumulated transcript expression to give the gene level expression. I'd say Salmon is your best bet in this case. The alternative is Kallisto, but they haven't implemented gene expression level yet. I've come into so many problems with what Cufflinks actually does, that I'd much rather trust Salmon's approach. My final point would be that if you've got ambiguous reads, then k-mer counting is going to be a much better approach for you.
Thank you. I will try Salmon then. I also read in other forum about StringTie. Probably I will try both to make sure but maybe if you have any opinion or experience about that tool.
I have finished generating Salmon output. It was quick. So, for the gene level analysis, do I need to generate raw count and use DESeq2 to analyze the expression level? At what step di I need to sum the transcript level so that I can get the gene level expression? Is it after I load DESeq2 or before that? (I think if I need to sum before I load to DESeq it would be hard because I need to know the transcript id for every gene). Thank you for your suggestion.
Yep, it's pretty impressive. To get salmon to produce gene level summarised counts, you need to add the -g parameter into your salmon command, and provide a gene map. A gene map is basically a tab delimited text file with two columns, one contains the transcript ID, the other contains the Gene ID. Run salmon quant --help-reads and look at the entry for -g. Salmon should then produce 2 count files, one for transcripts and one for genes, the gene one you can put into DESeq2
Just to restate, DESeq2 is designed for gene-level analysis, and we recommend using software such as htseq-count, summarizeOverlaps or featureCounts to obtain for each gene a count of fragments which can be uniquely aligned to that gene.
"Many of the read count from htseq was 0 because this ambiguity."
I'm not sure exactly what problem you are describing, but if you want to sum all the reads/fragments for a gene family, you could edit the annotation file so that these genes all make up one feature. However if you want to test each gene in the family separately, it is not recommended as input to DESeq(2) to use non-uniquely assigning reads. You can search the archive to find many posts on this topic.
The HTseq-count read is ambigous for my gene family. After I tried to find the problem, I think it is caused by the ambiguity in the HTSeq process. My guess is in my gene of interest, the gene family contains of 5 exons. All of it use the same 4 last exon and only the first exon is different. I'm thinking that the usage by 9 gene together can make HTseq confuse which one should the reads of the last 4 exon count assigned to. So, Htseq-count, summarizeOverlaps, and featureCounts is already out of option. I already tried to use them and most of the read is 0.
What I want to do is not get the gene family expression, but the single gene in a gene family (consist of 9 genes). For info, the cufflinks workflow produce result but for whole gene family, not for a single gene. My problem right now is to get gene expression from trasncript expression. eXpress result is in the form of transcript expression count reads. My gene of interest actaully has 2 transcript variant and another gene that I want to check has more variant. I'm kinda confused to get the total gene expression level. I imagine the gene expression level is the accummulation of transcript level but I'm not really sure whether I can directly add count from eXpress or not and in which step I should do the addition.
show us the output of
head(RoundedReadCount)
I tried to find error by subset the RoundedReadCount for the first 3 data, it succes. After I move to the fourth data, it is error. So, my guess is the data like the 4th row is causing the problem.
This is the data from 4th row:
It seems there is a really huge number there. Do you think that's the problem? The number is illogical. I tried to visualize the bam files in that gene, the result shouldn't that big. Probably have some problem in express.
How did you do your counting?
31 Billion reads mapping to a single transcript is not realistic.... something has gone wrong somewhere.
I use eXpress software. I choose the effective count for the read count matrix. I don't know why its result is strange. After that, I tried to use estimate count fro eXpress and the result is normal. I'm still trying to figure it out why effective count from eXpress is weird.
One thing that I should note is that you have transcript IDs in there. DESeq2 is not designed for transcript level counts, only gene level counts. You can do HTSeq_Count -> DESeq2 on your data to do gene level differential expression. Transcript level is a whole other ball game, as you'd be dealing with fractional counts. Tools such as Salmon or Kallisto can return transcript level quantification, or the Tuxedo pipeline. I'd definitely go back through your eXpress command (if you choose to go down that route), and check everything. Otherwise, check out some of the other software I suggested.
Actually I had tried Tuxedo workflow. The problem was Cufflinks result gave me the gene expression for entire locus which is in my gene of interest, the locus has 9 different genes.So, the gene expression level is actually the accummulation of 9 different genes. That's why I changed to count based analysis. At first I use HT-seq to generate the readcount, but the result was weird because it gave too many ambigous result. Many of the rea count from HT-seq was 0 because this ambiguity.
Well, there's not a great deal you can do about the counting methods if your alignment has ambiguously mapped reads. Look at Salmon (https://github.com/COMBINE-lab/salmon) It'll take your fastq files and quantify them to the transcript level, you can also get it to give you gene level quantification. it's a completely different method to the ones you've used thus far, it uses k-mer counting to map reads.
Well, actually, I hope I can use a tool to get the gene level expression directly. I realized that what I want to do is find the correlation between gene expression and its promoter expression. Probably if I use transcript level, after I get the expression level, I will find another problem when I need to accummulate the expression level of different transcript for the same gene. Do you have any suggestion? Actually I think cufflinks is really suitable for this. It's just that the result report for the whole locus it's really unfortunate. My gene of interest is quite unique. All of the 9 gene family, each of it has 5 exon which only the first exon is different (all other 4 exon is used together by 9 different genes). No wonder HT-seq get ambigous mapping because of this. HT-seq will fail to decide which gene is the owner of the read in 4 exon which is used together by 9 different genes.
Salmon uses the accumulated transcript expression to give the gene level expression. I'd say Salmon is your best bet in this case. The alternative is Kallisto, but they haven't implemented gene expression level yet. I've come into so many problems with what Cufflinks actually does, that I'd much rather trust Salmon's approach. My final point would be that if you've got ambiguous reads, then k-mer counting is going to be a much better approach for you.
Thank you. I will try Salmon then. I also read in other forum about StringTie. Probably I will try both to make sure but maybe if you have any opinion or experience about that tool.
Again, stringtie works on the same principle as cufflinks. I don't know how it'll deal with ambiguously mapped reads.
I have finished generating Salmon output. It was quick. So, for the gene level analysis, do I need to generate raw count and use DESeq2 to analyze the expression level? At what step di I need to sum the transcript level so that I can get the gene level expression? Is it after I load DESeq2 or before that? (I think if I need to sum before I load to DESeq it would be hard because I need to know the transcript id for every gene). Thank you for your suggestion.
Yep, it's pretty impressive. To get salmon to produce gene level summarised counts, you need to add the -g parameter into your salmon command, and provide a gene map. A gene map is basically a tab delimited text file with two columns, one contains the transcript ID, the other contains the Gene ID. Run salmon quant --help-reads and look at the entry for -g. Salmon should then produce 2 count files, one for transcripts and one for genes, the gene one you can put into DESeq2