Entering edit mode
Hi everyone,
I have exonic counts for 344 samples for 15 genes that include some house-keeping genes like GAPDH and Actin genes using dexseq_count. I am trying to normalize them like this:
dxd = DEXSeqDataSetFromHTSeq( countFiles, sampleData = sampleTable, design= ~ sample + exon + disease_type:exon, flattenedfile = flattenedFile) head(sampleTable) sample_id disease_type 7316-1057-T 7316-1057-T High-grade glioma/astrocytoma (WHO grade III/IV) 7316-1480-T 7316-1480-T Ependymoma 7316-775-T 7316-775-T Ependymoma 7316-2009-T 7316-2009-T Medulloblastoma 7316-1473-T 7316-1473-T Supratentorial or Spinal Cord PNET 7316-302-T 7316-302-T Medulloblastoma > dxd class: DEXSeqDataSet dim: 441 688 metadata(1): version assays(1): counts rownames(441): ENSG00000075624.13:E001 ENSG00000075624.13:E002 ... ENSG00000188676.13:E017 ENSG00000188676.13:E018 rowData names(5): featureID groupID exonBaseMean exonBaseVar transcripts colnames: NULL colData names(4): sample sample_id disease_type exon
Why is it showing 688 samples instead of 344?
Also, I am getting the following error when I try to normalize:
dxd = estimateSizeFactors(dxd) Error in estimateSizeFactorsForMatrix(featureCounts(object), locfunc, : every gene contains at least one zero, cannot compute log geometric means
I checked my counts file but I can't seem to find any sample that has all 0 values:
> dim(cts) [1] 441 688 > summary(colSums(cts)) Min. 1st Qu. Median Mean 3rd Qu. Max. 485 15420 42410 2255000 599800 67470000 > summary(rowSums(cts)) Min. 1st Qu. Median Mean 3rd Qu. Max. 989 2973 187300 3518000 8387000 11690000
Please let me know what am I missing here. Thanks
Hi Komal
Can you please post
countFiles
,flattenedFile
,sampleTable
devtools::session_info()
This should make it easier to troubleshoot what is going on. (Check out the advice in the Bioconductor posting guide, which is intended to make it easier to pin down the issue more quickly.)
It's always a good idea to get to the bottom of troubling observations such as "Why is it showing 688 samples instead of 344?" before continuing with the workflow. "Garbage in, garbage out."
One more thing:
DEXSeq
was designed to work with transcriptome-wide (i.e. RNA-Seq) data. Applying it to a dataset with 15 genes will be a bit of a stretch. It's quite possible that it works (especially if your treatment effects are mild), but I'd use extra caution and visualization in assessing the results. You might have to modify the size factor estimation to rely more on the housekeeping genes rather than all genes; or, if you do useestimateSizeFactors
, look at the housekeeping genes whether the result is satisfactory.Hope this helps
Wolfgang