DEXSeq output - count file

0

Entering edit mode

Rao,Xiayu ▴ 550

@raoxiayu-6003

Last seen 8.9 years ago

United States

Hello, DEXSeq is a great tool. Thank you for that. I recently generate my own gtf file with the same format as the exon.gff generated by dexseq_prepare_annotation.py. However, the output is strange. I tried to find the reason with no success. Could you please provide any idea about that problem? Thank you very much in advance! Note: I used the latest dexseq, and the sam files had been sorted. 1 genes.gtf exonic_part 12228 12594 . + . exonic_part_number "001"; gene_id "ENSG00000223972" 1 genes.gtf exonic_part 12722 12974 . + . exonic_part_number "002"; gene_id "ENSG00000223972" 1 genes.gtf exonic_part 13053 13220 . + . exonic_part_number "003"; gene_id "ENSG00000223972" 1 genes.gtf exonic_part 14830 14969 . - . exonic_part_number "001"; gene_id "ENSG00000223972+ENSG00000227232" ............. ==> SRR791043_counts.txt <== :001G00027000003" :002G00021000003" :003G00070000003" :004G00040000003" :005G00060000003" :006G00030000003" :007G00019000003" :008G00045600003" :009G00020400003" :001G00000000005" Thanks, Xiayu

DEXSeq DEXSeq • 1.7k views

ADD COMMENT • link 10.1 years ago Rao,Xiayu ▴ 550

0

Entering edit mode

Rao,Xiayu ▴ 550

@raoxiayu-6003

Last seen 8.9 years ago

United States

Hello, DEXSeq is a great tool. Thank you for that. I recently generate my own gtf file with the same format as the exon.gff generated by dexseq_prepare_annotation.py. However, the output is strange. I tried to find the reason with no success. Could you please provide any idea about that problem? Thank you very much in advance! Note: I used the latest dexseq, and the sam files had been sorted. 1 genes.gtf exonic_part 12228 12594 . + . exonic_part_number "001"; gene_id "ENSG00000223972" 1 genes.gtf exonic_part 12722 12974 . + . exonic_part_number "002"; gene_id "ENSG00000223972" 1 genes.gtf exonic_part 13053 13220 . + . exonic_part_number "003"; gene_id "ENSG00000223972" 1 genes.gtf exonic_part 14830 14969 . - . exonic_part_number "001"; gene_id "ENSG00000223972+ENSG00000227232" ............. ==> SRR791043_counts.txt <== :001G00027000003" :002G00021000003" :003G00070000003" :004G00040000003" :005G00060000003" :006G00030000003" :007G00019000003" :008G00045600003" :009G00020400003" :001G00000000005" Thanks, Xiayu

ADD COMMENT • link 10.1 years ago Rao,Xiayu ▴ 550

0

Entering edit mode

Dear Xiayu Rao, Thanks for your interest in DEXSeq. That looks strange, does it happen when you use files different from the one you generated by your own? Could you maybe send me (offline) your gtf file and the first 1000 lines of one of your sam files? Bests, Alejandro > Hello, > > DEXSeq is a great tool. Thank you for that. I recently generate my own gtf file with the same format as the exon.gff generated by dexseq_prepare_annotation.py. However, the output is strange. I tried to find the reason with no success. Could you please provide any idea about that problem? Thank you very much in advance! > > Note: I used the latest dexseq, and the sam files had been sorted. > > 1 genes.gtf exonic_part 12228 12594 . + . exonic_part_number "001"; gene_id "ENSG00000223972" > 1 genes.gtf exonic_part 12722 12974 . + . exonic_part_number "002"; gene_id "ENSG00000223972" > 1 genes.gtf exonic_part 13053 13220 . + . exonic_part_number "003"; gene_id "ENSG00000223972" > 1 genes.gtf exonic_part 14830 14969 . - . exonic_part_number "001"; gene_id "ENSG00000223972+ENSG00000227232" > ............. > > > ==> SRR791043_counts.txt <== > :001G00027000003" > :002G00021000003" > :003G00070000003" > :004G00040000003" > :005G00060000003" > :006G00030000003" > :007G00019000003" > :008G00045600003" > :009G00020400003" > :001G00000000005" > > > Thanks, > Xiayu

ADD REPLY • link 10.1 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

Hi Alejandro, I am analyzing with DEXSeq my data. 4 CTRLs and 2 Treated samples. My design is the following: design(ecs.24) sampleName fileName condition experiment H1 H1 Exon_Martelli_Sample_Martelli_H_1.bam HYPOXIA RUN_2 H2 H2 Exon_Martelli_Sample_Martelli_H_2.bam HYPOXIA RUN_2 N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam CTRL RUN_2 N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam CTRL RUN_2 CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam CTRL RUN_1 CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam CTRL RUN_1 When I follow the vignette: ecs.24 <- estimateDispersions(ecs.24,nCores=8) ....Done ecs.24 <- fitDispersionFunction(ecs.24) ....Done ecs.24 <- testForDEU(ecs.24, nCores=8) ..... ecs.24 <- estimatelog2FoldChanges(ecs.24, nCores=8) *Error in `row.names<-.data.frame`(`*tmp*`, value = c("geneID", "exonID", : * * duplicate 'row.names' are not allowed* *Calls: estimatelog2FoldChanges ... pData<- -> pData<- -> row.names<- -> row.names<-.data.frame* *In addition: Warning message:* *non-unique value when setting 'row.names': 'log2fold(CTRL/HYPOXIA)' * I checked for duplication as you had suggested elsewhere any(duplicated(featureNames(ecs.24))) [1] FALSE any(duplicated(paste(geneIDs(ecs.24),exonIDs(ecs.24),sep=":"))) [1] FALSE I also checked that the condition in design where factors: is.factor(pData(ecs.24)$condition) [1] TRUE The only explanation I could come to is the fact that I have an even number of samples for control and treated? or that I have the 'experiment' column in the design, but it would be irrelevant since the default formula is only taking condition into consideration, isn't it? What could be the origin of the error? Thanks again! Jose > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6 [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6 [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18 [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1 [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0 2014-03-13 16:32 GMT+01:00 Alejandro Reyes <alejandro.reyes@embl.de>: > Dear Xiayu Rao, > > Thanks for your interest in DEXSeq. > That looks strange, does it happen when you use files different from the > one you generated by your own? Could you maybe send me (offline) your > gtf file and the first 1000 lines of one of your sam files? > > Bests, > Alejandro > > > Hello, > > > > DEXSeq is a great tool. Thank you for that. I recently generate my own > gtf file with the same format as the exon.gff generated by > dexseq_prepare_annotation.py. However, the output is strange. I tried to > find the reason with no success. Could you please provide any idea about > that problem? Thank you very much in advance! > > > > Note: I used the latest dexseq, and the sam files had been sorted. > > > > 1 genes.gtf exonic_part 12228 12594 . + > . exonic_part_number "001"; gene_id "ENSG00000223972" > > 1 genes.gtf exonic_part 12722 12974 . + > . exonic_part_number "002"; gene_id "ENSG00000223972" > > 1 genes.gtf exonic_part 13053 13220 . + > . exonic_part_number "003"; gene_id "ENSG00000223972" > > 1 genes.gtf exonic_part 14830 14969 . - > . exonic_part_number "001"; gene_id "ENSG00000223972+ENSG00000227232" > > ............. > > > > > > ==> SRR791043_counts.txt <== > > :001G00027000003" > > :002G00021000003" > > :003G00070000003" > > :004G00040000003" > > :005G00060000003" > > :006G00030000003" > > :007G00019000003" > > :008G00045600003" > > :009G00020400003" > > :001G00000000005" > > > > > > Thanks, > > Xiayu > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Jose M. Garcia Manteiga PhD Research Assistant - Data Analysis in Functional Genomics Center for Translational Genomics and BioInformatics Dibit2-Basilica, 3A3 San Raffaele Scientific Institute Via Olgettina 58, 20132 Milano (MI), Italy Tel: +39-02-2643-4751 e-mail: garciamanteiga.josemanuel@hsr.it [[alternative HTML version deleted]]

ADD REPLY • link 10.1 years ago Jose M Garcia Manteiga ▴ 310

0

Entering edit mode

Hi Alejandro, I solved the problem by re-creating the object ecs.24. I had made one DEXSeq analysis up to the end by first creating an ecs object. Then I just split the ecs object, which already had p.value and other info, and re-run the analysis from sizeFactors on onto the new split ecs.24 object. Now it has worked. However, I have obtained a much harder to interpret result which points to something wrong I do not know why. And it is present in both the split and the original ecs.24 and ecs objects. >From scratch: I made dexseq_prepare_annotation.py with the script from DEXSeq 1.6 which contained the '-r' parameter in order to avoid counting exons overlapping different genes. Then I tried to count using the new dexseq_count.py in the same package but it gave me an error because it had been introduced a check for NH tag in the bam that I do not have because I use SOAPSplice. You suggested to use the old dexseq_count.py whithout the check (from DEXSeq 1.4). It worked and then I used the following script: sampleFiles.R_ExonOUT<-Files sampleName.R_ExonOUT<-Names sampleCondition.R_ExonOUT<-c(rep("HYPOXIA",2),rep("CTRL",4),rep("HYPOX IA",2)) sampleExperiment.R_ExonOUT<-c(rep("RUN_2",4),rep("RUN_1",4)) sampleTable.R_ExonOUT <- data.frame(sampleName = sampleName.R_ExonOUT, fileName = sampleFiles.R_ExonOUT, condition = sampleCondition.R_ExonOUT, experiment = sampleExperiment.R_ExonOUT) inDir = getwd() annotationfile = file.path ("/lustre1/genomes/hg19/annotation","Homo_sapiens.ensembl72.DEXSeq.gff ") ecs = read.HTSeqCounts(countfiles = file.path(inDir, sampleTable.R_ExonOUT$fileName),design = sampleTable.R_ExonOUT, flattenedfile = annotationfile) sampleNames(ecs) = sampleTable.R_ExonOUT$sampleName ecs <- estimateSizeFactors(ecs) library(parallel) ecs <- estimateDispersions(ecs,nCores=8) ecs <- fitDispersionFunction(ecs) ecs <- testForDEU(ecs, nCores=8) ecs <- estimatelog2FoldChanges(ecs, nCores=8) res<- DEUresultTable(ecs) The problem is that I have some exons with a ridiculous log2FC but with a very good p.adjust. Same thing with the ecs.24 or ecs.48 split objects. Here an example: head(res.48[which(res.48$geneID=="ENSG00000148516"),]) geneID exonID dispersion pvalue padjust meanBase log2fold(HYPOXIA/CTRL) ENSG00000148516:E036 ENSG00000148516 E036 0.014798679 2.873434e-16 5.646223e-14 171.5313 -6.075811e-01 ENSG00000148516:E049 ENSG00000148516 E049 0.011425856 2.461690e-14 2.846653e-12 414.4351 -1.907197e-01 ENSG00000148516:E039 ENSG00000148516 E039 0.014486497 2.332678e-13 2.043916e-11 181.3705 -4.226252e-01 *ENSG00000148516:E050 ENSG00000148516 E050 0.009733072 1.131825e-12 8.326638e-11 1432.6492 -1.278668e-05* ENSG00000148516:E033 ENSG00000148516 E033 0.037143254 3.483915e-12 2.236853e-10 514.5010 -5.273017e-01 ENSG00000148516:E034 ENSG00000148516 E034 0.019826955 4.660942e-12 2.896722e-10 113.6851 -6.541261e-01 If you look at the plot (just a few exons around E50) plotDEXSeq(ecs.48,"ENSG00000148516",splicing=T) [image: Immagine in linea 3] It seems clear that all those p-values cannot come from those log2FC that are adjusted for expression of all exons coming from the same gene. I have checked the design and formula: design(ecs.48) sampleName fileName condition experiment N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam CTRL RUN_2 N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam CTRL RUN_2 CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam CTRL RUN_1 CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam CTRL RUN_1 HYPOXIA2 HYPOXIA2 Exon_Martelli_Sample_Martelli_HYPOXIA_2.bam HYPOXIA RUN_1 HYPOXIA3 HYPOXIA3 Exon_Martelli_Sample_Martelli_HYPOXIA_3.bam HYPOXIA RUN_1 formula(ecs.48) $formulaDispersion [1] "~sample + exon + condition:exon" $formula0 [1] "~sample + exon" $formula1 [1] "~sample + exon + condition:exon" So, I am a bit stuck with it. I guess everything comes from having used different versions but I cannot come across it. Summarizing: SOASPSplice dexseq_prepare_annotation.py (From DEXSeq 1.6) with Ensembl72 (hg19) -r no dexseq_count.py (From DEXSeq 1.4) Analysis (DEXSeq 1.8) Thanks for the help, Jose sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6 [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6 [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18 [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1 [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0 2014-03-19 13:18 GMT+01:00 Jose Garcia <garciamanteiga.josemanuel@hsr.it>: > Hi Alejandro, > I am analyzing with DEXSeq my data. 4 CTRLs and 2 Treated samples. My > design is the following: > > design(ecs.24) > > sampleName fileName condition > experiment > > H1 H1 Exon_Martelli_Sample_Martelli_H_1.bam HYPOXIA > RUN_2 > > H2 H2 Exon_Martelli_Sample_Martelli_H_2.bam HYPOXIA > RUN_2 > > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam CTRL > RUN_2 > > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam CTRL > RUN_2 > > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam CTRL > RUN_1 > > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam CTRL > RUN_1 > > When I follow the vignette: > > ecs.24 <- estimateDispersions(ecs.24,nCores=8) > > ....Done > > ecs.24 <- fitDispersionFunction(ecs.24) > > ....Done > > ecs.24 <- testForDEU(ecs.24, nCores=8) > > ..... > > ecs.24 <- estimatelog2FoldChanges(ecs.24, nCores=8) > > *Error in `row.names<-.data.frame`(`*tmp*`, value = c("geneID", "exonID", > : * > > * duplicate 'row.names' are not allowed* > > *Calls: estimatelog2FoldChanges ... pData<- -> pData<- -> row.names<- -> > row.names<-.data.frame* > > *In addition: Warning message:* > > *non-unique value when setting 'row.names': 'log2fold(CTRL/HYPOXIA)' * > > > I checked for duplication as you had suggested elsewhere > > any(duplicated(featureNames(ecs.24))) > > [1] FALSE > > any(duplicated(paste(geneIDs(ecs.24),exonIDs(ecs.24),sep=":"))) > > [1] FALSE > > > I also checked that the condition in design where factors: > > > is.factor(pData(ecs.24)$condition) > > [1] TRUE > > > The only explanation I could come to is the fact that I have an even > number of samples for control and treated? or that I have the 'experiment' > column in the design, but it would be irrelevant since the default formula > is only taking condition into consideration, isn't it? > > What could be the origin of the error? > > Thanks again! > > Jose > > > > > sessionInfo() > > R version 3.0.1 (2013-05-16) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > locale: > > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US > > [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US > > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C > > > attached base packages: > > [1] parallel stats graphics grDevices utils datasets methods > > [8] base > > > other attached packages: > > [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0 > > > loaded via a namespace (and not attached): > > [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6 > > [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6 > > [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18 > > [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1 > > [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0 > > > 2014-03-13 16:32 GMT+01:00 Alejandro Reyes <alejandro.reyes@embl.de>: > > Dear Xiayu Rao, >> >> Thanks for your interest in DEXSeq. >> That looks strange, does it happen when you use files different from the >> one you generated by your own? Could you maybe send me (offline) your >> gtf file and the first 1000 lines of one of your sam files? >> >> Bests, >> Alejandro >> >> > Hello, >> > >> > DEXSeq is a great tool. Thank you for that. I recently generate my own >> gtf file with the same format as the exon.gff generated by >> dexseq_prepare_annotation.py. However, the output is strange. I tried to >> find the reason with no success. Could you please provide any idea about >> that problem? Thank you very much in advance! >> > >> > Note: I used the latest dexseq, and the sam files had been sorted. >> > >> > 1 genes.gtf exonic_part 12228 12594 . + >> . exonic_part_number "001"; gene_id "ENSG00000223972" >> > 1 genes.gtf exonic_part 12722 12974 . + >> . exonic_part_number "002"; gene_id "ENSG00000223972" >> > 1 genes.gtf exonic_part 13053 13220 . + >> . exonic_part_number "003"; gene_id "ENSG00000223972" >> > 1 genes.gtf exonic_part 14830 14969 . - >> . exonic_part_number "001"; gene_id "ENSG00000223972+ENSG00000227232" >> > ............. >> > >> > >> > ==> SRR791043_counts.txt <== >> > :001G00027000003" >> > :002G00021000003" >> > :003G00070000003" >> > :004G00040000003" >> > :005G00060000003" >> > :006G00030000003" >> > :007G00019000003" >> > :008G00045600003" >> > :009G00020400003" >> > :001G00000000005" >> > >> > >> > Thanks, >> > Xiayu >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > -- > Jose M. Garcia Manteiga PhD > Research Assistant - Data Analysis in Functional Genomics > Center for Translational Genomics and BioInformatics > Dibit2-Basilica, 3A3 > San Raffaele Scientific Institute > Via Olgettina 58, 20132 Milano (MI), Italy > > Tel: +39-02-2643-4751 > e-mail: garciamanteiga.josemanuel@hsr.it > -- Jose M. Garcia Manteiga PhD Research Assistant - Data Analysis in Functional Genomics Center for Translational Genomics and BioInformatics Dibit2-Basilica, 3A3 San Raffaele Scientific Institute Via Olgettina 58, 20132 Milano (MI), Italy Tel: +39-02-2643-4751 e-mail: garciamanteiga.josemanuel@hsr.it [[alternative HTML version deleted]]

ADD REPLY • link 10.1 years ago Jose M Garcia Manteiga ▴ 310

0

Entering edit mode

Hi Jose, Thanks for your e-mail! I see two possibilities, 1. My first suggestion would be to recreate the ExonCountSet object from scratch at the beginning of each analysis (calling read.HTSeqCounts with the files of each subset of the data independently). This would make sure that the fold changes from the subset of the data correspond the p-values in each object. 2. If this is not the problem, I would check if these are not cases where lots of exons of a single gene are affected by DEU. For example, if you have a gene with 20 exon bins and, say 9, of them are differentially used (could be the case when you have a alternative transcription start site or termination site in the middle of the gene), the GLM model could "fail" to detect the exact exons that are being differentially used and could be detecting the other 11 exons within the same gene. For this cases the DEXSeq results are more useful at the "gene level" rather than at the "exon level"! Do you think this could be the case in your data? Best regards, Alejandro > Hi Alejandro, > I solved the problem by re-creating the object ecs.24. I had made one > DEXSeq analysis up to the end by first creating an ecs object. Then I > just split the ecs object, which already had p.value and other info, > and re-run the analysis from sizeFactors on onto the new split ecs.24 > object. > Now it has worked. > However, I have obtained a much harder to interpret result which > points to something wrong I do not know why. And it is present in both > the split and the original ecs.24 and ecs objects. > From scratch: > I made dexseq_prepare_annotation.py with the script from DEXSeq 1.6 > which contained the '-r' parameter in order to avoid counting exons > overlapping different genes. Then I tried to count using the new > dexseq_count.py in the same package but it gave me an error because it > had been introduced a check for NH tag in the bam that I do not have > because I use SOAPSplice. You suggested to use the old dexseq_count.py > whithout the check (from DEXSeq 1.4). > It worked and then I used the following script: > > sampleFiles.R_ExonOUT<-Files > sampleName.R_ExonOUT<-Names > sampleCondition.R_ExonOUT<-c(rep("HYPOXIA",2),rep("CTRL",4),rep("HYP OXIA",2)) > sampleExperiment.R_ExonOUT<-c(rep("RUN_2",4),rep("RUN_1",4)) > sampleTable.R_ExonOUT <- data.frame(sampleName = sampleName.R_ExonOUT, > fileName = sampleFiles.R_ExonOUT, > condition = sampleCondition.R_ExonOUT, > experiment = sampleExperiment.R_ExonOUT) > inDir = getwd() > annotationfile = file.path > ("/lustre1/genomes/hg19/annotation","Homo_sapiens.ensembl72.DEXSeq.g ff") > > ecs = read.HTSeqCounts(countfiles = file.path(inDir, > sampleTable.R_ExonOUT$fileName),design = sampleTable.R_ExonOUT, > flattenedfile = annotationfile) > > sampleNames(ecs) = sampleTable.R_ExonOUT$sampleName > ecs <- estimateSizeFactors(ecs) > library(parallel) > ecs <- estimateDispersions(ecs,nCores=8) > ecs <- fitDispersionFunction(ecs) > ecs <- testForDEU(ecs, nCores=8) > ecs <- estimatelog2FoldChanges(ecs, nCores=8) > res<- DEUresultTable(ecs) > > The problem is that I have some exons with a ridiculous log2FC but > with a very good p.adjust. > Same thing with the ecs.24 or ecs.48 split objects. Here an example: > > head(res.48[which(res.48$geneID=="ENSG00000148516"),]) > > geneID exonID dispersion pvalue padjust meanBase > log2fold(HYPOXIA/CTRL) > > ENSG00000148516:E036 ENSG00000148516 E036 0.014798679 2.873434e-16 > 5.646223e-14 171.5313 -6.075811e-01 > > ENSG00000148516:E049 ENSG00000148516 E049 0.011425856 2.461690e-14 > 2.846653e-12 414.4351 -1.907197e-01 > > ENSG00000148516:E039 ENSG00000148516 E039 0.014486497 2.332678e-13 > 2.043916e-11 181.3705 -4.226252e-01 > > *ENSG00000148516:E050 ENSG00000148516 E050 0.009733072 1.131825e-12 > 8.326638e-11 1432.6492 -1.278668e-05* > > ENSG00000148516:E033 ENSG00000148516 E033 0.037143254 3.483915e-12 > 2.236853e-10 514.5010 -5.273017e-01 > > ENSG00000148516:E034 ENSG00000148516 E034 0.019826955 4.660942e-12 > 2.896722e-10 113.6851 -6.541261e-01 > > > If you look at the plot (just a few exons around E50) > > plotDEXSeq(ecs.48,"ENSG00000148516",splicing=T) > > > Immagine in linea 3 > > It seems clear that all those p-values cannot come from those log2FC > that are adjusted for expression of all exons coming from the same gene. > > I have checked the design and formula: > > design(ecs.48) > > sampleName fileName condition > experiment > > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam CTRL RUN_2 > > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam CTRL RUN_2 > > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam CTRL RUN_1 > > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam CTRL RUN_1 > > HYPOXIA2 HYPOXIA2 Exon_Martelli_Sample_Martelli_HYPOXIA_2.bam HYPOXIA > RUN_1 > > HYPOXIA3 HYPOXIA3 Exon_Martelli_Sample_Martelli_HYPOXIA_3.bam HYPOXIA > RUN_1 > > > formula(ecs.48) > > $formulaDispersion > > [1] "~sample + exon + condition:exon" > > > $formula0 > > [1] "~sample + exon" > > > $formula1 > > [1] "~sample + exon + condition:exon" > > > So, I am a bit stuck with it. I guess everything comes from having > used different versions but I cannot come across it. Summarizing: > > SOASPSplice > > dexseq_prepare_annotation.py (From DEXSeq 1.6) with Ensembl72 (hg19) -r no > > dexseq_count.py (From DEXSeq 1.4) > > Analysis (DEXSeq 1.8) > > Thanks for the help, > > > Jose > > > > sessionInfo() > > R version 3.0.1 (2013-05-16) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > locale: > > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US > > [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US > > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C > > > attached base packages: > > [1] parallel stats graphics grDevices utils datasets methods > > [8] base > > > other attached packages: > > [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0 > > > loaded via a namespace (and not attached): > > [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6 > > [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6 > > [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18 > > [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1 > > [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0 > > > > 2014-03-19 13:18 GMT+01:00 Jose Garcia > <garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">>: > > Hi Alejandro, > I am analyzing with DEXSeq my data. 4 CTRLs and 2 Treated samples. > My design is the following: > > design(ecs.24) > > sampleName fileName condition > experiment > > H1 H1 Exon_Martelli_Sample_Martelli_H_1.bam HYPOXIA > RUN_2 > > H2 H2 Exon_Martelli_Sample_Martelli_H_2.bam HYPOXIA > RUN_2 > > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam CTRL RUN_2 > > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam CTRL RUN_2 > > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam CTRL > RUN_1 > > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam CTRL > RUN_1 > > When I follow the vignette: > > ecs.24 <- estimateDispersions(ecs.24,nCores=8) > > ....Done > > ecs.24 <- fitDispersionFunction(ecs.24) > > ....Done > > ecs.24 <- testForDEU(ecs.24, nCores=8) > > ..... > > ecs.24 <- estimatelog2FoldChanges(ecs.24, nCores=8) > > *Error in `row.names<-.data.frame`(`*tmp*`, value = c("geneID", > "exonID", : * > > *duplicate 'row.names' are not allowed* > > *Calls: estimatelog2FoldChanges ... pData<- -> pData<- -> > row.names<- -> row.names<-.data.frame* > > *In addition: Warning message:* > > *non-unique value when setting 'row.names': 'log2fold(CTRL/HYPOXIA)' * > > > I checked for duplication as you had suggested elsewhere > > any(duplicated(featureNames(ecs.24))) > > [1] FALSE > > any(duplicated(paste(geneIDs(ecs.24),exonIDs(ecs.24),sep=":"))) > > [1] FALSE > > > I also checked that the condition in design where factors: > > > is.factor(pData(ecs.24)$condition) > > [1] TRUE > > > The only explanation I could come to is the fact that I have an > even number of samples for control and treated? or that I have the > 'experiment' column in the design, but it would be irrelevant > since the default formula is only taking condition into > consideration, isn't it? > > What could be the origin of the error? > > Thanks again! > > Jose > > > > > sessionInfo() > > R version 3.0.1 (2013-05-16) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > locale: > > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US > > [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US > > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C > > > attached base packages: > > [1] parallel stats graphics grDevices utils datasets methods > > [8] base > > > other attached packages: > > [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0 > > > loaded via a namespace (and not attached): > > [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6 > > [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6 > > [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18 > > [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1 > > [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0 > > > > 2014-03-13 16:32 GMT+01:00 Alejandro Reyes > <alejandro.reyes at="" embl.de="" <mailto:alejandro.reyes="" at="" embl.de="">>: > > Dear Xiayu Rao, > > Thanks for your interest in DEXSeq. > That looks strange, does it happen when you use files > different from the > one you generated by your own? Could you maybe send me > (offline) your > gtf file and the first 1000 lines of one of your sam files? > > Bests, > Alejandro > > > Hello, > > > > DEXSeq is a great tool. Thank you for that. I recently > generate my own gtf file with the same format as the exon.gff > generated by dexseq_prepare_annotation.py. However, the output > is strange. I tried to find the reason with no success. Could > you please provide any idea about that problem? Thank you very > much in advance! > > > > Note: I used the latest dexseq, and the sam files had been > sorted. > > > > 1 genes.gtf exonic_part 12228 12594 . > + . exonic_part_number "001"; gene_id "ENSG00000223972" > > 1 genes.gtf exonic_part 12722 12974 . > + . exonic_part_number "002"; gene_id "ENSG00000223972" > > 1 genes.gtf exonic_part 13053 13220 . > + . exonic_part_number "003"; gene_id "ENSG00000223972" > > 1 genes.gtf exonic_part 14830 14969 . > - . exonic_part_number "001"; gene_id > "ENSG00000223972+ENSG00000227232" > > ............. > > > > > > ==> SRR791043_counts.txt <== > > :001G00027000003" > > :002G00021000003" > > :003G00070000003" > > :004G00040000003" > > :005G00060000003" > > :006G00030000003" > > :007G00019000003" > > :008G00045600003" > > :009G00020400003" > > :001G00000000005" > > > > > > Thanks, > > Xiayu > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > -- > Jose M. Garcia Manteiga PhD > Research Assistant - Data Analysis in Functional Genomics > Center for Translational Genomics and BioInformatics > Dibit2-Basilica, 3A3 > San Raffaele Scientific Institute > Via Olgettina 58, 20132 Milano (MI), Italy > > Tel: +39-02-2643-4751 > e-mail: garciamanteiga.josemanuel at hsr.it > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> > > > > > -- > Jose M. Garcia Manteiga PhD > Research Assistant - Data Analysis in Functional Genomics > Center for Translational Genomics and BioInformatics > Dibit2-Basilica, 3A3 > San Raffaele Scientific Institute > Via Olgettina 58, 20132 Milano (MI), Italy > > Tel: +39-02-2643-4751 > e-mail: garciamanteiga.josemanuel at hsr.it > <mailto:garciamanteiga.josemanuel at="" hsr.it="">

ADD REPLY • link 10.1 years ago Alejandro Reyes ★ 1.9k

Login before adding your answer.