Question

Error message with transcriptCounts(easyRNASeq) performed with gtf file generated with cuffmerge

0

Entering edit mode

Richard Friedman ★ 2.0k

@richard-friedman-513

Last seen 11.3 years ago

Dear Bioconductor List, I am working through the easyRNASeq human sequence use case with a difference. I am using Bowtie/Tophat/Cufflinks aligned reads and transcripts. It works fine for getting counts of exons, transcripts, and genes for the genes.gtf file from the UCSD assembly. The difficulty comes in when I try to get counts for novel exons, genes and transcripts. To get this, I am uses the merged.gtf file which is generated by cuffmerge rather than the genes.gtf file from UCSD. I get Exon counts just fine (although I lose the old Ensembl Exon symbols). What I can't get is transcript counts. Here are excerpts from my session: > library(easyRNASeq) Loading required package: parallel Loading required package: genomeIntervals Loading required package: intervals Loading required package: BiocGenerics Attaching package: BiocGenerics The following object(s) are masked from package:stats: xtabs The following object(s) are masked from package:base: anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, get, intersect, lapply, Map, mapply, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rbind, Reduce, rep.int, rownames, sapply, setdiff, table, tapply, union, unique Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'. Loading required package: biomaRt Loading required package: edgeR Loading required package: limma Loading required package: Biostrings Loading required package: IRanges Attaching package: IRanges The following object(s) are masked from package:intervals: expand, reduce Attaching package: Biostrings The following object(s) are masked from package:intervals: type Loading required package: BSgenome Loading required package: GenomicRanges Loading required package: DESeq Loading required package: locfit locfit 1.5-8 2012-04-25 Attaching package: locfit The following object(s) are masked from package:GenomicRanges: left, right Loading required package: lattice Attaching package: DESeq The following object(s) are masked from package:limma: plotMA Loading required package: Rsamtools Loading required package: ShortRead Loading required package: latticeExtra Loading required package: RColorBrewer Warning messages: 1: replacing previous import coerce when loading intervals 2: replacing previous import initialize when loading intervals > library(BSgenome.Hsapiens.UCSC.hg19) > library(Rsamtools) > > chr.sizes=seqlengths(Hsapiens) > chr.sizes chr1 chr2 chr3 chr4 249250621 243199373 198022430 191154276 chr5 chr6 chr7 chr8 180915260 171115067 159138663 146364022 chr9 chr10 chr11 chr12 141213431 135534747 135006516 133851895 chr13 chr14 chr15 chr16 115169878 107349540 102531392 90354753 chr17 chr18 chr19 chr20 81195210 78077248 59128983 63025520 chr21 chr22 chrX chrY 48129895 51304566 155270560 59373566 chrM chr1_gl000191_random chr1_gl000192_random chr4_ctg9_hap1 16571 106433 547496 590426 chr4_gl000193_random chr4_gl000194_random chr6_apd_hap1 chr6_cox_hap2 189789 191469 4622290 4795371 chr6_dbb_hap3 chr6_mann_hap4 chr6_mcf_hap5 chr6_qbl_hap6 4610396 4683263 4833398 4611984 chr6_ssto_hap7 chr7_gl000195_random chr8_gl000196_random chr8_gl000197_random 4928567 182896 38914 37175 chr9_gl000198_random chr9_gl000199_random chr9_gl000200_random chr9_gl000201_random 90085 169874 187035 36148 chr11_gl000202_random chr17_ctg5_hap1 chr17_gl000203_random chr17_gl000204_random 40103 1680828 37498 81310 chr17_gl000205_random chr17_gl000206_random chr18_gl000207_random chr19_gl000208_random 174588 41001 4262 92689 chr19_gl000209_random chr21_gl000210_random chrUn_gl000211 chrUn_gl000212 159169 27682 166566 186858 chrUn_gl000213 chrUn_gl000214 chrUn_gl000215 chrUn_gl000216 164239 137718 172545 172294 chrUn_gl000217 chrUn_gl000218 chrUn_gl000219 chrUn_gl000220 172149 161147 179198 161802 chrUn_gl000221 chrUn_gl000222 chrUn_gl000223 chrUn_gl000224 155397 186861 180455 179693 chrUn_gl000225 chrUn_gl000226 chrUn_gl000227 chrUn_gl000228 211173 15008 128374 129120 chrUn_gl000229 chrUn_gl000230 chrUn_gl000231 chrUn_gl000232 19913 43691 27386 40652 chrUn_gl000233 chrUn_gl000234 chrUn_gl000235 chrUn_gl000236 45941 40531 34474 41934 chrUn_gl000237 chrUn_gl000238 chrUn_gl000239 chrUn_gl000240 45867 39939 33824 41933 chrUn_gl000241 chrUn_gl000242 chrUn_gl000243 chrUn_gl000244 42152 43523 43341 39929 chrUn_gl000245 chrUn_gl000246 chrUn_gl000247 chrUn_gl000248 36651 38154 36422 39786 chrUn_gl000249 38502 countrabtex <- easyRNASeq(filesDirectory=getwd(), + organism="Hsapiens", chr.sizes=chr.sizes, + readLength=58L, + annotationMethod="gtf", + annotationFile="merged.gtf", + count="exons", + gapped=TRUE, + filenames=bamfiles[1]) Checking arguments... Fetching annotations... Read 454815 records Summarizing counts... Processing SRR490224.bam Preparing output Warning messages: 1: In .Method(..., deparse.level = deparse.level) : number of columns of result is not a multiple of vector length (arg 1) 2: In easyRNASeq(filesDirectory = getwd(), organism = "Hsapiens", chr.sizes = chr.sizes, : There are 229182 features/exons defined in your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? 3: In easyRNASeq(filesDirectory = getwd(), organism = "Hsapiens", chr.sizes = chr.sizes, : You enforce UCSC chromosome conventions, however the provided annotation is not compliant. Correcting it. 4: In fetchCoverage(rnaSeq, format = format, filename = filename, filter = filter, : You enforce UCSC chromosome conventions, however the provided alignments are not compliant. Correcting it. 5: In fetchCoverage(rnaSeq, format = format, filename = filename, filter = filter, : The read length stored in the object (probably provided as argument): 58 is not the same as the ones: 58, 57, 56, 55, 54, 53, 52, 51, 50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 determined from the file: /Documents/clients/Phyllis/learning_easyRNAseq/SRR490224.bam Updating the readLength slot. 6: In fetchCoverage(rnaSeq, format = format, filename = filename, filter = filter, : Not all the chromosome names in your chromosome size list 'chr.sizes' are present in your read file(s) (aln or bam). 7: In fetchCoverage(rnaSeq, format = format, filename = filename, filter = filter, : The available chromosomes in both your read file(s) (aln or bam) and 'chr.sizes' list were restricted to their common term. These are: chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY. > write.csv(countrabtex,"countrabtexmerged.csv" ) THE OUTPUT FILE CONTAINED AN EXON COUNT LIST. EXONS WERE LABELLED "XLOC". QUESTION: IS THERE A WAY TO GET THE ENSEMBL NUMBER FOR KNOWN EXONS AND RESERVE "XLOC" FOR NEW ONES? NOW FOR TRANSCRIPTS: > RnaSeq <- easyRNASeq(filesDirectory=getwd(), + organism="Hsapiens", chr.sizes=chr.sizes, + readLength=58L, + annotationMethod="gtf", + annotationFile="merged.gtf", + count="exons", + gapped=TRUE, + filenames=bamfiles[1], + outputFormat="RNAseq") Checking arguments... Fetching annotations... Read 454815 records Summarizing counts... Processing SRR490224.bam Preparing output Warning messages: 1: In .Method(..., deparse.level = deparse.level) : number of columns of result is not a multiple of vector length (arg 1) 2: In easyRNASeq(filesDirectory = getwd(), organism = "Hsapiens", chr.sizes = chr.sizes, : There are 229182 features/exons defined in your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? 3: In easyRNASeq(filesDirectory = getwd(), organism = "Hsapiens", chr.sizes = chr.sizes, : You enforce UCSC chromosome conventions, however the provided annotation is not compliant. Correcting it. 4: In fetchCoverage(rnaSeq, format = format, filename = filename, filter = filter, : You enforce UCSC chromosome conventions, however the provided alignments are not compliant. Correcting it. 5: In fetchCoverage(rnaSeq, format = format, filename = filename, filter = filter, : The read length stored in the object (probably provided as argument): 58 is not the same as the ones: 58, 57, 56, 55, 54, 53, 52, 51, 50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 determined from the file: /Documents/clients/Phyllis/learning_easyRNAseq/SRR490224.bam Updating the readLength slot. 6: In fetchCoverage(rnaSeq, format = format, filename = filename, filter = filter, : Not all the chromosome names in your chromosome size list 'chr.sizes' are present in your read file(s) (aln or bam). 7: In fetchCoverage(rnaSeq, format = format, filename = filename, filter = filter, : The available chromosomes in both your read file(s) (aln or bam) and 'chr.sizes' list were restricted to their common term. These are: chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY. > class(RnaSeq) [1] "RNAseq" attr(,"package") [1] "easyRNASeq" > > RnaSeqTranscripts<-transcriptCounts(RnaSeq) Error in aggregate.data.frame(as.data.frame(x), ...) : arguments must have same length > sessionInfo() R version 2.15.2 (2012-10-26) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Hsapiens.UCSC.hg19_1.3.19 easyRNASeq_1.4.2 [3] ShortRead_1.16.3 latticeExtra_0.6-24 [5] RColorBrewer_1.0-5 Rsamtools_1.10.2 [7] DESeq_1.10.1 lattice_0.20-10 [9] locfit_1.5-8 BSgenome_1.26.1 [11] GenomicRanges_1.10.5 Biostrings_2.26.2 [13] IRanges_1.16.4 edgeR_3.0.7 [15] limma_3.14.3 biomaRt_2.14.0 [17] Biobase_2.18.0 genomeIntervals_1.14.0 [19] BiocGenerics_0.4.0 intervals_0.13.3 loaded via a namespace (and not attached): [1] annotate_1.36.0 AnnotationDbi_1.20.3 bitops_1.0-4.2 DBI_0.2-5 [5] genefilter_1.40.0 geneplotter_1.36.0 grid_2.15.2 hwriter_1.3 [9] RCurl_1.95-3 RSQLite_0.11.2 splines_2.15.2 stats4_2.15.2 [13] survival_2.37-2 XML_3.95-0.1 xtable_1.7-0 zlibbioc_1.4.0 > QUESTION: IS THERE ANY WAY TO GET TRANSCRIPT COUNTS INCLUDING NEW TRANSCRIPTS ASSEMBLED BY CUFFMERGE? QUESTION: DO YOU HAVE ANY SUGGESTIONS FOR ANOTHER WAY OF GETTING COUNTS OF NOVEL TRANSCRIPTS ASSEMBLED BY CUFFMERGE? CUFFDIFF GIVES NON-INTEGRAL TRANSCRIPT NUMBERS SO I THINK THAT IT DOES SOME SORT OF NORMALIZATION WHICH I DON'T WANT. Thanks and best wishes, Rich Richard A. Friedman, PhD Associate Research Scientist, Biomedical Informatics Shared Resource Herbert Irving Comprehensive Cancer Center (HICCC) Lecturer, Department of Biomedical Informatics (DBMI) Educational Coordinator, Center for Computational Biology and Bioinformatics (C2B2)/ National Center for Multiscale Analysis of Genomic Networks (MAGNet)/ Columbia Initiative in Systems Biology Room 824 Irving Cancer Research Center Columbia University 1130 St. Nicholas Ave New York, NY 10032 (212)851-4765 (voice) friedman@cancercenter.columbia.edu http://cancercenter.columbia.edu/~friedman/ "Complex numbers! Ha! Ha! There is nothing weirder than imaginary numbers. Architects don't need to know complex numbers. Whenever I get a negative root for an area, I throw it out. And don't talk to me about quaternions. I am not going into computer animation." -Rose Friedman, age 16 [[alternative HTML version deleted]]

RNASeq Annotation Normalization Cancer Organism BSgenome BSgenome easyRNASeq RNASeq • 1.9k views

ADD COMMENT • link 12.9 years ago Richard Friedman ★ 2.0k