countOverlaps Warnings

0

Entering edit mode

Pankaj Agarwal ▴ 100

@pankaj-agarwal-6492

Last seen 7.7 years ago

United States

Hi, I am analyzing a matched pair tumor/normal rna-seq data set. After aligning with bowtie2 against UCSC hg19 index, I am trying to get the counts for each of the samples using the iGenomes UCSC GTF file. I came across the following tutorial which shows different ways to do it in Bioconductor: http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec _12_16_2013/Rrnaseq/Rrnaseq.pdf Following along slide 28 "Read Counting with countOverlaps", I am able to generate the counts but get the following Warnings. I just want to be sure that there is nothing to be worried about because I don't understand the meaning of these warnings. My code is as follows: > library(GenomicFeatures) Loading required package: AnnotationDbi > library(Rsamtools) > library(rtracklayer) > txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite") > eByg <- exonsBy(txdb, by="gene") > head(eByg) GRangesList of length 6: $A1BG GRanges with 8 ranges and 2 metadata columns: seqnames ranges strand | exon_id exon_name <rle> <iranges> <rle> | <integer> <character> [1] chr19 [58858172, 58858395] - | 163177 <na> [2] chr19 [58858719, 58859006] - | 163178 <na> [3] chr19 [58861736, 58862017] - | 163179 <na> [4] chr19 [58862757, 58863053] - | 163180 <na> [5] chr19 [58863649, 58863921] - | 163181 <na> [6] chr19 [58864294, 58864563] - | 163182 <na> [7] chr19 [58864658, 58864693] - | 163183 <na> [8] chr19 [58864770, 58864865] - | 163184 <na> $A1BG-AS1 GRanges with 4 ranges and 2 metadata columns: seqnames ranges strand | exon_id exon_name [1] chr19 [58863336, 58864410] + | 156640 <na> [2] chr19 [58864745, 58864840] + | 156641 <na> [3] chr19 [58865080, 58865223] + | 156642 <na> [4] chr19 [58865735, 58866549] + | 156643 <na> ... <4 more elements> --- seqlengths: chr12 chr8 ... chr1_gl000192_random NA NA ... NA > samples = c("BRPC13-1118_L1.D710_501.sorted", "BRPC_13-764_L2.sorted", "DU04_PBMC_RNA_L1.sorted", "DU06_PBMC_RNA_L1.sorted") > > samples [1] "BRPC13-1118_L1.D710_501.sorted" "BRPC_13-764_L2.sorted" [3] "DU04_PBMC_RNA_L1.sorted" "DU06_PBMC_RNA_L1.sorted" > > samplespath <- paste("./", samples, ".bam", sep="") > > samplespath [1] "./BRPC13-1118_L1.D710_501.sorted.bam" [2] "./BRPC_13-764_L2.sorted.bam" [3] "./DU04_PBMC_RNA_L1.sorted.bam" [4] "./DU06_PBMC_RNA_L1.sorted.bam" > > names(samplespath) <- samples > > samplespath BRPC13-1118_L1.D710_501.sorted BRPC_13-764_L2.sorted "./BRPC13-1118_L1.D710_501.sorted.bam" "./BRPC_13-764_L2.sorted.bam" DU04_PBMC_RNA_L1.sorted DU06_PBMC_RNA_L1.sorted "./DU04_PBMC_RNA_L1.sorted.bam" "./DU06_PBMC_RNA_L1.sorted.bam" > > countDF <- data.frame(row.names=names(eByg)) > > countDF data frame with 0 columns and 23710 rows > > dim(countDF) [1] 23710 0 > > for(i in samplespath) { + aligns <- readGAlignmentsFromBam(i) + counts <- countOverlaps(eByg, aligns, ignore.strand=TRUE) + countDF <- cbind(countDF, counts) + } Warning messages: 1: In .Seqinfo.mergexy(x, y) : Each of the 2 combined objects has sequence levels not in the other: - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, chr1_gl000192_random - in 'y': chrM Make sure to always combine/compare objects based on the same reference genome (use suppressWarnings() to suppress this warning). 2: In .Seqinfo.mergexy(x, y) : Each of the 2 combined objects has sequence levels not in the other: - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, chr1_gl000192_random - in 'y': chrM Make sure to always combine/compare objects based on the same reference genome (use suppressWarnings() to suppress this warning). 3: In .Seqinfo.mergexy(x, y) : Each of the 2 combined objects has sequence levels not in the other: - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, chr1_gl000192_random - in 'y': chrM Make sure to always combine/compare objects based on the same reference genome (use suppressWarnings() to suppress this warning). 4: In .Seqinfo.mergexy(x, y) : Each of the 2 combined objects has sequence levels not in the other: - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, chr1_gl000192_random - in 'y': chrM Make sure to always combine/compare objects based on the same reference genome (use suppressWarnings() to suppress this warning). > I also wanted to verify if what I am doing is correct for paired end reads. Thanks, - Pankaj ======================================= sessionInfo() R version 3.0.3 (2014-03-06) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base [[alternative HTML version deleted]]

• 3.4k views

ADD COMMENT • link updated 11.7 years ago by Valerie Obenchain ★ 6.8k • written 11.7 years ago by Pankaj Agarwal ▴ 100

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 4.0 years ago

United States

Hi Pankaj, There are several options for counting paired-end reads. Both readGAlignmentPairs() and readGAlignmentsList() are appropriate for paired-end data and are in the GenomicAlignments package. They use the same counting algorithm but return the data in different containers. readGAlignmentPairs() returns only pairs while readGAlignmentsList() returns pairs as well as singletons, reads with unmapped pairs etc. See the ?readGAlignments man page for details. Another counting function is summarizeOverlaps() which uses the above functions 'under the hood'. It returns the counts in the 'assays' slot of a SummarizedExperiment object. Since you have multiple bam files to count you may want to use the BamFileList method. The man page has examples of counting paired-end bam files. library(Rsamtools) ?summarizeOverlaps Other packages that offer count functions are Rsubread and easyRNASeq. Valerie On 04/21/2014 09:53 AM, Pankaj Agarwal wrote: > Hi, > > I am analyzing a matched pair tumor/normal rna-seq data set. After aligning with bowtie2 against UCSC hg19 index, I am trying to get the counts for each of the samples using the iGenomes UCSC GTF file. I came across the following tutorial which shows different ways to do it in Bioconductor: > http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_D ec_12_16_2013/Rrnaseq/Rrnaseq.pdf > > Following along slide 28 "Read Counting with countOverlaps", I am able to generate the counts but get the following Warnings. I just want to be sure that there is nothing to be worried about because I don't understand the meaning of these warnings. My code is as follows: > >> library(GenomicFeatures) > Loading required package: AnnotationDbi >> library(Rsamtools) >> library(rtracklayer) >> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite") >> eByg <- exonsBy(txdb, by="gene") >> head(eByg) > GRangesList of length 6: > $A1BG > GRanges with 8 ranges and 2 metadata columns: > seqnames ranges strand | exon_id exon_name > <rle> <iranges> <rle> | <integer> <character> > [1] chr19 [58858172, 58858395] - | 163177 <na> > [2] chr19 [58858719, 58859006] - | 163178 <na> > [3] chr19 [58861736, 58862017] - | 163179 <na> > [4] chr19 [58862757, 58863053] - | 163180 <na> > [5] chr19 [58863649, 58863921] - | 163181 <na> > [6] chr19 [58864294, 58864563] - | 163182 <na> > [7] chr19 [58864658, 58864693] - | 163183 <na> > [8] chr19 [58864770, 58864865] - | 163184 <na> > > $A1BG-AS1 > GRanges with 4 ranges and 2 metadata columns: > seqnames ranges strand | exon_id exon_name > [1] chr19 [58863336, 58864410] + | 156640 <na> > [2] chr19 [58864745, 58864840] + | 156641 <na> > [3] chr19 [58865080, 58865223] + | 156642 <na> > [4] chr19 [58865735, 58866549] + | 156643 <na> > > ... > <4 more elements> > --- > seqlengths: > chr12 chr8 ... chr1_gl000192_random > NA NA ... NA >> samples = c("BRPC13-1118_L1.D710_501.sorted", "BRPC_13-764_L2.sorted", "DU04_PBMC_RNA_L1.sorted", "DU06_PBMC_RNA_L1.sorted") >> >> samples > [1] "BRPC13-1118_L1.D710_501.sorted" "BRPC_13-764_L2.sorted" > [3] "DU04_PBMC_RNA_L1.sorted" "DU06_PBMC_RNA_L1.sorted" >> >> samplespath <- paste("./", samples, ".bam", sep="") >> >> samplespath > [1] "./BRPC13-1118_L1.D710_501.sorted.bam" > [2] "./BRPC_13-764_L2.sorted.bam" > [3] "./DU04_PBMC_RNA_L1.sorted.bam" > [4] "./DU06_PBMC_RNA_L1.sorted.bam" >> >> names(samplespath) <- samples >> >> samplespath > BRPC13-1118_L1.D710_501.sorted BRPC_13-764_L2.sorted > "./BRPC13-1118_L1.D710_501.sorted.bam" "./BRPC_13-764_L2.sorted.bam" > DU04_PBMC_RNA_L1.sorted DU06_PBMC_RNA_L1.sorted > "./DU04_PBMC_RNA_L1.sorted.bam" "./DU06_PBMC_RNA_L1.sorted.bam" >> >> countDF <- data.frame(row.names=names(eByg)) >> >> countDF > data frame with 0 columns and 23710 rows >> >> dim(countDF) > [1] 23710 0 >> >> for(i in samplespath) { > + aligns <- readGAlignmentsFromBam(i) > + counts <- countOverlaps(eByg, aligns, ignore.strand=TRUE) > + countDF <- cbind(countDF, counts) > + } > Warning messages: > 1: In .Seqinfo.mergexy(x, y) : > Each of the 2 combined objects has sequence levels not in the other: > - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, chr1_gl000192_random > - in 'y': chrM > Make sure to always combine/compare objects based on the same reference > genome (use suppressWarnings() to suppress this warning). > 2: In .Seqinfo.mergexy(x, y) : > Each of the 2 combined objects has sequence levels not in the other: > - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, chr1_gl000192_random > - in 'y': chrM > Make sure to always combine/compare objects based on the same reference > genome (use suppressWarnings() to suppress this warning). > 3: In .Seqinfo.mergexy(x, y) : > Each of the 2 combined objects has sequence levels not in the other: > - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, chr1_gl000192_random > - in 'y': chrM > Make sure to always combine/compare objects based on the same reference > genome (use suppressWarnings() to suppress this warning). > 4: In .Seqinfo.mergexy(x, y) : > Each of the 2 combined objects has sequence levels not in the other: > - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, chr1_gl000192_random > - in 'y': chrM > Make sure to always combine/compare objects based on the same reference > genome (use suppressWarnings() to suppress this warning). >> > > I also wanted to verify if what I am doing is correct for paired end reads. > > Thanks, > > - Pankaj > > > ======================================= > sessionInfo() > R version 3.0.3 (2014-03-06) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: vobencha at fhcrc.org Phone: (206) 667-3158

ADD COMMENT • link 11.7 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Pankaj, How did it go? Were you able to get readGAlignmentPairs() or readGAlignmentsList() to work for your case? Valerie On 04/21/2014 01:32 PM, Valerie Obenchain wrote: > Hi Pankaj, > > There are several options for counting paired-end reads. > > Both readGAlignmentPairs() and readGAlignmentsList() are appropriate for > paired-end data and are in the GenomicAlignments package. They use the > same counting algorithm but return the data in different containers. > readGAlignmentPairs() returns only pairs while readGAlignmentsList() > returns pairs as well as singletons, reads with unmapped pairs etc. See > the ?readGAlignments man page for details. > > Another counting function is summarizeOverlaps() which uses the above > functions 'under the hood'. It returns the counts in the 'assays' slot > of a SummarizedExperiment object. Since you have multiple bam files to > count you may want to use the BamFileList method. > > The man page has examples of counting paired-end bam files. > library(Rsamtools) > ?summarizeOverlaps > > Other packages that offer count functions are Rsubread and easyRNASeq. > > Valerie > > > > > On 04/21/2014 09:53 AM, Pankaj Agarwal wrote: >> Hi, >> >> I am analyzing a matched pair tumor/normal rna-seq data set. After >> aligning with bowtie2 against UCSC hg19 index, I am trying to get the >> counts for each of the samples using the iGenomes UCSC GTF file. I >> came across the following tutorial which shows different ways to do it >> in Bioconductor: >> http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_ Dec_12_16_2013/Rrnaseq/Rrnaseq.pdf >> >> >> Following along slide 28 "Read Counting with countOverlaps", I am able >> to generate the counts but get the following Warnings. I just want to >> be sure that there is nothing to be worried about because I don't >> understand the meaning of these warnings. My code is as follows: >> >>> library(GenomicFeatures) >> Loading required package: AnnotationDbi >>> library(Rsamtools) >>> library(rtracklayer) >>> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite") >>> eByg <- exonsBy(txdb, by="gene") >>> head(eByg) >> GRangesList of length 6: >> $A1BG >> GRanges with 8 ranges and 2 metadata columns: >> seqnames ranges strand | exon_id exon_name >> <rle> <iranges> <rle> | <integer> <character> >> [1] chr19 [58858172, 58858395] - | 163177 <na> >> [2] chr19 [58858719, 58859006] - | 163178 <na> >> [3] chr19 [58861736, 58862017] - | 163179 <na> >> [4] chr19 [58862757, 58863053] - | 163180 <na> >> [5] chr19 [58863649, 58863921] - | 163181 <na> >> [6] chr19 [58864294, 58864563] - | 163182 <na> >> [7] chr19 [58864658, 58864693] - | 163183 <na> >> [8] chr19 [58864770, 58864865] - | 163184 <na> >> >> $A1BG-AS1 >> GRanges with 4 ranges and 2 metadata columns: >> seqnames ranges strand | exon_id exon_name >> [1] chr19 [58863336, 58864410] + | 156640 <na> >> [2] chr19 [58864745, 58864840] + | 156641 <na> >> [3] chr19 [58865080, 58865223] + | 156642 <na> >> [4] chr19 [58865735, 58866549] + | 156643 <na> >> >> ... >> <4 more elements> >> --- >> seqlengths: >> chr12 chr8 ... chr1_gl000192_random >> NA NA ... NA >>> samples = c("BRPC13-1118_L1.D710_501.sorted", >>> "BRPC_13-764_L2.sorted", "DU04_PBMC_RNA_L1.sorted", >>> "DU06_PBMC_RNA_L1.sorted") >>> >>> samples >> [1] "BRPC13-1118_L1.D710_501.sorted" "BRPC_13-764_L2.sorted" >> [3] "DU04_PBMC_RNA_L1.sorted" "DU06_PBMC_RNA_L1.sorted" >>> >>> samplespath <- paste("./", samples, ".bam", sep="") >>> >>> samplespath >> [1] "./BRPC13-1118_L1.D710_501.sorted.bam" >> [2] "./BRPC_13-764_L2.sorted.bam" >> [3] "./DU04_PBMC_RNA_L1.sorted.bam" >> [4] "./DU06_PBMC_RNA_L1.sorted.bam" >>> >>> names(samplespath) <- samples >>> >>> samplespath >> BRPC13-1118_L1.D710_501.sorted >> BRPC_13-764_L2.sorted >> "./BRPC13-1118_L1.D710_501.sorted.bam" >> "./BRPC_13-764_L2.sorted.bam" >> DU04_PBMC_RNA_L1.sorted >> DU06_PBMC_RNA_L1.sorted >> "./DU04_PBMC_RNA_L1.sorted.bam" >> "./DU06_PBMC_RNA_L1.sorted.bam" >>> >>> countDF <- data.frame(row.names=names(eByg)) >>> >>> countDF >> data frame with 0 columns and 23710 rows >>> >>> dim(countDF) >> [1] 23710 0 >>> >>> for(i in samplespath) { >> + aligns <- readGAlignmentsFromBam(i) >> + counts <- countOverlaps(eByg, aligns, ignore.strand=TRUE) >> + countDF <- cbind(countDF, counts) >> + } >> Warning messages: >> 1: In .Seqinfo.mergexy(x, y) : >> Each of the 2 combined objects has sequence levels not in the other: >> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, >> chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, >> chr1_gl000192_random >> - in 'y': chrM >> Make sure to always combine/compare objects based on the same >> reference >> genome (use suppressWarnings() to suppress this warning). >> 2: In .Seqinfo.mergexy(x, y) : >> Each of the 2 combined objects has sequence levels not in the other: >> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, >> chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, >> chr1_gl000192_random >> - in 'y': chrM >> Make sure to always combine/compare objects based on the same >> reference >> genome (use suppressWarnings() to suppress this warning). >> 3: In .Seqinfo.mergexy(x, y) : >> Each of the 2 combined objects has sequence levels not in the other: >> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, >> chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, >> chr1_gl000192_random >> - in 'y': chrM >> Make sure to always combine/compare objects based on the same >> reference >> genome (use suppressWarnings() to suppress this warning). >> 4: In .Seqinfo.mergexy(x, y) : >> Each of the 2 combined objects has sequence levels not in the other: >> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, >> chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, >> chr1_gl000192_random >> - in 'y': chrM >> Make sure to always combine/compare objects based on the same >> reference >> genome (use suppressWarnings() to suppress this warning). >>> >> >> I also wanted to verify if what I am doing is correct for paired end >> reads. >> >> Thanks, >> >> - Pankaj >> >> >> ======================================= >> sessionInfo() >> R version 3.0.3 (2014-03-06) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: vobencha at fhcrc.org Phone: (206) 667-3158

ADD REPLY • link 11.7 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Valerie, I used summarizeOverlaps() instead, since it seemed simpler to use. I also used htseq-count to get the counts using the same parameters, but I get very different results. >From what I have read and understood, both should give similar results since the counts should be reported at the "gene" level and not the "isoform" level. I would feel much more comfortable with the results if they were both at least close. For both I used the UCSC GTF file in the iGenomes distribution. For htseq-count I used BAM files sorted by name since it required it, but for summarizeOverlaps I used BAM files sorted by coordinates (the default sorting done by samtools). Example of counts generated by the two methods are given after the code: The code for htseq-count that I used is as follows (once for each sample): samtools view -h A.sorted_byname.bam | htseq-count -t exon -i gene_id -m intersection-nonempty -s no - iGenomes/UCSC/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf > A.htseq.count The code for summarizeOverlaps() is as follows: > library(GenomicFeatures) Loading required package: BiocGenerics Loading required package: parallel ... > library(Rsamtools) Loading required package: Biostrings > library(rtracklayer) > library(GenomicRanges) > > txdb <- makeTranscriptDbFromGFF(file=iGenomes/UCSC/Homo_sapiens/UCSC /hg19/Annotation/Genes/genes.gtf", format="gtf") extracting transcript information Estimating transcript ranges. Extracting gene IDs Processing splicing information for gtf file. Deducing exon rank from relative coordinates provided Prepare the 'metadata' data frame ... metadata: OK Now generating chrominfo from available sequence names. No chromosome length information is available. Warning messages: 1: In .deduceExonRankings(exs, format = "gtf") : Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName 2: In matchCircularity(chroms, circ_seqs) : None of the strings in your circ_seqs argument match your seqnames. > - are these warnings message problems? > saveDb(txdb, file="./data/ucsc_igenomes_hg19_gtf.sqlite") > > txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite") > > eByg <- exonsBy(txdb, by="gene") > > head(eByg) GRangesList of length 6: $A1BG GRanges with 8 ranges and 2 metadata columns: seqnames ranges strand | exon_id exon_name <rle> <iranges> <rle> | <integer> <character> [1] chr19 [58858172, 58858395] - | 163177 <na> [2] chr19 [58858719, 58859006] - | 163178 <na> > > samples = c("A.sorted", "B.sorted", "C.sorted", "D.sorted") > > samples [1] "A.sorted" "B.sorted" [3] "C.sorted" "D.sorted" > > samplespath [1] "../A.sorted.bam" [2] "../B.bam" [3] "../C.sorted.bam" [4] "../D.sorted.bam" > > names(samplespath) <- samples > bfl <- BamFileList(samplespath, yieldSize=50000, index=character()) I am not sure what yieldSize means, just used it from the example I followed. > bfl BamFileList of length 4 names(4): A.sorted ... B.sorted > > countDF_intsectionNotEmpty <- summarizeOverlaps(eByg, bfl, mode="IntersectionNotEmpty", ignore.strand=TRUE) > > head(countDF_intsectionNotEmpty) class: SummarizedExperiment dim: 1 4 exptData(0): assays(1): counts rownames(1): A1BG rowData metadata column names(0): colnames(4): A1.sorted B.sorted C.sorted D.sorted colData names(0): > > class(countDF_intsectionNotEmpty) [1] "SummarizedExperiment" attr(,"package") [1] "GenomicRanges" > > countDF_intsectionNotEmpty <- assays(countDF_intsectionNotEmpty)$counts > colnames(countDF_intsectionNotEmpty) <- samples > countDF_intsectionNotEmpty[1:4,] A.sorted B.sorted C.sorted D.sorted A1BG 13 119 162 136 A1BG-AS1 2 112 52 85 A1CF 3 28 0 5 A2M 2543 17593 47 233 ... The counts that I get are very different from what I get using htseq- count, which are following for the same genes listed above: A.bam B. bam C. bam D. bam A1BG 7 67 90 72 A1BG-AS1 1 69 41 51 A1CF 2 16 0 3 A2M 1536 10866 34 173 ... ... I would appreciate your feedback. Thanks, - Pankaj -----Original Message----- From: Valerie Obenchain [mailto:vobencha@fhcrc.org] Sent: Wednesday, April 23, 2014 12:50 PM To: Pankaj Agarwal; bioconductor at r-project.org Subject: Re: [BioC] countOverlaps Warnings Pankaj, How did it go? Were you able to get readGAlignmentPairs() or readGAlignmentsList() to work for your case? Valerie On 04/21/2014 01:32 PM, Valerie Obenchain wrote: > Hi Pankaj, > > There are several options for counting paired-end reads. > > Both readGAlignmentPairs() and readGAlignmentsList() are appropriate > for paired-end data and are in the GenomicAlignments package. They use > the same counting algorithm but return the data in different containers. > readGAlignmentPairs() returns only pairs while readGAlignmentsList() > returns pairs as well as singletons, reads with unmapped pairs etc. > See the ?readGAlignments man page for details. > > Another counting function is summarizeOverlaps() which uses the above > functions 'under the hood'. It returns the counts in the 'assays' slot > of a SummarizedExperiment object. Since you have multiple bam files to > count you may want to use the BamFileList method. > > The man page has examples of counting paired-end bam files. > library(Rsamtools) > ?summarizeOverlaps > > Other packages that offer count functions are Rsubread and easyRNASeq. > > Valerie > > > > > On 04/21/2014 09:53 AM, Pankaj Agarwal wrote: >> Hi, >> >> I am analyzing a matched pair tumor/normal rna-seq data set. After >> aligning with bowtie2 against UCSC hg19 index, I am trying to get the >> counts for each of the samples using the iGenomes UCSC GTF file. I >> came across the following tutorial which shows different ways to do >> it in Bioconductor: >> http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_De >> c_12_16_2013/Rrnaseq/Rrnaseq.pdf >> >> >> Following along slide 28 "Read Counting with countOverlaps", I am >> able to generate the counts but get the following Warnings. I just >> want to be sure that there is nothing to be worried about because I >> don't understand the meaning of these warnings. My code is as follows: >> >>> library(GenomicFeatures) >> Loading required package: AnnotationDbi >>> library(Rsamtools) >>> library(rtracklayer) >>> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite") >>> eByg <- exonsBy(txdb, by="gene") >>> head(eByg) >> GRangesList of length 6: >> $A1BG >> GRanges with 8 ranges and 2 metadata columns: >> seqnames ranges strand | exon_id exon_name >> <rle> <iranges> <rle> | <integer> <character> >> [1] chr19 [58858172, 58858395] - | 163177 <na> >> [2] chr19 [58858719, 58859006] - | 163178 <na> >> [3] chr19 [58861736, 58862017] - | 163179 <na> >> [4] chr19 [58862757, 58863053] - | 163180 <na> >> [5] chr19 [58863649, 58863921] - | 163181 <na> >> [6] chr19 [58864294, 58864563] - | 163182 <na> >> [7] chr19 [58864658, 58864693] - | 163183 <na> >> [8] chr19 [58864770, 58864865] - | 163184 <na> >> >> $A1BG-AS1 >> GRanges with 4 ranges and 2 metadata columns: >> seqnames ranges strand | exon_id exon_name >> [1] chr19 [58863336, 58864410] + | 156640 <na> >> [2] chr19 [58864745, 58864840] + | 156641 <na> >> [3] chr19 [58865080, 58865223] + | 156642 <na> >> [4] chr19 [58865735, 58866549] + | 156643 <na> >> >> ... >> <4 more elements> >> --- >> seqlengths: >> chr12 chr8 ... chr1_gl000192_random >> NA NA ... NA >>> samples = c("BRPC13-1118_L1.D710_501.sorted", >>> "BRPC_13-764_L2.sorted", "DU04_PBMC_RNA_L1.sorted", >>> "DU06_PBMC_RNA_L1.sorted") >>> >>> samples >> [1] "BRPC13-1118_L1.D710_501.sorted" "BRPC_13-764_L2.sorted" >> [3] "DU04_PBMC_RNA_L1.sorted" "DU06_PBMC_RNA_L1.sorted" >>> >>> samplespath <- paste("./", samples, ".bam", sep="") >>> >>> samplespath >> [1] "./BRPC13-1118_L1.D710_501.sorted.bam" >> [2] "./BRPC_13-764_L2.sorted.bam" >> [3] "./DU04_PBMC_RNA_L1.sorted.bam" >> [4] "./DU06_PBMC_RNA_L1.sorted.bam" >>> >>> names(samplespath) <- samples >>> >>> samplespath >> BRPC13-1118_L1.D710_501.sorted BRPC_13-764_L2.sorted >> "./BRPC13-1118_L1.D710_501.sorted.bam" >> "./BRPC_13-764_L2.sorted.bam" >> DU04_PBMC_RNA_L1.sorted DU06_PBMC_RNA_L1.sorted >> "./DU04_PBMC_RNA_L1.sorted.bam" >> "./DU06_PBMC_RNA_L1.sorted.bam" >>> >>> countDF <- data.frame(row.names=names(eByg)) >>> >>> countDF >> data frame with 0 columns and 23710 rows >>> >>> dim(countDF) >> [1] 23710 0 >>> >>> for(i in samplespath) { >> + aligns <- readGAlignmentsFromBam(i) counts <- countOverlaps(eByg, >> + aligns, ignore.strand=TRUE) countDF <- cbind(countDF, counts) } >> Warning messages: >> 1: In .Seqinfo.mergexy(x, y) : >> Each of the 2 combined objects has sequence levels not in the other: >> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >> chr4_gl000194_random, chr1_gl000192_random >> - in 'y': chrM >> Make sure to always combine/compare objects based on the same >> reference >> genome (use suppressWarnings() to suppress this warning). >> 2: In .Seqinfo.mergexy(x, y) : >> Each of the 2 combined objects has sequence levels not in the other: >> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >> chr4_gl000194_random, chr1_gl000192_random >> - in 'y': chrM >> Make sure to always combine/compare objects based on the same >> reference >> genome (use suppressWarnings() to suppress this warning). >> 3: In .Seqinfo.mergexy(x, y) : >> Each of the 2 combined objects has sequence levels not in the other: >> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >> chr4_gl000194_random, chr1_gl000192_random >> - in 'y': chrM >> Make sure to always combine/compare objects based on the same >> reference >> genome (use suppressWarnings() to suppress this warning). >> 4: In .Seqinfo.mergexy(x, y) : >> Each of the 2 combined objects has sequence levels not in the other: >> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >> chr4_gl000194_random, chr1_gl000192_random >> - in 'y': chrM >> Make sure to always combine/compare objects based on the same >> reference >> genome (use suppressWarnings() to suppress this warning). >>> >> >> I also wanted to verify if what I am doing is correct for paired end >> reads. >> >> Thanks, >> >> - Pankaj >> >> >> ======================================= >> sessionInfo() >> R version 3.0.3 (2014-03-06) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: vobencha at fhcrc.org Phone: (206) 667-3158

ADD REPLY • link 11.7 years ago Pankaj Agarwal ▴ 100

0

Entering edit mode

Hi guys, I have this weird error when I tried to install googleVis package. I installed it in my other PC with no problem and both runs Win7 Enterprise 64x. I tired to install from biocLite("googleVis"), install.packages("googleVis"), install.packages() from downloaded source, and had the same error. > biocLite("googleVis") BioC_mirror: http://bioconductor.org Using Bioconductor version 2.13 (BiocInstaller 1.12.1), R version 3.1.0. Installing package(s) 'googleVis' trying URL 'http://cran.fhcrc.org/bin/windows/contrib/3.1/googleVis_0.5.1.zip' Content type 'application/zip' length 962989 bytes (940 Kb) opened URL downloaded 940 Kb Error in if (any(diff)) { : missing value where TRUE/FALSE needed > utils:::menuInstallLocal() Error in if (any(diff)) { : missing value where TRUE/FALSE needed > traceback() 5: tools::checkMD5sums(pkgname, file.path(tmpDir, pkgname)) 4: unpackPkgZip(pkgs[i], pkgnames[i], lib, libs_only, lock, quiet) 3: .install.winbinary(pkgs = pkgs, lib = lib, contriburl = contriburl, method = method, available = available, destdir = destdir, dependencies = dependencies, libs_only = libs_only, quiet = quiet, ...) 2: install.packages(choose.files("", filters = Filters[c("zip", "All"), ]), .libPaths()[1L], repos = NULL) 1: utils:::menuInstallLocal() > sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BiocInstaller_1.12.1 loaded via a namespace (and not attached): [1] tools_3.1.0 > Thanks a lot for the help! Ying [[alternative HTML version deleted]]

ADD REPLY • link 11.7 years ago ying chen ▴ 340

0

Entering edit mode

On 04/23/2014 08:43 PM, ying chen wrote: > Hi guys, > > I have this weird error when I tried to install googleVis package. I installed it in my other PC with no problem and both runs Win7 Enterprise 64x. I tired to install from biocLite("googleVis"), install.packages("googleVis"), install.packages() from downloaded source, and had the same error. > >> biocLite("googleVis") > BioC_mirror: http://bioconductor.org > Using Bioconductor version 2.13 (BiocInstaller 1.12.1), R version 3.1.0. > Installing package(s) 'googleVis' > trying URL 'http://cran.fhcrc.org/bin/windows/contrib/3.1/googleVis_0.5.1.zip' > Content type 'application/zip' length 962989 bytes (940 Kb) > opened URL > downloaded 940 Kb > Error in if (any(diff)) { : missing value where TRUE/FALSE needed >> utils:::menuInstallLocal() > Error in if (any(diff)) { : missing value where TRUE/FALSE needed This is unusual. The code being evaluated is tools::checkMD5sums, and the idea is that the MD5 sums included in the zip archive are compared with the MD5 sums calculated from the installed (in a temporary location) package. I think what happens is that some of the installed files are not readable by the process that wrote them and md5sums() returns 'NA'; later, this is compared with the true md5sum from the zip archive and the test -- 'diff' -- contains an 'NA'. any(diff) then reports NA, and the 'if' statement fails if (NA) "foo" > any(c(FALSE, NA)) [1] NA > if (NA) "foo" Error in if (NA) "foo" : missing value where TRUE/FALSE needed I don't really know why your files might not be readable; maybe some other process (a virus checker?) has marked one or more files as not readable, or the your temporary directory itself is write-only (likely other packages would fail to install, too), or .... You could discover the problematic file by options(error=recover) biocLite("googleVis") and when the error occurs choosing Selection: 5 (so that you end up inside tools::checkMD5sums) and look at xx[is.na(nmxx)] (the md5 sums from the package) or x[is.na(nmxx)] (the md5 sums from the installed directory). The variable 'dir' is the directory where all the action is taking place. It would be good to hear back about what your investigation indicates. Martin >> traceback() > 5: tools::checkMD5sums(pkgname, file.path(tmpDir, pkgname)) > 4: unpackPkgZip(pkgs[i], pkgnames[i], lib, libs_only, lock, quiet) > 3: .install.winbinary(pkgs = pkgs, lib = lib, contriburl = contriburl, > method = method, available = available, destdir = destdir, > dependencies = dependencies, libs_only = libs_only, quiet = quiet, > ...) > 2: install.packages(choose.files("", filters = Filters[c("zip", > "All"), ]), .libPaths()[1L], repos = NULL) > 1: utils:::menuInstallLocal() >> sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-w64-mingw32/x64 (64-bit) > locale: > [1] LC_COLLATE=English_United States.1252 > [2] LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] BiocInstaller_1.12.1 > loaded via a namespace (and not attached): > [1] tools_3.1.0 >> > > Thanks a lot for the help! > > Ying > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 11.7 years ago Martin Morgan 25k

0

Entering edit mode

Hi Martin, Thanks a lot for the help! I believe it's due to some virus on my computer as I just found out that my computer's antivirus programs all got turned off and could not be turned back on. I am re-installing Windows now and will try to re-install R and googleVis later. Thanks, Ying > Date: Thu, 24 Apr 2014 18:19:40 -0700 > From: mtmorgan@fhcrc.org > To: ying_chen@live.com; bioconductor@r-project.org > Subject: Re: [BioC] googleVis installation error > > On 04/23/2014 08:43 PM, ying chen wrote: > > Hi guys, > > > > I have this weird error when I tried to install googleVis package. I installed it in my other PC with no problem and both runs Win7 Enterprise 64x. I tired to install from biocLite("googleVis"), install.packages("googleVis"), install.packages() from downloaded source, and had the same error. > > > >> biocLite("googleVis") > > BioC_mirror: http://bioconductor.org > > Using Bioconductor version 2.13 (BiocInstaller 1.12.1), R version 3.1.0. > > Installing package(s) 'googleVis' > > trying URL 'http://cran.fhcrc.org/bin/windows/contrib/3.1/googleVis_0.5.1.zip' > > Content type 'application/zip' length 962989 bytes (940 Kb) > > opened URL > > downloaded 940 Kb > > Error in if (any(diff)) { : missing value where TRUE/FALSE needed > >> utils:::menuInstallLocal() > > Error in if (any(diff)) { : missing value where TRUE/FALSE needed > > This is unusual. The code being evaluated is tools::checkMD5sums, and the idea > is that the MD5 sums included in the zip archive are compared with the MD5 sums > calculated from the installed (in a temporary location) package. > > I think what happens is that some of the installed files are not readable by the > process that wrote them and md5sums() returns 'NA'; later, this is compared with > the true md5sum from the zip archive and the test -- 'diff' -- contains an 'NA'. > any(diff) then reports NA, and the 'if' statement fails > > if (NA) "foo" > > > any(c(FALSE, NA)) > [1] NA > > if (NA) "foo" > Error in if (NA) "foo" : missing value where TRUE/FALSE needed > > I don't really know why your files might not be readable; maybe some other > process (a virus checker?) has marked one or more files as not readable, or the > your temporary directory itself is write-only (likely other packages would fail > to install, too), or .... You could discover the problematic file by > > options(error=recover) > biocLite("googleVis") > > and when the error occurs choosing Selection: 5 (so that you end up inside > tools::checkMD5sums) and look at > > xx[is.na(nmxx)] > > (the md5 sums from the package) or x[is.na(nmxx)] (the md5 sums from the > installed directory). The variable 'dir' is the directory where all the action > is taking place. > > It would be good to hear back about what your investigation indicates. > > Martin > > >> traceback() > > 5: tools::checkMD5sums(pkgname, file.path(tmpDir, pkgname)) > > 4: unpackPkgZip(pkgs[i], pkgnames[i], lib, libs_only, lock, quiet) > > 3: .install.winbinary(pkgs = pkgs, lib = lib, contriburl = contriburl, > > method = method, available = available, destdir = destdir, > > dependencies = dependencies, libs_only = libs_only, quiet = quiet, > > ...) > > 2: install.packages(choose.files("", filters = Filters[c("zip", > > "All"), ]), .libPaths()[1L], repos = NULL) > > 1: utils:::menuInstallLocal() > >> sessionInfo() > > R version 3.1.0 (2014-04-10) > > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > > [1] LC_COLLATE=English_United States.1252 > > [2] LC_CTYPE=English_United States.1252 > > [3] LC_MONETARY=English_United States.1252 > > [4] LC_NUMERIC=C > > [5] LC_TIME=English_United States.1252 > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > > [1] BiocInstaller_1.12.1 > > loaded via a namespace (and not attached): > > [1] tools_3.1.0 > >> > > > > Thanks a lot for the help! > > > > Ying > > > > > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 [[alternative HTML version deleted]]

ADD REPLY • link 11.7 years ago ying chen ▴ 340

0

Entering edit mode

Are these paired-end reads? If yes set 'singleEnd=FALSE' in the summarizeOverlaps() call. summarizeOverlaps(eByg, bfl, "IntersectionNotEmpty", ignore.strand=TRUE, singleEnd=FALSE) To 'stamp' the BamFileList as containing paired-end reads set the 'asMates' arg to TRUE. bfl <- BamFileList(samplespath, asMates=TRUE) 'yieldSize' reads data in by chunks and is useful when the bam files are too large to fit in memory. Can you point me to where you download 'genes.gtf'? Valerie On 04/23/14 19:40, Pankaj Agarwal wrote: > Valerie, > > I used summarizeOverlaps() instead, since it seemed simpler to use. I also used htseq-count to get the counts using the same parameters, but I get very different results. > From what I have read and understood, both should give similar results since the counts should be reported at the "gene" level and not the "isoform" level. > I would feel much more comfortable with the results if they were both at least close. > > For both I used the UCSC GTF file in the iGenomes distribution. > For htseq-count I used BAM files sorted by name since it required it, but for summarizeOverlaps I used BAM files sorted by coordinates (the default sorting done by samtools). > Example of counts generated by the two methods are given after the code: > > The code for htseq-count that I used is as follows (once for each sample): > > samtools view -h A.sorted_byname.bam | htseq-count -t exon -i gene_id -m intersection-nonempty -s no - iGenomes/UCSC/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf > A.htseq.count > > The code for summarizeOverlaps() is as follows: > >> library(GenomicFeatures) > Loading required package: BiocGenerics > Loading required package: parallel > ... >> library(Rsamtools) > Loading required package: Biostrings >> library(rtracklayer) >> library(GenomicRanges) >> >> txdb <- makeTranscriptDbFromGFF(file=iGenomes/UCSC/Homo_sapiens/UCS C/hg19/Annotation/Genes/genes.gtf", format="gtf") > extracting transcript information > Estimating transcript ranges. > Extracting gene IDs > Processing splicing information for gtf file. > Deducing exon rank from relative coordinates provided > Prepare the 'metadata' data frame ... metadata: OK > Now generating chrominfo from available sequence names. No chromosome length information is available. > Warning messages: > 1: In .deduceExonRankings(exs, format = "gtf") : > Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName > 2: In matchCircularity(chroms, circ_seqs) : > None of the strings in your circ_seqs argument match your seqnames. >> > > - are these warnings message problems? > >> saveDb(txdb, file="./data/ucsc_igenomes_hg19_gtf.sqlite") >> >> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite") >> >> eByg <- exonsBy(txdb, by="gene") >> >> head(eByg) > GRangesList of length 6: > $A1BG > GRanges with 8 ranges and 2 metadata columns: > seqnames ranges strand | exon_id exon_name > <rle> <iranges> <rle> | <integer> <character> > [1] chr19 [58858172, 58858395] - | 163177 <na> > [2] chr19 [58858719, 58859006] - | 163178 <na> >> >> samples = c("A.sorted", "B.sorted", "C.sorted", "D.sorted") >> >> samples > [1] "A.sorted" "B.sorted" > [3] "C.sorted" "D.sorted" >> >> samplespath > [1] "../A.sorted.bam" > [2] "../B.bam" > [3] "../C.sorted.bam" > [4] "../D.sorted.bam" >> >> names(samplespath) <- samples >> bfl <- BamFileList(samplespath, yieldSize=50000, index=character()) > > I am not sure what yieldSize means, just used it from the example I followed. > >> bfl > BamFileList of length 4 > names(4): A.sorted ... B.sorted >> >> countDF_intsectionNotEmpty <- summarizeOverlaps(eByg, bfl, mode="IntersectionNotEmpty", ignore.strand=TRUE) >> >> head(countDF_intsectionNotEmpty) > class: SummarizedExperiment > dim: 1 4 > exptData(0): > assays(1): counts > rownames(1): A1BG > rowData metadata column names(0): > colnames(4): A1.sorted B.sorted C.sorted D.sorted > colData names(0): >> >> class(countDF_intsectionNotEmpty) > [1] "SummarizedExperiment" > attr(,"package") > [1] "GenomicRanges" >> >> countDF_intsectionNotEmpty <- assays(countDF_intsectionNotEmpty)$counts >> colnames(countDF_intsectionNotEmpty) <- samples >> countDF_intsectionNotEmpty[1:4,] > A.sorted B.sorted C.sorted D.sorted > A1BG 13 119 162 136 > A1BG-AS1 2 112 52 85 > A1CF 3 28 0 5 > A2M 2543 17593 47 233 > ... > > The counts that I get are very different from what I get using htseq-count, which are following for the same genes listed above: > > A.bam B. bam C. bam D. bam > A1BG 7 67 90 72 > A1BG-AS1 1 69 41 51 > A1CF 2 16 0 3 > A2M 1536 10866 34 173 > ... > ... > > I would appreciate your feedback. > > Thanks, > > - Pankaj > > -----Original Message----- > From: Valerie Obenchain [mailto:vobencha at fhcrc.org] > Sent: Wednesday, April 23, 2014 12:50 PM > To: Pankaj Agarwal; bioconductor at r-project.org > Subject: Re: [BioC] countOverlaps Warnings > > Pankaj, > > How did it go? Were you able to get readGAlignmentPairs() or > readGAlignmentsList() to work for your case? > > Valerie > > > On 04/21/2014 01:32 PM, Valerie Obenchain wrote: >> Hi Pankaj, >> >> There are several options for counting paired-end reads. >> >> Both readGAlignmentPairs() and readGAlignmentsList() are appropriate >> for paired-end data and are in the GenomicAlignments package. They use >> the same counting algorithm but return the data in different containers. >> readGAlignmentPairs() returns only pairs while readGAlignmentsList() >> returns pairs as well as singletons, reads with unmapped pairs etc. >> See the ?readGAlignments man page for details. >> >> Another counting function is summarizeOverlaps() which uses the above >> functions 'under the hood'. It returns the counts in the 'assays' slot >> of a SummarizedExperiment object. Since you have multiple bam files to >> count you may want to use the BamFileList method. >> >> The man page has examples of counting paired-end bam files. >> library(Rsamtools) >> ?summarizeOverlaps >> >> Other packages that offer count functions are Rsubread and easyRNASeq. >> >> Valerie >> >> >> >> >> On 04/21/2014 09:53 AM, Pankaj Agarwal wrote: >>> Hi, >>> >>> I am analyzing a matched pair tumor/normal rna-seq data set. After >>> aligning with bowtie2 against UCSC hg19 index, I am trying to get the >>> counts for each of the samples using the iGenomes UCSC GTF file. I >>> came across the following tutorial which shows different ways to do >>> it in Bioconductor: >>> http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_De >>> c_12_16_2013/Rrnaseq/Rrnaseq.pdf >>> >>> >>> Following along slide 28 "Read Counting with countOverlaps", I am >>> able to generate the counts but get the following Warnings. I just >>> want to be sure that there is nothing to be worried about because I >>> don't understand the meaning of these warnings. My code is as follows: >>> >>>> library(GenomicFeatures) >>> Loading required package: AnnotationDbi >>>> library(Rsamtools) >>>> library(rtracklayer) >>>> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite") >>>> eByg <- exonsBy(txdb, by="gene") >>>> head(eByg) >>> GRangesList of length 6: >>> $A1BG >>> GRanges with 8 ranges and 2 metadata columns: >>> seqnames ranges strand | exon_id exon_name >>> <rle> <iranges> <rle> | <integer> <character> >>> [1] chr19 [58858172, 58858395] - | 163177 <na> >>> [2] chr19 [58858719, 58859006] - | 163178 <na> >>> [3] chr19 [58861736, 58862017] - | 163179 <na> >>> [4] chr19 [58862757, 58863053] - | 163180 <na> >>> [5] chr19 [58863649, 58863921] - | 163181 <na> >>> [6] chr19 [58864294, 58864563] - | 163182 <na> >>> [7] chr19 [58864658, 58864693] - | 163183 <na> >>> [8] chr19 [58864770, 58864865] - | 163184 <na> >>> >>> $A1BG-AS1 >>> GRanges with 4 ranges and 2 metadata columns: >>> seqnames ranges strand | exon_id exon_name >>> [1] chr19 [58863336, 58864410] + | 156640 <na> >>> [2] chr19 [58864745, 58864840] + | 156641 <na> >>> [3] chr19 [58865080, 58865223] + | 156642 <na> >>> [4] chr19 [58865735, 58866549] + | 156643 <na> >>> >>> ... >>> <4 more elements> >>> --- >>> seqlengths: >>> chr12 chr8 ... chr1_gl000192_random >>> NA NA ... NA >>>> samples = c("BRPC13-1118_L1.D710_501.sorted", >>>> "BRPC_13-764_L2.sorted", "DU04_PBMC_RNA_L1.sorted", >>>> "DU06_PBMC_RNA_L1.sorted") >>>> >>>> samples >>> [1] "BRPC13-1118_L1.D710_501.sorted" "BRPC_13-764_L2.sorted" >>> [3] "DU04_PBMC_RNA_L1.sorted" "DU06_PBMC_RNA_L1.sorted" >>>> >>>> samplespath <- paste("./", samples, ".bam", sep="") >>>> >>>> samplespath >>> [1] "./BRPC13-1118_L1.D710_501.sorted.bam" >>> [2] "./BRPC_13-764_L2.sorted.bam" >>> [3] "./DU04_PBMC_RNA_L1.sorted.bam" >>> [4] "./DU06_PBMC_RNA_L1.sorted.bam" >>>> >>>> names(samplespath) <- samples >>>> >>>> samplespath >>> BRPC13-1118_L1.D710_501.sorted BRPC_13-764_L2.sorted >>> "./BRPC13-1118_L1.D710_501.sorted.bam" >>> "./BRPC_13-764_L2.sorted.bam" >>> DU04_PBMC_RNA_L1.sorted DU06_PBMC_RNA_L1.sorted >>> "./DU04_PBMC_RNA_L1.sorted.bam" >>> "./DU06_PBMC_RNA_L1.sorted.bam" >>>> >>>> countDF <- data.frame(row.names=names(eByg)) >>>> >>>> countDF >>> data frame with 0 columns and 23710 rows >>>> >>>> dim(countDF) >>> [1] 23710 0 >>>> >>>> for(i in samplespath) { >>> + aligns <- readGAlignmentsFromBam(i) counts <- countOverlaps(eByg, >>> + aligns, ignore.strand=TRUE) countDF <- cbind(countDF, counts) } >>> Warning messages: >>> 1: In .Seqinfo.mergexy(x, y) : >>> Each of the 2 combined objects has sequence levels not in the other: >>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>> chr4_gl000194_random, chr1_gl000192_random >>> - in 'y': chrM >>> Make sure to always combine/compare objects based on the same >>> reference >>> genome (use suppressWarnings() to suppress this warning). >>> 2: In .Seqinfo.mergexy(x, y) : >>> Each of the 2 combined objects has sequence levels not in the other: >>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>> chr4_gl000194_random, chr1_gl000192_random >>> - in 'y': chrM >>> Make sure to always combine/compare objects based on the same >>> reference >>> genome (use suppressWarnings() to suppress this warning). >>> 3: In .Seqinfo.mergexy(x, y) : >>> Each of the 2 combined objects has sequence levels not in the other: >>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>> chr4_gl000194_random, chr1_gl000192_random >>> - in 'y': chrM >>> Make sure to always combine/compare objects based on the same >>> reference >>> genome (use suppressWarnings() to suppress this warning). >>> 4: In .Seqinfo.mergexy(x, y) : >>> Each of the 2 combined objects has sequence levels not in the other: >>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>> chr4_gl000194_random, chr1_gl000192_random >>> - in 'y': chrM >>> Make sure to always combine/compare objects based on the same >>> reference >>> genome (use suppressWarnings() to suppress this warning). >>>> >>> >>> I also wanted to verify if what I am doing is correct for paired end >>> reads. >>> >>> Thanks, >>> >>> - Pankaj >>> >>> >>> ======================================= >>> sessionInfo() >>> R version 3.0.3 (2014-03-06) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > >

ADD REPLY • link 11.7 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Yes, these are p.e reads. I made the changes you suggested as follows, everything else being the same as before: bfl <- BamFileList(samplespath, yieldSize=50000, index=character(), asMates=TRUE) ... countDF_intsectionNotEmpty_042414A <- summarizeOverlaps(eByg, bfl, mode="IntersectionNotEmpty", ignore.strand=TRUE, singleEnd=FALSE) Warning in FUN(bf, param = param) : 'yieldSize' set to 'NA' Warning in FUN(bf, param = param) : 'yieldSize' set to 'NA' ... Not sure why I got these warning this time (did not see them last time). The results are much closer to what I got with htseq-count, though still not very close: htseq-count: > A.bam B. bam C. bam D. bam > A1BG 7 67 90 72 > A1BG-AS1 1 69 41 51 > A1CF 2 16 0 3 > A2M 1536 10866 34 173 summarizeOverlaps: > A.bam B. bam C. bam D. bam > A1BG 7 65 78 66 > A1BG-AS1 1 66 37 47 > A1CF 2 16 0 3 > A2M 1001 6897 28 160 I got the GTF files from the following website: http://support.illumina.com/sequencing/sequencing_software/igenome.ilm n Homo sapiens Ensembl GRCh37 NCBI build37.2 UCSC hg19 I used the UCSC hg19 - which has the bowtie2 index, which I used for alignment with bowtie2, and the GTF file, which I used as the annotation source. This same GTF was used for htseq-count also. Thanks, - Pankaj ________________________________________ From: Valerie Obenchain [vobencha@fhcrc.org] Sent: Thursday, April 24, 2014 9:39 AM To: Pankaj Agarwal; bioconductor at r-project.org Subject: Re: [BioC] countOverlaps Warnings Are these paired-end reads? If yes set 'singleEnd=FALSE' in the summarizeOverlaps() call. summarizeOverlaps(eByg, bfl, "IntersectionNotEmpty", ignore.strand=TRUE, singleEnd=FALSE) To 'stamp' the BamFileList as containing paired-end reads set the 'asMates' arg to TRUE. bfl <- BamFileList(samplespath, asMates=TRUE) 'yieldSize' reads data in by chunks and is useful when the bam files are too large to fit in memory. Can you point me to where you download 'genes.gtf'? Valerie On 04/23/14 19:40, Pankaj Agarwal wrote: > Valerie, > > I used summarizeOverlaps() instead, since it seemed simpler to use. I also used htseq-count to get the counts using the same parameters, but I get very different results. > From what I have read and understood, both should give similar results since the counts should be reported at the "gene" level and not the "isoform" level. > I would feel much more comfortable with the results if they were both at least close. > > For both I used the UCSC GTF file in the iGenomes distribution. > For htseq-count I used BAM files sorted by name since it required it, but for summarizeOverlaps I used BAM files sorted by coordinates (the default sorting done by samtools). > Example of counts generated by the two methods are given after the code: > > The code for htseq-count that I used is as follows (once for each sample): > > samtools view -h A.sorted_byname.bam | htseq-count -t exon -i gene_id -m intersection-nonempty -s no - iGenomes/UCSC/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf > A.htseq.count > > The code for summarizeOverlaps() is as follows: > >> library(GenomicFeatures) > Loading required package: BiocGenerics > Loading required package: parallel > ... >> library(Rsamtools) > Loading required package: Biostrings >> library(rtracklayer) >> library(GenomicRanges) >> >> txdb <- makeTranscriptDbFromGFF(file=iGenomes/UCSC/Homo_sapiens/UCS C/hg19/Annotation/Genes/genes.gtf", format="gtf") > extracting transcript information > Estimating transcript ranges. > Extracting gene IDs > Processing splicing information for gtf file. > Deducing exon rank from relative coordinates provided > Prepare the 'metadata' data frame ... metadata: OK > Now generating chrominfo from available sequence names. No chromosome length information is available. > Warning messages: > 1: In .deduceExonRankings(exs, format = "gtf") : > Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName > 2: In matchCircularity(chroms, circ_seqs) : > None of the strings in your circ_seqs argument match your seqnames. >> > > - are these warnings message problems? > >> saveDb(txdb, file="./data/ucsc_igenomes_hg19_gtf.sqlite") >> >> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite") >> >> eByg <- exonsBy(txdb, by="gene") >> >> head(eByg) > GRangesList of length 6: > $A1BG > GRanges with 8 ranges and 2 metadata columns: > seqnames ranges strand | exon_id exon_name > <rle> <iranges> <rle> | <integer> <character> > [1] chr19 [58858172, 58858395] - | 163177 <na> > [2] chr19 [58858719, 58859006] - | 163178 <na> >> >> samples = c("A.sorted", "B.sorted", "C.sorted", "D.sorted") >> >> samples > [1] "A.sorted" "B.sorted" > [3] "C.sorted" "D.sorted" >> >> samplespath > [1] "../A.sorted.bam" > [2] "../B.bam" > [3] "../C.sorted.bam" > [4] "../D.sorted.bam" >> >> names(samplespath) <- samples >> bfl <- BamFileList(samplespath, yieldSize=50000, index=character()) > > I am not sure what yieldSize means, just used it from the example I followed. > >> bfl > BamFileList of length 4 > names(4): A.sorted ... B.sorted >> >> countDF_intsectionNotEmpty <- summarizeOverlaps(eByg, bfl, mode="IntersectionNotEmpty", ignore.strand=TRUE) >> >> head(countDF_intsectionNotEmpty) > class: SummarizedExperiment > dim: 1 4 > exptData(0): > assays(1): counts > rownames(1): A1BG > rowData metadata column names(0): > colnames(4): A1.sorted B.sorted C.sorted D.sorted > colData names(0): >> >> class(countDF_intsectionNotEmpty) > [1] "SummarizedExperiment" > attr(,"package") > [1] "GenomicRanges" >> >> countDF_intsectionNotEmpty <- assays(countDF_intsectionNotEmpty)$counts >> colnames(countDF_intsectionNotEmpty) <- samples >> countDF_intsectionNotEmpty[1:4,] > A.sorted B.sorted C.sorted D.sorted > A1BG 13 119 162 136 > A1BG-AS1 2 112 52 85 > A1CF 3 28 0 5 > A2M 2543 17593 47 233 > ... > > The counts that I get are very different from what I get using htseq-count, which are following for the same genes listed above: > > A.bam B. bam C. bam D. bam > A1BG 7 67 90 72 > A1BG-AS1 1 69 41 51 > A1CF 2 16 0 3 > A2M 1536 10866 34 173 > ... > ... > > I would appreciate your feedback. > > Thanks, > > - Pankaj > > -----Original Message----- > From: Valerie Obenchain [mailto:vobencha at fhcrc.org] > Sent: Wednesday, April 23, 2014 12:50 PM > To: Pankaj Agarwal; bioconductor at r-project.org > Subject: Re: [BioC] countOverlaps Warnings > > Pankaj, > > How did it go? Were you able to get readGAlignmentPairs() or > readGAlignmentsList() to work for your case? > > Valerie > > > On 04/21/2014 01:32 PM, Valerie Obenchain wrote: >> Hi Pankaj, >> >> There are several options for counting paired-end reads. >> >> Both readGAlignmentPairs() and readGAlignmentsList() are appropriate >> for paired-end data and are in the GenomicAlignments package. They use >> the same counting algorithm but return the data in different containers. >> readGAlignmentPairs() returns only pairs while readGAlignmentsList() >> returns pairs as well as singletons, reads with unmapped pairs etc. >> See the ?readGAlignments man page for details. >> >> Another counting function is summarizeOverlaps() which uses the above >> functions 'under the hood'. It returns the counts in the 'assays' slot >> of a SummarizedExperiment object. Since you have multiple bam files to >> count you may want to use the BamFileList method. >> >> The man page has examples of counting paired-end bam files. >> library(Rsamtools) >> ?summarizeOverlaps >> >> Other packages that offer count functions are Rsubread and easyRNASeq. >> >> Valerie >> >> >> >> >> On 04/21/2014 09:53 AM, Pankaj Agarwal wrote: >>> Hi, >>> >>> I am analyzing a matched pair tumor/normal rna-seq data set. After >>> aligning with bowtie2 against UCSC hg19 index, I am trying to get the >>> counts for each of the samples using the iGenomes UCSC GTF file. I >>> came across the following tutorial which shows different ways to do >>> it in Bioconductor: >>> http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_De >>> c_12_16_2013/Rrnaseq/Rrnaseq.pdf >>> >>> >>> Following along slide 28 "Read Counting with countOverlaps", I am >>> able to generate the counts but get the following Warnings. I just >>> want to be sure that there is nothing to be worried about because I >>> don't understand the meaning of these warnings. My code is as follows: >>> >>>> library(GenomicFeatures) >>> Loading required package: AnnotationDbi >>>> library(Rsamtools) >>>> library(rtracklayer) >>>> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite") >>>> eByg <- exonsBy(txdb, by="gene") >>>> head(eByg) >>> GRangesList of length 6: >>> $A1BG >>> GRanges with 8 ranges and 2 metadata columns: >>> seqnames ranges strand | exon_id exon_name >>> <rle> <iranges> <rle> | <integer> <character> >>> [1] chr19 [58858172, 58858395] - | 163177 <na> >>> [2] chr19 [58858719, 58859006] - | 163178 <na> >>> [3] chr19 [58861736, 58862017] - | 163179 <na> >>> [4] chr19 [58862757, 58863053] - | 163180 <na> >>> [5] chr19 [58863649, 58863921] - | 163181 <na> >>> [6] chr19 [58864294, 58864563] - | 163182 <na> >>> [7] chr19 [58864658, 58864693] - | 163183 <na> >>> [8] chr19 [58864770, 58864865] - | 163184 <na> >>> >>> $A1BG-AS1 >>> GRanges with 4 ranges and 2 metadata columns: >>> seqnames ranges strand | exon_id exon_name >>> [1] chr19 [58863336, 58864410] + | 156640 <na> >>> [2] chr19 [58864745, 58864840] + | 156641 <na> >>> [3] chr19 [58865080, 58865223] + | 156642 <na> >>> [4] chr19 [58865735, 58866549] + | 156643 <na> >>> >>> ... >>> <4 more elements> >>> --- >>> seqlengths: >>> chr12 chr8 ... chr1_gl000192_random >>> NA NA ... NA >>>> samples = c("BRPC13-1118_L1.D710_501.sorted", >>>> "BRPC_13-764_L2.sorted", "DU04_PBMC_RNA_L1.sorted", >>>> "DU06_PBMC_RNA_L1.sorted") >>>> >>>> samples >>> [1] "BRPC13-1118_L1.D710_501.sorted" "BRPC_13-764_L2.sorted" >>> [3] "DU04_PBMC_RNA_L1.sorted" "DU06_PBMC_RNA_L1.sorted" >>>> >>>> samplespath <- paste("./", samples, ".bam", sep="") >>>> >>>> samplespath >>> [1] "./BRPC13-1118_L1.D710_501.sorted.bam" >>> [2] "./BRPC_13-764_L2.sorted.bam" >>> [3] "./DU04_PBMC_RNA_L1.sorted.bam" >>> [4] "./DU06_PBMC_RNA_L1.sorted.bam" >>>> >>>> names(samplespath) <- samples >>>> >>>> samplespath >>> BRPC13-1118_L1.D710_501.sorted BRPC_13-764_L2.sorted >>> "./BRPC13-1118_L1.D710_501.sorted.bam" >>> "./BRPC_13-764_L2.sorted.bam" >>> DU04_PBMC_RNA_L1.sorted DU06_PBMC_RNA_L1.sorted >>> "./DU04_PBMC_RNA_L1.sorted.bam" >>> "./DU06_PBMC_RNA_L1.sorted.bam" >>>> >>>> countDF <- data.frame(row.names=names(eByg)) >>>> >>>> countDF >>> data frame with 0 columns and 23710 rows >>>> >>>> dim(countDF) >>> [1] 23710 0 >>>> >>>> for(i in samplespath) { >>> + aligns <- readGAlignmentsFromBam(i) counts <- countOverlaps(eByg, >>> + aligns, ignore.strand=TRUE) countDF <- cbind(countDF, counts) } >>> Warning messages: >>> 1: In .Seqinfo.mergexy(x, y) : >>> Each of the 2 combined objects has sequence levels not in the other: >>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>> chr4_gl000194_random, chr1_gl000192_random >>> - in 'y': chrM >>> Make sure to always combine/compare objects based on the same >>> reference >>> genome (use suppressWarnings() to suppress this warning). >>> 2: In .Seqinfo.mergexy(x, y) : >>> Each of the 2 combined objects has sequence levels not in the other: >>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>> chr4_gl000194_random, chr1_gl000192_random >>> - in 'y': chrM >>> Make sure to always combine/compare objects based on the same >>> reference >>> genome (use suppressWarnings() to suppress this warning). >>> 3: In .Seqinfo.mergexy(x, y) : >>> Each of the 2 combined objects has sequence levels not in the other: >>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>> chr4_gl000194_random, chr1_gl000192_random >>> - in 'y': chrM >>> Make sure to always combine/compare objects based on the same >>> reference >>> genome (use suppressWarnings() to suppress this warning). >>> 4: In .Seqinfo.mergexy(x, y) : >>> Each of the 2 combined objects has sequence levels not in the other: >>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>> chr4_gl000194_random, chr1_gl000192_random >>> - in 'y': chrM >>> Make sure to always combine/compare objects based on the same >>> reference >>> genome (use suppressWarnings() to suppress this warning). >>>> >>> >>> I also wanted to verify if what I am doing is correct for paired end >>> reads. >>> >>> Thanks, >>> >>> - Pankaj >>> >>> >>> ======================================= >>> sessionInfo() >>> R version 3.0.3 (2014-03-06) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > >

ADD REPLY • link 11.7 years ago Pankaj Agarwal ▴ 100

0

Entering edit mode

I think you have an older version of summarizeOverlaps() which had a restriction on using 'yieldSize' with paired-end and emits the warning you see below. The reason you didn't see the warning the first time is because the reads were being treated as single-end (hence almost double the count difference). To update to the current version you'll need R 3.1 and Bioconductor 2.14 (your sessionInfo() below shows R 3.0.3). Even with the update there is no guarantee the counts you see with htseq-count and summarizeOverlaps will be 100% the same. The pairing algorithm (i.e., how we define which reads are true pairs) is documented on the ?readGAlignments man page in the 'Pairing criteria' section. The HTSeq docs probably define how they determine pairs. Once you update you may also want to check out the 'fragments' argument to summarizeOverlaps(). This allows you to read in not only the records paired according to the algorithm but records with unmapped pairs, singletons, and other fragments. This may not be appropriate for your current task but does provide and interesting separation of the paired reads and 'other' fragments. Valerie On 04/24/14 12:44, Pankaj Agarwal wrote: > Yes, these are p.e reads. I made the changes you suggested as follows, everything else being the same as before: > > bfl <- BamFileList(samplespath, yieldSize=50000, index=character(), asMates=TRUE) > ... > countDF_intsectionNotEmpty_042414A <- summarizeOverlaps(eByg, bfl, mode="IntersectionNotEmpty", ignore.strand=TRUE, singleEnd=FALSE) > Warning in FUN(bf, param = param) : 'yieldSize' set to 'NA' > Warning in FUN(bf, param = param) : 'yieldSize' set to 'NA' > ... > > Not sure why I got these warning this time (did not see them last time). > > The results are much closer to what I got with htseq-count, though still not very close: > > htseq-count: >> A.bam B. bam C. bam D. bam >> A1BG 7 67 90 72 >> A1BG-AS1 1 69 41 51 >> A1CF 2 16 0 3 >> A2M 1536 10866 34 173 > > summarizeOverlaps: >> A.bam B. bam C. bam D. bam >> A1BG 7 65 78 66 >> A1BG-AS1 1 66 37 47 >> A1CF 2 16 0 3 >> A2M 1001 6897 28 160 > > I got the GTF files from the following website: > > http://support.illumina.com/sequencing/sequencing_software/igenome.i lmn > > Homo sapiens > Ensembl GRCh37 > NCBI build37.2 > UCSC hg19 > > I used the UCSC hg19 - which has the bowtie2 index, which I used for alignment with bowtie2, and the GTF file, which I used as the annotation source. This same GTF was used for htseq-count also. > > Thanks, > > - Pankaj > > ________________________________________ > From: Valerie Obenchain [vobencha at fhcrc.org] > Sent: Thursday, April 24, 2014 9:39 AM > To: Pankaj Agarwal; bioconductor at r-project.org > Subject: Re: [BioC] countOverlaps Warnings > > Are these paired-end reads? If yes set 'singleEnd=FALSE' in the > summarizeOverlaps() call. > > summarizeOverlaps(eByg, bfl, "IntersectionNotEmpty", > ignore.strand=TRUE, singleEnd=FALSE) > > To 'stamp' the BamFileList as containing paired-end reads set the > 'asMates' arg to TRUE. > > bfl <- BamFileList(samplespath, asMates=TRUE) > > 'yieldSize' reads data in by chunks and is useful when the bam files are > too large to fit in memory. > > Can you point me to where you download 'genes.gtf'? > > Valerie > > On 04/23/14 19:40, Pankaj Agarwal wrote: >> Valerie, >> >> I used summarizeOverlaps() instead, since it seemed simpler to use. I also used htseq-count to get the counts using the same parameters, but I get very different results. >> From what I have read and understood, both should give similar results since the counts should be reported at the "gene" level and not the "isoform" level. >> I would feel much more comfortable with the results if they were both at least close. >> >> For both I used the UCSC GTF file in the iGenomes distribution. >> For htseq-count I used BAM files sorted by name since it required it, but for summarizeOverlaps I used BAM files sorted by coordinates (the default sorting done by samtools). >> Example of counts generated by the two methods are given after the code: >> >> The code for htseq-count that I used is as follows (once for each sample): >> >> samtools view -h A.sorted_byname.bam | htseq-count -t exon -i gene_id -m intersection-nonempty -s no - iGenomes/UCSC/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf > A.htseq.count >> >> The code for summarizeOverlaps() is as follows: >> >>> library(GenomicFeatures) >> Loading required package: BiocGenerics >> Loading required package: parallel >> ... >>> library(Rsamtools) >> Loading required package: Biostrings >>> library(rtracklayer) >>> library(GenomicRanges) >>> >>> txdb <- makeTranscriptDbFromGFF(file=iGenomes/UCSC/Homo_sapiens/UC SC/hg19/Annotation/Genes/genes.gtf", format="gtf") >> extracting transcript information >> Estimating transcript ranges. >> Extracting gene IDs >> Processing splicing information for gtf file. >> Deducing exon rank from relative coordinates provided >> Prepare the 'metadata' data frame ... metadata: OK >> Now generating chrominfo from available sequence names. No chromosome length information is available. >> Warning messages: >> 1: In .deduceExonRankings(exs, format = "gtf") : >> Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName >> 2: In matchCircularity(chroms, circ_seqs) : >> None of the strings in your circ_seqs argument match your seqnames. >>> >> >> - are these warnings message problems? >> >>> saveDb(txdb, file="./data/ucsc_igenomes_hg19_gtf.sqlite") >>> >>> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite") >>> >>> eByg <- exonsBy(txdb, by="gene") >>> >>> head(eByg) >> GRangesList of length 6: >> $A1BG >> GRanges with 8 ranges and 2 metadata columns: >> seqnames ranges strand | exon_id exon_name >> <rle> <iranges> <rle> | <integer> <character> >> [1] chr19 [58858172, 58858395] - | 163177 <na> >> [2] chr19 [58858719, 58859006] - | 163178 <na> >>> >>> samples = c("A.sorted", "B.sorted", "C.sorted", "D.sorted") >>> >>> samples >> [1] "A.sorted" "B.sorted" >> [3] "C.sorted" "D.sorted" >>> >>> samplespath >> [1] "../A.sorted.bam" >> [2] "../B.bam" >> [3] "../C.sorted.bam" >> [4] "../D.sorted.bam" >>> >>> names(samplespath) <- samples >>> bfl <- BamFileList(samplespath, yieldSize=50000, index=character()) >> >> I am not sure what yieldSize means, just used it from the example I followed. >> >>> bfl >> BamFileList of length 4 >> names(4): A.sorted ... B.sorted >>> >>> countDF_intsectionNotEmpty <- summarizeOverlaps(eByg, bfl, mode="IntersectionNotEmpty", ignore.strand=TRUE) >>> >>> head(countDF_intsectionNotEmpty) >> class: SummarizedExperiment >> dim: 1 4 >> exptData(0): >> assays(1): counts >> rownames(1): A1BG >> rowData metadata column names(0): >> colnames(4): A1.sorted B.sorted C.sorted D.sorted >> colData names(0): >>> >>> class(countDF_intsectionNotEmpty) >> [1] "SummarizedExperiment" >> attr(,"package") >> [1] "GenomicRanges" >>> >>> countDF_intsectionNotEmpty <- assays(countDF_intsectionNotEmpty)$counts >>> colnames(countDF_intsectionNotEmpty) <- samples >>> countDF_intsectionNotEmpty[1:4,] >> A.sorted B.sorted C.sorted D.sorted >> A1BG 13 119 162 136 >> A1BG-AS1 2 112 52 85 >> A1CF 3 28 0 5 >> A2M 2543 17593 47 233 >> ... >> >> The counts that I get are very different from what I get using htseq-count, which are following for the same genes listed above: >> >> A.bam B. bam C. bam D. bam >> A1BG 7 67 90 72 >> A1BG-AS1 1 69 41 51 >> A1CF 2 16 0 3 >> A2M 1536 10866 34 173 >> ... >> ... >> >> I would appreciate your feedback. >> >> Thanks, >> >> - Pankaj >> >> -----Original Message----- >> From: Valerie Obenchain [mailto:vobencha at fhcrc.org] >> Sent: Wednesday, April 23, 2014 12:50 PM >> To: Pankaj Agarwal; bioconductor at r-project.org >> Subject: Re: [BioC] countOverlaps Warnings >> >> Pankaj, >> >> How did it go? Were you able to get readGAlignmentPairs() or >> readGAlignmentsList() to work for your case? >> >> Valerie >> >> >> On 04/21/2014 01:32 PM, Valerie Obenchain wrote: >>> Hi Pankaj, >>> >>> There are several options for counting paired-end reads. >>> >>> Both readGAlignmentPairs() and readGAlignmentsList() are appropriate >>> for paired-end data and are in the GenomicAlignments package. They use >>> the same counting algorithm but return the data in different containers. >>> readGAlignmentPairs() returns only pairs while readGAlignmentsList() >>> returns pairs as well as singletons, reads with unmapped pairs etc. >>> See the ?readGAlignments man page for details. >>> >>> Another counting function is summarizeOverlaps() which uses the above >>> functions 'under the hood'. It returns the counts in the 'assays' slot >>> of a SummarizedExperiment object. Since you have multiple bam files to >>> count you may want to use the BamFileList method. >>> >>> The man page has examples of counting paired-end bam files. >>> library(Rsamtools) >>> ?summarizeOverlaps >>> >>> Other packages that offer count functions are Rsubread and easyRNASeq. >>> >>> Valerie >>> >>> >>> >>> >>> On 04/21/2014 09:53 AM, Pankaj Agarwal wrote: >>>> Hi, >>>> >>>> I am analyzing a matched pair tumor/normal rna-seq data set. After >>>> aligning with bowtie2 against UCSC hg19 index, I am trying to get the >>>> counts for each of the samples using the iGenomes UCSC GTF file. I >>>> came across the following tutorial which shows different ways to do >>>> it in Bioconductor: >>>> http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_De >>>> c_12_16_2013/Rrnaseq/Rrnaseq.pdf >>>> >>>> >>>> Following along slide 28 "Read Counting with countOverlaps", I am >>>> able to generate the counts but get the following Warnings. I just >>>> want to be sure that there is nothing to be worried about because I >>>> don't understand the meaning of these warnings. My code is as follows: >>>> >>>>> library(GenomicFeatures) >>>> Loading required package: AnnotationDbi >>>>> library(Rsamtools) >>>>> library(rtracklayer) >>>>> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite") >>>>> eByg <- exonsBy(txdb, by="gene") >>>>> head(eByg) >>>> GRangesList of length 6: >>>> $A1BG >>>> GRanges with 8 ranges and 2 metadata columns: >>>> seqnames ranges strand | exon_id exon_name >>>> <rle> <iranges> <rle> | <integer> <character> >>>> [1] chr19 [58858172, 58858395] - | 163177 <na> >>>> [2] chr19 [58858719, 58859006] - | 163178 <na> >>>> [3] chr19 [58861736, 58862017] - | 163179 <na> >>>> [4] chr19 [58862757, 58863053] - | 163180 <na> >>>> [5] chr19 [58863649, 58863921] - | 163181 <na> >>>> [6] chr19 [58864294, 58864563] - | 163182 <na> >>>> [7] chr19 [58864658, 58864693] - | 163183 <na> >>>> [8] chr19 [58864770, 58864865] - | 163184 <na> >>>> >>>> $A1BG-AS1 >>>> GRanges with 4 ranges and 2 metadata columns: >>>> seqnames ranges strand | exon_id exon_name >>>> [1] chr19 [58863336, 58864410] + | 156640 <na> >>>> [2] chr19 [58864745, 58864840] + | 156641 <na> >>>> [3] chr19 [58865080, 58865223] + | 156642 <na> >>>> [4] chr19 [58865735, 58866549] + | 156643 <na> >>>> >>>> ... >>>> <4 more elements> >>>> --- >>>> seqlengths: >>>> chr12 chr8 ... chr1_gl000192_random >>>> NA NA ... NA >>>>> samples = c("BRPC13-1118_L1.D710_501.sorted", >>>>> "BRPC_13-764_L2.sorted", "DU04_PBMC_RNA_L1.sorted", >>>>> "DU06_PBMC_RNA_L1.sorted") >>>>> >>>>> samples >>>> [1] "BRPC13-1118_L1.D710_501.sorted" "BRPC_13-764_L2.sorted" >>>> [3] "DU04_PBMC_RNA_L1.sorted" "DU06_PBMC_RNA_L1.sorted" >>>>> >>>>> samplespath <- paste("./", samples, ".bam", sep="") >>>>> >>>>> samplespath >>>> [1] "./BRPC13-1118_L1.D710_501.sorted.bam" >>>> [2] "./BRPC_13-764_L2.sorted.bam" >>>> [3] "./DU04_PBMC_RNA_L1.sorted.bam" >>>> [4] "./DU06_PBMC_RNA_L1.sorted.bam" >>>>> >>>>> names(samplespath) <- samples >>>>> >>>>> samplespath >>>> BRPC13-1118_L1.D710_501.sorted BRPC_13-764_L2.sorted >>>> "./BRPC13-1118_L1.D710_501.sorted.bam" >>>> "./BRPC_13-764_L2.sorted.bam" >>>> DU04_PBMC_RNA_L1.sorted DU06_PBMC_RNA_L1.sorted >>>> "./DU04_PBMC_RNA_L1.sorted.bam" >>>> "./DU06_PBMC_RNA_L1.sorted.bam" >>>>> >>>>> countDF <- data.frame(row.names=names(eByg)) >>>>> >>>>> countDF >>>> data frame with 0 columns and 23710 rows >>>>> >>>>> dim(countDF) >>>> [1] 23710 0 >>>>> >>>>> for(i in samplespath) { >>>> + aligns <- readGAlignmentsFromBam(i) counts <- countOverlaps(eByg, >>>> + aligns, ignore.strand=TRUE) countDF <- cbind(countDF, counts) } >>>> Warning messages: >>>> 1: In .Seqinfo.mergexy(x, y) : >>>> Each of the 2 combined objects has sequence levels not in the other: >>>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>>> chr4_gl000194_random, chr1_gl000192_random >>>> - in 'y': chrM >>>> Make sure to always combine/compare objects based on the same >>>> reference >>>> genome (use suppressWarnings() to suppress this warning). >>>> 2: In .Seqinfo.mergexy(x, y) : >>>> Each of the 2 combined objects has sequence levels not in the other: >>>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>>> chr4_gl000194_random, chr1_gl000192_random >>>> - in 'y': chrM >>>> Make sure to always combine/compare objects based on the same >>>> reference >>>> genome (use suppressWarnings() to suppress this warning). >>>> 3: In .Seqinfo.mergexy(x, y) : >>>> Each of the 2 combined objects has sequence levels not in the other: >>>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>>> chr4_gl000194_random, chr1_gl000192_random >>>> - in 'y': chrM >>>> Make sure to always combine/compare objects based on the same >>>> reference >>>> genome (use suppressWarnings() to suppress this warning). >>>> 4: In .Seqinfo.mergexy(x, y) : >>>> Each of the 2 combined objects has sequence levels not in the other: >>>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>>> chr4_gl000194_random, chr1_gl000192_random >>>> - in 'y': chrM >>>> Make sure to always combine/compare objects based on the same >>>> reference >>>> genome (use suppressWarnings() to suppress this warning). >>>>> >>>> >>>> I also wanted to verify if what I am doing is correct for paired end >>>> reads. >>>> >>>> Thanks, >>>> >>>> - Pankaj >>>> >>>> >>>> ======================================= >>>> sessionInfo() >>>> R version 3.0.3 (2014-03-06) >>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >> >> >

ADD REPLY • link 11.7 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

I installed R 3.1.0 and install all the bioconductor packages to try out with the latest version, but for some reason summarizeOverlaps() function is not showing as being available. Below is the message that I get when it comes to running the function followed by the sessionInfo(). Not sure if I am missing any package. > library(GenomicFeatures) > library(Rsamtools) > library(rtracklayer) ... countDF <- summarizeOverlaps(eByg, bfl, mode="IntersectionNotEmpty", ignore.strand=TRUE, singleEnd=FALSE) Error: could not find function "summarizeOverlaps" > sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] rtracklayer_1.24.0 Rsamtools_1.16.0 Biostrings_2.32.0 [4] XVector_0.4.0 GenomicFeatures_1.16.0 AnnotationDbi_1.26.0 [7] Biobase_2.24.0 GenomicRanges_1.16.2 GenomeInfoDb_1.0.2 [10] IRanges_1.22.3 BiocGenerics_0.10.0 loaded via a namespace (and not attached): [1] BatchJobs_1.2 BBmisc_1.6 BiocParallel_0.6.0 [4] biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 [7] BSgenome_1.32.0 codetools_0.2-8 DBI_0.2-7 [10] digest_0.6.4 fail_1.2 foreach_1.4.2 [13] GenomicAlignments_1.0.0 iterators_1.0.7 plyr_1.8.1 [16] Rcpp_0.11.1 RCurl_1.95-4.1 RSQLite_0.11.4 [19] sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 [22] tools_3.1.0 XML_3.98-1.1 zlibbioc_1.10.0 > - Pankaj ________________________________________ From: Valerie Obenchain [vobencha@fhcrc.org] Sent: Thursday, April 24, 2014 5:07 PM To: Pankaj Agarwal; bioconductor at r-project.org Subject: Re: [BioC] countOverlaps Warnings I think you have an older version of summarizeOverlaps() which had a restriction on using 'yieldSize' with paired-end and emits the warning you see below. The reason you didn't see the warning the first time is because the reads were being treated as single-end (hence almost double the count difference). To update to the current version you'll need R 3.1 and Bioconductor 2.14 (your sessionInfo() below shows R 3.0.3). Even with the update there is no guarantee the counts you see with htseq-count and summarizeOverlaps will be 100% the same. The pairing algorithm (i.e., how we define which reads are true pairs) is documented on the ?readGAlignments man page in the 'Pairing criteria' section. The HTSeq docs probably define how they determine pairs. Once you update you may also want to check out the 'fragments' argument to summarizeOverlaps(). This allows you to read in not only the records paired according to the algorithm but records with unmapped pairs, singletons, and other fragments. This may not be appropriate for your current task but does provide and interesting separation of the paired reads and 'other' fragments. Valerie On 04/24/14 12:44, Pankaj Agarwal wrote: > Yes, these are p.e reads. I made the changes you suggested as follows, everything else being the same as before: > > bfl <- BamFileList(samplespath, yieldSize=50000, index=character(), asMates=TRUE) > ... > countDF_intsectionNotEmpty_042414A <- summarizeOverlaps(eByg, bfl, mode="IntersectionNotEmpty", ignore.strand=TRUE, singleEnd=FALSE) > Warning in FUN(bf, param = param) : 'yieldSize' set to 'NA' > Warning in FUN(bf, param = param) : 'yieldSize' set to 'NA' > ... > > Not sure why I got these warning this time (did not see them last time). > > The results are much closer to what I got with htseq-count, though still not very close: > > htseq-count: >> A.bam B. bam C. bam D. bam >> A1BG 7 67 90 72 >> A1BG-AS1 1 69 41 51 >> A1CF 2 16 0 3 >> A2M 1536 10866 34 173 > > summarizeOverlaps: >> A.bam B. bam C. bam D. bam >> A1BG 7 65 78 66 >> A1BG-AS1 1 66 37 47 >> A1CF 2 16 0 3 >> A2M 1001 6897 28 160 > > I got the GTF files from the following website: > > http://support.illumina.com/sequencing/sequencing_software/igenome.i lmn > > Homo sapiens > Ensembl GRCh37 > NCBI build37.2 > UCSC hg19 > > I used the UCSC hg19 - which has the bowtie2 index, which I used for alignment with bowtie2, and the GTF file, which I used as the annotation source. This same GTF was used for htseq-count also. > > Thanks, > > - Pankaj > > ________________________________________ > From: Valerie Obenchain [vobencha at fhcrc.org] > Sent: Thursday, April 24, 2014 9:39 AM > To: Pankaj Agarwal; bioconductor at r-project.org > Subject: Re: [BioC] countOverlaps Warnings > > Are these paired-end reads? If yes set 'singleEnd=FALSE' in the > summarizeOverlaps() call. > > summarizeOverlaps(eByg, bfl, "IntersectionNotEmpty", > ignore.strand=TRUE, singleEnd=FALSE) > > To 'stamp' the BamFileList as containing paired-end reads set the > 'asMates' arg to TRUE. > > bfl <- BamFileList(samplespath, asMates=TRUE) > > 'yieldSize' reads data in by chunks and is useful when the bam files are > too large to fit in memory. > > Can you point me to where you download 'genes.gtf'? > > Valerie > > On 04/23/14 19:40, Pankaj Agarwal wrote: >> Valerie, >> >> I used summarizeOverlaps() instead, since it seemed simpler to use. I also used htseq-count to get the counts using the same parameters, but I get very different results. >> From what I have read and understood, both should give similar results since the counts should be reported at the "gene" level and not the "isoform" level. >> I would feel much more comfortable with the results if they were both at least close. >> >> For both I used the UCSC GTF file in the iGenomes distribution. >> For htseq-count I used BAM files sorted by name since it required it, but for summarizeOverlaps I used BAM files sorted by coordinates (the default sorting done by samtools). >> Example of counts generated by the two methods are given after the code: >> >> The code for htseq-count that I used is as follows (once for each sample): >> >> samtools view -h A.sorted_byname.bam | htseq-count -t exon -i gene_id -m intersection-nonempty -s no - iGenomes/UCSC/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf > A.htseq.count >> >> The code for summarizeOverlaps() is as follows: >> >>> library(GenomicFeatures) >> Loading required package: BiocGenerics >> Loading required package: parallel >> ... >>> library(Rsamtools) >> Loading required package: Biostrings >>> library(rtracklayer) >>> library(GenomicRanges) >>> >>> txdb <- makeTranscriptDbFromGFF(file=iGenomes/UCSC/Homo_sapiens/UC SC/hg19/Annotation/Genes/genes.gtf", format="gtf") >> extracting transcript information >> Estimating transcript ranges. >> Extracting gene IDs >> Processing splicing information for gtf file. >> Deducing exon rank from relative coordinates provided >> Prepare the 'metadata' data frame ... metadata: OK >> Now generating chrominfo from available sequence names. No chromosome length information is available. >> Warning messages: >> 1: In .deduceExonRankings(exs, format = "gtf") : >> Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName >> 2: In matchCircularity(chroms, circ_seqs) : >> None of the strings in your circ_seqs argument match your seqnames. >>> >> >> - are these warnings message problems? >> >>> saveDb(txdb, file="./data/ucsc_igenomes_hg19_gtf.sqlite") >>> >>> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite") >>> >>> eByg <- exonsBy(txdb, by="gene") >>> >>> head(eByg) >> GRangesList of length 6: >> $A1BG >> GRanges with 8 ranges and 2 metadata columns: >> seqnames ranges strand | exon_id exon_name >> <rle> <iranges> <rle> | <integer> <character> >> [1] chr19 [58858172, 58858395] - | 163177 <na> >> [2] chr19 [58858719, 58859006] - | 163178 <na> >>> >>> samples = c("A.sorted", "B.sorted", "C.sorted", "D.sorted") >>> >>> samples >> [1] "A.sorted" "B.sorted" >> [3] "C.sorted" "D.sorted" >>> >>> samplespath >> [1] "../A.sorted.bam" >> [2] "../B.bam" >> [3] "../C.sorted.bam" >> [4] "../D.sorted.bam" >>> >>> names(samplespath) <- samples >>> bfl <- BamFileList(samplespath, yieldSize=50000, index=character()) >> >> I am not sure what yieldSize means, just used it from the example I followed. >> >>> bfl >> BamFileList of length 4 >> names(4): A.sorted ... B.sorted >>> >>> countDF_intsectionNotEmpty <- summarizeOverlaps(eByg, bfl, mode="IntersectionNotEmpty", ignore.strand=TRUE) >>> >>> head(countDF_intsectionNotEmpty) >> class: SummarizedExperiment >> dim: 1 4 >> exptData(0): >> assays(1): counts >> rownames(1): A1BG >> rowData metadata column names(0): >> colnames(4): A1.sorted B.sorted C.sorted D.sorted >> colData names(0): >>> >>> class(countDF_intsectionNotEmpty) >> [1] "SummarizedExperiment" >> attr(,"package") >> [1] "GenomicRanges" >>> >>> countDF_intsectionNotEmpty <- assays(countDF_intsectionNotEmpty)$counts >>> colnames(countDF_intsectionNotEmpty) <- samples >>> countDF_intsectionNotEmpty[1:4,] >> A.sorted B.sorted C.sorted D.sorted >> A1BG 13 119 162 136 >> A1BG-AS1 2 112 52 85 >> A1CF 3 28 0 5 >> A2M 2543 17593 47 233 >> ... >> >> The counts that I get are very different from what I get using htseq-count, which are following for the same genes listed above: >> >> A.bam B. bam C. bam D. bam >> A1BG 7 67 90 72 >> A1BG-AS1 1 69 41 51 >> A1CF 2 16 0 3 >> A2M 1536 10866 34 173 >> ... >> ... >> >> I would appreciate your feedback. >> >> Thanks, >> >> - Pankaj >> >> -----Original Message----- >> From: Valerie Obenchain [mailto:vobencha at fhcrc.org] >> Sent: Wednesday, April 23, 2014 12:50 PM >> To: Pankaj Agarwal; bioconductor at r-project.org >> Subject: Re: [BioC] countOverlaps Warnings >> >> Pankaj, >> >> How did it go? Were you able to get readGAlignmentPairs() or >> readGAlignmentsList() to work for your case? >> >> Valerie >> >> >> On 04/21/2014 01:32 PM, Valerie Obenchain wrote: >>> Hi Pankaj, >>> >>> There are several options for counting paired-end reads. >>> >>> Both readGAlignmentPairs() and readGAlignmentsList() are appropriate >>> for paired-end data and are in the GenomicAlignments package. They use >>> the same counting algorithm but return the data in different containers. >>> readGAlignmentPairs() returns only pairs while readGAlignmentsList() >>> returns pairs as well as singletons, reads with unmapped pairs etc. >>> See the ?readGAlignments man page for details. >>> >>> Another counting function is summarizeOverlaps() which uses the above >>> functions 'under the hood'. It returns the counts in the 'assays' slot >>> of a SummarizedExperiment object. Since you have multiple bam files to >>> count you may want to use the BamFileList method. >>> >>> The man page has examples of counting paired-end bam files. >>> library(Rsamtools) >>> ?summarizeOverlaps >>> >>> Other packages that offer count functions are Rsubread and easyRNASeq. >>> >>> Valerie >>> >>> >>> >>> >>> On 04/21/2014 09:53 AM, Pankaj Agarwal wrote: >>>> Hi, >>>> >>>> I am analyzing a matched pair tumor/normal rna-seq data set. After >>>> aligning with bowtie2 against UCSC hg19 index, I am trying to get the >>>> counts for each of the samples using the iGenomes UCSC GTF file. I >>>> came across the following tutorial which shows different ways to do >>>> it in Bioconductor: >>>> http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_De >>>> c_12_16_2013/Rrnaseq/Rrnaseq.pdf >>>> >>>> >>>> Following along slide 28 "Read Counting with countOverlaps", I am >>>> able to generate the counts but get the following Warnings. I just >>>> want to be sure that there is nothing to be worried about because I >>>> don't understand the meaning of these warnings. My code is as follows: >>>> >>>>> library(GenomicFeatures) >>>> Loading required package: AnnotationDbi >>>>> library(Rsamtools) >>>>> library(rtracklayer) >>>>> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite") >>>>> eByg <- exonsBy(txdb, by="gene") >>>>> head(eByg) >>>> GRangesList of length 6: >>>> $A1BG >>>> GRanges with 8 ranges and 2 metadata columns: >>>> seqnames ranges strand | exon_id exon_name >>>> <rle> <iranges> <rle> | <integer> <character> >>>> [1] chr19 [58858172, 58858395] - | 163177 <na> >>>> [2] chr19 [58858719, 58859006] - | 163178 <na> >>>> [3] chr19 [58861736, 58862017] - | 163179 <na> >>>> [4] chr19 [58862757, 58863053] - | 163180 <na> >>>> [5] chr19 [58863649, 58863921] - | 163181 <na> >>>> [6] chr19 [58864294, 58864563] - | 163182 <na> >>>> [7] chr19 [58864658, 58864693] - | 163183 <na> >>>> [8] chr19 [58864770, 58864865] - | 163184 <na> >>>> >>>> $A1BG-AS1 >>>> GRanges with 4 ranges and 2 metadata columns: >>>> seqnames ranges strand | exon_id exon_name >>>> [1] chr19 [58863336, 58864410] + | 156640 <na> >>>> [2] chr19 [58864745, 58864840] + | 156641 <na> >>>> [3] chr19 [58865080, 58865223] + | 156642 <na> >>>> [4] chr19 [58865735, 58866549] + | 156643 <na> >>>> >>>> ... >>>> <4 more elements> >>>> --- >>>> seqlengths: >>>> chr12 chr8 ... chr1_gl000192_random >>>> NA NA ... NA >>>>> samples = c("BRPC13-1118_L1.D710_501.sorted", >>>>> "BRPC_13-764_L2.sorted", "DU04_PBMC_RNA_L1.sorted", >>>>> "DU06_PBMC_RNA_L1.sorted") >>>>> >>>>> samples >>>> [1] "BRPC13-1118_L1.D710_501.sorted" "BRPC_13-764_L2.sorted" >>>> [3] "DU04_PBMC_RNA_L1.sorted" "DU06_PBMC_RNA_L1.sorted" >>>>> >>>>> samplespath <- paste("./", samples, ".bam", sep="") >>>>> >>>>> samplespath >>>> [1] "./BRPC13-1118_L1.D710_501.sorted.bam" >>>> [2] "./BRPC_13-764_L2.sorted.bam" >>>> [3] "./DU04_PBMC_RNA_L1.sorted.bam" >>>> [4] "./DU06_PBMC_RNA_L1.sorted.bam" >>>>> >>>>> names(samplespath) <- samples >>>>> >>>>> samplespath >>>> BRPC13-1118_L1.D710_501.sorted BRPC_13-764_L2.sorted >>>> "./BRPC13-1118_L1.D710_501.sorted.bam" >>>> "./BRPC_13-764_L2.sorted.bam" >>>> DU04_PBMC_RNA_L1.sorted DU06_PBMC_RNA_L1.sorted >>>> "./DU04_PBMC_RNA_L1.sorted.bam" >>>> "./DU06_PBMC_RNA_L1.sorted.bam" >>>>> >>>>> countDF <- data.frame(row.names=names(eByg)) >>>>> >>>>> countDF >>>> data frame with 0 columns and 23710 rows >>>>> >>>>> dim(countDF) >>>> [1] 23710 0 >>>>> >>>>> for(i in samplespath) { >>>> + aligns <- readGAlignmentsFromBam(i) counts <- countOverlaps(eByg, >>>> + aligns, ignore.strand=TRUE) countDF <- cbind(countDF, counts) } >>>> Warning messages: >>>> 1: In .Seqinfo.mergexy(x, y) : >>>> Each of the 2 combined objects has sequence levels not in the other: >>>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>>> chr4_gl000194_random, chr1_gl000192_random >>>> - in 'y': chrM >>>> Make sure to always combine/compare objects based on the same >>>> reference >>>> genome (use suppressWarnings() to suppress this warning). >>>> 2: In .Seqinfo.mergexy(x, y) : >>>> Each of the 2 combined objects has sequence levels not in the other: >>>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>>> chr4_gl000194_random, chr1_gl000192_random >>>> - in 'y': chrM >>>> Make sure to always combine/compare objects based on the same >>>> reference >>>> genome (use suppressWarnings() to suppress this warning). >>>> 3: In .Seqinfo.mergexy(x, y) : >>>> Each of the 2 combined objects has sequence levels not in the other: >>>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>>> chr4_gl000194_random, chr1_gl000192_random >>>> - in 'y': chrM >>>> Make sure to always combine/compare objects based on the same >>>> reference >>>> genome (use suppressWarnings() to suppress this warning). >>>> 4: In .Seqinfo.mergexy(x, y) : >>>> Each of the 2 combined objects has sequence levels not in the other: >>>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>>> chr4_gl000194_random, chr1_gl000192_random >>>> - in 'y': chrM >>>> Make sure to always combine/compare objects based on the same >>>> reference >>>> genome (use suppressWarnings() to suppress this warning). >>>>> >>>> >>>> I also wanted to verify if what I am doing is correct for paired end >>>> reads. >>>> >>>> Thanks, >>>> >>>> - Pankaj >>>> >>>> >>>> ======================================= >>>> sessionInfo() >>>> R version 3.0.3 (2014-03-06) >>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >> >> >

ADD REPLY • link 11.7 years ago Pankaj Agarwal ▴ 100

0

Entering edit mode

Sorry, I should have mentioned the function moved to the new GenomicAlignments package. library(GenomicAlignments) Valerie On 04/25/2014 10:30 AM, Pankaj Agarwal wrote: > I installed R 3.1.0 and install all the bioconductor packages to try out with the latest version, but for some reason summarizeOverlaps() function is not showing as being available. Below is the message that I get when it comes to running the function followed by the sessionInfo(). Not sure if I am missing any package. > >> library(GenomicFeatures) >> library(Rsamtools) >> library(rtracklayer) > ... > countDF <- summarizeOverlaps(eByg, bfl, mode="IntersectionNotEmpty", ignore.strand=TRUE, singleEnd=FALSE) > Error: could not find function "summarizeOverlaps" > >> sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] rtracklayer_1.24.0 Rsamtools_1.16.0 Biostrings_2.32.0 > [4] XVector_0.4.0 GenomicFeatures_1.16.0 AnnotationDbi_1.26.0 > [7] Biobase_2.24.0 GenomicRanges_1.16.2 GenomeInfoDb_1.0.2 > [10] IRanges_1.22.3 BiocGenerics_0.10.0 > > loaded via a namespace (and not attached): > [1] BatchJobs_1.2 BBmisc_1.6 BiocParallel_0.6.0 > [4] biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 > [7] BSgenome_1.32.0 codetools_0.2-8 DBI_0.2-7 > [10] digest_0.6.4 fail_1.2 foreach_1.4.2 > [13] GenomicAlignments_1.0.0 iterators_1.0.7 plyr_1.8.1 > [16] Rcpp_0.11.1 RCurl_1.95-4.1 RSQLite_0.11.4 > [19] sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 > [22] tools_3.1.0 XML_3.98-1.1 zlibbioc_1.10.0 >> > > - Pankaj > ________________________________________ > From: Valerie Obenchain [vobencha at fhcrc.org] > Sent: Thursday, April 24, 2014 5:07 PM > To: Pankaj Agarwal; bioconductor at r-project.org > Subject: Re: [BioC] countOverlaps Warnings > > I think you have an older version of summarizeOverlaps() which had a > restriction on using 'yieldSize' with paired-end and emits the warning > you see below. The reason you didn't see the warning the first time is > because the reads were being treated as single-end (hence almost double > the count difference). > > To update to the current version you'll need R 3.1 and Bioconductor 2.14 > (your sessionInfo() below shows R 3.0.3). Even with the update there is > no guarantee the counts you see with htseq-count and summarizeOverlaps > will be 100% the same. The pairing algorithm (i.e., how we define which > reads are true pairs) is documented on the ?readGAlignments man page in > the 'Pairing criteria' section. The HTSeq docs probably define how they > determine pairs. > > Once you update you may also want to check out the 'fragments' argument > to summarizeOverlaps(). This allows you to read in not only the records > paired according to the algorithm but records with unmapped pairs, > singletons, and other fragments. This may not be appropriate for your > current task but does provide and interesting separation of the paired > reads and 'other' fragments. > > Valerie > > > On 04/24/14 12:44, Pankaj Agarwal wrote: >> Yes, these are p.e reads. I made the changes you suggested as follows, everything else being the same as before: >> >> bfl <- BamFileList(samplespath, yieldSize=50000, index=character(), asMates=TRUE) >> ... >> countDF_intsectionNotEmpty_042414A <- summarizeOverlaps(eByg, bfl, mode="IntersectionNotEmpty", ignore.strand=TRUE, singleEnd=FALSE) >> Warning in FUN(bf, param = param) : 'yieldSize' set to 'NA' >> Warning in FUN(bf, param = param) : 'yieldSize' set to 'NA' >> ... >> >> Not sure why I got these warning this time (did not see them last time). >> >> The results are much closer to what I got with htseq-count, though still not very close: >> >> htseq-count: >>> A.bam B. bam C. bam D. bam >>> A1BG 7 67 90 72 >>> A1BG-AS1 1 69 41 51 >>> A1CF 2 16 0 3 >>> A2M 1536 10866 34 173 >> >> summarizeOverlaps: >>> A.bam B. bam C. bam D. bam >>> A1BG 7 65 78 66 >>> A1BG-AS1 1 66 37 47 >>> A1CF 2 16 0 3 >>> A2M 1001 6897 28 160 >> >> I got the GTF files from the following website: >> >> http://support.illumina.com/sequencing/sequencing_software/igenome. ilmn >> >> Homo sapiens >> Ensembl GRCh37 >> NCBI build37.2 >> UCSC hg19 >> >> I used the UCSC hg19 - which has the bowtie2 index, which I used for alignment with bowtie2, and the GTF file, which I used as the annotation source. This same GTF was used for htseq-count also. >> >> Thanks, >> >> - Pankaj >> >> ________________________________________ >> From: Valerie Obenchain [vobencha at fhcrc.org] >> Sent: Thursday, April 24, 2014 9:39 AM >> To: Pankaj Agarwal; bioconductor at r-project.org >> Subject: Re: [BioC] countOverlaps Warnings >> >> Are these paired-end reads? If yes set 'singleEnd=FALSE' in the >> summarizeOverlaps() call. >> >> summarizeOverlaps(eByg, bfl, "IntersectionNotEmpty", >> ignore.strand=TRUE, singleEnd=FALSE) >> >> To 'stamp' the BamFileList as containing paired-end reads set the >> 'asMates' arg to TRUE. >> >> bfl <- BamFileList(samplespath, asMates=TRUE) >> >> 'yieldSize' reads data in by chunks and is useful when the bam files are >> too large to fit in memory. >> >> Can you point me to where you download 'genes.gtf'? >> >> Valerie >> >> On 04/23/14 19:40, Pankaj Agarwal wrote: >>> Valerie, >>> >>> I used summarizeOverlaps() instead, since it seemed simpler to use. I also used htseq-count to get the counts using the same parameters, but I get very different results. >>> From what I have read and understood, both should give similar results since the counts should be reported at the "gene" level and not the "isoform" level. >>> I would feel much more comfortable with the results if they were both at least close. >>> >>> For both I used the UCSC GTF file in the iGenomes distribution. >>> For htseq-count I used BAM files sorted by name since it required it, but for summarizeOverlaps I used BAM files sorted by coordinates (the default sorting done by samtools). >>> Example of counts generated by the two methods are given after the code: >>> >>> The code for htseq-count that I used is as follows (once for each sample): >>> >>> samtools view -h A.sorted_byname.bam | htseq-count -t exon -i gene_id -m intersection-nonempty -s no - iGenomes/UCSC/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf > A.htseq.count >>> >>> The code for summarizeOverlaps() is as follows: >>> >>>> library(GenomicFeatures) >>> Loading required package: BiocGenerics >>> Loading required package: parallel >>> ... >>>> library(Rsamtools) >>> Loading required package: Biostrings >>>> library(rtracklayer) >>>> library(GenomicRanges) >>>> >>>> txdb <- makeTranscriptDbFromGFF(file=iGenomes/UCSC/Homo_sapiens/U CSC/hg19/Annotation/Genes/genes.gtf", format="gtf") >>> extracting transcript information >>> Estimating transcript ranges. >>> Extracting gene IDs >>> Processing splicing information for gtf file. >>> Deducing exon rank from relative coordinates provided >>> Prepare the 'metadata' data frame ... metadata: OK >>> Now generating chrominfo from available sequence names. No chromosome length information is available. >>> Warning messages: >>> 1: In .deduceExonRankings(exs, format = "gtf") : >>> Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName >>> 2: In matchCircularity(chroms, circ_seqs) : >>> None of the strings in your circ_seqs argument match your seqnames. >>>> >>> >>> - are these warnings message problems? >>> >>>> saveDb(txdb, file="./data/ucsc_igenomes_hg19_gtf.sqlite") >>>> >>>> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite") >>>> >>>> eByg <- exonsBy(txdb, by="gene") >>>> >>>> head(eByg) >>> GRangesList of length 6: >>> $A1BG >>> GRanges with 8 ranges and 2 metadata columns: >>> seqnames ranges strand | exon_id exon_name >>> <rle> <iranges> <rle> | <integer> <character> >>> [1] chr19 [58858172, 58858395] - | 163177 <na> >>> [2] chr19 [58858719, 58859006] - | 163178 <na> >>>> >>>> samples = c("A.sorted", "B.sorted", "C.sorted", "D.sorted") >>>> >>>> samples >>> [1] "A.sorted" "B.sorted" >>> [3] "C.sorted" "D.sorted" >>>> >>>> samplespath >>> [1] "../A.sorted.bam" >>> [2] "../B.bam" >>> [3] "../C.sorted.bam" >>> [4] "../D.sorted.bam" >>>> >>>> names(samplespath) <- samples >>>> bfl <- BamFileList(samplespath, yieldSize=50000, index=character()) >>> >>> I am not sure what yieldSize means, just used it from the example I followed. >>> >>>> bfl >>> BamFileList of length 4 >>> names(4): A.sorted ... B.sorted >>>> >>>> countDF_intsectionNotEmpty <- summarizeOverlaps(eByg, bfl, mode="IntersectionNotEmpty", ignore.strand=TRUE) >>>> >>>> head(countDF_intsectionNotEmpty) >>> class: SummarizedExperiment >>> dim: 1 4 >>> exptData(0): >>> assays(1): counts >>> rownames(1): A1BG >>> rowData metadata column names(0): >>> colnames(4): A1.sorted B.sorted C.sorted D.sorted >>> colData names(0): >>>> >>>> class(countDF_intsectionNotEmpty) >>> [1] "SummarizedExperiment" >>> attr(,"package") >>> [1] "GenomicRanges" >>>> >>>> countDF_intsectionNotEmpty <- assays(countDF_intsectionNotEmpty)$counts >>>> colnames(countDF_intsectionNotEmpty) <- samples >>>> countDF_intsectionNotEmpty[1:4,] >>> A.sorted B.sorted C.sorted D.sorted >>> A1BG 13 119 162 136 >>> A1BG-AS1 2 112 52 85 >>> A1CF 3 28 0 5 >>> A2M 2543 17593 47 233 >>> ... >>> >>> The counts that I get are very different from what I get using htseq-count, which are following for the same genes listed above: >>> >>> A.bam B. bam C. bam D. bam >>> A1BG 7 67 90 72 >>> A1BG-AS1 1 69 41 51 >>> A1CF 2 16 0 3 >>> A2M 1536 10866 34 173 >>> ... >>> ... >>> >>> I would appreciate your feedback. >>> >>> Thanks, >>> >>> - Pankaj >>> >>> -----Original Message----- >>> From: Valerie Obenchain [mailto:vobencha at fhcrc.org] >>> Sent: Wednesday, April 23, 2014 12:50 PM >>> To: Pankaj Agarwal; bioconductor at r-project.org >>> Subject: Re: [BioC] countOverlaps Warnings >>> >>> Pankaj, >>> >>> How did it go? Were you able to get readGAlignmentPairs() or >>> readGAlignmentsList() to work for your case? >>> >>> Valerie >>> >>> >>> On 04/21/2014 01:32 PM, Valerie Obenchain wrote: >>>> Hi Pankaj, >>>> >>>> There are several options for counting paired-end reads. >>>> >>>> Both readGAlignmentPairs() and readGAlignmentsList() are appropriate >>>> for paired-end data and are in the GenomicAlignments package. They use >>>> the same counting algorithm but return the data in different containers. >>>> readGAlignmentPairs() returns only pairs while readGAlignmentsList() >>>> returns pairs as well as singletons, reads with unmapped pairs etc. >>>> See the ?readGAlignments man page for details. >>>> >>>> Another counting function is summarizeOverlaps() which uses the above >>>> functions 'under the hood'. It returns the counts in the 'assays' slot >>>> of a SummarizedExperiment object. Since you have multiple bam files to >>>> count you may want to use the BamFileList method. >>>> >>>> The man page has examples of counting paired-end bam files. >>>> library(Rsamtools) >>>> ?summarizeOverlaps >>>> >>>> Other packages that offer count functions are Rsubread and easyRNASeq. >>>> >>>> Valerie >>>> >>>> >>>> >>>> >>>> On 04/21/2014 09:53 AM, Pankaj Agarwal wrote: >>>>> Hi, >>>>> >>>>> I am analyzing a matched pair tumor/normal rna-seq data set. After >>>>> aligning with bowtie2 against UCSC hg19 index, I am trying to get the >>>>> counts for each of the samples using the iGenomes UCSC GTF file. I >>>>> came across the following tutorial which shows different ways to do >>>>> it in Bioconductor: >>>>> http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_De >>>>> c_12_16_2013/Rrnaseq/Rrnaseq.pdf >>>>> >>>>> >>>>> Following along slide 28 "Read Counting with countOverlaps", I am >>>>> able to generate the counts but get the following Warnings. I just >>>>> want to be sure that there is nothing to be worried about because I >>>>> don't understand the meaning of these warnings. My code is as follows: >>>>> >>>>>> library(GenomicFeatures) >>>>> Loading required package: AnnotationDbi >>>>>> library(Rsamtools) >>>>>> library(rtracklayer) >>>>>> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite") >>>>>> eByg <- exonsBy(txdb, by="gene") >>>>>> head(eByg) >>>>> GRangesList of length 6: >>>>> $A1BG >>>>> GRanges with 8 ranges and 2 metadata columns: >>>>> seqnames ranges strand | exon_id exon_name >>>>> <rle> <iranges> <rle> | <integer> <character> >>>>> [1] chr19 [58858172, 58858395] - | 163177 <na> >>>>> [2] chr19 [58858719, 58859006] - | 163178 <na> >>>>> [3] chr19 [58861736, 58862017] - | 163179 <na> >>>>> [4] chr19 [58862757, 58863053] - | 163180 <na> >>>>> [5] chr19 [58863649, 58863921] - | 163181 <na> >>>>> [6] chr19 [58864294, 58864563] - | 163182 <na> >>>>> [7] chr19 [58864658, 58864693] - | 163183 <na> >>>>> [8] chr19 [58864770, 58864865] - | 163184 <na> >>>>> >>>>> $A1BG-AS1 >>>>> GRanges with 4 ranges and 2 metadata columns: >>>>> seqnames ranges strand | exon_id exon_name >>>>> [1] chr19 [58863336, 58864410] + | 156640 <na> >>>>> [2] chr19 [58864745, 58864840] + | 156641 <na> >>>>> [3] chr19 [58865080, 58865223] + | 156642 <na> >>>>> [4] chr19 [58865735, 58866549] + | 156643 <na> >>>>> >>>>> ... >>>>> <4 more elements> >>>>> --- >>>>> seqlengths: >>>>> chr12 chr8 ... chr1_gl000192_random >>>>> NA NA ... NA >>>>>> samples = c("BRPC13-1118_L1.D710_501.sorted", >>>>>> "BRPC_13-764_L2.sorted", "DU04_PBMC_RNA_L1.sorted", >>>>>> "DU06_PBMC_RNA_L1.sorted") >>>>>> >>>>>> samples >>>>> [1] "BRPC13-1118_L1.D710_501.sorted" "BRPC_13-764_L2.sorted" >>>>> [3] "DU04_PBMC_RNA_L1.sorted" "DU06_PBMC_RNA_L1.sorted" >>>>>> >>>>>> samplespath <- paste("./", samples, ".bam", sep="") >>>>>> >>>>>> samplespath >>>>> [1] "./BRPC13-1118_L1.D710_501.sorted.bam" >>>>> [2] "./BRPC_13-764_L2.sorted.bam" >>>>> [3] "./DU04_PBMC_RNA_L1.sorted.bam" >>>>> [4] "./DU06_PBMC_RNA_L1.sorted.bam" >>>>>> >>>>>> names(samplespath) <- samples >>>>>> >>>>>> samplespath >>>>> BRPC13-1118_L1.D710_501.sorted BRPC_13-764_L2.sorted >>>>> "./BRPC13-1118_L1.D710_501.sorted.bam" >>>>> "./BRPC_13-764_L2.sorted.bam" >>>>> DU04_PBMC_RNA_L1.sorted DU06_PBMC_RNA_L1.sorted >>>>> "./DU04_PBMC_RNA_L1.sorted.bam" >>>>> "./DU06_PBMC_RNA_L1.sorted.bam" >>>>>> >>>>>> countDF <- data.frame(row.names=names(eByg)) >>>>>> >>>>>> countDF >>>>> data frame with 0 columns and 23710 rows >>>>>> >>>>>> dim(countDF) >>>>> [1] 23710 0 >>>>>> >>>>>> for(i in samplespath) { >>>>> + aligns <- readGAlignmentsFromBam(i) counts <- countOverlaps(eByg, >>>>> + aligns, ignore.strand=TRUE) countDF <- cbind(countDF, counts) } >>>>> Warning messages: >>>>> 1: In .Seqinfo.mergexy(x, y) : >>>>> Each of the 2 combined objects has sequence levels not in the other: >>>>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>>>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>>>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>>>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>>>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>>>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>>>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>>>> chr4_gl000194_random, chr1_gl000192_random >>>>> - in 'y': chrM >>>>> Make sure to always combine/compare objects based on the same >>>>> reference >>>>> genome (use suppressWarnings() to suppress this warning). >>>>> 2: In .Seqinfo.mergexy(x, y) : >>>>> Each of the 2 combined objects has sequence levels not in the other: >>>>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>>>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>>>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>>>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>>>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>>>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>>>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>>>> chr4_gl000194_random, chr1_gl000192_random >>>>> - in 'y': chrM >>>>> Make sure to always combine/compare objects based on the same >>>>> reference >>>>> genome (use suppressWarnings() to suppress this warning). >>>>> 3: In .Seqinfo.mergexy(x, y) : >>>>> Each of the 2 combined objects has sequence levels not in the other: >>>>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>>>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>>>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>>>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>>>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>>>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>>>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>>>> chr4_gl000194_random, chr1_gl000192_random >>>>> - in 'y': chrM >>>>> Make sure to always combine/compare objects based on the same >>>>> reference >>>>> genome (use suppressWarnings() to suppress this warning). >>>>> 4: In .Seqinfo.mergexy(x, y) : >>>>> Each of the 2 combined objects has sequence levels not in the other: >>>>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, >>>>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, >>>>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, >>>>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, >>>>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, >>>>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, >>>>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, >>>>> chr4_gl000194_random, chr1_gl000192_random >>>>> - in 'y': chrM >>>>> Make sure to always combine/compare objects based on the same >>>>> reference >>>>> genome (use suppressWarnings() to suppress this warning). >>>>>> >>>>> >>>>> I also wanted to verify if what I am doing is correct for paired end >>>>> reads. >>>>> >>>>> Thanks, >>>>> >>>>> - Pankaj >>>>> >>>>> >>>>> ======================================= >>>>> sessionInfo() >>>>> R version 3.0.3 (2014-03-06) >>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>> >>>>> locale: >>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>> >>>>> attached base packages: >>>>> [1] stats graphics grDevices utils datasets methods base >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>>> >>> >>> >> > -- Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: vobencha at fhcrc.org Phone: (206) 667-3158

ADD REPLY • link 11.7 years ago Valerie Obenchain ★ 6.8k

Login before adding your answer.