Re-send directly to BioC on miRNA annoation

0

Entering edit mode

Wu, Huiyun ▴ 70

@wu-huiyun-5271

Last seen 9.6 years ago

Dear Steve, thank you for your feedback. Below is what it returned. > st <- sum(counts==0) > st [1] 12296 Let me give more details of my situation. Say, I have read in 6 BAM files (s1 to s6). and then got a count table (filename = countTable) using countOverlaps function with resultant file head as below. Question here, is the rowname (first col) miRNA ID? > colnames(countTable) <- samples > head(countTable) s1 s2 s3 s4 s5 s6 1 1 1 3 0 2 2 10 0 0 1 0 1 0 100 0 1 0 1 3 3 1000 0 7 1 1 3 3 10000 0 6 2 8 16 18 100008586 0 0 0 0 0 0 > dim(countTable) [1] 22932 6 I then filtered out those miRNAs whose expressions were 0 across all samples by the following code (it might be what you asked for). > newCountTable <- countTable[rowSums(countTable) != 0, ] > dim(newCountTable) [1] 22105 6 > head(newCountTable) s1 s2 s3 s4 s5 s6 1 1 1 3 0 2 2 10 0 0 1 0 1 0 100 0 1 0 1 3 3 1000 0 7 1 1 3 3 10000 0 6 2 8 16 18 100009676 0 0 0 1 3 4 After annotating using the following library, I got another count table. entrezGenes <- rownames(newCountTable) mapped_genes <- mappedkeys(org.Hs.egSYMBOL) overlapgenes <- intersect(entrezGenes, mapped_genes) overlapsymbols <- as.list(org.Hs.egSYMBOL[overlapgenes]) getGeneSymbol <- function(x) { if (x %in% overlapgenes) { return(overlapsymbols[[which(x == names(overlapsymbols))]]) } else { return(NA) } } geneSymbols <- sapply(entrezGenes, getGeneSymbol) countTable <- cbind(geneSymbols, numberOfexons[names(numberOfexons) %in% entrezGenes], geneLength[names(geneLength) %in% entrezGenes], newCountTable) > dim(countTable) [1] 22105 9 I further removed redundant IDs and NAs to generate an updated count table (dataset named 'd') for DE analysis > dim(d) [1] 22100 7 > d[20:30,] X s1 s2 s3 s4 s5 s6 20 7-Mar 1 2 2 11 12 5 21 7-Sep 2 6 4 6 19 9 22 8-Mar 1 2 0 1 6 4 23 8-Sep 0 4 4 4 3 10 24 9-Mar 1 3 5 1 6 9 25 9-Sep 2 5 8 10 35 20 26 A1BG 1 1 3 0 2 2 27 A1BG-AS1 0 1 1 0 0 4 28 A1CF 0 2 1 3 9 1 29 A2LD1 1 1 2 0 1 6 30 A2M 5 53 48 70 58 114 It looks like there are 25 technical controls, but majority were annotated. thanks, --- William Wu Vanderbilt University/Department of Biostatistics. -----Original Message----- From: Steve Lianoglou [mailto:mailinglist.honeypot@gmail.com] Sent: Monday, May 07, 2012 1:59 PM To: Wu, Huiyun Cc: bioconductor@r-project.org<mailto:bioconductor@r-project.org> Subject: Re: [BioC] miRNA annotation Hi, On Mon, May 7, 2012 at 2:15 PM, Wu, Huiyun <william.wu@vanderbilt.edu<mailto:william.wu@vanderbilt.edu>> wrote: > Hello there, > > I am new in seq data analysis, and have difficulty in miRNA annotation. To clarify my question, I have the following codes. > > > bamfile <- "GSE33858_1.bowtie_alignment.bam" > aln <- readBamGappedAlignments(bamfile) > txdb<-makeTranscriptDbFromUCSC(genome="hg19",tablename="knownGene") > exonRangesList <- exonsBy(txdb, by = "gene") > exonRanges <- unlist(exonRangesList) > strand(exonRanges) <- "*" > geneNames <- sub("\\..*$<file: \\..*$="">", "", names(exonRanges)) > exonRangesListNoStrand <- split(exonRanges, geneNames) > numberOfexons <- sapply(width(exonRangesListNoStrand), length) > geneLength <- sum(width(reduce(exonRangesListNoStrand))) > # Counting reads for the BAM file > counts <- countOverlaps(exonRangesListNoStrand, aln) > names(counts) <- names(exonRangesListNoStrand) > > > length(counts) > > [1] 22932 > > So my question is from aligned file to annotated file. Specifically, why I got 22932 features after mapping to hg19? I learned there are only less than 2000 miRNAs matched genes. I also know there are several RNA databases including miRBase for annotation. Did I do something logically wrong? It would be greatly appreciated if someone could show me how to implement this annotation in R. What does `sum(counts == 0)` give you? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact [[alternative HTML version deleted]]

miRNA Annotation Cancer miRNA Annotation Cancer • 968 views

ADD COMMENT • link updated 12.0 years ago by Valerie Obenchain ★ 6.8k • written 12.0 years ago by Wu, Huiyun ▴ 70

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.3 years ago

United States

Hi William, countOverlaps() was deprecated in R-2.14 and has been replaced by summarizeOverlaps(). You'll want to update R and use summarizeOverlaps(). When posting a question please remember to provide the output of sessionInfo(). Valerie On 05/07/2012 02:52 PM, Wu, Huiyun wrote: > Dear Steve, > > > > thank you for your feedback. Below is what it returned. > > > >> st<- sum(counts==0) >> st > [1] 12296 > > > > Let me give more details of my situation. Say, I have read in 6 BAM files (s1 to s6). and then got a count table (filename = countTable) using countOverlaps function with resultant file head as below. Question here, is the rowname (first col) miRNA ID? > > > >> colnames(countTable)<- samples >> head(countTable) > s1 s2 s3 s4 s5 s6 > > 1 1 1 3 0 2 2 > > 10 0 0 1 0 1 0 > > 100 0 1 0 1 3 3 > > 1000 0 7 1 1 3 3 > > 10000 0 6 2 8 16 18 > > 100008586 0 0 0 0 0 0 > >> dim(countTable) > [1] 22932 6 > > > > > > I then filtered out those miRNAs whose expressions were 0 across all samples by the following code (it might be what you asked for). > > > >> newCountTable<- countTable[rowSums(countTable) != 0, ] >> dim(newCountTable) > [1] 22105 6 > >> head(newCountTable) > s1 s2 s3 s4 s5 s6 > > 1 1 1 3 0 2 2 > > 10 0 0 1 0 1 0 > > 100 0 1 0 1 3 3 > > 1000 0 7 1 1 3 3 > > 10000 0 6 2 8 16 18 > > 100009676 0 0 0 1 3 4 > > > > > > After annotating using the following library, I got another count table. > > > > entrezGenes<- rownames(newCountTable) > > mapped_genes<- mappedkeys(org.Hs.egSYMBOL) > > overlapgenes<- intersect(entrezGenes, mapped_genes) > > overlapsymbols<- as.list(org.Hs.egSYMBOL[overlapgenes]) > > getGeneSymbol<- function(x) { > > if (x %in% overlapgenes) { > > return(overlapsymbols[[which(x == names(overlapsymbols))]]) > > } > > else { > > return(NA) > > } > > } > > > > geneSymbols<- sapply(entrezGenes, getGeneSymbol) > > countTable<- cbind(geneSymbols, numberOfexons[names(numberOfexons) %in% entrezGenes], geneLength[names(geneLength) %in% entrezGenes], > > newCountTable) > > > >> dim(countTable) > [1] 22105 9 > > > > I further removed redundant IDs and NAs to generate an updated count table (dataset named 'd') for DE analysis > > > >> dim(d) > [1] 22100 7 > > > >> d[20:30,] > X s1 s2 s3 s4 s5 s6 > > 20 7-Mar 1 2 2 11 12 5 > > 21 7-Sep 2 6 4 6 19 9 > > 22 8-Mar 1 2 0 1 6 4 > > 23 8-Sep 0 4 4 4 3 10 > > 24 9-Mar 1 3 5 1 6 9 > > 25 9-Sep 2 5 8 10 35 20 > > 26 A1BG 1 1 3 0 2 2 > > 27 A1BG-AS1 0 1 1 0 0 4 > > 28 A1CF 0 2 1 3 9 1 > > 29 A2LD1 1 1 2 0 1 6 > > 30 A2M 5 53 48 70 58 114 > > > > It looks like there are 25 technical controls, but majority were annotated. > > > > thanks, > > > > --- > > William Wu > > Vanderbilt University/Department of Biostatistics. > > > > > > > > > > > > > > > > -----Original Message----- > From: Steve Lianoglou [mailto:mailinglist.honeypot at gmail.com] > Sent: Monday, May 07, 2012 1:59 PM > To: Wu, Huiyun > Cc: bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > Subject: Re: [BioC] miRNA annotation > > > > Hi, > > > > On Mon, May 7, 2012 at 2:15 PM, Wu, Huiyun<william.wu at="" vanderbilt.edu<mailto:william.wu="" at="" vanderbilt.edu="">> wrote: > >> Hello there, >> I am new in seq data analysis, and have difficulty in miRNA annotation. To clarify my question, I have the following codes. >> bamfile<- "GSE33858_1.bowtie_alignment.bam" >> aln<- readBamGappedAlignments(bamfile) >> txdb<-makeTranscriptDbFromUCSC(genome="hg19",tablename="knownGene") >> exonRangesList<- exonsBy(txdb, by = "gene") >> exonRanges<- unlist(exonRangesList) >> strand(exonRanges)<- "*" >> geneNames<- sub("\\..*$<file: \\..*$="">", "", names(exonRanges)) >> exonRangesListNoStrand<- split(exonRanges, geneNames) >> numberOfexons<- sapply(width(exonRangesListNoStrand), length) >> geneLength<- sum(width(reduce(exonRangesListNoStrand))) >> # Counting reads for the BAM file >> counts<- countOverlaps(exonRangesListNoStrand, aln) >> names(counts)<- names(exonRangesListNoStrand) >> > length(counts) >> [1] 22932 >> So my question is from aligned file to annotated file. Specifically, why I got 22932 features after mapping to hg19? I learned there are only less than 2000 miRNAs matched genes. I also know there are several RNA databases including miRBase for annotation. Did I do something logically wrong? It would be greatly appreciated if someone could show me how to implement this annotation in R. > > > What does `sum(counts == 0)` give you? > > > > -steve > > > > -- > > Steve Lianoglou > > Graduate Student: Computational Systems Biology > > | Memorial Sloan-Kettering Cancer Center > > | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.0 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Valerie, thank you for your helpful suggestions. I will try it. best, William -----Original Message----- From: Valerie Obenchain [mailto:vobencha@fhcrc.org] Sent: Tuesday, May 08, 2012 11:35 AM To: Wu, Huiyun Cc: bioconductor at r-project.org; Steve Lianoglou Subject: Re: [BioC] Re-send directly to BioC on miRNA annoation Hi William, countOverlaps() was deprecated in R-2.14 and has been replaced by summarizeOverlaps(). You'll want to update R and use summarizeOverlaps(). When posting a question please remember to provide the output of sessionInfo(). Valerie On 05/07/2012 02:52 PM, Wu, Huiyun wrote: > Dear Steve, > > > > thank you for your feedback. Below is what it returned. > > > >> st<- sum(counts==0) >> st > [1] 12296 > > > > Let me give more details of my situation. Say, I have read in 6 BAM files (s1 to s6). and then got a count table (filename = countTable) using countOverlaps function with resultant file head as below. Question here, is the rowname (first col) miRNA ID? > > > >> colnames(countTable)<- samples >> head(countTable) > s1 s2 s3 s4 s5 s6 > > 1 1 1 3 0 2 2 > > 10 0 0 1 0 1 0 > > 100 0 1 0 1 3 3 > > 1000 0 7 1 1 3 3 > > 10000 0 6 2 8 16 18 > > 100008586 0 0 0 0 0 0 > >> dim(countTable) > [1] 22932 6 > > > > > > I then filtered out those miRNAs whose expressions were 0 across all samples by the following code (it might be what you asked for). > > > >> newCountTable<- countTable[rowSums(countTable) != 0, ] >> dim(newCountTable) > [1] 22105 6 > >> head(newCountTable) > s1 s2 s3 s4 s5 s6 > > 1 1 1 3 0 2 2 > > 10 0 0 1 0 1 0 > > 100 0 1 0 1 3 3 > > 1000 0 7 1 1 3 3 > > 10000 0 6 2 8 16 18 > > 100009676 0 0 0 1 3 4 > > > > > > After annotating using the following library, I got another count table. > > > > entrezGenes<- rownames(newCountTable) > > mapped_genes<- mappedkeys(org.Hs.egSYMBOL) > > overlapgenes<- intersect(entrezGenes, mapped_genes) > > overlapsymbols<- as.list(org.Hs.egSYMBOL[overlapgenes]) > > getGeneSymbol<- function(x) { > > if (x %in% overlapgenes) { > > return(overlapsymbols[[which(x == > names(overlapsymbols))]]) > > } > > else { > > return(NA) > > } > > } > > > > geneSymbols<- sapply(entrezGenes, getGeneSymbol) > > countTable<- cbind(geneSymbols, > numberOfexons[names(numberOfexons) %in% entrezGenes], > geneLength[names(geneLength) %in% entrezGenes], > > newCountTable) > > > >> dim(countTable) > [1] 22105 9 > > > > I further removed redundant IDs and NAs to generate an updated count > table (dataset named 'd') for DE analysis > > > >> dim(d) > [1] 22100 7 > > > >> d[20:30,] > X s1 s2 s3 s4 s5 s6 > > 20 7-Mar 1 2 2 11 12 5 > > 21 7-Sep 2 6 4 6 19 9 > > 22 8-Mar 1 2 0 1 6 4 > > 23 8-Sep 0 4 4 4 3 10 > > 24 9-Mar 1 3 5 1 6 9 > > 25 9-Sep 2 5 8 10 35 20 > > 26 A1BG 1 1 3 0 2 2 > > 27 A1BG-AS1 0 1 1 0 0 4 > > 28 A1CF 0 2 1 3 9 1 > > 29 A2LD1 1 1 2 0 1 6 > > 30 A2M 5 53 48 70 58 114 > > > > It looks like there are 25 technical controls, but majority were annotated. > > > > thanks, > > > > --- > > William Wu > > Vanderbilt University/Department of Biostatistics. > > > > > > > > > > > > > > > > -----Original Message----- > From: Steve Lianoglou [mailto:mailinglist.honeypot at gmail.com] > Sent: Monday, May 07, 2012 1:59 PM > To: Wu, Huiyun > Cc: bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > Subject: Re: [BioC] miRNA annotation > > > > Hi, > > > > On Mon, May 7, 2012 at 2:15 PM, Wu, Huiyun<william.wu at="" vanderbilt.edu<mailto:william.wu="" at="" vanderbilt.edu="">> wrote: > >> Hello there, >> I am new in seq data analysis, and have difficulty in miRNA annotation. To clarify my question, I have the following codes. >> bamfile<- "GSE33858_1.bowtie_alignment.bam" >> aln<- readBamGappedAlignments(bamfile) >> txdb<-makeTranscriptDbFromUCSC(genome="hg19",tablename="knownGene") >> exonRangesList<- exonsBy(txdb, by = "gene") >> exonRanges<- unlist(exonRangesList) >> strand(exonRanges)<- "*" >> geneNames<- sub("\\..*$<file: \\..*$="">", "", names(exonRanges)) >> exonRangesListNoStrand<- split(exonRanges, geneNames) >> numberOfexons<- sapply(width(exonRangesListNoStrand), length) >> geneLength<- sum(width(reduce(exonRangesListNoStrand))) >> # Counting reads for the BAM file >> counts<- countOverlaps(exonRangesListNoStrand, aln) >> names(counts)<- names(exonRangesListNoStrand) >> > length(counts) >> [1] 22932 >> So my question is from aligned file to annotated file. Specifically, why I got 22932 features after mapping to hg19? I learned there are only less than 2000 miRNAs matched genes. I also know there are several RNA databases including miRBase for annotation. Did I do something logically wrong? It would be greatly appreciated if someone could show me how to implement this annotation in R. > > > What does `sum(counts == 0)` give you? > > > > -steve > > > > -- > > Steve Lianoglou > > Graduate Student: Computational Systems Biology > > | Memorial Sloan-Kettering Cancer Center > > | Weill Medical College of Cornell University Contact Info: > http://cbio.mskcc.org/~lianos/contact > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.0 years ago Wu, Huiyun ▴ 70

Login before adding your answer.