Re-send directly to BioC on miRNA annoation
1
0
Entering edit mode
Wu, Huiyun ▴ 70
@wu-huiyun-5271
Last seen 9.6 years ago
Dear Steve, thank you for your feedback. Below is what it returned. > st <- sum(counts==0) > st [1] 12296 Let me give more details of my situation. Say, I have read in 6 BAM files (s1 to s6). and then got a count table (filename = countTable) using countOverlaps function with resultant file head as below. Question here, is the rowname (first col) miRNA ID? > colnames(countTable) <- samples > head(countTable) s1 s2 s3 s4 s5 s6 1 1 1 3 0 2 2 10 0 0 1 0 1 0 100 0 1 0 1 3 3 1000 0 7 1 1 3 3 10000 0 6 2 8 16 18 100008586 0 0 0 0 0 0 > dim(countTable) [1] 22932 6 I then filtered out those miRNAs whose expressions were 0 across all samples by the following code (it might be what you asked for). > newCountTable <- countTable[rowSums(countTable) != 0, ] > dim(newCountTable) [1] 22105 6 > head(newCountTable) s1 s2 s3 s4 s5 s6 1 1 1 3 0 2 2 10 0 0 1 0 1 0 100 0 1 0 1 3 3 1000 0 7 1 1 3 3 10000 0 6 2 8 16 18 100009676 0 0 0 1 3 4 After annotating using the following library, I got another count table. entrezGenes <- rownames(newCountTable) mapped_genes <- mappedkeys(org.Hs.egSYMBOL) overlapgenes <- intersect(entrezGenes, mapped_genes) overlapsymbols <- as.list(org.Hs.egSYMBOL[overlapgenes]) getGeneSymbol <- function(x) { if (x %in% overlapgenes) { return(overlapsymbols[[which(x == names(overlapsymbols))]]) } else { return(NA) } } geneSymbols <- sapply(entrezGenes, getGeneSymbol) countTable <- cbind(geneSymbols, numberOfexons[names(numberOfexons) %in% entrezGenes], geneLength[names(geneLength) %in% entrezGenes], newCountTable) > dim(countTable) [1] 22105 9 I further removed redundant IDs and NAs to generate an updated count table (dataset named 'd') for DE analysis > dim(d) [1] 22100 7 > d[20:30,] X s1 s2 s3 s4 s5 s6 20 7-Mar 1 2 2 11 12 5 21 7-Sep 2 6 4 6 19 9 22 8-Mar 1 2 0 1 6 4 23 8-Sep 0 4 4 4 3 10 24 9-Mar 1 3 5 1 6 9 25 9-Sep 2 5 8 10 35 20 26 A1BG 1 1 3 0 2 2 27 A1BG-AS1 0 1 1 0 0 4 28 A1CF 0 2 1 3 9 1 29 A2LD1 1 1 2 0 1 6 30 A2M 5 53 48 70 58 114 It looks like there are 25 technical controls, but majority were annotated. thanks, --- William Wu Vanderbilt University/Department of Biostatistics. -----Original Message----- From: Steve Lianoglou [mailto:mailinglist.honeypot@gmail.com] Sent: Monday, May 07, 2012 1:59 PM To: Wu, Huiyun Cc: bioconductor@r-project.org<mailto:bioconductor@r-project.org> Subject: Re: [BioC] miRNA annotation Hi, On Mon, May 7, 2012 at 2:15 PM, Wu, Huiyun <william.wu@vanderbilt.edu<mailto:william.wu@vanderbilt.edu>> wrote: > Hello there, > > I am new in seq data analysis, and have difficulty in miRNA annotation. To clarify my question, I have the following codes. > > > bamfile <- "GSE33858_1.bowtie_alignment.bam" > aln <- readBamGappedAlignments(bamfile) > txdb<-makeTranscriptDbFromUCSC(genome="hg19",tablename="knownGene") > exonRangesList <- exonsBy(txdb, by = "gene") > exonRanges <- unlist(exonRangesList) > strand(exonRanges) <- "*" > geneNames <- sub("\\..*$<file: \\..*$="">", "", names(exonRanges)) > exonRangesListNoStrand <- split(exonRanges, geneNames) > numberOfexons <- sapply(width(exonRangesListNoStrand), length) > geneLength <- sum(width(reduce(exonRangesListNoStrand))) > # Counting reads for the BAM file > counts <- countOverlaps(exonRangesListNoStrand, aln) > names(counts) <- names(exonRangesListNoStrand) > > > length(counts) > > [1] 22932 > > So my question is from aligned file to annotated file. Specifically, why I got 22932 features after mapping to hg19? I learned there are only less than 2000 miRNAs matched genes. I also know there are several RNA databases including miRBase for annotation. Did I do something logically wrong? It would be greatly appreciated if someone could show me how to implement this annotation in R. What does `sum(counts == 0)` give you? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact [[alternative HTML version deleted]]
miRNA Annotation Cancer miRNA Annotation Cancer • 968 views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.3 years ago
United States
Hi William, countOverlaps() was deprecated in R-2.14 and has been replaced by summarizeOverlaps(). You'll want to update R and use summarizeOverlaps(). When posting a question please remember to provide the output of sessionInfo(). Valerie On 05/07/2012 02:52 PM, Wu, Huiyun wrote: > Dear Steve, > > > > thank you for your feedback. Below is what it returned. > > > >> st<- sum(counts==0) >> st > [1] 12296 > > > > Let me give more details of my situation. Say, I have read in 6 BAM files (s1 to s6). and then got a count table (filename = countTable) using countOverlaps function with resultant file head as below. Question here, is the rowname (first col) miRNA ID? > > > >> colnames(countTable)<- samples >> head(countTable) > s1 s2 s3 s4 s5 s6 > > 1 1 1 3 0 2 2 > > 10 0 0 1 0 1 0 > > 100 0 1 0 1 3 3 > > 1000 0 7 1 1 3 3 > > 10000 0 6 2 8 16 18 > > 100008586 0 0 0 0 0 0 > >> dim(countTable) > [1] 22932 6 > > > > > > I then filtered out those miRNAs whose expressions were 0 across all samples by the following code (it might be what you asked for). > > > >> newCountTable<- countTable[rowSums(countTable) != 0, ] >> dim(newCountTable) > [1] 22105 6 > >> head(newCountTable) > s1 s2 s3 s4 s5 s6 > > 1 1 1 3 0 2 2 > > 10 0 0 1 0 1 0 > > 100 0 1 0 1 3 3 > > 1000 0 7 1 1 3 3 > > 10000 0 6 2 8 16 18 > > 100009676 0 0 0 1 3 4 > > > > > > After annotating using the following library, I got another count table. > > > > entrezGenes<- rownames(newCountTable) > > mapped_genes<- mappedkeys(org.Hs.egSYMBOL) > > overlapgenes<- intersect(entrezGenes, mapped_genes) > > overlapsymbols<- as.list(org.Hs.egSYMBOL[overlapgenes]) > > getGeneSymbol<- function(x) { > > if (x %in% overlapgenes) { > > return(overlapsymbols[[which(x == names(overlapsymbols))]]) > > } > > else { > > return(NA) > > } > > } > > > > geneSymbols<- sapply(entrezGenes, getGeneSymbol) > > countTable<- cbind(geneSymbols, numberOfexons[names(numberOfexons) %in% entrezGenes], geneLength[names(geneLength) %in% entrezGenes], > > newCountTable) > > > >> dim(countTable) > [1] 22105 9 > > > > I further removed redundant IDs and NAs to generate an updated count table (dataset named 'd') for DE analysis > > > >> dim(d) > [1] 22100 7 > > > >> d[20:30,] > X s1 s2 s3 s4 s5 s6 > > 20 7-Mar 1 2 2 11 12 5 > > 21 7-Sep 2 6 4 6 19 9 > > 22 8-Mar 1 2 0 1 6 4 > > 23 8-Sep 0 4 4 4 3 10 > > 24 9-Mar 1 3 5 1 6 9 > > 25 9-Sep 2 5 8 10 35 20 > > 26 A1BG 1 1 3 0 2 2 > > 27 A1BG-AS1 0 1 1 0 0 4 > > 28 A1CF 0 2 1 3 9 1 > > 29 A2LD1 1 1 2 0 1 6 > > 30 A2M 5 53 48 70 58 114 > > > > It looks like there are 25 technical controls, but majority were annotated. > > > > thanks, > > > > --- > > William Wu > > Vanderbilt University/Department of Biostatistics. > > > > > > > > > > > > > > > > -----Original Message----- > From: Steve Lianoglou [mailto:mailinglist.honeypot at gmail.com] > Sent: Monday, May 07, 2012 1:59 PM > To: Wu, Huiyun > Cc: bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > Subject: Re: [BioC] miRNA annotation > > > > Hi, > > > > On Mon, May 7, 2012 at 2:15 PM, Wu, Huiyun<william.wu at="" vanderbilt.edu<mailto:william.wu="" at="" vanderbilt.edu="">> wrote: > >> Hello there, >> I am new in seq data analysis, and have difficulty in miRNA annotation. To clarify my question, I have the following codes. >> bamfile<- "GSE33858_1.bowtie_alignment.bam" >> aln<- readBamGappedAlignments(bamfile) >> txdb<-makeTranscriptDbFromUCSC(genome="hg19",tablename="knownGene") >> exonRangesList<- exonsBy(txdb, by = "gene") >> exonRanges<- unlist(exonRangesList) >> strand(exonRanges)<- "*" >> geneNames<- sub("\\..*$<file: \\..*$="">", "", names(exonRanges)) >> exonRangesListNoStrand<- split(exonRanges, geneNames) >> numberOfexons<- sapply(width(exonRangesListNoStrand), length) >> geneLength<- sum(width(reduce(exonRangesListNoStrand))) >> # Counting reads for the BAM file >> counts<- countOverlaps(exonRangesListNoStrand, aln) >> names(counts)<- names(exonRangesListNoStrand) >> > length(counts) >> [1] 22932 >> So my question is from aligned file to annotated file. Specifically, why I got 22932 features after mapping to hg19? I learned there are only less than 2000 miRNAs matched genes. I also know there are several RNA databases including miRBase for annotation. Did I do something logically wrong? It would be greatly appreciated if someone could show me how to implement this annotation in R. > > > What does `sum(counts == 0)` give you? > > > > -steve > > > > -- > > Steve Lianoglou > > Graduate Student: Computational Systems Biology > > | Memorial Sloan-Kettering Cancer Center > > | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Hi Valerie, thank you for your helpful suggestions. I will try it. best, William -----Original Message----- From: Valerie Obenchain [mailto:vobencha@fhcrc.org] Sent: Tuesday, May 08, 2012 11:35 AM To: Wu, Huiyun Cc: bioconductor at r-project.org; Steve Lianoglou Subject: Re: [BioC] Re-send directly to BioC on miRNA annoation Hi William, countOverlaps() was deprecated in R-2.14 and has been replaced by summarizeOverlaps(). You'll want to update R and use summarizeOverlaps(). When posting a question please remember to provide the output of sessionInfo(). Valerie On 05/07/2012 02:52 PM, Wu, Huiyun wrote: > Dear Steve, > > > > thank you for your feedback. Below is what it returned. > > > >> st<- sum(counts==0) >> st > [1] 12296 > > > > Let me give more details of my situation. Say, I have read in 6 BAM files (s1 to s6). and then got a count table (filename = countTable) using countOverlaps function with resultant file head as below. Question here, is the rowname (first col) miRNA ID? > > > >> colnames(countTable)<- samples >> head(countTable) > s1 s2 s3 s4 s5 s6 > > 1 1 1 3 0 2 2 > > 10 0 0 1 0 1 0 > > 100 0 1 0 1 3 3 > > 1000 0 7 1 1 3 3 > > 10000 0 6 2 8 16 18 > > 100008586 0 0 0 0 0 0 > >> dim(countTable) > [1] 22932 6 > > > > > > I then filtered out those miRNAs whose expressions were 0 across all samples by the following code (it might be what you asked for). > > > >> newCountTable<- countTable[rowSums(countTable) != 0, ] >> dim(newCountTable) > [1] 22105 6 > >> head(newCountTable) > s1 s2 s3 s4 s5 s6 > > 1 1 1 3 0 2 2 > > 10 0 0 1 0 1 0 > > 100 0 1 0 1 3 3 > > 1000 0 7 1 1 3 3 > > 10000 0 6 2 8 16 18 > > 100009676 0 0 0 1 3 4 > > > > > > After annotating using the following library, I got another count table. > > > > entrezGenes<- rownames(newCountTable) > > mapped_genes<- mappedkeys(org.Hs.egSYMBOL) > > overlapgenes<- intersect(entrezGenes, mapped_genes) > > overlapsymbols<- as.list(org.Hs.egSYMBOL[overlapgenes]) > > getGeneSymbol<- function(x) { > > if (x %in% overlapgenes) { > > return(overlapsymbols[[which(x == > names(overlapsymbols))]]) > > } > > else { > > return(NA) > > } > > } > > > > geneSymbols<- sapply(entrezGenes, getGeneSymbol) > > countTable<- cbind(geneSymbols, > numberOfexons[names(numberOfexons) %in% entrezGenes], > geneLength[names(geneLength) %in% entrezGenes], > > newCountTable) > > > >> dim(countTable) > [1] 22105 9 > > > > I further removed redundant IDs and NAs to generate an updated count > table (dataset named 'd') for DE analysis > > > >> dim(d) > [1] 22100 7 > > > >> d[20:30,] > X s1 s2 s3 s4 s5 s6 > > 20 7-Mar 1 2 2 11 12 5 > > 21 7-Sep 2 6 4 6 19 9 > > 22 8-Mar 1 2 0 1 6 4 > > 23 8-Sep 0 4 4 4 3 10 > > 24 9-Mar 1 3 5 1 6 9 > > 25 9-Sep 2 5 8 10 35 20 > > 26 A1BG 1 1 3 0 2 2 > > 27 A1BG-AS1 0 1 1 0 0 4 > > 28 A1CF 0 2 1 3 9 1 > > 29 A2LD1 1 1 2 0 1 6 > > 30 A2M 5 53 48 70 58 114 > > > > It looks like there are 25 technical controls, but majority were annotated. > > > > thanks, > > > > --- > > William Wu > > Vanderbilt University/Department of Biostatistics. > > > > > > > > > > > > > > > > -----Original Message----- > From: Steve Lianoglou [mailto:mailinglist.honeypot at gmail.com] > Sent: Monday, May 07, 2012 1:59 PM > To: Wu, Huiyun > Cc: bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > Subject: Re: [BioC] miRNA annotation > > > > Hi, > > > > On Mon, May 7, 2012 at 2:15 PM, Wu, Huiyun<william.wu at="" vanderbilt.edu<mailto:william.wu="" at="" vanderbilt.edu="">> wrote: > >> Hello there, >> I am new in seq data analysis, and have difficulty in miRNA annotation. To clarify my question, I have the following codes. >> bamfile<- "GSE33858_1.bowtie_alignment.bam" >> aln<- readBamGappedAlignments(bamfile) >> txdb<-makeTranscriptDbFromUCSC(genome="hg19",tablename="knownGene") >> exonRangesList<- exonsBy(txdb, by = "gene") >> exonRanges<- unlist(exonRangesList) >> strand(exonRanges)<- "*" >> geneNames<- sub("\\..*$<file: \\..*$="">", "", names(exonRanges)) >> exonRangesListNoStrand<- split(exonRanges, geneNames) >> numberOfexons<- sapply(width(exonRangesListNoStrand), length) >> geneLength<- sum(width(reduce(exonRangesListNoStrand))) >> # Counting reads for the BAM file >> counts<- countOverlaps(exonRangesListNoStrand, aln) >> names(counts)<- names(exonRangesListNoStrand) >> > length(counts) >> [1] 22932 >> So my question is from aligned file to annotated file. Specifically, why I got 22932 features after mapping to hg19? I learned there are only less than 2000 miRNAs matched genes. I also know there are several RNA databases including miRBase for annotation. Did I do something logically wrong? It would be greatly appreciated if someone could show me how to implement this annotation in R. > > > What does `sum(counts == 0)` give you? > > > > -steve > > > > -- > > Steve Lianoglou > > Graduate Student: Computational Systems Biology > > | Memorial Sloan-Kettering Cancer Center > > | Weill Medical College of Cornell University Contact Info: > http://cbio.mskcc.org/~lianos/contact > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 819 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6