miRNA annotation
1
0
Entering edit mode
Wu, Huiyun ▴ 70
@wu-huiyun-5271
Last seen 9.7 years ago
Hello there, I am new in seq data analysis, and have difficulty in miRNA annotation. To clarify my question, I have the following codes. bamfile <- "GSE33858_1.bowtie_alignment.bam" aln <- readBamGappedAlignments(bamfile) txdb<-makeTranscriptDbFromUCSC(genome="hg19",tablename="knownGene") exonRangesList <- exonsBy(txdb, by = "gene") exonRanges <- unlist(exonRangesList) strand(exonRanges) <- "*" geneNames <- sub("\\..*$", "", names(exonRanges)) exonRangesListNoStrand <- split(exonRanges, geneNames) numberOfexons <- sapply(width(exonRangesListNoStrand), length) geneLength <- sum(width(reduce(exonRangesListNoStrand))) # Counting reads for the BAM file counts <- countOverlaps(exonRangesListNoStrand, aln) names(counts) <- names(exonRangesListNoStrand) > length(counts) [1] 22932 So my question is from aligned file to annotated file. Specifically, why I got 22932 features after mapping to hg19? I learned there are only less than 2000 miRNAs matched genes. I also know there are several RNA databases including miRBase for annotation. Did I do something logically wrong? It would be greatly appreciated if someone could show me how to implement this annotation in R. Many thanks, William Wu [[alternative HTML version deleted]]
Annotation Annotation • 978 views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 14 months ago
United States
Hi, On Mon, May 7, 2012 at 2:15 PM, Wu, Huiyun <william.wu at="" vanderbilt.edu=""> wrote: > Hello there, > > I am new in seq data analysis, and have difficulty in miRNA annotation. To clarify my question, I have the following codes. > > > ? bamfile <- "GSE33858_1.bowtie_alignment.bam" > ? aln <- readBamGappedAlignments(bamfile) > ? txdb<-makeTranscriptDbFromUCSC(genome="hg19",tablename="knownGene") > ? exonRangesList <- exonsBy(txdb, by = "gene") > ? exonRanges <- unlist(exonRangesList) > ? strand(exonRanges) <- "*" > ? geneNames <- sub("\\..*$", "", names(exonRanges)) > ? exonRangesListNoStrand <- split(exonRanges, geneNames) > ? numberOfexons <- sapply(width(exonRangesListNoStrand), length) > ? geneLength <- sum(width(reduce(exonRangesListNoStrand))) > ? # Counting reads for the BAM file > ? counts <- countOverlaps(exonRangesListNoStrand, aln) > ? names(counts) <- names(exonRangesListNoStrand) > > ? > length(counts) > > ? ? ? [1] 22932 > > So my question is from aligned file to annotated file. Specifically, why I got 22932 features after mapping to hg19? I learned there are only less than 2000 miRNAs matched genes. I also know there are several RNA databases including miRBase for annotation. Did I do something logically wrong? It would be greatly appreciated if someone could show me how to implement this annotation in R. What does `sum(counts == 0)` give you? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT
0
Entering edit mode
Dear Steve, thank you for your feedback. Below is what it returned. > st <- sum(counts==0) > st [1] 12296 Let me give more details of my situation. Say, I have read in 6 BAM files (s1 to s6). and then got a count table (filename = countTable) using countOverlaps function with resultant file head as below. Question here, is the rowname (first col) miRNA ID? > colnames(countTable) <- samples > head(countTable) s1 s2 s3 s4 s5 s6 1 1 1 3 0 2 2 10 0 0 1 0 1 0 100 0 1 0 1 3 3 1000 0 7 1 1 3 3 10000 0 6 2 8 16 18 100008586 0 0 0 0 0 0 > dim(countTable) [1] 22932 6 I then filtered out those miRNAs whose expressions were 0 across all samples by the following code (it might be what you asked for). > newCountTable <- countTable[rowSums(countTable) != 0, ] > dim(newCountTable) [1] 22105 6 > head(newCountTable) s1 s2 s3 s4 s5 s6 1 1 1 3 0 2 2 10 0 0 1 0 1 0 100 0 1 0 1 3 3 1000 0 7 1 1 3 3 10000 0 6 2 8 16 18 100009676 0 0 0 1 3 4 After annotating using the following library, I got another count table. entrezGenes <- rownames(newCountTable) mapped_genes <- mappedkeys(org.Hs.egSYMBOL) overlapgenes <- intersect(entrezGenes, mapped_genes) overlapsymbols <- as.list(org.Hs.egSYMBOL[overlapgenes]) getGeneSymbol <- function(x) { if (x %in% overlapgenes) { return(overlapsymbols[[which(x == names(overlapsymbols))]]) } else { return(NA) } } geneSymbols <- sapply(entrezGenes, getGeneSymbol) countTable <- cbind(geneSymbols, numberOfexons[names(numberOfexons) %in% entrezGenes], geneLength[names(geneLength) %in% entrezGenes], newCountTable) > dim(countTable) [1] 22105 9 I further removed redundant IDs and NAs to generate an updated count table (dataset named 'd') for DE analysis > dim(d) [1] 22100 7 > d[20:30,] X s1 s2 s3 s4 s5 s6 20 7-Mar 1 2 2 11 12 5 21 7-Sep 2 6 4 6 19 9 22 8-Mar 1 2 0 1 6 4 23 8-Sep 0 4 4 4 3 10 24 9-Mar 1 3 5 1 6 9 25 9-Sep 2 5 8 10 35 20 26 A1BG 1 1 3 0 2 2 27 A1BG-AS1 0 1 1 0 0 4 28 A1CF 0 2 1 3 9 1 29 A2LD1 1 1 2 0 1 6 30 A2M 5 53 48 70 58 114 It looks like there are 25 technical controls, but majority were annotated. thanks, --- William Wu Vanderbilt University/Department of Biostatistics. -----Original Message----- From: Steve Lianoglou [mailto:mailinglist.honeypot@gmail.com] Sent: Monday, May 07, 2012 1:59 PM To: Wu, Huiyun Cc: bioconductor@r-project.org Subject: Re: [BioC] miRNA annotation Hi, On Mon, May 7, 2012 at 2:15 PM, Wu, Huiyun <william.wu@vanderbilt.edu<mailto:william.wu@vanderbilt.edu>> wrote: > Hello there, > > I am new in seq data analysis, and have difficulty in miRNA annotation. To clarify my question, I have the following codes. > > > bamfile <- "GSE33858_1.bowtie_alignment.bam" > aln <- readBamGappedAlignments(bamfile) > txdb<-makeTranscriptDbFromUCSC(genome="hg19",tablename="knownGene") > exonRangesList <- exonsBy(txdb, by = "gene") > exonRanges <- unlist(exonRangesList) > strand(exonRanges) <- "*" > geneNames <- sub("\\..*$", "", names(exonRanges)) > exonRangesListNoStrand <- split(exonRanges, geneNames) > numberOfexons <- sapply(width(exonRangesListNoStrand), length) > geneLength <- sum(width(reduce(exonRangesListNoStrand))) > # Counting reads for the BAM file > counts <- countOverlaps(exonRangesListNoStrand, aln) > names(counts) <- names(exonRangesListNoStrand) > > > length(counts) > > [1] 22932 > > So my question is from aligned file to annotated file. Specifically, why I got 22932 features after mapping to hg19? I learned there are only less than 2000 miRNAs matched genes. I also know there are several RNA databases including miRBase for annotation. Did I do something logically wrong? It would be greatly appreciated if someone could show me how to implement this annotation in R. What does `sum(counts == 0)` give you? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi William, I'm actually now at a bit of a loss as to what you are asking. (1) Do you want someone to check your code to see if you did the annotations correctly? Or (2) are you curious why you are getting reads from (what I guess is) a small RNA library that seem to be coming from full length mRNAs? If it's the first, why not try loading your BAM file up in something like IGV and hopping over to some of the non-miRNA genes that your count table says you have reads for and see if you see in the browser what your table says. If it's the second question, I guess you'd have to go through the protocol and see how they prepped the library. You might start by looking to see when (if) they did the size selection of the RNA/cDNA to enrich for small RNA's. HTH, -steve On Mon, May 7, 2012 at 4:00 PM, Wu, Huiyun <william.wu at="" vanderbilt.edu=""> wrote: > Dear Steve, > > thank you for your feedback. Below is what it returned. > >> st <- sum(counts==0) >> st > [1] 12296 > > Let me give more details of my situation. Say, I have read in 6 BAM files > (s1 to s6). and then got a count table (filename = countTable) using > countOverlaps function with resultant file head as below. Question here, is > the rowname (first col) miRNA ID? > >>??? colnames(countTable) <- samples >>?? head(countTable) > ????????? s1 s2 s3 s4 s5 s6 > 1????????? 1? 1? 3? 0? 2? 2 > 10???????? 0? 0? 1? 0? 1? 0 > 100??????? 0? 1? 0? 1? 3? 3 > 1000?????? 0? 7? 1? 1? 3? 3 > 10000????? 0? 6? 2? 8 16 18 > 100008586? 0? 0? 0? 0? 0? 0 >>??? dim(countTable) > [1] 22932???? 6 > I then filtered out those miRNAs whose expressions were 0 across all samples > by the following code (it might be what you asked for). > >>?? newCountTable <- countTable[rowSums(countTable) != 0, ] >>??? dim(newCountTable) > [1] 22105???? 6 >>??? head(newCountTable) > ????????? s1 s2 s3 s4 s5 s6 > 1????????? 1? 1? 3? 0? 2? 2 > 10???????? 0? 0? 1? 0? 1? 0 > 100??????? 0? 1? 0? 1? 3? 3 > 1000?????? 0? 7? 1? 1? 3? 3 > 10000????? 0? 6? 2? 8 16 18 > 100009676? 0? 0? 0? 1? 3? 4 > > > After annotating using the following library, I got another count table. > > entrezGenes <- rownames(newCountTable) > ?? mapped_genes <- mappedkeys(org.Hs.egSYMBOL) > ?? overlapgenes <- intersect(entrezGenes, mapped_genes) > ?? overlapsymbols <- as.list(org.Hs.egSYMBOL[overlapgenes]) > ?? getGeneSymbol <- function(x) { > ?? if (x %in% overlapgenes) { > ???????????? return(overlapsymbols[[which(x == names(overlapsymbols))]]) > ????? } > ????? else { > ???? return(NA) > ?????? } > ??? } > > ??? geneSymbols <- sapply(entrezGenes, getGeneSymbol) > ??? countTable <- cbind(geneSymbols, numberOfexons[names(numberOfexons) %in% > entrezGenes], geneLength[names(geneLength) %in%? entrezGenes], > ????????????????? newCountTable) > >> dim(countTable) > [1] 22105???? 9 > > I further removed redundant IDs and NAs to generate an updated count table > (dataset named 'd') for DE analysis > >> dim(d) > [1] 22100???? 7 > >> d[20:30,] > ????????? X s1 s2 s3 s4 s5? s6 > 20??? 7-Mar? 1? 2? 2 11 12?? 5 > 21??? 7-Sep? 2? 6? 4? 6 19?? 9 > 22??? 8-Mar? 1? 2? 0? 1? 6?? 4 > 23??? 8-Sep? 0? 4? 4? 4? 3? 10 > 24??? 9-Mar? 1? 3? 5? 1? 6?? 9 > 25??? 9-Sep? 2? 5? 8 10 35? 20 > 26???? A1BG? 1? 1? 3? 0? 2?? 2 > 27 A1BG-AS1? 0? 1? 1? 0? 0?? 4 > 28???? A1CF? 0? 2? 1? 3? 9?? 1 > 29??? A2LD1? 1? 1? 2? 0? 1?? 6 > 30????? A2M? 5 53 48 70 58 114 > > It looks like there are 25 technical controls, but majority were annotated. > > thanks, > > --- > William Wu > Vanderbilt University/Department of Biostatistics. > > > > > > > > -----Original Message----- > From: Steve Lianoglou [mailto:mailinglist.honeypot at gmail.com] > Sent: Monday, May 07, 2012 1:59 PM > To: Wu, Huiyun > Cc: bioconductor at r-project.org > Subject: Re: [BioC] miRNA annotation > > Hi, > > On Mon, May 7, 2012 at 2:15 PM, Wu, Huiyun <william.wu at="" vanderbilt.edu=""> > wrote: >> Hello there, >> >> I am new in seq data analysis, and have difficulty in miRNA annotation. To >> clarify my question, I have the following codes. >> >> >> ? bamfile <- "GSE33858_1.bowtie_alignment.bam" >> ? aln <- readBamGappedAlignments(bamfile) >> ? txdb<-makeTranscriptDbFromUCSC(genome="hg19",tablename="knownGene") >> ? exonRangesList <- exonsBy(txdb, by = "gene") >> ? exonRanges <- unlist(exonRangesList) >> ? strand(exonRanges) <- "*" >> ? geneNames <- sub("\\..*$", "", names(exonRanges)) >> ? exonRangesListNoStrand <- split(exonRanges, geneNames) >> ? numberOfexons <- sapply(width(exonRangesListNoStrand), length) >> ? geneLength <- sum(width(reduce(exonRangesListNoStrand))) >> ? # Counting reads for the BAM file >> ? counts <- countOverlaps(exonRangesListNoStrand, aln) >> ? names(counts) <- names(exonRangesListNoStrand) >> >> ? > length(counts) >> >> ? ? ? [1] 22932 >> >> So my question is from aligned file to annotated file. Specifically, why I >> got 22932 features after mapping to hg19? I learned there are only less than >> 2000 miRNAs matched genes. I also know there are several RNA databases >> including miRBase for annotation. Did I do something logically wrong? It >> would be greatly appreciated if someone could show me how to implement this >> annotation in R. > > What does `sum(counts == 0)` give you? > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University Contact Info: > http://cbio.mskcc.org/~lianos/contact > > -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY

Login before adding your answer.

Traffic: 806 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6