Question

miRNA annotation

0

Entering edit mode

Wu, Huiyun ▴ 70

@wu-huiyun-5271

Last seen 9.7 years ago

Hello there, I am new in seq data analysis, and have difficulty in miRNA annotation. To clarify my question, I have the following codes. bamfile <- "GSE33858_1.bowtie_alignment.bam" aln <- readBamGappedAlignments(bamfile) txdb<-makeTranscriptDbFromUCSC(genome="hg19",tablename="knownGene") exonRangesList <- exonsBy(txdb, by = "gene") exonRanges <- unlist(exonRangesList) strand(exonRanges) <- "*" geneNames <- sub("\\..*$", "", names(exonRanges)) exonRangesListNoStrand <- split(exonRanges, geneNames) numberOfexons <- sapply(width(exonRangesListNoStrand), length) geneLength <- sum(width(reduce(exonRangesListNoStrand))) # Counting reads for the BAM file counts <- countOverlaps(exonRangesListNoStrand, aln) names(counts) <- names(exonRangesListNoStrand) > length(counts) [1] 22932 So my question is from aligned file to annotated file. Specifically, why I got 22932 features after mapping to hg19? I learned there are only less than 2000 miRNAs matched genes. I also know there are several RNA databases including miRBase for annotation. Did I do something logically wrong? It would be greatly appreciated if someone could show me how to implement this annotation in R. Many thanks, William Wu [[alternative HTML version deleted]]

Annotation Annotation • 978 views

ADD COMMENT • link updated 12.0 years ago by Steve Lianoglou ★ 13k • written 12.0 years ago by Wu, Huiyun ▴ 70

score 0 · Answer 1 · 2012-05-07

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Hi, On Mon, May 7, 2012 at 2:15 PM, Wu, Huiyun <william.wu at="" vanderbilt.edu=""> wrote: > Hello there, > > I am new in seq data analysis, and have difficulty in miRNA annotation. To clarify my question, I have the following codes. > > > ? bamfile <- "GSE33858_1.bowtie_alignment.bam" > ? aln <- readBamGappedAlignments(bamfile) > ? txdb<-makeTranscriptDbFromUCSC(genome="hg19",tablename="knownGene") > ? exonRangesList <- exonsBy(txdb, by = "gene") > ? exonRanges <- unlist(exonRangesList) > ? strand(exonRanges) <- "*" > ? geneNames <- sub("\\..*$", "", names(exonRanges)) > ? exonRangesListNoStrand <- split(exonRanges, geneNames) > ? numberOfexons <- sapply(width(exonRangesListNoStrand), length) > ? geneLength <- sum(width(reduce(exonRangesListNoStrand))) > ? # Counting reads for the BAM file > ? counts <- countOverlaps(exonRangesListNoStrand, aln) > ? names(counts) <- names(exonRangesListNoStrand) > > ? > length(counts) > > ? ? ? [1] 22932 > > So my question is from aligned file to annotated file. Specifically, why I got 22932 features after mapping to hg19? I learned there are only less than 2000 miRNAs matched genes. I also know there are several RNA databases including miRBase for annotation. Did I do something logically wrong? It would be greatly appreciated if someone could show me how to implement this annotation in R. What does `sum(counts == 0)` give you? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 12.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Dear Steve, thank you for your feedback. Below is what it returned. > st <- sum(counts==0) > st [1] 12296 Let me give more details of my situation. Say, I have read in 6 BAM files (s1 to s6). and then got a count table (filename = countTable) using countOverlaps function with resultant file head as below. Question here, is the rowname (first col) miRNA ID? > colnames(countTable) <- samples > head(countTable) s1 s2 s3 s4 s5 s6 1 1 1 3 0 2 2 10 0 0 1 0 1 0 100 0 1 0 1 3 3 1000 0 7 1 1 3 3 10000 0 6 2 8 16 18 100008586 0 0 0 0 0 0 > dim(countTable) [1] 22932 6 I then filtered out those miRNAs whose expressions were 0 across all samples by the following code (it might be what you asked for). > newCountTable <- countTable[rowSums(countTable) != 0, ] > dim(newCountTable) [1] 22105 6 > head(newCountTable) s1 s2 s3 s4 s5 s6 1 1 1 3 0 2 2 10 0 0 1 0 1 0 100 0 1 0 1 3 3 1000 0 7 1 1 3 3 10000 0 6 2 8 16 18 100009676 0 0 0 1 3 4 After annotating using the following library, I got another count table. entrezGenes <- rownames(newCountTable) mapped_genes <- mappedkeys(org.Hs.egSYMBOL) overlapgenes <- intersect(entrezGenes, mapped_genes) overlapsymbols <- as.list(org.Hs.egSYMBOL[overlapgenes]) getGeneSymbol <- function(x) { if (x %in% overlapgenes) { return(overlapsymbols[[which(x == names(overlapsymbols))]]) } else { return(NA) } } geneSymbols <- sapply(entrezGenes, getGeneSymbol) countTable <- cbind(geneSymbols, numberOfexons[names(numberOfexons) %in% entrezGenes], geneLength[names(geneLength) %in% entrezGenes], newCountTable) > dim(countTable) [1] 22105 9 I further removed redundant IDs and NAs to generate an updated count table (dataset named 'd') for DE analysis > dim(d) [1] 22100 7 > d[20:30,] X s1 s2 s3 s4 s5 s6 20 7-Mar 1 2 2 11 12 5 21 7-Sep 2 6 4 6 19 9 22 8-Mar 1 2 0 1 6 4 23 8-Sep 0 4 4 4 3 10 24 9-Mar 1 3 5 1 6 9 25 9-Sep 2 5 8 10 35 20 26 A1BG 1 1 3 0 2 2 27 A1BG-AS1 0 1 1 0 0 4 28 A1CF 0 2 1 3 9 1 29 A2LD1 1 1 2 0 1 6 30 A2M 5 53 48 70 58 114 It looks like there are 25 technical controls, but majority were annotated. thanks, --- William Wu Vanderbilt University/Department of Biostatistics. -----Original Message----- From: Steve Lianoglou [mailto:mailinglist.honeypot@gmail.com] Sent: Monday, May 07, 2012 1:59 PM To: Wu, Huiyun Cc: bioconductor@r-project.org Subject: Re: [BioC] miRNA annotation Hi, On Mon, May 7, 2012 at 2:15 PM, Wu, Huiyun <william.wu@vanderbilt.edu<mailto:william.wu@vanderbilt.edu>> wrote: > Hello there, > > I am new in seq data analysis, and have difficulty in miRNA annotation. To clarify my question, I have the following codes. > > > bamfile <- "GSE33858_1.bowtie_alignment.bam" > aln <- readBamGappedAlignments(bamfile) > txdb<-makeTranscriptDbFromUCSC(genome="hg19",tablename="knownGene") > exonRangesList <- exonsBy(txdb, by = "gene") > exonRanges <- unlist(exonRangesList) > strand(exonRanges) <- "*" > geneNames <- sub("\\..*$", "", names(exonRanges)) > exonRangesListNoStrand <- split(exonRanges, geneNames) > numberOfexons <- sapply(width(exonRangesListNoStrand), length) > geneLength <- sum(width(reduce(exonRangesListNoStrand))) > # Counting reads for the BAM file > counts <- countOverlaps(exonRangesListNoStrand, aln) > names(counts) <- names(exonRangesListNoStrand) > > > length(counts) > > [1] 22932 > > So my question is from aligned file to annotated file. Specifically, why I got 22932 features after mapping to hg19? I learned there are only less than 2000 miRNAs matched genes. I also know there are several RNA databases including miRBase for annotation. Did I do something logically wrong? It would be greatly appreciated if someone could show me how to implement this annotation in R. What does `sum(counts == 0)` give you? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact [[alternative HTML version deleted]]

ADD REPLY • link 12.0 years ago Wu, Huiyun ▴ 70

0

Entering edit mode

Hi William, I'm actually now at a bit of a loss as to what you are asking. (1) Do you want someone to check your code to see if you did the annotations correctly? Or (2) are you curious why you are getting reads from (what I guess is) a small RNA library that seem to be coming from full length mRNAs? If it's the first, why not try loading your BAM file up in something like IGV and hopping over to some of the non-miRNA genes that your count table says you have reads for and see if you see in the browser what your table says. If it's the second question, I guess you'd have to go through the protocol and see how they prepped the library. You might start by looking to see when (if) they did the size selection of the RNA/cDNA to enrich for small RNA's. HTH, -steve On Mon, May 7, 2012 at 4:00 PM, Wu, Huiyun <william.wu at="" vanderbilt.edu=""> wrote: > Dear Steve, > > thank you for your feedback. Below is what it returned. > >> st <- sum(counts==0) >> st > [1] 12296 > > Let me give more details of my situation. Say, I have read in 6 BAM files > (s1 to s6). and then got a count table (filename = countTable) using > countOverlaps function with resultant file head as below. Question here, is > the rowname (first col) miRNA ID? > >>??? colnames(countTable) <- samples >>?? head(countTable) > ????????? s1 s2 s3 s4 s5 s6 > 1????????? 1? 1? 3? 0? 2? 2 > 10???????? 0? 0? 1? 0? 1? 0 > 100??????? 0? 1? 0? 1? 3? 3 > 1000?????? 0? 7? 1? 1? 3? 3 > 10000????? 0? 6? 2? 8 16 18 > 100008586? 0? 0? 0? 0? 0? 0 >>??? dim(countTable) > [1] 22932???? 6 > I then filtered out those miRNAs whose expressions were 0 across all samples > by the following code (it might be what you asked for). > >>?? newCountTable <- countTable[rowSums(countTable) != 0, ] >>??? dim(newCountTable) > [1] 22105???? 6 >>??? head(newCountTable) > ????????? s1 s2 s3 s4 s5 s6 > 1????????? 1? 1? 3? 0? 2? 2 > 10???????? 0? 0? 1? 0? 1? 0 > 100??????? 0? 1? 0? 1? 3? 3 > 1000?????? 0? 7? 1? 1? 3? 3 > 10000????? 0? 6? 2? 8 16 18 > 100009676? 0? 0? 0? 1? 3? 4 > > > After annotating using the following library, I got another count table. > > entrezGenes <- rownames(newCountTable) > ?? mapped_genes <- mappedkeys(org.Hs.egSYMBOL) > ?? overlapgenes <- intersect(entrezGenes, mapped_genes) > ?? overlapsymbols <- as.list(org.Hs.egSYMBOL[overlapgenes]) > ?? getGeneSymbol <- function(x) { > ?? if (x %in% overlapgenes) { > ???????????? return(overlapsymbols[[which(x == names(overlapsymbols))]]) > ????? } > ????? else { > ???? return(NA) > ?????? } > ??? } > > ??? geneSymbols <- sapply(entrezGenes, getGeneSymbol) > ??? countTable <- cbind(geneSymbols, numberOfexons[names(numberOfexons) %in% > entrezGenes], geneLength[names(geneLength) %in%? entrezGenes], > ????????????????? newCountTable) > >> dim(countTable) > [1] 22105???? 9 > > I further removed redundant IDs and NAs to generate an updated count table > (dataset named 'd') for DE analysis > >> dim(d) > [1] 22100???? 7 > >> d[20:30,] > ????????? X s1 s2 s3 s4 s5? s6 > 20??? 7-Mar? 1? 2? 2 11 12?? 5 > 21??? 7-Sep? 2? 6? 4? 6 19?? 9 > 22??? 8-Mar? 1? 2? 0? 1? 6?? 4 > 23??? 8-Sep? 0? 4? 4? 4? 3? 10 > 24??? 9-Mar? 1? 3? 5? 1? 6?? 9 > 25??? 9-Sep? 2? 5? 8 10 35? 20 > 26???? A1BG? 1? 1? 3? 0? 2?? 2 > 27 A1BG-AS1? 0? 1? 1? 0? 0?? 4 > 28???? A1CF? 0? 2? 1? 3? 9?? 1 > 29??? A2LD1? 1? 1? 2? 0? 1?? 6 > 30????? A2M? 5 53 48 70 58 114 > > It looks like there are 25 technical controls, but majority were annotated. > > thanks, > > --- > William Wu > Vanderbilt University/Department of Biostatistics. > > > > > > > > -----Original Message----- > From: Steve Lianoglou [mailto:mailinglist.honeypot at gmail.com] > Sent: Monday, May 07, 2012 1:59 PM > To: Wu, Huiyun > Cc: bioconductor at r-project.org > Subject: Re: [BioC] miRNA annotation > > Hi, > > On Mon, May 7, 2012 at 2:15 PM, Wu, Huiyun <william.wu at="" vanderbilt.edu=""> > wrote: >> Hello there, >> >> I am new in seq data analysis, and have difficulty in miRNA annotation. To >> clarify my question, I have the following codes. >> >> >> ? bamfile <- "GSE33858_1.bowtie_alignment.bam" >> ? aln <- readBamGappedAlignments(bamfile) >> ? txdb<-makeTranscriptDbFromUCSC(genome="hg19",tablename="knownGene") >> ? exonRangesList <- exonsBy(txdb, by = "gene") >> ? exonRanges <- unlist(exonRangesList) >> ? strand(exonRanges) <- "*" >> ? geneNames <- sub("\\..*$", "", names(exonRanges)) >> ? exonRangesListNoStrand <- split(exonRanges, geneNames) >> ? numberOfexons <- sapply(width(exonRangesListNoStrand), length) >> ? geneLength <- sum(width(reduce(exonRangesListNoStrand))) >> ? # Counting reads for the BAM file >> ? counts <- countOverlaps(exonRangesListNoStrand, aln) >> ? names(counts) <- names(exonRangesListNoStrand) >> >> ? > length(counts) >> >> ? ? ? [1] 22932 >> >> So my question is from aligned file to annotated file. Specifically, why I >> got 22932 features after mapping to hg19? I learned there are only less than >> 2000 miRNAs matched genes. I also know there are several RNA databases >> including miRBase for annotation. Did I do something logically wrong? It >> would be greatly appreciated if someone could show me how to implement this >> annotation in R. > > What does `sum(counts == 0)` give you? > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University Contact Info: > http://cbio.mskcc.org/~lianos/contact > > -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 12.0 years ago Steve Lianoglou ★ 13k