Search
Question: using a gtf file to map reads
0
gravatar for Sam McInturf
4.5 years ago by
Sam McInturf300
United States
Sam McInturf300 wrote:
Hello everyone, I am having a problem using GenomicRanges to count my mapped reads, although this is a user lack of understanding, not a bug. After mapping the reads, there is no entries for the rownames of the differentially expressed genes. I am doing RNA seq in Arabidopsis (Illumina, 100bp, single end). I mapped using Tophat2, using the -G option and passed my gtf file. This file is the TAIR10 release gtf, downloaded from the tophat annotation download page. I loaded it into R using rtracklayer's import function. By my way of understanding things (which is may be wrong), if you count reads so each row of the count matrix is an exon, then you can use DexSeq to calculate differential exon usage. If you want to calculate gene level differential expression, then you count per transcript, and the use DESeq/2 to calculate differential expression. Because I am interested in gene level differential expression (at this point), I subsetted my gtf file for the CDS, where CDS is found in the 'type' column (gtf reproduced below). This becomes the first question, do I subset for CDS to get gene level differential expression? I then loaded all of my tophat bam files in and mapped them using the summarizeOverlaps() function, changed the colData, and add my gtf file as rowData. So now, i have a summarized experiment with plenty of information on each row the of the count matrix, but no rownames. using DESeq2, no identifiers are displayed with the deferentially expressed genes. but what do I put in as the rownames? the gene_id, transcript_id, gene_name, and transcript_name are all non-unique (i presume the rownames must be unique). What to do? I don't `need` to use a gtf file, just the only place I know to get features to count on. Code: gff0 <- import(".../Arabidopsis_thaliana.TAIR10.17.gtf", asRangedData = FALSE) idx <- mcols(gff0)$type == "CDS" gffCds <- gff0[idx ] fls <- c("sortedBamFiles/sortSUC.bam", "sortedBamFiles/sortTOTAL.bam") bfl <- BamFileList(fls, index = character()) olap <- summarizeOverlaps(gffCds, bfl) colData(olap)$RNA <- factor(c("Suc", "Total"), levels = c("Total", "Suc")) rowData(olap) <- gffCds Thanks for any help! Sam McInturf A few objects: > gff0 GRanges with 485101 ranges and 12 metadata columns: seqnames ranges strand | source type <rle> <iranges> <rle> | <factor> <factor> [1] Mt [ 273, 734] - | protein_coding exon [2] Mt [ 276, 734] - | protein_coding CDS [3] Mt [ 732, 734] - | protein_coding start_codon [4] Mt [ 273, 275] - | protein_coding stop_codon [5] Mt [8848, 11415] - | rRNA exon ... ... ... ... ... ... ... [485097] 1 [30426535, 30426557] - | pseudogene exon [485098] 1 [30426341, 30426403] - | pseudogene exon [485099] 1 [30426217, 30426268] - | pseudogene exon [485100] 1 [30425840, 30425977] - | pseudogene exon [485101] 1 [30426839, 30427577] + | pseudogene exon score phase gene_id transcript_id exon_number <numeric> <integer> <character> <character> <numeric> [1] <na> <na> ATMG00010 ATMG00010.1 1 [2] <na> 0 ATMG00010 ATMG00010.1 1 [3] <na> 0 ATMG00010 ATMG00010.1 1 [4] <na> 0 ATMG00010 ATMG00010.1 1 [5] <na> <na> ATMG00020 ATMG00020.1 1 ... ... ... ... ... ... [485097] <na> <na> AT1G81001 AT1G81001.1 1 [485098] <na> <na> AT1G81001 AT1G81001.1 2 [485099] <na> <na> AT1G81001 AT1G81001.1 3 [485100] <na> <na> AT1G81001 AT1G81001.1 4 [485101] <na> <na> AT1G81020 AT1G81020.1 1 gene_name transcript_name seqedit protein_id peptide <character> <character> <character> <character> <character> [1] ORF153A ATMG00010.1 false <na> <na> [2] ORF153A ATMG00010.1 <na> ATMG00010.1 <na> [3] ORF153A ATMG00010.1 <na> <na> <na> [4] ORF153A ATMG00010.1 <na> <na> <na> [5] RRN26 ATMG00020.1 false <na> <na> ... ... ... ... ... ... [485097] AT1G81001 AT1G81001.1 false <na> <na> [485098] AT1G81001 AT1G81001.1 false <na> <na> [485099] AT1G81001 AT1G81001.1 false <na> <na> [485100] AT1G81001 AT1G81001.1 false <na> <na> [485101] AT1G81020 AT1G81020.1 false <na> <na> --- seqlengths: Mt Pt 5 4 3 2 1 NA NA NA NA NA NA NA overlap file > olap class: SummarizedExperiment dim: 197105 2 exptData(0): assays(1): counts rownames: NULL rowData metadata column names(12): source type ... protein_id peptide colnames(2): sortedBamFiles/sortSUC.bam sortedBamFiles/sortTOTAL.bam colData names(2): fileName RNA > sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] DESeq2_1.0.11 RcppArmadillo_0.3.820 Rcpp_0.10.3 [4] lattice_0.20-15 Biobase_2.20.0 rtracklayer_1.20.2 [7] GenomicRanges_1.12.4 IRanges_1.18.1 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] annotate_1.38.0 AnnotationDbi_1.22.5 Biostrings_2.28.0 [4] bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-7 [7] genefilter_1.42.0 grid_3.0.0 locfit_1.5-9.1 [10] RColorBrewer_1.0-5 RCurl_1.95-4.1 Rsamtools_1.12.3 [13] RSQLite_0.11.4 splines_3.0.0 stats4_3.0.0 [16] survival_2.37-4 tools_3.0.0 XML_3.96-1.1 [19] xtable_1.7-1 zlibbioc_1.6.0 -- Sam McInturf [[alternative HTML version deleted]]
ADD COMMENTlink modified 4.5 years ago by Michael Lawrence9.8k • written 4.5 years ago by Sam McInturf300
0
gravatar for Michael Lawrence
4.5 years ago by
United States
Michael Lawrence9.8k wrote:
On Tue, Jun 4, 2013 at 1:07 PM, Sam McInturf <smcinturf@gmail.com> wrote: > Hello everyone, > I am having a problem using GenomicRanges to count my mapped reads, > although this is a user lack of understanding, not a bug. After mapping > the reads, there is no entries for the rownames of the differentially > expressed genes. I am doing RNA seq in Arabidopsis (Illumina, 100bp, > single end). I mapped using Tophat2, using the -G option and passed my gtf > file. This file is the TAIR10 release gtf, downloaded from the tophat > annotation download page. I loaded it into R using rtracklayer's import > function. > > By my way of understanding things (which is may be wrong), if you count > reads so each row of the count matrix is an exon, then you can use DexSeq > to calculate differential exon usage. I think the typical/recommended set of features for DEXseq is the disjoint exon bins, i.e., the result of calling disjoin(exons), although you'll want to filter out bins shared between multiple genes. > If you want to calculate gene level > differential expression, then you count per transcript, and the use DESeq/2 > to calculate differential expression. > > Because I am interested in gene level differential expression (at this > point), I subsetted my gtf file for the CDS, where CDS is found in the > 'type' column (gtf reproduced below). This becomes the first question, do > I subset for CDS to get gene level differential expression? > > The CDS features will be the individual coding regions, split up by the introns. They're essentially the exons, with the UTRs removed. If you define gene-level expression as the count of reads overlapping any of the CDS regions, you'll want to first split your gff0 by the gene_id metadata column, and then call reduce() on the result. > I then loaded all of my tophat bam files in and mapped them using the > summarizeOverlaps() function, changed the colData, and add my gtf file as > rowData. The replacement of the rowData should not be necessary. It should already be done for you. > So now, i have a summarized experiment with plenty of information > on each row the of the count matrix, but no rownames. using DESeq2, no > identifiers are displayed with the deferentially expressed genes. but what > do I put in as the rownames? the gene_id, transcript_id, gene_name, and > transcript_name are all non-unique (i presume the rownames must be unique). > What to do? I don't `need` to use a gtf file, just the only place I know > to get features to count on. > > The new ranges (the reduced CDSs) should be named (by the gene) and thus should become the rownames on the SummarizedExperiment. I edited your code with the suggestions: > Code: > gff0 <- import(".../Arabidopsis_thaliana.TAIR10.17.gtf", asRangedData = > FALSE) > cds <- subset(gff0, type == "CDS") cds_gene <- reduce(split(cds, cds$gene_id)) > fls <- c(Suc = "sortedBamFiles/sortSUC.bam", Total = > "sortedBamFiles/sortTOTAL.bam") > bfl <- BamFileList(fls, index = character()) > olap <- summarizeOverlaps(gffCds, bfl) > colData(olap)$RNA <- factor(c("Suc", "Total"), levels = c("Total", "Suc")) > > The above colData<- call is not really necessary, because I named the fls vector accordingly. Those names will percolate to the rownames on colData (and perhaps more importantly the column names in your assay matrix). Michael > > Thanks for any help! > > Sam McInturf > > A few objects: > > gff0 > GRanges with 485101 ranges and 12 metadata columns: > seqnames ranges strand | source > type > <rle> <iranges> <rle> | <factor> > <factor> > [1] Mt [ 273, 734] - | protein_coding > exon > [2] Mt [ 276, 734] - | protein_coding > CDS > [3] Mt [ 732, 734] - | protein_coding > start_codon > [4] Mt [ 273, 275] - | protein_coding > stop_codon > [5] Mt [8848, 11415] - | rRNA > exon > ... ... ... ... ... ... > ... > [485097] 1 [30426535, 30426557] - | pseudogene > exon > [485098] 1 [30426341, 30426403] - | pseudogene > exon > [485099] 1 [30426217, 30426268] - | pseudogene > exon > [485100] 1 [30425840, 30425977] - | pseudogene > exon > [485101] 1 [30426839, 30427577] + | pseudogene > exon > score phase gene_id transcript_id exon_number > <numeric> <integer> <character> <character> <numeric> > [1] <na> <na> ATMG00010 ATMG00010.1 1 > [2] <na> 0 ATMG00010 ATMG00010.1 1 > [3] <na> 0 ATMG00010 ATMG00010.1 1 > [4] <na> 0 ATMG00010 ATMG00010.1 1 > [5] <na> <na> ATMG00020 ATMG00020.1 1 > ... ... ... ... ... ... > [485097] <na> <na> AT1G81001 AT1G81001.1 1 > [485098] <na> <na> AT1G81001 AT1G81001.1 2 > [485099] <na> <na> AT1G81001 AT1G81001.1 3 > [485100] <na> <na> AT1G81001 AT1G81001.1 4 > [485101] <na> <na> AT1G81020 AT1G81020.1 1 > gene_name transcript_name seqedit protein_id peptide > <character> <character> <character> <character> <character> > [1] ORF153A ATMG00010.1 false <na> <na> > [2] ORF153A ATMG00010.1 <na> ATMG00010.1 <na> > [3] ORF153A ATMG00010.1 <na> <na> <na> > [4] ORF153A ATMG00010.1 <na> <na> <na> > [5] RRN26 ATMG00020.1 false <na> <na> > ... ... ... ... ... ... > [485097] AT1G81001 AT1G81001.1 false <na> <na> > [485098] AT1G81001 AT1G81001.1 false <na> <na> > [485099] AT1G81001 AT1G81001.1 false <na> <na> > [485100] AT1G81001 AT1G81001.1 false <na> <na> > [485101] AT1G81020 AT1G81020.1 false <na> <na> > --- > seqlengths: > Mt Pt 5 4 3 2 1 > NA NA NA NA NA NA NA > > overlap file > > olap > class: SummarizedExperiment > dim: 197105 2 > exptData(0): > assays(1): counts > rownames: NULL > rowData metadata column names(12): source type ... protein_id peptide > colnames(2): sortedBamFiles/sortSUC.bam sortedBamFiles/sortTOTAL.bam > colData names(2): fileName RNA > > > sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] DESeq2_1.0.11 RcppArmadillo_0.3.820 Rcpp_0.10.3 > [4] lattice_0.20-15 Biobase_2.20.0 rtracklayer_1.20.2 > [7] GenomicRanges_1.12.4 IRanges_1.18.1 BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] annotate_1.38.0 AnnotationDbi_1.22.5 Biostrings_2.28.0 > [4] bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-7 > [7] genefilter_1.42.0 grid_3.0.0 locfit_1.5-9.1 > [10] RColorBrewer_1.0-5 RCurl_1.95-4.1 Rsamtools_1.12.3 > [13] RSQLite_0.11.4 splines_3.0.0 stats4_3.0.0 > [16] survival_2.37-4 tools_3.0.0 XML_3.96-1.1 > [19] xtable_1.7-1 zlibbioc_1.6.0 > > > -- > Sam McInturf > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENTlink written 4.5 years ago by Michael Lawrence9.8k
Michael, That worked like a charm! Although, I do have one question about what you did there. I am used to using split(), but what is reduce() doing? Best On Wed, Jun 5, 2013 at 7:04 AM, Michael Lawrence <lawrence.michael@gene.com>wrote: > > > > On Tue, Jun 4, 2013 at 1:07 PM, Sam McInturf <smcinturf@gmail.com> wrote: > >> Hello everyone, >> I am having a problem using GenomicRanges to count my mapped reads, >> although this is a user lack of understanding, not a bug. After mapping >> the reads, there is no entries for the rownames of the differentially >> expressed genes. I am doing RNA seq in Arabidopsis (Illumina, 100bp, >> single end). I mapped using Tophat2, using the -G option and passed my >> gtf >> file. This file is the TAIR10 release gtf, downloaded from the tophat >> annotation download page. I loaded it into R using rtracklayer's import >> function. >> >> By my way of understanding things (which is may be wrong), if you count >> reads so each row of the count matrix is an exon, then you can use DexSeq >> to calculate differential exon usage. > > > I think the typical/recommended set of features for DEXseq is the disjoint > exon bins, i.e., the result of calling disjoin(exons), although you'll want > to filter out bins shared between multiple genes. > > >> If you want to calculate gene level >> differential expression, then you count per transcript, and the use >> DESeq/2 >> to calculate differential expression. >> >> Because I am interested in gene level differential expression (at this >> point), I subsetted my gtf file for the CDS, where CDS is found in the >> 'type' column (gtf reproduced below). This becomes the first question, >> do >> I subset for CDS to get gene level differential expression? >> >> > The CDS features will be the individual coding regions, split up by the > introns. They're essentially the exons, with the UTRs removed. If you > define gene-level expression as the count of reads overlapping any of the > CDS regions, you'll want to first split your gff0 by the gene_id metadata > column, and then call reduce() on the result. > > >> I then loaded all of my tophat bam files in and mapped them using the >> summarizeOverlaps() function, changed the colData, and add my gtf file as >> rowData. > > > The replacement of the rowData should not be necessary. It should already > be done for you. > > >> So now, i have a summarized experiment with plenty of information >> on each row the of the count matrix, but no rownames. using DESeq2, no >> identifiers are displayed with the deferentially expressed genes. but >> what >> do I put in as the rownames? the gene_id, transcript_id, gene_name, and >> transcript_name are all non-unique (i presume the rownames must be >> unique). >> What to do? I don't `need` to use a gtf file, just the only place I know >> to get features to count on. >> >> > The new ranges (the reduced CDSs) should be named (by the gene) and thus > should become the rownames on the SummarizedExperiment. > > I edited your code with the suggestions: > > >> Code: >> gff0 <- import(".../Arabidopsis_thaliana.TAIR10.17.gtf", asRangedData = >> FALSE) >> > > cds <- subset(gff0, type == "CDS") > cds_gene <- reduce(split(cds, cds$gene_id)) > > >> fls <- c(Suc = "sortedBamFiles/sortSUC.bam", Total = >> "sortedBamFiles/sortTOTAL.bam") >> >> bfl <- BamFileList(fls, index = character()) >> olap <- summarizeOverlaps(gffCds, bfl) >> colData(olap)$RNA <- factor(c("Suc", "Total"), levels = c("Total", "Suc")) >> >> > The above colData<- call is not really necessary, because I named the fls > vector accordingly. Those names will percolate to the rownames on colData > (and perhaps more importantly the column names in your assay matrix). > > Michael > > >> >> Thanks for any help! >> >> Sam McInturf >> >> A few objects: >> > gff0 >> GRanges with 485101 ranges and 12 metadata columns: >> seqnames ranges strand | source >> type >> <rle> <iranges> <rle> | <factor> >> <factor> >> [1] Mt [ 273, 734] - | protein_coding >> exon >> [2] Mt [ 276, 734] - | protein_coding >> CDS >> [3] Mt [ 732, 734] - | protein_coding >> start_codon >> [4] Mt [ 273, 275] - | protein_coding >> stop_codon >> [5] Mt [8848, 11415] - | rRNA >> exon >> ... ... ... ... ... ... >> ... >> [485097] 1 [30426535, 30426557] - | pseudogene >> exon >> [485098] 1 [30426341, 30426403] - | pseudogene >> exon >> [485099] 1 [30426217, 30426268] - | pseudogene >> exon >> [485100] 1 [30425840, 30425977] - | pseudogene >> exon >> [485101] 1 [30426839, 30427577] + | pseudogene >> exon >> score phase gene_id transcript_id exon_number >> <numeric> <integer> <character> <character> <numeric> >> [1] <na> <na> ATMG00010 ATMG00010.1 1 >> [2] <na> 0 ATMG00010 ATMG00010.1 1 >> [3] <na> 0 ATMG00010 ATMG00010.1 1 >> [4] <na> 0 ATMG00010 ATMG00010.1 1 >> [5] <na> <na> ATMG00020 ATMG00020.1 1 >> ... ... ... ... ... ... >> [485097] <na> <na> AT1G81001 AT1G81001.1 1 >> [485098] <na> <na> AT1G81001 AT1G81001.1 2 >> [485099] <na> <na> AT1G81001 AT1G81001.1 3 >> [485100] <na> <na> AT1G81001 AT1G81001.1 4 >> [485101] <na> <na> AT1G81020 AT1G81020.1 1 >> gene_name transcript_name seqedit protein_id peptide >> <character> <character> <character> <character> <character> >> [1] ORF153A ATMG00010.1 false <na> <na> >> [2] ORF153A ATMG00010.1 <na> ATMG00010.1 <na> >> [3] ORF153A ATMG00010.1 <na> <na> <na> >> [4] ORF153A ATMG00010.1 <na> <na> <na> >> [5] RRN26 ATMG00020.1 false <na> <na> >> ... ... ... ... ... ... >> [485097] AT1G81001 AT1G81001.1 false <na> <na> >> [485098] AT1G81001 AT1G81001.1 false <na> <na> >> [485099] AT1G81001 AT1G81001.1 false <na> <na> >> [485100] AT1G81001 AT1G81001.1 false <na> <na> >> [485101] AT1G81020 AT1G81020.1 false <na> <na> >> --- >> seqlengths: >> Mt Pt 5 4 3 2 1 >> NA NA NA NA NA NA NA >> >> overlap file >> > olap >> class: SummarizedExperiment >> dim: 197105 2 >> exptData(0): >> assays(1): counts >> rownames: NULL >> rowData metadata column names(12): source type ... protein_id peptide >> colnames(2): sortedBamFiles/sortSUC.bam sortedBamFiles/sortTOTAL.bam >> colData names(2): fileName RNA >> >> > sessionInfo() >> R version 3.0.0 (2013-04-03) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] DESeq2_1.0.11 RcppArmadillo_0.3.820 Rcpp_0.10.3 >> [4] lattice_0.20-15 Biobase_2.20.0 rtracklayer_1.20.2 >> [7] GenomicRanges_1.12.4 IRanges_1.18.1 BiocGenerics_0.6.0 >> >> loaded via a namespace (and not attached): >> [1] annotate_1.38.0 AnnotationDbi_1.22.5 Biostrings_2.28.0 >> [4] bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-7 >> [7] genefilter_1.42.0 grid_3.0.0 locfit_1.5-9.1 >> [10] RColorBrewer_1.0-5 RCurl_1.95-4.1 Rsamtools_1.12.3 >> [13] RSQLite_0.11.4 splines_3.0.0 stats4_3.0.0 >> [16] survival_2.37-4 tools_3.0.0 XML_3.96-1.1 >> [19] xtable_1.7-1 zlibbioc_1.6.0 >> >> >> -- >> Sam McInturf >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- Sam McInturf [[alternative HTML version deleted]]
ADD REPLYlink written 4.5 years ago by Sam McInturf300
You can see help(reduce). It basically collapses/merges the ranges so that none are overlapping nor adjacent. On Wed, Jun 5, 2013 at 7:01 AM, Sam McInturf <smcinturf@gmail.com> wrote: > Michael, > That worked like a charm! Although, I do have one question about what > you did there. I am used to using split(), but what is reduce() doing? > > Best > > > On Wed, Jun 5, 2013 at 7:04 AM, Michael Lawrence < > lawrence.michael@gene.com> wrote: > >> >> >> >> On Tue, Jun 4, 2013 at 1:07 PM, Sam McInturf <smcinturf@gmail.com> wrote: >> >>> Hello everyone, >>> I am having a problem using GenomicRanges to count my mapped reads, >>> although this is a user lack of understanding, not a bug. After mapping >>> the reads, there is no entries for the rownames of the differentially >>> expressed genes. I am doing RNA seq in Arabidopsis (Illumina, 100bp, >>> single end). I mapped using Tophat2, using the -G option and passed my >>> gtf >>> file. This file is the TAIR10 release gtf, downloaded from the tophat >>> annotation download page. I loaded it into R using rtracklayer's import >>> function. >>> >>> By my way of understanding things (which is may be wrong), if you count >>> reads so each row of the count matrix is an exon, then you can use DexSeq >>> to calculate differential exon usage. >> >> >> I think the typical/recommended set of features for DEXseq is the >> disjoint exon bins, i.e., the result of calling disjoin(exons), although >> you'll want to filter out bins shared between multiple genes. >> >> >>> If you want to calculate gene level >>> differential expression, then you count per transcript, and the use >>> DESeq/2 >>> to calculate differential expression. >>> >>> Because I am interested in gene level differential expression (at this >>> point), I subsetted my gtf file for the CDS, where CDS is found in the >>> 'type' column (gtf reproduced below). This becomes the first question, >>> do >>> I subset for CDS to get gene level differential expression? >>> >>> >> The CDS features will be the individual coding regions, split up by the >> introns. They're essentially the exons, with the UTRs removed. If you >> define gene-level expression as the count of reads overlapping any of the >> CDS regions, you'll want to first split your gff0 by the gene_id metadata >> column, and then call reduce() on the result. >> >> >>> I then loaded all of my tophat bam files in and mapped them using the >>> summarizeOverlaps() function, changed the colData, and add my gtf file as >>> rowData. >> >> >> The replacement of the rowData should not be necessary. It should already >> be done for you. >> >> >>> So now, i have a summarized experiment with plenty of information >>> on each row the of the count matrix, but no rownames. using DESeq2, no >>> identifiers are displayed with the deferentially expressed genes. but >>> what >>> do I put in as the rownames? the gene_id, transcript_id, gene_name, and >>> transcript_name are all non-unique (i presume the rownames must be >>> unique). >>> What to do? I don't `need` to use a gtf file, just the only place I >>> know >>> to get features to count on. >>> >>> >> The new ranges (the reduced CDSs) should be named (by the gene) and thus >> should become the rownames on the SummarizedExperiment. >> >> I edited your code with the suggestions: >> >> >>> Code: >>> gff0 <- import(".../Arabidopsis_thaliana.TAIR10.17.gtf", asRangedData = >>> FALSE) >>> >> >> cds <- subset(gff0, type == "CDS") >> cds_gene <- reduce(split(cds, cds$gene_id)) >> >> >>> fls <- c(Suc = "sortedBamFiles/sortSUC.bam", Total = >>> "sortedBamFiles/sortTOTAL.bam") >>> >>> bfl <- BamFileList(fls, index = character()) >>> olap <- summarizeOverlaps(gffCds, bfl) >>> colData(olap)$RNA <- factor(c("Suc", "Total"), levels = c("Total", >>> "Suc")) >>> >>> >> The above colData<- call is not really necessary, because I named the fls >> vector accordingly. Those names will percolate to the rownames on colData >> (and perhaps more importantly the column names in your assay matrix). >> >> Michael >> >> >>> >>> Thanks for any help! >>> >>> Sam McInturf >>> >>> A few objects: >>> > gff0 >>> GRanges with 485101 ranges and 12 metadata columns: >>> seqnames ranges strand | source >>> type >>> <rle> <iranges> <rle> | <factor> >>> <factor> >>> [1] Mt [ 273, 734] - | protein_coding >>> exon >>> [2] Mt [ 276, 734] - | protein_coding >>> CDS >>> [3] Mt [ 732, 734] - | protein_coding >>> start_codon >>> [4] Mt [ 273, 275] - | protein_coding >>> stop_codon >>> [5] Mt [8848, 11415] - | rRNA >>> exon >>> ... ... ... ... ... ... >>> ... >>> [485097] 1 [30426535, 30426557] - | pseudogene >>> exon >>> [485098] 1 [30426341, 30426403] - | pseudogene >>> exon >>> [485099] 1 [30426217, 30426268] - | pseudogene >>> exon >>> [485100] 1 [30425840, 30425977] - | pseudogene >>> exon >>> [485101] 1 [30426839, 30427577] + | pseudogene >>> exon >>> score phase gene_id transcript_id exon_number >>> <numeric> <integer> <character> <character> <numeric> >>> [1] <na> <na> ATMG00010 ATMG00010.1 1 >>> [2] <na> 0 ATMG00010 ATMG00010.1 1 >>> [3] <na> 0 ATMG00010 ATMG00010.1 1 >>> [4] <na> 0 ATMG00010 ATMG00010.1 1 >>> [5] <na> <na> ATMG00020 ATMG00020.1 1 >>> ... ... ... ... ... ... >>> [485097] <na> <na> AT1G81001 AT1G81001.1 1 >>> [485098] <na> <na> AT1G81001 AT1G81001.1 2 >>> [485099] <na> <na> AT1G81001 AT1G81001.1 3 >>> [485100] <na> <na> AT1G81001 AT1G81001.1 4 >>> [485101] <na> <na> AT1G81020 AT1G81020.1 1 >>> gene_name transcript_name seqedit protein_id >>> peptide >>> <character> <character> <character> <character> >>> <character> >>> [1] ORF153A ATMG00010.1 false <na> >>> <na> >>> [2] ORF153A ATMG00010.1 <na> ATMG00010.1 >>> <na> >>> [3] ORF153A ATMG00010.1 <na> <na> >>> <na> >>> [4] ORF153A ATMG00010.1 <na> <na> >>> <na> >>> [5] RRN26 ATMG00020.1 false <na> >>> <na> >>> ... ... ... ... ... >>> ... >>> [485097] AT1G81001 AT1G81001.1 false <na> >>> <na> >>> [485098] AT1G81001 AT1G81001.1 false <na> >>> <na> >>> [485099] AT1G81001 AT1G81001.1 false <na> >>> <na> >>> [485100] AT1G81001 AT1G81001.1 false <na> >>> <na> >>> [485101] AT1G81020 AT1G81020.1 false <na> >>> <na> >>> --- >>> seqlengths: >>> Mt Pt 5 4 3 2 1 >>> NA NA NA NA NA NA NA >>> >>> overlap file >>> > olap >>> class: SummarizedExperiment >>> dim: 197105 2 >>> exptData(0): >>> assays(1): counts >>> rownames: NULL >>> rowData metadata column names(12): source type ... protein_id peptide >>> colnames(2): sortedBamFiles/sortSUC.bam sortedBamFiles/sortTOTAL.bam >>> colData names(2): fileName RNA >>> >>> > sessionInfo() >>> R version 3.0.0 (2013-04-03) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] DESeq2_1.0.11 RcppArmadillo_0.3.820 Rcpp_0.10.3 >>> [4] lattice_0.20-15 Biobase_2.20.0 rtracklayer_1.20.2 >>> [7] GenomicRanges_1.12.4 IRanges_1.18.1 BiocGenerics_0.6.0 >>> >>> loaded via a namespace (and not attached): >>> [1] annotate_1.38.0 AnnotationDbi_1.22.5 Biostrings_2.28.0 >>> [4] bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-7 >>> [7] genefilter_1.42.0 grid_3.0.0 locfit_1.5-9.1 >>> [10] RColorBrewer_1.0-5 RCurl_1.95-4.1 Rsamtools_1.12.3 >>> [13] RSQLite_0.11.4 splines_3.0.0 stats4_3.0.0 >>> [16] survival_2.37-4 tools_3.0.0 XML_3.96-1.1 >>> [19] xtable_1.7-1 zlibbioc_1.6.0 >>> >>> >>> -- >>> Sam McInturf >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > > > -- > Sam McInturf > [[alternative HTML version deleted]]
ADD REPLYlink written 4.5 years ago by Michael Lawrence9.8k
Just to advertise a new piece of documentation: The vignette of the "parathyroidSE" data package, written by Mike Love, now describes how to get such count tables within R (i.e., without using our Python scripts), using a method similar to he one discussed in this thread. Simon
ADD REPLYlink written 4.5 years ago by Simon Anders3.4k
Thanks for pointing that out, Simon. It would be great to get such documentation in a more conspicuous place. Michael On Wed, Jun 5, 2013 at 12:00 PM, Simon Anders <anders@embl.de> wrote: > Just to advertise a new piece of documentation: The vignette of the > "parathyroidSE" data package, written by Mike Love, now describes how to > get such count tables within R (i.e., without using our Python scripts), > using a method similar to he one discussed in this thread. > > Simon > > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]
ADD REPLYlink written 4.5 years ago by Michael Lawrence9.8k
Hi Mike (Love), Would you be interested in contributing a disjointExons() to GenomicFeatures? I think this extraction would be useful to many. Maybe we also want a disjointExonsBy() but the only 'by' would be genes ...? Valerie On 06/05/2013 01:37 PM, Michael Lawrence wrote: > Thanks for pointing that out, Simon. It would be great to get such > documentation in a more conspicuous place. > > Michael > > > On Wed, Jun 5, 2013 at 12:00 PM, Simon Anders <anders at="" embl.de=""> wrote: > >> Just to advertise a new piece of documentation: The vignette of the >> "parathyroidSE" data package, written by Mike Love, now describes how to >> get such count tables within R (i.e., without using our Python scripts), >> using a method similar to he one discussed in this thread. >> >> Simon >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLYlink written 4.5 years ago by Valerie Obenchain ♦♦ 6.4k
hi Valerie, On Tue, Jun 11, 2013 at 6:13 PM, Valerie Obenchain <vobencha at="" fhcrc.org=""> wrote: > Hi Mike (Love), > > Would you be interested in contributing a disjointExons() to > GenomicFeatures? I think this extraction would be useful to many. Maybe we > also want a disjointExonsBy() but the only 'by' would be genes ...? > > Valerie > I'd be happy to contribute any of this code in parathyroidSE. Alejandro Reyes has formalized this part of the code into the following function in DEXSeq >= 1.6.0: > prepareAnnotationForDEXSeq function (transcriptDb, aggregateGenes = FALSE, includeTranscripts = TRUE) { stopifnot(is(transcriptDb, "TranscriptDb")) exonsByGene <- exonsBy(transcriptDb, by = "gene") exonicParts <- disjoin(unlist(exonsByGene)) if (!aggregateGenes) { overlaps <- findOverlaps(exonicParts, exonsByGene) geneNames <- names(exonsByGene)[subjectHits(overlaps)] ..... best, Mike
ADD REPLYlink written 4.5 years ago by Michael Love15k
Great. I'll go ahead and put this in GenomicFeatures with credit to you and Alejandro. Thanks. Valerie On 06/11/2013 09:39 AM, Michael Love wrote: > hi Valerie, > > On Tue, Jun 11, 2013 at 6:13 PM, Valerie Obenchain <vobencha at="" fhcrc.org=""> wrote: >> Hi Mike (Love), >> >> Would you be interested in contributing a disjointExons() to >> GenomicFeatures? I think this extraction would be useful to many. Maybe we >> also want a disjointExonsBy() but the only 'by' would be genes ...? >> >> Valerie >> > > I'd be happy to contribute any of this code in parathyroidSE. > Alejandro Reyes has formalized this part of the code into the > following function in DEXSeq >= 1.6.0: > >> prepareAnnotationForDEXSeq > function (transcriptDb, aggregateGenes = FALSE, includeTranscripts = TRUE) > { > stopifnot(is(transcriptDb, "TranscriptDb")) > exonsByGene <- exonsBy(transcriptDb, by = "gene") > exonicParts <- disjoin(unlist(exonsByGene)) > if (!aggregateGenes) { > overlaps <- findOverlaps(exonicParts, exonsByGene) > geneNames <- names(exonsByGene)[subjectHits(overlaps)] > ..... > > best, > > Mike >
ADD REPLYlink written 4.5 years ago by Valerie Obenchain ♦♦ 6.4k
0
gravatar for Hotz, Hans-Rudolf
4.5 years ago by
Switzerland
Hotz, Hans-Rudolf390 wrote:
Hi Sam I am not really answering your e-mail, but instead just 'abuse' it for a bold commercial: Have you looked at the QuasR package? It might simplify your analysis. Regards, Hans-Rudolf On 06/04/2013 10:07 PM, Sam McInturf wrote: > Hello everyone, > I am having a problem using GenomicRanges to count my mapped reads, > although this is a user lack of understanding, not a bug. After mapping > the reads, there is no entries for the rownames of the differentially > expressed genes. I am doing RNA seq in Arabidopsis (Illumina, 100bp, > single end). I mapped using Tophat2, using the -G option and passed my gtf > file. This file is the TAIR10 release gtf, downloaded from the tophat > annotation download page. I loaded it into R using rtracklayer's import > function. > > By my way of understanding things (which is may be wrong), if you count > reads so each row of the count matrix is an exon, then you can use DexSeq > to calculate differential exon usage. If you want to calculate gene level > differential expression, then you count per transcript, and the use DESeq/2 > to calculate differential expression. > > Because I am interested in gene level differential expression (at this > point), I subsetted my gtf file for the CDS, where CDS is found in the > 'type' column (gtf reproduced below). This becomes the first question, do > I subset for CDS to get gene level differential expression? > > I then loaded all of my tophat bam files in and mapped them using the > summarizeOverlaps() function, changed the colData, and add my gtf file as > rowData. So now, i have a summarized experiment with plenty of information > on each row the of the count matrix, but no rownames. using DESeq2, no > identifiers are displayed with the deferentially expressed genes. but what > do I put in as the rownames? the gene_id, transcript_id, gene_name, and > transcript_name are all non-unique (i presume the rownames must be unique). > What to do? I don't `need` to use a gtf file, just the only place I know > to get features to count on. > > Code: > gff0 <- import(".../Arabidopsis_thaliana.TAIR10.17.gtf", asRangedData = > FALSE) > idx <- mcols(gff0)$type == "CDS" > gffCds <- gff0[idx > ] > fls <- c("sortedBamFiles/sortSUC.bam", "sortedBamFiles/sortTOTAL.bam") > bfl <- BamFileList(fls, index = character()) > olap <- summarizeOverlaps(gffCds, bfl) > colData(olap)$RNA <- factor(c("Suc", "Total"), levels = c("Total", "Suc")) > rowData(olap) <- gffCds > > > Thanks for any help! > > Sam McInturf > > A few objects: >> gff0 > GRanges with 485101 ranges and 12 metadata columns: > seqnames ranges strand | source > type > <rle> <iranges> <rle> | <factor> > <factor> > [1] Mt [ 273, 734] - | protein_coding > exon > [2] Mt [ 276, 734] - | protein_coding > CDS > [3] Mt [ 732, 734] - | protein_coding > start_codon > [4] Mt [ 273, 275] - | protein_coding > stop_codon > [5] Mt [8848, 11415] - | rRNA > exon > ... ... ... ... ... ... > ... > [485097] 1 [30426535, 30426557] - | pseudogene > exon > [485098] 1 [30426341, 30426403] - | pseudogene > exon > [485099] 1 [30426217, 30426268] - | pseudogene > exon > [485100] 1 [30425840, 30425977] - | pseudogene > exon > [485101] 1 [30426839, 30427577] + | pseudogene > exon > score phase gene_id transcript_id exon_number > <numeric> <integer> <character> <character> <numeric> > [1] <na> <na> ATMG00010 ATMG00010.1 1 > [2] <na> 0 ATMG00010 ATMG00010.1 1 > [3] <na> 0 ATMG00010 ATMG00010.1 1 > [4] <na> 0 ATMG00010 ATMG00010.1 1 > [5] <na> <na> ATMG00020 ATMG00020.1 1 > ... ... ... ... ... ... > [485097] <na> <na> AT1G81001 AT1G81001.1 1 > [485098] <na> <na> AT1G81001 AT1G81001.1 2 > [485099] <na> <na> AT1G81001 AT1G81001.1 3 > [485100] <na> <na> AT1G81001 AT1G81001.1 4 > [485101] <na> <na> AT1G81020 AT1G81020.1 1 > gene_name transcript_name seqedit protein_id peptide > <character> <character> <character> <character> <character> > [1] ORF153A ATMG00010.1 false <na> <na> > [2] ORF153A ATMG00010.1 <na> ATMG00010.1 <na> > [3] ORF153A ATMG00010.1 <na> <na> <na> > [4] ORF153A ATMG00010.1 <na> <na> <na> > [5] RRN26 ATMG00020.1 false <na> <na> > ... ... ... ... ... ... > [485097] AT1G81001 AT1G81001.1 false <na> <na> > [485098] AT1G81001 AT1G81001.1 false <na> <na> > [485099] AT1G81001 AT1G81001.1 false <na> <na> > [485100] AT1G81001 AT1G81001.1 false <na> <na> > [485101] AT1G81020 AT1G81020.1 false <na> <na> > --- > seqlengths: > Mt Pt 5 4 3 2 1 > NA NA NA NA NA NA NA > > overlap file >> olap > class: SummarizedExperiment > dim: 197105 2 > exptData(0): > assays(1): counts > rownames: NULL > rowData metadata column names(12): source type ... protein_id peptide > colnames(2): sortedBamFiles/sortSUC.bam sortedBamFiles/sortTOTAL.bam > colData names(2): fileName RNA > >> sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] DESeq2_1.0.11 RcppArmadillo_0.3.820 Rcpp_0.10.3 > [4] lattice_0.20-15 Biobase_2.20.0 rtracklayer_1.20.2 > [7] GenomicRanges_1.12.4 IRanges_1.18.1 BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] annotate_1.38.0 AnnotationDbi_1.22.5 Biostrings_2.28.0 > [4] bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-7 > [7] genefilter_1.42.0 grid_3.0.0 locfit_1.5-9.1 > [10] RColorBrewer_1.0-5 RCurl_1.95-4.1 Rsamtools_1.12.3 > [13] RSQLite_0.11.4 splines_3.0.0 stats4_3.0.0 > [16] survival_2.37-4 tools_3.0.0 XML_3.96-1.1 > [19] xtable_1.7-1 zlibbioc_1.6.0 > >
ADD COMMENTlink written 4.5 years ago by Hotz, Hans-Rudolf390
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 310 users visited in the last hour