Entering edit mode
Hello everyone,
I am having a problem using GenomicRanges to count my mapped reads,
although this is a user lack of understanding, not a bug. After
mapping
the reads, there is no entries for the rownames of the differentially
expressed genes. I am doing RNA seq in Arabidopsis (Illumina, 100bp,
single end). I mapped using Tophat2, using the -G option and passed
my gtf
file. This file is the TAIR10 release gtf, downloaded from the tophat
annotation download page. I loaded it into R using rtracklayer's
import
function.
By my way of understanding things (which is may be wrong), if you
count
reads so each row of the count matrix is an exon, then you can use
DexSeq
to calculate differential exon usage. If you want to calculate gene
level
differential expression, then you count per transcript, and the use
DESeq/2
to calculate differential expression.
Because I am interested in gene level differential expression (at this
point), I subsetted my gtf file for the CDS, where CDS is found in the
'type' column (gtf reproduced below). This becomes the first
question, do
I subset for CDS to get gene level differential expression?
I then loaded all of my tophat bam files in and mapped them using the
summarizeOverlaps() function, changed the colData, and add my gtf file
as
rowData. So now, i have a summarized experiment with plenty of
information
on each row the of the count matrix, but no rownames. using DESeq2,
no
identifiers are displayed with the deferentially expressed genes. but
what
do I put in as the rownames? the gene_id, transcript_id, gene_name,
and
transcript_name are all non-unique (i presume the rownames must be
unique).
What to do? I don't `need` to use a gtf file, just the only place I
know
to get features to count on.
Code:
gff0 <- import(".../Arabidopsis_thaliana.TAIR10.17.gtf", asRangedData
=
FALSE)
idx <- mcols(gff0)$type == "CDS"
gffCds <- gff0[idx
]
fls <- c("sortedBamFiles/sortSUC.bam", "sortedBamFiles/sortTOTAL.bam")
bfl <- BamFileList(fls, index = character())
olap <- summarizeOverlaps(gffCds, bfl)
colData(olap)$RNA <- factor(c("Suc", "Total"), levels = c("Total",
"Suc"))
rowData(olap) <- gffCds
Thanks for any help!
Sam McInturf
A few objects:
> gff0
GRanges with 485101 ranges and 12 metadata columns:
seqnames ranges strand | source
type
<rle> <iranges> <rle> | <factor>
<factor>
[1] Mt [ 273, 734] - | protein_coding
exon
[2] Mt [ 276, 734] - | protein_coding
CDS
[3] Mt [ 732, 734] - | protein_coding
start_codon
[4] Mt [ 273, 275] - | protein_coding
stop_codon
[5] Mt [8848, 11415] - | rRNA
exon
... ... ... ... ... ...
...
[485097] 1 [30426535, 30426557] - | pseudogene
exon
[485098] 1 [30426341, 30426403] - | pseudogene
exon
[485099] 1 [30426217, 30426268] - | pseudogene
exon
[485100] 1 [30425840, 30425977] - | pseudogene
exon
[485101] 1 [30426839, 30427577] + | pseudogene
exon
score phase gene_id transcript_id exon_number
<numeric> <integer> <character> <character> <numeric>
[1] <na> <na> ATMG00010 ATMG00010.1 1
[2] <na> 0 ATMG00010 ATMG00010.1 1
[3] <na> 0 ATMG00010 ATMG00010.1 1
[4] <na> 0 ATMG00010 ATMG00010.1 1
[5] <na> <na> ATMG00020 ATMG00020.1 1
... ... ... ... ... ...
[485097] <na> <na> AT1G81001 AT1G81001.1 1
[485098] <na> <na> AT1G81001 AT1G81001.1 2
[485099] <na> <na> AT1G81001 AT1G81001.1 3
[485100] <na> <na> AT1G81001 AT1G81001.1 4
[485101] <na> <na> AT1G81020 AT1G81020.1 1
gene_name transcript_name seqedit protein_id
peptide
<character> <character> <character> <character>
<character>
[1] ORF153A ATMG00010.1 false <na>
<na>
[2] ORF153A ATMG00010.1 <na> ATMG00010.1
<na>
[3] ORF153A ATMG00010.1 <na> <na>
<na>
[4] ORF153A ATMG00010.1 <na> <na>
<na>
[5] RRN26 ATMG00020.1 false <na>
<na>
... ... ... ... ...
...
[485097] AT1G81001 AT1G81001.1 false <na>
<na>
[485098] AT1G81001 AT1G81001.1 false <na>
<na>
[485099] AT1G81001 AT1G81001.1 false <na>
<na>
[485100] AT1G81001 AT1G81001.1 false <na>
<na>
[485101] AT1G81020 AT1G81020.1 false <na>
<na>
---
seqlengths:
Mt Pt 5 4 3 2 1
NA NA NA NA NA NA NA
overlap file
> olap
class: SummarizedExperiment
dim: 197105 2
exptData(0):
assays(1): counts
rownames: NULL
rowData metadata column names(12): source type ... protein_id peptide
colnames(2): sortedBamFiles/sortSUC.bam sortedBamFiles/sortTOTAL.bam
colData names(2): fileName RNA
> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets
methods
[8] base
other attached packages:
[1] DESeq2_1.0.11 RcppArmadillo_0.3.820 Rcpp_0.10.3
[4] lattice_0.20-15 Biobase_2.20.0 rtracklayer_1.20.2
[7] GenomicRanges_1.12.4 IRanges_1.18.1 BiocGenerics_0.6.0
loaded via a namespace (and not attached):
[1] annotate_1.38.0 AnnotationDbi_1.22.5 Biostrings_2.28.0
[4] bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-7
[7] genefilter_1.42.0 grid_3.0.0 locfit_1.5-9.1
[10] RColorBrewer_1.0-5 RCurl_1.95-4.1 Rsamtools_1.12.3
[13] RSQLite_0.11.4 splines_3.0.0 stats4_3.0.0
[16] survival_2.37-4 tools_3.0.0 XML_3.96-1.1
[19] xtable_1.7-1 zlibbioc_1.6.0
--
Sam McInturf
[[alternative HTML version deleted]]