How to annotate DESeq2 object with gene size to get FPKMs
1
0
Entering edit mode
ashley.doane ▴ 20
@ashleydoane-8524
Last seen 5.5 years ago
United States

I'd like to annotate a DESeqDataSet with information on gene size, specifically populating mcols(dds)$basepairs, where dds is my DESeqDataSet.  The DESeqDataSet was created from HTSeq output, using the DESeqDataSetFromHTSeqCount function.  

 

 

 

 

The reason for this is to pull out FPKMs, and to have genomic intervals for other downstream analysis.  The HTSeq counts are by gene symbol.

Genome in this case is hg19, and I have TxDb.Hsapiens.UCSC.hg19.knownGene and Org.Hs.eg.db data, but not sure which slots to match.

Any suggestions would be really appreciated.  thanks! 

Ashley

 

 

deseq2 fpkm • 4.2k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 8 hours ago
United States

You can use makeTxDbFromGFF from the GenomicFeatures package to build a TxDb object, and then exonsBy with by="gene" to obtain a GRangesList of the exons for each gene.

sum(width(reduce(x)))

for x the GRangesList will give you the exonic basepairs for each gene. You'll have to just arrange this vector in the same order as the dds.

See an example of makeTxDbFromGFF and exonsBy here:

http://master.bioconductor.org/help/workflows/rnaseqGene/#count

ADD COMMENT
0
Entering edit mode

I'm not sure if this is more correct, but what I do (in well-annotated genomes) is compute the length of every transcript isoform and then take the length of the longest isoform for each gene. This would be less than the total width of all exons in the gene in cases where some exons are mutually exclusive.

ADD REPLY
0
Entering edit mode

Why not just use the exact "effective length" that was used during the counting step? And by "effective length" I mean to just sum up all of the basepairs that are considered to "hit" the "parent gene" when creating your count matrix.

ADD REPLY
0
Entering edit mode

Because there may not be any transcript that includes all exons. Imagine there are two isoforms of a gene, each including one of two mutually exclusive exons, and both expressed equally. (And assume the the two isoforms are roughly equal in length.)  Then the coverage depth for the two mutually exclusive exons will be half of the others, and you would underestimate the overall FPKM for the gene if you divided by the total length of all exons. So dividing by the length of a known isoform seems better to me, as long as the genome is well-annotated.

ADD REPLY
0
Entering edit mode

It's up to the user. The code above is just a fast way to get an estimate of the size of the feature used for counting (also it's not taking into consideration the subtraction of basepairs which overlap another feature, and therefore do not contribute to the uniquely assignable count).

It's good to remember that most genes with multiple transcripts share the large majority of their basepairs but I see your point about the longest isoform.

For more accurate FPKM I would use an estimator like RSEM.

ADD REPLY
0
Entering edit mode

Hi Michael,

How do I arrange my list of basepairs in the same order as the dds? mcols(dds) does not have a column for gene IDs so I don't know where to start.

Thank you!

Josh

EDIT: looks like it's just in alphabetical order?

ADD REPLY
1
Entering edit mode

Try mcols(dds, use.names=TRUE)

ADD REPLY
1
Entering edit mode

EDIT: I reran the DESeq and the error message below disappeared. I don't know how. I am now able to get the fpkm values. Thank you for the help!!

 

Thanks for the reply!

I ordered my gene length data frame like this:

gene_lengths<-gene_lengths[match(rownames(mcols(dds, use.names=TRUE)), gene_lengths$Geneid),]

It ordered the rows of my gene length data frame as the mcols(dds).

and then when I tried to add a basepairs column to dds like this:

mcols(dds)$basepairs <- gene_lengths[, "Length"]

It gave an error message:

Error in V_recycle(value, x, x_what = "value", skeleton_what = "x") : 
  'NROW(value)' is greater than 'length(x)'

What caused this error? My gene length data frame and the mcols(dds) have the same numbers of rows.

ADD REPLY

Login before adding your answer.

Traffic: 560 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6