Question: How to annotate DESeq2 object with gene size to get FPKMs
0
gravatar for ashley.doane
4.3 years ago by
ashley.doane20
United States
ashley.doane20 wrote:

I'd like to annotate a DESeqDataSet with information on gene size, specifically populating mcols(dds)$basepairs, where dds is my DESeqDataSet.  The DESeqDataSet was created from HTSeq output, using the DESeqDataSetFromHTSeqCount function.  

 

 

 

 

The reason for this is to pull out FPKMs, and to have genomic intervals for other downstream analysis.  The HTSeq counts are by gene symbol.

Genome in this case is hg19, and I have TxDb.Hsapiens.UCSC.hg19.knownGene and Org.Hs.eg.db data, but not sure which slots to match.

Any suggestions would be really appreciated.  thanks! 

Ashley

 

 

deseq2 fpkm • 2.3k views
ADD COMMENTlink modified 4.3 years ago by Michael Love26k • written 4.3 years ago by ashley.doane20
Answer: How to annotate DESeq2 object with gene size to get FPKMs
0
gravatar for Michael Love
4.3 years ago by
Michael Love26k
United States
Michael Love26k wrote:

You can use makeTxDbFromGFF from the GenomicFeatures package to build a TxDb object, and then exonsBy with by="gene" to obtain a GRangesList of the exons for each gene.

sum(width(reduce(x)))

for x the GRangesList will give you the exonic basepairs for each gene. You'll have to just arrange this vector in the same order as the dds.

See an example of makeTxDbFromGFF and exonsBy here:

http://master.bioconductor.org/help/workflows/rnaseqGene/#count

ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by Michael Love26k

I'm not sure if this is more correct, but what I do (in well-annotated genomes) is compute the length of every transcript isoform and then take the length of the longest isoform for each gene. This would be less than the total width of all exons in the gene in cases where some exons are mutually exclusive.

ADD REPLYlink written 4.3 years ago by Ryan C. Thompson7.4k

Why not just use the exact "effective length" that was used during the counting step? And by "effective length" I mean to just sum up all of the basepairs that are considered to "hit" the "parent gene" when creating your count matrix.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Steve Lianoglou12k

Because there may not be any transcript that includes all exons. Imagine there are two isoforms of a gene, each including one of two mutually exclusive exons, and both expressed equally. (And assume the the two isoforms are roughly equal in length.)  Then the coverage depth for the two mutually exclusive exons will be half of the others, and you would underestimate the overall FPKM for the gene if you divided by the total length of all exons. So dividing by the length of a known isoform seems better to me, as long as the genome is well-annotated.

ADD REPLYlink written 4.3 years ago by Ryan C. Thompson7.4k

It's up to the user. The code above is just a fast way to get an estimate of the size of the feature used for counting (also it's not taking into consideration the subtraction of basepairs which overlap another feature, and therefore do not contribute to the uniquely assignable count).

It's good to remember that most genes with multiple transcripts share the large majority of their basepairs but I see your point about the longest isoform.

For more accurate FPKM I would use an estimator like RSEM.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Michael Love26k

Hi Michael,

How do I arrange my list of basepairs in the same order as the dds? mcols(dds) does not have a column for gene IDs so I don't know where to start.

Thank you!

Josh

EDIT: looks like it's just in alphabetical order?

ADD REPLYlink modified 16 months ago • written 16 months ago by Josh Wang10
1

Try mcols(dds, use.names=TRUE)

ADD REPLYlink written 16 months ago by Michael Love26k
1

EDIT: I reran the DESeq and the error message below disappeared. I don't know how. I am now able to get the fpkm values. Thank you for the help!!

 

Thanks for the reply!

I ordered my gene length data frame like this:

gene_lengths<-gene_lengths[match(rownames(mcols(dds, use.names=TRUE)), gene_lengths$Geneid),]

It ordered the rows of my gene length data frame as the mcols(dds).

and then when I tried to add a basepairs column to dds like this:

mcols(dds)$basepairs <- gene_lengths[, "Length"]

It gave an error message:

Error in V_recycle(value, x, x_what = "value", skeleton_what = "x") : 
  'NROW(value)' is greater than 'length(x)'

What caused this error? My gene length data frame and the mcols(dds) have the same numbers of rows.

ADD REPLYlink modified 16 months ago • written 16 months ago by Josh Wang10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 345 users visited in the last hour