Search
Question: Can Tximport average transcript length be used for GOSeq?
0
gravatar for casey.rimland
4 months ago by
University of Cambridge, National Institutes of Health, Chapel Hill School of Medicine
casey.rimland80 wrote:

I’m trying to run GOSeq with human RNA Seq data that has been processed with Salmon, summarized to gene level with tximport, and DGE run using DeSeq2. 

I was reading the vignette for GOSeq and got stuck/confused on the part where you need to know what human genome build you used... It says you need to know this to access the correct transcript length in their database.

This may be a stupid question/lack of understanding on my part...But with Salmon/tximport I do not have this information because I did not map to a specific genome, correct? So then could I just use the avgTX length from the tximport function to input to GOSeq to correct for length biases?

Thanks!

 

 

ADD COMMENTlink modified 4 months ago • written 4 months ago by casey.rimland80
1

Thanks for the reply Steve!

So I spent literally all day trying to figure out how to go about doing that, now that I know its okay...and I think maybe I finally figured it out? Getting the data into the right format for GOseq is a bit of a beast for a novice R user!!

Any chance you're feeling super friendly and could take a look at my code?

Also, just in case someone in the future needs to do something like this, maybe the code will help them avoid a day of banging their head on the wall (assuming its correct code :D).

When I run GOseq the pwf plots are looking a bit odd which is the only thing that has me worried... This is link to the pwf plot https://ibb.co/h69iLH

 

#Get transcript lengths from Tximport data
txi.length.df <- as.data.frame(txi[["length"]])

#Subset for only genes input to the DeSeq DGE model
txi.length.df <- subset(txi.length.df, rownames(txi.length.df) %in% rownames(results.df))

#Remove samples not being included in analyses at this time
txi.length.df_sub <- within(txi.length.df, rm("SRR000020643", "SRR000020650", "SRR000020653", "SRR000020660", "SRR000020663", "SRR000020670", "SRR000020680", "SRR000020690", "SRR000026811"))

#Calculate mean tx length
txi.length.df_sub$avgTx <- rowMeans(txi.length.df_sub)

#Load file containing the list of genes DE from DeSeq2 output
DE.genes <- read.delim(file = "/users/rimlandca/desktop/RNA Seq Results/GOseq files/DEGs/Upreg overlap.txt")

#Load file containing GO terms (downloaded GO terms from Ensembl Biomart online for protein coding genes)

go.terms <- read.csv("/users/rimlandca/desktop/RNA Seq Results/GOseq files/Category Mappings/GO term.txt", as.is=TRUE)
head(go.terms)
names(go.terms) <- c("Gene.stable.ID","GO.term.accession")
summary(go.terms)

#Background Genes
expressed.genes <- row.names(txi.length.df_sub)

#Gene Length
gene.lengths <- txi.length.df_sub$avgTx

#Make vector where DEGs = TRUE and non-DEGs = FALSE

DEgenes <- expressed.genes %in% DE.genes$ID
names(DEgenes) <- expressed.genes
head(DEgenes)
sum(DEgenes)

#Run GoSeq!

nullp.result <- nullp(DEgenes = DEgenes, bias.data = gene.lengths)

GO <- goseq(pwf = nullp.result, gene2cat = go.terms)

write.table(GO[GO$over_represented_pvalue < 0.01],row.names=FALSE,file="GO_terms.txt", quote = FALSE,col.names = FALSE)
ADD REPLYlink modified 4 months ago • written 4 months ago by casey.rimland80

There's a lot of code here, but I'm confused. If you need the gene or transcript lengths, it's simply:

lengthData <- rowMeans(txi$length)
lengthData <- lengthData[ x ]

Where 'x' is a vector of the row names in the order you want them (lined up with DEgenes)

ADD REPLYlink written 4 months ago by Michael Love18k

Yea, I am still at the learning R stage where I tend to be very very inefficient with my code :D!

I have edited the above post with updated code and made it much more efficient. In my "banging my head" state yesterday I did a ton of rather useless steps I now realize. 

The pwf plots are still looking a bit off I think, especially the second one? They are for two different DEG lists for contrasts between different biological groups. 

 

https://imgur.com/a/1kyV8

https://ibb.co/iw8Bcx

 

Also, one more question: is it acceptable to use the built in goseq database's GO classification for hg19, even though I technically did not map to hg19 since I used Salmon? For some reason the data I downloaded from Ensembl Biomart for GO classifications seems to be different to what GOseq has and is maybe not as complete... 

GO <- goseq(pwf = nullp.result, "hg19", "ensGene") gives me 22,205 GO classifications.

GO <- goseq(pwf = nullp.result, gene2cat = go.terms) gives me 17,705 GO classifications

p.s. Go Tar Heels! 

ADD REPLYlink modified 4 months ago • written 4 months ago by casey.rimland80

The second plot is definitely not correct. I don't know what these are supposed to look like other than the power should increase with length. I'd make sure that you are providing the correct DEgenes, because it looks like you may have swapped DE and non-DE status.

For the second part, what matters is that the gene IDs that you used after quantifying and summarizing to gene-level match up with the gene IDs in GO. I'm not sure exactly about the internal mechanics of goseq().

ADD REPLYlink modified 3 months ago • written 3 months ago by Michael Love18k

I am pretty confident I included correctly the DE genes n the second plot as the number of "True"/"1" is equal to the number of differentially expressed genes in the list I imported... To get this list, I have been taking an overlap of genes significantly up-regulated in Tissue C when compared to either Tissue A or B.

The second plot was definitely odd though and I can't figure out why... It seems like a similar problem to what was posted here: Question: GoSeq: opposite pwf plot for up- and downregulated genes. But there wasn't really a clear resolution or reason for why the plots were coming out reversed when looking at subsets of genes that either up or down-regulated... 

My first pwf plot which looks normal-ish is for the overlap of genes upregulated in Tissue A when compared to Tissue B and Tissue C. Similar thing holds true for the Tissue B upregulated genes. Tissue C is where it gets odd...

Tissue A upregulated (https://ibb.co/fD4wUc)
Tissue B upregulated (https://ibb.co/dKGh2x)
Tissue C upregulated (https://ibb.co/j7FLpc)

So something weird is happening with the list for Tissue C it seems?... Is it biologically possible that perhaps the genes DE in Tissue C are just shorter?

ADD REPLYlink modified 3 months ago • written 3 months ago by casey.rimland80
0
gravatar for Steve Lianoglou
4 months ago by
Denali
Steve Lianoglou12k wrote:

Yes it can, and it's arguably a better value to use than what is in the database anyway.

ADD COMMENTlink written 4 months ago by Steve Lianoglou12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 257 users visited in the last hour