Question

Problem with Tximport to DESeq2

0

Entering edit mode

zli1 • 0

@cd41d80f

Last seen 3.4 years ago

United States

Hi all,

I have some raw RNA seq data from mouse that I would like to use DESeq2 for analysis. After using Salmon for quantification, I used tximport and DESeqDataSetFromTximport to create the data object, but the count matrix I obtained only have one row per sample, which is a value corresponding to protein-encoding.

I am not sure whether this is because the reference transcriptome I used for Salmon indexing is downloaded from http://ftp.ensembl.org/pub/release-99/fasta/mus_musculus/cdna/, which should be GRCm38. However, the id names does not match those in the R package TxDb.Mmusculus.UCSC. The id names I obtained starts with "GENSCAN0000000000", and when I tried to convert it to tx2gene object, the result I obtained only has two columns, one is the id name and the other is the string 'protein encoding'. Below is the codes I used to create the tx2gene object.

gunzip -c Mus.GRCm38.cdna.all.fa.gz | grep '>' | cut -d ' ' -f1,4,7 > temp

paste <(cut -d '>' -f2 temp | cut -d ' ' -f1) <(cut -d ' ' -f2 temp | cut -d ':' -f2) <(cut -d ' ' -f3 temp | cut -d ':' -f2) >> tx2gene.txt

I am wondering whether there is any way to fix the issue or whether there is another fasta file I should use for Salmon index. Any help would be greatly appreciated!

TxDb.Mmusculus.UCSC.mm10.ensGene DESeq2 tximport • 1.4k views

ADD COMMENT • link 3.7 years ago zli1 • 0

score 0 · Answer 1 · 2021-04-29

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

It was common for users with tximport to have difficulty putting together the correct table for gene summarization, which is why we created tximeta which does this for you. Magic!

It works for GENCODE, Ensembl and RefSeq for human, mouse and fly. I'd recommend starting there first.

ADD COMMENT • link 3.7 years ago Michael Love 43k

0

Entering edit mode

Thank you for the reply! I tried to run se <- tximeta(coldata) but encountered an error saying that Error in nchar(x) : invalid multibyte string, element 4. Is there a way to fix the problem? In addition, I'm new in biostatistics, and I'm wondering whether there is any recommended reference transcriptome for mouse RNA seq data? Thank you again for your help!

ADD REPLY • link 3.7 years ago zli1 • 0

0

Entering edit mode

I prefer GENCODE for mouse and human. They do a good job providing relevant files, documenting versions, and keeping permalinks.

Re: your error, I can't help you much without you showing me what coldata looks like.

ADD REPLY • link 3.7 years ago Michael Love 43k

0

Entering edit mode

Below is what my coldata looks like. I also attach my session info below.

enter image description here

R version 4.0.4 (2021-02-15) Platform: x86_64-apple-darwin17.0 (64-bit)

Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale: 1 en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

ADD REPLY • link 3.7 years ago zli1 • 0

0

Entering edit mode

Hmm, my first guess is whether those files are readable by tximport? Are those the quant.sf files?

ADD REPLY • link 3.7 years ago Michael Love 43k

0

Entering edit mode

They are quant.sf files. It turned out I solved the problem by using transcripts downloaded from GENCODE. Thank you for your help!!

ADD REPLY • link 3.7 years ago zli1 • 0