Hi all,
I have some raw RNA seq data from mouse that I would like to use DESeq2 for analysis. After using Salmon for quantification, I used tximport
and DESeqDataSetFromTximport
to create the data object, but the count matrix I obtained only have one row per sample, which is a value corresponding to protein-encoding.
I am not sure whether this is because the reference transcriptome I used for Salmon indexing is downloaded from http://ftp.ensembl.org/pub/release-99/fasta/mus_musculus/cdna/, which should be GRCm38. However, the id names does not match those in the R package TxDb.Mmusculus.UCSC
. The id names I obtained starts with "GENSCAN0000000000", and when I tried to convert it to tx2gene object, the result I obtained only has two columns, one is the id name and the other is the string 'protein encoding'. Below is the codes I used to create the tx2gene object.
gunzip -c Mus.GRCm38.cdna.all.fa.gz | grep '>' | cut -d ' ' -f1,4,7 > temp
paste <(cut -d '>' -f2 temp | cut -d ' ' -f1) <(cut -d ' ' -f2 temp | cut -d ':' -f2) <(cut -d ' ' -f3 temp | cut -d ':' -f2) >> tx2gene.txt
I am wondering whether there is any way to fix the issue or whether there is another fasta file I should use for Salmon index. Any help would be greatly appreciated!
Thank you for the reply! I tried to run
se <- tximeta(coldata)
but encountered an error saying thatError in nchar(x) : invalid multibyte string, element 4
. Is there a way to fix the problem? In addition, I'm new in biostatistics, and I'm wondering whether there is any recommended reference transcriptome for mouse RNA seq data? Thank you again for your help!I prefer GENCODE for mouse and human. They do a good job providing relevant files, documenting versions, and keeping permalinks.
Re: your error, I can't help you much without you showing me what
coldata
looks like.Below is what my
coldata
looks like. I also attach my session info below.R version 4.0.4 (2021-02-15) Platform: x86_64-apple-darwin17.0 (64-bit)
Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale: 1 en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
Hmm, my first guess is whether those files are readable by tximport? Are those the
quant.sf
files?They are
quant.sf
files. It turned out I solved the problem by using transcripts downloaded from GENCODE. Thank you for your help!!