I have some raw RNA seq data from mouse that I would like to use DESeq2 for analysis. After using Salmon for quantification, I used
DESeqDataSetFromTximport to create the data object, but the count matrix I obtained only have one row per sample, which is a value corresponding to protein-encoding.
I am not sure whether this is because the reference transcriptome I used for Salmon indexing is downloaded from http://ftp.ensembl.org/pub/release-99/fasta/mus_musculus/cdna/, which should be GRCm38. However, the id names does not match those in the R package
TxDb.Mmusculus.UCSC. The id names I obtained starts with "GENSCAN0000000000", and when I tried to convert it to tx2gene object, the result I obtained only has two columns, one is the id name and the other is the string 'protein encoding'. Below is the codes I used to create the tx2gene object.
gunzip -c Mus.GRCm38.cdna.all.fa.gz | grep '>' | cut -d ' ' -f1,4,7 > temp
paste <(cut -d '>' -f2 temp | cut -d ' ' -f1) <(cut -d ' ' -f2 temp | cut -d ':' -f2) <(cut -d ' ' -f3 temp | cut -d ':' -f2) >> tx2gene.txt
I am wondering whether there is any way to fix the issue or whether there is another fasta file I should use for Salmon index. Any help would be greatly appreciated!