I'm working through my first batch of RNA-Seq analysis and unfortunately I don't have an experienced bioinformatician to work with. My question is regarding tximport of my quant.sf files into R. I have been working with the EquCab3.0 reference transcriptome from NCBI to generate these quant.sf files, but I have encountered a problem when importing into R.
Although the NCBI .fasta file has the NM_transcript identifiers, they don't appear to have unique delimiters for the corresponding gene ID's that I need to create the 2 column data.frame. Therefore, I ran the following from the EquCab3.0 gtf file:
gtf <- rtracklayer::import("GCF_002863925.1_EquCab3.0_genomic.gtf")
Then to generate the mapping table I ran
tx2gene = unique(data.frame(gtf[gtf$type=="exon"])[,c("transcript_id", "gene_id")])
transcript_id gene_id 1 XM_005602135.3 SYCE1 14 XM_005602137.3 SYCE1 26 XM_023636208.1 LOC111772506 33 XM_023636216.1 LOC111772506 39 XM_023636230.1 LOC111772506 45 XM_023636202.1 LOC111772506
For reference, each of my 10 quant.sf files look like this:
Name Length EffectiveLength TPM NumReads NM_001081757.1 5648 5649.138 5.559883 1566.000 NM_001081758.1 5984 5841.368 0.028292 8.240 NM_001081759.1 5982 5654.000 0.000000 0.000 NM_001081760.1 5980 5652.000 0.000000 0.000 NM_001081761.1 5505 5177.000 0.000000 0.000 NM_001081762.1 4872 4553.211 0.010932 2.482 NM_001081763.1 4629 4471.577 13.722008 3059.301 NM_001081764.1 4553 4225.000 0.000000 0.000 NM_001081765.1 4125 3676.439 29.286268 5368.281
For the final code I ran the 'ignoreTxVersion = T' which gives an error message
count_data = tximport(files = sample_files, type = "salmon", tx2gene = tx2gene, ignoreTxVersion = T) reading in files with read_tsv 1 2 3 4 5 6 7 8 9 10 Error in .local(object, ...) : None of the transcripts in the quantification files are present in the first column of tx2gene. Check to see that you are using the same annotation for both. Example IDs (file): [NM_001081757, NM_001081758, NM_001081759, ...] Example IDs (tx2gene): [XM_005602135.3, XM_005602137.3, XM_023636208.1, ...] This can sometimes (not always) be fixed using 'ignoreTxVersion' or 'ignoreAfterBar'.
However, when I set 'ignoreTxVersion = F', it seems to work
reading in files with read_tsv 1 2 3 4 5 6 7 8 9 10 summarizing abundance summarizing counts summarizing length
Is there something obvious I have missed here, or am I ok to move forward with the analysis? From what I've read, it's better to set this option to 'TRUE' for downstream analysis with DeSeq. Any advice is greatly appreciated.