hi Hadi,
The ignoreTxVersion is designed for let's call case #1: the transcripts used in quantification have names like ENST00000335137.3, but the tx2gene data.frame has names like ENST00000335137. In your case, the transcript names are more complicated, but at this point I don't want to add any code to tximport beyond what we have supporting case #1.
What does your tx2gene look like? If the first column contained names like ENST00000335137 I think you would be fine, because the standard ignoreTxVersion will split at the first "."
Regarding the readr and first transcript issue, I cannot replicate, if I use the vignette code, and then instead ask txOut=TRUE I get:
> txi.salmon <- tximport(files, type = "salmon", txOut=TRUE, reader = read_tsv)
reading in files
1 2 3 4 5 6
> head(txi.salmon$abundance,5)
sample1 sample2 sample3 sample4 sample5 sample6
NR_001526 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
NR_001526_1 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
NR_001526_2 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
NM_130786 3.99278 9.70125 2.83211 3.80392 2.78513 3.07926
NR_015380 2.80094 3.87487 2.79935 4.48677 3.74006 3.84378
And if you look in the quant.sf file for the first sample (ERR188297):
ERR188297> head -6 quant.sf
Name Length EffectiveLength TPM NumReads
NR_001526 164 164 0 0
NR_001526_1 164 164 0 0
NR_001526_2 164 164 0 0
NM_130786 1764 1953.76 3.99278 109.232
NR_015380 2129 2141 2.80094 83.9697
Can you inspect what your quant.sf files look like? Did you manipulate them after running Salmon?
Dear Micheal,
previously transcript names had the dots but I removed it now, and still it gives empty vector.(I use it without txOut=TRUE for gene level).
mcols(tx)[,"transcript_id"] <- sapply(strsplit(mcols(tx)[,"transcript_id"],"\."),.subset,1)
txi.salmon <- tximport(sample_files, type = "salmon", tx2gene = as.data.frame(mcols(tx)), reader = read_tsv)
head(txi.salmon)
I can't help much because I don't know what your tx2gene data.frame looks like.
It also looks like you didn't use ignoreTxVersion=TRUE in the code above.
Also, regarding your comment about the first transcript not appearing: Can you inspect what your quant.sf files look like? Did you manipulate them after running Salmon?
Thank you so much, it works now with that option. For the second problem:
sf1<- read.delim(sample_files[1], comment ="#")
dim(sf1)
[1] 119206 4
head(rownames(sf1))
colnames(sf1)
[1] "ENST00000335137.3.ENSG00000186092.4.OTTHUMG00000001094.1.OTTHUMT00000003223.1.OR4F5.001.OR4F5.918.CDS.1.918."
[2] "X918"
[3] "X0"
[4] "X0.1"
sf2<- read.table(sample_files[1], row.names = 1)
dim(sf2)
[1] 119207 4
head(rownames(sf2))
[1] "ENST00000335137.3|ENSG00000186092.4|OTTHUMG00000001094.1|OTTHUMT00000003223.1|OR4F5-001|OR4F5|918|CDS:1-918|"
[2] "ENST00000423372.3|ENSG00000237683.5|-|-|AL627309.1-201|AL627309.1|2661|UTR5:1-70|CDS:71-850|UTR3:851-2661|"
[3] "ENST00000426406.1|ENSG00000235249.1|OTTHUMG00000002860.1|OTTHUMT00000007999.1|OR4F29-001|OR4F29|995|UTR5:1-19|CDS:20-958|UTR3:959-995|"
[4] "ENST00000332831.2|ENSG00000185097.2|OTTHUMG00000002581.1|OTTHUMT00000007334.1|OR4F16-001|OR4F16|995|UTR5:1-19|CDS:20-958|UTR3:959-995|"
[5] "ENST00000599533.1|ENSG00000269831.1|-|-|AL669831.1-201|AL669831.1|129|CDS:1-129|"
[6] "ENST00000594233.1|ENSG00000269308.1|-|-|AL645608.2-201|AL645608.2|57|CDS:1-57|"
sf3<- read_tsv(sample_files[1], comment ="#")
dim(sf3)
[1] 119206 4
colnames(sf3)
[1] "ENST00000335137.3|ENSG00000186092.4|OTTHUMG00000001094.1|OTTHUMT00000003223.1|OR4F5-001|OR4F5|918|CDS:1-918|"
[2] "918"
[3] "0"
[4] "0"
sf4 <- read_tsv(sample_files[1], col_names = FALSE, comment='#')
dim(sf4)
[1] 119207 4
dim(txi.salmon$counts)
[1] 119206 13
I think read_tsv and read.delim consider the first transcript as colnames.
the quant.sf file starts with some information as comments and then the transcripts.(name, length, tpm, number of reads)
# salmon (mapping-based) v0.5.1
# [ program ] => salmon
# [ command ] => quant
# [ index ] => { /tcp_idx }
# [ libType ] => { SR }
# [ unmatedReads ] => { /1.fastq }
# [ threads ] => { 5 }
# [ output ] => { ZR1 }
# [ mapping rate ] => { 49.3879% }
# Name Length TPM NumReads
ENST00000335137.3|ENSG00000186092.4|OTTHUMG00000001094.1|OTTHUMT00000003223.1|OR4F5-001|OR4F5|918|CDS:1-918| 918 0 0
ENST00000423372.3|ENSG00000237683.5|-|-|AL627309.1-201|AL627309.1|2661|UTR5:1-70|CDS:71-850|UTR3:851-2661| 2661 1.18649 142.613
ENST00000426406.1|ENSG00000235249.1|OTTHUMG00000002860.1|OTTHUMT00000007999.1|OR4F29-001|OR4F29|995|UTR5:1-19|CDS:20-958|UTR3:959-995| 995 0.00644092 0.25
ENST00000332831.2|ENSG00000185097.2|OTTHUMG00000002581.1|OTTHUMT00000007334.1|OR4F16-001|OR4F16|995|UTR5:1-19|CDS:20-958|UTR3:959-995| 995 0.00644092 0.25
...
Note that the first tx starts with "ENST00000335137.3..."
Isn't this one in your txi matrices?
sorry, I don't follow. Can you confirm that there is a missing transcript that is not in txi.salmon$counts?
What do you get on the command line with:
or within R:
No, the first transcript does not appear
head -3 quant.sf
# salmon (mapping-based) v0.5.1
# [ program ] => salmon
# [ command ] => quant
codes: (in the name of transcripts I omit the rest for readability but it starts from the second transcript)
txi.salmon <- tximport(sample_files, type = "salmon", txOut=TRUE, reader = read_tsv)
head(rownames(txi.salmon$counts))
[1] "ENST00000423372.3|ENSG00000237683.5|
[2] "ENST00000426406.1|ENSG00000
[3] "ENST00000332831.2|ENSG0
Ok. good catch. and thanks for bearing with me while I checked if it was missing!
The issue is that you've got an older version of Salmon, which I had written some code to try to deal with, but there is indeed a bug in reading the first transcript. I've attempted to fix this in the devel branch. Could you test the new tximport package for me? There are no dependencies of tximport, so I can give you some lines of code to test the new package, and if this works, I'll port the change over to release branch. (Note to any other readers, this is not a safe way to test out other Bioconductor packages in the devel branch, you will enter a bad, bad place with regard to dependencies.)
Download:
https://github.com/Bioconductor-mirror/tximport/archive/master.zip
Then unzip and rename the dir from 'tximport-master' to 'tximport'
Then on the command line:
yes it works fine, it gives an error for read_tsv ( Error in reader(x, comment = "#", header = FALSE) : unused argument (header = FALSE) )but it reads all the transcripts. I learned a lot form our conversation:))
hi Hadi,
thanks! if you don't mind could you just run it again, I've updated to try to silence that error
https://github.com/Bioconductor-mirror/tximport/archive/master.zip
Hey Michael,
it is a pleasure,
Yes, perfect, it does not give error anymore;)
I pushed this fix to release (1.0.3) and devel (1.1.4) now. You should be able to get these via biocLite() in 1-2 days.
Thank you so much Michael for the nice conversation :))