running tximport for salmon quants
1
0
Entering edit mode
@hadi-gharibi-10795
Last seen 8.6 years ago
Zurich, Switzerland

Dear All,

Using tximport for salmon, I noticed that when the name of transcripts are like this:
ENST00000335137.3|ENSG00000186092.4|OTTHUMG00000001094.1|OTTHUMT00000003223.1|OR4F5-001|OR4F5|918|CDS:1-918|
It cannot read and returns an empty vector but does not give an error!!. I replaced "." with "|" and it worked fine for me on line 296:

if (ignoreTxVersion) {
      txId <- sapply(strsplit(as.character(txId), "\."), "[[", 1)
    }


 txId <- sapply(strsplit(as.character(txId), "\|"), "[[", 1)


Using "readr" package, I saw it skips the first transcript ( I guess it assumes the first line as column names) maybe this solve it:
read_tsv(x, col_names = FALSE, comment='#')


Best Regards,
Hadi Gharibi

tximport readr • 3.2k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 4 hours ago
United States

hi Hadi,

The ignoreTxVersion is designed for let's call case #1: the transcripts used in quantification have names like ENST00000335137.3, but the tx2gene data.frame has names like ENST00000335137. In your case, the transcript names are more complicated, but at this point I don't want to add any code to tximport beyond what we have supporting case #1. 

What does your tx2gene look like? If the first column contained names like ENST00000335137 I think you would be fine, because the standard ignoreTxVersion will split at the first "."

Regarding the readr and first transcript issue, I cannot replicate, if I use the vignette code, and then instead ask txOut=TRUE I get:

> txi.salmon <- tximport(files, type = "salmon", txOut=TRUE, reader = read_tsv)
reading in files
1 2 3 4 5 6
> head(txi.salmon$abundance,5)
            sample1 sample2 sample3 sample4 sample5 sample6
NR_001526   0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
NR_001526_1 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
NR_001526_2 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
NM_130786   3.99278 9.70125 2.83211 3.80392 2.78513 3.07926
NR_015380   2.80094 3.87487 2.79935 4.48677 3.74006 3.84378

And if you look in the quant.sf file for the first sample (ERR188297):

ERR188297> head -6 quant.sf
Name           Length  EffectiveLength    TPM        NumReads
NR_001526      164     164                0          0
NR_001526_1    164     164                0          0
NR_001526_2    164     164                0          0
NM_130786      1764    1953.76            3.99278    109.232
NR_015380      2129    2141               2.80094    83.9697

Can you inspect what your quant.sf files look like? Did you manipulate them after running Salmon?

ADD COMMENT
0
Entering edit mode

Dear Micheal,

previously transcript names had the dots but I removed it now, and still it gives empty vector.(I use it without txOut=TRUE for gene level).

mcols(tx)[,"transcript_id"] <- sapply(strsplit(mcols(tx)[,"transcript_id"],"\."),.subset,1) 
txi.salmon <- tximport(sample_files, type = "salmon", tx2gene = as.data.frame(mcols(tx)), reader = read_tsv)
head(txi.salmon) 

 

 

 

 

ADD REPLY
1
Entering edit mode

I can't help much because I don't know what your tx2gene data.frame looks like. 

It also looks like you didn't use ignoreTxVersion=TRUE in the code above.

Also, regarding your comment about the first transcript not appearing: Can you inspect what your quant.sf files look like? Did you manipulate them after running Salmon?

ADD REPLY
0
Entering edit mode

 

Thank you so much, it works now with that option. For the second problem:

sf1<- read.delim(sample_files[1], comment ="#")
dim(sf1)
[1] 119206      4
head(rownames(sf1))
colnames(sf1)
[1] "ENST00000335137.3.ENSG00000186092.4.OTTHUMG00000001094.1.OTTHUMT00000003223.1.OR4F5.001.OR4F5.918.CDS.1.918."
[2] "X918"                                                                                                        
[3] "X0"                                                                                                          
[4] "X0.1"   

sf2<- read.table(sample_files[1], row.names = 1)
dim(sf2)
[1] 119207      4
head(rownames(sf2))
[1] "ENST00000335137.3|ENSG00000186092.4|OTTHUMG00000001094.1|OTTHUMT00000003223.1|OR4F5-001|OR4F5|918|CDS:1-918|"                          
[2] "ENST00000423372.3|ENSG00000237683.5|-|-|AL627309.1-201|AL627309.1|2661|UTR5:1-70|CDS:71-850|UTR3:851-2661|"                            
[3] "ENST00000426406.1|ENSG00000235249.1|OTTHUMG00000002860.1|OTTHUMT00000007999.1|OR4F29-001|OR4F29|995|UTR5:1-19|CDS:20-958|UTR3:959-995|"
[4] "ENST00000332831.2|ENSG00000185097.2|OTTHUMG00000002581.1|OTTHUMT00000007334.1|OR4F16-001|OR4F16|995|UTR5:1-19|CDS:20-958|UTR3:959-995|"
[5] "ENST00000599533.1|ENSG00000269831.1|-|-|AL669831.1-201|AL669831.1|129|CDS:1-129|"                                                      
[6] "ENST00000594233.1|ENSG00000269308.1|-|-|AL645608.2-201|AL645608.2|57|CDS:1-57|" 

sf3<- read_tsv(sample_files[1], comment ="#")
dim(sf3)
[1] 119206      4
colnames(sf3)
[1] "ENST00000335137.3|ENSG00000186092.4|OTTHUMG00000001094.1|OTTHUMT00000003223.1|OR4F5-001|OR4F5|918|CDS:1-918|"
[2] "918"                                                                                                         
[3] "0"                                                                                                           
[4] "0"     

sf4 <- read_tsv(sample_files[1], col_names = FALSE, comment='#')
dim(sf4)
[1] 119207      4

 

dim(txi.salmon$counts)

[1] 119206     13

I think read_tsv and read.delim consider the first transcript as colnames.

 

 

ADD REPLY
0
Entering edit mode

the quant.sf file starts with some information as comments and then the transcripts.(name, length, tpm, number of reads)

# salmon (mapping-based) v0.5.1
# [ program ] => salmon 
# [ command ] => quant 
# [ index ] => { /tcp_idx }
# [ libType ] => { SR }
# [ unmatedReads ] => { /1.fastq }
# [ threads ] => { 5 }
# [ output ] => { ZR1 }
# [ mapping rate ] => { 49.3879% }

# Name    Length    TPM    NumReads
ENST00000335137.3|ENSG00000186092.4|OTTHUMG00000001094.1|OTTHUMT00000003223.1|OR4F5-001|OR4F5|918|CDS:1-918|    918    0    0
ENST00000423372.3|ENSG00000237683.5|-|-|AL627309.1-201|AL627309.1|2661|UTR5:1-70|CDS:71-850|UTR3:851-2661|    2661    1.18649    142.613
ENST00000426406.1|ENSG00000235249.1|OTTHUMG00000002860.1|OTTHUMT00000007999.1|OR4F29-001|OR4F29|995|UTR5:1-19|CDS:20-958|UTR3:959-995|    995    0.00644092    0.25
ENST00000332831.2|ENSG00000185097.2|OTTHUMG00000002581.1|OTTHUMT00000007334.1|OR4F16-001|OR4F16|995|UTR5:1-19|CDS:20-958|UTR3:959-995|    995    0.00644092    0.25

...

ADD REPLY
0
Entering edit mode

Note that the first tx starts with "ENST00000335137.3..."

Isn't this one in your txi matrices?

ADD REPLY
0
Entering edit mode

sorry, I don't follow. Can you confirm that there is a missing transcript that is not in txi.salmon$counts?

What do you get on the command line with:

> head -3 quant.sf 

or within R:

system("head -3 quant.sf")
ADD REPLY
0
Entering edit mode

No, the first transcript does not appear

head -3 quant.sf 

# salmon (mapping-based) v0.5.1

# [ program ] => salmon 

# [ command ] => quant

 

codes: (in the name of transcripts I omit the rest for readability but it starts from the second transcript)

txi.salmon <- tximport(sample_files, type = "salmon",  txOut=TRUE, reader = read_tsv)

head(rownames(txi.salmon$counts))

[1] "ENST00000423372.3|ENSG00000237683.5|                     

[2] "ENST00000426406.1|ENSG00000

[3] "ENST00000332831.2|ENSG0

ADD REPLY
1
Entering edit mode

Ok. good catch. and thanks for bearing with me while I checked if it was missing!

The issue is that you've got an older version of Salmon, which I had written some code to try to deal with, but there is indeed a bug in reading the first transcript. I've attempted to fix this in the devel branch. Could you test the new tximport package for me? There are no dependencies of tximport, so I can give you some lines of code to test the new package, and if this works, I'll port the change over to release branch. (Note to any other readers, this is not a safe way to test out other Bioconductor packages in the devel branch, you will enter a bad, bad place with regard to dependencies.)

Download:

https://github.com/Bioconductor-mirror/tximport/archive/master.zip

Then unzip and rename the dir from 'tximport-master' to 'tximport'

Then on the command line:

R CMD build --no-build-vignettes tximport
R CMD INSTALL tximport_1.1.3.tar.gz

 

ADD REPLY
0
Entering edit mode

yes it works fine, it gives an error for read_tsv ( Error in reader(x, comment = "#", header = FALSE) : unused argument (header = FALSE) )but it reads all the transcripts. I learned a lot form our conversation:))

ADD REPLY
0
Entering edit mode

hi Hadi,

thanks! if you don't mind could you just run it again, I've updated to try to silence that error

https://github.com/Bioconductor-mirror/tximport/archive/master.zip

R CMD build --no-build-vignettes tximport
R CMD INSTALL tximport_1.1.4.tar.gz

 

ADD REPLY
0
Entering edit mode

Hey Michael,

it is a pleasure,

Yes, perfect, it does not give error anymore;) 

R CMD build --no-build-vignettes tximport
R CMD INSTALL tximport_1.1.3.tar.gz

 

ADD REPLY
1
Entering edit mode

I pushed this fix to release (1.0.3) and devel (1.1.4) now. You should be able to get these via biocLite() in 1-2 days.

ADD REPLY
0
Entering edit mode

Thank you so much Michael for the nice conversation :))

ADD REPLY

Login before adding your answer.

Traffic: 772 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6