How to run tximport ?
Entering edit mode
Last seen 9 months ago


I'm working with non-model organism, doesn't have any information in the public domain. After denovo assembly using trinity for all 32 samples, using salmon transcript abundance level calculated and further using trinity abundance matrix scripts genecount2.matrix were generated for all the datasets. My question is how do i combine all the 32 matrix into one matrix fed into DEseq2/EdgeR packages as an input. From the web I found tximport does the job. To avoid confusion, I put all my quant.sf files in salmon directory (I refered, tximportdata)

dir <- system.file("extdata",package="tximportData")
 [1] "cufflinks"            "ERR188021"            "ERR188088"           
 [4] "ERR188288"            "ERR188297"            "ERR188329"           
 [7] "ERR188356"            "kallisto"             "rsem"                
[10] "sailfish"             "salmon"               "samples_extended.txt"
[13] "samples.txt"          "tx2gene.csv"       

samples <- read.table(file.path(dir,"samples.txt"),header=TRUE)
files <- file.path(dir,"salmon",samples$run,"quant.sf.gz")
names(files) <- paste0("sample",1:32)
samples <- read.table(file.path(dir,"samples.txt"),header = TRUE)

      run treatment replicate
1    9_S5     Model         1
2   10_S9     Model         2
3  11_S13     Model         3
4  12_S17     Model         4
5  13_S21     Model         5
6  14_S26     Model         6
7  15_S30     Model         7
8   16_S2     Model         8
9   17_S6    Salmon         1
10 18_S10    Salmon         2
11 19_S14    Salmon         3
12 20_S18    Salmon         4
13 21_S22    Salmon         5
14 22_S27    Salmon         6
15 23_S31    Salmon         7
16 24_S23    Salmon         8
17 25_S28   Control         1
18 26_S32   Control         2
19  27_S3   Control         3
20  28_S7   Control         4
21 29_S11   Control         5
22 30_S15   Control         6
23 31_S19   Control         7
24 32_S24   Control         8
25   1_S4      Odor         1
26   2_S8      Odor         2
27  3_S12      Odor         3
28  4_S16      Odor         4
29  5_S20      Odor         5
30  6_S25      Odor         6
31  7_S29      Odor         7
32   8_S1      Odor         8

head quant.sf
Name    Length  EffectiveLength TPM NumReads
TRINITY_DN30552_c0_g1_i1    772 523.000 2.871728    133.000
TRINITY_DN30585_c0_g1_i1    572 323.000 0.769154    22.000
TRINITY_DN30563_c0_g1_i1    516 267.000 444.724836  10515.000
TRINITY_DN30577_c0_g1_i1    1130    881.000 1.358699    106.000
TRINITY_DN30527_c1_g1_i1    366 117.000 3.084007    31.953
TRINITY_DN30527_c0_g1_i1    446 197.000 4.413853    77.000
TRINITY_DN30562_c0_g1_i1    236 16.266  4.165573    6.000
TRINITY_DN30526_c0_g1_i1    384 135.000 1.422029    17.000
TRINITY_DN30543_c0_g1_i1    1384    1135.000    1.094436    110.000

Now, how do i create, tx2gene dataframe, the genome,GTF information is not available for non-model organism I'm currently working with then, how do i generate tx2gene using GenomicFeatures package. Please need suggestions. Here is the gene.counts.matrix I generated using trinity abundance transcript scripts.

8_S1.gene_trans_map_salmon.gene.counts.matrix <==
    GeneID                      8_S1
TRINITY_DN0_c0_g1           9630.41
TRINITY_DN0_c1_g1   1      73.96
TRINITY_DN100000_c0_g1  76.79
TRINITY_DN100001_c0_g1  71.34
TRINITY_DN100002_c0_g1  73.23
TRINITY_DN100003_c0_g1  87.40
TRINITY_DN100004_c0_g1  31.01
TRINITY_DN100005_c0_g1  70.68
TRINITY_DN100006_c0_g1  13.12

==> 9_S5.gene_trans_map_salmon.gene.counts.matrix <==
       GeneID                   9_S5
TRINITY_DN0_c0_g1          537.85
TRINITY_DN100000_c0_g1  802.50
TRINITY_DN100001_c0_g1  41.39
TRINITY_DN100002_c0_g1  64.61
TRINITY_DN100003_c0_g1  83.42
TRINITY_DN100004_c0_g1  22.55
TRINITY_DN100005_c0_g1  30.28
TRINITY_DN100006_c0_g1  39.07
TRINITY_DN100007_c0_g1  186.55
tximport • 585 views
Entering edit mode
Last seen 8 hours ago
United States

tx2gene, which indicates which transcripts belongs to which genes is a required input to perform gene-level inference. tximport here is aggregating counts from transcript to gene level while computing an offset that accounts for average transcript length when comparing gene-level counts across samples.

I'm not sure all of the methods that cluster transcripts into genes, but I know Corset can do this.

Entering edit mode

How do I generate tx2gene information ? If I had a GTF file, is it possible to generate tx2gene information using GenomicFeatures package ? Some easy way.

Entering edit mode

Have you read any of the documentation? This is discussed in the function man page and the vignette. Before posting to the support site, it’s assumed that you’ve spent time looking up the documentation (or else why should we bother writing help pages and vignettes)?

Entering edit mode
ATpoint ▴ 860
Last seen 11 days ago

You already posted a lot about this particular problem at biostars and I think you are missing the concept here, so I will try to explain, more clearly than I already did over at Biostars:

1) salmon itself creates transcript level abundance estimates, so for simplicity lets call this transcript counts, a measurement for each transcript in the reference you used.

2) tximport is a method that takes these transcript counts and summarizes them to the gene level (plus a couple of other things such as correcting for relative length, but lets ignore it for now), so you get one single count per gene which consists of all the counts of the gene's transcripts.

3) You already seem to have gene level counts according to what you write here, based on this trinity application you used, so you do not need tximport here. The tool already did that for you. I suggested tximport over at biostars because you wrote that you have transcript level counts and want gene level differential analysis so typically tximport is the logical choice. But if you already have gene level counts, and it seems so, then you do not need it. You simply have to load the gene level counts into R and merge them into a single matrix.

Given you have loaded the files into R already do:

file1 <- data.frame(Gene=c("GeneA","GeneB","GeneC"),
file2 <- data.frame(Gene=c("GeneA","GeneB","GeneC"),
file3 <- data.frame(Gene=c("GeneA","GeneB","GeneC"),

#/ a merged data frame using dplyr
cmatrix <- plyr::join_all(lapply(ls(pattern = "^file"), function(x) get(x)), 
                          by='Gene', type='left')
genes <- cmatrix$Gene

#/ as matrix, "Gene" as rownames
cmatrix <- as.matrix(cmatrix[,2:ncol(cmatrix)])
rownames(cmatrix) <- genes

> cmatrix
      Sample1 Sample2 Sample3
GeneA      10      11       0
GeneB       0      22       2
GeneC       3       3      33
Entering edit mode

Thanks for the clear explanation. In every gene counts matrix on average I had 11074 gene ID's for each file, how can I do it as you showed in the above example. I had all the gene counts matrix in .txt and .csv format. file1 <- data.frame(Gene=c("GeneA","GeneB","GeneC"),

$ grep "TRINITY_" 9_S5.gene_trans_map_salmon.gene.counts.matrix | wc
 110794  221588 3135097
$ grep "TRINITY_" 10_S9.gene_trans_map_salmon.gene.counts.matrix  | wc
 124110  248220 3536139
$ grep "TRINITY_" 11_S13.gene_trans_map_salmon.gene.counts.matrix | wc
 123823  247646 3527948
$ grep "TRINITY_" 12_S17.gene_trans_map_salmon.gene.counts.matrix | wc
 137392  274784 3931360
Entering edit mode

I do not know what "it" in "how can I do it" is.


Login before adding your answer.

Traffic: 299 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6