Question

tximport question on the values contained in "counts" and "abundance" matrices

0

Entering edit mode

HKS • 0

@hks-19681

Last seen 6.4 years ago

I want to ask you about the matrices that are generated when using txOut = FALSE (default) because I want the values from StringTie quantification but gene level summarized (as they are more robust). I am using "StringTie" output files "tdata.ctab" as input files for tximport. tximport gives the lists with matrices, “abundance”, “counts”, and “length” where the transcript level information is summarized to the gene-level. I want to ask whether 1. txi$counts gives raw counts? If yes, then where do they come from - I can not find the raw counts in the tdata.ctab file? 2. Do the matrix txi$abundance contains TPM or FPKM values? (I used the default value of countsFromAbundance = i.e. "no") because again I can not find TPM values in t_data.ctab files however, there are FPKM values in that file and if they are TPM values, then where do they come from? and if they are FPKM values, then how do I get TPM values (not Scaled or lengthscaled ones)?

Please let me know as it is not clear to me from the research paper. Looking forward for your reply. I also tried with gene.tsv files which contains FPKM and TPM values but I got this error:

> tx2gene <- tmp[, c("Gene ID", "Gene Name")]
> txi <- tximport(files1, type = "stringtie", tx2gene = tx2gene)
reading in files with read_tsv
1 Warning: 59043 parsing failures.
row      col               expected  actual                                                                        file
  1 Strand   an integer             +       '/scratch/neocircle-samples-20190118/S006493/l.r.m.c.lib.g/k2.a/t/gene.tsv'
  1 Coverage no trailing characters .714223 '/scratch/neocircle-samples-20190118/S006493/l.r.m.c.lib.g/k2.a/t/gene.tsv'
  1 FPKM     no trailing characters .219789 '/scratch/neocircle-samples-20190118/S006493/l.r.m.c.lib.g/k2.a/t/gene.tsv'
  2 Strand   an integer             +       '/scratch/neocircle-samples-20190118/S006493/l.r.m.c.lib.g/k2.a/t/gene.tsv'
  2 Coverage no trailing characters .669177 '/scratch/neocircle-samples-20190118/S006493/l.r.m.c.lib.g/k2.a/t/gene.tsv'
... ........ ...................... ....... ...........................................................................
See problems(...) for more details.

Error in tximport(files1, type = "stringtie", tx2gene = tx2gene) : 
  all(c(lengthCol, abundanceCol) %in% names(raw)) is not TRUE
In addition: Warning message:
Unnamed `col_types` should have the same length as `col_names`. Using smaller of the two.

seems to me that tximport will only work with "t_data.ctab" files.

tximport • 2.1k views

ADD COMMENT • link updated 6.9 years ago by James W. MacDonald 68k • written 7.0 years ago by HKS • 0

0

Entering edit mode

Thanks for your answer:

I have a question:

tid chr strand start end tname numexons length geneid gene_name cov FPKM 15869 chr12 - 9067712 9116157 ENST00000318602.11 36 4844 ENSG00000175899.14 A2M 336.726562 134.45993 15870 chr12 - 9110314 9116229 ENST00000404455.2 6 623 ENSG00000175899.14 A2M 0.041283 0.016485

So how will you calculate counts at gene and transcript level for this gene here : A2M - it is a t_data.ctab file. Please let me know. Looking forward to hear from you. Thanks

ADD REPLY • link 6.9 years ago HKS • 0

0

Entering edit mode

t_id    chr strand  start   end t_name  num_exons   length  gene_id gene_name   cov FPKM
15869   chr12   -   9067712 9116157 ENST00000318602.11  36  4844    ENSG00000175899.14  A2M 336.726562  134.45993
15870   chr12   -   9110314 9116229 ENST00000404455.2   6   623 ENSG00000175899.14  A2M 0.041283    0.016485

ADD REPLY • link updated 6.9 years ago by Michael Love 43k • written 6.9 years ago by HKS • 0

0

Entering edit mode

Will it be : 336.726562 * (9116157-9067712)/ 4844 ?? but this result does not match to the tximport output tx$counts..

ADD REPLY • link 6.9 years ago HKS • 0

0

Entering edit mode

The transcript length is 4844. The start and end are genomic coordinates, so that includes introns.

The read length is a parameter in Stringtie, and we have a special tximport argument for Stringtie so you can set it.

ADD REPLY • link 6.9 years ago Michael Love 43k

0

Entering edit mode

yes, i have used the Stringtie argument in tximport:

So as you mentioned counts will be calculated by cov * average transcript length / read length

so for this gene will it be : 336.726562*4844/?

the answer in the output file is : 21748.3891418267

I just want to understand how it is calculated at fundamental level.

ADD REPLY • link 6.9 years ago HKS • 0

0

Entering edit mode

What do you get for 336 * 4844 / read length, where you fill in read length with the value that you provided (or the default value that you can look up in ?tximport if you did not provide tximport with the read length)?

ADD REPLY • link 6.9 years ago Michael Love 43k

0

Entering edit mode

Yes got it.. the default value for readLength in tximport is 75. Thank you ..

ADD REPLY • link 6.9 years ago HKS • 0

score 0 · Answer 1 · 2019-01-31

txi$counts gives our best estimate of the original counts for Stringtie, which is cov * average transcript length / read length (as suggested by the Stringtie authors).

Abundance gives back the FPKM column from Stringtie. Abundance with all methods gives back what the software estimates. You can generate TPM easily from FPKM: divide each column by its sum, and then multiply by 1e6.