I want to ask you about the matrices that are generated when using txOut = FALSE (default) because I want the values from StringTie quantification but gene level summarized (as they are more robust). I am using "StringTie" output files "tdata.ctab" as input files for tximport. tximport gives the lists with matrices, “abundance”, “counts”, and “length” where the transcript level information is summarized to the gene-level. I want to ask whether 1. txi$counts gives raw counts? If yes, then where do they come from - I can not find the raw counts in the tdata.ctab file? 2. Do the matrix txi$abundance contains TPM or FPKM values? (I used the default value of countsFromAbundance = i.e. "no") because again I can not find TPM values in t_data.ctab files however, there are FPKM values in that file and if they are TPM values, then where do they come from? and if they are FPKM values, then how do I get TPM values (not Scaled or lengthscaled ones)?
Please let me know as it is not clear to me from the research paper. Looking forward for your reply. I also tried with gene.tsv files which contains FPKM and TPM values but I got this error:
> tx2gene <- tmp[, c("Gene ID", "Gene Name")]
> txi <- tximport(files1, type = "stringtie", tx2gene = tx2gene)
reading in files with read_tsv
1 Warning: 59043 parsing failures.
row col expected actual file
1 Strand an integer + '/scratch/neocircle-samples-20190118/S006493/l.r.m.c.lib.g/k2.a/t/gene.tsv'
1 Coverage no trailing characters .714223 '/scratch/neocircle-samples-20190118/S006493/l.r.m.c.lib.g/k2.a/t/gene.tsv'
1 FPKM no trailing characters .219789 '/scratch/neocircle-samples-20190118/S006493/l.r.m.c.lib.g/k2.a/t/gene.tsv'
2 Strand an integer + '/scratch/neocircle-samples-20190118/S006493/l.r.m.c.lib.g/k2.a/t/gene.tsv'
2 Coverage no trailing characters .669177 '/scratch/neocircle-samples-20190118/S006493/l.r.m.c.lib.g/k2.a/t/gene.tsv'
... ........ ...................... ....... ...........................................................................
See problems(...) for more details.
Error in tximport(files1, type = "stringtie", tx2gene = tx2gene) :
all(c(lengthCol, abundanceCol) %in% names(raw)) is not TRUE
In addition: Warning message:
Unnamed `col_types` should have the same length as `col_names`. Using smaller of the two.
seems to me that tximport will only work with "t_data.ctab" files.
Thanks for your answer:
I have a question:
tid chr strand start end tname numexons length geneid gene_name cov FPKM 15869 chr12 - 9067712 9116157 ENST00000318602.11 36 4844 ENSG00000175899.14 A2M 336.726562 134.45993 15870 chr12 - 9110314 9116229 ENST00000404455.2 6 623 ENSG00000175899.14 A2M 0.041283 0.016485
So how will you calculate counts at gene and transcript level for this gene here : A2M - it is a t_data.ctab file. Please let me know. Looking forward to hear from you. Thanks
Will it be : 336.726562 * (9116157-9067712)/ 4844 ?? but this result does not match to the tximport output tx$counts..
The transcript length is 4844. The start and end are genomic coordinates, so that includes introns.
The read length is a parameter in Stringtie, and we have a special
tximport
argument for Stringtie so you can set it.yes, i have used the Stringtie argument in tximport:
So as you mentioned counts will be calculated by cov * average transcript length / read length
so for this gene will it be : 336.726562*4844/?
the answer in the output file is : 21748.3891418267
I just want to understand how it is calculated at fundamental level.
What do you get for 336 * 4844 / read length, where you fill in read length with the value that you provided (or the default value that you can look up in
?tximport
if you did not provide tximport with the read length)?Yes got it.. the default value for readLength in tximport is 75. Thank you ..