1
0
Entering edit mode
J. Smith • 0
@j-smith-20436
Last seen 2.0 years ago

Hi.

I have obtained TCGA RSEM data for STAD using FireBrowse, I found that there are two types of files for RSEM RNASeqV2 1) illuminahiseqrnaseqv2-RSEMgenes (MD5) 2) illuminahiseqrnaseqv2-RSEMgenes_normalized (MD5)

I think 1) illuminahiseq_rnaseqv2-RSEM_genes file is the most suitable for subsequent analysis with DESeq2 imported through tximport.

There are 3 columns associated with each samples in the 1) illuminahiseqrnaseqv2-RSEMgenes file: "raw count", "scaled estimate" and "transcript id". I feel "raw count" (in reality probably it is estimated counts because data not in integer format) column will be more appropriate for DESeq2 analysis imported via tximport.

If anyone please confirm me, whether I am correct or not.

Thank you.

deseq2 tximport tcga rsem firebrowse • 1.4k views
1
Entering edit mode

The user has already been receiving help (from me) on Biostars: https://www.biostars.org/p/437617/#438523

0
Entering edit mode

Thanks Kevin for your lots of helps I got in Biostar. But actually, I have not got an answer to this specific query about FireBrowse "raw_counts" data question. Actually I am new to this field and I am trying to gain as many alternatives and information I can get. Sorry to bother you a lot... And thanks once again.

1
Entering edit mode

Sure thing - no problem.

1
Entering edit mode
@kevin
Last seen 4 minutes ago
V&A Waterfront, Cape Town, South Africa

J. Smith, when you post across two web-sites, it's good to link up the threads so that we do not end up duplicating efforts.

There is a specific use case for RSEM in the vignette for tximport:

However, that data that's made available by the Broad Institute is not quite suitable for tximport because they have merged all original RSEM files into the same file.

The 'scaled_estimate' is TPM, while the 'raw_count' is, as you implied, the expected / estimated count. You could round the 'raw_counts' to integer values and use those on their own, if you wish. Unfortunately, the length of each gene is not included in this data provided by the Broad Institute. This would normally be used by tximport to correct for gene length.

Alternatives:

1. the original RSEM files per sample are available at the GDC; so, you could obtain those and use via tximport.
2. HTseq raw counts are available at the UCSC Xena Browser. You could use those and then follow my advice here to prepare them for DESeq2 / EdgeR: Question: Normalisation of RNAseq data from UCSC Xena Browser
3. use TCGAbiolinks, which is a Bioconductor package that is tailoured for users from diverse backgrounds. It takes much of the initial 'heavy lifting' away from you.
4. explore data via cBioPortal, hosted by MSKCC

Kevin

0
Entering edit mode

Many Many thanks Kevin once again. Sorry, for the mistakes that I have made by not mentioning the thread at Biostar. I will never make such mistakes again. As I am new to these types of community, I have done such mistakes unknowingly.

Actually, there are many files when I try to download RSEM data for a specific cancer through FireBrowse. I really don't know which file contain the original RSEM values per sample. If you kindly tell me which file corresponds to "the original RSEM files per sample are available at the GDC", that you are referring to? If the original RSEM files per sample can't be downloaded via FireBrowse, then from where and how can I get those data? If link to any specific cancer dataset with "original RSEM files per sample" you provide to me, that will help me a lot more...

Thank you.

1
Entering edit mode

Actually, there are many files when I try to download RSEM data for a specific cancer when I use FireBrowse

Are you sure? - I downloaded the file to which you originally linked (illuminahiseqrnaseqv2-RSEMgenes), and it contains a single tab-delimited file (and another manifest) called STAD.rnaseqv2illuminahiseq_rnaseqv2unc_eduLevel_3RSEM_genes__data.data.txt. Inside this file is data across many samples.

If you can kindly tell me which file corresponds to "the original RSEM files per sample are available at the GDC", that you are referring to? If the original RSEM files per sample can't be downloaded via FireBrowse, then from where can I get those data?

The RSEM files per sample are actually available on the GDC Legacy Archive, and mixed among the HT-seq count files, it seems - here is a configured search. Due to the fact that there is a learning curve in itself to obtain data in bulk from the GDC, I would encourage you to use the data that you already have but just do the following:

1. filter the data for just the raw_count columns
2. convert these to integer values
3. import to DESeq2 via DESeqDataSetFromMatrix()

Keep in mind that this is not ideal, but there are restraints here in relation to the fact that you claim to be a beginner in this field.

Or, do the above, but using the HT-seq raw counts from Xena, which I mentioned in my original answer.

0
Entering edit mode

Thanks a lot Kevin, By

"Actually, there are many files when I try to download RSEM data for a specific cancer when I use FireBrowse"

I am telling that when I select a particular cohort in FireBrowse and click mRNASeq bar on the right side window, there are options for downloading many RSEM files.

I have also tried to use TCGABiolinks to download RSEM data for TCGA cancer cohort. But I don't know whether they are "RSEM files per sample". Thanks for providing me with the configured search option. Additionally, if you kindly tell me if there is any R package to get TCGA data with "original RSEM files per sample".

Thanks a lot...