Question

Differences between RTCGA and TCGAbiolinks data

2

Entering edit mode

atakanekiz ▴ 30

@atakanekiz-15874

Last seen 10 months ago

Turkey

Hello Bioconductor community,

I'm trying to analyze miRNASeq data of TCGA melanoma (SKCM) samples. This is a prototype analysis in which I picked hsa-mir-155 and looked at how its expression is correlated with survival. My end goal is expanding this analysis to other cancer types also focusing on other miRNAs.

I performed my analyses with two of the popular packages in R (RTCGA and TCGAbiolinks) and obtained quite different results. RTCGA package features data from 2015-11-01. I used harmonized data when I was trying TCGAbiolinks. I'm trying to make sense of what might be causing the different results. Any help is appreciated. For comparison purposes, I'm attaching some figures and my codes as well.

![Kaplan-Meier curves][1] Overall Kaplan-Meier curves look quite different. These plots were generated on the whole cohort (unsegregated for any parameter). I'm including the plot from oncoLNC.org database as well for comparison purposes. RTCGA looks more like the oncoLNC curve.

You can see similar differences when the data is segregated based on patient.gender:

When I segregate patients based on the expression of hsa-mir-155 (top and bottom thirds), the differences become more obvious:

To understand what might be different in the datasets I exported the data from both packages and performed a comparison. Linked excel file shows the comparisons including clinical details (days_to_death and days_to_last_follow_up) and gene expression values of hsa-mir-155 (both read_count and reads_per_million_miRNA. I noticed that there are considerable differences between two datasets. I'm pretty sure, the way I organized the data is ok and I don't think the differences are due a mistake in data manipulation in R.

The code I used for analyses can be found at

RTCGA_v1

TCGABiolinks_v1

Please let me know why you think there is a discrepancy here. I'm pretty new to this type of analyses and hopefully, I didn't miss something silly.

Thank you very much in advance for your insights,

Atakan

r survival analysis rtcga tcgabiolinks mirna-seq • 3.1k views

ADD COMMENT • link 6.8 years ago atakanekiz ▴ 30

2

Entering edit mode

Levi Waldron ★ 1.1k

@levi-waldron-3429

Last seen 11 weeks ago

CUNY Graduate School of Public Health a…

Dear Atakan,

I'm not sure why, but you're not getting the last follow-up from RTCGAToolbox, which I could see by comparing the follow-up times of patients who were alive at last follow-up. May I suggest making your life easier and using curatedTCGAData, which provides these data as MultiAssayExperiment objects so that assays are pre-linked to patient data. It uses RTCGAToolbox to access Firehose data, but it has the same follow-up time in SKCM as you reported here from TCGABiolinks.

-Levi

ADD COMMENT • link 6.8 years ago Levi Waldron ★ 1.1k

0

Entering edit mode

Hi Levi,

Thank you very much for your response! This package looks pretty exciting and I see that you are one of the authors. I'm looking forward to trying this. Can you please let me know a bit about the data hosted within the package? Is it pulling out the most recent harmonized data from Firehose? I'm not that familiar with Firehose and I'm not sure if it is constantly maintained/updated to incorporate changes in TCGA data.

Best,

Atakan

ADD REPLY • link 6.8 years ago atakanekiz ▴ 30

0

Entering edit mode

Hi Atakan,

The last TCGA Firehose run date was 20160128, and there are no new cases or survival follow-up after that date. It’s that snapshot that RTCGAToolbox, TCGABiolinks, and curatedTCGAData can provide you with, and it’s the most recent data available from Firehose. curatedTCGAData does something extra in merging levels 1-3 clinical pathological and specimen data, plus subtypes extracted from supplemental tables of primary publications, within the colData of the MultiAssayExperiment.

There is ongoing QC work and processing through a hg38 pipeline at the Genomic Data Commons, which you would have to access through the GenomicDataCommons package. You can see the kind and frequency of updates they are doing at https://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/.

Note there are differences in the sequence analysis pipeline used by GDC than was used by Firehose, in particular, default GDC data is based on hg38 coordinates whereas Firehose (and “legacy” data from GDC) are based on hg19, see:

https://confluence.broadinstitute.org/m/view-rendered-page.action?pageId=31232809

Sincerely,

Levi

ADD REPLY • link 6.8 years ago Levi Waldron ★ 1.1k

0

Entering edit mode

Dear @Levi Waldron. Please don't missname RTCGA with RTCGAToolbox. Those are two various packages. Best.

ADD REPLY • link 6.8 years ago MarcinKosiński • 0

0

Entering edit mode

Indeed Marcin, my apologies! That explains how the asker was getting 2015-11-01 release data; they should be using your "20160128" set of data packages.

ADD REPLY • link 6.8 years ago Levi Waldron ★ 1.1k

0

Entering edit mode

20160128 are only on GitHub. I had no power and energy to transfer them to the Bioconductor ready format https://github.com/RTCGA

ADD REPLY • link 6.8 years ago MarcinKosiński • 0

score 2 · Accepted Answer · 2018-06-12

Alright, I think I figured out what's going on.

The biggest difference in the results was caused by a missed point during pre-processing. Please see below:

RTCGA features an already preprocessed data frame with clinical parameters. In this data frame there is a times variable that combines both days_to_death and days_to_last_follow_up. That way you can provide times variable into the survfit() function and get the data in one step. In TCGAbiolinks data I needed to create this variable myself. Previously I overlooked this point, and I fitted the survival model only by using days_to_death variable. This effectively discards data from censored and alive patients skewing the results.

I also compared legacy and harmonized miRNA datasets by using TCGAbiolinks package. Results are shown below all side by side. As an example, two different parameters are plotted: patient.gender and hsa-mir-155 expression. harmonized and legacy data uses the same clinical annotation file, that's why gender plots look the same. With this realization, RTCGA and TCGAbiolinks are giving comparable results. Whew!

Thank you all for your insights!

Best,

Atakan