Question: Differences between RTCGA and TCGAbiolinks data
2
17 months ago by
atakanekiz30
atakanekiz30 wrote:

Hello Bioconductor community,

I'm trying to analyze miRNASeq data of TCGA melanoma (SKCM) samples. This is a prototype analysis in which I picked hsa-mir-155 and looked at how its expression is correlated with survival. My end goal is expanding this analysis to other cancer types also focusing on other miRNAs.

I performed my analyses with two of the popular packages in R (RTCGA and TCGAbiolinks) and obtained quite different results. RTCGA package features data from 2015-11-01. I used harmonized data when I was trying TCGAbiolinks. I'm trying to make sense of what might be causing the different results. Any help is appreciated. For comparison purposes, I'm attaching some figures and my codes as well.

![Kaplan-Meier curves][1]  Overall Kaplan-Meier curves look quite different. These plots were generated on the whole cohort (unsegregated for any parameter). I'm including the plot from oncoLNC.org database as well for comparison purposes. RTCGA looks more like the oncoLNC curve.

You can see similar differences when the data is segregated based on patient.gender:

When I segregate patients based on the expression of hsa-mir-155 (top and bottom thirds), the differences become more obvious:

To understand what might be different in the datasets I exported the data from both packages and performed a comparison. Linked excel file shows the comparisons including clinical details (days_to_death and days_to_last_follow_up) and gene expression values of hsa-mir-155 (both read_count and reads_per_million_miRNA.  I noticed that there are considerable differences between two datasets. I'm pretty sure, the way I organized the data is ok and I don't think the differences are due a mistake in data manipulation in R.

The code I used for analyses can be found at

RTCGA_v1

Please let me know why you think there is a discrepancy here. I'm pretty new to this type of analyses and hopefully, I didn't miss something silly.

Atakan

modified 17 months ago • written 17 months ago by atakanekiz30
2
17 months ago by
atakanekiz30
atakanekiz30 wrote:

Alright, I think I figured out what's going on.

The biggest difference in the results was caused by a missed point during pre-processing. Please see below:

RTCGA features an already preprocessed data frame with clinical parameters. In this data frame there is a times variable that combines both days_to_death and days_to_last_follow_up. That way you can provide times variable into the survfit() function and get the data in one step. In TCGAbiolinks data I needed to create this variable myself. Previously I overlooked this point, and I fitted the survival model only by using days_to_death variable. This effectively discards data from censored and alive patients skewing the results.

I also compared legacy and  harmonized miRNA datasets by using TCGAbiolinks package. Results are shown below all side by side. As an example, two different parameters are plotted: patient.gender and hsa-mir-155 expression. harmonized and legacy data uses the same clinical annotation file, that's why gender plots look the same. With this realization, RTCGA and TCGAbiolinks are giving comparable results. Whew!

Thank you all for your insights!

Best,

Atakan

2
17 months ago by
Levi Waldron950
CUNY Graduate School of Public Health and Health Policy, New York, NY
Levi Waldron950 wrote:

Dear Atakan,

I'm not sure why, but you're not getting the last follow-up from RTCGAToolbox, which I could see by comparing the follow-up times of patients who were alive at last follow-up. May I suggest making your life easier and using curatedTCGAData, which provides these data as MultiAssayExperiment objects so that assays are pre-linked to patient data. It uses RTCGAToolbox to access Firehose data, but it has the same follow-up time in SKCM as you reported here from TCGABiolinks.

-Levi

Hi Levi,

Thank you very much for your response! This package looks pretty exciting and I see that you are one of the authors. I'm looking forward to trying this. Can you please let me know a bit about the data hosted within the package? Is it pulling out the most recent harmonized data from Firehose? I'm not that familiar with Firehose and I'm not sure if it is constantly maintained/updated to incorporate changes in TCGA data.

Best,

Atakan

Hi Atakan,

The last TCGA Firehose run date was 20160128, and there are no new cases or survival follow-up after that date. It’s that snapshot that RTCGAToolbox, TCGABiolinks, and curatedTCGAData can provide you with, and it’s the most recent data available from Firehose. curatedTCGAData does something extra in merging levels 1-3 clinical pathological and specimen data, plus subtypes extracted from supplemental tables of primary publications, within the colData of the MultiAssayExperiment.

There is ongoing QC work and processing through a hg38 pipeline at the Genomic Data Commons, which you would have to access through the GenomicDataCommons package. You can see the kind and frequency of updates they are doing at https://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/

Note there are differences in the sequence analysis pipeline used by GDC than was used by Firehose, in particular, default GDC data is based on hg38 coordinates whereas Firehose (and “legacy” data from GDC) are based on hg19, see:

Sincerely,

Levi

Dear @Levi Waldron. Please don't missname RTCGA with RTCGAToolbox. Those are two various packages. Best.

Indeed Marcin, my apologies! That explains how the asker was getting 2015-11-01 release data; they should be using your "20160128" set of data packages.