Hi BioC community,
I'm investigating genomic relationships in TCGA-BRCA data from the GDC using the TCGABiolinks package. As the PAM50 molecular subtypes are commonly used to stratify breast cancer data, I've tried to find values for the missing subtype designations in the dataset. (Out of 1222 total patients with RNA-Seq expression data, only 524 are labeled with molecular subtype in column subtype_PAM50.mRNA).
I've contacted GDC User Support Services to determine where the values in column subtype_PAM50.mRNA originate and how to reproduce them. They believe column subtype_PAM50.mRNA is not coming from them and have asked me to check with TCGABiolinks support (which I think is this community).
The downloaded data is from this TCGABiolinks statement:
rnaseq_query <- GDCquery(project = "TCGA-BRCA",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
legacy = FALSE)
The columns (including subtype_PAM50.mRNA) are extracted from the harmonized RNA-Seq expression data phenodata table:
sample_data <- colData(data) #complete phenodata table: merged clinical, sample and subtype info
rnaseq_patient <- data.frame(sample_data[,c(1,3,11,60,61,62,63,77)])
Does anyone know the origin of column subtype_PAM50.mRNA so I may fill in missing data by inferring labels according to the methodology used for the 524 existing labels? Or, if you have the complete set of labels, that would be even better.
There is a list of 524 molecular subtype labels at this link:
It's found as an answer to a Biostars question here:
These subtype labels are for TCGA-BRCA and purportedly originate from: "Comprehensive molecular portraits of human breast tumors"; Nature 2012) found here:
Here's a guide from Prat, Parker and Perou on clustering RNA expression data into molecular subtype:
Guide to Intrinsic Subtyping 9-6-10.pdf
However, this guide seems to be designed for microarray, rather than RNA-Seq, whereas the 524 labels are in the RNA-Seq dataset that is obtained from executing the above download statement.
There's a spreadsheet for the TCGA-BRCA dataset containing 1148 PAM50 subtypes as assigned by UNC and provided to the Shamir lab. It's included as:
Additional file 2: Detailed cohort description - TCGA sample IDs for the 6 sample groups analyzed in the study. (XLSX 179 kb)
as part of this article:
"Expression and methylation patterns partition luminal-A breast tumors into distinct prognostic subgroups"
I'm inclined to use these UNC labels provided by the Ron Shamir lab as subtype labels unless I find something more authoritative. Even if I use these labels, both I and GDC User Support Services would like to know where the values for column subtype_PAM50.mRNA originate. Please advise on what is best - I'm just researching these and know neither the origin of column subtype_PAM50.mRNA nor what would be the best practice.
Thanks so much for your help and expertise,