Up until now I have used the UCSC Xenabrowser to access TCGA survival data. I decided to learn TCGAbiolinks because I thought it would be more fluid and automatable. I know from having done many survival analyses that from Xenabrowser I can access 371 survival data points (censoring and time of event). This is shown in this image https://photos.app.goo.gl/itfh5a1ppdYakWrz7 where OS.time is the time of the event, and is equal to either daystodeath or daystolastfollowup. OS is the event (1 for a death, zero for censoring). I downloaded the data and confirmed in excel that there were no duplicate patient IDs. Still a total of 371.
Here is a script that I wrote to reproduce these data using TCGAbiolinks: Forgive my hacky and verbose R coding, but I think that my comments should be informative. In the end there are only 343 survival data points! Is TCGAbiolinks not retrieving the data correctly? Perhaps there's a bug I'm not seeing and you have some advice for me to streamline my R code. Perhaps I'm not using TCGAbiolinks correctly. Any insight into this discrepancy would be greatly appreciated!
Code updated to be simpler, more legible
library("TCGAbiolinks") library("tidyverse") query <- GDCquery( project = "TCGA-LIHC", data.category = "Clinical", file.type = "xml", legacy = FALSE) GDCdownload(query,directory = ".") clinical <- GDCprepare_clinic(query, clinical.info = "patient",directory = ".") #getting the survival time of event data survival_data <- as_tibble(clinical[,c("days_to_last_followup","days_to_death","vital_status","bcr_patient_barcode","patient_id")]) survival_data <- filter(survival_data,!is.na(days_to_last_followup)|!is.na(days_to_death)) #not both NA survival_data <- filter(survival_data,is.na(days_to_last_followup)|days_to_last_followup>0)&is.na(days_to_death)|days_to_death > 0 )) #ensuring positive values survival_data <- survival_data[!duplicated(survival_data$patient_id),] #ensuring no duplicates dim(survival_data) #should be 371
Update: There are missing data when I take this approach as well
query <- GDCquery(project = "TCGA-LIHC", data.category = "Clinical", data.type = "Clinical Supplement", data.format = "BCR Biotab") GDCdownload(query) clinical.BCRtab.all <- GDCprepare(query) followup <- clinical.BCRtab.all$clinical_follow_up_v4.0_lihc