I face a problem with the output of the package TGCAbiolinks. My ultimate goal is to reanalyze the miRNA isoform data from the TCGA-LIHC project (data.type = "Isoform Expression Quantification"). I used the library TGCAbiolinks to download this dataset from the TCGA repository. The problem is that I ended up with a large table of dimensions 2082955 x 7, in which all data is 'mixed-up'. So the data is there, but I don't know how to reformat this table to obtain a suitable count table. To make things even more complex, the number of IDs is not the same for each sample....
I tried several things, but didn't get it to work. I would appreciate any suggestions!
As a note, it works fine when analyzing the precursor miRNA dataset (data.type = "miRNA Expression Quantification").
Main challenge: how to generate from the list 'splitted' below a count table that is:
- comprised of the miRNA IDs that are present in any of the samples (=union),
- while for multiple entries for the same ID the read counts are added per sample,
- and that for missing IDs in samples (but that are otherwise present in the union) these are set at 0.
??
# data as obtained by TCGA library: dim(data) [1] 2082955 7 > head(data) # A tibble: 6 x 7 miRNA_ID isoform_coords read_count reads_pe~ `cros~ miRNA~ barcode <chr> <chr> <int> <dbl> <chr> <chr> <chr> 1 hsa-let-7a-1 hg38:chr9:94175942-94175961:+ 1 0.271 N precu~ TCGA-D~ 2 hsa-let-7a-1 hg38:chr9:94175942-94175962:+ 3 0.814 N precu~ TCGA-D~ 3 hsa-let-7a-1 hg38:chr9:94175961-94175982:+ 1 0.271 N matur~ TCGA-D~ 4 hsa-let-7a-1 hg38:chr9:94175961-94175983:+ 1 0.271 N matur~ TCGA-D~ 5 hsa-let-7a-1 hg38:chr9:94175961-94175984:+ 18 4.89 N matur~ TCGA-D~ 6 hsa-let-7a-1 hg38:chr9:94175962-94175981:+ 120 32.6 N matur~ TCGA-D~ > # now extract/split the data for each barcode (subject). This will generate a list. barcode <- data$barcode splitted <- split.data.frame(data, barcode) #check str(splitted) #observe that for each sample number of isoforms is different! List of 425 $ TCGA-2V-A95S-01A-11R-A37G-13:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4814 obs. of 7 variables: ..$ miRNA_ID : chr [1:4814] "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" ... ..$ isoform_coords : chr [1:4814] "hg38:chr9:94175942-94175962:+" "hg38:chr9:94175943-94175962:+" ... ..$ read_count : int [1:4814] 1 1 3 3 4 194 7413 6192 9364 144 ... ..$ reads_per_million_miRNA_mapped: num [1:4814] 0.233 0.233 0.698 0.698 0.931 ... ..$ cross-mapped : chr [1:4814] "N" "N" "N" "N" ... ..$ miRNA_region : chr [1:4814] "precursor" "precursor" "mature,MIMAT0000062" "mature,MIMAT0000062" ... ..$ barcode : chr [1:4814] "TCGA-2V-A95S-01A-11R-A37G-13" "TCGA-2V-A95S-01A-11R-A37G-13" ... $ TCGA-2Y-A9GS-01A-12R-A38M-13:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 5881 obs. of 7 variables: ..$ miRNA_ID : chr [1:5881] "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" ... ..$ isoform_coords : chr [1:5881] "hg38:chr9:94175961-94175982:+" "hg38:chr9:94175961-94175983:+" ... ..$ read_count : int [1:5881] 6 11 18 180 7038 18595 36149 664 16 1 ... ..$ reads_per_million_miRNA_mapped: num [1:5881] 0.899 1.648 2.696 26.963 1054.259 ... ..$ cross-mapped : chr [1:5881] "N" "N" "N" "N" ... ..$ miRNA_region : chr [1:5881] "mature,MIMAT0000062" "mature,MIMAT0000062" "mature,MIMAT0000062" "mature,MIMAT0000062" ... ..$ barcode : chr [1:5881] "TCGA-2Y-A9GS-01A-12R-A38M-13" "TCGA-2Y-A9GS-01A-12R-A38M-13" ... $ TCGA-2Y-A9GT-01A-11R-A38M-13:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4006 obs. of 7 variables: ..$ miRNA_ID : chr [1:4006] "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" ... ..$ isoform_coords : chr [1:4006] "hg38:chr9:94175942-94175962:+" "hg38:chr9:94175961-94175980:+" ... ..$ read_count : int [1:4006] 1 1 1 7 5 175 8574 13607 19945 268 ... ..$ reads_per_million_miRNA_mapped: num [1:4006] 0.282 0.282 0.282 1.971 1.408 ... ..$ cross-mapped : chr [1:4006] "N" "N" "N" "N" ... ..$ miRNA_region : chr [1:4006] "precursor" "mature,MIMAT0000062" "mature,MIMAT0000062" "mature,MIMAT0000062" ... ..$ barcode : chr [1:4006] "TCGA-2Y-A9GT-01A-11R-A38M-13" "TCGA-2Y-A9GT-01A-11R-A38M-13" "TCGA-2Y-A9GT-01A-11R-A38M-13" "TCGA-2Y-A9GT-01A-11R-A38M-13" ... <<snip>> length(splitted) #[1] 425 = indeed correct number of samples!

For completeness the code that was used to generate the object
dataabove:library(TCGAbiolinks) library(dplyr) library(DT) library(SummarizedExperiment) # project: TCGA-LIHC # isoforms (data.type = "Isoform Expression Quantification") query <- GDCquery(project = c("TCGA-LIHC"), data.category = "Transcriptome Profiling", legacy = FALSE, data.type = "Isoform Expression Quantification") # view output datatable(getResults(query,), filter = 'top', options = list(scrollX = TRUE, keys = TRUE, pageLength = 100), rownames = FALSE) # extract 'subject' for additional query subject.id <- getResults(query, cols = "cases") # query again query <- GDCquery(project = c("TCGA-LIHC"), data.category = "Transcriptome Profiling", legacy = FALSE, data.type = "Isoform Expression Quantification", barcode = subject.id) #download data (note: this takes some time [~15m]) GDCdownload(query, method = "client", directory = "miRNA_dir") # read data and prepare R object data <- GDCprepare(query, directory="D:\\work\\miRNA_dir", summarizedExperiment = TRUE, save = TRUE, save.filename="TCGA-LIHC_miRNAdata.RData") dim(data) [1] 2082955 7 > sessionInfo() R version 3.4.2 Patched (2017-10-09 r73515) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] SummarizedExperiment_1.8.1 DelayedArray_0.4.1 [3] matrixStats_0.52.2 Biobase_2.38.0 [5] GenomicRanges_1.30.0 GenomeInfoDb_1.14.0 [7] IRanges_2.12.0 S4Vectors_0.16.0 [9] BiocGenerics_0.24.0 DT_0.2 [11] dplyr_0.7.4 TCGAbiolinks_2.7.16 >Hello,
what is the output you expect ?
There is no summarized experiment for this data, only a data frame. Also, as the rows in each file differ we cant do a table line counts for RNA-seq.
The output is a rbind of all tables, adding the sample it belongs.
Best regards,
Tiago
Hi. I am not so familiar with all the data formats present at TCGA, so initially I expected the output for the 'isoform' data also to be a count-based table (like for the precursor data that I first generated). However, due to the differences in rows I now understand all data is just appended to form a single table.
Using a combination of basic
Randdplyrcommands, I was able to generate a count table from the objectdata. For the archive I post my code below; likely not the most straightforward code but it worked for me. :)Guido
# object "data" contains 2082955 rows... # to generate count table I need to collapse on "miRNA_region" # 1) sum all counts per barcode for same MIMAT ID. data %>% group_by(barcode,miRNA_region) %>% summarise_at(vars(read_count),sum, na.rm = TRUE) -> data dim(data) #[1] 353665 3 # 2) delete all MIMAT IDs NOT being 'mature' data <- data[data$"miRNA_region" != "precursor" & data$"miRNA_region" != "stemloop" & data$"miRNA_region" != "unannotated", ] dim(data) #[1] 352391 3 # 3) now split per barcode ID; a list is generated splitted <- split.data.frame(data, data$"barcode") # define columns to keep in list for all samples. keep <- c("miRNA_region", "read_count") splitted <- lapply(splitted, function(x) x[keep]) # to make column names unique, add/paste barcode name to coulumn "read_count" for (i in 1:length(splitted)) { names(splitted[[i]])[2] <- paste( names(splitted[[i]])[2], names(splitted[i]), sep=".") } # Make the actual count table, a 'full join' is performed since the union of counts is required, # and if there are no matching values in any of the tables full_join returns <N A> for the one missing. splitted %>% Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2,by="miRNA_region"), .) -> count.table # set all NA's to zero. It is assumed the isoforms that # are not listed/present in a sample have zero counts. count.table[is.na(count.table)] <- 0 # to increase readability; rename/remove "read_count." from column names, # and 'mature,' from row names count.table <- as.data.frame(count.table) colnames(count.table) <- sub("read_count.", "", colnames(count.table)) count.table[,1] <- sub("mature,", "",count.table[,1]) # Save/export count table write.table(count.table, "TCGA-LIHC_miRNAdata_isoforms_countdata.txt", quote=FALSE, col.names=NA, sep="\t")Hi tia,
Could you please answer this post [How to download TCGA vcf file from GDC data portal?]
Thankyou