Question

how to reformat complex list object (derived with TCGAbiolinks)?

0

Entering edit mode

Guido Hooiveld ★ 4.1k

@guido-hooiveld-2020

Last seen 18 days ago

Wageningen University, Wageningen, the …

I face a problem with the output of the package TGCAbiolinks. My ultimate goal is to reanalyze the miRNA isoform data from the TCGA-LIHC project (data.type = "Isoform Expression Quantification"). I used the library TGCAbiolinks to download this dataset from the TCGA repository. The problem is that I ended up with a large table of dimensions 2082955 x 7, in which all data is 'mixed-up'. So the data is there, but I don't know how to reformat this table to obtain a suitable count table. To make things even more complex, the number of IDs is not the same for each sample....

I tried several things, but didn't get it to work. I would appreciate any suggestions!

As a note, it works fine when analyzing the precursor miRNA dataset (data.type = "miRNA Expression Quantification").

Main challenge: how to generate from the list 'splitted' below a count table that is:

- comprised of the miRNA IDs that are present in any of the samples (=union),
- while for multiple entries for the same ID the read counts are added per sample,
- and that for missing IDs in samples (but that are otherwise present in the union) these are set at 0.

??

# data as obtained by TCGA library:
dim(data)
[1] 2082955       7

> head(data)
# A tibble: 6 x 7
  miRNA_ID     isoform_coords                read_count reads_pe~ `cros~ miRNA~ barcode
  <chr>        <chr>                              <int>     <dbl> <chr>  <chr>  <chr>  
1 hsa-let-7a-1 hg38:chr9:94175942-94175961:+          1     0.271 N      precu~ TCGA-D~
2 hsa-let-7a-1 hg38:chr9:94175942-94175962:+          3     0.814 N      precu~ TCGA-D~
3 hsa-let-7a-1 hg38:chr9:94175961-94175982:+          1     0.271 N      matur~ TCGA-D~
4 hsa-let-7a-1 hg38:chr9:94175961-94175983:+          1     0.271 N      matur~ TCGA-D~
5 hsa-let-7a-1 hg38:chr9:94175961-94175984:+         18     4.89  N      matur~ TCGA-D~
6 hsa-let-7a-1 hg38:chr9:94175962-94175981:+        120    32.6   N      matur~ TCGA-D~
>

# now extract/split the data for each barcode (subject). This will generate a list.
barcode <- data$barcode

splitted <- split.data.frame(data, barcode)
#check
str(splitted)        #observe that for each sample number of isoforms is different!
List of 425
 $ TCGA-2V-A95S-01A-11R-A37G-13:Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       4814 obs. of  7 variables:
  ..$ miRNA_ID                      : chr [1:4814] "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" ...
  ..$ isoform_coords                : chr [1:4814] "hg38:chr9:94175942-94175962:+" "hg38:chr9:94175943-94175962:+" ...
  ..$ read_count                    : int [1:4814] 1 1 3 3 4 194 7413 6192 9364 144 ...
  ..$ reads_per_million_miRNA_mapped: num [1:4814] 0.233 0.233 0.698 0.698 0.931 ...
  ..$ cross-mapped                  : chr [1:4814] "N" "N" "N" "N" ...
  ..$ miRNA_region                  : chr [1:4814] "precursor" "precursor" "mature,MIMAT0000062" "mature,MIMAT0000062" ...
  ..$ barcode                       : chr [1:4814] "TCGA-2V-A95S-01A-11R-A37G-13" "TCGA-2V-A95S-01A-11R-A37G-13" ...
 $ TCGA-2Y-A9GS-01A-12R-A38M-13:Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       5881 obs. of  7 variables:
  ..$ miRNA_ID                      : chr [1:5881] "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" ...
  ..$ isoform_coords                : chr [1:5881] "hg38:chr9:94175961-94175982:+" "hg38:chr9:94175961-94175983:+" ...
  ..$ read_count                    : int [1:5881] 6 11 18 180 7038 18595 36149 664 16 1 ...
  ..$ reads_per_million_miRNA_mapped: num [1:5881] 0.899 1.648 2.696 26.963 1054.259 ...
  ..$ cross-mapped                  : chr [1:5881] "N" "N" "N" "N" ...
  ..$ miRNA_region                  : chr [1:5881] "mature,MIMAT0000062" "mature,MIMAT0000062" "mature,MIMAT0000062" "mature,MIMAT0000062" ...
  ..$ barcode                       : chr [1:5881] "TCGA-2Y-A9GS-01A-12R-A38M-13" "TCGA-2Y-A9GS-01A-12R-A38M-13"  ...
 $ TCGA-2Y-A9GT-01A-11R-A38M-13:Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       4006 obs. of  7 variables:
  ..$ miRNA_ID                      : chr [1:4006] "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" ...
  ..$ isoform_coords                : chr [1:4006] "hg38:chr9:94175942-94175962:+" "hg38:chr9:94175961-94175980:+" ...
  ..$ read_count                    : int [1:4006] 1 1 1 7 5 175 8574 13607 19945 268 ...
  ..$ reads_per_million_miRNA_mapped: num [1:4006] 0.282 0.282 0.282 1.971 1.408 ...
  ..$ cross-mapped                  : chr [1:4006] "N" "N" "N" "N" ...
  ..$ miRNA_region                  : chr [1:4006] "precursor" "mature,MIMAT0000062" "mature,MIMAT0000062" "mature,MIMAT0000062" ...
  ..$ barcode                       : chr [1:4006] "TCGA-2Y-A9GT-01A-11R-A38M-13" "TCGA-2Y-A9GT-01A-11R-A38M-13" "TCGA-2Y-A9GT-01A-11R-A38M-13" "TCGA-2Y-A9GT-01A-11R-A38M-13" ...
<<snip>>

length(splitted)    #[1] 425 = indeed correct number of samples!

TCGA tcgabiolinks list reformat • 1.8k views

ADD COMMENT • link 7.0 years ago Guido Hooiveld ★ 4.1k

0

Entering edit mode

For completeness the code that was used to generate the object data above:

library(TCGAbiolinks)
library(dplyr)
library(DT)
library(SummarizedExperiment)

# project: TCGA-LIHC

# isoforms (data.type = "Isoform Expression Quantification")
query <- GDCquery(project = c("TCGA-LIHC"),
                  data.category = "Transcriptome Profiling",
                  legacy = FALSE,
          data.type = "Isoform Expression Quantification")

# view output
datatable(getResults(query,),
              filter = 'top',
              options = list(scrollX = TRUE, keys = TRUE, pageLength = 100),
              rownames = FALSE)

# extract 'subject' for additional query
subject.id <- getResults(query, cols = "cases")

# query again
query <- GDCquery(project = c("TCGA-LIHC"),
                  data.category = "Transcriptome Profiling",
                  legacy = FALSE,
          data.type = "Isoform Expression Quantification",
          barcode = subject.id)

#download data (note: this takes some time [~15m])
GDCdownload(query, method = "client", directory = "miRNA_dir")

# read data and prepare R object
data <- GDCprepare(query, directory="D:\\work\\miRNA_dir", summarizedExperiment = TRUE, save = TRUE, save.filename="TCGA-LIHC_miRNAdata.RData")

dim(data)
[1] 2082955       7

> sessionInfo()
R version 3.4.2 Patched (2017-10-09 r73515)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] SummarizedExperiment_1.8.1 DelayedArray_0.4.1        
 [3] matrixStats_0.52.2         Biobase_2.38.0            
 [5] GenomicRanges_1.30.0       GenomeInfoDb_1.14.0       
 [7] IRanges_2.12.0             S4Vectors_0.16.0          
 [9] BiocGenerics_0.24.0        DT_0.2                    
[11] dplyr_0.7.4                TCGAbiolinks_2.7.16       

>

ADD REPLY • link 7.0 years ago Guido Hooiveld ★ 4.1k

0

Entering edit mode

Hello,

what is the output you expect ?

There is no summarized experiment for this data, only a data frame. Also, as the rows in each file differ we cant do a table line counts for RNA-seq.

The output is a rbind of all tables, adding the sample it belongs.

> head(data)
# A tibble: 6 x 7
miRNA_ID isoform_coords read_count reads_per_million_miRNA_mapped `cross-mapped` miRNA_region barcode               
<chr>        <chr>       <int>             <dbl>              <chr>        <chr>   <chr>                      
hsa-let-7a-1 hg38:chr9:94175942-94175961:+ 1 0.271 N  precursor  TCGA-DD-A1EJ-01A-11R-A154-13
hsa-let-7a-1 hg38:chr9:94175942-94175962:+ 3 0.814 N  precursor  TCGA-DD-A1EJ-01A-11R-A154-13
hsa-let-7a-1 hg38:chr9:94175961-94175982:+ 1 0.271 N  mature,MIMAT0000062 TCGA-DD-A1EJ-01A-11R-A154-13
hsa-let-7a-1 hg38:chr9:94175961-94175983:+ 1 0.271 N  mature,MIMAT0000062 TCGA-DD-A1EJ-01A-11R-A154-13
hsa-let-7a-1 hg38:chr9:94175961-94175984:+ 18 4.89 N mature,MIMAT0000062 TCGA-DD-A1EJ-01A-11R-A154-13
hsa-let-7a-1 hg38:chr9:94175962-94175981:+ 120 32.6 N  mature,MIMAT0000062 TCGA-DD-A1EJ-01A-11R-A154-13

Best regards,

Tiago

ADD REPLY • link 7.0 years ago Tiago Chedraoui Silva ▴ 260

0

Entering edit mode

Hi. I am not so familiar with all the data formats present at TCGA, so initially I expected the output for the 'isoform' data also to be a count-based table (like for the precursor data that I first generated). However, due to the differences in rows I now understand all data is just appended to form a single table.

Using a combination of basic R and dplyr commands, I was able to generate a count table from the object data. For the archive I post my code below; likely not the most straightforward code but it worked for me. :)

Guido

# object "data" contains 2082955 rows...
# to generate count table I need to collapse on "miRNA_region"

# 1) sum all counts per barcode for same MIMAT ID.
data %>%
  group_by(barcode,miRNA_region) %>%
  summarise_at(vars(read_count),sum, na.rm = TRUE) -> data

dim(data)
#[1] 353665      3

# 2) delete all MIMAT IDs NOT being 'mature'
data <- data[data$"miRNA_region" != "precursor" & data$"miRNA_region" != "stemloop" & data$"miRNA_region" != "unannotated", ]

dim(data)
#[1] 352391      3

# 3) now split per barcode ID; a list is generated
splitted <- split.data.frame(data, data$"barcode")

# define columns to keep in list for all samples.
keep <- c("miRNA_region", "read_count")
splitted <- lapply(splitted, function(x) x[keep])

# to make column names unique, add/paste barcode name to coulumn "read_count"
for (i in 1:length(splitted)) {
    names(splitted[[i]])[2] <- paste( names(splitted[[i]])[2], names(splitted[i]), sep=".")
    }

# Make the actual count table, a 'full join' is performed since the union of counts is required,
# and if there are no matching values in any of the tables full_join returns <N A> for the one missing.

splitted %>%
    Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2,by="miRNA_region"), .) -> count.table

# set all NA's to zero. It is assumed the isoforms that
# are not listed/present in a sample have zero counts.
count.table[is.na(count.table)] <- 0

# to increase readability; rename/remove "read_count." from column names,
# and 'mature,' from row names
count.table <- as.data.frame(count.table)
colnames(count.table) <- sub("read_count.", "", colnames(count.table))
count.table[,1] <-  sub("mature,", "",count.table[,1])

# Save/export count table
write.table(count.table, "TCGA-LIHC_miRNAdata_isoforms_countdata.txt", quote=FALSE, col.names=NA, sep="\t")

ADD REPLY • link 7.0 years ago Guido Hooiveld ★ 4.1k

0

Entering edit mode

Hi tia,

Could you please answer this post [How to download TCGA vcf file from GDC data portal?]

Thankyou

ADD REPLY • link 7.0 years ago Biologist ▴ 120