how to reformat complex list object (derived with TCGAbiolinks)?
0
0
Entering edit mode
Guido Hooiveld ★ 4.1k
@guido-hooiveld-2020
Last seen 18 days ago
Wageningen University, Wageningen, the …

I face a problem with the output of the package TGCAbiolinks. My ultimate goal is to reanalyze the miRNA isoform data from the TCGA-LIHC project (data.type = "Isoform Expression Quantification"). I used the library TGCAbiolinks to download this dataset from the TCGA repository. The problem is that I ended up with a large table of dimensions 2082955 x 7, in which all data is 'mixed-up'. So the data is there, but I don't know how to reformat this table to obtain a suitable count table. To make things even more complex, the number of IDs is not the same for each sample....

I tried several things, but didn't get it to work. I would appreciate any suggestions!

As a note, it works fine when analyzing the precursor miRNA dataset (data.type = "miRNA Expression Quantification").

 

Main challenge: how to generate from the list 'splitted' below  a count table that is:

- comprised of the miRNA IDs that are present in any of the samples (=union),
- while for multiple entries for the same ID the read counts are added per sample,
- and that for missing IDs in samples (but that are otherwise present in the union) these are set at 0.

??

 

 

# data as obtained by TCGA library:
dim(data)
[1] 2082955       7

> head(data)
# A tibble: 6 x 7
  miRNA_ID     isoform_coords                read_count reads_pe~ `cros~ miRNA~ barcode
  <chr>        <chr>                              <int>     <dbl> <chr>  <chr>  <chr>  
1 hsa-let-7a-1 hg38:chr9:94175942-94175961:+          1     0.271 N      precu~ TCGA-D~
2 hsa-let-7a-1 hg38:chr9:94175942-94175962:+          3     0.814 N      precu~ TCGA-D~
3 hsa-let-7a-1 hg38:chr9:94175961-94175982:+          1     0.271 N      matur~ TCGA-D~
4 hsa-let-7a-1 hg38:chr9:94175961-94175983:+          1     0.271 N      matur~ TCGA-D~
5 hsa-let-7a-1 hg38:chr9:94175961-94175984:+         18     4.89  N      matur~ TCGA-D~
6 hsa-let-7a-1 hg38:chr9:94175962-94175981:+        120    32.6   N      matur~ TCGA-D~
>

# now extract/split the data for each barcode (subject). This will generate a list.
barcode <- data$barcode

splitted <- split.data.frame(data, barcode)
#check
str(splitted)        #observe that for each sample number of isoforms is different!
List of 425
 $ TCGA-2V-A95S-01A-11R-A37G-13:Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       4814 obs. of  7 variables:
  ..$ miRNA_ID                      : chr [1:4814] "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" ...
  ..$ isoform_coords                : chr [1:4814] "hg38:chr9:94175942-94175962:+" "hg38:chr9:94175943-94175962:+" ...
  ..$ read_count                    : int [1:4814] 1 1 3 3 4 194 7413 6192 9364 144 ...
  ..$ reads_per_million_miRNA_mapped: num [1:4814] 0.233 0.233 0.698 0.698 0.931 ...
  ..$ cross-mapped                  : chr [1:4814] "N" "N" "N" "N" ...
  ..$ miRNA_region                  : chr [1:4814] "precursor" "precursor" "mature,MIMAT0000062" "mature,MIMAT0000062" ...
  ..$ barcode                       : chr [1:4814] "TCGA-2V-A95S-01A-11R-A37G-13" "TCGA-2V-A95S-01A-11R-A37G-13" ...
 $ TCGA-2Y-A9GS-01A-12R-A38M-13:Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       5881 obs. of  7 variables:
  ..$ miRNA_ID                      : chr [1:5881] "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" ...
  ..$ isoform_coords                : chr [1:5881] "hg38:chr9:94175961-94175982:+" "hg38:chr9:94175961-94175983:+" ...
  ..$ read_count                    : int [1:5881] 6 11 18 180 7038 18595 36149 664 16 1 ...
  ..$ reads_per_million_miRNA_mapped: num [1:5881] 0.899 1.648 2.696 26.963 1054.259 ...
  ..$ cross-mapped                  : chr [1:5881] "N" "N" "N" "N" ...
  ..$ miRNA_region                  : chr [1:5881] "mature,MIMAT0000062" "mature,MIMAT0000062" "mature,MIMAT0000062" "mature,MIMAT0000062" ...
  ..$ barcode                       : chr [1:5881] "TCGA-2Y-A9GS-01A-12R-A38M-13" "TCGA-2Y-A9GS-01A-12R-A38M-13"  ...
 $ TCGA-2Y-A9GT-01A-11R-A38M-13:Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       4006 obs. of  7 variables:
  ..$ miRNA_ID                      : chr [1:4006] "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" ...
  ..$ isoform_coords                : chr [1:4006] "hg38:chr9:94175942-94175962:+" "hg38:chr9:94175961-94175980:+" ...
  ..$ read_count                    : int [1:4006] 1 1 1 7 5 175 8574 13607 19945 268 ...
  ..$ reads_per_million_miRNA_mapped: num [1:4006] 0.282 0.282 0.282 1.971 1.408 ...
  ..$ cross-mapped                  : chr [1:4006] "N" "N" "N" "N" ...
  ..$ miRNA_region                  : chr [1:4006] "precursor" "mature,MIMAT0000062" "mature,MIMAT0000062" "mature,MIMAT0000062" ...
  ..$ barcode                       : chr [1:4006] "TCGA-2Y-A9GT-01A-11R-A38M-13" "TCGA-2Y-A9GT-01A-11R-A38M-13" "TCGA-2Y-A9GT-01A-11R-A38M-13" "TCGA-2Y-A9GT-01A-11R-A38M-13" ...
<<snip>>

length(splitted)    #[1] 425 = indeed correct number of samples!
TCGA tcgabiolinks list reformat • 1.8k views
ADD COMMENT
0
Entering edit mode

For completeness the code that was used to generate the object data above:

library(TCGAbiolinks)
library(dplyr)
library(DT)
library(SummarizedExperiment)

# project: TCGA-LIHC

# isoforms (data.type = "Isoform Expression Quantification")
query <- GDCquery(project = c("TCGA-LIHC"),
                  data.category = "Transcriptome Profiling",
                  legacy = FALSE,
          data.type = "Isoform Expression Quantification")

# view output
datatable(getResults(query,),
              filter = 'top',
              options = list(scrollX = TRUE, keys = TRUE, pageLength = 100),
              rownames = FALSE)

# extract 'subject' for additional query
subject.id <- getResults(query, cols = "cases")

# query again
query <- GDCquery(project = c("TCGA-LIHC"),
                  data.category = "Transcriptome Profiling",
                  legacy = FALSE,
          data.type = "Isoform Expression Quantification",
          barcode = subject.id)

#download data (note: this takes some time [~15m])
GDCdownload(query, method = "client", directory = "miRNA_dir")

# read data and prepare R object
data <- GDCprepare(query, directory="D:\\work\\miRNA_dir", summarizedExperiment = TRUE, save = TRUE, save.filename="TCGA-LIHC_miRNAdata.RData")

dim(data)
[1] 2082955       7

> sessionInfo()
R version 3.4.2 Patched (2017-10-09 r73515)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] SummarizedExperiment_1.8.1 DelayedArray_0.4.1        
 [3] matrixStats_0.52.2         Biobase_2.38.0            
 [5] GenomicRanges_1.30.0       GenomeInfoDb_1.14.0       
 [7] IRanges_2.12.0             S4Vectors_0.16.0          
 [9] BiocGenerics_0.24.0        DT_0.2                    
[11] dplyr_0.7.4                TCGAbiolinks_2.7.16       

>
ADD REPLY
0
Entering edit mode

Hello,

what is the output you expect ?

There is no summarized experiment for this data, only a data frame. Also, as the rows in each file differ we cant do a table line counts for RNA-seq.

The output is a rbind of all tables, adding the sample it belongs.

 

> head(data)
# A tibble: 6 x 7
miRNA_ID isoform_coords read_count reads_per_million_miRNA_mapped `cross-mapped` miRNA_region barcode               
<chr>        <chr>       <int>             <dbl>              <chr>        <chr>   <chr>                      
hsa-let-7a-1 hg38:chr9:94175942-94175961:+ 1 0.271 N  precursor  TCGA-DD-A1EJ-01A-11R-A154-13
hsa-let-7a-1 hg38:chr9:94175942-94175962:+ 3 0.814 N  precursor  TCGA-DD-A1EJ-01A-11R-A154-13
hsa-let-7a-1 hg38:chr9:94175961-94175982:+ 1 0.271 N  mature,MIMAT0000062 TCGA-DD-A1EJ-01A-11R-A154-13
hsa-let-7a-1 hg38:chr9:94175961-94175983:+ 1 0.271 N  mature,MIMAT0000062 TCGA-DD-A1EJ-01A-11R-A154-13
hsa-let-7a-1 hg38:chr9:94175961-94175984:+ 18 4.89 N mature,MIMAT0000062 TCGA-DD-A1EJ-01A-11R-A154-13
hsa-let-7a-1 hg38:chr9:94175962-94175981:+ 120 32.6 N  mature,MIMAT0000062 TCGA-DD-A1EJ-01A-11R-A154-13

Best regards,

Tiago

 

ADD REPLY
0
Entering edit mode

Hi. I am not so familiar with all the data formats present at TCGA, so initially I expected the output for the 'isoform' data also to be a count-based table (like for the precursor data that I first generated). However, due to the differences in rows I now understand all data is just appended to form a single table.

Using a combination of basic R and dplyr commands, I was able to generate a count table from the object data. For the archive I post my code below; likely not the most straightforward code but it worked for me. :)

Guido

 

# object "data" contains 2082955 rows...
# to generate count table I need to collapse on "miRNA_region"

# 1) sum all counts per barcode for same MIMAT ID.
data %>%
  group_by(barcode,miRNA_region) %>%
  summarise_at(vars(read_count),sum, na.rm = TRUE) -> data

dim(data)
#[1] 353665      3

# 2) delete all MIMAT IDs NOT being 'mature'
data <- data[data$"miRNA_region" != "precursor" & data$"miRNA_region" != "stemloop" & data$"miRNA_region" != "unannotated", ]

dim(data)
#[1] 352391      3

# 3) now split per barcode ID; a list is generated
splitted <- split.data.frame(data, data$"barcode")

# define columns to keep in list for all samples.
keep <- c("miRNA_region", "read_count")
splitted <- lapply(splitted, function(x) x[keep])

# to make column names unique, add/paste barcode name to coulumn "read_count"
for (i in 1:length(splitted)) {
    names(splitted[[i]])[2] <- paste( names(splitted[[i]])[2], names(splitted[i]), sep=".")
    }

# Make the actual count table, a 'full join' is performed since the union of counts is required,
# and if there are no matching values in any of the tables full_join returns <N A> for the one missing.

splitted %>%
    Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2,by="miRNA_region"), .) -> count.table

# set all NA's to zero. It is assumed the isoforms that
# are not listed/present in a sample have zero counts.
count.table[is.na(count.table)] <- 0

# to increase readability; rename/remove "read_count." from column names,
# and 'mature,' from row names
count.table <- as.data.frame(count.table)
colnames(count.table) <- sub("read_count.", "", colnames(count.table))
count.table[,1] <-  sub("mature,", "",count.table[,1])

# Save/export count table
write.table(count.table, "TCGA-LIHC_miRNAdata_isoforms_countdata.txt", quote=FALSE, col.names=NA, sep="\t")

 

ADD REPLY
0
Entering edit mode

Hi tia,

Could you please answer this post [How to download TCGA vcf file from GDC data portal?]

Thankyou

ADD REPLY

Login before adding your answer.

Traffic: 748 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6