I face a problem with the output of the package TGCAbiolinks
. My ultimate goal is to reanalyze the miRNA isoform data from the TCGA-LIHC project (data.type = "Isoform Expression Quantification")
. I used the library TGCAbiolinks
to download this dataset from the TCGA repository. The problem is that I ended up with a large table of dimensions 2082955 x 7, in which all data is 'mixed-up'. So the data is there, but I don't know how to reformat this table to obtain a suitable count table. To make things even more complex, the number of IDs is not the same for each sample....
I tried several things, but didn't get it to work. I would appreciate any suggestions!
As a note, it works fine when analyzing the precursor miRNA dataset (data.type = "miRNA Expression Quantification")
.
Main challenge: how to generate from the list 'splitted
' below a count table that is:
- comprised of the miRNA IDs that are present in any of the samples (=union),
- while for multiple entries for the same ID the read counts are added per sample,
- and that for missing IDs in samples (but that are otherwise present in the union) these are set at 0.
??
# data as obtained by TCGA library: dim(data) [1] 2082955 7 > head(data) # A tibble: 6 x 7 miRNA_ID isoform_coords read_count reads_pe~ `cros~ miRNA~ barcode <chr> <chr> <int> <dbl> <chr> <chr> <chr> 1 hsa-let-7a-1 hg38:chr9:94175942-94175961:+ 1 0.271 N precu~ TCGA-D~ 2 hsa-let-7a-1 hg38:chr9:94175942-94175962:+ 3 0.814 N precu~ TCGA-D~ 3 hsa-let-7a-1 hg38:chr9:94175961-94175982:+ 1 0.271 N matur~ TCGA-D~ 4 hsa-let-7a-1 hg38:chr9:94175961-94175983:+ 1 0.271 N matur~ TCGA-D~ 5 hsa-let-7a-1 hg38:chr9:94175961-94175984:+ 18 4.89 N matur~ TCGA-D~ 6 hsa-let-7a-1 hg38:chr9:94175962-94175981:+ 120 32.6 N matur~ TCGA-D~ > # now extract/split the data for each barcode (subject). This will generate a list. barcode <- data$barcode splitted <- split.data.frame(data, barcode) #check str(splitted) #observe that for each sample number of isoforms is different! List of 425 $ TCGA-2V-A95S-01A-11R-A37G-13:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4814 obs. of 7 variables: ..$ miRNA_ID : chr [1:4814] "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" ... ..$ isoform_coords : chr [1:4814] "hg38:chr9:94175942-94175962:+" "hg38:chr9:94175943-94175962:+" ... ..$ read_count : int [1:4814] 1 1 3 3 4 194 7413 6192 9364 144 ... ..$ reads_per_million_miRNA_mapped: num [1:4814] 0.233 0.233 0.698 0.698 0.931 ... ..$ cross-mapped : chr [1:4814] "N" "N" "N" "N" ... ..$ miRNA_region : chr [1:4814] "precursor" "precursor" "mature,MIMAT0000062" "mature,MIMAT0000062" ... ..$ barcode : chr [1:4814] "TCGA-2V-A95S-01A-11R-A37G-13" "TCGA-2V-A95S-01A-11R-A37G-13" ... $ TCGA-2Y-A9GS-01A-12R-A38M-13:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 5881 obs. of 7 variables: ..$ miRNA_ID : chr [1:5881] "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" ... ..$ isoform_coords : chr [1:5881] "hg38:chr9:94175961-94175982:+" "hg38:chr9:94175961-94175983:+" ... ..$ read_count : int [1:5881] 6 11 18 180 7038 18595 36149 664 16 1 ... ..$ reads_per_million_miRNA_mapped: num [1:5881] 0.899 1.648 2.696 26.963 1054.259 ... ..$ cross-mapped : chr [1:5881] "N" "N" "N" "N" ... ..$ miRNA_region : chr [1:5881] "mature,MIMAT0000062" "mature,MIMAT0000062" "mature,MIMAT0000062" "mature,MIMAT0000062" ... ..$ barcode : chr [1:5881] "TCGA-2Y-A9GS-01A-12R-A38M-13" "TCGA-2Y-A9GS-01A-12R-A38M-13" ... $ TCGA-2Y-A9GT-01A-11R-A38M-13:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4006 obs. of 7 variables: ..$ miRNA_ID : chr [1:4006] "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" "hsa-let-7a-1" ... ..$ isoform_coords : chr [1:4006] "hg38:chr9:94175942-94175962:+" "hg38:chr9:94175961-94175980:+" ... ..$ read_count : int [1:4006] 1 1 1 7 5 175 8574 13607 19945 268 ... ..$ reads_per_million_miRNA_mapped: num [1:4006] 0.282 0.282 0.282 1.971 1.408 ... ..$ cross-mapped : chr [1:4006] "N" "N" "N" "N" ... ..$ miRNA_region : chr [1:4006] "precursor" "mature,MIMAT0000062" "mature,MIMAT0000062" "mature,MIMAT0000062" ... ..$ barcode : chr [1:4006] "TCGA-2Y-A9GT-01A-11R-A38M-13" "TCGA-2Y-A9GT-01A-11R-A38M-13" "TCGA-2Y-A9GT-01A-11R-A38M-13" "TCGA-2Y-A9GT-01A-11R-A38M-13" ... <<snip>> length(splitted) #[1] 425 = indeed correct number of samples!
For completeness the code that was used to generate the object
data
above:Hello,
what is the output you expect ?
There is no summarized experiment for this data, only a data frame. Also, as the rows in each file differ we cant do a table line counts for RNA-seq.
The output is a rbind of all tables, adding the sample it belongs.
Best regards,
Tiago
Hi. I am not so familiar with all the data formats present at TCGA, so initially I expected the output for the 'isoform' data also to be a count-based table (like for the precursor data that I first generated). However, due to the differences in rows I now understand all data is just appended to form a single table.
Using a combination of basic
R
anddplyr
commands, I was able to generate a count table from the objectdata
. For the archive I post my code below; likely not the most straightforward code but it worked for me. :)Guido
Hi tia,
Could you please answer this post [How to download TCGA vcf file from GDC data portal?]
Thankyou