Question

problem for downloading TCGA data by TCGAbiolikns package

0

Entering edit mode

modarzi ▴ 10

@modarzi-16296

Last seen 5.3 years ago

Hi, for downloading TCGA-SARC transcriptome data from GDC. I used TCGAbiolinks package trough below code:

library(TCGAbiolinks);
library(SummarizedExperiment);
query <- GDCquery(project = "TCGA-SARC",sample.type = "Primary solid Tumor",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - FPKM")
GDCdownload(query) 
data <- GDCprepare(query)
data1 <- assay(data)

Now my SARC gene expression data has just 56512 gene type by clear ensemble id. So, I checked gencode.v22 and in that I found 60483 gene type. Now, I extract protein coding genes from gencode.v22 annotation file and the number is 19814. when I get intersection between 56512 ensemble id from my SARC gene expression data and 19814 proteing code genes I get just 19509 protein coding genes in my expression data. I am really concern about 305 lost protein genes in my study. I appreciate it if anybody shares his/her comment with me about my code validation and 305 lost protein code gens. Best

sessionInfo() R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale: [1] LCCOLLATE=EnglishUnited States.1252 LCCTYPE=EnglishUnited States.1252
[3] LCMONETARY=EnglishUnited States.1252 LCNUMERIC=C
[5] LCTIME=English_United States.1252

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] TCGAbiolinks2.14.0 sva3.34.0 genefilter1.68.0
[4] mgcv1.8-30 nlme3.1-141 SummarizedExperiment1.16.0 [7] DelayedArray0.12.0 BiocParallel1.20.0 matrixStats0.55.0
[10] Biobase2.46.0 GenomicRanges1.38.0 GenomeInfoDb1.22.0
[13] IRanges2.20.0 S4Vectors0.24.0 BiocGenerics0.32.0
[16] stringr1.4.0 dplyr0.8.3 Hmisc4.2-0
[19] ggplot23.2.1 Formula1.2-3 survival2.44-1.1
[22] lattice0.20-38 impute1.60.0 cluster2.1.0
[25] class7.3-15 MASS7.3-51.4 sqldf0.4-11
[28] RSQLite2.1.2 gsubfn0.7 proto1.0.0
[31] WGCNA1.68 fastcluster1.1.25 dynamicTreeCut_1.63-1

loaded via a namespace [1] backports1.1.5 [4] BiocFileCache1.10.0 [7] ConsensusClusterPlus1.50.0 [10] robust0.4-18.1 [13] htmltools0.4.0 [16] checkmate1.9.4 [19] doParallel1.0.15 [22] Biostrings2.54.0 [25] R.utils2.9.0 [28] colorspace1.4-1 [31] blob1.2.0 [34] xfun0.10 [37] crayon1.3.4 [40] zoo1.8-6 [43] survminer0.4.6 [46] XVector0.26.0 [49] DEoptimR1.0-8 [52] mvtnorm1.0-11 [55] ggthemes4.2.0 [58] progress1.2.2 [61] matlab1.0.2 [64] km.ci0.5-2 [67] httr1.4.1 [70] pkgconfig2.0.3 [73] nnet7.3-12 [76] labeling0.3 [79] AnnotationDbi1.48.0 [82] downloader0.4 [85] knitr1.25 [88] survMisc0.5.5 [91] R.oo1.23.0 [94] compiler3.6.1 [97] png0.1-7 [100] geneplotter1.64.0 [103] GenomicFeatures1.38.0 [106] vctrs0.2.0 [109] BiocManager1.30.9 [112] bitops1.0-6 [115] latticeExtra0.6-28 [118] gridExtra2.3 [121] chron2.3-54 [124] withr2.1.2 [127] GenomeInfoDbData1.2.2 [130] rpart4.1-15 [133] base64enc_0.1-3 (and not attached): circlize0.4.8 aroma.light3.16.0
plyr1.8.4 selectr0.4-1
lazyeval0.2.2 splines3.6.1
digest0.6.23 foreach1.4.7
GO.db3.10.0 magrittr1.5
memoise1.1.0 fit.models0.5-14
limma3.42.0 ComplexHeatmap2.2.0
readr1.3.1 annotate1.64.0
askpass1.1 prettyunits1.0.2
rvest0.3.4 ggrepel0.8.1
rappdirs0.3.1 rrcov1.4-7
jsonlite1.6 tcltk3.6.1
RCurl1.95-4.12 zeallot0.1.0
iterators1.0.12 glue1.3.1
gtable0.3.0 zlibbioc1.32.0
GetoptLong0.1.7 shape1.4.4
scales1.0.0 DESeq1.38.0
edgeR3.28.0 DBI1.0.0
Rcpp1.0.3 xtable1.8-4
htmlTable1.13.2 clue0.3-57
foreign0.8-72 bit1.1-14
preprocessCore1.48.0 htmlwidgets1.5.1
RColorBrewer1.1-2 acepack1.4.1
XML3.98-1.20 R.methodsS31.7.1
dbplyr1.4.2 locfit1.5-9.1
tidyselect0.2.5 rlang0.4.2
munsell0.5.0 tools3.6.1
generics0.0.2 broom0.5.2
bit640.9-7 robustbase0.93-5
purrr0.3.3 EDASeq2.20.0
xml21.2.2 biomaRt2.42.0
rstudioapi0.10 curl4.2
ggsignif0.6.0 tibble2.1.3
pcaPP1.9-73 stringi1.4.3
Matrix1.2-17 KMsurv0.1-5
lifecycle0.1.0 pillar1.4.2
GlobalOptions0.1.1 data.table1.12.6
rtracklayer1.46.0 R62.4.0
hwriter1.3.2 ShortRead1.44.0
codetools0.2-16 assertthat0.2.1
openssl1.4.1 rjson0.2.20
GenomicAlignments1.22.0 Rsamtools2.2.0
hms0.5.2 grid3.6.1
tidyr1.0.0 ggpubr0.2.3

TCGAbiolinks TCGA GDC • 2.0k views
ADD COMMENT • link 6.2 years ago modarzi ▴ 10

0

Entering edit mode

Hi,

We map gene in TCGAbiolinks using the last patched version of the genome using www.ensembl.org (GRCh38.p13 is used for the harmonized database, and GRCh37.p13 for the legacy one) using the R package biomaRt.

We have an option data reads the data without mapping it, so you can do your own mapping if you prefer (i.e. using gencode.v22)

I wrote an Rpubs listing the genes not found (http://rpubs.com/tiagochst/TCGAbiolinksmappinggenes) such as https://useast.ensembl.org/Homosapiens/Gene/Idhistory?g=ENSG00000266862, https://useast.ensembl.org/Homosapiens/Gene/Idhistory?g=ENSG00000279937 which are retired genes.

ADD REPLY • link 6.2 years ago Tiago Chedraoui Silva ▴ 260

0

Entering edit mode

Thanks. based on my code, can I say my download process is perfect and TCGA-SARC has just 56512 gene types?

ADD REPLY • link 6.2 years ago modarzi ▴ 10

0

Entering edit mode

Yes, the download process is perfect. It is just a question about using the most updated version or gencode.v22 to annotate the results. We chose the most updated version since it would not make sense to analyze genes that were retracted. I don't think there is right or wrong when choosing between both annotations. If you consider using the most updated version to annotate, yes, you can say you have 56512 gene types.

ADD REPLY • link 6.2 years ago Tiago Chedraoui Silva ▴ 260

0

Entering edit mode

Dear Tiago I need to download SARC HTSeq - Counts by TCGAbiolinks but I have problem. I post my problem in this link. I appreciate if you guide me.

ADD REPLY • link 6.1 years ago modarzi ▴ 10