Question: problem for downloading TCGA data by TCGAbiolikns package
0
gravatar for modarzi
11 days ago by
modarzi10
modarzi10 wrote:

Hi, for downloading TCGA-SARC transcriptome data from GDC. I used TCGAbiolinks package trough below code:

library(TCGAbiolinks);
library(SummarizedExperiment);
query <- GDCquery(project = "TCGA-SARC",sample.type = "Primary solid Tumor",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - FPKM")
GDCdownload(query) 
data <- GDCprepare(query)
data1 <- assay(data)

Now my SARC gene expression data has just 56512 gene type by clear ensemble id. So, I checked gencode.v22 and in that I found 60483 gene type. Now, I extract protein coding genes from gencode.v22 annotation file and the number is 19814. when I get intersection between 56512 ensemble id from my SARC gene expression data and 19814 proteing code genes I get just 19509 protein coding genes in my expression data. I am really concern about 305 lost protein genes in my study. I appreciate it if anybody shares his/her comment with me about my code validation and 305 lost protein code gens. Best

sessionInfo() R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale: [1] LCCOLLATE=EnglishUnited States.1252 LCCTYPE=EnglishUnited States.1252
[3] LCMONETARY=EnglishUnited States.1252 LCNUMERIC=C
[5] LC
TIME=English_United States.1252

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] TCGAbiolinks2.14.0 sva3.34.0 genefilter1.68.0
[4] mgcv
1.8-30 nlme3.1-141 SummarizedExperiment1.16.0 [7] DelayedArray0.12.0 BiocParallel1.20.0 matrixStats0.55.0
[10] Biobase
2.46.0 GenomicRanges1.38.0 GenomeInfoDb1.22.0
[13] IRanges2.20.0 S4Vectors0.24.0 BiocGenerics0.32.0
[16] stringr
1.4.0 dplyr0.8.3 Hmisc4.2-0
[19] ggplot23.2.1 Formula1.2-3 survival2.44-1.1
[22] lattice
0.20-38 impute1.60.0 cluster2.1.0
[25] class7.3-15 MASS7.3-51.4 sqldf0.4-11
[28] RSQLite
2.1.2 gsubfn0.7 proto1.0.0
[31] WGCNA1.68 fastcluster1.1.25 dynamicTreeCut_1.63-1

loaded via a namespace (and not attached): [1] backports1.1.5 circlize0.4.8 aroma.light3.16.0
[4] BiocFileCache
1.10.0 plyr1.8.4 selectr0.4-1
[7] ConsensusClusterPlus1.50.0 lazyeval0.2.2 splines3.6.1
[10] robust
0.4-18.1 digest0.6.23 foreach1.4.7
[13] htmltools0.4.0 GO.db3.10.0 magrittr1.5
[16] checkmate
1.9.4 memoise1.1.0 fit.models0.5-14
[19] doParallel1.0.15 limma3.42.0 ComplexHeatmap2.2.0
[22] Biostrings
2.54.0 readr1.3.1 annotate1.64.0
[25] R.utils2.9.0 askpass1.1 prettyunits1.0.2
[28] colorspace
1.4-1 rvest0.3.4 ggrepel0.8.1
[31] blob1.2.0 rappdirs0.3.1 rrcov1.4-7
[34] xfun
0.10 jsonlite1.6 tcltk3.6.1
[37] crayon1.3.4 RCurl1.95-4.12 zeallot0.1.0
[40] zoo
1.8-6 iterators1.0.12 glue1.3.1
[43] survminer0.4.6 gtable0.3.0 zlibbioc1.32.0
[46] XVector
0.26.0 GetoptLong0.1.7 shape1.4.4
[49] DEoptimR1.0-8 scales1.0.0 DESeq1.38.0
[52] mvtnorm
1.0-11 edgeR3.28.0 DBI1.0.0
[55] ggthemes4.2.0 Rcpp1.0.3 xtable1.8-4
[58] progress
1.2.2 htmlTable1.13.2 clue0.3-57
[61] matlab1.0.2 foreign0.8-72 bit1.1-14
[64] km.ci
0.5-2 preprocessCore1.48.0 htmlwidgets1.5.1
[67] httr1.4.1 RColorBrewer1.1-2 acepack1.4.1
[70] pkgconfig
2.0.3 XML3.98-1.20 R.methodsS31.7.1
[73] nnet7.3-12 dbplyr1.4.2 locfit1.5-9.1
[76] labeling
0.3 tidyselect0.2.5 rlang0.4.2
[79] AnnotationDbi1.48.0 munsell0.5.0 tools3.6.1
[82] downloader
0.4 generics0.0.2 broom0.5.2
[85] knitr1.25 bit640.9-7 robustbase0.93-5
[88] survMisc
0.5.5 purrr0.3.3 EDASeq2.20.0
[91] R.oo1.23.0 xml21.2.2 biomaRt2.42.0
[94] compiler
3.6.1 rstudioapi0.10 curl4.2
[97] png0.1-7 ggsignif0.6.0 tibble2.1.3
[100] geneplotter
1.64.0 pcaPP1.9-73 stringi1.4.3
[103] GenomicFeatures1.38.0 Matrix1.2-17 KMsurv0.1-5
[106] vctrs
0.2.0 lifecycle0.1.0 pillar1.4.2
[109] BiocManager1.30.9 GlobalOptions0.1.1 data.table1.12.6
[112] bitops
1.0-6 rtracklayer1.46.0 R62.4.0
[115] latticeExtra0.6-28 hwriter1.3.2 ShortRead1.44.0
[118] gridExtra
2.3 codetools0.2-16 assertthat0.2.1
[121] chron2.3-54 openssl1.4.1 rjson0.2.20
[124] withr
2.1.2 GenomicAlignments1.22.0 Rsamtools2.2.0
[127] GenomeInfoDbData1.2.2 hms0.5.2 grid3.6.1
[130] rpart
4.1-15 tidyr1.0.0 ggpubr0.2.3
[133] base64enc_0.1-3

tcga tcgabiolinks gdc • 51 views
ADD COMMENTlink modified 11 days ago • written 11 days ago by modarzi10

Hi,

We map gene in TCGAbiolinks using the last patched version of the genome using www.ensembl.org (GRCh38.p13 is used for the harmonized database, and GRCh37.p13 for the legacy one) using the R package biomaRt.

We have an option data reads the data without mapping it, so you can do your own mapping if you prefer (i.e. using gencode.v22)

I wrote an Rpubs listing the genes not found (http://rpubs.com/tiagochst/TCGAbiolinksmappinggenes) such as https://useast.ensembl.org/Homosapiens/Gene/Idhistory?g=ENSG00000266862, https://useast.ensembl.org/Homosapiens/Gene/Idhistory?g=ENSG00000279937 which are retired genes.

ADD REPLYlink written 11 days ago by Tiago Chedraoui Silva240

Thanks. based on my code, can I say my download process is perfect and TCGA-SARC has just 56512 gene types?

ADD REPLYlink written 11 days ago by modarzi10

Yes, the download process is perfect. It is just a question about using the most updated version or gencode.v22 to annotate the results. We chose the most updated version since it would not make sense to analyze genes that were retracted. I don't think there is right or wrong when choosing between both annotations. If you consider using the most updated version to annotate, yes, you can say you have 56512 gene types.

ADD REPLYlink written 11 days ago by Tiago Chedraoui Silva240

Dear Tiago I need to download SARC HTSeq - Counts by TCGAbiolinks but I have problem. I post my problem in this link. I appreciate if you guide me.

ADD REPLYlink written 1 day ago by modarzi10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 149 users visited in the last hour