Question

Recount TCGA data

0

Entering edit mode

rajesha1986 • 0

@rajesha1986-13918

Last seen 6.6 years ago

Hello, Thanks for providing great tool for accessing all these datasets. I downloaded gene counts (counts_gene.tsv) file for TCGA through https://jhubiostatistics.shinyapps.io/recount/ --> TCGA --> gene counts (duffel.rail.bio/recount/TCGA/counts_gene.tsv.gz). Then I was interested in Lung data so separated the counts for lung cancer related patients. However, in the count matrix, I see no gene IDs. I am new to this and please help whether I am missing some thing.

Thanks

Rajesha

recount • 1.9k views

ADD COMMENT • link updated 6.6 years ago by Leonardo Collado Torres ★ 1.0k • written 6.6 years ago by rajesha1986 • 0

score 0 · Answer 1 · 2017-09-08

Hi,

The text files are missing the gene ids. I realize this is an inconvenience if you don't want to use R.

This information is much more well organized in the RangedSummarizedExperiment objects (RSE) that you can download from https://jhubiostatistics.shinyapps.io/recount/ or via the recount Bioconductor package. See Figure 2 of https://f1000research.com/articles/6-1558/v1. Actually, that workflow and the recount vignette http://bioconductor.org/packages/release/bioc/vignettes/recount/inst/doc/recount-quickstart.html are the best places to get started and familiarized with recount.

Since the genes are all the same regardless of the study, you can use:

> library(recount)

> rowRanges(rse_gene_SRP009615)
GRanges object with 58037 ranges and 3 metadata columns:
                     seqnames                 ranges strand |            gene_id bp_length          symbol
                        <Rle>              <IRanges>  <Rle> |        <character> <integer> <CharacterList>
  ENSG00000000003.14     chrX [100627109, 100639991]      - | ENSG00000000003.14      4535          TSPAN6
   ENSG00000000005.5     chrX [100584802, 100599885]      + |  ENSG00000000005.5      1610            TNMD
  ENSG00000000419.12    chr20 [ 50934867,  50958555]      - | ENSG00000000419.12      1207            DPM1
  ENSG00000000457.13     chr1 [169849631, 169894267]      - | ENSG00000000457.13      6883           SCYL3
  ENSG00000000460.16     chr1 [169662007, 169854080]      + | ENSG00000000460.16      5967        C1orf112
                 ...      ...                    ...    ... .                ...       ...             ...
   ENSG00000283695.1    chr19 [ 52865369,  52865429]      - |  ENSG00000283695.1        61              NA
   ENSG00000283696.1     chr1 [161399409, 161422424]      + |  ENSG00000283696.1       997              NA
   ENSG00000283697.1     chrX [149548210, 149549852]      - |  ENSG00000283697.1      1184    LOC101928917
   ENSG00000283698.1     chr2 [112439312, 112469687]      - |  ENSG00000283698.1       940              NA
   ENSG00000283699.1    chr10 [ 12653138,  12653197]      - |  ENSG00000283699.1        60         MIR4481
  -------
  seqinfo: 25 sequences (1 circular) from an unspecified genome; no seqlengths

And save that information in a text table.

Best,

Leonardo

> packageVersion('recount')
[1] ‘1.2.3’