Question: Matching columns to sample names
gravatar for b.curran
23 months ago by
New Zealand
b.curran0 wrote:

I'm examining the TCGA dataset from the recount2 website ( I've downloaded one of the Summarized experiments for the prostate cancer, and have managed to extract the counts using assay(). The column names of the dataframe extracted however, whilst they look like identifiers, bear no resemblance to any identifiers in the TCGA database.

I have a separate download of clinical data, with TCGA identifiers and sample names for the same data set. How does one link up columns in the assay with known sample names? 






recount • 471 views
ADD COMMENTlink modified 23 months ago by Leonardo Collado Torres640 • written 23 months ago by b.curran0
Answer: Matching columns to sample names
gravatar for Leonardo Collado Torres
23 months ago by
United States
Leonardo Collado Torres640 wrote:


There are many id columns in the TCGA recount2 metadata. You can access the metadata using colData(rse) or get the metadata for all TCGA samples with all_metadata() as shown below. We used gdc_file_id to match across different tables.

m <-  all_metadata('tcga')
colnames(m)[grep('id', colnames(m))]
> library('recount')
> m <-  all_metadata('tcga')
2017-08-28 09:18:14 downloading the metadata to /var/folders/cx/n9s558kx6fb7jf5z_pgszgb80000gn/T//Rtmpev7kCK/metadata_clean_tcga.Rdata
trying URL ''
Content type 'application/octet-stream' length 16334695 bytes (15.6 MB)
downloaded 15.6 MB

> head(m$gdc_file_id)
[1] "3dff72d2-f292-497e-ace3-6faa9c884205" "b1e54366-42b9-463c-8615-b34d52bd14dc" "473713f7-eb41-4f20-a37f-acd209e3cb75"
[4] "11f18f54-9b33-4c33-bdf9-0f093f4f3336" "136b7576-1108-4fa3-8254-6069f0ca879a" "e81fa8b7-3ffe-4f73-94af-0b5257d7f81a"
## Due to space limitations I'm just showing the length()
> length(colnames(m)[grep('id', colnames(m))])
[1] 82                                     
> packageVersion('recount')
[1] ‘1.3.3’

Best, Leonardo

ADD COMMENTlink written 23 months ago by Leonardo Collado Torres640


I have a very large dataset which I downloaded from TCGA for pancreatic cancer. It has one normal sample but 184 patient samples. So, It has 186 columns (no technical replicates) and 20500 rows for all the genes. I am trying to get log2fold change data from this using deseq2. To begin my analysis I have extracted four patient samples(raw reads) and the normal control and now have 5 separate files. I tried using both .txt and .csv file formats with the following codes:

sampleFiles<-grep ('.txt',list.files('/Users/dorothy/Desktop/PAAD'),value=TRUE)
sampleTable<-data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition)
ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory='/Users/dorothy/Desktop/PAAD/', design=~condition)
colData(ddsHTSeq)$condition<-factor(colData(ddsHTSeq)$condition, levels=c('control','patient'))

But, once I get to the ddsHTSeq function, it shows the following error: Error in Ops.factor(a$V1, l[[1]]$V1) : 
  level sets of factors are different.

Is there a way I can solve this? 

Also, Is it possible to perform ddHTSeq function where I can tell R to identify the columns as separate samples. In short, I do not want to have a separate metadata file that contains information about the columns. 

Thanks in advance!


ADD REPLYlink written 22 months ago by dorothy.jrobbert20

Hi Dorothy,

Please check the posting guidelines and create a new post with a reproducible example & session information. Also, use the DESeq2 tag because this has nothing to do with recount from what I can see.



ADD REPLYlink written 22 months ago by Leonardo Collado Torres640

Hi Leonardo - sorry for the delay in reply. 

I've done as you suggested. There's something hinky somewhere though. I've been pulling out the gdc_file_id's from the metadata, and those file_id's aren't present when I search through the gdc data portal. The case id's that I pull from the metadata are. If I pull out the gdc_submitter_id's they're not accessible through the portal. Nor are the file names. Long story short, some of the metadata I can match up to the gdc portal, most of it, I can't though. I don't know why. 

As an aside (I should probably start a separate question I know) is there a way to pull just tcga prostate data or just gtex prostate data via the R package? Or do I have to go through the website (which isn't working for me atm) to find the pre-compiled data sets?




ADD REPLYlink written 22 months ago by b.curran0

Hi Ben,

I think that it'll be best that you contact the TCGA team regarding searching the gdc_file_id's via their web interface (gdc data portal). I don't know if it's a uppercase vs lowercase issue or how their search portal works. 

Regarding the TCGA prostate data in recount2, right now you can only download it from the recount2 website and not via the recount package download_study() function. You can however search the recount_url table to find the appropriate links and download the files using downloader::download(). By the way, is up right now. I ignore why it wasn't working for you.

> subset(recount::recount_url, project == 'TCGA' & grepl('prostate', url_table$file_name))$url
[1] "" "" 



ADD REPLYlink modified 22 months ago • written 22 months ago by Leonardo Collado Torres640
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 350 users visited in the last hour