Question

Matching columns to sample names

0

Entering edit mode

b.curran • 0

@bcurran-6988

Last seen 2.5 years ago

New Zealand

I'm examining the TCGA dataset from the recount2 website (https://jhubiostatistics.shinyapps.io/recount/). I've downloaded one of the Summarized experiments for the prostate cancer, and have managed to extract the counts using assay(). The column names of the dataframe extracted however, whilst they look like identifiers, bear no resemblance to any identifiers in the TCGA database.

I have a separate download of clinical data, with TCGA identifiers and sample names for the same data set. How does one link up columns in the assay with known sample names?

Thanks

Ben.

recount • 3.3k views

ADD COMMENT • link updated 8.5 years ago by Leonardo Collado Torres ★ 1.1k • written 8.5 years ago by b.curran • 0

score 0 · Answer 1 · 2017-08-28

0

Entering edit mode

Leonardo Collado Torres ★ 1.1k

@lcolladotor

Last seen 7 months ago

United States

Hi,

There are many id columns in the TCGA recount2 metadata. You can access the metadata using colData(rse) or get the metadata for all TCGA samples with all_metadata() as shown below. We used gdc_file_id to match across different tables.

library('recount')
m <-  all_metadata('tcga')
head(m$gdc_file_id)
colnames(m)[grep('id', colnames(m))]
packageVersion('recount')

> library('recount')
> m <-  all_metadata('tcga')
2017-08-28 09:18:14 downloading the metadata to /var/folders/cx/n9s558kx6fb7jf5z_pgszgb80000gn/T//Rtmpev7kCK/metadata_clean_tcga.Rdata
trying URL 'https://github.com/leekgroup/recount-website/blob/master/metadata/metadata_clean_tcga.Rdata?raw=true'
Content type 'application/octet-stream' length 16334695 bytes (15.6 MB)
==================================================
downloaded 15.6 MB

> head(m$gdc_file_id)
[1] "3dff72d2-f292-497e-ace3-6faa9c884205" "b1e54366-42b9-463c-8615-b34d52bd14dc" "473713f7-eb41-4f20-a37f-acd209e3cb75"
[4] "11f18f54-9b33-4c33-bdf9-0f093f4f3336" "136b7576-1108-4fa3-8254-6069f0ca879a" "e81fa8b7-3ffe-4f73-94af-0b5257d7f81a"
## Due to space limitations I'm just showing the length()
> length(colnames(m)[grep('id', colnames(m))])
[1] 82                                     
> packageVersion('recount')
[1] ‘1.3.3’

Best, Leonardo

ADD COMMENT • link 8.5 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

Hello,

I have a very large dataset which I downloaded from TCGA for pancreatic cancer. It has one normal sample but 184 patient samples. So, It has 186 columns (no technical replicates) and 20500 rows for all the genes. I am trying to get log2fold change data from this using deseq2. To begin my analysis I have extracted four patient samples(raw reads) and the normal control and now have 5 separate files. I tried using both .txt and .csv file formats with the following codes:

library('DESeq2')
setwd("/Users/dorothy/Desktop/PAAD")
getwd()
sampleFiles<-grep ('.txt',list.files('/Users/dorothy/Desktop/PAAD'),value=TRUE)
sampleCondition<-c('control','patient','patient','patient','patient')
sampleTable<-data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition)
sampleFiles
sampleCondition
sampleTable
ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory='/Users/dorothy/Desktop/PAAD/', design=~condition)
colData(ddsHTSeq)$condition<-factor(colData(ddsHTSeq)$condition, levels=c('control','patient'))
dds<-DESeq(ddsHTSeq)
res<-results(dds)
res<-res[order(res$padj),]

But, once I get to the ddsHTSeq function, it shows the following error: Error in Ops.factor(a$V1, l[[1]]$V1) :
level sets of factors are different.

Is there a way I can solve this?

Also, Is it possible to perform ddHTSeq function where I can tell R to identify the columns as separate samples. In short, I do not want to have a separate metadata file that contains information about the columns.

Thanks in advance!

Dorothy

ADD REPLY • link 8.4 years ago dorothy.jrobbert ▴ 30

0

Entering edit mode

Hi Dorothy,

Please check the posting guidelines and create a new post with a reproducible example & session information. Also, use the DESeq2 tag because this has nothing to do with recount from what I can see.

Best,

Leonardo

ADD REPLY • link 8.4 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

Hi Leonardo - sorry for the delay in reply.

I've done as you suggested. There's something hinky somewhere though. I've been pulling out the gdc_file_id's from the metadata, and those file_id's aren't present when I search through the gdc data portal. The case id's that I pull from the metadata are. If I pull out the gdc_submitter_id's they're not accessible through the portal. Nor are the file names. Long story short, some of the metadata I can match up to the gdc portal, most of it, I can't though. I don't know why.

As an aside (I should probably start a separate question I know) is there a way to pull just tcga prostate data or just gtex prostate data via the R package? Or do I have to go through the website (which isn't working for me atm) to find the pre-compiled data sets?

Cheers

Ben.

ADD REPLY • link 8.4 years ago b.curran • 0

0

Entering edit mode

Hi Ben,

I think that it'll be best that you contact the TCGA team regarding searching the gdc_file_id's via their web interface (gdc data portal). I don't know if it's a uppercase vs lowercase issue or how their search portal works.

Regarding the TCGA prostate data in recount2, right now you can only download it from the recount2 website and not via the recount package download_study() function. You can however search the recount_url table to find the appropriate links and download the files using downloader::download(). By the way, https://jhubiostatistics.shinyapps.io/recount/ is up right now. I ignore why it wasn't working for you.

> subset(recount::recount_url, project == 'TCGA' & grepl('prostate', url_table$file_name))$url
[1] "http://duffel.rail.bio/recount/TCGA/rse_exon_prostate.Rdata" "http://duffel.rail.bio/recount/TCGA/rse_gene_prostate.Rdata"

Best,

Leonardo

ADD REPLY • link 8.4 years ago Leonardo Collado Torres ★ 1.1k