I'm examining the TCGA dataset from the recount2 website (https://jhubiostatistics.shinyapps.io/recount/). I've downloaded one of the Summarized experiments for the prostate cancer, and have managed to extract the counts using assay(). The column names of the dataframe extracted however, whilst they look like identifiers, bear no resemblance to any identifiers in the TCGA database.
I have a separate download of clinical data, with TCGA identifiers and sample names for the same data set. How does one link up columns in the assay with known sample names?
Thanks
Ben.
Hello,
I have a very large dataset which I downloaded from TCGA for pancreatic cancer. It has one normal sample but 184 patient samples. So, It has 186 columns (no technical replicates) and 20500 rows for all the genes. I am trying to get log2fold change data from this using deseq2. To begin my analysis I have extracted four patient samples(raw reads) and the normal control and now have 5 separate files. I tried using both .txt and .csv file formats with the following codes:
library('DESeq2')
setwd("/Users/dorothy/Desktop/PAAD")
getwd()
sampleFiles<-grep ('.txt',list.files('/Users/dorothy/Desktop/PAAD'),value=TRUE)
sampleCondition<-c('control','patient','patient','patient','patient')
sampleTable<-data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition)
sampleFiles
sampleCondition
sampleTable
ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory='/Users/dorothy/Desktop/PAAD/', design=~condition)
colData(ddsHTSeq)$condition<-factor(colData(ddsHTSeq)$condition, levels=c('control','patient'))
dds<-DESeq(ddsHTSeq)
res<-results(dds)
res<-res[order(res$padj),]
But, once I get to the ddsHTSeq function, it shows the following error: Error in Ops.factor(a$V1, l[[1]]$V1) :
level sets of factors are different.
Is there a way I can solve this?
Also, Is it possible to perform ddHTSeq function where I can tell R to identify the columns as separate samples. In short, I do not want to have a separate metadata file that contains information about the columns.
Thanks in advance!
Dorothy
Hi Dorothy,
Please check the posting guidelines and create a new post with a reproducible example & session information. Also, use the DESeq2 tag because this has nothing to do with recount from what I can see.
Best,
Leonardo
Hi Leonardo - sorry for the delay in reply.
I've done as you suggested. There's something hinky somewhere though. I've been pulling out the gdc_file_id's from the metadata, and those file_id's aren't present when I search through the gdc data portal. The case id's that I pull from the metadata are. If I pull out the gdc_submitter_id's they're not accessible through the portal. Nor are the file names. Long story short, some of the metadata I can match up to the gdc portal, most of it, I can't though. I don't know why.
As an aside (I should probably start a separate question I know) is there a way to pull just tcga prostate data or just gtex prostate data via the R package? Or do I have to go through the website (which isn't working for me atm) to find the pre-compiled data sets?
Cheers
Ben.
Hi Ben,
I think that it'll be best that you contact the TCGA team regarding searching the gdc_file_id's via their web interface (gdc data portal). I don't know if it's a uppercase vs lowercase issue or how their search portal works.
Regarding the TCGA prostate data in recount2, right now you can only download it from the recount2 website and not via the recount package download_study() function. You can however search the recount_url table to find the appropriate links and download the files using downloader::download(). By the way, https://jhubiostatistics.shinyapps.io/recount/ is up right now. I ignore why it wasn't working for you.
Best,
Leonardo