Search
Question: Can I feed TCGA normalized count data to EdgeR for differential gene expression analysis
1
gravatar for ycding
2.7 years ago by
ycding10
United States
ycding10 wrote:

Dear R people,

I downloaded TCGA level3 RNAseq data for 63 prostate tumor samples with conditions (age difference) I am interested in. I want to identify differential expressed genes between two age groups. I am reading the user guide for the EdgeR program and found that it takes integer count data as input. however, the level 3 TCGA data are in the following format:

Hybridization REF TCGA-HC-8260-01A-11R-2263-07 TCGA-HC-8260-11A-01R-2263-07
gene_id normalized_count normalized_count
EML3|256364 973.2368 839.2435
EML4|27436 1135.3904 727.7384
EML5|161436 518.8917 543.7352
EML6|400954 102.6448 24.4287
EMP1|2012 2250.9446 1138.2979
EMP2|2013 2926.6373 4150.5122
so,  it is normalized count.

 

 

 

 

Can I use EdgeR?

thank you !!!

Yuanchun Ding

ADD COMMENTlink modified 2.7 years ago by alakatos80 • written 2.7 years ago by ycding10
4
gravatar for Stephen Piccolo
2.7 years ago by
United States
Stephen Piccolo500 wrote:

One option is to use this data set that we put together: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944

It has raw read counts (obtained using the Rsubread package), which you can use with edgeR.

We are putting together a publication based on it, so if you end up using it, please let me know, and I'll send you the citation info.

ADD COMMENTlink written 2.7 years ago by Stephen Piccolo500
1

An ExpressionSet of this data is available in AnnotationHub

library(AnnotationHub)
hub = AnnotationHub()
gse = query(hub, "GSE62944")[[1]]

(a little expensive to download the first time, but then cached locally for subsequent use). It is based on the following script, which retrieves and processes the data to a 264.5Mb ExpressionSet, ready for use (expect that the pData are character vectors instead of types)...

## Imports: GEOquery, Biobase

suppl <- GEOquery::getGEOSuppFiles("GSE62944")
setwd("GSE62944")

clinvar <- local({
    clinvarFile <- "GSE62944_TCGA_20_420_Clinical_Variables_7706_Samples.txt.gz"
    data <- scan(clinvarFile, what=character(), sep="\t", quote="")
    m <- matrix(data, 7707)
    dimnames(m) <- list(m[,1], m[1,])
    as.data.frame(m[-1, -1])
})

CancerType <-
    read.delim("GSE62944_TCGA_20_CancerType_Samples.txt.gz",
               header=FALSE, colClasses=c("character", "factor"),
               col.names=c("sample", "type"))
idx <- match(rownames(clinvar), CancerType$sample)
stopifnot(!anyNA(idx))
clinvar$CancerType <- CancerType$type[idx]

countFile <- 
    "GSM1536837_TCGA_20.Illumina.tumor_Rsubread_FeatureCounts.txt.gz"
untar("GSE62944_RAW.tar", countFile)
counts <- local({
    data <- scan(countFile, what=character(), sep="\t", quote="")
    m <- matrix(data, 7707)
    dimnames(m) <- list(m[,1], m[1,])
    m <- t(m[-1, -1])
    mode(m) <- "integer"
    m
})

eset <- Biobase::ExpressionSet(counts, AnnotatedDataFrame(clinvar))
ADD REPLYlink modified 2.6 years ago • written 2.7 years ago by Martin Morgan ♦♦ 20k

Without actually having a close look, that seems like a very useful resource!

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Martin Morgan ♦♦ 20k

FYI: We released an updated version of the data, which now includes 9,264 tumor samples and 741 normal samples from TCGA. The previous files are still available, so the above code should still work. The new file names can be found here: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944. We plan to do at least 1-2 more releases as additional, publishable data become available via TCGA.

ADD REPLYlink written 2.5 years ago by Stephen Piccolo500

That's a great resource, Stephen. Thanks! I do have a suggestion though. I wonder if it would be possible to use Entrez Gene IDs to annotate the count tables in future releases, rather than using Gene Symbols. After all, that's what Subread internal annotation is using. I suspect you might be using some custom annotation. Still, I feel using Entrez Gene IDs might be more cross-compatible.

ADD REPLYlink written 2.3 years ago by Aliaksei Holik350

Hi Stephen!

Would you please send me the citation info?

Thank you.

Anita

ADD REPLYlink written 2.3 years ago by alakatos80
1

Our paper describing this data set is now published in Bioinformatics (http://www.ncbi.nlm.nih.gov/pubmed/26209429). It would be fantastic if you (and others) could cite this article. Thanks!

ADD REPLYlink modified 2.3 years ago by Steve Lianoglou12k • written 2.3 years ago by Stephen Piccolo500

Thanks Stephen. This seems like a very promising resource. Do you have any help material as to how to load data into R as expressionSet , combine cancer and normal patients in one table and do the anaysis of cancer vs. non cancer data?

ADD REPLYlink written 20 months ago by hamda.binte.ajmal0
2
gravatar for Gordon Smyth
2.7 years ago by
Gordon Smyth32k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth32k wrote:

Aaron is right that edgeR needs counts rather than normalized values. And the subread/featureCounts count summaries from Stephen Piccolo sound extraordinarily useful, an excellent way to go.

However let me just point that edgeR will work with non-integer counts, provided that they add up to the correct library sizes for each sample. If the TCGA "normalized counts" are in fact expected counts from RSEM, then they will work quite well with either edgeR or voom.

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Gordon Smyth32k
2

I feel like this comment should be pinned somehow, or made into a separate post/news item. Since analysis of count/RNAseq data has crawled itself out of the water an onto land, everyone has said that non-count-based input was a non-starter for edgeR (and other methods of similar ilk), and I believe this is the first place (to my knowledge) that it has been authoritatively stated that RSEM output actually works well (as opposed to it being "passable") with these tools.

ADD REPLYlink written 2.7 years ago by Steve Lianoglou12k
1
gravatar for Aaron Lun
2.7 years ago by
Aaron Lun17k
Cambridge, United Kingdom
Aaron Lun17k wrote:

Normalized counts are not appropriate for edgeR. The variance in the negative binomial model depends on the mean, so the significance of any differences will depend on the absolute size of the count. This information about the absolute size is lost upon normalization. In short, if you want to use edgeR, try to get the original counts rather than transformed values that are obtained after high-level processing.

ADD COMMENTlink written 2.7 years ago by Aaron Lun17k
1
gravatar for alakatos
2.7 years ago by
alakatos80
United States
alakatos80 wrote:

Yes, it does. Thank you.

I am wondering if you have a tentative schedule for the new release including the normal samples.

Thank you.

Anita

ADD COMMENTlink written 2.7 years ago by alakatos80

Sorry for the delayed response. I didn't see your question until now. We have processed the normal samples (and about 1500 additional tumor samples). We're in the process of uploading these data to GEO and will post a comment as soon as they are available. In the meantime, feel free to contact me directly, and I can send the data to you another way.

ADD REPLYlink written 2.5 years ago by Stephen Piccolo500

We now include normal data from TCGA in this data set. Please see comment above.

ADD REPLYlink written 2.5 years ago by Stephen Piccolo500

Great!

Thank you very much.

Anita

ADD REPLYlink written 2.4 years ago by alakatos80
0
gravatar for ycding
2.7 years ago by
ycding10
United States
ycding10 wrote:

Hi Stephen and Aaron, thank you so much for your quick help.  I briefly checked into the link provided by Stephen, it is a very good data source,  I will definitely look into it. 

ADD COMMENTlink written 2.7 years ago by ycding10
0
gravatar for alakatos
2.7 years ago by
alakatos80
United States
alakatos80 wrote:

Hello Stephen and Martin, 

Thank you for making the dataset and code available.

I would like to use a subset of GSM1536837_TCGA_20.Illumina.tumor_Rsubread_TPM.txt.gz file  for my analysis.

Question 1.

Would you please confirm that this modified code below is correct?

countFile <- "GSM1536837_TCGA_20.Illumina.tumor_Rsubread_TPM.txt.gz"
untar("GSE62944_RAW.tar", countFile)
counts <- local({
    data <- scan(countFile, what=character(), sep="\t", quote="")
    m <- matrix(data, 7707)
    dimnames(m) <- list(m[,1], m[1,])
    m <- t(m[-1, -1])
    mode(m) <- "float"
    m
})

Question 2.

I am puzzled by the coding system and I would like to make sure that I am using the correct labeling “normal” and “tumor” for the subset of the dataset. I am interested in the 373 SKCM subjects available in GSE62944 clinvar. I am wondering if your dataset includes only the subjects with TN (Tumor, Matched Normal)  or both TN and NT (Normal, Matched Tumor).

In case both classes are included, shall I  try to match the bcr_patient_uuids in order to identify them?

TGCA website barcode:

bcr_patient_uuid bcr_patient_barcode bcr_sample_uuid
3DD5A206-D7F3-42F1-B9CC-4B31C76D495D TCGA-BF-A1PU A4D739D2-63A1-41FC-82C1-F1D027DB756D

clinvar barcode:

bcr_patient_barcode patient_id bcr_patient_uuid
TCGA-BF-A1PU-01A-11R-A18S-07 TCGA-BF-A1PU 3DD5A206-D7F3-42F1-B9CC-4B31C76D495D

Or what would be the most straightforward way for a novice with less opportunity for mistakes?

Thank you in advance,

Anita  

 

ADD COMMENTlink modified 2.7 years ago by Martin Morgan ♦♦ 20k • written 2.7 years ago by alakatos80

I'm not sure if this will answer your question, but this data set does not yet have data for normal samples. When we created the data set, we didn't have a need for the normal samples. But various people have expressed interest in this, so we will work to add it.

ADD REPLYlink written 2.7 years ago by Stephen Piccolo500

The code should read mode(m) = "numeric". The data is large, causing my 16 Gb machine to start swappinng, so you'll want to parse the counts on a computer with sufficient memory (it would be a little more memory efficient to use 'm' instead of 'data' as a variable name).

ADD REPLYlink written 2.7 years ago by Martin Morgan ♦♦ 20k

Thank you for the code clarification.

Anita

ADD REPLYlink written 2.7 years ago by alakatos80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 136 users visited in the last hour