Can I feed TCGA normalized count data to EdgeR for differential gene expression analysis
6
1
Entering edit mode
ycding ▴ 10
@ycding-7496
Last seen 3.4 years ago
United States

Dear R people,

I downloaded TCGA level3 RNAseq data for 63 prostate tumor samples with conditions (age difference) I am interested in. I want to identify differential expressed genes between two age groups. I am reading the user guide for the EdgeR program and found that it takes integer count data as input. however, the level 3 TCGA data are in the following format:

Hybridization REF TCGA-HC-8260-01A-11R-2263-07 TCGA-HC-8260-11A-01R-2263-07
gene_id normalized_count normalized_count
EML3|256364 973.2368 839.2435
EML4|27436 1135.3904 727.7384
EML5|161436 518.8917 543.7352
EML6|400954 102.6448 24.4287
EMP1|2012 2250.9446 1138.2979
EMP2|2013 2926.6373 4150.5122
so,  it is normalized count.

 

 

 

 

Can I use EdgeR?

thank you !!!

Yuanchun Ding

EdgeR TCGA normalized count • 13k views
ADD COMMENT
4
Entering edit mode
@stephen-piccolo-6761
Last seen 3.5 years ago
United States

One option is to use this data set that we put together: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944

It has raw read counts (obtained using the Rsubread package), which you can use with edgeR.

We are putting together a publication based on it, so if you end up using it, please let me know, and I'll send you the citation info.

ADD COMMENT
1
Entering edit mode

An ExpressionSet of this data is available in AnnotationHub

library(AnnotationHub)
hub = AnnotationHub()
gse = query(hub, "GSE62944")[[1]]

(a little expensive to download the first time, but then cached locally for subsequent use). It is based on the following script, which retrieves and processes the data to a 264.5Mb ExpressionSet, ready for use (expect that the pData are character vectors instead of types)...

## Imports: GEOquery, Biobase

suppl <- GEOquery::getGEOSuppFiles("GSE62944")
setwd("GSE62944")

clinvar <- local({
    clinvarFile <- "GSE62944_TCGA_20_420_Clinical_Variables_7706_Samples.txt.gz"
    data <- scan(clinvarFile, what=character(), sep="\t", quote="")
    m <- matrix(data, 7707)
    dimnames(m) <- list(m[,1], m[1,])
    as.data.frame(m[-1, -1])
})

CancerType <-
    read.delim("GSE62944_TCGA_20_CancerType_Samples.txt.gz",
               header=FALSE, colClasses=c("character", "factor"),
               col.names=c("sample", "type"))
idx <- match(rownames(clinvar), CancerType$sample)
stopifnot(!anyNA(idx))
clinvar$CancerType <- CancerType$type[idx]

countFile <- 
    "GSM1536837_TCGA_20.Illumina.tumor_Rsubread_FeatureCounts.txt.gz"
untar("GSE62944_RAW.tar", countFile)
counts <- local({
    data <- scan(countFile, what=character(), sep="\t", quote="")
    m <- matrix(data, 7707)
    dimnames(m) <- list(m[,1], m[1,])
    m <- t(m[-1, -1])
    mode(m) <- "integer"
    m
})

eset <- Biobase::ExpressionSet(counts, AnnotatedDataFrame(clinvar))
ADD REPLY
0
Entering edit mode

Without actually having a close look, that seems like a very useful resource!

ADD REPLY
0
Entering edit mode

FYI: We released an updated version of the data, which now includes 9,264 tumor samples and 741 normal samples from TCGA. The previous files are still available, so the above code should still work. The new file names can be found here: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944. We plan to do at least 1-2 more releases as additional, publishable data become available via TCGA.

ADD REPLY
0
Entering edit mode

That's a great resource, Stephen. Thanks! I do have a suggestion though. I wonder if it would be possible to use Entrez Gene IDs to annotate the count tables in future releases, rather than using Gene Symbols. After all, that's what Subread internal annotation is using. I suspect you might be using some custom annotation. Still, I feel using Entrez Gene IDs might be more cross-compatible.

ADD REPLY
0
Entering edit mode

Hi Stephen!

Would you please send me the citation info?

Thank you.

Anita

ADD REPLY
1
Entering edit mode

Our paper describing this data set is now published in Bioinformatics (http://www.ncbi.nlm.nih.gov/pubmed/26209429). It would be fantastic if you (and others) could cite this article. Thanks!

ADD REPLY
0
Entering edit mode

Thanks Stephen. This seems like a very promising resource. Do you have any help material as to how to load data into R as expressionSet , combine cancer and normal patients in one table and do the anaysis of cancer vs. non cancer data?

ADD REPLY
2
Entering edit mode
@gordon-smyth
Last seen 20 minutes ago
WEHI, Melbourne, Australia

Aaron is right that edgeR needs counts rather than normalized values. And the subread/featureCounts count summaries from Stephen Piccolo sound extraordinarily useful, an excellent way to go.

However let me just point that edgeR will work with non-integer counts, provided that they add up to the correct library sizes for each sample. If the TCGA "normalized counts" are in fact expected counts from RSEM, then they will work quite well with either edgeR or voom.

ADD COMMENT
2
Entering edit mode

I feel like this comment should be pinned somehow, or made into a separate post/news item. Since analysis of count/RNAseq data has crawled itself out of the water an onto land, everyone has said that non-count-based input was a non-starter for edgeR (and other methods of similar ilk), and I believe this is the first place (to my knowledge) that it has been authoritatively stated that RSEM output actually works well (as opposed to it being "passable") with these tools.

ADD REPLY
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 3 hours ago
The city by the bay

Normalized counts are not appropriate for edgeR. The variance in the negative binomial model depends on the mean, so the significance of any differences will depend on the absolute size of the count. This information about the absolute size is lost upon normalization. In short, if you want to use edgeR, try to get the original counts rather than transformed values that are obtained after high-level processing.

ADD COMMENT
1
Entering edit mode
alakatos ▴ 130
@alakatos-6983
Last seen 4.5 years ago
United States

Yes, it does. Thank you.

I am wondering if you have a tentative schedule for the new release including the normal samples.

Thank you.

Anita

ADD COMMENT
0
Entering edit mode

Sorry for the delayed response. I didn't see your question until now. We have processed the normal samples (and about 1500 additional tumor samples). We're in the process of uploading these data to GEO and will post a comment as soon as they are available. In the meantime, feel free to contact me directly, and I can send the data to you another way.

ADD REPLY
0
Entering edit mode

We now include normal data from TCGA in this data set. Please see comment above.

ADD REPLY
0
Entering edit mode

Great!

Thank you very much.

Anita

ADD REPLY
0
Entering edit mode
ycding ▴ 10
@ycding-7496
Last seen 3.4 years ago
United States

Hi Stephen and Aaron, thank you so much for your quick help.  I briefly checked into the link provided by Stephen, it is a very good data source,  I will definitely look into it. 

ADD COMMENT
0
Entering edit mode
alakatos ▴ 130
@alakatos-6983
Last seen 4.5 years ago
United States

Hello Stephen and Martin, 

Thank you for making the dataset and code available.

I would like to use a subset of GSM1536837_TCGA_20.Illumina.tumor_Rsubread_TPM.txt.gz file  for my analysis.

Question 1.

Would you please confirm that this modified code below is correct?

countFile <- "GSM1536837_TCGA_20.Illumina.tumor_Rsubread_TPM.txt.gz"
untar("GSE62944_RAW.tar", countFile)
counts <- local({
    data <- scan(countFile, what=character(), sep="\t", quote="")
    m <- matrix(data, 7707)
    dimnames(m) <- list(m[,1], m[1,])
    m <- t(m[-1, -1])
    mode(m) <- "float"
    m
})

Question 2.

I am puzzled by the coding system and I would like to make sure that I am using the correct labeling “normal” and “tumor” for the subset of the dataset. I am interested in the 373 SKCM subjects available in GSE62944 clinvar. I am wondering if your dataset includes only the subjects with TN (Tumor, Matched Normal)  or both TN and NT (Normal, Matched Tumor).

In case both classes are included, shall I  try to match the bcr_patient_uuids in order to identify them?

TGCA website barcode:

bcr_patient_uuid bcr_patient_barcode bcr_sample_uuid
3DD5A206-D7F3-42F1-B9CC-4B31C76D495D TCGA-BF-A1PU A4D739D2-63A1-41FC-82C1-F1D027DB756D

clinvar barcode:

bcr_patient_barcode patient_id bcr_patient_uuid
TCGA-BF-A1PU-01A-11R-A18S-07 TCGA-BF-A1PU 3DD5A206-D7F3-42F1-B9CC-4B31C76D495D

Or what would be the most straightforward way for a novice with less opportunity for mistakes?

Thank you in advance,

Anita  

 

ADD COMMENT
0
Entering edit mode

I'm not sure if this will answer your question, but this data set does not yet have data for normal samples. When we created the data set, we didn't have a need for the normal samples. But various people have expressed interest in this, so we will work to add it.

ADD REPLY
0
Entering edit mode

The code should read mode(m) = "numeric". The data is large, causing my 16 Gb machine to start swappinng, so you'll want to parse the counts on a computer with sufficient memory (it would be a little more memory efficient to use 'm' instead of 'data' as a variable name).

ADD REPLY
0
Entering edit mode

Thank you for the code clarification.

Anita

ADD REPLY

Login before adding your answer.

Traffic: 732 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6