Normal Patient Samples from GSE62944
Entering edit mode
Last seen 5.7 years ago


I am trying to do Differential Expression Analysis of Genes of Normal vs. Breast cancer patients. For that ourpose, I chose GEO data GSE62944 as it contains  9264 Tumour Samples and 741 normal samples. 

Question 1: 

I load the expression set using code


    ah = AnnotationHub()
    query(ah , "GSE62944")

What I see is: 

AnnotationHub with 1 record
# snapshotDate(): 2016-03-09 
# names(): AH28855
# $dataprovider: GEO
# $species: Homo sapiens
# $rdataclass: ExpressionSet
# $title: RNA-Sequencing and clinical data for 7706 tumor samples from The Cancer Genome Atlas
# $description: TCGA RNA-seq Rsubread-summarized raw count data for 7706 tumor samples, represented as an R / Bioconductor...
# $taxonomyid: 9606
# $genome: hg19
# $sourcetype: tar.gz
# $sourceurl:
# $sourcelastmodifieddate: NA
# $sourcesize: NA
# $tags: TCGA, RNA-seq, Expression, Count 
# retrieve record with 'object[["AH28855"]]' 


Why does the title say 7706 tumor samples? Does this expression set contains normal samples at all? How would I access the normal samples?


Question 2: 

I subset the breast cancer patient samples using code:

    tcga_data <- ah[["AH28855"]]

    brca_data <- tcga_data[, which(phenoData(tcga_data)$CancerType=="BRCA")]


How can I subset both breast cancer and normal samples from the entire dataset?


Question 3: 

Is there a way to subset specific genes (i.e rows ) from the data set?


Help would be appreciated

GSE62944 annotationhub differentialexpression • 1.2k views
Entering edit mode
Sonali Arora ▴ 380
Last seen 5.7 years ago
United States

Hi Hamda,

Solution #1

Please see the supplementary file section for  the GEO page  -we have added only the  
7706 tumor samples from TCGA - the Normal samples have not been added. 

> library(AnnotationHub)
> ah = AnnotationHub()
> ah <- query(ah , "GSE62944")
> data <- ah[["AH28855"]]

This is also stated clearly  in - 

> ah['AH28855']$description
[1] "TCGA RNA-seq Rsubread-summarized raw count data for 7706 tumor samples, represented as an R / Bioconductor Biobase ExpressionSet. R data representation derived from GEO accession GSE62944."

Solution #2 - 

I'll show an example with bladdar cancer data - The data is stored in an Expression Set 

> bladder_data <- data[, which(phenoData(data)$CancerType=="BLCA")]
> class(bladder_data)  
[1] "ExpressionSet"
[1] "Biobase"

I highly recommend understanding and exploring Bioconductor objects, especially ExpressionSet vignette:

Check out the show method - it gives you some idea of the data, no of genes, no of samples etc.. 

> bladder_data
ExpressionSet (storageMode: lockedEnvironment)
assayData: 23368 features, 273 samples 
  element names: exprs 
protocolData: none
  sampleNames: TCGA-BL-A0C8-01A-11R-A10U-07
    TCGA-BL-A0C8-01A-11R-A277-07 ... TCGA-YC-A89H-01A-11R-A36F-07 (273
  varLabels: bcr_patient_barcode bcr_patient_uuid ... CancerType (421
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'

To get the expression for the tumor samples - Rows are gene Names and columns are sample names. 

> raw_bladder_data = exprs(bladder_data)
> class(raw_bladder_data)
[1] "matrix"
> raw_bladder_data[1:5, 1:5]
            TCGA-BL-A0C8-01A-11R-A10U-07 TCGA-BL-A0C8-01A-11R-A277-07
1/2-SBSRNA4                           25                           20
A1BG                                  19                           22
A1BG-AS1                              11                           12
A1CF                                 100                          124
A2LD1                                146                          141
            TCGA-BL-A0C8-01B-04R-A277-07 TCGA-BL-A13I-01A-11R-A13Y-07
1/2-SBSRNA4                           53                           31
A1BG                                  10                          210
A1BG-AS1                              10                           95
A1CF                                 553                            2
A2LD1                                153                          142
1/2-SBSRNA4                            1
A1BG                                  10
A1BG-AS1                               8
A1CF                                   2
A2LD1                                 10

Solution #3 

To subset this matrix for genes of your interest , simply subset a matrix like you would usually do

> my_genes = c("PTEN", "MYC", "BRCA1")
> idx = match(tolower(my_genes), tolower(rownames(raw_bladder_data)))
> my_genes_raw_data = raw_bladder_data[idx, ]
> dim(my_genes_raw_data)
[1]   3 273

> my_genes_raw_data[,1:5]
      TCGA-BL-A0C8-01A-11R-A10U-07 TCGA-BL-A0C8-01A-11R-A277-07
PTEN                          2535                         5054
MYC                            336                          320
BRCA1                          939                         1676
      TCGA-BL-A0C8-01B-04R-A277-07 TCGA-BL-A13I-01A-11R-A13Y-07
PTEN                          4147                         3775
MYC                             75                         7239
BRCA1                         1384                         1987
PTEN                           217
MYC                            358
BRCA1                          207

We could add the normal samples - as another expressionSet - and you could use them in a similar way. 
More on that soon! Watch this space.. 

Thanks and Regards,

Entering edit mode

Thank you for your reply Sonali.

I was actually confused because I read on this post C: Can I feed TCGA normalized count data to EdgeR for differential gene expression

That the normal samples have been added, so I wondered if there is any way to get them.

Your replies are great help to me. Thanks alot



Login before adding your answer.

Traffic: 300 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6