Question: Normal Patient Samples from GSE62944
gravatar for hamda.binte.ajmal
23 months ago by
hamda.binte.ajmal0 wrote:


I am trying to do Differential Expression Analysis of Genes of Normal vs. Breast cancer patients. For that ourpose, I chose GEO data GSE62944 as it contains  9264 Tumour Samples and 741 normal samples. 

Question 1: 

I load the expression set using code


    ah = AnnotationHub()
    query(ah , "GSE62944")

What I see is: 

AnnotationHub with 1 record
# snapshotDate(): 2016-03-09 
# names(): AH28855
# $dataprovider: GEO
# $species: Homo sapiens
# $rdataclass: ExpressionSet
# $title: RNA-Sequencing and clinical data for 7706 tumor samples from The Cancer Genome Atlas
# $description: TCGA RNA-seq Rsubread-summarized raw count data for 7706 tumor samples, represented as an R / Bioconductor...
# $taxonomyid: 9606
# $genome: hg19
# $sourcetype: tar.gz
# $sourceurl:
# $sourcelastmodifieddate: NA
# $sourcesize: NA
# $tags: TCGA, RNA-seq, Expression, Count 
# retrieve record with 'object[["AH28855"]]' 


Why does the title say 7706 tumor samples? Does this expression set contains normal samples at all? How would I access the normal samples?


Question 2: 

I subset the breast cancer patient samples using code:

    tcga_data <- ah[["AH28855"]]

    brca_data <- tcga_data[, which(phenoData(tcga_data)$CancerType=="BRCA")]


How can I subset both breast cancer and normal samples from the entire dataset?


Question 3: 

Is there a way to subset specific genes (i.e rows ) from the data set?


Help would be appreciated

ADD COMMENTlink modified 23 months ago by Sonali Arora370 • written 23 months ago by hamda.binte.ajmal0
gravatar for Sonali Arora
23 months ago by
Sonali Arora370
United States
Sonali Arora370 wrote:

Hi Hamda,

Solution #1

Please see the supplementary file section for  the GEO page  -we have added only the  
7706 tumor samples from TCGA - the Normal samples have not been added. 

> library(AnnotationHub)
> ah = AnnotationHub()
> ah <- query(ah , "GSE62944")
> data <- ah[["AH28855"]]

This is also stated clearly  in - 

> ah['AH28855']$description
[1] "TCGA RNA-seq Rsubread-summarized raw count data for 7706 tumor samples, represented as an R / Bioconductor Biobase ExpressionSet. R data representation derived from GEO accession GSE62944."

Solution #2 - 

I'll show an example with bladdar cancer data - The data is stored in an Expression Set 

> bladder_data <- data[, which(phenoData(data)$CancerType=="BLCA")]
> class(bladder_data)  
[1] "ExpressionSet"
[1] "Biobase"

I highly recommend understanding and exploring Bioconductor objects, especially ExpressionSet vignette:

Check out the show method - it gives you some idea of the data, no of genes, no of samples etc.. 

> bladder_data
ExpressionSet (storageMode: lockedEnvironment)
assayData: 23368 features, 273 samples 
  element names: exprs 
protocolData: none
  sampleNames: TCGA-BL-A0C8-01A-11R-A10U-07
    TCGA-BL-A0C8-01A-11R-A277-07 ... TCGA-YC-A89H-01A-11R-A36F-07 (273
  varLabels: bcr_patient_barcode bcr_patient_uuid ... CancerType (421
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'

To get the expression for the tumor samples - Rows are gene Names and columns are sample names. 

> raw_bladder_data = exprs(bladder_data)
> class(raw_bladder_data)
[1] "matrix"
> raw_bladder_data[1:5, 1:5]
            TCGA-BL-A0C8-01A-11R-A10U-07 TCGA-BL-A0C8-01A-11R-A277-07
1/2-SBSRNA4                           25                           20
A1BG                                  19                           22
A1BG-AS1                              11                           12
A1CF                                 100                          124
A2LD1                                146                          141
            TCGA-BL-A0C8-01B-04R-A277-07 TCGA-BL-A13I-01A-11R-A13Y-07
1/2-SBSRNA4                           53                           31
A1BG                                  10                          210
A1BG-AS1                              10                           95
A1CF                                 553                            2
A2LD1                                153                          142
1/2-SBSRNA4                            1
A1BG                                  10
A1BG-AS1                               8
A1CF                                   2
A2LD1                                 10

Solution #3 

To subset this matrix for genes of your interest , simply subset a matrix like you would usually do

> my_genes = c("PTEN", "MYC", "BRCA1")
> idx = match(tolower(my_genes), tolower(rownames(raw_bladder_data)))
> my_genes_raw_data = raw_bladder_data[idx, ]
> dim(my_genes_raw_data)
[1]   3 273

> my_genes_raw_data[,1:5]
      TCGA-BL-A0C8-01A-11R-A10U-07 TCGA-BL-A0C8-01A-11R-A277-07
PTEN                          2535                         5054
MYC                            336                          320
BRCA1                          939                         1676
      TCGA-BL-A0C8-01B-04R-A277-07 TCGA-BL-A13I-01A-11R-A13Y-07
PTEN                          4147                         3775
MYC                             75                         7239
BRCA1                         1384                         1987
PTEN                           217
MYC                            358
BRCA1                          207

We could add the normal samples - as another expressionSet - and you could use them in a similar way. 
More on that soon! Watch this space.. 

Thanks and Regards,

ADD COMMENTlink modified 23 months ago • written 23 months ago by Sonali Arora370

Thank you for your reply Sonali.

I was actually confused because I read on this post C: Can I feed TCGA normalized count data to EdgeR for differential gene expression

That the normal samples have been added, so I wondered if there is any way to get them.

Your replies are great help to me. Thanks alot


ADD REPLYlink modified 23 months ago • written 23 months ago by hamda.binte.ajmal0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 310 users visited in the last hour