Question

Normal Patient Samples from GSE62944

0

Entering edit mode

hamda.binte.ajmal • 0

@hamdabinteajmal-10011

Last seen 8.7 years ago

Hello,

I am trying to do Differential Expression Analysis of Genes of Normal vs. Breast cancer patients. For that ourpose, I chose GEO data GSE62944 as it contains 9264 Tumour Samples and 741 normal samples.

Question 1:

I load the expression set using code

library(AnnotationHub)

ah = AnnotationHub()
query(ah , "GSE62944")

What I see is:

AnnotationHub with 1 record
# snapshotDate(): 2016-03-09
# names(): AH28855
# $dataprovider: GEO
# $species: Homo sapiens
# $rdataclass: ExpressionSet
# $title: RNA-Sequencing and clinical data for 7706 tumor samples from The Cancer Genome Atlas
# $description: TCGA RNA-seq Rsubread-summarized raw count data for 7706 tumor samples, represented as an R / Bioconductor...
# $taxonomyid: 9606
# $genome: hg19
# $sourcetype: tar.gz
# $sourceurl: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944
# $sourcelastmodifieddate: NA
# $sourcesize: NA
# $tags: TCGA, RNA-seq, Expression, Count
# retrieve record with 'object[["AH28855"]]'

Why does the title say 7706 tumor samples? Does this expression set contains normal samples at all? How would I access the normal samples?

Question 2:

I subset the breast cancer patient samples using code:

tcga_data <- ah[["AH28855"]]

brca_data <- tcga_data[, which(phenoData(tcga_data)$CancerType=="BRCA")]

How can I subset both breast cancer and normal samples from the entire dataset?

Question 3:

Is there a way to subset specific genes (i.e rows ) from the data set?

Help would be appreciated

GSE62944 annotationhub differentialexpression • 2.1k views

ADD COMMENT • link updated 8.7 years ago by Sonali Arora ▴ 390 • written 8.7 years ago by hamda.binte.ajmal • 0

score 2 · Accepted Answer · 2016-04-01

Hi Hamda,

Solution #1

Please see the supplementary file section for the GEO page -we have added only the
7706 tumor samples from TCGA - the Normal samples have not been added.

> library(AnnotationHub)
> ah = AnnotationHub()
> ah <- query(ah , "GSE62944")
> data <- ah[["AH28855"]]

This is also stated clearly in -

> ah['AH28855']$description
[1] "TCGA RNA-seq Rsubread-summarized raw count data for 7706 tumor samples, represented as an R / Bioconductor Biobase ExpressionSet. R data representation derived from GEO accession GSE62944."

Solution #2 -

I'll show an example with bladdar cancer data - The data is stored in an Expression Set

> bladder_data <- data[, which(phenoData(data)$CancerType=="BLCA")]
> class(bladder_data)  
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"

I highly recommend understanding and exploring Bioconductor objects, especially ExpressionSet vignette:

Check out the show method - it gives you some idea of the data, no of genes, no of samples etc..

> bladder_data
ExpressionSet (storageMode: lockedEnvironment)
assayData: 23368 features, 273 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: TCGA-BL-A0C8-01A-11R-A10U-07
    TCGA-BL-A0C8-01A-11R-A277-07 ... TCGA-YC-A89H-01A-11R-A36F-07 (273
    total)
  varLabels: bcr_patient_barcode bcr_patient_uuid ... CancerType (421
    total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation:

To get the expression for the tumor samples - Rows are gene Names and columns are sample names.

> raw_bladder_data = exprs(bladder_data)
> class(raw_bladder_data)
[1] "matrix"
> raw_bladder_data[1:5, 1:5]
            TCGA-BL-A0C8-01A-11R-A10U-07 TCGA-BL-A0C8-01A-11R-A277-07
1/2-SBSRNA4                           25                           20
A1BG                                  19                           22
A1BG-AS1                              11                           12
A1CF                                 100                          124
A2LD1                                146                          141
            TCGA-BL-A0C8-01B-04R-A277-07 TCGA-BL-A13I-01A-11R-A13Y-07
1/2-SBSRNA4                           53                           31
A1BG                                  10                          210
A1BG-AS1                              10                           95
A1CF                                 553                            2
A2LD1                                153                          142
            TCGA-BL-A13I-01A-11R-A277-07
1/2-SBSRNA4                            1
A1BG                                  10
A1BG-AS1                               8
A1CF                                   2
A2LD1                                 10

Solution #3

To subset this matrix for genes of your interest , simply subset a matrix like you would usually do

> my_genes = c("PTEN", "MYC", "BRCA1")
> idx = match(tolower(my_genes), tolower(rownames(raw_bladder_data)))
> my_genes_raw_data = raw_bladder_data[idx, ]
> dim(my_genes_raw_data)
[1]   3 273

> my_genes_raw_data[,1:5]
      TCGA-BL-A0C8-01A-11R-A10U-07 TCGA-BL-A0C8-01A-11R-A277-07
PTEN                          2535                         5054
MYC                            336                          320
BRCA1                          939                         1676
      TCGA-BL-A0C8-01B-04R-A277-07 TCGA-BL-A13I-01A-11R-A13Y-07
PTEN                          4147                         3775
MYC                             75                         7239
BRCA1                         1384                         1987
      TCGA-BL-A13I-01A-11R-A277-07
PTEN                           217
MYC                            358
BRCA1                          207

We could add the normal samples - as another expressionSet - and you could use them in a similar way.
More on that soon! Watch this space..

Thanks and Regards,
Sonali.