Normal Patient Samples from GSE62944
1
0
Entering edit mode
@hamdabinteajmal-10011
Last seen 8.0 years ago

Hello,

I am trying to do Differential Expression Analysis of Genes of Normal vs. Breast cancer patients. For that ourpose, I chose GEO data GSE62944 as it contains  9264 Tumour Samples and 741 normal samples. 

Question 1: 

I load the expression set using code

    library(AnnotationHub)

    ah = AnnotationHub()
    query(ah , "GSE62944")

What I see is: 

AnnotationHub with 1 record
# snapshotDate(): 2016-03-09 
# names(): AH28855
# $dataprovider: GEO
# $species: Homo sapiens
# $rdataclass: ExpressionSet
# $title: RNA-Sequencing and clinical data for 7706 tumor samples from The Cancer Genome Atlas
# $description: TCGA RNA-seq Rsubread-summarized raw count data for 7706 tumor samples, represented as an R / Bioconductor...
# $taxonomyid: 9606
# $genome: hg19
# $sourcetype: tar.gz
# $sourceurl: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944
# $sourcelastmodifieddate: NA
# $sourcesize: NA
# $tags: TCGA, RNA-seq, Expression, Count 
# retrieve record with 'object[["AH28855"]]' 

 

Why does the title say 7706 tumor samples? Does this expression set contains normal samples at all? How would I access the normal samples?

 

Question 2: 

I subset the breast cancer patient samples using code:

    tcga_data <- ah[["AH28855"]]

    brca_data <- tcga_data[, which(phenoData(tcga_data)$CancerType=="BRCA")]

 

How can I subset both breast cancer and normal samples from the entire dataset?

 

Question 3: 

Is there a way to subset specific genes (i.e rows ) from the data set?

 

Help would be appreciated

GSE62944 annotationhub differentialexpression • 1.9k views
ADD COMMENT
2
Entering edit mode
Sonali Arora ▴ 390
@sonali-arora-6563
Last seen 8.0 years ago
United States

Hi Hamda,

Solution #1

Please see the supplementary file section for  the GEO page  -we have added only the  
7706 tumor samples from TCGA - the Normal samples have not been added. 

> library(AnnotationHub)
> ah = AnnotationHub()
> ah <- query(ah , "GSE62944")
> data <- ah[["AH28855"]]

This is also stated clearly  in - 

> ah['AH28855']$description
[1] "TCGA RNA-seq Rsubread-summarized raw count data for 7706 tumor samples, represented as an R / Bioconductor Biobase ExpressionSet. R data representation derived from GEO accession GSE62944."


Solution #2 - 

I'll show an example with bladdar cancer data - The data is stored in an Expression Set 

> bladder_data <- data[, which(phenoData(data)$CancerType=="BLCA")]
> class(bladder_data)  
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"

I highly recommend understanding and exploring Bioconductor objects, especially ExpressionSet vignette:
 

Check out the show method - it gives you some idea of the data, no of genes, no of samples etc.. 

> bladder_data
ExpressionSet (storageMode: lockedEnvironment)
assayData: 23368 features, 273 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: TCGA-BL-A0C8-01A-11R-A10U-07
    TCGA-BL-A0C8-01A-11R-A277-07 ... TCGA-YC-A89H-01A-11R-A36F-07 (273
    total)
  varLabels: bcr_patient_barcode bcr_patient_uuid ... CancerType (421
    total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation:

To get the expression for the tumor samples - Rows are gene Names and columns are sample names. 

> raw_bladder_data = exprs(bladder_data)
> class(raw_bladder_data)
[1] "matrix"
> raw_bladder_data[1:5, 1:5]
            TCGA-BL-A0C8-01A-11R-A10U-07 TCGA-BL-A0C8-01A-11R-A277-07
1/2-SBSRNA4                           25                           20
A1BG                                  19                           22
A1BG-AS1                              11                           12
A1CF                                 100                          124
A2LD1                                146                          141
            TCGA-BL-A0C8-01B-04R-A277-07 TCGA-BL-A13I-01A-11R-A13Y-07
1/2-SBSRNA4                           53                           31
A1BG                                  10                          210
A1BG-AS1                              10                           95
A1CF                                 553                            2
A2LD1                                153                          142
            TCGA-BL-A13I-01A-11R-A277-07
1/2-SBSRNA4                            1
A1BG                                  10
A1BG-AS1                               8
A1CF                                   2
A2LD1                                 10

Solution #3 

To subset this matrix for genes of your interest , simply subset a matrix like you would usually do

> my_genes = c("PTEN", "MYC", "BRCA1")
> idx = match(tolower(my_genes), tolower(rownames(raw_bladder_data)))
> my_genes_raw_data = raw_bladder_data[idx, ]
> dim(my_genes_raw_data)
[1]   3 273

> my_genes_raw_data[,1:5]
      TCGA-BL-A0C8-01A-11R-A10U-07 TCGA-BL-A0C8-01A-11R-A277-07
PTEN                          2535                         5054
MYC                            336                          320
BRCA1                          939                         1676
      TCGA-BL-A0C8-01B-04R-A277-07 TCGA-BL-A13I-01A-11R-A13Y-07
PTEN                          4147                         3775
MYC                             75                         7239
BRCA1                         1384                         1987
      TCGA-BL-A13I-01A-11R-A277-07
PTEN                           217
MYC                            358
BRCA1                          207


We could add the normal samples - as another expressionSet - and you could use them in a similar way. 
More on that soon! Watch this space.. 


Thanks and Regards,
Sonali. 

ADD COMMENT
0
Entering edit mode

Thank you for your reply Sonali.

I was actually confused because I read on this post C: Can I feed TCGA normalized count data to EdgeR for differential gene expression

That the normal samples have been added, so I wondered if there is any way to get them.

Your replies are great help to me. Thanks alot

 

ADD REPLY

Login before adding your answer.

Traffic: 933 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6