I am trying to do Differential Expression Analysis of Genes of Normal vs. Breast cancer patients. For that ourpose, I chose GEO data GSE62944 as it contains 9264 Tumour Samples and 741 normal samples.
I load the expression set using code
ah = AnnotationHub()
query(ah , "GSE62944")
What I see is:
AnnotationHub with 1 record
# snapshotDate(): 2016-03-09
# names(): AH28855
# $dataprovider: GEO
# $species: Homo sapiens
# $rdataclass: ExpressionSet
# $title: RNA-Sequencing and clinical data for 7706 tumor samples from The Cancer Genome Atlas
# $description: TCGA RNA-seq Rsubread-summarized raw count data for 7706 tumor samples, represented as an R / Bioconductor...
# $taxonomyid: 9606
# $genome: hg19
# $sourcetype: tar.gz
# $sourceurl: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944
# $sourcelastmodifieddate: NA
# $sourcesize: NA
# $tags: TCGA, RNA-seq, Expression, Count
# retrieve record with 'object[["AH28855"]]'
Why does the title say 7706 tumor samples? Does this expression set contains normal samples at all? How would I access the normal samples?
I subset the breast cancer patient samples using code:
tcga_data <- ah[["AH28855"]]
brca_data <- tcga_data[, which(phenoData(tcga_data)$CancerType=="BRCA")]
How can I subset both breast cancer and normal samples from the entire dataset?
Is there a way to subset specific genes (i.e rows ) from the data set?
Help would be appreciated