I'm using a GEO dataset (GSE140830) — HumanHT-12 v4 Expression BeadChip (nuID) platform — in a study, but I am having problems regarding the metadata and so how to remove outliers verified in the study. I know that it is possible to relate the info observed in the study with the data present in the metadata using affy (for example: from the series matrix file):
Data <- ReadAffy(filenames=row.names(targets), celfile.path=datadir, phenoData = targets)
In this case, the row.names carries the sample ID, but can I do something like this with Illumina data since the expression info is in table format and the table has different codes representing the samples (for example, GSM4187774 in phenoData and 5872617031_A in raw_data file)?
I believe it would be easier to rename based on what's in the expression data. But would it be possible to load this data and establish a relationship between the files in such a way as to make it easier to create groups, remove samples, and design the statistical test?
It's not exactly clear to me what you are after, but I'll take a stab at it. First off you probably want to read in using the GEOquery package. You could alternately download the raw data and process yourself, but it's probably not worth it, buy ymmv.
> library(GEOquery)
## need to set this first because the data are large
> Sys.setenv("VROOM_CONNECTION_SIZE" = 1000000L)
> z <- getGEO("GSE140830")[[1]]
> head(colnames(z))
[1] "GSM4187774" "GSM4187775" "GSM4187776" "GSM4187777" "GSM4187778"
[6] "GSM4187779"
> names(pData(z))
[1] "title" "geo_accession"
[3] "status" "submission_date"
[5] "last_update_date" "type"
[7] "channel_count" "source_name_ch1"
[9] "organism_ch1" "characteristics_ch1"
[11] "characteristics_ch1.1" "characteristics_ch1.2"
[13] "characteristics_ch1.3" "characteristics_ch1.4"
[15] "molecule_ch1" "extract_protocol_ch1"
[17] "label_ch1" "label_protocol_ch1"
[19] "taxid_ch1" "hyb_protocol"
[21] "scan_protocol" "description"
[23] "data_processing" "data_processing.1"
[25] "platform_id" "contact_name"
[27] "contact_laboratory" "contact_institute"
[29] "contact_address" "contact_city"
[31] "contact_state" "contact_zip/postal_code"
[33] "contact_country" "supplementary_file"
[35] "data_row_count" "age_at_draw:ch1"
[37] "batch:ch1" "connectivity_zscore:ch1"
[39] "diagnosis:ch1" "Sex:ch1"
!> pData(z)[1:5,1:10]
title geo_accession
GSM4187774 Whole blood, Control, 5872617031_A [ftd] GSM4187774
GSM4187775 Whole blood, Control, 5872617031_C [ftd] GSM4187775
GSM4187776 Whole blood, Control, 5872617031_E [ftd] GSM4187776
GSM4187777 Whole blood, bvFTD, 5872617031_F [ftd] GSM4187777
GSM4187778 Whole blood, Control, 5872641011_B [ftd] GSM4187778
status submission_date last_update_date type
GSM4187774 Public on Nov 13 2020 Nov 22 2019 Nov 13 2020 RNA
GSM4187775 Public on Nov 13 2020 Nov 22 2019 Nov 13 2020 RNA
GSM4187776 Public on Nov 13 2020 Nov 22 2019 Nov 13 2020 RNA
GSM4187777 Public on Nov 13 2020 Nov 22 2019 Nov 13 2020 RNA
GSM4187778 Public on Nov 13 2020 Nov 22 2019 Nov 13 2020 RNA
channel_count source_name_ch1 organism_ch1 characteristics_ch1
GSM4187774 1 Whole blood Homo sapiens diagnosis: Control
GSM4187775 1 Whole blood Homo sapiens diagnosis: Control
GSM4187776 1 Whole blood Homo sapiens diagnosis: Control
GSM4187777 1 Whole blood Homo sapiens diagnosis: bvFTD
GSM4187778 1 Whole blood Homo sapiens diagnosis: Control
> all.equal(colnames(z), row.names(pData(z)))
[1] TRUE
The samples are all aligned with the phenodata, automatically!
Thank you very much for your solution. I wasn't using GEOquery, because I need to remove probes using a p-value threshold < 0.05. Also, another point is to run an analysis (PCAtools / arrayQualityMetrics) to recognize outliers before the differential gene expression analysis. After recognition and excluding these samples, the normalization and DEG will be conducted. Considering the file from GEOquery in your example, how could I discard these low-quality samples (previously identified) and normalize my dataset only with the 'good-quality samples?
The data you get from getGEO are the processed values, which are described on GEO:
Data was imported from BeadStudio/GenomeStudio output (raw_data.txt) using lumi. Illumina Ids were converted to nuIDs and expression values were normalized using variance-stabilizing transform. Interarray normalization was done using robust spline normalization (normalized_data.txt). Probes were dropped if their detection score p-value was greater than 0.01 or if they were not annotated with a gene symbol. Outlier samples were identified by estimating connectivity Z-scores and dropping any sample with a Z-score less than -2. Batches with only a single sample were also dropped. Batch correction was performed using ComBat. Duplicate probes for the same gene symbol were dropped from the batch corrected data by choosing the probe with the highest mean expression across all samples, and nuIDs were converted to gene symbols. Expression values were also adjusted for age because of correlation between age and diagnosis (final_normalized_data.txt).
If you don't want to use the data as is, you can download the raw data and then process using limma. At that point matching samples up with treatment and whatnot is up to you, and is off topic for this support site. That's a data wrangling question rather than a Bioconductor software question.
Thank you very much for your solution. I wasn't using GEOquery, because I need to remove probes using a p-value threshold < 0.05. Also, another point is to run an analysis (PCAtools / arrayQualityMetrics) to recognize outliers before the differential gene expression analysis. After recognition and excluding these samples, the normalization and DEG will be conducted. Considering the file from GEOquery in your example, how could I discard these low-quality samples (previously identified) and normalize my dataset only with the 'good-quality samples?
The data you get from
getGEO
are the processed values, which are described on GEO:If you don't want to use the data as is, you can download the raw data and then process using
limma
. At that point matching samples up with treatment and whatnot is up to you, and is off topic for this support site. That's a data wrangling question rather than a Bioconductor software question.