1
0
Entering edit mode
RMP87 • 0
@rafael-24221
Last seen 3 days ago
Brazil

Hello!

I'm using a GEO dataset (GSE140830) — HumanHT-12 v4 Expression BeadChip (nuID) platform — in a study, but I am having problems regarding the metadata and so how to remove outliers verified in the study. I know that it is possible to relate the info observed in the study with the data present in the metadata using affy (for example: from the series matrix file):




In this case, the row.names carries the sample ID, but can I do something like this with Illumina data since the expression info is in table format and the table has different codes representing the samples (for example, GSM4187774 in phenoData and 5872617031_A in raw_data file)?

I believe it would be easier to rename based on what's in the expression data. But would it be possible to load this data and establish a relationship between the files in such a way as to make it easier to create groups, remove samples, and design the statistical test?

Any suggestion?

Thank you very much!

limma GEO MicroarrayData affy • 199 views
0
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

It's not exactly clear to me what you are after, but I'll take a stab at it. First off you probably want to read in using the GEOquery package. You could alternately download the raw data and process yourself, but it's probably not worth it, buy ymmv.

> library(GEOquery)
## need to set this first because the data are large
> Sys.setenv("VROOM_CONNECTION_SIZE" = 1000000L)
> z <- getGEO("GSE140830")[[1]]
[1] "GSM4187774" "GSM4187775" "GSM4187776" "GSM4187777" "GSM4187778"
[6] "GSM4187779"
> names(pData(z))
[1] "title"                   "geo_accession"
[3] "status"                  "submission_date"
[5] "last_update_date"        "type"
[7] "channel_count"           "source_name_ch1"
[9] "organism_ch1"            "characteristics_ch1"
[11] "characteristics_ch1.1"   "characteristics_ch1.2"
[13] "characteristics_ch1.3"   "characteristics_ch1.4"
[15] "molecule_ch1"            "extract_protocol_ch1"
[17] "label_ch1"               "label_protocol_ch1"
[19] "taxid_ch1"               "hyb_protocol"
[21] "scan_protocol"           "description"
[23] "data_processing"         "data_processing.1"
[25] "platform_id"             "contact_name"
[27] "contact_laboratory"      "contact_institute"
[31] "contact_state"           "contact_zip/postal_code"
[33] "contact_country"         "supplementary_file"
[35] "data_row_count"          "age_at_draw:ch1"
[37] "batch:ch1"               "connectivity_zscore:ch1"
[39] "diagnosis:ch1"           "Sex:ch1"
!> pData(z)[1:5,1:10]
title geo_accession
GSM4187774 Whole blood, Control, 5872617031_A  [ftd]    GSM4187774
GSM4187775  Whole blood, Control, 5872617031_C [ftd]    GSM4187775
GSM4187776  Whole blood, Control, 5872617031_E [ftd]    GSM4187776
GSM4187777    Whole blood, bvFTD, 5872617031_F [ftd]    GSM4187777
GSM4187778  Whole blood, Control, 5872641011_B [ftd]    GSM4187778
status submission_date last_update_date type
GSM4187774 Public on Nov 13 2020     Nov 22 2019      Nov 13 2020  RNA
GSM4187775 Public on Nov 13 2020     Nov 22 2019      Nov 13 2020  RNA
GSM4187776 Public on Nov 13 2020     Nov 22 2019      Nov 13 2020  RNA
GSM4187777 Public on Nov 13 2020     Nov 22 2019      Nov 13 2020  RNA
GSM4187778 Public on Nov 13 2020     Nov 22 2019      Nov 13 2020  RNA
channel_count source_name_ch1 organism_ch1 characteristics_ch1
GSM4187774             1     Whole blood Homo sapiens  diagnosis: Control
GSM4187775             1     Whole blood Homo sapiens  diagnosis: Control
GSM4187776             1     Whole blood Homo sapiens  diagnosis: Control
GSM4187777             1     Whole blood Homo sapiens    diagnosis: bvFTD
GSM4187778             1     Whole blood Homo sapiens  diagnosis: Control

> all.equal(colnames(z), row.names(pData(z)))
[1] TRUE


The samples are all aligned with the phenodata, automatically!

0
Entering edit mode

Thank you very much for your solution. I wasn't using GEOquery, because I need to remove probes using a p-value threshold < 0.05. Also, another point is to run an analysis (PCAtools / arrayQualityMetrics) to recognize outliers before the differential gene expression analysis. After recognition and excluding these samples, the normalization and DEG will be conducted. Considering the file from GEOquery in your example, how could I discard these low-quality samples (previously identified) and normalize my dataset only with the 'good-quality samples?

1
Entering edit mode

The data you get from getGEO are the processed values, which are described on GEO:

Data was imported from BeadStudio/GenomeStudio output (raw_data.txt) using lumi. Illumina Ids were converted to nuIDs and expression values were normalized using variance-stabilizing transform. Interarray normalization was done using robust spline normalization (normalized_data.txt). Probes were dropped if their detection score p-value was greater than 0.01 or if they were not annotated with a gene symbol. Outlier samples were identified by estimating connectivity Z-scores and dropping any sample with a Z-score less than -2. Batches with only a single sample were also dropped. Batch correction was performed using ComBat. Duplicate probes for the same gene symbol were dropped from the batch corrected data by choosing the probe with the highest mean expression across all samples, and nuIDs were converted to gene symbols. Expression values were also adjusted for age because of correlation between age and diagnosis (final_normalized_data.txt).

If you don't want to use the data as is, you can download the raw data and then process using limma. At that point matching samples up with treatment and whatnot is up to you, and is off topic for this support site. That's a data wrangling question rather than a Bioconductor software question.