Aggregating genes from multiple GSE's, ranking them, then building a PPI network.
Entering edit mode
Caitlin ▴ 50
Last seen 2.4 years ago
United States

Hi all.

After a lengthy struggle (I am a beginning student), I was able extract a list of genes from one GSE which contained 25 GSMs. From this GSE, I extracted a list of gene symbols after which I removed all of the duplicates. So, I am now assuming that the gene symbols I obtained are from all of the GSMs within that GSE? If so, I am attempting to perform the same operation on six additional GSEs before removing duplicates and using the RobustRankAggreg package, aggregating all of the genes into one data object (vector?) before performing a differential expression analysis with the limma package to isolate a smaller set of DEGs that I could use with Cytoscape via RCy3. Unfortunately, I don't know how to retrieve genes from multiple GSE objects or how one would combine all of the gene symbols. Can limma accept one monolithic data structure of gene symbols and could a boxplot be constructed from it even though there would presumably need to be two categories? Please accept my apologies if any of the above text is incorrect. I am working on my own without any form of guidance from my PI, TA, or classmates. I am, however, finding the limma course from DataCamp to be quite useful.

Code should be placed in three backticks as shown below


gse <- getGEO("GSE12657", AnnotGPL = TRUE)
features <- fData(gse[[1]])

length(features$`Gene symbol`)

# Remove duplicates leaving 9156 unique genes
features <- features[!duplicated(features$`Gene symbol`),]
# include your problematic code here with any corresponding output 
# please also include the results of running the following in an R session 

sessionInfo( )
GEOquery CancerData limma RCy3 CancerInSilico • 1.2k views
Entering edit mode
Last seen 2 hours ago
United States

There is one problem with what you are doing. Note how duplicated works

> d.f <- data.frame(count = sample(1:4, 10, TRUE), letters = letters[1:10])
> d.f
   count letters
1      4       a
2      4       b
3      2       c
4      4       d
5      2       e
6      3       f
7      1       g
8      1       h
9      3       i
10     1       j

> d.f[!duplicated(d.f[,1]),]
  count letters
1     4       a
3     2       c
6     3       f
7     1       g

See how it just takes the first of the duplicated rows? That may or may not be what you want. The duplicates on an Affymetrix array can be there for many different reasons and you may be inadvertently choosing the 'wrong' one. An alternative would be to average the replicates.

> library(GEOquery)
> library(affycoretools)
> library(hgu95av2.db)
> eset <- getGEO("GSE12657", AnnotGPL = TRUE)[[1]]
Found 1 file(s)
Using locally cached version: C:\Users\Public\Documents\Wondershare\CreatorTemp\Rtmpu2M04l/GSE12657_series_matrix.txt.gz
Using locally cached version of GPL8300 found here:
> eset <- annotateEset(eset, hgu95av2.db)
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
## Any NA NCBI Gene IDs?
> sum($ENTREZID))
[1] 494
## remove the NA values
> eset <- eset[!$ENTREZID),]
> dim(eset)
Features  Samples 
   12131       25 
## make a new ExpressionSet with just the unique
> eset2 <- eset[!duplicated(fData(eset)$ENTREZID),]
> dim(eset2)
Features  Samples 
    9055       25 
> z <- avereps(exprs(eset), fData(eset)$ENTREZID)
## now put that into our new ExpressionSet
> row.names(z) <- row.names(eset2)
> exprs(eset2) <- z

And then the eset2 object has no duplicate NCBI Gene IDs (which are more reliable than HGNC symbols, which you shouldn't use for unique identifiability).

As for your questions about 'monolithic data structure' and a boxplot, I have no idea what you mean by the former, and the latter is simple


Although do note

> table(pData(eset2)$characteristics_ch1)

              Control          Glioblastoma     Oligodendroglioma 
                    5                     7                     7 
Pilocytic astrocytoma 
## compare each cancer to controls
> fit <- eBayes(lmFit(eset2, model.matrix(~characteristics_ch1, pData(eset2))))
> sapply(1:3, function(x) nrow(topTable(fit, x, Inf, p.value = 0.05)))
[1] 7372 1637  903

So there are 7372, 1637, and 903 genes differentially expressed in glioblastoma, oligodendroglioma and pilocytic astrocytoma tissue, respectively, as compared to control, at an FDR < 0.05

Entering edit mode

Hi James.

Very nice. Thank you. I was just looking at an email from that postdoc I consulted several weeks ago and this is what she said. Could you please explain as I am quite confused, working under an encroaching deadline, and without benefit the classmate or PI:

pData(gse) returns to the data frame of sample information only. A bit more idea of how gse ExpressionSet object looks could imagine it is a multi-tab excel table. For example, when we extract the "sampledata <- pData(gse[[1]])", it is one of the tabs containing the pData data frame. For the downstream, you should look for "exprs(gse)" for the expression data, in which the rows are the genes and the columns are the sample IDs. Just do the differential expression analysis using the limma package instead of DESeq2. Now, you have the gse object which contains the assayData (expression data), phenoData (sample information), and featureData (gene annotation data).

  1. Proceed to check if you need to normalize the expression data
  2. Check the sample information, which column should be used as the "group"?
  3. Proceed to the differential expression analysis using the limma package, you should get the "TopTable" results which include the gene IDs as the rows and the DE result as the column, we are interested in the columns of "log2FoldChange" and "adjusted p value". And then, you have to follow the instructions from clusterprofiler book. By using "log2FoldChange" and "adjusted p value", we can form a genelist object that is used in clusterprofiler package for the GO and KEGG enrichment analysis, as well as visualizing the results in the network figures.

I think the first step involves calling log2 on the ExpressionSet object? The second step is puzzling, since upon using the following code: sampleData <- pData(gse[[1]]), I don't know what column I should use as the "group" nor do I know what that means (GSE12657) although limma apparently depends upon the specification of one column name? Could you please explain 3? The postdoc appears to believe this step is obvious but it most certainly is not.

From the paper I am seeking to reproduce, the authors mentioned "hub genes" and the cytohubba package from Cytoscape. Would I need to transform the data at some point with the WGCNA package?

I spent the past several hours reading the following document and my code, thus far, is below.



gse <- getGEO("GSE12657", AnnotGPL = TRUE)

# print the sample information

# print the gene annotation

# print the expression data

# Check the normalisation and scales used

# Look at the expression value

# Perform a log2 transformation since the range (min and max) is not between 0 and 16
exprs(gse[[1]]) <- log2(exprs(gse[[1]]))

# Reassess the summary

# Plot the summary values

# Retrieve sampleInfo
sampleInfo <- pData(gse[[1]])

# Obtain table counts for Control, Glioblastoma, Oligodendroglioma, and 
#                         Pilocytic astrocytoma 

# Select factors we need for the analysis
sampleInfo <- select(sampleInfo, characteristics_ch1)

# Rename the column to one that is more convenient
sampleInfo <- rename(sampleInfo, sample = characteristics_ch1)

# The argument use="c" stops an error if there are any missing data points
# We can visualize the correlations between the samples by hierarchical clustering.
# The function ‘cor’ can calculate the correlation on the scale of 0 to 1, in a pairwise fashion between all samples, then 
# visualise on a heatmap. There are many ways to create heatmaps in R, here I use ‘pheatmap’, the only argument it requires # is a matrix of numeric values.

# We can add more sample info onto the plot to get a better pic of the group and clustering. Here, we make use of the # ‘sampleInfo’ file that was created earlier, to match with the columns of the correlation matrix.
corMatrix <- cor(exprs(gse[[1]]),use="c")

# This this case, rownames(sampleInfo) == colnames(sampleInfo), so the correlation is 1.0

# Construct a design matrix will be created, this is a matrix of 0 and 1, one row for each 
# sample and one column for each sample group.
design <- model.matrix(~0 + sampleInfo$sample)

# Calculate the median expression level
threshold <- median(exprs(gse[[1]]))

#TRUE or FALSE for whether each gene is "expressed" in each sample
gene_is_expressed <- exprs(gse[[1]]) > threshold

# Identify genes that are expressed in more than 2 samples
retain <- rowSums(gene_is_expressed) > 3

# Check how many genes are removed / retained.

# Subset the data to keep only the expressed genes
gse <- gse[retain,]
Error in orig[[nm]][i, , ..., drop = drop] : 
  (subscript) logical subscript too long
gse <- gse[retain,]

Thank you James for your help. I greatly appreciate it.

Entering edit mode

We have ventured far past the main purpose of this support site, which is to help people with specific questions about the software. While I sympathize with the fact that you are confused and have little time, you already have examples (the limma User's Guide, the GitHub repo you link to) that show you pretty much exactly what you should be doing. You just need to apply that information to your situation.


Login before adding your answer.

Traffic: 790 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6