Question

Questions Trying to Re-create MicroArray Data (GSE97562) (Fig. 2)

0

Entering edit mode

rodj5201 • 0

@1efd0df9

Last seen 9 months ago

United States

Hi, I want to pre-face that this might be a long read.

I would like to re-create data from a dataset to learn how to perform differential analysis with limma. My main goal is to re-create Fig. 2 from the cited study (https://onlinelibrary.wiley.com/doi/10.1002/cam4.1187). I'm extremely new to this but have compiled code from tutorials, etc. that will read the files and perform rma normalization, then use limma to look for differential expression. At-least that's what I understand to be happening.

(btw I used GSM2572161 through GSM2572170 for .cel files (HSC vs CML))

I will put the code below:


library(oligo)
library(dplyr)
setwd("C:fake-example/GSE/GSE97562")
celFiles <- list.celfiles(full.names = TRUE)
affyExpressionFS <- read.celfiles(celFiles)



#RMA
eset <- oligo::rma(affyExpressionFS)
data.matrix = exprs(eset)



library(limma)

#Make design
design = model.matrix(~0+factor(c(1,1,1,1,1,2,2,2,2,2)))
design[,2] = c("cmlS","cmlS","cmlS","cmlS","cmlS","cmlP","cmlP","cmlP","cmlP","cmlP")
colnames(design)[2]="source"
design = as.data.frame(design)


groups = design$source

#Factor
f = factor(groups,levels=c("cmlS","cmlP"))


design2 = model.matrix(~ 0 + f)
colnames(design2) = c("cmlS","cmlP")

#Fit to model
data.fit = lmFit(data.matrix,design2)


contrast.matrix = makeContrasts(cmlP- cmlS,levels=design2)
data.fit.con = contrasts.fit(data.fit,contrast.matrix)
data.fit.eb = eBayes(data.fit.con)


#Make table with expression data
options(digits=2)
tab = topTable(data.fit.eb,coef=1,number=Inf,adjust.method="BH")

First, want to apologize for the ugly code. From what I understand this is a very basic run-through of differential analysis. The output of tab is 33,927 genes/probes (since I never annotated). The data I get back after keeping "genes" that follow the study's filters (fold change greater than 2 and p-value less than .05) is 50 genes (they get 584). What I believe has not been done is removal of mappings to the same gene ID. I also realize that I have not cutoff "low counts". I'm not sure what else could account for such differences. I have attempted annotating these probes but have had very little luck with that. I realize that it's much to ask for a tutorial and I feel that I might not be completely understanding what I'm doing, but was hoping someone could answer these questions:

Is it possible to convert .cel files immediately to .csv or .txt? (Something I'm able to look at and visualize would be extremely helpful) (Such as in RNA-seq data)

How would you go about removing "low counts"? Meaning, how would you find the sample cutoff and then set that as the threshold for the "counts".

When should annotation be done? (Many questions I found seem to do it after rma normalization)

Could you point me to a tutorial for annotating the probeID's? ( The affyExpressionFS outputs this : Annotation: pd.hugene.1.0.st.v1 )

-I tried using annotateEset which seemed to work, genuinely don't know how to check the mappings after this function though.

How do I remove the NA values after annotating the probes?

I appreciate any help/advice! Thank you!

MicroRNAArrayData GSE97562 pd.hugene.1.0.st.v1 Annotation • 662 views

ADD COMMENT • link updated 9 months ago by James W. MacDonald 65k • written 9 months ago by rodj5201 • 0

score 2 · Accepted Answer · 2023-07-20

The paper you reference used Affy's TAC software, so you won't be able to recreate 100%. But you can probably get close.

> library(GEOquery)
> library(oligo)
> library(affycoretools)
> library(limma)
> library(pd.hugene.1.0.st.v1)
> library(hugene10sttranscriptcluster.db)
> getGEOSuppFiles("GSE97562")
> setwd("GSE97562")
> untar("GSE97562_RAW.tar")
> dat <- read.celfiles(list.celfiles(listGzipped = TRUE))
Loading required package: RSQLite
Loading required package: DBI
Platform design info loaded.
Reading in : GSM2572161_LMTR1414T0.CEL.gz
Reading in : GSM2572162_LMTR222T0.CEL.gz
Reading in : GSM2572163_LMTR35T0.CEL.gz
Reading in : GSM2572164_LMTR47T0.CEL.gz
Reading in : GSM2572165_LMTR914T0.CEL.gz
Reading in : GSM2572166_MOTR36T0.CEL.gz
Reading in : GSM2572167_MO914TRT0.CEL.gz
Reading in : GSM2572168_MO1514TRT0.CEL.gz
Reading in : GSM2572169_MO114TRT0.CEL.gz
Reading in : GSM2572170_MO07TRT0.CEL.gz
Reading in : GSM2572171_LMPR1414T0.CEL.gz
Reading in : GSM2572172_LMPR222T0.CEL.gz
Reading in : GSM2572173_LMPR35T0.CEL.gz
Reading in : GSM2572174_LMPR47T0.CEL.gz
Reading in : GSM2572175_LMPR914T0.CEL.gz
Reading in : GSM2572176_MO07PRT0.CEL.gz
Reading in : GSM2572177_MO114PRT0.CEL.gz
Reading in : GSM2572178_MO1514PRT0.CEL.gz
Reading in : GSM2572179_MO914PRT0.CEL.gz
Reading in : GSM2572180_MOPR36T0.CEL.gz
Reading in : GSM2572181_LMPR1414CT.CEL.gz
Reading in : GSM2572182_LMPR222CT.CEL.gz
Reading in : GSM2572183_LMPR35CT.CEL.gz
Reading in : GSM2572184_LMPR47CT.CEL.gz
Reading in : GSM2572185_LMPR914CT.CEL.gz
Reading in : GSM2572186_MO07PRCT.CEL.gz
Reading in : GSM2572187_MO114RCT.CEL.gz
Reading in : GSM2572188_MO1514RCT.CEL.gz
Reading in : GSM2572189_MO36PRCT.CEL.gz
Reading in : GSM2572190_MO914RCT.CEL.gz
Reading in : GSM2572191_LMPR1414IM.CEL.gz
Reading in : GSM2572192_LMPR222IM.CEL.gz
Reading in : GSM2572193_LMPR35IM.CEL.gz
Reading in : GSM2572194_LMPR47IM.CEL.gz
Reading in : GSM2572195_LMPR914IM.CEL.gz
Reading in : GSM2572196_MO07PRIM.CEL.gz
Reading in : GSM2572197_MO114PRIM.CEL.gz
Reading in : GSM2572198_MO1514PRIM.CEL.gz
Reading in : GSM2572199_MO36PRIM.CEL.gz
Reading in : GSM2572200_MO914PRIM.CEL.gz
> eset <- rma(dat)
Background correcting
Normalizing
Calculating Expression
> eset <- annotateEset(eset, hugene10sttranscriptcluster.db)
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
> design <- model.matrix(~0+fact)
> colnames(design) <- gsub("fact","", colnames(design))
> contrast <- makeContrasts(CML_stem - NBM_stem, levels = design)
> fit <- eBayes(contrasts.fit(lmFit(eset, design), contrast))
> topTable(fit,1)
        PROBEID ENTREZID  SYMBOL                                     GENENAME
7938989 7938989     2620    GAS2                     growth arrest specific 2
8098060 8098060    59350   RXFP1            relaxin family peptide receptor 1
8155754 8155754   256691  MAMDC2                      MAM domain containing 2
8150881 8150881     5324   PLAG1                            PLAG1 zinc finger
8089145 8089145    25890  ABI3BP          ABI family member 3 binding protein
7916432 7916432     1718  DHCR24              24-dehydrocholesterol reductase
7906757 7906757     2212  FCGR2A                        Fc gamma receptor IIa
7971922 7971922     5101   PCDH9                              protocadherin 9
7945058 7945058    79607 FAM118B family with sequence similarity 118 member B
8160040 8160040     5789   PTPRD protein tyrosine phosphatase receptor type D
            logFC   AveExpr         t      P.Value    adj.P.Val        B
7938989  4.521992  6.020772  21.87851 2.738989e-22 9.120011e-18 38.58232
8098060  4.157618  7.306033  21.12634 8.765271e-22 1.459286e-17 37.62152
8155754  3.460398  5.787928  20.35201 3.014163e-21 3.345420e-17 36.58845
8150881 -3.144974  5.658739 -17.01883 1.004835e-18 8.364499e-15 31.56899
8089145 -3.730470  6.054714 -16.01465 6.912714e-18 4.603453e-14 29.85084
7916432  2.182689 10.122409  13.09589 3.301244e-15 1.832025e-11 24.21457
7906757  2.912017  8.074198  12.87955 5.410723e-15 2.377795e-11 23.75511
7971922 -2.461452  5.156261 -12.80534 6.418011e-15 2.377795e-11 23.59610
7945058  1.672019  8.639475  12.80473 6.427051e-15 2.377795e-11 23.59479
8160040  2.452418  6.457668  12.54838 1.165042e-14 3.879241e-11 23.03984
> nrow(topTable(fit, 1, Inf, p.value = 0.05))
[1] 4626

## they used TAC with a > 2-fold difference (which is not how to do things, btw), so let's try to get close

> nrow(subset(topTable(fit, 1, Inf, p.value = 0.05), abs(logFC) > 1))
[1] 671

## post hoc filtering like that invalidates the FDR as well as the p-values, so let's do it correctly.

> fit2 <- treat(contrasts.fit(lmFit(eset, design), contrast), 1.3)
> nrow(topTreat(fit2,1, n = Inf, p.value = 0.05))
[1] 570

## note that it's all annotated and stuff
> topTreat(fit2,1)
        PROBEID ENTREZID SYMBOL                                     GENENAME
7938989 7938989     2620   GAS2                     growth arrest specific 2
8098060 8098060    59350  RXFP1            relaxin family peptide receptor 1
8155754 8155754   256691 MAMDC2                      MAM domain containing 2
8150881 8150881     5324  PLAG1                            PLAG1 zinc finger
8089145 8089145    25890 ABI3BP          ABI family member 3 binding protein
7906757 7906757     2212 FCGR2A                        Fc gamma receptor IIa
7971922 7971922     5101  PCDH9                              protocadherin 9
7916432 7916432     1718 DHCR24              24-dehydrocholesterol reductase
8098195 8098195     6307  MSMO1                 methylsterol monooxygenase 1
8160040 8160040     5789  PTPRD protein tyrosine phosphatase receptor type D
            logFC   AveExpr         t      P.Value    adj.P.Val
7938989  4.521992  6.020772  20.04717 2.487016e-21 8.281016e-17
8098060  4.157618  7.306033  19.20299 1.018004e-20 1.694824e-16
8155754  3.460398  5.787928  18.12582 6.637187e-20 7.366614e-16
8150881 -3.144974  5.658739 -14.97054 2.839369e-17 2.363561e-13
8089145 -3.730470  6.054714 -14.38973 9.614538e-17 6.402705e-13
7906757  2.912017  8.074198  11.20543 1.498387e-13 8.315298e-10
7971922 -2.461452  5.156261 -10.83619 3.805798e-13 1.630317e-09
7916432  2.182689 10.122409  10.82486 3.917030e-13 1.630317e-09
8098195  2.840028  9.503743  10.65642 6.032698e-13 2.231897e-09
8160040  2.452418  6.457668  10.61163 6.768107e-13 2.253577e-09

You can now use volcanoplot from limma, or the EnhancedVolcano package. And you can use ComplexHeatmap to make a heatmap. I leave it up to you to read and learn how those things work.