Question

Intronic sequences are the majority of the top differentially expressed genes

0

Entering edit mode

ben_cossins • 0

@ben_cossins-8532

Last seen 5.0 years ago

United Kingdom

So I have seen this Affymetrix Intronic Normalization Control Probes Differentially Expressed?. Like the OP I'm seeing a large number of intronic sequences in the most differentially expressed genes.

using toptable to get the top 100 results (not correcting for multiple testing) I get between 65 to 81 of the 100 being intronic sequences dependant on which pairwise comparison I'm doing.

As an explanation of the experiment, I'm looking at the effects of an imprinted gene in the offspring and how that regulates gene expression in the mother. So the arrays are from either WT, KO or Transgenic offspring implanted into WT mothers, and the samples come from a number of tissues (though my project is only concerning the maternal pancreas).

I've included my R code, I apologise if its not the prettiest or there are some ugly hacks to make it output results.

From the previously linked thread I see that I could exclude the intronic sequences (which would reduce the amount of multiple testing, so some of the results might become significant that way), but obviously I'd like to work out what I'm doing wrong first. (maximum of 5 tags, so just specifying I'm using the pd.mogene.2.0.st array)

Please note I am aware that I am not selecting significant genes, where I have filtered by P-value (mostly for the GO analysis) it is because I wanted to make sure the script worked properly and that I get some kind of pathway out.

limma microarray normalization oligo differential gene expression • 1.3k views

ADD COMMENT • link updated 8.6 years ago by James W. MacDonald 65k • written 8.6 years ago by ben_cossins • 0

0

Entering edit mode

Removed unnecessary code, and tried to follow the bioconductor style guide

## loading libraries ---- run these line by line
source("http://bioconductor.org/biocLite.R")
library(Biobase)
library(oligo) # array data handling
library(limma) # linear models of array data
library(mogene20sttranscriptcluster.db) # array daya annotation

## data input
setwd('/Pancreas_CEL_Files/') # change to directory containing array data
pd <- read.AnnotatedDataFrame("cel.pData.txt", header = TRUE, row.names = 1) # read in phenotype data
celfiles <- read.celfiles(filenames = rownames(pData(pd)), phenoData = pd) # read in data from microarray, incorporating phenotype data
norm <- rma(celfiles) # normalise expression data
groups <- factor(c(rep("WT",4),rep("KO",4),rep("2x",4))) # create groups to compare - this needs to be changed to reflect sample order

## array annotation
gns <- select(mogene20sttranscriptcluster.db, featureNames(norm), c("ENSEMBL","SYMBOL","GENENAME","REFSEQ")) # get annotation data
gns <- gns[!duplicated(gns[,1]),] # remove duplicate entries

## pairwise comparisons & linear modelling
design <- model.matrix(~0+groups)
names <- matrix(c("WTvsKO","WTvs2X","KOvs2X","WTvsKO+2X"), nrow=4, ncol=1) # create list of comparisons for naming files in loop
acontrast <- makeContrasts(
    WT_vs_KO = (groupsWT - groupsKO),
    WT_vs_2X = (groupsWT - groups2x),
    KO_vs_2X = (groupsKO - groups2x),
    WT_vs_others = (groupsWT - ((groupsKO + groups2x)/2)),
    levels=design) # compare different conditions
lm1 <- lmFit(exprs(norm), design) # create linear model of expression data given design matrix
lm1 <- contrasts.fit(lm1, acontrast) # create linear model factoring the contrasts made
lm1 <- eBayes(lm1) # compute test statistics on differential expression

## loop for pairwise comparisons, heatmaps and GO analysis
for (a in 1:4) {
    dat <- toptable(lm1, coef=a, num=100) # Create dataframe of top results
    dat["Symbol"] <- gns$SYMBOL[match(rownames(dat), gns$PROBEID)]
    dat["GeneName"] <- gns$GENENAME[match(rownames(dat), gns$PROBEID)]
    if (a == 1) WTvKO <- dat
    if (a == 2) WTv2X <- dat
    if (a == 3) KOv2X <- dat
    if (a == 4) WTvKO2X <- dat
}

ADD REPLY • link 8.6 years ago ben_cossins • 0

0

Entering edit mode

For future reference, you should post a minimal working example that recapitulates the relevant behaviour. There's a lot of code here that's irrelevant to the issue at hand.

ADD REPLY • link 8.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hopefully fixed that now

ADD REPLY • link 8.6 years ago ben_cossins • 0

score 1 · Answer 1 · 2015-09-17

Aaron's right - that is a metric ton of irrelevant code, plus when you use = for assignment rather than <-, and 'bad' indenting, it's pretty difficult to scan through to see what you are doing. But do note that this part

### linear modeling
design = model.matrix(~0+groups)
lm1 = lmFit(exprs(norm), design) 
lm1 = eBayes(lm1) 

### pairwise comparisons
names = matrix(c("WTvsKO","WTvs2X","KOvs2X","WTvsKO+2X"), nrow=4, ncol=1)                        # create list of comparisons for naming files in loop
acontrast = makeContrasts(
     WT_vs_KO = (groupsWT - groupsKO),
    WT_vs_2X = (groupsWT - groups2x),
    KO_vs_2X = (groupsKO - groups2x),
    WT_vs_others = (groupsWT - ((groupsKO + groups2x)/2)),
     levels=design)
lm3 = contrasts.fit(lm1, acontrast)

Is not correct. You use contrasts.fit() before eBayes(), not after.

In addition, this part

for (a in 1:4)                                                        # hard coded for 4 comparisons, needs to be changed if the number of comparisons changed
{
    dat=toptable(lm3, coef=a, num=100)                                                # Create dataframe of top results
    dat[,c("Symbol","GeneName")] = NA                                                # create columns for gene symbol and name
    b = length(row.names(dat))                                            # calculate length of table for loop count
    for (n in 1:b)                                                    # loop to add human readable names and gene symbols
    {
        dat[n,6] = gns$SYMBOL[gns$PROBEID == rownames(dat[n,])]                            # add gene symbols    
        dat[n,7] = gns$GENENAME[gns$PROBEID == rownames(dat[n,])]                        # add gene names
    }

Isn't needed. There is a 'genes' list item in the MArrayLM object you are calling lm1, and you can just do

lm1$genes <- gns

and the topTable() results will be correctly annotated. Plus you will make your life easier if you vectorize things rather than using for() loops. Instead of thinking that you are going to add a column of NA values and then iterate through each one, replacing with the correct value, it's better to do something like use match() to create a correctly ordered vector and then add that to the data.frame.