How to remove dublicate probesets matching to the same gene, after toptable in order to have a genelist without dublicate gene Symbols for further GO analysis ?
2
0
Entering edit mode
svlachavas ▴ 830
@svlachavas-7225
Last seen 6 months ago
Germany/Heidelberg/German Cancer Resear…

Dear All,

i have used the limma software package to implement an  paired statistical analysis for a dataset regarding DE expression between control & cancer samples. Heres my code:

library(limma)

library(hgu133a.db)


conditions <- data.trusted.eset$condition
condition <- factor(conditions, levels(condition)[c(2,1)])
pairs <- factor(rep(1:13, each = 2))

design <- model.matrix(~condition+pairs)
fit <- lmFit(data.trusted.eset, design)

fit2 <- eBayes(fit)

library(hgu133a.db)

symbols <- unlist(mget(featureNames(data.trusted.eset), env=hgu133aSYMBOL))

top <- topTable(fit2, coef="conditionCancer", number=nrow(fit2), adjust.method="fdr", genelist=symbols)

select <- top[which(abs(top$logFC) >1 & top$ adj.P.Val < 0.05),]

My main question(as a beginner in R/bioconductor), is because my platform is Affymetrix and some probesets match to the same gene, how i can remove these dublicates(select$ID column) which have the same gene 2 or more times and remove them, so that i can extract my DE list without gene dublicate symbols for further analysis ?

Thank you in advance

 

 

bioconductor limma differential gene expression dublicate • 1.6k views
ADD COMMENT
2
Entering edit mode
@ryan-c-thompson-5618
Last seen 8 months ago
Scripps Research, La Jolla, CA

First of all, you probably shouldn't use "select" as a variable name, because it is already a function, and also the word is a verb, not a noun. I would suggest calling it "selected" instead.

Second, the function you're looking for is duplicated. You would do something like:

# Make sure the table is ordered with most the most 
# significant results at the top
selected <- selected[order(select$adj.P.Val),]
# Keep only the first occurrence of each ID
selected <- selected[!duplicated(select$ID),]

Thirdly, using separate cutoffs for fold change and significance (p-value or FDR) is not recommended. Consider using the treat function instead to test for significance relative to a nonzero threshold.

ADD COMMENT
0
Entering edit mode
svlachavas ▴ 830
@svlachavas-7225
Last seen 6 months ago
Germany/Heidelberg/German Cancer Resear…

Thank you for your valuable answer and corrections !! . On the other hand, for the separate cutoffs you mean that i should not use simultaneously logFC and p-value cutoffs for sorting the results from topTable ? I thought that using logFC along with p-value cutoff gives more biologically meaningful results regarding the DE genes, instead of using just the adjusted p-value below a threshold. Moreover, one similar suggestion is mentioned in the paper of  JJ Chen et al., 2007{....."The fold-change criterion can always be used as a secondary criterion to facilitate the interpretation of biological significance. Some researchers may impose that the P-value and the foldchange are equally important; in such cases all genes satisfying both criteria are selected...."}.

ADD COMMENT
3
Entering edit mode

Go read the documentation and accompanying paper for the treat function. It provides a statistically principled way to combine the p-value and fold change cutoff into a single test.

ADD REPLY
0
Entering edit mode

thank you again for your recommendation-i'm definately going to read the paper to get a validated approach on my data set analysis !! i also used another approach,to get unique PROBEID, SYMBOL & ENTREZID but i'm not sure that is correct as the above :

after selected <- top[which(abs(top$logFC) >1 & top$ adj.P.Val < 0.05),]

ls("package:hgu133a.db")

columns(hgu133a.db)

keytypes(hgu133a.db)

res <- select(hgu133a.db, keys=rownames(selected),columns=c("ENTREZID", "SYMBOL"), 
keytype="PROBEID")

idx <- match(rownames(selected), res$PROBEID)

res2 <- res[idx,]

ADD REPLY

Login before adding your answer.

Traffic: 1087 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6