Question

GGtools: transScores function returning cis eQTL

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 11.3 years ago

Hi there, I'm using the excellent GGtools to package to look for eQTLs. When I use the transScore function to look for trans eQTL I find that some of the eQTL-gene pairs I discover are closer than the distance specified by the argument 'radius' (used to define what is cis; hence eSNPs closer to the target gene than radius should not be returned by transScores). Any suggestions on what I'm doing wrong would be greatly appreciated. Best wishes James An example using GGdata looking only at genes and SNPs on chromosome 1. Here I define trans as > 100,000 bps. library(GGtools) library(GGdata) # get radius used to define cis rad <- 1e+05 trS <-transScores("GGdata", snpchr = as.character(1), rhs = ~1, K = 20, targdirpref = "tsco", geneApply = lapply, chrnames = paste("chr", as.character(1), sep = ""), radius = rad, renameChrs = NULL, probesToKeep = NULL, batchsize = 200, genegran = 50, shortfac = 10, wrapperEndo= function(x) MAFfilter(x, low=0.1), geneannopk = "illuminaHumanv1.db", snpannopk = "SNPlocs.Hsapiens.dbSNP.20111119", gchrpref = "", schrpref = "ch") # get gene indexes TOPind <- geneIndcol(trS, 1) # construct results table using probe ids, chi squared #association scores and snp loci TOP <- data.frame(cbind(probeid= featureNames(getSS("GGdata",1))[TOPind], score= topScores(trS), snpids = locusNames(trS))) TOP$score <- as.numeric(as.character(TOP$score)) # set chi squared score threshold for significance for trans eQTL (equivalent to p val < 1e-11) threshold <- 46.328476 tophits <- TOP[which(TOP$score > threshold),] # map probe to gene symbol, entrez id and chromosome PROBE2SYMS <- select(illuminaHumanv1.db, keys = as.character(tophits$probeid), cols = c("SYMBOL","ENTREZID","CHR")) tophits <- cbind(PROBE2SYMS, tophits) # get snp details library(NCBI2R) mysnps <- as.character(tophits$snpids) annotSNPs <- AnnotateSNPList(mysnps, file="") selectSNPinfo <- annotSNPs[,c("marker","chr","chrpos")] # get annotation on the genes using NCBI geneinfo <- GetGeneInfo(tophits$ENTREZID) # ori gives info on whether the gene is on sense or #anti-sense strand geneinfo <- geneinfo[,c("locusID","GeneLowPoint","GeneHighPoint","ori")] y <- cbind(tophits, selectSNPinfo) #check if(identical(y$marker, as.character(y$snpids))==FALSE){stop("The SNP ids don't match up")} # merge gene info with snp info and trans eQTL results final <- cbind(geneinfo, y) # check if(identical(final$locusID, final$ENTREZID)==FALSE){stop("The Entrez ids don't match up")} # clean up the dataframe (remove duplicate cols, make col names clearer) colnames(final)[1] <- 'EntrezID' colnames(final)[8] <- 'geneCHR' colnames(final)[13] <- 'snpCHR' colnames(final)[14] <- 'snpPos' final <- final[,-c(7,9,12)] # calculate distance from snp to gene start position diff <- rep(NA, nrow(final)) for (i in 1:nrow(final)){ if(final$ori[i] == '+'){ diff[i] <- abs(final$GeneLowPoint[i] - final$snpPos[i]) } else if (final$ori[i] == '-'){ diff[i] <- abs(final$GeneHighPoint[i] - final$snpPos[i]) } } # eQTL which are in fact cis final[which(diff < rad),] -- output of sessionInfo(): > sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] splines stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] SNPlocs.Hsapiens.dbSNP.20111119_0.99.6 SNPlocs.Hsapiens.dbSNP.20110815_0.99.6 GGdata_1.0.19 illuminaHumanv1.db_1.12.2 org.Hs.eg.db_2.7.1 [6] RSQLite_0.11.2 DBI_0.2-5 AnnotationDbi_1.18.4 Biobase_2.16.0 GGtools_4.4.0 [11] Rsamtools_1.8.6 Biostrings_2.24.1 GenomicRanges_1.8.13 IRanges_1.14.4 BiocGenerics_0.2.0 [16] GGBase_3.18.0 snpStats_1.6.0 Matrix_1.0-9 lattice_0.20-10 survival_2.36-14 [21] NCBI2R_1.4.3 loaded via a namespace (and not attached): [1] annotate_1.34.1 biomaRt_2.12.0 bit_1.1-8 bitops_1.0-4.1 BSgenome_1.24.0 ff_2.2-7 genefilter_1.38.0 [8] GenomicFeatures_1.8.3 grid_2.15.1 RCurl_1.95-0.1.2 rtracklayer_1.16.3 tools_2.15.1 VariantAnnotation_1.2.11 XML_3.95-0 [15] xtable_1.7-0 -- Sent via the guest posting facility at bioconductor.org.

SNP Annotation probe SNPlocs GGtools SNP Annotation probe SNPlocs GGtools • 2.1k views

ADD COMMENT • link updated 13.2 years ago by Vincent J. Carey, Jr. 6.7k • written 13.2 years ago by Guest User ★ 13k

score 0 · Answer 1 · 2012-10-04

0

Entering edit mode

Vincent J. Carey, Jr. 6.7k

@vincent-j-carey-jr-4

Last seen 3 days ago

United States

Thank you for this report. I will get back to you ASAP. This is a tough time with versions changing, but I wonder if you could try with the new release. The robustification of location handling has been a key concern of development in the past couple of months. On Thu, Oct 4, 2012 at 11:56 AM, James Peters [guest] < guest@bioconductor.org> wrote: > > Hi there, > > I'm using the excellent GGtools to package to look for eQTLs. When I use > the transScore function to look for trans eQTL I find that some of the > eQTL-gene pairs I discover are closer than the distance specified by the > argument 'radius' (used to define what is cis; hence eSNPs closer to the > target gene than radius should not be returned by transScores). > > Any suggestions on what I'm doing wrong would be greatly appreciated. > > Best wishes > > James > > An example using GGdata looking only at genes and SNPs on chromosome 1. > Here I define trans as > 100,000 bps. > > library(GGtools) > library(GGdata) > > # get radius used to define cis > rad <- 1e+05 > > trS <-transScores("GGdata", snpchr = as.character(1), rhs = ~1, K = 20, > targdirpref = "tsco", geneApply = lapply, chrnames = paste("chr", > as.character(1), sep = ""), radius = rad, renameChrs = NULL, probesToKeep = > NULL, > batchsize = 200, genegran = 50, shortfac = 10, wrapperEndo= function(x) > MAFfilter(x, low=0.1), > geneannopk = "illuminaHumanv1.db", snpannopk = > "SNPlocs.Hsapiens.dbSNP.20111119", gchrpref = "", > schrpref = "ch") > > # get gene indexes > TOPind <- geneIndcol(trS, 1) > > # construct results table using probe ids, chi squared #association scores > and snp loci > TOP <- data.frame(cbind(probeid= featureNames(getSS("GGdata",1))[TOPind], > score= topScores(trS), snpids = locusNames(trS))) > > TOP$score <- as.numeric(as.character(TOP$score)) > > # set chi squared score threshold for significance for trans eQTL > (equivalent to p val < 1e-11) > threshold <- 46.328476 > > tophits <- TOP[which(TOP$score > threshold),] > > # map probe to gene symbol, entrez id and chromosome > PROBE2SYMS <- select(illuminaHumanv1.db, keys = > as.character(tophits$probeid), cols = c("SYMBOL","ENTREZID","CHR")) > > tophits <- cbind(PROBE2SYMS, tophits) > > # get snp details > library(NCBI2R) > mysnps <- as.character(tophits$snpids) > annotSNPs <- AnnotateSNPList(mysnps, file="") > > selectSNPinfo <- annotSNPs[,c("marker","chr","chrpos")] > > # get annotation on the genes using NCBI > geneinfo <- GetGeneInfo(tophits$ENTREZID) > # ori gives info on whether the gene is on sense or #anti-sense strand > geneinfo <- geneinfo[,c("locusID","GeneLowPoint","GeneHighPoint","ori")] > > y <- cbind(tophits, selectSNPinfo) > #check > if(identical(y$marker, as.character(y$snpids))==FALSE){stop("The SNP ids > don't match up")} > > # merge gene info with snp info and trans eQTL results > final <- cbind(geneinfo, y) > # check > if(identical(final$locusID, final$ENTREZID)==FALSE){stop("The Entrez ids > don't match up")} > > # clean up the dataframe (remove duplicate cols, make col names clearer) > colnames(final)[1] <- 'EntrezID' > colnames(final)[8] <- 'geneCHR' > colnames(final)[13] <- 'snpCHR' > colnames(final)[14] <- 'snpPos' > final <- final[,-c(7,9,12)] > > # calculate distance from snp to gene start position > diff <- rep(NA, nrow(final)) > for (i in 1:nrow(final)){ > if(final$ori[i] == '+'){ > diff[i] <- abs(final$GeneLowPoint[i] - final$snpPos[i]) > } else if > (final$ori[i] == '-'){ > diff[i] <- abs(final$GeneHighPoint[i] - final$snpPos[i]) > } > } > > # eQTL which are in fact cis > final[which(diff < rad),] > > > > -- output of sessionInfo(): > > > sessionInfo() > R version 2.15.1 (2012-06-22) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 > > attached base packages: > [1] splines stats4 stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] SNPlocs.Hsapiens.dbSNP.20111119_0.99.6 > SNPlocs.Hsapiens.dbSNP.20110815_0.99.6 GGdata_1.0.19 > illuminaHumanv1.db_1.12.2 org.Hs.eg.db_2.7.1 > [6] RSQLite_0.11.2 DBI_0.2-5 > AnnotationDbi_1.18.4 Biobase_2.16.0 > GGtools_4.4.0 > [11] Rsamtools_1.8.6 Biostrings_2.24.1 > GenomicRanges_1.8.13 IRanges_1.14.4 > BiocGenerics_0.2.0 > [16] GGBase_3.18.0 snpStats_1.6.0 > Matrix_1.0-9 lattice_0.20-10 > survival_2.36-14 > [21] NCBI2R_1.4.3 > > loaded via a namespace (and not attached): > [1] annotate_1.34.1 biomaRt_2.12.0 bit_1.1-8 > bitops_1.0-4.1 BSgenome_1.24.0 ff_2.2-7 > genefilter_1.38.0 > [8] GenomicFeatures_1.8.3 grid_2.15.1 RCurl_1.95-0.1.2 > rtracklayer_1.16.3 tools_2.15.1 > VariantAnnotation_1.2.11 XML_3.95-0 > [15] xtable_1.7-0 > > -- > Sent via the guest posting facility at bioconductor.org. > [[alternative HTML version deleted]]

ADD COMMENT • link 13.2 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

First, I want to thank you for making a detailed and reproducible report. I have only done enough to exonerate the current version. Second, there was a problem in the script you gave -- I see no way for chrnames="chr1" to work with current annotation specifications as the chrnames is used to query the geneannopk package. for illuminaHumanv1.db you have to use "1". Third, the following additional code shows that the trans scores retrieved involve SNP-gene distances > 100000 bp, USING THE BIOCONDUCTOR ANNOTATION PACKAGES SPECIFIED IN THE CALL. > tt = transTab(trS) # get a data frame of the scores and some location information > dim(tt) [1] 2853880 8 # that gives 20 per-SNP association scores, for the 20 largest trans associations on chr1 to SNP on chr1, that is, the best among SNP-gene pairs satisfying # your distance restriction > library(SNPlocs.Hsapiens.dbSNP.20111119) > sl1 = getSNPlocs("ch1", as.GRanges=TRUE) > rsn1 = paste("rs", values(sl1)$RefSNP_id, sep="") > > sst = start(sl1) > names(sst) = rsn1 > > MM <- match(as.character(tt[,1]), names(sst), nomatch=0) > sst2 =sst[MM] > > tt2 = tt[which(MM>0),] > > summary(abs(abs(tt2$geneloc)-sst2)) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 100100 35100000 76200000 88130000 137300000 248300000 53284 this shows that the SNP-gene pairs on which scores were returned are separated by at least 100100 b it also shows that some gene/snp pairs couldn't be related by location, and that needs to be rationalized. it may be something simple. I think the problem that you report comes from using NCBI2R SNP annotation to characterize SNP locations ... I'm not familiar with that package but I assume that it yields SNP locations not completely consistent with the SNPlocs packaged used in the filltering for association testing. It would be possible to use the NCBI2R locations for filtering if you made GRanges with the names and locations corresponding to the genotyped variants and put them in a package with a getSNPlocs function that would emulate the SNPlocs package behaviors. Another approach would work from passing the "cis map" in as a list, indicating which SNPs are excluded from testing for each gene. If these become compelling use cases I can introduce this capability. Let me know if there are further issues. On Thu, Oct 4, 2012 at 12:09 PM, Vincent Carey <stvjc@channing.harvard.edu>wrote: > Thank you for this report. I will get back to you ASAP. This is a tough > time with versions changing, but I wonder if you could try with the new > release. The robustification of location handling has been a key concern > of development in the past couple of months. > > On Thu, Oct 4, 2012 at 11:56 AM, James Peters [guest] < > guest@bioconductor.org> wrote: > >> >> Hi there, >> >> I'm using the excellent GGtools to package to look for eQTLs. When I use >> the transScore function to look for trans eQTL I find that some of the >> eQTL-gene pairs I discover are closer than the distance specified by the >> argument 'radius' (used to define what is cis; hence eSNPs closer to the >> target gene than radius should not be returned by transScores). >> >> Any suggestions on what I'm doing wrong would be greatly appreciated. >> >> Best wishes >> >> James >> >> An example using GGdata looking only at genes and SNPs on chromosome 1. >> Here I define trans as > 100,000 bps. >> >> library(GGtools) >> library(GGdata) >> >> # get radius used to define cis >> rad <- 1e+05 >> >> trS <-transScores("GGdata", snpchr = as.character(1), rhs = ~1, K = 20, >> targdirpref = "tsco", geneApply = lapply, chrnames = paste("chr", >> as.character(1), sep = ""), radius = rad, renameChrs = NULL, probesToKeep = >> NULL, >> batchsize = 200, genegran = 50, shortfac = 10, wrapperEndo= function(x) >> MAFfilter(x, low=0.1), >> geneannopk = "illuminaHumanv1.db", snpannopk = >> "SNPlocs.Hsapiens.dbSNP.20111119", gchrpref = "", >> schrpref = "ch") >> >> # get gene indexes >> TOPind <- geneIndcol(trS, 1) >> >> # construct results table using probe ids, chi squared #association >> scores and snp loci >> TOP <- data.frame(cbind(probeid= featureNames(getSS("GGdata",1))[TOPind], >> score= topScores(trS), snpids = locusNames(trS))) >> >> TOP$score <- as.numeric(as.character(TOP$score)) >> >> # set chi squared score threshold for significance for trans eQTL >> (equivalent to p val < 1e-11) >> threshold <- 46.328476 >> >> tophits <- TOP[which(TOP$score > threshold),] >> >> # map probe to gene symbol, entrez id and chromosome >> PROBE2SYMS <- select(illuminaHumanv1.db, keys = >> as.character(tophits$probeid), cols = c("SYMBOL","ENTREZID","CHR")) >> >> tophits <- cbind(PROBE2SYMS, tophits) >> >> # get snp details >> library(NCBI2R) >> mysnps <- as.character(tophits$snpids) >> annotSNPs <- AnnotateSNPList(mysnps, file="") >> >> selectSNPinfo <- annotSNPs[,c("marker","chr","chrpos")] >> >> # get annotation on the genes using NCBI >> geneinfo <- GetGeneInfo(tophits$ENTREZID) >> # ori gives info on whether the gene is on sense or #anti-sense strand >> geneinfo <- geneinfo[,c("locusID","GeneLowPoint","GeneHighPoint","ori")] >> >> y <- cbind(tophits, selectSNPinfo) >> #check >> if(identical(y$marker, as.character(y$snpids))==FALSE){stop("The SNP ids >> don't match up")} >> >> # merge gene info with snp info and trans eQTL results >> final <- cbind(geneinfo, y) >> # check >> if(identical(final$locusID, final$ENTREZID)==FALSE){stop("The Entrez ids >> don't match up")} >> >> # clean up the dataframe (remove duplicate cols, make col names clearer) >> colnames(final)[1] <- 'EntrezID' >> colnames(final)[8] <- 'geneCHR' >> colnames(final)[13] <- 'snpCHR' >> colnames(final)[14] <- 'snpPos' >> final <- final[,-c(7,9,12)] >> >> # calculate distance from snp to gene start position >> diff <- rep(NA, nrow(final)) >> for (i in 1:nrow(final)){ >> if(final$ori[i] == '+'){ >> diff[i] <- abs(final$GeneLowPoint[i] - final$snpPos[i]) >> } else if >> (final$ori[i] == '-'){ >> diff[i] <- abs(final$GeneHighPoint[i] - final$snpPos[i]) >> } >> } >> >> # eQTL which are in fact cis >> final[which(diff < rad),] >> >> >> >> -- output of sessionInfo(): >> >> > sessionInfo() >> R version 2.15.1 (2012-06-22) >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> locale: >> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 >> >> attached base packages: >> [1] splines stats4 stats graphics grDevices utils datasets >> methods base >> >> other attached packages: >> [1] SNPlocs.Hsapiens.dbSNP.20111119_0.99.6 >> SNPlocs.Hsapiens.dbSNP.20110815_0.99.6 GGdata_1.0.19 >> illuminaHumanv1.db_1.12.2 org.Hs.eg.db_2.7.1 >> [6] RSQLite_0.11.2 DBI_0.2-5 >> AnnotationDbi_1.18.4 Biobase_2.16.0 >> GGtools_4.4.0 >> [11] Rsamtools_1.8.6 Biostrings_2.24.1 >> GenomicRanges_1.8.13 IRanges_1.14.4 >> BiocGenerics_0.2.0 >> [16] GGBase_3.18.0 snpStats_1.6.0 >> Matrix_1.0-9 lattice_0.20-10 >> survival_2.36-14 >> [21] NCBI2R_1.4.3 >> >> loaded via a namespace (and not attached): >> [1] annotate_1.34.1 biomaRt_2.12.0 bit_1.1-8 >> bitops_1.0-4.1 BSgenome_1.24.0 ff_2.2-7 >> genefilter_1.38.0 >> [8] GenomicFeatures_1.8.3 grid_2.15.1 RCurl_1.95-0.1.2 >> rtracklayer_1.16.3 tools_2.15.1 >> VariantAnnotation_1.2.11 XML_3.95-0 >> [15] xtable_1.7-0 >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> > > [[alternative HTML version deleted]]

ADD REPLY • link 13.2 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

I forgot my sessionInfo: > sessionInfo() R version 2.15.1 Patched (2012-08-20 r60331) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.iso88591 LC_NUMERIC=C [3] LC_TIME=en_US.iso88591 LC_COLLATE=en_US.iso88591 [5] LC_MONETARY=en_US.iso88591 LC_MESSAGES=en_US.iso88591 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.iso88591 LC_IDENTIFICATION=C attached base packages: [1] parallel splines stats4 stats graphics grDevices datasets [8] tools utils methods base other attached packages: [1] SNPlocs.Hsapiens.dbSNP.20111119_0.99.6 [2] BiocInstaller_1.8.1 [3] GGdata_1.0.19 [4] illuminaHumanv1.db_1.16.0 [5] org.Hs.eg.db_2.8.0 [6] RSQLite_0.11.2 [7] DBI_0.2-5 [8] AnnotationDbi_1.19.38 [9] Biobase_2.17.8 [10] GGtools_4.5.41 [11] Rsamtools_1.9.30 [12] Biostrings_2.25.12 [13] GenomicRanges_1.9.65 [14] IRanges_1.15.44 [15] BiocGenerics_0.3.2 [16] GGBase_3.19.7 [17] snpStats_1.7.3 [18] Matrix_1.0-9 [19] lattice_0.20-10 [20] survival_2.36-14 [21] weaver_1.23.0 [22] codetools_0.2-8 [23] digest_0.5.2 loaded via a namespace (and not attached): [1] annotate_1.35.4 AnnotationForge_0.99.11 biomaRt_2.13.2 [4] bit_1.1-8 bitops_1.0-4.1 BSgenome_1.25.8 [7] ff_2.2-7 genefilter_1.39.0 GenomicFeatures_1.9.41 [10] grid_2.15.1 RCurl_1.91-1 rtracklayer_1.17.19 [13] VariantAnnotation_1.3.28 XML_3.9-4 xtable_1.7-0 [16] zlibbioc_1.3.0 On Thu, Oct 4, 2012 at 3:52 PM, Vincent Carey <stvjc@channing.harvard.edu>wrote: > First, I want to thank you for making a detailed and reproducible report. > I have only done enough to exonerate the current version. > > Second, there was a problem in the script you gave -- I see no way for > chrnames="chr1" to work with current annotation specifications as the > chrnames is used to query the geneannopk package. for illuminaHumanv1.db > you have to use "1". > > Third, the following additional code shows that the trans scores retrieved > involve SNP-gene distances > 100000 bp, USING THE BIOCONDUCTOR ANNOTATION > PACKAGES SPECIFIED IN THE CALL. > > > tt = transTab(trS) # get a data frame of the scores and some location > information > > dim(tt) > [1] 2853880 8 > > # that gives 20 per-SNP association scores, for the 20 largest trans > associations on chr1 to SNP on chr1, that is, the best among SNP- gene pairs > satisfying > # your distance restriction > > > library(SNPlocs.Hsapiens.dbSNP.20111119) > > sl1 = getSNPlocs("ch1", as.GRanges=TRUE) > > rsn1 = paste("rs", values(sl1)$RefSNP_id, sep="") > > > > sst = start(sl1) > > names(sst) = rsn1 > > > > MM <- match(as.character(tt[,1]), names(sst), nomatch=0) > > sst2 =sst[MM] > > > > tt2 = tt[which(MM>0),] > > > > summary(abs(abs(tt2$geneloc)-sst2)) > Min. 1st Qu. Median Mean 3rd Qu. Max. NA's > 100100 35100000 76200000 88130000 137300000 248300000 53284 > > this shows that the SNP-gene pairs on which scores were returned are > separated by at least 100100 b > > it also shows that some gene/snp pairs couldn't be related by location, > and that needs to be rationalized. it may be something > simple. > > I think the problem that you report comes from using NCBI2R SNP annotation > to characterize SNP > locations ... I'm not familiar with that package but I assume that it > yields SNP locations not completely consistent with the > SNPlocs packaged used in the filltering for association testing. > > It would be possible to use the NCBI2R locations for filtering if you made > GRanges with the names and locations corresponding > to the genotyped variants and put them in a package with a getSNPlocs > function that would emulate the SNPlocs > package behaviors. > > Another approach would work from passing the "cis map" in as a list, > indicating which SNPs are excluded from > testing for each gene. If these become compelling use cases I can > introduce this capability. > > Let me know if there are further issues. > > On Thu, Oct 4, 2012 at 12:09 PM, Vincent Carey <stvjc@channing.harvard.edu> > wrote: > >> Thank you for this report. I will get back to you ASAP. This is a tough >> time with versions changing, but I wonder if you could try with the new >> release. The robustification of location handling has been a key concern >> of development in the past couple of months. >> >> On Thu, Oct 4, 2012 at 11:56 AM, James Peters [guest] < >> guest@bioconductor.org> wrote: >> >>> >>> Hi there, >>> >>> I'm using the excellent GGtools to package to look for eQTLs. When I use >>> the transScore function to look for trans eQTL I find that some of the >>> eQTL-gene pairs I discover are closer than the distance specified by the >>> argument 'radius' (used to define what is cis; hence eSNPs closer to the >>> target gene than radius should not be returned by transScores). >>> >>> Any suggestions on what I'm doing wrong would be greatly appreciated. >>> >>> Best wishes >>> >>> James >>> >>> An example using GGdata looking only at genes and SNPs on chromosome 1. >>> Here I define trans as > 100,000 bps. >>> >>> library(GGtools) >>> library(GGdata) >>> >>> # get radius used to define cis >>> rad <- 1e+05 >>> >>> trS <-transScores("GGdata", snpchr = as.character(1), rhs = ~1, K = 20, >>> targdirpref = "tsco", geneApply = lapply, chrnames = paste("chr", >>> as.character(1), sep = ""), radius = rad, renameChrs = NULL, probesToKeep = >>> NULL, >>> batchsize = 200, genegran = 50, shortfac = 10, wrapperEndo= function(x) >>> MAFfilter(x, low=0.1), >>> geneannopk = "illuminaHumanv1.db", snpannopk = >>> "SNPlocs.Hsapiens.dbSNP.20111119", gchrpref = "", >>> schrpref = "ch") >>> >>> # get gene indexes >>> TOPind <- geneIndcol(trS, 1) >>> >>> # construct results table using probe ids, chi squared #association >>> scores and snp loci >>> TOP <- data.frame(cbind(probeid= >>> featureNames(getSS("GGdata",1))[TOPind], score= topScores(trS), snpids = >>> locusNames(trS))) >>> >>> TOP$score <- as.numeric(as.character(TOP$score)) >>> >>> # set chi squared score threshold for significance for trans eQTL >>> (equivalent to p val < 1e-11) >>> threshold <- 46.328476 >>> >>> tophits <- TOP[which(TOP$score > threshold),] >>> >>> # map probe to gene symbol, entrez id and chromosome >>> PROBE2SYMS <- select(illuminaHumanv1.db, keys = >>> as.character(tophits$probeid), cols = c("SYMBOL","ENTREZID","CHR")) >>> >>> tophits <- cbind(PROBE2SYMS, tophits) >>> >>> # get snp details >>> library(NCBI2R) >>> mysnps <- as.character(tophits$snpids) >>> annotSNPs <- AnnotateSNPList(mysnps, file="") >>> >>> selectSNPinfo <- annotSNPs[,c("marker","chr","chrpos")] >>> >>> # get annotation on the genes using NCBI >>> geneinfo <- GetGeneInfo(tophits$ENTREZID) >>> # ori gives info on whether the gene is on sense or #anti-sense strand >>> geneinfo <- geneinfo[,c("locusID","GeneLowPoint","GeneHighPoint","ori")] >>> >>> y <- cbind(tophits, selectSNPinfo) >>> #check >>> if(identical(y$marker, as.character(y$snpids))==FALSE){stop("The SNP ids >>> don't match up")} >>> >>> # merge gene info with snp info and trans eQTL results >>> final <- cbind(geneinfo, y) >>> # check >>> if(identical(final$locusID, final$ENTREZID)==FALSE){stop("The Entrez ids >>> don't match up")} >>> >>> # clean up the dataframe (remove duplicate cols, make col names clearer) >>> colnames(final)[1] <- 'EntrezID' >>> colnames(final)[8] <- 'geneCHR' >>> colnames(final)[13] <- 'snpCHR' >>> colnames(final)[14] <- 'snpPos' >>> final <- final[,-c(7,9,12)] >>> >>> # calculate distance from snp to gene start position >>> diff <- rep(NA, nrow(final)) >>> for (i in 1:nrow(final)){ >>> if(final$ori[i] == '+'){ >>> diff[i] <- abs(final$GeneLowPoint[i] - final$snpPos[i]) >>> } else if >>> (final$ori[i] == '-'){ >>> diff[i] <- abs(final$GeneHighPoint[i] - final$snpPos[i]) >>> } >>> } >>> >>> # eQTL which are in fact cis >>> final[which(diff < rad),] >>> >>> >>> >>> -- output of sessionInfo(): >>> >>> > sessionInfo() >>> R version 2.15.1 (2012-06-22) >>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>> >>> locale: >>> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 >>> >>> attached base packages: >>> [1] splines stats4 stats graphics grDevices utils datasets >>> methods base >>> >>> other attached packages: >>> [1] SNPlocs.Hsapiens.dbSNP.20111119_0.99.6 >>> SNPlocs.Hsapiens.dbSNP.20110815_0.99.6 GGdata_1.0.19 >>> illuminaHumanv1.db_1.12.2 org.Hs.eg.db_2.7.1 >>> [6] RSQLite_0.11.2 DBI_0.2-5 >>> AnnotationDbi_1.18.4 Biobase_2.16.0 >>> GGtools_4.4.0 >>> [11] Rsamtools_1.8.6 Biostrings_2.24.1 >>> GenomicRanges_1.8.13 IRanges_1.14.4 >>> BiocGenerics_0.2.0 >>> [16] GGBase_3.18.0 snpStats_1.6.0 >>> Matrix_1.0-9 lattice_0.20-10 >>> survival_2.36-14 >>> [21] NCBI2R_1.4.3 >>> >>> loaded via a namespace (and not attached): >>> [1] annotate_1.34.1 biomaRt_2.12.0 bit_1.1-8 >>> bitops_1.0-4.1 BSgenome_1.24.0 ff_2.2-7 >>> genefilter_1.38.0 >>> [8] GenomicFeatures_1.8.3 grid_2.15.1 RCurl_1.95-0.1.2 >>> rtracklayer_1.16.3 tools_2.15.1 >>> VariantAnnotation_1.2.11 XML_3.95-0 >>> [15] xtable_1.7-0 >>> >>> -- >>> Sent via the guest posting facility at bioconductor.org. >>> >> >> > [[alternative HTML version deleted]]

ADD REPLY • link 13.2 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

Dear Prof. Carey, Thanks for such a swift response. Yes you are quite right re: the error in the script I posted. Apologies- I somehow managed to copy and paste part of another bit of code from the analysis of the dataset I'm using- although in fact since upgrading to the latest version of GGtools the 'chr' prefix now creates an error with that too (rectified by specifying the argument chrnames as a numeric vector). Thanks also for highlighting the transTab function- this saves some of the manual construction of the results data frame I had been doing. I ended up using the NCBI2R package as I was having problems with the rsid2loc function in the SNPlocs.Hsapiens.dbSNP.20111119 package (I realise there is now a newer version of the SNPlocs package); specifically the genotyping of the dataset I'm using was done on the Affy SNP 6.0 array, and some of the rs ids of my SNPs are not recognised by the function rsid2loc resulting in an error message: # mysnps is a character vector of eSNP ids: > head(mysnps) [1] "rs17117873" "rs17117877" "rs11590757" "rs2372309" "rs12118488" [6] "rs2274226" > rsid2loc(mysnps) Error in .rsid2rowidx(rsids) : rs id(s) not found: 387642, 6581161, 3129934, 3129900, 17651213, 3129868, 17563800, 12150111, 17688944, 453997, 241033, 241031, 17688056, 17688434, 17563501, 2902662, 17571718, 17572169, 17572851, 41437445, 7221390, 17687667, 17564780, 8070723, 17574228, 413778, 418891, 17653162, 17688391, 17649553, 17689882, 17650063, 916793, 17649641, 242559, 4074462, 41374248, 17575556, 2532292, 12150672, 17763086, 17688922, 17652502, 17575850, 11079729, 17650872, 7350980, 17563787, 41382552, 8072451, 17573175, 17688002, 17572893, 17760733, 17762769, 17652449, 4597358, 11079724, 17563599, 17689824, 41399444, 17761207, 3129768, 2732706, 17762165, 17660907, 41384744, 2732675, 1406068, 17662235, 17563827, 17659881, 17426106, 17688773, 2532297, 2532316, 1467969, 17660132, 17660065, 17688068, 17660595, 17769490, 17763533, 34097347, 17649954, 17689471, 12150090, 2696526, 2732589, 17691328, 3129063, 4277389, 2732647, 2532259, 2532275, 760293, 2532257, 12150320, 1611350, 8067056 I think these are SNPs where the rs id has changed over time. The NCBI2R package seems to be able to handle these older rs ids. Perhaps these and similar SNPs are reflected in the NAs you demonstrated below? I will try the approaches you suggested. Kind regards, and once again many thanks James PS sessionInfo for the above code snippet: > sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] splines stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] NCBI2R_1.4.3 [2] SNPlocs.Hsapiens.dbSNP.20111119_0.99.6 [3] hugene10stensembl.db_2.1.0 [4] hugene10stensembl68.db_1.18.1 [5] org.Hs.eg.db_2.8.0 [6] RSQLite_0.11.2 [7] DBI_0.2-5 [8] AnnotationDbi_1.20.0 [9] oligo_1.22.0 [10] oligoClasses_1.20.0 [11] Biobase_2.18.0 [12] GGtools_4.6.0 [13] Rsamtools_1.10.1 [14] Biostrings_2.26.0 [15] GenomicRanges_1.10.0 [16] IRanges_1.16.2 [17] BiocGenerics_0.4.0 [18] GGBase_3.20.0 [19] snpStats_1.8.0 [20] Matrix_1.0-9 [21] lattice_0.20-10 [22] survival_2.36-14 loaded via a namespace (and not attached): [1] affxparser_1.30.0 affyio_1.26.0 annotate_1.36.0 [4] BiocInstaller_1.8.1 biomaRt_2.14.0 bit_1.1-8 [7] bitops_1.0-4.1 BSgenome_1.26.0 codetools_0.2-8 [10] ff_2.2-7 foreach_1.4.0 genefilter_1.40.0 [13] GenomicFeatures_1.10.0 grid_2.15.1 iterators_1.0.6 [16] preprocessCore_1.20.0 RCurl_1.95-0.1.2 rtracklayer_1.18.0 [19] tools_2.15.1 VariantAnnotation_1.4.0 XML_3.95-0.1 [22] xtable_1.7-0 zlibbioc_1.4.0 On Oct 4 2012, Vincent Carey wrote: >First, I want to thank you for making a detailed and reproducible report. > I have only done enough to exonerate the current version. > >Second, there was a problem in the script you gave -- I see no way for >chrnames="chr1" to work with current annotation specifications as the >chrnames is used to query the geneannopk package. for illuminaHumanv1.db >you have to use "1". > >Third, the following additional code shows that the trans scores retrieved >involve SNP-gene distances > 100000 bp, USING THE BIOCONDUCTOR ANNOTATION >PACKAGES SPECIFIED IN THE CALL. > >> tt = transTab(trS) # get a data frame of the scores and some location >information >> dim(tt) >[1] 2853880 8 > ># that gives 20 per-SNP association scores, for the 20 largest trans >associations on chr1 to SNP on chr1, that is, the best among SNP-gene pairs >satisfying ># your distance restriction > >> library(SNPlocs.Hsapiens.dbSNP.20111119) >> sl1 = getSNPlocs("ch1", as.GRanges=TRUE) >> rsn1 = paste("rs", values(sl1)$RefSNP_id, sep="") >> >> sst = start(sl1) >> names(sst) = rsn1 >> >> MM <- match(as.character(tt[,1]), names(sst), nomatch=0) >> sst2 =sst[MM] >> >> tt2 = tt[which(MM>0),] >> >> summary(abs(abs(tt2$geneloc)-sst2)) > Min. 1st Qu. Median Mean 3rd Qu. Max. NA's > 100100 35100000 76200000 88130000 137300000 248300000 53284 > >this shows that the SNP-gene pairs on which scores were returned are >separated by at least 100100 b > >it also shows that some gene/snp pairs couldn't be related by location, and >that needs to be rationalized. it may be something >simple. > >I think the problem that you report comes from using NCBI2R SNP annotation >to characterize SNP >locations ... I'm not familiar with that package but I assume that it >yields SNP locations not completely consistent with the >SNPlocs packaged used in the filltering for association testing. > >It would be possible to use the NCBI2R locations for filtering if you made >GRanges with the names and locations corresponding >to the genotyped variants and put them in a package with a getSNPlocs >function that would emulate the SNPlocs >package behaviors. > >Another approach would work from passing the "cis map" in as a list, >indicating which SNPs are excluded from >testing for each gene. If these become compelling use cases I can >introduce this capability. > >Let me know if there are further issues. > >On Thu, Oct 4, 2012 at 12:09 PM, Vincent Carey ><stvjc at="" channing.harvard.edu="">wrote: > >> Thank you for this report. I will get back to you ASAP. This is a tough >> time with versions changing, but I wonder if you could try with the new >> release. The robustification of location handling has been a key concern >> of development in the past couple of months. >> >> On Thu, Oct 4, 2012 at 11:56 AM, James Peters [guest] < >> guest at bioconductor.org> wrote: >> >>> >>> Hi there, >>> >>> I'm using the excellent GGtools to package to look for eQTLs. When I use >>> the transScore function to look for trans eQTL I find that some of the >>> eQTL-gene pairs I discover are closer than the distance specified by the >>> argument 'radius' (used to define what is cis; hence eSNPs closer to the >>> target gene than radius should not be returned by transScores). >>> >>> Any suggestions on what I'm doing wrong would be greatly appreciated. >>> >>> Best wishes >>> >>> James >>> >>> An example using GGdata looking only at genes and SNPs on chromosome 1. >>> Here I define trans as > 100,000 bps. >>> >>> library(GGtools) >>> library(GGdata) >>> >>> # get radius used to define cis >>> rad <- 1e+05 >>> >>> trS <-transScores("GGdata", snpchr = as.character(1), rhs = ~1, K = >>> 20, targdirpref = "tsco", geneApply = lapply, chrnames = paste("chr", >>> as.character(1), sep = ""), radius = rad, renameChrs = NULL, >>> probesToKeep = NULL, batchsize = 200, genegran = 50, shortfac = 10, >>> wrapperEndo= function(x) MAFfilter(x, low=0.1), geneannopk = >>> "illuminaHumanv1.db", snpannopk = "SNPlocs.Hsapiens.dbSNP.20111119", >>> gchrpref = "", schrpref = "ch") >>> >>> # get gene indexes >>> TOPind <- geneIndcol(trS, 1) >>> >>> # construct results table using probe ids, chi squared #association >>> scores and snp loci TOP <- data.frame(cbind(probeid= >>> featureNames(getSS("GGdata",1))[TOPind], score= topScores(trS), snpids >>> = locusNames(trS))) >>> >>> TOP$score <- as.numeric(as.character(TOP$score)) >>> >>> # set chi squared score threshold for significance for trans eQTL >>> (equivalent to p val < 1e-11) >>> threshold <- 46.328476 >>> >>> tophits <- TOP[which(TOP$score > threshold),] >>> >>> # map probe to gene symbol, entrez id and chromosome >>> PROBE2SYMS <- select(illuminaHumanv1.db, keys = >>> as.character(tophits$probeid), cols = c("SYMBOL","ENTREZID","CHR")) >>> >>> tophits <- cbind(PROBE2SYMS, tophits) >>> >>> # get snp details >>> library(NCBI2R) >>> mysnps <- as.character(tophits$snpids) >>> annotSNPs <- AnnotateSNPList(mysnps, file="") >>> >>> selectSNPinfo <- annotSNPs[,c("marker","chr","chrpos")] >>> >>> # get annotation on the genes using NCBI >>> geneinfo <- GetGeneInfo(tophits$ENTREZID) >>> # ori gives info on whether the gene is on sense or #anti-sense strand >>> geneinfo <- geneinfo[,c("locusID","GeneLowPoint","GeneHighPoint","ori")] >>> >>> y <- cbind(tophits, selectSNPinfo) >>> #check >>> if(identical(y$marker, as.character(y$snpids))==FALSE){stop("The SNP ids >>> don't match up")} >>> >>> # merge gene info with snp info and trans eQTL results >>> final <- cbind(geneinfo, y) >>> # check >>> if(identical(final$locusID, final$ENTREZID)==FALSE){stop("The Entrez ids >>> don't match up")} >>> >>> # clean up the dataframe (remove duplicate cols, make col names clearer) >>> colnames(final)[1] <- 'EntrezID' >>> colnames(final)[8] <- 'geneCHR' >>> colnames(final)[13] <- 'snpCHR' >>> colnames(final)[14] <- 'snpPos' >>> final <- final[,-c(7,9,12)] >>> >>> # calculate distance from snp to gene start position >>> diff <- rep(NA, nrow(final)) >>> for (i in 1:nrow(final)){ >>> if(final$ori[i] == '+'){ >>> diff[i] <- abs(final$GeneLowPoint[i] - final$snpPos[i]) >>> } else if >>> (final$ori[i] == '-'){ >>> diff[i] <- abs(final$GeneHighPoint[i] - final$snpPos[i]) >>> } >>> } >>> >>> # eQTL which are in fact cis >>> final[which(diff < rad),] >>> >>> >>> >>> -- output of sessionInfo(): >>> >>> > sessionInfo() >>> R version 2.15.1 (2012-06-22) >>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>> >>> locale: >>> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 >>> >>> attached base packages: >>> [1] splines stats4 stats graphics grDevices utils datasets >>> methods base >>> >>> other attached packages: >>> [1] SNPlocs.Hsapiens.dbSNP.20111119_0.99.6 >>> SNPlocs.Hsapiens.dbSNP.20110815_0.99.6 GGdata_1.0.19 >>> illuminaHumanv1.db_1.12.2 org.Hs.eg.db_2.7.1 >>> [6] RSQLite_0.11.2 DBI_0.2-5 >>> AnnotationDbi_1.18.4 Biobase_2.16.0 >>> GGtools_4.4.0 >>> [11] Rsamtools_1.8.6 Biostrings_2.24.1 >>> GenomicRanges_1.8.13 IRanges_1.14.4 >>> BiocGenerics_0.2.0 >>> [16] GGBase_3.18.0 snpStats_1.6.0 >>> Matrix_1.0-9 lattice_0.20-10 >>> survival_2.36-14 >>> [21] NCBI2R_1.4.3 >>> >>> loaded via a namespace (and not attached): >>> [1] annotate_1.34.1 biomaRt_2.12.0 bit_1.1-8 >>> bitops_1.0-4.1 BSgenome_1.24.0 ff_2.2-7 >>> genefilter_1.38.0 >>> [8] GenomicFeatures_1.8.3 grid_2.15.1 RCurl_1.95-0.1.2 >>> rtracklayer_1.16.3 tools_2.15.1 >>> VariantAnnotation_1.2.11 XML_3.95-0 >>> [15] xtable_1.7-0 >>> >>> -- >>> Sent via the guest posting facility at bioconductor.org. >>> >> >> >

ADD REPLY • link 13.2 years ago Dr J. Peters ▴ 30