Application of quantro to RNASeq: I do not get the same results as are in the quantro paper supplement
1
0
Entering edit mode
raf4 ▴ 20
@raf4-8249
Last seen 14 months ago
United States

Dear List:

I want to apply quantro to RNASeq data. When I ran the quantro vignette with methylation data I got the same results as the vignette. I want to duplicate and RNASeq values and plots of an RNAseq exmaple in the quantro paper, before I apply quantro to my own data.

I applied quantro to Bottomly et al.'s dataset (PMID:21455293) downloaded from recount (PMID:22087737),

Here is the count file: Count file

Here is the phenodata file: phenodata

I followed the methyl protocol, with 2 extra steps (eliminating all zero genes and running the rlogTransformation) which were mentioned in the paper. I got different quantro statistics and plots than in the appendix of the paper supplement (F quantro: paper: 1.215; me: 1.78239 and P-value: paper=0.245; me=0.177 ),

Here is the density plot:

density plot

Here is the boxplot:

boxplot

Here is the P-Value histogram:

P-Value histogram

Excerpts from my session:

>library(quantro)
>library(DESeq2)
>library(doParallel)
> Counts<-read.table("bottomly_count_table.txt",header=T,sep="\t",row.names=1)
> dim(Counts)
[1] 36536    21
> Counts[1:2,1:2]
                   SRX033480 SRX033488
ENSMUSG00000000001       369       744
ENSMUSG00000000003         0         0

> pd<-read.table("bottomly_phenodata_ed.txt",header=T,sep="\t",row.names=1)
> class(pd)
[1] "data.frame"
> dim(pd)
[1] 21  1
> pd
            strain
SRX033480 C57BL_6J
SRX033488 C57BL_6J
,
.
.
SRX033494   DBA_2J

> # Make factor variable
> strain<-factor(pd$strain)
> strain[1]
[1] C57BL_6J
Levels: C57BL_6J DBA_2J
> class(strain)
[1] "factor"
> > # Make DeSeq2DataSet to input into rlogTransformation
> 
> dds <- DESeqDataSetFromMatrix(countData = Counts.m,
+                               colData = DataFrame(strain),
+                               design = ~ strain)
> > # perform rlogTransformation, to create a DESeqTransform object
> log2transCounts<-rlogTransformation(dds,blind=FALSE)
> class(log2transCounts)
[1] "DESeqTransform"
attr(,"package")
[1] "DESeq2"
> dim(log2transCounts)
[1] 13932    21# extract log2counts matrix from DESeqTransform object
log2transCounts.m<-as.matrix(log2transCounts@assays@data@listData[[1]])
log2transCounts.m[1:2,1:2]
rownames(log2transCounts.m)<-rownames(Counts.m)
log2transCounts.m[1:2,1:2]

## ----plot-distributions-density, fig.height=5, fig.width=6-----------------
matdensity(log2transCounts.m, groupFactor = pd$strain, xlab = " ", ylab = "density", main = "log2transformed counts")
legend('top', c("C57BL_6J", "DBA_2J"), col = c(1, 2), lty = 1, lwd = 3)

> ## ----calculate-quantro----------------------------------------------------
> qtest <- quantro(object = log2transCounts.m, groupFactor = pd$strain)
[quantro] Average medians of the distributions are 
                        equal across groups.
[quantro] Calculating the quantro test statistic.
[quantro] No permutation testing performed. 
                         Use B > 0 for permutation testing.
> qtest
quantro: Test for global differences in distributions
   nGroups:  2 
   nTotSamples:  21 
   nSamplesinGroups:  10 11 
   anovaPval:  0.14366 
   quantroStat:  1.78239 
   quantroPvalPerm:  Use B > 0 for permutation testing. 
> ## ----quantro-summary-------------------------------------------------------
> summary(qtest)
$nGroups
[1] 2

$nTotSamples
[1] 21

$nSamplesinGroups
C57BL_6J   DBA_2J 
      10       11 

> 
> ## ----quantro-medians-------------------------------------------------------
> anova(qtest)
Analysis of Variance Table

Response: objectMedians
            Df    Sum Sq    Mean Sq F value Pr(>F)
groupFactor  1 0.0005763 0.00057629  2.3265 0.1437
Residuals   19 0.0047064 0.00024771               
> 
> ## ----quantro-quantroStat---------------------------------------------------
> quantroStat(qtest)
[1] 1.782389
> 
> 
> ## ----quantro-parallel------------------------------------------------------
> registerDoParallel(cores=1)
> qtestPerm <- quantro( log2transCounts.m, groupFactor = pd$strain, B = 1000)
[quantro] Average medians of the distributions are 
                        equal across groups.
[quantro] Calculating the quantro test statistic.
[quantro] Starting permutation testing.
[quantro] Using a single core (backend: doParallelMC, 
                        version: 1.0.15).
> qtestPerm
quantro: Test for global differences in distributions
   nGroups:  2 
   nTotSamples:  21 
   nSamplesinGroups:  10 11 
   anovaPval:  0.14366 
   quantroStat:  1.78239 
   quantroPvalPerm:  0.177 
> ## ----quantro-plot, fig.height=5, fig.width=6-------------------------------
> quantroPlot(qtestPerm)

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] doParallel_1.0.15           iterators_1.0.12            foreach_1.4.7               DESeq2_1.26.0               SummarizedExperiment_1.16.1
 [6] DelayedArray_0.12.3         BiocParallel_1.20.1         matrixStats_0.55.0          Biobase_2.46.0              GenomicRanges_1.38.0       
[11] GenomeInfoDb_1.22.1         IRanges_2.20.2              S4Vectors_0.24.4            BiocGenerics_0.32.0         quantro_1.20.0             

loaded via a namespace (and not attached):
  [1] colorspace_1.4-1         siggenes_1.60.0          mclust_5.4.6             htmlTable_1.13.2         XVector_0.26.0          
  [6] base64enc_0.1-3          base64_2.0               rstudioapi_0.11          bit64_0.9-7              AnnotationDbi_1.48.0    
 [11] xml2_1.3.2               codetools_0.2-16         splines_3.6.3            scrime_1.3.5             geneplotter_1.64.0      
 [16] knitr_1.25               Formula_1.2-3            Rsamtools_2.2.3          annotate_1.64.0          cluster_2.1.0           
 [21] dbplyr_1.4.3             HDF5Array_1.14.4         readr_1.3.1              compiler_3.6.3           httr_1.4.1              
 [26] backports_1.1.5          assertthat_0.2.1         Matrix_1.2-18            lazyeval_0.2.2           limma_3.42.2            
 [31] htmltools_0.4.0          acepack_1.4.1            prettyunits_1.0.2        tools_3.6.3              gtable_0.3.0            
 [36] glue_1.4.0               GenomeInfoDbData_1.2.2   dplyr_0.8.5              rappdirs_0.3.1           doRNG_1.8.2             
 [41] Rcpp_1.0.2               bumphunter_1.28.0        vctrs_0.2.4              Biostrings_2.54.0        multtest_2.42.0         
 [46] preprocessCore_1.48.0    nlme_3.1-144             rtracklayer_1.46.0       DelayedMatrixStats_1.8.0 xfun_0.10               
 [51] stringr_1.4.0            lifecycle_0.2.0          rngtools_1.5             XML_3.98-1.20            beanplot_1.2            
 [56] zlibbioc_1.32.0          MASS_7.3-51.5            scales_1.0.0             hms_0.5.1                minfi_1.32.0            
 [61] rhdf5_2.30.1             GEOquery_2.54.1          RColorBrewer_1.1-2       curl_4.2                 gridExtra_2.3           
 [66] memoise_1.1.0            ggplot2_3.2.1            rpart_4.1-15             biomaRt_2.42.1           reshape_0.8.8           
 [71] latticeExtra_0.6-28      stringi_1.4.3            RSQLite_2.1.2            genefilter_1.68.0        checkmate_1.9.4         
 [76] GenomicFeatures_1.38.2   rlang_0.4.5              pkgconfig_2.0.3          bitops_1.0-6             nor1mix_1.3-0           
 [81] lattice_0.20-38          purrr_0.3.4              Rhdf5lib_1.8.0           labeling_0.3             htmlwidgets_1.5.1       
 [86] GenomicAlignments_1.22.1 bit_1.1-14               tidyselect_1.0.0         plyr_1.8.4               magrittr_1.5            
 [91] R6_2.4.0                 Hmisc_4.2-0              DBI_1.1.0                pillar_1.4.2             foreign_0.8-75          
 [96] survival_3.1-8           RCurl_1.95-4.12          nnet_7.3-12              tibble_2.1.3             crayon_1.3.4            
[101] BiocFileCache_1.10.2     progress_1.2.2           locfit_1.5-9.1           grid_3.6.3               data.table_1.12.6       
[106] blob_1.2.0               digest_0.6.25            xtable_1.8-4             tidyr_1.0.2              illuminaio_0.28.0       
[111] openssl_1.4.1            munsell_0.5.0            askpass_1.1              quadprog_1.5-7          
> 



I would appreciate any suggestions to resolve this discrepancy.

Thanks and best wishes. Rich Richard A Friedman, PhD Columbia University

quantro RNASeq • 1.5k views
ADD COMMENT
0
Entering edit mode
@stephanie-hicks-7843
Last seen 3.7 years ago
USA/Baltimore/Johns Hopkins

Hi Richard,

Thanks for your interest in quantro. This is the script that I used to apply quantro to the Bottomly dataset:

https://github.com/stephaniehicks/quantroPaper/blob/master/scripts/geneExpression/mouseStrainsRNASeq.Rmd

All the best, Stephanie Hicks

ADD COMMENT
0
Entering edit mode

Stephanie,

I ran your script. The graphs were the same as when I did them originally. This includes the density plot, which looks very different than the one in your paper. The statistics are also different from the paper and what I did before:

F quantro: paper: 1.215; me using my script: 1.78239 me using your script: 2.0402

P-Value: paper: 0.245; me using my script: 0.177 me using your script: 0.109

The p-value discrepancies are perhaps attributable to differences in the random number seed, but the I don't think that the density plot or F quantro should be that different.

I think that something may have changed in R or the other packages you used to get this effect. If you get the time, can you please rerun the script and see what you get?

(I am entering this note under my google account name, because I cannot log in using my raf4@cumc.columbia.edu e-mail address. The login menu keeps disappearing.)

Thanks and best wishes, Rich Richard Friedman, PhD Columbia University Cancer Center

ADD REPLY
1
Entering edit mode

I see that you are using Bioconductor 3.10, but the paper used version 3.1 as reported in the Software section at the end of the article.

If you want to reproduce exact results, it's up to you to start with the same version numbers of software. Stephanie has provided the R script used in the paper but all the package dependencies have been updated since Bioc 3.1.

You can install Bioconductor 3.1 by downloading R 3.2 and following these legacy instructions:

http://bioconductor.org/install/#Legacy

ADD REPLY
0
Entering edit mode

Mike, Thank you for your note. I have been unable to get quantro working in R3.2 after considerable effort. As best I can see the stumbling block is the dependency of DESeq2 on RcppArmadillo, I could not install RcppArmadillo with BiocInstaller. I get

> library(DESeq2)
Error: package ‘RcppArmadillo’ required by ‘DESeq2’ could not be found
In addition: Warning message:
package ‘DESeq2’ was built under R version 3.2.3 
> 

(I omitted the sessionInfo above to keep the post length manageable) When I try to install it from source (using the command line in MAC OS X terminal) I get

Richards-Mac-mini:Applications epiktistes$ R CMD INSTALL -l /Library/Frameworks/R.framework/Resources/library/ RcppArmadillo 
* installing *source* package ‘RcppArmadillo’ ...
** package ‘RcppArmadillo’ successfully unpacked and MD5 sums checked
* checking LAPACK_LIBS: divide-and-conquer complex SVD available via R-supplied LAPACK
Warning in as.POSIXlt.POSIXct(x, tz) :
  unknown timezone 'zone/tz/2020a.1.0/zoneinfo/America/New_York'
** libs
clang++ -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG  -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rcpp/include"  -I../inst/include -fPIC  -Wall -mtune=core2 -g -O2  -c RcppArmadillo.cpp -o RcppArmadillo.o
clang++ -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG  -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rcpp/include"  -I../inst/include -fPIC  -Wall -mtune=core2 -g -O2  -c RcppExports.cpp -o RcppExports.o
clang++ -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG  -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rcpp/include"  -I../inst/include -fPIC  -Wall -mtune=core2 -g -O2  -c fastLm.cpp -o fastLm.o
clang++ -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/usr/local/lib -o RcppArmadillo.so RcppArmadillo.o RcppExports.o fastLm.o -L/Library/Frameworks/R.framework/Resources/lib -lRlapack -L/Library/Frameworks/R.framework/Resources/lib -lRblas -L/usr/local/lib/gcc/x86_64-apple-darwin13.0.0/4.8.2 -lgfortran -lquadmath -lm -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
ld: warning: directory not found for option '-L/usr/local/lib/gcc/x86_64-apple-darwin13.0.0/4.8.2'
ld: library not found for -lgfortran
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [RcppArmadillo.so] Error 1
ERROR: compilation failed for package ‘RcppArmadillo’
* removing ‘/Library/Frameworks/R.framework/Versions/3.2/Resources/library/RcppArmadillo’
Richards-Mac-mini:Applications epiktistes$ 

I am not clear on how to debug this error.I would appreaciate any suggestions.

That being said, I suspect that the discrepancy between my results and Stephanie's published results may lie in the DESeq2 rlogTransformation command. As applied in my initial post above, I get enter image des cription here

Whereas the density plot in the paper is: enter image description here

I wonder if the difference between my analysis and the published analysis lies in changes in the rlogTransformation function over time. If so, which do I believe?

Again, I regret not being able to reproduce the result in the original version of R and Bioconductor, but if you, or anyone else can consider the possibile explanation I just raised it would be helpful.

Thanks and best wishes,

Rich

Richard Friedman, PhD Columbia University

ADD REPLY
0
Entering edit mode

re: "I could not install RcppArmadillo with BiocInstaller"

What do you get with install.packages("RcppArmadillo")?

rlog didn't change since its introduction really. I'd be curious to see rlog of the same data across versions, so maybe figuring out installation of RcppArmadillo is the way to go.

ADD REPLY
0
Entering edit mode

Mike,

Thank you for your quick reply. Here is what I get:

> install.packages("RcppArmadillo")
Warning message:
package ‘RcppArmadillo’ is not available (for R version 3.2.2) 
> 

Best wishes, Rich

ADD REPLY
0
Entering edit mode

The vignette for DESeq2 1.8 says it is using RcppArmadillo 0.5.600.2.0. Try grabbing that here:

https://cran.r-project.org/src/contrib/Archive/RcppArmadillo/

ADD REPLY
0
Entering edit mode

Mike, I did, and I got the exact same error message. I am going to reinstall the latest versions of R and Bioconductor (hopefully without overwriting 3.2.2 and try what Stephanie suggests I am willing to continue to try to get it working with 3.2.2 if you or Stephanie see any use in it. But in the meantime, I will try to reproduce the calculation that Stephanie just posted,

Thanks and best wishes, Rich

ADD REPLY
0
Entering edit mode

Ok, for posterity, to install a downloaded version of a package in R requires:

install.packages("package_0.1.2.tar.gz", repos=NULL)

As this package did work with R 3.2, I don't expect you would get any error trying to install it.

ADD REPLY
1
Entering edit mode

Hi Rich,

I have not tried installing Bioconductor 3.1, but I did re-run the code in my GitHub repo tonight. I can confirm that I see the plots you have above (which are different than what's in the supplement) and I get the same quantro test statistic as you did above (2.0402) which is again different than what's in the paper (1.215). Unfortunately, it is unclear to me what the discrepancy is, but I agree with Michael that installing Bioconductor 3.1 would be the good thing to try.

However, just looking at the differences in the density plots, I doubt this is only related to the rLogTransformation function because it looks to me like the lowly expressed genes were filtered out (as the first mode is missing). This is a complete guess, but maybe I downloaded the file locally while working with it e.g.

bottomly.local <- load("<local_path>/bottomly_eset.RData")

and somehow saved a version of the dataset that had lowly expressed genes removed (?), but then changed the location back to the URL location to make the code more reproducible when I was finished with the project e.g.

bottomly.local <- load(url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bottomly_eset.RData"))

To test this idea out, I tried to remove some lowly expressed genes and re-ran the code in GitHub:

keepMeID <- sapply(1:nrow(eset), function(x){ any(eset[x,] >= 3) })
esetNoZeros <- eset[keepMeID,]

dds <- DESeqDataSetFromMatrix(countData = esetNoZeros, colData = pd, design = ~ 1)
rld <- rlogTransformation(dds)
rld.mat <- as.matrix(assay(rld))

matdensity(rld.mat)
quantro(rld.mat, groupFactor = pd$strain)

The figure looks closer to what's in the supplement and the test statistic too, but not the same

> quantro(rld.mat, groupFactor = pd$strain)#, B=1000)
[quantro] Average medians of the distributions are
                        equal across groups.
[quantro] Calculating the quantro test statistic.
[quantro] No permutation testing performed.
                         Use B > 0 for permutation testing.
quantro: Test for global differences in distributions
   nGroups:  2
   nTotSamples:  21
   nSamplesinGroups:  10 11
   anovaPval:  0.18929
   quantroStat:  1.54718
   quantroPvalPerm:  Use B > 0 for permutation testing.

Retrospectively, I recognize that I could have definitely made this code more reproducible e.g. providing a knitted HTML with sessionInfo() describing all the details of my R session. I have learned over time to incorporate better habits for reproducibility.

ADD REPLY
0
Entering edit mode

Stephanie,

Thank you for the clarification. I ran the script with the additional filter step in R4.0.2, Bioconductor 3.11 and got everything you got, up until the qtestPerm step. For the qtestPermstep, I got 0.176, close to what I got before. I suspect that any differences are because of the random number seed as you suggested previously. One more question: Do you recommend the kind of filtering that you just did to make the density uinimodal? It did not make a difference in the permutation p-value in this case. I ask so that I may follow the best practice.

Best wishes, Rich

ADD REPLY
1
Entering edit mode

Hi Rich,

If you are asking what to do with RNA-seq data as input to quantro, yes I think it makes sense to filter out the lowly expressed genes because these genes are typically very noisy. Generally speaking, I remove lowly expressed genes for many downstream analyses, but not all. It depends on what you are wanting to do with the data. I hope this helps.

All the best, Stephanie

ADD REPLY
1
Entering edit mode

Stephanie,

It helps. Quantro has come in handy with my dataset in showing differences in the densities across conditions. I see smooth quantile normalization in my future.

Thanks and best wishes,

Rich

ADD REPLY
0
Entering edit mode

Hi Rich,

I'm happy to hear this thread has been helpful, but most importantly the methods and software themselves. Wishing you all the best in your data analysis.

All the best, Stephanie

ADD REPLY

Login before adding your answer.

Traffic: 734 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6