Question

PCA percent variance DESeq2

0

Entering edit mode

IUH • 0

@6d1ed6fa

Last seen 8 months ago

United States

I saved 'pcaData' as a data frame for future use. I ran the following to get vector 'percentVar', however, it is empty. Do I need the data in a different format to extract percent variance?

> pcaData <- plotPCA(rld, intgroup="Groups", returnData=TRUE)
> percentVar <- round(100 * attr(pcaData, "percentVar"))
> percentVar
numeric(0)

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.11.0 
LAPACK: /usr/lib/liblapack.so.3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Chicago
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] lubridate_1.9.2             forcats_1.0.0              
 [3] stringr_1.5.0               dplyr_1.1.2                
 [5] purrr_1.0.2                 readr_2.1.4                
 [7] tidyr_1.3.0                 tibble_3.2.1               
 [9] ggplot2_3.4.3               tidyverse_2.0.0            
[11] DESeq2_1.40.2               SummarizedExperiment_1.30.2
[13] Biobase_2.60.0              MatrixGenerics_1.12.3      
[15] matrixStats_1.0.0           GenomicRanges_1.52.0       
[17] GenomeInfoDb_1.36.1         IRanges_2.34.1             
[19] S4Vectors_0.38.1            BiocGenerics_0.46.0        

loaded via a namespace (and not attached):
 [1] gtable_0.3.3            lattice_0.21-8          tzdb_0.4.0             
 [4] vctrs_0.6.3             tools_4.3.1             bitops_1.0-7           
 [7] generics_0.1.3          parallel_4.3.1          fansi_1.0.4            
[10] pkgconfig_2.0.3         Matrix_1.5-4.1          lifecycle_1.0.3        
[13] GenomeInfoDbData_1.2.10 farver_2.1.1            compiler_4.3.1         
[16] munsell_0.5.0           codetools_0.2-19        RCurl_1.98-1.12        
[19] pillar_1.9.0            crayon_1.5.2            BiocParallel_1.34.2    
[22] DelayedArray_0.26.7     abind_1.4-5             tidyselect_1.2.0       
[25] locfit_1.5-9.8          stringi_1.7.12          labeling_0.4.2         
[28] grid_4.3.1              colorspace_2.1-0        cli_3.6.1              
[31] magrittr_2.0.3          S4Arrays_1.0.5          utf8_1.2.3             
[34] withr_2.5.0             scales_1.2.1            bit64_4.0.5            
[37] timechange_0.2.0        XVector_0.40.0          bit_4.0.5              
[40] hms_1.1.3               rlang_1.1.1             Rcpp_1.0.11            
[43] glue_1.6.2              vroom_1.6.3             R6_2.5.1               
[46] zlibbioc_1.46.0

PCA DESeq2 • 1.4k views

ADD COMMENT • link 8 months ago IUH • 0

score 1 · Answer 1 · 2023-08-17

1

Entering edit mode

ATpoint ★ 4.0k

@atpoint-13662

Last seen 1 day ago

Germany

Edit: The answer was that OP took the output, wrote as a CSV, then read back into R, so (expected) attributes were lost.

Strange, should work.

Can you reproduce with this minimal example?

dds <- makeExampleDESeqDataSet()
rld <- rlog(dds)
pca <-plotPCA(rld, "condition", returnData=TRUE)

attr(pca, "percentVar")

Alternatively, the PCA code itself is just four lines https://github.com/thelovelab/DESeq2/blob/devel/R/plots.R#L243-L252 so you could run this manually. Note that percentVar is already the percentage, so your multiplication by 100 is unnecessary.

ADD COMMENT • link 8 months ago ATpoint ★ 4.0k

0

Entering edit mode

This should have been a comment. I have moved it.

ADD REPLY • link 8 months ago Kevin Blighe ★ 3.9k

0

Entering edit mode

Please accept my apologies if I am a little naive. From your explanation, I assume that the 'returnData=TRUE' shall give me a data frame that should have 'PC1' 'PC2' and 'percentVar' as numeric and 'condition' and 'sample names' as characters. However, the 'returnData=TRUE' provided everything except 'percentVar'. While I can go back to perform the analyses using DESeq2, I wonder if I can use some code that can extract 'percentVar' from the 'PC1' and 'PC2' data that I already have in the data frame. After all I am able to plot the PC1 and PC2 data using ggplot2 but without the percent variation.

ADD REPLY • link 8 months ago IUH • 0

1

Entering edit mode

That assumption is wrong. The data frame returned by returnData = TRUE does not contain percentVar as a column. It is there in the object, but it is hidden, and you get it with the attr function.

ADD REPLY • link 8 months ago swbarnes2 ★ 1.3k

0

Entering edit mode

This is good to know that percentVar is there but hidden and the attr() should extract it. In my case the attr() is not extracting percentVar.

ADD REPLY • link 8 months ago IUH • 0

0

Entering edit mode

No need for any apologies. When you use my example code, do you get the percentVar via the attr funcrion? I am just wondering whether this 'glitch' you encounter is reproducible on your machine because on mine it is not.

ADD REPLY • link 8 months ago ATpoint ★ 4.0k

0

Entering edit mode

The example code you provided is returning 'percentVar'. So, what do you think is wrong with my data frame that I saved while I was analyzing my own dataset? I had wrote it with 'write.csv'. I will appreciate your help.

> dds <- makeExampleDESeqDataSet()
> rld <- rlog(dds)
> pca <- plotPCA(rld, "condition", returnData=TRUE)
> attr(pca, "percentVar")
[1] 0.1132248 0.1065856


> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.11.0 
LAPACK: /usr/lib/liblapack.so.3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Chicago
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] DESeq2_1.40.2               SummarizedExperiment_1.30.2
 [3] Biobase_2.60.0              MatrixGenerics_1.12.3      
 [5] matrixStats_1.0.0           GenomicRanges_1.52.0       
 [7] GenomeInfoDb_1.36.1         IRanges_2.34.1             
 [9] S4Vectors_0.38.1            BiocGenerics_0.46.0        
[11] scales_1.2.1                lubridate_1.9.2            
[13] forcats_1.0.0               stringr_1.5.0              
[15] dplyr_1.1.2                 purrr_1.0.2                
[17] readr_2.1.4                 tidyr_1.3.0                
[19] tibble_3.2.1                tidyverse_2.0.0            
[21] ggplot2_3.4.3              

loaded via a namespace (and not attached):
 [1] gtable_0.3.4            lattice_0.21-8          tzdb_0.4.0             
 [4] vctrs_0.6.3             tools_4.3.1             bitops_1.0-7           
 [7] generics_0.1.3          parallel_4.3.1          fansi_1.0.4            
[10] pkgconfig_2.0.3         Matrix_1.5-4.1          lifecycle_1.0.3        
[13] GenomeInfoDbData_1.2.10 compiler_4.3.1          farver_2.1.1           
[16] textshaping_0.3.6       munsell_0.5.0           codetools_0.2-19       
[19] RCurl_1.98-1.12         pillar_1.9.0            crayon_1.5.2           
[22] BiocParallel_1.34.2     DelayedArray_0.26.7     abind_1.4-5            
[25] locfit_1.5-9.8          tidyselect_1.2.0        stringi_1.7.12         
[28] labeling_0.4.2          grid_4.3.1              colorspace_2.1-0       
[31] cli_3.6.1               magrittr_2.0.3          S4Arrays_1.0.5         
[34] utf8_1.2.3              withr_2.5.0             bit64_4.0.5            
[37] timechange_0.2.0        XVector_0.40.0          bit_4.0.5              
[40] ragg_1.2.5              hms_1.1.3               rlang_1.1.1            
[43] Rcpp_1.0.11             glue_1.6.2              svglite_2.1.1          
[46] vroom_1.6.3             R6_2.5.1                systemfonts_1.0.4      
[49] zlibbioc_1.46.0

ADD REPLY • link 8 months ago IUH • 0

1

Entering edit mode

If the made-on-the-spot data works, and yours does not, then something is wrong with your input data.

ADD REPLY • link 8 months ago swbarnes2 ★ 1.3k

0

Entering edit mode

Yes, I agree that there is something wrong with my input data but I need a little help to find out what is causing this problem. I ran the following code to make-on-the-spot data. Then I wrote a '.csv' file with write_csv(). When I imported the input data back to R, the 'attr' function returned "NULL". I don't know what causes this behavior and I would greatly appreciate any help on a better way to save the data frame and import it back to R without losing the attribute "percentVar".

> dds <- makeExampleDESeqDataSet()
> rld <- rlog(dds)
> pca <- plotPCA(rld, 'condition', returnData=TRUE)
> attr(pca, "percentVar")
[1] 0.1173749 0.1041526

> str(pca)
'data.frame':   12 obs. of  5 variables:
 $ PC1      : num  -1.632 2.656 0.421 -4.766 -0.368 ...
 $ PC2      : num  -1.95 -1.16 -4.8 -3.85 2.52 ...
 $ group    : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2 2 ...
 $ condition: Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2 2 ...
 $ name     : chr  "sample1" "sample2" "sample3" "sample4" ...
 - attr(*, "percentVar")= num [1:2] 0.117 0.104

> write_csv(pca, file="test_pca.csv")
> df <- read_csv("test_pca.csv")

> str(df)
spc_tbl_ [12 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ PC1      : num [1:12] -1.632 2.656 0.421 -4.766 -0.368 ...
 $ PC2      : num [1:12] -1.95 -1.16 -4.8 -3.85 2.52 ...
 $ group    : chr [1:12] "A" "A" "A" "A" ...
 $ condition: chr [1:12] "A" "A" "A" "A" ...
 $ name     : chr [1:12] "sample1" "sample2" "sample3" "sample4" ...
 - attr(*, "spec")=
  .. cols(
  ..   PC1 = col_double(),
  ..   PC2 = col_double(),
  ..   group = col_character(),
  ..   condition = col_character(),
  ..   name = col_character()
  .. )
 - attr(*, "problems")=<externalptr>

> attr(df, "percentVar")
NULL

> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.11.0 
LAPACK: /usr/lib/liblapack.so.3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Chicago
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] ggrepel_0.9.3               patchwork_1.1.3            
 [3] scales_1.2.1                DESeq2_1.40.2              
 [5] SummarizedExperiment_1.30.2 Biobase_2.60.0             
 [7] MatrixGenerics_1.12.3       matrixStats_1.0.0          
 [9] GenomicRanges_1.52.0        GenomeInfoDb_1.36.2        
[11] IRanges_2.34.1              S4Vectors_0.38.1           
[13] BiocGenerics_0.46.0         lubridate_1.9.2            
[15] forcats_1.0.0               stringr_1.5.0              
[17] dplyr_1.1.2                 purrr_1.0.2                
[19] readr_2.1.4                 tidyr_1.3.0                
[21] tibble_3.2.1                ggplot2_3.4.3              
[23] tidyverse_2.0.0            

loaded via a namespace (and not attached):
 [1] utf8_1.2.3              generics_0.1.3          bitops_1.0-7           
 [4] lattice_0.21-8          stringi_1.7.12          hms_1.1.3              
 [7] magrittr_2.0.3          grid_4.3.1              timechange_0.2.0       
[10] Matrix_1.5-4.1          fansi_1.0.4             codetools_0.2-19       
[13] abind_1.4-5             cli_3.6.1               rlang_1.1.1            
[16] crayon_1.5.2            XVector_0.40.0          bit64_4.0.5            
[19] munsell_0.5.0           DelayedArray_0.26.7     withr_2.5.0            
[22] S4Arrays_1.0.5          parallel_4.3.1          tools_4.3.1            
[25] BiocParallel_1.34.2     tzdb_0.4.0              colorspace_2.1-0       
[28] locfit_1.5-9.8          GenomeInfoDbData_1.2.10 vctrs_0.6.3            
[31] R6_2.5.1                lifecycle_1.0.3         zlibbioc_1.46.0        
[34] bit_4.0.5               vroom_1.6.3             pkgconfig_2.0.3        
[37] pillar_1.9.0            gtable_0.3.4            Rcpp_1.0.11            
[40] glue_1.6.2              tidyselect_1.2.0        compiler_4.3.1         
[43] RCurl_1.98-1.12

ADD REPLY • link 8 months ago IUH • 0

2

Entering edit mode

If you write the data out and then read it back in, the attributes of the original object are lost.

ADD REPLY • link 8 months ago James W. MacDonald 65k

0

Entering edit mode

For example

> df <- data.frame(A = letters, B = LETTERS)
> attr(df, "lolwhut") <- 1:5
> str(df)
'data.frame':   26 obs. of  2 variables:
 $ A: chr  "a" "b" "c" "d" ...
 $ B: chr  "A" "B" "C" "D" ...
 - attr(*, "lolwhut")= int [1:5] 1 2 3 4 5
> write.csv(df, "tmp.csv", row.names = FALSE)
> str(read.csv("tmp.csv"))
'data.frame':   26 obs. of  2 variables:
 $ A: chr  "a" "b" "c" "d" ...
 $ B: chr  "A" "B" "C" "D" ...

ADD REPLY • link 8 months ago James W. MacDonald 65k

2

Entering edit mode

Which brings to the 'original' ethos of R, which is that the code is real and the output is not. In other words, there is little benefit to saving things as CSV files or even .RDS or .RData files unless you have long running code that takes forever, in which case you should cache that as part of your .Rmd file (you are using .Rmd files, no?), like this.

if(!file.exists("bigstuffomg.Rdata")) {
    <do all the stuff>
    save(<all the stuff>, file =  "bigstuffomg.Rdata")
} else {
    load("bigstuffomg.Rdata")
}

And then you can just render your .Rmd file as needed, and you will get all the stuff you need without having to drop all this excess cruft in your working directory.

ADD REPLY • link 8 months ago James W. MacDonald 65k

1

Entering edit mode

If you do caching then I encourage approaches that actually invalidate the cache when something changes upstream, like xfun::cache_rds or at least put an option to rerun cached chunks if needed. With above code you could be using legacy objects even if upstream changes completely. +1 for Rmd

ADD REPLY • link 8 months ago ATpoint ★ 4.0k

0

Entering edit mode

This is great stuff and thank you for sharing your knowledge.

ADD REPLY • link 8 months ago IUH • 0

0

Entering edit mode

This clears everything. Thank you for your time and help.

ADD REPLY • link 8 months ago IUH • 0