Question

Proper normalization of batch effected corrected raw rsem rnaseq values for downstream unsupervised analyses

0

Entering edit mode

svlachavas ▴ 830

@svlachavas-7225

Last seen 6 months ago

Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor community,

based on a project for multi-omics data integration in various cancer types concerning distinct molecular layers, I was trying to process and analyze separately the omics layers before integration; For simplicity, I post a small chunk code of the fetched pancancer tcga coad data, which are raw RSEM values which were also batch effect corrected to account for different sequencing platform and other confounders:

....# initial processing of fetching the data 

head(rna.seq.form)
         TCGA-AA-3489 TCGA-AA-3492 TCGA-AA-3496 TCGA-AA-3502 TCGA-AA-3506 TCGA-AA-3510
UBE2Q2P2       7.9135       6.3524       7.7206      10.6232       6.9761    7.2472064
HMGB1P1      249.5640     286.6830     216.2460     210.5220     203.0250  247.7090443
RNU12-2P       0.3794       0.0000       1.6559       0.4831       0.4495           NA
SSX9P          0.0000       0.0000       0.0000       0.0000       0.0000           NA
EZHIP          0.0000       0.0000       0.0000       0.0000       0.0000           NA
EFCAB8         2.2762       0.9773       1.6559       0.4831       0.8990   -0.3627441
         TCGA-AA-3511 TCGA-AA-3514 TCGA-AA-3517 TCGA-AA-3519 TCGA-AA-3520 TCGA-AA-3521
UBE2Q2P2       4.2429   -0.5162796     29.35766   13.7552756   14.9389784     9.982441
HMGB1P1      267.6850   75.2821045    466.05384  312.1466578  475.1038153   585.216398
RNU12-2P       0.4268           NA           NA           NA           NA           NA
SSX9P          0.0000           NA           NA           NA           NA           NA
EZHIP          0.0000           NA           NA           NA           NA           NA
EFCAB8         1.2805   -0.3627441      5.50310   -0.3627441   -0.3627441     1.015633

range(rna.seq.form,na.rm = T)
[1] -9.601263e-01  8.944090e+05

# small subsetting of keeping common patients profiled with other omic layers

grps.rna = gsub("-[0-1]{2}", "", colnames(rna.seq.form))

colnames(rna.seq.form) <- grps.rna
rna.seq.form <- rna.seq.form %>% dplyr::select(all_of(all_final_samples)) 
rna.seq.mat <- as.matrix(rna.seq.form)

# initial log2 conditional transformation

 if(max(rna.seq.mat,na.rm=TRUE) > 50){
+     ##Do log-transformation
+     if(min(rna.seq.mat,na.rm=TRUE) <= 0){
+         rna.seq.mat <- rna.seq.mat - min(rna.seq.mat,na.rm=TRUE) + 1
+     }
+     rna.seq.mat <- log2(rna.seq.mat+1)
+ }

range(rna.seq.mat,na.rm = T)
[1]  1.00000 19.77058

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] reticulate_1.22             limma_3.49.5                maftools_2.9.03            
 [4] forcats_0.5.1               stringr_1.4.0               dplyr_1.0.7                
 [7] purrr_0.3.4                 readr_2.0.2                 tidyr_1.1.4                
[10] tibble_3.1.5                ggplot2_3.3.5               tidyverse_1.3.1            
[13] data.table_1.14.2           MOFA2_1.3.4                 M3C_1.15.0                 
[16] DESeq2_1.33.5               SummarizedExperiment_1.23.5 Biobase_2.53.0             
[19] MatrixGenerics_1.5.4        matrixStats_0.61.0          GenomicRanges_1.45.0       
[22] GenomeInfoDb_1.29.10        IRanges_2.27.2              S4Vectors_0.31.5           
[25] BiocGenerics_0.39.2

1) My main question is, based on the nature of the "raw" data, which normalization strategy would be appropriate as these are not raw counts, but also are batch corrected? For example, a simple initial log2 transformation or something like the above transformation posted, then followed by normalizeQuantiles function from limma, would suffice for a proper transformation and normalization for downstream processes such multi-omics integration, clustering etc.? In order to ensure variance stabilization and account for library bias and relative effects?

2) As also from above you could see that there is a pattern in a small number of genes with NA values in most of the samples; in addition, I should remove any genes-except also non-expressed genes-that have NA values in the vast majority of the samples?

Thank you in advance,

Efstathios

limma RNASeq rsem Normalization • 676 views

ADD COMMENT • link 2.5 years ago svlachavas ▴ 830