Proper normalization of batch effected corrected raw rsem rnaseq values for downstream unsupervised analyses
0
0
Entering edit mode
svlachavas ▴ 840
@svlachavas-7225
Last seen 7 days ago
Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor community,

based on a project for multi-omics data integration in various cancer types concerning distinct molecular layers, I was trying to process and analyze separately the omics layers before integration; For simplicity, I post a small chunk code of the fetched pancancer tcga coad data, which are raw RSEM values which were also batch effect corrected to account for different sequencing platform and other confounders:

....# initial processing of fetching the data 

head(rna.seq.form)
         TCGA-AA-3489 TCGA-AA-3492 TCGA-AA-3496 TCGA-AA-3502 TCGA-AA-3506 TCGA-AA-3510
UBE2Q2P2       7.9135       6.3524       7.7206      10.6232       6.9761    7.2472064
HMGB1P1      249.5640     286.6830     216.2460     210.5220     203.0250  247.7090443
RNU12-2P       0.3794       0.0000       1.6559       0.4831       0.4495           NA
SSX9P          0.0000       0.0000       0.0000       0.0000       0.0000           NA
EZHIP          0.0000       0.0000       0.0000       0.0000       0.0000           NA
EFCAB8         2.2762       0.9773       1.6559       0.4831       0.8990   -0.3627441
         TCGA-AA-3511 TCGA-AA-3514 TCGA-AA-3517 TCGA-AA-3519 TCGA-AA-3520 TCGA-AA-3521
UBE2Q2P2       4.2429   -0.5162796     29.35766   13.7552756   14.9389784     9.982441
HMGB1P1      267.6850   75.2821045    466.05384  312.1466578  475.1038153   585.216398
RNU12-2P       0.4268           NA           NA           NA           NA           NA
SSX9P          0.0000           NA           NA           NA           NA           NA
EZHIP          0.0000           NA           NA           NA           NA           NA
EFCAB8         1.2805   -0.3627441      5.50310   -0.3627441   -0.3627441     1.015633

range(rna.seq.form,na.rm = T)
[1] -9.601263e-01  8.944090e+05

# small subsetting of keeping common patients profiled with other omic layers

grps.rna = gsub("-[0-1]{2}", "", colnames(rna.seq.form))

colnames(rna.seq.form) <- grps.rna
rna.seq.form <- rna.seq.form %>% dplyr::select(all_of(all_final_samples)) 
rna.seq.mat <- as.matrix(rna.seq.form)

# initial log2 conditional transformation

 if(max(rna.seq.mat,na.rm=TRUE) > 50){
+     ##Do log-transformation
+     if(min(rna.seq.mat,na.rm=TRUE) <= 0){
+         rna.seq.mat <- rna.seq.mat - min(rna.seq.mat,na.rm=TRUE) + 1
+     }
+     rna.seq.mat <- log2(rna.seq.mat+1)
+ }

range(rna.seq.mat,na.rm = T)
[1]  1.00000 19.77058

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] reticulate_1.22             limma_3.49.5                maftools_2.9.03            
 [4] forcats_0.5.1               stringr_1.4.0               dplyr_1.0.7                
 [7] purrr_0.3.4                 readr_2.0.2                 tidyr_1.1.4                
[10] tibble_3.1.5                ggplot2_3.3.5               tidyverse_1.3.1            
[13] data.table_1.14.2           MOFA2_1.3.4                 M3C_1.15.0                 
[16] DESeq2_1.33.5               SummarizedExperiment_1.23.5 Biobase_2.53.0             
[19] MatrixGenerics_1.5.4        matrixStats_0.61.0          GenomicRanges_1.45.0       
[22] GenomeInfoDb_1.29.10        IRanges_2.27.2              S4Vectors_0.31.5           
[25] BiocGenerics_0.39.2

1) My main question is, based on the nature of the "raw" data, which normalization strategy would be appropriate as these are not raw counts, but also are batch corrected? For example, a simple initial log2 transformation or something like the above transformation posted, then followed by normalizeQuantiles function from limma, would suffice for a proper transformation and normalization for downstream processes such multi-omics integration, clustering etc.? In order to ensure variance stabilization and account for library bias and relative effects?

2) As also from above you could see that there is a pattern in a small number of genes with NA values in most of the samples; in addition, I should remove any genes-except also non-expressed genes-that have NA values in the vast majority of the samples?

Thank you in advance,

Efstathios

limma RNASeq rsem Normalization • 872 views
ADD COMMENT

Login before adding your answer.

Traffic: 683 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6