Dear Bioconductor community,
based on a project for multi-omics data integration in various cancer types concerning distinct molecular layers, I was trying to process and analyze separately the omics layers before integration; For simplicity, I post a small chunk code of the fetched pancancer tcga coad data, which are raw RSEM values which were also batch effect corrected to account for different sequencing platform and other confounders:
....# initial processing of fetching the data
head(rna.seq.form)
TCGA-AA-3489 TCGA-AA-3492 TCGA-AA-3496 TCGA-AA-3502 TCGA-AA-3506 TCGA-AA-3510
UBE2Q2P2 7.9135 6.3524 7.7206 10.6232 6.9761 7.2472064
HMGB1P1 249.5640 286.6830 216.2460 210.5220 203.0250 247.7090443
RNU12-2P 0.3794 0.0000 1.6559 0.4831 0.4495 NA
SSX9P 0.0000 0.0000 0.0000 0.0000 0.0000 NA
EZHIP 0.0000 0.0000 0.0000 0.0000 0.0000 NA
EFCAB8 2.2762 0.9773 1.6559 0.4831 0.8990 -0.3627441
TCGA-AA-3511 TCGA-AA-3514 TCGA-AA-3517 TCGA-AA-3519 TCGA-AA-3520 TCGA-AA-3521
UBE2Q2P2 4.2429 -0.5162796 29.35766 13.7552756 14.9389784 9.982441
HMGB1P1 267.6850 75.2821045 466.05384 312.1466578 475.1038153 585.216398
RNU12-2P 0.4268 NA NA NA NA NA
SSX9P 0.0000 NA NA NA NA NA
EZHIP 0.0000 NA NA NA NA NA
EFCAB8 1.2805 -0.3627441 5.50310 -0.3627441 -0.3627441 1.015633
range(rna.seq.form,na.rm = T)
[1] -9.601263e-01 8.944090e+05
# small subsetting of keeping common patients profiled with other omic layers
grps.rna = gsub("-[0-1]{2}", "", colnames(rna.seq.form))
colnames(rna.seq.form) <- grps.rna
rna.seq.form <- rna.seq.form %>% dplyr::select(all_of(all_final_samples))
rna.seq.mat <- as.matrix(rna.seq.form)
# initial log2 conditional transformation
if(max(rna.seq.mat,na.rm=TRUE) > 50){
+ ##Do log-transformation
+ if(min(rna.seq.mat,na.rm=TRUE) <= 0){
+ rna.seq.mat <- rna.seq.mat - min(rna.seq.mat,na.rm=TRUE) + 1
+ }
+ rna.seq.mat <- log2(rna.seq.mat+1)
+ }
range(rna.seq.mat,na.rm = T)
[1] 1.00000 19.77058
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] reticulate_1.22 limma_3.49.5 maftools_2.9.03
[4] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7
[7] purrr_0.3.4 readr_2.0.2 tidyr_1.1.4
[10] tibble_3.1.5 ggplot2_3.3.5 tidyverse_1.3.1
[13] data.table_1.14.2 MOFA2_1.3.4 M3C_1.15.0
[16] DESeq2_1.33.5 SummarizedExperiment_1.23.5 Biobase_2.53.0
[19] MatrixGenerics_1.5.4 matrixStats_0.61.0 GenomicRanges_1.45.0
[22] GenomeInfoDb_1.29.10 IRanges_2.27.2 S4Vectors_0.31.5
[25] BiocGenerics_0.39.2
1) My main question is, based on the nature of the "raw" data, which normalization strategy would be appropriate as these are not raw counts, but also are batch corrected? For example, a simple initial log2 transformation or something like the above transformation posted, then followed by normalizeQuantiles
function from limma, would suffice for a proper transformation and normalization for downstream processes such multi-omics integration, clustering etc.? In order to ensure variance stabilization and account for library bias and relative effects?
2) As also from above you could see that there is a pattern in a small number of genes with NA values in most of the samples; in addition, I should remove any genes-except also non-expressed genes-that have NA values in the vast majority of the samples?
Thank you in advance,
Efstathios