Question

Model matrix not full rank, then

0

Entering edit mode

Sonia • 0

@2f3f6904

Last seen 4.2 years ago

United Kingdom

I have an experiment in which there is a group (based on a set, celltype + time) and a number of individuals.

The comparisons I am interested in are 'CloneT1' vs 'VbOtherT1' 'VbOtherT1' vs 'OtherCD4T1' and 'LowOCI_CCR4T1' vs 'LowOCI_OtherCD4T1'

(and the same at T2).

I wanted to know the effect of the individuals on the group as each individual is in more than one group.

I tried to run the model as DESeq2Table2 <- DESeqDataSetFromHTSeqCount(sampleTable = mysampleTable2, directory = "H:/R/htseq_counts_experiment_9/", design = ~patient + group, ignoreRank = FALSE) but got the 'Model matrix not full rank' error.

I then read the vignette and tried to assign an ind.n to each subject (I cheated and did manually as the group numbers were different in each group)

The data looks like this:

patient group   ind.n
LHD CloneT1 1
S4  CloneT1 2
N7  CloneT1 3
HHL CloneT1 4
3N  CloneT1 5
LID CloneT1 6
S7  CloneT1 7
LFC CloneT1 8
3AK CloneT1 9
LHD CloneT2 1
HHL CloneT2 2
3N  CloneT2 3
S7  CloneT2 4
LFC CloneT2 5
S1  LowOCI_CCR4T1   1
S9  LowOCI_CCR4T1   2
TAL LowOCI_CCR4T1   3
A1  LowOCI_CCR4T1   4
TFH LowOCI_CCR4T1   5
TDT LowOCI_CCR4T1   6
TDX LowOCI_CCR4T1   7
HES LowOCI_CCR4T1   8
TDV LowOCI_CCR4T1   9
3V  LowOCI_CCR4T1   10
S9  LowOCI_CCR4T2   1
A1  LowOCI_CCR4T2   2
TFH LowOCI_CCR4T2   3
TDT LowOCI_CCR4T2   4
S1  LowOCI_OtherCD4T1   1
TAL LowOCI_OtherCD4T1   2
TDX LowOCI_OtherCD4T1   3
HES LowOCI_OtherCD4T1   4
TDV LowOCI_OtherCD4T1   5
3V  LowOCI_OtherCD4T1   6
S9  LowOCI_OtherCD4T2   1
A1  LowOCI_OtherCD4T2   2
TFH LowOCI_OtherCD4T2   3
TDT LowOCI_OtherCD4T2   4
S4  Other_CD4T1 1
N7  Other_CD4T1 2
HHL Other_CD4T1 3
LID Other_CD4T1 4
3AK Other_CD4T1 5
LHD Other_CD4T2 1
HHL Other_CD4T2 2
3N  Other_CD4T2 3
S7  Other_CD4T2 4
LFC Other_CD4T2 5
S4  VbOtherT1   1
N7  VbOtherT1   2
HHL VbOtherT1   3
3N  VbOtherT1   4
LID VbOtherT1   5
S7  VbOtherT1   6
LFC VbOtherT1   7
3AK VbOtherT1   8
LHD VbOtherT1   9
LHD VbOtherT2   1
HHL VbOtherT2   2
3N  VbOtherT2   3
S7  VbOtherT2   4
LFC VbOtherT2   5

I then ran the code as per the Vignette but got the message: The design matrix has the same number of samples and coefficients to fit, so estimation of dispersion is not possible.

It seems to be saying I have no replicates- but I do- the patients are the replicates!

Or is it because there is only 1 ind.n in each condition that it cannot distinguish the effects of the individual from the condition, and therefore it is best to just run it as a simple model with design ~condition?

Or should ind.n match individuals eg all LHDs have the ind.n 1, S4 is ind.n 2 etc?

write.csv(mysampleTable, "updated_metadata_9.csv", row.names = TRUE)

#I have added a column manually for individuals within each group
metadata_9b <- read.csv("H:/R/htseq_counts_experiment_9/updated_metadata_9.csv")
metadata_9b <-as.data.frame(metadata_9b)

mysampleTable2 <- metadata_9b
#note that the individual number needs converting to a factor
mysampleTable2[,'ind.n']<-factor(mysampleTable2[,'ind.n'])

class(mysampleTable2$ind.n)
head(mysampleTable2[,c(4,10:11)])

patient   group ind.n
1     LHD CloneT1     1
2      S4 CloneT1     2
3      N7 CloneT1     3
4     HHL CloneT1     4
5      3N CloneT1     5
6     LID CloneT1     6

#this builds the DESeqtable.
DESeq2Table2 <- DESeqDataSetFromHTSeqCount(sampleTable = mysampleTable2, directory = "H:/R/htseq_counts_experiment_9/", design = ~group, ignoreRank = FALSE)

#this shows the size of the table
dim(DESeq2Table2)

#this line gets rid of anything with 5 reads or less 
keep <- rowSums(counts(DESeq2Table2)) > 5
DESeq2Table2 <- DESeq2Table2[keep,]

#this shows the size of the table after removal of anything with less than 5 reads
dim(DESeq2Table2)

#this makes a new model matrix and removes zeros
mm <- model.matrix(~group + group:ind.n, mysampleTable2)
colnames(mm)
unname(mm)
all.zero <- apply(mm, 2, function(x) all(x==0))
all.zero

idx <- which(all.zero)
mm <- mm[,-idx]
unname(mm)


#this is the DESeq
dds_complex <- DESeq(DESeq2Table2, full = mm)

# include your problematic code here with any corresponding output 

using supplied model matrix
estimating size factors
estimating dispersions
Error in checkForExperimentalReplicates(object, modelMatrix) : 

  The design matrix has the same number of samples and coefficients to fit,
  so estimation of dispersion is not possible. Treating samples
  as replicates was deprecated in v1.20 and no longer supported since v1.22.

# please also include the results of running the following in an R session 

sessionInfo( )

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[10] base     

other attached packages:
 [1] pheatmap_1.0.12             ggvenn_0.1.9                PCAtools_2.4.0             
 [4] dplyr_1.0.7                 ggrepel_0.9.1               ggplot2_3.3.5              
 [7] gplots_3.1.1                RColorBrewer_1.1-2          DESeq2_1.32.0              
[10] SummarizedExperiment_1.22.0 Biobase_2.52.0              MatrixGenerics_1.4.3       
[13] matrixStats_0.61.0          GenomicRanges_1.44.0        GenomeInfoDb_1.28.4        
[16] IRanges_2.26.0              S4Vectors_0.30.0            BiocGenerics_0.38.0        

loaded via a namespace (and not attached):
 [1] bitops_1.0-7              fs_1.5.0                  lubridate_1.7.10         
 [4] bit64_4.0.5               httr_1.4.2                tools_4.1.1              
 [7] backports_1.2.1           irlba_2.3.3               utf8_1.2.2               
[10] R6_2.5.1                  KernSmooth_2.23-20        DBI_1.1.1                
[13] colorspace_2.0-2          withr_2.4.2               tidyselect_1.1.1         
[16] bit_4.0.4                 compiler_4.1.1            cli_3.0.1                
[19] rvest_1.0.1               xml2_1.3.2                DelayedArray_0.18.0      
[22] labeling_0.4.2            caTools_1.18.2            scales_1.1.1             
[25] readr_2.0.1               genefilter_1.74.0         digest_0.6.28            
[28] stringr_1.4.0             XVector_0.32.0            pkgconfig_2.0.3          
[31] sparseMatrixStats_1.4.2   dbplyr_2.1.1              fastmap_1.1.0            
[34] rlang_0.4.11              readxl_1.3.1              rstudioapi_0.13          
[37] RSQLite_2.2.8             DelayedMatrixStats_1.14.3 farver_2.1.0             
[40] generics_0.1.0            jsonlite_1.7.2            BiocParallel_1.26.2      
[43] gtools_3.9.2              RCurl_1.98-1.5            magrittr_2.0.1           
[46] BiocSingular_1.8.1        GenomeInfoDbData_1.2.6    Matrix_1.3-4             
[49] Rcpp_1.0.7                munsell_0.5.0             fansi_0.5.0              
[52] lifecycle_1.0.1           stringi_1.7.5             zlibbioc_1.38.0          
[55] plyr_1.8.6                blob_1.2.2                dqrng_0.3.0              
[58] forcats_0.5.1             crayon_1.4.1              lattice_0.20-45          
[61] beachmat_2.8.1            Biostrings_2.60.2         haven_2.4.3              
[64] cowplot_1.1.1             splines_4.1.1             annotate_1.70.0          
[67] hms_1.1.1                 KEGGREST_1.32.0           locfit_1.5-9.4           
[70] pillar_1.6.3              reshape2_1.4.4            ScaledMatrix_1.0.0       
[73] geneplotter_1.70.0        reprex_2.0.1              XML_3.99-0.8             
[76] glue_1.4.2                BiocManager_1.30.16       modelr_0.1.8             
[79] png_0.1-7                 vctrs_0.3.8               tzdb_0.1.2               
[82] cellranger_1.1.0          gtable_0.3.0              purrr_0.3.4              
[85] tidyr_1.1.3               assertthat_0.2.1          cachem_1.0.6             
[88] rsvd_1.0.5                xtable_1.8-4              broom_0.7.9              
[91] survival_3.2-13           tibble_3.1.4              AnnotationDbi_1.54.1     
[94] memoise_2.0.0             ellipsis_0.3.2

DESeq2 • 2.6k views

ADD COMMENT • link 4.3 years ago Sonia • 0

score 2 · Answer 1 · 2021-10-13

2

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 day ago

United States

Unless you have complete replication for the subjects it's not possible to block on subject. In that case you should use edgeR, in particular the voomLmFit function, which can fit a linear mixed model, with arbitrary replication of the subjects.

ADD COMMENT • link 4.3 years ago James W. MacDonald 68k

0

Entering edit mode

I see- so I can only do a model with ~condition in this case?

I think what you are saying is that I cannot examine the effects of individual subjects as there are no replicates for each subject, is that right? ie for each patient and each condition there is only 1 sample?

ADD REPLY • link 4.3 years ago Sonia • 0

2

Entering edit mode

No, that's not what I am saying at all. What I am saying is that having different numbers of replicates for various subjects is problematic if you want to fit the subjects as a fixed effect. Heuristically what a fixed effect on subject does (if you use the default parameterization in R) is to define a baseline subject, and then make comparisons between the different treatments using that subject. The subject-level coefficients for all the other subjects are the difference in mean expression between the given subject and the baseline. As a simple example, let's say you just have two treatments and two subjects. The design matrix would look like this:

  (Intercept) Treat2 Patient2
1           1      0        0
2           1      1        0
3           1      0        1
4           1      1        1

Here the intercept estimates the mean expression for patient 1 at treatment 1. The Treat2 coefficient estimates the difference between treatment 2 and treatment 1 for patient 1, and the Patient2 coefficient estimates the difference between the mean expression of patient2 and the mean expression of patient1. But note that there is a Treat2 coefficient for patient2 (that's row 4)! We use information from patient2 to estimate the treatment difference for patient1, by first subtracting out the mean difference between the two patients. Heuristically you can think about this as 'setting the patient2 expression data to patient1-equivalents'.

This works well if you have data for all the non-baseline subjects at all the treatments, because then the average expression for each patient is orthogonal to the treatment, and you can then just do the subtraction like we do in this fake example. But if we didn't have the treat1 data for patient2, the model is still full rank, but the patient2 coefficient would only be estimated from the treated patient2! That is a problem, because the patient2 coefficient is now confounded with treatment (it's not orthogonal to the treatment any more).

Instead what you can do is use a linear mixed model, which incorporates information about the patients in the model through the within-patient correlations, which is what you do if you use edgeR and voomLmFit. You still block on patient, but in that case the patient blocks are used to estimate the within-patient correlations that are then used in a linear mixed model.