Hi all,
Hi, I am using DESeq2 to get differential expression between two genotypes A (sensitive) and B (resistant) and two conditions (virus-infected ; versus control). I’m using the formula given by Mike in DESeq2 vignette "Example 2: two conditions, two genotypes, with an interaction term” which fit adequately to my design. However, I have the same problem as others i.e. that my tops DE genes are genes having few reads in one condition and 0 in the other one for one genotype (see below genes G93001 or G13001 ).
I tried several options (see below ) but was not able to discard these genes from my top DE. Any help will be greatly appreciate.
Thanks in advance,
Carine
Options tried without success
#betaPrior= FALSE with or without prefiltrering
dds <- DESeq(dds, betaPrior = FALSE)
#minimal prefiltrering and independent filtering = TRUE
dds1 <-dds[rowSums(counts(dds)>1) >=6]
finB57<-results(dds, contrast=c("condition","VI","CTRL"), independentFiltering = TRUE, pAdjustMethod = "BH")
#filtering on log2FC
Below is my code and my tops DE genes
######### DESeq2 analysis ###### antifin<-read.csv(file="Antivicfin.csv", sep = ";",header=TRUE,row.names=1 ) dim(antifin) [1] 46585 12 library(DESeq2) coldatafile<-read.table(file="colDataFin.txt",sep="\t",header=TRUE, row.names=1) colnames(antifin) <- NULL dds<-DESeqDataSetFromMatrix(countData = antifin, + colData = coldatafile, + design = ~ genotype + condition + genotype:condition) dds$condition<- relevel(dds$condition, ref ="CTRL") dds$genotype<- relevel(dds$genotype, ref ="B") dds$genotype [1] B B B B B B A A A A A A Levels: B A dds1 <-dds[rowSums(counts(dds)>1) >=6] ## minimal pre-filtering dim(dds1) [1] 39148 12 ########################################################### #Differential analysis ########################################################## dds3 <- DESeq(dds1) estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing resultsNames(dds3) [1] "Intercept" genotype_A_vs_B" "condition_VI_vs_CTRL" "genotypeA.conditionVI" #################################################### # EFFET DU VIRUS SUR B ##################################################### # the condition effect for genotype B (the main effect) finB<-results(dds3, contrast=c("condition","VI","CTRL"), independentFiltering = TRUE, pAdjustMethod = "BH") finB<-finB[order(finB$padj),] #tri par pvalue finB_tri<-subset(finB,padj<0.01) # genes avec padj <0.01 dim(finB_tri) [1] 1434 6 FC_finB<-subset(finB_tri, abs(log2FoldChange)>2) comptb<-as.data.frame(counts(dds3)) colnames(comptb)<-paste("brut.",rownames(coldatafile),sep="") comptn<-as.data.frame(counts(dds3,norm=TRUE)) colnames(comptn)<-paste("norm.",rownames(coldatafile),sep="") compt<-merge(comptb,comptn,by.x=0,by.y=0) rownames(compt)<-compt$Row.names compt<-compt[,-1]# .zz <-merge(compt,as.data.frame(FC_finB),by.x=0,by.y=0) .zz Row.names brut.BC1 brut.BC2 brut.BC3 brut.BVa brut.BVb brut.BVc brut.ACa brut.A22Cb brut.A22Cc brut.A22Va 1 G93001 173 0 0 0 0 0 579 692 466 2279 2 G03001 246 91 317 116 52 16 376 363 118 3947 3 G45001 61 49 71 23 30 22 40 19 41 104 4 G49001 14 6 2 1 1 0 41 35 29 201 5 G53001 10 1 1 0 0 0 45 63 76 146 6 G51001 278 0 2 0 0 0 1000 1021 1144 2845 7 G57001 52 4 12 3 6 3 110 187 206 280 8 G10001 50 29 89 31 29 14 32 18 9 121 9 G13001 175 0 8 0 0 0 882 800 879 2838 10 G63001 53 1 0 0 0 0 221 195 195 485 11 G65001 74 59 65 45 14 10 23 17 28 198 12 G07001 155 0 0 0 0 0 549 588 475 1950 13 G66001 244 3 0 1 3 3 1203 1005 836 2466 14 G48001 18 2 1 0 3 1 58 62 58 241 15 G54001 209 56 138 69 55 34 359 627 145 2391 16 G37001 2563 1706 3620 1461 1460 502 632 385 413 2398 17 G53001 1656 1162 1973 959 765 300 689 640 342 3152 18 G17001 67 29 66 24 12 8 15 17 29 95 brut.AVb brut.AVc norm.BC1 norm.BC2 norm.BC3 norm.BVa norm.BVb norm.BVc norm.A22Ca norm.A22Cb norm.A22Cc 1 2119 1570 179.79746 0.000000 0.000000 0.0000000 0.0000000 0.0000000 999.50900 1344.73069 802.46711 2 3329 2428 255.66575 150.974214 363.383752 84.0976294 26.9956007 13.1194766 649.07666 705.40064 203.19983 3 141 155 63.39679 81.293807 81.388790 16.6745300 15.5743850 18.0392803 69.05071 36.92180 70.60333 4 147 162 14.55008 9.954344 2.292642 0.7249796 0.5191462 0.0000000 70.77698 68.01384 49.93894 5 200 136 10.39292 1.659057 1.146321 0.0000000 0.0000000 0.0000000 77.68205 122.42490 130.87446 6 3375 2650 288.92309 0.000000 2.292642 0.0000000 0.0000000 0.0000000 1726.26770 1984.06074 1970.00509 7 138 251 54.04317 6.636229 13.755852 2.1749387 3.1148770 2.4599019 189.88945 363.38821 354.73868 8 91 75 51.96458 48.112662 102.022568 22.4743665 15.0552389 11.4795420 55.24057 34.97854 15.49829 9 4168 2412 181.87604 0.000000 9.170568 0.0000000 0.0000000 0.0000000 1522.56811 1554.60195 1513.66650 10 683 425 55.08246 1.659057 0.000000 0.0000000 0.0000000 0.0000000 381.50516 378.93423 335.79632 11 151 192 76.90758 97.884380 74.510864 32.6240804 7.2680463 8.1996729 39.70416 33.03529 48.21691 12 2631 1652 161.09021 0.000000 0.000000 0.0000000 0.0000000 0.0000000 947.72097 1142.63244 817.96540 13 3321 3283 253.58717 4.977172 0.000000 0.7249796 1.5574385 2.4599019 2076.70004 1952.96870 1439.61911 14 224 200 18.70725 3.318115 1.146321 0.0000000 1.5574385 0.8199673 100.12353 120.48165 99.87788 15 3040 1146 217.21196 92.907208 158.192296 50.0235899 28.5530392 27.8788878 619.73010 1218.41928 249.69470 16 1478 1935 2663.70458 2830.351744 4149.681960 1059.1951431 757.9534044 411.6235780 1091.00119 748.15219 711.19939 17 2493 2043 1721.06702 1927.824576 2261.691300 695.2554019 397.1468181 245.9901861 1189.39845 1243.68156 588.93509 18 103 49 69.63254 48.112662 75.657185 17.3995095 6.2297540 6.5597383 25.89402 33.03529 49.93894 norm.AVa norm.AVb norm.AVc baseMean log2FoldChange lfcSE stat pvalue padj 1 1338.91643 1195.31581 1255.19868 592.99460 -22.014388 2.4510531 -8.981604 2.668480e-19 1.301818e-15 2 2318.86931 1877.86990 1941.16077 715.81779 -2.633620 0.6685746 -3.939157 8.176858e-05 2.732247e-03 3 61.10018 79.53730 123.92089 59.79182 -2.170184 0.3402018 -6.379108 1.781223e-10 8.989342e-08 4 118.08785 82.92186 129.51732 45.60816 -4.348125 1.1083976 -3.922893 8.749194e-05 2.874272e-03 5 85.77525 112.81886 108.73059 54.29203 -5.279819 1.2830666 -4.115000 3.871797e-05 1.559613e-03 6 1671.44241 1903.81824 2118.64746 972.12145 -9.610427 2.1835291 -4.401327 1.075906e-05 5.942047e-04 7 164.50048 77.84501 200.67189 119.43489 -3.273525 0.8071359 -4.055730 4.997806e-05 1.878019e-03 8 71.08771 51.33258 59.96172 44.93403 -2.049240 0.4168799 -4.915661 8.848356e-07 8.633340e-05 9 1667.32990 2351.14502 1928.36893 894.06059 -9.003871 1.9513453 -4.614187 3.946379e-06 2.790204e-04 10 284.93834 385.27640 339.78308 180.24792 -7.259619 1.9058171 -3.809190 1.394229e-04 4.101557e-03 11 116.32534 85.17824 153.50200 64.44638 -2.384568 0.4386113 -5.436632 5.429705e-08 1.025374e-05 12 1145.62836 1484.13209 1320.75683 584.99386 -21.869489 2.4257179 -9.015677 1.956579e-19 1.145421e-15 13 1448.77925 1873.35715 2624.72438 973.28794 -5.779478 1.5647116 -3.693638 2.210688e-04 5.681217e-03 14 141.58792 126.35712 159.89792 64.48959 -3.253415 0.8474613 -3.839014 1.235296e-04 3.718087e-03 15 1404.71662 1714.84665 916.21509 558.19912 -2.141628 0.5767180 -3.713475 2.044328e-04 5.386095e-03 16 1408.82914 833.73137 1547.01239 1517.70301 -2.113006 0.3112430 -6.788928 1.129698e-11 1.066690e-08 17 1851.80544 1406.28707 1633.35727 1263.53668 -2.142501 0.3183755 -6.729478 1.702734e-11 1.311598e-08 18 55.81266 58.10171 39.17499 40.46242 -2.709772 0.4019223 -6.742029 1.561898e-11 1.269953e-08 |
> sessionInfo()
R version 3.3.1 RC (2016-06-17 r70798)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252 LC_NUMERIC=C
[5] LC_TIME=French_France.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] ade4_1.7-4 DESeq2_1.12.3 SummarizedExperiment_1.2.3 Biobase_2.32.0
[5] GenomicRanges_1.24.2 GenomeInfoDb_1.8.2 IRanges_2.6.1 S4Vectors_0.10.2
[9] BiocGenerics_0.18.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.5 RColorBrewer_1.1-2 plyr_1.8.4 XVector_0.12.0 tools_3.3.1 zlibbioc_1.18.0
[7] rpart_4.1-10 RSQLite_1.0.0 annotate_1.50.0 gtable_0.2.0 lattice_0.20-33 Matrix_1.2-6
[13] DBI_0.4-1 gridExtra_2.2.1 genefilter_1.54.2 cluster_2.0.4 locfit_1.5-9.1 grid_3.3.1
[19] nnet_7.3-12 data.table_1.9.6 AnnotationDbi_1.34.4 XML_3.98-1.4 survival_2.39-5 BiocParallel_1.6.2
[25] foreign_0.8-66 latticeExtra_0.6-28 Formula_1.2-1 geneplotter_1.50.0 ggplot2_2.1.0 Hmisc_3.17-4
[31] scales_0.4.0 splines_3.3.1 rsconnect_0.4.3 colorspace_1.2-6 xtable_1.8-2 acepack_1.3-3.3
[37] munsell_0.4.3 chron_2.3-47
ok thanks for your answer. So basically, from your experience, do you will take them into account for pathway analysis or should I have to discard them ?
By the way, have you done any QC on these samples? For example, what does a PCA plot look like? (see vignette for examples)
Yes I did. Here the results for me it's fine ....: I use ade4 package
dds2<-as.data.frame((counts(dds1,norm=FALSE)))
lmat<-log(1+dds2)
library(ade4)
lmatc<-bicenter.wt(lmat)
lmatc.pca<-dudi.pca(lmatc,center=F,scale=F,nf=3,scannf=F)
s.class(lmatc.pca$co,fac=dds1$genotype:dds1$condition, col=c(1,2,3,4, addaxes=FALSE, cgrid=5))
####The eigenvalues inform us on the inertia kept by each axis.
(kip<-100*lmatc.pca$eig/sum(lmatc.pca$eig))
## [1] 52.864812 9.090166 7.021512 6.097858 4.647150 4.463586 4.038370
## [8] 3.596396 3.501872 2.399417 2.278860
####The first axis extract 52.86% of the total inertia. The second axis account for 9.09 % of the variance.
I also done PCA with DESEQ2 and found nearly 92% for PC1 which is very strange for me...my first axis correspond to genotype and the second one to condition.
In fact, I've done all the exploraty analysis and think it was ok but maybe it's not ...