Hello Bioconductor and DESeq community. I hope you can help me with this. First of all, I am very new to RNAseq analysis, but I have read a lot and I am very warned that data should not be normalized prior DESeq2 analysis. However, I am having some concerns with my data.
From the beggning. I am analyzing dual RNA-Seq data to study the interaction between plant and pathogen (for which there is a reference genome) during infection. Moreover, I have many samples, since I have to analyze and compare local and distal responses, different plant genotypes (tolerant versus susceptible), additionally to differential responses along time. And I also have a control plant for each treated one. However, I have just three control samples of the pathogen, grown independently.
The workflow I have followed includes mapping against the reference genome to extract the pathogen sequences, de novo assembly of plant sequences and mapping samples against this de novo transcriptome, and finally counting with htseq for mapped sequences to reference genome and kallisto for those mapped to the de novo transcriptome.
The analysis of plant responses with DESeq2 have been done well, and I have several data sets of differentially expressed sequences to analyze and compare. However, I am not sure about the analysis with DESeq2 of the pathogen sequences. The sequencing protocol included 50 M reads for each sample (pair-end), and differences between the pathogen' mapped sequences of inoculated samples (about 0.5-2%, very low) and control samples (over 90%, as expected) were high.
This is the coldata I have created:
coldata.mdv1.l.6 <- c("Z.BU1_1","Z.BU1_2","Z.BU1_3","MDV1.50","MDV1.13","MDV1.15","MDV1.23"," control "," control ","control", "infective_6h"," infective_6h "," infective_6h "," infective_6h ")
coldata.mdv1.l.6<-matrix(coldata.mdv1.l.6,nrow=7,ncol=2,byrow = FALSE)
rownames(coldata.mdv1.l.6)<-c("Z.BU1_1","Z.BU1_2","Z.BU1_3","MDV1.50_O_L_6","MDV1.13_O_L_6","MDV1.15_O_L_6","MDV1.23_O_L_6")
colnames(coldata.mdv1.l.6)<-c("sample","condition")
print(coldata.mdv1.l.6)
## sample condition
## Z.BU1_1 "Z.BU1_1" "control"
## Z.BU1_2 "Z.BU1_2" "control"
## Z.BU1_3 "Z.BU1_3" "control"
## MDV1.50_O_L_6 "MDV1.50" "infective_6h"
## MDV1.13_O_L_6 "MDV1.13" "infective_6h"
## MDV1.15_O_L_6 "MDV1.15" "infective_6h"
## MDV1.23_O_L_6 "MDV1.23" "infective_6h"
The point is that the differential analysis yielded a biased result, since many of the genes were differentially expressed (over 80% in some cases), and most of them were repressed, which has not biological sense but has sense if the differences on millions of aligned reads in both conditions are considered (control >>>> condition).
I let you here an example result from one of the sample:
summary(res05.mdv1.l.6)
##
## out of 8605 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 639, 7.4%
## LFC < 0 (down) : 2099, 24%
## outliers [1] : 106, 1.2%
## low counts [2] : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
sum(res05.mdv1.l.6$padj < 0.05, na.rm=TRUE)
## [1] 2738
And this is how a MA plot looks like:
plotMA(res.mdv1.l.6,ylim=c(-5,5))
Is there any normalization step that should I have done? Do you think that DESeq2 is the right tool for this analysis? Or should I not include this pathogen control in the analysis due to these high differences in the number of reads? Thank you all!
Hello Michael. Thank you very much for your response.
I didn't try any other normalization method beyond the RLE method. The counts were obtained from htseq_count, and then I built the count matrix. I didn't use the DESeqDataSetFromHTSeq() function, but I assumed that the DESeqDataSetFromMatrix() was right once the datamatrix is created...
As far as I know, FPKM, RPKM or TPM values are not suitable for DESeq2, since they aren't integer. And the other normalization method I have been thinking is TMM, from edgeR, but I have been using DESeq2 for now.
I’m confused because it sounded like you had tried something other than size factor normalization and it had worked better.
Can you give the range of the number of mapped reads for each group?
I'm sorry if I haven't been clear enough. Maybe, the confussing part is that this is dual RNA-seq, so I have followed two workflows, one for the pathogen and the other for the plant sequences, which worked better.
After mapping with HISAT2 to the pathogen reference genome, I obtained the aligned sequences (considered as pathogen sequences), and the not aligned sequences (considered as plant sequences, plus possible contaminations). As well, I mapped the pathogen control samples, which came from in vitro cultures.
As I mentioned above, sequencing was 2 x 100 bp at 50 M depth, for all samples (including pathogen controls). The alignment rates ranged 0.4%-1.61% for infective samples, while controls ranged 97-98%. So, differences are huge (aprox. 50 M reads for control samples and 500 K reads for experimental conditions).
I missed that. I don’t know of methods that can deal with such extreme differences in sequencing depth confounded with the condition. We talk about this as a pathological case in the DESeq2 paper. You could downsample the larger group perhaps.
Thank you so much, Michael. I have been reading about downsampling just today. I think that this could work.
Best regards!