I'm a graduate student working on analyzing RNA-Seq data from a microbiome - many different species all found in the same environment. I used some annotation software and Python to get a list of counts for hits to each organism, within each sample. Imported these counts into R, created a combined data table, and then used DESeq2 for comparing the 12 wild-type (WT) samples with my 12 samples from diseased individuals.
When comparing between DESeq's normalized counts, AKA counts(dds, normalized=TRUE), versus the raw counts per million transcripts (CPM) for these organisms, I noticed something interesting. For some of the higher-abundance organisms, the CPM is higher in my WT samples, if only slightly. However, in DESeq2's normalized data, counts are higher in my diseased samples.
In other words, for some of the organism counts, they are being shifted down in my WT samples and up in my experimental (diseased) samples, to the point where the final calculated log2FoldChange shows these organisms as more active in my experimental samples - even though the CPM is higher in my WT samples.
I know that DESeq fits to a negative binomial distribution, but is that strong enough to lead to a reversed direction of the log2FoldChange for these entries?
Code I'm running:
> completeCondition <- data.frame(condition=factor(c(rep("control", length(control_files)), rep("experimental", length(exp_files)))))
> colnames(complete_table) <- completeCondition
> dds <- DESeqDataSetFromMatrix(complete_table, completeCondition, ~ condition)
> dds <- DESeq(dds)
> normalized_data <- counts(dds, normalized=TRUE)
Any help is greatly appreciated!