I am using Deseq2 for differential expression analysis of my small RNA-seq data, specifically I want to know what small RNAs are bound by a protein, so I pulled down the protein and sequenced both IP and input. I got the reads assigned to each gene using featurecount and use that as input to Deseq2, and I have two replicates for IP and input.
After differential expression analysis (IP vs input), I got Log2FC from Deseq2, however, the log2FC is not consistent with the RPM I calculated (RPM calculation: featurecount assigned reads/total reads *1000000). Here is an example gene output from Deseq2:
Row.names baseMean log2FoldChange lfcSE pvalue padj input_rep1 input_rep2 IP_rep1 IP_rep2 WBGene00006575 70.2292642324974 **1.00070601036351** 0.353916350885099 0.00242976844942981 0.00746499829990863 **9.65912370567743 10.8332280102832 3.3485982057602 2.62551118855487**
The third column is calculated log2FC, it's higher than 1 so the small RNA are enriched in IP over input. However, the last four columns are the RPMs I calculated, they are input1, input2, IP1, IP2, so this small RNA is not enriched in IP. I am not sure if the RPM or log2FC from Deseq2 give the correct result.
Here is code I used for Deseq2:
# Create a coldata frame and instantiate the DESeqDataSet. coldata <- data.frame(row.names=colnames(featurecountsmatrix), condition) dds <- DESeqDataSetFromMatrix(countData=featurecountsmatrix, colData=coldata, design=~condition) dds <- DESeq(dds) res <- results(dds) # Use lfcShrink to generate more accurate log fold change towards zero by shrinking lfc resultsNames(dds) resLFC <- lfcShrink(dds, coef=2) #print all data res_alldata <- as.data.frame(resLFC) res_wRPM <- merge(as.data.frame(resLFC), as.data.frame(RPMmatrix), by="row.names") sessionInfo( ) R version 4.2.2 (2022-10-31) Platform: aarch64-apple-darwin20 (64-bit) Running under: macOS Ventura 13.4.1
I have used the same code for differential expression on other dataset before and they work well. One possible reason I am thinking is that specifically for this dataset, there are some small RNAs that are much higher expressed then the others and are strongly bound by the protein. So for the IPvsinput, only small amount of small RNAs have extremely high RPM and are enriched in IP, which might leads to the wrong calculation of log2FC.
I appreciate it if anyone could help to check where my problem comes from. Whether my dataset is not suitable for Deseq2, or if I used Deseq2 incorrectly. Any suggestion about the differential expression analysis on this type of dataset (either program or calculating log2FC manually) will be very helpful!
Thank you all!