Question

Deseq2 calculated log2foldchange not consistent with RPM in IP-small RNA-seq

0

Entering edit mode

Shihui • 0

@df3a4c47

Last seen 6 months ago

United States

Hello,

I am using Deseq2 for differential expression analysis of my small RNA-seq data, specifically I want to know what small RNAs are bound by a protein, so I pulled down the protein and sequenced both IP and input. I got the reads assigned to each gene using featurecount and use that as input to Deseq2, and I have two replicates for IP and input.

After differential expression analysis (IP vs input), I got Log2FC from Deseq2, however, the log2FC is not consistent with the RPM I calculated (RPM calculation: featurecount assigned reads/total reads *1000000). Here is an example gene output from Deseq2:

Row.names   baseMean    log2FoldChange  lfcSE   pvalue  padj    input_rep1  input_rep2  IP_rep1 IP_rep2
WBGene00006575  70.2292642324974    **1.00070601036351**    0.353916350885099   0.00242976844942981 0.00746499829990863 **9.65912370567743  10.8332280102832    3.3485982057602 2.62551118855487**

The third column is calculated log2FC, it's higher than 1 so the small RNA are enriched in IP over input. However, the last four columns are the RPMs I calculated, they are input1, input2, IP1, IP2, so this small RNA is not enriched in IP. I am not sure if the RPM or log2FC from Deseq2 give the correct result.

Here is code I used for Deseq2:

# Create a coldata frame and instantiate the DESeqDataSet.
coldata <- data.frame(row.names=colnames(featurecountsmatrix), condition)
dds <- DESeqDataSetFromMatrix(countData=featurecountsmatrix, colData=coldata, design=~condition)
dds <- DESeq(dds)
res <- results(dds)

# Use lfcShrink to generate more accurate log fold change towards zero by shrinking lfc
resultsNames(dds)
resLFC <- lfcShrink(dds, coef=2)

#print all data
res_alldata <- as.data.frame(resLFC)
res_wRPM <- merge(as.data.frame(resLFC), as.data.frame(RPMmatrix), by="row.names")

sessionInfo( )
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4.1

I have used the same code for differential expression on other dataset before and they work well. One possible reason I am thinking is that specifically for this dataset, there are some small RNAs that are much higher expressed then the others and are strongly bound by the protein. So for the IPvsinput, only small amount of small RNAs have extremely high RPM and are enriched in IP, which might leads to the wrong calculation of log2FC.

I appreciate it if anyone could help to check where my problem comes from. Whether my dataset is not suitable for Deseq2, or if I used Deseq2 incorrectly. Any suggestion about the differential expression analysis on this type of dataset (either program or calculating log2FC manually) will be very helpful!

Thank you all!

RNASeqData DifferentialExpression SmallRNAData DESeq2 • 663 views

ADD COMMENT • link 7 months ago Shihui • 0

score 1 · Answer 1 · 2023-09-27

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 13 hours ago

United States

Did you manually calculate RPMs? Try fpm(dds) for comparison, as this also uses DESeq2 median ratio scaling.

ADD COMMENT • link 7 months ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Yes I manually calculated the RPMs by dividing each genes RPM to the library read depth. I didn't do any other normalization other than that.

I tried the fpm(dds) you mentioned. It's true that now the reads is consistent with what the log2FC shows.

But I am still confused if the RPM manually calculated or the FPM DEseq2 calculated are correct. Would you mind point out where I did wrong about! the RPM calculation?

Below are the two plots made using RPM or FPM, I am wondering if the blue group genes are enriched in my IP or not. Thank you so much!

enter image description here

ADD REPLY • link 7 months ago Shihui • 0

1

Entering edit mode

I can rephrase your question as:

Why is number of reads not the superior estimate for the scaling factor (the "m" in rpm).

There are some papers from early RNA-seq methods development that address this, e.g.:

Bullard, J.H., Purdom, E., Hansen, K.D. et al. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11, 94 (2010). https://doi.org/10.1186/1471-2105-11-94

ADD REPLY • link 7 months ago Michael Love 41k

0

Entering edit mode

It makes sense that DeSeq2 normalized FPMs are more accurate. Thank you for the information!

ADD REPLY • link 7 months ago Shihui • 0