Can NanoString data be analyzed using DESeq2?
1
3
Entering edit mode
lim6432 ▴ 50
@lim6432-21478
Last seen 3.0 years ago

Hello, I am trying to analyze differential expressed genes using NanoString data. There are a number of tools that deal with RNA-seq data, but Nanostring data is not well documented. Since Nanostring data is also a gene count data, I think it is possible to analyze using RNA-seq tool.

Is it possible to analyze differential expressed genes with the DESeq2 package using NanoString data?

Many posts say that if you perform data normalization on your own, you can use DESeq to analyze differential gene analysis. Is that clear?

If you have good tools for dealing NanoString data, please let me know.

Thanks.

0
Entering edit mode

Hi Michael,

I inherited a NanoString dataset and would like to use NanoNormIter and DESeq2 for its analysis. Before I begin, I wanted to clarify a few points which I summarize below to be more comprehensive:

1) You start by reading the raw files and do the pre-processing until the raw count matrixes. You next define housekeeping genes, test for differential expression in the group of interest, would you remove such genes and perhaps replace them with those with the lowest variance in the dataset?

2) You detect the LOD for all samples. This is for flagging up problematic samples, right? Same for the HK_Gene_Miss genes as I see no other use of them later in the code.

3) I noticed that you run RUV_total twice to generate two objects, 'vsd' and 'set', respectively. Am I right that you use the former for visualization and decision on the k parameter and the latter for the design of the actual dds object?

4) One thing that was not clear was that you used the W_1 matrix for downstream analysis. Would you then imply to re-run betweenLaneNormalization and RUVg from the RUVSeq package outside the RUV_total function? I could not find where else W_1 was generated in the script.

5) Would you recommend additional filtering like one does in RNA-seq (e.g. filtering out genes which are <10 counts in all samples) before normalization?

0
Entering edit mode

Let me ping the author of NanoNormIter for his thoughts.

For 1) there are numerous ways to define control/housekeeping genes which have been discussed in the RUV-Seq paper. 5) Yes, if you have original counts, filtering out the very low count features may be a good approach.

0
Entering edit mode

Thank you Michael. Have you heard back from the authors of NanoNormIter regarding their code on github?

0
Entering edit mode

Hello,

I had one more question about the NanoString analysis pipeline that I needed some clarification. As I understand it, it is expected that one would perform UQ normalization and RUVg() using housekeeping genes to estimate size factors. Then variance stabilizing transformation takes place and the batch is removed to generate the normalized count table without unwanted variation for visualization. But this is independent of any downstream steps (e.g. DE analysis) where the standard DESeq2 pipeline can be used using the unwanted variation vector in the design matrix, right? So the point of the RUV.total() function is only to remove the unwanted variation, but not for later analysis.

0
Entering edit mode

Yes, that sounds correct to me.

4
Entering edit mode
@mikelove
Last seen 1 day ago
United States

We use DESeq2 on Nanostring datasets in our lab. Our approach is to estimate RUV factors using the endogenous housekeeping genes and use these in the design formula. You could also specify the endogenous housekeeping as controlGenes in DESeq2.

Update: see our manuscript on bioRxiv:

https://doi.org/10.1101/2020.04.08.032490

0
Entering edit mode

I have a question for your saying.

Is it a process for data normalization to calculate using controlGenes in DESeq2?(You could also specify the endogenous housekeeping as controlGenes in DESeq2.)

I have heard that the process of normalization of the NanoString is fixed, and I want to normalize them using other tools.(for normalization tool for NanoString nCounter data / e.g. NanoStringNorm etc.) But I can not find a function in DESeq2 that can receive data that has already been normalized.

Can I use the processed data(normalized count data) instead of the raw data(raw count data)?

1
Entering edit mode

Their normalized values are not really the optimal input. We found in our internal testing it was much better it is to use the original counts and RUV or control genes plus various EDA and diagnostic checks, like MA plots with labeled control genes.

0
Entering edit mode

DESeq assumes most genes are not DE. In Nanostring, as far as I understand, only genes that are expected to change are inspected (besides the housekeeping genes).

Is this DESeq assumption necessary only for the normalization step? If it is necessary for other steps in the process, it's hard to understand how can Nanostring data be analyzed with it.

1
Entering edit mode

This is why we use the endogenous housekeeping genes passed to RUV or to DESeq2 as controlGenes for normalization, followed by MA plots for quality control. The endogenous housekeeping should fall on the x-axis.

DESeq2 assumes that the median ratio captures the sequencing depth (not exactly the same as saying that most genes are not DE). But still, you do need to modify only the normalization step, to either inform DESeq2 about which are the control genes, or to calculate normalization with RUV. It’s just the normalization that needs a modification not other steps.

0
Entering edit mode

Dear Michael - very insightful thank you.

What should be done with the NanoString Positive and Negative spike-in control genes in the DESeq2 analysis approach you described above?

Should they be filtered out from the original counts before starting?

Also, why are the endogenous Housekeeping genes best for RUV or controlGenes in estimateSizeFactors and not one of the other spike-in groups?

1
Entering edit mode

We’ve now written up our NanoString normalization recommendations in a manuscript:

https://doi.org/10.1101/2020.04.08.032490

Let me know if you have any questions.

0
Entering edit mode

Very nice paper, just have a couple questions regarding the details:

You only do UQ normalization on the original counts for the RUVg analysis is this correct? And the UQ normalized counts are not used for below.

When you create the dds on the original counts for varianceStabilizingTransformation, does the design formula have the RUV factors? VST is also normalizing the data with RLE via estimateSizeFactors as well as the design matrix you provide, so somewhat confusing given you are also doing removeBatchEffect next to normalize with RUV factors.

When you do subsequent limma removeBatchEffect on the VST data, are you passing the RUV factors into the covariates option and leaving batch=NULL and batch2=NULL?

1
Entering edit mode

1) Yes, only for RUVg

2) No, design does not have RUV factors for VST, it's just using design=~1 (nor does VST do any removing of variance or shifting of values associated with variables the design, by the way). Then VST corrects for size factors, then removeBatchEffect corrects for RUV factors. I don't see the mixup.

3) Only using covariates, take a look at the exact code here:

https://github.com/bhattacharya-a-bt/CBCS_normalization/blob/master/nanostring_RUV_functions.R#L28-L40

0
Entering edit mode

Thank you, and sorry you are right it does when VST is only correcting for size factors.

On the related topic of performing general differential expression testing with DESeq2 and edgeR with RUVg factors, why in Section 2.3 of the RUVSeq vignette:

https://bioconductor.org/packages/release/bioc/vignettes/RUVSeq/inst/doc/RUVSeq.pdf#page5

do they show doing UQ calcNormFactors of raw counts before testing with edgeR, though in the following Section 2.5 they do not do that with DESeq2?

1
Entering edit mode

They are just using one norm method within edgeR and a different in DESeq2.

0
Entering edit mode

Also, I see that in lines 31-35 it’s deviating from the standard dds <- estimateDispersions(dds)

Could you elaborate on what exactly is being done here and why?

1
Entering edit mode

The parametric trend (which was derived specifically for RNA-seq in Anders and Huber) I think was not doing well for our NanoString datasets so we just use the same prior across probes (fitType="mean").

0
Entering edit mode

Thanks very much for spending the time to answer these questions, will definitely try this on a couple NanoString datasets and compare to NanoStringNorm.

0
Entering edit mode

I apologize, one more question, would you also recommend applying lines 31-35 prior to differential expression testing with DESeq2 (when using the RUVg factors in the design formula)?

1
Entering edit mode

Those lines are using a methods of moments estimator for dispersion. We did this to reduce computing time because we had many samples. With fewer samples I’d recommend just using the standard DESeq2 steps.

0
Entering edit mode

For clarification, for smaller datasets that only have ~100s samples or less:

You recommend applying all of lines 31-35 for normalization with VST and removeBatchEffects with RUVg factors.

Though for DE testing you recommend standard DESeq2 with default DESeq and RUVg factors in design formula.

1
Entering edit mode

You could delete the MoM dispersion lines for smaller datasets.

Yes.

1
Entering edit mode

Just to be explicit, for smaller datasets you could replace lines 31-35 (MoM estimate of gene-wise dispersion) with standard dispersion estimation:

dds <- estimateDispersions(dds, fitType="mean")

0
Entering edit mode

Our comments crossed paths.. thank you yes that makes sense

0
Entering edit mode

So for both normalization and DE testing though still a fitType=“mean”

0
Entering edit mode

Hi Michael,

Interesting post. One question...

Is this method valid also for nanostring miRNAs nCounter data? Or it should be normalised before with nanostring software and then perform the differential analysis with DESeq2?

Kind regards,

0
Entering edit mode

We've used DESeq2 with NanoString without any issue (but with careful selection of genes for column scaling). For large cohorts we use RUV upstream and would then supply factors in the design matrix for DE analysis. Here is an example of a published framework for these tools applied to NanoString:

https://pubmed.ncbi.nlm.nih.gov/32789507/

0
Entering edit mode

Thank you Michael, as always.

I will try to follow the paper.... in this case is a very small cohort (6 exp vs 5 controls). The first approach I used was to use NACHO to perform the reading of the .RCC files and perform a ·GEO" normalization:

data.norm <- normalise(
nacho_object = data,
housekeeping_genes = my_housekeeping,
housekeeping_predict = FALSE,
housekeeping_norm = TRUE,
normalisation_method = "GEO", # to choose between "GLM", o "GEO"
remove_outliers = TRUE)


then I extracted these normalised counts:

expr_counts <- data.norm[["nacho"]] %>%
filter(grepl("Endogenous", CodeClass)) %>%
select(NcounterFile, Name, Count_Norm) %>%
pivot_wider(names_from = "Name", values_from = "Count_Norm") %>%
column_to_rownames("NcounterFile") %>%
t()


and finally aplied DESeq2:

dds <- DESeqDataSetFromMatrix(countData = expr_counts,
colData = selected_pheno,
design= ~ Phenotype)

dds <- DESeq(dds)
res <- results(dds)


but this, as mentioned before, is introducing two normalizations, the one form NACHO, and then deseq2. (is this that bad???)

Lets see if I can manage to make it work following your paper pipeline.

Thank you again!

0
Entering edit mode

Our approach is to use original counts with factors of unwanted variation in the design. In general DESeq2 should not be used with pre-normalized counts, so I wouldn't recommend the above.

0
Entering edit mode

Hi Michael Love , I'd like to double check the input of DESeq2 if I want to use nSolver normalization. I see from "sabry_analysis.R" https://github.com/bhattacharya-a-bt/CBCS_normalization/blob/master/sabry_analysis.R as below. There is no double the input for "DDS.nsovler" is the original counts, because the nSolver normalization was provided by R itself using pre-existing size factors: sizeFactors(dds.nsolver).

My question is could I provide the pre-normalized counts by nSolver to DESeq2 for DEA? I don't think so, right? Because the function of DESeq will do the estimating of size factors, dispersions, and ect, which expected the raw counts, am I right?

How about edgeR? I see some paper used it to do DEA for NanoString data. What it is the expected input for edgeR on NanoString data, do you have any comments?

Thanks a lot! This thread, the GitHub, and the paper for RUV really helps for me.

dds.nsolver <- DESeqDataSetFromMatrix(countData = counts(set)[1:652,],
colData = pData(set),
design = ~Group)
pos = raw[grepl('POS',rownames(raw)),]
hk = raw[grepl('Housekeeping',raw_expression$Class),] pos.factors = mean(as.numeric(apply(pos,2,geoMean)))/as.numeric(apply(pos,2,geoMean)) hk.factors = mean(as.numeric(apply(hk,2,geoMean)))/as.numeric(apply(hk,2,geoMean)) sizeFactors(dds.nsolver) <- pos.factors * hk.factors dds.nsolver <- DESeq(dds.nsolver)  ADD REPLY 0 Entering edit mode As above, pre-normalized data is not recommended for DESeq2 (or edgeR). Part of the point of the software is to take into account the precision of count data. This actually helps quite a bit in analyses where some samples have higher depth that others. I've seen groups input pre-normalized data into methods and then -- predictably -- the methods perform worse because they do not get to use the information about differential precision. ADD REPLY 0 Entering edit mode Thanks a lot for convincing me! Could you help me understand about "k=1, i=1", "W_1", and also "makeRLEplot" . Meanwhile, I will read the paper again to get better understanding. ADD REPLY 0 Entering edit mode Read over the paper and the RUV paper we cite first, then post any questions you have. ADD REPLY 0 Entering edit mode to estimate RUV factors using the endogenous housekeeping genes and use these in the design formula. By these codes as below, right? In fact, I did't follow what do they mean by "k=1", "i=1", and what does it mean "W_1", and which functions to get it in "RUV_total"? I read the paper, went over the GitHub, and also the function of RUV_total. I still feel confused. Could you help me to understand them? Also, for function of makeRLEplot <- function(data,metadata,id), https://github.com/bhattacharya-a-bt/NanoNormIter/blob/master/R/makeRLEplot.R , what does it by "id"? I see it is "id - colname of sample ids". But, I do't understand what it is? Please let me know if it is not a right place to ask this question. Thank you so much! k = 1 vsd = RUV_total(raw,pData,fData,k = k)$vsd
set = RUV_total(raw,pData,fData,k = k)\$set

i = 1

dds <- DESeqDataSetFromMatrix(countData = counts(set)[1:652,],
colData = pData(set),
design = ~ W_1 + Group)

0
Entering edit mode

Michael Love I mean the questions here.

0
Entering edit mode

For W and k, please read the RUV paper that we cite and use heavily in the article:

W is an n × k matrix corresponding to hidden factors of "unwanted variation"

From our paper:

In general, the number of dimensions of unwanted variation to remove was chosen by iteratively normalizing the data for a given number of dimensions and checking for the removal of known technical factors already identified in the raw expression data (e.g. study phase), and presence of key biological variation (e.g. bimodality of ESR1 expression in the CBCS breast cancer data where estrogen receptor status is a known predominant feature). Further details about choosing this dimension are given by Gagnon-Bartsch et al and Risso et al [6,8].

Also see the tutorial linked from our GitHub which talks about the range of k that was used:

https://github.com/bhattacharya-a-bt/CBCS_normalization/blob/master/CBCS_normalization_tutorial.pdf

I'm pretty sure id in the RLE plot is the sample ids to plot, at least that's what it says in the comment at the top.

This specific mention of i is just a variable. Actually in this case it is indicating k=1 (only one factor was used).

https://github.com/bhattacharya-a-bt/CBCS_normalization/blob/master/sabry_analysis.R#L79

0
Entering edit mode

Hi, I am now looking at this thread and the Briefings paper. I do understand housekeeping might not be such and the need to QC, but I was wondering how the (RUV factor estimation + design including RUV factors) approach compares with simply estimating size factors using genes labelled as housekeeping?

Thanks!

Clara

0
Entering edit mode

RUV helps to remove systematic technical variation beyond size factors. Suggest reading the RUV papers we cite in the Briefings paper.

0
Entering edit mode

ok! I'd have a few more questions

1) if I decide to first try with size factor normalization only, with >500 genes in the Nanostring panel, could it be better to just normalize on all genes and not housekeeping only?

2) do you have any recommendations on how to "verify" size factor normalization in general when using targeted RNA-seq that might not include any housekeeping (e.g. some Illumina TruSight)?

thank you!!

1- add) I have tried both approaches, sizeFactors on all genes and on (12) housekkeping- in both cases housekeeping genes lie close to x-axis (max lFC = 0.2) but there are no DE genes at adjp = 0.1 and pvalue hist has an increasing slope towards 1