Question

About the results and contrast (DESeq2)

0

Entering edit mode

David Requena ▴ 20

@david-requena-13990

Last seen 8.0 years ago

The Rockefeller University. New York, U…

Hello! I want to make a differential expression analysis of some RNAseq data using DESeq2.

Essentially, I have 3 types of tissue: normal, primary tumor and metastasis. These samples were sequenced in 4 groups and I found an evident batch effect (clusters by group processed, indicated in the variable "lib_prep") and also clusters by patient.

This is my data: (sample id, patient, library preparation and sample type)

Patient	ID	sample_type	lib_prep	nested_patient
1	P1N1	normal	A	A
1	P1P1	primary	A	A
1	P1M1	metastasis	A	A
1	P1M2	metastasis	A	A
1	P1M3	metastasis	A	A
1	P1M4	metastasis	A	A
2	P2N1	normal	A	B
2	P2P1	primary	A	B
2	P2M1	metastasis	A	B
2	P2M2	metastasis	A	B
3	P3N1	normal	A	C
3	P3P1	primary	A	C
3	P3M1	metastasis	A	C
3	P3M2	metastasis	A	C
4	P4N1	normal	B	A
4	P4P1	primary	B	A
4	P4M1	metastasis	B	A
4	P4M2	metastasis	B	A
4	P4M3	metastasis	B	A
5	P5N1	normal	B	B
5	P5P1	primary	B	B
6	P6N1	normal	C	A
6	P6P1	primary	C	A
6	P6M1	metastasis	C	A
6	P6M2	metastasis	C	A
7	P7N1	normal	C	B
7	P7P1	primary	C	B
7	P7M1	metastasis	C	B
8	P8N1	normal	C	C
8	P8N2	normal	C	C
8	P8P1	primary	C	C
8	P8P2	primary	C	C
8	P8M1	metastasis	C	C
9	P9N1	normal	C	D
9	P9P1	primary	C	D
9	P9M1	metastasis	C	D
9	P9M2	metastasis	C	D
10	P10N1	normal	D	A
10	P10M1	metastasis	D	A
11	P11N1	normal	D	B
11	P11M1	metastasis	D	B
11	P11M2	metastasis	D	B
12	P12N1	normal	D	C
12	P12M1	metastasis	D	C
12	P12M2	metastasis	D	C

I'm interested in the genes differentially expressed between the metastasis and primary tumor samples, using as base level the normal samples. As patient is collinear with lib_prep, I created a new variable "nested_patient".

So, I made the DESeq object:

ds.deseq <- DESeqDataSetFromMatrix(
    countData = countstable.num,
    colData = sampledata,
    design = ~ sample_type)

colData(ds.deseq)$sample_type  <- relevel(colData(ds.deseq)$sample_type, "normal")

I also provided the corresponding model matrix:

mm <- model.matrix(~ lib_prep + lib_prep:nested_patient + sample_type, colData(ds.deseq))
# To remove the columns full of zeros:
mm <- mm[ , colSums(mm) > 0]

And I ran the program as follows:

ds.deseq <- DESeq(ds.deseq, full=mm,
                  modelMatrixType="standard", betaPrior=FALSE,
                  parallel = TRUE, BPPARAM=BatchJobsParam(workers = 64))

As result, I obtained:

> resultsNames(ds.deseq)
 [1] "Intercept"                 "lib_prepB"                 "lib_prepC"                 "lib_prepD"                
 [5] "sample_typemetastasis"     "sample_typeprimary"        "lib_prepA.nested_patientB" "lib_prepB.nested_patientB"
 [9] "lib_prepC.nested_patientB" "lib_prepD.nested_patientB" "lib_prepA.nested_patientC" "lib_prepC.nested_patientC"
[13] "lib_prepC.nested_patientD" "lib_prepC.nested_patientE"

But I don't know which of these is the comparison that I want. How to select the contrast appropriately to have only the genes differentially expressed in metastasis vs primary tumor, but different than the normal samples, and controlling by library preparation and patient. Could you help me, please?

David

Samples summary:

ID	Library	Normal	Primary	Metastasis
1	A	1	1	4
2	A	1	1	2
3	A	1	1	2
4	B	1	1	3
5	B	1	1	-
6	C	1	1	2
7	C	1	1	1
8	C	2	2	1
9	C	1	1	2
10	D	1	-	1
11	D	1	-	2
12	D	1	-	2

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] checkmate_1.8.3            pheatmap_1.0.8             tsne_0.1-3                 ggrepel_0.6.5
[5] ggplot2_2.2.1              gplots_3.0.1               RColorBrewer_1.1-2         BiocParallel_1.8.2
[9] BatchJobs_1.6              BBmisc_1.11                matrixStats_0.52.2         biomaRt_2.30.0
[13] knitr_1.17                 DESeq2_1.14.1              SummarizedExperiment_1.4.0 Biobase_2.34.0
[17] GenomicRanges_1.26.4       GenomeInfoDb_1.10.3        IRanges_2.8.2              S4Vectors_0.12.2
[21] BiocGenerics_0.20.0

deseq2 differential expression rna-seq • 6.0k views

ADD COMMENT • link 8.4 years ago David Requena ▴ 20

0

Entering edit mode

David Requena ▴ 20

@david-requena-13990

Last seen 8.0 years ago

The Rockefeller University. New York, U…

Thanks a lot for your answer and recommendations!

I get an error with the approach B: "Error in results(ds.deseq, contrast = c("sample_type", "metastasis", "primary"), : only list- and numeric-type contrasts are supported for user-supplied model matrices". So, I used this:

res <- results(ds.deseq, contrast=list("sample_typemetastasis", "sample_typeprimary"),
               altHypothesis="greaterAbs", alpha = 0.01, pAdjustMethod = "BH")

Which is equivalent to the approach A.

And, about the comparison: could the normal samples be introducing noise? should be better to exclude them? Because, my main goal is to find differences between metastasis and primary tumors. Or is more convenient to include them? Would this result in a better fit of the coefficients in the model?

David

ADD COMMENT • link 8.4 years ago David Requena ▴ 20

1

Entering edit mode

Having the normal samples in the deseq object, even if you're not using them explicitly in the contrast, adds a few extra degrees of freedom better to estimate the variance of the residuals. So it then comes down to how much you believe the within-normals variability is consistent with the within-tumour variability. My experience is that they are commensurate generally, sometimes they are bit less variable (if the Tolstoy maxim holds, that all normal samples are the same, but all tumour samples are tumorous in their own way), and so I generally keep them (look at a few positive controls' behaviour to get a feel) but occasionally if there's varying contamination of normals with tumour I'll then definitely subset the deseq object to remove them from the analysis entirely.

ADD REPLY • link 8.4 years ago Gavin Kelly ▴ 690

0

Entering edit mode

Thanks a lot!

David

ADD REPLY • link 8.4 years ago David Requena ▴ 20

score 2 · Accepted Answer · 2017-09-19

For the "differentially expressed in metastasis vs primary tumor" you can get that directly by

results(ds.deseq, contrast=list(c("sample_typemetastasis"),  c("sample_typeprimary")))   (approach A)

because the sample_typemetastasis, and sample_typeprimary are relative to normal (that being the baseline level, as it is first alphabetically), so constructing the contrast this way will compare (meta/normal) / (primary/normal), giving meta/primary fold-change (the normals cancelling out). You can also specify it using the alternative but equivalent

results(ds.deseq, contrast=c("sample_type", "metastasis", "primary"))      (approach B)

Because you've included terms for the factors you want to control for, the coefficients we're using above already adjust for these.

I'm a bit confused about the "but different than the normal samples" requirement. The differential test above doesn't need to take into account any 'normal controls'. But if you're really insistent on using the normals in some way, you could look at results(ds.deseq, name="sample_typemetastasis") as those are the met_vs_normal differential genes; similarly for the primaries - and then you could decide to include the meta_vs_primary genes based on there differential pattern in these two auxilliary lists. But you're doubling up on your statistical errors, so I really wouldn't recommend it. Instead, I'd recommend labeling your differential genes (from approach A above) by the fold-changes from the 'results(name=...)' approach, so you'd have a list of met_vs_primary significant genes, annotated by the magnitude and direction of log-fold-changes of the cancerous samples when compared to normal.