Question

Feedback on edgeR model setup: identifying sex-by-tissue effects

0

Entering edit mode

spflanagan.phd • 0

@spflanaganphd-21396

Last seen 5.7 years ago

I have RNA-seq data from 5 males and 5 females for 4 tissue types (skin, liver, brain, and gonads), and I would like to: 1. Identify genes/transcripts with skin-biased expression 2. Identify genes/transcripts with sex-biased expression 3. Identify genes/transcripts with sex-biased expression in the skin only

I would like to use the GLM approach to analyze my data as described in the edgeR User's Guide, but I am not confident that I am correctly specifying the appropriate contrasts or using the best glm model, so I'm requesting feedback on my model setup here.

I have a DGEList object where I have assigned individuals to a group that is tissuetype_sex. Here is how I've specified the design matrix:

design<-model.matrix(~0+group, data=y$samples)
head(design)
             groupbrains_female groupbrains_male groupgonads_female groupgonads_male groupliver_female groupliver_male groupskin_female groupskin_male
brains_SSF1                   1                0                  0                0                 0               0                0              0
brains_SSF2                   1                0                  0                0                 0               0                0              0
brains_SSF3                   1                0                  0                0                 0               0                0              0
brains_SSF4                   1                0                  0                0                 0               0                0              0
brains_SSF6                   1                0                  0                0                 0               0                0              0
brains_SSNP1                  0                1                  0                0                 0               0                0              0

(sorry for the messy formatting of the output, I couldn't figure out how to make it look better)

I then fit the model using glmQLFit:

fit <- glmQLFit(y, design)

To identify skin-specific genes, I then do:

skinVbrains <- glmTreat(fit, contrast=makeContrasts((groupskin_female+groupskin_male)-(groupbrains_female+groupbrains_male), levels=design), lfc=2)
skinVliver <- glmTreat(fit, contrast=makeContrasts((groupskin_female+groupskin_male)-(groupliver_female+groupliver_male), levels=design), lfc=2)
skinVgonads <- glmTreat(fit, contrast=makeContrasts((groupskin_female+groupskin_male)-(groupgonads_female+groupgonads_male), levels=design), lfc=2)

and the genes that are significantly different in each of these three tests would be genes with skin-biased expression. To identify sex-biased genes and skin-specific sex-biased genes, I follow a similar approach:

Sexfit <- glmTreat(fit, 
                contrast=makeContrasts((groupskin_female+groupbrains_female+groupliver_female+groupgonads_female)-(groupskin_male+groupbrains_male+groupliver_male+groupgonads_male),
                                       levels=design), lfc=2)
skinSex <- glmTreat(fit, contrast=makeContrasts(groupskin_female-groupskin_male, levels=design), lfc=2)

where those genes with significantly different expression in skinSex that are also in the previously-identified skin-biased expression gene set would be genes that are sex-biased in the skin only.

Is this the best way to set up these models? Would I be better off trying to use an interaction formula instead? Any advice on how I can improve this analysis would be much appreciated!

Thank you in advance, Sarah

P.S. In case it helps, here is my sessionInfo():

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_NZ.UTF-8       LC_NUMERIC=C               LC_TIME=en_NZ.UTF-8        LC_COLLATE=en_NZ.UTF-8     LC_MONETARY=en_NZ.UTF-8    LC_MESSAGES=en_NZ.UTF-8   
 [7] LC_PAPER=en_NZ.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_NZ.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] knitr_1.23   edgeR_3.26.4 limma_3.40.2

loaded via a namespace (and not attached):
 [1] compiler_3.6.1  tools_3.6.1     yaml_2.2.0      Rcpp_1.0.1      splines_3.6.1   highr_0.8       grid_3.6.1      locfit_1.5-9.1  xfun_0.7        lattice_0.20-38

edger • 1.1k views

ADD COMMENT • link updated 5.7 years ago by Aaron Lun ★ 28k • written 5.7 years ago by spflanagan.phd • 0

0

Entering edit mode

I fixed the formating of the output for you by replacing the markdown fencing (three back-ticks) with pre-formatted html tags instead. The fenced block will wrap long lines but the pre-formatted block doesn't. It's a trick that is seldom mentioned in documentation but can be useful.

ADD REPLY • link 5.7 years ago Gordon Smyth 52k

score 1 · Accepted Answer · 2019-07-21

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 10 hours ago

The city by the bay

You don't mention whether the samples from different tissues are collected from the same set of 5 males and 5 females. I'm going to assume that each set of samples for a particular tissue was collected from a different set of donors, otherwise that would represent a surprisingly well-coordinated harvest.

If that is the case (i.e., you actually have data from (5+5)*4 different donors), then your design is fine. There is no need to use an interaction model, that is what is implicitly done when you combine sex and tissue into a single factor. Using an interaction term would just make life more complicated.

Your contrasts should use averages of coefficients rather than just the sums, otherwise the magnitude of the log-fold changes don't make sense. For example:

makeContrasts((groupskin_female+groupskin_male)/2
    -(groupbrains_female+groupbrains_male)/2, levels=design)

And the log-fold change threshold in glmTreat is very high. Too high, I would say. If you don't want to to miss out on a lot of interesting DE genes with lower log-fold changes, I would use a value like lfc=0.5 instead.

ADD COMMENT • link 5.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you for noting that I need to use the averages!

To make things more difficult, it turns out that most of the samples are from individual donors, but two of the same donors were used to generate two separate tissue samples (so two donors gave both liver and skin tissue). That means that I have 40 tissue samples and 38 unique donors. Given that this is the case, would you recommend including a random factor with individual IDs? If so, what is the best way to specify that?

ADD REPLY • link 5.7 years ago spflanagan.phd • 0

0

Entering edit mode

You have three options:

Switch to voom and use duplicateCorrelation.
Drop two of the paired samples randomly so that there is a 1:1 sample:patient mapping, and proceed with using edgeR as described above.
Ignore it completely and just proceed with edgeR.

1 probably won't work all that well because you can only estimate the correlation from the two patients that have two samples. 2 is probably the best approach as it gets rid of any sample-sample correlation without much loss of information. But frankly, 3 should be fine. I'd have to work through the math, but even if the two paired samples were perfectly correlated, it would represent a worst-case overstatement of the residual d.f. by... ~5%, which is not a big deal.

ADD REPLY • link 5.7 years ago Aaron Lun ★ 28k