Search
Question: Allele-specific expression with different gene lenghts
0
4 months ago by
gtechbio0
gtechbio0 wrote:

Hi,
I want to use DESeq2 for assessing allele-specific expression. I will compare gene-level allelic counts of a yeast hybrid (basically comparing counts of orthologous genes of parentals of this hybrid).
I'm aware of this tutorial http://rstudio-pubs-static.s3.amazonaws.com/275642_e9d578fe1f7a404aad0553f52236c0a4.html.

My question is that parentals might have orthologous genes with different lengths, so for example ortholog in parent A is 1000bp and in parent B is 1020 bp. Is it possible at some step of DESeq2 to account for this length difference?

Thanks

modified 4 months ago by Michael Love19k • written 4 months ago by gtechbio0
1
4 months ago by
Michael Love19k
United States
Michael Love19k wrote:

If you provide a matrix at the beginning of the analysis called “avgTxLength”, the analysis will control for gene length changes across sample. This should happen before running DESeq().

assays(dds)[["avgTxLength"]] <- lengths

Hi Michael,

Just to make sure I do everything in order:

My count_table looks like

                        SC1    SC2    SC3    SU1    SU2    SU3
YBR177C_Sbay_4.429      2711   3397   3765   1246   2156   3287
YIL140W_Sbay_9.48         178    111    204    109    271    250
YLR268W_Sbay_10.366    1208   1170   1704    631    825    865


The code is

groups<-factor(x=c(rep("SC",3), rep("SU",3)), levels=c("SC","SU"))

colData <- DataFrame(condition=groups)

dds <- DESeqDataSetFromMatrix(count_table, colData, formula(~condition))
assays(dds)[["avgTxLength"]] <- as.matrix(length_data)   ### correcting for gene length

length_data looks like

                       SC1   SC2   SC3   SU1   SU2   SU3
YBR177C_Sbay_4.429     1353  1353  1353  1359  1359  1359
YIL140W_Sbay_9.48      2469  2469  2469  2457  2457  2457
YLR268W_Sbay_10.366     642   642   642   642   642   642
sizeFactors(dds) <- c(as.numeric(rep("1",6)))   ### this is allelic expression, so everything comes from the same library
dds <- DESeq(dds)
res <- results(dds)

Is everything correct?
Cheers and sorry if formatting is a bit messed

EDIT: When I omit sizeFactors(dds) <- c(as.numeric(rep("1",6))), I get the message using 'avgTxLength' from assays(dds), correcting for library size, but if I do both preset library size to 1 and feed the code with gene length, I do not receive the message above. Does it mean that with predefined size factors its not possible to account for gene length?

There's two issues here, first is that you should correct for library size unless you are comparing within individual (e.g. alt to total or alt to ref, etc.). Above you don't have the same kind of count table as in my example, where there were two columns for every sample.

The other issue is that you're correct, if you want to specify the size factors as 1, the avgTxLength method isn't the way to go. Instead you should use normalizationFactors instead:

normalizationFactors(dds) <- lengths / exp(rowMeans(log(lengths)))

Dear Michael,

Regarding normalization, so thanks a lot, I will try running the code like that.

Regarding the count table, so for example SC1 and SU1 are from the same library (as other samples with the same numbers), and I thought my design will compare alt to ref, like you have mentioned, won't it? Is there a conceptual difference for Deseq2 calculations if I also add "sample" column?

Once again thank you for your time!

I'd say take a look at how I went about it in the workflow you posted. Yes, there's a critical difference in adding the sample information and identifying which column is which allele or not.

Hi Michael,
I am sorry for ignorance, but still I don't understand why my example wouldn't work properly: I specify in design formula that there are two conditions (basically parents), and I control for library size and gene length. So basically it will compare condition 1 vs condition 2, and will show in which condition there is higher or lower expression (which in theory corresponds to ASE). What does this logic miss?

One of the strengths of the ASE approach is that you compare counts within individual, controlling for many potential biases. In the workflow I created, this is the approach. Your approach doesn’t have individual in it.

Thanks for elaborating!Now it makes sense for me. And I guess for human data (i.e. patients or any "case-control" study) this "individualized" approach would matter compared to mine. In my case though I analyze yeasts (they are quite homogenous within the culture), so I can in principle assume that between-colony variations will not bias overall analysis.
Anyway, for my analysis I will do as you suggest with design like ~parent+sample.
Also, just to clarify, with normalizationFactors I don't to set sizefactors to 1, correct?

That design is confounded. Again, I’d recommend to take a look at that workflow.

The only difference is that you will provide norm factors before DESeq()

Dear Michael,

I was wondering if in DESeq2 it is possible to simultaneously set size factors and control for gene length (like you have suggested in your post)?