Question

edgeR or DESeq2 between duplicated chromosomes in the same individual

1

Entering edit mode

grant.dejong ▴ 10

@grantdejong-22202

Last seen 6.2 years ago

Hi there,

I have an RNA-seq dataset from a polyploid species (consisting of sets of similar chromosomes, which consequently, contain highly similar paralogs). I'd like to know whether I could compare expression between these gene pairs using the aforementioned software.

Is there a way to incorporate gene lengths in the detection of DEGs? I'm aware that you must supply raw data (although I've heard it is possible but highly discouraged to use RPK normalized genes, at least of edgeR, which seemed to lead to somewhat good results). I've heard of incorporating offsets - which might be might best bet - but I'm not sure how they work.

Additionally, since these are intra-sample comparisons, I've manually forced the library size calculated from the sum of each replicate because it would underestimate the size of count differences to normalize by chromosome. For example, rep_1 from the table below would be 100+130+450 (A) + 200+300+30 (B) = 1210. Both rep1_A and rep1_B would therefore be normalized by this value. Would this violate the assumptions of normalization (e.g. TMM) or the DEG tests?

Thanks!

The dataset is essentially analogous to this (however incorporating all ~30000 gene pairs):

| Gene100_Gene200 | 100 | 101 | 90 | 200 | 220 | 190 |

| Gene110_Gene210 | 130 | 140 | 120 | 300 | 350 | 370 |

| Gene130_Gene230 | 400 | 450 | 390 | 30 | 75 | 20 |

(Sorry about the formatting, markdown tables didn't work - not sure what's going on.)

deseq2 edgeR normalization DGE • 1.9k views

ADD COMMENT • link updated 6.2 years ago by Gordon Smyth 53k • written 6.2 years ago by grant.dejong ▴ 10

0

Entering edit mode

Grant, I am having trouble following what you are doing. After reading your question through a few times, I think that you are probably splitting each library of RNA-seq counts into two, with each row of B being a paralog of the same row of A. But that would only make sense if all the paralogs come in pairs. Perhaps A and B correspond to duplicated chromosomes containing essentially the same genes? If so, do all the chromosomes come in pairs? Or are you only analysing one pair of chromosomes?

What are replicates 1, 2 and 3? Different individuals?

If you have duplicated chromosomes, I am not clear how you can generate separate read counts for each duplicate. Won't the genes and sequences be mostly the same on both copies, so how you determine with confidence which copy of a chromosome a sequence read originates from?

ADD REPLY • link 6.2 years ago Gordon Smyth 53k

0

Entering edit mode

Hi Gordon, I'm sorry for any confusion! I'm looking at an allopolyploid (which has undergone a recent whole-genome duplication event via inter-specific hybridization) and am particularly interested in comparing gene expression between each gene pair. These genes are organized in large, highly similar syntenic regions throughout each ancestral genome (subgenomes) such that chromosomes copies have retained a fairly consistent gene order. For the sake of this analysis, I'm looking only at paralogous gene pairs which are single-copy genes in their respective subgenome (this constitutes the majority of genes).

Replicates 1, 2, and 3 refer to the biological replicates (each represents RNA extracted from a standardized tissue volume harvested from 10 individuals).

The duplicate genes (homeologs, in this case) aren't identical and have a number of polymorphisms that can aid in distinguishing subgenome-specific reads during read mapping (this is part of the motivation for gene-length normalization, although the vast majority are nearly identical in sum exon length). Reads which are best hits with a particular gene are kept, and the minority of identically (or somewhat ambiguously) mapped reads are subsequently discarded. There are other post-mapping tools which can do more comprehensive read-sorting but aligning alone is usually sufficient given the correlation between read-sorting outputs and aligners like STAR.

Given these counts I'd like to compare gene expression between subgenomes, but comparing subgenome expression using default edgeR/DESeq2 (which I believe include library corrections by default) would erroneously underestimate the difference in subgenome-mapped read counts since the reads originated from the same sample, and therefore, differences in mapped totals between subgenomes are biologically informative.

I hope that was less confusing but I'll admit this is a pretty non-standard use case for either of these tools.

RE my library size query: I'm just looking to see if manually including a library size (sum of A subgenome genes + sum of B subgenome genes) can be done in either program as the internal library correction would be problematic.

Edit: Here's an example of what I mean in edgeR:

library_size = function(x){
  # Library size for polyploid comparison:
  samples = gsub("\\..*$", "", colnames(x))  # remove subgenome IDs from colnames
  lapply(samples, function(sample){
    lib = x[,grep(sample, colnames(x))]
    sum(lib[,1],lib[,2])
  }) %>% unlist
}

lib_sizes = library_size(counts_merged)
DGE = DGEList(counts=counts_merged, group = group, remove.zeros = T, lib.size = lib_sizes)

ADD REPLY • link 6.2 years ago grant.dejong ▴ 10

0

Entering edit mode

I wonder what you mean why you mean by "RPK normalized genes" or by "possible but highly discouraged". Possible according to whom? Discouraged by whom?

ADD REPLY • link 6.2 years ago Gordon Smyth 53k

0

Entering edit mode

Sorry, by RPK/FPK I'm referring normalization of my raw counts by gene length in kilobases (raw counts/(gene length/1000)). Since edegR and DESeq2 internally correct for library size I figured representing my data by FPK/RPK would be acceptable but a colleague brought up the fact that I shouldn't use this method as it would decrease the power.

ADD REPLY • link 6.2 years ago grant.dejong ▴ 10

score 0 · Answer 1 · 2019-10-29

In principle, you can simply incorporate the gene lengths into the offset matrix when using edgeR.

First create a DGList object and equalize the library sizes from same individual:

DGE <- DGEList(counts=counts_merged)
lib_sizes <- colSums( DGE$counts_merged[,1:3] + DGE$counts[,4:6] )
lib_sizes <- rep(lib_sizes, 2)
DGE$samples$lib.sizes <- lib_sizees

Now, assume that gene_length is a matrix of gene lengths of the same dimensions as DGE. Then create the offset matrix from the library sizes and gene lengths:

lgl <- log(gene_length)
lgl <- lgl - mean(lgl)
offset <- t( t(lgl) + log(lib_sizes) )
DGE$offset <- offset

You can now proceed with the analysis. However I don't know how to make calcNormFactors work in this context, so you will need to skip TMM normalization.