Question

Limma-voom read count analysis

0

Entering edit mode

ioannis.konstantinidis • 0

@ioanniskonstantinidis-13769

Last seen 6.7 years ago

Hello community,

I have been reading some related posts in the forum for limma and voom (read count analysis and differential expression). However all the questions and answers seems to be quite specific.

So, I have a few samples to analyze and I am a beginner in R. Any help or suggestion would be much appreciated.

In R-studio, each one of my samples is a count matrix and contains 3 columns (Chromosome - Genomic Position - Read Counts). The rows are more than 400.000 and they represent hydroxymethylated CpG sites. However, the number of rows is not the same in every matrix. I am trying to follow the examples as described in limma userguide (RNA-seq data, p.69-73). That's probably what I should use (I am not sure). My data is not RNA-seq but aligned reads on the reference genome.

Two of the count matrices serve as controls. One control is a genomic DNA pool of the 6 first samples (Group 1) and the second control is a pool for the 6 last samples (Group 2).
These two controls have all the possible CpG sites (based on the MspI restriction efficiency). On the contrary, the other 12 samples have only CpG sites that are methylated.
My goal is to compare these 2 groups and find any possible differentiations.

So, my question is: How can I detect or extract the differentially methylated regions based on the number of counts between these matrices-samples? Is this package able to give this information?

Kind Regards,

Ioannis

Differentially methylated CpG's Limma Voom Read Count • 1.3k views

ADD COMMENT • link updated 6.7 years ago by Aaron Lun ★ 28k • written 6.7 years ago by ioannis.konstantinidis • 0

score 1 · Answer 1 · 2017-08-17

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 4 hours ago

The city by the bay

I would suggest looking at some packages dedicated to analyzing methylation data. You say you have reads, but you don't mention what type of data you have. Is it bisulfite sequencing? MeDIP-seq? bsseq and MEDIPs come to mind. A differential methylation analysis is not the same as a standard differential expression analysis with voom, because you need to account for the total coverage per site when testing for differences.

ADD COMMENT • link 6.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Apologies. I just don't know what is important to mention and what is not.
Well I am using the Reduced Representation Hydroxymethylation Profiling (RRHP) on genomic DNA.
It uses double MspI digestion. At first, the MspI restriction enzyme is used for fragmentation and then a specific adapter creates again fragments with a CCGG start (which is the MspI restriction site). The whole protocol is based on the glucosylation of hydroxymethylcytosines. This glucosylation step blocks the second MspI step from digesting again the CCGG fragments so at the end, the reads that do not contain hydroxymethylcytosines at the second position (C C G G) are being digested (adapters are gone, therefore they can not be amplified). Finally I sequence these samples on a HiSeq4000, align the fasta files with bowtie2 and then extract all the CCGG reads from the SAM files into a .txt which is my input in R.

I have no bisulfite treatment, thus most of the packages are not usefull to me. Bisulfite sequencing based packages are able to identify CpGs. In my case, only the second cytosine is hydroxymethylated but I have many more CpGs within the reads which are not usefull and should not be considered in the analysis.

For example
RRHP: CCGGATACGTCACTACGATCGTAGATGCTA RRBS: TCGGATATGTTATTATGATTGTAGATGTTA

Since the C's are not transformed into T's these packages identify all cytosine's as methylated which is not the case.

So I was thinking that the count of the reads is an indication of the hydroxymethylation abundance for each sample at each CpG site. That's why I have created a count matrix for each sample as mentioned above.
Regarding the total coverage, this is why I have the control matrices. I believe they serve as a normalization method because I haven't include any glucosylation step nor second digestion with MspI. Controls have possibly all or most of the CpGs of my reference genome and the samples have only hydroxymethylated CpGs. Every sample and control has the same DNA input (library prep), same amplification cycles, same concentration on the sequencing and same alignment parameters.

And this is actually where I am stuck. I can't compare, combine and normalize these matrices anyhow.
Limma user guide is not very helpful either.
I hope I am not very wrong about the method I am using.

Thanks for your reply

ADD REPLY • link 6.7 years ago ioannis.konstantinidis • 0

0

Entering edit mode

This does not sound like a straightforward application of existing methods. All I can suggest is:

Create a pool of all CpG sites, by taking the union across all samples.
Initialize a common matrix with number of rows equal to the size of the pool
Use findOverlaps to match the CpG sites in each individual matrix to the pool.
Fill in the common matrix based on the overlaps (lack of overlap is assumed to be a zero).

This should give you a single matrix with counts for all features in all samples. Now, as to what you can do with it... that's another question altogether. The most effective normalization strategy is not obvious, e.g., can you assume that most regions are not differentially methylated? Similarly, it is not clear how to best use the genomic controls, if at all (given the analogous difficulties with incorporating input controls in differential binding analyses of ChIP-seq data). You will have to have a serious think about how you will deal with this type of data; don't blindly apply voom and hope for the best.

ADD REPLY • link 6.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks for the suggestions. I agree, according to the literature there is nothing specific like a package in R to work with such data sets. Well even if I put my data in voom, I don't hope for anything because simply it doesn't give me anything that I can understand or work with. Probably I need some experienced bioinformatician for this.

Thanks again for your time Aaron.
Cheers!

ADD REPLY • link 6.7 years ago ioannis.konstantinidis • 0

0

Entering edit mode

Hi Ioannis!

I have the same problem... I think I have to count my read in each condition but with subread, I had low coverage. Did you succeed in your analysis?

Thanks!!

Christelle

ADD REPLY • link 5.6 years ago christelle.ledantec • 0