Coverage vs GC bias

Coverage vs GC bias - how to?

0

Entering edit mode

Marc Noguera ▴ 100

@marc-noguera-3883

Last seen 11.3 years ago

Hi all, I am trying to get some insight knowledge on some NGS data. I have an alignment file and an AlignedRead object from it, from which I can compute coverage. What I would like to know is if there is any bias on the coverage vs GC content. My idea is to run a sliding window of width N through the reference sequence I have the reads on and to compute, somehow, a kind of coverage coefficient from the coverage i get from IRanges and the AlignedRead object. As I know the coordinate of the sliding window the next step is to obtain, somehow, the sequence of the window and calculate the GC content. How could I extract the sequence from BSgenome within this sliding window and how can I create the sliding window. I see that I can use getSeq to obtain the sequence and the use alphabetByFrequency on it. However, I don't know how to create this sliding window. Which is the best way to proceed? am I reinventing the wheel? Thanks in advance Marc -- ----------------------------------------------------- Marc Noguera i Julian, PhD Genomics unit / Bioinformatics Institut de Medicina Predictiva i Personalitzada del C?ncer (IMPPC) B-10 Office Carretera de Can Ruti Cam? de les Escoles s/n 08916 Badalona, Barcelona

Coverage BSgenome BSgenome IRanges Coverage BSgenome BSgenome IRanges • 2.7k views

ADD COMMENT • link updated 15.3 years ago by Harris A. Jaffee ▴ 590 • written 15.3 years ago by Marc Noguera ▴ 100

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 6 weeks ago

United States

Hi, Just a few comments: On Tue, Sep 7, 2010 at 4:59 AM, Marc Noguera <mnoguera at="" imppc.org=""> wrote: > Hi all, > > I am trying to get some insight knowledge on some NGS data. I have an > alignment file and an AlignedRead object from it, from which I can > compute coverage. > What I would like to know is if there is any bias on the coverage vs GC > content. > My idea is to run a sliding window of width N through the reference > sequence I have the reads on and to compute, somehow, a kind of coverage > coefficient from the coverage i get from IRanges and the AlignedRead object. For some ideas on performing sliding-window stuff over Rle's, read through this thread: http://thread.gmane.org/gmane.science.biology.informatics.conductor/28 491 > As I know the coordinate of the sliding window the next step is to > obtain, somehow, the sequence of the window and calculate the GC content. > > How could I extract the sequence from BSgenome within this sliding > window and how can I create the sliding window. I see that I can use > getSeq to obtain the sequence and the use alphabetByFrequency on it. While getSeq might work (I really never use that function), I think I'd rather use "Views", like so: R> library(BSgenome.Hsapiens.UCSC.hg18) R> chr <- unmasked(Hsapiens$chr10) R> ranges <- IRanges(start=c(100000, 101000, 102000), width=100) R> seqs <- Views(chr, ranges) R> alphabetFrequency(seqs) A C G T M R W S Y K V H D B N - + [1,] 18 26 23 33 0 0 0 0 0 0 0 0 0 0 0 0 0 [2,] 30 26 24 20 0 0 0 0 0 0 0 0 0 0 0 0 0 [3,] 38 23 18 21 0 0 0 0 0 0 0 0 0 0 0 0 0 > However, I don't know how to create this sliding window. > Which is the best way to proceed? am I reinventing the wheel? Maybe, but it's good practice to learn how to do anyway ;-) There have been several papers published about biases found in sequencing data, which you should probably read through. I think it's good to know how to do these things so that you can actually see which of the reported phenomena are happening in the dataset you're working on .. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 15.3 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Harris A. Jaffee ▴ 590

@harris-a-jaffee-3972

Last seen 11.2 years ago

United States

I wonder if you are blurring a couple of concepts, and I'm not sure if bias is the right word, rather than an actual facet of the situation. You could get the GC content of your suite of reads fairly directly: a = alphabetFrequency(reads, baseOnly=TRUE) reads_GC_content = a[,"C"] + a[,"G"] But for the GC content of the alignments, you might do this (also from Biostrings): U: reference (must be unmasked, if masked) N: read/alignment width L: alignment starting locations lf = letterFrequencyInSlidingView(U, view.width=N, letters="CG") alignments_GC_content = lf[L,] Note that 'lf' holds the GC content of all subsequences of length N along your reference sequence. This may seem wasteful, but it can be faster to do all that, then subset it for the relevant subsequences, than to extract those subsequences, get their alphabetFrequency, etc. Let us know if we aren't getting to what you need. -Harris On Sep 7, 2010, at 4:59 AM, Marc Noguera wrote: > Hi all, > > I am trying to get some insight knowledge on some NGS data. I have an > alignment file and an AlignedRead object from it, from which I can > compute coverage. > What I would like to know is if there is any bias on the coverage > vs GC > content. > My idea is to run a sliding window of width N through the reference > sequence I have the reads on and to compute, somehow, a kind of > coverage > coefficient from the coverage i get from IRanges and the > AlignedRead object. > > As I know the coordinate of the sliding window the next step is to > obtain, somehow, the sequence of the window and calculate the GC > content. > > How could I extract the sequence from BSgenome within this sliding > window and how can I create the sliding window. I see that I can use > getSeq to obtain the sequence and the use alphabetByFrequency on it. > However, I don't know how to create this sliding window. > Which is the best way to proceed? am I reinventing the wheel? > > Thanks in advance > > Marc > > -- > ----------------------------------------------------- > Marc Noguera i Julian, PhD > Genomics unit / Bioinformatics > Institut de Medicina Predictiva i Personalitzada > del C?ncer (IMPPC) > B-10 Office > Carretera de Can Ruti > Cam? de les Escoles s/n > 08916 Badalona, Barcelona > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/ > gmane.science.biology.informatics.conductor

ADD COMMENT • link 15.3 years ago Harris A. Jaffee ▴ 590

Login before adding your answer.