Question: Coverage vs GC bias - how to?
0
gravatar for Marc Noguera
9.1 years ago by
Marc Noguera100
Marc Noguera100 wrote:
Hi all, I am trying to get some insight knowledge on some NGS data. I have an alignment file and an AlignedRead object from it, from which I can compute coverage. What I would like to know is if there is any bias on the coverage vs GC content. My idea is to run a sliding window of width N through the reference sequence I have the reads on and to compute, somehow, a kind of coverage coefficient from the coverage i get from IRanges and the AlignedRead object. As I know the coordinate of the sliding window the next step is to obtain, somehow, the sequence of the window and calculate the GC content. How could I extract the sequence from BSgenome within this sliding window and how can I create the sliding window. I see that I can use getSeq to obtain the sequence and the use alphabetByFrequency on it. However, I don't know how to create this sliding window. Which is the best way to proceed? am I reinventing the wheel? Thanks in advance Marc -- ----------------------------------------------------- Marc Noguera i Julian, PhD Genomics unit / Bioinformatics Institut de Medicina Predictiva i Personalitzada del C?ncer (IMPPC) B-10 Office Carretera de Can Ruti Cam? de les Escoles s/n 08916 Badalona, Barcelona
coverage bsgenome iranges • 1.5k views
ADD COMMENTlink modified 9.1 years ago by Harris A. Jaffee590 • written 9.1 years ago by Marc Noguera100
Answer: Coverage vs GC bias - how to?
0
gravatar for Steve Lianoglou
9.1 years ago by
Denali
Steve Lianoglou12k wrote:
Hi, Just a few comments: On Tue, Sep 7, 2010 at 4:59 AM, Marc Noguera <mnoguera at="" imppc.org=""> wrote: > Hi all, > > I am trying to get some insight knowledge on some NGS data. I have an > alignment file and an AlignedRead object from it, from which I can > compute coverage. > What I would like to know is if there is any bias on the coverage vs GC > content. > My idea is to run a sliding window of width N through the reference > sequence I have the reads on and to compute, somehow, a kind of coverage > coefficient from the coverage i get from IRanges and the AlignedRead object. For some ideas on performing sliding-window stuff over Rle's, read through this thread: http://thread.gmane.org/gmane.science.biology.informatics.conductor/28 491 > As I know the coordinate of the sliding window the next step is to > obtain, somehow, the sequence of the window and calculate the GC content. > > How could I extract the sequence from BSgenome within this sliding > window and how can I create the sliding window. I see that I can use > getSeq to obtain the sequence and the use alphabetByFrequency on it. While getSeq might work (I really never use that function), I think I'd rather use "Views", like so: R> library(BSgenome.Hsapiens.UCSC.hg18) R> chr <- unmasked(Hsapiens$chr10) R> ranges <- IRanges(start=c(100000, 101000, 102000), width=100) R> seqs <- Views(chr, ranges) R> alphabetFrequency(seqs) A C G T M R W S Y K V H D B N - + [1,] 18 26 23 33 0 0 0 0 0 0 0 0 0 0 0 0 0 [2,] 30 26 24 20 0 0 0 0 0 0 0 0 0 0 0 0 0 [3,] 38 23 18 21 0 0 0 0 0 0 0 0 0 0 0 0 0 > However, I don't know how to create this sliding window. > Which is the best way to proceed? am I reinventing the wheel? Maybe, but it's good practice to learn how to do anyway ;-) There have been several papers published about biases found in sequencing data, which you should probably read through. I think it's good to know how to do these things so that you can actually see which of the reported phenomena are happening in the dataset you're working on .. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENTlink written 9.1 years ago by Steve Lianoglou12k
Answer: Coverage vs GC bias - how to?
0
gravatar for Harris A. Jaffee
9.1 years ago by
United States
Harris A. Jaffee590 wrote:
I wonder if you are blurring a couple of concepts, and I'm not sure if bias is the right word, rather than an actual facet of the situation. You could get the GC content of your suite of reads fairly directly: a = alphabetFrequency(reads, baseOnly=TRUE) reads_GC_content = a[,"C"] + a[,"G"] But for the GC content of the alignments, you might do this (also from Biostrings): U: reference (must be unmasked, if masked) N: read/alignment width L: alignment starting locations lf = letterFrequencyInSlidingView(U, view.width=N, letters="CG") alignments_GC_content = lf[L,] Note that 'lf' holds the GC content of all subsequences of length N along your reference sequence. This may seem wasteful, but it can be faster to do all that, then subset it for the relevant subsequences, than to extract those subsequences, get their alphabetFrequency, etc. Let us know if we aren't getting to what you need. -Harris On Sep 7, 2010, at 4:59 AM, Marc Noguera wrote: > Hi all, > > I am trying to get some insight knowledge on some NGS data. I have an > alignment file and an AlignedRead object from it, from which I can > compute coverage. > What I would like to know is if there is any bias on the coverage > vs GC > content. > My idea is to run a sliding window of width N through the reference > sequence I have the reads on and to compute, somehow, a kind of > coverage > coefficient from the coverage i get from IRanges and the > AlignedRead object. > > As I know the coordinate of the sliding window the next step is to > obtain, somehow, the sequence of the window and calculate the GC > content. > > How could I extract the sequence from BSgenome within this sliding > window and how can I create the sliding window. I see that I can use > getSeq to obtain the sequence and the use alphabetByFrequency on it. > However, I don't know how to create this sliding window. > Which is the best way to proceed? am I reinventing the wheel? > > Thanks in advance > > Marc > > -- > ----------------------------------------------------- > Marc Noguera i Julian, PhD > Genomics unit / Bioinformatics > Institut de Medicina Predictiva i Personalitzada > del C?ncer (IMPPC) > B-10 Office > Carretera de Can Ruti > Cam? de les Escoles s/n > 08916 Badalona, Barcelona > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/ > gmane.science.biology.informatics.conductor
ADD COMMENTlink written 9.1 years ago by Harris A. Jaffee590
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 335 users visited in the last hour