Coverage vs GC bias - how to?
2
0
Entering edit mode
Marc Noguera ▴ 100
@marc-noguera-3883
Last seen 10.2 years ago
Hi all, I am trying to get some insight knowledge on some NGS data. I have an alignment file and an AlignedRead object from it, from which I can compute coverage. What I would like to know is if there is any bias on the coverage vs GC content. My idea is to run a sliding window of width N through the reference sequence I have the reads on and to compute, somehow, a kind of coverage coefficient from the coverage i get from IRanges and the AlignedRead object. As I know the coordinate of the sliding window the next step is to obtain, somehow, the sequence of the window and calculate the GC content. How could I extract the sequence from BSgenome within this sliding window and how can I create the sliding window. I see that I can use getSeq to obtain the sequence and the use alphabetByFrequency on it. However, I don't know how to create this sliding window. Which is the best way to proceed? am I reinventing the wheel? Thanks in advance Marc -- ----------------------------------------------------- Marc Noguera i Julian, PhD Genomics unit / Bioinformatics Institut de Medicina Predictiva i Personalitzada del C?ncer (IMPPC) B-10 Office Carretera de Can Ruti Cam? de les Escoles s/n 08916 Badalona, Barcelona
Coverage BSgenome BSgenome IRanges Coverage BSgenome BSgenome IRanges • 2.4k views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 21 months ago
United States
Hi, Just a few comments: On Tue, Sep 7, 2010 at 4:59 AM, Marc Noguera <mnoguera at="" imppc.org=""> wrote: > Hi all, > > I am trying to get some insight knowledge on some NGS data. I have an > alignment file and an AlignedRead object from it, from which I can > compute coverage. > What I would like to know is if there is any bias on the coverage vs GC > content. > My idea is to run a sliding window of width N through the reference > sequence I have the reads on and to compute, somehow, a kind of coverage > coefficient from the coverage i get from IRanges and the AlignedRead object. For some ideas on performing sliding-window stuff over Rle's, read through this thread: http://thread.gmane.org/gmane.science.biology.informatics.conductor/28 491 > As I know the coordinate of the sliding window the next step is to > obtain, somehow, the sequence of the window and calculate the GC content. > > How could I extract the sequence from BSgenome within this sliding > window and how can I create the sliding window. I see that I can use > getSeq to obtain the sequence and the use alphabetByFrequency on it. While getSeq might work (I really never use that function), I think I'd rather use "Views", like so: R> library(BSgenome.Hsapiens.UCSC.hg18) R> chr <- unmasked(Hsapiens$chr10) R> ranges <- IRanges(start=c(100000, 101000, 102000), width=100) R> seqs <- Views(chr, ranges) R> alphabetFrequency(seqs) A C G T M R W S Y K V H D B N - + [1,] 18 26 23 33 0 0 0 0 0 0 0 0 0 0 0 0 0 [2,] 30 26 24 20 0 0 0 0 0 0 0 0 0 0 0 0 0 [3,] 38 23 18 21 0 0 0 0 0 0 0 0 0 0 0 0 0 > However, I don't know how to create this sliding window. > Which is the best way to proceed? am I reinventing the wheel? Maybe, but it's good practice to learn how to do anyway ;-) There have been several papers published about biases found in sequencing data, which you should probably read through. I think it's good to know how to do these things so that you can actually see which of the reported phenomena are happening in the dataset you're working on .. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT
0
Entering edit mode
@harris-a-jaffee-3972
Last seen 10.1 years ago
United States
I wonder if you are blurring a couple of concepts, and I'm not sure if bias is the right word, rather than an actual facet of the situation. You could get the GC content of your suite of reads fairly directly: a = alphabetFrequency(reads, baseOnly=TRUE) reads_GC_content = a[,"C"] + a[,"G"] But for the GC content of the alignments, you might do this (also from Biostrings): U: reference (must be unmasked, if masked) N: read/alignment width L: alignment starting locations lf = letterFrequencyInSlidingView(U, view.width=N, letters="CG") alignments_GC_content = lf[L,] Note that 'lf' holds the GC content of all subsequences of length N along your reference sequence. This may seem wasteful, but it can be faster to do all that, then subset it for the relevant subsequences, than to extract those subsequences, get their alphabetFrequency, etc. Let us know if we aren't getting to what you need. -Harris On Sep 7, 2010, at 4:59 AM, Marc Noguera wrote: > Hi all, > > I am trying to get some insight knowledge on some NGS data. I have an > alignment file and an AlignedRead object from it, from which I can > compute coverage. > What I would like to know is if there is any bias on the coverage > vs GC > content. > My idea is to run a sliding window of width N through the reference > sequence I have the reads on and to compute, somehow, a kind of > coverage > coefficient from the coverage i get from IRanges and the > AlignedRead object. > > As I know the coordinate of the sliding window the next step is to > obtain, somehow, the sequence of the window and calculate the GC > content. > > How could I extract the sequence from BSgenome within this sliding > window and how can I create the sliding window. I see that I can use > getSeq to obtain the sequence and the use alphabetByFrequency on it. > However, I don't know how to create this sliding window. > Which is the best way to proceed? am I reinventing the wheel? > > Thanks in advance > > Marc > > -- > ----------------------------------------------------- > Marc Noguera i Julian, PhD > Genomics unit / Bioinformatics > Institut de Medicina Predictiva i Personalitzada > del C?ncer (IMPPC) > B-10 Office > Carretera de Can Ruti > Cam? de les Escoles s/n > 08916 Badalona, Barcelona > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/ > gmane.science.biology.informatics.conductor
ADD COMMENT

Login before adding your answer.

Traffic: 968 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6