Reading Paired End Native Report Format in ShortRead

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.7 years ago

Hi, I am trying to read the alignments generated using NovoAlign. The format I have the data is Paired End Native Report Format(http://computing.bio.cam.ac.uk/local/doc/NovoCraftV2.06.pdf). What is the most efficient way to read this data into ShortRead? Since it is paired end data I have two files corresponding to the two sides. I tried without success using the different formats using readAligned(). I also read an earlier posting about it which suggests to convert it to SAM format. I would appreciate your suggestions. Cheers../Murli -- output of sessionInfo(): R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.14.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 [4] Rsamtools_1.8.5 lattice_0.20-6 Biostrings_2.24.1 [7] GenomicRanges_1.8.7 IRanges_1.14.4 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 stats4_2.15.0 [6] tools_2.15.0 zlibbioc_1.2.0 -- Sent via the guest posting facility at bioconductor.org.

convert convert • 1.1k views

ADD COMMENT • link updated 11.9 years ago by Martin Morgan 25k • written 11.9 years ago by Guest User ★ 13k

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 28 days ago

United States

On 07/12/2012 11:42 AM, Murli Nair [guest] wrote: > > Hi, > > I am trying to read the alignments generated using NovoAlign. The format I have the data is Paired End Native Report Format(http://computing.bio.cam.ac.uk/local/doc/NovoCraftV2.06.pdf). > What is the most efficient way to read this data into ShortRead? Since it is paired end data I have two files corresponding to the two sides. > I tried without success using the different formats using readAligned(). I also read an earlier posting about it which suggests to convert it to SAM format. > I would appreciate your suggestions. From the document you reference Three output formats are provided. 1. Native 2. Extended Native 3. Pairwise 4. SAM If Paired End Native Report Format is 1 or 2 with a single record per line then I believe the only support for input would be as tab- delimited files (read.table and friends; these are flexible and could easily be used to iterate through a large file in a memory efficient way); you would then use an appropriate constructor, e.g., GenomicRanges::GappedAlignmentPairs, to create an object that you could manipulate. Format 3 looks challenging to parse. Generally, for aligned reads aim for BAM files, which is output format 4 followed by using Rsamtools or other with asBam, sortBam, indexBam to create a sorted bam file and index. use GenomicRanges::readGappedAlignmentPairs for many paired-end tasks. It might help to think a little further ahead about what you want to do, e.g., GenomicRanges::summarizeOverlaps would be useful in RNAseq differential expression to count reads in regions of interest, and would need bam files but would manage data input for you. Martin > Cheers../Murli > > > -- output of sessionInfo(): > > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] ShortRead_1.14.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 > [4] Rsamtools_1.8.5 lattice_0.20-6 Biostrings_2.24.1 > [7] GenomicRanges_1.8.7 IRanges_1.14.4 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 stats4_2.15.0 > [6] tools_2.15.0 zlibbioc_1.2.0 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 11.9 years ago Martin Morgan 25k

0

Entering edit mode

Hi Martin, Thanks for your suggestions. I have tried for some time to parse the novoAlign output using read.table, with no success yet. I have just started using shortRead and have not worked with GenomicRanges yet. I have created two files of the paired end data containing ~500 lines of data that may be downloaded from http://bioinformatics.iusb.edu/seqSubset.tar . Would it be possible for your take a look at the format and guide me a little here? I would greatly appreciate it. Cheers../Murli On Thu, Jul 12, 2012 at 2:58 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > On 07/12/2012 11:42 AM, Murli Nair [guest] wrote: >> >> >> Hi, >> >> I am trying to read the alignments generated using NovoAlign. The format I >> have the data is Paired End Native Report >> Format(http://computing.bio.cam.ac.uk/local/doc/NovoCraftV2.06.pdf). >> What is the most efficient way to read this data into ShortRead? Since it >> is paired end data I have two files corresponding to the two sides. >> I tried without success using the different formats using readAligned(). I >> also read an earlier posting about it which suggests to convert it to SAM >> format. >> I would appreciate your suggestions. > > > From the document you reference > > Three output formats are provided. > > 1. Native > > 2. Extended Native > > 3. Pairwise > > 4. SAM > > If Paired End Native Report Format is 1 or 2 with a single record per line > then I believe the only support for input would be as tab-delimited files > (read.table and friends; these are flexible and could easily be used to > iterate through a large file in a memory efficient way); you would then use > an appropriate constructor, e.g., GenomicRanges::GappedAlignmentPairs, to > create an object that you could manipulate. Format 3 looks challenging to > parse. > > Generally, for aligned reads aim for BAM files, which is output format 4 > followed by using Rsamtools or other with asBam, sortBam, indexBam to create > a sorted bam file and index. use GenomicRanges::readGappedAlignmentPairs for > many paired-end tasks. > > It might help to think a little further ahead about what you want to do, > e.g., GenomicRanges::summarizeOverlaps would be useful in RNAseq > differential expression to count reads in regions of interest, and would > need bam files but would manage data input for you. > > Martin > >> Cheers../Murli >> >> >> -- output of sessionInfo(): >> >> R version 2.15.0 (2012-03-30) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] ShortRead_1.14.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 >> [4] Rsamtools_1.8.5 lattice_0.20-6 Biostrings_2.24.1 >> [7] GenomicRanges_1.8.7 IRanges_1.14.4 BiocGenerics_0.2.0 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 >> stats4_2.15.0 >> [6] tools_2.15.0 zlibbioc_1.2.0 >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > >

ADD REPLY • link 11.9 years ago Murli ▴ 80

0

Entering edit mode

On 07/13/2012 11:17 AM, Murli wrote: > Hi Martin, > Thanks for your suggestions. I have tried for some time to parse the > novoAlign output using read.table, with no success yet. I have just > started using shortRead and have not worked with GenomicRanges yet. I > have created two files of the paired end data containing ~500 lines > of data that may be downloaded from > http://bioinformatics.iusb.edu/seqSubset.tar . Would it be possible > for your take a look at the format and guide me a little here? I > would greatly appreciate it. > Cheers../Murli Really I think the most straight-forward approach is to get novoAlign to generate SAM or BAM output; if SAM then use Rsamtools::asBam to convert to BAM. You can then read the data in with Rsamtools::scanBam, GenomicRanges::readGappedAlignments, and in devel GenomicRanges::readGappedAlignmentPairs. Otherwise you'll spend a lot of time working through the appropriate interpretation of the novoAlign fields. I'm attaching a script that might at least provide some inspiration if you really want to parse the current text format; there are some significant problems (see the FIXME's, for example) and some aspects of your files look weird, e.g., the second column contains 'S' indicating single-end alignments, but the intention seems to have been to align paired end data? Martin > > > > On Thu, Jul 12, 2012 at 2:58 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: >> On 07/12/2012 11:42 AM, Murli Nair [guest] wrote: >>> >>> >>> Hi, >>> >>> I am trying to read the alignments generated using NovoAlign. The format I >>> have the data is Paired End Native Report >>> Format(http://computing.bio.cam.ac.uk/local/doc/NovoCraftV2.06.pdf). >>> What is the most efficient way to read this data into ShortRead? Since it >>> is paired end data I have two files corresponding to the two sides. >>> I tried without success using the different formats using readAligned(). I >>> also read an earlier posting about it which suggests to convert it to SAM >>> format. >>> I would appreciate your suggestions. >> >> >> From the document you reference >> >> Three output formats are provided. >> >> 1. Native >> >> 2. Extended Native >> >> 3. Pairwise >> >> 4. SAM >> >> If Paired End Native Report Format is 1 or 2 with a single record per line >> then I believe the only support for input would be as tab-delimited files >> (read.table and friends; these are flexible and could easily be used to >> iterate through a large file in a memory efficient way); you would then use >> an appropriate constructor, e.g., GenomicRanges::GappedAlignmentPairs, to >> create an object that you could manipulate. Format 3 looks challenging to >> parse. >> >> Generally, for aligned reads aim for BAM files, which is output format 4 >> followed by using Rsamtools or other with asBam, sortBam, indexBam to create >> a sorted bam file and index. use GenomicRanges::readGappedAlignmentPairs for >> many paired-end tasks. >> >> It might help to think a little further ahead about what you want to do, >> e.g., GenomicRanges::summarizeOverlaps would be useful in RNAseq >> differential expression to count reads in regions of interest, and would >> need bam files but would manage data input for you. >> >> Martin >> >>> Cheers../Murli >>> >>> >>> -- output of sessionInfo(): >>> >>> R version 2.15.0 (2012-03-30) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] ShortRead_1.14.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 >>> [4] Rsamtools_1.8.5 lattice_0.20-6 Biostrings_2.24.1 >>> [7] GenomicRanges_1.8.7 IRanges_1.14.4 BiocGenerics_0.2.0 >>> >>> loaded via a namespace (and not attached): >>> [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 >>> stats4_2.15.0 >>> [6] tools_2.15.0 zlibbioc_1.2.0 >>> >>> -- >>> Sent via the guest posting facility at bioconductor.org. >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 >> >> -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 -------------- next part -------------- library(GenomicRanges) .novoAlign_input <- function(fl, filt, gal, ...) { ## input col.names <- c("qname", "type", "qseq", "qual", "status", "maps", "mapq", "rname", "pos", "strand", "mrnm", "mpos", "mstrand", "mismatches") df <- read.table(fl, sep="\t", fill=TRUE, col.names=col.names, stringsAsFactors=FALSE, ...) ## filter and coerce to GappedAlignments object gal(filt(df)) } .novoAlign_filter <- function(df, ...) { ## keep only uniquely aligning pairs on the forward or reverse strand keep <- (df$status == "U") & (df$strand %in% c("F", "R")) df[keep,] } .novoAlign_GappedAlignments <- function(df, ...) { ## coerce some information to GappedAlignments object with(df, { GappedAlignments(seqnames=rname, pos=pos, ## FIXME: not correct for indels cigar=paste0(nchar(qseq), "M"), strand=strand(c(F="+", R="-")[strand]), names=qname, qseq=DNAStringSet(qseq), qual=PhredQuality(qual)) }) } .novoAlign_GappedAlignmentPairs <- function(first, last, ...) { ## create GappedAlignmentPairs nms <- union(as.character(seqnames(first)), as.character(seqnames(last))) seqlevels(first) <- seqlevels(last) <- nms isProperPair <- rep(TRUE, length(first)) GappedAlignmentPairs(first, last, isProperPair) } fls <- dir(pattern="novoOutput$") pairs <- lapply(fls, .novoAlign_input, .novoAlign_filter, .novoAlign_GappedAlignments) ## FIXME: is this matching correct? o <- match(names(pairs[[1]]), names(pairs[[2]])) galp <- .novoAlign_GappedAlignmentPairs(pairs[[1]][!is.na(o)], pairs[[2]][o[!is.na(o)]])

ADD REPLY • link 11.8 years ago Martin Morgan 25k

0

Entering edit mode

Martin Thanks a lot for your time. I contacted the developers of Novoalign and they have a perl script to convert the novo output to sam called novo2sam.pl. I was able to convert the data I have to sam format using that. I shall try to use Rsamtools to convert it to Bam. I tried using samtools and the reference index file to convert it to Bam format which I did achieve. But for some reason it did not have the chromosome information when I tried to read it using readAligned(). Let me play with the suggestions you have just send me and see how it goes. Also, the my data while it is paired end data, my alignments are single read alignments. This is because I am looking at the structure of DNA in chromatin so the contacts are from different chromosomes. I presume it probably is inefficient to align paired-end data if they involve different chromosomes. I am not sure of this but that is the only reason I can think of. You may comment if you have any thoughts on this. Thanks again for script, I greatly appreciate it. Cheers../Murli On Sun, Jul 15, 2012 at 6:31 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > On 07/13/2012 11:17 AM, Murli wrote: >> >> Hi Martin, >> Thanks for your suggestions. I have tried for some time to parse the >> novoAlign output using read.table, with no success yet. I have just >> started using shortRead and have not worked with GenomicRanges yet. I >> have created two files of the paired end data containing ~500 lines >> of data that may be downloaded from >> http://bioinformatics.iusb.edu/seqSubset.tar . Would it be possible >> for your take a look at the format and guide me a little here? I >> would greatly appreciate it. >> Cheers../Murli > > > Really I think the most straight-forward approach is to get novoAlign to > generate SAM or BAM output; if SAM then use Rsamtools::asBam to convert to > BAM. You can then read the data in with Rsamtools::scanBam, > GenomicRanges::readGappedAlignments, and in devel > GenomicRanges::readGappedAlignmentPairs. Otherwise you'll spend a lot of > time working through the appropriate interpretation of the novoAlign fields. > > I'm attaching a script that might at least provide some inspiration if you > really want to parse the current text format; there are some significant > problems (see the FIXME's, for example) and some aspects of your files look > weird, e.g., the second column contains 'S' indicating single-end > alignments, but the intention seems to have been to align paired end data? > > Martin > > > >> >> >> >> On Thu, Jul 12, 2012 at 2:58 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: >>> >>> On 07/12/2012 11:42 AM, Murli Nair [guest] wrote: >>>> >>>> >>>> >>>> Hi, >>>> >>>> I am trying to read the alignments generated using NovoAlign. The format >>>> I >>>> have the data is Paired End Native Report >>>> Format(http://computing.bio.cam.ac.uk/local/doc/NovoCraftV2.06.pdf). >>>> What is the most efficient way to read this data into ShortRead? Since >>>> it >>>> is paired end data I have two files corresponding to the two sides. >>>> I tried without success using the different formats using readAligned(). >>>> I >>>> also read an earlier posting about it which suggests to convert it to >>>> SAM >>>> format. >>>> I would appreciate your suggestions. >>> >>> >>> >>> From the document you reference >>> >>> Three output formats are provided. >>> >>> 1. Native >>> >>> 2. Extended Native >>> >>> 3. Pairwise >>> >>> 4. SAM >>> >>> If Paired End Native Report Format is 1 or 2 with a single record per >>> line >>> then I believe the only support for input would be as tab- delimited files >>> (read.table and friends; these are flexible and could easily be used to >>> iterate through a large file in a memory efficient way); you would then >>> use >>> an appropriate constructor, e.g., GenomicRanges::GappedAlignmentPairs, to >>> create an object that you could manipulate. Format 3 looks challenging to >>> parse. >>> >>> Generally, for aligned reads aim for BAM files, which is output format 4 >>> followed by using Rsamtools or other with asBam, sortBam, indexBam to >>> create >>> a sorted bam file and index. use GenomicRanges::readGappedAlignmentPairs >>> for >>> many paired-end tasks. >>> >>> It might help to think a little further ahead about what you want to do, >>> e.g., GenomicRanges::summarizeOverlaps would be useful in RNAseq >>> differential expression to count reads in regions of interest, and would >>> need bam files but would manage data input for you. >>> >>> Martin >>> >>>> Cheers../Murli >>>> >>>> >>>> -- output of sessionInfo(): >>>> >>>> R version 2.15.0 (2012-03-30) >>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>> [7] LC_PAPER=C LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> other attached packages: >>>> [1] ShortRead_1.14.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 >>>> [4] Rsamtools_1.8.5 lattice_0.20-6 Biostrings_2.24.1 >>>> [7] GenomicRanges_1.8.7 IRanges_1.14.4 BiocGenerics_0.2.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 >>>> stats4_2.15.0 >>>> [6] tools_2.15.0 zlibbioc_1.2.0 >>>> >>>> -- >>>> Sent via the guest posting facility at bioconductor.org. >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >>> -- >>> Computational Biology / Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N. >>> PO Box 19024 Seattle, WA 98109 >>> >>> Location: Arnold Building M1 B861 >>> Phone: (206) 667-2793 >>> >>> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > >

ADD REPLY • link 11.8 years ago Murli ▴ 80

Login before adding your answer.