Question

error in function reduce from girafe

0

Entering edit mode

Andreia Fonseca ▴ 810

@andreia-fonseca-3796

Last seen 7.2 years ago

Dear Joern, I decided to follow your suggestion, and I am trying to use girafe. As the alignment file that I have as one line for each hit, I have prepared another file which has an extra column which is the number of matches and only has one row for each sequence. V1 V2 V3 V4 V5 V6 V7 V8 V9 1 TTTCTAATGAGCCCAGGGAGGGCTAGA 27 + 5 43076512 43076539 100 27 1 2 ATAACTGTAGAGGCAAGC 18 - 8 141533705 141533723 100 18 1 3 CTGAAGGGTGGATAAATTGG 20 - 22 40929788 40929808 100 20 1 4 ACTGATTGGGCTAGG 15 - 16 28567043 28567058 100 15 1 5 GCGCGTCGCCATGGAGCCCGACG 23 - 19 36605745 36605768 100 23 1 6 TGTTCTGGAACGGGCCGAGC 20 + 15 83680515 83680535 100 20 1 I have created an object called data using read.table and then converted it into an AlignedGenomeIntervals object using the following command: A<-AlignedGenomeIntervals(start=data$V5, end =data$V6, chromosome=data$V4, strand=data$V3, sequence=as.character(data$V1), matches=data$V9) organism(A)<-"Hs" and then I called the reduce B<-reduce(A) and then I am getting the following error: Error in .local(x, ...) : invalid consensus matrix 'x' (some columns do not sum to 1). Please make sure 'x' was obtained by a call to consensusMatrix(..., as.prob=TRUE) can you explain me what this error means? with kind regards, Andreia On Fri, May 14, 2010 at 4:31 PM, Joern Toedling <joern.toedling@curie.fr>wrote: > Hello, > > I leave it to the IRanges developers to point out the quickest way how to > find > such overlaps using IRanges, but my guess is that you need to create > 'RangedData' objects and use the function findOverlaps then. > > However, sorry for the shameless plug, the package 'girafe' from the latest > Bioconductor release can also be used to answer such kinds of questions. > Have > a look at the vignette for some use cases. Basically you need to create two > objects: > 1. an object of class 'AlignedGenomeIntervals' from your aligned sequences. > the manual page of that class and the vignette show how to do this, but > it's > easy given the data.frame that you already have when you read your table > into > R using read.table. > 2. an object of class 'Genome_intervals_stranded' of your genomic > annotation. > For example, the function 'readGff3' from package 'genomeIntervals' can be > used to create such an object from a gff (version 3) file containing such > annotation. > When you have those two objects, the function 'interval_overlap' will give > you > overlaps of any kind (>= 1nt) between those two, and 'fracOverlap' can be > used > to get overlaps based on additional restrictions that you specify. > How to use 'girafe' for finding overlaps is also shown in the vignette. > And there is also a coercion method between AlignedGenomeIntervals objects > and > RangedData for using IRanges methods later on. > > Hope that helps, > Joern > > PS: There is an additional mailing list 'bioc-sig-sequencing' which may be > more appropriate for this kind of question. > > On Fri, 14 May 2010 13:43:15 +0100, Andreia Fonseca wrote > > Dear List, > > > > I have a file with the hits of my sequences of small RNA (18-30bp) > > in the human genome and I have downloaded the all the annotation of > > the human genome from UCSC. What I want is to annotate my sequences > > by finding ovelaping between the positions of my sequences the the > information > > available from the tables I have downloaded from UCSC. So in the > > file which maps my sequences (produced using microRazers) in the > > human genome I have the folowing structure: > > > > sequence sequence length strand chromosome start end score alignment > > length > > > > I don't want to do this with biomart, because it will be too slow > > making all the queries. However I have found the package IRanges, > > which has the overlap function, but I am not understanding how the > > two tables - the query and the target tables - should be stored and > > how to make the overlapping. Can someone give me a hint? With kind > > regards, Andreia > > > > -- > > -------------------------------------------- > > Andreia J. Amaral > > Unidade de Imunologia Clínica > > Instituto de Medicina Molecular > > Universidade de Lisboa > > email: andreiaamaral@fm.ul.pt > > andreia.fonseca@gmail.com > > > > [[alternative HTML version deleted]] > > > --- > Joern Toedling > Institut Curie -- U900 > 26 rue d'Ulm, 75005 Paris, FRANCE > Tel. +33 (0)156246927 > > -- -------------------------------------------- Andreia J. Amaral Unidade de Imunologia Clínica Instituto de Medicina Molecular Universidade de Lisboa email: andreiaamaral@fm.ul.pt andreia.fonseca@gmail.com [[alternative HTML version deleted]]

Annotation annotate biomaRt IRanges girafe Annotation annotate biomaRt IRanges girafe • 1.3k views

ADD COMMENT • link updated 13.8 years ago by Joern Toedling ▴ 450 • written 13.8 years ago by Andreia Fonseca ▴ 810

score 0 · Answer 1 · 2010-06-24

0

Entering edit mode

David Lyon ▴ 340

@david-lyon-4016

Last seen 2.4 years ago

United States

Hello can someone show me how to add metadata to a FC file eg: If I wanted to add $patient_id = 100 to an existing FC file whats the exact code to do this? also if I wanted to modify an existing value whats the exact code ? Thanks alot

ADD COMMENT • link 13.8 years ago David Lyon ▴ 340

score 0 · Answer 2 · 2010-06-24

Hi Andreia, when "reducing" overlapping aligned reads, also the consensus sequence of the merged reads is determined, using the function consensusString from package Biostrings. And for some reason this step does not work for you. I cannot be completely sure but I think there may be an issue with the way that you generate the AlignedGenomeIntervals object. The recommended way is to use package ShortRead first to generate an AlignedRead object and then coercing that object into an AlignedGenomeIntervals object. However, creating it manually is possible but more prone to such problems. However, thank you very much for trying it and reporting the results, maybe I need to implement an additional check of the object to provide users in this case with a more meaningful error message. I think the problem here is that your specified read sequence always seems to be one shorter than the specified interval lengths. For example, in the first row the sequence TTTCTAATGAGCCCAGGGAGGGCTAGA has 27 letters but the interval goes from position 43076512 to 43076539, which is 28 bases. So maybe your alignment coordinates are in half-open interval notation, in which case the interval actually ends at position 43076538. Then the argument 'end' should be: end=data$V6-1L Also note that the coordinates in girafe are 1-based, so you may need to adjust the alignment coordinates accordingly. It could also be an issue with factors in your data.frame 'data' (which btw. is not a very good name since there also is a basic R function of that name), especially the strand could be a candidate here. When you read in data using read.table it's generally a good idea to use the argument stringsAsFactors=FALSE to make sure there are no factor-related errors later on unless you specifically want to make use of factors in your data.frame. Please let me know if these suggested changes resolve the problem. Otherwise, I would be grateful if you can send me a small excerpt of the file that you read in with read.table off-list and I can have a closer look at it. Cheers, Joern On Thu, 24 Jun 2010 15:18:10 +0100, Andreia Fonseca wrote > Dear Joern, > > I decided to follow your suggestion, and I am trying to use girafe. As the alignment file that I have as one line for each hit, I have prepared another file which has an extra column which is the number of matches and only has one row for each sequence. > ?V1 V2 V3 V4 V5??????? V6??????? V7? V8 V9 > 1 TTTCTAATGAGCCCAGGGAGGGCTAGA?? 27? +? 5? 43076512? 43076539 100 27?? 1 > 2????????? ATAACTGTAGAGGCAAGC? 18? -? 8 141533705 141533723 100 18?? 1 > 3??????? CTGAAGGGTGGATAAATTGG?? 20? - 22? 40929788? 40929808 100 20?? 1 > 4???????????? ACTGATTGGGCTAGG?? 15? - 16? 28567043? 28567058 100 15?? 1 > 5???? GCGCGTCGCCATGGAGCCCGACG?? 23? - 19? 36605745? 36605768 100 23?? 1 > 6??????? TGTTCTGGAACGGGCCGAGC?? 20? + 15? 83680515? 83680535 100 20?? 1 > > I have created an object called data using read.table and then converted it into an AlignedGenomeIntervals object using the following command: > > A<-AlignedGenomeIntervals(start=data$V5, end =data$V6, chromosome=data$V4, strand=data$V3, sequence=as.character(data$V1), matches=data$V9) > > ?organism(A)<-"Hs" > > and then I called the reduce > > ?B<-reduce(A) > > and then I am getting the following error: > > Error in .local(x, ...) : > ? invalid consensus matrix 'x' (some columns do not sum to 1). > ? Please make sure 'x' was obtained by a call to consensusMatrix(...., as.prob=TRUE) > > can you explain me what this error means? > > with kind regards, > Andreia > --- Joern Toedling Institut Curie -- U900 26 rue d'Ulm, 75005 Paris, FRANCE Tel. +33 (0)156246927

score 0 · Answer 3 · 2010-06-25

Hi Andreia, please keep the discussion on the list, in order to have the answers stored in the mailing list archives. For reading in GFF (version 3) files, I am using the function "readGff3" from package genomeIntervals. However, if you look at the code of that function (simply type in 'readGff3' without brackets), you see that the process is straightforward. The file is first read into R (using funcion 'scan') and then a "Genome_intervals_stranded" object is constructed from the result. Since the format GTF is very similar to GFF, you can likely use the same source code line-by-line with only minor changes. For example, in the attributes field, I think that GFF requires a "=" between attribute and value and GTF a whitespace. So after reading in the file with "scan", you might want to replace those whitespaces at those positions by equal signs, using for example the function 'gsub'. HTH, Joern On Fri, 25 Jun 2010 13:16:55 +0100, Andreia Fonseca wrote > Dear Joern, > > Thank you for your quick reply. Now its working well!!!! I have one more question. Concerning the annotation file, I have a gtf file with all the annotation of the human genome. I saw in your annotation script that you show how to read gff files. Can you give me a hint on how to read the gtf file. > Thanks > with kind regards, > Andreia > > On Thu, Jun 24, 2010 at 4:26 PM, Joern Toedling <joern.toedling at="" curie.fr=""> wrote: > Hi Andreia, > > when "reducing" overlapping aligned reads, also the consensus sequence of the > merged reads is determined, using the function consensusString from package > Biostrings. And for some reason this step does not work for you. > > I cannot be completely sure but I think there may be an issue with the way > that you generate the AlignedGenomeIntervals object. The recommended way is > to use package ShortRead first to generate an AlignedRead object and then > coercing that object into an AlignedGenomeIntervals object. However, creating > it manually is possible but more prone to such problems. > However, thank you very much for trying it and reporting the results, maybe I > need to implement an additional check of the object to provide users in this > case with a more meaningful error message. > > I think the problem here is that your specified read sequence always seems to > be one shorter than the specified interval lengths. > For example, in the first row the sequence > TTTCTAATGAGCCCAGGGAGGGCTAGA > has 27 letters but the interval goes from position 43076512 to 43076539, which > is 28 bases. So maybe your alignment coordinates are in half-open interval > notation, in which case the interval actually ends at position 43076538. Then > the argument 'end' should be: > end=data$V6-1L > Also note that the coordinates in girafe are 1-based, so you may need to > adjust the alignment coordinates accordingly. > > It could also be an issue with factors in your data.frame 'data' (which btw. > is not a very good name since there also is a basic R function of that name), > especially the strand could be a candidate here. When you read in data using > read.table it's generally a good idea to use the argument > stringsAsFactors=FALSE to make sure there are no factor-related > errors later on unless you specifically want to make use of factors in your > data.frame. > > Please let me know if these suggested changes resolve the problem. Otherwise, > I would be grateful if you can send me a small excerpt of the file that you > read in with read.table off-list and I can have a closer look at it. > > Cheers, > Joern