breastCancerNKI patients question
2
0
Entering edit mode
Xavier Robin ▴ 30
@xavier-robin-4928
Last seen 9.6 years ago
Hello everyone! According to the documentation of package breastCancerNKI, nki is a merge of two datastets by van?t Veer et al. and van de Vijver et al. We should have 117 patients for the former, and 295 for the latter, summing up to 412 patients. However the nki dataset contains only 337 patients: > dim(pData(nki)) [1] 337 21 > dim(exprs(nki)) [1] 24481 337 Apparently the "series" annotation specifies which dataset the patients belong to: > table(pData(nki)$series) NKI NKI2 117 220 So we would have all 117 patients from van?t Veer et al, but only 220 of the 295 patients of van de Vijver et al. Is it correct? * If yes, where are the 75 remaining patients? * If no, what did I misunderstand? Thanks, Xavier
Annotation Annotation • 1.5k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States
On Tue, Oct 25, 2011 at 7:38 AM, Xavier Robin <xavier.robin at="" unige.ch=""> wrote: > Hello everyone! > > > According to the documentation of package breastCancerNKI, nki is a > merge of two datastets by van?t Veer et al. and van de Vijver et al. We > should have 117 patients for the former, and 295 for the latter, summing > up to 412 patients. > > However the nki dataset contains only 337 patients: > >> dim(pData(nki)) > [1] 337 ?21 >> dim(exprs(nki)) > [1] 24481 ? 337 > > Apparently the "series" annotation specifies which dataset the patients > belong to: > >> table(pData(nki)$series) > > ?NKI NKI2 > ?117 ?220 > > So we would have all 117 patients from van?t Veer et al, but only 220 of > the 295 patients of van de Vijver et al. > > Is it correct? > * If yes, where are the 75 remaining patients? > * If no, what did I misunderstand? There was overlap between the van't Veer and van de Vijver datasets, so the numbers are for unique patients only, I believe. Sean
ADD COMMENT
0
Entering edit mode
On 10/25/2011 12:47 PM, Sean Davis wrote: > On Tue, Oct 25, 2011 at 7:38 AM, Xavier Robin<xavier.robin at="" unige.ch=""> wrote: >> Hello everyone! >> >> >> According to the documentation of package breastCancerNKI, nki is a >> merge of two datastets by van?t Veer et al. and van de Vijver et al. We >> should have 117 patients for the former, and 295 for the latter, summing >> up to 412 patients. >> >> However the nki dataset contains only 337 patients: >> >>> dim(pData(nki)) >> [1] 337 21 >>> dim(exprs(nki)) >> [1] 24481 337 >> >> Apparently the "series" annotation specifies which dataset the patients >> belong to: >> >>> table(pData(nki)$series) >> NKI NKI2 >> 117 220 >> >> So we would have all 117 patients from van?t Veer et al, but only 220 of >> the 295 patients of van de Vijver et al. >> >> Is it correct? >> * If yes, where are the 75 remaining patients? >> * If no, what did I misunderstand? > There was overlap between the van't Veer and van de Vijver datasets, > so the numbers are for unique patients only, I believe. Yes, the duplicate patients were removed which resulted in 337 unique patients for both studies. Regards, Markus
ADD REPLY
0
Entering edit mode
Le 25. 10. 11 13:59, Markus Schroeder a ?crit : > On 10/25/2011 12:47 PM, Sean Davis wrote: >> There was overlap between the van't Veer and van de Vijver datasets, >> so the numbers are for unique patients only, I believe. > > Yes, the duplicate patients were removed which resulted in 337 unique > patients for both studies. Thank you Sean and Markus for your answers. Do you know which are these duplicate patients? Thanks again, Xavier
ADD REPLY
0
Entering edit mode
It was a complicated process of curation and unfortunately we did not keep track of all the changes. However, you can easily identify those samples thanks to the "series" slot in the phenodata: NKI and NKI2 stand for the first and second papers respectively. We removed samples from the first series in priority. I hope this helps. --- Benjamin Haibe-Kains Computational Biology and Functional Genomics Laboratory Center for Cancer Computational Biology Harvard School of Public Health, Dana-Farber Cancer Institute (SM824) Phone: +1 (617) 582 7267 Fax: +1 (617) 582 7760 Email: bhaibeka@jimmy.harvard.edu Webpage: http://www.hsph.harvard.edu/research/benjamin-haibe-kains/ On Oct 25, 2011, at 11:51 AM, Xavier Robin wrote: > Le 25. 10. 11 13:59, Markus Schroeder a écrit : >> On 10/25/2011 12:47 PM, Sean Davis wrote: >>> There was overlap between the van't Veer and van de Vijver datasets, >>> so the numbers are for unique patients only, I believe. >> >> Yes, the duplicate patients were removed which resulted in 337 unique >> patients for both studies. > > Thank you Sean and Markus for your answers. > > Do you know which are these duplicate patients? > > Thanks again, > Xavier [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
dear list, the following three lines allow one to count overlaps of aligned short-reads with annotations: aln <- readGappedAlignments("somebamfile.bam") txdb <- makeTranscriptFromUCSC(genome="hg19", tablename="ensGene") ensGenes <- exonsBy(txdb, by="gene") ov <- countGenomicOverlaps(aln, ensGenes) then i want to get read-counts per gene and the first thing that comes to my head is doing: counts <- sapply(ov, function(x) sum(values(x)[["hits"]])) which goes through every gene and adds up the "hits" of its exons. however, this latter step of "just adding" takes longer than the actual calculation of the hits with countGenomicOverlaps() and i guess that there are more efficient ways to approach this, probably something around "reducing the hits value column". i've been looking at rdapply() and reduce() and googled too, but couldn't find anything, so i look forward to your suggestions. thanks!! robert.
ADD REPLY
0
Entering edit mode
On 10/25/2011 05:35 PM, Robert Castelo wrote: > dear list, > > the following three lines allow one to count overlaps of aligned > short-reads with annotations: > > aln <- readGappedAlignments("somebamfile.bam") > txdb <- makeTranscriptFromUCSC(genome="hg19", tablename="ensGene") > ensGenes <- exonsBy(txdb, by="gene") > ov <- countGenomicOverlaps(aln, ensGenes) > > then i want to get read-counts per gene and the first thing that comes > to my head is doing: > > counts <- sapply(ov, function(x) sum(values(x)[["hits"]])) > > which goes through every gene and adds up the "hits" of its exons. > however, this latter step of "just adding" takes longer than the actual > calculation of the hits with countGenomicOverlaps() and i guess that > there are more efficient ways to approach this, probably something > around "reducing the hits value column". i've been looking at rdapply() > and reduce() and googled too, but couldn't find anything, so i look > forward to your suggestions. Hi Robert -- A strategy here is along the lines (untested!) of hits <- values(unlist(ov)))[["hits"]] genes <- rep(names(ov), elementLengths(ov)) counts <- sapply(split(hits, genes), sum) but you'll want to make sure that this is a conceptually sensible way of counting hits per gene. The countGenomicOverlaps function has been heavily updated to summarizeOverlaps in 'devel'; you might want to check that out, or at least the vignette for summarizeOverlaps at http://bioconductor.org/packages/devel/bioc/html/GenomicRanges.html Martin > > thanks!! > robert. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
ADD REPLY
0
Entering edit mode
hi, On 10/27/11 12:07 AM, Martin Morgan wrote: > On 10/25/2011 05:35 PM, Robert Castelo wrote: >> dear list, >> >> the following three lines allow one to count overlaps of aligned >> short-reads with annotations: >> >> aln <- readGappedAlignments("somebamfile.bam") >> txdb <- makeTranscriptFromUCSC(genome="hg19", tablename="ensGene") >> ensGenes <- exonsBy(txdb, by="gene") >> ov <- countGenomicOverlaps(aln, ensGenes) >> >> then i want to get read-counts per gene and the first thing that comes >> to my head is doing: >> >> counts <- sapply(ov, function(x) sum(values(x)[["hits"]])) >> >> which goes through every gene and adds up the "hits" of its exons. >> however, this latter step of "just adding" takes longer than the actual >> calculation of the hits with countGenomicOverlaps() and i guess that >> there are more efficient ways to approach this, probably something >> around "reducing the hits value column". i've been looking at rdapply() >> and reduce() and googled too, but couldn't find anything, so i look >> forward to your suggestions. > > Hi Robert -- A strategy here is along the lines (untested!) of > > hits <- values(unlist(ov)))[["hits"]] > genes <- rep(names(ov), elementLengths(ov)) > counts <- sapply(split(hits, genes), sum) beautiful and fast, i've just tested it with the ~ 50,000 Ensembl genes and the execution time is nearly negligible, less than half a second in my laptop. > but you'll want to make sure that this is a conceptually sensible way of > counting hits per gene. i'm aiming at counting exonic-reads and adding them up at gene level. the function countGenomicOverlaps() allows me to tune the part of the logic related to counting exonic reads, but is there any function that would allow me to tune the summing of exonic reads? > The countGenomicOverlaps function has been heavily updated to > summarizeOverlaps in 'devel'; you might want to check that out, or at > least the vignette for summarizeOverlaps at > > http://bioconductor.org/packages/devel/bioc/html/GenomicRanges.html that sounds great, i'll upgrade next week when it becomes release. thanks again! robert.
ADD REPLY
0
Entering edit mode
Hi Robert, On 10/26/2011 03:34 PM, Robert Castelo wrote: > hi, > > On 10/27/11 12:07 AM, Martin Morgan wrote: >> On 10/25/2011 05:35 PM, Robert Castelo wrote: >>> dear list, >>> >>> the following three lines allow one to count overlaps of aligned >>> short-reads with annotations: >>> >>> aln <- readGappedAlignments("somebamfile.bam") >>> txdb <- makeTranscriptFromUCSC(genome="hg19", tablename="ensGene") >>> ensGenes <- exonsBy(txdb, by="gene") >>> ov <- countGenomicOverlaps(aln, ensGenes) >>> >>> then i want to get read-counts per gene and the first thing that comes >>> to my head is doing: >>> >>> counts <- sapply(ov, function(x) sum(values(x)[["hits"]])) >>> >>> which goes through every gene and adds up the "hits" of its exons. >>> however, this latter step of "just adding" takes longer than the actual >>> calculation of the hits with countGenomicOverlaps() and i guess that >>> there are more efficient ways to approach this, probably something >>> around "reducing the hits value column". i've been looking at rdapply() >>> and reduce() and googled too, but couldn't find anything, so i look >>> forward to your suggestions. >> >> Hi Robert -- A strategy here is along the lines (untested!) of >> >> hits <- values(unlist(ov)))[["hits"]] >> genes <- rep(names(ov), elementLengths(ov)) >> counts <- sapply(split(hits, genes), sum) > > beautiful and fast, i've just tested it with the ~ 50,000 Ensembl > genes and the execution time is nearly negligible, less than half a > second in my laptop. > >> but you'll want to make sure that this is a conceptually sensible way of >> counting hits per gene. > > i'm aiming at counting exonic-reads and adding them up at gene level. > the function countGenomicOverlaps() allows me to tune the part of the > logic related to counting exonic reads, but is there any function that > would allow me to tune the summing of exonic reads? How do you want to tune the summing? If you want to sum all exons by gene the piece of code Martin provided should do it. Was there another goal you had in mind? Valerie > >> The countGenomicOverlaps function has been heavily updated to >> summarizeOverlaps in 'devel'; you might want to check that out, or at >> least the vignette for summarizeOverlaps at >> >> http://bioconductor.org/packages/devel/bioc/html/GenomicRanges.html > > that sounds great, i'll upgrade next week when it becomes release. > > thanks again! > robert. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Hi Valerie, On 10/27/11 7:53 PM, Valerie Obenchain wrote: > Hi Robert, > > > On 10/26/2011 03:34 PM, Robert Castelo wrote: >> hi, >> >> On 10/27/11 12:07 AM, Martin Morgan wrote: >>> On 10/25/2011 05:35 PM, Robert Castelo wrote: >>>> dear list, >>>> >>>> the following three lines allow one to count overlaps of aligned >>>> short-reads with annotations: >>>> >>>> aln <- readGappedAlignments("somebamfile.bam") >>>> txdb <- makeTranscriptFromUCSC(genome="hg19", tablename="ensGene") >>>> ensGenes <- exonsBy(txdb, by="gene") >>>> ov <- countGenomicOverlaps(aln, ensGenes) >>>> >>>> then i want to get read-counts per gene and the first thing that comes >>>> to my head is doing: >>>> >>>> counts <- sapply(ov, function(x) sum(values(x)[["hits"]])) >>>> >>>> which goes through every gene and adds up the "hits" of its exons. >>>> however, this latter step of "just adding" takes longer than the actual >>>> calculation of the hits with countGenomicOverlaps() and i guess that >>>> there are more efficient ways to approach this, probably something >>>> around "reducing the hits value column". i've been looking at rdapply() >>>> and reduce() and googled too, but couldn't find anything, so i look >>>> forward to your suggestions. >>> >>> Hi Robert -- A strategy here is along the lines (untested!) of >>> >>> hits <- values(unlist(ov)))[["hits"]] >>> genes <- rep(names(ov), elementLengths(ov)) >>> counts <- sapply(split(hits, genes), sum) >> >> beautiful and fast, i've just tested it with the ~ 50,000 Ensembl >> genes and the execution time is nearly negligible, less than half a >> second in my laptop. >> >>> but you'll want to make sure that this is a conceptually sensible way of >>> counting hits per gene. >> >> i'm aiming at counting exonic-reads and adding them up at gene level. >> the function countGenomicOverlaps() allows me to tune the part of the >> logic related to counting exonic reads, but is there any function that >> would allow me to tune the summing of exonic reads? > > How do you want to tune the summing? If you want to sum all exons by > gene the piece of code Martin provided should do it. Was there another > goal you had in mind? well, Martin was warning me about whether just adding up exon counts per gene makes complete sense and i guess the problem might arise with overlapping exons of different transcripts since then by just summing i'd be adding some counts twice. so i was wondering whether something existed to address this summing along the lines of what is available about the decision logic options for counting reads overlapping annotations with countGenomicOverlaps(). cheers, robert.
ADD REPLY
0
Entering edit mode
On 10/27/2011 12:44 PM, Robert Castelo wrote: > Hi Valerie, > > On 10/27/11 7:53 PM, Valerie Obenchain wrote: >> Hi Robert, >> >> >> On 10/26/2011 03:34 PM, Robert Castelo wrote: >>> hi, >>> >>> On 10/27/11 12:07 AM, Martin Morgan wrote: >>>> On 10/25/2011 05:35 PM, Robert Castelo wrote: >>>>> dear list, >>>>> >>>>> the following three lines allow one to count overlaps of aligned >>>>> short-reads with annotations: >>>>> >>>>> aln <- readGappedAlignments("somebamfile.bam") >>>>> txdb <- makeTranscriptFromUCSC(genome="hg19", tablename="ensGene") >>>>> ensGenes <- exonsBy(txdb, by="gene") >>>>> ov <- countGenomicOverlaps(aln, ensGenes) >>>>> >>>>> then i want to get read-counts per gene and the first thing that >>>>> comes >>>>> to my head is doing: >>>>> >>>>> counts <- sapply(ov, function(x) sum(values(x)[["hits"]])) >>>>> >>>>> which goes through every gene and adds up the "hits" of its exons. >>>>> however, this latter step of "just adding" takes longer than the >>>>> actual >>>>> calculation of the hits with countGenomicOverlaps() and i guess that >>>>> there are more efficient ways to approach this, probably something >>>>> around "reducing the hits value column". i've been looking at >>>>> rdapply() >>>>> and reduce() and googled too, but couldn't find anything, so i look >>>>> forward to your suggestions. >>>> >>>> Hi Robert -- A strategy here is along the lines (untested!) of >>>> >>>> hits <- values(unlist(ov)))[["hits"]] >>>> genes <- rep(names(ov), elementLengths(ov)) >>>> counts <- sapply(split(hits, genes), sum) >>> >>> beautiful and fast, i've just tested it with the ~ 50,000 Ensembl >>> genes and the execution time is nearly negligible, less than half a >>> second in my laptop. >>> >>>> but you'll want to make sure that this is a conceptually sensible >>>> way of >>>> counting hits per gene. >>> >>> i'm aiming at counting exonic-reads and adding them up at gene level. >>> the function countGenomicOverlaps() allows me to tune the part of the >>> logic related to counting exonic reads, but is there any function that >>> would allow me to tune the summing of exonic reads? >> >> How do you want to tune the summing? If you want to sum all exons by >> gene the piece of code Martin provided should do it. Was there another >> goal you had in mind? > > well, Martin was warning me about whether just adding up exon counts > per gene makes complete sense and i guess the problem might arise with > overlapping exons of different transcripts since then by just summing > i'd be adding some counts twice. so i was wondering whether something > existed to address this summing along the lines of what is available > about the decision logic options for counting reads overlapping > annotations with countGenomicOverlaps(). Both the counting and summarizing problems have been addressed in the summarizedOverlaps function Martin mentioned. The function has different "modes" that determine how to assign a read that overlaps multiple features. This function does not double count so a read is either assigned according to the rules of the "mode" or it is discarded. In the end you have counts per feature. A feature can be represented by a single range (ie, single exon, transcript or gene) or by multiple ranges (ie, exons in a gene, transcripts in a gene, etc.). I'd be interested in feedback once you've had a chance to use the function. Valerie > > cheers, > robert.
ADD REPLY
0
Entering edit mode
Xavier Robin ▴ 30
@xavier-robin-4928
Last seen 9.6 years ago
Le 25. 10. 11 18:54, Benjamin Haibe-Kains a ?crit : > It was a complicated process of curation and unfortunately we did not keep track of all the changes. However, you can easily identify those samples thanks to the "series" slot in the phenodata: NKI and NKI2 stand for the first and second papers respectively. We removed samples from the first series in priority. > > I hope this helps. Yes it does! Thank you Benjamin.
ADD COMMENT

Login before adding your answer.

Traffic: 1053 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6