breastCancerNKI patients question

0

Entering edit mode

Xavier Robin ▴ 30

@xavier-robin-4928

Last seen 9.6 years ago

Hello everyone! According to the documentation of package breastCancerNKI, nki is a merge of two datastets by van?t Veer et al. and van de Vijver et al. We should have 117 patients for the former, and 295 for the latter, summing up to 412 patients. However the nki dataset contains only 337 patients: > dim(pData(nki)) [1] 337 21 > dim(exprs(nki)) [1] 24481 337 Apparently the "series" annotation specifies which dataset the patients belong to: > table(pData(nki)$series) NKI NKI2 117 220 So we would have all 117 patients from van?t Veer et al, but only 220 of the 295 patients of van de Vijver et al. Is it correct? * If yes, where are the 75 remaining patients? * If no, what did I misunderstand? Thanks, Xavier

Annotation Annotation • 1.5k views

ADD COMMENT • link 12.5 years ago Xavier Robin ▴ 30

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

On Tue, Oct 25, 2011 at 7:38 AM, Xavier Robin <xavier.robin at="" unige.ch=""> wrote: > Hello everyone! > > > According to the documentation of package breastCancerNKI, nki is a > merge of two datastets by van?t Veer et al. and van de Vijver et al. We > should have 117 patients for the former, and 295 for the latter, summing > up to 412 patients. > > However the nki dataset contains only 337 patients: > >> dim(pData(nki)) > [1] 337 ?21 >> dim(exprs(nki)) > [1] 24481 ? 337 > > Apparently the "series" annotation specifies which dataset the patients > belong to: > >> table(pData(nki)$series) > > ?NKI NKI2 > ?117 ?220 > > So we would have all 117 patients from van?t Veer et al, but only 220 of > the 295 patients of van de Vijver et al. > > Is it correct? > * If yes, where are the 75 remaining patients? > * If no, what did I misunderstand? There was overlap between the van't Veer and van de Vijver datasets, so the numbers are for unique patients only, I believe. Sean

ADD COMMENT • link 12.5 years ago Sean Davis 21k

0

Entering edit mode

On 10/25/2011 12:47 PM, Sean Davis wrote: > On Tue, Oct 25, 2011 at 7:38 AM, Xavier Robin<xavier.robin at="" unige.ch=""> wrote: >> Hello everyone! >> >> >> According to the documentation of package breastCancerNKI, nki is a >> merge of two datastets by van?t Veer et al. and van de Vijver et al. We >> should have 117 patients for the former, and 295 for the latter, summing >> up to 412 patients. >> >> However the nki dataset contains only 337 patients: >> >>> dim(pData(nki)) >> [1] 337 21 >>> dim(exprs(nki)) >> [1] 24481 337 >> >> Apparently the "series" annotation specifies which dataset the patients >> belong to: >> >>> table(pData(nki)$series) >> NKI NKI2 >> 117 220 >> >> So we would have all 117 patients from van?t Veer et al, but only 220 of >> the 295 patients of van de Vijver et al. >> >> Is it correct? >> * If yes, where are the 75 remaining patients? >> * If no, what did I misunderstand? > There was overlap between the van't Veer and van de Vijver datasets, > so the numbers are for unique patients only, I believe. Yes, the duplicate patients were removed which resulted in 337 unique patients for both studies. Regards, Markus

ADD REPLY • link 12.5 years ago Markus Schroeder ▴ 10

0

Entering edit mode

Le 25. 10. 11 13:59, Markus Schroeder a ?crit : > On 10/25/2011 12:47 PM, Sean Davis wrote: >> There was overlap between the van't Veer and van de Vijver datasets, >> so the numbers are for unique patients only, I believe. > > Yes, the duplicate patients were removed which resulted in 337 unique > patients for both studies. Thank you Sean and Markus for your answers. Do you know which are these duplicate patients? Thanks again, Xavier

ADD REPLY • link 12.5 years ago Xavier Robin ▴ 30

0

Entering edit mode

It was a complicated process of curation and unfortunately we did not keep track of all the changes. However, you can easily identify those samples thanks to the "series" slot in the phenodata: NKI and NKI2 stand for the first and second papers respectively. We removed samples from the first series in priority. I hope this helps. --- Benjamin Haibe-Kains Computational Biology and Functional Genomics Laboratory Center for Cancer Computational Biology Harvard School of Public Health, Dana-Farber Cancer Institute (SM824) Phone: +1 (617) 582 7267 Fax: +1 (617) 582 7760 Email: bhaibeka@jimmy.harvard.edu Webpage: http://www.hsph.harvard.edu/research/benjamin-haibe-kains/ On Oct 25, 2011, at 11:51 AM, Xavier Robin wrote: > Le 25. 10. 11 13:59, Markus Schroeder a écrit : >> On 10/25/2011 12:47 PM, Sean Davis wrote: >>> There was overlap between the van't Veer and van de Vijver datasets, >>> so the numbers are for unique patients only, I believe. >> >> Yes, the duplicate patients were removed which resulted in 337 unique >> patients for both studies. > > Thank you Sean and Markus for your answers. > > Do you know which are these duplicate patients? > > Thanks again, > Xavier [[alternative HTML version deleted]]

ADD REPLY • link 12.5 years ago Benjamin Haibe-Kains ▴ 20

0

Entering edit mode

dear list, the following three lines allow one to count overlaps of aligned short-reads with annotations: aln <- readGappedAlignments("somebamfile.bam") txdb <- makeTranscriptFromUCSC(genome="hg19", tablename="ensGene") ensGenes <- exonsBy(txdb, by="gene") ov <- countGenomicOverlaps(aln, ensGenes) then i want to get read-counts per gene and the first thing that comes to my head is doing: counts <- sapply(ov, function(x) sum(values(x)[["hits"]])) which goes through every gene and adds up the "hits" of its exons. however, this latter step of "just adding" takes longer than the actual calculation of the hits with countGenomicOverlaps() and i guess that there are more efficient ways to approach this, probably something around "reducing the hits value column". i've been looking at rdapply() and reduce() and googled too, but couldn't find anything, so i look forward to your suggestions. thanks!! robert.

ADD REPLY • link 12.5 years ago Robert Castelo ★ 3.3k

0

Entering edit mode

On 10/25/2011 05:35 PM, Robert Castelo wrote: > dear list, > > the following three lines allow one to count overlaps of aligned > short-reads with annotations: > > aln <- readGappedAlignments("somebamfile.bam") > txdb <- makeTranscriptFromUCSC(genome="hg19", tablename="ensGene") > ensGenes <- exonsBy(txdb, by="gene") > ov <- countGenomicOverlaps(aln, ensGenes) > > then i want to get read-counts per gene and the first thing that comes > to my head is doing: > > counts <- sapply(ov, function(x) sum(values(x)[["hits"]])) > > which goes through every gene and adds up the "hits" of its exons. > however, this latter step of "just adding" takes longer than the actual > calculation of the hits with countGenomicOverlaps() and i guess that > there are more efficient ways to approach this, probably something > around "reducing the hits value column". i've been looking at rdapply() > and reduce() and googled too, but couldn't find anything, so i look > forward to your suggestions. Hi Robert -- A strategy here is along the lines (untested!) of hits <- values(unlist(ov)))[["hits"]] genes <- rep(names(ov), elementLengths(ov)) counts <- sapply(split(hits, genes), sum) but you'll want to make sure that this is a conceptually sensible way of counting hits per gene. The countGenomicOverlaps function has been heavily updated to summarizeOverlaps in 'devel'; you might want to check that out, or at least the vignette for summarizeOverlaps at http://bioconductor.org/packages/devel/bioc/html/GenomicRanges.html Martin > > thanks!! > robert. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793

ADD REPLY • link 12.5 years ago Martin Morgan 25k

0

Entering edit mode

hi, On 10/27/11 12:07 AM, Martin Morgan wrote: > On 10/25/2011 05:35 PM, Robert Castelo wrote: >> dear list, >> >> the following three lines allow one to count overlaps of aligned >> short-reads with annotations: >> >> aln <- readGappedAlignments("somebamfile.bam") >> txdb <- makeTranscriptFromUCSC(genome="hg19", tablename="ensGene") >> ensGenes <- exonsBy(txdb, by="gene") >> ov <- countGenomicOverlaps(aln, ensGenes) >> >> then i want to get read-counts per gene and the first thing that comes >> to my head is doing: >> >> counts <- sapply(ov, function(x) sum(values(x)[["hits"]])) >> >> which goes through every gene and adds up the "hits" of its exons. >> however, this latter step of "just adding" takes longer than the actual >> calculation of the hits with countGenomicOverlaps() and i guess that >> there are more efficient ways to approach this, probably something >> around "reducing the hits value column". i've been looking at rdapply() >> and reduce() and googled too, but couldn't find anything, so i look >> forward to your suggestions. > > Hi Robert -- A strategy here is along the lines (untested!) of > > hits <- values(unlist(ov)))[["hits"]] > genes <- rep(names(ov), elementLengths(ov)) > counts <- sapply(split(hits, genes), sum) beautiful and fast, i've just tested it with the ~ 50,000 Ensembl genes and the execution time is nearly negligible, less than half a second in my laptop. > but you'll want to make sure that this is a conceptually sensible way of > counting hits per gene. i'm aiming at counting exonic-reads and adding them up at gene level. the function countGenomicOverlaps() allows me to tune the part of the logic related to counting exonic reads, but is there any function that would allow me to tune the summing of exonic reads? > The countGenomicOverlaps function has been heavily updated to > summarizeOverlaps in 'devel'; you might want to check that out, or at > least the vignette for summarizeOverlaps at > > http://bioconductor.org/packages/devel/bioc/html/GenomicRanges.html that sounds great, i'll upgrade next week when it becomes release. thanks again! robert.

ADD REPLY • link 12.5 years ago Robert Castelo ★ 3.3k

0

Entering edit mode

Hi Robert, On 10/26/2011 03:34 PM, Robert Castelo wrote: > hi, > > On 10/27/11 12:07 AM, Martin Morgan wrote: >> On 10/25/2011 05:35 PM, Robert Castelo wrote: >>> dear list, >>> >>> the following three lines allow one to count overlaps of aligned >>> short-reads with annotations: >>> >>> aln <- readGappedAlignments("somebamfile.bam") >>> txdb <- makeTranscriptFromUCSC(genome="hg19", tablename="ensGene") >>> ensGenes <- exonsBy(txdb, by="gene") >>> ov <- countGenomicOverlaps(aln, ensGenes) >>> >>> then i want to get read-counts per gene and the first thing that comes >>> to my head is doing: >>> >>> counts <- sapply(ov, function(x) sum(values(x)[["hits"]])) >>> >>> which goes through every gene and adds up the "hits" of its exons. >>> however, this latter step of "just adding" takes longer than the actual >>> calculation of the hits with countGenomicOverlaps() and i guess that >>> there are more efficient ways to approach this, probably something >>> around "reducing the hits value column". i've been looking at rdapply() >>> and reduce() and googled too, but couldn't find anything, so i look >>> forward to your suggestions. >> >> Hi Robert -- A strategy here is along the lines (untested!) of >> >> hits <- values(unlist(ov)))[["hits"]] >> genes <- rep(names(ov), elementLengths(ov)) >> counts <- sapply(split(hits, genes), sum) > > beautiful and fast, i've just tested it with the ~ 50,000 Ensembl > genes and the execution time is nearly negligible, less than half a > second in my laptop. > >> but you'll want to make sure that this is a conceptually sensible way of >> counting hits per gene. > > i'm aiming at counting exonic-reads and adding them up at gene level. > the function countGenomicOverlaps() allows me to tune the part of the > logic related to counting exonic reads, but is there any function that > would allow me to tune the summing of exonic reads? How do you want to tune the summing? If you want to sum all exons by gene the piece of code Martin provided should do it. Was there another goal you had in mind? Valerie > >> The countGenomicOverlaps function has been heavily updated to >> summarizeOverlaps in 'devel'; you might want to check that out, or at >> least the vignette for summarizeOverlaps at >> >> http://bioconductor.org/packages/devel/bioc/html/GenomicRanges.html > > that sounds great, i'll upgrade next week when it becomes release. > > thanks again! > robert. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.5 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Valerie, On 10/27/11 7:53 PM, Valerie Obenchain wrote: > Hi Robert, > > > On 10/26/2011 03:34 PM, Robert Castelo wrote: >> hi, >> >> On 10/27/11 12:07 AM, Martin Morgan wrote: >>> On 10/25/2011 05:35 PM, Robert Castelo wrote: >>>> dear list, >>>> >>>> the following three lines allow one to count overlaps of aligned >>>> short-reads with annotations: >>>> >>>> aln <- readGappedAlignments("somebamfile.bam") >>>> txdb <- makeTranscriptFromUCSC(genome="hg19", tablename="ensGene") >>>> ensGenes <- exonsBy(txdb, by="gene") >>>> ov <- countGenomicOverlaps(aln, ensGenes) >>>> >>>> then i want to get read-counts per gene and the first thing that comes >>>> to my head is doing: >>>> >>>> counts <- sapply(ov, function(x) sum(values(x)[["hits"]])) >>>> >>>> which goes through every gene and adds up the "hits" of its exons. >>>> however, this latter step of "just adding" takes longer than the actual >>>> calculation of the hits with countGenomicOverlaps() and i guess that >>>> there are more efficient ways to approach this, probably something >>>> around "reducing the hits value column". i've been looking at rdapply() >>>> and reduce() and googled too, but couldn't find anything, so i look >>>> forward to your suggestions. >>> >>> Hi Robert -- A strategy here is along the lines (untested!) of >>> >>> hits <- values(unlist(ov)))[["hits"]] >>> genes <- rep(names(ov), elementLengths(ov)) >>> counts <- sapply(split(hits, genes), sum) >> >> beautiful and fast, i've just tested it with the ~ 50,000 Ensembl >> genes and the execution time is nearly negligible, less than half a >> second in my laptop. >> >>> but you'll want to make sure that this is a conceptually sensible way of >>> counting hits per gene. >> >> i'm aiming at counting exonic-reads and adding them up at gene level. >> the function countGenomicOverlaps() allows me to tune the part of the >> logic related to counting exonic reads, but is there any function that >> would allow me to tune the summing of exonic reads? > > How do you want to tune the summing? If you want to sum all exons by > gene the piece of code Martin provided should do it. Was there another > goal you had in mind? well, Martin was warning me about whether just adding up exon counts per gene makes complete sense and i guess the problem might arise with overlapping exons of different transcripts since then by just summing i'd be adding some counts twice. so i was wondering whether something existed to address this summing along the lines of what is available about the decision logic options for counting reads overlapping annotations with countGenomicOverlaps(). cheers, robert.

ADD REPLY • link 12.5 years ago Robert Castelo ★ 3.3k

0

Entering edit mode

On 10/27/2011 12:44 PM, Robert Castelo wrote: > Hi Valerie, > > On 10/27/11 7:53 PM, Valerie Obenchain wrote: >> Hi Robert, >> >> >> On 10/26/2011 03:34 PM, Robert Castelo wrote: >>> hi, >>> >>> On 10/27/11 12:07 AM, Martin Morgan wrote: >>>> On 10/25/2011 05:35 PM, Robert Castelo wrote: >>>>> dear list, >>>>> >>>>> the following three lines allow one to count overlaps of aligned >>>>> short-reads with annotations: >>>>> >>>>> aln <- readGappedAlignments("somebamfile.bam") >>>>> txdb <- makeTranscriptFromUCSC(genome="hg19", tablename="ensGene") >>>>> ensGenes <- exonsBy(txdb, by="gene") >>>>> ov <- countGenomicOverlaps(aln, ensGenes) >>>>> >>>>> then i want to get read-counts per gene and the first thing that >>>>> comes >>>>> to my head is doing: >>>>> >>>>> counts <- sapply(ov, function(x) sum(values(x)[["hits"]])) >>>>> >>>>> which goes through every gene and adds up the "hits" of its exons. >>>>> however, this latter step of "just adding" takes longer than the >>>>> actual >>>>> calculation of the hits with countGenomicOverlaps() and i guess that >>>>> there are more efficient ways to approach this, probably something >>>>> around "reducing the hits value column". i've been looking at >>>>> rdapply() >>>>> and reduce() and googled too, but couldn't find anything, so i look >>>>> forward to your suggestions. >>>> >>>> Hi Robert -- A strategy here is along the lines (untested!) of >>>> >>>> hits <- values(unlist(ov)))[["hits"]] >>>> genes <- rep(names(ov), elementLengths(ov)) >>>> counts <- sapply(split(hits, genes), sum) >>> >>> beautiful and fast, i've just tested it with the ~ 50,000 Ensembl >>> genes and the execution time is nearly negligible, less than half a >>> second in my laptop. >>> >>>> but you'll want to make sure that this is a conceptually sensible >>>> way of >>>> counting hits per gene. >>> >>> i'm aiming at counting exonic-reads and adding them up at gene level. >>> the function countGenomicOverlaps() allows me to tune the part of the >>> logic related to counting exonic reads, but is there any function that >>> would allow me to tune the summing of exonic reads? >> >> How do you want to tune the summing? If you want to sum all exons by >> gene the piece of code Martin provided should do it. Was there another >> goal you had in mind? > > well, Martin was warning me about whether just adding up exon counts > per gene makes complete sense and i guess the problem might arise with > overlapping exons of different transcripts since then by just summing > i'd be adding some counts twice. so i was wondering whether something > existed to address this summing along the lines of what is available > about the decision logic options for counting reads overlapping > annotations with countGenomicOverlaps(). Both the counting and summarizing problems have been addressed in the summarizedOverlaps function Martin mentioned. The function has different "modes" that determine how to assign a read that overlaps multiple features. This function does not double count so a read is either assigned according to the rules of the "mode" or it is discarded. In the end you have counts per feature. A feature can be represented by a single range (ie, single exon, transcript or gene) or by multiple ranges (ie, exons in a gene, transcripts in a gene, etc.). I'd be interested in feedback once you've had a chance to use the function. Valerie > > cheers, > robert.

ADD REPLY • link 12.5 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Xavier Robin ▴ 30

@xavier-robin-4928

Last seen 9.6 years ago

Le 25. 10. 11 18:54, Benjamin Haibe-Kains a ?crit : > It was a complicated process of curation and unfortunately we did not keep track of all the changes. However, you can easily identify those samples thanks to the "series" slot in the phenodata: NKI and NKI2 stand for the first and second papers respectively. We removed samples from the first series in priority. > > I hope this helps. Yes it does! Thank you Benjamin.

ADD COMMENT • link 12.5 years ago Xavier Robin ▴ 30

Login before adding your answer.