Question: DiffBind - sample sheet for multiple replicates and peak
0
6.9 years ago by
Paolo Kunderfranco350 wrote:
• 2.4k views
modified 6.9 years ago by Rory Stark100 • written 6.9 years ago by Paolo Kunderfranco350
Answer: DiffBind - sample sheet for multiple replicates and peak
0
6.9 years ago by
Rory Stark100
Rory Stark100 wrote:
Hello Ant?nio- Regarding your sample sheet, you should be able to use the same sample ID and the same read files for each sample, and specify the peak caller as MACS and QuEST as in your initial sample sheet. This should load two sets of peak for each sample, differing only peak caller name (as DiffBind doesn't recognize the MACS or QuEST strings, it will default to raw, which is the format you are using). Does this work for you? I've attached an example sample sheet using the sample tamoxifen data included with DiffBind. The next issue is deriving "a set of common peaks for the peak callers ". You can do this using all the peaks from all the callers for all the replicates but invoking dba.count directly (by default, any peak that is identified at least twice is included; these could come form two peak callers for a single replicate, or from single peak callers for different replicates). You may want to perform an intermediate step of deriving consensus peaksets for each replicate. In the development version of DiffBind, this is easy done with the call: data = dba.peakset(data, consensus = -DBA_CALLER) Which will add a consensus peakset for each sample (consisting of peak identified in both peak callers for that replicate). I have added a feature to dba.count to make it easy to use only these peaksets in deriving the master consensus peakset used for counting: data = dba.count(data, peaks=data$masks$Consensus) But this won't be in the build for a few more days. In the meantime you have to retrieve the peakset and pass it to dba.count: consensus = dba(data,data$masks$Consensus) # make new DBA object with only the consensus peaksets conspeaks = dba.peakset(consensus,bRetrieve=T) # retrieve master consensus peakset data = dba.count(data,peaks=conspeaks) # pass in master consensus peakset for counting As Gord said in another response, the reads must be in BED or BAM format (possible zipped); you can easily convert SAM files to BAM files. Hope this helps! Cheers- Rory On 14/09/2012 11:00, "bioconductor-request at r-project.org<mailto :bioconductor-request="" at="" r-project.org="">" <bioconductor-request at="" r-project.org<mailto:bioconductor-request="" at="" r-project.org="">> wrote: ------------------------------ Message: 10 Date: Thu, 13 Sep 2012 18:06:05 +0200 From: Ant?nio Miguel de Jesus Domingues <amjdomingues@gmail.com<mailto:amjdomingues@gmail.com>> To: bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> Subject: [BioC] DiffBind - sample sheet for multiple replicates and peak callers Message-ID: <capacvobu89nekmnu-2jojczybqwdhd=ut=b5o4hl10k+axxgnq at="" mail.gmail.com<mailto:capacvobu89nekmnu-="" 2jojczybqwdhd="ut=b5O4hL10k+axXgNQ" at="" mail.gmail.com="">> Content-Type: text/plain Hi all, I am trying to use DiffBind to compare peaks called in control vs condition. I have 2 replicates for each and I've also called peaks using 2 different peak callers (to wi, MACS and QuEST). I've also prepared a sample data sheet that looks like this: SampleID Tissue Factor Condition Replicate Peak.caller bamReads bamControl Peaks control Hela TF wt 1 MACS path path path control Hela TF wt 1 QuEST path path path control2 Hela TF wt 2 MACS path path path control 2 Hela TF wt 2 QuEST path path path (and the same for the conditions) My plan was to load all the data and then using diffbind selecte a set of common peaks for the peak callers before proceeding with the analysis. However, when I load the data (data = dba(sampleSheet="samplesheet.csv")) the peaks for each caller are not recognized as a different variable. How I can do that and is this silly? I could also derive a set of common peaks independently but it would be neat to do it all with the same package and that seems to be possible but I could not find how to do it in the documentation. Thanks, Ant?nio -- -- Ant?nio Miguel de Jesus Domingues, PhD Neugebauer group Max Planck Institute of Molecular Cell Biology and Genetics, Dresden Pfotenhauerstrasse 108 01307 Dresden Germany e-mail: domingue at mpi-cbg.de<mailto:domingue at="" mpi-cbg.de=""> tel. +49 351 210 2481 The Unbearable Lightness of Molecular Biology [[alternative HTML version deleted]] ------------------------------ ------------------------------ Message: 20 Date: Fri, 14 Sep 2012 11:23:26 +0200 From: Paolo Kunderfranco <paolo.kunderfranco@gmail.com<mailto:paolo.kunderfranco@gmail.com>> To: bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> Subject: Re: [BioC] DiffBind - sample sheet for multiple replicates and peak Message-ID: <cagxwfc-bywww9mjmr7nadcofnlj=zym3sssbc2mynydhe5+haq at="" mail.gmail.com<mailto:cagxwfc-="" bywww9mjmr7nadcofnlj="ZyM3sssbc2mYNyDhE5+haQ" at="" mail.gmail.com="">> Content-Type: text/plain; charset=ISO-8859-1 Dear Antonio I think that this problem was resolved in previous messagges, have a look to: https://stat.ethz.ch/pipermail/bioconductor/2012-August/047351.html? For the complete code you could browse August mailing list. SampleID should be unique for each sample, and moreover also bam file file should be unique, you should make a copy of all your bamReads and bamControl In my example, I compared 2 different peak callers in 3 cell lines: SampleID,Tissue,Factor,Condition,Replicate,bamReads,bamControl,Peaks mES_H3K27me3_m,ES,H3K27,mES_H3K27me3,1,reads/H3K27me3/ES_H3K27me3_m.be d,reads/H3K27me3/ES_input_m.bed,peaks/H3K27me3_ES_M.bed CMp_H3K27me3_m,CMN,H3K27,CMp_H3K27me3,1,reads/H3K27me3/CMN_H3K27me3_m. bed,reads/H3K27me3/CMN_input_m.bed,peaks/H3K27me3_CMN_M.bed CMa_H3K27me3_m,CMA,H3K27,CMa_H3K27me3,1,reads/H3K27me3/CMA_H3K27me3_m. bed,reads/H3K27me3/CMA_input_m.bed,peaks/H3K27me3_CMA_M.bed mES_H3K27me3_s,ES,H3K27,mES_H3K27me3,2,reads/H3K27me3/ES_H3K27me3_s.be d,reads/H3K27me3/ES_input_s.bed,peaks/H3K27me3_ES_S.bed CMp_H3K27me3_s,CMN,H3K27,CMp_H3K27me3,2,reads/H3K27me3/CMN_H3K27me3_s. bed,reads/H3K27me3/CMN_input_s.bed,peaks/H3K27me3_CMN_S.bed CMa_H3K27me3_s,CMA,H3K27,CMa_H3K27me3,2,reads/H3K27me3/CMA_H3K27me3_s. bed,reads/H3K27me3/CMA_input_s.bed,peaks/H3K27me3_CMA_S.bed Like this it should work, Cheers, Paolo ------------------------------ Message: 21 Date: Fri, 14 Sep 2012 11:58:09 +0200 From: Ant?nio Miguel de Jesus Domingues <amjdomingues@gmail.com<mailto:amjdomingues@gmail.com>> To: bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> Subject: [BioC] DiffBind - error in dba.count Message-ID: <capacvoc+9hgaxzatgj7j3guktemaummn6sjhoxm1b4ygwvr4rq at="" mail.g="" mail.com<mailto:capacvoc+9hgaxzatgj7j3guktemaummn6sjhoxm1b4ygwvr4rq="" at="" mail.gmail.com="">> Content-Type: text/plain Hi again, I am trying DiffBind and loaded my data that looks like this: H3K4m3 4 Samples, 13203 sites in matrix (13792 total): ID Tissue Factor Condition Peak.caller Replicate Intervals 1 wt1 Hela H3K4me3 control1 raw 1 14111 2 wt2 Hela H3K4me3 control2 raw 2 13771 3 treat1 Hela H3K4me3 condition1 raw 1 14865 4 treat2 Hela H3K4me3 condition2 raw 2 13393 But I ran into problems trying to calculate the affinity scores with dba.count: H3K4m3 = dba.count(H3K4m3) Error in cond$counts :$ operator is invalid for atomic vectors In addition: Warning message: In mclapply(arglist, fn, ..., mc.preschedule = FALSE) : 6 function calls resulted in an error The peaks are in bed files (chr, start, end, score) and the reads are in SAM format. Can anyone help me with this? Cheers. Ant?nio sessionInfo() R version 2.14.1 (2011-12-22) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] DiffBind_1.0.9 Biobase_2.14.0 loaded via a namespace (and not attached): [1] IRanges_1.12.6 RColorBrewer_1.0-5 amap_0.8-7 edgeR_2.4.6 [5] gdata_2.11.0 gplots_2.11.0 gtools_2.7.0 limma_3.10.3 [9] zlibbioc_1.0.1 On 13 September 2012 18:06, Ant?nio Miguel de Jesus Domingues < amjdomingues at gmail.com<mailto:amjdomingues at="" gmail.com="">> wrote: Hi all, I am trying to use DiffBind to compare peaks called in control vs condition. I have 2 replicates for each and I've also called peaks using 2 different peak callers (to wi, MACS and QuEST). I've also prepared a sample data sheet that looks like this: SampleID Tissue Factor Condition Replicate Peak.caller bamReads bamControl Peaks control Hela TF wt 1 MACS path path path control Hela TF wt 1 QuEST path path path control2 Hela TF wt 2 MACS path path path control 2 Hela TF wt 2 QuEST path path path (and the same for the conditions) My plan was to load all the data and then using diffbind selecte a set of common peaks for the peak callers before proceeding with the analysis. However, when I load the data (data = dba(sampleSheet="samplesheet.csv")) the peaks for each caller are not recognized as a different variable. How I can do that and is this silly? I could also derive a set of common peaks independently but it would be neat to do it all with the same package and that seems to be possible but I could not find how to do it in the documentation. Thanks, Ant?nio -- -- Ant?nio Miguel de Jesus Domingues, PhD Neugebauer group Max Planck Institute of Molecular Cell Biology and Genetics, Dresden Pfotenhauerstrasse 108 01307 Dresden Germany e-mail: domingue at mpi-cbg.de<mailto:domingue at="" mpi-cbg.de=""> tel. +49 351 210 2481 The Unbearable Lightness of Molecular Biology -- -- Ant?nio Miguel de Jesus Domingues, PhD Neugebauer group Max Planck Institute of Molecular Cell Biology and Genetics, Dresden Pfotenhauerstrasse 108 01307 Dresden Germany e-mail: domingue at mpi-cbg.de<mailto:domingue at="" mpi-cbg.de=""> tel. +49 351 210 2481 The Unbearable Lightness of Molecular Biology [[alternative HTML version deleted]] ------------------------------ _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> https://stat.ethz.ch/mailman/listinfo/bioconductor End of Bioconductor Digest, Vol 115, Issue 14 ********************************************* NOTICE AND DISCLAIMER This e-mail (including any attachments) is intended for the above- named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you. Cancer Research UK Registered charity in England and Wales (1089464), Scotland (SC041666) and the Isle of Man (1103) A company limited by guarantee. Registered company in England and Wales (4325234) and the Isle of Man (5713F). Registered Office Address: Angel Building, 407 St John Street, London EC1V 4AD.
Hello António- Looking closer, I now see the problem with your original sample sheet: you have a column named Peak.caller that should be named PeakCaller. Changing the column name should allow DiffBind to differentiate the peaksets based on the peak caller name. Cheers- Rory From: Rory Stark <rory.stark@cancer.org.uk<mailto:rory.stark@cancer.org.uk>> Date: Fri, 14 Sep 2012 12:06:09 +0100 To: "bioconductor@r-project.org<mailto:bioconductor@r-project.org>" <bioconductor@r-project.org<mailto:bioconductor@r-project.org>> Cc: "domingue@mpi-cbg.de<mailto:domingue@mpi-cbg.de>" <domingue@mpi- cbg.de<mailto:domingue@mpi-cbg.de="">>, Paolo Kunderfranco <paolo.kunderfranco@gmail.com<mailto:paolo.kunderfranco@gmail.com>> Subject: Re: DiffBind - sample sheet for multiple replicates and peak Hello António- Regarding your sample sheet, you should be able to use the same sample ID and the same read files for each sample, and specify the peak caller as MACS and QuEST as in your initial sample sheet. This should load two sets of peak for each sample, differing only peak caller name (as DiffBind doesn't recognize the MACS or QuEST strings, it will default to raw, which is the format you are using). Does this work for you? I've attached an example sample sheet using the sample tamoxifen data included with DiffBind. The next issue is deriving "a set of common peaks for the peak callers ". You can do this using all the peaks from all the callers for all the replicates but invoking dba.count directly (by default, any peak that is identified at least twice is included; these could come form two peak callers for a single replicate, or from single peak callers for different replicates). You may want to perform an intermediate step of deriving consensus peaksets for each replicate. In the development version of DiffBind, this is easy done with the call: data = dba.peakset(data, consensus = -DBA_CALLER) Which will add a consensus peakset for each sample (consisting of peak identified in both peak callers for that replicate). I have added a feature to dba.count to make it easy to use only these peaksets in deriving the master consensus peakset used for counting: data = dba.count(data, peaks=data$masks$Consensus) But this won't be in the build for a few more days. In the meantime you have to retrieve the peakset and pass it to dba.count: consensus = dba(data,data$masks$Consensus) # make new DBA object with only the consensus peaksets conspeaks = dba.peakset(consensus,bRetrieve=T) # retrieve master consensus peakset data = dba.count(data,peaks=conspeaks) # pass in master consensus peakset for counting As Gord said in another response, the reads must be in BED or BAM format (possible zipped); you can easily convert SAM files to BAM files. Hope this helps! Cheers- Rory On 14/09/2012 11:00, "bioconductor-request@r-project.org<mailto :bioconductor-request@r-project.org="">" <bioconductor- request@r-project.org<mailto:bioconductor-request@r-project.org="">> wrote: ------------------------------ Message: 10 Date: Thu, 13 Sep 2012 18:06:05 +0200 From: Ant?nio Miguel de Jesus Domingues <amjdomingues@gmail.com<mailto:amjdomingues@gmail.com>> To: bioconductor@r-project.org<mailto:bioconductor@r-project.org> Subject: [BioC] DiffBind - sample sheet for multiple replicates and peak callers Message-ID: <capacvobu89nekmnu- 2jojczybqwdhd="ut=b5O4hL10k+axXgNQ@mail.gmail.com&lt;mailto" :capacvobu89nekmnu-2jojczybqwdhd="ut=b5O4hL10k+axXgNQ@mail.gmail.com">> Content-Type: text/plain Hi all, I am trying to use DiffBind to compare peaks called in control vs condition. I have 2 replicates for each and I've also called peaks using 2 different peak callers (to wi, MACS and QuEST). I've also prepared a sample data sheet that looks like this: SampleID Tissue Factor Condition Replicate Peak.caller bamReads bamControl Peaks control Hela TF wt 1 MACS path path path control Hela TF wt 1 QuEST path path path control2 Hela TF wt 2 MACS path path path control 2 Hela TF wt 2 QuEST path path path (and the same for the conditions) My plan was to load all the data and then using diffbind selecte a set of common peaks for the peak callers before proceeding with the analysis. However, when I load the data (data = dba(sampleSheet="samplesheet.csv")) the peaks for each caller are not recognized as a different variable. How I can do that and is this silly? I could also derive a set of common peaks independently but it would be neat to do it all with the same package and that seems to be possible but I could not find how to do it in the documentation. Thanks, Ant?nio -- -- Ant?nio Miguel de Jesus Domingues, PhD Neugebauer group Max Planck Institute of Molecular Cell Biology and Genetics, Dresden Pfotenhauerstrasse 108 01307 Dresden Germany e-mail: domingue@mpi-cbg.de<mailto:domingue@mpi-cbg.de> tel. +49 351 210 2481 The Unbearable Lightness of Molecular Biology [[alternative HTML version deleted]] ------------------------------ ------------------------------ Message: 20 Date: Fri, 14 Sep 2012 11:23:26 +0200 From: Paolo Kunderfranco <paolo.kunderfranco@gmail.com<mailto:paolo.kunderfranco@gmail.com>> To: bioconductor@r-project.org<mailto:bioconductor@r-project.org> Subject: Re: [BioC] DiffBind - sample sheet for multiple replicates and peak Message-ID: <cagxwfc- bywww9mjmr7nadcofnlj="ZyM3sssbc2mYNyDhE5+haQ@mail.gmail.com&lt;mailto" :cagxwfc-bywww9mjmr7nadcofnlj="ZyM3sssbc2mYNyDhE5+haQ@mail.gmail.com">> Content-Type: text/plain; charset=ISO-8859-1 Dear Antonio I think that this problem was resolved in previous messagges, have a look to: https://stat.ethz.ch/pipermail/bioconductor/2012-August/047351.html? For the complete code you could browse August mailing list. SampleID should be unique for each sample, and moreover also bam file file should be unique, you should make a copy of all your bamReads and bamControl In my example, I compared 2 different peak callers in 3 cell lines: SampleID,Tissue,Factor,Condition,Replicate,bamReads,bamControl,Peaks mES_H3K27me3_m,ES,H3K27,mES_H3K27me3,1,reads/H3K27me3/ES_H3K27me3_m.be d,reads/H3K27me3/ES_input_m.bed,peaks/H3K27me3_ES_M.bed CMp_H3K27me3_m,CMN,H3K27,CMp_H3K27me3,1,reads/H3K27me3/CMN_H3K27me3_m. bed,reads/H3K27me3/CMN_input_m.bed,peaks/H3K27me3_CMN_M.bed CMa_H3K27me3_m,CMA,H3K27,CMa_H3K27me3,1,reads/H3K27me3/CMA_H3K27me3_m. bed,reads/H3K27me3/CMA_input_m.bed,peaks/H3K27me3_CMA_M.bed mES_H3K27me3_s,ES,H3K27,mES_H3K27me3,2,reads/H3K27me3/ES_H3K27me3_s.be d,reads/H3K27me3/ES_input_s.bed,peaks/H3K27me3_ES_S.bed CMp_H3K27me3_s,CMN,H3K27,CMp_H3K27me3,2,reads/H3K27me3/CMN_H3K27me3_s. bed,reads/H3K27me3/CMN_input_s.bed,peaks/H3K27me3_CMN_S.bed CMa_H3K27me3_s,CMA,H3K27,CMa_H3K27me3,2,reads/H3K27me3/CMA_H3K27me3_s. bed,reads/H3K27me3/CMA_input_s.bed,peaks/H3K27me3_CMA_S.bed Like this it should work, Cheers, Paolo ------------------------------ Message: 21 Date: Fri, 14 Sep 2012 11:58:09 +0200 From: Ant?nio Miguel de Jesus Domingues <amjdomingues@gmail.com<mailto:amjdomingues@gmail.com>> To: bioconductor@r-project.org<mailto:bioconductor@r-project.org> Subject: [BioC] DiffBind - error in dba.count Message-ID: <capacvoc+9hgaxzatgj7j3guktemaummn6sjhoxm1b4ygwvr4rq@mail.gmai l.com<mailto:capacvoc+9hgaxzatgj7j3guktemaummn6sjhoxm1b4ygwvr4rq@mail.="" gmail.com="">> Content-Type: text/plain Hi again, I am trying DiffBind and loaded my data that looks like this: H3K4m3 4 Samples, 13203 sites in matrix (13792 total): ID Tissue Factor Condition Peak.caller Replicate Intervals 1 wt1 Hela H3K4me3 control1 raw 1 14111 2 wt2 Hela H3K4me3 control2 raw 2 13771 3 treat1 Hela H3K4me3 condition1 raw 1 14865 4 treat2 Hela H3K4me3 condition2 raw 2 13393 But I ran into problems trying to calculate the affinity scores with dba.count: H3K4m3 = dba.count(H3K4m3) Error in cond$counts :$ operator is invalid for atomic vectors In addition: Warning message: In mclapply(arglist, fn, ..., mc.preschedule = FALSE) : 6 function calls resulted in an error The peaks are in bed files (chr, start, end, score) and the reads are in SAM format. Can anyone help me with this? Cheers. Ant?nio sessionInfo() R version 2.14.1 (2011-12-22) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] DiffBind_1.0.9 Biobase_2.14.0 loaded via a namespace (and not attached): [1] IRanges_1.12.6 RColorBrewer_1.0-5 amap_0.8-7 edgeR_2.4.6 [5] gdata_2.11.0 gplots_2.11.0 gtools_2.7.0 limma_3.10.3 [9] zlibbioc_1.0.1 On 13 September 2012 18:06, Ant?nio Miguel de Jesus Domingues < amjdomingues@gmail.com<mailto:amjdomingues@gmail.com>> wrote: Hi all, I am trying to use DiffBind to compare peaks called in control vs condition. I have 2 replicates for each and I've also called peaks using 2 different peak callers (to wi, MACS and QuEST). I've also prepared a sample data sheet that looks like this: SampleID Tissue Factor Condition Replicate Peak.caller bamReads bamControl Peaks control Hela TF wt 1 MACS path path path control Hela TF wt 1 QuEST path path path control2 Hela TF wt 2 MACS path path path control 2 Hela TF wt 2 QuEST path path path (and the same for the conditions) My plan was to load all the data and then using diffbind selecte a set of common peaks for the peak callers before proceeding with the analysis. However, when I load the data (data = dba(sampleSheet="samplesheet.csv")) the peaks for each caller are not recognized as a different variable. How I can do that and is this silly? I could also derive a set of common peaks independently but it would be neat to do it all with the same package and that seems to be possible but I could not find how to do it in the documentation. Thanks, Ant?nio -- -- Ant?nio Miguel de Jesus Domingues, PhD Neugebauer group Max Planck Institute of Molecular Cell Biology and Genetics, Dresden Pfotenhauerstrasse 108 01307 Dresden Germany e-mail: domingue@mpi-cbg.de<mailto:domingue@mpi-cbg.de> tel. +49 351 210 2481 The Unbearable Lightness of Molecular Biology -- -- Ant?nio Miguel de Jesus Domingues, PhD Neugebauer group Max Planck Institute of Molecular Cell Biology and Genetics, Dresden Pfotenhauerstrasse 108 01307 Dresden Germany e-mail: domingue@mpi-cbg.de<mailto:domingue@mpi-cbg.de> tel. +49 351 210 2481 The Unbearable Lightness of Molecular Biology [[alternative HTML version deleted]] ------------------------------ _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor End of Bioconductor Digest, Vol 115, Issue 14 ********************************************* NOTICE AND DISCLAIMER This e-mail (including any attachments) is intended for ...{{dropped:19}}