Question

flowClean

0

Entering edit mode

Kipper Fletez-Brant ▴ 150

@kipper-fletez-brant-6421

Last seen 7.9 years ago

United States

Hey Justin, Sorry about taking so long to get back to you. I looked over the material you sent me and ran an analysis myself (I'll send you files privately). The gist is that you were right a lot of the files where you observed that flowClean did not pick up errors it should have - that is, they were flawed files. Before getting too much more into a direct response though, it's worth discussing a little bit how flowClean works (much of this is also in the vignette). We ID periods of bad collection by asking which terminal populations (+/- w/r/t to some threshold, the default being median) are proportionally under- or overrepresented in a given time period relative to others. We do this by taking the cell x parameter matrix (or a subset of the parameters; this is what vectMarkers does) and calling each entry in each column as > or < threshold (1 or 0); in this way we find the populations in the dataset (obviously then, to find a population with parameter `A`, `A` would have to be part of the subset). We then work on a population x time matrix, and look for times that look different (as defined more clearly in the vignette). Note that we look at all populations with at least 500 cells collected over the duration of collection (this is a setting you can twea: nCellCutoff) This being said, it is unsurprising that flowClean did not find the problems you saw in the first 5 files in your list in the email below, as many of them only existed in populations defined by parameters not included in your list passed to vectMarkers. I reran your analysis using all markers (1:16) and found most of the problems you were asking about (see inline below for some specific responses), although the computer disagrees with you slightly still. Note that you can tweak another setting (fcMax) to be more stringent. About the second class of FCS files, in which you expect there to be no problems, there was a bug that I thought I had fixed (this is a bit like whackamole ...) It's on Github now, should be on Bioc tomorrow. Ultimately, some of the concerns here really are about appropriate choice of settings. The ones we recommend as default have shown themselves to be frequently accurate, but certainly there will be cases where perhaps more stringency is desired. As far as parameter selection goes, well, we have been recommending people ignore the 'scatter' params largely because they usually look OK on the data we've examined. However, some of yours did look funny. We will have to tell people to use the scatter params I suppose. I hope these comments are useful; certainly I appreciate the insight (and bug!) you found. Kipper On 06/25/2014 07:09 PM, Justin Meskas wrote: > Hello Kipper and Pratip, > > Thank you for your explanations. After looking at my data more closely, I found that about half of the cases where flowClean was removing only the first compartment were consistent with the shape of the data. The other half of these files seemed to just remove the first compartment randomly. I have created an R source code file you can use to replicate this result. I have put it into a .tar.gz file and will transfer it to you from my google drive in a follow up email. Please do not redistribute the data. Inside the .tar.gz file there is a folder called Figures that can be regenerated using the code. The figures in Figures/Clean show the output of flowClean, while Figures/CleanTest show plots of Marker Vs Time that I created using plotDens from flowDensity. (I am using these Marker vs Time plots to judge if a certain section of the data should be removed or not.) > > Files "SPLN_L000030297_P3_090.fcs" and "SPLN_L000031107_P3_141.fcs" show when flowClean has removed the first compartment when I believe it should not of been. The other 5 FCS files show cases where flowClean seems to also give poor results (The other non-first- compartment-removed files all looked good). In my opinion, flowClean should be removing, from the following files, the following sections: > > SPLN_L000018651_Size_113 - 0-5% marks We found a bit more; see parameter 15 for probable cause > SPLN_L000018653_Size_115 - 0-5% and 75-80% marks Interestingly, looking at the CLR plot, it seems that the 75% - 80% regions are not too out of sync with the remainder of the file, at least for the populations we look at. That can change, however, as the population threshold is changed. > SPLN_L000018656_Size_118 - 0-5% marks > SPLN_L000019881_Size_148 - 0-5% and 20-25% marks > SPLN_L000028450_P3_054 - 0-5%, 12-17% and 55-60% marks > SPLN_L000030297_P3_090 - Nothing > SPLN_L000031107_P3_141 - Nothing > > For SPLN_L000018653_Size_115, SPLN_L000018656_Size_118 and SPLN_L000028450_P3_054 there seems to be certain locations where only one marker is having a problem and it is not removed. Is it the case that flowClean does not consider 1 marker problems to be substantial enough to remove? > > Any insight you might have on any of these problems would be greatly appreciated. Thank you very much, > > Justin > > P.S. I have made the code, hopefully, easy enough to use so all you have to do is change the working directory to the folder that the files have been extracted to. Let me know if there are any problems with the code. > > ________________________________________ > From: Pratip K. Chattopadhyay [pchattop at mail.nih.gov] > Sent: June 25, 2014 7:07 AM > To: Kipper Fletez-Brant > Cc: Justin Meskas; Ryan Brinkman; bioconductor at r-project.org > Subject: Re: flowClean > > There are probably a couple of factors at work here... > > The HTS is more likely to exhibit anomalies early in collection for various reasons... The pressure in the system may still be building up, the cells are settled in the bottom of the well and so more events go through at once, clogs/debris from previous wells/runs may dislodge. In principle, the system is engineered to avoid these issues, but in practice, I often (but not always) see anomalies at the beginning of the collection. Interestingly, on days/runs where there aren't many bad regions flagged, the early regions also look good. This inspires confidence that the algorithm is detecting true problems and doesn't have some systematic problem. > > The second factor - relevant to the case where you felt the first events weren't too bad - is guilt by association. Kipper has built in a little buffer to take out some bins that neighbor trouble spots, just to keep things as clean as possible. > > Best, Pratip > > [cid:part1.07090309.09090509 at mail.nih.gov] > Kipper Fletez-Brant<mailto:cafletezbrant at="" gmail.com=""> > June 25, 2014 8:56 AM > Hi Justin, > > We (Pratip and I) think it may likely be your data - we have observed that the early time points of collection in a flow run tend to have the most errors. Pratip can speak a little more to the technical causes of this. We appreciate your comments and look forward to the results of your tests. > > Kipper > > > Hi Kipper, > > On second thought, I think it is my data. I just checked a few files and they seem to be consistent with only removing the first compartment. I will run some tests tomorrow to validate this. Sorry for the emails. > > Thanks, > Justin > > ________________________________________ > From: Justin Meskas > Sent: June 24, 2014 4:31 PM > To: Kipper Fletez-Brant > Cc: Ryan Brinkman; bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > Subject: RE: flowClean > > Hi Kipper, > > Sorry to keep emailing you, but I had another question about flowClean. I have been noticing that the clean function seems to label the first compartment for removal every time. This seems odd to me. I attached two figures. The figure called "A..." looks like most other figures, where the first compartment is labelled for removal. And the other figure, called "B...", is my unique case where, I believe anyway, the first compartment should be removed, but not the second. Are all these files somehow accidentally removing the first compartment? Or do you think it is the case that all these files have bad data at the beginning? > > Thank you, > Justin

flowClean flowDensity flowClean flowDensity • 1.7k views

ADD COMMENT • link updated 11.6 years ago by Justin Meskas ▴ 60 • written 11.6 years ago by Kipper Fletez-Brant ▴ 150

score 0 · Answer 1 · 2014-07-03

Hi Kipper, Thank you for your reply. I definitely made a mistake with only using markers 7:16 instead of all 1:16. Thank you for catching that. As for the last two files that had the first compartment removed, I noticed the figures you sent from dropbox still had these compartments removed. The bug that you mention, was it suppose to not remove this? I waited a day to see if Bioconductor changed the version from 1.0.1 to 1.0.2 but it has not yet. So I checked flowClean on Github and it seems you had one small modification 2 days ago. I changed this on my local copy and it worked perfectly except for a warning: In if (bad != 0) { : the condition has length > 1 and only the first element will be used It seems that the warning will never give incorrect results because the first element will always contain something if "bad" is not zero. So, I am guessing that you did not change the version number and Bioconductor just thought there were no changes and therefore did not update it on the website. Any way, I just thought I would let you know. Thank you Kipper and Pratip for your help and all your comments. I am very thankful for this resource. Justin ________________________________________ From: Kipper Fletez-Brant [cafletezbrant@gmail.com] Sent: July 1, 2014 8:11 AM To: Justin Meskas; pchattop at mail.nih.gov Cc: Ryan Brinkman; bioconductor at r-project.org Subject: Re: flowClean Hey Justin, Sorry about taking so long to get back to you. I looked over the material you sent me and ran an analysis myself (I'll send you files privately). The gist is that you were right a lot of the files where you observed that flowClean did not pick up errors it should have - that is, they were flawed files. Before getting too much more into a direct response though, it's worth discussing a little bit how flowClean works (much of this is also in the vignette). We ID periods of bad collection by asking which terminal populations (+/- w/r/t to some threshold, the default being median) are proportionally under- or overrepresented in a given time period relative to others. We do this by taking the cell x parameter matrix (or a subset of the parameters; this is what vectMarkers does) and calling each entry in each column as > or < threshold (1 or 0); in this way we find the populations in the dataset (obviously then, to find a population with parameter `A`, `A` would have to be part of the subset). We then work on a population x time matrix, and look for times that look different (as defined more clearly in the vignette). Note that we look at all populations with at least 500 cells collected over the duration of collection (this is a setting you can twea: nCellCutoff) This being said, it is unsurprising that flowClean did not find the problems you saw in the first 5 files in your list in the email below, as many of them only existed in populations defined by parameters not included in your list passed to vectMarkers. I reran your analysis using all markers (1:16) and found most of the problems you were asking about (see inline below for some specific responses), although the computer disagrees with you slightly still. Note that you can tweak another setting (fcMax) to be more stringent. About the second class of FCS files, in which you expect there to be no problems, there was a bug that I thought I had fixed (this is a bit like whackamole ...) It's on Github now, should be on Bioc tomorrow. Ultimately, some of the concerns here really are about appropriate choice of settings. The ones we recommend as default have shown themselves to be frequently accurate, but certainly there will be cases where perhaps more stringency is desired. As far as parameter selection goes, well, we have been recommending people ignore the 'scatter' params largely because they usually look OK on the data we've examined. However, some of yours did look funny. We will have to tell people to use the scatter params I suppose. I hope these comments are useful; certainly I appreciate the insight (and bug!) you found. Kipper On 06/25/2014 07:09 PM, Justin Meskas wrote: > Hello Kipper and Pratip, > > Thank you for your explanations. After looking at my data more closely, I found that about half of the cases where flowClean was removing only the first compartment were consistent with the shape of the data. The other half of these files seemed to just remove the first compartment randomly. I have created an R source code file you can use to replicate this result. I have put it into a .tar.gz file and will transfer it to you from my google drive in a follow up email. Please do not redistribute the data. Inside the .tar.gz file there is a folder called Figures that can be regenerated using the code. The figures in Figures/Clean show the output of flowClean, while Figures/CleanTest show plots of Marker Vs Time that I created using plotDens from flowDensity. (I am using these Marker vs Time plots to judge if a certain section of the data should be removed or not.) > > Files "SPLN_L000030297_P3_090.fcs" and "SPLN_L000031107_P3_141.fcs" show when flowClean has removed the first compartment when I believe it should not of been. The other 5 FCS files show cases where flowClean seems to also give poor results (The other non-first- compartment-removed files all looked good). In my opinion, flowClean should be removing, from the following files, the following sections: > > SPLN_L000018651_Size_113 - 0-5% marks We found a bit more; see parameter 15 for probable cause > SPLN_L000018653_Size_115 - 0-5% and 75-80% marks Interestingly, looking at the CLR plot, it seems that the 75% - 80% regions are not too out of sync with the remainder of the file, at least for the populations we look at. That can change, however, as the population threshold is changed. > SPLN_L000018656_Size_118 - 0-5% marks > SPLN_L000019881_Size_148 - 0-5% and 20-25% marks > SPLN_L000028450_P3_054 - 0-5%, 12-17% and 55-60% marks > SPLN_L000030297_P3_090 - Nothing > SPLN_L000031107_P3_141 - Nothing > > For SPLN_L000018653_Size_115, SPLN_L000018656_Size_118 and SPLN_L000028450_P3_054 there seems to be certain locations where only one marker is having a problem and it is not removed. Is it the case that flowClean does not consider 1 marker problems to be substantial enough to remove? > > Any insight you might have on any of these problems would be greatly appreciated. Thank you very much, > > Justin > > P.S. I have made the code, hopefully, easy enough to use so all you have to do is change the working directory to the folder that the files have been extracted to. Let me know if there are any problems with the code. > > ________________________________________ > From: Pratip K. Chattopadhyay [pchattop at mail.nih.gov] > Sent: June 25, 2014 7:07 AM > To: Kipper Fletez-Brant > Cc: Justin Meskas; Ryan Brinkman; bioconductor at r-project.org > Subject: Re: flowClean > > There are probably a couple of factors at work here... > > The HTS is more likely to exhibit anomalies early in collection for various reasons... The pressure in the system may still be building up, the cells are settled in the bottom of the well and so more events go through at once, clogs/debris from previous wells/runs may dislodge. In principle, the system is engineered to avoid these issues, but in practice, I often (but not always) see anomalies at the beginning of the collection. Interestingly, on days/runs where there aren't many bad regions flagged, the early regions also look good. This inspires confidence that the algorithm is detecting true problems and doesn't have some systematic problem. > > The second factor - relevant to the case where you felt the first events weren't too bad - is guilt by association. Kipper has built in a little buffer to take out some bins that neighbor trouble spots, just to keep things as clean as possible. > > Best, Pratip > > [cid:part1.07090309.09090509 at mail.nih.gov] > Kipper Fletez-Brant<mailto:cafletezbrant at="" gmail.com=""> > June 25, 2014 8:56 AM > Hi Justin, > > We (Pratip and I) think it may likely be your data - we have observed that the early time points of collection in a flow run tend to have the most errors. Pratip can speak a little more to the technical causes of this. We appreciate your comments and look forward to the results of your tests. > > Kipper > > > Hi Kipper, > > On second thought, I think it is my data. I just checked a few files and they seem to be consistent with only removing the first compartment. I will run some tests tomorrow to validate this. Sorry for the emails. > > Thanks, > Justin > > ________________________________________ > From: Justin Meskas > Sent: June 24, 2014 4:31 PM > To: Kipper Fletez-Brant > Cc: Ryan Brinkman; bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > Subject: RE: flowClean > > Hi Kipper, > > Sorry to keep emailing you, but I had another question about flowClean. I have been noticing that the clean function seems to label the first compartment for removal every time. This seems odd to me. I attached two figures. The figure called "A..." looks like most other figures, where the first compartment is labelled for removal. And the other figure, called "B...", is my unique case where, I believe anyway, the first compartment should be removed, but not the second. Are all these files somehow accidentally removing the first compartment? Or do you think it is the case that all these files have bad data at the beginning? > > Thank you, > Justin