Question

quality assessment and preprocessing for tiling array-based CGH data

0

Entering edit mode

Leon Yee ▴ 110

@leon-yee-3088

Last seen 9.6 years ago

Dear all, Is there any well-established routine for quality assessment and preprocessing of array CGH data, especially tiling array-based CGH data? I found most of the quality assessment of array data are about expression array, while few are related to array CGH data. We are using agilent 244k CGH array of rat, and now we have the text files produced by Feature Extraction, don't know whether they are of good quality. Could anyone help provide some clues? Thanks in advance! After read.maimage(), we got the RGlist object, which contain several components including R, G, Rb, Gb, and so on. The probes are of 3 types: -1, 1 and 0. 0 means normal probe; -1 mean negative control, i guess, and the probe names are like (-)3xSLv1, NC1_00000002, etc[no corresponding probe sequence]; 1 means positive control, i guess, and the probe names are like DarkCorner, DCP_008001.0, RnCGHBrightCorner, SRN_800002, etc[no corresponding probe sequence]. The number of -1 is 1275, while the number of 1 is 4217, each of which has its R, Rb, G, Gb values. Do we need these values for quality assessment and normalization? How? In addition, in the normal probes, we have 1000 probes repeating 3 times in the array. How could we use these data for quality assessment and normalization? Regards, Leon

CGH probe CGH probe • 1.2k views

ADD COMMENT • link updated 15.5 years ago by Sean Davis 21k • written 15.5 years ago by Leon Yee ▴ 110

score 0 · Answer 1 · 2008-10-22

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

On Wed, Oct 22, 2008 at 9:51 AM, Leon Yee <yee.leon at="" gmail.com=""> wrote: > Dear all, > > Is there any well-established routine for quality assessment and > preprocessing of array CGH data, especially tiling array-based CGH data? I > found most of the quality assessment of array data are about expression > array, while few are related to array CGH data. > We are using agilent 244k CGH array of rat, and now we have the text > files produced by Feature Extraction, don't know whether they are of good > quality. Could anyone help provide some clues? Thanks in advance! > > After read.maimage(), we got the RGlist object, which contain several > components including R, G, Rb, Gb, and so on. The probes are of 3 types: > -1, 1 and 0. 0 means normal probe; -1 mean negative control, i guess, and > the probe names are like (-)3xSLv1, NC1_00000002, etc[no corresponding probe > sequence]; 1 means positive control, i guess, and the probe names are like > DarkCorner, DCP_008001.0, RnCGHBrightCorner, SRN_800002, etc[no > corresponding probe sequence]. The number of -1 is 1275, while the number > of 1 is 4217, each of which has its R, Rb, G, Gb values. Do we need these > values for quality assessment and normalization? How? > In addition, in the normal probes, we have 1000 probes repeating 3 times > in the array. How could we use these data for quality assessment and > normalization? You generally will not want to do any normalization besides a possible shift of the center. Any linear normalization that affects the slope of the M vs. A plot or nonlinear normalization will likely decrease signal. As for quality control, a good, general measure to track is the dlrs, a robust measure of the standard deviation. dlrs <- function(x) { nx <- length(x) if (nx<3) { stop("Vector length>2 needed for computation") } tmp <- embed(x,2) diffs <- tmp[,2]-tmp[,1] dlrs <- IQR(diffs)/(sqrt(2)*1.34) return(dlrs) } For agilent arrays, most of the dlrs should be around or under 0.2, generally. However, this might vary a bit based on lab-to-lab variation. In any case, if there is a significant outlier, that is suspect. The input to the above function is the log ratios for a single array arranged in chromosome and position order. Sean

ADD COMMENT • link 15.5 years ago Sean Davis 21k

0

Entering edit mode

Sean Davis wrote: > On Wed, Oct 22, 2008 at 9:51 AM, Leon Yee <yee.leon at="" gmail.com=""> wrote: >> Dear all, >> >> Is there any well-established routine for quality assessment and >> preprocessing of array CGH data, especially tiling array-based CGH data? I >> found most of the quality assessment of array data are about expression >> array, while few are related to array CGH data. >> We are using agilent 244k CGH array of rat, and now we have the text >> files produced by Feature Extraction, don't know whether they are of good >> quality. Could anyone help provide some clues? Thanks in advance! >> >> After read.maimage(), we got the RGlist object, which contain several >> components including R, G, Rb, Gb, and so on. The probes are of 3 types: >> -1, 1 and 0. 0 means normal probe; -1 mean negative control, i guess, and >> the probe names are like (-)3xSLv1, NC1_00000002, etc[no corresponding probe >> sequence]; 1 means positive control, i guess, and the probe names are like >> DarkCorner, DCP_008001.0, RnCGHBrightCorner, SRN_800002, etc[no >> corresponding probe sequence]. The number of -1 is 1275, while the number >> of 1 is 4217, each of which has its R, Rb, G, Gb values. Do we need these >> values for quality assessment and normalization? How? >> In addition, in the normal probes, we have 1000 probes repeating 3 times >> in the array. How could we use these data for quality assessment and >> normalization? > > You generally will not want to do any normalization besides a possible > shift of the center. Any linear normalization that affects the slope > of the M vs. A plot or nonlinear normalization will likely decrease > signal. As for quality control, a good, general measure to track is > the dlrs, a robust measure of the standard deviation. > > > dlrs <- > function(x) { > nx <- length(x) > if (nx<3) { > stop("Vector length>2 needed for computation") > } > tmp <- embed(x,2) > diffs <- tmp[,2]-tmp[,1] > dlrs <- IQR(diffs)/(sqrt(2)*1.34) > return(dlrs) > } > > For agilent arrays, most of the dlrs should be around or under 0.2, > generally. However, this might vary a bit based on lab-to-lab > variation. In any case, if there is a significant outlier, that is > suspect. The input to the above function is the log ratios for a > single array arranged in chromosome and position order. > > Sean > Hi, Sean Thanks for your advice. However, I have still several questions: 1. The input of dlrs is the log ratios, the log ration extracted from the text file produced by Feature Extraction? or calculated from RGlist --> MAlist ? I have searched the mailist and seen a post of you mentioned the difference of log ration from Feature Extraction and the default M value from read.maimages. 2. I can get the log ratios of all features including control type of -1 and 1, but these features don't have chromosome positions, does this mean I don't need all of them for quality assessment? 3. Some probes with the name of "chr2_random:xxxxx-yyyyyy" will not get a proper mapping on the chromosome, so I should remove these values from the input of dlrs. Is it so? 4. How could I handle those 1000 probes repeating 3 times? They will be mapped on the same chromosome position by three per group. Regards, Leon

ADD REPLY • link 15.5 years ago Leon Yee ▴ 110

0

Entering edit mode

On Wed, Oct 22, 2008 at 10:32 AM, Leon Yee <yee.leon at="" gmail.com=""> wrote: > Sean Davis wrote: >> >> On Wed, Oct 22, 2008 at 9:51 AM, Leon Yee <yee.leon at="" gmail.com=""> wrote: >>> >>> Dear all, >>> >>> Is there any well-established routine for quality assessment and >>> preprocessing of array CGH data, especially tiling array-based CGH data? >>> I >>> found most of the quality assessment of array data are about expression >>> array, while few are related to array CGH data. >>> We are using agilent 244k CGH array of rat, and now we have the text >>> files produced by Feature Extraction, don't know whether they are of good >>> quality. Could anyone help provide some clues? Thanks in advance! >>> >>> After read.maimage(), we got the RGlist object, which contain several >>> components including R, G, Rb, Gb, and so on. The probes are of 3 types: >>> -1, 1 and 0. 0 means normal probe; -1 mean negative control, i guess, and >>> the probe names are like (-)3xSLv1, NC1_00000002, etc[no corresponding >>> probe >>> sequence]; 1 means positive control, i guess, and the probe names are >>> like >>> DarkCorner, DCP_008001.0, RnCGHBrightCorner, SRN_800002, etc[no >>> corresponding probe sequence]. The number of -1 is 1275, while the >>> number >>> of 1 is 4217, each of which has its R, Rb, G, Gb values. Do we need these >>> values for quality assessment and normalization? How? >>> In addition, in the normal probes, we have 1000 probes repeating 3 >>> times >>> in the array. How could we use these data for quality assessment and >>> normalization? >> >> You generally will not want to do any normalization besides a possible >> shift of the center. Any linear normalization that affects the slope >> of the M vs. A plot or nonlinear normalization will likely decrease >> signal. As for quality control, a good, general measure to track is >> the dlrs, a robust measure of the standard deviation. >> >> >> dlrs <- >> function(x) { >> nx <- length(x) >> if (nx<3) { >> stop("Vector length>2 needed for computation") >> } >> tmp <- embed(x,2) >> diffs <- tmp[,2]-tmp[,1] >> dlrs <- IQR(diffs)/(sqrt(2)*1.34) >> return(dlrs) >> } >> >> For agilent arrays, most of the dlrs should be around or under 0.2, >> generally. However, this might vary a bit based on lab-to-lab >> variation. In any case, if there is a significant outlier, that is >> suspect. The input to the above function is the log ratios for a >> single array arranged in chromosome and position order. >> >> Sean >> > > Hi, Sean > > Thanks for your advice. However, I have still several questions: > > 1. The input of dlrs is the log ratios, the log ration extracted from the > text file produced by Feature Extraction? or calculated from RGlist --> > MAlist ? I have searched the mailist and seen a post of you mentioned the > difference of log ration from Feature Extraction and the default M value > from read.maimages. You can read the Agilent FE manual for more details, but you can probably use either and come to very similar conclusions. If you use the MAlist version, make sure to use only median centering or none for normalization. > 2. I can get the log ratios of all features including control type of -1 > and 1, but these features don't have chromosome positions, does this mean I > don't need all of them for quality assessment? We have not routinely used these probes, no. If an array fails miserably, then these control probes might be useful for determining the reason for the failure, though. > 3. Some probes with the name of "chr2_random:xxxxx-yyyyyy" will not get a > proper mapping on the chromosome, so I should remove these values from the > input of dlrs. Is it so? You can either remove them or treat chr2_random as a separate chromosome. > 4. How could I handle those 1000 probes repeating 3 times? They will be > mapped on the same chromosome position by three per group. You could choose one at random or use a mean or median of them. My guess is that they agree very closely with one another so the choice should not affect the results much. Sean

ADD REPLY • link 15.5 years ago Sean Davis 21k

0

Entering edit mode

Sean Davis wrote: >> Hi, Sean >> >> Thanks for your advice. However, I have still several questions: >> >> 1. The input of dlrs is the log ratios, the log ration extracted from the >> text file produced by Feature Extraction? or calculated from RGlist --> >> MAlist ? I have searched the mailist and seen a post of you mentioned the >> difference of log ration from Feature Extraction and the default M value >> from read.maimages. > > You can read the Agilent FE manual for more details, but you can > probably use either and come to very similar conclusions. If you use > the MAlist version, make sure to use only median centering or none for > normalization. > >> 2. I can get the log ratios of all features including control type of -1 >> and 1, but these features don't have chromosome positions, does this mean I >> don't need all of them for quality assessment? > > We have not routinely used these probes, no. If an array fails > miserably, then these control probes might be useful for determining > the reason for the failure, though. > >> 3. Some probes with the name of "chr2_random:xxxxx-yyyyyy" will not get a >> proper mapping on the chromosome, so I should remove these values from the >> input of dlrs. Is it so? > > You can either remove them or treat chr2_random as a separate chromosome. > >> 4. How could I handle those 1000 probes repeating 3 times? They will be >> mapped on the same chromosome position by three per group. > > You could choose one at random or use a mean or median of them. My > guess is that they agree very closely with one another so the choice > should not affect the results much. Hi, Sean Thank you very much for your detailed reply and help. Where can I get the references or official documentations about dlrs method? In addition, we have design our array with dye-swap [test-cy3 vs ref-cy5, and test-cy5 vs ref-cy3]. Is there any method for utilizing the information here for quality assessment? Best wishes! Leon

ADD REPLY • link 15.5 years ago Leon Yee ▴ 110

0

Entering edit mode

On Wed, Oct 22, 2008 at 1:14 PM, Leon Yee <yee.leon at="" gmail.com=""> wrote: > Sean Davis wrote: >>> >>> Hi, Sean >>> >>> Thanks for your advice. However, I have still several questions: >>> >>> 1. The input of dlrs is the log ratios, the log ration extracted from >>> the >>> text file produced by Feature Extraction? or calculated from RGlist --> >>> MAlist ? I have searched the mailist and seen a post of you mentioned >>> the >>> difference of log ration from Feature Extraction and the default M value >>> from read.maimages. >> >> You can read the Agilent FE manual for more details, but you can >> probably use either and come to very similar conclusions. If you use >> the MAlist version, make sure to use only median centering or none for >> normalization. >> >>> 2. I can get the log ratios of all features including control type of -1 >>> and 1, but these features don't have chromosome positions, does this mean >>> I >>> don't need all of them for quality assessment? >> >> We have not routinely used these probes, no. If an array fails >> miserably, then these control probes might be useful for determining >> the reason for the failure, though. >> >>> 3. Some probes with the name of "chr2_random:xxxxx-yyyyyy" will not get >>> a >>> proper mapping on the chromosome, so I should remove these values from >>> the >>> input of dlrs. Is it so? >> >> You can either remove them or treat chr2_random as a separate chromosome. >> >>> 4. How could I handle those 1000 probes repeating 3 times? They will be >>> mapped on the same chromosome position by three per group. >> >> You could choose one at random or use a mean or median of them. My >> guess is that they agree very closely with one another so the choice >> should not affect the results much. > > Hi, Sean > > Thank you very much for your detailed reply and help. > > Where can I get the references or official documentations about dlrs > method? It is a standard robust estimator of the variance and is not specific to microarrays. If you look at the code, it simply subtracts the difference between adjacent probes and then normalizes the result. If the array is "noisy", the dlrs will be high. This assumes that the contribution due to large copy number changes is negligible which is likely true since even the most abnormal cancer samples have fewer than 1000 breaks. > In addition, we have design our array with dye-swap [test-cy3 vs ref-cy5, > and test-cy5 vs ref-cy3]. Is there any method for utilizing the information > here for quality assessment? Not that I know of, but you could certainly look at correlations between replicates, etc. Our experience with Agilent CGH arrays is that the contribution due to dye bias is small compared to changes due to copy number. Sean Sean

ADD REPLY • link 15.5 years ago Sean Davis 21k