Key points are not available for this paper at this time.
Abstract The techniques of exploratory data analysis include a resistant rule for identifying possible outliers in univariate data. Using the lower and upper fourths, FL and FU (approximate quartiles), it labels as “outside” any observations below FL − 1. 5 (FU — FL) or above FU + 1. 5 (FU — FL). For example, in the ordered sample −5, −2, 0, 1, 8, FL = −2 and FU = 1, so any observation below −6. 5 or above 5. 5 is outside. Thus the rule labels 8 as outside. Some related rules also use cutoffs of the form FL — k (FU — FL) and FU + k (FU — FL). This approach avoids the need to specify the number of possible outliers in advance; as long as they are not too numerous, any outliers do not affect the location of the cutoffs. To describe the performance of these rules, we define the some-outside rate per sample as the probability that a sample will contain one or more outside observations. Its complement is the all-inside rate per sample. We also define the outside rate per observation as the average fraction of outside observations. For Gaussian data the population all-inside rate per sample (0) and the population outside rate per observation (. 7%) substantially understate the corresponding small-sample values. Simulation studies using Gaussian samples with n between 5 and 300 yield detailed information on the resistant rules. The main resistant rule (k = 1. 5) has an all-inside rate per sample between 67% and 86% for 5 ≤n ≤ 20, and corresponding estimates of its outside rate per observation range from 8. 6% to 1. 7%. Both characteristics vary with n in ways that lead to good empirical approximations. Because of the way in which the fourths are defined, the sample sizes separate into four classes, according to whether dividing n by 4 leaves a remainder of 0, 1, 2, or 3. Within these four classes the all-inside rate per sample shows a roughly linear decrease with n over the range 9 ≤ n ≤ 50, and the outside rate per observation decreases linearly in 1/n for n ≥ 9. A more theoretical approximation for the all-inside rate per sample works with the order statistics X (1) ≤ … ≤ X (n). In this notation the fourths are X (f) and X (n + 1 — f) with f = ½ (n + 3) /2, where · is the greatest-integer function. A sample has no observations outside whenever X (f) −X (1) /X (n+1-f) −X (f) ≤k and X (n) −X (n+1-f) /X (n+1-f) −X (f) ≤k. We first approximate the numerators and denominator in these ratios by constant multiples of chi-squared random variables with the same mean and variance. We then approximate the logarithm of each ratio by a Gaussian random variable, and we calculate the correlation between these variables from the fact that the ratios have the same denominator. Finally, a bivariate Gaussian probability calculation yields the approximate all-inside rate per sample. The error of the result relative to the simulation estimate is typically from 1% to 2% for 5 ≤ n ≤ 50. To provide an indication of how the two rates behave in alternative “null” situations, the simulation studies included samples from five heavier-tailed members of the family of h distributions. For a given sample size, the all-inside rate per sample decreases as the tails become heavier, and the outside rate per observation increases.
Hoaglin et al. (Mon,) studied this question.