December 1, 1986

Performance of Some Resistant Rules for Outlier Labeling

Key Points

Key points are not available for this paper at this time.

Abstract

Abstract The techniques of exploratory data analysis include a resistant rule for identifying possible outliers in univariate data. Using the lower and upper fourths, FL and FU (approximate quartiles), it labels as “outside” any observations below FL − 1. 5 (FU — FL) or above FU + 1. 5 (FU — FL). For example, in the ordered sample −5, −2, 0, 1, 8, FL = −2 and FU = 1, so any observation below −6. 5 or above 5. 5 is outside. Thus the rule labels 8 as outside. Some related rules also use cutoffs of the form FL — k (FU — FL) and FU + k (FU — FL). This approach avoids the need to specify the number of possible outliers in advance; as long as they are not too numerous, any outliers do not affect the location of the cutoffs. To describe the performance of these rules, we define the some-outside rate per sample as the probability that a sample will contain one or more outside observations. Its complement is the all-inside rate per sample. We also define the outside rate per observation as the average fraction of outside observations. For Gaussian data the population all-inside rate per sample (0) and the population outside rate per observation (. 7%) substantially understate the corresponding small-sample values. Simulation studies using Gaussian samples with n between 5 and 300 yield detailed information on the resistant rules. The main resistant rule (k = 1. 5) has an all-inside rate per sample between 67% and 86% for 5 ≤n ≤ 20, and corresponding estimates of its outside rate per observation range from 8. 6% to 1. 7%. Both characteristics vary with n in ways that lead to good empirical approximations. Because of the way in which the fourths are defined, the sample sizes separate into four classes, according to whether dividing n by 4 leaves a remainder of 0, 1, 2, or 3. Within these four classes the all-inside rate per sample shows a roughly linear decrease with n over the range 9 ≤ n ≤ 50, and the outside rate per observation decreases linearly in 1/n for n ≥ 9. A more theoretical approximation for the all-inside rate per sample works with the order statistics X (1) ≤ … ≤ X (n). In this notation the fourths are X (f) and X (n + 1 — f) with f = ½ (n + 3) /2, where · is the greatest-integer function. A sample has no observations outside whenever X (f) −X (1) /X (n+1-f) −X (f) ≤k and X (n) −X (n+1-f) /X (n+1-f) −X (f) ≤k. We first approximate the numerators and denominator in these ratios by constant multiples of chi-squared random variables with the same mean and variance. We then approximate the logarithm of each ratio by a Gaussian random variable, and we calculate the correlation between these variables from the fact that the ratios have the same denominator. Finally, a bivariate Gaussian probability calculation yields the approximate all-inside rate per sample. The error of the result relative to the simulation estimate is typically from 1% to 2% for 5 ≤ n ≤ 50. To provide an indication of how the two rates behave in alternative “null” situations, the simulation studies included samples from five heavier-tailed members of the family of h distributions. For a given sample size, the all-inside rate per sample decreases as the tails become heavier, and the outside rate per observation increases.

Bookmark

Performance of Some Resistant Rules for Outlier Labeling

Key Points

Abstract

Cite This Study