# Detecting Outliers - Univariate

(Difference between revisions)
 Revision as of 05:26, 16 February 2008 (view source)Stenstro (Talk | contribs)← Older edit Revision as of 05:33, 16 February 2008 (view source)Doug (Talk | contribs) Newer edit → Line 1: Line 1: *'''How do I detect outliers?''' *'''How do I detect outliers?''' #One way is to visually inspect your data with a FREQUENCY DISTRIBUTION. [[Image:Fe40.png]] - Imagine a study that asks the American public how many sexual partners they have over their lifetime. See the frequency distribution below for the findings from this hypothetical study. The people who said they have 100+ sexual partners in their lifetime appear disconnected from the rest of the data.
[[Image:Sexpartners_histogram0.png|400px]]
#One way is to visually inspect your data with a FREQUENCY DISTRIBUTION. [[Image:Fe40.png]] - Imagine a study that asks the American public how many sexual partners they have over their lifetime. See the frequency distribution below for the findings from this hypothetical study. The people who said they have 100+ sexual partners in their lifetime appear disconnected from the rest of the data.
[[Image:Sexpartners_histogram0.png|400px]]
- #One statistical benchmark is to use a BOXPLOT to determine "mild" and "extreme" outliers. Mild outliers are any score more than 1.5*IQR from the rest of the scores, and are indicated by open dots. Extreme outliers are any score more than 3*IQR from the rest of the scores. IQR stands for the Interquartile range, which is the middle 50% of the scores. In other words, an outlier is determined by comparison to the bulck of the scores in the middle.  [[Image:Fe40.png]] - The output below is from SPSS for a variable called "system1". A boxplot is a graphical display of the data that shows: (1) median, which is the middle black line, (2) middle 50% of scores, which is the shaded region, (3) top and bottom 25% of scores, which are the lines extending out of the shaded region, (4) the smallest and largest (non-outlier) scores, which are the horizontal lines at the top/bottom of the boxplot, and (5) outliers. For this variable, there is 1 mild outlier (subject #52) and 1 extreme outlier (subject #18).
[[Image:System1_boxplot0.png|400px]]
+ #One statistical benchmark is to use a BOXPLOT to determine "mild" and "extreme" outliers. Mild outliers are any score more than 1.5*IQR from the rest of the scores, and are indicated by open dots. Extreme outliers are any score more than 3*IQR from the rest of the scores. IQR stands for the Interquartile range, which is the middle 50% of the scores. In other words, an outlier is determined by comparison to the bulk of the scores in the middle.  [[Image:Fe40.png]] - The output below is from SPSS for a variable called "system1". A boxplot is a graphical display of the data that shows: (1) median, which is the middle black line, (2) middle 50% of scores, which is the shaded region, (3) top and bottom 25% of scores, which are the lines extending out of the shaded region, (4) the smallest and largest (non-outlier) scores, which are the horizontal lines at the top/bottom of the boxplot, and (5) outliers. For this variable, there is 1 mild outlier (subject #52) and 1 extreme outlier (subject #18).
[[Image:System1_boxplot0.png|400px]]

## Revision as of 05:33, 16 February 2008

• How do I detect outliers?
1. One way is to visually inspect your data with a FREQUENCY DISTRIBUTION. - Imagine a study that asks the American public how many sexual partners they have over their lifetime. See the frequency distribution below for the findings from this hypothetical study. The people who said they have 100+ sexual partners in their lifetime appear disconnected from the rest of the data.
2. One statistical benchmark is to use a BOXPLOT to determine "mild" and "extreme" outliers. Mild outliers are any score more than 1.5*IQR from the rest of the scores, and are indicated by open dots. Extreme outliers are any score more than 3*IQR from the rest of the scores. IQR stands for the Interquartile range, which is the middle 50% of the scores. In other words, an outlier is determined by comparison to the bulk of the scores in the middle. - The output below is from SPSS for a variable called "system1". A boxplot is a graphical display of the data that shows: (1) median, which is the middle black line, (2) middle 50% of scores, which is the shaded region, (3) top and bottom 25% of scores, which are the lines extending out of the shaded region, (4) the smallest and largest (non-outlier) scores, which are the horizontal lines at the top/bottom of the boxplot, and (5) outliers. For this variable, there is 1 mild outlier (subject #52) and 1 extreme outlier (subject #18).

• Some things to keep in mind when looking for outliers...
1. Outliers can found in many (many!) of the variables in ever study. If you are going to check for outliers, then you have to check for outliers in all your variables (e.g., could be 100+ in some surveys), and also check for outliers in the bivariate and multivariate relationships between your variables (e.g., 1000+ in some surveys). Given the large number of outlier analyses you have to conduct in every study, you will invariably find outliers.
2. You are less likely to find outliers after you create composites. It is common practice to use multiple questions to measure constructs because it increases the power of your statistical analysis. You typically create a “composite” score (average of all the questions) when analyzing your data. - In a study about happiness, you may use an established happiness scale, or create your own happiness questions that measure all the facets of the happiness construct. When analyzing your data, you average together all the happiness questions into 1 happiness composite measure. While there may be some outliers in each individual question, averaged the items together reduces the probability of outliers due to the increased amount of data composited into the variable.

◄ Back to Research Tools mainpage