Dealing with Outliers
From PsychWiki - A Collaborative Psychology Wiki
Revision as of 20:54, 7 September 2009 by Doug
- How do I deal with outliers?
- First, the answer depends partly upon why the outlier exists?
- It is possible the outlier is due to a data entry mistake, so you should first check for data entry mistakes to ensure that any outlier you find is not due to data entry errors.
- It is possible the question is poorly worded or constructed, or some subjects did not adequately understand how to respond to the question. In this case you may want to remove that question from the analysis.
- It is possible the question is adequately constructed but the subjects who responded with the outlier values are different than the subjects who did not respond with the extreme scores. You can create a new variable that categorizes all the subjects as either “outlier subjects” or “non-outlier subjects”, and then re-examine the data to see if there is a difference between these two types of subjects.
- It is possible that the same subjects are responsible for outliers in many questions in the survey. In this case you may want to remove those subjects from the analysis.
- It is possible that the subjects responded with the "outlier" for a reason. Just because a value is extreme compared to the rest of the data does not necessarily mean it is somehow an anomaly, or invalid, or should be removed.
- Second, if you want to reduce the influence of the outlier, you have four options:
- Option 1 is to delete the value. If you have only a few outliers, you may simply delete those values, so they become blank or missing values.
- Option 2 is to delete the variable. If you feel the question was poorly constructed, or if there are too many outliers in that variable, or if you do not need that variable, you can simply delete the variable. Also, if transforming the value or variable (e.g., Options #3 and #4) does not eliminate the problem, you may want to simply delete the variable.
- Option 3 is to transform the value. You have a few options for transforming the value. You can change the value to the next highest/lowest (non-outlier) number. - If you have a 100 point scale, and you have two outliers (95 and 96), and the next highest (non-outlier) number is 89, then you could simply change the 95 and 96 to 89s. Alternatively, if the two outliers were 5 and 6, and the next lowest (non-outlier) number was 11, then the 5 and 6 would change to 11s. Another option is to change the value to the next highest/lowest (non-outlier) number PLUS one unit increment higher/lower. - The 95 and 96 numbers would change to 90s (e.g., 89 plus 1 unit higher). The 5 and 6 numbers change to 10s (e.g., 11 minus 1 unit lower).
- Option 4 is to transform the variable. Instead of changing the individual outliers (as in Option #3), we are now talking about transforming the entire variable. Transformation creates normal distributions. Since outliers are one cause of non-normality, by transforming the variables, you reduce the influence of outliers. See How do I transform variables? for more information.
- Third, after dealing with the outlier, you re-run the outlier analysis to determine if the data are outlier free. See Detecting Outliers - Univariate and Detecting Outliers - Multivariate.
- Sometimes new outliers emerge because they were masked by the old outliers and/or the data is now different after removing the old outlier so existing extreme data points may now qualify as outliers.
- If new outliers emerge, and you want to reduce the influence of the outliers, you choose one the four options again. Then, re-run the outlier analysis to determine if any new outliers emerge or if the data are outlier free, and repeat again.
- Sometimes new outliers will keep emerging each time you re-run the outlier analysis. It can become a cumbersome and sometimes overwhelming process that has no end in sight. Plus, at what point, if any, should you draw the line and stop removing the newly emerging outliers?
- Fourth, some things to keep in mind about dealing with outliers...
- If you find and eliminate outliers in one of your published studies, then from an ethical and equity point of view, you should conduct the same outlier analysis in every study you analyze and deal with outliers consistently throughout your research.
◄ Back to Analyzing Data page