Identifying Missing Data
From PsychWiki - A Collaborative Psychology Wiki
Revision as of 20:53, 7 September 2009 by Doug
How do I identify if missing values are random or non-random?
- First, if there are only a small number of missing values, then it is extremely unlikely to be non-random. - Every dataset will routinely contain a few missing values. Imagine a dataset with multiple variables and/or a large sample size. The more variables in the study, the more likely it is that one or two will be inadvertantly skipped. The more subjects in the study, the more likely it is mistakes will be made by a subject. What do we mean by "small number" of missing values? The rule of thumb is that if there are less than 5% of missing values in the dataset, it is unlikely those missing values are non-random.
- Second, one way in which a small number of missing values could be non-random is if most or all of the missing values are from the same subject. - A subject who repeatedly skips questions is likely to be not paying close enough attention to the study and/or racing through the study too quickly. Those are the type of situations where you would want to delete the entire case (see How do I deal with missing data? for more information).
- Third, even if there are a larger number of missing values, that does not necessarily mean the missing values are non-random. - Some questions will always have large number of missing values because of the way the question is designed. Imagine the following question in your study: "Which of the following candy bars do you prefer. Please mark all answers that apply." Given the request to “mark all answers that apply”, there will be a lot of missing data because some options are chosen less frequently than others.
- Fourth, SPSS has an add-on module called “Missing Values Analysis” that will statistically test whether missing values are random or non-random. You can access it by clicking Analyze --> Missing Value Analysis, and check EM estimation. EM estimation checks if the subjects with missing values are different than the subjects without missing values. If p<.05, then the two groups are significantly different from each other, which indicates the missing values are non-random. In other words, you want the value to be greater than .05, which indicates the missing values are random.
◄ Back to Analyzing Data page