# How to dichotomize variables?

How do I decide where to split up the variable? This is a complex question with a complex answer:

• You can split at the midpoint of the scale from a theoretical point of view because that is conceptually the middle response. - Imagine a study about happiness where your happiness question (or composite) ranges from 1 to 7. The midpoint of a 7 point scale is 4, so you cut the continuous variable into a dichotomous categorical variable by categorizing the subjects as either high happiness (4 through 7 on the scale) or low happiness (1 through 4 on the scale).
• From a practical point of view, if you are dichotomizing a variable, you don’t truly cut it in half because then you might have the same subject in more than one category. [[Image:Fe40.png] - Notice that in the above example, the option "4" is in both categories (i.e., the "high" happiness category AND the "low" happiness category). This is a problem because you don't want the same subject in more than one category. The solution is to cut the variable such that "low" happiness is 1-3.99, and "high" happiness is 4.01 -7. In other words, you create a small degree of separation.
• However, what if in your dataset there are more subjects in the high or low end of the scale. Splitting at the mid-point of the scale might create a vastly unequal distribution when you dichotomize the variable. [[Image:Fe40.png] - Imagine splitting at the midpoint of the scale has 70-80% of the subjects in one end, and 10-20% in the other. You are already losing valuable information by reducing from a continuous variable to a categorical variable, and if you have unbalanced categories, you are losing even more information. In this case, you could choose to split at the median, even if the median is not the midpoint of the scale. From a theoretical point of view, the median is a good choice for splitting the variable because it is the mid-point of that sample. Samples are not always normally distributed, so the only way to create equal distribution is to cut at the median even though the median may not be the true mid-point of the scale. - On a 1-7 scale, the midpoint is 4, but the median of the sample may be 3.8 or 4.5 or etc. In other words, from a statistical point of view, the median truly splits the sample into halves.
• However, what if in your dataset the median is a very high or low number on the scale range. - What if on a 1-7 point scale, the median is a 2 or a 6. In this situation half of the scores are bunched into a small range (e.g., 2 points in this example), whereas the other half are more evenly distributed across a larger range (e.g., 5 point in this example). Once again, you are losing valuable information by dichotomizing in this way.
• In summary, there are theoretical and statistical considerations when dichotomizing variables. One solution is to dichotomize in both ways and analyze the data using both variables.

◄ Back to Analyzing Data page