Dummy Variables in Data Mining Projects
Dummy variables or indicator variables in regression analysis and data mining projects are derived variables that usually represent:
- Categorical variable converted to 0 / 1 values (true/false values) which indicate whether a record belongs to a category or not.
- Ordinal variable – same as above
- Null value indicator for numerical or categorical variable. Converted to 0 / 1 values (true/false values) which represent null/not null values.
Dummy variables are extensively used in data mining projects involving regression modeling, logistic regression modeling and cross tab generation. During the preliminary process of variable inspection, dummy variables should be generated and inspected to test different properties such as frequency and effectiveness on the target variable (explained variable).
Categorical variable
Dummy variables generated from a categorical variable represent category membership of a single record. As an example: in a data set containing user interface language (seen in the frequency table below), there are 9 different languages each record has a language assigned.

INTERFACE_LANG - Data records in Analysis Studio®

INTERFACE_LANG - Frequency table in Analysis Studio®
Using a dummy variable generator to generate the dummy variables for the INTERFACE_LANG variable will result in 8 or 9 new data columns (explained later), each representing whether the record has a specific INTERFACE_LANG or not. The variable names will be:
DUM_INTERFACE_LANG_DESC_CHINESE
DUM_INTERFACE_LANG_DESC_DUTCH
DUM_INTERFACE_LANG_DESC_ENGLISH
DUM_INTERFACE_LANG_DESC_FRENCH
DUM_INTERFACE_LANG_DESC_GERMAN
DUM_INTERFACE_LANG_DESC_ITALIAN
DUM_INTERFACE_LANG_DESC_PORTUGUESE
DUM_INTERFACE_LANG_DESC_SPANISH
DUM_INTERFACE_LANG_DESC_SWEDISH
As can be seen below each record will only have one true (or 1) value in the dummy variable that indicates the record's category. A single record from
DUM_INTERFACE_LANG_DESC_CHINESE = false
DUM_INTERFACE_LANG_DESC_DUTCH = false
DUM_INTERFACE_LANG_DESC_ENGLISH = true
DUM_INTERFACE_LANG_DESC_FRENCH = false
DUM_INTERFACE_LANG_DESC_GERMAN = false
DUM_INTERFACE_LANG_DESC_ITALIAN = false
DUM_INTERFACE_LANG_DESC_PORTUGUESE = false
DUM_INTERFACE_LANG_DESC_SPANISH = false
DUM_INTERFACE_LANG_DESC_SWEDISH = false
|
ENGLISH |
False |
False |
True |
False |
False |
False |
False |
False |
False |
Why 8 or 9 groups? When generating a dummy variable for categorical variables with K different values we only need K – 1 dummy variables to represent all values since one of the categories has an implicit value. As an example, when we have a gender variable with Male and Female values we may use one variable – which represents the question: is male? (the actual output will be something like DUMMY_Male). Lets extend this example to Male Female and Null and we will get two variables: Is male? and Is female? Records that have false values in both columns have a Null value in the Gender variable. Using explicit dummy variable generation with a dummy variable for all groups will result with another variable that indicates null values (DUMMY_Gender_Null). Using explicit dummy variable generation allows direct exploration of the K-th variable.
In some data mining and statistical analysis projects, dummy variables are generated in two steps:
1. Transform (recode) a variable in to a categorical one
2. Create a dummy variable from the newly created categories.
Example: Age variables are categorized (e.g. 10_20, 21_30, 31_40, 41_50, 51_60, 61_70, 70+) and then converted into dummy variable age group for further use (e.g. regression modeling or logistic regression modeling)
Ordinal variable
Ordinal variables are categorical variables that contain a direct meaningful order. Common example is a customer satisfaction survey where satisfaction level is represented by numbers (1-5). Dummy variable usage with ordinal variables is similar to categorical variables.
Null value indicator
A null value indicator indicates null value for a specific variable (is null?), Most (if not all) real life data sets contain null values and dummy variables are a good way to treat them. Null values cause model biasing when ignored in regression modeling and logistic regression modeling (in many cases a record containing null data will not be processed by the model).
Another dummy variable strategy which is a combination of null and category usage is to take a messy variable or a variable that can not be used (doesn’t make sense and has non normal behavior) and to generate a 0 and non-zero category (which is a dummy variable invariant).
Example: During a data mining project done in the service department of a large computer hardware reseller, a logistic regression model was used to determine customer attrition. One of the variables in the data set was the number of Sun™ servers each customer had. This variable was rejected every time it was put in a model "as is" and was almost excluded from the model. Eventually, the variable was converted into a 0/1 variable indicating whether a customer has a Sun™ server or not and was one of the core variables in the model. The business logic behind this variable was that it was not important how many Sun™ servers a customer has but it was important whether a customer had one since this computer hardware reseller had an exclusive Sun™ maintenance program in the region.
Since dummy variables are a recode of existing variables, they may cause data biasing when misused. A good example is when a data set contains null values, a null group may be created as a variable. At first it might seem OK, however when using the variable for predictive modeling it might be entered into the model and become an unwanted future indication.
Next steps:
- For further information please visit the support section of our site.
- Download Analysis Studio free edition – Includes unlimited variables, up to 500 rows of data.
- Purchase Analysis Studio