There are loads of different ways to convert categorical variables into numeric features so they can be used within machine learning models. While you can perform this process manually on a per-feature basis, it’s often quicker and easier to make use of transformers.

These special classes are built-in Scikit-Learn and are ideal for performing bulk operations on data. Some are built into Scikit-Learn, and you can create your own custom ones by inheriting from Scikit-Learn’s TransformerMixin. However, a much easier approach is to make use of pre-built transformers that plug into the Scikit-Learn architecture via a package such as Category Encoders.

Category Encoders is a diverse set of Scikit-Learn style transformers designed for converting categorical data into numeric forms. It works with Pandas (as an input or an output), it’s configurable, compatible with Scikit-Learn, and includes a transformer for pretty much everything common categorical data encoding problem you’re likely to encounter.

There are other benefits to using Scikit-Learn transformers for preprocessing instead of Pandas. You can validate the workflow, use grid search on the model and preprocessing hyperparameters, avoid adding new columns and avoid data leakage.

Encoder |
Description |

Backward Difference Encoder | Backward Difference Encoding compares the mean of the dependent variable to the mean of the dependent variable for the prior level. It's a Contrast Encoder (along with Reverse Helmert and Polynomial). |

BaseN Encoder | BaseN Encoding converts the numeric index of a categorical variable to a numeric form. It can work with a range of different base values to produce encodings. For example, passing the argument `base=2` to the encoder creates binary values, which larger values can be used on higher cardinality data. |

Binary Encoder | Binary Encoding sits somewhere between One Hot Encoding and Hashing, as it converts categorical data into binary digits. It is a bit more concise than One Hot Encoding and adds fewer columns so is better suited to higher cardinality data than OHE. |

<a href="https://contrib.scikit-learn.org/category_encoders/catboost.html rel="nofollow noopener" target="_blank">CatBoost Encoder</a> | CatBoost encoding, from the model of the same name, uses something called ordering principle to try and reduce target leakage. It's a bit like Leave One Out encoding and works in continuous and binomial data. |

Count Encoder | Count Encoding (which is like Count Vectorization used in NLP models) converts the categorical variable for a numeric value representing its frequency within the dataset, so common categories have high values and rare categories have low values. |

Generalized Linear Mixed Model Encoder | The Generalized Linear Mixed Model or GLMM Encoder is a bit like Target or M-Estimate encoding and can be used on continuous or binomial data. |

Hashing Encoder | The Hashing Encoder applies the popular "hashing trick" to convert categorical variables to high dimensional space. It's a popular choice for use on high cardinality data, where one hot encoding wouldn't work. Works on nominal and ordinal data, but can cause data loss due to hash collisions. |

Helmert Encoder | Helmert Encoding is another of the mean encoding transformers, like Target Encoding, James-Stein, and others. The version implemented in this package is reverse Helmert Encoding, and compares the mean of the target against the mean over all previous levels. Along with Backward Difference and Polynomial Encoding, it's one of the Contrast Encoders. |

James-Stein Encoder | The James-Stein estimator is another type of target encoder. It uses the mean target value for the observed feature and the mean target value to obtain a weighted average. It's designed for use with normal distributions. |

Leave One Out Encoder | Leave One Out or LOO encoding is another target encoding technique, however, it leaves out the target for the current row when calculating the mean, which can help with data contain outliers. |

M-estimate Encoder | M-estimate encoding is a bit like a simplified version of Target Encoding. Along with Target, WoE, James-Stein and LOO, this is one of the Bayesian encoders. All of them are generally good for high cardinality data. |

One Hot Encoder | One Hot Encoding or OHE is one of the most widely used techniques for encoding categorical variables. Best suited to low cardinality data, it can be used to binarise values, but needs to be used with caution. Works on nominal and ordinal data. |

Ordinal Encoder | The Ordinal Encoder (OE) is basically the same as the Label Encoder (LE), as I understand it. It takes each unique categorical value and maps it to a number. As the name suggests, it's best for ordinal data that have a rank order, as it inherently implies ordinality to models and can therefore mislead them if used on non-ordinal data. |

Polynomial Encoder | A Bayesian encoder that can work well on high cardinality data. It's supposed to be used on ordered categorical variables that are spaced equally. |

Sum Encoder | Sum Encoding compares the mean of the target variable for a given level against the mean of the target over all the levels. |

Target Encoder | Target Encoding, or Mean Encoding, as it's also known, is a powerful Bayesian encoding technique. Data are grouped and then a mean of the target is calculated for the grouping. Mean encoded data are often very important features. |

Weight of Evidence Encoder | Weight of Evidence or WoE encoding came from the world of finance where it was used to assess credit risk. It's a Bayesian encoding technique (along with Target Encoding, James-Stein, M-Estimator and LOO) and can be effective on high cardinality data. |

We need quite a few libraries for this project. We’ll be using Pandas to load and display our data, Numpy for some filtering, XGBoost for our classification model, and various packages from Scikit-Learn to run the transformers, pipelines and assess the model’s accuracy. Finally, we’re using the Category Encoders package to perform our categorical variable transformations.

```
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
import category_encoders as ce
import warnings
warnings.filterwarnings('ignore')
```

The data set I’ve used is a Census Income data set from the UCI Machine Learning Repository. It contains a range of numeric and categorical features for us to encode, with the aim of predicting a person’s income from features such as their age, education, occupation, and ethnicity. The column names on this data set are not properly defined, so I’ve passed in the actual column names using the `names`

argument of the Pandas `read_csv()`

function.

```
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
names=['age','employment_type','final_weight','education','education_score',
'marital_status','occupation','relationship_status','ethnicity','gender',
'capital_gain','capital_loss','weekly_hours','native_country','income'])
df.head()
```

age | employment_type | final_weight | education | education_score | marital_status | income | |
---|---|---|---|---|---|---|---|

0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | <=50K |

1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | <=50K |

2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | <=50K |

3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | <=50K |

4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | <=50K |

```
df.info()
```

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null int64
1 employment_type 32561 non-null object
2 final_weight 32561 non-null int64
3 education 32561 non-null object
4 education_score 32561 non-null int64
5 marital_status 32561 non-null object
6 occupation 32561 non-null object
7 relationship_status 32561 non-null object
8 ethnicity 32561 non-null object
9 gender 32561 non-null object
10 capital_gain 32561 non-null int64
11 capital_loss 32561 non-null int64
12 weekly_hours 32561 non-null int64
13 native_country 32561 non-null object
14 income 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
```

Next we’ll separate the features in the dataframe by their datatype. There are a few different ways to achieve this. I’ve used the `select_dtypes()`

function to obtain specific data types by passing in `np.number`

to obtain the numeric data and `exclude=['np.number']`

to return the categorical data. Appending `.columns`

to the end returns an `Index`

list containing the column names. For the categorical features, we don’t want to include the target `income`

column, so I’ve dropped that.

```
numeric_features = df.select_dtypes([np.number]).columns
numeric_features
```

```
Index(['age', 'final_weight', 'education_score', 'capital_gain',
'capital_loss', 'weekly_hours'],
dtype='object')
```

```
categorical_features = df.select_dtypes(exclude=[np.number]).drop(['income'], axis=1).columns
categorical_features
```

```
Index(['employment_type', 'education', 'marital_status', 'occupation',
'relationship_status', 'ethnicity', 'gender', 'native_country'],
dtype='object')
```

As usual, we’ll define the `X`

feature set to include our fields, minus the target `income`

column, then we’ll set our `y`

data to be the column containing the target value containing each person’s salary.

```
X = df.drop('income', axis=1)
y = df['income']
```

```
y.head()
```

```
0 <=50K
1 <=50K
2 <=50K
3 <=50K
4 <=50K
Name: income, dtype: object
```

If you print the unique values of the target column by entering `y.unique()`

you’ll see that it contains two strings stating whether the person’s income is above or below 50K. Since the model requires a numeric label, we need to convert this to an integer. Again, you can do that in a number of ways (such as using `np.where()`

), but the quickest is to use the `preprocessing`

package’s `LabelEncoder()`

class. Running `fit_transform()`

on the `y`

data will fit the label encoder and then return the encoded labels (this avoids the need to run `fit()`

and then `transform()`

.

```
y = preprocessing.LabelEncoder().fit_transform(y)
```

```
y
```

```
array([0, 0, 0, ..., 0, 0, 1])
```

Now we’ve got our dataframe ready we can split it up into the train and test datasets for our model to use. We’ll use the Scikit-Learn `train_test_split()`

function for this. By passing in the `X`

dataframe of raw features, the `y`

series containing the target, and the size of the test group (i.e. 0.3 for 30%), we get back the `X_train`

, `X_test`

, `y_train`

and `y_test`

data to use in the model.

```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
```

You can, of course, use any classification model for this. I’ve used XGBClassifier from XGBoost because it’s generally really effective and pretty quick to run. In normal circumstances, you’d obviously go through a careful model selection process, but we’ll skip that here for demonstration purposes.

```
selected_model = XGBClassifier(random_state=0)
```

As we want to assess all of the encoders provided with the Category Encoders package, we’ll put them all into a dictionary. The dictionary key is the name of the encoder (i.e. `HashingEncoder`

) while the value is the call to the `category_encoders`

package which loads the relevant encoder. To store the results of each test, I’ll create a dataframe in which to store the name of the encoder and the results obtained.

```
encoders = {
'BackwardDifferenceEncoder': ce.backward_difference.BackwardDifferenceEncoder,
'BaseNEncoder': ce.basen.BaseNEncoder,
'BinaryEncoder': ce.binary.BinaryEncoder,
'CatBoostEncoder': ce.cat_boost.CatBoostEncoder,
'HashingEncoder': ce.hashing.HashingEncoder,
'HelmertEncoder': ce.helmert.HelmertEncoder,
'JamesSteinEncoder': ce.james_stein.JamesSteinEncoder,
'OneHotEncoder': ce.one_hot.OneHotEncoder,
'LeaveOneOutEncoder': ce.leave_one_out.LeaveOneOutEncoder,
'MEstimateEncoder': ce.m_estimate.MEstimateEncoder,
'OrdinalEncoder': ce.ordinal.OrdinalEncoder,
'PolynomialEncoder': ce.polynomial.PolynomialEncoder,
'SumEncoder': ce.sum_coding.SumEncoder,
'TargetEncoder': ce.target_encoder.TargetEncoder,
'WOEEncoder': ce.woe.WOEEncoder
}
```

```
df_results = pd.DataFrame(columns=['encoder', 'f1', 'accuracy', 'roc'])
```

Next we’re going to loop through all of the encoders in the dictionary above and process the data using a Pipeline. While you don’t need to run a pipeline, it’s a good idea and makes the code cleaner, easier to maintain, and reduces repetition.

For our categorical variables (which we stored in `categorical_features`

) we’re going to create a Pipeline called `categorical_transformer`

which uses `SimpleImputer()`

to fill in the missing values and then uses the selected encoder from Category Encoders. For the numeric data, we’ll fill in any missing values with the mean using `SimpleImputer()`

, then we’ll scale the data using `StandardScaler()`

. Then we’ll use the `ColumnTransformer()`

to run our numeric and categorical transformer pipelines to preprocess our data.

Finally, we can define another Pipeline to describe the preprocessor step above, and pass in the details on the model we selected. We then `fit()`

the model and preprocessor on the `X_train`

and `y_train`

data and it runs everything for us. Once that’s done, we can then use the fitted model to predict against `X_test`

and return the data in `y_pred`

. Then, it’s simply a case of calculating some performance metrics and appending the output to the dataframe of results we created above.

```
for key in encoders:
categorical_transformer = Pipeline(
steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('encoder', encoders[key]())
]
)
numeric_transformer = Pipeline(
steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
]
)
preprocessor = ColumnTransformer(
transformers=[
('numerical', numeric_transformer, numeric_features),
('categorical', categorical_transformer, categorical_features)
]
)
pipe = Pipeline(
steps=[
('preprocessor', preprocessor),
('classifier', selected_model)
]
)
model = pipe.fit(X_train, y_train)
y_pred = model.predict(X_test)
row = {
'encoder': key,
'f1': f1_score(y_test, y_pred, average='macro'),
'accuracy': accuracy_score(y_test, y_pred),
'roc': roc_auc_score(y_test, y_pred)
}
df_results = df_results.append(row, ignore_index=True)
```

If you print out the results and rank them by the AUC ROC you’ll be able to see which approach worked best on the data set. As with hyperparameter tuning, you may find that you can improve the results by tweaking the parameters used.

```
df_results.head(20).sort_values(by='roc')
```

encoder | f1 | accuracy | roc | |
---|---|---|---|---|

8 | LeaveOneOutEncoder | 0.431043 | 0.757601 | 0.500000 |

4 | HashingEncoder | 0.792972 | 0.860784 | 0.770418 |

1 | BaseNEncoder | 0.812706 | 0.871021 | 0.794404 |

2 | BinaryEncoder | 0.812706 | 0.871021 | 0.794404 |

3 | CatBoostEncoder | 0.811457 | 0.869280 | 0.794979 |

0 | BackwardDifferenceEncoder | 0.814305 | 0.872249 | 0.795646 |

10 | OrdinalEncoder | 0.814305 | 0.872249 | 0.795646 |

9 | MEstimateEncoder | 0.813458 | 0.870816 | 0.796567 |

14 | WOEEncoder | 0.814858 | 0.872249 | 0.796938 |

11 | PolynomialEncoder | 0.814049 | 0.871225 | 0.797124 |

13 | TargetEncoder | 0.814175 | 0.871123 | 0.797631 |

5 | HelmertEncoder | 0.815163 | 0.872249 | 0.797656 |

6 | JamesSteinEncoder | 0.815577 | 0.872556 | 0.798002 |

7 | OneHotEncoder | 0.815463 | 0.872351 | 0.798154 |

12 | SumEncoder | 0.815463 | 0.872351 | 0.798154 |

Although I didn’t use this approach in the intentionally simple example above, you can (and should) use the transformers on specific columns, rather than applying them in a less targeted fashion.

By default, if you don’t pass in any arguments to an encoder it will run on every non-numeric column. However, if you pass in a list of specific column names, you can apply the encoding to specific fields.

- Vickery, R (2020) - An Easier Way to Encode Categorical Features, Towards Data Science
- Vorotyntsev, D (2019) - Benchmarking Categorical Encoders, Towards Data Science.

Matt Clarke, Saturday, March 06, 2021

Learn the fundamentals of gradient boosting and build state-of-the-art machine learning models using XGBoost to solve classification and regression problems.

Start course for FREE
## Comments