Categorical Embeddings: New Ways to Simplify Complex Data

January 21, 2021

When building a predictive model in R, many of the functions (such as lm(), glm(), randomForest, xgboost, or neural networks in keras) require that all input variables are numeric. If your data has categorical variables, you may have to choose between ignoring some of your data and too many new columns.

Categorical embeddings are a relative new method, utilizing methods popularized in Natural Language Processing that help models solve this problem and can help you understand more about the categories themselves.

While there are a number of online tutorials on how to use Keras (usually in Python) to create these embeddings, this talk will use embed::step_embed(), an extension of the recipes package, to create the embeddings.