R’s model formula infrastructure was discussed in my previous post. Despite the elegance and convenience of the formula method, there are some aspects that are limiting.

Limitations to Extensibility

The model formula interface does have some limitations:

  • It can be kludgy with many operations on many variables (e.g., log transforming 50 variables via a formula without using paste)
  • The predvars aspect (discussed in my previous post) limits the utility of the operations. Suppose a formula had: knn_impute(x1) + knn_impute(x2). Do we embed the training set twice in predvars?
  • Operations are constrained to single columns or features (excluding interaction specifications). For example, you cannot do

I’ll use PCA feature extraction a few times here since it is probably familiar to many readers.

Everything Happens at Once

Some of our data operations might be sequential. For example, it is not unreasonable to have predictors that require:

  1. imputation of a missing value
  2. centering and scale
  3. conversion to PCA scores

Given that the formula method operations happen (in effect) at once, this workflow requires some sort of custom solution. While caret::preProcess was designed for this sequence of operations, it does so in a single call, as opposed to a progression of steps exemplified by ggplot2, dplyr, or magrittr.

Allowing a series of steps to be defined in order is more consistent with how data analysis is conducted. However, it does raise the complexity of the underlying implementation. For example, caret::preProcess dictates the possible sequence of tasks to be: filters, single-variable transformations, normalizations, imputation, signal extraction, and spatial sign. This avoids nonsensical sequences that center the data before applying a Box-Cox calculation (which requires positive data).

No Recycling

As a corollary to the point above, there is no way to recycle the terms between models that share the same formula and data/environment. For example, if I fit a CART model to a data set with many predictors, the random forest model (theoretically) shouldn’t need to recreate the same terms information about the design matrix. If the model function has the non-formula interface (e.g., mod_func(x, y)), this can make it easier. However, many do not.

Also, suppose that one of the pre-processing steps is computationally expensive. We’d like to be able to store the state of the results and then add another layer of computations (perhaps as a separate object).

Formulas and Wide Datasets

The terms object saves a matrix with as many rows as formula variables and at least as many columns (depending on interactions, etc). Most of this data is zero and a non–sparse representation is used. The current framework was built in a time where there was more focus on interactions, nesting and other operations on a small scale.

It is unlikely that models would have hundreds of interaction terms, but now it is not uncommon to have hundreds or thousands of main effects. As the number of predictors increases, this takes up an inordinate amount of execution time. For simple randomForest or rpart calls, the formula/terms work can account for most of the execution time. For example, we can calculate how much time functions spend generating the model matrix relative to the total execution time. For rpart and randomForest, we used the default arguments and did the calculations with a simulated data set of 200 data points and varying numbers of predictors:

This is especially problematic for ensemble models. For example, ipred:::ipredbagg creates an ensemble of rpart trees. Since rpart only has a formula method, the footprint of the bagged model object can become very large if X trees are contained in the ensemble. Alternatively, randomForest.formula takes the approach of generating the terms once and feeding the model frame to randomForest.default. This does not work for rpart since there is no non-formula method exposed. Some functions (e.g., lm, survival::coxph) have arguments that can be used to prevent the terms and similar objects from being returned. This saves space but prevents new samples from being predicted. A little more detail can be found here.

One issue is the "factors" attribute of the terms object (discussed in the previous post). This is a non-sparse matrix that has a row for each predictor in the formula and a column for each model term (e.g. main effects, interactions, etc.). The purpose of this object is to know which predictors are involved in which terms.

The issue is that this matrix can get very large and usually has a high proportion of zeros. For example:

As the number of predictors increases, the rate of ones is likely to approach a value close to zero very quickly. For example:

Again, it is doubtful that a model with a large number of predictors will have a correspondingly large number of high-level interactions (see the Pareto principle applied to modeling).

Variable Roles

Some packages have implemented extensions of the basic formula. There are cases when formula are needed for specific sub-models. For example, a random coefficient model can be fit with the lmer function. In this case, a model is specified for a particular clustering variable (e.g., a subject in a clinical trial). The code is an example of how lmer syntax works:

Here Subject is important to the model-fitting routine, but not as a predictor. Similarly, the Bradley-Terry model can be used to model competitions and contests. A model on a set of boxers in a series of contests can include terms for their reach:

Another extension of basic formulas comes from the modeltools and mboost packages. The function mboost::mob fits a tree-based model with regression models in the terminal nodes. For this model, a separate list of predictors are used as splitting variables (to define the tree structure) and another set of regression variables that are modeled in the terminal nodes. An example of this call is:

The commonality between these three examples is that there are variables that are critical to the model but do not play the role of standard regression terms. For lmer, Subject is the independent experimental unit. For mob, we have variables to be used for splitting, etc.

There are similar issues on the left-hand side of the formula. When there are multivariate outcomes, different packages have different approaches:

The overall point here is that, for the most part, the formula method assumes that there is one variable on the left-hand side of the tilde and that the variables on the right-hand side are predictors (exceptions are discussed below). One can envision other roles that columns could play in the analysis of data. Besides the examples given above, variables could be used for

  • outcomes
  • predictors
  • stratification
  • data for assessing model performance (e.g., loan amount to compute expected loss)
  • conditioning or faceting variables (e.g., lattice or ggplot2)
  • random effects or hierarchical model ID variables
  • case weights
  • offsets
  • error terms (limited to Error in the aov function)

The last three items on this list are currently handled in formulas as “specials” or have existing functions. For example, when the model function has a weights argument, the current formula/terms frame work uses a function (model.weights) to extract the weights, and also makes sure that the weights are not included as covariates. The same is true for offsets.

Summary

Some limitations of the current formula interface can be mitigated by writing your own or utilizing the Formula package.

However, there are a number of conceptual aspects (e.g., roles, sequential processing) that would require a completely different approach to defining a design matrix, and this will be the focus of an upcoming tidyverse package.