diff --git a/_freeze/learn/develop/models/index/execute-results/html.json b/_freeze/learn/develop/models/index/execute-results/html.json index 17801f90..5521f56e 100644 --- a/_freeze/learn/develop/models/index/execute-results/html.json +++ b/_freeze/learn/develop/models/index/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "03e1a731c4a6d0af073e4a9dc12bd62c", + "hash": "363cf45050c99106958bce4dad368153", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"How to build a parsnip model\"\ncategories:\n - developer tools\ntype: learn-subsection\nweight: 2\ndescription: | \n Create a parsnip model function from an existing model implementation.\ntoc: true\ntoc-depth: 2\ninclude-after-body: ../../../resources.html\n---\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: mda, modeldata, and tidymodels.\n\nThe parsnip package constructs models and predictions by representing those actions in expressions. There are a few reasons for this:\n\n * It eliminates a lot of duplicate code.\n * Since the expressions are not evaluated until fitting, it eliminates many package dependencies.\n\nA parsnip model function is itself very general. For example, the `logistic_reg()` function itself doesn't have any model code within it. Instead, each model function is associated with one or more computational _engines_. These might be different R packages or some function in another language (that can be evaluated by R). \n\nThis article describes the process of creating a new model function. Before proceeding, take a minute and read our [guidelines on creating modeling packages](https://tidymodels.github.io/model-implementation-principles/) to understand the general themes and conventions that we use. \n\n## An example model\n\nAs an example, we'll create a function for _mixture discriminant analysis_. There are [a few packages](http://search.r-project.org/cgi-bin/namazu.cgi?query=%22mixture+discriminant%22&max=100&result=normal&sort=score&idxname=functions) that implement this but we'll focus on `mda::mda()`:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nstr(mda::mda)\n#> function (formula = formula(data), data = sys.frame(sys.parent()), subclasses = 3, \n#> sub.df = NULL, tot.df = NULL, dimension = sum(subclasses) - 1, eps = 100 * \n#> .Machine$double.eps, iter = 5, weights = mda.start(x, g, subclasses, \n#> trace, ...), method = polyreg, keep.fitted = (n * dimension < 5000), \n#> trace = FALSE, ...)\n```\n:::\n\n\nThe main hyperparameter is the number of subclasses. We'll name our function `discrim_mixture()`. \n\n## Aspects of models\n\nBefore proceeding, it helps to to review how parsnip categorizes models:\n\n* The model _type_ is related to the structural aspect of the model. For example, the model type `linear_reg()` represents linear models (slopes and intercepts) that model a numeric outcome. Other model types in the package are `nearest_neighbor()`, `decision_tree()`, and so on. \n\n* Within a model type is the _mode_, related to the modeling goal. Currently the three modes in the package are regression, classification, and censored regression. Some models have methods for multiple modes (e.g. nearest neighbors) while others have only a single mode (e.g. logistic regression). \n\n* The computation _engine_ is a combination of the estimation method and the implementation. For example, for linear regression, one engine is `\"lm\"` which uses ordinary least squares analysis via the `lm()` function. Another engine is `\"stan\"` which uses the Stan infrastructure to estimate parameters using Bayes rule. \n\nWhen adding a model into parsnip, the user has to specify which modes and engines are used. The package also enables users to add a new mode or engine to an existing model. \n\n## The general process\n\nThe parsnip package stores information about the models in an internal environment object. The environment can be accessed via the function `get_model_env()`. The package includes a variety of functions that can get or set the different aspects of the models. \n\nIf you are adding a new model from your own package, you can use these functions to add new entries into the model environment. \n\n### Step 1. Register the model, modes, and arguments\n\nWe will add the MDA model using the model type `discrim_mixture()`. Since this is a classification method, we only have to register a single mode:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nset_new_model(\"discrim_mixture\")\nset_model_mode(model = \"discrim_mixture\", mode = \"classification\")\nset_model_engine(\n \"discrim_mixture\", \n mode = \"classification\", \n eng = \"mda\"\n)\nset_dependency(\"discrim_mixture\", eng = \"mda\", pkg = \"mda\")\n```\n:::\n\n\nThese functions should silently finish. There is also a function that can be used to show what aspects of the model have been added to parsnip: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nshow_model_info(\"discrim_mixture\")\n#> Information for `discrim_mixture`\n#> modes: unknown, classification \n#> \n#> engines: \n#> classification: mdaNA\n#> \n#> \n#> no registered arguments.\n#> \n#> no registered fit modules.\n#> \n#> no registered prediction modules.\n```\n:::\n\n\nThe next step is to declare the main arguments to the model. These are declared independent of the mode. To specify the argument, there are a few slots to fill in:\n\n * The name that parsnip uses for the argument. In general, we try to use non-jargony names for arguments (e.g. \"penalty\" instead of \"lambda\" for regularized regression). We recommend consulting [the model argument table available here](/find/parsnip/) to see if an existing argument name can be used before creating a new one. \n \n * The argument name that is used by the underlying modeling function. \n \n * A function reference for a _constructor_ that will be used to generate tuning parameter values. This should be a character vector with a named element called `fun` that is the constructor function. There is an optional element `pkg` that can be used to call the function using its namespace. If referencing functions from the dials package, quantitative parameters can have additional arguments in the list for `trans` and `range` while qualitative parameters can pass `values` via this list. \n \n * A logical value for whether the argument can be used to generate multiple predictions for a single R object. For example, for boosted trees, if a model is fit with 10 boosting iterations, many modeling packages allow the model object to make predictions for any iterations less than the one used to fit the model. In general this is not the case so one would use `has_submodels = FALSE`. \n \nFor `mda::mda()`, the main tuning parameter is `subclasses` which we will rewrite as `sub_classes`. \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset_model_arg(\n model = \"discrim_mixture\",\n eng = \"mda\",\n parsnip = \"sub_classes\",\n original = \"subclasses\",\n func = list(pkg = \"foo\", fun = \"bar\"),\n has_submodel = FALSE\n)\nshow_model_info(\"discrim_mixture\")\n#> Information for `discrim_mixture`\n#> modes: unknown, classification \n#> \n#> engines: \n#> classification: mdaNA\n#> \n#> \n#> arguments: \n#> mda: \n#> sub_classes --> subclasses\n#> \n#> no registered fit modules.\n#> \n#> no registered prediction modules.\n```\n:::\n\n\n### Step 2. Create the model function\n\nThis is a fairly simple function that can follow a basic template. The main arguments to our function will be:\n\n * The mode. If the model can do more than one mode, you might default this to `\"unknown\"`. In our case, since it is only a classification model, it makes sense to default it to that mode so that the users won't have to specify it. \n \n * The argument names (`sub_classes` here). These should be defaulted to `NULL`.\n\nA basic version of the function is:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndiscrim_mixture <-\n function(mode = \"classification\", sub_classes = NULL) {\n # Check for correct mode\n if (mode != \"classification\") {\n rlang::abort(\"`mode` should be 'classification'.\")\n }\n \n # Capture the arguments in quosures\n args <- list(sub_classes = rlang::enquo(sub_classes))\n \n # Save some empty slots for future parts of the specification\n new_model_spec(\n \"discrim_mixture\",\n args = args,\n eng_args = NULL,\n mode = mode,\n method = NULL,\n engine = NULL\n )\n }\n```\n:::\n\n\nThis is pretty simple since the data are not exposed to this function. \n\n::: {.callout-warning}\n We strongly suggest favoring `rlang::abort()` and `rlang::warn()` (or their cli counterparts `cli::cli_abort()` and `cli::cli_warn()`) over `stop()` and `warning()`. The former return better traceback results and have safer defaults for handling call objects. \n:::\n\n### Step 3. Add a fit module\n\nNow that parsnip knows about the model, mode, and engine, we can give it the information on fitting the model for our engine. The information needed to fit the model is contained in another list. The elements are:\n\n * `interface` is a single character value that could be `\"formula\"`, `\"data.frame\"`, or `\"matrix\"`. This defines the type of interface used by the underlying fit function (`mda::mda()`, in this case). This helps the translation of the data to be in an appropriate format for the that function. \n \n * `protect` is an optional list of function arguments that **should not be changeable** by the user. In this case, we probably don't want users to pass data values to these arguments (until the `fit()` function is called).\n \n * `func` is the package and name of the function that will be called. If you are using a locally defined function, only `fun` is required. \n \n * `defaults` is an optional list of arguments to the fit function that the user can change, but whose defaults can be set here. This isn't needed in this case, but is described later in this document.\n\nFor the first engine:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset_fit(\n model = \"discrim_mixture\",\n eng = \"mda\",\n mode = \"classification\",\n value = list(\n interface = \"formula\",\n protect = c(\"formula\", \"data\"),\n func = c(pkg = \"mda\", fun = \"mda\"),\n defaults = list()\n )\n)\n\nshow_model_info(\"discrim_mixture\")\n#> Information for `discrim_mixture`\n#> modes: unknown, classification \n#> \n#> engines: \n#> classification: mda\n#> \n#> \n#> arguments: \n#> mda: \n#> sub_classes --> subclasses\n#> \n#> fit modules:\n#> engine mode\n#> mda classification\n#> \n#> no registered prediction modules.\n```\n:::\n\n\nWe also set up the information on how the predictors should be handled. These options ensure that the data that parsnip gives to the underlying model allows for a model fit that is as similar as possible to what it would have produced directly.\n\n * `predictor_indicators` describes whether and how to create indicator/dummy variables from factor predictors. There are three options: `\"none\"` (do not expand factor predictors), `\"traditional\"` (apply the standard `model.matrix()` encodings), and `\"one_hot\"` (create the complete set including the baseline level for all factors). \n \n * `compute_intercept` controls whether `model.matrix()` should include the intercept in its formula. This affects more than the inclusion of an intercept column. With an intercept, `model.matrix()` computes dummy variables for all but one factor level. Without an intercept, `model.matrix()` computes a full set of indicators for the first factor variable, but an incomplete set for the remainder.\n \n * `remove_intercept` removes the intercept column *after* `model.matrix()` is finished. This can be useful if the model function (e.g. `lm()`) automatically generates an intercept.\n\n* `allow_sparse_x` specifies whether the model can accommodate a sparse representation for predictors during fitting and tuning.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset_encoding(\n model = \"discrim_mixture\",\n eng = \"mda\",\n mode = \"classification\",\n options = list(\n predictor_indicators = \"traditional\",\n compute_intercept = TRUE,\n remove_intercept = TRUE,\n allow_sparse_x = FALSE\n )\n)\n```\n:::\n\n\n\n### Step 4. Add modules for prediction\n\nSimilar to the fitting module, we specify the code for making different types of predictions. To make hard class predictions, the `class_info` object below contains the details. The elements of the list are:\n\n * `pre` and `post` are optional functions that can preprocess the data being fed to the prediction code and to postprocess the raw output of the predictions. These won't be needed for this example, but a section below has examples of how these can be used when the model code is not easy to use. If the data being predicted has a simple type requirement, you can avoid using a `pre` function with the `args` below. \n * `func` is the prediction function (in the same format as above). In many cases, packages have a predict method for their model's class but this is typically not exported. In this case (and the example below), it is simple enough to make a generic call to `predict()` with no associated package. \n * `args` is a list of arguments to pass to the prediction function. These will most likely be wrapped in `rlang::expr()` so that they are not evaluated when defining the method. For mda, the code would be `predict(object, newdata, type = \"class\")`. What is actually given to the function is the parsnip model fit object, which includes a sub-object called `fit` that houses the mda model object. If the data need to be a matrix or data frame, you could also use `newdata = rlang::expr(as.data.frame(newdata))` or similar. \n\nThe parsnip prediction code will expect the result to be an unnamed character string or factor. This will be coerced to a factor with the same levels as the original data. \n\nTo add this method to the model environment, a similar set function, `set_pred()`, is used:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nclass_info <- \n list(\n pre = NULL,\n post = NULL,\n func = c(fun = \"predict\"),\n args =\n # These lists should be of the form:\n # {predict.mda argument name} = {values provided from parsnip objects}\n list(\n # We don't want the first two arguments evaluated right now\n # since they don't exist yet. `type` is a simple object that\n # doesn't need to have its evaluation deferred. \n object = rlang::expr(object$fit),\n newdata = rlang::expr(new_data),\n type = \"class\"\n )\n )\n\nset_pred(\n model = \"discrim_mixture\",\n eng = \"mda\",\n mode = \"classification\",\n type = \"class\",\n value = class_info\n)\n```\n:::\n\n\nA similar call can be used to define the class probability module (if they can be computed). The format is identical to the `class` module but the output is expected to be a tibble with columns for each factor level. \n\nAs an example of the `post` function, the data frame created by `mda:::predict.mda()` will be converted to a tibble. The arguments are `x` (the raw results coming from the predict method) and `object` (the parsnip model fit object). The latter has a sub-object called `lvl` which is a character string of the outcome's factor levels (if any). \n\nWe register the probability module. There is a template function that makes it slightly easier to format the objects:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nprob_info <-\n pred_value_template(\n post = function(x, object) {\n tibble::as_tibble(x)\n },\n func = c(fun = \"predict\"),\n # Now everything else is put into the `args` slot\n object = rlang::expr(object$fit),\n newdata = rlang::expr(new_data),\n type = \"posterior\"\n )\n\nset_pred(\n model = \"discrim_mixture\",\n eng = \"mda\",\n mode = \"classification\",\n type = \"prob\",\n value = prob_info\n)\n\nshow_model_info(\"discrim_mixture\")\n#> Information for `discrim_mixture`\n#> modes: unknown, classification \n#> \n#> engines: \n#> classification: mda\n#> \n#> \n#> arguments: \n#> mda: \n#> sub_classes --> subclasses\n#> \n#> fit modules:\n#> engine mode\n#> mda classification\n#> \n#> prediction modules:\n#> mode engine methods\n#> classification mda class, prob\n```\n:::\n\n\nIf this model could be used for regression situations, we could also add a `numeric` module. For these predictions, the model requires an unnamed numeric vector output. \n\nExamples are [here](https://github.com/tidymodels/parsnip/blob/master/R/linear_reg_data.R) and [here](https://github.com/tidymodels/parsnip/blob/master/R/rand_forest_data.R). \n\n\n### Does it work? \n\nAs a developer, one thing that may come in handy is the `translate()` function. This will tell you what the model's eventual syntax will be. \n\nFor example:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndiscrim_mixture(sub_classes = 2) %>%\n translate(engine = \"mda\")\n#> discrim mixture Model Specification (classification)\n#> \n#> Main Arguments:\n#> sub_classes = 2\n#> \n#> Computational engine: mda \n#> \n#> Model fit template:\n#> mda::mda(formula = missing_arg(), data = missing_arg(), subclasses = 2)\n```\n:::\n\n\nLet's try it on a data set from the modeldata package:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata(\"two_class_dat\", package = \"modeldata\")\nset.seed(4622)\nexample_split <- initial_split(two_class_dat, prop = 0.99)\nexample_train <- training(example_split)\nexample_test <- testing(example_split)\n\nmda_spec <- discrim_mixture(sub_classes = 2) %>% \n set_engine(\"mda\")\n\nmda_fit <- mda_spec %>%\n fit(Class ~ ., data = example_train)\nmda_fit\n#> parsnip model object\n#> \n#> Call:\n#> mda::mda(formula = Class ~ ., data = data, subclasses = ~2)\n#> \n#> Dimension: 2 \n#> \n#> Percent Between-Group Variance Explained:\n#> v1 v2 \n#> 82.63 100.00 \n#> \n#> Degrees of Freedom (per dimension): 3 \n#> \n#> Training Misclassification Error: 0.17241 ( N = 783 )\n#> \n#> Deviance: 671.391\n\npredict(mda_fit, new_data = example_test, type = \"prob\") %>%\n bind_cols(example_test %>% select(Class))\n#> # A tibble: 8 × 3\n#> .pred_Class1 .pred_Class2 Class \n#> \n#> 1 0.679 0.321 Class1\n#> 2 0.690 0.310 Class1\n#> 3 0.384 0.616 Class2\n#> 4 0.300 0.700 Class1\n#> 5 0.0262 0.974 Class2\n#> 6 0.405 0.595 Class2\n#> 7 0.793 0.207 Class1\n#> 8 0.0949 0.905 Class2\n\npredict(mda_fit, new_data = example_test) %>% \n bind_cols(example_test %>% select(Class))\n#> # A tibble: 8 × 2\n#> .pred_class Class \n#> \n#> 1 Class1 Class1\n#> 2 Class1 Class1\n#> 3 Class2 Class2\n#> 4 Class2 Class1\n#> 5 Class2 Class2\n#> 6 Class2 Class2\n#> 7 Class1 Class1\n#> 8 Class2 Class2\n```\n:::\n\n\n\n## Add an engine\n\nThe process for adding an engine to an existing model is _almost_ the same as building a new model but simpler with fewer steps. You only need to add the engine-specific aspects of the model. For example, if we wanted to fit a linear regression model using M-estimation, we could only add a new engine. The code for the `rlm()` function in MASS is pretty similar to `lm()`, so we can copy that code and change the package/function names:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset_model_engine(\"linear_reg\", \"regression\", eng = \"rlm\")\nset_dependency(\"linear_reg\", eng = \"rlm\", pkg = \"MASS\")\n\nset_fit(\n model = \"linear_reg\",\n eng = \"rlm\",\n mode = \"regression\",\n value = list(\n interface = \"formula\",\n protect = c(\"formula\", \"data\", \"weights\"),\n func = c(pkg = \"MASS\", fun = \"rlm\"),\n defaults = list()\n )\n)\n\nset_encoding(\n model = \"linear_reg\",\n eng = \"rlm\",\n mode = \"regression\",\n options = list(\n predictor_indicators = \"traditional\",\n compute_intercept = TRUE,\n remove_intercept = TRUE,\n allow_sparse_x = FALSE\n )\n)\n\nset_pred(\n model = \"linear_reg\",\n eng = \"rlm\",\n mode = \"regression\",\n type = \"numeric\",\n value = list(\n pre = NULL,\n post = NULL,\n func = c(fun = \"predict\"),\n args =\n list(\n object = expr(object$fit),\n newdata = expr(new_data),\n type = \"response\"\n )\n )\n)\n\n# testing:\nlinear_reg() %>% \n set_engine(\"rlm\") %>% \n fit(mpg ~ ., data = mtcars)\n#> parsnip model object\n#> \n#> Call:\n#> rlm(formula = mpg ~ ., data = data)\n#> Converged in 8 iterations\n#> \n#> Coefficients:\n#> (Intercept) cyl disp hp drat wt \n#> 17.82250038 -0.27878615 0.01593890 -0.02536343 0.46391132 -4.14355431 \n#> qsec vs am gear carb \n#> 0.65307203 0.24975463 1.43412689 0.85943158 -0.01078897 \n#> \n#> Degrees of freedom: 32 total; 21 residual\n#> Scale estimate: 2.15\n```\n:::\n\n\n## Add parsnip models to another package\n\nThe process here is almost the same. All of the previous functions are still required but their execution is a little different. \n\nFor parsnip to register them, that package must already be loaded. For this reason, it makes sense to have parsnip in the \"Depends\" category of the DESCRIPTION file of your package. \n\nThe first difference is that the functions that define the model must be inside of a wrapper function that is called when your package is loaded. For our example here, this might look like: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmake_discrim_mixture_mda <- function() {\n parsnip::set_new_model(\"discrim_mixture\")\n\n parsnip::set_model_mode(\"discrim_mixture\", \"classification\")\n\n # and so one...\n}\n```\n:::\n\n\nThis function is then executed when your package is loaded: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n.onLoad <- function(libname, pkgname) {\n # This defines discrim_mixture in the model database\n make_discrim_mixture_mda()\n}\n```\n:::\n\n\nFor an example package that uses parsnip definitions, take a look at the [discrim](https://github.com/tidymodels/discrim) package.\n\n::: {.callout-warning}\n To use a new model and/or engine in the broader tidymodels infrastructure, we recommend your model definition declarations (e.g. `set_new_model()` and similar) reside in a package. If these definitions are in a script only, the new model may not work with the tune package, for example for parallel processing. \n:::\n\nIt is also important for parallel processing support to **list the home package as a dependency**. If the `discrim_mixture()` function lived in a package called `mixedup`, include the line:\n\n```r\nset_dependency(\"discrim_mixture\", eng = \"mda\", pkg = \"mixedup\")\n```\n\nParallel processing requires this explicit dependency setting. When parallel worker processes are created, there is heterogeneity across technologies regarding which packages are loaded. Multicore methods on macOS and Linux will load all of the packages that were loaded in the main R process. However, parallel processing using psock clusters have no additional packages loaded. If the home package for a parsnip model is not loaded in the worker processes, the model will not have an entry in parsnip's internal database (and produce an error). \n\n\n## Your model, tuning parameters, and you\n\nThe tune package can be used to find reasonable values of model arguments via tuning. There are some S3 methods that are useful to define for your model. `discrim_mixture()` has one main tuning parameter: `sub_classes`. To work with tune it is _helpful_ (but not required) to use an S3 method called `tunable()` to define which arguments should be tuned and how values of those arguments should be generated. \n\n`tunable()` takes the model specification as its argument and returns a tibble with columns: \n\n* `name`: The name of the argument. \n\n* `call_info`: A list that describes how to call a function that returns a dials parameter object. \n\n* `source`: A character string that indicates where the tuning value comes from (i.e., a model, a recipe etc.). Here, it is just `\"model_spec\"`. \n\n* `component`: A character string with more information about the source. For models, this is just the name of the function (e.g. `\"discrim_mixture\"`). \n\n* `component_id`: A character string to indicate where a unique identifier is for the object. For a model, this is indicates the type of model argument (e.g. `\"main\"`). \n\nThe main piece of information that requires some detail is `call_info`. This is a list column in the tibble. Each element of the list is a list that describes the package and function that can be used to create a dials parameter object. \n\nFor example, for a nearest-neighbors `neighbors` parameter, this value is just: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ninfo <- list(pkg = \"dials\", fun = \"neighbors\")\n\n# FYI: how it is used under-the-hood: \nnew_param_call <- rlang::call2(.fn = info$fun, .ns = info$pkg)\nrlang::eval_tidy(new_param_call)\n#> # Nearest Neighbors (quantitative)\n#> Range: [1, 10]\n```\n:::\n\n\nFor `discrim_mixture()`, a dials object is needed that returns an integer that is the number of sub-classes that should be create. We can create a dials parameter function for this:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsub_classes <- function(range = c(1L, 10L), trans = NULL) {\n new_quant_param(\n type = \"integer\",\n range = range,\n inclusive = c(TRUE, TRUE),\n trans = trans,\n label = c(sub_classes = \"# Sub-Classes\"),\n finalize = NULL\n )\n}\n```\n:::\n\n\nIf this were in the same package as the other specifications for the parsnip engine, we could use: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntunable.discrim_mixture <- function(x, ...) {\n tibble::tibble(\n name = c(\"sub_classes\"),\n call_info = list(list(pkg = NULL, fun = \"sub_classes\")),\n source = \"model_spec\",\n component = \"discrim_mixture\",\n component_id = \"main\"\n )\n}\n```\n:::\n\n\nOnce this method is in place, the tuning functions can be used: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmda_spec <- \n discrim_mixture(sub_classes = tune()) %>% \n set_engine(\"mda\")\n\nset.seed(452)\ncv <- vfold_cv(example_train)\nmda_tune_res <- mda_spec %>%\n tune_grid(Class ~ ., cv, grid = 4)\nshow_best(mda_tune_res, metric = \"roc_auc\")\n```\n:::\n\n\n\n\n## Pro-tips, what-ifs, exceptions, FAQ, and minutiae\n\nThere are various things that came to mind while developing this resource.\n\n**Do I have to return a simple vector for `predict()`?**\n\nThere are some models (e.g. glmnet, plsr, Cubist, etc.) that can make predictions for different models from the same fitted model object. We facilitate this via `multi_predict()`, rather than `predict()`.\n\nFor example, if we fit a linear regression model via `glmnet` and predict for 10 different penalty values:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npreds <- linear_reg(penalty = 0.1) %>%\n set_engine(\"glmnet\") %>% \n fit(mpg ~ ., data = mtcars) %>%\n multi_predict(new_data = mtcars[1:3, -1], penalty = seq(0.1 / 1:10))\n\npreds\n#> # A tibble: 3 × 1\n#> .pred \n#> \n#> 1 \n#> 2 \n#> 3 \n\npreds$.pred[[1]]\n#> # A tibble: 10 × 2\n#> penalty .pred\n#> \n#> 1 1 22.2\n#> 2 2 21.5\n#> 3 3 21.1\n#> 4 4 20.6\n#> 5 5 20.2\n#> 6 6 20.1\n#> 7 7 20.1\n#> 8 8 20.1\n#> 9 9 20.1\n#> 10 10 20.1\n```\n:::\n\n\nThis gives a list column `.pred` which contains a tibble per row, each tibble corresponding to one row in `new_data`. Within each tibble are columns for the parameter we vary, here `penalty`, and the predictions themselves.\n\n**What do I do about how my model handles factors or categorical data?**\n\nSome modeling functions in R create indicator/dummy variables from categorical data when you use a model formula (typically using `model.matrix()`), and some do not. Some examples of models that do _not_ create indicator variables include tree-based models, naive Bayes models, and multilevel or hierarchical models. The tidymodels ecosystem assumes a `model.matrix()`-like default encoding for categorical data used in a model formula, but you can change this encoding using `set_encoding()`. For example, you can set predictor encodings that say, \"leave my data alone,\" and keep factors as is:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset_encoding(\n model = \"decision_tree\",\n eng = \"rpart\",\n mode = \"regression\",\n options = list(\n predictor_indicators = \"none\",\n compute_intercept = FALSE,\n remove_intercept = FALSE\n )\n)\n```\n:::\n\n\n::: {.callout-note}\nThere are three options for `predictor_indicators`: \n\n- `\"none\"`: do not expand factor predictors\n- `\"traditional\"`: apply the standard `model.matrix()` encoding\n- `\"one_hot\"`: create the complete set including the baseline level for all factors\n:::\n\nTo learn more about encoding categorical predictors, check out [this blog post](https://www.tidyverse.org/blog/2020/07/parsnip-0-1-2/#predictor-encoding-consistency).\n\n**What is the `defaults` slot and why do I need it?**\n\nYou might want to set defaults that can be overridden by the user. For example, for logistic regression with `glm()`, it make sense to default to `family = binomial`. However, if someone wants to use a different link function, they should be able to do that. For that model/engine definition, it has:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndefaults = list(family = expr(binomial))\n```\n:::\n\n\nSo that is the default:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlogistic_reg() %>% translate(engine = \"glm\")\n#> Logistic Regression Model Specification (classification)\n#> \n#> Computational engine: glm \n#> \n#> Model fit template:\n#> stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), \n#> family = stats::binomial)\n\n# but you can change it:\nlogistic_reg() %>%\n set_engine(\"glm\", family = binomial(link = \"probit\")) %>% \n translate()\n#> Logistic Regression Model Specification (classification)\n#> \n#> Engine-Specific Arguments:\n#> family = binomial(link = \"probit\")\n#> \n#> Computational engine: glm \n#> \n#> Model fit template:\n#> stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), \n#> family = binomial(link = \"probit\"))\n```\n:::\n\n\nThat's what `defaults` are for. \n\n**What if I want more complex defaults?**\n\nThe `translate()` function can be used to check values or set defaults once the model's mode is known. To do this, you can create a model-specific S3 method that first calls the general method (`translate.model_spec()`) and then makes modifications or conducts error traps. \n\nFor example, the ranger and randomForest package functions have arguments for calculating importance. One is a logical and the other is a string. Since this is likely to lead to a bunch of frustration and GitHub issues, we can put in a check:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# Simplified version\ntranslate.rand_forest <- function (x, engine, ...){\n # Run the general method to get the real arguments in place\n x <- translate.default(x, engine, ...)\n \n # Check and see if they make sense for the engine and/or mode:\n if (x$engine == \"ranger\") {\n if (any(names(x$method$fit$args) == \"importance\")) \n if (is.logical(x$method$fit$args$importance)) \n rlang::abort(\"`importance` should be a character value. See ?ranger::ranger.\")\n }\n x\n}\n```\n:::\n\n\nAs another example, `nnet::nnet()` has an option for the final layer to be linear (called `linout`). If `mode = \"regression\"`, that should probably be set to `TRUE`. You couldn't do this with the `args` (described above) since you need the function translated first. \n\n\n**My model fit requires more than one function call. So....?**\n\nThe best course of action is to write wrapper so that it can be one call. This was the case with xgboost and keras. \n\n**Why would I preprocess my data?**\n\nThere might be non-trivial transformations that the model prediction code requires (such as converting to a sparse matrix representation, etc.)\n\nThis would **not** include making dummy variables and `model.matrix()` stuff. The parsnip infrastructure already does that for you. \n\n\n**Why would I post-process my predictions?**\n\nWhat comes back from some R functions may be somewhat... arcane or problematic. As an example, for xgboost, if you fit a multi-class boosted tree, you might expect the class probabilities to come back as a matrix (*narrator: they don't*). If you have four classes and make predictions on three samples, you get a vector of 12 probability values. You need to convert these to a rectangular data set. \n\nAnother example is the predict method for ranger, which encapsulates the actual predictions in a more complex object structure. \n\nThese are the types of problems that the post-processor will solve. \n\n**Are there other modes?**\n\nThere could be. If you have a suggestion, please add a [GitHub issue](https://github.com/tidymodels/parsnip/issues) to discuss it. \n\n \n## Session information {#session-info}\n\n\n::: {.cell layout-align=\"center\"}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> version R version 4.5.1 (2025-06-13)\n#> language (EN)\n#> date 2025-10-17\n#> pandoc 3.6.3\n#> quarto 1.8.25\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package version date (UTC) source\n#> broom 1.0.9 2025-07-28 CRAN (R 4.5.0)\n#> dials 1.4.2 2025-09-04 CRAN (R 4.5.0)\n#> dplyr 1.1.4 2023-11-17 CRAN (R 4.5.0)\n#> ggplot2 4.0.0 2025-09-11 CRAN (R 4.5.0)\n#> infer 1.0.9 2025-06-26 CRAN (R 4.5.0)\n#> mda 0.5-5 2024-11-07 CRAN (R 4.5.0)\n#> modeldata 1.5.1 2025-08-22 CRAN (R 4.5.0)\n#> parsnip 1.3.3 2025-08-31 CRAN (R 4.5.0)\n#> purrr 1.1.0 2025-07-10 CRAN (R 4.5.0)\n#> recipes 1.3.1 2025-05-21 CRAN (R 4.5.0)\n#> rlang 1.1.6 2025-04-11 CRAN (R 4.5.0)\n#> rsample 1.3.1 2025-07-29 CRAN (R 4.5.0)\n#> tibble 3.3.0 2025-06-08 CRAN (R 4.5.0)\n#> tidymodels 1.4.1 2025-09-08 CRAN (R 4.5.0)\n#> tune 2.0.0 2025-09-01 CRAN (R 4.5.0)\n#> workflows 1.3.0 2025-08-27 CRAN (R 4.5.0)\n#> yardstick 1.3.2 2025-01-22 CRAN (R 4.5.0)\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n\n\n\n \n", + "markdown": "---\ntitle: \"How to build a parsnip model\"\ncategories:\n - developer tools\ntype: learn-subsection\nweight: 2\ndescription: | \n Create a parsnip model function from an existing model implementation.\ntoc: true\ntoc-depth: 2\ninclude-after-body: ../../../resources.html\n---\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: mda, modeldata, and tidymodels.\n\nThe parsnip package constructs models and predictions by representing those actions in expressions. There are a few reasons for this:\n\n * It eliminates a lot of duplicate code.\n * Since the expressions are not evaluated until fitting, it eliminates many package dependencies.\n\nA parsnip model function is itself very general. For example, the `logistic_reg()` function itself doesn't have any model code within it. Instead, each model function is associated with one or more computational _engines_. These might be different R packages or some function in another language (that can be evaluated by R). \n\nThis article describes the process of creating a new model function. Before proceeding, take a minute and read our [guidelines on creating modeling packages](https://tidymodels.github.io/model-implementation-principles/) to understand the general themes and conventions that we use. \n\n## An example model\n\nAs an example, we'll create a function for _mixture discriminant analysis_. There are [a few packages](http://search.r-project.org/cgi-bin/namazu.cgi?query=%22mixture+discriminant%22&max=100&result=normal&sort=score&idxname=functions) that implement this but we'll focus on `mda::mda()`:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nstr(mda::mda)\n#> function (formula = formula(data), data = sys.frame(sys.parent()), subclasses = 3, \n#> sub.df = NULL, tot.df = NULL, dimension = sum(subclasses) - 1, eps = 100 * \n#> .Machine$double.eps, iter = 5, weights = mda.start(x, g, subclasses, \n#> trace, ...), method = polyreg, keep.fitted = (n * dimension < 5000), \n#> trace = FALSE, ...)\n```\n:::\n\n\nThe main hyperparameter is the number of subclasses. We'll name our function `discrim_mixture()`. \n\n## Aspects of models\n\nBefore proceeding, it helps to to review how parsnip categorizes models:\n\n* The model _type_ is related to the structural aspect of the model. For example, the model type `linear_reg()` represents linear models (slopes and intercepts) that model a numeric outcome. Other model types in the package are `nearest_neighbor()`, `decision_tree()`, and so on. \n\n* Within a model type is the _mode_, related to the modeling goal. Currently the three modes in the package are regression, classification, and censored regression. Some models have methods for multiple modes (e.g. nearest neighbors) while others have only a single mode (e.g. logistic regression). \n\n* The computation _engine_ is a combination of the estimation method and the implementation. For example, for linear regression, one engine is `\"lm\"` which uses ordinary least squares analysis via the `lm()` function. Another engine is `\"stan\"` which uses the Stan infrastructure to estimate parameters using Bayes rule. \n\nWhen adding a model into parsnip, the user has to specify which modes and engines are used. The package also enables users to add a new mode or engine to an existing model. \n\n## The general process\n\nThe parsnip package stores information about the models in an internal environment object. The environment can be accessed via the function `get_model_env()`. The package includes a variety of functions that can get or set the different aspects of the models. \n\nIf you are adding a new model from your own package, you can use these functions to add new entries into the model environment. \n\n### Step 1. Register the model, modes, and arguments\n\nWe will add the MDA model using the model type `discrim_mixture()`. Since this is a classification method, we only have to register a single mode:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nset_new_model(\"discrim_mixture\")\nset_model_mode(model = \"discrim_mixture\", mode = \"classification\")\nset_model_engine(\n \"discrim_mixture\", \n mode = \"classification\", \n eng = \"mda\"\n)\nset_dependency(\"discrim_mixture\", eng = \"mda\", pkg = \"mda\")\n```\n:::\n\n\nThese functions should silently finish. There is also a function that can be used to show what aspects of the model have been added to parsnip: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nshow_model_info(\"discrim_mixture\")\n#> Information for `discrim_mixture`\n#> modes: unknown, classification \n#> \n#> engines: \n#> classification: mdaNA\n#> \n#> \n#> no registered arguments.\n#> \n#> no registered fit modules.\n#> \n#> no registered prediction modules.\n```\n:::\n\n\nThe next step is to declare the main arguments to the model. These are declared independent of the mode. To specify the argument, there are a few slots to fill in:\n\n * The name that parsnip uses for the argument. In general, we try to use non-jargony names for arguments (e.g. \"penalty\" instead of \"lambda\" for regularized regression). We recommend consulting the model engine topics pages, linked from [the searchable table here](/find/parsnip/), to see if an existing argument name can be used before creating a new one. \n \n * The argument name that is used by the underlying modeling function. \n \n * A function reference for a _constructor_ that will be used to generate tuning parameter values. This should be a character vector with a named element called `fun` that is the constructor function. There is an optional element `pkg` that can be used to call the function using its namespace. If referencing functions from the dials package, quantitative parameters can have additional arguments in the list for `trans` and `range` while qualitative parameters can pass `values` via this list. \n \n * A logical value for whether the argument can be used to generate multiple predictions for a single R object. For example, for boosted trees, if a model is fit with 10 boosting iterations, many modeling packages allow the model object to make predictions for any iterations less than the one used to fit the model. In general this is not the case so one would use `has_submodels = FALSE`. \n \nFor `mda::mda()`, the main tuning parameter is `subclasses` which we will rewrite as `sub_classes`. \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset_model_arg(\n model = \"discrim_mixture\",\n eng = \"mda\",\n parsnip = \"sub_classes\",\n original = \"subclasses\",\n func = list(pkg = \"foo\", fun = \"bar\"),\n has_submodel = FALSE\n)\nshow_model_info(\"discrim_mixture\")\n#> Information for `discrim_mixture`\n#> modes: unknown, classification \n#> \n#> engines: \n#> classification: mdaNA\n#> \n#> \n#> arguments: \n#> mda: \n#> sub_classes --> subclasses\n#> \n#> no registered fit modules.\n#> \n#> no registered prediction modules.\n```\n:::\n\n\n### Step 2. Create the model function\n\nThis is a fairly simple function that can follow a basic template. The main arguments to our function will be:\n\n * The mode. If the model can do more than one mode, you might default this to `\"unknown\"`. In our case, since it is only a classification model, it makes sense to default it to that mode so that the users won't have to specify it. \n \n * The argument names (`sub_classes` here). These should be defaulted to `NULL`.\n\nA basic version of the function is:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndiscrim_mixture <-\n function(mode = \"classification\", sub_classes = NULL) {\n # Check for correct mode\n if (mode != \"classification\") {\n rlang::abort(\"`mode` should be 'classification'.\")\n }\n \n # Capture the arguments in quosures\n args <- list(sub_classes = rlang::enquo(sub_classes))\n \n # Save some empty slots for future parts of the specification\n new_model_spec(\n \"discrim_mixture\",\n args = args,\n eng_args = NULL,\n mode = mode,\n method = NULL,\n engine = NULL\n )\n }\n```\n:::\n\n\nThis is pretty simple since the data are not exposed to this function. \n\n::: {.callout-warning}\n We strongly suggest favoring `rlang::abort()` and `rlang::warn()` (or their cli counterparts `cli::cli_abort()` and `cli::cli_warn()`) over `stop()` and `warning()`. The former return better traceback results and have safer defaults for handling call objects. \n:::\n\n### Step 3. Add a fit module\n\nNow that parsnip knows about the model, mode, and engine, we can give it the information on fitting the model for our engine. The information needed to fit the model is contained in another list. The elements are:\n\n * `interface` is a single character value that could be `\"formula\"`, `\"data.frame\"`, or `\"matrix\"`. This defines the type of interface used by the underlying fit function (`mda::mda()`, in this case). This helps the translation of the data to be in an appropriate format for the that function. \n \n * `protect` is an optional list of function arguments that **should not be changeable** by the user. In this case, we probably don't want users to pass data values to these arguments (until the `fit()` function is called).\n \n * `func` is the package and name of the function that will be called. If you are using a locally defined function, only `fun` is required. \n \n * `defaults` is an optional list of arguments to the fit function that the user can change, but whose defaults can be set here. This isn't needed in this case, but is described later in this document.\n\nFor the first engine:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset_fit(\n model = \"discrim_mixture\",\n eng = \"mda\",\n mode = \"classification\",\n value = list(\n interface = \"formula\",\n protect = c(\"formula\", \"data\"),\n func = c(pkg = \"mda\", fun = \"mda\"),\n defaults = list()\n )\n)\n\nshow_model_info(\"discrim_mixture\")\n#> Information for `discrim_mixture`\n#> modes: unknown, classification \n#> \n#> engines: \n#> classification: mda\n#> \n#> \n#> arguments: \n#> mda: \n#> sub_classes --> subclasses\n#> \n#> fit modules:\n#> engine mode\n#> mda classification\n#> \n#> no registered prediction modules.\n```\n:::\n\n\nWe also set up the information on how the predictors should be handled. These options ensure that the data that parsnip gives to the underlying model allows for a model fit that is as similar as possible to what it would have produced directly.\n\n * `predictor_indicators` describes whether and how to create indicator/dummy variables from factor predictors. There are three options: `\"none\"` (do not expand factor predictors), `\"traditional\"` (apply the standard `model.matrix()` encodings), and `\"one_hot\"` (create the complete set including the baseline level for all factors). \n \n * `compute_intercept` controls whether `model.matrix()` should include the intercept in its formula. This affects more than the inclusion of an intercept column. With an intercept, `model.matrix()` computes dummy variables for all but one factor level. Without an intercept, `model.matrix()` computes a full set of indicators for the first factor variable, but an incomplete set for the remainder.\n \n * `remove_intercept` removes the intercept column *after* `model.matrix()` is finished. This can be useful if the model function (e.g. `lm()`) automatically generates an intercept.\n\n* `allow_sparse_x` specifies whether the model can accommodate a sparse representation for predictors during fitting and tuning.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset_encoding(\n model = \"discrim_mixture\",\n eng = \"mda\",\n mode = \"classification\",\n options = list(\n predictor_indicators = \"traditional\",\n compute_intercept = TRUE,\n remove_intercept = TRUE,\n allow_sparse_x = FALSE\n )\n)\n```\n:::\n\n\n\n### Step 4. Add modules for prediction\n\nSimilar to the fitting module, we specify the code for making different types of predictions. To make hard class predictions, the `class_info` object below contains the details. The elements of the list are:\n\n * `pre` and `post` are optional functions that can preprocess the data being fed to the prediction code and to postprocess the raw output of the predictions. These won't be needed for this example, but a section below has examples of how these can be used when the model code is not easy to use. If the data being predicted has a simple type requirement, you can avoid using a `pre` function with the `args` below. \n * `func` is the prediction function (in the same format as above). In many cases, packages have a predict method for their model's class but this is typically not exported. In this case (and the example below), it is simple enough to make a generic call to `predict()` with no associated package. \n * `args` is a list of arguments to pass to the prediction function. These will most likely be wrapped in `rlang::expr()` so that they are not evaluated when defining the method. For mda, the code would be `predict(object, newdata, type = \"class\")`. What is actually given to the function is the parsnip model fit object, which includes a sub-object called `fit` that houses the mda model object. If the data need to be a matrix or data frame, you could also use `newdata = rlang::expr(as.data.frame(newdata))` or similar. \n\nThe parsnip prediction code will expect the result to be an unnamed character string or factor. This will be coerced to a factor with the same levels as the original data. \n\nTo add this method to the model environment, a similar set function, `set_pred()`, is used:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nclass_info <- \n list(\n pre = NULL,\n post = NULL,\n func = c(fun = \"predict\"),\n args =\n # These lists should be of the form:\n # {predict.mda argument name} = {values provided from parsnip objects}\n list(\n # We don't want the first two arguments evaluated right now\n # since they don't exist yet. `type` is a simple object that\n # doesn't need to have its evaluation deferred. \n object = rlang::expr(object$fit),\n newdata = rlang::expr(new_data),\n type = \"class\"\n )\n )\n\nset_pred(\n model = \"discrim_mixture\",\n eng = \"mda\",\n mode = \"classification\",\n type = \"class\",\n value = class_info\n)\n```\n:::\n\n\nA similar call can be used to define the class probability module (if they can be computed). The format is identical to the `class` module but the output is expected to be a tibble with columns for each factor level. \n\nAs an example of the `post` function, the data frame created by `mda:::predict.mda()` will be converted to a tibble. The arguments are `x` (the raw results coming from the predict method) and `object` (the parsnip model fit object). The latter has a sub-object called `lvl` which is a character string of the outcome's factor levels (if any). \n\nWe register the probability module. There is a template function that makes it slightly easier to format the objects:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nprob_info <-\n pred_value_template(\n post = function(x, object) {\n tibble::as_tibble(x)\n },\n func = c(fun = \"predict\"),\n # Now everything else is put into the `args` slot\n object = rlang::expr(object$fit),\n newdata = rlang::expr(new_data),\n type = \"posterior\"\n )\n\nset_pred(\n model = \"discrim_mixture\",\n eng = \"mda\",\n mode = \"classification\",\n type = \"prob\",\n value = prob_info\n)\n\nshow_model_info(\"discrim_mixture\")\n#> Information for `discrim_mixture`\n#> modes: unknown, classification \n#> \n#> engines: \n#> classification: mda\n#> \n#> \n#> arguments: \n#> mda: \n#> sub_classes --> subclasses\n#> \n#> fit modules:\n#> engine mode\n#> mda classification\n#> \n#> prediction modules:\n#> mode engine methods\n#> classification mda class, prob\n```\n:::\n\n\nIf this model could be used for regression situations, we could also add a `numeric` module. For these predictions, the model requires an unnamed numeric vector output. \n\nExamples are [here](https://github.com/tidymodels/parsnip/blob/master/R/linear_reg_data.R) and [here](https://github.com/tidymodels/parsnip/blob/master/R/rand_forest_data.R). \n\n\n### Does it work? \n\nAs a developer, one thing that may come in handy is the `translate()` function. This will tell you what the model's eventual syntax will be. \n\nFor example:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndiscrim_mixture(sub_classes = 2) %>%\n translate(engine = \"mda\")\n#> discrim mixture Model Specification (classification)\n#> \n#> Main Arguments:\n#> sub_classes = 2\n#> \n#> Computational engine: mda \n#> \n#> Model fit template:\n#> mda::mda(formula = missing_arg(), data = missing_arg(), subclasses = 2)\n```\n:::\n\n\nLet's try it on a data set from the modeldata package:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata(\"two_class_dat\", package = \"modeldata\")\nset.seed(4622)\nexample_split <- initial_split(two_class_dat, prop = 0.99)\nexample_train <- training(example_split)\nexample_test <- testing(example_split)\n\nmda_spec <- discrim_mixture(sub_classes = 2) %>% \n set_engine(\"mda\")\n\nmda_fit <- mda_spec %>%\n fit(Class ~ ., data = example_train)\nmda_fit\n#> parsnip model object\n#> \n#> Call:\n#> mda::mda(formula = Class ~ ., data = data, subclasses = ~2)\n#> \n#> Dimension: 2 \n#> \n#> Percent Between-Group Variance Explained:\n#> v1 v2 \n#> 82.63 100.00 \n#> \n#> Degrees of Freedom (per dimension): 3 \n#> \n#> Training Misclassification Error: 0.17241 ( N = 783 )\n#> \n#> Deviance: 671.391\n\npredict(mda_fit, new_data = example_test, type = \"prob\") %>%\n bind_cols(example_test %>% select(Class))\n#> # A tibble: 8 × 3\n#> .pred_Class1 .pred_Class2 Class \n#> \n#> 1 0.679 0.321 Class1\n#> 2 0.690 0.310 Class1\n#> 3 0.384 0.616 Class2\n#> 4 0.300 0.700 Class1\n#> 5 0.0262 0.974 Class2\n#> 6 0.405 0.595 Class2\n#> 7 0.793 0.207 Class1\n#> 8 0.0949 0.905 Class2\n\npredict(mda_fit, new_data = example_test) %>% \n bind_cols(example_test %>% select(Class))\n#> # A tibble: 8 × 2\n#> .pred_class Class \n#> \n#> 1 Class1 Class1\n#> 2 Class1 Class1\n#> 3 Class2 Class2\n#> 4 Class2 Class1\n#> 5 Class2 Class2\n#> 6 Class2 Class2\n#> 7 Class1 Class1\n#> 8 Class2 Class2\n```\n:::\n\n\n\n## Add an engine\n\nThe process for adding an engine to an existing model is _almost_ the same as building a new model but simpler with fewer steps. You only need to add the engine-specific aspects of the model. For example, if we wanted to fit a linear regression model using M-estimation, we could only add a new engine. The code for the `rlm()` function in MASS is pretty similar to `lm()`, so we can copy that code and change the package/function names:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset_model_engine(\"linear_reg\", \"regression\", eng = \"rlm\")\nset_dependency(\"linear_reg\", eng = \"rlm\", pkg = \"MASS\")\n\nset_fit(\n model = \"linear_reg\",\n eng = \"rlm\",\n mode = \"regression\",\n value = list(\n interface = \"formula\",\n protect = c(\"formula\", \"data\", \"weights\"),\n func = c(pkg = \"MASS\", fun = \"rlm\"),\n defaults = list()\n )\n)\n\nset_encoding(\n model = \"linear_reg\",\n eng = \"rlm\",\n mode = \"regression\",\n options = list(\n predictor_indicators = \"traditional\",\n compute_intercept = TRUE,\n remove_intercept = TRUE,\n allow_sparse_x = FALSE\n )\n)\n\nset_pred(\n model = \"linear_reg\",\n eng = \"rlm\",\n mode = \"regression\",\n type = \"numeric\",\n value = list(\n pre = NULL,\n post = NULL,\n func = c(fun = \"predict\"),\n args =\n list(\n object = expr(object$fit),\n newdata = expr(new_data),\n type = \"response\"\n )\n )\n)\n\n# testing:\nlinear_reg() %>% \n set_engine(\"rlm\") %>% \n fit(mpg ~ ., data = mtcars)\n#> parsnip model object\n#> \n#> Call:\n#> rlm(formula = mpg ~ ., data = data)\n#> Converged in 8 iterations\n#> \n#> Coefficients:\n#> (Intercept) cyl disp hp drat wt \n#> 17.82250038 -0.27878615 0.01593890 -0.02536343 0.46391132 -4.14355431 \n#> qsec vs am gear carb \n#> 0.65307203 0.24975463 1.43412689 0.85943158 -0.01078897 \n#> \n#> Degrees of freedom: 32 total; 21 residual\n#> Scale estimate: 2.15\n```\n:::\n\n\n## Add parsnip models to another package\n\nThe process here is almost the same. All of the previous functions are still required but their execution is a little different. \n\nFor parsnip to register them, that package must already be loaded. For this reason, it makes sense to have parsnip in the \"Depends\" category of the DESCRIPTION file of your package. \n\nThe first difference is that the functions that define the model must be inside of a wrapper function that is called when your package is loaded. For our example here, this might look like: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmake_discrim_mixture_mda <- function() {\n parsnip::set_new_model(\"discrim_mixture\")\n\n parsnip::set_model_mode(\"discrim_mixture\", \"classification\")\n\n # and so one...\n}\n```\n:::\n\n\nThis function is then executed when your package is loaded: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n.onLoad <- function(libname, pkgname) {\n # This defines discrim_mixture in the model database\n make_discrim_mixture_mda()\n}\n```\n:::\n\n\nFor an example package that uses parsnip definitions, take a look at the [discrim](https://github.com/tidymodels/discrim) package.\n\n::: {.callout-warning}\n To use a new model and/or engine in the broader tidymodels infrastructure, we recommend your model definition declarations (e.g. `set_new_model()` and similar) reside in a package. If these definitions are in a script only, the new model may not work with the tune package, for example for parallel processing. \n:::\n\nIt is also important for parallel processing support to **list the home package as a dependency**. If the `discrim_mixture()` function lived in a package called `mixedup`, include the line:\n\n```r\nset_dependency(\"discrim_mixture\", eng = \"mda\", pkg = \"mixedup\")\n```\n\nParallel processing requires this explicit dependency setting. When parallel worker processes are created, there is heterogeneity across technologies regarding which packages are loaded. Multicore methods on macOS and Linux will load all of the packages that were loaded in the main R process. However, parallel processing using psock clusters have no additional packages loaded. If the home package for a parsnip model is not loaded in the worker processes, the model will not have an entry in parsnip's internal database (and produce an error). \n\n\n## Your model, tuning parameters, and you\n\nThe tune package can be used to find reasonable values of model arguments via tuning. There are some S3 methods that are useful to define for your model. `discrim_mixture()` has one main tuning parameter: `sub_classes`. To work with tune it is _helpful_ (but not required) to use an S3 method called `tunable()` to define which arguments should be tuned and how values of those arguments should be generated. \n\n`tunable()` takes the model specification as its argument and returns a tibble with columns: \n\n* `name`: The name of the argument. \n\n* `call_info`: A list that describes how to call a function that returns a dials parameter object. \n\n* `source`: A character string that indicates where the tuning value comes from (i.e., a model, a recipe etc.). Here, it is just `\"model_spec\"`. \n\n* `component`: A character string with more information about the source. For models, this is just the name of the function (e.g. `\"discrim_mixture\"`). \n\n* `component_id`: A character string to indicate where a unique identifier is for the object. For a model, this is indicates the type of model argument (e.g. `\"main\"`). \n\nThe main piece of information that requires some detail is `call_info`. This is a list column in the tibble. Each element of the list is a list that describes the package and function that can be used to create a dials parameter object. \n\nFor example, for a nearest-neighbors `neighbors` parameter, this value is just: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ninfo <- list(pkg = \"dials\", fun = \"neighbors\")\n\n# FYI: how it is used under-the-hood: \nnew_param_call <- rlang::call2(.fn = info$fun, .ns = info$pkg)\nrlang::eval_tidy(new_param_call)\n#> # Nearest Neighbors (quantitative)\n#> Range: [1, 10]\n```\n:::\n\n\nFor `discrim_mixture()`, a dials object is needed that returns an integer that is the number of sub-classes that should be create. We can create a dials parameter function for this:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsub_classes <- function(range = c(1L, 10L), trans = NULL) {\n new_quant_param(\n type = \"integer\",\n range = range,\n inclusive = c(TRUE, TRUE),\n trans = trans,\n label = c(sub_classes = \"# Sub-Classes\"),\n finalize = NULL\n )\n}\n```\n:::\n\n\nIf this were in the same package as the other specifications for the parsnip engine, we could use: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntunable.discrim_mixture <- function(x, ...) {\n tibble::tibble(\n name = c(\"sub_classes\"),\n call_info = list(list(pkg = NULL, fun = \"sub_classes\")),\n source = \"model_spec\",\n component = \"discrim_mixture\",\n component_id = \"main\"\n )\n}\n```\n:::\n\n\nOnce this method is in place, the tuning functions can be used: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmda_spec <- \n discrim_mixture(sub_classes = tune()) %>% \n set_engine(\"mda\")\n\nset.seed(452)\ncv <- vfold_cv(example_train)\nmda_tune_res <- mda_spec %>%\n tune_grid(Class ~ ., cv, grid = 4)\nshow_best(mda_tune_res, metric = \"roc_auc\")\n```\n:::\n\n\n\n\n## Pro-tips, what-ifs, exceptions, FAQ, and minutiae\n\nThere are various things that came to mind while developing this resource.\n\n**Do I have to return a simple vector for `predict()`?**\n\nThere are some models (e.g. glmnet, plsr, Cubist, etc.) that can make predictions for different models from the same fitted model object. We facilitate this via `multi_predict()`, rather than `predict()`.\n\nFor example, if we fit a linear regression model via `glmnet` and predict for 10 different penalty values:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npreds <- linear_reg(penalty = 0.1) %>%\n set_engine(\"glmnet\") %>% \n fit(mpg ~ ., data = mtcars) %>%\n multi_predict(new_data = mtcars[1:3, -1], penalty = seq(0.1 / 1:10))\n\npreds\n#> # A tibble: 3 × 1\n#> .pred \n#> \n#> 1 \n#> 2 \n#> 3 \n\npreds$.pred[[1]]\n#> # A tibble: 10 × 2\n#> penalty .pred\n#> \n#> 1 1 22.2\n#> 2 2 21.5\n#> 3 3 21.1\n#> 4 4 20.6\n#> 5 5 20.2\n#> 6 6 20.1\n#> 7 7 20.1\n#> 8 8 20.1\n#> 9 9 20.1\n#> 10 10 20.1\n```\n:::\n\n\nThis gives a list column `.pred` which contains a tibble per row, each tibble corresponding to one row in `new_data`. Within each tibble are columns for the parameter we vary, here `penalty`, and the predictions themselves.\n\n**What do I do about how my model handles factors or categorical data?**\n\nSome modeling functions in R create indicator/dummy variables from categorical data when you use a model formula (typically using `model.matrix()`), and some do not. Some examples of models that do _not_ create indicator variables include tree-based models, naive Bayes models, and multilevel or hierarchical models. The tidymodels ecosystem assumes a `model.matrix()`-like default encoding for categorical data used in a model formula, but you can change this encoding using `set_encoding()`. For example, you can set predictor encodings that say, \"leave my data alone,\" and keep factors as is:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset_encoding(\n model = \"decision_tree\",\n eng = \"rpart\",\n mode = \"regression\",\n options = list(\n predictor_indicators = \"none\",\n compute_intercept = FALSE,\n remove_intercept = FALSE\n )\n)\n```\n:::\n\n\n::: {.callout-note}\nThere are three options for `predictor_indicators`: \n\n- `\"none\"`: do not expand factor predictors\n- `\"traditional\"`: apply the standard `model.matrix()` encoding\n- `\"one_hot\"`: create the complete set including the baseline level for all factors\n:::\n\nTo learn more about encoding categorical predictors, check out [this blog post](https://www.tidyverse.org/blog/2020/07/parsnip-0-1-2/#predictor-encoding-consistency).\n\n**What is the `defaults` slot and why do I need it?**\n\nYou might want to set defaults that can be overridden by the user. For example, for logistic regression with `glm()`, it make sense to default to `family = binomial`. However, if someone wants to use a different link function, they should be able to do that. For that model/engine definition, it has:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndefaults = list(family = expr(binomial))\n```\n:::\n\n\nSo that is the default:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlogistic_reg() %>% translate(engine = \"glm\")\n#> Logistic Regression Model Specification (classification)\n#> \n#> Computational engine: glm \n#> \n#> Model fit template:\n#> stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), \n#> family = stats::binomial)\n\n# but you can change it:\nlogistic_reg() %>%\n set_engine(\"glm\", family = binomial(link = \"probit\")) %>% \n translate()\n#> Logistic Regression Model Specification (classification)\n#> \n#> Engine-Specific Arguments:\n#> family = binomial(link = \"probit\")\n#> \n#> Computational engine: glm \n#> \n#> Model fit template:\n#> stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), \n#> family = binomial(link = \"probit\"))\n```\n:::\n\n\nThat's what `defaults` are for. \n\n**What if I want more complex defaults?**\n\nThe `translate()` function can be used to check values or set defaults once the model's mode is known. To do this, you can create a model-specific S3 method that first calls the general method (`translate.model_spec()`) and then makes modifications or conducts error traps. \n\nFor example, the ranger and randomForest package functions have arguments for calculating importance. One is a logical and the other is a string. Since this is likely to lead to a bunch of frustration and GitHub issues, we can put in a check:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# Simplified version\ntranslate.rand_forest <- function (x, engine, ...){\n # Run the general method to get the real arguments in place\n x <- translate.default(x, engine, ...)\n \n # Check and see if they make sense for the engine and/or mode:\n if (x$engine == \"ranger\") {\n if (any(names(x$method$fit$args) == \"importance\")) \n if (is.logical(x$method$fit$args$importance)) \n rlang::abort(\"`importance` should be a character value. See ?ranger::ranger.\")\n }\n x\n}\n```\n:::\n\n\nAs another example, `nnet::nnet()` has an option for the final layer to be linear (called `linout`). If `mode = \"regression\"`, that should probably be set to `TRUE`. You couldn't do this with the `args` (described above) since you need the function translated first. \n\n\n**My model fit requires more than one function call. So....?**\n\nThe best course of action is to write wrapper so that it can be one call. This was the case with xgboost and keras. \n\n**Why would I preprocess my data?**\n\nThere might be non-trivial transformations that the model prediction code requires (such as converting to a sparse matrix representation, etc.)\n\nThis would **not** include making dummy variables and `model.matrix()` stuff. The parsnip infrastructure already does that for you. \n\n\n**Why would I post-process my predictions?**\n\nWhat comes back from some R functions may be somewhat... arcane or problematic. As an example, for xgboost, if you fit a multi-class boosted tree, you might expect the class probabilities to come back as a matrix (*narrator: they don't*). If you have four classes and make predictions on three samples, you get a vector of 12 probability values. You need to convert these to a rectangular data set. \n\nAnother example is the predict method for ranger, which encapsulates the actual predictions in a more complex object structure. \n\nThese are the types of problems that the post-processor will solve. \n\n**Are there other modes?**\n\nThere could be. If you have a suggestion, please add a [GitHub issue](https://github.com/tidymodels/parsnip/issues) to discuss it. \n\n \n## Session information {#session-info}\n\n\n::: {.cell layout-align=\"center\"}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> version R version 4.5.1 (2025-06-13)\n#> language (EN)\n#> date 2025-10-21\n#> pandoc 3.6.3\n#> quarto 1.8.25\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package version date (UTC) source\n#> broom 1.0.9 2025-07-28 CRAN (R 4.5.0)\n#> dials 1.4.2 2025-09-04 CRAN (R 4.5.0)\n#> dplyr 1.1.4 2023-11-17 CRAN (R 4.5.0)\n#> ggplot2 4.0.0 2025-09-11 CRAN (R 4.5.0)\n#> infer 1.0.9 2025-06-26 CRAN (R 4.5.0)\n#> mda 0.5-5 2024-11-07 CRAN (R 4.5.0)\n#> modeldata 1.5.1 2025-08-22 CRAN (R 4.5.0)\n#> parsnip 1.3.3 2025-08-31 CRAN (R 4.5.0)\n#> purrr 1.1.0 2025-07-10 CRAN (R 4.5.0)\n#> recipes 1.3.1 2025-05-21 CRAN (R 4.5.0)\n#> rlang 1.1.6 2025-04-11 CRAN (R 4.5.0)\n#> rsample 1.3.1 2025-07-29 CRAN (R 4.5.0)\n#> tibble 3.3.0 2025-06-08 CRAN (R 4.5.0)\n#> tidymodels 1.3.0 2025-02-21 CRAN (R 4.5.1)\n#> tune 2.0.1 2025-10-17 CRAN (R 4.5.0)\n#> workflows 1.3.0 2025-08-27 CRAN (R 4.5.0)\n#> yardstick 1.3.2 2025-01-22 CRAN (R 4.5.0)\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n\n\n\n \n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/learn/develop/models/index.html.md b/learn/develop/models/index.html.md index e46b3098..0a5e01a1 100644 --- a/learn/develop/models/index.html.md +++ b/learn/develop/models/index.html.md @@ -102,7 +102,7 @@ show_model_info("discrim_mixture") The next step is to declare the main arguments to the model. These are declared independent of the mode. To specify the argument, there are a few slots to fill in: - * The name that parsnip uses for the argument. In general, we try to use non-jargony names for arguments (e.g. "penalty" instead of "lambda" for regularized regression). We recommend consulting [the model argument table available here](/find/parsnip/) to see if an existing argument name can be used before creating a new one. + * The name that parsnip uses for the argument. In general, we try to use non-jargony names for arguments (e.g. "penalty" instead of "lambda" for regularized regression). We recommend consulting the model engine topics pages, linked from [the searchable table here](/find/parsnip/), to see if an existing argument name can be used before creating a new one. * The argument name that is used by the underlying modeling function. @@ -826,7 +826,7 @@ There could be. If you have a suggestion, please add a [GitHub issue](https://gi #> ─ Session info ───────────────────────────────────────────────────── #> version R version 4.5.1 (2025-06-13) #> language (EN) -#> date 2025-10-17 +#> date 2025-10-21 #> pandoc 3.6.3 #> quarto 1.8.25 #> @@ -845,8 +845,8 @@ There could be. If you have a suggestion, please add a [GitHub issue](https://gi #> rlang 1.1.6 2025-04-11 CRAN (R 4.5.0) #> rsample 1.3.1 2025-07-29 CRAN (R 4.5.0) #> tibble 3.3.0 2025-06-08 CRAN (R 4.5.0) -#> tidymodels 1.4.1 2025-09-08 CRAN (R 4.5.0) -#> tune 2.0.0 2025-09-01 CRAN (R 4.5.0) +#> tidymodels 1.3.0 2025-02-21 CRAN (R 4.5.1) +#> tune 2.0.1 2025-10-17 CRAN (R 4.5.0) #> workflows 1.3.0 2025-08-27 CRAN (R 4.5.0) #> yardstick 1.3.2 2025-01-22 CRAN (R 4.5.0) #> diff --git a/learn/develop/models/index.qmd b/learn/develop/models/index.qmd index c4e29aed..7ea3ad0e 100644 --- a/learn/develop/models/index.qmd +++ b/learn/develop/models/index.qmd @@ -100,7 +100,7 @@ show_model_info("discrim_mixture") The next step is to declare the main arguments to the model. These are declared independent of the mode. To specify the argument, there are a few slots to fill in: - * The name that parsnip uses for the argument. In general, we try to use non-jargony names for arguments (e.g. "penalty" instead of "lambda" for regularized regression). We recommend consulting [the model argument table available here](/find/parsnip/) to see if an existing argument name can be used before creating a new one. + * The name that parsnip uses for the argument. In general, we try to use non-jargony names for arguments (e.g. "penalty" instead of "lambda" for regularized regression). We recommend consulting the model engine topics pages, linked from [the searchable table here](/find/parsnip/), to see if an existing argument name can be used before creating a new one. * The argument name that is used by the underlying modeling function.