This notebook supplies and demonstrates some code I cobbled together to do “classical” stepwise linear regression. Both the code and the rationale for writing it are discussed in a blog post I wrote in 2011 (when I used the code in a course I taught): Stepwise Regression in R.
The function “stepwise” defined below performs stepwise regression based on a “nested model” F test for inclusion/exclusion of a predictor. In keeping with my intent to use it as an instructional tool, it spews progress reports to output (as a side effect) while returning the final model chosen (an lm
object).
To keep it simple, I made no provision for forcing certain variables to be included in all models. The current version has the following properties.
lm
).NA
.y ~ x
and "y ~ x"
should work equally well.One other note: since the code uses R’s drop1
and add1
functions, it respects hierarchy in models. That is, regardless of p values, it will not attempt to drop a term while retaining a higher order interaction involving that term, nor will it add an interaction term if the lower order components are not all present. (You can of course defeat this by putting interactions into new variables and feeding it what looks like a first-order model.)
Consider this to be “beta” code (and feel free to improve it). I’ve done somewhat limited testing on it, beyond what you see in this notebook.
Besides correcting a few spelling errors (oops), I have made a few minor modifications to the function, and added demonstration code below for forward and backward stepwise regression.
The changes to the function are as follows:
alpha.to.enter
parameter now has a default value of 0.0 (which will prevent any variables from entering the model if you fail to specify a value for the parameter).alpha.to.leave
parameter now has a default value of 1.0 (which will prevent any variables from leaving the model if you fail to specify a value for the parameter).data
argument to the function).The following defines the stepwise function.
#'
#' Perform a stepwise linear regression using F tests of significance.
#'
#' @param full.model the model containing all possible terms
#' @param initial.model the first model to consider
#' @param alpha.to.enter the significance level above which a variable may enter the model
#' @param alpha.to.leave the significance level below which a variable may be deleted from the model
#' @param data the data frame to use (optional, as with lm)
#'
#' @return the final model
#'
stepwise <-
function(full.model, initial.model, alpha.to.enter = 0.0, alpha.to.leave = 1.0, data = NULL) {
# Sanity check: alpha.to.enter should not be greater than alpha.to.leave.
if (alpha.to.enter > alpha.to.leave) {
warning("Your alpha-to-enter is greater than your alpha-to-leave, which could throw the function into an infinite loop.\n")
return(NA)
}
# Warning: horrible kludge coming!
# Acquire the full and initial models as formulas. If they are
# entered as formulas, convert them to get their environments
# squared away.
# Note: "showEnv = F" is necessary to avoid having an
# environment identifier break things if the model is
# defined inside a function.
if (is.character(full.model)) {
fm <- as.formula(full.model)
} else {
fm <- as.formula(capture.output(print(full.model, showEnv = F)))
}
if (is.character(initial.model)) {
im <- as.formula(initial.model)
} else {
im <- as.formula(capture.output(print(initial.model, showEnv = F)))
}
# Deal with a missing data argument.
if (is.null(data)) {
# Catch the use of "." in a formula when the data argument is null.
if ("." %in% all.vars(fm) | "." %in% all.vars(im)) {
warning("In order to use the shortcut '.' in a formula, you must explicitly specify the data source via the 'data' argument.\n")
return(NA)
} else {
# Use the parent environment.
data <- parent.frame()
}
}
# Fit the full model.
full <- lm(fm, data);
# Sanity check: do not allow an overspecified full model.
if (full$df.residual < 1) {
warning("Your full model does not have enough observations to properly estimate it.\n")
return(NA)
}
msef <- (summary(full)$sigma)^2; # MSE of full model
n <- length(full$residuals); # sample size
# Fit the initial model.
current <- lm(im, data);
# Process consecutive models until we break out of the loop.
while (TRUE) {
# Summarize the current model.
temp <- summary(current);
# Print the model description.
print(temp$coefficients);
# Get the size, MSE and Mallow's cp of the current model.
p <- dim(temp$coefficients)[1]; # size
mse <- (temp$sigma)^2; # MSE
cp <- (n - p)*mse/msef - (n - 2*p); # Mallow's cp
# Show the fit statistics.
fit <- sprintf("\nS = %f, R-sq = %f, R-sq(adj) = %f, C-p = %f",
temp$sigma, temp$r.squared, temp$adj.r.squared, cp);
# Show the fit itself.
write(fit, file = "");
write("=====", file = "");
# Try to drop a term (but only if more than one is left).
if (p > 1) {
# Look for terms that can be dropped based on F tests.
d <- drop1(current, test = "F");
# Find the term with largest p-value.
pmax <- suppressWarnings(max(d[, 6], na.rm = TRUE));
# If the term qualifies, drop the variable.
if (pmax > alpha.to.leave) {
# We have a candidate for deletion.
# Get the name of the variable to delete.
var <- rownames(d)[d[,6] == pmax];
# If an intercept is present, it will be the first name in the list.
# There also could be ties for worst p-value.
# Taking the second entry if there is more than one is a safe solution to both issues.
if (length(var) > 1) {
var <- var[2];
}
# Print out the variable to be dropped.
write(paste("--- Dropping", var, "\n"), file = "");
# Modify the formulat to drop the chosen variable (by subtracting it from the current formula).
f <- formula(current);
f <- as.formula(paste(f[2], "~", paste(f[3], var, sep = " - ")), env = environment(f));
# Fit the modified model and loop.
current <- lm(f, data);
next;
}
}
# If we get here, we failed to drop a term; try adding one.
# Note: add1 throws an error if nothing can be added (current == full), which we trap with tryCatch.
a <- tryCatch(
add1(current, full, test = "F"),
error = function(e) NULL
);
if (is.null(a)) {
# There are no unused variables (or something went splat), so we bail out.
break;
}
# Find the minimum p-value of any term (skipping the terms with no p-value). In case none of the remaining terms have a p-value (true of the intercept and any linearly dependent predictors), suppress warnings about an empty list. The test for a suitable candidate to drop will fail since pmin will be set to infinity.
pmin <- suppressWarnings(min(a[, 6], na.rm = TRUE));
if (pmin < alpha.to.enter) {
# We have a candidate for addition to the model. Get the variable's name.
var <- rownames(a)[a[,6] == pmin];
# We have the same issue with ties and the presence of an intercept term, and the same solution, as above.
if (length(var) > 1) {
var <- var[2];
}
# Print the variable being added.
write(paste("+++ Adding", var, "\n"), file = "");
# Add it to the current formula.
f <- formula(current);
f <- as.formula(paste(f[2], "~", paste(f[3], var, sep = " + ")), env = environment(f));
# Fit the modified model and loop.
current <- lm(f, data = data);
next;
}
# If we get here, we failed to make any changes to the model; time to declare victory and exit.
break;
}
current
}
The rest of the notebook demonstrates the function in operation.
The first tests of the function will be done using the swiss
data set (47 observations of 6 variables) from the datasets
package. We will (arbitrarily) use alpha = 0.05 to add a variable and alpha = 0.10 to remove one.
data(swiss)
attach(swiss) # to save typing
aToEnter <- 0.05
aToLeave <- 0.10
The first invocation will start with just a constant term. Everything except Examination
ends up used.
result <- stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 1, aToEnter, aToLeave)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 70.14255 1.822101 38.49542 1.212895e-36
S = 12.491697, R-sq = 0.000000, R-sq(adj) = 0.000000, C-p = 94.805296
=====
+++ Adding Education
Estimate Std. Error t value Pr(>|t|)
(Intercept) 79.6100585 2.1040971 37.835734 9.302464e-36
Education -0.8623503 0.1448447 -5.953619 3.658617e-07
S = 9.446029, R-sq = 0.440616, R-sq(adj) = 0.428185, C-p = 35.204895
=====
+++ Adding Catholic
Estimate Std. Error t value Pr(>|t|)
(Intercept) 74.2336892 2.35197061 31.562337 7.349828e-32
Education -0.7883293 0.12929324 -6.097219 2.428340e-07
Catholic 0.1109210 0.02980965 3.720974 5.598332e-04
S = 8.331442, R-sq = 0.574507, R-sq(adj) = 0.555167, C-p = 18.486158
=====
+++ Adding Infant.Mortality
Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.67707330 7.91908348 6.146806 2.235983e-07
Education -0.75924577 0.11679763 -6.500524 6.833658e-08
Catholic 0.09606607 0.02721795 3.529511 1.006201e-03
Infant.Mortality 1.29614813 0.38698777 3.349326 1.693753e-03
S = 7.505417, R-sq = 0.662544, R-sq(adj) = 0.639000, C-p = 8.178162
=====
+++ Adding Agriculture
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.1013116 9.60488611 6.465596 8.491981e-08
Education -0.9802638 0.14813668 -6.617293 5.139985e-08
Catholic 0.1246664 0.02889350 4.314686 9.503030e-05
Infant.Mortality 1.0784422 0.38186621 2.824136 7.220378e-03
Agriculture -0.1546175 0.06818992 -2.267454 2.856968e-02
S = 7.168166, R-sq = 0.699348, R-sq(adj) = 0.670714, C-p = 5.032800
=====
The return value is an instance of a linear model.
class(result)
[1] "lm"
It can be summarized, used for predictions etc. just like any other linear model.
summary(result)
Call:
lm(formula = f, data = data)
Residuals:
Min 1Q Median 3Q Max
-14.6765 -6.0522 0.7514 3.1664 16.1422
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
Education -0.98026 0.14814 -6.617 5.14e-08 ***
Catholic 0.12467 0.02889 4.315 9.50e-05 ***
Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
Agriculture -0.15462 0.06819 -2.267 0.02857 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.168 on 42 degrees of freedom
Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
The second invocation starts with the complete model and initially winnows it. We end up with the same model as the previous attempt (albeit with the variables listed in a different order).
stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, aToEnter, aToLeave)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.9151817 10.70603759 6.250229 1.906051e-07
Agriculture -0.1721140 0.07030392 -2.448142 1.872715e-02
Examination -0.2580082 0.25387820 -1.016268 3.154617e-01
Education -0.8709401 0.18302860 -4.758492 2.430605e-05
Catholic 0.1041153 0.03525785 2.952969 5.190079e-03
Infant.Mortality 1.0770481 0.38171965 2.821568 7.335715e-03
S = 7.165369, R-sq = 0.706735, R-sq(adj) = 0.670971, C-p = 6.000000
=====
--- Dropping Examination
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.1013116 9.60488611 6.465596 8.491981e-08
Agriculture -0.1546175 0.06818992 -2.267454 2.856968e-02
Education -0.9802638 0.14813668 -6.617293 5.139985e-08
Catholic 0.1246664 0.02889350 4.314686 9.503030e-05
Infant.Mortality 1.0784422 0.38186621 2.824136 7.220378e-03
S = 7.168166, R-sq = 0.699348, R-sq(adj) = 0.670714, C-p = 5.032800
=====
Call:
lm(formula = f, data = data)
Coefficients:
(Intercept) Agriculture Education
62.1013 -0.1546 -0.9803
Catholic Infant.Mortality
0.1247 1.0784
Finally, we start with Education and Examination as the two predictors. The same final model wins out.
stepwise(Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ Examination + Education, aToEnter, aToLeave)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.2532753 3.0854981 27.630312 1.945244e-29
Examination -0.5572183 0.2319374 -2.402451 2.057160e-02
Education -0.5394570 0.1924380 -2.803277 7.497224e-03
S = 8.981812, R-sq = 0.505485, R-sq(adj) = 0.483007, C-p = 28.135883
=====
+++ Adding Infant.Mortality
Estimate Std. Error t value Pr(>|t|)
(Intercept) 55.2746618 8.8077340 6.275696 1.451652e-07
Examination -0.5108888 0.2063175 -2.476226 1.729132e-02
Education -0.5225093 0.1709099 -3.057221 3.832793e-03
Infant.Mortality 1.4556114 0.4064507 3.581274 8.644778e-04
S = 7.973957, R-sq = 0.619096, R-sq(adj) = 0.592521, C-p = 14.252399
=====
+++ Adding Catholic
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.02820666 8.66076269 5.7764204 8.325568e-07
Examination -0.10580461 0.26036962 -0.4063631 6.865390e-01
Education -0.70415772 0.17969218 -3.9186887 3.221868e-04
Infant.Mortality 1.30567908 0.39150335 3.3350393 1.790664e-03
Catholic 0.08631125 0.03649293 2.3651501 2.271709e-02
S = 7.579356, R-sq = 0.663865, R-sq(adj) = 0.631853, C-p = 9.993398
=====
--- Dropping Examination
Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.67707330 7.91908348 6.146806 2.235983e-07
Education -0.75924577 0.11679763 -6.500524 6.833658e-08
Infant.Mortality 1.29614813 0.38698777 3.349326 1.693753e-03
Catholic 0.09606607 0.02721795 3.529511 1.006201e-03
S = 7.505417, R-sq = 0.662544, R-sq(adj) = 0.639000, C-p = 8.178162
=====
+++ Adding Agriculture
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.1013116 9.60488611 6.465596 8.491981e-08
Education -0.9802638 0.14813668 -6.617293 5.139985e-08
Infant.Mortality 1.0784422 0.38186621 2.824136 7.220378e-03
Catholic 0.1246664 0.02889350 4.314686 9.503030e-05
Agriculture -0.1546175 0.06818992 -2.267454 2.856968e-02
S = 7.168166, R-sq = 0.699348, R-sq(adj) = 0.670714, C-p = 5.032800
=====
Call:
lm(formula = f, data = data)
Coefficients:
(Intercept) Education Infant.Mortality
62.1013 -0.9803 1.0784
Catholic Agriculture
0.1247 -0.1546
Whether the final model contains a constant term or not depends on how the initial model is specified (with or without one), irrespective of whether the full model contains a constant term.
First, we include the constant in the full model but not in the initial model.
stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 0 + Examination + Education, aToEnter, aToLeave)
Estimate Std. Error t value Pr(>|t|)
Examination 4.321474 0.6370496 6.783575 2.135175e-08
Education -1.501667 0.8016925 -1.873121 6.755593e-02
S = 38.046199, R-sq = 0.726789, R-sq(adj) = 0.714646, C-p = 1225.697117
=====
+++ Adding Infant.Mortality
Estimate Std. Error t value Pr(>|t|)
Examination -0.1277718 0.2696722 -0.4738042 6.379822e-01
Education -0.5546268 0.2337591 -2.3726428 2.209730e-02
Infant.Mortality 3.8798737 0.1729717 22.4306888 1.030375e-25
S = 10.911136, R-sq = 0.978029, R-sq(adj) = 0.976531, C-p = 61.027090
=====
--- Dropping Examination
Estimate Std. Error t value Pr(>|t|)
Education -0.6342234 0.1611389 -3.93588 2.846018e-04
Infant.Mortality 3.8195926 0.1161705 32.87919 4.156264e-33
S = 10.816708, R-sq = 0.977917, R-sq(adj) = 0.976935, C-p = 59.547638
=====
+++ Adding Catholic
Estimate Std. Error t value Pr(>|t|)
Education -0.57683975 0.15306577 -3.768574 4.850099e-04
Infant.Mortality 3.58816498 0.14029530 25.575802 4.794287e-28
Catholic 0.09692921 0.03687941 2.628275 1.177271e-02
S = 10.169722, R-sq = 0.980913, R-sq(adj) = 0.979612, C-p = 47.632657
=====
Call:
lm(formula = f, data = data)
Coefficients:
Education Infant.Mortality Catholic
-0.57684 3.58816 0.09693
None of the models, including the last one, has an intercept. Now we reverse that, including it only in the initial model.
stepwise(Fertility ~ 0 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 1 + Examination + Education, aToEnter, aToLeave)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.2532753 3.0854981 27.630312 1.945244e-29
Examination -0.5572183 0.2319374 -2.402451 2.057160e-02
Education -0.5394570 0.1924380 -2.803277 7.497224e-03
S = 8.981812, R-sq = 0.505485, R-sq(adj) = 0.483007, C-p = -4.733290
=====
+++ Adding Infant.Mortality
Estimate Std. Error t value Pr(>|t|)
(Intercept) 55.2746618 8.8077340 6.275696 1.451652e-07
Examination -0.5108888 0.2063175 -2.476226 1.729132e-02
Education -0.5225093 0.1709099 -3.057221 3.832793e-03
Infant.Mortality 1.4556114 0.4064507 3.581274 8.644778e-04
S = 7.973957, R-sq = 0.619096, R-sq(adj) = 0.592521, C-p = -11.065312
=====
+++ Adding Catholic
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.02820666 8.66076269 5.7764204 8.325568e-07
Examination -0.10580461 0.26036962 -0.4063631 6.865390e-01
Education -0.70415772 0.17969218 -3.9186887 3.221868e-04
Infant.Mortality 1.30567908 0.39150335 3.3350393 1.790664e-03
Catholic 0.08631125 0.03649293 2.3651501 2.271709e-02
S = 7.579356, R-sq = 0.663865, R-sq(adj) = 0.631853, C-p = -12.348605
=====
--- Dropping Examination
Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.67707330 7.91908348 6.146806 2.235983e-07
Education -0.75924577 0.11679763 -6.500524 6.833658e-08
Infant.Mortality 1.29614813 0.38698777 3.349326 1.693753e-03
Catholic 0.09606607 0.02721795 3.529511 1.006201e-03
S = 7.505417, R-sq = 0.662544, R-sq(adj) = 0.639000, C-p = -14.251684
=====
+++ Adding Agriculture
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.1013116 9.60488611 6.465596 8.491981e-08
Education -0.9802638 0.14813668 -6.617293 5.139985e-08
Infant.Mortality 1.0784422 0.38186621 2.824136 7.220378e-03
Catholic 0.1246664 0.02889350 4.314686 9.503030e-05
Agriculture -0.1546175 0.06818992 -2.267454 2.856968e-02
S = 7.168166, R-sq = 0.699348, R-sq(adj) = 0.670714, C-p = -14.950793
=====
Call:
lm(formula = f, data = data)
Coefficients:
(Intercept) Education Infant.Mortality
62.1013 -0.9803 1.0784
Catholic Agriculture
0.1247 -0.1546
Every model has an intercept.
Similarly, if the initial model contains variables not included in the full model, they will automatically be added to the full model. To demonstrate this, we run stepwise with a full model consisting of just a constant term and a bigger initial model.
stepwise(Fertility ~ 1, Fertility ~ 1 + Examination + Catholic + Infant.Mortality, aToEnter, aToLeave)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.63160781 9.91013518 5.5127006 1.861112e-06
Examination -0.87554621 0.19738235 -4.4357877 6.266888e-05
Catholic 0.02518988 0.03810315 0.6610972 5.120768e-01
Infant.Mortality 1.44975095 0.45016081 3.2205179 2.438693e-03
S = 8.753626, R-sq = 0.540967, R-sq(adj) = 0.508942, C-p = -17.884492
=====
--- Dropping Catholic
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.0809736 9.6025659 5.840207 5.792493e-07
Examination -0.9492895 0.1617956 -5.867213 5.287395e-07
Infant.Mortality 1.4900177 0.4431587 3.362267 1.608484e-03
S = 8.697447, R-sq = 0.536302, R-sq(adj) = 0.515224, C-p = -19.669875
=====
Call:
lm(formula = f, data = data)
Coefficients:
(Intercept) Examination Infant.Mortality
56.0810 -0.9493 1.4900
Notice that Examination and Infant.Mortality (not included in the “full” model) are retained but that Education (which is in neither the “full” nor the initial model) is never added.
Next, we run “forward” stepwise regression (in which variables may enter but may not leave the model under construction) and “backward” stepwise regression (in which variables may leave but may not enter).
To demonstrate forward regression, we begin with a model containing only a constant term and the Examination variable, and observe that Examination (which has been consistently dropped above) remains in the final model. The demonstration code omits the alpha.to.leave value, but setting it explicitly to something sufficiently large would produce the same result.
stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 1 + Examination, alpha.to.enter = aToEnter)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 86.818529 3.2576034 26.651043 3.353924e-29
Examination -1.011317 0.1781971 -5.675275 9.450437e-07
S = 9.642000, R-sq = 0.417164, R-sq(adj) = 0.404213, C-p = 38.483494
=====
+++ Adding Infant.Mortality
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.0809736 9.6025659 5.840207 5.792493e-07
Examination -0.9492895 0.1617956 -5.867213 5.287395e-07
Infant.Mortality 1.4900177 0.4431587 3.362267 1.608484e-03
S = 8.697447, R-sq = 0.536302, R-sq(adj) = 0.515224, C-p = 23.827488
=====
+++ Adding Education
Estimate Std. Error t value Pr(>|t|)
(Intercept) 55.2746618 8.8077340 6.275696 1.451652e-07
Examination -0.5108888 0.2063175 -2.476226 1.729132e-02
Infant.Mortality 1.4556114 0.4064507 3.581274 8.644778e-04
Education -0.5225093 0.1709099 -3.057221 3.832793e-03
S = 7.973957, R-sq = 0.619096, R-sq(adj) = 0.592521, C-p = 14.252399
=====
+++ Adding Catholic
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.02820666 8.66076269 5.7764204 8.325568e-07
Examination -0.10580461 0.26036962 -0.4063631 6.865390e-01
Infant.Mortality 1.30567908 0.39150335 3.3350393 1.790664e-03
Education -0.70415772 0.17969218 -3.9186887 3.221868e-04
Catholic 0.08631125 0.03649293 2.3651501 2.271709e-02
S = 7.579356, R-sq = 0.663865, R-sq(adj) = 0.631853, C-p = 9.993398
=====
+++ Adding Agriculture
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.9151817 10.70603759 6.250229 1.906051e-07
Examination -0.2580082 0.25387820 -1.016268 3.154617e-01
Infant.Mortality 1.0770481 0.38171965 2.821568 7.335715e-03
Education -0.8709401 0.18302860 -4.758492 2.430605e-05
Catholic 0.1041153 0.03525785 2.952969 5.190079e-03
Agriculture -0.1721140 0.07030392 -2.448142 1.872715e-02
S = 7.165369, R-sq = 0.706735, R-sq(adj) = 0.670971, C-p = 6.000000
=====
Call:
lm(formula = f, data = data)
Coefficients:
(Intercept) Examination Infant.Mortality
66.9152 -0.2580 1.0770
Education Catholic Agriculture
-0.8709 0.1041 -0.1721
Notice that Examination ends up with a pretty high p-value (~0.3) but remains in the model.
To demonstrate backward regression, we start with a model containing all variables except Education (which in previous models tended to be the strongest predictor in terms of p-value), and use the default value for alpha.to.enter. The demonstration code omits the alpha.to.enter value, but setting it explicitly to something sufficiently small would produce the same result.
stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 1 + Agriculture + Examination + Catholic + Infant.Mortality, alpha.to.leave = aToLeave)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 59.60267039 13.04246209 4.5698941 4.245698e-05
Agriculture -0.04759274 0.08032440 -0.5925066 5.566882e-01
Examination -0.96804720 0.25284306 -3.8286484 4.228040e-04
Catholic 0.02610993 0.03842535 0.6794976 5.005506e-01
Infant.Mortality 1.39596612 0.46259048 3.0177148 4.314855e-03
S = 8.820436, R-sq = 0.544772, R-sq(adj) = 0.501417, C-p = 26.643242
=====
--- Dropping Agriculture
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.63160781 9.91013518 5.5127006 1.861112e-06
Examination -0.87554621 0.19738235 -4.4357877 6.266888e-05
Catholic 0.02518988 0.03810315 0.6610972 5.120768e-01
Infant.Mortality 1.44975095 0.45016081 3.2205179 2.438693e-03
S = 8.753626, R-sq = 0.540967, R-sq(adj) = 0.508942, C-p = 25.175215
=====
--- Dropping Catholic
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.0809736 9.6025659 5.840207 5.792493e-07
Examination -0.9492895 0.1617956 -5.867213 5.287395e-07
Infant.Mortality 1.4900177 0.4431587 3.362267 1.608484e-03
S = 8.697447, R-sq = 0.536302, R-sq(adj) = 0.515224, C-p = 23.827488
=====
Call:
lm(formula = f, data = data)
Coefficients:
(Intercept) Examination Infant.Mortality
56.0810 -0.9493 1.4900
Notice this time that Education never made it into the model.
Before we abandon this data set, there is one other thing worth noting. Your alpha-to-enter must not be larger than your alpha-to-leave. Although you can get away with that sometimes, it carries the potential to put stepwise regression into an infinite loop (adding and dropping the same variable repeatedly), so the function disallows it.
result <- stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 1, 0.2, 0.1)
Your alpha-to-enter is greater than your alpha-to-leave, which could throw the function into an infinite loop.
The result of this is a missing model.
result
[1] NA
The “.” shortcut can be used to specify that the right-hand side of a model contain all terms except the dependent variable. It requires that the data source be specified explicitly. So the following works.
stepwise(Fertility ~ ., Fertility ~ 1, aToEnter, aToLeave, data = swiss)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 70.14255 1.822101 38.49542 1.212895e-36
S = 12.491697, R-sq = 0.000000, R-sq(adj) = 0.000000, C-p = 94.805296
=====
+++ Adding Education
Estimate Std. Error t value Pr(>|t|)
(Intercept) 79.6100585 2.1040971 37.835734 9.302464e-36
Education -0.8623503 0.1448447 -5.953619 3.658617e-07
S = 9.446029, R-sq = 0.440616, R-sq(adj) = 0.428185, C-p = 35.204895
=====
+++ Adding Catholic
Estimate Std. Error t value Pr(>|t|)
(Intercept) 74.2336892 2.35197061 31.562337 7.349828e-32
Education -0.7883293 0.12929324 -6.097219 2.428340e-07
Catholic 0.1109210 0.02980965 3.720974 5.598332e-04
S = 8.331442, R-sq = 0.574507, R-sq(adj) = 0.555167, C-p = 18.486158
=====
+++ Adding Infant.Mortality
Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.67707330 7.91908348 6.146806 2.235983e-07
Education -0.75924577 0.11679763 -6.500524 6.833658e-08
Catholic 0.09606607 0.02721795 3.529511 1.006201e-03
Infant.Mortality 1.29614813 0.38698777 3.349326 1.693753e-03
S = 7.505417, R-sq = 0.662544, R-sq(adj) = 0.639000, C-p = 8.178162
=====
+++ Adding Agriculture
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.1013116 9.60488611 6.465596 8.491981e-08
Education -0.9802638 0.14813668 -6.617293 5.139985e-08
Catholic 0.1246664 0.02889350 4.314686 9.503030e-05
Infant.Mortality 1.0784422 0.38186621 2.824136 7.220378e-03
Agriculture -0.1546175 0.06818992 -2.267454 2.856968e-02
S = 7.168166, R-sq = 0.699348, R-sq(adj) = 0.670714, C-p = 5.032800
=====
Call:
lm(formula = f, data = data)
Coefficients:
(Intercept) Education Catholic
62.1013 -0.9803 0.1247
Infant.Mortality Agriculture
1.0784 -0.1546
If we omit the data
argument and rely on the swiss
data set being attached, we get an error.
stepwise(Fertility ~ ., Fertility ~ 1, aToEnter, aToLeave)
In order to use the shortcut '.' in a formula, you must explicitly specify the data source via the 'data' argument.
[1] NA
We are done with the swiss
data set.
detach(swiss)
rm(swiss)