Fitting models with DataExplore

Introductions to linear regression are available from many sources on the web and a basic understanding is assumed here. See links below.

Statsmodels

This Python library is used for model fitting in DataExplore since it works well with pandas and has a simple interface. It provides descriptive statistics, statistical tests, plotting functions, and implementation of standard estimators used in model fitting. It also allows string formulas using patsy and this is used in DataExplore to allow the user to type in their formulas easily.

Linear regression

Simple linear regression is an approach for predicting a quantitative response using a single feature. In ordinary least squares, coefficients are estimated using the least squares criterion, which means we are find the line (mathematically) which minimizes the sum of squared residuals (or “sum of squared errors”). Once coefficients are known they can be used to predict new data points.

Logistic regression

Logistic (or logit) regression is a type of regression analysis used for predicting the outcome of a categorical dependent variable represented often by integers (i.e. 0 or 1 might represent male/female) based on one or more predictor variables which can be continuous quantities. Statsmodels provides the logit function for this. The IPython Notebook Tutorial for Titanic gives a very good illustration of this kind of fitting. The Titanic dataset is included by default with the application to allow logistic regression to be demonstrated in the same manner as that tutorial.

Usage

The model fitting dialog is kept simple and currently only implements a subset of statsmodels functionality. Thus it is suitable for teaching beginners. Clicking the model fitting button on the toolbar creates the dialog below the table. The most important part is the entry bar for the formula. Estimators can be selected from the drop down menu. Formulas are of the following form:

b ~ a + c

where b represents the dependent variable (the one you want to fit to and predict new values for) and a and c are the independent variables (or features). The ~ sign is an = sign, indicating the sides of the equation. This is an example of multivariate regression since 2 variables are used.

The above assumes you have 3 columns in the table with the corresponding names. The program will use the currently selected rows of the table for fitting or ‘training’ the model. You simply select the appropriate rows in the table and press ‘fit’. This will produce a detailed results summary showing residuals etc. Various plots showing the fit results are also available such as residuals and qq plots. The default plot for linear regressors shows the training/fitted data along with the remaining data in your table and shows the predicted values for this ‘out of sample’ data. You can plot each independent variable versus the predicted outcome (response).

Logistic regression is done by selecting the logit estimator. You should first ensure your dependent variable is categorical and a number like 0 and 1. The equations are of the same form as OLS. Here is the example formula for fitting the Titanic dataset, the point of which is to predict survival outcome (0 or 1) using a subset of the features like age, sex, class:

survived ~ C(pclass) + C(sex) + age + sibsp  + C(embarked)

The C() lets the estimator know that those variables are categorical so you don’t have to convert them first to numbers. However it is also possible to manually prepare the columns before fitting.

For a more detailed explanation on patsy formulas see here.

A friendly introduction to linear regression

Published: September 02 2015

  • tags: