How to use statsmodels for data analytics

Data analytics is a branch of data science that deals with finding meaningful relationships in raw data and providing insights into the data. People who don’t specialize in data science can find utility in data analytics and make use of data plots, algorithms, and statistical models for analyzing and interpreting data.

The statsmodels library

statsmodels is a Python library used for data analytics. It has different modules to support cross-sectional, time-series, and formula-based models, along with associated functions for statistical data analysis and data visualization.

Cross-sectional models

Cross-sectional models are statistical models that analyze data in a single instance of time, that is, there is no temporal component in the data model. Cross-sectional models and their associated tools are available in statsmodels as statsmodels.api and have various functions in the following categories:

  • Regression

  • Imputation

  • Generalized estimating equations

  • Generalized linear models

  • Discrete and count models

  • Multivariate models

  • Graphics

  • Statistics

  • Miscellaneous tools

Time-series models

Time-series models are the data models that assume data accumulated over a time-series. Time-series models are available in statsmodels as statsmodels.tsa.api and have various functions in the following categories:

  • Statistics and tests

  • Univariate time-series analysis

  • Exponential smoothing

  • Multivariate time-series models

  • Filters and decompositions

  • Markov regime switching models

  • Forecasting

  • Time-series tools

  • X12/X13 Interface

Formula models

Formula models are the statistical models that fit the given data on a formula. Formula models are available in statsmodels as statsmodels.formula.api.

statsmodels in data analytics

The statsmodels library provides a wide array of functions for model fitting, hypothesis testing, and other statistical techniques for data analysis. With its optimized performance for data analysis and dedicated collection of data analytics tools, statsmodels is a great resource for data analytics.

Furthermore, statsmodels can be used in conjunction with other Python libraries like NumPy and Matplotlib to extend its functionality.

Code example

Let’s have a look at the following example, in which different data analytic techniques are applied to the Titanic dataset:

main.py
titanic.csv
import pandas as pd
import statsmodels.api as sm
data = pd.read_csv('titanic.csv')
print(data.describe())
X = data['Pclass']
y = data['Survived']
X = sm.add_constant(X)
model = sm.OLS(y, X, missing='drop')
results = model.fit()
print(results.params)

Code explanation

  • Lines 4–5: The Titanic dataset is imported from the titanic.csv file to a pandas DataFrame.

  • Lines 6–7: The predictor and response variables are defined.

  • Line 8: A column of 1s is added to the predictor variable. This is done to accommodate the constant in the regression model, and the coefficient of this column will be the value of the biasbbin the model.

  • Line 9: The ordinary least squares (OLS) linear regression model model is created. This is a simple linear regression model of the typey=iwiXi+by = \sum_i w_i X_i + b.

  • Line 10: The Titanic data data is fitted to the model model via linear regression.

  • Line 11: The result of the linear regression is printed. The regression model is predicted to beSurvived=0.5044PClass0.0621{Survived} = 0.5044* {PClass} - 0.0621.

Note: The code example has been implemented with Python 3.9 and statsmodels==0.13.5.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved