How to use statsmodels for data analytics

Data analytics is a branch of data science that deals with finding meaningful relationships in raw data and providing insights into the data. People who don’t specialize in data science can find utility in data analytics and make use of data plots, algorithms, and statistical models for analyzing and interpreting data.

The `statsmodels` library

statsmodels is a Python library used for data analytics. It has different modules to support cross-sectional, time-series, and formula-based models, along with associated functions for statistical data analysis and data visualization.

Cross-sectional models

Cross-sectional models are statistical models that analyze data in a single instance of time, that is, there is no temporal component in the data model. Cross-sectional models and their associated tools are available in statsmodels as statsmodels.api and have various functions in the following categories:

Regression
Imputation
Generalized estimating equations
Generalized linear models
Discrete and count models
Multivariate models
Graphics
Statistics
Miscellaneous tools

Time-series models

Time-series models are the data models that assume data accumulated over a time-series. Time-series models are available in statsmodels as statsmodels.tsa.api and have various functions in the following categories:

Statistics and tests
Univariate time-series analysis
Exponential smoothing
Multivariate time-series models
Filters and decompositions
Markov regime switching models
Forecasting
Time-series tools
X12/X13 Interface

Formula models

Formula models are the statistical models that fit the given data on a formula. Formula models are available in statsmodels as statsmodels.formula.api.

`statsmodels` in data analytics

The statsmodels library provides a wide array of functions for model fitting, hypothesis testing, and other statistical techniques for data analysis. With its optimized performance for data analysis and dedicated collection of data analytics tools, statsmodels is a great resource for data analytics.

Furthermore, statsmodels can be used in conjunction with other Python libraries like NumPy and Matplotlib to extend its functionality.

Code example

Let’s have a look at the following example, in which different data analytic techniques are applied to the Titanic dataset:

Code explanation

Lines 4–5: The Titanic dataset is imported from the titanic.csv file to a pandas DataFrame.
Lines 6–7: The predictor and response variables are defined.
Line 8: A column of 1s is added to the predictor variable. This is done to accommodate the constant in the regression model, and the coefficient of this column will be the value of the bias $b$ in the model.
Line 9: The ordinary least squares (OLS) linear regression model model is created. This is a simple linear regression model of the type $y = \sum_i w_i X_i + b$ .
Line 10: The Titanic data data is fitted to the model model via linear regression.
Line 11: The result of the linear regression is printed. The regression model is predicted to be ${Survived} = 0.5044* {PClass} - 0.0621$ .