Data analytics is a branch of data science that deals with finding meaningful relationships in raw data and providing insights into the data. People who don’t specialize in data science can find utility in data analytics and make use of data plots, algorithms, and statistical models for analyzing and interpreting data.
statsmodels
librarystatsmodels
is a Python library used for data analytics. It has different modules to support cross-sectional, time-series, and formula-based models, along with associated functions for statistical data analysis and data visualization.
Cross-sectional models are statistical models that analyze data in a single instance of time, that is, there is no temporal component in the data model. Cross-sectional models and their associated tools are available in statsmodels
as statsmodels.api
and have various functions in the following categories:
Regression
Imputation
Generalized estimating equations
Generalized linear models
Discrete and count models
Multivariate models
Graphics
Statistics
Miscellaneous tools
Time-series models are the data models that assume data accumulated over a time-series. Time-series models are available in statsmodels
as statsmodels.tsa.api
and have various functions in the following categories:
Statistics and tests
Univariate time-series analysis
Exponential smoothing
Multivariate time-series models
Filters and decompositions
Markov regime switching models
Forecasting
Time-series tools
X12/X13 Interface
Formula models are the statistical models that fit the given data on a formula. Formula models are available in statsmodels
as statsmodels.formula.api
.
statsmodels
in data analyticsThe statsmodels
library provides a wide array of functions for model fitting, hypothesis testing, and other statistical techniques for data analysis. With its optimized performance for data analysis and dedicated collection of data analytics tools, statsmodels
is a great resource for data analytics.
Furthermore, statsmodels
can be used in conjunction with other Python libraries like NumPy and Matplotlib to extend its functionality.
Let’s have a look at the following example, in which different data analytic techniques are applied to the Titanic dataset:
import pandas as pdimport statsmodels.api as smdata = pd.read_csv('titanic.csv')print(data.describe())X = data['Pclass']y = data['Survived']X = sm.add_constant(X)model = sm.OLS(y, X, missing='drop')results = model.fit()print(results.params)
Lines 4–5: The Titanic dataset is imported from the titanic.csv
file to a pandas DataFrame.
Lines 6–7: The predictor and response variables are defined.
Line 8: A column of 1
s is added to the predictor variable. This is done to accommodate the constant in the regression model, and the coefficient of this column will be the value of the bias
Line 9: The ordinary least squares (OLS) linear regression model model
is created. This is a simple linear regression model of the type
Line 10: The Titanic data data
is fitted to the model model
via linear regression.
Line 11: The result of the linear regression is printed. The regression model is predicted to be
Note: The code example has been implemented with Python 3.9 and
statsmodels==0.13.5
.
Free Resources