Learn predictive data analysis with Python using NumPy, Pandas, Matplotlib, and Seaborn. Apply skills to real-world finance and advertising projects to extract and visualize insights.

seaborn.tar.gz

seaborn

In this course, you will learn how to perform predictive data analysis using Python. The ideal audience is those who want to start their careers as data analysts. The main goal of this course is to show you how to use statistics to draw useful insights from data which can help in predicting future behavior or patterns.

Beyond that, you’ll learn all the tools of the trade that data scientists use everyday including: NumPy, Pandas, Matplotlib, and Seaborn. You’ll learn not only how to extract meaningful insights from data, but you’ll also learn how to create stunning visualizations that you can use for reports.

Various datasets of real-world scenarios are used in each lesson to get you accustomed to handling any type of data. At the end of the course, you will work on two real-world projects that demonstrate how data analysis techniques are being used in the financial and advertisement sector to generate revenue.

Predictive Data Analysis with Python

def clean_data(df):

    df = df.dropna() # dropping all rows with null values

    # A list of all columns on which outliers need to be removed
    out_list = ['median_house_value', 'median_income', 'housing_median_age']

    quantiles_df = (df.quantile([0.25,0.75])) # computing 1st & 3rd quartiles

    for out in out_list: # traversing through the list

        Q1 = quantiles_df[out][0.25] # Retrieving value of 1st quartile
        Q3 = quantiles_df[out][0.75] # Retrieving value of 3rd quartile

        iqr = Q3 - Q1 # computing the interquartile range

        lower_bound = (Q1 - (iqr * 1.5)) # computing lower bound 
        upper_bound = (Q3 + (iqr * 1.5)) # computing upper bound

        col = df[out] # Storing reference of required column

        col[(col < lower_bound)] = lower_bound # Assign outliers to lower bound

        col[(col > upper_bound)] = upper_bound # Assign outliers to upper bound

    return df

# Test Code

df = pd.read_csv('housing.csv')

df_res = clean_data(df.copy())

print(df_res)

## Explanation

A function `clean_data` is declared with `df` passed to it as a parameter.

On __line 3__, the `dropna()` function of the `DataFrame`, which automatically finds and removes all NaN containing rows, is used.

On __line 6__, a `list` that contains all the columns of the dataset from which outliers need to be removed is declared.

On __line 8__, the `quantile` function of the `DataFrame` is used to find the __first__ and __third__ quartile to help us compute the lower and upper bound for outliers.

On __line 10__, a `for` loop is used to traverse through the list. On each iteration, the columns in the list get processed for outliers.

On __lines 12 & 13__, the __1st__ and __3rd__ quartile values are retrieved for the required column.

On __line 15__, the _interquartile range_ is calculated from the quartile values.

On __lines 17 & 18__, the lower and upper bound values are calculated using the ___IQR___ value calculated above.

On __line 20__, the reference for the required current column is stored in a variable for the removal of identified outliers.

On __line 22__, those values of the current column, which are below or less than the lower bound value, are assigned that same lower bound value to get them in the required range.

On __line 24__, those values of the current column, which are above or greater than the upper bound value, are assigned that same upper bound value to get them in the required range.


___

A quiz awaits you in the next lesson.

# Explanation

A function `clean_data` is declared with `df` passed to it as a parameter.

On __line 3__, the `dropna()` function of the `DataFrame`, which automatically finds and removes all NaN containing rows, is used.

On __line 6__, a `list` that contains all the columns of the dataset from which outliers need to be removed is declared.

On __line 8__, the `quantile` function of the `DataFrame` is used to find the __first__ and __third__ quartile to help us compute the lower and upper bound for outliers.

On __line 10__, a `for` loop is used to traverse through the list. On each iteration, the columns in the list get processed for outliers.

On __lines 12 & 13__, the __1st__ and __3rd__ quartile values are retrieved for the required column.

On __line 15__, the _interquartile range_ is calculated from the quartile values.

On __lines 17 & 18__, the lower and upper bound values are calculated using the ___IQR___ value calculated above.

On __line 20__, the reference for the required current column is stored in a variable for the removal of identified outliers.

On __line 22__, those values of the current column, which are below or less than the lower bound value, are assigned that same lower bound value to get them in the required range.

On __line 24__, those values of the current column, which are above or greater than the upper bound value, are assigned that same upper bound value to get them in the required range.


___

A quiz awaits you in the next lesson.

This lesson gives a detailed review of the solution to the challenge from the previous lesson.

Getting Started

Numpy for Python

Pandas for Python

Statistics for Data Analysis

Data Wrangling

Visualizing the Data

Data Scraping

Project #1

Project #2

Stock Market Data Visualization Using Python

Conclusion

Predictive Data Analysis Exam

Solution: Clean the Data

Solution #

Explanation