Python for Scalable Compute
Learn why Python is the leading language in data science.
We'll cover the following
Python for data science
Python is quickly becoming the de facto language for data science. In addition to the huge library of packages that provide useful functionalities, one of the reasons that Python is becoming so popular is that it can be used for building scalable data and predictive model pipelines.
Below is an example of modeling in Python. Click the button to execute the code in our embedded code widget.
from sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_bostonimport pandas as pdimport numpy as np# load Boston housing data setdata_url = "http://lib.stat.cmu.edu/datasets/boston"raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])target = raw_df.values[1::2, 2]bostonDF = pd.DataFrame(data, columns=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT'])bostonDF['label'] = target# create train and test splits of the housing data setx_train, x_test, y_train, y_test = train_test_split(bostonDF.drop(['label'], axis=1), bostonDF['label'], test_size=0.33)# train a linear regression modelmodel = LinearRegression()model.fit(x_train, y_train)# print resultsprint("R^2: " + str(model.score(x_test, y_test)))print("Mean Error: " + str(sum(abs(y_test - model.predict(x_test) ))/y_test.count()))
You can use Python on your local machine and build predictive models with scikit-learn, or you can use environments such as Dataflow and PySpark to build distributed systems. While these different environments use different libraries and programming paradigms, they’re all in the same language of Python.
It’s no longer necessary to translate an R script into a production language such as Java; you can use the same language for both development and production of predictive models. It took me a while to adopt Python as my data science language of choice.
Java had been my preferred language, regardless of the task, since early in my undergraduate career. For data science tasks, I used tools like Weka to train predictive models. I still find Java to be useful when building data pipelines, and it’s great to know for directly collaborating with engineering teams on projects.
I later switched to R while working at Electronic Arts, and found the transition to an interactive coding environment to be quite useful for data science. One of the features I really enjoyed in R is R Markdown, which you can use to write documents with inline code.
Reasons to learn Python
When I started working at Zynga in 2018, I adopted Python and haven’t looked back. It took a bit of time to get used to the new language, but there were a number of reasons that convinced me to learn Python.
Following are some of the reasons:
-
Momentum: Many teams are already using Python for production, or portions of their data pipelines. It makes sense to also use Python for performing analysis tasks.
-
PySpark: R and Java don’t provide a good transition to authoring Spark tasks interactively. You can use Java for Spark, but it’s not a good fit for exploratory work. Additionally, the transition from Python to PySpark seems to be the most approachable way to learn Spark.
-
Deep learning: I’m interested in deep learning, and while there are R bindings for libraries such as Keras, it’s better to code in the native language of these libraries. I used R to author custom loss functions previously, and I had problems figuring out debugging errors.
-
Libraries: In addition to the deep learning libraries offered for Python, there are a number of other useful tools, including Flask and Bokeh. There are also notebook environments that can scale, including Google’s Colaboratory and AWS SageMaker.
From R to Python
To ease the transition from R to Python, I’d recommend the following steps:
-
Focus on outcomes, not semantics: Instead of learning about all the fundamentals of the language, I first focused on doing what I already knew how to do in other languages in Python, such as training a logistic regression model.
-
Learn the ecosystem, not the language: I didn’t limit myself to the base language when learning. Instead, I jumped right into using Pandas and scikit-learn.
-
Use cross-language libraries: I already had experience with Keras and Plotly in R and used knowledge of these libraries to bootstrap learning Python.
-
Work with real-world data: I used the data sets provided by Google’s BigQuery to test out my scripts on large-scale data.
-
Start locally, if possible: While one of my goals was to learn PySpark, I first focused on getting things up and running them on my local machine before moving to cloud ecosystems.
There are many situations where Python is not the best choice for a specific task, but it does have broad applicability when prototyping models and building scalable model pipelines.
Because of Python’s rich ecosystem, we will be using it for all the examples in this course.