15:[["$","$L133",null,{"props":{"lessonContent":{"components":[{"type":"MarkdownEditor","children":[{"text":"/"}],"mode":"edit","content":{"version":"2.0","text":"$134","mdHtml":"

DataFrames in global scope

The following code is an example of small DataFrames in the global scope, which should be converted into a series of functions so that we can avoid polluting the global scope:

total_review_by_mth_df = (\n

...","comp_id":"xSWoE4kj4a1WxfOscRMe0"},"iteration":14,"hash":0,"saveVersion":19,"contentID":"aGss4l5AqyQ_RKS6UzdH9"}],"summary":{"titleUpdated":true,"description":"Learn about the good practices in pandas and PySpark.\n","title":"Avoid Global Scope"},"content":[{"type":"MarkdownEditor","children":[{"text":"/"}],"mode":"edit","content":{"version":"2.0","text":"$135","mdHtml":"

DataFrames in global scope

The following code is an example of small DataFrames in the global scope, which should be converted into a series of functions so that we can avoid polluting the global scope:

total_review_b

...","comp_id":"xSWoE4kj4a1WxfOscRMe0"},"iteration":14,"hash":0,"saveVersion":19,"contentID":"aGss4l5AqyQ_RKS6UzdH9"}],"darkModeContent":[{"type":"MarkdownEditor","children":[{"text":"/"}],"mode":"edit","content":{"version":"2.0","text":"$136","mdHtml":"

DataFrames in global scope

The following code is an example of small DataFrames in the global scope, which should be converted into a series of functions so that we can avoid polluting the global scope:

total_review_by_mth_df = (\n

...","comp_id":"xSWoE4kj4a1WxfOscRMe0"},"iteration":14,"hash":0,"saveVersion":19,"contentID":"aGss4l5AqyQ_RKS6UzdH9"}]},"isPreviewLesson":false,"pageType":"collection_lesson","aiCoachVideoUrl":"https://youtu.be/kgl8y9J3O6c","collectionDetailsSSR":{"title":"From Pandas to PySpark DataFrame","summary":"Pandas is a popular Python library used to manipulate data, but it has certain limitations in its ability to process large datasets. The Apache Spark analytics library offers significant performance improvements.\n\nThis course will help improve your Python-based data processing by leveraging Apache Spark’s multithreading capabilities through the PySpark library. You’ll start by reading data into a PySpark DataFrame before performing basic input/output functions, such as renaming attributes, selecting, and writing data. You’ll move onto transformation functions like aggregation, statistical analysis, and joins before creating custom, user-defined functions. At each step, you’ll get a quick Pandas review before being walked through leveraging the more robust PySpark library to unlock Apache Spark.\n\nBy the end of this course, you’ll be able to quickly and reliably process large amounts of data, even stored across multiple files, using PySpark.","details":"","clos":["A working knowledge of Apache Spark and the PySpark library for Python","A strong understanding of the advantages of using PySpark instead of Pandas for processing large datasets","The ability to calculate some Metrics or produce aggregated analytics reporting solutions","The ability to write Production Code in PySpark"],"arabic_available":false,"page_tags":{"4551515674509312":"","5373491146129408":"","5110078922817536":"","5436273887543296":"","6273870775975936":"","4997668924817408":"","5632546074787840":"","5550818551398400":"","6260882073452544":"","6181638853099520":"","5822414566457344":"","5248305415585792":"","5636323330752512":"","6415385233981440":"","4657779658194944":"","4722451228917760":"","4690628876697600":"","4613371709620224":"","6209248849035264":"","4547273337339904":"","5598158452686848":"","5384200898740224":"","6083742589779968":"","6518708758904832":"","6463130405699584":"","5531610048364544":"","5029245104947200":"","6612337704566784":"","5735610928594944":"","5794177304756224":"","4854219483381760":"","5055436890046464":"","5604284519677952":"","5419636845969408":"","6072628322500608":"","5893114711769088":"","6271759028584448":"","5376071650508800":"","6625513531768832":""},"collection_toc_is_enabled":true,"page_count":null,"docker":{"container":{"file":{"name":"frompandastopyspark.tar.gz","size":3263},"imageName":"author-10370001-collection-4603192435802112-rev-19-container-5388494793080832-frompandastopyspark","buildStatus":"SUCCESS","buildStatusUrl":"https://www.educative.io/api/author/10370001/collection/4603192435802112/containers/5388494793080832/build/status","buildLogUrl":"https://www.educative.io/api/author/10370001/collection/4603192435802112/containers/5388494793080832/build/log","metadata":{"sizeInBytes":3263},"id":-1,"tarballDownloadUrl":"https://www.educative.io/api/author/10370001/collection/4603192435802112/containers/5388494793080832/download","rebuildImageUrl":"https://www.educative.io/api/author/10370001/collection/4603192435802112/containers/5388494793080832/rebuild","track":false},"envs":[],"jobs":[{"key":"InW57XEnXG-Spohs5uSkT","name":"PandasAndPyspark","inputFileName":"main.py","runScript":"python3 main.py","jobType":"Default","runInLiveContainer":false},{"key":"YI-XDqXV0R96Ps6xzpAUN","name":"jupyter","inputFileName":"foo","runScript":"nohup jupyter notebook /usr/local/notebooks/helloworld.ipynb --allow-root --no-browser > /dev/null 2>&1 &","ports":"8080","startScript":"echo \"Hello World\"","jobType":"Live","forceRelaunchOnRun":false,"runInLiveContainer":true},{"key":"Nj2hfK_pv9CpkgVIOsqOF","name":"spa","inputFileName":"main.py","runScript":"echo \"Hello World\"","ports":"8080","startScript":"cd usercode && python3 main.py","jobType":"Live","forceRelaunchOnRun":true,"runInLiveContainer":true},{"key":"XbQzsqeJhYJWBEatHv1od","name":"spa-jupyter","inputFileName":"foo","runScript":"ls","ports":"8080","startScript":"nohup jupyter notebook /usr/local/notebooks/helloworld.ipynb --allow-root --no-browser > /dev/null 2>&1 &","jobType":"Live","forceRelaunchOnRun":false,"runInLiveContainer":true},{"key":"N59Vaww_X0MzzongZzoVr","name":"spa-snapshot","inputFileName":"main.py","runScript":"echo \"Hello World\"","ports":"8080","startScript":"mkdir -p data/snapshot/pandas && mkdir -p data/snapshot/pyspark && cd usercode && python3 main.py","jobType":"Live","forceRelaunchOnRun":true,"runInLiveContainer":true},{"key":"QBW03eA2m8ELleJxkat8w","name":"spa-copy","inputFileName":"main.py","runScript":"echo \"Hello World\"","ports":"8080","startScript":"cd usercode\nclear \npython3 pyspark_analysis.py","jobType":"Live","forceRelaunchOnRun":true,"runInLiveContainer":true},{"key":"8Y0snNOALNdPfZzXwz8WR","name":"spa-copy-copy","inputFileName":"main.py","runScript":"echo \"Hello World\"","ports":"8080","startScript":"cd usercode\nclear \npython3 pandas_analysis.py","jobType":"Live","forceRelaunchOnRun":true,"runInLiveContainer":true},{"key":"bbXHPqjghEgibNMVb9DxU","name":"spa-copy-copy-5klyxp","inputFileName":"main.py","runScript":"echo \"Hello World\"","ports":"8080","startScript":" cd usercode && python3 pyspark_analysis.py","jobType":"Live","forceRelaunchOnRun":true,"runInLiveContainer":true},{"key":"twI-7xgMuJnH-0UUZI-uJ","name":"spa-snapshot-copy","inputFileName":"main.py","runScript":"echo \"Hello World\"","ports":"8080","startScript":" cd usercode && mkdir -p data/snapshot/pandas && mkdir -p data/snapshot/pyspark && python3 pandas_analysis.py","jobType":"Live","forceRelaunchOnRun":true,"runInLiveContainer":true},{"key":"DJ5OFP3zWnyITgzdHeXyJ","name":"spa-snapshot-copy-copy","inputFileName":"main.py","runScript":"echo \"Hello World\"","ports":"8080","startScript":" cd usercode && mkdir -p data/snapshot/pandas && mkdir -p data/snapshot/pyspark && python3 pyspark_analysis.py","jobType":"Live","forceRelaunchOnRun":true,"runInLiveContainer":true},{"key":"MJlV300DroOwzOMOu-fbl","name":"spa-copy-copy-5klyxp-copy","inputFileName":"main.py","runScript":"echo \"Hello World\"","ports":"8080","startScript":" cd usercode && python3 udf_example.py","jobType":"Live","forceRelaunchOnRun":true,"runInLiveContainer":true}],"testRunners":[],"version":3,"loaded":true},"discounted_price":null,"cover_image_id":5399836813950976,"cover_image_metadata":"{\"width\":1024,\"height\":512,\"sizeInBytes\":41101,\"name\":\"PANDAS-TO-PYSPARK-COURSE-.png\"}","cover_image_serving_url":"/v2api/collection/10370001/4603192435802112/image/5399836813950976","tags":["Pyspark","Pandas","UDF","Dataframe","statistical analysis"],"intro_video_url":"","intro_video_thumbnail_url":"","aggregated_widget_stats":{"projects":0,"assessments":0,"Code":8,"codeExerciseCount":0,"codeRunnableCount":54,"codeSnippetCount":14,"illustrations":27,"SlateHTML":53,"TerminalWidget":1,"LiveApp":0,"WebpackBin":53,"MarkdownEditor":122,"LinkedList":1,"TabbedCode":3,"MxGraphWidget":25,"NaryTree":0,"Table":0,"Quiz":3,"Columns":1,"cloudlabs":0},"default_themes":{"code_themes":{"Code":"default","Markdown":"default","RunJS":"default","SPA":"default","isForced":{"Code":false,"Markdown":false,"RunJS":false,"SPA":false}}},"api_keys":{"api_keys":[]},"skills":[],"testimonials":[],"licensing":null,"target_audience":"intermediate","author_id":"10370001","collection_id":"4603192435802112","approval_status":3005,"price":29,"is_private":false,"path_type":"regular","organization_id":null,"is_mini":false,"is_priced":true,"brief_summary":"Gain insights into enhancing Python data processing with PySpark. Delve into reading, transforming, aggregating data, and creating user-defined functions, boosting efficiency with Apache Spark.","approval_update_time":"2022-10-18T13:24:36.511Z","rating_visibility":true,"update_last_published_on_homepage":true,"show_developed_by":true,"udata_files":[],"CodeThemes":{"Code":"default","Markdown":"default","RunJS":"default","SPA":"default","isForced":{"Code":false,"Markdown":false,"RunJS":false,"SPA":false}},"is_marked_for_deletion":false,"transition_page_title":"","is_redirectable":false,"collection_type":"collection","adaptive_learning_mode":false,"HLOs_to_toc":{},"is_guide":false,"read_time":10980,"allow_logged_out_executions":false,"unique_live_widget_urls":false,"metadata_status":101,"palified_version":null},"pageSummarySSR":{"title":"Avoid Global Scope","description":"Learn about the good practices in pandas and PySpark.","discourse_page_url":"https://discuss.educative.io/tag/avoid-global-scope__data-transformation__from-pandas-to-pyspark-dataframe?open=true&ctag=from-pandas-to-pyspark-dataframe__mrdatapsycho&cslug=pandas-to-pyspark-dataframe&pslug=avoid-global-scope"},"adaptiveLearningConfigConstantSSR":0,"enableLessonPageLockedBannerV2":true,"allowAllLessonPreview":false,"lockedBannerStatsSSR":{"b2cTrialStats":{"is_b2c_trial_active":true,"b2c_trial_active_duration":21,"b2c_trial_categories":"$137"},"b2cStatus":100,"learnerTags":"$138","workStats":1600,"interviewWorksStats":93,"inL2cStarterPack":false,"l2cWorkStats":46,"enableL2cStarterPackPaymentWidget":"false"},"pageTocSSR":"

","authorId":"10370001","collectionId":"4603192435802112","pageId":"5029245104947200","isCollectionPageLockedCachingEnabled":true,"aceFeatureFlags":{"enableAceEditor":true,"enableAceEditorForAnswers":true},"meta":{"type":["Article","TechArticle"],"title":"Avoid Global Scope","name":"From Pandas to PySpark DataFrame","description":"Learn about the good practices in pandas and PySpark.","image":"https://educative.io/api/collection/10370001/4603192435802112/image/5399836813950976.png","isAccessibleForFree":false,"keywords":"$138","provider":"Educative","publisher":"Educative","id":"courses/pandas-to-pyspark-dataframe/avoid-global-scope","author":"Educative","educationalLevel":"intermediate","noIndex":true,"isForcedNoIndex":true,"noFollow":false,"redirectInfo":{"isDeletedCollectionPageRedirectable":false},"page_titles":{"4551515674509312":"Getting Started","5373491146129408":"Overview of Dataset","5110078922817536":"Introduction to Data Input and Output","5436273887543296":"Read Data into DataFrame","6273870775975936":"Rename Attributes","4997668924817408":"Select a Subset of Attributes","5632546074787840":"Data Input and Output: Save a Snapshot","5550818551398400":"Read Parquet Data Source","6260882073452544":"pandas and PySpark: Behind the Scenes","6181638853099520":"Write Production Code","5822414566457344":"Quiz: Data Input and Output","5248305415585792":"Introduction to Data Transformation","5636323330752512":"Setup","6415385233981440":"Amazon Review Data (2018)","4657779658194944":"Handling Date-time","4722451228917760":"Impute Unavailable Data Points","4690628876697600":"Average Review per Product","4613371709620224":"Total Number of Reviews for Each Product","6209248849035264":"Distribution of the Review Text Length","4547273337339904":"Yearly Median Review","5598158452686848":"Top reviews of 2017","5384200898740224":"Compare Total Review of 2016 and 2017","6083742589779968":"Conversion Between Wide and Long Format using melt and pivot","6518708758904832":"Date Transformation: Save a Snapshot","5029245104947200":"Avoid Global Scope","6463130405699584":"Quiz: Data Transformation","5735610928594944":"Introduction to User-defined Functions","5794177304756224":"Object Conversion Between Python and Scala","4854219483381760":"Writing UDF","5055436890046464":"UDF in Action","6612337704566784":"UDF: Save a snapshot","5531610048364544":"Quiz: User-defined Functions","5604284519677952":"Challenge: Data Input and Output","5419636845969408":"Challenge: Data Transformation","6072628322500608":"Challenge: User-defined Functions","5893114711769088":"Solution: Data Input and Output","6271759028584448":"Solution: Data Transformation","5376071650508800":"Solution: User Defined Function","6625513531768832":"Conclusion"},"is_marked_for_deletion":false,"transition_page_title":"","is_redirectable":false,"deleted_course_lesson_redirect":{"author_id":null,"collection_id":null,"page_id":null,"redirect_url_slug":null},"metadata_status":101,"additional_course_alternatives":[]},"requestUrl":"/courses/pandas-to-pyspark-dataframe/avoid-global-scope","requestUrlInfo":{"authorId":10370001,"collectionId":4603192435802112,"pageId":5029245104947200,"courseUrlSlug":"pandas-to-pyspark-dataframe","pageUrlSlug":"avoid-global-scope"},"isExternalContent":false}}],[["$","script",null,{"id":"generate-data","type":"application/ld+json","dangerouslySetInnerHTML":{"__html":"$139"}}],false,"$undefined"]]