14:[["$","$L132",null,{"props":{"lessonContent":{"components":[{"type":"SlateHTML","content":{"html":"

When we work with a specific domain, such as medicine, insurance, or finance, we often come across words, abbreviations, and entities that need special attention. Most domains we'll process have characteristic words and phrases that need custom tokenization rules. Here's how to add a special case rule to an existing Tokenizer class instance:

","comp_id":"2EB4rI1B2jhur_FeD8_7h"},"hash":0},{"type":"Code","mode":"edit","content":{"version":"8.0","caption":"Adding special rules to exisitng tokenzier classes","language":"python3","title":"","theme":"default","additionalContent":[],"selectedIndex":0,"runnable":true,"judge":false,"staticEntryFileName":true,"judgeContent":"","judgeHints":"","allowDownload":false,"treatOutputAsHTML":false,"enableHiddenCode":true,"enableStdin":false,"evaluateWithoutExecution":false,"showSolution":false,"timeLimit":30,"hiddenCodeContent":{"prependCode":"import os\nos.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' \n\n","appendCode":"\n\n","codeSelection":"prependCode"},"dockerJob":{"key":"e_UI3o3jcdLkCrhQyDFns","jobType":"Default","name":"spacy-code","inputFileName":"main.py","runScript":"python3 -W ignore main.py","runInLiveContainer":false},"selectedApiKeys":{},"selectedEnvVars":{},"specialInput":"no-input","solutionContent":"\n\n\n","judgeContentPrepend":"\n\n\n","evaluateLanguage":"","isCodeDrawing":false,"content":"import spacy\nfrom spacy.symbols import ORTH\nnlp = spacy.load(\"en_core_web_md\")\ndoc = nlp(\"lemme that\")\nprint([w.text for w in doc])\n\nspecial_case = [{ORTH: \"lem\"}, {\"ORTH\": \"me\"}]\nnlp.tokenizer.add_special_case(\"lemme\", special_case)\nprint([w.text for w in nlp(\"lemme that\")])","comp_id":"Q6mUtrfuiMJQL9TkxM6uE","entryFileName":"main.py","staticEntryName":false,"defaultSelectedFile":"main.py","dockerExecutionContext":{"imageName":"author-10370001-collection-6664829680222208-rev-13-container-5933431780016128-jupyter","job":{"key":"e_UI3o3jcdLkCrhQyDFns","jobType":"Default","name":"spacy-code","inputFileName":"main.py","runScript":"python3 -W ignore main.py","runInLiveContainer":false},"envs":[],"liveInstance":{"id":"ed-5225693428580352","url":"https://ed-5225693428580352.educative.run","live-app-id":"ed-5225693428580352-live-app","live-app-url":"https://ed-5225693428580352-live-app.educative.run","cloudlab-id":"ed-5225693428580352-cloudlab","cloudlab-url":"https://ed-5225693428580352-cloudlab.educative.run","vm-lease-time":false,"vscode":true,"vscode-url":"https://ed-5225693428580352-vscode.educative.run","openai_api_draft":false,"openai_api_published":false}}},"iteration":0,"hash":1,"children":[{"text":""}],"status":"normal","contentID":"uDSmTgYed3U3an3MoDouk","saveVersion":1},{"type":"SlateHTML","content":{"html":"

Here is what we did:

We again started by importing spacy.
Then, we imported the ORTH symbol, which means orthography; that is, text.
We continued ...

","comp_id":"5B2YUIp3wa_pRuUHjyAA1"},"hash":2}],"summary":{"title":"Customizing the Tokenizer and Sentence Segmentation","description":"Let's learn how we can add special case rules to the existing Tokenizer class.","titleUpdated":true},"content":[{"type":"SlateHTML","content":{"html":"

Here is what we did:

We again started by importing spacy.
Then, we imported the ORTH symbol, which means orthography; that is, text.
We continued ...

","comp_id":"5B2YUIp3wa_pRuUHjyAA1"},"hash":2}],"darkModeContent":[{"type":"SlateHTML","content":{"html":"

Here is what we did:

We again started by importing spacy.
Then, we imported the ORTH symbol, which means orthography; that is, text.
We continued ...

","comp_id":"5B2YUIp3wa_pRuUHjyAA1"},"hash":2}]},"isPreviewLesson":false,"pageType":"collection_lesson","aiCoachVideoUrl":"https://youtu.be/kgl8y9J3O6c","collectionDetailsSSR":{"title":"Mastering spaCy","summary":"This course extensively introduces the widely used Python library spaCy for natural language processing (NLP). It covers spaCy basics, such as tokenization and part-of-speech tagging, as well as advanced topics like custom model training and NLP pipeline creation. \n\nThe course has three parts: \nPart 1 focuses on spaCy’s fundamentals, architecture, installation, and setup. It teaches common NLP tasks like tokenization, named entity recognition (NER), part-of-speech (POS) tagging, and dependency parsing. \n\nPart 2 delves into spaCy’s features, covering syntax and semantics. It explores pattern matching and semantics via word vectors and thoroughly discusses statistical information extraction techniques. \n\nPart 3 examines advanced topics, including developing complex NLP models that require expertise, analysis, and practical experience. Multiple experiments with various NLP tasks are conducted, including customizing statistical models to meet specific needs.","details":"","clos":["An understanding of spaCy’s architecture and its various components for NLP tasks","The ability to customize and train your statistical models for NLP tasks","The ability to work with advanced NLP features","An understanding of how to build and optimize NLP pipelines","Hands-on experience using spaCy for real-world NLP applications"],"arabic_available":false,"page_tags":{"6389251672637440":"","6347787857035264":"","4595582250516480":"","5340300936740864":"","5126186884923392":"","5151965345742848":"","4747400968404992":"","5814266738507776":"","5386576340451328":"","5368920572952576":"","5513499183939584":"","5944791218257920":"","4851371263393792":"","4982489165856768":"","4673843085180928":"","5973380332519424":"","4584529038934016":"","4701863988690944":"","6210212200710144":"","5058308380819456":"","6172598219309056":"","6434324869283840":"","6273327634317312":"","6226676219641856":"","6111808502104064":"","5584254447910912":"","5422995672924160":"","6195125566046208":"","5055075889446912":"","4604889295749120":"","6286340823187456":"","6337866170105856":"","4684354445115392":"","5099952325001216":"","4859798087794688":"","5110394934001664":"","5827352981667840":"","4624946688163840":"","6240490449272832":"","5569138692194304":"","5857368511610880":"","5800649710370816":"","4707426619359232":"","5287938677276672":"","6677731340451840":"","6719717967659008":"","5427519863128064":"","6086750450745344":"","4887484390703104":"","5411445570535424":"","6189877053095936":"","5771034090536960":"","6177114423558144":"","5347476864499712":"","6105248145080320":"","6535945787801600":"","5497438264885248":"","5434522161381376":"","6487175512981504":"","5485778587877376":"","5668163365896192":"","5103716148183040":"","6503302828392448":"","5972620427395072":"","5353246263869440":"","6345941117042688":"","4838338390130688":"","4845163177050112":"","5607307606753280":"","5624668317548544":"","4866650042793984":"","6187219677872128":"","6533877276082176":"","6624955396259840":"","5391654161481728":"","6491914959060992":"","4843826905350144":"","4743591948451840":"","6504967029129216":"","5673916325167104":"","5695678773460992":"","5707236983439360":"","5184766492803072":"","6589659757674496":"","6037755738718208":"","5170204641067008":"","6491806880235520":"","4763580726247424":"","4869499183169536":"","5211359923666944":"","4504444738469888":"","6421978890895360":"","6497798451888128":"","6182321427054592":"","6276607397068800":"","5656156455043072":"","4545599077351424":"","5665306849312768":"","6214034991611904":"","4987444509016064":"","5032519351926784":"","5188186280820736":"","6012989183098880":"","5937852287025152":"","6488834079195136":"","4985810173296640":"","5865184695156736":"","6370436682874880":"","6148216698175488":"","4873252703567872":"","6096226991472640":"","5814304432193536":"","6039214991605760":"","5392735876677632":"","5953880735875072":"","4922804949221376":"","6098822342901760":"","6114046022254592":"","6291968834142208":"","6338251233951744":"","4525665144274944":"","5206403703373824":"","5249395050938368":"","4734949438259200":"","6699427046359040":"","5514604877447168":"","4944915046596608":"","5874411677417472":"","5900484670652416":"","5300603241365504":"","6498268226519040":"","5514211124576256":"","4627722224271360":"","4662574038384640":"","6687990152429568":"","6194300136980480":"","6254028133236736":"","5942777482051584":"","5293133924139008":"","4516173593706496":""},"collection_toc_is_enabled":true,"page_count":null,"docker":{"container":{"file":{"name":"jupyter.tar.gz","size":26582},"imageName":"author-10370001-collection-6664829680222208-rev-52-container-4564946195841024-jupyter","buildStatus":"SUCCESS","buildStatusUrl":"/api/author/10370001/collection/6664829680222208/containers/4564946195841024/build/status","buildLogUrl":"/api/author/10370001/collection/6664829680222208/containers/4564946195841024/build/log","metadata":{"sizeInBytes":26582},"id":-1,"tarballDownloadUrl":"/api/author/10370001/collection/6664829680222208/containers/4564946195841024/download","rebuildImageUrl":"/api/author/10370001/collection/6664829680222208/containers/4564946195841024/rebuild","track":false},"envs":[],"jobs":[{"key":"sW_dtPr-LZhKOo4cbABAg","jobType":"Live","name":"jupyter","inputFileName":"helloworld","runScript":"nohup jupyter notebook /usr/local/notebooks --allow-root --no-browser > /dev/null 2>&1 &","ports":"8080","startScript":"echo \"hello world\"","runInLiveContainer":true},{"key":"e_UI3o3jcdLkCrhQyDFns","jobType":"Default","name":"spacy-code","inputFileName":"main.py","runScript":"python3 -W ignore main.py","runInLiveContainer":false}],"testRunners":[],"version":3,"loaded":true},"discounted_price":null,"cover_image_id":6671824247848960,"cover_image_metadata":"{\"width\":1024,\"height\":512,\"sizeInBytes\":60763,\"name\":\"sapcy course cover-01 (1).png\"}","cover_image_serving_url":"/v2api/collection/10370001/6664829680222208/image/6671824247848960","tags":["transformers","spacy"],"intro_video_url":"","intro_video_thumbnail_url":"","aggregated_widget_stats":{"projects":0,"assessments":2,"SlateHTML":659,"codeExerciseCount":2,"codeRunnableCount":232,"codeSnippetCount":216,"illustrations":105,"Image":0,"DrawIOWidget":101,"Code":339,"SpoilerEditor":14,"Table":24,"LiveApp":13,"EditorCode":97,"TerminalWidget":1,"Quiz":10,"CodeTest":0,"TableHTML":2,"Columns":4,"cloudlabs":0},"default_themes":{"code_themes":{"Code":"default","Markdown":"default","RunJS":"default","SPA":"default","isForced":{"Code":false,"Markdown":false,"RunJS":false,"SPA":false}}},"api_keys":{"api_keys":[]},"skills":["Machine Learning","Natural Language Processing"],"testimonials":[],"licensing":null,"target_audience":"beginner","author_id":"10370001","collection_id":"6664829680222208","approval_status":3005,"price":99,"is_private":false,"path_type":"regular","organization_id":null,"is_mini":false,"is_priced":true,"brief_summary":"In this spaCy NLP course, you will learn about core tasks like tokenization, NER, and POS tagging and advanced topics such as custom model training and complex NLP pipelines.","approval_update_time":"2023-06-13T05:52:23.140Z","rating_visibility":true,"update_last_published_on_homepage":true,"show_developed_by":true,"udata_files":[],"CodeThemes":{"Code":"default","Markdown":"default","RunJS":"default","SPA":"default","isForced":{"Code":false,"Markdown":false,"RunJS":false,"SPA":false}},"is_marked_for_deletion":false,"transition_page_title":"","is_redirectable":false,"collection_type":"collection","adaptive_learning_mode":false,"HLOs_to_toc":{},"is_guide":false,"read_time":108000,"allow_logged_out_executions":false,"unique_live_widget_urls":false,"metadata_status":101,"palified_version":null},"pageSummarySSR":{"title":"Customizing the Tokenizer and Sentence Segmentation","description":"Let's learn how we can add special case rules to the existing Tokenizer class.","discourse_page_url":"https://discuss.educative.io/tag/customizing-the-tokenizer-and-sentence-segmentation__core-operations-with-spacy__mastering-spacy?open=true&ctag=mastering-spacy__packt&cslug=spacy-nlp&pslug=customizing-the-tokenizer-and-sentence-segmentation"},"adaptiveLearningConfigConstantSSR":0,"enableLessonPageLockedBannerV2":true,"allowAllLessonPreview":false,"lockedBannerStatsSSR":{"b2cTrialStats":{"is_b2c_trial_active":true,"b2c_trial_active_duration":7,"b2c_trial_categories":"$133"},"b2cStatus":100,"learnerTags":"$134","workStats":1590,"interviewWorksStats":93,"inL2cStarterPack":false,"l2cWorkStats":44,"enableL2cStarterPackPaymentWidget":"true"},"pageTocSSR":"

","authorId":"10370001","collectionId":"6664829680222208","pageId":"4747400968404992","isCollectionPageLockedCachingEnabled":true,"aceFeatureFlags":{"enableAceEditor":true,"enableAceEditorForAnswers":true},"meta":{"type":["Article","TechArticle"],"title":"Customizing the Tokenizer and Sentence Segmentation","name":"Mastering spaCy","description":"Let's learn how we can add special case rules to the existing Tokenizer class.","image":"https://educative.io/api/collection/10370001/6664829680222208/image/6671824247848960.png","isAccessibleForFree":false,"keywords":"$134","provider":"Educative","publisher":"Educative","id":"courses/spacy-nlp/customizing-the-tokenizer-and-sentence-segmentation","author":"Educative","educationalLevel":"beginner","noIndex":true,"isForcedNoIndex":true,"noFollow":false,"redirectInfo":{"isDeletedCollectionPageRedirectable":false},"page_titles":{"6389251672637440":"NLP and Python","6347787857035264":"Introduction to the Course","4595582250516480":"High-level Overview of the spaCy Library","5340300936740864":"Visualization with displaCy","5126186884923392":"Overview of spaCy Conventions","5151965345742848":"Introducing Tokenization","4747400968404992":"Customizing the Tokenizer and Sentence Segmentation","5814266738507776":"Understanding Lemmatization","5386576340451328":"spaCy Container Objects","5368920572952576":"More spaCy Features","5513499183939584":"Python String Operations","5944791218257920":"Summary: Core Operations with spaCy","4851371263393792":"Quiz: Getting Started","4982489165856768":"Quiz: Core Operations with spaCy","4673843085180928":"Summary: Getting Started","5973380332519424":"POS Tagging","4584529038934016":"POS Tagging - Continued","4701863988690944":"Introduction to Dependency Parsing","6210212200710144":"Introducing NER","5058308380819456":"Merging and Splitting Tokens","6172598219309056":"Summary: Linguistic Features","6434324869283840":"Installing spaCy","6273327634317312":"Installing spaCy's Statistical Models","6226676219641856":"Quiz: Linguistic Features","6111808502104064":"Exercise: Getting Started","5584254447910912":"Solution: Getting Started","5422995672924160":"Exercise: Core Operations with spaCy","6195125566046208":"Solution: Core Operations with spaCy","5055075889446912":"Exercise: Linguistic Features","4604889295749120":"Solution: Linguistic Features","6286340823187456":"Token Based Matching","6337866170105856":"Syntax Support and Regex","4684354445115392":"PhraseMatcher and EntityRuler","5099952325001216":"Combining spaCy Models and Matchers","4859798087794688":"Summary: Rule-Based Matchmaking","5110394934001664":"Quiz: Rule-Based Matchmaking","5827352981667840":"Exercise: Rule-Based Matchmaking","4624946688163840":"Solution: Rule-Based Matchmaking","6240490449272832":"Understanding Word Vectors","5569138692194304":"Using spaCy's Pre-trained Vectors","5857368511610880":"The Similarity Method","5800649710370816":"Advanced Semantic Similarity Methods","4707426619359232":"Summary: Word Vectors and Semantic Similarity","5287938677276672":"Overview: Semantic Parsing with spaCy","6677731340451840":"Extracting Named Entities","6719717967659008":"Quiz: Working with Word Vectors and Semantic Similarity","5427519863128064":"Exercise: Word Vectors and Semantic Similarity","6086750450745344":"Solution: Word Vectors and Semantic Similarity","4887484390703104":"Using Dependency Trees For Extracting Entities","5411445570535424":"Using Dependency Relations for Intent Recognition","6189877053095936":"Semantic Similarity Methods for Semantic Parsing","5771034090536960":"Putting It All Together","6177114423558144":"Overview: Word Vectors and Semantic Similarity","5347476864499712":"Overview: Rule-Based Matchmaking","6105248145080320":"Overview: Linguistic Features","6535945787801600":"Overview: Core Operations with spaCy","5497438264885248":"Overview: Getting Started","5434522161381376":"Summary: Semantic Parsing with spaCy","6487175512981504":"Overview: Customizing spaCy Models","5485778587877376":"Getting Started with Data Preparation","5668163365896192":"Updating an Existing Pipeline Component","5103716148183040":"Training a Pipeline Component From Scratch","6503302828392448":"Overview: Text Classification with spaCy","5972620427395072":"Understanding the Basics of Text Classification","5353246263869440":"Training the spaCy Text Classifier","6345941117042688":"Summary","4838338390130688":"Quiz: Semantic Parsing with spaCy","4845163177050112":"Exercise: Semantic Parsing with spaCy","5607307606753280":"Solution: Semantic Parsing with spaCy","5624668317548544":"Assessment: spaCy Features","4866650042793984":null,"6187219677872128":null,"6533877276082176":null,"6624955396259840":null,"5391654161481728":null,"6491914959060992":null,"4843826905350144":null,"4743591948451840":null,"6504967029129216":null,"5673916325167104":null,"5695678773460992":null,"5707236983439360":null,"5184766492803072":null,"6589659757674496":null,"6037755738718208":null,"5170204641067008":null,"6491806880235520":null,"4763580726247424":null,"4869499183169536":null,"5211359923666944":null,"4504444738469888":"Saving and loading custom models","6421978890895360":"Quiz: Customizing spaCy Models","6497798451888128":"Exercise: Customizing spaCy Models","6182321427054592":"Solution: Customizing spaCy Models","6276607397068800":"Sentiment Analysis with spaCy","5656156455043072":"Text classification with spaCy and Keras","4545599077351424":"Embedding Words","5665306849312768":"Summary: Text Classification with SpaCy","6214034991611904":"Quiz: Text Classification with spaCy","4987444509016064":"Exercise: Text Classification with spaCy","5032519351926784":"Solution: Text Classification with spaCy","5188186280820736":"Overview: spaCy and Transformers","6012989183098880":"Understanding BERT","5937852287025152":"Transformers and Transfer Learning","6488834079195136":"Transformers and TensorFlow","4985810173296640":"Using BERT for Text Classification","5865184695156736":"Using Transformer Pipelines","6370436682874880":"Transformers and spaCy","6148216698175488":"Summary: spaCy and Transformers","4873252703567872":"Quiz: spaCy and Transformers","6096226991472640":"Exercise: spaCy and Transformers","5814304432193536":"Solution: spaCy and Transformers","6039214991605760":"Overview: Designing a Chatbot with spaCy","5392735876677632":"Introduction to Conversational AI","5953880735875072":"Getting to Know the Dataset","4922804949221376":"Entity Extraction","6098822342901760":"Intent Recognition","6114046022254592":"Classifying Text with a Character-level LSTM","6291968834142208":"Differentiating Subjects from Objects","6338251233951744":"Anaphora Resolution","4525665144274944":"Summary: Designing a Chatbot with spaCy","5206403703373824":"Assessment - Machine Learning with spaCy","5249395050938368":null,"4734949438259200":null,"6699427046359040":null,"5514604877447168":null,"4944915046596608":null,"5874411677417472":null,"5900484670652416":null,"5300603241365504":null,"6498268226519040":null,"5514211124576256":null,"4627722224271360":null,"4662574038384640":null,"6687990152429568":null,"6194300136980480":null,"6254028133236736":"Quiz: Designing a Chatbot With spaCy","5942777482051584":"Conclusion","5293133924139008":null,"4516173593706496":"More on Visualization with displaCy"},"is_marked_for_deletion":false,"transition_page_title":"","is_redirectable":false,"deleted_course_lesson_redirect":{"author_id":null,"collection_id":null,"page_id":null,"redirect_url_slug":null},"metadata_status":101,"additional_course_alternatives":[]},"requestUrl":"/courses/spacy-nlp/customizing-the-tokenizer-and-sentence-segmentation","requestUrlInfo":{"authorId":10370001,"collectionId":6664829680222208,"pageId":4747400968404992,"courseUrlSlug":"spacy-nlp","pageUrlSlug":"customizing-the-tokenizer-and-sentence-segmentation"},"isExternalContent":false}}],[["$","script",null,{"id":"generate-data","type":"application/ld+json","dangerouslySetInnerHTML":{"__html":"$135"}}],false,"$undefined"]]

Assessment: spaCy Features

Auto-Tagging System for Content Categorization

Assessment - Machine Learning with spaCy

Customizing the Tokenizer and Sentence Segmentation