Introduction

A few common indexing techniques used in text preprocessing include term-based indexing, document-based indexing, inverted indexing, and positional indexing. These techniques use different approaches to index text data based on keywords, phrases, or other relevant metadata and enable efficient searching, classification, and analysis of extensive collections of text data.

Term-based indexing

Term-based indexing involves indexing documents based on individual terms or words that appear in the documents. By associating each term with a list of document identifiers where it occurs, we efficiently retrieve documents based on specific query terms. One of the advantages of term-based indexing is its fast and efficient retrieval of relevant documents containing the query terms. However, this type of indexing can be memory-intensive as it also requires the application of text preprocessing steps like tokenization, normalization, and stemming to handle variations in term spellings or word forms. In the example below, let’s see how to apply term-based indexing using Python. We’ll read reviews from a CSV file, tokenize them into words, remove common English stopwords, and then create an inverted index that maps each term to the list of review IDs in which it appears.

Press + to interact

main.py

reviews.csv

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict 
 
df = pd.read_csv('reviews.csv') 
df['tokens'] = df['review'].apply(lambda text: word_tokenize(text.lower())) 
stop_words = set(stopwords.words('english'))
df['tokens'] = df['tokens'].apply(lambda tokens: [token for token in tokens if token not in stop_words])
term_index = defaultdict(lambda: defaultdict(list))
for idx, tokens in df[['review_id', 'tokens']].itertuples(index=False):
    for position, term in enumerate(tokens):
        term_index[term][idx].append(position)
for term, doc_positions in term_index.items():
    print(f"Term: {term}")
    for doc_id, positions in doc_positions.items():
        print(f"  Document ID: {doc_id}, Positions: {positions}")

About This Course

Introduction To Text Preprocessing

Regular Expressions

Irrelevant Text Data

Basic Text Preprocessing Techniques

Indexing

Text Transformation

Text Representation

Text Feature Engineering

Advanced Text Preprocessing

N-grams

Text Classification of Customer Reviews

Conclusion

Text Classification Using PyTorch

Term-Based and Document-Based Indexing

Introduction

Term-based indexing