Enhancing Data Granularity: Data Cleaning

Learn data-cleaning techniques to enhance the quality and relevance of your indexed data.

Building on from chunking, this lesson focuses on data-cleaning techniques to further enhance the quality and relevance of indexed data.

Data cleaning

Data cleaning involves removing irrelevant information and inconsistencies from your data. This improves the focus of each data chunk and ensures the LLM receives high-quality information for response generation. Here’s why data cleaning is crucial:

  • Improved retrieval accuracy: By removing irrelevant information, the retrieval process can focus on the most relevant data chunks that accurately match the user's query, leading to more precise responses.

  • Better context understanding: Cleaned data provides a clearer view of the context surrounding the information. This allows the LLM to understand the relationships between concepts and generate responses that are more coherent and relevant to the overall topic.

  • Reduced system bottlenecks: Removing unnecessary information can improve processing efficiency during retrieval and generation stages.

Data cleaning techniques

The following techniques are commonly used for cleaning the data:

  • Stop words removal

  • Special character removal

  • Text normalization

  • Fact-checking and updating information

Get hands-on with 1200+ tech skills courses.