Combine Indexing Techniques
Learn how to utilize multiple techniques to craft an overall sophisticated index.
We can choose from different techniques and keys to compute an index. Equality-based and sort-based indexing techniques have their strengths and weaknesses. Some keys work well for one subset of the data and perform poorly for others.
We are not limited to just one technique, key, or run. Let’s discuss how and when to combine multiple individual approaches to create a sophisticated index.
Limits of simple indexing
The importance of thought-out indexing increases with the size of the data. Below, we read a relatively large dataset of 0.5 million records (original size: 5 million) having data quality issues in every column. The data represents North Carolina voters.
Note: The
voters
dataset we use below is open data. See the Glossary of this course for attribution and references. We reduce the dataset’s size to 10% of its original five million records through random sampling. This is required to stay within the memory limits of the provided virtual machine. We also report outcomes for the full dataset computed on our Macbook with 32 GB of RAM.
Get hands-on with 1400+ tech skills courses.