Why We Need Integrity Tests

Learn about why we need additional testing for complex data.

Handling constraints within the database has the advantage of obtaining universal compliance. With Postgres, it becomes impossible for data from any source to violate the constraints we’ve put in place. On the other hand, some constraints do not readily lend themselves to being handled by the database. Here, we discuss some types of difficult constraints to look out for. We then consider both preventative and diagnostic out-of-database approaches to data integrity.

Complex constraints

Some conditions, even on individual fields, can get complicated. A field storing an email address should be formatted as a valid email. Similarly, a phone number, an address, and an identification number could all have patterns that need to match. These can probably be done with regular expressions, which are implemented in Postgres, but not in SQLite, unless a special add-on is installed. In order to avoid back-end dependency, it may be better to check things in Python.

Sometimes the complication is not within a single field, but across fields, such as an address that must be self-consistent. The city must match the zip code, and both must agree with the state. Traditionally, these checks have been done by maintaining reference tables for city names, zip codes, and states. The modern trend is to outsource this to a web API, which allows us to check the address’s validity. For example, UPS and USPS provide address validation APIs, which can check whether or not the address is properly formatted, and that it refers to an actual place.

Out-of-data referencing

Having an address record refer to a real address is an example of an out-of-data reference. Our data is referring to or otherwise related to something outside of the data itself. We can’t check its accuracy without going outside of our database, probably through the internet, and checking the answer. Beyond addresses, out-of-data references can include historical names, GPS locations, product codes on a vendor’s site, technical specs from a manufacturer, and standard medical diagnostic codes.

Sometimes the issue with out-of-data references is that the thing they refer to is dynamic. A common example of this is if we have a blog with links, we want to make sure those links return valid pages. Or perhaps we have a student discount, and we need to make sure that the student is still enrolled from time to time. Simply testing at the time of insertion would be insufficient in these cases.

Get hands-on with 1300+ tech skills courses.