Memory Tips
Learn how to write memory-efficient code in Python.
Cache it
Data-intensive projects can rarely avoid downloading large amounts of data, like website and database dumps, tweets, other social media posts, and the like. These downloads take significant time and consume bandwidth and our credit of trust. If the data source isn’t particularly keen about our transactions, it may ban or throttle our IP address and make further downloads slow or impossible. So, citing one of the relatively user-friendly sites, polite data miners cache on their end; impolite ones get banned. As a general rule, we want to cache anything that we downloaded unless we have a good reason to believe that either we won’t need it again or it will expire before we need it again.
- A classical approach to caching is to organize a directory for storing the previously obtained objects by their identifiers. The identifiers may be, for example, objects’ URLs, tweet ids, or database row numbers; anything related to the objects’ sources.
- The next step is to convert an identifier to a uniform-looking unique file name. We can write the conversion function ourselves or use the standard library. Start by encoding the identifier, which is presumably a string. Apply one of the hashing functions, such as the
hashlib.md5()
or a fasterhashlib.sha256()
, to get a HASH object. The functions don’t produce totally unique file names, but the likelihood of getting two identical file names (called a hash collision) is so low that we can ignore it for all practical purposes. Finally, obtain a hexadecimal digest of the object. The digest is simply a 64-character ASCII string: a perfect file name that has no resemblance to the original object identifier.
Get hands-on with 1400+ tech skills courses.