Getting the Visits into Unique Counts

Compute a single HyperLogLog value per message per day using hll.

Context and constraints

In a production setup, we would have the following context and constraints:

  • Tweets are published, and users from the internet are visiting tweets.

  • Our application inserts a new row in tweet.visitor with the visitor’s IP address each time there is a new visit to one of our Tweets. It also registers the precise timestamp of the visit.

  • As we anticipate quite some success on our little application idea, we also anticipate not being able to keep all the visitor logs and not being able to respect our quality-of-service terms when computing the unique visitors on the fly each time someone needs them.

  • Finally, as numbers are being used in a marketing context rather than in an invoicing context, we’re in a position to lose some precision over the number, and we would actually like to implement a system that is lossy if it allows us to relax our storage and processing requirements.

Computing a single hyperloglog

From the tweet.visitor table, we’re now going to compute a single HyperLogLog value per message and per day:

Get hands-on with 1400+ tech skills courses.