How to partition data

The elephant in the room we have ignored so far is how do we actually go about partitioning data? To make the example concrete, let’s say we are trying to partition a collection of millions of song files in a database with five nodes. Some of the ways we can partition are:

Partition Randomly

Randomly assign a node to each song but storing the same number of songs on each node (assuming the total number of songs is exactly divisible by 5 for simplicity). The acute reader can immediately see the problem with this approach. We can store the file on a node but come retrieval time we have no way of knowing which node stored the file on. Partitioning is a two way street, we should be able to locate the node for storage and retrieval of a particular record deterministically and relatively quickly. Deterministically, means the destination node for a record is always worked out to be the same. Quickly is subjective but in general we can’t afford to run, say , a bunch of complex time-consuming mathematical formulas/equations to determine the destination node for a record.

Basics

Kafka Producer

Kafka Consumer

Kafka Internals

Conclusion

Appendix

Reference: Replication

Reference: Partitioning

Reference: Transactions

Reference: Issues in Distributed Systems

Partitioning Schemes

How to partition data

Partition Randomly

Partition by Range