Data Replication is the process through which data is continuously duplicated from its primary location (the original database) and stored in a separate secondary location(s). This helps account for errors or faults that may make one of these servers/nodes temporarily unavailable.
In this Answer, we'll go over some advantages and disadvantages of data replication and explore the different replication schemes in detail.
In order for this distributed database to allow all users to access the same level of information, it is essential that all the replicas are in a consistently updated state and that it is fully synchronized with the original database. To do this, there are two main data replication schemes:
Full replication
Partial replication
Now, we'll explore each of these individually and examine the conditions in which each is best suited.
In full replication, the entire original database is cloned at every replica. This makes the data highly available since all the replicas are constantly updated, and it decreases the time for a query to be executed since the data can be fetched from any closest replica.
However, it is quite slow to update the replicas under this scheme since the entire database needs to be copied at every single replica's location.
Full replication is useful when users at different locations need to see the same view of the data. For example, users trading or buying stocks must see the same details about the stocks regardless of their whereabouts. It shouldn't be the case that a user in one region cannot view details of a stock that users in other regions can see or for the details to be different for a user in one region. For this purpose, the stock exchange must maintain full replicas of its database in all the regions where the replicas are present.
In partial replication, each replica stores a copy of a selected portion of the data in the original database. Thus, depending on the replica, some fragments of the original database are cloned while others are not. This means that the type and significance of the data dictate the number of replicas to be made of it, and this way, updating each replica isn't as slow since it only receives a portion of the entire database.
However, if some data isn't found in the local replica, then it needs to be fetched from the original database (or from another nearby replica that contains the requested data), and this can increase the time taken for queries to be executed.
Partial replication is useful when users need to be provided an isolated view of the data depending on their location. For example, an international car company might have offices in multiple countries or continents. The office in one country doesn't necessarily need to have the data on the company's cars sold in another country since separate makes and models of cars are sold in each country. This way, the system in one location isn't slowed down by handling an unnecessary data load.
Replicating data has multiple advantages. These are explained below:
All the nodes hold a consistent copy of the data, thus improving data availability.
When the replicas are merged (to provide a consistent view), incomplete and redundant data is removed.
The query execution speed is increased since a user doesn't have to necessarily fetch data from a primary server that may be far away.
Despite its positives, replicating data has its drawbacks too. These are explained below:
Complex algorithms are required to ensure that the data remains consistent.
Since data is stored at multiple locations now, a greater number of nodes/servers need to be maintained, and therefore the cost increases drastically.