Scalability

We provide storage and IOPS scalability. Each storage device (and storage node that will house tens of such devices) provides us with the maximum number of available IOPS per device (or per storage node). Due to the horizontal scalability of our system, we can either add more storage devices to a storage node or can add more storage nodes in the cluster. Doing so enables us to meet growing needs in terms of IOPS.

Before Tectonic, large-scale storage systems were unable to achieve scalability up to exabytes within a cluster. The systems before Tectonic were able to store multiple petabytes of data within a cluster which was enough in the early stages, but with the passage of time, that storage was not sufficient to meet the new era’s requirement.

Performance isolation

We use many measures to ensure that each tenant (and their applications) get their requested resources, but they don’t impact others. Each tenant’s storage and IOPS needs profile is maintained using quotas initially. Tenants can go beyond their quota if free IOPS is available. Request throttling and fair queuing keep each tenant to its allocated resources.

Because of the exclusive storage allocation, tenants are physically isolated in terms of storage. We prioritize storage traffic using TrafficGroups and TrafficClasses. Doing so helps us provide low latency service to the applications that need them.

Availability

Primarily, our system consists of metadata and data components. Metadata is managed by a separate, highly-available service ZippyDB. We have ensured availability in the following threecomponents:

In Metadata Store, we used snapshot reads so that data is available for readers if a write operation is in progress to increase the availability of the data. In addition, we used file-listing in the data warehouse and used hash-partitioning to distribute the queries workload on different layers to avoid hotspots in the Metadata Store.
In the Chunk Store, the data is on many storage nodes, where data is either encoded with error-correcting codes or data is fully replicated. Each of these schemes helps to recover lost or corrupt data.
On the cluster level, we can share the spare ephemeral resources (IOPS) with other tenants using different TrafficGroups and TrafficClasses. Because failures are common in a large system, our system can throttle or gracefully degrade the service, if necessary, based on TrafficCasses. By doing so, we can manage the availability of each cluster.

Durability

Data durability is critical for a file system. Once the data is accepted by our system, it should persist permanently (until a user explicitly deletes it). We have ensured durability in the following three components:

In background services, our system uses techniques such as repairing lost or damaged data that increase durability.
In the Chunk Store, we provided per-block durability by applying replication or RS-encoding on blocks.
In the Metadata Store, the data is synchronously replicated between storage nodes, which all happens within a shard. The write operations are also logged once the operation is ...