...

/

Design a Distributed Task Scheduler

Design a Distributed Task Scheduler

Learn to design a distributed task scheduler.

What is a task scheduler?

A task scheduler is a critical component of a system for getting work done efficiently. For example, uploading a photo or a video on Facebook or Instagram consists of the following background tasks:

  1. Encode the photo or video in multiple resolutions.
  2. Validate the photo or video to check for content monetizationContent monetization is a way of leveraging content so that a service can profit from it as users consume it., copyrights, and many more.

The successful execution of the tasks above makes the photo or video visible. It allows us to complete a large number of tasks using limited resources.

Distributed task scheduling

The process of deciding and assigning resources to the tasks in a timely manner is called task scheduling. The visual difference between an OS-level task scheduler and a datacenter-level task scheduler is shown in the following illustration:

The OS task scheduler schedules a node’s local tasks or processes on that node’s computational resources. At the same time, the datacenter’s task scheduler schedules billions of tasks coming from multiple tenants that use the datacenter’s resources.

Our goal is to design a task scheduler similar to the datacenter-level task scheduler where the following is considered:

  • Tasks will come from many different sources, tenants, and subsystems.
  • Many resources will be dispersed in a datacenter (or maybe across many datacenters).

The two requirements above make the task scheduling problem challenging. We’ll design a distributed task scheduler that can handle all these tasks by making it scalable, reliable, and fault tolerant.

Requirements

Let’s start by going over the functional and non-functional requirements for designing a task scheduler.

Functional requirements

The functional requirements of the distributed task scheduler are as follows:

  • Submit tasks: The system should allow the users to submit their tasks for execution.
  • Allocate resources: The system should be able to allocate the required resources to each task.
  • Remove tasks: The system should allow the users to cancel the submitted tasks.
  • Monitor task execution: The task execution should be adequately monitored and rescheduled if the task fails to execute.
  • Efficient resource utilization: The resources (CPU and memory) must be used efficiently in terms of time, cost, and fairness. Efficiency means that we do not waste resources.
  • Release resources: After successfully executing a task, the system should take back the resources assigned to the task.
  • Show task status: The system should show the users the current status of the task.

Non-functional requirements

The non-functional requirements of the distributed task scheduler are as follows:

  • Availability: The system should be highly available to schedule and execute tasks.
  • Durability: The tasks received by the system should be durable and should not be lost.
  • Scalability: The system should be able to schedule and execute an ever-increasing number of tasks per day. Fault tolerance: The system must be fault tolerant by
...