Cluster Management System
Distributed Systems Project

Abstract

The need for more and more computation has grown exponentially over the years. It was only 10 years ago when deep learning era started and now we have models that contain more than 150 billion parameters. Every year we see better supercomputers being developed in terms of the computation power. We all are aware that the Moore's Law was saturated that means the development in cpus has slowed down. So, how exactly are the increasing demands for computation met? The answer is to combine several cpus to distribute the workload among them. Clusters are networks of several computers that can be viewed as a large very powerful computer. Each computer (called a node) is itself (ideally) a high performance computer with a powerful cpu. Resources are allocated from these computers for the jobs submitted by the user. Clusters have become an integral part of any high computing facilities. Several modern supercomputers are infact huge clusters of compute nodes. A cluster management service or a cluster middleware is a system that manages this network of computers. It performs several necessary functions such as job scheduling, fault tolerance, load balancing etc. We discuss several possible implementation paradigms of these services. We also provide a simple prototype for the cluster management system that will perform most of the necessary functions.

Report & Code

[Report] [Design of the cluster middleware] [Code]