Huawei HIRP: Heterogeneous Big Data Platform
It is of principal importance and challenge to schedule many jobs of various characteristics in a data center with heterogeneous machines. This project aims at practical scheduling with scalability, efficiency, fairness, and low costs.
A cloud-hosted application is expected to support millions of end users with terabytes of data. To accommodate such large-scale workloads, it is common to deploy thousands of servers in one data center. Meanwhile, existing big data platforms (e.g., Hadoop or Spark) employ naive scheduling algorithms, which consider neither heterogeneity of resources nor differences of jobs. This motivates a more advanced scheduling scheme in big data environments.
Challenges:Heterogeneity in data analytics system
Improve performance and cost-effectiveness of a data analytics cluster in the cloud, the data analytics system should account for heterogeneity of the environment and workloads. Types of Heterogeneity
- Memory size
- Processor architectures, e.g., different brands and speed of CPU
- Graphics Processing Unit (GPU), e.g., nodes with or without GPU
- Machine types, e.g., physical or virtual
- Disk Speed, e.g., SSD or HDD
- Network resources, e.g., bandwidth
- Only Configuration on #CPU and memory size. Less consideration on other resources. E.g. GPU.
- Naive scheduling which considers neither heterogeneity of resources nor differences of jobs.
A toolkit to find the effective YARN parameters for single job before its execution. The toolkit includes the scripts that are able to automatically find the effective parameters as well as reconfigure the YARN framework.
A prototype system with the new scheduler, which is aware of the different resources and jobs, is implemented to replace the default YARN scheduler for better scheduling and assigning the resources.