Liquid
This is the code repository for the deep learning job scheduling paper titled 'Liquid: Intelligent Resource Requirement Estimation and Scheduling for Deep Learning Jobs on Distributed GPU Clusters'.
The project is based on Docker.
Prerequisites
- OS Centos Linux release7.6.1810
- Nvidia Driver 410.129
- CUDA 10.0
- Docker 19.03
- Nvidia-docker 2.2.2
Steps to bring up the Liquid components
Init a docker swarm cluster
# on master node
docker swarm init
# Add other nodes to the cluster
docker swarm join --token A-LONG-TOKEN-STRING-HERE 192.168.0.1:2377
docker swarm leave
docker swarm leave --force
Create an overlay network named yao
docker network create --driver overlay --attachable yao-net
# docker network create --driver overlay --attachable --opt encrypted yao-net
Note: try remove encrypted when the containers cannot communicate cross nodes
Start HDFS cluster (Optional)
Liquid-docs/sbin/run_hdfs.sh
Start GlusterFS cluster (Optional)
Liquid-docs/sbin/run_glusterfs.sh
Start the agents in each Liquid-Worker
Liquid-docs/sbin/run_agent_helper.sh
Liquid-docs/sbin/run_agent.sh
Start the agent-master on Liquid-Master
Liquid-docs/sbin/start_agent_master.sh
Start mysql
Liquid-docs/sbin/start_mysql.sh
Start Liquid-optimizer on Master Node
Liquid-docs/sbin/run_optimizer.sh
Start Liquid-scheduler
Liquid-docs/sbin/start_scheduler.sh
Start Redis
Liquid-docs/sbin/start_redis.sh
Start the web portal
Liquid-docs/sbin/start_portal.sh
Start gitea
Liquid-docs/sbin/start_gitea.sh
Visit http://YOUR_IP/install.php