yao

module
v0.0.0-...-2e70203 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 18, 2021 License: Apache-2.0

README

Liquid

This is the code repository for the deep learning job scheduling paper titled 'Liquid: Intelligent Resource Requirement Estimation and Scheduling for Deep Learning Jobs on Distributed GPU Clusters'.

The project is based on Docker.

Prerequisites

  • OS Centos Linux release7.6.1810
  • Nvidia Driver 410.129
  • CUDA 10.0
  • Docker 19.03
  • Nvidia-docker 2.2.2

Steps to bring up the Liquid components

Init a docker swarm cluster

# on master node
docker swarm init

# Add other nodes to the cluster
docker swarm join --token A-LONG-TOKEN-STRING-HERE 192.168.0.1:2377
docker swarm leave
docker swarm leave --force

Create an overlay network named yao

docker network create --driver overlay --attachable yao-net

# docker network create --driver overlay --attachable --opt encrypted yao-net

Note: try remove encrypted when the containers cannot communicate cross nodes

Start HDFS cluster (Optional)

Liquid-docs/sbin/run_hdfs.sh

Start GlusterFS cluster (Optional)

Liquid-docs/sbin/run_glusterfs.sh

Start the agents in each Liquid-Worker

Liquid-docs/sbin/run_agent_helper.sh

Liquid-docs/sbin/run_agent.sh

Start the agent-master on Liquid-Master

Liquid-docs/sbin/start_agent_master.sh

Start mysql

Liquid-docs/sbin/start_mysql.sh

Start Liquid-optimizer on Master Node

Liquid-docs/sbin/run_optimizer.sh

Start Liquid-scheduler

Liquid-docs/sbin/start_scheduler.sh

Start Redis

Liquid-docs/sbin/start_redis.sh

Start the web portal

Liquid-docs/sbin/start_portal.sh

Start gitea

Liquid-docs/sbin/start_gitea.sh

Visit http://YOUR_IP/install.php

Directories

Path Synopsis
Liquid-scheduler
src

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL