RDMA CNI plugin
CNI compliant plugin for network namespace aware RDMA interfaces.
RDMA CNI plugin allows network namespace isolation for RDMA workloads in a containerized environment.
Overview
RDMA CNI plugin is intended to be run as a chained CNI plugin (introduced in CNI Specifications v0.3.0
).
It ensures isolation of RDMA traffic from other workloads in the system by moving the associated RDMA interfaces of the
provided network interface to the container's network namespace path.
The main use-case (for now...) is for containerized SR-IOV workloads orchestrated by Kubernetes
that perform RDMA and wish to leverage network namespace
isolation of RDMA devices introduced in linux kernel 5.3.0
.
Requirements
Hardware
SR-IOV capable NIC which supports RDMA.
Supported Hardware
Mellanox Network adapters
ConnectX®-4 and above
Operating System
Linux distribution
Kernel
Kernel based on 5.3.0
or newer, RDMA modules loaded in the system.
rdma-core
package provides means to automatically load relevant modules
on system start.
Note: For deployments that use Mellanox out-of-tree driver (Mellanox OFED), Mellanox OFED version 4.7
or newer
is required. In this case it is not required to use a Kernel based on 5.3.0
or newer.
Pacakges
iproute2
package based on kernel 5.3.0
or newer
installed on the system.
Note: It is recommended that the required packages are installed by your system's package manager.
Note: For deployments using Mellanox OFED, iproute2
package is bundled with the driver under /opt/mellanox/iproute2/
Deployment requirements (Kubernetes)
Please refer to the relevant link on how to deploy each component.
For a Kubernetes deployment, each SR-IOV capable worker node should have:
Note:: Kubernetes version 1.16 or newer is required for deploying as daemonset
RDMA CNI configurations
{
"cniVersion": "0.3.1",
"type": "rdma",
"args": {
"cni": {
"debug": true
}
}
}
Note: "args" keyword is optional.
Deployment
System configuration
It is recommended to set RDMA subsystem namespace awareness mode to exclusive
on OS boot.
Set RDMA subsystem namespace awareness mode to exclusive
via ib_core
module parameter:
~$ echo "options ib_core netns_mode=0" >> /etc/modporbe.d/ib_core.conf
Set RDMA subsystem namespace awareness mode to exclusive
via rdma tool:
~$ rdma system set netns exclusive
Note: When changing RDMA subsystem netns mode, kernel requires that no network namespaces to exist in the system.
Deploy RDMA CNI
~$ kubectl apply -f ./deployment/rdma-cni-daemonset.yaml
Deploy workload
Pod definition can be found in the example below.
~$ kubectl apply -f ./examples/my_rdma_test_pod.yaml
Pod example:
apiVersion: v1
kind: Pod
metadata:
name: rdma-test-pod
annotations:
k8s.v1.cni.cncf.io/networks: sriov-rdma-net
spec:
containers:
- name: rdma-app
image: centos/tools
imagePullPolicy: IfNotPresent
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 300000; done;" ]
resources:
requests:
mellanox.com/sriov_rdma: '1'
limits:
mellanox.com/sriov_rdma: '1'
SR-IOV Network Device Plugin ConfigMap example
The following yaml
defines an RDMA enabled SR-IOV resource pool named: mellanox.com/sriov_rdma
apiVersion: v1
kind: ConfigMap
metadata:
name: sriovdp-config
namespace: kube-system
data:
config.json: |
{
"resourceList": [
{
"resourcePrefix": "mellanox.com",
"resourceName": "sriov_rdma",
"selectors": {
"isRdma": true,
"vendors": ["15b3"],
"pfNames": ["enp4s0f0"]
}
}
]
}
Network CRD example
The following yaml
defines a network, sriov-network
, associated with an rdma enabled resurce, mellanox.com/sriov_rdma
.
The CNI plugins that will be executed in a chain are for Pods that request this network are: sriov, rdma CNIs
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: sriov-rdma-net
annotations:
k8s.v1.cni.cncf.io/resourceName: mellanox.com/sriov_rdma
spec:
config: '{
"cniVersion": "0.3.1",
"name": "sriov-rdma-net",
"plugins": [{
"type": "sriov",
"ipam": {
"type": "host-local",
"subnet": "10.56.217.0/24",
"routes": [{
"dst": "0.0.0.0/0"
}],
"gateway": "10.56.217.1"
}
},
{
"type": "rdma"
}]
}'
Development
It is recommended to use the same go version as defined in .travis.yml
to avoid potential build related issues during development (newer version will most likely work as well).
Build from source
~$ git clone https://github.com/Mellanox/rdma-cni.git
~$ cd rdma-cni
~$ make
Upon a successful build, rdma
binary can be found under ./build
.
For small deployments (e.g a kubernetes test cluster/AIO K8s deployment) you can:
- Copy
rdma
binary to the CNI dir in each worker node.
- Build container image, push to your own image repo then modify the deployment template and deploy.
Run tests:
~$ make tests
Build image:
~$ make image
Limitations
Ethernet
RDMA workloads utilizing RDMA Connection Manager (CM)
For Mellanox Hardware, due to kernel limitation, it is required to pre-allocate MACs for all VFs in the deployment
if an RDMA workload wishes to utilize RMDA CM to establish connection.
This is done in the following manner:
Set VF administrative MAC address :
$ ip link set <pf-netdev> vf <vf-index> mac <mac-address>
Unbind/Bind VF driver :
$ echo <vf-pci-address> > /sys/bus/pci/drivers/mlx5_core/unbind
$ echo <vf-pci-address> > /sys/bus/pci/drivers/mlx5_core/bind
Example:
$ ip link set enp4s0f0 vf 3 mac 02:03:00:00:48:56
$ echo 0000:03:00.5 > /sys/bus/pci/drivers/mlx5_core/unbind
$ echo 0000:03:00.5 > /sys/bus/pci/drivers/mlx5_core/bind
Doing so will populate the VF's node and pord GUID required for RDMA CM to establish connection.