Introduction:Overcoming GPU Management Challenges
In Part 1 of this blog series,we explored the challenges of hosting large language models(LLMs) on CPU-based workloads within an EKS cluster. We discussed the inefficiencies associated with using CPUs forsuch tasks, primarily due to the large model sizesandslower inference speeds. The introduction of GPU resources offered a significant performance boost, but it also brought about the need for efficient management of these high-cost resources.
In this second part, we will delve deeper into how to optimize GPU usage for these workloads. We will cover the following key areas:
Challenges Addressed
Section 1: Introduction to NVIDIA Device Plugin
The NVIDIA device plugin for Kubernetes is a component that simplifies the management and usage of NVIDIA GPUs in Kubernetes clusters. It allows Kubernetes to recognize and allocate GPU resources to pods, enabling GPU-accelerated workloads.
Why We Need the NVIDIA Device Plugin
The NVIDIA device plugin simplifies GPU management in Kubernetes clusters. It automates the installation of theNVIDIA driver, container toolkit, and CUDA, ensuring that GPU resources are available for workloads without requiring manual setup.
#Installed Version rpm-qa|grep-i nvidia-container-toolkit nvidia-container-toolkit-base-1.15.0-1.x86_64 nvidia-container-toolkit-1.15.0-1.x86_64
/usr/local/cuda/bin/nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c)2005-2023NVIDIA Corporation Built onTue_Aug_15_22:02:13_PDT_2023 Cuda compilation tools, release12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0
Setting Up the NVIDIA Device Plugin
To ensure the DaemonSet runs exclusively on GPU-based instances, we label the node with the key "nvidia.com/gpu" and the value "true". This is achieved usingNode affinity,Node selector and Taints and Tolerations.
Let us now delve into each of these components in detail.
affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: -matchExpressions: -key: feature.node.kubernetes.io/pci-10de.present operator: In values: - "true" -matchExpressions: -key: feature.node.kubernetes.io/cpu-model.vendor_id operator: In values: - NVIDIA -matchExpressions: -key: nvidia.com/gpu operator: In values: - "true"
kubectl taint node ip-10-20-23-199.us-west-1.compute.internalnvidia.com/gpu=true:Noschedule kubectl describe node ip-10-20-23-199.us-west-1.compute.internal |grep-i taint Taints: nvidia.com/gpu=true:NoSchedule tolerations: -effect: NoSchedule key: nvidia.com/gpu operator: Exists
After implementing the node labeling, affinity, node selector, and taints/tolerations, we can ensure the Daemon Set runs exclusively on GPU-based instances. We can verify the deployment of the NVIDIA device plugin using the following command:
kubectl get ds-n kube-system NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE nvidia-device-plugin 1 1 1 1 1 nvidia.com/gpu=true 75d nvidia-device-plugin-mps-control-daemon 0 0 0 0 0 nvidia.com/gpu=true,nvidia.com/mps.capable=true 75d
But the challenge here is GPUs are so expensive and need to make sure the maximum utilization of GPU's and let us explore more on GPU Concurrency.
GPU Concurrency:
Refers to the ability to execute multiple tasks or threads simultaneously on a GPU
Section 2: Implementing Time Slicing for GPUs
Time-slicing in the context of NVIDIA GPUs and Kubernetes refers to sharing a physical GPU among multiple containers or pods in a Kubernetes cluster. The technology involves partitioning the GPU's processing time into smaller intervals and allocating those intervals to different containers or pods.
Why We Need Time Slicing
Configuration Example for Time Slicing
Let us apply the time slicing config using config map as shown below. Herereplicas: 3specifies the number of replicas for GPU resources that means that GPU resource can be sliced into 3 sharing instances
apiVersion: v1 kind: ConfigMap metadata: name: nvidia-device-plugin namespace: kube-system data: any: |- version: v1 flags: migStrategy: none sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 3 #We can verify the GPU resources available on your nodes using the following command: kubectl get nodes-o json|jq-r '.items[] | select(.status.capacity."nvidia.com/gpu" != null) | {name: .metadata.name, capacity: .status.capacity}'| select(.status.capacity."nvidia.com/gpu" != null) | {name: .metadata.name, capacity: .status.capacity}' { "name": "ip-10-20-23-199.us-west-1.compute.internal", "capacity": { "cpu": "4", "ephemeral-storage": "104845292Ki", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "16069060Ki", "nvidia.com/gpu": "3", "pods": "110" } } #The above output shows that the node ip-10-20-23-199.us-west-1. compute.internal has 3 virtual GPUs available.#We can request GPU resources in their pod specifications by setting resource limitsresources: limits: cpu: "1" memory: 2G nvidia.com/gpu:"1" requests: cpu: "1" memory: 2G nvidia.com/gpu:"1"
In our case we can be able to host 3 pods in a single node ip-10-20-23-199.us-west-1. compute. Internal and because of time slicing these 3 pods can use 3 virtual GPU's as below
GPUs have been shared virtually among the pods, and we can see the PIDS assigned for each of the processes below.
Now we optimized GPU at the pod level, let us now focus on optimizing GPU resources at the node level. We can achieve this by using a cluster autoscaling solution calledKarpenter. This is particularly important as the learning labs may not always have a constant load or user activity, and GPUs are extremely expensive. By leveragingKarpenter, we can dynamically scale GPU nodes up or down based on demand, ensuring cost-efficiency and optimal resource utilization.
Section 3: Node Autoscaling with Karpenter
Karpenter is an open-source node lifecycle management for Kubernetes. It automates provisioning and deprovisioning of nodes based on the scheduling needs of pods, allowing efficient scaling and cost optimization
Why Use Karpenter for Dynamic Scaling
Installing Karpenter:
#Install Karpenter using HELM:helm upgrade--install karpenteroci://public.ecr.aws/karpenter/karpenter --version "${KARPENTER_VERSION}" --namespace"${KARPENTER_NAMESPACE}" --create-namespace --set "settings.clusterName=${CLUSTER_NAME}" --set "settings.interruptionQueue=${CLUSTER_NAME}" --setcontroller.resources.requests.cpu=1 --setcontroller.resources.requests.memory=1Gi --setcontroller.resources.limits.cpu=1 --setcontroller.resources.limits.memory=1Gi #Verify Karpenter Installation: kubectl get pod-n kube-system|grep-i karpenter karpenter-7df6c54cc-rsv8s 1/1 Running 2 (10dago) 53d karpenter-7df6c54cc-zrl9n 1/1 Running 0 53d
Configuring Karpenter with NodePoolsand NodeClasses:
Karpenter can be configured withNodePools and NodeClassesto automate the provisioning and scaling of nodes based on the specific needs of your workloads
apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: g4-nodepool spec: template: metadata: labels: nvidia.com/gpu:"true" spec: taints: -effect: NoSchedule key: nvidia.com/gpu value: "true" requirements: - key: kubernetes.io/arch operator: In values: ["amd64"] - key: kubernetes.io/os operator: In values: ["linux"] - key: karpenter.sh/capacity-type operator: In values: ["on-demand"] - key: node.kubernetes.io/instance-type