serveurs

"If you cannot measure it,you can not improve it"- Lord Kelvin

The field of machine learning is progressing at a break-neck speedandnew algorithms and techniques as well as performance improvements are being published at such a high frequency that it is impractical to keep pace.As expected,the research community, which in this field includes several corporate entities in addition to academic establishments, is open about its research findings, butthe real hurdle in democratizing machine learningare the engineering issues that one is likely to face in moving ML from the lab to production.

The only open-source project that is making a serious attempt at solving the engineering issues of reliably scaling and deploying ML workloads isKubeflow.

Kubeflow is dedicated to making deployments of machine learning workflows onKubernetessimple, portable and scalable.

However, till very recently, the Kubeflow project did not have any benchmarking components thus making it impossible to evaluate the performance of the system when deployed on any underlying Kubernetes cluster. We at Cisco took the lead in working with the open source community to come up with a solution to this. We realized that whenever there is talk of performance, any enterprise needs to get clarity on at least the following issues:

Which ML toolkit to use? There are quite a few of them available and there will probably be more in the future. Should it beTensorFlow, PyTorch, Caffe, or something else? The choice really depends on the performance of these toolkits in the specific problems that are relevant to the enterprise.
How does the underlying infrastructure impact the performance? What is the performance of say, a Cisco UCS 480 ML M5? Does adding more containers or VMs improve the performance? Does adding GPUs improve the performance enough to justify the cost of purchasing GPUs?

Requirements

Based on the above mentioned issues, we came up with some requirements that an ML benchmarking platform should have. These requirements are generic and are not dependent on Kubeflow or any other specific system for implementing ML workloads. The picture below shows a schematic of the proposed benchmarking platform.

A benchmarking platform for ML workloads should satisfy,at least,the following requirements:

It should support multiple ML frameworks. Since there are multiple frameworks such as TensorFlow and PyTorch available, the framework should support as many of them as possible and should be flexible enough to allow the addition of future toolkits.
It should be portable. A key requirement is for the workloads to be able to easily move from one infrastructure to another so that any combination of CPUs, GPUs, and other custom processors, such as TPUs, can be handled.
It should support extreme scalability. Realistic ML workloads are large and hence the benchmarking platform should also have the ability to scale to extreme sizes.
And,it should be based on Open Source. At Cisco we take open source very seriouslyand we believe that even for this benchmarking effort, the design and implementation should be done in the open source community.

Design

In order to satisfy the requirements, we came up with the design ofKubebench.

Kubebench is a harness for benchmarking and analyzing ML workloads on Kubernetes.

As shown in the figure, it runs on top of Kubeflow and is hence not directly dependent on the underlying infrastructure. Since it works on Kubeflow and Kubeflow works on Kubernetes (only), Kubebench also works on Kubernetes. This makes the benchmark platform immediately compatible with running benchmarking runs on any Kubernetes platform, allowing the benchmarking of workloads running at massive scale.

The picture below shows the overall architecture of Kubebench. The parts in yellow are what comes prepackaged in Kubebench and includes:

an Interfacethat has anAPIto configure a workflow and adashboardon which the user can view the results of running the benchmarks.
The underlyingstoragestores the config of the benchmark job and the data generated by running the benchmark job.
The monitoring and metric visualization service that monitors the workload and collects data that is used by the dashboard.
There are alsouser defined components, marked in blue in the middle. These include the actual TensorFlow or PyTorch code related to the ML job that is to be benchmarled (currently Kubeflow officially supports TensorFlow and PyTorch only), a pre-processing job, and a post-processsing job.

For more details about Kubebench, interested readers are referred to our KubeCon + CloudNativeConTalk and Slides.

Summary

In the current cacophony over Machine Learning (ML), one thing that is often forgotten, or at least not reported often enough, is the benchmarking of ML workloads. Kubebench has been a Cisco led open-source effort to develop an ML benchmarking platform for ML workloads on Kubernetes. We would like to express our deep gratitude to the numerous reviewers and contributors from the Kubeflow project.

References

Kubebench: A Benchmarking Platform for ML Workloads-IEEE ai4i 2018 -Xinyuan Huang (Cisco), Amit Kumar Saha (Cisco), Debojyoti Dutta (Cisco), and Ce Gao (Caicloud)https://alln-extcloud-storage.cisco.com/Cisco_Blogs:ciscoblogs/5c0fda3a560b9.pdf
Presentation at KubeCon + CloudNativeCon, Shanghai, Nov, 2018.https://www.youtube.com/watch?v=9sLRIBYYUlQ -Xinyuan Huang (Cisco) and Ce Gao (Caicloud)
The Kubebench repo on Github:https://github.com/kubeflow/kubebench
The Kubeflow project:https://www.kubeflow.org/

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

serveurs

Nouvelles chaudes

S5735-L48LP4XE-A-V2: Scalable, Secure, and PoE-Ready for Demanding Enterprise Deployments

S5735-L48LP4S-A-V2 Powers Smarter Campus Networks with Advanced PoE and Cloud Management

S5735-L24T4X-A1 Empowers Installers with Scalable, Reliable, and Efficient Network Access

Best Ethernet Switches for Business (2025): Selection Guide and Top Picks

Huawei S5735-L24T4S-A1: A Compact, Stackable Access Switch Built for the Future

Huawei S5735-L24T4S-A: High-Performance Stacking Meets Zero-Noise Deployment

S5735-L24P4XE-A-V2: Huawei’s Smart Choice for High-Density Campus Deployments

S5735-L24P4X-A1: Huawei’s High-Performance Access Switch Redefining Campus Networking

Huawei S5735-L24P4S-A1 Review: Reliable Gigabit Access with Enterprise-Grade Features

What Is an Orthogonal Architecture?

Huawei s5735-l24p4s-a-v2 Delivers Scalable, Secure, and Smart PoE Access for Modern IT Infrastructures

Huawei S5735-L48T4XE-A-V2 Switch Delivers Enterprise-Grade Performance in a Compact Design

Huawei S5735-L48P4XE-A-V2 Review: Versatile Campus Switch with iStack and Full L3 Support

Differences Between Huawei CE Series and S Series Switches

Huawei CloudEngine S5735 Switches Set the Benchmark for High-Performance, Energy-Efficient Switching

Huawei CloudEngine S5731‑S48P4X Datasheet

Huawei CloudEngine S5731‑S24P4X Datasheet

Huawei S5731-S Empowers Next-Generation Campus Networks with Advanced Capabilities

Huawei S5731-H24P4XC Switch Review: Power-Packed Performance and Smart PoE

Huawei S5731-H Series Switches Redefine Campus Networking with Intelligent High-Performance Architecture

Top Features of the Huawei S5731-S24T4X: The Ultimate Gigabit Access Switch for Modern Networks

General Power Module Fault Location Procedure (CE8800 & 7800 & 6800 & 5800)

How Do I Split a Stack? How to clear the stacking configuration?

Huawei CloudEngine S5731 Datasheet

Huawei CloudEngine S5731-S24P4X: Powerful Enterprise-Grade Switch Explained

Huawei S5731-S48T4X Review: Powerful Enterprise Switch for High-Speed Networking

Why are network cables limited to 100 meters?

Huawei S5731-S32ST4X: Powerful, Enterprise-Ready Gigabit Switch with Advanced Capabilities

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

Benchmarking ML Workloads

Requirements

Design

Summary

References

Tags chauds: En vedette #nuage #CiscoDNA Cisco DNA #MachineLearning #consistentAI KubeFlow benchmarking kubebench

Ordering Guide

Ressources ressources

À propos de nous

Huawei CloudEngine S5731‑S48P4X Datasheet