Cisco IT designed AI-ready infrastructure with Cisco compute, best-in-class NVIDIA GPUs, and Cisco networking that supports AI model training and inferencing across dozens of use cases for Cisco product and engineering teams.
It's no secret thatthepressure to implement AI across the business presents challenges for IT teams. Itchallenges us to deploy new technology faster than ever before and rethink how data centers are built to meet increasing demands across compute, networking, and storage. While the pace of innovation and business advancement is exhilarating, it can also feel daunting.
How do you quickly build the data center infrastructure needed to power AI workloads and keep up with critical business needs? This is exactly what our team, Cisco IT, was facing.
We were approached by a product teamthatneeded a way to run AI workloadswhichwould be used to develop and test new AI capabilities for Cisco products. It would eventually support model training and inferencing for multiple teams and dozens of use cases across the business.And they needed it done quickly. With theneedfor the product teamsto get innovations to our customers as quickly aspossible, wehad todeliver the new environmentin just three months.
We began by mapping out the requirements for the new AI infrastructure. A non-blocking, lossless network was essential with the AI compute fabric to ensure reliable, predictable, and high-performance data transmission within the AI cluster.Ethernetwas the first-class choice. Other requirements included:
With the requirements in place, we began figuring out where the cluster could be built.The existing data center facilities were not designed to support AI workloads. We knew that building from scratch with a full data center refresh would take 18-24 months -which was not an option. We needed to deliver an operational AI infrastructure in a matter of weeks, so we leveraged an existing facility with minor changes to cabling and device distribution to accommodate.
Our next concerns were around the data being used to train models. Since some of that data would not be stored locally in the same facility as our AI infrastructure, we decided to replicate data from other data centers into our AI infrastructure storage systems to avoid performance issues related to network latency. Our network team had to ensure sufficient network capacity to handle this data replication into the AI infrastructure.
Now, getting to the actual infrastructure.We designed the heart of the AI infrastructure with Cisco compute, best-in-class GPUs from NVIDIA, and Cisco networking. On the networking side, we built a front-end ethernet network and back-end lossless ethernet network. With this model, we were confident that we could quickly deploy advanced AI capabilities in any environment and continue to add them as we brought more facilities online.
After making the initial infrastructure available, the business added more use cases each week and we added additional AI clusters to support them. We needed a way to make it all easier to manage, including managing the switch configurations and monitoring for packet loss. We used Cisco Nexus Dashboard, which dramatically streamlined operations and ensured we could grow and scale for the future. We were already using it in other parts of our data center operations, so it was easy to extend it to our AI infrastructure and didn't require the team to learn an additional tool.
Our team was able to move fast and overcome several hurdles in designing the solution. We were able to design and deploy the backend of the AI fabric in under three hours and deploy the entire AI cluster and fabrics in 3 months, which was 80% faster than the alternative rebuild.
Today, the environment supports more than 25 use cases across the business, with more added each week. This includes:
Not only were we able to support the needs of the business today, butwe'redesigning how our data centers need to evolve for the future. We are actively building out more clusters and will share additional details on our journey in future blogs. The modularity and flexibility of Cisco's networking,compute, and security gives us confidence that we can keep scaling with the business.
Additional resources: