I agree We use cookies on this website to enhance your user experience. By clicking any link on this page you are giving your consent for us to set cookies. More info
Zongjie Diao, Director of Product Strategy and Management, Data Center Compute Group, CISCO
Artificial Intelligence (AI) and Machine Learning(ML) are no longer viewed as “hype.” More and more companies have started seeing the real value and impact of ML. AI/ML has become a board level conversation and top priority for CXOs, who have started seeing it as a game changer and a potentially crucial differentiation factor for their companies.
However, it is not an easy task to start an AI/ML initiative and even harder to move it from “moonshot” science experiment to real deployment that drives tangible impact.
• Up to 80% of ML models developed are never used in production.
• Time to deploy ML initiative is often 3-4X of what was expected initially.
Operationalizing Machine Learning efficiently and effectively is critical. After supporting many customers deploy machine learning in their production environment, we have learned about significant challenges faced by enterprises in operationalizing machine learning and are providing some recommendations on how to address those challenges.
One of the biggest challenges in operationalizing Machine Learning is the lack of collaboration among key stakeholders, especially between Data Scientists/Engineers and IT team. While most organizations lack proper infrastructure to support the unique need of deploying Machine Learning, IT is unfortunately quite often an after-thought in ML deployment. Data scientists end up becoming shadow IT themselves, spending more than 1/3 of their time managing training servers/workstations under their desk. Furthermore, while it might be considered as ‘science’ to write machine learning code to train a model, it’s a lot more complicated when trying to scale it up to a production grade system. A successful Machine Learning deployment requires the proper IT infrastructure (with different I/O, compute power or latency requirements) to support the data preparation, processing, training, and deployment. Cisco believes that it is critical to involve IT from the beginning of ML deployment to ensure best practices from architecture support, security, and compliance management are included. On the other hand, we often hear IT teams complaining about the lack of ML expertise to build up ML IT environment promptly to support their internal customers – data scientists and data engineers. To quickly close this gap, the IT team can work with infrastructure OEMs and system integrators who have special ML software, frameworks and overall ML ecosystem, tested and validated for fast deployment.
Companies should build out machine learning capabilities with the end in mind, understanding the use cases and strategic objectives of ML deployment, and build a robust data pipeline with collaboration between data scientists, data engineers and IT
Another challenge in operationalizing machine learning is to create a cloud-like Machine learning development experience on-prem. Today, the public cloud is often the first-place data scientists go, to start machine learning project. Setting up a GPU environment in the cloud is often faster. Also, cloud-based machine learning APIs with easy-to-use services, such as speech recognition, compute vision, can be a good starting point for model development. However,in reality, about 50- 60% of machine learning models are developed in on-prem IT environment, due to data gravity, companies’ governance and security concerns, and lower TCO. While ML APIs on the cloud are easy and fast, they are typically trained on publicly available data and lack model accuracy needed in actual production. Designing and training machine learning models, especially deep learning algorithms on enterprise specific data is often the only way to create real competitive advantages. Moreover, that data is often collected and stored on-prem. Also, the hidden cost in running a public-cloud based machine learning environment can be quite high and hard to manage, when multiple data scientists are using the resources or when models need to be retrained and continuously trained to improve accuracy in real production environment. When it comes to inferencing, it’s even more common to see it being deployed on-prem due to latency concerns. When the rubber meets the road for a full scale ML deployment, we see companies adopting hybrid cloud approach. Hence, to speed up ML deployment and ensure success, IT needs to ensure a smooth experience between public-cloud and on-prem environment for data scientists and engineers. One thing is to make sure that the ML software and platforms running on-cloud can also run on the on-prem IT environment. The other task is to create an ML-as-a-service experience on-prem. The first step is to create a multi-tenancy GPU-as-a-service environment through virtualization, so data scientists and data engineers can require dedicated ML infra resources, while it’s shared and managed by IT efficiently.
One other mistake people repeatedly make in ML deployments is the lack of an end-to-end data pipeline view. Instead, when talking about machine learning, especially deep learning, they often focus on one part of ML infrastructure only – the training infrastructure. Machine learning requires an end to end data pipeline, from data collection, data processing, to training, evaluation, deployment, and recollection. Operationalizing machine learning requires data engineers, data scientists and IT to look at how the data would come in and stored, where and how it will be cleaned, where and how it will be trained and where and how it will be deployed. If data comes from existing infrastructure, e.g., big data clusters, mission-critical workloads, it is critical to ensure that special deep learning training infrastructure is well integrated with existing infrastructure and can be managed as part of the standard data center infrastructure. Besides, Machine learning deployment often requires compute to follow where the data is. In many use cases, for example, in retail and manufacturing, the data center is no longer centered, it’s distributed. Ability to manage the edge servers is critical for operational efficiency and success.
Finally, companies should build out machine learning capabilities with the end in mind, understanding the use cases.
See Also: