Post

AI - MLOps

AI - MLOps


MLOps

definition

  • machine learning operations

  • the process of managing the machine learning life cycle, from development to deployment and monitoring.
  • set of practices that helps data scientists and engineers to manage the machine learning (ML) life cycle more efficiently.

  • It aims to bridge the gap between development and operations for machine learning. The goal of MLOps is to ensure that ML models are developed, tested, and deployed in a consistent and reliable way.

  • MLOps is becoming increasingly important as more and more organizations are using ML models to make critical business decisions.

  • It involves tasks such as:

    • Experiment tracking: Keeping track of experiments and results to identify the best models

    • Model deployment: Deploying models to production and making them accessible to applications

    • Model monitoring: Monitoring models to detect any issues or degradation in performance

    • Model retraining: Retraining models with new data to improve their performance

  • MLOps is essential for ensuring that machine learning models are reliable, scalable, and maintainable in production environments.

The importance of MLOps

  • managing the ML life cycle
  • ensuring the ML models are effectively developed, deployed, and maintained.
  • Without MLOps, organizations may face several challenges, including:

    • Increased risk of errors: Manual processes can lead to errors and inconsistencies in the ML life cycle, which can impact the accuracy and reliability of ML models.

    • Lack of scalability: Manual processes can become difficult to manage as ML models and datasets grow in size and complexity, making it difficult to scale ML operations effectively.

    • Reduced efficiency: Manual processes can be time-consuming and inefficient, slowing down the development and deployment of ML models.

    • Lack of collaboration: Manual processes can make it difficult for data scientists, engineers, and operations teams to collaborate effectively, leading to silos and communication breakdowns.

  • MLOps addresses these challenges by providing a framework and set of tools to automate and manage the ML life cycle. It enables organizations to develop, deploy, and maintain ML models more efficiently, reliably, and at scale.

Benefits of MLOps

  • Improved efficiency: MLOps automates and streamlines the ML life cycle, reducing the time and effort required to develop, deploy, and maintain ML models

  • Increased scalability: MLOps enables organizations to scale their ML operations more effectively, handling larger datasets and more complex models

  • Improved reliability: MLOps reduces the risk of errors and inconsistencies, ensuring that ML models are reliable and accurate in production

  • Enhanced collaboration: MLOps provides a common framework and set of tools for data scientists, engineers, and operations teams to collaborate effectively

  • Reduced costs: MLOps can help organizations reduce costs by automating and optimizing the ML life cycle, reducing the need for manual intervention

Difference between MLOps and DevOps?

  • DevOps: a set of practices that helps organizations to bridge the gap between software development and operations teams.
  • MLOps: a similar set of practices that specifically addresses the needs of ML models.

  • key differences:

    • Scope:
      • DevOps: focuses on the software development life cycle
      • MLOps: focuses on the ML life cycle
    • Complexity:
      • ML models are often more complex than traditional software applications, requiring specialized tools and techniques for development and deployment
    • Data:
      • ML models rely on data for training and inference, which introduces additional challenges for managing and processing data
    • Regulation:
      • ML models may be subject to regulatory requirements, which can impact the development and deployment process
  • Despite these differences, MLOps and DevOps share some common principles, such as the importance of collaboration, automation, and continuous improvement. Organizations that have adopted DevOps practices can often leverage those practices when implementing MLOps.

Basic components of MLOps

  • MLOps consists of several components that work together to manage the ML life cycle, including:

  • Exploratory data analysis (EDA):

    • process of exploring and understanding the data that will be used to train the ML model. This involves tasks such as:

    • Data visualization: Visualizing the data to identify patterns, trends, and outliers

    • Data cleaning: Removing duplicate or erroneous data and dealing with missing values

  • Data prep and feature engineering

    • Data preparation:
      • involves cleaning, transforming, and formatting the raw data to make it suitable for model training.
    • Feature engineering:

      • Transforming the raw data into features that are relevant and useful for the ML model

      • involves creating new features from the raw data that are more relevant and useful for model training. These steps are essential for ensuring that the ML model is trained on high-quality data and can make accurate predictions.

  • Model training and tuning

    • Model training and tuning involve training the ML model on the prepared data and optimizing its hyperparameters to achieve the best possible performance.

    • Common tasks for model training and tuning include:

      • Selecting the right ML algorithm: Choosing the right ML algorithm for the specific problem and dataset

      • Training the model: Training the ML model on the training data

      • Tuning the model: Adjusting the hyperparameters of the ML model to improve its performance

      • Evaluating the model: Evaluating the performance of the ML model on the test data

  • Model review and governance

    • Model review and governance ensure that ML models are developed and deployed responsibly and ethically.

    • Model validation: Validating the ML model to ensure it meets the desired performance and quality standards

    • Model fairness: Ensuring the ML model does not exhibit bias or discrimination

    • Model interpretability: Ensuring the ML model is understandable and explainable

    • Model security: Ensuring the ML model is secure and protected from attacks

  • Model inference and serving

    • Model inference and serving involve deploying the trained ML model to production and making it available for use by applications and end users.

    • Model deployment: Deploying the ML model to a production environment

    • Model serving: Making the ML model available for inference by applications and end-users

  • Model monitoring

    • Model monitoring involves continuously monitoring the performance and behavior of the ML model in production. Tasks may include:

    • Model monitoring: Monitoring the performance and behavior of the ML model in production

    • Tracking model performance: Tracking metrics such as accuracy, precision, and recall to assess the performance of the ML model

    • Detecting model drift: Detecting when the performance of the ML model degrades over time due to changes in the data or environment

    • Identifying model issues: Identifying issues such as bias, overfitting, or underfitting that may impact the performance of the ML model

  • Automated model retraining

    • Automated model retraining involves retraining the ML model when its performance degrades or when new data becomes available. Automated model retraining includes:

    • Triggering model retraining: Triggering the retraining process when specific conditions are met, such as a decline in model performance or the availability of new data

    • Retraining the model: Retraining the ML model using the latest data and updating the model in production

    • Evaluating the retrained model: Evaluating the performance of the retrained model and ensuring it meets the desired performance standards


AI Readiness

AI Readiness is the maturity level in adopting AI technologies in your organization.

  • The amount of MLOps practices determines AI Readiness.
  • The AI Readiness ranges from manual and ad-hoc deployment in the tactical phase to using pipelines in the strategic phase, to creating fully automated ML pipelines and using advanced MLOps in the transformational phase.

ML Ops Best Practices

enter image description here


DATA & CODE MANAGEMENT

DATA SOURCES & DATA VERSIONING

  • Is data versioning optional or mandatory?
    • E.g., is data versioning a requirement for a system like a regulatory requirement?
  • What data sources are available?
    • (e.g., owned, public, earned, paid data)
  • What is the storage for the above data?
    • (e.g., data lake, DWH)
  • Is manual labeling required? Do we have human resources for it?
  • How to version data for each trained model?
  • What tooling is available for data pipelines/workflows?

DATA ANALYSIS & EXPERIMENT MANAGEMENT

  • What programming language to use for analysis? (Is SQL sufficient for analysis?)
  • What ML algorithm will be used?
  • What evaluation metrics need to be computed?
  • Are there any infrastructure requirements for model training?
  • Reproducibility: What metadata about ML experiments is collected? (data sets, hyper-parameters)
  • What IDE are we confident with?
  • What ML Framework know-how is there?

FEATURE STORE & WORKFLOWS

  • Is this optional or mandatory? Do we have a data governance process such that feature engineering has to be reproducible?
  • How are features computed (workflows) during the training and prediction phases?
  • Are there any infrastructure requirements for feature computation
  • “Buy or make” for feature stores?
  • What databases are involved in feature storage?
  • Do we design APIs for feature engineering?

FOUNDATIONS

  • How do we maintain the code? What source version control system is used?
  • How do we monitor the system performance?
  • Do we need versioning for notebooks?
  • Is there a trunk-based development in place?
  • Deployment and testing automation: What is the CI/CD pipeline for the codebase? What tools are used for it?
  • Do we track deployment frequency, lead time for changes, mean time to restore, and change failure rate metrics?

METADATA MANAGEMENT

METADATA STORE

  • What kind of metadata in code, data, ML pipeline execution and model management need to be collected?
  • What is the documentation strategy: Do we treat documentation as a code? (examples: Datasheets for Datasets, Model Card for Model Reporting )
  • What operational metrics need to be collected?
    • E.g., time to restore, change fail percentage.
  • SRE for ML: How outages/failures will be protocolled?

MODEL MANAGEMENT

CI/CT/CD: ML PIPELINE ORCHESTRATION

[Building, testing, packaging and deploying of ML pipeline]

  • How often are models expected to be re-trained? Where does this happen (locally or on cloud)?
  • What is the formalized workflow for an ML pipeline?
    • (e.g., Data prep -> model training -> model eval & validation)
    • What tech stack is used?
  • Is distributed model training required? Do we have an infrastructure for the distributed training?
  • What is the workflow for the CI pipeline? What tools are used?
  • What are the non-functional requirements for the ML model? How are they tested? Are these tests integrated into the CI/CT workflow?

MODEL REGISTRY & MODEL VERSIONING

  • Is this optional or mandatory?
  • Where should new ML models be stored and tracked?
  • What versioning standards are used? (e.g., semantic versioning)

MODEL DEPLOYMENT

  • What is the delivery format for the model?
  • What is the expected time for changes? (Time from commit to production)
  • What is the target environment to serve predictions?
  • What is your model release policy? Is A/B testing or multi-armed bandit testing required? (e.g., for deciding what model to deploy)
  • Is shadow/canary deployment required?

PREDICTION SERVING

  • What is the serving mode? (batch or online)
  • Is distributed serving required?
  • Do you need ML inference accelerators (TPU)?
  • What is the expected target volume of predictions per month or hours?

MODEL & DATA & APPLICATION MONITORING

  • Is this optional or mandatory?
  • What ML metrics are collected?
  • What domain-specific metrics are collected?
  • How is the model staleness detected? (Data Monitoring)
  • How is the data skew detected? (Data Monitoring)
  • What operational aspects need to be monitored? (SRE)
  • What is the alerting strategy? (thresholds)
  • What triggers the model re-training?
    • e.g. Latency of model serving
This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.