175 Matching Annotations
  1. Nov 2024
    1. Data scientists, MLOps engineers, or AI developers, can mount large language model weights or machine learning model weights in a pod alongside a model-server, so that they can efficiently serve them without including them in the model-server container image. They can package these in an OCI object to take advantage of OCI distribution and ensure efficient model deployment. This allows them to separate the model specifications/content from the executables that process them.

      The introduction of the Image Volume Source feature in Kubernetes 1.31 allows MLOps practitioners to mount OCI-compatible artifacts, such as large language model weights or machine learning models, directly into pods without embedding them in container images. This streamlines model deployment, enhances efficiency, and leverages OCI distribution mechanisms for effective model management.

    1. Deploying Machine Learning Models with Flask and AWS Lambda: A Complete Guide

      In essence, this article is about:

      1) Training a sample model and uploading it to an S3 bucket:

      ```python from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression import joblib

      Load the Iris dataset

      iris = load_iris() X, y = iris.data, iris.target

      Split the data into training and testing sets

      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

      Train the logistic regression model

      model = LogisticRegression(max_iter=200) model.fit(X_train, y_train)

      Save the trained model to a file

      joblib.dump(model, 'model.pkl') ```

      1. Creating a sample Zappa config, because AWS Lambda doesn’t natively support Flask, we need to use Zappa, a tool that helps deploy WSGI applications (like Flask) to AWS Lambda:

      ```json { "dev": { "app_function": "app.app", "exclude": [ "boto3", "dateutil", "botocore", "s3transfer", "concurrent" ], "profile_name": null, "project_name": "flask-test-app", "runtime": "python3.10", "s3_bucket": "zappa-31096o41b" },

      "production": {
          "app_function": "app.app",
          "exclude": [
              "boto3",
              "dateutil",
              "botocore",
              "s3transfer",
              "concurrent"
          ],
          "profile_name": null,
          "project_name": "flask-test-app",
          "runtime": "python3.10",
          "s3_bucket": "zappa-31096o41b"
      }
      

      } ```

      1. Writing a sample Flask app:

      ```python import boto3 import joblib import os

      Initialize the Flask app

      app = Flask(name)

      S3 client to download the model

      s3 = boto3.client('s3')

      Download the model from S3 when the app starts

      s3.download_file('your-s3-bucket-name', 'model.pkl', '/tmp/model.pkl') model = joblib.load('/tmp/model.pkl')

      @app.route('/predict', methods=['POST']) def predict(): # Get the data from the POST request data = request.get_json(force=True)

      # Convert the data into a numpy array
      input_data = np.array(data['input']).reshape(1, -1)
      
      # Make a prediction using the model
      prediction = model.predict(input_data)
      
      # Return the prediction as a JSON response
      return jsonify({'prediction': int(prediction[0])})
      

      if name == 'main': app.run(debug=True) ```

      1. Deploying this app to production (to AWS):

      bash zappa deploy production

      and later eventually updating it:

      bash zappa update production

      1. We should get a URL like this:

      https://xyz123.execute-api.us-east-1.amazonaws.com/production

      which we can query:

      curl -X POST -H "Content-Type: application/json" -d '{"input": [5.1, 3.5, 1.4, 0.2]}' https://xyz123.execute-api.us-east-1.amazonaws.com/production/predict

    1. Optimizing Kubernetes Costs with Multi-Tenancy and Virtual Clusters

      The blog post by Cliff Malmborg from Loft Labs discusses optimizing Kubernetes costs using multi-tenancy and virtual clusters. With Kubernetes expenses rising rapidly at scale, traditional cost-saving methods like autoscaling, resource quotas, and monitoring tools help but are not enough for complex environments where underutilized clusters are common. Multi-tenancy enables resource sharing, reducing the number of clusters and, in turn, management and operational costs.

      A virtual cluster is a fully functional Kubernetes cluster running within a larger host cluster, providing better isolation and flexibility than namespaces. Unlike namespaces, each virtual cluster has its own Kubernetes control plane, so resources like statefulsets and webhooks are isolated within it, while only core resources (like pods and services) are shared with the host cluster. This setup addresses the "noisy neighbor" problem, where workloads in a shared environment interfere with each other due to resource contention.

      Virtual clusters offer the isolation benefits of individual physical clusters but are cheaper and easier to manage than deploying separate physical clusters for each tenant or application. They also support "sleep mode," automatically scaling down unused resources to save costs, and allow shared use of central tools (like ingress controllers) installed in the host cluster. By transitioning to virtual clusters, companies can balance security, isolation, and cost-effectiveness, reducing the need for multiple physical clusters and making Kubernetes infrastructure scalable for modern, resource-demanding applications.

  2. Feb 2024
    1. We’ve (painstakingly) manually reviewed 310 live MLOps positions, advertised across various platforms in Q4 this year

      They went through 310 role descriptions and, even though role descriptions may vary significantly, they found 3 core skills that a large percentage of MLOps roles required:

      📦 Docker and Kubernetes 🐍 Python 🌥 Cloud

  3. Mar 2023
    1. You can freely replace SageMaker services with other components as your project grows and potentially outgrows SageMaker.

    1. Ultimately, after researching how we can overcome some inconveniences in Kubeflow, we decided to continue using it. Even though the UI could use some improvements in terms of clarity, we didn’t want to give up the advantages of configured CI/CD and containerization, which allowed us to use different environments. Also, for our projects, it is convenient to develop each ML pipeline in separate Git repositories.

      Kubeflow sounds like the most feature rich solution, whose main con is its UI and the setup process

    2. So, let’s sum up the pros and cons of each tool:

      Summary of pros/cons of Airflow, Kubeflow and Prefect

    3. The airflow environment must have all the libraries that are being imported in all DAGs. Without using containerization all Airflow pipelines are launched within the same environment. This leads to limitations in using exotic libraries or conflicting module versions for different projects.

      Main con of Airflow

    4. Prefect is a comparatively new but promising orchestration tool that appeared in 2018. The tool positions itself as a replacement for Airflow, featuring greater flexibility and simplicity. It is an open-source project; however, there is a paid cloud version to track workflows.
    5. Airflow has been one of the most popular orchestrating tools for several years.

      (see the graph above)

    6. An orchestration tool usually doesn’t do the hard work of translating and processing data itself, but tells other systems and frameworks what to do and monitors the status of the execution.

      Responsibility of the orchestration tool

    7. To this day, the field of machine learning does not have a single generally accepted approach to solving problems in terms of practical use of models.

      Business ¯\_(ツ)_/¯

    1. ServingRuntime - Templates for Pods that can serve one or more particular model formats. There are three "built in" runtimes that cover the out-of-the-box model types, custom runtimes can be defined by creating additional ones.

      ServingRuntime

    1. cluster with 4096 IP addresses can deploy at most 1024 models assuming each InferenceService has 4 pods on average (two transformer replicas and two predictor replicas).

      Kubernetes clusters have a maximum IP address limitation

    2. According to Kubernetes best practice, a node shouldn't run more than 100 pods.
    3. Each model’s resource overhead is 1CPU and 1 GB memory. Deploying many models using the current approach will quickly use up a cluster's computing resource. With Multi-model serving, these models can be loaded in one InferenceService, then each model's average overhead is 0.1 CPU and 0.1GB memory.

      If I am not mistaken, the multi-model approach reduces the size by 90% in this case

    4. Multi-model serving is designed to address three types of limitations KServe will run into

      Benefits of multi-model serving

    5. While you get the benefit of better inference accuracy and data privacy by building models for each use case, it is more challenging to deploy thousands to hundreds of thousands of models on a Kubernetes cluster.

      With more separation, comes the problem of distribution

    1. response times, error rates, and request rates

      Sample metrics to monitor

    2. You can use authentication mechanisms such as OAuth2, JSON Web Tokens (JWT), or HTTP Basic Authentication to ensure that only authorized users or applications can access your API.
    3. In this example, we’ve defined an API endpoint called /predict_image that accepts a file upload using FastAPI's UploadFile type. When a client sends an image file to this endpoint, the file is read and its contents are passed to a preprocessing function that prepares the image for input into the model. Once the image has been preprocessed, the model can make a prediction on it, and the result can be returned to the client as a JSON response.

      Example above shows how to upload an image to an API endpoint with FastAPI.

      Example below is a bit more complex.

    4. For example, if you are using TensorFlow, you might save your model as a .h5 file using the Keras API. If you are using PyTorch, you might save your model as a .pt file using the torch.save() function. By saving your model as a file, you can easily load it into a deployment environment (such as FastAPI) and use it to make predictions on new images
  4. Jan 2023
    1. We’re also not going deep here on MLops or LLMops tooling, which is not yet highly standardized and will be addressed in a future post.

      first mention of LLMops I've seen in the wild

    1. tl;dw (best DevOps tools in 2023)

      1. Low-budget cloud computing : Civo (close to Scaleway)
      2. Infrastructure and Service Management: Crossplane
      3. App Management - manifests : cdk8s (yes, not Kustomize or Helm)
      4. App Management - k8s operators: tie between Knative and Crossplane
      5. App Management - managed services: Google Cloud Run
      6. Dev Envs: Okteto (yeap, not GitPod)
      7. CI/CD: GitHub Actions (as it's simplest to use)
      8. GitOps (CD): Argo CD (wins with Flux due to its adoption rate)
      9. Policy Management: Kyverno (simpler to use than industry's most powerful tool: OPA / Gatekeeper)
      10. Observability: OpenTelemetry (instrumentation of apps), VictoriaMetrics (metrics - yes not Prometheus), Grafana / Loki (logs), Grafana Tempo (tracing), Grafana (dashboards), Robusta (alerting), Komodor (troubleshooting)
    1. The {vetiver} package provides a set of tools for building, deploying, and managing machine learning models in production. It allows users to easily create, version, and deploy machine learning models to various hosting platforms, such as Posit Connect or a cloud hosting service like Azure.
    2. I hope to show how to demonstrate how easy model deployment can be using Posit’s open source tools for MLOps. This includes {pins}, {vetiver}, and the {tidymodels} bundle of packages along with the {tidyverse}.

      Consider the following packages while doing MLOps in R: - pins - vetiver - tidymodels - tidyverse

  5. Dec 2022
    1. Ultimately the data scientists need me more than I need them; I’m the reason their stuff is in production and runs smoothly.
  6. Nov 2022
    1. in MLflow 2.0, the mlflow.evaluate() API for model evaluation is now stable and production-ready. With just a single line of code, mlflow.evaluate() creates a comprehensive performance report for any ML model.
    2. MLflow 2.0 also adds AutoML to MLflow Recipes, dramatically reducing the amount of time required to produce a high-quality model.

      AutoML in MLflow 2.0

    3. In MLflow 2.0, MLflow Recipes is now a core platform component with several new features, including support for classification models, improved data profiling and hyperparameter tuning capabilities.

      MLflow Recipes in MLflow 2.0

    1. As I think today microservice can do much more than just gives predictions using a single model, like:

      List of differences between a microservice and inference service.

      (see bullet points below annotation)

    1. See an example for combining mlflow and DVC e.g. here: https://github.com/mbunse/mlcomops/tree/meetup_erlangen

      Real project combining MLflow & DVC.

    2. What do you mean, would ( ͡° ͜ʖ ͡°)

      Voting for ClearML with a video.

    3. Infos in the comments about DVC MLOps & one suggesting ClearML.

  7. Oct 2022
    1. As of today, a lot of the things in ML are not automated. They are manual or semi-manual.
    2. If we say that MLOps is just DevOps + “some things”, then CI/CD is a core principle of that.
    3. I believe that packaging/building/deploying the vanilla, run-of-the-mill ML model will become common knowledge for backend devs.
    4. MLOps engineer today is either an ML engineer (building ML-specific software) or a DevOps engineer. Nothing special here.Should we call a DevOps engineer who primarily operates ML-fueled software delivery an MLOps engineer?I mean, if you really want, we can, but I don’t think we need a new role here. It is just a DevOps eng.

      Who really is MLOps Engineer ;)

    5. The MLOps team should consist of a DevOps engineer, a backend software engineer, a data scientist, + regular software folks.

      Recommended MLOps team structure

    1. Python is known for using more memory than more optimized languages and, in this case, it uses 7 times more than PostgresML.
    2. PostgresML outperforms traditional Python microservices by a factor of 8 in local tests and by a factor of 40 on AWS EC2.
  8. Jun 2022
    1. In a Staging workflow, releases are slower because of more steps, and bigger because of batching.

    2. For Staging to be useful, it has to catch a special kind of issues that 1) would happen in production, but 2) wouldn’t happen on a developer's laptop.What are these? They might be problems with data migrations, database load and queries, and other infra-related problems.

      How "Staging" environment can be useful

    1. Another disadvantage of managed platforms is that they are inflexible and slow to change. They might provide 80% of the functionality we require, but it is often the case that the missing 20% provides functionality that is mission critical for machine learning projects. The closed design and architecture of managed platforms makes it difficult to make even the most trivial changes. To compensate for this lack of flexibility, we often have to design custom, inefficient and hard-to-maintain mechanisms that add technical debt to the project.

      Main disadvantage of managed MLOps platforms

  9. May 2022
    1. Overall, if speed is your primary concern and you’re on a budget, then Circle CI is the clear choice. If you’re not looking to run a ton of builds each month and your code is already in Github, then Github Actions can offer similar performance with the added convenience of having everything under one service. Even though we liked Travis better, our main criteria was value, and since you can’t use Travis for free after the first month, GitLab was able to grab the third slot, despite it being weaker in almost every other category.

      4 CI free tier comparison: * Quality of Documentation * Compute Power * Available Disk Space * Free Build Minutes * Speed and Performance

    1. As of today, the Docker Engine is to be intended as an open source software for Linux, while Docker Desktop is to be intended as the freemium product of the Docker, Inc. company for Mac and Windows platforms. From Docker's product page: "Docker Desktop includes Docker Engine, Docker CLI client, Docker Build/BuildKit, Docker Compose, Docker Content Trust, Kubernetes, Docker Scan, and Credential Helper".

      About Docker Engine and Docker Desktop

    2. The diagram below tries to summarise the situation as of today, and most importantly to clarify the relationships between the various moving parts.

      Containers (the backend):

    1. Without accounting for what we install or add inside, the base python:3.8.6-buster weighs 882MB vs 113MB for the slim version. Of course it's at the expense of many tools such as build toolchains3 but you probably don't need them in your production image.4 Your ops teams should be happier with these lighter images: less attack surface, less code that can break, less transfer time, less disk space used, ... And our Dockerfile is still readable so it should be easy to maintain.

      See sample Dockerfile above this annotation (below there is a version tweaked even further)

  10. Apr 2022
    1. Most companies are not prepared to pay for a staging environment identical to production

      Keeping staging environment has its cost

  11. Mar 2022
    1. Have you ever built an image only to realize that you actually need it on a user account other than root, requiring you to rebuild the image again in rootless mode? Or have you built an image on one machine but run containers on the image using multiple different machines? Now you need to set up an account on a registry, push the image to the registry, Secure Shell (SSH) to each device you want to run the image on, and then pull the image. The podman image scp command solves both of these annoying scenarios as quickly as they occur.

      Podman 4.0 can transfer container images without a registry.

      For example: * You can copy a root image to a non-root account:

      $ podman image scp root@localhost::IMAGE USER@localhost:: * Or copy an image from one machine to another with this command:

      $ podman image scp me@192.168.68.122::IMAGE you@192.168.68.128::

    1. As mentioned earlier, PATCH requests should apply partial updates to a resource, whereas PUT replaces an existing resource entirely. It's usually a good idea to design updates around PATCH requests

      Prefer PATCH over PUT

    2. Aside from using HTTP status codes that indicate the outcome of the request (success or error), when returning errors, always use a standardized error response that includes more detailed information on what went wrong.

      For example: ``` // Request => GET /users/4TL011ax

      // Response <= 404 Not Found { "code": "user/not_found", "message": "A user with the ID 4TL011ax could not be found." } ```

    3. https://api.averagecompany.com/v1/health https://api.averagecompany.com/health?api_version=1.0

      2 examples of versioning APIs

    4. When dealing with date and time, APIs should always return ISO 8601-formatted strings.
    1. But the problem with Poetry is arguably down to the way Docker’s build works: Dockerfiles are essentially glorified shell scripts, and the build system semantic units are files and complete command runs. There is no way in a normal Docker build to access the actually relevant semantic information: in a better build system, you’d only re-install the changed dependencies, not reinstall all dependencies anytime the list changed. Hopefully someday a better build system will eventually replace the Docker default. Until then, it’s square pegs into round holes.

      Problem with Poetry/Docker

    2. Third, you can use poetry-dynamic-versioning, a plug-in for Poetry that uses Git tags instead of pyproject.toml to set your application’s version. That way you won’t have to edit pyproject.toml to update the version. This seems appealing until you realize you now need to copy .git into your Docker build, which has its own downsides, like larger images unless you’re using multi-stage builds.

      Approach of using poetry-dynamic-versioning plugin

    3. But if you’re doing some sort of continuous deployment process where you’re continuously updating the version field, your Docker builds are going to be slow.

      Be careful when updating the version field of pyproject.toml around Docker

    1. VCR.py works primarily via the @vcr decorator. You can import this decorator by writing: import vcr.

      How VCR.py works

    2. The VCR.py library records the responses from HTTP requests made within your unit tests. The first time you run your tests using VCR.py is like any previous run. But the after VCR.py has had the chance to run once and record, all subsequent tests are:Fast! No more waiting for slow HTTP requests and responses in your tests.Deterministic. Every test is repeatable since they run off of previously recorded responses.Offline-capable! Every test can now run offline.

      VCR.py library to speed up Python HTTP tests

    1. DevOps is an interesting case study for understanding MLOps for a number of reasons: It underscores the long period of transformation required for enterprise adoption.It shows how the movement is comprised of both tooling advances as well as shifts in cultural mindset at organizations. Both must march forward hand-in-hand.It highlights the emerging need for practitioners with cross-functional skills and expertise. Silos be damned.

      3 things MLOps can learn from DevOps

    2. MLOps today is in a very messy state with regards to tooling, practices, and standards. However, this is to be expected given that we are still in the early phases of broader enterprise machine learning adoption. As this transformation continues over the coming years, expect the dust to settle while ML-driven value becomes more widespread.

      State of MLOps in March 2022

  12. Jan 2022
    1. Adopting Kubernetes-native environments ensures true portability for the hybrid cloud. However, we also need a Kubernetes-native framework to provide the "glue" for applications to seamlessly integrate with Kubernetes and its services. Without application portability, the hybrid cloud is relegated to an environment-only benefit. That framework is Quarkus.

      Quarkus framework

    2. Kubernetes-native is a specialization of cloud-native, and not divorced from what cloud native defines. Whereas a cloud-native application is intended for the cloud, a Kubernetes-native application is designed and built for Kubernetes.

      Kubernetes-native application

    3. According to Wilder, a cloud-native application is any application that was architected to take full advantage of cloud platforms. These applications: Use cloud platform services. Scale horizontally. Scale automatically, using proactive and reactive actions. Handle node and transient failures without degrading. Feature non-blocking asynchronous communication in a loosely coupled architecture.

      Cloud-native applications

    1. Salesforce has a unique use case where they need to serve 100K-500K models because the Salesforce Einstein product builds models for every customer. Their system serves multiple models in each ML serving framework container. To avoid the noisy neighbor problem and prevent some containers from taking significantly more load than others, they use shuffle sharding [8] to assign models to containers. I won’t go into the details and I recommend watching their excellent presentation in [3].

      Case of Salesforce serving 100K-500K ML models with the use of shuffle sharding

    2. Batching predictions can be especially beneficial when running neural networks on GPUs since batching takes better advantage of the hardware.

      Barching predictions

    3. Inference Service — provides the serving API. Clients can send requests to different routes to get predictions from different models. The Inference Service unifies serving logic across models and provides easier interaction with other internal services. As a result, data scientists don’t need to take on those concerns. Also, the Inference Service calls out to ML serving containers to obtain model predictions. That way, the Inference Service can focus on I/O-bound operations while the model serving frameworks focus on compute-bound operations. Each set of services can be scaled independently based on their unique performance characteristics.

      Responsibilities of Inference Service

    4. Provide a model config file with the model’s input features, the model location, what it needs to run (like a reference to a Docker image), CPU & memory requests, and other relevant information.

      Contents of a model config file

    5. what changes when you need to deploy hundreds to thousands of online models? The TLDR: much more automation and standardization.

      MLOps focuses deeply on automation and standardization

    1. “Shadow Mode” or “Dark Launch” as Google calls it is a technique where production traffic and data is run through a newly deployed version of a service or machine learning model, without that service or model actually returning the response or prediction to customers/other systems. Instead, the old version of the service or model continues to serve responses or predictions, and the new version’s results are merely captured and stored for analysis.

      Shadow mode

    1. you can also mount different FastAPI applications within the FastAPI application. This would mean that every sub-FastAPI application would have its docs, would run independent of other applications, and will handle its path-specific requests. To mount this, simply create a master application and sub-application file. Now, import the app object from the sub-application file to the master application file and pass this object directly to the mount function of the master application object.

      It's possible to mount FastAPI applications within a FastAPI application

    1. There are officially 5 types of UUID values, version 1 to 5, but the most common are: time-based (version 1 or version 2) and purely random (version 3). The time-based UUIDs encode the number of 10ns since January 1st, 1970 in 7.5 bytes (60 bits), which is split in a “time-low”-“time-mid”-“time-hi” fashion. The missing 4 bits is the version number used as a prefix to the time-hi field.  This yields the 64 bits of the first 3 groups. The last 2 groups are the clock sequence, a value incremented every time the clock is modified and a host unique identifier.

      There are 5 types of UUIDs (source):

      Type 1: stuffs MAC address+datetime into 128 bits

      Type 3: stuffs an MD5 hash into 128 bits

      Type 4: stuffs random data into 128 bits

      Type 5: stuffs an SHA1 hash into 128 bits

      Type 6: unofficial idea for sequential UUIDs

    2. Even though most posts are warning people against the use of UUIDs, they are still very popular. This popularity comes from the fact that these values can easily be generated by remote devices, with a very low probability of collision.
  13. Dec 2021
    1. Artifactory/Nexus/Docker repo was unavailable for a tiny fraction of a second when downloading/uploading packagesThe Jenkins builder randomly got stuck

      Typical random issues when deploying microservices

    2. Microservices can really bring value to the table, but the question is; at what cost? Even though the promises sound really good, you have more moving pieces within your architecture which naturally leads to more failure. What if your messaging system breaks? What if there’s an issue with your K8S cluster? What if Jaeger is down and you can’t trace errors? What if metrics are not coming into Prometheus?

      Microservices have quite many moving parts

    3. If you’re going with a microservice:

      9 things needed for deploying a microservice (listed below)

    4. Let’s take a simple online store app as an example.

      5 things needed for deploying a monolith (listed below)

    5. some of the pros for going microservices

      Pros of microservices (not always all are applicable):

      • Fault isolation
      • Eliminating the technology lock
      • Easier understanding
      • Faster deployment
      • Scalability
  14. Nov 2021
    1. I’d probably choose the official Docker Python image (python:3.9-slim-bullseye) just to ensure the latest bugfixes are always available.

      python:3.9-slim-bullseye may be the sweet spot for a Python Docker image

    2. So which should you use? If you’re a RedHat shop, you’ll want to use their image. If you want the absolute latest bugfix version of Python, or a wide variety of versions, the official Docker Python image is your best bet. If you care about performance, Debian 11 or Ubuntu 20.04 will give you one of the fastest builds of Python; Ubuntu does better on point releases, but will have slightly larger images (see above). The difference is at most 10% though, and many applications are not bottlenecked on Python performance.

      Choosing the best Python base Docker image depends on different factors.

    3. There are three major operating systems that roughly meet the above criteria: Debian “Bullseye” 11, Ubuntu 20.04 LTS, and RedHat Enterprise Linux 8.

      3 candidates for the best Python base Docker image

    1. If for some reason you don’t see a running pod from this command, then using kubectl describe po a is your next-best option. Look at the events to find errors for what might have gone wrong.

      kubectl run a –image alpine –command — /bin/sleep 1d

    2. As with listing nodes, you should first look at the status column and look for errors. The ready column will show how many pods are desired and how many are running.

      kubectl get pods -A -o wide

    3. -o wide option will tell us additional details like operating system (OS), IP address and container runtime. The first thing you should look for is the status. If the node doesn’t say “Ready” you might have a problem, but not always.

      kubectl get nodes -o wide

    4. This command will be the easiest way to discover if your scheduler, controller-manager and etcd node(s) are healthy.

      kubectl get componentstatus

    5. If something broke recently, you can look at the cluster events to see what was happening before and after things broke.

      kubectl get events -A

    6. this command will tell you what CRDs (custom resource definitions) have been installed in your cluster and what API version each resource is at. This could give you some insights into looking at logs on controllers or workload definitions.

      kubectl api-resources -o wide –sort-by name

    7. kubectl get --raw '/healthz?verbose'

      Alternative to kubectl get --raw '/healthz?verbose'. It does not show scheduler or controller-manager output, but it adds a lot of additional checks that might be valuable if things are broken

    8. Here are the eight commands to run

      8 commands to debug Kubernetes cluster:

      kubectl version --short
      kubectl cluster-info
      kubectl get componentstatus
      kubectl api-resources -o wide --sort-by name
      kubectl get events -A
      kubectl get nodes -o wide
      kubectl get pods -A -o wide
      kubectl run a --image alpine --command -- /bin/sleep 1d
      
  15. Oct 2021
    1. few battle-hardened options, for instance: Airflow, a popular open-source workflow orchestrator; Argo, a newer orchestrator that runs natively on Kubernetes, and managed solutions such as Google Cloud Composer and AWS Step Functions.

      Current top orchestrators:

      • Airflow
      • Argo
      • Google Cloud Composer
      • AWS Step Functions
    2. To make ML applications production-ready from the beginning, developers must adhere to the same set of standards as all other production-grade software. This introduces further requirements:

      Requirements specific to MLOps systems:

      1. Large scale of operations
      2. Orchestration
      3. Robust versioning (data, models, code)
      4. Apps integrated to surrounding busness systems
    3. In contrast, a defining feature of ML-powered applications is that they are directly exposed to a large amount of messy, real-world data which is too complex to be understood and modeled by hand.

      One of the best ways to picture a difference between DevOps and MLOps

    1. Argo Workflow is part of the Argo project, which offers a range of, as they like to call it, Kubernetes-native get-stuff-done tools (Workflow, CD, Events, Rollouts).

      High level definition of Argo Workflow

    2. Argo is designed to run on top of k8s. Not a VM, not AWS ECS, not Container Instances on Azure, not Google Cloud Run or App Engine. This means you get all the good of k8s, but also the bad.

      Pros of Argo Workflow:

      • Resilience
      • Autoscaling
      • Configurability
      • Support for RBAC

      Cons of Argo Workflow:

      • A lot of YAML files required
      • k8s knowledge required
    3. If you are already heavily invested in Kubernetes, then yes look into Argo Workflow (and its brothers and sisters from the parent project).The broader and harder question you should ask yourself is: to go full k8s-native or not? Look at your team’s cloud and k8s experience, size, growth targets. Most probably you will land somewhere in the middle first, as there is no free lunch.

      Should you go into Argo, or not?

    4. In order to reduce the number of lines of text in Workflow YAML files, use WorkflowTemplate . This allow for re-use of common components.

      kind: WorkflowTemplate

    1. You probably shouldn't use Alpine for Python projects, instead use the slim Docker image versions.

      (have a look below this highlight for a full reasoning)

  16. Sep 2021
    1. we will be releasing KServe 0.7 outside of the Kubeflow Project and will provide more details on how to migrate from KFServing to KServe with minimal disruptions

      KFServing is now KServe

    1. kind, microk8s, or k3s are replacements for Docker Desktop. False. Minikube is the only drop-in replacement. The other tools require a Linux distribution, which makes them a non-starter on macOS or Windows. Running any of these in a VM misses the point – you don't want to be managing the Kubernetes lifecycle and a virtual machine lifecycle. Minikube abstracts all of this.

      At the current moment the best approach is to use minikube with a preferred backend (Docker Engine and Podman are already there), and you can simply run one command to configure Docker CLI to use the engine from the cluster.

  17. Aug 2021
    1. k3d is basically running k3s inside of Docker. It provides an instant benefit over using k3s on a local machine, that is, multi-node clusters. Running inside Docker, we can easily spawn multiple instances of our k3s Nodes.

      k3d <--- k3s that allows to run mult-node clusters on a local machine

    2. Kubernetes in Docker (KinD) is similar to minikube but it does not spawn VM's to run clusters and works only with Docker. KinD for the most part has the least bells and whistles and offers an intuitive developer experience in getting started with Kubernetes in no time.

      KinD (Kubernetes in Docker) <--- sounds like the most recommended solution to learn k8s locally

    3. Contrary to the name, it comes in a larger binary of 150 MB+. It can be run as a binary or in DinD mode. k0s takes security seriously and out of the box, it meets the FIPS compliance.

      k0s <--- similar to k3s, but not as lightweight

    4. k3s is a lightweight Kubernetes distribution from Rancher Labs. It is specifically targeted for running on IoT and Edge devices, meaning it is a perfect candidate for your Raspberry Pi or a virtual machine.

      k3s <--- lightweight solution

    5. All of the tools listed here more or less offer the same feature, including but not limited to

      7 tools for learning k8s locally:

      1. k3s
      2. k0s
      3. Microk8s
      4. DinD
      5. minikube
      6. KinD
      7. k3d
    6. There are multiple tools for running Kubernetes on your local machine, but it basically boils down to two approaches on how it is done

      We can run Kubernetes locally as a:

      1. binary package
      2. container using dind
    7. Before we move on to talk about all the tools, it will be beneficial if you installed arkade on your machine.

      With arkade, we can quickly set up different k8s tools, while using a single command:

      e.g. arkade get k9s

  18. Jul 2021
    1. Why do 87% of data science projects never make it into production?

      It turns out that this phrase doesn't lead to an existing research. If one goes down the rabbit hole, it all ends up with dead links

    1. Furthermore, in order to build a comprehensive pipeline, the code quality, unit test, automated test, infrastructure provisioning, artifact building, dependency management and deployment tools involved have to connect using APIs and extend the required capabilities using IaC.

      Vital components of a pipeline

    1. The fact that FastAPI does not come with a development server is both a positive and a negative in my opinion. On the one hand, it does take a bit more to serve up the app in development mode. On the other, this helps to conceptually separate the web framework from the web server, which is often a source of confusion for beginners when one moves from development to production with a web framework that does have a built-in development server (like Django or Flask).

      FastAPI does not include a web server like Flask. Therefore, it requires Uvicorn.

      Not having a web server has pros and cons listed here

    1. Get the `curl-format.txt` from github and then run this curl command in order to get the output $ curl -L -w "@curl-format.txt" -o tmp -s $YOUR_URL

      Testing server latency with curl:

      1) Get this file from GitHub

      2) Run the curl: curl -L -w "@curl-format.txt" -o tmp -s $YOUR_URL

    1. We comment out the failed line, and the Dockerfile now looks like this:

      To test a failing Dockerfile step, it is best to comment it out, successfully build an image, and then run this command from inside of the Dockerfile

    1. To prevent this skew, companies like DoorDash and Etsy log a variety of data at online prediction time, like model input features, model outputs, and data points from relevant production systems.

      Log inputs and outputs of your online models to prevent training-serving skew

    2. idempotent jobs — you should be able to run the same job multiple times and get the same result.

      Encourage idempotency

    3. Uber and Booking.com’s ecosystem was originally JVM-based but they expanded to support Python models/scripts. Spotify made heavy use of Scala in the first iteration of their platform until they received feedback like:some ML engineers would never consider adding Scala to their Python-based workflow.

      Python might be even more popular due to MLOps

    4. Spotify has a CLI that helps users build Docker images for Kubeflow Pipelines components. Users rarely need to write Docker files.

      Spotify approach towards writing Dockerfiles for Kubeflow Pipelines

    5. Most serving systems are built in-house, I assume for similar reasons as a feature store — there weren’t many serving tools until recently and these companies have stringent production requirements.

      The reason of many feature stores and model serving tools built in house, might be, because there were not many open-source tools before

    6. Models require a dedicated system because their behavior is determined not only by code, but also by the training data, and hyper-parameters. These three aspects should be linked to the artifact, along with metrics about performance on hold-out data.

      Why model registry is a must in MLOps

    7. five ML platform components stand out which are indicated by the green boxes in the diagram below
      1. Feature store
      2. Workflow orchestration
      3. Model registry
      4. Model serving
      5. Model quality monitoring
    1. we employed a three-stage strategy for validating and deploying the latest binary of the Real-time Prediction Service: staging integration test, canary integration test, and production rollout. The staging integration test and canary integration tests are run against non-production environments. Staging integration tests are used to verify the basic functionalities. Once the staging integration tests have been passed, we run canary integration tests to ensure the serving performance across all production models. After ensuring that the behavior for production models will be unchanged, the release is deployed onto all Real-time Prediction Service production instances, in a rolling deployment fashion.

      3-stage strategy for validating and deploying the latest binary of the Real-time Prediction Service:

      1. Staging integration test <--- verify the basic functionalities
      2. Canary integration tests <--- ensure the serving performance across all production models
      3. Production rollout <--- deploy release onto all Real-time Prediction Service production instances, in a rolling deployment fashion
    2. We add auto-shadow configuration as part of the model deployment configurations. Real-time Prediction Service can check on the auto-shadow configurations, and distribute traffic accordingly. Users only need to configure shadow relations and shadow criteria (what to shadow and how long to shadow) through API endpoints, and make sure to add features that are needed for the shadow model but not for the primary model.

      auto-shadow configuration

    3. In a gradual rollout, clients fork traffic and gradually shift the traffic distribution among a group of models.  In shadowing, clients duplicate traffic on an initial (primary) model to apply on another (shadow) model).

      gradual rollout (model A,B,C) vs shadowing (model D,B):

    4. we built a model auto-retirement process, wherein owners can set an expiration period for the models. If a model has not been used beyond the expiration period, the Auto-Retirement workflow, in Figure 1 above, will trigger a warning notification to the relevant users and retire the model.

      Model Auto-Retirement - without it, we may observe unnecessary storage costs and an increased memory footprint

    5. For helping machine learning engineers manage their production models, we provide tracking for deployed models, as shown above in Figure 2. It involves two parts:

      Things to track in model deployment (listed below)

    6. Model deployment does not simply push the trained model into Model Artifact & Config store; it goes through the steps to create a self-contained and validated model package

      3 steps (listed below) are executed to validate the packaged model

    7. we implemented dynamic model loading. The Model Artifact & Config store holds the target state of which models should be served in production. Realtime Prediction Service periodically checks that store, compares it with the local state, and triggers loading of new models and removal of retired models accordingly. Dynamic model loading decouples the model and server development cycles, enabling faster production model iteration.

      Dynamic Model Loading technique

    8. The first challenge was to support a large volume of model deployments on a daily basis, while keeping the Real-time Prediction Service highly available.

      A typical MLOps use case

  19. Jun 2021
    1. It basically takes any command line arguments passed to entrypoint.sh and execs them as a command. The intention is basically "Do everything in this .sh script, then in the same shell run the command the user passes in on the command line".

      What is the use of this part in a Docker entry point:

      #!/bin/bash
      set -e
      
      ... code ...
      
      exec "$@"
      
  20. May 2021
    1. Kubeflow Pipelines comes to solve this problem. KFP, for short, is a toolkit dedicated to run ML Workflows (as experiments for model training) on Kubernetes, and it does it in a very clever way:Along with other ways, Kubeflow lets us define a workflow as a series of Python functions that pass results, and Artifacts for one another.For each Python function, we can define dependencies (for libs used) and Kubeflow will create a container to run each function in an isolated way, and passing any wanted object to a next step on the workflow. We can set needed resources, (as memory or GPUs) and it will provision them for our workflow step. It feels like magic.Once you’ve ran your pipeline, you will be able to see it in a nice UI, like this:

      Brief explanation of Kubeflow Pipelines

    2. Vertex AI came from the skies to solve our MLOps problem with a managed — and reasonably priced—alternative. Vertex AI comes with all the AI Platform classic resources plus a ML metadata store, a fully managed feature store, and a fully managed Kubeflow Pipelines runner.

      Vertex AI - Google Cloud’s new unified ML platform

    1. In short, MLflow makes it far easier to promote models to API endpoints on various cloud vendors compared to Kubeflow, which can do this but only with more development effort.

      MLflow seems to be much easier

    2. Bon Appétit?

      Quick comparison of MLflow and Kubeflow (check below the annotation)

    3. MLflow is a single python package that covers some key steps in model management. Kubeflow is a combination of open-source libraries that depends on a Kubernetes cluster to provide a computing environment for ML model development and production tools.

      Brief comparison of MLflow and Kubeflow

  21. Apr 2021
    1. To summarize, implementing ML in a production environment doesn't only mean deploying your model as an API for prediction. Rather, it means deploying an ML pipeline that can automate the retraining and deployment of new models. Setting up a CI/CD system enables you to automatically test and deploy new pipeline implementations. This system lets you cope with rapid changes in your data and business environment. You don't have to immediately move all of your processes from one level to another. You can gradually implement these practices to help improve the automation of your ML system development and production.

      The ideal state of MLOps in a project (2nd level)

    1. On the median case, Colab is going to assign users a K80, and the GTX 1080 is around double the speed, which does not stack up particularly well for Colab. However, on occasion, when a P100 is assigned, the P100 is an absolute killer GPU (again, for FREE).

      Some of the GPUs from Google Colab are outstanding.

    1. With Spark 3.1, the Spark-on-Kubernetes project is now considered Generally Available and Production-Ready.

      With Spark 3.1 k8s becomes the right option to replace YARN

  22. Mar 2021
    1. We use Prometheus to collect time-series metrics and Grafana for graphs, dashboards, and alerts.

      How Prometheus and Grafana can be used to collect information from running ML on K8s

    2. large machine learning job spans many nodes and runs most efficiently when it has access to all of the hardware resources on each node. This allows GPUs to cross-communicate directly using NVLink, or GPUs to directly communicate with the NIC using GPUDirect. So for many of our workloads, a single pod occupies the entire node.

      The way OpenAI runs large ML jobs on K8s

    1. We use Kubernetes mainly as a batch scheduling system and rely on our autoscaler to dynamically scale up and down our cluster — this lets us significantly reduce costs for idle nodes, while still providing low latency while iterating rapidly.
    2. For high availability, we always have at least 2 masters, and set the --apiserver-count flag to the number of apiservers we’re running (otherwise Prometheus monitoring can get confused between instances).

      Tip for high availability:

      • have at least 2 masters
      • set --apiserver-count flag to the number of running apiservers
    3. We’ve increased the max etcd size with the --quota-backend-bytes flag, and the autoscaler now has a sanity check not to take action if it would terminate more than 50% of the cluster.

      If we've more than 1k nodes, etcd's hard storage limit might stop accepting writes

    4. Another helpful tweak was storing Kubernetes Events in a separate etcd cluster, so that spikes in Event creation wouldn’t affect performance of the main etcd instances.

      Another trick apart from tweaking default settings of Fluentd & Datadog

    5. The root cause: the default setting for Fluentd’s and Datadog’s monitoring processes was to query the apiservers from every node in the cluster (for example, this issue which is now fixed). We simply changed these processes to be less aggressive with their polling, and load on the apiservers became stable again:

      Default settings of Fluentd and Datadog might not be suited for running many nodes

    6. We then moved the etcd directory for each node to the local temp disk, which is an SSD connected directly to the instance rather than a network-attached one. Switching to the local disk brought write latency to 200us, and etcd became healthy!

      One of the solutions for etcd using only about 10% of the available IOPS. It was working till about 1k nodes

  23. Feb 2021
    1. Consider the amount of data and the speed of the data, if low latency is your priority use Akka Streams, if you have huge amounts of data use Spark, Flink or GCP DataFlow.

      For low latency = Akka Streams

      For huge amounts of data = Spark, Flink or GCP DataFlow

    2. As we mentioned before, the majority of machine learning implementations are based on running model serving as a REST service, which might not be appropriate for the high volume data processing or usage of the streaming system, which requires re coding/starting systems for model update, for example, TensorFlow or Flink. Model as Data is a great fit for big data pipelines. For online inference, it is quite easy to implement, you can store the model anywhere (S3, HDFS…), read it into memory and call it.

      Model as Data <--- more appropriate approach than REST service for serving big data pipelines

    3. The most common way to deploy a trained model is to save into the binary format of the tool of your choice, wrap it in a microservice (for example a Python Flask application) and use it for inference.

      Model as Code <--- the most common way of deploying ML models

    1. When we are providing our API endpoint to frontend team we need to ensure that we don’t overwhelm them with preprocessing technicalities.We might not always have a Python backend server (eg. Node.js server) so using numpy and keras libraries, for preprocessing, might be a pain.If we are planning to serve multiple models then we will have to create multiple TensorFlow Serving servers and will have to add new URLs to our frontend code. But our Flask server would keep the domain URL same and we only need to add a new route (a function).Providing subscription-based access, exception handling and other tasks can be carried out in the Flask app.

      4 reasons why we might need Flask apart from TensorFlow serving

    1. Next, imagine you have more models to deploy. You have three optionsLoad the models into the existing cluster — having one cluster serve all models.Spin up a new cluster to serve each model — having multiple clusters, one cluster serves one model.Combination of 1 and 2 — having multiple clusters, one cluster serves a few models.The first option would not scale, because it’s just not possible to load all models into one cluster as the cluster has limited resources.The second option will definitely work but it doesn’t sound like an effective process, as you need to create a set of resources every time you have a new model to deploy. Additionally, how do you optimize the usage of resources, e.g., there might be unutilized resources in your clusters that could potentially be shared by the rest.The third option looks promising, you can manually choose the cluster to deploy each of your new models into so that all the clusters’ resource utilization is optimal. The problem is you have to manuallymanage it. Managing 100 models using 25 clusters can be a challenging task. Furthermore, running multiple models in a cluster can also cause a problem as different models usually have different resource utilization patterns and can interfere with each other. For example, one model might use up all the CPU and the other model won’t be able to serve anymore.Wouldn’t it be better if we had a system that automatically orchestrates model deployments based on resource utilization patterns and prevents them from interfering with each other? Fortunately, that is exactly what Kubernetes is meant to do!

      Solution for deploying lots of ML models

    1. If you’re running lots of deployments of models then it becomes important to record which versions were deployed and when. This is needed to be able to go back to specific versions. Model registries help with this problem by providing ways to store and version models.

      Model Registries <--- way to handle multiple ML models in production

    1. The benefits of applying GitOps best practices are far reaching and provide:

      The 6 provided benefits also explain GitOps in simple terms

    2. GitOps is a way to do Kubernetes cluster management and application delivery.  It works by using Git as a single source of truth for declarative infrastructure and applications. With GitOps, the use of software agents can alert on any divergence between Git with what's running in a cluster, and if there's a difference, Kubernetes reconcilers automatically update or rollback the cluster depending on the case. With Git at the center of your delivery pipelines, developers use familiar tools to make pull requests to accelerate and simplify both application deployments and operations tasks to Kubernetes.

      Other definition of GitOps (source):

      GitOps is a way of implementing Continuous Deployment for cloud native applications. It focuses on a developer-centric experience when operating infrastructure, by using tools developers are already familiar with, including Git and Continuous Deployment tools.

  24. Jan 2021
    1. Different data sources are better suited for different types of data transformations and provide access to different data quantities at different freshnesses

      Comparison of data sources

      • Data warehouses / lakes (such as Snowflake or Redshift) tend to hold a lot of information but with low data freshness (hours or days). They can be a gold mine, but are most useful for large-scale batch aggregations with low freshness requirements, such as “number of lifetime transactions per user.”
      • Transactional data sources (such as MongoDB or MySQL) usually store less data at a higher freshness and are not built to process large analytical transformations. They’re better suited for small-scale aggregations over limited time horizons, like the number of orders placed by a user in the past 24 hrs.
      • Data streams (such as Kafka) store high-velocity events and provide them in near real-time (within milliseconds). In common setups, they retain 1-7 days of historical data. They are well-suited for aggregations over short time-windows and simple transformations with high freshness requirements, like calculating that “trailing count over the last 30 minutes” feature described above.
      • Prediction request data is raw event data that originates in real-time right before an ML prediction is made, e.g. the query a user just entered into the search box. While the data is limited, it’s often as “fresh” as can be and contains a very predictive signal. This data is provided with the prediction request and can be used for real-time calculations like finding the similarity score between a user’s search query and documents in a search corpus.
    2. MLOps platforms like Sagemaker and Kubeflow are heading in the right direction of helping companies productionize ML. They require a fairly significant upfront investment to set up, but once properly integrated, can empower data scientists to train, manage, and deploy ML models. 

      Two popular MLOps platforms: Sagemaker and Kubeflow