Improved GPU utilization, better LLM storage solutions, and prompt caching features in deployment tools like KServe will continue to make it more accessible to deploy a variety of models.
MLOps prediction for 2025
Improved GPU utilization, better LLM storage solutions, and prompt caching features in deployment tools like KServe will continue to make it more accessible to deploy a variety of models.
MLOps prediction for 2025
cluster with 4096 IP addresses can deploy at most 1024 models assuming each InferenceService has 4 pods on average (two transformer replicas and two predictor replicas).
Kubernetes clusters have a maximum IP address limitation
According to Kubernetes best practice, a node shouldn't run more than 100 pods.
Each model’s resource overhead is 1CPU and 1 GB memory. Deploying many models using the current approach will quickly use up a cluster's computing resource. With Multi-model serving, these models can be loaded in one InferenceService, then each model's average overhead is 0.1 CPU and 0.1GB memory.
If I am not mistaken, the multi-model approach reduces the size by 90% in this case
Multi-model serving is designed to address three types of limitations KServe will run into
Benefits of multi-model serving
While you get the benefit of better inference accuracy and data privacy by building models for each use case, it is more challenging to deploy thousands to hundreds of thousands of models on a Kubernetes cluster.
With more separation, comes the problem of distribution
we will be releasing KServe 0.7 outside of the Kubeflow Project and will provide more details on how to migrate from KFServing to KServe with minimal disruptions
KFServing is now KServe