Efficient AI Model Serving with KServe ModelMesh and Kubernetes Persistent Volumes

1. Introduction

As AI models grow more complex and require frequent updates, the infrastructure supporting these models must evolve to keep pace. Traditional methods of storing model files on cloud object storage can introduce delays during deployment, particularly when fetching large model files. KServe's ModelMesh Serving addresses these challenges by providing a scalable model serving framework that can dynamically load and unload models as needed. By leveraging Kubernetes Persistent Volumes, this solution further optimizes model serving by reducing latency and improving overall efficiency.

2. Step-by-Step Guide

This section provides a detailed guide on configuring KServe ModelMesh Serving with Kubernetes Persistent Volumes. Follow these steps to set up an efficient AI model serving environment.

2.1. Prerequisites

Before starting, ensure you have the following:

A Kubernetes cluster with admin privileges (or Minikube) with at least 4 CPUs and 8 GB of memory
kubectl and kustomize (v4.0.0+) installed
A "Quickstart" installation of ModelMesh Serving

2.2. Create a Persistent Volume Claim (PVC)

To begin, create a Persistent Volume Claim (PVC) within your Kubernetes cluster to allocate storage that ModelMesh can use to store model files:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: "my-models-pvc"
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi

Apply this configuration using kubectl:

kubectl apply -f - <<EOF
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: "my-models-pvc"
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
EOF

Verify that the PVC is created and bound to a persistent volume:

kubectl get pvc

# Output example:
# NAME            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS       AGE
# my-models-pvc   Bound    pvc-783726ab-9fd3-47f3-8c7d-bf7822d6d7f8   15Gi       RWX            retain-file-gold   2m

2.3. Create a Pod to Access the PVC

Next, create a pod that will mount the PVC as a volume:

apiVersion: v1
kind: Pod
metadata:
  name: "pvc-access"
spec:
  containers:
    - name: main
      image: ubuntu
      command: ["/bin/sh", "-ec", "sleep 10000"]
      volumeMounts:
        - name: "my-pvc"
          mountPath: "/mnt/models"
  volumes:
    - name: "my-pvc"
      persistentVolumeClaim:
        claimName: "my-models-pvc"

Apply this configuration:

kubectl apply -f - <<EOF
---
apiVersion: v1
kind: Pod
metadata:
  name: "pvc-access"
spec:
  containers:
    - name: main
      image: ubuntu
      command: ["/bin/sh", "-ec", "sleep 10000"]
      volumeMounts:
        - name: "my-pvc"
          mountPath: "/mnt/models"
  volumes:
    - name: "my-pvc"
      persistentVolumeClaim:
        claimName: "my-models-pvc"
EOF

Confirm that the pod is running:

kubectl get pods | grep pvc\|STATUS

# Output example:
# NAME                 READY   STATUS    RESTARTS   AGE
# pvc-access           1/1     Running   0          2m12s

2.4. Store the Model on the Persistent Volume

Download the MNIST model:

curl -sOL https://github.com/kserve/modelmesh-minio-examples/raw/main/sklearn/mnist-svm.joblib

Copy the model to the pod:

kubectl cp mnist-svm.joblib pvc-access:/mnt/models/

Verify the model upload:

kubectl exec -it pvc-access -- ls -alr /mnt/models/

# Expected output:
# total 356
# -rw-r--r-- 1    501 staff      344917 Sep 17 09:20 mnist-svm.joblib
# drwxr-xr-x 3 nobody 4294967294   4096 Sep 17 09:20 ..
# drwxr-xr-x 2 nobody 4294967294   4096 Sep 17 09:20 .

2.5. Configure ModelMesh Serving

Create a ConfigMap to enable PVC usage:

apiVersion: v1
kind: ConfigMap
metadata:
  name: model-serving-config
data:
  config.yaml: |
    allowAnyPVC: true

Apply the configuration:

kubectl apply -f - <<EOF
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: model-serving-config
data:
  config.yaml: |
    allowAnyPVC: true
EOF

2.6. Deploy Inference Service

Deploy the MNIST model inference service:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-mnist
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storage:
        parameters:
          type: pvc
          name: my-models-pvc
        path: mnist-svm.joblib

Apply the service:

kubectl apply -f - <<EOF
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-mnist
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storage:
        parameters:
          type: pvc
          name: my-models-pvc
        path: mnist-svm.joblib
EOF

Check the service status:

kubectl get isvc

# Output example:
# NAME            URL                                               READY   PREV   LATEST   AGE
# sklearn-mnist   grpc://modelmesh-serving.modelmesh-serving:8033   True                    35s

2.7. Run an Inference Request

Set up port-forwarding:

kubectl port-forward --address 0.0.0.0 service/modelmesh-serving 8008 &

Send an inference request:

MODEL_NAME="sklearn-mnist"

curl -X POST -k "http://localhost:8008/v2/models/${MODEL_NAME}/infer" -d '{"inputs": [{ "name": "predict", "shape": [1, 64], "datatype": "FP32", "data": [0.0, 0.0, 3.0, 10.0, 15.0, 16.0, 2.0, 0.0, 0.0, 2.0, 14.0, 16.0, 11.0, 15.0, 7.0, 0.0, 0.0, 7.0, 16.0, 3.0, 5.0, 15.0, 4.0, 0.0, 0.0, 4.0, 14.0, 10.0, 12.0, 13.0, 0.0, 0.0, 0.0, 0.0, 3.0, 13.0, 15.0, 12.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.0, 15.0, 15.0, 5.0, 0.0, 0.0, 0.0, 0.0, 15.0, 15.0, 15.0, 6.0, 0.0, 0.0, 0.0, 0.0, 10.0, 12.0, 11.0, 0.0, 0.0]}]}'

The response should look like:

{
  "model_name": "sklearn-mnist__isvc-2d5cba6382",
  "outputs": [
    { "name": "predict", "datatype": "INT64", "shape": [1], "data": [7] }
  ]
}

3. Conclusion

By configuring KServe ModelMesh Serving to utilize Kubernetes Persistent Volumes, you can significantly improve the efficiency and performance of your AI model deployments. This approach not only reduces the latency associated with fetching models from remote storage but also enhances the overall scalability of your AI infrastructure.