Skip to main content

Multi-Model Runners

⏱️ Estimated time: ~30 minutes

Note: this feature is still in beta.

Overview

ExaDeploy supports dynamically loading multiple models onto a single machine. In this guide, we will demonstrate how to dynamically load 60 different BERT models on a single GPU with 16GB of available memory. Multi-model runners lets ExaDeploy efficiently load models onto runners, without the need to specify which models must be run together.

This guide assumes that you have a working ExaDeploy system set up; for help setting up ExaDeploy, see our quickstart guides for AWS and GCP.

Loading Multiple Models

In this example, we will be using the bert-base-cased-finetuned-mrpc model downloaded from HuggingFace. After the TorchScript plugin upload, we need to upload all models to the module repository:

note

In the below code snippets, we will register the same model 100 times for demonstration purposes, and we will dynamically load 60 of these models onto one runner. In a practical application, we could have 100 different fine-tuned BERT models.

register_model.py
import torch
import transformers
import exa

model = transformers.AutoModelForSequenceClassification.from_pretrained(
"bert-base-cased-finetuned-mrpc", return_dict=False
).to(torch.device('cuda'))
model = model.half() #fp16 precision to pack more models

example_inputs_paraphrase = ...
torchscript_path = ...
traced_model_gpu = torch.jit.trace(model, example_inputs_paraphrase)
traced_model_gpu.save(torchscript_path)

# Upload TorchScript models to Module Repository.
with exa.ModuleRepository(
...
) as repo:
for i in range(100):
repo.register_torchscript(
f"BertGpu_{i}",
torchscript_file = torchscript_path,
input_names = ["input_ids", "attention_mask", "token_type_ids"],
output_names = ["output_tensor"],
plugin = "LibTorchPlugin:v1.12",
)

To enable multi-model runners, we need to add the additional fields placement_group_affinity_key and max_placement_group_count to the placement group spec when creating a new session. ExaDeploy will then run sessions with matching placement_group_affinity_key fields on the same runners.

  • placement_group_affinity_key can be an arbitrary string; ExaDeploy will try to place sessions with the same affinity key on to the same machine.
  • max_placement_group_count is the maximum number of placement groups that can loaded on a single runner. In the example below, we have a single model per placement group, and 60 placement groups total. To arrive at this number, we experimentally determined that 60 placement groups with this BERT model would fit on a single runner with 16GB of GPU memory.
tip

It may make sense to use a different value placement_group_affinity_key for each additional model framework (e.g. PyTorch, TensorFlow, TRT). This is because each framework has some memory overhead.

Each additional class of models might use a different value for max_placement_group_count; a larger model would likely need a smaller value than 60.

As an example, if we also had a PyTorch bert-large-cased model, we might use max_placement_group_count=30 with the same placement_group_affinity_key.

An example of session creation for multi-model runners is seen below:

session_creation.py
import random

# Run model on remote ExaDeploy runner.
# We choose a random subset of models for our sessions; this emphasizes that
# the models will be dynamically loaded.
module_idx_to_run = random.sample(range(100), 60)

# Store sessions in this list, forcing all 60 sessions to
# persist simultaneously.
sessions = []

for i in module_idx_to_run:
module_tag = f"BertGpu_{i}"
placement_group = exa.PlacementGroupSpec(
module_contexts = [
exa.ModuleContextSpec(module_tag = module_tag),
],
placement_group_affinity_key = "affinity-key-0",
max_placement_group_count = 60,
...
)

# Create a session, which will be loaded onto the runner.
sessions.append(exa.Session(
placement_groups = {"default": placement_group},
...
))

# run an inference.
note

We have max_placement_group_count=60 and runner_fraction=0.001 in this example. max_placement_group_count is the number of models a single runner can support, while runner_fraction represents the runner's resource usage per session.

kubectl should confirm that all of this has been accomplished using 1 runner.

> kubectl get pods
NAME READY STATUS RESTARTS AGE
runner-00747134261729369700 1/1 Running 0 8s

Summary

In this example, we have created 60 sessions with one BERT model each and performed an inference on all of them, while using only 1 runner. ExaDeploy was able to accomplish this by loading all 60 models onto the same runner dynamically - at no point did the user need to specify which models could be loaded together.