Multi-Model Runners
⏱️ Estimated time: ~30 minutes
Note: this feature is still in beta.
Overview
ExaDeploy supports dynamically loading multiple models onto a single machine. In this guide, we will demonstrate how to dynamically load 60 different BERT models on a single GPU with 16GB of available memory. Multi-model runners lets ExaDeploy efficiently load models onto runners, without the need to specify which models must be run together.
This guide assumes that you have a working ExaDeploy system set up; for help setting up ExaDeploy, see our quickstart guides for AWS and GCP.
Loading Multiple Models
In this example, we will be using the bert-base-cased-finetuned-mrpc
model downloaded from HuggingFace. After the TorchScript plugin upload, we need to upload all models to the module repository:
In the below code snippets, we will register the same model 100 times for demonstration purposes, and we will dynamically load 60 of these models onto one runner. In a practical application, we could have 100 different fine-tuned BERT models.
import torch
import transformers
import exa
model = transformers.AutoModelForSequenceClassification.from_pretrained(
"bert-base-cased-finetuned-mrpc", return_dict=False
).to(torch.device('cuda'))
model = model.half() #fp16 precision to pack more models
example_inputs_paraphrase = ...
torchscript_path = ...
traced_model_gpu = torch.jit.trace(model, example_inputs_paraphrase)
traced_model_gpu.save(torchscript_path)
# Upload TorchScript models to Module Repository.
with exa.ModuleRepository(
...
) as repo:
for i in range(100):
repo.register_torchscript(
f"BertGpu_{i}",
torchscript_file = torchscript_path,
input_names = ["input_ids", "attention_mask", "token_type_ids"],
output_names = ["output_tensor"],
plugin = "LibTorchPlugin:v1.12",
)
To enable multi-model runners, we need to add the additional fields placement_group_affinity_key
and max_placement_group_count
to the placement group spec when creating a new session. ExaDeploy will then run sessions with matching placement_group_affinity_key
fields on the same runners.
placement_group_affinity_key
can be an arbitrary string; ExaDeploy will try to place sessions with the same affinity key on to the same machine.max_placement_group_count
is the maximum number of placement groups that can loaded on a single runner. In the example below, we have a single model per placement group, and 60 placement groups total. To arrive at this number, we experimentally determined that 60 placement groups with this BERT model would fit on a single runner with 16GB of GPU memory.
It may make sense to use a different value placement_group_affinity_key
for each additional model framework (e.g. PyTorch, TensorFlow, TRT). This is because each framework has some memory overhead.
Each additional class of models might use a different value for max_placement_group_count
; a larger model would likely need a smaller value than 60.
As an example, if we also had a PyTorch bert-large-cased
model, we might use max_placement_group_count=30
with the same placement_group_affinity_key
.
An example of session creation for multi-model runners is seen below:
import random
# Run model on remote ExaDeploy runner.
# We choose a random subset of models for our sessions; this emphasizes that
# the models will be dynamically loaded.
module_idx_to_run = random.sample(range(100), 60)
# Store sessions in this list, forcing all 60 sessions to
# persist simultaneously.
sessions = []
for i in module_idx_to_run:
module_tag = f"BertGpu_{i}"
placement_group = exa.PlacementGroupSpec(
module_contexts = [
exa.ModuleContextSpec(module_tag = module_tag),
],
placement_group_affinity_key = "affinity-key-0",
max_placement_group_count = 60,
...
)
# Create a session, which will be loaded onto the runner.
sessions.append(exa.Session(
placement_groups = {"default": placement_group},
...
))
# run an inference.
We have max_placement_group_count=60
and runner_fraction=0.001
in this example. max_placement_group_count
is the number of models a single runner can support, while runner_fraction
represents the runner's resource usage per session.
kubectl
should confirm that all of this has been accomplished using 1 runner.
> kubectl get pods
NAME READY STATUS RESTARTS AGE
runner-00747134261729369700 1/1 Running 0 8s
Summary
In this example, we have created 60 sessions with one BERT model each and performed an inference on all of them, while using only 1 runner. ExaDeploy was able to accomplish this by loading all 60 models onto the same runner dynamically - at no point did the user need to specify which models could be loaded together.