Multi-Model Runners
⏱️ Estimated time: ~30 minutes
Note: this feature is still in beta.
Overview
ExaDeploy supports dynamically loading multiple models onto a single machine. In this guide, we will demonstrate how to dynamically load 60 different BERT models on a single GPU with 16GB of available memory. Multi-model runners lets ExaDeploy efficiently load models onto runners, without the need to specify which models must be run together.
This guide assumes that you have a working ExaDeploy system set up; for help setting up ExaDeploy, see our quickstart guides for AWS and GCP.
Loading Multiple Models
In this example, we will be using the bert-base-cased-finetuned-mrpc model downloaded from HuggingFace. After the TorchScript plugin upload, we need to upload all models to the module repository:
In the below code snippets, we will register the same model 100 times for demonstration purposes, and we will dynamically load 60 of these models onto one runner. In a practical application, we could have 100 different fine-tuned BERT models.
import torch
import transformers
import exa
model = transformers.AutoModelForSequenceClassification.from_pretrained(
            "bert-base-cased-finetuned-mrpc", return_dict=False
        ).to(torch.device('cuda'))
model = model.half() #fp16 precision to pack more models
example_inputs_paraphrase = ...
torchscript_path = ...
traced_model_gpu = torch.jit.trace(model, example_inputs_paraphrase)
traced_model_gpu.save(torchscript_path)
# Upload TorchScript models to Module Repository.
with exa.ModuleRepository(
    ...
) as repo:
    for i in range(100):
        repo.register_torchscript(
            f"BertGpu_{i}",
            torchscript_file = torchscript_path,
            input_names = ["input_ids", "attention_mask", "token_type_ids"],
            output_names = ["output_tensor"],
            plugin = "LibTorchPlugin:v1.12",
        )
To enable multi-model runners, we need to add the additional fields placement_group_affinity_key and max_placement_group_count to the placement group spec when creating a new session. ExaDeploy will then run sessions with matching placement_group_affinity_key fields on the same runners.
- placement_group_affinity_keycan be an arbitrary string; ExaDeploy will try to place sessions with the same affinity key on to the same machine.
- max_placement_group_countis the maximum number of placement groups that can loaded on a single runner. In the example below, we have a single model per placement group, and 60 placement groups total. To arrive at this number, we experimentally determined that 60 placement groups with this BERT model would fit on a single runner with 16GB of GPU memory.
It may make sense to use a different value placement_group_affinity_key for each additional model framework (e.g. PyTorch, TensorFlow, TRT). This is because each framework has some memory overhead.
Each additional class of models might use a different value for max_placement_group_count; a larger model would likely need a smaller value than 60.
As an example, if we also had a PyTorch bert-large-cased model, we might use max_placement_group_count=30 with the same placement_group_affinity_key.
An example of session creation for multi-model runners is seen below:
import random
# Run model on remote ExaDeploy runner.
# We choose a random subset of models for our sessions; this emphasizes that
# the models will be dynamically loaded.
module_idx_to_run = random.sample(range(100), 60)
# Store sessions in this list, forcing all 60 sessions to
# persist simultaneously.
sessions = []
for i in module_idx_to_run:
    module_tag = f"BertGpu_{i}"
    placement_group = exa.PlacementGroupSpec(
        module_contexts = [
            exa.ModuleContextSpec(module_tag = module_tag),
        ],
        placement_group_affinity_key = "affinity-key-0",
        max_placement_group_count = 60,
        ...
    )
    # Create a session, which will be loaded onto the runner.
    sessions.append(exa.Session(
            placement_groups = {"default": placement_group},
            ...
        ))
    # run an inference.
We have max_placement_group_count=60 and runner_fraction=0.001 in this example. max_placement_group_count is the number of models a single runner can support, while runner_fraction represents the runner's resource usage per session.
kubectl should confirm that all of this has been accomplished using 1 runner.
> kubectl get pods
NAME                                   READY   STATUS              RESTARTS   AGE
runner-00747134261729369700            1/1     Running             0          8s
Summary
In this example, we have created 60 sessions with one BERT model each and performed an inference on all of them, while using only 1 runner. ExaDeploy was able to accomplish this by loading all 60 models onto the same runner dynamically - at no point did the user need to specify which models could be loaded together.