Inferencing with GPUs
By default, we support accelerating training using CUDA (on supported NVIDIA GPUs). However, you may also use CUDA to accelerate the inference workload and achieve much higher API throughput than what’s possible using a CPU, especially for DistilBERT-based models.
Prerequisites
Currently, GPU inference has only been tested on Linux hosts. Specifically, we require a Linux installation on an
x86_64
architecture with kernel version of at least3.10
. To check your currentn kernel version, rununame -a
.NVIDIA drivers and CUDA
>11.3
. Please use the official (proprietary) drivers instead of the open-sourcenouveau
one. - If you want to use Docker containerisation, then Docker19.03
or newer is required. - If your BentoService has monitoring capabilities enabled (i.e. it runs as adocker-compose
network) then you also need to install the NVIDIA Container Runtime. Arch Linux users can installnvidia-container-runtime
from the AUR.
GPU-based inference using Bentos
If you decide to stop at raw BentoServices for deployment to your inference server, you simply need to ensure that the server has direct access to a compatible NVIDIA GPU and install all drivers and dependencies accordingly. KTT-generated BentoServices are designed to automatically take advantage of the first-found NVIDIA GPU.
To verify that the GPU is being used, you should use nvidia-smi
. If the Bento is using your GPU:
Video memory utilisation will be higher than idle. For DistilBERT-based models, expect usage of around 4GB.
There is a
python
process listed in the process list.
> nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.31 Driver Version: 465.31 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 49C P8 6W / N/A | 753MiB / 6078MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 179346 C /opt/conda/bin/python 745MiB |
+-----------------------------------------------------------------------------+
GPU-based inference for Dockerised services
Docker containers can simply be deployed to a server without needing to care about dependencies. One only needs to ensure that the host machine itself satisfies the hardware requirements and has the correct drivers installed.
Without monitoring capabilities
Services built without monitoring capabilities are Dockerised into single Docker images. They can be run directly from the terminal using your typical docker run
command.
Due to a recent systemd
architectural redesign, we need a workaround to grant hardware access to the container.
Instead of starting the Docker container as usual, please add --gpu all
and specific --device
options as follows:
> docker run --gpus all --device /dev/nvidia0 --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools --device /dev/nvidia-modeset --device /dev/nvidiactl <your arguments>
In <your arguments>
, fill in your usual Docker arguments such as the image ID and port forwarding. Again, you can verify that the GPU is being utilised by running nvidia-smi
and looking out for the signs described above. Of course, you should run it on the host machine’s terminal instead of within the shell of the container.
Note
Due to how BentoML GPU-enabled base images are configured, you might encounter errors like the following:
RuntimeError: Click will abort further execution because Python was configured to use ASCII as encoding for the environment. Consult https://click.palletsprojects.com/unicode-support/ for mitigation steps.
This system supports the C.UTF-8 locale which is recommended. You might be able to resolve your issue by exporting the following environment variables:
export LC_ALL=C.UTF-8
export LANG=C.UTF-8
In other words, the base images were configured with their system locale being set to ASCII, which is potentially buggy for Python interpreters, which can accept Unicode too. The fix is to simply do as they said - override the base image’s system locale to UTF-8. One way to do that would be through the docker run
command itself by adding -e LC_ALL='C.UTF-8' -e LANG='C.UTF-8'
to the arguments list.
With monitoring capabilities
Services built with monitoring capabilities contain not just the inference server itself but also several other services, namely the Prometheus time series database, Grafana dashboard and an Evidently-based model metrics app. All of these are supposed to be run together as a docker-compose
network instead of being separately and manually started. Our docker-compose.yaml
configuration already includes all the workarounds, so you only need to ensure the host system’s NVIDIA GPU is functional and accessible.