Quickstart
This guide will help you get started with training and exporting your first model based on an example dataset to an online inference system. Your resulting model can be used by a hypothetical e-commerce platform to give on-the-fly hierarchical product categorisation hints as a poster types their product name. It assumes that you have successfully downloaded the latest version of the system and installed all dependencies without error. If you have not done so, please refer to the previous page for instructions.
Data preparation
There are currently two ways to get data into the system:
Connecting an SQL database
Reading CSV, JSON, Parquet, Arrow or Feather files
The SQL approach is more advanced, so for this quickstart guide, we will instead use a flatfile. Bundled in the repository is an example dataset derived from this CC0 dataset on Kaggle, stored as datasets/Walmart_30k.parquet
. Although it had been cleaned up to a degree, it is still in its original schema and thus serves as a close example to flatfiles you would encounter in the wild. A peek at the file (using any Parquet-compatible tool, such as Python’s pandas
) shows how the data inside are organised:
idx title description List Price Sale Price Brand category
0 La Costena Chipotle Peppers, 7 OZ (Pack of 12) La Costena Chipotle Peppers, 7 OZ (Pack of 12)... 31.93 31.93 La Coste��a [Food, Meal Solutions, Grains & Pasta, Canned ...
1 Equate Triamcinolone Acetonide Nasal Allergy S... Compare to Nasacort Allergy 24HR active ingred... 10.48 10.48 Equate [Health, Equate, Equate Allergy, Equate Sinus ...
2 AduroSmart ERIA Soft White Smart A19 Light Bul... The Soft White ERIA A19 bulb (2700K) can be co... 10.99 10.99 AduroSmart ERIA [Electronics, Smart Home, Smart Energy and Lig...
3 24" Classic Adjustable Balloon Fender Set Chro... Lowrider Fender Set 24" Classic Adjustable Chr... 38.59 38.59 lowrider [Sports & Outdoors, Bikes, Bike Accessories, B...
4 Elephant Shape Silicone Drinkware Portable Sil... This is a kind of fine quality silicone cup l... 5.81 5.81 Anself [Baby, Feeding, Sippy Cups: Alternatives to Pl...
... ... ... ... ... ... ...
29201 McCain Smiles Add a wholesome side to your dinnertime meal w... 0.00 0.00 McCain [Food, Frozen Foods, Frozen Vegetables]
29202 Shock Sox Fork Seal Guards 29-36mm Fork Tube 4... Shock Sox are a wraparound, neoprene sleeve th... 33.25 33.25 Shock Sox [Sports & Outdoors, Bikes, Bike Components, Bi...
29203 Princes Gooseberries 300g Gooseberries in syrup Princes Gooseberries 300g 8.88 8.88 Princes [Food, Meal Solutions, Grains & Pasta, Canned ...
29204 Create Ion Grace 3/4 Inches Straight Hair Iron... Create ion grace straight is a 3/4 inches wide... 50.00 24.50 Create Ion [Beauty, Hair Care, Hair Styling Tools, Flat I...
29205 Green Bell Takuminowaza Two Way Ear Pick Brass... Green bell ear wax remover 2-way screw & spoon... 6.00 4.20 Takuminowaza [Beauty, Here for Every Beauty, Featured Shops...
[29206 rows x 6 columns]
Since our model should be able to suggest which category to put a product into as the user types the product name, we will use the title
column as our input to the model, and category
as our training labels. Note how the category
column is a list of strings, ordered by hierarchical depth (from the coarsest to the most detailed categorisation level). This is the common style that KTT uses. JSON files will use their own list syntax, while CSV files can accept different delimiters to split a single string into separate categories. Formats that natively support list datatypes for cells, such as Apache Arrow, Feather and Parquet, are directly supported and are most preferrable.
Note
In addition, they are much lighter in size and significantly faster to parse as well as having additional capabilities such as out-of-core reading, so we highly recommend using them for your data storage outside of our system to lower storage costs (especially on services such as Amazon S3) and ease the data preparation process.
To input this flatfile into the system, we invoke the flatfile adapter:
cd adapters
python adapter_flat.py -v -t title -c category -d 2 --dvc ../datasets/Walmart_30k.parquet Walmart
An explanation of what these options and arguments do:
-v
is the verbose argument. With this, the adapter prints more information about its process, which helps with understanding what it does.
-t
specifies the column to use as the model’s input. As we have previously discussed, we are using thetitle
column. If left unspecified,title
is also the default column name KTT uses. Of course, if you don’t specify it and the flatfile does not have atitle
column, or it has one but the column is something else entirely and not what we need, expect some errors or poor performance.
-c
likewise specifies the column name to use as classes (labels). It defaults toclasses
if left unspecified. Here, we want to use thecategory
column, so of course we must specify it.
-d
is the hierarchical depth to search to. By default, the adapter runs for two levels. This must not exceed the minimum depth of the dataset, i.e. the length of the shortest list in the label column.
--dvc
tells the adapter to also add the resulting intermediate dataset to the Data Version Control system. With this, you can later add a remote of your choosing, push the dataset there and safely delete the local copy to save disk space. Training scripts will automatically pull them from said remote when needed.The first argument is the path to the flatfile. Walmart_30k.parquet is bundled in datasets.
The second argument is the name of the resulting intermediate dataset.
After running the above commands, a new folder (datasets/Walmart
) will be created, containing three Parquet files and a JSON file named hierarchy.json
.
Note
As a quickstart guide, we won’t delve too deeply into the specifics, but here are the main tasks an adapter performs:
Read the flatfile or SQL data into memory, then rename columns and reformat the classes column if necessary.
Build a list of pairwise-different labels on each level.
Reconstruct the hierarchy from relationships found in the flatfile or corresponding SQL table.
Generate metadata for specific models.
CV-split the dataset into a training set, a validation set and a test set. This is done using random sampling.
Write the three sets into a new folder within
datasets
as Parquet files (irrespective of input format) and the hierarchical metadata as a JSON file.
Training a model
Having generated a compatible model and the necessary metadata, we can now train a model on it. To keep everything lightweight for a quickstart guide, we will use a small CPU-based model called Tfidf + Hierarchical SGD (internal identifier tfidf_hsgd
). This model is capable of fully hierarchical classification and trains very quickly, albeit with limited accuracy for small datasets.
From ./
, run the following command to train it:
python train.py -m tfidf_hsgd Walmart
An explanation, again:
-m
specifies which model to train. Here we use the internal identifier of the above model.The one and only argument specifies the (internal) dataset name to train on. We have previously named our dataset
Walmart
(to differentiate it from the flatfile, which was namedWalmart_30k
, so we’ll use that name here.
Note
Both the dataset-name argument and the -m
option accept comma-separated lists of models and datasets. In other words, you can tell the training script to train on multiple datasets at once, train multiple models at once, or train multiple models, each on multiple datasets!
Once the command finishes, a tfidf_hsgd
instance will have been trained on Walmart
and saved in the weights/tfidf_hsgd/Walmart
directory.
Exporting the trained model
With a trained model under our belt, we can export it into some sort of API server. Our goal is to have a minimal inference server set up by the end of this guide, so we will take advantage of the preset BentoML-powered inference system in KTT. Every bundled model in KTT has been prepared for usage by BentoML through special build scripts and service files.
Run the following command from from .\
to export the newly trained model:
python export.py -m 'tfidf_hsgd' -b Walmart
This command will then look for the latest saved instance of tfidf_hsgd
trained on the Walmart
dataset - note how the -m
option once again present in this command and the argument is the dataset name. The additional -b
argument tells the script to generate a BentoService and copy the necessary supporting data along. The resulting service will be located in the build
folder.
Warning
Unlike the other six models, tfidf_hsgd
does not support exporting to other formats than BentoML. As such, the -b
flag MUST always be specified when exporting this model.
Serving up a Bento
It’s Bento(ML) time! KTT’s BentoML inference pipelines relies on packaged systems known as BentoServices. These are the model graph themselves, plus the code needed to feed data in and extract data out of the model (i.e. a REST-to-model-to-REST routine), a lot of supporting files for peripheral subsystems such as a live performance monitoring dashboard (optional), and a version-frozen list of all dependencies.
As above, each and every bundled model in the system has their own Bento pipeline implemented. As such, in order to serve our trained model as a REST API, run the following command:
bentoml serve build/tfidf_hsgd_Walmart/inference
The model will be served as a REST API at localhost:5000
with the /predict
endpoint. You can use the supplied test.py
test script in onnx/bentoml
to send a single request to it and check the results.
cd onnx/bentoml
python test.py
Or, if you prefer to send requests to it yourself, simply POST in the following format:
Content-Type:
application/json
Data: a JSON string with a single field
text
containing the textual input to be classified. For example:"{ \"text\": \"Classify this string\" }"
.
Shipping Bentos in a container
Having a BentoService online should be enough if you only plan on running it directly on the computer that trained it, or another that you are sure has had all dependencies correctly installed. However, this is hard to maintain, especially as libraries may change over time, causing breakages. Furthermore, downloading all dependencies over and over again is not a small task, and certain libraries may even become no longer available on the cloud for you to download (in extreme cases that is). Furthermore, KTT provides additional monitoring capabilities for its live inference systems that would be much, much more easily run as a single docker-compose
network instead of separate manually-started processes (although the monitoring part isn’t within the scopes of this Quickstart guide).
As such, it is preferrable that we find a way to keep an offline backup of such dependencies, frozen to the exact version used to train and export the model. This is where Dockerisation comes in.
First, ensure that you have Docker correctly installed (and its daemon running) on your local system and that your usee account has the necessary privileges. Then, Dockerising a BentoService is a one-command affair:
docker image build ./build/tfidf_hsgd_Walmart_005
This might take a while as Docker builds a Debian-based system with our model. Once it’s done, check the list of images on your system and note down the Image ID of the newly-built Bento container.
❯ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
tfidf_hsgd_Walmart_005 ................ fbbf4a810b58 ... ...
Here the Image ID is fbbf4a810b58
. We can now fire the image up:
docker run -p 5000:5000 fbbf4a810b58
The -p 5000:5000
argument forwards the host’s port 5000 to the container’s corresponding port. This allows requests from the outside to be directed to the container, and the container’s response to be in turn directed back to the outside. You can test this using the same test script we have mentioned above.
Note
The Dockerisation process is a bit different for BentoServices built with monitoring capabilities enabled. Since they use multiple Docker containers, KTT will also generate a docker-compose.yaml
file for them. Simply navigate to the built service’s folder (where the docker-compose.yaml
file is) and hit docker-compose up
.
If all goes well, congratulations! You now have a fully self-contained Docker image of your newly-trained hierarchical classification model.