Tf-idf + Leaf SGD
API
Configuration schema
The configuration for this model defines the following hyperparameters:
loss
: Type of loss function for use with theSGDClassifier
. See the related documentation for more details.max_iter
: The maximum number of iterations to train theSGDClassifier
over.min_df
: The minimum number of occurences a token must have in the training dataset for it to be included in thetfidf
vectoriser’s vocabulary.
Default tuning configuration
"tfidf_hsgd": {
"display_name": "Tfidf + HierarchicalSGD",
"range": {
"loss": "modified_huber",
"max_iter": [500, 1000, 2000],
"min_df": [1, 2, 4, 8, 16, 32, 64, 128]
},
"mode": {
"loss": "fixed",
"max_iter": "choice",
"min_df": "choice"
}
}
Theory
A standard combination between the TF-IDF vectorization algorithm and a machine learning classifier based around stochastic gradient descent. This is a generic classifier and through different loss functions can represent different machine learning algorithms. For example, using a hinge loss would turn it into a multiclass-capable linear SVM [EP99]. An example computation graph for the classifier is given below.
In our experiment, we used a modified Huber loss, a classification-centric variant of the original Huber loss [Zha04]. It is expressed as:
where \(y\) is the target and \(f(x)\) is the output of our model, here labelled as \(f\). After preliminary testing, our implementation is now based around the Scikit-learn Python library [PVG+11] and NLTK [BKL09], a robust stemming library (specifically, we used their English SnowballStemmer with standard English stop-word removal). Input text first goes through stemming and stop-word removal, then through a TF-IDF vectoriser fitted against the general corpus with a cut-off frequency of 50, then to the classifier, which classifies at the leaf level of the hierarchy} In case multi-string examples are used (such as categorising products based on both their title and description), each string can go through its stemmer-vectoriser pipeline (possibly with differing hyperparameters); the outputs are then multiplied with weights that allow us to control the significance of each string, and lastly, concatenated. All major stages are packaged into Scikit-learn pipeline stages, which are then queued in a serialisable pipeline. This serialisable pipeline together with its trained parameters can be saved to disk for later inference.
To further aid in performance regarding class example count imbalance, we further weight the loss functions such that classes with fewer examples are given more priority against more frequently-encountered classes, which should prevent the classifier from resorting to population-based guessing. The specific weight for each class is computed using the following formula:
where \(W_C\) is the weight for examples of class \(c\), \(N\) is the total example count for the entire training set, \(C\) is the number of classes (in all levels) and \(N_c\) is the number of examples of class \(c\).