.. DistilBERT + Linear documentation. DistilBERT + Linear =================== API --- .. autoclass:: models.DB_Linear :members: :special-members: Configuration schema -------------------- The configuration for this model defines the following hyperparameters: * ``encoder_lr``: Encoder (DistilBERT) learning rate. * ``classifier_lr``: Classifier learning rate. * ``dropout``: The model's dropout rate. Checkpoint schema ----------------- * ``config``: A copy of the configuration dictionary passed to this instance's constructor, either explicitly, or by ``from_checkpoint`` (extracted from a prior checkpoint). * ``hierarchy``: A serialised dictionary of hierarchical metadata created by ``PerLevelHierarchy.to_dict()``. * ``encoder_state_dict``: Weights of the DistilBERT model. * ``classifier_state_dict``: Weights of the classifier. * ``optimizer_state_dict``: Saved state of the optimiser that was used to train the model for that checkpoint. .. _db-linear-theory: Theory ------ This is our deep-learning equivalent to Tfidf_LSGD, in which the vectorisation is done by DistilBERT and the leaf-level classification by a single linear layer. An example computation graph for the classifier is given below. .. image:: linear.svg :width: 320 :alt: Topology of the BHCN. As there is only one linear layer after BERT and as our parsing scheme is a simple :math:`argmax`` to determine the highest-scoring leaf class, we do not use any nonlinearity in the forward pass after the classifier and keep the scores as-is. On the backward pass, however, we pass the raw output through a LogSoftmax function. .. math:: LogSoftmax(y_i)=\log(\frac{\exp(x_i)}{\sum_j\exp(x_j)}) before handing it over to our averaged negative log-likelihood loss function .. math:: L(y, f(x)) = \sum_{n=1}^N(\frac{1}{\sum_{n=1}^N(w_{y_n})} \times l_n(y, f(x)) where :math:`x` is the batch input of size :math:`N \times C` (i.e. :math:`x` can contain more than one example, in which case it is a minibatch); :math:`y` of size :math:`N \times 1` is the corresponding target class indices for examples in :math:`x`; :math:`w_{y_n}` represents the weight of the correct class (which is the target :math:`y`) of example :math:`n` within :math:`x`; :math:`l_n` is defined as: .. math:: l_n(y, f(x)) = -w_{y_n} \times f(x)_{n, y_n} where :math:`f(x)_{n, y_n}` represents the score for the correct class of example :math:`n` within the output of our model :math:`f` for input :math:`x`.