Multimodal Transformers Documentation

A toolkit for incorporating multimodal data on top of text data for classification and regression tasks. This toolkit is heavily based off of HuggingFace Transformers. It adds a combining module that takes the outputs of the transformers in addition to categorical and numerical features to produce rich multimodal features for downstream classification/regression layers. Given a pretrained transformer, the parameters of the combining module and transformer are trained based on the supervised task.

See its documentation for specific details regarding HuggingFace transformer models, configs, and tokenizers.

_images/model_image.png

Indices and tables