Multimodal Transformers Documentation¶
A toolkit for incorporating multimodal data on top of text data for classification and regression tasks. This toolkit is heavily based off of HuggingFace Transformers. It adds a combining module that takes the outputs of the transformers in addition to categorical and numerical features to produce rich multimodal features for downstream classification/regression layers. Given a pretrained transformer, the parameters of the combining module and transformer are trained based on the supervised task.
See its documentation for specific details regarding HuggingFace transformer models, configs, and tokenizers.

Installation¶
This package was written with python3.7. It depends on PyTorch and HuggingFace Transformers 3.0. and up.
Install¶
pip install multimodal-transformers
Introduction by Example¶
This guide covers how to use the transformer with tabular models in your own project. We use a
BertWithTabular
model as an example.
For a working script see the github repository.
How to Initialize Transformer With Tabular Models¶
The models which support tabular features are located in multimodal_transformers.model.tabular_transformers
.
These adapted transformer modules expect the same transformer config instances as
the ones from HuggingFace. However, expect a multimodal_transformers.model.TabularConfig
instance specifying
the configs.
Say for example we had categorical features of dim 9 and numerical features of dim 5.
from transformers import BertConfig
from multimodal_transformers.model import BertWithTabular
from multimodal_transformers.model import TabularConfig
bert_config = BertConfig.from_pretrained('bert-base-uncased')
tabular_config = TabularConfig(
combine_feat_method='attention_on_cat_and_numerical_feats', # change this to specify the method of combining tabular data
cat_feat_dim=9, # need to specify this
numerical_feat_dim=5, # need to specify this
num_labels=2, # need to specify this, assuming our task is binary classification
use_num_bn=False,
)
bert_config.tabular_config = tabular_config
model = BertWithTabular.from_pretrained('bert-base-uncased', config=bert_config)
In fact for any HuggingFace transformer model supported in multimodal_transformers.model.tabular_transformers
we
can initialize it using multimodal_transformers.model.AutoModelWithTabular
to
leverage any community trained transformer models
from transformers import AutoConfig
from multimodal_transformers.model import AutoModelWithTabular
from multimodal_transformers.model import TabularConfig
hf_config = AutoConfig.from_pretrained('ipuneetrathore/bert-base-cased-finetuned-finBERT')
tabular_config = TabularConfig(
combine_feat_method='attention_on_cat_and_numerical_feats', # change this to specify the method of combining tabular data
cat_feat_dim=9, # need to specify this
numerical_feat_dim=5, # need to specify this
num_labels=2, # need to specify this, assuming our task is binary classification
)
hf_config.tabular_config = tabular_config
model = AutoModelWithTabular.from_pretrained('ipuneetrathore/bert-base-cased-finetuned-finBERT', config=hf_config)
Forward Pass of Transformer With Tabular Models¶
During the forward pass we pass HuggingFace’s normal transformer inputs as well as our categorical and numerical features.
The forward pass returns
torch.FloatTensor
of shape(1,)
: The classification (or regression if tabular_config.num_labels==1) losstorch.FloatTensor
of shape(batch_size, tabular_config.num_labels)
: The classification (or regression if tabular_config.num_labels==1) scores (before SoftMax)list
oftorch.FloatTensor
The outputs of each layer of the final classification layers. The 0th index of this list is the combining module’s output
The following example shows a forward pass on two data examples
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
text_1 = "HuggingFace is based in NYC"
text_2 = "Where is HuggingFace based?"
model_inputs = tokenizer([text_1, text_2])
# 5 numerical features
numerical_feat = torch.rand(2, 5).float()
# 9 categorical features
categorical_feat = torch.tensor([[0, 0, 0, 1, 0, 1, 0, 1, 0],
[1, 0, 0, 0, 1, 0, 1, 0, 0]]).float()
labels = torch.tensor([1, 0])
model_inputs['cat_feats'] = categorical_feat
model_inputs['num_feats'] = numerical_feat
model_inputs['labels'] = labels
loss, logits, layer_outs = model(**model_inputs)
We can also pass in the arguments explicitly
loss, logits, layer_outs = model(
model_inputs['input_ids'],
token_type_ids=model_inputs['token_type_ids'],
labels=labels,
cat_feats=categorical_feat,
numerical_feats=numerical_feat
)
Modifications: Only One Type of Tabular Feature or No Tabular Features¶
If there are no tabular features, the models basically default to the ForSequenceClassification
models from HuggingFace. We must specify combine_feat_method='text_only'
in
multimodal_transformers.model.TabularConfig
. During the forward pass
we can simply pass the text related inputs
loss, logits, layer_outs = model(
model_inputs['input_ids'],
token_type_ids=model_inputs['token_type_ids'],
labels=labels,
)
If only one of the features is available, we first must specify a
combine_feat_method
that supports only one type of feature available.
See supported methods for more details.
When initializing our tabular config we specify the dimensions of the feature we have.
For example if we only have categorical features
tabular_config = TabularConfig(
combine_feat_method='attention_on_cat_and_numerical_feats', # change this to specify the method of combining tabular data
cat_feat_dim=9, # need to specify this
num_labels=2, # need to specify this, assuming our task is binary classification
)
During the forward pass, we also pass only the tabular data that we have.
loss, logits, layer_outs = model(
model_inputs['input_ids'],
token_type_ids=model_inputs['token_type_ids'],
labels=labels,
cat_feats=categorical_feat,
)
Inference¶
During inference we do not need to pass the labels and we can take the logits from the second output from the forward pass of the model.
with torch.no_grad():
_, logits, classifier_outputs = model(
model_inputs['input_ids'],
token_type_ids=model_inputs['token_type_ids'],
cat_feats=categorical_feat,
numerical_feats=numerical_feat
)
Combine Methods¶
This page explains the methods that are supported by multimodal_transformers.tabular_combiner.TabularFeatCombiner
.
See the table for details.
If you have rich categorical and numerical features any of the attention
, gating
, or weighted sum
methods are worth trying.
The following describes each supported method and whether or not it requires both categorical and numerical features.
Combine Feat Method | Description | requires both cat and num features |
---|---|---|
text_only | Uses just the text columns as processed by transformer before final classifier layer(s). Essentially equivalent to HuggingFace's ForSequenceClassification models |
False |
concat | Concatenate transformer output, numerical feats, and categorical feats all at once before final classifier layer(s) | False |
mlp_on_categorical_then_concat | MLP on categorical feats then concat transformer output, numerical feats, and processed categorical feats before final classifier layer(s) | False (Requires cat feats) |
individual_mlps_on_cat_and_numerical_feats_then_concat | Separate MLPs on categorical feats and numerical feats then concatenation of transformer output, with processed numerical feats, and processed categorical feats before final classifier layer(s). | False |
mlp_on_concatenated_cat_and_numerical_feats_then_concat | MLP on concatenated categorical and numerical feat then concatenated with transformer output before final classifier layer(s) | True |
attention_on_cat_and_numerical_feats | Attention based summation of transformer outputs, numerical feats, and categorical feats queried by transformer outputs before final classifier layer(s). | False |
gating_on_cat_and_num_feats_then_sum | Gated summation of transformer outputs, numerical feats, and categorical feats before final classifier layer(s). Inspired by Integrating Multimodal Information in Large Pretrained Transformers which performs the mechanism for each token. | False |
weighted_feature_sum_on_transformer_cat_and_numerical_feats | Learnable weighted feature-wise sum of transformer outputs, numerical feats and categorical feats for each feature dimension before final classifier layer(s) | False |
This table shows the the equations involved with each method. First we define some notation
equation denotes the combined multimodal features
equation denotes the output text features from the transformer
equation denotes the categorical features
equation denotes the numerical features
equation denotes a MLP parameterized by
equation
equation denotes a weight matrix
equation denotes a scalar bias
Combine Feat Method | Equation |
---|---|
text_only | |
concat | |
mlp_on_categorical_then_concat | |
individual_mlps_on_cat_and_ numerical_feats_then_concat |
|
mlp_on_concatenated_cat_and_ numerical_feats_then_concat |
|
attention_on_cat_and_numerical_feats | where |
gating_on_cat_and_num_feats_ then_sum |
where |
weighted_feature_sum_on_transformer_ cat_and_numerical_feats |
Colab Example¶
Here is an example of a colab notebook for running the toolkit involving data preparation, training, and evaluation:
multimodal_transformers.model¶
Tabular Feature Combiner¶
-
class
TabularFeatCombiner
(tabular_config)[source]¶ Bases:
torch.nn.modules.module.Module
Combiner module for combining text features with categorical and numerical features The methods of combining, specified by
tabular_config.combine_feat_method
are shown below. \(\mathbf{m}\) denotes the combined multimodal features, \(\mathbf{x}\) denotes the output text features from the transformer, \(\mathbf{c}\) denotes the categorical features, \(\mathbf{t}\) denotes the numerical features, \(h_{\mathbf{\Theta}}\) denotes a MLP parameterized by \(\Theta\), \(W\) denotes a weight matrix, and \(b\) denotes a scalar biastext_only
\[\mathbf{m} = \mathbf{x}\]concat
\[\mathbf{m} = \mathbf{x} \, \Vert \, \mathbf{c} \, \Vert \, \mathbf{n}\]mlp_on_categorical_then_concat
\[\mathbf{m} = \mathbf{x} \, \Vert \, h_{\mathbf{\Theta}}( \mathbf{c}) \, \Vert \, \mathbf{n}\]individual_mlps_on_cat_and_numerical_feats_then_concat
\[\mathbf{m} = \mathbf{x} \, \Vert \, h_{\mathbf{\Theta_c}}( \mathbf{c}) \, \Vert \, h_{\mathbf{\Theta_n}}(\mathbf{n})\]mlp_on_concatenated_cat_and_numerical_feats_then_concat
\[\mathbf{m} = \mathbf{x} \, \Vert \, h_{\mathbf{\Theta}}( \mathbf{c} \, \Vert \, \mathbf{n})\]attention_on_cat_and_numerical_feats self attention on the text features
\[\mathbf{m} = \alpha_{x,x}\mathbf{W}_x\mathbf{x} + \alpha_{x,c}\mathbf{W}_c\mathbf{c} + \alpha_{x,n}\mathbf{W}_n\mathbf{n}\]where \(\mathbf{W}_x\) is of shape
(out_dim, text_feat_dim)
, \(\mathbf{W}_c\) is of shape(out_dim, cat_feat_dim)
, \(\mathbf{W}_n\) is of shape(out_dim, num_feat_dim)
, and the attention coefficients \(\alpha_{i,j}\) are computed as\[\alpha_{i,j} = \frac{ \exp\left(\mathrm{LeakyReLU}\left(\mathbf{a}^{\top} [\mathbf{W}_i\mathbf{x}_i \, \Vert \, \mathbf{W}_j\mathbf{x}_j] \right)\right)} {\sum_{k \in \{ x, c, n \}} \exp\left(\mathrm{LeakyReLU}\left(\mathbf{a}^{\top} [\mathbf{W}_i\mathbf{x}_i \, \Vert \, \mathbf{W}_k\mathbf{x}_k] \right)\right)}.\]gating_on_cat_and_num_feats_then_sum sum of features gated by text features. Inspired by the gating mechanism introduced in Integrating Multimodal Information in Large Pretrained Transformers
\[\mathbf{m}= \mathbf{x} + \alpha\mathbf{h}\]\[\mathbf{h} = \mathbf{g_c} \odot (\mathbf{W}_c\mathbf{c}) + \mathbf{g_n} \odot (\mathbf{W}_n\mathbf{n}) + b_h\]\[\alpha = \mathrm{min}( \frac{\| \mathbf{x} \|_2}{\| \mathbf{h} \|_2}*\beta, 1)\]where \(\beta\) is a hyperparamter, \(\mathbf{W}_c\) is of shape
(out_dim, cat_feat_dim)
, \(\mathbf{W}_n\) is of shape(out_dim, num_feat_dim)
. and the gating vector \(\mathbf{g}_i\) with activation function \(R\) is defined as\[\mathbf{g}_i = R(\mathbf{W}_{gi}[\mathbf{i} \, \Vert \, \mathbf{x}]+ b_i)\]where \(\mathbf{W}_{gi}\) is of shape
(out_dim, i_feat_dim + text_feat_dim)
weighted_feature_sum_on_transformer_cat_and_numerical_feats
\[\mathbf{m} = \mathbf{x} + \mathbf{W}_{c'} \odot \mathbf{W}_c \mathbf{c} + \mathbf{W}_{n'} \odot \mathbf{W}_n \mathbf{t}\]
- Parameters
tabular_config (
TabularConfig
) – Tabular model configuration class with all the parameters of the model.
-
forward
(text_feats, cat_feats=None, numerical_feats=None)[source]¶ - Parameters
text_feats (
torch.FloatTensor
of shape(batch_size, text_out_dim)
) – The tensor of text features. This is assumed to be the output from a HuggingFace transformer modelcat_feats (
torch.FloatTensor
of shape(batch_size, cat_feat_dim)
, optional, defaults toNone
)) – The tensor of categorical featuresnumerical_feats (
torch.FloatTensor
of shape(batch_size, numerical_feat_dim)
, optional, defaults toNone
) – The tensor of numerical features
- Returns
A tensor representing the combined features
- Return type
torch.FloatTensor
of shape(batch_size, final_out_dim)
Tabular Config¶
-
class
TabularConfig
(num_labels, mlp_division=4, combine_feat_method='text_only', mlp_dropout=0.1, numerical_bn=True, use_simple_classifier=True, mlp_act='relu', gating_beta=0.2, numerical_feat_dim=0, cat_feat_dim=0, **kwargs)[source]¶ Bases:
object
Config used for tabular combiner
- Parameters
mlp_division (int) – how much to decrease each MLP dim for each additional layer
combine_feat_method (str) – The method to combine categorical and numerical features. See
TabularFeatCombiner
for details on the supported methods.mlp_dropout (float) – dropout ratio used for MLP layers
numerical_bn (bool) – whether to use batchnorm on numerical features
use_simple_classifier (bool) – whether to use single layer or MLP as final classifier
mlp_act (str) – the activation function to use for finetuning layers
gating_beta (float) –
the beta hyperparameters used for gating tabular data see the paper Integrating Multimodal Information in Large Pretrained Transformers for details
numerical_feat_dim (int) – the number of numerical features
cat_feat_dim (int) – the number of categorical features
AutoModel with Tabular¶
-
class
AutoModelWithTabular
[source]¶ Bases:
object
-
classmethod
from_config
(config)[source]¶ Instantiates one of the base model classes of the library from a configuration.
Note
Only the models in multimodal_transformers.py are implemented
- Parameters
config (
PretrainedConfig
) –- The model class to instantiate is selected based on the configuration class:
see multimodal_transformers.py for supported transformer models
Examples:
config = BertConfig.from_pretrained('bert-base-uncased') # Download configuration from S3 and cache. model = AutoModelWithTabular.from_config(config) # E.g. model was saved using `save_pretrained('./test/saved_model/')`
-
classmethod
from_pretrained
(pretrained_model_name_or_path, *model_args, **kwargs)[source]¶ Instantiates one of the sequence classification model classes of the library from a pre-trained model configuration. See multimodal_transformers.py for supported transformer models
The from_pretrained() method takes care of returning the correct model class instance based on the model_type property of the config object, or when it’s missing, falling back to using pattern matching on the pretrained_model_name_or_path string:
The model is set in evaluation mode by default using model.eval() (Dropout modules are deactivated) To train the model, you should first set it back in training mode with model.train()
- Parameters
pretrained_model_name_or_path –
either:
a string with the shortcut name of a pre-trained model to load from cache or download, e.g.:
bert-base-uncased
.a string with the identifier name of a pre-trained model that was user-uploaded to our S3, e.g.:
dbmdz/bert-base-german-cased
.a path to a directory containing model weights saved using
save_pretrained()
, e.g.:./my_model_directory/
.a path or url to a tensorflow index checkpoint file (e.g. ./tf_model/model.ckpt.index). In this case,
from_tf
should be set to True and a configuration object should be provided asconfig
argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
model_args – (optional) Sequence of positional arguments: All remaining positional arguments will be passed to the underlying model’s
__init__
methodconfig –
(optional) instance of a class derived from
PretrainedConfig
: Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:the model is a model provided by the library (loaded with the
shortcut-name
string of a pretrained model), orthe model was saved using
save_pretrained()
and is reloaded by suppling the save directory.the model is loaded by suppling a local directory as
pretrained_model_name_or_path
and a configuration JSON file named config.json is found in the directory.
state_dict – (optional) dict: an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file. This option can be used if you want to create a model from a pretrained configuration but load your own weights. In this case though, you should check if using
save_pretrained()
andfrom_pretrained()
is not a simpler option.cache_dir – (optional) string: Path to a directory in which a downloaded pre-trained model configuration should be cached if the standard cache should not be used.
force_download – (optional) boolean, default False: Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
resume_download – (optional) boolean, default False: Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.
proxies – (optional) dict, default None: A dictionary of proxy servers to use by protocol or endpoint, e.g.: {‘http’: ‘foo.bar:3128’, ‘http://hostname’: ‘foo.bar:4012’}. The proxies are used on each request.
output_loading_info – (optional) boolean: Set to
True
to also return a dictionary containing missing keys, unexpected keys and error messages.kwargs – (optional) Remaining dictionary of keyword arguments: These arguments will be passed to the configuration and the model.
Examples:
model = AutoModelWithTabular.from_pretrained('bert-base-uncased') # Download model and configuration from S3 and cache. model = AutoModelWithTabular.from_pretrained('./test/bert_model/') # E.g. model was saved using `save_pretrained('./test/saved_model/')` assert model.config.output_attention == True # Loading from a TF checkpoint file instead of a PyTorch model (slower) config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json') model = AutoModelWithTabular.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)
-
classmethod
Transformers with Tabular¶
-
class
AlbertWithTabular
(hf_model_config)[source]¶ Bases:
transformers.modeling_albert.AlbertForSequenceClassification
ALBERT Model transformer with a sequence classification/regression head as well as a TabularFeatCombiner module to combine categorical and numerical features with the Roberta pooled output
- Parameters
hf_model_config (
AlbertConfig
) – Model configuration class with all the parameters of the model. This object must also have a tabular_config member variable that is aTabularConfig
instance specifying the configs forTabularFeatCombiner
-
forward
(input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None, class_weights=None, cat_feats=None, numerical_feats=None)[source]¶ The
AlbertWithTabular
forward method, overrides the__call__()
special method.Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.- Parameters
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) –Indices of input sequence tokens in the vocabulary.
Indices can be obtained using
transformers.AlbertTokenizer
. Seetransformers.PreTrainedTokenizer.encode()
andtransformers.PreTrainedTokenizer()
for details.attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) –Mask to avoid performing attention on padding token indices. Mask values selected in
[0, 1]
:1
for tokens that are NOT MASKED,0
for MASKED tokens.token_type_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) –Segment token indices to indicate first and second portions of the inputs. Indices are selected in
[0, 1]
:0
corresponds to a sentence A token,1
corresponds to a sentence B tokenposition_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) –Indices of positions of each input sequence tokens in the position embeddings. Selected in the range
[0, config.max_position_embeddings - 1]
.head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional, defaults toNone
) – Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:1
indicates the head is not masked,0
indicates the head is masked.inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional, defaults toNone
) – Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.output_attentions (
bool
, optional, defaults toNone
) – If set toTrue
, the attentions tensors of all attention layers are returned. Seeattentions
under returned tensors for more detail.labels (
torch.LongTensor
of shape(batch_size,)
, optional, defaults toNone
) – Labels for computing the sequence classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]
. Ifconfig.num_labels == 1
a regression loss is computed (Mean-Square loss), Ifconfig.num_labels > 1
a classification loss is computed (Cross-Entropy).
-
class
BertWithTabular
(hf_model_config)[source]¶ Bases:
transformers.modeling_bert.BertForSequenceClassification
Bert Model transformer with a sequence classification/regression head as well as a TabularFeatCombiner module to combine categorical and numerical features with the Bert pooled output
- Parameters
hf_model_config (
BertConfig
) – Model configuration class with all the parameters of the model. This object must also have a tabular_config member variable that is aTabularConfig
instance specifying the configs forTabularFeatCombiner
-
forward
(input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, labels=None, class_weights=None, output_attentions=None, output_hidden_states=None, cat_feats=None, numerical_feats=None)[source]¶ The
BertWithTabular
forward method, overrides the__call__()
special method.Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.- Parameters
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) –Indices of input sequence tokens in the vocabulary.
Indices can be obtained using
transformers.BertTokenizer
. Seetransformers.PreTrainedTokenizer.encode()
andtransformers.PreTrainedTokenizer.__call__()
for details.attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) –Mask to avoid performing attention on padding token indices. Mask values selected in
[0, 1]
:1
for tokens that are NOT MASKED,0
for MASKED tokens.token_type_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) –Segment token indices to indicate first and second portions of the inputs. Indices are selected in
[0, 1]
:0
corresponds to a sentence A token,1
corresponds to a sentence B tokenposition_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) –Indices of positions of each input sequence tokens in the position embeddings. Selected in the range
[0, config.max_position_embeddings - 1]
.head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional, defaults toNone
) – Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:1
indicates the head is not masked,0
indicates the head is masked.inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional, defaults toNone
) – Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.encoder_hidden_states (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional, defaults toNone
) – Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.encoder_attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) – Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in[0, 1]
:1
for tokens that are NOT MASKED,0
for MASKED tokens.output_attentions (
bool
, optional, defaults toNone
) – If set toTrue
, the attentions tensors of all attention layers are returned. Seeattentions
under returned tensors for more detail.class_weights (
torch.FloatTensor
of shape(tabular_config.num_labels,)
, optional, defaults toNone
) – Class weights to be used for cross entropy loss function for classification tasklabels (
torch.LongTensor
of shape(batch_size,)
, optional, defaults toNone
) – Labels for computing the sequence classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]
. Iftabular_config.num_labels == 1
a regression loss is computed (Mean-Square loss), Iftabular_config.num_labels > 1
a classification loss is computed (Cross-Entropy).cat_feats (
torch.FloatTensor
of shape(batch_size, tabular_config.cat_feat_dim)
, optional, defaults toNone
) – Categorical features to be passed in to the TabularFeatCombinernumerical_feats (
torch.FloatTensor
of shape(batch_size, tabular_config.numerical_feat_dim)
, optional, defaults toNone
) – Numerical features to be passed in to the TabularFeatCombiner
- Returns
- loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabel
is provided): Classification (or regression if tabular_config.num_labels==1) loss.
- logits (
torch.FloatTensor
of shape(batch_size, tabular_config.num_labels)
): Classification (or regression if tabular_config.num_labels==1) scores (before SoftMax).
- classifier_layer_outputs(
list
oftorch.FloatTensor
): The outputs of each layer of the final classification layers. The 0th index of this list is the combining module’s output
- loss (
- Return type
tuple
comprising various elements depending on configuration and inputs
-
class
DistilBertWithTabular
(hf_model_config)[source]¶ Bases:
transformers.modeling_distilbert.DistilBertForSequenceClassification
DistilBert Model transformer with a sequence classification/regression head as well as a TabularFeatCombiner module to combine categorical and numerical features with the Roberta pooled output
- Parameters
hf_model_config (
DistilBertConfig
) – Model configuration class with all the parameters of the model. This object must also have a tabular_config member variable that is aTabularConfig
instance specifying the configs forTabularFeatCombiner
-
forward
(input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, labels=None, output_attentions=None, output_hidden_states=None, class_weights=None, cat_feats=None, numerical_feats=None)[source]¶ The
DistilBertWithTabular
forward method, overrides the__call__()
special method.Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.- Parameters
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) –Indices of input sequence tokens in the vocabulary.
Indices can be obtained using
transformers.DistilBertTokenizer
. Seetransformers.PreTrainedTokenizer.encode()
andtransformers.PreTrainedTokenizer.__call__()
for details.attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) –Mask to avoid performing attention on padding token indices. Mask values selected in
[0, 1]
:1
for tokens that are NOT MASKED,0
for MASKED tokens.head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional, defaults toNone
) – Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:1
indicates the head is not masked,0
indicates the head is masked.inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional, defaults toNone
) – Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.output_attentions (
bool
, optional, defaults toNone
) – If set toTrue
, the attentions tensors of all attention layers are returned. Seeattentions
under returned tensors for more detail.class_weights (
torch.FloatTensor
of shape(tabular_config.num_labels,)
,`optional`, defaults toNone
) – Class weights to be used for cross entropy loss function for classification tasklabels (
torch.LongTensor
of shape(batch_size,)
, optional, defaults toNone
) – Labels for computing the sequence classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]
. Iftabular_config.num_labels == 1
a regression loss is computed (Mean-Square loss), Iftabular_config.num_labels > 1
a classification loss is computed (Cross-Entropy).cat_feats (
torch.FloatTensor
of shape(batch_size, tabular_config.cat_feat_dim)
,`optional`, defaults toNone
) – Categorical features to be passed in to the TabularFeatCombinernumerical_feats (
torch.FloatTensor
of shape(batch_size, tabular_config.numerical_feat_dim)
,`optional`, defaults toNone
) – Numerical features to be passed in to the TabularFeatCombiner
- Returns
- loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabel
is provided): Classification (or regression if tabular_config.num_labels==1) loss.
- logits (
torch.FloatTensor
of shape(batch_size, tabular_config.num_labels)
): Classification (or regression if tabular_config.num_labels==1) scores (before SoftMax).
- classifier_layer_outputs(
list
oftorch.FloatTensor
): The outputs of each layer of the final classification layers. The 0th index of this list is the combining module’s output
- loss (
- Return type
tuple
comprising various elements depending on configuration and inputs
-
class
RobertaWithTabular
(hf_model_config)[source]¶ Bases:
transformers.modeling_roberta.RobertaForSequenceClassification
Roberta Model transformer with a sequence classification/regression head as well as a TabularFeatCombiner module to combine categorical and numerical features with the Roberta pooled output
- Parameters
hf_model_config (
RobertaConfig
) – Model configuration class with all the parameters of the model. This object must also have a tabular_config member variable that is aTabularConfig
instance specifying the configs forTabularFeatCombiner
-
forward
(input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, labels=None, output_attentions=None, output_hidden_states=None, class_weights=None, cat_feats=None, numerical_feats=None)[source]¶ The
RobertaWithTabular
forward method, overrides the__call__()
special method.Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.- Parameters
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) –Indices of input sequence tokens in the vocabulary.
Indices can be obtained using
transformers.RobertaTokenizer
. Seetransformers.PreTrainedTokenizer.encode()
andtransformers.PreTrainedTokenizer.__call__()
for details.attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) –Mask to avoid performing attention on padding token indices. Mask values selected in
[0, 1]
:1
for tokens that are NOT MASKED,0
for MASKED tokens.token_type_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) –Segment token indices to indicate first and second portions of the inputs. Indices are selected in
[0, 1]
:0
corresponds to a sentence A token,1
corresponds to a sentence B tokenposition_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) –Indices of positions of each input sequence tokens in the position embeddings. Selected in the range
[0, config.max_position_embeddings - 1]
.head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional, defaults toNone
) – Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:1
indicates the head is not masked,0
indicates the head is masked.inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional, defaults toNone
) – Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.output_attentions (
bool
, optional, defaults toNone
) – If set toTrue
, the attentions tensors of all attention layers are returned. Seeattentions
under returned tensors for more detail.class_weights (
torch.FloatTensor
of shape(tabular_config.num_labels,)
, optional, defaults toNone
) – Class weights to be used for cross entropy loss function for classification tasklabels (
torch.LongTensor
of shape(batch_size,)
, optional, defaults toNone
) – Labels for computing the sequence classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]
. Iftabular_config.num_labels == 1
a regression loss is computed (Mean-Square loss), Iftabular_config.num_labels > 1
a classification loss is computed (Cross-Entropy).cat_feats (
torch.FloatTensor
of shape(batch_size, tabular_config.cat_feat_dim)
, optional, defaults toNone
) – Categorical features to be passed in to the TabularFeatCombinernumerical_feats (
torch.FloatTensor
of shape(batch_size, tabular_config.numerical_feat_dim)
, optional, defaults toNone
) – Numerical features to be passed in to the TabularFeatCombiner
- Returns
- loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabel
is provided): Classification (or regression if tabular_config.num_labels==1) loss.
- logits (
torch.FloatTensor
of shape(batch_size, tabular_config.num_labels)
): Classification (or regression if tabular_config.num_labels==1) scores (before SoftMax).
- classifier_layer_outputs(
list
oftorch.FloatTensor
): The outputs of each layer of the final classification layers. The 0th index of this list is the combining module’s output
- loss (
- Return type
tuple
comprising various elements depending on configuration and inputs
-
class
XLMRobertaWithTabular
(hf_model_config)[source]¶ Bases:
multimodal_transformers.model.tabular_transformers.RobertaWithTabular
This class overrides
RobertaWithTabular
. Please check the superclass for the appropriate documentation alongside usage examples.-
config_class
¶ alias of
transformers.configuration_xlm_roberta.XLMRobertaConfig
-
-
class
XLMWithTabular
(hf_model_config)[source]¶ Bases:
transformers.modeling_xlm.XLMForSequenceClassification
XLM Model transformer with a sequence classification/regression head as well as a TabularFeatCombiner module to combine categorical and numerical features with the Roberta pooled output
- Parameters
hf_model_config (
XLMConfig
) – Model configuration class with all the parameters of the model. This object must also have a tabular_config member variable that is aTabularConfig
instance specifying the configs forTabularFeatCombiner
-
forward
(input_ids=None, attention_mask=None, langs=None, token_type_ids=None, position_ids=None, lengths=None, cache=None, head_mask=None, inputs_embeds=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None, class_weights=None, cat_feats=None, numerical_feats=None)[source]¶ The
XLMWithTabular
forward method, overrides the__call__()
special method.Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.- Parameters
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) –Indices of input sequence tokens in the vocabulary.
Indices can be obtained using
transformers.BertTokenizer
. Seetransformers.PreTrainedTokenizer.encode()
andtransformers.PreTrainedTokenizer.__call__()
for details.attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) –Mask to avoid performing attention on padding token indices. Mask values selected in
[0, 1]
:1
for tokens that are NOT MASKED,0
for MASKED tokens.langs (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) –A parallel sequence of tokens to be used to indicate the language of each token in the input. Indices are languages ids which can be obtained from the language names by using two conversion mappings provided in the configuration of the model (only provided for multilingual models). More precisely, the language name -> language id mapping is in model.config.lang2id (dict str -> int) and the language id -> language name mapping is model.config.id2lang (dict int -> str).
See usage examples detailed in the multilingual documentation.
token_type_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) –Segment token indices to indicate first and second portions of the inputs. Indices are selected in
[0, 1]
:0
corresponds to a sentence A token,1
corresponds to a sentence B tokenposition_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) –Indices of positions of each input sequence tokens in the position embeddings. Selected in the range
[0, config.max_position_embeddings - 1]
.lengths (
torch.LongTensor
of shape(batch_size,)
, optional, defaults toNone
) – Length of each sentence that can be used to avoid performing attention on padding token indices. You can also use attention_mask for the same result (see above), kept here for compatbility. Indices selected in[0, ..., input_ids.size(-1)]
:cache (
Dict[str, torch.FloatTensor]
, optional, defaults toNone
) – dictionary withtorch.FloatTensor
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model (see cache output below). Can be used to speed up sequential decoding. The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional, defaults toNone
) – Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:1
indicates the head is not masked,0
indicates the head is masked.inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional, defaults toNone
) – Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.output_attentions (
bool
, optional, defaults toNone
) – If set toTrue
, the attentions tensors of all attention layers are returned. Seeattentions
under returned tensors for more detail.labels (
torch.LongTensor
of shape(batch_size,)
, optional, defaults toNone
) – Labels for computing the sequence classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]
. Ifconfig.num_labels == 1
a regression loss is computed (Mean-Square loss), Ifconfig.num_labels > 1
a classification loss is computed (Cross-Entropy).
-
class
XLNetWithTabular
(hf_model_config)[source]¶ Bases:
transformers.modeling_xlnet.XLNetForSequenceClassification
XLNet Model transformer with a sequence classification/regression head as well as a TabularFeatCombiner module to combine categorical and numerical features with the Roberta pooled output
- Parameters
hf_model_config (
XLNetConfig
) – Model configuration class with all the parameters of the model. This object must also have a tabular_config member variable that is aTabularConfig
instance specifying the configs forTabularFeatCombiner
-
forward
(input_ids=None, attention_mask=None, mems=None, perm_mask=None, target_mapping=None, token_type_ids=None, input_mask=None, head_mask=None, inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None, class_weights=None, cat_feats=None, numerical_feats=None)[source]¶ The
XLNetWithTabular
forward method, overrides the__call__()
special method.Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.- Parameters
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) –Indices of input sequence tokens in the vocabulary.
Indices can be obtained using
transformers.BertTokenizer
. Seetransformers.PreTrainedTokenizer.encode()
andtransformers.PreTrainedTokenizer.__call__()
for details.attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) –Mask to avoid performing attention on padding token indices. Mask values selected in
[0, 1]
:1
for tokens that are NOT MASKED,0
for MASKED tokens.mems (
List[torch.FloatTensor]
of lengthconfig.n_layers
) – Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model (see mems output below). Can be used to speed up sequential decoding. The token ids which have their mems given to this model should not be passed as input ids as they have already been computed. use_cache has to be set to True to make use of mems.perm_mask (
torch.FloatTensor
of shape(batch_size, sequence_length, sequence_length)
, optional, defaults toNone
) – Mask to indicate the attention pattern for each input token with values selected in[0, 1]
: Ifperm_mask[k, i, j] = 0
, i attend to j in batch k; ifperm_mask[k, i, j] = 1
, i does not attend to j in batch k. If None, each token attends to all the others (full bidirectional attention). Only used during pretraining (to define factorization order) or for sequential decoding (generation).target_mapping (
torch.FloatTensor
of shape(batch_size, num_predict, sequence_length)
, optional, defaults toNone
) – Mask to indicate the output tokens to use. Iftarget_mapping[k, i, j] = 1
, the i-th predict in batch k is on the j-th token. Only used during pretraining for partial prediction or for sequential decoding (generation).token_type_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) –Segment token indices to indicate first and second portions of the inputs. Indices are selected in
[0, 1]
:0
corresponds to a sentence A token,1
corresponds to a sentence B token. The classifier token should be represented by a2
.input_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) – Mask to avoid performing attention on padding token indices. Negative of attention_mask, i.e. with 0 for real tokens and 1 for padding. Kept for compatibility with the original code base. You can only uses one of input_mask and attention_mask Mask values selected in[0, 1]
:1
for tokens that are MASKED,0
for tokens that are NOT MASKED.head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional, defaults toNone
) – Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:1
indicates the head is not masked,0
indicates the head is masked.inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional, defaults toNone
) – Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.use_cache (
bool
) – If use_cache is True, mems are returned and can be used to speed up decoding (see mems). Defaults to True.output_attentions (
bool
, optional, defaults toNone
) – If set toTrue
, the attentions tensors of all attention layers are returned. Seeattentions
under returned tensors for more detail.labels (
torch.LongTensor
of shape(batch_size,)
, optional, defaults toNone
) – Labels for computing the sequence classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]
. Ifconfig.num_labels == 1
a regression loss is computed (Mean-Square loss), Ifconfig.num_labels > 1
a classification loss is computed (Cross-Entropy).
multimodal_transformers.data¶
The data module includes two functions to help load your own datasets
into multimodal_transformers.data.tabular_torch_dataset.TorchTabularTextDataset
which can be fed into a torch.utils.data.DataLoader
. The
multimodal_transformers.data.tabular_torch_dataset.TorchTabularTextDataset
’s
__getitem__
method’s outputs can be directly fed to the
forward pass to a model in multimodal_transformers.model.tabular_transformers
.
Note
You may still need to move the __getitem__
method outputs to the right gpu device.
Module contents¶
-
class
TorchTabularTextDataset
(encodings, categorical_feats, numerical_feats, labels=None, df=None, label_list=None, class_weights=None)[source]¶ Bases:
torch.utils.data.dataset.Dataset
TorchDataset
wrapper for text dataset with categorical features and numerical features- Parameters
encodings (
transformers.BatchEncoding
) – The output from encode_plus() and batch_encode() methods (tokens, attention_masks, etc) of a transformers.PreTrainedTokenizercategorical_feats (
numpy.ndarray
, of shape(n_examples, categorical feat dim)
, optional, defaults toNone
) – An array containing the preprocessed categorical featuresnumerical_feats (
numpy.ndarray
, of shape(n_examples, numerical feat dim)
, optional, defaults toNone
) – An array containing the preprocessed numerical features( (labels) – class: list` or numpy.ndarray, optional, defaults to
None
): The labels of the training examplesclass_weights (
numpy.ndarray
, of shape (n_classes), optional, defaults toNone
) – Class weights used for cross entropy loss for classificationdf (
pandas.DataFrame
, optional, defaults toNone
) – Model configuration class with all the parameters of the model. This object must also have a tabular_config member variable that is a TabularConfig instance specifying the configs for TabularFeatCombiner
-
load_data
(data_df, text_cols, tokenizer, label_col, label_list=None, categorical_cols=None, numerical_cols=None, sep_text_token_str=' ', categorical_encode_type='ohe', numerical_transformer=None, empty_text_values=None, replace_empty_text=None, max_token_length=None, debug=False)[source]¶ Function to load a single dataset given a pandas DataFrame
Given a DataFrame, this function loads the data to a
torch_dataset.TorchTextDataset
object which can be used in atorch.utils.data.DataLoader
.- Parameters
data_df (
pd.DataFrame
) – The DataFrame to convert to a TorchTextDatasettext_cols (
list
ofstr
) – the column names in the dataset that contain text from which we want to loadtokenizer (
transformers.tokenization_utils.PreTrainedTokenizer
) – HuggingFace tokenizer used to tokenize the input texts as specifed by text_colslabel_col (str) – The column name of the label, for classification the column should have int values from 0 to n_classes-1 as the label for each class. For regression the column can have any numerical value
label_list (
list
ofstr
, optional) – Used for classification; the names of the classes indexed by the values in label_col.categorical_cols (
list
ofstr
, optional) – The column names in the dataset that contain categorical features. The features can be already prepared numerically, or could be preprocessed by the method specified by categorical_encode_typenumerical_cols (
list
ofstr
, optional) – The column names in the dataset that contain numerical features. These columns should contain only numeric values.sep_text_token_str (str, optional) – The string token that is used to separate between the different text columns for a given data example. For Bert for example, this could be the [SEP] token.
categorical_encode_type (str, optional) – Given categorical_cols, this specifies what method we want to preprocess our categorical features. choices: [ ‘ohe’, ‘binary’, None] see encode_features.CategoricalFeatures for more details
numerical_transformer (
sklearn.base.TransformerMixin
) – The sklearn numeric transformer instance to transform our numerical featuresempty_text_values (
list
ofstr
, optional) – Specifies what texts should be considered as missing which would be replaced by replace_empty_textreplace_empty_text (str, optional) – The value of the string that will replace the texts that match with those in empty_text_values. If this argument is None then the text that match with empty_text_values will be skipped
max_token_length (int, optional) – The token length to pad or truncate to on the input text
debug (bool, optional) – Whether or not to load a smaller debug version of the dataset
- Returns
The converted dataset
- Return type
tabular_torch_dataset.TorchTextDataset
-
load_data_from_folder
(folder_path, text_cols, tokenizer, label_col, label_list=None, categorical_cols=None, numerical_cols=None, sep_text_token_str=' ', categorical_encode_type='ohe', numerical_transformer_method='quantile_normal', empty_text_values=None, replace_empty_text=None, max_token_length=None, debug=False)[source]¶ Function to load tabular and text data from a specified folder
Loads train, test and/or validation text and tabular data from specified folder path into TorchTextDataset class and does categorical and numerical data preprocessing if specified. Inside the folder, there is expected to be a train.csv, and test.csv (and if given val.csv) containing the training, testing, and validation sets respectively
- Parameters
folder_path (str) – The path to the folder containing train.csv, and test.csv (and if given val.csv)
text_cols (
list
ofstr
) – The column names in the dataset that contain text from which we want to loadtokenizer (
transformers.tokenization_utils.PreTrainedTokenizer
) – HuggingFace tokenizer used to tokenize the input texts as specifed by text_colslabel_col (str) – The column name of the label, for classification the column should have int values from 0 to n_classes-1 as the label for each class. For regression the column can have any numerical value
label_list (
list
ofstr
, optional) – Used for classification; the names of the classes indexed by the values in label_col.categorical_cols (
list
ofstr
, optional) – The column names in the dataset that contain categorical features. The features can be already prepared numerically, or could be preprocessed by the method specified by categorical_encode_typenumerical_cols (
list
ofstr
, optional) – The column names in the dataset that contain numerical features. These columns should contain only numeric values.sep_text_token_str (str, optional) – The string token that is used to separate between the different text columns for a given data example. For Bert for example, this could be the [SEP] token.
categorical_encode_type (str, optional) – Given categorical_cols, this specifies what method we want to preprocess our categorical features. choices: [ ‘ohe’, ‘binary’, None] see encode_features.CategoricalFeatures for more details
numerical_transformer_method (str, optional) – Given numerical_cols, this specifies what method we want to use for normalizing our numerical data. choices: [‘yeo_johnson’, ‘box_cox’, ‘quantile_normal’, None] see https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html for more details
empty_text_values (
list
ofstr
, optional) – specifies what texts should be considered as missing which would be replaced by replace_empty_textreplace_empty_text (str, optional) – The value of the string that will replace the texts that match with those in empty_text_values. If this argument is None then the text that match with empty_text_values will be skipped
max_token_length (int, optional) – The token length to pad or truncate to on the input text
debug (bool, optional) – Whether or not to load a smaller debug version of the dataset
- Returns
This tuple contains the training, validation and testing sets. The val dataset is
None
if there is no val.csv in folder_path- Return type
tuple
of tabular_torch_dataset.TorchTextDataset
-
load_data_into_folds
(data_csv_path, num_splits, validation_ratio, text_cols, tokenizer, label_col, label_list=None, categorical_cols=None, numerical_cols=None, sep_text_token_str=' ', categorical_encode_type='ohe', numerical_transformer_method='quantile_normal', empty_text_values=None, replace_empty_text=None, max_token_length=None, debug=False)[source]¶ Function to load tabular and text data from a specified folder into folds
Loads train, test and/or validation text and tabular data from specified csv path into num_splits of train, val and test for Kfold cross validation. Performs categorical and numerical data preprocessing if specified. data_csv_path is a path to
- Parameters
data_csv_path (str) – The path to the csv containing the data
num_splits (int) – The number of cross validation folds to split the data into.
validation_ratio (float) – A float between 0 and 1 representing the percent of the data to hold as a consistent validation set.
text_cols (
list
ofstr
) – The column names in the dataset that contain text from which we want to loadtokenizer (
transformers.tokenization_utils.PreTrainedTokenizer
) – HuggingFace tokenizer used to tokenize the input texts as specifed by text_colslabel_col (str) – The column name of the label, for classification the column should have int values from 0 to n_classes-1 as the label for each class. For regression the column can have any numerical value
label_list (
list
ofstr
, optional) – Used for classification; the names of the classes indexed by the values in label_col.categorical_cols (
list
ofstr
, optional) – The column names in the dataset that contain categorical features. The features can be already prepared numerically, or could be preprocessed by the method specified by categorical_encode_typenumerical_cols (
list
ofstr
, optional) – The column names in the dataset that contain numerical features. These columns should contain only numeric values.sep_text_token_str (str, optional) – The string token that is used to separate between the different text columns for a given data example. For Bert for example, this could be the [SEP] token.
categorical_encode_type (str, optional) – Given categorical_cols, this specifies what method we want to preprocess our categorical features. choices: [ ‘ohe’, ‘binary’, None] see encode_features.CategoricalFeatures for more details
numerical_transformer_method (str, optional) – Given numerical_cols, this specifies what method we want to use for normalizing our numerical data. choices: [‘yeo_johnson’, ‘box_cox’, ‘quantile_normal’, None] see https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html for more details
empty_text_values (
list
ofstr
, optional) – specifies what texts should be considered as missing which would be replaced by replace_empty_textreplace_empty_text (str, optional) – The value of the string that will replace the texts that match with those in empty_text_values. If this argument is None then the text that match with empty_text_values will be skipped
max_token_length (int, optional) – The token length to pad or truncate to on the input text
debug (bool, optional) – Whether or not to load a smaller debug version of the dataset
- Returns
This tuple contains three lists representing the splits of training, validation and testing sets. The length of the lists is equal to the number of folds specified by num_splits
- Return type
tuple
of list of tabular_torch_dataset.TorchTextDataset