multimodal_transformers.data¶

The data module includes two functions to help load your own datasets into multimodal_transformers.data.tabular_torch_dataset.TorchTabularTextDataset which can be fed into a torch.utils.data.DataLoader. The multimodal_transformers.data.tabular_torch_dataset.TorchTabularTextDataset’s __getitem__ method’s outputs can be directly fed to the forward pass to a model in multimodal_transformers.model.tabular_transformers.

Note

You may still need to move the __getitem__ method outputs to the right gpu device.

Module contents¶

class TorchTabularTextDataset(encodings, categorical_feats, numerical_feats, labels=None, df=None, label_list=None, class_weights=None)[source]¶

Bases: torch.utils.data.dataset.Dataset

TorchDataset wrapper for text dataset with categorical features and numerical features

Parameters

encodings (transformers.BatchEncoding) – The output from encode_plus() and batch_encode() methods (tokens, attention_masks, etc) of a transformers.PreTrainedTokenizer
categorical_feats (numpy.ndarray, of shape (n_examples, categorical feat dim), optional, defaults to None) – An array containing the preprocessed categorical features
numerical_feats (numpy.ndarray, of shape (n_examples, numerical feat dim), optional, defaults to None) – An array containing the preprocessed numerical features
( (labels) – class: list` or numpy.ndarray, optional, defaults to None): The labels of the training examples
class_weights (numpy.ndarray, of shape (n_classes), optional, defaults to None) – Class weights used for cross entropy loss for classification
df (pandas.DataFrame, optional, defaults to None) – Model configuration class with all the parameters of the model. This object must also have a tabular_config member variable that is a TabularConfig instance specifying the configs for TabularFeatCombiner

get_labels()[source]¶: returns the label names for classification

load_data(data_df, text_cols, tokenizer, label_col, label_list=None, categorical_cols=None, numerical_cols=None, sep_text_token_str=' ', categorical_encode_type='ohe', numerical_transformer=None, empty_text_values=None, replace_empty_text=None, max_token_length=None, debug=False)[source]¶

Function to load a single dataset given a pandas DataFrame

Given a DataFrame, this function loads the data to a torch_dataset.TorchTextDataset object which can be used in a torch.utils.data.DataLoader.

Parameters

data_df (pd.DataFrame) – The DataFrame to convert to a TorchTextDataset
text_cols (list of str) – the column names in the dataset that contain text from which we want to load
tokenizer (transformers.tokenization_utils.PreTrainedTokenizer) – HuggingFace tokenizer used to tokenize the input texts as specifed by text_cols
label_col (str) – The column name of the label, for classification the column should have int values from 0 to n_classes-1 as the label for each class. For regression the column can have any numerical value
label_list (list of str, optional) – Used for classification; the names of the classes indexed by the values in label_col.
categorical_cols (list of str, optional) – The column names in the dataset that contain categorical features. The features can be already prepared numerically, or could be preprocessed by the method specified by categorical_encode_type
numerical_cols (list of str, optional) – The column names in the dataset that contain numerical features. These columns should contain only numeric values.
sep_text_token_str (str, optional) – The string token that is used to separate between the different text columns for a given data example. For Bert for example, this could be the [SEP] token.
categorical_encode_type (str, optional) – Given categorical_cols, this specifies what method we want to preprocess our categorical features. choices: [ ‘ohe’, ‘binary’, None] see encode_features.CategoricalFeatures for more details
numerical_transformer (sklearn.base.TransformerMixin) – The sklearn numeric transformer instance to transform our numerical features
empty_text_values (list of str, optional) – Specifies what texts should be considered as missing which would be replaced by replace_empty_text
replace_empty_text (str, optional) – The value of the string that will replace the texts that match with those in empty_text_values. If this argument is None then the text that match with empty_text_values will be skipped
max_token_length (int, optional) – The token length to pad or truncate to on the input text
debug (bool, optional) – Whether or not to load a smaller debug version of the dataset

Returns

The converted dataset

Return type

tabular_torch_dataset.TorchTextDataset

load_data_from_folder(folder_path, text_cols, tokenizer, label_col, label_list=None, categorical_cols=None, numerical_cols=None, sep_text_token_str=' ', categorical_encode_type='ohe', numerical_transformer_method='quantile_normal', empty_text_values=None, replace_empty_text=None, max_token_length=None, debug=False)[source]¶

Function to load tabular and text data from a specified folder

Loads train, test and/or validation text and tabular data from specified folder path into TorchTextDataset class and does categorical and numerical data preprocessing if specified. Inside the folder, there is expected to be a train.csv, and test.csv (and if given val.csv) containing the training, testing, and validation sets respectively

Parameters

folder_path (str) – The path to the folder containing train.csv, and test.csv (and if given val.csv)
text_cols (list of str) – The column names in the dataset that contain text from which we want to load
tokenizer (transformers.tokenization_utils.PreTrainedTokenizer) – HuggingFace tokenizer used to tokenize the input texts as specifed by text_cols
label_col (str) – The column name of the label, for classification the column should have int values from 0 to n_classes-1 as the label for each class. For regression the column can have any numerical value
label_list (list of str, optional) – Used for classification; the names of the classes indexed by the values in label_col.
categorical_cols (list of str, optional) – The column names in the dataset that contain categorical features. The features can be already prepared numerically, or could be preprocessed by the method specified by categorical_encode_type
numerical_cols (list of str, optional) – The column names in the dataset that contain numerical features. These columns should contain only numeric values.
sep_text_token_str (str, optional) – The string token that is used to separate between the different text columns for a given data example. For Bert for example, this could be the [SEP] token.
categorical_encode_type (str, optional) – Given categorical_cols, this specifies what method we want to preprocess our categorical features. choices: [ ‘ohe’, ‘binary’, None] see encode_features.CategoricalFeatures for more details
numerical_transformer_method (str, optional) – Given numerical_cols, this specifies what method we want to use for normalizing our numerical data. choices: [‘yeo_johnson’, ‘box_cox’, ‘quantile_normal’, None] see https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html for more details
empty_text_values (list of str, optional) – specifies what texts should be considered as missing which would be replaced by replace_empty_text
replace_empty_text (str, optional) – The value of the string that will replace the texts that match with those in empty_text_values. If this argument is None then the text that match with empty_text_values will be skipped
max_token_length (int, optional) – The token length to pad or truncate to on the input text
debug (bool, optional) – Whether or not to load a smaller debug version of the dataset

Returns

This tuple contains the training, validation and testing sets. The val dataset is None if there is no val.csv in folder_path

Return type

tuple of tabular_torch_dataset.TorchTextDataset

load_data_into_folds(data_csv_path, num_splits, validation_ratio, text_cols, tokenizer, label_col, label_list=None, categorical_cols=None, numerical_cols=None, sep_text_token_str=' ', categorical_encode_type='ohe', numerical_transformer_method='quantile_normal', empty_text_values=None, replace_empty_text=None, max_token_length=None, debug=False)[source]¶

Function to load tabular and text data from a specified folder into folds

Loads train, test and/or validation text and tabular data from specified csv path into num_splits of train, val and test for Kfold cross validation. Performs categorical and numerical data preprocessing if specified. data_csv_path is a path to

Parameters

data_csv_path (str) – The path to the csv containing the data
num_splits (int) – The number of cross validation folds to split the data into.
validation_ratio (float) – A float between 0 and 1 representing the percent of the data to hold as a consistent validation set.
text_cols (list of str) – The column names in the dataset that contain text from which we want to load
tokenizer (transformers.tokenization_utils.PreTrainedTokenizer) – HuggingFace tokenizer used to tokenize the input texts as specifed by text_cols
label_col (str) – The column name of the label, for classification the column should have int values from 0 to n_classes-1 as the label for each class. For regression the column can have any numerical value
label_list (list of str, optional) – Used for classification; the names of the classes indexed by the values in label_col.
categorical_cols (list of str, optional) – The column names in the dataset that contain categorical features. The features can be already prepared numerically, or could be preprocessed by the method specified by categorical_encode_type
numerical_cols (list of str, optional) – The column names in the dataset that contain numerical features. These columns should contain only numeric values.
sep_text_token_str (str, optional) – The string token that is used to separate between the different text columns for a given data example. For Bert for example, this could be the [SEP] token.
categorical_encode_type (str, optional) – Given categorical_cols, this specifies what method we want to preprocess our categorical features. choices: [ ‘ohe’, ‘binary’, None] see encode_features.CategoricalFeatures for more details
numerical_transformer_method (str, optional) – Given numerical_cols, this specifies what method we want to use for normalizing our numerical data. choices: [‘yeo_johnson’, ‘box_cox’, ‘quantile_normal’, None] see https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html for more details
empty_text_values (list of str, optional) – specifies what texts should be considered as missing which would be replaced by replace_empty_text
replace_empty_text (str, optional) – The value of the string that will replace the texts that match with those in empty_text_values. If this argument is None then the text that match with empty_text_values will be skipped
max_token_length (int, optional) – The token length to pad or truncate to on the input text
debug (bool, optional) – Whether or not to load a smaller debug version of the dataset

Returns

This tuple contains three lists representing the splits of training, validation and testing sets. The length of the lists is equal to the number of folds specified by num_splits

Return type

tuple of list of tabular_torch_dataset.TorchTextDataset