multimodal_transformers.data

The data module includes two functions to help load your own datasets into multimodal_transformers.data.tabular_torch_dataset.TorchTabularTextDataset which can be fed into a torch.utils.data.DataLoader. The multimodal_transformers.data.tabular_torch_dataset.TorchTabularTextDataset’s __getitem__ method’s outputs can be directly fed to the forward pass to a model in multimodal_transformers.model.tabular_transformers.

Note

You may still need to move the __getitem__ method outputs to the right gpu device.

Module contents

class TorchTabularTextDataset(encodings, categorical_feats, numerical_feats, labels=None, df=None, label_list=None, class_weights=None)[source]

Bases: torch.utils.data.dataset.Dataset

TorchDataset wrapper for text dataset with categorical features and numerical features

Parameters
  • encodings (transformers.BatchEncoding) – The output from encode_plus() and batch_encode() methods (tokens, attention_masks, etc) of a transformers.PreTrainedTokenizer

  • categorical_feats (numpy.ndarray, of shape (n_examples, categorical feat dim), optional, defaults to None) – An array containing the preprocessed categorical features

  • numerical_feats (numpy.ndarray, of shape (n_examples, numerical feat dim), optional, defaults to None) – An array containing the preprocessed numerical features

  • ( (labels) – class: list` or numpy.ndarray, optional, defaults to None): The labels of the training examples

  • class_weights (numpy.ndarray, of shape (n_classes), optional, defaults to None) – Class weights used for cross entropy loss for classification

  • df (pandas.DataFrame, optional, defaults to None) – Model configuration class with all the parameters of the model. This object must also have a tabular_config member variable that is a TabularConfig instance specifying the configs for TabularFeatCombiner

get_labels()[source]

returns the label names for classification

load_data(data_df, text_cols, tokenizer, label_col, label_list=None, categorical_cols=None, numerical_cols=None, sep_text_token_str=' ', categorical_encode_type='ohe', numerical_transformer=None, empty_text_values=None, replace_empty_text=None, max_token_length=None, debug=False)[source]

Function to load a single dataset given a pandas DataFrame

Given a DataFrame, this function loads the data to a torch_dataset.TorchTextDataset object which can be used in a torch.utils.data.DataLoader.

Parameters
  • data_df (pd.DataFrame) – The DataFrame to convert to a TorchTextDataset

  • text_cols (list of str) – the column names in the dataset that contain text from which we want to load

  • tokenizer (transformers.tokenization_utils.PreTrainedTokenizer) – HuggingFace tokenizer used to tokenize the input texts as specifed by text_cols

  • label_col (str) – The column name of the label, for classification the column should have int values from 0 to n_classes-1 as the label for each class. For regression the column can have any numerical value

  • label_list (list of str, optional) – Used for classification; the names of the classes indexed by the values in label_col.

  • categorical_cols (list of str, optional) – The column names in the dataset that contain categorical features. The features can be already prepared numerically, or could be preprocessed by the method specified by categorical_encode_type

  • numerical_cols (list of str, optional) – The column names in the dataset that contain numerical features. These columns should contain only numeric values.

  • sep_text_token_str (str, optional) – The string token that is used to separate between the different text columns for a given data example. For Bert for example, this could be the [SEP] token.

  • categorical_encode_type (str, optional) – Given categorical_cols, this specifies what method we want to preprocess our categorical features. choices: [ ‘ohe’, ‘binary’, None] see encode_features.CategoricalFeatures for more details

  • numerical_transformer (sklearn.base.TransformerMixin) – The sklearn numeric transformer instance to transform our numerical features

  • empty_text_values (list of str, optional) – Specifies what texts should be considered as missing which would be replaced by replace_empty_text

  • replace_empty_text (str, optional) – The value of the string that will replace the texts that match with those in empty_text_values. If this argument is None then the text that match with empty_text_values will be skipped

  • max_token_length (int, optional) – The token length to pad or truncate to on the input text

  • debug (bool, optional) – Whether or not to load a smaller debug version of the dataset

Returns

The converted dataset

Return type

tabular_torch_dataset.TorchTextDataset

load_data_from_folder(folder_path, text_cols, tokenizer, label_col, label_list=None, categorical_cols=None, numerical_cols=None, sep_text_token_str=' ', categorical_encode_type='ohe', numerical_transformer_method='quantile_normal', empty_text_values=None, replace_empty_text=None, max_token_length=None, debug=False)[source]

Function to load tabular and text data from a specified folder

Loads train, test and/or validation text and tabular data from specified folder path into TorchTextDataset class and does categorical and numerical data preprocessing if specified. Inside the folder, there is expected to be a train.csv, and test.csv (and if given val.csv) containing the training, testing, and validation sets respectively

Parameters
  • folder_path (str) – The path to the folder containing train.csv, and test.csv (and if given val.csv)

  • text_cols (list of str) – The column names in the dataset that contain text from which we want to load

  • tokenizer (transformers.tokenization_utils.PreTrainedTokenizer) – HuggingFace tokenizer used to tokenize the input texts as specifed by text_cols

  • label_col (str) – The column name of the label, for classification the column should have int values from 0 to n_classes-1 as the label for each class. For regression the column can have any numerical value

  • label_list (list of str, optional) – Used for classification; the names of the classes indexed by the values in label_col.

  • categorical_cols (list of str, optional) – The column names in the dataset that contain categorical features. The features can be already prepared numerically, or could be preprocessed by the method specified by categorical_encode_type

  • numerical_cols (list of str, optional) – The column names in the dataset that contain numerical features. These columns should contain only numeric values.

  • sep_text_token_str (str, optional) – The string token that is used to separate between the different text columns for a given data example. For Bert for example, this could be the [SEP] token.

  • categorical_encode_type (str, optional) – Given categorical_cols, this specifies what method we want to preprocess our categorical features. choices: [ ‘ohe’, ‘binary’, None] see encode_features.CategoricalFeatures for more details

  • numerical_transformer_method (str, optional) – Given numerical_cols, this specifies what method we want to use for normalizing our numerical data. choices: [‘yeo_johnson’, ‘box_cox’, ‘quantile_normal’, None] see https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html for more details

  • empty_text_values (list of str, optional) – specifies what texts should be considered as missing which would be replaced by replace_empty_text

  • replace_empty_text (str, optional) – The value of the string that will replace the texts that match with those in empty_text_values. If this argument is None then the text that match with empty_text_values will be skipped

  • max_token_length (int, optional) – The token length to pad or truncate to on the input text

  • debug (bool, optional) – Whether or not to load a smaller debug version of the dataset

Returns

This tuple contains the training, validation and testing sets. The val dataset is None if there is no val.csv in folder_path

Return type

tuple of tabular_torch_dataset.TorchTextDataset

load_data_into_folds(data_csv_path, num_splits, validation_ratio, text_cols, tokenizer, label_col, label_list=None, categorical_cols=None, numerical_cols=None, sep_text_token_str=' ', categorical_encode_type='ohe', numerical_transformer_method='quantile_normal', empty_text_values=None, replace_empty_text=None, max_token_length=None, debug=False)[source]

Function to load tabular and text data from a specified folder into folds

Loads train, test and/or validation text and tabular data from specified csv path into num_splits of train, val and test for Kfold cross validation. Performs categorical and numerical data preprocessing if specified. data_csv_path is a path to

Parameters
  • data_csv_path (str) – The path to the csv containing the data

  • num_splits (int) – The number of cross validation folds to split the data into.

  • validation_ratio (float) – A float between 0 and 1 representing the percent of the data to hold as a consistent validation set.

  • text_cols (list of str) – The column names in the dataset that contain text from which we want to load

  • tokenizer (transformers.tokenization_utils.PreTrainedTokenizer) – HuggingFace tokenizer used to tokenize the input texts as specifed by text_cols

  • label_col (str) – The column name of the label, for classification the column should have int values from 0 to n_classes-1 as the label for each class. For regression the column can have any numerical value

  • label_list (list of str, optional) – Used for classification; the names of the classes indexed by the values in label_col.

  • categorical_cols (list of str, optional) – The column names in the dataset that contain categorical features. The features can be already prepared numerically, or could be preprocessed by the method specified by categorical_encode_type

  • numerical_cols (list of str, optional) – The column names in the dataset that contain numerical features. These columns should contain only numeric values.

  • sep_text_token_str (str, optional) – The string token that is used to separate between the different text columns for a given data example. For Bert for example, this could be the [SEP] token.

  • categorical_encode_type (str, optional) – Given categorical_cols, this specifies what method we want to preprocess our categorical features. choices: [ ‘ohe’, ‘binary’, None] see encode_features.CategoricalFeatures for more details

  • numerical_transformer_method (str, optional) – Given numerical_cols, this specifies what method we want to use for normalizing our numerical data. choices: [‘yeo_johnson’, ‘box_cox’, ‘quantile_normal’, None] see https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html for more details

  • empty_text_values (list of str, optional) – specifies what texts should be considered as missing which would be replaced by replace_empty_text

  • replace_empty_text (str, optional) – The value of the string that will replace the texts that match with those in empty_text_values. If this argument is None then the text that match with empty_text_values will be skipped

  • max_token_length (int, optional) – The token length to pad or truncate to on the input text

  • debug (bool, optional) – Whether or not to load a smaller debug version of the dataset

Returns

This tuple contains three lists representing the splits of training, validation and testing sets. The length of the lists is equal to the number of folds specified by num_splits

Return type

tuple of list of tabular_torch_dataset.TorchTextDataset