multimodal_transformers.data¶
The data module includes two functions to help load your own datasets
into multimodal_transformers.data.tabular_torch_dataset.TorchTabularTextDataset
which can be fed into a torch.utils.data.DataLoader
. The
multimodal_transformers.data.tabular_torch_dataset.TorchTabularTextDataset
’s
__getitem__
method’s outputs can be directly fed to the
forward pass to a model in multimodal_transformers.model.tabular_transformers
.
Note
You may still need to move the __getitem__
method outputs to the right gpu device.
Module contents¶
-
class
TorchTabularTextDataset
(encodings, categorical_feats, numerical_feats, labels=None, df=None, label_list=None, class_weights=None)[source]¶ Bases:
torch.utils.data.dataset.Dataset
TorchDataset
wrapper for text dataset with categorical features and numerical features- Parameters
encodings (
transformers.BatchEncoding
) – The output from encode_plus() and batch_encode() methods (tokens, attention_masks, etc) of a transformers.PreTrainedTokenizercategorical_feats (
numpy.ndarray
, of shape(n_examples, categorical feat dim)
, optional, defaults toNone
) – An array containing the preprocessed categorical featuresnumerical_feats (
numpy.ndarray
, of shape(n_examples, numerical feat dim)
, optional, defaults toNone
) – An array containing the preprocessed numerical features( (labels) – class: list` or numpy.ndarray, optional, defaults to
None
): The labels of the training examplesclass_weights (
numpy.ndarray
, of shape (n_classes), optional, defaults toNone
) – Class weights used for cross entropy loss for classificationdf (
pandas.DataFrame
, optional, defaults toNone
) – Model configuration class with all the parameters of the model. This object must also have a tabular_config member variable that is a TabularConfig instance specifying the configs for TabularFeatCombiner
-
load_data
(data_df, text_cols, tokenizer, label_col, label_list=None, categorical_cols=None, numerical_cols=None, sep_text_token_str=' ', categorical_encode_type='ohe', numerical_transformer=None, empty_text_values=None, replace_empty_text=None, max_token_length=None, debug=False)[source]¶ Function to load a single dataset given a pandas DataFrame
Given a DataFrame, this function loads the data to a
torch_dataset.TorchTextDataset
object which can be used in atorch.utils.data.DataLoader
.- Parameters
data_df (
pd.DataFrame
) – The DataFrame to convert to a TorchTextDatasettext_cols (
list
ofstr
) – the column names in the dataset that contain text from which we want to loadtokenizer (
transformers.tokenization_utils.PreTrainedTokenizer
) – HuggingFace tokenizer used to tokenize the input texts as specifed by text_colslabel_col (str) – The column name of the label, for classification the column should have int values from 0 to n_classes-1 as the label for each class. For regression the column can have any numerical value
label_list (
list
ofstr
, optional) – Used for classification; the names of the classes indexed by the values in label_col.categorical_cols (
list
ofstr
, optional) – The column names in the dataset that contain categorical features. The features can be already prepared numerically, or could be preprocessed by the method specified by categorical_encode_typenumerical_cols (
list
ofstr
, optional) – The column names in the dataset that contain numerical features. These columns should contain only numeric values.sep_text_token_str (str, optional) – The string token that is used to separate between the different text columns for a given data example. For Bert for example, this could be the [SEP] token.
categorical_encode_type (str, optional) – Given categorical_cols, this specifies what method we want to preprocess our categorical features. choices: [ ‘ohe’, ‘binary’, None] see encode_features.CategoricalFeatures for more details
numerical_transformer (
sklearn.base.TransformerMixin
) – The sklearn numeric transformer instance to transform our numerical featuresempty_text_values (
list
ofstr
, optional) – Specifies what texts should be considered as missing which would be replaced by replace_empty_textreplace_empty_text (str, optional) – The value of the string that will replace the texts that match with those in empty_text_values. If this argument is None then the text that match with empty_text_values will be skipped
max_token_length (int, optional) – The token length to pad or truncate to on the input text
debug (bool, optional) – Whether or not to load a smaller debug version of the dataset
- Returns
The converted dataset
- Return type
tabular_torch_dataset.TorchTextDataset
-
load_data_from_folder
(folder_path, text_cols, tokenizer, label_col, label_list=None, categorical_cols=None, numerical_cols=None, sep_text_token_str=' ', categorical_encode_type='ohe', numerical_transformer_method='quantile_normal', empty_text_values=None, replace_empty_text=None, max_token_length=None, debug=False)[source]¶ Function to load tabular and text data from a specified folder
Loads train, test and/or validation text and tabular data from specified folder path into TorchTextDataset class and does categorical and numerical data preprocessing if specified. Inside the folder, there is expected to be a train.csv, and test.csv (and if given val.csv) containing the training, testing, and validation sets respectively
- Parameters
folder_path (str) – The path to the folder containing train.csv, and test.csv (and if given val.csv)
text_cols (
list
ofstr
) – The column names in the dataset that contain text from which we want to loadtokenizer (
transformers.tokenization_utils.PreTrainedTokenizer
) – HuggingFace tokenizer used to tokenize the input texts as specifed by text_colslabel_col (str) – The column name of the label, for classification the column should have int values from 0 to n_classes-1 as the label for each class. For regression the column can have any numerical value
label_list (
list
ofstr
, optional) – Used for classification; the names of the classes indexed by the values in label_col.categorical_cols (
list
ofstr
, optional) – The column names in the dataset that contain categorical features. The features can be already prepared numerically, or could be preprocessed by the method specified by categorical_encode_typenumerical_cols (
list
ofstr
, optional) – The column names in the dataset that contain numerical features. These columns should contain only numeric values.sep_text_token_str (str, optional) – The string token that is used to separate between the different text columns for a given data example. For Bert for example, this could be the [SEP] token.
categorical_encode_type (str, optional) – Given categorical_cols, this specifies what method we want to preprocess our categorical features. choices: [ ‘ohe’, ‘binary’, None] see encode_features.CategoricalFeatures for more details
numerical_transformer_method (str, optional) – Given numerical_cols, this specifies what method we want to use for normalizing our numerical data. choices: [‘yeo_johnson’, ‘box_cox’, ‘quantile_normal’, None] see https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html for more details
empty_text_values (
list
ofstr
, optional) – specifies what texts should be considered as missing which would be replaced by replace_empty_textreplace_empty_text (str, optional) – The value of the string that will replace the texts that match with those in empty_text_values. If this argument is None then the text that match with empty_text_values will be skipped
max_token_length (int, optional) – The token length to pad or truncate to on the input text
debug (bool, optional) – Whether or not to load a smaller debug version of the dataset
- Returns
This tuple contains the training, validation and testing sets. The val dataset is
None
if there is no val.csv in folder_path- Return type
tuple
of tabular_torch_dataset.TorchTextDataset
-
load_data_into_folds
(data_csv_path, num_splits, validation_ratio, text_cols, tokenizer, label_col, label_list=None, categorical_cols=None, numerical_cols=None, sep_text_token_str=' ', categorical_encode_type='ohe', numerical_transformer_method='quantile_normal', empty_text_values=None, replace_empty_text=None, max_token_length=None, debug=False)[source]¶ Function to load tabular and text data from a specified folder into folds
Loads train, test and/or validation text and tabular data from specified csv path into num_splits of train, val and test for Kfold cross validation. Performs categorical and numerical data preprocessing if specified. data_csv_path is a path to
- Parameters
data_csv_path (str) – The path to the csv containing the data
num_splits (int) – The number of cross validation folds to split the data into.
validation_ratio (float) – A float between 0 and 1 representing the percent of the data to hold as a consistent validation set.
text_cols (
list
ofstr
) – The column names in the dataset that contain text from which we want to loadtokenizer (
transformers.tokenization_utils.PreTrainedTokenizer
) – HuggingFace tokenizer used to tokenize the input texts as specifed by text_colslabel_col (str) – The column name of the label, for classification the column should have int values from 0 to n_classes-1 as the label for each class. For regression the column can have any numerical value
label_list (
list
ofstr
, optional) – Used for classification; the names of the classes indexed by the values in label_col.categorical_cols (
list
ofstr
, optional) – The column names in the dataset that contain categorical features. The features can be already prepared numerically, or could be preprocessed by the method specified by categorical_encode_typenumerical_cols (
list
ofstr
, optional) – The column names in the dataset that contain numerical features. These columns should contain only numeric values.sep_text_token_str (str, optional) – The string token that is used to separate between the different text columns for a given data example. For Bert for example, this could be the [SEP] token.
categorical_encode_type (str, optional) – Given categorical_cols, this specifies what method we want to preprocess our categorical features. choices: [ ‘ohe’, ‘binary’, None] see encode_features.CategoricalFeatures for more details
numerical_transformer_method (str, optional) – Given numerical_cols, this specifies what method we want to use for normalizing our numerical data. choices: [‘yeo_johnson’, ‘box_cox’, ‘quantile_normal’, None] see https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html for more details
empty_text_values (
list
ofstr
, optional) – specifies what texts should be considered as missing which would be replaced by replace_empty_textreplace_empty_text (str, optional) – The value of the string that will replace the texts that match with those in empty_text_values. If this argument is None then the text that match with empty_text_values will be skipped
max_token_length (int, optional) – The token length to pad or truncate to on the input text
debug (bool, optional) – Whether or not to load a smaller debug version of the dataset
- Returns
This tuple contains three lists representing the splits of training, validation and testing sets. The length of the lists is equal to the number of folds specified by num_splits
- Return type
tuple
of list of tabular_torch_dataset.TorchTextDataset