banhxeo.data package

class banhxeo.data.BaseTextDataset(root_dir: str | None, split_name: str, config: DatasetConfig, seed: int, download: bool = True)[source]

Bases: object

Abstract base class for raw text datasets.

This class handles common dataset operations like downloading, extracting, and providing an interface to access raw text samples. It’s designed to be subclassed for specific datasets. Raw datasets return text strings, which can then be converted to PyTorch Datasets using to_torch_dataset.

Variables:

root_path (Path) – The root directory for storing datasets.
dataset_base_path (Path) – The specific directory for this dataset (e.g., root_path/datasets/MyDatasetName).
config (DatasetConfig) – Configuration for the dataset, including name, URL, file info, etc.
split_name (str) – The name of the dataset split (e.g., “train”, “test”).
seed (int) – Random seed, primarily for reproducibility if subsampling or shuffling is involved at this stage.
max_workers (int) – Maximum number of workers for parallel processing tasks (e.g., file reading in subclasses).
_data (Any) – Internal storage for the loaded dataset samples. Subclasses are responsible for populating this.

__init__(root_dir: str | None, split_name: str, config: DatasetConfig, seed: int, download: bool = True)[source]

Initializes the BaseTextDataset.

Parameters:

root_dir – The root directory where datasets are stored. If None, defaults to the current working directory.
split_name – The name of the dataset split (e.g., “train”, “test”).
config – A DatasetConfig object containing metadata for the dataset.
seed – A random seed for reproducibility.
download – If True, attempts to download and extract the dataset if it’s not already present and config.url is provided.

get_all_texts() → List[str][source]

Extracts all text content from the dataset.

Iterates through the dataset using __getitem__ and extracts the text portion from each sample. Assumes samples are either strings, tuples where the first element is text, or dictionaries with a self.config.text_column.

Returns:: A list of all text strings in the dataset.

get_data() → Any[source]

Returns the internal data structure holding all samples.

The type of this structure (self._data) depends on the subclass (e.g., list, Polars DataFrame, Hugging Face Dataset).

Returns:: The raw, loaded dataset.

to_torch_dataset(tokenizer: Tokenizer, vocab: Vocabulary, **kwargs)[source]

Converts this raw text dataset into a TorchTextDataset.

This method sets up the necessary configurations for tokenization, numericalization, and transformations to prepare the data for PyTorch models.

Parameters:

tokenizer – The Tokenizer instance to use.
vocab – The Vocabulary instance for ID mapping.
**kwargs –
Additional configuration options: add_special_tokens (bool): Passed to TokenizerConfig. Defaults to False. max_length (Optional[int]): Passed to TokenizerConfig. Defaults to None. truncation (bool): Passed to TokenizerConfig. Defaults to False. padding (Union[bool, str]): Passed to TokenizerConfig. Defaults to False. is_classification (bool): Whether this is for a classification task.

Defaults to False.

transforms (Union[List[“Transforms”], “ComposeTransforms”]): Preprocessing
transforms to apply to text before tokenization. Defaults to [].

label_map (Optional[Dict[str, int]]): Mapping for labels if
is_classification is True. Defaults to self.config.label_map.

text_column_name (str): Name of the text column. Defaults to
self.config.text_column.

label_column_name (Optional[str]): Name of the label column.
Defaults to self.config.label_column.

Returns:

A TorchTextDataset instance ready for use with PyTorch DataLoaders.

classmethod load_from_huggingface(hf_path: str, hf_name: str | None = None, root_dir: str | None = None, split_name: str = 'train', text_column: str = 'text', label_column: str | None = 'label', seed: int = 1234, **hf_load_kwargs)[source]

Loads a dataset from Hugging Face Datasets Hub.

This classmethod creates an instance of the calling BaseTextDataset subclass (or BaseTextDataset itself if called directly, though subclasses are typical) and populates its _data attribute with the loaded Hugging Face dataset.

Parameters:

hf_path – The path or name of the dataset on Hugging Face Hub (e.g., “imdb”, “glue”).
hf_name – The specific configuration or subset of the dataset (e.g., “cola” for “glue”).
root_dir – Root directory for dataset caching (can be managed by HF Datasets). If provided, used to construct a unique DatasetConfig.name.
split_name – The dataset split to load (e.g., “train”, “test”, “validation”, “train[:10%]”).
text_column – The name of the column containing text data in the HF dataset.
label_column – The name of the column containing label data. Can be None.
seed – Random seed, primarily for dataset config naming consistency.
**hf_load_kwargs – Additional keyword arguments to pass to datasets.load_dataset() (e.g., cache_dir, num_proc).

Returns:

An instance of the class this method is called on, with _data populated by the Hugging Face dataset.

Raises:

ImportError – If datasets library is not installed.
ValueError – If specified text_column or label_column are not found in the loaded dataset.
Exception – Propagates exceptions from datasets.load_dataset().

static inspect_huggingface_dataset(hf_path: str, hf_name: str | None = None)[source]

Prints information about a Hugging Face dataset.

Displays the dataset description, features, and available splits. Useful for exploring a dataset before loading it.

Parameters:

hf_path – The path or name of the dataset on Hugging Face Hub.
hf_name – The specific configuration or subset of the dataset.

Raises:

ImportError – If datasets library is not installed.

class banhxeo.data.IMDBDataset(root_dir: str | None = None, split_name: str = 'train', seed: int = 1234)[source]

Bases: BaseTextDataset

__init__(root_dir: str | None = None, split_name: str = 'train', seed: int = 1234)[source]

Initializes the BaseTextDataset.

Parameters:

root_dir – The root directory where datasets are stored. If None, defaults to the current working directory.
split_name – The name of the dataset split (e.g., “train”, “test”).
config – A DatasetConfig object containing metadata for the dataset.
seed – A random seed for reproducibility.
download – If True, attempts to download and extract the dataset if it’s not already present and config.url is provided.

class banhxeo.data.AmazonReviewFullDataset(root_dir: str | None = None, split_name: str = 'train', seed: int = 1234)[source]

Bases: BaseTextDataset

__init__(root_dir: str | None = None, split_name: str = 'train', seed: int = 1234)[source]

Initializes the BaseTextDataset.

Parameters:

root_dir – The root directory where datasets are stored. If None, defaults to the current working directory.
split_name – The name of the dataset split (e.g., “train”, “test”).
config – A DatasetConfig object containing metadata for the dataset.
seed – A random seed for reproducibility.
download – If True, attempts to download and extract the dataset if it’s not already present and config.url is provided.

Subpackages

banhxeo.data.dataset package
- Submodules
  - banhxeo.data.dataset.amazon_review module
    - AmazonReviewFullDataset
  - banhxeo.data.dataset.imdb module
    - IMDBDataset

banhxeo.data package

Subpackages

Submodules