banhxeo.core.tokenizer module

class banhxeo.core.tokenizer.TokenizerConfig(*, add_special_tokens: bool = False, max_length: int | None = None, truncation: bool = False, padding: bool | Literal['do_not_pad', 'max_length'] = False)[source]

Bases: BaseModel

Configuration for tokenizers.

Variables:

add_special_tokens (bool) – Whether to add special tokens like BOS/EOS.
max_length (Optional[int]) – Maximum sequence length. If specified, truncation or padding might be applied.
truncation (bool) – Whether to truncate sequences longer than max_length.
padding (Union[bool, Literal['do_not_pad', 'max_length']]) – Strategy for padding. Can be: - False or “do_not_pad”: No padding. - True or “max_length”: Pad to max_length.

add_special_tokens: bool

max_length: int | None

truncation: bool

padding: bool | Literal['do_not_pad', 'max_length']

check_padding() → Self[source]: Validates padding configuration against max_length.

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class banhxeo.core.tokenizer.Tokenizer[source]

Bases: object

Abstract Base Class for all tokenizers.

Defines the core interface for tokenization, encoding, and managing tokenizer-specific data like pre-trained models or vocabularies.

tokenize(text: str, **kwargs) → List[str][source]

Tokenizes a single string into a list of tokens.

This method should be implemented by all concrete tokenizer subclasses. The base implementation provides a simple regex-based tokenizer as a fallback.

Parameters:

text – The input string to tokenize.
**kwargs – Subclass-specific tokenization arguments.

Returns:

A list of string tokens.

encode(text: str, vocab: Vocabulary, config: TokenizerConfig, **kwargs) → Dict[str, List[int]][source]

Converts a text string into a dictionary of encoded features.

This base implementation handles tokenization, addition of special tokens, truncation, and padding based on the provided configuration. Subclasses can override this for more specialized encoding logic.

Parameters:

text – The input string to encode.
vocab – The vocabulary instance for mapping tokens to IDs.
config – TokenizerConfig object specifying encoding parameters.
**kwargs – Additional arguments, potentially passed to the tokenize method.

Returns:

“input_ids”: List of token IDs.
”attention_mask”: List of 0s and 1s indicating padding.

Return type:

A dictionary containing

Raises:

ValueError – If padding is ‘max_length’ but config.max_length is not set.

batch_encode(texts: List[str], vocab: Vocabulary, config: TokenizerConfig, **kwargs) → List[Dict[str, List[int]]][source]

Encodes a batch of text strings.

Parameters:

texts – A list of strings to encode.
vocab – The vocabulary instance.
config – TokenizerConfig object.
**kwargs – Additional arguments for the encode method.

Returns:

A list of dictionaries, where each dictionary is the output of encode for the corresponding text.

train_from_iterator(iterator: Iterable[str], vocab_size: int, min_frequency: int = 2, special_tokens: List[str] = ['<pad>', '<unk>', '<bos>', '<eos>'], **kwargs) → None[source]

Trains the tokenizer from an iterator of texts.

This is primarily for tokenizers that learn a vocabulary or merges, such as BPE or WordPiece. Simpler tokenizers might implement this as a no-operation.

Parameters:

iterator – An iterable yielding text strings.
vocab_size – The desired vocabulary size.
min_frequency – The minimum frequency for a token to be included.
special_tokens – A list of special tokens to include.
**kwargs – Tokenizer-specific training arguments.

Raises:

NotImplementedError – If the tokenizer does not support training.

save_pretrained(save_directory: str | Path, **kwargs)[source]

Saves the tokenizer’s state to a directory.

This should save any learned vocabulary, merges, or configuration necessary to reload the tokenizer.

Parameters:

save_directory – Path to the directory where the tokenizer will be saved.
**kwargs – Additional saving arguments.

Raises:

NotImplementedError – If saving is not implemented.

classmethod from_pretrained(load_directory: str | Path, **kwargs) → Tokenizer[source]

Loads a tokenizer from a previously saved directory.

Parameters:

load_directory – Path to the directory from which to load.
**kwargs – Additional loading arguments.

Returns:

An instance of the tokenizer.

Raises:

NotImplementedError – If loading is not implemented.

class banhxeo.core.tokenizer.NLTKTokenizer[source]

Bases: Tokenizer

A tokenizer that uses NLTK’s TreebankWordTokenizer.

Falls back to a regex-based tokenizer if NLTK is not installed.

tokenize(text: str, **kwargs) → List[str][source]

Tokenizes text using NLTK’s TreebankWordTokenizer.

Parameters:

text – The input string.
**kwargs – Ignored (NLTK tokenizer doesn’t take extra args here).

Returns:

A list of tokens.

detokenize(tokens: List[str], **kwargs) → str[source]

Detokenizes a list of tokens using NLTK’s TreebankWordDetokenizer.

Parameters:

tokens – A list of string tokens.
**kwargs – Ignored.

Returns:

The detokenized string.