banhxeo.core.tokenizer module
- class banhxeo.core.tokenizer.TokenizerConfig(*, add_special_tokens: bool = False, max_length: int | None = None, truncation: bool = False, padding: bool | Literal['do_not_pad', 'max_length'] = False)[source]
Bases:
BaseModel
Configuration for tokenizers.
- Variables:
add_special_tokens (bool) – Whether to add special tokens like BOS/EOS.
max_length (Optional[int]) – Maximum sequence length. If specified, truncation or padding might be applied.
truncation (bool) – Whether to truncate sequences longer than max_length.
padding (Union[bool, Literal['do_not_pad', 'max_length']]) – Strategy for padding. Can be: - False or “do_not_pad”: No padding. - True or “max_length”: Pad to max_length.
- add_special_tokens: bool
- max_length: int | None
- truncation: bool
- padding: bool | Literal['do_not_pad', 'max_length']
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class banhxeo.core.tokenizer.Tokenizer[source]
Bases:
object
Abstract Base Class for all tokenizers.
Defines the core interface for tokenization, encoding, and managing tokenizer-specific data like pre-trained models or vocabularies.
- tokenize(text: str, **kwargs) List[str] [source]
Tokenizes a single string into a list of tokens.
This method should be implemented by all concrete tokenizer subclasses. The base implementation provides a simple regex-based tokenizer as a fallback.
- Parameters:
text – The input string to tokenize.
**kwargs – Subclass-specific tokenization arguments.
- Returns:
A list of string tokens.
- encode(text: str, vocab: Vocabulary, config: TokenizerConfig, **kwargs) Dict[str, List[int]] [source]
Converts a text string into a dictionary of encoded features.
This base implementation handles tokenization, addition of special tokens, truncation, and padding based on the provided configuration. Subclasses can override this for more specialized encoding logic.
- Parameters:
text – The input string to encode.
vocab – The vocabulary instance for mapping tokens to IDs.
config – TokenizerConfig object specifying encoding parameters.
**kwargs – Additional arguments, potentially passed to the tokenize method.
- Returns:
“input_ids”: List of token IDs.
”attention_mask”: List of 0s and 1s indicating padding.
- Return type:
A dictionary containing
- Raises:
ValueError – If padding is ‘max_length’ but config.max_length is not set.
- batch_encode(texts: List[str], vocab: Vocabulary, config: TokenizerConfig, **kwargs) List[Dict[str, List[int]]] [source]
Encodes a batch of text strings.
- Parameters:
texts – A list of strings to encode.
vocab – The vocabulary instance.
config – TokenizerConfig object.
**kwargs – Additional arguments for the encode method.
- Returns:
A list of dictionaries, where each dictionary is the output of encode for the corresponding text.
- train_from_iterator(iterator: Iterable[str], vocab_size: int, min_frequency: int = 2, special_tokens: List[str] = ['<pad>', '<unk>', '<bos>', '<eos>'], **kwargs) None [source]
Trains the tokenizer from an iterator of texts.
This is primarily for tokenizers that learn a vocabulary or merges, such as BPE or WordPiece. Simpler tokenizers might implement this as a no-operation.
- Parameters:
iterator – An iterable yielding text strings.
vocab_size – The desired vocabulary size.
min_frequency – The minimum frequency for a token to be included.
special_tokens – A list of special tokens to include.
**kwargs – Tokenizer-specific training arguments.
- Raises:
NotImplementedError – If the tokenizer does not support training.
- save_pretrained(save_directory: str | Path, **kwargs)[source]
Saves the tokenizer’s state to a directory.
This should save any learned vocabulary, merges, or configuration necessary to reload the tokenizer.
- Parameters:
save_directory – Path to the directory where the tokenizer will be saved.
**kwargs – Additional saving arguments.
- Raises:
NotImplementedError – If saving is not implemented.
- classmethod from_pretrained(load_directory: str | Path, **kwargs) Tokenizer [source]
Loads a tokenizer from a previously saved directory.
- Parameters:
load_directory – Path to the directory from which to load.
**kwargs – Additional loading arguments.
- Returns:
An instance of the tokenizer.
- Raises:
NotImplementedError – If loading is not implemented.
- class banhxeo.core.tokenizer.NLTKTokenizer[source]
Bases:
Tokenizer
A tokenizer that uses NLTK’s TreebankWordTokenizer.
Falls back to a regex-based tokenizer if NLTK is not installed.