banhxeo.core package

class banhxeo.core.NLTKTokenizer[source]

Bases: Tokenizer

A tokenizer that uses NLTK’s TreebankWordTokenizer.

Falls back to a regex-based tokenizer if NLTK is not installed.

tokenize(text: str, **kwargs) → List[str][source]

Tokenizes text using NLTK’s TreebankWordTokenizer.

Parameters:

text – The input string.
**kwargs – Ignored (NLTK tokenizer doesn’t take extra args here).

Returns:

A list of tokens.

detokenize(tokens: List[str], **kwargs) → str[source]

Detokenizes a list of tokens using NLTK’s TreebankWordDetokenizer.

Parameters:

tokens – A list of string tokens.
**kwargs – Ignored.

Returns:

The detokenized string.

class banhxeo.core.Tokenizer[source]

Bases: object

Abstract Base Class for all tokenizers.

Defines the core interface for tokenization, encoding, and managing tokenizer-specific data like pre-trained models or vocabularies.

tokenize(text: str, **kwargs) → List[str][source]

Tokenizes a single string into a list of tokens.

This method should be implemented by all concrete tokenizer subclasses. The base implementation provides a simple regex-based tokenizer as a fallback.

Parameters:

text – The input string to tokenize.
**kwargs – Subclass-specific tokenization arguments.

Returns:

A list of string tokens.

encode(text: str, vocab: Vocabulary, config: TokenizerConfig, **kwargs) → Dict[str, List[int]][source]

Converts a text string into a dictionary of encoded features.

This base implementation handles tokenization, addition of special tokens, truncation, and padding based on the provided configuration. Subclasses can override this for more specialized encoding logic.

Parameters:

text – The input string to encode.
vocab – The vocabulary instance for mapping tokens to IDs.
config – TokenizerConfig object specifying encoding parameters.
**kwargs – Additional arguments, potentially passed to the tokenize method.

Returns:

“input_ids”: List of token IDs.
”attention_mask”: List of 0s and 1s indicating padding.

Return type:

A dictionary containing

Raises:

ValueError – If padding is ‘max_length’ but config.max_length is not set.

batch_encode(texts: List[str], vocab: Vocabulary, config: TokenizerConfig, **kwargs) → List[Dict[str, List[int]]][source]

Encodes a batch of text strings.

Parameters:

texts – A list of strings to encode.
vocab – The vocabulary instance.
config – TokenizerConfig object.
**kwargs – Additional arguments for the encode method.

Returns:

A list of dictionaries, where each dictionary is the output of encode for the corresponding text.

train_from_iterator(iterator: Iterable[str], vocab_size: int, min_frequency: int = 2, special_tokens: List[str] = ['<pad>', '<unk>', '<bos>', '<eos>'], **kwargs) → None[source]

Trains the tokenizer from an iterator of texts.

This is primarily for tokenizers that learn a vocabulary or merges, such as BPE or WordPiece. Simpler tokenizers might implement this as a no-operation.

Parameters:

iterator – An iterable yielding text strings.
vocab_size – The desired vocabulary size.
min_frequency – The minimum frequency for a token to be included.
special_tokens – A list of special tokens to include.
**kwargs – Tokenizer-specific training arguments.

Raises:

NotImplementedError – If the tokenizer does not support training.

save_pretrained(save_directory: str | Path, **kwargs)[source]

Saves the tokenizer’s state to a directory.

This should save any learned vocabulary, merges, or configuration necessary to reload the tokenizer.

Parameters:

save_directory – Path to the directory where the tokenizer will be saved.
**kwargs – Additional saving arguments.

Raises:

NotImplementedError – If saving is not implemented.

classmethod from_pretrained(load_directory: str | Path, **kwargs) → Tokenizer[source]

Loads a tokenizer from a previously saved directory.

Parameters:

load_directory – Path to the directory from which to load.
**kwargs – Additional loading arguments.

Returns:

An instance of the tokenizer.

Raises:

NotImplementedError – If loading is not implemented.

class banhxeo.core.TokenizerConfig(*, add_special_tokens: bool = False, max_length: int | None = None, truncation: bool = False, padding: bool | Literal['do_not_pad', 'max_length'] = False)[source]

Bases: BaseModel

Configuration for tokenizers.

Variables:

add_special_tokens (bool) – Whether to add special tokens like BOS/EOS.
max_length (int | None) – Maximum sequence length. If specified, truncation or padding might be applied.
truncation (bool) – Whether to truncate sequences longer than max_length.
padding (bool | Literal['do_not_pad', 'max_length']) – Strategy for padding. Can be: - False or “do_not_pad”: No padding. - True or “max_length”: Pad to max_length.

add_special_tokens: bool

max_length: int | None

truncation: bool

padding: bool | Literal['do_not_pad', 'max_length']

check_padding() → Self[source]: Validates padding configuration against max_length.

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class banhxeo.core.VocabConfig(*, min_freq: int = 1, pad_tok: str = '<PAD>', unk_tok: str = '<UNK>', bos_tok: str = '<BOS>', sep_tok: str = '<SEP>', cls_tok: str | None = None, mask_tok: str | None = None, resv_tok: str | None = None)[source]

Bases: BaseModel

Configuration for vocabulary settings, especially special tokens.

This configuration defines the string representations for various special tokens and the minimum frequency for corpus tokens. The order in the special_tokens property attempts to follow Hugging Face conventions to facilitate consistent ID assignment (e.g., PAD=0, UNK=1).

Variables:

min_freq (int) – Minimum frequency for a token from the corpus to be included in the vocabulary. Defaults to 1.
pad_tok (str) – Padding token string. Crucial for sequence padding. Defaults to “<PAD>”.
unk_tok (str) – Unknown token string, for out-of-vocabulary words. Defaults to “<UNK>”.
bos_tok (str) – Beginning-of-sentence token string. Often used by generative models. Defaults to “<BOS>”.
sep_tok (str) – Separator token string. Used to separate sequences (e.g., in BERT for sentence pairs) or as an end-of-sentence token. Defaults to “<SEP>”.
mask_tok (str | None) – Mask token string (e.g., for Masked Language Modeling like BERT). Optional. Defaults to None.
cls_tok (str | None) – Classification token string (e.g., the first token in BERT sequences for classification tasks). Optional. Defaults to None.
resv_tok (str | None) – Reserved token string for future use or custom purposes. Optional. Defaults to None.

min_freq: int

pad_tok: str

unk_tok: str

bos_tok: str

sep_tok: str

cls_tok: str | None

mask_tok: str | None

resv_tok: str | None

classmethod check_min_freq_positive(value: int) → int[source]: Validates that min_freq is at least 1.

property special_tokens: List[str]

Returns a list of all configured special tokens in a conventional order.

This order is designed to facilitate common ID assignments when building a vocabulary sequentially (e.g., PAD token getting ID 0, UNK token ID 1). The actual IDs depend on the Vocabulary.build() process.

Conventional Order (if token is defined): 1. pad_tok (aims for ID 0) 2. unk_tok (aims for ID 1) 3. cls_tok (if defined, common for BERT-like models) 4. sep_tok (common separator or EOS for many models) 5. mask_tok (if defined, for MLM) 6. bos_tok (if defined and distinct, for generative start) 7. resv_tok (if defined)

Returns:: A list of special token strings, excluding any that are None.

special_token_idx(token: str) → int[source]

Gets the predefined index of a special token within the special_tokens list.

Note: This implies a fixed ordering of special tokens. The actual ID in a Vocabulary instance depends on how it was built. Use Vocabulary.token_to_id[token] for the actual ID. This method is more about the config’s view.

Parameters:: token – The special token string.
Returns:: The index of the token in the special_tokens list.
Raises:: ValueError – If the token is not found in the special_tokens list.

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class banhxeo.core.Vocabulary(vocab_config: VocabConfig | None = None)[source]

Bases: object

Manages token-to-ID mapping, special tokens, and vocabulary building.

Provides functionalities to build a vocabulary from a corpus, load/save it, and convert between tokens and their numerical IDs. The assignment of IDs to special tokens during build() is guided by the order in VocabConfig.special_tokens.

Variables:

vocab_config (VocabConfig) – Configuration for special tokens and vocabulary building parameters like minimum frequency.
tokenizer (Optional[Tokenizer]) – The tokenizer associated with this vocabulary, used during the build process.
_idx_to_token (List[str]) – A list mapping token IDs to tokens.
_token_to_idx (Dict[str, int]) – A dictionary mapping tokens to token IDs.
_word_counts (Optional[Dict[str, int]]) – Raw token counts from the corpus used to build the vocabulary (populated after build() is called).

__init__(vocab_config: VocabConfig | None = None)[source]

Initializes the Vocabulary.

Parameters:: vocab_config – Configuration for the vocabulary. If None, uses DEFAULT_VOCAB_CONFIG.

classmethod load(path: Path | str, tokenizer: Tokenizer)[source]

Loads a vocabulary from a JSON file.

The JSON file should contain the vocabulary config, the tokenizer class name used for building, and the token-to-ID mappings.

Parameters:

path – Path to the vocabulary JSON file.
tokenizer – The tokenizer instance that was used or is compatible with this vocabulary. Its class name will be checked against the saved one.

Returns:

An instance of Vocabulary loaded with data from the file.

Raises:

ValueError – If the path is invalid, or if the provided tokenizer class does not match the one saved with the vocabulary.
FileNotFoundError – If the vocabulary file does not exist.

classmethod build(corpus: List[str], tokenizer: Tokenizer, **kwargs)[source]

Builds a vocabulary from a list of text sentences.

This method tokenizes the corpus, counts token frequencies, adds special tokens, and then adds corpus tokens that meet the minimum frequency requirement.

Parameters:

corpus – A list of strings, where each string is a sentence or document.
tokenizer – The tokenizer instance to use for tokenizing the corpus.
**kwargs – Additional arguments to override VocabConfig defaults. Supported keys: min_freq, pad_tok, unk_tok, sep_tok, bos_tok, mask_tok, cls_tok, resv_tok.

Returns:

A new Vocabulary instance built from the corpus.

save(path: str | Path) → None[source]

Saves the vocabulary to a JSON file.

The saved file includes the vocabulary configuration, the name of the tokenizer class used for building (if any), the token-to-ID mappings, and raw word counts.

Parameters:

path – The file path where the vocabulary will be saved.

Raises:

ValueError – If the vocabulary has not been built yet (is empty).
IOError – If there’s an issue writing the file.

property vocab_size: int: Returns the total number of unique tokens in the vocabulary (including special tokens).

get_vocab() → List[str][source]

Returns the list of all tokens in the vocabulary, ordered by ID.

Returns:: A list of all token strings in the vocabulary.
Raises:: ValueError – If the vocabulary has not been built.

property unk_id: int: Returns the ID of the unknown token (<UNK>).

property pad_id: int: Returns the ID of the padding token (<PAD>).

property bos_id: int: Returns the ID of the beginning-of-sentence token (<BOS>).

property sep_id: int: Returns the ID of the separator/end-of-sentence token (<SEP>).

property sep: int: Alias for sep_id.

property unk_tok: str: Returns the unknown token string (e.g., “<UNK>”).

property bos_toks: List[str]

Returns a list containing the beginning-of-sentence token string.

Typically used when prepending tokens to a sequence.

property sep_toks: List[str]

Returns a list containing the separator/end-of-sentence token string.

Typically used when appending tokens to a sequence.

property token_to_idx: Dict[str, int]

Returns the dictionary mapping tokens to their IDs.

Raises:: ValueError – If the vocabulary has not been built.

property idx_to_token: List[str]

Returns the list mapping IDs to their token strings.

Raises:: ValueError – If the vocabulary has not been built.

tokens_to_ids(tokens: List[str]) → List[int][source]

Converts a list of token strings to a list of their corresponding IDs.

Unknown tokens are mapped to the unk_id.

Parameters:: tokens – A list of token strings.
Returns:: A list of integer IDs.

ids_to_tokens(ids: List[int]) → List[str][source]

Converts a list of token IDs to their corresponding token strings.

IDs outside the vocabulary range might be mapped to the unk_tok or raise an error, depending on the internal _convert_id_to_token implementation.

Parameters:: ids – A list of integer IDs.
Returns:: A list of token strings.

get_word_counts() → Dict[str, int][source]

Returns the raw counts of tokens observed in the corpus during build().

If the vocabulary was loaded or not built from a corpus, this might be empty or reflect counts from the original build.

Returns:: A dictionary mapping token strings to their raw frequencies.
Raises:: ValueError – If the vocabulary has not been built (and thus _word_counts is not populated).

banhxeo.core package

Submodules