banhxeo.core.vocabulary module

class banhxeo.core.vocabulary.VocabConfig(*, min_freq: int = 1, pad_tok: str = '<PAD>', unk_tok: str = '<UNK>', bos_tok: str = '<BOS>', sep_tok: str = '<SEP>', cls_tok: str | None = None, mask_tok: str | None = None, resv_tok: str | None = None)[source]

Bases: BaseModel

Configuration for vocabulary settings, especially special tokens.

This configuration defines the string representations for various special tokens and the minimum frequency for corpus tokens. The order in the special_tokens property attempts to follow Hugging Face conventions to facilitate consistent ID assignment (e.g., PAD=0, UNK=1).

Variables:
  • min_freq (int) – Minimum frequency for a token from the corpus to be included in the vocabulary. Defaults to 1.

  • pad_tok (str) – Padding token string. Crucial for sequence padding. Defaults to “<PAD>”.

  • unk_tok (str) – Unknown token string, for out-of-vocabulary words. Defaults to “<UNK>”.

  • bos_tok (str) – Beginning-of-sentence token string. Often used by generative models. Defaults to “<BOS>”.

  • sep_tok (str) – Separator token string. Used to separate sequences (e.g., in BERT for sentence pairs) or as an end-of-sentence token. Defaults to “<SEP>”.

  • mask_tok (str | None) – Mask token string (e.g., for Masked Language Modeling like BERT). Optional. Defaults to None.

  • cls_tok (str | None) – Classification token string (e.g., the first token in BERT sequences for classification tasks). Optional. Defaults to None.

  • resv_tok (str | None) – Reserved token string for future use or custom purposes. Optional. Defaults to None.

min_freq: int
pad_tok: str
unk_tok: str
bos_tok: str
sep_tok: str
cls_tok: str | None
mask_tok: str | None
resv_tok: str | None
classmethod check_min_freq_positive(value: int) int[source]

Validates that min_freq is at least 1.

property special_tokens: List[str]

Returns a list of all configured special tokens in a conventional order.

This order is designed to facilitate common ID assignments when building a vocabulary sequentially (e.g., PAD token getting ID 0, UNK token ID 1). The actual IDs depend on the Vocabulary.build() process.

Conventional Order (if token is defined): 1. pad_tok (aims for ID 0) 2. unk_tok (aims for ID 1) 3. cls_tok (if defined, common for BERT-like models) 4. sep_tok (common separator or EOS for many models) 5. mask_tok (if defined, for MLM) 6. bos_tok (if defined and distinct, for generative start) 7. resv_tok (if defined)

Returns:

A list of special token strings, excluding any that are None.

special_token_idx(token: str) int[source]

Gets the predefined index of a special token within the special_tokens list.

Note: This implies a fixed ordering of special tokens. The actual ID in a Vocabulary instance depends on how it was built. Use Vocabulary.token_to_id[token] for the actual ID. This method is more about the config’s view.

Parameters:

token – The special token string.

Returns:

The index of the token in the special_tokens list.

Raises:

ValueError – If the token is not found in the special_tokens list.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class banhxeo.core.vocabulary.Vocabulary(vocab_config: VocabConfig | None = None)[source]

Bases: object

Manages token-to-ID mapping, special tokens, and vocabulary building.

Provides functionalities to build a vocabulary from a corpus, load/save it, and convert between tokens and their numerical IDs. The assignment of IDs to special tokens during build() is guided by the order in VocabConfig.special_tokens.

Variables:
  • vocab_config (VocabConfig) – Configuration for special tokens and vocabulary building parameters like minimum frequency.

  • tokenizer (Optional[Tokenizer]) – The tokenizer associated with this vocabulary, used during the build process.

  • _idx_to_token (List[str]) – A list mapping token IDs to tokens.

  • _token_to_idx (Dict[str, int]) – A dictionary mapping tokens to token IDs.

  • _word_counts (Optional[Dict[str, int]]) – Raw token counts from the corpus used to build the vocabulary (populated after build() is called).

__init__(vocab_config: VocabConfig | None = None)[source]

Initializes the Vocabulary.

Parameters:

vocab_config – Configuration for the vocabulary. If None, uses DEFAULT_VOCAB_CONFIG.

classmethod load(path: Path | str, tokenizer: Tokenizer)[source]

Loads a vocabulary from a JSON file.

The JSON file should contain the vocabulary config, the tokenizer class name used for building, and the token-to-ID mappings.

Parameters:
  • path – Path to the vocabulary JSON file.

  • tokenizer – The tokenizer instance that was used or is compatible with this vocabulary. Its class name will be checked against the saved one.

Returns:

An instance of Vocabulary loaded with data from the file.

Raises:
  • ValueError – If the path is invalid, or if the provided tokenizer class does not match the one saved with the vocabulary.

  • FileNotFoundError – If the vocabulary file does not exist.

classmethod build(corpus: List[str], tokenizer: Tokenizer, **kwargs)[source]

Builds a vocabulary from a list of text sentences.

This method tokenizes the corpus, counts token frequencies, adds special tokens, and then adds corpus tokens that meet the minimum frequency requirement.

Parameters:
  • corpus – A list of strings, where each string is a sentence or document.

  • tokenizer – The tokenizer instance to use for tokenizing the corpus.

  • **kwargs – Additional arguments to override VocabConfig defaults. Supported keys: min_freq, pad_tok, unk_tok, sep_tok, bos_tok, mask_tok, cls_tok, resv_tok.

Returns:

A new Vocabulary instance built from the corpus.

save(path: str | Path) None[source]

Saves the vocabulary to a JSON file.

The saved file includes the vocabulary configuration, the name of the tokenizer class used for building (if any), the token-to-ID mappings, and raw word counts.

Parameters:

path – The file path where the vocabulary will be saved.

Raises:
  • ValueError – If the vocabulary has not been built yet (is empty).

  • IOError – If there’s an issue writing the file.

property vocab_size: int

Returns the total number of unique tokens in the vocabulary (including special tokens).

get_vocab() List[str][source]

Returns the list of all tokens in the vocabulary, ordered by ID.

Returns:

A list of all token strings in the vocabulary.

Raises:

ValueError – If the vocabulary has not been built.

property unk_id: int

Returns the ID of the unknown token (<UNK>).

property pad_id: int

Returns the ID of the padding token (<PAD>).

property bos_id: int

Returns the ID of the beginning-of-sentence token (<BOS>).

property sep_id: int

Returns the ID of the separator/end-of-sentence token (<SEP>).

property sep: int

Alias for sep_id.

property unk_tok: str

Returns the unknown token string (e.g., “<UNK>”).

property bos_toks: List[str]

Returns a list containing the beginning-of-sentence token string.

Typically used when prepending tokens to a sequence.

property sep_toks: List[str]

Returns a list containing the separator/end-of-sentence token string.

Typically used when appending tokens to a sequence.

property token_to_idx: Dict[str, int]

Returns the dictionary mapping tokens to their IDs.

Raises:

ValueError – If the vocabulary has not been built.

property idx_to_token: List[str]

Returns the list mapping IDs to their token strings.

Raises:

ValueError – If the vocabulary has not been built.

tokens_to_ids(tokens: List[str]) List[int][source]

Converts a list of token strings to a list of their corresponding IDs.

Unknown tokens are mapped to the unk_id.

Parameters:

tokens – A list of token strings.

Returns:

A list of integer IDs.

ids_to_tokens(ids: List[int]) List[str][source]

Converts a list of token IDs to their corresponding token strings.

IDs outside the vocabulary range might be mapped to the unk_tok or raise an error, depending on the internal _convert_id_to_token implementation.

Parameters:

ids – A list of integer IDs.

Returns:

A list of token strings.

get_word_counts() Dict[str, int][source]

Returns the raw counts of tokens observed in the corpus during build().

If the vocabulary was loaded or not built from a corpus, this might be empty or reflect counts from the original build.

Returns:

A dictionary mapping token strings to their raw frequencies.

Raises:

ValueError – If the vocabulary has not been built (and thus _word_counts is not populated).