banhxeo.core.vocabulary module
- class banhxeo.core.vocabulary.VocabConfig(*, min_freq: int = 1, pad_tok: str = '<PAD>', unk_tok: str = '<UNK>', bos_tok: str = '<BOS>', sep_tok: str = '<SEP>', cls_tok: str | None = None, mask_tok: str | None = None, resv_tok: str | None = None)[source]
Bases:
BaseModel
Configuration for vocabulary settings, especially special tokens.
This configuration defines the string representations for various special tokens and the minimum frequency for corpus tokens. The order in the special_tokens property attempts to follow Hugging Face conventions to facilitate consistent ID assignment (e.g., PAD=0, UNK=1).
- Variables:
min_freq (int) – Minimum frequency for a token from the corpus to be included in the vocabulary. Defaults to 1.
pad_tok (str) – Padding token string. Crucial for sequence padding. Defaults to “<PAD>”.
unk_tok (str) – Unknown token string, for out-of-vocabulary words. Defaults to “<UNK>”.
bos_tok (str) – Beginning-of-sentence token string. Often used by generative models. Defaults to “<BOS>”.
sep_tok (str) – Separator token string. Used to separate sequences (e.g., in BERT for sentence pairs) or as an end-of-sentence token. Defaults to “<SEP>”.
mask_tok (str | None) – Mask token string (e.g., for Masked Language Modeling like BERT). Optional. Defaults to None.
cls_tok (str | None) – Classification token string (e.g., the first token in BERT sequences for classification tasks). Optional. Defaults to None.
resv_tok (str | None) – Reserved token string for future use or custom purposes. Optional. Defaults to None.
- min_freq: int
- pad_tok: str
- unk_tok: str
- bos_tok: str
- sep_tok: str
- cls_tok: str | None
- mask_tok: str | None
- resv_tok: str | None
- classmethod check_min_freq_positive(value: int) int [source]
Validates that min_freq is at least 1.
- property special_tokens: List[str]
Returns a list of all configured special tokens in a conventional order.
This order is designed to facilitate common ID assignments when building a vocabulary sequentially (e.g., PAD token getting ID 0, UNK token ID 1). The actual IDs depend on the Vocabulary.build() process.
Conventional Order (if token is defined): 1. pad_tok (aims for ID 0) 2. unk_tok (aims for ID 1) 3. cls_tok (if defined, common for BERT-like models) 4. sep_tok (common separator or EOS for many models) 5. mask_tok (if defined, for MLM) 6. bos_tok (if defined and distinct, for generative start) 7. resv_tok (if defined)
- Returns:
A list of special token strings, excluding any that are None.
- special_token_idx(token: str) int [source]
Gets the predefined index of a special token within the special_tokens list.
Note: This implies a fixed ordering of special tokens. The actual ID in a Vocabulary instance depends on how it was built. Use Vocabulary.token_to_id[token] for the actual ID. This method is more about the config’s view.
- Parameters:
token – The special token string.
- Returns:
The index of the token in the special_tokens list.
- Raises:
ValueError – If the token is not found in the special_tokens list.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class banhxeo.core.vocabulary.Vocabulary(vocab_config: VocabConfig | None = None)[source]
Bases:
object
Manages token-to-ID mapping, special tokens, and vocabulary building.
Provides functionalities to build a vocabulary from a corpus, load/save it, and convert between tokens and their numerical IDs. The assignment of IDs to special tokens during build() is guided by the order in VocabConfig.special_tokens.
- Variables:
vocab_config (VocabConfig) – Configuration for special tokens and vocabulary building parameters like minimum frequency.
tokenizer (Optional[Tokenizer]) – The tokenizer associated with this vocabulary, used during the build process.
_idx_to_token (List[str]) – A list mapping token IDs to tokens.
_token_to_idx (Dict[str, int]) – A dictionary mapping tokens to token IDs.
_word_counts (Optional[Dict[str, int]]) – Raw token counts from the corpus used to build the vocabulary (populated after build() is called).
- __init__(vocab_config: VocabConfig | None = None)[source]
Initializes the Vocabulary.
- Parameters:
vocab_config – Configuration for the vocabulary. If None, uses DEFAULT_VOCAB_CONFIG.
- classmethod load(path: Path | str, tokenizer: Tokenizer)[source]
Loads a vocabulary from a JSON file.
The JSON file should contain the vocabulary config, the tokenizer class name used for building, and the token-to-ID mappings.
- Parameters:
path – Path to the vocabulary JSON file.
tokenizer – The tokenizer instance that was used or is compatible with this vocabulary. Its class name will be checked against the saved one.
- Returns:
An instance of Vocabulary loaded with data from the file.
- Raises:
ValueError – If the path is invalid, or if the provided tokenizer class does not match the one saved with the vocabulary.
FileNotFoundError – If the vocabulary file does not exist.
- classmethod build(corpus: List[str], tokenizer: Tokenizer, **kwargs)[source]
Builds a vocabulary from a list of text sentences.
This method tokenizes the corpus, counts token frequencies, adds special tokens, and then adds corpus tokens that meet the minimum frequency requirement.
- Parameters:
corpus – A list of strings, where each string is a sentence or document.
tokenizer – The tokenizer instance to use for tokenizing the corpus.
**kwargs – Additional arguments to override VocabConfig defaults. Supported keys: min_freq, pad_tok, unk_tok, sep_tok, bos_tok, mask_tok, cls_tok, resv_tok.
- Returns:
A new Vocabulary instance built from the corpus.
- save(path: str | Path) None [source]
Saves the vocabulary to a JSON file.
The saved file includes the vocabulary configuration, the name of the tokenizer class used for building (if any), the token-to-ID mappings, and raw word counts.
- Parameters:
path – The file path where the vocabulary will be saved.
- Raises:
ValueError – If the vocabulary has not been built yet (is empty).
IOError – If there’s an issue writing the file.
- property vocab_size: int
Returns the total number of unique tokens in the vocabulary (including special tokens).
- get_vocab() List[str] [source]
Returns the list of all tokens in the vocabulary, ordered by ID.
- Returns:
A list of all token strings in the vocabulary.
- Raises:
ValueError – If the vocabulary has not been built.
- property unk_id: int
Returns the ID of the unknown token (<UNK>).
- property pad_id: int
Returns the ID of the padding token (<PAD>).
- property bos_id: int
Returns the ID of the beginning-of-sentence token (<BOS>).
- property sep_id: int
Returns the ID of the separator/end-of-sentence token (<SEP>).
- property sep: int
Alias for sep_id.
- property unk_tok: str
Returns the unknown token string (e.g., “<UNK>”).
- property bos_toks: List[str]
Returns a list containing the beginning-of-sentence token string.
Typically used when prepending tokens to a sequence.
- property sep_toks: List[str]
Returns a list containing the separator/end-of-sentence token string.
Typically used when appending tokens to a sequence.
- property token_to_idx: Dict[str, int]
Returns the dictionary mapping tokens to their IDs.
- Raises:
ValueError – If the vocabulary has not been built.
- property idx_to_token: List[str]
Returns the list mapping IDs to their token strings.
- Raises:
ValueError – If the vocabulary has not been built.
- tokens_to_ids(tokens: List[str]) List[int] [source]
Converts a list of token strings to a list of their corresponding IDs.
Unknown tokens are mapped to the unk_id.
- Parameters:
tokens – A list of token strings.
- Returns:
A list of integer IDs.
- ids_to_tokens(ids: List[int]) List[str] [source]
Converts a list of token IDs to their corresponding token strings.
IDs outside the vocabulary range might be mapped to the unk_tok or raise an error, depending on the internal _convert_id_to_token implementation.
- Parameters:
ids – A list of integer IDs.
- Returns:
A list of token strings.
- get_word_counts() Dict[str, int] [source]
Returns the raw counts of tokens observed in the corpus during build().
If the vocabulary was loaded or not built from a corpus, this might be empty or reflect counts from the original build.
- Returns:
A dictionary mapping token strings to their raw frequencies.
- Raises:
ValueError – If the vocabulary has not been built (and thus _word_counts is not populated).