banhxeo.core package
- class banhxeo.core.NLTKTokenizer[source]
Bases:
Tokenizer
A tokenizer that uses NLTK’s TreebankWordTokenizer.
Falls back to a regex-based tokenizer if NLTK is not installed.
- class banhxeo.core.Tokenizer[source]
Bases:
object
Abstract Base Class for all tokenizers.
Defines the core interface for tokenization, encoding, and managing tokenizer-specific data like pre-trained models or vocabularies.
- tokenize(text: str, **kwargs) List[str] [source]
Tokenizes a single string into a list of tokens.
This method should be implemented by all concrete tokenizer subclasses. The base implementation provides a simple regex-based tokenizer as a fallback.
- Parameters:
text – The input string to tokenize.
**kwargs – Subclass-specific tokenization arguments.
- Returns:
A list of string tokens.
- encode(text: str, vocab: Vocabulary, config: TokenizerConfig, **kwargs) Dict[str, List[int]] [source]
Converts a text string into a dictionary of encoded features.
This base implementation handles tokenization, addition of special tokens, truncation, and padding based on the provided configuration. Subclasses can override this for more specialized encoding logic.
- Parameters:
text – The input string to encode.
vocab – The vocabulary instance for mapping tokens to IDs.
config – TokenizerConfig object specifying encoding parameters.
**kwargs – Additional arguments, potentially passed to the tokenize method.
- Returns:
“input_ids”: List of token IDs.
”attention_mask”: List of 0s and 1s indicating padding.
- Return type:
A dictionary containing
- Raises:
ValueError – If padding is ‘max_length’ but config.max_length is not set.
- batch_encode(texts: List[str], vocab: Vocabulary, config: TokenizerConfig, **kwargs) List[Dict[str, List[int]]] [source]
Encodes a batch of text strings.
- Parameters:
texts – A list of strings to encode.
vocab – The vocabulary instance.
config – TokenizerConfig object.
**kwargs – Additional arguments for the encode method.
- Returns:
A list of dictionaries, where each dictionary is the output of encode for the corresponding text.
- train_from_iterator(iterator: Iterable[str], vocab_size: int, min_frequency: int = 2, special_tokens: List[str] = ['<pad>', '<unk>', '<bos>', '<eos>'], **kwargs) None [source]
Trains the tokenizer from an iterator of texts.
This is primarily for tokenizers that learn a vocabulary or merges, such as BPE or WordPiece. Simpler tokenizers might implement this as a no-operation.
- Parameters:
iterator – An iterable yielding text strings.
vocab_size – The desired vocabulary size.
min_frequency – The minimum frequency for a token to be included.
special_tokens – A list of special tokens to include.
**kwargs – Tokenizer-specific training arguments.
- Raises:
NotImplementedError – If the tokenizer does not support training.
- save_pretrained(save_directory: str | Path, **kwargs)[source]
Saves the tokenizer’s state to a directory.
This should save any learned vocabulary, merges, or configuration necessary to reload the tokenizer.
- Parameters:
save_directory – Path to the directory where the tokenizer will be saved.
**kwargs – Additional saving arguments.
- Raises:
NotImplementedError – If saving is not implemented.
- classmethod from_pretrained(load_directory: str | Path, **kwargs) Tokenizer [source]
Loads a tokenizer from a previously saved directory.
- Parameters:
load_directory – Path to the directory from which to load.
**kwargs – Additional loading arguments.
- Returns:
An instance of the tokenizer.
- Raises:
NotImplementedError – If loading is not implemented.
- class banhxeo.core.TokenizerConfig(*, add_special_tokens: bool = False, max_length: int | None = None, truncation: bool = False, padding: bool | Literal['do_not_pad', 'max_length'] = False)[source]
Bases:
BaseModel
Configuration for tokenizers.
- Variables:
add_special_tokens (bool) – Whether to add special tokens like BOS/EOS.
max_length (int | None) – Maximum sequence length. If specified, truncation or padding might be applied.
truncation (bool) – Whether to truncate sequences longer than max_length.
padding (bool | Literal['do_not_pad', 'max_length']) – Strategy for padding. Can be: - False or “do_not_pad”: No padding. - True or “max_length”: Pad to max_length.
- add_special_tokens: bool
- max_length: int | None
- truncation: bool
- padding: bool | Literal['do_not_pad', 'max_length']
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class banhxeo.core.VocabConfig(*, min_freq: int = 1, pad_tok: str = '<PAD>', unk_tok: str = '<UNK>', bos_tok: str = '<BOS>', sep_tok: str = '<SEP>', cls_tok: str | None = None, mask_tok: str | None = None, resv_tok: str | None = None)[source]
Bases:
BaseModel
Configuration for vocabulary settings, especially special tokens.
This configuration defines the string representations for various special tokens and the minimum frequency for corpus tokens. The order in the special_tokens property attempts to follow Hugging Face conventions to facilitate consistent ID assignment (e.g., PAD=0, UNK=1).
- Variables:
min_freq (int) – Minimum frequency for a token from the corpus to be included in the vocabulary. Defaults to 1.
pad_tok (str) – Padding token string. Crucial for sequence padding. Defaults to “<PAD>”.
unk_tok (str) – Unknown token string, for out-of-vocabulary words. Defaults to “<UNK>”.
bos_tok (str) – Beginning-of-sentence token string. Often used by generative models. Defaults to “<BOS>”.
sep_tok (str) – Separator token string. Used to separate sequences (e.g., in BERT for sentence pairs) or as an end-of-sentence token. Defaults to “<SEP>”.
mask_tok (str | None) – Mask token string (e.g., for Masked Language Modeling like BERT). Optional. Defaults to None.
cls_tok (str | None) – Classification token string (e.g., the first token in BERT sequences for classification tasks). Optional. Defaults to None.
resv_tok (str | None) – Reserved token string for future use or custom purposes. Optional. Defaults to None.
- min_freq: int
- pad_tok: str
- unk_tok: str
- bos_tok: str
- sep_tok: str
- cls_tok: str | None
- mask_tok: str | None
- resv_tok: str | None
- classmethod check_min_freq_positive(value: int) int [source]
Validates that min_freq is at least 1.
- property special_tokens: List[str]
Returns a list of all configured special tokens in a conventional order.
This order is designed to facilitate common ID assignments when building a vocabulary sequentially (e.g., PAD token getting ID 0, UNK token ID 1). The actual IDs depend on the Vocabulary.build() process.
Conventional Order (if token is defined): 1. pad_tok (aims for ID 0) 2. unk_tok (aims for ID 1) 3. cls_tok (if defined, common for BERT-like models) 4. sep_tok (common separator or EOS for many models) 5. mask_tok (if defined, for MLM) 6. bos_tok (if defined and distinct, for generative start) 7. resv_tok (if defined)
- Returns:
A list of special token strings, excluding any that are None.
- special_token_idx(token: str) int [source]
Gets the predefined index of a special token within the special_tokens list.
Note: This implies a fixed ordering of special tokens. The actual ID in a Vocabulary instance depends on how it was built. Use Vocabulary.token_to_id[token] for the actual ID. This method is more about the config’s view.
- Parameters:
token – The special token string.
- Returns:
The index of the token in the special_tokens list.
- Raises:
ValueError – If the token is not found in the special_tokens list.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class banhxeo.core.Vocabulary(vocab_config: VocabConfig | None = None)[source]
Bases:
object
Manages token-to-ID mapping, special tokens, and vocabulary building.
Provides functionalities to build a vocabulary from a corpus, load/save it, and convert between tokens and their numerical IDs. The assignment of IDs to special tokens during build() is guided by the order in VocabConfig.special_tokens.
- Variables:
vocab_config (VocabConfig) – Configuration for special tokens and vocabulary building parameters like minimum frequency.
tokenizer (Optional[Tokenizer]) – The tokenizer associated with this vocabulary, used during the build process.
_idx_to_token (List[str]) – A list mapping token IDs to tokens.
_token_to_idx (Dict[str, int]) – A dictionary mapping tokens to token IDs.
_word_counts (Optional[Dict[str, int]]) – Raw token counts from the corpus used to build the vocabulary (populated after build() is called).
- __init__(vocab_config: VocabConfig | None = None)[source]
Initializes the Vocabulary.
- Parameters:
vocab_config – Configuration for the vocabulary. If None, uses DEFAULT_VOCAB_CONFIG.
- classmethod load(path: Path | str, tokenizer: Tokenizer)[source]
Loads a vocabulary from a JSON file.
The JSON file should contain the vocabulary config, the tokenizer class name used for building, and the token-to-ID mappings.
- Parameters:
path – Path to the vocabulary JSON file.
tokenizer – The tokenizer instance that was used or is compatible with this vocabulary. Its class name will be checked against the saved one.
- Returns:
An instance of Vocabulary loaded with data from the file.
- Raises:
ValueError – If the path is invalid, or if the provided tokenizer class does not match the one saved with the vocabulary.
FileNotFoundError – If the vocabulary file does not exist.
- classmethod build(corpus: List[str], tokenizer: Tokenizer, **kwargs)[source]
Builds a vocabulary from a list of text sentences.
This method tokenizes the corpus, counts token frequencies, adds special tokens, and then adds corpus tokens that meet the minimum frequency requirement.
- Parameters:
corpus – A list of strings, where each string is a sentence or document.
tokenizer – The tokenizer instance to use for tokenizing the corpus.
**kwargs – Additional arguments to override VocabConfig defaults. Supported keys: min_freq, pad_tok, unk_tok, sep_tok, bos_tok, mask_tok, cls_tok, resv_tok.
- Returns:
A new Vocabulary instance built from the corpus.
- save(path: str | Path) None [source]
Saves the vocabulary to a JSON file.
The saved file includes the vocabulary configuration, the name of the tokenizer class used for building (if any), the token-to-ID mappings, and raw word counts.
- Parameters:
path – The file path where the vocabulary will be saved.
- Raises:
ValueError – If the vocabulary has not been built yet (is empty).
IOError – If there’s an issue writing the file.
- property vocab_size: int
Returns the total number of unique tokens in the vocabulary (including special tokens).
- get_vocab() List[str] [source]
Returns the list of all tokens in the vocabulary, ordered by ID.
- Returns:
A list of all token strings in the vocabulary.
- Raises:
ValueError – If the vocabulary has not been built.
- property unk_id: int
Returns the ID of the unknown token (<UNK>).
- property pad_id: int
Returns the ID of the padding token (<PAD>).
- property bos_id: int
Returns the ID of the beginning-of-sentence token (<BOS>).
- property sep_id: int
Returns the ID of the separator/end-of-sentence token (<SEP>).
- property sep: int
Alias for sep_id.
- property unk_tok: str
Returns the unknown token string (e.g., “<UNK>”).
- property bos_toks: List[str]
Returns a list containing the beginning-of-sentence token string.
Typically used when prepending tokens to a sequence.
- property sep_toks: List[str]
Returns a list containing the separator/end-of-sentence token string.
Typically used when appending tokens to a sequence.
- property token_to_idx: Dict[str, int]
Returns the dictionary mapping tokens to their IDs.
- Raises:
ValueError – If the vocabulary has not been built.
- property idx_to_token: List[str]
Returns the list mapping IDs to their token strings.
- Raises:
ValueError – If the vocabulary has not been built.
- tokens_to_ids(tokens: List[str]) List[int] [source]
Converts a list of token strings to a list of their corresponding IDs.
Unknown tokens are mapped to the unk_id.
- Parameters:
tokens – A list of token strings.
- Returns:
A list of integer IDs.
- ids_to_tokens(ids: List[int]) List[str] [source]
Converts a list of token IDs to their corresponding token strings.
IDs outside the vocabulary range might be mapped to the unk_tok or raise an error, depending on the internal _convert_id_to_token implementation.
- Parameters:
ids – A list of integer IDs.
- Returns:
A list of token strings.
- get_word_counts() Dict[str, int] [source]
Returns the raw counts of tokens observed in the corpus during build().
If the vocabulary was loaded or not built from a corpus, this might be empty or reflect counts from the original build.
- Returns:
A dictionary mapping token strings to their raw frequencies.
- Raises:
ValueError – If the vocabulary has not been built (and thus _word_counts is not populated).
Submodules
- banhxeo.core.tokenizer module
- banhxeo.core.vocabulary module
VocabConfig
Vocabulary
Vocabulary.__init__()
Vocabulary.load()
Vocabulary.build()
Vocabulary.save()
Vocabulary.vocab_size
Vocabulary.get_vocab()
Vocabulary.unk_id
Vocabulary.pad_id
Vocabulary.bos_id
Vocabulary.sep_id
Vocabulary.sep
Vocabulary.unk_tok
Vocabulary.bos_toks
Vocabulary.sep_toks
Vocabulary.token_to_idx
Vocabulary.idx_to_token
Vocabulary.tokens_to_ids()
Vocabulary.ids_to_tokens()
Vocabulary.get_word_counts()