banhxeo.model.neural module

class banhxeo.model.neural.NeuralModelConfig(*, vocab_size: int | None = None, embedding_dim: int)[source]

Bases: ModelConfig

embedding_dim: int

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class banhxeo.model.neural.NeuralLanguageModel(model_config: NeuralModelConfig, vocab: Vocabulary)[source]

Bases: BaseLanguageModel, Module

Abstract base class for neural network-based language models.

Extends BaseLanguageModel and torch.nn.Module. It provides common functionalities for neural models, such as device management (CPU/GPU), model saving/loading (weights and config), attaching downstream heads, and a more detailed summary.

Subclasses must implement the forward method and typically override generate_sequence if applicable.

Variables:

config (NeuralModelConfig) – Configuration specific to neural models, inheriting from ModelConfig and often adding embedding_dim.
vocab (Vocabulary) – The vocabulary used by the model.
downstream_heads (nn.ModuleDict) – A dictionary to hold task-specific output layers (heads) that can be attached to the base model.

__init__(model_config: NeuralModelConfig, vocab: Vocabulary)[source]

Initializes the NeuralLanguageModel.

Parameters:

model_config – The configuration object, instance of NeuralModelConfig or its subclass.
vocab – The Vocabulary instance.

freeze()[source]

Freezes all parameters of the model.

Sets requires_grad = False for all parameters, making them non-trainable. Useful for feature extraction or fine-tuning only a part of the model (e.g., a downstream head).

unfreeze() → None[source]

Unfreezes all parameters of the model.

Sets requires_grad = True for all parameters, making them trainable.

summary() → None[source]

Prints an enhanced summary of the neural model.

Includes the summary from BaseLanguageModel and also prints the PyTorch model structure (layers and parameters).

abstract forward(*args: Any, **kwargs: Any) → Dict[str, Tensor][source]

Defines the computation performed at every call.

Subclasses must implement this method. It should take tensors as input (e.g., input_ids, attention_mask) and return a dictionary of output tensors (e.g., logits, hidden_states, loss if computed).

Parameters:

*args – Variable length argument list for model inputs.
**kwargs – Arbitrary keyword arguments for model inputs.

Returns:

A dictionary where keys are string names of outputs (e.g., “logits”, “last_hidden_state”) and values are the corresponding `torch.Tensor`s.

Raises:

NotImplementedError – If the subclass does not implement this method.

generate_sequence(prompt: str, generate_config: GenerateConfig | None = None, tokenizer_config: TokenizerConfig | None = None, **kwargs: Any) → str[source]

Generates a sequence of text starting from a given prompt.

This method is typically applicable to autoregressive models (e.g., GPT-like LMs, sequence-to-sequence models in generation mode). Non-autoregressive models (like MLP classifiers or Word2Vec) should raise NotImplementedError.

The implementation should handle: 1. Tokenizing the input prompt using self.vocab.tokenizer. 2. Iteratively predicting the next token. 3. Applying sampling strategies specified in generate_config (e.g., greedy, top-k). 4. Stopping generation based on max_length or an end-of-sequence token. 5. Detokenizing the generated token IDs back into a string.

Parameters:

prompt – The initial text string to start generation from.
generate_config – A GenerateConfig object specifying generation parameters like max_length, sampling strategy, top_k, etc. If None, default generation parameters should be used.
tokenizer_config – An optional TokenizerConfig for encoding the prompt. If None, a default sensible configuration should be used (e.g., no padding, no truncation initially unless prompt is too long for model).
**kwargs – Additional keyword arguments that might be passed to the tokenizer’s encode method when processing the prompt.

Returns:

The generated text string (excluding the initial prompt, or including it, based on implementation choice).

Raises:

NotImplementedError – If the model does not support sequence generation.
ValueError – If prerequisites for generation (like a tokenizer in vocab) are missing.

attach_downstream_head(head_name: str, head_module: Module) → None[source]

Attaches a task-specific head to the base model.

This allows reusing the base model’s learned representations for different downstream tasks (e.g., classification, token classification) by adding a new final layer or set of layers.

Parameters:

head_name – A unique string name for the head. If a head with this name already exists, it will be replaced.
head_module – The torch.nn.Module instance representing the head. This head will be registered in self.downstream_heads.

get_downstream_head_output(head_name: str, base_model_output: Dict[str, Tensor], **head_kwargs) → Tensor[source]

Passes features from the base model’s output through a specified downstream head.

Parameters:

head_name – The name of the downstream head to use (must have been previously attached via attach_downstream_head).
base_model_output – The dictionary output from the base model’s forward() method. The head might expect specific keys from this dictionary (e.g., “last_hidden_state”, “pooled_output”).
**head_kwargs – Additional keyword arguments to pass to the downstream head’s forward method.

Returns:

The output tensor from the specified downstream head.

Raises:

KeyError – If head_name is not found in self.downstream_heads.
ValueError – If the head_module (from self.downstream_heads[head_name]) is not a callable nn.Module or if its expected_input_key (if defined) is not in base_model_output.

save_model(save_directory: str | Path) → None[source]

Saves the neural model’s state_dict, configuration, and vocabulary.

The model’s state_dict is saved to pytorch_model.bin (or similar), the configuration (self.config) to config.json, and the vocabulary (self.vocab) to vocabulary.json within the specified save_directory.

Parameters:: save_directory – The directory path where the model components will be saved. The directory will be created if it doesn’t exist.

classmethod load_model(load_directory: str | Path, vocab: Vocabulary | None = None, tokenizer_for_vocab_load: Tokenizer | None = None, **model_kwargs: Any) → NeuralLanguageModel[source]

Loads a neural model from a saved directory.

This method reconstructs the model by: 1. Loading the configuration from config.json. 2. Loading the vocabulary from vocabulary.json (if it exists and vocab is not provided). 3. Instantiating the model with the loaded config and vocab. 4. Loading the saved weights from pytorch_model.bin into the model.

Parameters:

cls – The specific NeuralLanguageModel subclass to instantiate.
load_directory – The directory path from which to load the model components.
vocab – An optional pre-loaded Vocabulary instance. If provided, loading vocabulary.json from the directory is skipped.
tokenizer_for_vocab_load – A Tokenizer instance, required if vocab is None and vocabulary.json needs to be loaded (as Vocabulary.load requires a tokenizer).
**model_kwargs – Additional keyword arguments to pass to the model’s __init__ method, potentially overriding loaded configuration values.

Returns:

An instance of the neural model class, loaded with configuration and weights.

Raises:

FileNotFoundError – If config.json or pytorch_model.bin is missing.
ValueError – If vocabulary cannot be loaded/provided and is essential.

to_gpu()[source]

Moves the model and its parameters to the GPU if available.

Checks for CUDA or MPS (Apple Silicon GPU) availability. If no GPU is found, a warning is logged, and the model remains on CPU.

Returns:: The model itself, now on the GPU (or CPU if no GPU).

to_cpu()[source]

Moves the model and its parameters to the CPU.

Returns:: The model itself, now on the CPU.