banhxeo.model.classic.mlp module

class banhxeo.model.classic.mlp.MLPConfig(*, vocab_size: int | None = None, embedding_dim: int, output_size: int, hidden_sizes: ~typing.List[int] = <factory>, activation_fn: str = 'relu', dropout_rate: float = 0.0, aggregate_strategy: str = 'average', window_size: int | None = None)[source]

Bases: NeuralModelConfig

Configuration for the Multi-Layer Perceptron (MLP) model.

Variables:

output_size (int) – The number of output units (e.g., number of classes for classification).
hidden_sizes (List[int]) – A list of integers, where each integer is the number of units in a hidden layer. E.g., [256, 128] for two hidden layers.
activation_fn (str) – The activation function to use in hidden layers. Supported: “relu”, “tanh”, “gelu”, “sigmoid”. Defaults to “relu”.
dropout_rate (float) – Dropout rate to apply after activation in hidden layers. Must be between 0.0 and 1.0. Defaults to 0.0.
aggregate_strategy (str) – Strategy to aggregate token embeddings from a sequence into a single vector before passing to the MLP. Supported: “average”, “max”, “sum”. “concat_window” is planned. Defaults to “average”.
window_size (int | None) – Required if aggregate_strategy is “concat_window”. Specifies the window size for concatenating embeddings.
embedding_dim (from NeuralModelConfig) – Dimension of the input embeddings.

output_size: int

hidden_sizes: List[int]

activation_fn: str

dropout_rate: float

aggregate_strategy: str

window_size: int | None

check_valid() → Self[source]: Validates the MLP configuration.

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class banhxeo.model.classic.mlp.MLP(vocab: Vocabulary, output_size: int, embedding_dim: int = 128, hidden_sizes: List[int] = [256], **kwargs)[source]

Bases: NeuralLanguageModel

Multi-Layer Perceptron model for sequence classification or regression.

This model takes token embeddings, aggregates them into a single vector using a specified strategy (e.g., averaging), and then passes this vector through one or more fully connected layers to produce an output.

ConfigClass: alias of MLPConfig

__init__(vocab: Vocabulary, output_size: int, embedding_dim: int = 128, hidden_sizes: List[int] = [256], **kwargs)[source]

Initializes the MLP model.

Parameters:

vocab – The vocabulary instance.
output_size – The dimensionality of the output layer.
embedding_dim – The dimensionality of the input token embeddings. Defaults to 128.
hidden_sizes – A list of hidden layer sizes. Defaults to [256].
**kwargs – Additional arguments for MLPConfig, such as activation_fn, dropout_rate, aggregate_strategy.

forward(input_ids: Integer[Tensor, 'batch seq'], attention_mask: Integer[Tensor, 'batch seq'] | None = None, **kwargs) → Dict[str, Tensor][source]

Performs the forward pass of the MLP.

Parameters:

input_ids – Tensor of token IDs with shape (batch_size, seq_len).
attention_mask – Optional tensor indicating valid tokens (1) and padding (0) with shape (batch_size, seq_len). If None and padding_idx is set for embeddings, a warning is issued and all tokens are assumed valid.
**kwargs – Additional keyword arguments (ignored by this base forward pass).

Returns:

“logits”: Output logits from the MLP, shape (batch_size, output_size).

Return type:

A dictionary containing

Raises:

NotImplementedError – If an unsupported aggregate_strategy is configured (though check_valid should catch this).

generate_sequence(*args, **kwargs) → str[source]

Generates a sequence of text starting from a given prompt.

This method is typically applicable to autoregressive models (e.g., GPT-like LMs, sequence-to-sequence models in generation mode). Non-autoregressive models (like MLP classifiers or Word2Vec) should raise NotImplementedError.

The implementation should handle: 1. Tokenizing the input prompt using self.vocab.tokenizer. 2. Iteratively predicting the next token. 3. Applying sampling strategies specified in generate_config (e.g., greedy, top-k). 4. Stopping generation based on max_length or an end-of-sequence token. 5. Detokenizing the generated token IDs back into a string.

Parameters:

prompt – The initial text string to start generation from.
generate_config – A GenerateConfig object specifying generation parameters like max_length, sampling strategy, top_k, etc. If None, default generation parameters should be used.
tokenizer_config – An optional TokenizerConfig for encoding the prompt. If None, a default sensible configuration should be used (e.g., no padding, no truncation initially unless prompt is too long for model).
**kwargs – Additional keyword arguments that might be passed to the tokenizer’s encode method when processing the prompt.

Returns:

The generated text string (excluding the initial prompt, or including it, based on implementation choice).

Raises:

NotImplementedError – If the model does not support sequence generation.
ValueError – If prerequisites for generation (like a tokenizer in vocab) are missing.