banhxeo.model.classic.mlp module

class banhxeo.model.classic.mlp.MLPConfig(*, vocab_size: int | None = None, embedding_dim: int, output_size: int, hidden_sizes: ~typing.List[int] = <factory>, activation_fn: str = 'relu', dropout_rate: float = 0.0, aggregate_strategy: str = 'average', window_size: int | None = None)[source]

Bases: NeuralModelConfig

Configuration for the Multi-Layer Perceptron (MLP) model.

Variables:
  • output_size (int) – The number of output units (e.g., number of classes for classification).

  • hidden_sizes (List[int]) – A list of integers, where each integer is the number of units in a hidden layer. E.g., [256, 128] for two hidden layers.

  • activation_fn (str) – The activation function to use in hidden layers. Supported: “relu”, “tanh”, “gelu”, “sigmoid”. Defaults to “relu”.

  • dropout_rate (float) – Dropout rate to apply after activation in hidden layers. Must be between 0.0 and 1.0. Defaults to 0.0.

  • aggregate_strategy (str) – Strategy to aggregate token embeddings from a sequence into a single vector before passing to the MLP. Supported: “average”, “max”, “sum”. “concat_window” is planned. Defaults to “average”.

  • window_size (int | None) – Required if aggregate_strategy is “concat_window”. Specifies the window size for concatenating embeddings.

  • embedding_dim (from NeuralModelConfig) – Dimension of the input embeddings.

output_size: int
hidden_sizes: List[int]
activation_fn: str
dropout_rate: float
aggregate_strategy: str
window_size: int | None
check_valid() Self[source]

Validates the MLP configuration.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class banhxeo.model.classic.mlp.MLP(vocab: Vocabulary, output_size: int, embedding_dim: int = 128, hidden_sizes: List[int] = [256], **kwargs)[source]

Bases: NeuralLanguageModel

Multi-Layer Perceptron model for sequence classification or regression.

This model takes token embeddings, aggregates them into a single vector using a specified strategy (e.g., averaging), and then passes this vector through one or more fully connected layers to produce an output.

ConfigClass

alias of MLPConfig

__init__(vocab: Vocabulary, output_size: int, embedding_dim: int = 128, hidden_sizes: List[int] = [256], **kwargs)[source]

Initializes the MLP model.

Parameters:
  • vocab – The vocabulary instance.

  • output_size – The dimensionality of the output layer.

  • embedding_dim – The dimensionality of the input token embeddings. Defaults to 128.

  • hidden_sizes – A list of hidden layer sizes. Defaults to [256].

  • **kwargs – Additional arguments for MLPConfig, such as activation_fn, dropout_rate, aggregate_strategy.

forward(input_ids: Integer[Tensor, 'batch seq'], attention_mask: Integer[Tensor, 'batch seq'] | None = None, **kwargs) Dict[str, Tensor][source]

Performs the forward pass of the MLP.

Parameters:
  • input_ids – Tensor of token IDs with shape (batch_size, seq_len).

  • attention_mask – Optional tensor indicating valid tokens (1) and padding (0) with shape (batch_size, seq_len). If None and padding_idx is set for embeddings, a warning is issued and all tokens are assumed valid.

  • **kwargs – Additional keyword arguments (ignored by this base forward pass).

Returns:

  • “logits”: Output logits from the MLP, shape (batch_size, output_size).

Return type:

A dictionary containing

Raises:

NotImplementedError – If an unsupported aggregate_strategy is configured (though check_valid should catch this).

generate_sequence(*args, **kwargs) str[source]

Generates a sequence of text starting from a given prompt.

This method is typically applicable to autoregressive models (e.g., GPT-like LMs, sequence-to-sequence models in generation mode). Non-autoregressive models (like MLP classifiers or Word2Vec) should raise NotImplementedError.

The implementation should handle: 1. Tokenizing the input prompt using self.vocab.tokenizer. 2. Iteratively predicting the next token. 3. Applying sampling strategies specified in generate_config (e.g., greedy, top-k). 4. Stopping generation based on max_length or an end-of-sequence token. 5. Detokenizing the generated token IDs back into a string.

Parameters:
  • prompt – The initial text string to start generation from.

  • generate_config – A GenerateConfig object specifying generation parameters like max_length, sampling strategy, top_k, etc. If None, default generation parameters should be used.

  • tokenizer_config – An optional TokenizerConfig for encoding the prompt. If None, a default sensible configuration should be used (e.g., no padding, no truncation initially unless prompt is too long for model).

  • **kwargs – Additional keyword arguments that might be passed to the tokenizer’s encode method when processing the prompt.

Returns:

The generated text string (excluding the initial prompt, or including it, based on implementation choice).

Raises:
  • NotImplementedError – If the model does not support sequence generation.

  • ValueError – If prerequisites for generation (like a tokenizer in vocab) are missing.