banhxeo.model.classic.mlp module
- class banhxeo.model.classic.mlp.MLPConfig(*, vocab_size: int | None = None, embedding_dim: int, output_size: int, hidden_sizes: ~typing.List[int] = <factory>, activation_fn: str = 'relu', dropout_rate: float = 0.0, aggregate_strategy: str = 'average', window_size: int | None = None)[source]
Bases:
NeuralModelConfig
Configuration for the Multi-Layer Perceptron (MLP) model.
- Variables:
output_size (int) – The number of output units (e.g., number of classes for classification).
hidden_sizes (List[int]) – A list of integers, where each integer is the number of units in a hidden layer. E.g., [256, 128] for two hidden layers.
activation_fn (str) – The activation function to use in hidden layers. Supported: “relu”, “tanh”, “gelu”, “sigmoid”. Defaults to “relu”.
dropout_rate (float) – Dropout rate to apply after activation in hidden layers. Must be between 0.0 and 1.0. Defaults to 0.0.
aggregate_strategy (str) – Strategy to aggregate token embeddings from a sequence into a single vector before passing to the MLP. Supported: “average”, “max”, “sum”. “concat_window” is planned. Defaults to “average”.
window_size (int | None) – Required if aggregate_strategy is “concat_window”. Specifies the window size for concatenating embeddings.
embedding_dim (from NeuralModelConfig) – Dimension of the input embeddings.
- output_size: int
- activation_fn: str
- dropout_rate: float
- aggregate_strategy: str
- window_size: int | None
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class banhxeo.model.classic.mlp.MLP(vocab: Vocabulary, output_size: int, embedding_dim: int = 128, hidden_sizes: List[int] = [256], **kwargs)[source]
Bases:
NeuralLanguageModel
Multi-Layer Perceptron model for sequence classification or regression.
This model takes token embeddings, aggregates them into a single vector using a specified strategy (e.g., averaging), and then passes this vector through one or more fully connected layers to produce an output.
- __init__(vocab: Vocabulary, output_size: int, embedding_dim: int = 128, hidden_sizes: List[int] = [256], **kwargs)[source]
Initializes the MLP model.
- Parameters:
vocab – The vocabulary instance.
output_size – The dimensionality of the output layer.
embedding_dim – The dimensionality of the input token embeddings. Defaults to 128.
hidden_sizes – A list of hidden layer sizes. Defaults to [256].
**kwargs – Additional arguments for MLPConfig, such as activation_fn, dropout_rate, aggregate_strategy.
- forward(input_ids: Integer[Tensor, 'batch seq'], attention_mask: Integer[Tensor, 'batch seq'] | None = None, **kwargs) Dict[str, Tensor] [source]
Performs the forward pass of the MLP.
- Parameters:
input_ids – Tensor of token IDs with shape (batch_size, seq_len).
attention_mask – Optional tensor indicating valid tokens (1) and padding (0) with shape (batch_size, seq_len). If None and padding_idx is set for embeddings, a warning is issued and all tokens are assumed valid.
**kwargs – Additional keyword arguments (ignored by this base forward pass).
- Returns:
“logits”: Output logits from the MLP, shape (batch_size, output_size).
- Return type:
A dictionary containing
- Raises:
NotImplementedError – If an unsupported aggregate_strategy is configured (though check_valid should catch this).
- generate_sequence(*args, **kwargs) str [source]
Generates a sequence of text starting from a given prompt.
This method is typically applicable to autoregressive models (e.g., GPT-like LMs, sequence-to-sequence models in generation mode). Non-autoregressive models (like MLP classifiers or Word2Vec) should raise NotImplementedError.
The implementation should handle: 1. Tokenizing the input prompt using self.vocab.tokenizer. 2. Iteratively predicting the next token. 3. Applying sampling strategies specified in generate_config (e.g., greedy, top-k). 4. Stopping generation based on max_length or an end-of-sequence token. 5. Detokenizing the generated token IDs back into a string.
- Parameters:
prompt – The initial text string to start generation from.
generate_config – A GenerateConfig object specifying generation parameters like max_length, sampling strategy, top_k, etc. If None, default generation parameters should be used.
tokenizer_config – An optional TokenizerConfig for encoding the prompt. If None, a default sensible configuration should be used (e.g., no padding, no truncation initially unless prompt is too long for model).
**kwargs – Additional keyword arguments that might be passed to the tokenizer’s encode method when processing the prompt.
- Returns:
The generated text string (excluding the initial prompt, or including it, based on implementation choice).
- Raises:
NotImplementedError – If the model does not support sequence generation.
ValueError – If prerequisites for generation (like a tokenizer in vocab) are missing.