banhxeo.model package

class banhxeo.model.LSTM(vocab: Vocabulary, embedding_dim: int, hidden_size: int, bias: bool = False)[source]

Bases: NeuralLanguageModel

Use batch_first as default

__init__(vocab: Vocabulary, embedding_dim: int, hidden_size: int, bias: bool = False)[source]

Initializes the NeuralLanguageModel.

Parameters:
  • model_config – The configuration object, instance of NeuralModelConfig or its subclass.

  • vocab – The Vocabulary instance.

forward(input_ids: Integer[Tensor, 'batch seq'], attention_mask: Integer[Tensor, 'batch seq'] | None = None, **kwargs)[source]

Defines the computation performed at every call.

Subclasses must implement this method. It should take tensors as input (e.g., input_ids, attention_mask) and return a dictionary of output tensors (e.g., logits, hidden_states, loss if computed).

Parameters:
  • *args – Variable length argument list for model inputs.

  • **kwargs – Arbitrary keyword arguments for model inputs.

Returns:

A dictionary where keys are string names of outputs (e.g., “logits”, “last_hidden_state”) and values are the corresponding `torch.Tensor`s.

Raises:

NotImplementedError – If the subclass does not implement this method.

class banhxeo.model.MLP(vocab: Vocabulary, output_size: int, embedding_dim: int = 128, hidden_sizes: List[int] = [256], **kwargs)[source]

Bases: NeuralLanguageModel

Multi-Layer Perceptron model for sequence classification or regression.

This model takes token embeddings, aggregates them into a single vector using a specified strategy (e.g., averaging), and then passes this vector through one or more fully connected layers to produce an output.

ConfigClass

alias of MLPConfig

__init__(vocab: Vocabulary, output_size: int, embedding_dim: int = 128, hidden_sizes: List[int] = [256], **kwargs)[source]

Initializes the MLP model.

Parameters:
  • vocab – The vocabulary instance.

  • output_size – The dimensionality of the output layer.

  • embedding_dim – The dimensionality of the input token embeddings. Defaults to 128.

  • hidden_sizes – A list of hidden layer sizes. Defaults to [256].

  • **kwargs – Additional arguments for MLPConfig, such as activation_fn, dropout_rate, aggregate_strategy.

forward(input_ids: Integer[Tensor, 'batch seq'], attention_mask: Integer[Tensor, 'batch seq'] | None = None, **kwargs) Dict[str, Tensor][source]

Performs the forward pass of the MLP.

Parameters:
  • input_ids – Tensor of token IDs with shape (batch_size, seq_len).

  • attention_mask – Optional tensor indicating valid tokens (1) and padding (0) with shape (batch_size, seq_len). If None and padding_idx is set for embeddings, a warning is issued and all tokens are assumed valid.

  • **kwargs – Additional keyword arguments (ignored by this base forward pass).

Returns:

  • “logits”: Output logits from the MLP, shape (batch_size, output_size).

Return type:

A dictionary containing

Raises:

NotImplementedError – If an unsupported aggregate_strategy is configured (though check_valid should catch this).

generate_sequence(*args, **kwargs) str[source]

Generates a sequence of text starting from a given prompt.

This method is typically applicable to autoregressive models (e.g., GPT-like LMs, sequence-to-sequence models in generation mode). Non-autoregressive models (like MLP classifiers or Word2Vec) should raise NotImplementedError.

The implementation should handle: 1. Tokenizing the input prompt using self.vocab.tokenizer. 2. Iteratively predicting the next token. 3. Applying sampling strategies specified in generate_config (e.g., greedy, top-k). 4. Stopping generation based on max_length or an end-of-sequence token. 5. Detokenizing the generated token IDs back into a string.

Parameters:
  • prompt – The initial text string to start generation from.

  • generate_config – A GenerateConfig object specifying generation parameters like max_length, sampling strategy, top_k, etc. If None, default generation parameters should be used.

  • tokenizer_config – An optional TokenizerConfig for encoding the prompt. If None, a default sensible configuration should be used (e.g., no padding, no truncation initially unless prompt is too long for model).

  • **kwargs – Additional keyword arguments that might be passed to the tokenizer’s encode method when processing the prompt.

Returns:

The generated text string (excluding the initial prompt, or including it, based on implementation choice).

Raises:
  • NotImplementedError – If the model does not support sequence generation.

  • ValueError – If prerequisites for generation (like a tokenizer in vocab) are missing.

class banhxeo.model.RNN(vocab: Vocabulary, embedding_dim: int, hidden_size: int, bias: bool = False)[source]

Bases: NeuralLanguageModel

Use batch_first as default.

__init__(vocab: Vocabulary, embedding_dim: int, hidden_size: int, bias: bool = False)[source]

Initializes the NeuralLanguageModel.

Parameters:
  • model_config – The configuration object, instance of NeuralModelConfig or its subclass.

  • vocab – The Vocabulary instance.

forward(input_ids: Integer[Tensor, 'batch seq'], attention_mask: Integer[Tensor, 'batch seq'] | None = None, **kwargs)[source]

Defines the computation performed at every call.

Subclasses must implement this method. It should take tensors as input (e.g., input_ids, attention_mask) and return a dictionary of output tensors (e.g., logits, hidden_states, loss if computed).

Parameters:
  • *args – Variable length argument list for model inputs.

  • **kwargs – Arbitrary keyword arguments for model inputs.

Returns:

A dictionary where keys are string names of outputs (e.g., “logits”, “last_hidden_state”) and values are the corresponding `torch.Tensor`s.

Raises:

NotImplementedError – If the subclass does not implement this method.

class banhxeo.model.NGram(vocab: Vocabulary, n: int = 2, smoothing: bool | str = False, k: int | None = None)[source]

Bases: BaseLanguageModel

ConfigClass

alias of NGramConfig

__init__(vocab: Vocabulary, n: int = 2, smoothing: bool | str = False, k: int | None = None)[source]

Initializes the BaseLanguageModel.

Parameters:
  • model_config – The configuration object for the model. It should be an instance of ModelConfig or its subclass.

  • vocab – The Vocabulary instance to be used by the model.

fit(corpus: list[str])[source]
generate_sequence(prompt: str, sampling: str = 'greedy', max_length: int | None = 20, **kwargs) str[source]

Subpackages

Submodules