banhxeo.model package
- class banhxeo.model.LSTM(vocab: Vocabulary, embedding_dim: int, hidden_size: int, bias: bool = False)[source]
Bases:
NeuralLanguageModel
Use batch_first as default
- __init__(vocab: Vocabulary, embedding_dim: int, hidden_size: int, bias: bool = False)[source]
Initializes the NeuralLanguageModel.
- Parameters:
model_config – The configuration object, instance of NeuralModelConfig or its subclass.
vocab – The Vocabulary instance.
- forward(input_ids: Integer[Tensor, 'batch seq'], attention_mask: Integer[Tensor, 'batch seq'] | None = None, **kwargs)[source]
Defines the computation performed at every call.
Subclasses must implement this method. It should take tensors as input (e.g., input_ids, attention_mask) and return a dictionary of output tensors (e.g., logits, hidden_states, loss if computed).
- Parameters:
*args – Variable length argument list for model inputs.
**kwargs – Arbitrary keyword arguments for model inputs.
- Returns:
A dictionary where keys are string names of outputs (e.g., “logits”, “last_hidden_state”) and values are the corresponding `torch.Tensor`s.
- Raises:
NotImplementedError – If the subclass does not implement this method.
- class banhxeo.model.MLP(vocab: Vocabulary, output_size: int, embedding_dim: int = 128, hidden_sizes: List[int] = [256], **kwargs)[source]
Bases:
NeuralLanguageModel
Multi-Layer Perceptron model for sequence classification or regression.
This model takes token embeddings, aggregates them into a single vector using a specified strategy (e.g., averaging), and then passes this vector through one or more fully connected layers to produce an output.
- __init__(vocab: Vocabulary, output_size: int, embedding_dim: int = 128, hidden_sizes: List[int] = [256], **kwargs)[source]
Initializes the MLP model.
- Parameters:
vocab – The vocabulary instance.
output_size – The dimensionality of the output layer.
embedding_dim – The dimensionality of the input token embeddings. Defaults to 128.
hidden_sizes – A list of hidden layer sizes. Defaults to [256].
**kwargs – Additional arguments for MLPConfig, such as activation_fn, dropout_rate, aggregate_strategy.
- forward(input_ids: Integer[Tensor, 'batch seq'], attention_mask: Integer[Tensor, 'batch seq'] | None = None, **kwargs) Dict[str, Tensor] [source]
Performs the forward pass of the MLP.
- Parameters:
input_ids – Tensor of token IDs with shape (batch_size, seq_len).
attention_mask – Optional tensor indicating valid tokens (1) and padding (0) with shape (batch_size, seq_len). If None and padding_idx is set for embeddings, a warning is issued and all tokens are assumed valid.
**kwargs – Additional keyword arguments (ignored by this base forward pass).
- Returns:
“logits”: Output logits from the MLP, shape (batch_size, output_size).
- Return type:
A dictionary containing
- Raises:
NotImplementedError – If an unsupported aggregate_strategy is configured (though check_valid should catch this).
- generate_sequence(*args, **kwargs) str [source]
Generates a sequence of text starting from a given prompt.
This method is typically applicable to autoregressive models (e.g., GPT-like LMs, sequence-to-sequence models in generation mode). Non-autoregressive models (like MLP classifiers or Word2Vec) should raise NotImplementedError.
The implementation should handle: 1. Tokenizing the input prompt using self.vocab.tokenizer. 2. Iteratively predicting the next token. 3. Applying sampling strategies specified in generate_config (e.g., greedy, top-k). 4. Stopping generation based on max_length or an end-of-sequence token. 5. Detokenizing the generated token IDs back into a string.
- Parameters:
prompt – The initial text string to start generation from.
generate_config – A GenerateConfig object specifying generation parameters like max_length, sampling strategy, top_k, etc. If None, default generation parameters should be used.
tokenizer_config – An optional TokenizerConfig for encoding the prompt. If None, a default sensible configuration should be used (e.g., no padding, no truncation initially unless prompt is too long for model).
**kwargs – Additional keyword arguments that might be passed to the tokenizer’s encode method when processing the prompt.
- Returns:
The generated text string (excluding the initial prompt, or including it, based on implementation choice).
- Raises:
NotImplementedError – If the model does not support sequence generation.
ValueError – If prerequisites for generation (like a tokenizer in vocab) are missing.
- class banhxeo.model.RNN(vocab: Vocabulary, embedding_dim: int, hidden_size: int, bias: bool = False)[source]
Bases:
NeuralLanguageModel
Use batch_first as default.
- __init__(vocab: Vocabulary, embedding_dim: int, hidden_size: int, bias: bool = False)[source]
Initializes the NeuralLanguageModel.
- Parameters:
model_config – The configuration object, instance of NeuralModelConfig or its subclass.
vocab – The Vocabulary instance.
- forward(input_ids: Integer[Tensor, 'batch seq'], attention_mask: Integer[Tensor, 'batch seq'] | None = None, **kwargs)[source]
Defines the computation performed at every call.
Subclasses must implement this method. It should take tensors as input (e.g., input_ids, attention_mask) and return a dictionary of output tensors (e.g., logits, hidden_states, loss if computed).
- Parameters:
*args – Variable length argument list for model inputs.
**kwargs – Arbitrary keyword arguments for model inputs.
- Returns:
A dictionary where keys are string names of outputs (e.g., “logits”, “last_hidden_state”) and values are the corresponding `torch.Tensor`s.
- Raises:
NotImplementedError – If the subclass does not implement this method.
- class banhxeo.model.NGram(vocab: Vocabulary, n: int = 2, smoothing: bool | str = False, k: int | None = None)[source]
Bases:
BaseLanguageModel
- ConfigClass
alias of
NGramConfig
- __init__(vocab: Vocabulary, n: int = 2, smoothing: bool | str = False, k: int | None = None)[source]
Initializes the BaseLanguageModel.
- Parameters:
model_config – The configuration object for the model. It should be an instance of ModelConfig or its subclass.
vocab – The Vocabulary instance to be used by the model.
Subpackages
Submodules
- banhxeo.model.base module
- banhxeo.model.config module
- banhxeo.model.neural module
NeuralModelConfig
NeuralLanguageModel
NeuralLanguageModel.__init__()
NeuralLanguageModel.freeze()
NeuralLanguageModel.unfreeze()
NeuralLanguageModel.summary()
NeuralLanguageModel.forward()
NeuralLanguageModel.generate_sequence()
NeuralLanguageModel.attach_downstream_head()
NeuralLanguageModel.get_downstream_head_output()
NeuralLanguageModel.save_model()
NeuralLanguageModel.load_model()
NeuralLanguageModel.to_gpu()
NeuralLanguageModel.to_cpu()