banhxeo.model.classic.word2vec module
- class banhxeo.model.classic.word2vec.Word2VecDataset(text_dataset: BaseTextDataset, window_size: int, k_negative_samples: int, alpha: float, tokenizer: Tokenizer, vocab: Vocabulary)[source]
Bases:
Dataset
- __init__(text_dataset: BaseTextDataset, window_size: int, k_negative_samples: int, alpha: float, tokenizer: Tokenizer, vocab: Vocabulary)[source]
- class banhxeo.model.classic.word2vec.Word2Vec(model_config: NeuralModelConfig, vocab: Vocabulary)[source]
Bases:
NeuralLanguageModel
An implementation of Word2Vec with SGNS (Skip-gram with negative sampling) variation.
- __init__(model_config: NeuralModelConfig, vocab: Vocabulary)[source]
Initializes the NeuralLanguageModel.
- Parameters:
model_config – The configuration object, instance of NeuralModelConfig or its subclass.
vocab – The Vocabulary instance.
- forward(target_words: Integer[Tensor, 'batch'], context_words: Integer[Tensor, 'batch']) Dict[str, Tensor] [source]
Defines the computation performed at every call.
Subclasses must implement this method. It should take tensors as input (e.g., input_ids, attention_mask) and return a dictionary of output tensors (e.g., logits, hidden_states, loss if computed).
- Parameters:
*args – Variable length argument list for model inputs.
**kwargs – Arbitrary keyword arguments for model inputs.
- Returns:
A dictionary where keys are string names of outputs (e.g., “logits”, “last_hidden_state”) and values are the corresponding `torch.Tensor`s.
- Raises:
NotImplementedError – If the subclass does not implement this method.
- static prepare_data(data: BaseTextDataset, tokenizer: Tokenizer, vocab: Vocabulary, window_size: int = 2, k_negative_samples: int = 3, **kwargs)[source]