banhxeo.model.classic.word2vec module

class banhxeo.model.classic.word2vec.Word2VecDataset(text_dataset: BaseTextDataset, window_size: int, k_negative_samples: int, alpha: float, tokenizer: Tokenizer, vocab: Vocabulary)[source]

Bases: Dataset

__init__(text_dataset: BaseTextDataset, window_size: int, k_negative_samples: int, alpha: float, tokenizer: Tokenizer, vocab: Vocabulary)[source]

class banhxeo.model.classic.word2vec.Word2Vec(model_config: NeuralModelConfig, vocab: Vocabulary)[source]

Bases: NeuralLanguageModel

An implementation of Word2Vec with SGNS (Skip-gram with negative sampling) variation.

__init__(model_config: NeuralModelConfig, vocab: Vocabulary)[source]

Initializes the NeuralLanguageModel.

Parameters:

model_config – The configuration object, instance of NeuralModelConfig or its subclass.
vocab – The Vocabulary instance.

forward(target_words: Integer[Tensor, 'batch'], context_words: Integer[Tensor, 'batch']) → Dict[str, Tensor][source]

Defines the computation performed at every call.

Subclasses must implement this method. It should take tensors as input (e.g., input_ids, attention_mask) and return a dictionary of output tensors (e.g., logits, hidden_states, loss if computed).

Parameters:

*args – Variable length argument list for model inputs.
**kwargs – Arbitrary keyword arguments for model inputs.

Returns:

A dictionary where keys are string names of outputs (e.g., “logits”, “last_hidden_state”) and values are the corresponding `torch.Tensor`s.

Raises:

NotImplementedError – If the subclass does not implement this method.

static prepare_data(data: BaseTextDataset, tokenizer: Tokenizer, vocab: Vocabulary, window_size: int = 2, k_negative_samples: int = 3, **kwargs)[source]