raggie package

Submodules

raggie.data module

class raggie.data.RaggieData(file_path: str)[source]

Bases: RaggieDataClass

Abstract base class for Raggie data handling.

This class defines the interface for iterating over paired data. Users can implement their own data handling logic by extending this class.

property data: List[Tuple[str, str]]

Get the paired data.

Returns:: A list of tuples containing paired data.
Return type:: List[Tuple[str, str]]

class raggie.data.RaggieDataLoader(data_dir: str)[source]

Bases: RaggieDataLoaderClass

Default implementation of the RaggieDataClass.

This class provides methods to load and manage training, testing, and validation data from a specified directory. It supports JSONL, JSON, and CSV file formats.

find_data_files() → List[str][source]

Find and return a list of data files in the specified directory.

Returns:: A list of file names found in the directory.
Return type:: List[str]

label_data_files() → Dict[str, str][source]

Label data files in the directory for training, testing, and validation.

Returns:: A dictionary mapping data types to file paths.
Return type:: Dict[str, str]

property test: RaggieDataClass

Load and return the testing data.

Returns:: An instance of RaggieDataClass containing the testing data.
Return type:: RaggieDataClass
Raises:: ValueError – If no testing data file is found.

property train: RaggieDataClass

Load and return the training data.

Returns:: An instance of RaggieDataClass containing the training data.
Return type:: RaggieDataClass
Raises:: ValueError – If no training data file is found.

property val: RaggieDataClass

Load and return the validation data.

Returns:: An instance of RaggieDataClass containing the validation data.
Return type:: RaggieDataClass
Raises:: ValueError – If no validation data file is found.

raggie.main module

class raggie.main.Raggie(model: RaggieModelClass, data: RaggieDataClass)[source]

Bases: object

Raggie is a retriever that uses a model to find similar keys and values based on embeddings.

evaluate_rank(value: str, ground_truth: str, top_k: int = 10) → int[source]: Evaluate the rank of a key based on its ground truth value.

most_similar(queries: List[str] | None = None, keys: List[str] | None = None, return_all_scores: bool = False, *args, **kwargs) → List[List[Tuple[str, float]]] | List[List[str]][source]: Retrieve most similar keys for given queries.

retrieve(queries: List[str], top_k: int = 5, verbose: bool = True, return_all_scores: bool = False) → List[List[Tuple[str, float]]][source]: Retrieve keys based on queries.

raggie.model module

class raggie.model.RaggieModel(model_path: str = None, base_model_name: str = None, output_dir: str = None)[source]

Bases: RaggieModelClass

Default implementation of the RaggieModelClass.

This class provides methods for training, saving, and predicting embeddings using a Sentence Transformer model.

predict(docs: List[str]) → ndarray[source]

Generate embeddings for the given documents.

Parameters:: docs (List[str]) – List of documents to encode.
Returns:: Array of embeddings.
Return type:: np.ndarray

save(model_path: str) → None[source]: Save the trained model to the specified path.

train(train_data: RaggieDataClass, val_data: RaggieDataClass | None = None, epochs: int = 5) → None[source]

Train the model using the provided data.

Parameters:

train_data (RaggieDataClass) – The data handler providing training data.
val_data (Optional[RaggieDataClass]) – The data handler providing validation data.
epochs (int) – Number of training epochs.

raggie.types module

class raggie.types.RaggieDataClass[source]

Bases: ABC

Abstract base class for Raggie data handling.

This class defines the interface for iterating over paired data. Users can implement their own data handling logic by extending this class.

abstract property data: List[Tuple[str, str]]

Get the paired data.

Returns:: A list of tuples containing paired data.
Return type:: List[Tuple[str, str]]

class raggie.types.RaggieDataLoaderClass[source]

Bases: ABC

Abstract base class for Raggie data handling.

This class defines the interface for loading and managing training, testing, and validation data. Users can implement their own data handling logic by extending this class.

abstract property test: RaggieDataClass

Load and return the testing data.

Returns:: An instance of RaggieDataClass containing the testing data.
Return type:: RaggieDataClass

abstract property train: RaggieDataClass

Load and return the training data.

Returns:: An instance of RaggieDataClass containing the training data.
Return type:: RaggieDataClass

abstract property val: RaggieDataClass

Load and return the validation data.

Returns:: An instance of RaggieDataClass containing the validation data.
Return type:: RaggieDataClass

class raggie.types.RaggieModelClass[source]

Bases: ABC

Abstract base class for Raggie models.

This class defines the interface for training, saving, and predicting embeddings using a machine learning model.

abstractmethod predict(docs: List[str]) → ndarray[source]

Generate embeddings for the given documents.

Parameters:: docs (List[str]) – List of documents to encode.
Returns:: Array of embeddings.
Return type:: numpy.ndarray

abstractmethod save() → None[source]: Save the trained model to the output directory.

abstractmethod train(data: RaggieDataClass, *args, **kwargs) → None[source]

Train the model using the provided data.

Parameters:

data (RaggieDataClass) – The data handler providing training and validation data.
*args – Additional arguments for training.
**kwargs – Additional keyword arguments for training.

class raggie.types.RaggiePlotterClass[source]

Bases: ABC

Abstract base class for Raggie plotters.

This class defines the interface for visualizing embeddings using t-SNE and optional clustering.

abstractmethod plot(keys: List[str], perplexity: float | None = None, learning_rate: float | str = 'auto', n_iter_without_progress: int = 1000, random_state: int = 42, n_clusters: int = None, show: bool = True, save_path: str | None = None) → None[source]

Plot the t-SNE visualization of keys with optional clustering.

Parameters:

keys (List[str]) – List of keys to visualize.
perplexity (Optional[float]) – Perplexity parameter for t-SNE.
learning_rate (Union[float, str]) – Learning rate for t-SNE.
n_iter_without_progress (int) – Number of iterations without progress before stopping.
random_state (int) – Random seed for reproducibility.
n_clusters (int) – Number of clusters for k-means (optional).
show (bool) – Whether to display the plot.
save_path (Optional[str]) – Path to save the plot (optional).

raggie.utils module

class raggie.utils.RaggiePlotter(model)[source]

Bases: RaggiePlotterClass

Raggie plotter for visualizing keys using t-SNE and optional k-means clustering.

This class provides methods to reduce embeddings to 2D space and visualize them with clustering and annotations.

plot(keys: List[str], perplexity: float | None = None, learning_rate: float | str = 'auto', n_iter_without_progress: int = 1000, random_state: int = 42, n_clusters: int = None, show: bool = True, save_path: str | None = None) → None[source]

Perform t-SNE dimensionality reduction and visualize keys with optional k-means clustering.

Parameters:

keys (List[str]) – List of keys to visualize.
perplexity (Optional[float]) – Perplexity parameter for t-SNE.
learning_rate (Union[float, str]) – Learning rate for t-SNE.
n_iter_without_progress (int) – Number of iterations without progress before stopping.
random_state (int) – Random seed for reproducibility.
n_clusters (int) – Number of clusters for k-means (optional).
show (bool) – Whether to display the plot.
save_path (Optional[str]) – Path to save the plot (optional).

raggie package

Submodules

raggie.data module

raggie.main module

raggie.model module

raggie.types module

raggie.utils module

Module contents