raggie package
Submodules
raggie.data module
- class raggie.data.RaggieData(file_path: str)[source]
Bases:
RaggieDataClassAbstract base class for Raggie data handling.
This class defines the interface for iterating over paired data. Users can implement their own data handling logic by extending this class.
- property data: List[Tuple[str, str]]
Get the paired data.
- Returns:
A list of tuples containing paired data.
- Return type:
List[Tuple[str, str]]
- class raggie.data.RaggieDataLoader(data_dir: str)[source]
Bases:
RaggieDataLoaderClassDefault implementation of the RaggieDataClass.
This class provides methods to load and manage training, testing, and validation data from a specified directory. It supports JSONL, JSON, and CSV file formats.
- find_data_files() List[str][source]
Find and return a list of data files in the specified directory.
- Returns:
A list of file names found in the directory.
- Return type:
List[str]
- label_data_files() Dict[str, str][source]
Label data files in the directory for training, testing, and validation.
- Returns:
A dictionary mapping data types to file paths.
- Return type:
Dict[str, str]
- property test: RaggieDataClass
Load and return the testing data.
- Returns:
An instance of RaggieDataClass containing the testing data.
- Return type:
- Raises:
ValueError – If no testing data file is found.
- property train: RaggieDataClass
Load and return the training data.
- Returns:
An instance of RaggieDataClass containing the training data.
- Return type:
- Raises:
ValueError – If no training data file is found.
- property val: RaggieDataClass
Load and return the validation data.
- Returns:
An instance of RaggieDataClass containing the validation data.
- Return type:
- Raises:
ValueError – If no validation data file is found.
raggie.main module
- class raggie.main.Raggie(model: RaggieModelClass, data: RaggieDataClass)[source]
Bases:
objectRaggie is a retriever that uses a model to find similar keys and values based on embeddings.
- evaluate_rank(value: str, ground_truth: str, top_k: int = 10) int[source]
Evaluate the rank of a key based on its ground truth value.
raggie.model module
- class raggie.model.RaggieModel(model_path: str = None, base_model_name: str = None, output_dir: str = None)[source]
Bases:
RaggieModelClassDefault implementation of the RaggieModelClass.
This class provides methods for training, saving, and predicting embeddings using a Sentence Transformer model.
- predict(docs: List[str]) ndarray[source]
Generate embeddings for the given documents.
- Parameters:
docs (List[str]) – List of documents to encode.
- Returns:
Array of embeddings.
- Return type:
np.ndarray
- train(train_data: RaggieDataClass, val_data: RaggieDataClass | None = None, epochs: int = 5) None[source]
Train the model using the provided data.
- Parameters:
train_data (RaggieDataClass) – The data handler providing training data.
val_data (Optional[RaggieDataClass]) – The data handler providing validation data.
epochs (int) – Number of training epochs.
raggie.types module
- class raggie.types.RaggieDataClass[source]
Bases:
ABCAbstract base class for Raggie data handling.
This class defines the interface for iterating over paired data. Users can implement their own data handling logic by extending this class.
- abstract property data: List[Tuple[str, str]]
Get the paired data.
- Returns:
A list of tuples containing paired data.
- Return type:
List[Tuple[str, str]]
- class raggie.types.RaggieDataLoaderClass[source]
Bases:
ABCAbstract base class for Raggie data handling.
This class defines the interface for loading and managing training, testing, and validation data. Users can implement their own data handling logic by extending this class.
- abstract property test: RaggieDataClass
Load and return the testing data.
- Returns:
An instance of RaggieDataClass containing the testing data.
- Return type:
- abstract property train: RaggieDataClass
Load and return the training data.
- Returns:
An instance of RaggieDataClass containing the training data.
- Return type:
- abstract property val: RaggieDataClass
Load and return the validation data.
- Returns:
An instance of RaggieDataClass containing the validation data.
- Return type:
- class raggie.types.RaggieModelClass[source]
Bases:
ABCAbstract base class for Raggie models.
This class defines the interface for training, saving, and predicting embeddings using a machine learning model.
- abstractmethod predict(docs: List[str]) ndarray[source]
Generate embeddings for the given documents.
- Parameters:
docs (List[str]) – List of documents to encode.
- Returns:
Array of embeddings.
- Return type:
numpy.ndarray
- abstractmethod train(data: RaggieDataClass, *args, **kwargs) None[source]
Train the model using the provided data.
- Parameters:
data (RaggieDataClass) – The data handler providing training and validation data.
*args – Additional arguments for training.
**kwargs – Additional keyword arguments for training.
- class raggie.types.RaggiePlotterClass[source]
Bases:
ABCAbstract base class for Raggie plotters.
This class defines the interface for visualizing embeddings using t-SNE and optional clustering.
- abstractmethod plot(keys: List[str], perplexity: float | None = None, learning_rate: float | str = 'auto', n_iter_without_progress: int = 1000, random_state: int = 42, n_clusters: int = None, show: bool = True, save_path: str | None = None) None[source]
Plot the t-SNE visualization of keys with optional clustering.
- Parameters:
keys (List[str]) – List of keys to visualize.
perplexity (Optional[float]) – Perplexity parameter for t-SNE.
learning_rate (Union[float, str]) – Learning rate for t-SNE.
n_iter_without_progress (int) – Number of iterations without progress before stopping.
random_state (int) – Random seed for reproducibility.
n_clusters (int) – Number of clusters for k-means (optional).
show (bool) – Whether to display the plot.
save_path (Optional[str]) – Path to save the plot (optional).
raggie.utils module
- class raggie.utils.RaggiePlotter(model)[source]
Bases:
RaggiePlotterClassRaggie plotter for visualizing keys using t-SNE and optional k-means clustering.
This class provides methods to reduce embeddings to 2D space and visualize them with clustering and annotations.
- plot(keys: List[str], perplexity: float | None = None, learning_rate: float | str = 'auto', n_iter_without_progress: int = 1000, random_state: int = 42, n_clusters: int = None, show: bool = True, save_path: str | None = None) None[source]
Perform t-SNE dimensionality reduction and visualize keys with optional k-means clustering.
- Parameters:
keys (List[str]) – List of keys to visualize.
perplexity (Optional[float]) – Perplexity parameter for t-SNE.
learning_rate (Union[float, str]) – Learning rate for t-SNE.
n_iter_without_progress (int) – Number of iterations without progress before stopping.
random_state (int) – Random seed for reproducibility.
n_clusters (int) – Number of clusters for k-means (optional).
show (bool) – Whether to display the plot.
save_path (Optional[str]) – Path to save the plot (optional).