Online Hubs¶

Dataset Hub¶

This module provides utilities for loading benchmark datasets from the OmniGenome hub. It handles automatic downloading, configuration loading, and dataset initialization for various genomic benchmarks.

omnigenbench.utility.dataset_hub.dataset_hub.load_benchmark_datasets(benchmark: str, tokenizer: OmniTokenizer | str = None, **kwargs: dict)[source]

This function automatically downloads benchmark datasets if they don’t exist locally, loads their configurations, and initializes train/validation/test datasets with the specified tokenizer.

Parameters:

benchmark (str) – Name or path of the benchmark to load. If the benchmark doesn’t exist locally, it will be downloaded from the hub.
tokenizer (Union[OmniTokenizer, str], optional) – Tokenizer to use for dataset preprocessing. Can be an OmniTokenizer instance or a string identifier for a pre-trained tokenizer. If None, the tokenizer will be loaded from the benchmark configuration.
**kwargs – Additional keyword arguments to override benchmark configuration. These will be passed to the dataset classes and tokenizer initialization.

Returns:

dict – Dictionary containing datasets for each benchmark task, with keys being benchmark names and values being dictionaries with ‘train’, ‘valid’, and ‘test’ datasets.

Raises:

FileNotFoundError – If the benchmark cannot be found or downloaded.
ValueError – If the benchmark configuration is invalid.
ImportError – If required dependencies are not available.

Example

>>> from omnigenbench import OmniSingleNucleotideTokenizer
>>> tokenizer = OmniSingleNucleotideTokenizer.from_pretrained("model_name")
>>> datasets = load_benchmark_datasets("RGB", tokenizer, max_length=512)
>>> print(f"Loaded {len(datasets)} benchmark tasks")
>>> for task_name, task_datasets in datasets.items():
...     print(f"{task_name}: {len(task_datasets['train'])} train samples")

Note

The function automatically handles U/T conversion and other preprocessing based on the benchmark configuration.
If a tokenizer string is provided, it will be loaded with the benchmark’s trust_remote_code setting.
The function supports multiple seeds for robust evaluation.
Long sequences can be dropped or truncated based on configuration.

Model Hub¶

class omnigenbench.utility.model_hub.model_hub.ModelHub(*args, **kwargs)[source]

Bases: object

This class provides a unified interface for loading pre-trained models from the OmniGenome hub or local paths. It handles model downloading, tokenizer loading, and device placement automatically. It supports various model types and can automatically download models from the hub if they’re not available locally.

Variables:: metadata (dict) – Environment metadata information

Example

>>> from omnigenbench import ModelHub
>>> hub = ModelHub()
>>> # Load a model from the hub
>>> model, tokenizer = ModelHub.load_model_and_tokenizer("model_name")
>>> # Check available models
>>> models = hub.available_models()
>>> print(list(models.keys()))

available_models(model_name_or_path=None, local_only=False, repo='', **kwargs)[source]

This method queries the OmniGenome hub to retrieve information about available models. It can filter models by name and supports both local and remote queries.

Parameters:

model_name_or_path (str, optional) – Filter models by name. Defaults to None
local_only (bool, optional) – Whether to use only local cache. Defaults to False
repo (str, optional) – Repository URL to query. Defaults to “”
**kwargs – Additional keyword arguments

Returns:

dict – Dictionary containing information about available models

Example

>>> # Load all available models
>>> hub = ModelHub()
>>> models = hub.available_models()
>>> print(f"Available models: {len(models)}")
>>> # Filter models by name
>>> dna_models = hub.available_models("DNA")
>>> print(f"DNA models: {list(dna_models.keys())}")

static load(model_name_or_path, local_only=False, device=None, dtype=torch.float16, **kwargs)[source]

This method handles model loading from various sources including local paths and the OmniGenome hub. It automatically downloads models if they’re not available locally.

Parameters:

model_name_or_path (str) – Name or path of the model to load
local_only (bool, optional) – Whether to use only local cache. Defaults to False
device (str, optional) – Device to load the model on. If None, uses auto-detection
dtype (torch.dtype, optional) – Data type for the model. Defaults to torch.float16
**kwargs – Additional keyword arguments passed to the model loading functions

Returns:

torch.nn.Module – The loaded model

Raises:

ValueError – If model_name_or_path is not a string

Example

>>> model = ModelHub.load("yangheng/OmniGenome-186M")
>>> print(f"Model type: {type(model)}")

static load_model_and_tokenizer(model_name_or_path, local_only=False, device=None, dtype=torch.float16, **kwargs)[source]

This method loads both the model and tokenizer, places them on the specified device, and returns them as a tuple. It handles automatic device selection if none is specified.

Parameters:

model_name_or_path (str) – Name or path of the model to load
local_only (bool, optional) – Whether to use only local cache. Defaults to False
device (str, optional) – Device to load the model on. If None, uses auto-detection
dtype (torch.dtype, optional) – Data type for the model. Defaults to torch.float16
**kwargs – Additional keyword arguments passed to the model loading functions

Returns:

tuple – A tuple containing (model, tokenizer)

Example

>>> model, tokenizer = ModelHub.load_model_and_tokenizer("yangheng/OmniGenome-186M")
>>> print(f"Model loaded on device: {next(model.parameters()).device}")

push(model, **kwargs)[source]

Push a model to the hub.

This method is not yet implemented and will raise a NotImplementedError.

Parameters:

model – The model to push to the hub
**kwargs – Additional keyword arguments

Raises:

NotImplementedError – This method has not been implemented yet

Pipeline Hub¶

This module provides the PipelineHub class for managing and loading pre-built pipelines from the OmniGenome hub. Pipelines combine models, tokenizers, datasets, and trainers into ready-to-use workflows.

class omnigenbench.utility.pipeline_hub.pipeline_hub.PipelineHub(*args, **kwargs)[source]

Bases: object

The PipelineHub provides a centralized interface for accessing pre-built pipelines that combine models, tokenizers, datasets, and training configurations. It handles automatic downloading and loading of pipelines from the OmniGenome hub.

Variables:: metadata (dict) – Environment metadata including system information, package versions, and hardware details.

Example

>>> from omnigenbench import PipelineHub
>>> hub = PipelineHub()
>>> pipeline = hub.load("yangheng/OmniGenome-RNA-Classification")
>>> predictions = pipeline("ATCGATCG")
>>> print(predictions['predictions'])

Note

Pipelines can be loaded from local paths or downloaded from the hub
The hub automatically handles model, tokenizer, and dataset loading
Environment metadata is collected for reproducibility

static load(pipeline_name_or_path, local_only=False, **kwargs)[source]

This method loads a complete pipeline including the model, tokenizer, datasets, and trainer configuration. If the pipeline doesn’t exist locally and local_only is False, it will be downloaded from the hub.

Parameters:

pipeline_name_or_path (str) – Name or path of the pipeline to load. Can be a local directory path or a hub identifier.
local_only (bool, optional) – If True, only load from local paths. If False, download from hub if not found locally. Defaults to False.
**kwargs – Additional keyword arguments passed to the Pipeline constructor. Common options include: - device: Target device for the model - trust_remote_code: Whether to trust remote code in tokenizers - name: Custom name for the pipeline

Returns:

Pipeline – Loaded pipeline instance with model, tokenizer, datasets, and trainer ready for use.

Raises:

FileNotFoundError – If the pipeline cannot be found locally and local_only is True.
ValueError – If the pipeline configuration is invalid.
ImportError – If required dependencies are not available.

Example

>>> hub = PipelineHub()
>>> # Load from hub
>>> pipeline = hub.load("yangheng/OmniGenome-RNA-Classification")
>>> # Load from local path
>>> pipeline = hub.load("./my_pipeline", local_only=True)
>>> # Use pipeline for inference
>>> results = pipeline("ATCGATCG")

Note

The pipeline includes all necessary components for training and inference
Model weights, tokenizer, and datasets are automatically loaded
The pipeline can be used immediately for inference or fine-tuning

push(pipeline, **kwargs)[source]

This method is intended to upload custom pipelines to the OmniGenome hub for sharing and distribution. Currently not implemented.

Parameters:

pipeline (Pipeline) – Pipeline instance to upload to the hub.
**kwargs – Additional keyword arguments for the upload process.

Raises:

NotImplementedError – This method has not been implemented yet.

Note

Future implementation will support: - Pipeline metadata and documentation - Model weights and configuration - Tokenizer and dataset specifications - Training configurations and results

Pipeline¶

This module provides the Pipeline class for creating and managing complete machine learning workflows that combine models, tokenizers, datasets, and trainers. Pipelines provide a unified interface for training, inference, and model management.

class omnigenbench.utility.pipeline_hub.pipeline.Pipeline(name, *, model_name_or_path, tokenizer=None, datasets=None, trainer=None, **kwargs)[source]

Bases: object

The Pipeline class provides a unified interface for managing complete machine learning workflows. It handles model initialization, training, inference, and persistence. Pipelines can be loaded from pre-built configurations or created from scratch with custom components.

Variables:

model (OmniModel) – The underlying model for the pipeline.
tokenizer – Tokenizer for preprocessing input sequences.
dataset (dict) – Dictionary containing train/validation/test datasets.
metadata (dict) – Environment and pipeline metadata.
trainer (Trainer) – Trainer instance for model training.
device (str) – Target device for model execution (CPU/GPU).
name (str) – Name identifier for the pipeline.

Example

>>> from omnigenbench import Pipeline, OmniModelForSequenceClassification
>>> # Create pipeline from model
>>> model = OmniModelForSequenceClassification("model_path", tokenizer)
>>> pipeline = Pipeline("my_pipeline", model_name_or_path=model)
>>> # Use for inference
>>> predictions = pipeline("ATCGATCG")
>>> # Train the model
>>> pipeline.train(datasets)
>>> # Save pipeline
>>> pipeline.save("./saved_pipeline")

Note

Pipelines automatically handle device placement and model optimization
Environment metadata is collected for reproducibility
Pipelines can be saved and loaded for easy deployment
Supports both local models and hub-based model loading

dataset: dict = None

inference(inputs, **kwargs)[source]

This method provides the complete inference pipeline including preprocessing, model forward pass, and postprocessing. It’s the recommended method for production inference.

Parameters:

inputs – Input data for inference. Can be: - str: Single sequence string - list: List of sequence strings - tensor: Preprocessed input tensors
**kwargs – Additional keyword arguments for inference including: - return_attention: Whether to return attention weights - return_hidden_states: Whether to return hidden states - temperature: Temperature for sampling (if applicable)

Returns:

dict –

Complete inference results including:

predictions: Final predictions
confidence: Confidence scores
attention: Attention weights (if requested)
hidden_states: Hidden states (if requested)

Example

>>> pipeline = Pipeline("my_pipeline", model_name_or_path=model)
>>> # Basic inference
>>> results = pipeline.inference("ATCGATCG")
>>> print(results['predictions'])
>>> # Inference with attention
>>> results = pipeline.inference("ATCGATCG", return_attention=True)
>>> print(results['attention'].shape)

Note

This is the most comprehensive inference method
Handles all preprocessing and postprocessing automatically
Returns rich information about the model’s internal states

init_pipeline(*, model_name_or_path, tokenizer=None, **kwargs)[source]

This method handles loading the model, tokenizer, and configuration from a model path or identifier. It tries to load from the ModelHub first, then falls back to HuggingFace transformers.

Parameters:

model_name_or_path (str) – Path or identifier of the model to load.
tokenizer (optional) – Tokenizer instance. If None, will be loaded from the model path. Defaults to None.
**kwargs –
Additional keyword arguments for model loading including:
- trust_remote_code (bool): Whether to trust remote code
- device (str): Target device for the model
- Other model-specific parameters

Returns:

Pipeline – Self for method chaining.

Raises:

ValueError – If model loading fails.
ImportError – If required dependencies are not available.

Example

>>> pipeline = Pipeline("my_pipeline")
>>> pipeline.init_pipeline(model_name_or_path="yangheng/OmniGenome-186M")

Note

First attempts to load from OmniGenome ModelHub
Falls back to HuggingFace transformers if ModelHub fails
Automatically handles tokenizer loading and configuration

static load(pipeline_name_or_path, local_only=False, **kwargs)[source]

This static method loads a complete pipeline including model, tokenizer, datasets, and trainer from a saved pipeline directory or hub identifier.

Parameters:

pipeline_name_or_path (str) – Path to saved pipeline directory or hub identifier for downloading.
local_only (bool, optional) – If True, only load from local paths. If False, download from hub if not found locally. Defaults to False.
**kwargs – Additional keyword arguments for pipeline initialization: - device: Target device for the model - name: Custom name for the pipeline - trust_remote_code: Whether to trust remote code

Returns:

Pipeline – Loaded pipeline instance ready for use.

Raises:

FileNotFoundError – If pipeline cannot be found locally and local_only is True.
ValueError – If pipeline files are corrupted or invalid.
ImportError – If required dependencies are not available.

Example

>>> # Load from local path
>>> pipeline = Pipeline.load("./saved_pipeline")
>>> # Load from hub
>>> pipeline = Pipeline.load("yangheng/OmniGenome-RNA-Classification")
>>> # Use loaded pipeline
>>> results = pipeline("ATCGATCG")

Note

Loads all pipeline components (model, tokenizer, datasets, trainer)
Automatically handles device placement
Preserves all training configurations and metadata

metadata: dict = None

model: OmniModel = None

predict(inputs, **kwargs)[source]

This method provides a high-level interface for generating predictions from the pipeline’s model. It handles preprocessing and postprocessing automatically.

Parameters:

inputs –
Input data for prediction. Can be:
- str: Single sequence string
- list: List of sequence strings
- tensor: Preprocessed input tensors
**kwargs – Additional keyword arguments passed to model prediction.

Returns:

dict –

Prediction results including:

predictions: Predicted labels or values
confidence: Confidence scores (if available)
logits: Raw model outputs (if requested)

Example

>>> pipeline = Pipeline("my_pipeline", model_name_or_path=model)
>>> # Single prediction
>>> result = pipeline.predict("ATCGATCG")
>>> print(result['predictions'])
>>> # Batch prediction
>>> results = pipeline.predict(["ATCGATCG", "GCTAGCTA"])
>>> print(results['predictions'])

Note

Input preprocessing is handled automatically
Results are formatted consistently across different model types
Confidence scores are included when available

save(path, overwrite=False, **kwargs)[source]

This method saves the complete pipeline including model, tokenizer, datasets, trainer, and metadata to a directory. The saved pipeline can be loaded later using Pipeline.load().

Parameters:

path (str) – Directory path where to save the pipeline.
overwrite (bool, optional) – If True, overwrite existing directory. If False, raise error if directory exists. Defaults to False.
**kwargs – Additional keyword arguments for model saving.

Raises:

FileExistsError – If path exists and overwrite is False.
OSError – If there are issues creating the directory or writing files.
RuntimeError – If saving fails due to model or data issues.

Example

>>> pipeline = Pipeline("my_pipeline", model_name_or_path=model)
>>> # Train the pipeline
>>> pipeline.train(datasets)
>>> # Save the trained pipeline
>>> pipeline.save("./trained_pipeline", overwrite=True)
>>> # Load the saved pipeline later
>>> loaded_pipeline = Pipeline.load("./trained_pipeline")

Note

Saves all pipeline components (model, tokenizer, datasets, trainer)
Preserves training configurations and metadata
Model is temporarily moved to CPU during saving to avoid GPU memory issues
Creates a complete, self-contained pipeline directory

to(device)[source]

Move the pipeline to a specific device.

Parameters:: device (str) – Target device (‘cpu’, ‘cuda’, ‘cuda:0’, etc.).
Returns:: Pipeline – Self for method chaining.

Example

>>> pipeline = Pipeline("my_pipeline", model_name_or_path=model)
>>> pipeline.to("cuda:0")  # Move to GPU
>>> pipeline.to("cpu")     # Move to CPU

tokenizer = None

train(datasets: dict = None, trainer=None, **kwargs)[source]

This method initiates training of the model using the provided datasets and trainer configuration. If no trainer is provided, the pipeline’s existing trainer will be used.

Parameters:

datasets (dict, optional) – Dictionary containing train/validation/test datasets. If None, uses the pipeline’s existing datasets. Keys should be ‘train’, ‘valid’, ‘test’. Defaults to None.
trainer (Trainer, optional) – Trainer instance to use for training. If None, uses the pipeline’s existing trainer. Defaults to None.
**kwargs – Additional keyword arguments passed to the trainer.

Raises:

ValueError – If no trainer is available or datasets are invalid.
RuntimeError – If training fails.

Example

>>> pipeline = Pipeline("my_pipeline", model_name_or_path=model)
>>> # Train with existing datasets
>>> pipeline.train()
>>> # Train with custom datasets
>>> custom_datasets = {'train': train_data, 'valid': valid_data}
>>> pipeline.train(datasets=custom_datasets)
>>> # Train with custom trainer
>>> from omnigenbench import Trainer
>>> custom_trainer = Trainer(model, train_dataset=train_data)
>>> pipeline.train(trainer=custom_trainer)

Note

Training uses the pipeline’s current model and device
Progress and metrics are logged during training
The trained model is automatically saved in the pipeline

Hub Utilities¶

omnigenbench.utility.hub_utils.check_version(repo: str = None) → None[source]

Checks the version compatibility between local and remote OmniGenome.

Parameters:: repo (str, optional) – The repository URL to check. If None, uses the default hub.

Example

>>> check_version()  # Check version compatibility

omnigenbench.utility.hub_utils.download_benchmark(benchmark_name_or_path: str, local_only: bool = False, repo: str = None, cache_dir=None) → str[source]

Downloads a benchmark from a given URL. It supports both remote and local-only modes.

Parameters:

benchmark_name_or_path (str) – The name or path of the benchmark to download.
local_only (bool) – A flag indicating whether to download the benchmark from the local cache. Defaults to False.
repo (str, optional) – The URL of the repository to download the benchmark from.
cache_dir (str, optional) – The directory to cache the downloaded benchmark. If None, uses “__OMNIGENOME_DATA__/benchmarks/”.

Returns:

str – A string representing the path to the downloaded benchmark.

Raises:

ConnectionError – If the benchmark download fails.
ValueError – If the benchmark is not found in the repository.

Example

>>> # Download a benchmark
>>> benchmark_path = download_benchmark("RGB")
>>> print(benchmark_path)  # Path to the downloaded benchmark
>>> # Download with custom cache directory
>>> benchmark_path = download_benchmark("RGB", cache_dir="./benchmarks")

omnigenbench.utility.hub_utils.download_model(model_name_or_path: str, local_only: bool = False, repo: str = None, cache_dir=None) → str[source]

Downloads a model from a given URL. It supports both remote and local-only modes.

Parameters:

model_name_or_path (str) – The name or path of the model to download.
local_only (bool) – A flag indicating whether to download the model from the local cache. Defaults to False.
repo (str, optional) – The URL of the repository to download the model from.
cache_dir (str, optional) – The directory to cache the downloaded model. If None, uses “__OMNIGENOME_DATA__/models/”.

Returns:

str – A string representing the path to the downloaded model.

Raises:

ConnectionError – If the model download fails.
ValueError – If the model is not found in the repository.

Example

>>> # Download a model
>>> model_path = download_model("DNABERT-2")
>>> print(model_path)  # Path to the downloaded model
>>> # Download with custom cache directory
>>> model_path = download_model("DNABERT-2", cache_dir="./models")

omnigenbench.utility.hub_utils.download_pipeline(pipeline_name_or_path: str, local_only: bool = False, repo: str = None, cache_dir=None) → str[source]

Downloads a pipeline from a given URL. It supports both remote and local-only modes.

Parameters:

pipeline_name_or_path (str) – The name or path of the pipeline to download.
local_only (bool) – A flag indicating whether to download the pipeline from the local cache. Defaults to False.
repo (str, optional) – The URL of the repository to download the pipeline from.
cache_dir (str, optional) – The directory to cache the downloaded pipeline. If None, uses “__OMNIGENOME_DATA__/pipelines/”.

Returns:

str – A string representing the path to the downloaded pipeline.

Raises:

ConnectionError – If the pipeline download fails.
ValueError – If the pipeline is not found in the repository.

Example

>>> # Download a pipeline
>>> pipeline_path = download_pipeline("classification_pipeline")
>>> print(pipeline_path)  # Path to the downloaded pipeline

omnigenbench.utility.hub_utils.query_benchmarks_info(keyword: list | str, repo: str = None, local_only: bool = False, **kwargs) → Dict[str, Any][source]

This function retrieves benchmark information from the OmniGenome hub, either from a remote repository or from a local cache. It supports filtering by keywords to find specific benchmarks.

Parameters:

keyword (Union[list, str]) – A keyword or list of keywords to filter benchmarks.
repo (str, optional) – The repository URL to query. If None, uses the default hub.
local_only (bool) – Whether to use only local cache. Defaults to False.
**kwargs – Additional keyword arguments.

Returns:

Dict[str, Any] – A dictionary containing benchmark information filtered by the keyword.

Example

>>> # Query all benchmarks
>>> benchmarks = query_benchmarks_info("")
>>> print(len(benchmarks))  # Number of available benchmarks
>>> # Query specific benchmarks
>>> benchmarks = query_benchmarks_info("RGB")
>>> print(benchmarks.keys())  # Benchmarks containing "RGB"

omnigenbench.utility.hub_utils.query_models_info(keyword: list | str, repo: str = None, local_only: bool = False, **kwargs) → Dict[str, Any][source]

This function retrieves model information from the OmniGenome hub, either from a remote repository or from a local cache. It supports filtering by keywords to find specific models.

Parameters:

keyword (Union[list, str]) – A keyword or list of keywords to filter models.
repo (str, optional) – The repository URL to query. If None, uses the default hub.
local_only (bool) – Whether to use only local cache. Defaults to False.
**kwargs – Additional keyword arguments.

Returns:

Dict[str, Any] – A dictionary containing model information filtered by the keyword.

Example

>>> # Query all models
>>> models = query_models_info("")
>>> print(len(models))  # Number of available models
>>> # Query specific models
>>> models = query_models_info("DNA")
>>> print(models.keys())  # Models containing "DNA"

omnigenbench.utility.hub_utils.query_pipelines_info(keyword: list | str, repo: str = None, local_only: bool = False, **kwargs) → Dict[str, Any][source]

This function retrieves pipeline information from the OmniGenome hub, either from a remote repository or from a local cache. It supports filtering by keywords to find specific pipelines.

Parameters:

keyword (Union[list, str]) – A keyword or list of keywords to filter pipelines.
repo (str, optional) – The repository URL to query. If None, uses the default hub.
local_only (bool) – Whether to use only local cache. Defaults to False.
**kwargs – Additional keyword arguments.

Returns:

Dict[str, Any] – A dictionary containing pipeline information filtered by the keyword.

Example

>>> # Query all pipelines
>>> pipelines = query_pipelines_info("")
>>> print(len(pipelines))  # Number of available pipelines
>>> # Query specific pipelines
>>> pipelines = query_pipelines_info("classification")
>>> print(pipelines.keys())  # Pipelines containing "classification"

omnigenbench.utility.hub_utils.unzip_checkpoint(checkpoint_path)[source]

This function extracts a zipped checkpoint file to a directory, making it ready for use by the model loading functions.

Parameters:: checkpoint_path (str) – The path to the checkpoint file.
Returns:: str – The path to the extracted checkpoint directory.

Example

>>> extracted_path = unzip_checkpoint("model.zip")
>>> print(extracted_path)  # "model"