Package Design Principles¶

OmniGenBench is designed to be a unified, extensible, and robust framework for genomic foundation models. Our core philosophy centers on abstraction, modularity, and interoperability, enabling you to build, extend, and integrate complex genomic pipelines with minimal friction.

This guide explores the core architecture, the main abstract classes, and the patterns you can follow to extend the library for your own needs.

Core Philosophy¶

The entire framework is built upon a set of abstract base classes (ABCs). These classes define a clear “contract” or interface for every major component. This approach provides several key advantages:

Consistency

All components of the same type (e.g., all models) share the same interface, making them predictable and reducing bugs.

Extensibility

Adding new functionality is as simple as subclassing an existing abstract class and implementing the required methods.

Interoperability

Because all components adhere to a standard interface, they can be easily swapped or combined, like LEGO bricks.

Maintainability

A clear and consistent structure makes the codebase easier to understand, debug, and maintain over time.

The Core Components¶

OmniGenBench is built around four fundamental abstract classes. Understanding these is key to mastering the library.

Abstract Model (OmniModel)

The OmniModel class is the foundation for all models, providing a unified interface for initialization, forward passes, and inference.

Key Features:

Flexible initialization from pre-trained weights, configs, or PyTorch modules.
Automatic loss computation for various task types.
Standardized predict() and inference() methods.
Built-in support for saving and loading.

Core Methods:

__init__(config_or_model, tokenizer, **kwargs)
forward(**inputs)
predict(sequence)
save_model(path) / load_model(path)

Usage Example:

from omnigenbench import OmniModelForSequenceClassification

model = OmniModelForSequenceClassification("model_path", tokenizer)
# Training: forward pass with labels
outputs = model(input_ids=..., attention_mask=..., labels=...)
loss = outputs.loss
# Inference
predictions = model.predict("ACGU...")
print(predictions)

Abstract Dataset (OmniDataset)

The OmniDataset class standardizes data handling, supporting various file formats and integrating seamlessly with tokenizers and PyTorch DataLoaders.

Key Features:

Handles multiple data formats (JSON, CSV, Parquet, TXT).
Integrates tokenization directly into the data loading pipeline.
Automatic mapping between string labels and integer indices.
Built-in data validation and flexible configuration.

Core Methods:

__init__(data_path, tokenizer, **kwargs)
__getitem__(index) & __len__()
get_labels()
get_label_mapping()

Usage Example:

from omnigenbench import OmniDatasetForSequenceClassification

dataset = OmniDatasetForSequenceClassification("data.json", tokenizer, max_length=512)
# Access a sample
sample = dataset[0]
print(sample['input_ids'].shape) # torch.Size([512])
# Get dataset info
print(f"Dataset size: {len(dataset)}")

Abstract Tokenizer (OmniTokenizer)

The OmniTokenizer class provides a consistent wrapper for various tokenization strategies, from simple k-mers to complex pre-trained tokenizers.

Key Features:

Consistent API regardless of the underlying tokenization logic.
Automatic handling of special tokens (BOS, EOS, PAD).
Built-in preprocessing options (e.g., U-to-T conversion).
Easy integration with custom tokenization logic.

Core Methods:

__init__(base_tokenizer, **kwargs)
tokenize(sequence, **kwargs)
encode(sequence, **kwargs) & decode(token_ids, **kwargs)
from_pretrained(model_name)

Usage Example:

from omnigenbench import OmniSingleNucleotideTokenizer

tokenizer = OmniSingleNucleotideTokenizer.from_pretrained("model_name")
# Tokenize a sequence
inputs = tokenizer("ATCG", max_length=128, padding=True)
print(inputs['input_ids'].shape)
# Decode back to string
decoded = tokenizer.decode(inputs['input_ids'][0])

Abstract Metric (OmniMetric)

The OmniMetric class standardizes evaluation, leveraging powerful libraries like scikit-learn while providing a consistent interface.

Key Features:

Seamless integration with scikit-learn’s metric collection.
Proper handling of ignored labels (e.g., -100 in PyTorch).
Standardized result dictionary format.
Support for classification, regression, and ranking metrics.

Core Methods:

__init__(ignore_y=None, **kwargs)
compute_metric(y_true, y_pred, **kwargs)
get_metric_name()

Usage Example:

from omnigenbench import ClassificationMetric

metric = ClassificationMetric(ignore_y=-100)
y_true = [0, 1, -100, 1]
y_pred = [0, 1, 0, 0]
results = metric.compute_metric(y_true, y_pred)
print(results) # {'accuracy_score': 0.66, ...}

Extending OmniGenBench: A How-To¶

The true power of OmniGenBench lies in its extensibility. To add a custom component, you simply inherit from one of the core abstract classes and implement the required methods.

Below are implementation patterns for each component type.

Custom Model

Inherit from OmniModel and override the forward method to add your custom layers or logic.

from omnigenbench import OmniModel
import torch

class CustomModel(OmniModel):
    def __init__(self, config, tok, **kw):
        super().__init__(config, tok, **kw)
        self.classifier = torch.nn.Linear(...)

    def forward(self, **inputs):
        outputs = self.base_model(**inputs)
        logits = self.classifier(outputs.last_hidden_state)
        # ... compute loss ...
        return loss, logits

Custom Dataset

Inherit from an OmniDataset subclass and override _load_data or _process_data to handle your specific data format or structure.

from omnigenbench import OmniDatasetForSequenceClassification

class CustomDataset(OmniDatasetForSequenceClassification):
    def _load_data(self, data_path):
        # Your custom logic to read a file
        # and return a list of examples.
        ...
        return processed_data

Custom Tokenizer

Inherit from OmniTokenizer and implement the core tokenize method with your unique tokenization strategy.

from omnigenbench import OmniTokenizer

class KmerTokenizer(OmniTokenizer):
    def tokenize(self, seq, **kw):
        k = self.k
        return [seq[i:i+k] for i in ...]

Custom Metric

Inherit from OmniMetric and implement compute_metric to calculate your custom evaluation score.

from omnigenbench import OmniMetric
from your_lib import special_metric

class MyMetric(OmniMetric):
    def compute_metric(self, y_true, y_pred):
        score = special_metric(y_true, y_pred)
        return {"my_special_metric": score}

Best Practices for Contributors¶

When extending the library, please follow these guidelines to ensure your contributions are robust and align with the framework’s philosophy.

Always Inherit: Start by inheriting from the most relevant abstract base class.
Implement Abstract Methods: Ensure all required methods from the parent class are implemented.
Document Everything: Provide clear docstrings for your new class and its methods, including examples.
Write Unit Tests: Every new feature should be accompanied by tests to prevent future regressions.
Follow Conventions: Adhere to the existing coding style and design patterns for consistency.
Handle Errors Gracefully: Provide meaningful error messages for invalid inputs or failed operations.