[RFC] Enhancing `combined_score` to support multi-objective optimization via Lexicographical ordering

### Background

Currently, combined_score in OpenEvolve is a scalar. While this is straightforward, in many real-world scenarios (e.g. [MoE Load Balancing](https://github.com/baidu-baige/LoongFlow/tree/main/agents/math_agent/examples/moe_lb)), we often need the Agent to balance multiple metrics (e.g., maximizing accuracy while minimizing runtime).

At present, addressing this need mostly relies on prompting to "nudge" the Agent toward secondary goals. However, this can be difficult to track and evaluate systematically. I’d like to explore if we could bring more structure to this, potentially providing a foundational step for the goals mentioned in #110 and #109.

****

### Idea: Lexicographical Multi-Objective (Tuple-based)

Instead of a single number, what if combined_score could be a `tuple[float, ...]`?

The core idea is to use **Lexicographical ordering** (Python’s natural tuple comparison). This would treat the objectives with a clear hierarchy: `tuple[0]` as the primary goal, and `tuple[1]` as a tie-breaker, and so on.

#### Why this might be a good starting point:

- **Pragmatic & Simple:** It’s a "softened" version of multi-objective optimization. It doesn't require complex Pareto front management yet, but it gives us a much more formal way to handle secondary constraints than just prompts.
- **Agent-Friendly:** In my experience (primarily, CUDA C++ kernel opt.), LLMs are quite good at following "Priority A > Priority B." A tuple-based score provides a clear, structured signal that reflects this hierarchy.
- **Flexible:** It avoids the need for users to come up with "magic" weights (like $0.7 \times A + 0.3 \times B$), which can be tricky to tune.

#### Engineering Considerations

I considered an "integer scaling" approach (packing multiple scores into one large integer), but it felt a bit fragile and might be difficult for the Agent to interpret. Using a tuple (or a simple wrapper class) seems like a cleaner path.

To maintain compatibility with existing logic (like fitness averaging or leaderboard plotting), we might consider:

- **A Score Wrapper Class**: Wrapping the tuple in a small helper class to manage comparisons.
- **Refining Evolution Logic**: Exploring how `combined_score` is utilized during evolution so the system can respect the full tuple rather than defaulting only to a single primary metric.
- **Direction of Optimization**: To keep it simple, we could adopt a convention where all elements are "higher is better" (negating metrics like runtime).

#### Note on Diversity & Multi-island Migration

Calculating the "distance" between `combined_score`'s becomes non-trivial with tuples. I suggest that for migration and diversity logic, we could try:

- **Primary Metric Projection**: Default to the primary metric (tuple[0]) for migration logic requiring a scalar delta, ensuring zero disruption to existing island coordination.
- **Rank-based Migration**: Explore shifting migration triggers from "Score Delta" to "Lexicographical Rank."
- **Distance Helper**: Implement a distance_to(other) helper to provide a weighted scalar distance solely for diversity metrics if needed.

#### Leveraging Best Practices ([DEAP](https://github.com/deap/DEAP) & NSGA-II)

To ensure this remains extensible, we could draw inspiration from established frameworks:

- **Fitness Abstraction (DEAP-style)**: Using a Fitness wrapper that supports a weights vector (e.g., weights=(1.0, -1.0)) to handle max/min natively.
- **Future-Proofing (NSGA-II)**: While starting with Lexicographical priority for simplicity, this structure allows for a smoother transition to Non-dominated Sorting or Crowding Distance in the future, which is particularly valuable for non-linear trade-offs in kernel optimization.

****

### Questions

I'm very curious to hear thoughts on this direction:

1. Does this feel like a useful intermediate step toward the full Pareto optimization mentioned in #110?
2. For #109, would having structured indices in a score make it easier to implement "stop when metric X reaches Y"?
3. Are there any concerns regarding how this might affect existing evolutionary strategies or the current codebase complexity?

I'd be more than happy to help draft a PR or a prototype if this seems like a path worth pursuing!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Enhancing `combined_score` to support multi-objective optimization via Lexicographical ordering #453

Background

Idea: Lexicographical Multi-Objective (Tuple-based)

Why this might be a good starting point:

Engineering Considerations

Note on Diversity & Multi-island Migration

Leveraging Best Practices (DEAP & NSGA-II)

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Enhancing combined_score to support multi-objective optimization via Lexicographical ordering #453

Description

Background

Idea: Lexicographical Multi-Objective (Tuple-based)

Why this might be a good starting point:

Engineering Considerations

Note on Diversity & Multi-island Migration

Leveraging Best Practices (DEAP & NSGA-II)

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[RFC] Enhancing `combined_score` to support multi-objective optimization via Lexicographical ordering #453