You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, combined_score in OpenEvolve is a scalar. While this is straightforward, in many real-world scenarios (e.g. MoE Load Balancing), we often need the Agent to balance multiple metrics (e.g., maximizing accuracy while minimizing runtime).
At present, addressing this need mostly relies on prompting to "nudge" the Agent toward secondary goals. However, this can be difficult to track and evaluate systematically. I’d like to explore if we could bring more structure to this, potentially providing a foundational step for the goals mentioned in #110 and #109.
Instead of a single number, what if combined_score could be a tuple[float, ...]?
The core idea is to use Lexicographical ordering (Python’s natural tuple comparison). This would treat the objectives with a clear hierarchy: tuple[0] as the primary goal, and tuple[1] as a tie-breaker, and so on.
Why this might be a good starting point:
Pragmatic & Simple: It’s a "softened" version of multi-objective optimization. It doesn't require complex Pareto front management yet, but it gives us a much more formal way to handle secondary constraints than just prompts.
Agent-Friendly: In my experience (primarily, CUDA C++ kernel opt.), LLMs are quite good at following "Priority A > Priority B." A tuple-based score provides a clear, structured signal that reflects this hierarchy.
Flexible: It avoids the need for users to come up with "magic" weights (like $0.7 \times A + 0.3 \times B$), which can be tricky to tune.
Engineering Considerations
I considered an "integer scaling" approach (packing multiple scores into one large integer), but it felt a bit fragile and might be difficult for the Agent to interpret. Using a tuple (or a simple wrapper class) seems like a cleaner path.
To maintain compatibility with existing logic (like fitness averaging or leaderboard plotting), we might consider:
A Score Wrapper Class: Wrapping the tuple in a small helper class to manage comparisons.
Refining Evolution Logic: Exploring how combined_score is utilized during evolution so the system can respect the full tuple rather than defaulting only to a single primary metric.
Direction of Optimization: To keep it simple, we could adopt a convention where all elements are "higher is better" (negating metrics like runtime).
Note on Diversity & Multi-island Migration
Calculating the "distance" between combined_score's becomes non-trivial with tuples. I suggest that for migration and diversity logic, we could try:
Primary Metric Projection: Default to the primary metric (tuple[0]) for migration logic requiring a scalar delta, ensuring zero disruption to existing island coordination.
Rank-based Migration: Explore shifting migration triggers from "Score Delta" to "Lexicographical Rank."
Distance Helper: Implement a distance_to(other) helper to provide a weighted scalar distance solely for diversity metrics if needed.
To ensure this remains extensible, we could draw inspiration from established frameworks:
Fitness Abstraction (DEAP-style): Using a Fitness wrapper that supports a weights vector (e.g., weights=(1.0, -1.0)) to handle max/min natively.
Future-Proofing (NSGA-II): While starting with Lexicographical priority for simplicity, this structure allows for a smoother transition to Non-dominated Sorting or Crowding Distance in the future, which is particularly valuable for non-linear trade-offs in kernel optimization.
Questions
I'm very curious to hear thoughts on this direction:
Background
Currently, combined_score in OpenEvolve is a scalar. While this is straightforward, in many real-world scenarios (e.g. MoE Load Balancing), we often need the Agent to balance multiple metrics (e.g., maximizing accuracy while minimizing runtime).
At present, addressing this need mostly relies on prompting to "nudge" the Agent toward secondary goals. However, this can be difficult to track and evaluate systematically. I’d like to explore if we could bring more structure to this, potentially providing a foundational step for the goals mentioned in #110 and #109.
Idea: Lexicographical Multi-Objective (Tuple-based)
Instead of a single number, what if combined_score could be a
tuple[float, ...]?The core idea is to use Lexicographical ordering (Python’s natural tuple comparison). This would treat the objectives with a clear hierarchy:
tuple[0]as the primary goal, andtuple[1]as a tie-breaker, and so on.Why this might be a good starting point:
Engineering Considerations
I considered an "integer scaling" approach (packing multiple scores into one large integer), but it felt a bit fragile and might be difficult for the Agent to interpret. Using a tuple (or a simple wrapper class) seems like a cleaner path.
To maintain compatibility with existing logic (like fitness averaging or leaderboard plotting), we might consider:
combined_scoreis utilized during evolution so the system can respect the full tuple rather than defaulting only to a single primary metric.Note on Diversity & Multi-island Migration
Calculating the "distance" between
combined_score's becomes non-trivial with tuples. I suggest that for migration and diversity logic, we could try:Leveraging Best Practices (DEAP & NSGA-II)
To ensure this remains extensible, we could draw inspiration from established frameworks:
Questions
I'm very curious to hear thoughts on this direction:
I'd be more than happy to help draft a PR or a prototype if this seems like a path worth pursuing!