Skip to content

Latest commit

 

History

History
101 lines (55 loc) · 6.37 KB

File metadata and controls

101 lines (55 loc) · 6.37 KB
graph LR
    Metric_Orchestrator["Metric Orchestrator"]
    Sample_Preprocessor["Sample Preprocessor"]
    Sample_Level_Metric_Calculator["Sample-Level Metric Calculator"]
    LLM_as_Judge_Evaluator["LLM-as-Judge Evaluator"]
    Text_Normalization_Utilities["Text Normalization Utilities"]
    Harness_Compatibility_Layer["Harness Compatibility Layer"]
    Specialized_Metric_Integrations["Specialized Metric Integrations"]
    Statistical_Error_Calculator["Statistical Error Calculator"]
    Metric_Orchestrator -- "requests samples from" --> Sample_Preprocessor
    Metric_Orchestrator -- "dispatches prepared samples to" --> Sample_Level_Metric_Calculator
    Metric_Orchestrator -- "delegates specialized evaluations to" --> LLM_as_Judge_Evaluator
    Sample_Level_Metric_Calculator -- "utilizes functions from" --> Text_Normalization_Utilities
    LLM_as_Judge_Evaluator -- "may utilize" --> Text_Normalization_Utilities
    Harness_Compatibility_Layer -- "calls utilities from" --> Text_Normalization_Utilities
    Specialized_Metric_Integrations -- "relies on" --> Text_Normalization_Utilities
    Statistical_Error_Calculator -- "computes errors on results provided by" --> Metric_Orchestrator
Loading

CodeBoardingDemoContact

Details

The lighteval metrics subsystem is orchestrated by the Metric Orchestrator, which directs the flow of evaluation. Raw data is first processed by the Sample Preprocessor to ensure consistent formatting. Metric computations are then performed by the Sample-Level Metric Calculator for standard NLP metrics, or delegated to the LLM-as-Judge Evaluator for LLM-based assessments. The Text Normalization Utilities provide essential standardization across various metric calculations. For external compatibility, the Harness Compatibility Layer integrates with other evaluation frameworks, while Specialized Metric Integrations bridge to advanced third-party metrics. Finally, the Statistical Error Calculator provides robustness analysis for the computed results. This modular design ensures flexibility and extensibility in evaluating language models.

Metric Orchestrator

Serves as the central coordinator for all metric computations. It manages the overall flow, dispatching evaluation tasks to appropriate metric calculators and aggregating results. This component is crucial for a pipeline architecture, ensuring a coherent evaluation process.

Related Classes/Methods:

Sample Preprocessor

Standardizes and transforms raw evaluation samples into the specific input formats required by various metric functions (e.g., LogprobCorpusMetricInput, GenerativeCorpusMetricInput). This ensures data consistency across different metric types.

Related Classes/Methods:

Sample-Level Metric Calculator

Implements a broad range of common NLP metrics (e.g., Rouge, BLEU, Pass@k, edit similarity) that are computed on individual samples or batches. It represents the core quantitative scoring functionality.

Related Classes/Methods:

LLM-as-Judge Evaluator

Manages the specialized logic for metrics where an LLM itself acts as a judge. It handles interactions with various LLM inference backends (e.g., Hugging Face Inference API, LiteLLM, vLLM, Transformers) to obtain judgments.

Related Classes/Methods:

Text Normalization Utilities

Provides a collection of reusable text normalization functions (e.g., helm_normalizer, math_normalizer) to standardize text inputs and outputs before metric calculation. This ensures consistency and comparability of results across different evaluations.

Related Classes/Methods:

Harness Compatibility Layer

Adapts lighteval's metric computation to be compatible with specific metric implementations found in other evaluation harnesses (e.g., EleutherAI's LM Harness for TruthfulQA, DROP). This facilitates interoperability and leverages existing benchmarks.

Related Classes/Methods:

Specialized Metric Integrations

Integrates advanced or domain-specific metrics from external libraries, such as BERTScore for semantic similarity and SUMMAC for summarization consistency. This component acts as a bridge to third-party metric implementations.

Related Classes/Methods:

Statistical Error Calculator

Computes statistical measures of error, such as standard deviation and bootstrap standard error, for the calculated metrics. This provides confidence intervals and robustness analysis for evaluation results.

Related Classes/Methods: