graph LR
Command_Line_Interface_CLI_["Command-Line Interface (CLI)"]
Configuration_Task_Registry["Configuration & Task Registry"]
Data_Prompt_Preparation["Data & Prompt Preparation"]
Model_Interface_Inference_Backends["Model Interface & Inference Backends"]
Evaluation_Pipeline_Core["Evaluation Pipeline Core"]
Metric_Computation_Engine["Metric Computation Engine"]
Result_Management_Reporting["Result Management & Reporting"]
External_Services["External Services"]
Command_Line_Interface_CLI_ -- "loads from" --> Configuration_Task_Registry
Configuration_Task_Registry -- "provides to" --> Data_Prompt_Preparation
Data_Prompt_Preparation -- "supplies to" --> Evaluation_Pipeline_Core
Evaluation_Pipeline_Core -- "requests from" --> Model_Interface_Inference_Backends
Model_Interface_Inference_Backends -- "returns to" --> Evaluation_Pipeline_Core
Evaluation_Pipeline_Core -- "sends to" --> Metric_Computation_Engine
Metric_Computation_Engine -- "returns to" --> Evaluation_Pipeline_Core
Evaluation_Pipeline_Core -- "sends to" --> Result_Management_Reporting
Result_Management_Reporting -- "pushes to" --> External_Services
click Command_Line_Interface_CLI_ href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/lighteval/Command_Line_Interface_CLI_.md" "Details"
click Configuration_Task_Registry href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/lighteval/Configuration_Task_Registry.md" "Details"
click Data_Prompt_Preparation href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/lighteval/Data_Prompt_Preparation.md" "Details"
click Model_Interface_Inference_Backends href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/lighteval/Model_Interface_Inference_Backends.md" "Details"
click Evaluation_Pipeline_Core href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/lighteval/Evaluation_Pipeline_Core.md" "Details"
click Metric_Computation_Engine href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/lighteval/Metric_Computation_Engine.md" "Details"
click Result_Management_Reporting href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/lighteval/Result_Management_Reporting.md" "Details"
click External_Services href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/lighteval/External_Services.md" "Details"
The lighteval project is structured around a clear, modular pipeline for evaluating Large Language Models (LLMs). The process begins with the Command-Line Interface (CLI), which serves as the primary entry point for users to define and initiate evaluation runs. The CLI interacts with the Configuration & Task Registry to load predefined evaluation tasks, including datasets, prompts, and metric configurations. Once tasks are defined, the Data & Prompt Preparation component takes over, responsible for loading raw data, preparing it, and generating prompts suitable for LLM inference. The prepared data and prompts are then fed into the Evaluation Pipeline Core, the central orchestrator that manages the end-to-end evaluation flow. This core component interacts with the Model Interface & Inference Backends to perform actual LLM inference, abstracting away the complexities of different model types and inference systems. Model outputs are then passed to the Metric Computation Engine, which calculates various evaluation metrics. Finally, the results are handled by the Result Management & Reporting component, which stores, loads, and presents the evaluation outcomes. This component also integrates with External Services to push results to platforms like Hugging Face Hub, Weights & Biases, and TensorBoard for enhanced storage, visualization, and collaboration.
Command-Line Interface (CLI) [Expand]
The primary user interface for initiating evaluations, handling command-line argument parsing, and loading initial configurations. It acts as the orchestrator for different evaluation modes.
Related Classes/Methods:
src/lighteval/__main__.pysrc/lighteval/main_accelerate.pysrc/lighteval/main_baseline.pysrc/lighteval/main_custom.pysrc/lighteval/main_endpoint.pysrc/lighteval/main_nanotron.pysrc/lighteval/main_sglang.pysrc/lighteval/main_tasks.pysrc/lighteval/main_vllm.py
Configuration & Task Registry [Expand]
Defines and manages evaluation tasks, including their associated datasets, prompts, and metric configurations. It serves as a central repository for all available evaluation tasks.
Related Classes/Methods:
src/lighteval/tasks/registry.pysrc/lighteval/tasks/default_prompts.pysrc/lighteval/tasks/extended/src/lighteval/tasks/lighteval_task.py
Data & Prompt Preparation [Expand]
Responsible for loading raw datasets, preparing them for evaluation, and generating prompts tailored for LLMs based on task definitions. It ensures data is in the correct format for model inference.
Related Classes/Methods:
Model Interface & Inference Backends [Expand]
Provides a unified, abstract interface (AbstractModel) for interacting with various LLM inference systems. It encapsulates the logic for loading models and performing inference (greedy generation, log-likelihood computation) across different backends.
Related Classes/Methods:
src/lighteval/models/model_loader.pysrc/lighteval/models/abstract_model.pysrc/lighteval/models/transformers/transformers_model.pysrc/lighteval/models/vllm/vllm_model.pysrc/lighteval/models/nanotron/nanotron_model.pysrc/lighteval/models/sglang/sglang_model.pysrc/lighteval/models/endpoints/src/lighteval/models/custom/custom_model.pysrc/lighteval/models/dummy/dummy_model.py
Evaluation Pipeline Core [Expand]
The central orchestrator of the end-to-end evaluation process. It manages the flow from running models with prepared prompts to collecting responses and initiating metric computations. It handles both synchronous and asynchronous model runs.
Related Classes/Methods:
Metric Computation Engine [Expand]
Implements a wide array of evaluation metrics, including standard NLP metrics, LLM-as-a-judge paradigms, and specialized metrics. It takes model outputs and ground truth to compute quantitative scores.
Related Classes/Methods:
src/lighteval/metrics/metrics_corpus.pysrc/lighteval/metrics/metrics_sample.pysrc/lighteval/metrics/llm_as_judge.pysrc/lighteval/metrics/normalizations.pysrc/lighteval/metrics/harness_compatibility/src/lighteval/metrics/metrics.py
Result Management & Reporting [Expand]
Handles the storage, loading, and presentation of evaluation results. It manages the persistence of detailed responses and aggregated scores, and facilitates integration with external platforms for visualization and sharing.
Related Classes/Methods:
External Services [Expand]
Manages the pushing of evaluation results and details to external platforms and services for enhanced storage, visualization, and collaboration (e.g., Hugging Face Hub, Weights & Biases, TensorBoard, Amazon S3).
Related Classes/Methods: