graph LR
LMTPDcModel["LMTPDcModel"]
Scheduler["Scheduler"]
LMTP_Clients["LMTP Clients"]
LMTP_Server["LMTP Server"]
LMTPBalancer["LMTPBalancer"]
Batched_OpenAI_API_Adapter["Batched OpenAI API Adapter"]
Model_Backends["Model Backends"]
LMTPDcModel -- "delegates request sending to" --> LMTP_Clients
LMTPDcModel -- "enqueues requests to" --> Scheduler
LMTPDcModel -- "leverages" --> Batched_OpenAI_API_Adapter
Scheduler -- "dispatches batched requests to" --> Model_Backends
LMTP_Clients -- "send requests to" --> LMTP_Server
LMTP_Server -- "receives requests from" --> LMTP_Clients
LMTP_Server -- "uses" --> LMTPDcModel
LMTPBalancer -- "forwards requests to" --> LMTP_Server
Batched_OpenAI_API_Adapter -- "provides an interface for" --> LMTPDcModel
Model_Backends -- "receive requests from" --> Scheduler
The LMTP subsystem in LMQL facilitates interaction with various LLM backends. LMTP Clients initiate requests, which are then sent to the LMTP Server. If a load balancer is in place, the LMTPBalancer forwards these incoming requests to available LMTP Server instances. The LMTP Server receives these requests and utilizes the LMTPDcModel to abstract the LLM interaction. The LMTPDcModel enqueues requests to the Scheduler, which optimizes and dispatches batched requests to specific Model Backends (e.g., HuggingFace, Llama.cpp) or leverages the Batched OpenAI API Adapter for OpenAI interactions. Model Backends process these requests and return results, which are then handled by the Scheduler and ultimately relayed back through the LMTP Server to the LMTP Clients.
Serves as the core abstraction for LMQL's interaction with various LLM backends. It provides a unified, high-level interface for model loading, tokenization, and core generate and score operations, abstracting away underlying communication and inference details. This component acts as the primary entry point from the LMQL runtime into the LLM integration layer.
Related Classes/Methods:
Orchestrates and optimizes LLM inference requests by batching them (GenerateBatch) and dispatching them to available Model Backends. It manages the lifecycle of loaded models and handles streaming results back to callers, embodying a key part of the runtime optimization.
Related Classes/Methods:
Provide various communication protocols and mechanisms for LMQL to interact with the LMTP Server or directly with remote/local model instances. They handle the low-level message passing and connection management, forming the client-side of the Client-Server Architecture.
Related Classes/Methods:
lmql.models.lmtp.lmtp_clientlmql.models.lmtp.lmtp_multiprocessinglmql.models.lmtp.lmtp_asynclmql.models.lmtp.lmtp_replicate_client
Acts as the central entry point for external LMTP Clients, providing a network interface (e.g., WebSocket) to expose LLM capabilities. It receives client requests and routes them for processing, forming the server-side of the Client-Server Architecture.
Related Classes/Methods:
Distributes incoming LMTP requests across multiple LMTP Server instances (workers) to ensure efficient resource utilization, load balancing, and high availability within the distributed setup.
Related Classes/Methods:
Provides a robust, batched, and fault-tolerant interface for interacting specifically with the OpenAI API. It handles request queuing, retries, and response parsing, acting as a specialized Adapter for a key external LLM service.
Related Classes/Methods:
Encapsulate the specific implementation details for interacting with different underlying LLM libraries (e.g., HuggingFace Transformers, Llama.cpp). They provide a standardized generate and score interface for the Scheduler, embodying the Adapter and Strategy patterns for diverse LLM integrations.
Related Classes/Methods: