graph LR
LMQLTokenizer["LMQLTokenizer"]
_load_tokenizer["_load_tokenizer"]
TiktokenTokenizer["TiktokenTokenizer"]
TransformersTokenizer["TransformersTokenizer"]
SentencePieceTokenizer["SentencePieceTokenizer"]
PythonBackedTokenizer["PythonBackedTokenizer"]
_load_tokenizer -- "initializes" --> LMQLTokenizer
_load_tokenizer -- "instantiates" --> TiktokenTokenizer
_load_tokenizer -- "instantiates" --> TransformersTokenizer
_load_tokenizer -- "instantiates" --> SentencePieceTokenizer
_load_tokenizer -- "instantiates" --> PythonBackedTokenizer
LMQLTokenizer -- "uses" --> TiktokenTokenizer
LMQLTokenizer -- "uses" --> TransformersTokenizer
LMQLTokenizer -- "uses" --> SentencePieceTokenizer
LMQLTokenizer -- "uses" --> PythonBackedTokenizer
The Tokenizer subsystem is crucial for the LMQL project, providing the essential text-to-token and token-to-text conversion services required for interaction with Large Language Models. It embodies the Adapter and Strategy patterns to support diverse tokenizer implementations.
Serves as the primary abstract interface and facade for all tokenization and detokenization operations within LMQL. It provides a consistent API, abstracting the complexities of various underlying tokenizer implementations, and manages special tokens and tokenizer properties. This component is critical for maintaining a clean separation between the LMQL runtime and specific LLM tokenizer details, embodying the Adapter Pattern.
Related Classes/Methods:
A factory function responsible for dynamically selecting and instantiating the appropriate concrete tokenizer backend based on the provided model identifier or configuration. It ensures the correct tokenizer is loaded and initialized for use by LMQLTokenizer, acting as a key part of the Strategy Pattern implementation.
Related Classes/Methods:
Provides the concrete implementation for tokenization using the tiktoken library, primarily used for OpenAI models. It handles encoding text to token IDs and decoding token IDs back to text, along with managing tiktoken-specific special tokens. This is one of the core strategies for LLM integration.
Related Classes/Methods:
Implements tokenization logic by wrapping HuggingFace Transformers tokenizers. This component leverages the extensive model support of the HuggingFace ecosystem for encoding, decoding, and handling special tokens, significantly broadening LMQL's compatibility with various LLMs.
Related Classes/Methods:
Offers tokenization capabilities for models that utilize SentencePiece, a common subword tokenization algorithm (e.g., Llama models). It handles the specific encoding and decoding nuances of SentencePiece, ensuring accurate tokenization for these model architectures.
Related Classes/Methods:
Provides a pure Python implementation of a tokenizer. This serves as a fallback or for specific internal LMQL tokenization needs where external libraries might not be suitable or available, ensuring a baseline tokenization capability.
Related Classes/Methods: