Skip to content

Latest commit

 

History

History
82 lines (46 loc) · 4.97 KB

File metadata and controls

82 lines (46 loc) · 4.97 KB
graph LR
    LMQLTokenizer["LMQLTokenizer"]
    _load_tokenizer["_load_tokenizer"]
    TiktokenTokenizer["TiktokenTokenizer"]
    TransformersTokenizer["TransformersTokenizer"]
    SentencePieceTokenizer["SentencePieceTokenizer"]
    PythonBackedTokenizer["PythonBackedTokenizer"]
    _load_tokenizer -- "initializes" --> LMQLTokenizer
    _load_tokenizer -- "instantiates" --> TiktokenTokenizer
    _load_tokenizer -- "instantiates" --> TransformersTokenizer
    _load_tokenizer -- "instantiates" --> SentencePieceTokenizer
    _load_tokenizer -- "instantiates" --> PythonBackedTokenizer
    LMQLTokenizer -- "uses" --> TiktokenTokenizer
    LMQLTokenizer -- "uses" --> TransformersTokenizer
    LMQLTokenizer -- "uses" --> SentencePieceTokenizer
    LMQLTokenizer -- "uses" --> PythonBackedTokenizer
Loading

CodeBoardingDemoContact

Details

The Tokenizer subsystem is crucial for the LMQL project, providing the essential text-to-token and token-to-text conversion services required for interaction with Large Language Models. It embodies the Adapter and Strategy patterns to support diverse tokenizer implementations.

LMQLTokenizer

Serves as the primary abstract interface and facade for all tokenization and detokenization operations within LMQL. It provides a consistent API, abstracting the complexities of various underlying tokenizer implementations, and manages special tokens and tokenizer properties. This component is critical for maintaining a clean separation between the LMQL runtime and specific LLM tokenizer details, embodying the Adapter Pattern.

Related Classes/Methods:

_load_tokenizer

A factory function responsible for dynamically selecting and instantiating the appropriate concrete tokenizer backend based on the provided model identifier or configuration. It ensures the correct tokenizer is loaded and initialized for use by LMQLTokenizer, acting as a key part of the Strategy Pattern implementation.

Related Classes/Methods:

TiktokenTokenizer

Provides the concrete implementation for tokenization using the tiktoken library, primarily used for OpenAI models. It handles encoding text to token IDs and decoding token IDs back to text, along with managing tiktoken-specific special tokens. This is one of the core strategies for LLM integration.

Related Classes/Methods:

TransformersTokenizer

Implements tokenization logic by wrapping HuggingFace Transformers tokenizers. This component leverages the extensive model support of the HuggingFace ecosystem for encoding, decoding, and handling special tokens, significantly broadening LMQL's compatibility with various LLMs.

Related Classes/Methods:

SentencePieceTokenizer

Offers tokenization capabilities for models that utilize SentencePiece, a common subword tokenization algorithm (e.g., Llama models). It handles the specific encoding and decoding nuances of SentencePiece, ensuring accurate tokenization for these model architectures.

Related Classes/Methods:

PythonBackedTokenizer

Provides a pure Python implementation of a tokenizer. This serves as a fallback or for specific internal LMQL tokenization needs where external libraries might not be suitable or available, ensuring a baseline tokenization capability.

Related Classes/Methods: