graph LR
AudioBufferProcessor["AudioBufferProcessor"]
VADAnalyzer["VADAnalyzer"]
TranscriptProcessor["TranscriptProcessor"]
DTMFAggregator["DTMFAggregator"]
AudioBufferProcessor -- "provides processed audio to" --> VADAnalyzer
VADAnalyzer -- "provides speech activity context to" --> TranscriptProcessor
This subsystem is designed to efficiently process diverse inputs derived from an audio stream, primarily focusing on speech and DTMF signals. The AudioBufferProcessor forms the foundation by managing raw audio data. This processed audio is then channeled to the VADAnalyzer for voice activity detection, a critical step that provides contextual information. The TranscriptProcessor leverages this context to accurately aggregate textual transcriptions, while the DTMFAggregator independently handles DTMF tones. This architecture ensures parallel and specialized processing of different audio-derived data types, optimizing for responsive and accurate conversational AI interactions.
Manages the buffering, recording, and initial processing of raw audio data, serving as the entry point for audio streams into the system.
Related Classes/Methods:
Analyzes processed audio frames to detect human voice activity (VAD) and compute volume, providing crucial signals for subsequent speech-related processing.
Related Classes/Methods:
Aggregates and processes textual transcription frames, distinguishing between speakers, and preparing aggregated text for further use. It operates on data derived from speech-to-text (STT) outputs.
Related Classes/Methods:
Specializes in aggregating DTMF (Dual-Tone Multi-Frequency) signals, managing their detection and handling interruptions, providing a pathway for non-speech input.
Related Classes/Methods: