A natural language to SQL data analysis agent. Users ask business questions in natural language; the system extracts keywords, retrieves metadata context, generates and validates SQL, executes it, and streams back an LLM-generated summary with tabular results.
User Question
↓
extract_keywords (jieba segmentation + LLM keyword extraction)
↓
┌──────────────────────────────────┐
│ recall_column │ recall_value │ recall_metric │ (parallel)
└──────────────────────────────────┘
↓
merge_retrieved_info
↓
┌──────────────────────────────────┐
│ filter_table │ filter_metric │ (parallel)
└──────────────────────────────────┘
↓
add_extra_context → generate_sql → validate_sql
↓
execute_sql → [has result] → summarize_result → END
→ [no result] → END
Each node retrieves schema metadata from MySQL (table structures, column descriptions, metric definitions) and vector search results from Qdrant/ES, then passes enriched context to the LLM for SQL generation. SQL is validated via EXPLAIN before execution. Results are summarized in natural language via LLM before streaming to the frontend.
| Layer | Technology |
|---|---|
| Backend | Python >= 3.12 · FastAPI · LangGraph · LangChain |
| Frontend | Vue 3 · Vite |
| Storage | MySQL · Qdrant · Elasticsearch |
| LLM / Embedding | OpenAI-compatible API (DashScope / DeepSeek) |
| Keyword Extraction | jieba (Chinese word segmentation) |
| ORM | SQLAlchemy |
| Communication | Server-Sent Events (SSE) |
InsightData/
├── data-agent/ # Python backend
│ ├── app/
│ │ ├── agent/ # LangGraph state graph and node implementations
│ │ │ ├── graph.py # Graph definition (node wiring + conditional edges)
│ │ │ └── nodes/ # extract_keywords, recall_*, filter_*, generate_sql, etc.
│ │ ├── api/
│ │ │ ├── routers/ # FastAPI route handlers (query_router with SSE)
│ │ │ └── schemas/ # Request/response Pydantic models
│ │ ├── clients/ # Database + embedding client managers
│ │ ├── conf/ # Config structure definitions
│ │ ├── core/ # Lifespan management, logging, request context
│ │ ├── entities/ # Data entity definitions (column_info, metric_info, etc.)
│ │ ├── models/ # SQLAlchemy ORM models
│ │ ├── prompt/ # Prompt loading logic
│ │ ├── repositories/ # Data access layer (MySQL, Qdrant, ES)
│ │ ├── scripts/ # Meta knowledge initialization scripts
│ │ └── services/ # Business logic (query orchestration)
│ ├── conf/ # Runtime config (app_config.yaml — gitignored)
│ ├── prompts/ # LLM prompt templates (8 .prompt files)
│ ├── deployment/ # Docker compose + reference configs
│ ├── main.py # Application entry point
│ └── pyproject.toml # Dependency definitions
├── data-agent-frontend/ # Vue 3 frontend
└── CLAUDE.md
- Docker Desktop (MySQL / ES / Qdrant)
- uv (Python package manager)
- Node.js >= 18
cd data-agent/deployment/docker
docker compose up -d| Service | Address |
|---|---|
| MySQL | localhost:3307 |
| Elasticsearch | localhost:9200 |
| Kibana | localhost:5601 |
| Qdrant | localhost:6333 |
cd data-agent
cp conf/app_config.sample.yaml conf/app_config.yamlEdit conf/app_config.yaml:
- MySQL credentials (
db_meta/db_dw) - LLM API key and base URL (e.g., DashScope / DeepSeek)
- Embedding service API key
cd data-agent
uv sync --frozenuv run python -m app.scripts.build_meta_knowledge -c conf/meta_config.yamlThis populates MySQL with table/column/metric definitions, builds Qdrant vector index, and creates ES search index.
uv run uvicorn main:app --host 0.0.0.0 --port 8000 --reloadAPI docs: http://localhost:8000/docs
cd ../data-agent-frontend
npm ci
npm run devOpen http://localhost:5173 and start asking business questions.
- "统计2025年各地区的销售总额" (Total sales by region in 2025)
- "哪个产品的销量最高" (Which product has the highest sales volume)
- "各渠道的订单数量对比" (Order count comparison across channels)
| File | Purpose |
|---|---|
conf/app_config.yaml |
Runtime config with credentials (gitignored) |
conf/app_config.sample.yaml |
Public sample config |
conf/meta_config.yaml |
Meta knowledge base schema definitions |
deployment/docker/docker-compose.yaml |
Infrastructure service orchestration |