docs: update plan status to implemented and add Package 9 API examples to README
This commit is contained in:
parent
032dd75e17
commit
3e1f053f73
|
|
@ -2,7 +2,26 @@
|
||||||
|
|
||||||
**Source**: User request (2026-05-25)
|
**Source**: User request (2026-05-25)
|
||||||
**Scope**: New backend API endpoints for generating test results (audio + text inputs) and evaluating pipeline accuracy against ground truth. Results Generation APIs capture every pipeline stage as structured JSON. Results Evaluation API measures transcription quality (CER/WER), key question extraction quality (dual-model marking scheme), chunk retrieval/filtering accuracy (batched binary relevance methodology), and final response completeness.
|
**Scope**: New backend API endpoints for generating test results (audio + text inputs) and evaluating pipeline accuracy against ground truth. Results Generation APIs capture every pipeline stage as structured JSON. Results Evaluation API measures transcription quality (CER/WER), key question extraction quality (dual-model marking scheme), chunk retrieval/filtering accuracy (batched binary relevance methodology), and final response completeness.
|
||||||
**Status**: Draft — Planning Phase (no code)
|
**Status**: ✅ Implemented (Sub-Phases 9.0–9.4 complete, 78/78 tests passing)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Progress Summary
|
||||||
|
|
||||||
|
| Sub-Phase | Commit | Status | Tests |
|
||||||
|
|-----------|--------|--------|-------|
|
||||||
|
| 9.0 Config & Models | `852430f` | ✅ | 32/32 |
|
||||||
|
| 9.1 Generation APIs | `ac81df0` | ✅ | 51/51 (cumulative) |
|
||||||
|
| 9.2 Evaluation Engine | `098be35` | ✅ | 76/76 (cumulative) |
|
||||||
|
| 9.3 Evaluate API | `032dd75` | ✅ | 78/78 (cumulative) |
|
||||||
|
| 9.4 Polish | `032dd75` | ✅ | 78/78, no regressions |
|
||||||
|
|
||||||
|
### Implementation Notes (Deviations from Plan)
|
||||||
|
|
||||||
|
- **Pydantic model naming**: All `Test*`-prefixed models renamed to drop the `Test` prefix (e.g., `TestGenerateResult` → `GenerateResult`, `TestEvaluationResult` → `EvaluationResult`). This avoids pytest collection conflicts where `Test*` classes are treated as test cases.
|
||||||
|
- **`DimensionScores` field type**: Changed from `int` to `float` to support averaged scores with decimal precision.
|
||||||
|
- **Single test file**: `test_phase9_chunk_response_eval.py` combines chunk evaluator and response evaluator tests (plan had separate files).
|
||||||
|
- **Pipeline refactoring**: Created `RAGPipeline` class in `services/rag_pipeline.py` instead of modifying `query.py` directly. The reusable async generator yields `PipelineSnapshot` dataclasses. The existing `_query_stream()` in `query.py` was left unchanged (no regression risk).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -60,12 +79,14 @@ Evaluated by two advanced models independently, scores averaged:
|
||||||
|
|
||||||
### What's Missing
|
### What's Missing
|
||||||
|
|
||||||
- **No test profile config**: Profiles A/B/C in `PromptService` only control prompt templates. Model selection is via env vars (`LLM_MODEL_NAME`, `DP_MODEL_NAME`). This is the intended behavior — profiles switch prompts, not models.
|
All items previously listed as "missing" have been implemented in Sub-Phases 9.0–9.4:
|
||||||
- **No results JSON format**: No defined schema for capturing pipeline stage outputs in a structured, machine-readable format.
|
|
||||||
- **No evaluation infrastructure**: Zero evaluation metrics, scoring functions, or comparison logic exist in the codebase.
|
- ✅ **Test profile config**: Profiles A/B/C in `PromptService` only control prompt templates. Model selection via env vars — unchanged, as intended.
|
||||||
- **No ground-truth chunk determination**: No method exists for independently identifying which chunks SHOULD be relevant to a question.
|
- ✅ **Results JSON format**: All Pydantic models defined in `backend/app/models/testing.py` (26 models).
|
||||||
- **No CER/WER calculation**: No transcription accuracy measurement utilities.
|
- ✅ **Evaluation infrastructure**: Full evaluation engine in `services/key_questions_evaluator.py`, `chunk_evaluator.py`, `response_evaluator.py`, `cer_wer.py`.
|
||||||
- **No dual-model scoring**: No infrastructure for having two LLMs independently evaluate the same output and averaging scores.
|
- ✅ **Ground-truth chunk determination**: Batched LLM-based binary relevance evaluation in `chunk_evaluator.py`.
|
||||||
|
- ✅ **CER/WER calculation**: Levenshtein-distance implementation in `services/cer_wer.py`.
|
||||||
|
- ✅ **Dual-model scoring**: `key_questions_evaluator.py` with separate `LLMClient` instances, `asyncio.gather` parallelism, 3x retry.
|
||||||
|
|
||||||
### Language Context
|
### Language Context
|
||||||
|
|
||||||
|
|
|
||||||
246
README.md
246
README.md
|
|
@ -316,6 +316,252 @@ wget -qO- https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.
|
||||||
cp /tmp/ffmpeg-*-static/ffmpeg ~/.local/bin/
|
cp /tmp/ffmpeg-*-static/ffmpeg ~/.local/bin/
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Accuracy Testing API (Package 9)
|
||||||
|
|
||||||
|
Backend endpoints for generating test results and evaluating RAG pipeline accuracy. Designed for programmatic use — call via `curl`, Python `requests`, or any HTTP client.
|
||||||
|
|
||||||
|
All endpoints are at `/api/v1/test/*` and accessible through the same domain as the frontend (nginx proxies all paths to FastAPI).
|
||||||
|
|
||||||
|
### 1. Generate Test Result (Text)
|
||||||
|
|
||||||
|
Run the full RAG pipeline on a text question and capture every intermediate stage.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Basic text generation
|
||||||
|
curl http://localhost:8000/api/v1/test/generate/text \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"question": "立法會今日討論咗咩議題?",
|
||||||
|
"profile": "A",
|
||||||
|
"label": "Test run 2026-05-25"
|
||||||
|
}'
|
||||||
|
|
||||||
|
# Response includes:
|
||||||
|
# - extracted_key_questions: decomposed sub-questions
|
||||||
|
# - retrieval: per-sub-question chunks with metadata and distance scores
|
||||||
|
# - filtered: chunks after relevance filter (with relevance_score)
|
||||||
|
# - response: final answer with source citations
|
||||||
|
# - timing: per-stage timing in ms
|
||||||
|
```
|
||||||
|
|
||||||
|
**Response** (partial):
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"result_id": "a1b2c3d4e5f6",
|
||||||
|
"input_type": "text",
|
||||||
|
"profile": "A",
|
||||||
|
"label": "Test run 2026-05-25",
|
||||||
|
"input": { "text": "立法會今日討論咗咩議題?" },
|
||||||
|
"extracted_key_questions": [
|
||||||
|
"立法會:今日討論的主要議題",
|
||||||
|
"會議記錄:近期立法會會議的討論內容"
|
||||||
|
],
|
||||||
|
"retrieval": {
|
||||||
|
"total_chunks_retrieved": 20,
|
||||||
|
"retriever_time_ms": 456
|
||||||
|
},
|
||||||
|
"filtered": {
|
||||||
|
"total_chunks_filtered": 14,
|
||||||
|
"filter_time_ms": 789
|
||||||
|
},
|
||||||
|
"response": {
|
||||||
|
"final_answer": "## Sub-question 0: ...\n\n- 今日立法會討論了三項主要議題... [meeting_minutes.pdf, page 1]",
|
||||||
|
"generate_time_ms": 1011
|
||||||
|
},
|
||||||
|
"timing": {
|
||||||
|
"decomposer_time_ms": 234,
|
||||||
|
"total_time_ms": 2490
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Generate Test Result (Audio)
|
||||||
|
|
||||||
|
Transcribe audio via ASR, then run the RAG pipeline on the transcribed text. Optionally provide a reference transcript for later CER/WER evaluation.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Audio generation with reference transcript (for later CER/WER scoring)
|
||||||
|
curl http://localhost:8000/api/v1/test/generate/audio \
|
||||||
|
-F "audio_file=@legco_clip.wav" \
|
||||||
|
-F "profile=A" \
|
||||||
|
-F "reference_transcript=立法會今日討論咗咩議題?" \
|
||||||
|
-F "language=yue" \
|
||||||
|
-F "label=Cantonese LegCo audio test"
|
||||||
|
|
||||||
|
# Without reference transcript (CER/WER will return N/A in evaluation)
|
||||||
|
curl http://localhost:8000/api/v1/test/generate/audio \
|
||||||
|
-F "audio_file=@meeting.mp3" \
|
||||||
|
-F "profile=B" \
|
||||||
|
-F "language=yue"
|
||||||
|
```
|
||||||
|
|
||||||
|
Compared to the text endpoint, the audio result includes extra fields:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"input_type": "audio",
|
||||||
|
"input": {
|
||||||
|
"text": "立法會今日討論咗咩議題?",
|
||||||
|
"reference_transcript": "立法會今日討論咗咩議題?",
|
||||||
|
"audio_filename": "legco_clip.wav",
|
||||||
|
"audio_duration_seconds": 45.2,
|
||||||
|
"asr_language": "yue"
|
||||||
|
},
|
||||||
|
"timing": {
|
||||||
|
"asr_time_ms": 1234,
|
||||||
|
"total_time_ms": 3724
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Evaluate Test Result
|
||||||
|
|
||||||
|
Run all four evaluation dimensions on a previously generated result:
|
||||||
|
|
||||||
|
- **(i) Audio transcription accuracy** — CER/WER (only for audio inputs with reference transcript)
|
||||||
|
- **(ii) Key questions quality** — Two evaluator LLMs score against 4-dimension rubric, scores averaged
|
||||||
|
- **(iii) Chunk accuracy** — LLM determines ground truth chunks, computes precision/recall/F1
|
||||||
|
- **(iv) Response completeness** — Generate ideal response from ground truth chunks, compare
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Evaluate a previously saved result
|
||||||
|
curl http://localhost:8000/api/v1/test/evaluate \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"result_id": "a1b2c3d4e5f6",
|
||||||
|
"evaluation_config": {
|
||||||
|
"key_questions_evaluators": [
|
||||||
|
{
|
||||||
|
"model_name": "deepseek-v4-pro",
|
||||||
|
"base_url": "https://api.deepseek.com",
|
||||||
|
"api_key_env": "DP_API_KEY",
|
||||||
|
"enable_thinking": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"model_name": "qwen3-7b-max",
|
||||||
|
"base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
|
||||||
|
"api_key_env": "DASHSCOPE_API_KEY",
|
||||||
|
"enable_thinking": true
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"chunk_evaluator": {
|
||||||
|
"model_name": "qwen/qwen3.6-35b-a3b",
|
||||||
|
"base_url": "https://openrouter.ai/api/v1",
|
||||||
|
"api_key_env": "LLM_API_KEY",
|
||||||
|
"enable_thinking": true
|
||||||
|
},
|
||||||
|
"response_evaluator": {
|
||||||
|
"model_name": "qwen/qwen3.6-35b-a3b",
|
||||||
|
"base_url": "https://openrouter.ai/api/v1",
|
||||||
|
"api_key_env": "LLM_API_KEY",
|
||||||
|
"enable_thinking": true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Response** (partial — shows scoring structure):
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"evaluation_id": "eval-abc123",
|
||||||
|
"result_id": "a1b2c3d4e5f6",
|
||||||
|
"status": "completed",
|
||||||
|
"audio_evaluation": {
|
||||||
|
"status": "completed",
|
||||||
|
"cer": 0.052,
|
||||||
|
"wer": 0.083
|
||||||
|
},
|
||||||
|
"key_questions_evaluation": {
|
||||||
|
"average_scores": {
|
||||||
|
"dimension_1_準確性": 36.0,
|
||||||
|
"dimension_2_完整性": 22.5,
|
||||||
|
"dimension_3_清晰度": 17.5,
|
||||||
|
"dimension_4_簡潔性": 13.5
|
||||||
|
},
|
||||||
|
"average_total": 89.5
|
||||||
|
},
|
||||||
|
"chunk_evaluation": {
|
||||||
|
"overall_unfiltered": { "avg_precision": 0.60, "avg_recall": 1.00, "avg_f1": 0.75 },
|
||||||
|
"overall_filtered": { "avg_precision": 1.00, "avg_recall": 1.00, "avg_f1": 1.00 }
|
||||||
|
},
|
||||||
|
"response_evaluation": {
|
||||||
|
"overall_completeness": 0.85,
|
||||||
|
"overall_factual_accuracy": 0.92
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Manage Results & Evaluations
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# List all saved test results
|
||||||
|
curl http://localhost:8000/api/v1/test/results?limit=10&offset=0
|
||||||
|
|
||||||
|
# Retrieve a specific result
|
||||||
|
curl http://localhost:8000/api/v1/test/results/a1b2c3d4e5f6
|
||||||
|
|
||||||
|
# Delete a result
|
||||||
|
curl -X DELETE http://localhost:8000/api/v1/test/results/a1b2c3d4e5f6
|
||||||
|
|
||||||
|
# List all evaluation results
|
||||||
|
curl http://localhost:8000/api/v1/test/evaluations?limit=10
|
||||||
|
|
||||||
|
# Retrieve a specific evaluation
|
||||||
|
curl http://localhost:8000/api/v1/test/evaluations/eval-abc123
|
||||||
|
|
||||||
|
# Delete an evaluation
|
||||||
|
curl -X DELETE http://localhost:8000/api/v1/test/evaluations/eval-abc123
|
||||||
|
```
|
||||||
|
|
||||||
|
### Full Workflow Example
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Generate a test result
|
||||||
|
RESULT=$(curl -s http://localhost:8000/api/v1/test/generate/text \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"question": "立法會討論咗咩房屋政策?", "profile": "A", "label": "housing policy test"}')
|
||||||
|
RESULT_ID=$(echo "$RESULT" | python3 -c "import sys,json; print(json.load(sys.stdin)['result_id'])")
|
||||||
|
echo "Generated result: $RESULT_ID"
|
||||||
|
|
||||||
|
# 2. Evaluate that result
|
||||||
|
curl -s http://localhost:8000/api/v1/test/evaluate \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d "{
|
||||||
|
\"result_id\": \"$RESULT_ID\",
|
||||||
|
\"evaluation_config\": {
|
||||||
|
\"key_questions_evaluators\": [
|
||||||
|
{\"model_name\": \"deepseek-v4-pro\", \"base_url\": \"https://api.deepseek.com\", \"api_key_env\": \"DP_API_KEY\", \"enable_thinking\": true},
|
||||||
|
{\"model_name\": \"qwen3-7b-max\", \"base_url\": \"https://dashscope.aliyuncs.com/compatible-mode/v1\", \"api_key_env\": \"DASHSCOPE_API_KEY\", \"enable_thinking\": true}
|
||||||
|
],
|
||||||
|
\"chunk_evaluator\": {\"model_name\": \"qwen/qwen3.6-35b-a3b\", \"base_url\": \"https://openrouter.ai/api/v1\", \"api_key_env\": \"LLM_API_KEY\", \"enable_thinking\": true},
|
||||||
|
\"response_evaluator\": {\"model_name\": \"qwen/qwen3.6-35b-a3b\", \"base_url\": \"https://openrouter.ai/api/v1\", \"api_key_env\": \"LLM_API_KEY\", \"enable_thinking\": true}
|
||||||
|
}
|
||||||
|
}" | python3 -m json.tool
|
||||||
|
|
||||||
|
# 3. Check results
|
||||||
|
curl -s http://localhost:8000/api/v1/test/results?limit=5 | python3 -m json.tool
|
||||||
|
curl -s http://localhost:8000/api/v1/test/evaluations?limit=5 | python3 -m json.tool
|
||||||
|
```
|
||||||
|
|
||||||
|
### Key Questions Marking Scheme (4 Dimensions)
|
||||||
|
|
||||||
|
| 維度 | 權重 | 滿分 |
|
||||||
|
|------|------|------|
|
||||||
|
| 1. 準確性 (Fidelity) | 40分 | 完全忠於原意,數字/關鍵詞無誤 |
|
||||||
|
| 2. 完整性 (Completeness) | 25分 | 涵蓋所有關鍵元素(問題+背景+目的) |
|
||||||
|
| 3. 清晰度 (Clarity) | 20分 | 語言精準、邏輯清楚、易讀易懂 |
|
||||||
|
| 4. 簡潔性 (Conciseness) | 15分 | 最少字數表達最完整意思 |
|
||||||
|
|
||||||
|
### Requirements for Evaluation
|
||||||
|
|
||||||
|
- All evaluation prompts, marking schemes, and LLM interactions are in **Chinese**
|
||||||
|
- Both key questions evaluator models must succeed (3 retries each) — no partial scores
|
||||||
|
- Chunk evaluation processes ALL chunks from ALL documents in batches of 10
|
||||||
|
- Thinking mode (`enable_thinking: true`) is enabled on all evaluation models
|
||||||
|
- Stored results and evaluations are not auto-deleted — manage via DELETE endpoints
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
- PDF upload limit: 300MB
|
- PDF upload limit: 300MB
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue