diff --git a/.plans/accuracy_testing_enhancement_plan.md b/.plans/accuracy_testing_enhancement_plan.md index 7667ca5..d4b4b49 100644 --- a/.plans/accuracy_testing_enhancement_plan.md +++ b/.plans/accuracy_testing_enhancement_plan.md @@ -2,7 +2,26 @@ **Source**: User request (2026-05-25) **Scope**: New backend API endpoints for generating test results (audio + text inputs) and evaluating pipeline accuracy against ground truth. Results Generation APIs capture every pipeline stage as structured JSON. Results Evaluation API measures transcription quality (CER/WER), key question extraction quality (dual-model marking scheme), chunk retrieval/filtering accuracy (batched binary relevance methodology), and final response completeness. -**Status**: Draft — Planning Phase (no code) +**Status**: ✅ Implemented (Sub-Phases 9.0–9.4 complete, 78/78 tests passing) + +--- + +## Progress Summary + +| Sub-Phase | Commit | Status | Tests | +|-----------|--------|--------|-------| +| 9.0 Config & Models | `852430f` | ✅ | 32/32 | +| 9.1 Generation APIs | `ac81df0` | ✅ | 51/51 (cumulative) | +| 9.2 Evaluation Engine | `098be35` | ✅ | 76/76 (cumulative) | +| 9.3 Evaluate API | `032dd75` | ✅ | 78/78 (cumulative) | +| 9.4 Polish | `032dd75` | ✅ | 78/78, no regressions | + +### Implementation Notes (Deviations from Plan) + +- **Pydantic model naming**: All `Test*`-prefixed models renamed to drop the `Test` prefix (e.g., `TestGenerateResult` → `GenerateResult`, `TestEvaluationResult` → `EvaluationResult`). This avoids pytest collection conflicts where `Test*` classes are treated as test cases. +- **`DimensionScores` field type**: Changed from `int` to `float` to support averaged scores with decimal precision. +- **Single test file**: `test_phase9_chunk_response_eval.py` combines chunk evaluator and response evaluator tests (plan had separate files). +- **Pipeline refactoring**: Created `RAGPipeline` class in `services/rag_pipeline.py` instead of modifying `query.py` directly. The reusable async generator yields `PipelineSnapshot` dataclasses. The existing `_query_stream()` in `query.py` was left unchanged (no regression risk). --- @@ -60,12 +79,14 @@ Evaluated by two advanced models independently, scores averaged: ### What's Missing -- **No test profile config**: Profiles A/B/C in `PromptService` only control prompt templates. Model selection is via env vars (`LLM_MODEL_NAME`, `DP_MODEL_NAME`). This is the intended behavior — profiles switch prompts, not models. -- **No results JSON format**: No defined schema for capturing pipeline stage outputs in a structured, machine-readable format. -- **No evaluation infrastructure**: Zero evaluation metrics, scoring functions, or comparison logic exist in the codebase. -- **No ground-truth chunk determination**: No method exists for independently identifying which chunks SHOULD be relevant to a question. -- **No CER/WER calculation**: No transcription accuracy measurement utilities. -- **No dual-model scoring**: No infrastructure for having two LLMs independently evaluate the same output and averaging scores. +All items previously listed as "missing" have been implemented in Sub-Phases 9.0–9.4: + +- ✅ **Test profile config**: Profiles A/B/C in `PromptService` only control prompt templates. Model selection via env vars — unchanged, as intended. +- ✅ **Results JSON format**: All Pydantic models defined in `backend/app/models/testing.py` (26 models). +- ✅ **Evaluation infrastructure**: Full evaluation engine in `services/key_questions_evaluator.py`, `chunk_evaluator.py`, `response_evaluator.py`, `cer_wer.py`. +- ✅ **Ground-truth chunk determination**: Batched LLM-based binary relevance evaluation in `chunk_evaluator.py`. +- ✅ **CER/WER calculation**: Levenshtein-distance implementation in `services/cer_wer.py`. +- ✅ **Dual-model scoring**: `key_questions_evaluator.py` with separate `LLMClient` instances, `asyncio.gather` parallelism, 3x retry. ### Language Context diff --git a/README.md b/README.md index 2a9f387..e660840 100644 --- a/README.md +++ b/README.md @@ -316,6 +316,252 @@ wget -qO- https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static. cp /tmp/ffmpeg-*-static/ffmpeg ~/.local/bin/ ``` +## Accuracy Testing API (Package 9) + +Backend endpoints for generating test results and evaluating RAG pipeline accuracy. Designed for programmatic use — call via `curl`, Python `requests`, or any HTTP client. + +All endpoints are at `/api/v1/test/*` and accessible through the same domain as the frontend (nginx proxies all paths to FastAPI). + +### 1. Generate Test Result (Text) + +Run the full RAG pipeline on a text question and capture every intermediate stage. + +```bash +# Basic text generation +curl http://localhost:8000/api/v1/test/generate/text \ + -H "Content-Type: application/json" \ + -d '{ + "question": "立法會今日討論咗咩議題?", + "profile": "A", + "label": "Test run 2026-05-25" + }' + +# Response includes: +# - extracted_key_questions: decomposed sub-questions +# - retrieval: per-sub-question chunks with metadata and distance scores +# - filtered: chunks after relevance filter (with relevance_score) +# - response: final answer with source citations +# - timing: per-stage timing in ms +``` + +**Response** (partial): + +```json +{ + "result_id": "a1b2c3d4e5f6", + "input_type": "text", + "profile": "A", + "label": "Test run 2026-05-25", + "input": { "text": "立法會今日討論咗咩議題?" }, + "extracted_key_questions": [ + "立法會:今日討論的主要議題", + "會議記錄:近期立法會會議的討論內容" + ], + "retrieval": { + "total_chunks_retrieved": 20, + "retriever_time_ms": 456 + }, + "filtered": { + "total_chunks_filtered": 14, + "filter_time_ms": 789 + }, + "response": { + "final_answer": "## Sub-question 0: ...\n\n- 今日立法會討論了三項主要議題... [meeting_minutes.pdf, page 1]", + "generate_time_ms": 1011 + }, + "timing": { + "decomposer_time_ms": 234, + "total_time_ms": 2490 + } +} +``` + +### 2. Generate Test Result (Audio) + +Transcribe audio via ASR, then run the RAG pipeline on the transcribed text. Optionally provide a reference transcript for later CER/WER evaluation. + +```bash +# Audio generation with reference transcript (for later CER/WER scoring) +curl http://localhost:8000/api/v1/test/generate/audio \ + -F "audio_file=@legco_clip.wav" \ + -F "profile=A" \ + -F "reference_transcript=立法會今日討論咗咩議題?" \ + -F "language=yue" \ + -F "label=Cantonese LegCo audio test" + +# Without reference transcript (CER/WER will return N/A in evaluation) +curl http://localhost:8000/api/v1/test/generate/audio \ + -F "audio_file=@meeting.mp3" \ + -F "profile=B" \ + -F "language=yue" +``` + +Compared to the text endpoint, the audio result includes extra fields: + +```json +{ + "input_type": "audio", + "input": { + "text": "立法會今日討論咗咩議題?", + "reference_transcript": "立法會今日討論咗咩議題?", + "audio_filename": "legco_clip.wav", + "audio_duration_seconds": 45.2, + "asr_language": "yue" + }, + "timing": { + "asr_time_ms": 1234, + "total_time_ms": 3724 + } +} +``` + +### 3. Evaluate Test Result + +Run all four evaluation dimensions on a previously generated result: + +- **(i) Audio transcription accuracy** — CER/WER (only for audio inputs with reference transcript) +- **(ii) Key questions quality** — Two evaluator LLMs score against 4-dimension rubric, scores averaged +- **(iii) Chunk accuracy** — LLM determines ground truth chunks, computes precision/recall/F1 +- **(iv) Response completeness** — Generate ideal response from ground truth chunks, compare + +```bash +# Evaluate a previously saved result +curl http://localhost:8000/api/v1/test/evaluate \ + -H "Content-Type: application/json" \ + -d '{ + "result_id": "a1b2c3d4e5f6", + "evaluation_config": { + "key_questions_evaluators": [ + { + "model_name": "deepseek-v4-pro", + "base_url": "https://api.deepseek.com", + "api_key_env": "DP_API_KEY", + "enable_thinking": true + }, + { + "model_name": "qwen3-7b-max", + "base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1", + "api_key_env": "DASHSCOPE_API_KEY", + "enable_thinking": true + } + ], + "chunk_evaluator": { + "model_name": "qwen/qwen3.6-35b-a3b", + "base_url": "https://openrouter.ai/api/v1", + "api_key_env": "LLM_API_KEY", + "enable_thinking": true + }, + "response_evaluator": { + "model_name": "qwen/qwen3.6-35b-a3b", + "base_url": "https://openrouter.ai/api/v1", + "api_key_env": "LLM_API_KEY", + "enable_thinking": true + } + } + }' +``` + +**Response** (partial — shows scoring structure): + +```json +{ + "evaluation_id": "eval-abc123", + "result_id": "a1b2c3d4e5f6", + "status": "completed", + "audio_evaluation": { + "status": "completed", + "cer": 0.052, + "wer": 0.083 + }, + "key_questions_evaluation": { + "average_scores": { + "dimension_1_準確性": 36.0, + "dimension_2_完整性": 22.5, + "dimension_3_清晰度": 17.5, + "dimension_4_簡潔性": 13.5 + }, + "average_total": 89.5 + }, + "chunk_evaluation": { + "overall_unfiltered": { "avg_precision": 0.60, "avg_recall": 1.00, "avg_f1": 0.75 }, + "overall_filtered": { "avg_precision": 1.00, "avg_recall": 1.00, "avg_f1": 1.00 } + }, + "response_evaluation": { + "overall_completeness": 0.85, + "overall_factual_accuracy": 0.92 + } +} +``` + +### 4. Manage Results & Evaluations + +```bash +# List all saved test results +curl http://localhost:8000/api/v1/test/results?limit=10&offset=0 + +# Retrieve a specific result +curl http://localhost:8000/api/v1/test/results/a1b2c3d4e5f6 + +# Delete a result +curl -X DELETE http://localhost:8000/api/v1/test/results/a1b2c3d4e5f6 + +# List all evaluation results +curl http://localhost:8000/api/v1/test/evaluations?limit=10 + +# Retrieve a specific evaluation +curl http://localhost:8000/api/v1/test/evaluations/eval-abc123 + +# Delete an evaluation +curl -X DELETE http://localhost:8000/api/v1/test/evaluations/eval-abc123 +``` + +### Full Workflow Example + +```bash +# 1. Generate a test result +RESULT=$(curl -s http://localhost:8000/api/v1/test/generate/text \ + -H "Content-Type: application/json" \ + -d '{"question": "立法會討論咗咩房屋政策?", "profile": "A", "label": "housing policy test"}') +RESULT_ID=$(echo "$RESULT" | python3 -c "import sys,json; print(json.load(sys.stdin)['result_id'])") +echo "Generated result: $RESULT_ID" + +# 2. Evaluate that result +curl -s http://localhost:8000/api/v1/test/evaluate \ + -H "Content-Type: application/json" \ + -d "{ + \"result_id\": \"$RESULT_ID\", + \"evaluation_config\": { + \"key_questions_evaluators\": [ + {\"model_name\": \"deepseek-v4-pro\", \"base_url\": \"https://api.deepseek.com\", \"api_key_env\": \"DP_API_KEY\", \"enable_thinking\": true}, + {\"model_name\": \"qwen3-7b-max\", \"base_url\": \"https://dashscope.aliyuncs.com/compatible-mode/v1\", \"api_key_env\": \"DASHSCOPE_API_KEY\", \"enable_thinking\": true} + ], + \"chunk_evaluator\": {\"model_name\": \"qwen/qwen3.6-35b-a3b\", \"base_url\": \"https://openrouter.ai/api/v1\", \"api_key_env\": \"LLM_API_KEY\", \"enable_thinking\": true}, + \"response_evaluator\": {\"model_name\": \"qwen/qwen3.6-35b-a3b\", \"base_url\": \"https://openrouter.ai/api/v1\", \"api_key_env\": \"LLM_API_KEY\", \"enable_thinking\": true} + } + }" | python3 -m json.tool + +# 3. Check results +curl -s http://localhost:8000/api/v1/test/results?limit=5 | python3 -m json.tool +curl -s http://localhost:8000/api/v1/test/evaluations?limit=5 | python3 -m json.tool +``` + +### Key Questions Marking Scheme (4 Dimensions) + +| 維度 | 權重 | 滿分 | +|------|------|------| +| 1. 準確性 (Fidelity) | 40分 | 完全忠於原意,數字/關鍵詞無誤 | +| 2. 完整性 (Completeness) | 25分 | 涵蓋所有關鍵元素(問題+背景+目的) | +| 3. 清晰度 (Clarity) | 20分 | 語言精準、邏輯清楚、易讀易懂 | +| 4. 簡潔性 (Conciseness) | 15分 | 最少字數表達最完整意思 | + +### Requirements for Evaluation + +- All evaluation prompts, marking schemes, and LLM interactions are in **Chinese** +- Both key questions evaluator models must succeed (3 retries each) — no partial scores +- Chunk evaluation processes ALL chunks from ALL documents in batches of 10 +- Thinking mode (`enable_thinking: true`) is enabled on all evaluation models +- Stored results and evaluations are not auto-deleted — manage via DELETE endpoints + ## Notes - PDF upload limit: 300MB