diff --git a/.plans/accuracy_testing_enhancement_plan.md b/.plans/accuracy_testing_enhancement_plan.md
index 7667ca5..d4b4b49 100644
--- a/.plans/accuracy_testing_enhancement_plan.md
+++ b/.plans/accuracy_testing_enhancement_plan.md
@@ -2,7 +2,26 @@
 
 **Source**: User request (2026-05-25)
 **Scope**: New backend API endpoints for generating test results (audio + text inputs) and evaluating pipeline accuracy against ground truth. Results Generation APIs capture every pipeline stage as structured JSON. Results Evaluation API measures transcription quality (CER/WER), key question extraction quality (dual-model marking scheme), chunk retrieval/filtering accuracy (batched binary relevance methodology), and final response completeness.
-**Status**: Draft — Planning Phase (no code)
+**Status**: ✅ Implemented (Sub-Phases 9.0–9.4 complete, 78/78 tests passing)
+
+---
+
+## Progress Summary
+
+| Sub-Phase | Commit | Status | Tests |
+|-----------|--------|--------|-------|
+| 9.0 Config & Models | `852430f` | ✅ | 32/32 |
+| 9.1 Generation APIs | `ac81df0` | ✅ | 51/51 (cumulative) |
+| 9.2 Evaluation Engine | `098be35` | ✅ | 76/76 (cumulative) |
+| 9.3 Evaluate API | `032dd75` | ✅ | 78/78 (cumulative) |
+| 9.4 Polish | `032dd75` | ✅ | 78/78, no regressions |
+
+### Implementation Notes (Deviations from Plan)
+
+- **Pydantic model naming**: All `Test*`-prefixed models renamed to drop the `Test` prefix (e.g., `TestGenerateResult` → `GenerateResult`, `TestEvaluationResult` → `EvaluationResult`). This avoids pytest collection conflicts where `Test*` classes are treated as test cases.
+- **`DimensionScores` field type**: Changed from `int` to `float` to support averaged scores with decimal precision.
+- **Single test file**: `test_phase9_chunk_response_eval.py` combines chunk evaluator and response evaluator tests (plan had separate files).
+- **Pipeline refactoring**: Created `RAGPipeline` class in `services/rag_pipeline.py` instead of modifying `query.py` directly. The reusable async generator yields `PipelineSnapshot` dataclasses. The existing `_query_stream()` in `query.py` was left unchanged (no regression risk).
 
 ---
 
@@ -60,12 +79,14 @@ Evaluated by two advanced models independently, scores averaged:
 
 ### What's Missing
 
-- **No test profile config**: Profiles A/B/C in `PromptService` only control prompt templates. Model selection is via env vars (`LLM_MODEL_NAME`, `DP_MODEL_NAME`). This is the intended behavior — profiles switch prompts, not models.
-- **No results JSON format**: No defined schema for capturing pipeline stage outputs in a structured, machine-readable format.
-- **No evaluation infrastructure**: Zero evaluation metrics, scoring functions, or comparison logic exist in the codebase.
-- **No ground-truth chunk determination**: No method exists for independently identifying which chunks SHOULD be relevant to a question.
-- **No CER/WER calculation**: No transcription accuracy measurement utilities.
-- **No dual-model scoring**: No infrastructure for having two LLMs independently evaluate the same output and averaging scores.
+All items previously listed as "missing" have been implemented in Sub-Phases 9.0–9.4:
+
+- ✅ **Test profile config**: Profiles A/B/C in `PromptService` only control prompt templates. Model selection via env vars — unchanged, as intended.
+- ✅ **Results JSON format**: All Pydantic models defined in `backend/app/models/testing.py` (26 models).
+- ✅ **Evaluation infrastructure**: Full evaluation engine in `services/key_questions_evaluator.py`, `chunk_evaluator.py`, `response_evaluator.py`, `cer_wer.py`.
+- ✅ **Ground-truth chunk determination**: Batched LLM-based binary relevance evaluation in `chunk_evaluator.py`.
+- ✅ **CER/WER calculation**: Levenshtein-distance implementation in `services/cer_wer.py`.
+- ✅ **Dual-model scoring**: `key_questions_evaluator.py` with separate `LLMClient` instances, `asyncio.gather` parallelism, 3x retry.
 
 ### Language Context
 
diff --git a/README.md b/README.md
index 2a9f387..e660840 100644
--- a/README.md
+++ b/README.md
@@ -316,6 +316,252 @@ wget -qO- https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.
 cp /tmp/ffmpeg-*-static/ffmpeg ~/.local/bin/
 ```
 
+## Accuracy Testing API (Package 9)
+
+Backend endpoints for generating test results and evaluating RAG pipeline accuracy. Designed for programmatic use — call via `curl`, Python `requests`, or any HTTP client.
+
+All endpoints are at `/api/v1/test/*` and accessible through the same domain as the frontend (nginx proxies all paths to FastAPI).
+
+### 1. Generate Test Result (Text)
+
+Run the full RAG pipeline on a text question and capture every intermediate stage.
+
+```bash
+# Basic text generation
+curl http://localhost:8000/api/v1/test/generate/text \
+  -H "Content-Type: application/json" \
+  -d '{
+    "question": "立法會今日討論咗咩議題？",
+    "profile": "A",
+    "label": "Test run 2026-05-25"
+  }'
+
+# Response includes:
+#   - extracted_key_questions: decomposed sub-questions
+#   - retrieval: per-sub-question chunks with metadata and distance scores
+#   - filtered: chunks after relevance filter (with relevance_score)
+#   - response: final answer with source citations
+#   - timing: per-stage timing in ms
+```
+
+**Response** (partial):
+
+```json
+{
+  "result_id": "a1b2c3d4e5f6",
+  "input_type": "text",
+  "profile": "A",
+  "label": "Test run 2026-05-25",
+  "input": { "text": "立法會今日討論咗咩議題？" },
+  "extracted_key_questions": [
+    "立法會：今日討論的主要議題",
+    "會議記錄：近期立法會會議的討論內容"
+  ],
+  "retrieval": {
+    "total_chunks_retrieved": 20,
+    "retriever_time_ms": 456
+  },
+  "filtered": {
+    "total_chunks_filtered": 14,
+    "filter_time_ms": 789
+  },
+  "response": {
+    "final_answer": "## Sub-question 0: ...\n\n- 今日立法會討論了三項主要議題... [meeting_minutes.pdf, page 1]",
+    "generate_time_ms": 1011
+  },
+  "timing": {
+    "decomposer_time_ms": 234,
+    "total_time_ms": 2490
+  }
+}
+```
+
+### 2. Generate Test Result (Audio)
+
+Transcribe audio via ASR, then run the RAG pipeline on the transcribed text. Optionally provide a reference transcript for later CER/WER evaluation.
+
+```bash
+# Audio generation with reference transcript (for later CER/WER scoring)
+curl http://localhost:8000/api/v1/test/generate/audio \
+  -F "audio_file=@legco_clip.wav" \
+  -F "profile=A" \
+  -F "reference_transcript=立法會今日討論咗咩議題？" \
+  -F "language=yue" \
+  -F "label=Cantonese LegCo audio test"
+
+# Without reference transcript (CER/WER will return N/A in evaluation)
+curl http://localhost:8000/api/v1/test/generate/audio \
+  -F "audio_file=@meeting.mp3" \
+  -F "profile=B" \
+  -F "language=yue"
+```
+
+Compared to the text endpoint, the audio result includes extra fields:
+
+```json
+{
+  "input_type": "audio",
+  "input": {
+    "text": "立法會今日討論咗咩議題？",
+    "reference_transcript": "立法會今日討論咗咩議題？",
+    "audio_filename": "legco_clip.wav",
+    "audio_duration_seconds": 45.2,
+    "asr_language": "yue"
+  },
+  "timing": {
+    "asr_time_ms": 1234,
+    "total_time_ms": 3724
+  }
+}
+```
+
+### 3. Evaluate Test Result
+
+Run all four evaluation dimensions on a previously generated result:
+
+- **(i) Audio transcription accuracy** — CER/WER (only for audio inputs with reference transcript)
+- **(ii) Key questions quality** — Two evaluator LLMs score against 4-dimension rubric, scores averaged
+- **(iii) Chunk accuracy** — LLM determines ground truth chunks, computes precision/recall/F1
+- **(iv) Response completeness** — Generate ideal response from ground truth chunks, compare
+
+```bash
+# Evaluate a previously saved result
+curl http://localhost:8000/api/v1/test/evaluate \
+  -H "Content-Type: application/json" \
+  -d '{
+    "result_id": "a1b2c3d4e5f6",
+    "evaluation_config": {
+      "key_questions_evaluators": [
+        {
+          "model_name": "deepseek-v4-pro",
+          "base_url": "https://api.deepseek.com",
+          "api_key_env": "DP_API_KEY",
+          "enable_thinking": true
+        },
+        {
+          "model_name": "qwen3-7b-max",
+          "base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
+          "api_key_env": "DASHSCOPE_API_KEY",
+          "enable_thinking": true
+        }
+      ],
+      "chunk_evaluator": {
+        "model_name": "qwen/qwen3.6-35b-a3b",
+        "base_url": "https://openrouter.ai/api/v1",
+        "api_key_env": "LLM_API_KEY",
+        "enable_thinking": true
+      },
+      "response_evaluator": {
+        "model_name": "qwen/qwen3.6-35b-a3b",
+        "base_url": "https://openrouter.ai/api/v1",
+        "api_key_env": "LLM_API_KEY",
+        "enable_thinking": true
+      }
+    }
+  }'
+```
+
+**Response** (partial — shows scoring structure):
+
+```json
+{
+  "evaluation_id": "eval-abc123",
+  "result_id": "a1b2c3d4e5f6",
+  "status": "completed",
+  "audio_evaluation": {
+    "status": "completed",
+    "cer": 0.052,
+    "wer": 0.083
+  },
+  "key_questions_evaluation": {
+    "average_scores": {
+      "dimension_1_準確性": 36.0,
+      "dimension_2_完整性": 22.5,
+      "dimension_3_清晰度": 17.5,
+      "dimension_4_簡潔性": 13.5
+    },
+    "average_total": 89.5
+  },
+  "chunk_evaluation": {
+    "overall_unfiltered": { "avg_precision": 0.60, "avg_recall": 1.00, "avg_f1": 0.75 },
+    "overall_filtered": { "avg_precision": 1.00, "avg_recall": 1.00, "avg_f1": 1.00 }
+  },
+  "response_evaluation": {
+    "overall_completeness": 0.85,
+    "overall_factual_accuracy": 0.92
+  }
+}
+```
+
+### 4. Manage Results & Evaluations
+
+```bash
+# List all saved test results
+curl http://localhost:8000/api/v1/test/results?limit=10&offset=0
+
+# Retrieve a specific result
+curl http://localhost:8000/api/v1/test/results/a1b2c3d4e5f6
+
+# Delete a result
+curl -X DELETE http://localhost:8000/api/v1/test/results/a1b2c3d4e5f6
+
+# List all evaluation results
+curl http://localhost:8000/api/v1/test/evaluations?limit=10
+
+# Retrieve a specific evaluation
+curl http://localhost:8000/api/v1/test/evaluations/eval-abc123
+
+# Delete an evaluation
+curl -X DELETE http://localhost:8000/api/v1/test/evaluations/eval-abc123
+```
+
+### Full Workflow Example
+
+```bash
+# 1. Generate a test result
+RESULT=$(curl -s http://localhost:8000/api/v1/test/generate/text \
+  -H "Content-Type: application/json" \
+  -d '{"question": "立法會討論咗咩房屋政策？", "profile": "A", "label": "housing policy test"}')
+RESULT_ID=$(echo "$RESULT" | python3 -c "import sys,json; print(json.load(sys.stdin)['result_id'])")
+echo "Generated result: $RESULT_ID"
+
+# 2. Evaluate that result
+curl -s http://localhost:8000/api/v1/test/evaluate \
+  -H "Content-Type: application/json" \
+  -d "{
+    \"result_id\": \"$RESULT_ID\",
+    \"evaluation_config\": {
+      \"key_questions_evaluators\": [
+        {\"model_name\": \"deepseek-v4-pro\", \"base_url\": \"https://api.deepseek.com\", \"api_key_env\": \"DP_API_KEY\", \"enable_thinking\": true},
+        {\"model_name\": \"qwen3-7b-max\", \"base_url\": \"https://dashscope.aliyuncs.com/compatible-mode/v1\", \"api_key_env\": \"DASHSCOPE_API_KEY\", \"enable_thinking\": true}
+      ],
+      \"chunk_evaluator\": {\"model_name\": \"qwen/qwen3.6-35b-a3b\", \"base_url\": \"https://openrouter.ai/api/v1\", \"api_key_env\": \"LLM_API_KEY\", \"enable_thinking\": true},
+      \"response_evaluator\": {\"model_name\": \"qwen/qwen3.6-35b-a3b\", \"base_url\": \"https://openrouter.ai/api/v1\", \"api_key_env\": \"LLM_API_KEY\", \"enable_thinking\": true}
+    }
+  }" | python3 -m json.tool
+
+# 3. Check results
+curl -s http://localhost:8000/api/v1/test/results?limit=5 | python3 -m json.tool
+curl -s http://localhost:8000/api/v1/test/evaluations?limit=5 | python3 -m json.tool
+```
+
+### Key Questions Marking Scheme (4 Dimensions)
+
+| 維度 | 權重 | 滿分 |
+|------|------|------|
+| 1. 準確性 (Fidelity) | 40分 | 完全忠於原意，數字/關鍵詞無誤 |
+| 2. 完整性 (Completeness) | 25分 | 涵蓋所有關鍵元素（問題+背景+目的） |
+| 3. 清晰度 (Clarity) | 20分 | 語言精準、邏輯清楚、易讀易懂 |
+| 4. 簡潔性 (Conciseness) | 15分 | 最少字數表達最完整意思 |
+
+### Requirements for Evaluation
+
+- All evaluation prompts, marking schemes, and LLM interactions are in **Chinese**
+- Both key questions evaluator models must succeed (3 retries each) — no partial scores
+- Chunk evaluation processes ALL chunks from ALL documents in batches of 10
+- Thinking mode (`enable_thinking: true`) is enabled on all evaluation models
+- Stored results and evaluations are not auto-deleted — manage via DELETE endpoints
+
 ## Notes
 
 - PDF upload limit: 300MB