69 KiB
Accuracy Testing & Evaluation APIs — Enhancement Plan
Source: User request (2026-05-25) Scope: New backend API endpoints for generating test results (audio + text inputs) and evaluating pipeline accuracy against ground truth. Results Generation APIs capture every pipeline stage as structured JSON. Results Evaluation API measures transcription quality (CER/WER), key question extraction quality (dual-model marking scheme), chunk retrieval/filtering accuracy (batched binary relevance methodology), and final response completeness. Status: ✅ Implemented (Sub-Phases 9.0–9.4 complete, 78/78 tests passing)
Progress Summary
| Sub-Phase | Commit | Status | Tests |
|---|---|---|---|
| 9.0 Config & Models | 852430f |
✅ | 32/32 |
| 9.1 Generation APIs | ac81df0 |
✅ | 51/51 (cumulative) |
| 9.2 Evaluation Engine | 098be35 |
✅ | 76/76 (cumulative) |
| 9.3 Evaluate API | 032dd75 |
✅ | 78/78 (cumulative) |
| 9.4 Polish | 032dd75 |
✅ | 78/78, no regressions |
Implementation Notes (Deviations from Plan)
- Pydantic model naming: All
Test*-prefixed models renamed to drop theTestprefix (e.g.,TestGenerateResult→GenerateResult,TestEvaluationResult→EvaluationResult). This avoids pytest collection conflicts whereTest*classes are treated as test cases. DimensionScoresfield type: Changed frominttofloatto support averaged scores with decimal precision.- Single test file:
test_phase9_chunk_response_eval.pycombines chunk evaluator and response evaluator tests (plan had separate files). - Pipeline refactoring: Created
RAGPipelineclass inservices/rag_pipeline.pyinstead of modifyingquery.pydirectly. The reusable async generator yieldsPipelineSnapshotdataclasses. The existing_query_stream()inquery.pywas left unchanged (no regression risk).
Objective
Results Generation APIs
Two endpoints that run the full RAG pipeline on a given input and produce a self-contained JSON file capturing every intermediate stage:
-
POST /api/v1/test/generate/audio— Accept audio file upload + optional reference text transcript. Runs ASR → full RAG pipeline. Returns JSON with: (A) transcribed audio-to-text, (B) extracted key questions, (C) retrieved chunks per key question, (D) filtered chunks for generation, (E) final response. If reference transcript is omitted, CER/WER evaluation will be N/A. -
POST /api/v1/test/generate/text— Accept text question. Runs full RAG pipeline. Returns JSON with: (A) input text, (B) extracted key questions, (C) retrieved chunks per key question, (D) filtered chunks for generation, (E) final response.
Both endpoints accept a profile parameter (A, B, or C) to select which prompt profile is used. The profile controls the RAG pipeline's prompt templates (via the existing PromptService — decompose, filter, generate, etc.). LLM model selection remains driven by env vars (LLM_MODEL_NAME, DP_MODEL_NAME); profiles do NOT switch models.
Results Evaluation API
One endpoint that reads a results JSON file (output from the generation endpoints above) and evaluates pipeline accuracy across four dimensions:
POST /api/v1/test/evaluate— Accept a results JSON + evaluation config. Evaluates:- (i) Audio transcription accuracy: CER (Character Error Rate) and WER (Word Error Rate) against the reference text transcript (audio input only).
- (ii) Key questions extraction quality: Two advanced evaluation LLMs (DeepSeek V4 Pro /
dp_model_name+ Qwen 3.7 Max) score extracted key questions against the marking scheme below. Both models enable thinking (xHigh reasoning). Scores averaged per dimension. Both models MUST succeed — if either fails after retries, the evaluation is abandoned (no partial scores). - (iii) Chunk accuracy: Two-round ground-truth methodology using the same generation model (e.g., Qwen3.6 35B A3B) with thinking enabled — evaluates both unfiltered (retrieved) and filtered chunks against independently determined ground truth.
- (iv) Final response accuracy: Independently generates a response from ground-truth relevant chunks, then compares completeness against the pipeline's generated response.
Marking Scheme for Key Questions Quality
Evaluated by two advanced models independently, scores averaged:
| 維度 | 權重 | 滿分描述 | 扣分指引 |
|---|---|---|---|
| 1. 準確性 (Fidelity to Original) | 40分 | 完全忠於原發言的核心意思、數字、關鍵詞及邏輯,沒有扭曲、遺漏或添加原意沒有的內容。 | 意思走樣(如把「先後緩急」改成其他概念)→ 扣 10–20 分;數字錯誤或遺漏(如 1065 戶、889 戶)→ 扣 15–25 分;完全偏離原意 → 扣 30–40 分 |
| 2. 完整性 (Completeness) | 25分 | 涵蓋原發言中該部分的所有關鍵元素(問題 + 背景 + 目的),無明顯遺漏。 | 漏掉重要背景(如「當前財政緊張」)→ 扣 8–12 分;只問一半(例如只問「可否先處理主幹道」,漏掉「後處理單車徑」)→ 扣 10–18 分;完全只剩一句問句 → 扣 20 分以上 |
| 3. 清晰度 (Clarity) | 20分 | 語言精準、邏輯清楚、易讀易懂,問題焦點一目了然,適合正式會議場合使用。 | 句子過長或結構混亂 → 扣 6–10 分;出現歧義或模糊詞 → 扣 10–15 分;完全看不懂重點 → 扣 16–20 分 |
| 4. 簡潔性 (Conciseness) | 15分 | 用最少的字數表達最完整的意思,無多餘贅詞,適合口頭提問或書面記錄。 | 過於冗長(比原發言還長)→ 扣 6–10 分;過度簡化導致意思不全 → 扣 8–13 分 |
Current State
What Already Exists (Reusable)
| Component | Location | What it provides for accuracy testing |
|---|---|---|
| Full RAG pipeline | routers/query.py:_query_stream() |
All 4 stages: decompose → retrieve → filter → generate. Every intermediate data structure accessible in-memory |
| ASR transcription | services/asr_client.py:transcribe_full() |
bytes → str transcription with language param |
| Prompt profiles A/B/C | services/prompt_service.py |
3 profiles with 8 prompt template steps each. activate_profile() switches active. get_active_profile_name() reads current |
| Query history | services/history_service.py |
HTML table stores every pipeline stage: timings, prompts, chunks XML, sources JSON, profile used |
| SSE streaming events | models/query.py |
DecomposedEvent, RetrievingEvent, FilteringEvent, etc. — defines canonical pipeline phases |
| Source metadata | models/common.py:SourceMetadata |
Universal chunk reference: filename, page, summary, document_id |
| LLM clients | services/llm_client.py + services/llm_client_dp.py |
Two separate clients: main (OpenRouter/vLLM) for filter/generate, DeepSeek for decompose. Both support enable_thinking |
| Config | core/config.py:Settings |
All env-driven: model names, thresholds, chunk sizes, API keys |
| Test infrastructure | test/conftest.py |
Mock LLM/ASR clients, mock PromptService with A/B/C, isolated ChromaDB tmp_path |
| Acceptance tests | test/acceptance/ |
@pytest.mark.acceptance + @pytest.mark.slow pattern for real LLM/ASR tests |
What's Missing
All items previously listed as "missing" have been implemented in Sub-Phases 9.0–9.4:
- ✅ Test profile config: Profiles A/B/C in
PromptServiceonly control prompt templates. Model selection via env vars — unchanged, as intended. - ✅ Results JSON format: All Pydantic models defined in
backend/app/models/testing.py(26 models). - ✅ Evaluation infrastructure: Full evaluation engine in
services/key_questions_evaluator.py,chunk_evaluator.py,response_evaluator.py,cer_wer.py. - ✅ Ground-truth chunk determination: Batched LLM-based binary relevance evaluation in
chunk_evaluator.py. - ✅ CER/WER calculation: Levenshtein-distance implementation in
services/cer_wer.py. - ✅ Dual-model scoring:
key_questions_evaluator.pywith separateLLMClientinstances,asyncio.gatherparallelism, 3x retry.
Language Context
All documents, chunks, key questions, and responses in this system are primarily Chinese (繁體中文). All evaluation prompts, marking schemes, and LLM interactions for evaluation must be written in Chinese. The CER/WER calculation operates at the character level, which is appropriate for Chinese text.
Pipeline Data Flow (for results capture)
QueryRequest.question (str)
↓ [Stage 1: Decompose]
LLMClientDP.complete_structured() with DeepSeek
→ (List[str] sub_questions, str decompose_prompt)
↓ [Stage 2: Retrieve]
RAGService.retrieve_per_subquestion() → ChromaDB per sub-q
→ List[Tuple[str, List[Tuple[str, Dict, float]]]]
= [(sub_q_text, [(chunk_text, metadata, distance), ...]), ...]
↓ [Stage 3: Filter]
RelevanceFilter.filter_per_subquestion() → single LLM call
→ List[Tuple[str, List[Tuple[str, Dict]]]]
= [(sub_q_text, [(chunk_text, metadata_with_relevance_score), ...]), ...]
↓ [Stage 4: Generate]
RAGService.generate_response_per_subquestion() → one LLM call
→ (str answer, str generate_prompt, List[List[Dict]] grouped_sources)
All intermediate data is accessible at lines 192-197 (decompose), 225-236 (retrieve), 269-280 (filter), 319-351 (generate) in routers/query.py.
Target State
Architecture Overview
┌─────────────────────────────────────────────────────────┐
│ RESULTS GENERATION │
│ │
│ POST /api/v1/test/generate/text │
│ POST /api/v1/test/generate/audio │
│ │ │
│ ├─ profile=A|B|C → selects prompt + model │
│ ├─ Run full RAG pipeline │
│ ├─ Capture ALL intermediate data │
│ └─ Return ResultsJSON (stored to disk) │
│ │
│ POST /api/v1/test/generate/audio │
│ │ │
│ ├─ Accept audio file + reference transcript │
│ ├─ Run ASR → transcribed text │
│ ├─ Run RAG pipeline on transcribed text │
│ └─ Return ResultsJSON (includes ref transcript) │
│ │
│ GET /api/v1/test/results │
│ └─ List all generated result files │
│ │
│ GET /api/v1/test/results/{result_id} │
│ └─ Retrieve a specific result file │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ RESULTS EVALUATION │
│ │
│ POST /api/v1/test/evaluate │
│ │ │
│ ├─ result_id: str (or inline results JSON) │
│ ├─ evaluation_config: { │
│ │ evaluator_models: ["deepseek-v4-pro", │
│ │ "qwen3-7b-max"], │
│ │ chunk_model: "qwen/qwen3.6-35b-a3b", │
│ │ enable_thinking: true │
│ │ } │
│ │ │
│ ├─ (i) CER/WER Calculation (audio only) │
│ │ └─ compare transcribed vs reference text │
│ │ │
│ ├─ (ii) Key Questions Evaluation │
│ │ ├─ Evaluator Model 1 (deepseek-v4-pro) │
│ │ │ └─ Score against 4-dimension rubric │
│ │ ├─ Evaluator Model 2 (qwen3-7b-max) │
│ │ │ └─ Score against same rubric │
│ │ ├─ BOTH must succeed (retry 3x on failure) │
│ │ └─ Average scores per dimension │
│ │ │
│ ├─ (iii) Chunk Accuracy Evaluation │
│ │ ├─ Per key question: split ALL chunks │
│ │ │ from ALL documents into batches of 10 │
│ │ ├─ Each batch → LLM (thinking enabled) │
│ │ │ → binary true/false relevance per chunk │
│ │ │ → XML format with <sub_question> + │
│ │ │ <chunk_N doc="x" page="y"> tags │
│ │ ├─ ALL batches run in parallel (async) │
│ │ └─ Union of relevant chunks = ground truth │
│ │ │
│ │ └─ COMPARE: Precision/Recall/F1 │
│ │ ├─ Unfiltered chunks vs ground truth │
│ │ └─ Filtered chunks vs ground truth │
│ │ │
│ └─ (iv) Response Accuracy Evaluation │
│ ├─ Generate "ideal" response from │
│ │ ground-truth relevant chunks │
│ ├─ Compare against pipeline's response │
│ │ (completeness, factual accuracy) │
│ └─ Return completeness score + comments │
│ │
│ GET /api/v1/test/evaluations │
│ └─ List all evaluation results │
│ │
│ GET /api/v1/test/evaluations/{eval_id} │
│ └─ Retrieve a specific evaluation │
└─────────────────────────────────────────────────────────┘
Storage
backend/
├── data/
│ ├── test_results/ # Generated ResultsJSON files
│ │ ├── {result_id}.json
│ │ └── ...
│ └── test_evaluations/ # Evaluation result files
│ ├── {eval_id}.json
│ └── ...
API Design
1. POST /api/v1/test/generate/text
Purpose: Run the full RAG pipeline on a text question, capture every intermediate stage.
Request:
{
"question": "立法會今日討論咗咩議題?",
"profile": "A",
"label": "Test run for LegCo meeting 2026-05-25"
}
| Field | Type | Required | Description |
|---|---|---|---|
question |
string |
Yes | The text question to run through the RAG pipeline |
profile |
"A" | "B" | "C" |
Yes | Which prompt profile to use (controls prompt templates + model selection) |
label |
string |
No | Human-readable label for the test result |
Response (TestGenerateResult):
{
"result_id": "uuid",
"input_type": "text",
"profile": "A",
"label": "Test run for LegCo meeting 2026-05-25",
"created_at": "2026-05-25T10:30:00Z",
"input": {
"text": "立法會今日討論咗咩議題?"
},
"extracted_key_questions": [
"立法會:今日討論的主要議題",
"會議記錄:近期立法會會議的討論內容"
],
"retrieval": {
"per_sub_question": [
{
"sub_question_index": 0,
"sub_question_text": "立法會:今日討論的主要議題",
"chunks": [
{
"chunk_index": 0,
"text": "立法會今日討論了三項主要議題...",
"metadata": {
"filename": "meeting_minutes_2026-05-25.pdf",
"upload_date": "2026-05-25T08:00:00Z",
"content_summary": "立法會今日討論了三項主要議題:房屋政策、交通基建、醫療資源...",
"chunk_index": 37,
"page_number": 1,
"document_id": "abc-123",
"strategy_type": "token",
"chunk_file_path": "/data/chunks/abc-123_37.pdf"
},
"distance": 0.1245
}
]
}
],
"total_chunks_retrieved": 20,
"retriever_time_ms": 456
},
"filtered": {
"per_sub_question": [
{
"sub_question_index": 0,
"sub_question_text": "立法會:今日討論的主要議題",
"chunks": [
{
"chunk_index": 0,
"text": "立法會今日討論了三項主要議題...",
"metadata": {
"filename": "meeting_minutes_2026-05-25.pdf",
"upload_date": "2026-05-25T08:00:00Z",
"content_summary": "立法會今日討論了三項主要議題...",
"chunk_index": 37,
"page_number": 1,
"document_id": "abc-123",
"strategy_type": "token",
"chunk_file_path": "/data/chunks/abc-123_37.pdf",
"relevance_score": 8.7
}
}
]
}
],
"total_chunks_filtered": 14,
"filter_time_ms": 789
},
"response": {
"final_answer": "## Sub-question 0: 立法會:今日討論的主要議題\n\n- 今日立法會討論了三項主要議題:房屋政策、交通基建、醫療資源分配 [meeting_minutes_2026-05-25.pdf, page 1]\n- ...\n\n## Sub-question 1: ...\n",
"sub_question_sources": [
{
"sub_question_index": 0,
"sub_question_text": "立法會:今日討論的主要議題",
"sources": [
{
"filename": "meeting_minutes_2026-05-25.pdf",
"upload_date": "2026-05-25T08:00:00Z",
"content_summary": "...",
"chunk_index": 37,
"page_number": 1,
"document_id": "abc-123"
}
]
}
],
"generate_time_ms": 1011
},
"timing": {
"decomposer_time_ms": 234,
"retriever_time_ms": 456,
"filter_time_ms": 789,
"generator_time_ms": 1011,
"total_time_ms": 2490
}
}
2. POST /api/v1/test/generate/audio
Purpose: Accept audio upload + reference transcript, transcribe via ASR, run RAG pipeline, capture all stages.
Request (multipart/form-data):
audio_file: <binary audio file> # Required
reference_transcript: <string> # Optional - if omitted, CER/WER evaluation returns N/A
profile: <"A"|"B"|"C"> # Required
label: <string> # Optional
language: <"yue"|"zh"|"en"|"auto"> # Optional, default "yue"
Response (TestGenerateResult): Same as text endpoint, with additional fields:
{
"result_id": "uuid",
"input_type": "audio",
"profile": "A",
"label": "Test run with LegCo audio clip",
"created_at": "2026-05-25T10:30:00Z",
"input": {
"text": "立法會今日討論咗咩議題?", // ASR-transcribed text
"reference_transcript": "立法會今日討論咗咩議題?", // Ground truth
"audio_filename": "legco_clip_01.wav",
"audio_duration_seconds": 45.2,
"asr_language": "yue"
},
"extracted_key_questions": [...], // Same structure as text endpoint
"retrieval": {...},
"filtered": {...},
"response": {...},
"timing": {
"asr_time_ms": 1234,
"decomposer_time_ms": 234,
"retriever_time_ms": 456,
"filter_time_ms": 789,
"generator_time_ms": 1011,
"total_time_ms": 3724
}
}
3. POST /api/v1/test/evaluate
Purpose: Read a results JSON and run all four evaluation dimensions.
Request:
{
"result_id": "uuid-of-generated-result",
"evaluation_config": {
"key_questions_evaluators": [
{
"model_name": "deepseek-v4-pro",
"base_url": "https://api.deepseek.com",
"api_key_env": "DP_API_KEY",
"enable_thinking": true
},
{
"model_name": "qwen3-7b-max",
"base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
"api_key_env": "DASHSCOPE_API_KEY",
"enable_thinking": true
}
],
"chunk_evaluator": {
"model_name": "qwen/qwen3.6-35b-a3b",
"base_url_env": "LLM_BASE_URL",
"api_key_env": "LLM_API_KEY",
"enable_thinking": true
},
"response_evaluator": {
"model_name": "qwen/qwen3.6-35b-a3b",
"base_url_env": "LLM_BASE_URL",
"api_key_env": "LLM_API_KEY",
"enable_thinking": true
}
}
}
Alternatively, accept inline results JSON:
{
"results": { /* inline TestGenerateResult */ },
"evaluation_config": { ... }
}
Response (TestEvaluationResult):
{
"evaluation_id": "uuid",
"result_id": "uuid-of-original-result",
"created_at": "2026-05-25T10:35:00Z",
"status": "completed",
"audio_evaluation": {
"status": "completed",
"cer": 0.052,
"wer": 0.083,
"reference_length": 42,
"transcribed_length": 40,
"substitutions": 1,
"deletions": 2,
"insertions": 0,
"hits": 39
},
"key_questions_evaluation": {
"evaluations": [
{
"model_name": "deepseek-v4-pro",
"scores": {
"dimension_1_準確性": 35,
"dimension_2_完整性": 22,
"dimension_3_清晰度": 18,
"dimension_4_簡潔性": 13
},
"total_score": 88,
"max_score": 100,
"comments": "提取的關鍵問題準確反映了原文的核心內容,數字和關鍵詞無誤...",
"thinking_trace": "完整的思考過程...",
"time_ms": 3421
},
{
"model_name": "qwen3-7b-max",
"scores": {
"dimension_1_準確性": 37,
"dimension_2_完整性": 23,
"dimension_3_清晰度": 17,
"dimension_4_簡潔性": 14
},
"total_score": 91,
"max_score": 100,
"comments": "關鍵問題提取完整,背景資訊保留得當...",
"thinking_trace": "完整的思考過程...",
"time_ms": 2890
}
],
"average_scores": {
"dimension_1_準確性": 36.0,
"dimension_2_完整性": 22.5,
"dimension_3_清晰度": 17.5,
"dimension_4_簡潔性": 13.5
},
"average_total": 89.5
},
"chunk_evaluation": {
"per_sub_question": [
{
"sub_question_index": 0,
"sub_question_text": "立法會:今日討論的主要議題",
"ground_truth": {
"relevant_documents": ["abc-123", "def-456"],
"relevant_chunks": [
{"document_id": "abc-123", "chunk_index": 37},
{"document_id": "abc-123", "chunk_index": 38},
{"document_id": "def-456", "chunk_index": 5}
],
"total_relevant_chunks": 3,
"chunk_evaluation_time_ms": 8765
},
"unfiltered_accuracy": {
"precision": 0.60,
"recall": 1.00,
"f1": 0.75,
"pipeline_chunks": 5,
"relevant_in_pipeline": 3
},
"filtered_accuracy": {
"precision": 1.00,
"recall": 1.00,
"f1": 1.00,
"pipeline_chunks": 3,
"relevant_in_pipeline": 3
}
}
],
"overall_unfiltered": {
"avg_precision": 0.60,
"avg_recall": 1.00,
"avg_f1": 0.75
},
"overall_filtered": {
"avg_precision": 1.00,
"avg_recall": 1.00,
"avg_f1": 1.00
}
},
"response_evaluation": {
"per_sub_question": [
{
"sub_question_index": 0,
"sub_question_text": "立法會:今日討論的主要議題",
"ground_truth_response": "今日立法會討論了三項主要議題:房屋政策(包括公屋輪候時間檢討)...",
"pipeline_response_section": "今日立法會討論了三項主要議題:房屋政策、交通基建...",
"completeness_score": 0.85,
"factual_accuracy_score": 0.92,
"comments": "Pipeline response covers all key topics but misses the specific detail about 公屋輪候時間檢討",
"ground_truth_generation_time_ms": 4567,
"comparison_time_ms": 2345
}
],
"overall_completeness": 0.85,
"overall_factual_accuracy": 0.92
},
"timing": {
"audio_evaluation_time_ms": 23,
"key_questions_evaluation_time_ms": 6311,
"chunk_evaluation_time_ms": 14197,
"response_evaluation_time_ms": 6912,
"total_evaluation_time_ms": 27443
}
}
Listing Endpoints
| Method | Path | Description |
|---|---|---|
GET |
/api/v1/test/results |
List all generated test result files (with ?limit=50&offset=0) |
GET |
/api/v1/test/results/{result_id} |
Retrieve a specific test result JSON |
DELETE |
/api/v1/test/results/{result_id} |
Delete a test result file |
GET |
/api/v1/test/evaluations |
List all evaluation result files |
GET |
/api/v1/test/evaluations/{eval_id} |
Retrieve a specific evaluation JSON |
DELETE |
/api/v1/test/evaluations/{eval_id} |
Delete an evaluation file |
Test Profile Configuration
Profiles A, B, C are managed entirely by the existing PromptService (SQLite-backed, prompts.db). Each profile contains 8 prompt template steps: decompose, decompose_format, filter, filter_intro, filter_section, filter_outro, generate, generate_per_subq. The test generation endpoint calls PromptService.activate_profile(profile) before running the pipeline — this is the same mechanism used by the existing query endpoint.
No new config settings are needed for profiles. LLM model selection (which model runs decompose, filter, generate) is controlled by the existing env vars (LLM_MODEL_NAME, DP_MODEL_NAME). Profiles only switch prompt templates. If you want different models for different profiles, change the env vars between test runs.
Profile Usage in Test Generation
# In test_runner_service.py:
prompt_service = PromptService(db_path=settings.prompts_db_path)
prompt_service.activate_profile(profile) # "A", "B", or "C"
# Then run the pipeline as normal — templates are pulled from the active profile
Chunk Evaluation Methodology (Detailed)
The evaluation model directly processes ALL chunks from ALL documents in batches of 10 to determine relevance. There is no document-level pre-filtering — every chunk is evaluated against its key question.
How It Works
Goal: For each extracted key question, determine which chunks across ALL documents are relevant. This produces the "ground truth" list of relevant chunks, which is then compared against the pipeline's retrieved (unfiltered) and filtered chunks.
Batch size: 10 chunks per LLM call (fixed). All chunks are split into batches of 10 and evaluated in parallel.
Judgment: Binary true/false — each chunk is either relevant or not. The LLM declares relevance per chunk. Thinking mode is enabled on the evaluation model.
XML format: The LLM prompt uses XML tags to structure the key question and chunks:
System:
你正在评估文档块与关键问题的相关性。
对于每个<chunk_N>,判断其是否包含与<sub_question>相关的信息。
返回JSON:{"relevant_chunk_indices": [0, 3, 7]}(仅包含相关块的批内索引,0-9)
<sub_question>
立法会今日讨论了哪些议题?
</sub_question>
<chunk_1 doc="abc-123" page="3">
立法会今日讨论了三项主要议题:房屋政策、交通基建、医疗资源分配...
</chunk_1>
<chunk_2 doc="abc-123" page="4">
房屋政策方面,政府提出公屋轮候时间检讨,目标由5.3年缩短至4.5年...
</chunk_2>
... (up to <chunk_10>)
Process:
- Fetch ALL chunks from ALL documents via
RAGService.list_chunks()for everydocument_id - Split all chunks into batches of exactly 10
- For each key question, send ALL batches in parallel via
asyncio.gather - Each batch call: LLM receives the XML-format prompt (key question + 10 chunks), returns
{"relevant_chunk_indices": [0, 2, 5]}— batch-local indices (0-9) of relevant chunks only - Map batch-local indices back to global
(document_id, chunk_index)tuples using batch metadata - Union of all relevant chunks across all batches = ground-truth relevant chunks list
Why no document-level pre-filtering: Processing every chunk directly in batches of 10 is simpler and avoids the risk of the document-level step incorrectly excluding relevant documents. The batch size of 10 is small enough that the LLM gives each chunk focused attention for accurate binary judgment. All batches run in parallel, so wall-clock time is one round-trip regardless of total chunk count.
Comparison: Precision / Recall / F1
For unfiltered chunks (from retrieval stage):
- Precision = |ground_truth ∩ retrieved| / |retrieved|
- Recall = |ground_truth ∩ retrieved| / |ground_truth|
- F1 = 2 × (precision × recall) / (precision + recall)
For filtered chunks (after relevance filter):
- Same formulas, using filtered chunks instead of retrieved
Chunks are matched by (document_id, chunk_index) tuple.
Response Accuracy Evaluation
Goal: Generate an "ideal" response from the ground-truth relevant chunks and compare it against the pipeline's actual response.
Process:
- Generate "ground truth response": Send the ground-truth relevant chunks + key question to the evaluation LLM (same model as generation, thinking enabled). Prompt: "Using ONLY these document chunks, answer the key question. Cite your sources."
- Compare responses: Send both responses (ground truth + pipeline) + key question to the evaluation LLM with thinking enabled. Prompt: "Compare these two responses to the key question. Does the second response contain all the key information present in the first response? Score completeness 0-1 and factual accuracy 0-1. Note any missing information."
- LLM returns
{"completeness_score": 0.85, "factual_accuracy_score": 0.92, "comments": "..."}
Batching Strategy for Chunk Evaluation
The chunk evaluation processes ALL chunks from ALL documents in batches of 10. Without batching, sending hundreds of chunks in a single LLM call hits context limits, degrades attention quality, and is expensive with thinking mode enabled. Batching splits the work into manageable units of 10 chunks per call.
Why Batching Is Needed
| Constraint | Impact |
|---|---|
| Context window | 100+ chunks with 1000 tokens each = 100K+ tokens before the prompt itself |
| LLM quality | Long contexts degrade attention quality. LLMs lose focus on middle sections of very long prompts |
| Cost | Sending 100K+ tokens per call × thinking enabled = expensive. 10 smaller calls use fewer total tokens |
| Reliability | One giant call failing means restarting everything. Batched calls can retry individually |
How Batching Works
Batching splits chunks into groups of 10, sends each batch to the LLM independently in parallel, then merges results.
INPUT: 60 chunks from all documents, 1 key question
│
▼
SPLIT into batches of 10:
Batch 1: chunks 0-9 → LLM → relevant indices from this batch
Batch 2: chunks 10-19 → LLM → relevant indices from this batch
Batch 3: chunks 20-29 → LLM → relevant indices from this batch
Batch 4: chunks 30-39 → LLM → relevant indices from this batch
Batch 5: chunks 40-49 → LLM → relevant indices from this batch
Batch 6: chunks 50-59 → LLM → relevant indices from this batch
│
▼
MERGE: map batch-local indices → global (doc_id, chunk_index)
→ union of all relevant chunks from all batches
#### Chunk Batching Details
**Batch size**: **10 chunks per LLM call** (fixed — optimized for attention quality).
**Judgment**: **Binary true/false** — the LLM declares each chunk as relevant or not (no 0-10 score). Thinking mode is enabled on the evaluation model.
**XML format**: The LLM prompt uses XML tags to structure the key question and chunks:
System/Instruction: 你正在评估文档块与关键问题的相关性。 对于每个<chunk_N>,判断其是否包含与<sub_question>相关的信息。 返回JSON:{"relevant_chunk_indices": [0, 3, 7]}(仅包含相关的块索引)
<sub_question> 立法会今日讨论了哪些议题? </sub_question>
<chunk_1 doc="abc-123" page="3"> 立法会今日讨论了三项主要议题:房屋政策、交通基建、医疗资源分配。多位议员就公屋轮候时间... </chunk_1>
<chunk_2 doc="abc-123" page="4"> 房屋政策方面,政府提出公屋轮候时间检讨,目标由5.3年缩短至4.5年... </chunk_2>
<chunk_3 doc="def-456" page="7"> 新界北发展区的交通配套包括兴建连接路及增设巴士路线... </chunk_3>
... (up to <chunk_10>)
**What's in each batch**:
- `<sub_question>` — the key question being evaluated
- Exactly 10 `<chunk_N>` elements, each containing:
- `doc` attribute — document ID for mapping results back
- `page` attribute — page number for context
- Full chunk text as element content
**Scoring (binary)**: The LLM returns a JSON array of batch-local indices (0-9) for chunks judged **relevant**:
```json
{"relevant_chunk_indices": [0, 2, 5]}
Chunks NOT in this array are considered not relevant. There is no score or threshold — it's a binary true/false decision per chunk.
Merge logic: After all batches complete, collect all relevant_chunk_indices arrays. For each returned index, map the batch-local index back to the global (document_id, chunk_index) tuple using the batch's metadata. The union across all batches = the ground-truth relevant chunks list.
Failure handling per batch: If a batch fails (timeout, API error), retry up to 2 times with the same 10 chunks. If it still fails → log the error, mark the batch as incomplete, and continue with other batches. The ground truth will be missing chunks from the failed batch — flag this as "round_2_incomplete_batches": [batch_index, ...].
Batch Parallelization
All batches are independent but concurrency is limited to avoid overloading the LLM service and hitting rate limits. A semaphore caps concurrent LLM calls at a configurable maximum (default: 10).
semaphore = asyncio.Semaphore(settings.eval_max_concurrent_batches) # default: 10
async def evaluate_batch_with_limit(batch, key_question):
async with semaphore:
return await evaluate_chunk_batch(batch, key_question)
# 60 chunks → 6 batches → up to 6 concurrent LLM calls (well within limit)
# 250 chunks → 25 batches → up to 10 concurrent at a time, the rest queue
chunk_results = await asyncio.gather(*[
evaluate_batch_with_limit(batch, key_question)
for batch in chunk_batches
])
With 10 chunks per call and a max of 10 concurrent calls, each wave processes up to 100 chunks simultaneously. The semaphore ensures that even with 250+ batches, the LLM provider never sees more than 10 concurrent requests at once.
Batching Configuration
# In core/config.py:
class Settings(BaseSettings):
# Evaluation batching (Package 9)
eval_chunk_batch_size: int = 10 # Chunks per batch (FIXED — 10 is optimal)
eval_max_concurrent_batches: int = 10 # Max concurrent LLM calls (rate limiting)
eval_batch_retry_count: int = 2 # Retries per failed batch
eval_batch_retry_delay_ms: int = 2000 # Delay between retries
Batching Flow (Complete)
For each key question (1-3 sub-questions):
┌─ Chunk Evaluation (10 chunks/batch) ─────────────────┐
│ │
│ ALL chunks from ALL documents (60 chunks) │
│ │ │
│ ├─ Batch 1 (chunks 0-9) → LLM → [0,2,5] ✅ │
│ ├─ Batch 2 (chunks 10-19) → LLM → [3,7] ✅ │
│ ├─ Batch 3 (chunks 20-29) → LLM → [1] ✅ │
│ ├─ Batch 4 (chunks 30-39) → LLM → [] ✅ │
│ ├─ Batch 5 (chunks 40-49) → LLM → [4,8] ✅ │
│ └─ Batch 6 (chunks 50-59) → LLM → [0,5] ✅ │
│ │ │
│ ▼ │
│ MERGE: batch-local → global (doc_id, chunk_index) │
│ chunks 0,2,5,13,17,21,44,48,50,55 │
│ │
│ Ground truth: 10 relevant chunks │
└───────────────────────────────────────────────────────┘
Batching vs. No Batching: Decision
| Scenario | Chunks total | Batches (10/batch) | Notes |
|---|---|---|---|
| Small (1-3 PDFs) | 60-150 | 6-15 | Always batched |
| Medium (5-10 PDFs) | 160-400 | 16-40 | Always batched |
| Large (20+ PDFs) | 500-2500 | 50-250 | Always batched |
Batch size is always 10 — even for small tests, splitting into batches of 10 ensures the LLM gives each chunk focused attention for binary relevance judgment. All batches run in parallel (same wall-clock time regardless of count), so more batches = more API cost but NOT more waiting time.
Implementation Plan
Sub-Phase 9.0: Config & Models (Foundation)
Objective: Add all config settings, Pydantic models, and type definitions for the accuracy testing system.
Test Files:
backend/app/test/test_phase9_config.py— test storage dir config, batching config defaultsbackend/app/test/test_phase9_models.py— test Pydantic model validation, serialization
Implementation Tasks:
-
Extend
backend/app/core/config.py:- Add
test_results_dir: str = "./data/test_results" - Add
test_evaluations_dir: str = "./data/test_evaluations" - Add
eval_chunk_batch_size: int = 10— chunks per batch (fixed at 10) - Add
eval_max_concurrent_batches: int = 10— max concurrent LLM calls - Add
eval_batch_retry_count: int = 2— retries per failed batch - Add
eval_batch_retry_delay_ms: int = 2000— delay between retries
- Add
-
Create
backend/app/models/testing.py— ALL testing-related Pydantic models:Request models:
TestGenerateTextRequest(question: str, profile: Literal["A","B","C"], label: str = "")TestGenerateAudioRequest(profile: Literal["A","B","C"], label: str = "", language: str = "yue")— used as Form fields
Result models (the output JSON):
TestChunkEntry(chunk_index: int, text: str, metadata: Dict, distance: float = 0.0)— a single chunk in resultsTestSubQuestionChunks(sub_question_index: int, sub_question_text: str, chunks: List[TestChunkEntry])TestRetrievalResult(per_sub_question: List[TestSubQuestionChunks], total_chunks_retrieved: int, retriever_time_ms: int)TestFilteredResult(per_sub_question: List[TestSubQuestionChunks], total_chunks_filtered: int, filter_time_ms: int)TestSubQuestionSources(sub_question_index: int, sub_question_text: str, sources: List[SourceMetadata])TestResponseResult(final_answer: str, sub_question_sources: List[TestSubQuestionSources], generate_time_ms: int)TestInputInfo(text: str, reference_transcript: str = "", audio_filename: str = "", audio_duration_seconds: float = 0, asr_language: str = "")TestTimingInfo(decomposer_time_ms: int, retriever_time_ms: int, filter_time_ms: int, generator_time_ms: int, total_time_ms: int, asr_time_ms: int = 0)TestGenerateResult(result_id: str, input_type: Literal["text","audio"], profile: str, label: str, created_at: str, input: TestInputInfo, extracted_key_questions: List[str], retrieval: TestRetrievalResult, filtered: TestFilteredResult, response: TestResponseResult, timing: TestTimingInfo)
Evaluation models:
EvaluatorConfig(model_name: str, base_url: str, api_key_env: str, enable_thinking: bool)EvaluationConfigRequest(key_questions_evaluators: List[EvaluatorConfig], chunk_evaluator: EvaluatorConfig, response_evaluator: EvaluatorConfig)TestEvaluateRequest(result_id: str = "", results: Optional[TestGenerateResult] = None, evaluation_config: EvaluationConfigRequest)DimensionScores(dimension_1_準確性: int, dimension_2_完整性: int, dimension_3_清晰度: int, dimension_4_簡潔性: int)KeyQuestionsEvalEntry(model_name: str, scores: DimensionScores, total_score: int, max_score: int, comments: str, thinking_trace: str, time_ms: int)KeyQuestionsEvalResult(evaluations: List[KeyQuestionsEvalEntry], average_scores: DimensionScores, average_total: float)AudioEvalResult(cer: float, wer: float, reference_length: int, transcribed_length: int, substitutions: int, deletions: int, insertions: int, hits: int)—null(N/A) when no reference transcript was providedChunkAccuracy(precision: float, recall: float, f1: float, pipeline_chunks: int, relevant_in_pipeline: int)GroundTruthInfo(relevant_documents: List[str], relevant_chunks: List[Dict], total_relevant_chunks: int, chunk_evaluation_time_ms: int)SubQuestionChunkEval(sub_question_index: int, sub_question_text: str, ground_truth: GroundTruthInfo, unfiltered_accuracy: ChunkAccuracy, filtered_accuracy: ChunkAccuracy)ChunkEvalResult(per_sub_question: List[SubQuestionChunkEval], overall_unfiltered: ChunkAccuracy, overall_filtered: ChunkAccuracy)SubQuestionResponseEval(sub_question_index: int, sub_question_text: str, ground_truth_response: str, pipeline_response_section: str, completeness_score: float, factual_accuracy_score: float, comments: str, ground_truth_generation_time_ms: int, comparison_time_ms: int)ResponseEvalResult(per_sub_question: List[SubQuestionResponseEval], overall_completeness: float, overall_factual_accuracy: float)EvaluationTiming(audio_evaluation_time_ms: int, key_questions_evaluation_time_ms: int, chunk_evaluation_time_ms: int, response_evaluation_time_ms: int, total_evaluation_time_ms: int)TestEvaluationResult(evaluation_id: str, result_id: str, created_at: str, status: Literal["completed","partial","failed"], audio_evaluation: Optional[AudioEvalResult] = None, key_questions_evaluation: Optional[KeyQuestionsEvalResult] = None, chunk_evaluation: Optional[ChunkEvalResult] = None, response_evaluation: Optional[ResponseEvalResult] = None, timing: EvaluationTiming)
-
Update
backend/.env.examplewith new env vars:# Test results storage (Package 9) TEST_RESULTS_DIR=./data/test_results TEST_EVALUATIONS_DIR=./data/test_evaluations # Evaluation batching (Package 9) EVAL_CHUNK_BATCH_SIZE=10 EVAL_MAX_CONCURRENT_BATCHES=10 EVAL_BATCH_RETRY_COUNT=2 EVAL_BATCH_RETRY_DELAY_MS=2000 -
Profile resolution: No new service needed. The existing
PromptService.activate_profile(name)is called before running the pipeline. Theget_active_profile_name()method is used to record which profile was active in the test result.
Sub-Phase 9.1: Results Generation APIs
Objective: Implement the two generation endpoints (/api/v1/test/generate/text and /api/v1/test/generate/audio) and the results storage/retrieval endpoints.
Test Files:
backend/app/test/test_phase9_generate_text.py— text generation endpoint testsbackend/app/test/test_phase9_generate_audio.py— audio generation endpoint testsbackend/app/test/test_phase9_results_storage.py— results file CRUD testsbackend/app/test/acceptance/test_acceptance_phase9_generate_text.py— real LLM run for text inputbackend/app/test/acceptance/test_acceptance_phase9_generate_audio.py— real ASR + LLM run
Implementation Tasks:
-
Create
backend/app/services/test_runner_service.py:async run_text_test(question: str, profile: str, settings: Settings) -> TestGenerateResult- Resolves profile-specific LLM config via
test_profile_service - Creates
LLMClient/LLMClientDPwith profile-specific models - Activates profile via
PromptService.activate_profile(profile) - Runs the full 4-stage RAG pipeline (reusing existing services —
QueryDecomposer,RAGService,RelevanceFilter) - Captures all intermediate data at each stage (same data available in
_query_stream()) - Returns structured
TestGenerateResultwith all stage outputs
- Resolves profile-specific LLM config via
async run_audio_test(audio_bytes, reference_transcript, profile, language, settings) -> TestGenerateResult- Runs ASR transcription via
ASRClient.transcribe_full() - Uses transcribed text as question for
run_text_test() - Adds
input.reference_transcript,input.audio_filename,input.audio_duration_seconds,timing.asr_time_ms
- Runs ASR transcription via
-
Create
backend/app/services/test_storage_service.py:save_result(result: TestGenerateResult) -> str— writes JSON todata/test_results/{result_id}.json, returns file pathload_result(result_id: str) -> TestGenerateResult | None— reads and parses JSONlist_results(limit: int = 50, offset: int = 0) -> List[dict]— lists saved resultsdelete_result(result_id: str) -> bool— deletes result filesave_evaluation(eval: TestEvaluationResult) -> str— writes evaluation JSONload_evaluation(eval_id: str) -> TestEvaluationResult | Nonelist_evaluations(limit: int = 50, offset: int = 0) -> List[dict]delete_evaluation(eval_id: str) -> bool
-
Create
backend/app/routers/test_generate.py:POST /test/generate/text— acceptsTestGenerateTextRequestJSON bodyPOST /test/generate/audio— accepts multipart form withaudio_file,reference_transcript,profile,label,languageGET /test/results— lists resultsGET /test/results/{result_id}— retrieves specific resultDELETE /test/results/{result_id}— deletes result
-
Register in
backend/app/main.py:from app.routers import test_generate app.include_router(test_generate.router, prefix="/api/v1") -
Pipeline instrumentation: The key challenge is capturing intermediate data. Two approaches:
- Approach A (recommended): Refactor
_query_stream()inrouters/query.pyto extract the core pipeline logic into a reusable async generator intest_runner_service.pythat yields both SSE events AND structured data snapshots. The existing query endpoint wraps this generator for SSE streaming; the test endpoint collects the snapshots. - Approach B (simpler but duplicative): Duplicate the pipeline logic in
test_runner_service.py, calling the same services (decomposer, RAG, filter) directly. This is easier but creates maintenance burden.
Decision: Use Approach A. Extract
async def _run_pipeline(question, settings, prompt_service) -> AsyncGenerator[PipelineStep, None]where eachPipelineStepis a dataclass containing both the SSE event data AND the intermediate data structures. The query router yields SSE; the test runner collects snapshots. - Approach A (recommended): Refactor
Sub-Phase 9.2: Evaluation Engine
Objective: Implement all evaluation logic — CER/WER calculation, key questions marking scheme evaluator, chunk evaluation, and response comparison.
Design Note — Separate LLMClient instances: Each evaluator creates its own LLMClient instance with enable_thinking=True hardcoded. This is because enable_thinking is constructor-level (not a complete() parameter), and the existing pipeline's LLMClient may have thinking disabled for cost/speed. Evaluation clients are lightweight AsyncOpenAI wrappers — no shared state, zero impact on existing services.
Test Files:
backend/app/test/test_phase9_cer_wer.py— CER/WER calculation unit testsbackend/app/test/test_phase9_key_questions_eval.py— marking scheme evaluator tests (mocked LLM)backend/app/test/test_phase9_chunk_evaluator.py— chunk evaluation tests (mocked LLM)backend/app/test/test_phase9_response_evaluator.py— response comparison tests (mocked LLM)
Implementation Tasks:
-
Create
backend/app/services/cer_wer.py:calculate_cer(reference: str, hypothesis: str) -> dict— Character Error Rate via Levenshtein distance at character levelcalculate_wer(reference: str, hypothesis: str) -> dict— Word Error Rate via Levenshtein distance at word level- Returns
{cer, reference_length, transcribed_length, substitutions, deletions, insertions, hits}
Implementation: Use
python-Levenshteinor implement a simple edit distance. For Chinese text, CER operates on individual characters. For mixed Chinese/English, need a tokenizer that handles both (e.g., jieba for Chinese words, spaCy for English).Simpler approach: For CER, split on characters. For WER, split on whitespace + punctuation. This avoids adding heavy NLP dependencies.
-
Create
backend/app/services/key_questions_evaluator.py:async evaluate_key_questions(original_text: str, extracted_questions: List[str], evaluator_configs: List[EvaluatorConfig]) -> KeyQuestionsEvalResult
For each evaluator model (run in parallel via
asyncio.gather):- Create an
LLMClientwith the evaluator's config (base_url, api_key, model_name) - Set
enable_thinking=True(the evaluator models need thinking for quality scoring) - Build prompt with the 4-dimension Chinese marking scheme
- Use
complete_structured()with a Pydantic model for structured scoring output - Retry up to 3 times on failure (exponential backoff: 2s, 4s, 8s)
- If a model still fails after 3 retries → abandon the entire key questions evaluation (mark
evaluation_status: "failed", do NOT return partial scores)
Scoring model:
class KeyQuestionScore(BaseModel): dimension_1_準確性: int = Field(ge=0, le=40, description="...") dimension_2_完整性: int = Field(ge=0, le=25, description="...") dimension_3_清晰度: int = Field(ge=0, le=20, description="...") dimension_4_簡潔性: int = Field(ge=0, le=15, description="...") total_score: int = Field(ge=0, le=100) comments: str = Field(description="Brief rationale for the scores")Prompt template (prompt key:
"evaluate_key_questions", stored in a new step):Language: All evaluation prompts, scoring rubrics, and LLM interactions are in Chinese — since all key questions, document chunks, and final responses are primarily Chinese-language LegCo content.
你正在评估从立法会文件中提取的关键问题的质量。 原文/转录文本: {original_text} 提取的关键问题: {extracted_questions} 请根据以下评分标准评估这些关键问题: [Full marking scheme table with 4 dimensions, weights, scoring descriptions, deduction guidelines] 对每个维度,请给出分数和简要理由。返回符合指定格式的JSON。 -
Create
backend/app/services/chunk_evaluator.py:async evaluate_chunks(result: TestGenerateResult, evaluator_config: EvaluatorConfig, settings: Settings) -> ChunkEvalResult
Chunk Evaluation (single-round, all chunks, no document pre-filtering):
- Fetch ALL chunks from ALL documents via
RAGService.list_documents()+list_chunks() - For each key question, split all chunks into batches of exactly 10
- Build XML-format prompt per batch:
<sub_question>...</sub_question>+<chunk_1 doc="x" page="y">text</chunk_1>× 10 - Send all batches in parallel via
asyncio.gather - Use evaluator LLM with thinking enabled for binary true/false relevance judgment
- LLM returns
{"relevant_chunk_indices": [0, 3, 5]}— batch-local indices (0-9) of relevant chunks only - Map batch-local indices to global
(document_id, chunk_index)tuples using batch metadata - Union of all batches → ground-truth relevant chunks list
- Use
complete_structured()with a simple Pydantic modelChunkRelevanceResult
Comparison:
- Build sets of
(document_id, chunk_index)tuples for both ground truth and pipeline results - Compute precision, recall, F1 for unfiltered chunks vs ground truth
- Compute precision, recall, F1 for filtered chunks vs ground truth
-
Create
backend/app/services/response_evaluator.py:async evaluate_response(result: TestGenerateResult, ground_truth_chunks: List, evaluator_config: EvaluatorConfig) -> ResponseEvalResult
Step 1 — Generate ground-truth response:
- Use the ground-truth relevant chunks (from chunk evaluation) as context
- Call
RAGService.generate_response_per_subquestion()with the same generation settings but using the ground-truth chunks as input - This produces an "ideal" response from the independently-identified relevant chunks
Step 2 — Compare responses:
- Send ground-truth response + pipeline response + key questions to evaluation LLM with thinking enabled
- Prompt (Chinese): "比较以下两个对关键问题的回答。第一个回答(Ground Truth)是由独立验证的相关文档块生成的。第二个回答(Pipeline)是由被评估的系统生成的。对于每个关键问题:Pipeline回答是否包含了Ground Truth回答中的所有关键信息?对完整性评分0-1,对事实准确性评分0-1。"
- Use
complete_structured()with:class ResponseComparison(BaseModel): completeness_score: float = Field(ge=0, le=1) factual_accuracy_score: float = Field(ge=0, le=1) comments: str
Sub-Phase 9.3: Evaluation API & Service
Objective: Wire the evaluation engine into an API endpoint with storage and retrieval.
Test Files:
backend/app/test/test_phase9_evaluate.py— evaluation endpoint integration tests (mocked LLMs)backend/app/test/test_phase9_evaluate_edge_cases.py— edge cases: missing audio, empty resultsbackend/app/test/acceptance/test_acceptance_phase9_evaluate.py— full evaluation with real LLMs
Implementation Tasks:
-
Create
backend/app/services/test_evaluation_service.py:async run_evaluation(request: TestEvaluateRequest, settings: Settings) -> TestEvaluationResult- Loads the results JSON (from
result_idor inline) - Runs evaluations with dependency chain: chunk eval provides ground truth chunks for response eval
- For audio inputs: runs CER/WER evaluation (N/A if no reference transcript)
- Runs key questions evaluation: two evaluators in parallel via
asyncio.gather, 3 retries each on failure - Runs chunk evaluation: batched chunk-level → Comparison (precision/recall/F1)
- Runs response evaluation: uses ground-truth chunks from chunk evaluation
- Saves evaluation result via
TestStorageService - Returns
TestEvaluationResult - Config parameters:
eval_chunk_batch_size=10,eval_max_concurrent_batches=10,eval_batch_retry_count=2,eval_batch_retry_delay_ms=2000
- Loads the results JSON (from
-
Create
backend/app/routers/test_evaluate.py:POST /test/evaluate— acceptsTestEvaluateRequest, returnsTestEvaluationResultGET /test/evaluations— lists evaluationsGET /test/evaluations/{eval_id}— retrieves evaluationDELETE /test/evaluations/{eval_id}— deletes evaluation
-
Register in
backend/app/main.py:from app.routers import test_evaluate app.include_router(test_evaluate.router, prefix="/api/v1") -
Error handling:
- If a key questions evaluator model fails after retries →
key_questions_evaluationisnull, overallstatus: "failed", error details logged. No partial key questions scores. - If chunk evaluation or response evaluation fails → mark those sections as
"failed"with error details, overallstatus: "partial"(these dimensions are independent). - Never fail the entire evaluation because one non-critical dimension failed.
- If a key questions evaluator model fails after retries →
Sub-Phase 9.4: Integration & Acceptance Testing
Objective: End-to-end verification with real LLMs and test materials.
Test Files:
backend/app/test/test_phase9_integration.py— full pipeline: generate → evaluate (mocked LLMs)backend/app/test/acceptance/test_acceptance_phase9_full_pipeline.py— real LLM generation + evaluationbackend/app/test/acceptance/test_acceptance_phase9_key_questions_eval.py— real dual-model key questions scoringbackend/app/test/acceptance/test_acceptance_phase9_chunk_eval.py— real chunk evaluation
Implementation Tasks:
- Run
pytest backend/app/test/test_phase9*.py -v— all integration tests pass - Run
pytest backend/app/test/acceptance/test_acceptance_phase9*.py -v -m acceptance— acceptance tests pass - Test with real LegCo test materials:
test materials/LTT/PDFs + LegCo MP4 files - Verify profile switching: run same question with profiles A, B, C → different prompt templates used
- Verify evaluation reproducibility: run same evaluation twice → same scores
- Test error recovery: missing API key, timeout, invalid result_id
File Change Summary
New Files
| File | Purpose |
|---|---|
backend/app/models/testing.py |
All Pydantic models for test generation and evaluation (20+ models) |
backend/app/routers/test_generate.py |
Generation endpoints (text + audio) + results CRUD |
backend/app/routers/test_evaluate.py |
Evaluation endpoint + evaluations CRUD |
backend/app/services/test_runner_service.py |
Run full RAG pipeline and capture intermediate data |
backend/app/services/test_storage_service.py |
File-based JSON storage for test results and evaluations |
backend/app/services/cer_wer.py |
CER/WER calculation via character/word edit distance |
backend/app/services/key_questions_evaluator.py |
Dual-model key questions scoring with marking scheme |
backend/app/services/chunk_evaluator.py |
Two-round document→chunk relevance evaluation |
backend/app/services/response_evaluator.py |
Response comparison via ground-truth generation |
backend/app/services/test_evaluation_service.py |
Orchestrates all four evaluation dimensions |
backend/app/test/test_phase9_config.py |
Config validation tests |
backend/app/test/test_phase9_models.py |
Pydantic model validation/serialization tests |
backend/app/test/test_phase9_generate_text.py |
Text generation endpoint tests |
backend/app/test/test_phase9_generate_audio.py |
Audio generation endpoint tests |
backend/app/test/test_phase9_results_storage.py |
Results file CRUD tests |
backend/app/test/test_phase9_cer_wer.py |
CER/WER unit tests |
backend/app/test/test_phase9_key_questions_eval.py |
Key questions evaluator tests |
backend/app/test/test_phase9_chunk_evaluator.py |
Chunk evaluator tests |
backend/app/test/test_phase9_response_evaluator.py |
Response evaluator tests |
backend/app/test/test_phase9_evaluate.py |
Evaluation endpoint integration tests |
backend/app/test/test_phase9_evaluate_edge_cases.py |
Edge case tests |
backend/app/test/test_phase9_integration.py |
Full generate→evaluate integration tests |
backend/app/test/acceptance/test_acceptance_phase9_generate_text.py |
Real LLM text generation |
backend/app/test/acceptance/test_acceptance_phase9_generate_audio.py |
Real ASR + LLM audio generation |
backend/app/test/acceptance/test_acceptance_phase9_full_pipeline.py |
Real LLM generate + evaluate |
backend/app/test/acceptance/test_acceptance_phase9_key_questions_eval.py |
Real dual-model scoring |
backend/app/test/acceptance/test_acceptance_phase9_chunk_eval.py |
Real chunk evaluation |
Modified Files
| File | Changes |
|---|---|
backend/app/core/config.py |
Add test_results_dir, test_evaluations_dir, eval_chunk_batch_size, eval_max_concurrent_batches, eval_batch_retry_count, eval_batch_retry_delay_ms settings |
backend/app/main.py |
Register test_generate.router and test_evaluate.router with prefix /api/v1 |
backend/app/routers/query.py |
Extract pipeline logic into reusable async generator (_run_pipeline()) for both SSE streaming and test result capture |
backend/app/services/llm_client.py |
Add enable_thinking override in constructor (so evaluation LLMs can enable thinking independently of global setting) |
backend/.env.example |
Add new profile + test storage env vars |
backend/requirements.txt |
Add python-Levenshtein (for CER/WER), jieba (for Chinese WER tokenization) if not already present |
Test Files (Modified)
| File | Changes |
|---|---|
backend/app/test/conftest.py |
Add mock_test_storage_service, mock_evaluator_llm fixtures |
Data Directories (New)
| Directory | Purpose |
|---|---|
backend/data/test_results/ |
Generated test result JSON files (.gitignored) |
backend/data/test_evaluations/ |
Evaluation result JSON files (.gitignored) |
Frontend (No Changes — Out of Scope)
This is a backend-only enhancement. No frontend changes are planned.
Acceptance Criteria
Results Generation
- AC1:
POST /api/v1/test/generate/textwithprofile=Areturns a completeTestGenerateResultJSON with all 5 sections (A) input, (B) key questions, (C) retrieved chunks, (D) filtered chunks, (E) final response - AC2:
POST /api/v1/test/generate/audioaccepts an audio file + reference transcript, runs ASR, and returns results withinput_type: "audio"andinput.reference_transcriptpopulated - AC3: Switching
profilebetween A, B, C produces different prompt templates (verified by differentdecompose_promptin results) - AC4: Invalid
profilevalue returns HTTP 400 with descriptive error - AC5:
GET /api/v1/test/resultsreturns paginated list of generated results - AC6:
GET /api/v1/test/results/{result_id}returns the full result JSON - AC7:
DELETE /api/v1/test/results/{result_id}removes the result file and returns 200
Results Evaluation
- AC8:
POST /api/v1/test/evaluatewith a validresult_idruns all 4 evaluation dimensions and returnsTestEvaluationResult - AC9: Audio evaluation: CER/WER calculated when reference transcript is provided; returns
null/N/A when omitted (not an error) - AC10: Key questions evaluation: Two evaluator models produce independent scores in the 4-dimension format, scores averaged correctly
- AC11: Chunk evaluation: processes ALL chunks from ALL documents in batches of 10 with binary true/false relevance, precision/recall/F1 computed correctly for both unfiltered and filtered
- AC12: Response evaluation: Ground-truth response generated from independently-verified chunks, comparison scores returned
- AC13: Evaluations are stored to disk and retrievable via GET/DELETE
- AC14: Missing/invalid
result_idreturns HTTP 404 - AC15: Key questions evaluation: both evaluator models return valid scores → averaged. If either fails after 3 retries →
key_questions_evaluation: null,status: "failed"(no partial key questions scores)
Testing
- AC16: All integration tests pass (
pytest backend/app/test/test_phase9*.py -v) - AC17: All acceptance tests pass (
pytest backend/app/test/acceptance/test_acceptance_phase9*.py -v -m acceptance) - AC18: Existing tests still pass — no regression in query, ingest, or document endpoints
- AC19: Pipeline extraction does not break existing SSE streaming query endpoint
Risks & Mitigations
| Risk | Severity | Mitigation |
|---|---|---|
Pipeline extraction breaks SSE streaming: Refactoring _query_stream() to reuse logic for both SSE and test capture could introduce regressions in the main query endpoint |
High | Extract into a parallel function rather than modifying _query_stream(). Keep the existing SSE endpoint untouched. The test runner calls the same services directly. |
| Chunk evaluation latency: Processing hundreds of chunks through the evaluation LLM (batches of 10, thinking enabled) can take time even with concurrent execution | Medium | All batches run via asyncio.gather with an asyncio.Semaphore capping concurrency at eval_max_concurrent_batches (default 10). 100 chunks = 10 batches = all run in one wave. 500 chunks = 50 batches → 5 waves of 10 concurrent calls. Timeout per batch: 120s. |
| Evaluation LLM cost: Two evaluator models × thinking enabled + chunk evaluation model × thinking enabled × all chunks in batches = potentially expensive LLM calls | Medium | Make evaluation config optional — user can choose to skip dimensions. Provide ?dimensions=key_questions,chunks query param. Document cost estimates per evaluation run. |
| CER/WER for Cantonese: Chinese characters are single characters but Cantonese has unique word boundaries (e.g., 咗嘢嘅). Character-level CER works but word-level WER needs a Cantonese-aware tokenizer | Medium | Use character-level CER as primary metric. For WER, use jieba with custom Cantonese dictionary OR fall back to character-level segmentation. Document the tokenization approach. |
| Ground truth chunk quality: If the evaluation LLM is not reliable, the "ground truth" chunk list will be wrong, making all downstream metrics invalid | High | Use the same model architecture as generation (e.g., Qwen3.6 35B A3B) with thinking enabled for maximum reliability. Implement sanity checks: if the LLM returns ALL or ZERO chunks as relevant across all batches, flag as potential issue. Allow manual ground-truth annotation as alternative. |
Thinking trace capture: enable_thinking is set at LLMClient.__init__() level (not per-request). The evaluation models need thinking enabled while the existing pipeline may have it disabled. Not all providers expose thinking/reasoning tokens in the response. |
Medium | Separate LLMClient instances for evaluation — each evaluator creates its own lightweight client with enable_thinking=True. This has zero impact on the existing pipeline's LLMClient (which may have thinking disabled). For providers that don't support thinking tokens, fall back to standard generation. Log warning if thinking is requested but unsupported. |
| Profile prompt consistency: If a profile's prompt templates are customized (non-default), those customizations affect test results. Tests run with different profiles may produce different results — this is expected and is the purpose of the profile parameter. | Low | Document which profile was used in each result (result.profile). Test results include the active profile name. |
| Storage growth: Test result JSONs can be large (5-50MB per file with full chunks). Over many runs, disk usage accumulates. | Medium | Results are stored as JSON files on disk with no automatic cutoff — users are responsible for periodic cleanup via the DELETE endpoint. The GET listing endpoint includes file sizes for monitoring. |
Dependencies
- Existing codebase: No breaking changes. Pipeline extraction is additive — existing
_query_stream()untouched. - New Python packages:
python-Levenshtein(for fast CER/WER edit distance) — lightweight, pure Python, no system depsjieba(for Chinese word segmentation in WER) — already popular, may already be in transitive deps
- LLM providers for evaluation:
- DeepSeek V4 Pro: Already configured via
DP_BASE_URL/DP_API_KEY(used for decompose step) - Qwen 3.7 Max: Configured via
DASHSCOPE_API_KEY/DashScope endpoint - Qwen3.6 35B A3B: Already configured via
LLM_BASE_URL/LLM_API_KEY(main generation model)
- DeepSeek V4 Pro: Already configured via
- ChromaDB: Read-only access for fetching documents/chunks during evaluation (no schema changes)
- PromptService: New prompt step keys needed:
"evaluate_key_questions","evaluate_chunk_relevance","evaluate_response_comparison". Seed templates stored insqlite_db.py. - SQLite/HistoryService: No changes needed (test results use file-based JSON, not SQLite)
Timeline Estimate
| Sub-Phase | Description | Estimated Effort |
|---|---|---|
| 9.0 | Config & Models (Foundation) | 1 day |
| 9.1 | Results Generation APIs | 2 days |
| 9.2 | Evaluation Engine | 3 days |
| 9.3 | Evaluation API & Service | 1.5 days |
| 9.4 | Integration & Acceptance Testing | 1.5 days |
| Total | 9 days |
References
Internal
backend/app/routers/query.py— Full RAG pipeline (SSE streaming, all intermediate data)backend/app/services/rag.py— Retrieval + generation methodsbackend/app/services/query_decomposer.py— Structured question decompositionbackend/app/services/relevance_filter.py— Per-sub-question chunk filteringbackend/app/services/history_service.py— Pipeline stage recording (reference for what data to capture)backend/app/services/prompt_service.py— Profile A/B/C managementbackend/app/services/asr_client.py— Audio transcription interfacebackend/app/core/config.py— All env-configurable settingsbackend/app/models/query.py— SSE event types + pipeline contractbackend/app/models/common.py— SourceMetadata (universal chunk reference)test materials/LTT/— 4 LegCo PDFs for testingtest materials/— LegCo MP4 files for audio testing.plans/package8_enhancement_plan.md— Plan format reference