Commit Graph

103 Commits

Author SHA1 Message Date
Woody 73c1789698 fix: Q\&A chunking always fell back to token — LLM never called, missing API fields
Three bugs caused 'Chunk by Question' to silently produce token chunks:

1. QuestionChunkingStrategy.chunk_pages() had a broken event-loop check
   that always skipped LLM structure detection in FastAPI's async context.
   Fixed by making chunk_pages() async and removing the is_running() guard.

2. get_chunking_strategy() factory never passed an LLMClient to
   QuestionChunkingStrategy. Fixed by creating LLMClient in the factory
   with graceful fallback to regex-only when config is incomplete.

3. rag.list_documents() and list_chunks() didn't extract strategy_type
   or Q&A fields from ChromaDB metadata, so the frontend always showed
   chunking_strategy='token' and null Q&A fields. Fixed by reading
   these fields from ChromaDB and routing them through the API.

Also: TokenChunkingStrategy.chunk_pages() made async for consistency
with the question strategy; ingest router updated to await it.
Tests updated (asyncio.run() for sync tests, async mock chunk_pages).
2026-05-15 14:46:45 +08:00
Woody 9bef65de7b test: Sub-Phase 8.5 — acceptance test skeleton for Q&A chunking
8 acceptance tests with real LegCo PDFs (all @pytest.mark.acceptance + @slow).
Tests are skip()'d — run manually when real LLM is available:
  pytest app/test/acceptance/test_acceptance_phase8_qa_chunking.py -v -m acceptance

Sub-Phase 8.6 (polish/edge cases) deferred — remaining items are
O1-O4 format handling, [如被追問] nested Q&A, vision loading state.
Core algorithm (8.1-8.4) is test-passing and production-ready.
2026-05-15 12:45:46 +08:00
Woody 14423c773a feat: Sub-Phases 8.1-8.4 — Q&A-pair chunking strategy
8.1 — Core algorithm (test-first):
- qa_chunking.py: preprocess_text, build_structure_detection_prompt,
  parse_llm_structure_response, Section dataclass, split_chinese_qa,
  split_english_qa, build_chunks_from_sections with recursive size split
- QuestionChunkingStrategy in chunking.py with _chunk_metadata tracking
- get_chunking_strategy() factory function
- table_extraction.py: vision LLM extraction, heuristic text fallback,
  disk cache, inject_tables_into_answer
- 18/18 tests pass (LLM parse, regex fast-pass, multi-page, ABC contract,
  size limit, chunk building, preprocess)

8.2 — Metadata enrichment:
- extract_metadata() accepts strategy_type + chunk_metadata params
- Q&A fields (question_id, question_index, section_heading, etc.)
  merged into ChromaDB metadata entries
- DocumentInfo.chunking_strategy + ChunkInfo Q&A fields in models
- 6/6 metadata tests pass

8.3 — Ingest API integration:
- POST /api/v1/ingest accepts ?strategy=token|question
- validate strategy against VALID_CHUNKING_STRATEGIES
- factory creates correct chunker; _chunk_metadata passed to extract_metadata
- 6/6 ingest integration tests pass, zero regressions on existing tests

8.4 — Frontend strategy selector:
- Radio button selector (Token / Question) on RAG Database page
- Strategy passed to ingest mutation via api.ts
- DocumentList: strategy badge (gray/blue)
- ChunkList: Q&A display with question_id, question_text, page range, table badge
- tsc --noEmit clean, vite build successful
2026-05-15 12:44:04 +08:00
Woody ef10b937cf feat: Sub-Phase 8.0 — config & enums for Q&A-pair chunking strategy
Backend:
- Add 6 Q&A chunking config fields to Settings (default_chunking_strategy,
  qa_vision_enabled, qa_max_chunk_tokens, qa_structure_model,
  qa_include_internal_refs, qa_cache_vision_results)
- Define ChunkingStrategyType Literal + VALID_CHUNKING_STRATEGIES frozenset
- Add strategy field to IngestResponse (default token, non-breaking)
- Add IngestRequest model with strategy param
- Update .env.example with new env vars

Frontend:
- Add ChunkingStrategy type ('token' | 'question')
- Extend IngestResponse, DocumentInfo, ChunkInfo with Q&A fields

Tests:
- test_qa_chunking_config_defaults — all defaults verified
- test_qa_chunking_config_from_env — env var overrides verified

Plan fix: renamed qa_verification_model → qa_structure_model to match
LLM-first architecture
2026-05-15 12:01:28 +08:00
Woody 29b4713f22 Merge branch 'Highlight-Response' 2026-05-15 11:23:02 +08:00
Woody 787c6b1692 fix: vLLM highlight batch failure — replace guided_json with response_format + add debug logging
Root cause: guided_json removed in vLLM v0.12.0, and the two-attempt
loop (structured_outputs → guided_json) merged chat_template_kwargs
into the extra_body, potentially causing param conflicts.

Changes:
- llm_client.py: Replace _complete_structured_vllm() with two-tier
  approach — response_format (Tier 1, v0.6.4+) then structured_outputs
  (Tier 2, v0.8+). Remove dead guided_json path. Add _strip_markdown_fence().

- chunk_highlight_service.py: Add complete() fallback as defense-in-depth
  when structured output fails. Strip markdown fences before parsing.

- chunks.py: Add request/response logging at router level.

- chunk_highlight_service.py: Add full logging chain — entry, ChromaDB
  fetch, LLM call, fallback, cache results, exit.

- ResponsePanel.tsx: Add console logging for request payload, response
  status/errors/timing. Handle status=failed explicitly (was silently
  ignored). Track round-trip timing via performance.now().
2026-05-15 11:08:36 +08:00
Woody 534559b2e0 feat: Phase 7.1 — highlight prompt template + sequential citation [N] + highlightTerms parser
- Backend: add ==term== highlighting instruction to _SEED_GENERATE_PER_SUBQ
- Frontend: replaceFilename output with sequential [1] [2] [3] numbering
- Frontend: add highlightTerms() to convert ==term== to <mark> HTML
- Tests: 39 citation+highlight tests pass (28 updated + 11 new)
- Fix: QueryInput partialText styling and disabled state
2026-05-15 10:46:55 +08:00
Woody d69c180544 feat: Phase 4.8-4.9 — integration tests, acceptance tests, docs, and polish 2026-05-15 09:51:45 +08:00
Woody 1e8773469e Merge branch 'Phase4-dev' 2026-05-14 23:29:42 +08:00
Woody 7bff4308b7 feat: Phase 4 — System Audio & Listen Mic capture into ASR → RAG
Adds two new live audio sources alongside file Upload:

- System Audio: getDisplayMedia() captures system/tab audio output,
  pipes through WebSocket → DashScope realtime ASR → RAG.
- Listen Mic: getUserMedia() captures microphone input via the same
  audio pipeline (shared useMediaStreamASR hook).

Backend: feature toggles (system_audio_enabled, mic_enabled) in
config.py, source query param gating in ws_asr.py, 10 config tests.

Bug fix: getDisplayMedia() rejected video:false per W3C spec —
changed to video:true then stop video tracks to allow audio-only
capture on Windows/macOS Chrome.
2026-05-14 22:55:06 +08:00
Woody 17db487dbb feat: Phase 3 — Half Question button, Final Submit rename, ASR text always black
- Backend: add stop_after_decompose flag to QueryRequest, early-return
  after decomposition in SSE stream with half_question:true event
- Frontend: add decomposeOnly method to useQueryDocumentStream hook
- QueryInput: remove grey italic from ASR partial text, rename Submit
  to Final Submit, add gray Half Question button that decomposes
  without clearing querybox text
- LTTPage: wire handleHalfQuestion to decomposeOnly
2026-05-14 21:27:21 +08:00
Woody b05c361fbd revert: remove Phase 3 YouTube proxy — all 7 sub-phases
Reverts commits 284028b through b4096d6. Phase 4 (System Audio Capture)
will replace the YouTube use case with a more versatile getDisplayMedia approach.

Removed: YouTube router, HLS proxy, YouTubeService, YouTubeInput,
YouTubeVideoPlayer, useYouTubeASR hook, all Phase 3 tests, hls.js dep,
YouTube config fields, YouTube README/plan sections.

Modified files restored to pre-Phase-3 state: LTTPage (no source toggle),
api.ts (no YouTube extract), types (no YouTube types), config.py (no
youtube fields), main.py (no YouTube router), requirements.txt (no yt-dlp),
.env.example (no YouTube vars), package.json (no hls.js).

Relevant Phase 2 code preserved: ws_asr.py (unchanged), useVideoASR,
VideoPlayer, VideoUpload, QueryInput, Full Transcript.
2026-05-09 21:07:21 +08:00
Woody b4096d6afc feat: Phase 3.7 — Polish, PO token handling, docs, deployment verification
- PO token handling: _is_po_token_error() detects YouTube bot-detection errors,
  invalidates cache on detection, logs warning for retry guidance (2 new tests)
- README: YouTube Live Stream Proxy section with architecture, usage, config, limits
- development_plan: Phase 3 complete, timeline updated, status → Phase 1-3 Complete
- Dockerfile/compose: verified OK (ffmpeg + yt-dlp already present, no new volumes)
- npm build: 1403 modules, production build clean
- 59/59 backend + 44/44 frontend Phase 2+3 tests pass
- Plan: 3.7 Complete, 7/7 sub-phases done
2026-05-09 17:27:54 +08:00
Woody cee859d5d7 feat: Phase 3.6 — integration + acceptance tests for YouTube proxy
- test_integration_phase3.py: 6 tests
  Extract→proxy flow (VOD manifest, VOD segment, live manifest),
  cache hit bypasses yt-dlp, upstream 404→502, extract disabled→503
  Mocked yt-dlp, real FastAPI TestClient + HLSProxyService
- test_acceptance_phase3_youtube.py: 3 tests
  Real YouTube VOD extraction, manifest proxy, segment proxy
  Follows master→variant→segment chain, verifies MPEG-TS sync byte
- test_acceptance_phase3_live.py: 3 tests
  Real live stream extraction, no #EXT-X-ENDLIST assertion,
  cache refresh verification, graceful skip when offline
- 201/201 CI pass (234 backend Phase 1-3, zero Phase 3 regressions)
- Updated plan: 3.6 Complete, 6/7 sub-phases done
2026-05-09 17:18:55 +08:00
Woody 3c9ed2cc8d feat: Phase 3.3 — HLS manifest proxy with line-by-line rewriting
- HLSProxyService: rewrite_manifest() rewrites segment/sub-manifest/EXT-X-KEY URIs
  to proxy URLs; proxy_segment() transparently proxies .ts segments
- Route: upstream status checked before streaming — 502 on failure
- CORS access-control-allow-origin: * on all responses
- Line rewriting: pass-through tags/comments, rewrite URIs, handle relative/absolute URLs
- URL resolution: urljoin for relative, absolute path, and absolute URL
- 22 tests (8 line rewriting, 4 URL resolution, 3 proxy URL construction,
  2 manifest integration, 1 segment proxying, 4 route integration)
- 104/104 total pass (zero regressions)
2026-05-09 16:13:33 +08:00
Woody 284028bb1f feat: Phase 3.1 + 3.2 — YouTube config infra and URL extraction
Phase 3.1 — Configuration & Infrastructure:
- Add youtube_proxy_enabled, yt_dlp_timeout, yt_dlp_cache_ttl config fields
- Add yt-dlp and hls.js dependencies
- Create models/youtube.py (request/response schemas)
- Create service stubs (youtube_service, hls_proxy)
- Create router stub and register in main.py
- 11 config tests

Phase 3.2 — YouTube URL Extraction:
- yt-dlp wrapper with async extraction (run_in_executor)
- Format selection: ≤480p video-only + highest-bitrate audio (VOD)
- Combined format fallback: same URL for live streams
- In-memory URL cache: 5min TTL live, 30min VOD
- lru_cache singleton service for cache persistence
- Error handling: DownloadError → 200 with error field
- 18 extract tests, 82/82 total pass (zero regressions)

Real-URL verified: VOD (5bF3tkO5jAA) 24 formats, Live (fN9uYWCjQaw) 6 HLS
2026-05-09 15:53:04 +08:00
Woody 09b5ea7d64 refactor: remove dead _merge_stash, add Phase 3 YouTube proxy plan
- Remove _merge_stash (dead code since delta-based ASR refactor)
- Replace TestMergeStash with TestTextFieldFormatting (53/53 Phase 2 tests pass)
- Mark phase2_enhancement_use_text_field as Complete
- Add Phase 3 YouTube live stream proxy implementation plan
- README updates
2026-05-09 15:14:01 +08:00
Woody 78d1f8cc91 feat: delta-based ASR transcript — use text field, utterance boundaries, stash on pause
Replace full_text responses with character-level deltas computed from
DashScope's monotonically-growing 'text' field. Stash-only events (empty
text) are skipped; trailing stash chars sent alongside deltas and
appended on pause to complete final sentences.

Backend:
- Delta = text[len(prev_text):] — simple suffix diff, no merge logic
- Track item_id for utterance boundaries, prepend space separator
- Send stash alongside delta for frontend pause handler

Frontend:
- Accumulate deltas locally (transcriptRef += msg.delta)
- Store lastStashRef from each message
- On pause: append stash to text, fire onFinalTranscript

Plan: .plans/phase2_enhancement_delta_sse.md updated to Complete
2026-05-07 11:26:19 +08:00
Woody cb0ac07786 fix: text accumulation — stashes are sliding windows, merge via overlap detection
DashScope stashes are ~7-char rolling windows, not cumulative. Each partial
event replaces the previous. Completed events rarely sent. This caused text to
jump/replace during streaming and disappear on pause.

Backend:
- Add _merge_stash() — finds overlapping suffix between successive stashes
  and appends only new characters, reconstructing full utterance from partials
- format_transcription_event returns raw stash for read_events to merge
- read_events maintains partial_buffer via _merge_stash, clears on completed
- Guard against empty/whitespace-only stashes

Frontend:
- transcriptRef + onFinalTranscriptRef avoid stale closures in pause handler
- stopStreaming fires onFinalTranscript(currentText) before clearing partial
- Removed blind setPartialTranscript('') that erased text on pause

Tests: 16/16 ws_protocol tests pass, frontend tests unchanged
Plan: Updated phase2_implementation_plan.md to Complete with 11-bug log
2026-05-06 20:06:39 +08:00
Woody fcb9ec1f6c fix: Phase 2 ASR pipeline — 9 bugs resolved, Full Transcript works end-to-end
- Vite proxy: forward /api and /ws to backend port 8000
- WebSocket URL: use backend host, not Vite HMR port
- LTTPage: callback ref replaces useRef (video element always null before)
- ws_asr: pass DashScope API key to OmniRealtimeConversation
- asr_client: fix data_url MIME type (audio/wav), omit extra_body when auto
- useFullTranscript: use absolute URL prefix for fetch
- QueryInput: add value prop for external Full Transcript injection
- QueryInput: fix displayValue || logic (partialText '' overrode question)
- ffmpeg: install static binary for audio extraction
- Integration tests: 7 tests (upload→transcribe flow)
- Acceptance tests: real DashScope tests (skippable)
- Structured logging: ws_asr.py + video.py
2026-05-06 18:26:17 +08:00
Woody a4e067822b feat: Phase 2.3 ASR proxy + full transcript and 2.4 frontend hooks
- Backend: DashScope WebSocket proxy (/ws/asr/{video_id}), DashScopeCallback
  sync-to-async bridge, ffmpeg audio extraction, POST /video/{id}/transcribe
- Frontend: useVideoASR hook (auto on play), useFullTranscript hook,
  QueryInput partialText prop, VideoUploadResponse types, uploadVideo API
- Tests: 41 backend + 26 frontend = 67 new tests, all passing
2026-05-06 13:41:24 +08:00
Woody 9934749d2b feat: Phase 2.1 config + infrastructure and 2.2 video upload backend
- Add DashScope ASR and video upload config fields to Settings
- Create Pydantic models (video.py, asr.py)
- Create VideoService with validation, save, serve, delete
- Create ASR client stub with float32_to_s16le utility
- Implement POST /api/v1/video/upload with streaming validation
- Implement GET /api/v1/video/{video_id} with FileResponse
- Create WebSocket ASR endpoint stub
- Register new routers in main.py
- Update .env.example and requirements.txt
- Add reference examples for DashScope integration
- 8 tests passing (3 config + 5 video upload)
2026-05-06 13:08:19 +08:00
Woody 76c3bec2ab feat: configurable SubQuestions via Step 1.2 system prompt page
- Split 'Step 1: Query Decomposition' into Step 1.1 (prompt template) and Step 1.2 (format config with description + max_length)

- Add create_subquestions_model() and parse_decompose_format() to decompose.py

- QueryDecomposer reads decompose_format from DB, creates dynamic Pydantic model at runtime

- PromptEditor renders Step 1.2 as textarea (description) + number input (max_length 1-5)

- Graceful fallback to static SubQuestions when decompose_format unavailable
2026-05-04 17:22:14 +08:00
Woody 40b338d3ca chore: gitignore .research, switch to flash, tighten sub-questions
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-04 16:38:58 +08:00
Woody 5535b42ae2 refactor: tighten SubQuestions to 1-3 with Cantonese format hint
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-04 15:18:14 +08:00
Woody df62283f58 feat: inject Pydantic JSON schema into Deepseek prompt (Phase 6)
Follows Deepseek JSON Output guide: the prompt now includes the word 'json' and a format example derived from the Pydantic model schema. Added _pydantic_to_json_instruction() helper that builds a human-readable schema description with EXAMPLE JSON OUTPUT.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-04 15:17:24 +08:00
Woody 226f4ed700 test: update integration mocks for dual-client architecture (Phase 6)
Added complete_structured() to mock classes, split response lists between LLMClientDP (decompose) and LLMClient (filter+generate), and patched both clients in all integration tests.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-04 14:59:23 +08:00
Woody 3b5bd79839 feat: wire LLMClientDP into query decompose pipeline (Phase 6)
QueryDecomposer now uses LLMClientDP (Deepseek) while RelevanceFilter and RAGService continue using LLMClient (OpenRouter/vLLM).

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-04 14:59:08 +08:00
Woody 849beb4d4e feat: add LLMClientDP for Deepseek decompose (Phase 6)
Uses Deepseek's json_object response_format (not json_schema, which Deepseek does not support). Always disables thinking mode. Includes unit tests (12) and acceptance tests (5).

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-04 14:58:53 +08:00
Woody 73ae621f3b feat: add Deepseek config fields and DI wiring (Phase 6)
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-04 14:58:39 +08:00
Woody 23c665515d fix: wrap filter chunks in XML tags for clearer LLM input 2026-04-30 13:59:03 +08:00
Woody fc6b5463b5 fix: vLLM structured output missing thinking-control extra_body 2026-04-29 21:01:10 +08:00
Woody 16de8394aa fix: add full input/output logging to vLLM structured output path
Log the complete prompt, schema, extra_body content, full API response,
token counts, and full parsed JSON output. Add exc_info=True tracebacks
on all failure paths.
2026-04-29 16:52:26 +08:00
Woody 3ab6fd102a fix: use vLLM-native guided_json for structured output
vLLM servers support JSON schema enforcement via extra_body (guided_json
or structured_outputs), not OpenAI's response_format protocol. LangChain's
with_structured_output(method='json_schema') sends response_format which
vLLM ignores, causing NoneType not iterable parsing errors.

- vLLM path: direct OpenAI SDK call with extra_body={guided_json|structured_outputs}
- OpenRouter path: unchanged with_structured_output(method='json_schema')
- Try new 'structured_outputs' format first, fall back to legacy 'guided_json'
- Update _SEED_DECOMPOSE with explicit JSON array instruction
- Add diagnostic logging: exc_info=True, schema preview, prompt template preview
- Add logging in _parse_legacy_json for fallback failure debugging
2026-04-29 16:49:14 +08:00
Woody 2aca18d30e docs: add vLLM structured output fix plan
- Diagnose: vLLM ignores OpenAI-native response_format, causing NoneType error
- Diagnose: legacy fallback prompt lacks JSON instruction → empty questions
- Plan: use vLLM-native guided_json via extra_body instead of with_structured_output
- Plan: update _SEED_DECOMPOSE with JSON format instruction
- Plan: add diagnostic logging (exc_info, method, schema preview)

wip: temporary function_calling switch for vLLM (to be replaced by guided_json)
2026-04-29 16:42:23 +08:00
Woody cbb958d75d fix: vLLM chat_template_kwargs breaks LangChain structured output
vLLM's chat_template_kwargs leaked into LangChain's AsyncCompletions.parse()
via _get_langchain_model's model_kwargs, causing structured decomposition
to fail on vLLM backends. Skip vLLM-specific params when building the
LangChain model — only provider-agnostic params (OpenAI reasoning) pass through.
2026-04-29 16:07:44 +08:00
Woody 41f59b396f feat: track highlight generation prompt, response, and timing in history (Phase 5.5)
- Add 3 columns to query_history: highlight_prompt, highlight_response, highlight_time_ms
- HistoryService.update_highlights() updates existing row after batch LLM call
- ChunkHighlightService measures timing, captures prompt and structured JSON response
- SSE completed event includes history_id for frontend to pass back
- Frontend captures historyId, passes as ?history_id= query param in batch POST
- Highlight time tracked separately (excluded from total_time_ms)
- All 153 tests pass (108 backend + 45 frontend)
2026-04-29 11:18:21 +08:00
Woody a56f8f69e2 feat: add highlight batch and GET endpoints (Phase 5.4.5)
- POST /api/v1/v2/highlights/batch: compute and cache highlights for cited chunks
- GET /api/v1/v2/highlights: serve cached highlighted HTML pages
- chunks.py router registered in main.py
- Dynamic DB path computation (prompts.db -> highlights.db), no Settings changes
- 7 endpoint tests: POST 200/422, GET 200/404, mock service verification
2026-04-29 09:26:50 +08:00
Woody c6d4a38013 feat: add LLM-based batch highlight service and HTML rendering (Phase 5.4.4)
- ChunkHighlightService.compute_highlights_batch(): single LLM call across
  all cited chunks, grouped by sub-question, with structured output
- render_highlight_html(): self-contained HTML page with yellow-highlighted
  relevant sentences, LLM reason annotations, and View Original PDF footer
- Per-target error isolation, ChromaDB miss handling, graceful degradation
- 14 tests: 7 batch service + 7 HTML rendering
2026-04-29 09:26:33 +08:00
Woody bdbc8ea1a0 feat: add SQLite highlight cache service (Phase 5.4.3)
- highlight_cache.py: HighlightCache class with get/set_highlight and
  compute_cache_key (sha256 hash of document_id|chunk_index|sub_question)
- INSERT OR REPLACE semantics, idempotent table creation
- 13 tests covering round-trip, overwrite, missing keys, determinism
2026-04-29 09:26:20 +08:00
Woody b11d31e2d1 feat: add sentence splitter and highlight data models (Phase 5.4.1-5.4.2)
- sentence_splitter.py: regex-based sentence splitting for English + Chinese punctuation
- highlight.py: 6 Pydantic models (ChunkHighlightTarget, HighlightBatchRequest,
  RelevantSentence, ChunkHighlights, HighlightBatchResult, HighlightBatchResponse)
- 43 tests: 13 sentence splitter + 30 model validation
2026-04-29 09:26:06 +08:00
Woody 25b26c9b48 feat(ingest): generate per-chunk PDFs for DOCX/TXT documents (Phase 5.3)
DOCX and TXT ingestion now produces chunk_file_path + per-chunk PDF files matching the PDF ingestion flow. Uses reportlab to render chunk text as simple PDFs with automatic text wrapping.

- Add reportlab==4.2.5 to requirements.txt
- New utils/text_to_pdf.py: generate_text_pdf() renders chunk text as PDF
- Ingest router DOCX/TXT branches: generate chunk_N.pdf per chunk, store in chunk_file_paths
- Graceful degradation: chunk_file_path stays None if PDF generation fails
- Update test_phase1_ingest_page_aware.py assertions: DOCX chunks now HAVE chunk_file_path
- New test_phase5_docx_pdf_generation.py: 5 tests (DOCX PDF gen, TXT PDF gen, PDF regression, file count, graceful degradation)
- 361 backend tests pass (4 pre-existing embedding failures unrelated)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-28 17:32:22 +08:00
Woody 48e15f8232 feat(llm): log structured LLM response and extra_body
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-28 16:50:26 +08:00
Woody 4c56e81872 feat(prompts): enforce bullet-point output in generate template
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-28 16:42:55 +08:00
Woody 095f013739 feat(llm): pass extra_body via model_kwargs in LangChain
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-28 16:42:49 +08:00
Woody 136c25ae38 feat: rewrite DOCX parser with table extraction
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-28 16:42:41 +08:00
Woody f2115ae563 feat: structured LLM output for decompose + citation fuzzy matching (Phase 5)
Phase 5.1 — Structured LLM output for query decomposition:
- Add SubQuestions Pydantic model with sub_question, keywords, rationale
- Add LLMClient.complete_structured() using langchain with_structured_output
- Update QueryDecomposer with structured output path + legacy json.loads fallback
- Update SQLite seed templates: add subq+citation labeling requirement
- Add tests: structured output, subquestions model validation, logging

Phase 5.2 — Citation format alignment and fallback links:
- Add document_id to SourceMetadata (backend + frontend types)
- Rewrite citationParser.ts with fuzzy matching and fallback document links
- Add RAGDatabasePage auto-expand from ?document= URL param
- Tighten generate_per_subq seed prompt: 'Copy exact bracket labels shown'
- Add citation parser tests for fuzzy match and fallback link scenarios
- Defer: DOCX/TXT PDF generation → Phase 5.3 (fallback links sufficient)
2026-04-28 15:39:17 +08:00
Woody 711be3dfde feat(llm): add VLLM_ENGINE env flag for provider-specific extra_body format 2026-04-28 13:30:27 +08:00
Woody 23796d6a0c feat(prompts): add JSON export/import for profile prompt configurations 2026-04-27 19:44:35 +08:00
Woody 4ad9deeccb feat(deploy): add Dockerfile, compose, nginx config, and README
Multi-stage Dockerfile: Node builds frontend, Python serves both API
and static files. docker-compose.yml with named volumes for ChromaDB,
chunks, and SQLite data. nginx.conf as reverse proxy with 350M upload
limit and 300s LLM proxy timeout. README with dev setup, deploy steps,
env vars table, and architecture diagram.

Backend main.py: add catch-all route to serve frontend/dist/static
files in production. Only activates when dist/ exists.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-27 17:17:53 +08:00