legco_ai_assistant

Commit Graph

Author	SHA1	Message	Date
Woody	d49756f374	feat: add chunk PDF serving endpoint and frontend clickable source links (1.5.6) - Add page_number and chunk_file_path to SourceMetadata model and query router - Add GET /chunks/{file_path}/pdf endpoint with path traversal protection - Add View PDF links in ResponsePanel source cards and ChunkList component - Update TypeScript types and API helper for chunk PDF URLs - Add backend tests (5) and frontend ChunkList tests (7) - Update enhancement plan: all 3 features complete	2026-04-24 11:49:39 +08:00
Woody	b2dd385443	feat(backend): refactor ingest pipeline for page-aware chunking with PDF generation PDF uploads now use parse_pdf_by_page() -> chunk_pages() -> extract page PDFs -> enhanced metadata with page_number, chunk_file_path, and document_id. Same-filename replacement deletes old chunks and PDFs before re-ingest. DOCX/TXT keep original flat flow with document_id added. RAGService.ingest_document() accepts optional document_id parameter. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-24 10:53:17 +08:00
Woody	8c84062996	feat(backend): add PDF page extractor and chunk PDF storage config New pdf_extractor.py with extract_page_as_pdf() and extract_pages_as_pdf() for extracting individual PDF pages as separate files. Adds document_chunk_path setting to config and document_chunk/ to .gitignore. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-24 10:52:57 +08:00
Woody	b97264c66a	feat(backend): add page_number, chunk_file_path, document_id to chunk metadata Enhance extract_metadata() with three new optional fields for page-aware chunking support. Validates list length mismatches. Fully backward compatible — existing callers unaffected. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-24 10:30:40 +08:00
Woody	0995c685fa	feat(backend): add page-aware chunking with adjacent-page overlap Add chunk_pages() to TokenChunkingStrategy: one chunk per page with 200-token overlap from adjacent pages. Uses original page text for main content, decoded tokens for overlap. Never splits a page regardless of size. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-24 10:30:18 +08:00
Woody	f4fa577fb0	feat(backend): add page-aware PDF parsing with per-page text extraction Add parse_pdf_by_page() that returns List[Tuple[int, str]] with 1-indexed page numbers. Pages with no extractable text are skipped. Follows same error handling as existing parse_pdf(). Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-24 10:30:04 +08:00
Woody	f21085b3df	feat(backend): add documents CRUD endpoints and tests Add 4 REST endpoints for RAG database management: GET /documents, GET /documents/{id}/chunks, DELETE /documents/{id}, DELETE /chunks/{id}. Register documents router in main.py. 8 unit tests covering all CRUD operations. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-23 19:02:28 +08:00
Woody	33b960f786	fix(backend): extract JSON from markdown code blocks in LLM responses The LLM (Qwen3.5 via OpenRouter) returns JSON wrapped in markdown code blocks: ```json ["project manager", "limits", ...] ``` But the code was trying to parse this directly with json.loads(), causing: - QueryDecomposer to return empty keywords - RelevanceFilter to fail with "Expecting value: line 1 column 1" Changes: - Added _extract_json_from_markdown() helper function to both modules - Strips markdown code block markers (```json and ```) before JSON parsing - Added unit tests for markdown code block handling Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus \u003cclio-agent@sisyphuslabs.ai\u003e	2026-04-23 16:28:07 +08:00
Woody	be5e75e67c	test(backend): update unit tests for LLM monitoring changes - Fixed MockLLMClient to accept step_name parameter - Updated test mocks for OpenAI SDK structure Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-23 14:52:41 +08:00
Woody	351950f512	test(backend): update Phase 1 test suite Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-23 13:27:40 +08:00
Woody	7493b3aaf6	feat: Phase 1.4 acceptance tests, error handling, and polish - Implement acceptance tests for ingest (real ChromaDB) and query (real LLM) - Full 3-step RAG pipeline verified: decompose → retrieve → filter → generate - Add logging to ingest and query routers - Improve error handling: empty doc detection, proper HTTPException re-raising - Add .txt file support to ingest endpoint - Fix query router: strip distance from retrieve tuples before relevance filter - Update plan: Phase 1 backend complete (all acceptance criteria met) - Tests: 41 unit passed, 5 acceptance passed (real OpenRouter calls)	2026-04-22 17:45:50 +08:00
Woody	181f4eca5b	feat: Phase 1.3 query pipeline with decomposition, relevance filter, and response - Add QueryDecomposer: extracts keywords from question via LLM JSON response - Add RelevanceFilter: batch scores chunks 0-10, filters by threshold - Add POST /api/v1/query endpoint with full 3-step pipeline: 1. QueryDecomposer.decompose() → keywords 2. RAGService.retrieve() → chunks from ChromaDB 3. RelevanceFilter.filter() → score and filter chunks 4. RAGService.generate_response() → bullet-point answer - Fix SourceMetadata.upload_date type from datetime to str for flexibility - Test-first: 13 new tests pass (5 decomposer, 5 relevance filter, 3 query endpoint) - All Phase 1 tests: 41 passed, 2 skipped	2026-04-22 17:19:21 +08:00
Woody	d94abaac77	feat: Phase 1.2 ingestion pipeline with chunking and metadata - Add document parsers (DOCX, PDF) with lazy imports - Add TokenChunkingStrategy with ABC for future replacement - Add metadata extraction (filename, upload_date, content_summary) - Add RAGService for ChromaDB ingestion/retrieval/response generation - Add POST /api/v1/ingest endpoint with file validation - Test-first: 20 passed, 2 skipped (python-docx not installed)	2026-04-22 16:49:52 +08:00
Woody	3712397d64	feat: Phase 1.1 project setup with config, database, and models - Add requirements.txt with all dependencies - Add .env.example with required environment variables - Add Pydantic Settings (config.py) with .env loading - Add ChromaDB persistent client (database.py) - Add Pydantic schemas (ingest.py) for request/response - Add FastAPI main.py with CORS middleware - Add package __init__.py files - Add tests: test_phase1_config.py, test_phase1_database.py - All 5 tests pass	2026-04-22 16:13:52 +08:00
Woody	be48b1d8c7	docs: add sub-phase development rules and acceptance test structure	2026-04-22 15:27:31 +08:00
Woody	3c2d647943	init: project setup with AGENTS.md, test structure, and plan directory	2026-04-22 15:22:29 +08:00

16 Commits