legco_ai_assistant

Commit Graph

Author	SHA1	Message	Date
Woody	25b26c9b48	feat(ingest): generate per-chunk PDFs for DOCX/TXT documents (Phase 5.3) DOCX and TXT ingestion now produces chunk_file_path + per-chunk PDF files matching the PDF ingestion flow. Uses reportlab to render chunk text as simple PDFs with automatic text wrapping. - Add reportlab==4.2.5 to requirements.txt - New utils/text_to_pdf.py: generate_text_pdf() renders chunk text as PDF - Ingest router DOCX/TXT branches: generate chunk_N.pdf per chunk, store in chunk_file_paths - Graceful degradation: chunk_file_path stays None if PDF generation fails - Update test_phase1_ingest_page_aware.py assertions: DOCX chunks now HAVE chunk_file_path - New test_phase5_docx_pdf_generation.py: 5 tests (DOCX PDF gen, TXT PDF gen, PDF regression, file count, graceful degradation) - 361 backend tests pass (4 pre-existing embedding failures unrelated) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-28 17:32:22 +08:00
Woody	b2dd385443	feat(backend): refactor ingest pipeline for page-aware chunking with PDF generation PDF uploads now use parse_pdf_by_page() -> chunk_pages() -> extract page PDFs -> enhanced metadata with page_number, chunk_file_path, and document_id. Same-filename replacement deletes old chunks and PDFs before re-ingest. DOCX/TXT keep original flat flow with document_id added. RAGService.ingest_document() accepts optional document_id parameter. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-24 10:53:17 +08:00
Woody	b48c23001e	fix(backend): preserve original filename in chunk metadata instead of temp file name When uploading files, the backend passes them through NamedTemporaryFile, causing os.path.basename to return temp names like 'tmp90i7xqa8.pdf'. Added original_filename parameter to extract_metadata() so the actual upload filename is stored. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-24 10:14:58 +08:00
Woody	4a22b906e4	refactor(backend): update ingest and query routers Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-23 13:26:32 +08:00
Woody	7493b3aaf6	feat: Phase 1.4 acceptance tests, error handling, and polish - Implement acceptance tests for ingest (real ChromaDB) and query (real LLM) - Full 3-step RAG pipeline verified: decompose → retrieve → filter → generate - Add logging to ingest and query routers - Improve error handling: empty doc detection, proper HTTPException re-raising - Add .txt file support to ingest endpoint - Fix query router: strip distance from retrieve tuples before relevance filter - Update plan: Phase 1 backend complete (all acceptance criteria met) - Tests: 41 unit passed, 5 acceptance passed (real OpenRouter calls)	2026-04-22 17:45:50 +08:00
Woody	d94abaac77	feat: Phase 1.2 ingestion pipeline with chunking and metadata - Add document parsers (DOCX, PDF) with lazy imports - Add TokenChunkingStrategy with ABC for future replacement - Add metadata extraction (filename, upload_date, content_summary) - Add RAGService for ChromaDB ingestion/retrieval/response generation - Add POST /api/v1/ingest endpoint with file validation - Test-first: 20 passed, 2 skipped (python-docx not installed)	2026-04-22 16:49:52 +08:00

6 Commits