A system for agentic LLM-powered data processing and ETL
- Updated
Jun 12, 2025 - Python
A system for agentic LLM-powered data processing and ETL
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
Generic framework for historical document processing
TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents
A Python framework for multi-modal document understanding with Amazon Bedrock
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.
An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.
Semantic extraction from conference proceedings.
This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.
Low-Cost LLM-Powered Data Processing with Theoretical Guarantees
Cutting-edge tool that unlocks the full potential of semantic chunking
DocGenius AI - Generative AI Chatbot for your Documents
Serverless Intelligent Document Processing (IDP) solution for invoice automation using Amazon Bedrock Data Automation. Features automated data extraction, annotation, and processing pipeline built with AWS services and CDK.
OnnxTR OCR plugin for Docling
Cutting-edge tool designed to intelligently segment text documents into optimally-sized chunks
Text line detection for Urdu OCR (UTRNet)
A Python package that converts table images into HTML format using Object Detection model and OCR.
A GenAI-powered Flask app that audits documents for GDPR/HIPAA compliance, using regex rules and Anthropic Claude API to suggest remediations.
Use data from MongoDB in LangChain, Llama and OpenAI
Add a description, image, and links to the document-processing topic page so that developers can more easily learn about it.
To associate your repository with the document-processing topic, visit your repo's landing page and select "manage topics."