Have we solved materials data extraction from PDFs?

Name: Have we solved materials data extraction from PDFs?
Uploaded: Apr 22, 2026
Duration: 434 s

Taylor Sparks49.7K subscribers

436 views

Apr 22, 2026

7:14

In this Materials Minute, we highlight “KnowMat: An Agentic Approach to Transforming Unstructured Material Science Literature into Structured Data” from Integrating Materials and Manufacturing Innovation (IMMI). KnowMat is a multi-agent LLM pipeline that converts full-text materials science PDFs into schema-aligned, machine-readable JSON. Using typed tool-calling, iterative evaluation, aggregation, and confidence scoring, the system dramatically reduces hallucinations and enforces consistent structure. Across 30 real-world materials science papers, KnowMat achieved 89.9% F1 score, compared to just 41.4% for a zero-shot GPT-5 baseline — demonstrating the power of agentic workflows for scientific data extraction . All code is open source and available on GitHub. 📄 Article: Sayeed, Clark, Mohanty, Sparks. KnowMat: An Agentic Approach to Transforming Unstructured Material Science Literature into Structured Data. Integrating Materials and Manufacturing Innovation (2026). IMMJ-D-25-00228_R1 ⏱️ Time Stamps 00:00 – The materials data bottleneck (it’s not scarcity — it’s extraction) 01:05 – Schema-enforced tool calling (TrustCall + Pydantic) 02:40 – Multi-agent architecture: extractor, evaluator, manager, flagging 05:20 – Results: 89.9% F1 vs. 41.4% zero-shot baseline

Download

0 formats

No download links available.