Unstructured
The
unstructured
package from Unstructured.IO extracts clean text from raw source documents like PDFs and Word documents. This page covers how to use theunstructured
ecosystem within LangChain.
Installation and Setupβ
If you are using a loader that runs locally, use the following steps to get unstructured
and
its dependencies running locally.
- Install the Python SDK with
pip install unstructured
.- You can install document specific dependencies with extras, i.e.
pip install "unstructured[docx]"
. - To install the dependencies for all document types, use
pip install "unstructured[all-docs]"
.
- You can install document specific dependencies with extras, i.e.
- Install the following system dependencies if they are not already available on your system.
Depending on what document types you're parsing, you may not need all of these.
libmagic-dev
(filetype detection)poppler-utils
(images and PDFs)tesseract-ocr
(images and PDFs)libreoffice
(MS Office docs)pandoc
(EPUBs)
If you want to get up and running with less set up, you can
simply run pip install unstructured
and use UnstructuredAPIFileLoader
or
UnstructuredAPIFileIOLoader
. That will process your document using the hosted Unstructured API.
The Unstructured API
requires API keys to make requests.
You can request an API key here and start using it today!
Checkout the README here here to get started making API calls.
We'd love to hear your feedback, let us know how it goes in our community slack.
And stay tuned for improvements to both quality and performance!
Check out the instructions
here if you'd like to self-host the Unstructured API or run it locally.
Data Loadersβ
The primary usage of the Unstructured
is in data loaders.
UnstructuredAPIFileIOLoaderβ
See a usage example.
from langchain_community.document_loaders import UnstructuredAPIFileIOLoader
API Reference:
UnstructuredAPIFileLoaderβ
See a usage example.
from langchain_community.document_loaders import UnstructuredAPIFileLoader
API Reference:
UnstructuredCHMLoaderβ
CHM
means Microsoft Compiled HTML Help
.
See a usage example in the API documentation.
from langchain_community.document_loaders import UnstructuredCHMLoader
API Reference:
UnstructuredCSVLoaderβ
A comma-separated values
(CSV
) file is a delimited text file that uses
a comma to separate values. Each line of the file is a data record.
Each record consists of one or more fields, separated by commas.
See a usage example.
from langchain_community.document_loaders import UnstructuredCSVLoader
API Reference:
UnstructuredEmailLoaderβ
See a usage example.
from langchain_community.document_loaders import UnstructuredEmailLoader
API Reference:
UnstructuredEPubLoaderβ
EPUB is an e-book file format
that uses
the β.epubβ file extension. The term is short for electronic publication and
is sometimes styled ePub
. EPUB
is supported by many e-readers, and compatible
software is available for most smartphones, tablets, and computers.
See a usage example.
from langchain_community.document_loaders import UnstructuredEPubLoader
API Reference:
UnstructuredExcelLoaderβ
See a usage example.
from langchain_community.document_loaders import UnstructuredExcelLoader
API Reference:
UnstructuredFileIOLoaderβ
See a usage example.
from langchain_community.document_loaders import UnstructuredFileIOLoader
API Reference:
UnstructuredFileLoaderβ
See a usage example.
from langchain_community.document_loaders import UnstructuredFileLoader
API Reference:
UnstructuredHTMLLoaderβ
See a usage example.
from langchain_community.document_loaders import UnstructuredHTMLLoader
API Reference:
UnstructuredImageLoaderβ
See a usage example.
from langchain_community.document_loaders import UnstructuredImageLoader
API Reference:
UnstructuredMarkdownLoaderβ
See a usage example.
from langchain_community.document_loaders import UnstructuredMarkdownLoader
API Reference:
UnstructuredODTLoaderβ
The Open Document Format for Office Applications (ODF)
, also known as OpenDocument
,
is an open file format for word processing documents, spreadsheets, presentations
and graphics and using ZIP-compressed XML files. It was developed with the aim of
providing an open, XML-based file format specification for office applications.
See a usage example.
from langchain_community.document_loaders import UnstructuredODTLoader
API Reference:
UnstructuredOrgModeLoaderβ
An Org Mode document is a document editing, formatting, and organizing mode, designed for notes, planning, and authoring within the free software text editor Emacs.
See a usage example.
from langchain_community.document_loaders import UnstructuredOrgModeLoader
API Reference:
UnstructuredPDFLoaderβ
See a usage example.
from langchain_community.document_loaders import UnstructuredPDFLoader
API Reference:
UnstructuredPowerPointLoaderβ
See a usage example.
from langchain_community.document_loaders import UnstructuredPowerPointLoader
API Reference:
UnstructuredRSTLoaderβ
A reStructured Text
(RST
) file is a file format for textual data
used primarily in the Python programming language community for technical documentation.
See a usage example.
from langchain_community.document_loaders import UnstructuredRSTLoader
API Reference:
UnstructuredRTFLoaderβ
See a usage example in the API documentation.
from langchain_community.document_loaders import UnstructuredRTFLoader
API Reference:
UnstructuredTSVLoaderβ
A tab-separated values
(TSV
) file is a simple, text-based file format for storing tabular data.
Records are separated by newlines, and values within a record are separated by tab characters.
See a usage example.
from langchain_community.document_loaders import UnstructuredTSVLoader
API Reference:
UnstructuredURLLoaderβ
See a usage example.
from langchain_community.document_loaders import UnstructuredURLLoader
API Reference:
UnstructuredWordDocumentLoaderβ
See a usage example.
from langchain_community.document_loaders import UnstructuredWordDocumentLoader
API Reference:
UnstructuredXMLLoaderβ
See a usage example.
from langchain_community.document_loaders import UnstructuredXMLLoader