Source Code
This notebook covers how to load source code files using a special approach with language parsing: each top-level function and class in the code is loaded into separate documents. Any remaining code top-level code outside the already loaded functions and classes will be loaded into a separate document.
This approach can potentially improve the accuracy of QA models over source code.
The supported languages for code parsing are:
- C (*)
- C++ (*)
- C# (*)
- COBOL
- Go (*)
- Java (*)
- JavaScript (requires package
esprima
) - Kotlin (*)
- Lua (*)
- Perl (*)
- Python
- Ruby (*)
- Rust (*)
- Scala (*)
- TypeScript (*)
Items marked with (*) require the packages tree_sitter
and tree_sitter_languages
.
It is straightforward to add support for additional languages using tree_sitter
,
although this currently requires modifying LangChain.
The language used for parsing can be configured, along with the minimum number of lines required to activate the splitting based on syntax.
If a language is not explicitly specified, LanguageParser
will infer one from
filename extensions, if present.
%pip install -qU esprima esprima tree_sitter tree_sitter_languages
import warnings
warnings.filterwarnings("ignore")
from pprint import pprint
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import LanguageParser
from langchain_text_splitters import Language
API Reference:
loader = GenericLoader.from_filesystem(
"./example_data/source_code",
glob="*",
suffixes=[".py", ".js"],
parser=LanguageParser(),
)
docs = loader.load()
len(docs)
6
for document in docs:
pprint(document.metadata)
{'content_type': 'functions_classes',
'language': <Language.PYTHON: 'python'>,
'source': 'example_data/source_code/example.py'}
{'content_type': 'functions_classes',
'language': <Language.PYTHON: 'python'>,
'source': 'example_data/source_code/example.py'}
{'content_type': 'simplified_code',
'language': <Language.PYTHON: 'python'>,
'source': 'example_data/source_code/example.py'}
{'content_type': 'functions_classes',
'language': <Language.JS: 'js'>,
'source': 'example_data/source_code/example.js'}
{'content_type': 'functions_classes',
'language': <Language.JS: 'js'>,
'source': 'example_data/source_code/example.js'}
{'content_type': 'simplified_code',
'language': <Language.JS: 'js'>,
'source': 'example_data/source_code/example.js'}
print("\n\n--8<--\n\n".join([document.page_content for document in docs]))
class MyClass:
def __init__(self, name):
self.name = name
def greet(self):
print(f"Hello, {self.name}!")
--8<--
def main():
name = input("Enter your name: ")
obj = MyClass(name)
obj.greet()
--8<--
# Code for: class MyClass:
# Code for: def main():
if __name__ == "__main__":
main()
--8<--
class MyClass {
constructor(name) {
this.name = name;
}
greet() {
console.log(`Hello, ${this.name}!`);
}
}
--8<--
function main() {
const name = prompt("Enter your name:");
const obj = new MyClass(name);
obj.greet();
}
--8<--
// Code for: class MyClass {
// Code for: function main() {
main();
The parser can be disabled for small files.
The parameter parser_threshold
indicates the minimum number of lines that the source code file must have to be segmented using the parser.
loader = GenericLoader.from_filesystem(
"./example_data/source_code",
glob="*",
suffixes=[".py"],
parser=LanguageParser(language=Language.PYTHON, parser_threshold=1000),
)
docs = loader.load()
len(docs)
1
print(docs[0].page_content)
class MyClass:
def __init__(self, name):
self.name = name
def greet(self):
print(f"Hello, {self.name}!")
def main():
name = input("Enter your name: ")
obj = MyClass(name)
obj.greet()
if __name__ == "__main__":
main()
Splitting
Additional splitting could be needed for those functions, classes, or scripts that are too big.
loader = GenericLoader.from_filesystem(
"./example_data/source_code",
glob="*",
suffixes=[".js"],
parser=LanguageParser(language=Language.JS),
)
docs = loader.load()
from langchain_text_splitters import (
Language,
RecursiveCharacterTextSplitter,
)
API Reference:
js_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.JS, chunk_size=60, chunk_overlap=0
)
result = js_splitter.split_documents(docs)
len(result)
7
print("\n\n--8<--\n\n".join([document.page_content for document in result]))
class MyClass {
constructor(name) {
this.name = name;
--8<--
}
--8<--
greet() {
console.log(`Hello, ${this.name}!`);
}
}
--8<--
function main() {
const name = prompt("Enter your name:");
--8<--
const obj = new MyClass(name);
obj.greet();
}
--8<--
// Code for: class MyClass {
// Code for: function main() {
--8<--
main();
Adding Languages using Tree-sitter Template
Expanding language support using the Tree-Sitter template involves a few essential steps:
- Creating a New Language File:
- Begin by creating a new file in the designated directory (langchain/libs/community/langchain_community/document_loaders/parsers/language).
- Model this file based on the structure and parsing logic of existing language files like
cpp.py
. - You will also need to create a file in the langchain directory (langchain/libs/langchain/langchain/document_loaders/parsers/language).
- Parsing Language Specifics:
- Mimic the structure used in the
cpp.py
file, adapting it to suit the language you are incorporating. - The primary alteration involves adjusting the chunk query array to suit the syntax and structure of the language you are parsing.
- Mimic the structure used in the
- Testing the Language Parser:
- For thorough validation, generate a test file specific to the new language. Create
test_language.py
in the designated directory(langchain/libs/community/tests/unit_tests/document_loaders/parsers/language). - Follow the example set by
test_cpp.py
to establish fundamental tests for the parsed elements in the new language.
- For thorough validation, generate a test file specific to the new language. Create
- Integration into the Parser and Text Splitter:
- Incorporate your new language within the
language_parser.py
file. Ensure to update LANGUAGE_EXTENSIONS and LANGUAGE_SEGMENTERS along with the docstring for LanguageParser to recognize and handle the added language. - Also, confirm that your language is included in
text_splitter.py
in class Language for proper parsing.
- Incorporate your new language within the
By following these steps and ensuring comprehensive testing and integration, you'll successfully extend language support using the Tree-Sitter template.
Best of luck!