LANGCHAIN — LLMs and Sql?

Laxfed Paulacy
Straight Bias Devs
Published in
3 min readFeb 24, 2024

--

Real artists ship. — Steve Jobs

LangChain is a powerful library that offers multiple SQL chains and an SQL agent to simplify interactions with SQL databases. In this article, we will explore how LLMs (Large Language Models) can be used with SQL to generate meaningful queries. We’ll discuss the challenges, solutions, and best practices for using LLMs with SQL.

Describing your database

Describing the structure and data of your database is crucial for guiding the LLM to generate reasonable queries. By providing detailed information about the tables, columns, and example rows, we can enhance the LLM’s ability to create optimal queries.

Describing the schema

You can describe the schema of a table by including the table names, columns, and their types. For instance, for the ‘Track’ table in the Chinook database:

CREATE TABLE "Track" (
"TrackId" INTEGER NOT NULL,
"Name" NVARCHAR(200) NOT NULL,
"AlbumId" INTEGER,
"MediaTypeId" INTEGER NOT NULL,
"GenreId" INTEGER,
"Composer" NVARCHAR(220),
"Milliseconds" INTEGER NOT NULL,
"Bytes" INTEGER,
"UnitPrice" NUMERIC(10, 2) NOT NULL,
PRIMARY KEY ("TrackId"),
FOREIGN KEY("MediaTypeId") REFERENCES "MediaType" ("MediaTypeId"),
FOREIGN KEY("GenreId") REFERENCES "Genre" ("GenreId"),
FOREIGN KEY("AlbumId") REFERENCES "Album" ("AlbumId")
)

Describing the data

Adding examples of what the data looks like can further improve the LLM’s query generation. For example, providing example rows in a SELECT statement after the CREATE TABLE description can significantly enhance performance. Here's an example for the 'Track' table:

SELECT * FROM 'Track' LIMIT 3;

Using custom table information

You can override automatic schema and data descriptions with custom information, providing more context for the LLM. This can be useful for handling sensitive data or providing specific examples. For instance, customizing the ‘Track’ table description:

CREATE TABLE "Track" (
"TrackId" INTEGER NOT NULL,
"Name" NVARCHAR(200) NOT NULL,
"Composer" NVARCHAR(220),
PRIMARY KEY ("TrackId")
)
SELECT * FROM 'Track' LIMIT 4;

Constraining the size of the output

Limiting the size of the query output is essential to prevent overwhelming the LLM’s input size. You can achieve this by using as few columns as possible and limiting the number of returned rows. Here’s an example of constraining the output:

agent_executor.run("List the total sales per country. Which country's customers spent the most?")

Checking syntax

If the LLM-generated query is syntactically broken, it’s important to identify and correct the issue. You can guide the LLM to correct the query by replicating the steps a human would take to rectify the mistake. Here’s an example of error correction using the query_checker_sql_db tool:

Action Input: SELECT Artist.Name, SUM(InvoiceLine.Quantity) AS TotalQuantity FROM Artist INNER JOIN Track ON Artist.ArtistId = Track.ArtistId INNER JOIN InvoiceLine ON Track.TrackId = InvoiceLine.TrackId GROUP BY Artist.Name ORDER BY TotalQuantity DESC LIMIT 3
Observation: Error: (sqlite3.OperationalError) no such column: Track.ArtistId

Future work

The field of LLM-SQL interaction is evolving rapidly, and there are ongoing efforts to optimize the process. Some future work includes integrating few-shot examples and leveraging subqueries to enhance the LLM’s query generation accuracy.

In conclusion, by providing detailed descriptions of the database schema, data, and custom information, along with constraining the size of the output and checking syntax, we can effectively harness LLMs for SQL interactions.

If you’re interested in contributing or sharing best practices, join the discussion in the #sql channel on Discord or consider contributing via a PR.

This tutorial has provided an overview of using LLMs with SQL, focusing on practical examples and best practices for guiding the LLM to generate meaningful queries.

--

--

Delivering Fresh Recipes, Crypto News, Python Tips & Tricks, and Federal Government Shenanigans and Content.