# installs
! pip install -q duckdb
! pip install -q mongita
! pip install -q kuzu
! pip install -q lancedb
! pip install -q sentence-transformers
! pip install -q requests
In my teaching, I often want to demonstrate a variety of database options to my students. However, setting up servers introduces a layer of complexity that isn’t practical for certain offerings. Fortunately, there’s a category of databases that eliminate the need for running servers. These are commonly referred to as embedded, or in-process databases.
Below, I’ll highlight some options across different database types:
- SQLite: A lightweight OLTP database.
- DuckDB: A fast OLAP database quickly becoming a favorite tool for data professionals.
- Mongita: A document database designed to mimic MongoDB’s functionality.
- Kuzu: A graph database that leverages Neo4j’s Cypher query language.
- LanceDB: A vector database built for modern data applications.
Let’s get started.
Setup
# imports
import pandas as pd
import requests
from sentence_transformers import SentenceTransformer
# get an example database for sqlite
! wget -qO music.sqlite https://github.com/lerocha/chinook-database/raw/refs/heads/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite
# get an example for the graph database
! wget -q https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
I am on a Mac and installed wget
via homebrew. To follow along, the easiest path might be to follow along via Google Colab.
With the libraries installed and some sample data downloaded from the web, let’s define the connections to each database.
# instantiate each database
# oltp - single file
import sqlite3
= sqlite3.connect("music.sqlite")
oltp
# olap - single file
import duckdb
= duckdb.connect("olap.duckdb")
olap
# doc - creates a folder
from mongita import MongitaClientDisk
= MongitaClientDisk("docdb")
doc
# graph - creates a folder
import kuzu
= kuzu.Database("graphdb")
g = kuzu.Connection(g)
graph
# vector - creates a folder
import lancedb
= lancedb.connect("vectordb") vec
OLTP via SQLite
I won’t spend much time on SQlite and python, as there are plenty of tutorials on the web. Below will demonstrate how to interface with sqlite via Python and pandas.
# execute a query to show the tables
# list tables in sqlite database
"SELECT name FROM sqlite_master WHERE type='table';").fetchall() oltp.execute(
[('Album',),
('Artist',),
('Customer',),
('Employee',),
('Genre',),
('Invoice',),
('InvoiceLine',),
('MediaType',),
('Playlist',),
('PlaylistTrack',),
('Track',)]
# of course, we can query data directly into pandas
# join the artists and albumns
= """
sql select a.*,
b.albumid,
b.title
from artist a
join album b
on a.artistid = b.artistid
"""
= pd.read_sql(sql, oltp)
df 3)
df.head(
ArtistId | Name | AlbumId | Title | |
---|---|---|---|---|
0 | 1 | AC/DC | 1 | For Those About To Rock We Salute You |
1 | 2 | Accept | 2 | Balls to the Wall |
2 | 2 | Accept | 3 | Restless and Wild |
# create a silly table
= """
create_table_query CREATE TABLE IF NOT EXISTS applicants (
id INTEGER PRIMARY KEY,
name TEXT,
position TEXT,
salary REAL
)
"""
# Execute the query
oltp.execute(create_table_query) oltp.commit()
Ensure the table exists.
"SELECT name FROM sqlite_master WHERE type='table';").fetchall() oltp.execute(
[('Album',),
('Artist',),
('Customer',),
('Employee',),
('Genre',),
('Invoice',),
('InvoiceLine',),
('MediaType',),
('Playlist',),
('PlaylistTrack',),
('Track',),
('applicants',)]
# a simple dataset
= [
applicants "Alice", "Manager", 70000),
("Bob", "Analyst", 50000),
("Charlie", "Clerk", 30000)
(
]
# insert into the table,
= """
insert_query INSERT INTO applicants (name, position, salary) VALUES (?, ?, ?)
"""
oltp.executemany(insert_query, applicants) oltp.commit()
# lets grab the data with pandas
"SELECT * FROM applicants", oltp) pd.read_sql(
id | name | position | salary | |
---|---|---|---|---|
0 | 1 | Alice | Manager | 70000.0 |
1 | 2 | Bob | Analyst | 50000.0 |
2 | 3 | Charlie | Clerk | 30000.0 |
# of course we can update records
= """
update_query UPDATE applicants SET salary = ? WHERE name = ?
"""
80000, "Alice"))
oltp.execute(update_query, (
oltp.commit()
"SELECT * FROM applicants", oltp) pd.read_sql(
id | name | position | salary | |
---|---|---|---|---|
0 | 1 | Alice | Manager | 80000.0 |
1 | 2 | Bob | Analyst | 50000.0 |
2 | 3 | Charlie | Clerk | 30000.0 |
# we can even use pandas to insert/append to the table
= pd.DataFrame({
new_data "name": ["David", "Eve"],
"position": ["Engineer", "HR"],
"salary": [60000, 55000]
})"applicants", oltp, if_exists="append", index=False) new_data.to_sql(
2
# review the data
= pd.read_sql("SELECT * FROM applicants", oltp)
applicants_table applicants_table
id | name | position | salary | |
---|---|---|---|---|
0 | 1 | Alice | Manager | 80000.0 |
1 | 2 | Bob | Analyst | 50000.0 |
2 | 3 | Charlie | Clerk | 30000.0 |
3 | 4 | David | Engineer | 60000.0 |
4 | 5 | Eve | HR | 55000.0 |
OLAP via DuckDB
I am a huge fan of DuckDB.
- Like SQLite, it’s an in-process database. You can also use duckdb in-memory as well.
- Incredibly performant
- Can be used as an analytics engine or a backend to power data applications.
- Motherduck is a cloud hosted data warehouse based on Duckdb.
There are a variety of applications. If you are interested in learning more about how DuckDB is empowering data teams, I refer you to the free resource here.
# we can create a table directly from pandas
"CREATE TABLE applicants as select * from applicants_table") olap.execute(
<duckdb.duckdb.DuckDBPyConnection at 0x1453148f0>
# verify that the data exist
"SELECT * FROM applicants").df() olap.execute(
id | name | position | salary | |
---|---|---|---|---|
0 | 1 | Alice | Manager | 80000.0 |
1 | 2 | Bob | Analyst | 50000.0 |
2 | 3 | Charlie | Clerk | 30000.0 |
3 | 4 | David | Engineer | 60000.0 |
4 | 5 | Eve | HR | 55000.0 |
# create two csv files
"e1.csv", index=False)
applicants_table.to_csv("e2.csv", index=False) applicants_table.to_csv(
# duckdb is amazing - scan a directory of files and return as a single dataframe
= olap.execute("SELECT * FROM '*.csv'").df()
combined_csvs combined_csvs
id | name | position | salary | |
---|---|---|---|---|
0 | 1 | Alice | Manager | 80000.0 |
1 | 2 | Bob | Analyst | 50000.0 |
2 | 3 | Charlie | Clerk | 30000.0 |
3 | 4 | David | Engineer | 60000.0 |
4 | 5 | Eve | HR | 55000.0 |
5 | 1 | Alice | Manager | 80000.0 |
6 | 2 | Bob | Analyst | 50000.0 |
7 | 3 | Charlie | Clerk | 30000.0 |
8 | 4 | David | Engineer | 60000.0 |
9 | 5 | Eve | HR | 55000.0 |
Resources
Document Database via Mongita
From the repo, Mongita is a lightweight embedded document database that implements a commonly-used subset of the MongoDB/PyMongo interface.
To demonstrate some basic data operations, I am going to use the Fruityvice API.
# get the data from the API
= requests.get("https://www.fruityvice.com/api/fruit/all")
resp = resp.json()
fruits_resp 0] fruits_resp[
{'name': 'Persimmon',
'id': 52,
'family': 'Ebenaceae',
'order': 'Rosales',
'genus': 'Diospyros',
'nutritions': {'calories': 81,
'fat': 0.0,
'sugar': 18.0,
'carbohydrates': 18.0,
'protein': 0.0}}
# from the documentation
# create the database within the document store
= doc.fruits_db fruits_db
# the collection
= fruits_db.fruits fruits
# add documents to the fruits collection
fruits.insert_many(fruits_resp)
<mongita.results.InsertManyResult at 0x112bea0f0>
# how many documents are there - no query filter
fruits.count_documents({})
49
# get one document
fruits.find_one()
{'name': 'Persimmon',
'id': 52,
'family': 'Ebenaceae',
'order': 'Rosales',
'genus': 'Diospyros',
'nutritions': {'calories': 81,
'fat': 0.0,
'sugar': 18.0,
'carbohydrates': 18.0,
'protein': 0.0},
'_id': ObjectId('677187a7635da2c06e2f038a')}
# we can query documents where the calories key is greater than 100
list( fruits.find({'nutritions.calories': {'$gt': 100}}) )
[{'name': 'Durian',
'id': 60,
'family': 'Malvaceae',
'order': 'Malvales',
'genus': 'Durio',
'nutritions': {'calories': 147,
'fat': 5.3,
'sugar': 6.75,
'carbohydrates': 27.1,
'protein': 1.5},
'_id': ObjectId('677187a7635da2c06e2f038f')},
{'name': 'Avocado',
'id': 84,
'family': 'Lauraceae',
'order': 'Laurales',
'genus': 'Persea',
'nutritions': {'calories': 160,
'fat': 14.66,
'sugar': 0.66,
'carbohydrates': 8.53,
'protein': 2.0},
'_id': ObjectId('677187a7635da2c06e2f03ac')},
{'name': 'Hazelnut',
'id': 96,
'family': 'Betulaceae',
'order': 'Fagales',
'genus': 'Corylus',
'nutritions': {'calories': 628,
'fat': 61.0,
'sugar': 4.3,
'carbohydrates': 17.0,
'protein': 15.0},
'_id': ObjectId('677187a7635da2c06e2f03b3')}]
Resources
The Github repo is fantastic. Review the README. As you will see, the library is also pretty performant!
Graph Database via Kuzu
Kuzu is an embedded graph database. A graph databases allow us to store our data as nodes and edges, where nodes are the entities in our application, and the edges, or relationships, are how the nodes are related.
While I am a huge fan of Neo4j, Kuzu provides an attractive solution that also aims to focus on graph analytics tasks.
For this demo, I am going to use a small subset Movielens dataset, which was downloaded via wget
above.
# unzip the dataset
! unzip ml-latest-small.zip
Archive: ml-latest-small.zip
creating: ml-latest-small/
inflating: ml-latest-small/links.csv
inflating: ml-latest-small/tags.csv
inflating: ml-latest-small/ratings.csv
inflating: ml-latest-small/README.txt
inflating: ml-latest-small/movies.csv
# read in two of the datasets
= pd.read_csv("ml-latest-small/movies.csv")
movies = pd.read_csv("ml-latest-small/ratings.csv") ratings
3) movies.sample(
movieId | title | genres | |
---|---|---|---|
2289 | 3036 | Quest for Fire (Guerre du feu, La) (1981) | Adventure|Drama |
9711 | 187717 | Won't You Be My Neighbor? (2018) | Documentary |
1753 | 2351 | Nights of Cabiria (Notti di Cabiria, Le) (1957) | Drama |
3) ratings.sample(
userId | movieId | rating | timestamp | |
---|---|---|---|---|
23954 | 166 | 3052 | 3.5 | 1188773616 |
42564 | 288 | 2380 | 1.0 | 975692910 |
483 | 4 | 3851 | 5.0 | 986849180 |
Kuzu can import from a variety of sources, including Pandas dataframes. For the dataset above, we need to parse out the users into their own dataframe.
# kuzu can read from pandas dataframes
= ratings[['userId']].drop_duplicates()
user "user.csv", index=False)
user.to_csv(
3) user.head(
userId | |
---|---|
0 | 1 |
232 | 2 |
261 | 3 |
Unlike Neo4j, it appears that we have define Node and Edge/Relationship tables prior to import. We will do that below.
# create the schema - kuzu specific
# nodes
"CREATE NODE TABLE Movie(movieId INT64, title STRING, genres STRING, PRIMARY KEY (movieId))")
graph.execute("CREATE NODE TABLE User(userId INT64, PRIMARY KEY (userId))")
graph.execute(
# relationships/edges
"CREATE REL TABLE Rating(FROM User TO Movie, rating FLOAT, timestamp INT64)") graph.execute(
<kuzu.query_result.QueryResult at 0x1459c6b70>
# import the data directly from the DataFrames
'COPY Movie FROM movies')
graph.execute('COPY User FROM user')
graph.execute('COPY Rating FROM ratings') graph.execute(
<kuzu.query_result.QueryResult at 0x1471a2ed0>
Below will be a cypher query that queries the graph database via a pattern that looks at user’s rating movies, and then summarizes the data to sort the movies based on the number of ratings while also displaying the average rating, which is a property on the rating relationship.
# query
= """
cql MATCH (u:User)-[r]->(m:Movie)
WITH m.title AS title,
AVG(r.rating) AS avg_rating,
COUNT(r) AS num_ratings
ORDER BY num_ratings DESC
LIMIT 10
RETURN title, avg_rating, num_ratings
"""
graph.execute(cql).get_as_df()
title | avg_rating | num_ratings | |
---|---|---|---|
0 | Forrest Gump (1994) | 4.164134 | 329 |
1 | Shawshank Redemption, The (1994) | 4.429022 | 317 |
2 | Pulp Fiction (1994) | 4.197068 | 307 |
3 | Silence of the Lambs, The (1991) | 4.161290 | 279 |
4 | Matrix, The (1999) | 4.192446 | 278 |
5 | Star Wars: Episode IV - A New Hope (1977) | 4.231076 | 251 |
6 | Jurassic Park (1993) | 3.750000 | 238 |
7 | Braveheart (1995) | 4.031646 | 237 |
8 | Terminator 2: Judgment Day (1991) | 3.970982 | 224 |
9 | Schindler's List (1993) | 4.225000 | 220 |
Resources
- https://docs.kuzudb.com/import/copy-from-dataframe/
- Learn Cypher, which has been widely adopated as the query language for graph databases.
Vector Database via lancedb
LanceDB is an embeddable vector database, a format that has become incredibly popular with the rise of LLMs and Generative AI.
Below is a search of Vector Database
over the last 5 years, which spikes just after the release of ChatGPT in November 2022.
# a synthetic dataset of text abstracts
= [
abstracts
{"id": "1",
"title": "Optimizing Transformers for Real-Time Applications",
"abstract": "This paper explores optimization techniques for transformer architectures to reduce latency in real-time applications. We evaluate pruning, quantization, and hardware-specific tuning, achieving up to 40% reduction in inference time without compromising accuracy.",
"topic": "Transformer Optimization"
},
{"id": "2",
"title": "Adversarial Robustness in Deep Neural Networks",
"abstract": "We present a novel adversarial training algorithm that enhances the robustness of deep neural networks against various attack strategies. Our method outperforms baseline approaches by 15% on benchmark datasets like CIFAR-10 and ImageNet.",
"topic": "Adversarial Robustness"
},
{"id": "3",
"title": "Generative Adversarial Networks for Data Augmentation",
"abstract": "This study proposes a GAN-based data augmentation technique to improve model generalization. Our experiments demonstrate significant accuracy improvements in medical imaging and natural language processing tasks.",
"topic": "GANs and Data Augmentation"
},
{"id": "4",
"title": "Explainability in Large Language Models",
"abstract": "We introduce a framework for probing the decision-making processes of large language models, offering insights into token importance and context understanding. Case studies reveal increased transparency in summarization and classification tasks.",
"topic": "Explainability in AI"
},
{"id": "5",
"title": "Efficient Training of Sparse Neural Networks",
"abstract": "Our research focuses on sparse neural networks and their potential for reducing computational overhead. Using structured pruning techniques, we achieve up to 60% parameter reduction while maintaining competitive performance.",
"topic": "Sparse Neural Networks"
} ]
# initialize an embedding model
= SentenceTransformer("all-MiniLM-L6-v2") model
# apply the model the abstracts
for a in abstracts:
"vector"] = model.encode(a["abstract"]) a[
# create a table index
= vec.create_table("abstracts", data=abstracts) tbl
# database inspection
vec.table_names()
['abstracts']
# get the embedding for a new abstract
= [
new_abstract
{"id": "20",
"title": "Automating Feature Engineering with Generative AI",
"abstract": "We propose a generative AI framework for automating feature engineering in machine learning pipelines. Case studies demonstrate reduced development time and improved model performance.",
"topic": "Automated Feature Engineering"
}
]
## nn search
= model.encode(new_abstract[0]["abstract"])
embedding
# search
= tbl.search(embedding).limit(3).to_pandas()
results results
id | title | abstract | topic | vector | _distance | |
---|---|---|---|---|---|---|
0 | 3 | Generative Adversarial Networks for Data Augme... | This study proposes a GAN-based data augmentat... | GANs and Data Augmentation | [-0.05354594, -0.09120101, 0.08235424, 0.03139... | 1.060543 |
1 | 2 | Adversarial Robustness in Deep Neural Networks | We present a novel adversarial training algori... | Adversarial Robustness | [-0.09800007, -0.018522585, -0.0128599, 0.0551... | 1.359992 |
2 | 1 | Optimizing Transformers for Real-Time Applicat... | This paper explores optimization techniques fo... | Transformer Optimization | [-0.06164691, 0.058088887, -0.047632158, -0.01... | 1.428792 |
Resources
- The documentation is fantastic and provide a range of resources, including conceptual understanding of where Vector databases fit into the analytics ecosystem.
Summary
Above is a brief review of how to get started with a variety of databases that can power application and analytical workloads, especially those that are targeted at embedded applications that do not require a client/server architecture.
It’s worth noting that there are online hosted versions of the technologies above. The list below highlights offerings that have generous free tiers, which have proven to be helpful for in-class projects.