<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Brock Tibert</title>
<link>https://brocktibert.com/blog.html</link>
<atom:link href="https://brocktibert.com/blog.xml" rel="self" type="application/rss+xml"/>
<description>Intersection of Cloud, Analytics, Product and Education and a touch of sports.</description>
<generator>quarto-1.6.32</generator>
<lastBuildDate>Sun, 29 Dec 2024 05:00:00 GMT</lastBuildDate>
<item>
  <title>Single-File and In-process/Embedded Databases</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/20241229-single-file-db/single_file_embedded_db.html</link>
  <description><![CDATA[ 





<p>In my teaching, I often want to demonstrate a variety of database options to my students. However, setting up servers introduces a layer of complexity that isn’t practical for certain offerings. Fortunately, there’s a category of databases that eliminate the need for running servers. These are commonly referred to as embedded, or in-process databases.</p>
<p>Below, I’ll highlight some options across different database types:</p>
<ul>
<li><strong>SQLite</strong>: A lightweight OLTP database.<br>
</li>
<li><strong>DuckDB</strong>: A fast OLAP database quickly becoming a favorite tool for data professionals.<br>
</li>
<li><strong>Mongita</strong>: A document database designed to mimic MongoDB’s functionality.<br>
</li>
<li><strong>Kuzu</strong>: A graph database that leverages Neo4j’s Cypher query language.<br>
</li>
<li><strong>LanceDB</strong>: A vector database built for modern data applications.</li>
</ul>
<p>Let’s get started.</p>
<section id="setup" class="level2">
<h2 class="anchored" data-anchor-id="setup">Setup</h2>
<div id="cell-2" class="cell" data-outputid="544b4f6d-fab6-40a9-e429-4487b1ea675b" data-execution_count="25">
<div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># installs</span></span>
<span id="cb1-2"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span> pip install <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>q duckdb</span>
<span id="cb1-3"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span> pip install <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>q mongita</span>
<span id="cb1-4"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span> pip install <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>q kuzu</span>
<span id="cb1-5"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span> pip install <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>q lancedb</span>
<span id="cb1-6"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span> pip install <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>q sentence<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>transformers</span>
<span id="cb1-7"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span> pip install <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>q requests</span></code></pre></div>
</div>
<div id="cell-3" class="cell" data-execution_count="5">
<div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># imports</span></span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb2-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> requests</span>
<span id="cb2-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sentence_transformers <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SentenceTransformer</span>
<span id="cb2-5"></span></code></pre></div>
</div>
<div id="cell-4" class="cell" data-execution_count="6">
<div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># get an example database for sqlite</span></span>
<span id="cb3-2"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span> wget <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>qO music.sqlite https:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">//</span>github.com<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>lerocha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>chinook<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>database<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>raw<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>refs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>heads<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>master<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>ChinookDatabase<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>DataSources<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>Chinook_Sqlite.sqlite</span>
<span id="cb3-3"></span>
<span id="cb3-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># get an example for the graph database</span></span>
<span id="cb3-5"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span> wget <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>q https:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">//</span>files.grouplens.org<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>datasets<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>movielens<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>ml<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>latest<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>small.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">zip</span></span></code></pre></div>
</div>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>I am on a Mac and installed <code>wget</code> via homebrew. To follow along, the easiest path might be to follow along via Google Colab.</p>
</div>
</div>
<p>With the libraries installed and some sample data downloaded from the web, let’s define the connections to each database.</p>
<div id="cell-7" class="cell" data-execution_count="9">
<div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># instantiate each database</span></span>
<span id="cb4-2"></span>
<span id="cb4-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># oltp - single file</span></span>
<span id="cb4-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> sqlite3</span>
<span id="cb4-5">oltp <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sqlite3.<span class="ex" style="color: null;
background-color: null;
font-style: inherit;">connect</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"music.sqlite"</span>)</span>
<span id="cb4-6"></span>
<span id="cb4-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># olap - single file</span></span>
<span id="cb4-8"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> duckdb</span>
<span id="cb4-9">olap <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> duckdb.<span class="ex" style="color: null;
background-color: null;
font-style: inherit;">connect</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"olap.duckdb"</span>)</span>
<span id="cb4-10"></span>
<span id="cb4-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># doc - creates a folder</span></span>
<span id="cb4-12"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> mongita <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> MongitaClientDisk</span>
<span id="cb4-13">doc <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> MongitaClientDisk(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"docdb"</span>)</span>
<span id="cb4-14"></span>
<span id="cb4-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># graph - creates a folder</span></span>
<span id="cb4-16"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> kuzu</span>
<span id="cb4-17">g <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> kuzu.Database(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"graphdb"</span>)</span>
<span id="cb4-18">graph <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> kuzu.Connection(g)</span>
<span id="cb4-19"></span>
<span id="cb4-20"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># vector - creates a folder</span></span>
<span id="cb4-21"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> lancedb</span>
<span id="cb4-22">vec <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> lancedb.<span class="ex" style="color: null;
background-color: null;
font-style: inherit;">connect</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"vectordb"</span>)</span></code></pre></div>
</div>
</section>
<section id="oltp-via-sqlite" class="level2">
<h2 class="anchored" data-anchor-id="oltp-via-sqlite">OLTP via SQLite</h2>
<p>I won’t spend much time on SQlite and python, as there are plenty of tutorials on the web. Below will demonstrate how to interface with sqlite via Python and pandas.</p>
<div id="cell-10" class="cell" data-outputid="52bfc6b0-e84c-419f-fb62-ed9c056bb579" data-execution_count="10">
<div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># execute a query to show the tables</span></span>
<span id="cb5-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># list tables in sqlite database</span></span>
<span id="cb5-3">oltp.execute(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SELECT name FROM sqlite_master WHERE type='table';"</span>).fetchall()</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="10">
<pre><code>[('Album',),
 ('Artist',),
 ('Customer',),
 ('Employee',),
 ('Genre',),
 ('Invoice',),
 ('InvoiceLine',),
 ('MediaType',),
 ('Playlist',),
 ('PlaylistTrack',),
 ('Track',)]</code></pre>
</div>
</div>
<div id="cell-11" class="cell" data-outputid="91ddf424-c774-43c7-ec2e-87b29c39fe22" data-execution_count="11">
<div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># of course, we can query data directly into pandas</span></span>
<span id="cb7-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># join the artists and albumns</span></span>
<span id="cb7-3">sql <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb7-4"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">select a.*,</span></span>
<span id="cb7-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">       b.albumid,</span></span>
<span id="cb7-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">       b.title</span></span>
<span id="cb7-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">from artist a</span></span>
<span id="cb7-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">join album b</span></span>
<span id="cb7-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">on a.artistid = b.artistid</span></span>
<span id="cb7-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb7-11">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_sql(sql, oltp)</span>
<span id="cb7-12">df.head(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span>
<span id="cb7-13"></span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="11">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">ArtistId</th>
<th data-quarto-table-cell-role="th">Name</th>
<th data-quarto-table-cell-role="th">AlbumId</th>
<th data-quarto-table-cell-role="th">Title</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td data-quarto-table-cell-role="th">0</td>
<td>1</td>
<td>AC/DC</td>
<td>1</td>
<td>For Those About To Rock We Salute You</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">1</td>
<td>2</td>
<td>Accept</td>
<td>2</td>
<td>Balls to the Wall</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">2</td>
<td>2</td>
<td>Accept</td>
<td>3</td>
<td>Restless and Wild</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<div id="cell-12" class="cell" data-execution_count="12">
<div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># create a silly table</span></span>
<span id="cb8-2">create_table_query <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb8-3"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">CREATE TABLE IF NOT EXISTS applicants (</span></span>
<span id="cb8-4"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    id INTEGER PRIMARY KEY,</span></span>
<span id="cb8-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    name TEXT,</span></span>
<span id="cb8-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    position TEXT,</span></span>
<span id="cb8-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    salary REAL</span></span>
<span id="cb8-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">)</span></span>
<span id="cb8-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb8-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Execute the query</span></span>
<span id="cb8-11">oltp.execute(create_table_query)</span>
<span id="cb8-12">oltp.commit()</span></code></pre></div>
</div>
<p>Ensure the table exists.</p>
<div id="cell-14" class="cell" data-execution_count="13">
<div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1">oltp.execute(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SELECT name FROM sqlite_master WHERE type='table';"</span>).fetchall()</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="13">
<pre><code>[('Album',),
 ('Artist',),
 ('Customer',),
 ('Employee',),
 ('Genre',),
 ('Invoice',),
 ('InvoiceLine',),
 ('MediaType',),
 ('Playlist',),
 ('PlaylistTrack',),
 ('Track',),
 ('applicants',)]</code></pre>
</div>
</div>
<div id="cell-15" class="cell" data-execution_count="14">
<div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># a simple dataset</span></span>
<span id="cb11-2">applicants <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb11-3">    (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Alice"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Manager"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">70000</span>),</span>
<span id="cb11-4">    (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Bob"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Analyst"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50000</span>),</span>
<span id="cb11-5">    (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Charlie"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Clerk"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30000</span>)</span>
<span id="cb11-6">]</span>
<span id="cb11-7"></span>
<span id="cb11-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># insert into the table,</span></span>
<span id="cb11-9">insert_query <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb11-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">INSERT INTO applicants (name, position, salary) VALUES (?, ?, ?)</span></span>
<span id="cb11-11"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb11-12"></span>
<span id="cb11-13">oltp.executemany(insert_query, applicants)</span>
<span id="cb11-14">oltp.commit()</span></code></pre></div>
</div>
<div id="cell-16" class="cell" data-outputid="fbfe37dc-d5e4-40dd-c5d1-3793dcd42e22" data-execution_count="15">
<div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># lets grab the data with pandas</span></span>
<span id="cb12-2">pd.read_sql(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SELECT * FROM applicants"</span>, oltp)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="15">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">id</th>
<th data-quarto-table-cell-role="th">name</th>
<th data-quarto-table-cell-role="th">position</th>
<th data-quarto-table-cell-role="th">salary</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td data-quarto-table-cell-role="th">0</td>
<td>1</td>
<td>Alice</td>
<td>Manager</td>
<td>70000.0</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">1</td>
<td>2</td>
<td>Bob</td>
<td>Analyst</td>
<td>50000.0</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">2</td>
<td>3</td>
<td>Charlie</td>
<td>Clerk</td>
<td>30000.0</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<div id="cell-17" class="cell" data-execution_count="17">
<div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># of course we can update records</span></span>
<span id="cb13-2">update_query <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb13-3"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">UPDATE applicants SET salary = ? WHERE name = ?</span></span>
<span id="cb13-4"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb13-5">oltp.execute(update_query, (<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">80000</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Alice"</span>))</span>
<span id="cb13-6">oltp.commit()</span>
<span id="cb13-7"></span>
<span id="cb13-8">pd.read_sql(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SELECT * FROM applicants"</span>, oltp)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="17">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">id</th>
<th data-quarto-table-cell-role="th">name</th>
<th data-quarto-table-cell-role="th">position</th>
<th data-quarto-table-cell-role="th">salary</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td data-quarto-table-cell-role="th">0</td>
<td>1</td>
<td>Alice</td>
<td>Manager</td>
<td>80000.0</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">1</td>
<td>2</td>
<td>Bob</td>
<td>Analyst</td>
<td>50000.0</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">2</td>
<td>3</td>
<td>Charlie</td>
<td>Clerk</td>
<td>30000.0</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<div id="cell-18" class="cell" data-outputid="400768f9-0431-4e59-8f42-a4960acb28b7" data-execution_count="18">
<div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># we can even use pandas to insert/append to the table</span></span>
<span id="cb14-2">new_data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame({</span>
<span id="cb14-3">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"name"</span>: [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"David"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Eve"</span>],</span>
<span id="cb14-4">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"position"</span>: [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Engineer"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"HR"</span>],</span>
<span id="cb14-5">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"salary"</span>: [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">60000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">55000</span>]</span>
<span id="cb14-6">})</span>
<span id="cb14-7">new_data.to_sql(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"applicants"</span>, oltp, if_exists<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"append"</span>, index<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="18">
<pre><code>2</code></pre>
</div>
</div>
<div id="cell-19" class="cell" data-outputid="4e622f04-5820-48bd-92a6-8277bd3c5c22" data-execution_count="19">
<div class="sourceCode cell-code" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># review the data</span></span>
<span id="cb16-2">applicants_table <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_sql(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SELECT * FROM applicants"</span>, oltp)</span>
<span id="cb16-3">applicants_table</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="19">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">id</th>
<th data-quarto-table-cell-role="th">name</th>
<th data-quarto-table-cell-role="th">position</th>
<th data-quarto-table-cell-role="th">salary</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td data-quarto-table-cell-role="th">0</td>
<td>1</td>
<td>Alice</td>
<td>Manager</td>
<td>80000.0</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">1</td>
<td>2</td>
<td>Bob</td>
<td>Analyst</td>
<td>50000.0</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">2</td>
<td>3</td>
<td>Charlie</td>
<td>Clerk</td>
<td>30000.0</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">3</td>
<td>4</td>
<td>David</td>
<td>Engineer</td>
<td>60000.0</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">4</td>
<td>5</td>
<td>Eve</td>
<td>HR</td>
<td>55000.0</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
</section>
<section id="olap-via-duckdb" class="level2">
<h2 class="anchored" data-anchor-id="olap-via-duckdb">OLAP via DuckDB</h2>
<p>I am a huge fan of DuckDB.</p>
<ul>
<li>Like SQLite, it’s an in-process database. You can also use duckdb in-memory as well.</li>
<li>Incredibly performant</li>
<li>Can be used as an analytics engine or a backend to power data applications.</li>
<li>Motherduck is a cloud hosted data warehouse based on Duckdb.</li>
</ul>
<p>There are a variety of applications. If you are interested in learning more about how DuckDB is empowering data teams, I refer you to the free resource <a href="https://motherduck.com/duckdb-book-brief/">here</a>.</p>
<div id="cell-21" class="cell" data-outputid="3ed20e29-6ed5-4a89-9390-0c299aaed354" data-execution_count="20">
<div class="sourceCode cell-code" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># we can create a table directly from pandas</span></span>
<span id="cb17-2">olap.execute(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CREATE TABLE applicants as select * from applicants_table"</span>)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="20">
<pre><code>&lt;duckdb.duckdb.DuckDBPyConnection at 0x1453148f0&gt;</code></pre>
</div>
</div>
<div id="cell-22" class="cell" data-outputid="6a35ea01-741f-4d96-fce7-83ebbb251f70" data-execution_count="21">
<div class="sourceCode cell-code" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># verify that the data exist</span></span>
<span id="cb19-2">olap.execute(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SELECT * FROM applicants"</span>).df()</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="21">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">id</th>
<th data-quarto-table-cell-role="th">name</th>
<th data-quarto-table-cell-role="th">position</th>
<th data-quarto-table-cell-role="th">salary</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td data-quarto-table-cell-role="th">0</td>
<td>1</td>
<td>Alice</td>
<td>Manager</td>
<td>80000.0</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">1</td>
<td>2</td>
<td>Bob</td>
<td>Analyst</td>
<td>50000.0</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">2</td>
<td>3</td>
<td>Charlie</td>
<td>Clerk</td>
<td>30000.0</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">3</td>
<td>4</td>
<td>David</td>
<td>Engineer</td>
<td>60000.0</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">4</td>
<td>5</td>
<td>Eve</td>
<td>HR</td>
<td>55000.0</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<div id="cell-23" class="cell" data-execution_count="22">
<div class="sourceCode cell-code" id="cb20" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># create two csv files</span></span>
<span id="cb20-2">applicants_table.to_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"e1.csv"</span>, index<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span>
<span id="cb20-3">applicants_table.to_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"e2.csv"</span>, index<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span></code></pre></div>
</div>
<div id="cell-24" class="cell" data-outputid="957a431f-5f3a-4670-aecf-3d6bcc3bc74d" data-execution_count="23">
<div class="sourceCode cell-code" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># duckdb is amazing - scan a directory of files and return as a single dataframe</span></span>
<span id="cb21-2">combined_csvs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> olap.execute(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SELECT * FROM '*.csv'"</span>).df()</span>
<span id="cb21-3">combined_csvs</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="23">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">id</th>
<th data-quarto-table-cell-role="th">name</th>
<th data-quarto-table-cell-role="th">position</th>
<th data-quarto-table-cell-role="th">salary</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td data-quarto-table-cell-role="th">0</td>
<td>1</td>
<td>Alice</td>
<td>Manager</td>
<td>80000.0</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">1</td>
<td>2</td>
<td>Bob</td>
<td>Analyst</td>
<td>50000.0</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">2</td>
<td>3</td>
<td>Charlie</td>
<td>Clerk</td>
<td>30000.0</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">3</td>
<td>4</td>
<td>David</td>
<td>Engineer</td>
<td>60000.0</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">4</td>
<td>5</td>
<td>Eve</td>
<td>HR</td>
<td>55000.0</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">5</td>
<td>1</td>
<td>Alice</td>
<td>Manager</td>
<td>80000.0</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">6</td>
<td>2</td>
<td>Bob</td>
<td>Analyst</td>
<td>50000.0</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">7</td>
<td>3</td>
<td>Charlie</td>
<td>Clerk</td>
<td>30000.0</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">8</td>
<td>4</td>
<td>David</td>
<td>Engineer</td>
<td>60000.0</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">9</td>
<td>5</td>
<td>Eve</td>
<td>HR</td>
<td>55000.0</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<section id="resources" class="level3">
<h3 class="anchored" data-anchor-id="resources">Resources</h3>
<ul>
<li><a href="https://motherduck.com/">MotherDuck</a></li>
<li><a href="https://duckdb.org/">DuckDB</a></li>
<li><a href="https://medium.com/@kkyon/profile-a-data-lake-built-with-aws-lambda-and-duckdb-2fc810ff9f4d">Using DuckDB on AWS Lambda</a></li>
</ul>
</section>
</section>
<section id="document-database-via-mongita" class="level2">
<h2 class="anchored" data-anchor-id="document-database-via-mongita">Document Database via Mongita</h2>
<p>From the repo, <a href="https://github.com/scottrogowski/mongita">Mongita</a> <em>is a lightweight embedded document database that implements a commonly-used subset of the MongoDB/PyMongo interface.</em></p>
<p>To demonstrate some basic data operations, I am going to use the <a href="https://www.fruityvice.com/">Fruityvice API</a>.</p>
<div id="cell-27" class="cell" data-outputid="52c37260-8793-4c31-875c-a5a6706400d2" data-execution_count="26">
<div class="sourceCode cell-code" id="cb22" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb22-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># get the data from the API</span></span>
<span id="cb22-2"></span>
<span id="cb22-3">resp <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> requests.get(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://www.fruityvice.com/api/fruit/all"</span>)</span>
<span id="cb22-4">fruits_resp <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> resp.json()</span>
<span id="cb22-5">fruits_resp[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="26">
<pre><code>{'name': 'Persimmon',
 'id': 52,
 'family': 'Ebenaceae',
 'order': 'Rosales',
 'genus': 'Diospyros',
 'nutritions': {'calories': 81,
  'fat': 0.0,
  'sugar': 18.0,
  'carbohydrates': 18.0,
  'protein': 0.0}}</code></pre>
</div>
</div>
<div id="cell-28" class="cell" data-execution_count="27">
<div class="sourceCode cell-code" id="cb24" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb24-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># from the documentation</span></span>
<span id="cb24-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># create the database within the document store</span></span>
<span id="cb24-3">fruits_db <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> doc.fruits_db</span></code></pre></div>
</div>
<div id="cell-29" class="cell" data-execution_count="28">
<div class="sourceCode cell-code" id="cb25" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb25-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># the collection</span></span>
<span id="cb25-2">fruits <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> fruits_db.fruits</span></code></pre></div>
</div>
<div id="cell-30" class="cell" data-outputid="4cae1661-4612-4f9b-bbb0-83a4410caa78" data-execution_count="29">
<div class="sourceCode cell-code" id="cb26" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb26-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># add documents to the fruits collection</span></span>
<span id="cb26-2">fruits.insert_many(fruits_resp)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="29">
<pre><code>&lt;mongita.results.InsertManyResult at 0x112bea0f0&gt;</code></pre>
</div>
</div>
<div id="cell-31" class="cell" data-outputid="54cb3d8d-6ad9-461c-eef5-3236337a5c28" data-execution_count="30">
<div class="sourceCode cell-code" id="cb28" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb28-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># how many documents are there - no query filter</span></span>
<span id="cb28-2">fruits.count_documents({})</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="30">
<pre><code>49</code></pre>
</div>
</div>
<div id="cell-32" class="cell" data-outputid="a5fed3f4-2700-4120-98ba-f259e2b1c65d" data-execution_count="31">
<div class="sourceCode cell-code" id="cb30" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb30-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># get one document</span></span>
<span id="cb30-2">fruits.find_one()</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="31">
<pre><code>{'name': 'Persimmon',
 'id': 52,
 'family': 'Ebenaceae',
 'order': 'Rosales',
 'genus': 'Diospyros',
 'nutritions': {'calories': 81,
  'fat': 0.0,
  'sugar': 18.0,
  'carbohydrates': 18.0,
  'protein': 0.0},
 '_id': ObjectId('677187a7635da2c06e2f038a')}</code></pre>
</div>
</div>
<div id="cell-33" class="cell" data-outputid="9db1981a-ca02-4eda-e033-87b40cd7a82b" data-execution_count="33">
<div class="sourceCode cell-code" id="cb32" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb32-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># we can query documents where the calories key is greater than 100</span></span>
<span id="cb32-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>( fruits.find({<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'nutritions.calories'</span>: {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'$gt'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>}}) )</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="33">
<pre><code>[{'name': 'Durian',
  'id': 60,
  'family': 'Malvaceae',
  'order': 'Malvales',
  'genus': 'Durio',
  'nutritions': {'calories': 147,
   'fat': 5.3,
   'sugar': 6.75,
   'carbohydrates': 27.1,
   'protein': 1.5},
  '_id': ObjectId('677187a7635da2c06e2f038f')},
 {'name': 'Avocado',
  'id': 84,
  'family': 'Lauraceae',
  'order': 'Laurales',
  'genus': 'Persea',
  'nutritions': {'calories': 160,
   'fat': 14.66,
   'sugar': 0.66,
   'carbohydrates': 8.53,
   'protein': 2.0},
  '_id': ObjectId('677187a7635da2c06e2f03ac')},
 {'name': 'Hazelnut',
  'id': 96,
  'family': 'Betulaceae',
  'order': 'Fagales',
  'genus': 'Corylus',
  'nutritions': {'calories': 628,
   'fat': 61.0,
   'sugar': 4.3,
   'carbohydrates': 17.0,
   'protein': 15.0},
  '_id': ObjectId('677187a7635da2c06e2f03b3')}]</code></pre>
</div>
</div>
<section id="resources-1" class="level3">
<h3 class="anchored" data-anchor-id="resources-1">Resources</h3>
<p>The <a href="https://github.com/scottrogowski/mongita">Github repo</a> is fantastic. Review the README. As you will see, the library is also pretty performant!</p>
</section>
</section>
<section id="graph-database-via-kuzu" class="level2">
<h2 class="anchored" data-anchor-id="graph-database-via-kuzu">Graph Database via Kuzu</h2>
<p><a href="https://docs.kuzudb.com/">Kuzu</a> is an embedded graph database. A graph databases allow us to store our data as nodes and edges, where nodes are the entities in our application, and the edges, or relationships, are how the nodes are related.</p>
<p>While I am a huge fan of <a href="https://neo4j.com/">Neo4j</a>, Kuzu provides an attractive solution that also aims to focus on graph analytics tasks.</p>
<p>For this demo, I am going to use a small subset <a href="https://grouplens.org/datasets/movielens/">Movielens</a> dataset, which was downloaded via <code>wget</code> above.</p>
<div id="cell-36" class="cell" data-outputid="ebcf82dd-84d8-4101-85a5-6ff88729532e" data-execution_count="34">
<div class="sourceCode cell-code" id="cb34" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb34-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># unzip the dataset</span></span>
<span id="cb34-2"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span> unzip ml<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>latest<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>small.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">zip</span></span></code></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Archive:  ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  </code></pre>
</div>
</div>
<div id="cell-37" class="cell" data-execution_count="35">
<div class="sourceCode cell-code" id="cb36" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb36-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># read in two of the datasets</span></span>
<span id="cb36-2">movies <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ml-latest-small/movies.csv"</span>)</span>
<span id="cb36-3">ratings <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ml-latest-small/ratings.csv"</span>)</span></code></pre></div>
</div>
<div id="cell-38" class="cell" data-outputid="d45f5fa6-9721-45e7-c2c2-dcb5ea04e8ac" data-execution_count="36">
<div class="sourceCode cell-code" id="cb37" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb37-1">movies.sample(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="36">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">movieId</th>
<th data-quarto-table-cell-role="th">title</th>
<th data-quarto-table-cell-role="th">genres</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td data-quarto-table-cell-role="th">2289</td>
<td>3036</td>
<td>Quest for Fire (Guerre du feu, La) (1981)</td>
<td>Adventure|Drama</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">9711</td>
<td>187717</td>
<td>Won't You Be My Neighbor? (2018)</td>
<td>Documentary</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">1753</td>
<td>2351</td>
<td>Nights of Cabiria (Notti di Cabiria, Le) (1957)</td>
<td>Drama</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<div id="cell-39" class="cell" data-outputid="98210d75-419e-4c3b-c15c-16d70bda3b30" data-execution_count="37">
<div class="sourceCode cell-code" id="cb38" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb38-1">ratings.sample(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="37">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">userId</th>
<th data-quarto-table-cell-role="th">movieId</th>
<th data-quarto-table-cell-role="th">rating</th>
<th data-quarto-table-cell-role="th">timestamp</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td data-quarto-table-cell-role="th">23954</td>
<td>166</td>
<td>3052</td>
<td>3.5</td>
<td>1188773616</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">42564</td>
<td>288</td>
<td>2380</td>
<td>1.0</td>
<td>975692910</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">483</td>
<td>4</td>
<td>3851</td>
<td>5.0</td>
<td>986849180</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<p>Kuzu can import from a variety of sources, including Pandas <a href="https://docs.kuzudb.com/import/copy-from-dataframe/">dataframes</a>. For the dataset above, we need to parse out the users into their own dataframe.</p>
<div id="cell-41" class="cell" data-outputid="f91bf823-308f-44c9-81fb-7ae817f035f8" data-execution_count="38">
<div class="sourceCode cell-code" id="cb39" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb39-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># kuzu can read from pandas dataframes</span></span>
<span id="cb39-2">user <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ratings[[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'userId'</span>]].drop_duplicates()</span>
<span id="cb39-3">user.to_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user.csv"</span>, index<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span>
<span id="cb39-4"></span>
<span id="cb39-5">user.head(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="38">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">userId</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td data-quarto-table-cell-role="th">0</td>
<td>1</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">232</td>
<td>2</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">261</td>
<td>3</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<p>Unlike Neo4j, it appears that we have define Node and Edge/Relationship <em>tables</em> prior to import. We will do that below.</p>
<div id="cell-43" class="cell" data-outputid="e00ec18c-3847-4abe-e01a-30311daa9a9c" data-execution_count="39">
<div class="sourceCode cell-code" id="cb40" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb40-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># create the schema - kuzu specific</span></span>
<span id="cb40-2"></span>
<span id="cb40-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># nodes</span></span>
<span id="cb40-4">graph.execute(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CREATE NODE TABLE Movie(movieId INT64, title STRING, genres STRING, PRIMARY KEY (movieId))"</span>)</span>
<span id="cb40-5">graph.execute(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CREATE NODE TABLE User(userId INT64, PRIMARY KEY (userId))"</span>)</span>
<span id="cb40-6"></span>
<span id="cb40-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># relationships/edges</span></span>
<span id="cb40-8">graph.execute(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CREATE REL TABLE Rating(FROM User TO Movie, rating FLOAT, timestamp INT64)"</span>)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="39">
<pre><code>&lt;kuzu.query_result.QueryResult at 0x1459c6b70&gt;</code></pre>
</div>
</div>
<div id="cell-44" class="cell" data-outputid="acb31646-d510-4e46-8fbc-f8eeebce7403" data-execution_count="40">
<div class="sourceCode cell-code" id="cb42" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb42-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># import the data directly from the DataFrames</span></span>
<span id="cb42-2">graph.execute(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'COPY Movie FROM movies'</span>)</span>
<span id="cb42-3">graph.execute(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'COPY User FROM user'</span>)</span>
<span id="cb42-4">graph.execute(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'COPY Rating FROM ratings'</span>)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="40">
<pre><code>&lt;kuzu.query_result.QueryResult at 0x1471a2ed0&gt;</code></pre>
</div>
</div>
<p>Below will be a cypher query that queries the graph database via a pattern that looks at user’s rating movies, and then summarizes the data to sort the movies based on the number of ratings while also displaying the average rating, which is a property on the rating relationship.</p>
<div id="cell-46" class="cell" data-outputid="6e57aa0e-9ac4-43b8-c951-b9e1fd9066a4" data-execution_count="41">
<div class="sourceCode cell-code" id="cb44" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb44-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># query</span></span>
<span id="cb44-2">cql <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb44-3"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">MATCH (u:User)-[r]-&gt;(m:Movie)</span></span>
<span id="cb44-4"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">WITH  m.title AS title,</span></span>
<span id="cb44-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      AVG(r.rating) AS avg_rating,</span></span>
<span id="cb44-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      COUNT(r) AS num_ratings</span></span>
<span id="cb44-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">ORDER BY num_ratings DESC</span></span>
<span id="cb44-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">LIMIT 10</span></span>
<span id="cb44-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">RETURN title, avg_rating, num_ratings</span></span>
<span id="cb44-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb44-11"></span>
<span id="cb44-12">graph.execute(cql).get_as_df()</span>
<span id="cb44-13"></span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="41">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">title</th>
<th data-quarto-table-cell-role="th">avg_rating</th>
<th data-quarto-table-cell-role="th">num_ratings</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td data-quarto-table-cell-role="th">0</td>
<td>Forrest Gump (1994)</td>
<td>4.164134</td>
<td>329</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">1</td>
<td>Shawshank Redemption, The (1994)</td>
<td>4.429022</td>
<td>317</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">2</td>
<td>Pulp Fiction (1994)</td>
<td>4.197068</td>
<td>307</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">3</td>
<td>Silence of the Lambs, The (1991)</td>
<td>4.161290</td>
<td>279</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">4</td>
<td>Matrix, The (1999)</td>
<td>4.192446</td>
<td>278</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">5</td>
<td>Star Wars: Episode IV - A New Hope (1977)</td>
<td>4.231076</td>
<td>251</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">6</td>
<td>Jurassic Park (1993)</td>
<td>3.750000</td>
<td>238</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">7</td>
<td>Braveheart (1995)</td>
<td>4.031646</td>
<td>237</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">8</td>
<td>Terminator 2: Judgment Day (1991)</td>
<td>3.970982</td>
<td>224</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">9</td>
<td>Schindler's List (1993)</td>
<td>4.225000</td>
<td>220</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<section id="resources-2" class="level3">
<h3 class="anchored" data-anchor-id="resources-2">Resources</h3>
<ul>
<li>https://docs.kuzudb.com/import/copy-from-dataframe/</li>
<li><a href="https://docs.kuzudb.com/cypher/">Learn Cypher</a>, which has been widely adopated as <em>the</em> query language for graph databases.</li>
</ul>
</section>
</section>
<section id="vector-database-via-lancedb" class="level2">
<h2 class="anchored" data-anchor-id="vector-database-via-lancedb">Vector Database via lancedb</h2>
<p><a href="https://lancedb.github.io/lancedb/">LanceDB</a> is an embeddable vector database, a format that has become incredibly popular with the rise of LLMs and Generative AI.</p>
<p>Below is a search of <code>Vector Database</code> over the last 5 years, which spikes just after the release of ChatGPT in November 2022.</p>
<p><img src="https://brocktibert.com/posts/20241229-single-file-db/single_file_embedded_db_files/figure-html/cell-48-1-e0ae31cc-5603-404e-a626-16f5eb93c1a3.png" class="img-fluid"></p>
<div id="cell-49" class="cell" data-execution_count="42">
<div class="sourceCode cell-code" id="cb45" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb45-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># a synthetic dataset of text abstracts</span></span>
<span id="cb45-2"></span>
<span id="cb45-3">abstracts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb45-4">  {</span>
<span id="cb45-5">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"id"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"1"</span>,</span>
<span id="cb45-6">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"title"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Optimizing Transformers for Real-Time Applications"</span>,</span>
<span id="cb45-7">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"abstract"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"This paper explores optimization techniques for transformer architectures to reduce latency in real-time applications. We evaluate pruning, quantization, and hardware-specific tuning, achieving up to 40</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">% r</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">eduction in inference time without compromising accuracy."</span>,</span>
<span id="cb45-8">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"topic"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Transformer Optimization"</span></span>
<span id="cb45-9">  },</span>
<span id="cb45-10">  {</span>
<span id="cb45-11">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"id"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2"</span>,</span>
<span id="cb45-12">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"title"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Adversarial Robustness in Deep Neural Networks"</span>,</span>
<span id="cb45-13">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"abstract"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"We present a novel adversarial training algorithm that enhances the robustness of deep neural networks against various attack strategies. Our method outperforms baseline approaches by 15</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">% o</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">n benchmark datasets like CIFAR-10 and ImageNet."</span>,</span>
<span id="cb45-14">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"topic"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Adversarial Robustness"</span></span>
<span id="cb45-15">  },</span>
<span id="cb45-16">  {</span>
<span id="cb45-17">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"id"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"3"</span>,</span>
<span id="cb45-18">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"title"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Generative Adversarial Networks for Data Augmentation"</span>,</span>
<span id="cb45-19">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"abstract"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"This study proposes a GAN-based data augmentation technique to improve model generalization. Our experiments demonstrate significant accuracy improvements in medical imaging and natural language processing tasks."</span>,</span>
<span id="cb45-20">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"topic"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"GANs and Data Augmentation"</span></span>
<span id="cb45-21">  },</span>
<span id="cb45-22">  {</span>
<span id="cb45-23">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"id"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"4"</span>,</span>
<span id="cb45-24">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"title"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Explainability in Large Language Models"</span>,</span>
<span id="cb45-25">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"abstract"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"We introduce a framework for probing the decision-making processes of large language models, offering insights into token importance and context understanding. Case studies reveal increased transparency in summarization and classification tasks."</span>,</span>
<span id="cb45-26">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"topic"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Explainability in AI"</span></span>
<span id="cb45-27">  },</span>
<span id="cb45-28">  {</span>
<span id="cb45-29">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"id"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"5"</span>,</span>
<span id="cb45-30">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"title"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Efficient Training of Sparse Neural Networks"</span>,</span>
<span id="cb45-31">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"abstract"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Our research focuses on sparse neural networks and their potential for reducing computational overhead. Using structured pruning techniques, we achieve up to 60% parameter reduction while maintaining competitive performance."</span>,</span>
<span id="cb45-32">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"topic"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sparse Neural Networks"</span></span>
<span id="cb45-33">  }</span>
<span id="cb45-34">]</span></code></pre></div>
</div>
<div id="cell-50" class="cell" data-outputid="5091c8d9-28e3-4b7a-b190-927544d58c04" data-execution_count="43">
<div class="sourceCode cell-code" id="cb46" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb46-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># initialize an embedding model</span></span>
<span id="cb46-2">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> SentenceTransformer(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"all-MiniLM-L6-v2"</span>)</span></code></pre></div>
</div>
<div id="cell-51" class="cell" data-execution_count="44">
<div class="sourceCode cell-code" id="cb47" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb47-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># apply the model the abstracts</span></span>
<span id="cb47-2"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> a <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> abstracts:</span>
<span id="cb47-3">  a[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"vector"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.encode(a[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"abstract"</span>])</span></code></pre></div>
</div>
<div id="cell-52" class="cell" data-execution_count="45">
<div class="sourceCode cell-code" id="cb48" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb48-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># create a table index</span></span>
<span id="cb48-2">tbl <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> vec.create_table(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"abstracts"</span>, data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>abstracts)</span></code></pre></div>
</div>
<div id="cell-53" class="cell" data-outputid="531680c1-d6bc-441b-c13f-6ed2e3733fec" data-execution_count="46">
<div class="sourceCode cell-code" id="cb49" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb49-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># database inspection</span></span>
<span id="cb49-2">vec.table_names()</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="46">
<pre><code>['abstracts']</code></pre>
</div>
</div>
<div id="cell-54" class="cell" data-execution_count="47">
<div class="sourceCode cell-code" id="cb51" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb51-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># get the embedding for a new abstract</span></span>
<span id="cb51-2">new_abstract <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb51-3">  {</span>
<span id="cb51-4">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"id"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"20"</span>,</span>
<span id="cb51-5">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"title"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Automating Feature Engineering with Generative AI"</span>,</span>
<span id="cb51-6">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"abstract"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"We propose a generative AI framework for automating feature engineering in machine learning pipelines. Case studies demonstrate reduced development time and improved model performance."</span>,</span>
<span id="cb51-7">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"topic"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Automated Feature Engineering"</span></span>
<span id="cb51-8">  }</span>
<span id="cb51-9">]</span>
<span id="cb51-10"></span>
<span id="cb51-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## nn search</span></span>
<span id="cb51-12">embedding <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.encode(new_abstract[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"abstract"</span>])</span>
<span id="cb51-13"></span>
<span id="cb51-14"></span></code></pre></div>
</div>
<div id="cell-55" class="cell" data-outputid="251e0d6d-c236-461b-f59e-ef99dd49b995" data-execution_count="48">
<div class="sourceCode cell-code" id="cb52" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb52-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># search</span></span>
<span id="cb52-2">results <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tbl.search(embedding).limit(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>).to_pandas()</span>
<span id="cb52-3">results</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="48">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">id</th>
<th data-quarto-table-cell-role="th">title</th>
<th data-quarto-table-cell-role="th">abstract</th>
<th data-quarto-table-cell-role="th">topic</th>
<th data-quarto-table-cell-role="th">vector</th>
<th data-quarto-table-cell-role="th">_distance</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td data-quarto-table-cell-role="th">0</td>
<td>3</td>
<td>Generative Adversarial Networks for Data Augme...</td>
<td>This study proposes a GAN-based data augmentat...</td>
<td>GANs and Data Augmentation</td>
<td>[-0.05354594, -0.09120101, 0.08235424, 0.03139...</td>
<td>1.060543</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">1</td>
<td>2</td>
<td>Adversarial Robustness in Deep Neural Networks</td>
<td>We present a novel adversarial training algori...</td>
<td>Adversarial Robustness</td>
<td>[-0.09800007, -0.018522585, -0.0128599, 0.0551...</td>
<td>1.359992</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">2</td>
<td>1</td>
<td>Optimizing Transformers for Real-Time Applicat...</td>
<td>This paper explores optimization techniques fo...</td>
<td>Transformer Optimization</td>
<td>[-0.06164691, 0.058088887, -0.047632158, -0.01...</td>
<td>1.428792</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<section id="resources-3" class="level3">
<h3 class="anchored" data-anchor-id="resources-3">Resources</h3>
<ul>
<li>The <a href="https://lancedb.github.io/lancedb/concepts/vector_search/">documentation</a> is fantastic and provide a range of resources, including conceptual understanding of where Vector databases fit into the analytics ecosystem.</li>
</ul>
</section>
</section>
<section id="summary" class="level2">
<h2 class="anchored" data-anchor-id="summary">Summary</h2>
<p>Above is a brief review of how to get started with a variety of databases that can power application and analytical workloads, especially those that are targeted at embedded applications that do not require a client/server architecture.</p>
<p>It’s worth noting that there are online hosted versions of the technologies above. The list below highlights offerings that have generous free tiers, which have proven to be helpful for in-class projects.</p>
<ul>
<li><a href="https://motherduck.com/product/pricing/">OLAP: MotherDuck</a></li>
<li><a href="https://www.mongodb.com/pricing">Document: MongoDB Atlas Cloud</a></li>
<li><a href="https://neo4j.com/pricing/">Graph: Neo4j AuraDB</a></li>
<li><a href="https://www.pinecone.io/pricing/">Vector: Pinecone Serverless</a></li>
</ul>


</section>

 ]]></description>
  <category>sqlite</category>
  <category>duckdb</category>
  <category>MongoDB</category>
  <category>Neo4j</category>
  <category>Vector Database</category>
  <guid>https://brocktibert.com/posts/20241229-single-file-db/single_file_embedded_db.html</guid>
  <pubDate>Sun, 29 Dec 2024 05:00:00 GMT</pubDate>
</item>
<item>
  <title>AWS S3, Glue, Data Lakes and DuckDB</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/20241223-aws-s3-duckdb-lake/</link>
  <description><![CDATA[ 





<p>It’s increasingly clear that data teams need to lean into patterns that embrace the separation of storage and compute. This brief post below will walk through some access patterns which hopefully highlight the benefits of writing the artifacts of your data pipelines to cloud storage services like AWS S3.</p>
<p>Moreover, and while certainly not new, I want to emphasize the use of <a href="https://duckdb.org/docs/data/parquet/overview.html">DuckDB</a> as a data processing engine that drastically reduces the surface area of your data stack. Perhaps by the end you might start to question whether you should be leaning into large, bulky, provisioned warehouses like Redshift if you are an AWS shop.</p>
<p>This post will assume the following:</p>
<ul>
<li>An AWS Account</li>
<li>DuckDB and Python installed</li>
<li>Familiarity with services like S3 and basic data engineering tasks like ETL and ELT.</li>
</ul>
<section id="the-dataset" class="level2">
<h2 class="anchored" data-anchor-id="the-dataset">The Dataset</h2>
<p>For the sake of illustration, I will going to use play-by-play data files from the NHL to highlight how a data team might store and analyze these datasets outside of a full load into a “data warehouse”. Common query patterns might summarize statistics by season, or more granularly, by month or game. With this in mind, let’s jump into how we might store the data on S3 to enable the analysis of the data.</p>
<p>Let’ assume I have a bucket called <code>nhl-pbp</code>. I am using “folders” to partition the data via a number of core fields. For example, here is the S3 URI for one file.</p>
<pre><code>s3://nhl-pbp/pbp/season=2024/year=2024/month=12/day=22/gameid=2024020539/pbp.parquet</code></pre>
<ul>
<li>Think of the first folder <code>pbp</code> as the table in our database. Other tables might be players, or teams, or venue which would reside at /players, /teams and /venue accordingly.</li>
<li>Within the pbp table, we then partition the data by season</li>
<li>Within a given season, we can partition the data by calendar items like year month and day to enable granular queries such as “How many goals did Brad Marchand have in the month of December 2024?”</li>
<li>Finally, because there can be multiple games per calendar day, we add a partition for the game id itself.</li>
</ul>
<p>Let’s consider a nightly pipeline that collects the data files from <em>yesterday’s</em> games. Our pipeline might do the following:</p>
<ul>
<li>Run overnight via a schedule (e.g.&nbsp;EventBridge Scheduler)</li>
<li>Determine if there were any new games for today() - 1 day</li>
<li>If there are games, grab the JSON, perform a light transform to flatten into a tabular format, and write a parquet file to a path on S3 that conforms with the partitions.</li>
</ul>
<p>For the sake of illustration, I already have some data on S3; it’s a handful of games to keep the surface area small.</p>
</section>
<section id="querying-the-data-via-duckdb" class="level2">
<h2 class="anchored" data-anchor-id="querying-the-data-via-duckdb">Querying the Data via DuckDB</h2>
<p>Without hyperbole, I tell everyone I can in the data space about DuckDB. I bring it up in nearly every conversation. To me, DuckDB will be considered the swiss army knife for data teams if it isn’t considered that already. The benefits are many, but for this post, I will use DuckDB as a query engine, which will use S3 as a data store. If you are new to DuckDb, let that sink in for a second.</p>
<p>To get started, you will need to ensure you have the HTTPFS duckdb extension installed.</p>
<div id="68617868" class="cell" data-execution_count="1">
<div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># import the library</span></span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> duckdb</span>
<span id="cb2-3"></span>
<span id="cb2-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># create an in memory connection</span></span>
<span id="cb2-5">db <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> duckdb.<span class="ex" style="color: null;
background-color: null;
font-style: inherit;">connect</span>()</span>
<span id="cb2-6"></span>
<span id="cb2-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ensure the extension is ready to go</span></span>
<span id="cb2-8">db.execute(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"INSTALL httpfs; LOAD httpfs;"</span>)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="19">
<pre><code>&lt;duckdb.duckdb.DuckDBPyConnection at 0x16a2e50b0&gt;</code></pre>
</div>
</div>
<p>Newer versions of duckdb have made it really simple to authenticate with Cloud storage services. In my case, I have already installed and configured my AWS profile, and the snippet below will tell duckdb to use that configuration when making calls to S3.</p>
<div id="03b8d196" class="cell" data-execution_count="2">
<div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1">sql <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb4-2"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">CREATE SECRET aws_secret (</span></span>
<span id="cb4-3"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    TYPE S3,</span></span>
<span id="cb4-4"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    PROVIDER CREDENTIAL_CHAIN</span></span>
<span id="cb4-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">);</span></span>
<span id="cb4-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb4-7"></span>
<span id="cb4-8">db.execute(sql)</span></code></pre></div>
<div class="cell-output cell-output-display" data-execution_count="20">
<pre><code>&lt;duckdb.duckdb.DuckDBPyConnection at 0x16a2e50b0&gt;</code></pre>
</div>
</div>
<p>For a basic example, below I will print out the number of plays for one specific file on S3.</p>
<div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1">sql <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""" </span></span>
<span id="cb6-2"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">select count(*)</span></span>
<span id="cb6-3"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">from 's3://nhl-pbp/pbp/season=2023/year=2023/month=12/day=14/gameid=2023020448/pbp.parquet'</span></span>
<span id="cb6-4"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb6-5"></span>
<span id="cb6-6">db.execute(sql).df()</span></code></pre></div>
<div id="b54dcad4" class="cell" data-execution_count="3">
<div class="cell-output cell-output-display" data-execution_count="21">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">count_star()</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td data-quarto-table-cell-role="th">0</td>
<td>331</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<p>While this hopefully highlights the power, simplicity, and speed of duckdb, this is a simple example. The true power comes when we want to use S3 as our data store.</p>
<p>Below we are counting the number of rows from all of the games in the month of December across all years.</p>
<div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1">sql <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb7-2"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">SELECT COUNT(*)</span></span>
<span id="cb7-3"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">FROM read_parquet(</span></span>
<span id="cb7-4"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">  's3://nhl-pbp/pbp/season=*/year=*/month=12/day=*/gameid=*/pbp.parquet'</span></span>
<span id="cb7-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">);</span></span>
<span id="cb7-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb7-7"></span>
<span id="cb7-8">db.execute(sql).df()</span></code></pre></div>
<div id="c1c97431" class="cell" data-execution_count="4">
<div class="cell-output cell-output-display" data-execution_count="22">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">count_star()</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td data-quarto-table-cell-role="th">0</td>
<td>1575</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<p>You may have noticed above that we are using the <code>read_parquet</code> function as way to scan for files that match our path definition. DuckDB is smart enough to review the files on S3 and calculate the results which span multiple parquet files on S3. How cool is that?</p>
<p>If you want to calculate how many unique games are being evaluated, it’s as simple as:</p>
<div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1">sql <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb8-2"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">SELECT COUNT(distinct gameid)</span></span>
<span id="cb8-3"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">FROM read_parquet(</span></span>
<span id="cb8-4"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">  's3://nhl-pbp/pbp/season=*/year=*/month=12/day=*/gameid=*/pbp.parquet'</span></span>
<span id="cb8-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">);</span></span>
<span id="cb8-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb8-7"></span>
<span id="cb8-8">db.execute(sql).df()</span></code></pre></div>
<div id="12254251" class="cell" data-execution_count="5">
<div class="cell-output cell-output-display" data-execution_count="23">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">count(DISTINCT gameid)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td data-quarto-table-cell-role="th">0</td>
<td>4</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<p>It’s not too hard to imagine the type of data apps you could build by simplying writing parquet files to S3 and using DuckDB as the query engine that sits in between the the frontend and the S3 backend!</p>
</section>
<section id="aws-glue-data-catalog-and-athena" class="level2">
<h2 class="anchored" data-anchor-id="aws-glue-data-catalog-and-athena">AWS Glue Data Catalog and Athena</h2>
<p>AWS Glue includes a number of execellent services to empower data teams. One central service is AWS Glue Data Catalog, which allows us to formalize a “Database” and tables over the data stored in S3. There are some tradeoffs as it relates to DuckDB above. While it’s entirely possible to programmatically configure the steps below, I am going to use the console for sake of demonstration.</p>
<section id="step-1-create-a-database" class="level3">
<h3 class="anchored" data-anchor-id="step-1-create-a-database">Step 1: Create a Database</h3>
<p>After navigating to AWS Glue, select Databases from the left hand menu. Click Add Database.</p>
<p><img src="https://brocktibert.com/posts/20241223-aws-s3-duckdb-lake/20241248-154830.png" class="img-fluid"></p>
<p>Give the database a name, and click Create Database. In my case, I am calling it <code>nhl-blog-example</code>.</p>
<div class="callout callout-style-default callout-tip callout-titled" title="Glue? Database?">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Glue? Database?
</div>
</div>
<div class="callout-body-container callout-body">
<p>AWS Glue is a fully managed ETL (Extract, Transform, Load) service that simplifies data integration by enabling data preparation, transformation, and cataloging at scale. A <strong>Glue Database</strong> acts as a logical container in the Glue Data Catalog, organizing metadata about data sources, such as schemas and table definitions, to support querying and processing. Glue Tables represent structured metadata about datasets, defining their schema and location, which allows tools like Athena or Spark to interact with data seamlessly in storage, such as S3.</p>
</div>
</div>
</section>
<section id="step-2-create-a-crawler" class="level3">
<h3 class="anchored" data-anchor-id="step-2-create-a-crawler">Step 2: Create a Crawler</h3>
<p>From the left-hand menu, select Crawler &gt; Create Crawler.</p>
<p>For the unique name, I calling this <code>nhl-blog-crawler</code>. Click next. For step 2, we haven’t yet mapped data (keep that option selected) from our source into Glue, and our source is s3. Let’s get that setup now.</p>
<p>Below I am configuring the crawler for the <code>pbp</code> folder in the bucket.</p>
<p><img src="https://brocktibert.com/posts/20241223-aws-s3-duckdb-lake/20241253-155351.png" class="img-fluid"></p>
<p>The Step 2 screen should look similar to below. Click next.</p>
<p><img src="https://brocktibert.com/posts/20241223-aws-s3-duckdb-lake/20241254-155441.png" class="img-fluid"></p>
<p>If you do not already have an IAM Service role, let AWS create and select that for you before moving onto the next screen.</p>
<p><img src="https://brocktibert.com/posts/20241223-aws-s3-duckdb-lake/20241255-155550.png" class="img-fluid"></p>
<p>On step 4, select the database we created above. Under Advanced Options, select the first checkbox to Create a Single Schema for each S3 path.</p>
<p><img src="https://brocktibert.com/posts/20241223-aws-s3-duckdb-lake/20241203-160344.png" class="img-fluid"></p>
<p>There are a host of options we can configure here, but they are outside the scope for this post.</p>
<p>Finally, on the last page, select Create Crawler.</p>
</section>
<section id="step-3-run-the-crawler" class="level3">
<h3 class="anchored" data-anchor-id="step-3-run-the-crawler">Step 3: Run the Crawler</h3>
<p>After completing above, you should be on a screen where you can create an ondemand run of the Crawler. We will do that now. Select Run Crawler.</p>
<p><img src="https://brocktibert.com/posts/20241223-aws-s3-duckdb-lake/20241257-155758.png" class="img-fluid"></p>
<p>In my case, this will only take a minute or so since I only have a handful of files on S3.</p>
<p>Once it completes, select Databases from the left under Data Catalog.</p>
<p><img src="https://brocktibert.com/posts/20241223-aws-s3-duckdb-lake/20241204-160452.png" class="img-fluid"></p>
<p>We now have created a table of data in Glue based on the parquet files in S3. Selecting the table name &gt; Partitions highlights how the crawler used the S3 folders to partition our data.</p>
<p><img src="https://brocktibert.com/posts/20241223-aws-s3-duckdb-lake/20241205-160547.png" class="img-fluid"></p>
</section>
<section id="step-4-query-with-athena" class="level3">
<h3 class="anchored" data-anchor-id="step-4-query-with-athena">Step 4: Query with Athena</h3>
<p>Now we can use Athena to query our data via the Glue Data Catalog backed by data in S3. Search for Athena in the search bar on the web console.</p>
<div class="callout callout-style-default callout-important callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Important
</div>
</div>
<div class="callout-body-container callout-body">
<p>I already have Athena setup. If this is your first time using the service, you may need to select an S3 location for the Athena output.</p>
</div>
</div>
<p>From the left hand side, you can see that we have our Data Catalog, the database within the catalog, and a query editor to execute queries against S3.</p>
<p><img src="https://brocktibert.com/posts/20241223-aws-s3-duckdb-lake/20241209-160925.png" class="img-fluid"></p>
<p>This is the query I will run</p>
<div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode sql code-with-copy"><code class="sourceCode sql"><span id="cb9-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">select</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">year</span>, <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">month</span>, <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">day</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">count</span>(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">as</span> total <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">from</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"AwsDataCatalog"</span>.<span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"nhl-blog-example"</span>.<span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"pbp"</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">group</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">by</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>;</span></code></pre></div>
<p>This returns</p>
<p><img src="https://brocktibert.com/posts/20241223-aws-s3-duckdb-lake/20241210-161028.png" class="img-fluid"></p>
</section>
</section>
<section id="adding-new-data" class="level2">
<h2 class="anchored" data-anchor-id="adding-new-data">Adding new Data</h2>
<p>Let’s step back, and imagine that our nightly service has added data to S3. To simulate this, I am going to add a new play-by-play parquet file for a single game. To add context, this will be for the 21st of December 2024. The S3 URI is below.</p>
<pre><code>s3://nhl-pbp/pbp/season=2024/year=2024/month=12/day=21/gameid=2024020539/pbp.parquet</code></pre>
<p>To demonstrate the final point of this post, let’s hop back to Athena and re-run the query.</p>
<p>We get the same results. This happens because Athena is using the Glue Data Catalog to support the queries against our table, and because above was a <strong>new</strong> partition for the data lake, our Data Catalog, and as a result Athena, are unaware to look for that file.</p>
<p>To emphasize our core different relative to the DuckDB example, we will run the same query.</p>
<div class="sourceCode" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1">sql <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb11-2"></span>
<span id="cb11-3"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">select year, month, day, count(*) as total </span></span>
<span id="cb11-4"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">FROM read_parquet(</span></span>
<span id="cb11-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">  's3://nhl-pbp/pbp/season=*/year=*/month=*/day=*/gameid=*/pbp.parquet'</span></span>
<span id="cb11-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">)</span></span>
<span id="cb11-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">group by 1,2,3;</span></span>
<span id="cb11-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb11-9"></span>
<span id="cb11-10">db.execute(sql).df()</span></code></pre></div>
<div id="146af3c2" class="cell" data-execution_count="6">
<div class="cell-output cell-output-display" data-execution_count="24">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">year</th>
<th data-quarto-table-cell-role="th">month</th>
<th data-quarto-table-cell-role="th">day</th>
<th data-quarto-table-cell-role="th">total</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td data-quarto-table-cell-role="th">0</td>
<td>2024</td>
<td>12</td>
<td>21</td>
<td>323</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">1</td>
<td>2024</td>
<td>12</td>
<td>15</td>
<td>277</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">2</td>
<td>2024</td>
<td>12</td>
<td>22</td>
<td>323</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">3</td>
<td>2024</td>
<td>12</td>
<td>16</td>
<td>321</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">4</td>
<td>2023</td>
<td>12</td>
<td>14</td>
<td>331</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<p>DuckDB correctly returns the results as expected.</p>
<p>What’s important to note above is that if our data pipeline adds new files to S3, DuckDB will <strong>automatically</strong> factor these files in as long as it meets the path in the <code>read_parquet</code> definition. You could consider this schema-on-read.</p>
<p>You likely are asking what we need to do to have Athena properly identify the new data added as part of our pipeline. There are ways to programmatically update the table in the Data Catalog, or, we could have our pipeline fire a new run of the crawler. For the latter, it’s important to note that you are billed for the processing time, so keep that in mind on which approach makes the most sense.</p>
</section>
<section id="summary" class="level2">
<h2 class="anchored" data-anchor-id="summary">Summary</h2>
<p>We saw above, via a simple example, that data teams can leverage services like S3 as the data store for their pipelines. Tools like DuckDB or services from AWS such as Glue’s Data Catalog and Athena remove the need to have a “warehouse-first” data strategy.</p>
<p>There isn’t always one right answer, but in future posts, we will explore the concepts of data lakes and lakehouses via Apache Iceberg in order provide alternatives to the traditional cloud data warehouse pattern.</p>


</section>

 ]]></description>
  <category>AWS</category>
  <category>Data Lake</category>
  <category>DuckDB</category>
  <category>Glue</category>
  <category>Athena</category>
  <guid>https://brocktibert.com/posts/20241223-aws-s3-duckdb-lake/</guid>
  <pubDate>Mon, 23 Dec 2024 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Calling OpenAI from within AWS Step Functions</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/20241222-aws-sfn-openai/</link>
  <description><![CDATA[ 





<p>In this short post, I will demonstrate how to authenticate and make API calls during the execution of a workflow built on AWS Step Functions. Prior to the addition of the HTTPS Endpoint state, we would need to fall back and invoke a Lambda function. While this approach still has benefits, the post below will walk you through the steps of invoking ChatGPT inside a Step Function Workflow.</p>
<div class="callout callout-style-simple callout-none no-icon">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-body-container">
<p>Even if you pay the $20 for the web version of ChatGPT, the API product is separate. Below assumes you have an API account and know how to create an API key.</p>
</div>
</div>
</div>
<section id="step-1-eventbridge-connection" class="level2">
<h2 class="anchored" data-anchor-id="step-1-eventbridge-connection">Step 1: EventBridge Connection</h2>
<p>The HTTP Endpoint State requires an <a href="https://aws.amazon.com/blogs/compute/using-api-destinations-with-amazon-eventbridge/">Eventbridge Connection</a>, which in this case, stores our API token.</p>
<p>In the web console, navigate to Eventbridge Connections. Click Create Connection.</p>
<p><img src="https://brocktibert.com/posts/20241222-aws-sfn-openai/20241254-115425.png" class="img-fluid"></p>
<p>On the first portion of the screen, give your connection a valid name and select Public. Optionally provide a description.</p>
<p><img src="https://brocktibert.com/posts/20241222-aws-sfn-openai/20241256-115602.png" class="img-fluid"></p>
<p>At the bottom of the page, select API Key.</p>
<p><img src="https://brocktibert.com/posts/20241222-aws-sfn-openai/20241256-115632.png" class="img-fluid"></p>
<p>This part is <strong>important!</strong>.</p>
<ul>
<li>For the <strong>API Key Name</strong>, enter <code>Authorization</code></li>
<li>for the <strong>Value</strong>, enter <code>Bearer sk-XXXXXX</code> where your sk-XXXXXX is your OpenAI API token.</li>
</ul>
<p>Click Create.</p>
<p>In my case, this is what I see <strong>after</strong> creating the connection.</p>
<p><img src="https://brocktibert.com/posts/20241222-aws-sfn-openai/20241200-120004.png" class="img-fluid"></p>
<div class="callout callout-style-default callout-tip callout-titled" title="Important">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Important
</div>
</div>
<div class="callout-body-container callout-body">
<p>There is a space between the Bearer and your API token!</p>
</div>
</div>
</section>
<section id="step-2-sfn-workflow" class="level2">
<h2 class="anchored" data-anchor-id="step-2-sfn-workflow">Step 2: SFN Workflow</h2>
<p>The first iteration will be as simple as it can get. Create a new Step Function on the AWS Console. <strong>NOTE</strong>: When prompted, select JSONata as the Query language.</p>
<p><img src="https://brocktibert.com/posts/20241222-aws-sfn-openai/20241202-120229.png" class="img-fluid"></p>
<p>After selecting the State, we will need to add a few bits of information to configure our HTTP call out.</p>
<ol type="1">
<li>API Endpoint: <code>https://api.openai.com/v1/chat/completions</code></li>
<li>Method: <code>POST</code></li>
<li>Connection: Select the Connection Name you created in the Step 1 above</li>
</ol>
<p><img src="https://brocktibert.com/posts/20241222-aws-sfn-openai/20241203-120334.png" class="img-fluid"></p>
<p>Below is a basic request body in the format OpenAI expects</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode json code-with-copy"><code class="sourceCode json"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb1-2">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"model"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-4o"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-3">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"messages"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb1-4">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb1-5">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"role"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"developer"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-6">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"content"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"You are a helpful assistant."</span></span>
<span id="cb1-7">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-8">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb1-9">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"role"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-10">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"content"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Hello!"</span></span>
<span id="cb1-11">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb1-12">  <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span></span>
<span id="cb1-13"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div>
<p><img src="https://brocktibert.com/posts/20241222-aws-sfn-openai/20241206-120630.png" class="img-fluid"></p>
<p>As I noted in my previous article, beyond the low-code Workflow builder on the Web Console, we also can <strong>fully define</strong> our workflows via ASL, enabling Infrastructure-as-Code deployments.</p>
<p>Below is the JSON ASL for the workflow</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode json code-with-copy"><code class="sourceCode json"><span id="cb2-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-2">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"QueryLanguage"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"JSONata"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-3">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Comment"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Dec 2024 flow updated as POC for alternatives, and also to show we don't always need to hop into a Lambda."</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-4">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"StartAt"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Call HTTPS APIs"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-5">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"States"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-6">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Call HTTPS APIs"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-7">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Task"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-8">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Resource"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"arn:aws:states:::http:invoke"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-9">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Arguments"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-10">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"ApiEndpoint"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://api.openai.com/v1/chat/completions"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-11">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Method"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"POST"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-12">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"InvocationConfig"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-13">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"ConnectionArn"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"arn:aws:events:us-east-1:00000000000000:connection/OpenaiDec24/6a2df29c-c945-4e9f-a446-9957ee9de7e1"</span></span>
<span id="cb2-14">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb2-15">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"RequestBody"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-16">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"model"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-4o"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-17">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"messages"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb2-18">            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-19">              <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"role"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"developer"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-20">              <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"content"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"You are a helpful assistant."</span></span>
<span id="cb2-21">            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-22">            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-23">              <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"role"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-24">              <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"content"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Hello!"</span></span>
<span id="cb2-25">            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb2-26">          <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span></span>
<span id="cb2-27">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb2-28">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb2-29">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Retry"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb2-30">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-31">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"ErrorEquals"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb2-32">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"States.ALL"</span></span>
<span id="cb2-33">          <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-34">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"BackoffRate"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-35">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"IntervalSeconds"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-36">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"MaxAttempts"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-37">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"JitterStrategy"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"FULL"</span></span>
<span id="cb2-38">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb2-39">      <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-40">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"End"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">true</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-41">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Comment"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"- Authorization is the key, Bearer &lt;TOKEN&gt; is the value via event Bridge connection"</span></span>
<span id="cb2-42">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb2-43">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb2-44"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div>
<div class="callout callout-style-default callout-tip callout-titled" title="IaC">
<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-2-contents" aria-controls="callout-2" aria-expanded="true" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
IaC
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-2" class="callout-2-contents callout-collapse collapse show">
<div class="callout-body-container callout-body">
<p>You can copy the JSON ASL above and paste this into the Workflow editor on the Console. Select the <strong>Code</strong> pill at the top of the screen, and on the left hand side, paste the definition above and navigate back to the Design tab.</p>
<p><img src="https://brocktibert.com/posts/20241222-aws-sfn-openai/20241209-120956.png" class="img-fluid"></p>
</div>
</div>
</div>
</section>
<section id="step-3-execute-the-workflow" class="level2">
<h2 class="anchored" data-anchor-id="step-3-execute-the-workflow">Step 3: Execute the Workflow</h2>
<p>If you haven’t done so, we need to rename and save the Workflow. Click the Config tab, rename the workflow, select your Execution Role which allows HTTP invocations, and click save.</p>
<div class="callout callout-style-default callout-tip callout-titled" title="IAM">
<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-3-contents" aria-controls="callout-3" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
IAM
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-3" class="callout-3-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<p>You can let the Console create the IAM Execution role for you automatically. The benefit is that AWS will ensure you have the necessary permissions. If you are new to IAM configuration, this is a great feature and one that can help you learn the necessary elements. To review the role created, navigate to IAM &gt; Roles and search the Execution Role created by AWS for you. Clicking into role will allow you to review your permissions.</p>
<p><img src="https://brocktibert.com/posts/20241222-aws-sfn-openai/20241214-121428.png" class="img-fluid"></p>
</div>
</div>
</div>
<p>With our worfklow defined and saved, we can now test the Execution. I am <strong>not</strong> going to define a Payload. Leave this screen blank and click Start Execution.</p>
<p><img src="https://brocktibert.com/posts/20241222-aws-sfn-openai/20241215-121542.png" class="img-fluid"></p>
<p>In the graph view, if everything is configured correctly (i.e.&nbsp;the Eventbridge Connection is setup as outlined above), you should see a successful execution.</p>
<p><img src="https://brocktibert.com/posts/20241222-aws-sfn-openai/20241216-121644.png" class="img-fluid"></p>
<p>For the Task State Output, you should see a <code>RequestBody</code> key, with the data sent back from the API call out to ChatGPT.</p>
<p>That’s it. In my case, ChatGPT responded with <code>Hello! How can I help you today?</code>.</p>
</section>
<section id="optional-variables-and-jsonata" class="level2">
<h2 class="anchored" data-anchor-id="optional-variables-and-jsonata">Optional: Variables and JSONata</h2>
<p>Above was a simple example to ensure everything is wire up correctly in order to call out to OpenAI APIs from within our Step Function workflow. Below will demonstrate how we can dynamically pass in the prompt to OpenAI. As I did with the previous post, I am going to use a Pass state to enter the prompt, and then refer to this variable within the defition of the request body.</p>
<p>The updated workflow will make use of Variables and JSONata to refer to the prompt defined the Pass state.</p>
<p><img src="https://brocktibert.com/posts/20241222-aws-sfn-openai/20241226-122641.png" class="img-fluid"></p>
<p>Below is the ASL definition of the updated workflow</p>
<div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode json code-with-copy"><code class="sourceCode json"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-2">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"QueryLanguage"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"JSONata"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-3">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Comment"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Dec 2024 flow updated as POC for alternatives, and also to show we don't always need to hop into a Lambda."</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-4">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"StartAt"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Pass"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-5">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"States"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-6">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Pass"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-7">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Pass"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-8">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Next"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Call HTTPS APIs"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-9">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Assign"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-10">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"prompt"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"How many wins do the Boston Bruisn have this season?"</span></span>
<span id="cb3-11">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb3-12">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb3-13">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Call HTTPS APIs"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-14">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Task"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-15">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Resource"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"arn:aws:states:::http:invoke"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-16">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Arguments"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-17">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"ApiEndpoint"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://api.openai.com/v1/chat/completions"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-18">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Method"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"POST"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-19">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"InvocationConfig"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-20">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"ConnectionArn"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"arn:aws:events:us-east-1:000000000000:connection/OpenaiDec24/6a2df29c-c945-4e9f-a446-9957ee9de7e1"</span></span>
<span id="cb3-21">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb3-22">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"RequestBody"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-23">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"model"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-4o"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-24">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"messages"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb3-25">            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-26">              <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"role"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"developer"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-27">              <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"content"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"You are a helpful assistant."</span></span>
<span id="cb3-28">            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-29">            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-30">              <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"role"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-31">              <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"content"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"{% $prompt %}"</span></span>
<span id="cb3-32">            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb3-33">          <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span></span>
<span id="cb3-34">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb3-35">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb3-36">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Retry"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb3-37">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-38">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"ErrorEquals"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb3-39">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"States.ALL"</span></span>
<span id="cb3-40">          <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-41">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"BackoffRate"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-42">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"IntervalSeconds"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-43">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"MaxAttempts"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-44">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"JitterStrategy"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"FULL"</span></span>
<span id="cb3-45">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb3-46">      <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-47">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"End"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">true</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-48">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Comment"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"- Authorization is the key, Bearer &lt;TOKEN&gt; is the value via event Bridge connection"</span></span>
<span id="cb3-49">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb3-50">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb3-51"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div>
<p>A few notes on above:</p>
<ul>
<li>We define a variable called <code>prompt</code> in the Pass state.</li>
<li>In the defintion of the Request body, you can see that we are using JSONata to dynamically pull in the prompt via the variable defined above via <code>"content": "{% $prompt %}"</code></li>
</ul>
<p>After a successfull execution, we see that ChatGPT acknowledges that the model does not have the ability to recall current information.</p>
<p><img src="https://brocktibert.com/posts/20241222-aws-sfn-openai/20241229-122904.png" class="img-fluid"></p>
</section>
<section id="optional-input-prompt" class="level2">
<h2 class="anchored" data-anchor-id="optional-input-prompt">Optional: Input Prompt</h2>
<p>Finally, a more realistic workflow would expect that the prompt would be included in the payload that triggers the execution of the Step Function.</p>
<p>The changes:</p>
<ul>
<li>Remove the Pass state that was added above to highlight Variables</li>
<li>Include the prompt in the payload at the start of the execution.</li>
</ul>
<p>We will make use of the <a href="https://docs.aws.amazon.com/step-functions/latest/dg/transforming-data.html#transforming-reserved-variable-states">reserved variables</a>, namely the input key, to isolate the prompt passed into the Workflow Execution.</p>
<p>The ASL definition is below:</p>
<div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode json code-with-copy"><code class="sourceCode json"><span id="cb4-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb4-2">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"QueryLanguage"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"JSONata"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-3">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Comment"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Dec 2024 flow updated as POC for alternatives, and also to show we don't always need to hop into a Lambda."</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-4">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"StartAt"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Call HTTPS APIs"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-5">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"States"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb4-6">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Call HTTPS APIs"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb4-7">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Task"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-8">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Resource"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"arn:aws:states:::http:invoke"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-9">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Arguments"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb4-10">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"ApiEndpoint"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://api.openai.com/v1/chat/completions"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-11">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Method"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"POST"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-12">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"InvocationConfig"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb4-13">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"ConnectionArn"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"arn:aws:events:us-east-1:0000000000000:connection/OpenaiDec24/6a2df29c-c945-4e9f-a446-9957ee9de7e1"</span></span>
<span id="cb4-14">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb4-15">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"RequestBody"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb4-16">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"model"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-4o"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-17">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"messages"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb4-18">            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb4-19">              <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"role"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"developer"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-20">              <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"content"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"You are a helpful assistant."</span></span>
<span id="cb4-21">            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-22">            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb4-23">              <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"role"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-24">              <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"content"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"{% $states.input.prompt %}"</span></span>
<span id="cb4-25">            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb4-26">          <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span></span>
<span id="cb4-27">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb4-28">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb4-29">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Retry"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb4-30">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb4-31">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"ErrorEquals"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb4-32">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"States.ALL"</span></span>
<span id="cb4-33">          <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-34">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"BackoffRate"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-35">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"IntervalSeconds"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-36">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"MaxAttempts"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-37">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"JitterStrategy"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"FULL"</span></span>
<span id="cb4-38">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb4-39">      <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-40">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"End"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">true</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-41">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Comment"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"- Authorization is the key, Bearer &lt;TOKEN&gt; is the value via event Bridge connection"</span></span>
<span id="cb4-42">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb4-43">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb4-44"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div>
<ul>
<li>Notice that the body of our request to OpenAI was modified to refer to the prompt key from <code>states.input</code> via <code>"content": "{% $states.input.prompt %}"</code> which passes the input directly to OpenAI.</li>
</ul>
<p>Save and execute the workflow with the input shown below.</p>
<div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode json code-with-copy"><code class="sourceCode json"><span id="cb5-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb5-2"> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"prompt"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Tell me a joke"</span></span>
<span id="cb5-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div>
<p><img src="https://brocktibert.com/posts/20241222-aws-sfn-openai/20241239-123936.png" class="img-fluid"></p>
<p>That’s it! We were able to pull in the input that triggered the execution into the call to OpenAI.</p>
</section>
<section id="summary" class="level2">
<h2 class="anchored" data-anchor-id="summary">Summary</h2>
<p>The post above walks you through the basics of getting started with making callouts to external APIs from inside an AWS Step Function workflow. Above, I used OpenAI’s API to make simple calls to Chat GPT, highlighting that you have options beyond AWS’s Bedrock inside your Agentic Solutions built on top of Step Functions.</p>
<p>Finally, you <strong>can</strong> call public apis that do not require user/pass by creating a EventBridge Connection with placeholder username/password. Even though the API will you want to leverage may not require a user/pass combintation, this will allow you to get around the <strong>required</strong> connection input.</p>


</section>

 ]]></description>
  <category>AWS</category>
  <category>Step Functions</category>
  <category>OpenAI</category>
  <category>API</category>
  <category>LLM</category>
  <guid>https://brocktibert.com/posts/20241222-aws-sfn-openai/</guid>
  <pubDate>Sun, 22 Dec 2024 05:00:00 GMT</pubDate>
</item>
<item>
  <title>AWS Step Functions - Variables and JSONata</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/20241220-aws-sfn-vars/</link>
  <description><![CDATA[ 





<p>I finally had a chance to review some of the newer features of AWS Step Functions. In particular, I was very excited to explore the possibilities of defining <a href="https://aws.amazon.com/blogs/compute/simplifying-developer-experience-with-variables-and-jsonata-in-aws-step-functions/">Variables</a> which are available throughout the entire execution. Moreover, JSONata appears to add an incredibly expressive syntax to expand the logic that we can apply within native states/tasks. I encourage the reader to explore some of the blog posts demonstrating the “before and after”, in which JSONata has the potential to reduce the need to hop into Lambda functions for more “advnaced” logic.</p>
<p>Below I am going to run through a simple flow that uses conditional execution. While the example below is basic, it’s not a far stretch to think about how this type of solution could be applied to create agentic workflows. More on that below.</p>
<div class="callout callout-style-default callout-note callout-titled" title="Step Functions are underrated">
<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-1-contents" aria-controls="callout-1" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Step Functions are underrated
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-1" class="callout-1-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<p>I don’t think AWS Step Functions get enough attention in the data world. In a future post I will discuss the benefits of this service, but for now, a simple list:</p>
<ul>
<li>Managed Service</li>
<li>Error Handling</li>
<li>Retry logic</li>
<li>Self-loops, parallel execution, and even using child executions</li>
<li>Logging</li>
<li>Low Code builder but with full IaC support. Think: sandbox an idea in the console, and then rip out the code for deployment.</li>
</ul>
</div>
</div>
</div>
<section id="our-first-flow" class="level2">
<h2 class="anchored" data-anchor-id="our-first-flow">Our First Flow</h2>
<p>Go ahead and create a new Step Function from the AWS Console. For my purposes, I am going to use a <strong>Standard</strong> workflow, but note that in practice we very likely would be leverage Express workflows. Discussing the tradeoffs between the two is outside the scope of this article.</p>
<p>I am going to step through the logic in order to hopefully help with the intuition of Variables. If you haven’t used Step Functions in the past, below is the visual Workflow editor.</p>
<p><img src="https://brocktibert.com/posts/20241220-aws-sfn-vars/20241213-201358.png" class="img-fluid"></p>
<p>The <a href="https://docs.aws.amazon.com/step-functions/latest/dg/state-pass.html">Pass</a> state is incredibly powerful, and in this case, we can set variables and apply other logic as needed at the start of our execution.</p>
<p><img src="https://brocktibert.com/posts/20241220-aws-sfn-vars/20241216-201608.png" class="img-fluid"></p>
<p>If you select the Pass State in the canvas, and then the Variables tab, you can manually specify the variables.</p>
<p><img src="https://brocktibert.com/posts/20241220-aws-sfn-vars/20241217-201751.png" class="img-fluid"></p>
<p>From the pills at the top of the screen, you can select <strong>Config</strong> to rename the Workflow and define the execution role for the Step Function. I already have an Execution Role, but for this purpose of this post, you can let AWS create one for you.</p>
<p><img src="https://brocktibert.com/posts/20241220-aws-sfn-vars/20241219-201930.png" class="img-fluid"></p>
<p>As I noted earlier, not only can we define our workflow logic via drag-and-drop, but this solution is backed by a templating language that we can later use to package our projects via IaC. Below is the first workflow via JSON.</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode json code-with-copy"><code class="sourceCode json"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb1-2">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"QueryLanguage"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"JSONPath"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-3">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Comment"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I can set a comment about my workflow, what is required at invocation, the task this solves, etc."</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-4">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"StartAt"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Pass"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-5">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"States"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb1-6">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Pass"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb1-7">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Pass"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-8">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"End"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">true</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-9">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Comment"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"This will set the variables that will pass through the execution"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-10">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Assign"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb1-11">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"max_iter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-12">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"counter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb1-13">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb1-14">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb1-15">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb1-16"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div>
<p>If you are following along with this post, it’s important to note that you can copy the code above and paste the contents into the code editor in the console. Select the <strong>Code</strong> pill, and paste the workflow definition on the left.</p>
<p><img src="https://brocktibert.com/posts/20241220-aws-sfn-vars/20241222-202225.png" class="img-fluid"></p>
<p>With our workflow created, we can now review the execution logic. Click Execute, this will run the workflow. No need to enter a payload yet, just click Start Execution in the lower right.</p>
<p>We can see that the execution successfully ran, and that the variables were assigned. These will be accessible throughout the execution of this pipeline.</p>
<p><img src="https://brocktibert.com/posts/20241220-aws-sfn-vars/20241225-202515.png" class="img-fluid"></p>
</section>
<section id="second-flow" class="level2">
<h2 class="anchored" data-anchor-id="second-flow">Second Flow</h2>
<p>Let’s update the workflow to add some additional logic. As with above, you can paste the code definition directly into the console to see the changes.</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode json code-with-copy"><code class="sourceCode json"><span id="cb2-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-2">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"QueryLanguage"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"JSONPath"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-3">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"StartAt"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Pass"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-4">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"States"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-5">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Pass"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-6">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Pass"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-7">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Assign"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-8">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"variableName"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"$.states.input"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-9">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"max_iter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-10">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"counter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb2-11">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb2-12">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"QueryLanguage"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"JSONata"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-13">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Comment"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Manually creating an array for now, but ideally just say if you get to this state, we can specify an iterator as a for loop.  Might be missing something, but this expects </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\"</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">data</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\"</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;"> to iterate over, versus a for loop wih an exit (for max 3 iteratons, do this, else exit) -&gt; like training, but Agent loops is the obvious use case."</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-14">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Next"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Choice"</span></span>
<span id="cb2-15">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb2-16">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Choice"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-17">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Choice"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-18">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Choices"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb2-19">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-20">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Variable"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"$counter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-21">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"NumericLessThanPath"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"$max_iter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-22">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Next"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"update counter"</span></span>
<span id="cb2-23">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb2-24">      <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-25">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Default"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Success"</span></span>
<span id="cb2-26">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb2-27">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"update counter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-28">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Pass"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-29">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Next"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Success"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-30">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Assign"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-31">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"counter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"{% $counter + 1 %}"</span></span>
<span id="cb2-32">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb2-33">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb2-34">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Success"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-35">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Succeed"</span></span>
<span id="cb2-36">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb2-37">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb2-38">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Comment"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Play around with the new JSONata things.  Really expressive syntax (is there a plugin/vs AI help) and the variables is huge.  Seems like just reference states.input.&lt;object&gt; and then assign variables as needed.  ONly keep what you need.  The concept of jsut passing JSON around is kinda gross, but also, JSONPath is very simple"</span></span>
<span id="cb2-39">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div>
<p>Above we have added a few additional nodes to our workflow, namely the <a href="https://docs.aws.amazon.com/step-functions/latest/dg/state-choice.html">Choice</a> which allows us to conditionally route the execution path within our workflow.</p>
<p><img src="https://brocktibert.com/posts/20241220-aws-sfn-vars/20241242-204220.png" class="img-fluid"></p>
<p>Let’s step back for a moment.</p>
<ul>
<li>We have defined two variables, <code>max_iter</code> and <code>counter</code>. The former defines the upper bound for how many times we would allow the self-loop, as I may refer to it, self-reflection, to occur. The counter variable allows us to keep track of how many times we evaluted this path.<br>
</li>
<li>For each evaluation, we increment the counter by 1.<br>
</li>
<li>When the expression counter &lt; max_iter is no longer True, the choice state will follow a different path and complete the exeuction.</li>
</ul>
<p>Save the workflow and execute the workflow. In reviewing the output, you will notice the the counter doesn’t properly update.</p>
<p><img src="https://brocktibert.com/posts/20241220-aws-sfn-vars/20241228-112803.png" class="img-fluid"></p>
<p><img src="https://brocktibert.com/posts/20241220-aws-sfn-vars/20241229-112925.png" class="img-fluid"></p>
</section>
<section id="jsonata" class="level2">
<h2 class="anchored" data-anchor-id="jsonata">JSONata</h2>
<p>Let’s update the workflow to explicilty define the use of JSONata in the <code>update counter</code> state/task.</p>
<div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode json code-with-copy"><code class="sourceCode json"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-2">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"QueryLanguage"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"JSONPath"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-3">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"StartAt"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Pass"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-4">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"States"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-5">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Pass"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-6">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Pass"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-7">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Assign"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-8">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"max_iter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-9">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"counter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb3-10">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb3-11">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"QueryLanguage"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"JSONata"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-12">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Comment"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Manually creating an array for now, but ideally just say if you get to this state, we can specify an iterator as a for loop.  Might be missing something, but this expects </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\"</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">data</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\"</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;"> to iterate over, versus a for loop wih an exit (for max 3 iteratons, do this, else exit) -&gt; like training, but Agent loops is the obvious use case."</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-13">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Next"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Choice"</span></span>
<span id="cb3-14">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb3-15">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Choice"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-16">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Choice"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-17">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Choices"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb3-18">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-19">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Variable"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"$counter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-20">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"NumericLessThanPath"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"$max_iter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-21">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Next"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"update counter"</span></span>
<span id="cb3-22">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb3-23">      <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-24">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Default"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Success"</span></span>
<span id="cb3-25">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb3-26">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"update counter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-27">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Pass"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-28">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Next"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Success"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-29">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Assign"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-30">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"counter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"{% $counter + 1 %}"</span></span>
<span id="cb3-31">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb3-32">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"QueryLanguage"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"JSONata"</span></span>
<span id="cb3-33">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb3-34">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Success"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-35">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Succeed"</span></span>
<span id="cb3-36">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb3-37">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb3-38">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Comment"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Play around with the new JSONata things.  Really expressive syntax (is there a plugin/vs AI help) and the variables is huge.  Seems like just reference states.input.&lt;object&gt; and then assign variables as needed.  ONly keep what you need.  The concept of jsut passing JSON around is kinda gross, but also, JSONPath is very simple"</span></span>
<span id="cb3-39"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div>
<p><img src="https://brocktibert.com/posts/20241220-aws-sfn-vars/20241257-205701.png" class="img-fluid"></p>
<p>After saving and executing the workflow, review the output. You can see above that the interface highlights that our counter is changing values, a subtle but nice touch!</p>
<p>However, you can see from the logging that the update counter task didn’t fire three times.</p>
<p><img src="https://brocktibert.com/posts/20241220-aws-sfn-vars/20241243-114310.png" class="img-fluid"></p>
</section>
<section id="the-final-edit" class="level2">
<h2 class="anchored" data-anchor-id="the-final-edit">The Final Edit</h2>
<p>Above, everything is working correctly, but we are immediately ending our workflow’s execution without going back for another evaluation pass.</p>
<p>Below we will update the workflow to go back to the Choice state in order to enable to allow our workflow to perform another pass.</p>
<div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode json code-with-copy"><code class="sourceCode json"><span id="cb4-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb4-2">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"QueryLanguage"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"JSONPath"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-3">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"StartAt"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Pass"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-4">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"States"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb4-5">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Pass"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb4-6">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Pass"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-7">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Assign"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb4-8">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"max_iter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-9">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"counter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb4-10">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb4-11">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"QueryLanguage"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"JSONata"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-12">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Comment"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Manually creating an array for now, but ideally just say if you get to this state, we can specify an iterator as a for loop.  Might be missing something, but this expects </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\"</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">data</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\"</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;"> to iterate over, versus a for loop wih an exit (for max 3 iteratons, do this, else exit) -&gt; like training, but Agent loops is the obvious use case."</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-13">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Next"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Choice"</span></span>
<span id="cb4-14">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb4-15">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Choice"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb4-16">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Choice"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-17">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Choices"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb4-18">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb4-19">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Variable"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"$counter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-20">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"NumericLessThanPath"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"$max_iter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-21">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Next"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"update counter"</span></span>
<span id="cb4-22">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb4-23">      <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-24">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Default"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Success"</span></span>
<span id="cb4-25">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb4-26">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"update counter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb4-27">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Pass"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-28">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Next"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Choice"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-29">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Assign"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb4-30">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"counter"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"{% $counter + 1 %}"</span></span>
<span id="cb4-31">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb4-32">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"QueryLanguage"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"JSONata"</span></span>
<span id="cb4-33">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb4-34">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Success"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb4-35">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Succeed"</span></span>
<span id="cb4-36">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb4-37">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb4-38">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"Comment"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Play around with the new JSONata things.  Really expressive syntax (is there a plugin/vs AI help) and the variables is huge.  Seems like just reference states.input.&lt;object&gt; and then assign variables as needed.  ONly keep what you need.  The concept of jsut passing JSON around is kinda gross, but also, JSONPath is very simple"</span></span>
<span id="cb4-39"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div>
<p>Save and execute.</p>
<p><img src="https://brocktibert.com/posts/20241220-aws-sfn-vars/20241201-210152.png" class="img-fluid"></p>
<p>By looking at the events, we can now see that the update counter State/Task was invoked three times, as expected.</p>
<p><img src="https://brocktibert.com/posts/20241220-aws-sfn-vars/20241202-210259.png" class="img-fluid"></p>
</section>
<section id="summary" class="level2">
<h2 class="anchored" data-anchor-id="summary">Summary</h2>
<ul>
<li>Variables allow us to keep track of core pieces of information throughout the full execution of our workflow. There is a limit to how much data we can pass around, but the limit is quite reasonable.<br>
</li>
<li>JSONata allows us to reference these variables, and shown above, update the values conditionally as needed</li>
<li>It’s worth pointing out that you can specify if you want to use JSONata at the creation of your workflow. I created my workflow with the previous JSONPath evaluation to highlight that it is possible to mix-and-match as needed, and it’s just a matter of defining the engine within the task state.</li>
</ul>
<p>The above example is silly and a basic proof-of-concept, but let’s zoom out to why I find this new capability incredibly interesting. <a href="https://airbyte.com/data-engineering-resources/ai-agentic-workflows">Agentic workflows</a> allow us to combine multiple generative AI calls together in a system that aims to perform a given task.</p>
<p>As one example, consider an <a href="https://huggingface.co/learn/cookbook/en/llm_judge">LLM-as-a-judge</a> workflow which will use one LLM to generate output, and another to evalute the output. Above demonstrated that we could allow up to three calls to the LLM judge. When thinking about structured output, we could have the judge provide feedback for the initial agent to improve, or, if the judge “approved”, move onto new tasks in our workflow.</p>


</section>

 ]]></description>
  <category>AWS</category>
  <category>Step Functions</category>
  <category>Agentic Workflows</category>
  <guid>https://brocktibert.com/posts/20241220-aws-sfn-vars/</guid>
  <pubDate>Sat, 21 Dec 2024 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Deploy Streamlit as a Serverless App on GCP</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/20241213-streamlit-gcp/</link>
  <description><![CDATA[ 





<section id="before-you-get-started" class="level2">
<h2 class="anchored" data-anchor-id="before-you-get-started">Before you get started</h2>
<p>Below itemizes the information that will need to have setup prior to moving onto the next session below.</p>
<ul>
<li>Google Cloud Project</li>
<li>Google Cloud CLI installed and configured</li>
<li>If you are coding locally, Docker installed (Docker Desktop is the easiest path to get started)</li>
</ul>
<div class="callout callout-style-default callout-tip callout-titled" title="Cloud Shell">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Cloud Shell
</div>
</div>
<div class="callout-body-container callout-body">
<p>In my opinion, the easiest way to code against GCP is with Cloud Shell Terminal, which you can find at <a href="shell.cloud.google.com">shell.cloud.google.com</a></p>
</div>
</div>
<p>Below assumes a *nix environment or Cloud Shell. If you are coding locally on Windows you will need to modify the commands as necessary. In my case, this is a simple POC application that uses Secret Manager to access a Token that is used to create a connection to a <a href="https://motherduck.com/">Motherduck</a> cloud data warehouse.</p>
<p>The app is a simple POC to help students learn streamlit, but the focus of this post is to walk you through the steps to deploy a <strong>serverless</strong> Streamlit app on GCP.</p>
</section>
<section id="setup" class="level2">
<h2 class="anchored" data-anchor-id="setup">Setup</h2>
<p>First, you will need to ensure strealmit is installed:</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install streamlit </span></code></pre></div>
<p>Now, let’s setup the project structure:</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb2-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mkdir</span> streamlit</span>
<span id="cb2-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">cd</span> streamlit</span></code></pre></div>
<p>Finally, make sure you have enabled APIs on GCP</p>
<div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb3-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">gcloud</span> services enable run.googleapis.com</span>
<span id="cb3-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">gcloud</span> services enable cloudbuild.googleapis.com</span>
<span id="cb3-3"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">gcloud</span> services enable artifactregistry.googleapis.com</span>
<span id="cb3-4"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">gcloud</span> services enable secretmanager.googleapis.com</span>
<span id="cb3-5"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">gcloud</span> services enable iam.googleapis.com</span></code></pre></div>
</section>
<section id="files" class="level2">
<h2 class="anchored" data-anchor-id="files">Files</h2>
<p>Below we are going to define the files for our application.</p>
<section id="requirements" class="level3">
<h3 class="anchored" data-anchor-id="requirements">Requirements</h3>
<p>Modify this for your project and application’s needs.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>streamlit/requirements.txt</strong></pre>
</div>
<div class="sourceCode" id="cb4" data-filename="streamlit/requirements.txt" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">streamlit</span> </span>
<span id="cb4-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">matplotlib</span></span>
<span id="cb4-3"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">networkx</span></span>
<span id="cb4-4"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">google-cloud-secret-manager</span></span>
<span id="cb4-5"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">numpy</span></span>
<span id="cb4-6"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pandas</span></span>
<span id="cb4-7"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">duckdb</span></span></code></pre></div>
</div>
</section>
<section id="streamlit-app" class="level3">
<h3 class="anchored" data-anchor-id="streamlit-app">Streamlit App</h3>
<p>Below is my application. Modify for your needs.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>streamlit/app.py</strong></pre>
</div>
<div class="sourceCode" id="cb5" data-filename="streamlit/app.py" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> streamlit <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> st</span>
<span id="cb5-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> duckdb</span>
<span id="cb5-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb5-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> google.cloud <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> secretmanager</span>
<span id="cb5-5"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb5-6"></span>
<span id="cb5-7"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> itertools <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> combinations</span>
<span id="cb5-8"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> networkx <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> nx</span>
<span id="cb5-9"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> collections <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Counter</span>
<span id="cb5-10"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> matplotlib.pyplot <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> plt</span>
<span id="cb5-11"></span>
<span id="cb5-12">project_id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'_your project id_'</span></span>
<span id="cb5-13">secret_id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mother_duck'</span>   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#&lt;---------- this is the name of the Google Cloud Secret</span></span>
<span id="cb5-14">version_id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'latest'</span></span>
<span id="cb5-15"></span>
<span id="cb5-16">db <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'awsblogs'</span></span>
<span id="cb5-17">schema <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"stage"</span></span>
<span id="cb5-18">db_schema <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>db<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">.</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>schema<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb5-19"></span>
<span id="cb5-20">sm <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> secretmanager.SecretManagerServiceClient()</span>
<span id="cb5-21"></span>
<span id="cb5-22">name <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"projects/</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>project_id<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">/secrets/</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>secret_id<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">/versions/</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>version_id<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb5-23"></span>
<span id="cb5-24">response <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sm.access_secret_version(request<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"name"</span>: name})</span>
<span id="cb5-25">md_token <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> response.payload.data.decode(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"UTF-8"</span>)</span>
<span id="cb5-26"></span>
<span id="cb5-27">md <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> duckdb.<span class="ex" style="color: null;
background-color: null;
font-style: inherit;">connect</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'md:?motherduck_token=</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>md_token<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>) </span>
<span id="cb5-28"></span>
<span id="cb5-29"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">############################################ Streamlit App</span></span>
<span id="cb5-30"></span>
<span id="cb5-31">st.set_page_config(page_title<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"My Fancy Streamlit App"</span>, layout<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"wide"</span>)</span>
<span id="cb5-32"></span>
<span id="cb5-33">sql <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb5-34"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">select </span></span>
<span id="cb5-35"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    min(published) as min,</span></span>
<span id="cb5-36"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    max(published) as max,</span></span>
<span id="cb5-37"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">from</span></span>
<span id="cb5-38"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    awsblogs.stage.posts</span></span>
<span id="cb5-39"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb5-40">date_range <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> md.sql(sql).df()</span>
<span id="cb5-41">start_date <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> date_range[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'min'</span>].to_list()[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb5-42">end_date <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> date_range[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'max'</span>].to_list()[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb5-43"></span>
<span id="cb5-44"></span>
<span id="cb5-45">st.title(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Streamlit - Example"</span>)</span>
<span id="cb5-46">st.subheader(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"To demonstrate Streamlit concepts - It's just a python script!"</span>)</span>
<span id="cb5-47"></span>
<span id="cb5-48"></span>
<span id="cb5-49">st.sidebar.header(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Filters"</span>)</span>
<span id="cb5-50">st.sidebar.markdown(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"One option is to use sidebars for inputs"</span>)</span>
<span id="cb5-51">author_filter <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> st.sidebar.text_input(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Search by Author"</span>)</span>
<span id="cb5-52">date_filter <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> st.sidebar.date_input(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Post Start Date"</span>, (start_date, end_date))</span>
<span id="cb5-53"></span>
<span id="cb5-54">st.sidebar.button(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"A button to control inputs"</span>)</span>
<span id="cb5-55"></span>
<span id="cb5-56">st.sidebar.file_uploader(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Users can upload files that your app analyzes!"</span>)</span>
<span id="cb5-57"></span>
<span id="cb5-58">st.sidebar.markdown(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"These controls are not wired up to control data, just highlighting you have a lot of control!"</span>)</span>
<span id="cb5-59"></span>
<span id="cb5-60"></span>
<span id="cb5-61"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">############ A simple line plot</span></span>
<span id="cb5-62"></span>
<span id="cb5-63">num_days <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">365</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Number of days for the time series</span></span>
<span id="cb5-64">start_date <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'2023-01-01'</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Start date</span></span>
<span id="cb5-65"></span>
<span id="cb5-66">date_range <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.date_range(start<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>start_date, periods<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>num_days, freq<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'D'</span>)</span>
<span id="cb5-67"></span>
<span id="cb5-68">np.random.seed(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">42</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># For reproducibility</span></span>
<span id="cb5-69">values <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.random.randint(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">150</span>, size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>num_days)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Example: random sales values between 50 and 150</span></span>
<span id="cb5-70"></span>
<span id="cb5-71">time_series_data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame({</span>
<span id="cb5-72">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'date'</span>: date_range,</span>
<span id="cb5-73">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'value'</span>: values</span>
<span id="cb5-74">})</span>
<span id="cb5-75"></span>
<span id="cb5-76">st.line_chart(time_series_data, x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"date"</span>, y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"value"</span>)</span>
<span id="cb5-77"></span>
<span id="cb5-78"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">############ Graph of co-association of tags, a touch forward looking</span></span>
<span id="cb5-79">st.markdown(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"---"</span>)</span>
<span id="cb5-80"></span>
<span id="cb5-81">pt_sql <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb5-82"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">select post_id, term from awsblogs.stage.tags</span></span>
<span id="cb5-83"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb5-84">pt_df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> md.sql(pt_sql).df()</span>
<span id="cb5-85"></span>
<span id="cb5-86">st.markdown(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"### You _can_ show data tables"</span>)</span>
<span id="cb5-87">st.dataframe(pt_df)</span>
<span id="cb5-88"></span>
<span id="cb5-89">st.markdown(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"### A static network graph"</span>)</span>
<span id="cb5-90">st.markdown(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"We can think of relationships as a graph"</span>)</span>
<span id="cb5-91"></span>
<span id="cb5-92">cotag_pairs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb5-93"></span>
<span id="cb5-94"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> _, group <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> pt_df.groupby(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'post_id'</span>):</span>
<span id="cb5-95">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Get the unique list of authors for each post</span></span>
<span id="cb5-96">    terms <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> group[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'term'</span>].unique()</span>
<span id="cb5-97">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Generate all possible pairs of co-authors for this post</span></span>
<span id="cb5-98">    pairs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> combinations(terms, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span>
<span id="cb5-99">    cotag_pairs.extend(pairs)</span>
<span id="cb5-100"></span>
<span id="cb5-101">cotag_counter <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Counter(cotag_pairs)</span>
<span id="cb5-102"></span>
<span id="cb5-103">G <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.Graph()</span>
<span id="cb5-104"></span>
<span id="cb5-105"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> (term1, term2), weight <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> cotag_counter.items():</span>
<span id="cb5-106">    G.add_edge(term1, term2, weight<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>weight)</span>
<span id="cb5-107"></span>
<span id="cb5-108">degree_centrality <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.degree_centrality(G)</span>
<span id="cb5-109"></span>
<span id="cb5-110">node_sizes <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> degree_centrality[node] <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> node <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> G.nodes()]</span>
<span id="cb5-111"></span>
<span id="cb5-112">edge_weights <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [G[u][v][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'weight'</span>] <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> u, v <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> G.edges()]</span>
<span id="cb5-113"></span>
<span id="cb5-114">pos <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.spring_layout(G, k<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.3</span>, seed<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">42</span>)  </span>
<span id="cb5-115"></span>
<span id="cb5-116">fig <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.figure(figsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>))</span>
<span id="cb5-117"></span>
<span id="cb5-118">nx.draw_networkx_nodes(G, pos, node_size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>node_sizes, node_color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'skyblue'</span>, alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.7</span>)</span>
<span id="cb5-119"></span>
<span id="cb5-120">nx.draw_networkx_edges(G, pos, width<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>edge_weights, alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, edge_color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'gray'</span>)</span>
<span id="cb5-121"></span>
<span id="cb5-122">plt.title(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Tag Graph"</span>)</span>
<span id="cb5-123">st.pyplot(fig)</span>
<span id="cb5-124"></span>
<span id="cb5-125"></span>
<span id="cb5-126"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">############ There are some chat support features, more coming</span></span>
<span id="cb5-127"></span>
<span id="cb5-128">st.markdown(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"---"</span>)</span>
<span id="cb5-129"></span>
<span id="cb5-130">st.markdown(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"### There are even some chat features - more coming on the roadmap."</span>)</span>
<span id="cb5-131"></span>
<span id="cb5-132">prompt <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> st.chat_input(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Say something"</span>)</span>
<span id="cb5-133"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> prompt:</span>
<span id="cb5-134">    st.write(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"User has sent the following prompt: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>prompt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div>
</div>
</section>
<section id="dockerfile" class="level3">
<h3 class="anchored" data-anchor-id="dockerfile">Dockerfile</h3>
<p>Our solution uses a Docker container.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>streamlit/Dockerfile</strong></pre>
</div>
<div class="sourceCode" id="cb6" data-filename="streamlit/Dockerfile" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb6-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">FROM</span> python:3.11-slim</span>
<span id="cb6-2"></span>
<span id="cb6-3"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">WORKDIR</span> /app</span>
<span id="cb6-4"></span>
<span id="cb6-5"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">COPY</span> . /app</span>
<span id="cb6-6"></span>
<span id="cb6-7"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">RUN</span> pip install <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--no-cache-dir</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-r</span> requirements.txt</span>
<span id="cb6-8"></span>
<span id="cb6-9"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">EXPOSE</span> 8501</span>
<span id="cb6-10"></span>
<span id="cb6-11"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">ENV</span> STREAMLIT_SERVER_HEADLESS=true</span>
<span id="cb6-12"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">ENV</span> STREAMLIT_SERVER_ADDRESS=0.0.0.0</span>
<span id="cb6-13"></span>
<span id="cb6-14"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">CMD</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"streamlit"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"run"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"app.py"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--server.port"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"8080"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--server.address"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"0.0.0.0"</span>]</span></code></pre></div>
</div>
</section>
</section>
<section id="deployment" class="level2">
<h2 class="anchored" data-anchor-id="deployment">Deployment</h2>
<p>The bash script below will build the image and deploy it to Google Cloud run as a service.</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>streamlit/deploy.sh</strong></pre>
</div>
<div class="sourceCode" id="cb7" data-filename="streamlit/deploy.sh" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb7-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">gcloud</span> config set project {your_project_here}</span>
<span id="cb7-2"></span>
<span id="cb7-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">echo</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"======================================================"</span></span>
<span id="cb7-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">echo</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"build (no cache)"</span></span>
<span id="cb7-5"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">echo</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"======================================================"</span></span>
<span id="cb7-6"></span>
<span id="cb7-7"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">docker</span> build <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--no-cache</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-t</span> gcr.io/{your_project_here}/streamlit-poc .</span>
<span id="cb7-8"></span>
<span id="cb7-9"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">echo</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"======================================================"</span></span>
<span id="cb7-10"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">echo</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"push"</span></span>
<span id="cb7-11"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">echo</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"======================================================"</span></span>
<span id="cb7-12"></span>
<span id="cb7-13"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">docker</span> push gcr.io/{your_project_here}/streamlit-poc</span>
<span id="cb7-14"></span>
<span id="cb7-15"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">echo</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"======================================================"</span></span>
<span id="cb7-16"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">echo</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"deploy run"</span></span>
<span id="cb7-17"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">echo</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"======================================================"</span></span>
<span id="cb7-18"></span>
<span id="cb7-19"></span>
<span id="cb7-20"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">gcloud</span> run deploy streamlit-poc <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-21">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--image</span> gcr.io/{your_project_here}/streamlit-poc <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-22">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--platform</span> managed <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-23">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--region</span> us-central1 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-24">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--allow-unauthenticated</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-25">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--service-account</span> {service_account_name}@{your_project_here}.iam.gserviceaccount.com_ <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-26">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--memory</span> 1Gi</span></code></pre></div>
</div>
<p>A few notes on the deploy script</p>
<ul>
<li>You will need to replace <code>{your_project_here}</code> with your actual project id</li>
<li>I am using the us-central1 region</li>
<li>Above specifies a service account that Cloud Run will use to access various services on GCP. You will need to ensure that when you define the Service Account, you provide access to Secret Manager</li>
</ul>
<p>From your shell, deploy the application.</p>
<div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb8-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bash</span> streamlit/deploy.sh</span></code></pre></div>
<p>When the process completes, you should see something similar to below:</p>
<div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb9-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">======================================================</span></span>
<span id="cb9-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">deploy</span> run</span>
<span id="cb9-3"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">======================================================</span></span>
<span id="cb9-4"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">Deploying</span> container to Cloud Run service <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">[</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">streamlit</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">-</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">poc</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">]</span> in project <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">[</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">{your</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">-</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">project</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">-</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">here}</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">]</span> region <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">[</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">us</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">-</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">central1</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">]</span></span>
<span id="cb9-5"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">OK</span> Deploying... Done.                                                                                                          </span>
<span id="cb9-6">  <span class="ex" style="color: null;
background-color: null;
font-style: inherit;">OK</span> Creating Revision...                                                                                                      </span>
<span id="cb9-7">  <span class="ex" style="color: null;
background-color: null;
font-style: inherit;">OK</span> Routing traffic...                                                                                                        </span>
<span id="cb9-8">  <span class="ex" style="color: null;
background-color: null;
font-style: inherit;">OK</span> Setting IAM Policy...                                                                                                     </span>
<span id="cb9-9"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">Done.</span>                                                                                                                          </span>
<span id="cb9-10"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">Service</span> <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">[</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">streamlit</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">-</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">poc</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">]</span> revision <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">[</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">streamlit</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">-</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">poc</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">-</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">00003</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">-</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">588</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">]</span> has been deployed and is serving 100 percent of traffic.</span>
<span id="cb9-11"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">Service</span> URL: https://streamlit-poc-111222333444.us-central1.run.app</span></code></pre></div>
<p>That’s it! Navigate to the URL and you have deployed a publicly available, serverless, Streamlit app on GCP via Cloud Run.</p>


</section>

 ]]></description>
  <category>Steramlit</category>
  <category>GCP</category>
  <category>Motherduck</category>
  <category>Serverless</category>
  <guid>https://brocktibert.com/posts/20241213-streamlit-gcp/</guid>
  <pubDate>Fri, 13 Dec 2024 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Calling RapidMiner from Python</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/migration/calling-rapidminer-from-python.html</link>
  <description><![CDATA[ 





<p>I recently discovered that RapidMiner (RM) has a python library that allows us to start an RM process in order to perform various actions. While this may appear counter-intuitive, I am interested in exploring how I might be able to write “tests” that let me evaluate the process submitted by a student as a homework assignment. Simply, if I were teaching a python-only course, I might use <code>pytest</code> to evaluate and autograde an assignment in order to provide immediate feedback to my students. With this library, I am hopeful that I might be able to come up with a comparable solution for when I teach with RapidMiner.</p>
<p>The code snippets below are mostly adapted from the <a href="https://github.com/rapidminer/python-rapidminer/blob/master/examples/studio_examples.ipynb">following resource</a>, but with some additional context added. First, getting setup is easy, as it’s a simple <code>pip install rapidminer</code>. The repository <a href="https://github.com/rapidminer/python-rapidminer">can be found here</a>, and is something that you should explore, as RapidMiner has been working on “operators” that allow us to easily include tooling backed by scikit-learn in python. That is, we can use scikit-learn <strong>inside</strong> a RapidMiner process.</p>
<p>After installing the package, I suspect that the hardest part will be the configuration to ensure that you inform the python library where RapidMiner is installed on your machine.</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1">rm_home <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"/Applications/RapidMiner Studio.app/Contents/Resources/RapidMiner-Studio/"</span></span>
<span id="cb1-2">connector <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> rapidminer.Studio(rm_home)</span></code></pre></div>
<p>Above establishes the <code>rm_home</code> variable and points to appropriate folder on my macbook. It is worth noting that instead of passing a string, the python library will also look for an environment variable <code>RAPIDMINER_HOME</code> instead of explicitly defining the location in your script.</p>
<blockquote class="blockquote">
<p>If you dive into the python library’s source code, you will notice that <code>Studio</code> is looking for a folder called <code>scripts</code> within the installation directory.</p>
</blockquote>
<p>With the <code>connector</code> established, let’s run through some basic use-cases.</p>
<section id="save-a-dataset-to-a-rapidminer-repository" class="level2">
<h2 class="anchored" data-anchor-id="save-a-dataset-to-a-rapidminer-repository">Save a dataset to a RapidMiner Repository</h2>
<p>Before diving into the code, let’s get some terminology out of the way. RapidMiner has the notion of <code>repositories</code>. These are simply folders on your machine that will act as the container for your work. Once you have a repository setup, you can reference these repositories via <code>//repository</code>. You will see that in the code below.</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> ast <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> operator</span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> concurrent.futures <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> process</span>
<span id="cb2-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> os</span>
<span id="cb2-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> rapidminer</span>
<span id="cb2-5"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb2-6"></span>
<span id="cb2-7">URL <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/diamonds.csv"</span></span>
<span id="cb2-8">dia <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(URL)</span>
<span id="cb2-9">connector.write_resource(dia, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"//BU/data/diamonds"</span>)</span></code></pre></div>
<p>Let’s step through above:</p>
<ol type="1">
<li>We import a number of libraries, including pandas. I would recommend using a package manager like conda. In my case, I create an environment called <code>rapidminer</code> and pip-install all of the tools that I need there. I also use this environment <strong>within</strong> RM when <strong>calling</strong> python.</li>
<li><code>URL</code> is simply a pointer to a csv file on the web.</li>
<li>I use pandas to read the csv file into a DataFrame, a python object that RapidMiner can easily work with.</li>
<li>Using the <code>connector</code> object, I am writing the dataset (called diamonds) into the <code>data</code> sub-folder in my <strong>repository</strong> called <code>BU</code>. Note the <code>//BU</code> shortcut I referenced above.</li>
<li>You should now see the data object within the repository of your choice. TIP: You may need to refresh the view within RM Studio to see the change.</li>
</ol>
<p>While RapidMiner also has the concept of a <code>project</code> (i.e.&nbsp;a git-backed folder), I tend to simply create repositories in RM that point to folders that are already under version control on my machine. I prefer to use other tools (e.g.&nbsp;CLI, Github Desktop) to commit/push my diffs instead of doing these tasks within RM.</p>
</section>
<section id="save-a-dataset-to-a-rapidminer-repository-1" class="level2">
<h2 class="anchored" data-anchor-id="save-a-dataset-to-a-rapidminer-repository-1">Save a dataset to a RapidMiner Repository</h2>
<p>Now this is where functionality gets exciting. The code below will use the same <code>connector</code>, but this time, it is pointed at a RapidMiner process stored in the repository.</p>
<div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1">hw <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> connector.run_process(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"//BU/hw1"</span>)</span></code></pre></div>
<p>A few notes:</p>
<ul>
<li>The process is stored in my repository and is called <code>hw1</code>.</li>
<li>Based on the configuration above, the <strong>entire</strong> RapidMiner process will be executed. We will see another example below where we can execute a specific operator within the process.</li>
<li>Python will invoke RapidMiner and start the engine in order to execute the process. This will take a few moments depending on your machine and installed extensions. NOTE: You can tell python whether or not you want to see the messages while RM is getting started and executing the process flow.</li>
<li>Above you can notice that the I am assigning the output of the process to a variable called <code>hw</code>. The results stored in this python object will be dependent on the process setup. Above, my process simply <code>Retrieve</code>s the <code>diamonds</code> dataset from the earlier example, performs an aggregation, and connects the resulting <code>ExampleSet</code> to the results port. As such the <code>hw</code> variable is a pandas DataFrame with the aggregated results.</li>
</ul>
<p>The key takeaway from above is that we don’t have any knowledge of the steps in the process, or the work that was done along the way, but the python library allows us to extract the result ports and bring those objects back into our python session.</p>
<blockquote class="blockquote">
<p>You might be wondering why I used the ExampleSet reference above. An ExampleSet is simply a dataset in RapidMiner, and in most cases from my experience, will be a pandas DataFrame, but it doesn’t have to be.</p>
</blockquote>
</section>
<section id="call-an-operator-within-the-process" class="level2">
<h2 class="anchored" data-anchor-id="call-an-operator-within-the-process">Call an Operator within the Process</h2>
<p>Finally, the last task that appears to be supported is the ability to call a specific operator (e.g.&nbsp;a function or unit of work in RM) and get the output from that run. This is a touch tricky because I suspect that when calling this operator, we must properly pass the inputs the operator expects.</p>
<div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># using the dia object from earlier, only keep rows where the cut value is Good</span></span>
<span id="cb4-2">dia2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> dia.loc[dia.cut<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Good'</span>, :]</span>
<span id="cb4-3"></span>
<span id="cb4-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># the Aggregate operator has two outputs, as such, specifying two output objects</span></span>
<span id="cb4-5">agg, ori <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> connector.run_process(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"//BU/hw1"</span>, inputs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>dia2, operator<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Aggregate"</span>)</span></code></pre></div>
<p>A few notes on above:</p>
<ul>
<li>I am keeping a subset of the original diamonds dataset that we grabbed from the web. This is to test the output of the <code>Aggregate</code> operator, which is essentially a <code>group-by</code> utility. In this case, there should only be one row since the process will be grouping by the variable, or <code>Attribute</code> cut.</li>
<li>We are still using the <code>run_process</code> method, but now are specifying the inputs to use, and the name of the Operator within the process.</li>
<li>We are also storing multiple output, as the <code>Aggregate</code> operator exports the results as an ExampleSet (that is coverted to a DataFrame), as well as the original ExampleSet passed to the operator.</li>
</ul>
<p>How neat is that?</p>
</section>
<section id="looking-ahead" class="level2">
<h2 class="anchored" data-anchor-id="looking-ahead">Looking Ahead</h2>
<p>This is a great start, and considering that RapidMiner processes are simply XML files, I do wonder if I will be able to think about ways to write tests against my students’ assignments by parsing the XML files and then running various operators along the way.</p>
<p>This leads me to my wishlist:</p>
<ol type="1">
<li>Instead of parsing the XML document, it would be nice if we could inspect both the process itself, as well as get feedback on the entire run of the process. Ideally we would be able to see the data flow through, understand the changes that occur along the way, etc. Having this information in python, after the call to <code>run_process</code> would enable us to write tests to ensure that a process is configured as expected, the output matches expectations, etc.</li>
<li>In addition to running a single operator, it could be helpful to run a portion, or sequence of the process. The same rationale still applies; write tests against the process.</li>
</ol>


</section>

 ]]></description>
  <guid>https://brocktibert.com/posts/migration/calling-rapidminer-from-python.html</guid>
  <pubDate>Thu, 20 Jan 2022 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Deploy your scikit Model with FastAPI and Docker</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/migration/2021-09-28-fastapi-docker.html</link>
  <description><![CDATA[ 





<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://snipboard.io/WFMXsl.jpg" class="img-fluid figure-img"></p>
<figcaption>Image</figcaption>
</figure>
</div>
<p>This quick post will highlight the necessary bits to get you thinking about how you might consider deploying your Machine Learning models into production. I will be using a python stack, and more specifically:</p>
<ul>
<li>FastAPI</li>
<li>Docker</li>
<li>Scikit-Learn</li>
</ul>
<p>This <strong>very basic</strong> proof of concept has scripts that:</p>
<ol type="1">
<li><code>train.py</code> a really horrible model, but that’s not the point. It fits a text classifier and saves the model to disk for use downstream.</li>
<li>Serve the model with FastAPI via <code>app.py</code> and <code>uvicorn app:app --reload</code> if you want to test it locally. To view the docs, go to <code>localhost:8000/docs</code>.</li>
<li>Dockerize the FastAPI app and serve.</li>
</ol>
<section id="why" class="level2">
<h2 class="anchored" data-anchor-id="why">Why</h2>
<ul>
<li>Remind my future self on the basic stack needed to deploy ML APIs.</li>
<li>Act as a Quick Start guide that data product owners can use to have discussions with downstream engineers</li>
<li>Highlight how easy it is to get started serving models with scikit-learn, a robust ML framework that allows for reproducible pipelines. Current SOTA models tend to be built in DL frameworks like Tensorflow, but for a large set of business needs, the marginal gain is not worth the marginal effort.</li>
</ul>
<blockquote class="blockquote">
<p>The last point is especially true for teams that are starting to figure out how to deploy models, even if they are batch-oriented and not via API responses.</p>
</blockquote>


</section>

 ]]></description>
  <category>ML Ops</category>
  <guid>https://brocktibert.com/posts/migration/2021-09-28-fastapi-docker.html</guid>
  <pubDate>Tue, 28 Sep 2021 04:00:00 GMT</pubDate>
</item>
<item>
  <title>Document your Architecture with Diagrams</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/migration/document-your-architecture-with-diagrams.html</link>
  <description><![CDATA[ 





<section id="document-your-architecture-with-diagrams" class="level1">
<h1>Document your Architecture with Diagrams</h1>
<p>This quick post will highlight <a href="https://diagrams.mingrammer.com/">the python package diagrams</a>, which aims to provide a toolkit for <code>diagrams as code</code>. This is a fantastic toolkit that empowers us to use code to drive documentation.</p>
<p>This library has a large number of use-cases, but for me, this boils down to technical documentation that can help visualize:</p>
<ul>
<li>Architecture reviews</li>
<li>Technical prototypes (e.g.&nbsp;ETL, app stack, data product)</li>
<li>ML pipelines</li>
</ul>
<p>To get started, it’s as easy as</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1">pip install diagrams</span></code></pre></div>
<blockquote class="blockquote">
<p>N.B. If you are developing locally, you will need to ensure that you have <code>graphviz</code> installed. See <a href="https://diagrams.mingrammer.com/docs/getting-started/installation">link link</a> for more.</p>
</blockquote>
<p>With that behind us, let’s use the example from the docs</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> diagrams <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Diagram</span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> diagrams.aws.compute <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> EC2</span>
<span id="cb2-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> diagrams.aws.database <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> RDS</span>
<span id="cb2-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> diagrams.aws.network <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> ELB</span>
<span id="cb2-5"></span>
<span id="cb2-6"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> Diagram(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Web Service"</span>, show<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>):</span>
<span id="cb2-7">    ELB(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lb"</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;</span> EC2(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"web"</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;</span> RDS(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"userdb"</span>)</span></code></pre></div>
<p>Which will render the following diagram:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://brocktibert.com/posts/migration/web_service.png" class="img-fluid figure-img"></p>
<figcaption>Web Service Diagram</figcaption>
</figure>
</div>
<p>Let’s call out a few amazing bits from above:</p>
<ol type="1">
<li>The code creates an image file that we can use downstream. Above, the resulting file is <code>web_service.png</code>. This may not be obvious at first, but customizable based on your approach.</li>
<li>We can leverage the iconography (i.e.&nbsp;product logo) from a wide range of technologies and cloud providers.</li>
<li>What’s more, you can also control the labels of these nodes in the diagram. I find this to be helpful when prototyping ideas for feature tickets or tech design meetings with devs/engineers.</li>
<li>This is reproducible and removes the need to drag-and-drop updates in GUI tools.</li>
</ol>
<blockquote class="blockquote">
<p>The last item above <ins>could</ins> be viewed as both a strength and a weakness from a “risk” point-of-view. Using code to build our diagram is reproducible, but does impose a more <em>technical</em> approach when balanced against drag/drop GUI apps.</p>
</blockquote>
<p>How about some other examples?</p>
<section id="clustersgroups-of-services" class="level3">
<h3 class="anchored" data-anchor-id="clustersgroups-of-services">Clusters/Groups of Services</h3>
<div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> diagrams <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Cluster, Diagram</span>
<span id="cb3-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> diagrams.aws.compute <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> ECS</span>
<span id="cb3-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> diagrams.aws.database <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> ElastiCache, RDS</span>
<span id="cb3-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> diagrams.aws.network <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> ELB</span>
<span id="cb3-5"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> diagrams.aws.network <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Route53</span>
<span id="cb3-6"></span>
<span id="cb3-7"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> Diagram(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Clustered Web Services"</span>, show<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>):</span>
<span id="cb3-8">    dns <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Route53(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dns"</span>)</span>
<span id="cb3-9">    lb <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ELB(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lb"</span>)</span>
<span id="cb3-10"></span>
<span id="cb3-11">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> Cluster(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Services"</span>):</span>
<span id="cb3-12">        svc_group <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [ECS(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"web1"</span>),</span>
<span id="cb3-13">                     ECS(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"web2"</span>),</span>
<span id="cb3-14">                     ECS(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"web3"</span>)]</span>
<span id="cb3-15"></span>
<span id="cb3-16">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> Cluster(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"DB Cluster"</span>):</span>
<span id="cb3-17">        db_main <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> RDS(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"userdb"</span>)</span>
<span id="cb3-18">        db_main <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> [RDS(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"userdb ro"</span>)]</span>
<span id="cb3-19"></span>
<span id="cb3-20">    memcached <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ElastiCache(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"memcached"</span>)</span>
<span id="cb3-21"></span>
<span id="cb3-22">    dns <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;</span> lb <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;</span> svc_group</span>
<span id="cb3-23">    svc_group <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;</span> db_main</span>
<span id="cb3-24">    svc_group <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;</span> memcached</span></code></pre></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://brocktibert.com/posts/migration/clustered_web_services.png" class="img-fluid figure-img"></p>
<figcaption>Clustered Web Services</figcaption>
</figure>
</div>
<p>How cool is that?!?</p>


</section>
</section>

 ]]></description>
  <category>Data Management</category>
  <guid>https://brocktibert.com/posts/migration/document-your-architecture-with-diagrams.html</guid>
  <pubDate>Mon, 30 Aug 2021 04:00:00 GMT</pubDate>
</item>
<item>
  <title>Trigger Airflow DAGs via the REST API</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/migration/trigger-airflow-dags-via-the-rest-api.html</link>
  <description><![CDATA[ 





<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://brocktibert.com/post/trigger-airflow-dags-via-the-rest-api/.png" class="img-fluid figure-img"></p>
<figcaption>Data Management</figcaption>
</figure>
</div>
<p>This post will discuss how to use the REST api in Airflow 2 to trigger the run of a DAG as well as pass parameters that can be used in the run. I will not go over how to get setup and install Airflow, but I will say that the documentation is pretty straight forward as long as you follow it step-by-step.</p>
<section id="step-1---enable-the-rest-api" class="level2">
<h2 class="anchored" data-anchor-id="step-1---enable-the-rest-api">Step 1 - Enable the REST API</h2>
<p>By default, airflow does not accept requests made to the API. However, it’s easy enough to turn on:</p>
<pre><code># auth_backend = airflow.api.auth.backend.deny_all
auth_backend = airflow.api.auth.backend.basic_auth</code></pre>
<blockquote class="blockquote">
<p>Above I am commenting out the original line, and including the basic auth scheme.</p>
</blockquote>
<p>To be validated by the API, we simply need to pass an <code>Authorization</code> header and the base64 encoded form of <code>username:password</code> where username and password are for the user created in Airflow.</p>
<p>For example:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://snipboard.io/h12rNQ.jpg" class="img-fluid figure-img"></p>
<figcaption>Example</figcaption>
</figure>
</div>
<p>Above, I have blurred out a series of text, but that is the <code>username:password</code> Base64 encoded. There are plenty of tools on the web that can encode this for you.</p>
<blockquote class="blockquote">
<p>NOTE: You see the encoded information is prefaced by <code>Basic</code>.</p>
</blockquote>
</section>
<section id="step-2-test-the-api-by-listing-dags" class="level2">
<h2 class="anchored" data-anchor-id="step-2-test-the-api-by-listing-dags">Step 2: Test the API by Listing Dags</h2>
<p>With above in place, we can list the dags in Airflow easily via /dags.</p>
<p>Because I am running locally, it’s as simple as a <code>GET</code> request to <code>http://localhost:8080/api/v1/dags</code>. Just remember to include the Authorization bits. The call to list the DAGs is also shown in the screenshot above.</p>
</section>
<section id="step-3-the-dag-setup-and-configuration" class="level2">
<h2 class="anchored" data-anchor-id="step-3-the-dag-setup-and-configuration">Step 3: The DAG setup and configuration</h2>
<p>Of course, if we are going to pass information to the DAG, we would expect the tasks to be able to consume and use that information. Below provides snippets of my DAG to help refer to the core pieces.</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># airflow bits</span></span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> airflow <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> DAG</span>
<span id="cb2-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> airflow.operators.python <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> PythonOperator</span>
<span id="cb2-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> airflow.utils.dates <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> days_ago</span>
<span id="cb2-5"></span>
<span id="cb2-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># a function to read the parameters passed</span></span>
<span id="cb2-7"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> main_task(ti, <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>context):</span>
<span id="cb2-8">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># the sql</span></span>
<span id="cb2-9">    SQL <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb2-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    select count(a.dim_user_id) as total</span></span>
<span id="cb2-11"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    from table1 a,</span></span>
<span id="cb2-12"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">        table2 b</span></span>
<span id="cb2-13"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    where a.dim_account_id = b.id</span></span>
<span id="cb2-14"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    and   b.global_id = </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span></span>
<span id="cb2-15"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    and   a.dim_site_id = </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span></span>
<span id="cb2-16"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    and   a.message NOT IN ('JOINED', 'LEFT')</span></span>
<span id="cb2-17"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    and   a.activity_date between '</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">' and '</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span></span>
<span id="cb2-18"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    """</span></span>
<span id="cb2-19">    </span>
<span id="cb2-20">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># connect to redshift</span></span>
<span id="cb2-21">    rs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> connect_redshift()  </span>
<span id="cb2-22">    </span>
<span id="cb2-23">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># extract the parameters passed from the REST API Trigger </span></span>
<span id="cb2-24">    gid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> context[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'dag_run'</span>].conf[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'gid'</span>]</span>
<span id="cb2-25">    sid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> context[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'dag_run'</span>].conf[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sid'</span>]</span>
<span id="cb2-26">    date_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> context[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'dag_run'</span>].conf[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'date_start'</span>]</span>
<span id="cb2-27">    date_end <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> context[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'dag_run'</span>].conf[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'date_end'</span>]</span>
<span id="cb2-28">    </span>
<span id="cb2-29">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># build the SQL and get the data</span></span>
<span id="cb2-30">    SQL <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> SQL.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(gid, sid, date_start, date_end)</span>
<span id="cb2-31">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> rs.redshift_to_pandas(SQL)</span>
<span id="cb2-32">    </span>
<span id="cb2-33">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># do more things and </span></span>
<span id="cb2-34"></span>
<span id="cb2-35"></span>
<span id="cb2-36">args <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb2-37">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'owner'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'brock'</span>,</span>
<span id="cb2-38">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'depends_on_past'</span>: <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>,</span>
<span id="cb2-39">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'start_date'</span>: days_ago(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>),</span>
<span id="cb2-40">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'email'</span>: [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'brocktibert@gmail.com'</span>],</span>
<span id="cb2-41">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'email_on_failure'</span>: <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>,</span>
<span id="cb2-42">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'email_on_retry'</span>: <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>,</span>
<span id="cb2-43">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'retries'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,</span>
<span id="cb2-44">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'retry_delay'</span>: timedelta(seconds<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span>),</span>
<span id="cb2-45">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'schedule_interval'</span>: <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>,</span>
<span id="cb2-46">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'provide_context'</span>: <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span></span>
<span id="cb2-47">}</span>
<span id="cb2-48"></span>
<span id="cb2-49">dag <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> DAG(</span>
<span id="cb2-50">    dag_id<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'rest-trigger'</span>,</span>
<span id="cb2-51">    default_args<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>args,</span>
<span id="cb2-52">    tags<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"brock"</span>]</span>
<span id="cb2-53">)</span>
<span id="cb2-54"></span>
<span id="cb2-55">t1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> PythonOperator(task_id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"main_etl"</span>, </span>
<span id="cb2-56">                    python_callable <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> main_task,</span>
<span id="cb2-57">                    dag <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> dag)</span></code></pre></div>
<p>Let’s review:</p>
<ol type="1">
<li>We are importing the airflow bits that we need.</li>
<li>I am not showing all of my code, or even the real code I created for the client, but I am simply creating a function that reads from <code>**context</code> and extracts the parameters from the DAG’s <code>conf</code> object, where those params are stored. This will make more sense in a moment.</li>
<li>I am adding <code>provide_context=True</code> to the args.</li>
<li>I use the function <code>main_task</code> in a PythonOperator and associate it to the DAG.</li>
</ol>
<blockquote class="blockquote">
<p>Note: The <code>dag_id</code> is really important. This is what is available airflow uses as the id for a DAG. Please be mindful of the values you use here.</p>
</blockquote>
</section>
<section id="step-4.-trigger-the-dag" class="level2">
<h2 class="anchored" data-anchor-id="step-4.-trigger-the-dag">Step 4. Trigger the dag</h2>
<p>Earlier in the screenshot we saw that we can use a <code>GET</code> request to /dags to get a simple list. We can use a <code>POST</code> request to trigger the dag by <strong>name</strong>. Above, the DAG I want to trigger is called <code>rest-trigger</code>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://snipboard.io/C0qrBQ.jpg" class="img-fluid figure-img"></p>
<figcaption>Example</figcaption>
</figure>
</div>
<p>Let’s review above:</p>
<ol type="1">
<li>We are making a call to POSTing data to <a href="http://localhost:8080/api/v1/dags/rest-trigger/dagRuns">http://localhost:8080/api/v1/dags/rest-trigger/dagRuns</a>. Note that we are including the dag name / dagRuns.</li>
<li>The Authorization bits still need to be included and are not shown above, but are the same as the earlier screenshot.</li>
<li>In the body, we are passing json and including the information that we want as inside the <code>conf</code> key.</li>
<li>I am also specifiying my own ID for the <code>dag_run_id</code>. This will be auto generated for us, but is helpful if we have systems/logic that sits above the Airflow API.</li>
</ol>
</section>
<section id="review-the-triggered-dag" class="level2">
<h2 class="anchored" data-anchor-id="review-the-triggered-dag">Review the Triggered Dag</h2>
<p>With the DAG triggered, we can use the UI to review the process, but we can also use the REST API to poll the job for it’s status via a <code>GET</code> call to <code>http://localhost:8080/api/v1/dags/rest-trigger/dagRuns</code> where again, we are passing in the dag name. In this case, we are passing in <code>rest-trigger</code> but you would use the name of your own dag.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://snipboard.io/ytrNiJ.jpg" class="img-fluid figure-img"></p>
<figcaption>Example</figcaption>
</figure>
</div>
</section>
<section id="closing-remarks" class="level2">
<h2 class="anchored" data-anchor-id="closing-remarks">Closing Remarks</h2>
<p>This was a harder to get up and running than I would like to admit. In the end,</p>
<ul>
<li>If you hit issues, the logs <em>can</em> be helpful. <code>print</code> statements will be included in the logs, so if you need to, you can leverage that flow to identify hang ups.</li>
<li>The name of the dag drives a lot of the functionality, which is why I stressed this earlier in the post.</li>
<li>From the web UI, you can access the REST API docs.</li>
</ul>
<p>While not shown above, you can pass the parameters around your DAG. Below is a simple example of taking in a single parameter and passing it around via XCOM and the <code>ti</code> parameter that was included but not used above.</p>
<div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> datetime <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> timedelta</span>
<span id="cb3-2"></span>
<span id="cb3-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> airflow <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> DAG</span>
<span id="cb3-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> airflow.operators.python <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> PythonOperator</span>
<span id="cb3-5"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> airflow.utils.dates <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> days_ago</span>
<span id="cb3-6"></span>
<span id="cb3-7"></span>
<span id="cb3-8">dag <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> DAG(</span>
<span id="cb3-9">    dag_id<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"trigger-me"</span>,</span>
<span id="cb3-10">    default_args<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"start_date"</span>: days_ago(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>), <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"owner"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"brock"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"provide_context"</span>: <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>},</span>
<span id="cb3-11">    schedule_interval<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span></span>
<span id="cb3-12">)</span>
<span id="cb3-13"></span>
<span id="cb3-14"></span>
<span id="cb3-15"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> push(ti, <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>context):</span>
<span id="cb3-16">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># gets the parameter gid which was passed as a key in the json of conf</span></span>
<span id="cb3-17">    gid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> context[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'dag_run'</span>].conf[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'gid'</span>]</span>
<span id="cb3-18">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># use the task</span></span>
<span id="cb3-19">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># https://marclamberti.com/blog/airflow-xcom/</span></span>
<span id="cb3-20">    ti.xcom_push(key<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'global_id'</span>, value<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>gid)</span>
<span id="cb3-21">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> gid</span>
<span id="cb3-22"></span>
<span id="cb3-23"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> pull(ti):</span>
<span id="cb3-24">    gid2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ti.xcom_pull(key<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"global_id"</span>, task_ids<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'pusher'</span>])</span>
<span id="cb3-25">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(gid2)</span>
<span id="cb3-26"></span>
<span id="cb3-27">t1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> PythonOperator(task_id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"pusher"</span>, python_callable<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>push, provide_context<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, dag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>dag)</span>
<span id="cb3-28">t2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> PythonOperator(task_id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"puller"</span>, python_callable<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>pull, provide_context<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, dag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>dag)</span>
<span id="cb3-29"></span>
<span id="cb3-30">t1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;</span> t2</span></code></pre></div>
<p>That’s it. I had to piece together above from a range of sources, so I hope this helps you (and my future self) if you need to explore this functionality.</p>


</section>

 ]]></description>
  <category>Data Management</category>
  <guid>https://brocktibert.com/posts/migration/trigger-airflow-dags-via-the-rest-api.html</guid>
  <pubDate>Wed, 25 Aug 2021 04:00:00 GMT</pubDate>
</item>
<item>
  <title>RapidMiner and Tableau</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/migration/2020-03-04-rapidminer-and-tableau.html</link>
  <description><![CDATA[ 





<p>I am a huge fan of incorporating Tableau into my data analytics projects. While some may use R/python <strong>or</strong> Tableau, I use both; Tableau allows us to rapidly explore our data in order to find errors that need to be addressed before moving onto downstream modeling and reporting tasks.</p>
<p>As you can imagine, I was thrilled when I recently noticed that Rapidminer has an Tableau Writer Extension. In short, it intends to do exactly what it says; export our ExampleSet to a Tableau extract file for use directly in Tableau.</p>
<p><img src="https://github.com/Btibert3/public-figs/raw/master/brocktibert/tableau-operator.png" class="img-fluid"></p>
<p>However, when you dive <a href="https://docs.rapidminer.com/latest/studio/connect/tableau.html">into the documentation</a>, the setup process is somewhat complex. Unfortunately, I was unsucessful and could not get things to behave.</p>
<p>Not a problem. Conda environments and python to the rescue!</p>
<p>Rapidminer allows us to call R and python from within the tool, and even provides us with the ability to manage virtual environments.</p>
<p><img src="https://github.com/Btibert3/public-figs/raw/master/brocktibert/rm-python-setup.png" class="img-fluid"></p>
<p>You can access above by navigating to Rapidminer &gt; preferences.</p>
<p>The big thing to note above is that you see we can specify our package manager and environment. We are going to use that in this post to create a Conda environment for our Rapidminer work.</p>
<p>To get started, I assume that you already have Conda installed. I prefer Miniconda, which you can install <a href="https://docs.conda.io/en/latest/miniconda.html">here</a>.</p>
<p>Let’s create the environment. I am a Mac user, so the below commands will be entered into Terminal. Windows users absolutely can perform the same actions, though the sytnax may be slightly different.</p>
<pre><code>conda create -n rapidminer python=3.7 pandas scikit-learn</code></pre>
<p>Above we created a new conda environment called rapidminer which uses python version 3.7, and includes pandas and scikit-learn out of the box. When prompted, say ‘y’ to install the necessary toolling.</p>
<p>One more step. We need to activate the environment to install <code>pantab</code> via pip.</p>
<pre><code>conda activate rapidminer
pip install pantab</code></pre>
<p>Last but not least, go back to Rapidminer &gt; Preferences and select our newly created rapidminer environment. Note, you may need to refresh the interface for it to be made available.</p>
<p><img src="https://github.com/Btibert3/public-figs/raw/master/rm-conda-rm.png" class="img-fluid"></p>
<p>That is all we need for setup!</p>
<p>From here, let’s do a basic test. We will use the included Golf dataset, and write the file to a Tableau hyper file.</p>
<p><img src="https://github.com/Btibert3/public-figs/raw/master/brocktibert/rm-tableau-process.png" class="img-fluid"></p>
<blockquote class="blockquote">
<p>Note that the dataset is going into our input port.</p>
</blockquote>
<p>The only other thing that we need to do is create a simple script to run. In this case, there are some details specific to me and my machine, but you can easily change these as needed.</p>
<pre><code>import pandas
import os
import pantab
import shutil

# rm_main is a mandatory function, 
# the number of arguments has to be the number of input ports (can be none),
#     or the number of input ports plus one if "use macros" parameter is set
# if you want to use macros, use this instead and check "use macros" parameter:
#def rm_main(data,macros):
def rm_main(data):
    # SETUP VARS
    FNAME = "brock.hyper"
    TNAME = "brock"
    FPATH = "/Users/btibert/Downloads/" + FNAME

    # checking
    print('Hello, world!')
    # output can be found in Log View
    print(type(data))
    # where are we
    print(os.curdir)
    print(data.shape)
    ## ^^ TO SEE ABOVE, add log as a VIEW from the top ribbon

    # the dir for the data to start
    os.chdir("/tmp")

    # write the file to a hyper
    pantab.frame_to_hyper(data, FNAME, table=TNAME)

    # move the file
    shutil.move(FNAME, FPATH)

    return data
</code></pre>
<p>This script is included within the script portion of the operator.</p>
<p><img src="https://github.com/Btibert3/public-figs/raw/master/brocktibert/rm-tableau-python-script.png" class="img-fluid"></p>
<p>Based on my setup, when I run the process, I now have a file called <code>brock.hyper</code> in my Downloads folder, which is our golf dataset written to a Tableau extract file via the excellent <a href="https://pantab.readthedocs.io/en/latest/index.html">pantab</a> library for python.</p>



 ]]></description>
  <category>Data Science</category>
  <guid>https://brocktibert.com/posts/migration/2020-03-04-rapidminer-and-tableau.html</guid>
  <pubDate>Fri, 20 Aug 2021 04:00:00 GMT</pubDate>
  <media:content url="https://github.com/Btibert3/public-figs/raw/master/brocktibert/tableau-operator.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Tensorflow, Tip-Ins, and Tableau, Oh My</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/migration/tensorflow-tip-ins-and-tableau-oh-my.html</link>
  <description><![CDATA[ 





<p>This notebook aims to show the basics of:</p>
<ol type="1">
<li>Tensorflow 2.0</li>
<li>Shooter Embedding estimation for NHL Player evaluation</li>
<li>Evaluate feasibility generating a post that switches between <code>R</code> and <code>python</code> via reticulate</li>
<li>Demonstrate code similarity/approach in both languages side-by-side</li>
</ol>
<section id="tldr" class="level2">
<h2 class="anchored" data-anchor-id="tldr">TL;DR</h2>
<ul>
<li>Combine Tensorflow/Keras with R</li>
<li>NHL Data to estimate Shooter Player Embeddings</li>
<li>Export to Tableau for exploration (yes we could use ggplot et. al, but highlights we have other options, especially for those new to the language)</li>
</ul>
</section>
<section id="r-setup" class="level2">
<h2 class="anchored" data-anchor-id="r-setup">R Setup</h2>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># packages</span></span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(keras)</span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">suppressPackageStartupMessages</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidyverse))</span>
<span id="cb1-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(reticulate)</span>
<span id="cb1-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">suppressPackageStartupMessages</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(caret))</span>
<span id="cb1-7"></span>
<span id="cb1-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># options</span></span>
<span id="cb1-9"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">options</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">stringsAsFactors =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)</span>
<span id="cb1-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">use_condaenv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tensorflow"</span>)</span></code></pre></div>
</section>
<section id="python-setup" class="level2">
<h2 class="anchored" data-anchor-id="python-setup">Python setup</h2>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># imports</span></span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb2-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb2-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.preprocessing <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> MinMaxScaler</span>
<span id="cb2-5"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.model_selection <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> train_test_split</span>
<span id="cb2-6"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> tensorflow.keras.layers <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Activation, concatenate, Dense, Dropout, Embedding, Input, Reshape, Flatten</span>
<span id="cb2-7"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> tensorflow.keras.models <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Model</span>
<span id="cb2-8"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> tensorflow.keras.utils <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> plot_model</span>
<span id="cb2-9"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> tensorflow.keras.preprocessing.text <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Tokenizer</span></code></pre></div>
</section>
<section id="get-the-data" class="level2">
<h2 class="anchored" data-anchor-id="get-the-data">Get the data</h2>
<section id="r" class="level3">
<h3 class="anchored" data-anchor-id="r">R</h3>
<div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1">URL <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"http://peter-tanner.com/moneypuck/downloads/shots_2019.zip"</span></span>
<span id="cb3-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">download.file</span>(URL, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">destfile=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"shots.zip"</span>)</span>
<span id="cb3-3">shots_raw <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read_csv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"shots.zip"</span>)</span></code></pre></div>
<p>What’s the shape?</p>
<div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dim</span>(shots_raw)</span></code></pre></div>
<pre class="plaintext"><code>[1] 88592   124</code></pre>
</section>
<section id="python" class="level3">
<h3 class="anchored" data-anchor-id="python">Python</h3>
<div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1">URL <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"http://peter-tanner.com/moneypuck/downloads/shots_2019.zip"</span></span>
<span id="cb6-2">shots_raw <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(URL)</span></code></pre></div>
<p>What do we have?</p>
<div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1">shots_raw.shape</span></code></pre></div>
<pre class="plaintext"><code>(88592, 124)</code></pre>
</section>
</section>
<section id="filter-rows" class="level2">
<h2 class="anchored" data-anchor-id="filter-rows">Filter rows</h2>
<p>We want to keep shots on net, and not on an empty net, as well as remove records where the shooter id is 0.</p>
<section id="r-1" class="level3">
<h3 class="anchored" data-anchor-id="r-1">R</h3>
<div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># keep shots that were on goal</span></span>
<span id="cb9-2">shots_raw <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> shots_raw <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(shotWasOnGoal <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> )</span>
<span id="cb9-3">shots_raw <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> shots_raw <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(shotOnEmptyNet <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb9-4">shots_raw <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> shots_raw <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(shooterPlayerId <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb9-5">shots_raw <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> shots_raw <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(shooterPlayerId))</span></code></pre></div>
<p>What we do have for a shape?</p>
<div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dim</span>(shots_raw)</span></code></pre></div>
<pre class="plaintext"><code>[1] 64318   124</code></pre>
</section>
<section id="python-1" class="level3">
<h3 class="anchored" data-anchor-id="python-1">Python</h3>
<div class="sourceCode" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1">shots_raw <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> shots_raw.loc[shots_raw.shotOnEmptyNet <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, :]</span>
<span id="cb12-2">shots_raw <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> shots_raw.loc[shots_raw.shotWasOnGoal <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, :]</span>
<span id="cb12-3">shots_raw <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> shots_raw.loc[shots_raw.shooterPlayerId <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, :]</span>
<span id="cb12-4">shots_raw <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> shots_raw.loc[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span>shots_raw.shooterPlayerId.isna(), :]</span></code></pre></div>
<p>What do we have for a shape?</p>
<div class="sourceCode" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1">shots_raw.shape</span></code></pre></div>
<pre class="plaintext"><code>(64318, 124)</code></pre>
</section>
</section>
<section id="select-columns" class="level2">
<h2 class="anchored" data-anchor-id="select-columns">Select Columns</h2>
<p>With the rows select, let’s keep the columns that we want to include in this analysis.</p>
<section id="r-2" class="level3">
<h3 class="anchored" data-anchor-id="r-2">R</h3>
<div class="sourceCode" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb15-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># keep just the columns that we need</span></span>
<span id="cb15-2">shots_raw <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> shots_raw <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(shooterPlayerId, shotType, goal, arenaAdjustedShotDistance, </span>
<span id="cb15-3">                                   arenaAdjustedXCord, arenaAdjustedYCord,  shotAngle, offWing)</span></code></pre></div>
<p>The shape …</p>
<div class="sourceCode" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb16-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dim</span>(shots_raw)</span></code></pre></div>
<pre class="plaintext"><code>[1] 64318     8</code></pre>
</section>
<section id="python-2" class="level3">
<h3 class="anchored" data-anchor-id="python-2">Python</h3>
<div class="sourceCode" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1">COLS <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'shooterPlayerId'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'shotType'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'goal'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'offWing'</span>, </span>
<span id="cb18-2">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'arenaAdjustedShotDistance'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'arenaAdjustedXCord'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'arenaAdjustedYCord'</span>, </span>
<span id="cb18-3">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'shotAngle'</span>]</span>
<span id="cb18-4">shots_raw <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> shots_raw[COLS]</span></code></pre></div>
<p>The shape …</p>
<div class="sourceCode" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1">shots_raw.shape</span></code></pre></div>
<pre class="plaintext"><code>(64318, 8)</code></pre>
</section>
</section>
<section id="encode-the-shot-types" class="level2">
<h2 class="anchored" data-anchor-id="encode-the-shot-types">Encode the shot types</h2>
<p>I am going to one-hot the shot types, though in the future I will explore the use of <code>keras.preprocessing.text.one_hot</code>. The result will be new columns added to our <code>shots_raw</code> dataset, with each shot type flagged as 0/1.</p>
<section id="r-3" class="level3">
<h3 class="anchored" data-anchor-id="r-3">R</h3>
<div class="sourceCode" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb21-1">x <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dummyVars</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" ~ ."</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> shots_raw)</span>
<span id="cb21-2">shots_raw <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">predict</span>(x, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">newdata =</span> shots_raw))</span>
<span id="cb21-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rm</span>(x)</span></code></pre></div>
<p>What do we have?</p>
<div class="sourceCode" id="cb22" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb22-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glimpse</span>(shots_raw)</span></code></pre></div>
<pre class="plaintext"><code>Observations: 64,318
Variables: 14
$ shooterPlayerId           &lt;dbl&gt; 8480801, 8476853, 8476331, 8476853, 8475197…
$ shotTypeBACK              &lt;dbl&gt; 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ shotTypeDEFL              &lt;dbl&gt; 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ shotTypeSLAP              &lt;dbl&gt; 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0…
$ shotTypeSNAP              &lt;dbl&gt; 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0…
$ shotTypeTIP               &lt;dbl&gt; 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
$ shotTypeWRAP              &lt;dbl&gt; 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ shotTypeWRIST             &lt;dbl&gt; 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1…
$ goal                      &lt;dbl&gt; 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arenaAdjustedShotDistance &lt;dbl&gt; 4.123106, 59.000000, 30.000000, 40.000000, …
$ arenaAdjustedXCord        &lt;dbl&gt; 85, -30, 60, -56, -40, -48, -34, -77, -61, …
$ arenaAdjustedYCord        &lt;dbl&gt; -1, -2, -7, -22, -30, -8, 41, -13, -34, 11,…
$ shotAngle                 &lt;dbl&gt; -14.036243, 2.009554, -12.994617, 33.690068…
$ offWing                   &lt;dbl&gt; 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1…</code></pre>
</section>
<section id="python-3" class="level3">
<h3 class="anchored" data-anchor-id="python-3">Python</h3>
<div class="sourceCode" id="cb24" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb24-1">shots_raw <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.get_dummies(shots_raw, columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'shotType'</span>])</span>
<span id="cb24-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(shots_raw.shape)</span></code></pre></div>
<pre class="plaintext"><code>(64318, 14)</code></pre>
<pre><code>print(shots_raw.head(3).T)</code></pre>
<pre class="plaintext"><code>                                      0             2             5
shooterPlayerId            8.480801e+06  8.476853e+06  8.476331e+06
goal                       1.000000e+00  0.000000e+00  0.000000e+00
offWing                    1.000000e+00  0.000000e+00  0.000000e+00
arenaAdjustedShotDistance  4.123106e+00  5.900000e+01  3.000000e+01
arenaAdjustedXCord         8.500000e+01 -3.000000e+01  6.000000e+01
arenaAdjustedYCord        -1.000000e+00 -2.000000e+00 -7.000000e+00
shotAngle                 -1.403624e+01  2.009554e+00 -1.299462e+01
shotType_BACK              0.000000e+00  0.000000e+00  0.000000e+00
shotType_DEFL              0.000000e+00  0.000000e+00  0.000000e+00
shotType_SLAP              0.000000e+00  0.000000e+00  0.000000e+00
shotType_SNAP              0.000000e+00  1.000000e+00  1.000000e+00
shotType_TIP               1.000000e+00  0.000000e+00  0.000000e+00
shotType_WRAP              0.000000e+00  0.000000e+00  0.000000e+00
shotType_WRIST             0.000000e+00  0.000000e+00  0.000000e+00</code></pre>
</section>
</section>
<section id="scale-the-numeric-data-to-01" class="level2">
<h2 class="anchored" data-anchor-id="scale-the-numeric-data-to-01">Scale the numeric data to 0/1</h2>
<section id="r-4" class="level3">
<h3 class="anchored" data-anchor-id="r-4">R</h3>
<div class="sourceCode" id="cb28" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb28-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># clunky, but break out columns to standardize</span></span>
<span id="cb28-2">tmp <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> shots_raw <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(arenaAdjustedShotDistance<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>shotAngle)</span>
<span id="cb28-3">tmp2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">preProcess</span>(tmp, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">method =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"range"</span>)</span>
<span id="cb28-4">pp <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">predict</span>(tmp2, tmp)</span>
<span id="cb28-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rm</span>(tmp, tmp2)</span>
<span id="cb28-6"></span>
<span id="cb28-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># drop the original and append these</span></span>
<span id="cb28-8">shots_raw <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(shots_raw, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>arenaAdjustedShotDistance<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:-</span>shotAngle)</span>
<span id="cb28-9">shots_raw <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cbind</span>(shots_raw, pp)</span>
<span id="cb28-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dim</span>(shots_raw)</span></code></pre></div>
<pre class="plaintext"><code>[1] 64318    14</code></pre>
</section>
<section id="python-4" class="level3">
<h3 class="anchored" data-anchor-id="python-4">Python</h3>
<div class="sourceCode" id="cb30" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb30-1">scaler <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> MinMaxScaler()</span>
<span id="cb30-2">COLS <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'arenaAdjustedShotDistance'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'arenaAdjustedXCord'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'arenaAdjustedYCord'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'shotAngle'</span>]</span>
<span id="cb30-3">shots_raw[COLS] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> scaler.fit_transform(shots_raw[COLS])</span>
<span id="cb30-4">shots_raw.shape</span></code></pre></div>
<pre class="plaintext"><code>(64318, 14)</code></pre>
</section>
</section>
<section id="setup-the-tokenizer-and-fit-to-the-player-ids" class="level2">
<h2 class="anchored" data-anchor-id="setup-the-tokenizer-and-fit-to-the-player-ids">Setup the tokenizer and fit to the Player IDs</h2>
<p>For this exercise, instead of converting the player ids to be 0-based, I am going to treat the player ids as if they are unique words, with the unique number of players representing our complete vocabulary. As such, document represents a shot of the puck on net, and each document only includes one “word”, or shooter.</p>
<blockquote class="blockquote">
<p>The trick here is that we have to treat our player ids as character strings.</p>
</blockquote>
<section id="r-5" class="level3">
<h3 class="anchored" data-anchor-id="r-5">R</h3>
<div class="sourceCode" id="cb32" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb32-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ensure that the shooter ID is a string</span></span>
<span id="cb32-2">shots_raw<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>shooterPlayerId <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.character</span>(shots_raw<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>shooterPlayerId)</span>
<span id="cb32-3"></span>
<span id="cb32-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># setup the tokenizer</span></span>
<span id="cb32-5">shooter_tokenizer <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">text_tokenizer</span>()</span>
<span id="cb32-6"></span>
<span id="cb32-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># fit the shooters </span></span>
<span id="cb32-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fit_text_tokenizer</span>(shooter_tokenizer, shots_raw<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>shooterPlayerId)</span></code></pre></div>
<p>What do we have?</p>
<div class="sourceCode" id="cb33" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb33-1">shooter_tokenizer<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>index_word[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>]</span></code></pre></div>
<pre class="plaintext"><code>$`1`
[1] "8477492"

$`2`
[1] "8471214"

$`3`
[1] "8474157"</code></pre>
<div class="sourceCode" id="cb35" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb35-1">shooter_tokenizer<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>word_index[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>]</span></code></pre></div>
<pre class="plaintext"><code>$`8477492`
[1] 1

$`8471214`
[1] 2

$`8474157`
[1] 3</code></pre>
<p>And how many?</p>
<div class="sourceCode" id="cb37" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb37-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(shooter_tokenizer<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>index_word)</span></code></pre></div>
<pre class="plaintext"><code>[1] 869</code></pre>
</section>
<section id="python-5" class="level3">
<h3 class="anchored" data-anchor-id="python-5">Python</h3>
<div class="sourceCode" id="cb39" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb39-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># make an integer so zero is not parsed</span></span>
<span id="cb39-2">shots_raw.shooterPlayerId <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> shots_raw.shooterPlayerId.astype(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'int'</span>)</span>
<span id="cb39-3"></span>
<span id="cb39-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ensure that the player ID is a string</span></span>
<span id="cb39-5">shots_raw.shooterPlayerId <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> shots_raw.shooterPlayerId.astype(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'str'</span>)</span>
<span id="cb39-6"></span>
<span id="cb39-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># setup the tokenizer</span></span>
<span id="cb39-8">shooter_tokenizer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Tokenizer()</span>
<span id="cb39-9"></span>
<span id="cb39-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># fit the tokenizer to shooters</span></span>
<span id="cb39-11">shooter_tokenizer.fit_on_texts(shots_raw.shooterPlayerId)</span></code></pre></div>
<p>What do we have?</p>
<div class="sourceCode" id="cb40" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb40-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(shooter_tokenizer.index_word.items())[:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>]</span></code></pre></div>
<pre class="plaintext"><code>[(1, '8477492'), (2, '8471214'), (3, '8474157')]</code></pre>
<div class="sourceCode" id="cb42" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb42-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(shooter_tokenizer.word_index.items())[:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>]</span></code></pre></div>
<pre class="plaintext"><code>[('8477492', 1), ('8471214', 2), ('8474157', 3)]</code></pre>
<p>And how many?</p>
<div class="sourceCode" id="cb44" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb44-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(shooter_tokenizer.index_word.items())</span></code></pre></div>
<pre class="plaintext"><code>869</code></pre>
</section>
</section>
<section id="create-the-shooter-sequences" class="level2">
<h2 class="anchored" data-anchor-id="create-the-shooter-sequences">Create the Shooter <code>sequences</code></h2>
<p>These are size 1 sequences that do not require padding, as we only allow 1 word (or player) per shot. The key here is that we are using <code>keras</code> to help us easily map our data to the new id system.</p>
<section id="r-6" class="level3">
<h3 class="anchored" data-anchor-id="r-6">R</h3>
<div class="sourceCode" id="cb46" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb46-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># make sequences with the new index</span></span>
<span id="cb46-2">shooters <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">texts_to_sequences</span>(shooter_tokenizer, shots_raw<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>shooterPlayerId)</span>
<span id="cb46-3">shooters <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unlist</span>(shooters)</span></code></pre></div>
<p>What do we have?</p>
<div class="sourceCode" id="cb47" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb47-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">class</span>(shooters)</span></code></pre></div>
<pre class="plaintext"><code>[1] "integer"</code></pre>
<div class="sourceCode" id="cb49" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb49-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(shooters)</span></code></pre></div>
<pre class="plaintext"><code>[1] 64318</code></pre>
</section>
<section id="python-6" class="level3">
<h3 class="anchored" data-anchor-id="python-6">Python</h3>
<div class="sourceCode" id="cb51" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb51-1">shooters <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> shooter_tokenizer.texts_to_sequences(shots_raw.shooterPlayerId)</span>
<span id="cb51-2">shooters <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [x[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>] <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> x <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> shooters]</span>
<span id="cb51-3">shooters <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.array(shooters)</span></code></pre></div>
<p>What do we have?</p>
<div class="sourceCode" id="cb52" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb52-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">type</span>(shooters)</span></code></pre></div>
<pre class="plaintext"><code>&lt;class 'numpy.ndarray'&gt;</code></pre>
<div class="sourceCode" id="cb54" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb54-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(shooters)</span></code></pre></div>
<pre class="plaintext"><code>64318</code></pre>
</section>
</section>
<section id="isolate-the-other-featurestargets" class="level2">
<h2 class="anchored" data-anchor-id="isolate-the-other-featurestargets">Isolate the other features/targets</h2>
<section id="r-7" class="level3">
<h3 class="anchored" data-anchor-id="r-7">R</h3>
<div class="sourceCode" id="cb56" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb56-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Was the shot a goal?   This is our target.</span></span>
<span id="cb56-2">goal <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> shots_raw<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>goal</span>
<span id="cb56-3"></span>
<span id="cb56-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># the shot info</span></span>
<span id="cb56-5">shot_info <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> shots_raw <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>shooterPlayerId, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>goal)</span>
<span id="cb56-6">shot_info <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.matrix</span>(shot_info)</span></code></pre></div>
<p>What do we have now?</p>
<div class="sourceCode" id="cb57" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb57-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(goal); <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(goal);</span></code></pre></div>
<pre class="plaintext"><code>[1] 64318</code></pre>
<pre class="plaintext"><code>[1] 0.09103206</code></pre>
<div class="sourceCode" id="cb60" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb60-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dim</span>(shot_info)</span></code></pre></div>
<pre class="plaintext"><code>[1] 64318    12</code></pre>
<div class="sourceCode" id="cb62" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb62-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(shot_info)</span></code></pre></div>
<pre class="plaintext"><code> [1] "shotTypeBACK"              "shotTypeDEFL"             
 [3] "shotTypeSLAP"              "shotTypeSNAP"             
 [5] "shotTypeTIP"               "shotTypeWRAP"             
 [7] "shotTypeWRIST"             "offWing"                  
 [9] "arenaAdjustedShotDistance" "arenaAdjustedXCord"       
[11] "arenaAdjustedYCord"        "shotAngle"                </code></pre>
</section>
<section id="python-7" class="level3">
<h3 class="anchored" data-anchor-id="python-7">Python</h3>
<div class="sourceCode" id="cb64" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb64-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Was the shot a goal?   This is our target.</span></span>
<span id="cb64-2">goal <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.array(shots_raw.goal)</span>
<span id="cb64-3"></span>
<span id="cb64-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># the shot info</span></span>
<span id="cb64-5">shot_info <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> shots_raw.drop(columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'shooterPlayerId'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'goal'</span>], axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, inplace<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span></code></pre></div>
<p>What do we have?</p>
<div class="sourceCode" id="cb65" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb65-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(goal)</span></code></pre></div>
<pre class="plaintext"><code>64318</code></pre>
<div class="sourceCode" id="cb67" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb67-1">goal.mean()</span></code></pre></div>
<pre class="plaintext"><code>0.09103205945458503</code></pre>
<div class="sourceCode" id="cb69" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb69-1">shot_info.shape</span></code></pre></div>
<pre class="plaintext"><code>(64318, 12)</code></pre>
<div class="sourceCode" id="cb71" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb71-1">shot_info.columns</span></code></pre></div>
<pre class="plaintext"><code>Index(['offWing', 'arenaAdjustedShotDistance', 'arenaAdjustedXCord',
       'arenaAdjustedYCord', 'shotAngle', 'shotType_BACK', 'shotType_DEFL',
       'shotType_SLAP', 'shotType_SNAP', 'shotType_TIP', 'shotType_WRAP',
       'shotType_WRIST'],
      dtype='object')</code></pre>
</section>
</section>
<section id="define-the-model-architecture" class="level2">
<h2 class="anchored" data-anchor-id="define-the-model-architecture">Define the model architecture</h2>
<section id="r-8" class="level3">
<h3 class="anchored" data-anchor-id="r-8">R</h3>
<blockquote class="blockquote">
<p>Note the +1, it’s needed to avoid the index error</p>
</blockquote>
<div class="sourceCode" id="cb73" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb73-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># the setup</span></span>
<span id="cb73-2">NUM_SHOOTERS <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unique</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unlist</span>(shooter_tokenizer<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>index_word))) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb73-3">SHOT_COLS <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ncol</span>(shot_info)</span>
<span id="cb73-4">VEC_SIZE <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span></span>
<span id="cb73-5"></span>
<span id="cb73-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># the input layers</span></span>
<span id="cb73-7">shooter_input <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">layer_input</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">shape=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">name =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"shooter_input"</span>)</span>
<span id="cb73-8">shot_input <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">layer_input</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">shape=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(SHOT_COLS), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">name =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"shot_input"</span>)</span>
<span id="cb73-9"></span>
<span id="cb73-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># shooter layers</span></span>
<span id="cb73-11">s1 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">layer_embedding</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">input_dim =</span> NUM_SHOOTERS, </span>
<span id="cb73-12">                     <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">output_dim =</span> VEC_SIZE, </span>
<span id="cb73-13">                     <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">input_length =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, </span>
<span id="cb73-14">                     <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">name=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"shooter_embedding"</span>)(shooter_input)</span>
<span id="cb73-15">s2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">layer_flatten</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">name =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"shooter_flat"</span>)(s1)</span>
<span id="cb73-16">s3 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">layer_dense</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">units =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">activation =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sigmoid"</span>)(s2)</span>
<span id="cb73-17"></span>
<span id="cb73-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># put the model together</span></span>
<span id="cb73-19">model <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">keras_model</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">inputs =</span> shooter_input, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">outputs =</span> s3)</span></code></pre></div>
<p>Summarize:</p>
<div class="sourceCode" id="cb74" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb74-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(model)</span></code></pre></div>
<pre class="plaintext"><code>Model: "model"
________________________________________________________________________________
Layer (type)                        Output Shape                    Param #
================================================================================
shooter_input (InputLayer)          [(None, 1)]                     0
________________________________________________________________________________
shooter_embedding (Embedding)       (None, 1, 50)                   43500
________________________________________________________________________________
shooter_flat (Flatten)              (None, 50)                      0
________________________________________________________________________________
dense (Dense)                       (None, 1)                       51
================================================================================
Total params: 43,551
Trainable params: 43,551
Non-trainable params: 0
________________________________________________________________________________</code></pre>
</section>
<section id="python-8" class="level3">
<h3 class="anchored" data-anchor-id="python-8">Python</h3>
<blockquote class="blockquote">
<p>Note the +2, it’s needed to avoid the index error and differs from abvoe</p>
</blockquote>
<div class="sourceCode" id="cb76" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb76-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># setup</span></span>
<span id="cb76-2">NUM_SHOOTERS <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(np.unique(shooters)) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb76-3">SHOT_COLS <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> shot_info.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb76-4">VEC_SIZE <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span></span>
<span id="cb76-5"></span>
<span id="cb76-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># the input layers</span></span>
<span id="cb76-7">shooter_input <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Input(shape<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, ), name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"shooter_input"</span>)</span>
<span id="cb76-8">shot_input <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Input(shape<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(SHOT_COLS, ), name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"shot_input"</span>)</span>
<span id="cb76-9"></span>
<span id="cb76-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># shooter layers</span></span>
<span id="cb76-11">s1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Embedding(NUM_SHOOTERS, VEC_SIZE, input_length<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)(shooter_input)</span>
<span id="cb76-12">s2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Flatten()(s1)</span>
<span id="cb76-13">s3 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Dense(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, activation<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sigmoid"</span>)(s2)</span>
<span id="cb76-14"></span>
<span id="cb76-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># put the model together</span></span>
<span id="cb76-16">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Model(inputs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> shooter_input, outputs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> s3)</span></code></pre></div>
<p>What do we have?</p>
<div class="sourceCode" id="cb77" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb77-1">model.summary()</span></code></pre></div>
<pre class="plaintext"><code>Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
shooter_input (InputLayer)   [(None, 1)]               0
_________________________________________________________________
embedding (Embedding)        (None, 1, 50)             43500
_________________________________________________________________
flatten (Flatten)            (None, 50)                0
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 51
=================================================================
Total params: 43,551
Trainable params: 43,551
Non-trainable params: 0
_________________________________________________________________</code></pre>
<p>and plot the model, this is not available within R at the moment.</p>
<div class="sourceCode" id="cb79" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb79-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># below might choke RMD</span></span>
<span id="cb79-2">plot_model(model, to_file<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'model.png'</span>)</span></code></pre></div>
</section>
</section>
<section id="train-and-evaluate-the-model" class="level2">
<h2 class="anchored" data-anchor-id="train-and-evaluate-the-model">Train and Evaluate the Model</h2>
<section id="r-9" class="level3">
<h3 class="anchored" data-anchor-id="r-9">R</h3>
<p>Compile the model.</p>
<div class="sourceCode" id="cb80" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb80-1">model <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> </span>
<span id="cb80-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">compile</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">optimizer =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"adam"</span>, </span>
<span id="cb80-3">          <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">loss=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"binary_crossentropy"</span>, </span>
<span id="cb80-4">          <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">metrics =</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"accuracy"</span>))</span></code></pre></div>
<p>Fit the model and record the history for plotting, if needed</p>
<div class="sourceCode" id="cb81" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb81-1">history <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> </span>
<span id="cb81-2">  model <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> </span>
<span id="cb81-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fit</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(shooters), </span>
<span id="cb81-4">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y=</span>goal, </span>
<span id="cb81-5">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">epochs =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, </span>
<span id="cb81-6">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">verbose =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span></code></pre></div>
</section>
<section id="python-9" class="level3">
<h3 class="anchored" data-anchor-id="python-9">Python</h3>
<p>Compile the model.</p>
<div class="sourceCode" id="cb82" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb82-1">model.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">compile</span>(optimizer<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"adam"</span>, loss <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"binary_crossentropy"</span>, metrics <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'accuracy'</span>])</span></code></pre></div>
<p>Fit the model.</p>
<div class="sourceCode" id="cb83" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb83-1">X <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [shooters, shot_info]</span>
<span id="cb83-2">history <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.fit(X, goal, epochs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span></code></pre></div>
</section>
</section>
<section id="get-the-embeddings" class="level2">
<h2 class="anchored" data-anchor-id="get-the-embeddings">Get the Embeddings</h2>
<p>With our simple model, we have estimated embeddings for each shooter. Let’s grab those.</p>
<section id="r-10" class="level3">
<h3 class="anchored" data-anchor-id="r-10">R</h3>
<div class="sourceCode" id="cb84" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb84-1">shooter_embeddings <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_weights</span>(model)[[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]]</span></code></pre></div>
<p>What do we have?</p>
<div class="sourceCode" id="cb85" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb85-1">shooter_embeddings[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>]</span></code></pre></div>
<pre class="plaintext"><code>            [,1]       [,2]        [,3]
[1,] -0.03280475 0.02733728 -0.01639561
[2,]  0.08887081 0.12082704  0.12113131
[3,]  0.09737433 0.06078419  0.05767342</code></pre>
<p>The shape.</p>
<div class="sourceCode" id="cb87" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb87-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dim</span>(shooter_embeddings)</span></code></pre></div>
<pre class="plaintext"><code>[1] 870  50</code></pre>
</section>
<section id="python-10" class="level3">
<h3 class="anchored" data-anchor-id="python-10">Python</h3>
<div class="sourceCode" id="cb89" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb89-1">shooter_embeddings <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.layers[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>].get_weights()[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span></code></pre></div>
<p>What do we have?</p>
<div class="sourceCode" id="cb90" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb90-1">shooter_embeddings[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>]</span></code></pre></div>
<pre class="plaintext"><code>array([[-0.07189947, -0.08770541,  0.04801337],
       [-0.03720884, -0.04936351,  0.04588475],
       [-0.1379198 , -0.04528455,  0.08142862]], dtype=float32)</code></pre>
<p>The shape.</p>
<div class="sourceCode" id="cb92" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb92-1">shooter_embeddings.shape</span></code></pre></div>
<pre class="plaintext"><code>(870, 50)</code></pre>
</section>
</section>
<section id="map-the-embeddings-to-the-players" class="level2">
<h2 class="anchored" data-anchor-id="map-the-embeddings-to-the-players">Map the embeddings to the players</h2>
<p>The embeddings are related to a player, so we are intereseted extracting these vectors and looking at player similarity, etc.</p>
<section id="r-11" class="level3">
<h3 class="anchored" data-anchor-id="r-11">R</h3>
<blockquote class="blockquote">
<p>This is to help with some of the mapping. There may be more elegant ways to do this, but below is intuitive and simple in my opinion.</p>
</blockquote>
<div class="sourceCode" id="cb94" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb94-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># build our vocabulary (player) dataframe</span></span>
<span id="cb94-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># https://www.r-bloggers.com/word-embeddings-with-keras/</span></span>
<span id="cb94-3">players <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb94-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">playerid =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">names</span>(shooter_tokenizer<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>word_index), </span>
<span id="cb94-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">id =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.integer</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unlist</span>(shooter_tokenizer<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>word_index)), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">stringsAsFactors=</span><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)</span>
<span id="cb94-6"></span>
<span id="cb94-7">players <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">arrange</span>(players, id)</span></code></pre></div>
<p>The embeddings with names and references</p>
<div class="sourceCode" id="cb95" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb95-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># keep only those rows where the indexes align - R is 1-based</span></span>
<span id="cb95-2">shooter_embeddings <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> shooter_embeddings[players<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>id, ]</span>
<span id="cb95-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rownames</span>(shooter_embeddings) <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> players<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>playerid</span>
<span id="cb95-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(shooter_embeddings) <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">paste0</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"e"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ncol</span>(shooter_embeddings))</span>
<span id="cb95-5">shooter_embeddings[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>]</span></code></pre></div>
<pre class="plaintext"><code>                 e1         e2          e3
8477492 -0.03280475 0.02733728 -0.01639561
8471214  0.08887081 0.12082704  0.12113131
8474157  0.09737433 0.06078419  0.05767342</code></pre>
</section>
<section id="python-11" class="level3">
<h3 class="anchored" data-anchor-id="python-11">Python</h3>
<div class="sourceCode" id="cb97" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb97-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># make the embed vectors a pandas dataframe</span></span>
<span id="cb97-2">shooter_embeddings <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame(shooter_embeddings)</span>
<span id="cb97-3"></span>
<span id="cb97-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># a list of true shooter ids</span></span>
<span id="cb97-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#shooter_id = [v for k, v in shooter_tokenizer.index_word.items()]</span></span>
<span id="cb97-6">shooter_id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {k:v <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> k, v <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> shooter_tokenizer.index_word.items()}</span>
<span id="cb97-7">shooter_df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame.from_dict(shooter_id, orient<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'index'</span>, columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"playerid"</span>])</span>
<span id="cb97-8"></span>
<span id="cb97-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># name the columns</span></span>
<span id="cb97-10">shooter_embeddings.columns <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"e"</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>(i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(shooter_embeddings.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])]</span>
<span id="cb97-11"></span>
<span id="cb97-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># align the data by index</span></span>
<span id="cb97-13">shooter_embeddings <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.merge(shooter_embeddings, shooter_df, how<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'inner'</span>, left_index<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, right_index<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb97-14"></span>
<span id="cb97-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># clean up the index so its the player</span></span>
<span id="cb97-16">shooter_embeddings.index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> shooter_embeddings.playerid</span>
<span id="cb97-17"></span>
<span id="cb97-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># the first few</span></span>
<span id="cb97-19">shooter_embeddings.iloc[:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, :<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>]</span></code></pre></div>
<pre class="plaintext"><code>                e1        e2        e3
playerid                              
8477492   0.093861 -0.071899 -0.087705
8471214   0.072550 -0.037209 -0.049364
8474157   0.059455 -0.137920 -0.045285</code></pre>
</section>
</section>
<section id="export-the-data-to-tableau" class="level2">
<h2 class="anchored" data-anchor-id="export-the-data-to-tableau">Export the data to Tableau</h2>
<p>Whether it is R or python, you might be asking why I am exporting the data to Tableau. That is a fair question, but the point is to show how the ecosystem of data science programming libraries can also leverage best-of-breed data visualization suites such as Tableau. The tool plays a key role in my exploratory analysis pipeline, and the goal below is show how in 1-line of code, we can export our data for rapid exploration, which can aid in our data cleaning and modeling tasks within R/python.</p>


</section>

 ]]></description>
  <category>R</category>
  <category>Python</category>
  <category>NHL</category>
  <category>Tensorflow</category>
  <category>Keras</category>
  <guid>https://brocktibert.com/posts/migration/tensorflow-tip-ins-and-tableau-oh-my.html</guid>
  <pubDate>Thu, 05 Mar 2020 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Python Development with Rstudio using Reticulate</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/migration/python-development-with-rstudio-using-reticulate.html</link>
  <description><![CDATA[ 





<p>I have been diving back into python a bit lately, and admittedly, I have yet to find a tool that fits my workflow similar to that of R and Rstudio. There are all sorts of tools out there, but in the end, it feels like I am fighting the tool, not my code.</p>
<p>To be honest, I really like using <a href="https://code.visualstudio.com/">VSCode</a> for other projects, but I feel like this product is aimed more at developers working on large applications, not data scientists.</p>
<p>My teaching tool of choice is <a href="https://colab.research.google.com/notebooks/welcome.ipynb#recent=true">Google Colaboratory</a>, but the lack of a dedicated <code>R</code> runtime is really brutal. There are workarounds, but this renders many of the features that I love about Colab useless. For example, you can’t connect your session to Google Drive. To me, and when I teach in class, this is a deal breaker.</p>
<section id="rstudio-and-reticulate" class="level2">
<h2 class="anchored" data-anchor-id="rstudio-and-reticulate">Rstudio and Reticulate</h2>
<p>There has been plenty written about <a href="https://rstudio.github.io/reticulate/">reticulate</a>, so I will let you dive into the tutorials and background. It is not without its quirks, but by and large, the combo works really well, especially for a younger solution. Historically there have been other attempts to bridge the gap, but from a feature and usability perspective, this is by far the most robust offering if you ask me.</p>
<p>Over the last year, I have been (slowly) working on a python package to help facilitate the collection and analysis of datasets that are openly available within higher education. I mention the tools above because my development has been really slow <strong>outside</strong> of RStudio. Last week, I got fed up and came back to RStudio. If I must say, the experience has been really pleasant, but more importantly, I am writing code at a much faster rate. Is it because I am more comfortable with the Rstudio interface? Perhaps, but I really do believe RStudio <em>could be</em> THE data science IDE of the future.</p>
<p>With that said, here are a few things that tripped me up along the way. This is not meant to call out the quirks of developing python packages using RStudio and reticulate, but it is a note to my future self as to the tricks necessary to work around some issues that are pretty annoying.</p>
<section id="restarting-r-sessions" class="level3">
<h3 class="anchored" data-anchor-id="restarting-r-sessions">1. Restarting R sessions</h3>
<p>Reticulate is fantastic and can hook into environments on our machine. For my package above, I manage my environment using <a href="https://docs.conda.io/en/latest/miniconda.html">conda</a>. Here is the thing. If I make a change to my package, and need to retest the code locally, things start to get hairy. You may properly uninstall/install your work locally, even in the conda environment, but you won’t see the changes in Rstudio.</p>
<p>The flow below solves the issue above, and represents the process by which I have been editing and seamlessly testing my code all without leaving RStudio.</p>
<ol type="1">
<li>Open Rstudio project for my python package</li>
<li>Load the reticulate package, set conda with <code>use_condaenv()</code> and then <code>repl_python</code></li>
<li>Make edits to my functions, methods, whatever. The trick here is that now we have an interactive python repl which is <strong>very</strong> helpful as I step through the development of methods and classes.</li>
<li>With the changes made, on the Terminal tab within Rstudio, uninstall the package with <code>pip uninstall &lt;packagename&gt;</code> and then <code>pip install .</code></li>
<li>Once the package is updated, <strong>you must restart your R session</strong> via Session &gt; Restart R. Failure to do so will not bring in the changes to your package, which is now updated within your conda environment.</li>
<li>After restarting R, repeat step 2.</li>
<li>Voila, it works!</li>
</ol>
<p>It’s not the worst workflow, but without it, you inevitably will be banging your head against the wall wondering why your python package updates didn’t hold.</p>
<blockquote class="blockquote">
<p>If there is a flaw above, or an easier way to address my issue, please reach out. As I noted above, Rstudio for python package development is my preferred solution at the moment.</p>
</blockquote>
</section>
<section id="history-doesnt-work" class="level3">
<h3 class="anchored" data-anchor-id="history-doesnt-work">2. History doesn’t work</h3>
<p>The section heading says it all. While we log the commands to the History tab, if you try to send a selected entry to the console, the python repl will break.</p>
</section>
<section id="incomplete-function-runs" class="level3">
<h3 class="anchored" data-anchor-id="incomplete-function-runs">3. Incomplete function runs</h3>
<p>This isn’t the worst, but it tripped me up once or twice. When I select the full block of code for a function or method definition, the code will bulk run in the repl, but I have to select the repl and hit enter for the code to fully execute. It’s as if we the repl doesn’t know that we are done with our code run.</p>
<blockquote class="blockquote">
<p>This might be happening if I select only the function, but that said, it does appear to occur here and there.</p>
</blockquote>
</section>
<section id="environment" class="level3">
<h3 class="anchored" data-anchor-id="environment">4. Environment</h3>
<p>I feel like this works from time to time, but the objects in my python session are not shown within the Environment tab.</p>
<p>To me, this would be a game changer, and have feature parity with <a href="https://www.spyder-ide.org/">spyder</a>.</p>


</section>
</section>

 ]]></description>
  <category>R</category>
  <category>Python</category>
  <guid>https://brocktibert.com/posts/migration/python-development-with-rstudio-using-reticulate.html</guid>
  <pubDate>Tue, 17 Dec 2019 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Carbon Code Snippets</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/migration/carbone-code-snippets.html</link>
  <description><![CDATA[ 





<p>I recently learned today of [carbon](https://carbon.now.sh/?bg=rgba(171%2C%20184%2C%20195%2C%201&amp;t=seti&amp;wt=none&amp;l=auto&amp;ds=true&amp;dsyoff=20px&amp;dsblur=68px&amp;wc=true&amp;wa=true&amp;pv=56px&amp;ph=56px&amp;ln=false&amp;fm=Hack&amp;fs=14px&amp;lh=133%25&amp;si=false&amp;es=2x&amp;wm=false) and it is absolutely fantastic.</p>
<p>In my own words, carbon provides a terminal-like formatting for your code snippets, which can be included in blog posts and the like. It just makes things easier to read, in my opinion.</p>
<p>Where my head goes is taking a snippet that looks like this:</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">options</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">stringsAsFactors =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)</span>
<span id="cb1-2"></span>
<span id="cb1-3"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## load the packages</span></span>
<span id="cb1-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(wakefield)</span>
<span id="cb1-5"></span>
<span id="cb1-6"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## generate a dataset of random users</span></span>
<span id="cb1-7">users <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">r_data_frame</span>(</span>
<span id="cb1-8">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span>,</span>
<span id="cb1-9">  id,</span>
<span id="cb1-10">  state,</span>
<span id="cb1-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">date_stamp</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">name=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"registration_date"</span>),</span>
<span id="cb1-12">  dob,</span>
<span id="cb1-13">  language</span>
<span id="cb1-14">)</span>
<span id="cb1-15">users<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>ID <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.numeric</span>(users<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>ID)</span></code></pre></div>
<p>Could look like below. It does not add a ton of value to the code itself, but the legibility and aesthetic does elevate the visual nature of the post.</p>
<p>![Carbon Embed](https://carbon.now.sh/embed/?bg=rgba(171%2C%20184%2C%20195%2C%201&amp;t=seti&amp;wt=none&amp;l=auto&amp;ds=true&amp;dsyoff=20px&amp;dsblur=31px&amp;wc=true&amp;wa=false&amp;pv=0px&amp;ph=0px&amp;ln=false&amp;fm=Hack&amp;fs=14px&amp;lh=133%25&amp;si=false&amp;es=4x&amp;wm=false)</p>



 ]]></description>
  <category>R</category>
  <guid>https://brocktibert.com/posts/migration/carbone-code-snippets.html</guid>
  <pubDate>Sun, 24 Feb 2019 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Convert R dataframe to JSON</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/migration/r-dataframe-to-json.html</link>
  <description><![CDATA[ 





<p>Below is a post aimed at my future self. Be forewarned.</p>
<p>The idea is to take an <code>R</code> data frame and convert it to a JSON object where each entry in the JSON is a row from my dataset, and the entry has key/value (<code>k/v</code>) pairs where each column is a key.</p>
<p>Finally, if the value is missing for an arbitrary key, remove that <code>k/v</code> pair from the JSON entry.</p>
<p>Huh?</p>
<section id="generate-a-dataset" class="level3">
<h3 class="anchored" data-anchor-id="generate-a-dataset">Generate a dataset</h3>
<p>This is probably more easily explained via a toy dataset.</p>
<blockquote class="blockquote">
<p>I am using the <code>wakefield</code> package, which I believe to be fantastic. Check it out <a href="https://github.com/trinker/wakefield">here</a>.</p>
<p>The <code>README</code> really starts to highlight what is possible in regards to <a href="https://towardsdatascience.com/synthetic-data-generation-a-must-have-skill-for-new-data-scientists-915896c0c1ae">synthentic data generation</a>.</p>
</blockquote>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># load the libraries</span></span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">suppressPackageStartupMessages</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidyverse))</span>
<span id="cb1-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(wakefield)</span>
<span id="cb1-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(jsonlite)</span></code></pre></div>
<p>If you have any issues with above, it should be as easy as <code>install.packages('&lt;package_name_here&gt;')</code> to install the required package(s).</p>
<p>Now that we are good to go, let’s build out a fake dataset to demonstrate my use-case.</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## synthetic dataset using wakefield -- super expressive, right?</span></span>
<span id="cb2-2">dat <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">r_data_frame</span>(</span>
<span id="cb2-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>,</span>
<span id="cb2-4">  id,</span>
<span id="cb2-5">  race,</span>
<span id="cb2-6">  age,</span>
<span id="cb2-7">  died</span>
<span id="cb2-8">) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">r_na</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">prob =</span> .<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span>
<span id="cb2-10"></span>
<span id="cb2-11"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## take a quick peak at the data</span></span>
<span id="cb2-12"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glimpse</span>(dat)</span>
<span id="cb2-13">Rows<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span></span>
<span id="cb2-14">Columns<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span></span>
<span id="cb2-15"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span> ID   <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>chr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"1"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"3"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"4"</span></span>
<span id="cb2-16"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span> Race <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>fct<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span>, White, Bi<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>Racial, <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span></span>
<span id="cb2-17"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span> Age  <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>int<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">62</span>, <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">52</span></span>
<span id="cb2-18"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span> Died <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>lgl<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span>, <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>, <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>, <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span></span></code></pre></div>
<p>What we have is 4 rows of data, and 4 columns of data. However, the bigger point is that we can see that each of the final 3 variables have some <code>NA</code> injected into their values. This is done via the <code>r_na</code> function available in <code>wakefield</code>.</p>
<blockquote class="blockquote">
<p>You should really check out the docs with <code>?r_na</code></p>
</blockquote>
<p>Ok, moving on.</p>
</section>
<section id="parsing" class="level3">
<h3 class="anchored" data-anchor-id="parsing">Parsing</h3>
<p>The goal is to generate JSON datasets where each entry is a record from our dataset, with the appropriate key:value pairs representing the features for each observation.</p>
<p>The <code>jsonlite</code> package is feature-rich, and while its totally <code>RTFM</code>, I only just noticed the <code>dataframe</code> parameter for <code>toJSON</code>, which takes one of 3 options:</p>
<ul>
<li><code>rows</code></li>
<li><code>columns</code></li>
<li><code>values</code></li>
</ul>
<p>Not really knowing what each did, the code below applies each transformation, and then generates an R <code>list</code> for each.</p>
<p>What I expect is a list of length that is equivalent to 4, with each entry being another list (the actual row of data) with a length of 1 to 4, but not exactly 4 for each entry as the <code>NA</code> values should be omitted from the JSON entry.</p>
<p>The code below creates three objects, one testing each value.</p>
<div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## write to a json file - note how to handle dataframes</span></span>
<span id="cb3-2">dat_r <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">toJSON</span>(dat, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">dataframe =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"rows"</span>)</span>
<span id="cb3-3">dat_c <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">toJSON</span>(dat, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">dataframe =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"columns"</span>)</span>
<span id="cb3-4">dat_v <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">toJSON</span>(dat, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">dataframe =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"values"</span>)</span>
<span id="cb3-5"></span>
<span id="cb3-6"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## bring the character vectors back but as lists in R</span></span>
<span id="cb3-7"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## might be a way to do this in jsonlite, but my old habits stay here</span></span>
<span id="cb3-8">dat_rl <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> rjson<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fromJSON</span>(dat_r)</span>
<span id="cb3-9">dat_cl <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> rjson<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fromJSON</span>(dat_c)</span>
<span id="cb3-10">dat_vl <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> rjson<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fromJSON</span>(dat_v)</span></code></pre></div>
<p>In recent versions of Rstudio, it’s possible to inspect lists within the Environment pane. Given these are small objects, I would encourage you to poke around with the <code>dat_[r|c|v]l</code> objects from above.</p>
</section>
<section id="rows-to-the-answer" class="level2">
<h2 class="anchored" data-anchor-id="rows-to-the-answer"><code>rows</code> to the answer</h2>
<p>Jumping ahead, let’s look at the printout of both <code>dat</code> and the list which was built using the <code>columns</code> as the value to the <code>dataframe</code> parameter.</p>
<div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1">dat</span>
<span id="cb4-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># A tibble: 4 x 4</span></span>
<span id="cb4-3">  ID    Race        Age Died</span>
<span id="cb4-4">  <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>chr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span>fct<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span>     <span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span>int<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span>lgl<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span></span>
<span id="cb4-5"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>     <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span>         <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span></span>
<span id="cb4-6"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>     White        <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">62</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span></span>
<span id="cb4-7"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>     Bi<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>Racial    <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span></span>
<span id="cb4-8"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>     <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span>         <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">52</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span></span></code></pre></div>
<p>And now the list form, which represents <strong>exactly</strong> the JSON format I am looking for.</p>
<div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lapply</span>(dat_rl, names)</span>
<span id="cb5-2">[[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]]</span>
<span id="cb5-3">[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ID"</span></span>
<span id="cb5-4"></span>
<span id="cb5-5">[[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>]]</span>
<span id="cb5-6">[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ID"</span>   <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Race"</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Age"</span>  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Died"</span></span>
<span id="cb5-7"></span>
<span id="cb5-8">[[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>]]</span>
<span id="cb5-9">[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ID"</span>   <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Race"</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Died"</span></span>
<span id="cb5-10"></span>
<span id="cb5-11">[[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>]]</span>
<span id="cb5-12">[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ID"</span>  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Age"</span></span></code></pre></div>
<p>We can see from above that the length is what we expected, and the columns of data for each observation are not the same, as the <code>NA</code> values were omitted from the data structure.</p>
</section>
<section id="so-what" class="level2">
<h2 class="anchored" data-anchor-id="so-what">So what</h2>
<p>The next logical question is why should I care, right?</p>
<ul>
<li>When converting between various data formats, if we store the fields with missing data, it’s just taking up space on disk</li>
<li>I am a huge fan of <code>neo4j</code>, and a quick way to create nodes is to pass a JSON entry, but the key is that you start to get into trouble when attempting to write properties (keys) that have missing values.</li>
<li>Don’t attempt to reinvent the wheel</li>
</ul>


</section>

 ]]></description>
  <category>R</category>
  <guid>https://brocktibert.com/posts/migration/r-dataframe-to-json.html</guid>
  <pubDate>Mon, 21 Jan 2019 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Rasa Chatbot Setup</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/migration/rasa-chatbot-setup.html</link>
  <description><![CDATA[ 





<section id="rasa-chatbot-setup" class="level1">
<h1>Rasa Chatbot Setup</h1>
<p>In this post, I am going to walk through some issues that I recently encountered when attempting to get up and running with the <a href="https://rasa.com/">Rasa stack</a>. I am a big fan of the work they are doing, and by and large, it makes a complex problem, chatbot development, accessible and leverages machine learning under the hood. This is in contrast to tools that levergae simple rule-based approaches.</p>
<p>Below we will be using <a href="https://conda.io/docs/glossary.html#miniconda-glossary">conda</a> to manage our python environments and ensure that the package dependencies align. I will assume that you have conda properly installed, and will later show how to get the toolset configured on a Digital Ocean droplet.</p>
<section id="why-this-post" class="level4">
<h4 class="anchored" data-anchor-id="why-this-post">Why this post</h4>
<p>Previously, installing rasa was a breeze. Admittedly I never really had issues until recently. I am uncertain if the issue occurred on my machine with the new Mac OS (I am a Mac user), or if new features were added and downstream dependencies were missing from my machine. Regardless, I want to remind myself of how I was able to fix the issue with the help of this <a href="https://github.com/RasaHQ/rasa_core/issues/1426">Github Issue</a>. I will do the setup on a Mac, as well as a simple cloud droplet on Digital Ocean.</p>
</section>
<section id="install-on-a-mac" class="level2">
<h2 class="anchored" data-anchor-id="install-on-a-mac">Install on a Mac</h2>
<p>The first thing that we need to do is create an environment in conda. An environment encapsulates our tooling and helps avoid issues where an upgrade to a python package could break your ability to run your previous code. Basically, use environments whenever possible.</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">conda</span> create <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-n</span> rasabot python=3.6</span></code></pre></div>
<p>With above, the conda tool will manage the dependencies for you and get a base install of python 3.6 contained with an environment called <code>rasabot</code>. Let’s activate this environment:</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb2-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">source</span> activate rasabot</span></code></pre></div>
<p>On my machine, my terminal now tells me that I am in the rasabot environment</p>
<div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb3-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">(</span><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">rasabot</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">)</span> <span class="ex" style="color: null;
background-color: null;
font-style: inherit;">QST-ML-0208:brocktibert</span> btibert$</span></code></pre></div>
<p>Ok, one thing that I want to do is enforce my version of numpy. Let’s install that below.</p>
<div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install numpy==1.14.5</span></code></pre></div>
<p>Ok, with that covered, let’s get Rasa Core setup.</p>
<div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb5-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install rasa_core</span></code></pre></div>
<p>As noted from the solution on the Github ticket, your machine may need to have the Mac Command Line tools installed (which is found in the Stackoverflow Answer). In short, you will create an Apple Developer account and install the tool as described <a href="https://stackoverflow.com/a/52511046">here</a>.</p>
<p>Ok, with rasa seemingly good to go, let’s validate:</p>
<div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb6-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">python</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-c</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"import rasa_core; print(rasa_core.__version__)"</span></span></code></pre></div>
<p>This returned <code>0.12.3</code>.</p>
<p>Ok, we also will want to leverage Rasa NLU. We will install that below.</p>
<div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb7-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install rasa_nlu<span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">[</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">tensorflow</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">]</span></span></code></pre></div>
<p>And we will do the same for validation:</p>
<div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb8-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">python</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-c</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"import rasa_nlu; print(rasa_nlu.__version__)"</span></span></code></pre></div>
<p>yields <code>0.13.8</code></p>
<p>We are off to the races.</p>
</section>
<section id="setup-on-digital-ocean" class="level2">
<h2 class="anchored" data-anchor-id="setup-on-digital-ocean">Setup on Digital Ocean</h2>
<p>In addition to local dev, I wanted to make sure that I could spin up rasa in the cloud. I really enjoy using Digital Ocean, but this is applicable anywhere.</p>
<p>Once you have ssh’d into your remote box:</p>
<div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb9-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">apt-get</span> update</span>
<span id="cb9-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">apt-get</span> install</span>
<span id="cb9-3"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">apt-get</span> install libxml2-dev libxslt-dev</span>
<span id="cb9-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sudo</span> apt-get install gcc</span>
<span id="cb9-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">wget</span> https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh</span>
<span id="cb9-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bash</span> Miniconda3-latest-Linux-x86_64.sh</span>
<span id="cb9-7"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">exit</span></span>
<span id="cb9-8"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">conda</span> create <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-n</span> rasabot python=3.6</span>
<span id="cb9-9"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">source</span> activate rasabot</span>
<span id="cb9-10"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install numpy==1.14.5</span>
<span id="cb9-11"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install rasa_core</span>
<span id="cb9-12"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install rasa_nlu<span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">[</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">tensorflow</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">]</span></span></code></pre></div>
<p>Above, when you install Miniconda, you will be prompted to hit enter and confirm the locations. I accepted all of the defaults.</p>
<p>That’s it!</p>


</section>
</section>

 ]]></description>
  <category>Data Science</category>
  <guid>https://brocktibert.com/posts/migration/rasa-chatbot-setup.html</guid>
  <pubDate>Thu, 13 Dec 2018 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Usefull Snippets for using Neo4j within R</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/migration/2015-11-29-useful-snippets-interfacing-with-neo4j-from-r.html</link>
  <description><![CDATA[ 





<section id="usefull-snippets-for-using-neo4j-within-r" class="level1">
<h1>Usefull Snippets for using Neo4j within R</h1>
<p>I have been playing with Neo4j quite a bit, mostly for fun as I learn how I figure out when and where I could apply it to solve various analytics problems. Neo4j, at it’s core, is a database, which allows us to query data in a structured way. While the graph model within Neo4j is very flexible, the <code>cypher</code> query language is fantastic. Once you get over the learning curve, with only a few lines of code you can do some really powerful queries.</p>
<p>With that said, I have increasingly realized that it’s better to move the analytics outside of the database. Even though you can execute <code>cypher</code> statements against the database, how and where you execute them will matter once you go beyond “toy datasets.”</p>
<p>This post is to two helpful code snippets that I inject into my workflow when combining <code>R</code> and <code>Neo4j</code> for my project.</p>
<section id="snippet-1-load-csv-from-within-r" class="level2">
<h2 class="anchored" data-anchor-id="snippet-1-load-csv-from-within-r">Snippet 1: Load CSV from within R</h2>
<p>First off, I use <a href="https://github.com/nicolewhite/RNeo4j">RNeo4j</a> to connect R to <code>Neo4j</code>. It is totally possible to load data within dataframes into <code>Neo4j</code> using the <code>?cypher</code> command; Nicole shows you how to do this in the <code>Import</code> section of the README on the project page. But be warned, once you get to larger datasets, this might not be your best option with respect to speed performance.</p>
<p>If you haven’t already, play around with the <code>neo4j-shell</code> and the <code>LOAD CSV</code> functionality. It’s pretty fast and can handle files of a few million records.</p>
<p>The snippet below is a quick way to simply call that procedure from within your <code>R</code> script.</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1">NEO_SHELL <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"~/neo4j-community-2.3.1/bin/neo4j-shell"</span></span>
<span id="cb1-2">build_import <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(neo_shell, cypher_file) {</span>
<span id="cb1-3">  cmd <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sprintf</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"%s -file %s"</span>, neo_shell, cypher_file)</span>
<span id="cb1-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">system</span>(cmd)</span>
<span id="cb1-5">}</span></code></pre></div>
<p>Just point to your shell and store it in the variable <code>NEO_SHELL</code>. From calling it is simple.</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">build_import</span>(NEO_SHELL, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"../cypher/add-contstraints.cql"</span>)</span></code></pre></div>
<p>I simply reference the shell, and pass a text file that contains my cypher statements.</p>
<p>From my experiences, there are huge performance gains when using the shell to execute commands, so the helper function above let’s me stay within R but benefit from the performance gains.</p>
</section>
<section id="snippet-2-read-a-cypher-query-file-into-r" class="level2">
<h2 class="anchored" data-anchor-id="snippet-2-read-a-cypher-query-file-into-r">Snippet 2: Read a Cypher query file into <code>R</code></h2>
<p>When developing your code, inevitably you will be playing around with the incredibly helpful browser tool. I use it to prototype my queries, especially before running <code>LOAD CSV</code> on a larger file.</p>
<p>When using R, I have two options.</p>
<ol type="1">
<li>Type in the cypher query into a string, and then pass the query to the <code>cypher</code> function within the R package. Or,</li>
<li>Development my queries in separate text files. I use a <code>.cql</code> extension. To me, it’s easier to maintain, but needs to be brought into R.</li>
</ol>
<p>The process below will bring in a cypher query as a statement into <code>R</code> that can be further passed to the cypher function.</p>
<div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## read the cypher query into a string variable  </span></span>
<span id="cb3-2"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## http://stackoverflow.com/questions/9068397/import-text-file-as-single-character-string  </span></span>
<span id="cb3-3">FILE <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"../cypher/get-edges.cql"</span></span>
<span id="cb3-4">cypher_edges <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">readChar</span>(FILE,<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">file.info</span>(FILE)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>size)</span>
<span id="cb3-5"></span>
<span id="cb3-6"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## get the edges</span></span>
<span id="cb3-7">edges <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cypher</span>(graph, cypher_edges)</span></code></pre></div>
<p>Above, I have a query file that I read into the variable <code>cypher_edges</code> and pass that to cypher. I use this query to return data back to R. To me, I would rather manage my queries in separate files, not within R, and this allows me to do that. Moreover, I believe it makes organizing your project’s structure much easier.</p>
</section>
<section id="aside-my-current-workflow" class="level2">
<h2 class="anchored" data-anchor-id="aside-my-current-workflow">Aside: My current workflow</h2>
<p>I recently implemented the two helper snippets above when working through a dataset for work. After trying to load data, and calculate similarities within the database (a future post), I arrived at this workflow. Given my tool of choice is <code>R</code>, the workflow below is both manageable and pretty fast.</p>
<ol type="1">
<li>Query my data warehouse for new data</li>
<li>Use <code>R</code> to clean and tidy the data for import into <code>Neo4j</code></li>
<li>Write the data to CSV files that can be consumed by <code>LOAD CSV</code>. I use Snippet 1 to do that.</li>
<li>Once the data are in the database, bring back a subgraph of interest into R using Snippet 2.</li>
<li>Manipulate the data as needed. For example, calculate jaccard similarities across the nodes of interest.</li>
<li>Write the similarities to another csv file.</li>
<li>Use Snippet 1 again to write the similarities back into the database for use in other queries.</li>
</ol>
<p>For one project, in less than 5 minutes, I was able to import two years of applicants, along with various demographics on that pool, calculate the similarity based on interactions, and write those similarities (nearly 2.8 million rows) back to the database. In terms of speed, this workflow will enable me to opertationalize all sorts of models that can be further implemented into marketing and recruitment efforts. I will post on this shortly …</p>
</section>
<section id="summary" class="level2">
<h2 class="anchored" data-anchor-id="summary">Summary</h2>
<p>This post was meant to capture two helpful code snippets that improve my workflow, and speed, of analyzing datasets in <code>R</code> and <code>Neo4j</code>.</p>


</section>
</section>

 ]]></description>
  <category>Data Science</category>
  <guid>https://brocktibert.com/posts/migration/2015-11-29-useful-snippets-interfacing-with-neo4j-from-r.html</guid>
  <pubDate>Sat, 08 Dec 2018 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Using DiagrammeR to help with Data Modeling in Neo4j</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/migration/2015-12-08-diagrammer-for-neo4j-data-modeling.html</link>
  <description><![CDATA[ 





<section id="using-diagrammer-to-help-with-data-modeling-in-neo4j" class="level1">
<h1>Using DiagrammeR to help with Data Modeling in Neo4j</h1>
<p>I have been watching the <a href="http://rich-iannone.github.io/DiagrammeR/index.html">DiagrammeR</a> package for a while now, and at this stage, it’s pretty impressive. I encourage you to take a look at what is possible, but be assured the framework is there to do some really awesome things.</p>
<p>One use-case that applies to me is that of data modeling an app within <a href="http://neo4j.com/">Neo4j</a>. There are already some tools out there, namely:</p>
<ul>
<li><a href="http://www.apcjones.com/arrows/">Arrows</a></li>
<li><a href="http://graphgen.graphaware.com/">Graphgen by GraphAware</a></li>
<li><a href="https://gist.github.com/nawroth/5880880">And you can always use graphgists</a></li>
</ul>
<p>The last link above is a sample graph gist that is a decent overview.</p>
<p>In this post, however, I am going to demo the idea that you can use <code>DiagrammeR</code> to assist in the data modeling process. The benefits, in my opinion, are:</p>
<ul>
<li><strong>Reproducibility.</strong> The arrows tool above is a fantastic in-browser solution, but it lends itself to working on one model at a time. And when you want to restore a previous data model, you have to re-build it again through point-and-click.</li>
<li><strong>The syntax is pretty expressive.</strong> The package builds on top of <a href="http://www.graphviz.org/">Graphviz</a>. Read through the documentation. The syntax is fairly straightforward but enables you to do some really powerful diagrams, including ERDs for a relational database.</li>
</ul>
<section id="a-basic-model" class="level2">
<h2 class="anchored" data-anchor-id="a-basic-model">A Basic Model</h2>
<p>The code and data model below are intended to highlight a simple proof-of-concept about how you might leverage graphViz to make the data-modeling tasks in Neo4j easier.</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">grViz</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb1-2"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      digraph neo4j {</span></span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      # a 'graph' statement</span></span>
<span id="cb1-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      graph [overlap = false, fontsize = 10]</span></span>
<span id="cb1-6"></span>
<span id="cb1-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      # several 'node' statements</span></span>
<span id="cb1-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      node [shape = circle, fontname = Helvetica]</span></span>
<span id="cb1-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      a [label = 'Student'];</span></span>
<span id="cb1-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      b [label = '@@1-1'];</span></span>
<span id="cb1-11"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      c [label = '@@1-2'];</span></span>
<span id="cb1-12"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      d [label = '@@1-3'];</span></span>
<span id="cb1-13"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      e [label = '@@1-4'];</span></span>
<span id="cb1-14"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      f [label = '@@1-5'];</span></span>
<span id="cb1-15"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      g [label = 'Marketing Persona'];</span></span>
<span id="cb1-16"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      h [label = 'Gender'];</span></span>
<span id="cb1-17"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      i [label = 'State'];</span></span>
<span id="cb1-18"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      j [label = 'Region'];</span></span>
<span id="cb1-19"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      k [label = '@@2-1'];</span></span>
<span id="cb1-20"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      l [label = '@@2-2'];</span></span>
<span id="cb1-21"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      m [label = '@@2-3'];</span></span>
<span id="cb1-22"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      n [label = '@@2-4'];</span></span>
<span id="cb1-23"></span>
<span id="cb1-24"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      # several 'edge' statements</span></span>
<span id="cb1-25"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      a -&gt; b [label = 'WAS_SENT' fontsize = 9.5];</span></span>
<span id="cb1-26"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      b -&gt; c [label = 'NEXT'];</span></span>
<span id="cb1-27"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      c -&gt; d [label = 'NEXT'];</span></span>
<span id="cb1-28"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      d -&gt; e [label = 'NEXT'];</span></span>
<span id="cb1-29"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      e -&gt; f [label = 'NEXT'];</span></span>
<span id="cb1-30"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      a -&gt; g [label = 'FROM_PERSONA'];</span></span>
<span id="cb1-31"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      a -&gt; h [label = 'HAS_GENDER'];</span></span>
<span id="cb1-32"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      a -&gt; i [label = 'LIVES_IN'];</span></span>
<span id="cb1-33"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      i -&gt; j [label = 'IS_IN'];</span></span>
<span id="cb1-34"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      a -&gt; k [label = 'HAS_INTERACTION' fontsize = 9.5];</span></span>
<span id="cb1-35"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      k -&gt; l [label = 'NEXT'];</span></span>
<span id="cb1-36"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      l -&gt; m [label = 'NEXT'];</span></span>
<span id="cb1-37"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      m -&gt; n [label = 'NEXT'];</span></span>
<span id="cb1-38"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      }</span></span>
<span id="cb1-39"></span>
<span id="cb1-40"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      [1]: rep('Email', 5)</span></span>
<span id="cb1-41"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      [2]: rep('Interaction', 4)</span></span>
<span id="cb1-42"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      "</span>,</span>
<span id="cb1-43"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">engine =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"circo"</span>)</span></code></pre></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://github.com/Btibert3/public-figs/blob/master/diagrammer-testing/diagrammer-testing.png?raw=true" class="img-fluid figure-img"></p>
<figcaption>example</figcaption>
</figure>
</div>
<p>From above, a few things that I wanted to call out:</p>
<ul>
<li>My example graph is very specific to Enrollment Management. In this case, the data are very student-centric, in that a Student is sent marketing emails, has various demographics associated with them, and interacts with you in a variety of ways (i.e., visit campus, request’s information, etc.). Your domains might yield “prettier” graphs.</li>
<li>I am leveraging the <code>@@</code> option to dynamically build the labels, which you reference at the end of the script through a footnote.</li>
<li>You can control a large number of elements. In a couple of cases, I manually specify the font size for the edge label.</li>
</ul>
</section>
<section id="summary" class="level2">
<h2 class="anchored" data-anchor-id="summary">Summary</h2>
<p>By no means is this a robust demonstration, but simply a quick post to demonstrate an option that you might want to leverage when documenting and building out your database. As mentioned above, the fact that you can reproduce your graph is why I will probably use <code>DiagrammeR</code> to work through my data modeling tasks.</p>


</section>
</section>

 ]]></description>
  <category>R</category>
  <guid>https://brocktibert.com/posts/migration/2015-12-08-diagrammer-for-neo4j-data-modeling.html</guid>
  <pubDate>Sat, 08 Dec 2018 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Predict Competition of Undergraduate Institutions using Neo4j</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/migration/2015-02-22-predict-competition-amongst-undergraduate-institutions-using-neo4j.html</link>
  <description><![CDATA[ 





<section id="predict-competition-of-undergraduate-institutions-using-neo4j" class="level1">
<h1>Predict Competition of Undergraduate Institutions using Neo4j</h1>
<p><strong>9 min read</strong></p>
<section id="intro" class="level3">
<h3 class="anchored" data-anchor-id="intro">Intro</h3>
<p>The use of graphs to solve business problems is not new, as companies like Amazon, Netflix, and nearly all major social media sites have been doing this for some time. I have been obsessed with graphs for just as long, and after learning as much as I can about analysis of graphs and graph databases, I am finally getting the time to take what I have learned and apply it to real world data problems I face within Enrollment Management.</p>
<p>In this post, I am going to explore the network structure of competition at the undergraduate level. To get the data, I crawled a popular “College Search Site” to extract the information that I needed. I won’t go into specifics, but the web is one huge dataset, and sometimes you need to be able to write a bot (computer program) to crawl and build your datasets.</p>
<p>In short, this post attempts to predict if one college is “similar” to another by leveraging the other institutions they are similar to, and then looking at the connections those schools have as well. For the time being, I am only going to look 1 hop away from the school of interest, but there is so much more that we can do with this type of database.</p>
<p>I am going to leverage <a href="http://neo4j.com/">Neo4j</a>, a graph database, for this analysis. Neo4j fits perfectly with this use case, as it not only belongs to the <code>NoSQL</code> family of databases, but with only a few lines of code, you can quickly generate recommendations, or predictions, from your graph. Pretty powerful!</p>
<p>Lastly, it’s worth noting that I am only looking at similarity between institutions. A more robust analysis would be the connections students make with these schools. The data exist, but outside of one company, I don’t know if other college search sites are taking advantage of this data structure.</p>
<blockquote class="blockquote">
<p>If you work at one of these companies and aren’t sure, reach out. <strong>I would love to chat about what we might be able to do!</strong> I am always looking for real-world datasets for the academic research I do outside of my role in Enrollment Management, so maybe we can work together!</p>
</blockquote>
<hr>
</section>
<section id="the-dataset" class="level3">
<h3 class="anchored" data-anchor-id="the-dataset">The Dataset</h3>
<p>As always, I am using <code>R</code> for my analysis. Nicole White has created an awesome R package <a href="https://github.com/nicolewhite/RNeo4j">RNeo4j</a> to connect to Neo4j. I am using that library below to connect to the database and wipe out all my data for a clean environment.</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## load the library</span></span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(RNeo4j)</span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## connect to the graph</span></span>
<span id="cb1-5">graph <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">startGraph</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"http://localhost:7474/db/data/"</span>)</span>
<span id="cb1-6">graph<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>version</span>
<span id="cb1-7"></span>
<span id="cb1-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># clear it out entirely</span></span>
<span id="cb1-9"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## CAVEAT: deletes the data without user confirmation.  Do not copy and paste!</span></span>
<span id="cb1-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># cypher(graph, "MATCH (n) RETURN COUNT(n)")</span></span>
<span id="cb1-11"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">clear</span>(graph, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">input =</span> F)</span></code></pre></div>
<p>Now that we have a clean database, let’s talk about the data.</p>
<p>Lets suppose you have a dataset that looks like this:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th style="text-align: right;">from_unitid</th>
<th style="text-align: right;">to_unitid</th>
<th style="text-align: right;">from_rank</th>
<th style="text-align: right;">from_rating</th>
<th style="text-align: right;">to_rating</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: right;">199148</td>
<td style="text-align: right;">198507</td>
<td style="text-align: right;">25</td>
<td style="text-align: right;">4.0263</td>
<td style="text-align: right;">3.7073</td>
</tr>
<tr class="even">
<td style="text-align: right;">212160</td>
<td style="text-align: right;">213987</td>
<td style="text-align: right;">6</td>
<td style="text-align: right;">3.7730</td>
<td style="text-align: right;">3.8522</td>
</tr>
<tr class="odd">
<td style="text-align: right;">217484</td>
<td style="text-align: right;">217518</td>
<td style="text-align: right;">6</td>
<td style="text-align: right;">3.8444</td>
<td style="text-align: right;">4.2442</td>
</tr>
<tr class="even">
<td style="text-align: right;">152390</td>
<td style="text-align: right;">151351</td>
<td style="text-align: right;">3</td>
<td style="text-align: right;">3.9082</td>
<td style="text-align: right;">4.0535</td>
</tr>
<tr class="odd">
<td style="text-align: right;">188641</td>
<td style="text-align: right;">196176</td>
<td style="text-align: right;">18</td>
<td style="text-align: right;">3.6977</td>
<td style="text-align: right;">3.5899</td>
</tr>
<tr class="even">
<td style="text-align: right;">186122</td>
<td style="text-align: right;">190415</td>
<td style="text-align: right;">10</td>
<td style="text-align: right;">0.0000</td>
<td style="text-align: right;">4.5446</td>
</tr>
<tr class="odd">
<td style="text-align: right;">100858</td>
<td style="text-align: right;">221759</td>
<td style="text-align: right;">8</td>
<td style="text-align: right;">4.0607</td>
<td style="text-align: right;">3.9113</td>
</tr>
<tr class="even">
<td style="text-align: right;">193283</td>
<td style="text-align: right;">195988</td>
<td style="text-align: right;">8</td>
<td style="text-align: right;">3.1148</td>
<td style="text-align: right;">3.4138</td>
</tr>
<tr class="odd">
<td style="text-align: right;">173902</td>
<td style="text-align: right;">160977</td>
<td style="text-align: right;">24</td>
<td style="text-align: right;">4.4167</td>
<td style="text-align: right;">4.5600</td>
</tr>
<tr class="even">
<td style="text-align: right;">434584</td>
<td style="text-align: right;">102553</td>
<td style="text-align: right;">11</td>
<td style="text-align: right;">0.0000</td>
<td style="text-align: right;">3.3206</td>
</tr>
</tbody>
</table>
<p>Here is a description of what these fields represent.</p>
<ol type="1">
<li><code>from_unitid</code>: The IPEDS <code>unitid</code> for the school of interest.</li>
<li><code>to_unitid</code>“: The IPEDS <code>unitid</code> for the school that is similar to <code>from_unitid</code></li>
<li><code>from_rank</code>: A rank-ordered estimate of <em>how similar</em> the <code>from</code> institution is to the <code>to</code> institution</li>
<li><code>from_rating</code>: The rating - out of 5 stars - for the <code>to_unitid</code> school scraped from the source website</li>
<li><code>to_rating</code>: The rating - out of 5 stars - for the <em>similar</em> school</li>
</ol>
<p>There already has been a <a href="https://scholar.google.com/scholar?q=graph+edge+link+prediction">ton of research</a> on the prediction of edges (links) in graphs, so my approach to this problem is only scratching the surface, and quite frankly, basic.</p>
<p>Since we are trying to build a model that predicts if one school is defined as <strong>smililar</strong> to another, we need to remove a small portion of our dataset in order to validate the accuracy of that estimate.</p>
<p>The code below is how I randomly deleted a few edges from our graph in order to retain a test dataset</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## ensure that there are 25 rows for each school</span></span>
<span id="cb2-2">tmp <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tbl_df</span>(edges) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(from_unitid) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">num =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(to_unitid))</span>
<span id="cb2-3">tmp <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(tmp, num <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">25</span>)</span>
<span id="cb2-4">edges <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">subset</span>(edges, from_unitid <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%in%</span> tmp<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>from_unitid)</span>
<span id="cb2-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rm</span>(tmp)</span>
<span id="cb2-6"></span>
<span id="cb2-7"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## split into the traing and test dataset</span></span>
<span id="cb2-8">ROWS <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nrow</span>(edges), <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">150</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">replace=</span>F)</span>
<span id="cb2-9"></span>
<span id="cb2-10"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## training graph</span></span>
<span id="cb2-11">graph_test<span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> edges[ROWS, ]</span>
<span id="cb2-12">graph_train <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> edges[<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>ROWS,]</span>
<span id="cb2-13"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rm</span>(ROWS)</span>
<span id="cb2-14"></span>
<span id="cb2-15"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## how many unique schools in the test dataset</span></span>
<span id="cb2-16"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unique</span>(graph_test<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>from_unitid))</span>
<span id="cb2-17"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">save</span>(graph_test, graph_train, edges, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">file=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"data/graph-data.rdata"</span>)</span></code></pre></div>
</section>
<section id="database-creation" class="level3">
<h3 class="anchored" data-anchor-id="database-creation">Database Creation</h3>
<p>The image below is how we are translating the data into a network graph.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://github.com/Btibert3/public-figs/blob/master/brocktibert/school-data-model.png?raw=true" class="img-fluid figure-img"></p>
<figcaption>data-model</figcaption>
</figure>
</div>
<p>Our dataset is more comparable to Twitter than Facebook. School 1 is similar to School 2, but School 2 isn’t similar to School 1. In graph terms, this is a directed network. Unlike Facebook, where users both agree to to a friendship, our dataset is comprised of connections between institutions that may not be reciprocated.</p>
<p>Because of this, we can start to map out the competitive landscape within higher education. Even though the dataset comes from only one college search site, it’s a start, and allows us to look at who are competitors are competing with. That’s some powerful stuff!</p>
<p>Before we do any analysis, we need to put the data into Neo4j.</p>
<div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## set the constraints first</span></span>
<span id="cb3-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addConstraint</span>(graph, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"School"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"unitid"</span>)</span>
<span id="cb3-3"></span>
<span id="cb3-4"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## the query</span></span>
<span id="cb3-5">query <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb3-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">MERGE (s1:School { unitid:{from_unitid} } )</span></span>
<span id="cb3-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">ON CREATE SET s1.rating = {from_rating}</span></span>
<span id="cb3-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">MERGE (s2:School { unitid:{to_unitid} } )</span></span>
<span id="cb3-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">ON CREATE SET s2.rating = {to_rating}</span></span>
<span id="cb3-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">CREATE (s1) -[r:SIMILAR_TO {rank:{from_rank}}]-&gt; (s2)</span></span>
<span id="cb3-11"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb3-12"></span>
<span id="cb3-13"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## start the initial transaction</span></span>
<span id="cb3-14">tx <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">newTransaction</span>(graph)</span>
<span id="cb3-15"></span>
<span id="cb3-16"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## just using the small dataset for exploring and valid queries</span></span>
<span id="cb3-17">start <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Sys.time</span>()</span>
<span id="cb3-18"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> (i <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nrow</span>(graph_train)) {</span>
<span id="cb3-19"> <span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## for every 500 rows, commit the transaction</span></span>
<span id="cb3-20"> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (i <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>) {</span>
<span id="cb3-21">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># commit the transaction</span></span>
<span id="cb3-22">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">commit</span>(tx)</span>
<span id="cb3-23">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cat</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Batch "</span>, i<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" committed </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb3-24">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># open a new transaction</span></span>
<span id="cb3-25">  tx <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">newTransaction</span>(graph)</span>
<span id="cb3-26"> }</span>
<span id="cb3-27"></span>
<span id="cb3-28"> <span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## create the parameter values</span></span>
<span id="cb3-29"> from_unitid <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> graph_train[i, ]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>from_unitid</span>
<span id="cb3-30"> from_rating <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> graph_train[i, ]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>from_rating</span>
<span id="cb3-31"> to_unitid <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> graph_train[i, ]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>to_unitid</span>
<span id="cb3-32"> to_rating <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> graph_train[i, ]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>to_rating</span>
<span id="cb3-33"> from_rank <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> graph_train[i, ]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>from_rank</span>
<span id="cb3-34"></span>
<span id="cb3-35"> <span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## appead the query</span></span>
<span id="cb3-36"> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">appendCypher</span>(tx,</span>
<span id="cb3-37">              query,</span>
<span id="cb3-38">              <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">from_unitid =</span> from_unitid,</span>
<span id="cb3-39">              <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">from_rating =</span> from_rating,</span>
<span id="cb3-40">              <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">to_unitid =</span> to_unitid,</span>
<span id="cb3-41">              <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">to_rating =</span> to_rating,</span>
<span id="cb3-42">              <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">from_rank =</span> from_rank)</span>
<span id="cb3-43">} <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#endfor</span></span>
<span id="cb3-44"></span>
<span id="cb3-45"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## committ the last transaction and record the time it takes</span></span>
<span id="cb3-46"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">commit</span>(tx)</span>
<span id="cb3-47">end <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Sys.time</span>()</span>
<span id="cb3-48">end <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> start</span></code></pre></div>
<p>Using cypher transactions, as I did above, isn’t the fastest way of getting data into Neo4j, but it’s clean and easy to read. For reference, it took a tad over 12 minutes to put our training dataset into the database.</p>
<p>If you need to throw larger volumes of data into the database, you should check out the shell tools that are part of the base install. More specifically, <code>neo4j-shell</code> and <code>neo4j-import</code>.</p>
</section>
<section id="explore-the-data" class="level3">
<h3 class="anchored" data-anchor-id="explore-the-data">Explore the data</h3>
<p>Now let’s confirm that we have some data.</p>
<div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## make sure we are connected</span></span>
<span id="cb4-2">graph <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">startGraph</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"http://localhost:7474/db/data/"</span>)</span>
<span id="cb4-3"></span>
<span id="cb4-4"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## the number of nodes</span></span>
<span id="cb4-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cypher</span>(graph,</span>
<span id="cb4-6">       <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb4-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">       MATCH (n)</span></span>
<span id="cb4-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">       RETURN COUNT(n) as `Total Nodes`</span></span>
<span id="cb4-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">       "</span>)</span></code></pre></div>
<div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">  Total Nodes</span>
<span id="cb5-2"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>        <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3477</span></span></code></pre></div>
<div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## summary of the data model</span></span>
<span id="cb6-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(graph)</span></code></pre></div>
<div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1">    This         To   That</span>
<span id="cb7-2"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> School SIMILAR_TO School</span></code></pre></div>
<div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## how many relationships</span></span>
<span id="cb8-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cypher</span>(graph,</span>
<span id="cb8-3">       <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb8-4"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">       MATCH () -[r]- ()</span></span>
<span id="cb8-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">       WITH r</span></span>
<span id="cb8-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">       RETURN type(r), count(*) as total</span></span>
<span id="cb8-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">       "</span>)</span></code></pre></div>
<div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1">     <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">type</span>(r)  total</span>
<span id="cb9-2"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> SIMILAR_TO <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">161860</span></span></code></pre></div>
<p>Finally, let’s look at a quick plot from a small subset of the data</p>
<div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## the query</span></span>
<span id="cb10-2">query <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb10-3"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">MATCH (n)-[r]-&gt;(b)</span></span>
<span id="cb10-4"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">RETURN n.unitid AS from, b.unitid AS to, r.rank AS rank</span></span>
<span id="cb10-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">LIMIT 50</span></span>
<span id="cb10-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb10-7"></span>
<span id="cb10-8"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## the dataframe</span></span>
<span id="cb10-9">dat <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cypher</span>(graph, query)</span>
<span id="cb10-10">dat<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>rand <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span></span>
<span id="cb10-11"></span>
<span id="cb10-12"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## graph the dataframe</span></span>
<span id="cb10-13">g <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">graph.data.frame</span>(dat, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">directed =</span> T)</span>
<span id="cb10-14"></span>
<span id="cb10-15"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## simple plot</span></span>
<span id="cb10-16"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggnet</span>(g, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>)</span></code></pre></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://raw.githubusercontent.com/Btibert3/btibert3.github.io/master/images/predict-competition-amongst-undergraduate-institutions-using-neo4j-netplot-1.png" class="img-fluid figure-img"></p>
<figcaption>plot of chunk netplot</figcaption>
</figure>
</div>
<p>The plot above just pulls out 50 nodes. I won’t go into describing the graph, but I find it interesting that there are two schools that are common to the larger clusters, but these two institutions are not <strong>similar</strong> to each other.</p>
</section>
<section id="predicting-edges" class="level3">
<h3 class="anchored" data-anchor-id="predicting-edges">Predicting Edges</h3>
<p>As mentioned above, I wanted to create a very basic predictive model to test if leveraging the structure of the graph is better than just guessing.</p>
<p>I don’t know if it’s my machine, a sub-optimal <code>CYPHER</code> query, or simply the expected run-time, but generating the predictions for 150 schools took nearly two hours. I have had much better performance on larger data, so something doesn’t feel right here.</p>
<p>Let’s take a quick at the predictions to ensure the output matches our expectations.</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th style="text-align: right;">from</th>
<th style="text-align: right;">freq</th>
<th style="text-align: right;">pred_school</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: right;">218487</td>
<td style="text-align: right;">23</td>
<td style="text-align: right;">218964</td>
</tr>
<tr class="even">
<td style="text-align: right;">138789</td>
<td style="text-align: right;">11</td>
<td style="text-align: right;">140322</td>
</tr>
<tr class="odd">
<td style="text-align: right;">220127</td>
<td style="text-align: right;">11</td>
<td style="text-align: right;">219602</td>
</tr>
<tr class="even">
<td style="text-align: right;">420398</td>
<td style="text-align: right;">24</td>
<td style="text-align: right;">229027</td>
</tr>
<tr class="odd">
<td style="text-align: right;">244233</td>
<td style="text-align: right;">9</td>
<td style="text-align: right;">132471</td>
</tr>
<tr class="even">
<td style="text-align: right;">196006</td>
<td style="text-align: right;">20</td>
<td style="text-align: right;">196176</td>
</tr>
</tbody>
</table>
<p>Note that column <code>freq</code> is how many schools were included in the prediction. If we wanted to extend this, a school can have up to 25 <code>SIMILAR</code> schools. With this in hand, you could estimate a <code>confidence</code> of the prediction by <code>freq / 25</code>, or more accuractley, <code>freq / # Schools Similar to</code>. This would yield a percentage, with numbers closer to 1 indicating that nearly all other schools have this recommendation as a competitor.</p>
<p>We finally have what we need!</p>
</section>
<section id="results" class="level3">
<h3 class="anchored" data-anchor-id="results">Results</h3>
<p>Let’s see how the recommendation did.</p>
<p>First, in order to measure how well our simple model performs, we need to understand the <em>compared to what</em> question. To create this baseline, we go back to our original dataset, identify the most popular schools, and then select the most popular insitution not already listed by the school of interest. This is just a <code>Top-N</code> prediction.</p>
<div class="sourceCode" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1">pop_accurate</span>
<span id="cb11-2"><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span></span>
<span id="cb11-3">  <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">150</span></span></code></pre></div>
<p>From above, we can see that the <code>Top-N</code> prediction did not predict any connections correctly, as evidenced by the 150 underneath the <code>FALSE</code>. In reality, if my test set was larger than 150 edges, undoubtedly we would have got a handful of correct guesses.</p>
<p>Now, let’s look at the model.</p>
<div class="sourceCode" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1">pred_accurate</span>
<span id="cb12-2"><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>  <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span></span>
<span id="cb12-3">  <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">105</span>    <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">45</span></span></code></pre></div>
<p>From above, we see that the model accurately predicted 45 of the 150 connections. That is 30 % correct, not bad.</p>
</section>
<section id="summary" class="level3">
<h3 class="anchored" data-anchor-id="summary">Summary</h3>
<p>Overall, there is some evidence to suggest that the network structure of the competition higher ed can be leveraged to predict which schools compete with each other. The holdout sample of 150 cases wasn’t that large, but predicting 30% of the cases with a very simple model isn’t a bad start.</p>
<p>Neo4j is a pretty powerful database, but I still have a lot to learn. Generating predictions (recommendations) for 150 cases shouldn’t take nearly two hours. This would never fly in a production environment, but as mentioend above, I suspect that there are some things that I can do on my end to greatly enhance performance. In another study (not published), I was able to generate predictions for 1000 entities in less than ten minutes. Neo4j admits that it’s not an analytical engine, which is why there are some great tools being devolped on top, like <a href="https://github.com/kbastani/neo4j-mazerunner">Mazerunner</a> which aims to bring graph ETL capabilities to the database.</p>
<p>Stepping back, I am excited to extend the use of Neo4j into other projects within Enrollment Management, as I strongly believe that analyzing our data in graph format can yield strategic insights above and beyond what we can already do with more traditional methods like regression, clustering, and machine learning.</p>


</section>
</section>

 ]]></description>
  <category>Enrollment Management</category>
  <guid>https://brocktibert.com/posts/migration/2015-02-22-predict-competition-amongst-undergraduate-institutions-using-neo4j.html</guid>
  <pubDate>Sat, 08 Dec 2018 05:00:00 GMT</pubDate>
  <media:content url="https://brocktibert.com/media/icon_hu8f52f26052f65ffc2b93d4a731c0c142_272978_512x512_fill_lanczos_center_3.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Environment Variables in RStudio on Mac</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/migration/2015-12-08-environment-variables-in-rstudio-on-mac.html</link>
  <description><![CDATA[ 





<section id="environment-variables-in-rstudio-on-mac" class="level1">
<h1>Environment Variables in RStudio on Mac</h1>
<p>I recently asked a question on <a href="http://stackoverflow.com/questions/34160664/environment-variables-in-rstudio-on-mac">Stack Overflow</a> on the best way to set environment variables on a Mac for use within an RStudio session.</p>
<p>It wasn’t as straightforward as I would have thought, so I wanted to share this quick post as a way to remind my future self of a quick way to solve the issue.</p>
<section id="overview" class="level2">
<h2 class="anchored" data-anchor-id="overview">Overview</h2>
<p>Generally, you can set environment variables by:</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode sh code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">export</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">YOUR_VAR</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>abc123</span></code></pre></div>
<p>within a terminal. In a new session,</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode sh code-with-copy"><code class="sourceCode bash"><span id="cb2-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">echo</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$YOUR_VAR</span></span></code></pre></div>
<p>should yield what you need.</p>
<p>Within python, you can get at it by:</p>
<div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> os</span>
<span id="cb3-2">os.getenv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"YOUR_VAR"</span>)</span></code></pre></div>
<p>but if you are using <code>R</code> and <code>Rstudio</code>, <code>Sys.getenv("YOUR_VAR")</code> will return <code>""</code>.</p>
<p>No bueno.</p>
</section>
<section id="a-solution" class="level2">
<h2 class="anchored" data-anchor-id="a-solution">A solution</h2>
<p>Navigate to <code>~</code>, and create the <code>.Renviron</code> file if it doesn’t already exist</p>
<div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode sh code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">cd</span> ~</span>
<span id="cb4-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">touch</span> .Renviron</span>
<span id="cb4-3"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">open</span> .Renviron</span></code></pre></div>
<p>And in the file, type</p>
<pre><code>YOUR_VAR="abc123"</code></pre>
<p>Save the file and restart/reopen Rstudio.</p>
<p>From there, <code>Sys.getenv("YOUR_VAR")</code> should be good to go.</p>
</section>
<section id="deeper-dive" class="level2">
<h2 class="anchored" data-anchor-id="deeper-dive">Deeper Dive</h2>
<p>For a more granular look at this functionality, feel free to reference the links below</p>
<ul>
<li><a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html">Startup</a></li>
<li><a href="https://www.biostat.wisc.edu/~kbroman/Rintro/">The reference that I used</a></li>
</ul>


</section>
</section>

 ]]></description>
  <category>R</category>
  <guid>https://brocktibert.com/posts/migration/2015-12-08-environment-variables-in-rstudio-on-mac.html</guid>
  <pubDate>Wed, 05 Dec 2018 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Forecasting College Enrollment</title>
  <dc:creator>Brock Tibert</dc:creator>
  <link>https://brocktibert.com/posts/migration/2014-05-06-forecast-college-enrollment.html</link>
  <description><![CDATA[ 





<section id="forecasting-college-enrollment" class="level1">
<h1>Forecasting College Enrollment</h1>
<p>As of late, there has been a surge in conversation around the topic of the <code>college-going population</code> here in the United States.</p>
<p>One one hand, we have long talked about the “The Perfect Storm” of demographics. For example, here is a simple <a href="http://goo.gl/T3OyCF">Google Search</a>. On the other, the decline in college enrollment, has been connected to changes in the <a href="http://fivethirtyeight.com/features/more-high-school-grads-decide-college-isnt-worth-it/">labor market</a>.</p>
<p>In the end, it might be nice to review what data exist and highlight how these flashy headlines could have been predictable well in advance of 2014.</p>
<section id="about-this-post" class="level2">
<h2 class="anchored" data-anchor-id="about-this-post">About this post</h2>
<p>For this post, I will be using using the <code>R</code> language to download the data from <a href="http://www.wiche.edu/knocking-8th">WICHE</a>, an amazing resource for projections of High School graduates by state. Using these data, we can do all sorts of fun analyses.</p>
<p>In a future post, I will show you how to link WICHE to <a href="http://nces.ed.gov/ipeds">IPEDS</a> data in order to forecast college participation rates by state.</p>
<p>While I will provide a few code snippets below, you should feel free to clone my <a href="https://github.com/Btibert3/Parse-WICHE">Github Repo</a> which everything you need to replicate this post.</p>
<p>Also included is a Tableau Workbook. If you have Tableau Desktop, this super basic workbook highlights how you can leverage parameters to create your own forecasts.</p>
<p>Below is a screenshot of the workbook, which is a basic “Create-Your-Own College Enrollment Forecast” of sorts.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://raw.githubusercontent.com/Btibert3/Parse-WICHE/master/figs/Tableau.PNG" class="img-fluid figure-img"></p>
<figcaption>Tableau-ss</figcaption>
</figure>
</div>
</section>
<section id="why-this-post" class="level2">
<h2 class="anchored" data-anchor-id="why-this-post">Why this post</h2>
<p>The changing demographics and volume of students that would be considering a college education should not be news to anyone in Enrollment Management. I hope to highlight how with just a few lines of code, we can:</p>
<ol type="1">
<li>Grab data that forecasts the volume of high school graduates</li>
<li>Use <code>R</code> to parse, clean, and reshape the data (originally stored in Excel)</li>
<li>Save out the data and leverage Tableau to do some basic forecasting</li>
</ol>
<p>For those of you that might be new to <code>R</code>, reading code can be extremely helpful when attempting to learn a new language. When possible, I always try to comment the heck of out my code. Hopefully these comments can help you in your journey.</p>
</section>
<section id="get-the-data" class="level2">
<h2 class="anchored" data-anchor-id="get-the-data">Get the data</h2>
<p>With <code>R</code>, it’s super simple to grab data from the web. The command below will download the WICHE Excel Workbook.</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## download the dataset into your working directory</span></span>
<span id="cb1-2"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## use mode option below so the file can open in R, error w/o it</span></span>
<span id="cb1-3">WICHE_DATA <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"http://wiche.edu/info/knocking-8th/tables/allProjections.xlsx"</span></span>
<span id="cb1-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">download.file</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">url=</span>WICHE_DATA, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">destfile=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"raw/wiche.xlsx"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mode=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"wb"</span>)</span></code></pre></div>
<p>It should be noted that the code above assumes that your current directory (where you are running the code) has a folder called <code>raw</code>. To assure that this is the case, just do this:</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## ensure that we have a directory to store the raw data</span></span>
<span id="cb2-2"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">file.exists</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"raw"</span>)) <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dir.create</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"raw"</span>)</span>
<span id="cb2-3"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">file.exists</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"figs"</span>)) <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dir.create</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"figs"</span>)</span></code></pre></div>
<p>Now we can use the <code>RODBC</code> package (on Windows) to connect to the workbook and query it as if the sheets were database tables.</p>
<div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1">xl <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">odbcConnectExcel2007</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"raw/wiche.xlsx"</span>)</span></code></pre></div>
<p>Because each state is a tab in the workbook, let’s use <code>R</code> to define an object that holds the state abbreivations, which we will use while looping through the workbook.</p>
<div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## how cool is it that R has the State names and Abbreviations preloaded?</span></span>
<span id="cb4-2">?state.name</span>
<span id="cb4-3">(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">states =</span> state.name)</span>
<span id="cb4-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(states)</span>
<span id="cb4-5">states <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(states, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"District of Columbia"</span>)</span></code></pre></div>
<p>Finally, let’s loop and build a dataset in the format we want:</p>
<div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## use a for loop -- not ideal but easy to read and debug</span></span>
<span id="cb5-2">wiche <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">stringsAsFactors=</span><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)</span>
<span id="cb5-3"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> (state <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> states) {</span>
<span id="cb5-4"> raw <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sqlFetch</span>(xl, state, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">stringsAsFactors=</span><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)</span>
<span id="cb5-5"> <span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## bc there is a structure to each sheet, we can reference each column by index</span></span>
<span id="cb5-6"> <span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## no way is this ideal, but quick when data doesnt change</span></span>
<span id="cb5-7"> ROWS <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">40</span></span>
<span id="cb5-8"> COLS <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>)</span>
<span id="cb5-9"> <span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## create a flag for actual/projected -- hard coded from looking at Excel file</span></span>
<span id="cb5-10"> status <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rep</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"actual"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">13</span>), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rep</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"projected"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">19</span>))</span>
<span id="cb5-11"> <span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## keep the data</span></span>
<span id="cb5-12"> df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> raw[ROWS, COLS]</span>
<span id="cb5-13"> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colnames</span>(df) <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'year'</span>,</span>
<span id="cb5-14">                  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'pub_amind'</span>,</span>
<span id="cb5-15">                  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'pub_asian'</span>,</span>
<span id="cb5-16">                  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'pub_black'</span>,</span>
<span id="cb5-17">                  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'pub_hisp'</span>,</span>
<span id="cb5-18">                  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'pub_white'</span>,</span>
<span id="cb5-19">                  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'pub_total'</span>,</span>
<span id="cb5-20">                  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'np_total'</span>,</span>
<span id="cb5-21">                  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'total'</span>)</span>
<span id="cb5-22"> <span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## remove the commas -- using a for loop not ideal, but intuitive</span></span>
<span id="cb5-23"> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> (i <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ncol</span>(df)) {</span>
<span id="cb5-24">  df[,i] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.numeric</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gsub</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">","</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>, df[,i]))</span>
<span id="cb5-25"> }</span>
<span id="cb5-26"> df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>state <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> state</span>
<span id="cb5-27"> df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>status <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> status</span>
<span id="cb5-28"> <span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## bind onto the master data frame</span></span>
<span id="cb5-29"> wiche <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rbind.fill</span>(wiche, df)</span>
<span id="cb5-30"> <span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## status</span></span>
<span id="cb5-31"> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cat</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"finished "</span>, state, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb5-32">}</span></code></pre></div>
</section>
<section id="a-quick-plot" class="level2">
<h2 class="anchored" data-anchor-id="a-quick-plot">A quick plot</h2>
<p>When playing around with data, it’s usually a good practice to visualize what you have. Below is a quick plot which represents both the actual and forecasted volume of high school graduates going until the 2027/28 Academic year.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://raw.githubusercontent.com/Btibert3/Parse-WICHE/master/figs/Total-HS-Grads.jpg" class="img-fluid figure-img"></p>
<figcaption>plot</figcaption>
</figure>
</div>
</section>
<section id="summary" class="level2">
<h2 class="anchored" data-anchor-id="summary">Summary</h2>
<p>I would encourage the reader to browse the code, and if possible, fire up the Tableau workbook. As an Enrollment Scientist, <code>R</code> and <code>Tableau</code> are my two tools that I use on a daily basis.</p>


</section>
</section>

 ]]></description>
  <category>Enrollment Management</category>
  <guid>https://brocktibert.com/posts/migration/2014-05-06-forecast-college-enrollment.html</guid>
  <pubDate>Wed, 05 Dec 2018 05:00:00 GMT</pubDate>
</item>
</channel>
</rss>
