Tensorflow, Tip-Ins, and Tableau, Oh My
This notebook aims to show the basics of:
- Tensorflow 2.0
- Shooter Embedding estimation for NHL Player evaluation
- Evaluate feasibility generating a post that switches between
R
andpython
via reticulate - Demonstrate code similarity/approach in both languages side-by-side
TL;DR
- Combine Tensorflow/Keras with R
- NHL Data to estimate Shooter Player Embeddings
- Export to Tableau for exploration (yes we could use ggplot et. al, but highlights we have other options, especially for those new to the language)
R Setup
# packages
library(keras)
suppressPackageStartupMessages(library(tidyverse))
library(reticulate)
suppressPackageStartupMessages(library(caret))
# options
options(stringsAsFactors = FALSE)
use_condaenv("tensorflow")
Python setup
# imports
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Activation, concatenate, Dense, Dropout, Embedding, Input, Reshape, Flatten
from tensorflow.keras.models import Model
from tensorflow.keras.utils import plot_model
from tensorflow.keras.preprocessing.text import Tokenizer
Get the data
R
URL = "http://peter-tanner.com/moneypuck/downloads/shots_2019.zip"
download.file(URL, destfile="shots.zip")
shots_raw = read_csv("shots.zip")
What’s the shape?
dim(shots_raw)
[1] 88592 124
python
URL = "http://peter-tanner.com/moneypuck/downloads/shots_2019.zip"
shots_raw = pd.read_csv(URL)
What do we have?
shots_raw.shape
(88592, 124)
Filter rows
We want to keep shots on net, and not on an empty net, as well as remove records where the shooter id is 0.
R
# keep shots that were on goal
shots_raw = shots_raw %>% filter(shotWasOnGoal == 1 )
shots_raw = shots_raw %>% filter(shotOnEmptyNet == 0)
shots_raw = shots_raw %>% filter(shooterPlayerId != 0)
shots_raw = shots_raw %>% filter(!is.na(shooterPlayerId))
What we do have for a shape?
dim(shots_raw)
[1] 64318 124
python
shots_raw = shots_raw.loc[shots_raw.shotOnEmptyNet == 0, :]
shots_raw = shots_raw.loc[shots_raw.shotWasOnGoal == 1, :]
shots_raw = shots_raw.loc[shots_raw.shooterPlayerId != 0, :]
shots_raw = shots_raw.loc[~shots_raw.shooterPlayerId.isna(), :]
What do we have for a shape?
shots_raw.shape
(64318, 124)
Select Columns
With the rows select, let’s keep the columns that we want to include in this analysis.
R
# keep just the columns that we need
shots_raw = shots_raw %>% select(shooterPlayerId, shotType, goal, arenaAdjustedShotDistance,
arenaAdjustedXCord, arenaAdjustedYCord, shotAngle, offWing)
The shape …
dim(shots_raw)
[1] 64318 8
python
COLS = ['shooterPlayerId', 'shotType', 'goal', 'offWing',
'arenaAdjustedShotDistance', 'arenaAdjustedXCord', 'arenaAdjustedYCord',
'shotAngle']
shots_raw = shots_raw[COLS]
The shape …
shots_raw.shape
(64318, 8)
Encode the shot types
I am going to one-hot the shot types, though in the future I will explore the use of keras.preprocessing.text.one_hot
. The result will be new columns added to our shots_raw
dataset, with each shot type flagged as 0/1.
R
x <- dummyVars(" ~ .", data = shots_raw)
shots_raw <- data.frame(predict(x, newdata = shots_raw))
rm(x)
What do we have?
glimpse(shots_raw)
Observations: 64,318
Variables: 14
$ shooterPlayerId <dbl> 8480801, 8476853, 8476331, 8476853, 8475197…
$ shotTypeBACK <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ shotTypeDEFL <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ shotTypeSLAP <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0…
$ shotTypeSNAP <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0…
$ shotTypeTIP <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
$ shotTypeWRAP <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ shotTypeWRIST <dbl> 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1…
$ goal <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arenaAdjustedShotDistance <dbl> 4.123106, 59.000000, 30.000000, 40.000000, …
$ arenaAdjustedXCord <dbl> 85, -30, 60, -56, -40, -48, -34, -77, -61, …
$ arenaAdjustedYCord <dbl> -1, -2, -7, -22, -30, -8, 41, -13, -34, 11,…
$ shotAngle <dbl> -14.036243, 2.009554, -12.994617, 33.690068…
$ offWing <dbl> 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1…
python
shots_raw = pd.get_dummies(shots_raw, columns=['shotType'])
print(shots_raw.shape)
(64318, 14)
print(shots_raw.head(3).T)
0 2 5
shooterPlayerId 8.480801e+06 8.476853e+06 8.476331e+06
goal 1.000000e+00 0.000000e+00 0.000000e+00
offWing 1.000000e+00 0.000000e+00 0.000000e+00
arenaAdjustedShotDistance 4.123106e+00 5.900000e+01 3.000000e+01
arenaAdjustedXCord 8.500000e+01 -3.000000e+01 6.000000e+01
arenaAdjustedYCord -1.000000e+00 -2.000000e+00 -7.000000e+00
shotAngle -1.403624e+01 2.009554e+00 -1.299462e+01
shotType_BACK 0.000000e+00 0.000000e+00 0.000000e+00
shotType_DEFL 0.000000e+00 0.000000e+00 0.000000e+00
shotType_SLAP 0.000000e+00 0.000000e+00 0.000000e+00
shotType_SNAP 0.000000e+00 1.000000e+00 1.000000e+00
shotType_TIP 1.000000e+00 0.000000e+00 0.000000e+00
shotType_WRAP 0.000000e+00 0.000000e+00 0.000000e+00
shotType_WRIST 0.000000e+00 0.000000e+00 0.000000e+00
Scale the numeric data to 0/1
R
# clunky, but break out columns to standardize
tmp = shots_raw %>% select(arenaAdjustedShotDistance:shotAngle)
tmp2 = preProcess(tmp, method = "range")
pp = predict(tmp2, tmp)
rm(tmp, tmp2)
# drop the original and append these
shots_raw = select(shots_raw, -arenaAdjustedShotDistance:-shotAngle)
shots_raw = cbind(shots_raw, pp)
dim(shots_raw)
[1] 64318 14
python
scaler = MinMaxScaler()
COLS = ['arenaAdjustedShotDistance', 'arenaAdjustedXCord', 'arenaAdjustedYCord', 'shotAngle']
shots_raw[COLS] = scaler.fit_transform(shots_raw[COLS])
shots_raw.shape
(64318, 14)
Setup the tokenizer and fit to the Player IDs
For this exercise, instead of converting the player ids to be 0-based, I am going to treat the player ids as if they are unique words, with the unique number of players representing our complete vocabulary. As such, document represents a shot of the puck on net, and each document only includes one “word”, or shooter.
The trick here is that we have to treat our player ids as character strings.
R
# ensure that the shooter ID is a string
shots_raw$shooterPlayerId = as.character(shots_raw$shooterPlayerId)
# setup the tokenizer
shooter_tokenizer = text_tokenizer()
# fit the shooters
fit_text_tokenizer(shooter_tokenizer, shots_raw$shooterPlayerId)
What do we have?
shooter_tokenizer$index_word[1:3]
$`1`
[1] "8477492"
$`2`
[1] "8471214"
$`3`
[1] "8474157"
shooter_tokenizer$word_index[1:3]
$`8477492`
[1] 1
$`8471214`
[1] 2
$`8474157`
[1] 3
And how many?
length(shooter_tokenizer$index_word)
[1] 869
python
# make an integer so zero is not parsed
shots_raw.shooterPlayerId = shots_raw.shooterPlayerId.astype('int')
# ensure that the player ID is a string
shots_raw.shooterPlayerId = shots_raw.shooterPlayerId.astype('str')
# setup the tokenizer
shooter_tokenizer = Tokenizer()
# fit the tokenizer to shooters
shooter_tokenizer.fit_on_texts(shots_raw.shooterPlayerId)
What do we have?
list(shooter_tokenizer.index_word.items())[:3]
[(1, '8477492'), (2, '8471214'), (3, '8474157')]
list(shooter_tokenizer.word_index.items())[:3]
[('8477492', 1), ('8471214', 2), ('8474157', 3)]
And how many?
len(shooter_tokenizer.index_word.items())
869
Create the Shooter sequences
These are size 1 sequences that do not require padding, as we only allow 1 word (or player) per shot. The key here is that we are using keras
to help us easily map our data to the new id system.
R
# make sequences with the new index
shooters = texts_to_sequences(shooter_tokenizer, shots_raw$shooterPlayerId)
shooters = unlist(shooters)
What do we have?
class(shooters)
[1] "integer"
length(shooters)
[1] 64318
python
shooters = shooter_tokenizer.texts_to_sequences(shots_raw.shooterPlayerId)
shooters = [x[0] for x in shooters]
shooters = np.array(shooters)
What do we have?
type(shooters)
<class 'numpy.ndarray'>
len(shooters)
64318
Isolate the other features/targets
R
# Was the shot a goal? This is our target.
goal = shots_raw$goal
# the shot info
shot_info = shots_raw %>% select(-shooterPlayerId, -goal)
shot_info = as.matrix(shot_info)
What do we have now?
length(goal); mean(goal);
[1] 64318
[1] 0.09103206
dim(shot_info)
[1] 64318 12
colnames(shot_info)
[1] "shotTypeBACK" "shotTypeDEFL"
[3] "shotTypeSLAP" "shotTypeSNAP"
[5] "shotTypeTIP" "shotTypeWRAP"
[7] "shotTypeWRIST" "offWing"
[9] "arenaAdjustedShotDistance" "arenaAdjustedXCord"
[11] "arenaAdjustedYCord" "shotAngle"
python
# Was the shot a goal? This is our target.
goal = np.array(shots_raw.goal)
# the shot info
shot_info = shots_raw.drop(columns=['shooterPlayerId', 'goal'], axis=1, inplace=False)
What do we have?
len(goal)
64318
goal.mean()
0.09103205945458503
shot_info.shape
(64318, 12)
shot_info.columns
Index(['offWing', 'arenaAdjustedShotDistance', 'arenaAdjustedXCord',
'arenaAdjustedYCord', 'shotAngle', 'shotType_BACK', 'shotType_DEFL',
'shotType_SLAP', 'shotType_SNAP', 'shotType_TIP', 'shotType_WRAP',
'shotType_WRIST'],
dtype='object')
Define the model architecture
R
Note the +1, it’s needed to avoid the index error
# the setup
NUM_SHOOTERS = length(unique(unlist(shooter_tokenizer$index_word))) +1
SHOT_COLS = ncol(shot_info)
VEC_SIZE = 50
# the input layers
shooter_input = layer_input(shape=c(1), name = "shooter_input")
shot_input = layer_input(shape=c(SHOT_COLS), name = "shot_input")
# shooter layers
s1 = layer_embedding(input_dim = NUM_SHOOTERS,
output_dim = VEC_SIZE,
input_length = 1,
name="shooter_embedding")(shooter_input)
s2 = layer_flatten(name = "shooter_flat")(s1)
s3 = layer_dense(units = 1, activation = "sigmoid")(s2)
# put the model together
model = keras_model(inputs = shooter_input, outputs = s3)
Summarize:
summary(model)
Model: "model"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
shooter_input (InputLayer) [(None, 1)] 0
________________________________________________________________________________
shooter_embedding (Embedding) (None, 1, 50) 43500
________________________________________________________________________________
shooter_flat (Flatten) (None, 50) 0
________________________________________________________________________________
dense (Dense) (None, 1) 51
================================================================================
Total params: 43,551
Trainable params: 43,551
Non-trainable params: 0
________________________________________________________________________________
python
Note the +2, it’s needed to avoid the index error and differs from abvoe
# setup
NUM_SHOOTERS = len(np.unique(shooters)) + 1
SHOT_COLS = shot_info.shape[1]
VEC_SIZE = 50
# the input layers
shooter_input = Input(shape=(1, ), name="shooter_input")
shot_input = Input(shape=(SHOT_COLS, ), name="shot_input")
# shooter layers
s1 = Embedding(NUM_SHOOTERS, VEC_SIZE, input_length=1)(shooter_input)
s2 = Flatten()(s1)
s3 = Dense(1, activation="sigmoid")(s2)
# put the model together
model = Model(inputs = shooter_input, outputs = s3)
What do we have?
model.summary()
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
shooter_input (InputLayer) [(None, 1)] 0
_________________________________________________________________
embedding (Embedding) (None, 1, 50) 43500
_________________________________________________________________
flatten (Flatten) (None, 50) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 51
=================================================================
Total params: 43,551
Trainable params: 43,551
Non-trainable params: 0
_________________________________________________________________
and plot the model, this is not available within R at the moment.
# below might choke RMD
plot_model(model, to_file='model.png')
Train and Evaluate the Model
R
Compile the model.
model %>%
compile(optimizer = "adam",
loss="binary_crossentropy",
metrics =c("accuracy"))
Fit the model and record the history for plotting, if needed
history =
model %>%
fit(x=list(shooters),
y=goal,
epochs = 5,
verbose = 2)
python
Compile the model.
model.compile(optimizer="adam", loss = "binary_crossentropy", metrics = ['accuracy'])
Fit the model.
X = [shooters, shot_info]
history = model.fit(X, goal, epochs=5)
Get the Embeddings
With our simple model, we have estimated embeddings for each shooter. Let’s grab those.
R
shooter_embeddings = get_weights(model)[[1]]
What do we have?
shooter_embeddings[1:3, 1:3]
[,1] [,2] [,3]
[1,] -0.03280475 0.02733728 -0.01639561
[2,] 0.08887081 0.12082704 0.12113131
[3,] 0.09737433 0.06078419 0.05767342
The shape.
dim(shooter_embeddings)
[1] 870 50
python
shooter_embeddings = model.layers[1].get_weights()[0]
What do we have?
shooter_embeddings[1:4, 1:4]
array([[-0.07189947, -0.08770541, 0.04801337],
[-0.03720884, -0.04936351, 0.04588475],
[-0.1379198 , -0.04528455, 0.08142862]], dtype=float32)
The shape.
shooter_embeddings.shape
(870, 50)
Map the embeddings to the players
The embeddings are related to a player, so we are intereseted extracting these vectors and looking at player similarity, etc.
R
This is to help with some of the mapping. There may be more elegant ways to do this, but below is intuitive and simple in my opinion.
# build our vocabulary (player) dataframe
# https://www.r-bloggers.com/word-embeddings-with-keras/
players = data.frame(
playerid = names(shooter_tokenizer$word_index),
id = as.integer(unlist(shooter_tokenizer$word_index)), stringsAsFactors=FALSE)
players = dplyr::arrange(players, id)
The embeddings with names and references
# keep only those rows where the indexes align - R is 1-based
shooter_embeddings = shooter_embeddings[players$id, ]
rownames(shooter_embeddings) = players$playerid
colnames(shooter_embeddings) = paste0("e", 1:ncol(shooter_embeddings))
shooter_embeddings[1:3, 1:3]
e1 e2 e3
8477492 -0.03280475 0.02733728 -0.01639561
8471214 0.08887081 0.12082704 0.12113131
8474157 0.09737433 0.06078419 0.05767342
python
# make the embed vectors a pandas dataframe
shooter_embeddings = pd.DataFrame(shooter_embeddings)
# a list of true shooter ids
#shooter_id = [v for k, v in shooter_tokenizer.index_word.items()]
shooter_id = {k:v for k, v in shooter_tokenizer.index_word.items()}
shooter_df = pd.DataFrame.from_dict(shooter_id, orient='index', columns=["playerid"])
# name the columns
shooter_embeddings.columns = ["e" + str(i + 1) for i in range(shooter_embeddings.shape[1])]
# align the data by index
shooter_embeddings = pd.merge(shooter_embeddings, shooter_df, how='inner', left_index=True, right_index=True)
# clean up the index so its the player
shooter_embeddings.index = shooter_embeddings.playerid
# the first few
shooter_embeddings.iloc[:3, :3]
e1 e2 e3
playerid
8477492 0.093861 -0.071899 -0.087705
8471214 0.072550 -0.037209 -0.049364
8474157 0.059455 -0.137920 -0.045285
Export the data to Tableau
Whether it is R or python, you might be asking why I am exporting the data to Tableau. That is a fair question, but the point is to show how the ecosystem of data science programming libraries can also leverage best-of-breed data visualization suites such as Tableau. The tool plays a key role in my exploratory analysis pipeline, and the goal below is show how in 1-line of code, we can export our data for rapid exploration, which can aid in our data cleaning and modeling tasks within R/python.
R
I ported a copy of the pantab
library in python into R. The trick is that I use reticulate
to port the python bits into R. As such, at present, it will not work if you are using Google Colab.
Installation is simple:
devtools::install_github("btibert3/pantabR")
sdf = as.data.frame(shooter_embeddings)
pantabR::frame_to_hyper(sdf, f="embeddings.hyper", tbl="shooters")
In the python section, I am going to use t-SNE to reduce the estimated shooter embeddings into a two dimensional space. The process is similar in
R
using theRtsne
package.
python
The pantab
library is easy to install:
pip install pantab
# if in a notebook environment
# !pip install pantab
However, prior to writing out the data for exploration, I am going to use t-SNE to compact the estimated shooter embeddings into a two dimensional coordinate system. For more on t-SNE, refer to this introduction.
from sklearn.manifold import TSNE
shooter_tsne = TSNE(n_components=2).fit_transform(shooter_embeddings.iloc[:, :50])
Add them to the dataframe
shooter_tsne = pd.DataFrame(shooter_tsne)
shooter_tsne.columns = ['t1', 't2']
shooter_embeddings.reset_index(inplace=True, drop=True)
shooter_embeddings = pd.concat([shooter_embeddings, shooter_tsne], axis=1)
With pantab setup, that package makes it really simple to write pandas dataframes to hyper
files for Tableau.
import pantab
pantab.frame_to_hyper(shooter_embeddings, "embeddings.hyper", table="shooters")
And the simple embeddings, plotted from our exported embeddings.hyper
file within Tableau.