This notebook aims to show the basics of:
- Tensorflow 2.0
- Shooter Embedding estimation for NHL Player evaluation
- Evaluate feasibility generating a post that switches between
R
andpython
via reticulate - Demonstrate code similarity/approach in both languages side-by-side
TL;DR
- Combine Tensorflow/Keras with R
- NHL Data to estimate Shooter Player Embeddings
- Export to Tableau for exploration (yes we could use ggplot et. al, but highlights we have other options, especially for those new to the language)
R Setup
# packages
library(keras)
suppressPackageStartupMessages(library(tidyverse))
library(reticulate)
suppressPackageStartupMessages(library(caret))
# options
options(stringsAsFactors = FALSE)
use_condaenv("tensorflow")
Python setup
# imports
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Activation, concatenate, Dense, Dropout, Embedding, Input, Reshape, Flatten
from tensorflow.keras.models import Model
from tensorflow.keras.utils import plot_model
from tensorflow.keras.preprocessing.text import Tokenizer
Get the data
R
= "http://peter-tanner.com/moneypuck/downloads/shots_2019.zip"
URL download.file(URL, destfile="shots.zip")
= read_csv("shots.zip") shots_raw
What’s the shape?
dim(shots_raw)
[1] 88592 124
Python
= "http://peter-tanner.com/moneypuck/downloads/shots_2019.zip"
URL = pd.read_csv(URL) shots_raw
What do we have?
shots_raw.shape
(88592, 124)
Filter rows
We want to keep shots on net, and not on an empty net, as well as remove records where the shooter id is 0.
R
# keep shots that were on goal
= shots_raw %>% filter(shotWasOnGoal == 1 )
shots_raw = shots_raw %>% filter(shotOnEmptyNet == 0)
shots_raw = shots_raw %>% filter(shooterPlayerId != 0)
shots_raw = shots_raw %>% filter(!is.na(shooterPlayerId)) shots_raw
What we do have for a shape?
dim(shots_raw)
[1] 64318 124
Python
= shots_raw.loc[shots_raw.shotOnEmptyNet == 0, :]
shots_raw = shots_raw.loc[shots_raw.shotWasOnGoal == 1, :]
shots_raw = shots_raw.loc[shots_raw.shooterPlayerId != 0, :]
shots_raw = shots_raw.loc[~shots_raw.shooterPlayerId.isna(), :] shots_raw
What do we have for a shape?
shots_raw.shape
(64318, 124)
Select Columns
With the rows select, let’s keep the columns that we want to include in this analysis.
R
# keep just the columns that we need
= shots_raw %>% select(shooterPlayerId, shotType, goal, arenaAdjustedShotDistance,
shots_raw arenaAdjustedXCord, arenaAdjustedYCord, shotAngle, offWing)
The shape …
dim(shots_raw)
[1] 64318 8
Python
= ['shooterPlayerId', 'shotType', 'goal', 'offWing',
COLS 'arenaAdjustedShotDistance', 'arenaAdjustedXCord', 'arenaAdjustedYCord',
'shotAngle']
= shots_raw[COLS] shots_raw
The shape …
shots_raw.shape
(64318, 8)
Encode the shot types
I am going to one-hot the shot types, though in the future I will explore the use of keras.preprocessing.text.one_hot
. The result will be new columns added to our shots_raw
dataset, with each shot type flagged as 0/1.
R
<- dummyVars(" ~ .", data = shots_raw)
x <- data.frame(predict(x, newdata = shots_raw))
shots_raw rm(x)
What do we have?
glimpse(shots_raw)
Observations: 64,318
Variables: 14
$ shooterPlayerId <dbl> 8480801, 8476853, 8476331, 8476853, 8475197…
$ shotTypeBACK <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ shotTypeDEFL <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ shotTypeSLAP <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0…
$ shotTypeSNAP <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0…
$ shotTypeTIP <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
$ shotTypeWRAP <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ shotTypeWRIST <dbl> 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1…
$ goal <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arenaAdjustedShotDistance <dbl> 4.123106, 59.000000, 30.000000, 40.000000, …
$ arenaAdjustedXCord <dbl> 85, -30, 60, -56, -40, -48, -34, -77, -61, …
$ arenaAdjustedYCord <dbl> -1, -2, -7, -22, -30, -8, 41, -13, -34, 11,…
$ shotAngle <dbl> -14.036243, 2.009554, -12.994617, 33.690068…
$ offWing <dbl> 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1…
Python
= pd.get_dummies(shots_raw, columns=['shotType'])
shots_raw print(shots_raw.shape)
(64318, 14)
print(shots_raw.head(3).T)
0 2 5
shooterPlayerId 8.480801e+06 8.476853e+06 8.476331e+06
goal 1.000000e+00 0.000000e+00 0.000000e+00
offWing 1.000000e+00 0.000000e+00 0.000000e+00
arenaAdjustedShotDistance 4.123106e+00 5.900000e+01 3.000000e+01
arenaAdjustedXCord 8.500000e+01 -3.000000e+01 6.000000e+01
arenaAdjustedYCord -1.000000e+00 -2.000000e+00 -7.000000e+00
shotAngle -1.403624e+01 2.009554e+00 -1.299462e+01
shotType_BACK 0.000000e+00 0.000000e+00 0.000000e+00
shotType_DEFL 0.000000e+00 0.000000e+00 0.000000e+00
shotType_SLAP 0.000000e+00 0.000000e+00 0.000000e+00
shotType_SNAP 0.000000e+00 1.000000e+00 1.000000e+00
shotType_TIP 1.000000e+00 0.000000e+00 0.000000e+00
shotType_WRAP 0.000000e+00 0.000000e+00 0.000000e+00
shotType_WRIST 0.000000e+00 0.000000e+00 0.000000e+00
Scale the numeric data to 0/1
R
# clunky, but break out columns to standardize
= shots_raw %>% select(arenaAdjustedShotDistance:shotAngle)
tmp = preProcess(tmp, method = "range")
tmp2 = predict(tmp2, tmp)
pp rm(tmp, tmp2)
# drop the original and append these
= select(shots_raw, -arenaAdjustedShotDistance:-shotAngle)
shots_raw = cbind(shots_raw, pp)
shots_raw dim(shots_raw)
[1] 64318 14
Python
= MinMaxScaler()
scaler = ['arenaAdjustedShotDistance', 'arenaAdjustedXCord', 'arenaAdjustedYCord', 'shotAngle']
COLS = scaler.fit_transform(shots_raw[COLS])
shots_raw[COLS] shots_raw.shape
(64318, 14)
Setup the tokenizer and fit to the Player IDs
For this exercise, instead of converting the player ids to be 0-based, I am going to treat the player ids as if they are unique words, with the unique number of players representing our complete vocabulary. As such, document represents a shot of the puck on net, and each document only includes one “word”, or shooter.
The trick here is that we have to treat our player ids as character strings.
R
# ensure that the shooter ID is a string
$shooterPlayerId = as.character(shots_raw$shooterPlayerId)
shots_raw
# setup the tokenizer
= text_tokenizer()
shooter_tokenizer
# fit the shooters
fit_text_tokenizer(shooter_tokenizer, shots_raw$shooterPlayerId)
What do we have?
$index_word[1:3] shooter_tokenizer
$`1`
[1] "8477492"
$`2`
[1] "8471214"
$`3`
[1] "8474157"
$word_index[1:3] shooter_tokenizer
$`8477492`
[1] 1
$`8471214`
[1] 2
$`8474157`
[1] 3
And how many?
length(shooter_tokenizer$index_word)
[1] 869
Python
# make an integer so zero is not parsed
= shots_raw.shooterPlayerId.astype('int')
shots_raw.shooterPlayerId
# ensure that the player ID is a string
= shots_raw.shooterPlayerId.astype('str')
shots_raw.shooterPlayerId
# setup the tokenizer
= Tokenizer()
shooter_tokenizer
# fit the tokenizer to shooters
shooter_tokenizer.fit_on_texts(shots_raw.shooterPlayerId)
What do we have?
list(shooter_tokenizer.index_word.items())[:3]
[(1, '8477492'), (2, '8471214'), (3, '8474157')]
list(shooter_tokenizer.word_index.items())[:3]
[('8477492', 1), ('8471214', 2), ('8474157', 3)]
And how many?
len(shooter_tokenizer.index_word.items())
869
Create the Shooter sequences
These are size 1 sequences that do not require padding, as we only allow 1 word (or player) per shot. The key here is that we are using keras
to help us easily map our data to the new id system.
R
# make sequences with the new index
= texts_to_sequences(shooter_tokenizer, shots_raw$shooterPlayerId)
shooters = unlist(shooters) shooters
What do we have?
class(shooters)
[1] "integer"
length(shooters)
[1] 64318
Python
= shooter_tokenizer.texts_to_sequences(shots_raw.shooterPlayerId)
shooters = [x[0] for x in shooters]
shooters = np.array(shooters) shooters
What do we have?
type(shooters)
<class 'numpy.ndarray'>
len(shooters)
64318
Isolate the other features/targets
R
# Was the shot a goal? This is our target.
= shots_raw$goal
goal
# the shot info
= shots_raw %>% select(-shooterPlayerId, -goal)
shot_info = as.matrix(shot_info) shot_info
What do we have now?
length(goal); mean(goal);
[1] 64318
[1] 0.09103206
dim(shot_info)
[1] 64318 12
colnames(shot_info)
[1] "shotTypeBACK" "shotTypeDEFL"
[3] "shotTypeSLAP" "shotTypeSNAP"
[5] "shotTypeTIP" "shotTypeWRAP"
[7] "shotTypeWRIST" "offWing"
[9] "arenaAdjustedShotDistance" "arenaAdjustedXCord"
[11] "arenaAdjustedYCord" "shotAngle"
Python
# Was the shot a goal? This is our target.
= np.array(shots_raw.goal)
goal
# the shot info
= shots_raw.drop(columns=['shooterPlayerId', 'goal'], axis=1, inplace=False) shot_info
What do we have?
len(goal)
64318
goal.mean()
0.09103205945458503
shot_info.shape
(64318, 12)
shot_info.columns
Index(['offWing', 'arenaAdjustedShotDistance', 'arenaAdjustedXCord',
'arenaAdjustedYCord', 'shotAngle', 'shotType_BACK', 'shotType_DEFL',
'shotType_SLAP', 'shotType_SNAP', 'shotType_TIP', 'shotType_WRAP',
'shotType_WRIST'],
dtype='object')
Define the model architecture
R
Note the +1, it’s needed to avoid the index error
# the setup
= length(unique(unlist(shooter_tokenizer$index_word))) +1
NUM_SHOOTERS = ncol(shot_info)
SHOT_COLS = 50
VEC_SIZE
# the input layers
= layer_input(shape=c(1), name = "shooter_input")
shooter_input = layer_input(shape=c(SHOT_COLS), name = "shot_input")
shot_input
# shooter layers
= layer_embedding(input_dim = NUM_SHOOTERS,
s1 output_dim = VEC_SIZE,
input_length = 1,
name="shooter_embedding")(shooter_input)
= layer_flatten(name = "shooter_flat")(s1)
s2 = layer_dense(units = 1, activation = "sigmoid")(s2)
s3
# put the model together
= keras_model(inputs = shooter_input, outputs = s3) model
Summarize:
summary(model)
Model: "model"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
shooter_input (InputLayer) [(None, 1)] 0
________________________________________________________________________________
shooter_embedding (Embedding) (None, 1, 50) 43500
________________________________________________________________________________
shooter_flat (Flatten) (None, 50) 0
________________________________________________________________________________
dense (Dense) (None, 1) 51
================================================================================
Total params: 43,551
Trainable params: 43,551
Non-trainable params: 0
________________________________________________________________________________
Python
Note the +2, it’s needed to avoid the index error and differs from abvoe
# setup
= len(np.unique(shooters)) + 1
NUM_SHOOTERS = shot_info.shape[1]
SHOT_COLS = 50
VEC_SIZE
# the input layers
= Input(shape=(1, ), name="shooter_input")
shooter_input = Input(shape=(SHOT_COLS, ), name="shot_input")
shot_input
# shooter layers
= Embedding(NUM_SHOOTERS, VEC_SIZE, input_length=1)(shooter_input)
s1 = Flatten()(s1)
s2 = Dense(1, activation="sigmoid")(s2)
s3
# put the model together
= Model(inputs = shooter_input, outputs = s3) model
What do we have?
model.summary()
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
shooter_input (InputLayer) [(None, 1)] 0
_________________________________________________________________
embedding (Embedding) (None, 1, 50) 43500
_________________________________________________________________
flatten (Flatten) (None, 50) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 51
=================================================================
Total params: 43,551
Trainable params: 43,551
Non-trainable params: 0
_________________________________________________________________
and plot the model, this is not available within R at the moment.
# below might choke RMD
='model.png') plot_model(model, to_file
Train and Evaluate the Model
R
Compile the model.
%>%
model compile(optimizer = "adam",
loss="binary_crossentropy",
metrics =c("accuracy"))
Fit the model and record the history for plotting, if needed
=
history %>%
model fit(x=list(shooters),
y=goal,
epochs = 5,
verbose = 2)
Python
Compile the model.
compile(optimizer="adam", loss = "binary_crossentropy", metrics = ['accuracy']) model.
Fit the model.
= [shooters, shot_info]
X = model.fit(X, goal, epochs=5) history
Get the Embeddings
With our simple model, we have estimated embeddings for each shooter. Let’s grab those.
R
= get_weights(model)[[1]] shooter_embeddings
What do we have?
1:3, 1:3] shooter_embeddings[
[,1] [,2] [,3]
[1,] -0.03280475 0.02733728 -0.01639561
[2,] 0.08887081 0.12082704 0.12113131
[3,] 0.09737433 0.06078419 0.05767342
The shape.
dim(shooter_embeddings)
[1] 870 50
Python
= model.layers[1].get_weights()[0] shooter_embeddings
What do we have?
1:4, 1:4] shooter_embeddings[
array([[-0.07189947, -0.08770541, 0.04801337],
[-0.03720884, -0.04936351, 0.04588475],
[-0.1379198 , -0.04528455, 0.08142862]], dtype=float32)
The shape.
shooter_embeddings.shape
(870, 50)
Map the embeddings to the players
The embeddings are related to a player, so we are intereseted extracting these vectors and looking at player similarity, etc.
R
This is to help with some of the mapping. There may be more elegant ways to do this, but below is intuitive and simple in my opinion.
# build our vocabulary (player) dataframe
# https://www.r-bloggers.com/word-embeddings-with-keras/
= data.frame(
players playerid = names(shooter_tokenizer$word_index),
id = as.integer(unlist(shooter_tokenizer$word_index)), stringsAsFactors=FALSE)
= dplyr::arrange(players, id) players
The embeddings with names and references
# keep only those rows where the indexes align - R is 1-based
= shooter_embeddings[players$id, ]
shooter_embeddings rownames(shooter_embeddings) = players$playerid
colnames(shooter_embeddings) = paste0("e", 1:ncol(shooter_embeddings))
1:3, 1:3] shooter_embeddings[
e1 e2 e3
8477492 -0.03280475 0.02733728 -0.01639561
8471214 0.08887081 0.12082704 0.12113131
8474157 0.09737433 0.06078419 0.05767342
Python
# make the embed vectors a pandas dataframe
= pd.DataFrame(shooter_embeddings)
shooter_embeddings
# a list of true shooter ids
#shooter_id = [v for k, v in shooter_tokenizer.index_word.items()]
= {k:v for k, v in shooter_tokenizer.index_word.items()}
shooter_id = pd.DataFrame.from_dict(shooter_id, orient='index', columns=["playerid"])
shooter_df
# name the columns
= ["e" + str(i + 1) for i in range(shooter_embeddings.shape[1])]
shooter_embeddings.columns
# align the data by index
= pd.merge(shooter_embeddings, shooter_df, how='inner', left_index=True, right_index=True)
shooter_embeddings
# clean up the index so its the player
= shooter_embeddings.playerid
shooter_embeddings.index
# the first few
3, :3] shooter_embeddings.iloc[:
e1 e2 e3
playerid
8477492 0.093861 -0.071899 -0.087705
8471214 0.072550 -0.037209 -0.049364
8474157 0.059455 -0.137920 -0.045285
Export the data to Tableau
Whether it is R or python, you might be asking why I am exporting the data to Tableau. That is a fair question, but the point is to show how the ecosystem of data science programming libraries can also leverage best-of-breed data visualization suites such as Tableau. The tool plays a key role in my exploratory analysis pipeline, and the goal below is show how in 1-line of code, we can export our data for rapid exploration, which can aid in our data cleaning and modeling tasks within R/python.