Tensorflow, Tip-Ins, and Tableau, Oh My

This notebook aims to show the basics of: Tensorflow 2.0 Shooter Embedding estimation for NHL Player evaluation…

R
Python
NHL
Tensorflow
Keras
Author

Brock Tibert

Published

March 5, 2020

This notebook aims to show the basics of:

  1. Tensorflow 2.0
  2. Shooter Embedding estimation for NHL Player evaluation
  3. Evaluate feasibility generating a post that switches between R and python via reticulate
  4. Demonstrate code similarity/approach in both languages side-by-side

TL;DR

  • Combine Tensorflow/Keras with R
  • NHL Data to estimate Shooter Player Embeddings
  • Export to Tableau for exploration (yes we could use ggplot et. al, but highlights we have other options, especially for those new to the language)

R Setup

# packages
library(keras)

suppressPackageStartupMessages(library(tidyverse))
library(reticulate)
suppressPackageStartupMessages(library(caret))

# options
options(stringsAsFactors = FALSE)
use_condaenv("tensorflow")

Python setup

# imports
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Activation, concatenate, Dense, Dropout, Embedding, Input, Reshape, Flatten
from tensorflow.keras.models import Model
from tensorflow.keras.utils import plot_model
from tensorflow.keras.preprocessing.text import Tokenizer

Get the data

R

URL = "http://peter-tanner.com/moneypuck/downloads/shots_2019.zip"
download.file(URL, destfile="shots.zip")
shots_raw = read_csv("shots.zip")

What’s the shape?

dim(shots_raw)
[1] 88592   124

Python

URL = "http://peter-tanner.com/moneypuck/downloads/shots_2019.zip"
shots_raw = pd.read_csv(URL)

What do we have?

shots_raw.shape
(88592, 124)

Filter rows

We want to keep shots on net, and not on an empty net, as well as remove records where the shooter id is 0.

R

# keep shots that were on goal
shots_raw = shots_raw %>% filter(shotWasOnGoal == 1 )
shots_raw = shots_raw %>% filter(shotOnEmptyNet == 0)
shots_raw = shots_raw %>% filter(shooterPlayerId != 0)
shots_raw = shots_raw %>% filter(!is.na(shooterPlayerId))

What we do have for a shape?

dim(shots_raw)
[1] 64318   124

Python

shots_raw = shots_raw.loc[shots_raw.shotOnEmptyNet == 0, :]
shots_raw = shots_raw.loc[shots_raw.shotWasOnGoal == 1, :]
shots_raw = shots_raw.loc[shots_raw.shooterPlayerId != 0, :]
shots_raw = shots_raw.loc[~shots_raw.shooterPlayerId.isna(), :]

What do we have for a shape?

shots_raw.shape
(64318, 124)

Select Columns

With the rows select, let’s keep the columns that we want to include in this analysis.

R

# keep just the columns that we need
shots_raw = shots_raw %>% select(shooterPlayerId, shotType, goal, arenaAdjustedShotDistance, 
                                   arenaAdjustedXCord, arenaAdjustedYCord,  shotAngle, offWing)

The shape …

dim(shots_raw)
[1] 64318     8

Python

COLS = ['shooterPlayerId', 'shotType', 'goal', 'offWing', 
        'arenaAdjustedShotDistance', 'arenaAdjustedXCord', 'arenaAdjustedYCord', 
        'shotAngle']
shots_raw = shots_raw[COLS]

The shape …

shots_raw.shape
(64318, 8)

Encode the shot types

I am going to one-hot the shot types, though in the future I will explore the use of keras.preprocessing.text.one_hot. The result will be new columns added to our shots_raw dataset, with each shot type flagged as 0/1.

R

x <- dummyVars(" ~ .", data = shots_raw)
shots_raw <- data.frame(predict(x, newdata = shots_raw))
rm(x)

What do we have?

glimpse(shots_raw)
Observations: 64,318
Variables: 14
$ shooterPlayerId           <dbl> 8480801, 8476853, 8476331, 8476853, 8475197…
$ shotTypeBACK              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ shotTypeDEFL              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ shotTypeSLAP              <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0…
$ shotTypeSNAP              <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0…
$ shotTypeTIP               <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
$ shotTypeWRAP              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ shotTypeWRIST             <dbl> 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1…
$ goal                      <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arenaAdjustedShotDistance <dbl> 4.123106, 59.000000, 30.000000, 40.000000, …
$ arenaAdjustedXCord        <dbl> 85, -30, 60, -56, -40, -48, -34, -77, -61, …
$ arenaAdjustedYCord        <dbl> -1, -2, -7, -22, -30, -8, 41, -13, -34, 11,…
$ shotAngle                 <dbl> -14.036243, 2.009554, -12.994617, 33.690068…
$ offWing                   <dbl> 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1…

Python

shots_raw = pd.get_dummies(shots_raw, columns=['shotType'])
print(shots_raw.shape)
(64318, 14)
print(shots_raw.head(3).T)
                                      0             2             5
shooterPlayerId            8.480801e+06  8.476853e+06  8.476331e+06
goal                       1.000000e+00  0.000000e+00  0.000000e+00
offWing                    1.000000e+00  0.000000e+00  0.000000e+00
arenaAdjustedShotDistance  4.123106e+00  5.900000e+01  3.000000e+01
arenaAdjustedXCord         8.500000e+01 -3.000000e+01  6.000000e+01
arenaAdjustedYCord        -1.000000e+00 -2.000000e+00 -7.000000e+00
shotAngle                 -1.403624e+01  2.009554e+00 -1.299462e+01
shotType_BACK              0.000000e+00  0.000000e+00  0.000000e+00
shotType_DEFL              0.000000e+00  0.000000e+00  0.000000e+00
shotType_SLAP              0.000000e+00  0.000000e+00  0.000000e+00
shotType_SNAP              0.000000e+00  1.000000e+00  1.000000e+00
shotType_TIP               1.000000e+00  0.000000e+00  0.000000e+00
shotType_WRAP              0.000000e+00  0.000000e+00  0.000000e+00
shotType_WRIST             0.000000e+00  0.000000e+00  0.000000e+00

Scale the numeric data to 0/1

R

# clunky, but break out columns to standardize
tmp = shots_raw %>% select(arenaAdjustedShotDistance:shotAngle)
tmp2 = preProcess(tmp, method = "range")
pp = predict(tmp2, tmp)
rm(tmp, tmp2)

# drop the original and append these
shots_raw = select(shots_raw, -arenaAdjustedShotDistance:-shotAngle)
shots_raw = cbind(shots_raw, pp)
dim(shots_raw)
[1] 64318    14

Python

scaler = MinMaxScaler()
COLS = ['arenaAdjustedShotDistance', 'arenaAdjustedXCord', 'arenaAdjustedYCord', 'shotAngle']
shots_raw[COLS] = scaler.fit_transform(shots_raw[COLS])
shots_raw.shape
(64318, 14)

Setup the tokenizer and fit to the Player IDs

For this exercise, instead of converting the player ids to be 0-based, I am going to treat the player ids as if they are unique words, with the unique number of players representing our complete vocabulary. As such, document represents a shot of the puck on net, and each document only includes one “word”, or shooter.

The trick here is that we have to treat our player ids as character strings.

R

# ensure that the shooter ID is a string
shots_raw$shooterPlayerId = as.character(shots_raw$shooterPlayerId)

# setup the tokenizer
shooter_tokenizer = text_tokenizer()

# fit the shooters 
fit_text_tokenizer(shooter_tokenizer, shots_raw$shooterPlayerId)

What do we have?

shooter_tokenizer$index_word[1:3]
$`1`
[1] "8477492"

$`2`
[1] "8471214"

$`3`
[1] "8474157"
shooter_tokenizer$word_index[1:3]
$`8477492`
[1] 1

$`8471214`
[1] 2

$`8474157`
[1] 3

And how many?

length(shooter_tokenizer$index_word)
[1] 869

Python

# make an integer so zero is not parsed
shots_raw.shooterPlayerId = shots_raw.shooterPlayerId.astype('int')

# ensure that the player ID is a string
shots_raw.shooterPlayerId = shots_raw.shooterPlayerId.astype('str')

# setup the tokenizer
shooter_tokenizer = Tokenizer()

# fit the tokenizer to shooters
shooter_tokenizer.fit_on_texts(shots_raw.shooterPlayerId)

What do we have?

list(shooter_tokenizer.index_word.items())[:3]
[(1, '8477492'), (2, '8471214'), (3, '8474157')]
list(shooter_tokenizer.word_index.items())[:3]
[('8477492', 1), ('8471214', 2), ('8474157', 3)]

And how many?

len(shooter_tokenizer.index_word.items())
869

Create the Shooter sequences

These are size 1 sequences that do not require padding, as we only allow 1 word (or player) per shot. The key here is that we are using keras to help us easily map our data to the new id system.

R

# make sequences with the new index
shooters = texts_to_sequences(shooter_tokenizer, shots_raw$shooterPlayerId)
shooters = unlist(shooters)

What do we have?

class(shooters)
[1] "integer"
length(shooters)
[1] 64318

Python

shooters = shooter_tokenizer.texts_to_sequences(shots_raw.shooterPlayerId)
shooters = [x[0] for x in shooters]
shooters = np.array(shooters)

What do we have?

type(shooters)
<class 'numpy.ndarray'>
len(shooters)
64318

Isolate the other features/targets

R

# Was the shot a goal?   This is our target.
goal = shots_raw$goal

# the shot info
shot_info = shots_raw %>% select(-shooterPlayerId, -goal)
shot_info = as.matrix(shot_info)

What do we have now?

length(goal); mean(goal);
[1] 64318
[1] 0.09103206
dim(shot_info)
[1] 64318    12
colnames(shot_info)
 [1] "shotTypeBACK"              "shotTypeDEFL"             
 [3] "shotTypeSLAP"              "shotTypeSNAP"             
 [5] "shotTypeTIP"               "shotTypeWRAP"             
 [7] "shotTypeWRIST"             "offWing"                  
 [9] "arenaAdjustedShotDistance" "arenaAdjustedXCord"       
[11] "arenaAdjustedYCord"        "shotAngle"                

Python

# Was the shot a goal?   This is our target.
goal = np.array(shots_raw.goal)

# the shot info
shot_info = shots_raw.drop(columns=['shooterPlayerId', 'goal'], axis=1, inplace=False)

What do we have?

len(goal)
64318
goal.mean()
0.09103205945458503
shot_info.shape
(64318, 12)
shot_info.columns
Index(['offWing', 'arenaAdjustedShotDistance', 'arenaAdjustedXCord',
       'arenaAdjustedYCord', 'shotAngle', 'shotType_BACK', 'shotType_DEFL',
       'shotType_SLAP', 'shotType_SNAP', 'shotType_TIP', 'shotType_WRAP',
       'shotType_WRIST'],
      dtype='object')

Define the model architecture

R

Note the +1, it’s needed to avoid the index error

# the setup
NUM_SHOOTERS = length(unique(unlist(shooter_tokenizer$index_word))) +1
SHOT_COLS = ncol(shot_info)
VEC_SIZE = 50

# the input layers
shooter_input = layer_input(shape=c(1), name = "shooter_input")
shot_input = layer_input(shape=c(SHOT_COLS), name = "shot_input")

# shooter layers
s1 = layer_embedding(input_dim = NUM_SHOOTERS, 
                     output_dim = VEC_SIZE, 
                     input_length = 1, 
                     name="shooter_embedding")(shooter_input)
s2 = layer_flatten(name = "shooter_flat")(s1)
s3 = layer_dense(units = 1, activation = "sigmoid")(s2)

# put the model together
model = keras_model(inputs = shooter_input, outputs = s3)

Summarize:

summary(model)
Model: "model"
________________________________________________________________________________
Layer (type)                        Output Shape                    Param #
================================================================================
shooter_input (InputLayer)          [(None, 1)]                     0
________________________________________________________________________________
shooter_embedding (Embedding)       (None, 1, 50)                   43500
________________________________________________________________________________
shooter_flat (Flatten)              (None, 50)                      0
________________________________________________________________________________
dense (Dense)                       (None, 1)                       51
================================================================================
Total params: 43,551
Trainable params: 43,551
Non-trainable params: 0
________________________________________________________________________________

Python

Note the +2, it’s needed to avoid the index error and differs from abvoe

# setup
NUM_SHOOTERS = len(np.unique(shooters)) + 1
SHOT_COLS = shot_info.shape[1]
VEC_SIZE = 50

# the input layers
shooter_input = Input(shape=(1, ), name="shooter_input")
shot_input = Input(shape=(SHOT_COLS, ), name="shot_input")

# shooter layers
s1 = Embedding(NUM_SHOOTERS, VEC_SIZE, input_length=1)(shooter_input)
s2 = Flatten()(s1)
s3 = Dense(1, activation="sigmoid")(s2)

# put the model together
model = Model(inputs = shooter_input, outputs = s3)

What do we have?

model.summary()
Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
shooter_input (InputLayer)   [(None, 1)]               0
_________________________________________________________________
embedding (Embedding)        (None, 1, 50)             43500
_________________________________________________________________
flatten (Flatten)            (None, 50)                0
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 51
=================================================================
Total params: 43,551
Trainable params: 43,551
Non-trainable params: 0
_________________________________________________________________

and plot the model, this is not available within R at the moment.

# below might choke RMD
plot_model(model, to_file='model.png')

Train and Evaluate the Model

R

Compile the model.

model %>% 
  compile(optimizer = "adam", 
          loss="binary_crossentropy", 
          metrics =c("accuracy"))

Fit the model and record the history for plotting, if needed

history = 
  model %>% 
  fit(x=list(shooters), 
      y=goal, 
      epochs = 5, 
      verbose = 2)

Python

Compile the model.

model.compile(optimizer="adam", loss = "binary_crossentropy", metrics = ['accuracy'])

Fit the model.

X = [shooters, shot_info]
history = model.fit(X, goal, epochs=5)

Get the Embeddings

With our simple model, we have estimated embeddings for each shooter. Let’s grab those.

R

shooter_embeddings = get_weights(model)[[1]]

What do we have?

shooter_embeddings[1:3, 1:3]
            [,1]       [,2]        [,3]
[1,] -0.03280475 0.02733728 -0.01639561
[2,]  0.08887081 0.12082704  0.12113131
[3,]  0.09737433 0.06078419  0.05767342

The shape.

dim(shooter_embeddings)
[1] 870  50

Python

shooter_embeddings = model.layers[1].get_weights()[0]

What do we have?

shooter_embeddings[1:4, 1:4]
array([[-0.07189947, -0.08770541,  0.04801337],
       [-0.03720884, -0.04936351,  0.04588475],
       [-0.1379198 , -0.04528455,  0.08142862]], dtype=float32)

The shape.

shooter_embeddings.shape
(870, 50)

Map the embeddings to the players

The embeddings are related to a player, so we are intereseted extracting these vectors and looking at player similarity, etc.

R

This is to help with some of the mapping. There may be more elegant ways to do this, but below is intuitive and simple in my opinion.

# build our vocabulary (player) dataframe
# https://www.r-bloggers.com/word-embeddings-with-keras/
players = data.frame(
  playerid = names(shooter_tokenizer$word_index), 
  id = as.integer(unlist(shooter_tokenizer$word_index)), stringsAsFactors=FALSE)

players = dplyr::arrange(players, id)

The embeddings with names and references

# keep only those rows where the indexes align - R is 1-based
shooter_embeddings = shooter_embeddings[players$id, ]
rownames(shooter_embeddings) = players$playerid
colnames(shooter_embeddings) = paste0("e", 1:ncol(shooter_embeddings))
shooter_embeddings[1:3, 1:3]
                 e1         e2          e3
8477492 -0.03280475 0.02733728 -0.01639561
8471214  0.08887081 0.12082704  0.12113131
8474157  0.09737433 0.06078419  0.05767342

Python

# make the embed vectors a pandas dataframe
shooter_embeddings = pd.DataFrame(shooter_embeddings)

# a list of true shooter ids
#shooter_id = [v for k, v in shooter_tokenizer.index_word.items()]
shooter_id = {k:v for k, v in shooter_tokenizer.index_word.items()}
shooter_df = pd.DataFrame.from_dict(shooter_id, orient='index', columns=["playerid"])

# name the columns
shooter_embeddings.columns = ["e" + str(i + 1) for i in range(shooter_embeddings.shape[1])]

# align the data by index
shooter_embeddings = pd.merge(shooter_embeddings, shooter_df, how='inner', left_index=True, right_index=True)

# clean up the index so its the player
shooter_embeddings.index = shooter_embeddings.playerid

# the first few
shooter_embeddings.iloc[:3, :3]
                e1        e2        e3
playerid                              
8477492   0.093861 -0.071899 -0.087705
8471214   0.072550 -0.037209 -0.049364
8474157   0.059455 -0.137920 -0.045285

Export the data to Tableau

Whether it is R or python, you might be asking why I am exporting the data to Tableau. That is a fair question, but the point is to show how the ecosystem of data science programming libraries can also leverage best-of-breed data visualization suites such as Tableau. The tool plays a key role in my exploratory analysis pipeline, and the goal below is show how in 1-line of code, we can export our data for rapid exploration, which can aid in our data cleaning and modeling tasks within R/python.