Tensorflow, Tip-Ins, and Tableau, Oh My

Last updated on Mar 5, 2020 10 min read R

This notebook aims to show the basics of:

Tensorflow 2.0
Shooter Embedding estimation for NHL Player evaluation
Evaluate feasibility generating a post that switches between R and python via reticulate
Demonstrate code similarity/approach in both languages side-by-side

TL;DR

Combine Tensorflow/Keras with R
NHL Data to estimate Shooter Player Embeddings
Export to Tableau for exploration (yes we could use ggplot et. al, but highlights we have other options, especially for those new to the language)

R Setup

# packages
library(keras)

suppressPackageStartupMessages(library(tidyverse))
library(reticulate)
suppressPackageStartupMessages(library(caret))

# options
options(stringsAsFactors = FALSE)
use_condaenv("tensorflow")

Python setup

# imports
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Activation, concatenate, Dense, Dropout, Embedding, Input, Reshape, Flatten
from tensorflow.keras.models import Model
from tensorflow.keras.utils import plot_model
from tensorflow.keras.preprocessing.text import Tokenizer

Get the data

R

URL = "http://peter-tanner.com/moneypuck/downloads/shots_2019.zip"
download.file(URL, destfile="shots.zip")
shots_raw = read_csv("shots.zip")

What’s the shape?

dim(shots_raw)

[1] 88592   124

python

URL = "http://peter-tanner.com/moneypuck/downloads/shots_2019.zip"
shots_raw = pd.read_csv(URL)

What do we have?

shots_raw.shape

(88592, 124)

Filter rows

We want to keep shots on net, and not on an empty net, as well as remove records where the shooter id is 0.

R

# keep shots that were on goal
shots_raw = shots_raw %>% filter(shotWasOnGoal == 1 )
shots_raw = shots_raw %>% filter(shotOnEmptyNet == 0)
shots_raw = shots_raw %>% filter(shooterPlayerId != 0)
shots_raw = shots_raw %>% filter(!is.na(shooterPlayerId))

What we do have for a shape?

dim(shots_raw)

[1] 64318   124

python

shots_raw = shots_raw.loc[shots_raw.shotOnEmptyNet == 0, :]
shots_raw = shots_raw.loc[shots_raw.shotWasOnGoal == 1, :]
shots_raw = shots_raw.loc[shots_raw.shooterPlayerId != 0, :]
shots_raw = shots_raw.loc[~shots_raw.shooterPlayerId.isna(), :]

What do we have for a shape?

shots_raw.shape

(64318, 124)

Select Columns

With the rows select, let’s keep the columns that we want to include in this analysis.

R

# keep just the columns that we need
shots_raw = shots_raw %>% select(shooterPlayerId, shotType, goal, arenaAdjustedShotDistance, 
                                   arenaAdjustedXCord, arenaAdjustedYCord,  shotAngle, offWing)

The shape …

dim(shots_raw)

[1] 64318     8

python

COLS = ['shooterPlayerId', 'shotType', 'goal', 'offWing', 
        'arenaAdjustedShotDistance', 'arenaAdjustedXCord', 'arenaAdjustedYCord', 
        'shotAngle']
shots_raw = shots_raw[COLS]

The shape …

shots_raw.shape

(64318, 8)

Encode the shot types

I am going to one-hot the shot types, though in the future I will explore the use of keras.preprocessing.text.one_hot. The result will be new columns added to our shots_raw dataset, with each shot type flagged as 0/1.

R

x <- dummyVars(" ~ .", data = shots_raw)
shots_raw <- data.frame(predict(x, newdata = shots_raw))
rm(x)

What do we have?

glimpse(shots_raw)

Observations: 64,318
Variables: 14
$ shooterPlayerId           <dbl> 8480801, 8476853, 8476331, 8476853, 8475197…
$ shotTypeBACK              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ shotTypeDEFL              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ shotTypeSLAP              <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0…
$ shotTypeSNAP              <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0…
$ shotTypeTIP               <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
$ shotTypeWRAP              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ shotTypeWRIST             <dbl> 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1…
$ goal                      <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ arenaAdjustedShotDistance <dbl> 4.123106, 59.000000, 30.000000, 40.000000, …
$ arenaAdjustedXCord        <dbl> 85, -30, 60, -56, -40, -48, -34, -77, -61, …
$ arenaAdjustedYCord        <dbl> -1, -2, -7, -22, -30, -8, 41, -13, -34, 11,…
$ shotAngle                 <dbl> -14.036243, 2.009554, -12.994617, 33.690068…
$ offWing                   <dbl> 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1…

python

shots_raw = pd.get_dummies(shots_raw, columns=['shotType'])
print(shots_raw.shape)

(64318, 14)

print(shots_raw.head(3).T)

                                      0             2             5
shooterPlayerId            8.480801e+06  8.476853e+06  8.476331e+06
goal                       1.000000e+00  0.000000e+00  0.000000e+00
offWing                    1.000000e+00  0.000000e+00  0.000000e+00
arenaAdjustedShotDistance  4.123106e+00  5.900000e+01  3.000000e+01
arenaAdjustedXCord         8.500000e+01 -3.000000e+01  6.000000e+01
arenaAdjustedYCord        -1.000000e+00 -2.000000e+00 -7.000000e+00
shotAngle                 -1.403624e+01  2.009554e+00 -1.299462e+01
shotType_BACK              0.000000e+00  0.000000e+00  0.000000e+00
shotType_DEFL              0.000000e+00  0.000000e+00  0.000000e+00
shotType_SLAP              0.000000e+00  0.000000e+00  0.000000e+00
shotType_SNAP              0.000000e+00  1.000000e+00  1.000000e+00
shotType_TIP               1.000000e+00  0.000000e+00  0.000000e+00
shotType_WRAP              0.000000e+00  0.000000e+00  0.000000e+00
shotType_WRIST             0.000000e+00  0.000000e+00  0.000000e+00

Scale the numeric data to 0/1

R

# clunky, but break out columns to standardize
tmp = shots_raw %>% select(arenaAdjustedShotDistance:shotAngle)
tmp2 = preProcess(tmp, method = "range")
pp = predict(tmp2, tmp)
rm(tmp, tmp2)

# drop the original and append these
shots_raw = select(shots_raw, -arenaAdjustedShotDistance:-shotAngle)
shots_raw = cbind(shots_raw, pp)
dim(shots_raw)

[1] 64318    14

python

scaler = MinMaxScaler()
COLS = ['arenaAdjustedShotDistance', 'arenaAdjustedXCord', 'arenaAdjustedYCord', 'shotAngle']
shots_raw[COLS] = scaler.fit_transform(shots_raw[COLS])
shots_raw.shape

(64318, 14)

Setup the tokenizer and fit to the Player IDs

For this exercise, instead of converting the player ids to be 0-based, I am going to treat the player ids as if they are unique words, with the unique number of players representing our complete vocabulary. As such, document represents a shot of the puck on net, and each document only includes one “word”, or shooter.

The trick here is that we have to treat our player ids as character strings.

R

# ensure that the shooter ID is a string
shots_raw$shooterPlayerId = as.character(shots_raw$shooterPlayerId)

# setup the tokenizer
shooter_tokenizer = text_tokenizer()

# fit the shooters 
fit_text_tokenizer(shooter_tokenizer, shots_raw$shooterPlayerId)

What do we have?

shooter_tokenizer$index_word[1:3]

$`1`
[1] "8477492"

$`2`
[1] "8471214"

$`3`
[1] "8474157"

shooter_tokenizer$word_index[1:3]

$`8477492`
[1] 1

$`8471214`
[1] 2

$`8474157`
[1] 3

And how many?

length(shooter_tokenizer$index_word)

[1] 869

python

# make an integer so zero is not parsed
shots_raw.shooterPlayerId = shots_raw.shooterPlayerId.astype('int')

# ensure that the player ID is a string
shots_raw.shooterPlayerId = shots_raw.shooterPlayerId.astype('str')

# setup the tokenizer
shooter_tokenizer = Tokenizer()

# fit the tokenizer to shooters
shooter_tokenizer.fit_on_texts(shots_raw.shooterPlayerId)

What do we have?

list(shooter_tokenizer.index_word.items())[:3]

[(1, '8477492'), (2, '8471214'), (3, '8474157')]

list(shooter_tokenizer.word_index.items())[:3]

[('8477492', 1), ('8471214', 2), ('8474157', 3)]

And how many?

len(shooter_tokenizer.index_word.items())

Create the Shooter `sequences`

These are size 1 sequences that do not require padding, as we only allow 1 word (or player) per shot. The key here is that we are using keras to help us easily map our data to the new id system.

R

# make sequences with the new index
shooters = texts_to_sequences(shooter_tokenizer, shots_raw$shooterPlayerId)
shooters = unlist(shooters)

What do we have?

class(shooters)

[1] "integer"

length(shooters)

[1] 64318

python

shooters = shooter_tokenizer.texts_to_sequences(shots_raw.shooterPlayerId)
shooters = [x[0] for x in shooters]
shooters = np.array(shooters)

What do we have?

type(shooters)

<class 'numpy.ndarray'>

len(shooters)

Isolate the other features/targets

R

# Was the shot a goal?   This is our target.
goal = shots_raw$goal

# the shot info
shot_info = shots_raw %>% select(-shooterPlayerId, -goal)
shot_info = as.matrix(shot_info)

What do we have now?

length(goal); mean(goal);

[1] 64318

[1] 0.09103206

dim(shot_info)

[1] 64318    12

colnames(shot_info)

 [1] "shotTypeBACK"              "shotTypeDEFL"             
 [3] "shotTypeSLAP"              "shotTypeSNAP"             
 [5] "shotTypeTIP"               "shotTypeWRAP"             
 [7] "shotTypeWRIST"             "offWing"                  
 [9] "arenaAdjustedShotDistance" "arenaAdjustedXCord"       
[11] "arenaAdjustedYCord"        "shotAngle"

python


# Was the shot a goal?   This is our target.
goal = np.array(shots_raw.goal)

# the shot info
shot_info = shots_raw.drop(columns=['shooterPlayerId', 'goal'], axis=1, inplace=False)

What do we have?

len(goal)

goal.mean()

0.09103205945458503

shot_info.shape

(64318, 12)

shot_info.columns

Index(['offWing', 'arenaAdjustedShotDistance', 'arenaAdjustedXCord',
       'arenaAdjustedYCord', 'shotAngle', 'shotType_BACK', 'shotType_DEFL',
       'shotType_SLAP', 'shotType_SNAP', 'shotType_TIP', 'shotType_WRAP',
       'shotType_WRIST'],
      dtype='object')

Define the model architecture

R

Note the +1, it’s needed to avoid the index error

# the setup
NUM_SHOOTERS = length(unique(unlist(shooter_tokenizer$index_word))) +1
SHOT_COLS = ncol(shot_info)
VEC_SIZE = 50

# the input layers
shooter_input = layer_input(shape=c(1), name = "shooter_input")
shot_input = layer_input(shape=c(SHOT_COLS), name = "shot_input")

# shooter layers
s1 = layer_embedding(input_dim = NUM_SHOOTERS, 
                     output_dim = VEC_SIZE, 
                     input_length = 1, 
                     name="shooter_embedding")(shooter_input)
s2 = layer_flatten(name = "shooter_flat")(s1)
s3 = layer_dense(units = 1, activation = "sigmoid")(s2)

# put the model together
model = keras_model(inputs = shooter_input, outputs = s3)

Summarize:

summary(model)

Model: "model"
________________________________________________________________________________
Layer (type)                        Output Shape                    Param #     
================================================================================
shooter_input (InputLayer)          [(None, 1)]                     0           
________________________________________________________________________________
shooter_embedding (Embedding)       (None, 1, 50)                   43500       
________________________________________________________________________________
shooter_flat (Flatten)              (None, 50)                      0           
________________________________________________________________________________
dense (Dense)                       (None, 1)                       51          
================================================================================
Total params: 43,551
Trainable params: 43,551
Non-trainable params: 0
________________________________________________________________________________

python

Note the +2, it’s needed to avoid the index error and differs from abvoe

# setup
NUM_SHOOTERS = len(np.unique(shooters)) + 1
SHOT_COLS = shot_info.shape[1]
VEC_SIZE = 50

# the input layers
shooter_input = Input(shape=(1, ), name="shooter_input")
shot_input = Input(shape=(SHOT_COLS, ), name="shot_input")

# shooter layers
s1 = Embedding(NUM_SHOOTERS, VEC_SIZE, input_length=1)(shooter_input)
s2 = Flatten()(s1)
s3 = Dense(1, activation="sigmoid")(s2)

# put the model together
model = Model(inputs = shooter_input, outputs = s3)

What do we have?

model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
shooter_input (InputLayer)   [(None, 1)]               0         
_________________________________________________________________
embedding (Embedding)        (None, 1, 50)             43500     
_________________________________________________________________
flatten (Flatten)            (None, 50)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 51        
=================================================================
Total params: 43,551
Trainable params: 43,551
Non-trainable params: 0
_________________________________________________________________

and plot the model, this is not available within R at the moment.

# below might choke RMD
plot_model(model, to_file='model.png')

Train and Evaluate the Model

R

Compile the model.

model %>% 
  compile(optimizer = "adam", 
          loss="binary_crossentropy", 
          metrics =c("accuracy"))

Fit the model and record the history for plotting, if needed

history = 
  model %>% 
  fit(x=list(shooters), 
      y=goal, 
      epochs = 5, 
      verbose = 2)

python

Compile the model.

model.compile(optimizer="adam", loss = "binary_crossentropy", metrics = ['accuracy'])

Fit the model.

X = [shooters, shot_info]
history = model.fit(X, goal, epochs=5)

Get the Embeddings

With our simple model, we have estimated embeddings for each shooter. Let’s grab those.

R

shooter_embeddings = get_weights(model)[[1]]

What do we have?

shooter_embeddings[1:3, 1:3]

            [,1]       [,2]        [,3]
[1,] -0.03280475 0.02733728 -0.01639561
[2,]  0.08887081 0.12082704  0.12113131
[3,]  0.09737433 0.06078419  0.05767342

The shape.

dim(shooter_embeddings)

[1] 870  50

python

shooter_embeddings = model.layers[1].get_weights()[0]

What do we have?

shooter_embeddings[1:4, 1:4]

array([[-0.07189947, -0.08770541,  0.04801337],
       [-0.03720884, -0.04936351,  0.04588475],
       [-0.1379198 , -0.04528455,  0.08142862]], dtype=float32)

The shape.

shooter_embeddings.shape

(870, 50)

Map the embeddings to the players

The embeddings are related to a player, so we are intereseted extracting these vectors and looking at player similarity, etc.

R

This is to help with some of the mapping. There may be more elegant ways to do this, but below is intuitive and simple in my opinion.

# build our vocabulary (player) dataframe
# https://www.r-bloggers.com/word-embeddings-with-keras/
players = data.frame(
  playerid = names(shooter_tokenizer$word_index), 
  id = as.integer(unlist(shooter_tokenizer$word_index)), stringsAsFactors=FALSE)

players = dplyr::arrange(players, id)

The embeddings with names and references

# keep only those rows where the indexes align - R is 1-based
shooter_embeddings = shooter_embeddings[players$id, ]
rownames(shooter_embeddings) = players$playerid
colnames(shooter_embeddings) = paste0("e", 1:ncol(shooter_embeddings))
shooter_embeddings[1:3, 1:3]

                 e1         e2          e3
8477492 -0.03280475 0.02733728 -0.01639561
8471214  0.08887081 0.12082704  0.12113131
8474157  0.09737433 0.06078419  0.05767342

python

# make the embed vectors a pandas dataframe
shooter_embeddings = pd.DataFrame(shooter_embeddings)

# a list of true shooter ids
#shooter_id = [v for k, v in shooter_tokenizer.index_word.items()]
shooter_id = {k:v for k, v in shooter_tokenizer.index_word.items()}
shooter_df = pd.DataFrame.from_dict(shooter_id, orient='index', columns=["playerid"])

# name the columns
shooter_embeddings.columns = ["e" + str(i + 1) for i in range(shooter_embeddings.shape[1])]

# align the data by index
shooter_embeddings = pd.merge(shooter_embeddings, shooter_df, how='inner', left_index=True, right_index=True)

# clean up the index so its the player
shooter_embeddings.index = shooter_embeddings.playerid

# the first few
shooter_embeddings.iloc[:3, :3]

                e1        e2        e3
playerid                              
8477492   0.093861 -0.071899 -0.087705
8471214   0.072550 -0.037209 -0.049364
8474157   0.059455 -0.137920 -0.045285

Export the data to Tableau

Whether it is R or python, you might be asking why I am exporting the data to Tableau. That is a fair question, but the point is to show how the ecosystem of data science programming libraries can also leverage best-of-breed data visualization suites such as Tableau. The tool plays a key role in my exploratory analysis pipeline, and the goal below is show how in 1-line of code, we can export our data for rapid exploration, which can aid in our data cleaning and modeling tasks within R/python.

R

I ported a copy of the pantab library in python into R. The trick is that I use reticulate to port the python bits into R. As such, at present, it will not work if you are using Google Colab.

Installation is simple:

devtools::install_github("btibert3/pantabR")

sdf = as.data.frame(shooter_embeddings)
pantabR::frame_to_hyper(sdf, f="embeddings.hyper", tbl="shooters")

In the python section, I am going to use t-SNE to reduce the estimated shooter embeddings into a two dimensional space. The process is similar in R using the Rtsne package.

python

The pantab library is easy to install:

pip install pantab
# if in a notebook environment
# !pip install pantab

However, prior to writing out the data for exploration, I am going to use t-SNE to compact the estimated shooter embeddings into a two dimensional coordinate system. For more on t-SNE, refer to this introduction.

from sklearn.manifold import TSNE
shooter_tsne = TSNE(n_components=2).fit_transform(shooter_embeddings.iloc[:, :50])

Add them to the dataframe

shooter_tsne = pd.DataFrame(shooter_tsne)
shooter_tsne.columns = ['t1', 't2']
shooter_embeddings.reset_index(inplace=True, drop=True)
shooter_embeddings = pd.concat([shooter_embeddings, shooter_tsne], axis=1)

With pantab setup, that package makes it really simple to write pandas dataframes to hyper files for Tableau.

import pantab
pantab.frame_to_hyper(shooter_embeddings, "embeddings.hyper", table="shooters")

And the simple embeddings, plotted from our exported embeddings.hyper file within Tableau.

NHL Tensorflow Keras

Tensorflow, Tip-Ins, and Tableau, Oh My

TL;DR

R Setup

Python setup

Get the data

R

python

Filter rows

R

python

Select Columns

R

python

Encode the shot types

R

python

Scale the numeric data to 0/1

R

python

Setup the tokenizer and fit to the Player IDs

R

python

Create the Shooter sequences

R

python

Isolate the other features/targets

R

python

Define the model architecture

R

python

Train and Evaluate the Model

R

python

Get the Embeddings

R

python

Map the embeddings to the players

R

python

Export the data to Tableau

R

python

Brock Tibert

Lecturer, Information Systems

Related

Create the Shooter `sequences`