In the National Football League (NFL), every play has a possibility of changing the game. One such game-changing play is the sack, where a quarterback is tackled behind the line of scrimmage before he can throw a forward pass. Using logistic regression, we can make a prediction on whethere a sack is likely to happen based on different circumstances within a game.
The first step involves setting up the necessary Python libraries. Among these are:
nfl_data_py
: A package for fetching NFL data.pandas
, numpy
: For data manipulation.matplotlib
, seaborn
: For visualization.sklearn
, xgboost
: For machine learning.For this we are using NFL play-by-play data from the years 2020, 2021, and 2022.
pbp = nfl.import_pbp_data([2020, 2021, 2022])
The dataset undergoes a cleaning process, filtering out non-passing plays and plays labeled as "no_play". This ensures that the data used for modeling is relevant and noise-free.
pbp_clean = pbp[(pbp['pass'] == 1) & (pbp['play_type'] != "no_play")]
Visualization plays a crucial role in understanding the dataset. Through EDA:
Looking at the data, we think about features we can add to make predicting a sack easier. From our EDA, a sack is most likely to happen on third down. From this we can define an "obvious pass" variable to help with analysis. An obvious pass is going to be one where it is third down and there is atleast 6 yards to go for a first down.
pbp_clean['obvious_pass'] = np.where((pbp_clean['down'] == 3) & (pbp_clean['ydstogo'] >= 6), 1,0)
After this, we need to clean the dataframe so that all Null values are removed.
pre_df = pbp_clean[['game_id', 'play_id', 'season', 'name', 'down', 'ydstogo', 'yardline_100', 'game_seconds_remaining',
'defenders_in_box', 'number_of_pass_rushers', 'xpass', 'obvious_pass', 'sack']]
df = pre_df.dropna()
Next, we can create a 'down' variable (1st down, 2nd down, etc.). This is treated as a categorical feature, making it more digestible for the models.
df['down'] = df['down'].astype('category')
df_no_ids = df.drop(columns = ['game_id', 'play_id', 'name', 'season'])
df_no_ids = pd.get_dummies(df_no_ids, columns = ['down'])
Before we run our machine learning models, stratified training and test datasets need to be created. This ensures that the training and test datasets have a similar distribution of the target variable. When working with unbalanced classes this is especially useful.
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.25, random_state=42)
for train_index, test_index in sss.split(df_no_ids, df_no_ids['sack']):
strat_train_set = df_no_ids.iloc[train_index]
strat_test_set = df_no_ids.iloc[test_index]
X_train = strat_train_set.drop(columns = ['sack'])
Y_train = strat_train_set['sack']
X_test = strat_test_set.drop(columns = ['sack'])
Y_test = strat_test_set['sack']
Three machine learning models were trained on the data. Each model's performance is evaluated using the Brier Score, which measures the mean squared difference between the predicted probabilities and the actual outcomes.
We can also take visual insight into the significance of different features used by an XGBClassifier model. The values are fetched then sorted to determine the ranking of features based on their influence on the model's predictions.
Link to the full github repo located here