First, let's load the libraries and get the play-by-play data for the last few NFL seasons. We'll use nflfastR to download the data and then transform it into a format that can be used for our analysis.
library(nflfastR)
library(dplyr)
library(ggplot2)
library(tidyr)
# Load play-by-play data from the 2018-2023 seasons
pbp_data <- load_pbp(2018:2023)
Using nflfastR's load_pbp()
function, we loaded the data from the 2018 to 2023 seasons. This data is crucial for understanding the key metrics associated with wins.
To explore what leads to a win, we need to create a dataset that summarizes key metrics per game. We’ll group the play-by-play data by game and team, then calculate metrics such as total yards, turnovers, and the success rate of passing and rushing plays.
# Transform the data to calculate metrics for each game
team_summary <- pbp_data %>%
filter(!is.na(posteam)) %>%
group_by(game_id, season, week, home_team, away_team, posteam, result) %>%
summarise(
total_yards = sum(yards_gained, na.rm = TRUE),
turnovers = sum(interception + fumble_lost, na.rm = TRUE),
pass_success_rate = mean(pass == 1 & success == 1, na.rm = TRUE),
rush_success_rate = mean(rush == 1 & success == 1, na.rm = TRUE),
.groups = 'drop'
)
In the above code, we calculated some common metrics for each team in each game, such as total yards, number of turnovers, and success rates for passing and rushing plays. We used the group_by()
function to aggregate data by game, and summarise()
to calculate these metrics.
Next, we want to visualize how these metrics relate to wins. Let’s create some visualizations to better understand how factors like turnovers and total yards gained influence winning outcomes.
# Create a win indicator
team_summary <- team_summary %>%
mutate(win = ifelse(result > 0, 1, 0))
# Visualize relationship between turnovers and winning percentage
ggplot(team_summary, aes(x = turnovers, y = win)) +
geom_jitter(width = 0.1, height = 0.05, alpha = 0.3) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE, color = "red") +
labs(title = "Relationship Between Turnovers and Winning Percentage",
x = "Number of Turnovers",
y = "Winning Percentage")
Here, we used a logistic regression model to show how the probability of winning changes as the number of turnovers increases. As expected, the more turnovers a team commits, the lower their chances of winning.
We also want to see if gaining more yards contributes to a higher chance of winning. Let's create another plot for this relationship:
# Visualize relationship between total yards and winning percentage
ggplot(team_summary, aes(x = total_yards, y = win)) +
geom_jitter(width = 0.1, height = 0.05, alpha = 0.3) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE, color = "blue") +
labs(title = "Total Yards vs Winning Probability",
x = "Total Yards Gained",
y = "Winning Percentage")
From the above graph, we can see that gaining more yards generally increases a team's chances of winning the game, but it isn't the only determining factor. Teams need to excel in multiple areas to secure a win consistently.
Our analysis shows that minimizing turnovers and gaining more total yards both contribute to a higher chance of winning NFL games. However, the relationship is not strictly linear, as other factors like defense, special teams, and in-game decisions also play a role in determining the outcome.
We have only scratched the surface here. By using nflfastR, more sophisticated models can be built to better predict game outcomes and understand deeper patterns within the data. If you're interested in more detailed exploration, consider incorporating EPA metrics, play success rates by situation, or advanced machine learning models to uncover even more insights.