March Madness is one of the most unpredictable sporting events in the world. Every year, millions of fans attempt to build the perfect bracket, only to watch their predictions unravel with unexpected upsets and Cinderella stories. What if we could leverage data analytics to improve our chances of making accurate predictions?
In this article, we’ll use Dataiku, a powerful data science and machine learning platform, to analyze historical NCAA tournament data and uncover patterns that could help us forecast game outcomes. By applying various analytical techniques - such as exploratory data analysis, correlation analysis, and feature engineering, we aim to determine whether data can bring more order to the madness.
Along the way, we’ll demonstrate how Dataiku makes it easy to ingest, clean, and analyze data, helping us build the foundation for a predictive model. Ultimately, this analysis will set the stage for our next step: using AI, including ChatGPT, to make game predictions for the 2025 tournament.
Selection Sunday: finding the right data
The ability to accomplish any type of analysis or predictive task depends on the availability and quality of available data. Thankfully, there is a treasure trove of March Madness data that has been collected in the March Madness Data Kaggle project and is updated annually. We’re going to make use of two files from this analysis, Resumes (originally found here) which contains team-level statistics for every tournament team since 2008 and Tournament Matchups, which contains the final score of each matchup from the tournament since 2008.
The Resumes dataset contains a wealth of team metrics that could potentially be useful for predicting tournament success. However, to keep our analysis focused and interpretable, we’ll concentrate on a single key metric: each team’s Elo rating going into the tournament. Elo is a widely used rating system designed to quantify a team’s relative strength based on past performance. While Elo isn’t the only factor that determines a team’s success, it provides a strong baseline for understanding relative performance.
For this analysis, we will download these raw datasets from Kaggle and upload them to a new Dataiku project where we’ll perform the analysis. If you’re new to Dataiku, it’s easy to get set up by either downloading the free local version or signing up for the free 14-day online trial.
First round: EDA
Exploratory Data Analysis (EDA) is a crucial step in any data analysis process, as it helps uncover hidden patterns, relationships, and anomalies within the dataset. By thoroughly exploring the data, we can identify key trends, assess data quality, and determine which variables may have the most predictive power. EDA allows us to form hypotheses and guide subsequent modeling efforts, ensuring that we extract meaningful insights rather than relying on assumptions. In the context of March Madness, EDA helps us understand how simple factors like team seed and Elo rating correlate with tournament success, laying the groundwork for more advanced predictive modeling.
To begin our analysis, we’ll use Dataiku to explore key insights from the Resumes dataset. This dataset provides a historical record of every NCAA tournament team since 2008, including their seed (1-16), number of tournament wins (0-6), and Elo national ranking entering the tournament. By examining these factors, we can start identifying trends and relationships that may help us predict future outcomes. It’s important to note that the Elo metric in this dataset represents a team’s national ranking rather than its raw rating score. Through this initial EDA, we’ll establish a foundational understanding of how seed and Elo ranking influence tournament success—insights that will be critical as we move toward building a predictive model.
Seeds of tournament success
One of the most widely used factors in tournament predictions is a team’s seed. By analyzing the Resumes dataset, we can examine how a team’s seed has historically correlated with tournament performance and what patterns emerge from past results.










