A window function has three components:
- Partition: the “grouping” key consisting of one or more columns
- Order: the column that sorts records within each partition
- (Window) Frame: limits the range of records included in the function evaluation
Depending on the task and intended summary information, one or more of these components may be required. For example, total sales by customer only requires a partition (by customer) and a summation of all sales records. If instead, you want to see cumulative sales over time, you need to include an order (e.g., date of sale). Further, if you wanted to see a rolling three-month sales total, you would need to add a window frame (again based on date of sale) that would include only the last three months of time up until the current record (date) being evaluated.
Based on the configuration of these window components (referred to as the window definition), many different analyzes are possible resulting in a variety of summary information. There are three general types of analyzes that window functions facilitate:
- Rankings - numbering of records
- Aggregations - grouping calculations
- Lags (Leads) - evaluating with prior/future records
People are usually most familiar with the “group-by” style of aggregations (min, max, average, sum, count, etc.), and these work as expected for window functions. Rankings are probably the most straightforward summary, as these simply assign a number (“rank”) to each record within each partition based on the order and an optional frame. However, there are several variations to how these ranks can be calculated, such as how to handle ties or duplicate records, which we’ll see later.
Perhaps the most useful feature, and uniquely specific to window functions, lagging (or leading) evaluations are used to compare the current record to previous (or future) records, such as the difference between values. These functions can also bring the value of these other records to the current record. For example, you could use a lag function on customer order dates to calculate the number of days since their previous order, as well as including the date of that previous order.
Next, we’ll walk through an example of each one of these types of window functions. Generally, these are all commonly used together, and over a variety of window definitions on the same dataset. With this ability to generate additional information while retaining the original dataset schema (i.e., more columns for every row), window functions are especially useful for machine learning tasks like time-series analysis and supervised prediction on transactional datasets (shopping, banking, etc.).
Choosing a dataset for our window recipe
In this tutorial, we will use a classic example database known as Northwind Traders. This synthetic dataset simulates the many operations of a small business that imports and exports specialty foods from around the world. Importantly, the database contains stereotypical examples of common business entities, and this allows us to illustrate common analytics use cases. Here we will focus on only a few aspects of the database, namely: customers, products, and orders. In essence, our simplified operating narrative is: customers make orders that contain some quantity of item(s) from the product list.
This dataset is small by today’s standards, but it serves well for our demonstration purposes. It contains 830 orders (2,155 line items) over 89 customers and 77 products. Within our Dataiku project, only minor preparation is used to combine the database tables (imported as separate CSV files) into analytics-ready datasets for our window recipes. Two datasets will be used in this tutorial: 1) “order-level” summary with one record per order (Orders_DetailRollup), and 2) “line-item” data with many records per order, where each line (record) is a specific product that is part of the sale (Orders_LineItems). See our annotated Dataiku project flow below.