My wife manages the family’s short-term apartment rental business in Budapest and maintains a detailed Google Sheet for all bookings. This weekend, she asked me if I could help calculate a monthly utilization and a profit share report for this year (2024). Calculating these metrics can be cumbersome with traditional data analytics software even for moderately sized datasets. This is especially true when bookings span multiple months and pricing varies based on factors like weekends. The request made me contemplate the beauty and expressiveness of q programming language and inspired me to write this article.
The data model
Each row corresponds to a booking. Among other fields (like channel, tax, gross income), she stores the start date, the number of nights and the net profit of the overall stay for each booking.
The problem
For each month, calculate
- the utilization ratio, which equals the number of taken days divided by the number of days of the month. For example, if 20 days were booked in April, then April’s utilization rate is 21/30 = 70%.
- the profit distribution, i.e. what percentage of the total profit came from each month. For example, if the profit in 2024 is 10000 USD and January contributed 800 USD to this amount then she would like to see an 8% profit rate for January. For simplicity, we assume that the profit is spread evenly across the booked days. In reality, the weekends are e.g. 20% more expensive than weekdays. We will see how easily kdb+ can cope with this extra complexity.
A booking may span over multiple months. For example, checking in on the 29 of March for 7 days means 3 days in March and 4 days in April.
First attempt with Google Sheets
Intuitively, we need to work with lists of dates. Google Sheets can generate a sequence of numbers by function SEQUENCE but adding (up to 31) new columns would pollute the sheet. I don’t see an elegant way of solving the problem with Google Sheets although I am open to conviction.
kdb+ and vector programming
Let us create a mock table for demonstration purposes.
q)t:flip `startdate`daycount`profit!flip ((2024.01.08;4;110);(2024.02.12;2;60);(2024.02.20;12;380);(2024.03.10;15;400);(2024.05.15;50;1900))
q)t
startdate daycount profit
--------------------------
2024.01.08 4 110
2024.02.12 2 60
2024.02.20 12 380
2024.03.10 15 400
2024.05.15 50 1900
We can get the list of booked days by generating a sequence of numbers starting from zero (function til) that we add to the start date. q
is a vector programming language so adding a scalar to a list works as expected:
q)2024.01.08 + til 4
2024.01.08 2024.01.09 2024.01.10 2024.01.11
Unlike ANSI SQL, kdb+ supports storing lists in table cells. Let us extend the original table using the each
adverb for explicit iteration over the rows
q)update dates: startdate + til each daycount from t
startdate daycount profit dates ..
-----------------------------------------------------------------------------------------------------------------------------------------------------------..
2024.01.08 4 110 2024.01.08 2024.01.09 2024.01.10 2024.01.11 ..
2024.02.12 2 60 2024.02.12 2024.02.13 ..
2024.02.20 12 380 2024.02.20 2024.02.21 2024.02.22 2024.02.23 2024.02.24 2024.02.25 2024.02.26 2024.02.27 2024.02.28 2024.02.29 2024.03.01 2024.03..
2024.03.10 15 400 2024.03.10 2024.03.11 2024.03.12 2024.03.13 2024.03.14 2024.03.15 2024.03.16 2024.03.17 2024.03.18 2024.03.19 2024.03.20 2024.03..
2024.05.15 50 1900 2024.05.15 2024.05.16 2024.05.17 2024.05.18 2024.05.19 2024.05.20 2024.05.21 2024.05.22 2024.05.23 2024.05.24 2024.05.25 2024.05..```
Monthly utilization ratio
We only care about the months, so we convert dates
column to months ($) and create a new table that only contains the relevant data:
q)t1: select month: raze `mm$startdate + til each daycount from t
q)t1
month
-----
1
1
1
1
2
2
2
2
2
2
2
2
2
..
a simple group-by select statement returns the number of takes days by month
q)select daycount:count i by month from t1
month| daycount
-----| --------
1 | 4
2 | 12
3 | 17
5 | 17
6 | 30
7 | 3
To get the ratios, we need to store the number of days of all months. A dictionary does the job
q)monthlength:(1 + til 12)!31 29 31 30 31 30 31 31 30 31 30 31
q)monthlength
1 | 31
2 | 29
3 | 31
4 | 30
5 | 31
6 | 30
7 | 31
8 | 31
9 | 30
10| 31
11| 30
12| 31
We can avoid hardcoding the data and handle non-leap years by generating all days and grouping by month.
q)count each group `mm$2024.01.01 + til 2025.01.01-2024.01.01
1 | 31
2 | 29
3 | 31
4 | 30
5 | 31
6 | 30
7 | 31
8 | 31
9 | 30
10| 31
11| 30
12| 31
q expressions can smoothly be injected into q-sql statements.
q)select utilization_ratio: (count month)%first monthlength month by month from t1
month| utilization_ratio
-----| -----------------
1 | 0.1290323
2 | 0.4137931
3 | 0.5483871
5 | 0.5483871
6 | 1
7 | 0.09677419
We made use of the vectorian nature of q again. A dictionary is indexed by a list then a list of values is returned.
Profit distribution
If the profit is shared evenly across the booked days, then we can create a new table that stores the daily profit and the list of months for each booking.
q)select daily_profit: profit%daycount, month: `mm$startdate+til each daycount from t
daily_profit month
-----------------------------------------------------------------------------------------------------------------
27.5 1 1 1 1i
30 2 2i
31.66667 2 2 2 2 2 2 2 2 2 2 3 3i
26.66667 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3i
38 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7i
The daily_profit
column contains scalars, and the remaining (single) columns contain lists. Function ungroup performs the inverse operation of group and creates multiple rows for each group
q)t2:ungroup select daily_profit: profit%daycount, month: `mm$startdate+til each daycount from t
q)t2
daily_profit month
------------------
27.5 1
27.5 1
27.5 1
27.5 1
30 2
30 2
31.66667 2
31.66667 2
31.66667 2
31.66667 2
31.66667 2
31.66667 2
31.66667 2
..
The final steps require a basic aggregation and nested select statements.
q)select month, profit_rate: daily_profit%sum daily_profit from select sum daily_profit by month from t2
month profit_rate
-----------------
1 0.03859649
2 0.1321637
3 0.1625731
5 0.2266667
6 0.4
7 0.04
Taking weekends into account
Weekends are 20% more expensive than weekdays. How do we decide if a date is a weekday in q? All dates are represented by an integer, 0 representing 2001.01.01. Modulo calculation works seamlessly with dates, so we can write a function that returns true if a given date is Saturday or Sunday.
q)isWeekend: {(x mod 7) in 0 1}
q)isWeekend 2024.11.17
1b
q)isWeekend 2024.11.18
0b
How fantastic the vectorial nature of q is? If your custom function is made of operators that support implicit iteration then so does your custom function. You don’t need ot write a single line of code. It does not matter if you pass a scalar, a list, a dictionary, etc the function’s output conforms to the input.
q)isWeekend 2024.11.17 + til 3
100b
We can convert boolean values to weights 1.2 and 1 using an (vectorial) if-then-else statement (denoted by ?
) or a dictionary or making use of implicit boolean casts in arithmetic operations.
q)?[isWeekend 2024.11.17 + til 3;1.2;1]
1.2 1 1
q)(10b!1.2 1) isWeekend 2024.11.17 + til 3
1.2 1 1
q)1 + 0.2 * isWeekend 2024.11.17 + til 3
1.2 1 1
Let us wrap getting the relative weights into a function
q)getWeights: {w: 1+0.2 * isWeekend x; w%sum w}
We can update the query that generates internal table t2
by calling getWeights
for each booking.
q)t2:ungroup select daily_profit: profit * getWeights each dates, month: `mm$dates from select profit, dates: startdate+til each daycount from t
q)t2
daily_profit month
------------------
27.5 1
27.5 1
27.5 1
27.5 1
30 2
30 2
30.15873 2
30.15873 2
30.15873 2
30.15873 2
36.19048 2
36.19048 2
30.15873 2
..
All in one
We can put the two metrics into one table either by employing a left-join or preferably calculating the two metrics in one go
q)(select month, profit_rate: daily_profit%sum daily_profit from select sum daily_profit by month from t2) lj select utilization_ratio: (count month)%first monthlength month by month from t1
month profit_rate utilization_ratio
-----------------------------------
1 0.03859649 0.1290323
2 0.1311055 0.4137931
3 0.1636313 0.5483871
5 0.2247475 0.5483871
6 0.4040404 1
7 0.03787879 0.09677419
q)update profit_rate: profit%sum profit from select profit: sum daily_profit, utilization_ratio: (count month)%first monthlength month by month from t2
month| profit utilization_ratio profit_rate
-----| --------------------------------------
1 | 110 0.1290323 0.03859649
2 | 373.6508 0.4137931 0.1311055
3 | 466.3492 0.5483871 0.1636313
5 | 640.5303 0.5483871 0.2247475
6 | 1151.515 1 0.4040404
7 | 107.9545 0.09677419 0.03787879
Conclusion
kdb+ is an outstanding, general-purpose tool for data analysis. The vector programming, temporal arithmetics, q-sql and the built-in function all play well together to create a robust, easy-to-maintain, easy-to-read analytical solution.
If you liked this article, then set a trip to Budapest, book my wife’s apartment and enjoy reading Q for Mortals in a tranquil environment 😎