Sensible SQL Puzzles That Will Stage Up Your Ability

There are some Sql patterns that, as soon as you already know them, you begin seeing them in all places. The options to the puzzles that I’ll present you at present are literally quite simple SQL queries, however understanding the idea behind them will certainly unlock new options to the queries you write on a day-to-day foundation.

These challenges are all primarily based on real-world situations, as over the previous few months I made a degree of writing down each puzzle-like question that I needed to construct. I additionally encourage you to strive them for your self, so as to problem your self first, which is able to enhance your studying!

All queries to generate the datasets will probably be offered in a PostgreSQL and DuckDB-friendly syntax, so as to simply copy and play with them. On the finish I will even present you a hyperlink to a GitHub repo containing all of the code, in addition to the reply to the bonus problem I’ll depart for you!

I organized these puzzles so as of accelerating issue, so, in case you discover the primary ones too straightforward, at the least check out the final one, which makes use of a method that I really consider you gained’t have seen earlier than.

Okay, let’s get began.

I really like this puzzle due to how quick and easy the ultimate question is, despite the fact that it offers with many edge instances. The info for this problem exhibits tickets shifting in between Kanban phases, and the target is to seek out how lengthy, on common, tickets keep within the Doing stage.

The info accommodates the ID of the ticket, the date the ticket was created, the date of the transfer, and the “from” and “to” phases of the transfer. The phases current are New, Doing, Evaluation, and Carried out.

Some issues it is advisable to know (edge instances):

  • Tickets can transfer backwards, that means tickets can return to the Doing stage.
  • You shouldn’t embrace tickets which might be nonetheless caught within the Doing stage, as there isn’t a technique to know the way lengthy they are going to keep there for.
  • Tickets are usually not at all times created within the New stage.
```SQL

CREATE TABLE ticket_moves (
    ticket_id INT NOT NULL,
    create_date DATE NOT NULL,
    move_date DATE NOT NULL,
    from_stage TEXT NOT NULL,
    to_stage TEXT NOT NULL
);

```
```SQL

INSERT INTO ticket_moves (ticket_id, create_date, move_date, from_stage, to_stage)
    VALUES
        -- Ticket 1: Created in "New", then strikes to Doing, Evaluation, Carried out.
        (1, '2024-09-01', '2024-09-03', 'New', 'Doing'),
        (1, '2024-09-01', '2024-09-07', 'Doing', 'Evaluation'),
        (1, '2024-09-01', '2024-09-10', 'Evaluation', 'Carried out'),
        -- Ticket 2: Created in "New", then strikes: New → Doing → Evaluation → Doing once more → Evaluation.
        (2, '2024-09-05', '2024-09-08', 'New', 'Doing'),
        (2, '2024-09-05', '2024-09-12', 'Doing', 'Evaluation'),
        (2, '2024-09-05', '2024-09-15', 'Evaluation', 'Doing'),
        (2, '2024-09-05', '2024-09-20', 'Doing', 'Evaluation'),
        -- Ticket 3: Created in "New", then strikes to Doing. (Edge case: no subsequent transfer from Doing.)
        (3, '2024-09-10', '2024-09-16', 'New', 'Doing'),
        -- Ticket 4: Created already in "Doing", then strikes to Evaluation.
        (4, '2024-09-15', '2024-09-22', 'Doing', 'Evaluation');
```

A abstract of the info:

  • Ticket 1: Created within the New stage, strikes usually to Doing, then Evaluation, after which Carried out.
  • Ticket 2: Created in New, then strikes: New → Doing → Evaluation → Doing once more → Evaluation.
  • Ticket 3: Created in New, strikes to Doing, however it’s nonetheless caught there.
  • Ticket 4: Created within the Doing stage, strikes to Evaluation afterward.

It is likely to be a good suggestion to cease for a bit and suppose how you’ll take care of this. Are you able to learn how lengthy a ticket stays on a single stage?

Truthfully, this sounds intimidating at first, and it appears like will probably be a nightmare to take care of all the sting instances. Let me present you the complete answer to the issue, after which I’ll clarify what is occurring afterward.

```SQL

WITH stage_intervals AS (
    SELECT
        ticket_id,
        from_stage,
        move_date 
        - COALESCE(
            LAG(move_date) OVER (
                PARTITION BY ticket_id 
                ORDER BY move_date
            ), 
            create_date
        ) AS days_in_stage
    FROM
        ticket_moves
)
SELECT
    SUM(days_in_stage) / COUNT(DISTINCT ticket_id) as avg_days_in_doing
FROM
    stage_intervals
WHERE
    from_stage = 'Doing';
```

The primary CTE makes use of the LAG operate to seek out the earlier transfer of the ticket, which would be the time the ticket entered that stage. Calculating the length is so simple as subtracting the earlier date from the transfer date.

What you must discover is the usage of the COALESCE within the earlier transfer date. What that does is that if a ticket doesn’t have a earlier transfer, then it makes use of the date of creation of the ticket. This takes care of the instances of tickets being created straight into the Doing stage, because it nonetheless will correctly calculate the time it took to depart the stage.

That is the results of the primary CTE, displaying the time spent in every stage. Discover how the Ticket 2 has two entries, because it visited the Doing stage in two separate events.

With this completed, it’s only a matter of getting the typical because the SUM of whole days spent in doing, divided by the distinct variety of tickets that ever left the stage. Doing it this fashion, as a substitute of merely utilizing the AVG, makes certain that the 2 rows for Ticket 2 get correctly accounted for as a single ticket.

Not so unhealthy, proper?

The objective of this second problem is to discover the newest contract sequence of each worker. A break of sequence occurs when two contracts have a niche of greater than someday between them. 

On this dataset, there aren’t any contract overlaps, that means {that a} contract for a similar worker both has a niche or ends a day earlier than the brand new one begins.

```SQL
CREATE TABLE contracts (
    contract_id integer PRIMARY KEY,
    employee_id integer NOT NULL,
    start_date date NOT NULL,
    end_date date NOT NULL
);

INSERT INTO contracts (contract_id, employee_id, start_date, end_date)
VALUES 
    -- Worker 1: Two steady contracts
    (1, 1, '2024-01-01', '2024-03-31'),
    (2, 1, '2024-04-01', '2024-06-30'),
    -- Worker 2: One contract, then a niche of three days, then two contracts
    (3, 2, '2024-01-01', '2024-02-15'),
    (4, 2, '2024-02-19', '2024-04-30'),
    (5, 2, '2024-05-01', '2024-07-31'),
    -- Worker 3: One contract
    (6, 3, '2024-03-01', '2024-08-31');
```

As a abstract of the info:

  • Worker 1: Has two steady contracts.
  • Worker 2: One contract, then a niche of three days, then two contracts.
  • Worker 3: One contract.

The anticipated consequence, given the dataset, is that each one contracts ought to be included apart from the primary contract of Worker 2, which is the one one which has a niche.

Earlier than explaining the logic behind the answer, I would really like you to consider what operation can be utilized to affix the contracts that belong to the identical sequence. Focus solely on the second row of information, what info do it is advisable to know if this contract was a break or not?

I hope it’s clear that that is the proper scenario for window capabilities, once more. They’re extremely helpful for fixing issues like this, and understanding when to make use of them helps rather a lot find clear options to issues.

Very first thing to do, then, is to get the top date of the earlier contract for a similar worker with the LAG operate. Doing that, it’s easy to match each dates and test if it was a break of sequence.

```SQL
WITH ordered_contracts AS (
    SELECT
        *,
        LAG(end_date) OVER (PARTITION BY employee_id ORDER BY start_date) AS previous_end_date
    FROM
        contracts
),
gapped_contracts AS (
    SELECT
        *,
        -- Offers with the case of the primary contract, which will not have
        -- a earlier finish date. On this case, it is nonetheless the beginning of a brand new
        -- sequence.
        CASE WHEN previous_end_date IS NULL
            OR previous_end_date < start_date - INTERVAL '1 day' THEN
            1
        ELSE
            0
        END AS is_new_sequence
    FROM
        ordered_contracts
)
SELECT * FROM gapped_contracts ORDER BY employee_id ASC;
```

An intuitive technique to proceed the question is to quantity the sequences of every worker. For instance, an worker who has no hole, will at all times be on his first sequence, however an worker who had 5 breaks in contracts will probably be on his fifth sequence. Funnily sufficient, that is completed by one other window operate.

```SQL
--
-- Earlier CTEs
--
sequences AS (
    SELECT
        *,
        SUM(is_new_sequence) OVER (PARTITION BY employee_id ORDER BY start_date) AS sequence_id
FROM
    gapped_contracts
)
SELECT * FROM sequences ORDER BY employee_id ASC;
```

Discover how, for Worker 2, he begins his sequence #2 after the primary gapped worth. To complete this question, I grouped the info by worker, bought the worth of their most up-to-date sequence, after which did an inside be part of with the sequences to maintain solely the newest one.

```SQL
--
-- Earlier CTEs
--
max_sequence AS (
    SELECT
        employee_id,
        MAX(sequence_id) AS max_sequence_id
FROM
    sequences
GROUP BY
    employee_id
),
latest_contract_sequence AS (
    SELECT
        c.contract_id,
        c.employee_id,
        c.start_date,
        c.end_date
    FROM
        sequences c
        JOIN max_sequence m ON c.sequence_id = m.max_sequence_id
            AND c.employee_id = m.employee_id
        ORDER BY
            c.employee_id,
            c.start_date
)
SELECT
    *
FROM
    latest_contract_sequence;
```

As anticipated, our ultimate result’s mainly our beginning question simply with the primary contract of Worker 2 lacking! 

Lastly, the final puzzle — I’m glad you made it this far. 

For me, that is essentially the most mind-blowing one, as after I first encountered this drawback I considered a very totally different answer that might be a large number to implement in SQL.

For this puzzle, I’ve modified the context from what I needed to take care of for my job, as I feel it is going to make it simpler to clarify. 

Think about you’re a knowledge analyst at an occasion venue, and also you’re analyzing the talks scheduled for an upcoming occasion. You need to discover the time of day the place there would be the highest variety of talks taking place on the similar time.

That is what you must know in regards to the schedules:

  • Rooms are booked in increments of 30min, e.g. from 9h-10h30.
  • The info is clear, there aren’t any overbookings of assembly rooms.
  • There will be back-to-back conferences in a single assembly room.

Assembly schedule visualized (that is the precise information). 

```SQL
CREATE TABLE conferences (
    room TEXT NOT NULL,
    start_time TIMESTAMP NOT NULL,
    end_time TIMESTAMP NOT NULL
);

INSERT INTO conferences (room, start_time, end_time) VALUES
    -- Room A conferences
    ('Room A', '2024-10-01 09:00', '2024-10-01 10:00'),
    ('Room A', '2024-10-01 10:00', '2024-10-01 11:00'),
    ('Room A', '2024-10-01 11:00', '2024-10-01 12:00'),
    -- Room B conferences
    ('Room B', '2024-10-01 09:30', '2024-10-01 11:30'),
    -- Room C conferences
    ('Room C', '2024-10-01 09:00', '2024-10-01 10:00'),
    ('Room C', '2024-10-01 11:30', '2024-10-01 12:00');
```

The best way to resolve that is utilizing what known as a Sweep Line Algorithm, or also referred to as an event-based answer. This final title really helps to grasp what will probably be completed, as the thought is that as a substitute of coping with intervals, which is what we’ve within the unique information, we take care of occasions as a substitute.

To do that, we have to remodel each row into two separate occasions. The primary occasion would be the Begin of the assembly, and the second occasion would be the Finish of the assembly.

```SQL
WITH occasions AS (
  -- Create an occasion for the beginning of every assembly (+1)
  SELECT 
    start_time AS event_time, 
    1 AS delta
  FROM conferences
  UNION ALL
  -- Create an occasion for the top of every assembly (-1)
  SELECT 
   -- Small trick to work with the back-to-back conferences (defined later)
    end_time - interval '1 minute' as end_time,
    -1 AS delta
  FROM conferences
)
SELECT * FROM occasions;
```

Take the time to grasp what is occurring right here. To create two occasions from a single row of information, we’re merely unioning the dataset on itself; the primary half makes use of the beginning time because the timestamp, and the second half makes use of the top time.

You would possibly already discover the delta column created and see the place that is going. When an occasion begins, we rely it as +1, when it ends, we rely it as -1. You would possibly even be already pondering of one other window operate to resolve this, and also you’re really proper!

However earlier than that, let me simply clarify the trick I used in the long run dates. As I don’t need back-to-back conferences to rely as two concurrent conferences, I’m subtracting a single minute of each finish date. This fashion, if a gathering ends and one other begins at 10h30, it gained’t be assumed that two conferences are concurrently taking place at 10h30.

Okay, again to the question and one more window operate. This time, although, the operate of alternative is a rolling SUM.

```SQL
--
-- Earlier CTEs
--
ordered_events AS (
  SELECT
    event_time,
    delta,
    SUM(delta) OVER (ORDER BY event_time, delta DESC) AS concurrent_meetings
  FROM occasions
)
SELECT * FROM ordered_events ORDER BY event_time DESC;
```

The rolling SUM on the Delta column is basically strolling down each report and discovering what number of occasions are energetic at the moment. For instance, at 9 am sharp, it sees two occasions beginning, so it marks the variety of concurrent conferences as two!

When the third assembly begins, the rely goes as much as three. However when it will get to 9h59 (10 am), then two conferences finish, bringing the counter again to at least one. With this information, the one factor lacking is to seek out when the best worth of concurrent conferences occurs.

```SQL
--
-- Earlier CTEs
--
max_events AS (
  -- Discover the utmost concurrent conferences worth
  SELECT 
    event_time, 
    concurrent_meetings,
    RANK() OVER (ORDER BY concurrent_meetings DESC) AS rnk
  FROM ordered_events
)
SELECT event_time, concurrent_meetings
FROM max_events
WHERE rnk = 1;
```

That’s it! The interval of 9h30–10h is the one with the biggest variety of concurrent conferences, which checks out with the schedule visualization above!

This answer appears extremely easy in my view, and it really works for thus many conditions. Each time you’re coping with intervals now, you must suppose if the question wouldn’t be simpler if you considered it within the perspective of occasions.

However earlier than you progress on, and to essentially nail down this idea, I need to depart you with a bonus problem, which can be a standard software of the Sweep Line Algorithm. I hope you give it a strive!

Bonus problem

The context for this one remains to be the identical because the final puzzle, however now, as a substitute of looking for the interval when there are most concurrent conferences, the target is to seek out unhealthy scheduling. Plainly there are overlaps within the assembly rooms, which should be listed so it may be fastened ASAP.

How would you discover out if the identical assembly room has two or extra conferences booked on the similar time? Listed below are some recommendations on the best way to remedy it:

  • It’s nonetheless the identical algorithm.
  • This implies you’ll nonetheless do the UNION, however it is going to look barely totally different.
  • You need to suppose within the perspective of every assembly room.

You should utilize this information for the problem:

```SQL
CREATE TABLE meetings_overlap (
    room TEXT NOT NULL,
    start_time TIMESTAMP NOT NULL,
    end_time TIMESTAMP NOT NULL
);

INSERT INTO meetings_overlap (room, start_time, end_time) VALUES
    -- Room A conferences
    ('Room A', '2024-10-01 09:00', '2024-10-01 10:00'),
    ('Room A', '2024-10-01 10:00', '2024-10-01 11:00'),
    ('Room A', '2024-10-01 11:00', '2024-10-01 12:00'),
    -- Room B conferences
    ('Room B', '2024-10-01 09:30', '2024-10-01 11:30'),
    -- Room C conferences
    ('Room C', '2024-10-01 09:00', '2024-10-01 10:00'),
    -- Overlaps with earlier assembly.
    ('Room C', '2024-10-01 09:30', '2024-10-01 12:00');
```

When you’re within the answer to this puzzle, in addition to the remainder of the queries, test this GitHub repo.

The primary takeaway from this weblog put up is that window capabilities are overpowered. Ever since I bought extra comfy with utilizing them, I really feel that my queries have gotten a lot easier and simpler to learn, and I hope the identical occurs to you.

When you’re serious about studying extra about them, you’ll in all probability take pleasure in studying this different weblog put up I’ve written, the place I am going over how one can perceive and use them successfully.

The second takeaway is that these patterns used within the challenges actually do occur in lots of different locations. You would possibly want to seek out sequences of subscriptions, buyer retention, otherwise you would possibly want to seek out overlap of duties. There are lots of conditions when you have to to make use of window capabilities in a really comparable trend to what was completed within the puzzles.

The third factor I need you to recollect is about this answer to utilizing occasions in addition to coping with intervals. I’ve checked out some issues I solved a very long time in the past that I might’ve used this sample on to make my life simpler, and sadly, I didn’t find out about it on the time.


I actually do hope you loved this put up and gave a shot to the puzzles your self. And I’m certain that in case you made it this far, you both realized one thing new about SQL or strengthened your data of window capabilities! 

Thanks a lot for studying. When you’ve got questions or simply need to get in contact with me, don’t hesitate to contact me at mtrentz.com.

All photographs by the creator until said in any other case.