Originally posted on LinkedIn
Learning SQL is hard at even if you are an experienced programmer. You have to think in terms of relationships: what matches up with what and how do I aggregate correctly? It is even harder if you come across a resource that looks trustworthy but is actually full of mistakes.
This week I got an email asking for help understanding two SQL examples from the Advanced SQL section of a seemingly normal SQL tutorial. The examples were around finding the Percent of Total and Cumulative Percent of Total.
I wrote this article to broadcast the issues I found and the correct answers to a greater audience. I hope this helps others who are on their journey to learning Data Science and Data Engineering.
The Resources Provided by the Tutorial in Question
The SQL tutorial provided a table:
The SQL tutorial provided a query on how to find the Percent of Total:
The SQL tutorial provided a query on how to find the Cumulative Percent of Total:
The Question I Received
“How does this work? I don’t understand WHERE a1.Sales <= a2.sales and I only see one table with two columns, so how do they get a2.Sales and a2.Name?”
I was a bit confused at first too because there are lots of problems with these exercises!
The First Issue: Use a Join
If you also were confused why there were two aliases but only one table, take a look at the code following the keyword FROM. You can see the table is referenced twice separated by a comma.
When you put two tables behind the word FROM it will create an INNER JOIN between them. In my opinion, this is bad practice. It is easier for everyone to understand if you use the standard syntax: INNER JOIN.
The Second Issue: Unused Code
In the Percent of Total query, there is a lot of unused code.
- There is nothing in the SELECT statement uses the alias ‘a2’
- There’s no apparent filtering purposes in the WHERE clause that explains the purpose of a2
You can achieve the same results with a simple SELECT/FROM query. See the image below on how to find the Percent of Total for each row:
The Third Issue: Cumulative Percent of Total Doesn’t Work
In the Cumulative Percent of Total query, this is where this confusing WHERE clause is actually used.
What they are trying to do is for each sales number match up all the sales numbers equal or lesser to it and also include any records that are themselves. Then aggregate.
The logic they were going for was this:
But the query they wrote allows both Stella and Jeff to match up with each other because they have the same sales amount. So they actually ended up with this:
So when you run the query exactly, it doesn’t print out the same numbers as Results section says it should.
The Fourth Issue: Redundant Code
If you didn’t catch the flawed logic right away and are wondering how that got past you, don’t worry it was hard to read in the first place because there was redundant code.
The section after OR is actually redundant:
- It is redundant since a1.Sales <= a2.Sales covers the a1.Sales=a2.Sales case
- Additionally if a1.Name=a2.Name, then the sales would already have matched up so the a1.Sales=a2.Sales case covers this
Take a look, it returns the same results when I comment out the redundant code:
Cumulative Percent of Total: The Answer
So how can we actually find the Cumulative Percent of Total? What we need is to label each record with a unique number that also reflects numerical order.
This way we can JOIN the table back to itself and eliminate the possibility of a ‘tie’ which was our third issue (above). We basically need a row number. There is a window function called ROW_NUMBER() which is supported in many versions of SQL.
But unfortunately the editor I’m using is a release of SQLite that doesn’t include window functions yet. So I just wrote them in, hope that is ok 🙂
To find the Cumulative Percent of Total, we need to JOIN the table to itself and find for each record all numbers that come before that record. Here is the JOIN I wrote to accomplish this. In case you are a visual learner, I printed out what the data looks like before I aggregate it. This is so you can see that each row accumulates the info of the ones before it.
Then I apply the SUM() aggregation function and use GROUP BY to bring us back to being at the name level.
Unfortunately, not everything on the internet is true even sometimes online tutorials! If something stumps you, don’t automatically assume that you are missing something. The tutorial may be wrong! Reach out to someone you trust (if I have time, I’m happy to help)! If all else fails, ask on StackOverflow. 🙂
Share this article with a friend who is studying for data projects or positions! If you have questions feel free to leave them in the comments & I will try to get back to you.
Thanks for reading!
Check out my essentials list on Amazon