Linear Regression Analysis - Part 2
Now that you understand the Pearson test, let’s start working towards a real world example. But first…
You’re likely going to be pulling your variables from a structured database, so, what would this look like?
Table - ProductInformation
Brand | Price | Rating | SalesTerritory | Sales |
---|---|---|---|---|
Brand A | 50.00 | 2.5 | Territory A | 323 |
Brand A | 50.00 | 5.0 | Territory B | 891 |
Brand A | 50.00 | 1.0 | Territory C | 104 |
Brand B | 45.00 | 5.0 | Territory A | 1240 |
Brand B | 45.00 | 4.0 | Territory B | 843 |
Brand B | 45.00 | 4.0 | Territory C | 923 |
Brand C | 55.00 | 3.5 | Territory A | 409 |
Brand C | 55.00 | 5.0 | Territory B | 822 |
Brand C | 55.00 | 4.5 | Territory C | 659 |
So, how would you interact with this data? SQL.
SQL is fun, and incredibly easy to learn! It’s worth your time to get a handle of the basic syntax, which is highly transferable between flavors; SQL, MySQL, PostgreSQL etc all share largely the same fundamental querying verbage. I’ll give you a few simple functional examples before we get back to regression analysis.
Let’s say you are only interested in the rows containing Brand A, you could extract it’s information with the following query:
SELECT * FROM ProductInformation WHERE Brand = ‘Brand A’
Your output would be:
Brand | Price | Rating | SalesTerritory | Sales |
---|---|---|---|---|
Brand A | 50.00 | 2.5 | Territory A | 323 |
Brand A | 50.00 | 5.0 | Territory B | 891 |
Brand A | 50.00 | 1.0 | Territory C | 104 |
You selected every column by using the * symbol, and constrained the results to the rows also containing Brand A.
Now, let’s say you wanted to find the poorest performing region for Brand A, and return a list ranked by lowest sales.
You would query:
SELECT SalesTerritory, Rating, Sales FROM ProductInformation ORDER BY TotalSales ASC
Your output would be:
SalesTerritory | Rating | Sales |
---|---|---|
Territory C | 1.0 | 104 |
Territory A | 2.5 | 323 |
Territory B | 5.0 | 891 |
Here, you defined the specific columns you wanted by name, and ordered them from lowest-to-highest by using the ASC keywork, short for ASCending. We can see that Territory C needs some serious market research to understand the poor performance.
You can even engage in data analysis and manipulation just using SQL. Reusing the above example, we will generate a report with a column called “TotalSales” that sums up the sales throughout each division:
SELECT SUM(Sales) AS TotalSales FROM ProductInformation
TotalSales |
---|
6214 |
We’ll get deeper into SQL at another date and cover features like JOIN and INSERT, but I hope you see how simple and intuitive the SQL syntax itself can be.