Data Aggregation
Aggregation: The process of collecting or gathering many separate pieces into a whole.
Data aggregation: The process of gathering data from multiple sources and combining it into a single, summarized collection.
Using VLOOKUP in Spreadsheets
MATCH: A spreadsheet function used to locate the position of a specific lookup value.
VALUE: A spreadsheet function that converts a text string that represents a number to a numeric value.
Basic syntax: =VLOOKUP(”value_to_search”, sheet2!$A$1:$D$10, column_number, FALSE)
TRUE searches for approximate results, and FALSE searches for exact matches.
Absolute reference: A reference within a function that is locked so that rows and columns won’t change if the function is copied.
Using JOIN in SQL
JOIN: A SQL function that is used to combine rows from two or more tables based on a related column.
Example:
SELECT
employees.name AS employee_name,
employees.role AS employy_role,
departments.name AS department_name
FROM
employees
INNER JOIN
deparments ON
employees.deparment_id = deparments.deparment_id
Other JOINs: LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN
INNER JOIN: A SQL function that returns records with matching values in both tables.
LEFT JOIN: A SQL function that will return all the records from the left table and only the matching records from the right table.
RIGHT JOIN: A SQL function that will return all records from the right table and only the matching records from the left.
OUTER JOIN: A SQL function that combines RIGHT and LEFT JOIN to return all matching records in both tables.
Using COUNT and COUNT DISTINCT in SQL
COUNT DISTINCT: A SQL function that only returns the distinct values in a specified range.
Example:
SELECT
orders.*,
warehouse.warehouse_alias,
warehouse.state
FROM
warehouse_orders.Orders AS orders
JOIN
warehouse_orders.Warehouse AS warehouse ON
orders.warehouse_id = warehouse.warehouse_id
LIMIT 100
SELECT
COUNT(DISTINCT warehouse.state) AS num_states
FROM
warehouse_orders.Orders AS orders
JOIN
warehouse_orders.Warehouse AS warehouse ON
orders.warehouse_id = warehouse.warehouse_id
GROUP BY
warehouse.state
Aliasing: Temporarily naming a table or column in a query to make it easier to read and write.
Work with Subqueries
Subquery: A SQL query that is nested inside a larger query.
Inner query: A SQL subquery that is inside of another SQL statement.
Outer query: A SQL statement containing a subquery.
Examples:
SELECT
station_id,
num_bikes_available
(SELECT
AVG(num_bikes_available)
FROM bigquery-public-data.new_york.citibike_stations) AS avg_num_bikes_available
FROM
bigquery-public-data.new_york.citibike_stations
SELECT
station_id,
name,
number_of_rides AS number_of_rides_starting_at_station
FROM
(
SELECT
start_station_id
COUNT(*) number_of_rides
FROM
bigquery-public-data.new_york.trips
GROUP BY
start_station_id
)
AS station_num_trips
INNER JOIN
bigquery-public-data.new_york.citibike_stations ON
station_id = start_station_id
ORDER BY
number_of_rides DESC
SELECT
station_id,
name
FROM
bigquery-public-data.new_york.citibike_stations
WHERE
station_id IN
(
SELECT
start_station_id
FROM
bigquery-public-data.new_york.trips
WHERE
usertype = 'Subscriber'
)
Use HAVING and CASE in subqueries.
HAVING: A SQL clause that adds a filter to a query instead of the underlying table that can only be used with aggregate functions.
Example:
SELECT
Warehouse.warehouse_id,
CONCAT(Warehouse.state, ': ', Warehouse.warehouse_alias) AS warehouse_name
COUNT(Orders.order_id) AS number_of_orders,
(
SELECT
COUNT(*)
FROM warehouse_orders.Orders Orders
)
AS total_orders,
CASE
WHEN COUNT(Orders.order_id)/(SELECT COUNT(*) FROM warehouse_orders.Orders Orders) <= 0.20
THEN "Fulfilled 0-20% of Orders"
WHEN COUNT(Orders.order_id)/(SELECT COUNT(*) FROM warehouse_orders.Orders Orders) > 0.20
AND COUNT(Orders.order_id)/(SELECT COUNT(*) FROM warehouse_orders.Orders Orders) <= 0.60
THEN "Fulfilled 21-60% of Orders"
ELSE "Fulfilled moree than 60% of Orders"
END AS fulfillment_summary
FROM
warehouse_orders.Warehouse Warehouse
LEFT JOIN
warehouse_orders.Orders Orders ON
Orders.warehouse_id = Warehouse.warehouse_id
GROUP BY
Warehouse.warehouse_id,
warehouse_name
HAVING
COUNT(Orders.order_id) > 0