Using SQL to Clean Data
SELECT
name
FROM
table_name
INSERT INTO xxx ()
VALUES ()
UPDATE table_name
SET variable_nanme1 = 'value1'
WHERE variable_name2 = 'value2'
To avoid duplicated data, we can use the DISTINCT
function.
DISTINCT: A keyword that is added to a SQL SELECT statement to retrieve only non-duplicate entries.
SELECT
DISTINCT variable_name
FROM
table_name
Use the LENGTH
or LEN
function to check if the text string variable matches.
SELECT
LENGTH(variable_name) AS column_name
FROM
table_nanme
SELECT
variable_nanme
FROM
table_name
WHERE
LENGTH(coulumn_nanme) > number
Use SUBSTR
function to select data we want.
SUBSTR: A SQL function that extracts a substring from a string variable.
SELECT
DISTINCT variable_name
FROM
table_name
WHERE
SUBSTR(column_nanme,start_letter_position,number_of_letter) = 'value'
Use the TRIM
function to remove extra spaces.
SELECT
DISTINCT variable_name
FROM
table_nanme
WHERE
TRIM(variable_name) = 'value'
Typecasting: Converting data from one type to another.
Use CAST
function to convert data into different data types.
CAST: A SQL function that converts data from one datatype to another.
Float: A number that contains a decimal.
SELECT
CAST(variable_name AS FLOAT64)
FROM
table_name
ORDER BY
CAST(variable_name AS FLOAT64) DESC
SELECT
CAST(date AS date) AS date_only, variable_name2
FROM
table_name
WHERE
date BETWEEN '2020-12-01' AND '2020-12-31'
Use CONCAT
function to combine two columns.
CONCAT: A SQL function that adds strings together to create new text strings that can be used as unique keys.
SELECT
CONCAT(varibale_name1,variable_name2) AS new_name
FROM
table_name
WHERE
variable = 'value'
When some information is missing, COALESCE
function gives a secondary reference.
COALESCE: A SQL function that returns non-null values in a list.
SELECT
COALESCE(variable_name1,variable_name2) AS new_name
FROM
table_name
WHERE
variable = 'value'