Ways to Clean Data
Clean data: Data that is complete, correct, and relevant to the problem being solved.
Dirty data: Data that is incomplete, incorrect, or irrelevant to problem to be solved.
Data mapping: The process of matching fields from one data source to another.
Data merging: The process of combing two or more datasets into a single dataset.
Data validation: A tool for checking the accuracy and quality of data.
Why Cleaning Data is Important?
Compatibility: How well two or more datasets are able to work together.
Duplicate data: Any record that inadvertently shares data with another record.
Incomplete data: Data that is important fields.
Inconsistent data: Data that uses different formats to represent the same thing.
Incorrect/Inaccurate data: Data that is complete but inaccurate.
Outdated data: Any data that has been superseded by newer and more accurate information.
Null: An indication that a value does not exist in a dataset.
Data engineer: A professional who transforms data into a useful format for analysis and gives it a reliable infrastructure.
Data warehousing specialist: A professional who develops processes and procedures to effectively store and organize data.
Merger: An agreement that unites two organizations into a single new one.
Cleaning Data Using Spreadsheets
Syntax: The predetermined structure of a language that includes all required words, symbols, and punctuation, as well as their proper placement.
Conditional formatting: A spreadsheet tool that changes how cells appear when values meet specific conditions.
Remove duplicates: A spreadsheet tool that automatically searches for and eliminates duplicate entries from a spreadsheet.
Split: A function that divides text around a specified character and puts each fragment into a new, separate cell.
CONCATENATE: A spreadsheet function that joins together two or more text strings.
COUNTIF: A spreadsheet function that returns the number of cells that match a specified value.
Text string: A group of characters within a cell, most often composed of letters.
Substring: A smaller subset of a text string.
LEFT: A function that returns a set number of characters from the left side of a text string.
RIGHT: A function that returns a set number of characters from the right side of a text string.
Length: The number of characters in a text string.
LEN: A function that returns the length of a text string by counting the number of characters it contains.
MID: A function that returns a segment from the middle of a text string.
TRIM: A function that removes leading, trailing, and repeated spaces in data.
Delimiter: A character that indicates the beginning or end of a data item.
Field length: A tool for determining how many characters can be keyed into a spreadsheet field.