# Data Preprocessing

databases report and need the explanation and answer to help me learn.

You are given the Store dataset including the following variables:

Order ID – A unique identifier for each order.

Customer ID – A unique identifier for each customer.

Order Date – The date of the order placement.

Ship Date – The date the order was shipped.

Ship Mode – The shipping mode for the order (e.g. standard, same-day).

Segment – The customer segment (e.g. Consumer, Corporate, Home Office).

Region – The region where the customer is located (e.g. West, Central, East).

Category – The category of the product purchased (e.g. Furniture, Technology, Office Supplies).

Sub-Category – The sub-category of the product purchased (e.g. Chairs, Desktops, Paper).

Product Name – The name of the product purchased.

Sales – The sales revenue for the product purchased.

Quantity – The number of units of the product purchased.

Discount – The discount applied to the product purchased.

Profit -The profit generated by the product purchased.

Questions :

Choose any two prediction tasks: Classification or regression

Based on the chosen tasks, apply all the preprocessing steps.

Justify your choices.

Requirements: 1-2 pages

Assignment-1

CS466-SEM451

This assignment can be done by group of no more than 3 students.

You are given the Store dataset including the following variables:

Order ID – A unique identifier for each order.

Customer ID – A unique identifier for each customer.

Order Date – The date of the order placement.

Ship Date – The date the order was shipped.

Ship Mode – The shipping mode for the order (e.g. standard, same-day).

Segment – The customer segment (e.g. Consumer, Corporate, Home Office).

Region – The region where the customer is located (e.g. West, Central, East).

Category – The category of the product purchased (e.g. Furniture, Technology, Office Supplies).

Sub-Category – The sub-category of the product purchased (e.g. Chairs, Desktops, Paper).

Product Name – The name of the product purchased.

Sales – The sales revenue for the product purchased.

Quantity – The number of units of the product purchased.

Discount – The discount applied to the product purchased.

Profit -The profit generated by the product purchased.

Questions :

Choose any two prediction tasks: Classification or regression

Based on the chosen tasks, apply all the preprocessing steps.

Justify your choices.

1Data Mining: Concepts and Techniques (3rd ed.)— Chapter 3 —Jiawei Han, Micheline Kamber, and Jian PeiUniversity of Illinois at Urbana-Champaign &Simon Fraser University©2011 Han, Kamber & Pei. All rights reserved.

22Chapter 3: Data PreprocessingData Preprocessing: An OverviewData CleaningData IntegrationData ReductionData Transformation and Data DiscretizationSummary

3Data Quality: Why Preprocess the Data?Data quality includes: Accuracy: correct or wrong, accurate or notCompleteness: not recorded, unavailable, …Consistency: some modified but some not,.., …Timeliness: timely update? Believability : how trustable the data are correct?Interpretability: how easily the data can be understood?

4Major Tasks in Data PreprocessingData cleaningFill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies..Data integrationIntegration of multiple databases, data cubes, or filesData reduction Dimensionality reductionNumerosity reductionData compressionData transformation and data discretizationNormalization Concept hierarchy generation

55Chapter 3: Data PreprocessingData Preprocessing: An OverviewData CleaningData IntegrationData ReductionData Transformation and Data DiscretizationSummary

6Data CleaningData in the Real World Is “Dirty”: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission errormissing values: lacking attribute values, lacking certain attributes of interest, or containing only aggregate datae.g., Occupation=“ ” (missing data)noisy: containing noise, errors, or outlierse.g., Salary=“−10” (an error)inconsistent: containing discrepancies in codes or names, e.g.,Age=“42”, Birthday=“03/07/2018”Was rating “1, 2, 3”, now rating “A, B, C”discrepancy between duplicate recordsIntentional (e.g., disguised missing data)Jan. 1 as everyone’s birthday?

7Missing DataData is not always availableE.g., many tuples have no recorded value for several attributes, such as customer income in sales dataMissing data may be due to equipment malfunctioninconsistent with other recorded data and thus deleteddata not entered due to misunderstandingcertain data may not be considered important at the time of entrynot register history or changes of the dataMissing data may need to be inferred

8How to Handle Missing Data?Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerablyFill in the missing value manually: tedious + infeasibleFill in it automatically with a global constant : e.g., “unknown”, a new class… Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing value.the attribute meanUse the attribute mean or median for all samples belonging to the same class as the given tuple. the most probable value: inference-based such as Bayesian formula or decision tree

9Noisy DataNoise: random error or variance in a measured variableIncorrect attribute values may be due tofaulty data collection instrumentsdata entry problemsdata transmission problemstechnology limitationinconsistency in naming convention

10How to Handle Noisy Data?Binning : smooth a sorted data value by consulting its “neighborhood” :first sort data and partition into (equal-frequency) bins, One can :smooth by bin means: each value in a bin is replaced by the mean value of the bin. smooth by bin median: each bin value is replaced by the bin mediansmooth by bin boundaries: the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value

11How to Handle Noisy Data?Clusteringdetect outliers by clustering and analyze or remove them clustering is when similar values are organized into groups.Combined computer and human inspectiondetect suspicious values and check by human (e.g., deal with possible outliers)

12Data Cleaning as a ProcessData discrepancy detectionUse metadata (e.g., domain, range, dependency, distribution)Check field overloading : when developers squeeze new attribute definitions into unused (bit) portions of already defined attributes (e.g., an unused bit of an attribute that has a value range that uses only, say, 31 out of 32 bits).Check uniqueness rule: . A unique rule says that each value of the given attribute must be different from all other values for that attributeconsecutive rule: A consecutive rule says that there can be no missing values between the lowest and highest values for the attribute, and that all values must also be unique (e.g., as in check numbers). null rule. A null rule specifies the use of blanks, question marks, special characters, or other strings , for e.g., where a value for a given attribute is not available, and how such values should be handled.

13Data Cleaning as a ProcessUse commercial tools: There are a number of different commercial tools that can aid in the discrepancy detection stepData scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections.Data auditing: by analyzing data to discover rules and relationship to detect violators, they may employ statistical analysis to find correlations, or clustering to identify outliers. They may also use the basic statistical data descriptions Some data inconsistencies may be corrected manually using external references.

14Data Cleaning as a ProcessData migration ToolsData migration tools: allow transformations to be specifiedETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interfaceIntegration of the two processes of discrepancy detection and data transformation (to correct discrepancies) :Iterative and interactive (e.g., Potter’s Wheels)

1515Chapter 3: Data PreprocessingData Preprocessing: An OverviewData CleaningData IntegrationData ReductionData Transformation and Data DiscretizationSummary

1616Data IntegrationData integration: Combines data from multiple sources into a coherent storeSchema integration: e.g., A.cust-id B.cust-#Integrate metadata from different sourcesEntity identification problem: Identify real world entities from multiple data sources. Eg. Customer_id in one database and cust_number in another refer to the same attributeFor the same real world entity, attribute values from different sources are different , Possible reasons: different representations, different scales, e.g., metric vs. British units

1717Handling Redundancy in Data IntegrationRedundant data occur often when integration of multiple databasesObject identification: The same attribute or object may have different names in different databasesDerivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenueCareful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

18Correlation Analysis (Nominal Data)Some redundancies can be detected by correlation analysis. Given two attributes, such analysis can measure how strongly one attribute implies the other, based on the available data. For nominal data: we use the χ2 (chi-square) test. For numeric attributes : we can use :the correlation coefficient Covariance

19Correlation Analysis (Nominal Data)For nominal data, a correlation relationship between two attributes, A and B, can be discovered by a χ2 test. Suppose A has c distinct values, namely a1,a2,…ac . B has r distinct values, namely b1,b2,…br . The data tuples described by A and B can be shown as a contingency table, with the c values of A making up the columns and the r values of B making up the rows. Let (Ai ,Bj) denote the joint, where (A = ai ,B = bj), every possible (Ai ,Bj) joint event has its own cell in the table. The χ2 value (also known as the Pearson χ2 statistic) is computed as :ExpectedExpectedObserved22)(

20Chi-Square Calculation: An Example (study if the fact to like science fiction or no is correlated to the gender)Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories)How to interpret this result ? 93.507840)8401000(360)360200(210)21050(90)90250(22222malefemaleSum (row)Like science fiction250(90)200(360)450Not like science fiction50(210)1000(840)1050Sum(col.)30012001500

21Correlation Analysis (Nominal Data)The χ 2 statistic checks (tests) the hypothesis that A and B are independent, that is, there is no correlation between them. The test is based on a significance level, with (r − 1) × (c − 1) degrees of freedom. How to interpret the value of χ2 ? If df is the degrees of freedom, and l is the significance level then we use the Chi-square distribution table to read the corresponding cell. For our previous example the degrees of freedom df= (2-1)(2−1) = 1. if we consider the significance level l =0.001 Then : the χ2 value needed to reject the hypothesis at the 0.001 significance level is 10.828 (taken from the table of upper percentage points of the χ2 distribution : Our computed value is 507.93 > 10.828 So we can reject the hypothesis that gender and preferred reading are independent and conclude that the two attributes are (strongly) correlated for the given group of people.

Chi-Square Distribution table09/10/23Data Mining: Concepts and Techniques22

23Correlation Analysis (Numeric Data)Correlation coefficient (also called Pearson’s product moment coefficient) (I)where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product.If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation.rA,B = 0: independent; rAB < 0: negatively correlatedBAniiiBAniiiBAnBAnbanBbAar)1()()1())((11,AB
Correlation coefficient Example 09/10/23Data Mining: Concepts and Techniques24We are given two variables X and Y , we first compute the mean of X: Ẋ= 2.57 , mean of Y: Ῡ= 72.29Standard Deviation x = Sqrt (SUM(xi – x̄)2 / 6) = Sqrt(9.71/6) =1.27 Standard Deviation y = Sqrt ( SUM(yi – ȳ)2 / 6) =Sqrt(171.43/6) =5.34We apply formula (I) : rx,y = 38.86 / (6*1.27*5.34) = 0.95
25Correlation is not CausalityCorrelation does not imply causality. That is, if A and B are correlated, this does not necessarily imply that A causes B or that B causes A. For example, in analyzing a demographic database, we may find that attributes representing the number of hospitals and the number of car thefts in a region are correlated. This does not mean that one causes the other. Both are actually causally linked to a third attribute, namely, population.
26Covariance (Numeric Data)Covariance is similar to correlationwhere n is the number of tuples, and are the respective mean or expected values of A and B, σA and σB are the respective standard deviation of A and B.Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values.Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value.Independence: CovA,B = 0 but the converse is not true:Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply independenceABCorrelation coefficient:
27Covariance (Numeric Data)There is a relationship between correlation and covariance : where n is the number of tuples, and are the respective mean or expected values of A and B, σA and σB are the respective standard deviation of A and B.Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values.Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value.Independence: CovA,B = 0 but the opposite is not true: Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions.AB
Co-Variance: An ExampleIt can be simplified in computation asSuppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14). Question: If the stocks are affected by the same industry trends, will their prices rise or fall together?E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4Thus, A and B rise together since Cov(A, B) > 0.

2929Chapter 3: Data PreprocessingData Preprocessing: An OverviewData CleaningData IntegrationData ReductionData Transformation and Data DiscretizationSummary

Data Mining: Concepts and Techniques(3rd ed.)Ñ Chapter 3 ÑJiawei Han, Micheline Kamber, and Jian PeiUniversity of Illinois at Urbana-Champaign &Simon Fraser University©2011 Han, Kamber & Pei. All rights reserved.

Chapter 3: Data Preprocessing●Data Preprocessing: An Overview●Data Cleaning●Data Integration●Data Reduction●Data Transformation and Data Discretization●Summary

Data Reduction Strategies●Data reduction: Obtain a reduced representation of the data set that is muchsmaller in volume but yet produces the same (or almost the same) analyticalresults●Why data reduction? Ñ A database/data warehouse may store terabytes ofdata. Complex data analysis may take a very long time to run on the completedata set.●Data reduction strategies●Dimensionality reduction, e.g., remove unimportant attributes●Wavelet transforms●Principal Components Analysis (PCA)●Feature subset selection, feature creation●Numerosity reduction (or simply: Data Reduction)●Regression and Log-Linear Models●Histograms, clustering, sampling●Data cube aggregation●Data compression

Data Reduction 1: Dimensionality Reduction●Issues with dimensionality●When dimensionality increases, data becomes increasingly sparse●Density and distance between points, which is critical to clustering, outlieranalysis, becomes less meaningful●The possible combinations of subspaces will grow exponentially●Dimensionality reduction●Avoid the issues with high of dimensionality●Help eliminate irrelevant features and reduce noise●Reduce time and space required in data mining●Allow easier visualization●Dimensionality reduction techniques●Wavelet transforms●Principal Component Analysis●Supervised and nonlinear techniques (e.g., feature selection)

Wavelet TransformationThe discrete wavelet transform (DWT) : is a linear signalprocessing technique that, when applied to a data vector X,transforms it to a numerically different vector, XÕ , of coefficients.The two vectors are of the same length.ÒHow can this technique be useful for data reduction if the wavelet transformeddata are of the same length as the original data?Ó : The usefulness lies in the factthat the wavelet transformed data can be truncated, by storing only a small fractionof the strongest of the wavelet coefficients. For example, all wavelet coefficientslarger than some user-specified threshold can be retained.

What Is Wavelet Transform?Decomposes a signalinto different frequencysubbandsApplicable to n-dimensional signalsUsed for imagecompression

x2x1ePrincipal Component Analysis (PCA)●Find a projection that captures the largest amount of variation in data●The original data are projected onto a much smaller space, resulting indimensionality reduction. We find the eigenvectors of the covariance matrix,and these eigenvectors define the new space

●Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors(principal components) that can be best used to represent data●Normalize input data: Each attribute falls within the same range●Compute k orthonormal (unit) vectors, i.e., principal components●Each input data (vector) is a linear combination of the k principalcomponent vectors●The principal components are sorted in order of decreasing ÒsignificanceÓor strength●Since the components are sorted, the size of the data can be reduced byeliminating the weak components, i.e., those with low variance (i.e., usingthe strongest principal components, it is possible to reconstruct a goodapproximation of the original data)●Works for numeric data onlyPrincipal Component Analysis (Steps)

Attribute Subset Selection●Another way to reduce dimensionality of data●Redundant attributes●Duplicate much or all of the information contained in one ormore other attributes●E.g., purchase price of a product and the amount of sales taxpaid●Irrelevant attributes●Contain no information that is useful for the data miningtask at hand●E.g., students’ ID is often irrelevant to the task of predictingstudents’ GPA

Heuristic Search in Attribute SelectionSome heuristic attribute selection methods: Best single attribute underthe attribute independence assumption: choose by significance tests●Step-wise forward selection:The best single-attribute is picked firstThen next best attribute condition to the first, …●Step-wise Backward elimination: Repeatedly eliminate the worstattribute.●Combination of forward selection and backward elimination: Ateach step, the procedure selects the best attribute and removes the worstfrom the remaining attribute.●Decision trees : a tree is constructed from the data. All attributes that do notappear in the tree are assumed to be irrelevant. (figure 3.6 page 104)

Data Reduction 2: Numerosity Reduction●Reduce data volume by choosing alternative, smaller forms ofdata representation●Parametric methods (e.g., regression)●Assume the data fits some model, estimate modelparameters, store only the parameters, and discard the data(except possible outliers)●Ex.: Log-linear modelsÑobtain value at a point in m-Dspace as the product on appropriate marginal subspaces●Non-parametric methods●Do not assume models●Major families: histograms, clustering, sampling, É

Parametric Data Reduction: Regression andLog-Linear Models●Linear regression : Y = w X + b●Data modeled to fit a straight line●Often uses the least-square method to fit the line●Multiple regression : Y = b0 + b1 X1 + b2 X2●Allows a variable Y to be modeled as a linear function of two ormore feature variables●Log-linear model●Approximate discrete multidimensional probability distributions●Estimate the probability of each point (tuple) in a multi-dimensionalspace for a set of discretized attributes, based on a smaller subset ofdimensional combinations●Useful for dimensionality reduction and data smoothing

Non Parametric : Histogram AnalysisDivide data into buckets : Partitioning rules:Equal-width: the width of each bucket range is uniform (e.g., thewidth of $10 for the buckets in Figure 3.8 page 107: 1-10 , 11-20 ,21-30É).Equal-frequency (or equal-depth) the frequency of each bucket isconstant each bucket contains the same number of data samples. 0510152025303540100002000030000400005000060000700008000090000100000

Clustering●Partition data set into clusters based on similarity, and storecluster representation (e.g., centroid and diameter) only●Can be very effective if data is clustered but not if data isÒdirtyÓ●Can have hierarchical clustering and be stored in multi-dimensional index tree structures●There are many choices of clustering definitions andclustering algorithms.

Sampling●Sampling: obtaining a small sample s to represent the wholedata set N●Allow a mining algorithm to run in complexity that ispotentially sub-linear to the size of the data●Key principle: Choose a representative subset of the data●Simple random sampling may have very poor performance inthe presence of skew●Develop adaptive sampling methods, e.g., stratified sampling

Types of Sampling (fig 3.9 page 109)●Simple random sampling●There is an equal probability of selecting any particular item●Sampling without replacement●Once an object is selected, it is removed from the population●Sampling with replacement●A selected object is not removed from the population●Stratified sampling:●Partition the data set, and draw samples from each partition(proportionally, i.e., approximately the same percentage of thedata)●Used in conjunction with skewed data

Sampling: Cluster or Stratified Sampling Raw DataCluster/StratifiedSample

Data Cube Aggregation●The lowest level of a data cube (base cuboid)●The aggregated data for an individual entity of interest●E.g., a specific customer in a data warehouse●Multiple levels of aggregation in data cubes●Further reduce the size of data to deal with●Reference appropriate levels●Use the smallest representation which is enough to solve thetask●Queries regarding aggregated information should be answeredusing data cube, when possible●(Fig 3.11 page 111)

Chapter 3: Data Preprocessing●Data Preprocessing: An Overview●Data Quality●Major Tasks in Data Preprocessing●Data Cleaning●Data Integration●Data Reduction●Data Transformation and Data Discretization●Summary

Data TransformationMaps the entire set of values of an attribute to a new set of values s.t. each old valuecan be identified with one of the new values. Methods :Smoothing: Remove noise from dataAttribute/feature constructionNew attributes constructed from the given onesAggregation: Summarization, data cube constructionNormalization: Scaled to fall within a smaller, specified rangemin-max normalizationz-score normalizationnormalization by decimal scalingDiscretization: continuous values are replaced by discrete intervalConcept hierarchy climbing : exple street can be generalized to higher-levelconcepts, like city or country

Normalization●Min-max normalization: to [new_minA, new_maxA] ●Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then$73,000 is mapped to ●Z-score normalization (µ: mean, σ: standard deviation): ●Ex. Let µ = 54,000, σ = 16,000. Then●Normalization by decimal scaling716.00)00.1(000,12000,98000,12600,73=+−−−AAAAAAminnewminnewmaxnewminmaxminvv_)__(‘+−−−=AAvvσµ−=’jvv10’=Where j is the smallest integer such that Max(|νÕ|) < 1225.1000,16000,54600,73=−
DiscretizationThree types of attributes●NominalÑvalues from an unordered set, e.g., color, profession●OrdinalÑvalues from an ordered set, e.g., military or academic rank●NumericÑreal numbers, e.g., integer or real numbersDiscretization: Divide the range of a continuous attribute into intervals●Interval labels can then be used to replace actual data values.●Reduce data size by discretization●Supervised vs. unsupervised●Split (top-down) vs. merge (bottom-up)●Discretization can be performed recursively on an attribute●Can be a preparation for further analysis, e.g., classification
Data Discretization Methods●Binning●Top-down split, unsupervised●Histogram analysis●Top-down split, unsupervised●Clustering analysis (unsupervised, top-down split or bottom-up merge)●Decision-tree analysis (supervised, top-down split)●Correlation (e.g., 2) analysis (unsupervised, bottom-upmerge)
Simple Discretization: Binning●Equal-width (distance) partitioning●Divides the range into N intervals of equal size: uniform grid●if A and B are the lowest and highest values of the attribute, the width ofintervals will be: W = (B ÐA)/N.●The most straightforward, but outliers may dominate presentation●Skewed data is not handled well●Equal-depth (frequency) partitioning●Divides the range into N intervals, each containing approximately samenumber of samples●Good data scaling●Managing categorical attributes can be tricky
Discretization by Classification &Correlation Analysis●Classification (e.g., decision tree analysis)●Supervised: Given class labels, e.g., cancerous vs. benign●Using entropy to determine split point (discretization point)●Top-down, recursive split( to be covered in Chapter 7)●Correlation analysis (e.g., Chi-merge: χ2-based discretization)●Supervised: use class information●Bottom-up merge: find the best neighboring intervals (those havingsimilar distributions of classes, i.e., low χ2 values) to merge●Merge performed recursively, until a predefined stopping condition
Concept Hierarchy Generation●Concept hierarchy organizes concepts (i.e., attribute values) hierarchicallyand is usually associated with each dimension in a data warehouse●Concept hierarchies facilitate exploring data warehouses to view data inmultiple granularity●Concept hierarchy formation: Recursively reduce the data by collecting andreplacing low level concepts (such as numeric values for age) by higher levelconcepts (such as youth, adult, or senior)●Concept hierarchies can be explicitly specified by domain experts and/or datawarehouse designers●Concept hierarchy can be automatically formed for both numeric and nominaldata. For numeric data, use discretization methods shown.
Concept Hierarchy Generation for Nominal Data●Specification of a partial/total ordering of attributes explicitly atthe schema level by users or experts●street < city < state < country●Specification of a hierarchy for a set of values by explicit datagrouping●{Urbana, Champaign, Chicago} < Illinois●Specification of only a partial set of attributes●E.g., only street < city, not others●Automatic generation of hierarchies (or attribute levels) by theanalysis of the number of distinct values●E.g., for a set of attributes: {street, city, state, country}
Automatic Concept Hierarchy Generation●Some hierarchies can be automatically generated based onthe analysis of the number of distinct values per attribute inthe data set●The attribute with the most distinct values is placed at thelowest level of the hierarchy●Exceptions, e.g., weekday, month, quarter, yearcountryprovince_or_ statecitystreet15 distinct values365 distinct values3567 distinct values674,339 distinct values
Summary●Data reduction●Dimensionality reduction●Numerosity reduction ●Data transformation and data discretization●Normalization●Concept hierarchy generation