Outlier Analysis How to ensure the sanity of size related brand data

Fitrrati is a cloud based and data driven personalized Fit Technology enterprise solution for fashion brands and e-tailers. It aims to address the consumer issue of not knowing the right size (which fits well) while shopping online for fashion products like apparel, footwear etc.

The technology collects a lot of size related data from brands and retailers and crunches it along with consumer data to make very accurate size recommendations along with fit details to individual users taking into account their individual fit preferences. Now, the performance of this technology depends a lot on the quality of data since if you ‘Garbage in’ you will get ‘Garbage out’. Hence, the need to ensure the data sanity and detect the aberrant values using two different approaches IQR analysis and Probability Density function which are discussed in detail along with results in this paper.

The outliers detected by each approach are subjected to exhaustive quality checks (for correction) which includes

1. Re-checking the original source of data
2. Verifying the values with at-least ONE more alternate reliable public source
3. Contacting the brand or retailers to confirm the values
4. Arranging a brand store visit to manually record the size related values

The two approaches are discussed below.

IQR Analysis

Inter-quantile range (IQR) is a measure of statistical dispersion, also called midspread or middle fifty. It is the range of values within which resides the middle 50% of the values. So, if

25% quantile means 25% values lies below that point and 75% above it

75% quantile means 75% values lies below that point and 25% above it

IQR value = 75% quantile value 25% quantile value

Now, a value is flagged as an outlier if it is below ‘25% quantile value a x IQR value’ (Min-Outlier) or above ‘75% quantile value + a x IQR value’ (Max-Outlier), where ‘a’ is a suitable factor of IQR which defines the outlier limits. After testing with various approaches to determining the value of ‘a’ (which includes using Standard Deviation, 90% quantile, 75% quantile of a bigger size & 25% quantile of smaller size), the concluded value of ‘a’ is 1.5 which results in less than 20% of the total values as outliers.

IQR analysis was conducted on all combinations of gender (male or female), product type (shirts, t-shirts etc.), fit type (regular, slim etc.) and measurement type (chest, shoulder etc.) for each size (S,M,L,38,40 etc.) and box plots for each size in every combination is plotted to determine the outliers.

Here’s a box plot of Men’s Regular Fit Shirts (Chest Measurement) for all the size labels. In a box plot, box thick line represents 50% quantile value while box base and top are 25% quantile & 75% quantile values respectively. The whisker extend at bottom & top of the box represents the Min-Outlier & Max-Outlier values respectively. Probability Density

In another approach, probability density is used to determine the chances of occurrence of a value in the multi-modal distribution of size related data. Every measurement value in a combination of gender, product type, fit type & measurement type for each size  is associated with a probability density value. PD value is derived from frequency of occurrence of values. Lesser the frequency lesser the probability density value.

In order to detect outliers a threshold of 0.05 is set. For measurement values whose  PD values are less than the threshold it is flagged as an outlier.

Here’s a probability density plot of Men’s Regular Fit Shirts (Chest Measurement, Size XS). Conclusion

IQR analysis flagged 19% of total size charts as outliers. Out of which 30% of flagged charts underwent correction during the quality check. In comparison, number of size charts flagged as outliers in Probability Density approach is 5% of total size charts. Out of which 60% of flagged charts underwent correction during the quality check.