Science Behind Sizes in Fashion Clustering Size Related Brand Data

Fitrrati’s aim is to go at the root cause of size related issues in online fashion retail and provide an intelligent fit technology platform to fashion brands and e-tailers. In one such attempt, we explored the science behind size related brand data by clustering the product measurements of various Size Labels across different gender, brands, product types and fit types.

The data used in the analysis involved 18,213 size label records across 224 Indian fashion brands in 17 product categories in both men’s & women’s clothing. The clustering approach used in this study is explained below.

K-means Clustering Approach

The clustering technique deployed is K-means. It is one of the simplest unsupervised learning algorithm which partitions a group of data points into a small number of clusters such that the distance of each data point belongs to the cluster with the nearest mean.

To determine the optimum number of clusters, a plot of Within-cluster sum of squares (WCSS) v/s No. of clusters (N) is generated. The optimum number of clusters is where WCSS becomes minimum after a steep change.

Here’s a plot of WCSS v/s N for Men’s Regular Fit Shirt (Chest & Shoulder Measurements). The number of optimum clusters in this case are 8.

Cluster plot of ‘Men’s Regular Fit Shirts’ is shown below. The points in this plot are product measurements (Shoulder v/s Chest) of different sizes across various brands offering Men’s Regular Fit Shirts in India. Each cluster (shown by lines connecting the points at one center point) in this plot represents an ideal Size Label (XS,S,M,L,XL,XXL,3XL,4-6XL) which are meant for users of different body measurements. The center points of each cluster are the mean of all points in that cluster.

The plot below shows same size of two brands offering ‘Men’s Regular Fit Shirts’ United Colors of Benetton (UCB) and Nautica on the cluster plot highlighted with + symbol. Size ‘S’ of UCB lies in the cluster ‘S’ while Size ‘S’ of Nautica lies in cluster ‘M’. The comparison shows that although the two shirts have same size tag (Size ‘S’), they are meant for customer of different body measurements. If Size ‘S’ of UCB fits a customer perfectly, chances are Size ‘S’ of Nautica will be loose for the same customer.

Similarly, the clusters plots of other product categories Men’s Regular Fit T-shirts, Women’s Regular Fit Dresses, Women’s Regular Fit Kurtas & Women’s Regular Fit Tees & Tops are generated using size related data and various clusters are analysed.

Below is the cluster plot of Men’s Regular Fit T-shirts.

Again, same sizes of two brands offering ‘Men’s Regular Fit T-shirts’ are analysed in the cluster plot below (depicted by + symbol). Size ‘S’ of Nautica is observed to run two sizes bigger (lies in Cluster ‘XL’) as compared to Size ‘S’ of Puma (lies in Cluster ‘M’).

Below is the cluster plot of Women’s Regular Fit Dresses.

Size S of Nautica and Anouk Regular Fit Dresses are analysed and compared in the cluster plot below (depicted by + symbol). Anouks Size S (lies in Cluster XS) is observed to run two sizes bigger compared to Nauticas Size S (lies in Cluster M).

Below is the cluster plot of Women’s Regular Fit Kurtas.

Similarly, Size XS of AND & Soch Regular Fit Kurtas are analysed on the cluster plot below (depicted by + symbol). The comparison shows that Size XS of two brands lies in different cluster again indicating that they are meant for customers of different body measurements.

Below is the cluster plot of Women’s Regular Fit Tees and Tops.

Again, the comparison of two brands (Levis & Wills Lifestyle) of same size (Size XS) in Regular Fit Tees & Tops is carried out on the cluster plot below (depicted by + symbol). Leviss Size XS (lies in Cluster XS) is observed to run one size bigger than Size XS of Wills Lifestyle (lies in Cluster S).


The clustering of size related brand data brings out an important insight. In all product categories for mens and womens clothing, same size in two brands are not necessarily meant for customers with similar body measurements. A possible reason for this can be the difference in the brand identity and target customers of various brands. While one brand may be targeting customers with petite body structure and hence keeping the product measurements smaller while retaining the standard size labels. Other brand may be targeting customers with fuller bodies and as a result the product measurements run larger in standard size labels.

This phenomena results in non-standardized size charts in the fashion industry resulting in a lot of confusion for a normal shopper. It is also one of the major reasons why order return rates of online fashion retailers remains as high as 15-30%. Clearly, theres a need for a simple technology solution which can understand the science behind sizes in fashion and offer a confident experience to online shoppers by eliminating guesswork.

Outlier Analysis How to ensure the sanity of size related brand data

Fitrrati is a cloud based and data driven personalized Fit Technology enterprise solution for fashion brands and e-tailers. It aims to address the consumer issue of not knowing the right size (which fits well) while shopping online for fashion products like apparel, footwear etc.

The technology collects a lot of size related data from brands and retailers and crunches it along with consumer data to make very accurate size recommendations along with fit details to individual users taking into account their individual fit preferences. Now, the performance of this technology depends a lot on the quality of data since if you ‘Garbage in’ you will get ‘Garbage out’. Hence, the need to ensure the data sanity and detect the aberrant values using two different approaches IQR analysis and Probability Density function which are discussed in detail along with results in this paper.

The outliers detected by each approach are subjected to exhaustive quality checks (for correction) which includes

  1. Re-checking the original source of data
  2. Verifying the values with at-least ONE more alternate reliable public source
  3. Contacting the brand or retailers to confirm the values
  4. Arranging a brand store visit to manually record the size related values

The two approaches are discussed below.

IQR Analysis

Inter-quantile range (IQR) is a measure of statistical dispersion, also called midspread or middle fifty. It is the range of values within which resides the middle 50% of the values. So, if

25% quantile means 25% values lies below that point and 75% above it

75% quantile means 75% values lies below that point and 25% above it

IQR value = 75% quantile value 25% quantile value

Now, a value is flagged as an outlier if it is below ‘25% quantile value a x IQR value’ (Min-Outlier) or above ‘75% quantile value + a x IQR value’ (Max-Outlier), where ‘a’ is a suitable factor of IQR which defines the outlier limits. After testing with various approaches to determining the value of ‘a’ (which includes using Standard Deviation, 90% quantile, 75% quantile of a bigger size & 25% quantile of smaller size), the concluded value of ‘a’ is 1.5 which results in less than 20% of the total values as outliers.

IQR analysis was conducted on all combinations of gender (male or female), product type (shirts, t-shirts etc.), fit type (regular, slim etc.) and measurement type (chest, shoulder etc.) for each size (S,M,L,38,40 etc.) and box plots for each size in every combination is plotted to determine the outliers.

Here’s a box plot of Men’s Regular Fit Shirts (Chest Measurement) for all the size labels. In a box plot, box thick line represents 50% quantile value while box base and top are 25% quantile & 75% quantile values respectively. The whisker extend at bottom & top of the box represents the Min-Outlier & Max-Outlier values respectively.


Probability Density

In another approach, probability density is used to determine the chances of occurrence of a value in the multi-modal distribution of size related data. Every measurement value in a combination of gender, product type, fit type & measurement type for each size  is associated with a probability density value. PD value is derived from frequency of occurrence of values. Lesser the frequency lesser the probability density value.

In order to detect outliers a threshold of 0.05 is set. For measurement values whose  PD values are less than the threshold it is flagged as an outlier.

Here’s a probability density plot of Men’s Regular Fit Shirts (Chest Measurement, Size XS).


IQR analysis flagged 19% of total size charts as outliers. Out of which 30% of flagged charts underwent correction during the quality check. In comparison, number of size charts flagged as outliers in Probability Density approach is 5% of total size charts. Out of which 60% of flagged charts underwent correction during the quality check.