Data Analysis/Visualizations of Diamonds Prices
As the saying goes, a diamond is forever. Therefore, you’d not want to make rash decisions when buying (or selling) one. This is where I come in.😉
In this article, I’ll be working with a dataset regarding the prices and attributes of approximately 54,000 round-cut diamonds. I’ll go through the steps of an explanatory data visualization, starting from univariate visualizations to bivariate visualizations and finally multivariate visualizations.
- price: Price in dollars. Data were collected in 2008.
- carat: Diamond weight. 1 carat is equal to 0.2 grams.
- cut: Quality of diamond cut, affects its shine. Grades go from (low) Fair, Good, Very Good, Premium, Ideal (best).
- color: Measure of diamond coloration. Increasing grades go from (some color) J, I, H, G, F, E, D (colorless).
- clarity: Measure of diamond inclusions. Increasing grades go from (inclusions) I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF (internally flawless).
The summary is that I will help you make better decisions.
The dataset is provided courtesy of Udacity and alx_africa. I was granted a scholarship to study data analytics)
To begin, the appropriate libraries were imported and visualizations of the structure of the data. Then we convert cut, color, and clarity into ordered categorical types in order of increasing quality.
ordinal_var_dict = {‘cut’: [‘Fair’,’Good’,’Very Good’,’Premium’,’Ideal’],
‘color’: [‘J’, ‘I’, ‘H’, ‘G’, ‘F’, ‘E’, ‘D’],
‘clarity’: [‘I1’, ‘SI2’, ‘SI1’, ‘VS2’, ‘VS1’, ‘VVS2’, ‘VVS1’, ‘IF’]}for var in ordinal_var_dict:
ordered_var = pd.api.types.CategoricalDtype(ordered = True,categories = ordinal_var_dict[var])
diamonds[var] = diamonds[var].astype(ordered_var)
Univariate Exploration
Let’s start our exploration by looking at the main variable of interest: price. Is the distribution skewed or symmetric? Is it unimodal or multimodal?
bins=np.arange(300,diamonds[‘price’].max()+300,300)
plt.figure(figsize=[10,5])
plt.hist(data=diamonds,x=’price’,bins=bins)
plt.xlabel(‘Price ($)’);
It is right skewed and bimodal (at below 1k and below 5k) .To see the bimodal nature more clearly, I have to do a log transform of the price variable. The main reason though is if there is a linear model in mind, it may be difficult for a predictor variable to capture small and large differences at the low and high end of the scale respectively and the price crosses many orders of magnitude (from 0 to 20k).
# there’s a long tail in the distribution, so let’s put it on a log scale instead
bins = 10 ** np.arange(2.5, np.log10(diamonds[‘price’].max())+0.05,0.05)plt.figure(figsize=[8, 5])
plt.hist(data=diamonds,x=’price’,bins=bins)
plt.xscale(‘log’)
plt.xticks([300,1000,3000,10000,30000],[300, ‘1k’, ‘3k’, ‘10k’, ‘30k’])
plt.xlabel(‘Price ($) (log type)’);
We can see more clearly where the bimodal points lie now.
Next, create a plot of our first diamond ‘C’ metric: carat weight. Is there an interesting pattern in these values?
# univariate plot of carat weights on a smaller scale
bins=np.arange(0.2,diamonds[‘carat’].max()+0.01,0.01)
plt.figure(figsize=[8, 5])
plt.hist(diamonds[‘carat’],bins=bins)
plt.xlabel(‘Carat’)
plt.xlim(0,2);
The original plot is bigger than this, but I had to zoom in to see the reason for the intermittent spikes in the data and I notice that the spikes are from some certain carats (eg 0.3,0.5,0.7,1.0,1.5 etc). We can easily come to conclusion that those carat weights are the standard carats in the market and the lower carats are more in the market.
Now, let’s move on to exploring the other three ‘C’ quality measures: cut, color, and clarity. For each of these measures, does the data we have tend to be higher on the quality scale, or lower?
fig,ax=plt.subplots(nrows=3,figsize=[8,9])
color=sns.color_palette()[0]
sns.countplot(data=diamonds,x=’cut’,color=color,ax=ax[0])
sns.countplot(data=diamonds,x=’color’,color=color,ax=ax[1])
sns.countplot(data=diamonds,x=’clarity’,color=color,ax=ax[2])
We can see that there is a higher number of higher quality cuts, while for color and clarity as the quality increases, the number of diamonds available increases then decreases.
Bivariate Exploration
We have looked at the univariate distribution of five features in the diamonds dataset: price, carat, cut, color, and clarity. Now, we’ll investigate relationships between pairs of these variables, particularly how each of them relate to diamond price.
To start, let’s construct a plot of the price against carat weight. What kind of shape does the relationship between these variables take?
# bivariate plot of price vs. carat
plt.scatter(data=diamonds,x=’carat’,y=’price’,alpha=0.1)
plt.xlabel(‘Carat’)
plt.ylabel(‘Price ($)’);
Oops, we forgot to do the log-transform for the price axis. The carat axis would undergo cube root transformation (carat is a function of the x,y and z varaible). One thing to notice here is there is an upper boundary (restriction) for the value of the maximum price in this dataset.
# fxns for log_transform and cube_root tranform respectively
def log_trans(x, inverse = False):
if not inverse:
return np.log10(x)
else:
return np.power(10,x)
def cbe_root_trans(x, inverse = False):
if not inverse:
return np.cbrt(x)
else:
return x**3# scatter plot of price vs. carat, with log transform on price axis and
# cube-root transform on carat
# scatter plot of price vs. carat, with log transform on price axis and
# cube-root transform on price
plt.figure(figsize = [8, 6])
plt.scatter(x=diamonds['carat'].apply(cbe_root_trans),y=diamonds['price'],alpha=1/10)plt.xlabel('Carat')plt.yscale('log')
plt.yticks([300,1e3,3e3,10e3,3e4,], [300, '1k','3k','10k','30k'])
plt.ylabel('Price ($)');
This is the code for equivalent heatmap visualization
plt.hist2d(x=diamonds[‘carat’].apply(log_trans),y=diamonds[‘price’].apply(log_trans),cmin=0.5,cmap=’viridis_r’)
plt.colorbar()
Now let’s take a look at the relationship between price, carat and the three categorical quality features, cut, color, and clarity. Are there any surprising trends to be seen here?
def boxgrid(x,y,color):
color=sb.color_palette()[0]
sb.boxplot(x,y,color=color)plt.figure(figsize=[10, 10])
g=sb.PairGrid(data=diamonds,y_vars=[‘price’, ‘carat’],x_vars=[‘cut’,’color’,’clarity’],height=3,aspect=1.5)
g.map(boxgrid)
plt.show();
Looking at the first row of the data, as the quality of the diamonds increases, the mean price of the diamonds are reducing. That is very surprising because we would normally associate higher price to higher quality. Why is this now different?
Look at the second row; carat vs cut, color and clarity. For increasing quality, the mean of the carat reduces. That is, the higher the quality of the diamond, the smaller it will be.
That is why the best quality diamonds are less pricy. It is because they are smaller than the lower grade diamonds.
We could infer from this that carat size is more an influence for the price of a diamond than its quality. Put that in mind when buying your next rock!!!.