# CS 4435 Data Mining-eight measurements of each utility

Fixed = fixed-charge covering ratio (income/debt)

RoR = rate of return on capital

Cost = cost per kilowatt capacity in place

Load = annual load factor

Demand = peak kilowatt-hour demand growth from 1974 to 1975

Sales = sales (kilowatt-hour use per year)

Nuclear = percent nuclear

Fuel Cost = total fuel costs (cents per kilowatt-hour)

Please Load the data as a Panda dataframe, set row names (index) to the utilities column (company). Convert all columns to float.

1. a. Use “from sklearn.metrics import pairwise” and calculate the pairwise Euclidean distance between each pair of Utilities and show the distance matrix.

b. Standardize the features based on mean and std and recalculate the pairwise distance matrix using Euclidean distance.

2) a. Use “from scipy.cluster.hierarchy import linkage” and plot the Dendrogram using the Single linkage

c. use “from scipy.cluster.hierarchy import fcluster” and apply it to Dendrograms for both Single and Average linkages to separate the data points into 6 clusters and print the clusters with their corresponding members. (Set the criterion=’maxclust’ for the fcluster)

3) a. Use “from sklearn.cluster import KMeans” to cluster the data into 6 clusters. Set the random state for KMeans to “0”. Print the clusters and their members.

b. For the number of clusters from 1-7, plot the average SSE vs the number of clusters as a line plot. Use “intertia” attribute of KMeans to get the SSE. Make sure that you divide it by the number of clusters to get the average SSE.