Data Science



Predictive Clustering for Big data

Prediction is an efficient process of making results for unknown data. Data points always play a major role in determining the data for clusters. These clusters reveal conditions for data to be settled in a group. The storage of these data creates a predominant method to handle unstructured formats.


The functional medicinal categorical split is the main objective of this classification. A large dataset of a patient's data with 30 million records and 447 categories with their corresponding imbalances, symptoms, reviews and surveys are given. A diagnosis will be based on these imbalances, symptoms and surveys from the patients. These data are encoded and clustered. Upon clustering, each of the values of the patients based on the factors will determine a new data cluster which paves the way for predictive analytics.

Key Challenges

The key challenges observed was, the data contained lots of noise, missing and erroneous during data pre-processing. Additionally, many different data types were present which were pre-trained data from fitting into the grouping variable.

Our Approach

Initially, a large amount of data is stored into spark. Data pre-processing steps contained removal of missing data, non-categorical data and noise data. Ordinal encoding, one hot encoding is used for the categorization. For the centroid calculation algorithms, we used Inuit and Huang. For determining the clusters K-means, K-mod, Hierarchical, DBScan and OPTICS were used.

Our Solution

The data after pre-processing gives us a valid amount of data for the learning process. By the next state, these data are converted into categorical values by the encoding process. After encoding, data will take the form of categorical features. Later, the data is passed to the clustering algorithms and based on the data values, clusters are formed and fitted. K-mod clustering algorithm suits the best for big data processes.

Technology Used

  • Big data –Combination of Hadoop and pySpark.
  • Machine Learning
  • Unsupervised Learning
  • k-mod
  • Data Analytics – MinMax scalar


The results observed were that K-mod performed well with good efficiency. Additionally, they provided an accurate number of minimal clusters that display respective data. If any new data is observed, the prediction of clusters belonging to it can be derived or observed.

Download Predictive Clustering for Big data Case Study


Send download link to:

Other Case Studies