Practical Data Science

Dive-in to handle data in more complex formats. This course covered data analysis life-cycle from initial access and acquisition, transformation, integration, querying, application of statistical learning, modeling to data mining methods.

  1. Star Wars - Explore and cleanse the given Star Wars dataset. Investigate 3 research areas:
    • Ranking on each Star Wars movies
    • Pairwise comparisons between
      • Fans of Star Wars and Fans of Star Trek
      • Fans of Star Wars and (Fans of / Familiarity on) Expanded Universe
    • Demographic impact (Gender, Age, Income, Education or Region) on favoriability of Star Wars characters. Perform Chi Square test on the associations.

  2. RSSI - Identify signal measurement patterns from the given dataset and investigate whether clustering could help us to group all these patterns into organized structure.


Machine Learning

  1. Data Preparation and KNN - 2 Tasks are given in the assignment:
    • Perform necessary data pre-processing steps on a a given dataset from UCI ML Repository so that the cleaned dataset can be directly fed into any classification algorithm within the Scikit-Learn Python module without any further changes.
    • Given a dataset with CPI and descriptive features of various countries, use KNN algorithm to perform predictions on CPI of Russia when only descriptive features are provided.


  1. Decision Tree - Use the given data as the entire training dataset, demonstrate the understanding of decision tree algorithm by:
    • Calculating the impurity of target feature
    • Determine the optimal root node of the decision tree
    • Determine the leaf predictions if specific descriptive feature is assumed to be the root node


  1. Naive Bayes - Treat the given data as the entire training dataset, use the built-in sklearn naive bayes module in python, and perform the following tasks:
    • Transform all descriptive features into binary, train and search for the best Bernoulli NB model by fine tuning the alpha parameter. Score the best Bernoulli NB model by accuracy.
    • Transform all descriptive features into binary, train and search for the best Gaussian model by fine tuning the var_smoothing parameter. Score the best Gaussian NB model by accuracy.
    • Retain the original format of the dataset, train the categorical features with Bernoulli NB model, train the numeric features with Gaussian model, combine the results into a hybrid NB Classifier. Score the hybrid NB classifer by accuracy.
    • List all the obtained scores in a dataframe


  1. Model Evaluation - Given a score file for binary prediction, perform the following tasks:
    • construct confusion matrix (default threshold=0.5)
    • calculate 5 metrics (default threshold=0.5):
      • Error Rate
      • Precision
      • TPR
      • F-1 Score
      • FPR
    • Varying the score threshold from 0.1 to 0.9 (both inclusive) in 0.1 increments, compute TPR and FPR values and plot the ROC curve.


  1. Final Project - Speed Dating* - We have chosen a speed dating dataset from a provided github site, perform data pre-processing, visualize and explore the data, then split the data into training and testing portion, use grid search and hyperparameter tuning (with feature selection) to select the best tuned model for 5 machine learning algorithms with the training data. Then draw conclusion base on comparing the performance of prediction (with testing data) from these best tuned models.



N.B. * indicates open-ended assignment