Python Repository Overview

Practical Data Science

Dive-in to handle data in more complex formats. This course covered data analysis life-cycle from initial access and acquisition, transformation, integration, querying, application of statistical learning, modeling to data mining methods.

Star Wars - Explore and cleanse the given Star Wars dataset. Investigate 3 research areas:
- Ranking on each Star Wars movies
- Pairwise comparisons between
  - Fans of Star Wars and Fans of Star Trek
  - Fans of Star Wars and (Fans of / Familiarity on) Expanded Universe
- Demographic impact (Gender, Age, Income, Education or Region) on favoriability of Star Wars characters. Perform Chi Square test on the associations.
RSSI - Identify signal measurement patterns from the given dataset and investigate whether clustering could help us to group all these patterns into organized structure.

Machine Learning

Data Preparation and KNN - 2 Tasks are given in the assignment:
- Perform necessary data pre-processing steps on a a given dataset from UCI ML Repository so that the cleaned dataset can be directly fed into any classification algorithm within the Scikit-Learn Python module without any further changes.
- Given a dataset with CPI and descriptive features of various countries, use KNN algorithm to perform predictions on CPI of Russia when only descriptive features are provided.

Decision Tree - Use the given data as the entire training dataset, demonstrate the understanding of decision tree algorithm by:
- Calculating the impurity of target feature
- Determine the optimal root node of the decision tree
- Determine the leaf predictions if specific descriptive feature is assumed to be the root node

Naive Bayes - Treat the given data as the entire training dataset, use the built-in sklearn naive bayes module in python, and perform the following tasks:
- Transform all descriptive features into binary, train and search for the best Bernoulli NB model by fine tuning the alpha parameter. Score the best Bernoulli NB model by accuracy.
- Transform all descriptive features into binary, train and search for the best Gaussian model by fine tuning the var_smoothing parameter. Score the best Gaussian NB model by accuracy.
- Retain the original format of the dataset, train the categorical features with Bernoulli NB model, train the numeric features with Gaussian model, combine the results into a hybrid NB Classifier. Score the hybrid NB classifer by accuracy.
- List all the obtained scores in a dataframe

Model Evaluation - Given a score file for binary prediction, perform the following tasks:
- construct confusion matrix (default threshold=0.5)
- calculate 5 metrics (default threshold=0.5):
  - Error Rate
  - Precision
  - TPR
  - F-1 Score
  - FPR
- Varying the score threshold from 0.1 to 0.9 (both inclusive) in 0.1 increments, compute TPR and FPR values and plot the ROC curve.

Final Project - Speed Dating* - We have chosen a speed dating dataset from a provided github site, perform data pre-processing, visualize and explore the data, then split the data into training and testing portion, use grid search and hyperparameter tuning (with feature selection) to select the best tuned model for 5 machine learning algorithms with the training data. Then draw conclusion base on comparing the performance of prediction (with testing data) from these best tuned models.

N.B. * indicates open-ended assignment