Thursday, December 19, 2019

An Exploration of Classification Models using Mobile Dataset

I wanted to review some classification algorithms, so I found a clean data set on Kaggle about classifying mobile price ranges to practice with logistic regression, decision tree algorithm, and random forest. I will be walking through my thought process when attempting this exploration of the data set from Kaggle.

The Dataset

The data set has its own story about how Bob wants to start his own mobile company, but does not know how to estimate the price of mobiles his company creates. He decides to collect sales data of mobile phones from various companies and decides to find some relation between the features of the phone and its selling price.

The features of the phone include battery power, bluetooth capabilities, clock speed, dual sim option, front camera mega pixels, 4G capabilities, internal memory, mobile depth, mobile weight, number of core processors, primary camera mega pixels, pixel height resolution, pixel width resolution, RAM, screen height, screen width, talk time with a single battery charge, 3G capabilities, touch screen, and wifi capabilities.

The data set has a test and train file, but because there were no price ranges in the test file, I decided to just use my training data and create a separate test file out of some of my training data. This decreased the amount of data I could train, but I felt that it was ok for this exploration.

Exploratory Data Analysis

Some things I did before I ran any models was to look at pairplots using seaborn. Since there were so many features, I separated the dataset into three groups with about 7-8 features in each set of pairplots and compared them with the price range, which is denoted  0, 1, 2, or 3. I looked at pairplots to see if there are any features that are linearly separable. (This would show me that I could use logistic regression and it would work well.)

How can I tell if some features are linearly separable or not? Well, I looked at each pairplot to see if there were features that overlapped the least. For example, if I look at the distribution plot of battery power with the different price ranges as the "hue" attribute, I see that the blue and red hues are clearly separated. There is definitely overlap, but there is a huge separation between the blue and red distributions.



















There is some separation in the pixel width feature too.






















But the separation with the RAM is most significant.


















So you might ask what are some examples of features that do not look separable? Well, I would say the front camera pixels feature is not linearly separable, since there seems to be an equal distribution of all the four colors in the same way. Look at the front camera pixels distribution-- the colors overlap almost perfectly.






















So, now I know that certain features like RAM, battery power, and pixel width may be very telling of the price range of mobile phones, but something like the front camera pixels might not be very telling.

Are classes balanced?

Another thing to look at before doing any modeling is to check the class balance of the target. My target has four classes that tell me how expensive a mobile phone is-- class 0, 1, 2, and 3. Looking at a distribution plot of the price range, I see that it is uniform. This rarely happens in real life, but again, this is a clean Kaggle data set.

Balanced classes tell me that I would not need to do any over or undersampling or class weights or change the threshold of my classes after modeling.
















Logistic Regression Model

I used sci-kit learn's modules to run my logistic regression model. I did two train, test, splits to divide my data set into a train-validate-test set of 60-20-20.  I also imported StandardScaler to standardize my features. I looked at many metrics, including F1, accuracy, precision, and recall. I also decided to look at a confusion matrix, to practice and make sure that I am getting the metrics right.

For my logistic regression model, I just decided to focus on the three features I found that looked more linearly separable than the rest of the features--battery_power, ram, and px_width. I standard scaled my features, then fit my model to my training set. I predicted using my validation set and used a classification report to see the precision, recall, F1, and accuracy scores. Using just three features, I got an accuracy of 82%, and an F1 score of predicting class 0 as .91 and class 3 as .93, class 1 as .68, and class 2 as .71. That's not bad.

Logistic Regression-Confusion Matrix





















This part will cover how to calculate precision, recall, and accuracy using the confusion matrix. The confusion matrix basically tells me the actual vs. predicted values of my target. For example, in this confusion matrix, there are 95 cases of phones predicted as class 0 and were actually class 0.

So, what is accuracy in relation to the confusion matrix? Accuracy is the total number of true positives over the total number of cases. (An example of a true positive is a case that is predicted to be 0 and is actually 0.) In this case, it is the sum of the diagonal that goes from top left to bottom right over the sum of the numbers in the whole matrix- 325/328 which rounds to .82 (same number as reported above).

What is precision in relation to the confusion matrix? Precision is the total number of true positives over the sum of true positives and false positives (an example of a false positive is a case that is predicted to be 0 but is not 0.). We will have to look at precision class to class. In this case, the total number of true positives for class 0 is 95. The total number of false positives is 19. So, the precision is 95/114 which rounds to .83.

What is recall in relation to the confusion matrix? Recall is the total number of true positives over the sum of true positives and false negatives. (An example of a false negative is a case that is predicted to not be 0 but is actually 0.) In this case, the total number of true positives for class 0 is 95 and the number of false negatives is 0, so precision is 1.

What is F1 in relation to precision and recall? F1 is actually the harmonic mean of the precision and recall scores.

Decision Tree

Using sklearn's decision tree algorithm is simple. You instantiate the DecisionTreeClassifier and set a max depth for the tree. Then you fit your training set and then predict using your validation set. The decision tree performed poorly in comparison to the logistic regression algorithm, with an accuracy of .26. We can guess the classes of the mobile phones and do just as well as the decision tree classifier.

I want to make a note that instead of using just three features, I used all the features available to me when training and fitting my decision tree.

The F1 score for class 0 was .28, class 1 was .33, class 2 was .14, and class 3 was .27.

Random Forest

Similarly, random forest did not perform that well compared to logistic regression. I used 1000 estimators (basically individual decision trees) and a max depth of 2 in my random forest model. The random forest basically averages out many decision trees to give a more robust and accurate prediction, but in this case, the accuracy score turned out o be .24. The F1 score for class 0 was .28, class 1 was .31, class 2 was .12, and class 3 was .27.

However, what was interesting to see was the feature importances in my random forest model. The following feature importance graph shows the relative importance of each feature in determining the information gained from splitting at that feature. Here, ram, battery power, px_height, and px_width showed the information gain. This is consistent with the fact that I noticed that ram, battery power and px_width showed much linear separability from the pairplots done in the exploratory data analysis portion above.














Logistic Regression worked best

A reason why I think logistic regression worked best in this scenario is that there were features that were linearly separable. In cases where there was little linearly separability, random forest might have worked better.

A final note

Improvements could be made on my model with more investigation of how to obtain a F1 score of when predicting class 1 and class 2 mobiles. Perhaps running a logistic regression model on all with all of the features and lowering the C parameter (a regularization parameter, which I set to 1 which means no regularization was used) to make sure the model doesn't over-fit. That would be interesting to see.

No comments:

Post a Comment