- In the code example concerned we perform following steps:To understand in detail, once again please refer to chapter 1 coding part here.Build dictionary of words from email documents from training set.Consider the most common 3000 words.For each document in training set, create a frequency matrix for these words in dictionary and corresponding labels.
- The code snippet below does this:def make_Dictionary(root_dir): all_words =  emails = [os.path.join(root_dir,f) for f in os.listdir(root_dir)] for mail in emails: with open(mail) as m: for line in m: words = line.split() all_words += words dictionary = Counter(all_words)# if you have python version 3.
- if item.isalpha() == False: del dictionary[item] elif len(item) == 1: del dictionary[item] # consider only most 3000 common words in dictionary.dictionary = dictionarydef extract_features(mail_dir): files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)] features_matrix = np.zeros((len(files),3000)) train_labels = np.zeros(len(files)) count = 0; docID = 0; for fil in files: with open(fil) as fi: for i,line in enumerate(fi): if i == 2: words = line.split() for word in words: wordID = 0 for i,d in enumerate(dictionary): if d == word: wordID = i features_matrix[docID,wordID] = words.count(word) train_labels[docID] = 0; filepathTokens = fil.split(‘/’) lastToken = – 1] if lastToken.startswith(“spmsg”): train_labels[docID] = 1; count = count + 1 docID = docID + 1 return features_matrix, train_labels2.
- Using Random Forest ClassifierThe code for using Random Forest Classifier is similar to previous classifiers.Import libraryCreate modelTrainPredictfrom sklearn.ensemble import = “.
- /test-mails”dictionary = “reading and processing emails from file.
Random Forest Classifier is ensemble algorithm. In next one or two posts we shall explore such algorithms. Ensembled algorithms are those which combines more than one algorithms of same or different…
@kdnuggets: #MachineLearning 101 – Random Forest Classifier #ensemble #algorithm
In this article, we shall see mathematics behind the Random Forest Classifier. Then we shall code a small example to classify emails into spam or ham. We shall check accuracy compared to previous classifiers.
Random forest classifier creates a set of decision trees from randomly selected subset of training set. It then aggregates the votes from different decision trees to decide the final class of the test object.
Suppose training set is given as : [X1, X2, X3, X4] with corresponding labels as [L1, L2, L3, L4], random forest may create three decision trees taking input of subset for example,
So finally, it predicts based on the majority of votes from each of the decision trees made.
Alternatively, the random forest can apply weight concept for considering the impact of result from any decision tree. Tree with high error rate are given low weight value and vise versa. This would increase the decision impact of trees with low error rate.
Basic parameters to Random Forest Classifier can be total number of trees to be generated and decision tree related parameters like minimum split, split criteria etc.
Don’t forget to click the heart(❤) icon.
Random Forest and Sklearn in Python (Coding example).
Lets try out RandomForestClassifier on our previous code of classifying emails into spam or ham.
I have created a git repository for the data set and the sample code. You can download it from here (Use chapter 5 folder). Its same data set discussed in this chapter. I would suggest you to follow through the discussion and do the coding yourself. In case it fails, you can use/refer my version to understand working.
1. Little bit about cleaning and extracting the features
You may skip this part if you have already gone through coding part of Naive Bayes.(this is for readers who have directly jumped here).
Before we can apply the sklearn classifiers, we must clean the data. Cleaning involves removal of stop words, extracting most common words from text etc. In the code example concerned we perform following steps:
To understand in detail, once again please refer to chapter 1 coding part here.
The code snippet below does this:
for mail in emails:
with open(mail) as m:
for line in m:
# if you have python version 3.x use commented version.
for item in list_to_remove:
# remove if numerical.
elif len(item) == 1:
# consider only most 3000 common words in dictionary.
count = 0;
docID = 0;
for fil in files:
with open(fil) as fi:
for i,line in enumerate(fi):
if i == 2:
for word in words:
for i,d in enumerate(dictionary):
if d == word:
train_labels[docID] = 0;
train_labels[docID] = 1;
return features_matrix, train_labels
The code for using Random Forest Classifier is similar to previous classifiers.
print “reading and processing emails from file.”
print “Training model.”
print “FINISHED classifying. accuracy score : “
print accuracy_score(test_labels, predicted_labels)
Try this out and check what is the accuracy? You will get accuracy around 95.7%. That’s pretty good compared to previous classifiers. Isn’t it?
Lets understand and play with some of the tuning parameters.
n_estimators : Number of trees in forest. Default is 10.
criterion: “gini” or “entropy” same as decision tree classifier.
min_samples_split: minimum number of working set size at node required to split. Default is 2.
Play with these parameters by changing values individually and in combination and check if you can improve accuracy.
I tried following combination and obtained the accuracy as shown in image below.