- In the code example concerned we perform following steps:To understand in detail, once again please refer to chapter 1 coding part here.Build dictionary of words from email documents from training set.Consider the most common 3000 words.For each document in training set, create a frequency matrix for these words in dictionary and corresponding labels.
- The code snippet below does this:def make_Dictionary(root_dir): all_words =  emails = [os.path.join(root_dir,f) for f in os.listdir(root_dir)] for mail in emails: with open(mail) as m: for line in m: words = line.split() all_words += words dictionary = Counter(all_words)# if you have python version 3.
- if item.isalpha() == False: del dictionary[item] elif len(item) == 1: del dictionary[item] # consider only most 3000 common words in dictionary.dictionary = dictionarydef extract_features(mail_dir): files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)] features_matrix = np.zeros((len(files),3000)) train_labels = np.zeros(len(files)) count = 0; docID = 0; for fil in files: with open(fil) as fi: for i,line in enumerate(fi): if i == 2: words = line.split() for word in words: wordID = 0 for i,d in enumerate(dictionary): if d == word: wordID = i features_matrix[docID,wordID] = words.count(word) train_labels[docID] = 0; filepathTokens = fil.split(‘/’) lastToken = – 1] if lastToken.startswith(“spmsg”): train_labels[docID] = 1; count = count + 1 docID = docID + 1 return features_matrix, train_labels2.
- Using Random Forest ClassifierThe code for using Random Forest Classifier is similar to previous classifiers.Import libraryCreate modelTrainPredictfrom sklearn.ensemble import = “.
- /test-mails”dictionary = “reading and processing emails from file.
Random Forest Classifier is ensemble algorithm. In next one or two posts we shall explore such algorithms. Ensembled algorithms are those which combines more than one algorithms of same or different…
Continue reading “Chapter 5: Random Forest Classifier – Machine Learning 101 – Medium”