February 21, 2023
For my third and final RMIT Practical Data Science with Python assignment, the task was to help fictitious company, Connect 5G, to address a growing spam problem impacting their customers. Two machine-learning classification models were built, tested and compared - K-means and Decision Tree methods. Recommendations were made based on the results, taking prediction time into consideration given the needs of their customers.
An important part of this exercise for me was to understand not just how to implement the model, but also the theory behind each method. I experimented with a variety of parameter values for each method including more extreme values, just to get a feel for how they impacted the models which was helpful in learning more about how the models work. It is this kind of iterative process that I enjoy about working with coding such as Python, as when set up properly, it can all be run again with a minimum of fuss.
Below is the Python code used for the data wrangling including tokenisation, removal of stopwords, balancing of data with SMOTE, building, test and compare K-means and Decision Tree data models.
Guiding questions (from case study) to be addressed in presentation:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score,balanced_accuracy_score,confusion_matrix, ConfusionMatrixDisplay, classification_report
from imblearn.over_sampling import SMOTE
from collections import Counter
import ssl
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
pd.set_option("display.max_rows", 100)
# set style for plots
plt.style.use("seaborn-white")
# setting colours for plots
color ="orange"
color_r = "navajowhite"
colors = ["orange", "navajowhite"]
colors_r = list(reversed(colors))
/var/folders/pb/f8gg2y7w8xjbbn0059bcyddh0000gr/T/ipykernel_47553/681184966.py:6: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
plt.style.use("seaborn-white")
df = pd.read_csv("A3_sms.csv", encoding="utf8")
df
Unnamed: 0 | sms | spam | Unnamed: 3 | |
---|---|---|---|---|
0 | 0 |
|
False | NaN |
1 | 1 | Hhahhaahahah rofl was leonardo in your room or something | False | NaN |
2 | 4 | Oh for sake she’s in like | False | NaN |
3 | 5 | No da:)he is stupid da..always sending like this:)don believe any of those message.pandy is a :) | False | NaN |
4 | 6 | Lul im gettin some juicy gossip at the hospital. Oyea. | False | NaN |
… | … | … | … | … |
5346 | 5348 | Congratulations! Thanks to a good friend U have WON the £2,000 Xmas prize. 2 claim is easy, just call 08718726971 NOW! Only 10p per minute. BT-national-rate. | True | NaN |
5347 | 5349 | Congratulations - Thanks to a good friend U have WON the £2,000 Xmas prize. 2 claim is easy, just call 08712103738 NOW! Only 10p per minute. BT-national-rate | True | NaN |
5348 | 5350 | URGENT! Your mobile number *************** WON a £2000 Bonus Caller prize on 10/06/03! This is the 2nd attempt to reach you! Call 09066368753 ASAP! Box 97N7QP, 150ppm | True | NaN |
5349 | 5351 | URGENT! Your Mobile No was awarded a £2,000 Bonus Caller Prize on 1/08/03! This is our 2nd attempt to contact YOU! Call 0871-4719-523 BOX95QU BT National Rate | True | NaN |
5350 | 5352 | Do whatever you want. You know what the rules are. We had a talk earlier this week about what had to start happening, you showing responsibility. Yet, every week it’s can i bend the rule this way? What about that way? Do whatever. I’m tired of having thia same argument with you every week. And a <#> movie DOESNT inlude the previews. You’re still getting in after 1. | False | NaN |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5351 entries, 0 to 5350
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 5351 non-null int64
1 sms 5351 non-null object
2 spam 5351 non-null bool
3 Unnamed: 3 49 non-null object
dtypes: bool(1), int64(1), object(2)
memory usage: 130.8+ KB
df.shape
(5351, 4)
df.nunique()
Unnamed: 0 5351
sms 4948
spam 2
Unnamed: 3 2
dtype: int64
#looking at values in each column for any issues
for col in df.columns:
print (f"Column \"{col}\" values: ")
print (df.loc[:, col].unique(), "\n")
Column "Unnamed: 0" values:
[ 0 1 4 ... 5350 5351 5352]
Column "sms" values:
['1. Tension face 2. Smiling face 3. Waste face 4. Innocent face 5.Terror face 6.Cruel face 7.Romantic face 8.Lovable face 9.decent face <#> .joker face.'
'Hhahhaahahah rofl was leonardo in your room or something'
"Oh for sake she's in like " ...
'URGENT! Your mobile number *************** WON a £2000 Bonus Caller prize on 10/06/03! This is the 2nd attempt to reach you! Call 09066368753 ASAP! Box 97N7QP, 150ppm'
'URGENT! Your Mobile No was awarded a £2,000 Bonus Caller Prize on 1/08/03! This is our 2nd attempt to contact YOU! Call 0871-4719-523 BOX95QU BT National Rate'
"Do whatever you want. You know what the rules are. We had a talk earlier this week about what had to start happening, you showing responsibility. Yet, every week it's can i bend the rule this way? What about that way? Do whatever. I'm tired of having thia same argument with you every week. And a <#> movie DOESNT inlude the previews. You're still getting in after 1."]
Column "spam" values:
[False True]
Column "Unnamed: 3" values:
[nan '********' '\\/\\/\\/\\/\\/']
# Column 0 - basically an index number - not required for this project
# Column 1 - sms content - what our models will be using
# Column 2 - spam marker - target class for evaluating model accuracy
# Column 3 - unknown - possibly de-indentification of numbers - not required for this project
# No missing values
# No need to change data types
# SMS text is what it is, typos and all (especially as typos can be a spam indicator)... just make case consistent (lower)
# Spam - boolean - fine, no typos
df_preparation = df["sms"].str.lower()
df_preparation.head()
0 1. tension face 2. smiling face 3. waste face ...
1 hhahhaahahah rofl was leonardo in your room or...
2 oh for sake she's in like
3 no da:)he is stupid da..always sending like th...
4 lul im gettin some juicy gossip at the hospita...
Name: sms, dtype: object
## disabling SSl check to download the package "punkt"
try:
_create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
pass
else:
ssl._create_default_https_context = _create_unverified_https_context
# load tokens (words)
nltk.download("punkt")
[nltk_data] Downloading package punkt to /Users/adam/nltk_data...
[nltk_data] Package punkt is already up-to-date!
True
df_preparation = [word_tokenize(sms) for sms in df_preparation]
df_preparation[0] #print first list item
['1.',
'tension',
'face',
'2.',
'smiling',
'face',
'3.',
'waste',
'face',
'4.',
'innocent',
'face',
'5.terror',
'face',
'6.cruel',
'face',
'7.romantic',
'face',
'8.lovable',
'face',
'9.decent',
'face',
'&',
'lt',
';',
'#',
'&',
'gt',
';',
'.joker',
'face',
'.']
nltk.download("stopwords")
[nltk_data] Downloading package stopwords to /Users/adam/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
True
list_stopwords=stopwords.words("english")
list_stopwords[0:10]
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
df_preparation= [[word for word in sms if not word in list_stopwords] for sms in df_preparation]
print(df_preparation[0:4])
[['1.', 'tension', 'face', '2.', 'smiling', 'face', '3.', 'waste', 'face', '4.', 'innocent', 'face', '5.terror', 'face', '6.cruel', 'face', '7.romantic', 'face', '8.lovable', 'face', '9.decent', 'face', '&', 'lt', ';', '#', '&', 'gt', ';', '.joker', 'face', '.'], ['hhahhaahahah', 'rofl', 'leonardo', 'room', 'something'], ['oh', 'sake', "'s", 'like'], ['da', ':', ')', 'stupid', 'da', '..', 'always', 'sending', 'like', ':', ')', 'believe', 'message.pandy', ':', ')']]
df_words = df.copy()
df_words["words"] = df_preparation
# dataset for checking most frequent words
df_words.head()
Unnamed: 0 | sms | spam | Unnamed: 3 | words | |
---|---|---|---|---|---|
0 | 0 |
|
False | NaN | [1., tension, face, 2., smiling, face, 3., waste, face, 4., innocent, face, 5.terror, face, 6.cruel, face, 7.romantic, face, 8.lovable, face, 9.decent, face, &, lt, ;, #, &, gt, ;, .joker, face, .] |
1 | 1 | Hhahhaahahah rofl was leonardo in your room or something | False | NaN | [hhahhaahahah, rofl, leonardo, room, something] |
2 | 4 | Oh for sake she’s in like | False | NaN | [oh, sake, ’s, like] |
3 | 5 | No da:)he is stupid da..always sending like this:)don believe any of those message.pandy is a :) | False | NaN | [da, :, ), stupid, da, .., always, sending, like, :, ), believe, message.pandy, :, )] |
4 | 6 | Lul im gettin some juicy gossip at the hospital. Oyea. | False | NaN | [lul, im, gettin, juicy, gossip, hospital, ., oyea, .] |
Apply exploratory methods as appropriate to textual data. Marks will be awarded as follows:
len(df[df["spam"]==True])/len(df)
0.13156419360867128
df_spam = df[df["spam"]==True]
df_ham = df[df["spam"]==False]
shape_spam = df_spam.shape
shape_ham = df_ham.shape
total = df.shape[0]
num_spam = shape_spam[0]
num_ham = shape_ham[0]
print (f"Spam: {shape_spam[0]} - {round(num_spam/total * 100, 2)}%")
print (f"Ham: {shape_ham[0]} - {round(num_ham/total * 100, 2)}%")
Spam: 704 - 13.16%
Ham: 4647 - 86.84%
Unbalanced - Spam 704 - approx 13% - Ham 4647 - approx 87%
labels = ["Spam\n ("+str(shape_spam[0])+")", "Ham \n("+str(shape_ham[0])+")"]
counts = [shape_spam[0], shape_ham[0]]
title = "Fig. 1 Spam vs Ham"
plt.pie(counts
, labels = labels
, colors = colors_r
, counterclock = True
, startangle = 0
, labeldistance = 1.2
, pctdistance = 0.65
, autopct = lambda p: f"{int(p)}%"
, textprops={"fontsize": 16}
)
plt.title(f"{title}", fontsize=18, fontweight="bold")
plt.savefig(f"Fig. 1 {title}.png", dpi=300, transparent=True, bbox_inches = "tight")
plt.show()
# Feature extraction
CountVec = CountVectorizer(lowercase=True,analyzer="word",stop_words="english")
# Get feature vectors
feature_vectors = CountVec.fit_transform(df["sms"])
# Prints all strings - commented out for purposes of length
# for x in CountVec.get_feature_names_out():
# print(x, end=", ")
feature_vectors
<5351x8011 sparse matrix of type '<class 'numpy.int64'>'
with 40832 stored elements in Compressed Sparse Row format>
# Investigate word frequency
# Create "spam" and "ham" subsets (using the dataset with tokenised emails)
#filter
df_words_spam = df_words[df_words["spam"]==True]
df_words_ham = df_words[df_words["spam"]==False]
# Get the most frequent words for each subset
word_counts_spam = df_words_spam["words"].apply(pd.Series).stack().value_counts()
word_counts_ham = df_words_ham["words"].apply(pd.Series).stack().value_counts()
word_counts_spam.head(35)
. 846
! 508
, 351
call 333
free 209
& 165
? 163
: 161
2 160
txt 143
ur 141
u 126
mobile 121
* 113
claim 113
4 110
text 101
stop 101
reply 97
prize 92
get 76
nokia 65
send 64
's 63
urgent 63
new 63
cash 62
win 60
) 58
contact 56
please 54
week 52
- 52
guaranteed 50
service 49
dtype: int64
word_counts_ham.head(35)
. 3644
, 1457
? 1307
... 1067
u 939
! 763
; 745
& 724
.. 675
: 537
) 420
's 405
'm 375
n't 310
gt 309
lt 309
2 287
get 285
# 275
go 241
ok 240
ur 238
got 236
come 227
call 226
'll 223
good 223
know 219
like 215
time 190
day 188
- 171
love 170
4 165
going 164
dtype: int64
# checking number of words in spam and ham -
word_counts_spam.shape, word_counts_ham.shape
((2808,), (6933,))
# convert to pandas datasets for joining/filtering
word_counts_spam = pd.DataFrame(word_counts_spam).reset_index()
word_counts_ham = pd.DataFrame(word_counts_ham).reset_index()
# setting "type" to be able to filter later by "spam" and "ham"
word_counts_spam ["type"] = "spam"
word_counts_spam
index | 0 | type | |
---|---|---|---|
0 | . | 846 | spam |
1 | ! | 508 | spam |
2 | , | 351 | spam |
3 | call | 333 | spam |
4 | free | 209 | spam |
… | … | … | … |
2803 | 08704439680. | 1 | spam |
2804 | passes | 1 | spam |
2805 | lounge | 1 | spam |
2806 | airport | 1 | spam |
2807 | 0871-4719-523 | 1 | spam |
word_counts_ham ["type"] = "ham"
word_counts_ham
index | 0 | type | |
---|---|---|---|
0 | . | 3644 | ham |
1 | , | 1457 | ham |
2 | ? | 1307 | ham |
3 | … | 1067 | ham |
4 | u | 939 | ham |
… | … | … | … |
6928 | andre | 1 | ham |
6929 | virgil | 1 | ham |
6930 | dismay | 1 | ham |
6931 | enjoying | 1 | ham |
6932 | previews | 1 | ham |
len(df_words_spam), len(df_words_ham)
(704, 4647)
all_words = pd.concat([word_counts_spam, word_counts_ham], axis=0).rename(columns={"index": "word", 0: "count"}).reset_index(drop=True)
all_words
word | count | type | |
---|---|---|---|
0 | . | 846 | spam |
1 | ! | 508 | spam |
2 | , | 351 | spam |
3 | call | 333 | spam |
4 | free | 209 | spam |
… | … | … | … |
9736 | andre | 1 | ham |
9737 | virgil | 1 | ham |
9738 | dismay | 1 | ham |
9739 | enjoying | 1 | ham |
9740 | previews | 1 | ham |
all_words.shape
(9741, 3)
#remove all duplicates (keep neither) to keep only unique words
ham_or_spam = all_words.drop_duplicates(subset=["word"], keep=False)
#remove all words only found in "ham" - keep "spam"
spam_words_only = ham_or_spam[ham_or_spam["type"]=="spam"].reset_index(drop=True)
spam_words_only
word | count | type | |
---|---|---|---|
0 | claim | 113 | spam |
1 | prize | 92 | spam |
2 | guaranteed | 50 | spam |
3 | tone | 48 | spam |
4 | cs | 41 | spam |
… | … | … | … |
1903 | villa | 1 | spam |
1904 | someonone | 1 | spam |
1905 | 08704439680. | 1 | spam |
1906 | passes | 1 | spam |
1907 | 0871-4719-523 | 1 | spam |
print (spam_words_only.head(30))
word count type
0 claim 113 spam
1 prize 92 spam
2 guaranteed 50 spam
3 tone 48 spam
4 cs 41 spam
5 awarded 38 spam
6 â£1000 35 spam
7 150ppm 34 spam
8 ringtone 29 spam
9 collection 26 spam
10 tones 26 spam
11 entry 25 spam
12 16+ 25 spam
13 weekly 24 spam
14 mob 23 spam
15 valid 23 spam
16 500 23 spam
17 â£100 22 spam
18 150p 21 spam
19 sae 21 spam
20 delivery 21 spam
21 8007 21 spam
22 bonus 21 spam
23 vouchers 20 spam
24 â£2000 20 spam
25 â£5000 20 spam
26 86688 19 spam
27 18 19 spam
28 â£500 19 spam
29 750 18 spam
ham_words_only = ham_or_spam[ham_or_spam["type"]=="ham"].reset_index(drop=True)
ham_words_only
word | count | type | |
---|---|---|---|
0 | gt | 309 | ham |
1 | lt | 309 | ham |
2 | lor | 162 | ham |
3 | da | 137 | ham |
4 | later | 130 | ham |
… | … | … | … |
6028 | andre | 1 | ham |
6029 | virgil | 1 | ham |
6030 | dismay | 1 | ham |
6031 | enjoying | 1 | ham |
6032 | previews | 1 | ham |
print (ham_words_only.head(30))
word count type
0 gt 309 ham
1 lt 309 ham
2 lor 162 ham
3 da 137 ham
4 later 130 ham
5 ã¼ 120 ham
6 happy 104 ham
7 amp 88 ham
8 work 88 ham
9 ask 88 ham
10 said 79 ham
11 lol 74 ham
12 anything 73 ham
13 cos 72 ham
14 morning 71 ham
15 sure 68 ham
16 something 65 ham
17 gud 63 ham
18 thing 58 ham
19 feel 56 ham
20 gon 56 ham
21 dun 55 ham
22 went 54 ham
23 sleep 54 ham
24 always 54 ham
25 told 52 ham
26 㜠52 ham
27 nice 51 ham
28 haha 51 ham
29 thk 50 ham
# Frquency of words in spam
count = Counter()
for word_list in df_words_spam["words"]:
for word in word_list:
count[word] += 1
# List most common
Counter(count).most_common(30)
[('.', 846),
('!', 508),
(',', 351),
('call', 333),
('free', 209),
('&', 165),
('?', 163),
(':', 161),
('2', 160),
('txt', 143),
('ur', 141),
('u', 126),
('mobile', 121),
('*', 113),
('claim', 113),
('4', 110),
('stop', 101),
('text', 101),
('reply', 97),
('prize', 92),
('get', 76),
('nokia', 65),
('send', 64),
("'s", 63),
('new', 63),
('urgent', 63),
('cash', 62),
('win', 60),
(')', 58),
('contact', 56)]
# Getting rid of duplicates within an email,
# to get number based on how many emails have a particular word, but no repetition
# Ham
count = Counter()
for word_list in df_words_spam["words"]:
for word in list(set(word_list)): # this makes a "set" which removes any duplicates within that email before counting
count[word] += 1
Counter(count).most_common(40)
[('.', 454),
('!', 341),
('call', 311),
(',', 208),
('free', 160),
('&', 138),
('txt', 137),
(':', 130),
('2', 125),
('?', 123),
('ur', 111),
('claim', 108),
('mobile', 107),
('4', 101),
('u', 101),
('text', 89),
('reply', 86),
('prize', 84),
('stop', 80),
('get', 75),
('send', 63),
('new', 62),
('urgent', 62),
('cash', 61),
('win', 60),
('contact', 56),
('please', 54),
("'s", 51),
('-', 50),
('guaranteed', 50),
('customer', 49),
('nokia', 49),
(')', 48),
('service', 48),
('*', 47),
('c', 45),
('week', 45),
('(', 44),
('tone', 41),
('cs', 41)]
# I did want to try and put in this information about % of spam that contains a particular word...
# but decided to keep focus on the modelling - and knowing that would be hard enough to cover in detail anyway!
top_words_spam = [("call", 333),
("free", 209),
("txt", 143),
("ur", 141),
("u", 126),
("mobile", 121),
("claim", 113),
("stop", 101),
("text", 101),
("reply", 97),
("prize", 92),
("get", 76),
("nokia", 65),
("send", 64),
("new", 63)]
df_top_words_spam = pd.DataFrame(top_words_spam, columns=["word", "count"])
df_top_words_spam["% present in Total"] = df_top_words_spam.apply(lambda x: round(x["count"]/len(df_words_spam)*100, 2), axis=1)
print(df_top_words_spam)
word count % present in Total
0 call 333 47.30
1 free 209 29.69
2 txt 143 20.31
3 ur 141 20.03
4 u 126 17.90
5 mobile 121 17.19
6 claim 113 16.05
7 stop 101 14.35
8 text 101 14.35
9 reply 97 13.78
10 prize 92 13.07
11 get 76 10.80
12 nokia 65 9.23
13 send 64 9.09
14 new 63 8.95
top_words_spam_unique = [("call", 311),
("free", 160),
("txt", 137),
("ur", 111),
("claim", 108),
("mobile", 107),
("u", 101),
("text", 89),
("reply", 86),
("prize", 84),
("stop", 80),
("get", 75),
("send", 63),
("new", 62),
("urgent", 62)]
df_top_words_spam_unique = pd.DataFrame(top_words_spam_unique, columns=["word", "count"])
df_top_words_spam_unique["% present in Total"] = df_top_words_spam_unique.apply(lambda x: round(x["count"]/len(df_words_spam)*100, 2), axis=1)
print(df_top_words_spam_unique)
word count % present in Total
0 call 311 44.18
1 free 160 22.73
2 txt 137 19.46
3 ur 111 15.77
4 claim 108 15.34
5 mobile 107 15.20
6 u 101 14.35
7 text 89 12.64
8 reply 86 12.22
9 prize 84 11.93
10 stop 80 11.36
11 get 75 10.65
12 send 63 8.95
13 new 62 8.81
14 urgent 62 8.81
# Frquency of words in ham
count = Counter()
for word_list in df_words_ham["words"]:
for word in word_list:
count[word] += 1
# List most common ham words
Counter(count).most_common(30)
[('.', 3644),
(',', 1457),
('?', 1307),
('...', 1067),
('u', 939),
('!', 763),
(';', 745),
('&', 724),
('..', 675),
(':', 537),
(')', 420),
("'s", 405),
("'m", 375),
("n't", 310),
('lt', 309),
('gt', 309),
('2', 287),
('get', 285),
('#', 275),
('go', 241),
('ok', 240),
('ur', 238),
('got', 236),
('come', 227),
('call', 226),
('good', 223),
("'ll", 223),
('know', 219),
('like', 215),
('time', 190)]
# Getting rid of duplicates within an email
# Ham
count = Counter()
for word_list in df_words_ham["words"]:
for word in list(set(word_list)):
count[word] += 1
Counter(count).most_common(40)
[('.', 2026),
('?', 1046),
(',', 1022),
('...', 684),
('u', 660),
('!', 508),
('..', 429),
(':', 402),
("'s", 368),
("'m", 344),
(')', 334),
(';', 331),
('&', 319),
('get', 265),
("n't", 257),
('2', 239),
('lt', 235),
('ok', 234),
('gt', 233),
('got', 224),
('go', 222),
("'ll", 217),
('call', 212),
('come', 212),
('good', 210),
('#', 209),
('know', 209),
('like', 200),
('ur', 187),
('time', 179),
('day', 176),
('going', 158),
('4', 157),
('home', 154),
('one', 149),
('want', 146),
('lor', 145),
('sorry', 144),
('-', 143),
('still', 143)]
Top 15 - Ham
Top 15 - no duplicates - (“u”, 660), - (“get”, 265), - (“ok”, 234), - (“got”, 224), - (“go”, 222), - (“call”, 212), - (“come”, 212), - (“good”, 210), - (“know”, 209), - (“like”, 200), - (“ur”, 187), - (“time”, 179), - (“day”, 176), - (“going”, 158), - (“home”, 154)
top_words_ham = [("u", 939),
("get", 285),
("go", 241),
("ok", 240),
("ur", 238),
("got", 236),
("come", 227),
("call", 226),
("good", 223),
("know", 219),
("like", 215),
("time", 190),
("day", 188),
("love", 170),
("going", 164)]
df_top_words_ham = pd.DataFrame(top_words_ham, columns=["word", "count"])
df_top_words_ham["% present in Total"] = df_top_words_ham.apply(lambda x: round(x["count"]/len(df_words_ham)*100, 2), axis=1)
print(df_top_words_ham)
word count % present in Total
0 u 939 20.21
1 get 285 6.13
2 go 241 5.19
3 ok 240 5.16
4 ur 238 5.12
5 got 236 5.08
6 come 227 4.88
7 call 226 4.86
8 good 223 4.80
9 know 219 4.71
10 like 215 4.63
11 time 190 4.09
12 day 188 4.05
13 love 170 3.66
14 going 164 3.53
top_words_ham_unique = [("u", 660),
("get", 265),
("ok", 234),
("got", 224),
("go", 222),
("call", 212),
("come", 212),
("good", 210),
("know", 209),
("like", 200),
("ur", 187),
("time", 179),
("day", 176),
("going", 158),
("home", 154)]
df_top_words_ham_unique = pd.DataFrame(top_words_ham_unique, columns=["word", "count"])
df_top_words_ham_unique["% present in Total"] = df_top_words_ham_unique.apply(lambda x: round(x["count"]/len(df_words_ham)*100, 2), axis=1)
print(df_top_words_ham_unique)
word count % present in Total
0 u 660 14.20
1 get 265 5.70
2 ok 234 5.04
3 got 224 4.82
4 go 222 4.78
5 call 212 4.56
6 come 212 4.56
7 good 210 4.52
8 know 209 4.50
9 like 200 4.30
10 ur 187 4.02
11 time 179 3.85
12 day 176 3.79
13 going 158 3.40
14 home 154 3.31
def plot_top_15 (dataframe, title="TBC", fig_num="TBC", color=color, size=3.8, total_freq=False):
#create plot for Top 15 words for spam/ham
words = dataframe["word"]
counts = dataframe["count"]
y_pos = np.arange(len(dataframe["word"])) # the label locations
width = 0.25 # the width of the bars
multiplier = 0
fig, ax = plt.subplots(figsize=(size,5), constrained_layout=True)
ax.barh(y_pos, counts, color=color)
# Add some text for labels, title and custom x-axis tick labels, etc.
if total_freq:
ax.set_xlabel("Total Count")
else:
ax.set_xlabel("Count of Emails")
ax.invert_yaxis()
ax.set_title(full_title, fontsize=15, fontweight="bold")
ax.set_yticks(y_pos, labels = words, fontsize=16)
#ax.set_ylim(0, 250)
# Show the plot
return fig
# Selected for report
# this is based on counting a word only once per email
data = df_top_words_ham_unique
title = "Ham"
fig_num = "2"
full_title = f"Fig. {fig_num} Top 15 words - {title}"
fig = plot_top_15 (data, full_title, fig_num, color_r)
plt.savefig(f"{full_title}.png", dpi=300, transparent=True, bbox_inches = "tight")
plt.show()
# Selected for report
# this is based on counting a word only once per email
data = df_top_words_spam_unique
title = "Spam"
fig_num = "3"
full_title = f"Fig. {fig_num} Top 15 words - {title}"
fig = plot_top_15 (data, full_title, fig_num, color)
plt.savefig(f"{full_title}.png", dpi=300, transparent=True, bbox_inches = "tight")
plt.show()
# word Frequency - allows for repetition in an email
data = df_top_words_ham
title = "Ham"
fig_num = "1.2"
full_title = f"Fig {fig_num} Top 15 words - {title}"
fig = plot_top_15 (data, full_title, fig_num, color_r, total_freq=True)
plt.savefig(f"{full_title}.png", dpi=300, transparent=True, bbox_inches = "tight")
plt.show()
# word Frequency - allows for repetition in an email
data = df_top_words_spam
title = "Spam"
fig_num = "1.4"
full_title = f"Fig {fig_num} Top 15 words - {title}"
fig = plot_top_15 (data, full_title, fig_num, color, total_freq=True)
plt.savefig(f"{full_title}.png", dpi=300, transparent=True, bbox_inches = "tight")
plt.show()
# Selected for Report
# based on word frquency...
data = spam_words_only.head(10)
title = "Spam-only words"
fig_num = "4"
full_title = f"Fig. {fig_num} Top 10 {title}"
size = 4.3
plt.figure(figsize = (5,4))
fig = plot_top_15 (data, full_title, fig_num, color="darkorange", size=size, total_freq=True)
plt.savefig(f"{full_title}.png", dpi=300, transparent=True, bbox_inches = "tight")
plt.show()
<Figure size 500x400 with 0 Axes>
data = ham_words_only.head(10)
title = "Ham-only words"
fig_num = "1.6"
full_title = f"Fig {fig_num} Top 10 {title}"
fig = plot_top_15 (data, full_title, fig_num, color_r, total_freq=True)
plt.savefig(f"{full_title}.png", dpi=300, transparent=True, bbox_inches = "tight")
plt.show()
# Creates a randomised matrix of dimensions based :
# - feature_vectors - sms content
# - df_ham and df_spam - an array based on 0 for ham and 1 for spam
X_train, X_test, y_train, y_test = train_test_split(feature_vectors, [0] * len(df_ham) + [1] * len(df_spam), random_state = 42, test_size=0.2)
# Balance training set using SMOTE
X_train_smote, y_train_smote = SMOTE(random_state=42).fit_resample(X_train, y_train)
# Check the ratio of True (1) for y training set in total number of y
# should end up with 0.5...
print (y_train_smote.count(1), len(y_train_smote))
print (y_train_smote.count(1)/len(y_train_smote))
3725 7450
0.5
# Make instance of KNN model and set hyperparameters
knn = KNeighborsClassifier()
# Set hyperparameters
hyperparameters = {
"n_neighbors": [1, 3, 5, 9, 11],
"p": [1, 2]
}
#Grid search
knn_base = GridSearchCV(knn, hyperparameters, scoring="accuracy")
knn_base.fit(X_train, y_train)
print("Best p:", knn_base.best_estimator_.get_params()["p"])
print("Best n_neighbors:", knn_base.best_estimator_.get_params()["n_neighbors"])
Best p: 2
Best n_neighbors: 1
knn_smote = GridSearchCV(knn, hyperparameters, scoring="accuracy")
knn_smote.fit(X_train_smote, y_train_smote)
print("Best p:", knn_smote.best_estimator_.get_params()["p"])
print("Best n_neighbors:", knn_smote.best_estimator_.get_params()["n_neighbors"])
Best p: 2
Best n_neighbors: 1
# Make instance of Decision Tree model and set hyperparameters
# NOTE - I have removed a number of the values at extremes and in between that were used for testing,
# so that it doesn't take so long to run when reviewing.
dt = DecisionTreeClassifier()
hyperparameters = {
"min_samples_split": [2, 3, 5, 10, 15, 20],
"min_samples_leaf": [3, 4, 5, 6, 8],
"max_depth": [10, 20, 40, 60, 80, 120]
}
# train & evaluate
dt_base = GridSearchCV(dt, hyperparameters, scoring="accuracy").fit(X_train, y_train)
print("Best max_depth:", dt_base.best_estimator_.get_params()["max_depth"])
print("Best min_samples_leaf:", dt_base.best_estimator_.get_params()["min_samples_leaf"])
print("Best min_samples_split:", dt_base.best_estimator_.get_params()["min_samples_split"])
print("Best criterion:", dt_base.best_estimator_.get_params()["criterion"])
Best max_depth: 20
Best min_samples_leaf: 3
Best min_samples_split: 3
Best criterion: gini
# train & evaluate
# setting hyperparameter variables here for use for tree map at end.
dt_smote = GridSearchCV(dt, hyperparameters, scoring="accuracy").fit(X_train_smote, y_train_smote)
print("Best max_depth:", max_depth := dt_smote.best_estimator_.get_params()["max_depth"])
print("Best min_samples_leaf:", min_samples_leaf := dt_smote.best_estimator_.get_params()["min_samples_leaf"])
print("Best min_samples_split:", min_samples_split := dt_smote.best_estimator_.get_params()["min_samples_split"])
print("Best criterion:", criterion := dt_smote.best_estimator_.get_params()["criterion"])
Best max_depth: 120
Best min_samples_leaf: 3
Best min_samples_split: 2
Best criterion: gini
def plot_confusion_matrix (model, y_predicted, title="TBC", fig_num="TBC"):
# Confusion Matrix
plt.figure(figsize=(4,4))
cm = confusion_matrix(y_test, y_predicted, labels=model.classes_, )
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot(cmap="YlOrBr_r")
plt.title(f"Fig. {fig_num} {title}", fontsize=20, fontweight="bold")
plt.xlabel("Predicted", fontsize=14)
plt.ylabel("True label", fontsize=14)
return disp
def calc_accuracy (model_name, y_predicted):
# Accuracy
results.loc[model_name,"accuracy"] = accuracy_score(y_test,y_predicted)
return results.loc[model_name,"accuracy"]
def calc_balanced_accuracy (model_name, y_predicted):
# Balanced Accuracy
results.loc[model_name,"balanced_accuracy"] = balanced_accuracy_score(y_test,y_predicted)
return results.loc[model_name,"balanced_accuracy"]
def calc_training_time (model, model_name):
# Training Time
results.loc[model_name,"training_time"] = model.cv_results_["mean_fit_time"].mean()
return results.loc[model_name,"training_time"]
def calc_prediction_time (model, model_name):
# Training Time
results.loc[model_name,"prediction_time"] = model.cv_results_["mean_score_time"].mean()/len(y_test)
return results.loc[model_name,"prediction_time"]
# Create empty results table
results = pd.DataFrame()
model = knn_base
model_name = "KNN"
y_predicted = model.predict(X_test)
Metrics for KNN evaluation
# Confusion Matrix
title = "KNN (unbalanced)"
fig_num = "5"
disp = plot_confusion_matrix(model, y_predicted, title=title, fig_num=fig_num)
plt.savefig(f"Fig. {fig_num} {title}.png", dpi=300, transparent=True, bbox_inches = "tight")
<Figure size 400x400 with 0 Axes>
# KNN_base results
# Accuracy
accuracy = calc_accuracy(model_name, y_predicted)
# Balanced Accuracy
balanced = calc_balanced_accuracy(model_name, y_predicted)
# Training Time
train_time = calc_training_time(model, model_name)
# Prediction Time
pred_time = calc_prediction_time(model, model_name)
print (f"Accuracy: {accuracy:.10f}")
print (f"Balanced Accuracy: {balanced:.10f}")
print (f"Training Time: {train_time:.10f}")
print (f"Prediction Time: {pred_time:.10f}")
Accuracy: 0.9187675070
Balanced Accuracy: 0.7136805020
Training Time: 0.0025524473
Prediction Time: 0.0001030172
model = knn_smote
model_name = "KNN-SMOTE"
y_predicted = model.predict(X_test)
# Confusion Matrix
title = "KNN (SMOTE)"
fig_num = "7"
disp = plot_confusion_matrix(model, y_predicted, title=title, fig_num=fig_num)
plt.savefig(f"Fig. {fig_num} {title}.png", dpi=300, transparent=True, bbox_inches = "tight")
<Figure size 400x400 with 0 Axes>
# KNN_SMOTE results
# Accuracy
accuracy = calc_accuracy(model_name, y_predicted)
# Balanced Accuracy
balanced = calc_balanced_accuracy(model_name, y_predicted)
# Training Time
train_time = calc_training_time(model, model_name)
# Prediction Time
pred_time = calc_prediction_time(model, model_name)
print (f"Accuracy: {accuracy:.10f}")
print (f"Balanced Accuracy: {balanced:.10f}")
print (f"Training Time: {train_time:.10f}")
print (f"Prediction Time: {pred_time:.10f}")
Accuracy: 0.8085901027
Balanced Accuracy: 0.7312779339
Training Time: 0.0037259293
Prediction Time: 0.0002646201
model = dt_base
model_name = "DT"
y_predicted = model.predict(X_test)
# Confusion Matrix
title = "Decision Tree (unbalanced)"
fig_num = "6"
disp = plot_confusion_matrix(model, y_predicted, title=title, fig_num=fig_num)
plt.savefig(f"Fig. {fig_num} {title}.png", dpi=300, transparent=True, bbox_inches = "tight")
<Figure size 400x400 with 0 Axes>
# DT_BASE results
# Accuracy
accuracy = calc_accuracy(model_name, y_predicted)
# Balanced Accuracy
balanced = calc_balanced_accuracy(model_name, y_predicted)
# Training Time
train_time = calc_training_time(model, model_name)
# Prediction Time
pred_time = calc_prediction_time(model, model_name)
print (f"Accuracy: {accuracy:.10f}")
print (f"Balanced Accuracy: {balanced:.10f}")
print (f"Training Time: {train_time:.10f}")
print (f"Prediction Time: {pred_time:.10f}")
Accuracy: 0.8832866480
Balanced Accuracy: 0.6424318304
Training Time: 0.1051164484
Prediction Time: 0.0000014010
model = dt_smote
model_name = "DT-SMOTE"
y_predicted = model.predict(X_test)
# Confusion Matrix
title = "Decision Tree (SMOTE)"
fig_num = "8"
disp = plot_confusion_matrix(model, y_predicted, title=title, fig_num=fig_num)
plt.savefig(f"Fig. {fig_num} {title}.png", dpi=300, transparent=True, bbox_inches = "tight")
<Figure size 400x400 with 0 Axes>
# DT_SMOTE results
# Accuracy
accuracy = calc_accuracy(model_name, y_predicted)
# Balanced Accuracy
balanced = calc_balanced_accuracy(model_name, y_predicted)
# Training Time
train_time = calc_training_time(model, model_name)
# Prediction Time
pred_time = calc_prediction_time(model, model_name)
print (f"Accuracy: {accuracy:.10f}")
print (f"Balanced Accuracy: {balanced:.10f}")
print (f"Training Time: {train_time:.10f}")
print (f"Prediction Time: {pred_time:.10f}")
Accuracy: 0.6937441643
Balanced Accuracy: 0.5886131695
Training Time: 0.1291490380
Prediction Time: 0.0000019057
results.index
Index(['KNN', 'KNN-SMOTE', 'DT', 'DT-SMOTE'], dtype='object')
# show collated results
results.style.highlight_max(color=color, axis=0).highlight_min(color=color_r, axis=0)
#Accuracy & balanced accuracy
fig_num="9"
labels = list(results.index)
fig = results[["accuracy","balanced_accuracy"]].plot(kind="bar", color=colors_r, figsize=(8,5))
plt.title(f"Fig. {fig_num} Accuracy & Balanced Accuracy by Model"
, fontweight="bold"
, fontsize=14)
plt.xticks(rotation=0)
plt.ylabel("Percentage", fontsize=14)
plt.legend()
plt.savefig(f"Fig. {fig_num} Accuracy & Balanced Accuracy by Model.png", dpi=300, transparent=True, bbox_inches = "tight")
# Training Time
fig_num="10"
results["training_time"].plot(kind="bar", color=color, figsize=(7,3))
plt.title(f"Fig. {fig_num} Training Time by Model"
, fontweight="bold"
, fontsize=14)
plt.xticks(rotation=0)
plt.ylabel("Training Time", fontsize=14)
plt.savefig(f"Fig. {fig_num} Training Time by Model.png", dpi=300, transparent=True, bbox_inches = "tight")
# Prediction Time
fig_num="11"
results["prediction_time"].plot(kind="bar", color=color, figsize=(7,3))
plt.title(f"Fig. {fig_num} Prediction Time by Model"
, fontweight="bold"
, fontsize=14)
plt.xticks(rotation=0)
plt.ylabel("Prediction Time", fontsize=14)
plt.savefig(f"Fig. {fig_num} Prediction Time by Model.png", dpi=300, transparent=True, bbox_inches = "tight")
all_words["type"].shape
(9741,)
df_words["check"] = df_words["spam"].replace({True: "Spam", False: "Ham"}).astype(str)
# Classification report for KNN (unbalanced)
y_predicted = knn_base.fit(X_train, y_train).predict(X_test)
report = classification_report(y_test, y_predicted, target_names= df_words["check"].unique()) #(all_words["type"].unique()))
print(report)
precision recall f1-score support
Ham 0.92 1.00 0.95 922
Spam 0.97 0.43 0.60 149
accuracy 0.92 1071
macro avg 0.94 0.71 0.78 1071
weighted avg 0.92 0.92 0.90 1071
# Classification report for Decision Tree (unbalanced)
y_predicted = dt_base.fit(X_train, y_train).predict(X_test)
report = classification_report(y_test, y_predicted, target_names= df_words["check"].unique()) #(all_words["type"].unique()))
print(report)
precision recall f1-score support
Ham 0.90 0.98 0.94 922
Spam 0.70 0.32 0.44 149
accuracy 0.89 1071
macro avg 0.80 0.65 0.69 1071
weighted avg 0.87 0.89 0.87 1071
# Classification report for Decision Tree (SMOTE)
y_predicted = dt_smote.fit(X_train_smote, y_train_smote).predict(X_test)
report = classification_report(y_test, y_predicted, target_names= df_words["check"].unique()) #(all_words["type"].unique()))
print(report)
precision recall f1-score support
Ham 0.89 0.75 0.81 922
Spam 0.22 0.45 0.30 149
accuracy 0.71 1071
macro avg 0.56 0.60 0.56 1071
weighted avg 0.80 0.71 0.74 1071
# Classification report for KNN (SMOTE)
y_predicted = knn_smote.fit(X_train_smote, y_train_smote).predict(X_test)
report = classification_report(y_test, y_predicted, target_names= df_words["check"].unique()) #(all_words["type"].unique()))
print(report)
precision recall f1-score support
Ham 0.93 0.84 0.88 922
Spam 0.38 0.62 0.48 149
accuracy 0.81 1071
macro avg 0.66 0.73 0.68 1071
weighted avg 0.86 0.81 0.83 1071
Commented out for this HTML document due to output length
Trying just out of curiosoity - not included in final report.
Parameters have been set from above based on
best_estimator.get_params
print (criterion, max_depth, min_samples_split, min_samples_leaf)
gini 120 2 3
# dt_tree = DecisionTreeClassifier(criterion=criterion
# , max_depth=max_depth
# , min_samples_split=min_samples_split
# , min_samples_leaf=min_samples_leaf)
# dt_tree = dt_tree.fit(X_train_smote, y_train_smote)
# plt.figure(figsize=(15,50))
# tree.plot_tree(decision_tree=dt_tree, class_names=all_words["type"],\
# filled=True, rounded=True)
Copyright © 2023 Adam Simmons, Inc. All rights reserved.