February 21, 2023


For my third and final RMIT Practical Data Science with Python assignment, the task was to help fictitious company, Connect 5G, to address a growing spam problem impacting their customers. Two machine-learning classification models were built, tested and compared - K-means and Decision Tree methods. Recommendations were made based on the results, taking prediction time into consideration given the needs of their customers.

An important part of this exercise for me was to understand not just how to implement the model, but also the theory behind each method. I experimented with a variety of parameter values for each method including more extreme values, just to get a feel for how they impacted the models which was helpful in learning more about how the models work. It is this kind of iterative process that I enjoy about working with coding such as Python, as when set up properly, it can all be run again with a minimum of fuss.


Below is the Python code used for the data wrangling including tokenisation, removal of stopwords, balancing of data with SMOTE, building, test and compare K-means and Decision Tree data models.


Practical Data Science with Python Assessment Task 3: Code for data modelling presentation

Guiding questions (from case study) to be addressed in presentation:

Set Up

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import GridSearchCV 
from sklearn.metrics import accuracy_score,balanced_accuracy_score,confusion_matrix, ConfusionMatrixDisplay, classification_report 

from imblearn.over_sampling import SMOTE 

from collections import Counter

import ssl
import nltk 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
pd.set_option("display.max_rows", 100)


# set style for plots

plt.style.use("seaborn-white")


# setting colours for plots

color ="orange"
color_r = "navajowhite"
colors = ["orange", "navajowhite"]
colors_r = list(reversed(colors))
/var/folders/pb/f8gg2y7w8xjbbn0059bcyddh0000gr/T/ipykernel_47553/681184966.py:6: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
  plt.style.use("seaborn-white")

Load Data

df = pd.read_csv("A3_sms.csv", encoding="utf8")
df
Unnamed: 0 sms spam Unnamed: 3
0 0
  1. Tension face 2. Smiling face 3. Waste face 4. Innocent face 5.Terror face 6.Cruel face 7.Romantic face 8.Lovable face 9.decent face &lt;#&gt; .joker face.
False NaN
1 1 Hhahhaahahah rofl was leonardo in your room or something False NaN
2 4 Oh for sake she’s in like False NaN
3 5 No da:)he is stupid da..always sending like this:)don believe any of those message.pandy is a :) False NaN
4 6 Lul im gettin some juicy gossip at the hospital. Oyea. False NaN
5346 5348 Congratulations! Thanks to a good friend U have WON the £2,000 Xmas prize. 2 claim is easy, just call 08718726971 NOW! Only 10p per minute. BT-national-rate. True NaN
5347 5349 Congratulations - Thanks to a good friend U have WON the £2,000 Xmas prize. 2 claim is easy, just call 08712103738 NOW! Only 10p per minute. BT-national-rate True NaN
5348 5350 URGENT! Your mobile number *************** WON a £2000 Bonus Caller prize on 10/06/03! This is the 2nd attempt to reach you! Call 09066368753 ASAP! Box 97N7QP, 150ppm True NaN
5349 5351 URGENT! Your Mobile No was awarded a £2,000 Bonus Caller Prize on 1/08/03! This is our 2nd attempt to contact YOU! Call 0871-4719-523 BOX95QU BT National Rate True NaN
5350 5352 Do whatever you want. You know what the rules are. We had a talk earlier this week about what had to start happening, you showing responsibility. Yet, every week it’s can i bend the rule this way? What about that way? Do whatever. I’m tired of having thia same argument with you every week. And a &lt;#&gt; movie DOESNT inlude the previews. You’re still getting in after 1. False NaN
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5351 entries, 0 to 5350
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  5351 non-null   int64 
 1   sms         5351 non-null   object
 2   spam        5351 non-null   bool  
 3   Unnamed: 3  49 non-null     object
dtypes: bool(1), int64(1), object(2)
memory usage: 130.8+ KB
df.shape
(5351, 4)
df.nunique()
Unnamed: 0    5351
sms           4948
spam             2
Unnamed: 3       2
dtype: int64
#looking at values in each column for any issues

for col in df.columns:
    print (f"Column \"{col}\" values: ")
    print (df.loc[:, col].unique(), "\n")
Column "Unnamed: 0" values: 
[   0    1    4 ... 5350 5351 5352] 

Column "sms" values: 
['1. Tension face 2. Smiling face 3. Waste face 4. Innocent face 5.Terror face 6.Cruel face 7.Romantic face 8.Lovable face 9.decent face &lt;#&gt; .joker face.'
 'Hhahhaahahah rofl was leonardo in your room or something'
 "Oh for  sake she's in like " ...
 'URGENT! Your mobile number *************** WON a £2000 Bonus Caller prize on 10/06/03! This is the 2nd attempt to reach you! Call 09066368753 ASAP! Box 97N7QP, 150ppm'
 'URGENT! Your Mobile No was awarded a £2,000 Bonus Caller Prize on 1/08/03! This is our 2nd attempt to contact YOU! Call 0871-4719-523 BOX95QU BT National Rate'
 "Do whatever you want. You know what the rules are. We had a talk earlier this week about what had to start happening, you showing responsibility. Yet, every week it's can i bend the rule this way? What about that way? Do whatever. I'm tired of having thia same argument with you every week. And a  &lt;#&gt;  movie DOESNT inlude the previews. You're still getting in after 1."] 

Column "spam" values: 
[False  True] 

Column "Unnamed: 3" values: 
[nan '********' '\\/\\/\\/\\/\\/'] 
# Column 0 - basically an index number - not required for this project
# Column 1 - sms content - what our models will be using
# Column 2 - spam marker - target class for evaluating model accuracy
# Column 3 - unknown - possibly de-indentification of numbers - not required for this project

# No missing values
# No need to change data types
# SMS text is what it is, typos and all (especially as typos can be a spam indicator)... just make case consistent (lower)
# Spam - boolean - fine, no typos

Clean/prepare data

Conversion to lower-case

df_preparation = df["sms"].str.lower()
df_preparation.head()
0    1. tension face 2. smiling face 3. waste face ...
1    hhahhaahahah rofl was leonardo in your room or...
2                          oh for  sake she's in like 
3    no da:)he is stupid da..always sending like th...
4    lul im gettin some juicy gossip at the hospita...
Name: sms, dtype: object

Tokenisation of emails into words:

## disabling SSl check to download the package "punkt"
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass 
else:
    ssl._create_default_https_context = _create_unverified_https_context
    
# load tokens (words) 

nltk.download("punkt") 
[nltk_data] Downloading package punkt to /Users/adam/nltk_data...
[nltk_data]   Package punkt is already up-to-date!





True
df_preparation = [word_tokenize(sms) for sms in df_preparation]
df_preparation[0] #print first list item
['1.',
 'tension',
 'face',
 '2.',
 'smiling',
 'face',
 '3.',
 'waste',
 'face',
 '4.',
 'innocent',
 'face',
 '5.terror',
 'face',
 '6.cruel',
 'face',
 '7.romantic',
 'face',
 '8.lovable',
 'face',
 '9.decent',
 'face',
 '&',
 'lt',
 ';',
 '#',
 '&',
 'gt',
 ';',
 '.joker',
 'face',
 '.']

Removing stopwords

nltk.download("stopwords") 
[nltk_data] Downloading package stopwords to /Users/adam/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!





True
list_stopwords=stopwords.words("english") 

list_stopwords[0:10] 
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
df_preparation= [[word for word in sms if not word in list_stopwords] for sms in df_preparation] 

print(df_preparation[0:4]) 
[['1.', 'tension', 'face', '2.', 'smiling', 'face', '3.', 'waste', 'face', '4.', 'innocent', 'face', '5.terror', 'face', '6.cruel', 'face', '7.romantic', 'face', '8.lovable', 'face', '9.decent', 'face', '&', 'lt', ';', '#', '&', 'gt', ';', '.joker', 'face', '.'], ['hhahhaahahah', 'rofl', 'leonardo', 'room', 'something'], ['oh', 'sake', "'s", 'like'], ['da', ':', ')', 'stupid', 'da', '..', 'always', 'sending', 'like', ':', ')', 'believe', 'message.pandy', ':', ')']]
df_words = df.copy()
df_words["words"] = df_preparation
# dataset for checking most frequent words

df_words.head()
Unnamed: 0 sms spam Unnamed: 3 words
0 0
  1. Tension face 2. Smiling face 3. Waste face 4. Innocent face 5.Terror face 6.Cruel face 7.Romantic face 8.Lovable face 9.decent face &lt;#&gt; .joker face.
False NaN [1., tension, face, 2., smiling, face, 3., waste, face, 4., innocent, face, 5.terror, face, 6.cruel, face, 7.romantic, face, 8.lovable, face, 9.decent, face, &, lt, ;, #, &, gt, ;, .joker, face, .]
1 1 Hhahhaahahah rofl was leonardo in your room or something False NaN [hhahhaahahah, rofl, leonardo, room, something]
2 4 Oh for sake she’s in like False NaN [oh, sake, ’s, like]
3 5 No da:)he is stupid da..always sending like this:)don believe any of those message.pandy is a :) False NaN [da, :, ), stupid, da, .., always, sending, like, :, ), believe, message.pandy, :, )]
4 6 Lul im gettin some juicy gossip at the hospital. Oyea. False NaN [lul, im, gettin, juicy, gossip, hospital, ., oyea, .]

Explore Data

Check if balanced

len(df[df["spam"]==True])/len(df)
0.13156419360867128
df_spam = df[df["spam"]==True]
df_ham = df[df["spam"]==False]

shape_spam = df_spam.shape
shape_ham = df_ham.shape

total = df.shape[0]
num_spam = shape_spam[0] 
num_ham = shape_ham[0]


print (f"Spam: {shape_spam[0]} - {round(num_spam/total * 100, 2)}%")
print (f"Ham: {shape_ham[0]} - {round(num_ham/total * 100, 2)}%")
Spam: 704 - 13.16%
Ham: 4647 - 86.84%

Unbalanced - Spam 704 - approx 13% - Ham 4647 - approx 87%

labels = ["Spam\n ("+str(shape_spam[0])+")", "Ham  \n("+str(shape_ham[0])+")"]
counts = [shape_spam[0], shape_ham[0]]

title = "Fig. 1 Spam vs Ham"

    
plt.pie(counts
          , labels = labels
          , colors = colors_r
          , counterclock = True
          , startangle = 0
          , labeldistance = 1.2
          , pctdistance = 0.65
          , autopct = lambda p: f"{int(p)}%"
          , textprops={"fontsize": 16}  
       )

plt.title(f"{title}", fontsize=18, fontweight="bold")



plt.savefig(f"Fig. 1 {title}.png", dpi=300, transparent=True, bbox_inches = "tight")



plt.show()
png
png

Extract features - Count Vectorizer

Feature extraction

# Feature extraction

CountVec = CountVectorizer(lowercase=True,analyzer="word",stop_words="english") 
# Get feature vectors

feature_vectors = CountVec.fit_transform(df["sms"]) 
# Prints all strings - commented out for purposes of length
# for x in CountVec.get_feature_names_out():
#    print(x, end=", ")
feature_vectors
<5351x8011 sparse matrix of type '<class 'numpy.int64'>'
    with 40832 stored elements in Compressed Sparse Row format>

Identify most common words for spam and ham SMS messages

# Investigate word frequency

# Create "spam" and "ham" subsets (using the dataset with tokenised emails)

#filter
df_words_spam = df_words[df_words["spam"]==True]
df_words_ham = df_words[df_words["spam"]==False]


# Get the most frequent words for each subset

word_counts_spam = df_words_spam["words"].apply(pd.Series).stack().value_counts()
word_counts_ham = df_words_ham["words"].apply(pd.Series).stack().value_counts()
word_counts_spam.head(35)
.             846
!             508
,             351
call          333
free          209
&             165
?             163
:             161
2             160
txt           143
ur            141
u             126
mobile        121
*             113
claim         113
4             110
text          101
stop          101
reply          97
prize          92
get            76
nokia          65
send           64
's             63
urgent         63
new            63
cash           62
win            60
)              58
contact        56
please         54
week           52
-              52
guaranteed     50
service        49
dtype: int64
word_counts_ham.head(35)
.        3644
,        1457
?        1307
...      1067
u         939
!         763
;         745
&         724
..        675
:         537
)         420
's        405
'm        375
n't       310
gt        309
lt        309
2         287
get       285
#         275
go        241
ok        240
ur        238
got       236
come      227
call      226
'll       223
good      223
know      219
like      215
time      190
day       188
-         171
love      170
4         165
going     164
dtype: int64
# checking number of words in spam and ham - 

word_counts_spam.shape, word_counts_ham.shape
((2808,), (6933,))
# convert to pandas datasets for joining/filtering

word_counts_spam = pd.DataFrame(word_counts_spam).reset_index()
word_counts_ham = pd.DataFrame(word_counts_ham).reset_index()
# setting "type" to be able to filter later by "spam" and "ham"

word_counts_spam ["type"] = "spam"
word_counts_spam
index 0 type
0 . 846 spam
1 ! 508 spam
2 , 351 spam
3 call 333 spam
4 free 209 spam
2803 08704439680. 1 spam
2804 passes 1 spam
2805 lounge 1 spam
2806 airport 1 spam
2807 0871-4719-523 1 spam
word_counts_ham ["type"] = "ham"
word_counts_ham
index 0 type
0 . 3644 ham
1 , 1457 ham
2 ? 1307 ham
3 1067 ham
4 u 939 ham
6928 andre 1 ham
6929 virgil 1 ham
6930 dismay 1 ham
6931 enjoying 1 ham
6932 previews 1 ham
len(df_words_spam), len(df_words_ham)
(704, 4647)
all_words = pd.concat([word_counts_spam, word_counts_ham], axis=0).rename(columns={"index": "word", 0: "count"}).reset_index(drop=True)
all_words
word count type
0 . 846 spam
1 ! 508 spam
2 , 351 spam
3 call 333 spam
4 free 209 spam
9736 andre 1 ham
9737 virgil 1 ham
9738 dismay 1 ham
9739 enjoying 1 ham
9740 previews 1 ham
all_words.shape
(9741, 3)

#remove all duplicates (keep neither) to keep only unique words
ham_or_spam = all_words.drop_duplicates(subset=["word"], keep=False)

#remove all words only found in "ham" - keep "spam"
spam_words_only = ham_or_spam[ham_or_spam["type"]=="spam"].reset_index(drop=True)

spam_words_only
word count type
0 claim 113 spam
1 prize 92 spam
2 guaranteed 50 spam
3 tone 48 spam
4 cs 41 spam
1903 villa 1 spam
1904 someonone 1 spam
1905 08704439680. 1 spam
1906 passes 1 spam
1907 0871-4719-523 1 spam
print (spam_words_only.head(30))
          word  count  type
0        claim    113  spam
1        prize     92  spam
2   guaranteed     50  spam
3         tone     48  spam
4           cs     41  spam
5      awarded     38  spam
6       â£1000     35  spam
7       150ppm     34  spam
8     ringtone     29  spam
9   collection     26  spam
10       tones     26  spam
11       entry     25  spam
12         16+     25  spam
13      weekly     24  spam
14         mob     23  spam
15       valid     23  spam
16         500     23  spam
17       â£100     22  spam
18        150p     21  spam
19         sae     21  spam
20    delivery     21  spam
21        8007     21  spam
22       bonus     21  spam
23    vouchers     20  spam
24      â£2000     20  spam
25      â£5000     20  spam
26       86688     19  spam
27          18     19  spam
28       â£500     19  spam
29         750     18  spam
ham_words_only = ham_or_spam[ham_or_spam["type"]=="ham"].reset_index(drop=True)

ham_words_only
word count type
0 gt 309 ham
1 lt 309 ham
2 lor 162 ham
3 da 137 ham
4 later 130 ham
6028 andre 1 ham
6029 virgil 1 ham
6030 dismay 1 ham
6031 enjoying 1 ham
6032 previews 1 ham
print (ham_words_only.head(30))
         word  count type
0          gt    309  ham
1          lt    309  ham
2         lor    162  ham
3          da    137  ham
4       later    130  ham
5          ã¼    120  ham
6       happy    104  ham
7         amp     88  ham
8        work     88  ham
9         ask     88  ham
10       said     79  ham
11        lol     74  ham
12   anything     73  ham
13        cos     72  ham
14    morning     71  ham
15       sure     68  ham
16  something     65  ham
17        gud     63  ham
18      thing     58  ham
19       feel     56  ham
20        gon     56  ham
21        dun     55  ham
22       went     54  ham
23      sleep     54  ham
24     always     54  ham
25       told     52  ham
26         㜠    52  ham
27       nice     51  ham
28       haha     51  ham
29        thk     50  ham
# Frquency of words in spam

count = Counter()
for word_list in df_words_spam["words"]:
    for word in word_list:
        count[word] += 1
        
# List most common 
Counter(count).most_common(30)
[('.', 846),
 ('!', 508),
 (',', 351),
 ('call', 333),
 ('free', 209),
 ('&', 165),
 ('?', 163),
 (':', 161),
 ('2', 160),
 ('txt', 143),
 ('ur', 141),
 ('u', 126),
 ('mobile', 121),
 ('*', 113),
 ('claim', 113),
 ('4', 110),
 ('stop', 101),
 ('text', 101),
 ('reply', 97),
 ('prize', 92),
 ('get', 76),
 ('nokia', 65),
 ('send', 64),
 ("'s", 63),
 ('new', 63),
 ('urgent', 63),
 ('cash', 62),
 ('win', 60),
 (')', 58),
 ('contact', 56)]
# Getting rid of duplicates within an email, 
# to get number based on how many emails have a particular word, but no repetition

# Ham

count = Counter()

for word_list in df_words_spam["words"]:
    for word in list(set(word_list)):   # this makes a "set" which removes any duplicates within that email before counting
        count[word] += 1

Counter(count).most_common(40)
[('.', 454),
 ('!', 341),
 ('call', 311),
 (',', 208),
 ('free', 160),
 ('&', 138),
 ('txt', 137),
 (':', 130),
 ('2', 125),
 ('?', 123),
 ('ur', 111),
 ('claim', 108),
 ('mobile', 107),
 ('4', 101),
 ('u', 101),
 ('text', 89),
 ('reply', 86),
 ('prize', 84),
 ('stop', 80),
 ('get', 75),
 ('send', 63),
 ('new', 62),
 ('urgent', 62),
 ('cash', 61),
 ('win', 60),
 ('contact', 56),
 ('please', 54),
 ("'s", 51),
 ('-', 50),
 ('guaranteed', 50),
 ('customer', 49),
 ('nokia', 49),
 (')', 48),
 ('service', 48),
 ('*', 47),
 ('c', 45),
 ('week', 45),
 ('(', 44),
 ('tone', 41),
 ('cs', 41)]

Top 15 words - Spam

  • (“call”, 333)
  • (“free”, 209),
  • (“txt”, 143),
  • (“ur”, 141),
  • (“u”, 126),
  • (“mobile”, 121),
  • (“claim”, 113),
  • (“stop”, 101),
  • (“text”, 101),
  • (“reply”, 97),
  • (“prize”, 92),
  • (“get”, 76),
  • (“nokia”, 65),
  • (“send”, 64),
  • (“new”, 63),

Top 15 words - Spam - no duplicates

  • (“call”, 311),
  • (“free”, 160),
  • (“txt”, 137),
  • (“ur”, 111),
  • (“claim”, 108),
  • (“mobile”, 107),
  • (“u”, 101),
  • (“text”, 89),
  • (“reply”, 86),
  • (“prize”, 84),
  • (“stop”, 80),
  • (“get”, 75),
  • (“send”, 63),
  • (“new”, 62),
  • (“urgent”, 62),
# I did want to try and put in this information about % of spam that contains a particular word... 
# but decided to keep focus on the modelling - and knowing that would be hard enough to cover in detail anyway!

top_words_spam = [("call", 333),
 ("free", 209),
 ("txt", 143),
 ("ur", 141),
 ("u", 126),
 ("mobile", 121),
 ("claim", 113),
 ("stop", 101),
 ("text", 101),
 ("reply", 97),
 ("prize", 92),
 ("get", 76),
 ("nokia", 65),
 ("send", 64),
 ("new", 63)]

df_top_words_spam = pd.DataFrame(top_words_spam, columns=["word", "count"])
df_top_words_spam["% present in Total"] = df_top_words_spam.apply(lambda x: round(x["count"]/len(df_words_spam)*100, 2), axis=1)
print(df_top_words_spam)
      word  count  % present in Total
0     call    333               47.30
1     free    209               29.69
2      txt    143               20.31
3       ur    141               20.03
4        u    126               17.90
5   mobile    121               17.19
6    claim    113               16.05
7     stop    101               14.35
8     text    101               14.35
9    reply     97               13.78
10   prize     92               13.07
11     get     76               10.80
12   nokia     65                9.23
13    send     64                9.09
14     new     63                8.95
top_words_spam_unique = [("call", 311),
("free", 160),
("txt", 137),
("ur", 111),
("claim", 108),
("mobile", 107),
("u", 101),
("text", 89),
("reply", 86),
("prize", 84),
("stop", 80),
("get", 75),
("send", 63),
("new", 62),
("urgent", 62)]


df_top_words_spam_unique = pd.DataFrame(top_words_spam_unique, columns=["word", "count"])
df_top_words_spam_unique["% present in Total"] = df_top_words_spam_unique.apply(lambda x: round(x["count"]/len(df_words_spam)*100, 2), axis=1)
print(df_top_words_spam_unique)
      word  count  % present in Total
0     call    311               44.18
1     free    160               22.73
2      txt    137               19.46
3       ur    111               15.77
4    claim    108               15.34
5   mobile    107               15.20
6        u    101               14.35
7     text     89               12.64
8    reply     86               12.22
9    prize     84               11.93
10    stop     80               11.36
11     get     75               10.65
12    send     63                8.95
13     new     62                8.81
14  urgent     62                8.81
# Frquency of words in ham

count = Counter()

for word_list in df_words_ham["words"]:
    for word in word_list:
        count[word] += 1

# List most common ham words
Counter(count).most_common(30)
[('.', 3644),
 (',', 1457),
 ('?', 1307),
 ('...', 1067),
 ('u', 939),
 ('!', 763),
 (';', 745),
 ('&', 724),
 ('..', 675),
 (':', 537),
 (')', 420),
 ("'s", 405),
 ("'m", 375),
 ("n't", 310),
 ('lt', 309),
 ('gt', 309),
 ('2', 287),
 ('get', 285),
 ('#', 275),
 ('go', 241),
 ('ok', 240),
 ('ur', 238),
 ('got', 236),
 ('come', 227),
 ('call', 226),
 ('good', 223),
 ("'ll", 223),
 ('know', 219),
 ('like', 215),
 ('time', 190)]
# Getting rid of duplicates within an email

# Ham

count = Counter()

for word_list in df_words_ham["words"]:
    for word in list(set(word_list)):
        count[word] += 1

Counter(count).most_common(40)
[('.', 2026),
 ('?', 1046),
 (',', 1022),
 ('...', 684),
 ('u', 660),
 ('!', 508),
 ('..', 429),
 (':', 402),
 ("'s", 368),
 ("'m", 344),
 (')', 334),
 (';', 331),
 ('&', 319),
 ('get', 265),
 ("n't", 257),
 ('2', 239),
 ('lt', 235),
 ('ok', 234),
 ('gt', 233),
 ('got', 224),
 ('go', 222),
 ("'ll", 217),
 ('call', 212),
 ('come', 212),
 ('good', 210),
 ('#', 209),
 ('know', 209),
 ('like', 200),
 ('ur', 187),
 ('time', 179),
 ('day', 176),
 ('going', 158),
 ('4', 157),
 ('home', 154),
 ('one', 149),
 ('want', 146),
 ('lor', 145),
 ('sorry', 144),
 ('-', 143),
 ('still', 143)]

Top 15 - Ham

  • (“u”, 939),
  • (“get”, 285),
  • (“go”, 241),
  • (“ok”, 240),
  • (“ur”, 238),
  • (“got”, 236),
  • (“come”, 227),
  • (“call”, 226),
  • (“good”, 223),
  • (“know”, 219),
  • (“like”, 215),
  • (“time”, 190),
  • (“day”, 188),
  • (“love”, 170),
  • (“going”, 164),

Top 15 - no duplicates - (“u”, 660), - (“get”, 265), - (“ok”, 234), - (“got”, 224), - (“go”, 222), - (“call”, 212), - (“come”, 212), - (“good”, 210), - (“know”, 209), - (“like”, 200), - (“ur”, 187), - (“time”, 179), - (“day”, 176), - (“going”, 158), - (“home”, 154)

top_words_ham = [("u", 939),
 ("get", 285),
 ("go", 241),
 ("ok", 240),
 ("ur", 238),
 ("got", 236),
 ("come", 227),
 ("call", 226),
 ("good", 223),
 ("know", 219),
 ("like", 215),
 ("time", 190),
 ("day", 188),
 ("love", 170),
 ("going", 164)]


df_top_words_ham = pd.DataFrame(top_words_ham, columns=["word", "count"])
df_top_words_ham["% present in Total"] = df_top_words_ham.apply(lambda x: round(x["count"]/len(df_words_ham)*100, 2), axis=1)
print(df_top_words_ham)
     word  count  % present in Total
0       u    939               20.21
1     get    285                6.13
2      go    241                5.19
3      ok    240                5.16
4      ur    238                5.12
5     got    236                5.08
6    come    227                4.88
7    call    226                4.86
8    good    223                4.80
9    know    219                4.71
10   like    215                4.63
11   time    190                4.09
12    day    188                4.05
13   love    170                3.66
14  going    164                3.53
top_words_ham_unique = [("u", 660),
("get", 265),
("ok", 234),
("got", 224),
("go", 222),
("call", 212),
("come", 212),
("good", 210),
("know", 209),
("like", 200),
("ur", 187),
("time", 179),
("day", 176),
("going", 158),
("home", 154)]

df_top_words_ham_unique = pd.DataFrame(top_words_ham_unique, columns=["word", "count"])
df_top_words_ham_unique["% present in Total"] = df_top_words_ham_unique.apply(lambda x: round(x["count"]/len(df_words_ham)*100, 2), axis=1)
print(df_top_words_ham_unique)
     word  count  % present in Total
0       u    660               14.20
1     get    265                5.70
2      ok    234                5.04
3     got    224                4.82
4      go    222                4.78
5    call    212                4.56
6    come    212                4.56
7    good    210                4.52
8    know    209                4.50
9    like    200                4.30
10     ur    187                4.02
11   time    179                3.85
12    day    176                3.79
13  going    158                3.40
14   home    154                3.31
def plot_top_15 (dataframe, title="TBC", fig_num="TBC", color=color, size=3.8, total_freq=False):
    #create plot for Top 15 words for spam/ham
    words = dataframe["word"]
    counts = dataframe["count"]

    y_pos = np.arange(len(dataframe["word"]))  # the label locations
    width = 0.25  # the width of the bars
    multiplier = 0

    fig, ax = plt.subplots(figsize=(size,5), constrained_layout=True)

    ax.barh(y_pos, counts, color=color)

    # Add some text for labels, title and custom x-axis tick labels, etc.
    if total_freq:
        ax.set_xlabel("Total Count")
    else:
        ax.set_xlabel("Count of Emails")
    ax.invert_yaxis() 
    ax.set_title(full_title, fontsize=15, fontweight="bold")
    ax.set_yticks(y_pos, labels = words, fontsize=16)

    #ax.set_ylim(0, 250)

    # Show the plot
    return fig
# Selected for report
# this is based on counting a word only once per email

data = df_top_words_ham_unique
title = "Ham"
fig_num = "2"

full_title = f"Fig. {fig_num} Top 15 words - {title}"

fig = plot_top_15 (data, full_title, fig_num, color_r)

plt.savefig(f"{full_title}.png", dpi=300, transparent=True, bbox_inches = "tight")

plt.show()
png
png
# Selected for report
# this is based on counting a word only once per email

data = df_top_words_spam_unique
title = "Spam"
fig_num = "3"
full_title = f"Fig. {fig_num} Top 15 words - {title}"

fig = plot_top_15 (data, full_title, fig_num, color)

plt.savefig(f"{full_title}.png", dpi=300, transparent=True, bbox_inches = "tight")

plt.show()
png
png
# word Frequency - allows for repetition in an email

data = df_top_words_ham
title = "Ham"
fig_num = "1.2"

full_title = f"Fig {fig_num} Top 15 words - {title}"

fig = plot_top_15 (data, full_title, fig_num, color_r, total_freq=True)

plt.savefig(f"{full_title}.png", dpi=300, transparent=True, bbox_inches = "tight")

plt.show()
png
png
# word Frequency - allows for repetition in an email

data = df_top_words_spam
title = "Spam"
fig_num = "1.4"

full_title = f"Fig {fig_num} Top 15 words - {title}"


fig = plot_top_15 (data, full_title, fig_num, color, total_freq=True)

plt.savefig(f"{full_title}.png", dpi=300, transparent=True, bbox_inches = "tight")

plt.show()
png
png
# Selected for Report
# based on word frquency... 

data = spam_words_only.head(10)
title = "Spam-only words"
fig_num = "4"
full_title = f"Fig. {fig_num} Top 10 {title}"
size = 4.3

plt.figure(figsize = (5,4))
fig = plot_top_15 (data, full_title, fig_num, color="darkorange", size=size, total_freq=True)

plt.savefig(f"{full_title}.png", dpi=300, transparent=True, bbox_inches = "tight")

plt.show()
<Figure size 500x400 with 0 Axes>
png
png
data = ham_words_only.head(10)
title = "Ham-only words"
fig_num = "1.6"
full_title = f"Fig {fig_num} Top 10 {title}"

fig = plot_top_15 (data, full_title, fig_num, color_r, total_freq=True)

plt.savefig(f"{full_title}.png", dpi=300, transparent=True, bbox_inches = "tight")

plt.show()
png
png

Data modelling

Model Training

  • Split the dataset into training and test sets, and ensure the training dataset is balanced (using SMOTE)
# Creates a randomised matrix of dimensions based :
# - feature_vectors - sms content
# - df_ham and df_spam - an array based on 0 for ham and 1 for spam

X_train, X_test, y_train, y_test = train_test_split(feature_vectors, [0] * len(df_ham) + [1] * len(df_spam), random_state = 42, test_size=0.2) 
# Balance training set using SMOTE

X_train_smote, y_train_smote = SMOTE(random_state=42).fit_resample(X_train, y_train) 

# Check the ratio of True (1) for y training set in total number of y
# should end up with 0.5...

print (y_train_smote.count(1), len(y_train_smote))
print (y_train_smote.count(1)/len(y_train_smote))
3725 7450
0.5

Apply machine learning/model approaches

Model 1 - KNN

# Make instance of KNN model and set hyperparameters

knn = KNeighborsClassifier() 
# Set hyperparameters

hyperparameters = { 
        "n_neighbors": [1, 3, 5, 9, 11], 
        "p": [1, 2] 
    } 

KNN - base (unbalanced)

#Grid search

knn_base = GridSearchCV(knn, hyperparameters, scoring="accuracy") 

knn_base.fit(X_train, y_train) 

print("Best p:", knn_base.best_estimator_.get_params()["p"]) 

print("Best n_neighbors:", knn_base.best_estimator_.get_params()["n_neighbors"]) 
Best p: 2
Best n_neighbors: 1

KNN - SMOTE (balanced)

knn_smote = GridSearchCV(knn, hyperparameters, scoring="accuracy") 
knn_smote.fit(X_train_smote, y_train_smote) 

print("Best p:", knn_smote.best_estimator_.get_params()["p"]) 

print("Best n_neighbors:", knn_smote.best_estimator_.get_params()["n_neighbors"]) 
Best p: 2
Best n_neighbors: 1

Model 2 - Decision Tree

# Make instance of Decision Tree model and set hyperparameters

# NOTE - I have removed a number of the values at extremes and in between that were used for testing, 
# so that it doesn't take so long to run when reviewing.

dt = DecisionTreeClassifier() 

hyperparameters = { 
    "min_samples_split": [2, 3, 5, 10, 15, 20], 
    "min_samples_leaf": [3, 4, 5, 6, 8], 
    "max_depth": [10, 20, 40, 60, 80, 120] 

} 

Decision Tree - base (unbalanced)

# train & evaluate

dt_base = GridSearchCV(dt, hyperparameters, scoring="accuracy").fit(X_train, y_train) 

print("Best max_depth:", dt_base.best_estimator_.get_params()["max_depth"]) 

print("Best min_samples_leaf:", dt_base.best_estimator_.get_params()["min_samples_leaf"]) 

print("Best min_samples_split:", dt_base.best_estimator_.get_params()["min_samples_split"]) 

print("Best criterion:", dt_base.best_estimator_.get_params()["criterion"]) 
Best max_depth: 20
Best min_samples_leaf: 3
Best min_samples_split: 3
Best criterion: gini

Decision Tree - SMOTE (balanced)

# train & evaluate
# setting hyperparameter variables here for use for tree map at end. 

dt_smote = GridSearchCV(dt, hyperparameters, scoring="accuracy").fit(X_train_smote, y_train_smote) 

print("Best max_depth:", max_depth := dt_smote.best_estimator_.get_params()["max_depth"]) 

print("Best min_samples_leaf:", min_samples_leaf := dt_smote.best_estimator_.get_params()["min_samples_leaf"]) 

print("Best min_samples_split:", min_samples_split := dt_smote.best_estimator_.get_params()["min_samples_split"]) 

print("Best criterion:", criterion := dt_smote.best_estimator_.get_params()["criterion"]) 
Best max_depth: 120
Best min_samples_leaf: 3
Best min_samples_split: 2
Best criterion: gini

Model Evaluation

Functions



def plot_confusion_matrix (model, y_predicted, title="TBC", fig_num="TBC"):
    # Confusion Matrix
    plt.figure(figsize=(4,4))
    cm = confusion_matrix(y_test, y_predicted, labels=model.classes_, ) 
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_) 
    disp.plot(cmap="YlOrBr_r") 

    plt.title(f"Fig. {fig_num} {title}", fontsize=20, fontweight="bold")
    plt.xlabel("Predicted", fontsize=14)
    plt.ylabel("True label", fontsize=14)
    
    return disp 

def calc_accuracy (model_name, y_predicted):
    # Accuracy
    results.loc[model_name,"accuracy"] = accuracy_score(y_test,y_predicted) 
    return results.loc[model_name,"accuracy"]

def calc_balanced_accuracy (model_name, y_predicted):
    # Balanced Accuracy
    results.loc[model_name,"balanced_accuracy"] = balanced_accuracy_score(y_test,y_predicted) 
    return results.loc[model_name,"balanced_accuracy"]

def calc_training_time (model, model_name):
    # Training Time
    results.loc[model_name,"training_time"] = model.cv_results_["mean_fit_time"].mean()
    return results.loc[model_name,"training_time"]

def calc_prediction_time (model, model_name):
    # Training Time
    results.loc[model_name,"prediction_time"] = model.cv_results_["mean_score_time"].mean()/len(y_test)
    return results.loc[model_name,"prediction_time"]


Metrics for KNN evaluation - Confusion Matrix and metric calculations

# Create empty results table

results = pd.DataFrame()
model = knn_base
model_name = "KNN"

y_predicted = model.predict(X_test)

Metrics for KNN evaluation

# Confusion Matrix

title = "KNN (unbalanced)"
fig_num = "5"

disp = plot_confusion_matrix(model, y_predicted, title=title, fig_num=fig_num)

plt.savefig(f"Fig. {fig_num} {title}.png", dpi=300, transparent=True, bbox_inches = "tight")
 
<Figure size 400x400 with 0 Axes>
png
png
# KNN_base results
# Accuracy
accuracy = calc_accuracy(model_name, y_predicted)

# Balanced Accuracy 
balanced = calc_balanced_accuracy(model_name, y_predicted)

# Training Time
train_time = calc_training_time(model, model_name)

# Prediction Time
pred_time = calc_prediction_time(model, model_name)

print (f"Accuracy: {accuracy:.10f}")
print (f"Balanced Accuracy: {balanced:.10f}")
print (f"Training Time: {train_time:.10f}")
print (f"Prediction Time: {pred_time:.10f}")
Accuracy: 0.9187675070
Balanced Accuracy: 0.7136805020
Training Time: 0.0025524473
Prediction Time: 0.0001030172
model = knn_smote
model_name = "KNN-SMOTE"

y_predicted = model.predict(X_test)
# Confusion Matrix

title = "KNN (SMOTE)"
fig_num = "7"

disp = plot_confusion_matrix(model, y_predicted, title=title, fig_num=fig_num)


plt.savefig(f"Fig. {fig_num} {title}.png", dpi=300, transparent=True, bbox_inches = "tight")
<Figure size 400x400 with 0 Axes>
png
png
# KNN_SMOTE results
# Accuracy
accuracy = calc_accuracy(model_name, y_predicted)

# Balanced Accuracy 
balanced = calc_balanced_accuracy(model_name, y_predicted)

# Training Time
train_time = calc_training_time(model, model_name)

# Prediction Time
pred_time = calc_prediction_time(model, model_name)

print (f"Accuracy: {accuracy:.10f}")
print (f"Balanced Accuracy: {balanced:.10f}")
print (f"Training Time: {train_time:.10f}")
print (f"Prediction Time: {pred_time:.10f}")
Accuracy: 0.8085901027
Balanced Accuracy: 0.7312779339
Training Time: 0.0037259293
Prediction Time: 0.0002646201
model = dt_base
model_name = "DT"

y_predicted = model.predict(X_test)
# Confusion Matrix

title = "Decision Tree (unbalanced)"
fig_num = "6"

disp = plot_confusion_matrix(model, y_predicted, title=title, fig_num=fig_num)


plt.savefig(f"Fig. {fig_num} {title}.png", dpi=300, transparent=True, bbox_inches = "tight")
<Figure size 400x400 with 0 Axes>
png
png
# DT_BASE results
# Accuracy
accuracy = calc_accuracy(model_name, y_predicted)

# Balanced Accuracy 
balanced = calc_balanced_accuracy(model_name, y_predicted)

# Training Time
train_time = calc_training_time(model, model_name)

# Prediction Time
pred_time = calc_prediction_time(model, model_name)

print (f"Accuracy: {accuracy:.10f}")
print (f"Balanced Accuracy: {balanced:.10f}")
print (f"Training Time: {train_time:.10f}")
print (f"Prediction Time: {pred_time:.10f}")
Accuracy: 0.8832866480
Balanced Accuracy: 0.6424318304
Training Time: 0.1051164484
Prediction Time: 0.0000014010
model = dt_smote
model_name = "DT-SMOTE"

y_predicted = model.predict(X_test)
# Confusion Matrix

title = "Decision Tree (SMOTE)"
fig_num = "8"

disp = plot_confusion_matrix(model, y_predicted, title=title, fig_num=fig_num)

plt.savefig(f"Fig. {fig_num} {title}.png", dpi=300, transparent=True, bbox_inches = "tight")
<Figure size 400x400 with 0 Axes>
png
png
# DT_SMOTE results
# Accuracy
accuracy = calc_accuracy(model_name, y_predicted)

# Balanced Accuracy 
balanced = calc_balanced_accuracy(model_name, y_predicted)

# Training Time
train_time = calc_training_time(model, model_name)

# Prediction Time
pred_time = calc_prediction_time(model, model_name)

print (f"Accuracy: {accuracy:.10f}")
print (f"Balanced Accuracy: {balanced:.10f}")
print (f"Training Time: {train_time:.10f}")
print (f"Prediction Time: {pred_time:.10f}")
Accuracy: 0.6937441643
Balanced Accuracy: 0.5886131695
Training Time: 0.1291490380
Prediction Time: 0.0000019057

Results

results.index
Index(['KNN', 'KNN-SMOTE', 'DT', 'DT-SMOTE'], dtype='object')
# show collated results
results.style.highlight_max(color=color, axis=0).highlight_min(color=color_r, axis=0)
png
png
#Accuracy & balanced accuracy
fig_num="9"
labels = list(results.index)

fig = results[["accuracy","balanced_accuracy"]].plot(kind="bar", color=colors_r, figsize=(8,5)) 

plt.title(f"Fig. {fig_num} Accuracy & Balanced Accuracy by Model"
         , fontweight="bold"
         , fontsize=14)


plt.xticks(rotation=0)
plt.ylabel("Percentage", fontsize=14)
plt.legend()

plt.savefig(f"Fig. {fig_num} Accuracy & Balanced Accuracy by Model.png", dpi=300, transparent=True, bbox_inches = "tight")
png
png
# Training Time
fig_num="10" 

results["training_time"].plot(kind="bar", color=color, figsize=(7,3)) 

plt.title(f"Fig. {fig_num} Training Time by Model"
         , fontweight="bold"
         , fontsize=14)

plt.xticks(rotation=0)
plt.ylabel("Training Time", fontsize=14)

plt.savefig(f"Fig. {fig_num} Training Time by Model.png", dpi=300, transparent=True, bbox_inches = "tight")
png
png
# Prediction Time

fig_num="11"

results["prediction_time"].plot(kind="bar", color=color, figsize=(7,3)) 

plt.title(f"Fig. {fig_num} Prediction Time by Model"
         , fontweight="bold"
         , fontsize=14)

plt.xticks(rotation=0)
plt.ylabel("Prediction Time", fontsize=14)

plt.savefig(f"Fig. {fig_num} Prediction Time by Model.png", dpi=300, transparent=True, bbox_inches = "tight")
png
png

Classification Report (including F1-score)

all_words["type"].shape
(9741,)
df_words["check"] = df_words["spam"].replace({True: "Spam", False: "Ham"}).astype(str)
# Classification report for KNN (unbalanced)

y_predicted = knn_base.fit(X_train, y_train).predict(X_test)

report = classification_report(y_test, y_predicted, target_names= df_words["check"].unique()) #(all_words["type"].unique()))
print(report)
              precision    recall  f1-score   support

         Ham       0.92      1.00      0.95       922
        Spam       0.97      0.43      0.60       149

    accuracy                           0.92      1071
   macro avg       0.94      0.71      0.78      1071
weighted avg       0.92      0.92      0.90      1071
# Classification report for Decision Tree (unbalanced)

y_predicted = dt_base.fit(X_train, y_train).predict(X_test)

report = classification_report(y_test, y_predicted, target_names= df_words["check"].unique()) #(all_words["type"].unique()))
print(report)
              precision    recall  f1-score   support

         Ham       0.90      0.98      0.94       922
        Spam       0.70      0.32      0.44       149

    accuracy                           0.89      1071
   macro avg       0.80      0.65      0.69      1071
weighted avg       0.87      0.89      0.87      1071
# Classification report for Decision Tree (SMOTE)

y_predicted = dt_smote.fit(X_train_smote, y_train_smote).predict(X_test)

report = classification_report(y_test, y_predicted, target_names= df_words["check"].unique()) #(all_words["type"].unique()))
print(report)
              precision    recall  f1-score   support

         Ham       0.89      0.75      0.81       922
        Spam       0.22      0.45      0.30       149

    accuracy                           0.71      1071
   macro avg       0.56      0.60      0.56      1071
weighted avg       0.80      0.71      0.74      1071
# Classification report for KNN (SMOTE)


y_predicted = knn_smote.fit(X_train_smote, y_train_smote).predict(X_test)

report = classification_report(y_test, y_predicted, target_names= df_words["check"].unique()) #(all_words["type"].unique()))
print(report)

              precision    recall  f1-score   support

         Ham       0.93      0.84      0.88       922
        Spam       0.38      0.62      0.48       149

    accuracy                           0.81      1071
   macro avg       0.66      0.73      0.68      1071
weighted avg       0.86      0.81      0.83      1071

Tree chart

Commented out for this HTML document due to output length

Trying just out of curiosoity - not included in final report.
Parameters have been set from above based on best_estimator.get_params

print (criterion, max_depth, min_samples_split, min_samples_leaf)
gini 120 2 3
# dt_tree = DecisionTreeClassifier(criterion=criterion
#                            , max_depth=max_depth
#                            , min_samples_split=min_samples_split
#                            , min_samples_leaf=min_samples_leaf)

# dt_tree = dt_tree.fit(X_train_smote, y_train_smote)

# plt.figure(figsize=(15,50))


# tree.plot_tree(decision_tree=dt_tree, class_names=all_words["type"],\
#              filled=True, rounded=True)


Copyright © 2023 Adam Simmons, Inc. All rights reserved.