February 21, 2023

For my third and final RMIT Practical Data Science with Python assignment, the task was to help fictitious company, Connect 5G, to address a growing spam problem impacting their customers. Two machine-learning classification models were built, tested and compared - K-means and Decision Tree methods. Recommendations were made based on the results, taking prediction time into consideration given the needs of their customers.

An important part of this exercise for me was to understand not just how to implement the model, but also the theory behind each method. I experimented with a variety of parameter values for each method including more extreme values, just to get a feel for how they impacted the models which was helpful in learning more about how the models work. It is this kind of iterative process that I enjoy about working with coding such as Python, as when set up properly, it can all be run again with a minimum of fuss.

Below is the Python code used for the data wrangling including tokenisation, removal of stopwords, balancing of data with SMOTE, building, test and compare K-means and Decision Tree data models.

Practical Data Science with Python Assessment Task 3: Code for data modelling presentation

Guiding questions (from case study) to be addressed in presentation:

Is the given dataset balanced or not?
- You’ll need to
  - provide evidence of your response and
  - propose and
  - implement an appropriate solution to solve this potential issue.
How did you go about creating features from the data?
How did you go about conducting the evaluation?
How do both machine learning models compare in terms of their performance to the dataset?

Set Up

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import GridSearchCV 
from sklearn.metrics import accuracy_score,balanced_accuracy_score,confusion_matrix, ConfusionMatrixDisplay, classification_report 

from imblearn.over_sampling import SMOTE 

from collections import Counter

import ssl
import nltk 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

pd.set_option("display.max_rows", 100)


# set style for plots

plt.style.use("seaborn-white")


# setting colours for plots

color ="orange"
color_r = "navajowhite"
colors = ["orange", "navajowhite"]
colors_r = list(reversed(colors))

/var/folders/pb/f8gg2y7w8xjbbn0059bcyddh0000gr/T/ipykernel_47553/681184966.py:6: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
  plt.style.use("seaborn-white")

Load Data

df = pd.read_csv("A3_sms.csv", encoding="utf8")

df

	Unnamed: 0	sms	spam	Unnamed: 3
0	0	Tension face 2. Smiling face 3. Waste face 4. Innocent face 5.Terror face 6.Cruel face 7.Romantic face 8.Lovable face 9.decent face <#> .joker face.	False	NaN
1	1	Hhahhaahahah rofl was leonardo in your room or something	False	NaN
2	4	Oh for sake she’s in like	False	NaN
3	5	No da:)he is stupid da..always sending like this:)don believe any of those message.pandy is a :)	False	NaN
4	6	Lul im gettin some juicy gossip at the hospital. Oyea.	False	NaN
…	…	…	…	…
5346	5348	Congratulations! Thanks to a good friend U have WON the Â£2,000 Xmas prize. 2 claim is easy, just call 08718726971 NOW! Only 10p per minute. BT-national-rate.	True	NaN
5347	5349	Congratulations - Thanks to a good friend U have WON the Â£2,000 Xmas prize. 2 claim is easy, just call 08712103738 NOW! Only 10p per minute. BT-national-rate	True	NaN
5348	5350	URGENT! Your mobile number *************** WON a Â£2000 Bonus Caller prize on 10/06/03! This is the 2nd attempt to reach you! Call 09066368753 ASAP! Box 97N7QP, 150ppm	True	NaN
5349	5351	URGENT! Your Mobile No was awarded a Â£2,000 Bonus Caller Prize on 1/08/03! This is our 2nd attempt to contact YOU! Call 0871-4719-523 BOX95QU BT National Rate	True	NaN
5350	5352	Do whatever you want. You know what the rules are. We had a talk earlier this week about what had to start happening, you showing responsibility. Yet, every week it’s can i bend the rule this way? What about that way? Do whatever. I’m tired of having thia same argument with you every week. And a <#> movie DOESNT inlude the previews. You’re still getting in after 1.	False	NaN

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5351 entries, 0 to 5350
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  5351 non-null   int64 
 1   sms         5351 non-null   object
 2   spam        5351 non-null   bool  
 3   Unnamed: 3  49 non-null     object
dtypes: bool(1), int64(1), object(2)
memory usage: 130.8+ KB

df.shape

(5351, 4)

df.nunique()

Unnamed: 0    5351
sms           4948
spam             2
Unnamed: 3       2
dtype: int64

#looking at values in each column for any issues

for col in df.columns:
    print (f"Column \"{col}\" values: ")
    print (df.loc[:, col].unique(), "\n")

Column "Unnamed: 0" values: 
[   0    1    4 ... 5350 5351 5352] 

Column "sms" values: 
['1. Tension face 2. Smiling face 3. Waste face 4. Innocent face 5.Terror face 6.Cruel face 7.Romantic face 8.Lovable face 9.decent face &lt;#&gt; .joker face.'
 'Hhahhaahahah rofl was leonardo in your room or something'
 "Oh for  sake she's in like " ...
 'URGENT! Your mobile number *************** WON a Â£2000 Bonus Caller prize on 10/06/03! This is the 2nd attempt to reach you! Call 09066368753 ASAP! Box 97N7QP, 150ppm'
 'URGENT! Your Mobile No was awarded a Â£2,000 Bonus Caller Prize on 1/08/03! This is our 2nd attempt to contact YOU! Call 0871-4719-523 BOX95QU BT National Rate'
 "Do whatever you want. You know what the rules are. We had a talk earlier this week about what had to start happening, you showing responsibility. Yet, every week it's can i bend the rule this way? What about that way? Do whatever. I'm tired of having thia same argument with you every week. And a  &lt;#&gt;  movie DOESNT inlude the previews. You're still getting in after 1."] 

Column "spam" values: 
[False  True] 

Column "Unnamed: 3" values: 
[nan '********' '\\/\\/\\/\\/\\/']

# Column 0 - basically an index number - not required for this project
# Column 1 - sms content - what our models will be using
# Column 2 - spam marker - target class for evaluating model accuracy
# Column 3 - unknown - possibly de-indentification of numbers - not required for this project

# No missing values
# No need to change data types
# SMS text is what it is, typos and all (especially as typos can be a spam indicator)... just make case consistent (lower)
# Spam - boolean - fine, no typos

Clean/prepare data

Prepare raw data to ensure it is clean and ready to model as follows: Clean the data as appropriate for a textual data source

Conversion to lower-case

df_preparation = df["sms"].str.lower()
df_preparation.head()

0    1. tension face 2. smiling face 3. waste face ...
1    hhahhaahahah rofl was leonardo in your room or...
2                          oh for  sake she's in like 
3    no da:)he is stupid da..always sending like th...
4    lul im gettin some juicy gossip at the hospita...
Name: sms, dtype: object

Tokenisation of emails into words:

## disabling SSl check to download the package "punkt"
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass 
else:
    ssl._create_default_https_context = _create_unverified_https_context
    
# load tokens (words) 

nltk.download("punkt")

[nltk_data] Downloading package punkt to /Users/adam/nltk_data...
[nltk_data]   Package punkt is already up-to-date!





True

df_preparation = [word_tokenize(sms) for sms in df_preparation]
df_preparation[0] #print first list item

['1.',
 'tension',
 'face',
 '2.',
 'smiling',
 'face',
 '3.',
 'waste',
 'face',
 '4.',
 'innocent',
 'face',
 '5.terror',
 'face',
 '6.cruel',
 'face',
 '7.romantic',
 'face',
 '8.lovable',
 'face',
 '9.decent',
 'face',
 '&',
 'lt',
 ';',
 '#',
 '&',
 'gt',
 ';',
 '.joker',
 'face',
 '.']

Removing stopwords

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /Users/adam/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!





True

list_stopwords=stopwords.words("english") 

list_stopwords[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

df_preparation= [[word for word in sms if not word in list_stopwords] for sms in df_preparation] 

print(df_preparation[0:4])

[['1.', 'tension', 'face', '2.', 'smiling', 'face', '3.', 'waste', 'face', '4.', 'innocent', 'face', '5.terror', 'face', '6.cruel', 'face', '7.romantic', 'face', '8.lovable', 'face', '9.decent', 'face', '&', 'lt', ';', '#', '&', 'gt', ';', '.joker', 'face', '.'], ['hhahhaahahah', 'rofl', 'leonardo', 'room', 'something'], ['oh', 'sake', "'s", 'like'], ['da', ':', ')', 'stupid', 'da', '..', 'always', 'sending', 'like', ':', ')', 'believe', 'message.pandy', ':', ')']]

df_words = df.copy()
df_words["words"] = df_preparation

# dataset for checking most frequent words

df_words.head()

	Unnamed: 0	sms	spam	Unnamed: 3	words
0	0	Tension face 2. Smiling face 3. Waste face 4. Innocent face 5.Terror face 6.Cruel face 7.Romantic face 8.Lovable face 9.decent face <#> .joker face.	False	NaN	[1., tension, face, 2., smiling, face, 3., waste, face, 4., innocent, face, 5.terror, face, 6.cruel, face, 7.romantic, face, 8.lovable, face, 9.decent, face, &, lt, ;, #, &, gt, ;, .joker, face, .]
1	1	Hhahhaahahah rofl was leonardo in your room or something	False	NaN	[hhahhaahahah, rofl, leonardo, room, something]
2	4	Oh for sake she’s in like	False	NaN	[oh, sake, ’s, like]
3	5	No da:)he is stupid da..always sending like this:)don believe any of those message.pandy is a :)	False	NaN	[da, :, ), stupid, da, .., always, sending, like, :, ), believe, message.pandy, :, )]
4	6	Lul im gettin some juicy gossip at the hospital. Oyea.	False	NaN	[lul, im, gettin, juicy, gossip, hospital, ., oyea, .]

Explore Data

Apply exploratory methods as appropriate to textual data. Marks will be awarded as follows:
1. Explore the dataset to check if the data is balanced (1 pt)
2. Extract features from the data using Count Vectorizor (2 pts)
3. Identify the most common words for spam and ham sms messages (2 pts)

Check if balanced

len(df[df["spam"]==True])/len(df)

0.13156419360867128

df_spam = df[df["spam"]==True]
df_ham = df[df["spam"]==False]

shape_spam = df_spam.shape
shape_ham = df_ham.shape

total = df.shape[0]
num_spam = shape_spam[0] 
num_ham = shape_ham[0]


print (f"Spam: {shape_spam[0]} - {round(num_spam/total * 100, 2)}%")
print (f"Ham: {shape_ham[0]} - {round(num_ham/total * 100, 2)}%")

Spam: 704 - 13.16%
Ham: 4647 - 86.84%

Unbalanced - Spam 704 - approx 13% - Ham 4647 - approx 87%

labels = ["Spam\n ("+str(shape_spam[0])+")", "Ham  \n("+str(shape_ham[0])+")"]
counts = [shape_spam[0], shape_ham[0]]

title = "Fig. 1 Spam vs Ham"

    
plt.pie(counts
          , labels = labels
          , colors = colors_r
          , counterclock = True
          , startangle = 0
          , labeldistance = 1.2
          , pctdistance = 0.65
          , autopct = lambda p: f"{int(p)}%"
          , textprops={"fontsize": 16}  
       )

plt.title(f"{title}", fontsize=18, fontweight="bold")



plt.savefig(f"Fig. 1 {title}.png", dpi=300, transparent=True, bbox_inches = "tight")



plt.show()

png

Extract features - Count Vectorizer

Feature extraction

# Feature extraction

CountVec = CountVectorizer(lowercase=True,analyzer="word",stop_words="english")

# Get feature vectors

feature_vectors = CountVec.fit_transform(df["sms"])

# Prints all strings - commented out for purposes of length
# for x in CountVec.get_feature_names_out():
#    print(x, end=", ")

feature_vectors

<5351x8011 sparse matrix of type '<class 'numpy.int64'>'
    with 40832 stored elements in Compressed Sparse Row format>

Identify most common words for spam and ham SMS messages

# Investigate word frequency

# Create "spam" and "ham" subsets (using the dataset with tokenised emails)

#filter
df_words_spam = df_words[df_words["spam"]==True]
df_words_ham = df_words[df_words["spam"]==False]


# Get the most frequent words for each subset

word_counts_spam = df_words_spam["words"].apply(pd.Series).stack().value_counts()
word_counts_ham = df_words_ham["words"].apply(pd.Series).stack().value_counts()

word_counts_spam.head(35)

.             846
!             508
,             351
call          333
free          209
&             165
?             163
:             161
2             160
txt           143
ur            141
u             126
mobile        121
*             113
claim         113
4             110
text          101
stop          101
reply          97
prize          92
get            76
nokia          65
send           64
's             63
urgent         63
new            63
cash           62
win            60
)              58
contact        56
please         54
week           52
-              52
guaranteed     50
service        49
dtype: int64

word_counts_ham.head(35)

.        3644
,        1457
?        1307
...      1067
u         939
!         763
;         745
&         724
..        675
:         537
)         420
's        405
'm        375
n't       310
gt        309
lt        309
2         287
get       285
#         275
go        241
ok        240
ur        238
got       236
come      227
call      226
'll       223
good      223
know      219
like      215
time      190
day       188
-         171
love      170
4         165
going     164
dtype: int64

# checking number of words in spam and ham - 

word_counts_spam.shape, word_counts_ham.shape

((2808,), (6933,))

# convert to pandas datasets for joining/filtering

word_counts_spam = pd.DataFrame(word_counts_spam).reset_index()
word_counts_ham = pd.DataFrame(word_counts_ham).reset_index()

# setting "type" to be able to filter later by "spam" and "ham"

word_counts_spam ["type"] = "spam"
word_counts_spam

	index	0	type
0	.	846	spam
1	!	508	spam
2	,	351	spam
3	call	333	spam
4	free	209	spam
…	…	…	…
2803	08704439680.	1	spam
2804	passes	1	spam
2805	lounge	1	spam
2806	airport	1	spam
2807	0871-4719-523	1	spam

word_counts_ham ["type"] = "ham"
word_counts_ham

	index	0	type
0	.	3644	ham
1	,	1457	ham
2	?	1307	ham
3	…	1067	ham
4	u	939	ham
…	…	…	…
6928	andre	1	ham
6929	virgil	1	ham
6930	dismay	1	ham
6931	enjoying	1	ham
6932	previews	1	ham

len(df_words_spam), len(df_words_ham)

(704, 4647)

all_words = pd.concat([word_counts_spam, word_counts_ham], axis=0).rename(columns={"index": "word", 0: "count"}).reset_index(drop=True)
all_words

	word	count	type
0	.	846	spam
1	!	508	spam
2	,	351	spam
3	call	333	spam
4	free	209	spam
…	…	…	…
9736	andre	1	ham
9737	virgil	1	ham
9738	dismay	1	ham
9739	enjoying	1	ham
9740	previews	1	ham

all_words.shape

(9741, 3)


#remove all duplicates (keep neither) to keep only unique words
ham_or_spam = all_words.drop_duplicates(subset=["word"], keep=False)

#remove all words only found in "ham" - keep "spam"
spam_words_only = ham_or_spam[ham_or_spam["type"]=="spam"].reset_index(drop=True)

spam_words_only

	word	count	type
0	claim	113	spam
1	prize	92	spam
2	guaranteed	50	spam
3	tone	48	spam
4	cs	41	spam
…	…	…	…
1903	villa	1	spam
1904	someonone	1	spam
1905	08704439680.	1	spam
1906	passes	1	spam
1907	0871-4719-523	1	spam

print (spam_words_only.head(30))

          word  count  type
0        claim    113  spam
1        prize     92  spam
2   guaranteed     50  spam
3         tone     48  spam
4           cs     41  spam
5      awarded     38  spam
6       â£1000     35  spam
7       150ppm     34  spam
8     ringtone     29  spam
9   collection     26  spam
10       tones     26  spam
11       entry     25  spam
12         16+     25  spam
13      weekly     24  spam
14         mob     23  spam
15       valid     23  spam
16         500     23  spam
17       â£100     22  spam
18        150p     21  spam
19         sae     21  spam
20    delivery     21  spam
21        8007     21  spam
22       bonus     21  spam
23    vouchers     20  spam
24      â£2000     20  spam
25      â£5000     20  spam
26       86688     19  spam
27          18     19  spam
28       â£500     19  spam
29         750     18  spam

ham_words_only = ham_or_spam[ham_or_spam["type"]=="ham"].reset_index(drop=True)

ham_words_only

	word	count	type
0	gt	309	ham
1	lt	309	ham
2	lor	162	ham
3	da	137	ham
4	later	130	ham
…	…	…	…
6028	andre	1	ham
6029	virgil	1	ham
6030	dismay	1	ham
6031	enjoying	1	ham
6032	previews	1	ham

print (ham_words_only.head(30))

         word  count type
0          gt    309  ham
1          lt    309  ham
2         lor    162  ham
3          da    137  ham
4       later    130  ham
5          ã¼    120  ham
6       happy    104  ham
7         amp     88  ham
8        work     88  ham
9         ask     88  ham
10       said     79  ham
11        lol     74  ham
12   anything     73  ham
13        cos     72  ham
14    morning     71  ham
15       sure     68  ham
16  something     65  ham
17        gud     63  ham
18      thing     58  ham
19       feel     56  ham
20        gon     56  ham
21        dun     55  ham
22       went     54  ham
23      sleep     54  ham
24     always     54  ham
25       told     52  ham
26         ãœ     52  ham
27       nice     51  ham
28       haha     51  ham
29        thk     50  ham

# Frquency of words in spam

count = Counter()
for word_list in df_words_spam["words"]:
    for word in word_list:
        count[word] += 1
        
# List most common 
Counter(count).most_common(30)

[('.', 846),
 ('!', 508),
 (',', 351),
 ('call', 333),
 ('free', 209),
 ('&', 165),
 ('?', 163),
 (':', 161),
 ('2', 160),
 ('txt', 143),
 ('ur', 141),
 ('u', 126),
 ('mobile', 121),
 ('*', 113),
 ('claim', 113),
 ('4', 110),
 ('stop', 101),
 ('text', 101),
 ('reply', 97),
 ('prize', 92),
 ('get', 76),
 ('nokia', 65),
 ('send', 64),
 ("'s", 63),
 ('new', 63),
 ('urgent', 63),
 ('cash', 62),
 ('win', 60),
 (')', 58),
 ('contact', 56)]

# Getting rid of duplicates within an email, 
# to get number based on how many emails have a particular word, but no repetition

# Ham

count = Counter()

for word_list in df_words_spam["words"]:
    for word in list(set(word_list)):   # this makes a "set" which removes any duplicates within that email before counting
        count[word] += 1

Counter(count).most_common(40)

[('.', 454),
 ('!', 341),
 ('call', 311),
 (',', 208),
 ('free', 160),
 ('&', 138),
 ('txt', 137),
 (':', 130),
 ('2', 125),
 ('?', 123),
 ('ur', 111),
 ('claim', 108),
 ('mobile', 107),
 ('4', 101),
 ('u', 101),
 ('text', 89),
 ('reply', 86),
 ('prize', 84),
 ('stop', 80),
 ('get', 75),
 ('send', 63),
 ('new', 62),
 ('urgent', 62),
 ('cash', 61),
 ('win', 60),
 ('contact', 56),
 ('please', 54),
 ("'s", 51),
 ('-', 50),
 ('guaranteed', 50),
 ('customer', 49),
 ('nokia', 49),
 (')', 48),
 ('service', 48),
 ('*', 47),
 ('c', 45),
 ('week', 45),
 ('(', 44),
 ('tone', 41),
 ('cs', 41)]

Top 15 words - Spam

(“call”, 333)
(“free”, 209),
(“txt”, 143),
(“ur”, 141),
(“u”, 126),
(“mobile”, 121),
(“claim”, 113),
(“stop”, 101),
(“text”, 101),
(“reply”, 97),
(“prize”, 92),
(“get”, 76),
(“nokia”, 65),
(“send”, 64),
(“new”, 63),

Top 15 words - Spam - no duplicates

(“call”, 311),
(“free”, 160),
(“txt”, 137),
(“ur”, 111),
(“claim”, 108),
(“mobile”, 107),
(“u”, 101),
(“text”, 89),
(“reply”, 86),
(“prize”, 84),
(“stop”, 80),
(“get”, 75),
(“send”, 63),
(“new”, 62),
(“urgent”, 62),

# I did want to try and put in this information about % of spam that contains a particular word... 
# but decided to keep focus on the modelling - and knowing that would be hard enough to cover in detail anyway!

top_words_spam = [("call", 333),
 ("free", 209),
 ("txt", 143),
 ("ur", 141),
 ("u", 126),
 ("mobile", 121),
 ("claim", 113),
 ("stop", 101),
 ("text", 101),
 ("reply", 97),
 ("prize", 92),
 ("get", 76),
 ("nokia", 65),
 ("send", 64),
 ("new", 63)]

df_top_words_spam = pd.DataFrame(top_words_spam, columns=["word", "count"])
df_top_words_spam["% present in Total"] = df_top_words_spam.apply(lambda x: round(x["count"]/len(df_words_spam)*100, 2), axis=1)
print(df_top_words_spam)

      word  count  % present in Total
0     call    333               47.30
1     free    209               29.69
2      txt    143               20.31
3       ur    141               20.03
4        u    126               17.90
5   mobile    121               17.19
6    claim    113               16.05
7     stop    101               14.35
8     text    101               14.35
9    reply     97               13.78
10   prize     92               13.07
11     get     76               10.80
12   nokia     65                9.23
13    send     64                9.09
14     new     63                8.95

top_words_spam_unique = [("call", 311),
("free", 160),
("txt", 137),
("ur", 111),
("claim", 108),
("mobile", 107),
("u", 101),
("text", 89),
("reply", 86),
("prize", 84),
("stop", 80),
("get", 75),
("send", 63),
("new", 62),
("urgent", 62)]


df_top_words_spam_unique = pd.DataFrame(top_words_spam_unique, columns=["word", "count"])
df_top_words_spam_unique["% present in Total"] = df_top_words_spam_unique.apply(lambda x: round(x["count"]/len(df_words_spam)*100, 2), axis=1)
print(df_top_words_spam_unique)

      word  count  % present in Total
0     call    311               44.18
1     free    160               22.73
2      txt    137               19.46
3       ur    111               15.77
4    claim    108               15.34
5   mobile    107               15.20
6        u    101               14.35
7     text     89               12.64
8    reply     86               12.22
9    prize     84               11.93
10    stop     80               11.36
11     get     75               10.65
12    send     63                8.95
13     new     62                8.81
14  urgent     62                8.81

# Frquency of words in ham

count = Counter()

for word_list in df_words_ham["words"]:
    for word in word_list:
        count[word] += 1

# List most common ham words
Counter(count).most_common(30)

[('.', 3644),
 (',', 1457),
 ('?', 1307),
 ('...', 1067),
 ('u', 939),
 ('!', 763),
 (';', 745),
 ('&', 724),
 ('..', 675),
 (':', 537),
 (')', 420),
 ("'s", 405),
 ("'m", 375),
 ("n't", 310),
 ('lt', 309),
 ('gt', 309),
 ('2', 287),
 ('get', 285),
 ('#', 275),
 ('go', 241),
 ('ok', 240),
 ('ur', 238),
 ('got', 236),
 ('come', 227),
 ('call', 226),
 ('good', 223),
 ("'ll", 223),
 ('know', 219),
 ('like', 215),
 ('time', 190)]

# Getting rid of duplicates within an email

# Ham

count = Counter()

for word_list in df_words_ham["words"]:
    for word in list(set(word_list)):
        count[word] += 1

Counter(count).most_common(40)

[('.', 2026),
 ('?', 1046),
 (',', 1022),
 ('...', 684),
 ('u', 660),
 ('!', 508),
 ('..', 429),
 (':', 402),
 ("'s", 368),
 ("'m", 344),
 (')', 334),
 (';', 331),
 ('&', 319),
 ('get', 265),
 ("n't", 257),
 ('2', 239),
 ('lt', 235),
 ('ok', 234),
 ('gt', 233),
 ('got', 224),
 ('go', 222),
 ("'ll", 217),
 ('call', 212),
 ('come', 212),
 ('good', 210),
 ('#', 209),
 ('know', 209),
 ('like', 200),
 ('ur', 187),
 ('time', 179),
 ('day', 176),
 ('going', 158),
 ('4', 157),
 ('home', 154),
 ('one', 149),
 ('want', 146),
 ('lor', 145),
 ('sorry', 144),
 ('-', 143),
 ('still', 143)]

Top 15 - Ham

(“u”, 939),
(“get”, 285),
(“go”, 241),
(“ok”, 240),
(“ur”, 238),
(“got”, 236),
(“come”, 227),
(“call”, 226),
(“good”, 223),
(“know”, 219),
(“like”, 215),
(“time”, 190),
(“day”, 188),
(“love”, 170),
(“going”, 164),

Top 15 - no duplicates - (“u”, 660), - (“get”, 265), - (“ok”, 234), - (“got”, 224), - (“go”, 222), - (“call”, 212), - (“come”, 212), - (“good”, 210), - (“know”, 209), - (“like”, 200), - (“ur”, 187), - (“time”, 179), - (“day”, 176), - (“going”, 158), - (“home”, 154)

top_words_ham = [("u", 939),
 ("get", 285),
 ("go", 241),
 ("ok", 240),
 ("ur", 238),
 ("got", 236),
 ("come", 227),
 ("call", 226),
 ("good", 223),
 ("know", 219),
 ("like", 215),
 ("time", 190),
 ("day", 188),
 ("love", 170),
 ("going", 164)]


df_top_words_ham = pd.DataFrame(top_words_ham, columns=["word", "count"])
df_top_words_ham["% present in Total"] = df_top_words_ham.apply(lambda x: round(x["count"]/len(df_words_ham)*100, 2), axis=1)
print(df_top_words_ham)

     word  count  % present in Total
0       u    939               20.21
1     get    285                6.13
2      go    241                5.19
3      ok    240                5.16
4      ur    238                5.12
5     got    236                5.08
6    come    227                4.88
7    call    226                4.86
8    good    223                4.80
9    know    219                4.71
10   like    215                4.63
11   time    190                4.09
12    day    188                4.05
13   love    170                3.66
14  going    164                3.53

top_words_ham_unique = [("u", 660),
("get", 265),
("ok", 234),
("got", 224),
("go", 222),
("call", 212),
("come", 212),
("good", 210),
("know", 209),
("like", 200),
("ur", 187),
("time", 179),
("day", 176),
("going", 158),
("home", 154)]

df_top_words_ham_unique = pd.DataFrame(top_words_ham_unique, columns=["word", "count"])
df_top_words_ham_unique["% present in Total"] = df_top_words_ham_unique.apply(lambda x: round(x["count"]/len(df_words_ham)*100, 2), axis=1)
print(df_top_words_ham_unique)

     word  count  % present in Total
0       u    660               14.20
1     get    265                5.70
2      ok    234                5.04
3     got    224                4.82
4      go    222                4.78
5    call    212                4.56
6    come    212                4.56
7    good    210                4.52
8    know    209                4.50
9    like    200                4.30
10     ur    187                4.02
11   time    179                3.85
12    day    176                3.79
13  going    158                3.40
14   home    154                3.31

def plot_top_15 (dataframe, title="TBC", fig_num="TBC", color=color, size=3.8, total_freq=False):
    #create plot for Top 15 words for spam/ham
    words = dataframe["word"]
    counts = dataframe["count"]

    y_pos = np.arange(len(dataframe["word"]))  # the label locations
    width = 0.25  # the width of the bars
    multiplier = 0

    fig, ax = plt.subplots(figsize=(size,5), constrained_layout=True)

    ax.barh(y_pos, counts, color=color)

    # Add some text for labels, title and custom x-axis tick labels, etc.
    if total_freq:
        ax.set_xlabel("Total Count")
    else:
        ax.set_xlabel("Count of Emails")
    ax.invert_yaxis() 
    ax.set_title(full_title, fontsize=15, fontweight="bold")
    ax.set_yticks(y_pos, labels = words, fontsize=16)

    #ax.set_ylim(0, 250)

    # Show the plot
    return fig

# Selected for report
# this is based on counting a word only once per email

data = df_top_words_ham_unique
title = "Ham"
fig_num = "2"

full_title = f"Fig. {fig_num} Top 15 words - {title}"

fig = plot_top_15 (data, full_title, fig_num, color_r)

plt.savefig(f"{full_title}.png", dpi=300, transparent=True, bbox_inches = "tight")

plt.show()

png

# Selected for report
# this is based on counting a word only once per email

data = df_top_words_spam_unique
title = "Spam"
fig_num = "3"
full_title = f"Fig. {fig_num} Top 15 words - {title}"

fig = plot_top_15 (data, full_title, fig_num, color)

plt.savefig(f"{full_title}.png", dpi=300, transparent=True, bbox_inches = "tight")

plt.show()

png

# word Frequency - allows for repetition in an email

data = df_top_words_ham
title = "Ham"
fig_num = "1.2"

full_title = f"Fig {fig_num} Top 15 words - {title}"

fig = plot_top_15 (data, full_title, fig_num, color_r, total_freq=True)

plt.savefig(f"{full_title}.png", dpi=300, transparent=True, bbox_inches = "tight")

plt.show()

png

# word Frequency - allows for repetition in an email

data = df_top_words_spam
title = "Spam"
fig_num = "1.4"

full_title = f"Fig {fig_num} Top 15 words - {title}"


fig = plot_top_15 (data, full_title, fig_num, color, total_freq=True)

plt.savefig(f"{full_title}.png", dpi=300, transparent=True, bbox_inches = "tight")

plt.show()

png

# Selected for Report
# based on word frquency... 

data = spam_words_only.head(10)
title = "Spam-only words"
fig_num = "4"
full_title = f"Fig. {fig_num} Top 10 {title}"
size = 4.3

plt.figure(figsize = (5,4))
fig = plot_top_15 (data, full_title, fig_num, color="darkorange", size=size, total_freq=True)

plt.savefig(f"{full_title}.png", dpi=300, transparent=True, bbox_inches = "tight")

plt.show()

<Figure size 500x400 with 0 Axes>

png

data = ham_words_only.head(10)
title = "Ham-only words"
fig_num = "1.6"
full_title = f"Fig {fig_num} Top 10 {title}"

fig = plot_top_15 (data, full_title, fig_num, color_r, total_freq=True)

plt.savefig(f"{full_title}.png", dpi=300, transparent=True, bbox_inches = "tight")

plt.show()

png

Data modelling

Apply the correct machine learning/modelling approach to model the data appropriately.

Model Training

Split the dataset into training and test sets, and ensure the training dataset is balanced (using SMOTE)

# Creates a randomised matrix of dimensions based :
# - feature_vectors - sms content
# - df_ham and df_spam - an array based on 0 for ham and 1 for spam

X_train, X_test, y_train, y_test = train_test_split(feature_vectors, [0] * len(df_ham) + [1] * len(df_spam), random_state = 42, test_size=0.2)

# Balance training set using SMOTE

X_train_smote, y_train_smote = SMOTE(random_state=42).fit_resample(X_train, y_train) 

# Check the ratio of True (1) for y training set in total number of y
# should end up with 0.5...

print (y_train_smote.count(1), len(y_train_smote))
print (y_train_smote.count(1)/len(y_train_smote))

3725 7450
0.5

Apply machine learning/model approaches

Apply two machine learning/modelling approaches to model the data (5 pts)
Correctly and completely train both models, including the selection of appropriate hyperparameter values (including use of hyperparameter tuning using grid search)
Run the model using the best parameters

Model 1 - KNN

# Make instance of KNN model and set hyperparameters

knn = KNeighborsClassifier()

# Set hyperparameters

hyperparameters = { 
        "n_neighbors": [1, 3, 5, 9, 11], 
        "p": [1, 2] 
    }

KNN - base (unbalanced)

#Grid search

knn_base = GridSearchCV(knn, hyperparameters, scoring="accuracy")


knn_base.fit(X_train, y_train) 

print("Best p:", knn_base.best_estimator_.get_params()["p"]) 

print("Best n_neighbors:", knn_base.best_estimator_.get_params()["n_neighbors"])

Best p: 2
Best n_neighbors: 1

KNN - SMOTE (balanced)

knn_smote = GridSearchCV(knn, hyperparameters, scoring="accuracy")

knn_smote.fit(X_train_smote, y_train_smote) 

print("Best p:", knn_smote.best_estimator_.get_params()["p"]) 

print("Best n_neighbors:", knn_smote.best_estimator_.get_params()["n_neighbors"])

Best p: 2
Best n_neighbors: 1

Model 2 - Decision Tree

# Make instance of Decision Tree model and set hyperparameters

# NOTE - I have removed a number of the values at extremes and in between that were used for testing, 
# so that it doesn't take so long to run when reviewing.

dt = DecisionTreeClassifier() 

hyperparameters = { 
    "min_samples_split": [2, 3, 5, 10, 15, 20], 
    "min_samples_leaf": [3, 4, 5, 6, 8], 
    "max_depth": [10, 20, 40, 60, 80, 120] 

}

Decision Tree - base (unbalanced)

# train & evaluate

dt_base = GridSearchCV(dt, hyperparameters, scoring="accuracy").fit(X_train, y_train) 

print("Best max_depth:", dt_base.best_estimator_.get_params()["max_depth"]) 

print("Best min_samples_leaf:", dt_base.best_estimator_.get_params()["min_samples_leaf"]) 

print("Best min_samples_split:", dt_base.best_estimator_.get_params()["min_samples_split"]) 

print("Best criterion:", dt_base.best_estimator_.get_params()["criterion"])

Best max_depth: 20
Best min_samples_leaf: 3
Best min_samples_split: 3
Best criterion: gini

Decision Tree - SMOTE (balanced)

# train & evaluate
# setting hyperparameter variables here for use for tree map at end. 

dt_smote = GridSearchCV(dt, hyperparameters, scoring="accuracy").fit(X_train_smote, y_train_smote) 

print("Best max_depth:", max_depth := dt_smote.best_estimator_.get_params()["max_depth"]) 

print("Best min_samples_leaf:", min_samples_leaf := dt_smote.best_estimator_.get_params()["min_samples_leaf"]) 

print("Best min_samples_split:", min_samples_split := dt_smote.best_estimator_.get_params()["min_samples_split"]) 

print("Best criterion:", criterion := dt_smote.best_estimator_.get_params()["criterion"])

Best max_depth: 120
Best min_samples_leaf: 3
Best min_samples_split: 2
Best criterion: gini

Model Evaluation

Successfully and fully test the models, using appropriate measures for evaluation (e.g. accuracy, balanced accuracy, training time, prediction time)

Functions



def plot_confusion_matrix (model, y_predicted, title="TBC", fig_num="TBC"):
    # Confusion Matrix
    plt.figure(figsize=(4,4))
    cm = confusion_matrix(y_test, y_predicted, labels=model.classes_, ) 
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_) 
    disp.plot(cmap="YlOrBr_r") 

    plt.title(f"Fig. {fig_num} {title}", fontsize=20, fontweight="bold")
    plt.xlabel("Predicted", fontsize=14)
    plt.ylabel("True label", fontsize=14)
    
    return disp 

def calc_accuracy (model_name, y_predicted):
    # Accuracy
    results.loc[model_name,"accuracy"] = accuracy_score(y_test,y_predicted) 
    return results.loc[model_name,"accuracy"]

def calc_balanced_accuracy (model_name, y_predicted):
    # Balanced Accuracy
    results.loc[model_name,"balanced_accuracy"] = balanced_accuracy_score(y_test,y_predicted) 
    return results.loc[model_name,"balanced_accuracy"]

def calc_training_time (model, model_name):
    # Training Time
    results.loc[model_name,"training_time"] = model.cv_results_["mean_fit_time"].mean()
    return results.loc[model_name,"training_time"]

def calc_prediction_time (model, model_name):
    # Training Time
    results.loc[model_name,"prediction_time"] = model.cv_results_["mean_score_time"].mean()/len(y_test)
    return results.loc[model_name,"prediction_time"]

Metrics for KNN evaluation - Confusion Matrix and metric calculations

# Create empty results table

results = pd.DataFrame()

model = knn_base
model_name = "KNN"

y_predicted = model.predict(X_test)

Metrics for KNN evaluation

# Confusion Matrix

title = "KNN (unbalanced)"
fig_num = "5"

disp = plot_confusion_matrix(model, y_predicted, title=title, fig_num=fig_num)

plt.savefig(f"Fig. {fig_num} {title}.png", dpi=300, transparent=True, bbox_inches = "tight")

<Figure size 400x400 with 0 Axes>

png

# KNN_base results
# Accuracy
accuracy = calc_accuracy(model_name, y_predicted)

# Balanced Accuracy 
balanced = calc_balanced_accuracy(model_name, y_predicted)

# Training Time
train_time = calc_training_time(model, model_name)

# Prediction Time
pred_time = calc_prediction_time(model, model_name)

print (f"Accuracy: {accuracy:.10f}")
print (f"Balanced Accuracy: {balanced:.10f}")
print (f"Training Time: {train_time:.10f}")
print (f"Prediction Time: {pred_time:.10f}")

Accuracy: 0.9187675070
Balanced Accuracy: 0.7136805020
Training Time: 0.0025524473
Prediction Time: 0.0001030172

model = knn_smote
model_name = "KNN-SMOTE"

y_predicted = model.predict(X_test)

# Confusion Matrix

title = "KNN (SMOTE)"
fig_num = "7"

disp = plot_confusion_matrix(model, y_predicted, title=title, fig_num=fig_num)


plt.savefig(f"Fig. {fig_num} {title}.png", dpi=300, transparent=True, bbox_inches = "tight")

<Figure size 400x400 with 0 Axes>

png

# KNN_SMOTE results
# Accuracy
accuracy = calc_accuracy(model_name, y_predicted)

# Balanced Accuracy 
balanced = calc_balanced_accuracy(model_name, y_predicted)

# Training Time
train_time = calc_training_time(model, model_name)

# Prediction Time
pred_time = calc_prediction_time(model, model_name)

print (f"Accuracy: {accuracy:.10f}")
print (f"Balanced Accuracy: {balanced:.10f}")
print (f"Training Time: {train_time:.10f}")
print (f"Prediction Time: {pred_time:.10f}")

Accuracy: 0.8085901027
Balanced Accuracy: 0.7312779339
Training Time: 0.0037259293
Prediction Time: 0.0002646201

model = dt_base
model_name = "DT"

y_predicted = model.predict(X_test)

# Confusion Matrix

title = "Decision Tree (unbalanced)"
fig_num = "6"

disp = plot_confusion_matrix(model, y_predicted, title=title, fig_num=fig_num)


plt.savefig(f"Fig. {fig_num} {title}.png", dpi=300, transparent=True, bbox_inches = "tight")

<Figure size 400x400 with 0 Axes>

png

# DT_BASE results
# Accuracy
accuracy = calc_accuracy(model_name, y_predicted)

# Balanced Accuracy 
balanced = calc_balanced_accuracy(model_name, y_predicted)

# Training Time
train_time = calc_training_time(model, model_name)

# Prediction Time
pred_time = calc_prediction_time(model, model_name)

print (f"Accuracy: {accuracy:.10f}")
print (f"Balanced Accuracy: {balanced:.10f}")
print (f"Training Time: {train_time:.10f}")
print (f"Prediction Time: {pred_time:.10f}")

Accuracy: 0.8832866480
Balanced Accuracy: 0.6424318304
Training Time: 0.1051164484
Prediction Time: 0.0000014010

model = dt_smote
model_name = "DT-SMOTE"

y_predicted = model.predict(X_test)

# Confusion Matrix

title = "Decision Tree (SMOTE)"
fig_num = "8"

disp = plot_confusion_matrix(model, y_predicted, title=title, fig_num=fig_num)

plt.savefig(f"Fig. {fig_num} {title}.png", dpi=300, transparent=True, bbox_inches = "tight")

<Figure size 400x400 with 0 Axes>

png

# DT_SMOTE results
# Accuracy
accuracy = calc_accuracy(model_name, y_predicted)

# Balanced Accuracy 
balanced = calc_balanced_accuracy(model_name, y_predicted)

# Training Time
train_time = calc_training_time(model, model_name)

# Prediction Time
pred_time = calc_prediction_time(model, model_name)

print (f"Accuracy: {accuracy:.10f}")
print (f"Balanced Accuracy: {balanced:.10f}")
print (f"Training Time: {train_time:.10f}")
print (f"Prediction Time: {pred_time:.10f}")

Accuracy: 0.6937441643
Balanced Accuracy: 0.5886131695
Training Time: 0.1291490380
Prediction Time: 0.0000019057

Results

results.index

Index(['KNN', 'KNN-SMOTE', 'DT', 'DT-SMOTE'], dtype='object')

# show collated results
results.style.highlight_max(color=color, axis=0).highlight_min(color=color_r, axis=0)

png

#Accuracy & balanced accuracy
fig_num="9"
labels = list(results.index)

fig = results[["accuracy","balanced_accuracy"]].plot(kind="bar", color=colors_r, figsize=(8,5)) 

plt.title(f"Fig. {fig_num} Accuracy & Balanced Accuracy by Model"
         , fontweight="bold"
         , fontsize=14)


plt.xticks(rotation=0)
plt.ylabel("Percentage", fontsize=14)
plt.legend()

plt.savefig(f"Fig. {fig_num} Accuracy & Balanced Accuracy by Model.png", dpi=300, transparent=True, bbox_inches = "tight")

png

# Training Time
fig_num="10" 

results["training_time"].plot(kind="bar", color=color, figsize=(7,3)) 

plt.title(f"Fig. {fig_num} Training Time by Model"
         , fontweight="bold"
         , fontsize=14)

plt.xticks(rotation=0)
plt.ylabel("Training Time", fontsize=14)

plt.savefig(f"Fig. {fig_num} Training Time by Model.png", dpi=300, transparent=True, bbox_inches = "tight")

png

# Prediction Time

fig_num="11"

results["prediction_time"].plot(kind="bar", color=color, figsize=(7,3)) 

plt.title(f"Fig. {fig_num} Prediction Time by Model"
         , fontweight="bold"
         , fontsize=14)

plt.xticks(rotation=0)
plt.ylabel("Prediction Time", fontsize=14)

plt.savefig(f"Fig. {fig_num} Prediction Time by Model.png", dpi=300, transparent=True, bbox_inches = "tight")

png

Classification Report (including F1-score)

all_words["type"].shape

(9741,)

df_words["check"] = df_words["spam"].replace({True: "Spam", False: "Ham"}).astype(str)

# Classification report for KNN (unbalanced)

y_predicted = knn_base.fit(X_train, y_train).predict(X_test)

report = classification_report(y_test, y_predicted, target_names= df_words["check"].unique()) #(all_words["type"].unique()))
print(report)

              precision    recall  f1-score   support

         Ham       0.92      1.00      0.95       922
        Spam       0.97      0.43      0.60       149

    accuracy                           0.92      1071
   macro avg       0.94      0.71      0.78      1071
weighted avg       0.92      0.92      0.90      1071

# Classification report for Decision Tree (unbalanced)

y_predicted = dt_base.fit(X_train, y_train).predict(X_test)

report = classification_report(y_test, y_predicted, target_names= df_words["check"].unique()) #(all_words["type"].unique()))
print(report)

              precision    recall  f1-score   support

         Ham       0.90      0.98      0.94       922
        Spam       0.70      0.32      0.44       149

    accuracy                           0.89      1071
   macro avg       0.80      0.65      0.69      1071
weighted avg       0.87      0.89      0.87      1071

# Classification report for Decision Tree (SMOTE)

y_predicted = dt_smote.fit(X_train_smote, y_train_smote).predict(X_test)

report = classification_report(y_test, y_predicted, target_names= df_words["check"].unique()) #(all_words["type"].unique()))
print(report)

              precision    recall  f1-score   support

         Ham       0.89      0.75      0.81       922
        Spam       0.22      0.45      0.30       149

    accuracy                           0.71      1071
   macro avg       0.56      0.60      0.56      1071
weighted avg       0.80      0.71      0.74      1071

# Classification report for KNN (SMOTE)


y_predicted = knn_smote.fit(X_train_smote, y_train_smote).predict(X_test)

report = classification_report(y_test, y_predicted, target_names= df_words["check"].unique()) #(all_words["type"].unique()))
print(report)

              precision    recall  f1-score   support

         Ham       0.93      0.84      0.88       922
        Spam       0.38      0.62      0.48       149

    accuracy                           0.81      1071
   macro avg       0.66      0.73      0.68      1071
weighted avg       0.86      0.81      0.83      1071

Tree chart

Commented out for this HTML document due to output length

Trying just out of curiosoity - not included in final report.
Parameters have been set from above based on best_estimator.get_params

print (criterion, max_depth, min_samples_split, min_samples_leaf)

gini 120 2 3

# dt_tree = DecisionTreeClassifier(criterion=criterion
#                            , max_depth=max_depth
#                            , min_samples_split=min_samples_split
#                            , min_samples_leaf=min_samples_leaf)

# dt_tree = dt_tree.fit(X_train_smote, y_train_smote)

# plt.figure(figsize=(15,50))


# tree.plot_tree(decision_tree=dt_tree, class_names=all_words["type"],\
#              filled=True, rounded=True)

RMIT - Spam Classification