April 20, 2023


For my third and final RMIT Data Wrangling assignment, we had to compare two datasets with some exploratory analysis. Having done a little mixed-methods research at KPMG with NVivo, but no text analysis in Python or R as yet,I chose two texts as my datasets. Or to be specific, two plays: Macbeth by William Shakespeare and The Tragedie of Mariam, faire queene of Jewry by Elizabeth Carey, both 16-17th century contemporaries.

Approaching it from the place of the play being a “dataset” and how to fit it to Tidy Data principles, helped identify variables, values (words) and factors, such as title, gender, characters and acts/scenes (which were out of my scope here due to time and space). I’m increasingly thinking of how things would be broken down similarly in other everyday situations.

One challenge I identify is that the language of both writers is quite different in spelling - to progress beyond the scope of this report would require this to be addressed for proper comparison. Though even in this preliminary form, I believe there are some potentially interesting insights. Overall, it was a fascinating shift of thinking, applying familiar skills with a different approach.


Below is the R code used for the data wrangling, transformation, analysis and visualisations.


Data Wrangling Assessment Task 3: Dataset challenge

Setup

# load all required packages 
library(gutenbergr)
library (readr)
library(here)
## here() starts at /Users/adam/data/R/Github Projects/adam-tuoa.github.io
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(magrittr)
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
library(tidytext)
library(stringr)
library(ggplot2)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
## 
##     col_factor
# Set maximum number for printing
options(max.print=50)


1. Data Description

For this project, two texts will be processed in preparation for a comparative text analysis with consideration of gender. The two plays are by 17th century contemporaries. One is considered “the greatest dramatist of all time as well as the most influential writer in the history of the English language” (EB 2020), while the other never saw her play performed in her lifetime, she does have the distinction of having written the first known play published by a woman (Braganza 2022). And both plays happen to involve murderous royalty.

The intended result of this process will be a dataset that combines the text from both plays in a Tidy Data format, ready for exploratory data analysis.

The preparation will identify variables for extraction, creation or transformation. With gender as a key focus, the variables will be selected for aiding the focus of the analysis.

1.1 Macbeth

Use Project Gutenberger R library to identify and access Macbeth:

# Find Project Gutenberger ID for Macbeth

gutenberg_works() %>%
  filter(title == "Macbeth")
## # A tibble: 1 × 8
##   gutenberg_id title   author   gutenberg_author_id language gutenberg_bookshelf
##          <int> <chr>   <chr>                  <int> <chr>    <chr>              
## 1         1533 Macbeth Shakesp…                  65 en       Opera              
## # ℹ 2 more variables: rights <chr>, has_text <lgl>
# Download MacBeth
macbeth <-gutenberg_download(1533, mirror = "http://aleph.gutenberg.org")
# Check dataset
macbeth %>% head()
## # A tibble: 6 × 2
##   gutenberg_id text     
##          <int> <chr>    
## 1         1533 "cover " 
## 2         1533 ""       
## 3         1533 ""       
## 4         1533 ""       
## 5         1533 ""       
## 6         1533 "MACBETH"

This is just the beginning of the play’s text, so let’s check further down:

# View random six rows
macbeth[2001:2006, ]
## # A tibble: 6 × 2
##   gutenberg_id text                                
##          <int> <chr>                               
## 1         1533 "But who did bid thee join with us?"
## 2         1533 ""                                  
## 3         1533 "THIRD MURDERER."                   
## 4         1533 "Macbeth."                          
## 5         1533 ""                                  
## 6         1533 "SECOND MURDERER."


1.2 The Tragedie of Mariam, the faire queene of Jewry

“The Tragedie of Mariam, the faire queene of Jewry” - let’s use Mariam - check Project Gutenberger access:

# Find Project Gutenberger ID for Mariam
gutenberg_works() %>%
  filter(title == "The Tragedie of Mariam, the faire queene of Jewry")
## # A tibble: 0 × 8
## # ℹ 8 variables: gutenberg_id <int>, title <chr>, author <chr>,
## #   gutenberg_author_id <int>, language <chr>, gutenberg_bookshelf <chr>,
## #   rights <chr>, has_text <lgl>

So, that confirms it is not in the Project Gutenberger library.

Fortunately, it is publicly available on the web courtesy of A Celebration of Women Writers, the result of work by Laura June Dziuban and Mary Mark Ockerbloom. From this site, I copied and pasted the entire text into a plain text document (CSV).

# Load Mariam
mariam <- read_csv("The Tragedie of Mariam.csv", show_col_types = FALSE)
## Warning: One or more parsing issues, call `problems()` on
## your data frame for details, e.g.:
##   dat <- vroom(...)
##   problems(dat)
# Check dataset
mariam %>%  head()
## # A tibble: 6 × 1
##   `A Celebration of Women Writers`                                              
##   <chr>                                                                         
## 1 The tragedie of Mariam, the faire queene of Jewry: written by that learned, v…
## 2 THE                                                                           
## 3 TRAGEDIE                                                                      
## 4 OF MARIAM,                                                                    
## 5 THE FAIRE                                                                     
## 6 Queene of Jewry.

Ignoring that warning for now, which is likely identified later, let’s have a look further down the text:

# View random six rows
mariam[2001:2006, ]
## # A tibble: 6 × 1
##   `A Celebration of Women Writers`                    
##   <chr>                                               
## 1 In base revenge there is no honor wonne.            
## 2 Who would a worthy courage overthrow,               
## 3 And who would wrastle with a worthles foe?          
## 4 We say our hearts are great and cannot yeeld,       
## 5 Because they cannot yeeld it proves them poore:     
## 6 Great hearts are task't beyond their power, but feld

Now the plays are loaded, before merging them it will be necessary to inspect each dataset more carefully and ensure that data from each play will be identifiable as such.



2. Understand

To start, let’s look at the overall dimensions and structure in terms of variables and data types.

From here on, Macbeth and Mariam will refer to the plays; macbeth and mariam for their respective datasets.

# Dimensions of each dataset
cat("macbeth:\n")
## macbeth:
macbeth %>% dim()
## [1] 4168    2
cat("\nmariam:\n")
## 
## mariam:
mariam %>% dim()
## [1] 2335    1
# Structure of each dataset
cat("macbeth:\n")
## macbeth:
macbeth %>% str()
## tibble [4,168 × 2] (S3: tbl_df/tbl/data.frame)
##  $ gutenberg_id: int [1:4168] 1533 1533 1533 1533 1533 1533 1533 1533 1533 1533 ...
##  $ text        : chr [1:4168] "cover " "" "" "" ...
cat("\nmariam:\n")
## 
## mariam:
mariam %>% str()
## spc_tbl_ [2,335 × 1] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ A Celebration of Women Writers: chr [1:2335] "The tragedie of Mariam, the faire queene of Jewry: written by that learned, vertuous, and truly noble ladie, E."| __truncated__ "THE" "TRAGEDIE" "OF MARIAM," ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `A Celebration of Women Writers` = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

That’s an initial overview - now for a look at each in turn in more detail.

2.1 Macbeth clean

Variables and structure

From above, we can see that macbeth has two variables:

  • gutenberg_id - numerical (integer)
  • text - character/string

The first identifies the play, while the second is the play’s text split by line into 4168 rows.

Attributes

Here are the attributes of the macbeth dataset, confirming we are working with a tibble class dataframe

macbeth %>% attributes()
## $class
## [1] "tbl_df"     "tbl"        "data.frame"
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
##  [ reached getOption("max.print") -- omitted 4118 entries ]
## 
## $names
## [1] "gutenberg_id" "text"

2.2 Mariam clean

Variables and structure

From above (2.0), we can see that Mariam has one variable:

  • A Celebration of Women Writers - character/string type

This single column contains the play’s text split by line into 2335 rows.

Attributes

Here are the attributes of the macbeth dataset. As it has been created by readr::read_csv, it is a spec_tbl_df which has additional specification. For all intents and purposes this has no impact on our use of this tibble subclass. It will become moot upon merging of the two datasets anyway. And we will get to that problem.

mariam %>% attributes()
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
##  [ reached getOption("max.print") -- omitted 2285 entries ]
## 
## $names
## [1] "A Celebration of Women Writers"
## 
## $spec
## cols(
##   `A Celebration of Women Writers` = col_character()
## )
## 
## $problems
## <pointer: 0x7fad808b40a0>
## 
## $class
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

From these outputs on macbeth and mariam, it confirms the above in terms of size, variable names and that they are indeed of dataframe format, more specifically a tibble and tibble subclass (spec_tbl_df) respectively. Though mariam is lacking an identifier column, as well as column names/headers.

2.2.1 Create Character variable

Curiously, this section was added later in the process, but is a necessary step right now in the process, in order to make things easier later on.

For the purposes of ascribing gender to the use of each word, a variable based on the character’s names will be created. The initial formatting - uppercase and punctuation - of both works makes it possible to differentiate speakers/characters from the use of their names in speech.

# Character names
macbeth_names = c("DUNCAN", "MALCOLM", "DONALBAIN","MACBETH", "BANQUO", "MACDUFF","LENNOX", "ROSS", "MENTEITH", "ANGUS", "CAITHNESS", "FLEANCE", "SIWARD", "YOUNG SIWARD", "SEYTON", "BOY", "DOCTOR", "SOLDIER", "PORTER", "OLD MAN", "MESSENGER", "FIRST MESSENGER", "SECOND MESSENGER","LORDS", "LADY MACBETH", "LADY MACDUFF", "GENTLEWOMAN", "HECATE", "FIRST WITCH", "SECOND WITCH", "THIRD WITCH")

# For Mariam's characters, I opted for this route on this occasion, though I see how regex would excel here - next time!
mariam_names = list(c("Herod\\.", "Her\\."),
        c("Doris\\.", "Dor\\."),
        c("Mariam\\.","Ma\\.", "Mar\\."),
        c("Salome\\.", "Sal\\.", "Salom\\.", "Salo\\."),
        c("Antipater\\.", "Antipat\\."),
        c("Alexander\\.", "Alex\\."),
        c("Sillius\\.", "Sil\\.", "Sille\\."),
        c("Constabarus\\.", "Con\\."),
        c("Pheroras\\.", "Phero\\.", "Pheror\\."),
        c("Graphina\\.", "Graph\\."),
        c("Babus first Sonne\\.", "Babus\\. I\\. Sonne\\.", "Ba\\. 1\\. Sonne"),
        c("Babus second Sonne\\.", "Babus 2\\. Sonne\\.", "Ba\\. 2\\. Sonne"),
        c("Annanell\\.", "Ananell\\.", "Ana\\."),
        c("Sohemus\\.", "Sohem\\."),
        c("Nuntio\\.", "Nunti\\.", "Nun\\."),
        c("Bu\\."),
        c("Chorus\\."))
# Function to add character name for Macbeth
add_macbeth_character <- function(dataframe, names)
{ dataframe$character <- NA
  for(name in names)
  { dataframe$character[grepl(name, dataframe$text)] <- name }
dataframe }

# Function to add character name for Mariam - Loops through the variations present, filling with the full name.
add_mariam_character <- function(dataframe, names_vector)
{ dataframe$character <- NA
  for (element in names_vector)
  { for (names in element)
    { dataframe$character[grepl(names, dataframe[[1]])] <- element[1] } }
dataframe }
# Create new column with character names corresponding to any line (row) they appear in
macbeth <- add_macbeth_character(macbeth, macbeth_names)   
mariam <- add_mariam_character(mariam, mariam_names)

What has just been done is the creation of a new character name column, corresponding to any row where the correctly formatted character name has been added. Now, what remains to be done is “fill down” for each name found in this new column - with some cleaning required of the character names in mariam.

macbeth %<>% fill(character)
mariam %<>% fill(character) 

#remove unwanted characters - ie "\."
mariam[[2]] %<>% str_remove_all("[:punct:]")

Please note, due to the method I have used, there is a small degree of inaccuracy introduced due to incorrect attribution of a speaker to general transitional instructions, scene changes, personnel and the like. For this project it was deemed prudent to focus on the core priorities. This can be easily addressed as required in the future.

2.3 Merge datasets

To merge the files, all text will be combined into a single column, or variable. Though for comparative analysis after pre-processing, the ability to identify the two plays will be required through assignment to a new categorical variable

In macbeth the gutenberg_id variable is currently numerical, but is only significant as an identifier. A quick check of unique values for the variable shows this to be the case.

macbeth$gutenberg_id %>% unique()
## [1] 1533

What is required is for either this column to be replicated in mariam or to remove it and create a new identifier variable for both datasets. In the end, the result will be the same: a new column with two values corresponding to the title of each play.

To be clean about it - let’s remove the gutenberg_id column and create a new column for each dataset. Then a straightforward join by stacking one dataset on top of the other (row binding) will result in our merged dataset.

An added line_number column will be added so that cross-referencing with the original text sources can be easily done.

From 2.2, it was noted mariam requires a column name change. This will be rectified now also.

# Remove unnecessary column (by "unselecting" it)
macbeth %<>% select (-1)

# Add new columns with titles and line numbers for values
macbeth$title <- "Macbeth"
mariam$title <- "The Tragedie of Mariam"
macbeth %<>% mutate(line_number = row_number())
mariam %<>% mutate(line_number = row_number())

# Rename column name
mariam %<>% rename("text"="A Celebration of Women Writers")

And now joining the two datasets together:

df <- rbind(macbeth, mariam) %>% select("title", "text", "character", "line_number")

A quick check of the top and and tail of the dataset should show text from both plays:

# show top and tail of the merged dataset
df %>% head()
## # A tibble: 6 × 4
##   title   text      character line_number
##   <chr>   <chr>     <chr>           <int>
## 1 Macbeth "cover "  <NA>                1
## 2 Macbeth ""        <NA>                2
## 3 Macbeth ""        <NA>                3
## 4 Macbeth ""        <NA>                4
## 5 Macbeth ""        <NA>                5
## 6 Macbeth "MACBETH" MACBETH             6
df %>% tail()
## # A tibble: 6 × 4
##   title                  text                              character line_number
##   <chr>                  <chr>                             <chr>           <int>
## 1 The Tragedie of Mariam "In after times the schoole of w… Chorus           2330
## 2 The Tragedie of Mariam "FINIS."                          Chorus           2331
## 3 The Tragedie of Mariam "About This Edition"              Chorus           2332
## 4 The Tragedie of Mariam "The original spelling of the te… Chorus           2333
## 5 The Tragedie of Mariam "This book has been put on-line … Chorus           2334
## 6 The Tragedie of Mariam "Edited by Mary Mark Ockerbloom"  Chorus           2335

NOTE - it is not visible in the PDF report, but rest assured the tail shows two columns also.

From manual inspection, there is extraneous text data before and after the actual play text. For purposes of expediency and given it is only four points to identify, this identification was done manually to obtain the row numbers for deletion.

# Remove Macbeth intro
df <- df[-(1:92),]
# Remove Mariam outro
df <- df[-(6388:6391),]
# Remove Macbeth end and Mariam intro
df <- df[-(4057:4197),]

We have our merged dataset containing the complete texts of Macbeth and The Tragedie of Mariam, a tibble of 4 columns and 6266 rows.

Now, let’s begin the real fun!



3. Tidy & Manipulate Data I

The dataset requires further manipulation to obtain the desired result.

Firstly, we need to conform the data to Tidy Data principles as it fails to comply in its current state. Only one of the three principles are satisfied:

But, the text column contains combinations of values rather than each having their own cell. Each value needs to be a single word only, corresponding to the observation of said value.

[NOTE: I did get this wrong in the video - this has been updated since that recording]

Tokenisation will be the method used to achieve this.

# Tokenise the text column and return a new dataframe with each word in its own cell 
df_token <- df %>% unnest_tokens(word, text)

# convert character to lower case
df_token$character %<>% str_to_lower() 

df_token %>%  str()
## tibble [35,469 × 4] (S3: tbl_df/tbl/data.frame)
##  $ title      : chr [1:35469] "Macbeth" "Macbeth" "Macbeth" "Macbeth" ...
##  $ character  : chr [1:35469] "hecate" "hecate" "hecate" "hecate" ...
##  $ line_number: int [1:35469] 94 94 94 94 94 94 94 94 94 94 ...
##  $ word       : chr [1:35469] "scene" "in" "the" "end" ...
df_token %>% head()
## # A tibble: 6 × 4
##   title   character line_number word 
##   <chr>   <chr>           <int> <chr>
## 1 Macbeth hecate             94 scene
## 2 Macbeth hecate             94 in   
## 3 Macbeth hecate             94 the  
## 4 Macbeth hecate             94 end  
## 5 Macbeth hecate             94 of   
## 6 Macbeth hecate             94 the
df_token %>% tail()
## # A tibble: 6 × 4
##   title                  character line_number word      
##   <chr>                  <chr>           <int> <chr>     
## 1 The Tragedie of Mariam chorus           2334 ockerbloom
## 2 The Tragedie of Mariam chorus           2335 edited    
## 3 The Tragedie of Mariam chorus           2335 by        
## 4 The Tragedie of Mariam chorus           2335 mary      
## 5 The Tragedie of Mariam chorus           2335 mark      
## 6 The Tragedie of Mariam chorus           2335 ockerbloom

We have a tidy tibble with 35469 words, complete with removal of all additional punctuation and conversion of all words to lower case.



4. Tidy & Manipulate Data II

Now to create additional variables by extracting them from data variable we have so far - and we’ve done one already!

Characters and Gender

The endpoint of this pre-processing is being able to ascribe gender to each observation. For this it will be necessary to identify the speaker. It was at this point I realised that generating the character variable would be far easier prior to tokenisation and even before joining the two initial datasets. Section 2.2.1 is where I travelled back in time to insert the creation of the character variable into this process.

Creating the gender category is based on indications from the introduction for each play. Without intent to offend or exclude, a binary approach has been taken for this report’s categorisation.

# Identify characters
male_chars <- c("duncan", "malcolm", " donalbain", " macbeth", " banquo", " macduff", "lennox", "ross", " menteith", "angus", "caithness", "fleance", "siward", "young siward", "seyton", "boy", "doctor", "soldier", "porter", "old man", "messenger", "first messenger", "second messenger", "lords", "herod", "antipater", "sillius", "constabarus", "pheroras", "babus first sonne", "babus second sonne", "annanell", "sohemus", "nuntio", "bu")

female_chars <- c("lady macbeth", "lady macduff", "gentlewoman", "hecate", "first witch", "second witch", "third witch", "doris", "mariam", "salome", "alexander", "graphina")

undefined <- c("chorus")

# Create a new column and fill when matches with character name 
df_token %<>% mutate(gender = case_when(character %in% male_chars ~ "male",
                                 character %in% female_chars ~ "female",
                                 TRUE ~ "undefined")) %>% 
              select("title", "word", "character", "gender", "line_number")

Create Factors

For the categorical variables - title, character, gender - these will be converted to factors, which will aid further analysis and visualisation.

#Convert to factors
df_token$title <- as.factor(df_token$title) 
df_token$character <- as.factor(df_token$character) 
df_token$gender <- as.factor(df_token$gender)


5. Scan I

Things are now in a Tidy Data format with primary variables sorted as correct data types. Next to check the values themselves.

Missing values

A check for NA values shows no missing values.

# Check for NA values
colSums(is.na(df_token))
##       title        word   character      gender line_number 
##           0           0           0           0           0

Inconsistences/Errors

Double-checking factors are correct:

for (variable in c("title", "character", "gender"))
{
 cat(paste(variable, ":\n"))
 cat(levels(df_token[[variable]]))
 cat("\n\n")
}
## title :
## Macbeth The Tragedie of Mariam
## 
## character :
## alexander angus annanell antipater babus first sonne babus second sonne banquo bu caithness chorus constabarus doctor donalbain doris duncan first witch fleance gentlewoman graphina hecate herod lady macbeth lady macduff lennox lords macbeth macduff malcolm mariam menteith messenger nuntio old man pheroras porter ross salome second witch seyton sillius siward sohemus soldier third witch young siward
## 
## gender :
## female male undefined

There will be no great benefit in checking for spelling errors given the nature of the old style of English, especially with Mariam, which also has a “flexible” approach to spelling anyway. This difference in language will impact some conventional text analysis techniques without further transformation of the data, such as lemmatisation, or converting to modern spelling, both of which are beyond the scope of this report.



6. Scan II

A final scan to remove noise in the way of removing whitespace, removal/replacement of unnecessary characters/outliers.

# Remove whitespace (it changes to a Large list, so convert back to tibble)
df_token %<>% lapply(str_trim) %>% as_tibble()

# Checking for any words that begin and/or end with non alphanumeric characters
df_check_start <- df_token[str_detect(df_token$word, "^[^A-Za-z]|^[^A-Za-z]$|^[A-Za-z]$"), ] %>% select("word")
df_check_end <- df_token[str_detect(df_token$word, "$[^A-Za-z]$"), ] %>% select("word")

# print list of relevant values
cat(paste("Begins/ends with non-alpha:\n", unique(df_check_start)))
## Begins/ends with non-alpha:
##  c("i", "_exeunt", "a", "o", "_exit", "_tiger", "_drum", "_witches", "_aside", "_to", "_flourish", "v", "_a", "_within", "_looking", "_knocking", "_opens", "_alarum", "_lady", "c", "_assaults", "_dies", "_ghost", "_music", "t", "_thunder", "_descends", "_hautboys", "_stabbing", "_they", "2", "1", "3", "4", "5", "6", "7", "8", "æthiopian", "s", "j", "u", "w")

A few things to address:

# Remove "_"
df_token$word %<>% str_remove_all("[:punct:]")
# Standardise spelling to avoid possible confusion
df_token$word %<>% str_replace_all("æthiopian", "ethiopian")
# Remove numbers
df_token$word %<>% str_remove_all("^[:digit:]$")


7. Transform

7.1 Remove stopwords / Word frequency

For considering word frequency, a good practice is to remove “stopwords”, which are a group of commonly used words - “a”, “the”, “but”, “about” - that when removed can help reduce the level of noise to reveal those more meaningful words.

At this step, stopwords will be removed and then the dataframe regrouped to get the frequency for each word:

df_no_stopwords <- 
  df_token %>% 
  filter(!df_token$word %in% stop_words$word) %>% 
  count(title, word, gender, character) %>% 
  arrange(desc(n)) 

df_no_stopwords
## # A tibble: 10,210 × 5
##    title                  word    gender    character        n
##    <chr>                  <chr>   <chr>     <chr>        <int>
##  1 Macbeth                macbeth undefined macbeth        158
##  2 Macbeth                macbeth female    lady macbeth    63
##  3 Macbeth                macduff undefined macduff         62
##  4 Macbeth                lady    female    lady macbeth    59
##  5 The Tragedie of Mariam herod   male      herod           58
##  6 The Tragedie of Mariam thou    male      herod           54
##  7 Macbeth                thou    undefined macbeth         40
##  8 The Tragedie of Mariam thy     male      herod           40
##  9 Macbeth                malcolm male      malcolm         39
## 10 The Tragedie of Mariam mariam  male      herod           39
## # ℹ 10,200 more rows

This has reduced the overall number of rows to 10210 though keep in mind a word may be in more than one row in this dataframe relating to different instances of title, gender and character.

The most frequent word in this list is “Macbeth” connected to “Macbeth. This highlights a flaw in my methodology for creating the character variable based on the appearance of the character’s name in the text (albeit in capitalised form in the first instance). This number (n) will include all of the matched instances referring to Macbeth’s lines.

Another issue of note is the presence of Old English words like “thou” and “thy”. In modern English, “you” and “your” would be removed by a regular stopwords list, but they do not account for Old English. Compounding the issue, is the fact that Cary’s approach to spelling might be described as flexible at best. Further work could be undertaken to address these issues to standardise the texts (as one option).

So, irrespective of these foreseeable and rectifiable flaws, the dataset is prepared for exploring word frequency in regard to comparison by title, gender and/or character.

# Function - plot word freq, with argument to vary minimum number
barplot_word_freq <- function (dataframe, num=30)
{  dataframe %>% filter(n > num) %>%
    mutate(word = reorder(word, n)) %>%
    ggplot(aes(n, word)) +
    geom_col() +
    labs(y = NULL)  }


Word frequency bar plots for both titles

df_no_stopwords %>% 
  filter (title == "Macbeth") %>%
  barplot_word_freq(22)

df_no_stopwords %>% 
  filter (title == "The Tragedie of Mariam") %>%
  barplot_word_freq(16)


Word frequency bar plots for gender (male/female):

df_no_stopwords %>% 
  filter (gender == "male") %>%
  barplot_word_freq(18)

df_no_stopwords %>% 
  filter (gender == "female") %>%
  barplot_word_freq(15)

Now, considering the potential to compare word frequency by proportion

Creating functions for creating table and scatter plotting:

#compare_character_word_freq <- function(characters)
word_frequency_compare <- function (dataframe, characters)
{ frequency <- dataframe %>% 
  filter (character %in% characters) %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(character, word) %>%
  group_by(character) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  pivot_wider(names_from = character, values_from = proportion) %>% 
  pivot_longer(pivot_pair[1]:pivot_pair[2],
               names_to = "character", values_to = "proportion")
}
plot_frequency_compare <- function (dataframe, y_character, pivot_pair)
{ ggplot(dataframe, aes(x = proportion, y = dataframe[[y_character]], 
                      color = abs(dataframe[[y_character]] - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), 
                       low = "darkslategray4", high = "gray75") +
  facet_wrap(~ character, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "Macbeth", x = NULL)
}

Word frequency scatter plots for comparison of character word usage:

All words (including stopwords)

characters <- c("lady macbeth", "doris", "macbeth", "herod")
y_character <- c("macbeth")
pivot_pair <- c("lady macbeth", "doris")

df_token %>% 
  word_frequency_compare(characters) %>% 
  plot_frequency_compare(y_character, pivot_pair)
## Warning: Use of `dataframe[[y_character]]` is discouraged.
## ℹ Use `.data[[y_character]]` instead.
## Use of `dataframe[[y_character]]` is discouraged.
## ℹ Use `.data[[y_character]]` instead.
## Use of `dataframe[[y_character]]` is discouraged.
## ℹ Use `.data[[y_character]]` instead.
## Use of `dataframe[[y_character]]` is discouraged.
## ℹ Use `.data[[y_character]]` instead.
## Warning: Removed 7786 rows containing missing values
## (`geom_point()`).
## Warning: Removed 7786 rows containing missing values
## (`geom_text()`).


With stopwords removed

characters <- c("lady macbeth", "doris", "macbeth", "herod", "banquo")
y_character <- c("macbeth")
pivot_pair <- c("lady macbeth", "doris")


df_no_stopwords %>% 
  word_frequency_compare(characters) %>% 
  plot_frequency_compare(y_character, pivot_pair)
## Warning: Use of `dataframe[[y_character]]` is discouraged.
## ℹ Use `.data[[y_character]]` instead.
## Use of `dataframe[[y_character]]` is discouraged.
## ℹ Use `.data[[y_character]]` instead.
## Use of `dataframe[[y_character]]` is discouraged.
## ℹ Use `.data[[y_character]]` instead.
## Use of `dataframe[[y_character]]` is discouraged.
## ℹ Use `.data[[y_character]]` instead.
## Warning: Removed 7552 rows containing missing values
## (`geom_point()`).
## Warning: Removed 7552 rows containing missing values
## (`geom_text()`).


Bigram/ngrams

An example of the dataset being used for generating ngrams - groups of n words appearing together. Here is a function to generate them based on n words:

#Function - create ngram of different numbers 
create_ngrams <- function (dataframe, n=2)
 { dataframe %>% unnest_tokens(output = "words", 
                 input = text, 
                 token = "ngrams", 
                 n = n) } 

Creating bigrams (words in pairs):

df_bigrams <- create_ngrams(df, 2)
df_bigrams %>% head()
## # A tibble: 6 × 4
##   title   character line_number words   
##   <chr>   <chr>           <int> <chr>   
## 1 Macbeth HECATE             93 <NA>    
## 2 Macbeth HECATE             94 scene in
## 3 Macbeth HECATE             94 in the  
## 4 Macbeth HECATE             94 the end 
## 5 Macbeth HECATE             94 end of  
## 6 Macbeth HECATE             94 of the

Frequency of bigrams for Macbeth:

df_bigrams %>% 
  filter(!df_bigrams$words %in% stop_words$word) %>% 
  filter (title == "Macbeth") %>% 
  count(title, words) %>% 
  arrange(desc(n)) %>% 
  head()
## # A tibble: 6 × 3
##   title   words            n
##   <chr>   <chr>        <int>
## 1 Macbeth <NA>          1433
## 2 Macbeth lady macbeth    70
## 3 Macbeth i have          40
## 4 Macbeth in the          36
## 5 Macbeth to the          35
## 6 Macbeth of the          30

Frequency of bigrams for Mariam:

df_bigrams %>% 
  filter(!df_bigrams$words %in% stop_words$word) %>% 
  filter (title == "The Tragedie of Mariam") %>% 
  count(title, words) %>% 
  arrange(desc(n)) %>% 
  head()
## # A tibble: 6 × 3
##   title                  words        n
##   <chr>                  <chr>    <int>
## 1 The Tragedie of Mariam to be       37
## 2 The Tragedie of Mariam i will      20
## 3 The Tragedie of Mariam i am        19
## 4 The Tragedie of Mariam it is       19
## 5 The Tragedie of Mariam my heart    19
## 6 The Tragedie of Mariam <NA>        18

Note - these are tested with the pre-tokenised dataframe (df), which explains the null values, as well as the outlier value for “lady macbeth”. To address this, the dataframe will need to be processed slightly differently, though largely a matter of changing the order, not the methods.



8. Reflective journal

Concept

I went into this project with the intention of presenting myself a challenge but was unsure of the direction to go. Two things helped clarify my decision to undertake this text-based project:

  • a job application I undertook that required a critical response to a massive collection of structured, but seemingly chaotic data with long strings of unknown codes, symbols, URLS, names and places
  • the opportunity to delve into something to complement and expand on the more numerical/date-focused wrangling skills developed in this subject and that I have applied in my final Data Visualisation assignment during this same teaching period.

I started with Shakespeare’s Macbeth as a seemingly obvious start - available, significant and a work I am familiar with in a range of variations, including an Australian rock opera and Akira Kurosawa’s adaptation in Throne of Blood. I had thought to do another of Shakespeare’s works, but then I wondered: might there be a female author who was a contemporary of Shakespeare’s? And certainly, there was: Elizabeth Cary, who may be less well-known but not without her own significance as the first known published female playwright, with a work coincidentally with similar themes to Macbeth. This helped meaningfully underpin the project’s concept to be the foundation for a gender-focused comparative text analysis of the two works.

Getting started

At first, I scribbled and sketched out possibilities for different features and how they could be used for analytical purposes, with consideration for the assignment’s requirements. The latter was helpful in pushing me in directions I may have not explored otherwise.

I’m learning that RMarkdown offers an easy way to create a framework for a project with flexibility to adapt - Jupyter Notebook also, though with differences. Starting with the template as a helpful base, I populated the document with my questions, aims, tasks and ideas which helped give focus and direction when straying down coding tunnels.

Cleaning/Tidying

This was a good example of the 80/20 rule - I’m sure I spent all of that 80% and more on the cleaning/wrangling side of things, though the focus here is on the pre-processing anyway. At first, it felt it may be on the easy side in some ways; how hard can it be to start with only two columns in the beginning? But yes, it’s all about what features are waiting to be extracted from the data.

I was confused by the spec_tbl_df subclass of tibble generated by reading in Mariam, and unsure of what problem there might be from the CSV file reading. It was all fine in the end, but it showed the importance of checking not just shape and size, but also the nature of R objects. A problem did pop up after I used str_trim which converted the tibble into a Large list, though easily converted back again.

Creation/Transformation

One thing I got stuck on was how to create the new character variable and matching by name, especially for the mariam dataset. I am unsure of why it was so difficult - I tried a range of functions and approaches using group_by, mutate, filter, case_when, various stringr functions such as str_detect, str_match and so on, until finally finding a solution I could work with. One issue was the need to deal with periods (“.”) but I wonder if it was the spec_tbl_df subclass also? Anyway, I got there. A positive learning outcome is that having a mindset based on knowing there will be a solution makes it easier to keep looking for it.

My method of variable creation is unfortunately flawed by design. As Iused the mention of each speaker from the original text to assign a new variable in the same row, that means those mentions are included also as words spoken by the speaker, incorrectly skewing their frequency of use. With the limits of time for this report, I have not yet addressed this issue, though I have considered potential solutions such as at the time of creating the character variable, the corresponding text value would be assigned a null value, making removal easy through a filter process, using is.na for instance.

My original design included extracting features based on the play structure also, i.e. Acts and Scenes. Using something based on the character variable extraction method, variables would be created for both act and scene. This would allow for a kind of time-based analysis for presence and participation by character, and to consider the development or otherwise of language as the play develops. And so on. Again, the scope on this occasion in time and size meant curtailing my aspirations in order to present a timely and focused report. But I’m excited for the possibilities of these approaches to data that I am being opened up to.

Learnings

Many and varied!! Trying to distill them into a few points:

  • Organisation/Planning
    • I am enjoying developing better habits for organising my work, which is coming from different impetuses. The assignments and their requirements are helping lead by example, and also working with RStudio/RMarkdown as tools that streamline and support workflow has been a positive contribution towards my work during this teaching period.
  • Conceptual
    • Learning about Tidy Data principles has been invaluable in how to conceive approaching a dataset such as a play, to consider from first principles on how to identify features and create a structural framework upon which to build. Seeing how this is applied in practice through my recent and final Data Visualisation assignment has helped to clarify how this conceptual approach feeds directly into practical application.
  • Adaptability
    • There is no one way - but there will be a way. As I’m problem-solving, I’m always remembering that there is a good chance someone has probably faced a similar issue and I can see from the Tidyverse tools that there is a strong commitment to supporting the R community. It is also about the adaptability of the skills we are learning that I’m enjoying - approaching numerical or text datasets with similar skillsets and principles is helping me to go into a new project with a greater feeling of having a foundation to build from.
    • In the instance of extracting character and gender variables, it became obvious I needed to adapt my plan and do this prior to merging the data sets. Fortunately, it was easy to insert this into my pipeline without undoing other work I had done. Aiming for a methodical approach that allows for flexibility when change becomes necessary has helped me in this case. And not being afraid to change tack when required!

Finally

At the end of this report, there is still work to be done, namely:

  • removal of non-spoken words from the word list to eliminate skewing the data due to unnecessary presence of character names
  • address difference in spelling through transformational methods such as lemmatisation and modernisation
  • adjust stopwords list to account for Old English usage

Though, I trust this project works as a “proof of concept”, offering the foundation for a gender-focused comparative analysis of the works by Shakespeare and Cary.



References

Main data source

Shakespeare W (1623) , Macbeth, Project Gutenberg. https://www.gutenberg.org/ebooks/1533

Cary E (1613) The Tragedie of Mariam, the faire queene of Jewry, A Celebration of Women Writers website, accessed 17 April 2023. https://digital.library.upenn.edu/women/cary/mariam/mariam.html

References

Agrawal R (14 June 2021) Must Known Techniques for text preprocessing in NLP, Analytics Vidhya website, accessed 18 April 2023. https://www.analyticsvidhya.com/blog/2021/06/must-known-techniques-for-text-preprocessing-in-nlp/

Braganza V (22 June 2022) 10 Women Writing in the Time of Shakespeare, website, accessed 17 April 2023. https://www.mentalfloss.com/posts/women-writers-contemporary-with-shakespeare

Dempsey M (7 May 2010) Regular expression “^[a-zA-Z]” or “[^a-zA-Z]”, Stack Overflow website, accessed 18 April 2023. https://stackoverflow.com/questions/2790813/regular-expression-a-za-z-or-a-za-z

Mowat B and Werstine P (n.d.) An Introduction to This Text: Macbeth, Folger Shakespeare Library website, accessed 17 April 2023. https://www.folger.edu/explore/shakespeares-works/macbeth/an-introduction-to-this-text

Silge J and Robinson S (2022) Text Mining with R, Text Mining with R website, accessed 17 April 2023. https://www.tidytextmining.com

Rydberg-Cox J (n.d.) Statistical Methods for Studying Literature Using R, Daedalus: Projects in Digital Humanities website, accessed 17 April 2023. https://www.chlt.org/StatisticalMethods/analyzing-literary-data.html

UQ (The University of Queensland) (6 April 2023) Text mining & text analysis, UQ website, accessed 17 April 2023. https://guides.library.uq.edu.au/research-techniques/text-mining-analysis/preparing-text-for-analysis

Tidyverse (27 December 2018) readr 1.3.1, Tidyverse website, accessed 17 April 2023. https://www.tidyverse.org/blog/2018/12/readr-1-3-1/

EB (Encyclopedia Britannica) (2020) (2 October 2020) William Shakespeare’s Important Works, EB website, accessed 18 April 2023. https://www.britannica.com/summary/William-Shakespeares-Important-Works


R packages

Johnston M, Robinson D (2022). gutenbergr: Download and Process Public Domain Works from Project Gutenberg. R package version 0.2.3. https://CRAN.R-project.org/package=gutenbergr.

Müller K (2020) here: A Simpler Way to Find Your Files, R package version 1.0.1. https://CRAN.R-project.org/package=here

Bache S, Wickham H (2022) magrittr: A Forward-Pipe Operator for R, R package version 2.0.3. https://CRAN.R-project.org/package=magrittr

Silge J, Robinson D (2016) tidytext: Text Mining and Analysis Using Tidy Data Principles in R. JOSS, 1(3). doi:10.21105/joss.00037 https://doi.org/10.21105/joss.00037

Wickham H (2022). stringr: Simple, Consistent Wrappers for Common String Operations, R package version 1.5.0. https://CRAN.R-project.org/package=stringr

Wickham H, François R, Henry L, Müller K, Vaughan D (2023) dplyr: A Grammar of Data Manipulation, R package version 1.1.0. https://CRAN.R-project.org/package=dplyr

Wickham H, Hester J, Bryan J (2023) readr: Read Rectangular Text Data, R package version 2.1.4. https://CRAN.R-project.org/package=readr

Wickham H, Seidel D (2022) scales: Scale Functions for Visualization. R package version 1.2.1. https://CRAN.R-project.org/package=scales.

Wickham H, Vaughan D, Girlich M (2023) tidyr: Tidy Messy Data, R package version 1.3.0. https://CRAN.R-project.org/package=tidyr

Wickham H. (2016) ggplot2: Elegant Graphics for Data Analysis, Springer-Verlag New York.

Yihui X (2023) knitr: A General-Purpose Package for Dynamic Report Generation in R, R package version 1.42.



Copyright © 2023 Adam Simmons, Inc. All rights reserved.