What is our motivation?

This project takes a closer look at the American sitcom TV series, The Office, that depicts the everyday lives of office employees at the Dunder Mifflin Paper Company. Our dataset consists of character page content from the wiki fandom page Dunderpedia as well as the characters' dialogue from the entire show - 9 seasons in total.

The Office is a very entertaining and totally binge-worhty series, which despite the at times dark humour will get you laughing repeadedly for every episode. Being fans of the show, our main motivation for chosing this dataset was to get a deeper insight into the characters, their relations and how they evolve throughout the series.

As compared to the Zelda BotW network, the wiki fandom for the The Office contains fewer and less detailed character descriptions, hence we expect our analysis will depend on the dialogues dataset to a greater extend. This allows us to explore the text analysis tools we have learned in the course as well as come up with new analyses, where we combine the methods we’ve learned in new ways. Being a mockumentary, the characters are caricatures of regular office workers, and we hope to be able to describe the characters from their language and compare with how we already know them from watching the show.

For the readers who are unfamiliar with the TV series, we have included a brief description of the main characeters in the show (WARNING: spoilers!):
Michael: The regional manager of the Dunder Mifflin Scranton Branch from Season 1-7 and main character of the series. He is a well-intentioned man with quite a special type of humor, while seemingly innocent to himself, he often offends and annoys his employees. In the end of the 7th season, he proposes to HR representative Holly Flax and moves to Colorado leaving the manager position to Andy.
Dwight: a salesman and the assistant to the regional manager, a fictional title created by Michael. He is notorious for his lack of social skills and common sense, his love for martial arts and the justice system, and his office rivalry with fellow salesman Jim Halpert. He is also known for his romantic relationship with Angela Martin, head of the accounting department.
Jim: also works as a salesman at the office. He is intelligent and mild-mannered, but also defined by his rivalrous pranking on fellow salesman Dwight Schrute. For several seasons he and the receptionist Pam are the “will-they-won’t-they” couple. They begin dating in the fourth season, marries in the sixth, and has children with in the sixth and eighth.
Pam: is initially the receptionist at Dunder Mifflin, before becoming a saleswoman and eventually office administrator. Her character is shy, friendly, assertive, and artistically inclined. At the beginning of the series, Pam has been engaged to Roy for three years, but shareas a romantic interest with Jim.


About the data - basic stats

We based our analysis on two main data sources: Dunderpedia: the online encyclopedia about The Office and the dialogues - a collection of all lines said in every single scene in 185 episodes.

To collect data that could serve as a source for our network, we extracted the raw HTML code behind the pages with a list of characters. As character attributes we used gender, list of seasons a character appeared in and a Dunder Mifflin branch.

From the figure below, it can be seen that we have 158 males, 106 females and 25 characters with unknown gender in our dataset.

Gender distribution

The dialogues dataset consists of 54626 lines along with information about which season, episode and scene each line was said in and by whom (speaker column).

Below you can examine number of lines spoken broken down by a character. Even though Michael wasn’t present in the 8th season, he still holds the highest number of lines! Apart from that we can see that Dwight, Jim and Pam are the most prominent characters in this show. The 5th person in terms of the number of lines spoken is Andy Bernard, who replaced Michael Scott and became boss in the 8th season. The rest of the main characters spoke more or less the same number of lines.

DIalogues

We created a new column - number_of_words - which tells us how many words there are in a single line. We can see that Michael speaks the longest lines on average. The standard deviation is also the highest for him which might indicate that he had the highest number of monologues, i.e. scenes where he talked to the camera alone.
Kevin speaks the shortest sentences (with the lowest standard deviation as well) which makes sense since this character is not the smartest and lacks communication skills. Kevin received a job as an accountant at Dunder Mifflin after applying for a job in the warehouse because Michael Scott had “a feeling about him.”

Speaker Sum of words Mean Standard deviation
Michael 147523 13.56 16.27
Dwight 75134 11.06 12.46
Jim 57550 9.19 10.84
Pam 44807 8.94 10.78
Andy 43746 11.72 12.69
Angela 13391 8.63 9.65
Erin 12763 8.93 10.37
Kevin 12292 7.97 8.77
Oscar 11903 8.78 9.04
Ryan 11624 9.79 10.43

About the network - brief introduction

Based on the content of Wiki pages, we built our first directed network. After removing isolated nodes and extracting the largest connected component, it consists of 284 nodes and 1241 links between them.
Top 5 in-degree nodes are: Michael Scott (113), Dwight Schrute (94), Jim Halpert (81), Pam Beesly (77) and Andy Bernard (57).
Top 5 out-degree nodes are: Andy Bernard (33), Michael Scott (26), Phyllis Vance (26), Pam Beesly (24) and Dwight Schrute (24).

Wiki pages network

Degree distribution of the out-degrees of our network (right) shares similarities to the degree distribution of a random Erdős–Rényi network. The in-degree distribution (left) is quite different from the Poisson distribution characterizing random networks and seems to follow a power law, which is the defining characteristic of a scale-free network.
The reason for this difference may lay in a way the wiki pages are written, that is, there is no detailed description of who exactly a particular character interacted with in every single scene and this description is mainly limited to the connections to the main characters like Michael, Dwight, Pam and Jim, who have the highest in-degree (which means that they are mentioned in a lot of character pages).

Distribution

To investigate the relationship between in- and out-degree values for each node, we created the below visualisation. It shows a clear positive relationship between the in-degree and out-degree values on the log(1+x) scale, which is reflected in a Pearson correlation equal to 0.75. This means that a node with a lot of in-coming links will also have a lot of out-going edges. Additionally, third plot presenting the excess degree defined as k_ex = k_in - k_out [1] shows that for most characters in-degree is usually close to out-degree. We can see some exceptions in a form of a long right tail - these are of course hubs - main characters like Michael, Jim, Dwigt and Pam who have a lot of in-coming links and much fewer out-coming ones.

In_out_degree


Authors:
Maja Jønck Hjuler (s164590)
Katarzyna Otko (s202872)
Malthe Andreas Lejbølle Jelstrup (s184291)

[1] Liu XF, Liu YL, Lu XH, Wang QX, Wang TX (2016) The Anatomy of the Global Football Player Transfer Network: Club Functionalities versus Network Properties. PLOS ONE 11(6): e0156504. https://doi.org/10.1371/journal.pone.0156504.