# Practice Session 04: Networks from text

In this session we will learn to construct a network from a set of implicit relationships. The relationships that we will study are between accounts in Twitter, a micro-blogging service.

We will create two networks: one directed and one undirected.

* In the **directed mention network**, we will say that there is a link of weight *w* from account *x* to account *y*, if account *x* has re-tweeted (re-posted) or mentioned *w* times account *y*.

* In the **undirected co-mention network**, we will say that there is a link of weight *w* between accounts *x* and *y*, if both accounts have been mentioned together in *w* tweets.

The input material you will use is a file named `CovidLockdownCatalonia.json.gz` available in the [data/](data/) directory. This is a gzip-compressed file, which you can de-compress using the `gunzip` command. The file contain about 35,500 messages ("tweets") posted between March 13th, 2020, and March 14th, 2020, containing a hashtag or keyword related to COVID-19, and posted by a user declaring a location in Catalonia.

The tweets are in a format known as [JSON](https://en.wikipedia.org/wiki/JSON#Example). Python's JSON library takes care of translating it into a dictionary.

**How was this file obtained?** This file was obtained from the [CrisisNLP](https://crisisnlp.qcri.org/covid19). This is a website that provides COVID-19 collections of tweets, however, they only provide the identifier of the tweet, known as a tweet-id. To recover the entire tweet, a process commonly known as *re-hydration* was used, which involves querying an API from Twitter, giving the tweet-id, and obtaining the tweet. This can be done with a little bit of programming or using a software such as [twarc](https://github.com/DocNow/twarc#dehydrate).

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

Author: <font color="blue">Your name here</font>

E-mail: <font color="blue">Your e-mail here</font>

Date: <font color="blue">The current date here</font>

# 1. Create the directed mention network

Create the **directed mention network**, which has a weighted edge (source, target, weight) if user *source* mentioned user *target* at least once; with *weight* indicating the number of mentions.

Create two files: one containing all edges, and one containing all edges having *count* greater or equal than 2.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [14]:
import io
import json
import gzip
import csv
import re

from IPython.display import Image

In [15]:
# Leave this code as-is

# Input file
COMPRESSED_INPUT_FILENAME = "CovidLockdownCatalonia.json.gz"

# These are the output files, leave as-is
OUTPUT_ALL_EDGES_FILENAME = "CovidLockdownCatalonia.csv"
OUTPUT_FILTERED_EDGES_FILENAME = "CovidLockdownCatalonia-min-weight-filtered.csv"
OUTPUT_CO_MENTIONS_FILENAME = "CovidLockdownCatalonia-co-mentions.csv"

## 1.1. Extract mentions

The `extract_mentions(text)` functions is used to extract mentions, so that if we give, for instance `RT @DiariDeSabadell: check this post by @EspaiNaturaSbd`, it returns the list `['DiariDeSabadell', 'EspaiNaturaSbd']`.

You can now print all the people mentioned in a tweet by doing:

```python
mentions = extract_mentions(message)
for mention in mentions:
    print("%s mentioned %s" % (author, mention))
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [16]:
# Leave this code as-is

def extract_mentions(text):
    return re.findall("@([a-zA-Z0-9_]{5,20})", text)

print(extract_mentions("RT @DiariDeSabadell: check this post by @EspaiNaturaSbd"))

['DiariDeSabadell', 'EspaiNaturaSbd']


## 1.2. Count mentions

We do not need to uncompress this file (it is about 236 MB uncompressed, but only 31 MB compressed), but we can read it directly while it is compressed.

```python
with gzip.open(COMPRESSED_INPUT_FILENAME, "rt", encoding="utf-8") as input_file:
    for line in input_file:
        tweet = json.loads(line)
        author = tweet["user"]["screen_name"]
        message = tweet["full_text"]
        print("%s: '%s'" % (author, message))
```

To count how many times a mention happen, you will keep a dictionary:

```python
mentions_counter = {}
```

Each key in the dictionary will be a tuple `(author, mention)` where `author` is the username of the person who writes the message, and `mention` the username of someone who is mentioned in the message. To update the dictionary, use this code while you are reading the input file:

```python
for mention in mentions:
    key = (author, mention)
    if key in mentions_counter:
        mentions_counter[key] += 1
    else:
        mentions_counter[key] = 1
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to read the compressed input file and create the mentions_counter dictionary.</font>

Print the number of times the account `BCN_Mobilitat` mentioned `TMBinfo`. It should be 8.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to print all the pairs of accounts (u,v) in which account *u* mentioned account *v*, and account *v* mentioned account *u*. Do not repeat pairs, i.e., if you print "Accounts @a and @b mention each other" do not additionally print "Accounts @b and @a mention each other"</font>

Now we write a file `OUTPUT_ALL_EDGES_FILENAME` with **all** the edges in this graph `(Source, Target, Weight)` as a tab-separated file, and `OUTPUT_FILTERED_EDGES_FILENAME` with edges of weight larger or equal to 2.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [20]:
# Leave this code as-is

lines_written = 0
with io.open(OUTPUT_ALL_EDGES_FILENAME, "w") as output_file:
    writer = csv.writer(output_file, delimiter='\t', quotechar='"', lineterminator='\n')
    writer.writerow(["Source", "Target", "Weight"])
    for key in mentions_counter:
        author = key[0]
        mention = key[1]
        weight = mentions_counter[key]
        writer.writerow([author, mention, weight])
        lines_written += 1
        
print("Wrote %d lines to file %s" % (lines_written, OUTPUT_ALL_EDGES_FILENAME))

Wrote 33870 lines to file CovidLockdownCatalonia.csv


<font size="+1" color="red">Replace this cell with your code to create a file named `OUTPUT_FILTERED_EDGES_FILENAME` containing all (author, mention) pairs with a value greater or equal to 2.</font>

# 2. Visualize the directed mention network

## 2.1. Visualize the largest connected component


Open the **filtered** edge file in Cytoscape, by importing its CSV file. You may have to set the delimiter to "Tab" in the advanced options, when importing.

(a) Execute ``Layout > Edge weighted spring embedded layout > Edge betweenness``

(b) Style edges to add arrows at the end of each edge.

(c) **Extract the largest connected component of the graph.** Click on any node in that connected component, then click on the "two houses" (neighbor) icon repeatedly until selecting all nodes in the connected component. Then ``File > New Network > From Selected Nodes, all Edges``.

*Tip*: To count nodes in Cytoscape, hold shift while clicking and select the nodes. In the lower-right corner you should see a count of nodes and edges.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">What is the size of the largest connected component, both as a number of nodes and as a percentage of the nodes in the graph? What is the diameter of the largest connected component, disregarding edge direction? </font>

Style nodes in the largest connected component:

* Run `Tools > Analyze Network ...` -- select **directed graph** because this graph is directed
* Node size and label size proportional to their in-degree
* Node color proportional to in-degree (white=small, blue=large)
* Edge width proportional to weight

Save the image as `mentions-largest-cc.png`, the next cell should display it. It is OK if your image does not look exactly like the example we provide.

*Tip*: The file is large so if you want to see all details while zooming out you may have to set ``View > Always show Graphic Details``. Note this makes the program run slower.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [22]:
# Adjust width/height as needed

Image(url="mentions-largest-cc.png", width=1200)

<font size="+1" color="red">Replace this cell with some observations about this graph. Which accounts are mentioned by many other accounts? Why do you think these accounts are often mentioned? Which accounts mention many other accounts? Why do you think they do that?</font>

## 2.2. Cluster the largest connected component


Keep only the largest connected component, deleting the rest of the nodes (you can hold shift while drawing a rectangle, to select some nodes).

Run the ClusterMaker2 plug-in to create a clustering (affinity propagation clustering) of this graph using the *weight* edge attribute. Color nodes according to their cluster, using a discrete mapping. Note that if you right-click on "Mapping type" when creating a discrete mapping, you can use an automatic mapping generator that you can fine-tune later.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Look at the cluster containing the account ``@salutcat``. Are there other thematically-related accounts in the same cluster? Name three of them. Indicate why do you think they are in the same cluster.</font>

## 2.3. Examine degree distributions

In the **network containing the largest connected component** look at the Results Panel of the network analyzer. From there, when `Node Table` is selected in the panel below, you can click on `Node degree distribution ...` and obtain in-degree and out-degree plots. 

Export the distributions as `mentions-largest-cc-indegree.png` and `mentions-largest-cc-outdegree.png`, the next cell should display them.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [26]:
# Adjust width/height as needed

display(Image(url="mentions-largest-cc-indegree.png", width=400))

display(Image(url="mentions-largest-cc-outdegree.png", width=400))

<font size="+1" color="red">Replace this cell by a brief commentary, in your own words, about these degree distributions</font>

# 3. Create the undirected co-mention network

The **undirected co-mention network** connects two accounts if they are both mentioned in the same tweet. The weight of the edge is the number of tweets in which the accounts are co-mentioned.

Suppose the mentions in a Tweet are in the array ``mentions``, then you can iterate through all pairs of co-mentioned like this:

```python
for mention1 in mentions:
    for mention2 in mentions:
        if mention1 < mention2:
            key = (mention1, mention2)
```

Read the input file again to create a dictionary `co_mentions_counter` in which keys are tuples (user1, user2) in which user1 lexicographically precedes user2 (user1 < user2), and values are the number of times user1 and user2 have appeared together in a tweet.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to create the `co_mentions_counter`.</font>

As a verification, print the number of times the accounts `emergenciescat` and `govern` have been mentioned together. It should be 31.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [39]:
# KEEP AS-IS

print(co_mentions_counter[('emergenciescat', 'govern')])

31


<font size="+1" color="red">Replace this cell with your code to print all pairs of accounts that have been co-mentioned 20 times or more.</font>

Now create a file named `OUTPUT_CO_MENTIONS_FILENAME` containing co-mentions in tab-separated columns `Source, Target, Weight`.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to create the co-mentions file.</font>

# 4. Visualize the undirected co-mention network in Cytoscape


Open the `OUTPUT_CO_MENTIONS_FILENAME` file in Cytoscape.

**Select nodes having degree (in + out) larger or equal to 15.** You can do that with the `Filter` panel on the left, then create a new graph with the selected edges.

Use `Layout > Edge weighted spring embedded layout > Weight` to create a layout by edge weight. You can also move nodes around to adjust this layout if you want it to be more readable.

Style the network so that:

* All nodes have the same size
* Edges have width proportional to weight.
* Edges are black for small weight, and red for large weight

Export the image as `co-mentions-min-degree-15.png`, the next cell should display it.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [42]:
# Adjust width/height as needed

Image(url="co-mentions-min-degree-15.png", width=1200)

<font size="+1" color="red">Find two dense communities in which nodes seem to be thematically related, for instance, a densely connected sub-graph or connected component in which (according to the names of the accounts) many nodes seem to have something in common. Replace this cell with a commentary on these two communities, indicating some example nodes and why do you think they are related.</font>


# DELIVER (individually)

Deliver a zip file containing:

* Your code as a Python notebook (a `.ipynb` file).
   * Remove all unnecessary elements
   * Add comments when needed
* Any png files that you inserted in the notebook

## Extra points available

For more learning and extra points, create a file `account-type.csv` containing the type of account of the top 50 accounts with the most mentions. You can use types "journalist", "media", "politician", "government institution", "individual", "health-related", etc. which you should categorize manually. Create a visualization of the **mentions** graph either including only these 50 accounts, or including more accounts but highlighting these top 50 with colors. Use broad categories as needed and **do not worry if there are some ambiguities in the categorization,** e.g., if you are not 100% sure on whether someone should be in one category or another; just do your best.

**Note:** if you go for the extra points, add ``<font size="+2" color="blue">Additional results: account types</font>`` at the top of your notebook.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>


<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, text, and figures were produced by myself.</font>
