# Practice Session 03: Management of networks data

In this session we will study an application of complex networks analysis to cooking. We will start with the *flavors network*, a bi-partite network connecting culinary ingredients to flavour compounds [*].

The initial dataset, prepared by [Ling Cheng in 2016](https://github.com/lingcheng99/Flavor-Network), contains three files:

* `ingredients.tsv` -- information about culinary ingredients
* `compounds.tsv` -- information about flavour compounds
* `ingredient-compound.tsv` -- flavour compounds present in each culinary ingredient
* `recipes.csv` -- ingredients used in recipes around the world (used only for extra points)

[*] Ahn, Y. Y., Ahnert, S. E., Bagrow, J. P., & Barabasi, A. L. (2011). [Flavor network and the principles of food pairing](https://doi.org/10.1038/srep00196). Scientific reports, 1(1), 1-7.


<font size="-1" color="gray">(Remove this cell when delivering.)</font>

Author: <font color="blue">Your name here</font>

E-mail: <font color="blue">Your e-mail here</font>

Date: <font color="blue">The current date here</font>

# 1. The flavors bi-partite graph

## 1.0. Examine your input files

Before you begin, we highly recommend you to:

1. Copy the input files to a local directory in your computer 
2. Open them in a spreadsheet and look at them

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

## 1.1. Read the bipartite graph in a dataframe


The following code, which you can leave as-is, reads the ingredient-compound relationship into a dataframe.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [1]:
# Feel free to add imports if you need them

import io
import csv
import pandas as pd
import networkx as nx

from networkx.algorithms import bipartite

import numpy as np
import matplotlib
import scipy

import itertools

from IPython.display import Image

In [2]:
# Leave this code as-is

INPUT_INGR_FILENAME = "ingredients.tsv"
INPUT_COMP_FILENAME = "compounds.tsv"
INPUT_INGR_COMP_FILENAME = "ingredient-compound.tsv"

In [5]:
# Leave this code as-is

ingredients = pd.read_csv(INPUT_INGR_FILENAME, sep="\t")
display(ingredients.head(3))

compounds = pd.read_csv(INPUT_COMP_FILENAME, sep="\t")
display(compounds.head(3))

ingr_comp = pd.read_csv(INPUT_INGR_COMP_FILENAME, sep="\t")
display(ingr_comp.head(3))


Unnamed: 0,ingredient_id,ingredient_name,ingredient_category
0,0,magnolia_tripetala,flower
1,1,calyptranthes_parriculata,plant
2,2,chamaecyparis_pisifera_oil,plant derivative


Unnamed: 0,compound_id,compound_name,compound_code
0,0,jasmone,488-10-8
1,1,5-methylhexanoic_acid,628-46-6
2,2,l-glutamine,56-85-9


Unnamed: 0,ingredient_id,compound_id
0,1392,906
1,1259,861
2,1079,673


## 1.2. Create the flavors bipartite network


Create a new dataframe named `flavors` by joining `ingredients` and `compounds`.

*Tips*:

* To [join](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html) a DataFrame A and a DataFrame B using a column X, use `result = A.set_index('X').join(B.set_index('X'), how='inner')
* You will need to do two joins to solve this. First, join `ingredients` and `ingr_comp`, then join the result with `compounds`.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to create the `flavors` dataframe and show its first 20 rows.</font>

Drop the `compound_code` column from the resulting dataframe, sort by `ingredient_name` then by `compound_name`, and reset its index.

*Tips:*

* To [drop column](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) x from DataFrame A, you can do: `A = A.drop(columns=['x'])`
* To [sort](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) a DataFrame A by column *x*, then by column *y*, you can do: `A = A.sort_values(['x', 'y'])`
* To [reset the index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) of a DataFrame A, you can do: `A = A.reset_index(drop=True)`; the index is the column appearing in boldface in front of every row of a DataFrame

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to modify the `flavors` dataframe as explained above, and show its first 20 rows.</font>

Write this dataframe to a `flavors.tsv` file, which should be a tab-separated file containing the three fields `ingredient_name`, `ingredient_category` and `compound_name`. Use the function [pandas.DataFrame.to_csv](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html).

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to save *flavors* into a tab-separated file.</font>

## 1.3. Open this bi-partite network in Cytoscape


### 1.3.1. Examine the file you generated

Open the ``flavors.tsv`` file in a spreadsheet program to make sure you generated it correctly; it should have exactly 3 tab-separated columns.

### 1.3.2. Import this file in Cytoscape

Remember these files are imported with ``File > Import > Network from File ...``. Then, you have to select:

* ingredient_name as a ``Source Node``
* ingredient_category as a ``Source Node Attribute``.
* compound_name as a ``Target Node`` 

### 1.3.3. Draw a small part of this graph

Find the `garlic` node and everything connected to it at distance 1 or 2. To do this, find "garlic" and then click on the "two-houses" (neighbors) icon twice. Extract the selected nodes as a sub-graph by doing `File > New network > From selected nodes, all edges`.

Run the network analyzer and then perform `Layout > Edge weighted spring embedded layout` using edge betweenness.

Style the network so that ingredient nodes have a color that depends on their category, using any color except white, and setting white to be the default node color so that compound nodes remain in color white. Set the label color to black. Set the node shape to ellipse. 

Save the image as `flavors.png` and its corresponding lenged as `flavors-legend.gif`. Use the next cell to display the network and legend. This time, node labels do not need to be visible or readable, we just want to appreciate overall clusters.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [16]:
# KEEP THIS CELL AS-IS

# Just adjust width/height if necessary

Image(url="flavors.png", width=1200)

### 1.3.4 Compounds in common

*Onion* and *Garlic* get their distinctive smell from sulfur-containing compounds. How many compounds onion and garlic have in common? Based solely on their names, how many of them do you think contain sulfur?

To answer this question, extract the nodes *Onion*, *Garlic*, and all their compounds in common as a graph. Layout is a hierarchical graph and modify it so that *Onion* appears at the top of the image, the compounds in the middle, and *Garlic* at the bottom of the image.

Save the image as `compounds-in-common.png`; the next cell should display it.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [19]:
# KEEP THIS CELL AS-IS

# Just adjust width/height if necessary

Image(url="compounds-in-common.png", width=1200)

<font size="+1" color="red">Replace this cell by a brief commentary indicating how many compounds they have in common, how many of them seem to contain sulfur (based on their names). Name a couple of those sulfur-containing compounds.</font>

# 2. The ingredient-ingredient graph

The bi-partite flavors graph is hard to visualize as it mixes ingredients and compounds. We will now try to visualize only the connections between ingredients.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>


## 2.1. Create an ingredient-ingredient.gml file


First, copy the list of ingredient names into an array `ingredients_array`. To convert column *x* of DataFrame *A* to an array, use `np.asarray(A['x'])`.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to create `ingredients_array` with the list of ingredients and to print the number of ingredients.</font>

Then, create a dictionary named `ingredient_to_compounds`, in which keys are ingredients, and values are sets of compounds. To create an empty set, you can use `s = set()`. To add to a set, you can do `s.add(element)`. Your code should look like this:

```python
ingredients_array = ...
print("There are %d ingredients" % (len(ingredients_array)))

ingredient_to_compounds = {}

for index, row in flavors.iterrows():
    ...

```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to create dictionary `ingredient_to_compounds` with a set of compounds for each ingredient. Print the number of keys of this dictionary. It should be less than or equal to the number of ingredients.</font></font>

Next, we create a NetworkX graph with nodes representing ingredients and edges of weight `x` connecting two ingredients having `x` flavor compounds in common.

To create an empty graph, do `ingredient_ingredient = nx.Graph()`.

Now, iterate through all pairs of ingredients in `ingredients_array` and compute the compounds they have in common between them. To iterate through all pair combinations of an array X, you can use:

```
for u, v in itertools.combinations(X,2):
    ...

```

The size of the intersection of two lists of compounds `l1`, `l2` can be obtained with `len(l1.intersection(l2))`. This will be the weight of the edge connecting two ingredients corresponding to those lists.

Please note you may need to check whether both ingredients have compounds. You can test it by asking `if u in ingredient_to_compounds and v in ingredient_to_compounds`

To facilitate visualization, we will keep only edges connecting two ingredients having **MIN_COMMON_COMPOUNDS or more compounds in common**. Set the value of **MIN_COMMON_COMPOUNDS** so that the resulting graph has somewhere around 150 +/- 30 nodes.

To add to graph *G* an edge between nodes *u* and *v* having weight *w*, do `G.add_edge(u, v, weight=w)`.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to create the `ingredient_ingredient` graph</font></font>

In [41]:
# Leave as-is
print("The ingredient-ingredient graph has %d nodes and %d edges" %
      (ingredient_ingredient.number_of_nodes(), ingredient_ingredient.number_of_edges()))

The ingredient-ingredient graph has 145 nodes and 1374 edges


Save the resulting graph into a file. You can use [write_gml](https://networkx.org/documentation/stable/reference/readwrite/generated/networkx.readwrite.gml.write_gml.html#networkx.readwrite.gml.write_gml) to use the *GML* format.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [42]:
OUTPUT_INGR_INGR_FILENAME = 'ingredient-ingredient.gml'

<font size="+1" color="red">Replace this cell with your code to save graph G to file OUTPUT_INGR_INGR_FILENAME</font>

## 2.2. Work with this file in Cytoscape

## 2.2.1. Inspect this file

*Tip:* Open the ``ingredient-ingredient.gml`` file in a text editor first to see how it is structured.


## 2.2.2. Import this file into Cytoscape

To import this file into Cytoscape:

* `File > Import > Network from file ...`
* Open the `ingredient-ingredient.gml` file

Now we need to import ingredient categories:

* `File > Import > Table from file ...`
* Open the `ingredients.tsv` file
* Import data as "Node Table Columns"
* `ingredient_name`: key
* `ingredient_category`: attribute

Do a `Layout > Edge weighted spring embedded` layout on the *weight* attribute.

### 2.2.3. Style and add simple annotations

Style lines connecting nodes so their thickness and color reflects the number of compounds in common.

Color the nodes with colors representing the ingredient categories. Note that if you right-click on "Mapping type" when creating a discrete mapping, you can use an automatic mapping generator to start with.

Save the main connected component of this graph as `ingr-ingr.png` using `File > Export > Network to image ...`.

Save a legend as `ingr-ingr-legend.gif` using the hamburger menu in `Style`and selecting `Create legend ...`

The next cell should display your graph and its legend.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [45]:
# Change width if necessary

display(Image(url="ingr-ingr.png", width=1200))

display(Image(url="ingr-ingr-legend.gif", width=400))

<font size="+1" color="red">Replace this cell by two interested pairings (combinations of two or more ingredients) suggested by this network. By a pairing we mean ingredients that may taste good together because they have shared compounds. Add a plausible explanation considering the network structure.</font>

# DELIVER (individually)

Read the section on "delivering your code" in the [course evaluation guidelines](https://github.com/chatox/networks-science-course/blob/master/upf/upf-evaluation.md).

Deliver a zip file containing:

* This notebook
* The ``flavors.tsv``, ``flavors.png``, and ``flavors-legend.gif`` files
* The `compounds-in-common.png` file
* The ``ingredient-ingredient.gml``, ``ingr-ingr.png``, and ``ingr-ingr-legend.gif`` files

## Extra points available

For more learning and extra points, get the `recipes.csv` file. It contains one recipe per line, in this format:

```
EastAsian,roasted_sesame_seed,garlic,cayenne,seaweed,sesame_oil
```

This means there is one East Asian dish whose recipe requires the ingredients "roasted_sesame_seed", "garlic", "cayenne", "seaweed", and "sesame_oil".

Select 3 recipes and draw using Cytoscape a graph with their ingredients and the compounds in those ingredients. Include those subgraphs here, plus a brief commentary about whether the ingredients used share many compounds, few compounds, or not at all, and any other observations you want to make about the selected recipes.

**Note:** if you go for the extra points, add ``<font size="+2" color="blue">Additional results: recipes</font>`` at the top of your notebook.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>
