{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Practice Session 03: Management of networks data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this session we will study an application of complex networks analysis to cooking. We will start with the *flavors network*, a bi-partite network connecting culinary ingredients to flavour compounds [*].\n", "\n", "The initial dataset, prepared by [Ling Cheng in 2016](https://github.com/lingcheng99/Flavor-Network), contains three files:\n", "\n", "* `ingredients.tsv` -- information about culinary ingredients\n", "* `compounds.tsv` -- information about flavour compounds\n", "* `ingredient-compound.tsv` -- flavour compounds present in each culinary ingredient\n", "* `recipes.csv` -- ingredients used in recipes around the world (used only for extra points)\n", "\n", "[*] Ahn, Y. Y., Ahnert, S. E., Bagrow, J. P., & Barabasi, A. L. (2011). [Flavor network and the principles of food pairing](https://doi.org/10.1038/srep00196). Scientific reports, 1(1), 1-7.\n", "\n", "\n", "(Remove this cell when delivering.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Author: Your name here\n", "\n", "E-mail: Your e-mail here\n", "\n", "Date: The current date here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. The flavors bi-partite graph" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.0. Examine your input files\n", "\n", "Before you begin, we highly recommend you to:\n", "\n", "1. Copy the input files to a local directory in your computer \n", "2. Open them in a spreadsheet and look at them\n", "\n", "(Remove this cell when delivering.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.1. Read the bipartite graph in a dataframe\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code, which you can leave as-is, reads the ingredient-compound relationship into a dataframe.\n", "\n", "(Remove this cell when delivering.)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Feel free to add imports if you need them\n", "\n", "import io\n", "import csv\n", "import pandas as pd\n", "import networkx as nx\n", "\n", "from networkx.algorithms import bipartite\n", "\n", "import numpy as np\n", "import matplotlib\n", "import scipy\n", "\n", "import itertools\n", "\n", "from IPython.display import Image" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Leave this code as-is\n", "\n", "INPUT_INGR_FILENAME = \"ingredients.tsv\"\n", "INPUT_COMP_FILENAME = \"compounds.tsv\"\n", "INPUT_INGR_COMP_FILENAME = \"ingredient-compound.tsv\"" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ingredient_idingredient_nameingredient_category
00magnolia_tripetalaflower
11calyptranthes_parriculataplant
22chamaecyparis_pisifera_oilplant derivative
\n", "
" ], "text/plain": [ " ingredient_id ingredient_name ingredient_category\n", "0 0 magnolia_tripetala flower\n", "1 1 calyptranthes_parriculata plant\n", "2 2 chamaecyparis_pisifera_oil plant derivative" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
compound_idcompound_namecompound_code
00jasmone488-10-8
115-methylhexanoic_acid628-46-6
22l-glutamine56-85-9
\n", "
" ], "text/plain": [ " compound_id compound_name compound_code\n", "0 0 jasmone 488-10-8\n", "1 1 5-methylhexanoic_acid 628-46-6\n", "2 2 l-glutamine 56-85-9" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ingredient_idcompound_id
01392906
11259861
21079673
\n", "
" ], "text/plain": [ " ingredient_id compound_id\n", "0 1392 906\n", "1 1259 861\n", "2 1079 673" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Leave this code as-is\n", "\n", "ingredients = pd.read_csv(INPUT_INGR_FILENAME, sep=\"\\t\")\n", "display(ingredients.head(3))\n", "\n", "compounds = pd.read_csv(INPUT_COMP_FILENAME, sep=\"\\t\")\n", "display(compounds.head(3))\n", "\n", "ingr_comp = pd.read_csv(INPUT_INGR_COMP_FILENAME, sep=\"\\t\")\n", "display(ingr_comp.head(3))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.2. Create the flavors bipartite network" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Create a new dataframe named `flavors` by joining `ingredients` and `compounds`.\n", "\n", "*Tips*:\n", "\n", "* To [join](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html) a DataFrame A and a DataFrame B using a column X, use `result = A.set_index('X').join(B.set_index('X'), how='inner')\n", "* You will need to do two joins to solve this. First, join `ingredients` and `ingr_comp`, then join the result with `compounds`.\n", "\n", "(Remove this cell when delivering.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace this cell with your code to create the `flavors` dataframe and show its first 20 rows." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Drop the `compound_code` column from the resulting dataframe, sort by `ingredient_name` then by `compound_name`, and reset its index.\n", "\n", "*Tips:*\n", "\n", "* To [drop column](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) x from DataFrame A, you can do: `A = A.drop(columns=['x'])`\n", "* To [sort](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) a DataFrame A by column *x*, then by column *y*, you can do: `A = A.sort_values(['x', 'y'])`\n", "* To [reset the index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) of a DataFrame A, you can do: `A = A.reset_index(drop=True)`; the index is the column appearing in boldface in front of every row of a DataFrame\n", "\n", "(Remove this cell when delivering.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace this cell with your code to modify the `flavors` dataframe as explained above, and show its first 20 rows." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write this dataframe to a `flavors.tsv` file, which should be a tab-separated file containing the three fields `ingredient_name`, `ingredient_category` and `compound_name`. Use the function [pandas.DataFrame.to_csv](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html).\n", "\n", "(Remove this cell when delivering.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace this cell with your code to save *flavors* into a tab-separated file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.3. Open this bi-partite network in Cytoscape\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3.1. Examine the file you generated\n", "\n", "Open the ``flavors.tsv`` file in a spreadsheet program to make sure you generated it correctly; it should have exactly 3 tab-separated columns.\n", "\n", "### 1.3.2. Import this file in Cytoscape\n", "\n", "Remember these files are imported with ``File > Import > Network from File ...``. Then, you have to select:\n", "\n", "* ingredient_name as a ``Source Node``\n", "* ingredient_category as a ``Source Node Attribute``.\n", "* compound_name as a ``Target Node`` \n", "\n", "### 1.3.3. Draw a small part of this graph\n", "\n", "Find the `garlic` node and everything connected to it at distance 1 or 2. To do this, find \"garlic\" and then click on the \"two-houses\" (neighbors) icon twice. Extract the selected nodes as a sub-graph by doing `File > New network > From selected nodes, all edges`.\n", "\n", "Run the network analyzer and then perform `Layout > Edge weighted spring embedded layout` using edge betweenness.\n", "\n", "Style the network so that ingredient nodes have a color that depends on their category, using any color except white, and setting white to be the default node color so that compound nodes remain in color white. Set the label color to black. Set the node shape to ellipse. \n", "\n", "Save the image as `flavors.png` and its corresponding lenged as `flavors-legend.gif`. Use the next cell to display the network and legend. This time, node labels do not need to be visible or readable, we just want to appreciate overall clusters.\n", "\n", "(Remove this cell when delivering.)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# KEEP THIS CELL AS-IS\n", "\n", "# Just adjust width/height if necessary\n", "\n", "Image(url=\"flavors.png\", width=1200)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3.4 Compounds in common\n", "\n", "*Onion* and *Garlic* get their distinctive smell from sulfur-containing compounds. How many compounds onion and garlic have in common? Based solely on their names, how many of them do you think contain sulfur?\n", "\n", "To answer this question, extract the nodes *Onion*, *Garlic*, and all their compounds in common as a graph. Layout is a hierarchical graph and modify it so that *Onion* appears at the top of the image, the compounds in the middle, and *Garlic* at the bottom of the image.\n", "\n", "Save the image as `compounds-in-common.png`; the next cell should display it.\n", "\n", "(Remove this cell when delivering.)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# KEEP THIS CELL AS-IS\n", "\n", "# Just adjust width/height if necessary\n", "\n", "Image(url=\"compounds-in-common.png\", width=1200)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace this cell by a brief commentary indicating how many compounds they have in common, how many of them seem to contain sulfur (based on their names). Name a couple of those sulfur-containing compounds." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. The ingredient-ingredient graph" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The bi-partite flavors graph is hard to visualize as it mixes ingredients and compounds. We will now try to visualize only the connections between ingredients.\n", "\n", "(Remove this cell when delivering.)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.1. Create an ingredient-ingredient.gml file\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, copy the list of ingredient names into an array `ingredients_array`. To convert column *x* of DataFrame *A* to an array, use `np.asarray(A['x'])`.\n", "\n", "(Remove this cell when delivering.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace this cell with your code to create `ingredients_array` with the list of ingredients and to print the number of ingredients." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, create a dictionary named `ingredient_to_compounds`, in which keys are ingredients, and values are sets of compounds. To create an empty set, you can use `s = set()`. To add to a set, you can do `s.add(element)`. Your code should look like this:\n", "\n", "```python\n", "ingredients_array = ...\n", "print(\"There are %d ingredients\" % (len(ingredients_array)))\n", "\n", "ingredient_to_compounds = {}\n", "\n", "for index, row in flavors.iterrows():\n", " ...\n", "\n", "```\n", "\n", "(Remove this cell when delivering.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace this cell with your code to create dictionary `ingredient_to_compounds` with a set of compounds for each ingredient. Print the number of keys of this dictionary. It should be less than or equal to the number of ingredients." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we create a NetworkX graph with nodes representing ingredients and edges of weight `x` connecting two ingredients having `x` flavor compounds in common.\n", "\n", "To create an empty graph, do `ingredient_ingredient = nx.Graph()`.\n", "\n", "Now, iterate through all pairs of ingredients in `ingredients_array` and compute the compounds they have in common between them. To iterate through all pair combinations of an array X, you can use:\n", "\n", "```\n", "for u, v in itertools.combinations(X,2):\n", " ...\n", "\n", "```\n", "\n", "The size of the intersection of two lists of compounds `l1`, `l2` can be obtained with `len(l1.intersection(l2))`. This will be the weight of the edge connecting two ingredients corresponding to those lists.\n", "\n", "Please note you may need to check whether both ingredients have compounds. You can test it by asking `if u in ingredient_to_compounds and v in ingredient_to_compounds`\n", "\n", "To facilitate visualization, we will keep only edges connecting two ingredients having **MIN_COMMON_COMPOUNDS or more compounds in common**. Set the value of **MIN_COMMON_COMPOUNDS** so that the resulting graph has somewhere around 150 +/- 30 nodes.\n", "\n", "To add to graph *G* an edge between nodes *u* and *v* having weight *w*, do `G.add_edge(u, v, weight=w)`.\n", "\n", "(Remove this cell when delivering.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace this cell with your code to create the `ingredient_ingredient` graph" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The ingredient-ingredient graph has 145 nodes and 1374 edges\n" ] } ], "source": [ "# Leave as-is\n", "print(\"The ingredient-ingredient graph has %d nodes and %d edges\" %\n", " (ingredient_ingredient.number_of_nodes(), ingredient_ingredient.number_of_edges()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save the resulting graph into a file. You can use [write_gml](https://networkx.org/documentation/stable/reference/readwrite/generated/networkx.readwrite.gml.write_gml.html#networkx.readwrite.gml.write_gml) to use the *GML* format.\n", "\n", "(Remove this cell when delivering.)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "OUTPUT_INGR_INGR_FILENAME = 'ingredient-ingredient.gml'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace this cell with your code to save graph G to file OUTPUT_INGR_INGR_FILENAME" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.2. Work with this file in Cytoscape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.2.1. Inspect this file\n", "\n", "*Tip:* Open the ``ingredient-ingredient.gml`` file in a text editor first to see how it is structured.\n", "\n", "\n", "## 2.2.2. Import this file into Cytoscape\n", "\n", "To import this file into Cytoscape:\n", "\n", "* `File > Import > Network from file ...`\n", "* Open the `ingredient-ingredient.gml` file\n", "\n", "Now we need to import ingredient categories:\n", "\n", "* `File > Import > Table from file ...`\n", "* Open the `ingredients.tsv` file\n", "* Import data as \"Node Table Columns\"\n", "* `ingredient_name`: key\n", "* `ingredient_category`: attribute\n", "\n", "Do a `Layout > Edge weighted spring embedded` layout on the *weight* attribute.\n", "\n", "### 2.2.3. Style and add simple annotations\n", "\n", "Style lines connecting nodes so their thickness and color reflects the number of compounds in common.\n", "\n", "Color the nodes with colors representing the ingredient categories. Note that if you right-click on \"Mapping type\" when creating a discrete mapping, you can use an automatic mapping generator to start with.\n", "\n", "Save the main connected component of this graph as `ingr-ingr.png` using `File > Export > Network to image ...`.\n", "\n", "Save a legend as `ingr-ingr-legend.gif` using the hamburger menu in `Style`and selecting `Create legend ...`\n", "\n", "The next cell should display your graph and its legend.\n", "\n", "(Remove this cell when delivering.)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Change width if necessary\n", "\n", "display(Image(url=\"ingr-ingr.png\", width=1200))\n", "\n", "display(Image(url=\"ingr-ingr-legend.gif\", width=400))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace this cell by two interested pairings (combinations of two or more ingredients) suggested by this network. By a pairing we mean ingredients that may taste good together because they have shared compounds. Add a plausible explanation considering the network structure." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# DELIVER (individually)\n", "\n", "Read the section on \"delivering your code\" in the [course evaluation guidelines](https://github.com/chatox/networks-science-course/blob/master/upf/upf-evaluation.md).\n", "\n", "Deliver a zip file containing:\n", "\n", "* This notebook\n", "* The ``flavors.tsv``, ``flavors.png``, and ``flavors-legend.gif`` files\n", "* The `compounds-in-common.png` file\n", "* The ``ingredient-ingredient.gml``, ``ingr-ingr.png``, and ``ingr-ingr-legend.gif`` files\n", "\n", "## Extra points available\n", "\n", "For more learning and extra points, get the `recipes.csv` file. It contains one recipe per line, in this format:\n", "\n", "```\n", "EastAsian,roasted_sesame_seed,garlic,cayenne,seaweed,sesame_oil\n", "```\n", "\n", "This means there is one East Asian dish whose recipe requires the ingredients \"roasted_sesame_seed\", \"garlic\", \"cayenne\", \"seaweed\", and \"sesame_oil\".\n", "\n", "Select 3 recipes and draw using Cytoscape a graph with their ingredients and the compounds in those ingredients. Include those subgraphs here, plus a brief commentary about whether the ingredients used share many compounds, few compounds, or not at all, and any other observations you want to make about the selected recipes.\n", "\n", "**Note:** if you go for the extra points, add ``Additional results: recipes`` at the top of your notebook.\n", "\n", "(Remove this cell when delivering.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 2 }