BiblioTools2.1 : Mapping Scientific Institutions & Communities

This page presents the BiblioTools 2.1, a set of python scripts I developed for transforming raw bibliographic data as extracted from the Web Of Science into “maps of science” gathering relevant informations about millions of articles in a single picture. These scripts are designed to be used as a black box, ie, no prior knowledge of python is required for using them [however, feel free to change the coding if you will and can!].

Here are links towards two papers showing you some examples of what the BiblioTools can help you to achieve. The first one presents different maps of the research carried out in a scientific institution (the ENS Lyon). The second one is focused on the representation of an emerging scientific field : the Complex System Science.

  • S Grauwin, P Jensen, Mapping Scientific Institutions. Scientometrics 89(3), 943-954 (2011). [paper online]
  • S Grauwin, G Beslon, E Fleury, S Franceschelli, C Robardet, JB Rouquier, P Jensen, Complex systems science: dreams of universality, reality of interdisciplinarity. Journal of the American Society for Information Science and Technology, ASIS&T (2012). [paper online]

Quick Install

Downlaod the BiblioTools2.1 zip file, unzip it and place the unzipped folder wherever you want.

The scripts have been heavily tested on Unix and Windows. In order to run them, you need to be able to run python (obviously!). You’ll need to have the following packages installed:

Moreover, you will need an access to the Web of Science to extract some bibliometric data, gephi to open and visualize the different maps produced by the scripts, and a latex compiler to visualize some tables produce by the scripts in pdf format.

Here are the most useful commands for those of you already familiar with the BiblioTools. If it is not your case, read the tutorial below!

recode utf-16..utf-8 data-wos/*.txt
./scripts/parser.py -i data-wos -o data -v
./scripts/first_distrib.py -i data -v
./scripts/BC.py -i data -o networks -v
./scripts/prep_het_graph.py -i data -o networks -v

Tutorial

Data extraction (from WOS)

The data have to be extracted from the Web Of Sciences. Depending on your project, you might want to collect records of articles using a given set of keywords / references / authors / institutions / publications journals, etc… For example, we used the query AD=”Ecole Normale Super Lyon” OR AD=”ENS LYON” OR AD=”ENS de Lyon” OR AD=”ENS LSH” OR AD=”ENS-LSH” to extract records from articles written by researchers from the ENS de Lyon.
To extract the data, used the “Output records” menu at the bottom of the results page:

  • step 1: you’ll see that you can only download 500 records at a time. It’s a shame, but you should be able to download ~ 50.000 records / hour.
  • step 2: choose the “Full records + cited references” option.
  • step 3: use the [save to other reference software] drop-down menu and choose “save to tab-delimited (win)”.

Put all the extracted “savedrecs” txt file within the “data-wos” folder of the BiblioTools.

Data parsing

The first step is to extract the relevant informations from the extracted files. This is done by the parser.py script, which depends in part on the structure of the “savedrecs” files. Since the guys from WOS change the structure of these files every  few months, this parser might need to be updated from time to time. In case you have a problem, you can either try to adapt the code or contact me for updating the code.

Random example showing how the informations about an article are displayed within WOS. The goal of the parsing process is to extract properly some of these informations.

For example, since spring 2012, the “savedrecs” txt files are encoded in utf16 format while they used to be encoded in utf8 format… You’ll need to re-encode them in utf8, eg thanks to “recode”. All the command lines presented in the following assume that you are working from the BiblioTools directory.

recode utf-16..utf-8 data-wos/*.txt
./scripts/parser.py -i data-wos -o data -v

This script produces 7 files (see the “data” folder) listing some infos about each article within your database: general infos (title, journal, year of publication, …), authors, keywords, (ISI) subject categories, references, institutions and countries.

First analyses (global stats)

The goal here is to give you a general idea of the content of your database, by computing the number of occurrences of each authors / keywords / references / etc…

./scripts/first_distrib.py -i data -v

The script outputs several “proba.dat” files within the “data” folder listing items from the most to the less frequent.

If you examine closely the results of these “proba.dat” files, I have no doubt that you will find some unexpected discrepancies: authors with too much articles, institutions under-represented, … Several biases are to be expected. The problem of homonymous authors is one for example [I have no way to distinguish a "P Smith" from another]. The problem of the labels used to name the institutions is another: different authors may used different but synonymous ways to name their institution. A same author might also name his own institution in different ways in his articles (see below)!

The "addresses" fields as displayed in WOS for 3 articles I co-authored. Notice the different variations used in the articles for naming the Complex System Institute (IXXI) or the Phys Lab of the ENS Lyon...

The BiblioTools proposes a script allowing to clean the “institutions” data. To be clear, the “institutions” that are extracted by the scripts are each coma-separated field on the WOS addresses lines but the last two (which always display cities and countries). In order to clean your institution data, you may

  • list all the different ways to name a given lab or institution that you can detect (in the “proba_institution.dat” file for example)
  • write them in the “inst_synonyms.py” file (see the “scripts” folder of the BiblioTools, which  displays some examples)
  • run the “clean_institutions.py” scripts.

./scripts/clean_institutions.py -i data -o data2 -v
./scripts/first_distrib.py -i data2 -v

Don’t forget to create a ‘data2′ folder before running the scripts. You may then delete the ‘data’ folder and rename ‘data2′ to ‘data’.

Filtering

To extract only the data related to the articles of your database published in a given period – say between 1991 and 1995, you may use the ‘filter.py’ script in the following way:

./scripts/filter.py -i data -o data2 -y1 1991 -y2 1995 -v

Filtered data will be created in the ‘data2′ folder that you must create beforehand. You can also easily change the script in order to extract the data relative to articles of a given author / using a specific keyword or references, etc…

Detecting Bibliographic Coupling communities

Bibliographic Coupling (BC) measures the similarities between two articles by comparing their references. The goal of the following script is to detect the BC communities, ie to separate the articles within the database into groups of articles sharing very similar references. BC communities are a ‘natural’ way to define scientific fields or disciplines allowing among other things to detect emerging communities. For details about the interest of BC communities, see our JASIST paper on Complex Systems Science.

./scripts/BC.py -i data -o networks -v

BC Communities detected within our “ENS LYON” database. Node size is proportional to the number of articles within the community, link width and spatial distance reflect a normalized similarity between the communities. We used the names of the most prolific authors and most frequent keywords as labels. Our script has drawn automatically a global map of the ENS de Lyon: we were indeed able to check that each community was corresponding to a research team. The colors here represents the different labs within the ENS (Phys, astroPhys, Geophys,  Chem, Bio, Maths & Computer Science)…

The script proposed in the BiblioTools performs different thing:

  • It creates the BC network by computing the BC weight w_ij = n_ij / sqrt(n_i*n_j) between each pair of articles, where n_ij is the number of shared references, and n_i (resp n_j ) the number of references of article i (resp j)
  • It detects the BC communities using Thomas Aynaud’s python implementation of the louvain algorithm, based on the maximization of a weighted modularity function.
  • The Louvain algorithm proposes a hierarchy of community partition. My script allow you to choose which partition you want to extract [default is the last one, with fewer communities and higher modularity]. You may also choose the minimum size of the communities you want to keep [default is 10]. Two files are then created by the script (in the “network” folder):
    • Output 1: a .tex file you’ll have to compile displaying an “ID Card” for each community, ie the list of the most frequent keywords, subject, authors, references, etc… used by the articles within this community. These ID cards allows to identify the characteristics of each community.
    • Output 2: a .gdf file you’ll have to open with Gephi. This file will allow you to visualize the BC network at the community level. Only basic knowledge of Gephi is required.
      • Resize the nodes: ranking tab / resize with the “size” parameter.
      • Run a spatialisation algo: Force Atlas 2 is rather good.
      • Displays the labels. You can display the ‘most_frequent_k’ field corresponding to the most frequent keywords of each community. You can change the labels (using the infos within the Id cards) in the ‘data laboratory’ tab of Gephi.
      • You can also play around with the filters to delete small communities or weak links.
  • Finally, the script proposes you to output the BC network at the articles level, to open it with Gephi. Keep in mind that Gephi can only handle a given number of nodes (beyond a few thousand nodes, it’s going to be slow).

ID card of the Hansen / Molecular-Dynamics community at the center of the map presented in the precedent figure.

Co-occurrence maps

The goal of the “prep_het_graph” script is to create an heterogeneous co-occurrence network, gathering different items (authors, keywords, journals, subjects, references, institutions, …) along with some properties: number of occurrence of each item, number of co-occurrence within the same article of each pair of items…

./scripts/prep_het_graph.py -i data -o networks -v

 

Scientometric Map of the ENS Lyon. This heterogeneous network shows the authors, keywords, references and institutions displayed by at least 10 of the ~ 8500 articles in the database we extracted from the Web of Science. Two nodes are closer to each other (and have a stronger link) if they are often used in the same articles. Inset : the global map shows the overall structure in distinct “clusters” corresponding to the different teams of the ENS labs. Detail : zoom on the central part. Notice in particular the specificity of the “Joliot Curie” biophysics lab, linking a team of physicists to a team of biologists.

Follow the different instructions of the script. You will be allowed to change the different thresholds (keep items used at least xx times, keep links between items co-cited at least xx times). Keep in mind that lower threshold means more nodes in the network, ie a larger gdf file (in the “networks” folder). The weight of the co-occurrence links is defined as w_kl= n_kl / sqrt(n_k*n_l), where n_kl is the number of articles where both items k and l appear and n_k (resp n_l) the number of articles in which k (resp l) appears.

Play around with the different tools of Gephi (colors, sizes, filters, etc…) to produce nice visualizations. See our Scientometrics paper for some examples!

If you only want to create a co-authors map (dealing only with the authors of the articles) or a co-citations map (dealing only with the references of the articles), you may use the following scripts, which will run faster:

./scripts/prep_coauthors.py -i data -o networks -v
./scripts/prep_cocitations.py -i data -o networks -v

 

___________________________________________________________________________________________
© Sebastian Grauwin, July 2012.
Do not hesitate to contact me for any question / detail / commentary!

Comments are closed.