Create your own map of sciences!

Edit Spring 2017: This page, written in 2011, is now out-of-date. I’ll soon share sources codes of the recently developped BibliTools 3.x versions, allowing to create web interactive maps such as those presented here or here. In the meantime, don’t hesitate to contact me if you are interested by the newest developments of BiblioTools.

Overview

This page presents the BiblioTools 2.2, a set of python scripts I developed for transforming raw bibliographic data as extracted from the Web Of Science into “maps of science” gathering relevant information about millions of articles in a single picture. These scripts are designed to be used as a black box, ie, no prior knowledge of python is required for using them [however, feel free to change the codes if you want to!], and they should help you generate your own “maps of science”, similar to those I present in the “Research” section.

Here are links towards two papers showing you some examples of what the BiblioTools can help you to achieve. The first one presents different maps of the research carried out in a scientific institution (the ENS Lyon). The second one is focused on the representation of an emerging scientific field : the Complex System Science.

  • S Grauwin, P Jensen, Mapping Scientific Institutions. Scientometrics 89(3), 943-954 (2011). [paper online]
  • S Grauwin, G Beslon, E Fleury, S Franceschelli, C Robardet, JB Rouquier, P Jensen, Complex systems science: dreams of universality, reality of interdisciplinarity. Journal of the American Society for Information Science and Technology, ASIS&T (2012). [paper online, arxiv version]

If you are looking for tools similar to the BiblioTools, you might want to check out the Scientometric portal, a gateway to scientometric-related materials and resources.

History of the project

  • 2011: Release of the BiblioTools1.0, based on SQL queries. (Not up-to-date!)
  • July 2012: Release of the BiblioTools2.1, based on python scripts
  • Sept 2012: BiblioTools2.2, with corrections in the ‘BC’ script (now running faster and dealing with larger networks), minor corrections in the parsing of references, addition of the ‘filter’ script allowing you to select articles published in a given period
  • February 2014: I patched the 2.2 scripts to account for a change in the WOS export format.
  • Upcoming (spring 2014): I started working again on this project. I will soon release a new improved version of the BiblioTools, with many new features. Stay tuned!

Quick Install

Download the BiblioTools2.2 file, unzip it and place the unzipped folder wherever you want.

The scripts have been heavily tested on Unix, Mac and Windows. In order to run them, you need to be able to run python (obviously!). You’ll need to have the following packages installed:

Moreover, you will need an access to the Web of Science to extract some bibliometric data, gephi to open and visualize the different maps produced by the scripts, and a latex compiler to visualize some tables produce by the scripts in pdf format.

Here are the most useful commands for those of you already familiar with the BiblioTools. If it is not your case, read the tutorial below!

./scripts/parser.py -i data-wos -o data -v
./scripts/first_distrib.py -i data -v
./scripts/BC.py -i data -o networks -v
./scripts/prep_het_graph.py -i data -o networks -v

Tutorial

Data extraction (from WOS)

The data have to be extracted from the Web Of Sciences. Depending on your project, you might want to collect records of articles using a given set of keywords / references / authors / institutions / publications journals, etc… For example, we used the query AD=”Ecole Normale Super Lyon” OR AD=”ENS LYON” OR AD=”ENS de Lyon” OR AD=”ENS LSH” OR AD=”ENS-LSH” to extract records from articles written by researchers from the ENS de Lyon.
To extract the data, used the “Output records” menu at the bottom of the results page:

  • step 1: you’ll see that you can only download 500 records at a time.
  • step 2: choose the “Full records + cited references” option.
  • step 3: use the [save to other reference software] drop-down menu and choose “save to tab-delimited (win)”.

Put all the extracted “savedrecs” txt file within the “data-wos” folder of the BiblioTools.

Data parsing

The first step is to extract the relevant information from the extracted files. This is done by the parser.py script, which depends in part on the structure of the “savedrecs” files. Since the guys from WOS change the structure of these files every  few months, this parser might need to be updated from time to time. In case you have a problem, you can either try to adapt the code or contact me for updating the code.

For example, since spring 2012, the “savedrecs” txt files are encoded in utf16 format while they used to be encoded in utf8 format… You’ll need to re-encode them in utf8, e.g. thanks to “recode” [EDIT Jun 2013: this step is no longer necessary since files can be once again downloaded in utf8]. All the command lines presented in the following assume that you are working from the BiblioTools directory.

recode utf-16..utf-8 data-wos/*.txt
./scripts/parser.py -i data-wos -o data -v

This script produces 7 files (see the “data” folder) listing some infos about each article within your database: general infos (title, journal, year of publication, …), authors, keywords, (ISI) subject categories, references, institutions and countries.

First analyses (global stats)

The goal here is to give you a general idea of the content of your database, by computing the number of occurrences of each authors / keywords / references / etc…

./scripts/first_distrib.py -i data -v

The script outputs several “proba.dat” files within the “data” folder listing items from the most to the less frequent.

If you examine closely the results of these “proba.dat” files, I have no doubt that you will find some unexpected discrepancies: authors with too much articles, institutions under-represented, … Several biases are to be expected. The problem of homonymous authors is one for example [I have no way to distinguish a “P Smith” from another]. The problem of the labels used to name the institutions is another: different authors may used different but synonymous ways to name their institution. A same author might also name his own institution in different ways in his articles (see below)!

The BiblioTools propose a script allowing to clean the “institutions” data. To be clear, the “institutions” that are extracted by the scripts are each coma-separated field on the WOS addresses lines but the last two (which always display cities and countries). In order to clean your institution data, you may

  • list all the different ways to name a given lab or institution that you can detect (in the “proba_institution.dat” file for example)
  • write them in the “inst_synonyms.py” file (see the “scripts” folder of the BiblioTools, which  displays some examples)
  • run the “clean_institutions.py” scripts.
./scripts/clean_institutions.py -i data -o data2 -v
./scripts/first_distrib.py -i data2 -v

Don’t forget to create a ‘data2′ folder before running the scripts. You may then delete the ‘data’ folder and rename ‘data2′ to ‘data’.

Filtering

To extract only the data related to the articles of your database published in a given period – say between 1991 and 1995, you may use the ‘filter.py’ script in the following way:

./scripts/filter.py -i data -o data2 -y1 1991 -y2 1995 -v

Filtered data will be created in the ‘data2′ folder that you must create beforehand. You can also easily change the script in order to extract the data relative to articles of a given author / using a specific keyword or references, etc…

Detecting Bibliographic Coupling communities

Bibliographic Coupling (BC) measures the similarities between two articles by comparing their references. The goal of the following script is to detect the BC communities, ie to separate the articles within the database into groups of articles sharing very similar references. BC communities are a ‘natural’ way to define scientific fields or disciplines allowing among other things to detect emerging communities. For details about the interest of BC communities, see our JASIST paper on Complex Systems Science.

./scripts/BC.py -i data -o networks -v

The script proposed in the BiblioTools performs different thing:

  • It creates the BC network by computing the BC weight w_ij = n_ij / sqrt(n_i*n_j) between each pair of articles, where n_ij is the number of shared references, and n_i (resp n_j ) the number of references of article i (resp j)
  • It detects the BC communities using Thomas Aynaud’s python implementationof the louvain algorithm, based on the maximization of a weighted modularity function.
  • The Louvain algorithm proposes a hierarchy of community partition. My script allow you to choose which partition you want to extract [default is the last one, with fewer communities and higher modularity]. You may also choose the minimum size of the communities you want to keep [default is 10]. Two files are then created by the script (in the “network” folder):
    • Output 1: a .tex file you’ll have to compile displaying an “ID Card” for each community, ie the list of the most frequent keywords, subject, authors, references, etc… used by the articles within this community. These ID cards allows to identify the characteristics of each community.
    • Output 2: a .gdf file you’ll have to open with Gephi. This file will allow you to visualize the BC network at the community level. Only basic knowledge of Gephi is required.
      • Resize the nodes: ranking tab / resize with the “size” parameter.
      • Run a spatialisation algo: Force Atlas 2 is rather good.
      • Displays the labels. You can display the ‘most_frequent_k’ field corresponding to the most frequent keywords of each community. You can change the labels (using the infos within the Id cards) in the ‘data laboratory’ tab of Gephi.
      • You can also play around with the filters to delete small communities or weak links.
  • Finally, the script proposes you to output the BC network at the articles level, to open it with Gephi. Keep in mind that Gephi can only handle a given number of nodes (beyond a few thousand nodes, it’s going to be slow). [EDIT Jan 2014: there was a small bug in the part of the code dealing with that, which is now fixed. Thanks to the people who reported it to me!]

Co-occurrence maps

The goal of the “prep_het_graph” script is to create an heterogeneous co-occurrence network, gathering different items (authors, keywords, journals, subjects, references, institutions, …) along with some properties: number of occurrence of each item, number of co-occurrence within the same article of each pair of items…

./scripts/prep_het_graph.py -i data -o networks -v

 

Follow the different instructions of the script. You will be allowed to change the different thresholds (keep items used at least xx times, keep links between items co-cited at least xx times). Keep in mind that lower threshold means more nodes in the network, ie a larger gdf file (in the “networks” folder). The weight of the co-occurrence links is defined asw_kl= n_kl / sqrt(n_k*n_l), where n_kl is the number of articles where both items k andl appear and n_k (resp n_l) the number of articles in which k (resp l) appears.

Play around with the different tools of Gephi (colors, sizes, filters, etc…) to produce nice visualizations. See our Scientometrics paper for some examples!

If you only want to create a co-authors map (dealing only with the authors of the articles) or a co-citations map (dealing only with the references of the articles), you may use the following scripts, which will run faster:

./scripts/prep_coauthors.py -i data -o networks -v
./scripts/prep_cocitations.py -i data -o networks -v

 

___________________________________________________________________________________________
© Sebastian Grauwin, July 2012.
Do not hesitate to contact me for any question / detail / commentary!