The scientific literature is growing at an unprecedented rate and it is estimated that the global scientific output doubles every nine years. To date, scientific digital libraries consist of millions of research publications, with thousands of these being added every day. For instance, consider MEDLINE (https://www.nlm.nih.gov/bsd/medline.html), a popular bibliographical database. It contains more than 26 million journal articles, mainly in the fields of life sciences and biomedicine. The MEDLINE database is updated with nearly 2000-4000 scientific papers on a daily basis. This enormous growth of scientific literature and its easy accessibility via World Wide Web (WWW) has opened up massive opportunities for scientists to explore novel research directions.
However, at the same time, this overwhelming amount of information has created huge barriers for scientists to make connections with their work from other disciplines. It is widely accepted that solutions derived through interdisciplinary scientific problem solving are more impactful and innovative than solutions proposed within the same problem domain. Nevertheless, this massive influx of scientific literature has made it extremely difficult for scientists to identify suitable cross-domain topics that complement their own areas of study. More specifically, researchers typically specialise in limited branches of knowledge. Thus, researchers from each area of academic specialisation only see a part of the big picture, which often leads to difficulty in identifying complementary cross-domain topics.
Consider a scientist who is interested in exploring novel research directions in dementia. To construct a scientifically sensible novel research hypothesis, the scientist is required to analyse the existing and emerging knowledge in the literature and combine the observations in a creative way to form a hypothesis. At the time of writing, a simple search in MEDLINE alone for the query “dementia” results in more than 210,000 scientific articles. Even if the scientist decided only to investigate research published in the past 12 months, MEDLINE would still return more than 13,000 records.
Despite this staggering amount of information, the reading ability of humans has remained the same over the years. In 2012, it was reported that US scientists read 264 papers per year on average, which is similar to the figure recorded in an identical survey conducted in 2005. In light of this sheer volume and the rapid growth of scientific literature, it is obvious that no one will be able to keep abreast of all the advancements across the entire body of the literature. Consequently, potentially valuable cross-silo linkages in the literature tend to remain unnoticed. This indicates the need to develop tools that efficiently search knowledge in the literature to assist researchers in forging novel research hypotheses. In this regard, novel advances in text summarisation techniques may assist researchers to some extent by providing them with a high-level overview of the literature. However, such tools are not tailored to capture the novel knowledge linkages made between seemingly distinct knowledge areas in the literature.
Motivated by this, Literature-Based Discovery (LBD) research (a.k.a. Hypotheses Generation) focuses on developing efficient knowledge discovery models that elicit new, implicit knowledge linkages from existing cross-domain scientific facts. Given the sheer volume of scientific knowledge, LBD is becoming an increasingly important tool in the research development process. For instance, Arrowsmith, which was initiated by the pioneers of the LBD discipline and is considered to be the most popular and well-maintained LBD tool in the discipline has approximately 1200 unique monthly users. The escalating benefits that LBD tools offer, as well as their practicality and capacity to accelerate innovation have attracted more and more research contributions from the text mining community. Smalheiser, a pioneer of the discipline, defines LBD as follows:
“LBD refers to a particular type of text mining that seeks to identify nontrivial assertions that are implicit, and not explicitly stated, within (generally a large body of) documents.”