The Embase project

Background

For many years, Cochrane has been feeding reports of trials from PubMed and Embase into Cochrane’s central database of controlled trials (CENTRAL). This has made CENTRAL an incredibly rich and valuable resource for authors and others trying to identify the evidence. The way that Embase records are fed into CENTRAL changed in 2013 when a new model which included crowdsourcing (the Embase Project) was introduced. Records of possible RCTs and quasi-RCTs from Embase are now identified in two ways:

 

1. Through an autofeed

2. Through human processing/screening (using a ‘crowd’)

 

The autofeed

Approximately 2/3 of all the reports of RCTs in Embase are indexed with the EMTREE term RCT or CCT. Every month (around the 20th) we feed these records directly into CENTRAL. So that’s 2/3 of the records we want in CENTRAL identified already.

 

The crowd approach

The remaining 1/3 is retrieved through a sensitive search strategy developed by Julie Glanville at YHEC. The search (complete strategy available at: http://www.cochranelibrary.com/help/central-creation-details.html) is run every month in Embase via Ovid SP. These records are then screened by a crowd. Anyone can join the crowd and start screening. When someone signs up, they undergo a brief, interactive training module before being able to screen ‘live’ records.

 

How do we ensure quality in this process?

To be included in CENTRAL, a record is assessed by at least two different screeners. We have evaluated this method and the results show very high levels of accuracy in terms of the crowd’s ability to identify the records we want in CENTRAL, and to reject the records we don’t want.

 

Progress to date

Our vision is that in the future authors and information specialists will only need to search CENTRAL to find relevant reports of randomized and quasi-randomised trials. We are much closer to this goal now as far as Embase is concerned. We have established the new crowd model and evaluated its accuracy, and we have cleared several years’ worth of records. As of mid-December 2015, the crowd were screening records that were added to Embase in October 2015. The number of records needing human screening roughly doubled in the last year with the introduction of conference records into the crowd process, but despite this, we are closing the small time-lag between the date of publication in Embase and publication in CENTRAL.

 

Does this mean review author teams won’t need to search Embase anymore?

Not quite, or least not completely.  If you only searched CENTRAL at the time of writing you would potentially miss RCTs that were added to Embase in October, November and the first couple of weeks of December 2015. For the time being you will need to run a ‘top-up’ search in Embase covering the last few months to be completely up-to-date.

 

Looking ahead

Now that we have established our method as robust, we are focusing on improving efficiency by closing the time lag, and investigating the feasibility of including other databases into the crowd model. Work on a a ‘centralised search service’ is at an early stage. There are challenges ahead, particularly the issue of getting permission to republish records from other commercial databases. However, we’re working right now on a way of identifying and including RCT reports from CT.gov. The approach outlined above won’t be right for every database, but using a combination of highly sensitive search filter development, crowdsourcing and machine learning we’re pressing ahead with the project!

 

Where can I find out more?

Any other questions or queries: Anna Noel-Storr (anna.noel-storr@rdm.ox.ac.uk), Gordon Dooley (Gordon@metaxis.com) or Ruth Foxlee (rfoxlee@cochrane.org).