Language corpora gain increasing relevance in the Human Language Technology areas. World leading universities, research groups and NLP and speech technology companies apply corpora with vetted linguistic data for an array of purposes from studying the dynamics of language development to training NLP models.
Oxford University Press is known as the creator of reliable human language technology tools and services. For years, OUP has been developing and using New Monitoring Corpus system that harvested RSS feeds and collected data. Up to this date, NMC has stored more than 26 million documents with around 9 million tokens.
However, latest technologies allow to improve these numbers and excel the capabilities of corpora. With this in mind, OUP decided to create a system able to collect more data and stand no performance limits.
The main goal was to build a platform scalable in terms of productivity and functionality. To address this challenge, OUP has chosen cloud technology – not a standard approach to making a corpus. As a result, Super Corpus Platform has proved to become a powerful system which exceeded all expectations.
OUP’s Super Corpus Platform is a cloud-based moduled solution built on Microsoft Azure technology. It was designed and deployed by the joint efforts of OUP and Digiteum – official technology partner of OUP, in collaboration with Microsoft Azure team.
OUP, on the part of a product owner, provided high-level linguistic team and made an invaluable contribution to system architecture and design. Digiteum team took the major role in complex technology design, created system architecture, implemented the project and deployed the platform. Microsoft Azure team, for its part, provided assistance in system architecture development.
In short, Super Corpus Platform is the system for corpus creation. It allows to build a corpus converting raw text data into balanced annotated data.
This is how it works. The system collects raw text data from certain data sources. Then, it performs various types of data processing: filters initial data, enriches existing metadata with new attributes, deduplicates and performs linguistic annotation. After, the system sends processed data to Azure Cosmos DB for storing and exports data to external systems, i.e. Sketch Engine.
OUP team has considered 3 different data sources and has chosen Event Registry news aggregator. It appeared to be the most suitable text data provider for editors. Cosmos DB was chosen for data storage. This document-oriented database service is able to work with massive volume of text data and fully addresses system requirements in terms of elastic productivity and sustainability.
The system harvests as much as 80 thousand documents per day in a cold mode. For the record, 80 thousand documents include all news articles in English collected by Event Registry throughout the Internet daily. Super Corpus Platform can process this one-day worth news feed in 3 hours. Moreover, it can handle roughly 1 million documents a day in a hot mode, with the ultimate performance of 8.6 million documents in 24 hours.
Today, Super Corpus Platform compiles 2 million documents which corresponds to 800-900 thousand documents of balanced corpus content monthly. Just for comparison: the NMC system has collected 26 million documents in 5 years. It makes the new Super Corpus Platform almost 5 times more efficient than the old system. Not to mention the quality of the balanced corpus content created by the new system.
Moreover, the modular architecture of Super Corpus Platform makes it a highly scalable tool, both in performance and functionality. Thus, new components such as sentiment analysis or genre recognition can be added to a given data creation process. So as other data sources (Twitter, hybrid search using Bing API) and additional languages. Technically, even the core components – processing pipeline and data storage – can be substituted from Azure components to equivalent modules of other platforms.
All in all, Super Corpus Platform represents the solution for corpus creation with huge potential and applicability.
- Innovative cloud-based corpus creation platform that applies Microsoft Azure Cosmos DB;
- Linguistic annotation based on industry leading toolkits – Stanford Core NLP and Open NLP;
- Platform with modular architecture that ensures unparallel scalability potential;
- Utmost performance: Super Corpus Platform can process all news articles in English produced daily throughout the entire Internet in 3 hours;
- Unlimited productivity: the system can take up to 10x overload processing from 80 thousand to 8 million + raw documents per day;
- Super Corpus Platform compiles the balanced content of 800-900 thousand documents monthly and is capable of collecting 12-15 terabytes of linguistic data yearly.
CLIENT: Oxford University Press