Lexical Data Software Development for Oxford Languages — Digiteum Case Study
  1. Home
  2. Digiteum Works
  3. Lexical Data Software Development for Oxford Languages

Dictionaries Conversion Framework

Oxford Languages case study

Oxford Languages division of Oxford University Press (OUP) is the world leader in human language technology and the primary provider of quality lexical data to academics, technologists and businesses around the globe.

Within its Oxford Global Languages (OGL) initiative, Oxford Languages seeks to create a major digital repository of language data in hundreds of languages and make it available for researchers, learners and developers worldwide. The project started in 2015 and hasn’t slowed down since then. The results of this tremendous work — clean and structured language datasets — are used by engineers and innovators for different use cases, from lexical software development to powering machine translation, NLP and Machine Learning technologies.

Our role

As an official technology partner, Digiteum has been actively engaged in many projects for Oxford Languages and directly involved in the implementation of the OGL initiative. In particular, the team has designed and developed a custom Dictionaries Conversion Framework and lexical analysis software that allow creating quality digital dictionaries and lexical datasets fast and efficiently. To this day, Digiteum team works side-by-side with the Oxford team, evolves the technologies used to create top-tier language content and implement the OGL initiative.

lexical analysis software
Oxford Languages division of OUP is the world leader in human language technology

Looking for experts in creating lexical data analytics software and custom tools for data processing?

Unified data pipeline and analytics tools for efficient lexical data processing

Challenges

Originally, Oxford Languages outsourced the conversion of each dictionary to different teams across the globe depending on the source language. Every team built its own parsing tool to process language data. These tools were incompatible and couldn’t be integrated into a unified workflow. This approach posed a set of challenges in terms of scalability, efficiency and quality of the conversion:

    • Lack of automation and standardization. The conversion process was largely manual. It involved unique specialists and required custom lexical software development for each language and type of dictionary.
    • Specifics of formats and data. Original data can be structured (eg. XML), unstructured (eg. PDF) and semistructured (eg. XML that requires parsing) and has to be transformed from different source formats to different target formats.
    • Slow process, high cost. It took from 3 weeks to 3 months to create a well-structured and verified dataset for a digital dictionary.
    • High error rate. Mostly manual process resulted in up to 20% data loss and a high error rate.
lexical analysis software
Originally, the conversion was outsourced to different teams depending on the source language
lexical analysis software
Digiteum has built a custom data pipeline to automate and unify the conversion process

Solutions

Digiteum team started with a detailed analysis of the original conversion process. The goal was to solve all the listed challenges and roll out an efficient and flexible toolkit and workflow based on modern big data processing technologies and practices.

Digiteum has built a custom data pipeline and developed and repurposed a range of lexical analysis and structuring tools to automate and unify the conversion process. Dictionaries Conversion Framework turned a manual process into an automated conveyor-based workflow that:

  • reduces the conversion time in at least 10 times.
  • allows a small team of 4 to perform the whole cycle of data processing from the initial analysis to post-processing and testing regardless of the language or volume.
  • is flexible and adjustable depending on the source/target format, conversion goals, language variation and the type of dictionary.
  • provides 99% data accuracy rate at the output.
lexical analysis software
Oxford Languages dictionaries are licensed and used by startups and world-known tech giants such as Amazon, Apple, Google and Microsoft for different purposes

Fueling high-end technologies and innovation

Oxford Languages dictionaries are licensed and used by startups and world-known tech giants such as Amazon, Apple, Google and Microsoft for different purposes, including the development of Natural Language Processing software, search engines, multilingual applications and machine translators. The workflow can be tailored depending on the requirements and goals of each conversion project and adjusted to address the needs of each customer precisely.

For example, the custom XML target format was added for Amazon projects. The workflow was repurposed to create wordlists using the Neo4j database to provide the other customers with quality language content.

Dictionaries Conversion Framework is the blend of data processing technology and modern engineering and QA practices. Digiteum team has evolved the framework and methodology along the 5-year journey, significantly improved conversion speed and performance. Only in 2020, the team has already produced about 90 dictionaries and counting, compared to the total 28 dictionaries issued in 2017.

lexical analysis software
The framework is the blend of data processing technology and modern engineering and QA practices
Oxford Languages seeks to create a major digital repository of language data in hundreds of languages. Video credit: Oxford Languages.

Highlights

  • Development of a custom framework for efficient and flexible big data processing.
  • Custom lexical data software development and repurposing of lexical analysis software (eg. PDFminer)
  • Broad tech stack (.NET, C#, Visual Studio, ANTLR, MSBuild, etc.) and modern deployment (CI/CD) and QA practices (automated testing).
  • 260+ dictionaries (bilingual, monolingual, thesaurus, etc.) in dozens of world languages issued in 5 years.
  • Quality language content provided to develop natural language understanding software, machine translation technologies, multilingual applications, etc.

PROJECT DETAILS

DATE: 2015-present
CLIENT: Oxford Languages of Oxford University Press
TEAM: 2 software engineers, 1 QA engineer и 1 computational linguist cooperating with the Oxford Languages team

lexical analysis software
260+ dictionaries in dozens of world languages issued in 5 years

Interested in custom lexical software development and data processing technologies? Let’s talk about your project.

VIEW PROFILE
Research Track
Extract value from field-specific data
VIEW PROFILE
TUI Travel Holiday Finder
Catchy videos and personalised experiences
VIEW PROFILE
Lymphoma Research Foundation
Focus on Lymphoma award winning app
VIEW PROFILE
Printique, an Adorama company
New online services double customer revenue
0
0
image
https://www.digiteum.com/wp-content/themes/blake/
https://www.digiteum.com//
#dd170f
style1
default
Loading posts...
/opt/bitnami/apps/wordpress/htdocs/
#
on
none
loading
#
Sort Gallery
https://www.digiteum.com/wp-content/themes/blake
off
yes
yes
off
Enter your business email here
on
off