Oxford Languages division of Oxford University Press (OUP) is the world leader in human language technology and the primary provider of quality lexical data to academics, technologists and businesses around the globe.
Within its Oxford Global Languages (OGL) initiative, Oxford Languages seeks to create a major digital repository of language data in hundreds of languages and make it available for researchers, learners and developers worldwide. The project started in 2015 and hasn’t slowed down since then. The results of this tremendous work — clean and structured language datasets — are used by engineers and innovators for different use cases, from lexical software development to powering machine translation, NLP and Machine Learning technologies.
As an official technology partner, Digiteum has been actively engaged in many projects for Oxford Languages and directly involved in the implementation of the OGL initiative. In particular, the team has designed and developed a custom Dictionaries Conversion Framework and lexical analysis software that allow creating quality digital dictionaries and lexical datasets fast and efficiently. To this day, Digiteum team works side-by-side with the Oxford team, evolves the technologies used to create top-tier language content and implement the OGL initiative.
Originally, Oxford Languages outsourced the conversion of each dictionary to different teams across the globe depending on the source language. Every team built its own parsing tool to process language data. These tools were incompatible and couldn’t be integrated into a unified workflow. This approach posed a set of challenges in terms of scalability, efficiency and quality of the conversion:
- Lack of automation and standardization. The conversion process was largely manual. It involved unique specialists and required custom lexical software development for each language and type of dictionary.
- Specifics of formats and data. Original data can be structured (eg. XML), unstructured (eg. PDF) and semistructured (eg. XML that requires parsing) and has to be transformed from different source formats to different target formats.
- Slow process, high cost. It took from 3 weeks to 3 months to create a well-structured and verified dataset for a digital dictionary.
- High error rate. Mostly manual process resulted in up to 20% data loss and a high error rate.
Digiteum team started with a detailed analysis of the original conversion process. The goal was to solve all the listed challenges and roll out an efficient and flexible toolkit and workflow based on modern big data processing technologies and practices.
Digiteum has built a custom data pipeline and developed and repurposed a range of lexical analysis and structuring tools to automate and unify the conversion process. Dictionaries Conversion Framework turned a manual process into an automated conveyor-based workflow that:
- reduces the conversion time in at least 10 times.
- allows a small team of 4 to perform the whole cycle of data processing from the initial analysis to post-processing and testing regardless of the language or volume.
- is flexible and adjustable depending on the source/target format, conversion goals, language variation and the type of dictionary.
- provides 99% data accuracy rate at the output.
Oxford Languages dictionaries are licensed and used by startups and world-known tech giants such as Amazon, Apple, Google and Microsoft for different purposes, including the development of Natural Language Processing software, search engines, multilingual applications and machine translators. The workflow can be tailored depending on the requirements and goals of each conversion project and adjusted to address the needs of each customer precisely.
For example, the custom XML target format was added for Amazon projects. The workflow was repurposed to create wordlists using the Neo4j database to provide the other customers with quality language content.
Dictionaries Conversion Framework is the blend of data processing technology and modern engineering and QA practices. Digiteum team has evolved the framework and methodology along the 5-year journey, significantly improved conversion speed and performance. Only in 2020, the team has already produced about 90 dictionaries and counting, compared to the total 28 dictionaries issued in 2017.
- Development of a custom framework for efficient and flexible big data processing.
- Custom lexical data software development and repurposing of lexical analysis software (eg. PDFminer)
- Broad tech stack (.NET, C#, Visual Studio, ANTLR, MSBuild, etc.) and modern deployment (CI/CD) and QA practices (automated testing).
- 260+ dictionaries (bilingual, monolingual, thesaurus, etc.) in dozens of world languages issued in 5 years.
- Quality language content provided to develop natural language understanding software, machine translation technologies, multilingual applications, etc.
CLIENT: Oxford Languages of Oxford University Press
TEAM: 2 software engineers, 1 QA engineer и 1 computational linguist cooperating with the Oxford Languages team