Pleias, GSMA release CommonLingua for 61 African languages

·

·

3 min read

Pleias and GSMA release CommonLingua, an open-source language-identification model covering 61 African languages

Pleias and the GSMA have released CommonLingua, an open-source language-identification model purpose-built for African-language text. The 2-million-parameter model covers 334 languages in total, including 61 African languages, and ships under permissive licences as the first joint output of the GSMA’s “AI Language Models in Africa, by Africa, for Africa” initiative, the partners said in a 28 April announcement from London.

Language identification is the first step in any natural-language pipeline: before a Swahili, Yoruba or Wolof model can be trained, the underlying text has to be correctly recognised by language. Existing tools commonly fail on African content, often mislabelling Swahili or Hausa as English or French. The most widely used open systems, fastText, GlotLID and OpenLID, were trained largely on European and Asian high-resource languages, and frontier models drop roughly 30 percentage points in accuracy on African content compared to major world languages, according to the partners.

How CommonLingua compares

On the new CommonLID benchmark, CommonLingua reaches 83% accuracy and a macro F1 of 0.79, more than ten percentage points ahead of the leading alternatives under comparable evaluation conditions. It does so at roughly one three-hundredth of the parameter count of competing frontier models. The 8-megabyte checkpoint runs about 20 texts per second on a CPU and up to 3,000 per second on a single GPU, the kind of footprint that fits inside on-device inference paths rather than requiring cloud calls.

The model operates directly on UTF-8 byte sequences rather than relying on a language-specific tokeniser. That allows consistent handling across Latin, Arabic, Ethiopic, N’Ko and Tifinagh scripts.

The 61 African languages, by family

The African coverage spans eight language families: 21 Bantu languages, 18 Niger-Congo and West African languages, 7 Afro-Asiatic and Semitic, 4 Cushitic and Chadic, 3 Berber, 3 Nilo-Saharan, and 5 pidgins, creoles and other contact languages.

CommonLingua is trained on Common Corpus content, the open multilingual pretraining dataset Pleias maintains. Training sources include Wikipedia, OpenAlex scientific publications, VOA Africa, WaxalNLP, Cultural Heritage and Pralekha datasets, all under permissive licences.

Why language ID matters first

“African languages are not an edge case. They are the working languages of hundreds of millions of people, and they deserve AI infrastructure built with the same care as any other language,” said Pierre-Carl Langlais, Pleias’s co-founder and chief technology officer. “CommonLingua is deliberately the first brick we are laying: you cannot curate what you cannot identify.”

Louis Powell, the GSMA’s Director of AI Initiatives, framed the model as foundational rather than end-user infrastructure. “Closing the gap in African-language AI is fundamental to digital inclusion and unlocking economic opportunity. Progress has long been held back by the lack of foundational infrastructure, beginning with something as essential as language identification,” he said.

In the wider GSMA Africa-AI strategy

CommonLingua is the first joint release from the GSMA Africa-AI initiative but slots into a wider sequence of GSMA work in this area. In March, the body partnered with Zindi on an AI safety challenge to stress-test African-language language models for harmful and biased outputs, and earlier coverage of Meta’s effort to translate 55 African languages signalled how thinly Africa is served by frontier-lab translation work.

The conversation continues at MWC26 Kigali, the GSMA’s Africa MWC event, where the partners say industry leaders will discuss how to build on the open-source release. Whether CommonLingua’s small-model footprint becomes the blueprint for further African-language infrastructure, or remains a one-off proof point, will become clearer once downstream tooling such as translation, summarisation, and retrieval models start integrating it.

Share

Oluniyi D. Ajao Avatar

Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.


Related articles