SEI in 2026: Scripts, Strategies, and Open Questions

At any given time, the Script Encoding Initiative has upwards of a dozen research projects in progress, aimed at including new scripts in the Unicode Standard. Because each script presents its own technical, historical, and community-specific questions, it can be difficult to see the shape of the work as a whole. What follows here is an overview of our planned near-term work by region: the scripts under investigation, the collaborators involved, and the kinds of decisions we’re navigating as we identify the right approaches for encoding these scripts.

Africa

Script	Region	Period	Languages	Script Type	Collaborators	Status
Masaba	Mali	1930 to Present	Bambara	syllabary	Oreen Yousuf	Revision nearly ready 2024 Proposal
Minim Dag Noore	Burkina Faso	2006 to Present	Mooré	alphabet	Oreen Yousuf	Revision nearly ready 2025 Proposal
Ndiko Jonam	Kenya	2009 to Present	Luo languages	alphabet	Oreen Yousuf	Revision under review 2025 Proposal
N’ti	Mali, Republic of Congo, Ivory Coast	1985 to Present	Soninke	alphabet	Ibrahima Ceesay, Oreen Yousuf	Revision under review 2025 Proposal
Odùduwà	Benin, Nigeria	2016 to Present	Yoruba	alphabet	Vyshantha Simha, Oreen Yousuf	Revised proposal planned 2025 Proposal

The majority of the scripts we’re tackling from Africa are new inventions. They range from Masaba, invented just under a century ago, to Odùduwà, invented in 2016. For such scripts, the challenge is to demonstrate two things: stability, that the characters’ shapes and meanings have remained consistent over time, and adoption, that the script has been embraced by a user community.

While it’s easy to group these uniformly under the umbrella of “modern neographies,” or newly-invented scripts, in fact the scripts vary quite a bit in intention and use. A script like Masaba is used by a few hundred people in rural villages – perhaps suggesting minimal adoption at first glance. However, it seems to have been used continuously since its invention, certainly outliving the creator, and for some of its users, is the only script with which they are literate. This suggests a high level of success in carrying out what the script was intended to do.

Map of West African language communities from Galtier (1987)

Masaba sign list from Galtier (1987)

Handwritten Masaba letter from 2024 proposal

In contrast, Ndiko Jonam, also known as Luo Lakeside, mirrors the pattern of more “viral” recent inventions, catching on with swaths of people in dense areas and spurring material investment in the script, evidenced in things like fridge magnets of the letters, representing a different kind of success.

The challenge for us, and for Unicode standards-makers, is to determine what kind of justification is “enough” – what nature of evidence convinces that these newer inventions are here to stay and that users desire to communicate in them over the internet. We’re still in the midst of learning more about this class of cases, but thus far believe the candidates above are worth investigating further.¹

Fridge magnets of Ndiko Jonam script from 2025 proposal

Secondary school teachers in Botswana holding clock and shirt with Ndiko Jonam script

Secondary school teachers in Botswana with supplies produced with Ndiko Jonam from 2025 proposal

Middle East

Script	Region	Period	Languages	Script Type	Collaborators	Status
Book Pahlavi	Iran	200 to 1100 CE	Middle Iranian	abjad	Anshuman Pandey, Roozbeh Pournader, Arash Zeini	Revised proposal planned 2024 Proposal
Linear Elamite	Iran	2300 to 1850 BCE	Elamite	logosyllabary, semisyllabary	François Desset, Sina Fakour, Thomas Huot-Marchand, Anshuman Pandey	Revised proposal planned 2021 Proposal
Persian Siyaq	Iran	900 to 1900 CE	Persian	numbers	Kourosh Beigpour, Anshuman Pandey	Revision nearly ready 2021 Proposal
Proto-Elamite	Iran	3100 to 2900 BCE	Unknown	logosyllabary	Anshuman Pandey	Revised proposal planned; to follow Linear Elamite 2020 Proposal
Proto-Sinaitic	Egypt	1850 to 1550 BCE	Unknown Canaanite language	abjad	Anshuman Pandey	Proposal nearly ready 2019 Report

The Middle Eastern scripts on our docket are all historic, which present a unique set of challenges. One of the central tenets of the Unicode Standard is that it encodes only characters, not glyphs. Character is meant to refer to the abstract notion of a unit of a writing system, roughly equivalent to a grapheme. Glyphs, on the other hand, are considered to be the visual variations that map more closely to style than meaning. These variations can be handled by different font options, rather than unique encodings.

Diagram showing comparison between glyphs and Unicode characters — For alphabets, at least, a single character is far simpler to encode than an infinite set of glyphs. From the Unicode Standard, chapter 2

This principle proves difficult to apply for more complicated script types, and specifically for ancient scripts where scholars may not agree upon or know what the core constitutive character is from a collection of similar-looking markings. Our strategy here is to only encode scripts once the relevant scholarly community achieves consensus on a repertoire. Characters do not even need to be deciphered, but there needs to be an agreed-upon inventory that is labeled and recognized in a consistent way.

For several of the ancient scripts above (Linear Elamite, Proto-Elamite, Proto-Sinaitic), there have been recent breakthroughs in scholarly understanding that are driving these projects forward. For the two relatively younger scripts, our current task involves reviewing sources for additional attestations before finalizing the repertoire for Persian Siyaq, and homing in on a character repertoire that will enable users to represent the complexities of Book Pahlavi using a simple encoding model.

Historic scripts differ most significantly from modern ones in that the contemporary users are quite specialized: they are typically scholars looking to publish commentary or digitize materials. The question is whether Unicode’s current encoding strategy works for them – to what extent is important information lost in abstracting to “characters”? As we hear of examples of scholars continuing to use ad hoc methods to suit their needs, resorting to images, hacked fonts, or Unicode’s private use area, they raise the question of what approach should be used moving forward.

text from academic article showing drawn proto sinaitic characters interspersed in typeset text — Proto-Sinaitic in running text (Goldwasser 2011)

Americas

Script	Region	Period	Languages	Script Type	Collaborators	Status
Classic Maya Hieroglyphs	Mexico, Guatemala, Belize, Honduras	250 to 900 CE	Mayan	logosyllabary	Alexandre Bassi, Andrew Glass, Gabrielle Vail	Font work underway; proposal to follow
Codical Maya Hieroglyphs	Mexico, Guatemala, Belize, Honduras	1100 to 1519 CE	Mayan	logosyllabary	Andrew Glass, Carlos Pallan, Céline Tamignaux	Proposal nearly ready

Continuing along the historic theme, we have two ongoing projects in the Americas, both aimed at representing Maya hieroglyphic writing. Maya writing is incredibly complex. The ornate signs are agglomerated into blocks which generally follow a 6×6 grid. These blocks are usually arranged in paired columns. Rendering the large sign repertoire and block arrangements echo the complexity of Egyptian Hieroglyphs, simpler in some respects and more complex in others. A special challenge for Maya writing is that the glyphs need to precisely touch within a block, leaving no room for error.

Comparison of Egyptian quadrats and Maya hieroglyph quadrats — Comparison of spacing between Egyptian and Mayan quadrats. From “Preliminary Proposal to Encode Maya” (unpublished)

The first project of ours is Codical Maya, done in conjunction with the Unicode Consortium. Because there is a bounded set of codices and characters from the post-Classic period, these represent a tractable way to first establish the Maya code block. The follow-on project is Classic Maya, which will be proposed as extensions to that initial block. The potential corpus from the Classic period is quite sprawling, so the approach here is to prioritize characters that would allow the representation of key texts used by scholars.² The hope is that future extensions would continue to add material from the Classic period.

Excerpts from Mayan codices represented in the initial proposal

Sample inscription of Maya hieroglyphs from the Classic period

Sample inscription from Classic period (Source: Alexandre Bassi)

These projects have thus required extensive coordination with Unicode standards-makers (the leads on these projects began meeting with standards-makers over a decade ago to find a workable strategy for this proposal), and interdisciplinary input from archeologists, type designers, and font engineers. Andrew Glass, product manager at Microsoft and our frequent collaborator, serves as an essential linchpin across these projects, working in parallel with both expert teams to hone a common strategy for the encodings.

Southeast Asia

Script	Region	Period	Languages	Script Type	Collaborators	Status
Lampung	Sumatra, Indonesia	1600 to Present	Lampung	abugida	Febri Muhammad Nasrullah	Revision under review 2025 Proposal
Kerinci	Sumatra, Indonesia	1370 to Present	Kerinci	abugida	Febri Muhammad Nasrullah, Anshuman Pandey, Aditya Bayu Perdana	Revised proposal planned 2016 Proposal
Kulitan	Pampanga, Philippines	1600 to Present	Kapampangang	abugida	Anshuman Pandey, Julie Sayo	Proposal nearly ready 2015 Report
Rejang	Sumatra, Indonesia	1700 to Present	Malay, Bengkulu, Rejang	abugida	Ariq Syauqi	Script published in Unicode 5.1 (2008). Comments on codeblock underway 2025 Review

From Southeast Asia, we are working on one script from the Philippines (Kulitan), and a cluster from Indonesia (Lampung, Kerinci, Rejang). All of these are examples of historic scripts that have found serious modern interest. Such cases present their own challenges. Which orthography should we hold fidelity to, the version attested in historic manuscripts or a modern reform that is part of revitalization efforts today? If both, should the orthographies share an encoding or be treated separately by Unicode? In the past, such questions have elicited strong opinions, sometimes leading to protracted conflicts that delay encoding.

Lampung manuscript from the National Library of Indonesia

Example of historic orthography in biscriptual Lampung manuscript 1416131_93-E-31-Naskah-Rencong, housed in the National Library of Indonesia

Standardized Lampung script as being taught in schools with glyph variants of the same grapheme

Standardized Lampung script as being taught in schools. Characters highlighted were not found in the manuscripts or were previously considered glyph variants of the same grapheme. Image taken from a textbook for the 5th grade elementary school pupils (Source: Ariq Syauqi)

Lampung new orthography from Novri Rahman, one of the best graduates of Lampung language and culture master's program

Modern orthography preferred by graduate of Lampung masters program (Source: Ariq Syauqi)

Contemporary use of Lampung script with Latin transcription, showing new orthography and novel graphemes: word spacing is used, and O & U sounds are distinguished (Source: Ariq Syauqi)

Our proposal work on Kulitan will be supported by research fellow Julie Sayo.³ For this, we are in the midst of collecting documentation on modern use and determining what kind of encoding model makes sense for this complicated script – does it work more like Devanagari or Hangul in how the characters are formed?

The Indonesian proposal projects include collaborations with current and past research fellows Febri Muhammad Nasrullah and Ariq Syauqi, working with our core proposal team.⁴ Some relevant matters here include how modifying characters should be classified – are they beside or above the base character, which can be hard to tell when pulling from handwritten manuscript attestations – and what sequence we anticipate users to type in. These are recurring questions for Indic scripts, which have many combining and reordered pieces that must be handled by Unicode in coordination with input methods and text rendering software.

Excerpt from the "Review of the Rejang Unicode Range" document discussing challenges of vowel mark positioning — Excerpt from “Review on the Rejang Unicode Range” discussing challenges of defining character properties for vowel marks

South Asia

Script	Region	Period	Languages	Script Type	Collaborators	Status
Box-Headed	Central and Southern India	200 to 800 CE	Sanskrit, Kannada	abugida	Jan Kučera, Biswajit Mandal	Revised proposal planned 2024 Proposal
Kurukh Banna	Odisha, India	1991 to Present	Kurukh	abugida	Biswajit Mandal, Anshuman Pandey	Revised proposal planned 2024 Proposal
Vatteluttu	South India	400 to 1500 CE	Tamil, Malayalam	abugida	Biswajit Mandal, Anshuman Pandey	Revision nearly ready 2016 Proposal
Zou	Northeastern India	1952 to Present	Zomi, Zou, Zo	alphabet	Biswajit Mandal, Anshuman Pandey	Revision nearly ready 2010 Proposal

Closing out our list, we have a mix of modern and historic scripts originating in South Asia. Amongst the modern, we have Kurukh Banna. Kurukh Banna was invented for the Kurukh language and is used primarily in Odisha state in India. A different script for the same language, Tolong Siki, was recently encoded in Unicode 17.0 (described in our blog post here). Though Tolong Siki benefits from official state recognition, we assessed that Kurukh Banna has reached a similar level of stability and adoption, and are thus putting it forward for Unicode inclusion. We are also working on the Zou (or Zolai) script, not to be confused with the modern Zou script from the Republic of Congo. This Zou script is also from northeastern India, a particularly active region for script invention. Our proposal authors have been accumulating evidence for this script and have been in touch with a community of users keen to use the script on digital devices.

The two historic scripts on the docket, Vatteluttu and Box-Headed, are both from South and Central India and are precursors to scripts such as Tamil, Malayalam, and Kannada. For each script, the primary decision concerns how much internal historical variation to represent explicitly. These scripts were used across several centuries, regions, and dynasties. One approach could be to delineate these different epochs and encode them as distinct scripts. The other approach could be to unify them under a general model, allowing distinctions to be handled in fonts. These decisions will also determine what the script name should be, whether described graphically (e.g. “Box-headed,” “Arrow-headed”), by dynasty (e.g. Kadamba), or something else entirely.

Excerpt from the "Proposal to Encode the Box-Headed script in Unicode" document that shows the relationship between the script varieties — Excerpt from “Proposal to Encode the Box-Headed script in Unicode” showing relationships between script varieties

As you can see, the challenges facing script encoding researchers are not purely technical, though familiarity with the priorities and principles governing the text stack is necessary. But the challenges are also paleographic and philologic, ultimately requiring the development of classifications that balance historical understanding with contemporary functionality.

We do our best to work across a constellation of stakeholders, but are continually trying to incorporate broader feedback to inform the encoding process. If you have expertise on any of the scripts mentioned above, do get in touch with us.

We are able to undertake this long-horizon, intensive research because of sustained support from the Mellon Foundation and, in earlier phases, the National Endowment for the Humanities – institutions whose commitment to scholarship and the public interest makes projects like this possible.

You can read about how our contributor Oreen Yousuf thinks about these issues in his interview on our blog here. ↩︎
Type designer Alexandre Bassi details the process in his guest post here. ↩︎
Julie’s project is described here. ↩︎
Read about Febri’s work here and Ariq’s here. ↩︎