Invisible Connectors: ZWJs and ZWNJs from Arabic to Emoji
For a long time, it has been a journey of mine to battle for Arabic fluency. With no one else in my household speaking the language, I’ve had to rely on online conversations with native speakers. Yet, without fluency, much of this correspondence is done in a liminal space between languages—otherwise referred to as ‘Arablish‘.
In speech, Arablish may be fluid. In writing, however, the two scripts clash: English flows from left to right (LTR)—its Latin letters distinct and neatly separated—while Arabic flows from right to left (RTL) with its cursive script connecting four different forms of each letter (isolated, beginning, middle, and end). Typing text messages in Arablish often proves frustrating: punctuation jumps to unintended places and flips in unintended directions, and words show up in unintended orders causing unintended meanings.
Because of anecdotes like this, it has always been a looming curiosity of mine how the Arabic script can be typed with such fluidity and how it interacts in a bidirectional text environment. SEI gave me the opportunity to study the very control characters that make this possible—not only for RTL scripts like Hebrew and Arabic, but also for Indic scripts with their intricate conjunct formations, and even emoji, which use the same control characters to form unique combinations and support diversified emoji variants.
These characters don’t show up visibly to users, but tell a machine how to process text accurately. What are these control characters? Namely, the Zero Width Joiner (ZWJ) and Zero Width Non Joiner (ZWNJ), which are two of the most important yet understudied elements in the Unicode Standard.
Background Readings
My project was guided by Anushah Hossain and assisted by Helena Kansa, the two of whom provided continuous feedback, thoughtful reflections, and exciting ideas to pursue throughout the summer.
I began in early July with lots of reading, working through a syllabus1 composed of case studies from three language families united by the use of ZWJ and ZWNJ. Our weekly meetings worked as any bookclub would—meeting at local coffee shops and discussing.
Though my notes were full of background information and any mention of ZWJ and ZWNJ I could find, our discussions were often filled with questions about the ethics of decision-making for invisible standards like Unicode. For instance, how do certain milestones in the development of Unicode affect different populations? How has corporate influence shaped script encoding? What does it mean for bias to be embedded in something that most users never see or think about? …Among more script-specific topics.
Though I will summarize a bit on how ZWJ and ZWNJ operate as technical tools, our conversations instead helped me frame these control characters as historical and social actors in the story of script encoding.
To understand their role, here is an overview on how they answer the unique challenges of different script families:
Bidirectional Scripts
Bidirectional (bidi) scripts require Unicode to support mixed-direction text. For example, if Arabic or Hebrew (RTL scripts) text is embedded with Latin words or numerical data (both LTR), Unicode’s algorithm must parse how to display the string.
The Unicode Bidirectional Algorithm (UBA) governs how such text is reordered to match reading expectations. In its general stages, UBA:
- Assigns base directionality or “bidi class” (typically determined by the first strong directional character whether LTR or RTL).
- Splits text string into segments of consistent directionality called “clumps” or “runs”.
- Reorders those “clumps” according to a set of rules that depend on what context the characters fall in (e.g. numbers stay LTR even in RTL environments, a neutral character between two strongly-typed characters that have the same directional type will inherit the same directionality, and when a space or punctuation mark falls between two strongly-typed characters with different directionality, the neutral character[s] inherit[s] the same directionality as the base).
- Displays the reordered text to match users’ reading expectations.
Text handling and compatibility with legacy systems where LTR is the default is mostly aided not by ZWJ and ZWNJ, but by the use of implicit directional formatting, explicit directional embedding, and explicit directional isolate control characters. Though at one point in Xerox history, these control characters for directionality were combined with joiners and non-joiners (a thread we flagged from a hint Unicode co-founder Mark Davis gave Anushah Hossain in an earlier interview of theirs).
Beyond the UBA, bidi scripts often involve contextual letter shaping, particularly in Arabic in which the letters take on different forms based on their relative positioning in a word. The Arabic alphabet is split into the following primary join types:
| Type | Description | Examples |
|---|---|---|
| (R) right joining | glyphs that join on the right | … ,ز, ر, ذ, د, ا |
| (L) left joining | glyphs that join on the left | none in Arabic |
| (D) dual joining | glyphs that join on both sides | … ,ب, ت, ث, ج |
| (C) join causing | distinguished from (D) in that they do not change shape themselves | ZWJ, Arabic tatweel, … |
| (U) non joining | spacing characters (except those in the other categories), digits, punctuation, non-Arabic letters, and so on | ZWNJ, ١٢٣, abcs, … |
| (T) transparent | all nonspacing marks and most format control characters | tashkil, quranic annotation signs |
ZWJ forces a connection where cursive joining or ligature formation is possible while ZWNJ explicitly prevents it, making them essential for rendering the script digitally.

Indic Scripts
Indic scripts present a different set of challenges due to their syllabic structure and ligature-rich orthographies. Consonant clusters often merge into conjunct ligatures, which may be rendered as composite shapes despite having multiple underlying codepoints. To control the formation/suppression of these ligatures, Unicode uses ZWJ to request conjunct formation and ZWNJ to block conjunct formation.
Another key mechanism for managing conjuncts is the VIRAMA. In Indic scripts, each consonant comes with an inherent vowel. The VIRAMA is a diacritic mark used to suppress the inherent vowel of a consonant from below it. Consonant clusters and their ligature forms rely on these “dead” (vowel-less) consonants. Therefore, while VIRAMA is not a control character, it enables conjunct formation by working alongside the ZWJ and ZWNJ.
In practice:
- <C1 + VIRAMA + C2> → trigger the ligated form of a conjunct from the font
- <C1 + VIRAMA + ZWJ + C2> → requests the half form of the first consonant in the cluster
- <C1 + ZWJ + VIRAMA + C2> → requests the subjoined/post-base for of the second consonant in the cluster
- <C1 + VIRAMA + ZWNJ + C2> → forces an overt virama and prevents conjunct joining

Emoji
One of the most unique opportunities during my summer was the chance to sit in on an official Emoji Standard and Research Working Group meeting—a group within the Unicode Consortium whose existence I was unaware of until working with SEI.
While scrolling through the meeting agenda, I noticed a “you are here” marker that was translated into six scripts. Among these were Latin and Cuneiform, while even Spanish was missing. Having studied both Latin and Cuneiform, I felt very at home in this script encoding community.
The agenda itself actually proved very useful to my research. Aside from discussing the new emojis and their edits (very exciting for me to see!), the Emoji Working Group discussed issues with interpretability between different devices (e.g. how Samsung and WhatsApp emojis differ from, say, Apple and Microsoft ones). One area of relevance was the direction of the emoji for users (the examples they used being how the teapot emoji or pickup truck face one way or the other). This lends itself to ZWJs and ZWNJs when considered in bidirectional environments, because emojis can be assigned strong, weak, and neutral directionality. This is described in L2/23-030 as follows:
- Semantic Movement: Encodes semantic movement (e.g. 🏃♂️, 🦅, 🎢, 🚁)
- Agentive Directionality: Encodes semantics involving transitivity. The direction of the emoji does not express movement, but can affect its own meaning or the meaning of its surrounding linguistic context when directionality is accounted (e.g. 🗝️, 📣, 🎥, 🔭)
- Neutral: Directionality has no effect on meaning (e.g. 😃, 👄, 🗑️, 📍)
These are then rendered using the aforementioned UBA, though the categories must work complimentary to each other with the danger of emoji sequences expressing an alternate meaning. L2/22-275 gives the example:
| Directionality | Sequence | Meaning |
|---|---|---|
| LTR | 🏃💨⚠️🚗🚗🚗 | Quickly running away from a line of cars |
| RTL | 🚗🚗🚗⚠️💨🏃 | Warning to not run behind car fumes |
Here, ZWJ plays a role in that the direction the emojis face operates using the combinations: → + ZWJ and ← + ZWJ.
While none of the emojis revealed during this meeting called for the use of ZWJ sequences, this control character remains at the heart of emoji composition with the ability to combine multiple emoji code points into a single glyph. ZWJ sequences enable combinations of atomic emoji to form new sequences (such as family units, skin tone variants and hair types, gendered forms with professions, and other less formulaic combinations), allowing for more diverse and creative emojis without inflating the Unicode character count.

Archival Search
Working through my syllabus of readings helped us form a timeline for ZWJs and ZWNJs: we followed them from their origin in Arabic, shepherded into Unicode at its founding from Xerox’s influence; to their expansion into Indic scripts, which got more attention in the 2000s and were a playground of testing ZWJ and ZWNJ’s capabilities; and finally to their role in emoji, a newer phenomenon for Unicode in the 2010s with ZWJs unexpectedly becoming a large part of how these graphic characters are put together.
As we worked through case studies, we noticed certain storylines unfolding for different script communities facing complications that needed to be solved by control characters. Though all our readings contained outlines of the substantial progress made for appropriate digital script representations, we couldn’t help but acknowledge the remaining work to be done.
Our team regrouped at SEI’s headquarters in Berkeley to come up with leading narratives to trace when we embarked on the main event of the summer: an archival research visit to Stanford’s Unicode Special Collections. This included following the notes of Unicode innovators Joe Becker and Mark Davis about ZWJ’s function splitting off from bidi usage in the 80s and the formation of the bidirectional algorithm in the 90s. We also traced how the fates of bidi, Indic, and emoji scripts became intertwined because of ZWJ and ZWNJ despite each group using them in different ways. We requested the relevant boxes to pull from the warehouse, and by the following week, they were ready.
On a Tuesday morning, we embarked through the rolling hills of 280, from fog into sun, until arriving at Stanford’s Cecil H Green Library. The library itself felt like a testing environment—it was quiet, we were not allowed to bring in any external items (even notebooks and pens—we instead had to use their provided pencils and papers), and we had to sit apart from each other. But, these rules keep their special collections pristine, safe, and organized.
I was blown away by the vast amount of primary source documents in their library, ranging from emails and newspaper clippings to hand-written notes, minutes of meetings, and many submitted proposals. Our team flipped past birth certificates, Byzantine musical notation documents, harsh opinions on the ‘silliness’ of encoding Klingon, and maps of Tokyo to guide people traveling for international meetings. Truly, I could spend months in that room and still be excited. But we only had two days, so we got scanning…with stops at Coupa Cafe, the Stanford favorite, of course.
After Wednesday evening, we had acquired many new primary sources to organize and read through, setting this project up to trace the story of these control characters in more detail with the real back and forth between Unicode committees and other organizations. This trip solidified the idea that the Unicode Standard is an artifact—not just an instrument—molded by debates and compromises that most users will never see.
Conclusion
I was a bit nervous when starting with SEI as I had never studied language at such a graphemic level. However, I felt tremendously supported throughout the summer and found the research nothing less than fascinating—both in the details of how these writing systems behave and what goes into encoding them. I mean this not only on a technological level, but also from a historian’s standpoint: what were the iterations of this script’s encoding? What sorts of feedback did they get, and from whom? What were the ramifications of the early iterations on the present model and population it serves? These questions were emphasized in the reading phase, but were even harder to ignore in the archives when flipping through primary sources of the very documents shaping this story.
Before starting, I had very little knowledge on ZWJs and ZWNJs since there is not one place that specifically traces their history and function. My hope is that this project will help set the foundations for a better understanding of ZWJs and ZWNJs and how they have been used, are used, and can be used for script encoding. ZWJs are all about making more with less and giving users the power to compose what isn’t necessarily in Unicode. Tracing their story reveals how a few invisible (or zero-width) code points can alter global communication.
| Bryce Hoenigman interned with the Script Encoding Initiative in Summer 2025, where she traced the history and evolution of control characters through archival research. She is currently studying Linguistics at the University of Chicago and has a background in Semitic languages. Her digital interaction with the Arabic script and Akkadian cuneiform first piqued her interest in Unicode. Now, she is eager to explore the encoding challenges and technologies that come with how these scripts are developed and used today. |
