Now Reading
Terminal Emulators Battle Royale – Unicode Version! · Articles

Terminal Emulators Battle Royale – Unicode Version! · Articles

2023-12-20 05:57:22

It seems that Unicode assist in Terminals is much more tough than it
first seems. A fast overview of particular assist for Unicode characters in
Terminals:

  • “Huge” or “Fullwidth” characters, notably for East Asian languages and
    emojis, are codepoints that occupy two cells in a terminal as an alternative of 1.
  • “Zero” width combining characters utilized in languages resembling Arabic, Hebrew,
    or Hindi don’t occupy any cells themselves; as an alternative, they modify the earlier
    character.
  • “Zero Width Joiner” (ZWJ U+200D) reduces and combines many codepoints into
    a single emoji. That is much like combining, however encoded in a totally
    totally different means.
  • “Variation Selector-16” (VS-16 U+FE0F) is a particular character that, for
    particular “Slim” emojis consuming one cell, causes them to grow to be “Huge”,
    consuming two cells.

I share upkeep of the python wcwidth library, which is answerable for
figuring out the printable width of a string when exhibited to a terminal. I
labored arduous to shut all open points, including assist for VS-16, ZWJ, and several other
bug fixes to the Zero-Width desk definitions. I now consider it to be probably the most
correct implementation.

Moreover, I authored a proper Specification detailing how characters ought to
be measured. Then, I up to date the python ucs-detect device to systematically asses
terminal emulators for his or her compliance with the specification.

Lastly, I’ve published results for the preferred terminal emulators on
Linux, macOS, and Home windows. This text is a abstract of my findings.

Huge Character assist

Throughout all unicode capabilities examined, Huge character assist is greatest. That is
seemingly attributed to the widespread adoption of emojis, that are handled as huge
characters, producing curiosity throughout builders and customers of all languages.

Whereas all examined terminals reveal assist for huge characters, there are
variations within the Unicode variations they assist. Notably, Konsole, iTerm2,
and Kovid Goyal’s kitty assist huge characters as much as Unicode launch model
15.0.0 (2022). In distinction, Hyper and Visible Studio Code, each constructed on
xterm.js, present assist solely as much as Unicode launch 12.1.0 (2019).

Which means these huge characters take up 1 cell as an alternative of two, typically
occluded by the subsequent character.

/images/hyper-wide.png

Pictured right here in Hyper terminal, the wcwidth developer device
wcwidth-browser.py reveals a number of Huge Emoji mistakenly displayed as Slim
as an alternative of Huge, as a consequence of out-of-date code tables in xterm.js, inflicting
some to be partially occluded by the Pipe character (|).

The wcwidth venture Specification describes Huge characters as:

> Any character outlined by East Asian Fullwidth (F)
> or Huge (W) properties in EastAsianWidth txt
> information, besides these which might be outlined by the
> Class codes of Nonspacing Mark (Mn) and
> Spacing Mark (Mc).

The “besides” clarification is required, as there are a number of characters formally
categorized as Huge or Fullwidth, however contradictory definitions of Zero by different
knowledge information!

The definition continues:

> Any characters of Modifier Image class,
> 'Sk' the place 'FULLWIDTH' is current in remark
> of unicode knowledge file, aprox. 3 characters.

The definition is additional expanded to incorporate any characters falling throughout the
Modifier Image ‘Sk’ class, particularly these with ‘FULLWIDTH’ talked about in
the remark discipline of the Unicode knowledge file—roughly three characters in
complete.

This clause is essential for a small set of characters from the modifier image
class that, whereas not formally designated as Fullwidth or Huge, certainly
exhibit these properties. Detecting these characters necessitates parsing the
remark discipline of the information information.

The “Modifier Image” class is a wierd class. It’s a set of mixing
characters that don’t act as combining characters, they’re for lone show,
besides for the Emoji Modifier Fitzpatrick codepoints, which modify the
pores and skin tone of the previous Emoji in sequence, making it a type of combining
character not like all different characters of this class.

How tough! The Unicode.org knowledge information current contradictory
categorizations. It is no surprise that builders, even those who attempt for full
compliance, can nonetheless encounter issue in precisely categorizing a small
share of characters.

Zero Width

Testing assist for Zero Width characters poses a specific problem. Whereas it
could also be doable to mix some combining characters with some other Unicode
characters, like U+0309 “Combining Hook
Above” with field drawing character U+2532:

>          ┲̉

Hoever, this isn’t the case for most combining characters, which might solely
mix with particular characters. For example, U+094D “Devanagari Signal Virama” efficiently combines
with an applicable Devanagari letter, like U+0915 “Devanagari Letter Ka”:

>           क्

Nevertheless, it fails to mix for non-Devanagari letters, resembling U+0061 “Latin Small Letter A”:

>           a्

The “dotted donut” depicted after “Latin Small Letter A” is used as a
placeholder for these unlawful combos.

/images/iterm2-combining-latin.png

Depicted right here in iTerm2 are a number of combining characters after
U+0007 “Latin Small Letter O”, the place many
fail to mix, ensuing within the show of a “dotted donut”.

To discover and visualize combining characters in a naive method, you need to use the
developer device wcwidth-browser.py from the wcwidth repository. Press ‘c’
after launch or use the CLI argument --combining. Nevertheless, this device serves
primarily to reveal that naive combining is just not possible for an unlimited quantity
of characters.

A Rosetta Stone?

The Common Declaration of Human Rights (UDHR) is a exceptional doc
translated to over 500 languages. The UDHR Unicode venture curates a group
of those translations, providing a priceless useful resource for testing assist of
Zero-Width characters.

Outdoors of Emoji, we actually solely care about whether or not any explicit language is
supported, and for a lot of languages, Zero-Width characters are essential to
correctly write them.

Utilizing the ucs-detect device to show phrases from UDHR in every language and
measuring the displayed width, we are able to conduct a complete take a look at for
Zero-Width character assist of every Terminal by Language.

Zero Width Outcomes

The Home windows-only terminals, Terminal.exe, cmd.exe, and ConsoleZ,
in addition to the cross-platform ExtraTermQt and for-pay business zoc
terminal all fail to accurately show many Zero-Width characters, failing
for about 100 of the world’s languages.

The widespread error of those terminals is that they account class codes
Nonspacing Mark (Mn) and Spacing Mark (Mc) as Slim as an alternative of Zero width.

One instance of the Hindi language from ConsoleZ the place the U+093e
of ‘Mc’ class is incorrectly measured as Slim:

Codepoint Python Class wcwidth Title
U+092E ‘u092e’ Lo 1 DEVANAGARI LETTER MA
U+093e ‘u093e’ Mc 0 DEVANAGARI VOWEL SIGN AA
U+0928 ‘u0928’ Lo 1 DEVANAGARI LETTER NA
U+0935 ‘u0935’ Lo 1 DEVANAGARI LETTER VA

And one other, of the Vietnamese language, from Microsoft’s Terminal.exe, the place
U+0300 “Combining Grave Accent” of the ‘Mn’
Class is incorrectly measured as Slim:

Codepoint Python Class wcwidth Title
U+0074 ‘t’ Ll 1 LATIN SMALL LETTER T
U+006F ‘o’ Ll 1 LATIN SMALL LETTER O
U+0061 ‘a’ Ll 1 LATIN SMALL LETTER A
U+0300 ‘u0300’ Mn 0 COMBINING GRAVE ACCENT
U+006E ‘n’ Ll 1 LATIN SMALL LETTER N

It’s comprehensible that these class codes are usually not thought-about for Zero-Width
assist by so many different wcwidth and terminal builders. Unicode.org paperwork
make solely normal statements concerning the goal of those classes and so they do
not make any direct statements about Terminal Emulators. Builders should then
search for solutions amongst 1000’s of pages of paperwork that may be cryptic and
verbose. And not using a search engine and a “hunch”, it could be very tough to
uncover naturally!

From Customary Annex #24 Unicode Script Property:

> Implementations that decide the boundaries
> between characters of given scripts ought to by no means
> break between a combining mark (a personality with
> General_Category worth of Mc, Mn or Me)

And, from Unicode Customary Annex #14 Unicode Line Breaking Algorithm:

> The CM line break class contains all combining
> characters with General_Category Mc, Me, and Mn,
> until listed explicitly elsewhere. This contains
> viramas that don’t have line break class VI or VF.

Variation Selector-16

U+FE0F “Variation Selector-16” is peculiar.

I think it’s some type of “fixup” or compatibility sequence for the earliest
emojis. These emojis could also be displayed in both “textual content” or “emoji” fashion, and
default to “textual content” fashion. In “textual content” fashion, emojis ought to seem with out shade in
a single cell (Slim), whereas in “emoji” fashion, they need to show in shade and
occupy two cells (Huge).

See Also

Regardless of this distinction, only a few fonts successfully differentiate between the 2
types, typically rendering each varieties in shade. When not in sequence with U+FE0F
“Variation Selector-16”, they’re occluded by any subsequent character.

For instance, U+23F1 “Stopwatch”:

/images/iterm2-stopwatch-without-vs16.png

Depicted right here in iTerm2 is a single U+23F1
“Stopwatch” character partially occluded by any subsequent character. Surprisingly,
that is the right conduct of a terminal when U+FE0F “Variation
Selector-16” is just not in sequence.

From python wcwidth Specification on Huge characters:

> Any character in sequence with `U+FE0F`_
> (Variation Selector 16) outlined by Emoji
> Variation Sequences txt as ``emoji fashion``.

An inventory of such characters is present in emoji-variation-sequence.txt.

VS-16 Outcomes

Out of the 23 terminals subjected to testing, solely 7 demonstrated right conduct by
displaying these emojis as “Huge” characters when mixed with VS-16 in sequence.

Remarkably, I discovered scarce documentation, if any, about VS-16 and its results in
terminals. The absence of documentation on this matter was the first motivation
for writing this text.

Wezterm, for instance, excels in complying with all different Unicode specs
outlined on this article and examined by ucs-detect. Nevertheless, like 16 different
terminals examined, it falls quick in supporting VS-16. These emojis are
persistently occluded by the subsequent character, even when in sequence with VS-16.

/images/wezterm-vs16.png

Depicted right here in Wezterm is U+23F1
“Stopwatch” adopted in sequence by U+FE0F “Variation Selector-16”. Nevertheless,
the stopwatch is displayed as Slim. Wezterm does nevertheless do an excellent job of
scaling the font to suit inside a single cell, whereas most different terminals trigger
it to be partially occluded by any subsequent character.

Emoji ZWJ

U+200D “Zero Width Joiner” is a particular character facilitating the discount
of a number of emojis right into a single illustration that embodies their mixture.
This characteristic resembles a particular case of combining, however it’s encoded in a
fully totally different method.

The python wcwidth Specification on “Width of 0” reads:

> Any character following a ZWJ (U+200D) when
> in sequence by operate wcwidth.wcswidth().

An occasion of a terminal missing ZWJ assist is Kovid Goyal’s kitty. It is
essential to notice that this terminal shouldn’t be confused with KiTTY, one other
terminal emulator sharing the same title however predating it by 14 years.
Mr. Goyal expresses particular hostility about
this naming battle.

Codepoint Python Class wcwidth Title
U+0001F9D1 ‘U0001f9d1’ So 2 ADULT
U+200D ‘u200d’ Cf 0 ZERO WIDTH JOINER
U+0001F9BC ‘U0001f9bc’ So 2 MOTORIZED WHEELCHAIR
U+200D ‘u200d’ Cf 0 ZERO WIDTH JOINER
U+27A1 ‘u27a1’ So 1 BLACK RIGHTWARDS ARROW
U+FE0F ‘ufe0f’ Mn 0 VARIATION SELECTOR-16

/images/kitty-zwj.png

On this kitty instance, the depicted sequence is predicted to measure a width of two.
Nevertheless, kitty measures it as 6 as a result of it doesn’t interpret the Zero Width
Joiner character to cut back the three huge characters into one.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top