Arabic status revisited.


Arabic is somehow a complex language when it comes to computer processing, It requires certain capabilities from the rendering and processing engines.
I'd consider 2 things to be the most important things, BiDi and Shaping, While Shaping deals with the different forms of a letter according to its position in the word, BiDi deals with the arrangement of embedded Arabic and English words.

Most of the Arabic letters change their shape according to their position in the word, Thus we have Initial, Medial, Final and Isolated.
Take the Letter "LAM" for example "0x644", It has 4 shapes: 0xFEDD, 0xFEDF, 0xFEE0 and 0xFEDE. Some letters can have 3 or even less.
Theoretically speaking, We have 28 letters although the unicode standard defines more shapes in the Arabic range, Not sure whether they are used for Farsi also or not as we have some letters there for Farsi and I guess Other languages too.

It also defines presentation forms, Those presentation forms describe how the letter will look like depending on the position in the word.

The BiDi standard is described in the Unicode Standard Annex #9, It handles embedding Arabic and other right to left languages with the left to write languages "English for example".The rendering backend must also be able to resolve this situation and reorder the test sigments to obtain a visually correct order.

The rendering backend should be aware of these to be able to display Arabic correctly.
Arabic letters are stored in text files in the Isolated form, The rendering backend must then "interpret" the letters and join them correctly otherwise we'll have non-joined letters.
Now there is another thing which does require the rendering system to take care, Diacritics or what we call in arabic "tashkeel", Or the accents ;-)

I'll be basing all my points on Gtk/Pango because I'm familiar with them, Pango is a text layout and rendering engine Thanks to pango, We wouldn't have had any Arabic in Gnome ;-)

Now I'd like to say a few words regarding diacritics since I won't be talking about them again "although this is more of a cultural issue":
I was talking a few month ago with a cool British guy, He was trying t teach himself Arabic by the means of flash cards. But he discovered that the letters - even in the arabeyes wordlist - are NOT accented so he'll have problems pronouncing it. I don't write the accents myself so I can't blame anyone.

So I can say that we have the input, The output and some tools for the Arabic user, The input would be mainly the keyboard although it can be an OCR application or a speech recognition application, The output would mainly be the monitor, a text to speech application or A printer. The tools are mainly The applications we use daily.

A GNU/Linux system is composed of several components working well with each others.
The desktop is composed of the X server or the X window system, Which the layer responsible for interfacing with the hardware, It gives you a plain desktop, On which the Desktop Environments start to put the background, The icons, A panel, ..........
You open applications "windows" the position of these windows is controlled by the "window manager".

With an open system like GNU/Linux you'll find yourself having multiple window managers and multiple desktop environments, Though you can assemble your own desktop from several components.
The most popular desktop environments are KDE and GNOME.

KDE is written in C++ using the Qt toolkit, While GNOME is written in C using the GTK+ toolkit.

A few years ago we the situation of Arabic was really bad, We had only Arabic on the console, A single closed source text editor called axmedit "It was blue colored :-)", It was written in Motif "An older GUI toolkit", Mozilla Arabic support was a yet to happen, Konqueror had a not so good support.

Sometime later GTK+ hit version 2.0 and QT hit 3.0, Both brought good Arabic support. Some small problems remained but they are almost solved by now. One of them was GTK+ not supporting the letter accents, This remained a problem for a long time but it has been solved.

I'll be talking mainly about GTK+ since I'm more familiar with it.

As we said before we must have a rendering backend, The rendering backend'll be drawing strings on the toolkit widget, It might be incorporated in the toolkit like Qt, Or separated in another library like Gtk+, Which is using "pango" as the rendering backend. Since there is the X layer below the GUI toolkits, Then X provides functions to draw strings too, Actually I'll be addressing this later.
now GTK+ is using UTF-8 internally to represent the strings, UTF-8 is one of the Unicode Transformation Formats, UTF-8 is a multi-byte encoding.
Now let's try to explain this.
How do you map a certain character stored in a file to the corresponding character in the font ? The characters are stored in files, Since a character is a byte, And a byte is 8 bits, So we can't have more than 2^8 = 256 characters which might be enough to represent a language or two at the same time but are not enough to represent all the languages at the same time, Thus we had something called the encoding. We'll have a font with the Arabic letters, Another one with the Hebrew letters, A third one with Greek letters and so on. We can only use 1 font at a time to represent the character, This is simply, the encoding.

With the unicode standard we now have more space to represent all the languages with one encoding.

UTF-8 is one of the unicode representations, A character might be 1 byte, 2, 3 or 4 bytes. Arabic falls into the 2 bytes segment "0x06XX".

So now we have toolkits capable of rendering Arabic and apply bidi and shaping, Not all the toolkits can do this, But I'm talking about the major two.

The bidi might be very complex when we have an arabic string in which we embed an english string and embed an arabic string into the english string, Here comes the Unicode control characters, They are used to aid the rendering backend to resolve the bidi correctly, Though we have no keys on the keyboard to input them, But GTK+ text widgets has a right click context menu to allow the input of them, AFAIK This is not present in Qt ATM.
I think this is a fast overview abut the current state regarding the desktop, Let's try to talk about the problems.

1) No standard on how to normalize Arabic text, Stripping the diacritics or the kashida "Arabic tatweel" (0x640) is possible, But what do you do when you are searching some fully accented text for a partially accented text ? How do you normalize letters like 0x622 and 0x623 ? Do you convert them to 0x627 ? Or what ? It becomes more complicated when you consider the huge amount of text throughout the web, If the search engine doesn't take normalization into account, You'll miss a lot of results unless you search with each form of the letter. It's even problematic when it comes to text processing and even more when you think about database servers like MySQL, It can be done using a fuzzy search but a fuzzy search can not really be optimized, Think about wikis creating wiki links, How would a wiki engine know that 2 words starting with 2 forms of the letter alef are the same ?

2) Letter accents, They are not used thus making non-arabic native speakers unable to pronounce the words correctly but this is not really a problem with the software ;-)

3) The lam-alef problem,
Lam 0x644 and alef 0x627/0x623/0x625 are combined to form a third glyph, It does not have a shape in the standard unicode Arabic range "You can obtain these glyphs via Shift+T, Shift+G, Shift+B". They are defined as presentation forms which is not bad by itself, The problem is that a few years ago, The XFree guys refused to bind the 1 keyboard key to 2 letters. As a workaround, Someone used the presentation forms in the keyboard layout, This is not bad by itself but iconv complains about those characters producing an error "iconv: illegal input sequence at position XX" whenever you try to convert them to say the windows 1256 encoding, No one tried talking to the xorg guys yet.

4) Lately, We discovered that some keys produce different characters than what they are supposed to "the key with the 2 curly braves for example", Some sweet guy created a patch but no one bother to submit it yet.

5) Automatic translation, I have really no idea about that, I know that google has a beta Arabic <-> English beta translation service but never had enough time to try it.

6) OCR and voice recognition, Same as above, But with an additional point that is no Arabic website is following the accessibility guides, So I'm not really sure. A member of the Arabeyes community is working on an application called "Siragi" which is supposed to be an Arabic OCR but It's still in the early stages "He said that gocr won't work for Arabic", I have no experience with such things so I can't tell whether he's correct or not.

7) We have no good free "as in free speech" font till now, the KACST fonts are fine as well as the Arabeyes "khotot" project but the English glyphs in those fonts are bad and sometimes the english to arabic glyph sizes are non proportional, We might get a good font by merging them with the bitstream vera font for example but I don't know anyone out there with good knowledge on how to do this.

8) No central library implementing shaping thus each application has to implement its own shaping algorithm, The CVS version of fribidi has the shaping code but it's not released yet, I personally tried having a look and see what's missing but I failed, I felt that the code is a bit over engineered, The API itself is not really a friendly API. Behdad sent a call for API proposal but I didn't know about that until it was implemented ;-)
The absence of a fribidi release means that any application not using gtk or qt must be manually patched to support Arabic and we'll end up having a separate shaping code embedded in each application making it hard to fix whenever we find a bug unlike the case when we have a library implementing it. Beside, Patching each application is tedious.
Pango has its own shaping algorithm as well as Qt, Open Office, Mozilla.

9) Even when fribidi is released, We must patch the applications manually, I had to do this for 2 applications in 1 week, I then thought: What if the shaping code is in the X library itself ? Or in the Xft library ? The problem is that I'm not sure this is the best approach and that pango/qt/open office/mozilla and any application implementing its bidi will break. This is really bad to do, It's like cutting your hand to get rid of the pain in your finger.

10) Printing plain text files from the command line is broken, A guy from Arabeyes tried to fix it, He wrote a patch already but he hit a problem where he was having problems with postscript fonts, No one was able to help him as we don't have postscript knowledge, The project is dead.

11) Copy and paste is generally working fine between all the gtk applications including firefox, It needs farther testing.
It's also broken between KDE applications and Gtk applications "especially konqueror and Open Office".

12) The spell checker: At this moment, We have 3 aspell dictionaries, The first 2 are produced using the Tim Buckwalter
One of them is being developed by google but not yet released, The other one is by Dan of hspell "The Hebrew spell checker", Both of them didn't know about the existence of the other and probably they'll merge.
The third one is being developed by me but not based on the Buckwalter data and it's yet small.
I'm not really going to state why I feel they are going in the wrong direction.
Another problem is that when Dan loaded the spell checker data in Open Office, It took 200 MBs of RAM which is really huge. I didn't test it myself and probably the guys from google didn't. Probably my list won't take much RAM as it's still huge, I don't know yet whether they used myspell or hunspell as I've just knew this today, I also have no idea about the diacritics.

13) Some applications like gedit "The standard GNOME editor" hardcode the supported language in the application, Now we have an arabic aspell dictionary but we can't use it with gedit, I wrote a patch to add Arabic (Egypt), It's now in the CVS but we need to wait for the new stable GNOME release to use it.
I don't know yet about other applications behaving like this, I didn't test much.

14) Some applications like gaim don't have a way to switch the spell checker language, I'm not really sure how it's detecting the language, Maybe from the locale. But I didn't have a look at the code yet. I have an en_US locale but sometimes I like to type in Arabic, It won't be spell checked. Most of the GUI applications are using aspell as the backend. AFAICT, Aspell can't handle more than one language at a time "not really sure".

15) Software don't get a lot of testing when it comes to Arabic support and those testing or patching from the Arabic community don't really use Arabic extensively.

16) Translation, I think that the process has been stalled. I don't really know why although I have a theory.

17) Gtk had an API to enforce tha direction of the text in the text editing widget, They removed this call and are auto-aligning the text now, Sometimes editing HTML pages for example becomes annoying when the direction is set to LTR because the tags are strong characters when the content itself is Arabic.

18) mozilla problems with bidi, I'm not a mozilla user myself. So, The best thing I can do
is to point you to this page which lists some mozilla bugs "among other bugs"

19) mozilla problems with joining text: Mozilla has been working fine with accented letters since the introduction of pango support, However sometimes it fails with some websites.

20) arabic in the terminal
We have 2 options:
* mlterm "Multilingual terminal". It can render Arabic "among other languages" fine.
* BiCon from Arabeyes, It should offer Arabic support on any terminal emulator, Didn't try it.

21) Quran + Unicode: I'm no expert in this subject but it looks like the Quran can't be fully encoded using Unicode, A few discussions on Arabeyes took place but I don't know what's the status.

22) A cultural issue, Not much Arabic content is available, And the available content is mainly Islamic or old books. They are either locked somehow like: or available as scanned PDF files which is not indexable or search-able.

In my opinion this is a quick overview about the Arabic language in general based on my poor experience!

Add new comment

The content of this field is kept private and will not be shown publicly.