It all started with a question from Amr Gharbeia
"Why are we working on an Arabic spell checker ?, Why not a wordlist for aspell ?"
Oh, That's a simple question but it changed a lot of things.
The answer could be: Because aspell won't work with Arabic.
This triggered another question: Who said so ? Did you test ?
Mo. Elzubeir started The Duali project, I assumed he did the testing and discovered that it won't work.
But did I do the testing ? No.
Too bad I wasted months doing something because someone didn't do testing, I should've done my testing part too.
So it's been four years without a spell checker just because some people are lame ? Too bad ;-)
Also Given the fact that Dwayne of translate.org.za suggested that we use the word list approach and he said that it'll work, We see that our approach was wrong.
Now Baghdad can continue but not as a spell checker, It can be a grammar checker or an engine that will understand the sentence or anything else but not as a spell checker.
For baghdad to be like that we need a dataset in a specific format that I'm sure no one will help generate it, All those voices out there crying due to the absence of an Arabic spell checker didn't "won't" move or help generate the word list. I can code but the thing I can't do is the word list.
Now that I have an Arabic wordlist generated from the words of the holy Quran and given the fact that it worked like a charm, I can say that all we need is the word list.
Too bad I assumed that aspell won't work for Arabic.
Now for the wordlist we want, I'd say that:
* It must be from the modern Arabic used, Arabic is full of words that are ancient, If you know them, Then you don't need a spell checker :-)
* It must be correct.
That's why I still object to the use of the Buckwalter dataset as we don't know whether it's 100% correct or not and we still don't know how many ancient words are there.
For the same reasons, I didn't really release the list generated from the Quran, I'm sure it's correct but the quran is special, They write some words in a different way and I don't know which ones or have the time or knowledge to proof read it.
Now for the dataset for baghdad to be a grammar checker:
We need a table that lists all the Arabic words, Which derivation rules apply to them "This is a problem with my previous approach, A word can be derived correctly but it's not available in the Arabic dictionary thus it's wrong" and their position in the sentence
At the moment what we need is a wordlist, Or some text of modern Arabic and I'd be glad to maintain the list after that.
- 15181 views
Add new comment