[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
7.3.2.2 What the phonetic code should do
Normal text comparison works well as long as the typer misspells a word because he pressed one key he didn’t really want to press. In these cases, mostly one character differs from the original word.
In cases where the writer didn’t know about the correct spelling of the word, the word may have several characters that differ from the original word but usually the word would still sound like the original. Someone might think that ‘tough’ is spelled ‘taff’. No spell checker without phonetic code will come to the idea that this might be ‘tough’, but a spell checker who knows that ‘taff’ would be pronounced like ‘tough’ will make good suggestions to the user. Another example could be ‘funetik’ and ‘phonetic’.
From these examples you can see that the phonetic transformation should not be too fussy and too precise. If you implement a whole phonetic dictionary as you can find it in books this will not be very useful because then there could still be many characters differing from the misspelled and the desired word. What you should do if you implement the phonetic transformation table is to reduce the number of used letters to the only really necessary ones.
Characters that sound similar should be reduced to one. In the English language for example ‘Z’ sounds like ‘S’ and that’s why the transformation rule ‘Z -> S’ is present in the replacement table. “PH is spoken like “F and so we have a ‘PH -> F’ rule.
If you take a closer look you will even see that vowels sound very similar in the English language: ‘contradiction’, ‘cuntradiction’, ‘cantradiction’ or ‘centradiction’ in fact sound nearly the same, don’t they? Therefore the English phonetic replacement rules not only reduce all vowels to one but even remove them all (removing is done by just setting up no rule for those letters). The phonetic code of “contradiction” is “KNTRTKXN” and if you try to read this letter-monster loud you will hear that it still sound a bit like ‘contradiction’. You also see that “D” is transformed to “T” because they nearly sound the same.
If you think you have found a regularity you should always take
your word list and grep
for the corresponding regular
expression you want to make a transformation rule for. An example: If
you come to the idea that all English words ending on ‘ough’ sound
like ‘AF’ at the end because you think of ‘enough’ and ‘tough’. If
you then grep
for the corresponding regular expression by
grep -i ough$ wordlist
you will see that the rule you wanted
to set up is not correct because the rule doesn’t fit to words like
‘although’ or ‘bough’. So you have to define your rule more precisely
or you have to set up exceptions if the number of words that differ
from the desired rule is not too big.
Don’t forget about follow-up rules which can help in many cases but which also can lead to confusion and unwanted side effects. It’s also important to write exceptions in front of the more general rules (‘GH’ before ‘G’ etc.).
If you think you have set up a number of rules that may produce some
good results try them out! If you run Aspell as aspell
--lang=your_language pipe
you get a prompt at which you can type
in words. If you just type words Aspell checks them and eventually
makes suggestions if they are misspelled. If you type in $$Sw
word
you will see the phonetic transformation and you can test
out if your work does what you want.
Another good way to check that changes you make to your rules don’t
have any bad side effects is to create another list from your word
list which contains not only the word of the word list but also the
corresponding phonetic version of this word on the same line. If you
do this once before the change and once after the change you can make
a diff (see man diff
) to see what really changed. To
do this use the command aspell --lang=your_language
soundslike
. In this mode Aspell will output the the original word
and then its soundslike separated by a tab character for each word you
give it. If you are interested in seeing how the algorithm works you
can download a set of useful programs from
http://members.xoom.com/maccy/spell/phonet-utils.tar.gz. This
includes a program that produces a list as mentioned above and another
program which illustrates how the algorithm works. It uses the same
transformation table as Aspell and so it helps a lot during the
process of creating a phonetic transformation table for Aspell.
During your work you should write down your basic ideas so that other people are able to understand what you did (and you still know about it after a few weeks). The English table has a huge documentation appended as an example.
Now you can start experimenting with all the things you just read and perhaps set up a nice phonetic transformation table for your language to help Aspell to come up with the best correction suggestions ever seen also for your language. Take a look at the Aspell homepage to see if there is already a transformation table for your language. If there is one you might also take a look at it to see if it could be improved.
If you think that this section helped you or if you think that this is just a waste of time you can send any feedback to bjoern.jacke@gmx.de.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |