Jump to content
Sign in to follow this  
flashjazzcat

Dictionary Files

Recommended Posts

I know there were many spell-checker/proofreader programs written for the Atari8 over the years, either as part of Word Processors or stand-alone utilities. My question is this: does anyone know of a freely usable dictionary (perhaps in various languages) which could be used as the core of a spell-checking program? I know that, for example, AtariWriter Plus came with a 30,000 word dictionary compressed onto a 90K floppy. Would it be prudent to use this, even if one could figure out how to detokenize the text? Or perhaps dictionaries from other 8-bit platforms could be adapted for the job... Or even something from the PC platform. Any thoughts? I don't particularly want to start with an empty dictionary along the lines of the PD Personal Spelling Checker. With the technology and storage at our disposal now, surely it's possible to make a fairly comprehensive compressed dictionary file.

Share this post


Link to post
Share on other sites

Try the SCOWL and/or 12Dicts lists from http://wordlist.sourceforge.net/

 

Or google for "/usr/dict/words", this is the traditional place where UNIX systems store the wordlist for the "spell" program. Quite a few versions of this file have been posted on the web... if you're going to use them in your own software, check the license terms.

Share this post


Link to post
Share on other sites

On further thought... those word lists are in the 1/2 to 1 megabyte range in compressed form (much larger unzipped). Not sure how useful they'll be on the Atari, even an Atari with a 1 meg RAM expansion. Even if you have a hard disk or lots of flash storage, searching would be kinda slow...

 

Actually that's a decent argument in favor of starting with an empty or at least very small wordlist: Most peoples' vocabularies aren't anywhere near the full set of words in their language; starting with an empty/tiny database allows them to create a wordlist that actually matches the words they use. On the other hand, people are spoiled rotten by modern software, they don't really think of creating a word list as something they should have to do themselves...

 

You might also check out the "cwl" (compressed word list) storage format, used by "ispell" and "aspell". For each word, there's a set of flags with meanings like "this is a standard noun, pluralize with -s, make possessive with 's" or "this is a standard verb, so the same word with -ed and -ing are also valid words". If you find you only need 7 flags, the flag-byte won't even waste any space in the file (a plain wordlist needs separator bytes, you'd just be saying "a byte with bit 7 set is a separator byte, with flags in bits 0-6").

 

Another idea is to store the list in frcode (front-code) form. Maybe the words "abdomen" and "abduct" appear in order in the list, you store the first one normally, and the 2nd one gets stored with a prefix length value of 3, plus the letters "uct" (the first 3 letters come from the previous word). This algorithm is used by the GNU findutils, and you can find an implementation here: http://www.sfr-fresh.com/unix/misc/slocate...cate-2.7/main.c

Share this post


Link to post
Share on other sites

Thanks for those links. The SCOWL list is pretty ideal, if I can trim it down a little. Among everything you've suggested, I should be able to construct something. Really I just wanted plain text file lists. I'll port one of a manageable size over to the Atari, figure out some compression algorithm and run it through a BASIC tokenizer. It will then remain to write the spell-checker app itself and we should be good to go. Creating the list and keeping it to a manageable size will be where the bulk of the work lies. Modern storage devices have at least removed some of the space limitations, although I still wouldn't want the dictionary to be more than 2-300K in size. The whole thing will be disk based; I've yet to think about a good binary chop search algorithm that can be applied to disk files. I would imagine - certainly with DOS 2.5 - it would be better to deal at the raw sector level.

Share this post


Link to post
Share on other sites

I wonder... could you do something like create a CRC for each entry in a large dictionary.

 

Maybe a 24 bit one would be enough. Then do word-check for a valid CRC match on each word.

Share this post


Link to post
Share on other sites
I wonder... could you do something like create a CRC for each entry in a large dictionary.

 

Maybe a 24 bit one would be enough. Then do word-check for a valid CRC match on each word.

 

I did a little spell checker with a 60K dictionary a long time ago. I used exactly the approaches Urchlay suggested: Flags indicating 'stems' (like -ing, -d, -s, -ed, -es, etc), and 'prefixes' (sharing chars from previous words).

 

My prefix codes were signed, so you either added or removed characters from the previous prefix.

 

I also used 4 bits per letter, with the 14 most used letters each using a nybble and two escape codes. One for less used letters (thus using 8 bits for those) and one for word boundaries. This technique saved something like 40%.

 

I also used a flag lookup table to store the top 15 most common combined stem flags and prefix codes. The majority of all words used a 4-bit flag and the rest needed 12 bits.

 

I fit just over 20,000 words (stolen from a Unix dict-file), so those compression techniques are equal to a 3-byte CRC in density.

 

CRCs have some shortcomings. The biggest is that you can't suggest spellings since you don't have a word list. Another is that collisions will occur more frequently than you expect. Google the birthday paradox for details. It's likely a few common misspellings will never be caught by your dictionary because of a hash collision.

 

One issue with prefixes is that it makes binary searches pretty tough, and scanning the whole list is just unreasonably slow. I tried various approaches to speed up binary searches, then I gave up on that approach.

 

Instead I sorted the document alphabetically by word in memory and then scanned the whole dictionary once, sequentially. This is definitely the way to go if you're reading your dictionary off disk.

 

- KS

Edited by kskunk

Share this post


Link to post
Share on other sites
I did a little spell checker with a 60K dictionary a long time ago. I used exactly the approaches Urchlay suggested: Flags indicating 'stems' (like -ing, -d, -s, -ed, -es, etc), and 'prefixes' (sharing chars from previous words).

 

My prefix codes were signed, so you either added or removed characters from the previous prefix.

 

I also used 4 bits per letter, with the 14 most used letters each using a nybble and two escape codes. One for less used letters (thus using 8 bits for those) and one for word boundaries. This technique saved something like 40%.

 

I also used a flag lookup table to store the top 15 most common combined stem flags and prefix codes. The majority of all words used a 4-bit flag and the rest needed 12 bits.

 

I fit just over 20,000 words (stolen from a Unix dict-file), so those compression techniques are equal to a 3-byte CRC in density.

 

CRCs have some shortcomings. The biggest is that you can't suggest spellings since you don't have a word list. Another is that collisions will occur more frequently than you expect. Google the birthday paradox for details. It's likely a few common misspellings will never be caught by your dictionary because of a hash collision.

 

One issue with prefixes is that it makes binary searches pretty tough, and scanning the whole list is just unreasonably slow. I tried various approaches to speed up binary searches, then I gave up on that approach.

 

Instead I sorted the document alphabetically by word in memory and then scanned the whole dictionary once, sequentially. This is definitely the way to go if you're reading your dictionary off disk.

 

- KS

Sounds interesting: it seems the mechanics of the thing are going to be a little more involving than I'd thought. Do you have any code examples to share? This thing is going to be written in pure assembler and is going to have to be compact. It will run inside The Last Word as an add-in, so it won't have to do anything other than interface with the file already loaded into memory. I can't see the possibility of loading the dictionary into RAM in this case. I do recall writing a spell-checker many years ago in Turbo Basic and I think the dictionary file was non-alphabetical. The program just read the words from the dictionary sequentially and checked them off as it encountered them in the text file. Any words left unchecked in the text file clearly weren't found in the dictionary. One advantage is that the dictionary file could be appended to: there was no need to keep a custom dictionary file for user additions.

Share this post


Link to post
Share on other sites

You can shorten the search with some creative organisation.

 

Have the database sorted into words (by length) then alpha order. Employ some sort of caching system so that disk reads are cut down.

Still, doing a word checker that runs at any sort of decent speed on the Atari would be a fairly tough job.

Share this post


Link to post
Share on other sites
This thing is going to be written in pure assembler and is going to have to be compact. It will run inside The Last Word as an add-in, so it won't have to do anything other than interface with the file already loaded into memory. I can't see the possibility of loading the dictionary into RAM in this case. I do recall writing a spell-checker many years ago in Turbo Basic and I think the dictionary file was non-alphabetical. The program just read the words from the dictionary sequentially and checked them off as it encountered them in the text file. Any words left unchecked in the text file clearly weren't found in the dictionary. One advantage is that the dictionary file could be appended to: there was no need to keep a custom dictionary file for user additions.

 

Wow, sounds like an interesting challenge. I'll be interested to hear about performance constraints. If you have 20KB of user text to spell check, some kind of sorting seems mandatory for performance. Otherwise you're looking at maybe 0.1 seconds per word tested, pretty rough with a 20,000 word dictionary.

 

Given memory constraints, maybe you could set aside 512 bytes for an ordered table of pointers to sorted words in the user text. Even a simple-to-code sort like an insertion sort could provide enough performance to build the ordered table as you scan the user text. You can also skip inserting duplicate words to keep the table from growing too fast.

 

If the table overflows, that's fine -- you'll end up with a list of words starting from 'a' thru some letter. Just keep track of where the table ended, and then come back to build the table again, skipping all words before the end point.

 

If your word database is also ordered alphabetically, you can do the whole search with one sweep through the word file, only pausing to rebuild the ordered table of pointers whenever you hit the end.

 

I believe such an approach could fit in a fairly small footprint and do the job pretty quickly even on a 6502.

 

- KS

Share this post


Link to post
Share on other sites
Wow, sounds like an interesting challenge. I'll be interested to hear about performance constraints. If you have 20KB of user text to spell check, some kind of sorting seems mandatory for performance. Otherwise you're looking at maybe 0.1 seconds per word tested, pretty rough with a 20,000 word dictionary.

 

Given memory constraints, maybe you could set aside 512 bytes for an ordered table of pointers to sorted words in the user text. Even a simple-to-code sort like an insertion sort could provide enough performance to build the ordered table as you scan the user text. You can also skip inserting duplicate words to keep the table from growing too fast.

 

If the table overflows, that's fine -- you'll end up with a list of words starting from 'a' thru some letter. Just keep track of where the table ended, and then come back to build the table again, skipping all words before the end point.

 

If your word database is also ordered alphabetically, you can do the whole search with one sweep through the word file, only pausing to rebuild the ordered table of pointers whenever you hit the end.

 

I believe such an approach could fit in a fairly small footprint and do the job pretty quickly even on a 6502.

It's an interesting approach. I was thinking of using a spare extended bank to store an alphabetical list of all the words in the file (it will always fit, as a file segement will never be more than 16K in length). I then use an optimised, double-buffered read routine to scan the (unordered and therefore expandable) dictionary file past this list, removing words from the alpha list as they are encountered in the dictionary. What's left in the list could be matched against the original text file, and the opportunity given to correct mistakes or add new words to the dictionary.

 

The trouble with any approach like this is that even when searching for a single word, the entire dictionary file has to be read. I do like the simplicity of it, though, and the fact that new words can be added to the main dictionary.

 

More fun, perhaps, would be a fancy data structure. I was trying to design a structure the other night, and it turns out I was a short (but important) step away from this interesting idea:

 

Directed Acyclic Word Graphs

 

This looks ideal for a non-modifiable structure on disk, but it's hard to predict how much space it would consume. I predict the file sizes would be very manageable for a modestly sized dictionary, however. What baffles me at the moment is designing a PC-based program to convert a word list to an efficient DAWG. Still, you'd probably be looking at a sector read per every couple of letters in a word (assuming the non-child node lists don't cross into another sector), so checking a long document might prove somewhat distressing for the disk drive.

 

The unordered (expandable) dictionary approach really turns the traditional approach on its head: the ordered (RAM based) list of words in the text file becomes the dictionary, and the unordered dictionary becomes the (very considerable) text file being checked against it.

 

I'd like to at least have a go at implementing the DAWG (roughly as described in the link), but I think the encoder would be very difficult to write. Certainly a job for someone well versed in C on the PC. Interrogating the dictionary, conversely, would be quite simple in assembler. Support for custom, user dictionaries would have to be provided when using the DAWG approach, of course.

 

...Just thinking... no real reason why the DAWG approach couldn't be updated with new words, too. A compelling argument...

Edited by flashjazzcat

Share this post


Link to post
Share on other sites
More fun, perhaps, would be a fancy data structure. I was trying to design a structure the other night, and it turns out I was a short (but important) step away from this interesting idea:

 

Directed Acyclic Word Graphs

 

This is a pretty nice structure. I wonder if you could order your user text using a graph like this for faster searching?

 

I suspect both the user text and the dictionary text must have some kind of ordering, given the meager performance of the 6502. Even linear in-memory searches are too expensive to do per-word.

 

Anything that causes a seek per word seems pretty nasty given the Atari floppy's seek time. It's nice to imagine algorithms that can read the disk one end to the other, possibly skipping sectors that are (somehow) known to contain nothing useful. I think AtariWriter Plus might have this property. It's been a long time, but I remember being impressed by its performance.

 

What kind of encoding would you use for the on-disk version of a DAWG? It seems like it could inflate the size of a dictionary quite a bit.

 

The dictionary compression ideas I was floating earlier are purely to get a lot of words into a small space. I hadn't thought about ways to make the dictionary more searchable. Interesting thoughts...

 

- KS

Edited by kskunk

Share this post


Link to post
Share on other sites
This is a pretty nice structure. I wonder if you could order your user text using a graph like this for faster searching?

 

I suspect both the user text and the dictionary text must have some kind of ordering, given the meager performance of the 6502. Even linear in-memory searches are too expensive to do per-word.

 

Anything that causes a seek per word seems pretty nasty given the Atari floppy's seek time. It's nice to imagine algorithms that can read the disk one end to the other, possibly skipping sectors that are (somehow) known to contain nothing useful. I think AtariWriter Plus might have this property. It's been a long time, but I remember being impressed by its performance.

 

What kind of encoding would you use for the on-disk version of a DAWG? It seems like it could inflate the size of a dictionary quite a bit.

 

The dictionary compression ideas I was floating earlier are purely to get a lot of words into a small space. I hadn't thought about ways to make the dictionary more searchable. Interesting thoughts...

Well, if I managed to implement a disk-based lookup system, the text in memory could stay as it is since we'd just be stepping through the words in order and looking them up in turn. However, some kind of optimisation might be gained from sorting the words in the text file.

 

I was similarly impressed by AtariWriter Plus, which appeared to use an extremely effective seek per word algorithm. I used to use it to check my Uni assignments (written with TextPro) when the Atari8 was my main computer in the mid nineties! A commented disassembly of the AW+ proofreader would be ideal, but I do know that it used a tokenised dictionary (with the expanded text of the tokens embedded in the executable), along with a memory-resident dictionary of a couple of hundred (IIRC) or so of the most common English words like "the" and "a".

 

The single most bothersome problem is that DOS 2.5 doesn't have a random access file system like SpartaDOS. The disk implementation of the DAWG would be simple, since if each node is (say) three bytes long, the computation of the file pointer offset to node n is an easy multiplication. This won't work with a DOS 2.5 file, which uses sector chains, which is probably why the AW+ dictionary was stored on a raw 90K disk. The computation of node offsets once again becomes simple when dealing with sector numbers and byte offsets. Also, the three byte overhead of the file linkage system is dispensed with. But with today's 16MB partitions on SD cards, it would be nice to set the whole thing up on a hard disk without having to "Insert dictionary disk".

 

I mentioned three bytes per node: on a 90K disk, we have space for about 30,000 3 byte nodes. Therefore we need only use 15 bits for the node number (although we might wish to use 16 bits for maximum flexibility). That leaves 8 or 9 bits for the character and the "end of word" flag. We only need at the very most 7 bits for the character, and setting the high bit could indicate the "end of word" flag (meaning it's permissible for a word to end on this character having traversed the tree).

 

Reading this extremely informative article, it's apparent that a 178,691 word dictionary will fit into 123,699 nodes: roughly four times what we can fit on a 90K disk, and twice what we can accomodate using a 16 bit node pointer on a 180K disk. I think a 40-50,000 word dictionary would be extremely respectable on an 8-bit Atari, and this would seemingly fit on a 90 or 130K DOS 2.5 disk.

 

The problem of storage space thus overcome, the one remaining hurdle to attaining a useable test system is writing the software to create the DAWG. Thinking about the idea of allowing the user to add words to the DAWG based dictionary, it occured to me that this could be done easily, but additional nodes would introduce data redundancy that the original DAWG lacked. Well - why not include in the spell-checker the ability to re-optimise the DAWG when additional words have been added. The logical conclusion is that it might be best to write the DAWG creation software in 6502 assembler (or C or Action! if the recursive stack implementation got to be too much to handle). This is easier for me, since I haven't yet learned any languages on the PC (and increasingly I feel I should).

 

Once a test system is built, we'd understand the severity of the disk read hit, and perhaps we could attempt further optimisation. If the dictionary was held in linear RAM, the DAWG would be a no-brainer. It's fast to search and incredibly compact. Unfortunately it's the iterations required to traverse the tree which make it less suited to a disk-based solution. The very nature of its low data redundancy would tend to make the nodes widely spaced across the disk.

 

Anyway, food for thought there. I'm going to think some about what's the quickest way to write a bare-bones program to take a word list and convert it to a DAWG on the Atari.

Edited by flashjazzcat

Share this post


Link to post
Share on other sites
Reading this extremely informative article, it's apparent that a 178,691 word dictionary will fit into 123,699 nodes: roughly four times what we can fit on a 90K disk, and twice what we can accomodate using a 16 bit node pointer on a 180K disk. I think a 40-50,000 word dictionary would be extremely respectable on an 8-bit Atari, and this would seemingly fit on a 90 or 130K DOS 2.5 disk.

 

40K words seems plausible using an Optimal DAWG. That article is just awesome. I only had to scroll down to the picture to immediately understand why the compression works so well. It's using both prefix and suffix compression, which is better than the usual prefix compression approaches like Urchlay and I were discussing.

 

The downside is the high number of seeks. That seems to make a DAWG-on-disk unusable. Even if you can read 15 random sectors per second (assuming peak utilization of SIO), 16KB of text would take roughly 18 minutes to check. :/ And a real 1050 gets more like 3 random sectors per second.

 

A DAWG pre-loaded in memory would totally rock. No need to pre-sort the input text -- the DAWG itself is such a fast structure that you can just test the text linearly.

 

The downside is that it would still take a while to load the dictionary in advance. But it's possible to imagine an 8KB or 16KB DAWG containing thousands of frequently used words, combined with a larger on-disk DAWG to soak up the occasional big word.

 

The problem of storage space thus overcome, the one remaining hurdle to attaining a useable test system is writing the software to create the DAWG. Thinking about the idea of allowing the user to add words to the DAWG based dictionary, it occured to me that this could be done easily, but additional nodes would introduce data redundancy that the original DAWG lacked. Well - why not include in the spell-checker the ability to re-optimise the DAWG when additional words have been added. The logical conclusion is that it might be best to write the DAWG creation software in 6502 assembler (or C or Action! if the recursive stack implementation got to be too much to handle). This is easier for me, since I haven't yet learned any languages on the PC (and increasingly I feel I should).

 

The algorithm described in the article above would not fit on an Atari. He uses dozens of times the memory required for the final DAWG in his processing. The amount of recursion and depth of queues alone give me pause -- these are not things that would fit in a 256 byte stack.

 

As you said, it's easy to make DAWGs, but Optimal DAWGs definitely seem like a PC pre-processing step that takes many billions of instructions and many megabytes. :/

 

I think you're on a really solid path though. Imagine if these techniques had existed in the days of AtariWriter Plus!

 

- KS

Share this post


Link to post
Share on other sites
I think a 40-50,000 word dictionary would be extremely respectable on an 8-bit Atari, and this would seemingly fit on a 90 or 130K DOS 2.5 disk.

 

This evening I tried designing a dictionary format that might be easy for the 6502 to deal with. With a pretty simple compressor, it fit 80472 words into 103910 bytes.

 

The dictionary is first sorted, then converted into front-code format. Then it is tokenized. Each token represents 1-16 characters. The token 'directory' is a fixed 1.5KB that expands typical sequences like 'qu' or 'tion' or 'ed/ing/s' (a single byte representing three additional words in front-code format).

 

The tricky part of compressing the dictionary is picking the best possible token directory. My naive compressor finds a token directory that achieves 10.33 bits/word, but just by fiddling with the token directory by hand I could improve performance. I suspect a smarter compressor or a hand-picked token directory could do even better.

 

Like other approaches I've mentioned, this one is designed to be read in one pass front-to-end per document, instead of seeking around per word.

 

I'm a little embarrassed to admit that this achieves better performance than my older more elaborate compression scheme. It turns out nybble-compression of characters and all those flags indicating plural words and stems like ing/ed/etc can be more efficiently handled by a universal tokenization scheme. And the resulting checker code can be much simpler.

 

- KS

Edited by kskunk

Share this post


Link to post
Share on other sites
This evening I tried designing a dictionary format that might be easy for the 6502 to deal with. With a pretty simple compressor, it fit 80472 words into 103910 bytes.

 

The dictionary is first sorted, then converted into front-code format. Then it is tokenized. Each token represents 1-16 characters. The token 'directory' is a fixed 1.5KB that expands typical sequences like 'qu' or 'tion' or 'ed/ing/s' (a single byte representing three additional words in front-code format).

 

The tricky part of compressing the dictionary is picking the best possible token directory. My naive compressor finds a token directory that achieves 10.33 bits/word, but just by fiddling with the token directory by hand I could improve performance. I suspect a smarter compressor or a hand-picked token directory could do even better.

 

Like other approaches I've mentioned, this one is designed to be read in one pass front-to-end per document, instead of seeking around per word.

 

I'm a little embarrassed to admit that this achieves better performance than my older more elaborate compression scheme. It turns out nybble-compression of characters and all those flags indicating plural words and stems like ing/ed/etc can be more efficiently handled by a universal tokenization scheme. And the resulting checker code can be much simpler.

Sounds great. Can you explain front-code?

Share this post


Link to post
Share on other sites
I'm a little embarrassed to admit that this achieves better performance than my older more elaborate compression scheme. It turns out nybble-compression of characters and all those flags indicating plural words and stems like ing/ed/etc can be more efficiently handled by a universal tokenization scheme. And the resulting checker code can be much simpler.

 

Now that's interesting. I would have thought it'd be the other way around...

 

If you're building the dictionary on a modern machine, it shouldn't be too hard to write some code to generate the optimum token directory (with a brute-force search and lots of memory, or with a clever algorithm that I can't really think of right now).

 

I may code up the brute-force version in perl, just to get an idea whether it's worth it. Did you use one byte per token?

 

Sounds great. Can you explain front-code?

 

It's described above, as "frcode".

Share this post


Link to post
Share on other sites
Sounds great. Can you explain front-code?

It's described above, as "frcode".

So it is! Thanks. There seem to be so many effective ways of optimising storage space, but few which optimise seek times (or rather, search iterations) sufficiently to implement the structure on disk. Again I'm impressed by AtariWriter Plus. I'd dearly love to know which seek algorithm is used there. I notice that the dictionary appears to achieve 100% tokenisation (on inspection of the disk sectors), through clever choices of token string. With a fixed-entity dictionary, I imagine the first few sectors could contain pointers to, say, words beginning A, AB, AC... through ZZ (the dictionary does seem to have many tables at the front). Moreover, the offset table could be read into RAM once and thereafter never read again. Given that the words are 100% tokenised, once you've loaded (with a single sector read) the first sector full of tokenized words beginning "AB", it may only take a further one or two sector reads at most to scan the entire table of "AB" words.

 

The more I think about this, the more it seems to me that some kind of RAM-based table should be read or assembled as soon as the spell-checker is loaded up. You'd then be within a few sectors of your target word without any disk access whatsover. A table of offsets like this would at most occupy around 26*26 (676) bytes. Maybe we'd use up to 40 characters (40*40)... I don't know. I feel a strong instinct that AW+ uses a big memory-based table before reading anything at all from the data disk.

 

I'd be inclined to try a reasonably simple program to perform comprehensive tokenisation of an alpha list of around 30,000 words, then build a table of pointers as I've described. With a compression of 10.33 bits per word, that's around a hundred words per 128 byte sector. A 90K disk (720 sectors) could thus hold (presumably) an impressive 72,000 words. With (in theory) 676 blocks of words (each pointed to by a byte in the table), the (very rough) average sector read overhead per complete block is almost 1:1. Practically speaking, some blocks will occupy much more space, but I think this is a good direction to head in.

Edited by flashjazzcat

Share this post


Link to post
Share on other sites
If you're building the dictionary on a modern machine, it shouldn't be too hard to write some code to generate the optimum token directory (with a brute-force search and lots of memory, or with a clever algorithm that I can't really think of right now).

 

I may code up the brute-force version in perl, just to get an idea whether it's worth it. Did you use one byte per token?

 

Yeah, one byte per token. The token directory contains some interesting stuff like a token for the suffixes '/ly/ness/'. This one token marks the end of the current word, then creates two more variants ending in -ly and -ness, leaving the checker ready to start a new word. -ly and -ness very frequently occur together, and rarely with other kinds of suffixes.

 

Another interesting token is 'ous/ly/'. Many words that end in -ous also have a variant ending in -ously.

 

In my old compressor, I only had a handful of suffixes and didn't have enough bits to spare to think of suffixes like '-able'. Of course -able and -ably tend to occur together so that's another compound token.

 

I think brute force could do a much better job. My naive compessor does some pretty dumb things like greedily matching on a 2 character token even though it ruins the chance of matching the next character as part of a 10 character token.

 

- KS

Share this post


Link to post
Share on other sites
The more I think about this, the more it seems to me that some kind of RAM-based table should be read or assembled as soon as the spell-checker is loaded up. You'd then be within a few sectors of your target word without any disk access whatsover. A table of offsets like this would at most occupy around 26*26 (676) bytes. Maybe we'd use up to 40 characters (40*40)... I don't know. I feel a strong instinct that AW+ uses a big memory-based table before reading anything at all from the data disk.

 

That seems pretty logical. Assuming you pre-sort your user text, you can simply skip over sectors of the dictionary that don't contain any starting letters you're interested in. And since you are reading the dictionary from one end to the other, the seeks you do take will be quite fast compared to true random seeks.

 

I think that approach could be very fast even with large dictionaries.

 

About the only improvement I can think of is to have a dictionary of a few thousand common words in memory, like you mentioned before. That could greatly reduce the number of actual seeks/reads from the 'full dictionary'.

 

- KS

Share this post


Link to post
Share on other sites
...since you are reading the dictionary from one end to the other, the seeks you do take will be quite fast compared to true random seeks.

 

I think that approach could be very fast even with large dictionaries.

 

About the only improvement I can think of is to have a dictionary of a few thousand common words in memory, like you mentioned before. That could greatly reduce the number of actual seeks/reads from the 'full dictionary'.

The point is that with the pointer table system, we're not reading the dictionary from one end to the other at all. You could tell the spell-checker to look up the word "Pointer", for example. It would search the RAM-based table for the "PO" entry (pointing to the first word beginning "PO") and the program would read the first sector of that block, and scan the tokenised data for "Pointer". I believe that with a highly tokenised list (up to 100 words per sector), the word would be found within 1-3 sector reads.

 

Coupled with a small memory-resident dictionary of a hundred or so of the most common words, I think this would be about as fast as AtariWriter Plus. We'd also reserve memory for a custom, memory-resident dictionary which the user could add words to. That would be best stored as a plain text file which could be loaded into the word processor for editing/deletions.

 

How about posting some of these dictionary creation programs? What I want to create is a highly compressed dictionary of about 90K in size which I can then scan with a BASIC program on the Atari to create a separate file of offsets to AA, AB, AC through ZZ words. Then I can set about writing the spell-checker.

Share this post


Link to post
Share on other sites

Something interesting...

 

http://en.wiktionary.org/wiki/Wiktionary:F...mporary_fiction

 

That's a list of the top 2000 most frequently used words in comtemporary fiction, sorted in order by frequency.

 

I grabbed the first 208 words from that list (aka the first 1K of them) and wrote a little perl script that takes some text as input, and shows the percentage of words in the text that are in that list, vs. the percentage that aren't. I checked the text of a few novels I have on my hard drive, plus some forum posts (including ones from this thread) and emails, and the contents of the fortune cookie database on my Linux box.

 

The hit rate averages around 55% or so... meaning, if you dedicate 1K to storing the top 208 words, only 45% of the words you spell-check would require disk seeks (assuming the user's actually spelled them right, I mean).

 

Extending the list to 400 words (requiring over 2K of RAM) only increases the hit rate to around 65%.

 

I'm also playing with tokenizing and front-coding and such... so far I'm only down to 13.5 bits/word, so my results aren't as good as kskunk's (not yet anyway), but I'm still messing with it.

 

A thought: if you're going with the "insert dictionary disk" approach, you could have 2 or 3 versions of the disk (one each for single, 1050 enhanced, and double densities). The code that reads the dictionary would need to be smart enough to detect the density of the inserted disk, but that's no big deal... Ideally, the same program would be used (probably on a PC) to generate the disk images, you'd just feed it different wordlists as input.

Share this post


Link to post
Share on other sites

Hello flashjazzcat

 

Some of us are able to write in a couple of languages. Would it be possible in LW to choose the language that is used? It's kinda silly to spellcheck a text in Dutch, German, Spanish, Italian, etc. with an English Dictionary.

 

The advantage of being able to select different languages/dictionaries would also be, that one can choose if (s)he wants to have color and aluminum marked as spelling errors or colour and aluminium.

 

greetings

 

Mathy (Who only speaks Limburgian dialect, Dutch, German and English)

 

PS I hear French, Belgian and Canadian French isn't the same either. Latin American Spanish isn't 100% the same as Spanish Spanish and Dutch Dutch isn't exactly the same as Belgian Dutch.

Share this post


Link to post
Share on other sites
The advantage of being able to select different languages/dictionaries would also be, that one can choose if (s)he wants to have color and aluminum marked as spelling errors or colour and aluminium.

 

When you're writing in French or Danish or whatever on the Atari 8-bit, are you using the XL/XE international character set to encode the accented/umlauted/etc characters? (I assume so, this means character codes 1 to 26 aren't available as valid multi-character tokens. Bummer.)

 

PS I hear French, Belgian and Canadian French isn't the same either. Latin American Spanish isn't 100% the same as Spanish Spanish and Dutch Dutch isn't exactly the same as Belgian Dutch.

 

Ideally, and this is just my own idea, you'd be able to create a dictionary text file on a PC/whatever, and upload it to a web form that crunches it and spits out an ATR image (which your browser would download). This would allow anyone to define their own dictionary for any language (or anyway, any language representable on the Atari: Hebrew, Chinese, and Klingon are probably impossible, Arabic might be doable for Arabic XE models).

 

It also might be that France's French and Canadian French are close enough that the same dictionary file could be used for either, with a smaller user-defined dictionary of words that are specific to the dialect. Not sure whether this is viable (I don't speak French), but it's a thought.

Share this post


Link to post
Share on other sites

To get an idea whether I'm on the right track or not.. here's an example of the stuff my encoder produces. <XX> = hex bytes, ^X = control-X.

 

original word = word with human-reabable tokens = binary tokenized word = tokenized and possibly frcoded word, which is what would get written to disk.

 

Leading ^A to ^O (1 to 15) are front-code prefix length (meaning "prefix is this many bytes of previous word"), except ^A (prefix 1 byte) is never used.

 

The (s|ed|ing) stuff gets encoded as uppercase letters. This isn't as smart as it should be: a stem word has to exist as a word in its own right, to qualify for suffix encoding. This means "apply, applied, applying" aren't recognized as the same word (because the Y changed to an I). However "fill, filled, filling" are recognized correctly and encoded as a single word (becase "fill" exists as a word by itself). Right now, the suffix combinations are hard-coded, and I don't have codes for all the common combinations yet.

 

The tokens (e.g. the (fi) and (ly) and such) are found by a dumb brute-force loop that extracts all 2- to 8-character strings from every word and keeps track of how frequently they occur.

 

The on-disk structure would have to have a separator byte (a null, $00) appended to the 4th column.

 

firmly = (fi)rm(ly) = <FB>rm^V = <FB>rm^V
firm(s|ed|ing) = (fi)rmB = <FB>rmB = ^CB
firmness('s) = (fi)rmn(ess)Y = <FB>rmn<93>Y = ^Cn<93>Y
firmware('s) = (fi)rmwa(re)Y = <FB>rmwa<AA>Y = ^Cwa<AA>Y
firmest = (fi)rm(est) = <FB>rm<8A> = ^C<8A>
firmer = (fi)rm(er) = <FB>rm<A5> = ^C<A5>
firing('s) = (fi)r(ing)Y = <FB>r<86>Y = ^B<86>Y
fire's = (fi)r(e's) = <FB>r<91> = ^B<91>
firstly = (fi)r(st)(ly) = <FB>r<A9>^V = ^B<A9>^V
firsthand = (fi)r(st)h(an)d = <FB>r<A9>h<B4>d = ^Ch<B4>d
firsts = (fi)r(st)s = <FB>r<A9>s = ^Cs
fixture(s|'s) = (fi)x(tu)(re)W = <FB>x0<AA>W = <FB>x0<AA>W
fix(es|ed|ing) = (fi)xC = <FB>xC = ^BC
fixable = (fi)xa(ble) = <FB>xa<94> = ^Ba<94>
fixation(s|'s) = (fi)x(ation)W = <FB>x{W = ^B{W
fixing's = (fi)x(ing)('s) = <FB>x<86><C4> = ^B<86><C4>
fizz(es|ed|ing) = (fi)zzC = <FB>zzC = <FB>zzC
fizzy = (fi)zzy = <FB>zzy = ^Cy

 

Looking at the first line, "firmly", there's no ^A - ^O, because it doesn't share a prefix with the word before it. <FB> is the token for "fi", ^V is the token for "ly"...

 

Next line, "firm(s|ed|ing)", shares a prefix of 3 tokens (the ^C = ASCII code 3). This is the words "firm, firms, firmed, firming" (the B means "word by itself, plus -s, -ed, -ing suffixes"). Ideally, there ought to be a letter that means s|ed|ing|ly (this line wouldn't even be in the file, that would save 3 bytes).

 

In case you're wondering, the word "first" is missing from the dictionary list I'm using, because it's in the most-frequent-word list (which would be stored in 1K of RAM and stay resident).

 

This chunk of encoded dictionary is 70 bytes. Expanded to its original form, it'd be 266 bytes in 34 words (counting separator bytes between words). At 70 bytes, that's 16.5 bits/word (overall, the file's compression is better than that). I could also strip off the <FB> bytes, since we're using an index (it would tell us where to look for words starting with an <FB> token, so there's no need to actually store the <FB> byte for every word). That would save 5 bytes, for 15.2 bits/word, but we'd need a marker at the end of the <FB> words to tell us we're at the end (probably two end-of-word/null bytes in a row), which costs one of the 5 saved bytes.

 

However... I think this might be pretty slow to decode on the Atari, and/or require too much code.

 

kskunk, how does this compare to your 10.33 bits/word structure? Am I on the right track at all?

Share this post


Link to post
Share on other sites
To get an idea whether I'm on the right track or not..

 

Leading ^A to ^O (1 to 15) are front-code prefix length (meaning "prefix is this many bytes of previous word"), except ^A (prefix 1 byte) is never used.

 

I did this slightly differently. Instead of storing how many bytes of the previous word to use, I store how many bytes of the previous word to discard. This seemed to help compression because stems start to look like this:

 

Urchlay0ed2ing3s

 

As you can guess, the string '0ed2ing3s' occurs very frequently.

 

a stem word has to exist as a word in its own right, to qualify for suffix encoding. This means "apply, applied, applying" aren't recognized as the same word (because the Y changed to an I).

 

I don't have any stem handling. It just falls out naturally because groups of frequently occurring stems turn up as tokens. It turns out that my token directory builder finds better combinations of stems than ever occurred to me when writing my own stem handlers.

 

The tokens (e.g. the (fi) and (ly) and such) are found by a dumb brute-force loop that extracts all 2- to 8-character strings from every word and keeps track of how frequently they occur.

 

The on-disk structure would have to have a separator byte (a null, $00) appended to the 4th column.

 

My longest token is 22 characters: ed1r1st3ing3ly2ness0's

 

I don't use any other separators besides the front code. Any time I see a front code I assume the word is split.

 

However... I think this might be pretty slow to decode on the Atari, and/or require too much code.

 

In my case, all I have to do is detokenize. I haven't tried writing it in 6502 assembly, but it seems pretty fast in C.

 

One optimization is that you don't need to actually do all the detokenization steps if the letters you already have in your front-code buffer don't match. For a given token, you can just encode how many bytes it discards, and then skip forward through tokens counting up the bytes discarded until enough wrong bytes are gone that you have a chance at decoding the next letter of the word you are testing. (Sorry if that's confusing, I'm writing it in a hurry before bed!)

 

Nonetheless, I think this algorithm requires a sorted input to perform well on a 6502. It's much cheaper computationally to sort the input (which also discards duplicate words) than to attempt to search several kilobytes of compressed text for every single word. Orders of magnitude cheaper!

 

kskunk, how does this compare to your 10.33 bits/word structure? Am I on the right track at all?

 

You have most of the pieces I do, but I'm doing it a little differently. In particular, I think letting the tokenizer handle stems is a nice simplification.

 

Also, I think I'm benefiting by using a dictionary with 80472 words. It's much easier to get a good compression ratio on a very large data set.

 

My current stats:

 

9.4 bits/word

1787 byte token directory

 

While out enjoying my weekend I thought of some ways to improve the token directory finder, so I think I can get down to 9 bits/word or maybe lower. That should allow 80K words to fit comfortable on a 90K floppy.

 

I'll have more time to play with this later in the week.

 

- KS

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
Sign in to follow this  

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...