flashjazzcat #26 Posted July 27, 2009 (edited) Some more very interesting ideas in these posts. It seems compression is more-or-less cut and dried. The only fly in the ointment would be multi-langauge support, in which case I think you could get away with 6 bits per character instead of 5. That would allow for A-Z, CTRL+A to CTRL+Z, plus the other three international characters. The token directory would have to change, too. Ideally, this would be stored as part of the dictionary file. The pointer table I've suggested (which I'm certain would slash the number of sector reads) could be stored as a separate file if we decided to use the native filing system. It could be generated by the spell-checker from a DOS independent table at the front of the dictionary file. This could be translated into a file of absolute byte/sector values every time the file was copied or moved. Conversely, if we decide to use a raw (unformatted) disk image, we don't have to worry about this at all. The first half dozen or so sectors of the disk would contain the 676 element jump table (2028 bytes if each element was 3 bytes long). International support would cause this to expand exponentially, too. We could be looking at 9K for the table if we included support for 55 characters! The RAM-based dictionary of 200 or so of the most common words is a solid idea I've mentioned before (AtariWriter Plus uses the same technique). Coupled with something like a 2 sector read average for disk-based words, we'd probably be looking at one disk access per word on average at the worst. I think this would be quite accessible when balanced against the inherent flexibility of the system. It means we could spell check a 100 word file by reading at most a hundred sectors, rather than first sorting those hundred words, storing them somewhere, then reading a 100K of dictionary in its entirety. We could also do very quick single word lookups, and have spelling suggestions if I can get my head around the "edit distance" algorithms I've been reading about. Perhaps if we get an English version of the program working first, international support would come next. Edited July 27, 2009 by flashjazzcat Quote Share this post Link to post Share on other sites
Mathy #27 Posted July 27, 2009 Hellop flashjazzcat Unformatted disks have one huge disadvantage: If you look at it using DOS, it doesn't maken sense. That means that if it's a real floppy, you have to write clearly on the label what's on it. Since many of us use hard disks and/or memory cards, that might not be easy. Try to come up with an 8+3 character name that'll even in 2 years tell you instantly that the file on the disk is the dictionary for LW. That's not gonna be easy. Especially it you're gonna allow multiple dictionaries to be used. greetings Mathy Quote Share this post Link to post Share on other sites
flashjazzcat #28 Posted July 27, 2009 (edited) Unformatted disks have one huge disadvantage: If you look at it using DOS, it doesn't maken sense. That means that if it's a real floppy, you have to write clearly on the label what's on it. Since many of us use hard disks and/or memory cards, that might not be easy. Try to come up with an 8+3 character name that'll even in 2 years tell you instantly that the file on the disk is the dictionary for LW. That's not gonna be easy. Especially it you're gonna allow multiple dictionaries to be used. Agreed - this stems from the problem that DOS 2.5 uses absolute values for note/point operations. The files aren't really random access at all since you can't use pre-computed values to read them. MyDOS is the same. If we keep the dictionaries as files (and use a quick jump table), the table will have to be kept as a separate file, rebuilt every time the dictionary file is moved. Alternatively, we could make the speller SpartaDOS X only and do away with all these problems. We'd even cut down on the size of the table (nine or ten bits would be adequate to describe the offsets to the various word sets). A drawback of using the filing system with any DOS is there'll be additional sector read overheads involved in any seek/read operation. Edited July 27, 2009 by flashjazzcat Quote Share this post Link to post Share on other sites
Mathy #29 Posted July 27, 2009 Hello flashjazzcat Or you use dummy directories. If you'ld look at them via DOS (you could have two dummy dirs. One at sector 361 and one at sector 7) it would spell "Last Word Directory - version x.y - Dutch" (or something simular). Since they would not really have links in them (or links that lead "nowhere") they would be "dummy" directories instead of real directories. You could even fill the VTOC and bitmap sectors with numbers that would indicate which sectors are used. All the sectors that aren't used for fooling/informing the user or telling DOS which sectors are used/that the disk is full, can be used for the actual directory data. greetings Mathy Quote Share this post Link to post Share on other sites
Urchlay #30 Posted July 27, 2009 The RAM-based dictionary of 200 or so of the most common words is a solid idea I've mentioned before (AtariWriter Plus uses the same technique). True. I was having my doubts about how much a 200-word dictionary would match, so I ran those tests... still kind of blows my mind that half the words we use are from the set of 200. I wonder if it's worthwhile to have the word processor store all instances of those 200 words as 1-byte tokens in the main text buffer... or anyway, the ones that are more than one letter. Probably not, the space savings would get eaten up by the code that does the (de)tokenizing. Perhaps if we get an English version of the program working first, international support would come next. If the file format supports encoding the international characters, it would just be a matter of making the dictionary-creator program available to users (possibly as a web app). Once the dictionary format is finalized, I'd be interested in writing such a web app (in Perl or PHP), if anyone's got a place to host it. Quote Share this post Link to post Share on other sites
flashjazzcat #31 Posted July 27, 2009 (edited) Or you use dummy directories. If you'ld look at them via DOS (you could have two dummy dirs. One at sector 361 and one at sector 7) it would spell "Last Word Directory - version x.y - Dutch" (or something simular). Since they would not really have links in them (or links that lead "nowhere") they would be "dummy" directories instead of real directories. You could even fill the VTOC and bitmap sectors with numbers that would indicate which sectors are used. All the sectors that aren't used for fooling/informing the user or telling DOS which sectors are used/that the disk is full, can be used for the actual directory data. A perfect solution, I think. Once again, I think AtariWriter Plus's dictionary is arranged like this IIRC. I was having my doubts about how much a 200-word dictionary would match, so I ran those tests... still kind of blows my mind that half the words we use are from the set of 200. I wonder if it's worthwhile to have the word processor store all instances of those 200 words as 1-byte tokens in the main text buffer... or anyway, the ones that are more than one letter. Probably not, the space savings would get eaten up by the code that does the (de)tokenizing. That last idea - very cool! It would fall down for reasons of speed as well as code size, though. Still - I like your thinking. Speaking of compression, LW's loading screen is run-length encoded. A whole GR.8 display is held in less than 4K (IIRC...). Decompression (though fast) is a one-off operation there, and wasn't really an issue. If the file format supports encoding the international characters, it would just be a matter of making the dictionary-creator program available to users (possibly as a web app). Once the dictionary format is finalized, I'd be interested in writing such a web app (in Perl or PHP), if anyone's got a place to host it. This particular user wants a dictionary creator program ASAP, please! I'm too busy typing up the rewritten manual for LW and debugging the software to write an encoder at the moment (even if I knew how!). I feel the decoder will be somewhat easier to write (in assembler). If you write it in PHP, the natural place to host it would be www.atari8.co.uk (I certainly have PHP on the server). If you want to write the web app, you're guaranteed a place in the hallowed pages of the LW 3.0 user manual! Edited July 27, 2009 by flashjazzcat Quote Share this post Link to post Share on other sites
kskunk #32 Posted July 28, 2009 (edited) It means we could spell check a 100 word file by reading at most a hundred sectors, rather than first sorting those hundred words, storing them somewhere, then reading a 100K of dictionary in its entirety. I think I was unclear when I mentioned reading the file end to end. I wasn't suggesting reading the entire dictionary, simply seeking forward to the appropriate sectors by prefix. But pre-sorting the input allows you to do the seeks in order instead of randomly, which of course is never slower and usually much faster. Also duplicate words are skipped. In a little simulation thing I was doing last Thursday, pre-sorting reduced the number of disk reads by a factor of 20. But I didn't try the top 200 words thing. I'll look at that too! Pre-sorting the input isn't so bad. You don't need a lot of memory -- 512 bytes is plenty for 32KB of text. I don't know what size text your worst case is though. - KS Edited July 28, 2009 by kskunk Quote Share this post Link to post Share on other sites
flashjazzcat #33 Posted July 28, 2009 I think I was unclear when I mentioned reading the file end to end. I wasn't suggesting reading the entire dictionary, simply seeking forward to the appropriate sectors by prefix. But pre-sorting the input allows you to do the seeks in order instead of randomly, which of course is never slower and usually much faster. Also duplicate words are skipped. In a little simulation thing I was doing last Thursday, pre-sorting reduced the number of disk reads by a factor of 20. But I didn't try the top 200 words thing. I'll look at that too! Pre-sorting the input isn't so bad. You don't need a lot of memory -- 512 bytes is plenty for 32KB of text. I don't know what size text your worst case is though. Ah - I think I understand now. I really wanted it to work like AtariWriter Plus and progress through the text in a visually linear fashion. There might be a performance hit this way, but I think the 200 word memory-resident dictionary will soak a lot of that up. The text banks are never more than 16K and you could always store the sorted words in an extended bank. But there's still going to be a pre-processing pass when sorting the words, and the words offered for correction (if we're going to insist on doing it alphabetically) will appear out of context. We must also remember that the code for the spell-checker must not exceed 3K - so simplicity is a must. The software will mostly be making calls to routines in the main LW executable, so apart from the detokenisation most of the program will be JSRs and calls to the PRINTF routine and disk I/O functions. It should be perfectly "doable". Quote Share this post Link to post Share on other sites
flashjazzcat #34 Posted August 10, 2009 Has anyone done anything more with these compressed dictionary files? I'm afraid I haven't had time to write a tokeniser yet. Quote Share this post Link to post Share on other sites
flashjazzcat #35 Posted October 13, 2009 I've had a rethink on the spelling checker and simplicity seems the most advisable imperative. I was reading a review of OSS's "The Writer's Tool" (about the only Atari8 word processor I haven't been able to track down a copy of yet), and it mentioned that the spelling checker first alphabeticised the words in the file to be checked, before scanning them against the entire dictionary. Now, with The Last Word, given that the spell checker will only be available on expanded memory machines, we might reasonably expect to have at least one 16K bank devoted to running the spell checker. If we split the on-disk dictionary into 16K chunks, not only would they remain editable, but we could keep them in alpha order (by keeping the whole segment in RAM while it's being scanned) and so speed up searches by using a binary chop, and insert any new words using the same method. In main RAM, we also hold a list of the 100 or so most common words, which will enable us to blast through a fair percentage of the text. The proofreading stages would be as follows: 1. In a bank of extended RAM, create a set of 2-byte pointers which refer to unique words in the target file in alphabetical order. If a word occurs more than once, the pointer refers to the first occurence. 2. Scan the 100 most common words list against the target file. If a word is matched, set the pointer in the table to NULL (the word is spelled correctly). 3. For each dictionary "segment" file, read the file (of up to 16K) into another extended bank, and scan every word in the dictionary against the words pointed to in the table. If a word is found in the dictionary, set the pointer to NULL. 4. What remains is a list of unrecognized words. The touble with this system is we can't suggest "possible matches" because we don't know that the word's unrecognized until we've worked our way through the entire dictionary. Also, was can't double-check words against the dictionary after getting the user to correct them. Added words just get inserted into the last dictionary segment file before it's written out to disk again. Corrected words just get inserted back into the text at the pointer address, using the original word-delimiting logic to first remove the offending word (will have to take care to adjust all the pointer values to account for the correct word being longer or shorter than the original and changing the position of subsequent words in the list. In fact that would be a read headache, since the pointers won't even be in linear order any more...). Anyway, that's the plan. Quote Share this post Link to post Share on other sites