Jump to content
IGNORED

2600 Words (Wordle clone)


Karl G

Recommended Posts

23 minutes ago, Thomas Jentzsch said:

If that's too slow, you could start searching in the background while the player is still active, e.g. as soon as the first letter is set. 

I didn't think of doing this, but it makes sense, since letters have to be entered in order from left to right.

Link to comment
Share on other sites

36 minutes ago, Thomas Jentzsch said:

I just did a coarse math based on the 2,315 Wordle words. Unfortunately the gain from Huffman compression is not very impressive. Instead of 2315*5*3=57,875 bits, the compressed data would require ~50,100 bits. So that would only save ~13%. Maybe the result becomes a bit better with Scrabble words, but the difference shouldn't be significant.

IMO you could do a bit better by rotating the dictionary (all first letters, followed by all second letters, ...) and using RLE. When mksmith started the 7800 version, I did some analysis of related approaches and was able to get the ten thousand word diction down to a bit under 20K. It wasn't enough to avoid bankswitching, so Matt carried on without compression. It sounds like similar calculus will play out here, since sufficient banked rom is available.

  • Like 5
Link to comment
Share on other sites

2 hours ago, Karl G said:

For the bigger word validation dictionary, I'll likely have address lookup tables for every combination of the first two letters of a word, which will allow me to store only the last 3 letters of each word to save space and speed up the lookup for word validation.

For optimal compression, I think this idea could be extended. If we store the words in alphabetical order, for each consecutive word the number of identical first letters (0..4) could be stored. 

 

Example:

  • aback
  • abase
  • abate
  • abbey

These would be encoded as "aback3se3te2bey". 

 

As a next step, we could encode deltas instead of letters (with wrap around). Initially we start at 00000 (=AAAAA). Then our example would be encoded as:

a  b  a  c  k  3  s  e  3  t  e  2  b  e  y
0  1  0  2 10  3 16 20  3  1  0  2  1 11 20

The idea is profit from the ordering and get as many low numbers as possible. In the end we will have a lot of 0s and 1s, some 2s, 3s and 4s and only a few larger numbers (up to 25). This can then be encoded very efficiently with Huffman encoding.

 

Instead of storing the lengths of matching first letters, we could use 0s. Since we know they will encode very well, this might become also quite effective:

a  b  a  c  k  a  b  a  s  e  a  b  a  t  e  a  b  b  e  y
0  1  0  2 10  0  0  0 16 20  0  0  0  1  0  0  0  1 11 20

Hope that makes sense. :) 

 

And probably it would make sense to add RLE for each letter position. As you can see, the first letter will be almost always encoded as 0. And the 2nd one too. Even the 3rd letter will still have a lot of 0s.

 

Spoiler
a b a c k 0 1 0 2 10
a b a s e 0 0 0 16 20
a b a t e 0 0 0 1 0
a b b e y 0 0 1 11 20
a b b o t 0 0 0 10 21
a b h o r 0 0 6 0 24
a b i d e 0 0 1 15 13
a b l e d 0 0 3 1 25
a b o d e 0 0 3 25 1
a b o r t 0 0 0 14 15
a b o u t 0 0 0 3 0
a b o v e 0 0 0 1 11
a b u s e 0 0 6 23 0
a b y s s 0 0 4 0 14
a c o r n 0 1 16 25 21
a c r i d 0 0 3 17 16
a c t o r 0 0 2 6 14
a c u t e 0 0 1 5 13
a d a g e 0 1 6 13 0
a d a p t 0 0 0 9 15
a d e p t 0 0 4 0 0
a d m i n 0 0 8 19 20
a d m i t 0 0 0 0 6
a d o b e 0 0 2 19 11
a d o p t 0 0 0 14 15
a d o r e 0 0 0 2 11
a d o r n 0 0 0 0 9
a d u l t 0 0 6 20 6
a f f i x 0 2 11 23 4
a f i r e 0 0 3 9 7
a f o o t 0 0 6 23 15
a f o u l 0 0 0 6 18
a f t e r 0 0 5 10 6
a g a i n 0 1 7 4 22
a g a p e 0 0 0 7 17
a g a t e 0 0 0 4 0
a g e n t 0 0 4 20 15
a g i l e 0 0 4 24 11
a g i n g 0 0 0 2 2
a g l o w 0 0 3 1 16
a g o n y 0 0 3 25 2
a g o r a 0 0 0 4 2
a g r e e 0 0 3 13 4
a h e a d 0 1 13 22 25
a i d e r 0 1 25 4 14
a i s l e 0 0 15 7 13
a l a r m 0 3 8 6 8
a l b u m 0 0 1 3 0
a l e r t 0 0 3 23 7
a l g a e 0 0 2 9 11
a l i b i 0 0 2 1 4
a l i e n 0 0 0 3 5
a l i g n 0 0 0 2 0
a l i k e 0 0 0 4 17
a l i v e 0 0 0 11 0
a l l a y 0 0 3 5 20
a l l e y 0 0 0 4 0
a l l o t 0 0 0 10 21
a l l o w 0 0 0 0 3
a l l o y 0 0 0 0 2
a l o f t 0 0 3 17 21
a l o n e 0 0 0 8 11
a l o n g 0 0 0 0 2
a l o o f 0 0 0 1 25
a l o u d 0 0 0 6 24
a l p h a 0 0 1 13 23
a l t a r 0 0 4 19 17
a l t e r 0 0 0 4 0
a m a s s 0 1 7 14 1
a m a z e 0 0 0 7 12
a m b e r 0 0 1 5 13
a m b l e 0 0 0 7 13
a m e n d 0 0 3 2 25
a m i s s 0 0 4 5 15
a m i t y 0 0 0 1 6
a m o n g 0 0 6 20 8
a m p l e 0 0 1 24 24
a m p l y 0 0 0 0 20
a m u s e 0 0 5 7 6
a n g e l 0 1 12 12 7
a n g e r 0 0 0 0 6
a n g l e 0 0 0 7 13
a n g r y 0 0 0 6 20
a n g s t 0 0 0 1 21
a n i m e 0 0 2 20 11
a n k l e 0 0 2 25 0
a n n e x 0 0 3 19 19
a n n o y 0 0 0 10 1
a n n u l 0 0 0 6 13
a n o d e 0 0 1 9 19
a n t i c 0 0 5 5 24
a n v i l 0 0 2 0 9
a o r t a 0 1 22 11 15
a p a r t 0 1 9 24 19
a p h i d 0 0 7 17 10
a p i n g 0 0 1 5 3
a p n e a 0 0 5 17 20
a p p l e 0 0 2 7 4
a p p l y 0 0 0 0 20

 

BTW: I haven't calculated the first option yet. The 2nd one compresses (without RLE) the Wordle dictionary from 57,875 to ~35,500 bits (~54% are 0s!).

Edited by Thomas Jentzsch
  • Like 3
Link to comment
Share on other sites

9 hours ago, Karl G said:

For the bigger word validation dictionary, I'll likely have address lookup tables for every combination of the first two letters of a word, which will allow me to store only the last 3 letters of each word to save space and speed up the lookup for word validation.

This is pretty much the "related" approach I was talking about. The majority of the repetition is in the first two letters, after which the RLE-type compressability peters out.

 

Heads up that there are 310 unique first+second letter combinations in that 10k-entry dictionary, with the biggest entry being "co", which has 191 entries. So no one table is larger than your bank size, which is good.

 

You'll need one or two LUTs to figure out which bank+table is holding your remaining 3 letters. Squeezing the 3 letter entries down to 2 bytes will give you a total of 20672 bytes of storage, in addition to the first+second index table(s). Plus a bit, since tables will be variable length, and you'll also lose space with packing since each set of tables needs to be contained within 4k.

Link to comment
Share on other sites

7 minutes ago, RevEng said:

This is pretty much the "related" approach I was talking about. The majority of the repetition gain is in the first two letters, after which the RLE-type compressability peters out almost completely.

Right. If my math is right, that approach should be good enough to fit everything into my ROM with space to spare without additional compression.

 

11 minutes ago, RevEng said:

Heads up that there are 310 unique first+second letter combinations in that 10k-entry dictionary, with the biggest entry being "co", which has 191 entries.

That's handy to know. I was wondering what my worst-case scenario would be for 5-letter words sharing the first two letters.

 

14 minutes ago, RevEng said:

You'll need one or two LUTs to figure out which table is holding your remaining 3 letters. Squeezing the 3 letter entries down to 2 bytes will give you a total of 20672 bytes of storage, in addition to the first+second index table(s).

Well I guess if I store 3 letters in 2 bytes, then it won't be hard to find the word boundaries, at least. I don't follow about the lookup tables, though. Having a table for each two-letter combo for a 2-byte address for the third letter would be 310x2x26=16120 bytes right?

Link to comment
Share on other sites

10 minutes ago, Karl G said:

Well I guess if I store 3 letters in 2 bytes, then it won't be hard to find the word boundaries, at least. I don't follow about the lookup tables, though. Having a table for each two-letter combo for a 2-byte address for the third letter would be 310x2x26=16120 bytes right?

Turn it on it's head. If you build a 310 entry LUT with each combination of two-letter indexes, you can search for your specific word's two-letter entry in the table. When you find your entry, it's index is unique, and can be used to point to a specific bank+table. The one less-attractive bit is the table is >256 entries, but otherwise easy enough.

 

9 minutes ago, Thomas Jentzsch said:

No ATARI in both lists?! :x 

LOL.

 

I started to put together a hobby/gaming related dictionary, but it was too tough to get a lot.

 

 

amiga
apple
atari
bonus
bubsy
bytes
carts
cheat
disks
eprom
exidy
gamer
giana
hacks
iwata
joust
kirby
mario
mappy
mspac
namco
pengo
pixel
ports
qbert
quake
rtype
retro
rogue
samus
score
shoot
snafu
spawn
spyro
stage
steam
sonic
taito
tandy
tapes
valve
video
voxel
wario
yoshi
zelda

 

  • Like 1
Link to comment
Share on other sites

13 minutes ago, RevEng said:

Turn it on it's head. If you build a 310 entry LUT with each combination of two-letter indexes, you can search for your specific word's two-letter entry in the table. When you find your entry, it's index is unique, and can be used to point to a specific bank+table. The one less-attractive bit is the table is >256 entries, but otherwise easy.

Okay; I misunderstood. I thought you meant further lookup tables to further navigate the data for that particular two-letter combo after the address was found.

 

How did you guys implement the valid word search in a timely manner? I guess in my case for a worst-case of 191 entries for a 2-letter combo, it might not be too bad to brute force it, but I'm curious if you had any more clever approach.

Link to comment
Share on other sites

9 minutes ago, Karl G said:

Okay; I misunderstood. I thought you meant further lookup tables to further navigate the data for that two-letter combo.

 

How did you guys implement the valid word search in a timely manner? I guess in my case for a worst-case of 191 entries for a 2-letter combo, it might not be too bad to brute force it, but I'm curious if you had any more clever approach.

Matt took a single letter index, since that code was simpler and the 7800 has 16k banks, which is sufficient for it's worst-case size table.

 

I'd start with the brute force, as it's not that many entries, and you can spread the whole process over multiple frames if needed. But if I was pressed to speed it up I'd throw a single letter LUT at it, which returns the start of the entry in the two-letter LUT. Binary search is also a possible approach, but IMO it's not worth the extra effort.

  • Like 1
Link to comment
Share on other sites

I believe you could use a bitwise trie with 26 bits per entry.

So the first 26 bits represent the first letter (e.g. if the first letter is A the first bit will be on)

Then, for every case of 1 bit, you have another 26 bit entry to represent the possible second letters.

This means you can represent all possible 2 letter words in maximum of 27 entries (each entry just over three bytes). However, not all combinations are valid words, it will quickly reduce. In the two letter example, if you have 10 real words, you'll need at worst 11 entries, and that's if none of the words start with the same letter.

 

So for 5 letters, I believe you'd be able to encode all valid words in about 4KB.

 

The challenge is reading this efficiently... 

 

 

 

  • Like 2
Link to comment
Share on other sites

I wrote the code that compressed the word dictionary for the NES game Wheel of Fortune, many years ago. I didn't know much about encoding back then, so I created a weird sort of encoding system that described words as a sort of mini-program.  These days I guess I'd classifiy it as a variant of Huffman, although it was implemented at a higher level in human-readable format.  Basically something like this (and remember, it's been over 25 years since I thought about this....). SHAR(E|K) --> describes two words... SHARE and SHARK.  However it was much more sophisticated than just this simple example.  I can't remember the savings I got, and it's probably not of any use here - but thought I'd jot it down for posterity because I don't think I've ever referred to it before.  We had to remove all potential "naughty words" from the dictionary because Nintendo didn't want to offend "the moral majority" in the USA.

 

  • Like 4
Link to comment
Share on other sites

2 hours ago, Bit Expander said:

I believe you could use a bitwise trie with 26 bits per entry.

Interesting idea. But I am afraid the overhead for addressing the tree entries will become a problem. For each bit set, you need a pointer to the next tree entry. 

Link to comment
Share on other sites

Just now, Albert said:

It would be interesting to use  AtariVox/SaveKey storage to remember where you are in the word list, so it could be walked through sequentially instead of just choosing random words. You could even save the state of the current game so you can come back to it later!

 

  ..Al

This was what I was thinking as well. It would be easy to implement either saving an index or a seed to ensure unique games each session. 

  • Like 2
Link to comment
Share on other sites

12 minutes ago, Karl G said:

This was what I was thinking as well. It would be easy to implement either saving an index or a seed to ensure unique games each session.

Awesome!  Might also be cool if it kept a counter of how many words you completed, possibly even showing it as something like this,

 

0046/2315 and maybe even toggle between that and "2%", lol. 

 

 ..Al

  • Like 3
Link to comment
Share on other sites

@Karl G Very nice mate!

 

I'll send you my source so you see can how I did it.  As Mike mentioned it just brute force checking each character and continuing or exiting as required.  For the answers list I check the whole thing but for the available I just check the required set based on the first char.

  • Thanks 1
Link to comment
Share on other sites

7 hours ago, Thomas Jentzsch said:

Interesting idea. But I am afraid the overhead for addressing the tree entries will become a problem. For each bit set, you need a pointer to the next tree entry. 

You don't need pointers, if you put the bits entries in a vector. So for example, if you have 4 bits on in the first entry for letters A, C, F and G, you'll know that the next entries are coming immediately afterwards, first for A then for C, etc.

 

It means you'll need to go through uninteresting data, just to find the relevant part without the ability to efficiently skip ahead, but it's much more compact

Link to comment
Share on other sites

2 hours ago, Bit Expander said:

You don't need pointers, if you put the bits entries in a vector. So for example, if you have 4 bits on in the first entry for letters A, C, F and G, you'll know that the next entries are coming immediately afterwards, first for A then for C, etc.

 

It means you'll need to go through uninteresting data, just to find the relevant part without the ability to efficiently skip ahead, but it's much more compact

Yes, that would work, but worst case you would have to decode (almost) the whole dictionary. 

Edited by Thomas Jentzsch
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...