Jump to content

Photo

Game Text Encoding Problem


8 replies to this topic

#1 DanBoris OFFLINE  

DanBoris

    Dragonstomper

  • 970 posts
  • Location:New Jersey, USA

Posted Sat Jul 17, 2010 10:22 AM

I need some help figuring out something, and thought some of the people here might spot whatever I am missing...

I'm playing around with decoding the format of the MSDOS Adventure Construction Set game files. Some of it has been pretty easy to figure out but when I came to the object names I found out that they were compressing this text, but I can't quite figure out the logic. Here is what I know:

- Each character can be one of 40 characters (26 letters, 10 digits, and 4 symbols)
- Every three characters is encoded into 2 bytes
- Here are some of the encodings I have seen, showing the character, hex values and binary values:

1 = C0 A8	        11000000 10101000
	A = 40 06		01000000 00000110
	B = 80 0C               10000000 00001100
	C = C0 12               11000000 00010010
	D = 00 19		00000000 00011001
	E = 40 1F		01000000 00011111
	F = 80 25               10000000 00100101

        A =   40 06		01000000 00000110
	AA =  38 13             00111000 00010011
	AAA = 69 06	        01101001 00000110
	
	ABC = 93 06             10010011 00000110

Anyone have any ideas on this?

#2 GroovyBee OFFLINE  

GroovyBee

    Games Developer

  • 7,949 posts
  • Busy bee!
  • Location:North, England

Posted Sat Jul 17, 2010 10:45 AM

Its some form of radix 40

e.g.

(X-'A')*1
+ (Y-'A')*40
+ (Z-'A')*1600

That'll fit into an unsigned 16 bit word.

#3 SeaGtGruff OFFLINE  

SeaGtGruff

    Quadrunner

  • 5,455 posts
  • Location:Georgia, USA

Posted Sat Jul 17, 2010 11:33 AM

Its some form of radix 40

e.g.

(X-'A')*1
+ (Y-'A')*40
+ (Z-'A')*1600

That'll fit into an unsigned 16 bit word.

Yes, I said the same thing, but I lost my internet connection while I was typing my reply, so it got lost when I tried to post it. :(

Notice the pattern with some of the values shown:

b = 00 00 (I'm guessing that 00 00 is a space) ?
A = 40 06 (add 40 06)
B = 80 0C (add 40 06)
C = C0 12 (add 40 06)
D = 00 19 (add 40 06, then add the carry flag)
E = 40 1F (add 40 06)
F = 80 25 (add 40 06)
G = C0 2B (add 40 06) ?
H = 00 32 (add 40 06, then add the carry flag) ?
etc.

Michael

Edit: Also, notice that it takes the same number of bytes to code 1, 2, or 3 characters, further pointing to a base-40 system.

The only other way I know to code 3 characters in 2 bytes is to split the bits, 5 bits per character, with 1 bit left over, but that gives only 32 characters.

Edited by SeaGtGruff, Sat Jul 17, 2010 11:35 AM.


#4 SeaGtGruff OFFLINE  

SeaGtGruff

    Quadrunner

  • 5,455 posts
  • Location:Georgia, USA

Posted Sat Jul 17, 2010 12:05 PM

Anyone have any ideas on this?

Adding to my previous comments, I suggest looking at the encoding systematically (b = SPACE):

bbb = ?? ??
bbA = ?? ??
bbB = ?? ??
bbC = ?? ??
etc.

That should give you the values for the 1s place (0 to 39).

The rest should be a matter of just multiplying by decimal 40 for the 10s place, or by decimal 1600 for the 100s place, but you could verify that systematically:

bAb = ?? ?? (should be the same as bbA times decimal 40)
bBb = ?? ?? (should be the same as bbB times decimal 40)
bCb = ?? ?? (should be the same as bbC times decimal 40)
etc.

Abb = 40 06 (should be the same as bAb divided by decimal 40, or bbA divided by decimal 1600)
Bbb = 80 0C
etc.

But you'd have to take the carry into consideration, since it appears that the carry might be getting added back to the lo byte?

Michael

#5 SeaGtGruff OFFLINE  

SeaGtGruff

    Quadrunner

  • 5,455 posts
  • Location:Georgia, USA

Posted Sat Jul 17, 2010 12:14 PM

Another thought:

I think the values shown are lo byte first:

Abb = hex 40 06 = $0640 = decimal 1600 = 1*1600
Bbb = hex 80 0C = $0C80 = decimal 3200 = 2*1600
Cbb = hex C0 12 = $12C0 = decimal 4800 = 3*1600
Dbb = hex 00 19 = $1900 = decimal 6400 = 4*1600
etc.

Michael

#6 GroovyBee OFFLINE  

GroovyBee

    Games Developer

  • 7,949 posts
  • Busy bee!
  • Location:North, England

Posted Sat Jul 17, 2010 12:17 PM

I think the values shown are lo byte first:


Makes sense since the files come from an x86 based machine which is little endian.

#7 SeaGtGruff OFFLINE  

SeaGtGruff

    Quadrunner

  • 5,455 posts
  • Location:Georgia, USA

Posted Sat Jul 17, 2010 12:30 PM

This seems to work for some, but not all, of the examples you posted:

bbb = $0000 = 0
bbA = $0001 = 1
bbB = $0002 = 2
bbC = $0003 = 3

bAb = $0028 = 1*40
bBb = $0050 = 2*40
bCb = $0078 = 3*40

Abb = $0640 = 1*1600
Bbb = $0C80 = 2*1600
Cbb = $12C0 = 3*1600

AAb = $0640+$0028=$0668 -- you gave $1338, or 38 13
AAA = $0640+$0028+$0001=$0669
ABC = $0640+$0050+$0003=$0693

By my figuring, $1338 should be CCb, not AAb.

Michael

Edited by SeaGtGruff, Sat Jul 17, 2010 12:32 PM.


#8 SeaGtGruff OFFLINE  

SeaGtGruff

    Quadrunner

  • 5,455 posts
  • Location:Georgia, USA

Posted Sat Jul 17, 2010 12:45 PM

Since 1 (presumably 1bb) is encoded as C0 A8, or $A8C0, which is decimal 43200, which is 27*1600, I'm guessing the characters have the following values:

b = 0 (space)
A = 1
B = 2
C = 3
D = 4
E = 5
F = 6
G = 7
H = 8
I = 9
J = 10
K = 11
L = 12
M = 13
N = 14
O = 15
P = 16
Q = 17
R = 18
S = 19
T = 20
U = 21
V = 22
W = 23
X = 24
Y = 25
Z = 26
1 = 27
2 = 28
3 = 29
4 = 30
5 = 31
6 = 32
7 = 33
8 = 34
9 = 35
0 = 36
? = 37 (unknown symbol)
? = 38 (unknown symbol)
? = 39 (unknown symbol)

These are multipled by 40^0=1, 40^1=40, or 40^2=1600, depending on their position. In example ABC, A is in the 100s place, B is in the 10s place, and C is in the 1s place.

Michael

#9 DanBoris OFFLINE  

DanBoris

    Dragonstomper

  • Topic Starter
  • 970 posts
  • Location:New Jersey, USA

Posted Sat Jul 17, 2010 4:40 PM

You guys rock! Thanks!

Yes, "AAb = $1338" was a mistake, $0668 is the correct value.

Dan

Edited by DanBoris, Sat Jul 17, 2010 4:40 PM.





0 user(s) are browsing this forum

0 members, 0 guests, 0 anonymous users