Jump to content
IGNORED

Statically Tracing 6502 Disassembler


Xuel

Recommended Posts

I wasn't satisfied with existing alternatives, so I wrote a simple 6502 disassembler in Perl. You can download from github here.

 

Some features:

  • Statically traces code from entry points that you provide in order to distinguish code from data
  • Automatically generates labels if desired
  • Emits XASM/MADS syntax
  • Emits "a:" as needed when absolute addressing is used for zero page addresses
  • Can generate labels for addresses in the middle of instructions, e.g. "l1234 equ *-2". This occurs when BIT is used to skip an instruction, for example.
  • Callers are annotated in a comment at every label so you can see who calls an address
  • The current address and the raw data is annotated in a comment for every instruction
  • Based on C= Hacking opcode table.
Example output:

 

l1150                   ; Callers: 111F 1124
    lda $10C3           ; 1150: AD C3 10
    lsr @               ; 1153: 4A
    lsr @               ; 1154: 4A
    cmp #$20            ; 1155: C9 20
    bcc l115E           ; 1157: 90 05
    beq l114F           ; 1159: F0 F4
    lda #$01            ; 115B: A9 01
    bit a:$00A9         ; 115D: 2C A9 00
l115E equ *-2           ; Callers: 1157
    sta $1166           ; 1160: 8D 66 11
    jmp l1130           ; 1163: 4C 30 11
    dta $1              ; 1166: 01
    dta $2              ; 1167: 02
    dta $0              ; 1168: 00

I've tested a few dumps and I've confirmed that XASM is able to reproduce the exact image when given the output of disassembling the image. But there are probably still some bugs lurking, so take the output with a grain of salt.

 

Suggestions for improvements are welcome. It currently only handles raw memory dumps. I'll probably add a mode to handle XEX files. I also want to add a mechanism to supply user-defined labels. It should also detect the BIT trick and replace it with "dta $2C" so the skipped instruction can be disassembled.

Edited by Xuel
  • Like 8
Link to comment
Share on other sites

I added support for Atari XEX and Commodore 64 PRG files. The disassembler automatically determines code entry points when disassembling such executables.

 

Download from the github project page.

 

I've tested a few executables and verified that running XASM on the disassembled code produces the exact same executable. I'll try some more exhaustive tests soon.

 

The XEX mode is aware that segments can overlap and that RUN and INI segments can refer to previous segments. However, it doesn't yet understand that code from one segment could call code in another segment. If that occurs, then the the code may be treated as data instead of code.

 

One tricky bit to reproducing the exact same XEX is to emit a $FFFF segment header only when it was present in the original executable, but I did implement this.

 

Here's an example from Ransack. The first two code and INI segments disable BASIC and DMACTL. The next segment has an $FFFF segment header as it was just a separate executable compressed with exomizer. Notice also that all of the labels are prefixed with the segment number to avoid label collisions in case of overlapping segments. I may add an option to output MADS .local/.endl directives instead or only prefix labels that actually collide.

 

    org $2000           ; end 2013
s1l2000
    lda #$02            ; 2000: A9 02 <--- Entry
    ora $D301           ; 2002: 0D 01 D3
    sta $D301           ; 2005: 8D 01 D3
    lda #$00            ; 2008: A9 00
    sta $022F           ; 200A: 8D 2F 02
    lda $14             ; 200D: A5 14
s1l200F                 ; Callers: 2011
    cmp $14             ; 200F: C5 14
    beq s1l200F         ; 2011: F0 FC
    rts                 ; 2013: 60
    ini $2000
    opt h-
    dta a($FFFF)        ; Segment header
    opt h+
    org $2000           ; end 3D47
s2l2000
    ldy #$11            ; 2000: A0 11 <--- Entry
    tsx                 ; 2002: BA
s2l2003                 ; Callers: 200A
    lda $3C76,x         ; 2003: BD 76 3C
    sta a:$00FC,x               ; 2006: 9D FC 00
    dex                 ; 2009: CA
    bne s2l2003         ; 200A: D0 F7
    jmp s2l3C3B         ; 200C: 4C 3B 3C
    dta $7E             ; 200F: 7E
    dta $B6             ; 2010: B6
    dta $20             ; 2011: 20
    ...
Edited by Xuel
  • Like 1
Link to comment
Share on other sites

will have a look, too... how do you decide if it's data and not code?

That's the "statically tracing" part. It starts by assuming everything is data. Then code entry points are determined from the executable's RUN and INI segments, or by manual specification by the user with -e XXXX. The disassembler traces starting from each entry point until it hits a JMP, JSR, BXX branch, RTI, RTS or illegal instruction. If it's a JMP, JSR or BXX, then it recursively traces the target addresses as new code entry points. So, theoretically if it knows the initial entry point, it can find all memory locations that correspond to code. However, this is done statically, i.e. without knowledge of how the program may change memory at runtime. So there are several cases where it can't find code, e.g. self-modifying jumps, indirect jumps through memory that changes, PHA/PHA/RTS-style jumps, interrupt vectors that are changed at runtime, generated code, decompressed code, etc. The best results are probably achieved using a memory dump taken after the executable has decompressed itself, but you're still only able to capture one state of the machine. If the executable goes on to create more speed code, change vectors, or otherwise self-modify then all bets are off. That being said, you can still get decent results on code that doesn't use a lot of tricks. And you can use as many -e XXXX options as you like to tell the tracer to visit places it otherwise would have missed. Static tracing is mostly conservative in what it treats as code, but there are some cases like an always taken branch where it might go off the rails into data. For those situations you can use -c XXXX to force the tracing to stop at specific addresses.

  • Like 2
Link to comment
Share on other sites

did you compare your outpiut with Dis6502?

 

Some advantages of dis over dis6502:

  • dis offers static tracing. As far as I know dis6502 does no tracing and attempts to disassemble everything as code. It appears to only use .byte for illegal opcodes and BRK.
  • dis can disassemble illegal opcodes. dis6502 treats them as data.
  • dis uses "a:" to distinguish absolute addressing from z-page addressing when the address lies within the zero page. So there are cases where assembling the output of dis6502 won't give you back the original file but dis will.
  • dis uses unique labels for each segment. dis6502 may create duplicate labels if segments overlap.

Some advantages of dis6502 over dis:

  • dis6502 has built-in system equates and supports user equates. dis doesn't yet.
  • dis6502 creates labels for data and supports address ranges. dis doesn't do this yet.
  • dis6502's GUI is very nice. dis is just a command-line tool.
  • dis6502 supports more input file formats including ATR, XFD and CAS. dis only suppots raw, XEX and PRG.
  • dis6502 let's you redefine the assembler syntax. dis only supports XASM/MADS at the moment.
  • dis6502 can put multiple data bytes on the same line. dis currently puts each byte on a separate line which can make the output huge.
  • way more features in general
Edited by Xuel
Link to comment
Share on other sites

Thanks Xuel, this sounds very good. I have thought out such a system before, though you've gone much further and even implemented it. I thought that it would be possible to some extent.

 

Feature request (though it may take some time)....

 

If it sees:

LDA #0

STA 559

 

... then it should add " ; Switching off screen" as a comment onto the end of the STA 559.

 

You could load values from a csv which the user could supply and we as a community could build that file up.

 

It would only work for statically coded information

i.e.

LDA 203

STA 559

... could not be commented.... unless it said, "Doing something with the screen"

  • Like 1
Link to comment
Share on other sites

Hi,

 

Heh, I have similar project on my hdd. Unfortunately, it isn't finished. Nice features, that I have and you not are (as I see):

- detect reading from unitililized memory (be carefull with hardware registers),

- detect executing from unitialized memory

- advanced breakpoints (memory1=x, memory2=y, memory3>z and PC=xxx),

- call VBL each around 30000 cycles (to make some vbl clocks working).

Link to comment
Share on other sites

Hello,

 

And what about something like this (I have nothing to assemble, juste a keyboard at my desk)? As hidden is never called in the code... I confess that I didn't read all the posts, just your first one.

    * = $600

    jmp (toto)

toto
    .word hidden

hidden
    ; here some code...
loop
    jmp loop
Edited by pfeuh
Link to comment
Share on other sites

Hi,

 

Heh, I have similar project on my hdd. Unfortunately, it isn't finished. Nice features, that I have and you not are (as I see):

- detect reading from unitililized memory (be carefull with hardware registers),

- detect executing from unitialized memory

- advanced breakpoints (memory1=x, memory2=y, memory3>z and PC=xxx),

- call VBL each around 30000 cycles (to make some vbl clocks working).

This sounds like you are actually simulating a 6502. I'm just stepping forward one instruction at a time based on the instruction length. The only instructions which I interpret in any fashion are JMP, JSR, BXX, RTI, and RTS.

 

I think your method has many advantages including being able to handle self-modifying code and the uninitialized memory checks you described. I'd like to see it action!

Link to comment
Share on other sites

Hello,

 

And what about something like this (I have nothing to assemble, juste a keyboard at my desk)? As hidden is never called in the code... I confess that I didn't read all the posts, just your first one.

    * = $600

    jmp (toto)

toto
    .word hidden

hidden
    ; here some code...
loop
    jmp loop

 

dis does handle indirect JMPs but only uses the value in "hidden" at the time of disassembly. If that value changes over the course of the run time, then the other values would have to be given manually with -e XXXX to insure that dis can traverse them.

  • Like 1
Link to comment
Share on other sites

ok.... cart will be assembled correctly... good start... I was just wondering that sometimes code is not disassembled but remains in DTA statements?

You have to help it by supplying the code entry points that it can't determine statically. As I mentioned in post 6, there are many types of code paths that it can't trace statically. It traces as much as it can through JMP/JSR/BXX instructions, but interrupts, indirect jumps and self-modifying code can throw it off.

 

As an example, consider the Joust.rom file on Atarimania. We can get an initial pass with the following command:

 

dis.pl Joust.rom -l -o 8000 -v bffa -v bffe > joust.asm

The -v options tell dis to trace from the code entry points specified by the Cartridge B start and init vectors. This will allow dis to trace the mainline code.

 

However, Joust uses a deferred VBLANK routine which dis won't see:

 

    lda #$38            ; 840C: A9 38
    sta $0224           ; 840E: 8D 24 02
    lda #$A6            ; 8411: A9 A6
    sta $0225           ; 8413: 8D 25 02

And it uses indirect JMP instructions fed with entries from a couple of jump tables. One of them looks like this:

 

    lda $B691,y         ; B684: B9 91 B6
    sta $B7             ; B687: 85 B7
    lda $B692,y         ; B689: B9 92 B6
    sta $B8             ; B68C: 85 B8
    jmp ($00B7)         ; B68E: 6C B7 00
    dta $B9             ; B691: B9
    dta $B6             ; B692: B6
    dta $9D             ; B693: 9D
    dta $B6             ; B694: B6
    dta $9C             ; B695: 9C
    dta $BF             ; B696: BF
    dta $9D             ; B697: 9D
    dta $B6             ; B698: B6
    dta $B9             ; B699: B9
    dta $B6             ; B69A: B6
    dta $D4             ; B69B: D4
    dta $B6             ; B69C: B6

We can augment the command-line to tell dis to trace these as well:

 

dis.pl Joust.rom -l -o 8000 -v bffa -v bffe -e a638 -e b6b9 -e b69d -e bf9c -e b6d4 > joust.asm

Basically, you can keep running dis with more -e switches until you're satisfied that it has covered all of the code. I'll probably add some options to ignore tracing and just force disassembly as well.

  • Like 1
Link to comment
Share on other sites

I should also mention that dis doesn't really support banked cartridges. Currently, the best bet with a banked cartridge would be to separate it into smaller virtual carts, maybe 4K or 8K pieces depending on the banking scheme, and then disassemble them individually. But the number of -e switches that you'd have to keep track of would probably make this a really unpleasant process. Perhaps I can apply some of the ideas I have for supporting XEX inter-segment calls to banked carts as well. At the minimum the -e flag would need a way to specify the bank number in addition to the code entry address.

 

Also, dis doesn't yet support the CAR format. You can fake it out by giving -o as 16 bytes before the real starting address to skip over the CART header, e.g. -o 7FF0. This will work fine for unbanked carts, but the banked cart issue remains.

Link to comment
Share on other sites

  • 3 weeks later...

Release v0.4 is now available here.

 

Major new features include:

  • Added support for user-defined labels with optional address ranges
  • Added support for reading options, including labels, from a set of files
  • Added support for SAP format files
  • Added tracing between XEX segments
  • Fixed bugs - Thanks to fox!
  • See change log for more details
Example of disassembling some of the Atari OS using sys.dop and hardware.dop:

 

lC0E2                   ; Callers: SYSVBV -v 0222 C029
    inc RTCLOK+2                ; C0E2: E6 14
    bne lC0EE           ; C0E4: D0 08
    inc ATRACT          ; C0E6: E6 4D
    inc RTCLOK+1                ; C0E8: E6 13
    bne lC0EE           ; C0EA: D0 02
    inc RTCLOK          ; C0EC: E6 12
lC0EE                   ; Callers: C0E4 C0EA
    lda #$FE            ; C0EE: A9 FE
    ldx #$00            ; C0F0: A2 00
    ldy ATRACT          ; C0F2: A4 4D
    bpl lC0FC           ; C0F4: 10 06
    sta ATRACT          ; C0F6: 85 4D
    ldx RTCLOK+1                ; C0F8: A6 13
    lda #$F6            ; C0FA: A9 F6
lC0FC                   ; Callers: C0F4
    sta DRKMSK          ; C0FC: 85 4E
    stx COLRSH          ; C0FE: 86 4F
  • Like 1
  • Thanks 1
Link to comment
Share on other sites

  • 6 years later...
On 12/11/2014 at 11:22 AM, Xuel said:

So there are several cases where it can't find code, e.g. self-modifying jumps, indirect jumps through memory that changes, PHA/PHA/RTS-style jumps, interrupt vectors that are changed at runtime, generated code, decompressed code, etc.

 

On 12/11/2014 at 11:22 AM, Xuel said:

If the executable goes on to create more speed code, change vectors, or otherwise self-modify then all bets are off.

Absolutely cool project you did/do! I´d like to suggest the following:

You mentioned the limits - what dis can and can´t do/handle.

*1* How about making the list complete? (Collect ALL possibilities, where dis would fail. Instead of "etc.")

*2* How about DETECTING these cases? It would be great if dis could say e.g. "self modifying jump(s) detected at $...." or "stack jump(s) detected at $...." or "irq/nmi/reset vectors manipulated at $...." or "jump table detected at $...."! This would help much where to investigate further.

*3* Already mentioned in this thread: How about building up and sharing an additional info-file for e.g. C64 for VIC, SID and CIA - addresses? Does nobody have interest in that? Why invent the wheel a thousand times individually?

It would be fantastic if these 3 points could get implemented!

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...