Jump to content

About This Club

Atari VCS/2600 development using the Harmony/Melody board. See the GENERAL discussion area for setting up your environment. See the CDFJ+ discussion area for the tutorial.

  1. What's new in this club
  2. Status update from @batari: I'm heading to Austin for Tesla Con Texas later this week, so even if this gets figured out soon I won't be able to do anything with it until next week.
  3. Mac: I installed the arm-none-eabi-gcc package from macports then built and ran your CDFJ Collect3 program without any problems. (Likely using some other stuff I already had installed.)
  4. Going to be a little bit longer: Part of the team is on vacation, so we've not been able to fully test the new build processes. Batari spotted an issue with the driver that causes it to perform differently when run via Harmony menu vs. when flashed to a Melody board. This could cause a game that runs OK during testing to have screen jitter/rolls when released on stand-alone cartridges.
  5. That is correct. If you have _DS_FROM_ARM somewhere other than directly after the _DS_TO_ARM section you could either write the _DS_FROM_ARM address to DSPTR in the 6507 code, or use setPointer(DSCOMM, _DS_FROM_ARM) in the C code.
  6. When/how is what DSCOMM is "pointing" to ... reset? That is, when is the pointer reset to point to the first item? That is, if I load from DSCOMM, and there are a bunch of things I'm loading that way, how to I know what it's feeding me? Edit:OK I think I might understand... please correct where wrong! It's basically hardwired, or intended to be used, such that you first write vars which go to the storage at DS_TO_ARM. More specifically, the address of this storage is written as lo/hi to DSPTR. Then you write multiple bytes to DSWRITE which then are stored at that DS_TO_ARM memory area. The label "DS_FROM_ARM" is superfluous; it's there as a visual placeholder only. As soon as you start reading from DSCOMM (lda #DSCOMM), you are in fact reading from whatever DSPTR currently points to (which increments by one with each write to DSWRITE, or read via DSCOMM). There is no "reset" per se; other than writing DSPTR. You can either write it, or assume that once you've written your vars for TO ARM, you can do some arm stuff and then read the next n bytes (however many in your from section) via "lda #DSCOMM").
  7. Sorry, but the next update will be at least another week. Found out last night that friends from Mexico will be in town for the weekend. We're also working on getting the build process to work under Linux, Mac, and Windows! While I've been using a Virtual Machine running Linux to compile the ARM code, @Andrew Davie and @johnnywc have been using native tools under MacOS and Windows. If we get this working then you will not need to use a virtual machine like I've had to over the years (the ARM compiler was not available for Mac when I started).
  8. I've wrapped up and committed the changes to Stella for CDFJ+ support. They'll be in the 6.7 release. I've finished converting Part 3 of the CDFJ tutorial to use CDFJ+. A major part of the conversion was to figure out routines that automatically configure the project based on these 4 symbols that are defined at the beginning of the 6507 code: ;=============================================================================== ; Project Configuration ;---------------------------------------- ; _PC_ROM_SIZE Size of ROM, in kilobytes ; 32 32 KB of ROM and 8 KB of RAM ; 64 64 KB of ROM and 16 KB of RAM ; 128 128 KB of ROM and 16 KB of RAM ; 256 256 KB of ROM and 32 KB of RAM ; 512 512 KB of ROM and 32 KB of RAM ; ; _PC_DD_SIZE Size of Display Data, in bytes ; 4096 ; ; _PC_CDFJ_FF Select CDFJ+ driver based on Fast Fetcher usage: ; FF_LDA LDA # only ; FF_LDA_LDX LDA # and LDX # ; FF_LDA_LDY LDA # and LDY # ; FF_LDA_LDX_LDY LDA #, LDX #, and LDY # ; ; _PC_FF_OFFSET offset for Fast Fetchers ; 0 - 220 adjusts which LDA # values are overridden for fast ; fetcher use (and optionally LDX # and/or LDY #) ; ;=============================================================================== _PC_ROM_SIZE = 32 _PC_DD_SIZE = 4096 _PC_CDFJ_FF = FF_LDA _PC_FF_OFFSET = $80 If you're building a 32K project and decide to make it 64K, all you need to do is change _PC_ROM_SIZE to 64. The proper driver will then be included (as 64K and up uses a different chipset), the RAM allocation will become 16K, and the C code will adjust accordingly (such as 8K of additional C Variable space). If you're running out of C Variable space and have unused space in Display Data, just decrease the size of _PC_DD_SIZE. If you're using a larger ROM with more RAM you can even increase Display Data size as CDFJ+ no longer limits Display Data to 4K. In both cases the C Variable space will adjust accordingly. My next step is to review the project, such as this section of code: MenuKernel: sta WSYNC FFSTA _DS_MENU_P0, GRP0 FFSTA _DS_MENU_P1, GRP1 dey bne MenuKernel which needs comments added to explain what's going on with the new FFSTA macros. I'm also going to review the comments and corrections in the CDFJ tutorial, such as reply 3 and reply 14 by @Dionoid in Part 5 - Source Improvements and implement them in Part 3 of the CDFJ+ tutorial. Have family in town for the weekend, so Part 3 will most likely be posted next weekend.
  9. Note: a copy of Part 2 - CDFJ Details with addition of the Fast Fetchers section towards the bottom CDFJ+ is built around data streams. A data stream is a sequence of data elements made available over time - basically a list of values such as: 10 55 20 25 ... The data stream will auto-advance so that the first time you read it you'd get 10, the next time you'd get 55, then 20 and so on. Data streams are very helpful during the kernel as you can update any TIA register in just 5 cycles: LDA #DS0DATA STA GRP0 General Purpose Data Streams There are 32 general purpose data streams named DS0DATA thru DS31DATA. Each data stream has an increment value associated with it for the auto-advance feature. Example increments: 1.0 for 1LK player graphics 0.20 to repeat chunky playfield graphics over 5 scanlines 2.0 to skip every other value. This is extremely useful for interlaced bitmap graphics, which are typical seen as 96 or 128 pixels across. Communication Data Stream There is a dedicated communication data stream named DSCOMM used for transferring data between the 6507 and ARM processors. Jump Data Streams There are 2 data streams for jumps named DSJMP1 and DSJMP2. These override JMP $0000 and JMP $0001 respectively, providing 3 cycle flow control within the kernel. This means instead of counting scanlines and branching your kernel would look something like this kernel from Draconian: ; data stream usage for game screen DS_GRP0 = DS0DATA DS_GRP1 = DS1DATA DS_HMP0 = DS2DATA DS_HMP1 = DS3DATA DS_MISSILE0 = DS4DATA ; HMM0 and ENAM0 DS_MISSILE1 = DS5DATA ; HMM1 and ENAM1 DS_BALL = DS6DATA ; HMBL and ENABL DS_COLOR = DS7DATA ; color change for players and ball only DS_SIZE = DS8DATA ; size change for all objects NormalKernel: ; 20 lda #DS_SIZE ; 2 22 <- just to keep stream in sync nk1: lda #DS_COLOR ; 2 24 2 33 from Resm0Strobe28 <- just to keep stream in sync lda #DS_HMP0 ; 2 26 sta HMP0 ; 3 29 lda #DS_HMP1 ; 2 31 sta HMP1 ; 3 34 lda #DS_MISSILE0 ; 2 36 tax ; 2 38 stx HMM0 ; 3 41 lda #DS_MISSILE1 ; 2 43 tay ; 2 45 sty HMM1 ; 3 48 lda #DS_BALL ; 2 50 sta HMBL ; 3 53 sta ENABL ; 3 56 lda #DS_GRP0 ; 2 58 sta GRP0 ; 3 61 lda #DS_GRP1 ; 2 63 sta WSYNC ; 3 66/0 sta HMOVE ; 3 3 sta GRP1 ; 3 6 <- also updates GRP0 and BL DIGITAL_AUDIO ; 5 11 stx ENAM0 ; 3 14 sty ENAM1 ; 3 17 jmp FASTJMP1 ; 3 20 ExitKernel: ... Resm0Strobe23: ; 20 sta RESM0 ; 3 23 lda #DS_SIZE ; 2 25 sta NUSIZ0 ; 3 28 <- changes missile size jmp nk1 ; 3 31 Resm0Strobe28: ... The data stream DSJMP1 is initially filled with addresses for NormalKernel, and ends with the address for ExitKernel. The player, missile and ball reuse routines will change individual values in DSJMP1 to jump to reposition kernels such as Resm0Strobe23. Audio Data Stream Lastly there's an audio data stream named AMPLITUDE. It will return a stream of data to play back a digital sample, or to play back 3 voice music with custom waveforms. The macro DIGIT_AUDIO in the above Draconian kernel is defined as: MAC DIGITAL_AUDIO lda #AMPLITUDE sta AUDV0 ENDM 6507 Interface From the Atari's point of view CDFJ only has 4 registers defined in the cartridge space. DSWRITE at $1FF0 DSPTR at $1FF1 SETMODE at $1FF2 CALLFN at $1FF3 DSPTR is used to set the Display Data address for the DSCOMM data stream - basically setting the RAM location the 6507 code wishes to write to. Write the low byte of the address first, then the high byte. DSWRITE writes to the address set by DSPTR. After writing, DSPTR advances to the next RAM location in preparation for the next write: ; define storage in Display Data _DS_TO_ARM: _SWCHA: ds 1 ; controller state to ARM code _SWCHB: ds 1 ; console switches state to ARM code _INPT4: ds 1 ; firebutton state to ARM code _INPT5: ds 1 ; firebutton state to ARM code ... ldx #<_DS_TO_ARM stx DSPTR ldx #>_DS_TO_ARM stx DSPTR ldx SWCHA ; read state of both joysticks stx DSWRITE ; written to _SWCHA ldx SWCHB ; read state of console switches stx DSWRITE ; written to _SWCHB ldx INPT4 ; read state of left joystick firebutton stx DSWRITE ; written to _INPT4 ldx INPT5 ; read state of right joystick firebutton stx DSWRITE ; written to _INPT5 SETMODE controls Fast Fetch Mode and Audio Mode. Fast Fetch mode overrides the LDA #immediate mode instruction and must be turned on to read from the data streams. Audio Mode selects between digital sample mode or 3-voice music mode. CALLFN is used to call the function main() in your C code. The value written to CALLFN determines if an interrupt will run to periodically update AUDV0. The interrupt is needed when playing back digital samples or 3-voice music. ldy #$FE ; generate interrupt to update AUDV0 while running ARM code sty CALLFN ldy #$FF ; do not update AUDV0 sty CALLFN Fast Fetchers Fast Fetchers in CDFJ+ have seen 2 optional updates from CDFJ. 1) Besides LDA #, you can optionally use LDX # and/or LDY # for Fast Fetch mode. This would free up 4 cycles in the NormalKernel from above, look for the <*==-- comments: NormalKernel: ; 20 lda #DS_SIZE ; 2 22 <- just to keep stream in sync nk1: lda #DS_COLOR ; 2 24 2 33 from Resm0Strobe28 <- just to keep stream in sync lda #DS_HMP0 ; 2 26 sta HMP0 ; 3 29 lda #DS_HMP1 ; 2 31 sta HMP1 ; 3 34 ldx #DS_MISSILE0 ; 2 36 <*==-- no longer need to use TAX for the deferred STX ENAM1 nop ; 2 38 <*==-- 2 freed up cycles stx HMM0 ; 3 41 ldy #DS_MISSILE1 ; 2 43 <*==-- no longer need to use TAY for the deferred STY ENAM0 nop ; 2 45 <*==-- 2 freed up cycles sty HMM1 ; 3 48 lda #DS_BALL ; 2 50 sta HMBL ; 3 53 sta ENABL ; 3 56 lda #DS_GRP0 ; 2 58 sta GRP0 ; 3 61 lda #DS_GRP1 ; 2 63 sta WSYNC ; 3 66/0 sta HMOVE ; 3 3 sta GRP1 ; 3 6 <- also updates GRP0 and BL DIGITAL_AUDIO ; 5 11 stx ENAM0 ; 3 14 sty ENAM1 ; 3 17 jmp FASTJMP1 ; 3 20 2) You can now set a Fast Fetcher offset. By default the offset is 0 so only LDA #DS0DATA thru LDA #AMPLITUDE are overridden, while LDA #$24 thru LDA #$FF would put the normal value of #$24 thru $FF into the Accumulator. If you change the offset to $80 then LDA #DS0DATA + $80 thru LDA #AMPLITUDE + $80 are overridden, while LDA #$00 thru LDA #$7F and LDA #$84 thru LDA #$FF would put the normal value of $00 thru $7F and $84 thru $ff into the Accumulator. Macros are provided to make it easier to use the offset and LDA/X/Y # immediate mode instructions: FFA - Fast Fetch using LDA # FFX - Fast Fetch using LDX # FFY - Fast Fetch using LDY # FFSTA - Fast Fetch plus Store using LDA # / STA FFSTX - Fast Fetch plus Store using LDX # / STX FFSTY - Fast Fetch plus Store using LDY # / STY SLDA - Safe LDA # - validates that value is not overridden for data stream usage. SLDX - Safe LDX # - validates that value is not overridden for data stream usage. SLDY - Safe LDY # - validates that value is not overridden for data stream usage. ; macro FFA DS0DATA ; becomes LDA #(DS0DATA + OFFSET) ; macro FFSTX AMPLITUDE, AUDV0 ; becomes LDX #$AMPLITUDE STX AUDV0 ; macro SLDY 10 ; If offset is 0 a compile time error will occur: LDY # $a is within Fast Fetcher range of 0 and $23 ; If OFFSET is > 10 becomes LDY #10 C Interface From the C code a number of functions have been defined to interact with CDFJ and Display Data: setPointer() setPointerFrac() setIncrement() setWaveform() setSamplePtr() setNote() resetWave() getWavePtr() getWavePtr() getPitch() getRandom32() myMemset() myMemcpy() myMemsetInt() myMemcpyInt() This section will be expanded upon later.
  10. I remember that and your values are correct. But I think the emulation has some flaws. We are adding fixed numbers of 6507 cycles to the counter. These cycles assume that the Thumb code executes in no time. I think we have to differentiate here when we emulate Thumb cycles.
  11. We did tests on real hardware to validate the ARM timer values. theoretical 11d329 for NTSC, saw 11d311 thru 11d32e theoretical 11e8ff for SECAM, saw 11e8f7 thru 11e8fd theoretical 11fd2b for PAL, saw 11fd00 thru 11fd35 Also conducted a poll in 2020 to see if anybody had incorrect detection.
  12. These are the only the 6507 cycles, and do not include the extra cycles which the Thumb code takes after starting (~2250), between the two updates (~300) and before stopping the counter (~100), right? Since the emulators may or may not count the Thumb cycles, it might be a good idea to: reduce the Thumbs cycles (especially after starting the counter) adjusting the test values I am not 100% sure, but it seems that the auto detect current only works on real hardware because the gaps between NTSC, SECAM and PAL are large enough (~5400 cycles).
  13. CDFJ+ CDFJ+ is an updated version of CDFJ that has some additional features, such as optionally also using LDX # and LDY # as fast fetchers, plus support for larger programs: 32K ROM & 8K RAM 64K ROM & 16K RAM 128K ROM & 16K RAM 256K ROM & 32K RAM 512K ROM & 32K RAM The 32K ROM & 8K RAM is compatible with 48-Pin LPC210X Family (Harmony, Harmony Encore, Melody). Larger sizes require new hardware, based on the 64-Pin LPC213X Family, which also uses its own version of the CDFJ+ driver. CDFJ+ layout For CDFJ the C Code is before the 6507 banks. As the C code grows it takes over banks 0, 1, 2, 3, 4, and then 5. Because of this, the CDFJ 6507 code always starts in bank 6 For CDFJ+ the C Code is after the 6507 banks. As the C code grows it takes over banks 6, 5, 4, 3, 2, and then 1. Because of this, the CDFJ+ 6507 code always starts in bank 0. If a game grows to need more than 32K of ROM the additional ROM is added after the existing ROM. This simplifies usage for the C code - if the original CDFJ was used then the extra ROM would be discontiguous from the C code. CDFJ+ Driver The CDFJ+1 Driver is the ARM code that emulates a CDFJ+ coprocessor. The 6507 in the Atari has no access to the driver. Likewise the ARM has no access to the internals of the Atari. While the driver is located in ROM, it is copied into RAM when the Harmony/Melody first powers up. This is because the code runs faster when located in RAM and the extra speed is required in order for the coprocessor emulation to keep up with the Atari. Bank 0 Bank 0 is always used for 6507 code. When a CDFJ+ cartridge is powered up bank 0 will already be selected. If you've selected another bank then access $FFF42 to reselect bank 0. Bank 1 The 6507 can select bank 1 by accessing memory location $FFF52. Bank 2 The 6507 can select bank 2 by accessing memory location $FFF62. Bank 3 The 6507 can select bank 3 by accessing memory location $FFF72. Bank 4 The 6507 can select bank 4 by accessing memory location $FFF82. Bank 5 The 6507 can select bank 5 by accessing memory location $FFF92. Bank 6 The 6507 can select bank 6 by accessing memory location $FFFA2. C code & data CDFJ+ has a dedicated section in the ROM for the compiled C code and its data. The size depends up on size of the ROM: 2K on a 32K ROM 34K on a 64K ROM 98K on a 128K ROM 226K on a 256K ROM 482K on a 512K ROM The C code & data can also expand downward into the 6507 banks of ROM. Display Data, C Variables & Stack While the chart shows 4K and 2K, with CDFJ+ the size of Display Data is no longer locked to 4K. This is because larger ROM configurations include additional RAM, so you now get to control how RAM is divided between Display Data and C Variables & Stack. When the Harmony/Melody is first powered on the RAM holding Display Data is not initialized, so it's up to your code to do so. This was done to keep the CDFJ+ driver 2K in size. While the Atari cannot "bank in" the Display Data, it can read its contents using Data Streams3. This is how the custom C code will pass information to the 6507. The Atari can also write to Display Data by using a Data Stream. This is how the 6507 code will pass information, such as the current state of the joysticks and console switches, to the custom C code. 1 CDFJ+ stands for Chris (@cd-w) Darrell (@SpiceWare) Fred (@batari) and John (@johnnywc), who were involved in its creation. 2 due to the 6507's 8K addressing space, these locations are mirrored multiple times in memory. The mirrors are $1FFx, $3FFx, $5FFx, $7FFx, $9FFx, $BFFx and $DFFx. Any of the mirror addresses may be used. I think of them being located at $FFF4-FFFA due them being just before the RESET and IRQ vectors. The 6507 is a reduced package version of the 6502, which I first learned to program in the early 80s on my Vic 20. On a 6502 these vectors are, by definition, located at addresses $FFFC-FFFD and $FFFE-FFFF. 3 Data Streams will be covered in Part 2
  14. NOTE: I want to clarify up front that this is an advanced Atari 2600 programming series. As such, I will not be covering basic things like triggering a Vertical Sync, what a Kernel is, why you need to set a timer for Vertical Blank, etc. If you need to learn that you should check out with the following: Collect - detailed development of a 2K game 2600 Programming for Newbies - use the Sorted Table of Contents topic that's pinned at the top in order to easily access the tutorial topics in order. CDFJ+ Tutorial Index Part 1 - CDFJ+ Overview Initial overview of CDFJ+ Part 2 - CDFJ+ Details Registers, Datastreams, etc. Part 3 - Beginnings of Collect 3 coming soon NOTE: Check out Start Here in the General forum for instructions on how to set up an environment for CDFJ+ development.
  15. A while ago I started working on a header file with a bunch of constants for division. I'm posting what I have here so that other people can use it. Note, some of the values don't work for the full 16 bit range but most can handle a pretty decent amount of numbers. Math.h
  16. I've used the data and ROMs Thomas has posted and used them to improve the cycle counting in Gopher2600. There are a couple of problem areas but on the whole it seems fairly consistent in all combinations. Certainly, all real-world ROMs that I have been using for testing (Turbo, Draconian, Zaxxon, etc.) continue to perform as expected (causing screen roll or not depending on the MAM settings). If nothing else it shows that 100% accuracy is within reach.
  17. While the ARM7TDMI-S microprocessor used in the Harmory/Melody only supports 8, 16 and 32 bit data types, I found that using 64-bit integers is allowed when doing bitwise operations only (bitwise and, or, shifting, etc.). I'm currently using an unsigned long long variable to create on-the-fly 48-bit wide graphics, and also for masking and validating a 35-bit wide password (i.e. 7 BASE32 characters). Sometimes 32 bits just isn't enough 🙂 Note that doing calculations with 64-bit integers isn't supported, but at least all bitwise operations are. Maybe these 64-bit tricks could come in handy for other developers too.
  18. This is really useful data. For what it's worth, Gopher2600 is close in some areas and not in others. If we look at MAM-0 for example O0 and Os are pretty around 100% but O2 and O3 are less than 100% so something is happening in the optimised code which is upsetting the emulation. I'll take a closer look to see what the differences are.
  19. I have completed my testing using Collect3 and also further improved Stella's cycle counting using the results: Console (PAL) Stella Delta Banks MAM O Opt Bytes ROM Dec % RAM Dec % ROM Dec RAM Dec ROM RAM - 0 1 all 12108 3570 13680 278,6% 10FB 4347 88,5% 35B1 13745 115A 4442 100,5% 102,2% - 0 s all 11972 350E 13582 276,6% 1048 4168 84,9% 35ED 13805 11BA 4538 101,6% 108,9% - 0 2 all 12328 28CD 10445 212,7% E4E 3662 74,6% 28FB 10491 E9A 3738 100,4% 102,1% - 0 3 all 12412 28A8 10408 212,0% E44 3652 74,4% 28D4 10452 E91 3729 100,4% 102,1% 1 1 1 all 12108 16A1 5793 118,0% 10E6 4326 88,1% 1652 5714 114B 4427 98,6% 102,3% 1 1 s all 11972 162B 5675 115,6% 10C3 4291 87,4% 1646 5702 11A5 4517 100,5% 105,3% 1 1 2 all 12328 11D1 4561 92,9% E39 3641 74,2% 1236 4662 E85 3717 102,2% 102,1% 1 1 3 all 12412 11B5 4533 92,3% E2F 3631 74,0% 11B8 4536 E82 3714 100,1% 102,3% 1 2 1 all 12108 131C 4892 99,6% FFC 4092 83,3% 1097 4247 104C 4172 86,8% 102,0% 1 2 s all 11972 132E 4910 100,0% FF1 4081 83,1% 10C1 4289 10AC 4268 87,4% 104,6% 1 2 2 all 12328 E4A 3658 74,5% D5E 3422 69,7% E0A 3594 D86 3462 98,3% 101,2% 1 2 3 all 12412 E31 3633 74,0% D54 3412 69,5% D95 3477 D89 3465 95,7% 101,6% 2 1 1 all 12108 2500 9472 192,9% 10EC 4332 88,2% 164C 5708 1145 4421 60,3% 102,1% 2 1 s all 11972 246C 9324 189,9% 10C9 4297 87,5% 1643 5699 11A5 4517 61,1% 105,1% 2 1 2 all 12328 1CCD 7373 150,2% E3F 3647 74,3% 11BE 4542 E85 3717 61,6% 101,9% 2 1 3 all 12412 1C75 7285 148,4% E35 3637 74,1% 11AC 4524 E7C 3708 62,1% 102,0% 2 2 1 all 12108 12D4 4820 98,2% FFC 4092 83,3% 1091 4241 1046 4166 88,0% 101,8% 2 2 s all 11972 1294 4756 96,9% FF1 4081 83,1% 10BE 4286 10AC 4268 90,1% 104,6% 2 2 2 all 12328 D76 3446 70,2% D5E 3422 69,7% D92 3474 D86 3462 100,8% 101,2% 2 2 3 all 12412 D65 3429 69,8% D5B 3419 69,6% D89 3465 D83 3459 101,0% 101,2% Legend: Banks = Number of Flash banks (LPC2103 = 1, LPC2104/5 = 2) MAM = MAM mode O = local optimization level Opt: all 3 routines optimized Bytes: size of main.c Dec: decimal value of left hex %: speed relative to LPC2103, -Os, ROM ROM: code executed in ROM RAM: code executed in RAM Findings: Compared to running the code in ROM with default optimization -Os and MAM = 2, running with local optimization and in RAM results into 25-30% less cycles. Local optimizations are more efficient than RAM code. E.g. for the LPC2104 with -O3 and MAM = 2, there is only very little (0.2%) gained when moving the code to RAM. RAM is more effective with less optimization. This shows that the MAM very effectively buffers Flash memory and that the optimization helps here (probably by using proper alignment, more testing required here). The difference between -O2 and -O3 is minimal. Notes: The LPC2104 I have clearly has a bug in MAM mode 1 (marked in red). The numbers are (much) worse than for the LPC2103, especially for ROM code. It would be nice if someone with LPC2105 could verify this. Stella is already quite close for MAM modes 0 or 1. But in mode 2, especially with -O1 and -Os in ROM there is room for improvement. Collect.zip
  20. By default the the ARM's 8K of RAM is configured as: 2K CDFJ Driver (or 3K DPC+ driver) 4K Display Data 2K C Usage for Variables, RAM Functions, and Stack (or 1K if using DPC+) 2K of space for C usage can be used up quickly if you're keeping track of a lot of objects, such the positions of 159 sprites that are in Draconian, or if you're putting functions in RAM for a performance boost. If you're not using the full 4K of Display Data you can reallocate part of it for C Usage. The division between Display Data and C Usage is controlled by the ram entry in the MEMORY section of the custom.boot.lds file, located in the custom directory: MEMORY { boot (RX) : ORIGIN = 0x800 , LENGTH = 0x80 /* C-runtime booter */ C_code (RX) : ORIGIN = 0x880 , LENGTH = 0x6780 /* C code (26K) */ ram : ORIGIN = 0x40001800, LENGTH = 0x800 /* 2K free RAM */ } If you're going to change it the best practice is to: duplicate the ram entry comment out one of them by putting /* before ram and */ after 0x800 edit the duplicate entry As an example we will reallocate 1K of Display Data to C Usage. Both ORIGIN and LENGTH need to be adjusted by the amount - subtract from ORIGIN, add to LENGTH. (1K = 0x400). MEMORY { boot (RX) : ORIGIN = 0x800 , LENGTH = 0x80 /* C-runtime booter */ C_code (RX) : ORIGIN = 0x880 , LENGTH = 0x6780 /* C code (26K) */ /* ram : ORIGIN = 0x40001800, LENGTH = 0x800 */ /* 2K free RAM */ ram : ORIGIN = 0x40001400, LENGTH = 0xC00 /* 3K free RAM, took 1K from Display Data */ } Warning: do not forget to update your Initialize routine so it does not override the RAM that has been reallocated for C Usage. This is especially critical if you have put functions into RAM for a performance boost. void Initialize() { ... // 1K of Display Data RAM was given to C Variables and Stack. // this is done in custom.boot.lds myMemsetInt(RAM_INT, 0, 3072/4); // When the Harmon/y Melody is powered up the 4K of Display Data RAM will // contain random values, so we should zero it out to have a known starting // point. Using myMemsetInt is faster than using myMemset, but requires // dividing the number of bytes by 4 because an integer is stored in 4 bytes. // myMemsetInt(RAM_INT, 0, 4096/4); ... } In Speed boost - run code in RAM the location of the functions that ended up in RAM can be seen in armcode.map: 0x0000000040001800 RAM_BitReversal 0x0000000040001828 RAM_ColorConvert 0x000000004000185c RAM_PrepArenaBuffers After changing custom.boot.lds the RAM functions now start at 0x40001400 instead of 0x40001800: 0x0000000040001400 RAM_PrepArenaBuffers 0x0000000040001570 RAM_BitReversal 0x0000000040001598 RAM_ColorConvert Source Collect3RAMallocation.zip
  21. generating that was never added to the makefile, probably because we didn't utilize it before. I think this investigation of putting functions into RAM is the first time I've used it. I had to go looking for one of @cd-w posts as I didn't remember how it was done.
  22. @SpiceWare How can I enable assembler output (incl. source code) for main.c in the makefile? There is a file armcode.txt, but it is not updated during make. I really would like to have a look at the assembly created by the several __attribute__ settings.
  23. I did some more testing, all with my Harmony cart (LPC2103): no local optimizing: 0x1339 (ROM), 0x1090 (RAM) optimizing PrepArenaBuffers only: 0x0e31 (ROM), 0x0d57 (RAM) optimizing BitReversal only: 0x177c (ROM), 0x148a (RAM) optimizing ColorConvert only: 0x1407 (ROM), 0x107c (RAM) optimizing alle three functions: 0x0e31 (ROM), 0x0d5a (RAM) So PrepArenaBuffers is the key function here. Optimizing only other functions even makes the result worse! I suppose if the code becomes too large, then the small 128-bit buffers of the MAM can become much less efficient. The alignment of loops might play a major role here too. E.g. if you have a (small) loop with an extra branch the branch trail buffer will be always a miss unless the loop and branch target are within the same 128-bits. If the loop is aligned to 128-bits the chance of a hit increases. However, if the the loop is larger and has multiple small branches, then putting the small branches into the same 128-bit page might be more efficient. Though in both cases the prefetch buffers may come to the rescue. And for the LPC2104/5 of the dev cart, we have two buffers each, which changes the situation again. Another optimization is to put the data into the same 128-bit page. Or make sure, that consecutively used data is in the same page. Since the compilers are not aware of the MAM, they cannot optimize for it. Maybe we can find some parameters which can help here. E.g. it might be helpful to optimize for speed only moderately and reduce the code size. Then the buffer hit rate and the overall speed might increase. But I suppose there is a lot of trial and error. And while changing code, the results may vary quite a lot unexpectedly. I suggest that the critical functions are optimized with level 2 or 3 by default. And when the coding is done, one can experiment with different settings. Some other maybe useful GCC function attributes: aligned (alignment): This attribute specifies a minimum alignment for the function, measured in bytes. always_inline: Generally, functions are not inlined unless optimization is specified. For functions declared inline, this attribute inlines the function even if no optimization level was specified. flatten: Generally, inlining into a function is limited. For a function marked with this attribute, every call inside this function will be inlined, if possible. Whether the function itself is considered for inlining depends on its size and the current inlining parameters. optimize: The optimize attribute is used to specify that a function is to be compiled with different optimization options than specified on the command line. Arguments can either be numbers or strings. Numbers are assumed to be an optimization level. Strings that begin with O are assumed to be an optimization option, while other options are assumed to be used with a -f prefix. You can also use the `#pragma GCC optimize' pragma to set the optimization options that affect more than one function. This can be used for instance to have frequently executed functions compiled with more aggressive optimization options that produce faster and larger code, while other functions can be called with less aggressive options. The same for variables: aligned (alignment): This attribute specifies a minimum alignment for the variable or structure field, measured in bytes. Example: int x __attribute__ ((aligned (16))) = 0; packed: The packed attribute specifies that a variable or structure field should have the smallest possible alignment—one byte for a variable, and one bit for a field, unless you specify a larger value with the aligned attribute. These are only the most relevant I found. There is more. We should also experiment with '#pragma GCC optimize' inside larger functions. So that only the most relevant parts (e.g. the main loop) of a function are optimized differently.
  24. True, though this investigation is helping to revive my interest 2600 projects. Hopefully I'll be interested enough to get back to it when our friends leave in a few days.
  25. This more something for @johnnywc to test for Turbo Arcade anyway.
  • Recently Browsing   0 members

    No registered users viewing this page.

  • Create New...