Jump to content


  • Content Count

  • Joined

  • Last visited

Everything posted by speccery

  1. A quick update: I have been working together with the ULX3S developers to make the icy99 implementation on the ULX3S FPGA board better. emard and lawrie have been testing the TI-99/4A core. emard added support for the screen overlay feature he is using with many other retro computer implementations. The cool thing is that this is running on an ESP32 micro controller running micropython. The overlay screen already allows loading of ROM images from SD card. When I have time I will continue to work on the core itself, for instance to add support for disk subsystem with the ESP32. I can probably reuse some of the code I created for my ET-PEB project. In the process I accidentally broke my FPGA board, the primary micro USB socket became loose and broke the PCB pads in the process. Luckily I could fix this with a new micro USB connector, as it is the same connector used in my StrangeCart board. Some pictures are here in the gallery.
  2. speccery


    Broken pads are visible next to the five narrow micro USB connector pads.
  3. speccery


    Here the new socket has been soldered in and the board works again.
  4. speccery

    ULX3S FPGA broken USB connector

    The micro USB sockets can sometimes become loose. I managed to fix my FPGA board by soldering a new socket. The old socket got stuck on the plug, and ripped the PCB pads.
  5. Just published the current version of the Icy99 at icy99 (github) due to some interest from the ULX3S development community. Thanks @pnr for informing about that - and for digital video interface as well TMS9902 cores used here.
  6. The Blackice-II board which I was using as the primary design target when writing the above only has 512K of memory in total. Thus I was trying to come up with an allocation of that memory which would support most use cases. Clearly it's not possible to allocate 640K for GROM/GRAM space when there is only 512K to work with in total On other boards, such as the ULX3S, there is a SDRAM chip on board. It provides 32 megabytes of memory, but I'm not yet supporting that in this design, although I have earlier worked on another board with SDRAM providing system memory for my TMS9900 core. With that kind of memory space it becomes possible to support the full GROM space, as well as full 1M AMS memory space. I don't actually even know what would be a reasonable amount of memory to allocate for GROM content? If memory serves me right the Ubergrom supports around 100K of GROM spacce. I was thinking that 40K of cartridge GROM space would support pretty much all individual GROM programs, but of course it is possible to have multiple programs in different GROM bases simultaneously.
  7. I ported the Icy99 again for the ULX3S FPGA board. I looked at my notes, in this thread actually, and I can see I did port it before. But I couldn't find the files anymore, probably since I only have some of my computers in the temporary apartment we're staying (moving to permanent location next week). I clearly use too many computers for the same stuff... Anyway it was not a lot of work to port the design from the Flea FPGA Ohm files, as both boards use the Lattice ECP5 family FPGA chips. Once I got the porting done and started to look at the cause of the VDP video bug I mentioned above, and it only took me 5 minutes to both find and fix the bugs. Sometimes it helps to to return to a project after a pause although I wouldn't recommend 6 month pauses... The basic problem was that my verilog code checked for end of scanline and end of frame conditions in a stupid way: if a CPU memory access would occur while the logic was waiting for the said event, the logic would miss the event... The fix was simple, just moving these end of line/frame checks to their own parallel blocks, evaluated on every clock cycle. Now the image is solid, although the first scanline repeats now for lines, so there is an edge condition still needing fixing. I'm still only using internal block memory on the ULX3S. The ECP5-85 chip has over 400K of internal block RAM, and I am now using just over 100K to support the 99/4A, system ROMs, system GROMs, 32K RAM expansion, VDP VRAM, 16K cartridge ROM space and 32K cartridge GROM space. I am now wondering how I could connect strangecart to the FPGA design, as that would be a fun exercise. Perhaps I will just go with a simple SPI bus, as that would only require four wires. The code on the StrangeCart side would need to be changed, so that instead of listening to the parallel TMS9900 address & data buses it would get the address and data over the serial bus. Also, the FPGA would need to capture reads and writes to cartridge space and transfer the data in SPI format. If the SPI clock is something like 25MHz I would get very close to the bandwidth of the standard cartridge interface: 16 clocks to transfer address, perhaps 8 bits for command (read/write, ROM/GROM), and 16-bits for data.
  8. Thanks Jim for the link, unfortunately I'm having a hard time downloading the file, but retrying seems to work, fingers crossed... Big file.
  9. Hi Tony, thanks for the comments, very interesting. As is discussed before in this thread, I have three TMS99105 chips, and at least one of them seem to have the TMS99110 floating point macrostore ROM. I am curious, what were the systems you were involved in using the TMS99000 series chips outside of TI?
  10. That is in fact what I was thinking of doing, i.e. just have along stream of LI R0 instructions (and perhaps occasional LI R1 to update VDP pointer) in cartridge memory. This way you would have sections of the TMS9918 VDP memory exposed to the ARM processor, so that every 4th byte would be VDP memory byte and the rest code. I was thinking that at the end of each long stretch of LI R0 instructions there would a branch to the beginning, and that branch instruction fetch could cause the ARM side to switch to another "ROM" bank. That would a simple way to synchronise the ARM and TMS9900. While the TMS9900 is running through one stretch, the ARM would generate the next.
  11. Interesting picture. Since I don't have a PEB, I am always learning about new things that were designed for the PEB. And continue to be intrigued by these. As I know nothing about the board, I guess I can make completely random comments: it appears the Z80 could be controlling a printer buffer. There is 64K RAM on board. Looking at the PCB design it would be interesting to be able to read what chip U5 is, as it seems to sit between the Z80 computer and TI's bus. Looks like they basically slammed a full CP/M computer on board
  12. Thanks for sharing your thoughts, interesting discussion here. When you say that the bandwidth is around 2k per frame on the TI, what is the frame rate you're assuming? Probably not 60fps, more like 30fps?
  13. The MiSTer version of TI-99/4A is based on my EP994A core. Integrating an existing component is conceptually similar to integrating a C library. But frequently it is not like that. There are differences in the chip capabilities, pinouts, peripheral chips a design relies on, and so on. I haven't studied F18A code even if I have had access to it for a while, but it is designed with the Xilinx dual port memories in mind. My Icy99 core targets multiple FPGA chips, including the Lattice ICE40HX family, which has really humble specs. One key area of differences in specs are the on-chip block memories. The ICE40HX chips do have dual port block memories, but only 16K in total, and one port is a fixed write-only port while the other is a fixed read-only port. The Xilinx block memories support changing roles, so you can have for example one read-write port and another read port. This setup works well for a VDP implementation. Long story short - I do not think it is possible to port the F18A for the ICE40HX chips. I have already tested Icy99 with Lattice ECP5 FPGAs as well as Xilinx chips, they could support the full F18A. 16K is not enough for TMS9918 + VRAM implementation, since in practice you always want to include a scan doubler to have VGA or DVI output. The scan doubler needs some memory too, and anyway in my opinion it would be wasteful to use all the internal block RAM for VDP alone when doing a complete system-on-a-chip setup. So the Icy99 VDP only uses a few internal block memories, and relies on fast external SRAM access (or in the case of the larger FPGAs, using internal block RAMs). Now the reason I am writing all of this is that I believe the bug occurs due to external SRAM bus arbitration and the way the VDP is using the bus. This part is very different from how I think the F18A works (without looking at the source) and definitely different from how the EP994A core works.
  14. Thanks, now this makes sense! I did the same, wrote a DSR before starting to think how they get called. The 99/4A would not be the 99/4A, unless the PABs and file buffers were in VDP RAM too That was my first thought when working on the DSR. Of course for an unexpanded computer that's nearly all RAM one's got.
  15. I haven't worked on the Icy99 lately, but I did run a test with it today in connection to my new project, the StrangeCart (pls see that thread if you're interested). I did find a bug in the VDP code with the test. Making a video has been in my mind, I will try to find some time to do it.
  16. Thanks. Yes I have the solution for this already: do all processing by the Cortex M4 on the cartridge and write the data to be written to VDP memory to a memory area in the cartridge memory, and then have a very simple routine in scratchpad memory for the TMS9900 to just loop though the data and write it to VDP. Then the whole thing will become VDP memory write bandwidth limited. Simple structures like: count, VDP destination address, count bytes of data. My current scrolling routine already sets VDP address registers very seldom (that is the reason for the vertical strips). But before I go there, I want to first play around with TMS9900 code without help by the Cortex M4 core since that is kind of cheating, although I will be more than happy to cheat once I get to that part of the project 😁 But I did do some benchmarking of my prototype scrolling code, comparing the same code on the some of the different systems I have: TI-99/4A (8.3 seconds) My "legacy" EP994A FPGA core (around 0.7s @ 100MHz) with 31 wait states but cache enabled. My "new" Icy99 core on the ICE40HX (around 1.5s @ 25 MHz). I could not run my EP994A FPGA core at full speed due to some stupid versioning reason, and I did not want to run again the synthesis. I did find a bug in my Icy99 core: on the ICE40HX platform the VDP memory is located in external SRAM. It is actually an UMA system (unified memory architecture - just one memory bus shared between CPU, VDP, and DMA from host). With the scrolling routine putting "a lot" of memory transfer load between the CPU and VDP, the VDP screen refresh engine sometimes runs too slowly, which causes annoyingly a temporary loss of vertical sync, i.e. screen flashes. It is funny how the different projects reveal new issues. By keeping the code in the TMS9900 assembly it is actually a lot of fun to test drive it in all of these other systems.
  17. Thanks Lee, that is interesting. One silly question (I have only written the receiving end of a DSR, i.e. the actual DSR): the calling convention you mention is: BLWP @DSRLNK DATA 8 I am curious about the use of the parameter 8. How is it used? If I have understood properly scratchpad location >8356 must be initialised with the PAB structure pointer (offset 9) in VDP memory. Sorry I haven't taken any time to dig into this.
  18. Wow that is a lot! The best I could do (on a very quick attempt, and I don't have the speech module here in my temporary apartment) was 77 cycles, with MOV @>9800(R1),@>A000(R1). Measured with your fantastic classic99. BTW: is there a way to save / restore breakpoints? I think you already answered to me in some message before, but can I reload a cartridge image (during development) without exiting classic?
  19. Thanks to the tip from @Asmusr I used classic99 to measure the speed of my code. With some simple changes I was able to get the time of a one bit left shift for 16 scanlines down from around 135K cycles to about 97K cycles. Only modifying the pattern tables, not touching color tables. There are in total 32*16=512 VDP RAM reads and and the same amount of writes. Come to think go it, my code actually probably reads one extra strip at the end. Anyway, in the big picture that's around 100 cycles per each VDP access and processing. I still find it mind boggling that instructions such as "MOVB *R7,@STRIPBUF" take 38 clock cycles (over 10 microseconds) when running from cartridge ROM space. In this example R7 points to a VDP register, workspace and STRIPBUF are in scratchpad. Of course there are 4 opcode bytes to fetch, and the store is a read-modify-write cycle, so there are quite a few memory cycles involved in the execution of that instruction. If I did my math properly there are 6 memory cycles in total, each 16-bits. Since two of these are to ROM memory and broken up with wait states to four byte wide fetches, the total is in a way 8 memory cycles. This actually makes me wonder what would be the slowest MOV instruction one could have? I suppose it would mean placing code and operands into 8-bit wide memory, and then perhaps accessing GROM to get the maximum penalty...
  20. Thanks Lee, good comment. A quick breakpoint in Parsec loads R0 with 6, so a shift right of 6. Which I suppose in practice means a shift of 2 pixels to the left, if the bytes are organised in memory the way I think. I did not pay too much attention to this instruction not matching my expectation, as I was more focused on getting the framework in place. I have now done that.
  21. Nice, thanks for the tip!
  22. I have been having fun programming in TMS9900 assembly. I've been thinking about how to demo the strange cart, which somehow evolved into seeing for instance how Parsec works (i.e. the scrolling there). As a result I wrote some miscellaneous code: - I wrote an assembler routine to convert material on the screen in graphics mode 1 to graphics mode 2. This was a good reminder how these modes work. I wrote the routine since my previous test cartridge code wrote some text on the screen in GM1, and I wanted to use that as test material. This conversion uses the 32K extended memory. It simply first reads VRAM contents to memory extension, and then from that it writes the same stuff back in GM2 memory layout (and it also sets VDP to GM2 of course). At the end the same content appears but in GM2. This was a good reminder that the CPU is not very fast, as it takes surprisingly long, something like a second for this operation to complete. - I then proceed to wrote some support code to get back to GM1. - As part of the GM1 to GM2 conversion I wrote the character indices so that for the top 8 rows it runs linearly from 0..255, which is pretty much the default for GM2. I did the same for the next 8 rows. But for the bottom third of the screen, I used a character layout similar to Parsec: this way the pattern definitions (i.e. contents of character cells) are so that each strip of 64 vertical pixels are in consecutive addresses. So if character x position is a 5 bit number from 0 to 31, and character y position is 3 bit number from 0 to 7 (to be able to address all 256 character cells in the bottom 3rd of the screen), I calculated the character cell values like this: (x << 3) | y. Below is the assembly code: ; Fill the last third with Parsec style scrolling arrangement. ; First column 0, second 8, third 16. Thus the data written is as follows ; X = 5 bit column value, Y = 3 bit row value ; Char codes: X4 X3 X2 X1 X0 Y2 Y1 Y0 CLR R1 ! MOV R1,R2 SLA R2,3 ; Shift left by three, making bottom three bits of higher byte vacant MOV R1,R3 SRL R3,5 ; Shift right by 5, move top 3 bits to three lsbs of higher byte SOC R3,R2 ; set ones corresponding, i.e. OR MOVB R2,@VDPWD AI R1,>0100 JNE -! - I then went on to test Parsec style scrolling routines. My code was in cartridge ROM area, workspace in scratchpad memory. I set a side a Parsec style buffer of 128 bytes for 2 strips of 8 pixels wide times 64 pixels tall. Then, I read two columns into this buffer, so that bytes from first and second column alternate. With this arrangement the actual scrolling goes the same way as in Parsec (although in parsec source code the shift instruction on line 3050 of the PDF I found is SRC R1,0 which makes no sense to me at first glance). *--------------------------------------------- * Write a strip of 64 pixels while scrolling. * Parsec style. VDP write pointer must have been set. SCROLLSTRIP: mov @STRIPBU2(R5),r1 ; read first and second column bytes sla r1,1 ; shift left by 1 bit (i.e. pixel) movb r1,@>8c00 ; write to VRAM inct R5 ; inc R5 by 2 (i.e. skip the other column) jlt SCROLLSTRIP ; loop back RT This turned out to be quite slow for the entire 64 pixel high area, far away from instantenous. My first attempt was to use an algorithm which on every loop iteration reads two strips, and then writes and shifts one strip, repeated 32 times for every column. This is kind of how parsec works, except it only scrolls a small horizontal section at a time, and the height is 30 pixel rows. My scroll was not going fast. I then modified the SCROLLSTRIP routine above to perform byteswapping and storing the swapped bytes, that way I only needed to read one strip and write one strip, except on first iteration. This was a bit better but still slow. As the last test I limited my code to only scroll 16 scanlines, and moved the "strip buffer" to scratchpad memory (my workspace is at >8300 and the buffer, now just 32 bytes, is at >8320). It seems like scrolling 16 vertical lines of all 256 pixels takes about 10 seconds, or goes at about 25fps. I could copy the code to scratchpad memory too, but did not bother at this point. This was a pretty interesting exercise, as a lead in to what would happen if I used the faster processor on the cartridge to compute the scrolls, and have the TMS9900 just copy the data to VDP memory. Or something like that. Overall it was fun to do some assembly programming after a while. One question which came to mind was that how would one easily do cycle counting, does classic99 for example support it? I.e. reset cycle counter at a break point, and read the count at another breakpoint?
  • Create New...