Jump to content

matthew180

Members
  • Posts

    3,184
  • Joined

  • Last visited

  • Days Won

    4

matthew180 last won the day on January 21 2022

matthew180 had the most liked content!

Profile Information

  • Gender
    Male
  • Location
    Central Florida
  • Interests
    My family, FPGA, electronics, retro/vintage computers, programming, coin-op games, reading, outdoor activities.
  • Currently Playing
    Divinity2

Recent Profile Visitors

22,112 profile views

matthew180's Achievements

River Patroller

River Patroller (8/9)

2.9k

Reputation

  1. There are three primary branching instructions on the 9900: B - Unconditional branch (fastest) BL - Branch and Link (almost as fast as B) BLWP - Branch and Load Workspace Pointer (slowest) To really understanding the differences, you really need to understand that the general purpose registers (R0..R15) on the 9900 are no internal to the 9900 CPU. The only *real* hardware registers are the PC (program counter), WP (Workspace Pointer), and SR (Status Register). The 9900 general purpose registers R0..R15 are stored in RAM, and is called a "workspace". The 9900 uses the WP to "point" to where the registers are located in memory. This is very different from just about all other commercial CPUs of the era (and most CPUs in general, at all, ever). There are advantages and disadvantages to having the CPU's registers set up this way, but let's skip that and just focus on the branching for now. Note, on the 99/4A you ALWAYS want your registers stored in the 256 bytes of 16-bit scratch-pad RAM. All other memory in the system is slow wait-state 8-bit RAM, and placing your registers in any RAM outside of memory from >8300 to >83F0 will slow your program down significantly. Again, there is only 256 bytes of 16-bit RAM in the 99/4A, so you need to use it carefully. The B instruction literally just loads a new value into the PC. Program execution continues at whatever address was specified with the instruction. The BL instruction is just like B, however the current value of the PC is first copied into R11 (the "link" part of the instruction), before replacing the PC with its new value. Copying the PC to R11 is hard-wired in the CPU, so be sure you don't need the value in R11 before using BL. Having the the PC saved in R11 this way allows you do "return" to where the BL instruction was issued, by using B *R11 (the assembler has a pseudo opcode "RET" that is converted into this instruction. If you see "RET" in an assembly program, it is literally "B *R11"). Thus, BL is a basic subroutine instruction. As long as the called subroutine does not change R11, it can return to the caller. If you need to nest subroutines, then it is up to you to save the value in R11 before issuing another BL instruction (and restore R11 before issuing B *R11). The BLWP is similar to BL, however when BLWP is executed both the PC and WP get new values, which means the subroutine is usually (but not always) using a new set of registers (changing the WP changes what memory the CPU uses for R0..R15). The instruction also copies the old WP into R13, the old PC into R14, and the old SR into R15 of the new workspace. You typically use RTWP to return from a subroutine called with BLWP, which will restore the caller's PC, WP, and SR. Your subroutine needs to not change R13, R14, and R15 for the BLWP / RTWP mechanism to work. What you need to keep in mind with BLWP is, if you use it as intended, then every subroutine needs to have its own 32-byte chunk of memory for the new R0..R15 workspace. On the 99/4A, where you only have 256 bytes of 16-bit memory, this can get used up really fast. If you limit yourself to only one BLWP at any one time, i.e. do not nest BLWP calls, then you can get away with just one additional chuck of 32-bytes for subroutines to use as their R0..R15 workspace. But then this limits the benefits of BLWP and you might as well just use the faster BL. You can also give your subroutines their own workspace when using BL, just like BLWP, so BLWP does not really have much benefit on the 99/4A. The 9900 was designed to be the heart of a minicomputer where context switching would happen frequently, and BLWP would help a lot in that situation. But for single programs, BL is a better, faster, more flexible choice. IMO, for games, avoid BLWP. For libraries or code designed to be reused, IMO still avoid BLWP and just set up a new workspace if necessary (you have to do that with BLWP anyway). You might want to check out the first 3 or 4 pages (at the very least) in the Assembly Programming thread here on this subforum. It was started to help people get into writing games in assembly on the 99/4A. 3D does not mean your program organization needs to be complex. Always try to keep your programs as simple and well organized as possible, regardless of what the code is doing. Also, 3D on any retro computer is not for the faint of heart. The 9918A VDP (used in a lot of systems of the era, i.e. 99/4A, ColecoVision, MSX1, ADAM, Tomy, NES, etc.) does not have a true bit-addressable display, so plotting pixels is *slow*. There are plenty of tricks for this kind of thing, and you are going to have to use all of them. Of course you can use pseudo 3D, i.e. using the tile map to draw scenes that look 3D, i.e. like Tunnels of Doom, etc.. You could also consider using the F18A to draw, but that depends on what criteria you are setting for your project.
  2. We are all in the past here. It is hard to convey intent with text, so my comment probably came across a little wrong. I agree with you, and I like to imagine using just enough technology to solve the problem. However, that is getting harder and harder to do, and we find ourselves in a situation where the smallest, cheapest, and best supported (software and vendors) solutions are battle axes with 1000x more capability than needed. Want to blink and LED? Get a 133MHz MCP with USB interface, download a 500MB IDE onto your 8GB operating system, fight with USB drivers, download an I/O library and a framework, find the right location in a 100 line "simple example" template, write 3 lines of code, compile, upload to your MCU, reset and bask in the glory of a blinking LED. That crap drives me nuts! But that is how things are now. Going against the grain is painful (I do it all the time, and I'm probably very miserable for it).
  3. When I finished the base 9918A functionality in the F18A, I had a lot of room left in the FPGA. I always intended to have a DMA for exactly this, but defining all the DMA features eventually morphed into "Why not just have a CPU?" And if you have a CPU in the VDP, why not have it be a 9900? Thus the GPU. There is, actually, also a DMA in the F18A. I should probably put the GPU's memory map in the register use spreadsheet. It is documented in this thread somewhere too. -- DMA -- 8xx0 - MSB src -- 8xx1 - LSB src -- 8xx2 - MSB dst -- 8xx3 - LSB dst -- 8xx4 - width -- 8xx5 - height -- 8xx6 - stride -- 8xx7 - 0..5 | !INC/DEC | !COPY/FILL -- 8xx8 - trigger -- -- src, dst, width, height, stride are copied to dedicated counters when -- the DMA is triggered, thus the original values remain unchanged. This will access VRAM at 10ns per byte, per read and write (but this will probably change a little in the future firmware, and will be slightly variable between 10ns to 30ns per byte). So copying a byte will be 10ns read and 10ns write. So, in 1us 50 bytes can be copied. Clearing the screen can be done in about 16us. Moving a 2K table takes about 40us. There is also the PIX instruction (replaces the XOP instruction) that is designed to read/write/update BML pixels, and can also calculate the GM2 byte to update from a pixel X,Y location. This instruction should be documented in this thread, I hope, and I think there are some examples (I can dig some up as well). -- PIX XY,CMD -- Can only operate on 16K VRAM addresses. -- Can be written like this: XOP src,dst -- Uses XOP addressing modes for src (XY) and dst (CMD) -- SRC: XY is the pixel x,y location in 8:8 format. Uses all source -- addressing modes. -- DST: MAxxRWCE xxOOxxPP -- M - 1 = calculate the effective address for GM2 instead of the new bitmap layer, -- placing the VRAM address in the dst. -- 0 = use the remainder of the bits for the new bitmap layer pixels -- A - 1 = retrieve the pixel's BML effective address instead of setting a pixel, -- placing the VRAM address in the dst. -- 0 = read or set a pixel according to the other bits -- R - 1 = read current pixel into PP, only after possibly writing PP -- 0 = do not read current pixel into PP -- W - 1 = do not write PP -- 0 = write PP to current pixel -- C - 1 = compare OO with PP according to E, and write PP only if true -- 0 = always write -- E - 1 = only write PP if current pixel is equal to OO -- 0 = only write PP if current pixel is not equal to OO -- OO pixel to compare to existing pixel -- PP new pixel to write, and previous pixel when reading
  4. That is the native functionality if you just start changing the scroll registers without any masking (from TL2, etc.) or setting up the additional name tables. Here is a screenshot of some early testing of the BML and scrolling. This is the stock Mater Title Screen, with a GPU program running that is updating registers (like the BML control), and the horizontal scroll register being updated at certain locations. The "READY-PRESS ANY KEY TO BEGIN" text looks blurry because it is horizontal scrolling, along with the color bars in places, etc.. The console has no idea this stuff is happening, and there is no 9900 code involved. I can probably make that happen.
  5. If you were doing parallax scrolling, that would be left-to-right, so I'm confused what masking you needed at the bottom? Just put empty tiles on TL1 and TL2 for those rows. If you are using the expanded name-tables, then you do not need any left or right edge masking (which TL2 was designed to do in cases where is was needed). Not currently. The BML and TL1 can swap which is on top, but TL2 will always be in front of both. You could use TL1 and the BML for the scrolling, and leave TL2 for things like the score and fixed-place text, etc.. The BML is like a big sprite and can be moved around. However, I would also think that some of the techniques Rasmus has come up with (like the driving games) could be used to crate a parallax effect much easier and more efficiently. Alternatively you can use a single tile layer and use the GPU to set the horizontal scroll for every pixel row. Every 8 rows it could also shift the horizontal tiles in a row so you do not need the expanded name tables. I'm working on a firmware update, and a possible new feature is support for horizontal and vertical scrolling without needing the extra name tables. I think I'm calling it "border scroll mode", but I might change it to "window scroll mode" (although I don't like "modes" so I should probably pick a new word for that too). Basically the name tables becomes 34x26 (or 34x32 if ROW30 is enabled), but the displayed tiles will be the center 32x24. This leaves a border of tiles all the way around the tile layer that is used to provide the edge data when scrolling takes place. This does mean after scrolling 8 pixels in any direction you will have to reset the scroll and tile-shift the whole name table, but you eventually have to do this anyway. With this technique the name table only needs to grow by 116 bytes, i.e. 768 to 884 (1088 in ROW30), which is way less memory than doubling or quadrupling the name table space for each layer.
  6. Checkout Arcade Shopper: https://www.arcadeshopper.com/wp/store/#!/Keyboards-and-adapters/c/23836460 Most people here probably have more than one 99/4A, and lots of random spare parts. I'm a little surprised no one replied before me. Of course rip-off ebay has them (~$38), but people seem to think "retro" and "vintage" means $$$, and that everyone is a museum or flush with cash and stupid. The 99/4A keyboards were also available as generic keyboards from Radio Shack, and I see some new-old-stock of those on ebay for those (but $65 is insane!) TI made a lot of 99/4A consoles, so look around a little bit. It is usually pretty easy to get one cheap for parts.
  7. Arguably a better use of time... 😄 That's crap. The tools should be free! If Microchip is charging for software to program their chips, then its time to find something else. As others have mentioned, the older PALs and GALs were programmable via standard IC programmers. Parts like the Atmel ATF16V8 series are $1.25 on Mouser and are programmed with a standard chip programmer. The 16V8 has been around a long time and has enough logic to do a bit of address decoding and such. The world moves on, and the older tech is going to get a lot harder to find. What people are using these days depends, but mostly they are hacking ready-made microcontroller boards like the Arduino, Rpi, AdaFruit tinsy, etc.. These days, for less than the price of a GAL ($1.25) you can get a dual-core 133MHz ARM core RP2040 ($0.80). And it sure as hell is hard to beat the $5 cost of the RP2040 already on a circuit board with support regulators and such. You might as well just buy the RP2040 board, stick it on your cartridge board, and call it a day. However, for retro-computing the biggest problem with the new stuff is interfacing with the 5V systems. Almost everywhere you will need protection for the inputs to the 3.3V (or lower) chips, which is where the older GALs still have an advantage. Newer chips are also usually designed to be in-circuit programmed, since flash memory is cheap and added to almost every embedded chip these days. The MCUs also come with hardware support for storage (like SD cards) and USB, which makes in-system programming even easier. You might as well not even consider price for a project like this, since the most logical or oldest-tech solution will not be the cheapest solution. Also, for $6 to $8 you can get an FPGA with enough capability to implement the entire computer, let alone something as simple as cartridge address decoding and logic, flash memory (or any other kind of memory) interfacing, etc.. And the $8 chip has built-in flash, so you don't need an external flash IC to load your bitstream. These would be the Trion T8 and T13. I'm having to change the F18A over to this line of FPGA because they are cheaper and more available than the Xilinx FPGAs these days (seems Xilinx does not want to be in the low-end FPGA business any more... which is a bummer). It is still hard to justify over the RP2040 though, that chip is just so cheap, and probably fast enough (maybe). Anyway, I digress. If you stick with surface mount components, then you can also have places like JLCPCB manufacture and assemble the boards, generally for less than you can buy the parts and assemble them yourself. This is the way.
  8. The register use spreadsheet for V1.9 says "Priority over tiles". Could be you have an older version for the spreadsheet? I have attached the latest spreadsheet just in case. Not currently, no. Sorry. Although I think this could be added in the next firmware update. The BML is only addressed during TL1 processing, so the BML's priority bit is between TL1 and the BML. Transparent / 0-bit pixels will allow the layer below to show through. Also, the BML does not currently have a priority bit to interact with sprites, but this could be added (or it could inherit TL1's sprite priority bit). But the layering is pretty complicated already, I'm not sure how much more should be added since it will start to be really hard to reason about what covers what. If the GPU is not IDLE when the HSYNC trigger occurs, nothing happens and the HSYNC will be missed. The GPU does not have real interrupts. If you need exact timing to the HSYNC, then your GPU program needs to run up to the point where it will wait of HSYNC, then execute the IDLE instruction. This will halt the GPU until the HSYNC is received, or the host CPU issues a GPU trigger, or GPU load and trigger. If you don't want to run IDLE waiting for the HSYNC, you can always poll the horizontal scanline counter in a loop and compare to the previous value (which you would need to track). You could combine the two methods to wait on HSYNC if it has not occurred since the last loop iteration, or immediately start the next loop if the HSYNC was missed. Something like this (in a really weird pseudo code): main_loop: if last_scanline != scanline then B start_loop IDLE start_loop: last_scanline = scanline ... f18a_register_use.ods
  9. The one on the left is the original F18A and I have not made those since about 2017, so unless you bought a used one from someone, then you do not have an original F18A. The one on the right is the F18A-MK1, and I started making those in 2022, IIRC. What you see above are the ONLY variations of the F18A, and there are no other boards or sockets involved. Please recheck what you have. As RickyDean mentioned, you probably have the Tang 9K, which is not the F18A (and why I would appreciate it if you could edit the thread topic to say Tang 9K, rather than F18A). Ah, right, I forgot, Hans23 did make a few batches of the original F18A. I do not know what PCB pins were used for those units, so if putting a machine socket into the 99/4A VDP socket is working for you, then great.
  10. If you do not have one of these two boards, then you do not have an F18A. If that is the case, I would really appreciate it if you could change the thread title to not say "F18A".
  11. @nuxi Do you have an original F18A (wider than the socket), or are you taking about the newer MK1? I do not have a lot of reports of the F18A coming out of the socket that easily, but it seems you found a solution that is working. Any chance you took a photo or two?
  12. You might want to re-watch the video then. The whole "idea" being proposed is that you can use faster DRAMs with your 9958, set bit >02 of R8, and boom, "turbo mode". And what kind of speed-up he expects, or what "turbo mode" might mean, is not discussed. There are really no "checks" or "CRC" routines to be done inside a VDP. I don't know where you got your view, but the video is not talking about anything like that. Nope, nothing like that going on inside the VDP, and not what was proposed in the video either. Until a valid hypothesis is presented, there is nothing to test or confirm. And since we are talking about undocumented functionality anyway, the only real way to know is to decap a 9958 and reverse engineer it enough to see what bit >02 of R8 actually does.
  13. This is a hobby, you don't need to make excuse (and certainly not to anyone here). Any reason you don't have all the chips in a horizontal orientation? The buffers at the bottom are functional in that orientation, since I believe the input and output pins are on opposite sides of the chips. Since the internal signals interface with the other chips on the board, keeping the same orientation might make routing easier. I am also a fan of functionality first, so chip orientation is secondary to cleaner routing. But, if cases like this where it is all mostly lower speed, and you have very large boards, you could probably do any orientation you want without much trouble. Also, be careful of chips too close to the edge, there may be mechanical keep-out areas. Always think about the assembly process, and give yourself room to work. Things always look further apart in the software layout then they are in reality. Use the 3D viewer to check clearances. Update to KiCAD 8 (if you have not already).
  14. No, I have not tested on a 9958. However, I would say we are making educated guesses based on knowing something about this line of VDP chips at a very low level, and knowing something about how DRAM works. Yeah, I sat through the video, designed to stretch a one sentence idea out to 10 minutes, based on nothing. Here is the summary: "Maybe bit >02 in R8 on the 9958 messes with the DRAM timing a little. Maybe that means the VDP has a TURBO MODE!!!" Sometimes, sure. But more times than not those things are side effects of using the hardware in ways not intended by the designers. If those cases turn out to useful, then great. However, in this specific case, the idea being proposed is that setting bit >02 in R8 will so drastically modify the DRAM timing on the 9958 that is acts like a "turbo mode" for VRAM access. If you know anything about DRAM access or how digital circuits work at all, you can very quickly make a very safe educated guess that this idea is completely wrong. Changing DRAM access timing will either, 1. break the VDP timing and probably cause all kinds of random behavior, or 2. do nothing at all because the internal state machine running the chip does not modify its timing based on a little faster DRAM access. All the reasons have already been posted, so please read them again slowly and carefully. I am ready to be proven wrong. I would love to see a trick like this work, because if it does, I would learn some pretty awesome new digital design concepts. It is common to have a clock synthesizer that will make a multi-phase clock. Not unlike what the 9904 does for the 9900, but internal to the VDP. Any one phase is still 21.477MHz, but the time between phases would be 1/4, or about 12ns.
  15. @FarmerPotato pretty much summed it up. The VDP is a carefully timed state machine that cannot tolerate variation in timing, otherwise the display would be messed up. If access to the DRAM chips was just a little faster, that does not magically translate into another CPU window after some amount of time. They are pretty close, and the 9958 was only release 3 years later and has very few additions (the biggest being horizontal scrolling). The 9958 datasheet has zero information on DRAM timing, i.e. RAS, CAS, precharge, etc., which means it has to be the same as the 9938, which means its memory access to VRAM is the same. If the DRAM timing is different between the 9938 and 9958, then, well, I have nothing good to say about Yamaha or these chips, other than that would be pretty f-ed up to not include that in the datasheet.
×
×
  • Create New...