Jump to content


  • Content Count

  • Joined

  • Last visited

Community Reputation

145 Excellent

About pnr

  • Rank
    Chopper Commander

Recent Profile Visitors

3,576 profile views
  1. The unroll also reduces the cost of the loop counter and loop jump. In the my Unix code (for an 8 bit CF card) I used first this: https://1587660.websites.xs4all.nl/cgi-bin/9995/artifact/c22c09b80a674a44?ln=75,78 but soon switched to: https://1587660.websites.xs4all.nl/cgi-bin/9995/artifact/ad1f9336c3316aa1?ln=86,92 This made it much faster. In your case the loop overhead is less in relative terms, so it isn't as critical. Another learning was dealing with interrupts. Your disk access code may be interrupted, and the interrupt may cause another disk access to happen before returning. Leaving interrupts off for long periods is not a good idea (e.g. your 9902's need servicing), so you have think about the time it takes to read a sector and if you can afford to leave interrupts off for that long, or you have to make sure the disk code is not re-entered twice in parallel. The third learning was that using the CPU to read the disk is a hog. Once you start running parallel jobs (not sure MDOS supports this) you really start to notice this, although the short seek times on flash disks offset some of this. I am planning to add DMA capability to my next board.
  2. I think you have found the fastest form, maybe unrolling the loop 4 times would gain a few percent but that is it. Slower but more compact options could be: - If a 256 byte table is too much space, you could consider using a nibble table with 16 entries and do it in two steps. - There is this hack for bit reversing a byte using MPY: https://graphics.stanford.edu/~seander/bithacks.html#ReverseByteWith32Bits If you place the hypothetical shift register in memory instead of parallel CRU space, your last example would not need the R12 adjustments and could be a bit faster still. For Unix on the Cortex I've found that disk access speed does matter a lot (but early Unix was quite disk intensive, maybe more so than MDOS).
  3. Please read the full post for some context. The 8087 is actually much faster than the others on plain arithmetic (add/sub, mul, div). I did not make a comparison for the transcendental functions at all. If one is looking for simple & fast mathematical functions, consider using lookup tables. The original Forth needed that (it was created for software controlling telescopes); if I remember well the arithmetic was done using scaled 32 bit integers, with the each function looked up in a 64 entry table and using interpolation.
  4. Actually, it is 99000 code. The difference is not big, but the code uses such things as 32 bit addition and shifts. So it is not cut and paste, but the amount of effort needed to make it true 9900 code would not be big. Also, John Walker (of AutoCAD fame) wrote some single & double precision routines for the 9900 that were fast for their time: https://www.fourmilab.ch/fbench/fbench.html I don't have source code for this, but the object code libraries can be reverse engineered of course. Happy to post the object code if anybody is interested. Note that the RADIX 100 code has the benefit of being exact for for decimal fractions, i.e. it is much better suited to writing financial program code than IBM370 or IEEE (adding 0.01 to an amount one hundred times will not end up being 0.99 due to rounding issues).
  5. Well, the analog shows that the signals are as I would expect them to be. Had a quick look at the spec sheet for the VB-8012: https://www.ni.com/pdf/manuals/371527d.pdf It says that the input threshold on the digital inputs can be adjusted between 0V and 2V. For TTL signals I think low is 0-0.7V and high is 2.4-5V. Maybe the threshold is currently set to a quite low or high value), where overshoots are detected as a reverse signal for one sample. Mathematically, 1.5V should be ideal, but in a circuit that mixes (LS-)TTL, HCT, NMOS, etc. some experimentation may be in order.
  6. Again, congrats on getting it to work! This by itself is not certain. My understanding (after experimentation) is that reset is only sampled by the CPU on the rising edge of CLKOUT. Although the datasheet says that it must be asserted for at least 3 machine cycles (clocks), actually one is enough for the processor to reset. If the glitches occur outside the setup/hold time around the clock edge, the CPU would not notice. It may be interesting to do an analogue measurement for CLKOUT an RESET and see how that relates to the digital measurements.
  7. Maybe it is not a DC but an AC issue: maybe the bus line is ringing? Have you tried a 100R series resistor to dampen reflections (as in the firehose interface)?
  8. I am not sure that is a good idea. Initially my thoughts were like yours, and I was aiming for the PLA to be in block RAM. Two things changed my mind: (i) The "ROM" has lots of duplication in it and it turns out that generating signals from the state vector does not take all that many LUTs. Probably this is the reason that CPU's from that era often used PLA's for microcode in the real silicon. (ii) The LUT version is faster than the "ROM" version. This was the case on the ICE40 chips and perhaps even more so on the ECP5 chips. Maybe the second reason drops away when the microcode lookup is more pipelined than in my design. A now obsolete reason was that I wanted the conserve block RAM on the limited ICE40 chip. Yes. This design choice was driven by a wish to stay close to the original silicon (see here and figure 3 in the 99105/99110 data manual). This too uses a constant table. Trying to eliminate multiplexers is a good idea, I think. In the NMOS silicon of the era, it was almost free to have a tri-state bus on the chip. On an FPGA this translates to multiplexers. The natural multiplexer seems to be a 4 bit 2:1 multiplexer in a single logic block and an 8-way multiplexer takes 3 layers of LUT. Including all the wire routing, the actual layout quickly becomes hard to predict/understand. Selecting ALU inputs and ALU function, and generating flag bits, is a critical timing path for me. The 99000 microcode is 152 bits wide. Mine is much more narrow, but in part that is optical. Fields have often been constrained to 4 bits, so that 1 LUT can derive single signals. I've never counted how many bits I have after such expansion. For another take on microcode organisation, take a look at the microcode word of the 990/12. It is described briefly in one of the assembler manuals, but I cannot find the right link at the moment. It is 64 bits wide.
  9. Happy to hear that you found the problem. Yes, with AS I meant ALATCH; I was working with a M68K recently and got the signal names confused. Wow, that VB-8012 is a serious bit of kit. Does it have an input mode that adds some hysteresis to the 32 inputs? If so, it could maybe help with the cross-talk. Maybe @Jimhearne and @Stuart have suggestions -- they are much better with hardware issues than I am.
  10. This is very interesting avenue of development! Just throwing out some thoughts: 1. I heard (read) the GPL processor thing as well, but I am not sure it is correct. As I understood, the original plan was for a 99xx CPU with an 8 bit data path but this project did not (timely) materialise and the 16-bit 9900 was shoehorned in at a late stage. I also think I remember reading that the designers did not mind the "double interpreter" because they expected that a dedicated CPU would be used for a next gen system. I am not sure how the two things relate, if at all. 2. For a microcoded design, have a look at my 99000 version. It has ~200 states for the 9995 instruction set. 3. Another route could be to use the co-processor design of the 99xxx series. I am not implementing that, but it could help to keep complexity down, by separating the GPL part in a co-processor. That co-processor could have a data path optimised for GPL,with maybe a separate address ALU etc. The co-processor interface has facilities to transfer the WP, PC and ST registers between the CPU and the co-processor, so integration could be quite seamless.
  11. When I look at the scope output picture(s) I am surprised by some of the signals. It is not clear why CLKOUT should not show a nice regular square wave, and I don't think that the BST lines should change state when the AS signal is low. Is it possible that the scope / analyser is not grounded and hence mis-measuring the signals? If your system is multi-board, is it possible that ground does not feed through? Or a ground loop perhaps?
  12. Maybe this works for ya: https://www.reichelt.nl/gb/en/sr-32kx8-28p-62256-80-p2673.html?r=1
  13. It is not about the line count so much, it is about maximum simplicity. When using internal ram (the smallest version of the ULX3S has 112KB internal ram/rom capacity), doing a 9918 that just supports basic 256x192 VGA DVI output is very simple indeed, hardly more complex than the video circuit in the 99/2. The complexity is in the sprites, which are done with comparators/counters in 9918 silicon, 4 blocks of that. I'm thinking of duplicating that design in the FPGA, hopefully it leads to very simple & readable code. However, writing that takes time, which I currently don't have. Just the other day I learned that Yosys currently cannot infer true two-port anyway. It is limited to one R/W port and one R port -- this bit of the Yosys code is currently being rewritten, so hopefully this limitation will be gone soon. For true 2-port one currently has to use a library block (Emard has that in his repo). Yes, it does not use GPL, and it does not need to as the RAM is connected to the CPU. When debugging the TI99/2 I disassembled some parts of the 32KB ROM and it has a table driven parser that compiles into a token byte code ("IF", "NEXT", etc.). This token byte code is then interpreted by calling a subroutine for each token. I did not manage to fully understand the parser, but I think it is a bottom-up parser with separate left and right priorities for each token - I did not get to the bottom of it. At another time, yes please. At the moment work projects are keeping me away from hobby stuff and I'd like to complete three other hobby projects first: - A 4-way write-back cache, to make sdram access fast. I have that working for Oberon, but I'm not happy with it yet. - True HDMI video (as opposed to DVI). This means implementing data islands and sound encoding. - Clean up TCP/IP for the Cortex So, we're talking mid-2021 at the earliest, maybe 2022. Maybe it is a cool project for a Tomy Tutor enthusiast...
  14. Actually, the original Cortex did that, using a technique called "write under". Initially, reads are from ROM, but writes to the same addresses are sent to RAM. Once the copy is complete, the ROM is switched off (using a CRU bit) and both reads and writes go to RAM. There are wait states for slow ROM access. Have you considered the TMS9911 DMA controller?
  15. The TI99/2 is here: https://gitlab.com/pnru/ti99/-/tree/master/ti99_2 The Mini-Cortex is here: https://gitlab.com/pnru/cortex I've focused on the Unix side of it. Its main claim to fame is that it hosts a 99-native C compiler and tool chain, and hence it can re-compile itself. I have a native TCP/IP stack working, but the experience is not smooth yet. It uses the ESP32 as an ISP, and connects to it using a PPP serial line. Yes, I've been thinking about that as well. The CPU in the Mini-Cortex is my best approximation of the 9995 yet, implementing the extra 4 instructions. It is almost cycle accurate and the bus interface is that of the 99105. It also has code to emulate the 9995's interrupt lines, for the internal timer and CRU bits, etc. Tongue-in-cheek, I'm calling it the 99095. What I had in the back of my mind was to do a version of the 9918 that mimicked the data paths of the real vintage silicon. I think it should fit in some 500-700 lines of Verilog and would of course have the same limits (4 sprites on a line, no 80 column text mode, etc.). Never got around to doing that code. Together with the 99095 it would allow for a very compact implementation of the Tomy Tutor. Probably using your 9918 is a quicker route to success.
  • Create New...