Jump to content

pnr

Members
  • Content Count

    157
  • Joined

  • Last visited

Posts posted by pnr


  1. On 3/19/2021 at 9:51 PM, Ksarul said:

    There are exactly three options, as we have already indicated:

     

    1. Clint Pulley's Small c Compiler--runs on a TI-99/4A, a Geneve 9640, and emulators.

    2. Al Beard's C Compiler--runs on a Geneve 9640, an Amiga, and emulators. It has more than Clint Pulley's compiler, but it is not a full C compiler.

    3. Dave Pitts' GCC Compiler--runs on Windows or Linux.

     

    I seem to recall that Al Beard's C compiler also supports overlays (on the Geneve). Al is still around, maybe he can clarify.

     

    EDIT:

     

    I looked up the source. The manual says this about TIC:

    "The GENEVE MDOS version of the compiler utilizes sophisticated
      memory swapping to gain an 85k workspace for the compiler, even
      though the compiler is over 50k in length.  This allows compilation
      of fairly sophisticated C programs.  The total memory required to

      run the TIC compiler is 144k."

    However, this appears to be achieved via custom memory management in the compiler source code (grep for "swap" in the source code), where specific compiler tables are swapped in and out of the actual memory space. So, in the general sense the TIC compiler does not support overlays.

    Sorry for the red herring.


  2. On 3/20/2021 at 12:03 AM, Tursi said:

    No C compiler targeting the TMS9900 supports overlays today. Nobody is really writing any software that requires it, so nobody has been motivated to develop the support.

     

    True, but there is something close. The C compiler used for Mini-Cortex Unix (which is both a cross and native compiler) almost has support for that. The source is here:
    https://www.jslite.net/cgi-bin/9995/dir?ci=tip
    (ccom, cc and c2 directories). It is K&R C though, so if you want to compile recent C code you need to do some tweaking.

    This C compiler is derived from the original C compiler as developed by Dennis Ritchie for the PDP-11 mini. As many people on this list will be aware, the instruction set of the PDP-11 and the 9900 are quite close. In the late 70's, this compiler was modified to generate overlays for programs that could not fit in 16 bits (this work was originally done at Berkeley for 2.10BSD).

     

    The overlay system is quite clever and does not require modification to the C source: all the work is done by the linker. In essence, functions that call across overlays do so via a small (automatically generated) thunk that adjusts the memory mapping as needed. The process is described here (nroff document):
    https://www.tuhs.org/cgi-bin/utree.pl?file=2.11BSD/doc/2.10/ovpap

     

    The Mini-Cortex compiler has the code in it to support this feature; however I've never written the bits of support code that it needs. Hence, it *almost* supports this.

     

    • Like 3

  3. On 2/4/2021 at 6:21 PM, FarmerPotato said:

    Unrolling to four bytes/loop would double the number of instructions--nothing to gain there.

    The unroll also reduces the cost of the loop counter and loop jump. In the my Unix code (for an 8 bit CF card) I used first this:
    https://1587660.websites.xs4all.nl/cgi-bin/9995/artifact/c22c09b80a674a44?ln=75,78

    but soon switched to:

    https://1587660.websites.xs4all.nl/cgi-bin/9995/artifact/ad1f9336c3316aa1?ln=86,92

    This made it much faster. In your case the loop overhead is less in relative terms, so it isn't as critical.

    Another learning was dealing with interrupts. Your disk access code may be interrupted, and the interrupt may cause another disk access to happen before returning. Leaving interrupts off for long periods is not a good idea (e.g. your 9902's need servicing), so you have think about the time it takes to read a sector and if you can afford to leave interrupts off for that long, or you have to make sure the disk code is not re-entered twice in parallel.
     

    The third learning was that using the CPU to read the disk is a hog. Once you start running parallel jobs (not sure MDOS supports this) you really start to notice this, although the short seek times on flash disks offset some of this. I am planning to add DMA capability to my next board.

    • Like 2

  4. I think you have found the fastest form, maybe unrolling the loop 4 times would gain a few percent but that is it.

     

    Slower but more compact options could be:

    - If a 256 byte table is too much space, you could consider using a nibble table with 16 entries and do it in two steps.

    - There is this hack for bit reversing a byte using MPY: https://graphics.stanford.edu/~seander/bithacks.html#ReverseByteWith32Bits

     

    If you place the hypothetical shift register in memory instead of parallel CRU space, your last example would not need the R12 adjustments and could be a bit faster still. For Unix on the Cortex I've found that disk access speed does matter a lot (but early Unix was quite disk intensive, maybe more so than MDOS).


  5. On 1/29/2021 at 2:21 AM, FarmerPotato said:

    @pnr   estimates the 99110 6MHz at 15 kFlops/s and concludes:

     

    Quote

    This compares well with the FPU chips of the late seventies and early eighties. The three main choices in 1981 were the AMD9511/i8231 from 1978, the AMD9512/i8232 from 1979 and the i8087 from 1980/81. The 99110 is from 1981 as well.

    http://www.cpushack.com/2010/09/23/arithmetic-processors-then-and-now/

    Please read the full post for some context. The 8087 is actually much faster than the others on plain arithmetic (add/sub, mul, div). I did not make a comparison for the transcendental functions at all.

     

    If one is looking for simple & fast mathematical functions, consider using lookup tables. The original Forth needed that (it was created for software controlling telescopes); if I remember well the arithmetic was done using scaled 32 bit integers, with the each function looked up in a 64 entry table and using interpolation.

     


  6. On 1/29/2021 at 2:21 AM, FarmerPotato said:

    This code is 9900 and fits into 2K. It would run at half the speed inside a 4A ROM, worse in slow RAM. But in a F18A it could go much faster!

    It does use IBM single precision, not RADIX 100. Suppose all the 4A routines in ROM were replaced by this? Would everything work after being on top of that layer?

     

    Actually, it is 99000 code. The difference is not big, but the code uses such things as 32 bit addition and shifts. So it is not cut and paste, but the amount of effort needed to make it true 9900 code would not be big.

    Also, John Walker (of AutoCAD fame) wrote some single & double precision routines for the 9900 that were fast for their time:
    https://www.fourmilab.ch/fbench/fbench.html
    I don't have source code for this, but the object code libraries can be reverse engineered of course. Happy to post the object code if anybody is interested.

    Note that the RADIX 100 code has the benefit of being exact for for decimal fractions, i.e. it is much better suited to writing financial program code than IBM370 or IEEE (adding 0.01 to an amount one hundred times will not end up being 0.99 due to rounding issues).

    • Like 2

  7. Well, the analog shows that the signals are as I would expect them to be. Had a quick look at the spec sheet for the VB-8012:
    https://www.ni.com/pdf/manuals/371527d.pdf

    It says that the input threshold on the digital inputs can be adjusted between 0V and 2V. For TTL signals I think low is 0-0.7V and high is 2.4-5V. Maybe the threshold is currently set to a quite low or high value), where overshoots are detected as a reverse signal for one sample. Mathematically, 1.5V should be ideal, but in a circuit that mixes (LS-)TTL, HCT, NMOS, etc. some experimentation may be in order.


  8. Again, congrats on getting it to work!

    1 hour ago, FarmerPotato said:

    With glitches appearing on RESET*, the cpu keeps on ticking, so they can't be real. (But then again that is too fast)

    This by itself is not certain. My understanding (after experimentation) is that reset is only sampled by the CPU on the rising edge of CLKOUT. Although the datasheet says that it must be asserted for at least 3 machine cycles (clocks), actually one is enough for the processor to reset. If the glitches occur outside the setup/hold time around the clock edge, the CPU would not notice.

    It may be interesting to do an analogue measurement for CLKOUT an RESET and see how that relates to the digital measurements. 

    • Like 1

  9. 20 hours ago, FarmerPotato said:

    Well, it is not solved.  I can sample SERENA* with the hi-Z analog probe or digital probe. As soon as I connect a gate input, it fails. Both LS and HCT. The driver is an ALS138A output (decoding the IO bus state and A14:13). As far as  I can tell, it should be compatible.

    Maybe it is not a DC but an AC issue: maybe the bus line is ringing? Have you tried a 100R series resistor to dampen reflections (as in the firehose interface)?

    • Like 1

  10. On 1/2/2021 at 9:19 PM, speccery said:

    I did take a look at your design - very interesting and good reference material! I am in the process of adopting a partially microcoded approach, so that I can hopefully do the microcode implementation incrementally, without necessitating the creation of an entirely microcoded processor from the start. As I was looking at your design, I noticed that your microcode (the array "pla") is not clocked. I wonder if it maps to block RAM during synthesis? I am going to attempt to make a clocked design for the microcode ROM, to ensure that the microcode will indeed be in ROM (i.e. initialised block RAM).

    I am not sure that is a good idea.

     

    Initially my thoughts were like yours, and I was aiming for the PLA to be in block RAM. Two things changed my mind:
    (i) The "ROM" has lots of duplication in it and it turns out that generating signals from the state vector does not take all that many LUTs. Probably this is the reason that CPU's from that era often used PLA's for microcode in the real silicon.

    (ii) The LUT version is faster than the "ROM" version. This was the case on the ICE40 chips and perhaps even more so on the ECP5 chips.
    Maybe the second reason drops away when the microcode lookup is more pipelined than in my design.

     

    A now obsolete reason was that I wanted the conserve block RAM on the limited ICE40 chip.

     

    On 1/2/2021 at 9:19 PM, speccery said:

    I am planning to make the microcode quite wide. I noticed that in your design you have a separate constant array, and the microcode only contains an index to the constants. That makes sense in keeping the microcode width smaller, but in order to accelerate GPL the microcode needs to generate a lot of constants, so I am going to just have a 16-bit field in every microcode step just for constants. That saves one level of indirection (a multiplexer) during the decode phase.


    Yes. This design choice was driven by a wish to stay close to the original silicon (see here and figure 3 in the 99105/99110 data manual). This too uses a constant table. Trying to eliminate multiplexers is a good idea, I think. In the NMOS silicon of the era, it was almost free to have a tri-state bus on the chip. On an FPGA this translates to multiplexers. The natural multiplexer seems to be a 4 bit 2:1 multiplexer in a single logic block and an 8-way multiplexer takes 3 layers of LUT. Including all the wire routing, the actual layout quickly becomes hard to predict/understand. Selecting ALU inputs and ALU function, and generating flag bits, is a critical timing path for me.

     

    The 99000 microcode is 152 bits wide. Mine is much more narrow, but in part that is optical. Fields have often been constrained to 4 bits, so that 1 LUT can derive single signals. I've never counted how many bits I have after such expansion.

    For another take on microcode organisation, take a look at the microcode word of the 990/12. It is described briefly in one of the assembler manuals, but I cannot find the right link at the moment. It is 64 bits wide.

     

     

    • Like 1
    • Thanks 2

  11. Happy to hear that you found the problem.
     

    Yes, with AS I meant ALATCH; I was working with a M68K recently and got the signal names confused.

     

    Wow, that VB-8012 is a serious bit of kit. Does it have an input mode that adds some hysteresis to the 32 inputs? If so, it could maybe help with the cross-talk. Maybe @Jimhearne and @Stuart have suggestions -- they are much better with hardware issues than I am.


  12. This is very interesting avenue of development!


    Just throwing out some thoughts:

     

    1. I heard (read) the GPL processor thing as well, but I am not sure it is correct. As I understood, the original plan was for a 99xx CPU with an 8 bit data path but this project did not (timely) materialise and the 16-bit 9900 was shoehorned in at a late stage. I also think I remember reading that the designers did not mind the "double interpreter" because they expected that a dedicated CPU would be used for a next gen system. I am not sure how the two things relate, if at all.

     

    2. For a microcoded design, have a look at my 99000 version. It has ~200 states for the 9995 instruction set.

     

    3. Another route could be to use the co-processor design of the 99xxx series. I am not implementing that, but it could help to keep complexity down, by separating the GPL part in a co-processor.  That co-processor could have a data path optimised for GPL,with maybe a separate address ALU etc. The co-processor interface has facilities to transfer the WP, PC and ST registers between the CPU and the co-processor, so integration could be quite seamless.
     


  13. When I look at the scope output picture(s) I am surprised by some of the signals. It is not clear why CLKOUT should not show a nice regular square wave, and I don't think that the BST lines should change state when the AS signal is low. Is it possible that the scope / analyser is not grounded and hence mis-measuring the signals? If your system is multi-board, is it possible that ground does not feed through? Or a ground loop perhaps?


  14. 1 hour ago, speccery said:

    but probably more than what your target line count would be

    It is not about the line count so much, it is about maximum simplicity. When using internal ram (the smallest version of the ULX3S has 112KB internal ram/rom capacity), doing a 9918 that just supports basic 256x192 VGA DVI output is very simple indeed, hardly more complex than the video circuit in the 99/2. The complexity is in the sprites, which are done with comparators/counters in 9918 silicon, 4 blocks of that. I'm thinking of duplicating that design in the FPGA, hopefully it leads to very simple & readable code. However, writing that takes time, which I currently don't have.
     

    1 hour ago, speccery said:

    They don't have true dual port block memories

    Just the other day I learned that Yosys currently cannot infer true two-port anyway. It is limited to one R/W port and one R port -- this bit of the Yosys code is currently being rewritten, so hopefully this limitation will be gone soon. For true 2-port one currently has to use a library block (Emard has that in his repo).
     

    1 hour ago, speccery said:

    With regards to the TI-99/2, it would be interesting to port the Basic from it to the 99/4A. That would be quite a bit faster, as it is not using GPL to my understanding.

    Yes, it does not use GPL, and it does not need to as the RAM is connected to the CPU. When debugging the TI99/2 I disassembled some parts of the 32KB ROM and it has a table driven parser that compiles into a token byte code ("IF", "NEXT", etc.). This token byte code is then interpreted by calling a subroutine for each token. I did not manage to fully understand the parser, but I think it is a bottom-up parser with separate left and right priorities for each token - I did not get to the bottom of it.
     

    1 hour ago, speccery said:

    Anyway, this is jus a long of saying that if you want to use my TMS9918, let me know and I will tidy up the code to make it easier to read. I am planning to clean up the code when I have a moment, also to make a version which does not have the external memory support to simplify the core.

    At another time, yes please. At the moment work projects are keeping me away from hobby stuff and I'd like to complete three other hobby projects first:
    - A 4-way write-back cache, to make sdram access fast. I have that working for Oberon, but I'm not happy with it yet.
    - True HDMI video (as opposed to DVI). This means implementing data islands and sound encoding.
    - Clean up TCP/IP for the Cortex
    So, we're talking mid-2021 at the earliest, maybe 2022.

    Maybe it is a cool project for a Tomy Tutor enthusiast...

     

    • Like 3

  15. On 11/18/2020 at 12:16 AM, FarmerPotato said:

    It's kind of hard for the 99105 to copy its own BIOS ROM into a RAM at the same address space.

    Actually, the original Cortex did that, using a technique called "write under". Initially, reads are from ROM, but writes to the same addresses are sent to RAM. Once the copy is complete, the ROM is switched off (using a CRU bit) and both reads and writes go to RAM. There are wait states for slow ROM access.

     

    On 11/18/2020 at 12:16 AM, FarmerPotato said:

    The DMA feature will work just as TI intended. The essential is that a peripheral takes over the bus for a period, and has full access to read/write the other peripherals just like the CPU would.

    Have you considered the TMS9911 DMA controller?


  16. 2 hours ago, speccery said:

    Could you provide links to the TI-99/2 and Mini-Cortext?

    The TI99/2 is here:

    https://gitlab.com/pnru/ti99/-/tree/master/ti99_2

     

    The Mini-Cortex is here:

    https://gitlab.com/pnru/cortex

    I've focused on the Unix side of it. Its main claim to fame is that it hosts a 99-native C compiler and tool chain, and hence it can re-compile itself. I have a native TCP/IP stack working, but the experience is not smooth yet. It uses the ESP32 as an ISP, and connects to it using a PPP serial line.
     

    2 hours ago, speccery said:

    In addition to the computers you listed, there is also the Tomy Tutor which I've been thinking to support. It's a simple computer, and with all the parts of TI-99/4A available should be a simple thing to do. Probably TMS9995 instructions would be needed.

     

    Yes, I've been thinking about that as well. The CPU in the Mini-Cortex is my best approximation of the 9995 yet, implementing the extra 4 instructions. It is almost cycle accurate and the bus interface is that of the 99105. It also has code to emulate the 9995's interrupt lines, for the internal timer and CRU bits, etc. Tongue-in-cheek, I'm calling it the 99095.

     

    What I had in the back of my mind was to do a version of the 9918 that mimicked the data paths of the real vintage silicon. I think it should fit in some 500-700 lines of Verilog and would of course have the same limits (4 sprites on a line, no 80 column text mode, etc.). Never got around to doing that code. Together with the 99095 it would allow for a very compact implementation of the Tomy Tutor. Probably using your 9918 is a quicker route to success.

     

    • Like 4

  17. Just for info to the interested: after the first production run of the ULX3S board of almost a 1000 pieces sold out in days, there is now a second production batch available on Mouser:
    https://eu.mouser.com/Search/Refine?Keyword=ulx3s
    If interested in this stuff, get one whilst supplies last. There is now the Icy99 implementation of the 99/4A + extensions, and there is also a TI99/2 implementation, and an implementation of the Mini-Cortex. Maybe an implementation of the TI99/8 will emerge over time.

    ULX3S development is discussed at:

    https://gitter.im/ulx3s/Lobby

    The complete open source Verilog tool chain can be downloaded from
    https://github.com/open-tool-forge/fpga-toolchain

    it is a big install (~700MB installed), but still much, much better than the multi-gigabyte installs that the vendor tool chains require.

    • Like 5

  18. 19 hours ago, FarmerPotato said:

    Oh great gabbleblotchits. There is indeed a note on p.90 that says MOV skips the DOP fetch.

    However, I still have this problem for MOVB.

    For future readers: I think you mean page 80.

    When developing the Verilog model for the 99000 I came across this optimisation of MOV. It is not only the DOP fetch, but also the destination WS fetch that is skipped. When debugging the FPGA version of the TI99/2 I disabled this optimisation for a while. Much to my surprise the TI99/2 ran some 10% slower without the optimisation.

     

    There is some interesting analysis by Karl Guttag (the 9995 and 99000 chip designer) here (note that the code name for the 99000 was "Alpha"):
    https://hansotten.file-hunter.com/uploads/files/99000 (Alpha) Misc Documents.pdf
    It says that MOV is 25% of instructions, together with MOVB even 30%. Throw JMP in the mix and it is 50%.

    Do you still have the problem for MOVB? On the 9995 interim solution it is not an issue and on the 99105 you can ignore the read-before-write using the BST outputs, as you already described above. What other current scenario is still problematic?


  19. By coincidence I was looking a RTC chips a few weeks ago. Looking at some old boards, I arrived at the MM58174 as a chip that was period correct and has direct links to very first RTC chips to appear on the market in the 1970's (the OKI M5832).

    For an 8 bit variant have a look at the MM58167.


    A third choice could be an early serial chip, the NEC uPD4990. This can maybe be coerced to act like a CRU interface chip.

     

    I think all are still available on eBay, but I have no direct buying experiences for any of the above.

     

    • Like 1

  20. As I understand it, Kienzle got started as a manufacturer of mechanical taximeters in the 1920's and from there expanded into electro-mechanical tabulating/accounting machines in the 1950/60's. From there they moved to electronic accounting machines (what in Germany they used to call "Mittlere Datentechnik", mid-range information technology). This was the 6000 series. Some sources say it used the 9900, but considering is was launched in 1968 that is probably wrong. Perhaps the later models did.

     

    Kienzle got stuck in that technology and was late converting to true mini-computers. They launched the 9000 series in 1979 and this used the tms9900 for sure. As they were entering the market late without clear differentiation, it was not a commercial success. I visited the Kienzle factory in 1984 (I think) as a student on a group field trip and came across a 9000 series machine in passing. They mainly wanted to showcase their automobile technologies but I got a few questions in. What I remember of that is that it was based on the tms9900 and that it ran MTOS. They also claimed that it could "run Unix as a sub-system under MTOS". I think what that meant was that it had a C compiler and a C library that worked under MTOS. I've never found any other reference to that, maybe it was a research skunkworks project. I think the main workhorse in the 9000 series was the Kienzle 9066. I am not sure what MTOS was. It could have been an in-house development, it is also possible that it was a translated version of DX10, or something like that.

     

    Later on they had the the 9100, 9200 etc. series, which I believe to have been tms99000 based (99105 most likely). In view of the timeline it is possible that the later 9x00 series used Ten-X technology, but I am speculating here.

     

    It is quite likely that (Mannesmann-) Kienzle made similar steps as TI in the late 80's, switching to x86 and 68K based unix systems, with software support to run the 990 base of Cobol programs, before giving up altogether.
     

    There is a list of Kienzle models here:
    http://www.computer-archiv.de (go to section K, select Kienzle, for the specific page).
    As Ksarul already observed, there are also MIPS machines in the list, the 2800 series.

     


  21. I was querying a correspondent about the MMU on the Ten-X 7-XP board and the following is what he reported wrt the 99105

    Would that correspondent be Daren Appelt by any chance?

     

    The TMS99105 had some undocumented macro space (32 bytes) inside the processor. This was 0 wait state memory and I utilized this to implement a variant of the LMF opcode to load the MMU.

     

    That might refer to the macro space workspace, 16 registers = 32 bytes. This was located at macro memory address 0. It is the bottom-right square on the die shot here: https://en.wikipedia.org/wiki/Texas_Instruments_TMS9900#/media/File:TI_TMS99105A_die.JPG


    What Ten-X docs I have are here: ftp://www.dragonsweb.org/pub/ti/docs/Ten-X/

     

    That is interesting; looking at it now.

     

    He also tells me the 7-XP stuff was later sold to a German company, but doesn't recall which one. Anyone has any idea I sure would like to know.

     

    It could also refer to Kienzle. Kienzle was one of the many German mini-computer companies focussing on business computers for mid-sized businesses. They used the 9900 CPU in the late 70's and the 99000 family in their later models. Like much of the New England computer scene they did not make it through the 80's and ended up being acquired by industrial conglomerate Mannesman. In 1991 it was sold on to Digital Equipment Corp. For German speakers (readers) the full story is here.

     

    990 microcode is of course a different animal than 99K macrocode assembly, but if your aim is to implement the rest of the /12 instruction set, it would be useful to know. I'll post anything I find here.

     

    Thanks. It would be interesting. I did a full analysis & commentary on the 99110 ROM a while back.

    • Like 2

  22. The source code for TIC appears to be partial - it looks like a few files did not get uploaded.

     

    Alan was kind enough to share the sources to TIC with me some 6 years ago, when I first got started on the Unix on 9900 port. These files have been on WHTech ever since:
    http://ftp.whtech.com/programming/TIC/

    That zip file contains a full set of sources and are verified to build. Dave Pitt's assembler can be used as an alternative to TASM (for TIC output) - see the README.txt for details.

     

    (The Unix port is long since completed, but I used a port of the PDP11 Unix compiler in the end).

    • Like 2
×
×
  • Create New...