E474 Posted August 28, 2020 Share Posted August 28, 2020 Hi, I was just wondering if it would be possible to upload code to a Happy that would support much higher SIO transfer speeds such as pokey divisor 0. My thinking is that if the 6502 on the 8-bit can handle sending/receiving data at pokey divisor 0 rates, surely there's the possibility this could also work with a Happy drive if the appropriate code was uploaded to the RAM in a Happy. Obviously transfer would be fastest if it was done from the Happy track buffer, but I think there would be enough RAM to just upload an enhanced transfer routine, and have space for a track buffer. I think Hias's SIO code is only about 1K, and a Happy (1050) has either 6K or 8K of RAM. Quote Link to comment Share on other sites More sharing options...
+Nezgar Posted August 28, 2020 Share Posted August 28, 2020 The CPU in the 1050 runs at 1Mhz, and the bit-banged loops timing is dependant on rates that line up with timing loops on the Atari at 1.79 mhz. The fastest rates I've seen used in real drives are 68Kbps (Indus GT, not a 1050 though) and 71Kbps in the 1050 Turbo. The Speedy 1050 normally operates in divisor 9/55Kbps but the HSS copier runs it at a higher speed, probably around 68-71Kbps. So yes it may be possible, but current example implementations suggest 71Kbps might be the maximum... maybe other factors like physical wire impedance/capacitance limit things as well. Curious if anyone with knowledge of timing on both the drive and computer side could determine if faster compatible rates may be possible... Quote Link to comment Share on other sites More sharing options...
phaeron Posted August 28, 2020 Share Posted August 28, 2020 Pokey divisor 0 is 14 cycles/bit on the computer and ~7.8 cycles on the 1050. There might some way to barely send at that rate but receiving probably not as there isn't enough CPU speed to sync to the start bit. Quote Link to comment Share on other sites More sharing options...
ijor Posted August 29, 2020 Share Posted August 29, 2020 It would be interesting to test how fast the 1050 CPU can xmit. It doesn't matter so much if receiving would be slower. The most important direction of transfer is from the drive to the computer, and it is possible to use an asymmetric protocol. We now know something very important that it wasn't known at the time of the Speedy and the Turbo. It is the Pokey late capture. It is then, conceivable, that it might be possible to achieve faster rates by taking this into consideration and extending the start bit to compensate. Quote Link to comment Share on other sites More sharing options...
ijor Posted August 29, 2020 Share Posted August 29, 2020 (edited) So how fast the 6502 on the 1050 can bitbang the SIO output? You can't go very fast with "traditional" shifting and rotating bits. A better 6502 coder might correct me, but I don't think it is feasible to do it at 8 cycles per bit, not even unrolling the bit loop. But what about the old self modified trick? Something as simple as this should work: STX PORTB ; 4 cycles STA PORTB ; 4 cycles ; repeat for a total of 8 instructions "Just" modify the opcodes according to each bit and preload the correct values on the X and A registers, and you can transmit as fast as 4 cycles per bit, for a whooping 250Kbps! Of course, there is a considerable overhead to modify the code, and you must do it for each byte. So even if you, technically, could transmit so fast, the delay between each bytes would slower the effective bit rate. An alternative that might require less overhead is to use precompiled code covering all the 256 values, and you just jump to the appropriate address for the byte you want to transmit. The problem here is, of course, the amount of RAM. Now, you don't need a full separate routine for each byte value because it is possible to share a lot of code. And even at 7 or 8 cycles per bit you still have time to add a JMP or Branch instruction between each bit: ; Transmit 1 0 0 0 STX PORTB ; 4 cycles JMP x01 ; 3 cycles, might need to spend an extra cycle ; Transmit 0 0 0 0 STA PORTB ; 4 cycles STA dummy ; Spend 3 or 4 cycles x01 STA PORTB ... Not sure exactly how much RAM it would require. It is complicated to calculate because you can include conditional branches as well. There might still be in addition, a problem of granularity to achieve a closer match to the Pokey frequency. But seems that 8 cycles per bit should be close enough for zero divisor. All of this ignores the hardware side. As Nezgar said, there might be circuit limitations at the drive external logic. Edited August 29, 2020 by ijor Quote Link to comment Share on other sites More sharing options...
Kyle22 Posted August 30, 2020 Share Posted August 30, 2020 Divisor zero USD. Great. Now, let's get the Z80 coders to do it for the Indus and ATR-8000. Also Happy/ Duplicator Functions and compatibility. These devices could be the most powerful copying devices ever made. Plenty of RAM on the Indus and ATR. Plenty of CPU power and RAM. These devices are worthy of programming. :) 1 Quote Link to comment Share on other sites More sharing options...
+Nezgar Posted August 30, 2020 Share Posted August 30, 2020 @ijor 250kbps is pretty cool, but wouldn't it be hard to get a consistent arbitrary speed like 125kbps? It wouldn't be a straightup loop - you'd have to add varying amounts of delay cycles to achieve specific rates, ideally multiples that line up with code loops... bit timing I would think would be quite inconsistent "jittery", with some bits sustaining longer than others, plus the Atari running at a different clock frequency will be 'sampling' those bits at different intervals as well... And of course wonky the other direction as well. Kind of like why 9600bps was marginally unreliable sometimes on the 850 (some modems would reportedly echo at 9600, but would not respond to an AT command as it could not train on the "jittery" bitrate), and 19200 was a mess, because the input levels were being 'copied' to the outputs in a software loop at 1Mhz. At 9600bps, some bits were sustained longer than others due to the processor cycles sampling the inputs. At 19200, bits would be missed entirely. Quote Link to comment Share on other sites More sharing options...
E474 Posted August 30, 2020 Author Share Posted August 30, 2020 Hi, I am a little bit rusty on programming PIAs, but isn't the (unrolled) code going to be something like: LDX #PORTB_SET_SPECIFIC_BIT_DIRECTION_TO_OUTPUT LDA #FIRST_BIT_OF_FIRST_BYTE STX PORTB_CONTROL STA PORTB_DATA LDA #SECOND_BIT_OF_FIRST_BYTE STX PORTB_CONTROL STA PORTB_DATA LDA #THIRD_BIT_OF_FIRST_BYTE STX PORTB_CONTROL STA PORTB_DATA ... LDA #EIGTH_BIT_OF_BYTE_256 STX PORTB_CONTROL STA PORTB_DATA I think I am missing start and stop bits for each byte sent, but that would be extra unrolled code. Prior to transmitting the sector, you would update the immediate #opcodes according to the sector data. The LDA, STA, STX code would be 8 bytes/bit, so for 256 bytes would be 256 * 8 bits = 2k, 2k * 8 bytes = 16 k of unrolled code. So I don't think that would work. Maybe you could do: LDX #SET_SPECIFIC_BIT_DIRECTION LDA #<START_OF_BIT_BUFFER STA ZERO_PAGE_BUFFER_POINTER_LO LDA #>START_OF_BIT_BUFFER STA ZERO_PAGE_BUFFER_POINTER_HI LDY #0 LOOP1 LDA (ZERO_PAGE_BUFFER_POINTER_LO),Y STX PORTB_CONTROL STA PORTB_DATA INY BNE LOOP1 INC ZERO_PAGE_BUFFER_POINTER_HI CMP #END_OF_BUFFER_VAL BNE LOOP1 RTS On the other hand, if you are using the high speed code on the 8-bit side as a basis, you could transfer 256 bytes of sector data in smaller chunks, so a 32 byte chunk of data could be transferred with 2K of unrolled code on the Happy. Not sure what other approaches could be used, or if I am actually using the right code for PIAs, or in general. Quote Link to comment Share on other sites More sharing options...
E474 Posted August 30, 2020 Author Share Posted August 30, 2020 Hi, Thinking about this a bit more, if you are only setting or clearing a single bit in PORTB_DATA (I don't know if this is the case, or not), you could use: LDX #SET_SPECIFIC_BIT_DIRECTION LDY #BIT_IS_ONE LDA #BIT_IS_ZERO WRITE_DATA ; transmit first bit from sector data STX PORTB_CONTROL STA PORTB_DATA ; transmit second bit from data STX PORTB_CONTROL STA PORTB_DATA ... ; transmit 2048th bit from data and before calling go through the code modifying the STA to STY or STY depending on whether you want to write a 1 or a 0 to PORTB_DATA. Unfortunately, this is still too big, e.g. 6 bytes * 2048 bit writes (not counting start and stop bits) = 12 K. Maybe 2048 JSR Calls would do it though, and just modify each JSR call depending on whether you wanted to write a 1 or zero bit? That would only be 3K of unrolled code, plus start and stop bits? Quote Link to comment Share on other sites More sharing options...
ijor Posted August 30, 2020 Share Posted August 30, 2020 (edited) 11 hours ago, Nezgar said: @ijor 250kbps is pretty cool, but wouldn't it be hard to get a consistent arbitrary speed like 125kbps? It wouldn't be a straightup loop - you'd have to add varying amounts of delay cycles to achieve specific rates, ideally multiples that line up with code loops... bit timing I would think would be quite inconsistent "jittery", with some bits sustaining longer than others, plus the Atari running at a different clock frequency will be 'sampling' those bits at different intervals as well... And of course wonky the other direction as well. The 250 Kbps figure was just to show how fast it is possible to transmit, in theory. Pokey can't operate at such a high frequency, except on synchronous mode that the 1050 can't use anyway. So the maximum we can normally use is divisor zero that is ~127 Kbps. Yes, there is a granularity issue and it is not possible to use an arbitrary frequency. But the frequencies of the sender and the receiver don't need to match exactly. They just have to be close enough. Async serial transmission can tolerate about a 5% difference. And here that we can adjust the length of the start bit, we can achieve a more optimal alignment that could tolerate close to 10% difference in the frequencies. So at divisor zero the drive would transmit at 125 Kbps and Pokey would receive at ~127 Kbps (slightly different depending on PAL or NTSC). As you can see, the difference is about 2% in the worst case which should be good enough. Of course, this has to be tested to confirm the theory. But again, to achieve such high speeds you need to implement some kind of precompiled code that would either require considerable overhead between bytes, or either use considerable amounts of RAM. If I'm not mistaken, a "standard" code loop could be as fast as 11 cycles per bit, perhaps even 10 (or 9?) cycles using some undocumented opcodes tricks. Still, at 11 cycles per bit and Pokey divisor 3 would give you ~89 Kbps. Not bad Yes, the other direction, from the computer to the drive, is a completely different game. Edited August 30, 2020 by ijor Quote Link to comment Share on other sites More sharing options...
ijor Posted August 30, 2020 Share Posted August 30, 2020 (edited) 5 hours ago, E474 said: I am a little bit rusty on programming PIAs, but isn't the (unrolled) code going to be something like: LDX #PORTB_SET_SPECIFIC_BIT_DIRECTION_TO_OUTPUT LDA #FIRST_BIT_OF_FIRST_BYTE STX PORTB_CONTROL STA PORTB_DATA LDA #SECOND_BIT_OF_FIRST_BYTE STX PORTB_CONTROL STA PORTB_DATA ... LDA #EIGTH_BIT_OF_BYTE_256 STX PORTB_CONTROL STA PORTB_DATA I think I am missing start and stop bits for each byte sent, but that would be extra unrolled code. Prior to transmitting the sector, you would update the immediate #opcodes according to the sector data. The LDA, STA, STX code would be 8 bytes/bit, so for 256 bytes would be 256 * 8 bits = 2k, 2k * 8 bytes = 16 k of unrolled code. So I don't think that would work. No. In first place the 1050 has a RIOT, not a PIA. The chips are quite similar, but the RIOT doesn't have a control register, you can always address the PORT register directly. In second place, when I said unrolled loop I didn't mean at the byte level, but at the bit level. You still obviously have to loop for each byte. So, for the maximum theoretical limit of 250 Kbps, the code would be something like this: ; Process start of byte LDA ValueFor0 ; Actual values depend on the stepper output bits, LDX ValueFor1 ; that share the same RIOT port STA PORTB ; Xmit start bit ; Might need a couple of cycles delay here to compensate for Pokey late capture STX PORTB ; Xmit bit 0. Opcode would be precompiled depending on byte STA PORTB ; Xmit bit 1. Precompiled STA or STX opcode ... STX PORTB ; Xmit stop bit ; Loop for next byte Edited August 30, 2020 by ijor Quote Link to comment Share on other sites More sharing options...
E474 Posted August 30, 2020 Author Share Posted August 30, 2020 Hi, I thought the RIOT contained a PIA as a logical sub-part (the only info I checked was wikipedia/AtariAge). Either way, if you only need to set the port control direction at the start, and can just do sequential (bit) writes afterwards, you might as well unroll the code for writing a whole 256 byte sectors worth of data, and update the writes to STA or STY depending on the actual sector data. LDA #BIT_IS_ONE LDY #BIT_IS_ZERO LDX #SET_PORTB_BIT_CONTROL_DIRECTION STX PORTB_CONTROL TRANSFER_DATA STA PORTB_DATA STA PORTB_DATA ... STA PORTB_DATA RTS and then have a routine that converts each bit of each byte of sector data into STA or STY opcodes in the TRANSFER_DATA code. That would give just over 6K of unrolled code, but maybe it's better to have a fast/unrolled 128 byte transfer routine, and call it twice for 256 byte sectors, though you would expect a pause after 128 bytes as the routine woud have to be updated with the second 128 bytes of the sector data. That would be a bit over 3K. I think that's as fast as you could go? Quote Link to comment Share on other sites More sharing options...
ijor Posted August 30, 2020 Share Posted August 30, 2020 2 hours ago, E474 said: Either way, if you only need to set the port control direction at the start, and can just do sequential (bit) writes afterwards, you might as well unroll the code for writing a whole 256 byte sectors worth of data, and update the writes to STA or STY depending on the actual sector data. ... and then have a routine that converts each bit of each byte of sector data into STA or STY opcodes in the TRANSFER_DATA code. That would give just over 6K of unrolled code, ... You don't need to set the port direction at all. That is performed only once at reset time. The SIO signals are unidirectional, and they are open collectors. There is no need to ever change it to input. But yes, what you suggest is perfectible possible if you have enough RAM. As a matter of fact, a fully unrolled code for the whole sector is the only way to transmit the whole sector at a 250 Kbps effective rate. But 250Kbps is not realistic. As I said, Pokey can't operate asynchronously at that frequency. And regardless, the overhead necessary to precompile the code is too expensive. Precompiling the opcode would probably require something like 10 cycles per bit, I guess. Add to this the 4 cycles per bit to for the actual transmission, and at the end this turns to be slower than using non precompiled code. Using 125 KBps with Pokey divisor zero might not be much better. The extra 4 cycles you can spend between each bit might, conceivable, make a difference and allow for some clever optimization. It might be even possible to process, at least partially, the next byte while you are transmitting the current one. But don't know. Might be it would also require a too expensive overhead between bytes or between sectors. Quote Link to comment Share on other sites More sharing options...
E474 Posted August 30, 2020 Author Share Posted August 30, 2020 (edited) Hi, I think you could unroll the code for writing individual bits from a byte to the transfer code, but you would have to loop through the sector data byte by byte, and jump over the code for stop and start bits. You could also replace the send stop/send start bits with a jsr to a routine to do this as this would result in smaller unrolled code. You could just about fit unrolled code for 256 byte sectors into the Happy memory (probably not for a 6K Happy), but you would have to give up the track buffer, though you could probably try and buffer the next few sectors. If you were dealing with 128 byte sectors, then you could handle those in 3k of unrolled code. If you wanted to handle 256 byte sectors, you could generate unrolled code for 128 bytes of data, execute it to transfer that data, generate the code for the next 128 bytes, then execute that code to transfer te data. Unfortunately that would mean a pause halfway through a 256 byte sector as the code for the second half of the sector's data was generated. I don't know how well this would be tolerated, but on the other hand, if there is custom code on the 8-bit for handling the transfer, I think it would be OK. Actually, for a single density disk, you would have 18*128 bytes = 2304 bytes for the track buffer, 3K for the unrolled transfer routine, so you could probably get the maximum throughput on a 6K Happy, at which point you would have to consider if your receiving code on the 8-bit could keep up. Edited August 30, 2020 by E474 added single density track buffer size Quote Link to comment Share on other sites More sharing options...
Rybags Posted August 31, 2020 Share Posted August 31, 2020 I don't know a lot about the Happy hardware, but what about the delay of generating the code? Generating bit-bang instructions for about 1300 transitions per sector might potentially cause sufficient delay that rotational latency screws up any speed gains? Quote Link to comment Share on other sites More sharing options...
+Nezgar Posted August 31, 2020 Share Posted August 31, 2020 7 hours ago, E474 said: on the other hand, if there is custom code on the 8-bit for handling the transfer, I think it would be OK. Love the theoretical hypothesis going on here, but way over my head. The thought occurred to me with the quoted statement that with "custom code on the 8-bit" you could also just transfer the entire track at once using existing known working bitrates and probably double the practical throughput of sequential reads/writes by skipping all the inter-sector overhead... Kinda like how PCLINK uses large block transfers over SIO... 1 Quote Link to comment Share on other sites More sharing options...
E474 Posted August 31, 2020 Author Share Posted August 31, 2020 Hi, I was thinking about what would be the best way to generate the transfer code, and thought maybe: CREATE_SKELETON_TRANSFER_CODE ; LOOP TO FILL TRANSFER_CODE WITH SKELETON CODE BYTES LOOP1 ; FIRST BIT OF BYTE .BYTE ??,#<PORTB_DATA,#>PORTB_DATA ; SECOND BIT OF BYTE .BYTE ??,#<PORTB_DATA,#>PORTB_DATA .... ; EIGHTH BIT OF BYTE .BYTE ??,#<PORTB_DATA,#>PORTB_DATA .BYTE #OPCODE_FOR_JSR,#<WRITE_STOP_START_BITS,#>WRITE_STOP_START_BITS ; BRANCH TO LOOP1 IF THERE ARE MORE BYTES IN THE BUFFER TO GENERATE SKELETON CODE FOR RTS You only need to call this routine once, ever, as the ?? will get replaced when you process the actual buffer data, and the rest never changes. ; A CONTAINS VALUE FOR STOP BIT ; Y CONTAINS VALUE FOR START BIT WRITE_STOP_START_BITS STA PORTB_DATA STY PORTB_DATA RTS For processing the bit value in each byte, probably quicker to use an 8 byte buffer to store each STA or STY opcode, depending on the value of each bit of the byte being processed, and then update the appropriate set of 8 instructions in the TRANSFER_DATA unrolled code. I think the code from @HiasofT would be the best basis for extending to support this transfer speed, so it could be possible to actually have read from drive speeds at this speed. If you were looking at a specific use case like copying an unprotected disk from Happy Drive 1 to Happy Drive 2, you might as well delegate formatting the destination drive at the start of the copy, and read the track layout of the source (that is, transfer a list saying which sectors in the current track have data), then only read and write those sectors. Also, if you wanted to transfer 256 byte sectors, and had to do UPDATE_SKELETON_CODE, execute code, UPDATE_SKELETON_CODE, execute code, you would know fairly accurately how long to wait between 128 byte transfers as the UPDATE_SKELETON_CODE should take the same amount of time to execute, regardless of the actual byte data (I think). Quote Link to comment Share on other sites More sharing options...
ijor Posted August 31, 2020 Share Posted August 31, 2020 15 hours ago, Rybags said: Generating bit-bang instructions for about 1300 transitions per sector might potentially cause sufficient delay that rotational latency screws up any speed gains? The problem is not exactly the rotational latency. But yes, generating all those instructions like that is so slow that makes the whole thing not worth. Quote Link to comment Share on other sites More sharing options...
ijor Posted August 31, 2020 Share Posted August 31, 2020 (edited) 19 hours ago, E474 said: You could also replace the send stop/send start bits with a jsr to a routine to do this as this would result in smaller unrolled code. No, you can't because that would disrupt the timing. Writing the stop bit must be done as fast as any other bit or it would create frame errors. And then using a JSR (or JMP) instruction would produce a fatal delay. 5 hours ago, E474 said: For processing the bit value in each byte, probably quicker to use an 8 byte buffer to store each STA or STY opcode, depending on the value of each bit of the byte being processed, and then update the appropriate set of 8 instructions in the TRANSFER_DATA unrolled code But how could you do that efficiently? Unless you could find a very fast method to do that, the overhead is too much and you spoil all the advantage of using such a high bitrate transmission. 5 hours ago, E474 said: STA PORTB_DATA STY PORTB_DATA ... I think the code from @HiasofT would be the best basis for extending to support this transfer speed, so it could be possible to actually have read from drive speeds at this speed. Once again, no, that's impossible. You are bit banging at 250 KBps and Pokey can't work at such a high frequency. It's not a software issue, it is a limitation of Pokey. The maximum frequency Pokey can operate asynchronously is at ~127 Kbps. I initially showed the code for 250 Kbps just for showing how fast the 1050 can transmit, at least in theory. This can't be really used, at least not against Pokey. Conceivable, 250 Kbps could be used when connected to a PC. Edited August 31, 2020 by ijor Quote Link to comment Share on other sites More sharing options...
E474 Posted August 31, 2020 Author Share Posted August 31, 2020 Hi, thanks for the feedback! It sounds like the code on the 1050 needs to be a little slower. I think adding NOPs between outputting each bit would bloat the code too much, so maybe it is better to JSR to either OUTPUT_BIT_SET or OUTPUT_BIT_CLEAR in the TRANSFER_DATA code? Providing these routines are on the same page, you are only updating one byte in the TRANSFER_DATA routine per bit output. Not sure how many cycles per bit, but code would look something like: TRANSFER_DATA JSR OUTPUT_BIT_SET JSR OUTPUT_BIT_SET JSR OUTPUT_BIT_CLEAR .. OUTPUT_BIT_SET STA PORTB_DATA RTS OUTPUT_BIT_CLEAR STY PORTB_DATA RTS But it would be a constant rate. You might as well unroll the code for testing each bit in a byte of sector data, the basic idea is: 8-bits -> 8 bytes, each byte is the low part of the address of either OUTPUT_BIT_CLEAR, or OUTPUT_BIT_SET then the 8 bytes get copied into a block of JSR statements for outputting that particular data byte. LDA DEST_JSR_TABLE STA (Z_DATA_TRANSFER_BLOCK),Y INY INY INY LDA DEST_JSR_TABLE+1 STA (Z_DATA_TRANSFER_BLOCK),Y INY INY INY ... LDA DEST_JSR_TABLE+7 STA (Z_DATA_TRANSFER_BLOCK),Y CLC LDA Z_DATA_TRANSFER_BLOCK ADC #30 STA Z_DATA_TRANSFER_BLOCK BCC ONLY_LO INC Z_DATA_TRANSFER_BLOCK + 1 ONLY_LO although I don't know if it is really worth unrolling this code as putting it inside a loop would not slow it down that much. Only other code would be data byte -> 8 bytes of lo jsr addresses. Quote Link to comment Share on other sites More sharing options...
ijor Posted September 1, 2020 Share Posted September 1, 2020 2 hours ago, E474 said: I think adding NOPs between outputting each bit would bloat the code too much, so maybe it is better to JSR to either OUTPUT_BIT_SET or OUTPUT_BIT_CLEAR in the TRANSFER_DATA code? Providing these routines are on the same page, you are only updating one byte in the TRANSFER_DATA routine per bit output. Not sure how many cycles per bit, but code would look something like: TRANSFER_DATA JSR OUTPUT_BIT_SET JSR OUTPUT_BIT_SET JSR OUTPUT_BIT_CLEAR That would be very slow. It would work, but it would take more cycles than using a simple conventional loop. Btw, I realized that I forgot about the issue of the drive SIO output being open collector. I think this eliminates the possibility to use 250 Kbps altogether. Not sure if it would work at 125 Kbps. I think all devices using divisor zero use push pull drivers. So besides all the other issues this remains to be tested and confirmed. Seems better to be a little more modest and try a "conventional" loop to transmit at just a "mere" ~89 Kbps using 11 cycles per bit and Pokey divisor 3. If it works it would probably still be a record. Quote Link to comment Share on other sites More sharing options...
ivop Posted September 1, 2020 Share Posted September 1, 2020 Here's one with a LUT of 2kB. 8 cycles per bit, 12 cycles for stopbit. Not tested, but here you go: ; eight tables of 256 bytes with proper bit masks ; each table must be page aligned (abs,x is 4 cycles) ; routine must be within one page (bpl/bne is 3 cycles) ; startbit and stopbit must not be on page zero send_data_128 ldy #0 ; 2 loop lda (data),y ; 5 tax ; 2 = 7 (+5 = 12, like one and half stopbit) lda startbit ; 4 sta dataport ; 4 = 8 lda bit7table,x ; 4 sta dataport ; 4 = 8 lda bit6table,x ; 4 sta dataport ; 4 = 8 lda bit5table,x ; 4 sta dataport ; 4 = 8 lda bit4table,x ; 4 sta dataport ; 4 = 8 lda bit3table,x ; 4 sta dataport ; 4 = 8 lda bit2table,x ; 4 sta dataport ; 4 = 8 lda bit1table,x ; 4 sta dataport ; 4 = 8 lda bit0table,x ; 4 sta dataport ; 4 = 8 lda stopbit ; 4 sta dataport ; 4 = 8 iny ; 2 bpl loop ; 3 = 5 (and routine must be in the same page) ; bne loop ; 3 this one for send_data_256 Quote Link to comment Share on other sites More sharing options...
ivop Posted September 1, 2020 Share Posted September 1, 2020 (edited) Number two. 11 cycles per bit. Although the 0 and 1 code path are both 11 cycles, 0 is set one cycle earlier. Again, the stopbit is slightly longer, but that's actually being nice to Pokey My sio2world code uses 8N2 for divisors 3 and lower, without snapping the caps in your Atari. ; buffer where data points to must be page aligned send_128_bytes ldx #value_for_1 ; 2 never change x loop ldy #0 ; 2 lda (data),y ; 5 ldy #value_for_0 ; 2 send_byte sty dataport ; startbit 4 NOP6 ; 6 bit7 rol ; 2 bcc bit7clr ; 2, 3 when taken sty dataport ; 4 bcc bit6 ; 3 (2+2+4+3) = 11 bit7clr stx dataport ; 4 NOP2 ; 2 (2+3+4+2) = 11 bit6 rol ...... bit0 rol bcc bit0clr sty dataport bcc done bit0clr stx dataport nop done inc data ; 5 we have to spend 5 cycles anyway stx dataport ; stopbit 4 P unchanged bpl loop ; 3 ; bne loop ; for 256 bytes version rts Edited September 1, 2020 by ivop Quote Link to comment Share on other sites More sharing options...
ijor Posted September 1, 2020 Share Posted September 1, 2020 2 hours ago, ivop said: Here's one with a LUT of 2kB. 8 cycles per bit, 12 cycles for stopbit. Not tested, but here you go: Great. Very clever, Ivop. But there are a few problems. First, I'm not sure that divisor zero is reliable with open collector drivers. This has to be tested and even then it might not be fully conclusive. Does any of the SIO2xxx devices work with divisor zero using open drain outputs? The second issue is that at divisor zero you might need to stretch the start bit to compensate for the Pokey late capture. I need to make the math to be sure how long ideally the start bit should be, but probably a couple of extra cycles would be enough. Of course, ideally this extra cycles should be borrowed from the already longer stop bit. Finally, your method requires the whole table to be rebuilt for every track because the correct bitmasks depends on the state of the stepper. You might need a very fast update routine for this task as well. 2 hours ago, ivop said: My sio2world code uses 8N2 for divisors 3 and lower, without snapping the caps in your Atari. That's very strange. I made tests at frequencies even higher than divisor zero (using synchronous mode) and didn't found a longer stop bit helped too much. The two critical issues were, again, not using open drain and stretching the start bit. What were the exact frequencies you tried? If the device bitrate was slightly faster, it is possible that the extra stop bit compensated for the Pokey late capture issue. But not sure how the caps would make much of a difference at divisor zero. Quote Link to comment Share on other sites More sharing options...
StickJock Posted September 1, 2020 Share Posted September 1, 2020 2 hours ago, ivop said: bit7 rol ; 2 bcc bit7clr ; 2, 3 when taken sty dataport ; 4 bcc bit6 ; 3 (2+2+4+3) = 11 bit7clr stx dataport ; 4 NOP2 ; 2 (2+3+4+2) = 11 bit6 I think that your second bcc to skip the other polarity (bcc bit6) should be a bcs. 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.