Jump to content
  • entries
    73
  • comments
    133
  • views
    82,995

Propeller Update

Sign in to follow this  
potatohead

1,485 views

After an extended time off, I've started to tinker with the Propeller again. Another user managed to do what I was trying to do; namely, build an NTSC video driver that's got stable color clocks on every scan line. Now it's possible to emulate the look of nearly any classic machine on this chip! There is enough control over the signal to essentially output any kind of NTSC / PAL signal you want. IMHO, this is really cool --and what I was hoping was doable.

 

I'll have to get a camera, or something for screenies... perhaps this weekend!

 

I was just reading the Closed Captioning thread. This is completely doable on the Propeller. It's got full control of the entire video signal. I'll have to go find a sample waveform to encode...

 

To get warmed up, I modified and built on that driver to output 80x192x8. This is a lot like the GTIA mode on the Atari, only with more color bits. Did this to check out the color possibilities, with nice big easily debugged pixels. Turns out, about 96 different colors are possible, largely due to there only being 6 intensity levels on the chip. It's quite possible to use two of the video generators together to get more, but that's gonna be outside the scope for now. There are 16 distinct hues per pixel, with six intensities. Sub-pixel artifacting, which I've not yet really tried, is gonna yield more. Either way, sub pixel, or combining the output of multiple video generators at the same time, is gonna yield plenty of additional color and intensity levels beyond the 96 done by the hardware.

 

As things stand right now, I've got a driver that outputs one pixel per NTSC color clock, non interlaced. This is exactly what the older Atari hardware does. If the chip had more intensity levels, it would all map out nicely, but it doesn't. It's totally possible to overdrive things to sub-color clock levels and get more direct control of what happens in a given pixel. I'm gonna save that for later.

 

Onboard RAM, being limited to 32K, does limit the color screens. 160x192x8 takes up 30K of ram! Eek. Just enough to build a color demo or two, and that's it. maybe, maybe a small game in assemblier. Breakout or something... It is 8 bit addressable. No color limitations, just a nice high-color bitmap.

 

It appears there is plenty of time between pixels for all sorts of tricks. Pixel packing to generate 32 or 64 color displays is doable and that would make a ~20k or so screen at classic game resolutions. Now there is room to actually do stuff.

 

On the fly video will do better ---or add external RAM. Andre, the designer of the HYDRA game system, that uses the chip, has done this, giving 64K of random access memory, with 500k or so sequential above that. More than plenty for high-color graphics in full on bitmap mode. I'll be playing with that later...

 

For now, the next task is to build an emulation of some of the better Atari modes with sprites. I've a coupla choices in this. Somebody already worked out how to connect atari Joysticks to the VGA port for quick and dirty control interfaces on the Demo Board. On that note, it's a bummer --sort of. The HYDRA game machine has nintendo style ports, and code to read the controllers. This is actually very cool in that modern controllers all function in a similar fashion. The Demo Board, which lacks some storage and essentially limits you to just the 32K in chip, without you building stuff, does not have game inputs, but does have mouse, keyboard, VGA, etc... in common with the HYDRA. If one is using a TV for the display, then that VGA port can easily become a controller port instead! Either system can do this, because all the I/O is bidirectional.

 

Time to make some cables and hook up my Atari stuff! Paddles are gonna be an annoyance as a CAP, charging, timing, etc... are gonna be necessary. All possible, but extra work right now. Joysticks, driving controllers, etc... are gonna be just fine, so I start with those.

 

Have one COG (that's one of the 8 CPU's running on this little bugger) build the screen display one scanline at a time, in high color mode. It works from a buffer that's built by one or two more of the COG's. Those two will be building sprite graphics, read from memory and drawn into the buffer on the fly. Graphics done this way, only need the storage necessary to define the images, with no full screen bitmap required for display. It's like a programmable TIA, essentially. Once the sprite engine is up and running, it will appear as just video hardware to the real game program running on one or more of the other CPU's.

 

I'm finding the Atari style Player missile graphics sprites an interesting design option. The existing sprite engines, for the chip, all use your typical rectangular sprite definitions. Drawing these into the buffer takes time --more time than would be required if all the sprites were screen height! So, vertical movement would be actually moving data, horizontal movement would be changing a register. Maybe add an origin register so we get to say where a particular bit of data ends up on the screen. This might make packing things in memory easier, and vertical movement easier, given the sprite is not to be reused vertically as seen on most every game done on the 8bitters.

 

I'm not sure how many will be possible. I strongly suspect this number is high, given the 8 CPU's to work with. It will be possible to have them be more than one color for sure. I don't think I'm there yet where emulating hardware collisions is concerned, either. No biggie, there is plenty of speed and ram to check these things in the usual non-hardware ways.

 

That's it for now. By way of reference, here is the actual 80x192x8 driver code, and the little quick and dirty color demo that went with it.

 

********** Color Demo code first ***************

{ ********  80 x 192 High Color 1 byte / Pixel display Demo ***********
 *  This demo writes all the useful color values to the screen	   *
 *  Linear addressing, no tiles									  *
 *  Derived from CardBoardGuru's Simple NTSC display example		 *
 *  Written for HYDRA												*
 *********************************************************************
}

CON
 ' Set up the processor clock in the standard way for 80MHz
 _CLKMODE = xtal1 + pll8x
 _XINFREQ = 10_000_000 + 0000

VAR
 byte		  displayb[15360]   'allocate display buffer in RAM
							  '80 x 192 x 1 byte / pixel
 long		  index			 'temp offset for byte statement below

OBJ
 tv : "80x192_NTSC"  'the TV driver, for this example running at 80x192
 
PUB start | j, c, k, o

{
fill bitmap with NTSC black color.  A zero is below black and will hose the display.
Perhaps it's not a bad idea to have the TV driver watch for this condition...
8 bits / pixel appears to be a waste in that only 120 distinct pixel conditions
 result.  Of these, perhaps 90 or so are really useful.

 Of the 8 bits, the first three deal with intensity:
Black   = %010
  01 Grey  = %011
  02 Grey  = %100
  03 Grey  = %101
  04 Grey  = %110
White   = %111

 Useful color exists in the remaining bits as shown by this program.  The first row
 of darker colors is questionable.  On my better TV, it works.  On lesser ones, it
 doesn't... 

}

  'fill bitmap with black pixels, before triggering display 
repeat index from 0 to 15360 
  byte [@displayb] [index] := 2  'another way to point at HUB memory...
	
tv.start(@displayb)  'start the tv cog & pass it the address of the bitmap

  'draw a border around the visible graphics screen (80 x 192)
repeat j from 0 to 79
 plot(j,0,251)
 plot(j,191,251)
repeat j from 0 to 191
 plot(0,j,251)
 plot(79,j,251) 

'draw 6 intensities possible  (black is one)
repeat o from 15 to 55 step 10
  repeat j from o to o+9
	repeat k from 160 to 180
	  plot(j,k,o/10+2)

'draw useful sets of colors possible
repeat k from 8 to 15
	if k == 9
	  k := 10
	c := k
  repeat o from 3 to 77 step 5
	c := c + 16
	  repeat j from 20 to 30	  
	   plot(o,j+16*(k-8),c)
	   plot(o+1,j+16*(k-8),c)

	
pub plot(x,y,c)
 'very simple dot plotter
 'one byte per pixel is sweet!  
 displayb[y*80+x] := c

 

And the driver... actually, it's full of commentary, written my me and the other guy! (Attached instead --actualy I don't know how after the AA upgrade. No biggie...)

 

Just for fun, this is all it really takes to encode the bitmap display, once the framework is all up and running:

 

 

						mov	 VSCL, CALC_user_data_VSCL

					mov	 r1, #20  '80 pixels horizontal resolution is 20 waitvids
:draw_pixels			rdlong  B, A	'get four pixels from hub
					waitvid B, #%%3210  'draw them to screen
					add	 A, #4	'point to next pixel group
					djnz	r1, #:draw_pixels  'line done? Move to sync...

 

Just one tight loop, grabbing bytes, writing them to the screen, etc... BTW, the entire signal is encoded as colors. This is why there are only 6 intensity levels, when three bits are actually defined for this. The other two are below black, for sync. These produce interesting results when present on screen, in the display graphics area! Not all TV's will cope with this either.... That's why the demo program above, filled the display memory with pixel values greater than zero. I'll have the driver itself handle this going forward, so zero will actually be black and not sync!

 

One other thing... I saw the great little project to make a 7800 adapter for the 5200 machine. Didn't know that one had a video input! I'm assuming this is a pass thru kind of thing. If so, one of these little chips could be on a cart, and present itself as ROM, maybe co-exist with the ROM, and output a video signal to be used in lieu of the 5200 one!

 

blogentry-4836-1178779668_thumb.jpg blogentry-4836-1178779651_thumb.jpg

 

These screens were taken with the emulator. I don't have a good video capture system right now, so this will have to do! The emulator works from a compiled binary from the Propeller IDE. See comment thread for more...

Sign in to follow this  


19 Comments


Recommended Comments

Interesting. How much contol do you have over the color amplitude? The main problem with the Atari colors, versus, the NES, was the Atari only had a single amplitude which was used for both burst and active video. This made the colors relatively dark. I think the NES also clicked up the base luma when not in greyscale.

 

Would the Propeller be capable of A7800 style display list processing? Have two COGs working as the GPU, one reading the display list and writing to a line RAM buffer, the other reading from a second line RAM buffer kicking out the video.

 

An A7800 display list goes like this: a register points to a series of display list list entries. Each DLL entry has a pointer to the display list and a byte with some flags and the number of lines the display list is used for. The display list is made up of a series of display list entries. Each entry has a pointer to the sprite data or a tile list, the number of bytes in the sprite or the tile list, and a horizontal position.

Share this comment


Link to comment
Interesting. How much contol do you have over the color amplitude? The main problem with the Atari colors, versus, the NES, was the Atari only had a single amplitude which was used for both burst and active video. This made the colors relatively dark. I think the NES also clicked up the base luma when not in greyscale.

 

Would the Propeller be capable of A7800 style display list processing? Have two COGs working as the GPU, one reading the display list and writing to a line RAM buffer, the other reading from a second line RAM buffer kicking out the video.

 

An A7800 display list goes like this: a register points to a series of display list list entries. Each DLL entry has a pointer to the display list and a byte with some flags and the number of lines the display list is used for. The display list is made up of a series of display list entries. Each entry has a pointer to the sprite data or a tile list, the number of bytes in the sprite or the tile list, and a horizontal position.

 

As things stand right now, the propeller video generators (8 of them, all identical), output pure luma, when no color is specified, and both when color is specified. Colors range from dark, to very bright. I think the chroma is a function of the luma. Let's put it this way, I can ask for black pixels that have colors! I found this interesting, and is the source of the comment above in the code I posted. On some televisions, these colors do not deal well.

 

The video hardware essentially has 16 sub, pixel clocks that translate into chroma, when combined with the color burst. Their intensity, again, I believe is keyed to the actual luma being generated at the time.

 

As far as the hardware goes, it's really simple. Basically, you've got a serializer that takes color and pixel data, four pixels at a time, with each one being one of the available hardware generated colors. Or, you can have 2 bit color and feed it 16 pixels at a time. That's the norm for the chip.

 

In addition to that, there is a scale register and a PLL counter that together define how long a pixel clock is. This can be synced to the NTSC color burst for pixel perfect graphics, and that's what this driver code does. I was not able to get it, but another user did. (fine by me) Additionally, it's running in the 4 pixels at a time mode, so there are 8 bits per pixel color. Requires more CPU time, but there are 8 of them, so no biggie, as long as it all keeps up.

 

There are no interrupts on the thing, all timing is deterministic. All instructions take 4 cycles, but for taken branches, which take 8, and HUB memory accesses, which take up to 20 or so, depending on where they are in the round robin memory access scheme.

 

The pixel clocks can be quite fast, with 1280 x 1024 VGA being possible! Overclocking the NTSC pixels, to essentially pack more than one pixel into an NTSC color clock is gonna be possible, but I'm not going to bother for now.

 

As for the display lists, this is a very interesting question. I'm quite sure the answer is yes.

 

I've been writing a 16 color display system that's modeled after the Atari 8 bit way of doing things. So, it's gonna have 16 color registers, bitmap modes and sprites --or some combination of those.

 

There are three core approaches:

 

1. Have one COG output a bitmap, then have other cogs, draw sprites into it, in the usual way one would see on the Apple ][, for example. This is quick and easy, but memory intensive, and has color limitations because a full color bitmap consumes all the RAM. (sounds familiar huh?) These engines, then use tiled displays where 16x16 pixels regions are 4 colors each. Sprites take on the color of the region they are in.

 

2. Have one COG output the core NTSC sync, etc... but have it render from a scanline buffer, that is built from one or more COGs. It's possible to have flags that tell the group of CPU's when VBLANK, etc... is happening, so things can be frame locked, if you want --or not. This is the approach I'm currently taking. Run one COG in high color mode, then have the others, however many it takes, read from memory to read color, bitmap, sprites, etc... In this fashion, it can all act like Antic + GTIA (sort of).

 

And that's why I bought one of the darn things. Wanted to be able to have a software video environment to recreate nice frame locked displays, like the 2600 does and the 8 bitters do.

 

3. Have more than one COG output video, on the same pins, at the same time. This one has proven difficult, but I don't think it's impossible. If the COGs are synced, then essentially, you've got video on layers that ends up being OR'ed together. This is gonna yield more colors, as would sub NTSC color clocking would.

 

All the COG's have their own video generators. They are independant, can output to their own pins, or the same pins, etc... One could drive a TV, VGA and another coupla TV's if that was the goal. Not sure where the RAM would come from, but the chip is happy to output whatever it's asked to.

 

Getting display list style sprites would involve building an engine to feed the main COG, the scanline information and it will end up on screen. Said engine could run on multiple additional COGs, depending on the demand. Lots of options there, that currently lie beyond my skills at the moment.

 

It took a lot to grok just how to make the signal --and how to program in a parallel fashion. Picking up speed now though. I think the curve is very similar to that I've seen many here experience when working on a new classic machine.

Share this comment


Link to comment

Just re-read your display list comment again. One problem I'm having with a flat lookup table for which sprite ends up on the scan line, is simple speed. Takes a while to determine what actually is on the line...

 

Might be a better solution, in the end. Likely to use less cogs, or provide more sprite options.

Share this comment


Link to comment

Ahh, now I understand the video generation a little better. There's no separate chroma signal, it's just generating a signal with a bandwidth greater than 3.58MHz. How fast does the rdlong/waitvid loop run? (i.e. 4 pixels in x cycles at y MHz) (Or how many bytes per line?)

 

Yep, the A7800 significantly decreases the typical sprite table issue of determining which sprites are on the line. One other advantage of the display list is the display list read cog could translate a single sprite byte into multiple line RAM bytes which the output COG would then read. This would simplify color translation. There's also the opportunity to add transparency.

Share this comment


Link to comment
Ahh, now I understand the video generation a little better. There's no separate chroma signal, it's just generating a signal with a bandwidth greater than 3.58MHz. How fast does the rdlong/waitvid loop run? (i.e. 4 pixels in x cycles at y MHz) (Or how many bytes per line?)

 

Yep, the A7800 significantly decreases the typical sprite table issue of determining which sprites are on the line. One other advantage of the display list is the display list read cog could translate a single sprite byte into multiple line RAM bytes which the output COG would then read. This would simplify color translation. There's also the opportunity to add transparency.

 

Hmmm...

 

Take a look at the schematic here: http://www.parallax.com/dl/docs/prod/prop/PropDemoDschem.pdf

 

Pins 12-15 handle video output. Did some looking at the stock TV driver. Seems that chroma can be output stand alone, or mixed in with the luma. So, it will do an S-video signal, if that's the desired effect.

 

The behavior I noted, was given the output circut shown. In that configraiton, chroma and luma are added together. I've got one of these boards, and the HYDRA game machine. Both are really just support circuts for the prop, but the HYDRA does have a cart, card slot for expansion, tinkering, etc... Demo board has a small breadboard for the same thing. Tinkering with different video output would just involve a few resistors, a cap or two and another RCA jack plugged into the breadboard.

 

Really, one could connect any kind of video, it seems. The circuit shown is the current standard, however.

Share this comment


Link to comment

Each of the 8 CPU's runs at 80MHZ, clocked together. Instructions are 4 cycles each, for 20 MIPS.

 

Bytes per line is completely arbitrary. Well, mostly. Depends on how you want to do pixels. The waitvid instruction runs for either 4, or 16 or 32 pixels. 8 bit color, 2 bit color, or one bit.

 

The driver I'm building off off, works by setting the colors to be at SYNC levels, and the scales to be equal to the time required for them. One waitvid then draws the HSYNC, etc... when necessary. Right as the user graphics area happens, the scale is changed to pixel size and then my bitmap code, in this case, just starts grabbing bytes, and feeding them to waitvids for that scanline. Right now, it's running at one byte per pixel, for a total of 160, or 320 interlaced.

 

Sprite engine code, and other video code, would consume less, depending on how long unpacking pixels takes, the number of pixels, colors, etc...

Share this comment


Link to comment

Yeah, I've been looking at the docs. Definitely interesting. You could easily do 3.58MHz (160 res) graphics with 4 colors (phase 0,90,180,270) + B&W and a huge number of shades via a single cog output driver. You might even be able to do 4.77MHz (240 res) graphics on a single cog.

 

set VCFG so CMode = 1 (2 bbp)
set VSCL so pixel clock = 14.31818MHz (4 * color burst) & frame length = 3

					mov	 r1, #60  '240 pixels horizontal resolution
:draw_pixels			rdlong  B, A	'get pixel (4 samples) from hub
					waitvid B, #%%3210  'draw them to screen
					add	 A, #4	'point to next pixel group
		rdlong  B, A	'get pixel (4 samples) from hub
					waitvid B, #%%2103  'draw them to screen
					add	 A, #4	'point to next pixel group
		rdlong  B, A	'get pixel (4 samples) from hub
					waitvid B, #%%1032  'draw them to screen
					add	 A, #4	'point to next pixel group
		rdlong  B, A	'get pixel (4 samples) from hub
					waitvid B, #%%0321  'draw them to screen
					add	 A, #4	'point to next pixel group
					djnz	r1, #:draw_pixels  'line done? Move to sync...

Share this comment


Link to comment
Yeah, I've been looking at the docs. Definitely interesting. You could easily do 3.58MHz (160 res) graphics with 4 colors (phase 0,90,180,270) + B&W and a huge number of shades via a single cog output driver. You might even be able to do 4.77MHz (240 res) graphics on a single cog.

 

set VCFG so CMode = 1 (2 bbp)
set VSCL so pixel clock = 14.31818MHz (4 * color burst) & frame length = 3

					mov	 r1, #60  '240 pixels horizontal resolution
:draw_pixels			rdlong  B, A	'get pixel (4 samples) from hub
					waitvid B, #%%3210  'draw them to screen
					add	 A, #4	'point to next pixel group
		rdlong  B, A	'get pixel (4 samples) from hub
					waitvid B, #%%2103  'draw them to screen
					add	 A, #4	'point to next pixel group
		rdlong  B, A	'get pixel (4 samples) from hub
					waitvid B, #%%1032  'draw them to screen
					add	 A, #4	'point to next pixel group
		rdlong  B, A	'get pixel (4 samples) from hub
					waitvid B, #%%0321  'draw them to screen
					add	 A, #4	'point to next pixel group
					djnz	r1, #:draw_pixels  'line done? Move to sync...

 

That's an interesting idea for sure. (Likely to work too) Tempted to try doing something like that next time I get some time on the chip. BTW, there is an emulator that will run code, display (most) TV output. It's slow, but effective for tinkering with code ideas. I've attached a coupla screenies from that. The IDE runs on win32, and is free to use, propeller or not and can be found at the Parallax site. Gear is OSS, runs on .Net and is also free. I'm using the emulator to work on stuff when I can't get time in front of the chip and a television...

 

http://sourceforge.net/projects/gear-emu

 

I think what I am going to do next is generate video on the fly. That's a kernel essentially, built for the game graphics I want to display. This will feed into the scanline buffer and will be a nice run through some of the tougher elements. This chip is both very easy and difficult at the same time. The assembly language, for example, has no registers. It's a memory to memory design, with conditional instructions. So, the A, B above simply reference two of the 512 cog longs, and I treat them as registers. That's cool in that any of the 512 longs in the COG can be instructions or data. Everything is a long in the COG as well. (no byte level access to COG memory. It's a long or nothing. Self modifying code is the norm, for loops and indexing. No biggie there, I've done that on the 8 bit chips before. One nice element, I'm still working through is that writing results, setting flags and executing instructions are all conditional things. Do an and instruction, for example, have it set the flags, but not write the result, then have the next instruction execute or not, based on the flag set on the AND. Lots of things happen in just two instructions where short strings of variable length instructions are required on most of the CPU's I know how to program.

 

Interacting with the HUB is done with pointers from COG memory, as is passing parameters from the higher level, on chip, SPIN interpeter. Knowing where things are, and what is running on what COG and when it's actually doing it is the trick. So far, it's been an exercise in timing, using flags or just knowing how long things take to happen, to know when a passed value is ok to use. That round robin memory access timing also requires one interleave tasks to make best use of HUB access windows.

 

On this first pass to actually get something playable done, program one COG to interpet memory to generate on-screen graphics, a sprite or two, another to generate the actual video, another to capture user input from a joystick, etc..., and another for sound. So 4 or 5 COGs define the environment such that the actual game can be written in SPIN, which runs from an interpeter on one COG. All it really should be doing is manupulating memory locations, branching, etc...

 

(how does one do attachments now??)

 

Looks like they go in blog entries... so they are in the main post.

Share this comment


Link to comment
So, the A, B above simply reference two of the 512 cog longs, and I treat them as registers. That's cool in that any of the 512 longs in the COG can be instructions or data.

 

Ahh, so it's like the 8051, which I always thought of as having zero page RAM instead of registers. I was just basing my code on the sample you said, trying to show how to turn four [Y+U,Y+V,Y-U,Y-V] 14.31818Mhz samples into a 4.77Mhz pixel.

 

Hmm... I wonder if you could set up two cogs with the same code. The code would have two phases: phase 1 would read the display list & sprite data from main memory and save the result to line RAM in cog memory. Second phase would bash that line RAM out to waitvid (along with generating sync).

 

And although I said you'd only be able to generate 4 colors, rethinking this I'm fairly certain that you'd be able to generate the entire NTSC color gamut. I was thinking that you'd need more samples to generate an accurate color phase signal, but the TV is really just going to run the signal through a bandpass filter then two de-modulators to extract the U & V. And since you control the U & V exactly, it should be possible to generate any color.

Share this comment


Link to comment

Start of the output code...

d4	LONG	#0000 0000 0000 000 000 100 000 000 000
pixcnt	LONG	#$00000000
LINERAM	RES	256

MOV	pixcnt, #240/4
MOVD	pixel0, #LINERAM
MOVD	pixel1, #LINERAM+1
MOVD	pixel2, #LINERAM+2
MOVD	pixel3, #LINERAM+3
pixel0	WAITVID	LINERAM,#%011100100
ADD	pixel0, d4
pixel1	WAITVID	LINERAM+1,#%010010011
ADD	pixel1, d4
pixel2	WAITVID	LINERAM+2,#1001110
ADD	pixel2, d4
pixel3	WAITVID	LINERAM+3,#0111001
ADD	pixel3, d4
DJNZ	pixcnt, #pixel0

I'll see if I can do some display list pseudo code.

Share this comment


Link to comment

I've been doing some thinking, so now it's time to write it down.

 

For the output logic, the system needs to be running at 62MHz or better (preferably some multiple of 4*colorburst). That gives 13 CPU cycles per pixel (or 2 normal instructions + WAITVID). That should be enough to handle the output logic. That also gives 3943 CPU cycles per line for the input logic (or max 246 hub accesses per line)

 

Okay, then we need the input logic. I did think some about display lists, but I ran into two issues. First is because there are two cogs running the display code, one doing the odd lines and one doing the even lines, handling the # of lines per zone gets tricky. The second issue is I'm not sure the added complexity of having separate display lists buys you anything.

 

Consider instead a basic sprite table in main RAM. Each entry would be a single 32-bit long containing a 16 bit address pointer to the sprite graphics and two bytes for horizontal and vertical position. Each sprite would be a fixed size (i.e. 8x8) and the graphics would include transparency. A zero entry would indicate the entry is unused and the code could skip to the next entry. So the code kinda looks like:

.1	RDLONG sprdata, sprptr wz
ADD sprdata, #4
 IF_Z JMP #.1

 

Okay, then the code needs to figure out whether the sprite is on the current line.

	MOV sprYoff, sprdata
AND sprYoff, #$0FF
SUB sprYoff, curYpos
AND sprYoff, #$1F8 wz,nr
 IF_NZ JMP #.1

 

Then it's just a matter of translating the address+Yoffset into RDLONGs, and then translating those bytes read into line RAM longs.

	MOV	sprtemp, sprdata
SHR	sprtemp, #16
SHL	sprYoff, #3
ADD	sprYoff, sprtemp
RDLONG	sprtemp, sprYoff
ADD	sprYoff, #4
RDLONG	sprYoff, sprYoff

Each 8 pixel wide sprite will therefore need 3 RDLONGs or a max of 82 sprites @ 62MHz (probably less than 64 since it will take more than 2 instructions to decode the words into bytes.

Share this comment


Link to comment

Thanks for thinking that through.

 

I did the same this weekend and arrived at a similar conclusion. There is enough horsepower to do sprites full on, without the display lists. Given that, it's probably the more direct way to go.

 

I'm out on business travel right now. Brought the Prop with me. Damn TV has no composite input, so I'm gonna have to spend tonight adding in the broadcast functionality to the driver I want to use before progressing on this. The good thing is that it is possible, however!

 

Edit: This worked, but not anywhere near as well with HYDRA as it did with the demo board. So, I'll be looking for an inexpensive RF modulator... :)

Share this comment


Link to comment

More thinky stuff:

 

One cycle chewer is going to be the byte->long color translation. Something about only 512 longs of cog RAM to squeeze the 256 byte line RAM not leaving much space for a 256 byte CLUT :-)

 

I had thought about how best to do a background. NES style tiles are relatively storage efficient, but need multiple RDLONGs. A simple static background would only need 32 RDLONGs per line, but would require 64K of main RAM for the 256x256 pixelmap. So for now it's foreground sprites only.

 

Okay, back to the CLUTs. There's two ways of doing it. One is to store the bytes in the line RAM and do the translation later. Second is to translate the bytes before storing them to line RAM. I was leaning to the former, but I just realized that since there's no background, that 90% of the screen will need to have the long=black code. So any savings realized by not translating overlapping sprites will be lost to translating the black background (while initializing the background to the fixed value is no cost since it has to be done anyway).

 

Each byte has three possible codings:

$00 is transparent

$01-$0F is greyscale (15 values - black to white)

$10-$FF is color, with the first octet being saturation/value & the second a color index

 

Output would be 5 bits: 80IRE, 40IRE, 20IRE, 10IRE and 5IRE although this may be the middle bits of the output so scaling doesn't overflow between subpixels.

Share this comment


Link to comment

I love/hate when I get caught into an "ooh, shiney" moment. I know I should be focusing my attention on other things, but I can't resist playing with something new.

 

Anyway, I've coded up 90+% of the sprite routine:

 

	ORG	$000
LineRAM	RES	256

sprtabl	LONG	$xxxx			' address of sprite table in main memory
sprptr	RES	1			' pointer to sprite table (main memory)
sprdata	RES	1			' sprite table entry / pointer to lineRAM (xPos)
sprtemp	RES	1			' Ypos, temp for CLUT
sprbyte	RES	1			' current pixel
sprgfx0	RES	1			' right 4 pixels
sprgfx1	RES	1			' left 4 pixels
spraddr	EQU	sprgfx1			' pointer to sprite graphics (main memory)
counter	RES	1			' counter

MOV	counter, #
MOV	sprptr, sprtable
:dospr	RDLONG	sprdata, sprptr	wz	' 0	get sprite table entry
IF_Z	JMP	#:nxtspr		' 7	zero entry, go to next sprite
MOV	sprtemp, sprdata	' 11	format addr[16]:xpos[8]:temp[8]
AND	sprtemp, #$0FF		' 15	mask off temp
SUB	sprtemp, curtemp	' 19	relative to current row
AND	sprtemp, #$1F8	wz,nr	' 23	check for outside 0-7	
IF_Z	JMP	#:nxtspr		' 27	outside range, go to next sprite

MOV	spraddr, sprdata	' 31	calculate sprite address
SHR	spraddr, #16		' 35	base address for top left
SHL	sprtemp, #3		' 39	8 bytes per row
ADD	spraddr, sprtemp	' 43
RDLONG	sprgfx0, spraddr	' 47->48
ADD	spraddr, #4		' 7
SHR	sprdata, #8		' 11	shift xpos to lsb
RDLONG	sprgfx1, spraddr	' 15->16
AND	sprdata, #$0FF		' 7	mask off xpos
MOV	sprbyte, sprgfx0	'	handle each byte (LSB first!)
CALL	CLUT
SHR	sprgfx0, #8
MOV	sprbyte, sprgfx0
CALL	CLUT
SHR	sprgfx0, #8
MOV	sprbyte, sprgfx0
CALL	CLUT
SHR	sprgfx0, #8
MOV	sprbyte, sprgfx0
CALL	CLUT
MOV	sprbyte, sprgfx1
CALL	CLUT
SHR	sprgfx1, #8
MOV	sprbyte, sprgfx1
CALL	CLUT
SHR	sprgfx1, #8
MOV	sprbyte, sprgfx1
CALL	CLUT
SHR	sprgfx1, #8
MOV	sprbyte, sprgfx1
CALL	CLUT
:nxtspr	ADD	sprptr, #4
DJNZ	counter, #:dospr	' 48+16+107+92*8 = 907 (912) max cycles / sprite



CLUT	AND	sprbyte, #$0FF	wz	' 4
 IF_Z	JMP	#:next			' 8	$00 is transparent
AND	sprbyte, #$1F0	wz,nr	' 12
 IF_Z	JMP	#:color			' 16	$01-$0F greyscale
ADD	sprbyte, #Greys-1
MOVS	:grey, sprbyte
:grey	MOV	sprbyte, Greys
JMP	#:wrbyte
:color	MOV	sprtemp, sprbyte	' 20	$10-$1F color
AND	sprbyte, #$00F		' 24	format %lllscccc
ADD	sprbyte, #Colors	' 28	cccc is color index
MOVS	:clut, sprbyte		' 32	s is saturation
:clut	MOV	sprbyte, Colors		' 36	llls is luma
SHR	sprtemp, #4		' 40
AND	sprtemp, #$001	wz,nr	' 44
 IF_Z	SHR	sprbyte, #1		' 48	%xxx0cccc = less color
ADD	sprbyte, sprtemp	' 52	add luma to each subpixel
SHL	sprtemp, #8		' 56
ADD	sprbyte, sprtemp	' 60
SHL	sprtemp, #8		' 64
ADD	sprbyte, sprtemp	' 68
SHL	sprtemp, #8		' 72
ADD	sprbyte, sprtemp	' 76
:wrbyte	MOVD	:write, sprdata		' 80	write pixel to [xpos]
:write	MOV	LineRAM, sprbyte	' 84
:next	ADD	sprdata, #1		' 88	next xpos
CLUT_ret	RET			' 92 cycles (max)
Blank	LONG	$08080808	' 0 IRE
Sync	LONG	$00000000	' -40 IRE
Burst	LONG	$000C0004	' 0/20/0/-20 IRE
Greys	LONG	$09090909	' 5 IRE (black)
LONG	$0B0B0B0B	' 15 IRE
LONG	$0C0C0C0C	' 20 IRE
LONG	$0D0D0D0D	' 25 IRE
LONG	$0F0F0F0F	' 35 IRE
LONG	$10101010	' 40 IRE
LONG	$11111111	' 45 IRE
LONG	$13131313	' 55 IRE
LONG	$14141414	' 60 IRE
LONG	$15151515	' 65 IRE
LONG	$17171717	' 75 IRE
LONG	$18181818	' 80 IRE
LONG	$19191919	' 85 IRE
LONG	$1B1B1B1B	' 95 IRE
LONG	$1C1C1C1C	' 100 IRE (white)
Colors

 

The big, BIG problem is the routine requires close to 1000 cycles per sprite. Even at 80MHz, that's only 5 sprites per line. (More like 4 active sprites plus N sprites not on the current line.) Ugh. Not good.

 

The big cycle waster is the CLUT routine which needs almost 100 cycles per pixel to translate each byte to a long. Although some savings might be achieved by adding a luma lookup table to the color branch, that would still leave around 50 cycles per pixel which wouldn't do more than double the number of sprites.

 

Which leads me back to the pure lookup table option, not my preferred choice (although it does mean the game could have a custom palette). The palette would be smaller since it would be limitted to the space not used by the code & line RAM.

Share this comment


Link to comment

Holy Crap! I'm hosed at the moment too. No display option :)

 

I PM'ed you on this...

 

Five per line is what others have seen as well. You clearly grok this thing.

 

There are 8 COGs, so that's not the end of the world...

 

(thinking about that and your code block)

 

Edit: Just finished looking over the code, over lunch. Seems to me, the color table lookup method makes the most sense. You are assuming the sub-pixel method worked out above is in play right?

 

(I think that will work too)

 

One would have to pay for the table in RAM somewhere, but the higher number of objects possible seems to outweigh that. I'm thinking about what the actual line driver can do in this regard. It's not clear to me just yet, how those play together. I'll have to do some more looking things over this evening...

 

The DK game in progress uses a lot of COG's to get it's graphics display. Essentially, it leaves one for sound and another one to process the actual game logic in SPIN. That just seems to be a lot of CPU to get all of that done.

 

On the background, lower resolutions might make sense, 2600 style. A larger sprite size might cut down on the bit chopping required as well, though that will cost RAM.

Share this comment


Link to comment

My vision is to have two cogs generating video, trading off during the sync pulse (which sets all outputs to 0). While the cog is generating output, it's doing little else. (Well, maybe some minor initialization.)

 

I've done up the lookup table option and it still chews up far more cycles than I'd expect. I'm not sure that more bits per pixel in main RAM would help things, given the cost of doing RDLONGs.

 

The Propeller ISA is part of the problem. It's really tuned for 32 bit register to register operations. So the lookup code turns into:

	MOV  sprbyte, sprdata
AND  sprbyte, #$0FF
MOVS  :byte0, sprbyte
MOVD  :byte0, sprXpos
:byte0  MOV  sprXpos, sprbyte
ADD  sprXpos, #1
SHR  sprdata, #8

7 instructions = 28 cycles per pixel. Three ops to extract a byte from a word, and three more ops for the indirect read/write.

 

Hmm... only one cog for sound... I wonder whether how tough it would be to stick a simple tune player into the VSYNC routines for the video cog. That would give you 2 channel music for free. Read a frequency and counter from main memory and update the B oscillator every frame.

 

 

Oh, which reminds me. One think thing I'm fighting with is clock generation. As I specified previously, the sub pixel clock is 14,318,182 Hz (4*colorburst), and the CPU clock has to be at least 62MHz (preferably higher for the sprite section). The question is whether it's possible to meet both objectives since 16*colorburst is only 57MHz, so you can't just strap a colorburst crystal to Xin/Xout. 4.77MHz (4/3 colorburst) would be fast enough, but I don't know whether those exist. (I know 14.31818MHz exists, I've used them.) And how feasible is it to generate the 14.31818MHz output clock from 4.77MHz?

Share this comment


Link to comment
4.77MHz (4/3 colorburst) would be fast enough, but I don't know whether those exist. (I know 14.31818MHz exists, I've used them.) And how feasible is it to generate the 14.31818MHz output clock from 4.77MHz?

 

You can order oscillators with whatever frequency you want from Digi-Key; in moderate quantities (100 or so) they're not much more than standard ones, and even in onesie-twosies they're not too bad.

Share this comment


Link to comment
4.77MHz (4/3 colorburst) would be fast enough, but I don't know whether those exist. (I know 14.31818MHz exists, I've used them.) And how feasible is it to generate the 14.31818MHz output clock from 4.77MHz?

 

You can order oscillators with whatever frequency you want from Digi-Key; in moderate quantities (100 or so) they're not much more than standard ones, and even in onesie-twosies they're not too bad.

 

it's got PLL circuts for each COG. They are good to 120Mhz, with the 5Mhz crystal. There is some harmonic distortion at the higher frequencies, as the generated frequency exceeds the CPU clock, but I think the ranges we are discussing are below that enough to not worry so much. Another user has used this thing as a fairly solid frequency generator, and another wrote a logic probe application that can actually debug one COG using another one to capture the pin states! In this regard, it's a very flexible design.

 

Clocked at 80Mhz, it does broadcast on channel 3 very well. Not sure why I had so much trouble with this on the HYDRA though. Could be my environnment, could be the different crystals used or some artifact of the circut in the HYDRA -vs- the demo board.

 

I think one COG could do donkey kong sounds with the video just fine, given some work replicating them. Samples would take their own COG for any sort of quality.

 

Will be interesting to see how that all shakes down. I find it amazing it takes so many COGs to draw the game graphics. Maybe that's how it is, but I am reluctant to buy into that given all the options. It's just gonna take some thought.

 

(or maybe some more RAM...)

 

Byte addressing in the COG takes multple ops to sort out. Byte addressing from the HUB costs cycles... Seems to me, changing how sprite data is stored might yield faster solutions.

 

I was thinking about one driver, building lines from a buffer. The problem with that happens to be the transfer time from HUB to COG. Can't really move much when doing wativids. Two COGs alternating is doable. Something similar to that was done to add a cursor to a high resolution VGA driver. Another option is to change the waitvids to place things at specific times, but that is not gonna work well for a lot of sprites.

 

There is the overlay idea too. Essentially have one more than one COG draw to the pins at the same time. If they are color cycle synced, the only real problem appears to be video getting or'ed together. This allows for more intensities, but also looks bad when things overlap. (no priorities, no collisions without COG to COG communication.)

Share this comment


Link to comment

Okay, I've gotten my head wrapped around the counters. Given a reasonably accurate system clock crystal, it should be possible to generate that 14.31818MHz clock. I'm still not clear on what VCFG is doing under the covers, (I think I saw something stating it's better covered in the Hydra manuals.) but I can relax on the whole clocking issue.

 

The following code is the sprite->LineRAM routine using a color lookup table store in cog RAM (not shown):

' Sprite Table format addr[16]:xpos[8]:ypos[8] (all 0 for unused)
' Sprites are 8x8 pixels (1 byte/pixel) stored in raster order
' pixel 0 is transparent, all others are color index
' xpos MUST be limitted to 0-232 - no wrap around or scrolling!
' vertical scrolling is possible (i.e. ypos=255 will scroll in from top, ypos=233 will scroll off bottom)

sprtbl	LONG	$xxxx			' address of sprite table in main memory
TBLSIZE	EQU	$xx			' number of entries in sprite table
MAXSPR	EQU	$xx

curYpos	RES	1			' current row (0=first active row)
counter	RES	1			' multipurpose counter
count2	RES	1			' maximum number of active sprites
sprptr	RES	1			' pointer to sprite table entry (in main memory)
sprdata	RES	1			' sprite table entry / pointer to lineRAM (xPos)
sprtemp	RES	1			' Ypos
spraddr	RES	1			' pointer to sprite graphics (main memory)
sprbyte	RES	1			' current pixel
sprgfx0	EQU	sprtemp			' left 4 pixels (LSByte = leftmost)
sprgfx1	RES	spraddr			' right 4 pixels (MSByte = rightmost)

MOV	counter, #TBLSIZE	' 	number of entries in sprite table
MOV	count2, #MAXSPR		' 	maximum number of active sprites
MOV	sprptr, sprtbl
sprloop	RDLONG	sprdata, sprptr	wz	' 7	get sprite table entry
IF_Z	JMP	#:nxtspr		' 11	zero entry, go to next sprite
MOV	sprtemp, sprdata	' 15	format addr[16]:xpos[8]:ypos[8]
AND	sprtemp, #$0FF		' 19	mask off ypos
SUB	sprtemp, curYpos	' 23	relative to current row
AND	sprtemp, #$1F8	wz,nr	' 27	check for outside 0-7	
IF_NZ	JMP	#:nxtspr		' 31	outside range, go to next sprite

MOV	spraddr, sprdata	' 35	calculate sprite address
SHR	spraddr, #16		' 39	base address for top left
SHL	sprtemp, #3		' 33	8 bytes per row
ADD	spraddr, sprtemp	' 47
RDLONG	sprgfx0, spraddr	' 48->7
ADD	spraddr, #4		' 11
SHR	sprdata, #8		' 15	shift xpos to lsb
RDLONG	sprgfx1, spraddr	' 16->7
AND	sprdata, #$0FF		' 11	mask off xpos
OR	sprdata, #$100		' 15	add LineRAM base address
MOV	sprbyte, sprgfx0	'	handle each byte (LSB first!)
AND	sprbyte, #$0FF	wz	'	$00 is transparent
MOVS	:byte0, sprbyte		'	set source
MOVD	:byte0, sprdata		'	set destination
:byte0	IF_NZ	MOV	sprdata, sprbyte
ADD	sprdata, #1
SHR	sprgfx0, #8
MOV	sprbyte, sprgfx0
AND	sprbyte, #$0FF	wz
MOVS	:byte1, sprbyte
MOVD	:byte1, sprdata
:byte1	IF_NZ	MOV	sprdata, sprbyte
ADD	sprdata, #1
SHR	sprgfx0, #8
MOV	sprbyte, sprgfx0
AND	sprbyte, #$0FF	wz
MOVS	:byte2, sprbyte
MOVD	:byte2, sprdata
:byte2	IF_NZ	MOV	sprdata, sprbyte
ADD	sprdata, #1
SHR	sprgfx0, #8	wz
MOVS	:byte3, sprgfx0
MOVD	:byte3, sprdata
:byte3	IF_NZ	MOV	sprdata, sprbyte
ADD	sprdata, #1
MOV	sprbyte, sprgfx1
AND	sprbyte, #$0FF	wz
MOVS	:byte4, sprbyte
MOVD	:byte4, sprdata
:byte4	IF_NZ	MOV	sprdata, sprbyte
ADD	sprdata, #1
SHR	sprgfx1, #8
MOV	sprbyte, sprgfx1
AND	sprbyte, #$0FF	wz
MOVS	:byte5, sprbyte
MOVD	:byte5, sprdata
:byte5	IF_NZ	MOV	sprdata, sprbyte
ADD	sprdata, #1
SHR	sprgfx1, #8
MOV	sprbyte, sprgfx1
AND	sprbyte, #$0FF	wz
MOVS	:byte6, sprbyte
MOVD	:byte6, sprdata
:byte6	IF_NZ	MOV	sprdata, sprbyte
ADD	sprdata, #1
SHR	sprgfx1, #8	wz
MOVS	:byte7, sprgfx1
MOVD	:byte7, sprdata
:byte7	IF_NZ	MOV	sprdata, sprbyte
DJNZ	count2, #:nxtspr
JMP	#:exit
:nxtspr	ADD	sprptr, #4
DJNZ	counter, #sprloop	' 48+16+219 = 287 (288) max cycles / sprite
:exit

ORG	$100
LineRAM	RES	240			' bottom half of cog RAM

 

If we assume the standard 80MHz system clock, then the input routine has 5084 cycles to populate the LineRAM. At 288 cyles/sprite, that's ~17 sprites per line. Better, but not hugely impressive. One item to note is it takes 35 cycles (effectively 48 due to hub delays) to handle a sprite not on the current line. Crunching the numbers, this means the routine could handle a 64 entry sprite table with 8 active sprites per line (i.e. NES equivalent). Smaller sprite tables mean more active sprites (i.e. a 32 entry table could handle 14 active sprites).

 

Hmm.. just checked. This routine will take up 84 longs of cog RAM. And each long of code is one less entry in the color lookup table. (Yeah, you could use some of the code longs as pixel values, but I suspect there won't be many which will actually have the desired Y+U,Y+V,Y-U,Y-V property.)

Share this comment


Link to comment
Guest
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...