Jump to content
IGNORED

Speech Synthesis on VCS


Max-T

Recommended Posts

Can anyone give me a link to some information on how the programmers were able to make the 2600 say "Quadrun?" I remember thinking it was a feat of the black arts when I heard my NES exclaim "Double Dribble" and "Blades Of Steel", but it fascinates me that the Atari could be made to synthesize anything close to a spoken word.

 

From what I understand, the TIA chip is only able to produce square-wave or noise type sounds. How then was it pushed to produce a perfectly distinguishable (if aliased) word like "Quadrun?"

 

If only I could afford to own one of those coveted cartridges for myself...

Link to comment
Share on other sites

I'd like to know what the process is to convert the audio. The only thing I've even heard mentioned is that you have to convert the file to 4-bit audio, and even doing that's tough. Compiling it in a ROM must be even harder...

 

 

I did speech on the NES (Three Stooges) and the process for '2600 is exactly the same. Speech is just a a series of volumes for the speaker, played in sequence very rapidly. Your ear probably can handle frequencies (= changes in volume) up to about 20kHz (ie: 20,000 changes per second). However, human speech, if I recall correctly, comes in about 4kHz). So to reproduce speech, all you really need to be able to do is change the volume at about 4,000 times per second.

 

Now consider a typical '2600 display; it is 262 lines at 60Hz. That's 15,720 scanlines per second. If you changed the volume on EVERY scanline, you could pretty much pump out sounds of frequencies almost up to the human hearing limit. But that's not really necessary. All you really need to do is change the volume (say) once every 2 or 4 scanlines, giving you very roughly 8kHz or 4kHz frequency capability.

 

But what is 'speech'? It's a sound, just like anything else. Consisting purely of a series of "volumes" at varying frequencies, all mixed together. It's amazing that the ear/brain can distinguish individual elements of a series of sounds 'mixed' together, but it can.

 

To play speech, all you really need to do is find the sample you want to play (a recording of a famous speech, say). Then you figure your sample rate. That is, how many times per second you are going to change the volume on the '2600. We already figured 4kHz was sufficient. So 4,000 times per second, sample the volume of the recording, and save the volume readings to an array. When we play back those readings on the '2600, we hear the original recording!

 

There's just one other issue to consider; the resolution of each sample. Typically, we can hear quite a range of volume -- from very quiet to very loud. Actually, it's a logarithmic scale. But the point is, when we take a sample, we are representing it with a number - if we used a single byte, we could represent volume from 0 to 255. Now the Atari doesn't have 8 bits of volume fidelity; it has just 4 bits. So when we sample, instead of ending up with a volume from 0-255, we need to shrink that range down to 0-15. Not high fidelity, but still does the job.

 

There are a few other complications, for example, we're not REALLY dealing with 0-255 as our range, but instead -128 to 127 (0 is the midpoint of the 256-value range). Sound is represented as 'vibration' or oscillation around a 0-point, with positive and negative values. But the fundamentals are the same; convert your sample into individual volume elements. Downsize to the resolution of your output method, and then play back the samples at the appropriate rate.

 

Once you can play *a* digitised speech sample, you can play back pretty much *any* digitised speech sample. It's not that tricky, really!

 

Cheers

A

Link to comment
Share on other sites

www.gooddealgames.com has an interview with Steve Woita where he discusses the process . . . (Basically, just as Andrew described it) . . .

 

I've done it on the Apple via an old Compute (or possibly Byte) type-in as well. Does anybody know where to find that program, and whether or not it works in emulation?

Link to comment
Share on other sites

Prompted by Andrews post above I must say that Andrew Davie is the man. Whatever I read of his (Programming for Newbies, etc..) no matter how complicated it is its alway easy to understand. If he neglects write a book the world is missing out.

 

Now, If he can only explain quantum physics to me. :)

 

 

Tim

Link to comment
Share on other sites

Prompted by Andrews post above I must say that Andrew Davie is the man. Whatever I read of his (Programming for Newbies, etc..) no matter how complicated it is its alway easy to understand. If he neglects write a book the world is missing out.

 

Yeah, Andrew is quite amazing. Although I'm proud of 2600 101 (currently undergoing a revision) my own meager knowledge of 6502 and TIA is always eclipsed by his. Hopefully I can persuade him and some of the other [stella]ites to put in some entries in my next project, the 2600 Cookbook...

 

Andrew, what is the status of your Newbie forum? It seems to be a bit neglected...(I know how easy it is to let that happen as real world timesucks make their demands...)

Link to comment
Share on other sites

Mr. Davie, thanks very much!

 

I'll be spending the next couple of days attached to online technical journals, 'cause you really went above and beyond beyond the call answerig my puny little post. But seriously, I do thank you, and I appreciate the time you took to help out a newbie.

 

Max'

Link to comment
Share on other sites

If I remember corectly from early days of stella list,

some body did program a demo calle stella says and it had speech in it too.

I don't remember how they did it though.

 

That was Eckhard, and my understanding is that's the same code used in Berzerk VE. I'm considering using it for the RPG if we have space left over.

 

-paul

Link to comment
Share on other sites

Andrew, what is the status of your Newbie forum?  It seems to be a bit neglected...(I know how easy it is to let that happen as real world timesucks make their demands...)

 

Well it's probably sufficient to point out that here it is currently 1:30am in the morning and I'm working on an important project and I will be waking up in another 6 hours or so and putting on a suit and tie and going to my "real" job and I'm likely to be doing that routine for a few months to come, at least.

 

I'm just too busy to do all the things I have on my plate. I'd really really love to turn the Newbie forum into an actual book. Somebody find me a publisher and I promise I'll do just that. I like to think that I'm able to explain things in a way that helps people understand.

 

On the subject of speech synthesis, it is helpful to think about what we are actually DOING when we are making sounds. The bottom line is that we cause our eardrums to vibrate. Nothing more, nothing less. The eardrum is a simple membrane... it vibrates quickly, slowly, with large or small intensity. And (for all intents and purposes) with that mechanism alone we hear every sound that you care to name.

 

So recreating speech or music is simply a matter of making the eardrum move in the same way as the original sound made the eardrum move. But no matter HOW complex the original sound, it was still moving the eardrum back and forward... at different speed, with different intensity... but still the same basic back and forward oscillation.

 

If we can reproduce that oscillation then we hear the original sound. Think about a speaker in a hi-fi system. That can reproduce incredibly complex sounds, speech, music. But ultimately, it's just a simple cardboard cone with a membrane vibrating back and forth and causing pressure waves in the air which in turn cause our eardrums to vibrate back and forth... just like the original sound.

 

So the Atari... just needs to control the speaker (the vibrating membrane) in the same way. To vibrate it back and forth, we can simply change the volume of the sound we're sending to it. Loud, soft, loud, soft, loud, soft... we have an oscillating membrane! Do that once and you hear a click. Do it quicker and you hear a low buzz. Do it a bit quicker and you hear a higher toned buzz. Quicker still and it's higher pitched 'note'. Do it with varying intensity, and you hear a warbling sound. Do it from a digitised speech sample and you hear... speech!

 

Let's throw a bit of theory at you while you're reading... given any waveform (ie: a recording of someone speaking), that waveform consists of various frequency components (remember, we already noted that the human ear can hear up to 20kHz). Some of the speech has 4kHz bits, some of it is at 8kHz, etc. To reliably reproduce the waveform from a sampled subset of the original data, Shannon's Theorem states that you need to sample at twice (or more) times the highest frequency component of the original. So if we decide that we want to reproduce sound with frequency up to 8kHz, then we'd need to sample at 16kHz (ie: have 16,000 samples per second) to accurately reproduce the original sample.

 

Even at 4bits per sample (the resolution of the '2600 sound register), that equates to 8K-bytes of ROM per second of speech! So you can see, although sound is simple, it's also very expensive. Clever packing techniques (eg: a delta between successive samples, rather than absolute values for each) can reduce this burden, but they, of course, can cost you valuable processing time. So I would expect that sound which is compressed will probably play with blank screens, and uncompressed sound will be very short in duration :)

 

Feel free to correct any errors in the above; it's rather late and clearly I'd rather be doing anything else but working right now :)

 

Cheers

A

Link to comment
Share on other sites

I've done it on the Apple via an old Compute (or possibly Byte) type-in as well.  Does anybody know where to find that program, and whether or not it works in emulation?

The samples themselves should work in emulation...since you are just playing a speaker click at the proper frequency. Getting the samples would be the difficult part, because the program just used the cassette input to gather it. IIRC, it was printed in Creative Computing.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...