Feasibility of prerecorded speech packages for Stella

DirtyHairy · March 22, 2020

I have been wondering whether it would be possible to prerecord the phrases used by individual games and distribute them as a kind of "speech packs" for Stella. To all of you who have already written games that use the Atarivox: what kind of commands are you using? Are you sending them over to the atarivox in a single "transaction", or is the transmission stretched over time? If all commands for a individual phrase would be sent in a single transmission, we could just collect them and calculate a hash to identify the matching recording.

+Nathan Strum · March 23, 2020

Sort of the way MAME used speech samples for games where it hadn't emulated the speech circuitry?

That's an interesting idea. We'd need the authors to generate a binary for recording the phrases (or words) for each of their games.

At least one game (Sync) doesn't use words, but rather makes the AtariVox "sing" along with what you're doing. I'm not sure how that would work. (And Seemo hasn't been around for quite some time.)

One of the things I want to document, is how each game is currently using the AtariVox: voices, scores, settings, etc. I just have to plow through the game descriptions.

RevEng · March 23, 2020

1 hour ago, DirtyHairy said:

what kind of commands are you using? Are you sending them over to the atarivox in a single "transaction", or is the transmission stretched over time?

We're sending byte commands (or 2 byte commands) that either contain phonemes or pauses, or some parameter that affects the speech quality.

The joystick port uses bit-banged serial at 19200 baud, so I think it's pretty rare to send more than a byte a frame. It's possible to do more, but a byte a frame will keep the speech buffer full, and CPU time is precious.

My spell&speak app won't work with your idea either, in addition to Nathan's observation about Sync, since all phrases are unique.

Nathan, check out toymailman's thread. There might be a few new ones to add, but I think it has most of them listed by vox functionality.

+Nathan Strum · March 23, 2020

5 hours ago, RevEng said:

Nathan, check out toymailman's thread. There might be a few new ones to add, but I think it has most of them listed by vox functionality.

Thanks! I'd even posted in that thread, but like so much AtariVox info I'd forgotten where it was (or that it even existed). I'll add it to the Information section.

Thomas Jentzsch · March 23, 2020

6 hours ago, RevEng said:

The joystick port uses bit-banged serial at 19200 baud, so I think it's pretty rare to send more than a byte a frame. It's possible to do more, but a byte a frame will keep the speech buffer full, and CPU time is precious.

Hm...

I suppose consecutive allophones are not independent, correct? So we cannot simply play the allophones one by one. How long is the dependency queue? More than just the next neighbor?

Without dependency, we could simply play one MP3 per allophone. If the resulting sound is depending on two allophones, we could delay output by one frame and stitch the MP3s together. So, as long as two (or more) samples cannot be identified, we split the MP3s (assuming that the initial part would sound identical in this case). Once we have found a unique sample, we play the remaining MP3.

Sounds complicated, but is it even feasible?

Maybe we better look at the MAME solution?

Edited March 23, 2020 by Thomas Jentzsch

RevEng · March 23, 2020

You're welcome, Nathan!

Thomas, as far as I can tell, consecutive phonemes are independent with speakjet. The formant frequencies don't change depending on previous or following phonemes, and I haven't detected any other dependencies.

I was originally responding to DirtyHarry's question about using phrase samples with my answer. So far as phoneme samples, the approach could work out, but probably not with comprehensive representation of phonemes. There would be a lot of samples involved in a comprehensive capture; number_of_phoneme_samples = phonemes * speeds * pitches * bends, which works out to 66584576 samples. If we estimate with an average phoneme length of .1s, that's about 77 days worth of recording time.

You'd also want to ideally capture the samples starting and ending on the same fundamental frequency phase, so there's no "pop" of static between phonemes during playback, though I don't know how noticeable the pop would be in practise.

Thomas Jentzsch · March 23, 2020

Speeds and pitches might be done on-the-fly by manipulating samples. Libraries for this exist. Not sure about the bends, could they be manipulated on-the-fly too?

Edited March 23, 2020 by Thomas Jentzsch

RevEng · March 23, 2020

Using programmatic speed and pitch changes would certainly put you back into workable numbers.

The bends are described in the speakjet manual as follows. "The frequency Bend adjusts the output frequencies of the oscillators. This will change the voicing from a deep-hollow sounding voice to a High-metallic sounding voice."

I don't think you'd be able to do that in the same way the speakjet does, with sample manipulation.

Sign In

AtariVox

Feasibility of prerecorded speech packages for Stella

Recommended Posts

DirtyHairy

Link to comment

Share on other sites

+Nathan Strum

Link to comment

Share on other sites

RevEng

Link to comment

Share on other sites

+Nathan Strum

Link to comment

Share on other sites

Thomas Jentzsch

Link to comment

Share on other sites

RevEng

Link to comment

Share on other sites

Thomas Jentzsch

Link to comment

Share on other sites

RevEng

Link to comment

Share on other sites

Recently Browsing 0 members

Apps

My Activity Streams

More