Sound - Game Programming Algorithms and Techniques: A Platform-Agnostic Approach (2014)

Game Programming Algorithms and Techniques: A Platform-Agnostic Approach (2014)

Chapter 6. Sound

Though sometimes overlooked, sound is an important component of games. Whether it’s to provide audio cues to gameplay situations or enhance the overall atmosphere, games lose a great deal without quality sound. To experience how important sound is, try playing one of your favorite games on mute—something is definitely lost if the sound is disabled.

This chapter first covers how source data is translated into “cues” that are played back by the game code. It then moves on to more advanced sound techniques, such as the Doppler effect, digital signal processing, and sound occlusion, that might be used in certain situations.

Basic Sound

At the most basic level, sound in a game could simply involve playing back standalone sound files at the appropriate points in time. But in many cases a single event does not necessarily correspond to a single sound. Suppose a game has a character who runs around the world. Every time the character’s foot hits the ground, a footstep sound should play. If there were only one single footstep sound file played over and over again, it would very quickly be repetitive. At the very least, it would be preferable to have multiple footstep sounds that are randomly selected among every time the footstep is triggered.

An additional consideration is that there is a finite number of channels, or separate sounds, that can be played simultaneously. Imagine a game where a large number of enemies are running around the player—if all of them play their footstep sounds, it may very quickly use up all the available audio channels. Certain sounds are going to be far more important than enemy footstep sounds, and so there needs to be some sort of prioritization system, as well. All of these considerations, and more, very quickly lead to a situation where we need more information than what is stored in a sound file. Because of this, most games store an additional set of data that describes how and in what circumstances specific sound files should be played.

Source Data

Source data refers to the original audio files that are created by the sound designer using a tool such as Audacity (http://audacity.sourceforge.net). In the footstep scenario, there might be dozens of different source files, such as fs1.wav, fs2.wav, fs3.wav, and so on. A common approach is to store short sound effects as WAV files or another uncompressed file format, and store longer sounds, such as music or dialogue, in a compressed format such as MP3 or OGG.

When it comes to playing back these sound files in a game, there are two common approaches. It usually makes sense to preload short sound effects into memory, so when it’s time to play the sound there is no time spent fetching the file from the disk. On the other hand, because compressed music or dialogue files are typically much larger in size, they are usually streamed off of a storage device. This means that as the sound file is being played, small segments of it are loaded on demand from the disk.

In order to load and play back source data, certain platforms have built-in sound libraries (such as CoreAudio on iOS). But for cross-platform support, OpenAL (http://kcat.strangesoft.net/openal.html) is a very popular solution.

Sound Cues

A sound cue, sometimes called a sound event, maps to one or more source data files. The sound cue is what is actually triggered by the game code—so rather than having code that directly plays the fs1.wav data file, there might be code that triggers a sound cue called “footstep.” The idea is that the sound cue can be a container for any number of source data files as well as store metadata about the sound as a whole.

For example, suppose there is an explosion cue. This cue should randomly trigger one of five different explosion WAV files. Additionally, because an explosion is something that can be heard from far away, there might be meta information that specifies the maximum distance the sound will be heard. It would be wise to also have a high priority assigned to the explosion cue, so even if all the channels are currently in use, the explosion is still audible. The basic layout of this cue is illustrated in Figure 6.1.

Image

Figure 6.1 Explosion sound cue.

As for how this metadata might be stored for game use, there are a number of possibilities. One option is to store it in a JSON file, which might look something like this in the case of the explosion cue:

{
"name": "explosion",
"falloff": 150,
"priority": 10,

"sources":
[
"explosion1.wav",
"explosion2.wav",
"explosion3.wav"
]
}

The JSON file format is covered in further detail in Chapter 11, “Scripting Languages and Data Formats.” But any format will work as long as it allows for enough flexibility. In any event, during the course of parsing the data, it can be directly mapped to a class implementation of the sound cue, like so:

class SoundCue
string name
int falloff
int priority
// List of strings of all the sources
List sources;

function Play()
// Randomly select from sources and play
...
end
end

The preceding system might be sufficient for many games, but for the footstep example it may not be enough. If the character can run across different surfaces—stone, sand, grass, and so on—different sound effects must play depending on what surface the character currently is on. In this case, the system needs some way to categorize the different sounds and then randomly select from the correct category based on the current surface. Or in other words, the system needs some way to switch between the different sets of source data based on the current surface.

The JSON file for this more advanced type of cue might look like this:

{
"name": "footstep",
"falloff": 25,
"priority": "low",
"switch_name": "foot_surface",

"sources":
[
{
"switch": "sand",
"sources":
[
"fs_sand1.wav",
"fs_sand2.wav",
"fs_sand3.wav"
]
},
{
"switch": "grass",
"sources":
[
"fs_grass1.wav",
"fs_grass2.wav",
"fs_grass3.wav"
]
}
]
}

It’s certainly possible to just add this additional functionality to the SoundCue class. A more preferable solution, however, would be to have an ISoundCue interface that is implemented by both SoundCue and a new SwitchableSoundCue class:

interface ISoundCue
function Play()
end

class SwitchableSoundCue implements ISoundCue
string name
int falloff
int priority
string switch_name

// Hash Map that stores (string,List) pairs
// For example ("sand", ["fs_sand1.wav", "fs_sand2.wav", "fs_sand3.
wav"])
HashMap sources

function Play()
// Grab the current value for switch_name
// Then lookup the list in the hash map and randomly play a sound
...
end
end

In order for this implementation to work, there needs to be a global way to get and set switches. This way, the player run code can determine the surface and set the switch prior to actually triggering the footstep cue. Then the footstep cue’s Play function can query the current value of the switch, and it will result in the correct type of footstep sound playing.

Ultimately, the key component when implementing a sound cue system is to allow for enough configurability to determine when and how sounds are played. Having as many dynamic parameters as possible will give the audio designers far more flexibility to create an immersive environment.

3D Sound

Although not an absolute requirement, 2D sound is typically positionless. This means that for most 2D games, the sound will just play equally out of the left and right speakers. Some 2D games might introduce some aspects of position into the sound, such as panning or distance-based volume reductions, but it will depend on the particular game.

For 3D sound, and 3D games by extension, position of the audio is extremely important. Most sounds are positional and have their characteristics change based on their relative distance and orientation to the listener, or the virtual microphone that picks up the audio in the 3D world.

That is not to say that 3D games don’t utilize 2D sounds; they are still used for elements such as user interface sounds, narration, ambient background effects, and music. But any sound effects that occur in the world itself are typically represented as 3D sounds.

Listeners and Emitters

Whereas the listener is what picks up the sound in the world, the emitter is what actually emits a particular sound effect. For example, if there is a fireplace that crackles, it will have a sound emitter placed at its location that plays back the crackle cue. Then based on the distance between the listener and the fireplace’s sound emitter, it would be possible to determine how loud the sound should be. Similarly, the orientation of the emitter relative to the listener will determine which speaker the sound should be heard from. This is shown in Figure 6.2.

Image

Figure 6.2 Sound listener and emitter; in this case, the sound effect should be heard on the right.

Because the listener picks up all the audio in the 3D world, determining the position and orientation of the listener is extremely important. If the listener is not set appropriately, the rest of the 3D sound system will not work properly—either sounds will be too quiet or too loud, or sounds might even come out of illogical speakers.

For many types of games, it makes sense to have the listener directly use the position and orientation of the camera. For example, in a first-person shooter, the camera is also where the player’s point of reference is, so it makes perfect sense to have the listener use that same position and orientation. The same can be said for a cutscene where the camera is following a set path; the listener can just update as the camera updates.

Although it may be tempting to always have the listener track the position and orientation of the camera, there are certain types of games where this will not work properly. Take, for instance, a third-person action game. In such a game, the camera is following the main character at a distance. Suppose in one particular game, the camera’s follow distance is 15 meters. What happens when a sound effect gets played right at the feet of the player character?

Well, if the listener is at the same position as the camera, the sound will seem like it’s 15 meters away. Depending on the type of sound and its falloff range, it might result in a sound that’s barely audible, which is odd because it’s something that’s happening right next to the player’s character. Now if it’s a sound effect that the player is triggering, we can usually identify these sounds and treat them differently. However, if it’s a sound triggered by an enemy adjacent to the player, we don’t have a simple recourse to the range problem.

One solution that might come to mind is to have the listener position and orientation set to that of the player character. Though this would solve the issue of sounds being 15 meters away, it introduces a major problem. Suppose an explosion occurs between the camera and the player. If the listener position and orientation inherits that of the player, this explosion will register as “behind” the listener, and therefore come out of the rear speakers. Furthermore, if the game allows the camera to rotate independently of the player, it may be possible to get into a scenario where a sound emitter that’s to the left of the player (and therefore, the listener) is actually on the right part of the screen. This means that if an explosion occurs on the right part of the screen, it may actually come out of the left speaker. This seems incorrect because we expect an explosion on the right side of the screen to always come out of the right speaker, regardless of how the player is oriented.

In the end, for the melee-focused third-person game Lord of the Rings: Conquest, we arrived at a bit of a compromise. First, we based the orientation of the listener on the orientation of the camera, rather than on that of the player. Next, instead of placing the position of the listener precisely at the camera or precisely at the player, we placed it at a location between the two positions. Or in other words, the position of the listener was the result of an interpolation between the camera and player position, as in Figure 6.3. The percentage of the way between the two will vary depending on the game, but usually somewhere between 33% and 66% works well.

Image

Figure 6.3 Listener position in a third-person game.

With this solution, although there still is the possibility that a sound between the camera and player will play from the rear speakers, the chance is reduced. And even though a sound at the player will not play at a distance of zero, it will play at a closer distance than it would were the listener at the camera. At the very least, there is no problem with the orientation of the listener, because that will be inherited from the camera. For some games, the correct decision is to always place the position of the listener at the camera. However, because our particular game heavily featured melee combat, this was not the case for us.

Falloff

Falloff describes how the volume of the sound decreases as it gets further away from the listener. It is possible to use any sort of function to represent falloff. However, because the unit of sound measurement, the decibel (dB), is a logarithmic scale, a linear decibel falloff would translate into logarithmic behavior. This sort of linear decibel function ends up often being the default method, but it certainly is not the only method.

As with point lights, it is also possible to add further parameters. Perhaps there could be an “inner” radius where the falloff function does not apply, or an “outer” radius after which the sound is automatically inaudible. Certain sound systems even allow the sound designer to create a multipart falloff function that exhibits different behavior at different distances.

Surround Sound

Certain platforms don’t support the idea of surround sound—most mobile devices really only support stereo and that’s about it. However, for PC or console games, it is possible to have more than just two speakers. In a 5.1 surround system, there are a total of five regular speakers and one subwoofer, which is used for the low-frequency effects (LFE) channel.

The traditional 5.1 configuration is to place three speakers in the front and two in the back. The front channels are left, right, and center, whereas the back features only left and right. The position of the subwoofer relative to the listener doesn’t matter, though placing it in the corner of a room will cause its low-frequency sounds to resonate more. Figure 6.4 illustrates a common 5.1 layout.

Image

Figure 6.4 A standard 5.1 surround sound speaker configuration.

The neat thing about a 5.1 configuration is that it gives a much greater perception of the position of sounds. If a spaceship flies overhead in the game, the sound can pan from the back speakers to the front, so it feels like it’s flying past you. Or in a horror game, it might be possible to determine if there’s an enemy creeping up on you from the left or right.

But as much as it can add to the gaming experience, the reality is that a large percentage of players will not have a 5.1 surround configuration. Because of this, it is not really viable to create a game that relies on surround sound to function. Although it certainly will sound better on a $1000 home theater setup, the game still has to work on tinny built-in TV speakers or a bad pair of headphones.

Although you can’t separate front and back in a stereo configuration, this does not mean that the game still can’t support positional sounds. The positions of both the listener and the emitters, as well as the falloffs and other parameters, will still affect the volume and left/right placement of the sound.

Digital Signal Processing

In a broad sense, digital signal processing (DSP) is the computational manipulation of a signal. In the realm of audio, DSP typically refers to taking a sound source file and modifying it on playback to sound differently. A relatively simple DSP effect would be to take a sound and increase or decrease its pitch.

It might seem like there would be no need to perform DSP effects on the fly, and rather bake such effects into the sound files that the game engine plays back. But the reason a runtime DSP effect is useful is because it can save a great deal of memory. Suppose that in a sword-fighting game, there are 20 different sound effects for swords clanging against each other. These sound effects were authored so they sound roughly like they are clanging in a wide open field. Now imagine that the game has a wide variety of locales where the sword fights can occur, in addition to the open field—it can be in a small cave, a large cathedral, and a great deal of other places.

The problem is that swords clanging in a small cave are going to sound dramatically different than in an open field. Specifically, when swords clang in a small cave, there is going to be a great deal of echo, or reverb. Without DSP effects, the only recourse would be to author a set of those 20 sword-clanging sounds for every single locale. If there’s five such distinct locales, that means a fivefold increase, to a total of 100 sword-clanging sounds. Now if this is the case for all the combat sounds, not just the sword ones, the game may very quickly run out of memory. But if DSP effects are available, the same 20 sword-clanging sounds can be used everywhere; they just have to be modified by the effects to create the desired end result.

To actually implement DSP effects requires knowledge of linear systems and advanced mathematical operations such as the Fourier transform. Because of this, the implementation of these effects is well beyond the scope of this book. However, if you are interested in learning how to implement the effects covered in this section, check the references at the end of the chapter. In any event, even though we won’t implement any of the effects, it is worthwhile to at least discuss what the most common effects represent.

Common DSP Effects

One of the most common DSP effects encountered in games is the aforementioned reverb. Any game that wants to re-create the echo of loud sounds in enclosed spaces will want to have some sort of implementation of reverb. A very popular open source reverb library is Freeverb3, which is available at http://freeverb3.sourceforge.net. Freeverb3 is an impulse-driven system, which means in order for it to apply the reverb effect to an arbitrary sound, it needs a source sound file that represents a specific test sound playing in the desired environment.

Another DSP effect that gets a large amount of use is the pitch shift, especially if Doppler shift (covered later in the chapter) is desired. A pitch shift increases or decreases the pitch of a sound by altering the frequency. Although a Doppler shift would be the most common use, another example might be in a car racing game, where the pitch of the engine might adjust depending on the number of RPMs.

Most of the other DSP effects that see use in games usually modify the range of frequencies or decibel levels output. For example, a compressor narrows the volume range so that very quiet sounds have their volume amplified while very loud sounds have their volume reduced. This might be used to try to normalize volume levels if there are wildly different levels in different sound files.

Another example is a low-pass filter, which reduces the volume of sounds with frequencies (and therefore pitches) higher than the cutoff point. This is commonly used in games that implement a “shell shock” effect when an explosion occurs near the player. To sell the effect, time is dilated, a low-pass filter is applied, and a distinct ringing sound is played.

There are quite a few other effects that might find use in a game, but these four are some of the most common effects you’ll come across.

Marking Regions

It will be rare that an effect—especially with the reverb effect—will be uniformly applied to an entire level. Instead, it likely is necessary to apply the reverb only in certain regions of the level. For example, if a level has both outdoor areas and a small cave, the reverb might only be enabled inside the cave. A region can be marked in a number of ways, but one of the simplest is to use a convex polygon that lies on the ground plane.

Recall that a convex polygon is one where all the vertices of the polygon point outward. More specifically, all of the interior angles of a convex polygon are less than 180°. Figure 6.5 demonstrates both a convex and a concave polygon.

Image

Figure 6.5 Convex (a) and concave (b) polygons.

The reason why a convex polygon is preferred is that given a point, it is relatively straightforward to determine whether or not that point is inside or outside the convex polygon. So, in our case, the convex polygon represents the region where the reverb effect should be applied. Given the position of the player, it will then be possible to determine whether the player is inside or outside that region; if the player is inside the region, the reverb should be enabled, and vice versa. The algorithm for determining whether or not a point is inside a convex polygon is covered in detail inChapter 7, “Physics,” as it has many applications beyond just DSP effects.

However, we don’t want to just suddenly switch on the reverb as soon as the player enters the cave. Otherwise, the effect will be pretty jarring. This means that once the player crosses over into a marked region, we want to slowly interpolate between the reverb being off and on, to make sure that the transition feels organic.

Note that there is one big caveat to using convex polygons to mark DSP regions. If it’s possible to have areas above or below other areas in the level, and those areas have different DSP effect needs, this approach will not work. So, for example, if there is a tunnel that goes under a grassy field, a convex polygon for the tunnel would also flag the areas above the tunnel as needing the DSP effect. If this problem must be solved, instead of a convex polygon you will have to use a bounding volume of some kind, which is also covered in Chapter 7.

Other Sound Topics

Although this chapter has covered a great deal of topics related to sound, there still are a couple items that didn’t really warrant entire sections but at least should be mentioned in some capacity.

Doppler Effect

If you stand on a street corner and a police car drives toward you with its sirens on, the pitch of the sound increases as the police car approaches. Conversely, once the police car passes by, the pitch of the sound decreases. This is a manifestation of the Doppler effect, which is demonstrated inFigure 6.6.

Image

Figure 6.6 Doppler effect on a police car siren.

The Doppler effect (or Doppler shift) occurs because sound waves take time to travel through the air. As the police car gets closer and closer to you, it means each successive wave will arrive a little bit earlier than the previous one. This causes an increase in frequency, which leads to the heightened pitch. At the exact moment the police car is next to you, the true pitch of the sound is audible. Finally, as the car begins to travel away from you, the waves will take increasingly longer to get to you, which leads to a lowered pitch.

It’s interesting to note that the Doppler effect doesn’t only apply to sound waves, but it applies to any kind of wave. Most notably, for light waves, a redshift occurs if the object moves further away and a blueshift occurs if the object moves closer. But in order for the shift to be noticeable, the object must either be travelling at exceptionally fast speeds or be exceptionally far away. This won’t happen at mundane speeds on earth, so the shifts are usually only noticeable in astronomy.

In games, a dynamic Doppler effect will usually only be applied for faster moving objects, such as vehicles. It technically could also apply to something like a bullet, but because they’re travelling so quickly, it’s typically preferred to just play a canned bullet flyby sound. Because Doppler shift results in the pitch increasing or decreasing, it can only be dynamically implemented if a DSP pitch shift effect is available.

Sound Occlusion and Obstruction

Imagine you’re living in a college dorm. Being the diligent student you are, you’re hard at work studying your copy of Game Programming Algorithms and Techniques. Suddenly, a party starts up down the hall. The music is blaring, and even though your door is closed, the sound is so loud it’s distracting. You recognize the song, but something about it sounds different. The most noticeable difference is that the bass is dominant and the higher frequency sounds are muffled. This is sound occlusion in action and is illustrated in Figure 6.7(a).

Image

Figure 6.7 Sound occlusion (a), sound obstruction (b), and Fresnel acoustic diffraction (c).

Sound occlusion occurs when sound does not have a direct path from emitter to listener, but rather must travel through some material to reach the listener. The predominant result of sound occlusion is that a low-pass filtering occurs, which means the volume of higher frequency sounds is reduced. That’s because lower frequency waves have an easier time passing through surfaces than higher frequency ones. However, another outcome of sound occlusion is an overall reduction in volume levels of all the sounds.

Similar but different is the idea of sound obstruction (also known as diffraction). With sound obstruction, the sound may not have a straight line path, but is able to travel around the obstacle, as shown in Figure 6.7(b). For example, if you yell at someone on the other side of a pillar, the sound will diffract around the pillar. One interesting result of sound obstruction is that the split waves may arrive at slightly different times. So if the waves travel toward the pillar and split, the ones on the left split may arrive a little earlier than those on the right split. This means that someone standing on the other side of the pillar may experience interaural time difference, because she will hear the sound in each ear at slightly different times.

One way to detect both occlusion and obstruction is to construct a series of vectors from the emitter to an arc around the listener. If none of the vectors can get to the listener without passing through another object first, it’s occlusion. If some of the vectors get through, then it’s obstruction. Finally, if all of the vectors get through, it is neither. This approach is known as Fresnel acoustic diffraction and is illustrated in Figure 6.7(c).

Implementing this requires being able to determine whether or not a vector intersects with any object in the world, which we have not covered yet. However, many types of such intersections are covered in Chapter 7. Specifically, we can use ray casts to determine whether the path from emitter to listener is blocked.

Summary

Sound is an important part of any game. Typically, we want to have some sort of sound cue system that allows the use of metadata that describes how and when different sound files will be played. DSP effects such as reverb can be applied to sound files on playback. For 3D sounds, we care about the position of the listener in the world as well as the sound emitters. Finally, some games may have to implement the Doppler effect for fast-moving objects or sound occlusion/obstruction, depending on the environment.

Review Questions

1. Describe the difference between source sound data and the metadata associated with it.

2. What advantage do “switchable” sound cues provide over regular sound cues?

3. What are listeners and sound emitters?

4. When deciding on the position of the listener in a third-person action game, what problems must be taken into consideration?

5. What type of scale is the decibel scale?

6. What is digital signal processing? Give examples of three different audio DSP effects.

7. Why is it useful to be able to mark regions where DSP effects should be played?

8. What drawback does using a convex polygon have for DSP regions?

9. Describe the Doppler effect.

10. What are the differences between sound occlusion and sound obstruction?

Additional References

Boulanger, Richard and Victor Lazzarini, Eds. The Audio Programming Book. Boston: MIT Press, 2010. This book is a solid overview of the basics of audio signal processing and includes a large number of code examples to reinforce the concepts.

Lane, John. DSP Filter Cookbook. Stamford: Cengage Learning, 2000. This book is a bit more advanced than the prior one, but it has implementations of many specific effects, including compressors and low-pass filters.