MENU
Home
Introduction
Construction
Behaviour
Sound Sampling
MIDI handling
Limitations
Sound Toolkit
|
Sampling Sound with the HandyBoard
In order to perform any analysis on sound received by the microphone beyond simple volume levels, a method of sampling the sound has to be used. By sampling the inputs from the microphone over time into an array, a wave representation can be constructed.
The first thing to note about using microphones with the Handyboard is that the ports used must be specially configured to accept the microphone input. This simply means that they have to be configured for infrared sensor inputs, by cutting the resistor behind the port. Without this, the readings received from the microphones will not be useable.
The accuracy or resolution of a sampled sound is determined by its frequency. This is the number of times per second that it is sampled at. Typical frequencies for CD-quality music are 48KHz, or 48,000 times per second, and for telephone around 8KHz. Initial estimates for sampling frequency by the Handyboard were high; since the processor runs at 2MHz it seemed safe to assume that the inputs could be sampled at a level above the level for telephones, which is perfectly acceptable for voice communications. Unfortunately we quickly discovered that the sampling rate was lower than expected. The actual sampling rate we could expect was around 500Hz.
The code to sample the sound sampling is a very simple for-loop:
for(i = 0; i<300; i++)
array[i] = analog(3);
In this way, the numbers in the array represent a wave. An array of size 300 was used because at a sampling rate of 500Hz, this ends up being around half a second, which is the length of most simple commands such as "yes" and "no".
Research on the Internet provided many sources of information and methods for speech analysis. Many of these are very complicated, relying on search networks and multiple samples for identifying words. They are accurate for a large vocabulary, but require a fair amount of processing power and memory to run correctly. Additionally, most of the projects involving speech recognition were final year projects by students programmed over months, not weeks. One of the works looked at was that of Barbara Webb of the University of Nottingham, on the subject of cricket phonotaxis. Webb's research suggests that there could be a simple method for certain sound analysis problems. Considering these factors, we decided on limiting the vocabulary to two words: "yes" and "no". In this way, the problem is simpler: determining a property that exists in one word and not the other. I searched the web for this information, and found some interesting and useful information at:
http://www.cs.dartmouth.edu/~dwagn/aiproj/speech.html
Words are composed of various sounds called phones and fricatives, which are unique one-syllable sounds. As it turns out, the fricative "s" sound in "yes" has a very distinctive waveform. It is categorized by a wave that has a high frequency, or lots of "zero-crossings", and a low energy level or amplitude. By looking for this sound, software can be written to determine the difference between "yes" and "no".
By recording the words "yes" and "no" into an array, they can be plotted in graphing software such as Excel. Transferring the values from the array into a useable file can be tricky, so we simply printed the values onto the screen using IC and copy-pasted them into emacs.
The resulting waveforms are shown next:

Figure 1: Waveform for "yes"

Figure 2: Another waveform for "yes"

Figure 3: Waveform for "no"

Figure 4: Another waveform for "no"
As seen in the above waveforms, the 'S' part of "yes" is distinctive and does not appear in the word "no". By looking for this part of the wave, a word heard by the robot can be recognized.
Unfortunately, the process isn't as easy as described. The properties of the 'S' fricative which are recognizable are the high number of zero-crossings, and the low energy level. The problem arises with the number of zero-crossings: since the sampling frequency is so low, the number of zero-crossings observed is much lower than the actual number. In fact, the number of zero-crossings observed for a normal sound is lower than the actual number, and so it becomes impossible to determine which sound is being made. Therefore the only indication which remains for an 'S' sound is the low energy level over time. This is found by checking if a certain length of sound lies within a range of amplitudes, as illustrated on figure 2. If every point in a section of the wave lies within that band, the sound is probably an 'S' so the word is assumed to be "yes".
A few more refinements have to be made to this model. If we look for sounds lying within the boundary described, then background noise would also give a result of true. So a minimum boundary has to be set as well, which a point cannot fall below. The problem with these assumptions is that sound is not always so uniform. Looking at figure 1, there is a part of the fricative that falls above the boundary. In any given 'S' sound, there is this possibility of the occasional discrepancy. As a result, the algorithm we use takes this into account by allowing up to 10 errors, or points that fall outside the boundary, per sample.
The result is that the algorithm written is fairly accurate. As long as the words are spoken clearly and slowly, the words "yes" and "no" are recognized well.
|
|