Window To My World

This means that we need to think of a dictionary of 200,000 words and phrases, the
largest of which might contain 50 or more phonemes. So let us assume that the largest
word or phrase has 50 phonemes and a few words have only one. In this dictionary,
let us assume that there are 200,000 entries. Then let us take our dictation with 1000
lines. If the average word or phrase contained three or four syllables, it is not
unrealistic to say that each word or phrase would contain an average of six or eight
phonemes. At any rate, if a line had fifty keystrokes in it, it would be fair to say each
line had an average of 45 phonemes, being as there is not an exact correspondence
between letters and phonemes. The previous sentence occupies three typical lines and
has about 137 phonemes or 42 words in it. This comes to about forty-five phonemes
or fourteen words a line and assuming that these averages apply to our sample
dictation, there would be around 45,000 phonemes or about 14,000 words in the
report.

If our dictator clearly demarcates each word or phrase in the dictionary by a clear
vocal silence, then the program might have to compare each of the 14,000 words in
this dictation to each of the 200,000 entries in the dictionary to find the closest match.
That is to say, it might have to do 2,800,000,000 comparisons (two billion, eight
hundred million). But if it needs to figure out where the word breaks are, then it needs
to compare each phoneme, each two consecutive phonemes, each three consecutive
phonemes, etc up to each 50 consecutive phonemes for a potential match. Following
the logic illustrated in the example of our imaginary language, there would be 45,000
single phonemes in our dictation, 44,999 consecutive two-phoneme matches, etc, up
to 44,951 consecutive fifty phoneme possibilities. When one adds up all these
possibilities, there are 2,248,775. Each of these possibilities would require 200,000
comparisons, or 449,755,000,000 (four hundred forty nine billion, seven hundred fifty
five million). In other words, a computer which was 2249 times more powerful than
the 486 computers currently used would be required. One could also assume that
there would be several hundred to several thousand times more ambiguities which
would mean that the screen would become very cluttered, indeed, with choices.

But the inherent problems in creating powerful enough voice recognition technology
to replace humans would actually be much more difficult than this, because the rules
of interpretation would be much more complex. Therefore, we cannot begin to solve
the problem of computer interpretation of continuous speech until the field of parallel
processing has evolved to the point where our computers and their programs begin to
approach the organization, sophistication, and complexity of that most wondrous of
computers, the human brain