This means that we need to think of a dictionary of 200,000 words and
phrases, the
largest of which might contain 50 or more phonemes. So let us assume that
the largest
word or phrase has 50 phonemes and a few words have only one. In this
dictionary,
let us assume that there are 200,000 entries. Then let us take our dictation
with 1000
lines. If the average word or phrase contained three or four syllables,
it is not
unrealistic to say that each word or phrase would contain an average of
six or eight
phonemes. At any rate, if a line had fifty keystrokes in it, it would
be fair to say each
line had an average of 45 phonemes, being as there is not an exact correspondence
between letters and phonemes. The previous sentence occupies three typical
lines and
has about 137 phonemes or 42 words in it. This comes to about forty-five
phonemes
or fourteen words a line and assuming that these averages apply to our
sample
dictation, there would be around 45,000 phonemes or about 14,000 words
in the
report.
If our dictator clearly demarcates each word or phrase in the dictionary
by a clear
vocal silence, then the program might have to compare each of the 14,000
words in
this dictation to each of the 200,000 entries in the dictionary to find
the closest match.
That is to say, it might have to do 2,800,000,000 comparisons (two billion,
eight
hundred million). But if it needs to figure out where the word breaks
are, then it needs
to compare each phoneme, each two consecutive phonemes, each three consecutive
phonemes, etc up to each 50 consecutive phonemes for a potential match.
Following
the logic illustrated in the example of our imaginary language, there
would be 45,000
single phonemes in our dictation, 44,999 consecutive two-phoneme matches,
etc, up
to 44,951 consecutive fifty phoneme possibilities. When one adds up all
these
possibilities, there are 2,248,775. Each of these possibilities would
require 200,000
comparisons, or 449,755,000,000 (four hundred forty nine billion, seven
hundred fifty
five million). In other words, a computer which was 2249 times more powerful
than
the 486 computers currently used would be required. One could also assume
that
there would be several hundred to several thousand times more ambiguities
which
would mean that the screen would become very cluttered, indeed, with choices.
But the inherent problems in creating powerful enough voice recognition
technology
to replace humans would actually be much more difficult than this, because
the rules
of interpretation would be much more complex. Therefore, we cannot begin
to solve
the problem of computer interpretation of continuous speech until the
field of parallel
processing has evolved to the point where our computers and their programs
begin to
approach the organization, sophistication, and complexity of that most
wondrous of
computers, the human brain