Thanks to my classmate Navi for showing this method to do speech recognition. I'd have had a tougher time had an introduction not been given!
I also evaluated
PocketSphinx to do the same and ended up sharing minor bug fixes with the developer. But due to lack of documentation and support w.r.t. how to set up the code in the environment I was in, I had to ditch that approach after a few days of work :(
Before we proceed, readers may be interested in checking out
PySpeech if Python 2.4 or 2.5 is being used.
Installation
- Install pywin32-214.win32-py2.6.exe to panda 1.7.1. Different versions of panda come with different python release and pywin depends on python's version.
- Execute the following command to make python aware of pywin32 (assuming Panda was installed to C: )
C:\Panda3D-1.7.1\python\python.exe C:\Panda3D-1.7.1\python\Lib\site-packages\win32com\client\makepy.py
- In case program does not work, Microsoft SDK may not be there. Install it. YOU'LL NEED INTERNET FOR THIS AND ITS A 600+ MB INSTALL. This should not be required if Visual Studio 2008 or newer is installed on the machine.
SpeechRecognition.py
(code based on
ActiveState's code.. which in turn can be found at a lot of places on the internet like
here (who I *think* is the original author - Inigo Surguy),
here,
here and many other places I came across while making/customizing the solution)
from win32com.client import constants
import win32com.client
import pythoncom
import sys
#import the panda modules to make a task to push speech values to code.
import direct.directbase.DirectStart #for taskMgr
from direct.task import Task #for Task.cont
from pandac.PandaModules import *
"""SAPI 5.4 docs: http://msdn.microsoft.com/en-us/library/ee125077%28v=VS.85%29.aspx"""
"""Sample code for using the Microsoft Speech SDK 5.4 via COM in Python."""
class SpeechRecognition:
""" Initialize the speech recognition with the passed in list of words """
def __init__(self, wordsToAdd):
print wordsToAdd
# For text-to-speech
self.speaker = win32com.client.Dispatch("SAPI.SpVoice")
# For speech recognition - first create a listener
self.listener = win32com.client.Dispatch("SAPI.SpSharedRecognizer")
# Then a recognition context
self.context = self.listener.CreateRecoContext()
# which has an associated grammar
self.grammar = self.context.CreateGrammar()
# Do not allow free word recognition - only command and control
# recognizing the words in the grammar only
self.grammar.DictationSetState(0)
# Create a new rule for the grammar, that is top level (so it begins
# a recognition) and dynamic (ie we can change it at runtime)
self.wordsRule = self.grammar.Rules.Add("wordsRule",
constants.SRATopLevel + constants.SRADynamic, 0)
# Clear the rule (not necessary first time, but if we're changing it
# dynamically then it's useful)
self.wordsRule.Clear()
# And go through the list of words, adding each to the rule
[ self.wordsRule.InitialState.AddWordTransition(None, word) for word in wordsToAdd ]
# Set the wordsRule to be active
self.grammar.Rules.Commit()
self.grammar.CmdSetRuleState("wordsRule", 1)
# Commit the changes to the grammar
self.grammar.Rules.Commit()
# And add an event handler that's called back when recognition occurs
self.eventHandler = ContextEvents(self.context)
# Announce we've started using speech synthesis
self.say("Welcome!")
#Add the task that'll push recognized sounds every frame.
taskMgr.add(self.pushMsgs, "pushMsgs")
"""Speak a word or phrase"""
def say(self, phrase):
self.speaker.Speak(phrase)
def pushMsgs(self, task):
pythoncom.PumpWaitingMessages()
return Task.cont
"""The engine and audio input are inactive and no audio is being read,
even if there rules active. The audio device will be closed in this state.
http://msdn.microsoft.com/en-us/library/ee431860%28v=VS.85%29.aspx"""
def stopListening(self):
self.listener.State = constants.__getattr__("SRSInactiveWithPurge")
def startListening(self):
self.listener.State = constants.__getattr__("SRSActive")
def isListening(self):
if self.listener.State == 1 or self.listener.State == 2:
return True
return False
"""The callback class that handles the events raised by the speech object.
See "Automation | SpSharedRecoContext (Events)" in the MS Speech SDK
online help for documentation of the other events supported. """
class ContextEvents(win32com.client.getevents("SAPI.SpSharedRecoContext")):
"""Called when a word/phrase is successfully recognized -
ie it is found in a currently open grammar with a sufficiently high
confidence"""
def OnRecognition(self, StreamNumber, StreamPosition, RecognitionType, Result):
newResult = win32com.client.Dispatch(Result)
print "Guest said: ",newResult.PhraseInfo.GetText()
#raise an event with the said word
messenger.send(newResult.PhraseInfo.GetText())
Using the class in Angela.py
#Import the speech handler class
from SpeechRecognition import *
WORDS_TO_RECOGNIZE = [ "One", "Two", "Three", "Four", "ego" ]
"""After running this, then saying 'One', 'Two', 'Three', 'Four' or 'ego' should
display 'Guest said One' etc on the console. When 'ego' is said, an additional
line saying 'Angela caught the event' should also be displayed. The recognition
can be a bit shaky at first until you've trained it (via the Speech entry in the
Windows Control Panel."""
class Angela(DirectObject):
def __init__(self):
#INITIALIZE SPEECH RECOGNITION
self.speechReco = SpeechRecognition(WORDS_TO_RECOGNIZE)
#Events accepted by the world
self.accept('ego', self.event_message_ego)
self.accept('s', self.event_keypress_s_toggleSpeechRecognition)
def event_message_ego(self):
print 'Angela caught the event.'
def event_keypress_s_toggleSpeechRecognition(self):
if self.speechReco.isListening():
self.speechReco.stopListening()
else:
self.speechReco.startListening()
if __name__ == '__main__':
spooky = Angela()
run()
Explaining Angela.py
Once a word is recognized, SpeechRecognition.py will raise an event with word, just like it does for a keypress; like in the given example it is done for keypress 's'. As 'ego' is one of the words, I've handled the case when ego is said as if I am handling a keystroke. Anything else that is said, if not a command, will be ignored by the speech engine system-wide (as we are using the library in shared mode).
Adding more features
You can simply add more functionality to SpeechRecognition.py by exploring sapi.dll (%systemroot%\System32\Speech\Common\sapi.dll) in any of the COM browsers. I am outlining a method to view in one of the COM browsers that comes with Microsoft SDK.
- Install Microsoft SDK if not already installed.
- Fire up "C:\Program Files\Microsoft SDKs\Windows\v6.0A\Bin\OleView.Exe" OR Start Menu > Programs > Microsoft Windows SDK v6.0A > Tools > OLE-COM Object Viewer
- Select File > View TypeLib
- Select %systemroot%\System32\Speech\Common\sapi.dll in it.
- Click View > Group By Kind to see code organized in meaningful way.
Note:- When multiple interfaces are implemented, like in case of SAPI.SpSharedRecognizer, methods of the ones having red icon will be used, as that is marked as the default interface for that class
- When you select some method from the class and it has propget in its description (in the right pane), that means it will always come on the right hand side of = sign. This means you can only read this value and never assign anything to it. You will be able to call as a method or assign values to method that has propput in its description.
- Everything defined as constants, even in enum, can be had as constants.__getattr__("constantName").
eg: SpeechRecognizerState is
enum SpeechRecognizerState {SRSInactive, SRSActive, SRSActiveAlways, SRSInactiveWithPurge}
We can use its constant as
constants.__getattr__("SRSInactiveWithPurge")
EDIT (Oct 21 2010):
Avoiding Windows commands: SpInProcRecognizer
The following code will detect only the words you have defined in your vocabulary i.e. the wordsToAdd list. This is will avoid random things happening when speech engine recognizes windows commands like 'close', 'escape', 'switch' etc.
Thinking aloud, it would be a good idea to have a bigger vocabulary, i.e. the engine's default list of words associated to your program and fire off/handle events only when the word you want has been detected. This will take care of the ambient noise problem to a great extent. Also, if you don't do this and keep a small list of words as vocabulary, you may end up getting many wrong matches as the engine will try to match any sound that you utter to the list it has. One way to bypass this will be to monitor confidence level of the match, but I don't know if SAPI will let me know that. Sphinx does.
Well, while writing this post, I went out looking for a good default word list that I can load into the program. After some searching, I found
Grady Ward's Moby word lists. I used singles.txt from that archive and ended up adding 177,470+ words. The speech engine took a good few minutes to register these words. Moreover the sensitivity went beyond control. So I discarded that list, picked up commons.txt, replaced all capital letters with their non-capital form, removed all lines with ' ', ',', '!', '\', '/', '&', '-' in it as they were causing the engine to throw an exception. We want to detect only single words, and the list should contain only them... but in a manageable number. So, now I am working with 17700+ words and it takes around 10 secs for it to load on a quad core CPU with NVIDIA GT330.
Refining code:
Speech recognition has its own challenges. I would
strongly suggest to test in an environment which resembles your presentation space. 99% of the times, it will
not give you the word you want but would be a similar sounding word. You want all these similar sounding words and fire same event when any of them is detected. Also, different word lists will give different results. For example, for word 'ego', I have added the following words and will test more with the live performer before the final show.
self.accept('ego', self.event_message_ego)
self.accept('ito', self.event_message_ego)
self.accept('ido', self.event_message_ego)
self.accept('edile', self.event_message_ego)
self.accept('beetle', self.event_message_ego)
self.accept('beadle', self.event_message_ego)
self.accept('kiel', self.event_message_ego)
self.accept('yell', self.event_message_ego)
self.accept('gold', self.event_message_ego)
self.accept('told', self.event_message_ego)
self.accept('toll', self.event_message_ego)
self.accept('gaul', self.event_message_ego)
self.accept('whole', self.event_message_ego)
These have been added after speaking softly and not-so-softly into the mic and then writing what the speech engine thought of my speech.
Caveats: Code will throw an exception if Windows cannot detect any mic connected to the computer.
from win32com.client import constants
import win32com.client
import pythoncom
import sys
#import the panda modules to make a task to push speech values to code.
from pandac.PandaModules import *
#loadPrcFileData("", "want-directtools #t")
#loadPrcFileData("", "want-tk #t")
import direct.directbase.DirectStart #for taskMgr
from direct.task import Task #for Task.cont
"""SAPI 5.4 docs: http://msdn.microsoft.com/en-us/library/ee125077%28v=VS.85%29.aspx"""
"""Sample code for using the Microsoft Speech SDK 5.1 via COM in Python.
Requires that the SDK be installed; it's a free download from
http://microsoft.com/speech
and that MakePy has been used on it (in PythonWin,
select Tools | COM MakePy Utility | Microsoft Speech Object Library 5.1).
After running this, then saying "One", "Two", "Three" or "Four" should
display "You said One" etc on the console. The recognition can be a bit
shaky at first until you've trained it (via the Speech entry in the Windows
Control Panel."""
class SpeechRecognition:
""" Initialize the speech recognition with the passed in list of words """
def __init__(self, wordsToAdd = None):
# For text-to-speech
self.speaker = win32com.client.Dispatch("SAPI.SpVoice")
# For speech recognition - first create a listener
self.listener = win32com.client.Dispatch("SAPI.SpInProcRecognizer")
#Set the mic (as recognized by Windows MultiMedia layer) to speech recognition engine.
self.listener.AudioInputStream = win32com.client.Dispatch("SAPI.SpMMAudioIn")
# Then a recognition context
self.context = self.listener.CreateRecoContext()
# which has an associated grammar
self.grammar = self.context.CreateGrammar()
# Do not allow free word recognition - only command and control
# recognizing the words in the grammar only
self.grammar.DictationSetState(0)
# Create a new rule for the grammar, that is top level (so it begins
# a recognition) and dynamic (ie we can change it at runtime)
self.wordsRule = self.grammar.Rules.Add("wordsRule",
constants.SRATopLevel + constants.SRADynamic, 0)
# Clear the rule (not necessary first time, but if we're changing it
# dynamically then it's useful)
self.wordsRule.Clear()
# And go through the list of words, adding each to the rule
print '\nSpeechRecognition.py: Starting to add words to be recognized.'
numWordsAdded = 0
#BEGIN DIRTY CODE TO GET USABLE FILE FROM COMMON.TXT
#refuse = [' ', '.', '-', '/', '!', '&']
#writeOut = []
#wordList = open("COMMON.TXT", "r")
#while wordList.readline()!='':
#add = True
#word = wordList.readline()
#word = word.strip()
#for sym in refuse:
#if sym in word:
#add = False
#if add:
#writeOut.append(word)
#newC = open("commonWords.txt", "w")
#for w in writeOut:
#newC.write(w+'\n')
#newC.close()
#wordList.close()
#sys.exit()
#END DIRTY CODE TO GET USABLE FILE FROM COMMON.TXT
if wordsToAdd == None:
wordList = open("commonWords.txt", "r")
while wordList.readline()!='':
word = wordList.readline()
word = word.strip()
self.wordsRule.InitialState.AddWordTransition(None, word)
numWordsAdded += 1
wordList.close()
else:
for word in wordsToAdd:
self.wordsRule.InitialState.AddWordTransition(None, word)
numWordsAdded += 1
print 'SpeechRecognition.py: ', numWordsAdded, ' words added.'
# Set the wordsRule to be active
self.grammar.Rules.Commit()
self.grammar.CmdSetRuleState("wordsRule", 1)
# Commit the changes to the grammar
self.grammar.Rules.Commit()
# And add an event handler that's called back when recognition occurs
self.eventHandler = ContextEvents(self.context)
# Announce we've started using speech synthesis
#self.say("Welcome!")
#Add the task that'll push recognized sounds every frame.
taskMgr.add(self.pushMsgs, "pushMsgs")
print 'SpeechRecognition.py: Done setting up speech recognition.\n'
"""Speak a word or phrase"""
def say(self, phrase):
self.speaker.Speak(phrase)
def pushMsgs(self, task):
pythoncom.PumpWaitingMessages()
return Task.cont
"""The engine and audio input are inactive and no audio is being read,
even if there rules active. The audio device will be closed in this state.
http://msdn.microsoft.com/en-us/library/ee431860%28v=VS.85%29.aspx"""
def stopListening(self):
self.listener.State = constants.__getattr__("SRSInactiveWithPurge")
def startListening(self):
self.listener.State = constants.__getattr__("SRSActive")
def isListening(self):
if self.listener.State == 1 or self.listener.State == 2:
return True
return False
"""The callback class that handles the events raised by the speech object.
See "Automation | SpSharedRecoContext (Events)" in the MS Speech SDK
online help for documentation of the other events supported. """
class ContextEvents(win32com.client.getevents("SAPI.SpInProcRecoContext")):
"""Called when a word/phrase is successfully recognized -
ie it is found in a currently open grammar with a sufficiently high
confidence"""
def OnRecognition(self, StreamNumber, StreamPosition, RecognitionType, Result):
newResult = win32com.client.Dispatch(Result)
print "Guest said: ",newResult.PhraseInfo.GetText()
#raise an event with the said word
messenger.send(newResult.PhraseInfo.GetText())
'''
Dev notes:
seeing the COM interface and application
- install winSDK
- fire up "C:\Program Files\Microsoft SDKs\Windows\v6.0A\Bin\OleView.Exe"
- select File > View TypeLib
- open C:\Windows\System32\Speech\Common\sapi.dll in it.
- Click View > Group By Kind to see code organized in meaningful way.
Note:
1. When multiple interfaces are implemented, like in case of SAPI.SpSharedRecognizer, methods of the
ones having red icon will be used, as that is marked as the default interface for that class
2. When you select some method from the class and it has propget in its description (in the right pane),
that means it will always come on the right hand side of = sign. This means you can only read this
value and never assign anything to it. You will be able to call as a method or assign values to
method that has propput in its description.
3. everything defined as constants, even in enum, can be had as
constants.__getattr__("constantName").
eg: SpeechRecognizerState is
enum SpeechRecognizerState {SRSInactive, SRSActive, SRSActiveAlways, SRSInactiveWithPurge}
We can use its constant as
constants.__getattr__("SRSInactiveWithPurge")
'''