Sunday, October 17, 2010

Panda 1.7.1 and Speech Recognition with SAPI 5.4 on Win 7 x64



Thanks to my classmate Navi for showing this method to do speech recognition. I'd have had a tougher time had an introduction not been given!
I also evaluated PocketSphinx to do the same and ended up sharing minor bug fixes with the developer. But due to lack of documentation and support w.r.t. how to set up the code in the environment I was in, I had to ditch that approach after a few days of work :(

Before we proceed, readers may be interested in checking out PySpeech if Python 2.4 or 2.5 is being used.

Installation

  1. Install pywin32-214.win32-py2.6.exe to panda 1.7.1. Different versions of panda come with different python release and pywin depends on python's version.
  2. Execute the following command to make python aware of pywin32 (assuming Panda was installed to C: )
    C:\Panda3D-1.7.1\python\python.exe C:\Panda3D-1.7.1\python\Lib\site-packages\win32com\client\makepy.py
  3. In case program does not work, Microsoft SDK may not be there. Install it. YOU'LL NEED INTERNET FOR THIS AND ITS A 600+ MB INSTALL. This should not be required if Visual Studio 2008 or newer is installed on the machine.
SpeechRecognition.py
(code based on ActiveState's code.. which in turn can be found at a lot of places on the internet like here (who I *think* is the original author - Inigo Surguy), here, here and many other places I came across while making/customizing the solution)
from win32com.client import constants
import win32com.client
import pythoncom
import sys
#import the panda modules to make a task to push speech values to code.
import direct.directbase.DirectStart    #for taskMgr
from direct.task import Task            #for Task.cont
from pandac.PandaModules import *

"""SAPI 5.4 docs: http://msdn.microsoft.com/en-us/library/ee125077%28v=VS.85%29.aspx"""

"""Sample code for using the Microsoft Speech SDK 5.4 via COM in Python."""
class SpeechRecognition:
    """ Initialize the speech recognition with the passed in list of words """
    def __init__(self, wordsToAdd):
        print wordsToAdd
        # For text-to-speech
        self.speaker = win32com.client.Dispatch("SAPI.SpVoice")
        # For speech recognition - first create a listener
        self.listener = win32com.client.Dispatch("SAPI.SpSharedRecognizer")
        # Then a recognition context
        self.context = self.listener.CreateRecoContext()
        # which has an associated grammar
        self.grammar = self.context.CreateGrammar()
        # Do not allow free word recognition - only command and control
        # recognizing the words in the grammar only
        self.grammar.DictationSetState(0)
        # Create a new rule for the grammar, that is top level (so it begins
        # a recognition) and dynamic (ie we can change it at runtime)
        self.wordsRule = self.grammar.Rules.Add("wordsRule",
                        constants.SRATopLevel + constants.SRADynamic, 0)
        # Clear the rule (not necessary first time, but if we're changing it
        # dynamically then it's useful)
        self.wordsRule.Clear()
        # And go through the list of words, adding each to the rule
        [ self.wordsRule.InitialState.AddWordTransition(None, word) for word in wordsToAdd ]
        # Set the wordsRule to be active
        self.grammar.Rules.Commit()
        self.grammar.CmdSetRuleState("wordsRule", 1)
        # Commit the changes to the grammar
        self.grammar.Rules.Commit()
        # And add an event handler that's called back when recognition occurs
        self.eventHandler = ContextEvents(self.context)
        # Announce we've started using speech synthesis
        self.say("Welcome!")
        #Add the task that'll push recognized sounds every frame.
        taskMgr.add(self.pushMsgs, "pushMsgs")


    """Speak a word or phrase"""
    def say(self, phrase):
        self.speaker.Speak(phrase)

    def pushMsgs(self, task):
        pythoncom.PumpWaitingMessages()
        return Task.cont

    """The engine and audio input are inactive and no audio is being read,
    even if there rules active. The audio device will be closed in this state.
    http://msdn.microsoft.com/en-us/library/ee431860%28v=VS.85%29.aspx"""
    def stopListening(self):
        self.listener.State = constants.__getattr__("SRSInactiveWithPurge")

    def startListening(self):
        self.listener.State = constants.__getattr__("SRSActive")

    def isListening(self):
        if self.listener.State == 1 or self.listener.State == 2:
            return True

        return False


"""The callback class that handles the events raised by the speech object.
    See "Automation | SpSharedRecoContext (Events)" in the MS Speech SDK
    online help for documentation of the other events supported. """
class ContextEvents(win32com.client.getevents("SAPI.SpSharedRecoContext")):
    """Called when a word/phrase is successfully recognized  -
        ie it is found in a currently open grammar with a sufficiently high
        confidence"""
    def OnRecognition(self, StreamNumber, StreamPosition, RecognitionType, Result):
        newResult = win32com.client.Dispatch(Result)
        print "Guest said: ",newResult.PhraseInfo.GetText()
        #raise an event with the said word
        messenger.send(newResult.PhraseInfo.GetText())
Using the class in Angela.py
#Import the speech handler class
from SpeechRecognition import *


WORDS_TO_RECOGNIZE = [ "One", "Two", "Three", "Four", "ego" ] 
"""After running this, then saying 'One', 'Two', 'Three', 'Four' or 'ego' should 
display 'Guest said One' etc on the console. When 'ego' is said, an additional 
line saying 'Angela caught the event' should also be displayed. The recognition 
can be a bit shaky at first until you've trained it (via the Speech entry in the 
Windows Control Panel."""
class Angela(DirectObject):
    def __init__(self):
        #INITIALIZE SPEECH RECOGNITION
        self.speechReco = SpeechRecognition(WORDS_TO_RECOGNIZE)

        #Events accepted by the world
        self.accept('ego', self.event_message_ego)
        self.accept('s', self.event_keypress_s_toggleSpeechRecognition)

    def event_message_ego(self):
        print 'Angela caught the event.'

    def event_keypress_s_toggleSpeechRecognition(self):
        if self.speechReco.isListening():
            self.speechReco.stopListening()
        else:
            self.speechReco.startListening()

if __name__ == '__main__':
    spooky = Angela()
    run()

Explaining Angela.py

Once a word is recognized, SpeechRecognition.py will raise an event with word, just like it does for a keypress; like in the given example it is done for keypress 's'. As 'ego' is one of the words, I've handled the case when ego is said as if I am handling a keystroke. Anything else that is said, if not a command, will be ignored by the speech engine system-wide (as we are using the library in shared mode).

Adding more features

You can simply add more functionality to SpeechRecognition.py by exploring sapi.dll (%systemroot%\System32\Speech\Common\sapi.dll) in any of the COM browsers. I am outlining a method to view in one of the COM browsers that comes with Microsoft SDK.
  1. Install Microsoft SDK if not already installed.
  2. Fire up "C:\Program Files\Microsoft SDKs\Windows\v6.0A\Bin\OleView.Exe" OR Start Menu > Programs > Microsoft Windows SDK v6.0A > Tools > OLE-COM Object Viewer
  3. Select File > View TypeLib
  4. Select %systemroot%\System32\Speech\Common\sapi.dll in it.
  5. Click View > Group By Kind to see code organized in meaningful way.


    Note:
    • When multiple interfaces are implemented, like in case of SAPI.SpSharedRecognizer, methods of the ones having red icon will be used, as that is marked as the default interface for that class
    • When you select some method from the class and it has propget in its description (in the right pane), that means it will always come on the right hand side of = sign. This means you can only read this value and never assign anything to it. You will be able to call as a method or assign values to method that has propput in its description.
    • Everything defined as constants, even in enum, can be had as constants.__getattr__("constantName").
      eg: SpeechRecognizerState is
      enum SpeechRecognizerState {SRSInactive, SRSActive, SRSActiveAlways, SRSInactiveWithPurge}
      
      We can use its constant as
      constants.__getattr__("SRSInactiveWithPurge")

EDIT (Oct 21 2010):

Avoiding Windows commands: SpInProcRecognizer

The following code will detect only the words you have defined in your vocabulary i.e. the wordsToAdd list. This is will avoid random things happening when speech engine recognizes windows commands like 'close', 'escape', 'switch' etc.
Thinking aloud, it would be a good idea to have a bigger vocabulary, i.e. the engine's default list of words associated to your program and fire off/handle events only when the word you want has been detected. This will take care of the ambient noise problem to a great extent. Also, if you don't do this and keep a small list of words as vocabulary, you may end up getting many wrong matches as the engine will try to match any sound that you utter to the list it has. One way to bypass this will be to monitor confidence level of the match, but I don't know if SAPI will let me know that. Sphinx does.
Well, while writing this post, I went out looking for a good default word list that I can load into the program. After some searching, I found Grady Ward's Moby word lists. I used singles.txt from that archive and ended up adding 177,470+ words. The speech engine took a good few minutes to register these words. Moreover the sensitivity went beyond control. So I discarded that list, picked up commons.txt, replaced all capital letters with their non-capital form, removed all lines with ' ', ',', '!', '\', '/', '&', '-' in it as they were causing the engine to throw an exception. We want to detect only single words, and the list should contain only them... but in a manageable number. So, now I am working with 17700+ words and it takes around 10 secs for it to load on a quad core CPU with NVIDIA GT330.
Refining code:

Speech recognition has its own challenges. I would strongly suggest to test in an environment which resembles your presentation space. 99% of the times, it will not give you the word you want but would be a similar sounding word. You want all these similar sounding words and fire same event when any of them is detected. Also, different word lists will give different results. For example, for word 'ego', I have added the following words and will test more with the live performer before the final show.
self.accept('ego', self.event_message_ego)
self.accept('ito', self.event_message_ego)
self.accept('ido', self.event_message_ego)
self.accept('edile', self.event_message_ego)
self.accept('beetle', self.event_message_ego)
self.accept('beadle', self.event_message_ego)
self.accept('kiel', self.event_message_ego)
self.accept('yell', self.event_message_ego)
self.accept('gold', self.event_message_ego)
self.accept('told', self.event_message_ego)
self.accept('toll', self.event_message_ego)
self.accept('gaul', self.event_message_ego)
self.accept('whole', self.event_message_ego)
These have been added after speaking softly and not-so-softly into the mic and then writing what the speech engine thought of my speech.
Caveats: Code will throw an exception if Windows cannot detect any mic connected to the computer.

from win32com.client import constants
import win32com.client
import pythoncom
import sys
#import the panda modules to make a task to push speech values to code.
from pandac.PandaModules import *
#loadPrcFileData("", "want-directtools #t")
#loadPrcFileData("", "want-tk #t")
import direct.directbase.DirectStart    #for taskMgr
from direct.task import Task            #for Task.cont

"""SAPI 5.4 docs: http://msdn.microsoft.com/en-us/library/ee125077%28v=VS.85%29.aspx"""

"""Sample code for using the Microsoft Speech SDK 5.1 via COM in Python.
    Requires that the SDK be installed; it's a free download from
            http://microsoft.com/speech
    and that MakePy has been used on it (in PythonWin,
    select Tools | COM MakePy Utility | Microsoft Speech Object Library 5.1).

    After running this, then saying "One", "Two", "Three" or "Four" should
    display "You said One" etc on the console. The recognition can be a bit
    shaky at first until you've trained it (via the Speech entry in the Windows
    Control Panel."""
class SpeechRecognition:
    """ Initialize the speech recognition with the passed in list of words """
    def __init__(self, wordsToAdd = None):
        # For text-to-speech
        self.speaker = win32com.client.Dispatch("SAPI.SpVoice")
        # For speech recognition - first create a listener
        self.listener = win32com.client.Dispatch("SAPI.SpInProcRecognizer")
        #Set the mic (as recognized by Windows MultiMedia layer) to speech recognition engine.
        self.listener.AudioInputStream =  win32com.client.Dispatch("SAPI.SpMMAudioIn")
        # Then a recognition context
        self.context = self.listener.CreateRecoContext()
        # which has an associated grammar
        self.grammar = self.context.CreateGrammar()
        # Do not allow free word recognition - only command and control
        # recognizing the words in the grammar only
        self.grammar.DictationSetState(0)
        # Create a new rule for the grammar, that is top level (so it begins
        # a recognition) and dynamic (ie we can change it at runtime)
        self.wordsRule = self.grammar.Rules.Add("wordsRule",
                        constants.SRATopLevel + constants.SRADynamic, 0)
        # Clear the rule (not necessary first time, but if we're changing it
        # dynamically then it's useful)
        self.wordsRule.Clear()
        # And go through the list of words, adding each to the rule
        print '\nSpeechRecognition.py: Starting to add words to be recognized.'
        numWordsAdded = 0
        #BEGIN DIRTY CODE TO GET USABLE FILE FROM COMMON.TXT
        #refuse = [' ', '.', '-', '/', '!', '&']

        #writeOut = []
        #wordList = open("COMMON.TXT", "r")
        #while wordList.readline()!='':
            #add = True
            #word = wordList.readline()
            #word = word.strip()
            #for sym in refuse:
                #if sym in word:
                    #add = False

            #if add:
                #writeOut.append(word)

        #newC = open("commonWords.txt", "w")
        #for w in writeOut:
            #newC.write(w+'\n')
        #newC.close()
        #wordList.close()

        #sys.exit()
        #END DIRTY CODE TO GET USABLE FILE FROM COMMON.TXT
        if wordsToAdd == None:
            wordList = open("commonWords.txt", "r")
            while wordList.readline()!='':
                word = wordList.readline()
                word = word.strip()
                self.wordsRule.InitialState.AddWordTransition(None, word)
                numWordsAdded += 1
            wordList.close()
        else:
            for word in wordsToAdd:
                self.wordsRule.InitialState.AddWordTransition(None, word)
                numWordsAdded += 1
        print 'SpeechRecognition.py: ', numWordsAdded, ' words added.'

        # Set the wordsRule to be active
        self.grammar.Rules.Commit()
        self.grammar.CmdSetRuleState("wordsRule", 1)
        # Commit the changes to the grammar
        self.grammar.Rules.Commit()
        # And add an event handler that's called back when recognition occurs
        self.eventHandler = ContextEvents(self.context)
        # Announce we've started using speech synthesis
        #self.say("Welcome!")
        #Add the task that'll push recognized sounds every frame.
        taskMgr.add(self.pushMsgs, "pushMsgs")
        print 'SpeechRecognition.py: Done setting up speech recognition.\n'

    """Speak a word or phrase"""
    def say(self, phrase):
        self.speaker.Speak(phrase)

    def pushMsgs(self, task):
        pythoncom.PumpWaitingMessages()
        return Task.cont

    """The engine and audio input are inactive and no audio is being read,
    even if there rules active. The audio device will be closed in this state.
    http://msdn.microsoft.com/en-us/library/ee431860%28v=VS.85%29.aspx"""
    def stopListening(self):
        self.listener.State = constants.__getattr__("SRSInactiveWithPurge")

    def startListening(self):
        self.listener.State = constants.__getattr__("SRSActive")

    def isListening(self):
        if self.listener.State == 1 or self.listener.State == 2:
            return True

        return False


"""The callback class that handles the events raised by the speech object.
    See "Automation | SpSharedRecoContext (Events)" in the MS Speech SDK
    online help for documentation of the other events supported. """
class ContextEvents(win32com.client.getevents("SAPI.SpInProcRecoContext")):
    """Called when a word/phrase is successfully recognized  -
        ie it is found in a currently open grammar with a sufficiently high
        confidence"""
    def OnRecognition(self, StreamNumber, StreamPosition, RecognitionType, Result):
        newResult = win32com.client.Dispatch(Result)
        print "Guest said: ",newResult.PhraseInfo.GetText()
        #raise an event with the said word
        messenger.send(newResult.PhraseInfo.GetText())

'''
Dev notes:
seeing the COM interface and application
- install winSDK
- fire up "C:\Program Files\Microsoft SDKs\Windows\v6.0A\Bin\OleView.Exe"
- select File > View TypeLib
- open C:\Windows\System32\Speech\Common\sapi.dll in it.
- Click View > Group By Kind to see code organized in meaningful way.
Note:
1. When multiple interfaces are implemented, like in case of SAPI.SpSharedRecognizer, methods of the
ones having red icon will be used, as that is marked as the default interface for that class
2. When you select some method from the class and it has propget in its description (in the right pane),
that means it will always come on the right hand side of = sign. This means you can only read this
value and never assign anything to it. You will be able to call as a method or assign values to
method that has propput in its description.
3. everything defined as constants, even in enum, can be had as
constants.__getattr__("constantName").
eg: SpeechRecognizerState is
enum SpeechRecognizerState {SRSInactive, SRSActive, SRSActiveAlways, SRSInactiveWithPurge}
We can use its constant as
constants.__getattr__("SRSInactiveWithPurge")
'''

Saturday, March 06, 2010

.NET typed dataset bug

When working with typed datasets having more than one table which acts as parent table (with primary keys) in a relationship, you are bound to get an exception.

Reason -
The moment first table in the dataset is populated, ALL constraints are enforced. This means there will be some parent tables which will not be populated, which will give rise to an exception.

You can't act smart and return all parent tables in one data set and populate them in one go using
DataAdapter.TableMappings, because this exception will be thrown the moment first table is mapped to dataset.

The only workaround I could think of was to do the following:

  1. Populate all parent tables in one method and ensure that this method is called before any other data operation or population method.
  2. In this method, set EnforceConstraints to false before starting data retrieval and set it back to true once the retrieval is complete.
Ideal behaviour - 
Only the constraints related to the populated datable should be activated.

This wasted loads and loads of my time. I hope it saves some of yours.

Tuesday, January 19, 2010

Tora 2.1.1 on slackware

Compiling TOra 2.1.1 from source on multilib Slackware64-13.0, with cmake version 2.6.2 patched.
./configure script does not detect QScintilla even though it is installed. Cmake does, however,
cmake_policy(SET CMP0011 NEW)
in cmakelists.txt will cause it to break.
Workaround:
Change above line to
#cmake_policy(SET CMP0011 NEW) #comment the line out
Change
CMAKE_MINIMUM_REQUIRED(VERSION 2.4.5 FATAL_ERROR)
to
#CMAKE_MINIMUM_REQUIRED(VERSION 2.4.5 FATAL_ERROR) #commented
Just below the line where previous change was made, add the lines from the work around link
cmake_minimum_required(VERSION 2.6.2)
if(POLICY CMP0011)
cmake_policy(SET CMP0011 OLD) # or even better, NEW
endif(POLICY CMP0011)
Save the file.
Installing for without Oracle and PostgreSQL support (I intend to use it with MySQL).
cmake -DENABLE_PGSQL=0 -DENABLE_ORACLE=0 .
make
make install

Edit: On 64 bit system, TOra looks for
/usr/lib64/libqscintilla2.so.5
However, qscintilla does not install the shared library in the path it is looking at. Solution make a symlink!
bash-3.1# ln -s /usr/lib64/qt/lib/libqscintilla2.so.5 /usr/lib64/libqscintilla2.so.5

Pinpointed the error by using (StackTrace)
strace /usr/local/bin/tora

Saturday, January 16, 2010

Installing flash on Slackware64-13.0

Flashplugin is available in extra directory, but it needs nspluginwrapper installed to work. The following steps will install to /usr/lib, so make sure you have write permissions for that directory.
1) Install nspluginwrapper
a. Make sure the target file listed below has executable permissions, and then issue the following command.
<slackware64 Dump>/extra/source/nspluginwrapper/nspluginwrapper.SlackBuild
At the end of this command, something like the following will appear. It is the name and path of the file that is ready to be installed.
Slackware package /tmp/nspluginwrapper-1.0.0-x86_64-1.txz created.
b. Now use installpkg to install the file just created. For the file mentioned above, I used the following command:
installpkg /tmp/nspluginwrapper-1.0.0-x86_64-1.txz
2) Time to install flash using same procedure as above.
a. Download and compile the package for your system. The slackbuild script will additionally handle download for you and will give you the installable package as it does normally. Reiterating, check if the slackbuild script has executable permissions.
<slackware64 Dump>/extra/source/flashplayer-plugin/flashplayer-plugin.SlackBuild
b. Use installpkg with .txz file generated on executing the above script to install the file.
installpkg /path/to/flashplayer-plugin-10.0.32.18-x86_64-1.txz

This installs flash system-wide. Seamonkey also worked.


Powered by ScribeFire.