连续语音识别。 使用SFSpeechRecognizer(ios10-beta)

我正在尝试执行续。 在iOS 10 beta上使用AVCapture进行语音识别。 我已经设置了captureOutput(...)来连续获取CMSampleBuffers 。 我把这些缓冲区直接放到我之前设置的SFSpeechAudioBufferRecognitionRequest ,如下所示:

 ... do some setup SFSpeechRecognizer.requestAuthorization { authStatus in if authStatus == SFSpeechRecognizerAuthorizationStatus.authorized { self.m_recognizer = SFSpeechRecognizer() self.m_recognRequest = SFSpeechAudioBufferRecognitionRequest() self.m_recognRequest?.shouldReportPartialResults = false self.m_isRecording = true } else { print("not authorized") } } .... do further setup func captureOutput(_ captureOutput: AVCaptureOutput!, didOutputSampleBuffer sampleBuffer: CMSampleBuffer!, from connection: AVCaptureConnection!) { if(!m_AV_initialized) { print("captureOutput(...): not initialized !") return } if(!m_isRecording) { return } let formatDesc = CMSampleBufferGetFormatDescription(sampleBuffer) let mediaType = CMFormatDescriptionGetMediaType(formatDesc!) if (mediaType == kCMMediaType_Audio) { // process audio here m_recognRequest?.appendAudioSampleBuffer(sampleBuffer) } return } 

整件事情只需几秒钟。 然后不再调用captureOutput。 如果我注释掉appendAudioSampleBuffer(sampleBuffer)行,那么只要app运行(正如预期的那样)就会调用captureOutput。 显然,将样本缓冲区放入语音识别引擎会以某种方式阻止进一步执行。 我想可用的缓冲区会在一段时间后被消耗,并且进程会以某种方式停止,因为它无法再获取缓冲区???

我应该提到在前2秒内记录的所有内容都会导致正确的识别。 我只是不知道SFSpeech API是如何工作的,因为Apple没有将任何文本放入beta文档中。 顺便说一句:如何使用SFSpeechAudioBufferRecognitionRequest.endAudio()?

有谁知道这里的事情?

谢谢克里斯

我将SpeakToMe示例Swift代码从语音识别WWDC开发人员谈话转换为Objective-C,它对我有用。 对于Swift,请参阅https://developer.apple.com/videos/play/wwdc2016/509/ ,或者对于Objective-C,请参阅下文。

 - (void) viewDidAppear:(BOOL)animated { _recognizer = [[SFSpeechRecognizer alloc] initWithLocale:[NSLocale localeWithLocaleIdentifier:@"en-US"]]; [_recognizer setDelegate:self]; [SFSpeechRecognizer requestAuthorization:^(SFSpeechRecognizerAuthorizationStatus authStatus) { switch (authStatus) { case SFSpeechRecognizerAuthorizationStatusAuthorized: //User gave access to speech recognition NSLog(@"Authorized"); break; case SFSpeechRecognizerAuthorizationStatusDenied: //User denied access to speech recognition NSLog(@"SFSpeechRecognizerAuthorizationStatusDenied"); break; case SFSpeechRecognizerAuthorizationStatusRestricted: //Speech recognition restricted on this device NSLog(@"SFSpeechRecognizerAuthorizationStatusRestricted"); break; case SFSpeechRecognizerAuthorizationStatusNotDetermined: //Speech recognition not yet authorized break; default: NSLog(@"Default"); break; } }]; audioEngine = [[AVAudioEngine alloc] init]; _speechSynthesizer = [[AVSpeechSynthesizer alloc] init]; [_speechSynthesizer setDelegate:self]; } -(void)startRecording { [self clearLogs:nil]; NSError * outError; AVAudioSession *audioSession = [AVAudioSession sharedInstance]; [audioSession setCategory:AVAudioSessionCategoryRecord error:&outError]; [audioSession setMode:AVAudioSessionModeMeasurement error:&outError]; [audioSession setActive:true withOptions:AVAudioSessionSetActiveOptionNotifyOthersOnDeactivation error:&outError]; request2 = [[SFSpeechAudioBufferRecognitionRequest alloc] init]; inputNode = [audioEngine inputNode]; if (request2 == nil) { NSLog(@"Unable to created a SFSpeechAudioBufferRecognitionRequest object"); } if (inputNode == nil) { NSLog(@"Unable to created a inputNode object"); } request2.shouldReportPartialResults = true; _currentTask = [_recognizer recognitionTaskWithRequest:request2 delegate:self]; [inputNode installTapOnBus:0 bufferSize:4096 format:[inputNode outputFormatForBus:0] block:^(AVAudioPCMBuffer *buffer, AVAudioTime *when){ NSLog(@"Block tap!"); [request2 appendAudioPCMBuffer:buffer]; }]; [audioEngine prepare]; [audioEngine startAndReturnError:&outError]; NSLog(@"Error %@", outError); } - (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishRecognition:(SFSpeechRecognitionResult *)result { NSLog(@"speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishRecognition"); NSString * translatedString = [[[result bestTranscription] formattedString] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]; [self log:translatedString]; if ([result isFinal]) { [audioEngine stop]; [inputNode removeTapOnBus:0]; _currentTask = nil; request2 = nil; } } 

我成功地连续使用SFSpeechRecognizer。 重点是使用AVCaptureSession捕获音频并传输到SpeechRecognizer。 对不起,我在Swift很穷,所以只是ObjC版本。

这是我的示例代码(省略一些UI代码,一些重要的标记):

 @interface ViewController () @property (nonatomic, strong) AVCaptureSession *capture; @property (nonatomic, strong) SFSpeechAudioBufferRecognitionRequest *speechRequest; @end @implementation ViewController - (void)startRecognizer { [SFSpeechRecognizer requestAuthorization:^(SFSpeechRecognizerAuthorizationStatus status) { if (status == SFSpeechRecognizerAuthorizationStatusAuthorized){ NSLocale *local =[[NSLocale alloc] initWithLocaleIdentifier:@"fr_FR"]; SFSpeechRecognizer *sf =[[SFSpeechRecognizer alloc] initWithLocale:local]; self.speechRequest = [[SFSpeechAudioBufferRecognitionRequest alloc] init]; [sf recognitionTaskWithRequest:self.speechRequest delegate:self]; // should call startCapture method in main queue or it may crash dispatch_async(dispatch_get_main_queue(), ^{ [self startCapture]; }); } }]; } - (void)endRecognizer { // END capture and END voice Reco // or Apple will terminate this task after 30000ms. [self endCapture]; [self.speechRequest endAudio]; } - (void)startCapture { NSError *error; self.capture = [[AVCaptureSession alloc] init]; AVCaptureDevice *audioDev = [AVCaptureDevice defaultDeviceWithMediaType:AVMediaTypeAudio]; if (audioDev == nil){ NSLog(@"Couldn't create audio capture device"); return ; } // create mic device AVCaptureDeviceInput *audioIn = [AVCaptureDeviceInput deviceInputWithDevice:audioDev error:&error]; if (error != nil){ NSLog(@"Couldn't create audio input"); return ; } // add mic device in capture object if ([self.capture canAddInput:audioIn] == NO){ NSLog(@"Couldn't add audio input"); return ; } [self.capture addInput:audioIn]; // export audio data AVCaptureAudioDataOutput *audioOutput = [[AVCaptureAudioDataOutput alloc] init]; [audioOutput setSampleBufferDelegate:self queue:dispatch_get_main_queue()]; if ([self.capture canAddOutput:audioOutput] == NO){ NSLog(@"Couldn't add audio output"); return ; } [self.capture addOutput:audioOutput]; [audioOutput connectionWithMediaType:AVMediaTypeAudio]; [self.capture startRunning]; } -(void)endCapture { if (self.capture != nil && [self.capture isRunning]){ [self.capture stopRunning]; } } - (void)captureOutput:(AVCaptureOutput *)captureOutput didOutputSampleBuffer:(CMSampleBufferRef)sampleBuffer fromConnection:(AVCaptureConnection *)connection { [self.speechRequest appendAudioSampleBuffer:sampleBuffer]; } // some Recognition Delegate @end 

这是@ cube的答案的Swift(3.0)实现:

 import UIKit import Speech import AVFoundation class ViewController: UIViewController { @IBOutlet weak var console: UITextView! var capture: AVCaptureSession? var speechRequest: SFSpeechAudioBufferRecognitionRequest? override func viewDidLoad() { super.viewDidLoad() } override func viewDidAppear(_ animated: Bool) { super.viewDidAppear(animated) startRecognizer() } func startRecognizer() { SFSpeechRecognizer.requestAuthorization { (status) in switch status { case .authorized: let locale = NSLocale(localeIdentifier: "fr_FR") let sf = SFSpeechRecognizer(locale: locale as Locale) self.speechRequest = SFSpeechAudioBufferRecognitionRequest() sf?.recognitionTask(with: self.speechRequest!, delegate: self) DispatchQueue.main.async { } case .denied: fallthrough case .notDetermined: fallthrough case.restricted: print("User Autorization Issue.") } } } func endRecognizer() { endCapture() speechRequest?.endAudio() } func startCapture() { capture = AVCaptureSession() guard let audioDev = AVCaptureDevice.defaultDevice(withMediaType: AVMediaTypeAudio) else { print("Could not get capture device.") return } guard let audioIn = try? AVCaptureDeviceInput(device: audioDev) else { print("Could not create input device.") return } guard true == capture?.canAddInput(audioIn) else { print("Couls not add input device") return } capture?.addInput(audioIn) let audioOut = AVCaptureAudioDataOutput() audioOut.setSampleBufferDelegate(self, queue: DispatchQueue.main) guard true == capture?.canAddOutput(audioOut) else { print("Could not add audio output") return } capture?.addOutput(audioOut) audioOut.connection(withMediaType: AVMediaTypeAudio) capture?.startRunning() } func endCapture() { if true == capture?.isRunning { capture?.stopRunning() } } } extension ViewController: AVCaptureAudioDataOutputSampleBufferDelegate { func captureOutput(_ captureOutput: AVCaptureOutput!, didOutputSampleBuffer sampleBuffer: CMSampleBuffer!, from connection: AVCaptureConnection!) { speechRequest?.appendAudioSampleBuffer(sampleBuffer) } } extension ViewController: SFSpeechRecognitionTaskDelegate { func speechRecognitionTask(_ task: SFSpeechRecognitionTask, didFinishRecognition recognitionResult: SFSpeechRecognitionResult) { console.text = console.text + "\n" + recognitionResult.bestTranscription.formattedString } } 

不要忘记在info.plist文件中添加NSSpeechRecognitionUsageDescription的值,否则它将崩溃。

事实certificate,Apple的新原生语音识别不会自动检测到语音结束时的静音(一个错误?),这对您的情况很有用,因为语音识别活动了近一分钟(苹果服务允许的最长时间) 。 所以基本上如果你需要连续的ASR,你必须在你的委托触发时重新启动语音识别:

 func speechRecognitionTask(task: SFSpeechRecognitionTask, didFinishSuccessfully successfully: Bool) //wether succesfully= true or not 

这是我使用的录音/语音识别SWIFT代码,它工作得很好。 如果您不需要,请忽略我计算麦克风音量平均功率的部分。 我用它来设置波形动画。 不要忘记设置SFSpeechRecognitionTaskDelegate,并且是委托方法,如果您需要额外的代码,请告诉我。

 func startNativeRecording() throws { LEVEL_LOWPASS_TRIG=0.01 //Setup Audio Session node = audioEngine.inputNode! let recordingFormat = node!.outputFormatForBus(0) node!.installTapOnBus(0, bufferSize: 1024, format: recordingFormat){(buffer, _) in self.nativeASRRequest.appendAudioPCMBuffer(buffer) //Code to animate a waveform with the microphone volume, ignore if you don't need it: var inNumberFrames:UInt32 = buffer.frameLength; var samples:Float32 = buffer.floatChannelData[0][0]; //https://github.com/apple/swift-evolution/blob/master/proposals/0107-unsaferawpointer.md var avgValue:Float32 = 0; vDSP_maxmgv(buffer.floatChannelData[0], 1, &avgValue, vDSP_Length(inNumberFrames)); //Accelerate Framework //vDSP_maxmgv returns peak values //vDSP_meamgv returns mean magnitude of a vector let avg3:Float32=((avgValue == 0) ? (0-100) : 20.0) var averagePower=(self.LEVEL_LOWPASS_TRIG*avg3*log10f(avgValue)) + ((1-self.LEVEL_LOWPASS_TRIG)*self.averagePowerForChannel0) ; print("AVG. POWER: "+averagePower.description) dispatch_async(dispatch_get_main_queue(), { () -> Void in //print("VU: "+vu.description) var fAvgPwr=CGFloat(averagePower) print("AvgPwr: "+fAvgPwr.description) var waveformFriendlyValue=0.5+fAvgPwr //-0.5 is AvgPwrValue when user is silent if(waveformFriendlyValue<0){waveformFriendlyValue=0} //round values <0 to 0 self.waveview.hidden=false self.waveview.updateWithLevel(waveformFriendlyValue) }) } audioEngine.prepare() try audioEngine.start() isNativeASRBusy=true nativeASRTask = nativeSpeechRecognizer?.recognitionTaskWithRequest(nativeASRRequest, delegate: self) nativeSpeechRecognizer?.delegate=self //I use this timer to track no speech timeouts, ignore if not neeeded: self.endOfSpeechTimeoutTimer = NSTimer.scheduledTimerWithTimeInterval(utteranceTimeoutSeconds, target: self, selector: #selector(ViewController.stopNativeRecording), userInfo: nil, repeats: false) }