Voice Recognition: A way for hands-Free | Primathon

Voice Recognition: A way for hands-Free | Primathon

May 8, 2021

In today’s technology-driven world, voice recognition software is something which is gaining high popularity. It is basically a subset of speech recognition, where users get robust speech recognition experience. The intelligent assistants like Cortana, Siri, Amazon’s Alexa and others are enabling hands-free requests by simply converting your voice into text.

History of voice recognition:

The first speech recognition system traces back to 1952 when the “Audrey” system was designed. Bell Laboratories designed the system such that it can understand individual voices but only digits were understood by the computer, and not words. Later IBM introduced ‘Shoebox’ which understood only 16 words in English.

The US Department of Defence and DARPA developed Speech Understanding Research (SUR) program and came up with Carnegie Mellon’s Harpy speech system which understood over 1,000 words making a huge development in the field of technology. In the late 90s Dragon Dictate became widely used and with the invention of VAL by BellSouth, a new definition was given to voice recognition. The system was a dial-in interactive voice recognition system, which introduced a myriad of phone tree systems used till date! Later in the 2000s, Google voice search was released which led to a chain of voice recognising apps.

How does it work?

We all know that the entire system of voice recognition is very complex.  Whichever language is given to the system as an input has many contexts and semantics which are processed via ML models and their dataset. following  steps are followed:

  1. Input signal– User’s voice is considered as raw input.
  2. Extraction of signal– Words that can be understood are kept while the excess noise signals are extracted.
  3. The Acoustic model– The system runs on the sounds you make. It detects the sound and pairs it up with the words using the stored statistical representations of the sounds.
  4. Produce output– In this process, decoding of the input signal takes place and from there the output is extracted. 
  5. The Language model– All the words you  speak   create  a sequence to make a sentence. This is done by probability distribution.
  6. Final result- The entire text you speak is recognized by the system.

Now, AI gets used to it as it turns your normal speech in the form of text that the computer will understand via Recurrent Neural Network (RNN). Before that, during the pre-processing speech happens. A computer can only understand the bits of sound waves which is done as follows:

  1. Pre-processing in the speech recognition model is done by breaking groups of data.
  2.  Sampled waves are very small around 1/16000th of a second. 
  3. Each group of words will have around for 20-25 milliseconds. 
  4. Then, the computer system will understand the bits feeded.

Here are the steps that RNN follows-

1.     The RNN reads each letter while predicting the next letter.

2.     Every prediction made by it is stored in memory for future predictions to be made.

3.  RNN’s predictions are  quite unfit  for long sequences and prediction, and therefore, the result  for an entire set of sequences will be needed for processing final output.

For this, Long Short Term Memory or LSTM is used which can handle the long sequences and provide the final output as a result.

Algorithms like PLP features, Viterbi search, deep neural networks (like RNN), discrimination training, WFST framework with other machine learning and API tools play an important role in the process. Nevertheless, Application Programming Interface (API) is a software-to-software interface. So, it only takes charge once the voice is converted into text via the help of Google Cloud Speech API.

Uses, and Effects of Voice recognition:

A speech recognition software can have many types. For e.g.- the software which takes our commands to control any app in the same device like “open contacts” or “call mom” and follows every instruction, will come under the category of ‘Command and control’ voice recognition software. Each software has its own use and together they form the intelligent assistants.  Each is used differently  such as: 

1. Amazon’s Logitech Alexa– It does not only control home entertainment systems but also has the capability to help in the professional sectors as it helps to join a meeting via voice commands, helps to link your email and calendar schedule or re-schedule meetings or inform participants about it automatically. It can manage audio conferencing devices, your tasks and schedule.

2. Home Assistants-Google’s voice-activated virtual Home assistant consists of 50-voice related games. It’s open software development kit helps the developers to customize device actions and have their own voice build in their products.

3. Microsoft’s Cortana– It can manage your emails across Office 365, Outlook.com, Gmail accounts and users can also manage their account via the speaker.

From Home security to managing accounts and banking, voice recognition excels in several fields but it has its own effects and drawbacks, especially related to Digital Marketing. SEO search is still under development and if your brand’s name or experience is being searched, then your website may not appear first on the result page. 

It is because you must have not added the potential answers. Voice search will not only change rankings format, but will also be no able to integrate your website name if it’s too tough. So, think twice before deciding a name and  planning  the data about your website (include answers for most commonly asked queries).

The impact of Predictions of Voice Recognition is relatively high in tourism and other businesses.  

Leave a Reply

Your email address will not be published. Required fields are marked *