Virtual Personal Assistants

State of the art 2019
Here I would like to share some perspectives pertaining to the current state of Virtual Personal Assistant technology and the ability to leverage its strengths in an education enterprise.

Download Whitepaper


  • Virtual Personal Assistant Technology
  • State of the Art
    • A Note of Caution
  • The Aural Modality
    • Strength and Weakness
    • Who is listening?
  • VPA Categories
    • General Knowledge Assistant
    • Specialized Knowledge Assistant
    • Healthcare Assistant
    • Task Oriented Assistant
    • Information Service Assistant
    • Entertainment Assistant
    • Lifelong Learning Companion
  • Technology Overview
    • Integration and Utilization
    • Voice Processing
    • Cutting Edge
  • Recommendations
    • Audible Educational Materials
    • Extending Casual Use
    • Mentoring

Virtual Personal Assistant Technology

The Virtual Personal Assistant (VPA) uses a conversational approach, either visual or aural, as the user interface. The VPA can initiate actions in other systems or operate from within its own domain of knowledge. Because the concept of a VPA has so recently been married to technology capable of realizing its promise (AI), there are many living theories on how best to utilize this modality in a manner best fit for its capabilities. Even when backed by advanced AI, the accuracy of VPAs in their most prominent forms; Alexa, Siri, Cortana and Google assistant, fall short of its most ambitious effort to be a general knowledge assistant.

It is important to remember that Amazon developed Alexa as a means to embed their consumer sales in the home. In tests, users are commonly directed to Amazon products and services through an ever present “Amazon recommends…” feature. Google developed its VPA in order to extend its search and marketing services into people’s home and mobile devices. Consumer use employs Google’s search engine as its primary fulfillment mechanism, but its ability to catalog consumer sentiment is a boon for its marketing line of business. Each successful utilization supports a core business function.

State of the Art

The current level of accuracy derived by performing identical tests using the big four providers; Amazon’s Alexa, Microsoft’s Cortana. Google’s Home and  smartphone Assistants and Apple’s Siri are mixed. While encouraging from a technological point of view, performance presently still lags expectations.

Two independent researchers summarize their comparison studies on the big four in the following manner.

Google Assistant running on a Smartphone remains the leader in number of questions answered, and in answering questions fully and correctly. Cortana came in a close second for answering questions fully and correctly. Alexa made the most progress by far in closing the gap from the 2017 results, increasing the number of questions answered by 2.7 times. Siri also made material improvements. There is no doubt that this space is seriously heating up. One major area not covered in this test is the overall connectivity of each personal assistant with other apps and services. This is an incredibly important part of rating a personal assistant as well. You can expect all four companies to be pressing hard to connect to as many quality apps and service providers as possible, as this will have a major bearing on how effective they all are.
[Source: Stone Temple]

Google Assistant continued its out performance, answering 86% correctly and understanding all 800 questions. Siri was close behind, correctly answering 79% and only misunderstanding 11 questions. Alexa correctly answered 61% and misunderstood 13. Cortana was the laggard, correctly answering just 52% and misunderstanding 19.

Note that nearly every misunderstood question involved a proper noun, often the name of a local town or restaurant. Both the voice recognition and natural language processing of digital assistants across the board has improved to the point where, within reason, they will understand everything you say to them.
[Source: Loup Ventures]

On one hand, the raw numbers and trajectory appear quite positive. On the other hand, consumer expectations are well above what the VPA can currently provide. This is a familiar inflection point in the hype-cycle. According to Gartner 2018:

Two to Five Years Away from Mainstream Adoption

Increasingly, behavior and event triggers will enhance virtual assistants," said Van Baker, research vice president at Gartner. "App development leaders need to anticipate that their proliferation as more and more people and businesses move to conversational user interfaces. Businesses that haven't begun deploying AI to interact with customers and employees should start now, because customers and employees are increasingly expecting conversational interfaces to be available to address help desk and customer service issues.
Chatbots are expected to exhibit huge growth over the next few years. While less than 4 percent of organizations have already deployed conversational interfaces (including chatbots), 38 percent of organizations are planning to implement or actively experimenting with the technology according to Gartner’s 2018 CIO Survey. Although customer service is the area that uses the most chatbots, they are likely to be deployed elsewhere in the organization. When chatbots are used as application interfaces, the way we work will change from "the user having to learn the interface" to "the chatbot learning what the user wants." This will greatly stimulate on-boarding, training, productivity and efficiency inside the workplace.
[Source Loup Ventures]

The number of ‘smart speakers’ estimated to be in use ranges from between 39-50 million active units and a household penetration of 18-24%. This is in part because industry leaders can subsidize their platform’s inclusion in consumer products not dedicated to their use. These numbers should therefore be taken with a grain of salt. Of those who own an actual smart speaker device, early mover Amazon still leads. That lead is however being eroded by Google’s HomePod device. Of those ‘devices in use’, a full 31% reportedly are never used. In fact, only 6% represent actual adoption by the user.

A Note of Caution

There is every indication to believe that AI and Personal Assistant technology will be a mainstream expectation of consumers that can be delivered upon. However not every technology exits the hype-cycle on a Plateau of Productivity. Some technologies exit the cycle in failure without living up to expectations or delivering revenue. It would therefore be pertinent to invest as long as a coherent product offering can be formulated and the underlying technology is seen as reliable.

Despite some encouraging numbers shown above, the public expects a highly sophisticated offering, one which is still not entirely possible. This is undoubtedly due to depictions of such technology in science fiction. Presently unreachable levels of sophistication have been represented in the level of person-ability desired, as expressed in films such as “Her” in which a man falls in love with an AI powered chatbot. Similarly near clairvoyant sophistication on par with Star Trek is just not possible in the near future. Because of this, no matter now advanced, early versions should expect to be met with some level of consumer disappointment. Evidence of the need for measured expectations is plainly expressed by the news provider Forbes whose link to the Stone Temple analysis was cited under an article titled “Dumb And Dumber: Comparing Alexa, Siri, Cortana And The Google Assistant”.

The Aural Modality

It is vital to understand the opportunities and deficiencies involved when communicating with users through speech rather than text. Our most familiar user interfaces are visual. Visually, users actively scan for relevant information because it is presented all at once. The aural modality progresses in a linear fashion but can be consumed passively by the user. When considering the design of a solution which uses speech and conversation as its keyboard and mouse, there are various scenarios this modality is well and poorly suited for.

  • System speaks at length
    • Good – Read an article
    • Bad – Present a long list of choices
  • User speaks at length
    • Good – Dictation / voice-to-text
    • Bad – Commands requiring lengthy
      clarification and specification
  • Short command
    • Good – Internet of Things or Personal
    • Bad – Select one of many items in a list
  • Short system response
    • Good – Word definition, Long
      background process
    • Bad – Incomplete answer requiring
      additional input
  • Conversational
    • Good – Short form completion, finite
      action clarification
    • Bad – Short interaction where the [hello,
      what’s up, goodbye] cycle is

Strength and Weakness

In conclusion, it is useful to imagine the execution of any given function desired as if it were described to another person over the phone. When these scenarios become tedious to imagine they will be tedious to use. When they free us from being tied to a screen without limiting our abilities, they will be well received and used repeatedly.

Who is listening?

A deficiency of the aural modality which has nothing to do with its technical capability is its perception of security and potential for documented liability. VPAs come with the inherent question; ‘Who is listening to this’? While bits of audio are constantly sent to and processed by outside servers connected to the VPA, it should not be concerned with persisting that information beyond its ability to discern user intent. However, the documented consequences of recorded  conference calls and off-hand remarks will impact user perceptions and willingness to engage. It is important to consider this perception when devising products which must be present in a group setting or scenarios where the service is constantly on and listening for commands. Efforts to mitigate this general perception have been limited and remain tied to trust in the product company. The major providers of VPA services are already bearing the brunt of poor public relations and unintended consequences.

VPA Categories

The following is a discussion of the most prominent strategies to utilize a VPA in the enterprise. Some, but not all, have the potential to benefit the healthcare industry.

General Knowledge Assistant

The concept of a General Knowledge Assistant (GKA) is the most familiar and ambitious utilizations of a VPA and are typified by the familiar names Alexa, Siri, Cortana and Google assistant. A GKA can potentially be asked any question, on any topic and be directed to the most relevant information much the way our search engines do. The Sisyphean task here is to map everything the user could possibly be interested in to their desired, relevant response. What makes this seemingly impossible is the breadth of knowledge necessary to make it minimally viable and the instant loss of authority the VPA experiences when it returns irrelevant information. While industry leaders are still pursuing GKA, their maturity is still many years away from general user acceptance even with their available resources.

Specialized Knowledge Assistant

The purpose of a Specialized Knowledge Assistant (SKA) is identical to that of the GKA. It is however limited to a specific and limited domain of knowledge. Users of the SKA understand they can only ask this VPA about information in its specialty. Because of this however, the SKA is much more accurate in
its ability to assist the user returning relevant information. It can more reliably find and return desired information as well as suggest a course of action.

Healthcare Assistant

Specifically relevant to the industry are VPAs focused already on the Healthcare specialty. In this category is the Microsoft Health Bot Service which is still in development at this time. Among its advertised capabilities include:

  • Symptom checkers, based on built-in medical protocols or their own protocols
  • Information about conditions, symptoms, type of doctors
  • Health plan services, benefits, eligibility and costs information
  • Service providers lookup and scheduling
  • Clinical trials information

Task Oriented Assistant

A Task Oriented Assistant (TOA) focuses on short, explicit commands to carry out defined tasks. It is loaded with a limited vocabulary focused only on the tasks it is programmed to facilitate. The most familiar examples of this utilization is in the Internet of Things (IoT). Here, turning on/off appliances and retrieving sensor data (environmental readings like temperature/humidity) are most often effected. Another use, sometimes mentioned separately is in the office where setting a reminder, checking a schedule or sending an email can be accomplished though short explicit commands.

Information Service Assistant

Information Service Assistants query the user for their general-to-specific intent until the user’s desired result is achieved or they are routed to a human being. These are used for help desk inquiries, reservation services and other limited tasks where by facilitating the means of self-service, the enterprise can stretch their human capital.

Entertainment Assistant

While the industry term for this utilization is primarily focused on retrieving multi-media for pleasure, its purpose is to return media for the user’s passive consumption. The defining characteristic is of a searchable index of media and its lengthy consumption by the user.

Lifelong Learning Companion

Aimed at the educational industry, the Lifelong Learning Companion (LLC) is a VPA which is a hybrid of a Specialized Knowledge Assistant and an Entertainment Assistant. Focused on a professional specialty or student development, the LLC has the ability to interact conversationally to specify knowledge articles or educational materials desired by the user on a given topic, then reads the contents aurally to the user. Meta data concerning each article links them together topically, and short quizzes can be injected periodically into the system’s dictation of knowledge. The LLC is aware of both the full library of knowledge available and the specific skill level and educational needs of the user so it can find specific materials or suggest relevant ones.

Technology Overview

A Virtual Personal Assistant transmits audio to a Speech-to-Text system, then passes that text to a Natural Language Understanding (NLU) system which derives Meta data such as the user’s intent and mentioned entities. This data is interpreted according to the ontology developed for it, which reflects its targeted domain of knowledge. Alexa’s domain of knowledge is targeted towards Amazon products and services while Google’s Home Assistant leverages their search while enriching their advertising data. Each solution was designed for a business purpose, not simple consumer entertainment.

Providing VPA services means extending these various individual technologies (speech-to-text, NLU and Meta data analysis) to a consumer product. Consumer products include desktop PC, mobile device and specialty devices such as the Amazon Alexa powered Echo, and the range of Google Assistant powered devices such as Google Home (Standard/Max/Mini) as well as over a dozen 3rd party manufacturers such as Sony, JBL, Nest and Sonos.

Each of the big four providers, arguably three, extend their service to a physical machine; specialized like the Echo, or generalized when using an app on your mobile device. From a technological state of affairs, this is analogous to the browser wars and their differing JavaScript support.

  • Each device may or may not support all available capabilities
  • The quality of said devices will vary impacting voice exchange
  • The availability of other native applications will vary such as Outlook or IM
  • The vital ontology used to accurately respond to the user is siloed by provider

Now is the time to mention a provider of all the necessary services which does not make the big four
list. This is because their strategy is not consumer oriented but rather enterprise oriented which makes
it an attractive option for corporate uses and that is IBM Assistant.

Any VPA investment will include developing a robust ontology covering the target knowledge domain. This ontology is a durable source of value and should be considered part of the secret sauce which makes your company’s products smarter than the competition. While all providers offer some base ontology collections such as hospitality, driving directions and help desk, in-house developed ontologies are the specialist knowledge stores which differentiate us from me-too adopters.

First, Watson Assistant is a white label product. There’s no Watson animated globe, or “OK Watson” wake-word — companies can add their own flair rather than ceding territory to Amazon or Apple. Second, clients can train their assistants using their own datasets, and IBM says it’s easier to add relevant actions and commands than with other assistant tech. And third, each integration of Watson Assistant keep its data to itself, meaning big tech companies aren’t pooling information on users’ activities across multiple domains.
[Source IBM]

Integration and Utilization

Each of the big four offer good levels of integration and opportunities to extend their product. Nomenclature is fairly consistent across providers as well. Capabilities developed specifically for your application running under Alexa, Cortana, et al. are referred to as Skills. Terminology of each provider refers to names, places and things (nouns) as Entities and actions or attitudes (verbs) as Intents. Identical ontologies loaded into different providers will result in accuracy variants relative to the sophistication of the underlying AI used and length of time it has to ‘learn’ through normal usage.

Voice Processing

The quality and sophistication of the text-to-speech solution varies among providers, yet they all use Speech Synthesis Markup Language (SSML) as an attempt to unify development. The attempt is made, but like the analogy to the browser wars above, not every provider supports the entire SSML specification. Be that as it may, it will be necessary to transform any text documents into this markup language, a simple XML derivative, before feeding into a voice synthesis service.

Cutting Edge

Further technical advancements allow us to sample other aspects of user response which may enrich educational engagement. We can analyze the emotional tone of the user response to determine if the student finds the subject difficult despite being right. With much larger sets of user responses we can infer certain personality traits organized into such constructs as the Big Five and preferred modes of consumption using IBM Personality Insights.

These technology opportunities are mentioned for future consideration and do not constitute required components of a coherent solution. They do represent the augmentations that will become mainstream in five or more years.


Our recommendations at this time (2019) have been tempered by our conclusion that while the pieces of the VPA puzzle are individually present, user acceptance and the maturity of the technology as a cohesive service is not. It is in this state of maturity that we recommend the best approach is to target technology components of the overall solution by augmenting existing services to prepare for the VPA’s coming user engagement.

Big picture, Artificial Intelligence is the product, the Virtual Personal Assistant is a way to utilize it. The same is true of Big Data analysis, Pattern Recognition, and Algorithmic Trading which all use unremarkable techniques at their core supercharged by AI’s ability to draw inferences we are not used to seeing from a machine. The maturation of AI will be the largest deterministic factor in the VPA’s success which still has significant hurdles to overcome and some ‘known unknowns’ yet to encounter. The robotics industry grapples with the uncanny valley effect which causes users to become uneasy with the technology the more realistic it becomes. Will there be a similar uncanny valley as AI behaves more like a person? Watching AI learn from a technologist’s console, it is somewhat creepy knowing how input becomes output in a binary system, yet observing this black box seem to reason.

Therefore, prudent approaches to the Virtual Personal Assistant would include,
extending casual use of existing company assets and initiating the product vision of a
virtual mentor.

Audible Educational Materials

It is our opinion that converting existing text based educational materials to an audible format will increase casual use of company services and accomplish a foundational component of any VPA; having something to say. The effort would begin as a targeted selection of materials, relevant to an existing relevant business service. Audible materials aimed at specific courses would be produced and promoted during student engagement. Utilization rates as well as the effect their use has on the students’ subsequent grades would provide efficacy feedback for tuning and further promotion.

The effort would require further examination of the company’s Term of Use agreements for partner materials. Selected written material would be at first manually converted to a machine readable format to produce audio files of those same materials being read aloud. Initial manual conversion efforts would reveal the means to automate the transformation making further efforts less costly.

Extending Casual Use

The longer the user interacts with the system, presented with unique information, the greater likelihood that they will be made aware of and utilize other company services. This approach is common in the news-to-advertiser model whereby news outlets provide relevant and changing information which compels the user to remain engaged and more likely to click thru to an advertiser. The way that education organizations can utilize this approach is to convert their written materials to voice and facilitate their dictation back to the user. This foundational content will create the critical volume of material and interest necessary to weave complementary services into the application flow.

Beginning with straight dictation of written educational materials, the member will be able to consume such media passively and through a more varied collection of devices. This will untether the member from direct observation of their PC or mobile screen, allowing the user to now take your company with them during more varied physical experiences. Improved ease of use will increase user interaction on its own. This new product, will mature from a straight dictation of educational materials, to one which

  • Mixes quizzes or chapter reviews into each section, evaluating the user’s comprehension in a true, audible chat.
    • Quiz results enrich the company’s ability to tailor the member’s educational needs on a topical and fine grained basis.
  • Asks closing questions about the member’s affinity for certain covered topics and uses their responses, combined with other professional requirements, to suggest further topics.
  • Uses this opportunity of engagement to notify the user of non-audible services within the company.
    • Can send a reminder of an event, a link to a page on the portal or elsewhere on the internet to the user.
    • Able to enroll them in courses.
  • Mentoring
    • This capability, is very important and it turns out, so well suited for the VPA that it deserves its own section below.


An under-served aspect of any professional education seems to be the teaching of critical thinking skills. This is a place the VPA is able to leverage its strengths. If we take a member on an educational journey wherein we intersperse questions throughout their education, we can not only alert them when they are right, but also WHY they are right. We can propel the student’s current understanding of what to do and how to do it to an understanding of why we do it this way. Rote regurgitation can be replaced with reasoned conclusion.

Your company may already deliver services meant to mentor their students. It is a different thing entirely to be a mentor to someone than to give them the tools. I wish to discuss how the common perception of a human mentor may be replicated as a service.

There is no single list of qualities which make a good mentor, however a reasonably complete list of the characteristics needed in order to fit the roll and how automated systems might provide them are the following.

Willingness to Share Skills, Knowledge and Expertise

The very first and most important quality is the sharing of knowledge. A mentor starts off by knowing a great deal, but not everyone with an encyclopedic knowledge would make a good mentor. The solution must start with a quantity and variety of information necessary to not exhaust its usefulness. Knowledge and expertise are educational products delivered to the student. The role of a VPA would be to unlock that information from active to passive consumption. The demonstrable skill of a mentor relevant to students in this circumstance is to pass on critical thinking skills.

Demonstrates a Positive Attitude and Acts as a Positive Role Model

The demonstration of a positive attitude is conveyed through consistent enthusiasm and determination. These are hard characteristics to convey on a web page using the tools of punctuation and images. Interaction through the spoken word presents us a richer palette from which to exhibit the system’s positive attitude and connect with the user. Human beings detect such emotional feedback first in a person’s face, but secondly and not less importantly in how they communicate. Tone, meter and pronunciation are configurable variables in modern voice generating systems which acts to our advantage.

Takes a Personal Interest in the Mentoring Relationship

A machine cannot take a personal interest in anything. It can however be taught to ape the behavior that does. Punctuality, follow through, deep knowledge of the student and oddly enough the ability to listen are all positive characteristics machines may exhibit to the student.

Exhibits Enthusiasm in the Field

Language, tone and inflection are the verbal means to exhibit enthusiasm. It will be necessary to ensure the system responses are mindful of how they are perceived to the student. When aural responses are used, the ability to produce natural sounding language will be dependent on the ability to configure
their meter and tone.

Values Ongoing Learning and Growth in the Field

Your business may have numerous routes to fulfill this wide ranging student need such as through a web portal. In many instances, the visual presentation of such opportunities is more appropriate than attempting to wedge them into the aural modality. Currently though, some material is perhaps not very approachable and would benefit from a different means of surfacing such opportunities to the student.

In our opinion, increasing casual interaction with the system will build opportunities which direct the user to relevant company services. Emphasizing services available through secondary systems can be done in a relevant manner and a reminder email or other click-thru device can be utilized to give them near effortless follow through. Events triggering such mentions during casual use can be based on content being presently consumed or other system or calendar events suggesting present relevance.

Provides Guidance and Constructive Feedback

The system needs not only to know the right answer when asked a question, but also why it is the right answer. Constructive feedback at this level is expected to be holistic and personalized to the individual. Your company knows the aptitude of your students and may perhaps provide guidance in their educational journey. However, constructive feedback is both topically relevant and personable.

Guidance, beyond what to do next in a list of educational chores, is to explain “why” to any given question. Here critical thinking skills can be developed and when delivered at the relevant moment in a consistent manner increase the likelihood the student will take action on their own accord beyond casual consumption.

Respected by Colleagues and Employees in All Levels of the Organization

An established educational company with an existing knowledge base is likely to be acknowledged as authoritative as long as it is presented in a very professional manner on the web and in person. Any verbal response made by the system will need to be consistent with those high values and by doing so, will hopefully engender respect for our artificial entity.

Sets and Meets Ongoing Personal and Professional Goals

The system must make the accomplishment of the student’s goals its own. Your company probably already collates a personalized data set on each of its students and encourages accomplishment of professional goals. A virtual mentor would need to express its earnestness in helping the student do so.