How Siri Works

Once again someone has offered us incredible artificial intelligence, and once again we are bracing for disappointment. It happened with handwriting recognition on the Newton, which proved to be slow and clumsy. It happened with the not-as-smart-as-they-first-appeared creatures of Lionhead’s Black and White. And remember the Kinect debut video showing a kid interacting with an on-screen villain effortlessly, the AI character perfectly intoning the kid’s name? Kinect brought some of the innovations promised in that early teaser, but clearly the video implied a level of sophistication and polish that turned to vapor in the end.

But it’s Apple this time, with Siri on the iPhone 4S. And although Apple has screwed up before—witness the aforementioned Newton—if anyone has the motivation, the resources, and the smarts to get AI right, the iPhone dev team is it.

Having programmed and taught artificial intelligence in video games for almost twenty years, I am deeply skeptical—you might almost say cynical—about claims to offer a truly useful and usable intelligent agent. Ordinary people—those who don’t study AI—have big hopes (and fears) about AI, and marketers prey on these fantasies. In reality AI is, on the whole, a hoax. Virtually everything we call “AI” today is either a theatrical display of essentially scripted behavior (that’s how most game AI works), a massive database (such as Google Suggestions and expert systems) or a vague and decidedly unintelligent jumble of neural networks and genetic algorithms. So-called “artificially intelligent” programs are generally either too limited or too clumsy to be useful in helping ordinary people do ordinary tasks. So will Siri be different?

Despite my skepticism, I actually think the answer is “yes.” I think Siri will do more or less what Apple promised yesterday.

The reason it will work is that it actually has fairly modest ambitions—more modest than they first appear.

Take a close look at the Siri site. What exactly can you ask Siri to do? Apple gives you a list:

  • Ask for a reminder.
  • Ask to send a text.
  • Ask about the weather.
  • Ask for information (from Yelp, Wolfram|Alpha, or Wikipedia).
  • Ask to set a meeting.
  • Ask to send an email.
  • Ask for a number.
  • Ask to set an alarm.
  • Ask for directions.
  • Ask about stocks.
  • Ask to set the timer.
  • Ask Siri about Siri.

The last item simply gets Siri to repeat this very list.

Now if you consider the list closely, what you’ll notice is that it is not as open-ended as it first appears. Siri can’t understand just anything. It can do a certain set of key tasks. In a nutshell:

  • Interact with the calendar.
  • Search contacts.
  • Read and write messages (text and email).
  • Interact with the Maps app and location services.
  • Forward search phrases to certain pre-defined data providers (Yahoo! Weather, Yahoo! Finance, Yelp, Wolfram|Alpha, or Wikipedia).

This is still an impressive and—most importantly—wildly useful set of functions. But it is a limited, focused set. And that’s what makes me think Siri’s “AI” may actually work.

Looking at it from a programmer’s perspective, it seems to me that Siri consists of three layers: a speech-to-text analyzer, a grammar analyzer, and a set of service providers. If all three of these work well, then Siri will be fun and helpful. If one of them is as troubled as traditional intelligent agents have tended to be, then Siri will go the same way those other agents went—tumbling into the trash heap of misguided innovations.

A speech-to-text analyzer is a piece of software that takes audio and turns it into text. Simple as that. Except it’s not so simple—systems like Dragon have been refining this process for years. It’s really hard to get right, and I’ve never seen an analyzer that didn’t jumble a significant portion of what I say. (If you’ve got a Mac, you can experience the joy of being constantly misunderstood by a computer by playing with your “Speech Recognition” settings. Try a game of chess using nothing but speech. It’ll miss your move as often as not.)

Siri, however, has a much easier job than Dragon or your Mac’s Speech Recognition facility. And that, again, is because its job is limited and focused. It doesn’t have to understand just anything you might say. It only has to understand words and sentences that pertain to appointments, contacts, messages, and maps. This makes it easier for Siri to pick out what you’re saying, because there are only so many things that you’re allowed to talk about.

Another advantage is physical. A phone has a much better chance of hearing your voice up-close than a computer does. Phone microphone technology already incorporates a degree of noise cancellation. So your phone is more likely to be able to hear you clearly, even in the midst of noise, than your computer is.

Despite these advantages, Siri is likely to misunderstand much more than it seemed to during yesterday’s Apple presentation. Did you notice how carefully Scott Forstall asked, “What Is The Weather Like Today?” Each word clearly articulated. Contractions fastidiously avoided. Reading from a script. Siri understood him well, but note that this was in a quiet room—no TV going in the background, no car humming, no coworkers laughing, no kids arguing. I think it’s possible that Siri’s voice recognition could learn to understand my voice pretty darned reliably even under those conditions. But I wouldn’t be surprised if it often gets me wrong, sometimes with disastrous results. Just think how much fun it will be when I say, “Send a text to Andrea that says ‘I love you,'” and Siri hears, “Send a text to Andrew that says ‘I love you.'” I look forward to seeing how reliable it really is.

The job of the speech-to-text analyzer is to turn your voice into written text. Text on its own, however, is just a jumble of letters to a computer. An additional piece of software is needed to turn the text into something useful. Siri needs to recognize that the string “send a message…” maps to the action of creating a new text message. It needs to understand that the phrase “my son” refers to the contact “Liam Wofford.” It needs to connect the word “here” with your current GPS position. This complex mapping of strings to functions is the job of a lexical and grammatical analyzer.

This is a tough job. In the ’80s there was a game company called Infocom that dramatically raised the bar on how computers understand text. Before Infocom, text-based games could only understand two-word phrases. “Hit ball.” “Eat mushroom.” Infocom gave their games the ability to understand whole sentences, complete with nouns, verbs, objects—even prepositional phrases. You could tell the game, “Hit the ball with the wooden bat,” and it would reply, “You swing with all your might and knock the ball out of the park!” It was amazing, and it made for some terrific games.

Siri has taken that kind of grammatical analysis to a new level. But despite the gap of almost thirty years, Siri is inches—not lightyears—beyond Zork. Grammatical analysis still comes down to searching a string for certain key phrases and using those phrases to build up a simple model of what the user wants to do and what he or she wants to do it to. Again, Siri’s limited focus on appointments, contacts, messages, and maps makes this technically viable.

What makes Siri’s grammatical analysis impressive is its integration with other aspects of the phone. One of the most exciting parts of the demonstration was when Scott Forstall told Siri (at 79:45 in the linked video), “Remind me to call my wife when I leave work.” Along with understanding that “leave work” means move outside of a defined GPS area, Siri had to know that “my wife” mapped to Scott’s wife—an entry in his Contact list.

But how did Siri learn who Scott’s wife was? The demo didn’t show us, but I have a suspicion about how it works.

The Mac Address Book has long had an entry for setting up relationships between contacts. I can indicate who my spouse is in Address Book. I suspect that the iPhone Contacts app will gain similar new fields in iOS 5. Siri will use this information to create the mapping between the phrases “my husband”, “my wife”, “my spouse” and the person whom you’ve identified as your spouse. This mapping will no doubt be mechanical, not “insightful.” Siri won’t understand who your spouse is—it’ll just record a string-to-Contact mapping. For example, you might be able to say “my husband” and have Siri find your wife. As far as I know, Address Book doesn’t keep information about the sex of each person, so Siri will probably treat all “spouse” words as identical. Let’s try it when it comes out.

Will Siri be able to recognize the phrase, “my boyfriend” or “my girlfriend”? Perhaps. What about arbitrary terms of endearment, like “my pookums” or “honeybuns”? Again, it’s quite possible. Address Book has an option for “Custom…” in the relationship field. You can add a custom label “pookums” and indicate your spouse or girlfriend or dog or whatever there. Now if Siri hears you say “pookums,” Siri can recognize that contact.

What I hope you’re seeing is that what Siri does isn’t science fiction and it certainly isn’t magic. It is the old and still-developing technology of speech-to-text analysis and the old and fairly mature technology of simple grammatical analysis and string matching.

And then there’s the third component, which is the set of services that Siri can send your commands to. This is the most modest and familiar part of the system. You already have a calendar app and you can press buttons to view and create appointments. Siri will push those buttons for you, in essence. You already have a maps app and you can search and find directions there. Siri will enter your search text for you, and can toggle traffic on and off by voice rather than by button. You already have Wikipedia and you can type search terms into it. Now Siri can type your search terms for you.

At this level, Siri isn’t doing anything you can’t already do. It’s just doing it hands free, by voice. This, clearly, is the big benefit of Siri, even if it’s not the most technically interesting part of the system.

Whether Siri is successful will depend fundamentally on the quality of its speech-to-text analyzer. If it can understand me, it will work. The grammatical analysis and service providing parts of the system are relatively modest in terms of technical difficulty and I suspect Apple has these in hand. I don’t want to trivialize these technologies—judging from the demo, Apple has done their usual, remarkable job of building a slick and natural-feeling user experience, and that takes tremendous skill and effort. But whether Siri becomes the model for how humans interact with computers in the future or whether it gets laughed off the stage of technical innovation like so many AI systems that have come before hinges on whether it can tell the difference between “Andrew” and “Andrea”—especially when I’m in a crowded coffee shop, speaking with a Southern drawl, with a stuffed-up nose from a bad cold.

I hope it does work. I’ve wanted this functionality for years—decades. I want it in my house and my car as well, but if I can get it on my phone the rest will follow.

Apple also deserves credit for doing some delightful things to amplify the “theatrics” of Siri’s AI. By that I mean that they’ve cooked up little alternative phrases and responses that make Siri seem smarter than it (she?) is. Like she’ll say, “Let me check on that,” or “Let me think,” when a traditional computer would spin a spinner or just say “Loading…”  In another demo video (see below) a woman asks whether it will be chilly in Napa Valley. (Actually she asks first about San Fransisco, then changes the location to Napa Valley without having to repeat the question. Nice.) Siri replies, “Doesn’t seem like it.” That’s a very nice alias for “No.” It doesn’t take any more “smarts” to say “Doesn’t seem like it” than “No,” but it sounds a lot smarter. More natural. That’s what I mean by “theatrics”—making the computer seem smarter by changing the way it expresses output. Again, I don’t want to trivialize what Apple has done—theatrics are important and getting them right is non-trivial. But it’s important to keep a realistic view of how intelligent Siri really is.

Here’s Apple’s Siri demo trailer.

Comments are closed.