A new turing test

It’s been a odd couple of years.

Take this post on linkedin from last week:

Linkedin post suggesting the author did some AI automation

The meta-point the author is trying to make is that testing is always testing, and AI or selenium or playwright, it doesn’t matter, tools just assist in testing. I don’t disagree.

My question though was: What are these repetitive processes he is automating? What did he actually DO?

The deeper I dug into this. the more it became clear the poster was not comfortable talking about the specifics of what he actually did. It is possible, even likely, that, the author didn’t actually do anything with AI at all — instead he was using AI to make a rhetorical point.

How many people read that post but not the replies, took him seriously, and now have AI Envy?

Let’s talk about what AI is for just a minute, and then propose a new Turing test.

How we got here

Alan Turing, the British Computer Scientist who helped crack the Enigma code in the second world war, proposed a well-known test for Artificial Intelligence. Turing’s claim was that, essentially, if you could sit as a terminal and have a conversation, and not tell that the words coming back from the other side were coming from a computer or a human, then we could say we have achieved Artificial Intelligence. The first serious attempt at this was ELIZA, a computer program that simulated a psychotherapist in the 1950’s – the kind that would reply to nearly any statement with “… and how do you feel about that?”, or notice keywords like “mother” and ask “tell me about your mother?” Of course, ELIZA could be tricked; it didn’t understand that some words have double-meanings that an educated English speaker would pick up on, and would guess the wrong meaning. Some would argue that we are there, today, with ChatGPT and other Large-Language Models. Of course, the output of ChatGPT is stilted, awkward, subservient. Some even argue that because of this, ELIZA is better at passing the Turing test.

I would suggest something different, at least for AI in software delivery. I propose that the Turing test might have outlived its usefulness.

When it comes to AI at work we don’t want to type at a desk and not be able to tell it from our next-door-neighbor. Increasingly, we don’t just want it to have some subject matter expertise we lack. Instead, we want the AI to do something for us.

With AI Agents, companies like Microsoft are claiming the next generation of tools will do just that.

The problem with the Microsoft post I wrote above is that it is sort of hollow. Information will be available at our fingertips, it says — is that really that much of a leap forward from click the search button in documentation? Or the tools will alert supply chain managers to low inventory and automatically re-order. That sounds like something, right?

I’ve thought of a new test to see when these tools will actually do what they claim. Read it, then feel free to tell me how I’m wrong.

A New Turing Test

Tools like Playwright can drive my browser. LLM’s are supposed to be able to figure out how to use a user interface.

To mimic a human, an AI script would need to be able to solve a problem problems at least this complex:

Based on a few sentences, do my trip planning for me. Show a plan then have a single button for review and approve. The approve button actually purchases the plane, hotel, and car tickets.

That’s it. That’s the new Turing test. Write an app that can do trip planning all the way through checkout that works. To that, it would have to work on multiple search engines and aggregate them. 

Call it the Heusser test, maybe.

This is a simple, straightforward, obvious tool that will be useful for some incredibly larger percentage of the population. It should be relatively easy to build. Based on the current rhetoric for AI, it should have been available in December.

And yet it is not.

Why don’t we have this tool? Shouldn’t we have this tool? Before we claim AI Agents are going to be awesome and automagically put our code in production, shouldn’t a tool like this, you know … exist?

Take a minute, as a thought exercise. See if you can figure out why this tool doesn’t exist.

The most obvious problems are password storage, username, credit card information, and liability. In some cases, right now, the browsers use Captchas, refuse to cache the passwords, or otherwise try to force a human click. These sites are designed to prevent automation. Yet I dare submit we’ll have to overcome very similar problems to agents to live up to the claims that have been promised.

You see, we will have the exact same problems when we try to have agents checking in code, filing issues, and taking code to production. (Or, if you check out a different view of Agents, bypassing the GUI layer and talking right to the appropriate databases). Some of them, such as password caching yet keeping the password private, have actually been solved first by software engineering through secret management. 

When it comes to technology, there are thousands of times more people than companies, so usually the mass market solutions come first. Apple Computer, for example, has an amazing history of this, taking proof of concepts for techies and getting them to work cleanly for a mass audience. Apple did not invent the personal computer, windowed operating system, portable music play or tablet. Instead they came to these emergent categories and made things that were widely appealing. I look to Apple to do what is next, only once it is possible to do what is next.

When Apple came out with Apple Intelligence, they did it with what was actually possible with LLMs – summarizing documents, composing emails, and grammar correction. I expect the next Turing test, as I’ve defined it, might be done by Apple. It won’t be today. If and when that happens, I’ll see the rhetoric around agents as having more potential. Until then, it’s hard for me to get excited about yet another claim that something new and exciting is coming. “Watch this space” and dancing unicorns, all six months out, is a little too tiring for me. Like Steve Jobs old keynotes, just tell me when the product exists, please.

Project Mariner is probably the most promising of the current bunch.

One thing I know is, the day that tool is broadly available won’t be today, and it won’t be tomorrow.

Now tell me how I am wrong.


Postscript

You might ask what kind of input I would feed into the computer to see if it would pass the turing test, so I wrote it for you:

“I want to go to Oregon during spring break for school. Taking (Child name) and Matt. That is April 7-16 – she needs to be in school at 3:00PM on the 15th and back home on the 16th in the afternoon to rest for school on the 17th. If we can save significant money by flying a day or two later than the 7th or a day or two earlier than the 16th. that is worth considering. We’ll fly into PDX. She doesn’t like to fly early – the flight should not leave before 10AM, and we need to arrive before 9PM PAC. For departure we can leave anytime after 6AM pac. We’ll drive from there to Salem, OR (Need a rental car) to stay at a 2-star hotel. The hotel needs free wifi, breakfast, I’d like a manager’s special included, free parking, a pool. I would really like a hot tub. You could consider AirBnBs if the stay is so cheap I save enough money to get the breakfast and dinner. Consider the full cost of the stay; a cheap restaurant with decent reviews next to the hotel might work. We like a real hot breakfast, not a continental breakfast. Check the reviews to make sure the breakfast is not all carbs. We want to stay within 5 miles of (mumble) adult care home. I figure there is a 20% chance we cancel, so if the trip insurance is less than 20% of the cost, go ahead and get it.”

Automating this ticket purchase is significantly easier than writing software

And yet …

How is it going to qualify the value of a hot tub? Or the distance to the senior center? What if there is an ideal match that is 5.5 miles away? As James Bach once told me: “The core problem with building a machine that does what we want is that we don’t yet know what we want.”

You might say the ideal software would go out to the internet, do the analysis, then come back and ask me a set of questions. Over time, it could determine my values, and fair enough.

My point here is: This is easier than writing software. It would be used almost immediately by millions. Yet we don’t have it

Until we do, it is hard for me to take the claim that autonomous agents are going to get rid of programmers or SaaS or whatever claim is new.

NOW, You could think of this blog post as a sort of challenge to arms – for someone to go build the app. Be the first one to do it. Push computer science forward.

That would be great. I would use the app. And in building it, I dare submit, we would learn a lot.

 

Leave a Reply

Your email address will not be published. Required fields are marked *