For the decades we have used modern machines, we have needed to learn how to speak their languages.

Every time we want to use a new device, a new program, a new game, we need to learn their terminology and behaviors in order to get that use from them.

When the computer mouse was invented and popularized, we needed to learn “click” and “double-click” to express intent to the operating system. We needed to learn the little arrow-like icon that responded to the mouse position was a “mouse cursor.” UI principles like “”

What may be commonplace today was once new and unfamiliar. We needed to learn what

But now, with large language models and their variants, for the first time, AI-powered machines can understand human intent with the language of humans.

The manner by which we express our intent is through human natural language, rather than a machine’s formal language.

Machine code GUI Search engines

Understanding vs Computing

What do we mean by “understand” instead of “compute?” It’s probably still a stretch to say a machine “understands” us.

It can respond to some notion of the way human language semantics work.

We don’t have to get the exact command and data formatted properly anymore; all we need to do now is express ourselves more naturally.

There is a flexibility now. We can embrace a degree of productive imprecision.

Early days of computing were giving machines instructions that corresponded to switches

There’s still enormous value in formal and structured ways in how we express ourselves.

But humans get things wrong. And so will AI.

Machines have had a peculiar effect on us. Even though we humans invent them, they seem to take on a conceptual life of their own that we must then, in turn, adapt to.

When Henry Ford’s Model T hit the market, this spawned not only a revolution in personal transportation and started the sublime and devastating changes in the physical landscape of the United States, it created an entirely new automotive language, one that consumers needed to learn to operate such machines, and one a new profession developed to repair and maintain them.

The novel quickly became the mundane, and talk of steering wheels, engines, cylinders, octane, and the like became commonplace conversation, then embedded culture.

And this is how technology has always evolved. New products and platforms emerge, and as dutiful humans often defined by our implements, we learn how to use them, learn their words, their concepts, their limitations. We even develop fanaticism around them, emotional connections often more precious than those we have with other humans.

I think AI is different. AI has enabled computers to speak human, which can supplant hundreds of years of humans learning how to speak machine.

While apps like ChatGPT still use data entry as a UI paradigm, entering crafted prompts (and if you “pop the hood,” so to speak, tweaking parameters, too) to serve as a starting point for a large language model to take over, this is just momentum from an industry that has treated data entry as the dominant UI paradigm since modern computers were invented.

But implicit in the treatment of those prompts is something novel again. Machines can “understand” our intent without presuming that intent.

Multi-Modal AI

Modern AI has multiple modes of perception. Perhaps the better way to think of AI today is a kind of probabilistic translator between media.

“Hand landmark detection” is implemented as a mapping from an image – a collection of pixels, each three numbers related a value of red, green, and blue – to a list of points representing positions of a human hand’s joints, also a set of numbers.

We ascribe a higher level meaning to this algorithm by saying “hand tracking,” as if the machine is perceiving the way humans do, but it’s still – to some extent – an artifice.

Perhaps the easiest gesture to detect with hand tracking algorithms that have meaning in human contexts is raising one’s hand.

We might simplistically program a machine to recognize such a gesture like this:

if hand.position.y < camera.height / 2:
    play_sound('bell')

If our coordinate system has the origin at the upper-left corner of a camera’s view, we might position the camera such that any human in the view are below the midline: camera.height / 2. Then we could check for any hands crossing that line.

With AI, we can now express the same idea with a prompt:

If a hand appears in the upper half of the camera, consider it raised.
If someone raises their hand, play a bell sound.