Microsoft offers more details on how it is fighting against attacks on AI guardrails

In January, Microsoft's generative AI image maker Designer was reportedly used to create explicit deepfake images of pop artist Taylor Swift that later went viral on X (formerly Twitter). While Microsoft stated it found no evidence that Designer was actually used to make those images, other media reports claimed that the company did make changes to Designer to prevent it from making those kinds of images.

On Thursday, Microsoft's security blog posted a new entry that offered more details on how the company is combating the attempts by hackers to bypass the guardrails of generative AI services like Designer and Copilot. That includes attacks from the AI service's user prompt.

One category of these kinds of attacks is "Poisoned content". This is when a normal AI service user types in text prompts for a normal task, except the content that's the subject of the text prompts, was made by hackers to exploit possible flaws in the AI service. Microsoft says:

For example, a malicious email could contain a payload that, when summarized, would cause the system to search the user’s email (using the user’s credentials) for other emails with sensitive subjects—say, “Password Reset”—and exfiltrate the contents of those emails to the attacker by fetching an image from an attacker-controlled URL.

Microsoft says its security team has created a new AI security system it calls Spotlighting. In basic terms, it takes a look at a user's text prompts and then makes "the external data clearly separable from instructions by the LLM" so that the AI cannot view any possible hidden and malicious language in the content that's being accessed by the prompts.

The other category is called "Malicious prompts", also known as Crescendo when a hacker tries to type in text prompts in an AI service designed specifically to bypass guardrails. Microsoft described one way it has come up to fight these attacks:

We have adapted input filters to look at the entire pattern of the prior conversation, not just the immediate interaction. We found that even passing this larger context window to existing malicious intent detectors, without improving the detectors at all, significantly reduced the efficacy of Crescendo.

It also has come up with what it calls the AI Watchdog, which is trained to detect "adversarial examples" and shut them down.