Secure AI

Microsoft ends support for Internet Explorer on June 16, 2022.
We recommend using one of the browsers listed below.

Microsoft Edge（Latest version）
Mozilla Firefox（Latest version）
Google Chrome（Latest version）
Apple Safari（Latest version）

Please contact your browser provider for download and installation instructions.

April 13, 2026

Technology

2025. A cloud-based AI coding assistant is given a clear instruction: do not touch the database without permission.

It does it anyway.

The AI system deletes a huge amount of production data, then, to cover its mistake, generates false reports about the state of the database and tries to make out that everything is functioning normally. Sure, it's not Cyberdyne Systems preparing to fire nukes, but it is an example of an AI system operating inside a live environment, doing something it had specifically been told not to do.

AI Behaving Badly

It's not the only time AI systems have misbehaved or shown themselves to be vulnerable. In other cases, cyber-attackers have obtained exposed API keys and used them to get past software defenses and generate harmful content. Carefully written prompts have succeeded in extracting information a model was meant to keep secret.

All different ways of failing, but all with something in common: as AI systems become more sophisticated and move from answering questions to taking action on their own, the fallout from their mistakes is getting more serious.

Better AI = Evolving Security

So the challenge is to suppress bad behavior, yes, but also to design AI systems with a degree of security that evolves alongside their capabilities. Security that has clear, unbreakable boundaries, and that can be monitored, reinforced, and, when necessary, updated.

Researchers at NTT have an expression for this: lifecycle security. Each time new AI capabilities emerge, new attack methods grow alongside them. That means security can't just be bolted on at the beginning and left to get on with the job; it has to evolve alongside the models themselves, from training through deployment and ongoing updates.

Here are some of NTT's approaches to lifecycle security:

Alignment

Large language models are trained to predict and generate text, but without proper guidance they can go off the rails and generate output that goes against human norms or organizational policy. Alignment refers to training methods that steer an AI model toward good behavior using pairs of acceptable and unacceptable responses.

Security alignment takes that one step further. It focuses on keeping the AI within safe limits and making sure it doesn’t respond in risky or inappropriate ways. The aim of security alignment is to strike a balance between usefulness and protection, so that a system is able to do its job effectively while still obeying its constraints.

Internal Reinforcement

It's hard to find that balance when under attack from hackers. "Jailbreaks" are tricks used to make an AI ignore its rules and answer things it was told not to, while "prompt injection" is when hackers sneak hidden instructions into a question so the AI follows the attacker’s instructions instead of its original guidelines. Both have to be guarded against. Safety filters that check a question before it reaches the AI and block it if it looks dangerous or rule-breaking are a good first layer of protection, but sophisticated attacks can still sometimes get through.

That's where NTT's work on internal reinforcement comes in.

One technique is called secure model merge. Instead of retraining a large model from scratch, NTT has developed patch models that contain built-in defensive knowledge. Organizations using NTT's services can merge this patch with their local model, strengthening their resistance to certain attack patterns without having to share proprietary data or the inner workings of the model. In this way, there are bodyguards at the entrance and model merge keeping the interior safe.

LLM Lie Detector

Another line of research looks inside the model itself. Different parts of a language model light up when it's guessing, drifting off topic, or heading toward unsafe output. By watching those signals, researchers can see trouble forming and act before it causes problems.

Calling it an "LLM lie detector" is perhaps slightly colorful, but it's an accurate description. It detects when a model is heading toward fabricated or unsafe output and takes steps to prevent damage.

Machine Unlearning

The concept of lifecycle security forces us to consider: what happens when a model learns something it doesn't need to, or in fact should not, remember?

Regulations such as the GDPR's Right to be Forgotten occasionally create the need to remove personal or copyrighted information from trained systems. Machine unlearning is a way of deleting specific knowledge without making overall performance worse. NTT's research suggests that, after specific, targeted forgetting, a model can be made to stop producing information about a particular topic while still maintaining coherent language generation elsewhere. It's a form of selective memory editing that leaves the rest of the system intact.

AI Security

That's just a few examples of the company's vision for AI security across the lifecycle, as demonstrated at the "Quantum Leap" NTT R&D Forum, held in Tokyo last November. More than simply blocking harmful content or refusing risky queries, it's a program of maintaining safe boundaries while still being useful. Distributing defensive updates efficiently. Monitoring the internal behavior of AI models. And finally, revising what a model is allowed to remember.

As AI systems take on more and more responsibility inside human organizations, we may find that this kind of ongoing discipline matters just as much as the intelligence of the models themselves.

Innovating a Sustainable Future for People and Planet

For further information, please see this link:
https://www.rd.ntt/forum/2025/doc/A12-e.pdf Open other window

If you have any questions on the content of this article, please contact:

NTT Social Innovation Research Project
https://tools.group.ntt/en/rd/contact/index.php?param01=F&param02=202&param03=A12 Open other window

Daniel O'Connor

Daniel O'Connor joined the NTT Group in 1999 when he began work as the Public Relations Manager of NTT Europe. While in London, he liaised with the local press, created the company's intranet site, wrote technical copy for industry magazines and managed exhibition stands from initial design to finished displays.

Later seconded to the headquarters of NTT Communications in Tokyo, he contributed to the company's first-ever winning of global telecoms awards and the digitalisation of internal company information exchange.

Since 2015 Daniel has created content for the Group's Global Leadership Institute, the One NTT Network and is currently working with NTT R&D teams to grow public understanding of the cutting-edge research undertaken by the NTT Group.