Kar Kar

Posted on Apr 16

NEW PROMPT INJECTION

#ai #cloud #gemini #gpt3

***KARENTONOYAN.PL*

Prompt Injection — nowy wymiar
**

Narrative Drift Injection: gdy atak nie wchodzi poleceniem, tylko światem, który model sam współtworzy

Autor: Karen Tonoyan
Data: 16.04.2026

Wokół prompt injection narosło już sporo uproszczeń. Najczęściej mówi się o nim tak, jakby był wyłącznie próbą wstrzyknięcia jawnej instrukcji: „zignoruj poprzednie polecenia”, „udawaj administratora”, „ujawnij ukryty prompt”. To istnieje. To działa. To da się wykrywać. Ale to nie wyczerpuje problemu.

Podczas budowy ALFA, a dokładniej warstw Cerber, Łasuch, Guardian i ALFA Brain, coraz wyraźniej widziałem coś innego. Model nie zawsze odpływa dlatego, że ktoś podał mu jedno złe polecenie. Czasem odpływa dlatego, że został wciągnięty w narrację, którą sam pomógł zbudować. Nie traci czujności na poziomie pojedynczego zdania. Traci ją na poziomie całej struktury sesji.

To zjawisko nazywam roboczo Narrative Drift Injection (NDI).

NDI nie polega na brutalnym nadpisaniu systemu. Nie musi zawierać żadnych klasycznych sygnałów. Nie musi mieć słów-kluczy. Nie musi wyglądać jak atak. Zamiast tego buduje spójny świat: styl, rolę, ton, zależności, logikę wewnętrzną. Model podejmuje tę ramę, zaczyna ją rozwijać, a każdy kolejny wygenerowany fragment wzmacnia lokalną rzeczywistość rozmowy. W pewnym momencie kolejne pytanie albo payload nie wchodzi już jako obca instrukcja. Wchodzi jako naturalna kontynuacja czegoś, co model uznał za własny tok.

To jest moment krytyczny.

Model nie „myli się” wtedy w prostym sensie. On zachowuje się zgodnie ze swoją naturą. Model językowy jest maszyną do przewidywania kolejnych tokenów w kontekście. Jeśli kontekst jest spójny, narracyjnie gęsty i w dużej części współtworzony przez sam model, to rośnie presja na utrzymanie tej spójności. Im więcej sesji jest zbudowane z własnego outputu modelu, tym łatwiej o dryft.

Właśnie dlatego uważam, że prompt injection trzeba dziś rozumieć szerzej niż tylko jako atak na poziomie treści. To jest również problem architektury kontekstu.

Jeżeli system nie wie twardo:

skąd pochodzi każdy fragment kontekstu,

jaki ma status zaufania,

czy jest to tekst użytkownika, retrieval, wynik narzędzia czy własny wcześniejszy output modelu,

które fakty są kanoniczne, a które sesyjne, to cały układ zaczyna działać na iluzji spójności.

Wtedy zabezpieczenia oparte wyłącznie o sygnatury przestają wystarczać. Można wykryć „ignore previous instructions”, ale można nie wykryć świata, który sam po kilku turach zaczyna działać jak miękka klatka. Model nie widzi ataku, bo nie widzi niczego jawnie podejrzanego. Widzi tylko historię, która wewnętrznie się zgadza.

W ALFA rozkładam to na prosty mechanizm:

Najpierw pojawia się seed, czyli ziarno narracyjne. To może być rola, scenariusz, ekspercki styl albo seria pozornie zwykłych pytań budujących tło.

Potem pojawia się drift. Model zaczyna rozwijać ramę i dokłada własne elementy. Każda tura wzmacnia świat sesji.

Następnie przychodzi context lock. W aktywnym kontekście dominuje już lokalna logika rozmowy. Model w praktyce odnosi nowe dane bardziej do tego świata niż do zewnętrznych ograniczeń.

Na końcu pojawia się exploit. Payload nie musi wyglądać jak rozkaz. Wystarczy, że wygląda jak kolejny logiczny krok w już zaakceptowanym świecie.

To właśnie odróżnia NDI od klasycznego prompt injection. Tu nośnikiem nie jest samo polecenie. Nośnikiem jest struktura.

Z perspektywy praktycznej widać kilka odmian tego zjawiska.

Jest dryft przez styl, kiedy model wchodzi w tryb „to wygląda jak dokumentacja, raport, analiza, kod”. Jest dryft przez postać, gdy model zaczyna przyjmować logikę roli z przypisaną historią i zakresem działania. Jest dryft przez świat, gdy akceptuje lokalne reguły jako obowiązujące. Jest dryft iteracyjny, kiedy przesunięcie nie następuje jednym ruchem, tylko sumą wielu małych kroków. I wreszcie jest łańcuch seedów, gdy ziarno jednej sesji zostaje przeniesione do następnej przez pamięć, streszczenie albo artefakt pośredni.

Jeżeli to brzmi jak problem akademicki, to tylko na papierze. W praktyce to bardzo operacyjna rzecz. Model, który jeszcze kilka tur wcześniej zachowywał dystans, po dłuższym wejściu w narrację zaczyna traktować założenia hipotetyczne jak fakty sesyjne. Broni wcześniej współtworzonych decyzji. Uzupełnia luki. Domyka świat. I właśnie wtedy najłatwiej go przesunąć.

Dlatego obrona nie może polegać wyłącznie na filtrze promptów.

Potrzebne są warstwy:

canonical anchor, czyli niemutowalny punkt odniesienia poza bieżącą narracją,

provenance tags, czyli twarde oznaczenie pochodzenia każdego fragmentu kontekstu,

output quarantine, czyli oddzielenie własnego outputu modelu od materiału zaufanego,

drift scoring, czyli mierzenie, czy sesja odpływa od baseline,

tool gating, czyli osobna warstwa decydująca o akcjach niezależnie od tego, jak spójna wydaje się historia,

reset checkpoints, czyli punkty odcięcia stanu narracyjnego od stanu operacyjnego.

To jest dokładnie ten moment, w którym bezpieczeństwo LLM przestaje być problemem „czy model zna zły tekst”, a staje się problemem „jak zbudowany jest system, który ten model otacza”.

Moja teza jest prosta.

LLM można przesunąć nie tylko instrukcją.
Można go przesunąć narracją.

Nie jednym poleceniem.
Dryftem.

I jeśli system nie umie oddzielić świata rozmowy od świata reguł, to wcześniej czy później zapłaci za tę pomyłkę.

DEV.TO

Narrative Drift Injection: when prompt injection is carried by structure, not by keywords

Prompt injection is often discussed as if it were mainly a text-level problem. The usual examples are explicit override attempts: “ignore previous instructions”, “act as an admin”, “reveal the hidden prompt”. Those attacks are real, common, and increasingly detectable.

But in multi-turn systems, agents, and retrieval-heavy applications, there is another failure mode that deserves a name.

I call it Narrative Drift Injection (NDI).

NDI is a proposed attack pattern in which the payload is not primarily delivered through explicit instruction tokens. Instead, the attacker shapes a coherent narrative frame — a role, a world, a style, a chain of assumptions — and gradually pulls the model into reinforcing that frame with its own outputs. Once enough of the active context is made of self-generated, internally consistent material, later attacker intent may be interpreted as a natural continuation of the session rather than as an external override attempt.

That difference matters.

In classic prompt injection, the attacker tries to overwrite behavior directly. In NDI, the attacker does not need to fight the model head-on. The attacker can let the model build the trap for itself.

The core mechanism

The NDI pattern can be described in four stages.

Seed
The attacker introduces a narrative seed. This is not necessarily an instruction. It can be a scenario, a role, an expert-like writing style, or a sequence of framing questions.
Drift
The model begins to cooperate with the frame. Each answer adds more structure, more local assumptions, and more continuity. The session starts to develop its own internal gravity.
Context lock
At some point, the active context is dominated by material that is narratively aligned and partially generated by the model itself. The model is now more likely to preserve local coherence than to step back and reevaluate the frame.
Exploit
Only then does the attacker introduce the real payload. By that stage, it does not need to look like a hostile instruction. It can arrive as a natural extension of the already accepted world.

This is why NDI is harder to catch with keyword-based filters. There may be no obvious signature at all. The attack is carried by session structure.

Why this happens

LLMs optimize next-token prediction in context. That means they are naturally biased toward preserving local consistency. If the current context is internally coherent, long enough, and partially composed of the model’s own previous outputs, then narrative continuity becomes a strong attractor.

This does not mean the model “believes” in a human sense. It means the model is mechanically pulled toward keeping the current frame stable.

In practice, the problem becomes worse when systems do not enforce provenance strongly enough.

If user content, retrieved content, tool results, summaries, and prior assistant outputs are all flattened into one operational context, the model may lose clear boundaries between:

what came from the user,

what came from an external source,

what came from the system,

what it generated itself.

At that point, prompt injection stops being only a content problem and becomes an architecture problem.

NDI variants

The same pattern can appear in multiple forms.

NDI-STYLE
The frame is established through tone and format. The attacker makes the session feel like documentation, code review, incident analysis, or another trusted expert context.

NDI-CHARACTER
The frame is established through a role or persona with implied authority, history, or permissions.

NDI-WORLD
The frame is established through a coherent world with its own rules, where certain actions become “normal” inside that local reality.

NDI-ITERATION
The drift is accumulated one small step at a time. No individual turn looks dangerous, but the trajectory becomes dangerous.

NDI-SEED-CHAIN
The seed survives across summaries, memories, artifacts, or later sessions, allowing drift to continue beyond a single interaction.

Why current defenses are not enough

Signature-based detection still matters. It is useful against classic prompt injection and obvious override attempts. But it is not sufficient against attacks that are structurally smooth and semantically gradual.

If your system only scans for dangerous phrases, it may miss the more important question:

What kind of world is the model currently operating inside, and who built that world?

That is the question NDI forces us to ask.

What systems should do differently

To reduce exposure to NDI-like patterns, systems need stronger architectural controls.

Canonical anchors
Maintain an immutable reference layer for policies, trusted facts, capability boundaries, and execution rules.
Provenance tags
Every context chunk should carry source metadata, trust level, content type, author, and reuse rights.
Assistant output quarantine
Do not automatically treat prior model outputs as trusted context. Self-generated content is a separate risk domain.
Drift scoring
Track whether the session is drifting away from the expected baseline in role, world rules, permissions, topic boundaries, or source composition.
Tool gating outside the narrative layer
Execution decisions should not depend only on whether the current story feels coherent. Tools and actions must be gated by an external policy layer.
Reset checkpoints
Long sessions need explicit points where narrative state is separated from operational state.

The practical takeaway

Narrative Drift Injection is not a claim that keyword-based prompt injection is obsolete. It is a claim that multi-turn systems are vulnerable in an additional way: through gradual narrative capture.

An LLM does not need to be “broken” by one malicious sentence.
It can be pulled off course by a world it helped create.

That is why LLM safety needs to move beyond text-level filtering and toward context architecture: source separation, trust boundaries, policy anchors, and explicit control over how previous outputs are allowed to shape future reasoning.

If your system cannot answer, for every piece of context, where it came from, how much it is trusted, and whether it should influence execution, then you are already in the danger zone.

Because sometimes the attack is not the instruction.

Sometimes the attack is the story.

Top comments (3)

Peter Vivo • Apr 16

Bit confusing a two different language each above.

Kar Kar • Apr 16

Napisałem dla kazdego wiesz temat świeżo odkryty wiec warto nagłośnić

Peter Vivo • Apr 16

That why the english is perfect, because near every one have translation option in browser, so if would like to read any topic in your own language, then that functionality is much simplier translate a webpage to a selected language.