A short Information Flow Control primer
This is DRAFT POST and more formal background for the ideas discussed in
Information flow control is an old idea. The formal concepts were introduced in 1976 and there has been continuous work introducing new flavors since then. I’m going to briefly summarize one such flavor, just enough to build the later examples on.
Fundamentally, IFC is about determining which flows of information are safe and thus allowed (preventing disallowed information flows is a separate problem). To determine that, IFC assigns labels to nodes and then has rules about what kinds of edges are allowed between these nodes.
Labels have two components:
Integrity, e.g. that this is an unmodified statement by a specific user; that no prompt could have been injected; that it doesn’t contain any age-restricted imagery; or (much simpler:) that it is properly encoded or is validated against a schema, etc.
Confidentiality, e.g. that this is private to a specific user (or two users and both have to agree to release it); that this is a private key and can’t leave certain secure environments; that this might contain something age-restricted; or that it might contain a prompt injection.
Information can by default only flow towards lower or equal integrity and higher or equal confidentiality: If a node receives secrets from two different users, it must have a confidentiality label that is at least a secret that both have to agree to release. If a node receives two bits of data and one might have a prompt injection, it can’t have the “no prompt injected” label.
We see that there is a duality between the two label components: We can use the integrity component to declare that there is no prompt injected (and treat the absence of it as dangerous) or the confidentiality component to warn that a prompt might be injected (and treat the absence of it as safe). This is useful when data crosses boundaries, from e.g. a system that isn’t aware of LLMs (and hence the risk of prompt injection) to one that is. In such a case we should err on the side of caution and mark all data as potentially dangerous (i.e. add confidentiality), unless it is explicitly marked as safe (i.e. has the corresponding integrity). We’ll see later how this can further simplify our system.
Labels are ordered – technically, they form a semi-lattice – and we can compute for two labels what directions of information flow, if any, are allowed. Or we can compute the minimally required label for each node, so that some graph of nodes and directed edges is allowed. If no such labels can be found, that graph is not valid. This sounds a lot like type checking and type inference, and indeed this is a good way to look at this: Labels are type modifiers (such as const) and a graph can be type checked for validity before it is instantiated.
A valid graph has a property called “non interference”, meaning that lower integrity data can’t impact higher integrity data and higher confidentiality data can’t impact lower confidentiality data. Simon’s proposal has that property.
But this is of course quite limiting: For most graphs, as processing goes on, data has to be treated as increasingly untrusted (low integrity) and overly secret (high confidentiality, requiring a lot of entities to agree to release it). Here is where the techniques above to limit training and explicitly overriding labels come in. This is called “downgrading” labels, i.e. explicitly increasing integrity (called “endorsing”) or lowering confidentiality (“declassifying”).
Doing that naïvely is risky and one would have to manually review for each graph whether this is a problem or not, which would be a major drawback (we’d have to treat graphs as trusted and we’ll soon see why we’d rather not).
Luckily, there are ways to be more rigorous about this. This is going to be a bit abstract, but bear with me and I’ll give you examples:
Robust declassification requires that the integrity of any inputs to the declassification decisions have to be trusted by the security principal behind the confidentiality label that is to be lowered. We can include any code performing the operation as input as well, i.e. formalize that the code has to be trusted by the security principal. For example, say user A has a bit of data that he is willing to share with user B against payment, then B’s payment token has to be trusted by A. So here, what robust declassification requires is that we keep track of where B’s token comes from, what it means to be trusted (account for double spend), and so on. We could do that manually for a given graph, but this gives a formal, machine-verifiable formulation.
Transparent endorsement requires a maximum confidentiality of all the inputs, formalizing the condition that to endorse data you have to be allowed to read it. In the previous example, B can’t sufficiently endorse the payments token, only A can. Applied to the code doing the endorsement, treating the code as input, means that A should be able to inspect the code to trust it. And of course requiring the absence of might-have-prompt-injection confidentiality on inputs that affect endorsements protects against that threat vector.
This is where the aforementioned duality of the label components comes in: These two operations are also duals of each other. They can translate between domains where either absence of presence of the label means safety, and this means we can simplify things again without giving up formal rigor:
Trusted tools endorse data (i.e. add an integrity component to the label, e.g. mark it as safe), while requiring both minimum integrity and maximum confidentiality, expressed as conditions on the output integrity: “This has <property> as long as <A>, <B>, etc. are trusted (for integrity) or agree (for confidentiality)”, which includes the code itself. Note that in some cases, the safety property can be inferred automatically from the code, in which case that automated method is to be trusted: For example in a classifier with an output schema of fixed options, verifying that schema property and trusting the OpenAI’s function feature to enforce that schema is enough to treat it is safe from prompt injection, as long as the schema itself wasn’t at risk of containing a prompt injection.
Declassification (i.e. removing a confidentiality component of the label) is then handled by the system and happens transparently when the corresponding integrity component is present. This can be expressed then as simple rules, e.g. “no prompt injected here” removes “might have prompt injection” and “user agreed to publish this data” removes “this is secret to the user”.
Locally swapping out the label for conditions, e.g. an LLM might require integrity that prompts might only come from allowed users, which can be derived from no other might-have-a-prompt-injected confidentiality being present than at most from allowed users.
As an example of why the extra conditionals are important, imagine a bidding process between two parties. The bidding process shall remain secret and only the winning bid should be released. And of course bids should be fair, so A can’t use secret information of B to make bids and vice-versa. We can run this in a system both trust to enforce the security constraints expressed in the labels. Both parties inject their bidding strategies (possibly as a plain text prompt, and crucially including a maximum price they are willing to pay) and there is a simple trusted component that endorses a bid as a valid bid from a party (i.e. that it was generated with the bidding strategy). In an arbitrary graph, we might end up feeding B’s secret to A’s bidding strategy, so we must prevent that. In such a case, an input to the endorsement would have a confidentiality label marking it as secret to B, hence requiring B’s endorsement as well, for which there is no possible path, hence it is impossible. As valid bids of the other party are an input to the bidding strategy, bids will still be confidential to both parties. Another mutually trusted tool can then release the winning bid to both parties separately once the losing party agrees to endorse it as final bid, which is configured to also allow removing the confidentiality labels. This might sound a bit convoluted, but the powerful property here is that we only provide a few outside conditions and simple trusted tools and any arbitrary graph can then contain valid auctions between these parties. Those graphs can be subgraphs of much more complex graphs!
Note that the security principal (trusted entity) for might-have-prompt-injection confidentiality is not the origin of the data but a principal the user trusts to determine what is a prompt injection and/or who is allowed to express prompts. So the above condition just means that this principal has to agree.
This begs the question of what trust means. Here delegation comes in. In practice, the user (the agent’s owner) will delegate to parties they trust, who might further delegate, to eventually express trust in the security principals the label components represent. If multiple users are present, e.g. the sender of an email, the bidding scenario, and so on, then there are multiple delegation roots. Given a label, and all the conditions on their components (this is trusted by A as long as X is trusted by A, etc.), the system must make sure that there is a path from the delegation root to these principals.
To recap, we now have a relatively simple system to express a wide range of safety scenarios:
Trusted tools that add integrity to their output’s labels, possibly conditioned on some of their input’s labels.
Mapping between integrity and what is required to declassify (i.e. remove) a confidentiality component.
Trust delegation to determine trust in these tools and mappings.
Some key labels are set at the outset to set up the policies that govern this system through a combination of delegation, mappings and trusted tools.
For the last point, this could be a confidentiality label for the user (thus mapping to their delegation root). A little trick here is that output to A’s screen might require a bit of declassification, allowing the user to set conditions on what processing is in their interest. Maybe those are conditions on how recommendations are computed based on their interest profile. Any other recommendation might be computed, but then there is no way for it (or data derived from it) to ever be presented to the user.
In other cases it’s the components that bring in the policies, e.g. LLMs around prompt injection, by requiring certain integrity on their inputs. Their owners can then trust the necessary tools via delegation, or the user might pick a separate set of tools and implicitly ask those to be trusted by the LLM.
So policies can come from users and from the tools they use. This sets up a key foundation for the ecosystem this newsletter imagines.
Thanks to Tiziano Santoro, Andrew Ferraiuolo, Sarah de Haas, Hong-Seok Kim and Ben Laurie for valuable early discussions on IFC.