Preface

CC0 public domain logo

To the extent possible under law, Corey Richardson has waived all copyright and related or neighboring rights to Robigalia: An Operating System for the Modern Era. This work is published from: United States.

This book is the culmination of many hours of thought and discussion by many parties. I would especially like to thank Alex Elsayed, Gernot Heiser, Kevin Elphinstone, Gerwin Klein, and Graham Northup.

Introduction

TODO: rewrite more defensively.

Current user operating systems such as GNU/Linux and Windows have proven themselves not up to the task of secure application development. They encourage ad hoc design and make it challenging for programs to communicate securely. Mark Miller, in his dissertation "Robust Composition: Towards a Unified Approach to Access Control and Concurrency Control", describes an architecture for building applications using object capability systems. These systems have, by construction, properties which make it much easier to write programs and components which can compose constructively rather than destructively. They provide the ability to isolate programs and ensure their privilege is limited to the minimum required for their proper functioning. These ideas have taken hold in newer operating systems like Android and, to a limited extent, iOS and macOS. (CITE: EROS, iOS reference?, Android Binder)

Meanwhile, in the operating systems space, microkernels have been improving dramatically since the original implementations. Third-generation microkernels such as seL4 are built entirely around capability systems for secure communication. They are small, usually less than 10k lines of code, and have formal mathematical specifications of their behavior. In the case of seL4, the entire implementation is verified to meet its specification. This is not at the expense of performance: seL4 is one of the highest performing kernels of any kind. Building on this foundation, it is possible to achieve a small trusted code base (TCB) upon which robust systems can be built.

Programming languages have also made significant advances in the past decades. The Rust programming language brings many concepts from the academic realm into an industry-ready programming language. Using affine types, algebraic datatypes, type classes, and regions, it offers the ability to write safe high-performance programs with minimal memory management overhead and at a high level of abstraction relative to languages like C++.

These themes run throughout the ideas behind Robigalia: formal methods, capability systems, and programming languages. This is not new, or ground-breaking. But it is our hope that we can provide a practical, usable system as an example of what a modern capability operating system could look like.

Rust

Some background info, especially about traits, references, and abstraction boundaries.

seL4

seL4 is an operating system kernel. The kernel is the portion of the OS that (usually) runs at the most privileged hardware mode. It is responsible for enforcing separation and allowing access to system resources, especially space (memory) and time (processor time). seL4 is different from the Linux or NT kernel in that it is very small, and avoids making policy decisions. Instead, it follows the UNIX principle of separating policy from mechanism, providing only mechanism. It consists of less than 10k lines of C code.

This chapter provides an overview of the important aspects of seL4 from our perspective. See the seL4 manual for the complete API and explanations of kernel primitives.

Capabilities

The fundamental abstraction in seL4 is the capability. A capability (or cap) is a "communicable, unforgeable token of authority" (CITE: erights wiki). Capabilities form the basis of seL4’s access control system. There are only seven operations that can be done with a capability, and these form seL4’s set of system calls:

  • Send: deliver a message along a capability

  • NBSend: deliver a message along a capability without blocking

  • Call: atomic combination of Send and Recv

  • Reply: respond to a Call

  • ReplyRecv: atomic combination of Reply and Recv

  • Recv: receive a message on a capability

  • NBRecv: receive a message on a capability without blocking

Capabilities are conceptually a reference to some kernel object along with access rights. The kernel objects available depend on the architecture the kernel is running for. There are, however, a few architecture independent kernel objects:

  • Endpoint objects: channel for inter-process communication (IPC)

  • Notification objects: asynchronous signaling, implementing a binary semaphore

  • CNode objects: mechanism for storing and addressing capabilities

  • Thread control blocks (TCBs): a thread of execution

  • Untyped memory objects: ranges of physical memory used for creating kernel objects

The architecture-dependent objects include device access and virtual memory management.

Abstract model

Resource management

Present the resource management model - retyping untyped memory.

Communication

Endpoints and notifications for communication. Blocking versus non-blocking.

Capability spaces

What they are.

Gotchas

The seL4 model also has a few quirks which tend to trip up people new to it. There are some restrictions in the API that prevent information flow between disconnected components. There are also some limitations in how some objects may be used, which largely stem from keeping the formal verification tractable.

  • Comparing capabilities for equivalence

  • Reusing untyped memory

  • Paging restrictions to allow efficient revocation

  • Determining object type a capability refers to

  • Observing success/failure of message send or recv

Communication patterns

Avoiding denial of service.

The most important property of a system is that it’s available when needed. Network services are well known for folding under intense traffic. On a local system, there are different concerns, foremost being exhaustion of limited resources: time, memory, and disk. Whenever a process can cause the system to consume resources on its behalf, there is opportunity for denial of service (DoS). The solution on seL4 is multifaceted.

Reduce or eliminate all artificial resource limits not caused by platform or hardware restrictions. For example, on Linux, it is quite easy to exhaust the supply of available file descriptors. Instead, prefer to use capabilities and badges judiciously, as they are directly constrained by available physical memory (indirectly, via CNodes).

Use mechanisms which permit clients to provide the storage space for processing their requests. This can be done by negotiating a shared memory region for large message communication, or having the client provide raw untyped memory or frames to satisfy allocation requests.

Especially for servers, try to use the client’s scheduling budget where enabled by seL4’s real time extensions. This is not yet available in the mainline kernel, but try to design your systems to be able to use them where possible.

Strive to use algorithms and datastructures which provide predictable performance, especially with regards to memory consumption. This is not always possible, but leads to easier analysis of the time and space used by a server.

Be careful with how you use blocking endpoint operations.

Endpoints and shared memory

Endpoints should not be used for transferring large messages. It introduces a large copy in the kernel. Instead, use a shared-memory ring buffer and use endpoints to transfer indexes into the ring buffer.

Object escrow

When receiving a capability from an untrusted client, especially for untyped memory, it can be important for availability that the capability not be revoked while the server is using it. Generalizing somewhat, a thread can share an untyped region with a pool of mutually untrusting peers. This can be achieved with a simple object escrow service, which implements a form of referenced-counted capabilities.

A client wishing to share a capability invokes the escrow service with the capability it wishes to share and a threshold, the number of peers that must agree to lock or unlock the capability.

An escrow has four important pieces of state:

  • threshold, the number of peers that need to agree before the object is revoked

  • voters, the number of peers which have voted to lock the object

  • locked, a flag indicating that the peers have successfully agreed to lock the object, enabling it for use

  • the raw capability under consideration

and three important methods:

  • new_voter, incrementing the voters count and returning a Notification

  • get_status, returning locked, voters, and threshold

  • get_cap, which returns a derived capability if the peers have agreed to lock the object

When a process wishes to share a capability, it invokes the escrow service with the capability and the desired threshold. The escrow service returns a capability to the new escrow. The client can hand out this escrow (grant-only) and a new_voter to each of the peers that want to share the capability.

TODO: this isn’t quite powerful enough, as it doesn’t allow Bob to extend the guarantee of object liveness to other threads.

Capability Equality

In general, capability equality cannot be implemented in seL4. However, the following protocol implements a limited case of it, which is all that is needed in most cases. For a full capability equality test, a privileged server would need to be used to track all CNode and Untyped operations, which would then have enough knowledge to know if two capabilities referred to the same object.

In our case, the only equality operation we need is comparing a capability of unknown provenance to a capability owned by the current thread. This is useful if a process servicing multiple endpoints needs to determine if a transferred capability corresponds to one of the other endpoints on the system. The protocol likes like this:

  • Alice selects a nonce and sends an "identify" message over the endpoint B, additionally with cap A.

  • B receives the message. If the badge is unwrapped, it knows that the cap is its own. It replies with the nonce and whether the badge was unwrapped or not.

TODO: implement, sanity check. What was the nonce for again?

Notification Trees

Notification objects allow you to block until at least one of WORD_BITS events are signaled by a (NB)Send on some capability. To allow blocking on more than WORD_BITS events, we use notification trees. Conceptually, a notification tree is an arbitrary binary tree with fanout of WORD_BITS. For brevity, we’ll consider WORD_BITS=3 here.

Here, A…​ are notification capabilities that the user wants to wait on and a…​ are internal notification capabilities ("overhead cap"). M is a cap the user wants to be signaled on when one of the events occurs, but is not actually part of the tree: a is the root. . is a null capability, indicating an empty slot.

Wait Trees

In this first scenario, which are called wait-trees, each overhead cap has a 1-1 correspondence with some TCB, which is bound to that overhead cap. That TCB is, in general, always doing an endpoint-sleep. If the thread is ever woken, it knows that its bound overhead cap was signaled.

Each column indicates another layer of the tree. | is used as a visual guide. < or > indicates the direction notifications will flow.

M<a
 ||A
 ||B
 |<b
 |||C
 |||.
 |||D

In this example, doing a Send on A or B will cause a to wake. It immediately signals M (its parent), and goes back to sleep. Doing a Send on E will cause b to wake, which will signal a (its parent), which will signal M.

Physically, here are the objects involved:

NTFNM - indirectly referenced as the object which M points to.
NTFN1 - the slots of a
NTFN2 - the slots of b
TCB0 - NT waiter a. CSpace = [NTFNM.x (parent), NTFN1.1, NTFN1.2, NTFN1.4] Bound ntfn = NTFN1.7
TCB1 - NT waiter b. CSpace = [NTFN1.2 (parent), NTFN2.1, null, NTFN2.4] Bound ntfn = NTFN2.7

A = NTFN1.1
B = NTFN1.2
C = NTFN2.1
D = NTFN2.4

This is a generalization of Recv on a single Notification, and allows a single listener to wait on arbitrarily many events.

Signal trees

We can turn the tree around, with some slight of hand:

 >a
 ||A
 ||B
 |>b
 |||C
 |||.
 |||E

The simplicity of this notation is deceiving. Physically, this situation is completely different. Before explaining what happens, here’s the physical layout:

NTFN0 - the slots of a
NTFN1 - the slots of b
TCB0 - NT waiter a. CSpace = [NTFN0.1, NTFN0.2, NTFN0.4] Bound ntfn = NTFN0.7
TCB1 - NT waiter b. CSpace = [NTFN1.1, null, NTFN1.4] Bound ntfn = NTFN1.7

A = NTFN0.1
B = NTFN0.2
C = NTFN1.1
D = NTFN1.4

Here, a is exposed directly to the user as the interface to the signal tree. When a is signaled

When a is signaled, it a will wake up on a signal, inspect the bits set in the notification, and signal each of its children if the corresponding bit was set. For example, signaling a

This is a generalization of Send on a single Notification, and allows a single sender to signal arbitrarily many listeners.

Summary and Analysis

By combining wait trees and signal trees in a sufficiently alternating fashion, one can construct an arbitrary many-listener many-sender event system. This does not come without a cost: These internal nodes, on 64-bit systems, consume either 2K (32-bit) or 4K (64-bit) of memory for the CNode, and an extra TCB.

The extra space overhead is:

ceil(log(N, WORD_BITS)) * (2KiB + 1KiB) for ia32
ceil(log(N, WORD_BITS)) * (4KiB + 2KiB) for x86_64
ceil(log(N, WORD_BITS)) * (2KiB + 512B) for "aarch32"

Where N is the number of events being multiplexed.

Phoma: A Capability Runtime

Phoma (/ˈfoʊmɑ/) is a persistent Object-capability system built with Rust and Cap’n Proto. Much of Phoma is inspired by KeyKOS, EROS, and Coyotos.

Capability Basics

Object-capability systems have a rich history in operating systems, going as far back at the late 1960s. Currently, the authoritative reference for objcap systems is Mark Miller’s PhD dissertation, "Robust Composition: Towards a Unified Approach to Access Control and Concurrency Control" (TODO: cite). The fundamental insight of Object-capability systems is that the object graph of a program can serve equally well as the access graph. In this chapter, we’ll explore the basics of Object-capability systems

Access Control

Any access control systems (ACS) mediates the access of subjects to resources. The goal is to ensure that a given subject can only access resources according to some policy. The goals of an access control policy are usually ensuring confidentiality or integrity of some special set of resources. Butler Lampson introduced the idea of an access control matrix as a way of expressing and understanding access control policies at a point in time. (TODO: cite)

In the traditional UNIX filesystem, subjects are groups or users, and resources are file system entities (files or directories). Each resource is associated with a single group and user, and has "file mode bits" which determine the operations allowed on the file. There is one set of "read, write, execute" bits for "user", "group", and "other". Here’s an example access control matrix in this system:

(TODO: example)

And here’s what it looks like inspecting the files from inside the system:

(TODO: example)

However, this system is too coarse grained. As an example, Steam accidentally deleted all files owned by user due to a misconfiguration. All programs running as a user have the authority to modify all files owned by that user. In a better system, a program will have the minimum amount of authority required to function. This is the Principle of Least Authority. To achieve this fine-grained control, a different method of access control is needed.

The Object-capability system enables this fine-grained control, as we’ll see.

Objects

TODO: present object/actor systems, using Rust examples.

From Objects to Capabilities

TODO: glue

Capabilities are, as you might remember from the seL4 chapter, "communicable, unforgeable tokens of authority". In a safe Rust code, references are capabilities. seL4 has capabilities as its primitive notion of communication with the kernel. In Phoma, capabilities are no different.

Capabilities in Phoma

Capability invocation corresponds to sending a message to some object, where the capability designates the object. The interfaces objects expose, and the format of data transferred in messages, is defined by Cap’n Proto serialization and RPC. We do not currently use anything like the normal implementation of Cap’n Proto RPC, so to avoid confusion, it’s referred to more generally as Phoma RPC.

The canonical, transferable capability is the badged endpoint, building on seL4’s capability model.

Operating model

This operating model has evolved informally for a long time, in discussions between eternaleye and myself (with most good ideas coming from eternaleye). Originally, it was a first-principles derivation of what a decent storage model would look like on seL4, given these constraints:

  • Structurally avoid denial of service attacks

  • Allow revocation of all resources

  • Allow delegation of all resources

  • Efficiently implementable given seL4’s underlying restrictions on untyped memory

Of course, I’m enamored with capability systems for providing a simple, robust, and mathematically sound model of dealing with all of these requirements. The last requirement was especially tricky, though.

eternaleye pointed me at Houyhnhnm Computing quite early on, which is a absolutely wonderful series of essays about a very different computing model. I had already examined Urbit at some point and found it…​ lacking. Many of the reasons can be found here, written quite elegantly. I am greatly inspired by the vision of a Houyhnhnm world. Thus, my priorities shifted somewhat to building a good low-level operating environment for an eventual Houyhnhnm platform. This was my earliest indication that persistence might live in the core.

Much later, I discovered EROS, Coyotos, and many emails authored by shap discussing their storage model and designs for L4-HURD. I first started investigating these systems (and the works of shap in general) by encouragement from Gernot Heiser and Kevin Elphinstone.

Eventually, I was convinced that orthogonal, transparent persistence belonged in the system core. This was not an easy conclusion, as it is a huge departure from both my original goal ("POSIX on seL4") and all operating system design that I was familiar with. It took a lot of thinking to end up with even the first version of this architecture. We welcome constructive feedback, especially from experts on capability systems, especially persistent capability systems.

The whole model is a fairly big, interdependent ball of many ideas. I will do my best to lay it out here, but eventual educational material will probably need to be written by someone other than myself, if only for the reason that I am so deeply aware of the space of trade offs that it becomes difficult to view the system from the outside.

System Overview

At a high level, a Phoma system is composed of two parts: the firmament, and applications. This corresponds loosely to the traditional split of kernelspace and userland in monolithic operating systems. shap rightly pointed out (pg 3) that such a system ought to be considered as three pieces: the nucleus (traditional kernel), the kernel (nucleus and userland device drivers), and applications. However, given that seL4 calls itself the kernel, it avoids confusion to everybody to not re-purpose that term.

The firmament is responsible for providing applications the view of a persistent capability-based world. The underlying system, however, is usually anything but. All machines today have their running state almost completely transient, and use a global namespace for many operations. The "machine interface", from the perspective of the applications, is entirely an object-capabilities. The firmament instead runs directly on hardware and communicates with cap-insecure devices. Robigalia provides a firmament that runs on seL4, and has very strong isolation guarantees. For compatibility, we also provide a firmament which runs on any tier-1 Rust platform (notably Linux, Windows, and macOS), with significantly weaker isolation guarantees.

Each application is composed of a number of processes, which in turn are composed of multiple threads. Each thread hosts a number of objects. A thread has a virtual address space, a badge space, and a capability space. The virtual address space is the mapping from its addresses to physical memory. The badge space is its internal mapping of the objects external capabilities refer to. The capability space is used for storing capabilities to external objects. These correspond approximately to near and far references in E.

TODO: transition

Storage

The storage subsystem is the most important, and complex, part of Robigalia. It provides all system state. The model is largely simple (TODO: I think/hope), but the implementation can be fiendishly complex. This model is largely inspired by EROS and Coyotos (although it was arrived at largely independently).

Permanent Storage

Permanent (information) storage is provided by Nemo, which uses nameless writes as the communication primitive for reading and storing information. Nameless writes are a (somewhat) new technique for managing storage, introduced to deal with the limitations of flash storage devices. Fundamentally, they are quite simple:

TODO: nameless interface

On top of this, a logical volume manager and so forth are implemented. Nemo internally has an implementation of nameless writes onto devices which don’t natively support such an interface (such as most HDDs). Devices do not need to be drives attached to the local system, any object store, such as Tahoe-LAFS or Lustre?

There is a well-defined on-disk format for the pickled state of objects in the system.

TODO: Should probably inspect SCSI OSD to see what it has to offer in this space. It has a crypto capability system, and provides object storage. The technology doesn’t seem very widely available as HW. See:

Transient Storage

Transient storage, on the other hand, is managed entirely by the firmament, and is firmament-specific. It only needs to expose the required interfaces to the rest of the system, and uphold the system’s invariants. Further, it’s the job of the firmament for persist all system state according to its configuration (say, every 5 minutes).

This mostly includes memory and page tables. The

Space Banks

Space banks (name taken from EROS, although our inspiration was seL4’s Untyped objects) are how the system does resource accounting. They are called banks to invoke a visceral understanding of how resource accounting works on a robust system: you can only spend what you have, and you can only get more from someone else. We will often use analogies to (a gross simplification of) real-world monetary systems to convey intuitions about resources.

Space banks allow these operations:

trait SpaceBank {
  fn buy_page_region(&self, bytes: u64) -> Result<Region>;
  fn sell_page_region(&self, region: Region);
  fn remaining_quota(&self) -> u64;
  fn effective_quota(&self) -> u64;
  fn destroy(&self);
  fn child_bank(&self, quota: u64) -> Result<SpaceBank>;
  fn verify(&self, other: Cap) -> bool;
}

You can buy a region from the bank, which gives you more storage space for information. You can also sell your regions back to the bank. You pay (and the bank compensates) in quota. Every space bank has associated with it a fixed quota, and tracks how much you have bought using that quota. You can create sub-banks using child_bank. The child bank can have an arbitrary quota. However, in practice, it will only be able to use the minimum of all the remaining quota in its transitive parents, up to the Prime Bank.

Destroying a bank will revoke all access to objects allocated from it. When an object is destroyed, the resources used to pay for it are returned to the quota.

You can also ask a space bank to determine if it recognizes some other capability as being a space bank with some common heritage. This is useful in the core of the system, but not much elsewhere.

Why can space banks have a quota larger than their parent? Simple! For implementing non-robust systems. Really. This is known as "overcommit" in traditional operating systems. The OS pretends it has more resources than it actually does, so that allocation requests don’t initially fail. A process can "reserve" address space, but when it actually goes to use it, the OS may be out of memory and swap and it will fault and die if your siblings happened to be particularly greedy at that point in time. Robustness destroyed. But, while robustness is a useful property of vital system components, it’s often not that important to some user applications.

To use the space bank to create other, non-information storage objects, you’ll need to use an appropriate Creator, and pass it a space bank.

Note
This is most similar to Coyotos, but see the EROS documentation:

IPC

Processes, Badges, Objects, and IPC

Processes in Phoma are a group of threads which share a capability, address, and badge space (or "bspace"). The bspace describes how a process allocates and uses badges for identifying mutually-isolated "bushels" of objects.

Most important principle: client pays for storage.

When a process is created, its initialized environment usually contains at least two capabilities: its object endpoint (OEP) and parent endpoint (PEP). The OEP is the raw, unbadged endpoint capability that the server will receive ~all of its messages on. It should hold this capability very tightly, and only ever give out freshly minted badged copies, as anyone with the raw OEP can forge connections. The parent endpoint is an endpoint that the process can use to communicate with its creator. It can request further authority ("powerbox request"), or offer up a capability it would like shared with some party (for example, a capability to create objects). It’s vital to the security of the whole setup that new processes are always completely confined, and only endowed with authority granted by the parent.

TODO: clean this up, presentation wise.

A bushel is represented by some state in the process which provides it, and is accessed by other processes via a badged endpoint. This endpoint is the input to a state machine which decides how to process the message.

Creating bushels is done by sending messages to a special creator capability, which some servers can offer. This is largely application-defined. The first message sent to a creator transfers a capability to a space bank, as well as any other information needed to allocate the bushel. The server will reply with "success" or "fail", and other information needed to continue the protocol. In the case of success, it will also transfer a badged capability to the new bushel.

Internally, the server is parameterized by an n, which is a number between PAGE_BITS and WORD_BITS - BADGE_BITS. PAGE_BITS is typically 12, WORD_BITS is either 32 or 64, and BADGE_SIZE is 28. n - PAGE_BITS is the size (in 4K pages) of a fixed-length slab, which stores the initial bookkeeping and state for a bushel. Later dynamic allocations from the space bank are allowed, and pointers outside of the slab are also allowed.

The IPC buffer for a bushel can also be extended. If a message larger than the standard seL4 IPC buffer is required, the client can negotiate extended virtual message registers (EVMRs). At that point, all messages should be written into the EVMR before sending, with only physical message registers used for the actual endpoint send operation.

Silos

A silo is a sort of container, similar to Solaris zones. Each silo is a mostly isolated operating system instance in its own right. On initialization, Phoma consists of a single root silo. This silo is responsible for bringing up the rest of the system based on the configuration it receives from its own creator. On a Robigalia system, this will be the seL4 root task. Processes can ask the root silo to create more silos, given some Untyped memory. There are no technical restrictions on processes creating new silos manually, but creating a silo is somewhat laborious. Thus it is usually left to the root silo to create all silos on the system.

Each silo ends up providing a consistent operating environment for the tasks it hosts. For example, Robigo is a silo type which provides POSIX compatibility. Building on Robigo, there is a Linux silo which provides compatibility via either linking to a custom libc or via system call emulation. Phoma also provides its own silo type. Silo types are not baked into the system, but arise from convention and simplicity. We’ll discuss here native Phoma silos.

Each silo is host to a number of processes. A process consists of a number of threads, all of which share an address and capability space. Each thread runs an event loop which processes messages from its inbox. These messages are arbitrary buffers of bytes, to be interpreted as Cap’n Proto messages. Both the silo itself and other processes on the system can send messages to a process. If a process stops processing its event loop, it will possibly be killed and restarted.

Silos are responsible for implementing resource allocation and managing process capability and address spaces. Processes, on creation, are endowed with some amount of untyped memory. They can use this untyped memory to create objects, transfer to other processes, or redeem it for a raw seL4 Untyped object. To get more Untyped memory, processes must request it from the silo (which may be denied), or receive it from some other process.

Channels

Channels are the mechanism Phoma uses for asynchronously transferring messages between processes. A process can have many channels associated with it, typically about one per thread. It is possible to "select", or wait on, multiple channels at once. Conceptually, a channel is an iterator of io::Read objects which. Physically, they are implemented using shared memory and seL4 Notification objects. The shared memory is used for a lock-free queue.

TODO: needs lots more details.

Cap’n Proto

Cap’n Proto is a sophisticated distributed capability system. It additionally includes a highly tuned serialization format.

Robigo: POSIX on Phoma

This is probably a thing!

Concluding Remarks

Rust is awesome, seL4 is awesome, Robigalia is awesome.