Abstract

Distributed systems are pervasive today, from cloud computing to Internet-of-Things (IoT) systems to connected automotive vehicles. However, building dependable distributed systems remains a challenging problem. While major strides have been made in formal methods for verifying distributed systems, successful users of formal methods must be both domain experts and experts in automated reasoning—a rare combination. In this dissertation, we address this challenge through a new automated reasoning stack that integrates artificial intelligence and formal methods at every layer.

The new automated reasoning stack consists of four key tools: Eudoxus, the P Verifier, Algaroba, and MedleySolver. Eudoxus is a neuro-symbolic tool that helps users generate formal models and specifications from informal artifacts. The P Verifier extends P, a formal modelling and specification language that developers are already using to test their distributed system designs, with support for formal verification and specification mining. To do this, the P Verifier generates satisfiability modulo theories (SMT) queries that Algaroba solves. Algaroba is currently one of the fastest SMT solvers in the world for the theory of quantifier-free algebraic data types. MedleySolver coordinates solvers, like Algaroba, to get better performance than any individual solver over realistic workloads, like those generated by the P Verifier.

We used our work to verify existing systems from related work and three new case studies; our industrial collaborators are actively using our work to verify their systems and classic protocols. Beyond these industrial applications, our stack represents the beginning of a broader vision for automated reasoning: one whose fundamental principle is that the full stack should be tailored to users’ needs rather than the other way around.

1 Introduction

Everyday tasks, like scheduling and tax reporting, require logical reasoning. If I book a long camping trip in the summer, I might not have vacation days to spare to see family in the winter. If I contribute to my retirement fund, I can reduce my overall tax rate at the cost of money to use today. In general, this reasoning might be boring, but, with the help of pen and paper, it is easy to handle at the scale that most individuals need. These same reasoning tasks also exist at much larger scales, though. Scheduling a professional sports league will require more time and effort than scheduling my time off; reporting corporate taxes will be more complicated than reporting my graduate student income. While the underlying reasoning is the same—basic deduction—at these larger scales, pen and paper will no longer suffice. At these larger scales, we need specialized computational support.

This dissertation is about building tools for automating logical reasoning at scale. These tools must be domain-specific: we would not use the same tools for scheduling, tax reporting, and every other possible domain. Everything from the user interface to the kinds of questions that we ask will change across domains. The only constant is that the underlying reasoning will always be basic logical deduction. As such, developers frequently build automated reasoning stacks that, layer by layer, transform domain-specific questions into logical queries, solve these queries with general-purpose reasoning engines, and then transform the results back into domain-specific answers. The overall success of these automated reasoning stacks depends on how they perform at scale.

Performing well at scale depends not only on the individual layers but also on how they interact. Logical reasoning problems are usually intractable or even undecidable, so logical queries must be carefully constructed to match the heuristics of the chosen reasoning engine. For example, Dafny (Leino 2010) is a programming language and verification engine that is tightly integrated with the Z3 (de Moura and Bjørner 2008) solver. When users try to replace Z3 with other solvers, even ones that generally perform comparably to Z3, Dafny’s overall performance greatly suffers.¹ In this dissertation, we show how to build a full automated reasoning stack, including both theory and tools, that is tailored to a single domain. Intentionally building a full automated reasoning stack at one time is an uncommon pursuit. We aim to provide a blueprint for similar efforts along with components and principles that are reusable in, and adaptable to, new domains. To ground our work, we focus on the domain of distributed systems safety verification.

1.1 Distributed Systems Safety Verification

Distributed systems are sets of computing nodes that work together to act as one system. These systems are increasingly pervasive, complex, and critical. For example, some estimates suggest that over 75% of organizations across industries use some form of cloud computing and that cloud revenues could reach $2 trillion by 2030.² ³ These industries include everything from entertainment and banking to health and science, and are geographically distributed across the world. To meet this demand, software engineers are building globally distributed systems (which power cloud services) that coordinate with each other to provide users with strong performance, availability, and fault tolerance. For example, AWS uses 114 Availability Zones and over 700 edge locations to power their cloud services.

Unfortunately, these complex cloud systems are not immune to bugs. For example, in 2024, the three major cloud providers—Amazon, Google, and Microsoft—all experienced significant failures, including outages and customer data loss.⁴ ⁵ ⁶ These failures happen despite extensive engineering efforts. For example, Liu et al. (2019) find that fault and timing-related bugs account for 44% of production Microsoft Azure incidents despite these kinds of bugs being carefully tested for. More concerning still, these issues exist in designs, not just implementations. For example, Michael et al. (2017) find that three previously published distributed system designs suffer from “serious correctness violations.”

Much of the challenge of designing and implementing distributed systems stems from bugs that can arise from complex combinations of machine failures and message orderings. This complexity is difficult for humans to reason about manually (Gabrielson 2019). As distributed systems become increasingly critical infrastructure, engineers will need more computational support to build and deploy them correctly.

Automated reasoning engines promise to be the computational support that engineers need. These engines can solve tedious, mission-critical logic problems billions of times a day (e.g., Rungta (2022)). In the domain of distributed systems specifically, these tools have helped find bugs and prove the correctness of industrial systems (e.g., Newcombe et al. (2015)). Unfortunately, despite their power and flexibility, adoption of automated reasoning engines remains low for one fundamental reason. Today, successful users must be experts in their application domain and in automated reasoning—a rare combination. Specifically, users must be able to precisely describe their problem in a formal language, such as first-order logic, and know enough about automated reasoning to make sure their encoded problem is practically solvable.⁷ ⁸ ⁹

This dissertation describes a stack of automated reasoning tools that help software engineers reason about their distributed systems. The stack reduces the need for automated reasoning expertise by allowing engineers to verify their systems in a modeling language they are already using (Chapter 5), providing one of the fastest solvers in the world for a relevant fragment of logic (Chapter 3), and partially automating the interaction with automated reasoning tools (Chapter 4, Chapter 6, and Chapter 7).

Our work follows a long tradition in formal methods that took an important pivot in Lynch and Tuttle (1987). Lynch and Tuttle introduced a new modeling formalism for distributed algorithms, I/O Automata, that accentuated the distinction between internal and external actions of system components. This distinction helps decompose systems and the reasoning about them, and did not exist in many earlier modeling languages, such as Communicating Sequential Processes (Hoare 1978) and the Calculus of Communicating Systems (Milner 1980). We provide a brief overview of I/O Automata in Chapter 2.5.

Work on I/O automata and similar modeling formalisms has continued for decades with an increasing emphasis on automation and executable code generation. For example, Garland and Lynch (2000) defined a formal language for describing and reasoning about I/O automata that interfaces to both a theorem prover and a model checker, and can generate code in a standard programming language. Around the same time, TLA+ and its extensions, which are still used today, provided similar functionality and interfaces (Lamport 2002). More recent works, such as Ivy (Kenneth L. McMillan and Padon 2020), have continued this tradition as well, incorporating more automation by encoding complex questions about models in a decidable fragment of logic.

Much of the work since the 80s has focused on two kinds of specifications: safety and liveness. Informally, safety properties state that nothing bad happens in a system; liveness properties state that something good will eventually happen. For example, for a leader election protocol, one might want to ensure the safety property that at most one leader is elected at any given time. It would be bad if two candidates were elected for one position. One might also want to ensure the liveness property that we will have a leader: it would be good for the election to conclude successfully. Leader election protocols are a class of consensus protocols where a set of nodes, each performing its local computation and communicating by sending and receiving messages, must agree on a specific value or result. Consensus protocols, in turn, are a class of distributed systems.

In distributed systems, like the consensus protocols described above, we often draw a distinction between synchronous and asynchronous. Informally, when messages are always sent and received within a known, fixed amount of time and nodes execute within a known, fixed relative speed, we call the system synchronous (Dwork, Lynch, and Stockmeyer 1988). When messages can take an arbitrarily long amount of time to deliver, or nodes execute at arbitrary relative speeds, we call the system asynchronous. Real-life systems are generally asynchronous. When we send a packet over the Internet, we do not have formal guarantees about the delivery time or whether the packet will be delivered at all. Unfortunately, this makes liveness properties difficult and often impossible to prove in the asynchronous setting. For example, Fischer, Lynch, and Paterson (1983) proved that distributed consensus is impossible even in the presence of just one faulty process.

The automated reasoning stack that we build, therefore, focuses on safety properties of fully asynchronous systems. As we will see throughout the dissertation, certain decisions will be tailored exclusively to this context while others will be more general. We pair the more general contributions with work that helps tailor them to our specific domain. This mix of general yet adaptable with highly domain-specific contributions enables us to resolve the central tension explored by the body of work contained in this dissertation. While it can be costly to create an entirely new approach for each new domain, we can develop general tools and techniques that can express problems across multiple domains and be tailored to specific domains. We must be general but adaptable so as to achieve usability and scalability.

1.2 Running Example and Thesis Overview

To better understand our domain and contributions, consider the following distributed system inspired by Le Lann (1977) and Chang and Roberts (1979), called the ring leader election protocol. The system consists of a set of nodes that work together to elect a leader. Each node has a unique, integer label value and the goal of the protocol is for the nodes to collectively discover the node with the greatest label. Nodes are arranged in a ring: each node can only receive messages from the node on its “left” and can only send messages to the node on its “right.” Nodes execute asynchronously and begin their execution by sending their label “right.” If a node receives a label greater than its own, it forwards the received label. If the received label is smaller, then the node sends its own label “right.” Finally, if the received label is equal, then the node declares itself the winner. Figure 1(A) depicts an instance of this system with three nodes and one of its executions: node 3 is to the right of node 2; 2 to the right of 1; 1 to the right of 3; and node 3 declares itself the winner after receiving its own label from node 2. Figure 1(B) depicts one possible safety specification for this system: at most one node should declare itself the winner of the election.

Figure 1: Left: visualization of the external interface to the automated reasoning (AR) stack. Right: visualization of the internal components of the AR stack. Users provide a system model (A) and specification (B). The system model is a description of what the system does while the specification is a description of what the system should do. The AR stack attempts to prove that the system satisfies the specification and returns the result of its attempt (G). There are five key components of the stack, each corresponding to one arrow in the figure and one chapter of this dissertation. Chapter 7 describes a technique to (semi-) automatically formalize system models and specifications, producing code (C) from informal artifacts. Chapter 6 describes two techniques for augmenting formalizations with useful specifications (D) that are automatically inferred from system executions and which assist in verification. Chapter 5 describes a domain-specific language for these formalizations and how to encode the verification problem into logical queries (E). Chapter 3 describes how to solve the generated logical queries—producing yes/no answers for each query (F). Finally, Chapter 4 describes how to coordinate solvers to get better performance on realistic verification workloads.

Given a distributed system model and a specification, such as the ring leader election protocol and the simple safety property described above, the goal of our work is to automatically decide if the model is guaranteed to satisfy the specification. The execution depicted in Figure 1(A) does satisfy the specification depicted in Figure 1(B) but, will all possible executions of the system do the same? Figure 1(G) depicts the response from the automated reasoning stack that we describe in this dissertation. The checkmark represents that the system does indeed satisfy the specification for all executions; the X mark represents that the attempted proof failed. Together, Figure 1(A), (B), and (C) depict the external interface to the automated reasoning stack. Interactions between users and the automated reasoning stack can be longer—Figure 11 depicts a more complicated example.

The next three subsections (Chapter 1.2.1, Chapter 1.2.2, and Chapter 1.2.3) give an overview of the automated reasoning stack itself—the right side of Figure 1. These three subsections cover Chapter 3, Chapter 4, Chapter 5, Chapter 6, and Chapter 7. In Chapter 2, we provide the necessary background on first-order logic to understand our work; we present a simple logical language, mar, that we use to ground our work, and we describe performance optimizations for checking mar programs related to proof caching and decomposition. We conclude this dissertation in Chapter 8 with plans for future work. Specifically, in Chapter 8.1, we present the problem of crossword construction as an example where automated reasoning can be useful but cannot exist on its own. We then briefly describe exciting future directions that combine automated reasoning engines with neural network-based tools.

1.2.1 Domain-Specific Modelling and Encoding

In Chapter 5, we take a domain-specific language for modeling distributed systems, called P (Desai et al. 2018, 2013), and add support for formal verification through proofs by induction. This work corresponds to the arrow from Figure 1(D) to Figure 1(E): it takes a formal model and specification and generates a logical query, Figure 1(E). The logical query corresponds to the question, “Does induction prove that the model satisfies the specification?” Before our work, P had support for explicit-state model checking (similar to testing), and engineers inside Amazon were already successfully using it to find bugs in real systems (Desai 2023; Terry 2024). Our work allows these same engineers to write proofs of correctness in the modeling language they are already using (by providing inductive specifications). In short, we make the modeling process more direct and usable for distributed systems engineers. The material in Chapter 5 is based on Mora et al. (2023), which is joint work with Elizabeth Polgreen, Ankush Desai, and Sanjit A. Seshia.

Formal verification is a challenging task that is often simplified through abstraction. For distributed systems, this means that existing verification frameworks often abstract away the details of explicit message passing. The main added challenge in verifying P programs is that they explicitly model message passing. Proofs, therefore, have to consider lemmas over the low-level details of single messages. To remedy this, we provide users with language support for reasoning about sequences of logically related messages, called message chains. Message chains are related to message sequence charts (ITU-T 2011) from the software engineering community and choreographic programming languages (Montesi 2014) from the programming languages community. Both message sequence charts and choreographic programming offer developers a way to reason about explicit message passing but with the context of how messages flow through the system in question. We use message chains to bring these same ideas to formal verification.

We implemented our approach in a prototype tool called the P Verifier. The P Verifier uses UCLID5 (Polgreen et al. 2022; Seshia and Subramanyan 2018), a general-purpose verification engine, as an intermediate backend. UCLID5, in turn, generates satisfiability modulo theories (SMT) queries that are satisfiable if a proof by induction fails. We evaluated the P Verifier by verifying models of systems from related work, as well as two system models used within Amazon Web Services. We found that, with message chains, proofs use a similar number of lemmas as, and runtime is comparable to, verification tools that require more abstract modeling. A new version of the P Verifier, which is part of the P language toolchain, is powering verification efforts inside Amazon Web Services today (P Team 2025).

1.2.2 (Semi-) Automated Modeling and Specification

Formally modeling and specifying a domain-specific task can be a tedious process. It is one thing to give an informal description of a distributed system like the ring leader election protocol and its safety specification. It is another thing altogether to formally define the system in a verification language and prove its correctness for all possible executions. This is true when using domain-specific languages, like P, and also in general purpose verification languages, like UCLID5.

In Chapter 7, we present an approach to building text-to-formal-model tools that help automate the modeling process. These tools correspond to the arrow into Figure 1(C): they take informal artifacts, Figure 1(A) and (B), and produce formal code, Figure 1(C). To handle informal artifacts, like unstructured text, these tools depend on language models (LMs). The primary challenge is that LMs struggle to generate code in low-resource programming languages, such as most verification languages. Existing solutions either provide no guarantees and perform poorly (e.g., fine-tuning and prompting) or are not expressive enough to capture real modeling languages (e.g., constrained decoding with context-free languages). We overcome these challenges with two key ideas originating from two distinct fields of computer science: human-computer interaction and formal methods. The material in Chapter 7 is based on Mora et al. (2024), which is joint work with Justin Wong, Haley Lepe, Sahil Bhatia, Karim Elmaaroufi, George Varghese, Joseph E. González, Elizabeth Polgreen, and Sanjit A. Seshia.

The first key idea behind our approach draws inspiration from natural programming elicitation, a class of studies that help programming language designers understand how programmers naturally approach problems within a given programming domain (Myers, Pane, and Ko 2004). Programming language designers use the findings of these studies to create languages that align with users’ expectations, resulting in reduced programming friction and more effective developers. We borrow this idea for the setting where LMs are the “users.” Akin to uncovering what human users find natural for a given domain, we uncover what LMs find “natural.” Specifically, our first insight is to embrace LM’s tendencies and design an intermediate language that aligns with the LM’s output distribution for tasks in our domain.

Our first key idea makes it possible for LMs to generate reasonable programs, but it presents a new challenge: different languages have different features, such as type systems, that make translation difficult. The second key idea behind our approach is to use an automated reasoning engine—a MAX-SMT solver in this case—to identify the minimal set of locations in the intermediate program that prevent a correct translation to the target language. We then repair these locations using deductive techniques and LM calls, repeating the process as necessary. This symbolic component guarantees that generated programs pass all compiler checks, including type checking, for the target modeling language.

We implemented a prototype of our approach in a tool called Eudoxus, which translates natural language text into UCLID5 programs. Eudoxus outperforms fine-tuning, self-repair, few-shot prompting, and chain-of-thought prompting over test problems collected from three well-known textbooks. Specifically, our tool correctly solved 33% of textbook problems, while the next best approach solved only 3%.

Once we have formal code, we can analyze it, execute it, and augment it to find bugs or help prove correctness properties. In Chapter 6, we show how to automatically augment formal models with specifications—specifically some of the lemmas required for verification in Chapter 5—using variations on classic learning algorithms, such as non-erasing pattern learning (Angluin 1980) and recursive function synthesis. This process is represented by the arrow from Figure 1(C) to Figure 1(D): it takes a formal artifact, Figure 1(C), and augments it with formal specifications, Figure 1(D). The intuition behind our approach is that message chains reveal enough structure for existing techniques to learn meaningful specifications from distributed system execution data alone. That is, given only example message chains (no access to the system definition or specifications), existing learning techniques can discover properties that hold for all executions of a system. Users can then use these properties in their safety proofs. The material in Chapter 6 is based on Mora et al. (2023), which is joint work with Elizabeth Polgreen, Ankush Desai, and Sanjit A. Seshia.

1.2.3 Fast and Adaptable Satisfiability Solving

In Chapter 3, we describe a solver called Algaroba that answers logical queries like those generated by Eudoxus and the P Verifier: Algaroba is the arrow from Figure 1(E) to Figure 1(F). Specifically, Algaroba handles SMT queries with algebraic data types, which naturally capture programming data structures, like message chains, and appear extensively in the questions generated by the P Verifier and other distributed systems verification projects (e.g., Zhang et al. (2024)). Algaroba is currently one of the fastest solvers available: it won the quantifier-free algebraic data type track at the 2024 SMT competition (Bromberger, Bobot, and Jonáš 2024). The material in Chapter 3 is based on Shah, Mora, and Seshia (2024), which is joint work with Amar Shah and Sanjit A. Seshia.

Algaroba takes an eager approach to solving queries. Specifically, it takes a quantifier-free query containing algebraic data types and translates it to a quantifier-free query without algebraic data types that existing non-algebraic data type solvers can handle. Algaroba eliminates algebraic data types by replacing them with uninterpreted sorts and functions along with a finite number of quantifier-free axioms. The key technical challenge with eagerly solving algebraic data type queries is that this kind of translation seems like it should require an unbounded number of axioms (e.g., lists can be arbitrarily long, but no list is equal to any of its sub-lists). Despite this, we proved that our finite translation is sound and complete.

Most other automated reasoning engines in this space take a lazy approach. In the lazy approach, theory axioms are introduced as needed rather than upfront. Since Algaroba takes a fundamentally different approach to existing tools, its performance profile is unique: Algaroba solves many queries that no other solver succeeds on, and other solvers succeed on queries for which Algaroba fails. This phenomenon is typical for SMT solvers. The problem is that, for any given query, it is hard to know what solver will succeed ahead of time. Visually, if every solver is an arrow out of Figure 1(E), the questions is, “Which arrow will lead to an answer in the least amount of time?”

In Mora et al. (2021), we defined three solving algorithms for SMT queries over the theory of strings and provided a fixed classifier for deciding when to use what algorithm. In Chapter 4, we generalize that work to an online learning scheme that supports any logical theories. Our tool, called MedleySolver, takes a stream of queries from a given domain. For each query, MedleySolver picks a sequence of solvers to execute, returns an overall answer, and learns from each individual solver’s performance. Over time, without any human expertise, MedleySolver automatically discovers the best solver coordination strategy for queries in the target domain and solves more queries than any individual solver using significantly less time. This work corresponds to the arrow at the bottom of Figure 1 and is based on Pimpalkhare et al. (2021), which is joint work with Nikhil Pimpalkhare, Elizabeth Polgreen, and Sanjit A. Seshia.

1.3 Contributions

Overall, this dissertation presents novel integrations of artificial intelligence and formal methods at every level of the automated reasoning stack for scalable and usable domain-specific automated reasoning. We apply our work to distributed systems modeling and verification but aim to be an exemplar for how to build a full automated reasoning stack that targets any domain. In particular, we make the following contributions.

We create the P Verifier, a formal verification engine for the P programming language. Engineers already use P for modeling and testing their distributed systems. Existing verification engines for distributed systems do not provide users with language primitives or abstractions to support explicit message passing. To work at the message passing level—where a significant source of implementation errors lie—therefore usually has a cost: significant manual modelling and user effort. We introduce the idea of message chain invariants, seamlessly integrate them into the P Verifier, and show that, with message chain invariants, explicitly reasoning about message passing does not make verification more challenging. Specifically, message chain invariants give users the vocabulary to reason about the low-level details of explicit message passing while remaining at an intuitive level of abstraction. We demonstrate this empirically by replicating verification efforts from related work, simplifying these proofs using message chains, and then verifying a system with complex message passing—the onion routing network–and two industrial case studies.
We present a new specification mining toolkit that uses the structure provided by message chains to discover meaningful invariants for distributed system models. Our tools take execution data and infer specifications that likely hold for all executions. This reduces the burden on verification users by suggesting invariants that would otherwise require manual effort to discover and formalize. We extend a classic learning algorithm for non-erasing patterns, give a domain-specific definition of optimality, and prove that our algorithm correctly discovers the optimal specification using an NP oracle. We empirically evaluate our tools and find that six out of 24 invariants that were used in P Verifier proofs can be mined automatically in under a second each. Without the structure of message chains, the underlying learning algorithms cannot discover meaningful properties.
We present a new auto-formalization method that generates quality code in very low resource programming languages—like languages for verification—where existing work fails to even generate syntactically correct code. Auto-formalization methods take natural language text and return code in a formal language. Existing work focuses on prompting, constrained decoding, and fine-tuning strategies, which are unreliable, insufficiently expressive, or data hungry. We propose a complementary design-based solution. Specifically, we design an intermediate language that LMs are good at writing and can be compiled down to the target, low resource language. When LMs write code that is not exactly in the intermediate language, we use automated reasoning to identify the minimal error source and automatically repair the code at that source. We empirically evaluate our approach on a very low resource language in the verification domain, and show that we can substantially improve the syntactic correctness of generated code, producing code that passes all compiler checks 84.8% of the time compared to 9.1% by the best baselines.
We present one of the fastest SMT solvers in the world for queries over the theory of quantifier-free algebraic data types. Most existing solvers take a lazy approach. We define a new eager approach, prove that it is sound and complete, and demonstrate its empirical performance over three large benchmark sets. These results were independently validated at the 2024 SMT competition (Bromberger, Bobot, and Jonáš 2024), where our solver won the quantifier-free algebraic data type track.
SMT users typically solve more than one query at a time and that they do not need to solve the same query more than once. We present the first online learning-based solution for the problem of improving solver performance in this typical SMT setting. Existing work—like solver selection tools—can help by using expensive solver runtime data to optimize solver performance generally. We frame the problem as an online learning problem and combine multi-armed bandits with timeout prediction schemes to match, and sometimes exceed, the performance of existing work without any expensive pre-training.

2 Background and Preliminaries

In this chapter, we provide a short background on satisfiability modulo theories (SMT) solvers. We then define algebraic data types, both as a programming construct and as a logical theory, and we define a simple programming language, called the Modeling and Automated Reasoning (mar) language, that can express useful satisfiability queries over algebraic data types. We use mar to express all the contributions of this dissertation. For example, in this section, we define I/O automata and how to reason about I/O Automata using mar.

2.1 Satisfiability Modulo Theories (SMT)

Boolean satisfiability (SAT) solvers have been shown to efficiently solve a number of NP-hard problems in areas such as AI planning (Kautz and Selman 1992), verification (Clarke et al. 2001), and software testing (Cadar et al. 2006). Satisfiability modulo theories (SMT) solvers are a natural extension of SAT solvers that reason about many-sorted first-order logic instead of just propositional logic. Specifically, SMT solvers can reason about first-order structures defined by background theories (Barrett et al. 2021), allowing them to tackle more general problems or to accept more succinct inputs. For example, SMT solvers can reason about bit-vectors (Brummayer and Biere 2009), floating-point numbers (Rümmer and Wahl 2010), strings (Bjørner et al. 2012), and algebraic data types (Barrett, Fontaine, and Tinelli 2017).

In this section, we provide a minimal reminder of concepts from many-sorted first-order and model theory that are required to understand what SMT solvers are. Many-sorted first-order logic is like first-order logic but with the universe partitioned into different sorts (Barrett et al. 2021). For a complete introduction, on first-order logic, we refer the reader to Lubarsky (2008). We focus on quantifier-free many-sorted first-order logic.

Symbols in many-sorted first-order logic are either logical or non-logical. Logical symbols include logical connectives, like disjunction ($\lor$), conjunction ($\land$), and negation ($\neg$). Non-logical symbols include predicate symbols and function symbols. Predicates and functions have an associated arity. In first-order logic, the arity is the number of arguments that the predicate or function takes. In our context, many-sorted first-order logic, the arity is more like a type signature: it defines the input sorts and the return sort. We treat predicates as functions whose return sort is Boolean: the set of predicates is a subset of the set of functions. When a function takes no arguments, we call it a constant. The set of non-logical symbols and their associated arity is given by a signature.

Expressions are formed from the symbols in a given signature. There are two kinds of expressions: terms and formulas. The set of terms is the smallest set such that every constant is a term and, if $t_1, ..., t_n$ are terms and $f$ is a function with appropriate arity, then $f(t_1, ..., t_n)$ is a term. The set of formulas is the smallest set such that

every constant predicate is a formula;
if $t_1, ..., t_n$ are terms and $p$ is a predicate with appropriate arity, then $p(t_1, ..., t_n)$ is a formula;
if $\phi$ is a formula, then $\neg \phi$ is a formula; and
if $\phi_1$ and $\phi_2$ are formulae, then $\phi_1 \lor \phi_2$, $\phi_1 \land \phi_2$, and so on are formulas.

A theory literal is a predicate application or its negation. These are the base units of SMT solving and are the equivalent of literals in SAT solving. A clause is a disjunction of theory literals. A cube is a conjunction of theory literals. A formula is in conjunctive normal form (CNF) iff it is a conjunction of clauses. A formula is in disjunctive normal form (DNF) iff it is a disjunction of cubes. A formula is in negation normal form (NNF) if only theory literals are negated. A formula is flat if all theory literals are of the form $\neg (t_1 = t_2)$, $t_1 = t_2$, or $t_1 = g(t_2, ..., t_n)$ where $t_i$ are constants.

A first-order theory is a set of formulas (axioms). For example the theory of Equality and Uninterpreted Functions (UF) consists of axioms that restrict the possible interpretations of $=$ (reflexivity, symmetry, transitivity) and all function symbols (congruence) (Burch and Dill 1994). A structure is a many-sorted universe along with a function that describes how all non-logical symbols must be interpreted over the many-sorted universe. We extend the notion of interpretation to terms and formulas inductively, as expected. When a formula $\phi$ is interpreted as true in a structure $\mathcal{M}$, we say that $\mathcal{M}$ satisfies $\phi$, denoted $\mathcal{M} \models \phi$, and we call $\mathcal{M}$ a model of $\phi$. As a slight abuse of notation, for a set of formulas $\Phi = \{\phi_i\}$, we use $\mathcal{M} \models \Phi$ to mean $\mathcal{M} \models \bigwedge_1^n \phi_i$. Similarly, we use $\phi \models \psi$ to mean every model of $\phi$ is a model of $\psi$. For a model $\mathcal{M}$, we use $\mathcal{M}[f]$ to represent the interpretation of $f$ in $\mathcal{M}$.

Definition 1 (SMT Query and Solver). An SMT query is a formula-theory pair. We say that an SMT query ($\phi, T$) is SAT iff there exists a model $\mathcal{M}$ such that $\mathcal{M} \models T \cup \{\phi\}$. Otherwise we say the query is UNSAT. SMT solvers take an SMT query as input and return SAT or UNSAT. Figure 1(E) represents an SMT query, Figure 1(F) represents a SAT/UNSAT answer, and the arrow from (E) to (F) represents an SMT solver.

An SMT solver $S$ is sound for a theory $T$ if, for every formula $\phi$, $S(\phi, T) = \text{SAT}$ implies that the query ($\phi, T$) is SAT. An SMT solver is complete for a theory $T$ if, for every formula $\phi$, the query ($\phi, T$) is SAT implies that $S(\phi, T) = \text{SAT}$. The distinctive aspect of SMT solvers is that they perform an encoding to SAT, either implicitly or explicitly. Eager SMT solvers perform a satisfiability-preserving reduction to SAT in a single phase (e.g., see Seshia (2005)) whereas lazy solvers perform an iterative encoding, on demand.

In Chapter 3, we use the following definition to build a sound and complete solver for the quantifier-free theory of algebraic data types using a sound and complete solver for the theory of uninterpreted functions and equality.

Definition 2 (Theory Reduction). A theory $T$ reduces to a theory $R$ if there is a computable function $m$ such that \[(\psi, T)\; \text{is SAT} \leftrightarrow (m(\psi), R)\; \text{is SAT}\]

In Chapter 7, we use an optimization variant of SMT solvers, called weighted maximum satisfiability modulo theories solvers (MAX-SMT solvers).

Definition 3 (MAX-SMT Query and Solver). The maximum satisfiability problem is, for a given (CNF) formula $\phi$, to determine the largest subset of $\phi$ that is SAT. That is, solvers aim to satisfy as many clauses as possible. The weighted maximum satisfiability problem (MAX-SMT) is a variation with two differences. First, some clauses can be “hard”—meaning they must be satisfied. Second, every “soft” clause (any clause that is not “hard”) has an associated weight. The problem is then to determine subset of $\phi$ that maximizes the sum of weights while being SAT and containing all “hard” clauses.

2.2 Algebraic Data Types

Algebraic data types (ADTs) are a representation of finite trees that are common in functional programming languages. The power behind ADTs lies in how they can succinctly express complex structures at a high-level of abstraction while avoiding common programming pitfalls, like null pointer dereferencing (Hoare 1975). For most of their history, ADTs lived exclusively inside functional programming languages, like NPL (Burstall 1977), Standard ML (Milner 1997), and Haskell (Hudak et al. 2007). Recently, however, the interest in ADTs has exploded with a number of mainstream languages being released with support for ADTs, e.g., Rust (Jung et al. 2021), or have added support, e.g., Python¹⁰ and Java.¹¹

An algebraic data type consists of a set of constructors (node names), selectors (directed edge names), and testers (predicates). Each constructor has a fixed set of selectors associated with it (possible edges out). Each selector has an associated type (each edge can only point to a fixed set of node names). Every constructor is associated with exactly one unique tester: that tester returns true on a given tree iff the root of the tree is labeled with the corresponding constructor. Every instance of an algebraic data types (a particular finite tree built from those node and edge names) must be acyclic.

We denote the theory of algebraic data types as DT. It contains the full theory of UF along with the following additional axioms.

Definition 4 (Theory of Algebraic Data Types). An algebraic data type $A$ is a sort $\sigma$ along with

a finite set of constructor functions $A^C$ of the form $f: \sigma_1 \times ... \times \sigma_l\rightarrow \sigma$;
a finite set of selector functions $A^S$ such that there are $m$ selectors $f^1, ..., f^m$ with $f^i : \sigma \rightarrow \sigma_i$ for each constructor $f \in A^C$ with arity $m$; and
a finite set of tester predicate functions $A^T$ of the form $is-f: \sigma \rightarrow \{\text{True}, \text{False}\}$ and a corresponding bijection $p: A^C \rightarrow A^T$ that matches constructors to testers ($is-f$ is the tester for $f$).

Furthermore, every algebraic data type $A$ satisfies the following axioms—the last axiom is known as the acyclicality axiom.

$\forall \, \vec{s} \; is-f (f (\vec{s})) = \texttt{True}$;
$\forall \, \vec{r} \; is-f (g(\vec{r})) = \texttt{False}$ for constructors $g \neq f$;
$\forall \, \vec{t} \; f^i (f (\vec{s})) = \vec{s}_i$ for every selector $f^i$ of $f$;
$\forall \, t \; is-f(t) \rightarrow \exists \, \vec{s} \; f(\vec{s}) = t$; and
$\forall \, t \, \forall \, s \;$ if $s$ is a descendant of $t$, then $s \neq t$.

For two terms $s$ and $t$, if $s$ can be obtained by applying a sequence of $l$ selectors to $t$, then we say $s$ is a ($l^\text{th}$) descendent of $t$.

An algebraic data type term is any expression of an algebraic data type sort. The set of normal algebraic data type terms is the smallest set containing 1. constants (0-ary constructors), 2. constructors applied to only normal algebraic data type terms. It is useful to think of normal terms as trees: constants are leaves and we can build larger trees by applying constructors to normal terms.

In Chapter 3.2 we define an algebraic data type called tower. The definition of tower uses two constructors: Empty and Stack. These are the two possible ways to build a tower. Empty is a function that takes no inputs and outputs a tower. Stack is a function that takes a block (another simple algebraic data type) and a tower and outputs a tower. Each corresponding constructor has a set of selectors. Empty has no selectors and so it is a normal term, but Stack has two selectors, top and rest. We often apply selectors using dot notation, e.g., x.top. Selectors can be thought of as de-constructors—we use them to get back the terms constructed a tower. The tower definition implicitly defines two testers: is Empty and is Stack These predicates take a tower and return true iff the argument was built using the matching constructor. In mar syntax, which we describe in Chapter 2.3, tower is defined as follows.

data block = A() | B()
data tower =
  | Empty ()
  | Stack (top: block, rest: tower)

We deviate from the SMT-LIB standard slightly in the vocabulary we use related to algebraic data types. The following clarifies the connection. If a sort $t$ has a non-empty set of constructors, then we call $t$ an algebraic data type. If $t$ has a single constructor $c$ and every selector $s$ of $c$ is such that it is not the case that $t$ appears in the sort of $s$, then we call $t$ a record. When $t$ is a record with constructor $c$, we call every selector $s$ of $c$ a field of $t$. If $t$ is an algebraic data type and every constructor of $t$ has no associated selectors, then we call $t$ an enum. When $t$ is an enum, we call every constructor of $t$ a variant of $t$ and often treat the enum as a set.

2.3 Modeling and Automated Reasoning Language

The modeling and automated reasoning language (mar) is a simple functional programming language that supports satisfiability testing, algebraic data types, and simple function contracts.¹² We use mar as an intermediate language: it is easy to translate to SMT-LIB 2.6—the standard input language for most modern SMT solvers—and it introduces just enough of an abstraction to express the contents of this dissertation. mar as the smallest version of UCLID5 needed to understand the contributions of this dissertation.

2.3.1 Syntax

We describe the syntax of mar programs in seven pieces: algebraic data type declarations, type synonym definitions, type expressions, function declarations, annotations, expressions, and commands. The syntax for algebraic data type declarations with type parameters is as follows, where r is a name representing the data type, each Ci is a name representing a constructor, each sij is a name representing a selector, and each tij is a type expressions corresponding to the selector sij.

d ::= data r n1 ... nn = | C1 (s11: t11, ..., s1l: t1k) 
                         | ... 
                         | Cm (sm1: tm1, ..., sml: tml)

The syntax for type declarations is as follows, where n is a name representing a type synonym, and t is a type expression.

d ::= type n = t

The syntax for type expressions is as follows, where a is a name representing a type parameter or synonym.

t ::= bool
t ::= a

The syntax for a function declaration is as follows, where f is a name representing the function, each xi is a name representing a parameter, each ti is a type expression representing the type of xi, y is a name representing the return value, t is a type expression representing the return type, and e is an optional expression.

b ::= let f (x1: t1, ..., xn: tn) -> (y: t) = e
b ::= let f (x1: t1, ..., xn: tn) -> (y: t)

Function declarations can be annotated with pre- and post-conditions. The syntax for these annotations is as follows, where f is the name of a previously declared function, e1 is a (boolean) expression over the arguments xi of f, and e2 is a (boolean) expression over the arguments xi of f and f(x1, ..., xn).

a ::= f requires e1
a ::= f ensures e2

The syntax for an expression is as follows, where each Ci is a name representing a constructor, each xi is a name representing a variable, f is a name representing a function, each ei is an expression, s is a selector, and each ti is a type expression.

e ::= true
e ::= false
e ::= x
e ::= e1 is C
e ::= e1.s
e ::= e1 == e2
e ::= e1 != e2
e ::= not e
e ::= e1 and e2
e ::= e1 or e2
e ::= e1 ==> e2
e ::= if e1 then e2 else e3
e ::= f(e1, ..., en)
e ::= C(e1, ..., en)
e ::= match e1 with | C1(x1 ... xn) -> e2 
                    | ... 
                    | Cm(x1 ... xn) -> em

The syntax for commands is as follows, where e is an expression, and each ni is a name representing a declared annotation.

c ::= assert e
c ::= print e

The syntax for a program is as follows, where d is a type declaration, b is a function declaration, a is a function annotation, and c is a command.

p ::= d p
p ::= b p
p ::= a p
p ::= c p
p ::= c

For example, the following is a simple mar program:

data natural = Z() | S(p: natural)
data option-natural = None() | Some(value: natural)

let get-natural (x: option-natural) -> (y: natural) = 
    match x with
    | Some(n) -> S(n)
    | None() -> Z()

print get-natural(Some(S(S(Z())).p))

get-natural requires x-is-some: x is Some
get-natural ensures y-is-s: not y is Z

2.3.2 Semantics

A mar program is valid if it is well-typed and all functions are total. We define the semantics of valid mar programs. Algebraic data type declarations, function declarations, and expressions have their expected meaning as described in Chapter 2.2. Given the one-to-one correspondence, we will frequently move between mathematical syntax, e.g., $p \land q$ and mar syntax, e.g., p and q. For code and transformations we will use mar syntax; for formal reasoning and meta theory we will use mathematical syntax. The main point of interest for this section is defining the semantics of mar commands.

We split the semantics of mar programs into two pieces: composed and contract semantics. The composed semantics describes the behavior of assert and print statements using function contracts in place of function bodies. The contract semantics describe the meaning of proof commands, which check if function bodies and calls satisfy their contracts.

Let $\phi$ be the conjunction of expressions in assert commands, let $f$ be a function with input parameters $x_1, ..., x_n$ and output parameter $y$, and let $e_f(x_1, ..., x_n, y)$ be a boolean function capturing the conjunction of expressions in ensures annotations for the function $f.$ The meaning of mar programs in the composed semantics is the satisfiability query $(\phi_f \land \psi, DT)$, where $\phi_f$ is like $\phi$ but with every call to $f$ replaced by a fresh variable $y$, and $\psi$ constrains every fresh variable $y$ to respect $e_f$. If $(\phi_f \land \psi, DT)$ is SAT, then the meaning of print commands is determined, as expected, by any model of $(\phi_f \land \psi, DT)$. If $(\phi_f \land \psi, DT)$ is UNSAT, then the meaning of every print command is “UNSAT.”

The meaning of mar programs in the contract semantics is two labels on each function. There are two possible label values: “proven” and “rejected.” Let $f$ be a function with input parameters $x_1, ..., x_n$, let $r_f(x_1, ..., x_n)$ be a boolean function capturing the conjunction of expressions in requires annotations for the function $f$, and let $e_f(x_1, ..., x_n)$ be a boolean function capturing the conjunction of expressions in ensures annotations for the function $f$. The first label on a function $f$ is “proven” iff every call to $f$ with arguments $e_1, ..., e_n$ is such that the satisfiability query $(\neg r_f(e_1, ..., e_n), DT)$ is UNSAT. The first label on a function $f$ is “rejected” if there exists a call to $f$ with arguments $e_1, ..., e_n$ such that the satisfiability query $(\neg r_f(e_1, ..., e_n), DT)$ is SAT. The second label on a function $f$ is “proven” iff the satisfiability query $(r_f(x_1, ..., x_n) \land \neg e_f(x_1, ..., x_n), DT)$ is UNSAT, where $x_1, ..., x_n$ are fresh variables. Like for the first label, if the same query is SAT, then the the second label is “rejected.” In all cases, the expressions $e_1, ..., e_n$ use the composed semantics.

To execute a mar program is to solve all the corresponding satisfiability queries.

2.4 Advanced Solver Interface

The semantics of mar programs as described above is sufficient for understanding the contributions of this dissertation. In this section, we shift our focus to more practical considerations related to writing and reasoning about mar programs. These contributions are yet to be published but an essential part of making automated reasoning usable and scalable. Specifically, in this section, we introduce proof commands and give more detailed semantics, including a new label, “unknown.”

In general, writing mar programs is not a one-shot process. Users refine their models and specifications repeatedly, and they make frequent calls to solvers under the hood as they go. In Chapter 4 we described how to automatically learn to pick the best solver over sequences of queries, like those that we would expect to see from a typical mar user. This is a form of adaptability. In this chapter, we describe a mar backend strategy that gives users two more desired features: non-repetition and progress.

Non-repetition refers to a backend that avoids calling a solver to perform the same reasoning multiple times. Progress refers to a backend that avoids redefining previously settled reasoning questions. To better understand the features and the distinction between them, consider the following mar program.

data natural = Z() | S(p: natural)

let subtract-two (x: natural) -> natural = x.p.p
let add-three (x: natural) -> natural = S(S(S(x)))

The program declares a data type that represents natural numbers, natural, and two functions that act on natural numbers, subtract-two and add-three. These functions use selectors and constructors. Since selectors can lead to undefined behavior—selecting p from Z() can result in any value—we likely want to require that subtract-two is never called with an input value of less than two. The following code snippet enforces this precondition.

subtract-two requires x-is-gte-2: x is S and x.p is S

Every time we use subtract-two in the code, the backend must check that the precondition holds. For example, if we declare an uninterpreted function a of type natural and then print subtract-two(add-three(a)), then the backend must check that add-three(a) is greater than or equal to two. Without any postconditions on add-three, the solver will not be able to prove that the precondition on subtract-two holds for that call. The simplest possible postcondition that we can add says that the result of an application add-three is greater than or equal to two. The following snippet captures this constraint.

add-three ensures y-is-gte-2: add-three(x) is S and 
                              add-three(x).p is S

Running mar will result in a successful proof that the precondition x-is-gte-2 is satisfied for the function application in the print statement. If we run the backend again on this same input, we should not need to check this fact again. This is the simplest form of non-repetition. This version of non-repetition can be achieved by caching queries.

Now suppose that we add the assertion subtract-two(add-three(a)) is Z. One might expect this assertion to block all models and therefore that printing subtract-two(add-three(a)) should fail. But, in the composed semantics, the backend must only consider the function contracts, and therefore the print statement should actually display Z(). To get the response that we might expect when considering the body of the add-three function, we need a different postcondition. Specifically, we can add the following postcondition.

add-three ensures y-is-gte-3: add-three(x) is S and 
                              add-three(x).p is S and 
                              add-three(x).p.p is S

For the composed semantics, this new postcondition will allow the backend to conclude UNSAT, and the print statement will fail, as expected. Unfortunately, this will also result in a new query for the contract semantics. We already proved that the precondition x-is-gte-2 will hold for the call subtract-two(add-three(a)) using the postcondition y-is-gte-2. But now, the naive backend would have to re-prove this fact with a larger query: the new query would contain both y-is-gte-2 and y-is-gte-3. In many realistic cases—programs that are more complex than this toy example—adding constraints on previously solved queries can make the solver timeout. If adding y-is-gte-3 caused the solver to timeout, then our edits to the mar program would have actually set back our proof efforts. Guaranteeing that this cannot happen (for a class of updates) is a form of progress.

Non-repetition and progress are similar. Providing progress opens up opportunities for non-repetition: if we can avoid changing backend checks, then we will get more cache hits. The opposite can also be true. More fine-grained caching, like through saving UNSAT cores, can automatically guarantee some progress—adding irrelevant facts will not lead to a new solver call. Another form of progress comes from building on previously proved facts. For example, if we have previously proven y-is-gte-3, then we can easily prove y-is-gte-2 without even looking at the body of add-three. For more complicated functions and properties, building on previously proven facts can lead to speedups over the entire verification effort.

2.4.1 Proof Scripts

We propose a lightweight solution to both non-repetition and progress called proof scripts. Users can explicitly label proof dependencies and the backend can use the implied dependency graph to minimize the number and size of generated queries. The following code consists of the program and contracts described above along with four proof commands.

// ----- Program -----
data natural = Z() | S(p: natural)

let subtract-two (x: natural) -> natural = x.p.p
let add-three (x: natural) -> natural = S(S(S(x)))

let a: natural

print subtract-two(add-three(a))
assert subtract-two(add-three(a)) is Z

// ----- Contracts -----
subtract-two requires x-is-gte-2: x is S and x.p is S
subtract-two ensures y-is-0-if-2: subtract-two(x) is Z 
                                    ==> x == S(S(Z()))

add-three ensures y-is-gte-2: add-three(x) is S and 
                              add-three(x).p is S
add-three ensures y-is-gte-3: add-three(x) is S and 
                              add-three(x).p is S and 
                              add-three(x).p.p is S

// ----- Proofs -----
prove y-is-0-if-2 using x-is-gte-2
prove x-is-gte-2  using y-is-gte-2
prove y-is-gte-2
prove y-is-gte-3

Each proof command is of the form prove ID (using ID+)?, where every ID must point to the name of either a requires or an ensures clause. We call the first ID the target and subsequent IDs, the dependencies. In the example above, with the command x-is-gte-2 using y-is-gte-2, the user is telling the backend that it only needs to use the postcondition y-is-gte-2 to prove the precondition x-is-gte-2 (the target). Later, the user also asks the backend to prove y-is-gte-2, this time with no dependencies: the checker can verify that property with only the body of the add-three function. In terms of the dependency graph, we say that x-is-gte-2 depends on y-is-gte-2 (there is an edge from x-is-gte-2 to y-is-gte-2) and that y-is-gte-2 is a leaf (there is no edge from y-is-gte-2). To ensure soundness, we require that the dependency graph is acyclic. We also require that every annotation appears as the target of a proof command at most once.

Given an acyclic dependency graph, we re-define the semantics of mar programs originally described in Chapter 2 as follows. The meaning of mar programs in the contract semantics is a label on each annotation. There are three possible labels: “proven” (☑), “rejected” (☒) and “unknown” (☐). If an annotation is not the target of any proof command, then it receives the label “unknown.” Otherwise, let $f$ be a function with input parameters $x_1, ..., x_n$ and let $a$ be the target annotation of some proof command with dependency set $D$ that refers to $f$. There are two cases: either $a$ is a precondition or it is a postcondition. We will consider both cases, in sequence, using the following definitions. Let $r_f(x_1, ..., x_n)$ be a boolean function capturing the conjunction of expressions in requires annotations for the function $f$ that are in $D$, and let $e_f(x_1, ..., x_n, y)$ be a boolean function capturing the conjunction of expressions in ensures annotations for the function $f$ that are in $D$.

Suppose $a$ is a requires annotation. The precondition $a$ receives the label “proven” iff every call to $f$ with arguments $e_1, ..., e_n$ is such that the satisfiability query \[(r_f(e_1, ..., e_n) \land \neg a(e_1, ..., e_n), DT)\] is UNSAT. The precondition $a$ receives the label “rejected” if there exists a call to $f$ with arguments $e_1, ..., e_n$ such that the same satisfiability query is SAT. In both cases, each $e_i$ uses the composed semantics of mar programs restricted to the annotations in $D$.

Suppose $a$ is an ensures annotation. The postcondition $a$ receives the label “proven” iff the satisfiability query \[ \begin{aligned} (r_f(x_1, ..., x_n) &\land e_f(x_1, ..., x_n, f(x_1, ..., x_n)) \\ &\land \neg a(x_1, ..., x_n, f(x_1, ..., x_n)), DT) \end{aligned} \] is UNSAT, where $x_1, ..., x_n$ are fresh variables. Like for requires annotations, if the same query is SAT, then the ensures annotation receives the label “rejected.” In both cases, each $e_i$ uses the composed semantics of mar programs restricted to the annotations in $D$.

Therefore, running the following mar program will result in the subsequent output which says that print subtract-two(add-three(a)) results in Z() and that one out of three annotations, x-is-gte-2, is successfully proved. The other two annotations are either not captured by a proof command (y-is-gte-2) or do not hold (y-is-0-if-2).

data natural = Z() | S(p: natural)

let subtract-two (x: natural) -> natural = x.p.p
let add-three (x: natural) -> natural = S(S(S(x)))

let a: natural

print subtract-two(add-three(a))
assert subtract-two(add-three(a)) is Z

subtract-two requires x-is-gte-2: x is S and x.p is S
subtract-two ensures y-is-0-if-2: subtract-two(x) is Z 
                                    ==> x == S(S(Z()))
add-three ensures y-is-gte-2: add-three(x) is S and 
                              add-three(x).p is S

prove y-is-0-if-2 using x-is-gte-2
prove x-is-gte-2  using y-is-gte-2

Z()

☒ y-is-0-if-2
☐ y-is-gte-2
☑ x-is-gte-2

2.4.2 Parallel and Priority Solving

Every proof command can be checked independently, opening up the door to parallel solving. For large systems, like the industrial systems described in Chapter 5, this can mean hundreds, or even thousands, of independent checks. Ideally, we could run all these checks in parallel until the first one fails or they all succeed. For each check, we could run every available solver with every possible configuration. In reality, we usually do not have enough processors available for this strategy to work.

Instead, we can use a tool like MedleySolver (Chapter 4) to pick solvers and to order checks. This usage of MedleySolver would reduce the number of solver calls per individual check. We can automatically learn to prioritize checks that will terminate more quickly, helping us find counterexamples faster. We are exploring these ideas in ongoing work.

2.5 I/O Automata

In this section, we give the necessary background on I/O automata and provide a basic encoding of I/O automata into the mar language. We present a version of I/O automata (Lynch 1996) tailored to our context along with examples focused on our encoding described in Chapter 5. Let $\textit{states}$ and $\textit{actions}$ be disjoint sets. An I/O automaton $A$ is a six-tuple \[(\textit{inp}(A), \textit{int}(A), \textit{out}(A), \textit{states}(A), \textit{start}(A), \textit{trans}(A)),\] where $\textit{inp}(A)$ (input actions), $\textit{int}(A)$ (internal actions), and $\textit{out}(A)$ (output actions) are disjoint subsets of $\textit{actions}$, $\textit{states}(A)$ is a subset of $\textit{states}$, $\textit{start}(A)$ is a subset of $\textit{states}(A)$, and $\textit{trans}(A)$ is a transition relation over state-action-state triples. Specifically, $\textit{trans}(A)$ is a relation on \[\textit{states}(A) \times \textit{inp}(A) \cup \textit{int}(A) \cup \textit{out}(A) \times \textit{states}(A).\] We call individual triples $(s,a,s')$ in $\textit{trans}(A)$ steps of $A$. Like Lynch, we require I/O automata to be input-enabled. That is, we require that for every state $s$ and action $\alpha \in \textit{inp}(A)$ there exists a state $s'$ such that $(s, \alpha, s') \in \textit{trans}(A)$.

Example 1. We define an I/O automaton called the simple universal buffer ($\textit{SUB}$). Intuitively, $\textit{SUB}$ represents a set of integers that other automata will be able to add to or remove from. Formally, for a non-empty index set $R$, let $\textit{inp}(\textit{SUB})$ be $\{\textit{put(v)}_{i, j} \; \vert \; v \in Z;\, i, j \in R\}$; $\textit{int}(\textit{SUB})$ be $\emptyset$; $\textit{out}(\textit{SUB})$ be $\{\textit{get(v)}_{i, j} \; \vert \; v \in Z;\, i, j \in R\}$; $\textit{states}(\textit{SUB})$ be the set of all sets of integers unioned with a special error state, i.e., $2^Z \cup \{\bot\}$; $\textit{start}(\textit{SUB})$ be the singleton set containing the empty set; and $\textit{trans}(\textit{SUB})$ be a relation such that $\textit{put(v)}_{i, j}$ actions add the integer $v$ to the state, and, if $v$ is in the set, $\textit{get(v)}_{i, j}$ actions remove the integer $v$ from the state. If $v$ is not in the set, then $\textit{get(v)}_{i, j}$ actions move $\textit{SUB}$ to the error state. Once in the error state, no actions change the state. Figure 2 (a) illustrates the interface of $\textit{SUB}$, where input (output) actions are arrows toward (away from) the automaton.

We often define transition relations piecemeal through precondition-effect pairs.

Definition 5. Let $A$ be an I/O automaton. A precondition-effect pair $(p, e)$ of $A$ consist of a predicate $p(s, \alpha)$ over an input state and an input action, and a function $e(s, \alpha)$ that takes in an input state and an input action and returns a new state. Given a precondition-effect pair $(p, e)$, if $p(s, \alpha)$ is true and $e(s, \alpha) = s'$, then we say that $(s, \alpha, s') \in \textit{trans}(A)$. As a matter of convention, when the I/O automaton in question has an error state (like $\bot$ for $\textit{SUB}$), actions that do not satisfy any precondition move the state of the I/O automaton to that error state. This convention makes it easier to define input-enabled I/O automata in terms of precondition-effect pairs. When non-determinism is needed, we extend $p$ and $e$ to take extra, non-deterministic arguments. However, to simplify the presentation, we elide non-deterministic arguments for the remainder of this chapter.

For example, the simple universal buffer ($\textit{SUB}$) has a precondition-effect pair with precondition $p(s, \textit{get(v)}_{i, j}) \triangleq v \in s$ and effect $e(s, \textit{get(v)}_{i, j}) \triangleq s \smallsetminus v$. This precondition asserts that $v$ must be in the set (we cannot get an element which is not present), and this effect removes $v$ from the set $s$.

Definition 6. When I/O automata are compatible, they can be composed to form new I/O automaton. A finite, non-empty set of I/O automata $M$ are compatible if their internal actions and output actions are pairwise disjoint, respectively. That is, the automata in $M$ are compatible if $\forall A, B \in M$, $A \neq B$ implies that

$\textit{int}(A) \cap \textit{int}(B) = \emptyset$ and
$\textit{out}(A) \cap \textit{out}(B) = \emptyset.$

The composition of machines $M$ for a finite, non-empty index set $R$ and enumeration $\mu$, denoted $M \cdot$, is a new I/O automaton where $\textit{inp}(M \cdot) = \bigcup_{A \in M} \textit{inp}(A) \smallsetminus \bigcup_{A \in M} \textit{out}(A)$; $\textit{int}(M \cdot) = \bigcup_{A \in M} \textit{int}(A)$; $\textit{out}(M \cdot) = \bigcup_{A \in M} \textit{out}(A)$; $\textit{states}(M \cdot)$ is the set of all arrays $s$ such that for every $i \in R$, $s[i]$ is a member of $\textit{states}(\mu(i))$; $\textit{start}(M \cdot)$ is the set of all arrays $s$ such that for every $i \in R$, $s[i]$ is a member of $\textit{start}(\mu(i))$; and $\textit{trans}(M \cdot)$ is the set of triples $(s, \alpha, s')$ such that $\forall i \in R$ if $\alpha$ is in the actions of $A$ then $(s[i], \alpha, s[i]') \in \textit{trans}(s[i])$ and otherwise $(s[i], \alpha, s[i]) \in \textit{trans}(s[i])$. Note that our definition of composition is associative.

Figure 2: Visualizations of I/O automata for examples 1 and 2. Irrelevant actions omitted.

Example 2. We define a class of I/O automaton called simple machine ($\textit{SM}_i$) and then compose an instance of the class with $\textit{SUB}$ from Example 1. Intuitively, $\textit{SM}_i$ automaton receive integers from $\textit{SUB}$, keep track of the most recent integer received, and send integers to $\textit{SUB}$ that are greater than the most recent integer received. Formally, for a non-empty index set $R$, let $\textit{inp}(\textit{SM}_i)$ be $\{\textit{get(v)}_{i, j} \; \vert \; v \in \mathbb{Z};\, j \in R\}$; $\textit{int}(\textit{SM}_i)$ be $\emptyset$; $\textit{out}(\textit{SM}_i)$ be $\{\textit{put(v)}_{i, j} \; \vert \; v \in \mathbb{Z};\, j \in R\}$; $\textit{states}(\textit{SM}_i)$ be the integers unioned with an error state $\mathbb{Z} \cup \{\bot\}$; $\textit{start}(\textit{SM}_i)$ be the singleton set containing the integer zero; and $\textit{trans}(\textit{SM}_i)$ be a relation such that $\textit{get(v)}_{i, j}$ sets the state to $v$, and $\textit{put(v)}_{i, j}$ can occur if $v$ is greater than the current state. Figure 2 (b) illustrates the interface of $\textit{SM}_i$. The automaton $\textit{SM}_1$ is compatible with $\textit{SUB}$ and so $\textit{SM}_1 \cdot \textit{SUB}$ is also an I/O automaton. Figure 2 (c) illustrates this composition with irrelevant actions omitted. Note that $\textit{SM}_1 \cdot \textit{SUB}$ is compatible with every instance $\textit{SM}_j$ iff $j \neq 1$. Therefore, we could further compose simple machine instances.

Definition 7. Let $A$ be an I/O automaton. An execution of $A$ is a finite sequence of alternating states and actions $s_0,\alpha_1,s_1, ...,\alpha_k, s_k$ such that $s_0$ belongs to $\textit{start}(A)$, every $s_i$ belongs to $\textit{states}(A)$, and every triple $(s_i,\alpha_{i+1},s_{i+1})$ belongs to $\textit{trans}(A)$. We use $\textit{exec}(A)$ to refer to the set of executions of $A$. We say that a state $s$ is reachable by $A$ if there exists an execution $s_0,\alpha_1,s_1, ...,\alpha_k, s_k \in \textit{exec}(A)$ with $s_k = s$.

For example, the sequence $\emptyset,\textit{put(7)}_{i, j},\{7\},\textit{get(7)}_{i, j},\emptyset$ is an execution of $\textit{SUB}$ from Example 1. On the other hand, the sequence $\emptyset,\textit{get(7)}_{i, j},\emptyset$ is not an execution of $\textit{SUB}$.

Definition 8. An invariant assertion (invariant) is a predicate $p$ over states. We say that an invariant holds of an I/O automaton $A$ if for all reachable states $s$, $p(s)$ holds. A counterexample is a reachable state $s$ such that $p(s)$ does not hold. An inductive invariant is an invariant $p$ such that for every step $(s, \alpha, s') \in \textit{trans}(A)$ if $p(s)$ holds then $p(s')$ holds. A counterexample to induction is a step $(s, \alpha, s') \in \textit{trans}(A)$ such that $p(s)$ holds but $p(s')$ does not. A proof by induction for an invariant $p$ is a proof that $p$ is inductive and that $p(s)$ holds for every state $s \in \textit{start}(A)$. A verification is a successful proof by induction.

For example, consider $\textit{SM}_1 \cdot \textit{SUB}$ from Example 2 and the invariant assertion which states that every element in the set maintained by the $\textit{SUB}$ component is greater than zero. That is, the invariant is $p([ s_{\textit{SM}_1}, s_{\textit{SUB}} ]) \triangleq \forall_{i \in s_{\textit{SUB}}} \, i > 0$. A counterexample to induction for this I/O automaton and invariant is the step $([ -1, \{\} ], \textit{put}_0, [ -1, \{0\} ])$. Note that the pre-state $[ -1, \{\} ]$ is not reachable and that the invariant does indeed always hold. Nevertheless, the proof by induction fails.

2.5.1 I/O Automata Encoding Into `mar`

Our encoding of I/O automata into mar is straightforward: each element of the six-tuple defining an I/O automaton maps to a mar construct. Specifically, for an I/O automaton $A$, we encode the set of actions, $\textit{actions}$, as a sort $a$, and we encode $\textit{inp}(A)$, $\textit{int}(A)$, and $\textit{out}(A)$, as predicates on $a$. We encode the state space, $\textit{states}$, as a sort $s$ and we encode $\textit{states}(A)$ and $\textit{start}(A)$ as predicates on $s$. Finally, we encode $\textit{trans}(A)$, as predicates on triples $(s, a, s)$. Precondition-effect pairs map to mar functions with annotations. Invariants are preconditions and postconditions on all functions.

For example, we can encode a version of SUB from Example 1 into mar as follows, where $\textit{states}(\textit{SUB})$ is redefined to be the set of all sets containing at most one natural numbers unioned with a special error state.

data natural = Zero () | Successor (n: natural)
data action = Put (p: natural) | Get (g: natural)
data state = Empty () | Full (current: natural) | Error ()

let start (s: state) -> bool =
  match s with
    | Empty() -> true
    | Full(_) -> false
    | Error() -> false

let trans (s: state, a: action, s': state) -> bool =
    match a with
        | Put(p) ->
          (match s with
            | Empty() -> s' == Full(p)
            | Full(_) -> s' == Full(p)
            | Error() -> s' == Error())
        | Get(g) ->
          (match s with
            | Empty() -> s' == Error()
            | Full(current) ->
                (if current == g then
                    s' == Empty()
                else
                    s' == Error())
            | Error() -> s' == Error())

let s1: state
let s2: state
let s3: state
assert start(s1)
assert trans(s1, Put(Zero()), s2)
assert trans(s2, Get(Successor(Zero())), s3)

print(s1) // will print "Empty()"
print(s2) // will print "Full(Zero())"
print(s3) // will print "Error()"

start ensures start-means-empty: start(s) ==> s is Empty
trans requires not-error: trans(s, a, s') ==> not s is Error
trans ensures correct-action: trans(s, a, s') and 
                              (a is Put) ==> s' == Full(a.p)
trans ensures correct-get: trans(s, a, s') and 
                           (a is Get) and 
                           (s is Full) and 
                           (a.g != s.current) ==> s' is Error

The first block of code (lines 1-3), declares a data type to represent natural numbers, natural, the set of actions, actions, and the set of states, states, which restricts the set of states to be all sets of size at most one containing natural numbers unioned with a special error state. The next block of code (lines 5-25) defines the start predicate, a function called start, and the transition relation, a function called trans. These first two blocks of code correspond closely to the description in Example 1. The next block of code (lines 27-32) declares three uninterpreted constants of type state, called s1, s2, and s3, and then uses assertions to encode that s1 is a valid start state, s2 is a valid step from s1, and s3 is a valid step from s2. The print statements on lines 34-36, along with the comments displaying their result, show an execution of the SUB machine. Finally, the last block of code (lines 38-45) provides contracts for start and trans that are both correct in the contract semantics and sufficient to ensure a valid execution of the SUB machine in the composed semantics.

Examples 1 and 2, and our work in Chapter 5, use aspects of many-sorted first-order logic that go beyond what mar supports. Specifically, in Chapter 5 we will use uninterpreted sorts, richer theories (like arrays to represent sets), theory combinations, recursive functions, quantifiers, and parametric data types. Barrett et al. (2021) provide detailed background on these standard extensions. We call mar with these additions “extended mar.”

3 An Eager Solver for Algebraic Data Types

In this chapter, we describe the arrow from Figure 1(E) to Figure 1(F): a solver called Algaroba that can execute mar programs. Specifically, Algaroba handles quantifier-free SMT queries with algebraic data types. As we will see in Chapter 5, algebraic data types naturally capture programming data structures and appear extensively in the queries generated by distributed systems verification projects (e.g., Zhang et al. (2024)). For these tools to perform well, we need Algaroba to be as fast as possible. Fortunately, Algaroba is one of the fastest solvers available: it won the quantifier-free algebraic data type track at the 2024 SMT competition (SMT-COMP) (Bromberger, Bobot, and Jonáš 2024). The technical contents of this chapter first appeared in Shah, Mora, and Seshia (2024).

3.1 Introduction

Automated reasoning about algebraic data types is important beyond distributed systems because this construct appears in many different software applications. As the popularity of algebraic data types grows, the demand for efficient SMT solvers will continue to increase. Unfortunately, the state-of-the-art tools in this space are already struggling to keep up. We demonstrate this empirically by generating a new benchmark set and showing that existing solvers, working together, are only able to solve 56.2% of the new queries in under 20 minutes per query (Chapter 3.5.2).

This imbalance between programming languages and SMT solvers is manifest as a gap in the SMT-solving literature. Oppen (1978) was the first to provide a decision procedure for the quantifier-free theory, but algebraic data types do not seem to have permeated the community much further. In 2003, a concerted effort to unify the SMT community began with the first official input standard and competition, called SMT-LIB and SMT-COMP (Barrett et al. 2011), respectively. Algebraic data types were not officially integrated into the standard until 2017 as part of version 2.6 (Barrett, Fontaine, and Tinelli 2017). In the 2023 iteration of SMT-COMP (the iteration before the work we describe in this chapter), only two solvers participated in the algebraic data type track, the least of any track. Both solvers use a variation of the same solving approach: a lazy SMT architecture combined with theory-specific reasoning based on the work by Oppen from 1978 (see Chapter 3.3).

In this chapter, we propose a new solving technique that departs from the standard approach. Instead of a lazy approach, we take an eager approach (Barrett et al. 2021) that translates the original SMT formula into an equi-satisfiable formula without algebraic data types. Our work fills the gap in the literature on SMT solving for algebraic data type queries and, by doing so, solves more queries than existing solvers (see Chapter 3.5.1). More importantly, we make the largest empirical contribution to the solving community on SMT-COMP benchmarks, solving different queries than existing tools (see Chapter 3.5.2).

The rest of this chapter is organized as follows. In Chapter 3.2, we describe algebraic data types, satisfiability, and our approach through an example planning problem called blocks world. In Chapter 3.3, we survey related work. In Chapter 3.4, we describe our eager reduction from algebraic data type queries to queries containing only uninterpreted functions (UF). Chapter 3.4 includes a proof of soundness and completeness, along with a complexity analysis. In Chapter 3.5, we describe a prototype implementation of our approach, called Algaroba, and we evaluate it over two research questions. We find that Algaroba outperforms the state of the art overall and in terms of contribution to the solving community. Chapter 3.5 also describes a new benchmark set consisting of blocks world queries. This set addresses an important gap in the existing literature: the queries in this set contain all important kinds of algebraic data types but are not easy to solve. Overall, we make the following contributions.

We define a notion of query depth and a finite, quantifier-free reduction from algebraic data type queries to UF queries that uses depths. This is a new eager solving approach.
We prove the soundness and completeness of our approach and show that it generates at most a finite number of assertions.
We generate a new benchmark set that contains all kinds of algebraic data types and is not trivial to solve. Existing benchmarks do not enjoy both these properties.
We implement our reduction approach in a prototype tool called Algaroba and we compare its performance to the state of the art. Algaroba outperforms existing tools and that it makes the largest empirical contribution to the community of solvers.

3.2 Running Example: Blocks World

To better understand algebraic data types, satisfiability queries, our approach, and the benchmark set that we generate in Chapter 3.5, consider the classic blocks world problem first proposed by Winograd (1971). We use a version of the problem that Sussman (1973) used to illustrate the Sussman anomaly (Russell 2010) and that Gupta and Nau (1992) used to show that the associated decision problem is NP-Hard.

In the simplified version of the blocks world problem, there is a table with (possibly empty) stacks of blocks on it (an initial configuration), a desired way that the stacks should be arranged (a target configuration), and three rules:

blocks can only be taken from the top of a stack;
blocks can only be placed on the top of a stack; and
only one block can be moved at a time.

The general problem is to find a sequence of legal moves that leads from the initial configuration to the target configuration. The associated decision problem is to determine if there is such a sequence of length less than some input $k$.

Figure 3: Solution—Figure 3 (b), Figure 3 (c), Figure 3 (d)—to a simple blocks world puzzle. Figure 3 (a) is the initial configuration; Figure 3 (e) is the target configuration..

Figure 3 shows an example blocks world solution. The initial configuration is given in Figure 3 (a), the target configuration in Figure 3 (e), and Figs. Figure 3 (b), Figure 3 (c), and Figure 3 (d) show a sequence of three legal moves that solve the problem, where faded blocks denote the previous position of the block moved at that step.

The blocks world problem is a useful illustrative example for an algebraic data types solver because the encoding uses sum, product, and inductive algebraic data types. The following mar code gives the required type definitions for the example in Figure 3.

data block = A() | B()
data tower =
  | Empty ()
  | Stack (top: block, rest: tower)
data config =
  | table (left: tower, center: tower, right: tower)

Specifically, this code defines an enumerated type for blocks (block at line 1), a record type for table configurations (config at lines 5-6), and an inductive type for stacks (tower at lines 2-4). Terms of an enumerated type can take on any of the values listed in the type definition. For example, terms of type block can take on the values A or B. Terms of a record type can take on any combination of values of the type arguments listed in the type definition. For example, terms of type config can take on a triple of any three tower values. Enumerated types are the simplest form of a sum type, while records are the simplest form of a product type. Algebraic data types allow for definitions that are both sum and product types. For example, terms of type tower can either be Empty or they can be a Stack but not both (sum). When these terms take on a Stack value, they are a pair of block and tower values (product). Notice that the definition of tower depends on itself, this makes it an inductive type.

The blocks world problem is a useful illustrative example for satisfiability queries for two reasons. First, satisfiability-based solutions for similar planning problems have been around for decades (Kautz and Selman 1992). Second, encoding the problem as a satisfiability problem is simple when using bounded model checking (Biere et al. 1999) for bounded-horizon planning (e.g., see Rintanen (2003)). Specifically, the bounded model checking-based encoding is given by a transition system and a specification. The transition system starts at the initial configuration, and, at each step, makes a non-deterministic choice of which legal move to make. The specification is that the transition system never reaches the target configuration. Given this transition system, specification, and a bound $k$, bounded model checking will generate a satisfiability query whose solutions are assignments to the non-deterministic choices—concrete choices that get us to the target configuration from the initial configuration in less than $k$ steps. SMT solvers, like our new solver, generate these solutions or prove that none exist. We use this encoding in Chapter 3.5 to generate a new benchmark set.

The blocks world problem also gives a useful intuition for our approach. While terms of inductive data types, like tower, can take on arbitrarily large values (e.g., a stack of a million A blocks), there is a bound on the size of relevant values. For the blocks world problem in Figure 3 it would never make sense to have a tower value of size greater than two. Such a bound exists for all quantifier-free algebraic data type queries; the problem is that automatically inferring the bound and using it to speed up solving was non-trivial. In this chapter, we give an automated procedure for computing an over-approximation of this bound, and then we use this over-approximation to replace algebraic data type functions with uninterpreted functions along with a finite number of quantifier-free axioms.

You can think of the blocks world problem in terms of distributed systems—it shares all the characteristics of the systems we will work with in Chapter 5. Specifically, each table position is a node in the system. Each node keeps track of its own tower and can move its top most block to another node by sending that other node a message. Assuming that the choice of target node is non-deterministic, solving the blocks world problem is exactly the task of verifying that the distributed system never reaches the target configuration. The subsequent code block is an encoding of the blocks world problem from Figure 3 as a mar program following the encoding of I/O Automata in Chapter 2.5.1. We use bounded model checking for three steps and omit the definition of trans and its contracts for brevity.

data block = A () | B ()

data tower =
  | Empty ()
  | Stack (top: block, rest: tower)

data config =
  | table (left: tower, center: tower, right: tower)

data action = 
  | Left-to-Center () 
  | Left-to-Right () 
  | Center-to-Left () 
  | Center-to-Right () 
  | Right-to-Center () 
  | Right-to-Left ()

let trans (before: config, a: action, after: config) -> bool 

// ... OMITTED ...

let start: config = table(Stack(B(),Stack(A(), Empty())), 
                          Empty(), 
                          Empty())

let goal: config = table(Empty(), 
                         Empty(), 
                         Stack(B(),Stack(A(), Empty())))

let c0: config
let c1: config
let c2: config
let c3: config

let a0: action
let a1: action
let a2: action

assert(c0 == start)
assert(trans(c0, a0, c1))
assert(trans(c1, a1, c2))
assert(trans(c2, a2, c3))
assert(c3 == goal)

print(c0) // table(Stack(B, Stack(A, Empty)), Empty, Empty)
print(a0) // Left-to-Center
print(c1) // table(Stack(A, Empty), Stack(B, Empty), Empty)
print(a1) // Left-to-Right
print(c2) // table(Empty, Stack(B, Empty), Stack(A, Empty))
print(a2) // Center-to-Right
print(c3) // table(Empty, Empty, Stack(B, Stack(A, Empty)))

3.4 Approach: Eager Reduction

We propose a new SMT solver for the DT theory defined in Chapter 2.2. For a quantifier-free input formula p, our solver generates a quantifier-free formula q in UF and then calls an existing UF solver to get a final result. We cannot compute a quantifier-free reduction directly, since DT axioms have universal quantifiers. Instead, we only instantiate the DT axioms over terms that are relevant to the input query.

Listing 1: Pseudocode for Reduce(p) algorithm.

p1          = NNF(p)
p2          = Flatten(p1)
k           = Number of algebraic data type constants in p2
p3          = Apply rewrite rules A and B to p2
q1, ..., qm = Apply axioms 1, 2, and 3 using k to p3
q           = p3 and q1 and ... and qm
return UF_SMT_Solver(q)

When the universe of the input query is finite (e.g., it only contains enums), we instantiate the entire universe (see Chapter 3.4.1). Otherwise, we follow the procedure in Listing 1. This procedure transforms the input query to NNF, flattens the result, and then applies the rewrite rules in Listing 2 and adds the axioms Listing 3. The depth of a query is the number of constants in the flattened NNF version of the query. In Listing 1, the depth is k. The depth is linear in the size of the input query because the NNF and flattening transformations introduce at most a linear number of constants.

Rules A and B from Listing 2 correspond to rewriting constructor and selector applications respectively so that they work well with other constructor, selectors and testers. Rule B contains existential quantifiers, but these are handled through skolemization (replacing existentially bound variables by fresh constants). Axioms 1 and 2 from Listing 3 ensure that only one tester returns true for each term, and that this tester corresponds to a constant iff the term is a constant. Axiom 3 encodes the acyclicality constraint.

To better understand Axiom 3 and acyclicality, consider the following example query.

data block = A() | B()
data tower =
  | Empty ()
  | Stack (top: block, rest: tower)

let x: tower
let y: tower

assert (x is Stack) and 
       (y is Stack) and 
       (y == x.rest) and 
       (x == y.rest)

This query is UNSAT because any satisfying assignment would need to violate acyclicality. Therefore, to avoid spurious models, we must encode the acyclicality property in our reduction. The challenge is that we need to capture this seemingly infinite property using only finite number of quantifier-free formulas.

Our key result is that we only need to enforce acyclicality for all $l<k^\text{th}$ descendants, where k is the number of algebraic data type constants in the flattened query. To see the intuition behind this, consider the following generalization of the previous example, where x1, …, xk-1, xk are of type tower.

assert (x1 is Stack) and ... and (xk is Stack) and 
       (x2 = x1.rest) and (x3 = x2.rest) and ... and
       (xk = xk−1.rest) and (x1 = xk.rest)

This query asserts that there is a cycle of size k. Since, our reduction asserts acyclicality for all $l<k^\text{th}$ descendants, we correctly return unsat on this query. Furthermore, it is impossible to have a query with a cycle of size more than k using k or fewer constants (in the flat query), so our encoding is sufficient (see Chapter 3.4.3 for a full proof).

Listing 2: Rewrites, where t is and algebraic data type term, f is a constructor for that algebraic data type, ai are well-typed arguments, and each si is the i$\text{th}$ selector matching that constructor. The left-hand-side of the arrow is what exists before the rewrite; the right-hand-side is what exists after the rewrite.

(A) f(a1, ..., am) == t
    → (t is f and 
          f(a1, ..., am) == t and t.s1 == a1 and ... and t.sm == am)

(B) t.sj == tj
    → (t.sj == tj 
        and (t is f ==> (exists (a1, ..., am) 
          f(a1, ..., am) == t and t.si == a1 and ... and t.sm == am)))

Listing 3: Axioms for every algebraic data type term t and s, where f1, …, fn are the constructors for the relevant algebraic data type.

(1) (t is f1 or ... or t is fn) and at-most-one(t is f1, ..., t is fn)

(2) For every zero-ary f, (t is f) == (f == t)

(3) If s is the lth descendant of t and l < k, s != t

3.4.1 Finite Universe Instantiation

If we recognize an algebraic data type has a finite universe, we create constants for every term in the universe and instantiate the axioms over the entire, finite universe. This is a source of double exponential blowup, but algebraic data types with finite universes are rare and often small enough to prevent noticeable blowup.

3.4.2 Complexity Analysis

In the finite universe case, we can have a doubly exponential blowup. One adversarial case is an algebraic data type that is records of records of enums:

data enum = A() | B() 
data rec1 = j (x: enum, y: enum)
data rec2 = k (z: rec1, w: rec1)

Here, enum has a universe of size two, rec1 has a universe of size four, and rec2 has a universe of size $16$. This gives a double exponential blowup since we are creating fresh constants to represent every normal term of every data type.

In the infinite universe case, we have at worst an exponential blowup in the size of the query. We know the depth k is at most linear in the size of the query, however, for a term x of the tree type definition below, the number of sequences of selector applications of length up to k is $2^{k+1} - 2$, thus giving us an exponential blowup in the number of terms.

data tree =
  | Leaf ()
  | Node (left: tree, right: tree)

3.4.3 Proof of Correctness

In this proof when we refer to a rule or axiom, we mean those from Listing 2 and Listing 3, respectively. We assume p is flattened and in NNF (i.e., it is p2 in Listing 1). For simplicity, we also assume that p does not contain any uninterpreted functions or sorts, but the proof can be extended to include them.

Theorem 1 (Reduction Correctness). Listing 1 is a correct reduction from DT to UF. Specifically, when the algebraic data type universe is infinite, $(p, \text{DT})$ is SAT iff $(q, \text{UF})$ is SAT.

Proof ($\rightarrow$). If $(p, \text{DT})$ is SAT, then there is a model $\mathcal{M}\models p \cup \text{DT}$. The modification from p to q involves either replacing a theory literal, $p_i$, with a conjunction of theory literals, $q_{i,1} \wedge ... \wedge q_{i,n}$, or adding a set of theory literals, $q_1 \wedge ... \wedge q_m$, to the end of the formula. In both cases, all added theory literals are redundant information in the DT theory (see Definition 4, Listing 2, and Listing 3). That is, $p_i \cup DT \models q_{i,1} \wedge ... \wedge q_{i,n}$ and $DT \models q_1 \wedge ... \wedge q_m$. Therefore, $\mathcal{M} \models q \cup \text{DT}$. Since every model in DT is a model in UF, it is also true that $\mathcal{M} \models q \cup \text{UF}$ and that $(q, \text{UF})$ is SAT.

Proof ($\leftarrow$). If $(q, \text{UF})$ is SAT, then there is a model $\mathcal{M}\models q \cup \text{UF}$. To show that this implies that $(p, \text{DT})$ is SAT, we will use $\mathcal{M}$ to create a model $\mathcal{N}$ such that $\mathcal{N}\models p \cup \text{DT}$.

Without loss of generality, we assume that q is a conjunction of theory literals, $q_1 \wedge ... \wedge q_n$. You can think of this as the satisfying assignment to the propositional structure of q given by $\mathcal{M}$. Every theory literal in q is a constructor equality, $f(x_1, ..., x_m) = x_0$; a selector equality, $f^j(x_0) = x_1$; a tester application, $\textit{is-f}(x_0)$; or an equality, $x_0 = x_1$. Tester applications and equalities can also appear negative. Every $x_i$ is a constant due to the flattening of q. We say that a constant $x_0$ is “constructed” in q when it is on the right-hand-side of a constructor equality; we say that a constant $x_0$ is “selected from” in q when it is the constant on the left-hand-side of a selector equality.

We begin by setting the universe of $\mathcal{N}$ to be all algebraic data type normal terms and setting the interpretations of tester, constructor, and selector symbols to the expected values over this universe. So if $f$, $g$, and $h$, are two, one, and zero-ary constructors, then the normal term $f(g(h), h)$ is in the universe of $\mathcal{N}$ and $\mathcal{N}[f(g(h), h)] = f(g(h), h)$. Similarly, $\mathcal{N}[f^2(f(g(h), h))] = h$. In short, $\mathcal{N} \models DT$. In the next two paragraphs, we describe an algorithm that determines the value of $\mathcal{N}$ for every constant that appears in q. Finally, we argue that $\mathcal{N}\models p \cup \text{DT}$.

For each $x \in V$, where $V$ is the set of constants that appears in q, by axiom 1, there is exactly one tester such that $\mathcal{M} \models \textit{is-f}(x)$. We construct $\mathcal{N}$ for each $x$ based on the corresponding $f$. There are two “base cases” for our construction of $\mathcal{N}$. First, if $f$ is some constant constructor, by axiom 2 of the reduction, we know that $\mathcal{M}[x] = \mathcal{M}[f]$, so we set $\mathcal{N}[x] \triangleq f$. Note that $f$ is an algebraic data type normal term, so it exists in the universe of $\mathcal{N}$. Second, if $x$ is some constant that was never constructed or selected from (either directly or transitively) then we set $\mathcal{N}[x]$ to any algebraic data type normal term that satisfies the tester applications and equalities. We can always find such a normal term because there are infinitely many algebraic data type normal terms, our construction of $\mathcal{N}$ is finite, and the existence of $\mathcal{M}$ guarantees there is no contradiction in the equations.

If we are not in one of the two base cases, we know that $f$ is an $m$-ary constructor for some $m > 0$ and that $x$ was either constructed or selected from in q. There are four “inductive cases” for our construction of $\mathcal{N}$. First, if $x$ was set equal to a constructor application, by rewrite A, there are constants $x_1, ..., x_m \in V$ such that $\mathcal{M}\models f^1(x) = x_1 \wedge ... \wedge f^m(x) = x_m$. We set $\mathcal{N}[x] \triangleq f(\mathcal{N}[x_1], ..., \mathcal{N}[x_m])$. Note that $f$ applied to algebraic data type normal terms results in an algebraic data type normal term, so $f(\mathcal{N}[x_1], ..., \mathcal{N}[x_m])$ is in the universe of $\mathcal{N}$. Second, if $x$ is correctly selected from (the antecedent of the implication in rewrite rule B evaluates to true in $\mathcal{M}$), by rewrite B, there are constants $x_1, ..., x_m \in V$ such that $\mathcal{M}\models f^1(x) = x_1 \wedge ... \wedge f^m(x) = x_m$ and we again set $\mathcal{N}[x] \triangleq f(\mathcal{N}[x_1], ..., \mathcal{N}[x_m])$. Third, if $x$ is incorrectly selected from (the antecedent of the implication in rewrite rule B evaluates to false in $\mathcal{M}$), then we pick an appropriate normal term, like in the second “base case” above. Fourth, if multiple of the above cases are triggered, we make sure that the assignments are consistent. For example, if the first two cases are triggered, then we ensure that the existentially bound variables (that are later skolemized) introduced by rule B are equal to the corresponding argument constants in rule A.

We claim that $\mathcal{N} \models q \cup UF$. If this is true, then together with the fact that $\mathcal{N} \models DT$, we have that $\mathcal{N} \models q \cup DT$. By the same reasoning as the proof in the forward direction ($\rightarrow$), we can then conclude that $\mathcal{N} \models p \cup DT$, as desired. The remainder of the proof demonstrates that $\mathcal{N} \models q \cup UF$.

Suppose for contradiction that $\mathcal{N} \nvDash q \cup UF$. Then there is some theory literal $q_i$ in q such that $\mathcal{N}[q_i]$ is false. Consider the case that this theory literal is a constructor equality, $f(x_1, ..., x_m) = x_0$. But \[ \begin{aligned} \mathcal{N}[f(x_1, ..., x_m) &= x_0] \iff \mathcal{N}[f](\mathcal{N}[x_1], ..., \mathcal{N}[x_m]) = \mathcal{N}[x_0] \\ &\iff f(\mathcal{N}[x_1], ..., \mathcal{N}[x_m]) = \mathcal{N}[x_0] \\ &\iff f(\mathcal{N}[x_1], ..., \mathcal{N}[x_m]) = f(\mathcal{N}[x_1], ..., \mathcal{N}[x_m]) \\ &\iff \text{true}, \end{aligned} \] which is a contradiction.

Consider the case that this theory literal is a selector equality, $f^j(x_0) = x_j$, and that the selector is correctly applied. That is, $\textit{is-f}(x_0)$ is also a theory literal in q. But \[ \begin{aligned} \mathcal{N}[f^j(x_0) = x_j] &\iff \mathcal{N}[f^j](\mathcal{N}[x_0]) = \mathcal{N}[x_j] \\ &\iff f^j(\mathcal{N}[x_0]) = \mathcal{N}[x_j] \\ &\iff f^j(f(\mathcal{N}[x_1], ..., \mathcal{N}[x_j], ..., \mathcal{N}[x_m])) = \mathcal{N}[x_j] \\ &\iff \mathcal{N}[x_j] = \mathcal{N}[x_j] \\ &\iff \text{true}, \end{aligned} \] which is a contradiction.

Consider the case that this theory literal is a tester application $\textit{is-f}(x)$. But, for every $x$, $\mathcal{N}$ assigns an algebraic data type normal term with constructor $f$ iff $\mathcal{M}[\textit{is-f}(x)]$ evaluates to true. Together with the fact that testers in $\mathcal{N}$ are interpreted as expected, this case also leads to an immediate contradiction.

The rest of the cases come down to equality. You can think of the appeals to $\mathcal{N}[x_i]$ in the inductive steps of our construction as recursive calls. Since there are $k$ constants in q and q is flat, the longest possible recursive call chain in our construction has length $k$. Since q ensures acyclicality up to a bound of $k$ (axiom 3), then $\mathcal{M}$ is acyclic over the set of values it assigns to constants in q. This means that, over the constants in q, the equalities in $\mathcal{M}$ are exactly the same equalities as in $\mathcal{N}$. Thus, $\mathcal{N} \models q \cup UF$ and so $\mathcal{N}\models p \cup \text{DT}$, giving that $(p, \text{DT})$ is SAT.

3.5 Empirical Evaluation

In this section we empirically compare the performance of our approach to state-of-the-art solvers. Specifically, we aim to answer the following research questions.

How does the overall performance of our approach compare to the state of the art?
How complementary is the performance of our approach to that of existing solvers?

We implement a prototype of our approach, called Algaroba, in approximately 2900 lines of OCaml code.¹³ We use the Z3 API as the default UF back-end solver but we allow for any UF solver to be used instead. Algaroba takes inputs in the SMT-LIB language and includes a number of simple optimizations, like hash-consing (Ershov 1958), incremental solving, and theory-specific query simplifications. All experiments are conducted on an Ubuntu workstation with nine Intel(R) Core(TM) i9-9900X CPUs running at 3.50 GHz and with 62 GB of RAM. All solvers were given a 1200 second timeout on each query to be consistent with SMT-COMP. The state-of-the-art solvers in this space are cvc5 (we use version 1.0.6-dev.214.97a64fc16) and Z3 (we use version 4.12.2). We also include Princess (latest release as of 2023-06-19) in our evaluation since it is the most related approach. We describe all three solvers in Chapter 3.3.

Figure 4: Number of queries solved $y$ in less than $x$ seconds for Bouvier (a) and blocks world (b) benchmark sets using a 1200s timeout. Higher (more queries solved) left (in less time) points are better. The legend lists the contribution rank and percentage of queries solved for each solver. Algaroba solves the most queries and achieves the highest contribution rank for both sets..

Our evaluation covers two existing benchmark sets from SMT-COMP, one originally from Bouvier (2021) and one originally from Barrett, Shikanian, and Tinelli (2006) (BST for short). These two benchmark sets are useful but limited: every solver succeeds on every BST query so it is difficult to draw performance conclusions; Bouvier queries are more challenging but only contain sum types.

To address these limitations, we introduce a new benchmark set consisting of randomly generated blocks world queries. Blocks world queries, which we describe in Chapter 3.2, are more challenging to solve than those in BST and, unlike those from Bouvier, contain sum, product, and inductive types. To generate blocks world queries we use the same table configuration as in Chapter 3.2 (three places for towers), but we randomly select a set of blocks (ranging between two to 26) and we randomly generate an initial and target configuration (two sets of three random block towers). We call these three random samples a blocks world setup. For each blocks world setup, we randomly sample a set of step numbers (ranging from one to two times the number of blocks) and generate a blocks world query for each step number. This process resulted in 500 individual queries that each ask “can we get from this initial configuration to this target configuration in exactly this number of steps?”

3.5.1 RQ1: Overall Performance

To answer our first research question, we time the execution of Algaroba, cvc5, Princess, and Z3 on all queries in all three benchmark sets. When more than one solver terminates on a given query we compare the results to check for disagreements. There was not a single case where one solver returned sat and another returned unsat; therefore, we focus the remainder of our evaluation on execution times.

For the BST benchmark set, which consists of 8000 queries, every solver successfully terminates on every query within the timeout. cvc5 performs the best on average with an average solve time of 0.05 seconds (compared to 0.08 seconds for Algaroba and 0.10 seconds for Z3). Z3 performed the most consistently with a standard deviation of 0.05 seconds (compared to 0.10 seconds for Algaroba and 0.15 seconds for cvc5). Given the magnitude of these values, we conclude that the performance differences between Algaroba, cvc5, and Z3 are negligible on this set. Princess is the slowest (2.20 seconds on average) and least consistent (standard deviation of 1.69 seconds) but still effective.

Results are more interesting for the remaining benchmark sets. Figure 4 (a) shows the execution times for every solver on every query in the Bouvier benchmark set (excluding timeouts). No solver succeeds on more than half the queries in the set but Algaroba clearly outperforms the rest. In terms of number of queries solved, Algaroba succeeds on 8.5 percentage points more queries than the next best (45.75% versus cvc5 and Z3’s 37.25%). In terms of average run time on successful queries, Algaroba is 2.30 times faster than the next best (190.11 seconds versus Z3’s 437.18 seconds). In terms of standard deviation on successful queries, Algaroba is 1.30 times more consistent than the next best (267.13 seconds versus cvc5’s 346.83 seconds).

Figure 4 (b) shows the execution times for every solver on every query in the blocks world benchmark set (excluding timeouts). Again, Algaroba outperforms the state-of-the-art solvers. Algaroba solves 5.2 percentage points more queries than the next best (61.4% versus Z3’s 56.2%) but the average and standard deviation results are more complicated. Princess is the fastest and most consistent solver on solved queries (31.89 seconds and 74.77 seconds, respectively) but it succeeds on 44.6 percentage points fewer queries than Algaroba. Compared with Z3, which solves the second most number of queries, Algaroba is 2.00 times faster (87.34 seconds on average compared to Z3’s 175.05 seconds) and 1.51 times more consistent in terms of standard deviation (198.67 seconds versus Z3’s 300.15 seconds).

Across both interesting benchmarks, Algaroba solves the most sat queries (167 versus cvc5’s 70), and the second most unsat queries (322 versus Z3’s 406). The average (median) increase in query size from our reduction was 259x (146x). Given the overall success of Algaroba on both interesting benchmark sets, we answer RQ1 by concluding that our performance compares favorably to the state of the art.

3.5.2 RQ2: Contribution Rank

Measuring overall performance is useful but it does not give an accurate perspective on how the community uses these tools. When faced with an SMT query, practitioners are likely to use multiple different solvers. This could be in parallel, like Rungta (2022), or through algorithm selection, like Pimpalkhare et al. (2021). Contribution ranks capture this practical perspective by evaluating solvers in terms of how complementary they are to other solvers. A higher rank means a higher contribution to the community of solvers.

To evaluate how complementary our approach is to existing solvers, we use SMT-COMP’s contribution ranking. This ranking uses the notion of a virtual best solver, which is defined as $vb(q, S) \triangleq s(q)$, where $S$ is a set of solvers and $s$ is the solver in $S$ that terminates most quickly on $q$. Informally, the ranking answers, “which solver can I remove from the virtual best solver to hurt performance the most?”

Figure 5: Number of uniquely solved queries per solver over interesting benchmark sets.

In terms of number of queries solved (the primary SMT-COMP metric), there is a four-way tie on the BST benchmark set—all solvers solve all queries. For both other benchmark sets, Algaroba is ranked highest. For blocks world, the virtual best solver without Algaroba succeeds on 56.2% of the queries, less than Algaroba on its own (61.4%). With Algaroba, the virtual best solver succeeds on 62.2%. For Bouvier, without Algaroba, the virtual best solver succeeds on 64.25% of the queries. With Algaroba, this number rises to 83.75%. These positive results are in part because Algaroba solves the most queries, but are mainly due to the uniqueness of our approach. cvc5 and Z3 use a similar underlying algorithm, so removing one does not affect the performance of the virtual best solver. On the other hand, while Princess is the most similar approach to our own, their reduction is different enough to not interfere with our ranking.

Figure 5 shows this same data as a stacked histogram: each colored bar represents the number of queries that are uniquely solved (y-axis) by each solver (x-axis) over the given benchmark set (color and hash-style). In short, we solve many queries that no other solver can (108/900). Given the winning contribution rank of Algaroba on both interesting benchmark sets, and the dominant number of uniquely solved queries, we answer RQ2 by concluding that our performance is complementary to existing solvers—we solve many benchmarks that no other solver can.

3.6 Conclusion

In this chapter, we described the arrow from Figure 1(E) to Figure 1(F): an SMT solver for the quantifier-free theory of algebraic data types called Algaroba. Algaroba takes in queries containing user-defined algebraic data types, transforms them to queries without algebraic data types, calls an existing core solver on the generated queries, and returns the result. We call this transformation a reduction and we say that the approach is eager because it introduces all of the neccessary axioms for reasoning upfront as part of the reduction. Most existing solvers for this theory take a lazy approach: they introduce axioms iteratively.

The main theoretical challenge is to construct a terminating reduction and to prove its correctness. The main practical challenge is to outperform existing tools with years of engineering. We do both. Specifically, we present a new measure based on the depth of the input query; we define our reduction by bounding the introduction of axioms using this measure; prove that this bound is sufficient; and empirically demonstrate that the bound is not too loose by outperforming state-of-the-art SMT solvers on a collection of benchmarks from related work. Perhaps the most interesting result is that the unique algorithmic approach that we take—eager instead of lazy—leads to unique and excellent empirical performance. Our solver handles many queries that no other solver can, leading to the best contribution score in our evaluation.

4 Online Learning for Automated Solver Selection

Beyond showing that Algaroba solves many queries that no other solver can, Figure 5 also shows that no solver strictly dominates all others. Similarly, the virtual best solver data detailed in Chapter 3.5 suggests that it is better to use a set of solvers to answer a query than to stick to any one given solver for all queries. This phenomenon holds beyond the theory of algebraic data types. Another example of differentiation between solvers is in quantifier reasoning, where the number of different algorithms implemented is reflected in the wide spectrum of literature on the subject, e.g., Ge and de Moura (2009), Reynolds et al. (2015), Löding, Madhusudan, and Peña (2017), Kovács, Robillard, and Voronkov (2017), and Barth et al. (2019). In the same vein, solvers use very different techniques for different theories; for example, there are various techniques that can be used for bit-precise reasoning, e.g., Bryant et al. (2007), Niemetz, Preiner, and Biere (2014), Jha, Limaye, and Seshia (2009), Barrett et al. (2011), Jonás and Strejcek (2016), de Moura and Bjørner (2008), Hansen (2012), Dutertre (2014), Bruttomesso et al. (2008), and Niemetz and Preiner (2020).

SMT solvers are becoming more widely used across various applications including verification, automated software testing, and policy verification, e.g., Bjørner (2009), Backes et al. (2018), and Kroening and Strichman (2016), making them particularly useful to industry practitioners and non-SMT researchers. Given the performance differential between solvers, a key question for such practitioners wishing to apply SMT solving to a problem in a specific domain is “which solver should I use?”

In this chapter, we describe the arrow that leaves Figure 1(F): a new method for automatically coordinating solvers, per query, over a sequence of queries. This setting matches the sequences of queries generated by the verification engine that we will describe in Chapter 5 and many similar verification engines. We call our method MedleySolver and we implement a prototype that can be used out of the box. In short, we provide a simple answer to the question of what solver to use for your query: “Let MedleySolver choose for you!”

The technical contents of this chapter first appeared in Pimpalkhare et al. (2021). In Y. Li et al. (2025), we extended this work for large language model prompt strategy selection in program synthesis, showing significant improvements over the state-of-the-art, but do not cover those contributions in this dissertation.

4.1 Introduction

MedleySolver frames the problem of choosing an SMT solver as a modified Multi-Armed Bandit (MAB) problem, a classic reinforcement learning formulation in which an agent must repeatedly pick from several different choices with unknown reward distributions, minimizing overall regret. This agent must trade-off between exploitation (choosing a solver that is already believed to be fast) and exploration (testing out other solvers). For a given SMT query, MedleySolver selects a sequence of solvers to run, running the solver it believes is most likely to solve the query first and the solver that is least likely last. MedleySolver also predicts the time it should spend running each solver before it should give up and move onto the next solver in the sequence.

We apply classic algorithm selection techniques from the domain of Multi-Armed Bandit problems to the order selection problem. In this work, we highlight Thompson Sampling and $k$-Nearest-Neighbor ($k$-NN) classification. We select these two as high-performing instances of a non-contextual and a contextual algorithm respectively but perform a more extensive comparison with a variety of other Multi-Armed Bandit algorithms for completeness. These algorithms traditionally predict one optimal action, but we use them to rank SMT solvers on a given query based on the behavior observed on previous queries and, in the case of the contextual bandit algorithms, a feature vector. This ranking allows the algorithm to explore and exploit in a single round.

Non-contextual multi-armed bandit algorithms have been directly applied to selecting search heuristics and variable orderings for constraint satisfaction problems (Xia and Yap 2018; Wattez et al. 2020), and to implementing co-operative sharing of clauses in parallel SAT solving (Lazaar et al. 2012). Our use of a contextual multi-armed bandit framework to select a sequence of SMT solvers for a given SMT query is novel.

We pair these order selection algorithms with two runtime prediction algorithms. The first runtime prediction algorithm estimates the time each solver should be run by modeling its performance as exponential distributions with a parameter that is updated dynamically. The second fits a linear model using stochastic gradient descent (SGD). Runtime estimation helps reduce the cost of exploratory solver choices that do not produce rewards by stopping solvers when we are confident they will not finish before the overall timeout.

Our work is inspired by recent work showing machine learning techniques can be used to solve SMT queries faster. For example, FastSMT (Balunovic, Bielik, and Vechev 2018) is a tool that uses machine learning to find an optimal sequence of tactics, or query transformations, for SMT solvers to use on queries from a given domain. One issue with such approaches is that the complexity of the learning methods leads to training times grossly larger than the time spent solving. In this work, we achieve comparable performance boosts with no pre-training or additional burden on the SMT end-user. To meet this goal, we approached the problem of algorithm selection for SMT solvers in a dynamic, or “online,” manner. This ensures the cost of our training remains small, proportional, and justified by how SMT practitioners use SMT solvers. For example, techniques such as counterexample-guided inductive synthesis (CEGIS) (Solar-Lezama et al. 2006) produce long sequences of similar SMT queries that are not easy to obtain prior to solving for offline training. Overall, we make the following contributions.

An adaptation of standard regression techniques to predicting when a given solver will timeout on a query, and a novel approach for the same time allocation problem that models runtime as exponential distributions and estimates timeouts dynamically and with context.
A framing of the SMT solver selection problem as a Multi-Armed Bandit (MAB) problem combined with a timeout prediction scheme. Specifically, we extend the MAB problem to selecting sequences of solvers per query instead of a single solver and use the timeout prediction scheme to allocate time to each solver in the sequence. This interaction lets us use lightweight techniques for both problems that do not require pre-training while retaining comparable performance to pre-trained techniques.
An empirical evaluation on a set of $2000$ benchmarks representing a typical user’s workload. Our approach solves $1813$ queries on this set; $128$ more than the next best solver in $3/5th$ of the time, with no pre-training.

4.2 Related Work

MedleySolver is most related to algorithm selection techniques for SAT and SMT. Our motivation, however, is similar to portfolio-based approaches.

Early approaches to learning-based algorithm selection in solvers included picking between different encodings of SMT to SAT in the UCLID solver (Seshia 2005; Bryant, Lahiri, and Seshia 2002) and selecting input parameters for SAT solvers (Hutter et al. 2007). SatZilla (Xu et al. 2008) used empirical-hardness models to map queries to SAT solvers. Models are learned offline, then combined with a fixed order of “pre-solvers”—solvers that are called before featurization with a short timeout—when online. MedleySolver differs from SATZilla in that it targets SMT, learns solver orders, distributes time among solvers, and does not require training. ArgoSmArT $k$-NN (Nikolić, Marić, and Janičić 2013) applies a pre-trained $k$-Nearest-Neighbor algorithm to portfolio SAT solving. Given a query, they deploy the most successful solver on the $k$ nearest neighbors. Although one algorithm we apply is $k$-NN, we use it to select sequences of solvers and apply it in combination with the time-prediction algorithm. MachSMT (Scott et al. 2021) is a pre-trained tool like SatZilla but for SMT. Like SatZilla, MachSMT pre-trains to learn an empirical hardness model, then used to predict solving time for a given query. This is related to our timeout estimation, however, our version requires no pre-training. We achieve similar performance to MachSMT without pre-training by decoupling solver choice from time allocation and allowing for mistakes by selecting a sequence of solvers to run, instead of a single solver. Where4 (Healy, Monahan, and Power 2016) is a portfolio-based SMT solver that uses regression models to select which solver to run. It extracts features from WhyML programs rather than SMT queries and does not allocate time between solvers. CPHydra (O’Mahony et al. 2008) does allocate time between solvers but does so by solving an NP-Hard problem (knapsack). CPHydra also ignores solver order, requires offline training, and is aimed at CSP, which is related to, but different from, SMT. FastSMT (Balunovic, Bielik, and Vechev 2018) is a pre-trained learning tool for speeding up the Z3 SMT solver that works by selecting algorithmic “tactics” or strategies inside the solver itself. FastSMT is interesting because it produces an interpretable strategy that can often be significantly faster than Z3 out-of-the-box. MedleySolver differs from FastSMT in that it learns to combine solvers, rather than to combine the tactics of a single solver. FastSMT also requires significant training time.

Parallel portfolio solvers execute sets of solving processes in parallel for each query. Our approach is complementary in that we focus on speeding up sequential computation: different configurations of MedleySolver could be run in parallel with a portfolio approach. Nevertheless, portfolio approaches share a similar goal, so we highlight related works in this space. PAR4 (Weber 2019) is a basic portfolio parallel SMT solver that won several tracks in the SMT competition (SMT-COMP) in 2019. Wintersteiger, Hamadi, and de Moura (2009) implement Parallel Z3, a portfolio solver for SMT with the additional feature that learned clauses are shared between processes. Menouer and Baarir (2017) combine search space-splitting with portfolio solving by using EXP3 to dynamically allocate cores to each search space splitting solver from a set. Our approach differs in that we focus on SMT, not SAT; we allocate time, not cores; and we use bandit algorithms (like EXP3) to pick the order of solvers, not the number of cores.

4.3 Approach: Online Learning

SMT users rarely aim to solve a single query in isolation and usually care about resource consumption. For example, verification engines generate many verification queries for one verification problem and aim to solve these queries in the least amount of time. As such, we define the practical SMT problem as that of taking a set of SMT queries $Q = \{q_1, ..., q_m\},$ a set of SMT algorithms (usually solvers) $S$, and producing a set of answers $A = \{a_1, ..., a_m\},$ where each $a_i$ corresponds to the matching $q_i,$ while using the least computational resources.

Our approach to the practical SMT problem, MedleySolver, is a program that takes $Q$, $S$, and a timeout $T$ per query, and aims to maximize the number of instances solved while minimizing the cumulative time spent. We decompose our approach into two parts. For each query $q \in Q$, we predict

which solvers are most likely to solve a given query and return a list of solvers ordered by chance of success; and
the time each solver is likely to take to solve a given query and distribute the timeout $T$ to the solvers in the sequence accordingly.

Given a query $q_i$, MedleySolver generates a sequence of solvers $\sigma$, and a sequence of time-allocations $t_1, ..., t_n$, so that the solver in $\sigma_1$ is run for $t_1$ seconds and so on. If a solver in the sequence successfully solves the query, we do not run the rest of the solver sequence on that query and instead move onto the next query. For the remainder of this chapter, we use $\sigma$ to denote a sequence of solvers and $\sigma_i$ to denote the $i^{th}$ solver in the sequence $\sigma$. We process each query $q_1, ..., q_m$ in order and our solver selection algorithms learn as we go, so the solver selection for $q_m$ uses information from queries $q_1, ..., q_{m-1}$. In practice, this historical information can be reset whenever and in our experimental results, we reset it when confronting a new set of queries (for instance, a new category of the SMT competition). An overview of our approach is shown in Figure 6. We describe the three main components of Figure 6: the solver selector, the timeout predictor, and the featurization of queries, in detail in the next two sections.

Figure 6: Overview of one iteration of MedleySolver

4.3.1 Solver Order Selection

In the Multi-armed Bandit (MAB) problem (Cesa-Bianchi and Lugosi 2006) an agent sequentially selects between choices with unknown associated reward distributions, aiming to maximize the reward achieved over time. The agent must trade off exploration (trying new actions and learning about them) and exploitation (deploying actions we know have the potential for high reward).

We frame the solver selection problem as a MAB problem. The agent is selecting the solver to use and the payout is based on successfully solving queries. We assume that running a solver for a randomly selected SMT query is equivalent to sampling from some unknown distribution that we seek to approximate. Contextual MABs extend the problem by giving agents access to a feature vector before each round. This allows us to add information about the characteristics of the SMT query we are trying to solve in each round, as described in Chapter 4.3.1.2.

We modify the MAB problem in one key way: we select a sequence of solvers to run (with corresponding time allocations that we consider later), instead of selecting a single solver. This contribution allows solver selection to perform exploration on each query until it observes a reward, has tried all solvers, or reaches the time-out per query. We also use a time-prediction algorithm, described in Chapter 4.3.2 which predicts the time it is worth running a solver on a given query, allowing us to perform “partial exploration” instead of committing to running a single solver until time-out or termination. Both of these extensions to the MAB algorithms allow the solver selection to correct incorrect solver choices, reducing the cost of exploration vs exploitation. In the following subsections, we adapt one non-contextual algorithm and one contextual algorithm (i.e., algorithms that use the feature vector) from the literature to our domain. We choose the algorithms based on their popularity in the classic literature for MAB, and the compatibility of the assumptions the algorithm makes with our domain (for instance, we omit algorithms such as LinUCB (L. Li et al. 2010), which assumes that the reward is linearly correlated with the feature vector). In Chapter 4.4 we evaluate our choices against other MAB algorithms for completeness.

We use a binary reward structure where a solver receives a reward of 1 if it is observed solving a query and 0 if it is observed failing to solve a query, which decouples solving time from rewards. We also explored an exponential reward structure where a solver receives a reward of $(1 - t/T)^4$ if it solves a query in time $t$, but found the binary reward more effective which we believe is due to its ability to differentiate more clearly between benchmarks that are slow to solve and benchmarks that are not solved.

4.3.1.1 Thompson Sampling

Thompson Sampling (S. Agrawal and Goyal 2012, 2013) uses Bayes’ rule to choose an action that maximizes the expected reward. Each round of the MAB in this context is picking a random query from the set of queries and trying to solve it with a specific solver $\sigma_i$. To adapt non-contextual Thompson Sampling to our SMT solver selection problem, we model the outcome of an experiment with a Bernoulli distribution where the solver solves the query with a probability $\theta_i$ and fails to solve it within the time-out with a probability $1-\theta_i$. In Thompson Sampling, the agent does not know the value of each $\theta_i$ but begins with some prior belief over each one. These priors are beta-distributed: the prior for $\theta_i$ is \[p(\theta_i) = \frac{\Gamma(\alpha_i + \beta_i)}{\Gamma(\alpha_i)\Gamma(\beta_i)}\theta_i^{\alpha_i - 1}(1 - \theta_i)^{\beta_i - 1}.\] We initially take this distribution to be uniform: $\alpha_i = \beta_i = 1$. That is, we assume a prior that, for a random query, each solver has a $50$% chance of solving the query within the timeout and a $50$% chance of failing to do so.

To select a solver to deploy, Thompson Sampling takes a sample from each distribution $p(\theta_i)$ corresponding to a solver. Note that, because Thompson Sampling takes a sample from the distributions $p(\theta_1)...p(\theta_n)$, it is more likely to pick solvers that we are uncertain about instead of simply returning the solvers in order of the $p(\theta)$ with the highest mean, allowing exploration. Conventional Thompson Sampling returns the solver with the highest valued sample. Our algorithm returns a sequence of solvers in descending order of these sampled values i.e., the solver with the $\theta$ closest to 1 is first.

Figure 7: Updating the distribution of \theta according to Bayes’ rule. A distribution for a solver that has failed on more queries than it has solved will have a \theta < 0.5. — Figure 7: Updating the distribution of $\theta$ according to Bayes’ rule. A distribution for a solver that has failed on more queries than it has solved will have a $\theta < 0.5$.

After deploying the solvers and observing the results, the distributions over $\theta_1, .., \theta_n$ are updated according to Bayes’s rule. Each time a solver $\sigma_i$ is run, a reward $r_i$ is observed. The posterior distribution for the beta distribution (Castillo, Hadi, and Solares 1997) is obtained by adding the reward $r_i$ to $\alpha_i$ and $1 - r_i$ to $\beta_i$, as illustrated in Figure 7.

Thompson Sampling assumes events are independent, i.e., the probability of a solver being able to solve a query is independent of all queries the solver has seen before. This could be true if our time-out prediction algorithm is perfect so that if a solver can solve a query within the timeout $T$, the solver will also always be able to solve that query within the time $t$ allocated to the solver.

4.3.1.2 Features for Contextual Approaches

Contextual approaches depend on the assumption that queries with similar characteristics will cause SMT solvers to perform similarly. We capture the characteristics of each SMT query $q_i$ in a feature vector $\boldsymbol{v}_i \in V$. We use these feature vectors both in contextual bandit algorithms (described next) and in our contextual time-prediction (described in Chapter 4.3.2).

We identify a list of 24 features that are quick to extract and that we believe correlate with solving time for specific solvers. These features include context-free qualities like counts for specific operators (e.g. array store operators), the maximum value of literals, the sum of literal values, and so on. The features also include context-sensitive qualities like quantifier nesting and alternations, as well as the size of a given queries’ abstract syntax tree as a minimal graph (we refer to this representation as a term graph). All feature extraction procedures run in $\mathcal{O}(n)$ where $n$ is the size of the term graph. The term graph construction is efficient and the term graph itself is often exponentially smaller than the input query. Therefore, the cost of extracting features is relatively small.

Figure 8: Empirical intuition for features and time predictors. (a) Solver performance on QF_ABV queries over a feature value showing that different solvers are more or less sensitive to that feature change. (b) Empirical intuition for features and time predictors: as time increases, fewer and fewer queries are solved..

As a heuristic, we try to build features that can differentiate solver performance on their own. For example, Figure 8 (a) shows the performance of three different SMT solvers over one of these features. For this example, intuitively, we would want to favor the use of Bitwuzla as the feature value increases, instead of CVC4, which appears to be more sensitive to changes in that feature. While we did not do any empirical feature selection, we did use our prior knowledge of SMT solvers to decide which feature extractors to build, and we evaluate the impact of our decisions in Chapter 4.4.4.4.

4.3.1.3 $k$-Nearest-Neighbor

The $k$-Nearest-Neighbor algorithm($k$-NN) is a simple contextual approach. Viewed from the perspective of a MAB problem, given solvers $s_1, ..., s_n$, $k$-NN classifies SMT queries into $n$ classes: queries where solver $s_1$ is the best choice, queries where solver $s_2$ is the best choice, and so on. We extend the standard $k$-NN algorithm to return a sequence of solvers.

Given a query, $q$, the basic $k$-NN algorithm looks at the $k$ nearest queries to $q$, tallies the number solved by each solver, and orders the solvers by their tallies. We calculate the distance between two queries—how “near” two queries are—by computing the Euclidean distance between their feature vectors. The idea behind this algorithm is that if a solver succeeded on many queries similar to $q$, then it is likely to succeed on $q$. If any solvers in the solver set are not included in the neighbors, we randomly shuffle these solvers and append them to the end of the sequence—we make an exception for the $k=1$ case: when $k=1$ we return the solver that solved the nearest neighbor, followed by the solver that solved the next nearest neighbor and so on, without replacement.

Figure 9: Example k-NN run with k = 1, solvers A, B, C, and 2 features. A, B, and C are represented by a square, a triangle, and a circle, respectively; the new query is represented by a diamond. A and B fail on the new query in (a); solver B fails on the new query in (b); and solver C succeeds on both. — Figure 9: Example $k$-NN run with $k = 1$, solvers A, B, C, and 2 features. A, B, and C are represented by a square, a triangle, and a circle, respectively; the new query is represented by a diamond. A and B fail on the new query in (a); solver B fails on the new query in (b); and solver C succeeds on both.

Once an order is selected, we run the solvers in sequence. If a solver $s$ succeeds, we add the feature vector of that query to our data set of previously solved queries along with the label $s$. See Figure 9 for three steps of a 1-nearest neighbor example.

4.3.2 Solver Runtime Prediction

The second component of our approach comprises time predictors, which we use to split the per-query timeout $T$ into sub-timeouts per solver $t_1\ldots t_n$. We train a time-predictor for each solver. Our goal is to find a time $t_i$ such that we can stop running the $i^{th}$ solver in the sequence and be highly confident it was unlikely to solve the current query after this point. Formally, we are trying to find the minimum $t_i$ such that $P(t_i < u_i < T) \leq \delta$, where $u_i$ is the true runtime of $\sigma_i$ on $q_i$ and $\delta$ is the accepted error probability. We consider this event the only relevant error scenario because it implies if we had allocated more time to solver $\sigma_i$ we could have solved the current query.

To calculate each $t_i$, we model each solver’s runtime as an exponential distribution, justified by our experimental observations illustrated in Figure 8 (b): solvers usually succeed early or not at all. We employ Maximum Likelihood Estimation (MLE) (Myung 2003) to fit an exponential distribution to the runtime samples which we have gathered up to that point. In MLE, we find $\min_{\lambda} P(u_1\ldots u_m | \lambda)$, where $u_1 \ldots u_m$ are the observed runtime samples we have seen, which we assume are drawn from $Exponential(\lambda)$. We use the exponential’s probability distribution function as a measure of likeliness, so this problem is equal to $\min_{\lambda} n\ln\lambda -\lambda(\sum_iu_i)$, leading to the following minimizer: \[\lambda^* = \frac{n}{\sum_iu_i}\] Applying the cumulative distribution function and using the memoryless property of the exponential distribution, we can calculate $t_i$ as follows: \[t_i = \frac{-\ln(\delta + e^{-\lambda^*T})}{\lambda^*}\] We split $T$ into sub-timeouts greedily; we use the above process to allocate time for solvers starting from the beginning of our ordering and stop once we reach the overall timeout, allocating zero time to the remaining solvers in the order. If we reach the end of the ordering and still have time remaining, we give the remaining timeout to the last solver.

4.3.2.1 $k$-NN Runtime Prediction

We present a contextual runtime prediction system based on the $k$-NN algorithm. Instead of using every past sample point to estimate $\lambda^*$, we limit our estimation scheme to the $k$ nearest data-points. As with the $k$-NN based solver selection, the distance between two queries is the Euclidean distance between their feature vectors. The rest of the estimation scheme remains identical to the non-contextual scheme.

4.3.2.2 Linear Regression Runtime Prediction

Finally, we present a contextual runtime prediction system that finds a linear relationship between our feature vector and the associated runtime. To do so, it minimizes the L2-regularized squared loss of the linear model using stochastic gradient descent, solving: \[\min_{w, b} \sum_i(w^Tx_i + b - u_i) + \alpha ||w||_2,\] where $w$ is the learned weight of our features, $b$ is a learned coefficient, $x_i$ is the feature vector of the $i^{th}$ query, $u_i$ is the true runtime of the $i^{th}$ query, and $\alpha$ is a regularization constant.

4.4 Empirical Evaluation

We implemented a prototype of MedleySolver in Python.¹⁴ The input is a directory of queries, and the output is the result, solver used, and time elapsed per query. In this section, we evaluate this prototype and aim to answer the following research questions:

How does MedleySolver compare to individual solvers on the practical SMT problem?
How does MedleySolver compare to the best available alternatives on the practical SMT problem?
How do the individual components of MedleySolver affect the overall performance?

4.4.1 Subjects and Methods

We equip MedleySolver with six SMT solvers—CVC4 v1.8 (Barrett et al. 2011), MathSAT v5.6.3 (Bruttomesso et al. 2008), Z3 v4.8.7 (de Moura and Bjørner 2008), Boolector v2.4.1 (Niemetz, Preiner, and Biere 2014), Bitwuzla v.0.9999 (Niemetz and Preiner 2020) and Yices v2.6.2 (Dutertre 2014)—and run on four benchmark sets, each with 500 queries. Some SMT solvers do not support all needed syntax. For example, the BV set includes quantifiers that Boolector cannot handle. We expect MedleySolver to learn to avoid solvers that fail on specific kinds of queries.

The benchmark queries simulate a typical user’s workload in that they are similar in nature, i.e. use related logical theories and come from similar applications, but are diverse enough to expose issues a normal user will encounter, i.e. deviations in SMT-LIB conformance. We selected a random sample of 500 queries from an existing benchmark set, Sage2, derived from a test generation tool; 500 queries from 140 UCLID5 (Seshia and Subramanyan 2018) verification tasks; and 500 queries each from the BV and QF_ABV theory SMT-COMP tracks, respectively.

We ran every individual solver with a timeout of 60 seconds for every query on a Dell PowerEdge C6220 server blade equipped with two Intel Xeon 10-core Ivy Bridge processors running Scientific Linux 7 at 2.5 GHz with 64 GB of 1866 Mhz DDR3 memory. We saved these results and used them to simulate runs of MedleySolver. This helped ensure results are deterministic, reproducible, and lowered our carbon emissions. The overhead of running MedleySolver on all 2000 queries varies between learning algorithms and features used but is always less than two minutes for the full set of queries, and is therefore negligible.

4.4.2 RQ1: Comparison With Individual Solvers

To evaluate the utility of MedleySolver for a typical user, we ran $k$-NN and Thompson with the three timeout predictors on every set individually and then over the combined set. The combined set represents a realistic combination of queries a typical user might want to solve.

Table 1: Performance in terms of Par-2 score (lower is better) and the number of queries solved per solver and MedleySolver configuration (row) by benchmark set (column). MedleySolver configurations are selectors (e.g. $10$-NN) followed by time predictors (e.g. Linear time prediction). “NN” is short for “Nearest Neighbor”, “Thomp” is short for “Thompson”, and “Exp” is short for “Exponential.” Learning algorithms use binary reward and every query is given a 60-second timeout. “Split” refers to the sum over all individual benchmark sets where the learning algorithms are restarted between sets, while “Combined” refers to the aggregated set with no resets. Individual sets contain 500 queries; “Combined” and “Split” contain 2000.
	BV	QFABV	Sage2	UCLID5	Split	Combined
10-NN Exp	2081.8	1208.2	14855.4	5386.9	23532.3	29111.2
	484	492	396	457	1829	1799
10-NN 10-NN Exp	2857.7	1039.7	17558.2	5386.9	26842.5	25133.8
	477	493	367	457	1794	1813
10-NN Linear	4229.1	824.4	13789.0	8136.4	26978.9	87898.4
	468	496	409	452	1825	1308
Thomp Exp	2986.4	1308.5	15323.1	5540.9	25158.9	51757.3
	480	491	401	456	1828	1676
Thomp 10-NN Exp	2840.8	1105.1	17949.7	5536.0	27431.6	27555.9
	479	493	365	456	1793	1816
Thomp Linear	4291.1	1474.1	15661.3	4344.6	25771.1	45267.7
	466	489	392	473	1820	1658
Boolector	60000.0	1408.7	31502.3	60000.0	152911.1
	0	491	265	0	756
Bitwuzla	3872.3	822.9	22316.4	60000.0	87011.7
	471	496	349	0	1316
CVC4	3332.1	5874.1	49161.5	7395.5	65763.3
	477	459	117	459	1512
MathSat	11455.0	1724.0	34159.4	50783.3	98121.7
	406	488	232	77	1203
Yices	7244.3	922.1	13544.4	60000.0	81710.9
	442	494	411	0	1347
Z3	2888.6	1202.0	35279.2	2637.2	42007.1
	477	492	232	484	1685
Virtual Best	964.6	530.8	9006.2	2192.1	12786.5
	493	497	453	476	1931

Table 1 reports the results of our experiment in terms of Par-2 score—the sum of all runtimes for solved instances + $2\times$timeout for unsolved instances (Weber et al. 2019)—where “virtual best” is calculated by using the best-performing individual solver for each query. On individual sets where one solver dominates, like Z3 on UCLID5, MedleySolver approaches the best solver but does not reach it. Conversely, on sets where no one solver is close to the virtual best, like BV, we find the MedleySolver can exploit this performance differentiation and approach the virtual best solver. The combined set, which is less uniform than BV but less dominated than UCLID5, demonstrates the power of MedleySolver: with no training, MedleySolver solves 94.5% of the queries solved by the virtual best using only 72.3% of the time taken by the most successful individual solver, Z3, which solves 87.3%. Table 1 gives a clear answer to RQ3: MedleySolver outperforms every individual solver on the practical SMT problem.

The Thompson MAB selector generally does better when summing up individual sets than when running on the combined set, while the $k$-NN selector is the opposite. This suggests contextual approaches can effectively carry over lessons between sets, and non-contextual approaches benefit from being used in the context of a benchmark.

4.4.3 RQ2: Comparison With State-of-the-Art

We now compare MedleySolver to alternative portfolio approaches, including those based on pre-trained machine learning techniques. Parallel portfolio solvers like PAR4 (Weber 2019) run multiple solvers in parallel and stop all solvers when the first one solves the query. We can calculate the hypothetical performance of such a solver by multiplying the virtual best solver time by the number of solvers we run. This would give a Par-2 score of $65481$, in comparison to $10$-NN’s better score of $32632$ over the combined benchmark set.

Table 2: Par-2 score (lower is better) and training time for MedleySolver, MachSMT, and the Virtual Best of all individual solvers. Individual benchmarks consist of 200 training queries and 300 test queries; ‘Combined’ consists of 800 training queries and 1200 test queries. Only MachSMT uses the training queries.
	BV	QFABV	Sage2	UCLID5	Combined
MedleySolver	1638.7	310.5	9245.3	4248.0	18565.5
	N/A	N/A	N/A	N/A	N/A
MachSMT	1458.3	919.2	8516.1	2430.9	12539.1
	33895.5s	4498.9s	55115.5s	276419.8s	300072.8s
Virtual Best	801.7	184.3	5204.2	1464.7	6746.0
	N/A	N/A	N/A	N/A	N/A

MachSMT (Scott et al. 2021) uses a neural network to select which solver to run on a given query. Table 2 shows the performance of MachSMT and MedleySolver on the same benchmarks as in Table 1 but with $2/5$ of the queries set aside for MachSMT to train on per set. Although MachSMT slightly outperforms MedleySolver on the test sets, the training time required by MachSMT is orders of magnitude larger than the time required to solve, particularly because, to have training data on which to train, MachSMT must run all the solvers on all the queries in the training set which takes a considerable amount of time. So, on the practical SMT problem, MedleySolver achieves similar results with significantly less resource consumption and we argue the cost of training is not worth it if online learning can achieve competitive performance. Scott et al. (2021) do make a trained model available that could eliminate training time for a user. However, this pre-trained model is trained on specific versions of a specific set of SMT solvers running on a specific system. In practice and in our example, the user’s specifics do not match and local training is required.

FastSMT (Balunovic, Bielik, and Vechev 2018) is an offline approach that synthesizes strategies for the Z3 SMT solver. FastSMT requires significant training time and needs to be trained per benchmark set. We can run the pre-trained FastSMT model on the Sage2 benchmark set, where it solves $358$ queries in $12766$s (Par-2 score of $29806$). This is a substantial performance gain over Z3, but all MedleySolver configurations still outperform this without any pre-training. FastSMT improves on Z3 but it is limited to one single solver and, unlike MedleySolver, is not able to take advantage of the range of different SMT solvers implementing different heuristics.

4.4.4 RQ3: Impact of Individual Components

In this section, we evaluate the performance of the individual pieces of MedleySolver. Specifically, we aim to answer the following questions. 1. How do our learning algorithms compare to other well-known MAB algorithms? 2. How well do our order selectors perform? 3. What is the impact of selecting an order instead of a single solver on performance? 4. Which query features are most responsible for MedleySolver’s performance?

4.4.4.1 Performance of Other Multi-Armed Bandit Algorithms

We have highlighted the results from Thompson Sampling and $k$-NN but we also adapted and evaluated the following Multi-Armed Bandit algorithms: the classic non-contextual epsilon-greedy bandit algorithm (Johnson, Li, and Chen 2000); LinUCB (L. Li et al. 2010), an upper-confidence bound algorithm that assumes a linear relationship between the rewards and the feature vector; Exp3 (Auer et al. 2002): an adversarial bandit algorithm; and an adaptation of a neural network classification based bandit (Zhou, Li, and Gu 2019). All of these bandits performed comparably or better than the best individual solver, but no non-contextual algorithm performed as well as Thompson Sampling, and no contextual algorithm performed as well as $k$-NN.

4.4.4.2 Order Selection Accuracy

To better understand MedleySolver’s performance, we measure how frequently it selects the best solver as the first solver to try. Over our five case-study benchmarks, the best performing selector is $k$-Nearest-Neighbor with $k=10$ ($10$-NN), which correctly picks the best solver 74.3% of the time. The highest success rate is on the BV benchmark, where $10$-NN is 92.0% accurate. The lowest success rate is on the QF_ABV benchmark, where $10$-NN is 52.0% accurate. This difference demonstrates MedleySolver’s accuracy is proportional to the cost of mistakes: solvers are much better overall on the QF_ABV category so MedleySolver can frequently pick a sub-optimal solver that will still terminate quickly; on the other hand, in BV, picking the wrong solver will often lead to a timeout.

4.4.4.3 Timeout Prediction Impact

To better understand MedleySolver’s performance, we measure the impact of selecting an order of solvers instead of a single solver. To do this we run MedleySolver without timeout prediction, giving the entire time per query to the first solver in the sequence. We find, all else equal, on the combined set, selecting a single solver produces a Par-2 score 350% worse than our best MedleySolver configuration. This difference is due to the direct cost of mispredictions and because, without the time prediction, MedleySolver is unable to learn from mistakes on a given query.

4.4.4.4 Feature Evaluation

In this section, we aim to interpret what our results tell us about SMT solvers and optimize performance through feature analysis. To better understand the SMT solvers we use, we measure how well each feature correlates with solver performance differentiation. Specifically, for every feature $f$, for every pair of solvers $(s_i, s_j) \in S \times S$, we measure the Pearson correlation coefficient between $f$ and $\text{time}(s_i) - \text{time}(s_j)$. Using this technique we found, on the BV benchmark set, the most distinguishing features were the number of universal quantifiers for Bitwuzla and MathSat5; the number of free variables for Yices and Boolector; the number of bound variables for Boolector and Bitwuzla; and the size of the term graph for Bitwuzla and Z3.

To optimize performance, we search for the subset of features that induce the best performance from MedleySolver. Specifically, we use backward step-wise feature selection (BSFS) to iteratively remove features from our feature vector whose removal does not negatively affect performance. Using this technique we found that, all else equal, using only three features on the BV benchmark (number of quantifiers, number of variables, and term graph size) improves the Par-2 score of the best configuration of MedleySolver by 30%. We observe a similar reduction in feature vector size across benchmarks and an average Par-2 performance improvement of 11%. Interestingly, BSFS often removes the feature with the smallest correlation score, as described above.

4.4.5 Threats To Validity

We evaluated MedleySolver on a set of benchmarks and a combination of solvers we believe represent a real user’s SMT workload, but we understand these results may not generalize. To mitigate this possibility, we evaluated the SMT-COMP data curated by the MachSMT authors. On the QF_UFBV benchmark, the best individual solver was Bitwuzla with a score of 2614.6 while MedleySolver scored 2600.5; on the QF_LIA benchmark, the best individual solver was Yices with a score of 45871.2 while MedleySolver scored 25737.2; on the QF_BVFPLRA benchmark, the best individual solver was MathSAT5 with a score of 3015.2 while MedleySolver scored 567.7; and on the NRA benchmark, the best individual solver was Z3 with a score of 1455.6 while MedleySolver scored 1068.6. In all cases, MedleySolver outperformed every individual solver while using no pre-training–often by margins greater than those observed in our case study–suggesting our results do generalize.

The queries MedleySolver has seen in the past affect the prediction MedleySolver makes for the current query, and thus the order MedleySolver receives the queries could affect the overall performance on the full dataset. MAB algorithms such as Thompson Sampling use random sampling and choice of random seed could also affect the results. To gain confidence in our claims, we repeated our experiments with 20 different random seeds and found the standard deviation in MedleySolver’s overall Par-2 score to be approximately 1% of its average score. The margins between MedleySolver and any individual solver are significantly larger than 1% and so MedleySolver is consistently comfortably better than any individual solver in our evaluation regardless of deviation.

4.5 Conclusion

In this chapter, we described the arrow that leaves Figure 1(F): a new way to coordinate solvers, like those described in Chapter 3, called MedleySolver. MedleySolver addresses a practical need borne from theoretical limitations. Solver performance can differ greatly across domains due to the use of carefully tuned heuristics that maximize performance in settings where no dominant algorithmic strategy exists. At the same time, solver users usually need to solve multiple related queries in sequence. If the user does not know which solver is best for their domain, then performance over the entire sequence can be prohibitively poor. MedleySolver learns to use a set of solvers over a stream of input logical queries to optimize performance over the sequence without human intervention.

We empirically demonstrated the power of MedleySolver over a set of benchmarks from related work in comparison with existing alternatives. We found that MedleySolver outperforms every individual solver and is comparable to existing pre-trained approaches, even though MedleySolver uses no pre-training. When accounting for pre-training time, MedleySolver significantly outperforms all related work. In short, MedleySolver eliminates the need for users to know which solver is best for their domain.

5 Message Chains for Distributed System Verification

In this chapter, we build a verification engine for distributed systems, called the P Verifier. This engine generates SMT queries, like those that Algaroba (Chapter 3) and MedleySolver (Chapter 4) handle. Specifically, we describe the arrow to Figure 1(E): a new verification engine—based on a novel notion of message chain invariants—that distributed system engineers can use to (automatically) reason about the fruits of their labor. The technical contents of this chapter first appeared in Mora et al. (2023). Since then, we built a new version of the P Verifier that includes the advanced solver interface described in Chapter 2.4.

5.1 Introduction

Programming error-free, reliable, distributed systems is hard. Engineers need to reason about numerous control branches stemming from a myriad interleaving of messages. The systems are incredibly tricky to debug due to the nested mesh of messages and failures, and it is surprisingly easy to introduce subtle errors while improvising to fill in gaps between high-level system descriptions and their concrete implementations. Yet, we are dependent on distributed systems to deliver high-performance computing, fault-tolerant networked services, and global-scale cloud infrastructures. What can we do to ease the job of the engineer tasked with programming a distributed system? In this chapter we turn to a combination of formal modeling and automated reasoning.

Modeling languages and automated reasoning already play an important role in raising the level of assurance in distributed systems and cloud services. Classic domain-specific examples of the combination include TLA+ (Lamport 2002); newer examples include Ivy (Kenneth L. McMillan and Padon 2020). These languages and associated automated reasoning tools can be used to guarantee correctness for all executions of a given system model, or at least systematically explore the space in search of errors. Once developers have a correct model, they can build their production systems with confidence. Unfortunately, these modeling languages do not provide users with language primitives or abstractions to support explicit message passing. To work at the message passing level—where a significant source of implementation errors lie—therefore requires manual modelling and significant user effort, making the translation to and from production systems more challenging.

More specifically, modern programming languages provide software engineers with a choice of different distributed systems programming paradigms. Two popular paradigms include the actor model, supported natively by e.g., Erlang; and the model of communicating sequential processes, supported natively by e.g., Go. No mainstream verification language supports either of these popular paradigms. For example, a software engineer working with the actor model will likely turn to Ivy (Kenneth L. McMillan and Padon 2020), TLA$^+$ (Lamport 1999), or Dafny (Leino 2010) for formal verification. Ivy works in the paradigm of synchronous reactive programs; TLA$^+$ is explicitly designed not to resemble any programming language (Lamport 2021); and Dafny targets imperative programs similar Java.

To remedy this, we propose message chains, an automated method for keeping track of message chains, and a verification framework that uses message chains to give users the vocabulary to reason about the low-level details of explicit message passing while remaining at an intuitive level of abstraction. Informally, message chains are sequences of logically related messages. This idea is related to message sequence charts (ITU-T 2011) from the software engineering community, and choreographic programming languages (Montesi 2014) from the programming languages community. Message sequence charts are partial specifications that describe the flow of messages through distributed systems. Engineers use these charts for testing, design, simulation, and documentation. Choreographic programming languages, on the other hand, are high-level languages that define distributed systems from a global perspective, focusing on communication flows. Both message sequence charts and choreographic programming offer developers a way to reason about explicit message passing but with the context of how messages flow through the system in question. In this work, we use message chains to bring these same ideas to formal verification.

The rest of this chapter is organized as follows. We begin by revisiting and generalizing the running example from Chapter 1.2 in Chapter 5.2. We then survey related work in Chapter 5.3 and present the necessary background to understand our approach in Chapter 2.5. We formally define the verification problem and our encoding in Chapter 5.4. The problem definition captures systems with any number of machines executing for any number of steps; the encoding captures the notion of message chains and the instrumentation needed to use them. Chapter 5.4 also formally defines message chain invariants and gives a detailed verification of the running example using them. In Chapter 5.4.6 we implement an instance of our verification approach for the P programming language specifically and we call the resulting tool the P Verifier. We then empirically evaluate our implementation in Chapter 5.5. Overall, we make the following contributions.

We define message-passing distributed systems based on I/O automata. We define the verification problem and build a framework that compiles the problem to mar programs.
We define the notion of message chains and integrate it into our message-passing distributed systems. This notion is suited to programming languages like Erlang,¹⁵ Akka,¹⁶ and P (Desai et al. 2018, 2013).
We instantiate our framework for a version of the P language and call the result the P Verifier. We then replicate verification efforts from related work, and simplify these proofs using message chains. We then verify a correctness property for a system with complex message passing—the onion routing network—and two industrial case studies.

5.2 Running Example

Consider again the ring leader election protocol from Chapter 1.2. We display an implementation of this protocol in P-like syntax in Listing 4. This language is like mar but with the extensions described in Chapter 2.5.1, inline let-bindings, P-like syntax for machines and events (described below), and a new induction command that is syntactic sugar for requires and ensures conditions. This code represents an input model to the P Verifier—it is an example of Figure 1(C).

P programs consist primarily of event and machine declarations. Event declarations define the kind of messages machines exchange and the payloads these messages hold. Machine declarations define the kinds of machines in the system and how they interact with each other. In this figure, we define one kind of event, eNominate, which holds an identifier (line 9). We also define one kind of machine, Node, which has no fields, and two states: Search and Won (lines 10-21). When nodes enter the Search state, they send their value to the node on their right (lines 12-13). After sending their value, while nodes are still in the Search state, nodes react to receiving nominations in one of three ways (lines 14-20). First, when nodes receive a value less than their value, they send their value to the node on their right (lines 19-20). Second, when nodes receive a value greater than their value, they forward this new value to the node on their right (lines 17-18). Third, when nodes receive their own value, they move to the Won state (line 16). The code blocks that begin with the keyword on are handlers.

Note that Listing 4 makes no mention of the number of nodes in the system. In Chapter 1.2 we considered only instances of the ring leader election protocol with three nodes. In this chapter, we consider all possible instances of the protocol. We describe this in detail in Chapter 5.4. For now, it suffices to know that the node (lower-case “n”) definition (line 1) declares a new uninterpreted sort that represents a pointer to, or label of, Node (upper-case “N”) machines. The size of the universe of node determines the number of Node machines, and the verification engine that we define will be free to search for counterexamples over any possible universe of node.

Listing 4: Ring leader election in the P Verifier. Definitions of btw (“between”), right, and le (“less than or equal to”) functions omitted. btw is used to define the ring topology, right uses btw to determine the next node in the ring given an input node, and le is used to determine who the winner should be. Lines 1-21 define the model; lines 22-42 define the specification.

data node
type mc = message_chain[node]
type sys = system[node]

let btw(w: node, x: node, y: node): boolean
let right(n: node): node      
let le(x: node, y: node): boolean

event eNominate = {id: node}
machine Node {
  state Search {
    on entry do {
      send right(this), eNominate(this)}
    on eNominate e do {
      let curr = e.payload.id in
      if curr == this then goto Won
      else if le(this, curr) then 
        send right(this), eNominate(curr)
      else 
        send right(this), eNominate(this)}}
  state Won {}}

let in_flight(s: sys, e: mc): boolean = 
  s.events[e] and e is send
let leader(s: sys, l: node): boolean = 
  s.machines[l].Node_state is Won
let rec participated(e: mc, r: node): boolean = 
  if e is empty then false
  else e.source == r or participated(e.history, r)

induction (s: sys) 
  invariant unique_leader: forall (l: node, n: node)
    leader(s, l) and leader(s, n) ==> l == n
  invariant leader_max: forall (l: node, n: node)
    leader(s, l) ==> le(n, l)
  invariant participated_means_le_head: 
    forall (e: mc, n: node) let head = e.payload.id in
      in_flight(s, e) and participated(e, n)
        ==> le(n, head)
  invariant not_participated_means_going_to_visit: 
    forall (e: mc, n: node) let head = e.payload.id in
      in_flight(s, e) and not participated(e, n)
        ==> btw(head, e.target, n) or e.target == n

5.2.1 Message Chains

Figure 10 shows the Lamport diagram for an execution of the ring leader election protocol on four nodes (the universe of node is the set {1, 2, 3, 4}). Every event in Figure 10 corresponds to the entry into a handler or the execution of a send instruction in Listing 4. In this example, node 4 is to the right of node 1, node 2 is to the right of node 4, node 3 is to the right of node 2, and node 1 is to the right of node 3. The execution begins with node 1 sending its label value to node 4 (event 1:0, which corresponds to entering the on entry handler of Listing 4). When node 4 receives the value 1 (event 4:0), it compares 1 to its value and decides to send its value to node 2 (event 4:1). This process goes on until node 4 receives its value (event 4:3) and declares itself the winner. Figure 10 also shows four message chains observed during the same execution. The first message chain, in blue, consists of the events 1:0, 4:0, 4:1, 2:3, 2:4, 3:3, 3:4, 1:2, 1:3, and 4:3, along with omitted associated payloads. The remaining message chains are in orange, green, and pink.

Intuitively, message chains behave like email chains. Each message in the system is passed around with the message it was responding to, creating a history consisting of a sequence of messages that caused the current message. Looking at this history is often useful. For example, if you receive a reply to an email, you know that that sender received your email, and you can tell which email they are responding to by looking at the chain. It does not matter whether that sender also sent other emails to you or other people, the order in which they sent these other emails or the content of those other emails. For your conversation with the sender, the email chain itself provides sufficient information. We use the same principles to reason about message-passing distributed systems.

We capture this intuitive notion by instrumenting handlers. Specifically, when a handler sends a message without having received a message, our instrumentation begins a new message chain containing only that message. For example, when the on entry handler of Listing 4 is triggered at event 1:0 of Figure 10, we start a new message chain (in blue) containing a single message from node 1 to node 4 containing the payload 1. When a machine receives a message and reacts in the same handler by sending a message, our instrumentation extends the message chain associated with the received message. For example, when the on eNominate handler of Listing 4 is triggered at event 4:0, we extend the blue message chain with a message from node 4 to node 2 with payload 4 (event 4:1).

5.2.2 Verification

Our goal, in general, is to prove that systems of asynchronously communicating state machines guarantee specified safety properties. The P Verifier is responsible for this process. It takes a set of properties and returns true if the conjunction of properties passes a proof by induction. For the ring leader election protocol specifically, we prove that there is only ever a single node that declares itself the winner and that the winning node has the label with the greatest value, regardless of the number of nodes included in the system. In general, and specifically for the ring leader election protocol, our proofs will rely on message chains. Note that while a successful proof by induction implies there are no bugs, a failed proof by induction does not imply there are bugs. This fact makes our tool unsuitable for bug finding, like all other proof-by-induction-based tools. But our work can be easily extended to support bug-finding algorithms, like bounded model checking.

Figure 10: Lamport diagram for an execution of Listing 4 instantiated with four nodes. Nodes are arranged in a ring, ordered 1, 4, 2, 3. Events are labeled n:i, where n is the node the event occurs on and i is the event number on that node. Colors on arrows between send and receive events indicate message chains. — Figure 10: Lamport diagram for an execution of Listing 4 instantiated with four nodes. Nodes are arranged in a ring, ordered 1, 4, 2, 3. Events are labeled $n$:$i$, where $n$ is the node the event occurs on and $i$ is the event number on that node. Colors on arrows between send and receive events indicate message chains.

Textbook proofs of correctness for such a ring leader election protocol often assume synchronous execution of nodes (e.g., Aspnes (2020)). For the synchronous version of this protocol, for every round $k < n$, every node sends the greatest value it has seen so far to the node on its right. After $n$ rounds, the protocol terminates as every node is aware of the greatest value in the ring. Proofs of correctness for the synchronous version of this protocol use induction on the round number. Unfortunately, most realistic implementations are not synchronous so these textbook proofs of correctness do not apply. In the asynchronous setting, proofs of correctness for this protocol tend to ignore message passing altogether. For example, Koenig et al. (2020) provide a proof in mypyvy that abstracts away individual message buffers into a single, shared, append-only relation.¹⁷ Padon et al. (2016) provide a similar proof in Ivy.

In this chapter, we propose a new verification approach based on message chains and use it to prove the correctness of asynchronous distributed systems at the message-passing level of abstraction. For example, for the ring leader election protocol, we can encode and machine-check a proof of correctness that is simple, like the textbook proof, and asynchronous, like the mypyvy proof, all while being at a lower level of abstraction than both, since it captures message passing directly. Specifically, Listing 4 displays the full verification of the ring leader election protocol in our prototype verification framework, the P Verifier. This proof, which we revisit in Chapter 5.4, uses two target invariants and two auxiliary invariants—invariants used to make the target specification inductive. The first invariant captures the desired property that there is only ever a single node that declares itself the winner. The second invariant captures the desired property that the winning node has the label with the greatest value. The third invariant is an auxiliary invariant. It says that the head of every message chain always holds the largest label it has seen so far. The fourth invariant is also an auxiliary invariant. It says that, for every message chain, if there is a node that has not yet participated in the message chain, then the message chain is going towards that node. Together, these four invariants capture the textbook proof of correctness over every message chain: for every message chain, the round number in the textbook proof is equal to the length of the message chain.

We provide the full proof of correctness for the ring leader election protocol in Chapter 5.4.5. However, proofs are generally not correct on the first attempt. Figure 11 is a visualization of what a more realistic interaction with the P Verifier would look like. The user first provides a system model (Figure 11 A) and specification (Figure 11 B), and then calls the P Verifier. Usually specifications, like the two target invariants for the ring leader election protocol, are not inductive on their own. In Figure 11 C, the P Verifier rejects the first proof attempt and provides a counterexample to induction. This does not necessarily mean that the model violates the specification, it only means that the proof by induction failed. To fix this, the user must provide auxiliary invariants. Figure 11 D represents an auxiliary message chain invariant. Specifically, Figure 11 D is a visualization of an invariant that asserts that the head of every message chain always holds the largest label in the message chain so far. With the added auxiliary invariants, the user will call the P Verifier again. Figure 11 E represents a successful proof attempt.

5.3 Related Work

Schwarz and Mattern (1994) define a notion of causal histories that has some features in common with message chains. For a given event $e$, the causal history of $e$ is the set of events that happened-before (Lamport 1978) $e$. Message chains are a focused and ordered version of causal histories. That is, the chain relation in a message chain induces a sub-relation of happened-before, but the inverse is not true. See Figure 10 for a visual comparison (e.g., event 2:1 happens before event 2:2, but the two events are not related in terms of message chains). Some of our key insights are that thread order is not necessarily important in the context of asynchronous message-passing verification, and that a lightweight but useful abstraction which ignores thread order can be extracted from the syntax of message-passing programs (message chains).

P (Desai et al. 2018, 2013) comes with an explicit-state model checker that allows users to systematically test their programs over concrete configurations—such as the ring leader election protocol with three nodes. Usually, the state space of P programs, even in concrete configurations, is so large that explicit-state model checking can only be used to find bugs. Our work, the P Verifier, can be used for the unbounded verification of parametric configurations of P programs—like the ring leader election protocol with n nodes. P’s explicit-state model checker and the P Verifier are complementary: users can test their systems until they are confident that they can write a proof of correctness.

Figure 11: Visualization of an extended interaction with the P Verifier. The user provides a system model (A) and specification (B). Lines 1-29 of Listing 4 are a concrete example of a system model. Lines 32-35 Listing 4 are a concrete example of a specification. Given this model and specification, the P Verifier will fail to complete a proof by induction and will alert the user (C). We provide two concrete counterexamples to induction in the proof of Chapter 5.4.5, C is a visualization of a similar counterexample. To complete the proof, the user must provide auxiliary invariants (D). Lines 36-43 Listing 4 are a concrete example of two auxiliary invariants and (D) visualizes the first message chain invariant. Given appropriate auxiliary invariants along with the desired specification and a correct model, the P Verifier will succeed and inform the user (E).

Ivy (Kenneth L. McMillan and Padon 2020) is a synchronous reactive programming language and framework for designing, testing, and verifying distributed algorithms. Ivy supports a multitude of features, including compilation to executable code and verification through model checking and induction, all while ensuring that every query to an SMT solver belongs to a decidable fragment of first-order logic. mypyvy (Feldman et al. 2019) is inspired by Ivy but contributes a methodology for inferring inductive invariants through lightweight user guidance. Specifically, users provide mypyvy with a system model and an automaton that describes the high-level “phases” of the system in question; mypyvy then uses the automaton to decompose the invariant inference problem and return a safety guarantee based on a set of inferred phase invariants. QuickSilver (Jaber et al. 2020) presents a framework for modeling and verifying distributed systems based on sound abstractions of complex components and then verifying the system assuming the complex components are verified separately. IronFleet (Hawblitzel et al. 2015) is a methodology for verifying distributed system implementations that was validated through a large-scale case study of a practical, real distributed system. The P Verifier differs in that it is domain-specific and supports verification at the message-passing level of abstraction using message chains.

The most related work is that of Talupur and Tuttle (2008), who observe that message sequence charts can be used as a proof artifact. The authors formalize the notion of message flows, which in our framework are akin to a sequence of actions, and integrate Message Flows into the CMP proof method (K. L. McMillan 2001). In follow-up work (Sethi, Talupur, and Malik 2014), the authors note that the “key limitation of our approach is that the invariants have to be derived manually by inspecting counterexamples.” Our work addresses this limitation. Other than the use of specification mining, message chains differ from message flows in three big ways: expressive power, verification flexibility, and programming language integration. In terms of expressive power, message flows can only express fixed-length sequences, so they cannot capture properties like those we use for the ring leader election protocol or the onion routing protocol, among others. In terms of verification flexibility, due to how they compile flows to the CMP verification method, their users can only specify a disjunction of possible flows. In contrast, our users can also express conjunctions and negations. Finally, we show how to use these ideas in real programming languages.

5.4 Approach: Message Chain Invariants

In this section, we formulate the problem of verifying asynchronously communicating state machines, we describe our verification workflow, and we define message chain invariants. All of this is built on top of FOLDS Automata (first-order logic distributed systems automata), a short-hand for parametric compositions of I/O automata with restrictions that

make modelling distributed systems easy (see Chapter 5.4.6),
make encoding these systems into SMT straightforward (see Chapter 5.4.3), and
enable our verification workflow by integrating message chains.

5.4.1 FOLDS Automata and Message Chains Definition

At a high-level, FOLDS Automata represent compositions of many state machines (similar to instances of $\textit{SM}_i$ of Example 2) with a single universal buffer (similar to $\textit{SUB}$ of Example 1). Our main contribution, and one of the key differences between FOLDS Automata and the I/O automata in Examples 1 and 2, is that, instead of communicating with integers, FOLDS Automata communicate with algebraic data types representing chains of messages.

Definition 9 (Message Chains). MC is an algebraic data type with two sort parameters and two constructors. The sort parameters are E, representing the type of messages, and R, representing the set of machine identifiers. The two constructors are empty and send. The constructor empty has no selectors. The constructor send has four selectors: source, target, payload, and history. The source selector represents the machine that sent the current message and is of sort R. The target selector represents the destination of the current message and is of sort R. The payload selector represents the contents of the current message and is of sort E. The history selector represents the tail of the message chain and is of sort MC[E, R]. A message chain is an instance of the MC algebraic data type. The following extended mar snippet captures this definition succinctly.

data [E, R] MC = | empty 
  | send (source: R, target: R, payload: E, history: MC[E, R])

For example, send(2, 3, eNominate(4), send(4, 2, eNominate(4), empty)) is the green message chain in Figure 10. Note that message chains are essentially lists of triples. For convenience, we sometimes treat message chains as sequences of triples. This is relevant in Chapter 6.

Definition 10 (FOLDS Automata). A FOLDS Automaton $A$ is a triple $(E, R, M_i)$, where $E$ is a sort representing messages, $R$ is a sort representing machine identifiers, and $M_i$ is a class of I/O automata that satisfies the following conditions.

For every $i \in R$, $M_i$ is an I/O automaton.
If $i, j \in R \; i \neq j$ then $M_i$ and $M_j$ are compatible.
$\textit{inp}(M_i)$ is $\{\textit{get(v)}_{i, j} \; \vert \; v \in 2^{MC};\, j \in R\}$.
$\textit{out}(M_i)$ is $\{\textit{put(v)}_{i, j} \; \vert \; v \in 2^{MC};\, j \in R\}$.
$\textit{states}(M_i)$ is an error state $\bot$, or a record with two sets of message chains, $\textit{input}_i$ and $\textit{output}_i$.
$\textit{start}(M_i)$ must only include non-error states where $\textit{input}_i$ and $\textit{output}_i$ are empty.
$\textit{trans}(M_i)$ is a relation such that $\textit{get(v)}_{i, j}$ updates $\textit{input}_i$ to be $\textit{input}_i \cup v$, and $\textit{put(v)}_{i, j}$ updates $\textit{output}_i$ to be $\textit{output}_i \smallsetminus v$, if $v$ was in $\textit{output}_i$ (otherwise $M_i$ moves to $\bot$).
No internal action adds to $\textit{input}_i$.
If an internal action removes from $\textit{input}_i$, then it removes exactly one message chain. We call this message chain the received message chain.
Internal actions can only depend on the received message chain.
No internal action removes from $\textit{output}_i$.
Every message chain that is added to $\textit{output}_i$ is of the form $\textit{send}(e, h)$, where $e$ is a message and $h$ is either the received message chain or empty, if the internal action did not receive a message chain. We call $e$ the current message in the message chain and we call $h$ the history of the message chain.
No internal action can depend on the history of the received message chain.
Internal actions can only add to $\textit{output}_i$ after any local state changes; internal actions can only make local state changes after any $\textit{input}_i$ removal.

Intuitively, the first seven conditions (1-7) define network communication: machines send and receive sets of message chains and have input and output buffers that are sets of message chains. The next four conditions (8-11) ensure that machines use their own input and output buffers correctly. Condition 12 ensures that message chains are constructed correctly. Condition 13 ensures that message chains do not affect the behavior of the system we are modeling. That is, message chains behave like ghost variables. Finally, condition 14 ensures that, even though machines take turns stepping, our system is equivalent to an asynchronous composition. This follows from the same argument put forth by Padon (2018), which is in turn due to Lipton (1975).

To give the semantics of FOLDS Automata, we first give the definition of the Universal Buffer, which mediates machine communication by holding and distributing message chains.

Definition 11 (Universal Buffer). For a given non-empty index set $R$, the Universal Buffer $\textit{UB}$, is an I/O automaton such that $\textit{inp}(\textit{UB})$ is $\{\textit{put(v)}_{i, j} \; \vert \; v \in 2^{MC};\, i, j \in R\}$; $\textit{out}(\textit{UB})$ is $\{\textit{get(v)}_{i, j} \; \vert \; v \in 2^{MC};\, i, j \in R\}$; $\textit{int}(\textit{UB})$ is the empty set; $\textit{states}(\textit{UB})$ is $2^{MC} \cup \{\bot\}$; $\textit{start}(\textit{UB})$ is the singleton set containing the empty set; and $\textit{trans}(\textit{SUB})$ is a relation such that $\textit{put(v)}_{i, j}$ updates the state $s$ to $s \cup v$ and $\textit{get(v)}_{i, j}$ updates the state $s$ to $s \smallsetminus v$, if $s$ contained $v$ (otherwise $\textit{SUB}$ moves to the error state $\bot$).

A FOLDS Automaton $(E, R, M_i)$ is equivalent to the I/O Automata composition of $\{M_i \, \vert \, i \in R\} \cup \{\textit{UB}\}$. All definitions that apply to this composed I/O automaton can be lifted to FOLDS Automata. For example, an invariant assertion on a FOLDS Automaton $F$ is a predicate $p$ over states of the I/O automaton corresponding to $F$. Similarly, a proof by induction for an invariant $p$ of a FOLDS Automaton $F$ is a proof that $p$ is inductive and that $p(s)$ holds for every starting state $s$ of $F$.

We wish to verify FOLDS Automata for all possible values of $R$. For example, we want to verify that the ring leader election protocol is correct no matter how many nodes participate in the election. To do this, we define parametric FOLDS Automata.

Definition 12 (Parametric FOLDS Automata). A parametric FOLDS Automaton is a function $F(R) \triangleq (E, R, M_i)$, where $E$ is a sort representing messages, $R$ is the parameter, and $M_i$ is a class of I/O automata satisfying the FOLDS Automaton conditions. We call every output of $F$ an instance of $F$.

5.4.2 Parametric Verification by Induction

Given a parametric FOLDS Automaton $F$ and an invariant $p$, the verification problem is to verify every possible instance of $F$. Specifically, the verification problem is to check that $\forall \, R$

$\forall \, s \in \textit{start}(F(R)) \; p(s)$; and
$\forall \, (s, \alpha, s') \in \textit{trans}(F(R)) \; p(s) \implies p(s')$.

In our setting, given our encoding into I/O automata, error states encode the malfunction of the runtime itself—something that we do not wish to reason about. For example, an action $\textit{get(v)}_{i, j}$ when $v$ is not in the set maintained by the Universal Buffer will result in the Universal Buffer transitioning to the error state $\bot$ and the message chain $v$ “magically” appearing in the input buffer of the $j^\textit{th}$ machine. Therefore, to avoid nonsensical counterexamples deriving from these transitions, we implicitly assume that all invariants are such that $inv(s) = inv(s) \lor \textit{error}(s)$, where $\textit{error}$ is a predicate that returns true iff any machine is currently in an error state. When input programs do not specify an initial set of states, we take it to be equal to the conjunction of invariants.

5.4.3 Encoding and Complexity

Encoding FOLDS Automata is exactly like encoding I/O automata. See Chapter 2.5.1 for the details on the encoding and a concrete example of a mar programs that represents an I/O automaton. Encoding parametric FOLDS Automata requires only a small change. Specifically, we encode the state space of a parametric FOLDS Automaton as an extensional array whose index sort is a parameter and whose element sort is the sort representing states of the individual I/O automata in the composition. The problem of verification by induction can then be encoded as an SMT query using an uninterpreted sort for the parameter of the parametric FOLDS Automaton. Satisfying interpretations represent counterexamples, while unsatisfiability proofs represent proofs of correctness.

When invariants contain quantifiers, the verification queries are as expressive as first-order logic and the problem is undecidable. Note that there are fragments of first-order logic that are decidable (e.g., EPR (Lewis 1980)) but we support logical theories that are outside of these fragments (e.g., sequences for EPR). When no invariant contains quantifiers, the verification by induction problem can be posed as a quantifier-free SMT query. Unfortunately, it is a long-standing open problem to determine the decidability of quantifier-free queries over sequences with length constraints (Day et al. 2023) and so the decidability of our verification by induction problem is also unknown. Without quantifiers and without sequences, since the theories of equality, linear integer arithmetic, arrays, and algebraic data types are strongly polite and decidable (Sheng et al. 2020; Bonacina et al. 2019), their combination and our verification problem is decidable.

5.4.4 Message Chain Invariants

Our problem formulation implicitly supports a novel kind of invariant, message chain invariants, that developers can use along with their usual invariants to verify their systems. Intuitively, message chain invariants describe the way that messages flow through a distributed system. Formally, a message chain invariant is an invariant of the from \[\forall c_1, ..., c_n \in {MC} \; \bigwedge^n_i \textit{alive}(s, c_n) \implies r(c_1, ..., c_n),\] where $s$ is the current state of the system, $\textit{alive}$ is a predicate that returns true iff the message chain $c_i$ is not empty and is held by the universal buffer or any machine’s buffers, and $r$ is a predicate over message chains. We call $r$ the message chain invariant since the rest of the predicate is fixed.

Consider the four message chains in Figure 10. One message chain invariant that is true for all four message chains is that the payload at the head of every message chain holds the largest label value seen so far by that message chain. This turns out to be true for all message chains in the system, and is a crucial fact used in the verification of the ring leader election protocol.

Since predicates in our language can be recursive and solvers directly support these functions (Suter, Köksal, and Kuncak 2011), message chain invariants over a single message chain ($n = 1$) can theoretically express any recursively enumerable language over sequences of messages. When properties are over more than a single message chain $(n > 1),$ these invariants can express properties akin to hyperproperties for traces (Clarkson and Schneider 2010). However, message chains invariants cannot express properties about the state of individual machines.

5.4.5 Running Example Verification

To better understand message chain invariants and our end-to-end workflow, consider the P implementation and verification by induction query for the ring leader election protocol displayed in Listing 4. This verification by induction query has two target invariants. The first invariant, unique_leader, states that there is only ever one node that declares itself as the winner. The second invariant, leader_max, states that this winning node has the greatest label. Note that neither of these two invariants are message chain invariants: we support message chain invariants, non message chain invariants, and combinations thereof.

Theorem 2 (Ring Protocol Correctness). The protocol introduced in Chapter 5.2 and implemented in Listing 4 is correct.

Proof (Ring). We define the semantics of P programs in terms of FOLDS Automata in Chapter 5.4.6. Given these semantics, the verification query in Listing 4 without the last two invariants, is not inductive. This can be explained by two counterexamples. Consider a two-node system—nodes labeled 1 and 2—and the following two message chains

$c \triangleq$ send(2, 1, eNominate(1), send(1, 2, eNominate(1), empty)) and
$c' \triangleq$ send(1, 1, eNominate(1), empty).

For the first counterexample, suppose node 2 has declared itself the winner and $\textit{input}_1$ contains the message chain $c$. This system will satisfy both target invariants before taking a step but will falsify unique_leader when node 1 receives $c$. For the second counterexample, suppose node 2 has declared itself the winner and $\textit{input}_1$ contains the message chain $c'$. Again, this system will satisfy both invariants before taking a step but will falsify unique_leader when node 1 receives $c'$.

To complete the proof by induction, the user can introduce two auxiliary message chain invariants—the last two invariants in Listing 4. The first auxiliary invariant blocks the first counter-example by asserting that the head of every message chain always holds the largest label it has seen so far. Formally, \[r(c) \triangleq \forall n \; \textit{participated}(c, n) ==> n.id \leq c.id,\] where $\textit{participated}(c, n)$ is a helper function that is true iff the node $n$ was the source or target of a message in the message chain $c$. For the counterexample two-node system above, the head of $c$ cannot hold the label 1 since the node labeled 2 has participated in $c$, thus our added auxiliary invariant $r$ blocks the first counter-example.

The second auxiliary invariant blocks the second counterexample above by asserting that, for every message chain, if there is a node that has not yet participated in the message chain then the message chain is going towards that node. Formally, this is the message chain invariant \[ \begin{aligned} r'(c) \triangleq \forall n \; &\neg \textit{participated}(c, n) \implies \\ & (\textit{between}(c.id, c.\textit{target}, n) \lor c.\textit{target} = n) \end{aligned} \] where $\textit{between}(c.id, c.\textit{target}, n)$ is a helper function that is true iff the target of the message chain $c$ is between the node labeled $c.id$ and the node labeled $n$ in the ring. For the counterexample two node system above, the node labeled 2 has not participated in the message chain $c'$ so the target of $c'$ should be 2 (not 1), thus our added auxiliary invariant $r'$ blocks the second counterexample.

Together, these two auxiliary invariants block all counterexamples to induction for all system instances—not just for the two-node system above. The P Verifier, our prototype implementation for the P programming language defined in Chapter 5.4.6, using Z3 (de Moura and Bjørner 2008), will accept this proof by induction in less than a tenth of a second (see Chapter 5.5). This proof is like doing the textbook proof of correctness described in Chapter 5.2 on every message chain at once: for every message chain, the round number in the textbook proof is equal to the length of the message chain. Yet our proof captures the nitty-gritty details of individual machine message passing.

5.4.6 Implementation

To demonstrate the expressive power of our problem formulation, we implement a verification framework accepting programs written in a version of the P programming language—a large, functional subset that treats message buffers as sets of message chains instead of queues of messages. We call our implementation the P Verifier. The P Verifier implementation is available as an OCaml project consisting of approximately 5000 lines of code.¹⁸

P is a natural fit for our approach because of its domain-specific structure but our approach generalizes to standard message-passing distributed system languages. In particular, the most important aspect of P to support message chains, the notion of a handler, is shared across many programming languages. For example, Erlang uses the receive keyword and Akka uses the method receiveMessage on Behaviors for a similar notion.

We support only the core features of P that are specific to distributed systems. For example, we support the definition of machines and events (messages), but we do not support foreign function calling or modules. We also assume that sequential code blocks—like the body of event handlers—is written in mar syntax and uses the associated, extended semantics.

At a high level, our compiler encodes P programs into parametric FOLDS Automata. We use P event declarations to create a message sort and P machine declarations to create as a class of I/O automata, the two required components for an FOLDS Automaton. The sort of messages ($E$) is given by an algebraic data type with $e$ constructors, where $e$ is the number of events in the input P program. The $i^{th}$ constructor of $E$ holds a single record representing the payload type of the $i^{th}$ event declaration in the input P program.

The state space of P machines follows a similar encoding. It is represented by an algebraic data type $M$ with $m + 1$ constructors, where $m$ is the number of machine kind declarations in the input P program. The $i^{th}$ constructor of $M$ holds a single record representing the fields of the $i^{th}$ machine kind declaration in the input P program. Specifically, the $i^{th}$ record holds the variables of the P machines, a state indicator, an entry flag, a “this” reference, and the required $\textit{input}_i$ and $\textit{output}_i$ buffers. The state indicator is an enum whose variants are the set of states named by the P machines. The entry flag is a Boolean. The last constructor of $M$ is $\bot$ and has no selectors.

The internal actions of the family of I/O automata is $\{e_{j, i} \; \vert \; e \in N \; j \in R\} \cup \epsilon_i$, where $N$ is the set of event names. We use the action $e_{j, i}$ to trigger that the current machine, $i$, receives a message of kind $e$, from some other machine, $j$. We use $\epsilon_i$ to trigger P entry handlers: the first code block that executes after a machine enters a given state. The transitions are defined by precondition-effect pairs derived from handlers as follows. For a handler of event $e$ within P machine state declaration $l$ for a machine kind $K$, let $m$ be a message chain and let $s$ be the machine instance at index $i$, the precondition for the action $e_{j, i}$ asserts

$s.\textit{this} = i$,
$s \in K$,
$s.\textit{state} = l$,
$\neg s.\textit{entry}$,
$m \in s.\textit{input}_i$, and
$m.\textit{current} \in e$.

In other words, the precondition for the action $e_{j, i}$ asserts the $i^\textit{th}$ machine (1) is the correct target, (2) is of the correct kind, (3) is in the correct state with (4) its entry flag is set to false, and (5) is actually receiving a message chain (5) whose head is of the correct kind. For an entry handler, the precondition for the action $\epsilon_i$ asserts

$s.\textit{this} = i$,
$s \in K$,
$s.\textit{state} = l$, and
$s.\textit{entry}$.

In other words, the precondition for the action $\epsilon_i$ asserts that the $i^\textit{th}$ machine (1) is the correct machine, (2) is of the correct kind, (3) is in the correct state, and (4) its entry flag is set to true. We also enforce that every entry handler effect sets the entry flag to false and that every event handler that receives a message chain e removes e from $\textit{input}_i$.

The effects for the actions $e_{j, i}$ and $\epsilon_i$ are given by the corresponding handler blocks ($e$ event handlers and entry handlers). We extend the language of fig:pverifier:lang with some syntactic sugar. First, we provide a send keyword that represents a procedure that takes the index of the target machine t, the payload of the message p, and (implicitly) the message chain that triggered the handler h, and adds send(this, t, p, h) to $\textit{output}_i$. Second, we provide a goto keyword that represents a procedure that takes a state label and sets the state value to that label and the entry flag to true.

We make a few simplifications in the implementation that manifest in our evaluation. First, instead of the state of the system being an array, we represent the state of the system as a record with two selectors: events and machines. events represents the state of the universal buffer. machines represents the states of all other machines. Second, instead of two buffers per machine, we collapse all buffers into the events selector.

5.5 Empirical Evaluation

We aim to answer the following research questions.

Is the P Verifier expressive enough to replicate existing proofs from other systems? How does the performance of the P Verifier compare to the state-of-the-art?
Do message chains help simplify proofs?
Can we use the P Verifier to verify industrial distributed systems?

5.5.1 RQ1: Expressive Power and Performance

To evaluate the expressive power of the P Verifier, we take verified benchmarks from Koenig et al. (2020) written in mypyvy, and re-verify them using the P Verifier. This set of benchmarks contains consensus protocols like the ring election protocol, standard systems like sharded key-value store services, and more complicated systems, like the hybrid reliable broadcast originally described by Widder and Schmid (2007) and modeled by Berkovits et al. (2019).

The P Verifier and mypyvy make different modeling choices and so the translation process is not always straightforward. For example, while the P Verifier intrinsically encodes implementation details like machine kinds and message buffers, mypyvy users would need to manually model these details. This encourages mypyvy users to write proofs at higher levels of abstraction and means that the corresponding P Verifier proof must explicitly link the implementation details to the abstract proof. For example, for the client-server system (described in detail in Chapter 5.5.2.1), the mypyvy model uses three logical relations to keep track of all messages that have been sent and received so far in the system. These relations abstract the notion of a message buffer—we can infer that a message is in the buffer if it has been sent but not yet received. We use the same logical relations in the P Verifier version of the proof but also maintain the implementation level detail of message buffers by adding one auxiliary invariant that amounts to saying that these abstract relations are append-only versions of the concrete message buffers in the system.

On the other hand, when mypyvy models do include low-level detail, translation into the P Verifier can actually simplify the proof since these details are frequently handled by the programming model itself. For example, the “ticket” benchmark models threads as state machines with three states. The P Verifier version of the proof encodes this directly by defining one machine kind called “Thread” which has three states. The mypyvy model defines three relations on threads, each acting as a flag to indicate when a thread is in the given state. This complicates their proof because auxiliary invariants must be added to ensure that no thread can be in more than one state at a time, a property that is tautological in the P Verifier model.

Table 3: Comparison on benchmarks from Koenig et al. (2020). First two columns are verification times, next three count invariants, last two flag the use of quantifier alternations. Asterisk free columns refer to a direct translation of the mypyvy verification without the use of message chains; #Inv* and $\exists\forall$?* refer to an improved proof that uses message chains. The P Verifier is faster and requires fewer quantifier alternations.
Benchmark	mypyvy	P Verifier	# Inv M	# Inv P	# Inv P*	$\exists\forall$?	$\exists\forall$?*
ring-id	0.365s	0.138s	4	4	4	N	N
toy-consensus-forall	0.380s	0.064s	4	5		N
consensus-wo-decide	0.338s	0.072s	5	8		N
sharded-kv	0.407s	0.045s	5	5		N
learning-switch	0.410s	0.048s	6	6		N
consensus-forall	0.566s	0.120s	7	10		N
lockserv	0.406s	0.049s	9	9		N
ticket	0.402s	0.068s	14	9		N
firewall	0.364s	0.033s	2	2	1	Y	N
sharded-kv-no-lost	0.328s	0.589s	2	2		Y
client-server-ae	0.323s	0.037s	2	3		Y
toy-consensus-epr	0.392s	0.106s	4	4		Y
client-server-db-ae	0.403s	0.060s	5	8	4	Y	N
ring-id-not-dead	0.479s	0.194s	6	5	5	Y	Y
consensus-epr	0.557s	0.088s	7	8		Y
hybrid-reliable-bcast	0.676s	0.489s	8	5		Y

Table 3 summarizes the data required to answer RQ1. Each row of the table corresponds to a benchmark. The first column contains the name of the benchmark in question, and the columns labeled “# Inv M” and “# Inv P” display the number of invariants used in the mypyvy and P Verifier proofs, respectively. While in general, the number of invariants is not always indicative of the complexity of a proof, in our context, where a one-to-one translation was made whenever possible, the number of invariants gives an idea of the extra proof considerations required.

Overall, we find that the P Verifier is comparably expressive to mypyvy and that it can successfully verify existing benchmarks. When our programming model requires extra proof considerations, we are able to address these considerations with few auxiliary invariants. To evaluate the performance of the P Verifier, we revisit Table 3. The second column shows the wall-clock time taken by mypyvy to verify each benchmark; the third column shows the wall-clock time taken by the P Verifier to verify the corresponding benchmark. All verification times are collected on a 2.3 GHz Quad-Core Intel Core i7 CPU with 32 GB of RAM. With the exception of the benchmark “sharded-kv-no-lost-keys” the P Verifier is faster on every single case.

Thus, for RQ1, we answer that yes, the P Verifier is expressive enough to replicate proofs from other systems, and the P Verifier performs comparably to mypyvy in terms of runtime. Note that, as a sanity check, we introduced bugs into all benchmarks and attempted to verify the buggy version. No proof by induction succeeded, as expected.

5.5.2 RQ2: Simplification Power

To evaluate the simplification power of message chains, we take four benchmarks from related work and attempt to use message chains to improve the proofs. We then explore a new benchmark that is difficult for existing tools but is trivial in our framework. The four benchmarks from existing work, “client-server-db-ae,” “firewall,” “ring-id,” and “ring-id-not-dead,” are those in Table 3 that display message chain that are longer than three messages.

5.5.2.1 Client-Server-Database

The first system is a client-server system inspired by Feldman et al. (2017). This system involves three kinds of machines: clients, servers, and databases. Clients send requests to servers which then consult a database. Once the server receives an answer from the database, it forwards the response to the original client. Feldman et al. verify that whenever a client in the client-server-database system receives a response, the same client had sent a matching request. This property is like the no-forge property of Griffin et al. (2020) (which states that no machine “forges” a message) and requires significant machinery to state and verify. Namely, for a message-passing model without message chains, this proof requires three ghost variables, which track every request and response ever sent by the system, three invariants to ensure that the concrete message buffers are abstracted by these ghost variables, and four invariants of the form $\forall \exists$ that state that for every response, there existed a request that “matches” that response.

Stating and verifying this no-forge property using message chain invariants, on the other hand, is easy and requires no modeling tricks: message chains provide the vocabulary that would otherwise need to be integrated into the system model manually. Specifically, the message chain-based uses four invariants. The first invariant uses the path-pattern on machine indexes $x_1x_2x_3x_2x_1[ x_1 \in C \land x_2 \in S \land x_3 \in D]$, where $\Sigma \triangleq C \cup S \cup D$, $C$ is the set of client machine indexes, $S$ is the set of server machine indexes, and $D$ is the set of database machine indexes. We define path-patterns in Chapter 6.4.2. This path-pattern captures the desired no-forge property and the invariant states that every message chain is a prefix of that path-pattern. The second invariant, states that for every message chain $mc$, mc.message.val = mc.history.message.val. The third invariant states that the payload of the first message in every message chain is the original client. Finally, the fourth invariant again uses a path-pattern but this time on sequences of message kinds $qqpp[],$ where $q$ represents an eRequest message, $p$ represents an eResponse message and $\Sigma \triangleq \{q, p\}$. Together, these four invariants are inductive for all system instances and message interleavings.

5.5.2.2 Firewall

The second models a set of nodes communicating through a firewall, where the firewall blocks traffic coming from outside a network unless the traffic is in response to an internal request. In the firewall case, the property asserts that every message received by an internal node was either sent by another internal node or was sent as a response to an internal node from an external node. Like the client-server-database system, the mypyvy encoding defines abstract relations to keep track of what messages have been sent out, and then when a message is received, the encoding asserts that there exists some message in the abstract relation with the desired property. This specification is a $\forall\exists$ properties that establishes the provenance of node interactions. To make the encoding inductive, auxiliary invariants are needed.

Using message chains and the P Verifier we can replace the $\forall\exists$ properties with a single message chain invariant that does not contain existential quantifiers. Specifically, the solution to the existential part of the property is given by the first element of every message chain. Again, no abstraction is needed to verify this system at the message-passing level.

5.5.2.3 Ring Leader Election Protocol

The final two systems from related work, “ring-id” and “ring-id-not-dead,” are variants of the ring election protocol described in Chapter 5.2 and verified in Chapter 5.4.5, where “ring-id-not-dead” includes an extra goal invariant specifying that the system is never “stuck.” As previously described, we use message chain invariants to verify both systems using a proof that is simple and asynchronous, all while being at a lower level of abstraction than proofs from related work, since it captures message passing directly.

5.5.2.4 The Onion Routing Network

Finally, we detail a new verification that is difficult for related work but easy for our framework: the Onion Routing (Tor) network. Tor is a distributed network comprised of relay nodes used to anonymize TCP-based web traffic. TCP-based web traffic is normally not anonymous because anyone in the network can view the source and destination of packets. In the Tor network, clients build a path through network nodes to a target server and then send their requests through this path. Each step of the path is encrypted on top of the previous steps of the path, making a layered structure (like an onion). When a node receives a packet, it decrypts the message (peels one layer of the onion) to determine the node it should forward the packet to. In this way, each node in the path knows the identity of its successor and predecessor, but no other nodes. To send a response back to a client, servers follow a similar process in reverse. By scrambling network activity, Tor makes it difficult for observers to trace client-server communication.

We modeled the Tor network using three machine kinds: Client, Node, and Server. Client machines build a path to Server machines through some sequence of Node machines and then back to themselves through a second sequence of Node machines. Message payloads hold a todo variable that corresponds to the path left to traverse. When a Node machine receives a message, it “peels” an element of the todo variable, and forwards the message to the next machine in line. Server machines behave similarly but could also do some processing to serve the given request.

There are interesting statistical and cryptographic properties to verify for the Tor network, but we focus on basic correctness: every complete message chain should start at a client, go through some non-zero number of intermediate nodes, reach the desired server, return through some non-zero number of intermediate nodes, and end at the same client that began the interaction. To verify this property, we introduced a single ghost variable to messages, done, and added code to maintain it such that concat(done, todo) stays constant throughout every message chain. Intuitively, whenever a machine peels an element off of todo, it now adds it onto done. From there it is easy to prove that concat(done, todo) always has the shape defined by the target specification and that every message chain is a prefix of concat(done, todo). Since this proof used interpreted sequence functions heavily (e.g., prefix and concat), we chose to represent the message chain using the SMT-LIB theory of sequences instead of algebraic data types, which would have required us to manually define these functions. Related work would have a much harder time verifying the same property of the same system at the individual message passing level of abstraction.

For RQ2, we find that message chains can simplify proofs when the system involves communication.

5.5.3 RQ3: Industrial Distributed Systems

To evaluate the P Verifier on industrial benchmarks, we conduct two case studies on verification problems provided to us by our industrial collaborators.

5.5.3.1 GlobalClock

The first industrial case study models and verifies a system like the ClockBound (Amazon.com 2023) protocol, which is a mechanism for generating timestamps that are guaranteed to be within some bound of “true time.” Here, “true time” is just the time of a reference machine, but when this reference machine is reliable, the protocol can be used to evaluate distributed consistency by comparing the timestamps of events.

The protocol offers an API that clients can query. When one requests a time from the system, one receives a time range $T_i = (E_i, L_i)$, where $E_i$ corresponds to the earliest time guaranteed to have passed and $L_i$ represents the latest time guaranteed to have not yet passed. Internally, the API consists of many systems with local clocks communicating with a corresponding global reference clock. When requesting time from a local clock, the local clock will send a query to their corresponding global clock who then responds with a “true time” that the local clocks use to return a lower and upper bound on time.

Our industrial partners have target specifications for this system. Specifically, that (1) for two time requests $T_1$ and $T_2$, if $T_1$ and $T_2$ happened on the same node and $T_2$ happened after $T_1$ then $E_2 \geq E_1$; (2) for two time requests $T_1$ and $T_2$, if $T_2$ happened after $T_1$ then $L_2 > E_1$; and (3) for two time requests $T_1$ and $T_2$, if $L_1 < E_2$ then $T_1$ happened before $T_2$.

The model and proof is 150 lines of code and takes less than two seconds to verify. We used ten auxiliary invariants, including one message chain invariant. The message chains in this system are up to five messages long and we use a message chain invariant to ensure that every local clock only ever participates in one message chain at a time. This is a powerful invariant because it blocks spurious counterexamples to induction stemming from rogue messages targeting local clocks. This is an example of a message chain invariant over multiple message chains, as described in Chapter 5.4.4.

5.5.3.2 Two-Phase Commit Multi-Version Concurrency Control

The second industrial case study is a distributed transaction commit protocol that implements multi-version concurrency control (MVCC) (Bernstein and Goodman 1983) using a two-phase commit (2PC) (Gray and Lamport 2006) protocol for agreeing on transactions. The general idea of MVCC is to allow non-blocking reads and writes by maintaining multiple versions of the data in question. Whenever a write occurs, a new version of the data is created. Reads can observe the most recent version of the data and do not need to wait for concurrent writes to terminate. The main issue is when two writes occur simultaneously. In this case, the system must agree on what writes to commit and in what order. 2PC is used to resolve this issue. Whenever a write transaction is initiated, it must broadcast a commit request to all participants. If all participants agree, then the transaction can be committed—the write can take place—and the transaction announces its success.

Our industrial partners have target specifications for this system. Specifically, (1) we only read the latest committed transactions; (2) for two read transactions $T_1$ and $T_2$, if $T_1$ happened before $T_2$ then the version read by $T_1$ must be earlier or equal to the version read by $T_2$; and (3) if a transaction announces success then all participants must have agreed to commit the corresponding write. This model and proof is 250 lines of code and takes approximately 12 minutes to verify.

The 2PC portion of this protocol is particularly interesting because it demonstrates a common phenomenon in distributed systems where an action depends on multiple concurrent messages. That is, in 2PC, coordinators require all participants to vote “yes” before telling them all to commit (and abort if any vote “no”). In terms of message chains, this means that many message chains end at the coordinator and only the message chain corresponding to the last “yes” vote or the first “no” vote is extended. An interesting aspect of our approach is that, since it is a non-deterministic choice of which message chain is extended (in the asynchronous setting the order that votes arrive in is non-deterministic), message chain invariants reason about all of them at once.

For example, it may be surprising but, a single message chain invariant is sufficient to capture important implementation details like that the coordinator in 2PC will never send a commit message if any participant voted “no”. Specifically, the message chain invariant is (in extended mar): c.head is commit ==> c.tail.head is vote("yes"). This works because the verification engine is free to pick any of the votes as the prefix that is extended, therefore the verification only succeeds if all of the possible prefixes satisfy the invariant. The same principle applies for different thresholds (e.g., quorums).

In summary, we give an affirmative answer to RQ3: the P Verifier can be used to verify industrial benchmarks and that message chains play an important role.

5.6 Conclusion

In this chapter, we described the arrow to Figure 1(E): a verification engine that takes formal models and specifications, and generates logical queries that can be solved by tools like our work in Chapter 3—the arrow from Figure 1(E) to Figure 1(F). The main challenge with verifying distributed systems is in managing the level of abstraction of models. More abstract models can be easier to verify, but harder to link to implementations. Low-level models are hard to verify, but easier to link to implementations. We introduce a new abstraction, called message chains, that can be automatically extracted from and linked to low-level models. Message chains allow users to reason about sequences of related messages instead of only individual messages. This leads to proofs at a more abstract level while retaining the low-level details of low-level models.

Our empirical evaluation demonstrates that message chains are a powerful new tool that distributed system engineers can use to verify their systems. Specifically, we show that the P Verifier is expressive enough to replicate proofs from other systems; that the P Verifier performs comparably to mypyvy in terms of runtime; that message chains can simplify proofs when the system involves communication; and that the P Verifier can be used to verify industrial benchmarks.

A new version of the P Verifier, which is part of the P language toolchain and includes the advanced solver interface described in Chapter 2.4, is in use at Amazon Web Services today. The new P Verifier was used to verify an industrial version of the 2PC protocol. This protocol works for multiple rounds of transactions and is used to maintain agreement on a partitioned key-value store. See P Team (2025) for more details.

6 Specification Mining for Message Chain Invariants

In Chapter 5, we presented a tool—a verification engine—that takes in a model, a set of properties, and attempts to prove by induction that the model satisfies the properties. Formal verification, however, can be a time-consuming task that requires significant manual effort. Especially for complex systems, users often need to manually provide, for example, loop invariants, function summaries, or environment models. Synthesis and learning has the potential to alleviate some of this manual burden (Seshia 2015). For example, prior work has used synthesis to reason about program loops (David, Kroening, and Lewis 2015), and to automate program repair (Le et al. 2017). In this chapter, we employ specification mining techniques, based on program synthesis and classic learning techniques, to lessen this burden. Specifically, in this chapter, we describe the arrow from Figure 1(C) to Figure 1(D): a new way to augment formal models with inferred specifications.

6.1 Introduction

The intuition behind our approach is that message chains reveal enough structure that they can even be used to automatically learn meaningful specifications from only distributed system execution data. That is, given only example message chains (no access to the system definition or specifications), existing learning techniques can discover properties that hold for all executions of a system. In Chapter 6.4.1 we describe how an existing program synthesis technique can be used to discover the third invariant of Listing 4 using only example message chains. Without message chains, given only example execution logs, the same off-the-shelf, state-of-the-art tool is unable to learn any meaningful specifications.

This learning from examples problem is usually called specification mining (Ammons, Bodı́k, and Larus 2002) or likely invariant synthesis (Ernst et al. 2001) and has applications beyond verification. For example, specification mining can be used for automated program repair (Demsky et al. 2006), testing (Schuler, Dallmeier, and Zeller 2009), system understanding (Beschastnikh et al. 2015), and more. In this chapter, we focus on validating the use of message chains for specification mining by showing that learning techniques can discover invariants that appear in our manual verification efforts.

Figure 12: Contributions overview and workflow. White rectangles are artefacts; blue rectangles are components of the framework. Red outline indicates artefacts are user inputs.

Figure 12 shows a workflow diagram that illustrates how specification mining and verification fit together. The user first inputs a model (Figure 12.a) of their system written in the version of P (Desai et al. 2018, 2013) described in Chapter 5.4.6. This model is fuzzed (Figure 12.b) to generate positive example message chains (Figure 12.c). We formally defined message chains in Chapter 5.4 and we formally define positive (and negative) examples in Chapter 6.3. The message chains in Figure 10 correspond to the “Positive Examples” in Figure 12.c: these can be obtained automatically by fuzzing the input model (Figure 12.a). The user can optionally add negative examples (Figure 12.d) and then run the specification mining framework (Figure 12.e) described in Chapter 6.4, producing likely invariants (Figure 12.f). The user can use these likely invariants as their specification, provide their own invariants (Figure 12.g), or both. Either way, the model (Figure 12.a) and invariants (Figure 12.f and Figure 12.g) are given to the verification engine (Figure 12.h) described in Chapter 5.4 which returns either true, if the proof by induction passes, or false otherwise. This process is usually iterative with the user changing their model, invariants, or adding examples manually or through fuzzing.

The rest of this chapter describes the specification mining components of Figure 12 in detail. We begin with a survey of related work in Chapter 6.2. We then present the necessary background to understand our approach in Chapter 6.3. We formally define the specification mining problem and describe two approaches for mining message chain invariants in Chapter 6.4. Finally, we evaluate both approaches empirically in Chapter 6.5. Overall, we make the following contributions.

We define the notion of positive and negative message chain examples and describe new ways to use existing learning and program synthesis techniques to automatically mine specifications from examples only.
We evaluate this technique by automatically mining useful specifications for the distributed systems we verified in Chapter 5.

6.3 Background

In this section, we define specification mining and give the necessary background on two learning techniques that we use to solve the specification mining problem.

6.3.1 Recursive Function Synthesis and Specification Mining

The function synthesis problem corresponds to the second-order query \[\exists f \; \forall \vec{x} \; \sigma(f, \vec{x}),\] where $f$ is the function to synthesize, $\vec{x}$ is the set of all possible inputs, and $\sigma$ is the specification to be satisfied. Many program synthesis engines, like Burst (Miltner et al. 2022), focus on synthesizing recursive functions over algebraic data types.

When $\sigma$ is a conjunction of input-output examples, like $f(0) = 1 \land f(1) = 2 \land f(2) = 3$, we call the program synthesis query a programming-by-example query (Halbert 1984). When the function to synthesize is a predicate, we call examples of the form $f(x) = \text{true}$ positive examples. Similarly, we call examples of the form $f(x) = \text{false}$ negative examples. In this chapter, we use existing recursive function synthesis tools to learn predicates from positive and negative examples and call this process specification mining.

Specifically, the inputs specification mining tools in our setting are a set of pairs. The first element of each pair is a message chain and the second element is a Boolean indicating if the message chain can appear in an execution of the system. For example, for the ring leader election protocol,

(Send(1, 4, eNominate(1), Empty)), true) is a positive example and
(Send(1, 4, eNominate(4), Empty)), false) is a negative example.

Given a set of positive and negative examples, the output of a specification mining tool is a function that evaluates to true for the positive examples and false for the negative examples. Intuitively, this function represents a system specification that is likely to hold, but is not guaranteed to hold. Note that allowing negative examples makes this definition slightly more general than the related definitions of Ernst et al. (2001) and Ammons, Bodı́k, and Larus (2002).

6.3.2 Pattern Learning

We present patterns and pattern languages (Angluin 1980), but with definitions tailored to SMT-LIB. Let $\Sigma$ be a zero-ary sort with finite cardinality greater than two, which represents a finite alphabet of size equal to the cardinality of the sort. Let $X = \{ x_1, x_2, \ldots\}$ be a countable set of symbols disjoint from $\Sigma$, which we refer to as pattern variables. A sequence over $\Sigma$ is an $n$-ary concatenation of symbols where every $t_1, \ldots, t_n$ is an element of $\Sigma$, and the set of non-zero finite sequences over $\Sigma$ is denoted $\Sigma^+$.

A pattern $p$ is a sequence over $\Sigma \cup X$, i.e., an $n$-ary concatenation of symbols $t_1, \ldots, t_n$ where every $t_i$ is an element in $\Sigma \cup X$. We use juxtaposition to represent concatenation, so write $p \triangleq t_1 ... t_n$. We use $\textit{Var}(p)$ to denote the pattern variables that occur in $p$. Patterns define languages in the following way. For a pattern $p$, call $e(\omega') \triangleq p = \omega' \; \land \bigwedge_{x \in \textit{Var}(p)} \vert x \vert \geq 1$ the characteristic formula of $p$, where the pattern variables in $p$ are free variables, $|x|$ denotes the length of $x$, and $\omega'$ is a given sequence in $\Sigma^+$. The language of a pattern $p$, $L(p)$, is the largest set of sequences such that for every $\omega\in L(p)$, $e(\omega)$ is satisfiable. When $m$ is a model of $e(\omega)$ for a particular $p$, we say that $m$ justifies $\omega\in L(p)$. For example, for $\Sigma \triangleq \{f, g, h\}$ and $X \triangleq \{x_1, x_2, ...\}$ the pattern $p \triangleq x_1x_2x_3x_2x_1$ defines a pattern language $L(p)$ that contains the sequence $\mathit{fghgf}$ and the sequence $\mathit{hfggggfh}$ but does not contain $\mathit{fghfg}$, and the model $\{x_1 \triangleq f, x_2 \triangleq g, x_3 \triangleq h\}$ justifies that $\mathit{fghgf} \in L(p)$. We use this notion of justification to, at a high level, learn the most strict but justified semantic constraints on patterns over a useful concept class. The characteristic formula constrains the model to map each variable in $\textit{Var}(p)$ to a concatenation of at least one symbol in $\omega$. We do not use erasing patterns, where variables can map to the empty sequence, in this work.

Angluin (1980) defines $\ell\textit{-minl}$ and Shinohara (1982) presents an algorithm to compute it. The $\ell\textit{-minl}$ algorithm takes in a set of sequences in $\Sigma^+$ and returns the longest pattern that best describes the examples. We treat the $\ell\textit{-minl}$ algorithm as a black box but rely on the following theorems.

Theorem 3 (Angluin (1980)). Let $\Sigma$ be a zero-ary sort with finite cardinality greater than two, let $\Omega\subseteq \Sigma^+$ be a set of finite sequences over $\Sigma$ and let $p$ be the output of the $\ell\textit{-minl}$ algorithm. Then $p$ is the longest pattern such that $\Omega\subseteq L(p)$ and there is no pattern $q$ with $\Omega\subseteq L(q)$ and $L(q) \subsetneq L(p)$.

Theorem 4 (Shinohara (1982)). The $\ell\textit{-minl}$ algorithm can be computed by a deterministic polynomial-time Turing machine using an oracle in NP. Specifically, for a given pattern $p$ and set of sequences $\Omega\subseteq \Sigma^+$, the NP oracle checks $\Omega\subseteq L(p)$ by checking $\omega\in L(p)$ at most $\vert \Omega\vert$ times. This membership check is NP-Complete for erasing and non-erasing patterns (Jiang et al. 1994).

6.4 Approach: Function and Pattern Learning

Up until now, we have assumed that users provide system models and specifications. This is the same requirement imposed by related work. However, in practice, specifications are not always obvious and formally expressing them is not always easy. In this section, we describe an approach to mining specifications—message chain invariants—using existing learning techniques. Positive examples can be generated easily by fuzzing the target system. Negative examples, on the other hand, are harder to generate automatically. In our setting, we take negative examples to be hints from the user: they are examples that the user believes can never occur during an execution.

6.4.1 Recursive Function Synthesis

When message chains are represented as algebraic data types, our specification mining problem matches the programming-by-example problem for functional, recursive programs. In this section, we describe how to use an existing program synthesis technique in this space, Burst, to mine message chain invariants.

Burst is a synthesis tool based on bottom-up enumeration. The Burst algorithm, by design, will produce syntactically short functions. Burst begins by synthesizing a recursive function that satisfies the specification (positive and negative examples) assuming that undefined recursive calls behave exactly as required to satisfy the specification (referred to as “angelic” semantics). It then checks whether the synthesized recursive function satisfies the specification under the actual semantics, and, if not, strengthens the specification based on the existing assumptions. This process repeats until a recursive function is found that satisfies the specification and has no undefined recursive calls. However, the synthesizer may have to backtrack some of the specification strengthening if the specification is made unsatisfiable by any added assumptions.

At a high-level, to synthesize message chain invariants using Burst, we give it queries consisting of the algebraic data type definition of message chains; a function signature (maps a message chain to a Boolean), examples (message chains that occur and do not occur); and a basic library of helper functions. This library includes functions for computing logical operators (like conjunction and negation) and simple functions for reasoning about message chains (like a function that takes a message chain and returns the innermost message if one exists). Burst then returns recursive functions that evaluate to true on all positive examples and false on all negative examples.

6.4.2 Extended Non-Erasing Pattern Learning

When message chains are represented as sequences, our specification mining problem matches the problem of learning formal languages from examples. In this section, we describe how to use non-erasing pattern languages to mine message chain invariants using only positive examples. We first give the necessary background on non-erasing pattern learning and then provide an extension to tailor it to our distributed systems domain.

Given only positive input message chain examples, the $\ell\textit{-minl}$ algorithm can quickly learn a non-erasing pattern that likely holds for all message chains in the target system. Note that most similar learning algorithms, like those for learning regular languages, require both positive and negative examples. Unfortunately, non-erasing patterns on their own are not specific enough for our domain. This is in part due to the fact that no non-erasing pattern containing at least one variable cannot represent a finite language, and many distributed systems only exhibit bounded message chains. For example, suppose you have a simple client-server system where clients send requests to servers and servers respond to the client. Every message chain in this system has length at most two, but no non-erasing pattern can represent this language.

To fix this, we extend the $\ell\textit{-minl}$ algorithm with an enumerative approach that learns a conjunction of constraints. These constraints are useful for the distributed systems domain and yet guarantee that the enumerative algorithm terminates. We call the resulting pattern a path-pattern and denote it $p[l]$, where $p$ is the base $\ell\textit{-minl}$ pattern and $l$ is a set of added constraints. When adding constraints, the goal is to be as specific as possible without being too specific. We formalize the notion of too specific in the following definition. Given a set of sequences $\Omega$ and a pattern $p$ with characteristic formula $e$, we say that a path-pattern $p[l]$ is too specific iff $\bigvee_{\omega\in \Omega} e(\omega) \nvDash l.$ That is, if every model that can be used to justify the training set is preserved by $l$, $l$ is not too specific.

Consider the following two examples. First, let $\Sigma \triangleq \{f, g\}$, let $\Omega\triangleq \{fg, gf, ff\}$, and the corresponding pattern $p \triangleq x_1x_2$. In this case, the constraint $l \triangleq \vert x_1 \vert = 1 \land \vert x_2 \vert = 1$ is better than the constraint $l' \triangleq \textit{true}$ because it gives us more information about the variables in the pattern. It is more specific. Second, keep $\Sigma$ and $p$ as before, but take $\Omega$ to be $\{fg, gff, gggg\}$ and consider the three different constraints $l \triangleq \vert x_1 \vert = 1$, $l' \triangleq \vert x_2 \vert = 1$, and $l'' \triangleq \textit{true}$. While it is the case that $\Omega\subseteq L(p[l]) = L(p[l']) = L(p[l''])$, the first two constraints are too specific while the last constraint is not. That is, they impose constraints on the roles the variables in the pattern play that are not supported by evidence: if these path-patterns represented machine indexes, the first path-pattern would say that $x_1$ always represents a single machine, the second path-pattern would say that $x_2$ always represents a single machine, and the last path-pattern would make no such unfounded judgments. Therefore, in this case, we would consider $l''$ to be the best choice. We now describe our procedure for learning two kinds of constraints over $\ell\textit{-minl}$ patterns: length constraints and membership constraints.

Learning Lengths The procedure to discover length constraints uses the following lemma.

Lemma 5 (Lengths). Let $\Omega$ be a set of sequences, $p[l]$ be a pattern such that $\Omega\subseteq L(p[l])$ and $p[l]$ is not too specific, $x \in \textit{Var}(p)$ be a variable, $x_1, x_2 \not\in \textit{Var}(p)$ be two fresh variables, and $p' \triangleq p\langle x_1x_2 / x\rangle$ be $p$ but with all occurrences of $x$ substituted with $x_1x_2$. If $\Omega\cap L(p'[l]) = \emptyset$, then

$p[l \land \vert x \vert = 1]$ is more specific than $p[l]$ and
$p[l \land \vert x \vert = 1]$ is not too specific.

Proof (Part 1). Let $e_l$ be the characteristic formula of $p[l]$. The characteristic formula of $p[l \land \vert e \vert = 1]$ is then $e_l \land \vert x \vert = 1$ and $e_l \land \vert x \vert = 1 \models e_l$ holds by the semantics of conjunction.

Proof (Part 2). Suppose for contradiction that $p[l \land \vert x \vert = 1]$ is too specific. That is, suppose that \[\bigvee_{s \in S} e_l(\omega) \nvDash l \land \vert x \vert = 1.\] Then there exists a model $m$ of $\bigvee_{\omega\in \Omega} e_l(\omega)$ that is not a model of $l \land \vert x \vert = 1$ and that interprets $x$ as the sequence $z$ such that $e_l(\omega)$ evaluates to true for some sequence $\omega\in \Omega$. Since $m \models l$ ($p[l]$ is not too specific by assumption) we have that $m \nvDash \vert x \vert = 1$ and therefore that $z$ is of length at least two. Let $e_{p'}$ be the characteristic formula of $p'[l]$ and let $m'$ be the model that is like $m$ in every way but extended to interpret $x_1$ as the first element of $z$ and $x_2$ as the rest of $z$. The contradiction is that we have $m' \models e_{p'}(\omega)$ but, by the antecedent of the implication, $\omega\not\in L(p'[l])$.

The procedure itself starts with the set of constraints $\bigwedge_{x \in \textit{Var}(x)} \vert x \vert = 1$. For every $x \in \textit{Var}(p)$, we check $\Omega\cap L(p'[l]) = \emptyset$, where $p'$ is constructed as in Lemma 5. If $\Omega\cap L(p'[l]) = \emptyset$ holds for $x$, then we keep $\vert x \vert = 1$ in the conjunction and we move to the next variable. If it does not hold, then we remove $\vert x \vert = 1$ from the conjunction and continue.

Theorem 6 (Effective Procedure for Discovering Lengths). Let $S$ be a set of sequences over $\Sigma$, $p$ be a pattern with $\Omega\subseteq L(p)$, and $l$ be the output of the above procedure. We claim that $p[l]$ is not too specific and that for every set of constraints $l'$ of the same form that is not too specific, $p[l]$ is more specific than $p[l']$. Furthermore, the above procedure runs in $\mathcal{O}(\vert p \vert)$ time using an oracle in NP.

Proof (Lengths). We use the same oracle as the $\ell\textit{-minl}$ algorithm in Theorem 4 but instead of checking $\Omega\subseteq L(p)$ we check $\Omega\cap L(p'[l]) = \emptyset$, where $p'$ is constructed as in Lemma 5. We call this oracle once for every variable that appears in $p$, which is at most the size of $p$. Correctness follows directly from Lemma 5 and two facts. First, if $\vert x \vert = 1$ is too specific on its own, then so is every conjunction that contains $\vert x \vert = 1$. Second, the most specific set of constraints is the largest set.

Discovering Membership Discovering membership constraints follows a similar process. To make things more tractable, we assume that users supply a list of pairwise disjoint subsets of $\Sigma$ of interest. The following theorem is key.

Theorem 7 (Membership). Let $\Omega$ be a set of sequences over $\Sigma$, $p[l]$ be a pattern such that $\Omega\subseteq L(p[l])$ and $p[l]$ is not too specific, $x \in \textit{Var}(p)$ be a variable, $x_1, x_2 \not\in \textit{Var}(p)$ be two fresh variables, and $\sigma$ be a strict subset of $\Sigma$. If for every $c \in \Sigma \smallsetminus \sigma$ it is the case that $\Omega\cap L(p\langle x_1cx_2 / x \rangle\{x_1, x_2\}[l]) = \emptyset$, then 1. $p[l \land x \in \sigma^+]$ is more specific than $p[l]$ and 2. $p[l \land x \in \sigma^+]$ is not too specific.

Proof (Memberships). Similar to Lemma 5, suppose for contradiction that $p[l \land x \in \sigma^+]$ is too specific. That is, suppose that $\bigvee_{s \in S} e_l(s) \nvDash l \land x \in \sigma^+.$ Then there exists a model $m$ of $\bigvee_{s \in S} e_l(s)$ that is not a model of $l \land x \in \sigma^+$ and that interprets $x$ as the sequence $z$ such that $e_l(s)$ evaluates to true for some sequence $s \in S$. Since $m \models l$ ($p[l]$ is not too specific by assumption) we have that $m \nvDash x \in \sigma^+$ and therefore that $z$ is a sequence of the form $y_1cy_2$, where $y_1$ and $y_2$ are fresh, erasing variables and $c \in \Sigma \smallsetminus \sigma$. Let $e_{p'}$ be the characteristic formula of $p\langle x_1cx_2 / x \rangle\{x_1, x_2\}[l]$ for the same $c$ and let $m'$ be the model that is like $m$ in every way but extended to interpret $x_1$ as the first element of $y_1$ and $x_2$ as $y_2$. The contradiction is that we have $m' \models e_{p'}(s)$ but, by the antecedent of the implication, $s \not\in L(p\langle x_1cx_2 / x \rangle\{x_1, x_2\}[l])$.

The procedure to discover membership constraints is again a simple loop. We start with the set of constraints $\bigwedge_{\sigma} \bigwedge_{x \in \textit{Var}(x)} x \in \sigma^+$. For every $x \in \textit{Var}(p)$ and input subset $\sigma$, we check $\Omega\cap L(p'[l]) = \emptyset$ for every $c \in \Sigma \smallsetminus \sigma$, where $p'$ is constructed as in Theorem 7. If $\Omega\cap L(p'[l]) = \emptyset$ holds for $x$ for every $c$, then we keep $x \in \sigma^+$ in the conjunction and we move to the next variable. If $\Omega\cap L(p'[l]) = \emptyset$ does not hold for $x$ for every $c$, then we remove $x \in \sigma^+$ from the conjunction and we move to the next variable. The proof that this is an effective procedure is the exact same as that of Theorem 6 but note that we also depend on the number and size of input sets $\sigma$.

In summary, $\ell\textit{-minl}$ gives us the best pattern from a syntactic perspective (Theorem 3) and we refine this pattern to get the best path-pattern from a semantic perspective. At a high-level, to synthesize message chain invariants using this algorithm, we give it queries consisting of positive examples generated by fuzzing, and it returns path-patterns that capture these examples.

6.5 Empirical Evaluation

We combine the recursive function synthesis approach and the extended non-erasing pattern learning approach into the same framework as the P Verifier. This code includes fuzzing infrastructure to generate positive example message chains. We implement the extended $\ell\textit{-minl}$ algorithm as a Python program of approximately 250 lines that uses Z3 for the NP oracle that checks sequence membership. The Python code is available with the OCaml code described in Chapter 5.4.6. We use Burst out of the box for synthesis of recursive functions.

For each target distributed system of interest, we give the framework many different queries. The queries all follow the same structure described above, but we categorize and clean the input data before calling the individual tools.

For categorization, we tag example message chains with the last instruction relevant to them, and we group message chains by this tag. For example, for the ring leader election protocol in Listing 4, there are four groups: message chains that have just started at line 13; message chains that have reached the leader at line 16; and message chains that have just been extended at lines 18 and 20. For cleaning, we take each group of message chains and we generate new sets of message chains that each focus on a different aspect. Specifically, we generate a set that removes all payload information from message chains, leaving only the sequence of nodes visited by each message chain (the source of each message along with the final target); and we generate a set that removes all source and target information, leaving only the sequence of payload values exchanged.

For the ring leader election protocol, this categorization and cleaning process generates 12 queries (4 groups, 3 versions of queries each). We give all twelve queries to both specification mining methods (using algebraic data type and sequence encoding as appropriate) and we return 24 suggested invariants to the user. The input queries to Burst contain three negative examples generated by hand and 8 positive examples generated automatically by fuzzing. The three negative examples represent three obviously impossible message chain: one where a node sends a value that is not its own and it did not receive; one where a node sends a value to a node that is not on its “right;” and one where a node receives its own value and instead of declaring itself the winner and stopping, it extends the message chain by sending a new message.

We evaluate our specification mining framework on the four unique systems from the benchmarks that display message chain of length longer than three messages because these are the hardest cases. The other benchmarks from related work are either simplified versions of the four (e.g., ‘client-server-ae’ is like ‘client-server–db-ae’ but without a database) or they do not display very complicated message passing (e.g., ‘ticket’ which we briefly describe in Chapter 5.5.1). We generate examples and create specification mining queries as described in Chapter 6.4. All specification mining runs terminated in less than a second.

For “ring,” we are able to automatically discover the first auxiliary invariant of Theorem 2, and the correct, but less immediately useful fact that the suffix of message chains leading to a victory is a list of nodes from left to right, ending in the leader. For “client-server-db-ae,” we are able to automatically discover the first (target) and fourth (auxiliary) invariants described in the verification in Chapter 5.5.2.1; and that every message chain in the client-server-db system begins with a Client—a fact that is very close to the third invariant of the same verification. For “firewall” we are able to discover the specification that every message arriving at an internal node must have originated from an internal node; that the firewall blocks messages originating from external nodes; that blocked message chains are of length at most one; and that complete message chains are of length at least one. For “tor,” we are able to automatically mine the target invariant described in Chapter 5.5.2.4.

In total, this means six out of 24 invariants used in the proofs were mined. Removing the mined specifications from their respective proofs makes each of the proofs fail: these six specifications are essential. The mined specifications that do not appear in the hand written proofs may still be useful for other proofs or other applications, like testing, debugging, and documentation, but a thorough evaluation of the effectiveness for other applications is needed to draw any conclusions about them. In conclusion, we find that we can learn meaningful specifications using only message chain examples.

6.6 Conclusion

In this chapter, we presented the arrow from Figure 1(C) to Figure 1(D): a specification mining toolkit that uses the structure provided by message chains to discover meaningful invariants for distributed system models. In short, our tools take execution examples and infer specifications that likely hold for all executions. This reduces the burden on verification users by suggesting invariants that would otherwise require manual effort to discover and formalize.

We empirically evaluated our tools and found that six out of 24 invariants that were used in the proofs in Chapter 5 can be mined automatically in under a second each. Without the structure of message chains, the underlying learning algorithms cannot discover meaningful properties. Therefore, we conclude that message chains expose enough structure in distributed executions to make automated specification mining feasible and effective.

7 Synthetic Programming Elicitation and Compilation

Our techniques for generating specifications for distributed systems assume that we have access to a formal model of the system. Unfortunately, the process of formally modeling a distributed system can be as arduous as specifying it.

In Mora et al. (2020) we integrated classic synthesis techniques, like those explored in Chapter 6, into UCLID5, our formal modelling and verification tool. The synthesis integration allowed users to generate parts of their models driven by verification goals defined by bounded model checking, k-induction, sequential program verification, and hyperproperty verification. We used this integration to generate 25 program synthesis benchmarks with simple, known solutions but found that the state-of-the-art formal synthesis engines of the time could not find solutions.

In this chapter, based on Mora et al. (2024), we explore a modern alternative that uses large language models. Specifically, we describe the arrow into Figure 1(C): a new technique for generating formal models from text descriptions of systems. Instances of this methodology are domain-specific. The methodology itself, however, is general.

7.1 Introduction

Language models (LMs) have demonstrated an exceptional ability to generate code from natural language prompts for popular programming languages, like Python and Java (M. Chen et al. 2021). Unfortunately, these same language models struggle to generate code for low-resource programming languages (LPLs), like many domain-specific languages (e.g., CUDA (Tarassow 2023)). These challenges are even more pronounced for very low-resource programming languages (VLPLs) (e.g., formal verification languages like UCLID5 (Seshia and Subramanyan 2018; Polgreen et al. 2022) and mar). Existing work has attempted to remedy this issue through prompting, constrained decoding, and fine-tuning strategies. Unfortunately, these approaches fail to capture the intricacies of real VLPLs and so success remains limited. To see why, consider the following three exemplars.

First, B. Wang et al. (2023) include context-free grammars in text-to-code prompts to guide LMs toward syntactically correct answers. This approach works well for simple languages but cannot capture many programming languages that are context-sensitive. Second, L. Agrawal et al. (2023) use static analysis techniques to reject tokens that lead to syntactically incorrect output programs. This technique can go beyond context-free languages but assumes a linear programming process. Unfortunately, it is well known that the root cause of a programming error need not surface as it is written (Pavlinovic, King, and Wies 2014), necessitating backtracking and a nonlinear programming process. Third, Cassano et al. (2024) translate training data from high resource languages to low resource languages and then use this new data to fine-tune models. This approach is restricted to languages where the LM is able to translate to the language reliably but unable to generate code from natural language. Further, this approach makes the overly restrictive assumption that the target low-resource language is general purpose: e.g., we cannot translate arbitrary Java programs to CUDA.

In this chapter, we propose a text-to-code approach that is fundamentally different (and complementary) to prompting, decoding, and fine-tuning strategies. The first key idea behind our approach comes from natural programming elicitation, a kind of study that helps programming language designers understand how programmers “naturally” approach problems from a given programming domain (Myers, Pane, and Ko 2004; Chasins, Glassman, and Sunshine 2021). Programming language designers use the results of these studies to create languages that are aligned with the expectations of users, leading to less programming friction and more effective developers. We borrow this idea for the setting where LMs are the “users” of programming languages. Akin to uncovering what human users find “natural” for a given domain, we uncover what LMs find “natural.” Specifically, our first insight is to embrace LM’s tendencies and design an intermediate language that aligns with them.

The second key idea in our approach is that program analyses and repair that are overly aggressive for human users may be suitable for LM “users.” For example, in UCLID5, all variables have to be declared and statically typed: an assignment like x = 0; would require a corresponding declaration like var x: integer;. But, if an LM generates code that had an assignment without a declaration, instead of crashing, one could automatically “repair” the program and output the result.

We use these two ideas to define a new text-to-code approach called synthetic programming elicitation and compilation (SPEAC, pronounced “speak”). Specifically, for a target VLPL $T$, we use synthetic programming elicitation to select an intermediate language $P$ (the “parent” language) and define a subset of the language $C$ (the “child” language). The language $P$ should be one that LMs are good at generating (e.g. Python); the language $C$ should be easy to compile to the target VLPL $T$. Our approach takes $P$, $C$, and a compiler from $C$ to $T$, and produces a text-to-code pipeline for the VLPL $T$. This pipeline uses deductive techniques to automatically repair programs generated by LMs that are in $P$ but not in $C$. When these deductive techniques are unable to fully repair a program, we insert a “hole” and ask an LM to finish the repair, repeating as necessary.

We demonstrate the effectiveness of this idea by implementing a prototype, called Eudoxus, that targets UCLID5 (Seshia and Subramanyan 2018; Polgreen et al. 2022), a language used for formal modeling and verification of state transition systems. UCLID5 has code examples numbering in the hundreds rather than thousands or millions. Furthermore, UCLID5 programs rely heavily on the notion of a transition system, which is not frequently found in other programming languages. As such, state-of-the-art LMs are unable to generate any useful UCLID5 code out-of-the-box (see Chapter 7.4.1). In our case study, we use Python as the parent language $P$ and a subset of Python as the child language $C$, and improve the performance of LM code generation for UCLID5. Overall, we make the following contributions.

We present SPEAC, a novel method for generating syntactically correct code from LMs in very low resource programming languages;
we implement this method for the UCLID5 verification language; and
we demonstrate substantial improvement with SPEAC in syntactic correctness, producing parsable code in UCLID5 84.8% of the time compared to 9.1% by gpt-3.5-turbo fine-tuning and 12.1% by gpt-4-turbo in-context learning.

7.2 Running Example: Traffic Light

Given a natural language description of a programming task and a target programming language $T$, the text-to-code problem is to generate a program $t \in T$ that satisfies the task specification. Listing 5 and Listing 6 show a real input-output pair generated by an instance of our approach targeting the UCLID5 programming language. Specifically, Listing 5 shows a task extracted from Lee and Seshia (2016), and Listing 6 shows the output corresponding to that task using a prototype implementation of our approach. Listing 6 passes all compiler checks but has a subtle semantic mistake on line 4.

Listing 5: Partial task description from Lee and Seshia (2016).

Model a state machine describing a traffic light at a pedestrian 
crosswalk. This is a time triggered machine that assumes it will 
react once per second. It starts in the red state and counts 60 
seconds with the help of the variable count. It then transitions 
to green, where it will remain until the pure input pedestrian 
is present. That input is generated by some other subsystem that 
detects when a pedestrian is present, and should be modelled as
nondeterministic. When pedestrian is present, the machine 
transitions to yellow if it has been green for at least 60 seconds. 
Otherwise, it transitions to pending, where it stays for the 
remainder of the 60 second interval. This ensures that once the 
light goes green, it stays green for at least 60 seconds. At the 
end of 60 seconds, it will transition to yellow ...

Listing 6: Partial output of Eudoxus in UCLID5. We interpret sigG, sigR, sigY to represent green, red, and yellow light signals, respectively. The procedure step captures the transition relation of the state machine. state appears to be a bookkeeping variable that is used to track the cases in the task, and count represents a timer.

module TrafficLight {
  var sigG, sigR, sigY: boolean; 
  var count, state: integer;
  var pedestrian: boolean; 
  init {
    sigG = false; sigY = false; sigR = true;
    state = 0; count = 0; pedestrian = false; }
  procedure step()
    modifies sigG; modifies sigY; modifies sigR;
    modifies count; modifies state; {
    if (state == 0) {
      sigG = false; sigY = false; sigR = true;
      count = 0;
      if (count < 60) { count = (count + 1); } 
      else { state = 1;}} 
    else { if (state == 1) {
      sigR = false; sigY = false; sigG = true;
      if (pedestrian) {
        if (count >= 60) { state = 2; } 
        else { state = 3; }}} 
...

Figure 13 shows the workflow that generates models based on task descriptions as shown in Listing 6. The workflow is parameterized by an LM, $L$ (e.g., gpt-3.5-turbo-0125); a target language, $T$ (e.g., UCLID5); a parent language, $P$ (e.g., Python); a child language, $C \subset P$ (e.g., a subset of Python); and a compiler, $f$, from $C$ to $T$ (e.g., a syntax-directed translation from the subset of Python to UCLID5). Given an input task, we create a prompt $q$ that asks the LM $L$ to generate code, $p \in C$, which satisfies the task description, $q$. The second step of the workflow is to repair $p$. If there is nothing wrong with $p$, or $p$ can be fixed using formal techniques described in Chapter 7.4.2, then repairing will generate a new, complete program $p'$ and return $f(p')$ (i.e., a program in the target language, like Listing 6). Frequently, however, repairing will generate a partial program containing “holes” (denoted “??”). For example, Listing 11 shows the first $p$ generated for the task in Listing 5 and Listing 10 shows the corresponding partial program $p'$ that was automatically generated using our formal techniques. Programs with holes cannot be compiled to the target language, so the third step of the workflow is ask the LM to complete the partial program $p'$, generating a new $p$. We use the template in Listing 9 to generate the LM prompt. Listing 12 shows the output generated by gpt-3.5-turbo-0125 when asked to repair the partial program in Listing 10. This program is still incorrect, but, it is now close enough that we can automatically repair it to a complete program without holes. Listing 13 shows the relevant portion of that complete program. This final, complete program is directly translated to the output in Listing 6. Chapter 7.4 elaborates on each of the workflow components.

To understand the subtle mistake in Listing 6 one needs some understanding of UCLID5. UCLID5 is a verification language used to model hardware and software systems as “state machine” like transition systems and to automatically check if the transition system does indeed satisfy a formal logic specification. UCLID5 transition systems primarily consist of a state space given by a set of variable declarations (e.g., lines 2-4 in Listing 6), an initialization block that represents a set of acceptable start states (e.g., lines 5-7 in Listing 6), and a transition relation block that defines how the transition system moves from one state to the next state (e.g., code starting at line 8 in Listing 6). The var keyword is used to declare variables that are internal to the transition system in question while the input keyword is used to declare read-only variables that are outside the control of the transition system in question. Listing 6 passes all compiler checks but has a subtle semantic mistake on line 4: var pedestrian: boolean; should be input pedestrian: boolean; because the presence of a pedestrian should not be controlled by the traffic light transition system. When manually assessing correctness in Chapter 7.5, this subtle mistake would prevent us from marking this example as fully correct.

Figure 13: The SPEAC workflow. Users input q, a task in natural language, and C, a description of the intermediate language. The LM takes these inputs and generates p, a program in P. We use formal techniques to repair p and produce p', a program in C that possibly contains holes. If p' does not contain holes, SPEAC applies f, a compiler from C to the target language, T, and returns the result. Otherwise, SPEAC generates a new prompt, q', and repeats by asking the LM to fill in the holes. — Figure 13: The SPEAC workflow. Users input $q$, a task in natural language, and $C$, a description of the intermediate language. The LM takes these inputs and generates $p$, a program in $P$. We use formal techniques to repair $p$ and produce $p'$, a program in $C$ that possibly contains holes. If $p'$ does not contain holes, SPEAC applies $f$, a compiler from $C$ to the target language, $T$, and returns the result. Otherwise, SPEAC generates a new prompt, $q'$, and repeats by asking the LM to fill in the holes.

7.3 Related Work

Modern language models perform exceptionally well on natural language to code generation tasks. For example, proprietary models like GPT-4 (OpenAI 2023), the Gemini (Team et al. 2024) and Claude 3 Families, and open-source models such as Code-Llama (Rozière et al. 2023) and Deepseek-Coder (Guo et al. 2024) have achieved impressive success on benchmarks such as HumanEval (M. Chen et al. 2021), Mostly Basic Python Problems (MBPP) (Austin et al. 2021), and LiveCodeBench (Jain et al. 2024). However, LMs perform better on popular programming languages that are well represented in training sets (e.g., Python) than less popular programming languages (e.g., Bash) (Guo et al. 2024; R. Li et al. 2023; Peng, Chai, and Li 2024). This effect is even more pronounced for very low resource programming languages, like UCLID5, as shown in Chapter 7.5.

In constrained decoding, syntactically incorrect code is avoided by rejecting impossible prefixes, without producing the full code (Geng et al. 2023; L. Agrawal et al. 2023; Scholak, Schucher, and Bahdanau 2021; Melcer et al. 2024). In the context of autoregressive LMs that naturally produce code left to right, it remains an open problem how to best include the inductive bias of a grammar. Bhatia et al. (2024) use LMs to rewrite existing code into domain-specific languages and prove that the translation is correct. Misu et al. (2024) use retrieval-augmented generation to write Dafny code, another low-resource language for verification (Leino 2010). Unlike these works, we focus on very low resourced languages (VLPLs), which have far fewer training examples in the public domain. Our approach is most similar to techniques that allow LMs to hallucinate but iteratively repair errors (Shinn et al. 2023; X. Chen et al. 2024; T. X. Olausson et al. 2024). For example, Elmaaroufi et al. (2024) use a mixture of prompting and compiler feedback to generate and iteratively repair Scenic code—a probabilistic VLPL that looks like Python (Fremont et al. 2023).

Recent work has considered augmenting LMs with support for low-resource programming languages (Cassano et al. 2024; F. Chen et al. 2022; Guo et al. 2024). F. Chen et al. (2022) show that, on smaller 125M parameter encoder-only models, fine-tuning on adjacent languages improves the monolingual performance coding tasks. Synthetic fine-tuning datasets curated and cleaned by LMs have shown promise for programming tasks. For example, Cassano et al. (2024) targets low-resource programming languages (e.g., Julia), using an LM to translate code examples from high-resource languages to the low-resource language. This process is promising for cases where the language model already has a baseline knowledge necessary to translate to the low-resource language and the target language is general purpose, which is not always the case for VLPLs.

7.4 Approach: SPEAC

In this section we describe the SPEAC approach. We begin with how to select $P$ and $C$ (Chapter 7.4.1) and the prompting that we use to interface with LMs. We then describe two formal techniques at the heart of our automated repair (Chapter 7.4.2).

Listing 7: Hypothetical LM output with grounded theory codes as comments. Codes are determined and assigned manually.

# GTCODE 1: imports a class from the hypothetical library 
from FormalVerificationLibrary import Module 
# GTCODE 2: creates an instance of an imported class 
m = Module("myModule") 
# GTCODE 3: uses a dunder method (e.g., __str__) of an imported class
print(m)

7.4.1 Synthetic Programming Elicitation

In this section, we present synthetic programming elicitation as a kind of programming study—where LMs are the subject—that follows three steps. The results of these studies are used to select $P$ and $C$. We call this process synthetic programming elicitation since it is inspired heavily by natural programming elicitation (Myers, Pane, and Ko 2004). To make this section more concrete, for each step of the study, we describe the specific steps we followed for targeting UCLID5.

First Step: Setup. First prepare LM subjects, select a target language, and collect or construct a set of programming tasks. In our case, our target language is UCLID5, our LM subjects are gpt-4-turbo-2024-04-09 and gpt-3.5-turbo-0125, and our programming tasks are a set of regression tests written by UCLID5 developers with corresponding natural language descriptions that were written by hand (see Chapter 7.5 for more details).

The second part of the study setup is to create multiple kinds of specialized prompts. The first will ask for the output to be in the target language directly. Subsequent prompts will ask for the output to use an imaginary API in a popular host language, like Python or Java. For example, for UCLID5, we generated prompts that look like the following.

“Write a UCLID5 model for a system that counts the number of times the temperature exceeds a threshold [$\ldots$]”
“FormalVerificationLibrary is a Python library for generating verification models [$\ldots$]. Use the FormalVerificationLibrary to model a system that counts the number of times the temperature exceeds a threshold [$\ldots$]”

Second Step: Execute. Now we execute every LM subject on every prompt and collect the outputs. Each query should be executed independently.

Third Step: Analyze and Design. Finally, we analyze the collected LM outputs and select $P$ and $C$ based on the analysis. To do this analysis, we follow a lightweight version of grounded theory (Glaser and Strauss 2017). See e.g., Stol, Ralph, and Fitzgerald (2016) for general guidelines on how to use grounded theory in software engineering research. See e.g., Barke, James, and Polikarpova (2023) for a specific, detailed case-study. Synthetic programming elicitation studies are inspired by and borrow many of the methods from those works. Our studies are categorically different, however, because we are not working with humans. We cannot interact with our participants in the same way, e.g., we cannot conduct semi-structured interviews. The goal of our studies is to design an intermediate language that more closely matches the code that LMs tend to generate.

The first step of our lightweight grounded theory analysis is to “code” the outputs generated by the LMs. In grounded theory parlance, “code” is akin to manually labeling parts of the generated outputs. For example, Listing 7 shows a hypothetical LM output for our running example along with grounded theory codes as comments. The second step is to group codes into concepts. Concepts are slightly more abstract than codes. For example, a concept that may emerge from Listing 7 is the group of codes 1, 2, and 3. The corresponding concept could be “using a class from the hypothetical library.” In the third step, we group concepts into categories. For example, we may group concepts related to the object oriented programming paradigm as one category. Finally, we select a $P$ and $C$ that are consistent with the final categories we observed across multiple prompts.

In our study, we found that, when asked to write UCLID5 code without any special prompting or training, no LM was able to produce code that parses (660 attempts: 10 samples per LM per benchmark, 33 benchmarks, two LMs). Worse still, the code generated by LMs was inconsistent, with each LM giving different outputs that resemble different programming languages at different times. When asked to write Python code that used a non-existent formal verification API, however, the LM outputs were more consistent. Therefore, we selected Python as our parent language, $P$.

Specifically, the Python code was more consistent because LM outputs fell into three broad categories, which we call “from-scratch,” “procedural,” and “object-oriented,” respectively. Programs in the “from-scratch” category used existing APIs (e.g., the Z3 solver API (de Moura and Bjørner 2008)) to re-implement what UCLID5 does, e.g., to manually model transition systems. This was the smallest significant category. Programs in the “procedural” category imported a class from the hypothetical API, created an instance of it, and then called methods from the class to declare variables, assert specifications and so on. Programs in the “object-oriented” category imported a class from the hypothetical API and extended it, including methods that correspond closely to parts of UCLID5 code. For example, these extended classes often included a method that represented a transition relation—a critical component of UCLID5 programs and of transition systems generally.

After analyzing the outputs, we concluded that it would be easiest to translate Python programs from the “object-oriented” category to UCLID5. For example, we observed that methods which represent transition relations used a limited set of names, and that methods themselves could be compiled to UCLID5 rather directly. Therefore we defined a subset of Python that matches the “object-oriented” category and used that as our child language, $C$. Essentially, $C$ is an abstract class that the LM must extend. For example, the abstract class includes an abstract method called “next” that corresponds to the transition relation. Listing 13 shows a portion of an example of a Python program in $C$ and Listing 6 shows a portion of its corresponding UCLID5 program.

Listing 8: Partial SPEAK prompt template for first generation.

Write [PARENT LANGUAGE] code to complete the following task.
> [TASK GOES HERE]
Reply with your code inside one unique code block
[DESCRIBE CHILD LANGUAGE]
I can definitely do that! Here is the code:       
'''

Listing 9: Partial SPEAK prompt template for hole filling.

Fix the following [PARENT LANGUAGE] code by replacing every occurrence 
of `??` with the correct code.
[CODE WITH HOLES GOES HERE]
Make sure that your code completes the following task.
[LINES 2-6 OF First Prompt]

After our synthetic programming elicitation study—once $P$ and $C$ have been selected—we use minimal prompting to interface with LMs. For example, for UCLID5, we use the prompt template in Listing 8 to create initial programs and the prompt template in Listing 9 to fill in holes. More advanced prompting techniques are complementary and could help here. However, for this work, we used minimal prompting so as to independently evaluate our contribution.

7.4.2 Largest Consistent Subtree and Model-Based Repair

Given a program $p$ in $P$, but not in $C$, our aim is to remove as little as possible from $p$ to bring it into the language $C$. That is, given the AST for $p$, we wish to generate the largest subtree of the AST, possibly containing holes, that is not rejected by the static checks of the language $C$. For example, the code in Listing 11 contains some errors, including mismatches between variable and constant types (UCLID5 does not automatically cast constants and so the line self.count = 1 is interpreted as assigning the integer literal $1$ to the bitvector variable self.count, which is a type error).

We find the largest consistent subtree in three steps. First, we use an error tolerant parser to build an AST, $A$, for the language $P$. Second, beginning at the root of $A$, we recursively descend and prune anything that cannot exist in an AST for the language $C$, placing holes wherever they are needed. This is an aggressive program slice, similar to Akhundov et al. (Akhundov et al. 2021), who give a new AST $A'$. $A'$ may or may not pass compiler checks, like type checking, or be semantically correct.

The third step finds the minimal number of AST nodes that need to be replaced with holes such that all static checks of the language $C$ pass, using the approach of Pavlinovic et al. (Pavlinovic, King, and Wies 2014). Specifically, for every static check of the language, for every node of the AST, we generate a set of soft clauses. Let $S$ be the the union of all the generated clauses, and let $S'$ be the largest subset of clauses that is sat (the output of a MAX-SMT solver). Any clause, $c$, that appears in $S$ but does not appear in $S'$ represents a bug: if we add $c$ to $S'$, then we will get an unsat set of clauses. So, for every such $c$, we replace the AST node that generated $c$ with a hole.

For example, in UCLID5, the two statements var x: bv32; and x = 0; would produce a set of clauses corresponding to the assertions that (a) x is a bitvector and that (b) x is an integer (since 0 is an integer). These assertions, together, are unsat, so MAX-SMT would return a set of clauses that does not include both (a) and (b). One solution would be to remove the bitvector type, producing the partial, but consistent, two statements var x: ??; and x = 0;.

We can bias the solver to include one or the other by adjusting the weights on the corresponding clauses. In practice, we assign clauses weights that are proportional to the depth of the corresponding node in the AST. This ensures that edits are closer to the leaves of the AST but it breaks syntactic minimality guarantees. In the case where weights are uniform and every node corresponds to exactly one clause, the MAX-SMT approach will make the minimal edit to the AST.

Listing 10: Partial response for task in Listing 5 using gpt-3.5-turbo-0125. Line 5 declares state as a local variable of bit-vector type; lines 25, 33, and 42 use state as an integer.

class TrafficLight(Module):
    def types(self):
        self.state_t = BitVector(2)
    def locals(self):
        self.state = BitVector(2)
        self.count = BitVector(6)
        self.pedestrian = Boolean()
        self.sigG = Boolean()
        self.sigY = Boolean()
        self.sigR = Boolean()
    def inputs(self):
        self.pedestrian = Boolean()
    def outputs(self):
        self.sigG = Boolean()
        self.sigY = Boolean()
        self.sigR = Boolean()
    def init(self):
        self.state = BitVector(2)
        self.count = BitVector(6)
        self.pedestrian = Boolean()
        self.sigG = Boolean()
        self.sigY = Boolean()
        self.sigR = Boolean()
    def next(self):
        if self.state == 0:
            self.sigG = False
            self.sigY = False
            self.sigR = True
            self.count = 0
            if self.count < 60:
                self.count += 1
            else:
                self.state = 1 
        ...
        elif self.state == 3:
            self.sigG = False
            self.sigY = False
            self.sigR = False
            if self.count < 60:
                self.count += 1
            else:
                self.state = 0

Listing 11: Partial first repair. Line 3 repairs the type of state to be an integer.

class TrafficLight(Module):
    def locals(self):
        self.state = int
        self.count = int
        self.pedestrian = bool
    def outputs(self):
        self.sigG = bool
        self.sigY = bool
        self.sigR = bool
    def init(self):
        self.state = ??
        self.count = ??
        self.pedestrian = ??
        self.sigG = ??
        self.sigY = ??
        self.sigR = ??
    def next(self):
        if self.state == 0:
            self.sigG = False
            self.sigY = False
            self.sigR = True
            self.count = 0
            ...

Once we have a satisfiable set of clauses, we can generate a corresponding satisfying model and use the model to repair the partial program. This is where our work most differs from Pavlinovic, King, and Wies (2014). For example, the partial program var x: ??; from above would correspond to the set of clauses ${\texttt{is\_Integer}(\texttt{??})}$. This clause would be satisfied by setting ?? to integer and so we can repair the partial program to be var x: integer;.

Listing 12 shows one buggy AST and Listing 13 shows the corresponding fix. For example, the variable count is declared as an integer in the repaired program because it is used as an integer in the buggy program. For repairs that cannot be automatically resolved by the MAX-SMT model, we use the LM to generate code to replace the AST holes, as shown in Figure 13.

Listing 12: Partial response for Listing 10 using gpt-3.5-turbo-0125. The outputs method should declare variables and their types but on lines 6-9 assigns variables to values (i.e., False instead of bool).

class TrafficLight(Module):
    def locals(self):
        self.state = 0
        self.count = 0
        self.pedestrian = False
    def outputs(self):
        self.sigG = False
        self.sigY = False
        self.sigR = False
    def init(self):
        self.state = 0
        self.count = 0
        self.pedestrian = False
        self.sigG = False
        self.sigY = False
        self.sigR = True
    ...

Listing 13: Partial second repair. The repair declares variables with the appropriate types on lines 5-7.

class TrafficLight(Module):
    def locals(self):
        self.count = int
        self.pedestrian = bool
        self.sigG = bool
        self.sigR = bool
        self.sigY = bool
        self.state = int

    def init(self):
        self.state = 0
        self.count = 0
        self.pedestrian = False
        self.sigG = False
        self.sigY = False
        self.sigR = True
    ...

7.5 Empirical Evaluation

In this section, we implement a prototype of SPEAC for UCLID5, called Eudoxus, and use it to evaluate the performance of SPEAC across three research questions.

Does Eudoxus perform better than LM baselines on syntactic correctness (passing compiler checks)?
Does Eudoxus perform better on semantic correctness?
Does MAX-SMT cost too much computation time?

Eudoxus uses the $P$ and $C$ described in Chapter 7.4 and is implemented in Python. We use tree-sitter to parse and search partial programs, and Z3 (de Moura and Bjørner 2008; Bjørner, Phan, and Fleckenstein 2015) to solve MAX-SMT queries. We allow Eudoxus to repeat the repair loop five times. All MAX-SMT queries are solved locally on a 2.3 GHz Quad-Core Intel Core i7 with 32 GB of RAM. All LM calls are made through the OpenAI Python API.[Code available at https://github.com/uclid-org/eudoxus]

We compare Eudoxus against three LM baselines: few-shot, self-repair, and fine-tuning. All using gpt-3.5-turbo-0125 (GPT3t) and gpt-4-turbo-2024-04-09 (GPT4t). For the few-shot baseline with and without chain-of-thought, we provide examples taken from the UCLID5 GitHub repository to the LM in context. We tried with one in context example, and then again with three in context examples. We also combine few-shot prompting with chain-of-thought (Wei et al. 2022). As with many VLPL, there are very limited number of natural language to UCLID5 examples in the public domain, and certainly not enough for fine-tuning. We do, however, have access to $317$ regression tests from the open-source code base. For the purposes of fine-tuning, we use self-instruct (Y. Wang et al. 2023) to first ask the model to summarize the UCLID5 regression tests in natural language and then provide this as the natural language description during fine-tuning. We fine-tune GPT3t for 284 steps with learning-rate multiplier 0.16. For self-repair, we gave the LMs compiler feedback in a loop for five iterations starting from a zero-shot prompt.

We use three sets of UCLID5 benchmarks. The first set is a curated set of $33$ regression tests with handwritten natural language descriptions. These tests are designed to cover a broad range of UCLID5 features and were used for the synthetic programming elicitation study in Chapter 7.4.1. The second set is a set of $317$ regression tests without descriptions taken directly from the UCLID5 GitHub repository. These tests were used for fine-tuning our baseline model, as described above. This set is representative of the quantity and quality of data that a user may have available for a VLPL. The third and final set is a set of $33$ exercises and examples taken from three textbooks (Baier and Katoen 2008; Lee and Seshia 2016; Huth and Ryan 2004). These benchmarks cover formal modeling of concurrent systems, linear-time properties, model checking and systems with discrete dynamics. We used this final set for the end-to-end evaluation below.

We run two variations of Eudoxus (one using GPT3t, one using GPT4t) and 11 variations of the baselines on all $33$ textbook benchmarks. We report the fraction of outputs that pass all compiler checks (“parse,” for short) and semantic correctness over all 11 approaches in Table 4. Semantic correctness is manually assessed by one author using a combination of manual reasoning and hand-written specifications for all programs that compile. Correctness is rated on a scale from $1 - 5$, where $1$ is entirely wrong, and $4$ indicates that the model is correct with only a few minor errors (e.g., the example output described in Chapter 7.2 and Listing 11). Intuitively, any score of $\geq 4$ indicates that we believe the output would be useful to text-to-code users.

Table 4: Semantic score over all correctly parsed models. 1 is completely wrong; 5 is fully correct. Iterative tools, like Eudoxus and self-repair, are limited to five LM calls per benchmark. Eudoxus hits this limit on four benchmarks.
Model	Parse Rate	1	2	3	4	5
Eudoxus (GPT4t)	24/33	1/24	1/24	3/24	8/24	11/24
Eudoxus (GPT3t)	28/33	3/28	6/28	5/28	5/28	9/28
Fine-tuned GPT3t	2/33	0	2/2	0	0	0
One-shot with COT (GPT4t)	1/33	0	0	0	1/1	0
One-shot with COT (GPT3t)	0	-	-	-	-	-
One-shot (GPT4t)	0	-	-	-	-	-
One-shot (GPT3t)	0	-	-	-	-	-
Three-shot with COT (GPT4t)	3/33	0	0	0	2/3	1/3
Three-shot with COT (GPT3t)	1/33	0	0	0	0	1/1
Three-shot (GPT4t)	4/33	0	0	0	3/4	1/4
Three-shot (GPT3t)	2/33	0	0	1/2	0	1/2
Self-repair (GPT4t)	0	-	-	-	-	-
Self-repair (GPT3t)	0	-	-	-	-	-

7.5.1 RQ1: Syntactic Correctness

Eudoxus outperforms all baselines in terms of compiler checks (see “Parse Rate” in Table 4), passing all compiler checks on $78\%$ of benchmarks. There are four benchmarks on which Eudoxus hits the iteration limit and fails to produce a result, and three benchmarks with small syntax errors due to bugs in our Eudoxus implementation (e.g., failing to extract code snippets from LM outputs or printing UCLID5 code incorrectly). In contrast, we find that GPT3t and GPT4t rarely produce UCLID5 models that even parse. The best results for GPT3t come from fine-tuning, but it is only able to produce two programs that parse. The best results for GPT4t come from three-shot prompting, but it is only able to produce four programs that parse. Given this data, we answer RQ1 in the affirmative: Eudoxus perform better than standard LM baselines on syntactic correctness. Even more interesting, the two versions of Eudoxus generated programs that passed all compiler checks for 30/33 unique problems; the 11 versions of the baselines together generated programs that passed all compiler checks for only 6/33 unique problems.

7.5.2 RQ2: Semantic Correctness

Eudoxus outperforms all baselines in terms of semantic correctness (see Figure 14), scoring 5/5 on more benchmarks than any other approach. Every unique problem that the baselines perform well on—four unique problems where some baseline scored four or higher—was also solved by Eudoxus. These are likely the easiest problems in the test set. On the other hand, $33/52$ programs (24 unique problems) that pass compiler checks produced by Eudoxus scored four or higher. This data suggests that our approach does not negatively impact semantic performance, but it is difficult to draw conclusions since so few of the programs generated by the baselines pass compiler checks.

Figure 14: Semantic score distribution for parsed responses.

7.5.3 RQ3: MAX-SMT Performance

In terms of execution time, Eudoxus with GPT3t took an average of 0.9 seconds (SD 0.6) in repair steps and 7.2 seconds (SD 4.6) in LM calls. Eudoxus with GPT4t took an average of 1.8 seconds (SD 2.0) in repair steps and 35.1 seconds (SD 24.7) in LM calls. We conclude that, in terms of execution time, LM calls are more expensive than our static repairs.

7.5.4 Threats to Validity

While our results are promising, there are three limitations to our evaluation. First, we only evaluated SPEAC on one VLPL, UCLID5. It remains to be seen if our results can be replicated across different VLPLs, especially those in different domains. Second, we only evaluated SPEAC with two LMs. It is possible that our work will not generalize across different kinds of LMs. Third, only one author evaluated the semantic correctness of programs generated by Eudoxus. While the main takeaway of our work is that we are able to generate syntactically correct programs where baselines cannot, it is possible that we have over or under-estimated semantic correctness.

7.6 Conclusion

In this chapter, we described how to build the arrow into Figure 1(C). That is, we described a new methodology for building tools that generate code in very low-resource programming and formal languages from natural language descriptions. Our approach, SPEAC, bridges the gap between code that LMs are good at writing and code that we need to write for formal verification and related tasks through a mix of programming language design, human-computer interaction, and formal methods. We borrow ideas from the intersection of human-computer interaction and programming languages research to discover what LMs find “natural” in a given domain. We then design a corresponding intermediate language for which LMs are good at writing code and for which we can automatically translate programs into our target language, using ideas from formal methods.

We empirically demonstrated that this combination makes it possible to generate syntactically correct and semantically meaningful code in VLPLs like UCLID5, where traditional prompting, decoding, and fine-tuning strategies fall short. Specifically, SPEAC produces code that passes compiler checks at a much higher rate than prior approaches (syntactic correctness), and shows promising results in semantic correctness.

8 Final Words

Building a correct distributed system requires logical reasoning. For simple distributed systems in low stake environments, reasoning with pen and paper may be enough. At larger scales, however, like industrial settings, reasoning about all possible executions of a distributed system quickly becomes unwieldy. While the underlying reasoning is the same–basic deduction—at these larger scales, distributed systems engineers need computational support. In this dissertation, we built the needed computational support: a full automated reasoning stack for verifying the correctness of distributed systems. The left side of Figure 1 visualizes the external interface to our stack; the right side of Figure 1 visualizes the internal details of the stack itself.

Automated reasoning stacks, like our own, transform domain-specific questions, layer by layer, into logical queries, solve these queries with general-purpose reasoning engines, and then transform the results back into domain-specific answers. Unlike existing work, in this dissertation, we tailored every layer of the automated reasoning stack to our target domain. In Chapter 5 we built a verification engine that is directly and exclusively tailored to distributed systems safety verification. In Chapter 6 we built specification mining tools for the same setting. Other layers were more general but still built to specifically address challenges encountered in our target domain. In Chapter 3 we built a solver that handles the data structures that our verification engine depends on. In Chapter 4 we built a tool that improves solver performance on the workloads generated by our verification engine. In Chapter 7 we provided a methodology for getting LMs to generate code in very low resource programming languages, like our verification language.

Overall, this dissertation aims to be an exemplar for how to build a full automated reasoning stack that targets a single domain. Along the way, however, we made more general contributions. Specifically, we provided components and principles that are reusable in, and adaptable to, new domains. In the next section, we explore an entirely new domain that highlights the power and limitations of automated reasoning, and motivates our future work directions discussed in Chapter 8.2.

8.1 Crossword Construction

American style crosswords, like those found on the New York Times, consist of a set of clues, a grid of white and black cells, and a solution. The solution to a crossword is a sequence of characters, usually a word, per clue. For example, the answer for the clue “software platform suitable for Starbucks?” could be “java.”¹⁹ You can think of the clues as semantic constraints and the grid as a set of syntactic constraints. Every clue has an orientation, down or across, and a starting cell in the grid. The answer to each clue must be laid out, character by character, cell by cell, starting from the designated cell, moving in the correct orientation, over only the white cells of the grid. This imposes length constraints on answers and will lead to answers that share characters. The goal of the crossword game is to guess the solutions from the clues and the grid: players must guess a solution that satisfies both the semantic and syntactic constraints. Figure 15 shows one example crossword puzzle.

Many people play crossword games everyday; fewer people construct crosswords. Similarly, existing work, like Wallace et al. (2022), have looked at automatically solving crossword puzzles, but few projects exist for automatically generating crossword puzzles. In this section, we focus on the latter problem, and specifically on the challenges of solving the latter problem with current LMs.

Figure 15: Crossword composed of clues and answers that appeared on The New York Times Crossword on any Monday between Jan. 1, 2010 and Dec. 31, 2014. See answers.²⁰

8.1.1 Approach: Neuro and Symbolic Constructions

The distinction between semantic and syntactic crossword constraint generation helps us isolate the capabilities and limitations of LMs: LMs are relatively good at generating clues and answers (semantic constraints) but bad at generating crossword grids (syntactic constraints). On the other hand, solvers cannot generate clues and answers—they are not built to reason about natural language—but they can easily generate crossword grids. In this section, we evaluate the above claims about syntactic constraints.

8.1.1.1 Language Model (Neuro)

To ask LMs to generate a crossword grid (syntactic constraints), we use the the prompt immediately following this paragraph, where {WORDS} is an input parameter. We provide the LM with a list of words for which we know a subset will fit nicely in a grid; the LM just has to lay out the input words in any correct grid. In Chapter 8.1.2 we test LM capabilities with this prompt using existing crosswords as the input value for {WORDS}. The crossword data that we use in the evaluation has been publicly available on GitHub for more than four years and is likely in the training set of the model. To avoid simple regurgitation, we include an instruction in the prompt to generate a mini crossword with the fewest possible black cells. At the very least, the model will have to regurgitate the input crossword with the fewest number of blanks. We use few-shot prompting, and request the output in markdown.

Q: Make a 5 by 5 crossword grid with the fewest possible blanks 
(use * for blanks) as a markdown table using the following 
fills: DIP, IMOUT, SAUCE, CYCLE, HAN, DISC, IMAY, POUCH, UCLA, 
TEEN. 
A: 
| | | | | |
|-|-|-|-|-|
|D|I|P|*|*|
|I|M|O|U|T|
|S|A|U|C|E|
|C|Y|C|L|E|
|*|*|H|A|N|

Q: Make a 5 by 5 crossword grid with the fewest possible blanks 
(use * for blanks) as a markdown table using the following 
fills: {WORDS}
A:
| | | | | |
|-|-|-|-|-|

8.1.1.2 Solver (Symbolic)

In this section, we describe how to generate the syntactic constraints for a crossword with a bounded number of mar programs. Our encoding is inspired by Dimitri Bohlender’s blog.²¹ Our encoding depends on a formal problem definition. Given a set of words and a natural number k, the crossword construction problem is to generate a k × k table such that

every cell is filled by a letter in an input word or a special “blank” character;
every sequence of horizontally (vertically) contiguous letters spells out a word in the input set when read from left to right (top to bottom);
every letter belongs to a horizontal and a vertical word;
the table is rotationally symmetric with respect to the blanks; and
the fewest possible number of blanks are used.

The fourth constraint is standard but not strictly enforced. Full size New York Times crosswords sometimes have different symmetries (like along an axis), and the mini puzzles often relax the symmetry constraint altogether. Notice that Figure 15 is rotationally symmetric.

To make the details of the encoding more concrete and easier to follow, we will use the running example of a 2 × 2 table with {to, do} as the input set of words. For example, there is no solution with fewer than four blanks to the crossword construction problem defined by a 2 × 2 table with {to, do} as the input set of words—because of the symmetry constraint.

8.1.1.2.1 Conditions 1-4

The first four conditions of the crossword construction problem can be solved by a single mar program. We will start with these conditions and then extend our encoding to handle the fifth condition in the next subsection (the fifth condition describes an optimization). We begin by declaring an enumerated type (enum) to represent letters. We want to capture exactly the unique set of characters that appear in the input set of words (ignoring case) plus a special blank character. The running example would require the following declaration.

data letter = d () | o () | t () | BLACK ()
...

For a k × k table, we create k·k + 1 constants of type letter: one for each table position plus one to represent a special off table position. For example, for a 2 × 2 table, the mar program would have the following 5 function declarations.

...
let p00: letter
let p01: letter
let p10: letter
let p11: letter
let off: letter
...

For each word in the input set, we create a set of boolean constants that represent possible table positions and orientations (down or across) for that word. For example, for a 2 × 2 table and an input set of words {to, do}, we create 16 boolean constants (2 words times 4 possible start positions times 2 possible orientations).

...
let to_00_down:   boolean
let to_00_across: boolean
let to_01_down:   boolean
let to_01_across: boolean
let to_10_down:   boolean
let to_10_across: boolean
let to_11_down:   boolean
let to_11_across: boolean
let do_00_down:   boolean
...
let do_11_across: boolean
...

The type declaration above—and to some extent the constant declarations—enforce the first condition of the crossword construction problem. We now need to enforce the second, third, and fourth conditions. We accomplish this with five different kinds of constraints: position, word start, off table, checked, and symmetry.

Position constraints ensure that if a word is selected, then the table is filled in correctly at the appropriate position and orientation: the cell before the word is a blank, the sequence of characters starting at the given position following the given orientation spell out the letters of the given word, and the cell after the word is a blank. For example, part of the final assertion for the running example will ensure that, if to_00_down is set to true, then p00 will have the letter t and p10 will have the letter o.

...
assert to_00_down ==> 
  (off == BLACK() and p00 == t() and p10 == o() and off == BLACK())
...

We also need to ensure that at most one position and orientation is selected for each word. For example, part of the final assertion for the running example will ensure that the word “to” is only placed in one position.

...
assert at-most-one(to_00_down, to_00_across, to_01_down, to_01_across, 
              to_10_down, to_10_across, to_11_down, to_11_across)
...

Here at-most-one is a helper function that ensures that at most one from a set of boolean variables can be true at the same time.

Word start constraints ensure that every time there is blank followed by two letters in a given orientation, then some word from the input set starts at the first letter’s position. For the running example, we would include the following constraint.

...
assert ((off == BLACK() and p00 != BLACK() and p10 != BLACK()) ==> 
    (to_00_down or to_00_across or to_01_down or to_01_across or 
     to_10_down or to_10_across or to_11_down or to_11_across or 
     do_00_down or do_00_across or do_01_down or do_01_across or 
     do_10_down or do_10_across or do_11_down or do_11_across))
...

The off table constraint is the most simple: the off table position, off, is a BLACK().

...
assert off == BLACK()
...

A letter is checked if it belongs to a vertical and a horizontal word. All letters must be checked and the checked constraints ensure this. For each cell in the table, if the cell holds a letter, then there must be a horizontally neighboring letter and a vertically neighboring letter. For the running example, this would include constraints like the following.

...
assert p00 != BLACK() ==> 
    ((off != BLACK() or p01 != BLACK()) and 
     (off != BLACK() or p10 != BLACK()))
...

Finally, symmetry constraints ensure that the table is rotationally symmetric with regards to the blanks. For a 2 × 2 table, these constraints would include the following.

...
assert (p00 == BLACK()) == (p11 == BLACK())
...

8.1.1.2.2 Optimization (Condition 5)

One possible solution to all the constraints described above is to set all the cells to be blanks. But that is not a very interesting crossword! The fifth condition of the crossword construction problem aims to minimize the number of blanks in the hopes of finding an interesting solution.

The mar language does not support minimization directly, however, so instead we will use a sequence of mar programs to solve the optimization problem. Each individual program will combine the constraints for the first four conditions of the crossword construction problem with an at-most-k constraint that puts an upper limit on the number of blanks in the solution. For the running example, this means including the following additional constraint.

...
assert at-most-k(p00 == BLACK(), p01 == BLACK(), 
                 p10 == BLACK(), p11 == BLACK())

Here at-most-k is a helper function that uses an efficient existing encoding to ensure that at most k from a set of boolean variables can be true at the same time, similar to the at-most-one constraint we used before. The highest possible number of blanks is the number of cells and the lowest possible number of blanks is zero. To solve the optimization problem we conduct a binary search between these two limits until we find the smallest k for which our at-most-k constraint holds. Each step of the binary search is a mar program. If the mar program has a solution, we try a smaller k in the next step; if there is no solution, we try a larger k.

8.1.2 Crossword Evaluation and Conclusions

To test if LMs can generate syntactic constraints for crossword puzzles, we asked OpenAI’s o4-mini-2025-04-16 (o4m) to build crossword grids from words and clues that previously appeared in the New York Times. Specifically, we took words and clues from five mini crossword puzzles (5 by 5 grids, ~10 words each) published between January 2019 and February 2021²² and asked o4m to create a mini crossword grid from the words using the prompt described in Chapter 8.1.1.1.

We repeated this experiment 10 times. Each time we used a fresh combination of five existing mini crosswords from a set of 10 randomly selected crosswords. Previous GPT models fail spectacularly. The o4m model performs surprisingly well. In 10 attempts, o4m generated three syntactically correct crosswords. That is, it was able to regurgitate an existing crossword three times. Of the three correctly generated grids, none are minimal. That is, none of the three are the crossword from the input with the fewest number of blanks. In these three cases, o4m successfully regurgitated an existing crossword but did not (successfully) do any reasoning about minimality. The following is an example of a correct grid generated by o4m, which appeared in the New York Times on August 26th, 2019.

|A|D|S|*|*|
|D|O|U|G|H|
|A|D|D|L|E|
|M|O|S|E|S|
|*|*|Y|E|S|

Seven out of 10 times, o4m hallucinated words or created nonsense grids. The following grid is an example of an incorrect grid generated by o4m, which includes hallucinated words, like “orosg.”

|A|M|I|G|O|
|L|I|V|E|R|
|I|V|A|D|O|
|V|E|D|A|S|
|E|R|O|U|G|

Laying out words in a grid is an obvious application of backtracking search. Generating text from left to right, top to bottom, will require making guesses that may or may not pay off. When the guesses do not pay off, backtracking is required. For example, consider the incorrect grid generated by o4m above. The model was given 50 words (five mini crosswords of 10 words each), and asked to construct a five by five crossword grid. The LM started the construction by selecting the word “amigo.” Immediately, this meant that the word down the middle column had to start with an “i,” but no such word existed in this particular input set. At this point, the LM should have backtracked: it should have started again with a word other than “amigo,” ideally learning from its mistake. Instead, the LM hallucinated the word “ivado,” which was not in the input set of words, and continued its construction.

Using our logical encoding, we can instantly solve ($<1$ second on a 2.3 GHz Quad-Core Intel Core i7 processor) all of the benchmarks described in Chapter 8.1.1.1. We can also generate more interesting crosswords. For example, Figure 15 was automatically generated by our solver using the encoding described in this chapter. Specifically, we took every word from every Monday (usually 15 by 15 grids) crossword published on the New York Times between January 1st, 2010, and December 31st, 2014,²³ generated the encoding described above, and used a solver to generate the grid in Figure 15.

Building a crossword grid is like programming. The lines of code that we write are guesses. These guesses have implications that are hard to predict, and we may or may not need to revisit them. Data structures and functions are rarely exactly right the first time we define them, and LMs are not beyond human reasoning capabilities. Despite previous work, like Gandhi et al. (2024) and Xie et al. (2025), LMs still struggle with backtracking search. In contrast, Solvers routinely perform better than expected on real-world problems that require backtracking (Ganesh and Vardi 2020). Programming and the crossword construction problem are no exception to this pattern. In Chapter 7, we use a MAX-SMT solver to pick the optimal backtracking point. For the crossword construction problem, the SMT solver handles all backtracking directly.

But crossword construction and programming will never be solved by solvers alone. Clue and answer generation is a quintessential natural language problem. Similarly, naming variables and many other programming tasks require natural language understanding. In this regard, LMs have a hope of solving these problems while solvers do not: LMs may one day perform backtracking search effectively but SMT solvers will never reason about natural language effectively. Until that time, which may never come, the most effective tools will combine the powers of both approaches.

8.2 Future Work

It is not yet clear how to best combine neuro and symbolic approaches. In Chapter 7 we proposed one combination that has exciting implications for the neuro-symbolic reasoning research area. For example, T. Olausson et al. (2023) follow the tool-use paradigm to solve natural language word problems by equipping an LM with an automated theorem prover tool. These same authors find that one of the biggest limitations of this approach is that LMs frequently generate syntactically incorrect calls to the theorem prover—precisely the problem we solved with Eudoxus for UCLID5. We aim to explore these exciting implications, particularly for reasoning about distributed systems described in natural language and diagrams. The dream is that developers will be able to ask questions about their documented designs and get meaningful feedback powered by scalable automated reasoning.

Our work has interesting implications for the neuro-symbolic research area beyond the tool-use paradigm. For example, agentic AI systems, which have recently gained popularity, can be thought of as a distributed systems: independent computing nodes work together to accomplish a shared goal by sending and receiving messages. Today, however, little work exists on verifying properties of agentic systems. For example, there are few tools available to ensure that private information does not leak from one agent in a system to another. We believe that programming, testing, and verification languages like P will play a crucial role in building safer agentic systems.

All of these advances will depend on the availability of fast solvers. The problem is that we do not know what logical theories and fragments we will need to support or what industrial queries will look like tomorrow. This is true for the whole field of automated reasoning: the demand for solvers is growing in often unpredictable directions. We aim to meet this demand through eager solvers. Eager solvers are promising for distributed solving, but handling theory combinations is currently challenging. We are particularly excited about facing these challenges by generating eager solvers from declarative specifications. Similarly, we are excited to explore solvers with custom LM generated components. Existing work, like Romera-Paredes et al. (2024), has successfully generated heuristics for combinatorial solvers. We believe that similar approaches can be used to generate hyper domain-specific heuristics that improve solver performance well beyond what general purpose solvers can attain today.

Finally, language design will continue to play a crucial role in all future developments. Either for getting the most out of LMs, like our work in Chapter 7, or for communicating with engineers, like our work in Chapter 5. Automated reasoning engines do not currently support many programming paradigms and features. For example, OCaml’s new concurrency model (Sivaramakrishnan et al. 2021), based on algebraic effects, is popular in programming language circles but has gained much less attention in the automated reasoning community. Therefore, today, human and non-human programmers have few choices when it comes to proving the correctness of these programs. We are interested in new programming language paradigms and features, like algebraic effects, and building full automated reasoning stacks for their users, broadly construed.

References

Agrawal, Lakshya, Aditya Kanade, Navin Goyal, Shuvendu K Lahiri, and Sriram Rajamani. 2023. “Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context.” In Thirty-Seventh Conference on Neural Information Processing Systems (NeurIPS). https://openreview.net/forum?id=qPUbKxKvXq.

Agrawal, Shipra, and Navin Goyal. 2012. “Analysis of Thompson Sampling for the Multi-Armed Bandit Problem.” In The 25th Annual Conference on Learning Theory (COLT), 23:39.1–26. JMLR Proceedings. http://proceedings.mlr.press/v23/agrawal12/agrawal12.pdf.

———. 2013. “Thompson Sampling for Contextual Bandits with Linear Payoffs.” In Proceedings of the 30th International Conference on Machine Learning (ICML), 28:127–35. JMLR Workshop and Conference Proceedings. http://proceedings.mlr.press/v28/agrawal13.html.

Akhundov, Murad, Federico Mora, Nick Feng, Vincent Hui, and Marsha Chechik. 2021. “Verification by Gambling on Program Slices.” In Automated Technology for Verification and Analysis - 19th International Symposium (ATVA), 12971:266–82. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-030-88885-5\_18.

Amadeo, Ron. 2024. “"Unprecedented" Google Cloud Event Wipes Out Customer Account and Its Backups.” Ars Technica. https://arstechnica.com/gadgets/2024/05/google-cloud-accidentally-nukes-customer-account-causes-two-weeks-of-downtime/.

Amazon.com, Inc. 2023. “Clock-Bound.” GitHub Repository. https://github.com/aws/clock-bound.

Ammons, Glenn, Rastislav Bodı́k, and James R. Larus. 2002. “Mining Specifications.” In Proceedings of the 29th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 4–16. https://doi.org/10.1145/503272.503275.

Angluin, Dana. 1980. “Finding Patterns Common to a Set of Strings.” Journal of Computer and System Sciences 21 (1): 46–62. https://doi.org/10.1016/0022-0000(80)90041-0.

Aspnes, James. 2020. “Notes on Theory of Distributed Systems.” CoRR abs/2001.04235. https://arxiv.org/abs/2001.04235.

Auer, Peter, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. 2002. “The Nonstochastic Multiarmed Bandit Problem.” SIAM Journal on Computing 32 (1): 48–77. https://doi.org/10.1137/S0097539701398375.

Austin, Jacob, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, et al. 2021. “Program Synthesis with Large Language Models.” CoRR abs/2108.07732. https://arxiv.org/abs/2108.07732.

Backes, John, Pauline Bolignano, Byron Cook, Catherine Dodge, Andrew Gacek, Kasper Søe Luckow, Neha Rungta, Oksana Tkachuk, and Carsten Varming. 2018. “Semantic-Based Automated Reasoning for AWS Access Policies Using SMT.” In Formal Methods in Computer Aided Design (FMCAD), 1–9. https://doi.org/10.23919/FMCAD.2018.8602994.

Baier, Christel, and Joost-Pieter Katoen. 2008. Principles of Model Checking.

Balunovic, Mislav, Pavol Bielik, and Martin T. Vechev. 2018. “Learning to Solve SMT Formulas.” In Annual Conference on Neural Information Processing Systems (NeurIPS), 10338–49.

Barbosa, Haniel, Clark W. Barrett, Martin Brain, Gereon Kremer, Hanna Lachnitt, Makai Mann, Abdalrhman Mohamed, et al. 2022. “Cvc5: A Versatile and Industrial-Strength SMT Solver.” In Tools and Algorithms for the Construction and Analysis of Systems - 28th International Conference (TACAS), 13243:415–42. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-030-99524-9_24.

Barke, Shraddha, Michael B. James, and Nadia Polikarpova. 2023. “Grounded Copilot: How Programmers Interact with Code-Generating Models.” Proceedings of the ACM on Programming Languages 7 (OOPSLA1): 85–111. https://doi.org/10.1145/3586030.

Barrett, Clark W., Christopher L. Conway, Morgan Deters, Liana Hadarean, Dejan Jovanovic, Tim King, Andrew Reynolds, and Cesare Tinelli. 2011. “CVC4.” In Computer Aided Verification - 23rd International Conference (CAV), 6806:171–77. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-642-22110-1_14.

Barrett, Clark W., Pacal Fontaine, and Cesare Tinelli. 2017. “The SMT-LIB Standard Version 2.6.” https://smtlib.cs.uiowa.edu/papers/smt-lib-reference-v2.6-r2017-07-18.pdf.

Barrett, Clark W., Roberto Sebastiani, Sanjit A. Seshia, and Cesare Tinelli. 2021. “Satisfiability Modulo Theories.” In Handbook of Satisfiability - Second Edition, 336:1267–1329. Frontiers in Artificial Intelligence and Applications. https://doi.org/10.3233/FAIA201017.

Barrett, Clark W., Igor Shikanian, and Cesare Tinelli. 2006. “An Abstract Decision Procedure for Satisfiability in the Theory of Recursive Data Types.” In Combined Proceedings of the Fourth Workshop on Pragmatics of Decision Procedures in Automated Reasoning (PDPAR) and the First International Workshop on Probabilistic Automata and Logics (PaUL), 174:23–37. Electronic Notes in Theoretical Computer Science 8. https://doi.org/10.1016/J.ENTCS.2006.11.037.

Barth, Max, Daniel Dietsch, Leonard Fichtner, and Matthias Heizmann. 2019. “Ultimate Eliminator: A Quantifier Upgrade for SMT Solvers at SMT-COMP 2019.”

Berkovits, Idan, Marijana Lazic, Giuliano Losa, Oded Padon, and Sharon Shoham. 2019. “Verification of Threshold-Based Distributed Algorithms by Decomposition to Decidable Logics.” In Computer Aided Verification - 31st International Conference (CAV), 11562:245–66. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-030-25543-5_15.

Bernstein, Philip A., and Nathan Goodman. 1983. “Multiversion Concurrency Control - Theory and Algorithms.” ACM Transactions on Database Systems (TODS) 8 (4): 465–83. https://doi.org/10.1145/319996.319998.

Beschastnikh, Ivan, Yuriy Brun, Jenny Abrahamson, Michael D. Ernst, and Arvind Krishnamurthy. 2015. “Using Declarative Specification to Improve the Understanding, Extensibility, and Comparison of Model-Inference Algorithms.” IEEE Transactions on Software Engineering 41 (4): 408–28. https://doi.org/10.1109/TSE.2014.2369047.

Bhatia, Sahil, Jie Qiu, Niranjan Hasabnis, Sanjit A. Seshia, and Alvin Cheung. 2024. “Verified Code Transpilation with LLMs.” In 38th Annual Conference on Neural Information Processing Systems (NeurIPS).

Biere, Armin, Alessandro Cimatti, Edmund M. Clarke, and Yunshan Zhu. 1999. “Symbolic Model Checking Without BDDs.” In Tools and Algorithms for Construction and Analysis of Systems - 5th International Conference (TACAS), 1579:193–207. Lecture Notes in Computer Science. https://doi.org/10.1007/3-540-49059-0_14.

Biermann, A. W., and J. A. Feldman. 1972. “On the Synthesis of Finite-State Machines from Samples of Their Behavior.” IEEE Transactions on Computers C-21 (6): 592–97. https://doi.org/10.1109/TC.1972.5009015.

Bjørner, Nikolaj. 2009. “SMT Solvers for Testing, Program Analysis and Verification at Microsoft.” In SYNASC, 15.

Bjørner, Nikolaj, Vijay Ganesh, Raphaël Michel, and Margus Veanes. 2012. “An SMT-LIB Format for Sequences and Regular Expressions.” SMT 12: 76–86.

Bjørner, Nikolaj, Anh-Dung Phan, and Lars Fleckenstein. 2015. “$\nu$Z-an Optimizing SMT Solver.” In Tools and Algorithms for the Construction and Analysis of Systems - 21st International Conference (TACAS), 194–99.

Bonacina, Maria Paola, and Mnacho Echenim. 2007. “Rewrite-Based Satisfiability Procedures for Recursive Data Structures.” Electronic Notes in Theoretical Computer Science 174 (8): 55–70. https://doi.org/https://doi.org/10.1016/j.entcs.2006.11.039.

Bonacina, Maria Paola, Pascal Fontaine, Christophe Ringeissen, and Cesare Tinelli. 2019. “Theory Combination: Beyond Equality Sharing.” In Description Logic, Theory Combination, and All That - Essays Dedicated to Franz Baader on the Occasion of His 60th Birthday, 11560:57–89. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-030-22102-7_3.

Bouvier, Pierre. 2021. “The VLSAT-3 Benchmark Suite.” CoRR abs/2112.03675. https://arxiv.org/abs/2112.03675.

Bromberger, Martin, François Bobot, and Martin Jonáš. 2024. “International Satisfiability Modulo Theories Competition.” https://smt-comp.github.io/2024/results/qf_datatypes-single-query/.

Brummayer, Robert, and Armin Biere. 2009. “Boolector: An Efficient SMT Solver for Bit-Vectors and Arrays.” In Tools and Algorithms for the Construction and Analysis of Systems - 15th International Conference (TACAS), 5505:174–77. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-642-00768-2_16.

Bruttomesso, Roberto, Alessandro Cimatti, Anders Franzén, Alberto Griggio, and Roberto Sebastiani. 2008. “The MathSAT4 SMT Solver.” In Computer Aided Verification (CAV), 299–303.

Bryant, Randal E., Daniel Kroening, Joël Ouaknine, Sanjit A. Seshia, Ofer Strichman, and Bryan A. Brady. 2007. “Deciding Bit-Vector Arithmetic with Abstraction.” In Tools and Algorithms for the Construction and Analysis of Systems - 14th International Conference (TACAS), 4424:358–72. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-540-71209-1_28.

Bryant, Randal E., Shuvendu K. Lahiri, and Sanjit A. Seshia. 2002. “Modeling and Verifying Systems Using a Logic of Counter Arithmetic with Lambda Expressions and Uninterpreted Functions.” In Computer Aided Verification - 14th International Conference (CAV), 2404:78–92. Lecture Notes in Computer Science. https://doi.org/10.1007/3-540-45657-0_7.

Burch, Jerry R, and David L Dill. 1994. “Automatic Verification of Pipelined Microprocessor Control.” In Computer Aided Verification (CAV), 68–80. https://doi.org/10.1145/196244.196577.

Burstall, Rod M. 1977. “Design Considerations for a Functional Programming Language.” Proceedings of Infotech State of the Art Conf. “The Software Revolution”’, 45–57.

Cadar, Cristian, Vijay Ganesh, Peter M. Pawlowski, David L. Dill, and Dawson R. Engler. 2006. “EXE: Automatically Generating Inputs of Death.” In Proceedings of the 13th ACM Conference on Computer and Communications Security (CCS), 322–35. https://doi.org/10.1145/1180405.1180445.

Cassano, Federico, John Gouwar, Francesca Lucchetti, Claire Schlesinger, Anders Freeman, Carolyn Jane Anderson, Molly Q. Feldman, Michael Greenberg, Abhinav Jangda, and Arjun Guha. 2024. “Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs.” Proceedings of the ACM on Programming Languages 8 (OOPSLA2): 677–708. https://doi.org/10.1145/3689735.

Castillo, Enrique F., Ali S. Hadi, and Cristina Solares. 1997. “Learning and Updating of Uncertainty in Dirichlet Models.” Machine Learning 26 (1): 43–63. https://doi.org/10.1023/A:1007372016040.

Cesa-Bianchi, Nicolò, and Gábor Lugosi. 2006. Prediction, Learning, and Games. https://doi.org/10.1017/CBO9780511546921.

Chang, Ernest J. H., and Rosemary Roberts. 1979. “An Improved Algorithm for Decentralized Extrema-Finding in Circular Configurations of Processes.” Communications of the ACM 22 (5): 281–83. https://doi.org/10.1145/359104.359108.

Chasins, Sarah E., Elena L. Glassman, and Joshua Sunshine. 2021. “PL and HCI: Better Together.” Communications of the ACM 64 (8): 98–106. https://doi.org/10.1145/3469279.

Chen, Fuxiang, Fatemeh H. Fard, David Lo, and Timofey Bryksin. 2022. “On the Transferability of Pre-Trained Language Models for Low-Resource Programming Languages.” In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension (ICPC), 401–12. https://doi.org/10.1145/3524610.3527917.

Chen, Mark, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, et al. 2021. “Evaluating Large Language Models Trained on Code.” CoRR abs/2107.03374. https://arxiv.org/abs/2107.03374.

Chen, Xinyun, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2024. “Teaching Large Language Models to Self-Debug.” In The Twelfth International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=KuPixIqPiq.

Christ, Jürgen, Jochen Hoenicke, and Alexander Nutz. 2012. “SMTInterpol: An Interpolating SMT Solver.” In Model Checking Software - 19th International Workshop (SPIN), 7385:248–54. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-642-31759-0_19.

Clarke, Edmund M., Armin Biere, Richard Raimi, and Yunshan Zhu. 2001. “Bounded Model Checking Using Satisfiability Solving.” Formal Methods in System Design 19 (1): 7–34. https://doi.org/10.1023/A:1011276507260.

Clarkson, Michael R., and Fred B. Schneider. 2010. “Hyperproperties.” Journal of Computer Security 18 (6): 1157–1210. https://doi.org/10.3233/JCS-2009-0393.

Cook, Byron. 2019. “AWS Security Profile: Byron Cook, Director of the AWS Automated Reasoning Group.” https://aws.amazon.com/blogs/security/aws-security-profile-byron-cook-director-aws-automated-reasoning-group/.

David, Cristina, Daniel Kroening, and Matt Lewis. 2015. “Using Program Synthesis for Program Analysis.” In LPAR, 483–98. Springer.

Day, Joel D., Vijay Ganesh, Nathan Grewal, and Florin Manea. 2023. “On the Expressive Power of String Constraints.” Proceedings of the ACM on Programming Languages 7 (POPL): 278–308. https://doi.org/10.1145/3571203.

De Angelis, Emanuele, Fabio Fioravanti, Alberto Pettorossi, and Maurizio Proietti. 2020. “Removing Algebraic Data Types from Constrained Horn Clauses Using Difference Predicates.” In Automated Reasoning - 10th International Joint Conference (IJCAR), 12166:83–102. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-030-51074-9_6.

de Moura, Leonardo Mendonça, and Nikolaj Bjørner. 2008. “Z3: An Efficient SMT Solver.” In Tools and Algorithms for the Construction and Analysis of Systems - 14th International Conference (TACAS), 337–40. TACAS’08/ETAPS’08.

Demsky, Brian, Michael D. Ernst, Philip J. Guo, Stephen McCamant, Jeff H. Perkins, and Martin C. Rinard. 2006. “Inference and Enforcement of Data Structure Consistency Specifications.” In Proceedings of the ACM/SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 233–44. https://doi.org/10.1145/1146238.1146266.

Desai, Ankush. 2023. “AWS Re:invent - Gain Confidence in System Correctness and Resilience with Formal Methods.” https://www.youtube.com/watch?v=FdXZXnkMDxs.

Desai, Ankush, Vivek Gupta, Ethan K. Jackson, Shaz Qadeer, Sriram K. Rajamani, and Damien Zufferey. 2013. “P: Safe Asynchronous Event-Driven Programming.” In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 321–32. https://doi.org/10.1145/2491956.2462184.

Desai, Ankush, Amar Phanishayee, Shaz Qadeer, and Sanjit A. Seshia. 2018. “Compositional Programming and Testing of Dynamic Distributed Systems.” Proceedings of the ACM on Programming Languages 2 (OOPSLA): 159:1–30. https://doi.org/10.1145/3276529.

Dodds, Mike. 2022. “Formally Verifying Industry Cryptography.” IEEE Security & Privacy. https://doi.ieeecomputersociety.org/10.1109/MSEC.2022.3153035.

Dutertre, Bruno. 2014. “Yices 2.2.” In Computer Aided Verification - 26th International Conference (CAV), 8559:737–44. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-319-08867-9_49.

Dwork, Cynthia, Nancy A. Lynch, and Larry J. Stockmeyer. 1988. “Consensus in the Presence of Partial Synchrony.” J. ACM 35 (2): 288–323. https://doi.org/10.1145/42282.42283.

Elmaaroufi, Karim, Devan Shanker, Ana Cismaru, Marcell Vazquez-Chanlatte, Alberto L. Sangiovanni-Vincentelli, Matei Zaharia, and Sanjit A. Seshia. 2024. “ScenicNL: Generating Probabilistic Scenario Programs from Natural Language.” In Conference on Language Models (COLM).

Ernst, M. D., J. Cockrell, W. G. Griswold, and D. Notkin. 2001. “Dynamically Discovering Likely Program Invariants to Support Program Evolution.” IEEE Transactions on Software Engineering 27 (2): 99–123. https://doi.org/10.1109/32.908957.

Ershov, A. P. 1958. “On Programming of Arithmetic Operations.” Communications of the ACM 1 (8): 3–6. https://doi.org/10.1145/368892.368907.

Feldman, Yotam M. Y., Oded Padon, Neil Immerman, Mooly Sagiv, and Sharon Shoham. 2017. “Bounded Quantifier Instantiation for Checking Inductive Invariants.” In Tools and Algorithms for the Construction and Analysis of Systems - 23rd International Conference (TACAS), 10205:76–95. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-662-54577-5_5.

Feldman, Yotam M. Y., James R. Wilcox, Sharon Shoham, and Mooly Sagiv. 2019. “Inferring Inductive Invariants from Phase Structures.” In Computer Aided Verification - 31st International Conference (CAV), 11562:405–25. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-030-25543-5_23.

Fischer, Michael J., Nancy A. Lynch, and Mike Paterson. 1983. “Impossibility of Distributed Consensus with One Faulty Process.” In Proceedings of the Second ACM SIGACT-SIGMOD Symposium on Principles of Database Systems (PODS), edited by Ronald Fagin and Philip A. Bernstein, 1–7. ACM. https://doi.org/10.1145/588058.588060.

Fremont, Daniel J., Edward Kim, Tommaso Dreossi, Shromona Ghosh, Xiangyu Yue, Alberto L. Sangiovanni-Vincentelli, and Sanjit A. Seshia. 2023. “Scenic: A Language for Scenario Specification and Data Generation.” Machine Learning 112 (10): 3805–49. https://doi.org/10.1007/S10994-021-06120-5.

Gabrielson, Jacob. 2019. “Challenges with distributed systems.” https://aws.amazon.com/builders-library/challenges-with-distributed-systems/.

Gandhi, Kanishk, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D. Goodman. 2024. “Stream of Search (SoS): Learning to Search in Language.” CoRR abs/2404.03683. https://doi.org/10.48550/ARXIV.2404.03683.

Ganesh, Vijay, and Moshe Y. Vardi. 2020. “On the Unreasonable Effectiveness of SAT Solvers.” In Beyond the Worst-Case Analysis of Algorithms, edited by Tim Roughgarden, 547–66. Cambridge University Press. https://doi.org/10.1017/9781108637435.032.

Garland, Stephen J, and Nancy A Lynch. 2000. “Using i/o Automata for Developing Distributed Systems.” Foundations of Component-Based Systems 13 (285-312): 5–2.

Ge, Yeting, and Leonardo Mendonça de Moura. 2009. “Complete Instantiation for Quantified Formulas in Satisfiabiliby Modulo Theories.” In Computer Aided Verification - 21st International Conference (CAV), 5643:306–20. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-642-02658-4_25.

Geng, Saibo, Martin Josifoski, Maxime Peyrard, and Robert West. 2023. “Grammar-Constrained Decoding for Structured NLP Tasks Without Finetuning.” In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 10932–52. https://doi.org/10.18653/V1/2023.EMNLP-MAIN.674.

Glaser, Barney, and Anselm Strauss. 2017. Discovery of Grounded Theory: Strategies for Qualitative Research.

Goldman Sachs. 2024. Goldman Sachs. https://www.goldmansachs.com/insights/articles/cloud-revenues-poised-to-reach-2-trillion-by-2030-amid-ai-rollout.

Gray, Jim, and Leslie Lamport. 2006. “Consensus on Transaction Commit.” ACM Transactions on Database Systems (TODS) 31 (1): 133–60. https://doi.org/10.1145/1132863.1132867.

Griffin, Jeremiah, Mohsen Lesani, Narges Shadab, and Xizhe Yin. 2020. “TLC: Temporal Logic of Distributed Components.” Proceedings of the ACM on Programming Languages 4 (ICFP): 123:1–30. https://doi.org/10.1145/3409005.

Guo, Daya, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, et al. 2024. “DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence.” CoRR abs/2401.14196. https://doi.org/10.48550/ARXIV.2401.14196.

Gupta, Naresh, and Dana S. Nau. 1992. “On the Complexity of Blocks-World Planning.” Artificial Intelligence 56 (2-3): 223–54. https://doi.org/10.1016/0004-3702(92)90028-V.

Halbert, Daniel Conrad. 1984. Programming by Example.

Hance, Travis, Marijn Heule, Ruben Martins, and Bryan Parno. 2021. “Finding Invariants of Distributed Systems: It’s a Small (Enough) World After All.” In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 115–31. https://www.usenix.org/conference/nsdi21/presentation/hance.

Hansen, Trevor. 2012. “A Constraint Solver and Its Application to Machine Code Test Generation.” PhD thesis, University of Melbourne, Australia. https://hdl.handle.net/11343/37952.

Hawblitzel, Chris, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael Lowell Roberts, Srinath T. V. Setty, and Brian Zill. 2015. “IronFleet: Proving Practical Distributed Systems Correct.” In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP), 1–17. https://doi.org/10.1145/2815400.2815428.

Healy, Andrew, Rosemary Monahan, and James F. Power. 2016. “Predicting SMT Solver Performance for Software Verification.” In Proceedings of the Third Workshop on Formal Integrated Development Environment (f-IDE), 240:20–37. EPTCS. https://doi.org/10.4204/EPTCS.240.2.

Hoare, C. A. R. 1975. “Recursive Data Structures.” International Journal of Parallel Programming 4 (2): 105–32. https://doi.org/10.1007/BF00976239.

———. 1978. “Communicating Sequential Processes.” Commun. ACM 21 (8): 666–77. https://doi.org/10.1145/359576.359585.

Hojjat, Hossein, and Philipp Rümmer. 2017. “Deciding and Interpolating Algebraic Data Types by Reduction.” In 19th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 145–52. https://doi.org/10.1109/SYNASC.2017.00033.

Hudak, Paul, John Hughes, Simon L. Peyton Jones, and Philip Wadler. 2007. “A History of Haskell: Being Lazy with Class.” In Proceedings of the Third ACM SIGPLAN History of Programming Languages Conference (HOPL-III), 1–55. https://doi.org/10.1145/1238844.1238856.

Huth, Michael, and Mark Dermot Ryan. 2004. Logic in Computer Science - Modelling and Reasoning about Systems (2. Ed.).

Hutter, Frank, Domagoj Babic, Holger H. Hoos, and Alan J. Hu. 2007. “Boosting Verification by Automatic Tuning of Decision Procedures.” In Formal Methods in Computer-Aided Design - 7th International Conference (FMCAD), 27–34. https://doi.org/10.1109/FAMCAD.2007.9.

ITU-T. 2011. “Message Sequence Chart (MSC).” Recommendation Z.120. International Telecommunication Union.

Jaber, Nouraldin, Swen Jacobs, Christopher Wagner, Milind Kulkarni, and Roopsha Samanta. 2020. “Parameterized Verification of Systems with Global Synchronization and Guards.” In Computer Aided Verification - 32nd International Conference (CAV), 12224:299–323. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-030-53288-8_15.

Jain, Naman, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.” CoRR abs/2403.07974. https://doi.org/10.48550/ARXIV.2403.07974.

Jeppu, Natasha Yogananda, Thomas F. Melham, Daniel Kroening, and John O’Leary. 2020. “Learning Concise Models from Long Execution Traces.” In 57th ACM/IEEE Design Automation Conference (DAC), 1–6. https://doi.org/10.1109/DAC18072.2020.9218613.

Jha, Susmit, Rhishikesh Limaye, and Sanjit A. Seshia. 2009. “Beaver: Engineering an Efficient SMT Solver for Bit-Vector Artithmetic.” In Computer-Aided Verification - 21st International Conference (CAV), 5643:668–74. Lecture Notes in Computer Science.

Jiang, Tao, Efim Kinber, Arto Salomaa, Kai Salomaa, and Sheng Yu. 1994. “Pattern Languages with and Without Erasing.” International Journal of Computer Mathematics 50 (3-4): 147–63.

Johnson, Jeffrey D., Jinghong Li, and Zengshi Chen. 2000. “Reinforcement Learning: An Introduction.” Neurocomputing 35 (1-4): 205–6. https://doi.org/10.1016/S0925-2312(00)00324-6.

Jonás, Martin, and Jan Strejcek. 2016. “Solving Quantified Bit-Vector Formulas Using Binary Decision Diagrams.” In Theory and Applications of Satisfiability Testing - 19th International Conference (SAT), 9710:267–83. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-319-40970-2_17.

Jung, Ralf, Jacques-Henri Jourdan, Robbert Krebbers, and Derek Dreyer. 2021. “Safe Systems Programming in Rust.” Communications of the ACM 64 (4): 144–52. https://doi.org/10.1145/3418295.

Kautz, Henry A., and Bart Selman. 1992. “Planning as Satisfiability.” In 10th European Conference on Artificial Intelligence (ECAI), 359–63.

Koenig, Jason R., Oded Padon, Neil Immerman, and Alex Aiken. 2020. “First-Order Quantified Separators.” In Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI), 703–17. https://doi.org/10.1145/3385412.3386018.

Kostyukov, Yurii, Dmitry Mordvinov, and Grigory Fedyukovich. 2021. “Beyond the Elementary Representations of Program Invariants over Algebraic Data Types.” In 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI), 451–65. https://doi.org/10.1145/3453483.3454055.

Kovács, Laura, Simon Robillard, and Andrei Voronkov. 2017. “Coming to Terms with Quantified Reasoning.” In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages (POPL), 260–70. https://doi.org/10.1145/3009837.3009887.

Kroening, Daniel, and Ofer Strichman. 2016. Decision Procedures - An Algorithmic Point of View, Second Edition. Texts in Theoretical Computer Science. An EATCS Series. https://doi.org/10.1007/978-3-662-50497-0.

Lamport, Leslie. 1978. “Time, Clocks, and the Ordering of Events in a Distributed System.” Communications of the ACM 21 (7): 558–65. https://doi.org/10.1145/359545.359563.

———. 1999. “Specifying Concurrent Systems with TLA+.” Calculational System Design. https://lamport.azurewebsites.net/pubs/lamport-spec-tla-plus.pdf.

———. 2002. Specifying Systems, The TLA+ Language and Tools for Hardware and Software Engineers. http://research.microsoft.com/users/lamport/tla/book.html.

———. 2021. “A High-Level View of TLA+.” https://lamport.azurewebsites.net/tla/high-level-view.html.

Lang, Kevin J., Barak A. Pearlmutter, and Rodney A. Price. 1998. “Results of the Abbadingo One DFA Learning Competition and a New Evidence-Driven State Merging Algorithm.” In Grammatical Inference - 4th International Colloquium (ICGI), 1433:1–12. Lecture Notes in Computer Science. https://doi.org/10.1007/BFB0054059.

Lazaar, Nadjib, Youssef Hamadi, Said Jabbour, and Michèle Sebag. 2012. “BESS: Bandit Ensemble for parallel SAT Solving.” Research Report RR-8070. https://hal.inria.fr/hal-00733282.

Le Lann, Gérard. 1977. “Distributed Systems - Towards a Formal Approach.” In Information Processing - Proceedings of the 7th IFIP Congress, 155–60.

Le, Xuan-Bach D., Duc-Hiep Chu, David Lo, Claire Le Goues, and Willem Visser. 2017. “S3: Syntax- and Semantic-Guided Repair Synthesis via Programming by Examples.” In ESEC/SIGSOFT FSE, 593–604. ACM.

Lee, Edward A., and Sanjit A. Seshia. 2016. Introduction to Embedded Systems: A Cyber-Physical Systems Approach. Second Edition. http://leeseshia.org.

Leino, K Rustan M. 2010. “Dafny: An Automatic Program Verifier for Functional Correctness.” In International Conference on Logic for Programming Artificial Intelligence and Reasoning (LPAR), 348–70.

Lewis, Harry R. 1980. “Complexity Results for Classes of Quantificational Formulas.” Journal of Computer and System Sciences 21 (3): 317–53. https://doi.org/10.1016/0022-0000(80)90027-6.

Li, Lihong, Wei Chu, John Langford, and Robert E. Schapire. 2010. “A Contextual-Bandit Approach to Personalized News Article Recommendation.” In Proceedings of the 19th International Conference on World Wide Web (WWW), 661–70. https://doi.org/10.1145/1772690.1772758.

Li, Raymond, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, et al. 2023. “StarCoder: May the Source Be with You!” Transactions on Machine Learning Research 2023. https://openreview.net/forum?id=KoFOg41haE.

Li, Yixuan, Lewis Frampton, Federico Mora, and Elizabeth Polgreen. 2025. “Online Prompt Selection for Program Synthesis.” Proceedings of the AAAI Conference on Artificial Intelligence 39 (11): 11282–89. https://doi.org/10.1609/aaai.v39i11.33227.

Lipton, Richard J. 1975. “Reduction: A Method of Proving Properties of Parallel Programs.” Communications of the ACM 18 (12): 717–21. https://doi.org/10.1145/361227.361234.

Liu, Haopeng, Shan Lu, Madan Musuvathi, and Suman Nath. 2019. “What Bugs Cause Production Cloud Incidents?” In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS), 155–62. ACM. https://doi.org/10.1145/3317550.3321438.

Löding, Christof, P Madhusudan, and Lucas Peña. 2017. “Foundations for Natural Proofs and Quantifier Instantiation.” Proceedings of the ACM on Programming Languages (POPL) 2 (POPL): 1–30.

Loukides, Mike. 2021. “The Cloud in 2021: Adoption Continues.” O’Reilly Media. https://www.oreilly.com/radar/the-cloud-in-2021-adoption-continues/.

Lubarsky, Robert. 2008. “Ian Chiswell and Wilfrid Hodges. Mathematical Logic. Oxford Texts in Logic, Vol. 3. Oxford University Press, Oxford, England, 2007, 250 Pp.” Bulletin of Symbolic Logic 14 (2): 265–67.

Lynch, Nancy A. 1996. Distributed Algorithms.

Lynch, Nancy A., and Mark R. Tuttle. 1987. “Hierarchical Correctness Proofs for Distributed Algorithms.” In Proceedings of the Sixth Annual ACM Symposium on Principles of Distributed Computing (PODC), edited by Fred B. Schneider, 137–51. ACM. https://doi.org/10.1145/41840.41852.

Ma, Haojun, Aman Goel, Jean-Baptiste Jeannin, Manos Kapritsos, Baris Kasikci, and Karem A. Sakallah. 2019. “I4: Incremental Inference of Inductive Invariants for Verification of Distributed Protocols.” In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP), 370–84. https://doi.org/10.1145/3341301.3359651.

McMillan, K. L. 2001. “Parameterized Verification of the FLASH Cache Coherence Protocol by Compositional Model Checking.” In Correct Hardware Design and Verification Methods, 179–95. https://doi.org/10.1007/3-540-44798-9_17.

McMillan, Kenneth L., and Oded Padon. 2020. “Ivy: A Multi-Modal Verification Tool for Distributed Algorithms.” In Computer Aided Verification - 32nd International Conference (CAV), 12225:190–202. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-030-53291-8_12.

Melcer, Daniel, Nathan Fulton, Sanjay Krishna Gouda, and Haifeng Qian. 2024. “Constrained Decoding for Code Language Models via Efficient Left and Right Quotienting of Context-Sensitive Grammars.” CoRR abs/2402.17988. https://doi.org/10.48550/ARXIV.2402.17988.

Menouer, Tarek, and Souheib Baarir. 2017. “Parallel Learning Portfolio-Based Solvers.” In International Conference on Computational Science (ICCS), 108:335–44. Procedia Computer Science. https://doi.org/10.1016/J.PROCS.2017.05.140.

Michael, Ellis, Dan R. K. Ports, Naveen Kr. Sharma, and Adriana Szekeres. 2017. “Recovering Shared Objects Without Stable Storage.” In 31st International Symposium on Distributed Computing (DISC), edited by Andréa W. Richa, 91:36:1–16. LIPIcs. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.DISC.2017.36.

Milner, Robin. 1980. A Calculus of Communicating Systems. Vol. 92. Lecture Notes in Computer Science. Springer. https://doi.org/10.1007/3-540-10235-3.

———. 1997. The Definition of Standard ML: Revised.

Miltner, Anders, Adrian Trejo Nuñez, Ana Brendel, Swarat Chaudhuri, and Isil Dillig. 2022. “Bottom-up Synthesis of Recursive Functional Programs Using Angelic Execution.” Proceedings of the ACM on Programming Languages 6 (POPL). https://doi.org/10.1145/3498682.

Misu, Md Rakib Hossain, Cristina V. Lopes, Iris Ma, and James Noble. 2024. “Towards AI-Assisted Synthesis of Verified Dafny Methods.” Proceedings of the ACM on Software Engineering 1 (FSE): 812–35. https://doi.org/10.1145/3643763.

Montesi, Fabrizio. 2014. Choreographic Programming.

Mora, Federico, Murphy Berzish, Mitja Kulczynski, Dirk Nowotka, and Vijay Ganesh. 2021. “Z3str4: A Multi-Armed String Solver.” In Formal Methods - 24th International Symposium (FM), 13047:389–406. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-030-90870-6_21.

Mora, Federico, Kevin Cheang, Elizabeth Polgreen, and Sanjit A. Seshia. 2020. “Synthesis in UCLID5.” CoRR abs/2007.06760. https://arxiv.org/abs/2007.06760.

Mora, Federico, Ankush Desai, Elizabeth Polgreen, and Sanjit A. Seshia. 2023. “Message Chains for Distributed System Verification.” In OOPSLA. https://doi.org/https://doi.org/10.1145/3622876.

Mora, Federico, Justin Wong, Haley Lepe, Sahil Bhatia, Karim Elmaaroufi, George Varghese, Joseph E. Gonzalez, Elizabeth Polgreen, and Sanjit A. Seshia. 2024. “Synthetic Programming Elicitation for Text-to-Code in Very Low-Resource Programming and Formal Languages.” In NeurIPS. https://arxiv.org/abs/2406.03636.

Myers, Brad A., John F. Pane, and Amy J. Ko. 2004. “Natural Programming Languages and Environments.” Communications of the ACM 47 (9): 47–52. https://doi.org/10.1145/1015864.1015888.

Myung, In Jae. 2003. “Tutorial on Maximum Likelihood Estimation.” Journal of Mathematical Psychology 47 (1): 90–100. https://doi.org/10.1016/S0022-2496(02)00028-7.

Newcombe, Chris, Tim Rath, Fan Zhang, Bogdan Munteanu, Marc Brooker, and Michael Deardeuff. 2015. “How Amazon Web Services Uses Formal Methods.” CACM. https://www.amazon.science/publications/how-amazon-web-services-uses-formal-methods.

Niemetz, Aina, and Mathias Preiner. 2020. “Bitwuzla at the SMT-COMP 2020.” CoRR abs/2006.01621. https://arxiv.org/abs/2006.01621.

Niemetz, Aina, Mathias Preiner, and Armin Biere. 2014. “Boolector 2.0.” Journal on Satisfiability, Boolean Modeling and Computation 9 (1): 53–58. https://doi.org/10.3233/SAT190101.

Nikolić, Mladen, Filip Marić, and Predrag Janičić. 2013. “Simple Algorithm Portfolio for SAT.” Artificial Intelligence Review 40 (4): 457–65. https://doi.org/10.1007/S10462-011-9290-2.

O’Mahony, Eoin, Emmanuel Hebrard, Alan Holland, Conor Nugent, and Barry O’Sullivan. 2008. “Using Case-Based Reasoning in an Algorithm Portfolio for Constraint Solving.” In Irish Conference on Artificial Intelligence and Cognitive Science, 210–16.

Olausson, Theo X., Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2024. “Is Self-Repair a Silver Bullet for Code Generation?” In The Twelfth International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=y0GJXRungR.

Olausson, Theo, Alex Gu, Ben Lipkin, Cedegao Zhang, Armando Solar-Lezama, Joshua Tenenbaum, and Roger Levy. 2023. “LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers.” In EMNLP. https://doi.org/10.18653/v1/2023.emnlp-main.313.

OpenAI. 2023. “GPT-4 Technical Report.” CoRR abs/2303.08774. https://doi.org/10.48550/ARXIV.2303.08774.

Oppen, Derek C. 1978. “Reasoning about Recursively Defined Data Structures.” In Fifth Annual ACM Symposium on Principles of Programming Languages (POPL), 151–57. https://doi.org/10.1145/512760.512776.

P Team. 2025. “Announcing the New Verification Backend for P.” https://p-org.github.io/P/advanced/PVerifierLanguageExtensions/announcement/.

Padon, Oded. 2018. “Deductive Verification of Distributed Protocols in First-Order Logic.” In Formal Methods in Computer Aided Design (FMCAD), 1. https://doi.org/10.23919/FMCAD.2018.8603010.

Padon, Oded, Kenneth L. McMillan, Aurojit Panda, Mooly Sagiv, and Sharon Shoham. 2016. “Ivy: Safety Verification by Interactive Generalization.” In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 614–30. https://doi.org/10.1145/2908080.2908118.

Pavlinovic, Zvonimir, Tim King, and Thomas Wies. 2014. “Finding Minimum Type Error Sources.” In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA), 525–42. https://doi.org/10.1145/2660193.2660230.

Peng, Qiwei, Yekun Chai, and Xuhong Li. 2024. “HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-Lingual Natural Language Generalization.” In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC/COLING), 8383–94. https://aclanthology.org/2024.lrec-main.735.

Pham, Tuan-Hung, and Michael W. Whalen. 2013. “An Improved Unrolling-Based Decision Procedure for Algebraic Data Types.” In Verified Software: Theories, Tools, Experiments - 5th International Conference (VSTTE), 8164:129–48. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-642-54108-7_7.

Pimpalkhare, Nikhil, Federico Mora, Elizabeth Polgreen, and Sanjit A. Seshia. 2021. “MedleySolver: Online SMT Algorithm Selection.” In Theory and Applications of Satisfiability Testing - 24th International Conference (SAT), 12831:453–70. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-030-80223-3_31.

Polgreen, Elizabeth, Kevin Cheang, Pranav Gaddamadugu, Adwait Godbole, Kevin Laeufer, Shaokai Lin, Yatin A. Manerkar, Federico Mora, and Sanjit A. Seshia. 2022. “UCLID5: Multi-Modal Formal Modeling, Verification, and Synthesis.” In Computer Aided Verification - 34th International Conference (CAV), 13371:538–51. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-031-13185-1_27.

Quiles, Horacio Mijail Antón, Aaron Tomb, and David Pearce. 2023. “CVC5 Support? Issue #4100 Dafny-Lang/Dafny—Github.com.” https://github.com/dafny-lang/dafny/issues/4100.

Reynolds, Andrew, and Jasmin Christian Blanchette. 2017. “A Decision Procedure for (Co)datatypes in SMT Solvers.” Journal of Automated Reasoning 58 (3): 341–62. https://doi.org/10.1007/S10817-016-9372-6.

Reynolds, Andrew, Morgan Deters, Viktor Kuncak, Cesare Tinelli, and Clark W. Barrett. 2015. “Counterexample-Guided Quantifier Instantiation for Synthesis in SMT.” In Computer Aided Verification - 27th International Conference (CAV), 9207:198–216. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-319-21668-3_12.

Reynolds, Andrew, Arjun Viswanathan, Haniel Barbosa, Cesare Tinelli, and Clark W. Barrett. 2018. “Datatypes with Shared Selectors.” In Automated Reasoning - 9th International Joint Conference (IJCAR), 10900:591–608. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-319-94205-6_39.

Rintanen, Jussi. 2003. “Symmetry Reduction for SAT Representations of Transition Systems.” In Proceedings of the Thirteenth International Conference on Automated Planning and Scheduling (ICAPS), 32–41. http://www.aaai.org/Library/ICAPS/2003/icaps03-004.php.

Romera-Paredes, Bernardino, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, et al. 2024. “Mathematical Discoveries from Program Search with Large Language Models.” Nature 625 (7995): 468–75.

Rozière, Baptiste, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, et al. 2023. “Code Llama: Open Foundation Models for Code.” CoRR abs/2308.12950. https://doi.org/10.48550/ARXIV.2308.12950.

Rümmer, Philipp, and Thomas Wahl. 2010. “An SMT-LIB Theory of Binary Floating-Point Arithmetic.” In International Workshop on Satisfiability Modulo Theories (SMT), 151.

Rungta, Neha. 2022. “A Billion SMT Queries a Day.” In Computer Aided Verification (CAV), 3–18.

Russell, Stuart J. 2010. Artificial Intelligence a Modern Approach.

Sayegh, Emil. 2024. “Microsoft and AWS Outages: A Wake-up Call for Cloud Dependency.” Forbes. Forbes Magazine. https://www.forbes.com/sites/emilsayegh/2024/07/31/microsoft-and-aws-outages-a-wake-up-call-for-cloud-dependency/.

Scholak, Torsten, Nathan Schucher, and Dzmitry Bahdanau. 2021. “PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models.” In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 9895–9901. https://doi.org/10.18653/V1/2021.EMNLP-MAIN.779.

Schuler, David, Valentin Dallmeier, and Andreas Zeller. 2009. “Efficient Mutation Testing by Checking Invariant Violations.” In Proceedings of the Eighteenth International Symposium on Software Testing and Analysis, ISSTA 2009, 69–80. https://doi.org/10.1145/1572272.1572282.

Schwarz, Reinhard, and Friedemann Mattern. 1994. “Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail.” Distributed Computing 7 (3): 149–74. https://doi.org/10.1007/BF02277859.

Scott, Joseph, Aina Niemetz, Mathias Preiner, Saeed Nejati, and Vijay Ganesh. 2021. “MachSMT: A Machine Learning-Based Algorithm Selector for SMT Solvers.” In Tools and Algorithms for the Construction and Analysis of Systems - 27th International Conference (TACAS), 12652:303–25. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-030-72013-1_16.

Sebastiani, Roberto. 2007. “Lazy Satisfiability Modulo Theories.” Journal on Satisfiability, Boolean Modeling and Computation 3: 141–224. https://doi.org/10.3233/SAT190034.

Seshia, Sanjit A. 2005. “Adaptive Eager Boolean Encoding for Arithmetic Reasoning in Verification.” PhD thesis, Carnegie Mellon University.

———. 2015. “Combining Induction, Deduction, and Structure for Verification and Synthesis.” In IEEE, 103:2036–51. 11.

Seshia, Sanjit A., and Pramod Subramanyan. 2018. “UCLID5: Integrating Modeling, Verification, Synthesis and Learning.” In 16th ACM/IEEE International Conference on Formal Methods and Models for System Design (MEMOCODE), 1–10. https://doi.org/10.1109/MEMCOD.2018.8556946.

Sethi, Divjyot, Muralidhar Talupur, and Sharad Malik. 2014. “Using Flow Specifications of Parameterized Cache Coherence Protocols for Verifying Deadlock Freedom.” In Automated Technology for Verification and Analysis - 12th International Symposium (ATVA), 8837:330–47. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-319-11936-6_24.

Shah, Amar, Federico Mora, and Sanjit A. Seshia. 2024. “An Eager Satisfiability Modulo Theories Solver for Algebraic Datatypes.” In AAAI. https://ojs.aaai.org/index.php/AAAI/article/view/28649.

Sheng, Ying, Yoni Zohar, Christophe Ringeissen, Jane Lange, Pascal Fontaine, and Clark W. Barrett. 2020. “Politeness for the Theory of Algebraic Datatypes.” In Automated Reasoning - 10th International Joint Conference (IJCAR), 12166:238–55. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-030-51074-9_14.

Shinn, Noah, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. “Reflexion: Language Agents with Verbal Reinforcement Learning.” In 36th Annual Conference on Neural Information Processing Systems (NeurIPS).

Shinohara, Takeshi. 1982. “Polynomial Time Inference of Pattern Languages and Its Applications.” In Proceedings of the 7th IBM Symposium on Mathematical Foundations of Computer Science, 191–209.

Sivaramakrishnan, KC, Stephen Dolan, Leo White, Tom Kelly, Sadiq Jaffer, and Anil Madhavapeddy. 2021. “Retrofitting Effect Handlers onto OCaml.” In PLDI. https://doi.org/https://doi.org/10.1145/3453483.3454039.

Solar-Lezama, Armando, Liviu Tancau, Rastislav Bodı́k, Sanjit A. Seshia, and Vijay A. Saraswat. 2006. “Combinatorial Sketching for Finite Programs.” In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 404–15. https://doi.org/10.1145/1168857.1168907.

Stol, Klaas-Jan, Paul Ralph, and Brian Fitzgerald. 2016. “Grounded Theory in Software Engineering Research: A Critical Review and Guidelines.” In Proceedings of the 38th International Conference on Software Engineering (ICSE), 120–31. https://doi.org/10.1145/2884781.2884833.

Sussman, Gerald J. 1973. “A Computational Model of Skill Acquisition.” Massachusetts Institute of Technology.

Suter, Philippe, Mirco Dotta, and Viktor Kuncak. 2010. “Decision Procedures for Algebraic Data Types with Abstractions.” In Proceedings of the 37th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 199–210. https://doi.org/10.1145/1706299.1706325.

Suter, Philippe, Ali Sinan Köksal, and Viktor Kuncak. 2011. “Satisfiability Modulo Recursive Programs.” In Static Analysis - 18th International Symposium (SAS), 6887:298–315. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-642-23702-7_23.

Talupur, Murali, and Mark R. Tuttle. 2008. “Going with the Flow: Parameterized Verification Using Message Flows.” In Formal Methods in Computer-Aided Design (FMCAD), 1–8. https://doi.org/10.1109/FMCAD.2008.ECP.14.

Tarassow, Artur. 2023. “The Potential of LLMs for Coding with Low-Resource and Domain-Specific Programming Languages.” CoRR abs/2307.13018. https://doi.org/10.48550/ARXIV.2307.13018.

Team, Gemini, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, et al. 2024. “Gemini: A Family of Highly Capable Multimodal Models.” https://arxiv.org/abs/2312.11805.

Terry, Doug. 2024. “How Do We Sleep at Night with Confidence That Our Services Are Operating Correctly?” https://www.linkedin.com/posts/doug-terry-08b2b68_an-unexpected-discovery-automated-reasoning-activity-7258244396049997825-GlH3.

Wallace, Eric, Nicholas Tomlin, Albert Xu, Kevin Yang, Eshaan Pathak, Matthew L. Ginsberg, and Dan Klein. 2022. “Automated Crossword Solving.” In Association for Computational Linguistics (ACL), edited by Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, 3073–85. Association for Computational Linguistics. https://doi.org/10.18653/V1/2022.ACL-LONG.219.

Wang, Bailin, Zi Wang, Xuezhi Wang, Yuan Cao, Rif A. Saurous, and Yoon Kim. 2023. “Grammar Prompting for Domain-Specific Language Generation with Large Language Models.” In 36th Annual Conference on Neural Information Processing Systems (NeurIPS).

Wang, Yizhong, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. “Self-Instruct: Aligning Language Models with Self-Generated Instructions.” In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 13484–508. https://doi.org/10.18653/V1/2023.ACL-LONG.754.

Wattez, Hugues, Frédéric Koriche, Christophe Lecoutre, Anastasia Paparrizou, and Sébastien Tabary. 2020. “Learning Variable Ordering Heuristics with Multi-Armed Bandits and Restarts.” In 24th European Conference on Artificial Intelligence (ECAI), 325:371–78. Frontiers in Artificial Intelligence and Applications. https://doi.org/10.3233/FAIA200115.

Weber, Tjark. 2019. “Par4 System Description.” https://smt-comp.github.io/2019/system-descriptions/Par4.pdf.

Weber, Tjark, Sylvain Conchon, David Déharbe, Matthias Heizmann, Aina Niemetz, and Giles Reger. 2019. “The SMT Competition 2015-2018.” Journal on Satisfiability, Boolean Modeling and Computation 11 (1): 221–59. https://doi.org/10.3233/SAT190123.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” In 35th Annual Conference on Neural Information Processing Systems (NeurIPS).

Whittaker, Zack. 2024. “Microsoft Said It Lost Weeks of Security Logs for Its Customers’ Cloud Products.” TechCrunch. https://techcrunch.com/2024/10/17/microsoft-said-it-lost-weeks-of-security-logs-for-its-customers-cloud-products/.

Widder, Josef, and Ulrich Schmid. 2007. “Booting Clock Synchronization in Partially Synchronous Systems with Hybrid Process and Link Failures.” Distributed Computing 20 (2): 115–40. https://doi.org/10.1007/S00446-007-0026-0.

Winograd, Terry. 1971. Procedures as a Representation for Data in a Computer Program for Understanding Natural Language. AI-TR.

Wintersteiger, Christoph M., Youssef Hamadi, and Leonardo Mendonça de Moura. 2009. “A Concurrent Portfolio Approach to SMT Solving.” In Computer Aided Verification - 21st International Conference (CAV), 5643:715–20. Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-642-02658-4_60.

Xia, Wei, and Roland H. C. Yap. 2018. “Learning Robust Search Strategies Using a Bandit-Based Approach.” In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), 6657–65. https://doi.org/10.1609/AAAI.V32I1.12211.

Xie, Tian, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. 2025. “Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning.” CoRR abs/2502.14768. https://doi.org/10.48550/ARXIV.2502.14768.

Xu, Lin, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. 2008. “SATzilla: Portfolio-Based Algorithm Selection for SAT.” Journal of Artificial Intelligence Research 32: 565–606. https://doi.org/10.1613/JAIR.2490.

Yao, Jianan, Runzhou Tao, Ronghui Gu, Jason Nieh, Suman Jana, and Gabriel Ryan. 2021. “DistAI: Data-Driven Automated Invariant Learning for Distributed Protocols.” In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 405–21. https://www.usenix.org/conference/osdi21/presentation/yao.

Zhang, Tony Nuda, Travis Hance, Manos Kapritsos, Tej Chajed, and Bryan Parno. 2024. “Inductive Invariants That Spark Joy: Using Invariant Taxonomies to Streamline Distributed Protocol Proofs.” In OSDI. https://www.usenix.org/conference/osdi24/presentation/zhang-nuda.

Zhou, Dongruo, Lihong Li, and Quanquan Gu. 2019. “Neural Contextual Bandits with Upper Confidence Bound-Based Exploration.” CoRR abs/1911.04462. http://arxiv.org/abs/1911.04462.

“Dafny currently relies heavily on Z3’s ability to quickly instantiate an absurd number of quantified axioms, and I think other solvers just can’t keep up with that.” (Quiles, Tomb, and Pearce 2023)↩︎
“Cloud computing sales are expected to rise to $2 trillion by the end of the decade… Generative artificial intelligence is forecast to account for about 10-15% of the spending” (Goldman Sachs 2024)↩︎
“… at least 75% of the respondents work for organizations using the cloud.” (Loukides 2021)↩︎
“UniSuper, an Australian pension fund that manages $135 billion worth of funds and has 647,000 members, had its entire account wiped out at Google Cloud, including all its backups that were stored on the service.” (Amadeo 2024)↩︎
“Microsoft has notified customers that it’s missing more than two weeks of security logs for some of its cloud products, leaving network defenders without critical data …” (Whittaker 2024)↩︎
“Users reported issues accessing critical AWS services, including EC2 instances, S3 storage, and RDS databases. Amazon acknowledged the problem …” (Sayegh 2024)↩︎
“We cannot expect all AWS users to be experts in formal methods, have the time to be trained in the use of formal methods tools, or even be experts in the cloud domain” (Rungta 2022).↩︎
“This limits the reach of formal verification: scalability will require teaching many more people to engineer proofs” (Dodds 2022).↩︎
“The full power of automated reasoning is not yet available to everyone because today’s tools are either difficult to use or weak” (Cook 2019).↩︎
See https://docs.python.org/3.10/whatsnew/3.10.html.↩︎
See https://openjdk.org/jeps/360.↩︎
A simple implementation of the mar compiler is available at https://github.com/FedericoAureliano/thesis.↩︎
Code available at https://github.com/uclid-org/algaroba.↩︎
Code and data available at https://github.com/uclid-org/medley-solver/tree/SAT2021.↩︎
See https://www.erlang.org/doc/index.html.↩︎
See https://akka.io/.↩︎
Available at https://github.com/wilcoxjay/mypyvy/blob/master/examples/ring_leader_election.pyv.↩︎
Code available at https://github.com/uclid-org/upverifier↩︎
Appeared in the New York Times Crossword on Monday, March 3, 2014.↩︎
Across: ALMA; SLEET; HOIST; INCAN; PEAS. Down: ALONE; LEICA; MESAS; ATTN; SHIP.↩︎
See https://bohlender.pro/blog/generating-crosswords-with-sat-smt/.↩︎
Data from https://github.com/chrisnunes57/nyt-mini-crosswords.↩︎
Data from https://github.com/doshea/nyt_crosswords.↩︎