Scalable and Usable Domain-Specific Automated Reasoning
By
Federico Mora
A dissertation submitted in partial satisfaction of the requirements for the degree of
Doctor of Philosophy
in
Computer Science
in the
Graduate Division
of the
University of California, Berkeley
Committee in charge:
Professor Sanjit A. Seshia, Chair
Associate Professor Alvin Cheung
Assistant Professor Natacha Crooks
Assistant Professor Lindsey Kuper
Summer 2025
Distributed systems are pervasive today, from cloud computing to Internet-of-Things (IoT) systems to connected automotive vehicles. However, building dependable distributed systems remains a challenging problem. While major strides have been made in formal methods for verifying distributed systems, successful users of formal methods must be both domain experts and experts in automated reasoning—a rare combination. In this dissertation, we address this challenge through a new automated reasoning stack that integrates artificial intelligence and formal methods at every layer.
The new automated reasoning stack consists of four key tools: Eudoxus, the P Verifier, Algaroba, and MedleySolver. Eudoxus is a neuro-symbolic tool that helps users generate formal models and specifications from informal artifacts. The P Verifier extends P, a formal modelling and specification language that developers are already using to test their distributed system designs, with support for formal verification and specification mining. To do this, the P Verifier generates satisfiability modulo theories (SMT) queries that Algaroba solves. Algaroba is currently one of the fastest SMT solvers in the world for the theory of quantifier-free algebraic data types. MedleySolver coordinates solvers, like Algaroba, to get better performance than any individual solver over realistic workloads, like those generated by the P Verifier.
We used our work to verify existing systems from related work and three new case studies; our industrial collaborators are actively using our work to verify their systems and classic protocols. Beyond these industrial applications, our stack represents the beginning of a broader vision for automated reasoning: one whose fundamental principle is that the full stack should be tailored to users’ needs rather than the other way around.
This dissertation is the product of six years’ worth of formal and informal collaborations and mentorships. Chief among my mentors is Sanjit A. Seshia, my advisor, who provided the steady hand and support that anchors this dissertation. I was fortunate to have two thesis committees. My formal committee, Alvin Cheung, Natacha Crooks, and Lindsey Kuper, encouraged me to be more ambitious and to pursue problems that excite me. My informal committee, Sarah E. Chasins, Marsha Chechik, Ankush Desai, Vijay Ganesh, Yi Li, Yatin A. Manerkar, Elizabeth Polgreen, and Max Willsey, helped me develop my research vision and shaped the researcher that I am today.
During my PhD, I was fortunate to mentor Isaac Chan, Selina Kim, Haley Lepe, Annamira O’Toole, Nikhil Pimpalkhare, and Amar Shah. Their work, ideas, and enthusiasm were my main source of energy. I was also fortunate to work with or learn from Murad Akhundov, Rohan Bavishi, Murphy Berzish, Sahil Bhatia, Dmitry Blotsky, Serena Caraco, Benjamin Caulfield, Kai-Chun Chang, Anirudh Chaudhary, Kevin Cheang, Pei-Wei Chen, Karim Elmaaroufi, Nick Feng, Pranav Gaddamadugu, Adwait Godbole, Mike He, Vincent Hui, Aniruddha Joshi, Sebastian Junges, Edward Kim, Mitja Kulczynski, Shivendra Kushwah, Kevin Laeufer, Niklas Lauffer, Caroline Lemieux, Shaokai Lin, Justin Lubin, Mae Milano, Victor Nicolet, Rohan Padhye, Yash Pant, Jiwon Park, Lauren Pick, Eric Rawn, Alejandro Sanchez Ocegueda, Joseph Scott, Ameesh Shah, Pramod Subramanyan, Hazem Torfah, Victoria Tuck, Marcell Vazquez-Chanlatte, Justin Wong, Ang Zhendong, and Parker Ziegler. Most of the work in this dissertation would not be possible without these folks.
Outside of work, I shared many happy moments with Emily Aiken, Kwasi Amofa, Jordan Baker, Sidney Buchbinder, Linda Chio, Amrutha Dorai, Allison Lui, Jackie Maslyn, Gabriel Matute, Elizabeth May, Nikita Mehandru, Amanda Meriwether, Sara Monedero, Andoni Mourdoukoutas, Suraj Nair, Carlos Ng, Jerome Quenum, Samantha Robertson, Ricardo Sandoval, Alok Tripathy, Tomás Valencia Zuluaga, and Ann Wang.
None of this work would be relevant if it were not for my family. Manuela Mora Rocha, Esteban Mora Rocha, Santiago Mora Camargo, Claudia Rocha Tamayo, Beatriz Tamayo Escobar, María Mercedes Tamayo Escobar, Amir El-Bayoumi, Jardena Gertler-Jaffe, Maja Kalaba, Mira Kalaba, Roy Mugume, Shizuka Ebata, Johnny Zhang, and Sena Ebata-Zhang all contributed with their love and support.
Finally, María Díaz de León Derby’s influence is hidden in every page of this dissertation. Gracias por ser mi equipo.
Everyday tasks, like scheduling and tax reporting, require logical reasoning. If I book a long camping trip in the summer, I might not have vacation days to spare to see family in the winter. If I contribute to my retirement fund, I can reduce my overall tax rate at the cost of money to use today. In general, this reasoning might be boring, but, with the help of pen and paper, it is easy to handle at the scale that most individuals need. These same reasoning tasks also exist at much larger scales, though. Scheduling a professional sports league will require more time and effort than scheduling my time off; reporting corporate taxes will be more complicated than reporting my graduate student income. While the underlying reasoning is the same—basic deduction—at these larger scales, pen and paper will no longer suffice. At these larger scales, we need specialized computational support.
This dissertation is about building tools for automating logical reasoning at scale. These tools must be domain-specific: we would not use the same tools for scheduling, tax reporting, and every other possible domain. Everything from the user interface to the kinds of questions that we ask will change across domains. The only constant is that the underlying reasoning will always be basic logical deduction. As such, developers frequently build automated reasoning stacks that, layer by layer, transform domain-specific questions into logical queries, solve these queries with general-purpose reasoning engines, and then transform the results back into domain-specific answers. The overall success of these automated reasoning stacks depends on how they perform at scale.
Performing well at scale depends not only on the individual layers but also on how they interact. Logical reasoning problems are usually intractable or even undecidable, so logical queries must be carefully constructed to match the heuristics of the chosen reasoning engine. For example, Dafny (Leino 2010) is a programming language and verification engine that is tightly integrated with the Z3 (de Moura and Bjørner 2008) solver. When users try to replace Z3 with other solvers, even ones that generally perform comparably to Z3, Dafny’s overall performance greatly suffers.1 In this dissertation, we show how to build a full automated reasoning stack, including both theory and tools, that is tailored to a single domain. Intentionally building a full automated reasoning stack at one time is an uncommon pursuit. We aim to provide a blueprint for similar efforts along with components and principles that are reusable in, and adaptable to, new domains. To ground our work, we focus on the domain of distributed systems safety verification.
Distributed systems are sets of computing nodes that work together to act as one system. These systems are increasingly pervasive, complex, and critical. For example, some estimates suggest that over 75% of organizations across industries use some form of cloud computing and that cloud revenues could reach $2 trillion by 2030.2 3 These industries include everything from entertainment and banking to health and science, and are geographically distributed across the world. To meet this demand, software engineers are building globally distributed systems (which power cloud services) that coordinate with each other to provide users with strong performance, availability, and fault tolerance. For example, AWS uses 114 Availability Zones and over 700 edge locations to power their cloud services.
Unfortunately, these complex cloud systems are not immune to bugs. For example, in 2024, the three major cloud providers—Amazon, Google, and Microsoft—all experienced significant failures, including outages and customer data loss.4 5 6 These failures happen despite extensive engineering efforts. For example, Liu et al. (2019) find that fault and timing-related bugs account for 44% of production Microsoft Azure incidents despite these kinds of bugs being carefully tested for. More concerning still, these issues exist in designs, not just implementations. For example, Michael et al. (2017) find that three previously published distributed system designs suffer from “serious correctness violations.”
Much of the challenge of designing and implementing distributed systems stems from bugs that can arise from complex combinations of machine failures and message orderings. This complexity is difficult for humans to reason about manually (Gabrielson 2019). As distributed systems become increasingly critical infrastructure, engineers will need more computational support to build and deploy them correctly.
Automated reasoning engines promise to be the computational support that engineers need. These engines can solve tedious, mission-critical logic problems billions of times a day (e.g., Rungta (2022)). In the domain of distributed systems specifically, these tools have helped find bugs and prove the correctness of industrial systems (e.g., Newcombe et al. (2015)). Unfortunately, despite their power and flexibility, adoption of automated reasoning engines remains low for one fundamental reason. Today, successful users must be experts in their application domain and in automated reasoning—a rare combination. Specifically, users must be able to precisely describe their problem in a formal language, such as first-order logic, and know enough about automated reasoning to make sure their encoded problem is practically solvable.7 8 9
This dissertation describes a stack of automated reasoning tools that help software engineers reason about their distributed systems. The stack reduces the need for automated reasoning expertise by allowing engineers to verify their systems in a modeling language they are already using (Chapter 5), providing one of the fastest solvers in the world for a relevant fragment of logic (Chapter 3), and partially automating the interaction with automated reasoning tools (Chapter 4, Chapter 6, and Chapter 7).
Our work follows a long tradition in formal methods that took an important pivot in Lynch and Tuttle (1987). Lynch and Tuttle introduced a new modeling formalism for distributed algorithms, I/O Automata, that accentuated the distinction between internal and external actions of system components. This distinction helps decompose systems and the reasoning about them, and did not exist in many earlier modeling languages, such as Communicating Sequential Processes (Hoare 1978) and the Calculus of Communicating Systems (Milner 1980). We provide a brief overview of I/O Automata in Chapter 2.5.
Work on I/O automata and similar modeling formalisms has continued for decades with an increasing emphasis on automation and executable code generation. For example, Garland and Lynch (2000) defined a formal language for describing and reasoning about I/O automata that interfaces to both a theorem prover and a model checker, and can generate code in a standard programming language. Around the same time, TLA+ and its extensions, which are still used today, provided similar functionality and interfaces (Lamport 2002). More recent works, such as Ivy (Kenneth L. McMillan and Padon 2020), have continued this tradition as well, incorporating more automation by encoding complex questions about models in a decidable fragment of logic.
Much of the work since the 80s has focused on two kinds of specifications: safety and liveness. Informally, safety properties state that nothing bad happens in a system; liveness properties state that something good will eventually happen. For example, for a leader election protocol, one might want to ensure the safety property that at most one leader is elected at any given time. It would be bad if two candidates were elected for one position. One might also want to ensure the liveness property that we will have a leader: it would be good for the election to conclude successfully. Leader election protocols are a class of consensus protocols where a set of nodes, each performing its local computation and communicating by sending and receiving messages, must agree on a specific value or result. Consensus protocols, in turn, are a class of distributed systems.
In distributed systems, like the consensus protocols described above, we often draw a distinction between synchronous and asynchronous. Informally, when messages are always sent and received within a known, fixed amount of time and nodes execute within a known, fixed relative speed, we call the system synchronous (Dwork, Lynch, and Stockmeyer 1988). When messages can take an arbitrarily long amount of time to deliver, or nodes execute at arbitrary relative speeds, we call the system asynchronous. Real-life systems are generally asynchronous. When we send a packet over the Internet, we do not have formal guarantees about the delivery time or whether the packet will be delivered at all. Unfortunately, this makes liveness properties difficult and often impossible to prove in the asynchronous setting. For example, Fischer, Lynch, and Paterson (1983) proved that distributed consensus is impossible even in the presence of just one faulty process.
The automated reasoning stack that we build, therefore, focuses on safety properties of fully asynchronous systems. As we will see throughout the dissertation, certain decisions will be tailored exclusively to this context while others will be more general. We pair the more general contributions with work that helps tailor them to our specific domain. This mix of general yet adaptable with highly domain-specific contributions enables us to resolve the central tension explored by the body of work contained in this dissertation. While it can be costly to create an entirely new approach for each new domain, we can develop general tools and techniques that can express problems across multiple domains and be tailored to specific domains. We must be general but adaptable so as to achieve usability and scalability.
To better understand our domain and contributions, consider the following distributed system inspired by Le Lann (1977) and Chang and Roberts (1979), called the ring leader election protocol. The system consists of a set of nodes that work together to elect a leader. Each node has a unique, integer label value and the goal of the protocol is for the nodes to collectively discover the node with the greatest label. Nodes are arranged in a ring: each node can only receive messages from the node on its “left” and can only send messages to the node on its “right.” Nodes execute asynchronously and begin their execution by sending their label “right.” If a node receives a label greater than its own, it forwards the received label. If the received label is smaller, then the node sends its own label “right.” Finally, if the received label is equal, then the node declares itself the winner. Figure 1(A) depicts an instance of this system with three nodes and one of its executions: node 3 is to the right of node 2; 2 to the right of 1; 1 to the right of 3; and node 3 declares itself the winner after receiving its own label from node 2. Figure 1(B) depicts one possible safety specification for this system: at most one node should declare itself the winner of the election.
Given a distributed system model and a specification, such as the ring leader election protocol and the simple safety property described above, the goal of our work is to automatically decide if the model is guaranteed to satisfy the specification. The execution depicted in Figure 1(A) does satisfy the specification depicted in Figure 1(B) but, will all possible executions of the system do the same? Figure 1(G) depicts the response from the automated reasoning stack that we describe in this dissertation. The checkmark represents that the system does indeed satisfy the specification for all executions; the X mark represents that the attempted proof failed. Together, Figure 1(A), (B), and (C) depict the external interface to the automated reasoning stack. Interactions between users and the automated reasoning stack can be longer—Figure 11 depicts a more complicated example.
The next three subsections (Chapter 1.2.1, Chapter 1.2.2, and Chapter 1.2.3) give an overview of the
automated reasoning stack itself—the right side of Figure 1. These three subsections cover Chapter 3, Chapter 4,
Chapter 5, Chapter 6, and Chapter 7. In Chapter 2, we provide the necessary background on
first-order logic to understand our work; we present a simple logical
language, mar
, that we use to ground our
work, and we describe performance optimizations for checking
mar
programs related to proof caching and
decomposition. We conclude this dissertation in Chapter 8 with plans for future work.
Specifically, in Chapter 8.1, we present
the problem of crossword construction as an example where automated
reasoning can be useful but cannot exist on its own. We then briefly
describe exciting future directions that combine automated reasoning
engines with neural network-based tools.
In Chapter 5, we take a domain-specific language for modeling distributed systems, called P (Desai et al. 2018, 2013), and add support for formal verification through proofs by induction. This work corresponds to the arrow from Figure 1(D) to Figure 1(E): it takes a formal model and specification and generates a logical query, Figure 1(E). The logical query corresponds to the question, “Does induction prove that the model satisfies the specification?” Before our work, P had support for explicit-state model checking (similar to testing), and engineers inside Amazon were already successfully using it to find bugs in real systems (Desai 2023; Terry 2024). Our work allows these same engineers to write proofs of correctness in the modeling language they are already using (by providing inductive specifications). In short, we make the modeling process more direct and usable for distributed systems engineers. The material in Chapter 5 is based on Mora et al. (2023), which is joint work with Elizabeth Polgreen, Ankush Desai, and Sanjit A. Seshia.
Formal verification is a challenging task that is often simplified through abstraction. For distributed systems, this means that existing verification frameworks often abstract away the details of explicit message passing. The main added challenge in verifying P programs is that they explicitly model message passing. Proofs, therefore, have to consider lemmas over the low-level details of single messages. To remedy this, we provide users with language support for reasoning about sequences of logically related messages, called message chains. Message chains are related to message sequence charts (ITU-T 2011) from the software engineering community and choreographic programming languages (Montesi 2014) from the programming languages community. Both message sequence charts and choreographic programming offer developers a way to reason about explicit message passing but with the context of how messages flow through the system in question. We use message chains to bring these same ideas to formal verification.
We implemented our approach in a prototype tool called the P Verifier. The P Verifier uses UCLID5 (Polgreen et al. 2022; Seshia and Subramanyan 2018), a general-purpose verification engine, as an intermediate backend. UCLID5, in turn, generates satisfiability modulo theories (SMT) queries that are satisfiable if a proof by induction fails. We evaluated the P Verifier by verifying models of systems from related work, as well as two system models used within Amazon Web Services. We found that, with message chains, proofs use a similar number of lemmas as, and runtime is comparable to, verification tools that require more abstract modeling. A new version of the P Verifier, which is part of the P language toolchain, is powering verification efforts inside Amazon Web Services today (P Team 2025).
Formally modeling and specifying a domain-specific task can be a tedious process. It is one thing to give an informal description of a distributed system like the ring leader election protocol and its safety specification. It is another thing altogether to formally define the system in a verification language and prove its correctness for all possible executions. This is true when using domain-specific languages, like P, and also in general purpose verification languages, like UCLID5.
In Chapter 7, we present an approach to building text-to-formal-model tools that help automate the modeling process. These tools correspond to the arrow into Figure 1(C): they take informal artifacts, Figure 1(A) and (B), and produce formal code, Figure 1(C). To handle informal artifacts, like unstructured text, these tools depend on language models (LMs). The primary challenge is that LMs struggle to generate code in low-resource programming languages, such as most verification languages. Existing solutions either provide no guarantees and perform poorly (e.g., fine-tuning and prompting) or are not expressive enough to capture real modeling languages (e.g., constrained decoding with context-free languages). We overcome these challenges with two key ideas originating from two distinct fields of computer science: human-computer interaction and formal methods. The material in Chapter 7 is based on Mora et al. (2024), which is joint work with Justin Wong, Haley Lepe, Sahil Bhatia, Karim Elmaaroufi, George Varghese, Joseph E. González, Elizabeth Polgreen, and Sanjit A. Seshia.
The first key idea behind our approach draws inspiration from natural programming elicitation, a class of studies that help programming language designers understand how programmers naturally approach problems within a given programming domain (Myers, Pane, and Ko 2004). Programming language designers use the findings of these studies to create languages that align with users’ expectations, resulting in reduced programming friction and more effective developers. We borrow this idea for the setting where LMs are the “users.” Akin to uncovering what human users find natural for a given domain, we uncover what LMs find “natural.” Specifically, our first insight is to embrace LM’s tendencies and design an intermediate language that aligns with the LM’s output distribution for tasks in our domain.
Our first key idea makes it possible for LMs to generate reasonable programs, but it presents a new challenge: different languages have different features, such as type systems, that make translation difficult. The second key idea behind our approach is to use an automated reasoning engine—a MAX-SMT solver in this case—to identify the minimal set of locations in the intermediate program that prevent a correct translation to the target language. We then repair these locations using deductive techniques and LM calls, repeating the process as necessary. This symbolic component guarantees that generated programs pass all compiler checks, including type checking, for the target modeling language.
We implemented a prototype of our approach in a tool called Eudoxus, which translates natural language text into UCLID5 programs. Eudoxus outperforms fine-tuning, self-repair, few-shot prompting, and chain-of-thought prompting over test problems collected from three well-known textbooks. Specifically, our tool correctly solved 33% of textbook problems, while the next best approach solved only 3%.
Once we have formal code, we can analyze it, execute it, and augment it to find bugs or help prove correctness properties. In Chapter 6, we show how to automatically augment formal models with specifications—specifically some of the lemmas required for verification in Chapter 5—using variations on classic learning algorithms, such as non-erasing pattern learning (Angluin 1980) and recursive function synthesis. This process is represented by the arrow from Figure 1(C) to Figure 1(D): it takes a formal artifact, Figure 1(C), and augments it with formal specifications, Figure 1(D). The intuition behind our approach is that message chains reveal enough structure for existing techniques to learn meaningful specifications from distributed system execution data alone. That is, given only example message chains (no access to the system definition or specifications), existing learning techniques can discover properties that hold for all executions of a system. Users can then use these properties in their safety proofs. The material in Chapter 6 is based on Mora et al. (2023), which is joint work with Elizabeth Polgreen, Ankush Desai, and Sanjit A. Seshia.
In Chapter 3, we describe a solver called Algaroba that answers logical queries like those generated by Eudoxus and the P Verifier: Algaroba is the arrow from Figure 1(E) to Figure 1(F). Specifically, Algaroba handles SMT queries with algebraic data types, which naturally capture programming data structures, like message chains, and appear extensively in the questions generated by the P Verifier and other distributed systems verification projects (e.g., Zhang et al. (2024)). Algaroba is currently one of the fastest solvers available: it won the quantifier-free algebraic data type track at the 2024 SMT competition (Bromberger, Bobot, and Jonáš 2024). The material in Chapter 3 is based on Shah, Mora, and Seshia (2024), which is joint work with Amar Shah and Sanjit A. Seshia.
Algaroba takes an eager approach to solving queries. Specifically, it takes a quantifier-free query containing algebraic data types and translates it to a quantifier-free query without algebraic data types that existing non-algebraic data type solvers can handle. Algaroba eliminates algebraic data types by replacing them with uninterpreted sorts and functions along with a finite number of quantifier-free axioms. The key technical challenge with eagerly solving algebraic data type queries is that this kind of translation seems like it should require an unbounded number of axioms (e.g., lists can be arbitrarily long, but no list is equal to any of its sub-lists). Despite this, we proved that our finite translation is sound and complete.
Most other automated reasoning engines in this space take a lazy approach. In the lazy approach, theory axioms are introduced as needed rather than upfront. Since Algaroba takes a fundamentally different approach to existing tools, its performance profile is unique: Algaroba solves many queries that no other solver succeeds on, and other solvers succeed on queries for which Algaroba fails. This phenomenon is typical for SMT solvers. The problem is that, for any given query, it is hard to know what solver will succeed ahead of time. Visually, if every solver is an arrow out of Figure 1(E), the questions is, “Which arrow will lead to an answer in the least amount of time?”
In Mora et al. (2021), we defined three solving algorithms for SMT queries over the theory of strings and provided a fixed classifier for deciding when to use what algorithm. In Chapter 4, we generalize that work to an online learning scheme that supports any logical theories. Our tool, called MedleySolver, takes a stream of queries from a given domain. For each query, MedleySolver picks a sequence of solvers to execute, returns an overall answer, and learns from each individual solver’s performance. Over time, without any human expertise, MedleySolver automatically discovers the best solver coordination strategy for queries in the target domain and solves more queries than any individual solver using significantly less time. This work corresponds to the arrow at the bottom of Figure 1 and is based on Pimpalkhare et al. (2021), which is joint work with Nikhil Pimpalkhare, Elizabeth Polgreen, and Sanjit A. Seshia.
Overall, this dissertation presents novel integrations of artificial intelligence and formal methods at every level of the automated reasoning stack for scalable and usable domain-specific automated reasoning. We apply our work to distributed systems modeling and verification but aim to be an exemplar for how to build a full automated reasoning stack that targets any domain. In particular, we make the following contributions.
We create the P Verifier, a formal verification engine for the P programming language. Engineers already use P for modeling and testing their distributed systems. Existing verification engines for distributed systems do not provide users with language primitives or abstractions to support explicit message passing. To work at the message passing level—where a significant source of implementation errors lie—therefore usually has a cost: significant manual modelling and user effort. We introduce the idea of message chain invariants, seamlessly integrate them into the P Verifier, and show that, with message chain invariants, explicitly reasoning about message passing does not make verification more challenging. Specifically, message chain invariants give users the vocabulary to reason about the low-level details of explicit message passing while remaining at an intuitive level of abstraction. We demonstrate this empirically by replicating verification efforts from related work, simplifying these proofs using message chains, and then verifying a system with complex message passing—the onion routing network–and two industrial case studies.
We present a new specification mining toolkit that uses the structure provided by message chains to discover meaningful invariants for distributed system models. Our tools take execution data and infer specifications that likely hold for all executions. This reduces the burden on verification users by suggesting invariants that would otherwise require manual effort to discover and formalize. We extend a classic learning algorithm for non-erasing patterns, give a domain-specific definition of optimality, and prove that our algorithm correctly discovers the optimal specification using an NP oracle. We empirically evaluate our tools and find that six out of 24 invariants that were used in P Verifier proofs can be mined automatically in under a second each. Without the structure of message chains, the underlying learning algorithms cannot discover meaningful properties.
We present a new auto-formalization method that generates quality code in very low resource programming languages—like languages for verification—where existing work fails to even generate syntactically correct code. Auto-formalization methods take natural language text and return code in a formal language. Existing work focuses on prompting, constrained decoding, and fine-tuning strategies, which are unreliable, insufficiently expressive, or data hungry. We propose a complementary design-based solution. Specifically, we design an intermediate language that LMs are good at writing and can be compiled down to the target, low resource language. When LMs write code that is not exactly in the intermediate language, we use automated reasoning to identify the minimal error source and automatically repair the code at that source. We empirically evaluate our approach on a very low resource language in the verification domain, and show that we can substantially improve the syntactic correctness of generated code, producing code that passes all compiler checks 84.8% of the time compared to 9.1% by the best baselines.
We present one of the fastest SMT solvers in the world for queries over the theory of quantifier-free algebraic data types. Most existing solvers take a lazy approach. We define a new eager approach, prove that it is sound and complete, and demonstrate its empirical performance over three large benchmark sets. These results were independently validated at the 2024 SMT competition (Bromberger, Bobot, and Jonáš 2024), where our solver won the quantifier-free algebraic data type track.
SMT users typically solve more than one query at a time and that they do not need to solve the same query more than once. We present the first online learning-based solution for the problem of improving solver performance in this typical SMT setting. Existing work—like solver selection tools—can help by using expensive solver runtime data to optimize solver performance generally. We frame the problem as an online learning problem and combine multi-armed bandits with timeout prediction schemes to match, and sometimes exceed, the performance of existing work without any expensive pre-training.
In this chapter, we provide a short background on satisfiability
modulo theories (SMT) solvers. We then define algebraic data types, both
as a programming construct and as a logical theory, and we define a
simple programming language, called the Modeling and Automated Reasoning
(mar
) language, that can express useful
satisfiability queries over algebraic data types. We use
mar
to express all the contributions of
this dissertation. For example, in this section, we define I/O automata
and how to reason about I/O Automata using
mar
.
Boolean satisfiability (SAT) solvers have been shown to efficiently solve a number of NP-hard problems in areas such as AI planning (Kautz and Selman 1992), verification (Clarke et al. 2001), and software testing (Cadar et al. 2006). Satisfiability modulo theories (SMT) solvers are a natural extension of SAT solvers that reason about many-sorted first-order logic instead of just propositional logic. Specifically, SMT solvers can reason about first-order structures defined by background theories (Barrett et al. 2021), allowing them to tackle more general problems or to accept more succinct inputs. For example, SMT solvers can reason about bit-vectors (Brummayer and Biere 2009), floating-point numbers (Rümmer and Wahl 2010), strings (Bjørner et al. 2012), and algebraic data types (Barrett, Fontaine, and Tinelli 2017).
In this section, we provide a minimal reminder of concepts from many-sorted first-order and model theory that are required to understand what SMT solvers are. Many-sorted first-order logic is like first-order logic but with the universe partitioned into different sorts (Barrett et al. 2021). For a complete introduction, on first-order logic, we refer the reader to Lubarsky (2008). We focus on quantifier-free many-sorted first-order logic.
Symbols in many-sorted first-order logic are either logical or non-logical. Logical symbols include logical connectives, like disjunction (\(\lor\)), conjunction (\(\land\)), and negation (\(\neg\)). Non-logical symbols include predicate symbols and function symbols. Predicates and functions have an associated arity. In first-order logic, the arity is the number of arguments that the predicate or function takes. In our context, many-sorted first-order logic, the arity is more like a type signature: it defines the input sorts and the return sort. We treat predicates as functions whose return sort is Boolean: the set of predicates is a subset of the set of functions. When a function takes no arguments, we call it a constant. The set of non-logical symbols and their associated arity is given by a signature.
Expressions are formed from the symbols in a given signature. There are two kinds of expressions: terms and formulas. The set of terms is the smallest set such that every constant is a term and, if \(t_1, ..., t_n\) are terms and \(f\) is a function with appropriate arity, then \(f(t_1, ..., t_n)\) is a term. The set of formulas is the smallest set such that
A theory literal is a predicate application or its negation. These are the base units of SMT solving and are the equivalent of literals in SAT solving. A clause is a disjunction of theory literals. A cube is a conjunction of theory literals. A formula is in conjunctive normal form (CNF) iff it is a conjunction of clauses. A formula is in disjunctive normal form (DNF) iff it is a disjunction of cubes. A formula is in negation normal form (NNF) if only theory literals are negated. A formula is flat if all theory literals are of the form \(\neg (t_1 = t_2)\), \(t_1 = t_2\), or \(t_1 = g(t_2, ..., t_n)\) where \(t_i\) are constants.
A first-order theory is a set of formulas (axioms). For example the theory of Equality and Uninterpreted Functions (UF) consists of axioms that restrict the possible interpretations of \(=\) (reflexivity, symmetry, transitivity) and all function symbols (congruence) (Burch and Dill 1994). A structure is a many-sorted universe along with a function that describes how all non-logical symbols must be interpreted over the many-sorted universe. We extend the notion of interpretation to terms and formulas inductively, as expected. When a formula \(\phi\) is interpreted as true in a structure \(\mathcal{M}\), we say that \(\mathcal{M}\) satisfies \(\phi\), denoted \(\mathcal{M} \models \phi\), and we call \(\mathcal{M}\) a model of \(\phi\). As a slight abuse of notation, for a set of formulas \(\Phi = \{\phi_i\}\), we use \(\mathcal{M} \models \Phi\) to mean \(\mathcal{M} \models \bigwedge_1^n \phi_i\). Similarly, we use \(\phi \models \psi\) to mean every model of \(\phi\) is a model of \(\psi\). For a model \(\mathcal{M}\), we use \(\mathcal{M}[f]\) to represent the interpretation of \(f\) in \(\mathcal{M}\).
Definition 1 (SMT Query and Solver). An SMT query is a formula-theory pair. We say that an SMT query (\(\phi, T\)) is SAT iff there exists a model \(\mathcal{M}\) such that \(\mathcal{M} \models T \cup \{\phi\}\). Otherwise we say the query is UNSAT. SMT solvers take an SMT query as input and return SAT or UNSAT. Figure 1(E) represents an SMT query, Figure 1(F) represents a SAT/UNSAT answer, and the arrow from (E) to (F) represents an SMT solver.
An SMT solver \(S\) is sound for a theory \(T\) if, for every formula \(\phi\), \(S(\phi, T) = \text{SAT}\) implies that the query (\(\phi, T\)) is SAT. An SMT solver is complete for a theory \(T\) if, for every formula \(\phi\), the query (\(\phi, T\)) is SAT implies that \(S(\phi, T) = \text{SAT}\). The distinctive aspect of SMT solvers is that they perform an encoding to SAT, either implicitly or explicitly. Eager SMT solvers perform a satisfiability-preserving reduction to SAT in a single phase (e.g., see Seshia (2005)) whereas lazy solvers perform an iterative encoding, on demand.
In Chapter 3, we use the following definition to build a sound and complete solver for the quantifier-free theory of algebraic data types using a sound and complete solver for the theory of uninterpreted functions and equality.
Definition 2 (Theory Reduction). A theory \(T\) reduces to a theory \(R\) if there is a computable function \(m\) such that \[(\psi, T)\; \text{is SAT} \leftrightarrow (m(\psi), R)\; \text{is SAT}\]
In Chapter 7, we use an optimization variant of SMT solvers, called weighted maximum satisfiability modulo theories solvers (MAX-SMT solvers).
Definition 3 (MAX-SMT Query and Solver). The maximum satisfiability problem is, for a given (CNF) formula \(\phi\), to determine the largest subset of \(\phi\) that is SAT. That is, solvers aim to satisfy as many clauses as possible. The weighted maximum satisfiability problem (MAX-SMT) is a variation with two differences. First, some clauses can be “hard”—meaning they must be satisfied. Second, every “soft” clause (any clause that is not “hard”) has an associated weight. The problem is then to determine subset of \(\phi\) that maximizes the sum of weights while being SAT and containing all “hard” clauses.
Algebraic data types (ADTs) are a representation of finite trees that are common in functional programming languages. The power behind ADTs lies in how they can succinctly express complex structures at a high-level of abstraction while avoiding common programming pitfalls, like null pointer dereferencing (Hoare 1975). For most of their history, ADTs lived exclusively inside functional programming languages, like NPL (Burstall 1977), Standard ML (Milner 1997), and Haskell (Hudak et al. 2007). Recently, however, the interest in ADTs has exploded with a number of mainstream languages being released with support for ADTs, e.g., Rust (Jung et al. 2021), or have added support, e.g., Python10 and Java.11
An algebraic data type consists of a set of constructors (node names), selectors (directed edge names), and testers (predicates). Each constructor has a fixed set of selectors associated with it (possible edges out). Each selector has an associated type (each edge can only point to a fixed set of node names). Every constructor is associated with exactly one unique tester: that tester returns true on a given tree iff the root of the tree is labeled with the corresponding constructor. Every instance of an algebraic data types (a particular finite tree built from those node and edge names) must be acyclic.
We denote the theory of algebraic data types as DT. It contains the full theory of UF along with the following additional axioms.
Definition 4 (Theory of Algebraic Data Types). An algebraic data type \(A\) is a sort \(\sigma\) along with
Furthermore, every algebraic data type \(A\) satisfies the following axioms—the last axiom is known as the acyclicality axiom.
For two terms \(s\) and \(t\), if \(s\) can be obtained by applying a sequence of \(l\) selectors to \(t\), then we say \(s\) is a (\(l^\text{th}\)) descendent of \(t\).
An algebraic data type term is any expression of an algebraic data type sort. The set of normal algebraic data type terms is the smallest set containing 1. constants (0-ary constructors), 2. constructors applied to only normal algebraic data type terms. It is useful to think of normal terms as trees: constants are leaves and we can build larger trees by applying constructors to normal terms.
In Chapter 3.2 we define an
algebraic data type called tower
. The definition of
tower
uses two constructors: Empty
and
Stack
. These are the two possible ways to build a
tower
. Empty
is a function that takes no
inputs and outputs a tower
. Stack
is a
function that takes a block
(another simple algebraic data
type) and a tower
and outputs a tower
. Each
corresponding constructor has a set of selectors. Empty
has
no selectors and so it is a normal term, but Stack
has two
selectors, top
and rest
. We often apply
selectors using dot notation, e.g., x.top
. Selectors can be
thought of as de-constructors—we use them to get back the terms
constructed a tower
. The tower
definition
implicitly defines two testers: is Empty
and
is Stack
These predicates take a tower
and
return true iff the argument was built using the matching constructor.
In mar
syntax, which we describe in Chapter 2.3, tower
is defined as
follows.
We deviate from the SMT-LIB standard slightly in the vocabulary we use related to algebraic data types. The following clarifies the connection. If a sort \(t\) has a non-empty set of constructors, then we call \(t\) an algebraic data type. If \(t\) has a single constructor \(c\) and every selector \(s\) of \(c\) is such that it is not the case that \(t\) appears in the sort of \(s\), then we call \(t\) a record. When \(t\) is a record with constructor \(c\), we call every selector \(s\) of \(c\) a field of \(t\). If \(t\) is an algebraic data type and every constructor of \(t\) has no associated selectors, then we call \(t\) an enum. When \(t\) is an enum, we call every constructor of \(t\) a variant of \(t\) and often treat the enum as a set.
The modeling and automated reasoning language
(mar
) is a simple functional programming
language that supports satisfiability testing, algebraic data types, and
simple function contracts.12 We use
mar
as an intermediate language: it is
easy to translate to SMT-LIB 2.6—the standard input language for most
modern SMT solvers—and it introduces just enough of an abstraction to
express the contents of this dissertation.
mar
as the smallest version of UCLID5
needed to understand the contributions of this dissertation.
We describe the syntax of mar
programs
in seven pieces: algebraic data type declarations, type synonym
definitions, type expressions, function declarations, annotations,
expressions, and commands. The syntax for algebraic data type
declarations with type parameters is as follows, where r
is
a name representing the data type, each Ci
is a name
representing a constructor, each sij
is a name representing
a selector, and each tij
is a type expressions
corresponding to the selector sij
.
The syntax for type declarations is as follows, where n
is a name representing a type synonym, and t
is a type
expression.
The syntax for type expressions is as follows, where a
is a name representing a type parameter or synonym.
The syntax for a function declaration is as follows, where
f
is a name representing the function, each xi
is a name representing a parameter, each ti
is a type
expression representing the type of xi
, y
is a
name representing the return value, t
is a type expression
representing the return type, and e
is an optional
expression.
Function declarations can be annotated with pre- and post-conditions.
The syntax for these annotations is as follows, where f
is
the name of a previously declared function, e1
is a
(boolean) expression over the arguments xi
of
f
, and e2
is a (boolean) expression over the
arguments xi
of f
and
f(x1, ..., xn)
.
The syntax for an expression is as follows, where each
Ci
is a name representing a constructor, each
xi
is a name representing a variable, f
is a
name representing a function, each ei
is an expression,
s
is a selector, and each ti
is a type
expression.
e ::= true
e ::= false
e ::= x
e ::= e1 is C
e ::= e1.s
e ::= e1 == e2
e ::= e1 != e2
e ::= not e
e ::= e1 and e2
e ::= e1 or e2
e ::= e1 ==> e2
e ::= if e1 then e2 else e3
e ::= f(e1, ..., en)
e ::= C(e1, ..., en)
e ::= match e1 with | C1(x1 ... xn) -> e2
| ...
| Cm(x1 ... xn) -> em
The syntax for commands is as follows, where e
is an
expression, and each ni
is a name representing a declared
annotation.
The syntax for a program is as follows, where d
is a
type declaration, b
is a function declaration,
a
is a function annotation, and c
is a
command.
For example, the following is a simple
mar
program:
data natural = Z() | S(p: natural)
data option-natural = None() | Some(value: natural)
let get-natural (x: option-natural) -> (y: natural) =
match x with
| Some(n) -> S(n)
| None() -> Z()
print get-natural(Some(S(S(Z())).p))
get-natural requires x-is-some: x is Some
get-natural ensures y-is-s: not y is Z
A mar
program is valid if it
is well-typed and all functions are total. We define the semantics of
valid mar
programs. Algebraic data type
declarations, function declarations, and expressions have their expected
meaning as described in Chapter 2.2.
Given the one-to-one correspondence, we will frequently move between
mathematical syntax, e.g., \(p \land
q\) and mar
syntax, e.g.,
p and q
. For code and transformations we will use mar
syntax; for formal reasoning and meta theory we will use mathematical
syntax. The main point of interest for this section is defining the
semantics of mar
commands.
We split the semantics of mar
programs
into two pieces: composed and contract semantics. The composed semantics
describes the behavior of assert and print statements using function
contracts in place of function bodies. The contract semantics describe
the meaning of proof commands, which check if function bodies and calls
satisfy their contracts.
Let \(\phi\) be the conjunction of
expressions in assert commands, let \(f\) be a function with input parameters
\(x_1, ..., x_n\) and output parameter
\(y\), and let \(e_f(x_1, ..., x_n, y)\) be a boolean
function capturing the conjunction of expressions in ensures annotations
for the function \(f.\) The meaning of
mar
programs in the composed semantics is
the satisfiability query \((\phi_f \land \psi,
DT)\), where \(\phi_f\) is like
\(\phi\) but with every call to \(f\) replaced by a fresh variable \(y\), and \(\psi\) constrains every fresh variable
\(y\) to respect \(e_f\). If \((\phi_f \land \psi, DT)\) is SAT, then the
meaning of print commands is determined, as expected, by any model of
\((\phi_f \land \psi, DT)\). If \((\phi_f \land \psi, DT)\) is UNSAT, then
the meaning of every print command is “UNSAT.”
The meaning of mar
programs in the
contract semantics is two labels on each function. There are two
possible label values: “proven” and “rejected.” Let \(f\) be a function with input parameters
\(x_1, ..., x_n\), let \(r_f(x_1, ..., x_n)\) be a boolean function
capturing the conjunction of expressions in requires annotations for the
function \(f\), and let \(e_f(x_1, ..., x_n)\) be a boolean function
capturing the conjunction of expressions in ensures annotations for the
function \(f\). The first label on a
function \(f\) is “proven” iff every
call to \(f\) with arguments \(e_1, ..., e_n\) is such that the
satisfiability query \((\neg r_f(e_1, ...,
e_n), DT)\) is UNSAT. The first label on a function \(f\) is “rejected” if there exists a call to
\(f\) with arguments \(e_1, ..., e_n\) such that the
satisfiability query \((\neg r_f(e_1, ...,
e_n), DT)\) is SAT. The second label on a function \(f\) is “proven” iff the satisfiability
query \((r_f(x_1, ..., x_n) \land \neg
e_f(x_1, ..., x_n), DT)\) is UNSAT, where \(x_1, ..., x_n\) are fresh variables. Like
for the first label, if the same query is SAT, then the the second label
is “rejected.” In all cases, the expressions \(e_1, ..., e_n\) use the composed
semantics.
To execute a mar
program is to solve
all the corresponding satisfiability queries.
The semantics of mar
programs as
described above is sufficient for understanding the contributions of
this dissertation. In this section, we shift our focus to more practical
considerations related to writing and reasoning about
mar
programs. These contributions are yet
to be published but an essential part of making automated reasoning
usable and scalable. Specifically, in this section, we introduce proof
commands and give more detailed semantics, including a new label,
“unknown.”
In general, writing mar
programs is not
a one-shot process. Users refine their models and specifications
repeatedly, and they make frequent calls to solvers under the hood as
they go. In Chapter 4 we described how to
automatically learn to pick the best solver over sequences of queries,
like those that we would expect to see from a typical
mar
user. This is a form of adaptability.
In this chapter, we describe a mar
backend
strategy that gives users two more desired features: non-repetition and
progress.
Non-repetition refers to a backend that avoids calling a solver to
perform the same reasoning multiple times. Progress refers to a backend
that avoids redefining previously settled reasoning questions. To better
understand the features and the distinction between them, consider the
following mar
program.
data natural = Z() | S(p: natural)
let subtract-two (x: natural) -> natural = x.p.p
let add-three (x: natural) -> natural = S(S(S(x)))
The program declares a data type that represents natural numbers,
natural
, and two functions that act on natural numbers,
subtract-two
and add-three
. These functions
use selectors and constructors. Since selectors can lead to undefined
behavior—selecting p
from Z()
can result in
any value—we likely want to require that subtract-two
is
never called with an input value of less than two. The following code
snippet enforces this precondition.
Every time we use subtract-two
in the code, the backend
must check that the precondition holds. For example, if we declare an
uninterpreted function a
of type natural
and
then print subtract-two(add-three(a))
, then the backend
must check that add-three(a)
is greater than or equal to
two. Without any postconditions on add-three
, the solver
will not be able to prove that the precondition on
subtract-two
holds for that call. The simplest possible
postcondition that we can add says that the result of an application
add-three
is greater than or equal to two. The following
snippet captures this constraint.
Running mar
will result in a successful
proof that the precondition x-is-gte-2
is satisfied for the
function application in the print statement. If we run the backend again
on this same input, we should not need to check this fact again. This is
the simplest form of non-repetition. This version of non-repetition can
be achieved by caching queries.
Now suppose that we add the assertion
subtract-two(add-three(a)) is Z
. One might expect this
assertion to block all models and therefore that printing
subtract-two(add-three(a))
should fail. But, in the
composed semantics, the backend must only consider the function
contracts, and therefore the print statement should actually display
Z()
. To get the response that we might expect when
considering the body of the add-three
function, we need a
different postcondition. Specifically, we can add the following
postcondition.
For the composed semantics, this new postcondition will allow the
backend to conclude UNSAT, and the print statement will fail, as
expected. Unfortunately, this will also result in a new query for the
contract semantics. We already proved that the precondition
x-is-gte-2
will hold for the call
subtract-two(add-three(a))
using the postcondition
y-is-gte-2
. But now, the naive backend would have to
re-prove this fact with a larger query: the new query would contain both
y-is-gte-2
and y-is-gte-3
. In many realistic
cases—programs that are more complex than this toy example—adding
constraints on previously solved queries can make the solver timeout. If
adding y-is-gte-3
caused the solver to timeout, then our
edits to the mar
program would have
actually set back our proof efforts. Guaranteeing that this cannot
happen (for a class of updates) is a form of progress.
Non-repetition and progress are similar. Providing progress opens up
opportunities for non-repetition: if we can avoid changing backend
checks, then we will get more cache hits. The opposite can also be true.
More fine-grained caching, like through saving UNSAT cores, can
automatically guarantee some progress—adding irrelevant facts will not
lead to a new solver call. Another form of progress comes from building
on previously proved facts. For example, if we have previously proven
y-is-gte-3
, then we can easily prove
y-is-gte-2
without even looking at the body of
add-three
. For more complicated functions and properties,
building on previously proven facts can lead to speedups over the entire
verification effort.
We propose a lightweight solution to both non-repetition and progress called proof scripts. Users can explicitly label proof dependencies and the backend can use the implied dependency graph to minimize the number and size of generated queries. The following code consists of the program and contracts described above along with four proof commands.
// ----- Program -----
data natural = Z() | S(p: natural)
let subtract-two (x: natural) -> natural = x.p.p
let add-three (x: natural) -> natural = S(S(S(x)))
let a: natural
print subtract-two(add-three(a))
assert subtract-two(add-three(a)) is Z
// ----- Contracts -----
subtract-two requires x-is-gte-2: x is S and x.p is S
subtract-two ensures y-is-0-if-2: subtract-two(x) is Z
==> x == S(S(Z()))
add-three ensures y-is-gte-2: add-three(x) is S and
add-three(x).p is S
add-three ensures y-is-gte-3: add-three(x) is S and
add-three(x).p is S and
add-three(x).p.p is S
// ----- Proofs -----
prove y-is-0-if-2 using x-is-gte-2
prove x-is-gte-2 using y-is-gte-2
prove y-is-gte-2
prove y-is-gte-3
Each proof command is of the form prove ID (using ID+)?
,
where every ID
must point to the name of either a requires
or an ensures clause. We call the first ID
the target and
subsequent ID
s, the dependencies. In the example above,
with the command x-is-gte-2 using y-is-gte-2
, the user is
telling the backend that it only needs to use the postcondition
y-is-gte-2
to prove the precondition
x-is-gte-2
(the target). Later, the user also asks the
backend to prove y-is-gte-2
, this time with no
dependencies: the checker can verify that property with only the body of
the add-three
function. In terms of the dependency graph,
we say that x-is-gte-2
depends on y-is-gte-2
(there is an edge from x-is-gte-2
to
y-is-gte-2
) and that y-is-gte-2
is a leaf
(there is no edge from y-is-gte-2
). To ensure soundness, we
require that the dependency graph is acyclic. We also require that every
annotation appears as the target of a proof command at most once.
Given an acyclic dependency graph, we re-define the semantics of
mar
programs originally described in Chapter 2 as follows. The meaning of
mar
programs in the contract semantics is
a label on each annotation. There are three possible labels: “proven”
(☑), “rejected” (☒) and “unknown” (☐). If an annotation is not the
target of any proof command, then it receives the label “unknown.”
Otherwise, let \(f\) be a function with
input parameters \(x_1, ..., x_n\) and
let \(a\) be the target annotation of
some proof command with dependency set \(D\) that refers to \(f\). There are two cases: either \(a\) is a precondition or it is a
postcondition. We will consider both cases, in sequence, using the
following definitions. Let \(r_f(x_1, ...,
x_n)\) be a boolean function capturing the conjunction of
expressions in requires annotations for the function \(f\) that are in \(D\), and let \(e_f(x_1, ..., x_n, y)\) be a boolean
function capturing the conjunction of expressions in ensures annotations
for the function \(f\) that are in
\(D\).
Suppose \(a\) is a requires
annotation. The precondition \(a\)
receives the label “proven” iff every call to \(f\) with arguments \(e_1, ..., e_n\) is such that the
satisfiability query \[(r_f(e_1, ..., e_n)
\land \neg a(e_1, ..., e_n), DT)\] is UNSAT. The precondition
\(a\) receives the label “rejected” if
there exists a call to \(f\) with
arguments \(e_1, ..., e_n\) such that
the same satisfiability query is SAT. In both cases, each \(e_i\) uses the composed semantics of
mar
programs restricted to the annotations
in \(D\).
Suppose \(a\) is an ensures
annotation. The postcondition \(a\)
receives the label “proven” iff the satisfiability query \[
\begin{aligned}
(r_f(x_1, ..., x_n) &\land e_f(x_1, ..., x_n, f(x_1, ..., x_n)) \\
&\land \neg a(x_1, ..., x_n, f(x_1, ..., x_n)), DT)
\end{aligned}
\] is UNSAT, where \(x_1, ...,
x_n\) are fresh variables. Like for requires annotations, if the
same query is SAT, then the ensures annotation receives the label
“rejected.” In both cases, each \(e_i\)
uses the composed semantics of mar
programs restricted to the annotations in \(D\).
Therefore, running the following mar
program will result in the subsequent output which says that
print subtract-two(add-three(a))
results in
Z()
and that one out of three annotations,
x-is-gte-2
, is successfully proved. The other two
annotations are either not captured by a proof command
(y-is-gte-2
) or do not hold (y-is-0-if-2
).
data natural = Z() | S(p: natural)
let subtract-two (x: natural) -> natural = x.p.p
let add-three (x: natural) -> natural = S(S(S(x)))
let a: natural
print subtract-two(add-three(a))
assert subtract-two(add-three(a)) is Z
subtract-two requires x-is-gte-2: x is S and x.p is S
subtract-two ensures y-is-0-if-2: subtract-two(x) is Z
==> x == S(S(Z()))
add-three ensures y-is-gte-2: add-three(x) is S and
add-three(x).p is S
prove y-is-0-if-2 using x-is-gte-2
prove x-is-gte-2 using y-is-gte-2
Every proof command can be checked independently, opening up the door to parallel solving. For large systems, like the industrial systems described in Chapter 5, this can mean hundreds, or even thousands, of independent checks. Ideally, we could run all these checks in parallel until the first one fails or they all succeed. For each check, we could run every available solver with every possible configuration. In reality, we usually do not have enough processors available for this strategy to work.
Instead, we can use a tool like MedleySolver (Chapter 4) to pick solvers and to order checks. This usage of MedleySolver would reduce the number of solver calls per individual check. We can automatically learn to prioritize checks that will terminate more quickly, helping us find counterexamples faster. We are exploring these ideas in ongoing work.
In this section, we give the necessary background on I/O automata and
provide a basic encoding of I/O automata into the
mar
language. We present a version of I/O
automata (Lynch 1996)
tailored to our context along with examples focused on our encoding
described in Chapter 5. Let \(\textit{states}\) and \(\textit{actions}\) be disjoint sets. An I/O
automaton \(A\) is a six-tuple \[(\textit{inp}(A), \textit{int}(A),
\textit{out}(A), \textit{states}(A), \textit{start}(A),
\textit{trans}(A)),\] where \(\textit{inp}(A)\) (input actions), \(\textit{int}(A)\) (internal actions), and
\(\textit{out}(A)\) (output actions)
are disjoint subsets of \(\textit{actions}\), \(\textit{states}(A)\) is a subset of \(\textit{states}\), \(\textit{start}(A)\) is a subset of \(\textit{states}(A)\), and \(\textit{trans}(A)\) is a transition
relation over state-action-state triples. Specifically, \(\textit{trans}(A)\) is a relation on \[\textit{states}(A) \times \textit{inp}(A) \cup
\textit{int}(A) \cup \textit{out}(A) \times \textit{states}(A).\]
We call individual triples \((s,a,s')\) in \(\textit{trans}(A)\) steps of \(A\). Like Lynch, we require I/O automata to
be input-enabled. That is, we require that for every state
\(s\) and action \(\alpha \in \textit{inp}(A)\) there exists a
state \(s'\) such that \((s, \alpha, s') \in
\textit{trans}(A)\).
Example 1. We define an I/O automaton called the simple universal buffer (\(\textit{SUB}\)). Intuitively, \(\textit{SUB}\) represents a set of integers that other automata will be able to add to or remove from. Formally, for a non-empty index set \(R\), let \(\textit{inp}(\textit{SUB})\) be \(\{\textit{put(v)}_{i, j} \; \vert \; v \in Z;\, i, j \in R\}\); \(\textit{int}(\textit{SUB})\) be \(\emptyset\); \(\textit{out}(\textit{SUB})\) be \(\{\textit{get(v)}_{i, j} \; \vert \; v \in Z;\, i, j \in R\}\); \(\textit{states}(\textit{SUB})\) be the set of all sets of integers unioned with a special error state, i.e., \(2^Z \cup \{\bot\}\); \(\textit{start}(\textit{SUB})\) be the singleton set containing the empty set; and \(\textit{trans}(\textit{SUB})\) be a relation such that \(\textit{put(v)}_{i, j}\) actions add the integer \(v\) to the state, and, if \(v\) is in the set, \(\textit{get(v)}_{i, j}\) actions remove the integer \(v\) from the state. If \(v\) is not in the set, then \(\textit{get(v)}_{i, j}\) actions move \(\textit{SUB}\) to the error state. Once in the error state, no actions change the state. Figure 2 (a) illustrates the interface of \(\textit{SUB}\), where input (output) actions are arrows toward (away from) the automaton.
We often define transition relations piecemeal through precondition-effect pairs.
Definition 5. Let \(A\) be an I/O automaton. A precondition-effect pair \((p, e)\) of \(A\) consist of a predicate \(p(s, \alpha)\) over an input state and an input action, and a function \(e(s, \alpha)\) that takes in an input state and an input action and returns a new state. Given a precondition-effect pair \((p, e)\), if \(p(s, \alpha)\) is true and \(e(s, \alpha) = s'\), then we say that \((s, \alpha, s') \in \textit{trans}(A)\). As a matter of convention, when the I/O automaton in question has an error state (like \(\bot\) for \(\textit{SUB}\)), actions that do not satisfy any precondition move the state of the I/O automaton to that error state. This convention makes it easier to define input-enabled I/O automata in terms of precondition-effect pairs. When non-determinism is needed, we extend \(p\) and \(e\) to take extra, non-deterministic arguments. However, to simplify the presentation, we elide non-deterministic arguments for the remainder of this chapter.
For example, the simple universal buffer (\(\textit{SUB}\)) has a precondition-effect pair with precondition \(p(s, \textit{get(v)}_{i, j}) \triangleq v \in s\) and effect \(e(s, \textit{get(v)}_{i, j}) \triangleq s \smallsetminus v\). This precondition asserts that \(v\) must be in the set (we cannot get an element which is not present), and this effect removes \(v\) from the set \(s\).
Definition 6. When I/O automata are compatible, they can be composed to form new I/O automaton. A finite, non-empty set of I/O automata \(M\) are compatible if their internal actions and output actions are pairwise disjoint, respectively. That is, the automata in \(M\) are compatible if \(\forall A, B \in M\), \(A \neq B\) implies that
The composition of machines \(M\) for a finite, non-empty index set \(R\) and enumeration \(\mu\), denoted \(M \cdot\), is a new I/O automaton where \(\textit{inp}(M \cdot) = \bigcup_{A \in M} \textit{inp}(A) \smallsetminus \bigcup_{A \in M} \textit{out}(A)\); \(\textit{int}(M \cdot) = \bigcup_{A \in M} \textit{int}(A)\); \(\textit{out}(M \cdot) = \bigcup_{A \in M} \textit{out}(A)\); \(\textit{states}(M \cdot)\) is the set of all arrays \(s\) such that for every \(i \in R\), \(s[i]\) is a member of \(\textit{states}(\mu(i))\); \(\textit{start}(M \cdot)\) is the set of all arrays \(s\) such that for every \(i \in R\), \(s[i]\) is a member of \(\textit{start}(\mu(i))\); and \(\textit{trans}(M \cdot)\) is the set of triples \((s, \alpha, s')\) such that \(\forall i \in R\) if \(\alpha\) is in the actions of \(A\) then \((s[i], \alpha, s[i]') \in \textit{trans}(s[i])\) and otherwise \((s[i], \alpha, s[i]) \in \textit{trans}(s[i])\). Note that our definition of composition is associative.
Figure 2: Visualizations of I/O automata for examples 1 and 2. Irrelevant actions omitted.
Example 2. We define a class of I/O automaton called simple machine (\(\textit{SM}_i\)) and then compose an instance of the class with \(\textit{SUB}\) from Example 1. Intuitively, \(\textit{SM}_i\) automaton receive integers from \(\textit{SUB}\), keep track of the most recent integer received, and send integers to \(\textit{SUB}\) that are greater than the most recent integer received. Formally, for a non-empty index set \(R\), let \(\textit{inp}(\textit{SM}_i)\) be \(\{\textit{get(v)}_{i, j} \; \vert \; v \in \mathbb{Z};\, j \in R\}\); \(\textit{int}(\textit{SM}_i)\) be \(\emptyset\); \(\textit{out}(\textit{SM}_i)\) be \(\{\textit{put(v)}_{i, j} \; \vert \; v \in \mathbb{Z};\, j \in R\}\); \(\textit{states}(\textit{SM}_i)\) be the integers unioned with an error state \(\mathbb{Z} \cup \{\bot\}\); \(\textit{start}(\textit{SM}_i)\) be the singleton set containing the integer zero; and \(\textit{trans}(\textit{SM}_i)\) be a relation such that \(\textit{get(v)}_{i, j}\) sets the state to \(v\), and \(\textit{put(v)}_{i, j}\) can occur if \(v\) is greater than the current state. Figure 2 (b) illustrates the interface of \(\textit{SM}_i\). The automaton \(\textit{SM}_1\) is compatible with \(\textit{SUB}\) and so \(\textit{SM}_1 \cdot \textit{SUB}\) is also an I/O automaton. Figure 2 (c) illustrates this composition with irrelevant actions omitted. Note that \(\textit{SM}_1 \cdot \textit{SUB}\) is compatible with every instance \(\textit{SM}_j\) iff \(j \neq 1\). Therefore, we could further compose simple machine instances.
Definition 7. Let \(A\) be an I/O automaton. An execution of \(A\) is a finite sequence of alternating states and actions \(s_0,\alpha_1,s_1, ...,\alpha_k, s_k\) such that \(s_0\) belongs to \(\textit{start}(A)\), every \(s_i\) belongs to \(\textit{states}(A)\), and every triple \((s_i,\alpha_{i+1},s_{i+1})\) belongs to \(\textit{trans}(A)\). We use \(\textit{exec}(A)\) to refer to the set of executions of \(A\). We say that a state \(s\) is reachable by \(A\) if there exists an execution \(s_0,\alpha_1,s_1, ...,\alpha_k, s_k \in \textit{exec}(A)\) with \(s_k = s\).
For example, the sequence \(\emptyset,\textit{put(7)}_{i, j},\{7\},\textit{get(7)}_{i, j},\emptyset\) is an execution of \(\textit{SUB}\) from Example 1. On the other hand, the sequence \(\emptyset,\textit{get(7)}_{i, j},\emptyset\) is not an execution of \(\textit{SUB}\).
Definition 8. An invariant assertion (invariant) is a predicate \(p\) over states. We say that an invariant holds of an I/O automaton \(A\) if for all reachable states \(s\), \(p(s)\) holds. A counterexample is a reachable state \(s\) such that \(p(s)\) does not hold. An inductive invariant is an invariant \(p\) such that for every step \((s, \alpha, s') \in \textit{trans}(A)\) if \(p(s)\) holds then \(p(s')\) holds. A counterexample to induction is a step \((s, \alpha, s') \in \textit{trans}(A)\) such that \(p(s)\) holds but \(p(s')\) does not. A proof by induction for an invariant \(p\) is a proof that \(p\) is inductive and that \(p(s)\) holds for every state \(s \in \textit{start}(A)\). A verification is a successful proof by induction.
For example, consider \(\textit{SM}_1 \cdot \textit{SUB}\) from Example 2 and the invariant assertion which states that every element in the set maintained by the \(\textit{SUB}\) component is greater than zero. That is, the invariant is \(p([ s_{\textit{SM}_1}, s_{\textit{SUB}} ]) \triangleq \forall_{i \in s_{\textit{SUB}}} \, i > 0\). A counterexample to induction for this I/O automaton and invariant is the step \(([ -1, \{\} ], \textit{put}_0, [ -1, \{0\} ])\). Note that the pre-state \([ -1, \{\} ]\) is not reachable and that the invariant does indeed always hold. Nevertheless, the proof by induction fails.
mar
Our encoding of I/O automata into mar
is straightforward: each element of the six-tuple defining an I/O
automaton maps to a mar
construct.
Specifically, for an I/O automaton \(A\), we encode the set of actions, \(\textit{actions}\), as a sort \(a\), and we encode \(\textit{inp}(A)\), \(\textit{int}(A)\), and \(\textit{out}(A)\), as predicates on \(a\). We encode the state space, \(\textit{states}\), as a sort \(s\) and we encode \(\textit{states}(A)\) and \(\textit{start}(A)\) as predicates on \(s\). Finally, we encode \(\textit{trans}(A)\), as predicates on
triples \((s, a, s)\).
Precondition-effect pairs map to mar
functions with annotations. Invariants are preconditions and
postconditions on all functions.
For example, we can encode a version of SUB from Example 1 into
mar
as follows, where \(\textit{states}(\textit{SUB})\) is
redefined to be the set of all sets containing at most one natural
numbers unioned with a special error state.
data natural = Zero () | Successor (n: natural)
data action = Put (p: natural) | Get (g: natural)
data state = Empty () | Full (current: natural) | Error ()
let start (s: state) -> bool =
match s with
| Empty() -> true
| Full(_) -> false
| Error() -> false
let trans (s: state, a: action, s': state) -> bool =
match a with
| Put(p) ->
(match s with
| Empty() -> s' == Full(p)
| Full(_) -> s' == Full(p)
| Error() -> s' == Error())
| Get(g) ->
(match s with
| Empty() -> s' == Error()
| Full(current) ->
(if current == g then
s' == Empty()
else
s' == Error())
| Error() -> s' == Error())
let s1: state
let s2: state
let s3: state
assert start(s1)
assert trans(s1, Put(Zero()), s2)
assert trans(s2, Get(Successor(Zero())), s3)
print(s1) // will print "Empty()"
print(s2) // will print "Full(Zero())"
print(s3) // will print "Error()"
start ensures start-means-empty: start(s) ==> s is Empty
trans requires not-error: trans(s, a, s') ==> not s is Error
trans ensures correct-action: trans(s, a, s') and
(a is Put) ==> s' == Full(a.p)
trans ensures correct-get: trans(s, a, s') and
(a is Get) and
(s is Full) and
(a.g != s.current) ==> s' is Error
The first block of code (lines 1-3), declares a data type to
represent natural numbers, natural
, the set of actions,
actions
, and the set of states, states
, which
restricts the set of states to be all sets of size at most one
containing natural numbers unioned with a special error state. The next
block of code (lines 5-25) defines the start predicate, a function
called start
, and the transition relation, a function
called trans
. These first two blocks of code correspond
closely to the description in Example 1. The next block of code (lines 27-32)
declares three uninterpreted constants of type state
,
called s1
, s2
, and s3
, and then
uses assertions to encode that s1
is a valid start state,
s2
is a valid step from s1
, and
s3
is a valid step from s2
. The print
statements on lines 34-36, along with the comments displaying their
result, show an execution of the SUB machine. Finally, the last block of
code (lines 38-45) provides contracts for start
and
trans
that are both correct in the contract semantics and
sufficient to ensure a valid execution of the SUB machine in the
composed semantics.
Examples 1 and 2, and our work in Chapter 5, use aspects of many-sorted
first-order logic that go beyond what mar
supports. Specifically, in Chapter 5 we
will use uninterpreted sorts, richer theories (like arrays to represent
sets), theory combinations, recursive functions, quantifiers, and
parametric data types. Barrett et al. (2021) provide detailed background on
these standard extensions. We call mar
with these additions “extended mar
.”
In this chapter, we describe the arrow from Figure 1(E) to Figure 1(F): a solver called Algaroba that can
execute mar
programs. Specifically,
Algaroba handles quantifier-free SMT queries with algebraic data types.
As we will see in Chapter 5, algebraic data
types naturally capture programming data structures and appear
extensively in the queries generated by distributed systems verification
projects (e.g., Zhang et
al. (2024)).
For these tools to perform well, we need Algaroba to be as fast as
possible. Fortunately, Algaroba is one of the fastest solvers available:
it won the quantifier-free algebraic data type track at the 2024 SMT
competition (SMT-COMP) (Bromberger, Bobot, and Jonáš 2024). The
technical contents of this chapter first appeared in Shah, Mora, and Seshia (2024).
Automated reasoning about algebraic data types is important beyond distributed systems because this construct appears in many different software applications. As the popularity of algebraic data types grows, the demand for efficient SMT solvers will continue to increase. Unfortunately, the state-of-the-art tools in this space are already struggling to keep up. We demonstrate this empirically by generating a new benchmark set and showing that existing solvers, working together, are only able to solve 56.2% of the new queries in under 20 minutes per query (Chapter 3.5.2).
This imbalance between programming languages and SMT solvers is manifest as a gap in the SMT-solving literature. Oppen (1978) was the first to provide a decision procedure for the quantifier-free theory, but algebraic data types do not seem to have permeated the community much further. In 2003, a concerted effort to unify the SMT community began with the first official input standard and competition, called SMT-LIB and SMT-COMP (Barrett et al. 2011), respectively. Algebraic data types were not officially integrated into the standard until 2017 as part of version 2.6 (Barrett, Fontaine, and Tinelli 2017). In the 2023 iteration of SMT-COMP (the iteration before the work we describe in this chapter), only two solvers participated in the algebraic data type track, the least of any track. Both solvers use a variation of the same solving approach: a lazy SMT architecture combined with theory-specific reasoning based on the work by Oppen from 1978 (see Chapter 3.3).
In this chapter, we propose a new solving technique that departs from the standard approach. Instead of a lazy approach, we take an eager approach (Barrett et al. 2021) that translates the original SMT formula into an equi-satisfiable formula without algebraic data types. Our work fills the gap in the literature on SMT solving for algebraic data type queries and, by doing so, solves more queries than existing solvers (see Chapter 3.5.1). More importantly, we make the largest empirical contribution to the solving community on SMT-COMP benchmarks, solving different queries than existing tools (see Chapter 3.5.2).
The rest of this chapter is organized as follows. In Chapter 3.2, we describe algebraic data types, satisfiability, and our approach through an example planning problem called blocks world. In Chapter 3.3, we survey related work. In Chapter 3.4, we describe our eager reduction from algebraic data type queries to queries containing only uninterpreted functions (UF). Chapter 3.4 includes a proof of soundness and completeness, along with a complexity analysis. In Chapter 3.5, we describe a prototype implementation of our approach, called Algaroba, and we evaluate it over two research questions. We find that Algaroba outperforms the state of the art overall and in terms of contribution to the solving community. Chapter 3.5 also describes a new benchmark set consisting of blocks world queries. This set addresses an important gap in the existing literature: the queries in this set contain all important kinds of algebraic data types but are not easy to solve. Overall, we make the following contributions.
We define a notion of query depth and a finite, quantifier-free reduction from algebraic data type queries to UF queries that uses depths. This is a new eager solving approach.
We prove the soundness and completeness of our approach and show that it generates at most a finite number of assertions.
We generate a new benchmark set that contains all kinds of algebraic data types and is not trivial to solve. Existing benchmarks do not enjoy both these properties.
We implement our reduction approach in a prototype tool called Algaroba and we compare its performance to the state of the art. Algaroba outperforms existing tools and that it makes the largest empirical contribution to the community of solvers.
To better understand algebraic data types, satisfiability queries, our approach, and the benchmark set that we generate in Chapter 3.5, consider the classic blocks world problem first proposed by Winograd (1971). We use a version of the problem that Sussman (1973) used to illustrate the Sussman anomaly (Russell 2010) and that Gupta and Nau (1992) used to show that the associated decision problem is NP-Hard.
In the simplified version of the blocks world problem, there is a table with (possibly empty) stacks of blocks on it (an initial configuration), a desired way that the stacks should be arranged (a target configuration), and three rules:
The general problem is to find a sequence of legal moves that leads from the initial configuration to the target configuration. The associated decision problem is to determine if there is such a sequence of length less than some input \(k\).
Figure 3: Solution—Figure 3 (b), Figure 3 (c), Figure 3 (d)—to a simple blocks world puzzle. Figure 3 (a) is the initial configuration; Figure 3 (e) is the target configuration..
Figure 3 shows an example blocks world solution. The initial configuration is given in Figure 3 (a), the target configuration in Figure 3 (e), and Figs. Figure 3 (b), Figure 3 (c), and Figure 3 (d) show a sequence of three legal moves that solve the problem, where faded blocks denote the previous position of the block moved at that step.
The blocks world problem is a useful illustrative example for an
algebraic data types solver because the encoding uses sum, product, and
inductive algebraic data types. The following
mar
code gives the required type
definitions for the example in Figure 3.
data block = A() | B()
data tower =
| Empty ()
| Stack (top: block, rest: tower)
data config =
| table (left: tower, center: tower, right: tower)
Specifically, this code defines an enumerated type for blocks
(block
at line 1), a record type for table configurations
(config
at lines 5-6), and an inductive type for stacks
(tower
at lines 2-4). Terms of an enumerated type can take
on any of the values listed in the type definition. For example, terms
of type block
can take on the values A
or
B
. Terms of a record type can take on any combination of
values of the type arguments listed in the type definition. For example,
terms of type config
can take on a triple of any three
tower
values. Enumerated types are the simplest form of a
sum type, while records are the simplest form of a
product type. Algebraic data types allow for definitions that
are both sum and product types. For example, terms of type
tower
can either be Empty
or they can be a
Stack
but not both (sum). When these terms take on a
Stack
value, they are a pair of block
and
tower
values (product). Notice that the definition of
tower
depends on itself, this makes it an
inductive type.
The blocks world problem is a useful illustrative example for satisfiability queries for two reasons. First, satisfiability-based solutions for similar planning problems have been around for decades (Kautz and Selman 1992). Second, encoding the problem as a satisfiability problem is simple when using bounded model checking (Biere et al. 1999) for bounded-horizon planning (e.g., see Rintanen (2003)). Specifically, the bounded model checking-based encoding is given by a transition system and a specification. The transition system starts at the initial configuration, and, at each step, makes a non-deterministic choice of which legal move to make. The specification is that the transition system never reaches the target configuration. Given this transition system, specification, and a bound \(k\), bounded model checking will generate a satisfiability query whose solutions are assignments to the non-deterministic choices—concrete choices that get us to the target configuration from the initial configuration in less than \(k\) steps. SMT solvers, like our new solver, generate these solutions or prove that none exist. We use this encoding in Chapter 3.5 to generate a new benchmark set.
The blocks world problem also gives a useful intuition for our
approach. While terms of inductive data types, like tower
,
can take on arbitrarily large values (e.g., a stack of a million
A
blocks), there is a bound on the size of relevant values.
For the blocks world problem in Figure 3 it would never make sense
to have a tower
value of size greater than two. Such a
bound exists for all quantifier-free algebraic data type queries; the
problem is that automatically inferring the bound and using it to speed
up solving was non-trivial. In this chapter, we give an automated
procedure for computing an over-approximation of this bound, and then we
use this over-approximation to replace algebraic data type functions
with uninterpreted functions along with a finite number of
quantifier-free axioms.
You can think of the blocks world problem in terms of distributed
systems—it shares all the characteristics of the systems we will work
with in Chapter 5. Specifically, each table
position is a node in the system. Each node keeps track of its own tower
and can move its top most block to another node by sending that other
node a message. Assuming that the choice of target node is
non-deterministic, solving the blocks world problem is exactly the task
of verifying that the distributed system never reaches the target
configuration. The subsequent code block is an encoding of the blocks
world problem from Figure 3 as a
mar
program following the encoding of I/O
Automata in Chapter 2.5.1. We use bounded
model checking for three steps and omit the definition of
trans
and its contracts for brevity.
data block = A () | B ()
data tower =
| Empty ()
| Stack (top: block, rest: tower)
data config =
| table (left: tower, center: tower, right: tower)
data action =
| Left-to-Center ()
| Left-to-Right ()
| Center-to-Left ()
| Center-to-Right ()
| Right-to-Center ()
| Right-to-Left ()
let trans (before: config, a: action, after: config) -> bool
// ... OMITTED ...
let start: config = table(Stack(B(),Stack(A(), Empty())),
Empty(),
Empty())
let goal: config = table(Empty(),
Empty(),
Stack(B(),Stack(A(), Empty())))
let c0: config
let c1: config
let c2: config
let c3: config
let a0: action
let a1: action
let a2: action
assert(c0 == start)
assert(trans(c0, a0, c1))
assert(trans(c1, a1, c2))
assert(trans(c2, a2, c3))
assert(c3 == goal)
print(c0) // table(Stack(B, Stack(A, Empty)), Empty, Empty)
print(a0) // Left-to-Center
print(c1) // table(Stack(A, Empty), Stack(B, Empty), Empty)
print(a1) // Left-to-Right
print(c2) // table(Empty, Stack(B, Empty), Stack(A, Empty))
print(a2) // Center-to-Right
print(c3) // table(Empty, Empty, Stack(B, Stack(A, Empty)))
We propose a new SMT solver for the DT theory defined in Chapter 2.2. For a quantifier-free input
formula p
, our solver generates a quantifier-free formula
q
in UF and then calls an existing UF solver to get a final
result. We cannot compute a quantifier-free reduction directly, since DT
axioms have universal quantifiers. Instead, we only instantiate the DT
axioms over terms that are relevant to the input query.
Listing 1: Pseudocode for Reduce(p
) algorithm.
When the universe of the input query is finite (e.g., it only
contains enums), we instantiate the entire universe (see Chapter 3.4.1). Otherwise, we
follow the procedure in Listing 1.
This procedure transforms the input query to NNF, flattens the result,
and then applies the rewrite rules in Listing 2 and adds the axioms Listing 3. The depth of a query
is the number of constants in the flattened NNF version of the query. In
Listing 1, the depth is
k
. The depth is linear in the size of the input query
because the NNF and flattening transformations introduce at most a
linear number of constants.
Rules A and B from Listing 2 correspond to rewriting constructor and selector applications respectively so that they work well with other constructor, selectors and testers. Rule B contains existential quantifiers, but these are handled through skolemization (replacing existentially bound variables by fresh constants). Axioms 1 and 2 from Listing 3 ensure that only one tester returns true for each term, and that this tester corresponds to a constant iff the term is a constant. Axiom 3 encodes the acyclicality constraint.
To better understand Axiom 3 and acyclicality, consider the following example query.
data block = A() | B()
data tower =
| Empty ()
| Stack (top: block, rest: tower)
let x: tower
let y: tower
assert (x is Stack) and
(y is Stack) and
(y == x.rest) and
(x == y.rest)
This query is UNSAT because any satisfying assignment would need to violate acyclicality. Therefore, to avoid spurious models, we must encode the acyclicality property in our reduction. The challenge is that we need to capture this seemingly infinite property using only finite number of quantifier-free formulas.
Our key result is that we only need to enforce acyclicality for all
\(l<k^\text{th}\) descendants, where
k
is the number of algebraic data type constants in the
flattened query. To see the intuition behind this, consider the
following generalization of the previous example, where x1
,
…, xk-1
, xk
are of type
tower
.
assert (x1 is Stack) and ... and (xk is Stack) and
(x2 = x1.rest) and (x3 = x2.rest) and ... and
(xk = xk−1.rest) and (x1 = xk.rest)
This query asserts that there is a cycle of size k
.
Since, our reduction asserts acyclicality for all \(l<k^\text{th}\) descendants, we
correctly return unsat on this query. Furthermore, it is impossible to
have a query with a cycle of size more than k
using
k
or fewer constants (in the flat query), so our encoding
is sufficient (see Chapter 3.4.3 for a
full proof).
Listing 2: Rewrites, where t
is and algebraic data type
term, f
is a constructor for that algebraic data type,
ai
are well-typed arguments, and each si
is
the i\(\text{th}\) selector matching
that constructor. The left-hand-side of the arrow is what exists before
the rewrite; the right-hand-side is what exists after the rewrite.
Listing 3: Axioms for every algebraic data type term t
and s
, where f1
, …, fn
are the
constructors for the relevant algebraic data type.
If we recognize an algebraic data type has a finite universe, we create constants for every term in the universe and instantiate the axioms over the entire, finite universe. This is a source of double exponential blowup, but algebraic data types with finite universes are rare and often small enough to prevent noticeable blowup.
In the finite universe case, we can have a doubly exponential blowup. One adversarial case is an algebraic data type that is records of records of enums:
Here, enum
has a universe of size two, rec1
has a universe of size four, and rec2
has a universe of
size \(16\). This gives a double
exponential blowup since we are creating fresh constants to represent
every normal term of every data type.
In the infinite universe case, we have at worst an exponential blowup
in the size of the query. We know the depth k
is at most
linear in the size of the query, however, for a term x
of
the tree
type definition below, the number of sequences of
selector applications of length up to k
is \(2^{k+1} - 2\), thus giving us an
exponential blowup in the number of terms.
In this proof when we refer to a rule or axiom, we mean those from Listing 2 and Listing 3, respectively. We assume
p
is flattened and in NNF (i.e., it is p2
in
Listing 1). For simplicity, we also
assume that p
does not contain any uninterpreted functions
or sorts, but the proof can be extended to include them.
Theorem 1 (Reduction Correctness). Listing 1 is a correct reduction from DT to UF. Specifically, when the algebraic data type universe is infinite, \((p, \text{DT})\) is SAT iff \((q, \text{UF})\) is SAT.
Proof (\(\rightarrow\)). If \((p,
\text{DT})\) is SAT, then there is a model \(\mathcal{M}\models p \cup \text{DT}\). The
modification from p
to q
involves either
replacing a theory literal, \(p_i\),
with a conjunction of theory literals, \(q_{i,1} \wedge ... \wedge q_{i,n}\), or
adding a set of theory literals, \(q_1 \wedge
... \wedge q_m\), to the end of the formula. In both cases, all
added theory literals are redundant information in the DT theory (see Definition 4, Listing 2, and Listing 3). That is, \(p_i \cup DT \models q_{i,1} \wedge ... \wedge
q_{i,n}\) and \(DT \models q_1 \wedge
... \wedge q_m\). Therefore, \(\mathcal{M} \models q \cup \text{DT}\).
Since every model in DT is a model in UF, it is also true that \(\mathcal{M} \models q \cup \text{UF}\) and
that \((q, \text{UF})\) is SAT.
Proof (\(\leftarrow\)). If \((q, \text{UF})\) is SAT, then there is a model \(\mathcal{M}\models q \cup \text{UF}\). To show that this implies that \((p, \text{DT})\) is SAT, we will use \(\mathcal{M}\) to create a model \(\mathcal{N}\) such that \(\mathcal{N}\models p \cup \text{DT}\).
Without loss of generality, we assume that q
is a
conjunction of theory literals, \(q_1 \wedge
... \wedge q_n\). You can think of this as the satisfying
assignment to the propositional structure of q
given by
\(\mathcal{M}\). Every theory literal
in q
is a constructor equality, \(f(x_1, ..., x_m) = x_0\); a selector
equality, \(f^j(x_0) = x_1\); a tester
application, \(\textit{is-f}(x_0)\); or
an equality, \(x_0 = x_1\). Tester
applications and equalities can also appear negative. Every \(x_i\) is a constant due to the flattening
of q
. We say that a constant \(x_0\) is “constructed” in q
when it is on the right-hand-side of a constructor equality; we say that
a constant \(x_0\) is “selected from”
in q
when it is the constant on the left-hand-side of a
selector equality.
We begin by setting the universe of \(\mathcal{N}\) to be all algebraic data type
normal terms and setting the interpretations of tester, constructor, and
selector symbols to the expected values over this universe. So if \(f\), \(g\), and \(h\), are two, one, and zero-ary
constructors, then the normal term \(f(g(h),
h)\) is in the universe of \(\mathcal{N}\) and \(\mathcal{N}[f(g(h), h)] = f(g(h), h)\).
Similarly, \(\mathcal{N}[f^2(f(g(h), h))] =
h\). In short, \(\mathcal{N} \models
DT\). In the next two paragraphs, we describe an algorithm that
determines the value of \(\mathcal{N}\)
for every constant that appears in q
. Finally, we argue
that \(\mathcal{N}\models p \cup
\text{DT}\).
For each \(x \in V\), where \(V\) is the set of constants that appears in
q
, by axiom 1, there is exactly one tester such that \(\mathcal{M} \models \textit{is-f}(x)\). We
construct \(\mathcal{N}\) for each
\(x\) based on the corresponding \(f\). There are two “base cases” for our
construction of \(\mathcal{N}\). First,
if \(f\) is some constant constructor,
by axiom 2 of the reduction, we know that \(\mathcal{M}[x] = \mathcal{M}[f]\), so we
set \(\mathcal{N}[x] \triangleq f\).
Note that \(f\) is an algebraic data
type normal term, so it exists in the universe of \(\mathcal{N}\). Second, if \(x\) is some constant that was never
constructed or selected from (either directly or transitively) then we
set \(\mathcal{N}[x]\) to any algebraic
data type normal term that satisfies the tester applications and
equalities. We can always find such a normal term because there are
infinitely many algebraic data type normal terms, our construction of
\(\mathcal{N}\) is finite, and the
existence of \(\mathcal{M}\) guarantees
there is no contradiction in the equations.
If we are not in one of the two base cases, we know that \(f\) is an \(m\)-ary constructor for some \(m > 0\) and that \(x\) was either constructed or selected from
in q
. There are four “inductive cases” for our construction
of \(\mathcal{N}\). First, if \(x\) was set equal to a constructor
application, by rewrite A, there are constants \(x_1, ..., x_m \in V\) such that \(\mathcal{M}\models f^1(x) = x_1 \wedge ... \wedge
f^m(x) = x_m\). We set \(\mathcal{N}[x]
\triangleq f(\mathcal{N}[x_1], ..., \mathcal{N}[x_m])\). Note
that \(f\) applied to algebraic data
type normal terms results in an algebraic data type normal term, so
\(f(\mathcal{N}[x_1], ...,
\mathcal{N}[x_m])\) is in the universe of \(\mathcal{N}\). Second, if \(x\) is correctly selected from (the
antecedent of the implication in rewrite rule B evaluates to true in
\(\mathcal{M}\)), by rewrite B, there
are constants \(x_1, ..., x_m \in V\)
such that \(\mathcal{M}\models f^1(x) = x_1
\wedge ... \wedge f^m(x) = x_m\) and we again set \(\mathcal{N}[x] \triangleq f(\mathcal{N}[x_1], ...,
\mathcal{N}[x_m])\). Third, if \(x\) is incorrectly selected from (the
antecedent of the implication in rewrite rule B evaluates to false in
\(\mathcal{M}\)), then we pick an
appropriate normal term, like in the second “base case” above. Fourth,
if multiple of the above cases are triggered, we make sure that the
assignments are consistent. For example, if the first two cases are
triggered, then we ensure that the existentially bound variables (that
are later skolemized) introduced by rule B are equal to the
corresponding argument constants in rule A.
We claim that \(\mathcal{N} \models q \cup UF\). If this is true, then together with the fact that \(\mathcal{N} \models DT\), we have that \(\mathcal{N} \models q \cup DT\). By the same reasoning as the proof in the forward direction (\(\rightarrow\)), we can then conclude that \(\mathcal{N} \models p \cup DT\), as desired. The remainder of the proof demonstrates that \(\mathcal{N} \models q \cup UF\).
Suppose for contradiction that \(\mathcal{N} \nvDash q \cup UF\). Then there
is some theory literal \(q_i\) in
q
such that \(\mathcal{N}[q_i]\) is false. Consider the
case that this theory literal is a constructor equality, \(f(x_1, ..., x_m) = x_0\). But \[
\begin{aligned}
\mathcal{N}[f(x_1, ..., x_m) &= x_0] \iff
\mathcal{N}[f](\mathcal{N}[x_1], ..., \mathcal{N}[x_m]) =
\mathcal{N}[x_0] \\
&\iff f(\mathcal{N}[x_1], ..., \mathcal{N}[x_m]) = \mathcal{N}[x_0]
\\
&\iff f(\mathcal{N}[x_1], ..., \mathcal{N}[x_m]) =
f(\mathcal{N}[x_1], ..., \mathcal{N}[x_m]) \\
&\iff \text{true},
\end{aligned}
\] which is a contradiction.
Consider the case that this theory literal is a selector equality,
\(f^j(x_0) = x_j\), and that the
selector is correctly applied. That is, \(\textit{is-f}(x_0)\) is also a theory
literal in q
. But \[
\begin{aligned}
\mathcal{N}[f^j(x_0) = x_j] &\iff \mathcal{N}[f^j](\mathcal{N}[x_0])
= \mathcal{N}[x_j] \\
&\iff f^j(\mathcal{N}[x_0]) = \mathcal{N}[x_j] \\
&\iff f^j(f(\mathcal{N}[x_1], ..., \mathcal{N}[x_j], ...,
\mathcal{N}[x_m])) = \mathcal{N}[x_j] \\
&\iff \mathcal{N}[x_j] = \mathcal{N}[x_j] \\
&\iff \text{true},
\end{aligned}
\] which is a contradiction.
Consider the case that this theory literal is a tester application \(\textit{is-f}(x)\). But, for every \(x\), \(\mathcal{N}\) assigns an algebraic data type normal term with constructor \(f\) iff \(\mathcal{M}[\textit{is-f}(x)]\) evaluates to true. Together with the fact that testers in \(\mathcal{N}\) are interpreted as expected, this case also leads to an immediate contradiction.
The rest of the cases come down to equality. You can think of the
appeals to \(\mathcal{N}[x_i]\) in the
inductive steps of our construction as recursive calls. Since there are
\(k\) constants in q
and
q
is flat, the longest possible recursive call chain in our
construction has length \(k\). Since
q
ensures acyclicality up to a bound of \(k\) (axiom 3), then \(\mathcal{M}\) is acyclic over the set of
values it assigns to constants in q
. This means that, over
the constants in q
, the equalities in \(\mathcal{M}\) are exactly the same
equalities as in \(\mathcal{N}\). Thus,
\(\mathcal{N} \models q \cup UF\) and
so \(\mathcal{N}\models p \cup
\text{DT}\), giving that \((p,
\text{DT})\) is SAT.
In this section we empirically compare the performance of our approach to state-of-the-art solvers. Specifically, we aim to answer the following research questions.
We implement a prototype of our approach, called Algaroba, in approximately 2900 lines of OCaml code.13 We use the Z3 API as the default UF back-end solver but we allow for any UF solver to be used instead. Algaroba takes inputs in the SMT-LIB language and includes a number of simple optimizations, like hash-consing (Ershov 1958), incremental solving, and theory-specific query simplifications. All experiments are conducted on an Ubuntu workstation with nine Intel(R) Core(TM) i9-9900X CPUs running at 3.50 GHz and with 62 GB of RAM. All solvers were given a 1200 second timeout on each query to be consistent with SMT-COMP. The state-of-the-art solvers in this space are cvc5 (we use version 1.0.6-dev.214.97a64fc16) and Z3 (we use version 4.12.2). We also include Princess (latest release as of 2023-06-19) in our evaluation since it is the most related approach. We describe all three solvers in Chapter 3.3.
Figure 4: Number of queries solved \(y\) in less than \(x\) seconds for Bouvier (a) and blocks world (b) benchmark sets using a 1200s timeout. Higher (more queries solved) left (in less time) points are better. The legend lists the contribution rank and percentage of queries solved for each solver. Algaroba solves the most queries and achieves the highest contribution rank for both sets..
Our evaluation covers two existing benchmark sets from SMT-COMP, one originally from Bouvier (2021) and one originally from Barrett, Shikanian, and Tinelli (2006) (BST for short). These two benchmark sets are useful but limited: every solver succeeds on every BST query so it is difficult to draw performance conclusions; Bouvier queries are more challenging but only contain sum types.
To address these limitations, we introduce a new benchmark set consisting of randomly generated blocks world queries. Blocks world queries, which we describe in Chapter 3.2, are more challenging to solve than those in BST and, unlike those from Bouvier, contain sum, product, and inductive types. To generate blocks world queries we use the same table configuration as in Chapter 3.2 (three places for towers), but we randomly select a set of blocks (ranging between two to 26) and we randomly generate an initial and target configuration (two sets of three random block towers). We call these three random samples a blocks world setup. For each blocks world setup, we randomly sample a set of step numbers (ranging from one to two times the number of blocks) and generate a blocks world query for each step number. This process resulted in 500 individual queries that each ask “can we get from this initial configuration to this target configuration in exactly this number of steps?”
To answer our first research question, we time the execution of Algaroba, cvc5, Princess, and Z3 on all queries in all three benchmark sets. When more than one solver terminates on a given query we compare the results to check for disagreements. There was not a single case where one solver returned sat and another returned unsat; therefore, we focus the remainder of our evaluation on execution times.
For the BST benchmark set, which consists of 8000 queries, every solver successfully terminates on every query within the timeout. cvc5 performs the best on average with an average solve time of 0.05 seconds (compared to 0.08 seconds for Algaroba and 0.10 seconds for Z3). Z3 performed the most consistently with a standard deviation of 0.05 seconds (compared to 0.10 seconds for Algaroba and 0.15 seconds for cvc5). Given the magnitude of these values, we conclude that the performance differences between Algaroba, cvc5, and Z3 are negligible on this set. Princess is the slowest (2.20 seconds on average) and least consistent (standard deviation of 1.69 seconds) but still effective.
Results are more interesting for the remaining benchmark sets. Figure 4 (a) shows the execution times for every solver on every query in the Bouvier benchmark set (excluding timeouts). No solver succeeds on more than half the queries in the set but Algaroba clearly outperforms the rest. In terms of number of queries solved, Algaroba succeeds on 8.5 percentage points more queries than the next best (45.75% versus cvc5 and Z3’s 37.25%). In terms of average run time on successful queries, Algaroba is 2.30 times faster than the next best (190.11 seconds versus Z3’s 437.18 seconds). In terms of standard deviation on successful queries, Algaroba is 1.30 times more consistent than the next best (267.13 seconds versus cvc5’s 346.83 seconds).
Figure 4 (b) shows the execution times for every solver on every query in the blocks world benchmark set (excluding timeouts). Again, Algaroba outperforms the state-of-the-art solvers. Algaroba solves 5.2 percentage points more queries than the next best (61.4% versus Z3’s 56.2%) but the average and standard deviation results are more complicated. Princess is the fastest and most consistent solver on solved queries (31.89 seconds and 74.77 seconds, respectively) but it succeeds on 44.6 percentage points fewer queries than Algaroba. Compared with Z3, which solves the second most number of queries, Algaroba is 2.00 times faster (87.34 seconds on average compared to Z3’s 175.05 seconds) and 1.51 times more consistent in terms of standard deviation (198.67 seconds versus Z3’s 300.15 seconds).
Across both interesting benchmarks, Algaroba solves the most sat queries (167 versus cvc5’s 70), and the second most unsat queries (322 versus Z3’s 406). The average (median) increase in query size from our reduction was 259x (146x). Given the overall success of Algaroba on both interesting benchmark sets, we answer RQ1 by concluding that our performance compares favorably to the state of the art.
Measuring overall performance is useful but it does not give an accurate perspective on how the community uses these tools. When faced with an SMT query, practitioners are likely to use multiple different solvers. This could be in parallel, like Rungta (2022), or through algorithm selection, like Pimpalkhare et al. (2021). Contribution ranks capture this practical perspective by evaluating solvers in terms of how complementary they are to other solvers. A higher rank means a higher contribution to the community of solvers.
To evaluate how complementary our approach is to existing solvers, we use SMT-COMP’s contribution ranking. This ranking uses the notion of a virtual best solver, which is defined as \(vb(q, S) \triangleq s(q)\), where \(S\) is a set of solvers and \(s\) is the solver in \(S\) that terminates most quickly on \(q\). Informally, the ranking answers, “which solver can I remove from the virtual best solver to hurt performance the most?”
In terms of number of queries solved (the primary SMT-COMP metric), there is a four-way tie on the BST benchmark set—all solvers solve all queries. For both other benchmark sets, Algaroba is ranked highest. For blocks world, the virtual best solver without Algaroba succeeds on 56.2% of the queries, less than Algaroba on its own (61.4%). With Algaroba, the virtual best solver succeeds on 62.2%. For Bouvier, without Algaroba, the virtual best solver succeeds on 64.25% of the queries. With Algaroba, this number rises to 83.75%. These positive results are in part because Algaroba solves the most queries, but are mainly due to the uniqueness of our approach. cvc5 and Z3 use a similar underlying algorithm, so removing one does not affect the performance of the virtual best solver. On the other hand, while Princess is the most similar approach to our own, their reduction is different enough to not interfere with our ranking.
Figure 5 shows this same data as a stacked histogram: each colored bar represents the number of queries that are uniquely solved (y-axis) by each solver (x-axis) over the given benchmark set (color and hash-style). In short, we solve many queries that no other solver can (108/900). Given the winning contribution rank of Algaroba on both interesting benchmark sets, and the dominant number of uniquely solved queries, we answer RQ2 by concluding that our performance is complementary to existing solvers—we solve many benchmarks that no other solver can.
In this chapter, we described the arrow from Figure 1(E) to Figure 1(F): an SMT solver for the quantifier-free theory of algebraic data types called Algaroba. Algaroba takes in queries containing user-defined algebraic data types, transforms them to queries without algebraic data types, calls an existing core solver on the generated queries, and returns the result. We call this transformation a reduction and we say that the approach is eager because it introduces all of the neccessary axioms for reasoning upfront as part of the reduction. Most existing solvers for this theory take a lazy approach: they introduce axioms iteratively.
The main theoretical challenge is to construct a terminating reduction and to prove its correctness. The main practical challenge is to outperform existing tools with years of engineering. We do both. Specifically, we present a new measure based on the depth of the input query; we define our reduction by bounding the introduction of axioms using this measure; prove that this bound is sufficient; and empirically demonstrate that the bound is not too loose by outperforming state-of-the-art SMT solvers on a collection of benchmarks from related work. Perhaps the most interesting result is that the unique algorithmic approach that we take—eager instead of lazy—leads to unique and excellent empirical performance. Our solver handles many queries that no other solver can, leading to the best contribution score in our evaluation.
Beyond showing that Algaroba solves many queries that no other solver can, Figure 5 also shows that no solver strictly dominates all others. Similarly, the virtual best solver data detailed in Chapter 3.5 suggests that it is better to use a set of solvers to answer a query than to stick to any one given solver for all queries. This phenomenon holds beyond the theory of algebraic data types. Another example of differentiation between solvers is in quantifier reasoning, where the number of different algorithms implemented is reflected in the wide spectrum of literature on the subject, e.g., Ge and de Moura (2009), Reynolds et al. (2015), Löding, Madhusudan, and Peña (2017), Kovács, Robillard, and Voronkov (2017), and Barth et al. (2019). In the same vein, solvers use very different techniques for different theories; for example, there are various techniques that can be used for bit-precise reasoning, e.g., Bryant et al. (2007), Niemetz, Preiner, and Biere (2014), Jha, Limaye, and Seshia (2009), Barrett et al. (2011), Jonás and Strejcek (2016), de Moura and Bjørner (2008), Hansen (2012), Dutertre (2014), Bruttomesso et al. (2008), and Niemetz and Preiner (2020).
SMT solvers are becoming more widely used across various applications including verification, automated software testing, and policy verification, e.g., Bjørner (2009), Backes et al. (2018), and Kroening and Strichman (2016), making them particularly useful to industry practitioners and non-SMT researchers. Given the performance differential between solvers, a key question for such practitioners wishing to apply SMT solving to a problem in a specific domain is “which solver should I use?”
In this chapter, we describe the arrow that leaves Figure 1(F): a new method for automatically coordinating solvers, per query, over a sequence of queries. This setting matches the sequences of queries generated by the verification engine that we will describe in Chapter 5 and many similar verification engines. We call our method MedleySolver and we implement a prototype that can be used out of the box. In short, we provide a simple answer to the question of what solver to use for your query: “Let MedleySolver choose for you!”
The technical contents of this chapter first appeared in Pimpalkhare et al. (2021). In Y. Li et al. (2025), we extended this work for large language model prompt strategy selection in program synthesis, showing significant improvements over the state-of-the-art, but do not cover those contributions in this dissertation.
MedleySolver frames the problem of choosing an SMT solver as a modified Multi-Armed Bandit (MAB) problem, a classic reinforcement learning formulation in which an agent must repeatedly pick from several different choices with unknown reward distributions, minimizing overall regret. This agent must trade-off between exploitation (choosing a solver that is already believed to be fast) and exploration (testing out other solvers). For a given SMT query, MedleySolver selects a sequence of solvers to run, running the solver it believes is most likely to solve the query first and the solver that is least likely last. MedleySolver also predicts the time it should spend running each solver before it should give up and move onto the next solver in the sequence.
We apply classic algorithm selection techniques from the domain of Multi-Armed Bandit problems to the order selection problem. In this work, we highlight Thompson Sampling and \(k\)-Nearest-Neighbor (\(k\)-NN) classification. We select these two as high-performing instances of a non-contextual and a contextual algorithm respectively but perform a more extensive comparison with a variety of other Multi-Armed Bandit algorithms for completeness. These algorithms traditionally predict one optimal action, but we use them to rank SMT solvers on a given query based on the behavior observed on previous queries and, in the case of the contextual bandit algorithms, a feature vector. This ranking allows the algorithm to explore and exploit in a single round.
Non-contextual multi-armed bandit algorithms have been directly applied to selecting search heuristics and variable orderings for constraint satisfaction problems (Xia and Yap 2018; Wattez et al. 2020), and to implementing co-operative sharing of clauses in parallel SAT solving (Lazaar et al. 2012). Our use of a contextual multi-armed bandit framework to select a sequence of SMT solvers for a given SMT query is novel.
We pair these order selection algorithms with two runtime prediction algorithms. The first runtime prediction algorithm estimates the time each solver should be run by modeling its performance as exponential distributions with a parameter that is updated dynamically. The second fits a linear model using stochastic gradient descent (SGD). Runtime estimation helps reduce the cost of exploratory solver choices that do not produce rewards by stopping solvers when we are confident they will not finish before the overall timeout.
Our work is inspired by recent work showing machine learning techniques can be used to solve SMT queries faster. For example, FastSMT (Balunovic, Bielik, and Vechev 2018) is a tool that uses machine learning to find an optimal sequence of tactics, or query transformations, for SMT solvers to use on queries from a given domain. One issue with such approaches is that the complexity of the learning methods leads to training times grossly larger than the time spent solving. In this work, we achieve comparable performance boosts with no pre-training or additional burden on the SMT end-user. To meet this goal, we approached the problem of algorithm selection for SMT solvers in a dynamic, or “online,” manner. This ensures the cost of our training remains small, proportional, and justified by how SMT practitioners use SMT solvers. For example, techniques such as counterexample-guided inductive synthesis (CEGIS) (Solar-Lezama et al. 2006) produce long sequences of similar SMT queries that are not easy to obtain prior to solving for offline training. Overall, we make the following contributions.
An adaptation of standard regression techniques to predicting when a given solver will timeout on a query, and a novel approach for the same time allocation problem that models runtime as exponential distributions and estimates timeouts dynamically and with context.
A framing of the SMT solver selection problem as a Multi-Armed Bandit (MAB) problem combined with a timeout prediction scheme. Specifically, we extend the MAB problem to selecting sequences of solvers per query instead of a single solver and use the timeout prediction scheme to allocate time to each solver in the sequence. This interaction lets us use lightweight techniques for both problems that do not require pre-training while retaining comparable performance to pre-trained techniques.
An empirical evaluation on a set of \(2000\) benchmarks representing a typical user’s workload. Our approach solves \(1813\) queries on this set; \(128\) more than the next best solver in \(3/5th\) of the time, with no pre-training.
SMT users rarely aim to solve a single query in isolation and usually care about resource consumption. For example, verification engines generate many verification queries for one verification problem and aim to solve these queries in the least amount of time. As such, we define the practical SMT problem as that of taking a set of SMT queries \(Q = \{q_1, ..., q_m\},\) a set of SMT algorithms (usually solvers) \(S\), and producing a set of answers \(A = \{a_1, ..., a_m\},\) where each \(a_i\) corresponds to the matching \(q_i,\) while using the least computational resources.
Our approach to the practical SMT problem, MedleySolver, is a program that takes \(Q\), \(S\), and a timeout \(T\) per query, and aims to maximize the number of instances solved while minimizing the cumulative time spent. We decompose our approach into two parts. For each query \(q \in Q\), we predict
Given a query \(q_i\), MedleySolver generates a sequence of solvers \(\sigma\), and a sequence of time-allocations \(t_1, ..., t_n\), so that the solver in \(\sigma_1\) is run for \(t_1\) seconds and so on. If a solver in the sequence successfully solves the query, we do not run the rest of the solver sequence on that query and instead move onto the next query. For the remainder of this chapter, we use \(\sigma\) to denote a sequence of solvers and \(\sigma_i\) to denote the \(i^{th}\) solver in the sequence \(\sigma\). We process each query \(q_1, ..., q_m\) in order and our solver selection algorithms learn as we go, so the solver selection for \(q_m\) uses information from queries \(q_1, ..., q_{m-1}\). In practice, this historical information can be reset whenever and in our experimental results, we reset it when confronting a new set of queries (for instance, a new category of the SMT competition). An overview of our approach is shown in Figure 6. We describe the three main components of Figure 6: the solver selector, the timeout predictor, and the featurization of queries, in detail in the next two sections.
In the Multi-armed Bandit (MAB) problem (Cesa-Bianchi and Lugosi 2006) an agent sequentially selects between choices with unknown associated reward distributions, aiming to maximize the reward achieved over time. The agent must trade off exploration (trying new actions and learning about them) and exploitation (deploying actions we know have the potential for high reward).
We frame the solver selection problem as a MAB problem. The agent is selecting the solver to use and the payout is based on successfully solving queries. We assume that running a solver for a randomly selected SMT query is equivalent to sampling from some unknown distribution that we seek to approximate. Contextual MABs extend the problem by giving agents access to a feature vector before each round. This allows us to add information about the characteristics of the SMT query we are trying to solve in each round, as described in Chapter 4.3.1.2.
We modify the MAB problem in one key way: we select a sequence of solvers to run (with corresponding time allocations that we consider later), instead of selecting a single solver. This contribution allows solver selection to perform exploration on each query until it observes a reward, has tried all solvers, or reaches the time-out per query. We also use a time-prediction algorithm, described in Chapter 4.3.2 which predicts the time it is worth running a solver on a given query, allowing us to perform “partial exploration” instead of committing to running a single solver until time-out or termination. Both of these extensions to the MAB algorithms allow the solver selection to correct incorrect solver choices, reducing the cost of exploration vs exploitation. In the following subsections, we adapt one non-contextual algorithm and one contextual algorithm (i.e., algorithms that use the feature vector) from the literature to our domain. We choose the algorithms based on their popularity in the classic literature for MAB, and the compatibility of the assumptions the algorithm makes with our domain (for instance, we omit algorithms such as LinUCB (L. Li et al. 2010), which assumes that the reward is linearly correlated with the feature vector). In Chapter 4.4 we evaluate our choices against other MAB algorithms for completeness.
We use a binary reward structure where a solver receives a reward of 1 if it is observed solving a query and 0 if it is observed failing to solve a query, which decouples solving time from rewards. We also explored an exponential reward structure where a solver receives a reward of \((1 - t/T)^4\) if it solves a query in time \(t\), but found the binary reward more effective which we believe is due to its ability to differentiate more clearly between benchmarks that are slow to solve and benchmarks that are not solved.
Thompson Sampling (S. Agrawal and Goyal 2012, 2013) uses Bayes’ rule to choose an action that maximizes the expected reward. Each round of the MAB in this context is picking a random query from the set of queries and trying to solve it with a specific solver \(\sigma_i\). To adapt non-contextual Thompson Sampling to our SMT solver selection problem, we model the outcome of an experiment with a Bernoulli distribution where the solver solves the query with a probability \(\theta_i\) and fails to solve it within the time-out with a probability \(1-\theta_i\). In Thompson Sampling, the agent does not know the value of each \(\theta_i\) but begins with some prior belief over each one. These priors are beta-distributed: the prior for \(\theta_i\) is \[p(\theta_i) = \frac{\Gamma(\alpha_i + \beta_i)}{\Gamma(\alpha_i)\Gamma(\beta_i)}\theta_i^{\alpha_i - 1}(1 - \theta_i)^{\beta_i - 1}.\] We initially take this distribution to be uniform: \(\alpha_i = \beta_i = 1\). That is, we assume a prior that, for a random query, each solver has a \(50\)% chance of solving the query within the timeout and a \(50\)% chance of failing to do so.
To select a solver to deploy, Thompson Sampling takes a sample from each distribution \(p(\theta_i)\) corresponding to a solver. Note that, because Thompson Sampling takes a sample from the distributions \(p(\theta_1)...p(\theta_n)\), it is more likely to pick solvers that we are uncertain about instead of simply returning the solvers in order of the \(p(\theta)\) with the highest mean, allowing exploration. Conventional Thompson Sampling returns the solver with the highest valued sample. Our algorithm returns a sequence of solvers in descending order of these sampled values i.e., the solver with the \(\theta\) closest to 1 is first.
After deploying the solvers and observing the results, the distributions over \(\theta_1, .., \theta_n\) are updated according to Bayes’s rule. Each time a solver \(\sigma_i\) is run, a reward \(r_i\) is observed. The posterior distribution for the beta distribution (Castillo, Hadi, and Solares 1997) is obtained by adding the reward \(r_i\) to \(\alpha_i\) and \(1 - r_i\) to \(\beta_i\), as illustrated in Figure 7.
Thompson Sampling assumes events are independent, i.e., the probability of a solver being able to solve a query is independent of all queries the solver has seen before. This could be true if our time-out prediction algorithm is perfect so that if a solver can solve a query within the timeout \(T\), the solver will also always be able to solve that query within the time \(t\) allocated to the solver.
Contextual approaches depend on the assumption that queries with similar characteristics will cause SMT solvers to perform similarly. We capture the characteristics of each SMT query \(q_i\) in a feature vector \(\boldsymbol{v}_i \in V\). We use these feature vectors both in contextual bandit algorithms (described next) and in our contextual time-prediction (described in Chapter 4.3.2).
We identify a list of 24 features that are quick to extract and that we believe correlate with solving time for specific solvers. These features include context-free qualities like counts for specific operators (e.g. array store operators), the maximum value of literals, the sum of literal values, and so on. The features also include context-sensitive qualities like quantifier nesting and alternations, as well as the size of a given queries’ abstract syntax tree as a minimal graph (we refer to this representation as a term graph). All feature extraction procedures run in \(\mathcal{O}(n)\) where \(n\) is the size of the term graph. The term graph construction is efficient and the term graph itself is often exponentially smaller than the input query. Therefore, the cost of extracting features is relatively small.
Figure 8: Empirical intuition for features and time predictors. (a) Solver performance on QF_ABV queries over a feature value showing that different solvers are more or less sensitive to that feature change. (b) Empirical intuition for features and time predictors: as time increases, fewer and fewer queries are solved..
As a heuristic, we try to build features that can differentiate solver performance on their own. For example, Figure 8 (a) shows the performance of three different SMT solvers over one of these features. For this example, intuitively, we would want to favor the use of Bitwuzla as the feature value increases, instead of CVC4, which appears to be more sensitive to changes in that feature. While we did not do any empirical feature selection, we did use our prior knowledge of SMT solvers to decide which feature extractors to build, and we evaluate the impact of our decisions in Chapter 4.4.4.4.
The \(k\)-Nearest-Neighbor algorithm(\(k\)-NN) is a simple contextual approach. Viewed from the perspective of a MAB problem, given solvers \(s_1, ..., s_n\), \(k\)-NN classifies SMT queries into \(n\) classes: queries where solver \(s_1\) is the best choice, queries where solver \(s_2\) is the best choice, and so on. We extend the standard \(k\)-NN algorithm to return a sequence of solvers.
Given a query, \(q\), the basic \(k\)-NN algorithm looks at the \(k\) nearest queries to \(q\), tallies the number solved by each solver, and orders the solvers by their tallies. We calculate the distance between two queries—how “near” two queries are—by computing the Euclidean distance between their feature vectors. The idea behind this algorithm is that if a solver succeeded on many queries similar to \(q\), then it is likely to succeed on \(q\). If any solvers in the solver set are not included in the neighbors, we randomly shuffle these solvers and append them to the end of the sequence—we make an exception for the \(k=1\) case: when \(k=1\) we return the solver that solved the nearest neighbor, followed by the solver that solved the next nearest neighbor and so on, without replacement.
Once an order is selected, we run the solvers in sequence. If a solver \(s\) succeeds, we add the feature vector of that query to our data set of previously solved queries along with the label \(s\). See Figure 9 for three steps of a 1-nearest neighbor example.
The second component of our approach comprises time predictors, which we use to split the per-query timeout \(T\) into sub-timeouts per solver \(t_1\ldots t_n\). We train a time-predictor for each solver. Our goal is to find a time \(t_i\) such that we can stop running the \(i^{th}\) solver in the sequence and be highly confident it was unlikely to solve the current query after this point. Formally, we are trying to find the minimum \(t_i\) such that \(P(t_i < u_i < T) \leq \delta\), where \(u_i\) is the true runtime of \(\sigma_i\) on \(q_i\) and \(\delta\) is the accepted error probability. We consider this event the only relevant error scenario because it implies if we had allocated more time to solver \(\sigma_i\) we could have solved the current query.
To calculate each \(t_i\), we model each solver’s runtime as an exponential distribution, justified by our experimental observations illustrated in Figure 8 (b): solvers usually succeed early or not at all. We employ Maximum Likelihood Estimation (MLE) (Myung 2003) to fit an exponential distribution to the runtime samples which we have gathered up to that point. In MLE, we find \(\min_{\lambda} P(u_1\ldots u_m | \lambda)\), where \(u_1 \ldots u_m\) are the observed runtime samples we have seen, which we assume are drawn from \(Exponential(\lambda)\). We use the exponential’s probability distribution function as a measure of likeliness, so this problem is equal to \(\min_{\lambda} n\ln\lambda -\lambda(\sum_iu_i)\), leading to the following minimizer: \[\lambda^* = \frac{n}{\sum_iu_i}\] Applying the cumulative distribution function and using the memoryless property of the exponential distribution, we can calculate \(t_i\) as follows: \[t_i = \frac{-\ln(\delta + e^{-\lambda^*T})}{\lambda^*}\] We split \(T\) into sub-timeouts greedily; we use the above process to allocate time for solvers starting from the beginning of our ordering and stop once we reach the overall timeout, allocating zero time to the remaining solvers in the order. If we reach the end of the ordering and still have time remaining, we give the remaining timeout to the last solver.
We present a contextual runtime prediction system based on the \(k\)-NN algorithm. Instead of using every past sample point to estimate \(\lambda^*\), we limit our estimation scheme to the \(k\) nearest data-points. As with the \(k\)-NN based solver selection, the distance between two queries is the Euclidean distance between their feature vectors. The rest of the estimation scheme remains identical to the non-contextual scheme.
Finally, we present a contextual runtime prediction system that finds a linear relationship between our feature vector and the associated runtime. To do so, it minimizes the L2-regularized squared loss of the linear model using stochastic gradient descent, solving: \[\min_{w, b} \sum_i(w^Tx_i + b - u_i) + \alpha ||w||_2,\] where \(w\) is the learned weight of our features, \(b\) is a learned coefficient, \(x_i\) is the feature vector of the \(i^{th}\) query, \(u_i\) is the true runtime of the \(i^{th}\) query, and \(\alpha\) is a regularization constant.
We implemented a prototype of MedleySolver in Python.14 The input is a directory of queries, and the output is the result, solver used, and time elapsed per query. In this section, we evaluate this prototype and aim to answer the following research questions:
How does MedleySolver compare to individual solvers on the practical SMT problem?
How does MedleySolver compare to the best available alternatives on the practical SMT problem?
How do the individual components of MedleySolver affect the overall performance?
We equip MedleySolver with six SMT solvers—CVC4 v1.8 (Barrett et al. 2011), MathSAT v5.6.3 (Bruttomesso et al. 2008), Z3 v4.8.7 (de Moura and Bjørner 2008), Boolector v2.4.1 (Niemetz, Preiner, and Biere 2014), Bitwuzla v.0.9999 (Niemetz and Preiner 2020) and Yices v2.6.2 (Dutertre 2014)—and run on four benchmark sets, each with 500 queries. Some SMT solvers do not support all needed syntax. For example, the BV set includes quantifiers that Boolector cannot handle. We expect MedleySolver to learn to avoid solvers that fail on specific kinds of queries.
The benchmark queries simulate a typical user’s workload in that they are similar in nature, i.e. use related logical theories and come from similar applications, but are diverse enough to expose issues a normal user will encounter, i.e. deviations in SMT-LIB conformance. We selected a random sample of 500 queries from an existing benchmark set, Sage2, derived from a test generation tool; 500 queries from 140 UCLID5 (Seshia and Subramanyan 2018) verification tasks; and 500 queries each from the BV and QF_ABV theory SMT-COMP tracks, respectively.
We ran every individual solver with a timeout of 60 seconds for every query on a Dell PowerEdge C6220 server blade equipped with two Intel Xeon 10-core Ivy Bridge processors running Scientific Linux 7 at 2.5 GHz with 64 GB of 1866 Mhz DDR3 memory. We saved these results and used them to simulate runs of MedleySolver. This helped ensure results are deterministic, reproducible, and lowered our carbon emissions. The overhead of running MedleySolver on all 2000 queries varies between learning algorithms and features used but is always less than two minutes for the full set of queries, and is therefore negligible.
To evaluate the utility of MedleySolver for a typical user, we ran \(k\)-NN and Thompson with the three timeout predictors on every set individually and then over the combined set. The combined set represents a realistic combination of queries a typical user might want to solve.
BV | QFABV | Sage2 | UCLID5 | Split | Combined | |
---|---|---|---|---|---|---|
10-NN Exp | 2081.8 | 1208.2 | 14855.4 | 5386.9 | 23532.3 | 29111.2 |
484 | 492 | 396 | 457 | 1829 | 1799 | |
10-NN 10-NN Exp | 2857.7 | 1039.7 | 17558.2 | 5386.9 | 26842.5 | 25133.8 |
477 | 493 | 367 | 457 | 1794 | 1813 | |
10-NN Linear | 4229.1 | 824.4 | 13789.0 | 8136.4 | 26978.9 | 87898.4 |
468 | 496 | 409 | 452 | 1825 | 1308 | |
Thomp Exp | 2986.4 | 1308.5 | 15323.1 | 5540.9 | 25158.9 | 51757.3 |
480 | 491 | 401 | 456 | 1828 | 1676 | |
Thomp 10-NN Exp | 2840.8 | 1105.1 | 17949.7 | 5536.0 | 27431.6 | 27555.9 |
479 | 493 | 365 | 456 | 1793 | 1816 | |
Thomp Linear | 4291.1 | 1474.1 | 15661.3 | 4344.6 | 25771.1 | 45267.7 |
466 | 489 | 392 | 473 | 1820 | 1658 | |
Boolector | 60000.0 | 1408.7 | 31502.3 | 60000.0 | 152911.1 | |
0 | 491 | 265 | 0 | 756 | ||
Bitwuzla | 3872.3 | 822.9 | 22316.4 | 60000.0 | 87011.7 | |
471 | 496 | 349 | 0 | 1316 | ||
CVC4 | 3332.1 | 5874.1 | 49161.5 | 7395.5 | 65763.3 | |
477 | 459 | 117 | 459 | 1512 | ||
MathSat | 11455.0 | 1724.0 | 34159.4 | 50783.3 | 98121.7 | |
406 | 488 | 232 | 77 | 1203 | ||
Yices | 7244.3 | 922.1 | 13544.4 | 60000.0 | 81710.9 | |
442 | 494 | 411 | 0 | 1347 | ||
Z3 | 2888.6 | 1202.0 | 35279.2 | 2637.2 | 42007.1 | |
477 | 492 | 232 | 484 | 1685 | ||
Virtual Best | 964.6 | 530.8 | 9006.2 | 2192.1 | 12786.5 | |
493 | 497 | 453 | 476 | 1931 |
Table 1 reports the results of our experiment in terms of Par-2 score—the sum of all runtimes for solved instances + \(2\times\)timeout for unsolved instances (Weber et al. 2019)—where “virtual best” is calculated by using the best-performing individual solver for each query. On individual sets where one solver dominates, like Z3 on UCLID5, MedleySolver approaches the best solver but does not reach it. Conversely, on sets where no one solver is close to the virtual best, like BV, we find the MedleySolver can exploit this performance differentiation and approach the virtual best solver. The combined set, which is less uniform than BV but less dominated than UCLID5, demonstrates the power of MedleySolver: with no training, MedleySolver solves 94.5% of the queries solved by the virtual best using only 72.3% of the time taken by the most successful individual solver, Z3, which solves 87.3%. Table 1 gives a clear answer to RQ3: MedleySolver outperforms every individual solver on the practical SMT problem.
The Thompson MAB selector generally does better when summing up individual sets than when running on the combined set, while the \(k\)-NN selector is the opposite. This suggests contextual approaches can effectively carry over lessons between sets, and non-contextual approaches benefit from being used in the context of a benchmark.
We now compare MedleySolver to alternative portfolio approaches, including those based on pre-trained machine learning techniques. Parallel portfolio solvers like PAR4 (Weber 2019) run multiple solvers in parallel and stop all solvers when the first one solves the query. We can calculate the hypothetical performance of such a solver by multiplying the virtual best solver time by the number of solvers we run. This would give a Par-2 score of \(65481\), in comparison to \(10\)-NN’s better score of \(32632\) over the combined benchmark set.
BV | QFABV | Sage2 | UCLID5 | Combined | |
---|---|---|---|---|---|
MedleySolver | 1638.7 | 310.5 | 9245.3 | 4248.0 | 18565.5 |
N/A | N/A | N/A | N/A | N/A | |
MachSMT | 1458.3 | 919.2 | 8516.1 | 2430.9 | 12539.1 |
33895.5s | 4498.9s | 55115.5s | 276419.8s | 300072.8s | |
Virtual Best | 801.7 | 184.3 | 5204.2 | 1464.7 | 6746.0 |
N/A | N/A | N/A | N/A | N/A |
MachSMT (Scott et al. 2021) uses a neural network to select which solver to run on a given query. Table 2 shows the performance of MachSMT and MedleySolver on the same benchmarks as in Table 1 but with \(2/5\) of the queries set aside for MachSMT to train on per set. Although MachSMT slightly outperforms MedleySolver on the test sets, the training time required by MachSMT is orders of magnitude larger than the time required to solve, particularly because, to have training data on which to train, MachSMT must run all the solvers on all the queries in the training set which takes a considerable amount of time. So, on the practical SMT problem, MedleySolver achieves similar results with significantly less resource consumption and we argue the cost of training is not worth it if online learning can achieve competitive performance. Scott et al. (2021) do make a trained model available that could eliminate training time for a user. However, this pre-trained model is trained on specific versions of a specific set of SMT solvers running on a specific system. In practice and in our example, the user’s specifics do not match and local training is required.
FastSMT (Balunovic, Bielik, and Vechev 2018) is an offline approach that synthesizes strategies for the Z3 SMT solver. FastSMT requires significant training time and needs to be trained per benchmark set. We can run the pre-trained FastSMT model on the Sage2 benchmark set, where it solves \(358\) queries in \(12766\)s (Par-2 score of \(29806\)). This is a substantial performance gain over Z3, but all MedleySolver configurations still outperform this without any pre-training. FastSMT improves on Z3 but it is limited to one single solver and, unlike MedleySolver, is not able to take advantage of the range of different SMT solvers implementing different heuristics.
In this section, we evaluate the performance of the individual pieces of MedleySolver. Specifically, we aim to answer the following questions. 1. How do our learning algorithms compare to other well-known MAB algorithms? 2. How well do our order selectors perform? 3. What is the impact of selecting an order instead of a single solver on performance? 4. Which query features are most responsible for MedleySolver’s performance?
We have highlighted the results from Thompson Sampling and \(k\)-NN but we also adapted and evaluated the following Multi-Armed Bandit algorithms: the classic non-contextual epsilon-greedy bandit algorithm (Johnson, Li, and Chen 2000); LinUCB (L. Li et al. 2010), an upper-confidence bound algorithm that assumes a linear relationship between the rewards and the feature vector; Exp3 (Auer et al. 2002): an adversarial bandit algorithm; and an adaptation of a neural network classification based bandit (Zhou, Li, and Gu 2019). All of these bandits performed comparably or better than the best individual solver, but no non-contextual algorithm performed as well as Thompson Sampling, and no contextual algorithm performed as well as \(k\)-NN.
To better understand MedleySolver’s performance, we measure how frequently it selects the best solver as the first solver to try. Over our five case-study benchmarks, the best performing selector is \(k\)-Nearest-Neighbor with \(k=10\) (\(10\)-NN), which correctly picks the best solver 74.3% of the time. The highest success rate is on the BV benchmark, where \(10\)-NN is 92.0% accurate. The lowest success rate is on the QF_ABV benchmark, where \(10\)-NN is 52.0% accurate. This difference demonstrates MedleySolver’s accuracy is proportional to the cost of mistakes: solvers are much better overall on the QF_ABV category so MedleySolver can frequently pick a sub-optimal solver that will still terminate quickly; on the other hand, in BV, picking the wrong solver will often lead to a timeout.
To better understand MedleySolver’s performance, we measure the impact of selecting an order of solvers instead of a single solver. To do this we run MedleySolver without timeout prediction, giving the entire time per query to the first solver in the sequence. We find, all else equal, on the combined set, selecting a single solver produces a Par-2 score 350% worse than our best MedleySolver configuration. This difference is due to the direct cost of mispredictions and because, without the time prediction, MedleySolver is unable to learn from mistakes on a given query.
In this section, we aim to interpret what our results tell us about SMT solvers and optimize performance through feature analysis. To better understand the SMT solvers we use, we measure how well each feature correlates with solver performance differentiation. Specifically, for every feature \(f\), for every pair of solvers \((s_i, s_j) \in S \times S\), we measure the Pearson correlation coefficient between \(f\) and \(\text{time}(s_i) - \text{time}(s_j)\). Using this technique we found, on the BV benchmark set, the most distinguishing features were the number of universal quantifiers for Bitwuzla and MathSat5; the number of free variables for Yices and Boolector; the number of bound variables for Boolector and Bitwuzla; and the size of the term graph for Bitwuzla and Z3.
To optimize performance, we search for the subset of features that induce the best performance from MedleySolver. Specifically, we use backward step-wise feature selection (BSFS) to iteratively remove features from our feature vector whose removal does not negatively affect performance. Using this technique we found that, all else equal, using only three features on the BV benchmark (number of quantifiers, number of variables, and term graph size) improves the Par-2 score of the best configuration of MedleySolver by 30%. We observe a similar reduction in feature vector size across benchmarks and an average Par-2 performance improvement of 11%. Interestingly, BSFS often removes the feature with the smallest correlation score, as described above.
We evaluated MedleySolver on a set of benchmarks and a combination of solvers we believe represent a real user’s SMT workload, but we understand these results may not generalize. To mitigate this possibility, we evaluated the SMT-COMP data curated by the MachSMT authors. On the QF_UFBV benchmark, the best individual solver was Bitwuzla with a score of 2614.6 while MedleySolver scored 2600.5; on the QF_LIA benchmark, the best individual solver was Yices with a score of 45871.2 while MedleySolver scored 25737.2; on the QF_BVFPLRA benchmark, the best individual solver was MathSAT5 with a score of 3015.2 while MedleySolver scored 567.7; and on the NRA benchmark, the best individual solver was Z3 with a score of 1455.6 while MedleySolver scored 1068.6. In all cases, MedleySolver outperformed every individual solver while using no pre-training–often by margins greater than those observed in our case study–suggesting our results do generalize.
The queries MedleySolver has seen in the past affect the prediction MedleySolver makes for the current query, and thus the order MedleySolver receives the queries could affect the overall performance on the full dataset. MAB algorithms such as Thompson Sampling use random sampling and choice of random seed could also affect the results. To gain confidence in our claims, we repeated our experiments with 20 different random seeds and found the standard deviation in MedleySolver’s overall Par-2 score to be approximately 1% of its average score. The margins between MedleySolver and any individual solver are significantly larger than 1% and so MedleySolver is consistently comfortably better than any individual solver in our evaluation regardless of deviation.
In this chapter, we described the arrow that leaves Figure 1(F): a new way to coordinate solvers, like those described in Chapter 3, called MedleySolver. MedleySolver addresses a practical need borne from theoretical limitations. Solver performance can differ greatly across domains due to the use of carefully tuned heuristics that maximize performance in settings where no dominant algorithmic strategy exists. At the same time, solver users usually need to solve multiple related queries in sequence. If the user does not know which solver is best for their domain, then performance over the entire sequence can be prohibitively poor. MedleySolver learns to use a set of solvers over a stream of input logical queries to optimize performance over the sequence without human intervention.
We empirically demonstrated the power of MedleySolver over a set of benchmarks from related work in comparison with existing alternatives. We found that MedleySolver outperforms every individual solver and is comparable to existing pre-trained approaches, even though MedleySolver uses no pre-training. When accounting for pre-training time, MedleySolver significantly outperforms all related work. In short, MedleySolver eliminates the need for users to know which solver is best for their domain.
In this chapter, we build a verification engine for distributed systems, called the P Verifier. This engine generates SMT queries, like those that Algaroba (Chapter 3) and MedleySolver (Chapter 4) handle. Specifically, we describe the arrow to Figure 1(E): a new verification engine—based on a novel notion of message chain invariants—that distributed system engineers can use to (automatically) reason about the fruits of their labor. The technical contents of this chapter first appeared in Mora et al. (2023). Since then, we built a new version of the P Verifier that includes the advanced solver interface described in Chapter 2.4.
Programming error-free, reliable, distributed systems is hard. Engineers need to reason about numerous control branches stemming from a myriad interleaving of messages. The systems are incredibly tricky to debug due to the nested mesh of messages and failures, and it is surprisingly easy to introduce subtle errors while improvising to fill in gaps between high-level system descriptions and their concrete implementations. Yet, we are dependent on distributed systems to deliver high-performance computing, fault-tolerant networked services, and global-scale cloud infrastructures. What can we do to ease the job of the engineer tasked with programming a distributed system? In this chapter we turn to a combination of formal modeling and automated reasoning.
Modeling languages and automated reasoning already play an important role in raising the level of assurance in distributed systems and cloud services. Classic domain-specific examples of the combination include TLA+ (Lamport 2002); newer examples include Ivy (Kenneth L. McMillan and Padon 2020). These languages and associated automated reasoning tools can be used to guarantee correctness for all executions of a given system model, or at least systematically explore the space in search of errors. Once developers have a correct model, they can build their production systems with confidence. Unfortunately, these modeling languages do not provide users with language primitives or abstractions to support explicit message passing. To work at the message passing level—where a significant source of implementation errors lie—therefore requires manual modelling and significant user effort, making the translation to and from production systems more challenging.
More specifically, modern programming languages provide software engineers with a choice of different distributed systems programming paradigms. Two popular paradigms include the actor model, supported natively by e.g., Erlang; and the model of communicating sequential processes, supported natively by e.g., Go. No mainstream verification language supports either of these popular paradigms. For example, a software engineer working with the actor model will likely turn to Ivy (Kenneth L. McMillan and Padon 2020), TLA\(^+\) (Lamport 1999), or Dafny (Leino 2010) for formal verification. Ivy works in the paradigm of synchronous reactive programs; TLA\(^+\) is explicitly designed not to resemble any programming language (Lamport 2021); and Dafny targets imperative programs similar Java.
To remedy this, we propose message chains, an automated method for keeping track of message chains, and a verification framework that uses message chains to give users the vocabulary to reason about the low-level details of explicit message passing while remaining at an intuitive level of abstraction. Informally, message chains are sequences of logically related messages. This idea is related to message sequence charts (ITU-T 2011) from the software engineering community, and choreographic programming languages (Montesi 2014) from the programming languages community. Message sequence charts are partial specifications that describe the flow of messages through distributed systems. Engineers use these charts for testing, design, simulation, and documentation. Choreographic programming languages, on the other hand, are high-level languages that define distributed systems from a global perspective, focusing on communication flows. Both message sequence charts and choreographic programming offer developers a way to reason about explicit message passing but with the context of how messages flow through the system in question. In this work, we use message chains to bring these same ideas to formal verification.
The rest of this chapter is organized as follows. We begin by revisiting and generalizing the running example from Chapter 1.2 in Chapter 5.2. We then survey related work in Chapter 5.3 and present the necessary background to understand our approach in Chapter 2.5. We formally define the verification problem and our encoding in Chapter 5.4. The problem definition captures systems with any number of machines executing for any number of steps; the encoding captures the notion of message chains and the instrumentation needed to use them. Chapter 5.4 also formally defines message chain invariants and gives a detailed verification of the running example using them. In Chapter 5.4.6 we implement an instance of our verification approach for the P programming language specifically and we call the resulting tool the P Verifier. We then empirically evaluate our implementation in Chapter 5.5. Overall, we make the following contributions.
We define message-passing distributed systems based on I/O
automata. We define the verification problem and build a framework that
compiles the problem to mar
programs.
We define the notion of message chains and integrate it into our message-passing distributed systems. This notion is suited to programming languages like Erlang,15 Akka,16 and P (Desai et al. 2018, 2013).
We instantiate our framework for a version of the P language and call the result the P Verifier. We then replicate verification efforts from related work, and simplify these proofs using message chains. We then verify a correctness property for a system with complex message passing—the onion routing network—and two industrial case studies.
Consider again the ring leader election protocol from Chapter 1.2. We display an implementation of
this protocol in P-like syntax in Listing 4. This language is like
mar
but with the extensions described in
Chapter 2.5.1, inline let-bindings, P-like
syntax for machines and events (described below), and a new induction
command that is syntactic sugar for requires and ensures conditions.
This code represents an input model to the P Verifier—it is an example
of Figure 1(C).
P programs consist primarily of event and machine declarations. Event
declarations define the kind of messages machines exchange and the
payloads these messages hold. Machine declarations define the kinds of
machines in the system and how they interact with each other. In this
figure, we define one kind of event, eNominate
, which holds
an identifier (line 9). We also define one kind of machine,
Node
, which has no fields, and two states:
Search
and Won
(lines 10-21). When nodes enter
the Search
state, they send their value to the node on
their right (lines 12-13). After sending their value, while nodes are
still in the Search
state, nodes react to receiving
nominations in one of three ways (lines 14-20). First, when nodes
receive a value less than their value, they send their value to the node
on their right (lines 19-20). Second, when nodes receive a value greater
than their value, they forward this new value to the node on their right
(lines 17-18). Third, when nodes receive their own value, they move to
the Won
state (line 16). The code blocks that begin with
the keyword on
are handlers.
Note that Listing 4 makes no
mention of the number of nodes in the system. In Chapter 1.2 we considered only instances of the
ring leader election protocol with three nodes. In this chapter, we
consider all possible instances of the protocol. We describe this in
detail in Chapter 5.4. For now, it
suffices to know that the node
(lower-case “n”) definition
(line 1) declares a new uninterpreted sort that represents a pointer to,
or label of, Node
(upper-case “N”) machines. The size of
the universe of node
determines the number of
Node
machines, and the verification engine that we define
will be free to search for counterexamples over any possible universe of
node
.
Listing 4: Ring leader election in the P Verifier. Definitions of
btw
(“between”), right
, and le
(“less than or equal to”) functions omitted. btw
is used to
define the ring topology, right
uses btw
to
determine the next node in the ring given an input node, and
le
is used to determine who the winner should be. Lines
1-21 define the model; lines 22-42 define the specification.
data node
type mc = message_chain[node]
type sys = system[node]
let btw(w: node, x: node, y: node): boolean
let right(n: node): node
let le(x: node, y: node): boolean
event eNominate = {id: node}
machine Node {
state Search {
on entry do {
send right(this), eNominate(this)}
on eNominate e do {
let curr = e.payload.id in
if curr == this then goto Won
else if le(this, curr) then
send right(this), eNominate(curr)
else
send right(this), eNominate(this)}}
state Won {}}
let in_flight(s: sys, e: mc): boolean =
s.events[e] and e is send
let leader(s: sys, l: node): boolean =
s.machines[l].Node_state is Won
let rec participated(e: mc, r: node): boolean =
if e is empty then false
else e.source == r or participated(e.history, r)
induction (s: sys)
invariant unique_leader: forall (l: node, n: node)
leader(s, l) and leader(s, n) ==> l == n
invariant leader_max: forall (l: node, n: node)
leader(s, l) ==> le(n, l)
invariant participated_means_le_head:
forall (e: mc, n: node) let head = e.payload.id in
in_flight(s, e) and participated(e, n)
==> le(n, head)
invariant not_participated_means_going_to_visit:
forall (e: mc, n: node) let head = e.payload.id in
in_flight(s, e) and not participated(e, n)
==> btw(head, e.target, n) or e.target == n
Figure 10 shows the
Lamport diagram for an execution of the ring leader election protocol on
four nodes (the universe of node
is the set {1, 2, 3, 4}).
Every event in Figure 10
corresponds to the entry into a handler or the execution of a send
instruction in Listing 4. In this
example, node 4 is to the right of node 1, node 2 is to the right of
node 4, node 3 is to the right of node 2, and node 1 is to the right of
node 3. The execution begins with node 1 sending its label value to node
4 (event 1:0, which corresponds to entering the on entry
handler of Listing 4). When node
4 receives the value 1 (event 4:0), it compares 1 to its value and
decides to send its value to node 2 (event 4:1). This process goes on
until node 4 receives its value (event 4:3) and declares itself the
winner. Figure 10 also shows
four message chains observed during the same execution. The first
message chain, in blue, consists of the events 1:0, 4:0, 4:1, 2:3, 2:4,
3:3, 3:4, 1:2, 1:3, and 4:3, along with omitted associated payloads. The
remaining message chains are in orange, green, and pink.
Intuitively, message chains behave like email chains. Each message in the system is passed around with the message it was responding to, creating a history consisting of a sequence of messages that caused the current message. Looking at this history is often useful. For example, if you receive a reply to an email, you know that that sender received your email, and you can tell which email they are responding to by looking at the chain. It does not matter whether that sender also sent other emails to you or other people, the order in which they sent these other emails or the content of those other emails. For your conversation with the sender, the email chain itself provides sufficient information. We use the same principles to reason about message-passing distributed systems.
We capture this intuitive notion by instrumenting handlers.
Specifically, when a handler sends a message without having received a
message, our instrumentation begins a new message chain containing only
that message. For example, when the on entry
handler of Listing 4 is triggered at event 1:0
of Figure 10, we start a new
message chain (in blue) containing a single message from node 1 to node
4 containing the payload 1. When a machine receives a message and reacts
in the same handler by sending a message, our instrumentation extends
the message chain associated with the received message. For example,
when the on eNominate
handler of Listing 4 is triggered at event 4:0,
we extend the blue message chain with a message from node 4 to node 2
with payload 4 (event 4:1).
Our goal, in general, is to prove that systems of asynchronously communicating state machines guarantee specified safety properties. The P Verifier is responsible for this process. It takes a set of properties and returns true if the conjunction of properties passes a proof by induction. For the ring leader election protocol specifically, we prove that there is only ever a single node that declares itself the winner and that the winning node has the label with the greatest value, regardless of the number of nodes included in the system. In general, and specifically for the ring leader election protocol, our proofs will rely on message chains. Note that while a successful proof by induction implies there are no bugs, a failed proof by induction does not imply there are bugs. This fact makes our tool unsuitable for bug finding, like all other proof-by-induction-based tools. But our work can be easily extended to support bug-finding algorithms, like bounded model checking.
Textbook proofs of correctness for such a ring leader election protocol often assume synchronous execution of nodes (e.g., Aspnes (2020)). For the synchronous version of this protocol, for every round \(k < n\), every node sends the greatest value it has seen so far to the node on its right. After \(n\) rounds, the protocol terminates as every node is aware of the greatest value in the ring. Proofs of correctness for the synchronous version of this protocol use induction on the round number. Unfortunately, most realistic implementations are not synchronous so these textbook proofs of correctness do not apply. In the asynchronous setting, proofs of correctness for this protocol tend to ignore message passing altogether. For example, Koenig et al. (2020) provide a proof in mypyvy that abstracts away individual message buffers into a single, shared, append-only relation.17 Padon et al. (2016) provide a similar proof in Ivy.
In this chapter, we propose a new verification approach based on message chains and use it to prove the correctness of asynchronous distributed systems at the message-passing level of abstraction. For example, for the ring leader election protocol, we can encode and machine-check a proof of correctness that is simple, like the textbook proof, and asynchronous, like the mypyvy proof, all while being at a lower level of abstraction than both, since it captures message passing directly. Specifically, Listing 4 displays the full verification of the ring leader election protocol in our prototype verification framework, the P Verifier. This proof, which we revisit in Chapter 5.4, uses two target invariants and two auxiliary invariants—invariants used to make the target specification inductive. The first invariant captures the desired property that there is only ever a single node that declares itself the winner. The second invariant captures the desired property that the winning node has the label with the greatest value. The third invariant is an auxiliary invariant. It says that the head of every message chain always holds the largest label it has seen so far. The fourth invariant is also an auxiliary invariant. It says that, for every message chain, if there is a node that has not yet participated in the message chain, then the message chain is going towards that node. Together, these four invariants capture the textbook proof of correctness over every message chain: for every message chain, the round number in the textbook proof is equal to the length of the message chain.
We provide the full proof of correctness for the ring leader election protocol in Chapter 5.4.5. However, proofs are generally not correct on the first attempt. Figure 11 is a visualization of what a more realistic interaction with the P Verifier would look like. The user first provides a system model (Figure 11 A) and specification (Figure 11 B), and then calls the P Verifier. Usually specifications, like the two target invariants for the ring leader election protocol, are not inductive on their own. In Figure 11 C, the P Verifier rejects the first proof attempt and provides a counterexample to induction. This does not necessarily mean that the model violates the specification, it only means that the proof by induction failed. To fix this, the user must provide auxiliary invariants. Figure 11 D represents an auxiliary message chain invariant. Specifically, Figure 11 D is a visualization of an invariant that asserts that the head of every message chain always holds the largest label in the message chain so far. With the added auxiliary invariants, the user will call the P Verifier again. Figure 11 E represents a successful proof attempt.
In this section, we formulate the problem of verifying asynchronously communicating state machines, we describe our verification workflow, and we define message chain invariants. All of this is built on top of FOLDS Automata (first-order logic distributed systems automata), a short-hand for parametric compositions of I/O automata with restrictions that
make modelling distributed systems easy (see Chapter 5.4.6),
make encoding these systems into SMT straightforward (see Chapter 5.4.3), and
enable our verification workflow by integrating message chains.
At a high-level, FOLDS Automata represent compositions of many state machines (similar to instances of \(\textit{SM}_i\) of Example 2) with a single universal buffer (similar to \(\textit{SUB}\) of Example 1). Our main contribution, and one of the key differences between FOLDS Automata and the I/O automata in Examples 1 and 2, is that, instead of communicating with integers, FOLDS Automata communicate with algebraic data types representing chains of messages.
Definition 9 (Message Chains). MC
is an algebraic data type
with two sort parameters and two constructors. The sort parameters are
E
, representing the type of messages, and R
,
representing the set of machine identifiers. The two constructors are
empty
and send
. The constructor
empty
has no selectors. The constructor send
has four selectors: source
, target
,
payload
, and history
. The source
selector represents the machine that sent the current message and is of
sort R
. The target
selector represents the
destination of the current message and is of sort R
. The
payload
selector represents the contents of the current
message and is of sort E
. The history
selector
represents the tail of the message chain and is of sort
MC[E, R]
. A message chain is an instance of the
MC
algebraic data type. The following extended
mar
snippet captures this definition
succinctly.
For example,
send(2, 3, eNominate(4), send(4, 2, eNominate(4), empty))
is the green message chain in Figure 10. Note that message
chains are essentially lists of triples. For convenience, we sometimes
treat message chains as sequences of triples. This is relevant in Chapter 6.
Definition 10 (FOLDS Automata). A FOLDS Automaton \(A\) is a triple \((E, R, M_i)\), where \(E\) is a sort representing messages, \(R\) is a sort representing machine identifiers, and \(M_i\) is a class of I/O automata that satisfies the following conditions.
For every \(i \in R\), \(M_i\) is an I/O automaton.
If \(i, j \in R \; i \neq j\) then \(M_i\) and \(M_j\) are compatible.
\(\textit{inp}(M_i)\) is \(\{\textit{get(v)}_{i, j} \; \vert \; v \in 2^{MC};\, j \in R\}\).
\(\textit{out}(M_i)\) is \(\{\textit{put(v)}_{i, j} \; \vert \; v \in 2^{MC};\, j \in R\}\).
\(\textit{states}(M_i)\) is an error state \(\bot\), or a record with two sets of message chains, \(\textit{input}_i\) and \(\textit{output}_i\).
\(\textit{start}(M_i)\) must only include non-error states where \(\textit{input}_i\) and \(\textit{output}_i\) are empty.
\(\textit{trans}(M_i)\) is a relation such that \(\textit{get(v)}_{i, j}\) updates \(\textit{input}_i\) to be \(\textit{input}_i \cup v\), and \(\textit{put(v)}_{i, j}\) updates \(\textit{output}_i\) to be \(\textit{output}_i \smallsetminus v\), if \(v\) was in \(\textit{output}_i\) (otherwise \(M_i\) moves to \(\bot\)).
No internal action adds to \(\textit{input}_i\).
If an internal action removes from \(\textit{input}_i\), then it removes exactly one message chain. We call this message chain the received message chain.
Internal actions can only depend on the received message chain.
No internal action removes from \(\textit{output}_i\).
Every message chain that is added to \(\textit{output}_i\) is of the form \(\textit{send}(e, h)\), where \(e\) is a message and \(h\) is either the received message chain or
empty
, if the internal action did not receive a message
chain. We call \(e\) the current
message in the message chain and we call \(h\) the history of the message
chain.
No internal action can depend on the history of the received message chain.
Internal actions can only add to \(\textit{output}_i\) after any local state changes; internal actions can only make local state changes after any \(\textit{input}_i\) removal.
Intuitively, the first seven conditions (1-7) define network communication: machines send and receive sets of message chains and have input and output buffers that are sets of message chains. The next four conditions (8-11) ensure that machines use their own input and output buffers correctly. Condition 12 ensures that message chains are constructed correctly. Condition 13 ensures that message chains do not affect the behavior of the system we are modeling. That is, message chains behave like ghost variables. Finally, condition 14 ensures that, even though machines take turns stepping, our system is equivalent to an asynchronous composition. This follows from the same argument put forth by Padon (2018), which is in turn due to Lipton (1975).
To give the semantics of FOLDS Automata, we first give the definition of the Universal Buffer, which mediates machine communication by holding and distributing message chains.
Definition 11 (Universal Buffer). For a given non-empty index set \(R\), the Universal Buffer \(\textit{UB}\), is an I/O automaton such that \(\textit{inp}(\textit{UB})\) is \(\{\textit{put(v)}_{i, j} \; \vert \; v \in 2^{MC};\, i, j \in R\}\); \(\textit{out}(\textit{UB})\) is \(\{\textit{get(v)}_{i, j} \; \vert \; v \in 2^{MC};\, i, j \in R\}\); \(\textit{int}(\textit{UB})\) is the empty set; \(\textit{states}(\textit{UB})\) is \(2^{MC} \cup \{\bot\}\); \(\textit{start}(\textit{UB})\) is the singleton set containing the empty set; and \(\textit{trans}(\textit{SUB})\) is a relation such that \(\textit{put(v)}_{i, j}\) updates the state \(s\) to \(s \cup v\) and \(\textit{get(v)}_{i, j}\) updates the state \(s\) to \(s \smallsetminus v\), if \(s\) contained \(v\) (otherwise \(\textit{SUB}\) moves to the error state \(\bot\)).
A FOLDS Automaton \((E, R, M_i)\) is equivalent to the I/O Automata composition of \(\{M_i \, \vert \, i \in R\} \cup \{\textit{UB}\}\). All definitions that apply to this composed I/O automaton can be lifted to FOLDS Automata. For example, an invariant assertion on a FOLDS Automaton \(F\) is a predicate \(p\) over states of the I/O automaton corresponding to \(F\). Similarly, a proof by induction for an invariant \(p\) of a FOLDS Automaton \(F\) is a proof that \(p\) is inductive and that \(p(s)\) holds for every starting state \(s\) of \(F\).
We wish to verify FOLDS Automata for all possible values of \(R\). For example, we want to verify that the ring leader election protocol is correct no matter how many nodes participate in the election. To do this, we define parametric FOLDS Automata.
Definition 12 (Parametric FOLDS Automata). A parametric FOLDS Automaton is a function \(F(R) \triangleq (E, R, M_i)\), where \(E\) is a sort representing messages, \(R\) is the parameter, and \(M_i\) is a class of I/O automata satisfying the FOLDS Automaton conditions. We call every output of \(F\) an instance of \(F\).
Given a parametric FOLDS Automaton \(F\) and an invariant \(p\), the verification problem is to verify every possible instance of \(F\). Specifically, the verification problem is to check that \(\forall \, R\)
\(\forall \, s \in \textit{start}(F(R)) \; p(s)\); and
\(\forall \, (s, \alpha, s') \in \textit{trans}(F(R)) \; p(s) \implies p(s')\).
In our setting, given our encoding into I/O automata, error states encode the malfunction of the runtime itself—something that we do not wish to reason about. For example, an action \(\textit{get(v)}_{i, j}\) when \(v\) is not in the set maintained by the Universal Buffer will result in the Universal Buffer transitioning to the error state \(\bot\) and the message chain \(v\) “magically” appearing in the input buffer of the \(j^\textit{th}\) machine. Therefore, to avoid nonsensical counterexamples deriving from these transitions, we implicitly assume that all invariants are such that \(inv(s) = inv(s) \lor \textit{error}(s)\), where \(\textit{error}\) is a predicate that returns true iff any machine is currently in an error state. When input programs do not specify an initial set of states, we take it to be equal to the conjunction of invariants.
Encoding FOLDS Automata is exactly like encoding I/O automata. See Chapter 2.5.1 for the details on the encoding
and a concrete example of a mar
programs
that represents an I/O automaton. Encoding parametric FOLDS Automata
requires only a small change. Specifically, we encode the state space of
a parametric FOLDS Automaton as an extensional array whose index sort is
a parameter and whose element sort is the sort representing states of
the individual I/O automata in the composition. The problem of
verification by induction can then be encoded as an SMT query using an
uninterpreted sort for the parameter of the parametric FOLDS Automaton.
Satisfying interpretations represent counterexamples, while
unsatisfiability proofs represent proofs of correctness.
When invariants contain quantifiers, the verification queries are as expressive as first-order logic and the problem is undecidable. Note that there are fragments of first-order logic that are decidable (e.g., EPR (Lewis 1980)) but we support logical theories that are outside of these fragments (e.g., sequences for EPR). When no invariant contains quantifiers, the verification by induction problem can be posed as a quantifier-free SMT query. Unfortunately, it is a long-standing open problem to determine the decidability of quantifier-free queries over sequences with length constraints (Day et al. 2023) and so the decidability of our verification by induction problem is also unknown. Without quantifiers and without sequences, since the theories of equality, linear integer arithmetic, arrays, and algebraic data types are strongly polite and decidable (Sheng et al. 2020; Bonacina et al. 2019), their combination and our verification problem is decidable.
Our problem formulation implicitly supports a novel kind of invariant, message chain invariants, that developers can use along with their usual invariants to verify their systems. Intuitively, message chain invariants describe the way that messages flow through a distributed system. Formally, a message chain invariant is an invariant of the from \[\forall c_1, ..., c_n \in {MC} \; \bigwedge^n_i \textit{alive}(s, c_n) \implies r(c_1, ..., c_n),\] where \(s\) is the current state of the system, \(\textit{alive}\) is a predicate that returns true iff the message chain \(c_i\) is not empty and is held by the universal buffer or any machine’s buffers, and \(r\) is a predicate over message chains. We call \(r\) the message chain invariant since the rest of the predicate is fixed.
Consider the four message chains in Figure 10. One message chain invariant that is true for all four message chains is that the payload at the head of every message chain holds the largest label value seen so far by that message chain. This turns out to be true for all message chains in the system, and is a crucial fact used in the verification of the ring leader election protocol.
Since predicates in our language can be recursive and solvers directly support these functions (Suter, Köksal, and Kuncak 2011), message chain invariants over a single message chain (\(n = 1\)) can theoretically express any recursively enumerable language over sequences of messages. When properties are over more than a single message chain \((n > 1),\) these invariants can express properties akin to hyperproperties for traces (Clarkson and Schneider 2010). However, message chains invariants cannot express properties about the state of individual machines.
To better understand message chain invariants and our end-to-end
workflow, consider the P implementation and verification by induction
query for the ring leader election protocol displayed in Listing 4. This verification by
induction query has two target invariants. The first invariant,
unique_leader
, states that there is only ever one node that
declares itself as the winner. The second invariant,
leader_max
, states that this winning node has the greatest
label. Note that neither of these two invariants are message chain
invariants: we support message chain invariants, non message chain
invariants, and combinations thereof.
Theorem 2 (Ring Protocol Correctness). The protocol introduced in Chapter 5.2 and implemented in Listing 4 is correct.
Proof (Ring). We define the semantics of P programs in terms of FOLDS Automata in Chapter 5.4.6. Given these semantics, the verification query in Listing 4 without the last two invariants, is not inductive. This can be explained by two counterexamples. Consider a two-node system—nodes labeled 1 and 2—and the following two message chains
\(c \triangleq\)
send(2, 1, eNominate(1), send(1, 2, eNominate(1), empty))
and
\(c' \triangleq\)
send(1, 1, eNominate(1), empty)
.
For the first counterexample, suppose node 2 has declared itself the
winner and \(\textit{input}_1\)
contains the message chain \(c\). This
system will satisfy both target invariants before taking a step but will
falsify unique_leader
when node 1 receives \(c\). For the second counterexample, suppose
node 2 has declared itself the winner and \(\textit{input}_1\) contains the message
chain \(c'\). Again, this system
will satisfy both invariants before taking a step but will falsify
unique_leader
when node 1 receives \(c'\).
To complete the proof by induction, the user can introduce two auxiliary message chain invariants—the last two invariants in Listing 4. The first auxiliary invariant blocks the first counter-example by asserting that the head of every message chain always holds the largest label it has seen so far. Formally, \[r(c) \triangleq \forall n \; \textit{participated}(c, n) ==> n.id \leq c.id,\] where \(\textit{participated}(c, n)\) is a helper function that is true iff the node \(n\) was the source or target of a message in the message chain \(c\). For the counterexample two-node system above, the head of \(c\) cannot hold the label 1 since the node labeled 2 has participated in \(c\), thus our added auxiliary invariant \(r\) blocks the first counter-example.
The second auxiliary invariant blocks the second counterexample above by asserting that, for every message chain, if there is a node that has not yet participated in the message chain then the message chain is going towards that node. Formally, this is the message chain invariant \[ \begin{aligned} r'(c) \triangleq \forall n \; &\neg \textit{participated}(c, n) \implies \\ & (\textit{between}(c.id, c.\textit{target}, n) \lor c.\textit{target} = n) \end{aligned} \] where \(\textit{between}(c.id, c.\textit{target}, n)\) is a helper function that is true iff the target of the message chain \(c\) is between the node labeled \(c.id\) and the node labeled \(n\) in the ring. For the counterexample two node system above, the node labeled 2 has not participated in the message chain \(c'\) so the target of \(c'\) should be 2 (not 1), thus our added auxiliary invariant \(r'\) blocks the second counterexample.
Together, these two auxiliary invariants block all counterexamples to induction for all system instances—not just for the two-node system above. The P Verifier, our prototype implementation for the P programming language defined in Chapter 5.4.6, using Z3 (de Moura and Bjørner 2008), will accept this proof by induction in less than a tenth of a second (see Chapter 5.5). This proof is like doing the textbook proof of correctness described in Chapter 5.2 on every message chain at once: for every message chain, the round number in the textbook proof is equal to the length of the message chain. Yet our proof captures the nitty-gritty details of individual machine message passing.
To demonstrate the expressive power of our problem formulation, we implement a verification framework accepting programs written in a version of the P programming language—a large, functional subset that treats message buffers as sets of message chains instead of queues of messages. We call our implementation the P Verifier. The P Verifier implementation is available as an OCaml project consisting of approximately 5000 lines of code.18
P is a natural fit for our approach because of its domain-specific structure but our approach generalizes to standard message-passing distributed system languages. In particular, the most important aspect of P to support message chains, the notion of a handler, is shared across many programming languages. For example, Erlang uses the receive keyword and Akka uses the method receiveMessage on Behaviors for a similar notion.
We support only the core features of P that are specific to
distributed systems. For example, we support the definition of machines
and events (messages), but we do not support foreign function calling or
modules. We also assume that sequential code blocks—like the body of
event handlers—is written in mar
syntax
and uses the associated, extended semantics.
At a high level, our compiler encodes P programs into parametric FOLDS Automata. We use P event declarations to create a message sort and P machine declarations to create as a class of I/O automata, the two required components for an FOLDS Automaton. The sort of messages (\(E\)) is given by an algebraic data type with \(e\) constructors, where \(e\) is the number of events in the input P program. The \(i^{th}\) constructor of \(E\) holds a single record representing the payload type of the \(i^{th}\) event declaration in the input P program.
The state space of P machines follows a similar encoding. It is represented by an algebraic data type \(M\) with \(m + 1\) constructors, where \(m\) is the number of machine kind declarations in the input P program. The \(i^{th}\) constructor of \(M\) holds a single record representing the fields of the \(i^{th}\) machine kind declaration in the input P program. Specifically, the \(i^{th}\) record holds the variables of the P machines, a state indicator, an entry flag, a “this” reference, and the required \(\textit{input}_i\) and \(\textit{output}_i\) buffers. The state indicator is an enum whose variants are the set of states named by the P machines. The entry flag is a Boolean. The last constructor of \(M\) is \(\bot\) and has no selectors.
The internal actions of the family of I/O automata is \(\{e_{j, i} \; \vert \; e \in N \; j \in R\} \cup \epsilon_i\), where \(N\) is the set of event names. We use the action \(e_{j, i}\) to trigger that the current machine, \(i\), receives a message of kind \(e\), from some other machine, \(j\). We use \(\epsilon_i\) to trigger P entry handlers: the first code block that executes after a machine enters a given state. The transitions are defined by precondition-effect pairs derived from handlers as follows. For a handler of event \(e\) within P machine state declaration \(l\) for a machine kind \(K\), let \(m\) be a message chain and let \(s\) be the machine instance at index \(i\), the precondition for the action \(e_{j, i}\) asserts
In other words, the precondition for the action \(e_{j, i}\) asserts the \(i^\textit{th}\) machine (1) is the correct target, (2) is of the correct kind, (3) is in the correct state with (4) its entry flag is set to false, and (5) is actually receiving a message chain (5) whose head is of the correct kind. For an entry handler, the precondition for the action \(\epsilon_i\) asserts
In other words, the precondition for the action \(\epsilon_i\) asserts that the \(i^\textit{th}\) machine (1) is the correct
machine, (2) is of the correct kind, (3) is in the correct state, and
(4) its entry flag is set to true. We also enforce that every entry
handler effect sets the entry flag to false and that every event handler
that receives a message chain e
removes e
from
\(\textit{input}_i\).
The effects for the actions \(e_{j,
i}\) and \(\epsilon_i\) are
given by the corresponding handler blocks (\(e\) event handlers and entry handlers). We
extend the language of fig:pverifier:lang with some syntactic sugar.
First, we provide a send
keyword that represents a
procedure that takes the index of the target machine t
, the
payload of the message p
, and (implicitly) the message
chain that triggered the handler h
, and adds
send(this, t, p, h)
to \(\textit{output}_i\). Second, we provide a
goto
keyword that represents a procedure that takes a state
label and sets the state value to that label and the entry flag to
true.
We make a few simplifications in the implementation that manifest in
our evaluation. First, instead of the state of the system being an
array, we represent the state of the system as a record with two
selectors: events
and machines
.
events
represents the state of the universal buffer.
machines
represents the states of all other machines.
Second, instead of two buffers per machine, we collapse all buffers into
the events
selector.
We aim to answer the following research questions.
Is the P Verifier expressive enough to replicate existing proofs from other systems? How does the performance of the P Verifier compare to the state-of-the-art?
Do message chains help simplify proofs?
Can we use the P Verifier to verify industrial distributed systems?
To evaluate the expressive power of the P Verifier, we take verified benchmarks from Koenig et al. (2020) written in mypyvy, and re-verify them using the P Verifier. This set of benchmarks contains consensus protocols like the ring election protocol, standard systems like sharded key-value store services, and more complicated systems, like the hybrid reliable broadcast originally described by Widder and Schmid (2007) and modeled by Berkovits et al. (2019).
The P Verifier and mypyvy make different modeling choices and so the translation process is not always straightforward. For example, while the P Verifier intrinsically encodes implementation details like machine kinds and message buffers, mypyvy users would need to manually model these details. This encourages mypyvy users to write proofs at higher levels of abstraction and means that the corresponding P Verifier proof must explicitly link the implementation details to the abstract proof. For example, for the client-server system (described in detail in Chapter 5.5.2.1), the mypyvy model uses three logical relations to keep track of all messages that have been sent and received so far in the system. These relations abstract the notion of a message buffer—we can infer that a message is in the buffer if it has been sent but not yet received. We use the same logical relations in the P Verifier version of the proof but also maintain the implementation level detail of message buffers by adding one auxiliary invariant that amounts to saying that these abstract relations are append-only versions of the concrete message buffers in the system.
On the other hand, when mypyvy models do include low-level detail, translation into the P Verifier can actually simplify the proof since these details are frequently handled by the programming model itself. For example, the “ticket” benchmark models threads as state machines with three states. The P Verifier version of the proof encodes this directly by defining one machine kind called “Thread” which has three states. The mypyvy model defines three relations on threads, each acting as a flag to indicate when a thread is in the given state. This complicates their proof because auxiliary invariants must be added to ensure that no thread can be in more than one state at a time, a property that is tautological in the P Verifier model.
Benchmark | mypyvy | P Verifier | # Inv M | # Inv P | # Inv P* | \(\exists\forall\)? | \(\exists\forall\)?* |
---|---|---|---|---|---|---|---|
ring-id | 0.365s | 0.138s | 4 | 4 | 4 | N | N |
toy-consensus-forall | 0.380s | 0.064s | 4 | 5 | N | ||
consensus-wo-decide | 0.338s | 0.072s | 5 | 8 | N | ||
sharded-kv | 0.407s | 0.045s | 5 | 5 | N | ||
learning-switch | 0.410s | 0.048s | 6 | 6 | N | ||
consensus-forall | 0.566s | 0.120s | 7 | 10 | N | ||
lockserv | 0.406s | 0.049s | 9 | 9 | N | ||
ticket | 0.402s | 0.068s | 14 | 9 | N | ||
firewall | 0.364s | 0.033s | 2 | 2 | 1 | Y | N |
sharded-kv-no-lost | 0.328s | 0.589s | 2 | 2 | Y | ||
client-server-ae | 0.323s | 0.037s | 2 | 3 | Y | ||
toy-consensus-epr | 0.392s | 0.106s | 4 | 4 | Y | ||
client-server-db-ae | 0.403s | 0.060s | 5 | 8 | 4 | Y | N |
ring-id-not-dead | 0.479s | 0.194s | 6 | 5 | 5 | Y | Y |
consensus-epr | 0.557s | 0.088s | 7 | 8 | Y | ||
hybrid-reliable-bcast | 0.676s | 0.489s | 8 | 5 | Y |
Table 3 summarizes the data required to answer RQ1. Each row of the table corresponds to a benchmark. The first column contains the name of the benchmark in question, and the columns labeled “# Inv M” and “# Inv P” display the number of invariants used in the mypyvy and P Verifier proofs, respectively. While in general, the number of invariants is not always indicative of the complexity of a proof, in our context, where a one-to-one translation was made whenever possible, the number of invariants gives an idea of the extra proof considerations required.
Overall, we find that the P Verifier is comparably expressive to mypyvy and that it can successfully verify existing benchmarks. When our programming model requires extra proof considerations, we are able to address these considerations with few auxiliary invariants. To evaluate the performance of the P Verifier, we revisit Table 3. The second column shows the wall-clock time taken by mypyvy to verify each benchmark; the third column shows the wall-clock time taken by the P Verifier to verify the corresponding benchmark. All verification times are collected on a 2.3 GHz Quad-Core Intel Core i7 CPU with 32 GB of RAM. With the exception of the benchmark “sharded-kv-no-lost-keys” the P Verifier is faster on every single case.
Thus, for RQ1, we answer that yes, the P Verifier is expressive enough to replicate proofs from other systems, and the P Verifier performs comparably to mypyvy in terms of runtime. Note that, as a sanity check, we introduced bugs into all benchmarks and attempted to verify the buggy version. No proof by induction succeeded, as expected.
To evaluate the simplification power of message chains, we take four benchmarks from related work and attempt to use message chains to improve the proofs. We then explore a new benchmark that is difficult for existing tools but is trivial in our framework. The four benchmarks from existing work, “client-server-db-ae,” “firewall,” “ring-id,” and “ring-id-not-dead,” are those in Table 3 that display message chain that are longer than three messages.
The first system is a client-server system inspired by Feldman et al. (2017). This system involves three kinds of machines: clients, servers, and databases. Clients send requests to servers which then consult a database. Once the server receives an answer from the database, it forwards the response to the original client. Feldman et al. verify that whenever a client in the client-server-database system receives a response, the same client had sent a matching request. This property is like the no-forge property of Griffin et al. (2020) (which states that no machine “forges” a message) and requires significant machinery to state and verify. Namely, for a message-passing model without message chains, this proof requires three ghost variables, which track every request and response ever sent by the system, three invariants to ensure that the concrete message buffers are abstracted by these ghost variables, and four invariants of the form \(\forall \exists\) that state that for every response, there existed a request that “matches” that response.
Stating and verifying this no-forge property using message chain
invariants, on the other hand, is easy and requires no modeling tricks:
message chains provide the vocabulary that would otherwise need to be
integrated into the system model manually. Specifically, the message
chain-based uses four invariants. The first invariant uses the
path-pattern on machine indexes \(x_1x_2x_3x_2x_1[ x_1 \in C \land x_2 \in S \land
x_3 \in D]\), where \(\Sigma \triangleq
C \cup S \cup D\), \(C\) is the
set of client machine indexes, \(S\) is
the set of server machine indexes, and \(D\) is the set of database machine indexes.
We define path-patterns in Chapter 6.4.2. This path-pattern captures
the desired no-forge property and the invariant states that every
message chain is a prefix of that path-pattern. The second invariant,
states that for every message chain \(mc\),
mc.message.val = mc.history.message.val
. The third
invariant states that the payload of the first message in every message
chain is the original client. Finally, the fourth invariant again uses a
path-pattern but this time on sequences of message kinds \(qqpp[],\) where \(q\) represents an eRequest
message, \(p\) represents an
eResponse message and \(\Sigma
\triangleq \{q, p\}\). Together, these four invariants are
inductive for all system instances and message interleavings.
The second models a set of nodes communicating through a firewall, where the firewall blocks traffic coming from outside a network unless the traffic is in response to an internal request. In the firewall case, the property asserts that every message received by an internal node was either sent by another internal node or was sent as a response to an internal node from an external node. Like the client-server-database system, the mypyvy encoding defines abstract relations to keep track of what messages have been sent out, and then when a message is received, the encoding asserts that there exists some message in the abstract relation with the desired property. This specification is a \(\forall\exists\) properties that establishes the provenance of node interactions. To make the encoding inductive, auxiliary invariants are needed.
Using message chains and the P Verifier we can replace the \(\forall\exists\) properties with a single message chain invariant that does not contain existential quantifiers. Specifically, the solution to the existential part of the property is given by the first element of every message chain. Again, no abstraction is needed to verify this system at the message-passing level.
The final two systems from related work, “ring-id” and “ring-id-not-dead,” are variants of the ring election protocol described in Chapter 5.2 and verified in Chapter 5.4.5, where “ring-id-not-dead” includes an extra goal invariant specifying that the system is never “stuck.” As previously described, we use message chain invariants to verify both systems using a proof that is simple and asynchronous, all while being at a lower level of abstraction than proofs from related work, since it captures message passing directly.
Finally, we detail a new verification that is difficult for related work but easy for our framework: the Onion Routing (Tor) network. Tor is a distributed network comprised of relay nodes used to anonymize TCP-based web traffic. TCP-based web traffic is normally not anonymous because anyone in the network can view the source and destination of packets. In the Tor network, clients build a path through network nodes to a target server and then send their requests through this path. Each step of the path is encrypted on top of the previous steps of the path, making a layered structure (like an onion). When a node receives a packet, it decrypts the message (peels one layer of the onion) to determine the node it should forward the packet to. In this way, each node in the path knows the identity of its successor and predecessor, but no other nodes. To send a response back to a client, servers follow a similar process in reverse. By scrambling network activity, Tor makes it difficult for observers to trace client-server communication.
We modeled the Tor network using three machine kinds:
Client
, Node
, and Server
.
Client
machines build a path to Server
machines through some sequence of Node
machines and then
back to themselves through a second sequence of Node
machines. Message payloads hold a todo
variable that
corresponds to the path left to traverse. When a Node
machine receives a message, it “peels” an element of the
todo
variable, and forwards the message to the next machine
in line. Server
machines behave similarly but could also do
some processing to serve the given request.
There are interesting statistical and cryptographic properties to
verify for the Tor network, but we focus on basic correctness: every
complete message chain should start at a client, go through some
non-zero number of intermediate nodes, reach the desired server, return
through some non-zero number of intermediate nodes, and end at the same
client that began the interaction. To verify this property, we
introduced a single ghost variable to messages, done
, and
added code to maintain it such that concat(done, todo)
stays constant throughout every message chain. Intuitively, whenever a
machine peels an element off of todo
, it now adds it onto
done
. From there it is easy to prove that
concat(done, todo)
always has the shape defined by the
target specification and that every message chain is a prefix of
concat(done, todo)
. Since this proof used interpreted
sequence functions heavily (e.g., prefix and concat), we chose to
represent the message chain using the SMT-LIB theory of sequences
instead of algebraic data types, which would have required us to
manually define these functions. Related work would have a much harder
time verifying the same property of the same system at the individual
message passing level of abstraction.
For RQ2, we find that message chains can simplify proofs when the system involves communication.
To evaluate the P Verifier on industrial benchmarks, we conduct two case studies on verification problems provided to us by our industrial collaborators.
The first industrial case study models and verifies a system like the ClockBound (Amazon.com 2023) protocol, which is a mechanism for generating timestamps that are guaranteed to be within some bound of “true time.” Here, “true time” is just the time of a reference machine, but when this reference machine is reliable, the protocol can be used to evaluate distributed consistency by comparing the timestamps of events.
The protocol offers an API that clients can query. When one requests a time from the system, one receives a time range \(T_i = (E_i, L_i)\), where \(E_i\) corresponds to the earliest time guaranteed to have passed and \(L_i\) represents the latest time guaranteed to have not yet passed. Internally, the API consists of many systems with local clocks communicating with a corresponding global reference clock. When requesting time from a local clock, the local clock will send a query to their corresponding global clock who then responds with a “true time” that the local clocks use to return a lower and upper bound on time.
Our industrial partners have target specifications for this system. Specifically, that (1) for two time requests \(T_1\) and \(T_2\), if \(T_1\) and \(T_2\) happened on the same node and \(T_2\) happened after \(T_1\) then \(E_2 \geq E_1\); (2) for two time requests \(T_1\) and \(T_2\), if \(T_2\) happened after \(T_1\) then \(L_2 > E_1\); and (3) for two time requests \(T_1\) and \(T_2\), if \(L_1 < E_2\) then \(T_1\) happened before \(T_2\).
The model and proof is 150 lines of code and takes less than two seconds to verify. We used ten auxiliary invariants, including one message chain invariant. The message chains in this system are up to five messages long and we use a message chain invariant to ensure that every local clock only ever participates in one message chain at a time. This is a powerful invariant because it blocks spurious counterexamples to induction stemming from rogue messages targeting local clocks. This is an example of a message chain invariant over multiple message chains, as described in Chapter 5.4.4.
The second industrial case study is a distributed transaction commit protocol that implements multi-version concurrency control (MVCC) (Bernstein and Goodman 1983) using a two-phase commit (2PC) (Gray and Lamport 2006) protocol for agreeing on transactions. The general idea of MVCC is to allow non-blocking reads and writes by maintaining multiple versions of the data in question. Whenever a write occurs, a new version of the data is created. Reads can observe the most recent version of the data and do not need to wait for concurrent writes to terminate. The main issue is when two writes occur simultaneously. In this case, the system must agree on what writes to commit and in what order. 2PC is used to resolve this issue. Whenever a write transaction is initiated, it must broadcast a commit request to all participants. If all participants agree, then the transaction can be committed—the write can take place—and the transaction announces its success.
Our industrial partners have target specifications for this system. Specifically, (1) we only read the latest committed transactions; (2) for two read transactions \(T_1\) and \(T_2\), if \(T_1\) happened before \(T_2\) then the version read by \(T_1\) must be earlier or equal to the version read by \(T_2\); and (3) if a transaction announces success then all participants must have agreed to commit the corresponding write. This model and proof is 250 lines of code and takes approximately 12 minutes to verify.
The 2PC portion of this protocol is particularly interesting because it demonstrates a common phenomenon in distributed systems where an action depends on multiple concurrent messages. That is, in 2PC, coordinators require all participants to vote “yes” before telling them all to commit (and abort if any vote “no”). In terms of message chains, this means that many message chains end at the coordinator and only the message chain corresponding to the last “yes” vote or the first “no” vote is extended. An interesting aspect of our approach is that, since it is a non-deterministic choice of which message chain is extended (in the asynchronous setting the order that votes arrive in is non-deterministic), message chain invariants reason about all of them at once.
For example, it may be surprising but, a single message chain
invariant is sufficient to capture important implementation details like
that the coordinator in 2PC will never send a commit message if any
participant voted “no”. Specifically, the message chain invariant is (in
extended mar
):
c.head is commit ==> c.tail.head is vote("yes")
. This
works because the verification engine is free to pick any of the votes
as the prefix that is extended, therefore the verification only succeeds
if all of the possible prefixes satisfy the invariant. The same
principle applies for different thresholds (e.g., quorums).
In summary, we give an affirmative answer to RQ3: the P Verifier can be used to verify industrial benchmarks and that message chains play an important role.
In this chapter, we described the arrow to Figure 1(E): a verification engine that takes formal models and specifications, and generates logical queries that can be solved by tools like our work in Chapter 3—the arrow from Figure 1(E) to Figure 1(F). The main challenge with verifying distributed systems is in managing the level of abstraction of models. More abstract models can be easier to verify, but harder to link to implementations. Low-level models are hard to verify, but easier to link to implementations. We introduce a new abstraction, called message chains, that can be automatically extracted from and linked to low-level models. Message chains allow users to reason about sequences of related messages instead of only individual messages. This leads to proofs at a more abstract level while retaining the low-level details of low-level models.
Our empirical evaluation demonstrates that message chains are a powerful new tool that distributed system engineers can use to verify their systems. Specifically, we show that the P Verifier is expressive enough to replicate proofs from other systems; that the P Verifier performs comparably to mypyvy in terms of runtime; that message chains can simplify proofs when the system involves communication; and that the P Verifier can be used to verify industrial benchmarks.
A new version of the P Verifier, which is part of the P language toolchain and includes the advanced solver interface described in Chapter 2.4, is in use at Amazon Web Services today. The new P Verifier was used to verify an industrial version of the 2PC protocol. This protocol works for multiple rounds of transactions and is used to maintain agreement on a partitioned key-value store. See P Team (2025) for more details.
In Chapter 5, we presented a tool—a verification engine—that takes in a model, a set of properties, and attempts to prove by induction that the model satisfies the properties. Formal verification, however, can be a time-consuming task that requires significant manual effort. Especially for complex systems, users often need to manually provide, for example, loop invariants, function summaries, or environment models. Synthesis and learning has the potential to alleviate some of this manual burden (Seshia 2015). For example, prior work has used synthesis to reason about program loops (David, Kroening, and Lewis 2015), and to automate program repair (Le et al. 2017). In this chapter, we employ specification mining techniques, based on program synthesis and classic learning techniques, to lessen this burden. Specifically, in this chapter, we describe the arrow from Figure 1(C) to Figure 1(D): a new way to augment formal models with inferred specifications.
The intuition behind our approach is that message chains reveal enough structure that they can even be used to automatically learn meaningful specifications from only distributed system execution data. That is, given only example message chains (no access to the system definition or specifications), existing learning techniques can discover properties that hold for all executions of a system. In Chapter 6.4.1 we describe how an existing program synthesis technique can be used to discover the third invariant of Listing 4 using only example message chains. Without message chains, given only example execution logs, the same off-the-shelf, state-of-the-art tool is unable to learn any meaningful specifications.
This learning from examples problem is usually called specification mining (Ammons, Bodı́k, and Larus 2002) or likely invariant synthesis (Ernst et al. 2001) and has applications beyond verification. For example, specification mining can be used for automated program repair (Demsky et al. 2006), testing (Schuler, Dallmeier, and Zeller 2009), system understanding (Beschastnikh et al. 2015), and more. In this chapter, we focus on validating the use of message chains for specification mining by showing that learning techniques can discover invariants that appear in our manual verification efforts.
Figure 12 shows a workflow diagram that illustrates how specification mining and verification fit together. The user first inputs a model (Figure 12.a) of their system written in the version of P (Desai et al. 2018, 2013) described in Chapter 5.4.6. This model is fuzzed (Figure 12.b) to generate positive example message chains (Figure 12.c). We formally defined message chains in Chapter 5.4 and we formally define positive (and negative) examples in Chapter 6.3. The message chains in Figure 10 correspond to the “Positive Examples” in Figure 12.c: these can be obtained automatically by fuzzing the input model (Figure 12.a). The user can optionally add negative examples (Figure 12.d) and then run the specification mining framework (Figure 12.e) described in Chapter 6.4, producing likely invariants (Figure 12.f). The user can use these likely invariants as their specification, provide their own invariants (Figure 12.g), or both. Either way, the model (Figure 12.a) and invariants (Figure 12.f and Figure 12.g) are given to the verification engine (Figure 12.h) described in Chapter 5.4 which returns either true, if the proof by induction passes, or false otherwise. This process is usually iterative with the user changing their model, invariants, or adding examples manually or through fuzzing.
The rest of this chapter describes the specification mining components of Figure 12 in detail. We begin with a survey of related work in Chapter 6.2. We then present the necessary background to understand our approach in Chapter 6.3. We formally define the specification mining problem and describe two approaches for mining message chain invariants in Chapter 6.4. Finally, we evaluate both approaches empirically in Chapter 6.5. Overall, we make the following contributions.
We define the notion of positive and negative message chain examples and describe new ways to use existing learning and program synthesis techniques to automatically mine specifications from examples only.
We evaluate this technique by automatically mining useful specifications for the distributed systems we verified in Chapter 5.
In this section, we define specification mining and give the necessary background on two learning techniques that we use to solve the specification mining problem.
The function synthesis problem corresponds to the second-order query \[\exists f \; \forall \vec{x} \; \sigma(f, \vec{x}),\] where \(f\) is the function to synthesize, \(\vec{x}\) is the set of all possible inputs, and \(\sigma\) is the specification to be satisfied. Many program synthesis engines, like Burst (Miltner et al. 2022), focus on synthesizing recursive functions over algebraic data types.
When \(\sigma\) is a conjunction of input-output examples, like \(f(0) = 1 \land f(1) = 2 \land f(2) = 3\), we call the program synthesis query a programming-by-example query (Halbert 1984). When the function to synthesize is a predicate, we call examples of the form \(f(x) = \text{true}\) positive examples. Similarly, we call examples of the form \(f(x) = \text{false}\) negative examples. In this chapter, we use existing recursive function synthesis tools to learn predicates from positive and negative examples and call this process specification mining.
Specifically, the inputs specification mining tools in our setting are a set of pairs. The first element of each pair is a message chain and the second element is a Boolean indicating if the message chain can appear in an execution of the system. For example, for the ring leader election protocol,
Send(1, 4, eNominate(1), Empty))
, true
)
is a positive example andSend(1, 4, eNominate(4), Empty))
, false
)
is a negative example.Given a set of positive and negative examples, the output of a specification mining tool is a function that evaluates to true for the positive examples and false for the negative examples. Intuitively, this function represents a system specification that is likely to hold, but is not guaranteed to hold. Note that allowing negative examples makes this definition slightly more general than the related definitions of Ernst et al. (2001) and Ammons, Bodı́k, and Larus (2002).
We present patterns and pattern languages (Angluin 1980), but with definitions tailored to SMT-LIB. Let \(\Sigma\) be a zero-ary sort with finite cardinality greater than two, which represents a finite alphabet of size equal to the cardinality of the sort. Let \(X = \{ x_1, x_2, \ldots\}\) be a countable set of symbols disjoint from \(\Sigma\), which we refer to as pattern variables. A sequence over \(\Sigma\) is an \(n\)-ary concatenation of symbols where every \(t_1, \ldots, t_n\) is an element of \(\Sigma\), and the set of non-zero finite sequences over \(\Sigma\) is denoted \(\Sigma^+\).
A pattern \(p\) is a sequence over \(\Sigma \cup X\), i.e., an \(n\)-ary concatenation of symbols \(t_1, \ldots, t_n\) where every \(t_i\) is an element in \(\Sigma \cup X\). We use juxtaposition to represent concatenation, so write \(p \triangleq t_1 ... t_n\). We use \(\textit{Var}(p)\) to denote the pattern variables that occur in \(p\). Patterns define languages in the following way. For a pattern \(p\), call \(e(\omega') \triangleq p = \omega' \; \land \bigwedge_{x \in \textit{Var}(p)} \vert x \vert \geq 1\) the characteristic formula of \(p\), where the pattern variables in \(p\) are free variables, \(|x|\) denotes the length of \(x\), and \(\omega'\) is a given sequence in \(\Sigma^+\). The language of a pattern \(p\), \(L(p)\), is the largest set of sequences such that for every \(\omega\in L(p)\), \(e(\omega)\) is satisfiable. When \(m\) is a model of \(e(\omega)\) for a particular \(p\), we say that \(m\) justifies \(\omega\in L(p)\). For example, for \(\Sigma \triangleq \{f, g, h\}\) and \(X \triangleq \{x_1, x_2, ...\}\) the pattern \(p \triangleq x_1x_2x_3x_2x_1\) defines a pattern language \(L(p)\) that contains the sequence \(\mathit{fghgf}\) and the sequence \(\mathit{hfggggfh}\) but does not contain \(\mathit{fghfg}\), and the model \(\{x_1 \triangleq f, x_2 \triangleq g, x_3 \triangleq h\}\) justifies that \(\mathit{fghgf} \in L(p)\). We use this notion of justification to, at a high level, learn the most strict but justified semantic constraints on patterns over a useful concept class. The characteristic formula constrains the model to map each variable in \(\textit{Var}(p)\) to a concatenation of at least one symbol in \(\omega\). We do not use erasing patterns, where variables can map to the empty sequence, in this work.
Angluin (1980) defines \(\ell\textit{-minl}\) and Shinohara (1982) presents an algorithm to compute it. The \(\ell\textit{-minl}\) algorithm takes in a set of sequences in \(\Sigma^+\) and returns the longest pattern that best describes the examples. We treat the \(\ell\textit{-minl}\) algorithm as a black box but rely on the following theorems.
Theorem 3 (Angluin (1980)). Let \(\Sigma\) be a zero-ary sort with finite cardinality greater than two, let \(\Omega\subseteq \Sigma^+\) be a set of finite sequences over \(\Sigma\) and let \(p\) be the output of the \(\ell\textit{-minl}\) algorithm. Then \(p\) is the longest pattern such that \(\Omega\subseteq L(p)\) and there is no pattern \(q\) with \(\Omega\subseteq L(q)\) and \(L(q) \subsetneq L(p)\).
Theorem 4 (Shinohara (1982)). The \(\ell\textit{-minl}\) algorithm can be computed by a deterministic polynomial-time Turing machine using an oracle in NP. Specifically, for a given pattern \(p\) and set of sequences \(\Omega\subseteq \Sigma^+\), the NP oracle checks \(\Omega\subseteq L(p)\) by checking \(\omega\in L(p)\) at most \(\vert \Omega\vert\) times. This membership check is NP-Complete for erasing and non-erasing patterns (Jiang et al. 1994).
Up until now, we have assumed that users provide system models and specifications. This is the same requirement imposed by related work. However, in practice, specifications are not always obvious and formally expressing them is not always easy. In this section, we describe an approach to mining specifications—message chain invariants—using existing learning techniques. Positive examples can be generated easily by fuzzing the target system. Negative examples, on the other hand, are harder to generate automatically. In our setting, we take negative examples to be hints from the user: they are examples that the user believes can never occur during an execution.
When message chains are represented as algebraic data types, our specification mining problem matches the programming-by-example problem for functional, recursive programs. In this section, we describe how to use an existing program synthesis technique in this space, Burst, to mine message chain invariants.
Burst is a synthesis tool based on bottom-up enumeration. The Burst algorithm, by design, will produce syntactically short functions. Burst begins by synthesizing a recursive function that satisfies the specification (positive and negative examples) assuming that undefined recursive calls behave exactly as required to satisfy the specification (referred to as “angelic” semantics). It then checks whether the synthesized recursive function satisfies the specification under the actual semantics, and, if not, strengthens the specification based on the existing assumptions. This process repeats until a recursive function is found that satisfies the specification and has no undefined recursive calls. However, the synthesizer may have to backtrack some of the specification strengthening if the specification is made unsatisfiable by any added assumptions.
At a high-level, to synthesize message chain invariants using Burst, we give it queries consisting of the algebraic data type definition of message chains; a function signature (maps a message chain to a Boolean), examples (message chains that occur and do not occur); and a basic library of helper functions. This library includes functions for computing logical operators (like conjunction and negation) and simple functions for reasoning about message chains (like a function that takes a message chain and returns the innermost message if one exists). Burst then returns recursive functions that evaluate to true on all positive examples and false on all negative examples.
When message chains are represented as sequences, our specification mining problem matches the problem of learning formal languages from examples. In this section, we describe how to use non-erasing pattern languages to mine message chain invariants using only positive examples. We first give the necessary background on non-erasing pattern learning and then provide an extension to tailor it to our distributed systems domain.
Given only positive input message chain examples, the \(\ell\textit{-minl}\) algorithm can quickly learn a non-erasing pattern that likely holds for all message chains in the target system. Note that most similar learning algorithms, like those for learning regular languages, require both positive and negative examples. Unfortunately, non-erasing patterns on their own are not specific enough for our domain. This is in part due to the fact that no non-erasing pattern containing at least one variable cannot represent a finite language, and many distributed systems only exhibit bounded message chains. For example, suppose you have a simple client-server system where clients send requests to servers and servers respond to the client. Every message chain in this system has length at most two, but no non-erasing pattern can represent this language.
To fix this, we extend the \(\ell\textit{-minl}\) algorithm with an enumerative approach that learns a conjunction of constraints. These constraints are useful for the distributed systems domain and yet guarantee that the enumerative algorithm terminates. We call the resulting pattern a path-pattern and denote it \(p[l]\), where \(p\) is the base \(\ell\textit{-minl}\) pattern and \(l\) is a set of added constraints. When adding constraints, the goal is to be as specific as possible without being too specific. We formalize the notion of too specific in the following definition. Given a set of sequences \(\Omega\) and a pattern \(p\) with characteristic formula \(e\), we say that a path-pattern \(p[l]\) is too specific iff \(\bigvee_{\omega\in \Omega} e(\omega) \nvDash l.\) That is, if every model that can be used to justify the training set is preserved by \(l\), \(l\) is not too specific.
Consider the following two examples. First, let \(\Sigma \triangleq \{f, g\}\), let \(\Omega\triangleq \{fg, gf, ff\}\), and the corresponding pattern \(p \triangleq x_1x_2\). In this case, the constraint \(l \triangleq \vert x_1 \vert = 1 \land \vert x_2 \vert = 1\) is better than the constraint \(l' \triangleq \textit{true}\) because it gives us more information about the variables in the pattern. It is more specific. Second, keep \(\Sigma\) and \(p\) as before, but take \(\Omega\) to be \(\{fg, gff, gggg\}\) and consider the three different constraints \(l \triangleq \vert x_1 \vert = 1\), \(l' \triangleq \vert x_2 \vert = 1\), and \(l'' \triangleq \textit{true}\). While it is the case that \(\Omega\subseteq L(p[l]) = L(p[l']) = L(p[l''])\), the first two constraints are too specific while the last constraint is not. That is, they impose constraints on the roles the variables in the pattern play that are not supported by evidence: if these path-patterns represented machine indexes, the first path-pattern would say that \(x_1\) always represents a single machine, the second path-pattern would say that \(x_2\) always represents a single machine, and the last path-pattern would make no such unfounded judgments. Therefore, in this case, we would consider \(l''\) to be the best choice. We now describe our procedure for learning two kinds of constraints over \(\ell\textit{-minl}\) patterns: length constraints and membership constraints.
Learning Lengths The procedure to discover length constraints uses the following lemma.
Lemma 5 (Lengths). Let \(\Omega\) be a set of sequences, \(p[l]\) be a pattern such that \(\Omega\subseteq L(p[l])\) and \(p[l]\) is not too specific, \(x \in \textit{Var}(p)\) be a variable, \(x_1, x_2 \not\in \textit{Var}(p)\) be two fresh variables, and \(p' \triangleq p\langle x_1x_2 / x\rangle\) be \(p\) but with all occurrences of \(x\) substituted with \(x_1x_2\). If \(\Omega\cap L(p'[l]) = \emptyset\), then
\(p[l \land \vert x \vert = 1]\) is more specific than \(p[l]\) and
\(p[l \land \vert x \vert = 1]\) is not too specific.
Proof (Part 1). Let \(e_l\) be the characteristic formula of \(p[l]\). The characteristic formula of \(p[l \land \vert e \vert = 1]\) is then \(e_l \land \vert x \vert = 1\) and \(e_l \land \vert x \vert = 1 \models e_l\) holds by the semantics of conjunction.
Proof (Part 2). Suppose for contradiction that \(p[l \land \vert x \vert = 1]\) is too specific. That is, suppose that \[\bigvee_{s \in S} e_l(\omega) \nvDash l \land \vert x \vert = 1.\] Then there exists a model \(m\) of \(\bigvee_{\omega\in \Omega} e_l(\omega)\) that is not a model of \(l \land \vert x \vert = 1\) and that interprets \(x\) as the sequence \(z\) such that \(e_l(\omega)\) evaluates to true for some sequence \(\omega\in \Omega\). Since \(m \models l\) (\(p[l]\) is not too specific by assumption) we have that \(m \nvDash \vert x \vert = 1\) and therefore that \(z\) is of length at least two. Let \(e_{p'}\) be the characteristic formula of \(p'[l]\) and let \(m'\) be the model that is like \(m\) in every way but extended to interpret \(x_1\) as the first element of \(z\) and \(x_2\) as the rest of \(z\). The contradiction is that we have \(m' \models e_{p'}(\omega)\) but, by the antecedent of the implication, \(\omega\not\in L(p'[l])\).
The procedure itself starts with the set of constraints \(\bigwedge_{x \in \textit{Var}(x)} \vert x \vert = 1\). For every \(x \in \textit{Var}(p)\), we check \(\Omega\cap L(p'[l]) = \emptyset\), where \(p'\) is constructed as in Lemma 5. If \(\Omega\cap L(p'[l]) = \emptyset\) holds for \(x\), then we keep \(\vert x \vert = 1\) in the conjunction and we move to the next variable. If it does not hold, then we remove \(\vert x \vert = 1\) from the conjunction and continue.
Theorem 6 (Effective Procedure for Discovering Lengths). Let \(S\) be a set of sequences over \(\Sigma\), \(p\) be a pattern with \(\Omega\subseteq L(p)\), and \(l\) be the output of the above procedure. We claim that \(p[l]\) is not too specific and that for every set of constraints \(l'\) of the same form that is not too specific, \(p[l]\) is more specific than \(p[l']\). Furthermore, the above procedure runs in \(\mathcal{O}(\vert p \vert)\) time using an oracle in NP.
Proof (Lengths). We use the same oracle as the \(\ell\textit{-minl}\) algorithm in Theorem 4 but instead of checking \(\Omega\subseteq L(p)\) we check \(\Omega\cap L(p'[l]) = \emptyset\), where \(p'\) is constructed as in Lemma 5. We call this oracle once for every variable that appears in \(p\), which is at most the size of \(p\). Correctness follows directly from Lemma 5 and two facts. First, if \(\vert x \vert = 1\) is too specific on its own, then so is every conjunction that contains \(\vert x \vert = 1\). Second, the most specific set of constraints is the largest set.
Discovering Membership Discovering membership constraints follows a similar process. To make things more tractable, we assume that users supply a list of pairwise disjoint subsets of \(\Sigma\) of interest. The following theorem is key.
Theorem 7 (Membership). Let \(\Omega\) be a set of sequences over \(\Sigma\), \(p[l]\) be a pattern such that \(\Omega\subseteq L(p[l])\) and \(p[l]\) is not too specific, \(x \in \textit{Var}(p)\) be a variable, \(x_1, x_2 \not\in \textit{Var}(p)\) be two fresh variables, and \(\sigma\) be a strict subset of \(\Sigma\). If for every \(c \in \Sigma \smallsetminus \sigma\) it is the case that \(\Omega\cap L(p\langle x_1cx_2 / x \rangle\{x_1, x_2\}[l]) = \emptyset\), then 1. \(p[l \land x \in \sigma^+]\) is more specific than \(p[l]\) and 2. \(p[l \land x \in \sigma^+]\) is not too specific.
Proof (Memberships). Similar to Lemma 5, suppose for contradiction that \(p[l \land x \in \sigma^+]\) is too specific. That is, suppose that \(\bigvee_{s \in S} e_l(s) \nvDash l \land x \in \sigma^+.\) Then there exists a model \(m\) of \(\bigvee_{s \in S} e_l(s)\) that is not a model of \(l \land x \in \sigma^+\) and that interprets \(x\) as the sequence \(z\) such that \(e_l(s)\) evaluates to true for some sequence \(s \in S\). Since \(m \models l\) (\(p[l]\) is not too specific by assumption) we have that \(m \nvDash x \in \sigma^+\) and therefore that \(z\) is a sequence of the form \(y_1cy_2\), where \(y_1\) and \(y_2\) are fresh, erasing variables and \(c \in \Sigma \smallsetminus \sigma\). Let \(e_{p'}\) be the characteristic formula of \(p\langle x_1cx_2 / x \rangle\{x_1, x_2\}[l]\) for the same \(c\) and let \(m'\) be the model that is like \(m\) in every way but extended to interpret \(x_1\) as the first element of \(y_1\) and \(x_2\) as \(y_2\). The contradiction is that we have \(m' \models e_{p'}(s)\) but, by the antecedent of the implication, \(s \not\in L(p\langle x_1cx_2 / x \rangle\{x_1, x_2\}[l])\).
The procedure to discover membership constraints is again a simple loop. We start with the set of constraints \(\bigwedge_{\sigma} \bigwedge_{x \in \textit{Var}(x)} x \in \sigma^+\). For every \(x \in \textit{Var}(p)\) and input subset \(\sigma\), we check \(\Omega\cap L(p'[l]) = \emptyset\) for every \(c \in \Sigma \smallsetminus \sigma\), where \(p'\) is constructed as in Theorem 7. If \(\Omega\cap L(p'[l]) = \emptyset\) holds for \(x\) for every \(c\), then we keep \(x \in \sigma^+\) in the conjunction and we move to the next variable. If \(\Omega\cap L(p'[l]) = \emptyset\) does not hold for \(x\) for every \(c\), then we remove \(x \in \sigma^+\) from the conjunction and we move to the next variable. The proof that this is an effective procedure is the exact same as that of Theorem 6 but note that we also depend on the number and size of input sets \(\sigma\).
In summary, \(\ell\textit{-minl}\) gives us the best pattern from a syntactic perspective (Theorem 3) and we refine this pattern to get the best path-pattern from a semantic perspective. At a high-level, to synthesize message chain invariants using this algorithm, we give it queries consisting of positive examples generated by fuzzing, and it returns path-patterns that capture these examples.
We combine the recursive function synthesis approach and the extended non-erasing pattern learning approach into the same framework as the P Verifier. This code includes fuzzing infrastructure to generate positive example message chains. We implement the extended \(\ell\textit{-minl}\) algorithm as a Python program of approximately 250 lines that uses Z3 for the NP oracle that checks sequence membership. The Python code is available with the OCaml code described in Chapter 5.4.6. We use Burst out of the box for synthesis of recursive functions.
For each target distributed system of interest, we give the framework many different queries. The queries all follow the same structure described above, but we categorize and clean the input data before calling the individual tools.
For categorization, we tag example message chains with the last instruction relevant to them, and we group message chains by this tag. For example, for the ring leader election protocol in Listing 4, there are four groups: message chains that have just started at line 13; message chains that have reached the leader at line 16; and message chains that have just been extended at lines 18 and 20. For cleaning, we take each group of message chains and we generate new sets of message chains that each focus on a different aspect. Specifically, we generate a set that removes all payload information from message chains, leaving only the sequence of nodes visited by each message chain (the source of each message along with the final target); and we generate a set that removes all source and target information, leaving only the sequence of payload values exchanged.
For the ring leader election protocol, this categorization and cleaning process generates 12 queries (4 groups, 3 versions of queries each). We give all twelve queries to both specification mining methods (using algebraic data type and sequence encoding as appropriate) and we return 24 suggested invariants to the user. The input queries to Burst contain three negative examples generated by hand and 8 positive examples generated automatically by fuzzing. The three negative examples represent three obviously impossible message chain: one where a node sends a value that is not its own and it did not receive; one where a node sends a value to a node that is not on its “right;” and one where a node receives its own value and instead of declaring itself the winner and stopping, it extends the message chain by sending a new message.
We evaluate our specification mining framework on the four unique systems from the benchmarks that display message chain of length longer than three messages because these are the hardest cases. The other benchmarks from related work are either simplified versions of the four (e.g., ‘client-server-ae’ is like ‘client-server–db-ae’ but without a database) or they do not display very complicated message passing (e.g., ‘ticket’ which we briefly describe in Chapter 5.5.1). We generate examples and create specification mining queries as described in Chapter 6.4. All specification mining runs terminated in less than a second.
For “ring,” we are able to automatically discover the first auxiliary invariant of Theorem 2, and the correct, but less immediately useful fact that the suffix of message chains leading to a victory is a list of nodes from left to right, ending in the leader. For “client-server-db-ae,” we are able to automatically discover the first (target) and fourth (auxiliary) invariants described in the verification in Chapter 5.5.2.1; and that every message chain in the client-server-db system begins with a Client—a fact that is very close to the third invariant of the same verification. For “firewall” we are able to discover the specification that every message arriving at an internal node must have originated from an internal node; that the firewall blocks messages originating from external nodes; that blocked message chains are of length at most one; and that complete message chains are of length at least one. For “tor,” we are able to automatically mine the target invariant described in Chapter 5.5.2.4.
In total, this means six out of 24 invariants used in the proofs were mined. Removing the mined specifications from their respective proofs makes each of the proofs fail: these six specifications are essential. The mined specifications that do not appear in the hand written proofs may still be useful for other proofs or other applications, like testing, debugging, and documentation, but a thorough evaluation of the effectiveness for other applications is needed to draw any conclusions about them. In conclusion, we find that we can learn meaningful specifications using only message chain examples.
In this chapter, we presented the arrow from Figure 1(C) to Figure 1(D): a specification mining toolkit that uses the structure provided by message chains to discover meaningful invariants for distributed system models. In short, our tools take execution examples and infer specifications that likely hold for all executions. This reduces the burden on verification users by suggesting invariants that would otherwise require manual effort to discover and formalize.
We empirically evaluated our tools and found that six out of 24 invariants that were used in the proofs in Chapter 5 can be mined automatically in under a second each. Without the structure of message chains, the underlying learning algorithms cannot discover meaningful properties. Therefore, we conclude that message chains expose enough structure in distributed executions to make automated specification mining feasible and effective.
Our techniques for generating specifications for distributed systems assume that we have access to a formal model of the system. Unfortunately, the process of formally modeling a distributed system can be as arduous as specifying it.
In Mora et al. (2020) we integrated classic synthesis techniques, like those explored in Chapter 6, into UCLID5, our formal modelling and verification tool. The synthesis integration allowed users to generate parts of their models driven by verification goals defined by bounded model checking, k-induction, sequential program verification, and hyperproperty verification. We used this integration to generate 25 program synthesis benchmarks with simple, known solutions but found that the state-of-the-art formal synthesis engines of the time could not find solutions.
In this chapter, based on Mora et al. (2024), we explore a modern alternative that uses large language models. Specifically, we describe the arrow into Figure 1(C): a new technique for generating formal models from text descriptions of systems. Instances of this methodology are domain-specific. The methodology itself, however, is general.
Language models (LMs) have demonstrated an exceptional ability to
generate code from natural language prompts for popular programming
languages, like Python and Java (M.
Chen et al. 2021). Unfortunately, these same language models
struggle to generate code for low-resource programming languages (LPLs),
like many domain-specific languages (e.g., CUDA (Tarassow 2023)). These challenges are
even more pronounced for very low-resource programming languages (VLPLs)
(e.g., formal verification languages like UCLID5 (Seshia and Subramanyan 2018; Polgreen et al.
2022) and mar
). Existing work
has attempted to remedy this issue through prompting, constrained
decoding, and fine-tuning strategies. Unfortunately, these approaches
fail to capture the intricacies of real VLPLs and so success remains
limited. To see why, consider the following three exemplars.
First, B. Wang et al. (2023) include context-free grammars in text-to-code prompts to guide LMs toward syntactically correct answers. This approach works well for simple languages but cannot capture many programming languages that are context-sensitive. Second, L. Agrawal et al. (2023) use static analysis techniques to reject tokens that lead to syntactically incorrect output programs. This technique can go beyond context-free languages but assumes a linear programming process. Unfortunately, it is well known that the root cause of a programming error need not surface as it is written (Pavlinovic, King, and Wies 2014), necessitating backtracking and a nonlinear programming process. Third, Cassano et al. (2024) translate training data from high resource languages to low resource languages and then use this new data to fine-tune models. This approach is restricted to languages where the LM is able to translate to the language reliably but unable to generate code from natural language. Further, this approach makes the overly restrictive assumption that the target low-resource language is general purpose: e.g., we cannot translate arbitrary Java programs to CUDA.
In this chapter, we propose a text-to-code approach that is fundamentally different (and complementary) to prompting, decoding, and fine-tuning strategies. The first key idea behind our approach comes from natural programming elicitation, a kind of study that helps programming language designers understand how programmers “naturally” approach problems from a given programming domain (Myers, Pane, and Ko 2004; Chasins, Glassman, and Sunshine 2021). Programming language designers use the results of these studies to create languages that are aligned with the expectations of users, leading to less programming friction and more effective developers. We borrow this idea for the setting where LMs are the “users” of programming languages. Akin to uncovering what human users find “natural” for a given domain, we uncover what LMs find “natural.” Specifically, our first insight is to embrace LM’s tendencies and design an intermediate language that aligns with them.
The second key idea in our approach is that program analyses and
repair that are overly aggressive for human users may be suitable for LM
“users.” For example, in UCLID5, all variables have to be declared and
statically typed: an assignment like x = 0;
would require a
corresponding declaration like var x: integer;
. But, if an
LM generates code that had an assignment without a declaration, instead
of crashing, one could automatically “repair” the program and output the
result.
We use these two ideas to define a new text-to-code approach called synthetic programming elicitation and compilation (SPEAC, pronounced “speak”). Specifically, for a target VLPL \(T\), we use synthetic programming elicitation to select an intermediate language \(P\) (the “parent” language) and define a subset of the language \(C\) (the “child” language). The language \(P\) should be one that LMs are good at generating (e.g. Python); the language \(C\) should be easy to compile to the target VLPL \(T\). Our approach takes \(P\), \(C\), and a compiler from \(C\) to \(T\), and produces a text-to-code pipeline for the VLPL \(T\). This pipeline uses deductive techniques to automatically repair programs generated by LMs that are in \(P\) but not in \(C\). When these deductive techniques are unable to fully repair a program, we insert a “hole” and ask an LM to finish the repair, repeating as necessary.
We demonstrate the effectiveness of this idea by implementing a prototype, called Eudoxus, that targets UCLID5 (Seshia and Subramanyan 2018; Polgreen et al. 2022), a language used for formal modeling and verification of state transition systems. UCLID5 has code examples numbering in the hundreds rather than thousands or millions. Furthermore, UCLID5 programs rely heavily on the notion of a transition system, which is not frequently found in other programming languages. As such, state-of-the-art LMs are unable to generate any useful UCLID5 code out-of-the-box (see Chapter 7.4.1). In our case study, we use Python as the parent language \(P\) and a subset of Python as the child language \(C\), and improve the performance of LM code generation for UCLID5. Overall, we make the following contributions.
We present SPEAC, a novel method for generating syntactically correct code from LMs in very low resource programming languages;
we implement this method for the UCLID5 verification language; and
we demonstrate substantial improvement with SPEAC in syntactic correctness, producing parsable code in UCLID5 84.8% of the time compared to 9.1% by gpt-3.5-turbo fine-tuning and 12.1% by gpt-4-turbo in-context learning.
Given a natural language description of a programming task and a target programming language \(T\), the text-to-code problem is to generate a program \(t \in T\) that satisfies the task specification. Listing 5 and Listing 6 show a real input-output pair generated by an instance of our approach targeting the UCLID5 programming language. Specifically, Listing 5 shows a task extracted from Lee and Seshia (2016), and Listing 6 shows the output corresponding to that task using a prototype implementation of our approach. Listing 6 passes all compiler checks but has a subtle semantic mistake on line 4.
Listing 5: Partial task description from Lee and Seshia (2016).
Model a state machine describing a traffic light at a pedestrian
crosswalk. This is a time triggered machine that assumes it will
react once per second. It starts in the red state and counts 60
seconds with the help of the variable count. It then transitions
to green, where it will remain until the pure input pedestrian
is present. That input is generated by some other subsystem that
detects when a pedestrian is present, and should be modelled as
nondeterministic. When pedestrian is present, the machine
transitions to yellow if it has been green for at least 60 seconds.
Otherwise, it transitions to pending, where it stays for the
remainder of the 60 second interval. This ensures that once the
light goes green, it stays green for at least 60 seconds. At the
end of 60 seconds, it will transition to yellow ...
Listing 6: Partial output of Eudoxus in UCLID5. We interpret
sigG, sigR, sigY
to represent green, red, and yellow light
signals, respectively. The procedure step
captures the
transition relation of the state machine. state
appears to
be a bookkeeping variable that is used to track the cases in the task,
and count
represents a timer.
module TrafficLight {
var sigG, sigR, sigY: boolean;
var count, state: integer;
var pedestrian: boolean;
init {
sigG = false; sigY = false; sigR = true;
state = 0; count = 0; pedestrian = false; }
procedure step()
modifies sigG; modifies sigY; modifies sigR;
modifies count; modifies state; {
if (state == 0) {
sigG = false; sigY = false; sigR = true;
count = 0;
if (count < 60) { count = (count + 1); }
else { state = 1;}}
else { if (state == 1) {
sigR = false; sigY = false; sigG = true;
if (pedestrian) {
if (count >= 60) { state = 2; }
else { state = 3; }}}
...
Figure 13 shows the workflow that
generates models based on task descriptions as shown in Listing 6. The workflow is
parameterized by an LM, \(L\) (e.g.,
gpt-3.5-turbo-0125); a target language, \(T\) (e.g., UCLID5); a parent language,
\(P\) (e.g., Python); a child language,
\(C
\subset P\) (e.g., a subset of Python); and a compiler, \(f\), from \(C\) to \(T\) (e.g., a syntax-directed translation
from the subset of Python to UCLID5). Given an input task, we create a
prompt \(q\) that asks the LM \(L\) to generate code, \(p \in C\), which satisfies the task
description, \(q\). The second step of
the workflow is to repair \(p\). If
there is nothing wrong with \(p\), or
\(p\) can be fixed using formal
techniques described in Chapter 7.4.2, then repairing will
generate a new, complete program \(p'\) and return \(f(p')\) (i.e., a program in the target
language, like Listing 6).
Frequently, however, repairing will generate a partial program
containing “holes” (denoted “??
”). For example, Listing 11 shows the
first \(p\) generated for the task in
Listing 5 and Listing 10 shows the
corresponding partial program \(p'\) that was automatically generated
using our formal techniques. Programs with holes cannot be compiled to
the target language, so the third step of the workflow is ask the LM to
complete the partial program \(p'\), generating a new \(p\). We use the template in Listing 9 to generate the
LM prompt. Listing 12
shows the output generated by gpt-3.5-turbo-0125 when asked to repair
the partial program in Listing 10. This program is
still incorrect, but, it is now close enough that we can automatically
repair it to a complete program without holes. Listing 13 shows the
relevant portion of that complete program. This final, complete program
is directly translated to the output in Listing 6. Chapter 7.4 elaborates on each of the
workflow components.
To understand the subtle mistake in Listing 6 one needs some
understanding of UCLID5. UCLID5 is a verification language used
to model hardware and software systems as “state machine” like
transition systems and to automatically check if the transition system
does indeed satisfy a formal logic specification. UCLID5 transition
systems primarily consist of a state space given by a set of variable
declarations (e.g., lines 2-4 in Listing 6), an initialization
block that represents a set of acceptable start states (e.g., lines 5-7
in Listing 6), and a
transition relation block that defines how the transition system moves
from one state to the next state (e.g., code starting at line 8 in Listing 6). The var
keyword is used to declare variables that are internal to the transition
system in question while the input
keyword is used to
declare read-only variables that are outside the control of the
transition system in question. Listing 6 passes all compiler
checks but has a subtle semantic mistake on line 4:
var pedestrian: boolean;
should be
input pedestrian: boolean;
because the presence of a
pedestrian should not be controlled by the traffic light transition
system. When manually assessing correctness in Chapter 7.5, this subtle mistake
would prevent us from marking this example as fully correct.
In this section we describe the SPEAC approach. We begin with how to select \(P\) and \(C\) (Chapter 7.4.1) and the prompting that we use to interface with LMs. We then describe two formal techniques at the heart of our automated repair (Chapter 7.4.2).
Listing 7: Hypothetical LM output with grounded theory codes as comments. Codes are determined and assigned manually.
In this section, we present synthetic programming elicitation as a kind of programming study—where LMs are the subject—that follows three steps. The results of these studies are used to select \(P\) and \(C\). We call this process synthetic programming elicitation since it is inspired heavily by natural programming elicitation (Myers, Pane, and Ko 2004). To make this section more concrete, for each step of the study, we describe the specific steps we followed for targeting UCLID5.
First Step: Setup. First prepare LM subjects, select a target language, and collect or construct a set of programming tasks. In our case, our target language is UCLID5, our LM subjects are gpt-4-turbo-2024-04-09 and gpt-3.5-turbo-0125, and our programming tasks are a set of regression tests written by UCLID5 developers with corresponding natural language descriptions that were written by hand (see Chapter 7.5 for more details).
The second part of the study setup is to create multiple kinds of specialized prompts. The first will ask for the output to be in the target language directly. Subsequent prompts will ask for the output to use an imaginary API in a popular host language, like Python or Java. For example, for UCLID5, we generated prompts that look like the following.
“Write a UCLID5 model for a system that counts the number of times the temperature exceeds a threshold [\(\ldots\)]”
“FormalVerificationLibrary is a Python library for generating verification models [\(\ldots\)]. Use the FormalVerificationLibrary to model a system that counts the number of times the temperature exceeds a threshold [\(\ldots\)]”
Second Step: Execute. Now we execute every LM subject on every prompt and collect the outputs. Each query should be executed independently.
Third Step: Analyze and Design. Finally, we analyze the collected LM outputs and select \(P\) and \(C\) based on the analysis. To do this analysis, we follow a lightweight version of grounded theory (Glaser and Strauss 2017). See e.g., Stol, Ralph, and Fitzgerald (2016) for general guidelines on how to use grounded theory in software engineering research. See e.g., Barke, James, and Polikarpova (2023) for a specific, detailed case-study. Synthetic programming elicitation studies are inspired by and borrow many of the methods from those works. Our studies are categorically different, however, because we are not working with humans. We cannot interact with our participants in the same way, e.g., we cannot conduct semi-structured interviews. The goal of our studies is to design an intermediate language that more closely matches the code that LMs tend to generate.
The first step of our lightweight grounded theory analysis is to “code” the outputs generated by the LMs. In grounded theory parlance, “code” is akin to manually labeling parts of the generated outputs. For example, Listing 7 shows a hypothetical LM output for our running example along with grounded theory codes as comments. The second step is to group codes into concepts. Concepts are slightly more abstract than codes. For example, a concept that may emerge from Listing 7 is the group of codes 1, 2, and 3. The corresponding concept could be “using a class from the hypothetical library.” In the third step, we group concepts into categories. For example, we may group concepts related to the object oriented programming paradigm as one category. Finally, we select a \(P\) and \(C\) that are consistent with the final categories we observed across multiple prompts.
In our study, we found that, when asked to write UCLID5 code without any special prompting or training, no LM was able to produce code that parses (660 attempts: 10 samples per LM per benchmark, 33 benchmarks, two LMs). Worse still, the code generated by LMs was inconsistent, with each LM giving different outputs that resemble different programming languages at different times. When asked to write Python code that used a non-existent formal verification API, however, the LM outputs were more consistent. Therefore, we selected Python as our parent language, \(P\).
Specifically, the Python code was more consistent because LM outputs fell into three broad categories, which we call “from-scratch,” “procedural,” and “object-oriented,” respectively. Programs in the “from-scratch” category used existing APIs (e.g., the Z3 solver API (de Moura and Bjørner 2008)) to re-implement what UCLID5 does, e.g., to manually model transition systems. This was the smallest significant category. Programs in the “procedural” category imported a class from the hypothetical API, created an instance of it, and then called methods from the class to declare variables, assert specifications and so on. Programs in the “object-oriented” category imported a class from the hypothetical API and extended it, including methods that correspond closely to parts of UCLID5 code. For example, these extended classes often included a method that represented a transition relation—a critical component of UCLID5 programs and of transition systems generally.
After analyzing the outputs, we concluded that it would be easiest to translate Python programs from the “object-oriented” category to UCLID5. For example, we observed that methods which represent transition relations used a limited set of names, and that methods themselves could be compiled to UCLID5 rather directly. Therefore we defined a subset of Python that matches the “object-oriented” category and used that as our child language, \(C\). Essentially, \(C\) is an abstract class that the LM must extend. For example, the abstract class includes an abstract method called “next” that corresponds to the transition relation. Listing 13 shows a portion of an example of a Python program in \(C\) and Listing 6 shows a portion of its corresponding UCLID5 program.
Listing 8: Partial SPEAK prompt template for first generation.
Listing 9: Partial SPEAK prompt template for hole filling.
After our synthetic programming elicitation study—once \(P\) and \(C\) have been selected—we use minimal prompting to interface with LMs. For example, for UCLID5, we use the prompt template in Listing 8 to create initial programs and the prompt template in Listing 9 to fill in holes. More advanced prompting techniques are complementary and could help here. However, for this work, we used minimal prompting so as to independently evaluate our contribution.
Given a program \(p\) in \(P\), but not in \(C\), our aim is to remove as little as
possible from \(p\) to bring it into
the language \(C\). That is, given the
AST for \(p\), we wish to generate the
largest subtree of the AST, possibly containing holes, that is not
rejected by the static checks of the language \(C\). For example, the code in Listing 11 contains some
errors, including mismatches between variable and constant types (UCLID5
does not automatically cast constants and so the line
self.count = 1
is interpreted as assigning the integer
literal \(1\) to the bitvector variable
self.count
, which is a type error).
We find the largest consistent subtree in three steps. First, we use an error tolerant parser to build an AST, \(A\), for the language \(P\). Second, beginning at the root of \(A\), we recursively descend and prune anything that cannot exist in an AST for the language \(C\), placing holes wherever they are needed. This is an aggressive program slice, similar to Akhundov et al. (Akhundov et al. 2021), who give a new AST \(A'\). \(A'\) may or may not pass compiler checks, like type checking, or be semantically correct.
The third step finds the minimal number of AST nodes that need to be replaced with holes such that all static checks of the language \(C\) pass, using the approach of Pavlinovic et al. (Pavlinovic, King, and Wies 2014). Specifically, for every static check of the language, for every node of the AST, we generate a set of soft clauses. Let \(S\) be the the union of all the generated clauses, and let \(S'\) be the largest subset of clauses that is sat (the output of a MAX-SMT solver). Any clause, \(c\), that appears in \(S\) but does not appear in \(S'\) represents a bug: if we add \(c\) to \(S'\), then we will get an unsat set of clauses. So, for every such \(c\), we replace the AST node that generated \(c\) with a hole.
For example, in UCLID5, the two statements var x: bv32;
and x = 0;
would produce a set of clauses corresponding to
the assertions that (a) x
is a bitvector and that (b)
x
is an integer (since 0
is an integer). These
assertions, together, are unsat, so MAX-SMT would
return a set of clauses that does not include both (a) and (b). One
solution would be to remove the bitvector type, producing the partial,
but consistent, two statements var x: ??;
and
x = 0;
.
We can bias the solver to include one or the other by adjusting the weights on the corresponding clauses. In practice, we assign clauses weights that are proportional to the depth of the corresponding node in the AST. This ensures that edits are closer to the leaves of the AST but it breaks syntactic minimality guarantees. In the case where weights are uniform and every node corresponds to exactly one clause, the MAX-SMT approach will make the minimal edit to the AST.
Listing 10: Partial response for task in Listing 5 using gpt-3.5-turbo-0125.
Line 5 declares state
as a local variable of bit-vector
type; lines 25, 33, and 42 use state
as an integer.
class TrafficLight(Module):
def types(self):
self.state_t = BitVector(2)
def locals(self):
self.state = BitVector(2)
self.count = BitVector(6)
self.pedestrian = Boolean()
self.sigG = Boolean()
self.sigY = Boolean()
self.sigR = Boolean()
def inputs(self):
self.pedestrian = Boolean()
def outputs(self):
self.sigG = Boolean()
self.sigY = Boolean()
self.sigR = Boolean()
def init(self):
self.state = BitVector(2)
self.count = BitVector(6)
self.pedestrian = Boolean()
self.sigG = Boolean()
self.sigY = Boolean()
self.sigR = Boolean()
def next(self):
if self.state == 0:
self.sigG = False
self.sigY = False
self.sigR = True
self.count = 0
if self.count < 60:
self.count += 1
else:
self.state = 1
...
elif self.state == 3:
self.sigG = False
self.sigY = False
self.sigR = False
if self.count < 60:
self.count += 1
else:
self.state = 0
Listing 11: Partial first repair. Line 3 repairs the type of
state
to be an integer.
class TrafficLight(Module):
def locals(self):
self.state = int
self.count = int
self.pedestrian = bool
def outputs(self):
self.sigG = bool
self.sigY = bool
self.sigR = bool
def init(self):
self.state = ??
self.count = ??
self.pedestrian = ??
self.sigG = ??
self.sigY = ??
self.sigR = ??
def next(self):
if self.state == 0:
self.sigG = False
self.sigY = False
self.sigR = True
self.count = 0
...
Once we have a satisfiable set of clauses, we can generate a
corresponding satisfying model and use the model to repair the partial
program. This is where our work most differs from Pavlinovic, King, and Wies (2014). For
example, the partial program var x: ??;
from above would
correspond to the set of clauses \({\texttt{is\_Integer}(\texttt{??})}\). This
clause would be satisfied by setting ??
to
integer
and so we can repair the partial program to be
var x: integer;
.
Listing 12 shows
one buggy AST and Listing 13 shows the
corresponding fix. For example, the variable count
is
declared as an integer in the repaired program because it is used as an
integer in the buggy program. For repairs that cannot be automatically
resolved by the MAX-SMT model, we use the LM to generate code to replace
the AST holes, as shown in Figure 13.
Listing 12: Partial response for Listing 10 using
gpt-3.5-turbo-0125. The outputs
method should declare
variables and their types but on lines 6-9 assigns variables to values
(i.e., False
instead of bool
).
Listing 13: Partial second repair. The repair declares variables with the appropriate types on lines 5-7.
In this section, we implement a prototype of SPEAC for UCLID5, called Eudoxus, and use it to evaluate the performance of SPEAC across three research questions.
Does Eudoxus perform better than LM baselines on syntactic correctness (passing compiler checks)?
Does Eudoxus perform better on semantic correctness?
Does MAX-SMT cost too much computation time?
Eudoxus uses the \(P\) and \(C\) described in Chapter 7.4 and is implemented in Python. We use tree-sitter to parse and search partial programs, and Z3 (de Moura and Bjørner 2008; Bjørner, Phan, and Fleckenstein 2015) to solve MAX-SMT queries. We allow Eudoxus to repeat the repair loop five times. All MAX-SMT queries are solved locally on a 2.3 GHz Quad-Core Intel Core i7 with 32 GB of RAM. All LM calls are made through the OpenAI Python API.[Code available at https://github.com/uclid-org/eudoxus]
We compare Eudoxus against three LM baselines: few-shot, self-repair, and fine-tuning. All using gpt-3.5-turbo-0125 (GPT3t) and gpt-4-turbo-2024-04-09 (GPT4t). For the few-shot baseline with and without chain-of-thought, we provide examples taken from the UCLID5 GitHub repository to the LM in context. We tried with one in context example, and then again with three in context examples. We also combine few-shot prompting with chain-of-thought (Wei et al. 2022). As with many VLPL, there are very limited number of natural language to UCLID5 examples in the public domain, and certainly not enough for fine-tuning. We do, however, have access to \(317\) regression tests from the open-source code base. For the purposes of fine-tuning, we use self-instruct (Y. Wang et al. 2023) to first ask the model to summarize the UCLID5 regression tests in natural language and then provide this as the natural language description during fine-tuning. We fine-tune GPT3t for 284 steps with learning-rate multiplier 0.16. For self-repair, we gave the LMs compiler feedback in a loop for five iterations starting from a zero-shot prompt.
We use three sets of UCLID5 benchmarks. The first set is a curated set of \(33\) regression tests with handwritten natural language descriptions. These tests are designed to cover a broad range of UCLID5 features and were used for the synthetic programming elicitation study in Chapter 7.4.1. The second set is a set of \(317\) regression tests without descriptions taken directly from the UCLID5 GitHub repository. These tests were used for fine-tuning our baseline model, as described above. This set is representative of the quantity and quality of data that a user may have available for a VLPL. The third and final set is a set of \(33\) exercises and examples taken from three textbooks (Baier and Katoen 2008; Lee and Seshia 2016; Huth and Ryan 2004). These benchmarks cover formal modeling of concurrent systems, linear-time properties, model checking and systems with discrete dynamics. We used this final set for the end-to-end evaluation below.
We run two variations of Eudoxus (one using GPT3t, one using GPT4t) and 11 variations of the baselines on all \(33\) textbook benchmarks. We report the fraction of outputs that pass all compiler checks (“parse,” for short) and semantic correctness over all 11 approaches in Table 4. Semantic correctness is manually assessed by one author using a combination of manual reasoning and hand-written specifications for all programs that compile. Correctness is rated on a scale from \(1 - 5\), where \(1\) is entirely wrong, and \(4\) indicates that the model is correct with only a few minor errors (e.g., the example output described in Chapter 7.2 and Listing 11). Intuitively, any score of \(\geq 4\) indicates that we believe the output would be useful to text-to-code users.
Model | Parse Rate | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
Eudoxus (GPT4t) | 24/33 | 1/24 | 1/24 | 3/24 | 8/24 | 11/24 |
Eudoxus (GPT3t) | 28/33 | 3/28 | 6/28 | 5/28 | 5/28 | 9/28 |
Fine-tuned GPT3t | 2/33 | 0 | 2/2 | 0 | 0 | 0 |
One-shot with COT (GPT4t) | 1/33 | 0 | 0 | 0 | 1/1 | 0 |
One-shot with COT (GPT3t) | 0 | - | - | - | - | - |
One-shot (GPT4t) | 0 | - | - | - | - | - |
One-shot (GPT3t) | 0 | - | - | - | - | - |
Three-shot with COT (GPT4t) | 3/33 | 0 | 0 | 0 | 2/3 | 1/3 |
Three-shot with COT (GPT3t) | 1/33 | 0 | 0 | 0 | 0 | 1/1 |
Three-shot (GPT4t) | 4/33 | 0 | 0 | 0 | 3/4 | 1/4 |
Three-shot (GPT3t) | 2/33 | 0 | 0 | 1/2 | 0 | 1/2 |
Self-repair (GPT4t) | 0 | - | - | - | - | - |
Self-repair (GPT3t) | 0 | - | - | - | - | - |
Eudoxus outperforms all baselines in terms of compiler checks (see “Parse Rate” in Table 4), passing all compiler checks on \(78\%\) of benchmarks. There are four benchmarks on which Eudoxus hits the iteration limit and fails to produce a result, and three benchmarks with small syntax errors due to bugs in our Eudoxus implementation (e.g., failing to extract code snippets from LM outputs or printing UCLID5 code incorrectly). In contrast, we find that GPT3t and GPT4t rarely produce UCLID5 models that even parse. The best results for GPT3t come from fine-tuning, but it is only able to produce two programs that parse. The best results for GPT4t come from three-shot prompting, but it is only able to produce four programs that parse. Given this data, we answer RQ1 in the affirmative: Eudoxus perform better than standard LM baselines on syntactic correctness. Even more interesting, the two versions of Eudoxus generated programs that passed all compiler checks for 30/33 unique problems; the 11 versions of the baselines together generated programs that passed all compiler checks for only 6/33 unique problems.
Eudoxus outperforms all baselines in terms of semantic correctness (see Figure 14), scoring 5/5 on more benchmarks than any other approach. Every unique problem that the baselines perform well on—four unique problems where some baseline scored four or higher—was also solved by Eudoxus. These are likely the easiest problems in the test set. On the other hand, \(33/52\) programs (24 unique problems) that pass compiler checks produced by Eudoxus scored four or higher. This data suggests that our approach does not negatively impact semantic performance, but it is difficult to draw conclusions since so few of the programs generated by the baselines pass compiler checks.
In terms of execution time, Eudoxus with GPT3t took an average of 0.9 seconds (SD 0.6) in repair steps and 7.2 seconds (SD 4.6) in LM calls. Eudoxus with GPT4t took an average of 1.8 seconds (SD 2.0) in repair steps and 35.1 seconds (SD 24.7) in LM calls. We conclude that, in terms of execution time, LM calls are more expensive than our static repairs.
While our results are promising, there are three limitations to our evaluation. First, we only evaluated SPEAC on one VLPL, UCLID5. It remains to be seen if our results can be replicated across different VLPLs, especially those in different domains. Second, we only evaluated SPEAC with two LMs. It is possible that our work will not generalize across different kinds of LMs. Third, only one author evaluated the semantic correctness of programs generated by Eudoxus. While the main takeaway of our work is that we are able to generate syntactically correct programs where baselines cannot, it is possible that we have over or under-estimated semantic correctness.
In this chapter, we described how to build the arrow into Figure 1(C). That is, we described a new methodology for building tools that generate code in very low-resource programming and formal languages from natural language descriptions. Our approach, SPEAC, bridges the gap between code that LMs are good at writing and code that we need to write for formal verification and related tasks through a mix of programming language design, human-computer interaction, and formal methods. We borrow ideas from the intersection of human-computer interaction and programming languages research to discover what LMs find “natural” in a given domain. We then design a corresponding intermediate language for which LMs are good at writing code and for which we can automatically translate programs into our target language, using ideas from formal methods.
We empirically demonstrated that this combination makes it possible to generate syntactically correct and semantically meaningful code in VLPLs like UCLID5, where traditional prompting, decoding, and fine-tuning strategies fall short. Specifically, SPEAC produces code that passes compiler checks at a much higher rate than prior approaches (syntactic correctness), and shows promising results in semantic correctness.
Building a correct distributed system requires logical reasoning. For simple distributed systems in low stake environments, reasoning with pen and paper may be enough. At larger scales, however, like industrial settings, reasoning about all possible executions of a distributed system quickly becomes unwieldy. While the underlying reasoning is the same–basic deduction—at these larger scales, distributed systems engineers need computational support. In this dissertation, we built the needed computational support: a full automated reasoning stack for verifying the correctness of distributed systems. The left side of Figure 1 visualizes the external interface to our stack; the right side of Figure 1 visualizes the internal details of the stack itself.
Automated reasoning stacks, like our own, transform domain-specific questions, layer by layer, into logical queries, solve these queries with general-purpose reasoning engines, and then transform the results back into domain-specific answers. Unlike existing work, in this dissertation, we tailored every layer of the automated reasoning stack to our target domain. In Chapter 5 we built a verification engine that is directly and exclusively tailored to distributed systems safety verification. In Chapter 6 we built specification mining tools for the same setting. Other layers were more general but still built to specifically address challenges encountered in our target domain. In Chapter 3 we built a solver that handles the data structures that our verification engine depends on. In Chapter 4 we built a tool that improves solver performance on the workloads generated by our verification engine. In Chapter 7 we provided a methodology for getting LMs to generate code in very low resource programming languages, like our verification language.
Overall, this dissertation aims to be an exemplar for how to build a full automated reasoning stack that targets a single domain. Along the way, however, we made more general contributions. Specifically, we provided components and principles that are reusable in, and adaptable to, new domains. In the next section, we explore an entirely new domain that highlights the power and limitations of automated reasoning, and motivates our future work directions discussed in Chapter 8.2.
American style crosswords, like those found on the New York Times, consist of a set of clues, a grid of white and black cells, and a solution. The solution to a crossword is a sequence of characters, usually a word, per clue. For example, the answer for the clue “software platform suitable for Starbucks?” could be “java.”19 You can think of the clues as semantic constraints and the grid as a set of syntactic constraints. Every clue has an orientation, down or across, and a starting cell in the grid. The answer to each clue must be laid out, character by character, cell by cell, starting from the designated cell, moving in the correct orientation, over only the white cells of the grid. This imposes length constraints on answers and will lead to answers that share characters. The goal of the crossword game is to guess the solutions from the clues and the grid: players must guess a solution that satisfies both the semantic and syntactic constraints. Figure 15 shows one example crossword puzzle.
Many people play crossword games everyday; fewer people construct crosswords. Similarly, existing work, like Wallace et al. (2022), have looked at automatically solving crossword puzzles, but few projects exist for automatically generating crossword puzzles. In this section, we focus on the latter problem, and specifically on the challenges of solving the latter problem with current LMs.
The distinction between semantic and syntactic crossword constraint generation helps us isolate the capabilities and limitations of LMs: LMs are relatively good at generating clues and answers (semantic constraints) but bad at generating crossword grids (syntactic constraints). On the other hand, solvers cannot generate clues and answers—they are not built to reason about natural language—but they can easily generate crossword grids. In this section, we evaluate the above claims about syntactic constraints.
To ask LMs to generate a crossword grid (syntactic constraints), we
use the the prompt immediately following this paragraph, where
{WORDS}
is an input parameter. We provide the LM with a
list of words for which we know a subset will fit nicely in a grid; the
LM just has to lay out the input words in any correct grid. In Chapter 8.1.2 we test LM capabilities
with this prompt using existing crosswords as the input value for
{WORDS}
. The crossword data that we use in the evaluation
has been publicly available on GitHub for more than four years and is
likely in the training set of the model. To avoid simple regurgitation,
we include an instruction in the prompt to generate a mini crossword
with the fewest possible black cells. At the very least, the model will
have to regurgitate the input crossword with the fewest number of
blanks. We use few-shot prompting, and request the output in
markdown.
Q: Make a 5 by 5 crossword grid with the fewest possible blanks
(use * for blanks) as a markdown table using the following
fills: DIP, IMOUT, SAUCE, CYCLE, HAN, DISC, IMAY, POUCH, UCLA,
TEEN.
A:
| | | | | |
|-|-|-|-|-|
|D|I|P|*|*|
|I|M|O|U|T|
|S|A|U|C|E|
|C|Y|C|L|E|
|*|*|H|A|N|
Q: Make a 5 by 5 crossword grid with the fewest possible blanks
(use * for blanks) as a markdown table using the following
fills: {WORDS}
A:
| | | | | |
|-|-|-|-|-|
In this section, we describe how to generate the syntactic
constraints for a crossword with a bounded number of
mar
programs. Our encoding is inspired by
Dimitri Bohlender’s blog.21 Our encoding depends
on a formal problem definition. Given a set of words and a natural
number k
, the crossword construction problem is to generate
a k × k
table such that
The fourth constraint is standard but not strictly enforced. Full size New York Times crosswords sometimes have different symmetries (like along an axis), and the mini puzzles often relax the symmetry constraint altogether. Notice that Figure 15 is rotationally symmetric.
To make the details of the encoding more concrete and easier to
follow, we will use the running example of a 2 × 2
table
with {to, do}
as the input set of words. For example, there
is no solution with fewer than four blanks to the crossword construction
problem defined by a 2 × 2
table with {to, do}
as the input set of words—because of the symmetry constraint.
The first four conditions of the crossword construction problem can
be solved by a single mar
program. We will
start with these conditions and then extend our encoding to handle the
fifth condition in the next subsection (the fifth condition describes an
optimization). We begin by declaring an enumerated type (enum) to
represent letters. We want to capture exactly the unique set of
characters that appear in the input set of words (ignoring case) plus a
special blank character. The running example would require the following
declaration.
For a k × k
table, we create k·k + 1
constants of type letter: one for each table position plus one to
represent a special off table position. For example, for a
2 × 2
table, the mar
program
would have the following 5
function declarations.
For each word in the input set, we create a set of boolean constants
that represent possible table positions and orientations (down or
across) for that word. For example, for a 2 × 2
table and
an input set of words {to, do}
, we create 16
boolean constants (2
words times 4
possible
start positions times 2
possible orientations).
...
let to_00_down: boolean
let to_00_across: boolean
let to_01_down: boolean
let to_01_across: boolean
let to_10_down: boolean
let to_10_across: boolean
let to_11_down: boolean
let to_11_across: boolean
let do_00_down: boolean
...
let do_11_across: boolean
...
The type declaration above—and to some extent the constant declarations—enforce the first condition of the crossword construction problem. We now need to enforce the second, third, and fourth conditions. We accomplish this with five different kinds of constraints: position, word start, off table, checked, and symmetry.
Position constraints ensure that if a word is selected, then the
table is filled in correctly at the appropriate position and
orientation: the cell before the word is a blank, the sequence of
characters starting at the given position following the given
orientation spell out the letters of the given word, and the cell after
the word is a blank. For example, part of the final assertion for the
running example will ensure that, if to_00_down
is set to
true, then p00
will have the letter t
and
p10
will have the letter o
.
We also need to ensure that at most one position and orientation is selected for each word. For example, part of the final assertion for the running example will ensure that the word “to” is only placed in one position.
...
assert at-most-one(to_00_down, to_00_across, to_01_down, to_01_across,
to_10_down, to_10_across, to_11_down, to_11_across)
...
Here at-most-one
is a helper function that ensures that
at most one from a set of boolean variables can be true at the same
time.
Word start constraints ensure that every time there is blank followed by two letters in a given orientation, then some word from the input set starts at the first letter’s position. For the running example, we would include the following constraint.
...
assert ((off == BLACK() and p00 != BLACK() and p10 != BLACK()) ==>
(to_00_down or to_00_across or to_01_down or to_01_across or
to_10_down or to_10_across or to_11_down or to_11_across or
do_00_down or do_00_across or do_01_down or do_01_across or
do_10_down or do_10_across or do_11_down or do_11_across))
...
The off table constraint is the most simple: the off table position,
off
, is a BLACK()
.
A letter is checked if it belongs to a vertical and a horizontal word. All letters must be checked and the checked constraints ensure this. For each cell in the table, if the cell holds a letter, then there must be a horizontally neighboring letter and a vertically neighboring letter. For the running example, this would include constraints like the following.
...
assert p00 != BLACK() ==>
((off != BLACK() or p01 != BLACK()) and
(off != BLACK() or p10 != BLACK()))
...
Finally, symmetry constraints ensure that the table is rotationally
symmetric with regards to the blanks. For a 2 × 2
table,
these constraints would include the following.
One possible solution to all the constraints described above is to set all the cells to be blanks. But that is not a very interesting crossword! The fifth condition of the crossword construction problem aims to minimize the number of blanks in the hopes of finding an interesting solution.
The mar
language does not support
minimization directly, however, so instead we will use a sequence of
mar
programs to solve the optimization
problem. Each individual program will combine the constraints for the
first four conditions of the crossword construction problem with an
at-most-k
constraint that puts an upper limit on the number
of blanks in the solution. For the running example, this means including
the following additional constraint.
Here at-most-k
is a helper function that uses an
efficient existing encoding to ensure that at most k
from a
set of boolean variables can be true at the same time, similar to the
at-most-one constraint we used before. The highest possible number of
blanks is the number of cells and the lowest possible number of blanks
is zero. To solve the optimization problem we conduct a binary search
between these two limits until we find the smallest k
for
which our at-most-k
constraint holds. Each step of the
binary search is a mar
program. If the
mar
program has a solution, we try a
smaller k
in the next step; if there is no solution, we try
a larger k
.
To test if LMs can generate syntactic constraints for crossword puzzles, we asked OpenAI’s o4-mini-2025-04-16 (o4m) to build crossword grids from words and clues that previously appeared in the New York Times. Specifically, we took words and clues from five mini crossword puzzles (5 by 5 grids, ~10 words each) published between January 2019 and February 202122 and asked o4m to create a mini crossword grid from the words using the prompt described in Chapter 8.1.1.1.
We repeated this experiment 10 times. Each time we used a fresh combination of five existing mini crosswords from a set of 10 randomly selected crosswords. Previous GPT models fail spectacularly. The o4m model performs surprisingly well. In 10 attempts, o4m generated three syntactically correct crosswords. That is, it was able to regurgitate an existing crossword three times. Of the three correctly generated grids, none are minimal. That is, none of the three are the crossword from the input with the fewest number of blanks. In these three cases, o4m successfully regurgitated an existing crossword but did not (successfully) do any reasoning about minimality. The following is an example of a correct grid generated by o4m, which appeared in the New York Times on August 26th, 2019.
Seven out of 10 times, o4m hallucinated words or created nonsense grids. The following grid is an example of an incorrect grid generated by o4m, which includes hallucinated words, like “orosg.”
Laying out words in a grid is an obvious application of backtracking search. Generating text from left to right, top to bottom, will require making guesses that may or may not pay off. When the guesses do not pay off, backtracking is required. For example, consider the incorrect grid generated by o4m above. The model was given 50 words (five mini crosswords of 10 words each), and asked to construct a five by five crossword grid. The LM started the construction by selecting the word “amigo.” Immediately, this meant that the word down the middle column had to start with an “i,” but no such word existed in this particular input set. At this point, the LM should have backtracked: it should have started again with a word other than “amigo,” ideally learning from its mistake. Instead, the LM hallucinated the word “ivado,” which was not in the input set of words, and continued its construction.
Using our logical encoding, we can instantly solve (\(<1\) second on a 2.3 GHz Quad-Core Intel Core i7 processor) all of the benchmarks described in Chapter 8.1.1.1. We can also generate more interesting crosswords. For example, Figure 15 was automatically generated by our solver using the encoding described in this chapter. Specifically, we took every word from every Monday (usually 15 by 15 grids) crossword published on the New York Times between January 1st, 2010, and December 31st, 2014,23 generated the encoding described above, and used a solver to generate the grid in Figure 15.
Building a crossword grid is like programming. The lines of code that we write are guesses. These guesses have implications that are hard to predict, and we may or may not need to revisit them. Data structures and functions are rarely exactly right the first time we define them, and LMs are not beyond human reasoning capabilities. Despite previous work, like Gandhi et al. (2024) and Xie et al. (2025), LMs still struggle with backtracking search. In contrast, Solvers routinely perform better than expected on real-world problems that require backtracking (Ganesh and Vardi 2020). Programming and the crossword construction problem are no exception to this pattern. In Chapter 7, we use a MAX-SMT solver to pick the optimal backtracking point. For the crossword construction problem, the SMT solver handles all backtracking directly.
But crossword construction and programming will never be solved by solvers alone. Clue and answer generation is a quintessential natural language problem. Similarly, naming variables and many other programming tasks require natural language understanding. In this regard, LMs have a hope of solving these problems while solvers do not: LMs may one day perform backtracking search effectively but SMT solvers will never reason about natural language effectively. Until that time, which may never come, the most effective tools will combine the powers of both approaches.
It is not yet clear how to best combine neuro and symbolic approaches. In Chapter 7 we proposed one combination that has exciting implications for the neuro-symbolic reasoning research area. For example, T. Olausson et al. (2023) follow the tool-use paradigm to solve natural language word problems by equipping an LM with an automated theorem prover tool. These same authors find that one of the biggest limitations of this approach is that LMs frequently generate syntactically incorrect calls to the theorem prover—precisely the problem we solved with Eudoxus for UCLID5. We aim to explore these exciting implications, particularly for reasoning about distributed systems described in natural language and diagrams. The dream is that developers will be able to ask questions about their documented designs and get meaningful feedback powered by scalable automated reasoning.
Our work has interesting implications for the neuro-symbolic research area beyond the tool-use paradigm. For example, agentic AI systems, which have recently gained popularity, can be thought of as a distributed systems: independent computing nodes work together to accomplish a shared goal by sending and receiving messages. Today, however, little work exists on verifying properties of agentic systems. For example, there are few tools available to ensure that private information does not leak from one agent in a system to another. We believe that programming, testing, and verification languages like P will play a crucial role in building safer agentic systems.
All of these advances will depend on the availability of fast solvers. The problem is that we do not know what logical theories and fragments we will need to support or what industrial queries will look like tomorrow. This is true for the whole field of automated reasoning: the demand for solvers is growing in often unpredictable directions. We aim to meet this demand through eager solvers. Eager solvers are promising for distributed solving, but handling theory combinations is currently challenging. We are particularly excited about facing these challenges by generating eager solvers from declarative specifications. Similarly, we are excited to explore solvers with custom LM generated components. Existing work, like Romera-Paredes et al. (2024), has successfully generated heuristics for combinatorial solvers. We believe that similar approaches can be used to generate hyper domain-specific heuristics that improve solver performance well beyond what general purpose solvers can attain today.
Finally, language design will continue to play a crucial role in all future developments. Either for getting the most out of LMs, like our work in Chapter 7, or for communicating with engineers, like our work in Chapter 5. Automated reasoning engines do not currently support many programming paradigms and features. For example, OCaml’s new concurrency model (Sivaramakrishnan et al. 2021), based on algebraic effects, is popular in programming language circles but has gained much less attention in the automated reasoning community. Therefore, today, human and non-human programmers have few choices when it comes to proving the correctness of these programs. We are interested in new programming language paradigms and features, like algebraic effects, and building full automated reasoning stacks for their users, broadly construed.
“Dafny currently relies heavily on Z3’s ability to quickly instantiate an absurd number of quantified axioms, and I think other solvers just can’t keep up with that.” (Quiles, Tomb, and Pearce 2023)↩︎
“Cloud computing sales are expected to rise to $2 trillion by the end of the decade… Generative artificial intelligence is forecast to account for about 10-15% of the spending” (Goldman Sachs 2024)↩︎
“… at least 75% of the respondents work for organizations using the cloud.” (Loukides 2021)↩︎
“UniSuper, an Australian pension fund that manages $135 billion worth of funds and has 647,000 members, had its entire account wiped out at Google Cloud, including all its backups that were stored on the service.” (Amadeo 2024)↩︎
“Microsoft has notified customers that it’s missing more than two weeks of security logs for some of its cloud products, leaving network defenders without critical data …” (Whittaker 2024)↩︎
“Users reported issues accessing critical AWS services, including EC2 instances, S3 storage, and RDS databases. Amazon acknowledged the problem …” (Sayegh 2024)↩︎
“We cannot expect all AWS users to be experts in formal methods, have the time to be trained in the use of formal methods tools, or even be experts in the cloud domain” (Rungta 2022).↩︎
“This limits the reach of formal verification: scalability will require teaching many more people to engineer proofs” (Dodds 2022).↩︎
“The full power of automated reasoning is not yet available to everyone because today’s tools are either difficult to use or weak” (Cook 2019).↩︎
See https://docs.python.org/3.10/whatsnew/3.10.html.↩︎
See https://openjdk.org/jeps/360.↩︎
A simple implementation of the
mar
compiler is available at
https://github.com/FedericoAureliano/thesis.↩︎
Code available at https://github.com/uclid-org/algaroba.↩︎
Code and data available at https://github.com/uclid-org/medley-solver/tree/SAT2021.↩︎
See https://www.erlang.org/doc/index.html.↩︎
See https://akka.io/.↩︎
Available at https://github.com/wilcoxjay/mypyvy/blob/master/examples/ring_leader_election.pyv.↩︎
Code available at https://github.com/uclid-org/upverifier↩︎
Appeared in the New York Times Crossword on Monday, March 3, 2014.↩︎
Across: ALMA; SLEET; HOIST; INCAN; PEAS. Down: ALONE; LEICA; MESAS; ATTN; SHIP.↩︎
See https://bohlender.pro/blog/generating-crosswords-with-sat-smt/.↩︎
Data from https://github.com/chrisnunes57/nyt-mini-crosswords.↩︎
Data from https://github.com/doshea/nyt_crosswords.↩︎