Crafting Marwood
by Erik Bremen
This book is built with mdBook and Mermaid-JS. Its source may be found at marwood-book.
Introduction
Crafting Marwood is a book describing the process behind researching and building Marwood, a Scheme compiler and runtime written in Rust. Marwood may be interacted with on the web at repl.marwood.io, and its source is MIT Licensed and may be found on github.
What is Scheme?
Scheme is a programming language invented in 1975 by Guy Steel and Gerald Jay Sussman at MIT. It's the first dialect of lisp to use lexical scope, and its simplicity makes it a great choice for hobby programming language implementers.
Here's a sample of scheme's syntax, demonstrating a recursive implementation of factorial:
(define (factorial n)
(if (= n 0)
1
(* n (factorial (- n 1)))))
Why Scheme?
Scheme has very simple grammar, and small number of core language features that once implemented provide the building blocks for the rest of the implementation.
R7Small, the latest standard at the time of writing, was published in 2013 and is just under 90 pages of text including full library specification and appendix.
The simplicity of scheme allows the implementer to spend more time exploring complex branches of programming language implementations, such as garbage collection, macro expansion, optimizations, etc.
How is Marwood different?
When researching scheme implementation material I came across two types of sources:
-
Academic material dating back to the 1970s. This material is quite abstract and a majority of these papers and books describe implementing a metacircular evaluator. There are very few materials on implementing scheme in a language like C or Rust.
-
Modern scheme codebases such as Chez Scheme, Guile, Chicken, Chibi. These implementations have the best performance, but are composed of millions of lines of code over years of implementation from dozens of developers.
Marwood's goal was to provide an example of an implemenation of scheme that fits somewhere between #1 and #2.
Who is this book for?
This book is walkthrough through Marwood's source code. It is aimed at an audience with some experience with the Rust programming language, parsers, and compilers.
For a detailed tutorial series on writing an interpreter, I would recommend Crafting Interpreters.
Resources and Influences
These resources were found useful when researching scheme implementations:
- Scheme Bibliography Github
- Three Implementation Models for Scheme - Kent Dybvig
- An Incremental Approach to Compiler Construction
- Crafting Interpreters
- LISP System Implementation - Nils Holm
Design Overview
Highlevel Design Choices
The first major decision when constructing Marwood was to decide how Marwood would evaluate scheme:
-
Direct evaluation of the AST in the host language: This is the most natural way to attempt to write an interpreter for Scheme, but it has a few short-comings. Certain features like tail-call optimization and call/cc may be more difficult to implement depending on the host language, and other considerations like memory management are directly affected by the host language.
This approach makes a lot of sense if the host language is Scheme, however. There are numerous texts (e.g. sicp) on writing meta-circular evaluators in Scheme, and it is one of the many brilliant examples of Scheme's elegant design.
-
Compilation to an existing target architecture: The target architecture could be hardware based (e.g. x86, amd64, arm) such as in An Incremental Approach to Compiler Construction, or virtual (JVM, WASM, etc).
-
Compilation to a custom virtual machine: A custom virtual machine is created designed to evaluate Scheme.
Marwood uses a custom virtual machine approach.
Marwood Library
The Marwood library allows the creation of Scheme virtual machines, which may be used to evaluate scheme expressions. Each virtual machine represents the state of a scheme environment (global environment, heap, stack, etc).
Here's an example demonstrating creation of a recursive factorial procedure, and evaluation of (factorial n)
for n in 0..10:
use marwood::vm::Vm;
fn main() {
let mut vm = Vm::new();
let code = r#"
(define (factorial n)
(let factorial ([n n] [acc 1])
(if (zero? n)
acc
(factorial (- n 1) (* acc n)))))
"#;
vm.eval_text(&code).unwrap();
for it in 0..10 {
let (cell, _) = vm.eval_text(&format!("(factorial {})", it)).unwrap();
println!("the factorial of {} is {}", it, cell);
}
}
Which produces the following output:
the factorial of 0 is 1
the factorial of 1 is 1
the factorial of 2 is 2
the factorial of 3 is 6
the factorial of 4 is 24
the factorial of 5 is 120
the factorial of 6 is 720
the factorial of 7 is 5040
the factorial of 8 is 40320
the factorial of 9 is 362880
A breakdown of Marwood's Components
The remainder of this text describes the design decisions behind each of Marwood's major library components:
-
A parser and tokenizer used to create Marwood's AST cell object. This object acts as both input to the VM's
eval()
function, and also the result of an evaluation. Marwood's tokenizer is also useful in other situations, such as providing input to REPL syntax highlighting and bracket matching. -
A virtual machine that represents a scheme environment and may be used to execute scheme. The VM may be used to evaluate a
Cell
produced by Marwood's parser. Successful evaluation results in an outputCell
, that represents the result of the evaluation. -
A compiler that given a
Cell
and aVM
, constructs compiled bytecode to be executed on the VM. -
Numerous built-in Scheme library procedures written in rust, and a scheme prelude containing library procedures written in scheme.
-
The repl and web-repl crates. These creates are separate from the main Marwood library crates, and provide example implementations of REPLs that use the Marwood library.
This diagram illustrates a Marwood read-eval-print loop. Text is read from the readline library, tokenized, parsed, evaluated and then the resulting Cell
object is printed back to the user.
The VM object maintains any state expected by the user between evaluations. For example, if the user enters (define x 10)
, then on evaluation the VM object will modify its heap to represent the new binding for x.
flowchart LR A(Readline) -->|Text| B(Tokenizer) B -->|Token Stream| C(Parser) C -->|Cell| D(VM) D -->|Cell| A
The remaining sections of this text describe in detail how these different components were designed.
Parsing & Printing
Scheme, having an incredibly simple grammar, lends itself well to a hand-written lexer and recursive descent parser.
Marwood's parser is composed of three separate components:
-
the
Cell
object, which represents a Scheme sexpr in Rust -
the
lexer
(or scanner or tokenizer), which is responsible for turning plain-text scheme into a list of tokens to be consumed by Marwood'sparser
. -
the
parser
, which is responsible for constructing an aggregate structure calledCell
from the lexer output.
Example
This example demonstrates the manual use of Marwood's lexer, parser, and Cell types. It also shows use of Marwood's printer. The alternative printing syntax {:#}
provides additional escaping, and is the printer used for Scheme's write
procedure.
let scheme_code = r#"
(quote (#\x 20 puppies))
"#;
let tokens: Vec<Token> = lex::scan(scheme_code).unwrap();
let cell = parse::parse(scheme_code, &mut tokens.iter().peekable()).unwrap();
println!("{}", cell);
println!("{:#}", cell);
outputs:
'(x 20 puppies)
'(#\x 20 puppies)
S-Expressions
When designing a scheme parser, we first have to think a little bit about how to represent Scheme expressions in Rust's type system.
Fortunately Scheme's source code is already representable in a data structure called an S-expression (or sexpr). This is one of the fundamental properties of a lisp: source code is data.
A sexpr can be thought of as an aggregate type that can represent one of two things:
- an atom, such as a booleans, numbers, symbols, strings, nil, etc.
- a pair in the form (car . cdr), where *car *and cdr are themselves S-expressions
The expression (+ 2 3)
in scheme is sugar for a tree of pairs, which could be rewritten (+ . (2 . (3 . ())))
.
This diagram depicts the above expression as a tree:
flowchart TD A(( )) -->|car| +((+)) A -->|cdr| B(( )) B -->|car| 2((2)) B -->|cdr| C(( )) C -->|car| 3((3)) C -->|cdr| nil((nil))
Improper Lists
The expression (1 2 . 3)
is an example of an *improper list, a list not terminated with '(). It relies on special use of a dot preceding the very last element in a list. This list could also be written as (1 . (2 . 3))
.
flowchart TD A(( )) -->|car| +((+)) A -->|cdr| B(( )) B -->|car| 2((2)) B -->|cdr| 3((3))
Improper lists are less common in Scheme, but there are a few parts of R7Rs, such as variable argument support, that rely on improper lists.
Examples of improper lists are:
'(1 2 3 4 5 6 7 8 9 . 10)
'((10 10) (20 20) . (/ 30 30))
This example is invalid scheme because there are two expressions after the dot:
'(1 2 . 3 4)
Cell
If Scheme can be represented by an aggregate type called a sexpr, then we just need to create a type in Rust that represents an S-expression: a type that can be either an atom, or a pair. Fortunately Rust's enum type allows exactly this: the creation of a type that represents one of several possible variants.
Marwood's Cell is the enum type that represents Scheme data and code in Marwood. It is the main input to Marwood's compiler, macro transformers, and printer, and is the output of Marwood's VM as a result of evaluating scheme.
Most Cell's variants represent atom types in Scheme, such as booleans, numbers, symbols, characters, and strings:
pub enum Cell {
Bool(bool),
Char(char),
Number(Number),
Nil,
String(String),
Symbol(String),
...
}
As an example, parsing the scheme expression #f
would result in Marwood's parser outputting the value Cell::Bool(false)
.
Pairs
Cell also contains an aggregate type called Pair, which represents a sexpr pair:
pub enum Cell {
...
Pair(Box<Cell>, Box<Cell>),
...
}
Pair
is a pair of Cells representing the car
and cdr
parts of a cons pair. This structure can be used to represent pairs in scheme, which in turn may be used to represent lists. This is the main building block for representing Scheme code's tree structure in Rust.
Vectors
Vector
represents a literal scheme vector, such as #(10 20 30)
.
pub enum Cell {
...
Vector(Vec<Cell>),
...
}
Special Variants
The Cell enum also contains a few variants that can only be produced as output of Marwood's evaluator. For example, the Cell::Void
variant is produced by some Marwood procedures that evaluate to #<void>
(e.g. define, set!). It is not possible to produce these variants via Marwood's parser.
#[derive(Debug, Eq, PartialEq, Hash, Clone)]
pub enum Cell {
...
Continuation,
Macro,
Procedure(Option<String>),
Undefined,
Void,
}
Tokenizing
Marwood's lexer is in the form:
fn scan(text: &str) -> Result<Vec<Token>, Error>
Given scheme code as input, the scan function returns a vector of tokens according to the Scheme grammar Each token contains a span, which may be used to extract the token's text from the source text, and the token type (e.g. LeftParen).
pub struct Token {
/// (start, end) index of original span in the source &str
pub span: (usize, usize),
/// The type output by the scanner
pub token_type: TokenType,
}
pub enum TokenType {
Char,
Dot,
False,
LeftParen,
Number,
NumberPrefix,
RightParen,
SingleQuote,
String,
Symbol,
True,
WhiteSpace,
HashParen,
}
Error::Incomplete
The special error Error::Incomplete
is returned in certain circumstances where the input looks incomplete, and may be hint to the REPL interface that the user intends to enter a multi-line expression.
An example that may cause an Error::Incomplete
error in Marwood's scanner is a mismatched double quote: "hello world
.
Scanner Example
Given this code:
'(10 20 "puppies") ; heterogeneous
The tokenizer will produce the following tokens:
[
Token {
span: (0, 1),
token_type: SingleQuote,
},
Token {
span: (1, 2),
token_type: LeftParen,
},
Token {
span: (2,4),
token_type: Number,
},
Token {
span: (5,7),
token_type: Number,
},
Token {
span: (8,17),
token_type: String,
},
Token {
span: (17,18),
token_type: RightParen,
}
]
Parsing
The Parse Function
Marwood's parser is a direct consumer of Vec<Token>
returned by the lexer.
parse::parse()
takes an iterator over &Token
, the original source &str that was supplied to lex::scan
, and returns a Cell
.
pub fn parse<'a, T: Iterator<Item = &'a Token>>(
text: &str,
cur: &mut Peekable<T>,
) -> Result<Cell, Error>
parse
will only consume one expression from cur
. If cur
may contain multiple expressions, the cursor position that parse
left off at can be used to determine the cursor position of the next expression.
For example, this scheme source contains five expressions and would require five separate calls to parse
.
1 2 3
(+ 1 2)(+ 3 4)
Error::Incomplete
Similarly to scan, parse() will return Error::Incomplete if the token stream does not contain a complete expression.
Some examples of incomplete expressions:
'(1 2 3
'(1 2 3)'(4 5
#(1 2 3)
Virtual Machine
Marwood has a stack based virtual machine represented by the Vm
object. Each Vm represents a full scheme runtime environment.
Scheme expressions evaluated on the VM are compiled into byte code represented by [op codes], which are evaluated by the VM.
#[derive(Debug)]
pub struct Vm {
/// The heap and global environment
heap: Heap,
globenv: GlobalEnvironment,
/// The current program stack
stack: Stack,
/// Registers
acc: VCell,
ep: usize,
ip: (usize, usize),
bp: usize,
/// System Interface (display, write, etc).
sys: Box<dyn SystemInterface>,
}
The Vm
object provides an eval interface for evaluating scheme expressions, and is the main interface for the console and web repl:
fn eval(&mut self, cell: &Cell) -> Result<Cell, Error>
VM Components
The Vm
is composed of these high-level components, which are explored later in this chapter:
- the stack
- a garbage collected heap
- the global environment
- registers
Registers
The Vm has a handful of registers that are used to maintain currently running program state:
Register | Description |
---|---|
%acc | Accumulator |
%ip | Instruction Pointer |
%bp | Frame Base Pointer |
%sp | Stack Pointer |
%ep | Environment Pointer |
The %acc (or accumulator) register contains the result of any Vm instruction that produces a result. The %acc register is also used by procedures to store the result of procedure application. For example, the procedure (+ 10 10)
would result in %acc containing the result 20
.
The %ip register contains a tuple that points to the currently running block of bytecode, and offset in the bytecode of the next instruction to be executed.
The %bp, %sp and %ep registers maintain state of Marwood's stack and call frame state and are discussed in detail in the procedure application section.
Op Codes
Marwood's instruction set is built specifically for executing scheme. It's not a general purpose virtual machine.
The MOV instruction is used to move values between Marwood's registers, stack, global and lexical environments.
The CALL, CLOSURE, ENTER, RET, TCALL, and VARARG instructions support procedure application.
The JMP and JNT provide branching support for the compiler to support the scheme if
primitive.
The CONS and VPUSH instructions are instructions used to build lists or vectors at runtime, and are primarily used to support the quasiquote language feature.
The HALT instruction stops the virtual machine, returning the contents of the %acc register as the result to eval
.
Opcode | Description |
---|---|
CALL %acc | Given the arguments for the procedure in %acc have been pushed on the stack, jump to the procedure in %acc. |
CLOSURE %acc | Create a lexical environment as a result of evaluating a (lambda ...) expression. |
CONS | Performs the cons operation on the first two values of the stack, storing the resulting pair in %acc |
ENTER | Setup the currently executing procedure's stack frame |
HALT | Halt program, returning the result contained within ACC |
JNT <OFFSET> | Set %ip to OFFSET if %acc is #f |
JMP <OFFSET> | Set %ip to OFFSET |
MOV <SRC> <DEST> | Move the value from SRC into DEST |
PUSH | Push the value in ACC on to the stack |
RET | Return from a procedure entered via CALL |
TCALL %acc | Identical to a CALL instruction, except that a tail optimizing CALL is performed. |
VPUSH | Push the value in %acc onto the vector at the top of the stack, storing the resulting vector in %acc |
Why a second Cell type?
While Marwood's Cell
type is able to represent any scheme expression, it's not suitable for representing Scheme structure in Marwood's VM.
Consider the following expressions that create a small list bound to x, and then bind y to a subset of the list:
(define x '(1 2 3))
(define y (cdr x))
This is because the expression (define y (cdr x))
has bound y to a sub-structure of x. It is not a copy. If y were mutated with set-car!
or set-cdr!
we would expect the structure that x is bound to realize the mutation also.
flowchart TD x --> A y --> B A(( )) -->|car| 1((1)) A -->|cdr| B(( )) B -->|car| 2((2)) B -->|cdr| C(( )) C -->|car| 3((3)) C -->|cdr| nil((nil))
Further, if x were to be bound to another value, we expect that Marwood be able to garbage collect any part of the structure no longer referenced. In this example, the pair x is pointing to would no longer be required and may be collected. Only the structure referenced by y would remain:
(define x 0)
flowchart TD y --> B(( )) B -->|car| 2((2)) B -->|cdr| C(( )) C -->|car| 3((3)) C -->|cdr| nil((nil))
Representing this relationship with Marwood's existing Cell
data structure would be very difficult. With the Cell
data structure, pairs are represented as Pair(Box<Cell>, Box<Cell>)
. It's not possible for two pairs to refer to the same data structure.
VCell
The VCell
type is how Marwood represents scheme at runtime. It's also used to represent other runtime objects, such as VM op codes, registers, and even lexical environments. Marwood's stack and heaps are just vectors of VCell. It is the universal type!
Instead of a pair being represented by (Box<Cell>, Box<Cell>)
, a VCell pair is represented by (usize, usize)
, each usize corresponding to the locations of the car
and cdr
portions on the heap. This is explored further in the section on Marwood's heap.
#[derive(Clone, Debug, Eq, PartialEq)]
pub enum VCell {
Bool(bool),
Char(char),
Nil,
Number(Number),
Pair(usize, usize),
Symbol(Rc<String>),
String(Rc<RefCell<String>>),
Vector(Rc<Vector>),
...
}
Note: Marwood's VCell enum is kept a maximum of 24 bytes. 8 bytes represent the enum tag, and an additional 16 bytes remain for data. Any VCell variants that may exceed this size are boxed with
Rc
, though the most common types fit well within this limit. Keeping VCell as small as possible has the benefit that it may be trivially cloneable, which comes in handy when encountering any borrow checker related issues with reading VCell values from Marwood's heap and stack.
VCell > Scheme
VCell's primary role is to represent scheme data in Marwood's VM, but it's also used to represent Marwood's internal instruction format (byte code) and values on Marwood's stack, heap, global and lexical environments. The remaining VCell variants are used to represent the various types that may be encountered.
#[derive(Clone, Debug, Eq, PartialEq)]
pub enum VCell {
...
Continuation(Rc<Continuation>),
Closure(usize, usize),
Lambda(Rc<Lambda>),
LexicalEnv(Rc<LexicalEnvironment>),
LexicalEnvSlot(usize),
LexicalEnvPtr(usize, usize),
Macro(Rc<Transform>),
Acc,
ArgumentCount(usize),
BasePointer(usize),
BasePointerOffset(i64),
BuiltInProc(Rc<BuiltInProc>),
EnvironmentPointer(usize),
GlobalEnvSlot(usize),
InstructionPointer(usize, usize),
OpCode(OpCode),
Ptr(usize),
}
Stack
Marwood's stack is represented by an upward growing stack of VCell values, and a stack pointer %sp
that represents the current top of the stack.
The stack's main purpose is to support the application of procedures, which include the pushing of arguments and meta data required for call frame setup.
Here's an example stack during evaluation of the expression (+ 10 5)
, with the stack pointer pointing at slot 7, and the arguments to +
having been pushed in slots 5 and 6.
Slot | Value | SP |
---|---|---|
7 | %argc[2] | ← |
6 | Fixnum(5) | |
5 | Fixnum(10) | |
4 | %bp[$00] | |
3 | %ip[$223][$06] | |
2 | %ep[$ffffffffffffffff] | |
1 | %argc[0] | |
0 | Undefined |
Stack Structure
Marwood's stack is backed by a Rust Vec<VCell>
, and the stack pointer sp
which contains the index of the current top of the stack in the stack vector. Operations such as push and pop are backed by the rust Vector's implementation of push and pop.
pub struct Stack {
/// Stack contents
stack: Vec<VCell>,
/// Stack Pointer. SP points to the top value to be pushed onto the stack,
/// This value backs the SP register of the VM
sp: usize,
}
Stack manipulation is provided by push and pop functions:
pub fn push<T: Into<VCell> + Display>(&mut self, vcell: T)
pub fn pop(&mut self) -> Result<&VCell, Error>
Access to values values on the stack are provided by either direct index into the stack, or offset relative to the current stack pointer:
pub fn get(&self, index: usize) -> Result<&VCell, Error>
pub fn get_mut(&mut self, index: usize) -> Result<&mut VCell, Error>
pub fn get_offset(&self, offset: i64) -> Result<&VCell, Error> {
pub fn get_offset_mut(&mut self, offset: i64) -> Result<&mut VCell, Error>
Stack Growth
The
push
operation is backed directly by Vec's push function, which will grow the underlying vector automatically. The stack does not currently have any method of shrinking the stack vector.
Stack API Example
let mut stack = Stack::new();
// Push
stack.push(VCell::number(1));
stack.push(VCell::number(2));
// Access relative to %sp
assert_eq!(stack.get_offset(0), Ok(&VCell::number(2)));
assert_eq!(stack.get_offset(-1), Ok(&VCell::number(1)));
assert_eq!(stack.get_offset(-2), Ok(&VCell::Undefined));
// Pop
assert_eq!(stack.pop(), Ok(&VCell::number(2)));
assert_eq!(stack.pop(), Ok(&VCell::number(1)));
assert_eq!(stack.pop(), Err(InvalidStackIndex(0)));
Heap
This small snippset of scheme is useful for reasoning about why scheme might need a heap:
(define x 10)
(define y (* x x ))
(define animals '(cats otters puppies))
(define sub-animals (cdr animals))
Both the value for x and y are bound to numbers. Mutating x
and y
only require changing what value the symbol is bound to in the environment.
But, animals and sub-animals are a little more nuanced. animals is a list composed of pairs
and symbols
, and more importantly sub-animals refers to tail of the same list containing the values '(otters puppies)
. That is, both animals
and sub-animals
refer to parts of the same data structure. Not a copy.
If sub-animals were modified with set-car!
or set-cdr!
, the mutation must also be realized on the list that animals
is bound to.
One way of supporting this type of funcitonality is to store these shared values on a heap. animals
and sub-animals
can then contain references into the same data structure on a heap.
Heap Structure
Like the stack, Marwood's heap is composed of a Vec<VCell>
, along with various supporting structures used to track free slots, intern symbols, and garbage collection.
pub struct Heap {
chunk_size: usize,
free_list: Vec<usize>,
heap: Vec<VCell>,
heap_map: gc::Map,
symbol_table: HashMap<String, usize>,
}
chunk_size
refers to the number of heap slots allocated when the heap needs to grow. On initialization, the Heap will allocate chunk_size slots initialized to VCell::Undefined, and add the index of each unused heap slot to free_list
. This free list contains indexes on the heap of unused slots ready to be allocated.
pub fn new(chunk_size: usize) -> Heap {
Heap {
chunk_size,
heap: vec![VCell::undefined(); chunk_size],
free_list: (0..chunk_size).rev().into_iter().collect()
}
}
Heap References
References into Marwood's heap are of the type
type HeapRef = usize;
and are direct indexes into the heap. These heap references can be found in various places in Marwood's VCell and VM structures:
- VCell::Ptr(HeapRef)
- VCell::Pair(HeapRef, HeapRef)
- VCell::Closure(HeapRef, HeapRef)
Alloc & Free
The heap's core API is composed of alloc, free and get operations on the heap.
pub fn grow(&mut self)
pub fn alloc(&mut self) -> usize
pub fn free(&mut self, ptr: usize)
pub fn get_at_index(&self, ptr: usize) -> &VCell
pub fn get_at_index_mut(&mut self, ptr: usize) -> &mut VCell
pub fn get<'a, T: Into<Cow<'a, VCell>>>(&self, vcell: T) -> VCell
alloc
returns the next available slot on the free_list
, calling grow
to grow the heap by chunk_size
elements if the free_list
is empty. The disposition of the slot is also marked as State::Allocated in the heap_map, which is used by Marwood's garbage collector and discussed in more detail in later sections.
pub fn alloc(&mut self) -> usize {
match self.free_list.pop() {
None => {
self.grow();
self.alloc()
}
Some(ptr) => {
self.heap_map.set(ptr, State::Allocated);
ptr
}
}
}
free
returns a heap slot back to the free_list
. The contents is set back to VCell::Undefined
, which may help catch any bugs in Marwood where a heap reference is used when it no longer should be.
The slot is also marked State::Free in the heap_map, a data structure used by Marwood's garbage collector.
pub fn free(&mut self, ptr: usize) {
self.heap_map.set(ptr, State::Free);
if let Some(VCell::Symbol(sym)) = self.heap.get(ptr) {
self.symbol_table.remove(&**sym);
}
*self.heap.get_mut(ptr).unwrap() = VCell::Undefined;
self.free_list.push(ptr);
}
The core get operations on the heap are not notable, except that the get functions will panic if given an index that is not valid for the current heap.
The function get
acts as a wrapper around get_at_index
and only performs a heap lookup if the supplied vcell is a pointer type. This allows code in Marwood to access values on the stack or environment slots without having to check of the value is a reference first.
pub fn get_at_index(&self, ptr: usize) -> &VCell {
self.heap.get(ptr).expect("heap index out of bounds")
}
pub fn get_at_index_mut(&mut self, ptr: usize) -> &mut VCell {
self.heap.get_mut(ptr).expect("heap index out of bounds")
}
pub fn get<'a, T: Into<Cow<'a, VCell>>>(&self, vcell: T) -> VCell {
match vcell.into() {
Cow::Borrowed(vcell) => match vcell {
VCell::Ptr(ptr) => self.get_at_index(*ptr).clone(),
vcell => vcell.clone(),
},
Cow::Owned(vcell) => match vcell {
VCell::Ptr(ptr) => self.get_at_index(ptr).clone(),
vcell => vcell,
},
}
}
Put
The put
function puts the supplied vcell on the next available slot on the heap returned by alloc
. If vcell was already a pointer type, then it's returned instead of being double boxed by the heap. This removes the need for a lot of boxing code in Marwood to check if the value it's boxing is already boxed.
pub fn put<T: Into<VCell> + Clone>(&mut self, vcell: T) -> VCell {
let vcell = vcell.into();
match &vcell {
VCell::Ptr(_) => vcell,
VCell::Symbol(sym) => match self.symbol_table.get(sym.deref()) {
Some(ptr) => VCell::ptr(*ptr),
None => {
let ptr = self.alloc();
*self.heap.get_mut(ptr)
.expect("heap index is out of bounds") = vcell.clone();
self.symbol_table.insert(sym.deref().into(), ptr);
VCell::ptr(ptr)
}
},
vcell => {
let ptr = self.alloc();
*self.heap.get_mut(ptr)
.expect("heap index is out of bounds") = vcell.clone();
VCell::Ptr(ptr)
}
}
}
Symbol Interning
The heap provides symbol interning by use of Heap::symbol_table. On put of a symbol, the heap will check if the symbol already exists in the symbol table and will return the stored heap location if the symbol already exists. This results in the same symbols always having the same heap location in Marwood. Why is this useful? This allows various parts of Marwood to refer to symbols by their heap reference instead of the string.
The maybe_put
function acts as a wrapper around put
, and will only put a VCell value if the value is one that must be boxed on a heap. Atom values such as numbers, bool, nil, etc are immutable values that may not always need to be boxed.
pub fn maybe_put<T: Into<VCell> + Clone>(&mut self, vcell: T) -> VCell
Put & Get Cell
The put_cell
and maybe_put_cell
are versions of put
and maybe_put
that recursively place a Cell
structure on the heap. Aggregate structures such as pairs and vectors may end up allocating multiple heap slots to represent the structure in the heap.
pub fn put_cell(&mut self, ast: &cell::Cell) -> VCell
pub fn maybe_put_cell(&mut self, ast: &cell::Cell) -> VCell
The following call to put_cell ends up allocating 10 slots on the heap to create the list structure and storage for values:
let mut heap = Heap::new(8192);
heap.put_cell(&parse!("(puppies 42.0 cats puppies #t)"));
The chart below contains the resulting heap allocations for the storage of this list in the heap. Note that because puppies is an interned symbol at position $00 in the heap, both pairs that reference the value puppy point to the same heap position of $00.
Slot | Value |
---|---|
0 | Symbol(puppies) |
1 | Float(42.0) |
2 | Symbol(cats) |
3 | #t |
4 | () |
5 | ($03 . $04) |
6 | ($00 . $05) |
7 | ($02 . $06) |
8 | ($01 . $07) |
9 | ($00 . $08) |