ASDL for Oils for UNIX Contributors, Algebraic Datatypes in Python (and C++)

In Oils for Unix, we have an ML-style syntax for algebraic data structures. It looks like this:

module value
{
  value = Null
        | Bool(bool b)
        | Int(BigInt i)
        | Float(float f)
        | List(List[value] items)
        | Obj %Obj

  Obj = (Optional[Obj] prototype, Dict[str, value] d)
}

This, clearly, isn't Python nor C++. It's a domain specific language called ASDL. We didn't invent this ourselves, we forked it from the CPython implementation which in turn was derived from a 1997 paper by the Zephyr university compiler infrastructure project.

This article exists to document ASDL in its current form in Oils for UNIX; part of our documentation effort in our Dev Guide. It is meant to be read by Oils for UNIX contributors. If you are just interested in algebraic datatypes in Python, focus on the first 4 sections Product Types, Sum Types, Tags and Shared Types, Type Syntax, ASDL Modules and Use. The most interesting is "shared types", detailed in Sum Types, Tags and Shared Types.

I will focus on the Python API for ASDL, since contributors will most often write Python. However, all the features listed here are fully supported when the shell is translated to C++!

Table of Contents

Product Types

Product types are like structs or dataclasses.

The syntax for a product type in ASDL is:

Person = (str name, int age)

In general, this is a list of type-name pairs.

When compiled, the resulting python source is (minus some utility methods and imports):

class Person(pybase.CompoundObj):
  _type_tag = 64
  __slots__ = ('name', 'age')

  def __init__(self, name, age):
    # type: (str, int) -> None
    self.name = name
    self.age = age

You can create Person objects using a familiar python API:

thor = Person('Thor', 1500)

Sum Types, Tags and Shared Types

Sum types are close to enums, but far more powerful. If you've used Rust's enum type, ASDL is very similar. In other languages, you can use tagged unions or subclassing. Sum types in ASDL are safer and more powerful. They are ASDL's "killer feature".

The syntax for a sum type in ASDL is:

color = Red
      | Blue
      | Green
      | Other(int r, int g, int b)

This results in the Python source:

class color(object):
  Red = color__Red()
  Blue = color__Blue()
  Green = color__Green()

  class Other(color_t):
    _type_tag = 4
    __slots__ = ('r', 'g', 'b')

    def __init__(self, r, g, b):
      # type: (int, int, int) -> None
      self.r = r
      self.g = g
      self.b = b

You can use this API to construct colors as follows:

red = color.Red
blue = color.Blue
pink = color.Other(1, 0, 1)

Sum types are very useful in compilers and interpreters. For example, it is common to switch on all variants of a sum type. In Oils for UNIX Python with ASDL, this looks like this:

def PrintColor(color):
    # type: (color_t) -> None
    with tagswitch(color) as case:
        if case(color_e.Red):
            print('Red')
        elif case(color_e.Blue):
            print('Blue')
        elif case(color_e.Green):
            print('Green')
        elif case(color_e.Other):
            other = cast(color.Other, color)
            print('Other(r:%d g:%d b:%d)' % (other.r, other.g, other.b))

Alternatively, if you only want to check a single case, you can call the .tag() method:

def IsOther(color):
    # type: (color_t) -> bool
    return color.tag() == color_e.Other

ASDL also generates two other classes generated here (and used in the API above). The first is an enum with all the color variants:

class color_e(object):
  Red = 1
  Blue = 2
  Green = 3
  Other = 4

The second is a superclass of all the color variants. This allows to type-checking of functions that take in any variant (eg. (color_t) -> bool).

Notably, you can also specialize a function to take in a single variant using the color.Other type:

def PrintOther(other):
    # type: (color.Other) -> None
    print('Other(r:%d g:%d b:%d)' % (other.r, other.g, other.b))

One feature of ASDL's sum types is that it can avoid extra layers of indirection with "shared types".

Our color example could also look like this:

RgbColor = (int r, int g, int b)

color = Red
      | Blue
      | Green
      | Other %RgbColor

obj = Person %Person
    | RgbColor %RgbColor

The result is that you can use a "bare" RgbColor in functions that take either a color_t or an object_t. We call this "shared types".

In code, this looks like:

pink = RgbColor(1, 0, 1)

def PrintColor(color):
    # type: (color_t) -> None
    ...

def PrintObj(obj):
    # type: (obj_t) -> None
    ...

# Both of these pass a (static, mypy) type-check!
PrintColor(pink)
PrintObj(pink)

In languages like Rust, you have to deal with indirection:

struct RgbColor {
  r: i32,
  g: i32,
  b: i32,
}

enum Color {
  Red,
  Blue,
  Green,
  Other(RgbColor),
}

enum Obj {
  Person(Person),
  RgbColor(RgbColor),
}

You then have to wrap each call with the relevant enum type:

fn print_color(color: Color) { ... }
fn print_obj(obj: Obj) { ... }


let pink = RgbColor { r: 1, g: 0, b: 1 };
print_color(Color::Other(pink))
print_obj(Obj::RgbColor(pink))

Under the Hood

Shared types are possible with ASDL because all resulting types (sum type variants and product types) are objects with "tags". These are integer constants accessible with the .tag() method.

For the RgbColor example above, the tags look like this:

class RgbColor(color_t, obj_t):
  _type_tag = 65  # <-- The tag is `65`
  __slots__ = ('r', 'g', 'b')

  ...


# Then the color_e and object_e enums both use 65 as the RgbColor tag:

class color_e(object):
  Red = 1
  Blue = 2
  Green = 3
  Other = 65

class obj_e(object):
  Person = 64
  RgbColor = 65

Type checking is made possible with multiple inheritance. Ie. RgbColor is a subclass of both color_t and obj_t.

Type Syntax

For sum type constructors and product types, ASDL uses type signatures that match the Python type system. For historical reasons, we also support an older syntax with ? and * postfix operators. These correspond to Optional[T] and List[T] respectively.

Here are some example type expressions:

List[mytype]       # used to be mytype*
Optional[mytype]   # used to be mytype?
Dict[str, int]
Dict[int, List[str]]

The mytype* and mytype? syntax comes from the original ASDL language. We still use it in old code, but now prefer the List[mytype] and Optional[mytype] syntax in new code.

ASDL Modules and Use

Every .asdl file has a module specifier which names the current module. You can only have one of these per ASDL file. Each module can then be imported via use from another ASDL module.

For example, if we had the modules:

# demo2.asdl

module demo2
{
  names = Alice | Bob
}
# nested.demo3.asdl

module demo3
{
  player_class = Pirate | Wizard
}

We could import these types with use. Note that instead of using / as a path separator, use a space. This corresponds to the fact that directories create namespaces in ASDL just as they do in python.

# demo.asdl

module demo
{
  use demo2 { player_name }
  use nested demo3 { player_class }

  Player = (player_name name, player_class class)
}

When building this, you have to separately build demo.asdl, demo2.asdl and nested/demo3.asdl. The ASDL compiler will not automatically discover and compile all imported ASDL files.

Circular dependencies are allowed in ASDL.

Compiling ASDL

This is only relevant to contributors adding new ASDL files, or working on ASDL directly.

ASDL is written in .asdl files. It can then be compiled to python with:

build/py.sh gen-asdl-py $myfile.asdl

Or C++ with by adding an asdl_library in the corresponding NINJA_subgraph.py:

ru.asdl_library(
    'core/value.asdl',
    # #include in cc file from 'use' deps
    deps=['//frontend/syntax.asdl', '//core/runtime.asdl'])

When you run build.py.sh all, gen-asdl-py is called on all the ASDL files. The output is written to _devbuild/gen/*_asdl.py. Importing the types is done with:

from _devbuild.gen.id_kind_asdl import Id

Attributes / Generate

Most contributors will not need this feature.

ASDL also supports "generate" attributes on sum types. These can be used to control how ASDL generates those types in your target language (python or C++). These are an advanced feature not used by most types.

The syntax is like this:

color = Red | Blue | Green
        generate [no_namespace_suffix]

# Multiple items can also be passed
color = Red | Blue | Green
        generate [no_namespace_suffix, integers]

These are documented in asdl/parse.py:

_CODE_GEN_OPTIONS = [
    'no_namespace_suffix',  # Id.Foo instead of Id_e.Foo
    'integers',  # integer builtin_i instead of strongly typed builtin_e
    'uint16',  # like integers, but use uint16_t instead
    'bit_set',  # not implemented: 1 << n instead of n

    # probably don't need this
    # 'common_synthetic_field:left_tok',

    # Put this type, and transitive closure of types it references, in the
    # unique "first class variant" namespace, and generate type reflection.
    'reflect_all_types',

    # Squeeze and Freeze, with the number of bits as a option Hm the headers
    # here still need type reflection.  Probably OK.
    'mirror_all_types:16',
]

Conclusion

ASDL is a language to expressing algebraic data types. We use a custom fork of the language in Oils for UNIX.

ASDL has product types (like structs), sum types (like tagged unions / rust enums), and "shared types".

module value
{
  # sum type
  value = Null
        | Bool(bool b)
        | Obj %Obj      # shared type, variant

  # product type
  Obj = (Optional[Obj] prototype, Dict[str, value] d)
}

In the above example, ASDL will generate the following:

ASDL has features like a module system, so you can import other types defined in ASDL with use.

module demo
{
  use demo2 { player_name }
  use nested demo3 { player_class }

  Player = (player_name name, player_class class)
}

For more on the history of ASDL in Oils, see Using Zephyr ASDL in the wiki.