Language Configuration

Languages are specified with YAML syntax. This page documents the possible properties/configuration fields. See the section Using YAML for tips.

version

Type:

string, number

Required:

False

Language version shown with --version if usage pattern specified.

Example:

version: 0.0.1

usage

Type:

string

Required:

False

Command line usage pattern specified with the Docopt language. This will be run with options_first=False and so options can occur in any order around positional arguments.

The only reserved identifier is the positional argument <src>, which is used to locate the language input file. It can be used in the following ways:

  • Using [<src>] means the input will be read from the file <src> or from stdin if the positional argument is not provided.

  • Using <src> means the input will only be read from the file <src>.

  • Not using it at all means the input will always be read from stdin.

Example:

usage: |
  language

  Usage:
      language [options] <config> [<src>]

  Options:
      -h, --help         Show this screen.
      -v, --version      Show version.
      -o, --output=FILE  Output file.

Note

Command line arguments are available in code through the identifier args. They will be represented with a dictionary with keys corresponding to positional and optional arguments as shown below.

$ serl run language -o out.txt example.cfg in.txt
{
  '--help': False,
  '--output': 'out.txt',
  '--version': False,
  '<config>': 'example.cfg',
  '<src>': 'in.txt'
}

tokens

Type:

object

Required:

False

Property Type:

string

Tokens to be used when constructing the lexer. Tokens are specified as a mapping between a token identifier and regex pattern. Token identifiers can be used within grammar productions as terminals and can contain any character except for whitespace.

Tokens can be referenced and substituted into other tokens through token expansion. See the meta.tokens.ref property for details on the syntax used to reference other tokens.

Note

Any tokens defined but not used within the grammar object will be ignored. This could be because those tokens are used only to be substituted into another token for readability.

Tokens can also be specified implicitly. These are tokens used within a grammar production but not defined within this object. These tokens will be interpreted literally as a fully escaped regex. For example, if ** is used but not defined in this object then its corresponding token pattern would be \*\*. This is useful for tokens such as operators or delimiters.

Note

By default, regex patterns will be specified according to Python’s re module with the verbose flag. However, this can be changed with the meta.tokens.regex and meta.tokens.flags properties respectively.

Example:

tokens:
  +: \+
  '-': \-
  '*': \*
  /: /
  (: \(
  ): \)
  num: \d+

precedence

Type:

array

Required:

False

Item Type:

string

A list of token precedence levels, from lowest (first) to highest (last). This can be used to disambiguate shift/reduce or reduce/reduce parser conflicts. Precedence levels are specified with an association type followed by a whitespace separated list of identifiers from the tokens object. Association type can be left, right, or nonassoc.

The precedence of a specific grammar production can also be overridden by specifying the non-terminal name and position (name[pos]). This will only affect the rightmost terminal of the production. For example, this could be used to give higher precedence to unary minus.

Example:

precedence:
  - left + -
  - right * /
  - nonassoc < >
  - right exp[4]

error

Type:

string

Required:

False

The name of an error token to be used in the grammar object. The error token can be used to support panic-mode parsing. Typically, a good place to use an error token is before a delimiter.

This can be used to find more errors, rather than stop on the first, or if meta.grammar.permissive is set to True allow execution to continue.

The error token is accessible within code like other terminal variables, however it won’t contain any capture groups. Instead it will be a tuple containing the whole error span as the first element.

Example:

In the following grammar snippet, a new production has been added with the error token (err) placed before a semi-colon (marking the end of a statement).

error: err
grammar:
  err-stmt:
    - stmt ;
    - err ;
  stmt: ...

The following would happen if stmt contained a syntax error:

  • Any symbols pushed onto the stack will be popped off (assuming no error token within stmt) until the state corresponding to err-stmt is reached.

  • All tokens will be discarded until a semi-colon.

  • The $.grammar.err-stmt[1] production will be reduced.

  • On execution $.code.err-stmt[1] will be run.

Note

The error token shouldn’t be used at the end of a grammar production.

grammar

Type:

object

Required:

True

Property Type:

string, array[string]

The language grammar specified as an object of productions. A grammar production consists of a head and a body, where the head is a non-terminal and the body is an arrangement of terminals (i.e., tokens) and other non-terminals.

A key of this property represents the head of a production, with the value being the corresponding body. To define multiple productions with the same head specify the value as a list.

Whitespace is ignored and so rules can be spread across multiple lines. The grammar start symbol will be taken as the head of the production defined first.

Example:

grammar:
  start: # production for start symbol
  non-terminal:
    - # production 0 for non-terminal
    - # production 1 for non-terminal
    - # production 2 for non-terminal

code

Type:

object

Required:

True

Property Type:

string, array[string | null]

Language functionality specified with code blocks written in Python code or Shell commands. Defined properties of this object directly correspond to the properties of the grammar object to allow functionality to be associated with syntax.

Example:

grammar:
  non-terminal: # production

code:
  non-terminal: # functionality for production

For multiple productions with the same non-terminal head, the list elements also correspond.

Example:

grammar:
  non-terminal:
    - # production 0
    - # production 1
    - # production 2

code:
  main: # main functionality must come first
  non-terminal:
    - # functionality for non-terminal production 0
    - # functionality for non-terminal production 1
    - # functionality for non-terminal production 2

The return value for properties defined within the grammar object but not within this object will be a Python dictionary of their Variable Environment. Details about return values can be found within Python Code or Shell Commands.

Variable Environment

Each code block has access to the global scope and variables of the symbols in the corresponding grammar production i.e., grammar variables. See non-terminal variables and terminal variables.

The following variables are initially available in the global scope:

  • __name__: The name of the executing language

  • args: A dictionary of the parsed command line argument values (see usage)

  • Start symbol non-terminal variable (only with main functionality)

Example:

tokens:
  name: (\w+):(\w+)

grammar:
  tag: |
    <name>
      value
    </name>
  value: ...

code:
  main: ...
  tag: ...
  value: ...

For the configuration above and the following source (details of value omitted):

<a:b>
  value
</c:d>

The code block code.tag (corresponding to grammar.tag) would have access to the following environment:

{
  # Any global variables, or keyword variables passed down through tag(...)
  '<': ('<',),
  '</': ('</',)
  '>': [('>',), ('>',)],
  'name': [('a:b', 'a', 'b'),('c:d', 'c', 'd')],
  'value': <function execute at 0x000002273B488AE0>
}

Note

  • The terminal variable name is returned as a list since the symbol is used multiple times in the grammar.tag production. Elements of this list correspond to the order they appear in the grammar production.

  • Calling the function value will execute the code block code.value.

  • </ is a single token because it is an implicit token (see tokens). To avoid this a space could be added between the symbols in the grammar, or < and > could be defined explicitly within the tokens object.

Main functionality

If the first property doesn’t correspond to a defined grammar non-terminal then it acts as the main functionality and is executed in a global context. This allows code to be executed before and after the main AST traversal.

Note

If the main functionality is defined as a list then each element of the list will be executed separately in order.

If no main functionality is defined then traversal, and thus execution is initiated with the code of the grammar start symbol. Otherwise, it is the responsibility of the main function to start traversal, which is done by calling the non-terminal variable corresponding to the grammar start symbol.

Any values returned from a main functionality code-block or the code-block corresponding to the grammar start symbol (if no main functionality defined) will be sent to stdout.

Python Code

Without the Shell Commands modifier ($), blocks are by default interpreted as normal Python code.

When non-terminal variables are called in Python, they can take any number of keyword arguments which will be passed down to the local environment of the called code block.

Note

Variables in Python can only be accessed by a limited character set. However, grammar variables that use characters outside this set can still be accessed through the locals or vars functions, which allow access to variables with arbitrary names.

The value of the final Python statement of a code block will be used as the return value. If you don’t wan’t to return anything you can explicitly make the final statement None or pass.

Note

  • Only the value of the final statement is used, and so if this is an assignment (e.g., a = 5) then the variable a would never be created, but 5 would be returned.

  • If the final statement doesn’t have a value (e.g., a function definition) then None will be returned.

  • The return keyword can only be used within functions or the final statement, but is not necessary for the latter.

Example:

grammar:
  tag: ...

code:
  main: | # python
    # import modules, create classes/functions etc.
    val = tag() # Main execution on grammar start symbol called 'tag'
    # Do something with val
    val # return val to stdout
  tag: # Code for tag

Note

Currently available for VS Code the YAML Embedded Languages extension provides syntax highlighting within YAML block-scalars by specifying the language name in a comment next to the block to highlight as shown above.

Shell Commands

Shell commands can be used by making the first character of the code-block $. Global, and grammar variables can be accessed using the Python format language.

Accessing non-terminal variables will be equivalent to calling them, although keyword arguments cannot be passed with the format language.

Note

  • Use of { or } for anything other than format strings require escaping with {{ or }} e.g., $ echo ${{HOME}}.

  • Grammar variables with incompatible syntax with the format language, can be accessed through the special key locals() e.g., {locals()[{]} for a variable named {.

The output (stdout) of a command will be used as the return value for the code block. If the command fails it will raise a CalledProcessError, which if caught allows access to stderr and the returncode.

Example:

code:
  non-terminal: $ echo {args[<src>]}

tokentypes

Type:

object

Required:

False

Property Type:

string

Tokens and corresponding type used in the syntax highlighter lexer. This is represented as a mapping between token identifiers from the tokens object and a dot separated list in title case (e.g., Token.Text.Whitespace) to represent token type. Arbitrary regex can also be assigned a token type.

Important

To take advantage of built-in Pygments styles it is recommended to use standard token names, see Pygments built-in tokens.

Example:

tokentypes:
  +: Operator
  '-': Operator
  '*': Operator
  /: Operator
  num: Number

styles

Type:

object

Required:

False

Property Type:

string

The style to be applied to a certain token type. This is represented as a mapping between a token type and a style specified with Pygments style rules.

Example:

styles:
  Number: "#42f2f5"
  Keyword.Constant: "bold #ff0000"
  Punctuation: "#f57242"
  String: "#75b54a"
  Whitespace: "bg:#e8dfdf"

Note

The use of quotes around the styles in the above example are necessary, as otherwise the hex colours using # would be treated as YAML comments. See Using YAML for tips.

See Static Syntax Highlighting for more details.

environment

Type:

string

Required:

False

The name of a virtual environment to be created to contain any python dependencies specified in requirements.

This is only required if you plan to use dependencies that may clash with those used by the tool or other serl languages used in the same environment. Not setting this property means that language dependencies are installed to the environment where the instance of the tool being used is installed.

To list the dependencies used by the tool and then get a specific version thereof you can use:

$ pip show serl
$ pip show <dependency>

Note

When running a language that specifies an environment that doesn’t already exist, a new environment will be created and the specified requirements will be installed. This may take a bit of time to complete but will only be run once unless the environment is removed.

Environments are created using the venv module from the Python standard library and are located in the directory ~/.serl/environments.

Environments can be manually created, however they must be created in the aforementioned directory and with the same venv module. Creating environments manually would still require setting the value of this property to the name of the environment directory. If two languages specify an environment with the same name, the environment will be shared.

Example:

environment: venv-lang

requirements

Type:

string

Required:

False

The required dependencies for the language, which if specified as a pip requirements file, can be automatically downloaded with the command line run option -r or --requirements.

Example:

requirements: | # pip
  PyYAML==6.0
  docopt==0.6.2
  ply==3.11
  regex==2022.10.31
  networkx==2.8.8
  jsonschema==4.17.3
  Pygments==2.13.0
  Pillow==9.4.0
  requests==2.28.2

  # Dev
  pytest==7.2.2
  pytest-cov==4.0.0

meta

Type:

object

Required:

False

The meta object provides the ability to alter certain aspects of the configuration or language behaviour.

meta.tokens

Type:

object

Required:

False

Properties relating to the tokens object.

meta.tokens.ref

Type:

string, null

Required:

False

Default:

^token(?= )|(?<= )token(?= )|(?<= )token$

A regex used to determine how tokens can be referenced in other tokens and consequently expanded (substituted). If the value of this property is set to null or equivalently defined but not given a value, token expansion will not take place.

The special identifier token is used as a substitute for user-defined token names. If this special identifier isn’t used the defined regex is assumed to be a prefix to the token name.

Example:

meta:
  tokens:
    ref: \$token

In this example the regex for a token named text defined in the tokens object could be substituted into any other token by specifying $text. As previously mentioned if the identifier token is not used, the value of meta.tokens.ref is taken to be a prefix and so this example can be equivalently specified as:

meta:
  tokens:
    ref: \$

Note

The $ symbol has been escaped because this string is treated as a regex and this has the special meaning of signifying the end of a string.

meta.tokens.regex

Type:

boolean

Required:

False

Default:

False

Setting this property to True allows for the use of the more feature rich 3rd party regex module for patterns in the tokens object.

Important

When used this will change the interface for language captures. Specifically, they will now be returned as a list rather than a single value. This is due to the fact that the regex package offers the ability to retain all captures within a group even when modified by a regex quantifier.

Note

The regex module may only be used with CPython implementations.

Run the following two commands in Python’s interactive shell to see what implementation you’re using:

$ python
>>> import platform
>>> platform.python_implementation()
Example:

meta:
  tokens:
    regex: True

meta.tokens.ignore

Type:

string

Required:

False

Default:

\s

A regex specifying characters to be ignored by the lexer. This will have the lowest precedence in the lexer.

Note

The regex flags used for this property will be the same as those used in the tokens object. Therefore, changes to the meta.tokens.flags will also be reflected here.

Example:

meta:
  tokens:
    ignore: \s | \#.*

meta.tokens.flags

Type:

string

Required:

False

Default:

VERBOSE

A whitespace separated list of regex flags for the lexer to use corresponding to the regex patterns defined in the tokens object. Valid flags include any defined in the re module or if meta.tokens.regex is enabled, any flag in the regex module.

Example:

meta:
  tokens:
    flags: VERBOSE MULTILINE I

meta.tokens.default

Type:

boolean

Required:

False

Default:

True

If set to True, the lexer won’t produce invalid character errors. Instead, characters that would normally be invalid are now matched as a default tokens. This means they can be matched by the error token.

Example:

meta:
  tokens:
    default: False

meta.grammar

Type:

object

Required:

False

Properties relating to the grammar object.

meta.grammar.permissive

Type:

boolean

Required:

False

Default:

True

If this property is set to False, then language execution will not take place in the event of a syntax error, even if any input was recovered during parsing.

Example:

meta:
  grammar:
    permissive: False