Language Configuration
Languages are specified with YAML syntax. This page documents the possible properties/configuration fields. See the section Using YAML for tips.
version
- Type:
string,number- Required:
False
Language version shown with --version if usage pattern specified.
- Example:
version: 0.0.1
usage
- Type:
string- Required:
False
Command line usage pattern specified with the Docopt language.
This will be run with options_first=False and so options can occur in any order around positional arguments.
The only reserved identifier is the positional argument <src>, which is used to locate the language input file.
It can be used in the following ways:
Using
[<src>]means the input will be read from the file<src>or fromstdinif the positional argument is not provided.Using
<src>means the input will only be read from the file<src>.Not using it at all means the input will always be read from
stdin.
- Example:
usage: |
language
Usage:
language [options] <config> [<src>]
Options:
-h, --help Show this screen.
-v, --version Show version.
-o, --output=FILE Output file.
Note
Command line arguments are available in code through the identifier args.
They will be represented with a dictionary with keys corresponding to positional and optional arguments as shown below.
$ serl run language -o out.txt example.cfg in.txt
{
'--help': False,
'--output': 'out.txt',
'--version': False,
'<config>': 'example.cfg',
'<src>': 'in.txt'
}
tokens
- Type:
object- Required:
False- Property Type:
string
Tokens to be used when constructing the lexer. Tokens are specified as a mapping between a token identifier and regex pattern. Token identifiers can be used within grammar productions as terminals and can contain any character except for whitespace.
Tokens can be referenced and substituted into other tokens through token expansion. See the meta.tokens.ref property for details on the syntax used to reference other tokens.
Note
Any tokens defined but not used within the grammar object will be ignored. This could be because those tokens are used only to be substituted into another token for readability.
Tokens can also be specified implicitly.
These are tokens used within a grammar production but not defined within this object.
These tokens will be interpreted literally as a fully escaped regex.
For example, if ** is used but not defined in this object then its corresponding token pattern would be \*\*.
This is useful for tokens such as operators or delimiters.
Note
By default, regex patterns will be specified according to Python’s re module with the verbose flag. However, this can be changed with the meta.tokens.regex and meta.tokens.flags properties respectively.
- Example:
tokens:
+: \+
'-': \-
'*': \*
/: /
(: \(
): \)
num: \d+
precedence
- Type:
array- Required:
False
- Item Type:
string
A list of token precedence levels, from lowest (first) to highest (last).
This can be used to disambiguate shift/reduce or reduce/reduce parser conflicts.
Precedence levels are specified with an association type followed by a whitespace separated list of identifiers from the tokens object.
Association type can be left, right, or nonassoc.
The precedence of a specific grammar production can also be overridden by specifying the non-terminal name and position (name[pos]).
This will only affect the rightmost terminal of the production.
For example, this could be used to give higher precedence to unary minus.
- Example:
precedence:
- left + -
- right * /
- nonassoc < >
- right exp[4]
error
- Type:
string- Required:
False
The name of an error token to be used in the grammar object. The error token can be used to support panic-mode parsing. Typically, a good place to use an error token is before a delimiter.
This can be used to find more errors, rather than stop on the first, or if meta.grammar.permissive is set to True allow execution to continue.
The error token is accessible within code like other terminal variables, however it won’t contain any capture groups. Instead it will be a tuple containing the whole error span as the first element.
- Example:
In the following grammar snippet, a new production has been added with the error token (err) placed before a semi-colon (marking the end of a statement).
error: err
grammar:
err-stmt:
- stmt ;
- err ;
stmt: ...
The following would happen if stmt contained a syntax error:
Any symbols pushed onto the stack will be popped off (assuming no error token within
stmt) until the state corresponding toerr-stmtis reached.All tokens will be discarded until a semi-colon.
The
$.grammar.err-stmt[1]production will be reduced.On execution
$.code.err-stmt[1]will be run.
Note
The error token shouldn’t be used at the end of a grammar production.
grammar
- Type:
object- Required:
True- Property Type:
string,array[string]
The language grammar specified as an object of productions. A grammar production consists of a head and a body, where the head is a non-terminal and the body is an arrangement of terminals (i.e., tokens) and other non-terminals.
A key of this property represents the head of a production, with the value being the corresponding body. To define multiple productions with the same head specify the value as a list.
Whitespace is ignored and so rules can be spread across multiple lines. The grammar start symbol will be taken as the head of the production defined first.
- Example:
grammar:
start: # production for start symbol
non-terminal:
- # production 0 for non-terminal
- # production 1 for non-terminal
- # production 2 for non-terminal
code
- Type:
object- Required:
True- Property Type:
string,array[string | null]
Language functionality specified with code blocks written in Python code or Shell commands. Defined properties of this object directly correspond to the properties of the grammar object to allow functionality to be associated with syntax.
- Example:
grammar:
non-terminal: # production
code:
non-terminal: # functionality for production
For multiple productions with the same non-terminal head, the list elements also correspond.
- Example:
grammar:
non-terminal:
- # production 0
- # production 1
- # production 2
code:
main: # main functionality must come first
non-terminal:
- # functionality for non-terminal production 0
- # functionality for non-terminal production 1
- # functionality for non-terminal production 2
The return value for properties defined within the grammar object but not within this object will be a Python dictionary of their Variable Environment. Details about return values can be found within Python Code or Shell Commands.
Variable Environment
Each code block has access to the global scope and variables of the symbols in the corresponding grammar production i.e., grammar variables. See non-terminal variables and terminal variables.
The following variables are initially available in the global scope:
__name__: The name of the executing languageargs: A dictionary of the parsed command line argument values (see usage)Start symbol non-terminal variable (only with main functionality)
- Example:
tokens:
name: (\w+):(\w+)
grammar:
tag: |
<name>
value
</name>
value: ...
code:
main: ...
tag: ...
value: ...
For the configuration above and the following source (details of value omitted):
<a:b>
value
</c:d>
The code block code.tag (corresponding to grammar.tag) would have access to the following environment:
{
# Any global variables, or keyword variables passed down through tag(...)
'<': ('<',),
'</': ('</',)
'>': [('>',), ('>',)],
'name': [('a:b', 'a', 'b'),('c:d', 'c', 'd')],
'value': <function execute at 0x000002273B488AE0>
}
Note
The terminal variable
nameis returned as a list since the symbol is used multiple times in thegrammar.tagproduction. Elements of this list correspond to the order they appear in the grammar production.Calling the function
valuewill execute the code blockcode.value.</is a single token because it is an implicit token (see tokens). To avoid this a space could be added between the symbols in the grammar, or<and>could be defined explicitly within the tokens object.
Main functionality
If the first property doesn’t correspond to a defined grammar non-terminal then it acts as the main functionality and is executed in a global context. This allows code to be executed before and after the main AST traversal.
Note
If the main functionality is defined as a list then each element of the list will be executed separately in order.
If no main functionality is defined then traversal, and thus execution is initiated with the code of the grammar start symbol. Otherwise, it is the responsibility of the main function to start traversal, which is done by calling the non-terminal variable corresponding to the grammar start symbol.
Any values returned from a main functionality code-block or the code-block corresponding to the grammar start symbol (if no main functionality defined) will be sent to stdout.
Python Code
Without the Shell Commands modifier ($), blocks are by default interpreted as normal Python code.
When non-terminal variables are called in Python, they can take any number of keyword arguments which will be passed down to the local environment of the called code block.
Note
Variables in Python can only be accessed by a limited character set. However, grammar variables that use characters outside this set can still be accessed through the locals or vars functions, which allow access to variables with arbitrary names.
The value of the final Python statement of a code block will be used as the return value.
If you don’t wan’t to return anything you can explicitly make the final statement None or pass.
Note
Only the value of the final statement is used, and so if this is an assignment (e.g.,
a = 5) then the variableawould never be created, but5would be returned.If the final statement doesn’t have a value (e.g., a function definition) then
Nonewill be returned.The
returnkeyword can only be used within functions or the final statement, but is not necessary for the latter.
- Example:
grammar:
tag: ...
code:
main: | # python
# import modules, create classes/functions etc.
val = tag() # Main execution on grammar start symbol called 'tag'
# Do something with val
val # return val to stdout
tag: # Code for tag
Note
Currently available for VS Code the YAML Embedded Languages extension provides syntax highlighting within YAML block-scalars by specifying the language name in a comment next to the block to highlight as shown above.
Shell Commands
Shell commands can be used by making the first character of the code-block $.
Global, and grammar variables can be accessed using the Python format language.
Accessing non-terminal variables will be equivalent to calling them, although keyword arguments cannot be passed with the format language.
Note
Use of
{or}for anything other than format strings require escaping with{{or}}e.g.,$ echo ${{HOME}}.Grammar variables with incompatible syntax with the format language, can be accessed through the special key
locals()e.g.,{locals()[{]}for a variable named{.
The output (stdout) of a command will be used as the return value for the code block.
If the command fails it will raise a CalledProcessError, which if caught allows access to stderr and the returncode.
- Example:
code:
non-terminal: $ echo {args[<src>]}
tokentypes
- Type:
object- Required:
False- Property Type:
string
Tokens and corresponding type used in the syntax highlighter lexer.
This is represented as a mapping between token identifiers from the tokens object and a dot separated list in title case (e.g., Token.Text.Whitespace) to represent token type.
Arbitrary regex can also be assigned a token type.
Important
To take advantage of built-in Pygments styles it is recommended to use standard token names, see Pygments built-in tokens.
- Example:
tokentypes:
+: Operator
'-': Operator
'*': Operator
/: Operator
num: Number
styles
- Type:
object- Required:
False- Property Type:
string
The style to be applied to a certain token type. This is represented as a mapping between a token type and a style specified with Pygments style rules.
- Example:
styles:
Number: "#42f2f5"
Keyword.Constant: "bold #ff0000"
Punctuation: "#f57242"
String: "#75b54a"
Whitespace: "bg:#e8dfdf"
Note
The use of quotes around the styles in the above example are necessary, as otherwise the hex colours using # would be treated as YAML comments.
See Using YAML for tips.
See Static Syntax Highlighting for more details.
environment
- Type:
string- Required:
False
The name of a virtual environment to be created to contain any python dependencies specified in requirements.
This is only required if you plan to use dependencies that may clash with those used by the tool or other serl languages used in the same environment. Not setting this property means that language dependencies are installed to the environment where the instance of the tool being used is installed.
To list the dependencies used by the tool and then get a specific version thereof you can use:
$ pip show serl
$ pip show <dependency>
Note
When running a language that specifies an environment that doesn’t already exist, a new environment will be created and the specified requirements will be installed. This may take a bit of time to complete but will only be run once unless the environment is removed.
Environments are created using the venv module from the Python standard library and are located in the directory ~/.serl/environments.
Environments can be manually created, however they must be created in the aforementioned directory and with the same venv module. Creating environments manually would still require setting the value of this property to the name of the environment directory. If two languages specify an environment with the same name, the environment will be shared.
- Example:
environment: venv-lang
requirements
- Type:
string- Required:
False
The required dependencies for the language, which if specified as a pip requirements file, can be automatically downloaded with the command line run option -r or --requirements.
- Example:
requirements: | # pip
PyYAML==6.0
docopt==0.6.2
ply==3.11
regex==2022.10.31
networkx==2.8.8
jsonschema==4.17.3
Pygments==2.13.0
Pillow==9.4.0
requests==2.28.2
# Dev
pytest==7.2.2
pytest-cov==4.0.0
meta
- Type:
object- Required:
False
The meta object provides the ability to alter certain aspects of the configuration or language behaviour.
meta.tokens
- Type:
object- Required:
False
Properties relating to the tokens object.
meta.tokens.ref
- Type:
string,null- Required:
False- Default:
^token(?= )|(?<= )token(?= )|(?<= )token$
A regex used to determine how tokens can be referenced in other tokens and consequently expanded (substituted). If the value of this property is set to null or equivalently defined but not given a value, token expansion will not take place.
The special identifier token is used as a substitute for user-defined token names.
If this special identifier isn’t used the defined regex is assumed to be a prefix to the token name.
- Example:
meta:
tokens:
ref: \$token
In this example the regex for a token named text defined in the tokens object could be substituted into any other token by specifying $text.
As previously mentioned if the identifier token is not used, the value of meta.tokens.ref is taken to be a prefix and so this example can be equivalently specified as:
meta:
tokens:
ref: \$
Note
The $ symbol has been escaped because this string is treated as a regex and this has the special meaning of signifying the end of a string.
meta.tokens.regex
- Type:
boolean- Required:
False- Default:
False
Setting this property to True allows for the use of the more feature rich 3rd party regex module for patterns in the tokens object.
Important
When used this will change the interface for language captures. Specifically, they will now be returned as a list rather than a single value. This is due to the fact that the regex package offers the ability to retain all captures within a group even when modified by a regex quantifier.
Note
The regex module may only be used with CPython implementations.
Run the following two commands in Python’s interactive shell to see what implementation you’re using:
$ python
>>> import platform
>>> platform.python_implementation()
- Example:
meta:
tokens:
regex: True
meta.tokens.ignore
- Type:
string- Required:
False- Default:
\s
A regex specifying characters to be ignored by the lexer. This will have the lowest precedence in the lexer.
Note
The regex flags used for this property will be the same as those used in the tokens object. Therefore, changes to the meta.tokens.flags will also be reflected here.
- Example:
meta:
tokens:
ignore: \s | \#.*
meta.tokens.flags
- Type:
string- Required:
False- Default:
VERBOSE
A whitespace separated list of regex flags for the lexer to use corresponding to the regex patterns defined in the tokens object. Valid flags include any defined in the re module or if meta.tokens.regex is enabled, any flag in the regex module.
- Example:
meta:
tokens:
flags: VERBOSE MULTILINE I
meta.tokens.default
- Type:
boolean- Required:
False- Default:
True
If set to True, the lexer won’t produce invalid character errors.
Instead, characters that would normally be invalid are now matched as a default tokens.
This means they can be matched by the error token.
- Example:
meta:
tokens:
default: False
meta.grammar
- Type:
object- Required:
False
Properties relating to the grammar object.
meta.grammar.permissive
- Type:
boolean- Required:
False- Default:
True
If this property is set to False, then language execution will not take place in the event of a syntax error, even if any input was recovered during parsing.
- Example:
meta:
grammar:
permissive: False