Introduction

Disclaimer : this specification is going to change a lot as the language evolves. It is still a very new project and nothing is set in stone yet.

The koj language was born as a side project of mine after taking a class on compilers for my second year at TELECOM Nancy.

Syntax

The goal for the syntax of this language is for it to be as consistent as possible. Each syntactic construct should have one function and one function only.

The first iteration of this language is going to be implemented using an interpreter written in Ocaml, using sedlex as a lexer and menhir as a parser. This language may then evolve to be a compiled one, probably implemented in itself with a hand-written lexer and parser and leveraging LLVM as a backend.

Lexical Structure

This chapter describes the lexical structure of the language.

Input format

The input format of the koj language is a sequence of Unicode code points encoded in UTF-8. The input is read from a file or from the standard input.

Whitespace

The koj language uses the following whitespace characters:

' ' (U+0020 SPACE)
'\t' (U+0009 CHARACTER TABULATION)
'\n' (U+000A LINE FEED)
'\r' (U+000D CARRIAGE RETURN)
U+000B LINE TABULATION
U+000C FORM FEED
U+0085 NEXT LINE
U+200E LEFT-TO-RIGHT MARK
U+200F RIGHT-TO-LEFT MARK
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR

These characters are called whitespace characters and are used to separate tokens in the language. They are not part of the language and are ignored by the lexer.

The meaning of the program is preserved if any whitespace character is replaced by another one.

Comments

Comments are used to document the code and are ignored by the compiler.

There are two types of comments:

Single-line comments start with // and end at the end of the line.
Block comments start with /* and end with */. They can span multiple lines.

COMMENT := LINE_COMMENT | BLOCK_COMMENT

LINE_COMMENT := //[^\n]*

BLOCK_COMMENT := / *.*?* /

Keywords

The following keywords are reserved by the koj language:

KW_AS: as
KW_BREAK: break
KW_CATCH: catch
KW_CONST: const
KW_CONTINUE: continue
KW_DO: do
KW_EACH: each
KW_ELSE: else
KW_ENUM: enum
KW_FALSE: false
KW_FOR: for
KW_FUNC: func
KW_IF: if
KW_IS: is
KW_LET: let
KW_MATCH: match
KW_MUT: mut
KW_RETURN: return
KW_STRUCT: struct
KW_THROW: throw
KW_TRUE: true
KW_TRY: try
KW_TYPE: type
KW_UNION: union
KW_WHILE: while

This list may change as the language evolves.

There keywords are reserved and cannot be used as identifiers.

Identifiers

Identifiers are represented by the regular expression [a-zA-Z_][a-zA-Z0-9_]*. They are used to name variables or functions names. They may not be one of the keywords described previously.

Type identifiers are similar, but they must start with a lowercase letter. The corresponding regular expression in [A-Z][a-zA-Z0-9_]*

Literals

Literals are used to represent values in the source code. They are used to initialize variables, to pass arguments to functions, etc.

Integer literals

Integer literals represent an integer value. They can be written in decimal, hexadecimal, octal or binary notation.

INTEGER_LITERAL :
     DECIMAL_LITERAL
     | HEXADECIMAL_LITERAL
     | OCTAL_LITERAL
     | BINARY_LITERAL

DECIMAL_LITERAL : DEC_DIGIT^*

HEXADECIMAL_LITERAL : ( 0x | 0X )DEC_DIGIT⁺

OCTAL_LITERAL : ( 0o | 0O )OCT_DIGIT⁺

BINARY_LITERAL : ( 0b | 0B )BIN_DIGIT⁺

BIN_DIGIT : [ 0 - 1 ]

OCT_DIGIT : [ 0 - 7 ]

HEX_DIGIT : [ 0 - 9 a - f A - F ]

DEC_DIGIT : [ 0 - 9 ]

Floating-point literals

Floating-point literals represent a floating-point value. They can be written in decimal or hexadecimal notation.

FLOATING_LITERAL :
     | DECIMAL_LITERAL . DECIMAL_LITERAL^?
     | . DECIMAL_LITERAL
     | DECIMAL_LITERAL ( . DECIMAL_LITERAL )^? EXPONENT

EXPONENT := ( e|E ) ( +|- )^? DECIMAL_LITERAL

String and character literals

Character literals represent a single character. String literals represent a sequence of characters.

Character literals

CHAR_LITERAL : ' CHAR_FRAGMENT^? '

CHAR_FRAGMENT :
     ~[ ' \ \n \r \t ]
     | QUOTE_ESCAPE
     | ASCII_ESCAPE
     | UNICODE_ESCAPE

QUOTE_ESCAPE : \' \"

ASCII_ESCAPE : \x OCT_DIGIT HEX_DIGIT | \n | \r | \t | \\ | \0

UNICODE_ESCAPE : \u{ HEX_DIGIT^[1,6] }

String literals

STRING_LITERAL : " STRING_FRAGMENT^* "

STRING_FRAGMENT :
     ~[ " \ \n \r \t ]
     | QUOTE_ESCAPE
     | ASCII_ESCAPE
     | UNICODE_ESCAPE

Punctuation

Token Name	Symbol	Description
LPAREN	`(`	Left parenthesis
RPAREN	`)`	Right parenthesis
LBRACE	`{`	Left brace
RBRACE	`}`	Right brace
LBRACKET	`[`	Left bracket
RBRACKET	`]`	Right bracket
COMMA	`,`	Separator
DOT	`.`	Property access
COLON	`:`	Type annotation
SEMICOLON	`;`	Separator
UNDERSCORE	`_`	Wildcdard
BANG	`!`	Throwing type
QUESTION	`?`	Null typing
DOLLAR	`$`	Dollar sign
PLUS	`+`	Plus
MINUS	`-`	Minus
STAR	`*`	Multiplication
SLASH	`/`	Division
PERCENT	`%`	Modulo
CARET	`^`	Bitwise XOR
AMP	`&`	Bitwise AND
PIPE	`	`
EQUAL	`=`	Assignement
LT	`<`	Less than
GT	`>`	Greater than
NEQ	`!=`	Not equal
LEQ	`<=`	Less than or equal
GEQ	`>=`	Greater than or equal
EQEQ	`==`	Equal equal
AND	`&&`	Logical and
OR	`
ARROW	`->`	Function type
SHR	`>>`	Shift right
SHL	`<<`	Shift left
PLUSPLUS	`++`	Incrementation
MINUSMINUS	`--`	Decrementation
PLUSEQ	`+=`	Add assign
MINUSEQ	`-=`	Minus assign
STAREQ	`*=`	Multiplication assign
SLASHEQ	`/=`	Division assign
PERCENTEQ	`%=`	Modulus assign
AMPEQ	`&=`	Bitwise and assign
PIPEEQ	`\\|=`	Bitwise or assign
CARETEQ	`^=`	Bitwise xor assign
SHREQ	`>>=`	Shift right assign
SHLEQ	`<<=`	Shift left assign

Tokens

Tokens can be one of the following elements, described it the previous chapters :

Statements and Expressions

Koj is primarily an expression language, meaning most of the code you write produces a value that can be assigned to a variable.

Statements, on the other hand, serve mostly to define types, declare variables or chain expressions together in blocks.

Statements

^Syntax
Statement :
     ;
     | TypeDef
     | Declaration
     | ExpressionStatement

Expression Statements

^Syntax
ExpressionStatement :
Expression ;

Expression statements are used to evaluate an expression and discard the result. They are used to trigger the side effects of evaluating the expression.

All statements evaluate to the Unit type.

Type Definitions

^Syntax
TypeDef :
type TYPE_IDENTIFIER = Type ;

Type declarations can be used both to alias existing types or create new ones.

Example:

You can define a Pizza type representing a recipe for a pizza in the following way:

type Pizza = struct {
  .crust: enum `Thick | `Thin | `Cheesy,
  .base: enum `Tomato | `Cream,
  .toppings: [String],
};

Declarations

^Syntax
Declaration :
VarDecl | FuncDecl

Variable Declarations

^Syntax
VarDecl :
let Mutability IDENTIFIER ( : Type )^? := Expression

Mutability :
mut | const

Variables can be declared as either mut or const. const variables can not be assigned a value after they are declared.

Function Declarations

^Syntax
FuncDecl :
     let func IDENTIFIER :=
         ( FuncDeclArgs^? ) => Expression

FuncDeclArgs :
FuncDeclArg ( , FuncDeclArg )^*

FuncDeclArg :
     IDENTIFIER : TYPE_IDENTIFIER

Expressions

^Syntax
Expression :
     LiteralExpression
     | ParensExpression
     | BlockExpression
     | OperatorExpression
     | FunctionExpression
     | StructExpression
     | ArrayExpression
     | TupleExpression
     | EnumExpression
     | UnitExpression
     | IndexExpression
     | FieldExpression
     | CallExpression
     | ConditionalExpression
     | LoopExpression
     | MatchExpression
     | PlaceholderExpression

Expressions are used to produce a value and trigger side effects.

Literal Expressions

^Syntax
LiteralExpression :
     INTEGER_LITERAL
     | FLOATING_LITERAL
     | CHARACTER_LITERAL
     | STRING_LITERAL
     | true | false

Literal expressions are composed of a single token and evaluate to the value of that token.

Parenthesized Expressions

^Syntax
ParensExpression :
( Expression )

Parenthesized expressions wrap a single expression and evaluate to the value of said expression. They are used to control the order of evaluation of subexpressions within an expression.

Block Expressions

^Syntax
BlockExpression :
     { BlockComponents^? }

BlockComponents :
     Expression
     | Statements⁺
     | Statements⁺ Expression^?

Block expressions are a way to chain several statements together. The value of a block expression is that of the last executed instruction.

Operator Expressions

^Syntax
OperatorExpression :
     ArithmeticOpExpression
     | BitwiseOpExpression
     | LogicalOpExpression
     | ComparisonExpression
     | AssignementExpression
     | StringOpExpression

Arithmetic Operations

^Syntax
ArithmeticOpExpression :
     - Expression
     | Expression + Expression
     | Expression - Expression
     | Expression * Expression
     | Expression / Expression
     | Expression // Expression
     | Expression % Expression
     | Expression ** Expression

These operators are, in order, the negation, addition, substraction, multiplication, division, integer division, modulo and exponentiation operators.

Bitwise Operations

^Syntax
BitwiseOpExpression :
     ~ Expression
     | Expression & Expression
     | Expression | Expression
     | Expression ^ Expression

These operators represent, in order, the bitwise NOT, AND, OR and XOR operators.

Logical Operations

^Syntax
LogicalOpExpression :
     ! Expression
     | Expression && Expression
     | Expression || Expression

These operator represent, respectively, the logical NOT, AND and OR operators.

Comparisons

^Syntax
ComparisonExpression :
     Expression > Expression
     | Expression >= Expression
     | Expression < Expression
     | Expression <= Expression
     | Expression == Expression
     | Expression != Expression

These operators represent, in order, the greater than, greater or equal, less than, less than or equal, equal and not equal operators.

Assignements

^Syntax
AssignementExpression :
Expression := Expression

The expression on the left side of the operator must be assignable to (a lvalue). It must also not have been declared as const.

String Operations

^Syntax
StringOpExpression :
Expression @ Expression

This operator represents the string concatenation.

Type System

^Syntax
Type :
     TYPE_IDENTIFIER
     | PrimitiveType
     | ArrayType
     | FunctionType
     | Struct
     | Enum
     | InferredType

Primitive Types

Koj offers these different primitive types:

Integer type

This type represents an integer value. It is represented by the identifier Int.

Floating point type

This type represents a floating point value. It is represented by the identifier Float.

Character type

This type represents a single character. It is represented by the identifier Char.

String type

This type represents a text string. It is represented by the identifier String.

For more info on these types, see literals

Unit type

This type represents the return type of a function returning no actual value. It is represented by the identifier Unit, and its only possible value is ().

Never type

This type is a type that has no value. It is represented by the identifier Never. It is used to type computations that do not return any value.

Array Types

^Syntax
ArrayType :
[ Type ]

Array types represent a sequence of elements of the same type. They are dinamically sized.

Tuple Types

^Syntax
TupleType :
( Type ( , Type )^* )

Tuple types represent lists of heterogeneous types. The order of the fields matter, so the type (String, Int) is different from the type (Int, String).

Function Types

^Syntax
FunctionType :
( FunctionTypeArguments^? ) -> Type

FunctionTypeArguments :
Type ( , Type )^*

Example:

A function adding two integers together may have the following type:

(int, int) -> int

Struct Types

^Syntax
Struct :
     struct { StructFields^* }

StructFields :
     StructFieldDec ( , StructFieldDec )^* ,^?

StructFieldDec:
    . TYPE_IDENTIFIER : Type

Structs are used to represent objects with fields. They are analogous to structs in C or record types in languages such as OCaml.

Example:

You may create a struct representing a point in 2D space the following way:

struct {
  .x: int,
  .y: int,
}

Enum Types

^Syntax
Enum :
enum { EnumMember ( , EnumMember )^* ,^? }

EnumMember : ` TYPE_IDENTIFIER ( EnumMemberOfType )^?

EnumMemberOfType : ( Type )

Examples:

To represent a direction on a D-pad, you could use the following enum:

enum { `Up, `Down, `Left, `Right }

To represent a user on a website, you could use the following enum:

enum {
 `Anon,
 `LoggedIn(struct { id: string, role: string }),
}

Inferred Type

^Syntax
InferredType : _

This type is used to let the compiler infer the type of the item.

Koj language reference