Introduction

Disclaimer : this specification is going to change a lot as the language evolves. It is still a very new project and nothing is set in stone yet.

The koj language was born as a side project of mine after taking a class on compilers for my second year at TELECOM Nancy.

Syntax

The goal for the syntax of this language is for it to be as consistent as possible. Each syntactic construct should have one function and one function only.

Implementation

The first iteration of this language is going to be implemented using an interpreter written in Ocaml, using sedlex as a lexer and menhir as a parser. This language may then evolve to be a compiled one, probably implemented in itself with a hand-written lexer and parser and leveraging LLVM as a backend.

Lexical Structure

This chapter describes the lexical structure of the language.

Input format

The input format of the koj language is a sequence of Unicode code points encoded in UTF-8. The input is read from a file or from the standard input.

Whitespace

The koj language uses the following whitespace characters:

  • ' ' (U+0020 SPACE)
  • '\t' (U+0009 CHARACTER TABULATION)
  • '\n' (U+000A LINE FEED)
  • '\r' (U+000D CARRIAGE RETURN)
  • U+000B LINE TABULATION
  • U+000C FORM FEED
  • U+0085 NEXT LINE
  • U+200E LEFT-TO-RIGHT MARK
  • U+200F RIGHT-TO-LEFT MARK
  • U+2028 LINE SEPARATOR
  • U+2029 PARAGRAPH SEPARATOR

These characters are called whitespace characters and are used to separate tokens in the language. They are not part of the language and are ignored by the lexer.

The meaning of the program is preserved if any whitespace character is replaced by another one.

Comments

Comments are used to document the code and are ignored by the compiler.

There are two types of comments:

  • Single-line comments start with // and end at the end of the line.
  • Block comments start with /* and end with */. They can span multiple lines.

COMMENT := LINE_COMMENT | BLOCK_COMMENT

LINE_COMMENT := //[^\n]*

BLOCK_COMMENT := / *.*?* /

Keywords

The following keywords are reserved by the koj language:

  • KW_AS: as
  • KW_BREAK: break
  • KW_CATCH: catch
  • KW_CONST: const
  • KW_CONTINUE: continue
  • KW_DO: do
  • KW_EACH: each
  • KW_ELSE: else
  • KW_ENUM: enum
  • KW_FALSE: false
  • KW_FOR: for
  • KW_FUNC: func
  • KW_IF: if
  • KW_IS: is
  • KW_LET: let
  • KW_MATCH: match
  • KW_MUT: mut
  • KW_RETURN: return
  • KW_STRUCT: struct
  • KW_THROW: throw
  • KW_TRUE: true
  • KW_TRY: try
  • KW_TYPE: type
  • KW_UNION: union
  • KW_WHILE: while

This list may change as the language evolves.

There keywords are reserved and cannot be used as identifiers.

Identifiers

Identifiers are represented by the regular expression [a-zA-Z_][a-zA-Z0-9_]*. They are used to name variables or functions names. They may not be one of the keywords described previously.

Type identifiers are similar, but they must start with a lowercase letter. The corresponding regular expression in [A-Z][a-zA-Z0-9_]*

Literals

Literals are used to represent values in the source code. They are used to initialize variables, to pass arguments to functions, etc.

Integer literals

Integer literals represent an integer value. They can be written in decimal, hexadecimal, octal or binary notation.

INTEGER_LITERAL :
     DECIMAL_LITERAL
     | HEXADECIMAL_LITERAL
     | OCTAL_LITERAL
     | BINARY_LITERAL

DECIMAL_LITERAL : DEC_DIGIT*

HEXADECIMAL_LITERAL : ( 0x | 0X )DEC_DIGIT+

OCTAL_LITERAL : ( 0o | 0O )OCT_DIGIT+

BINARY_LITERAL : ( 0b | 0B )BIN_DIGIT+

BIN_DIGIT : [ 0 - 1 ]

OCT_DIGIT : [ 0 - 7 ]

HEX_DIGIT : [ 0 - 9 a - f A - F ]

DEC_DIGIT : [ 0 - 9 ]

Floating-point literals

Floating-point literals represent a floating-point value. They can be written in decimal or hexadecimal notation.

FLOATING_LITERAL :
     | DECIMAL_LITERAL . DECIMAL_LITERAL?
     | . DECIMAL_LITERAL
     | DECIMAL_LITERAL ( . DECIMAL_LITERAL )? EXPONENT

EXPONENT := ( e|E ) ( +|- )? DECIMAL_LITERAL

String and character literals

Character literals represent a single character. String literals represent a sequence of characters.

Character literals

CHAR_LITERAL : ' CHAR_FRAGMENT? '

CHAR_FRAGMENT :
     ~[ ' \ \n \r \t ]
     | QUOTE_ESCAPE
     | ASCII_ESCAPE
     | UNICODE_ESCAPE

QUOTE_ESCAPE : \' \"

ASCII_ESCAPE : \x OCT_DIGIT HEX_DIGIT | \n | \r | \t | \\ | \0

UNICODE_ESCAPE : \u{ HEX_DIGIT[1,6] }

String literals

STRING_LITERAL : " STRING_FRAGMENT* "

STRING_FRAGMENT :
     ~[ " \ \n \r \t ]
     | QUOTE_ESCAPE
     | ASCII_ESCAPE
     | UNICODE_ESCAPE

Punctuation

Token NameSymbolDescription
LPAREN(Left parenthesis
RPAREN)Right parenthesis
LBRACE{Left brace
RBRACE}Right brace
LBRACKET[Left bracket
RBRACKET]Right bracket
COMMA,Separator
DOT.Property access
COLON:Type annotation
SEMICOLON;Separator
UNDERSCORE_Wildcdard
BANG!Throwing type
QUESTION?Null typing
DOLLAR$Dollar sign
PLUS+Plus
MINUS-Minus
STAR*Multiplication
SLASH/Division
PERCENT%Modulo
CARET^Bitwise XOR
AMP&Bitwise AND
PIPE``
EQUAL=Assignement
LT<Less than
GT>Greater than
NEQ!=Not equal
LEQ<=Less than or equal
GEQ>=Greater than or equal
EQEQ==Equal equal
AND&&Logical and
OR`
ARROW->Function type
SHR>>Shift right
SHL<<Shift left
PLUSPLUS++Incrementation
MINUSMINUS--Decrementation
PLUSEQ+=Add assign
MINUSEQ-=Minus assign
STAREQ*=Multiplication assign
SLASHEQ/=Division assign
PERCENTEQ%=Modulus assign
AMPEQ&=Bitwise and assign
PIPEEQ\|=Bitwise or assign
CARETEQ^=Bitwise xor assign
SHREQ>>=Shift right assign
SHLEQ<<=Shift left assign

Tokens

Tokens can be one of the following elements, described it the previous chapters :

Statements and Expressions

Koj is primarily an expression language, meaning most of the code you write produces a value that can be assigned to a variable.

Statements, on the other hand, serve mostly to define types, declare variables or chain expressions together in blocks.

Statements

Syntax
Statement :
     ;
     | TypeDef
     | Declaration
     | ExpressionStatement

Expression Statements

Syntax
ExpressionStatement :
     Expression ;

Expression statements are used to evaluate an expression and discard the result. They are used to trigger the side effects of evaluating the expression.

All statements evaluate to the Unit type.

Type Definitions

Syntax
TypeDef :
     type TYPE_IDENTIFIER = Type ;

Type declarations can be used both to alias existing types or create new ones.

Example:

You can define a Pizza type representing a recipe for a pizza in the following way:

type Pizza = struct {
  .crust: enum `Thick | `Thin | `Cheesy,
  .base: enum `Tomato | `Cream,
  .toppings: [String],
};

Declarations

Syntax
Declaration :
     VarDecl | FuncDecl

Variable Declarations

Syntax
VarDecl :
     let Mutability IDENTIFIER ( : Type )? := Expression

Mutability :
     mut | const

Variables can be declared as either mut or const. const variables can not be assigned a value after they are declared.

Function Declarations

Syntax
FuncDecl :
     let func IDENTIFIER :=
         ( FuncDeclArgs? ) => Expression

FuncDeclArgs :
FuncDeclArg ( , FuncDeclArg )*

FuncDeclArg :
     IDENTIFIER : TYPE_IDENTIFIER

Expressions

Syntax
Expression :
     LiteralExpression
     | ParensExpression
     | BlockExpression
     | OperatorExpression
     | FunctionExpression
     | StructExpression
     | ArrayExpression
     | TupleExpression
     | EnumExpression
     | UnitExpression
     | IndexExpression
     | FieldExpression
     | CallExpression
     | ConditionalExpression
     | LoopExpression
     | MatchExpression
     | PlaceholderExpression

Expressions are used to produce a value and trigger side effects.

Literal Expressions

Syntax
LiteralExpression :
     INTEGER_LITERAL
     | FLOATING_LITERAL
     | CHARACTER_LITERAL
     | STRING_LITERAL
     | true | false

Literal expressions are composed of a single token and evaluate to the value of that token.

Parenthesized Expressions

Syntax
ParensExpression :
     ( Expression )

Parenthesized expressions wrap a single expression and evaluate to the value of said expression. They are used to control the order of evaluation of subexpressions within an expression.

Block Expressions

Syntax
BlockExpression :
     { BlockComponents? }

BlockComponents :
     Expression
     | Statements+
     | Statements+ Expression?

Block expressions are a way to chain several statements together. The value of a block expression is that of the last executed instruction.

Operator Expressions

Syntax
OperatorExpression :
     ArithmeticOpExpression
     | BitwiseOpExpression
     | LogicalOpExpression
     | ComparisonExpression
     | AssignementExpression
     | StringOpExpression

Arithmetic Operations

Syntax
ArithmeticOpExpression :
     - Expression
     | Expression + Expression
     | Expression - Expression
     | Expression * Expression
     | Expression / Expression
     | Expression // Expression
     | Expression % Expression
     | Expression ** Expression

These operators are, in order, the negation, addition, substraction, multiplication, division, integer division, modulo and exponentiation operators.

Bitwise Operations

Syntax
BitwiseOpExpression :
     ~ Expression
     | Expression & Expression
     | Expression | Expression
     | Expression ^ Expression

These operators represent, in order, the bitwise NOT, AND, OR and XOR operators.

Logical Operations

Syntax
LogicalOpExpression :
     ! Expression
     | Expression && Expression
     | Expression || Expression

These operator represent, respectively, the logical NOT, AND and OR operators.

Comparisons

Syntax
ComparisonExpression :
     Expression > Expression
     | Expression >= Expression
     | Expression < Expression
     | Expression <= Expression
     | Expression == Expression
     | Expression != Expression

These operators represent, in order, the greater than, greater or equal, less than, less than or equal, equal and not equal operators.

Assignements

Syntax
AssignementExpression :
     Expression := Expression

The expression on the left side of the operator must be assignable to (a lvalue). It must also not have been declared as const.

String Operations

Syntax
StringOpExpression :
     Expression @ Expression

This operator represents the string concatenation.

Type System

Syntax
Type :
     TYPE_IDENTIFIER
     | PrimitiveType
     | ArrayType
     | FunctionType
     | Struct
     | Enum
     | InferredType

Primitive Types

Koj offers these different primitive types:

Integer type

This type represents an integer value. It is represented by the identifier Int.

Floating point type

This type represents a floating point value. It is represented by the identifier Float.

Character type

This type represents a single character. It is represented by the identifier Char.

String type

This type represents a text string. It is represented by the identifier String.

For more info on these types, see literals

Unit type

This type represents the return type of a function returning no actual value. It is represented by the identifier Unit, and its only possible value is ().

Never type

This type is a type that has no value. It is represented by the identifier Never. It is used to type computations that do not return any value.

Array Types

Syntax
ArrayType :
     [ Type ]

Array types represent a sequence of elements of the same type. They are dinamically sized.

Tuple Types

Syntax
TupleType :
     ( Type ( , Type )* )

Tuple types represent lists of heterogeneous types. The order of the fields matter, so the type (String, Int) is different from the type (Int, String).

Function Types

Syntax
FunctionType :
     ( FunctionTypeArguments? ) -> Type

FunctionTypeArguments :
     Type ( , Type )*

Example:

A function adding two integers together may have the following type:

(int, int) -> int

Struct Types

Syntax
Struct :
     struct { StructFields* }

StructFields :
     StructFieldDec ( , StructFieldDec )* ,?

StructFieldDec:
    . TYPE_IDENTIFIER : Type

Structs are used to represent objects with fields. They are analogous to structs in C or record types in languages such as OCaml.

Example:

You may create a struct representing a point in 2D space the following way:

struct {
  .x: int,
  .y: int,
}

Enum Types

Syntax
Enum :
     enum { EnumMember ( , EnumMember )* ,? }

EnumMember : ` TYPE_IDENTIFIER ( EnumMemberOfType )?

EnumMemberOfType : ( Type )

Examples:

To represent a direction on a D-pad, you could use the following enum:

enum { `Up, `Down, `Left, `Right }

To represent a user on a website, you could use the following enum:

enum {
 `Anon,
 `LoggedIn(struct { id: string, role: string }),
}

Inferred Type

Syntax
InferredType : _

This type is used to let the compiler infer the type of the item.