Introduction
Disclaimer : this specification is going to change a lot as the language evolves. It is still a very new project and nothing is set in stone yet.
The koj
language was born as a side project of mine after taking a class on compilers for my second year at TELECOM Nancy.
Syntax
The goal for the syntax of this language is for it to be as consistent as possible. Each syntactic construct should have one function and one function only.
Implementation
The first iteration of this language is going to be implemented using an interpreter written in Ocaml, using sedlex as a lexer and menhir as a parser. This language may then evolve to be a compiled one, probably implemented in itself with a hand-written lexer and parser and leveraging LLVM as a backend.
Lexical Structure
This chapter describes the lexical structure of the language.
Input format
The input format of the koj
language is a sequence of Unicode code points encoded in UTF-8. The input is read from a file or from the standard input.
Whitespace
The koj
language uses the following whitespace characters:
' '
(U+0020 SPACE)'\t'
(U+0009 CHARACTER TABULATION)'\n'
(U+000A LINE FEED)'\r'
(U+000D CARRIAGE RETURN)- U+000B LINE TABULATION
- U+000C FORM FEED
- U+0085 NEXT LINE
- U+200E LEFT-TO-RIGHT MARK
- U+200F RIGHT-TO-LEFT MARK
- U+2028 LINE SEPARATOR
- U+2029 PARAGRAPH SEPARATOR
These characters are called whitespace characters and are used to separate tokens in the language. They are not part of the language and are ignored by the lexer.
The meaning of the program is preserved if any whitespace character is replaced by another one.
Comments
Comments are used to document the code and are ignored by the compiler.
There are two types of comments:
- Single-line comments start with
//
and end at the end of the line. - Block comments start with
/*
and end with*/
. They can span multiple lines.
COMMENT := LINE_COMMENT | BLOCK_COMMENT
LINE_COMMENT :=
//
[^\n
]*BLOCK_COMMENT :=
/
*
.*?*
/
Keywords
The following keywords are reserved by the koj
language:
- KW_AS:
as
- KW_BREAK:
break
- KW_CATCH:
catch
- KW_CONST:
const
- KW_CONTINUE:
continue
- KW_DO:
do
- KW_EACH:
each
- KW_ELSE:
else
- KW_ENUM:
enum
- KW_FALSE:
false
- KW_FOR:
for
- KW_FUNC:
func
- KW_IF:
if
- KW_IS:
is
- KW_LET:
let
- KW_MATCH:
match
- KW_MUT:
mut
- KW_RETURN:
return
- KW_STRUCT:
struct
- KW_THROW:
throw
- KW_TRUE:
true
- KW_TRY:
try
- KW_TYPE:
type
- KW_UNION:
union
- KW_WHILE:
while
This list may change as the language evolves.
There keywords are reserved and cannot be used as identifiers.
Identifiers
Identifiers are represented by the regular expression [a-zA-Z_][a-zA-Z0-9_]*
. They are used to name variables or functions names. They may not be one of the keywords described previously.
Type identifiers are similar, but they must start with a lowercase letter. The corresponding regular expression in [A-Z][a-zA-Z0-9_]*
Literals
Literals are used to represent values in the source code. They are used to initialize variables, to pass arguments to functions, etc.
Integer literals
Integer literals represent an integer value. They can be written in decimal, hexadecimal, octal or binary notation.
INTEGER_LITERAL :
DECIMAL_LITERAL
| HEXADECIMAL_LITERAL
| OCTAL_LITERAL
| BINARY_LITERALDECIMAL_LITERAL : DEC_DIGIT*
HEXADECIMAL_LITERAL : (
0x
|0X
)DEC_DIGIT+OCTAL_LITERAL : (
0o
|0O
)OCT_DIGIT+BINARY_LITERAL : (
0b
|0B
)BIN_DIGIT+BIN_DIGIT : [
0
-1
]OCT_DIGIT : [
0
-7
]HEX_DIGIT : [
0
-9
a
-f
A
-F
]DEC_DIGIT : [
0
-9
]
Floating-point literals
Floating-point literals represent a floating-point value. They can be written in decimal or hexadecimal notation.
FLOATING_LITERAL :
| DECIMAL_LITERAL.
DECIMAL_LITERAL?
|.
DECIMAL_LITERAL
| DECIMAL_LITERAL (.
DECIMAL_LITERAL )? EXPONENTEXPONENT := (
e
|E
) (+
|-
)? DECIMAL_LITERAL
String and character literals
Character literals represent a single character. String literals represent a sequence of characters.
Character literals
CHAR_LITERAL :
'
CHAR_FRAGMENT?'
CHAR_FRAGMENT :
~['
\
\n
\r
\t
]
| QUOTE_ESCAPE
| ASCII_ESCAPE
| UNICODE_ESCAPEQUOTE_ESCAPE :
\'
\"
ASCII_ESCAPE :
\x
OCT_DIGIT HEX_DIGIT |\n
|\r
|\t
|\\
|\0
UNICODE_ESCAPE :
\u{
HEX_DIGIT[1,6]}
String literals
STRING_LITERAL :
"
STRING_FRAGMENT*"
STRING_FRAGMENT :
~["
\
\n
\r
\t
]
| QUOTE_ESCAPE
| ASCII_ESCAPE
| UNICODE_ESCAPE
Punctuation
Token Name | Symbol | Description |
---|---|---|
LPAREN | ( | Left parenthesis |
RPAREN | ) | Right parenthesis |
LBRACE | { | Left brace |
RBRACE | } | Right brace |
LBRACKET | [ | Left bracket |
RBRACKET | ] | Right bracket |
COMMA | , | Separator |
DOT | . | Property access |
COLON | : | Type annotation |
SEMICOLON | ; | Separator |
UNDERSCORE | _ | Wildcdard |
BANG | ! | Throwing type |
QUESTION | ? | Null typing |
DOLLAR | $ | Dollar sign |
PLUS | + | Plus |
MINUS | - | Minus |
STAR | * | Multiplication |
SLASH | / | Division |
PERCENT | % | Modulo |
CARET | ^ | Bitwise XOR |
AMP | & | Bitwise AND |
PIPE | ` | ` |
EQUAL | = | Assignement |
LT | < | Less than |
GT | > | Greater than |
NEQ | != | Not equal |
LEQ | <= | Less than or equal |
GEQ | >= | Greater than or equal |
EQEQ | == | Equal equal |
AND | && | Logical and |
OR | ` | |
ARROW | -> | Function type |
SHR | >> | Shift right |
SHL | << | Shift left |
PLUSPLUS | ++ | Incrementation |
MINUSMINUS | -- | Decrementation |
PLUSEQ | += | Add assign |
MINUSEQ | -= | Minus assign |
STAREQ | *= | Multiplication assign |
SLASHEQ | /= | Division assign |
PERCENTEQ | %= | Modulus assign |
AMPEQ | &= | Bitwise and assign |
PIPEEQ | \|= | Bitwise or assign |
CARETEQ | ^= | Bitwise xor assign |
SHREQ | >>= | Shift right assign |
SHLEQ | <<= | Shift left assign |
Tokens
Tokens can be one of the following elements, described it the previous chapters :
Statements and Expressions
Koj is primarily an expression language, meaning most of the code you write produces a value that can be assigned to a variable.
Statements, on the other hand, serve mostly to define types, declare variables or chain expressions together in blocks.
Statements
Syntax
Statement :
;
| TypeDef
| Declaration
| ExpressionStatement
Expression Statements
Syntax
ExpressionStatement :
Expression;
Expression statements are used to evaluate an expression and discard the result. They are used to trigger the side effects of evaluating the expression.
All statements evaluate to the Unit
type.
Type Definitions
Syntax
TypeDef :
type
TYPE_IDENTIFIER=
Type;
Type declarations can be used both to alias existing types or create new ones.
Example:
You can define a Pizza
type representing a recipe for a pizza in the following way:
type Pizza = struct {
.crust: enum `Thick | `Thin | `Cheesy,
.base: enum `Tomato | `Cream,
.toppings: [String],
};
Declarations
Syntax
Declaration :
VarDecl | FuncDecl
Variable Declarations
Syntax
VarDecl :
let
Mutability IDENTIFIER (:
Type )?:=
ExpressionMutability :
mut
|const
Variables can be declared as either mut
or const
. const
variables can not be assigned a value after they are declared.
Function Declarations
Syntax
FuncDecl :
let
func
IDENTIFIER:=
(
FuncDeclArgs?)
=>
ExpressionFuncDeclArgs :
FuncDeclArg (,
FuncDeclArg )*FuncDeclArg :
IDENTIFIER:
TYPE_IDENTIFIER
Expressions
Syntax
Expression :
LiteralExpression
| ParensExpression
| BlockExpression
| OperatorExpression
| FunctionExpression
| StructExpression
| ArrayExpression
| TupleExpression
| EnumExpression
| UnitExpression
| IndexExpression
| FieldExpression
| CallExpression
| ConditionalExpression
| LoopExpression
| MatchExpression
| PlaceholderExpression
Expressions are used to produce a value and trigger side effects.
Literal Expressions
Syntax
LiteralExpression :
INTEGER_LITERAL
| FLOATING_LITERAL
| CHARACTER_LITERAL
| STRING_LITERAL
|true
|false
Literal expressions are composed of a single token and evaluate to the value of that token.
Parenthesized Expressions
Syntax
ParensExpression :
(
Expression)
Parenthesized expressions wrap a single expression and evaluate to the value of said expression. They are used to control the order of evaluation of subexpressions within an expression.
Block Expressions
Syntax
BlockExpression :
{
BlockComponents?}
BlockComponents :
Expression
| Statements+
| Statements+ Expression?
Block expressions are a way to chain several statements together. The value of a block expression is that of the last executed instruction.
Operator Expressions
Syntax
OperatorExpression :
ArithmeticOpExpression
| BitwiseOpExpression
| LogicalOpExpression
| ComparisonExpression
| AssignementExpression
| StringOpExpression
Arithmetic Operations
Syntax
ArithmeticOpExpression :
-
Expression
| Expression+
Expression
| Expression-
Expression
| Expression*
Expression
| Expression/
Expression
| Expression//
Expression
| Expression%
Expression
| Expression**
Expression
These operators are, in order, the negation, addition, substraction, multiplication, division, integer division, modulo and exponentiation operators.
Bitwise Operations
Syntax
BitwiseOpExpression :
~
Expression
| Expression&
Expression
| Expression|
Expression
| Expression^
Expression
These operators represent, in order, the bitwise NOT, AND, OR and XOR operators.
Logical Operations
Syntax
LogicalOpExpression :
!
Expression
| Expression&&
Expression
| Expression||
Expression
These operator represent, respectively, the logical NOT, AND and OR operators.
Comparisons
Syntax
ComparisonExpression :
Expression>
Expression
| Expression>=
Expression
| Expression<
Expression
| Expression<=
Expression
| Expression==
Expression
| Expression!=
Expression
These operators represent, in order, the greater than, greater or equal, less than, less than or equal, equal and not equal operators.
Assignements
Syntax
AssignementExpression :
Expression:=
Expression
The expression on the left side of the operator must be assignable to (a lvalue). It must also not have been declared as const
.
String Operations
Syntax
StringOpExpression :
Expression@
Expression
This operator represents the string concatenation.
Type System
Syntax
Type :
TYPE_IDENTIFIER
| PrimitiveType
| ArrayType
| FunctionType
| Struct
| Enum
| InferredType
Primitive Types
Koj offers these different primitive types:
Integer type
This type represents an integer value. It is represented by the identifier Int
.
Floating point type
This type represents a floating point value. It is represented by the identifier Float
.
Character type
This type represents a single character. It is represented by the identifier Char
.
String type
This type represents a text string. It is represented by the identifier String
.
For more info on these types, see literals
Unit type
This type represents the return type of a function returning no actual value. It is represented by the identifier Unit
, and its only possible value is ()
.
Never type
This type is a type that has no value. It is represented by the identifier Never
. It is used to type computations that do not return any value.
Array Types
Syntax
ArrayType :
[
Type]
Array types represent a sequence of elements of the same type. They are dinamically sized.
Tuple Types
Tuple types represent lists of heterogeneous types. The order of the fields matter, so the type (String, Int)
is different from the type (Int, String)
.
Function Types
Syntax
FunctionType :
(
FunctionTypeArguments?)
->
Type
Example:
A function adding two integers together may have the following type:
(int, int) -> int
Struct Types
Syntax
Struct :
struct
{
StructFields*}
StructFields :
StructFieldDec (,
StructFieldDec )*,
?StructFieldDec:
.
TYPE_IDENTIFIER:
Type
Structs are used to represent objects with fields. They are analogous to structs in C or record types in languages such as OCaml.
Example:
You may create a struct representing a point in 2D space the following way:
struct {
.x: int,
.y: int,
}
Enum Types
Syntax
Enum :
enum
{
EnumMember (,
EnumMember )*,
?}
EnumMember :
`
TYPE_IDENTIFIER ( EnumMemberOfType )?EnumMemberOfType :
(
Type)
Examples:
To represent a direction on a D-pad, you could use the following enum:
enum { `Up, `Down, `Left, `Right }
To represent a user on a website, you could use the following enum:
enum {
`Anon,
`LoggedIn(struct { id: string, role: string }),
}
Inferred Type
Syntax
InferredType :_
This type is used to let the compiler infer the type of the item.