michael.gr: A Programming Language

"Coding Software Running On A Computer Monitor" by Scopio from NounProject.com

Abstract

My thoughts and notes on how I would like a new programming language to look like.

The goals of the language are:

Simple and elegant. (So that it is suitable for the academia.)
Expressive. (So that it is suitable for experienced programmers.)
Consistent. (So that it is attractive to developer teams.)
Guiding. (So that it promotes best practices.)
Fast. (So that it is suitable for high performance computing.)
Lean. (So that it is suitable for resource-constrained computing.)

This is work-in-progress; It is bound to be heavily amended as time passes, especially if I try some new language, like Kotlin or Rust.

Summary of language characteristics

(Useful pre-reading: About these papers)

The main goals of the language are achieved via the following characteristics:

For simplicity and elegance:

Scoping by indentation instead of curly braces. (Similar to python.)
Keyword-rich syntax which avoids cryptic abbreviations and symbols.
Clear distinction between what is a statement and what is an expression.
Automatic memory management.

For expressiveness:

Lightweight properties, user-defined operators, generics, etc.
Type inference whenever possible.
Explicit nullability of reference types.
Full support for functional programming.
Full support for imperative programming without functional Nazism.
Async-transparency.

For consistency:

Extensive, mandatory, and in many cases non-suppressible, code inspections.
Whenever possible, only one way of expressing any given thing.
Extensive and strict formatting rules ensure all code looks the same.
Reformattability spares developers from having to type code in a particular way.

For performance:

Strongly typed.
Primitive value types correspond to machine words.
Intermediate-code-based, Just-In-Time compiled.
Fibers. (By means of async-transparency.)

For leanness:

Reference counting instead of garbage collection.
Minimalistic mandatory runtime library.
Separate and optional standard library.

Language characteristics in detail

For a list of shortcomings of other languages, which this language intends to fix, see:

Supports reference types and value types, as C# does.
The `null` value is valid only with explicitly nullable reference types.

As in C# 8.0 with `#nullable enable`:

A non-nullable reference can be used when a nullable reference is expected.
A nullable reference cannot be used when a non-nullable reference is expected, unless:

the compiler knows, via data-flow analysis, that the value is not null.

For example, by means of an if-statement which precludes null.

the value is explicitly cast to non-null. (As with the "null-forgiving" or "damnit" operator in C#.)

However, unlike C#:

The non-null cast is also an assertion against null, so it does not just circumvent the nullability checks of the compiler, it acts as an if-statement which precludes null.
Thus, a non-nullable reference can never accidentally hold null.
It is illegal to apply the non-null cast on a reference that is already non-nullable.
It is illegal to assign the result of the non-null-cast to a nullable reference.

Compiles into an intermediate code format. There are two possibilities:

LLVM.
A new intermediate code format called ObjectCode, which is either interpreted or further compiled into machine code by a Just-In-Time (JIT) compiler.

Functionally, ObjectCode is a stack machine language, just as JVM ByteCode is.
ObjectCode is expressed as a hierarchical data structure.

A binary ObjectCode file is the result of the serializing that data structure into a binary stream. Serialization into a text stream should also be possible.

ObjectCode is not trying hard to look like machine language, the way JVM ByteCode does. For example:

Instructions have no alternative short-form versions that accomplish the same thing but with fewer bytes.
There are no instructions for operations between `Integer`, `Real`, `Boolean`, etc; instead, these operations are available as methods exposed by those value types.

Very few instructions have knowledge of any particular data type:

Boolean operations have knowledge of the `boolean` type. (So that the compiler can apply short-circuit evaluation and branching.)
The `throw` instruction has knowledge of the `Exception` type.
The `switch` instruction has knowledge of the `integer` type.

No unnecessary JVM gimmicks like bytecode verification, stack verification, etc.

Executable code is packaged into modules which correspond to C# assemblies.

So, no myriads of class files floating around.
Each class in a module has a timestamp.
When a module is being made, unchanged classes are copied verbatim from the old module instead of being recompiled, thus retaining their timestamps.

For the benefit of benchmarking, the runtime environment can be programmatically instructed to start JITting each method upon first invocation and never interpret anything.
Async-transparency and fibers.

Looks and feels synchronous, but works asynchronously under the hood.
A function can be declared as `async`; this signifies that the function works asynchronously, but nothing else changes:

When invoking: you call it and obtain its result just as with any other function.,
When implementing: you just return a result, just like any other function.

When an async function is invoked, the compiler does not emit a direct invocation to the function; instead, it invokes a special InvokeAsync function of the runtime, which accepts the function to be invoked as a parameter, and returns the result returned by the function. So, it looks as if the runtime will invoke the target function, block-waiting for it to complete, and return the result. However, the runtime does the following instead:

starts the asynchronous operation,
obtains a promise under the hood,
sets aside the promise and the current stack,
proceeds to do other stuff.

When the promise is satisfied, the runtime:

gets the return value from the promise,
switches back to that stack,
continues execution from there.

Note: something called "_hyperscript" already purports to support async-transparency; I do not know whether they switch stacks or pass promises/futures under the hood all over the place. See https://hyperscript.org/docs/#async
Note: this is related to OpenJDK JEP 425: Virtual Threads. See https://openjdk.org/jeps/425
An abstraction of an `EventDriver` is provided, which encapsulates an event driven system. A `ConcreteEventDriver` is provided, which is a default ("reference") implementation of an event-driven system.
The `EventDriver` does not contain a `post` method; instead, it exposes an `Injector` interface, which does. So, code that only needs to `post` only needs to have access to an `Injector`, not to the whole `EventDriver`.
Threads and thread-pools exist for interfacing with legacy systems; the preferred way of working is with fibers and fiber-pools.

Each fiber-pool has its own event-driver.
TODO: describe exactly what a fiber is.
TODO: describe how a fiber exposes a proxy for invocation from other fibers and how the proxy asserts that everything passed back and forth is either thread-safe or immutable.

Note that in multi-threaded execution models purity is of very limited usefulness because it does not prevent reading mutable state, so it does not avoid race conditions. However, this language makes use of fibers instead of threads, so there can be no race conditions, so purity becomes useful.

Support for functional programming.

For example:

Lambdas.
Tuples.
Everything is read-only by default.

The keyword `mutable` must be used to denote something which may vary. (Scala's `var` and `val` are too cryptic and too similar; mutability must stand out like a sore thumb.)
So, the syntax for declaring a mutable local integer is:

`mutable local x: integer`

It is an error to declare something as `mutable` and forget to ever mutate it.

All interfaces are pure by default.

A special keyword `impure` must be used to denote an interface which is allowed to contain impure methods.
It is an error to declare an interface as `impure` and forget to include any impure methods in it.

Methods can be either pure or impure, and this has severe implications on what they may and may not do.
Most language constructs like `if`, `for`, `while`, `switch` etc. have both functional and imperative forms.

The functional forms must be pure; the imperative forms can be impure.
The functional forms may not use flow-control keywords that would affect enclosing scopes; in other words,

A functional construct may not use the `return` keyword to exit the current function
A functional construct may not use the `break` or `continue` keywords to exit or repeat an enclosing loop.

The functional forms make use of the `yield` keyword to produce values. So, the functional `if` statement is `if( x ) yield 5; else yield 6;`
Functional loops evaluate to `Enumerable` and each execution of `yield` produces a new element.

The standard library offers various monads like `Optional`, `Try`, and other common functional goodies.
The standard collections support fluent constructs.

The functional constructs are like Scala's collections, which means that they are somewhat like C#'s linq and not like Java's collection streams.
There is no support in the standard collections for parallelization.

No such thing as the `ref` or `out` parameters of C#.

However:

No functional Nazism.

No obstacles to having mutable state, other than having to use an extra keyword here and there.
A proper `for` loop.

Even the functional version of the `for` loop is a first-class language construct, not yet another higher order function.
Thus, when single-stepping through code, you do not have to remember to use step-into instead of step-over in order to skip the header of the loop and reach the body of the loop.

Proper `break` and `continue` keywords.
Freedom to re-assign parameters.

Thus making the original value inaccessible.
To allow this, the `mutable` keyword must be added to the parameter.
The `mutable` keyword on a parameter has no meaning for the caller of the method, and therefore does not become part of the method prototype.

Everything that can be accomplished functionally can also be accomplished imperatively.

No functional gimmicks.

The expression evaluated last within a function does not magically become the return value of the function without a `return` statement; `return` statements cannot simply be omitted. Same for `yield` statements.
No copy-on-mutation collections.
No such thing as Scala's `Unit`. Two approaches are possible:

We maintain a clear distinction between functions and procedures, in which case `Unit` is unnecessary just as `void` is unnecessary.
Everything is a function, but instead of `Unit` we stick to good old familiar `void`, which now becomes an actual data type of which there exists only one instance.

Normally, the instance of void should never need to be accessed, (and therefore might not even be accessible,) because it is implied when necessary. For example, the statement `return` is equivalent to `return void.instance`.

The compiler makes a very clear distinction between statements and expressions.

A block scope consists of statements.
Statements and expressions are not interchangeable:

A statement may contain expressions, but an expression may not contain statements.
An expression cannot appear in place of a statement.
A statement cannot appear in place of an expression.

Most languages allow invoking a function and ignoring its return value; we put an end to that abhorrent malpractice.
When a statement is expected, and we use something which yields a value, that value must be dealt with, in order to be left with a statement and not an expression.
The language might provide a mechanism for ignoring a value, (perhaps a cast to void?) but this can also be accomplished by invoking a void-returning method which accepts one parameter and just ignores it.
Assignment is a statement, and it requires the use of the `let` keyword, as in `let a = 5;` unless a field or local is being declared and initialized at once, in which case the `let` keyword is omitted, as in `local a = 5;` This has some drawbacks and some benefits:

Drawback: We cannot initialize multiple variables in one go, as in `let a = b = c = 5;` because everything after the first `=` must be an expression. That's inconsequential, perhaps even arguably a benefit.
Drawback: We cannot assign and compare in one go, as in `if( ( let a = f() ) > 5 )...` because assignment is a statement, so it cannot be used inside an expression. That's inconsequential, perhaps even arguably a benefit.
Benefit: since the compiler can always tell whether it is compiling a statement or an expression, it can treat certain things differently depending on whether they appear in a statement or an expression. Namely, the equals sign can now be used either in a statement, as the assignment operator, or in an expression, as the equality check operator.
Thus, after so many decades, we can finally say good-bye to the inelegant double-equals (`==`) legacy of C, and start using the single equals sign for equality comparison, as it was always meant to be.
The inequality operator can either stay as `!=` or become `<>`.

The prefix and postfix increment operators are problematic because they are expressions with side-effects, (they both mutate an existing value and yield a new value,) so we might disallow them, and require the use of the long form instead: `let x = x + 1;`

If we keep them, then they will certainly only be allowed in expressions.
(You could make it a statement with `(void) x++;` but why would you?)

We keep `static` as in Java and avoid Scala's inelegant companion objects.
There is no support in the standard collections for parallelization.
When declaring a lambda, the keyword `function` must be used.
When declaring a tuple, the keyword `tuple` must be used.

Everything is private by default, unless explicitly given a higher visibility.

Therefore, the language does not have a keyword to indicate that something is private.
Note that this also applies to interface methods: if you want an interface method to be public, you have to declare it as public, otherwise it stays private and may only be invoked from other methods of the same interface.

Everything is non-inheritable by default, unless explicitly declared as inheritable.

(Except for interfaces, which are by definition inheritable.)
Therefore, the language does not have a keyword to indicate that something is non-inheritable (sealed in C#, final in Java.)
Note that this also applies to interface methods: if you want an interface method to be overridable, you have to declare it as overridable.
This makes certain other rules unnecessary, for example we do not have to stipulate that it is an error to explicitly declare a method as non-overridable in a class which has already been declared as non-inheritable.
It is an error to declare something as inheritable and fail to ever inherit from it.

This is enforceable because inheritance is confined within a module, so all members of an inheritance hierarchy are known during the compilation of the module.

Emphasis on purity.

There are two ways we can go about this, and which way we will go is yet to be decided.

Procedures and functions

A method can be either a procedure or a function.
A procedure:

Does not return anything.
Is impure. (Must have at least one side-effect.)
Can indicate failure only by means of throwing an exception.

A function:

Returns something.
Can indicate failure either by throwing an exception or by returning a `Try` monad.
Is pure. (Must have no side-effects.)

Experimental idea: the keyword `method` can be used to denote a higher order method which is either a procedure or a function depending on whether its parameter is a procedure or a function.

It must have a parameter declared as `method` instead of the more specific `procedure` or `function`.
It may have additional parameters that are explicitly `function` or `procedure`.
It must treat its parameter method as a function, meaning that when it invokes that method, it must obtain a return value from it.
It can be coded as a function, meaning that it can return that value.
From the point of view of the caller, it behaves either as a procedure or as a function depending on whether the caller passes a procedure or a function to its method parameter.
The caller may actually pass yet another a method to it, in which case the caller is in turn a method instead of a procedure or function.
Such a construct would eliminate the need to declare both a function and a procedure for each higher order operation, and at the same time avoid the inelegance of `Unit`.

Pure and impure methods

All methods are functions.
Methods that have nothing to return must be declared to return `void`, which is equivalent to Scala's `Unit` in the sense that it is an actual data type of which there exists only one instance.

Thus, void-returning and non-void-returning functions can be treated in exactly the same way in all situations. For example:

From within a `void` function we can use the `return` keyword to return the result of invoking another `void` function.

This in turn means that a single higher order function can operate both on void-returning and non-void-returning functions.

Impure methods must be explicitly marked with the `impure` keyword.
An impure method may return either void or non-void.
A pure method must return non-void. (It would not make sense to return void, because it cannot perform any side-effects, so its sole reason of existence is to return something.)

In all cases:

A pure method / function:

May not assign to any field of `this`.
May not invoke any impure methods / procedures on any of its parameters, including `this`.
May not escape an impure interface of any of its parameters, including `this`.

It is okay to escape pure interfaces, since there will be no side-effects.

May still declare and manipulate mutable locals, including the ability to escape mutable locals or impure interfaces thereof.
It would be nice to be able to say that a pure method / function can never throw an exception; however, we cannot do that, because even a pure method / function can, for example, accidentally divide by zero.

Mechanisms are provided whereby purity checks can be suppressed when necessary, in order to allow for functions which, although formally pure, may under the hood modify caches, update statistics, perform diagnostic I/O, etc.

Emphasis on readability, at the expense of terseness when necessary.

Typing is not one of the major problems faced by our profession; unreadable code is.
The language should be suitable for universities to teach, so unlike Scala, it needs to have a low entry barrier.
All language keywords are fully spelled out and avoid unnecessary technicalities.

No inelegant abbreviations like `fun`, `def`, `mut`, etc;
A function is denoted by `function`. (Duh!)
A field is denoted by `field`. (Duh!)
A mutable field is denoted by `mutable field`. (Duh!)
A local is denoted by `local`. (Duh!)
A mutable local is denoted by `mutable local`. (Duh!)
The Boolean type is `boolean`, not `bool`.
The Integer type is `integer`, not `int`.

Nobody will ever have to type `i`, `n`, `t`, `e`, `g`, `e`, `r`, because any halfway decent code editor will give you `integer` if you just type `i`, hit `Ctrl+Space` to open up auto-completion, and then `Enter` to pick the first suggestion.

The Long Integer type is `long integer`, not `long`.
The 64-bit IEEE floating point type is `real`, not `double`.
The 32-bit IEEE floating point type is `short real`, not `float`.

In general, the language aims to reduce the amount of parentheses.

Expressions may not be parenthesized, only sub-expressions may.

So, the popular construct `return (result)` is not just redundant; it is actually a compiler error.

In general, the language favors words over punctuation, so:

Inheritance by means of `extends` and `implements` keywords as in Java instead of the `:` character of C#.
Fully spelled out `for each a in b do` like C# instead of the `for( a : b )` of Java.
Boolean operators are words, like Pascal and Python and unlike the C family.

i.e. the operators are `and`, `or`, and `not` instead of `&&`, `||`, and `!`.

The compiler handles boolean operators, applying operator precedence and short-circuit evaluation.
The compiler maps all other operators to method calls, (observing operator precedence rules,) as follows:

`a + b` maps to `a.Plus( b )`.
`a - b` maps to `a.Minus( b )`.
`a * b` maps to `a.Times( b )`.
`a / b` maps to `a.Per( b )`.
`a % b` maps to `a.Modulo( b )`.
`a ^ b` maps to `a.Power( b )`.
`a = b` maps to `a.Equals( b )`.
`a < b` maps to `a.Below( b )`.
`a > b` maps to `a.Above( b )`.
`a != b` maps to `not a.Equals( b )`. (*)
`a <= b` maps to `not a.Above( b )`. (*)
`a => b` maps to `not a.Below( b )`. (*)
`-a` maps to `a.Negative`.
`~a` maps to `a.TwosComplement`.
`++a` maps to `a.PreIncrement()`.
`a++` maps to `a.PostIncrement()`.

So, when we code `a + b`, this will only compile if the type of `a` has a function called `Plus` with a parameter of the type of `b`.
(*) These negations are meant to save us from having to have negative forms of the functions; I think they are okay; it remains to be seen if there are situations where this will not work. NaN comes to mind as a possible pitfall, but then again a comparison against NaN should perhaps throw an exception.

Preference towards having only one way for any given thing.

When multiple ways of accomplishing the same thing are conceivable, the language design tries, when possible, and when it makes sense, to make a specific choice and prohibit all other ways. For example:

When it is unnecessary to qualify an instance member with `this`, it is an error to qualify it.
When it is unnecessary to qualify a static member with the class name, it is an error to qualify it.
When the body of the "then" part of an `if` statement never falls through (because it ends with either a `return` or a `throw` statement) it is an error to use the `else` keyword.

Encapsulation:

A nested scope has access to private members of the enclosing scope.
The enclosing scope never has access to private members of nested scopes.

Note that this corrects the insanity of Java which allows an enclosing class to have access to private members of nested classes. (Duh!?)

When a source file declares a namespace as public, only the classes in that source file are exported.

This stipulation is necessary since multiple source files may declare a namespace, but only some of those source files might declare the namespace as public.

A module may expose interfaces, enums, records (value types), and classes. However, when a module exposes a class, what actually gets exposed is only the interface of that class, not a class itself. In other words, the language will never expose across modules the constructor of a class, nor its protected methods. This has some very interesting implications:

All classes participating in an inheritance hierarchy must be defined within a single module: One cannot extend a class defined in another module.
All classes participating in an inheritance hierarchy are known during the compilation of the module that contains the hierarchy.

This allows for certain useful optimizations.

The creation of a new instance of a class defined in another module cannot be accomplished by invoking a constructor; it can only be accomplished via a factory method.

Memory management: Reference counting instead of garbage-collection.

The memory model looks a lot like the memory model of Java and C#:

The heap consists of big chunks of memory that are allocated from the operating system at once. The runtime does its own memory management within these chunks, for efficiency.
Objects are actually pointers to objects that live on the heap.
Pointers cannot be manipulated as they can in C++.
Value types live either in local storage or as members of other types.
When necessary, value types can be treated as reference types by means of boxing.

Pointers are implemented as smart (shared) pointers, so that:

There is no need for garbage collection.
There is no need for each object to have its own lock.
There is no need for finalization.
There are no preposterous situations like object resurrection.
There are fewer sources of randomness and non-determinism in the memory layout and in the responsiveness of the code.
Destruction is assured and immediate the moment an object ceases to be referenced.
Destruction involves real destructors as in C++.
While a destructor executes, all objects referenced by the object being destructed are guaranteed to still be present and alive. (Unlike garbage-collected languages, where finalizers have to cope with the fact that some of the referenced objects may have already been collected.)

The reference count is accommodated in the object itself, so smart pointers can be appreciably more lightweight than in C++.
The runtime may choose to implement smart pointers using double indirection, so as to be able to perform memory defragmentation.
Addressing the pitfalls of reference counting:

Reference counting suffers from two pitfalls:

Long reference chains:

May result in stack overflow when disposed.

Circular references:

Result in memory leaks.

We address these pitfalls as follows:

Long reference chains:

We solve this by making destructors deliberately fail if they are ever re-entered, so that we can detect the deallocation of even the smallest chain that consists of only two nodes. The programmer can then modify their code to do one of the following:

Manually perform the destruction of the chain in a way that avoids recursion.
Refactor things so that the objects are kept in a collection instead of forming an ad-hoc chain.
Explicitly unlink and destroy the chain using the `delete chain` keyword, which works in a non-recursive way.

Circular references:

A debug-time-only mark-sweep checker that runs on its own thread detects leaked cyclic object graphs and warns the programmer about them. (It does not attempt to fix anything.) The programmer can then modify their code to do one of the following:

Break any cycles in the graph before unlinking it.
Explicitly unlink and destroy the cyclic graph using the `delete cyclic` keyword, which gracefully handles cyclic object graphs.

These means of addressing the pitfalls of reference counting are not perfect, so some extra maintenance will sometimes be required. For example, we might think that we are properly handling all cyclic object graphs, but as a result of a change somewhere, we may now discover that we have a new cyclic object graph, which we must deal with; Still, the extra trouble is expected to be rare, and it is expected to be very well worth all the trouble we save by not having to have a garbage collector.

Of interest: https://verdagon.dev/blog/hybrid-generational-memory

Constructor syntax like Scala.

Constructor parameters in the class header.
Constructor code in the class body. (With the additional restriction that it must all appear up-front.)
Additional constructors by means of static factory methods.
Any constructor parameters that are referenced by methods automatically become fields so that we do not have to declare extra fields and initialize them from the parameters.

Strong distinction between release runs and debug runs.

(But not necessarily different builds; Optimization is a JIT concern.)

Externally supplied constant values.

A special type of constant can be defined, whose value is not specified in the source code, and must instead be supplied later:

During compilation, by means of a special parameter to the compiler, or
At runtime, by means of a special parameter to the launcher.

These constants are better than C-style "manifest constants" and C#-style "defined symbols" because they are well defined, strongly typed, mandatory, and obey normal static immutable field rules. This means that:

It is possible to know the set of all external constants that must be defined in order to compile and run something.
An attempt to compile or run something without supplying all external constant values will always result in an error.
An attempt to supply an external constant value for a non-existent external constant will always result in an error.
Each externally supplied constant value must be of the correct type expected by the constant declared in the code.
When using external constants for conditional compilation, the code paths that are not selected will result in no code being generated, but must still pass compilation, so there is no danger of code rot.
With some help from the loader we can write tests that exercise code under different values for runtime-supplied external constants.

Integer types:

Fixed Integer types with explicitly defined sizes, as per C#.
Flex integer types whose size is determined by the runtime according to what is most efficient for the underlying hardware architecture.

Each flex integer has a "Guaranteed Width", which is the minimum width that this integer is guaranteed to have on any hardware architecture. These widths are:

8 bits for `tiny integer`
16 bits for `short integer`
32 bits for `integer`
64 bits for `long integer`

On debug runs, the runtime checks all operations on flex integers, and if there is an overflow past the guaranteed width, a runtime exception is thrown. Thus, we ensure consistent flex integer behavior on any architecture.
This corrects the narrow-mindedness of C# where `int` has been defined to be exactly 32 bits long, even on architectures with a larger machine word size. (Which is pretty much all major architectures today that 64-bit is the norm.)

Full set of signed and unsigned integers as per C#, both for the fixed and flex flavors.
Exceptions

Lightweight exceptions that are inexpensive to throw and to catch.
No such thing as the "checked" exceptions of Java.
No extra baggage:

The base `Exception` class does not even have a "message", let alone a "localized message".

The `ToString()` method of the base `Exception` class:

Is not overridable.
Yields a string consisting of the class name of the exception followed by the name and the string representation of the value of each one of its fields, obtained using reflection.
If you want an exception to result in a human-readable error message that you can actually show to an end user, you have to accomplish this entirely by yourself. (Please make sure to do this in the end-user's native language, which, statistically speaking, is unlikely to be English.)

Standard Collections Model

The standard language runtime provides the following:

An assortment of unmodifiable collection interfaces: `Enumerable`, `Collection`, `List`, `Map`, etc.
`Enumerable` exposes a property for accessing the current element, and separate methods for checking whether there exist more elements and for advancing to the next element, as in C#.
A `Collection` is an `Enumerable` with a length and the ability to check whether it contains a certain element, as in Java.
`Map` is also a collection of `Map.Entry`.

This is as in C#, where a `Dictionary` is a collection of `KeyValuePair`.
This is unlike Java, where `Map` is not a collection, and in order to obtain the collection of entries you must invoke `Map.entrySet()`.

Factory methods create immutable collection classes implementing the unmodifiable collection interfaces.
An assortment of "rigid" (i.e. mutable, but structurally immutable) interfaces which extend the unmodifiable interfaces adding methods to replace existing items but no methods to add or remove items: `RigidEnumerable`, `RigidCollection`, `RigidList`, `RigidMap`, etc.
An assortment of mutable collection interfaces which extend the rigid interfaces adding add/remove/clear methods: `MutableEnumerable`, `MutableCollection`, `MutableList`, `MutableMap`, `Queue`, `Stack`, etc.
A `MutableCollections` factory exposing methods that create mutable collection classes implementing the mutable collection interfaces.
The `Values` collection of a mutable map returns a `RigidCollection` of map values, so that:

You can replace an element in this collection, which will have the side-effect of associating an existing key with a new value.
You cannot add an element to this collection, which makes sense because you have no means of specifying the key that should map to the newly inserted value.

The method for adding an item to a collection is called 'Add', not 'Push'.

For consistency, even the `Stack` collection exposes an `Add` method, not a `Push` method.

Collaboration between the language runtime and collections:

The for-each loop operates on `Enumerable`.

The loop variable can be reassigned, causing the current element of the Enumerable to be replaced with a new value. In this case, the for-each loop requires a `RigidEnumerable`.
A special keyword allows removing the current item, in which case the for-each loop requires a `MutableEnumerable`.
Since we have proper destructors, there is no need for special handling of disposable enumerators. (Something which C# provides, but Java lacks.)

An array literal evaluates to an instance of `RigidList`, so the language is free from arrays, like Scala.

Heavy promotion of assertions and plenty of built-in extra error-checking on debug runs, such as:

Arithmetic checking

An exception is thrown when any of the following occurs:

Division by zero.
Fixed integral type overflow. (This can be selectively suppressed on an individual expression basis as with the "unchecked" keyword of C#.)
Flex integer guaranteed width overflow.
(Possibly) Operations on NaNs.

Throwing Switches

If the switch data type is exhaustively switchable (e.g. boolean):

It is an error if not all cases are covered and no default case is provided.
It is an error if all cases are covered and a default case is provided.

If the switch data type is not exhaustively switchable (e.g. integer):

If no default case is provided, an implicit default case is supplied by the compiler which throws an exception.
This plays nicely with code coverage: no more uncoverable assertions in unreachable default clauses.
If you want a switch statement with default case fall-through on a non-exhaustively switchable type, add an empty `default` case. (Duh!)

Big on warnings and errors.

Most things traditionally thought of as warnings are errors.
Most checks of the kind that IntelliJ IDEA calls "inspections" are built-into the language as warnings, many of them even as errors.
Selective warning suppression only; no bulk suppression.

Warning suppression is possible only on the individual statement where the problem occurs, never on a larger scope.

Warnings always cause compilation to fail.

It is as if a "treat warnings as errors" option is always on and cannot be turned off.
The difference between warnings and errors is not that you can ignore warnings and proceed to run; the difference is that a warning can be suppressed, whereas an error cannot.
Furthermore, the language designates a message as a warning or an error based not on its severity, but instead on whether the programmer can reasonably be required to fix it or not.

If it is reasonable to require the programmer to fix it, then the programmer better fix it, so there is no need to be able to suppress it, so it is an error.
If it is unreasonable to require the programmer to fix it, then the programmer should be able to suppress it, so it is a warning.

For example:

If you have an unused import statement, you can very easily remove that import statement, so it is reasonable to require you to fix it. Therefore, the "unused import" message is an error.
If you have marked something as deprecated, and yet you must still make use of it in a couple of places until the day that it gets completely removed, then you have no way of fixing this problem, therefore you must be allowed to suppress it, therefore the "use of deprecated symbol" message is a warning. You will, however, have to explicitly suppress that warning on each and every usage of that symbol.

A warning suppression on a statement that does not actually produce a warning is an error.

Syntax:

Line-oriented, with scoping dictated by indentation (roughly as in Python) instead of curly braces.

Since it is very difficult (if not impossible) to express indentation rules in a formal grammar, this is handled by the tokenizer:

When the indentation increases, the tokenizer emits a hidden scope-start token.
When the indentation decreases, the tokenizer emits a hidden scope-end token.
The tokenizer also handles line breaking and line joining, so that the parser ends up parsing a C-style language.

There are two types of statements: simple and compound.

A simple statement occupies a single line; it may contain expressions, but it may not contain any nested scopes.
A compound statement begins with a simple statement as a header, and is followed by a dependent scope.
A scope contains statements, which may in turn be either simple or compound.
Some constructs that normally correspond to compound statements (e.g. the `if` statement) also come in "expression form".

The `for` loop does not have an expression form, due to the extra complexity of the multiple statements that it contains; however, the `for-each` loop does come in expression form.

Normally, each simple statement must be on a separate line.
To allow joining multiple simple statements in one line, a special line-joining punctuation is used, which is the semicolon.
Therefore, the semicolon is illegal at the end of a line.
Normally, an entire simple statement must be contained within a single line; in other words, a simple statement may not span multiple lines.
To allow splitting a simple statement into multiple lines, a special line-splitting construct is used. This construct is to be determined:

It may be a backslash at the end of the line that is being split into the next
It may be double the amount of indentation on the next line, signifying that it belongs to the previous one.
It may be both of the above.

Formatting:

The code formatting style of the language is thoroughly and unambiguously defined by an extensive set of code formatting rules.
Some degree of freedom is allowed, but even that is unambiguously controlled by special punctuation that exists specifically for that purpose.
This means that the formatting of a source file is thoroughly, accurately, and deterministically predictable from the language formatting rules and the punctuation present within the file.
This in turn allows code editors that can:

at any moment reflow an entire source file to its proper format, or even:
continuously reflow code, as it is being typed, to its proper format.

This in turn allows a compiler which imposes strict enforcement of the formatting rules, so that the slightest deviation, even by a single space, is a compiler error.
This brings us to the following paradox:

Even though the formatting rules are extremely detailed,
And even though the enforcement of the formatting rules is draconian,
The programmer never has to worry about code formatting, because it is being taken care of automatically.

The benefit of all this is that all code by all programmers will always have the exact same formatting, and yet no programmer will ever have to be bothered with having to type code in a specific way.
(It will also make the language parser slightly faster.)
Some indicative highlights of the formatting rules:

Tabs for indentation

The `tab` character denotes indentation, and may only appear at the beginning of a line; it is prohibited anywhere else.
Only the `tab` character may be used to denote indentation; the use of anything else, including the space character, is an error.
It is an error to have indentation in a line which is otherwise blank.

The language defines where a space may and may not appear.

When a space is expected, exactly one space must be given. (For example, right after a comma.)
When zero spaces are expected, exactly zero spaces must be given. (For example, right before a comma.)
Note that this prevents tabular code formatting, which is the practice of inserting spaces to column-align similar parts of consecutive statements.

That is okay, because tabular code formatting is a bad idea anyway, since it is a source of needless git merge conflicts.
In any case, if some folks really need tabular code formatting, they can achieve it via spacing comments ( /* */ ).

The language strictly defines when and how blank lines may be used. For example:

There must never be two consecutive blank lines anywhere, at all, under any circumstances, for any reason, ever.
There must always be exactly one blank line before a block comment. (Even a single-line block comment.)

If you want a comment without a blank line, then use a line comment instead of a block comment.

There must never be a blank line anywhere else, including:

Between method definitions.

This allows us to define whole groups of single-line methods without wasting a lot of screen real estate.
If you want blank lines between method definitions, add a block comment before each method definition; thus, a blank line will be mandatory before the block comment.

Between lines of code.

Most programmers have the habit of using blank lines within method bodies, to separate logical groups of lines of code. This is bad practice, because only the programmer who wrote the code knows why those lines form a separate group and why that group should stand out from the rest.
If you have multiple conceptually distinct groups of lines of code within a single method, then either:

Add block comments explaining what each group does, (in which case a blank line before the block comment is mandatory,) or
Move each group into a separate function, and give the function a descriptive name.

The language supports functions nested within functions, so you can do this without polluting the namespace of the class.
The language uses no curly braces, so you will not be wasting a lot of screen real estate in doing so.

Between class definitions.

This allows us to define whole groups of single-line classes without wasting a lot of screen real estate. Admittedly, single-line classes are rare, so let's just say that this rule exists just for consistency.

Special formatting punctuation allows overriding language default formatting rules on a case per case basis. For example:

A "line splitter" is a special punctuation character which allows splitting a construct into multiple lines when the language formatting rules would have normally required that construct to be all in one line.

For example: the language formatting rule for expressions is that an expression must fit in one line; so, if an expression needs to span multiple lines, a line splitter must be used to indicate precisely at which point the expression is to break into the next line.
The use of a line splitter in a place where it is not required is an error.

The "line joiner" is a special punctuation character which allows a construct to appear all in one line when the language formatting rules would have normally required that construct to be split into multiple lines.

For example: the language formatting rule for methods is that the body of the method must be on a separate line from the prototype. So, if a very short method needs to fit entirely in one line, a line joiner can be used to allow this.
The use of a line joiner in a place where it is not required is an error.

Capitalization

The language is case sensitive, and capitalization matters a lot more than in other languages.
Identifier casing must be one of the following:

lowercase
SentenceCase
kebab-case
SentenceKebab-case

Note that kebab-case is possible because the language mandates spacing around operators, so there is no possibility to confuse an identifier containing a dash with the dash operator between two identifiers.
The following are expressly disallowed:

The dash as first or last character of an identifier.
camelCase.
UPPERCASE and SCREAMING-KEBAB-CASE.
Two or more consecutive capital letters.

For an explanation why, see the following section about spell-checking.
Separate capital letters with dashes; for example, `XSpacing` is not allowed, but `X-Spacing` is fine.
Do not use acronyms; use either:

fully spelled out words, i.e.. "GraphicalUserInterfaceStyle", or
words that replace acronyms, i.e. "GuiStyle".

Underscores and all forms of snake_case. (Though an underscore alone might act as a special identifier, or special punctuation, to be determined.)

This is because we support kebab-case, and snake_case does not look sufficiently different from kebab-case.
Kebab-case is preferable to snake_case because on most keyboards the dash is slightly easier to produce than the underscore, because it does not require Shift.

Some capitalization rules apply to language constructs and are enforced by the compiler:

All names of types and namespaces must start with an uppercase letter.
All public and protected member names must start with an uppercase letter.
All private members must start with a lowercase letter.
All local and parameter names must start with a lowercase letter.

Spell Checking

The language comes together with a spell-checking dictionary, the contents of which are part of the language specification.
A module can have a supplemental user-defined spell-checking dictionary file which:

Is meant to be committed to source control
Is meant to undergo code review just as any other source file.

The compiler spell-checks source code and issues a warning if it encounters any unrecognized words.

Specifically, the compiler will issue a warning when any of the following fails to pass spell-check:

Any part of an identifier.
A word inside a string literal.
A word in a comment, unless it is markup referring to an identifier.

For the purpose of spell-checking, identifiers are broken into parts based on SentenceCase and kebab-case boundaries, as well as boundaries between letters and digits. This means that:

"CryptoGraphy" will not pass spell-check unless "graphy" has been added to the spell-checker. (It shouldn't; it is not an English word; use "Cryptography" instead.)
"Mousepointer" will not pass spell-check unless `mousepointer` has been added to the spell-checker. (It shouldn't; it is not an English word; use "MousePointer" instead.)

A warning for a misspelled identifier is issued only at the point of definition and not on each occurrence of the identifier, so that:

You only see the warning once, not five hundred times.
There is no warning at all for identifiers that you have no control over, due to them being defined in external modules. In other words, a module does not have to duplicate the spelling dictionaries of external modules, nor does a module have to ship with its spelling dictionary.

Two or more consecutive capital letters are disallowed because:

Each individual capital letter acts as a word delimiter, so it constitutes a word by itself.
To allow for single-letter variables, every individual letter passes spell-check.
So, a word made of capital letters circumvents the spell-checker.
The language does not allow circumventing the spell-checker.

(One day someone will inevitably submit a feature request for some means of disabling the spell checker; the answer they will receive is that if they do not have to use this language; there are so many other languages to choose from.)

Comments

Special formatting within comments is achievable with the use of `Markdown` as opposed to HTML or any ad-hoc syntax.
This special formatting is available in all comments, not just doc-comments.
Some extensions to markdown are necessary in order to specify relationships between code.

For example, when defining a link, one can omit the part within the parentheses, in which case the part within the square brackets is expected to be a resolvable symbol, and the resulting link points to that symbol.
The syntax for specifying the symbol requires no gimmicks like the hash-sign which is needed in Java's doc-comments to separate the type name from the member name.
If the symbol is not fully qualified then there must be an import statement for that symbol somewhere within the source file.
The use of a symbol in a comment is enough to prevent the corresponding import statement from being flagged by the compiler as unused.

Possibly: allow the comment that describes a parameter to be placed with the parameter itself.

Inheritance

A class may extend only one other class but implement any number of interfaces.
The only difference between a class and an interface is that an interface cannot have fields, a constructor, or a destructor; in all other respects, classes and interfaces are equivalent, meaning that an interface can have static, public, protected, and private methods.
By default, a class cannot be extended unless it is marked as `extensible`.
By default, a method of a class cannot be overridden/extended unless it is marked as follows:

If it is abstract, it must be marked as `abstract`.
If it is overridable, must be marked as `overridable`. (Duh!)

This corrects Java's exuberance of allowing any method to be overridden unless declared "final", and C#'s unwarranted technicalism of calling such methods "virtual".

If it is overridable with the provision that overriding methods must invoke the base method, it must be marked as `extensible`.

Methods that override other methods must be marked as follows:

A class method which implements an abstract method must be marked with `implements base`.
A class method which overrides an overridable method must be marked with `overrides base`.
A class method which extends an extensible method must be marked with `extends base`. Within the extending method:

The base method must be invoked exactly once.

(This can become a bit complicated with alternative execution paths, so we might want to mandate that there must be only one possible execution path at the point where the base method is invoked.)

The invocation of the base method can be simplified:

The name of the method can be replaced with `base`.
The parameters can be omitted.

In this case the base method is invoked with the values that the parameters have at the moment of the invocation, allowing the extending method to alter the values of the parameters before invoking base.

A class method which implements a method of an interface must be marked with `implements X`.

X is the interface-qualified-method-name of the method being implemented. The implementing method name may differ from the implemented method name (as long as the parameter list matches) and it will be accessible via both names.
X can also be a comma-separated list of interface-qualified-function-names, if the method implements multiple interface methods of different interfaces. In this case, the method will be accessible via any of the names.
Note that this corrects the stupidity of C# where no special marking is necessary for a class method that implements an interface method.
Note that C# provides a syntax for optionally specifying that a class method implements a method of a particular interface, but makes the implementing methods inaccessible, which renders the feature unusable.

Note that a class method may both override a superclass method and implement interface methods by adding both `overrides` and `implements`.
A class method does not automatically become overridable or extensible by virtue of implementing, overriding, or extending another method; it must in turn be marked as overridable or extensible if that is the intention.

Built-in Intertwine.
Built-in Domain-Oriented Programming features.

Alternatively, look into Scala's implicit parameter lists.

Built-in support for testing.

Bundled Testana.
Somewhat different testing semantics than JUnit:

The test class does not get re-instantiated prior to invoking each test method.
No 'before' method: use the test class constructor for this.
No 'after' method: use the test class destructor for this.
Use of the exact same assertion facility for test code as for production code.
No other test facility gimmicks like "expect", "assume", etc.: write the darn thing in code.
Test methods are always executed in the order in which they appear in the source file.
When a test class is derived from another test class, the test methods of the base class are always executed before the test methods of the derived class.

To enable separate testing of debug runs and release runs, assertions are always enabled for the testing code, but for the code-under-test they can be either enabled or disabled.
Even though all source files that constitute a module are compiled into a single binary file, (as per C# assemblies,) each class within that binary file comes with its own timestamp, to accommodate tools like Testana.

(Possibly) Explicit distinction between logic classes and data classes.

(Possibly) Built-in versioned externalization of data classes.
(Possibly) Built-in data-modelling framework for the data classes.

Built-in internationalization features (i.e. Unicode strings and culture-aware operations) but also full support for ANSI strings and culture-neutral operations.
Lightweight properties, exactly like in C#, with additional compiler support for obtaining a property as a separate entity and manipulating it independently of the object that it belongs to. (Probably a value type containing a reference to the object that owns the property and a reference to the reflection object that represents the property.)

NO compiler support for events.
Time Coordinate Data Type

Internally represented as a 64-bit IEEE floating point number of days since some epoch, allowing for:

low-precision coordinates billions of years away from the epoch
femtosecond precision coordinates near the epoch.

A special static "this" keyword (`this class`?) that you can use to refer to the current type in a static context without having to code the name of the type as you have to do in Java and in C#.
Proper method literals and field literals

No compromise like the `nameof()` of C#.
For example:

`Method m = method someMethod;` assigns to `m` the reflection method object of someMethod but causes a compiler error if someMethod has overloads.
`Method m = method someMethod(int);` assigns to `m` the reflection method object of a specific overload of someMethod.
`Method m = this method;` assigns to `m` the reflection method object of the method that is currently being compiled.
`Field f = field someField;` assigns to `f` the reflection field object of someField.

Source intrinsics

A source-line intrinsic similar to the __LINE__` macro of C and C++ or the [CallerLineNumber] attribute of C#.
A source-file intrinsic similar to the `__FILE__` macro of C and C++ or the [CallerFileName] attribute of C#. Note that the source filename yielded by this intrinsic is relative to the root of the source tree, not absolute.
A source-root intrinsic which yields the absolute path to the root of the source tree.

Namespaces, mostly as seen in C#, but with some differences.

You cannot just import a namespace and make everything in it accessible; instead, you have to do one of the following:

Import a specific name from a namespace; (like a java import statement without a wildcard;) then, you can use that one name without qualification.
Import an entire namespace, but with a namespace alias, like in XML; then, you can use any name from that namespace, but each name will have to be qualified with the chosen alias.

Compiler-enforced conformity between directory names and namespace names, as in java, and unlike C#. (Or, as in C# with ReSharper.)
Compiler-enforced conformity between source file names and class names, as in Java and unlike C#, with a couple of differences:

The file name of a source file may match a namespace defined in that file.
The file name of a source file may match the base-most class defined in the file, but the file may also contain additional classes derived from it.

System functionality injection

System functionality is available strictly via interfaces.
The main entry point function of a program declares in its argument list each system interface that it intends to use. Yes, this can become unwieldly; and yet that's how it is going to be.
When about to run an application, the runtime uses reflection to discover which interfaces are needed by the entry point, and passes each one of them to it.
From that moment on, user code makes sure to propagate system interfaces to all application code that needs them.
This means that no system functionality is provided statically. For example:

The data type for expressing time coordinates does not include a static method for obtaining the current time, as in most other languages. Instead, there is a `SystemClock` interface which provides this functionality, and this interface must be obtained via `main()` and propagated to all places that need to use it.
Similarly, if you want to open a file, you cannot just instantiate a file class; you have to obtain the `FileSystem` interface, and ask it to open a `File` for you.

Runtime environment:

`DirectoryPath` and `FilePath` interfaces that encapsulate file-system pathnames and filenames, so that one rarely needs to engage in string manipulation with paths.
No such thing as a "current directory"; All paths are absolute. When a path is constructed from a string, the absolute path is immediately computed, and the computation may take into account whatever the host system considers to be the "current directory" of the process. (So, by obtaining the `DirectoryPath` of "." one can discover the current directory, but one has no way of changing it, and the runtime environment will never change it.)

2022-07-16

A Programming Language

Abstract

Summary of language characteristics

Language characteristics in detail

No comments:

Post a Comment