The Terse-yet-verbose Untyped Frugal Programming Language (TUF-PL)

Rainer Glaschick, Paderborn
2021-09-16

This is version 0.8.20 of TUF-PL and its environment.

Contents

1. Introduction
2. Language Elements
2.1. Lines and Statements
Lines
Statements
2.2. Comments and Pragmas
Comments
Pragmas
Inline C statements
Debugging
Deprecated Functions
2.3. Literals (Constants)
Void literal
Number literals
String Literals
Named Literals
2.4. Variables
Global variables
Local static variables
2.5. Assignments
2.6. Statement Blocks
2.7. Guards
Basic guards
Repeat loops
Block repeat and terminate
Scan guards and scan functions
Row, Map and Tuple scans
Returning bunches in scans
2.8. Operators
Arithmetic expressions
Boolean expressions
Void references and fault items in expressions
Conditional and logical expressions
String expressions
String comparison
Bit operations
Operator precedence
Bunch comparisons
2.9. Tuples, lists and tuple assignments
Dynamic tuple creation
Tuple equality
Tuple assignments
3. Items
3.1. Fields
3.2. Exact (integer and rational) Numbers
Integral numbers
Rational numbers
3.3. Approximate (Floating Point) Numbers
3.4. Complex numbers
3.5. Strings
ASCII strings
Raw strings
Percent-encoded UTF strings
String input and output
3.6. Bunches (aggregates)
Rows
Maps
Scans for rows, tuples and maps
Cyclic references
Tagged bunches
Sorting
3.7. Chunks of bytes
Byte order
Block IO
3.8. Fault items
Fault item special cases
Signals
Fault items: conclusions
4. Assertions and Constraints
Assertions
Constraints
Assertions for Bunches (rows and maps)
Weak and strong assertions
5. Functions
5.1. Function definition templates
Naming conventions for function templates
Export and import of functions
Function aliases
5.2. Function call
5.3. Function returns
5.4. Function references
5.5. Bunch Functions
Constructors and destructors
Factory functions
6. Standard library
6.1. File I/O
6.2. Network I/O
Basic I/O
Forking subprocesses
Standard library function examples
6.3. Arithmetic functions for floating point numbers
6.4. Arithmetic functions for integers (and exact numbers)
6.5. String functions
6.6. Conversion to and from numbers
6.7. Formatting
6.8. File functions
6.9. File execution
7. Various
System variables and parameters
Program and shell invocation
Printing
Tail recursion
8. Libraries and Modules
Libraries
Modules
External C functions
Modules
Pure binary modules
Message translation
9. Object oriented programming
Introduction
How to do object oriented programming
OO outdated section
Example
Questions and Answers
10. Examples
Hello world
Table of squares
Approximate square root
Greatest common divisor
Linked List
11. Features not (yet) approved
Memory shortage interception
Anonymous functions
Function parameter lists
Row optimisations
Sequences
Slices
Row and map literals
Field change callbacks
Overriding bunch element access
Overriding of operators
Overriding of string conversion
Regular expressions
Enumerated integers
Combining (join) bunch elements to strings
Output to standard output (print)
Multi-dimensional fields
Template permutations
Threads
Alternate scan functions
Switches
Macroprocessor
Coroutines
Unified Rows and maps
Shell usage
12. Implementation Remarks
Fields and Attributes
Storage Management
Rows and maps
Tagged bunches
Level 0 TUF
Strings
Named Constants
Library bindings
Multiarch 32bit and 64bit

1. Introduction

The programming language described in this document was created as a personal exercise to describe and implement a simple, but versatile programming language, that is easy to write, use and understand, yet more efficient than most interpreted languages. I started with this project in about 1990 (after already 20 years in computer programming) when microprocessors and personal computers came up, in order to have a simple programming language. While Python is close to my programming taste — it avoids some legacy of parenthesis from the paper tape area by using indentation, and dared to provide arbitrary precision integers as default — it still uses the notation of a function with a parameter list, requiring a pair of parenthesis for each call. But it is possible to intermix parameters and words in calling a function1.

The most prominent features are:

So the basic syntax is terse, using only symbols, but the functions are strings of words, and rather verbose, with intermixed parameters. As the words and phrases of function templates may undergo localisation, there is no need to learn English terminology; a program can be expressed in any language, provided that attributes are also localized.

The language is not primarily object oriented, but object oriented programming is partially supporetd. Note that object oriented is a design method, that can be used in any programming language, admittedly more or less easily.

There is no compile-time enforced strict type system; however, the constraint mechanism — if supported by a future compiler — is deemed to be powerful and probably more flexible.

Thus, it is a language that can be easily used by beginners, but has all necessary features for professional programming; furthermore, it can be fully localised, as there are no reserved words.

It is terse, because no keywords are used in the core language; all syntax is expressed using special characters and digraphs. although the function strings are long and verbose.

It is untyped, in so far as a variable is not bound to certain type. However, each individual item refers to by a variable is of a certain kind, i.e. number, string, bunch etc. A discretionary type system is designed and hopefully integrates well.

Function templates play a central role, and all parameters are passed uniformly by reference. The term functional was mistakenly selected to emphasize the central role of functions; but there are still operators, assignments etc, thus it is not what is claimed a functional system. Also, parameters that are rows and maps with fields can be modified affecting the caller's view, contradicting the functional programming view.

I had observed that nowadays programming depends on libraries, and even fairly basic operations are done by library calls. Thus, it seemed logical to translate all statements to subroutine calls in C and avoid the overhead of virtual machine code interpretation. Even with modern JIT compilers, some tests with Python and AWK showed that equivalent TUF-PL programs are often quicker.

Graphical programming libraries have been applied, but only of a very rudimentary level of proof of concept. This is defintely a huge task for the furture.

The acronym is still TUF, while TYV for Terse Yet Verbose would have been more appropriate and more distinctive, but what is a name?

2. Language Elements

2.1. Lines and Statements

The program consists of a sequence of lines that contain statements. Normally, a statement is just a line.

Statements are for example:

function calls:
print "Hello world"
assignments:
x =: 1
function returns:
:> "Error"
guards (if)
? i > 0
loops (while)
?* i < 15 .scans (for) ?# i =: from 1 upto 15
assertions:
! i >= 0
loop break or repeat:
?> ?^

Statements — as opposed to the programming language C — cannot be used where an expression is expected, because they don't have a value. This excludes many compact notations, that tend to be complex, hard to understand and hard to verify.

Lines

Normally, a statement is contained in one line.

Newer compiler versions may automatically detect line continuations, if a line ends in an operator symbol, or parenthesis are still unbalanced.

Instead of long lines, it is recommended to break down lines using extra variables; this will increase never execution time using modern optimizing compilers, and be negligible in other cases, but increases readability and thus reliability. Note that breaking down a complex expression allows to comment the parts individually.

If it cannot be avoided, a join of two lines can be forced by placing the double dot digraph (..) as the last symbol of a line (white space including comments may follow and is ignored). The compiler may convert this to a long line (removing the double dot) and the next line a totally blank line.

Statements

The program is composed of statements, normally one on a line.

Statements can be:

As empty lines are ignored and a comment is treated as white space, the empty (do nothing) statement requires a notation to define indentation; a line with just the void notation () may be used (or a single semicolon, if condensed notation is available). For an empty loop body, ?^ may be used instead.

First, the line is converted to a row of terms separated by white space, where a term is:

White space must be used if necessary to separate terms, but is optionally otherwise. A symbol may be composed of one or two characters not separated by white space; in the latter case, it is a digraph. (There are no trigraphs etc.) Underscores may be used inside a word. Note that the word must be followed by blank space as usual, if a number literal follows the word.

Then, the line is parsed from left to right and sublines generated as follows:

Thus, the line is a list of identifiers, constants, symbols or expressions. Then, symbols are processed in priority order; for unary operators, a pair, and for binary operators, a triple with expressions for left hand side and right hand side is set.

Note that the assignment is not an operator; when found as symbol, it tells the line is an assignment. Thus it does not yield a value.

Parsing of the line proceeds as follows:

  1. If the last symbol is a colon, it is a function definition template
  2. If it starts with an exclamation mark, it is an assertion or starts a constraint
  3. If it starts with a question mark or a digraph that starts with a question mark, it is a guard or loop header.

If none of the above is the case, the line is parsed into a tree, i.e. a list of lists. A list is composed of:

Opening parenthesis and single backticks start a subexpression that is ended by a (matched) closing parenthesis or a backtick.

An binary operator symbol reorders the list in such a way that it has the operator first, then the left hand sublist, then the right hand sublist.

A unary operator symbol splits the list into the operator symbol and the right hand sublist.

However, the list has first to be scanned if it contains more than one operator, in which case the priority of the operator determines the order in which the list is split.

Once all this has been done, the tree is scanned for sublists starting with a name.

2.2. Comments and Pragmas

I use to say that designing a programming language should start with the comment, then the dummy statement that does nothing, and after that the rest may be developed. So comments will be here at an early stage, and pragmas also, as they are syntactically special comments.

Comments (text that is completely ignored by the compiler) and pragmas (compiler options) start with a backslash (&bsl;) character. Unlike other programming languages, special characters inside strings are not indicated by backslash characters; backslashes inside strings remain backslashes.

Skip to Literals (Constants) when comment and pragmas are boring.

Comments

Three types of comments are possible:

Block comments:
A backslash followed by an open parenthesis (the digraph \() starts a comment, and a backslash followed by a closing parenthesis (the digraph \) terminates it:
\( this is a block comment \)
Line tail comments:
After a single backslash \ followed by a blank the rest of the line is a comment.
Conditional comments:
The digraphs \? and \~ allow comments conditionally.

Note that backslashes have no special meaning in string literals, so comments start and end always outside of string literals.

A block comment is often used if the comment block spans several lines; in this case it is strongly recommended to write nothing behind the closing \), as otherwise the indentation might be unclear.

Despite the use of parenthesis, nesting comments is not (yet) supported; the result is undefined. Instead, the digraphs \[ and \] are additional pairs for block comments, that are intended to exclude parts containing block comments unconditionally.

A simple mechanism for conditional compilation uses the enable command line option -e of the compiler and the bigraphs \~, \{ and \}:

If enable is set, the bigraphs determine the indentation level:

do something:
    i =: 1
    \~ i =+ 1
    \{ i =+ 3 \} 
    print i     \ prints 1 normally, and 5 if -e is set

This allows to include test code in a module:

a func to export:+
    ....
an internal func:
    ....
\{
main (parms):-
    a func to export
    an internal func
\}

All three digraphs use a swiveld line, horizontally for the line comment, vertical for the block comment. Note that alternatives cannot be selected this way.

More flexible conditional comment blocks use the digraph \? in column 1, followed by a condition, which may be a logical expression using only literals and named literals:

#MAIN = ?+
\? #MAIN
main (args):+
    ...
\?

It starts comments upto and including a line that starts with \?, with or without a boolean expression. As no expression is treated as true, this ends the conditional comment block.

Note that setting a named literal may be done by a compiler option and not necessarily in the code.

Alternatives (else) could invert the condition:

\? ~#MAIN

or use \| instead, if available.

Early compilers may lack this type of comments totally or may only support a single named literal as condition, which must exist to enable the block.

Programmers might consider adhering to the following conventions:

Some other digraphs are already defined (see next section), but may not be honored by the compiler yet:

Comments are treated as white space in the source line, i.e. digraphs, strings and identifiers are broken by comments.

To allow integration in shell scripts, if the first line starts with the two characters #!, it is treated as comment line.

Pragmas

For modularisation, there are pragmas followed by white space and a file path (or URL) :

The \^ variant effectively invokes the TUFPL_genhdr utility that extracts header information and is used to create the .tyvh files.

They may be nested; there is no limit imposed by the language. If \+ is found within a file imported (via \^), importing continues, i.e. there is no reverting to include mode. (For this to work, the header extraction must be idempotent.)

If the filename is not an absolute one (starting with a slash), the current directory is tried first, i.e. the filename used as is. If not found, the compiler uses a set of include directories to prepend and try (-I parameter, TUFINC environment variable) until the file is found.

Including a file that is currently open for including would never end, in particular if conditional compilation is not supported. In order to allow nested includes for libraries that may be redundant, this case is not rejected, but the new file silently ignored. For this, the pathname string is used in comparison, but it may be converted to a real name before comparison on file systems that support soft links.

Include pragmas should in general not be used inside function bodies.

The pragma \# may be used to ensure to indicate the langauge version required, so that defective code is not accidently produced, e.g. when new attributes are provided. Language versions comprise of a major (version) number, a minor (release) number, and a modification level, e.g. 1.12.2.

Also, a number of keywords may be given to indicate required language features (starting with a plus, read as required) and those not used (starting with a minus, read as not required). The latter is used to check for backward compatibility when new features are not required. Such keywords may be:

Tuples
BunchFunctions
FunctionPrototypes
CallbackFunctions
AttributeChangeCallbacks

For each major version, the list of available language features is fixed; new language features and thus new keywords require a new major version.

The check issues a warning if:

Also, the compiler could flag language changes, e.g. new or deprecated attributes, once the compiler (minor) version is larger than the one given in the pragma.

A new compiler should be able to compile any older language version correctly, but flag places, where either new features are, perhaps accidentally, used, as well as places where features no longer present in the newest version. If it cannot process that version, the whole programme must be rejected, and not compiled according to some whatever compatible2 version.

Thus, the user can:

Inline C statements

As TUF-PL is designed and implemented as a precompiler for the C language, it is often desirable for a library to write only the most basic functions in C and the rest in TUF-PL. The canonical way to do so is to provide a short module in C using the TUF-PL interface. If it is ever considered useful to include C code, the digraphs \{ and \} are reserved for this. The easiest way is to copy the lines verbatim into the generated code, which requires intimate knowledge of how code code is generated and mentioned here only as a possible future language extension.

Debugging

In order to insert compile time information, the HTML entity notation is expanded:

&lineno;    Source line number
&filename;  Source file name
&function;  Template of the current function
&date;      Date of compilation in ISO notation as YYYY-MM-DD
&time;      Time of compilation as HH:MM

Compile time definitions (like -Dxx in the C compiler) might be supplied as named literals.

There is a debug print function, that immediately returns unless a debug flag is set, and sends its output to standard error (stderr) instead of standard output (stdout). No context information is automatically included, the entities &filename; and &lineno; may be used, or the functions:

debug print lno (mixed):-   \ print '@&lineno;:', mixed
debug print here            \ print '&filename;@&lineno'
debug print stack           \ a full stack trace
debub print stack (n)       \ only the n top items
debug dump item             \ works with any item

Note that debug print is enabled at runtime, e.g. by the --debug option, and may give small performance losses (about 10%) even if not enabled, but only if in heavily used loops. If complex strings expressions are used as an argument to debug print, this may cost some small extra time, and the use of a tuple might be preferred. Using a guard (if) like the following example is not signifcantly quicker, unless elaborate functions are called to compose the parameter:

? is debug enabled
    print a, b, c

If the expression for a guard can be evaluated at compile time and is false, the body of the guard may not be compiled at all:

#Debug = ?-     \ enable by ?+, or use '-#Debug' compile time option
...
    ? #Debug 
        print ...       \ executed only if Debug = ?+
    always compiled

However, this is not equivalent to a conditional compilation using \?, as pragmas inside the block are evaluated and not all compilers support this.

Deprecated Functions

If (exported) funtions in libraries are deprecated, it might be desired give a compiler warning when used.

For this purpose, the :~ bigraph can be used instead of :- in definition files.

A compiler option (-x) allows to exclude all deprecated function declarations.

Currently, deprecated functions are marked in the source files by comment blocks starting with \(~.

Using :~ instead of :+ is currently not considered as :- and :~ are declarations, and : and :+ are followed by a definition block. so maybe :? will be used.

2.3. Literals (Constants)

Literals are sometimes called constants, although the term constant is better reserved for items that are immutable (for a certain time).

Literals allow the creation of items with given contents.

Conceptually, when a literal is assigned to a variable, an item of appropriate kind is created, and a reference to this item stored into the variable.

As plain items are conceptually immutable, assigning a literal to a variable (or bunch element) could be regarded as setting a reference to the respective literal.

The void reference, that indicates that the variable does not contain a valid reference, is completely unique and clearly immutable.

The only item that can be changed is the bunch, which has also features of a structure. To create a new bunch, assign the empty row ([]) or map ({}) to a variable, and then assign to its variables (or variable fields). See Rows… for details.

Void literal

The void reference, void for short, that refers to nothing, is denoted by a pair of parenthesis as a digraph, i.e. without space within:

()

Note that [] is a new empty row and {} a new empty map without elements, and "" or '' are empty strings, all not the void reference.

The digraph () might be misinterpreted as an empty tuple; however, there are no empty tuples, although ((),) is the same as (),, a tuple with only one element, which is a void reference. Note the trailing comma, as tuples with one element are normally useless.

From a structural point of view, digraphs like $. or $- would be a more logical choice, but the empty pair is much better readable. The digraphs || or -| or <> or ~~ had been considered and found to be not clearer.

Number literals

The basic elements of number literals are sequences of decimal digits, that form non-negative integer numbers which are used to express literals of rational numbers.

Most compilers will evaluate expressions with the usual numerical operators at compile time, thus the minus sign can be used to make a negative integer. If followed by the division operator slash (/) and another sequence of digits, a rational number is created 2/3. Using the exponentiation operator, powers can be expressed, in particular powers of 2 (2^12) or 10 (10^6). Note that it might be necessary to use parenthesis as in (2/3)*10^6 (or or (2/3)⏨6.

Floating point (real) numbers require a point between two digits, as in 1.1; the short form 1. is not supported. As with the exact numbers, additional operators may be used, e.g. in 12.45*10^-6. The commonly used form 12.45E-6 is supported due to its compact form, but a decimal point is required (the short form 3E5 is not possible). Using the subscript 10 (), the unicode character &#9192;, is planned to be supported instead of E or e, e.g. 12.45⏨-6, but not yet implemented.

As the comma is a list separator, even if the current locale specifies the comma instead of the decimal point, floating point numbers are created using points between digits. Accepting other locales requires that a pragma is provided to indicate the desired locale, as otherwise a program may not compile in another locale.

In order to have large numbers better readable, in particular in assertions, the apostrophe (single quote, ') may be used inside (which does not collide with non-localised strings):

n =: 100'000'000 

Neither blank space nor a comma or point may be used for this purpose.

Numbers in other bases than 10 must be created from string constants using library functions, i.e. there are no octal numbers by using a leading zero nor hexadecimal numbers by a leading 0x.

String Literals

String literals are one or more characters enclosed in quotes (") or apostrophes ('). Those in quotes are subject to message localization, those in apostrophes not. Character is used in the sense of a Unicode code point, not as a 8-bit byte. Internally, all strings are held as UTF-8 coded strings. If expansion is required, e.g. because heavy modifications of single characters, there is a function to convert a string to a row of code points.

With the only exception of a newline character, any character that is not the string delimiter may be part of the string. This means that the closing delimiter must occur before the end of the line, except in multiline string literals (see next). In particular, either form may contain the opposite delimiter; thus, if a string constant for HTML is needed, it is normally not subject to localization, and using the apostrophe is a good choice:

print '<a href="'

If the delimiter is needed, a HTML entity must be used:

print "<a href=&quot;"

So to encode special characters, HTML entities can be used (see https://www.w3schools.com/html/html_entities.asp).

This means, that the string constant "M&uuml;he" results in Mühe. Some commonly required named entities are:

quot    "        (double) quote
apos    '        apostrophe
amp     &        ampersand
sect    §        section, paragraph
not     ¬        not
euro    €        Euro currency sign

The HTML standard allows any code point using their numerical value, but does not have names for control characters, nor for the backslash, so TUF-PL provides additional names for these:

nl    &#10;     newline
cr    &#13;     carriage return
tab   &#9;      tabulator
bsl   &#92;     backslash
e10   &#9192;   subscript 10 (⏨) for number literals

Thus, a text to be localized is written as:

"Hab&apos; acht!"     \ ergibt "Hab' acht"!

The backslash is conceptually reserved for comments and pragmas, and is better written as &bsl;. As this would make regular expressions even harder to write as they are anyhow, the are treated like any other character.

Strings in general support zero bytes (zero code points), but not in string literals, as these are internally zero terminated strings.

Multiline strings (here documents) use digraphs as a starting delimiter for the string. To start a verbatim multiline string, use <' while <" starts a localized variant; both are terminated by a single apostrophe or quote as for any other string. Line changes and whitespace are copied verbatim:

s =: <'
    a 1
    b 2
    c 3'
 
! s[1] = '&nl;' & s[-1] = '3'

This string starts with a newline and ends with the digit 3.

The alternate multiline string digraphs >' and >" shrinks any whitespace to a single blank and removes leading and trailing blanks:

s2 =: >'
    a 1
    b 2
    c 3
'
! s2 = 'a 1 b 2 c 3'

A typical use is:

?# lne =: give parts of string >'
    1   void
    2   integer
    5   real
    ' delimited by newline

Another example is the usage message:

print usage for (progname):
    print replace each '&progname.' by progname in string <"
        Usage: &progname. [options] command [values] '
        Options may be:
            -x  execute
            -v  print version and exit
        "

Note that placeholders for dynamic contents must be replaced programmatically as shown, while localisation is done automatically.

Named Literals

It is sometimes useful to give numbers or strings a name to allow consistent use. This is done by named literals, that are words prefixed with the number sign or hash (#).

Setting named literals starts with a number sign (#) in column 1, followed by a word, followed by an equals sign, and a value that can be evaluated at compile time:

#one = 1
give me two:
    :> #one  + 1
#pi = 3.14159265358979323846
#hello = "Hello"        

The single equals sign is justified because both sides are equal, as in contrast to an assignment, see below.

Setting a named literal may be only done outside function bodies, i.e. the defining line must not be followed by an indented block.

The value maybe any literal or named literal defined lexically before. Compilers should support expressions of literals and named literals. Despite the use in the definiton of named literals, they can be located anywhere in the source code. It is recommended to define them lexically before the line of use too.

Named literals should not be redefined. As the compiler might allow to set named literals in the command line, and to avoid libraries to disable words, a second form allows redefinition by using the assignment bigraph:

#a = 1
#a =: 1

Note that the lexically last one wins for replacement in the total programme text, while the actual one is used in conditional comments.

To use a named literal, the same word with the number sign prefixed is used. While this looks clumsy first, it is absolutely clear and avoids rules about when a literal and a variable is meant; so it is consistent with globals.

Using the dollar sign instead of the number sign was discarded from the same reason. Note that the number sign is used elsewhere only as part of a digraph. In particular, the conversion to approximated (floating point) numbers is a double number sign (##), and some special float numbers use other digraphs, see Approximate (Floating Point) Numbers.

Some number of common constants may be defined in the standard library: in particular:

\(# Numerical constants 
#\)
#MaxInt31 =            2'147'483'647
#MaxInt32 =            4'294'967'295
#MaxInt63 =   9'223'372'036'854'775'807
#MaxInt64 =  18'446'744'073'709'551'615
#Pi = 3.141'592'653'589'793'238'46
#E =  2.718'281'828'459'045'235'36
 
\(#  String constants
#\)
#Whitespace = ' &tab;&nl;&cr;'
#ASCIIupperLetters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
#ASCIIlowerLetters = 'abcdefghijklmnopqrstuvwxyz'
#DecimalDigits = '0123456789'
#HexDigits = '01234567890abcdefABCDEF'

Implementation dependent values are attributes of the global variable $SYSTEM, in particular $SYSTEM.epsilon, see System variables and parameters.

It is good practice to start literal names with a capital letter if they are to be exported, while those used locally start with a small letter. To have literals exported, the are marked with a comment block enclosed in \(# and #\), preceeding the block of literals (see TUFPL_genhdr)

Named literals can be predefined by compiler options: -#MAIN defines #MAIN as ?+, and -#x=5 define x as integer 5.

These definitions are valid from the first line on and may be overwritten by the assignment form.

If the the named literal is not set, it is void, and thus false (?-) in boolean expressions, 0 for integer additions (and subtractions) and void elsewhere.

2.4. Variables

Variables just hold references to Items, which are similar to objects in other programming languages. Every variable does just hold a reference to an item, or they have none, called the void reference. Bunches like rows and maps provide a means to save several references to items, and in this sense are aggregates of variables.

Variable names must begin with a letter, followed by letters, digits or underscores, and neither begin nor end with an underscore (so that there is no collision with the string catenation operator). Small and capital letters are distinguished. Using medial capitals (camel-case), i.e. capital letters inside words, is preferred over the use of the underscore. However, long function bodies are discouraged, and a rather small number local variables is required so that complex names are not necessary.

Plain variables are local to functions, see below for Global variables.

All variables come into existence if used; there is no declaration, and thus no means to initialise them. To avoid undefined references, each variable is initially set to void (which intentionally is zero, i.e. the NULL pointer in C).

There are no static local variables of functions, thus no need to initialize them before program start. The overhead of dynamic initialisation is negligible as only a relatively small number of variables is concerned (see Bunches (aggregates)).

Variables are not typed; they can hold references to all kinds of items.

Constraint expressions allow a flexible discretionary static type checking, see Assertions and Constraints.

Plain items are conceptually immutable. In the following example, the reference to the number 1 is copied to b, so a and b refer to the same item. However, changing a in the third line replaces the reference to the item 1 by a reference to the item 2. There is no means to change the value of a plain item; operators on plain items always deliver a new item.

Example:

a =: 1
b =: a
a =: 2
! b =: 1

This is one of the reasons the i++ notation is not used here; it would look like the item referred to by i would be changed.

Although additional information about a plain item can be obtained by using standard functions or attributes of items, these attributes cannot be changed, as plain items are immutable; a new attribute can only be given for a copy thereof.

Consequently, a string may not be e.g. truncated by an operator, only a truncated new string can be generated by a function.

The only mutable kind are bunches (rows and maps); they can be changed, as they hold references like a set of variables, and the references can be changed. Byte chunks are a third species of bunches, because they are mutable, but do not contain references, just a (large) byte buffer.

In particular, if a reference to a bunch is copied by an assignment, and one reference is used to change a bunch cell or element, then either reference provides exactly the same information, like in this example:

r[1] =: 1
r[2] =: 2
p =: r
p[2] =: 3
/ because p and r refer to the same item that was changed:
! p[2] = r[2]

This difference coincides with the observation that for plain items comparison is well defined and supported by a predefined operators, while for bunches, there is no canonical way to compare two bunches.

The special variable name @ is used in some contexts as the current item, often denoted by this.

Global variables

Prefixing the character $ defines variables which are global to the module (compilation unit). No blank space may follow the dollar sign (it is not an operator).

Global variables are shared by all functions, but not visible outside the module. This greatly reduces the implementation effort, they are just static item* G_name variables in C. There are neither name collisions with local variables, nor a need to introduce declarations and definitions.

Global variables are not automatically discarded on termination; it is the programmer's obligation to leave all global variables void, otherwise, a memory leak message is likely to occur at termination. While it is a nice side effect to discourage using global variables, the deeper reason is that there are neither automatic module initialisation nor termination functions; like in C, there is just a function with the reserved pattern main () that starts the programme. Using just a few global variables referring to bunches, and using variable fields of the bunches instead of single global variables, alleviates the burden to clear global variables, as all references within a bunch are automatically freed when the bunch is freed.

A callback on exit may be registered to automate this:

main (arg):+
    args =: parse args options with valopts () 
    $opts =: args.options
    on normal exit call `clear opts` with ()
    ...
\ or with anonymous functions:
    on normal exit call `() $opts =: ()` with ()

Note that parameters are always passed by reference; thus there is no means to pass the address of a global variable to a function, as there is no address-of operator and no suitable item kind.

If there is gloabl data to be shared between modules, it should preferrably be kept and provided in bunches that are made available via mutual function calls.

In case it is unavoidable to share items globally, there is a reserved global variable $COMMON common to all modules, which is initialised with a map and thus maps strings to item references. Using it is done via a key, e.g. in hierarchical notation as in de.glaschick.tuf, a hash of a constant string, or via the string representation of a UUID, so that the chance of collision is neglectable. Of course, fields with module names can be used too. Further protection could be obtained by storing references to bunches with tags. As opposed to normal module-globals, there is no need to remove an entry before termination; this will automatically done on system exit. But of course the item including all its references will stay in memory, until cleared by setting it void.

Another reserved global variable is $SYSTEM, which has some special attributes and fields, see below on Fields and Attributes.

It may happen that upon termination of the main programme, that a variable is freed that contains references to millions of items; as these will be diligently freed one per one, so termination may take time, unless the function system exit (retcode) is used, that does not check for memory leaks, and thus is highly discouraged. But note that items are freed automatically once there a no longer any references, so this is only the case if all these items were still of potential use.

If threads are provided, global variables will be thread-local, i.e. not requiring synchronisation for access.

It is recommended to append the variable name to the dollar sign ($) without spaces, because future versions might require it.

Functions use a different syntax for globality.

Local static variables

Because normal globals are restricted to a compile unit, functions that are singletons can use a global variable to maintain their state. Name clashes are manageable, because they are restricted to the compilation unit. The singleton will normally register a function on exit to clear the global. Thus, it is important that more than one function have access to the static variable.

Thus, static variables local to a function are more difficult to clear at end and thus are not (yet) considered.

2.5. Assignments

Normally, an assignment using the assign symbol =: has on its left hand side the name of a variable, and on the right hand side an expression that calculates a value, and the variable is assigned a reference to that value. Several other assignment variants are possible:

=:  Normal assignment
=~  Set if void
=_  Append expression as string 
=+  Add expression to target
=-  Subtract expression from target
=*  Multiply target by expression
=/  Float only: divide target by expression

The remaining ones are partially reserved for future use:

=%  Not supported
=^  Not supported
=?  reserved for fault bubble assignment
=;  reserved for weak references
=&  reserved for boolean and
=|  reserved for boolean or
=<  reserved for getter callback assignment
=>  reserved for setter callback assignment
=.  not yet defined
=!  not yet defined
==  reserved 
=#  not yet defined (alias?)
=,  not yet defined

All are digraphs that start with an equals sign, and all supported such digraphs are assignments, that's why the ALGOL assignment digraph := is not used.

The left hand side is called a reference term, or just term for short, which allows to store item references, and the right hand side is a value expression, or just expression for short, that yields a reference to an item that shall be used to overwrite the reference given by the reference term.

For combined assignment and arithmetic, the digraphs =+, =-, =* and =/ (only float) are provided, see Arithmetic expressions.

Using =_, a string can be appended to the end of a string:

a =: x _ y
x =_ y
! a = x
 
n =: 5
n =_ 'x'
! n = '5x'

Note that string catenation converts numbers to strings in default format.

A frequent pattern is

? x = () :
    x =: something

While condensed notation (see Sequences) could be used to squeeze it on one line:

? x = () : x =: something

the destination has to be written twice identical, which is a (tiny) source of errors, so there is a set if void assignment using the digraph =~:

x =~ something

Another common pattern is to check if a variable refers to a bunch, and to create one, before a field is set:

? para = ()
    para =: {}
para.field =: someValue

which would be in condensed notation:

? para = (): para =: {}; para.field =: someValue

and using the set if void assignment:

para =~ {} 
para.field =: someValue

It may be use frequently also to set a tag if not yet set already:

item.tag =~ 'aTag'

instead of

? item.tag = ()
    item.tag =: 'aTag'

although it is often better to clarify before if the tag is set.

Note that the digraph ~= is also used and denotes not equal comparison.

2.6. Statement Blocks

A block is a sequence of lines with the same or deeper indentation. Its end is reached if the next line has a less deep indentation. A blank line is always ignored and does not end a block, as it would be unclear how many blocks are closed.

Theoretically a block may be empty, if there is no indented line when possible, but currently empty blocks are not supported. The typical do-nothing-except-advance loop like

for (i=start; i!=0; i=i.next)
    ;

is anyhow not supported this way, and empty blocks are rarely required. If a scan is to be repeated until it ends, the block repeat ?^ is the logical choice instead of an empty block.

The end of the input source file also closes the current block(s).

A line, if it has less indentation as the previous one, must have the same indentation as another line above to match; otherwise, it is an error. A block may be executed conditionally or repeated, if guarded; otherwise, blocks are used for function bodies and other groupings.

Detecting indentation is no problem if only blanks or only tabulators are used for the initial spacing. However, mixed use may be confusing, not only to identify equal depth, but in particular to identify deeper indentation. There are several solutions with different merits:

The latter requires a tab stop value. Traditional unix uses a raster of 8, i.e at 1, 9, 17, 25, etc, while nowadays with screen editors, a raster of 4 is more convenient (and at the same time the recommended indentation in Python).

Using spaces only is surely the least surprising method; the same is true if indentation is only done by tabulators, as it is independent of the tab stop value.

Despite the experience in Python, tabs are supported by expanding them, with 8 as default tab spacing. vi mode lines are detected and may be used to change the default tab spacing for the file in which they occur (at any place, the last one wins). Automatic detection may be feasible, because any dedentation must return to an already used indentation level. However, this has not been tried and is currently not supported. Also, many text editors have this function, and can be used to normalise the use of tabs and spaces.

2.7. Guards

A block can be guarded (courtesy to E.W. Dijkstra) by a condition that determines if the block is executed or not.

Basic guards

A basic guard starts with a question mark (?) as first character on the line, followed by a boolean expression. If the expression yields false, the indented (sub-) block that follows the guard command is skipped. If it is true, the indented block is executed.

From a mathematical point of view, no alternative (else) is necessary, as the condition could be inverted, as in:

? i < 0
    m =: "less"
? ~(i < 0)
    m =: "greater or equal"

However, this puristic approach is not sensible. In particular, as the condition must be written twice the same, which could be error prone.

Thus, the common otherwise or else is possible, telling that the condition of the immediately proceeding guard is inverted:

? i < 0
    m =: "less"
?~
    m =: "greater or equal"

As shown, the digraph ?~ would be the logical choice; but using the vertical bar | is much better readable and used normally:

? i < 0
    m =: "less"
|
    m =: "greater or equal

The common if-then-else like

if (i < 0) {
    m = "less";
} else if (i > 0) {
        m = "greater";
    } else {
        m = "equal";
    }
}

becomes:

? i < 0
    m =: "less"
|
    ? i > 0
        m =: "greater"
    |
        m =: "equal"

The else token is the first and only token on a line. While this seems to be a waste of vertical space, it is necessary to define the indentation of the following block.

Consequently, the combined elif in some languages is straightforward, but may not yet be supported by your compiler:

? i < 0
    m =: "less"
|? i > 0                    
    m =: "greater"
|
    m =: "equal"

Note that |? is a digraph, and no blank space may be used in between.

Within a guarded block, the block can be terminated or repeated using the break operator ?> or the repeat operator ?^; in the latter case, the guard expression is evaluated again, etc. Both symbols are digraphs, i.e. no white space possible inside.

Thus, to print squares, the program may read:

i =: 1
? i < 10
    print i*i
    i =+ 1
    ?^

Repeat loops

For convenience and better readability and because there are so many loops to repeat, the loop guard (?*) implies a repeat at the end of the block, which is essentially the same as the previous example:

i =: 1
?* i <10
    print i*i
    i = +1

Endless loops have no expression (or the true symbol ?+) after the loop guard (the following example should be replaced by a better one):

i =: 1
?*
    ? i > 9
        ?>
    print i*i
    i =+ 1

Note that, unlike several other programming languages, an assignment is not an expression; also the start value must be set on an extra line, unless scans like those in the next section are used.

Instead of a boolean expression, values with boolean kind can be used. Just the void reference is allowed alternatively for false, to make map processing easier. Anything else is a fatal runtime error, as the complexity of the rules don't make the programs more reliable or understandable.

Note that there is no rule to fixed indentations, i.e. in:

? cond1
    then1
|
        else1

the indentation is just bad programming style, but not invalid.

Block repeat and terminate

Within a block, the current iteration can be repeated or terminated by the operators ?^ (continue or repeat, go up to loop begin) and ?> (break). It was tempting to systematically use ?< for repeat, but experience showed that when looking for errors, ?< and ?> are not easily distinguished.

Thus we have:

x =: 1.0
?*
  z =: (a/x - x) / 2.0
  ? z < epsilon
     ?>
  x =+ z

As the break digraph is the only one on a line, a multi-level break could be indicated by having two or more in a line:

?*                  \ first level loop
    ...     
    ?*              \ second level loop
        ...
        ?>          \ exit inner loop, and repeat outer loop
        ...
        ?> ?>       \ exit both loops

This is not yet implemented, and if the same is useful for repeats, had not been evaluated.

Scan guards and scan functions

A standard guarded repeat like this:

i =: 1
?* i < 10
    print i*i
    i =+ 1

is a bit verbose and error prone because the loop key must be repeatedly written.

A scan guard allows to write the same loop as:

?# i =: from 1 upto 10
    print i*i

An assignment with a function at the right hand side is used instead of a boolean expression, and scan functions with specific side effects (similar to a generators, a variant of a coroutines) control the scan as follows:

A special global variable, the scan context, denoted as $ (formerly as $# to resemble ?#) is set to void, the right hand side evaluated, and the return value assigned. Then the body is processed, unless the scan context is void which terminates the loop.

The scan function saves its context required to continue when called again using the scan context variable $:

from (first) upto (last):
    ? $ = ()            \ first call?
        $ =: first      \ set context 
    |
        $ =+ 1          \ else advance
    rv =: $         \ value to be returned
    ? rv > last & rv < first
        $ =: ()     \ done, clear context variable
    :> rv               

Any function, called directly or indirectly during the evaluation of the right hand side of the expression, may get and set the value of the context variable using the symbol $.

The returned value is assigned to the variable even if void scan context indicates end of loop and the loop body is skipped. This is needed for fault items that terminate a loop prematurely.

The scan context is saved to a hidden local variable each time the assignment is evaluated, thus scans can be nested to any depth allowed by the stack. Using the scan context within a scan body changes the scan context of the current function, allowing scan functions to use scan functions itself, which is often used.

Note that the above scan function works for all kinds of numbers, as long as they are of the same kind (mixing rationals and integers is ok), so you may use:

?# f =: from 1.0 upto 2.0 step 0.1      # reals
    ...
?# q =: from 1/10 upto 1 step 1/10      # rationals
    ...

Scan functions are often used to enumerate the keys (indices) for rows and maps, e.g. for rows as folllows:

row (r) give keys:
    ? $ = ()                \ 1st call
        $ =: r.first
    |
        $ =+ 1
    ?* $ <= r.last
        ? r[$] ~= ()        \ non-void cell found
            :> $            \ return its key
        $ =+ 1              \ next key
    $ =: ()                 \ scan termination
    :> ()                   \ yields nothing

See below for more information about library scans for rows and maps.

The parameters of a scan function may be arbitrary expressions including function calls, that are evaluated each time the loop starts the next round. because the whole right hand side is evaluated just like any other assignment:

?# j =: 1 + row r give keys
    ...

To provide the values of rows, maps or tuples, you may use a wrapper function or a more than simple scan assignment, e.g.:

    map (m) give values:
        :> m{map m give keys}
    ...
    ?# v =: r[row r give keys]      \
        ...
The '... give values' functions in the standard library
are just the above wrappers; they are provided just for
convenience and clarity of code.

Using functions as arguments of the primary scan function often is very inefficient. The next example extracts all keys and sorts them for every round of the loop; also it works only fortuitously because the scan function row () give keys currntly just enumerates integers and applies them to whatever row is presented:

?# k =: row (sort (map m keys) ascending) give keys
    ...

It consumes several seconds instead of milliseconds for a map with 1000 entries. Better use:

mk =: sort map m keys ascending
?# k =: give mk values
    ...

Thus it is recommended to use only a single scan function with variables and literals as parameters. Enforcing this seems too restrictive. Note that subexpressions are calculated to temporary variables anyhow, thus using them explicitely does not cost any extra time or space, and often is much clearer.

Leaving the loop body by a return or break is normal and supported; but the scan function will not be notified; its whole context is in the scan context $, which will be released by the loop exit. Global variables in scan functions have not yet been used and are not expected to be useful at all.

Because scan functions are lightweight coroutines, several sequences may be generated, e.g. Fibonacci numbers. A tuple is used to hold two values in the context variable:

give Fibonacci numbers upto (n):
    $ =~ 0, 1               \ initial tuple if first call
    r =: $.1 + $.2          \ next number
    $ =: $.2, r             \ save previous and current value
    ? r >= n                        
        $ =: ()             \ end of scan
    :> r                        \ return next number in any case

An example for a scan in the body of a scan function (a row with a variable field is used as scan context):

give primes upto (n) :
    ? $ = ()                    \ first call
        $ =: [n]                \ table preset with void
        $.idx =: 3              \ start here at next call
        :> 2                    \ suppply first prime 
    p =: $.idx                  \ continue here
    ?* $[p] ~= ()               \ skip non-primes, always finds void
        p =+ 2                  
    ? p > n                     \ done?
        $ =: ()                 \ yes, clear scan context
        :> ()
    $.idx =: p + 2              \ prime found, save continue position
    ?# j =: from p*p upto n step 2*p    
        $[j] =: p              \ mark all multiples as non-prime
    :> p

Caveat

Currently, the loop context may not be used in the assignment starting the loop, it must be set aside before, which does not cost any performance:

m =: $
?# i =: give keys m
    ....

Check the documentation if this error has been fixed.

Because the scan function is called for each iteration, general wrapper functions that make a scan function out of an ordinary function are possible:

until void (val):
    $ =: val                
    :> val
 
until zero (val):
    $ =: val
    ? val ?= 0          \ not equal if not a number
        $ =: ()
    :> val

Two examples, exploiting that parameters are evaluted each time:

?# i =: until zero string 'abcx123x908' locate 'x' from i+1
    print i         \ prints 4 and 8
 
expand tabs in string (line):
    ?#  i =: until zero locate string '&tab;' in line
        r =: 1 + $tabset - (i %% $tabset) 
        line =: string line replace character '&tab' by string ' ' times i
    :> line 

Using a scan function as argument for these wrappers is undefined, because of the competing use of the scan context; in the best case, one of the scan functions will fail because it checks the scan context for plausibility:

\ result is undefined:
?# x =: while positive from 5 downto 0
    print x

For user provided scan functions, a check may use the function reference to itself:

protected scan (x):
    this =: `protected scan ()`
    ? $ = ()
      $ =: this, x
    ! $.1 = this
    :> x

The bigraph @# is reserved for a function reference to the current function, but will likely never be implemented.

Scan functions are also used to provide lines from files, until the end of the file is reached:

?# line =: file () give lines    \ stdin for void, void at EOF
    ? line.isfault
        .. error processing
    print line

The single $ does not conflict with global variables, because it is starts the global variable name without a blank following (and it's not an operator).

The use of a global variable and the fact that scan functions are not syntactically marked may be considered dangerous; see alternatives for more information.

Row, Map and Tuple scans

To print all element values of a Scans for rows, tuples and maps, a range scan may be used, as all keys are integer numbers:

?# i =: from r.first upto r.last
    print r[i]

Maps may have any item as key and have no inherent ordering of keys, hence a library function is required to obtain the set of keys actually used (in undefined order):

?# i =: map m give keys
    ! m{i} ~= ()
    print m{i}

As an entry is removed by assigning void, all elements with the set keys supplied are not void at that time. Correspondingly, the standard key scan for a row also delivers only those keys that are not void:

?# i =: row r give keys
    ! r[i] != 0
    ...

while the above range scan (from r.first upto r.last) may deliver keys with void values.

Although it might be useful to protect a row against changes in a scan loop, this proved be too restrictive in general.

In contrast, the map scan always first saves all keys (key references) in an internal row, and then supplies the row values:

map (m) give keys:
    ? $ = ()
        $ =: map m keys         \ as a row
        $.i =: 0
    $.i =+ 1
    ? $.i > $.last
        $ =: ()                 \ should be totally empty
        :> ()
    rv =: $[$.i]
    $[$.i] =: ()                \ no longer used, chance to free
    :> rv

Because the implemetation uses a LRU queue for the keys, the actual order may change even by reading, thus there is no chance to internally use the n-th key in the queue. Note the second-to-last statement, that sets the row entry void, so that the item can be freed early.

The standard row scan is fairly simple :

row (r) give keys:
    ? $ = ()                \ 1st call
        $ =: r.first        \ scan context is next key
 
\ find next non-void entry
?* $ <= r.last
    ? r[$] ~= ()        \ non-void cell found
        :> $            \ return its key
    $ =+ 1              \ next key
$ =: ()                 \ scan termination
:> ()                   \ yields nothing

Thus only keys for actually non-void elements are supplied in ascending order; if a not yet scanned element is set, it will be found; if it is voided, it will be skipped; also, if elements are added at the end, because the actual .last attribute is used.

The order of keys retrieved for a map is undefined, meaning that they can even change with each access, and while it is normally deterministic, any program depending on a specific order is incorrect. It is, however, guaranteed that each key is supplied, as long as the map is not changed in the loop.

To have the set of keys for a map scan stable during the scan, it is saved in row at the start of the scan; so keys added in the scan loop body are missed, and those removed before delivered skipped:

map (m) give keys:
    ? $ = ()                    \ initial call?
        $ =: map m keys         \ row of map keys
        $.idx =: $.first        \ next one to deliver
    ?* $.idx <= $.last          \ try until end of row
        key =: $[$.idx]         \ get key
        $.idx =+ 1              \ and advance
        ? m{key} ~= ()          \ key has value?
            :> key              \ then give it
    $ =: ()                     \ done
    :> ()                       

Note that the following functions are bad:

map (m) give keys ineffciently:
    \ very inefficient: creates the row each call
    :> row (map m keys) give keys   
map (m) give keys fails:
    \ scan function recieves already set context
    $ =~ map m keys
    :> row ($) give keys

For both, additional scans are provided for convenience:

row (r) give values:
    i =: row r give keys
    :> r[i]
map (m) give values:
    k =: map m give keys
    :> m{k}

Sometimes rows are used as sparse vectors and filled incrementally; thus a scan to provide the keys of the void elements is useful:

row (r) give keys for void:
    $ =~ r.first - 1        \ same as ? $ = () : $ =: r.first - 1
    ?* $ < r.last
        $ =+ 1
        ? r[$] = ()
            :> $
    $ =: ()
    :> ()

A row scan that detects changes during the scan might read:

row (r) give keys safe:
    ? $ = ()
        $ =: r.first, r.Revisions   \ key, revision counter
    ! r.Revisions = $.2
    idx =: $.1
    \ search next nonvoid
    ?* r[idx] = () 
        ? idx > r.last
            $ = ()
            :> idx
        idx =+ 1
    $ =: idx, $.2
    :> idx

It is reasonable to nest functions calls to get the keys or values of maps sorted, as in:

x =: sort map m values unstable ascending
y =: row map m keys sort descending

Note that sorting a row is done in-place.

For convenience and to avoid the extra line, there are many scan functions to provide keys and values sorted; although not all combinations are covered.

Returning bunches in scans

A scan function may well return a row, map or tuple, but this may give unexpected results if the context variable is set to a map which is also returned as the result, as in the following example of supplying fibonacci numbers in pairs:

give fibo pairs upto (n):
    ? $ = ()
        $ =: {}
        $.a =: 1
        $.b =: 1
        :> $
    t =: $.a + $.b
    ? t > 10
        $ = ()
        :> ()
    $.a =: $.b
    $.b =: t
    :> $
test printing:
    ?# m =: give fibo pairs upto 10
        print m.a __ m.b
test saving:
    r =: []
    ?# m =: give fibo pairs upto 10
        r[] =: m
    ?# i =: r.scan
        m =: r[i]
        print i _ ': ' _ m.a __ m.b
main (parms):+
    test printing
    test saving

The result will be:

1 1
1 2
2 3
3 5
5 8
1: 5 8
2: 5 8
3: 5 8
4: 5 8
5: 5 8

So in the second scan, only the last pair is shown. The reason is that the scan returns always a reference to the same map, so the row cells all refer to the same map and print the same (last) values.

Scans returning immutable items like tuples, numbers and strings do not have this problem, as the items are created afresh each round. The above could be repaired by returing a copy of the map:

:> map $ copy

This phenomenon is one of the reasons why tuples are conceptually immutable; however, the following code works, because not the tuple elements are overwritten, but the tuple replaced by a new one:

give fibo pairs upto (n):
    ? $ = ()
        $ =: 1,1 
        :> $
    t =: $[1] + $[2]
    ? t > 10
        $ = ()
        :> ()
    \ $[1] =: $[2]; $[2] =: t is not allowed
    \ instead, create a new tuple
    $ =: $[2], t
    :> $

Unless the returned tuples are saved in a row, they are freed after print (more accurate: when the value of the loop variable is overwritten from the scan function and its use count goes to zero).

2.8. Operators

Parenthesis may be used as usual to group subexpressions or lists; extra parenthesis are just redundant.

The comma (,) is used as a list element separator, and white space to separate sequence elements.

Note that in complex symbols composed of more than one special character these must follow without intervening spaces.

Execution operators:

?       Guard (if)
|?      Guard alternative (elif)
|       Guard default (else)
?*      Loop 
?#      Scan (scans, iterations)
??      Error Guard
?|      Error catch
=:      Assignment
$       Global variable name prefix
#       Named literal (global) prefix
!       assertion or constraint
:>      return function value
?^      loop repeat
?>      loop break
<~ ~>   coroutine yield or receive
:       functions
:(      reference call
::      bunch function definition
.:      bunch function call

Instead of special characters, (reserved) words may be used:

?*  while
?   if
|?  elif
|   else
?*  loop
?#  scan
?^  continue
?>  break
??  try
?|  catch
!   assert
:>  return

Note that the character for guard alternative and guard default are the first one on a line and thus cannot be confused with a logical operator.

Arithmetic expressions

Arithmetic operators are in ascending priority:

+   addition
-   subtraction
*   multiplication
/   division for approximated (floating point) or integer numbers for rational result 
//  division for accurate numbers (integer and rational, (see [[Integral Numbers]] for variants)
%%  integer modulus (see [[Integral Numbers]] for variants
^   exponentiation 
-   (unary) sign change
##  (unary) convert to floating point (also an attribute and a function)

See below for item attributes or functions to obtain the absolute value or the sign of numeric item.

The operators accept either exact or floating point numbers, not mixed. To combine exact and floating point numbers, the exact number has to be converted to floating point using the ## operator, the .float attribute or the float() function. (The single # would conflict with named literals.) Note that the .float attribute cannot be applied to integer literals (also named literals). The ##, although ugly, has the advantage of the high precedence of a unary operator, using the float () function often requires parenthesis (except the last expression):

print ##3 * 2.0
print (float 3) * 2.0   \ ok
print 2.0 * float 3     \ ok
print float 3 * 2.0     \ runtime error: print float (3 * 2.0)
print #3, 2.0
print (float 3), 2.0    \ ok
print float 3, 2.0      \ runtime error: print float (3, 2.0)

The last line might work in future, if the float () function applied to a tuple returns a tuple of floats.

Exponentiation for approximate (floating point) numbers uses the standard C library. Otherwise, with an integer or rational base and positive integer exponent, either a shift (base 2, exponent <32) or square-and-multiply is used. All other are not allowed; neither rational exponents nor negative integer exponents. Although not conforming to the strict rule that accurate numbers are never converted automatically to approximate ones, the exponentiation of a float with an integer (not a rational) may be tolerated and the exponent silently converted to a float (not yet implemented).

Ideally, a void operand in an arithmetic expression would be a fatal runtime error. However, when counting elements of a map, under this regime the coding has to be e.g.

? map{idx} ~= ()
    map{idx} =+ 1

so that the map is searched always one more time just to find out if the keyed value is void (even if the last used key is cached). It seems not too dangerous to treat void references as the number 0 in addition and subtraction, as an added zero is neutral. However, in multiplications a zero operand is dominant, so a void reference is not treated as zero, as in all other arithmetic operations. Adding two voids is a zero integer, so that an arithmetic operation always returns a number, as is the negation, and the conversion of a void reference to floating point numbers.

The consequence is that conversions from integer to string must return a fault item, although a void would be sufficient to indicate that the conversion failed. But this may lead to hard to detect errors, if bad number strings are just treated as zero, because comparing the result to void has be forgotten. So there are two variants:

integer from string (x)           \ returns fault item if faulty
integer from string (x) else (y)  \ if not an integer, return y

The programmer may still use a void as default, as in:

x =: integer from string str else ()

to return void on errors, and an integer number otherwise.

The combined assignment and calculation is defined for these:

=+  Add rhs to lhs
=-  Subtract rhs from lhs
=*  Multiply lhs by rhs (any number kind)
=/  Divide lhs by rhs (only floating point numbers)

The symbols are intentionally inverted over the form known from the C language. Note that all assignments are digraphs starting with an equals sign, while all comparisons that concern include equality end with an equals sign.

There is no equivalent to the famous ++x or x++ construct, which provides the value of x with a side effect; the increment (or decrement) assignment x =+ 1 must be used as a statement of its own; it also allows any step. Clarity of programming seems to be better this way.

The shortcut =^ for exponentiation is not supported, as it is assumed that it is not used often enough. Also, the shortcut =% for divide by rhs and supply the remainder is not supported, due to the three different version of remainder calculation; consequently, the =/ operator is limited to floating point numbers.

The shortcuts =& and =| are reserved for boolean assignments, even if not yet implemented by most compilers.

Note that no arithmetic or boolean operator starts with an equals sign, while all assignments do so.

Boolean expressions

Very similar to arithmetic expressions are boolean expressions, using these comparison operators:

=   equal
~=  not equal ( 'a ~= b' ≡ '~(a = b)' )
<   less
>   greater
<=  not greater ≡ less or equal (see below)
>=  not less ≡ greater or equal (see below)
.=  equal kind
~.  unequal kind
?=  guarded equal (of the same kind and equal)
~?  guarded unequal ( a ~? b ≡ ~(a ?= b)
@=  equal tag
~@  unequal tag

Note that the tilde is always prepended the symbol that, when followed by an equals sign, gives the comparison that is negated, while assignments always start with an equals sign, followed by one additional sign.

There are no overlaps with arithmetic operators, logical operators and assignment statements. (An overlap with the assignment digraphs is anyhow not possible, as the target of an assignment is not an expression.)

The comparison operator = (equal) first checks if the references are equal; if they are (also if both are void), the operator yields true without further inspection of the items.

If one is void and the other one not, it yields false in any case, so any reference can be compared to the void reference with any operator.

If both operands are of the same kind, only numbers, strings or booleans are compared by their value.

Comparing numbers, the rules are:

It might be surprising that equality is banned for floating point numbers, although the tests >= and <= are allowed, so that x <= y & x >= y could be used to replace the restriction; this, however, clearly indicates a rather suspicious case.

The reserved global variable $SYSTEM provides the field $SYSTEM.epsilon, which is the smallest difference between 1.0 and the next number, so that you might to terminate an iteration by

ep =: (a - x) / a
? ep < $SYSTEM.epsilon
    ?>

Note that the order comparisons <, >, <= and >= can be done with only one of them, plus negation:

a<b ≡ b>a
a<=b ≡ ~(a>b)
a>=b ≡ ~(a<b) ≡ ~(b>a)

So all four are done using > (greater) only; although the digraph suggests it, no test for equality is done for >= and <=. And the above trick to compare floats for equality, using x <= y & x >= y, is equivalent to ~(y > x | x > y), so there is still no equality compaison generated.

Item references of other kinds than integer, rational and approximated (float) require that same kind of both operands; otherwise a fatal runtime error occurs.

Equality comparison of references of the same kind are allowed and check the reference only; no contents is compared.

The fact that comparisons can crash the program might be regarded unfortunate; in cases where intentionally two items shall be compared if of the same kind, an yielding false otherwise, the code would be a bit cumbersome:

? x .= y
    ? x = y
        \ true case
    |
        \ otherwise case

The guarded equality ?= provides exactly this; it first checks that the kinds are equal; if not, the result is false. Otherwise, the item of the same kind are compared.

Admittedly the double question mark in

? x ?= 0

is a bit hard to read, but guarded comparisons are not quite often needed.

The equal kind (.=) comparison allows to use a prototype, as in:

x.field .= ''

This would otherwise require stacked field names and fields of literals, which are not supported by all compilers:

x.field.kind = ''.kind

As the attribute notation might be less clear, functions are provided to determine the kind of an item:

is () void:
is () integer:
is () rational:
is () float:
is () exact:        \ integer or rational
is () numeric:      \ integer, rational or float
is () string:
is () bunch:        \ map or row
is () map:
is () row:
is () chunk:
is () fault:

Note that the long text is is () a variable of kind xxx, thus the form is () fault instead of is () faulty.

The equal tag (@=) comparison is true if both sides have the same tag attribute, i.e. x @= y is the same as x.tag = y.tag. As the tag attribute is always a string, this comparision is always possilble.

It is normally used in assertions and constraints to avoid stacked field and attribute notation, as would be the check in a linked list:

x.next.tag = x.tag
x.next @= x

Instead of a bunch a string may be used, so the two following lines are equivalent:

x @= 'tagname'
x.tag = 'tagname'

There is no operator to check if two references are equal; comparing e.g. two integer references this way has no sensible applications.

There is a long standing discussion whether to use the single equals sign for assignment or for comparison. The extended diagnostics in modern C compilers as well as my personal experience in programming C leads to the conclusion that numerous hours have been wasted in C programming and debugging, and not at least having programmed in ALGOL 60 and PASCAL, the equals sign is for equality test, and the assignment uses the mirrored ALGOL colon-equals (=:) for consistency as assignment symbol.

It might be possible to relax this strict rule. The single equals could be used for both, assignment and comparison, because the assignment is not an operator that yields a value, and boolean expressions are not possible at the left hand side of an assignment. Further study would be required, and I will be very reluctant to do so, as the current regime stands for clarity. Any unnecessary ruling is a source for errors, and writing the assignment digraph instead of a simple equals sign is not at all a real burden.

Void references and fault items in expressions

As the void reference refers to nothing, it is something like the empty set and has any and no property at the same time.

If it were forbidden as an operand for any operator in an expression and could only be assigned to a variable in order to void its reference, this would have been a sensible decision.

But for pragmatic reasons, the void reference as operand for an arithmetic operator always is treated as a zero for additions and subtractions, and as an empty string for a string operator. Even if both operands are the void reference, the result is a number (for addition and subtraction) or a string (for catenation).

Maps and rows are dynamic and provide a void reference for non-existent or erased elements, which eases programming a lot, in particular as dealing with buffer boundaries is known to be error prone.

If then the void value were not allowed in arithmetic operations, the use of map elements as counters would require tedious programming for counting occurrencesof strings:

x =: arry{str}
? x = ()
    x =: 1
|
    x =+ 1

instead of

arry{str} =+ 1

Moreover, if an element (row or map) is requested for a void variable, a void reference is provided. The programmer still can check first if the bunch exists before to attempt to obtain an element???? HOW?

The scheme chosen is simple and ensures that the result of an operator always provides the expected kind of item (or a fault item). Note that other extensions, like treating the number 1 as true, or a non-empty string as 1 in a multiplication, was rejected in order to keep the rules simple.

As TUF-PL does not have exceptions, but instead uses fault items, a function within an expression may return a fault item instead of an item of the expected kind, in particular with operators with two operands and the case that two fault items are the operands. If only one or the only one is a fault item, it is simple just supply that fault item as result. If both are fault items, there is no sensible rule which one to pass, and what to do with the other one, as fault items must be acknowledged before being discarded.

Thus each function return is tested for a fault item, and if it is, skips all further expression evaluation and immediately returns the fault item as the function result.

Conditional and logical expressions

In a logical expression, the following boolean operators may be used:

&   for logical and
|   for logical or
~   for logical negation
=   for logical equivalence
~=  for logical antivalence (exclusive or)

The implication must be expressed using:

~x | y  for the implication x→y

The terms in logical operators can be either comparisons, or variables and functions providing an item of the boolean kind (or void, see below).

As with all other expressions, all operands, including function calls that might have side effects, are evaluated, before the logical operator determines the result.

The C programming language defined a special property of the logical and (&&) — which since has often been considered advantageous — that makes the logical and an exceptional case. For any other operator, both sides must be determined before the result can be obtained. Only for the logical and the result is already known to be false if one operand delivers a false value, and C specifies that in this case the second operand is not evaluated at all..

This allows to write in C:

    if (x != 0 && x.field > 0)
        ....
instead of 
    if (x != 0 ? x.field > 0 : 0)

or, better readable

if (x != 0)
    if (x.field > 0) {
        ....
    }

to clearly indicate a conditional evaluation.

The major disadvantage is the inconsistency, in particular that the operator is no longer commutative, and the only advantage is more compact code on the expense of creating a special case. It may have its merits in C, because it is often necessary to check if a pointer is null, before taking a field value, as otherwise the programme would crash. This is not necessary in TUF-PL, where the field of a void is just void, and void counts as zero (and false) in arithmetic and comparisons. Other programming languages have it intentionally undefined, witch is no remedy, as still the logical and needs special attention.

Thus, in TUF-PL, in all expressions all operands are evaluated; if this in no desired, nested conditions must be used (or for some common cases the guarded comparisons, which clearly flag the points of difference.).

Complex cases should be broken down into smaller pieces anyhow; efficiency is not gained by complex expressions, only unreliability.

Also, the use of scans often makes the structure clear, as it separates the loop conditions from the loop actions.

Finally, a conditional expression is very often used to avoid using a void reference, e.g. to obtain a field, an element of a map, etc. Originally, all those operations were not allowed for a void reference, and lead to a fatal runtime error, so that they must be guarded. However, this has been changed to a more liberal regime: Nearly every (read) operation is possible with a void reference, and returns a void reference. E.g. getting an element of a row for a variable that contains a void reference will give a void reference, which is the same as would be given if a key for a non-existing element of an existing row would be given.

To avoid using the integer numbers 0 and 1 or the strings true and false — even if memory space is the same — for clarity, a boolean item can be used, of which only two exist, namely true and false, denoted as ?+ and ?-.

As mentioned, void is considered false, if in a boolean context.

A variable or function result may used in a logical expression, but only if returns a boolean item.

This leads to the following pattern for a function returning a boolean result:

#Letters = 'abcdefghikjklmnopqrstuvwxyz'
string (inp) is a word:
    n =: count in string inp from 1 many of #Letters
    ? n = inp.count
        :> ?+
    :> ?-
some function:
    ? string 'word' is a word
        ... yes

In order to avoid the necessity to assign unset attributes a false value, the void reference is accepted as a false logical value. No other value is a boolean false, in particular not a number zero, an empty string, etc.

Note that the boolean false item is not the void reference, thus a function can return a three-valued logic of true, false and void, the latter something like unknown:

res =: some function
? res = ()      \ check for void first, as void is treated as false
    ... not applicable here, etc
? res           \ = ?+
    ... yes, positive answer
|
    ... no, definite negative answer

Although the boolean items can be assigned to a variable, the current compiler does not allow to assign or return a boolean expression; it is simply not implemented yet.

The unary boolean operator has very high precedence, in order to bind tightly to variables, thus, the negation of a comparison must be written with parenthesis:

? ~(i < 5)
? ~ check something     \ not needed for a single function call

There are no operators for the other possible boolean two-valued function, e.g. the implication, which must be expressed as ~a | b, etc.

As it is often necessry to check if a function result is a fault item, the attribute .isfault may be used; the follwing lines are equivalent:

? x.isfault
? is x fault
? x.kind = 'fault'

Because there are only two boolean values, these are statically allocated. If you want to have longer names, you can use named literals:

#True = ?+      \ or even #T = ?+
#False = ?-

Avoid global variables, as these must be cleared before termination.

String expressions

The primary String operators are:

=   equal
~=  not equal
_   catenation
__  catenation with a blank between, i.e. '_ ' ' _' 
=_  string append

The choice of the full stop for catenation would not only be conflicting with the attribute and field notation, it also does not help reading a line with mixed variables and string constants.

Some languages allow interpolation of strings, such that variables are marked inside and automatically replaced once the string is used. Such a mechanism is often use in localisation of error messages:

File &filename invalid
Ungültige Datei &filename

As the ampersand (&) is the only character with a special function inside a string, and everything that is not a valid (and known) HTML entity (ending in a semicolon) is left in the string unchanged, it can nicely be used as a parameter indicator.

Using the the replace function of the standard library provides a solution that is flexible, fairly usable and does not need extensions:

msg =: "File &filename. invalid"
msg =: replace each '&filename.' by fn in string msg
msg =: "Error copying from &source. to &target."
msg =: string msg replace ('&source.', srcfn), ('&target.', tgtfn) 

The programmer is free to choose the parameter syntax, e.g.:

msg =: "Error copying from «source» to «target»"
msg =: string msg replace multiple ('«source»', srcfn), ('«target»', tgtfn)

The multiple replacement function has the advantage of strict left-to-right processing, i.e. a replacement is final.

String catenation can also handle numbers, and converts them using a standard format without leading or trailing blanks. Thus, one can write:

p =+ 1
print 'p=' _ p

Note that the catenation operator _ just returns the other operand if one operand is void or the empty string, so it is perfecty proper to rely on the automatic conversion in

v =: 10
str =: '' _ v 

For debugging, often the following patterns are used:

debug print 'p="' _ p _ '" '
debug print 'ary{' _ ai _ '}="' _ ary{ai} _'" '

It would be nice to write instead

debug print ?p
debug print ?ary{ai} 

This could be supported by a Macroprocessor, because the current compiler structure parses expressions without setting aside the source text needed here.

String comparison

Strings are designed as rows of UTF code point numbers, thus string comparison concerns code points.

For equality, as the UTF-8 encoding according to the standard is unique, a common byte-by-byte comparison is done and works as expected: If two strings are equal, all corresponding code points are equal, and vice versa. If the lengths differ, the shorter one is smaller. This is also true if raw strings are compared to non-raw strings; they are always not equal (unless the string is marked falsely raw). As this comparison is presumed to happen seldom, there is no check for rawness yielding false whenever raw is compared to not raw.

Thanks to the excellent UTF-8 coding scheme, order comparisons, e.g. greater or less, can be done on a byte-by-byte basis, yielding a result as if the code points were first expanded to integers and then compared. This is the way the order comparison (< etc) works in TUF-PL, comparing code points, and thus is independent of the locale environment. However, order comparisons of strings marked as raw with UTF-8 encoded strings are highly suspicious and often cause unintended results. Thus, such an order comparison is a runtime error. Order comparisons of both raw strings are not an error and done bytewise, and also if one or both strings are known to be pure ASCII. Note that the rawness of strings is marked at creation and never changes because a string is immutable.

If comparing strings according to a locale is desired, a library function must be used:

string (str1) compare to (str2) under locale (locale):
\ returns -1 for less, 0 for equal and +1 for greater
string (str1) compare to (str2) under locale (locale) case ignored:
\ same as before, but letter case ignored, i.e. A=a

If the locale is void, the current locale is used.

If strings are compared several times to each other, the locale controlled comparison each time transforms the source strings such that the transformed strings, when compared bytewise, deliver the desired result; if e.g. two strings should be compared ignoring case, all small letters are transformed to upper case, and the results are compared. This intermediate string can be obtained by the function

transform string (str) under locale (locale):
\ returns a raw string that can be compared directly

The resulting string is marked raw, because its contents is undefined for any other purpose than comparing such strings. (It seems not necessary to provide a special flag just for this purpose.) If the locale is void, the current locale is used, and the resulting string may be different in further calls.

While it would be nice to have the current locale stored automatically as an attribute for a string, this might be useful in very few cases only, so the overhead is not justified. In these cases, a bunch may be used to associate the locale to a string.

There may be a function to add an attribute to a copy of a string; this requires a copy of the string, as strings are immutable, including their attributes:

copy (str) adding attribute (attr) value (val):
\ create a copy of a string, and add a user attribute 

Strings cannot be compared to numbers, rows, or the void reference; any attempt to do so gives a run time error.

As with arithmetic, the shorthand s1 =_ s2 is the same as s1 = s1 _ s2. Some time in far future the compiler and run time may use preallocated strings (string buffers) to speed up the process.

To prefix a string, explicit assignment must and can be used:

x =: p _ x

As strings are immutable, a copy is created anyhow and finally assigned, so there is no overlap problem.

No operators (only functions) are provided to access parts of a string, except the single character at a specific location, using row notation, which returns a string with a single character, not the integer equivalent of the character (nor Unicode code point). This notation cannot be used as a target of an assignment, thus to change a single character of a string, normally it is combined from the head before, the string with the new character, and the tail after it. A library routine

replace in string (str) character (pos) by string (repl):

will do so more efficiently, in particular if the strings are large.

Bit operations

As TUF-PL has the concept of arbitrary precision integers, bit operations on negative integers are conceptually difficult, as the number of bits to be used is not fixed. Furthermore, the use of integer numbers as rows of bits is neither intended nor efficient in TUF-PL.

It is, however, often required to test if a number is odd, i.e. its remainder modulo 2 is 1:

is (x) odd:
    ? x %% 2 = 1
        :> ?+
    :> ?-

This extends to any bit :

is bit (b) set in (x):
    ? x %% 2^b = 1
        :> ?+
    :> ?-

Both functions should not be used for negative integers, although they currently behave as if in two's complement form, also for big integers.

Boolean and and or can be programmed in TUF-PL as long as the numbers are not negative, using the remainder of a division by 2 as bit extractor in a straightforward (inefficient) way:

binary (x) and (y) :
    ! x >= 0 & y >= 0
    z =: 0
    ?* x > 0 & y > 0
        z =: z * 2 
        bx =: x %% 2
        by =: y %% 2
        ? bx > 0 & by > 0
            z =+ 1
        x =: x // 2
        y =: y // 2
    :> z
binary (x) or (y) :
    ! x >= 0 & y >= 0
    z =: 0
    ?* x > 0 | y > 0
        z =: z * 2          
        bx =: x %% 2
        by =: y %% 2
        ? bx > 0 | by > 0
            z =+ 1
        x =: x // 2
        y =: y // 2
    :> z

Because in the commonly used two's-complement representation a negative number is the bit inversion plus one, the bit inversion is the negative minus one:

binary (x) inverted:
    :> -x - 1
binary (x) negated:
    :> 1 + binary x inverted

This extends to the exclusive or:

binary (x) xor (y):
    a =: binary x and y
    b =: binary x or y
    :> binary a and (binary b inverted)

Some bit operations can be conveniently expressed by arithmetic operations, in particular shifting by division or multiplication of two and its powers3.

For the classical boolean bit operations there are library functions (to be implemented):

binary (x) and (y)
binary (x) or (y)
binary (x) xor (y)
binary (x) invert

Two additional functions allow to count bits:

integer (x) bit count
math binary log (x)     \ = ld x, e.g. ld 3 = 2, ld 4 = 3, ld 0 = 0

To access the bits as a row of bits (zero origin):

integer (x) as row of bits:
    bits =: math binary log x
    res =: [bits]
    ?# i =: from 0 upto bits-1
        ? is bit i set in x
            res[i] =: 1
        |
            res[i] =: 0
    :> res
 
row of bits (y) as integer:
    res =: 0
    ?# i =: from 0 upto res.last-1
        ? res[i] = 1
            res =+ 1
        res *= 2
    :> res

The first one creates a row of integer values 0 or 1 from the integer number, keyed starting from 0, the highest key corresponding to the highest bit set:

x =: 257
y =: integer (i) as row of bits
! y.first = 0 & y.last = 7 & y[0] = 1
z =: integer y as row of bits y
! z = 257

Thus, count integer (x) as row of bits returns the number of bits in i, and last (integer (x) as row of bits).last the highest bit number (zero based).

Operator precedence

The operator precedence just gives the order of evaluation in an expression, but does not ensure that only combinations are found that do not generate fatal errors. In particular, the various kinds are mostly one-way: the string catenation will convert numbers to string, but the numeric operators do not process strings. Comparisons and boolean operators generate boolean values, but only the catenation operator, again, will convert them back to strings.

Some binary operators of equal precedence bind from left to right, i.e. a+b+c = (a+b)+c and a-b-c = (a-b)-c; these are (not allowed for all others):

| & + - *

Because the division slash is not in the list, neither a/b/c nor a/b*c nor a*b/c are allowed, although the latter expression is independent of order (note that there is no integer overflow in TUF-PL).

Note that the symbols for the statement types, i.e. guard (?) etc, as well as assignment symbols like =:, =+ are not operators.

Operators (except comparison) in ascending precedence (b=binary, u=unary):

1   b  |
2   b  &
3   b  < > <= >= = ~=
4   b  _
5   b  + -
6   b  * / // %% /% etc
7   u  - ^ ~

No operators are literal denoting symbols, i.e. quote, accent and tic, as these have closing correspondents. Also the digraphs =:, :: etc, and the function template terminators are not operators. Note that $ is a special variable with context-dependent references and not an operator.

Map and row keying as well as field and attribute notation (the dot) is not an operator; it is part of a variable denotation.

The (round) parenthesis for grouping and priority change are not regarded as operators.

Bunch comparisons

For bunch comparisons, no operator is provided, as the comparison is too much dependent on the application. However, equality comparison with the void reference is allowed as with any other item. Comparison with any other reference result in an run time error; while equal references imply equal contents, the opposite is not true, so the comparison has been banned. If the programmer is unsure, the kind can be tested before.

Some attributes, i.e. the number of elements, could be compared directly:

? r1.count = r2.count

A function to compare two rows might read as follows:

compare bunch (r) to (s):   \ true if same number and same kind
                            \ run time error if the elements
                            \ with the same key have different kind
  ? r.kind ~= s.kind        \ we are not comparing rows and maps
    :> ?-
  ? r.count ~= s.count      \ same number of elements?
    :> ?-                   \ no
  ?# i =: r.scan            \ give all keys of r
    ? r[i] ~= s[i]          \ fatal error if comparison not defined
        :> ?-               \ unequal
  :> ?+                     \ ok, all keyed fields are the same

2.9. Tuples, lists and tuple assignments

A list is a sequence of elements, separated by commas. It can be used:

In the fist case, a list creates a tuple from the list elements. The second case is described below as tuple assignments.

A tuple is a kind of item: an immutable row without fields:

a =: 1, 'x', math sin 5.0
! a.count = 3
! a[1] = 1
! a[2] = 'x'
! is a[3] float

Empty list elements are ignored; to create a cell with a void element, it must be explicitely written:

l =: 1, (), 3
! l.count = 3
! l[2] = ()

Tuples with only one element can be created, but are quite useless; except in this case, empty list elements are without effect:

x =: 0,
! x.count = 1
! x[1] = 0

To access elements, either the same notation as for a row, i.e. the key in square parenthesis, which may be a integer variable or literal, or field notation with a literal instead of a word:

t =: 1, 2
! t[1] = t.2 - 1

To call a function with a variable number of parameters, use a single argument and a tuple when calling it:

maximum of (tuple):
    ? is tuple tuple
        mx =: tuple[1]
        ?# i =: from 2 upto tuple.count
            ? mx < tuple[i]
                mx =: tuple[i]
        :> mx
    :> tuple
print maximum of (4, 6, 5)          \ prints 6

The print function accepts tuples too and separates the parameters by blanks (and converts numbers to strings):

print (arg):
    ? is arg tuple
        sep =: ''
        ?# v =: give arg values
            print sep _ v nonl
            sep =: ' '
        print ''
    ? is arg string
        print arg
    |
        print number arg as string
 
print 4, 5, 6

Because a list is composed of expressions, the comma stops expression scan and avoids to add an integer to a tuple:

print 4 + 5, 6                      \ prints 9

When calling a function, a list may be used as parameter, not just an expression, thus the second line is the equivalent of the first one:

print maximum of 4, 6, 5            \ prints 6
print maximum of (4, 6, 5)          \ prints 6
print (maximum of 4), 6, 5          \ prints 4

To allow a lists of variables in a function template as in:

funct (a, b, c):
    ...

might be usefule if a fixed number of similar parameters is required. As in most of these cases, a variable number of elements is more natural, this notation is not supported.

Quite often, a tuple is returned from a function:

do something:
    ...
    :> code, value
...
cv =: do something
? cv.1 = code_x
    process cv.2

As a tuple is immutable, a tuple element cannot be used on the left hand side of an expression; the whole tuple must be created again. For example, to rotate tuple elements or exchange two variables, write:

l =: l.2, l.3, l.1      
x, y =: y, x

A tuple may be used as a key for a map, because tuples can be tested for equality (see below):

m =: {}
m{1,3,5} =: 135

For an array, it could be used as keying a multi-dimensional array via the matrix library:

a =: new matrix 10,10
a[5,5] =: 55

Using a tuple as a key for a string for picking the indicated characters is considered sparely used, and might confused with a multidimensional array, so this is not supported.

Dynamic tuple creation

Tuples are conceptually immutable like literals, but in particular in libraries it may be necessary to create a tuple with elements determined at runtime.

For this purpose, there are library functions:

row (row) as tuple:
    \ create a tuple with all (even void) elements
new tuple (count) values (vals):
    \ new tuple with all cells preset
tuple (tuple) as row:
    \ create a row with all elements of (tuple)

The second funtion is in particular useful if a tuple of unity elements is required.

A simple library function allows to create a tuple from a part of a tuple:

tuple (tup) from (from) upto (upto):
    new =: row tup from from upto upto  \ Works for tuples too
    :> row new as tuple

It is provided just for clarity, as are the following ones:

tuple (tup) drop first:
    :> tuple tup from 2 upto tup.last
tuple (tup) drop last:
    :> tuple tup from 1 upto tup.last - 1
tuple (tup) prepend (item):
    r =: tuple tup as row
    ! r.first = 1
    r[0] =: item
    :> row r as tuple 
tuple (tup) append (item):
    r =: tuple tup as row
    r[] =: item
    :> row r as tuple   

Tuple equality

While rows and maps intentionally cannot be compared for equality due to their dynamic nature and unclear merge of fields, comparision of tuples is possible, as tuples are fixed size without fields; thus, the comparison simply must check for equal size, and with equal size do a guarded comparison, i.e. the same used for map keys.

Tuple assignments

Not yet implemented

A list of variables (or other assignment destinations) can be used on the left hand side of an assignment:

a, b, c =: 1, 'x', sin 5
! a = 1, b = 'x', c = sin 5  
x, y =: 0
! x = 0, y = 0

Yet the left hand side is not a tuple item, as this would require references to variables (or bunch elements, etc), it is just a list of assignment targets.

The result depends on the result of the right hand side expression, that is evaluted once:

Missing elements are treated as void, additional elements silently ignored.

The tuple assignment

a, b =: RHS

is a shortcut for:

t =: RHS    \ right hand side, tuple or other item
? is t tuple
    a =: t[1]
    b =: t[2]
|
    a =: t
    b =: t

Thus, in the following example, the random number is obtained only once:

a, b =: random next integer
! a = b

To exchange two values, a tuple assignment cat be use:

a, b =: b, a

is equivalent to

t =: b, a
a =: t[1]
b =: t[2]

It might be tempting to assign the last variable a tuple with the remaining elements if there are less destinations on the left hand side than tuple elements on the right hand side.

The disadvantage is that in most cases the last variable on the left hand side must be checked whether it is a tuple or not, and thus destroying the terseness of the construction. It would add complexity without proved gains in clarity and reliability.

3. Items

Items are integer numbers, rational numbers and floating point numbers, booleans, character strings, rows or maps, chunks or functions, etc.

3.1. Fields

Because attribute fields of items are used in the description of the various kinds of items, they are introduced first.

Fields concern information for items, named at compile time by words (letters followed by letters and numbers). The name of a field follows the item variable, attached by a point (full stop, .), e.g.:

var.kind

There are three types of fields:

The current compiler supports fields only with variables, not with expressions, thus stacked fields are not recognized. This may change in future.

Variable fields are available for mutable items only, in particular for rows and maps. They are variables bound to the actual item dynamically; they are not statically declared as in strongly typed languages. Organisational information can be attached to bunches (rows and maps) this way:

m =: {}             \ create a map
m.longest =: -1     \ keep track of longest entry
r =: []             \ create a row
r.max =: 0          \ to keep track of maximum value set so far

Retrieving a not yet set variable field is void, and any such field can be removed by setting it void; after that, it cannot be distinguished from a never set field.

The attempt to create a variable field for an immutable item is a fatal runtime error, while trying to obtain such a field is just void without error.

Attribute fields supply metadata of items; they cannot be directly modified. Each item (void is not an item) has at least the .kind attribute:

i =: 1
print i.kind        \ prints "integer"

For mutable items, attributes provide information about the actual state of the item (they can change over time); e.g. .count is the number of (non-void) elements in a row or map or the number of characters (code points) in a string.

Attributes use the same notation as fields and cannot be distinguished from variable fields syntactically; they are like read-only, predefined fields. Attributes start with a capital letter; except some commonly used ones:

.kind .count .first .last

To allow consistently using capital letters for attributes, the variants with a leading capital letter are all aliases and cannot be used as fields too:

.Kind .Count .First .Last

Attributes are also available for immutable items, like numbers and strings. Attributes of numbers are also lower case (note that numbers are immutable and thus cannot have variable fields):

.abs .den .float .frac .trunc .isbig .num .rsd .round .quot .sign

For consistency, these also have capitalised aliases. To better support ??? overloading, these can be set, but are automaticall called as unary functions if the corresponding field is set with a function reference, i.e.

r =: cpx.re
\ instead of
r =: cplx.re:(cplx)

Examples for less commonly used attributes without small letter aliases are:

.Bytes .Amount .Usecount .Revisions 

Using attributes to convert strings to numbers and vice versa, e.g. .asInteger, .asString etc. is not considered worthwhile. It is not required very often, and using the fully spelled functions is more clear. While the .asString attribute could be considered for any item to provide a printable representation for debugging, a short string is only possible for simple items; for complex bunches, the debug dump function produces a printable version on error output. For simple items, catenating the empty string converts any number in standard a format.

There is no protection against using the small letter variants as fields; however, the compiler might warn if these words are encounted as lower case field names.

Attribute names cannot not be used as field names for those item kinds for which they are provided. As attributes are read-only, an attempt to use it as a field will be detected very soon.

It has been tempting to provide boolean attributes like .isInteger, .isFault, .isASCII for their brevity, but discarded, as the corresponding functions

is () integer
is () fault
is () ASCII

are also easy to read, and attribute notation would not increase readability significantly.

Control fields are neither variable fields nor (read-only) attributes and start like attributes with a capital letter.

Their content controls the way an item is processed. E.g., the control field .Quickadd of a map, if set true, suppresses the search if an element is already set and thus quickens filling maps with unique keys. Most control fields restrict the kind of items to be set, i.e. a wrong kind is a fatal runtime error.

Control fields can thus conveniently used in an assignment:

bunch.Lock =+ 1         \ increase lock counter
map.Quickadd =: ?+      \ map (map) quickadd ?+

Once field change callbacks are implemented, libraries can control which settings are permitted, so that a field like this is like a control field and will often start with a capitial letter.

A special case is the .Tag control, that can only be set once.

Using variable field names starting with lower case letters is recommended. New language versions might introduce new attributes and control fields, but always starting with capital letters, so there will be no collision with already used field names that start with small letters (except the above commonly used small-letter aliases).

Fault items have a single control field .checked that tells if the fault item is still unprocessed (not checked). All other information in a fault item is a preset field; for ease of use, all, including the .checked control field, are written in lower case; thus, to mark a fault item checked:

fault.checked =+ 1

As the field and attribute notation returns any item reference, the attribute .scan could provide the basic key scan function for a row or map. But attributes returning a function reference are deprecated now for clarity; use the appropriate scan function. Providing functions to get or set attributes or fields is mostly deprecated now except for those noted below. In fact, if a field returns a function reference, its call must be written like any other one:

this.sortcall:(x, y)

Some attributes are common to all kinds of items:

The attributes for numbers are explained in detail in their section. Some functions are provided (without the prefix math) for easier use in expressions:

For floats, .den, .num and .rsd are void, and .float is just a copy of the reference, as is .trunc for integer. The sign .sign is integer, as it can be easily converted to float again using the ## operator or float () function:

print math sin ##x.sign
print math sin float x.sign

The attributes for strings are:

The .count attribute for a string gives the number of characters (code points) in the string, so to convert a string to a map of single character strings might read:

row of characters from string (s):
    r =: [s.count]                  \ final size is known
    ?# i =: from 1 to s.count
        r[i] =: s[i]                \ or: string s code point at i 
    :> r

Mutable items (rows, maps, chunks, fault items etc) have a common attribute:

The .Revisions attribute counts the number of modifications; thus it can be used to ensure consistency. It may be used in a scan to ensure the item is unchanged. As the count may be 16 bits only, i.e. provided modulo 2^16, only equality should be checked.

Bunches (tuples, maps or rows) have the following attributes:

The .count attribute gives the number of elementes in use:

Maps have a control field:

Normally, keys in a map are uniquely stored; when setting a cell, the key is first located, and the item replaced if it already exists. Adding items with not yet existing keys thus takes a time that grows quadratically with the number of already present keys. Filling a map with a large number (>1000) of unique keys thus may consume significant time. In order to speed up such operations, the .Quickadd option may be set for a map. Then, any value set is inserted in front of a chain unconditionally, and not searched for to replace. A read with the key provides the just added value; any previous value is still there, but hidden. So the items are just accumulated; the .count attribute gives the number of allocated cells, no longer those with a different keys. Setting such a cell to void deletes the new value and unhides the previous one. If the .Quickadd option is switched off again, just the normal behaviour is restored: writing a value first locates the (lastest) key and either adds or replaces the value.

Obtaining the set of keys (as a row, tuple or via a scan) delivers all keys, including the duplicates. Using these keys to obtain all values via a normal keyed map access will provide the same (latest) value for duplicate keys. There is a standard library function to obtain all values, even the hidden ones, as a tuple or row. A third variant makes a row of all key/value pairs

The option may be set and cleared at any time; it just controls the method used to set a new value by a key. Switching it off does thus not eliminate duplicates; there is no attribute to show that it was ever set if it is off.

As these remain allocated, the map will consume more memory if there are duplicates. The standard key scan then provides duplicate keys more than once, but the value for these is always the same, latest added. To obtain keys without duplicates, use a scan that returns them sorted. Note that logically the system cannot determine if there are duplicates unless it collects all keys used; and sorting is the most efficient method to detect duplicates. Setting a key with a hidden duplicate to void unhides the hidden one. This may cause hard to find errors, so the feature is normally deactivated. The option can be switched on and off at any time; it just determines if the next write access will search (off) or add the new pair in front (on). Inspecting the .Quickadd attribute tells whether the option is on or off.

Removing all hidden entries should use the library function

map (map) remove hidden entries:

If a map is used in two phases; first created, e.g. from a file contents, and then used extensively, creating it with .Quickadd and removing hidden entries before use may be an efficient solution, unless a hashed map — provided as library function — is used.

Currently, maps with a hashtable to speed up access are provided as a library; if directly supported, two control fields would be used:

So instead of first line the following ones could be used to create a hashmap:

hm =: new hashmap with 100 slots using `string () hash`
hm =: {hashmap}
hm.HashFunction =: `string () hash`
hm.HashSlots =: 100

Rows have additionally (compared to maps):

As the .count attribute gives the number of non-void cells, it may be less than the number of elements actually allocated:

! r.count <= r.last -r.first + 1    

There are two more special attributes always attached rows or maps:

The .tag attribute is used to identify (classes of) items, set once by the function tag (item) by (string) or assigning a string to the attribute .tag. (To provide the tag within the braces when creating a row or map might be possible, but will probably never be implemented.) It can only be set once: a fatal runtime error will occur if not void before. Setting a tag is currently only possible for rows and maps.

The lock count is a means to protect a mutable item against changes. The .lock attribute is initially zero for mutable items, and void for others. If it is not zero, the item cannot be changed, i.e. neither values replaced nor new elements or fields added, with the one exception that the lock count can always be decreased and increased:

bunch () lock:-         \ increase the lock count and thus locks the item
bunch () unlock:-       \ decrease the lock count and thus unlock the item.

where decrement returns a fault item if the lock was zero and is unchanged. The lock count may be an unsigned short integer (max. 65'533 lock increments) and a fault item is returned if this maximum is reached. The runtime system can set the lock count to 65'535; any attempt to decrement it will return a fault item too, thus the bunch cannot be unlocked by the TUF-PL programmer. Note that locked bunches are not locked for deletion.

The lock count can be used as a semaphore for coordinating access.

Some constants of the currently running runtime system can be accessed as special attributes of the all-global variable $SYSTEM, see below on System variables and parameters

As a connector to external libraries, items of kind system are used. These are opaque objects that contain some state information to connect functions calls of the library, e.g. a file descriptor for file access. System items do not have fields, i.e. no user-defined information; if this is needed, the programme must use a bunch and put the system item in a field. Conceptually, system items are under control of the library binding code, and not intended to organize user data. Thus, there are no fiels, only attributes; their names are defined by the binding code.

Implementation remark: ??? Some attributes are stored like fields, with the colon prefixed, thus not accessible as fields.

3.2. Exact (integer and rational) Numbers

Exact Numbers are arbitrary precision rational numbers, i.e. having a integer numerator and a positive integer denominator, where integer numbers are those with the denominator 14.

Using assertions, the compiler could be given hints to optimize the code, and a runaway of numbers could be prevented.

Integral numbers

Integral numbers are arbitrary precision numbers, normally not limited to the wordsize of the system that runs the programme.

Depending on the availability of a multi precision library, some runtime systems may have only 32 or 64 bit integer numbers5. If the size is sufficient, single (or double) machine words are used for arithmetic; except some rare environments, 64 bits should be available. The system automatically uses a arbitrary precision library; nothing has and can be done by the programmer. (Including to use machine words again if the number falls significantly below the limit.) The conversion library routines will also automatically return multi-precision numbers.

Professional programming should use assertions for the ranges of numbers, so that the compiler can detect at compile time that the range of numbers is not sufficient.

Addition, subtraction and multiplication of integral numbers have integral numbers as a result, thus there are no special considerations for these operators.

The division with a single slash (/) however, only yields an integer number if the divisor is a factor of the dividend; otherwise it is a rational number. If rational numbers are not supported, a fatal error will result. For this reason, the double slash (//) is provided as integer division, see below.

Integer division in computers is not at all simple; While the results are consistent among all proposals if the arguments are both not negative, the desirable extension to negative numbers is not as trivial as might be thought and has lead to different solutions.

As a generic symbol, ÷ is used here for the integer division, used for the remainder; moreover, n is used for the dividend, d for the divisor, q for the quotient n÷d and r for the remainder n†d.

Raymond Boute 6 gives a detailed comparison on this subject and demands that the following two conditions hold7:

n = (n÷d)·d + n†d
|n†d| < |d|

He calls the first one the division rule; the second one could be called the remainder rule, which is given for completeness only, as no proposal violates it.

Within TUF-PL, it is also desirable that the integer operators are consistent with the rational numbers, which will be called the rationals rule:

n÷d = (n/d).trunc
n†d = (n/d).frac * d

A criterion that is not found in the literature is another consistency with rational numbers, which will however conflict with the division rule. For elementary fractions, we have the invariants:

n / d = (-n) / (-d)
(-n) / d = d / (-n)

So it would be expected that this holds for the integer division and the integer remainder too, which we will call the fractions rules for the division and remainder:

n ÷ d = (-n) ÷ (-d)
(-n) ÷ d = n ÷ (-d)
n † d = (-n) † (-d)
(-n) † d = n † (-d)

Boute considers three operator pairs, where the other one is defined using the division rule:

Using // and %/ for the truncation pair, /_ and %_ for the flooring pair, and /* and %* for the euclidean pair, an example is sufficient to demonstrate the problems:

        //      %/      /*      %*      /_      %_
  29/6   4       5       4       5       4       5
 -29/6  -4      -5      -5       1      -5       1
 29/-6  -4       5      -4       5      -5      -1
-29/-6   4      -5       5       1       4      -5

By definition, all satisfy the division rule, but only the truncation and flooring method fulfil the fraction rule for the division, and none follows the fraction rule for the remainder.

Nor surprisingly, the truncation division is the same as obtained from the rationals rule.

It would be possible to fulfil the fractions rule for the remainder, but at the expense of violating the division rule, with two alternatives:

n † d = (n%%d).abs
n † d =  (n/d).frac * d.abs

so that the remainder is either never negative, or has the sign of the quotient.

While the remainder that belongs to the truncation division has the sign of the dividend, there are also systems in which the remainder has the sign of the divisor, which violates both, the division rule and the fractions rule for the remainder and thus is not further considered.

Boute shows that for mathematical consistency, the euclidean division and remainder is the best choice, and Knuth's the second best choice, while the truncation division, although the most dominant one, is the least desirable selection.

However, he did not consider a very practical aspect: Which solution is the easiest to remember for the programmer, and might produce the least number of faulty programmes?

One of the — here easy to avoid — pitfalls is the to test a number to be even:

is even (x) WRONG:
    ? x %/ 2 = 1
        :> ?-
    :> ?+
is even (x):
    ? x %/ 2 = 0
        :> ?+
    :> ?-

With this in mind, the mathematical desirable is the least intuitive one, as it violates both fractions rule, even if a modified rationals rule might be obtained.

Both, the euclidean and the flooring division have the major disadvantage that not only the sign, but also the absolute value of the remainder depends on the signs of the divided and the divisor. Knuth's flooring division additionally delivers four different results for each sign combination, which is hard to recall during programming.

Another practical criterion observes that negative divisors are rather seldom, for which consultation of the language manual might be justified, but that for positive divisors the result should be fairly easy to predict. Here, again the truncation division wins, because the rule that the sign of the result (for all divisors) is that of the dividend could be kept in mind fairly easy.

This finally rules against the mathematical desirable versions, sticks to the commonly used truncated division, and leaves as alternatives:

However, to provide more than two operators is not really an effort, so TUF-PL follows the example of the few programming languages that leave the choice to programmer (and thereby keeps him alerted to check the manual), where the first three pairs follow the division rule:

As might be apparent, the choice of /* reminds that the definition for the mathematical modulus only uses the multiplication:

n = q * d + r

And the underscore hints to flooring, etc.

Note that commonly used (and perhaps the only one implemented) are // and %%.

Rational numbers

Using literals (constants) for integer numbers, rational number literals (constants) are build from integer literals, by evaluating constant expressions at compile time. Thus, an approximation for e is 27183/10000 (better use #E provided by the standard library). Note that the literal 3.1415 denotes an floating point number, although it is de facto the rational number 6283/2000.

Using attributes, the numerator and denominator as well as other values can be obtained:

The corresponding functions are:

Thus, we have

x = x.num / x.den = x.trunc + x.frac = x.sign * x.abs
x.den > 0
x.sign = (x.num).sign = (x.trunc).sign = (x.frac).sign

The integral part is rounded towards zero, i.e. truncated division as normally provided by the hardware, which conforms to every day use, in that -17/7 = -2 3/7, where the fractional part of a number binds stronger than the unary minus, i.e. - 2 3/7 = -(2 3/7) = -(2 + 3/7) = -2 -3/7.

The numerator of the fraction in the mixed representation seems to have no generally accepted name, so we use residue for it:

x.rsd = (x.frac).num
x = x.trunc + x.rsd/x.den

However, the residue of a division of two integer numbers is not always the remainder or modulus of the division; this depends on a common divisor and the kind of division, see below.

Calculating with rational numbers benefits from simplifying (cancel common factors) where the greatest common divisor (GCD) of numerator and denominator is calculated, and both are divided by it. This is done before the rational number is stored, because in TUF-PL, numbers are immutable, and there are no clear performance benefits if the reduction would be postponed. So there is always

gdc(x.rsd, x.den) = 1

If the result of an arithmetic operation yields a rational number with a denominator of 1, an integer number is provided instead. As integer numbers provide 0 for .rsd and 1 for .den, these can be used like rationals if desired.

Note that due to the immediate reduction to the smallest denominator, currency values that require at most two places in the fraction, may have .den = 0, .den = 2, .den=5', etc, a denominator less or equal to 100 and divisible by 2 or 5. E.g 54/10 is 27/5, so the denominator is 5.

Neither a scaling factor nor floating point numbers with decimal exponents are yet envisaged as part of the language. So it remains to use integers denoting cents and use constraints to forbide the use of rationals.

Some examples to illustrate the attributes:

        .num    .den    .trunc  .rsd    .frac   .sgn
 21/6    7      2        3       1       1/2    +1
-21/6   -7      2       -3      -1      -1/2    -1
 21/-6  -7      2       -3      -1      -1/2    -1
-21/-6   7      2        3       1       1/2    +1

Some TUF-PL compilers and runtime systems may not support rational numbers at all, or only as 32-Bit numbers, so that the programmer has to fall back to the usual dyad of integer and floating point numbers. See Comments and Pragmas for information how to ensure at compile time that rationals are available.

The demand for rational numbers seems to be rather low; possibly, because we are trained in problem solving using integral and floating point numbers only. Also, when using rationals, the denominators often grow very large, so it might be better to find a way to use (large) integers.

Strictly spoken, rational and integer number items are never equal — because each rational is always converted to an integer if the denominator is 1, and thus an arithmetic operation will never provide a whole rational. But as rationals can be intermixed with integers in arithmetic (as opposed to floats), comparison of accurate numbers will never surprise.

3.3. Approximate (Floating Point) Numbers

For many technical and statistical problems, approximate numbers are sufficient, as provided by floating point numbers, that are the practical equivalent of real numbers. Mathematically, they are rational numbers with rounding errors, so they are approximated numbers, in contrast to the exact integer and rational numbers. As of convenience, the term floating point number or float is used.

Literals for floating point numbers are written using the decimal point, i.e. 3.1415, not the decimal separator from the locale. The forms .5 or 7. are not valid and must be written as 0.5 or 7.0. According to the IEE 754 standard, floating point number arithmetic could produce positive and negative infinite, and two special NaN (not a number) values. These are denoted by the digraphs #+ for +∞, #- for -∞, #~ for a quiet NaN and #! for a signalling NaN.

A rational or integer number is not converted automatically to a floating point number; this could be done as follows:

All methods are equivalent and return:

The reasons for not even converting integers to floats automatically are:

Although void is treated as zero in additions, the addition of two voids is always an integer, so the following would not work unless the explicit conversion to float would return 0.0 for void:

x =: ()
y =: ()
z =: x + y          
! z = 0             \ result is integer 0
z =: ##x + y.float      
! z = 0.0           \ result is float, as ## returns 0.0 for void

While the multiplication of void with any number is a runtime error, it can be circumvented by using the conversion, as a float is allowed.

The library function is seldobmly used in constraints and assertions.

Note that integer literals are treated the same. Unless the constraint system is properly used, the compiler has no information about the contents of a variable, and cannot issue a warning:

x =: 3
y =: 10.5
print x * y             \ runtime error, as x is integer
print x.float * y       \ ok
print ##x * y           \ ok
print (float x) * y     \ ok
print (float 3) * y     \ ok
print y * float 3       \ ok
print float 3 * y       \ runtime error, equivalent to float (3 * y)

Note that it is (float x), because it is a function, and neither float(x) nor (float) x, and that function arguments are scanned greedily.

If the argument is already a floating point number, all three variants do nothing than return the item unchanged.

Some aspects function of floating point numbers are provided as attributes:

While the assertion x = x.trunc + x.frac theoretically should always hold, no such confirmation could be found specifications of the mechanisms used.

To get the smallest integer larger than the floating point number, use:

i =: r.trunc
? r > i.float
    i =+ 1

Rounding per .round is equivalent to:

round (fn):
    rv =: fn + 0.5
    :> rv.trunc

If the runtime does not provide unbounded integers, a fault item is returned if the integral part is too large. If this is undesirable, use the +BigNum feature flag to assert that the integer is large enough in any case. Otherwise, the program specification should specify the maximum expected value, and the floating point number must be checked against this value before. Adding an assertion will have the compiler checked the situation:

i =: f.trunc
! i.abs < 10^20         \ compile time error if no integers of this size

Loop to calculate the square root:

x =: a
y =: x * 1.01       \ thus y > x
?* y > x            \ monotonically falling until sqrt found
    y =: x
    x =: (x + a/x) / 2.0
print 'sqrt(' _ a _ ')=' _ y

(Note that (x+a/x)/2 = x-(x²-a)/2x, so it's monotonically falling while x²>a.)

Obviously, for an integer number, .frac is always zero.

For floating point numbers, the attributes .den, .num and .rsd are void. While for .abs and .frac floating point numbers are supplied, the attributes .trunc and .sgn return exact (integer) numbers, because it is easy to convert them to floating point, as in

! x = ##x.trunc  + x.frac
! x = (integer x) + fraction x
! x = ##x.sgn * x.abs
! x = (float sign x) * abs x

3.4. Complex numbers

Complex numbers are not yet implemented in TUF, although this should be relatively simple: Just add a new kind complex, add the corresponding code to the operators and necessary or useful functions.

Complex numbers could be also implemented as bunches, but operator overloading is not yet defined, although it could be like array and map access by control fields.

A simple solution would use pairs (2-tuples) of real numbers, and define a set of functions returing a (complex) pair, like:

complex (a) add (b):
    :> a.1 + b.1, a.2 + b.2
complex (a) sub (b)
complex (a) mul (b)
complex (a) div (b)
complex (a) re:
    :> a.1
complex (a) im:
    :> a.2
complex (a) length:
    :> math sqrt a.1*a.1 + a.2*a.2
...

Nesting them is perhaps better readable as the first given tradional syntax:

complex_add(a, complex_mul(c, d)
complex a add (complex c mul d)
complex (complex a mul b) add (complex b div c)

Of course, complex numbers could be defined using rows or maps, and the basic complex operations considered bunch operations:

a =: new complex number (1, 2)
b =: new complex number (3, 4)
a.:add b
a.:add (b.:mul c)

These considerations are merely used to give an impression of what might be done with TUF.

3.5. Strings

Strings are immutable sequences of characters. The number of characters in a string is practically unlimited, and the number of different characters (code points) is given by the Unicode standard, which is more than 1 million. As strings are character oriented, the .count attribute returns the number of characters (not the number of bytes).

String comparison for equality is possible using the equals operator (=). Lexical comparison operators for greater etc. can also be used; in this case, the numerical value of the Unicode code points are used. If one is a substring of the other one, the longer one is greater. No locale is used by operators; thus surprises like "A" = "a" cannot occur. To compare using a locale, use standard library functions (which are still to be defined).

String literals are enclosed in either double quotes (") or single quotes ('), the former transformed by message localization. There is no way to denote the encoding in TUF-PL sources; it is assumed that the encoding of the source code file is known to the compiler, and that the complier converts to UTF-8. Using ASCII characters only together with HTML entities is strongly recommended, as it is independent of the encoding.

Strings can be joined using the binary catenation operator _ and the append assignment =_, which could be seen as a shortcut for a =: a _ b. The append assignment should be preferred whenever possible, not only for readability of the program, but also as a chance for an optimizing compiler to use a string buffer.

Unlike other operators, the catenation operator accepts numbers as operands and converts them to strings using a standard format. Other representations require use of a standard library functions or carving the result by string functions, see Formatting.

Void items are just ignored; items of other kind produce a fatal error.

Performance of string allocation and copy for long strings is often not relevant. Otherwise, collect the partial stings in a row and use the standard library function (that at some time may be written in host language) to join the parts.

Each character in a string can be accessed using row key syntax, starting at 1, and returning void if out of range or zero as key; the result is a one-character-string to allow comparison to a string literal:

s =: 'abcdef'
! s[3] = 'c'
t =: 'cdefghi'
x =: s[4]
! x = t[2], x .= ''
! s[0] = ()

Note that this notation may only be used as a source, not as a target of an assignment, as strings are immutable. Relaxed attribute syntax for positive integer literals may also be implemented in future:

s =: 'abb'
! s.2 = s[3]

Negative keys are allowed and count from the end, thus s[-1] is the last character, the same as s[s.count], and it is always s.first=1 (if the string contains at least one character).

Short strings (at leasts upto 5 bytes) are normally handled as efficient as integers, thus there is no need to have items of different kind for strings and characters.

To obtain the integer equivalent of a character, called code point in Unicode, library functions are are provided, that either return the characters individually or create a row out of each character:

character code point of (string) at (position):
    returns the code point as integer,
    or void (not 0) if position is not in string
 
row of character code points of (str):
    rv =: [str.count]
    ?# i =: from 1 to str.count
        rv[i] =: character code point of str at i
    :> rv

Unicode allows the zero code point, which is supported. This does not mean that strings are meant for processing binary data.

The opposite is done by

string from character code point (int):
    If the integer 'int' is a valid Unicode code point,
    i.e. between 0 and 1,114,111,
    a string of one character is returned.
    Otherwise, it is a fatal runtime error.
 
string from character code points in (row):
    rv =: ''
    ?# i =: row.scan
        rv =_ string from character code point row[i]
    :> rv

Converting a string to a row of single character strings, when really needed, is done by:

string row from string (s):
    r =: [s.count]          \ preallocate with s.count cells
    ?# i =: from 1 to s.count
        r[i] =: s[i]        \ assign string slice to row element
    :> r

To convert numbers to strings and vice versa, some routines are in the standard library:

convert to integer (str):
convert to integer (str) base (base):
    \ if base n is zero or void, C language integer constant format is used
    \ returns fault item if string does not start with a number
convert to float (str):
    \ converts leading decimal floating point number
convert to string (n):
    \ accepts integers, rationals and floats 

A lot of formatting can be done be using string functions, e.g.:

print pad right (sqrt 2) to 8 by '_'

See Formatting for more capabilities.

ASCII strings

Strings may have the .isascii attribute, if all code points are below 127. As of the standard, this includes control characters below 32, in particular the tab character. Some library routines are more efficient if the string is indicated as ASCII.

However, absence of this attribute does not mean that it is definitely not ASCII, and for efficiency reasons, testing the .isascii attribute just gives the flag that is internally kept for the string. If a authoritative answer is required, the library function:

check if (string) is ascii

returns either the original string, or a copy, where the .isascii attribute is freshly checked.

There is no attribute to indicate what is sometimes called printable, i.e. is ASCII and has no other control characters than tab, newline and carriage return, but a library routine:

is string (str) printable:
    ? ~ str.ascii
        :> ()
    cnt =: count in string str many of '&tab;&nl;&cr;01234...'
    :> cnt ~= str.count

Raw strings

Normally, strings are sequences of Unicode characters, internally encoded using UTF-8 conformant byte sequences.

But line oriented input from files, pipes or commands may obtain byte sequences that are not valid UTF-8 encoded strings. To allow the string handling functions to trust that encoding, the .israw attribute is set, if the string has failed the UTF-8 verification. Raw strings are byte sequences with code points 0 to 255 that can be passed as a reference and written out again. If the .isascii attribute is set, only the code points 0 to 127 are used in the string; but if this flag is not set, the string may still contain only ASCII code points, although most library routines check the result accordingly.

Some string processing operators and functions may reject to use raw strings, in particular in combination with Unicode strings, because the results will normally be useless. An exception is the catenation operator, that only marks the result as Unicode conformant, if both strings are such; otherwise, the byte sequences are catenated, and the resulting string is marked raw. This was considered useful, as this operator is often used in creating debug and error messages, that should work also under abnormal conditions, even if the results may be garbled. Note that the i-th character of string s via the notation s[i] delivers a one-character string (of the same rawness), not a code point number.

There are no plans to provide slices, use the clip functions instead. Note that strings are immutable, thus it is not possible to overwrite part of a string; this might have been a real inefficiency in former times, but nowadays the simplicity of immutable strings seems more valuable. Good real world examples where other languages like PYTHON and JAVA perform better are welcome.

As a substitute, the Chunks of bytes item may be used.

Attaching encoding information to a string is considered unnecessary complicated, as the string can be converted to Unicode once the encoding is known.

Two library functions allow recoding of strings:

recode string (rawstr) to Unicode using (coding):
    Recode the byte sequence of 'rawstr' to Unicode,
    using the encoding given by 'coding',
    and return a new string accordingly.
    If 'coding' is the voidref or 'UTF-8',
    the input is checked for UTF-8 conformance;
    in this case, if the input was not marked raw,
    a copy of the reference is returned.
    The input is always treated as a raw string,
    even if it is not marked as such.
    This allows any input string to be recoded.
    If an input byte is not a valid code point
    a fault item is returned.
 
recode string (str) to raw using (coding):
    Returns a raw string derived from the string 'str',
    converting code points to byte values
    according to 'coding'.
    If 'coding' is the voidref or 'UTF-8',
    an exact copy marked raw is returned.
    If the source string is raw,
    or contains code points outside the table,
    a fault item is returned.

The coding parameter is normally a string, that selects a predefined coding; if it is unknown, a fault item is returned. The parameter may also be a row that is keyed by code points between 0 and 255 in both cases, the encoder to raw will lookup values and return keys for conversion.

Note also that the recode functions change the byte sequence, as opposed to a check that a string conforms to the UTF-8 rules, marking it as raw if not.

The encodings ASCII, UTF-8, ISO-8859-15, ISO-8859-1, WINDOWS-1252, and DOS-437 are usually available. Known encodings can be queried:

give all known string encodings:
    enumerates all available encodings giving string codes as above,
    always starts with 'ASCII', 'UTF-8' and 'ISO-8859-1'.
check encoding (enc) for string (str):
    Test if string 'str', treated as raw string,
    is a valid encoding for the code 'enc'.

Converting from ISO-8859-1 to Unicode is particularly simple and always supported, as the code points are the same.

Unfortunately, ISO-8859-15 and ISO-8859-1 have the same code points, only different graphemes at 8 places; so they cannot be detected from inspecting the bytes. The WINDOWS-1252 encoding is a superset of ISO-8859-1 (not -15), but has only 5 undefined code points. DOS-437 has no unused code points.

Note that checking a valid WINDOWS and DOS encoded string for UTF-8 encoding often tells it is not UTF-8, as the latter requires certain sequences of byte patterns corresponding to seldom used special characters, that are not common in most texts.

The following function enumerates all possible encodings for a string:

give encodings for string (str):
    ?# enc =: give all known strings encodings
        ? check encoding enc for string str
            :> enc
    :> ()

Another often used encoding is UTF-16, that uses pairs of bytes to encode most Unicode characters. It can often be detected by the byte-order mark that starts the file. It can be converted to UTF-8 using the above functions.

Using raw strings for byte processing is discouraged, use Chunks of bytes instead.

Percent-encoded UTF strings

In particular query strings returned from HTML forms use a rather simple and short-sighted encoding, where improper characters are encoded using a percent sign (%), followed by exactly two characters that are hexadecimal digits. To deal with Unicode characters, there was a proposal to use %uxxxx or similar for each Unicode code point; it was, however, rejected. Instead, a Unicode code point must be encoded in UTF-8, and each byte encoded individually using the percent method. This, however, allows to encode strings that are not UTF-8 encoded.

So to decode such strings in TUF-PL is nasty, e.g. as the function to create a (single character) string from a Unicode code point does not just map the codes from 128 to 255 to a single byte.

One way to solve this is using byte chunks, where we can put small integers in the range 0..255 as a single byte into a byte chunk.

Or, we provide a library function that converts integers from 0 to 255 in one-character raw strings. While string catenation just combines the byte strings, the result marked as raw, if one of the source strings is raw; thus, the result must finally be checked and the raw marker removed.

Due to the widespread use of this encoding, it seems best to provide a function implemented in the host language:

decode percent encoded string (str):
\ replace all occurences of %xx with single bytes,
\ and then check the string for UTF-8 encoding
\ see libraray documentation

As percent encoding might produce not-UTF-8 strings, these are marked raw if this is the case.

String input and output

The standard input and output interface is line oriented, with each line provided as a string or supplied from a string. It uses the file streams defined by the C library, and includes pipes and command execution as well as network transmissions. It is buffered in the sense that the underlying input-output system may obtain and deliver data in parts of lines or as multiple lines; buffering tries to make the transfers as efficient as possible.

Furthermore, the transfers is synchronous, in that the program waits for the next line until it is filled and supplied, or until the line just provided has sufficient buffer space to be written.

As the standard linefeed character is used as line delimiter, input and output is done in units of lines, that can be rather long strings if the file is a binary one. The linefeed character is discarded here.

If a line is either pure 7-Bit ASCII codes or valid UTF-8 encoded characters, it can be seamlessly be processed as strings. If a line is neither ASCII nor valid UTF-8, it is marked as RAW. Note that the lines may contain byte sequences that are valid UTF-8, but written in another coding. As the encoding cannot be recorded by the file system, it must be communicated otherwise.

In case a binary file that does not consist of logical lines must be processed or generated, direct (unbuffered) input and output to and from chunk of bytes may be use, which is also asynchronous, i.e. just provides the bytes that are available or can be handled by the input-output system; so reads may be truncated and writes may tell that not all bytes have been written, and the rest must be transmitted in a new system call.

The line oriented file interface as well as pipes, including shell command execution, returns strings, which may or may not be encoded in UTF-8 (including 7-bit ASCII, which is a proper subset).

This is controlled by attributes and fields for the system item that is a file descriptor; the following text needs to rewritten, as the their attribute is still used even if fields are meant.

Instead of passing all received lines as raw strings, each line is checked if it conforms to the UTF-8 coding rules, and if not, marked as raw. This delivers all UTF-8 (incl. ASCII) encoded input as standard Unicode strings. However, it does not ensure that other encodings are always marked as raw. Fortunately and by design, the chance that other non-ASCII encodings are valid UTF-8 strings is relatively small, see http://en.wikipedia.org/wiki/UTF-8#Advantages. (In UTF-8, there are no single bytes with the high-bit set, while the special characters in other encodings are often followed by an ASCII character, thus strings in other encodings seldom pass the UTF-8 test.) Because the byte sequences are unchanged, just checked for UTF-8 conformance, there is no need to have an option that returns only raw strings.

If a different encoding is known prior to reading the input, the recode function may be used to convert the supplied string to a Unicode string. For this purpose, the resp. function always treats the input string as a raw string, even if it is not marked so.

Providing an attribute telling the code name and having the recode done automatically would be nice, but is assumed to be used seldom and thus is not implemented.

To obtain legible strings even if the input is neither pure ASCII nor UTF-8 encoded, the .autoconv option is on by default. Aside from checking for UTF-16 BOM codes and recoding accordingly, any string that is not UTF-8 conformant, is converted to Unicode from ISO-8859-1. The most often encountered problem will be that if the input was ISO-8859-15 instead, the euro sign (codepoint 8264, graphem €) is shown as the currency sign (codepoint 164, graphem ¤). As the WINDOWS-1252 encoding is a superset of ISO-8859-1, the additional code points are also converted. If any of the remaining 5 code points are encountered while .autoconv is active, a fault item is returned that refers to the offending line with the field .line. Note that with auto conversion on, the input bytes are not necessarily as retrieved from the system, but may be changed due to the recoding. Also, there is no indicator to show whether the input was originally UTF-8, or was recoded.

So the general rule is:

As most often the trailing linefeed is not needed, and neither the carriage return, two additional attributes are set by default:

If the stream is written to, each string written is considered a line and thus terminated by a line feed on Unix-like systems. This is indicated by the .DoLF attribute, which is on by default and can be disabled dynamically. There are systems that terminate a line by ac carriage return symbol before the line feed; on this kind of systems, the .DoCR is initially on, and can be switched off on demand.

Note that the attributes are a language feature, so they are valid only for the stream handle concerned, and other streams writing the same file may behave differently.

3.6. Bunches (aggregates)

Rows and maps allow to aggregate items; the term bunch is used for any of these. For blocks of binary data use Chunks of bytes.

A bunch has:

A row is also known as vector, column or array (of one dimension); while a map might also be called a dictionary, an associative array or a symbol table. The short words have been choosen for brevity of expression, although vector and dictionary would have been a bit more convincing.

Map keys can be any item references; as only equality is required, the are compared by the guarded equalitiy. Integer numbers or strings are recommended, and floats will fail, as the equality comparison for floats has been banned from the language.

A Fields… is addressed in the common dot notation that may also used for attributes, i.e. bunch.field. Rows addressed with the common key notation, i.e. row[i] maps use swiveled braces instead: map{s}.

To create a row, either use factory functions or a pair of square brackets ([]) or pair of braces ({}):

row =: new row
row =: []
map =: new map
map =: {}

Allocation of memory is done automatically as required; for a row, a small amount is allocated initially, and more allocated when required. Note that a row contains only references, thus the memory is proportional to the number of cells. For a map, nothing is allocated in advance; thus, when only fields are required, a map is used normally.

Optionally the map and row factory functions can provide options, see below.

Fields are accessed by names provided as words literals in the programme text. Thus, field names cannot be generated during run time of a programme; for this purpose, map keys has to be used. (To clone bunches, library functions allow to get and set fields.)

Note that r.x and r[x] or r{x} are different: the first is the attribute with the name x, the second a row and the third a map keyed with the value of x.

Fields allow to bundle global variables in a global bunch; this makes the release at exit easier. As bunches can be used as objects, a global variable with fields is similar to a singleton object.

In particular for tuples, constant integer keys can also be used in field notation; this is an experimental feature:

t =: 2, 3, 5, 7, 9, 11, 13
! t[5] = t.5

Rows

Rows are elsewhere called vectors, columns or arrays (of one dimension): A large number of arbitrary item references can be stored and quickly accessed by an integer key (key). As the cells always contain item references, any kind of item can be used as data, and all kinds of items can be mixed. (Overlays to mix numbers, strings and other kinds are not necessry.)

There is always a smallest and a largest key used so far, they can be obtained by the attributes .first and .last. Thus there is no fixed origin of 0 or 1; negative integers can be freely used as keys and must not be determined in advance or set at creation. When the first cell is set, the integer used sets the fields .first and .last. By writing a non-void item with a key less than .first or greater than .last, the row is dynamically expanded as necessary.

All cells between .first and .last are allocated:

row =: []
row[-1000] =: -1000
row[2000] =: 3000
\ now 3000 cells have been allocated (at least)
! row.last - row.first = 3000
! row.count = 2             \ counts the non-void cells

The number of allocated cells is always .last - .first + 1; there is no attribute to calculate this number; .count is the number of non void cells. Note that the runtime normally allocates more cells (at the top) in order to avoid frequent memory allocation if a row is filled with ascending keys; there is no attribute to obtain the number of actually allocated cells.

All cells are initally void, and so are cells outside the range of .first and .last. There is no means to distinguish a cell with its initial contents from one which was overwritten with a void reference.

Thus there is no key out of range error; just void will be returned on fetch, and the row expanded on storing a non-void item, and storing a void outside is ignored. Searching the next void cell in either direction never needs to check .first or .last, as a void cell will always be found.

If there is not enough memory, a fatal runtime error occurs. If this is not acceptable, a factory function is required to allocate a fixed number of cells, which returns a fault item on failure and may also disable the dynamic expansion. Note that assertions may be used for this purpose too. (Note also that the current fault item mechanism has not yet evaluated how to catch cases like this and must be revised too.)

In contrast to maps, cells are never deleted, even if a void reference is stored. To free the memory, a new row has to be made and the item references copied; note that this does not need extra memory for the data, as just the use counts are incremented.

A few library functions might help in standard situations:

row (r) copy            \ copy just the non-void cells, no fields
row (r) close           \ full duplicate
row (r) append (rmt)    \ add at the end
row (r) shrink          \ copy without voids, new keys
row (r) from (f) upto (u)  \ copy part

To print the values of all cells including the void ones, use:

?# i =: from row.first upto row.last
    print row[i]

As often just the keys for non-void cells are required, there is a scan function to provide these:

r =: factory function to create and fill a row
?# i =: row r give keys 
    ! r[i] ~= ()            \ void cells are not provided
    print r[i]

See below on Scan guards and scan functions for peculiarities if a row is changed inside a scan loop.

If the (estimated) number of cells is known at creation, the factory function allows to allocte the desired number of cells:

r =: new row 1000

If the number of cells cannot be allocated, a fault item is returned. The use of an integer literal within the brackets (e.g. r =: [100]) is deprecated.

Future versions may allow a tuple in the row creator to create a matrix, which currently requires the matrix library.

The notion r[] as an assignment target designates the next free cell, i.e. r[r.last+1].

Note that keying with a void reference results in a run time error, and does not automatically create a bunch.

Integers used as keys must not be big, i.e. their absolute value less $SYSTEM.intmax, which are at least 32-bit integers (with .isbig or is () big not true). Rather large integers can be used; only the difference between the first and last must reflect the available memory.

Maps

In contrast to rows, maps are associative memories that can use any item not only as value, but also as a key for the cells, if it can be compared for equality with other keys; no order comparison is required. They are also called dictionaries, associative arrays or symbol tables.

While normally strings are used as keys, integers or tuples may also be used (where the integer 1 and the string 1 are quite different):

ary =: {}
ary{1} =: 1
ary{'1'} =: 2
ary{9, "hello"} =: "Hello, 9"
! 1 = ary{1}
! 3 = ary.count;

For maps, only those cells are allocated that have non-void contents; writing a void erases the cell; there is no means to find out if a cell had existed before under a given key. Thus, sparse vectors and matrices can use maps with integers or tuples as keys, as only the non-void cells are allocated.

Keys are the same if they are of same kind and equal (guarded equality). Note that equality comparison for anything other than numbers and strings compares the reference only, and may give unexpected results. In particular rounded (floating point) numbers have intentionally no comparison for equality and thus cannot be used as keys.

In contrast to rows there is no general means to enumerate the keys, thus a function is required to obtain the set of keys:

r =: map m keys
r =: row from keys map m
?# i =: row r give values
    print m{i}
\ or
rv =: map m values as row
rv =: row of values from map m
?# v =: row rv give values
    print v

As both functions give a snapshot of the key space, there are no void cells in the rows supplied.

Normally, scan functions are used to provide the keys or values:

?# k =: map m give keys
    print m{k}
\ or even
?# v =: map m give values
    print v

See below on Scans for rows, tuples and maps for peculiarities if a map is changed inside a scan loop.

Using swivelled braces for maps might look a bit baroque, but see Unified Rows and maps below for reasons not to unify rows and maps.

Often, only attributes are needed, so the choice between a map and a row is fairly arbitrary. As a map is the more general construct, because any item reference can be used, the use of maps is recommended in these cases8.

Maps currently use linked lists to store the values; this proved to be quite efficient for some thousend entries. In order to speed up access, the least recently used keys are moved to the front of the chain. While this works well, it should not be used instead of a database; when tens of thousands of items are stored, the operations can become quite slow, in particular if equal keys are rare. A hashed map from the standard libray may be used if a full database is not appropriate and a hash function can be defined.

Using a factory function, future versions might allow alternate access mechanisms, e.g. a hash map. Currently, the library hashmap has to be used for this purpose.

Note that pre-allocating a given number of elements is not possible; and setting an element if there is no space left results in a fatal runtime error. If this case has to be caught, a function is used to set a map entry:

map (m) put (item) at (key)

which returns a fault item if it fails. Note that intercepting memory shortages has not yet been carfully analysed and integrated.

Binary trees have a penalty for small number of cells, whereas the linked lists proved to be quite efficient, in particular as the last found or saved element is moved to the top; this has decreased runtime in some map-base applications by the factor three, in particular if there is a common pattern like:

? i =: map map give keys
    ? map{i} = somevalue
        map{i} =: othervalue

Users might it find error-prone and difficult that all kinds of items can be mixed as elements; this can be restricted using assertions, in particular constraints.

Scans for rows, tuples and maps

Looping through a row can be done using a standard loop:

?# i =: from r.first upto r.last
  print 'r[' _ i _ '] = ' _ r[i]

This will also return keys (keys) for which the values are void; to provide — in analogy to maps — only keys for which the value is not void, a scan function is available:

?# i =: row r give keys     
  print 'r[' _ i _ '] = ' _ r[i]

which may be written as:

row (r) give keys:
    ? $ = ()
        $ =: r.first
    ?* 
        ? $ > r.last
            $ =: ()
            :> ()
        idx =: $
        $ =+ 1
        ? r[idx] = ()
            ?^
        :> idx

and similar:

row (r) give values:
    ? $ = ()
        $ =: r.first
    ?* 
        ? $ > r.last
            $ =: ()
            :> ()
        val =: r[$]
        $ =+ 1
        ? val = ()
            ?^
        :> val

The row may be changed during the scan; if it is dynamically extended at the bottom (r.first lower), these keys will not be supplied, while extending the row at the upper side will supply them. Values supplied will be actual values.

In order to keep the row unchanged, it may be locked (and unlocked) outside the loop, or the revision number checked inside:

row (r) give keys safe:
    ? $ = ()
        $ =: {}
        $.revno =: r.Revisions
        $.idx =: r.first 
    ! r.evisions = $.revno
    ?* 
        ? $ > r.last
            $ =: ()
            :> ()
        idx =: $
        $ =+ 1
        ? r[idx] = ()
            ?^
        :> idx

As mentioned above, to obtain the set of currently used keys for a map requires a library function:

Scan functions are also available for tuples for consistency, just using an integer range scan would be sufficient:

l =: 2, 3, 5, 7, 11, 13     \ primes
?# i =: tuple l give keys  
    print i, l[i]
?# v =: tuple l give values 
    print v

For tuples and rows, a scan function is provided that just gives the values:

give (rt) values:
        key =: give rt keys
        :> r[key]           \ void gives void

Elements of a bunch are deleted by assigning void to it. Once set to void, it cannot be distinguished from an element that never has been set. If there is a need to do so, a neutral element like the number zero, the empty string, a logical false, or an empty bunch may be used.

A bunch with zero elements can be created by using the empty row literal [] or by deleting all elements:

?# i =: r.scan              \ loop over all existing non-void elements
  r[i] =: ()                \ delete each element by assigning the void reference
! r.count = 0               \ assert the row is empty now

Note that there may exist several bunches with all elements deleted; they are not collected to a reference to a single empty bunch item, so one has to compare the number of used elements (attribute .count) to zero.

From the user point of view, any key can be used, even negative ones. Those cells that have not yet been given a value (reference to an item), will return the void reference, independent of being allocated or not9.

Using the .first attribute, the numerically smallest key can be queried, and the largest key .last is the smallest key plus number of elements (attribute .count) minus one. The number of keys may be zero, in which case the smallest key is greater than the largest.

Note that a[a.last+1] refers to the next free cell, thus

add (val) to (map) at end:
    map[map.last+1] =: val
add (val) to (map) at front:
    map[map.first-1] =: val

As adding a row element to the next larger key is fairly often needed, an empty key, i.e.

a[] =: new value to be added

may be used instead of

a[a.last+1] =: new value to be added

Note that only =: is supported, as a[a.last+1] has the void value before, so that using =+ would give a fatal error. Note also, that for an empty row, the first key used is one, not row.last+1, which would be 0.

Sparse rows are not supported via integer keys, only as maps. While the bunch is automatically expanded, it is not automatically shrunk; this must be done by a library routine call, possibly using the number of non-void cells provided by the used attribute.

Note that keying the void pointer as well as using an item other than an integer or string as key is not supported and will result in a run time error.

Adding a new element to a map first checks if the key already exists and its value should be replaced, which may take a long time on large maps. By setting the .noreplace attribute, this behaviour is changed: Any setting of an element in a map with a given key map adds a new element in front, even if the key already exists, and the old entry is hidden, unless the new one is deleted. Setting an element with void deletes the current one, unhiding the previous one. Creating a key row will contain duplicate keys. So to replace a value if the .noreplace option is active, you must delete it first:

\ replace 
map.noreplace =: ?+
? map{key} ~= ()
    map{key} =: ()
map{key} =: newval

As the least recently used entries are at the top, the above is quite efficient, if the entry to be replaced is not the oldest entry.

It would be tempting to make the .noreplace option standard and delete all hidden entries once their amount exceeds a given limit (which?). Like any other garbage collection, this will take time and add unexpected delays, thus the user must call a corresponding function when appropriate:

remove duplicate entries in map (map):
    o =: ()
    ?# i =: give keys from map map ascending
        ? i = o
            map{i} =: ()
        o =: i

Note that the .count attribute gives the number of physical entries, and that there cannot be a count of visible entries, as it is just the feature that an entry is added without checking for a duplicate. One strategy might be:

- Set the theshold to 1000 entries (.count)
- if there are more entries, remove duplicates
  and check the new size.
- If the count is below 2/3 of the threshold, continue;
  else increase the threshold by e.g. 50%

Cyclic references

As rows and maps can contain references to other rows and maps, cyclic references can be created. This spoils the automatic storage deallocation, as such a chain is self-consistent, and will cause memory leaks. Thus, the standard runtime system keeps track of the number of allocated bunches, and will give an error message at normal termination if there are bunches left over. For this purpose, all allocated items are recorded, causing a small performance penalty.

Examplea are:

a =: {}
b =: {}
c =: {}
a.next =: b
b.next =: c
c.next =: a
! a.usecount = 2    \ from the variable and from c.next
a =: () 
! c.next.usecount = 1  
 
x =: []
y1 =: {}
y1.up =: x
x[1] =: y1
y2 =: {}
y2.up =: x
x[2] =: {}

Instead of vanishing, the three items are keeping each other alive, but are not reachable any longer in the programme.

One example is the tree used if the DOM model is used for XML, with backwords references to the root or next higher level. If the outer referece to the tree is cleared, the root node still has a usecount greater 0 from the references of its primary children, and thus the whole tree is left in memory.

Library functions can be used to remove cyclic references inside a tree or cyclic chain, instead of doing so manually:

clear (tree1) all cyclic references 
tree1 =: ()
clear (tree2) top cyclic references 
tree2 =: ()

Both functions mark the current bunch visited and recursively descend all references to bunches in fields and elements unless a reference to a marked node is found. In the first case all those references are set void, in the second case only if it is a reference to the top item. If after that the usecound of the top node is 1, clearing the last reference will clear this item in the usual way.

All those references cleared this way are candidates for weak or semi-references, for which the assignment =; (equals-semicolon) is reserved, but no sound specification is yet available.

Although the items itself with a usecount greater 0 are known, this is not the case for the references to the items from outside, i.e. the local and global variables. Thus, a classical garbage collector is not possible.

The debug dump () function also detects cyclic references and marks them while descending the refernces hold by bunches.

Tagged bunches

The .tag attribute of a bunch can be set by the user freely to a reference to any item, but only once, i.e. only a void can be overwritten.

It allows a kind of type-controlled programming, in particular if used with constraints and assertions, and also makes object-oriented programming simpler.

A tag can thus be any item, but commonly stings are used, in particular in a hierarchy notation, e.g. de.glaschick.tuf.lang.env. The use of the delimiter is free, but only the underscore and the point may be used in Returning bunches in scans.

Its canonical use is for constraints (see Assertions and Constraints), like in

! env: @.tag = 'de.glaschick.tuf.lang.env'
do something on (env: env):
    ...
start all:
    nb =: {}
    nb.tag =: 'de.glaschick.tuf.lang.env'
    do something on nb

Besides the use in assertions and constraints, tags are a major means to associate functions with bunches, supporting a kind of object oriented programming, see Returning bunches in scans, where the strings are more likely to be word lists e.g. env glaschick, env_glaschick, env.glaschick.

For the programmer's convenience, the tag name may be given during creation within the brackets:

na =: {'de.glaschick.tuf.lang.env'}
nr =: [de_glaschick_tuf_lang_env]

If the tag is just a word (thus the use of underscores), the string quotes may be omitted. As it makes no sense to localize the tags, do not use double quotes. Compilers may reject such an attempt. Note that numbers give the estimated size, and thus cannot be used as tags in this notation.

Sorting

The basic sorting operation is to re-arrange the elements of a row in ascending (or descending) order (inplace sort):

sort (row) ascending:
    ....

and with a user provided comparison function.

    sort (row) using `compare (a) and (b)`
...
compare (a) and (b):
    ? a < b
        :> 1
    ? a > b
        :> -1
    :> 0

When using a compare function, the sort is always ascending with respect to the return of the compare function; just invert the sign of return values to have it descending.

As only the item references are sorted in place, there are no extra memory demands, and the comparison of the keys may be the bottleneck. If a sorted copy is demanded, copy the row, then sort.

The standard library function uses insertion sort, which works fine upto 10000 elements, and is stable, i.e. does not change the order of already sorted elements (which might be important if a comparison function is used.) This sort is also used to provide a sorted scans for keys and values of maps.

To speed up sorting, the standard sort of the C library may be used, which is quicker, but not stable, and thus denoted as unstable.

Often, a table with more than one column must be sorted by the first, and the others re-arranged correspondingly. In particular to supply the keys of a map in such an order that the values — not the keys — are sorted:

give keys of map (map) ordered by value

This can be done by three ways:

The first solution works by supplying an approprate comparison function, but is relatively inefficient.

The last one can be obtained by the second one by supplying an identity tuple (new tuple has (row.count) values ()

In the above case, at first two rows are created for the keys and the values of the map, and then sorted :

idx =: []
val =: []
?# i =: scan keys map
    idx[] =: i
    val[] =: map[i]
sort idx ascending sync val 
?# i =: idx.scan
    print val[i] __ idx[i]

The corresponding insert sort reads:

sort row (row) syncing (val) ascending:
?# sp =: from row.first+1 upto row.last     
    val =: row[sp]                          \ next value 
    ?# ip =: from sp-1 downto row.first     \ go down sorted part
        ? val > row[ip]                 \ check in sorted part 
            ?>                          \ insert point found
        row[ip+1] =: row[ip]                \ propagate upwards
        val[ip+1] =: val[ip]                \ propagate upwards
    row[ip+1] =: row                        \ insert here
    val[ip+1] =: val                        \ insert here

All sort functions return the row argument to allow nested use with functions returning a row, although most such applications are already provided, and it might never be used, except when a function reference is needed. Examples are not yet known.

3.7. Chunks of bytes

To cope with arbitrary binary data, the item kind called a byte chunk is provided. As the name says, these are chunks of unstructured bytes in memory. In contrast to strings, there is no inherent interpretation, and byte chunks are mutable. Their primary purpose is to access binary files in the filesystem. Byte chunks are a variant of a bunch, i.e. support similar attributes and also allow arbitrary fields set and obtained.

Byte chunks are created, accessed and modified by standard library functions only, and references to byte chunks are used like any other item, including the memory management that keeps them until the last reference has gone.

Byte chunks are allocated in one piece and cannot grow or shrink. Standard functions allow to get or put strings, binary coded numbers, or byte chunks. As no standard serialisation is defined for maps and rows, these cannot be get from or put into byte chunks directly.

As usual, the .bytes attribute returns the number of bytes used for data. It is always greater zero, because an empty chunk cannot be created.

Chunks have one mutable attribute:

Freshly created chunks are not initialised, and to inhibit the use of stale data, a fill level is automatically set to the last byte changed. If data is put into an area not yet written to, any not yet written bytes are filled with zeroes. The fill level is available via an attribute; to keep the number of attributes low, the .count attribute name is used for this.

Function using byte chunks use the general patterns

... chunk (chunkref) at (offset) size (size) ...
... chunk (chunkref, offset, size) ...

the latter is intended to use ngle parameter with a tuple, but may not be supported.

Thus access to byte chunks always uses an offset and a size, where the offset starts with zero for the first byte. The offset may be negative, in which case it is counted from the end, thus -1 is the last byte and 0 the first byte. Whether a size that exceeds the defined data is accepted, depends on the function.

A byte chunk can be created using a standard library function, which returns a reference to the byte chunk, or a fault item if the space was not available:

bcr =: new chunk of size (n)
? bcr.isfault           \ no memory
    print bcr.msg
    bcr.status =+ 1
    :> bcr
! bcr.bytes = n
! bcr.count = 0

The standard library functions to extract data from byte chunks are:

i =: get integer from chunk bcr at 32 size 4        \ signed
i =: get counter from chunk bcr at 36 size 1        \ unsigned
r =: get float from chunk bcr at 0 size 8           \ length 4 or 8
s =: get string from chunk bcr at 77 size 16
s =: get string from chunk bcr at 77 size 99 zero ends  \ null terminated
b =: get chunk from chunk bcr at start size len

These functions return references to new items with the data copied in the internal format of the corresponding item.

A byte order must be set for integers and floats (see .byteorder above). In order to reduce the chances for unnoticed data corruption, the size when getting a number must not exceed the initialised space (as returned by .count).

Floating point numbers are regarded as IEEE 754 single (32bit) or double (64bit) format, thus only the sizes 4 and 8 are guaranteed.

Integer numbers are regarded as binary signed in two's-complement notation. To extract unsigned numbers, the word counter is used, although the same effect can be achieved by adding 256 for a single byte, 256²=65536 for two bytes, etc, after extraction. Unless arbitrary precision numbers are implemented, the size must be between 1 and 8. For a single byte, setting a byte order is not necessary.

For string extraction, the first form copies the number of bytes given into a string, and then checks for UTF-8 encoding, setting .raw if that fails. The resulting string may contain zero bytes, in particular trailing zero bytes.

Because zero-terminated strings are often used in external data, the variant with the added term zero ends copies until a zero byte is found, which is not copied as TUF-PL does not need it. If there is no zero byte, copying stops when the end as indicated by the size is reached.

If a chunk is extracted from a chunk, a new byte chunk is created with the size given, and the bytes starting at the offset given are copied. If the size given exceeds the available data, only that data is copied, but the chunk allocated for the full size; the fill level is lower then.

To replace data in a byte chunk, the standard functions are:

l =: put integer (item) into chunk (tgt) at (offset) size (len)
l =: put counter (item) into chunk (tgt) at (offset) size (len)
l =: put float (item) into chunk (tgt) at (offset) size (len)
l =: put string (item) into chunk (tgt) at (offset) size (len)
l =: put chunk (src) at (soffset) size (slen) into chunk (tgt) at (toffset) size (tlen)

These functions return the offset of the next byte after the last byte accessed, i.e. offset+len, where offset and len might have been restricted (and negative offsets converted). If this number is equal to .bytes, the chunk is full now. As mentioned below, unused space below the area written is automatically zeroed.

A byte order must be set for integers and floats as above. In order to reduce the chances for unnoticed data corruption, the size when putting a number must not exceed the space available, else it is a fatal error.

Floating point numbers are binary floating point numbers in IEEE 754 single (32bit) or double (64bit) format, thus only the sizes 4 and 8 are guaranteed.

Integer numbers are formatted as signed binary numbers in two-complements; the value must in the appropriate range (e.g. -128..127 for a single byte), or a fatal error occurs. To write unsigned numbers, the form put counter ... is used for clarity, even if this could be done by just adding 256 etc. if the number is negative. For size 1, setting a byte order is not necessary.

Strings are not truncated to avoid accidental data loss, i.e. there must be sufficient space after the offset. To determine the space needed, the .bytes attribute of the strings should be used, not the .count attribute, as the latter counts characters. If the size given for the byte chunk is larger than needed, the remaining space is filled with zeroes, so a zero-terminated string may be created as follows:

str =: 'König'
put string str to chunk xx at 32 size (str.bytes + 1)

Note that TUF-PL strings may contain zero bytes, even if UTF-8 encoded, thus the above only guarantees a zero end byte, but not as the only one.

When copying between byte chunks, the target length must allow all source bytes to be copied, i.e. data is not automatically truncated. If the source is shorter than the target space, it is filled with zero bytes. The source size is the effective source size, taking into account the end of the byte chunk and the fill level. The chunks slices may overlap; in this case, it works like an extra buffer is used.

To create a new byte chunk by catenating the contents of two others, no library function is necessary nor significantly more efficient:

catenate chunk (first) and (second):
    nbc =: new chunk length (first.byes + second.bytes)
    pos =: put chunk first at 0 size first.bytes into chunk nbc at 0 size first.bytes
    put chunk second at 0 size second.bytes into chunk nbc at pos size second.bytes

An example to write a string with a leading string length in binary:

str1 =: "Hello, world"
bcr =: new chunk size 12345
pos =: zero chunk bcr size 1+str1.bytes
pos =: put counter str1.bytes into chunk bcr at pos size 4
pos =: put string str1 to chunk bcr at pos size str1.bytes

As byte chunks are bunches, attributes may be used to track positions and to allow shorter function templates. User attributes are used:

.getpos and .putpos are updated when used.

So there are functions with omitted parts following the general pattern:

... chunk (chunkref) at (offset) size (len) ...
... chunk (chunkref) size (len) ...
... chunk (chunkref) at (offset) ...
... chunk (chunkref) ...

An example is:

bcr.getpos =: 1036
get integer from chunk bcr size 4           \ signed
get counter from chunk bcr size 1           \ followed by unsigned
bcr.getpos =: 32
get float from chunk bcr size 8
bcr.getpos =: 77
bcr.getpos =: 1024
bcr.getsize =: 32
?# i =: 0 to 8
    cr =: get chunk from chunk bcr          \ adjacent slices
    ... process chunk cr

Byte order

When treating byte sequences in byte chunks as binary words, the byte order is relevant, i.e. whether the byte with the lowest number is the least significant byte (LSB) or most significant byte (MSB). The former is often called Little endian byte order, and the latter Big endian byte order, also called network byte order because most internet protocols use it. Each byte chunk has an attribute .byteorder, which can be U for undefined, L for LSB first, and M for MSB first, and is only propagated if a chunk is created from another one. It must be set before any integer or float is extracted or written; using the host byte order as default is not supported in order to avoid unintentional byte order dependencies. The .byteorder field can be set not only with U, L and M, but also with H for host byte order and N for network byte order. Reading back, however, always gives U, L or M, so this can be used to determine the host byte order.

Only the first character is used when set, and may also be a small letter, so

bcr.byteorder =: 'host'
bcr.byteorder =: 'LSB'

are both valid and work as supposed.

Using not a string with at least one character of the above set is a fatal error.

Note that .byteorder is a reserved field (lowercase), not an attribute, as it can be set by the user. However, a field must be returned unchanged;t Thus it is an attribute???

Block IO

Byte chunks provide access to binary system files.

The functions to exchange byte chunk data with the file system are:

fd =: open chunked file reading (filename)
fd =: open chunked file overwriting (filename)
fd =: open chunked file appending (filename)
fd =: open chunked file updating (filename)
rcnt =: read chunked file (fd) to chunk (bc) at (pos) size (size)
wcnt =: write chunked file (fd) from chunk (bc) at (pos) size (size)
loc =: tell chunked file (fd) position
lco =: position chunkedfile (fd) start plus (loc)
rc =: close chunked file (fd)

The following example shows how to copy a file (with minimal error handling):

infd =: open chunked file reading "/dev/random"
outfd =: open chunked file writing ("/tmp/xx" _ process id)
bc =: new chunk of size 1024*1024
?*
    rcnt =: read chunked file infd to chunk bc at 0 size bc.bytes
    ? rcnt.isfault
        :> rcnt         \ not necessarily a fatal error...
    ? rcnt = ()
        ?>              \ break loop if end of file
    ? rcnt = 0
        ?^              \ repeat if nothing read, but not yet end of file
    wcnt =: write chunked file outfd from chunk bc at 0 size rcnt
    ? wcnt.isfault
        :> wcnt
    ? wcnt ~= rcnt
        :> new fault item with code 1 and message "Write failed." 
close chunked file infd
close chunked file outfd

The read chunked ... function reads from the file as many bytes as are possible to the chunk determined by the position and size; here, the whole chunk is used. Note that while UNIX file IO returns 0 for end-of-file, and an error for a non-blocked IO that has nothing to deliver, the above function returns void instead of a fault item for end-of-file.

To transfer the data to another file, the write chunked ... transfers all bytes from the byte chunk slice, as indicated by position and size, but may fail to write all bytes and returns then number of bytes written.

Due to the similarity of byte chunks with UNIX files, a file can be memory mapped to a byte chunk. The library functions to do this are still to be specified.

3.8. Fault items

A programme may encounter two types of exceptional situations:

fatal errors:
These are faults that cannot be repaired within the programme. Typically, these are logic flaws that were not detected during creation, or situations, where the situation is so defective that the fate of the programme can only be to die.
recoverable errors:
These are exceptional situations that are considered not normal and require special measures. This is typically the case if an operation cannot provide the results normally expected, e.g. a read error during a file read. These are faults that lead to failure, unless special recovery measures are taken.

The traditional way to deal with recoverable errors in the C programming language is to indicate error conditions by return codes. However, it was observed that:

As an alternative, the exception model was introduced. A block has the option to handle the exception, or to pass it upstream. Now programmers could no longer accidental ignore error events and continue with code that runs under false assumptions. However, while the separation of normal program flow and error handling allows the former to follow more easily, an in-depth handling needs to bring together the lexically separate statements in the block of the normal program flow with that of the exception handling. Still, the major advantage is that error conditions are not lost and can be passed to the caller easily. The details of catching and unravelling exceptions is hidden from the programmer and requires a fairly complex runtime support.

In TUF-PL, exceptional conditions do not require new syntactical support; just a new kind of item, the fault item and a slightly different treatment for assignments are necessary. Eventually, it is not so different from exceptions, only integrated differently in the language. To avoid confusion with exceptions used in other programming languages, this term is not used in TUF-PL.

Fault handling is done by having a special kind of item, the fault item, which can be returned and processed like any other item. But the fault item has the special feature that a newly created one will not be silently lost if the value is to be freed, e.g. because the variable that holds a reference gets out of scope, or the function result is discarded. Instead, a fatal error will be the result, see details below.

As fault items are not intended to be used instead of bunches, there is no distinction between fields and attributes. These field names may not be localised:

.checked    number of times processed
.errno      error code number 
.code       error code string, e.g. 'ENOENT', or void
.message    message string
.backtrace  row with backtrace information
.lineno     line number where the fault item was created
.prototype  prototype of the function that generated the fault item
.info       additional information
.when       to differentiate different sources 
.data       for partially accumulated data before the error occurred

The programmer may add more fields and modify the above ones.

To check if an item returned is a fault item, use the function is () fault, not the deprected attritube .isfault. Normally, fault processing is guarded by first calling it after a value was returned by another function.

The field '.checked is set to integer 0 initially, indicating a not yet acknowledged fault item. To mark it processed, any other value is accepted; it can be incremented by a function or a field assignment:

fault (fault) set checked:
    fault.checked =+ 1

Done each time the error has been logged or otherwise acted upon, this allows upper levels to discard fault items with a large enough number. Thus fault items could be returned after marked as serviced to signal failures upstream, just for unravelling call chains manually.

While it is easy to write a function circumvent nagging messages as unprocessed fault item, its use is very limited — it returns void instead of a fault item — and thus it is not in the standard library:

ignore fault (item):
    ? is item fault
        item.checked =+ 1
        :> ()
    :> item

The .backtrace attribute, if not void, is a row representing the function backtrace (without parameters) at the point of error. Starting with 1, each odd element is a string that contains the function name, and each even element the corresponding source code line number.

As fault handling parts of programmes are normally guarded using the function is () fault collisions are rather unlikely.

To create a new fault item, a library function is provided:

new fault item with code (code) and message (msg):

To deal with error code numbers, the standard way is to define named integer literals, e.g.:

#EPERM  = 1     \ Operation not permitted
#ENOENT = 2     \ No such file or directory
#EINTR  = 4     \ Interrupted

While possible, because the C library has to have these codes unique anyhow, the use of small strings, i.e. EPERM as error code would be preferred; if the code is just a string, or a second field with the code string (with the message text in addition) is used, is still to be determined.

A function that detects an uncommon situation returns a fault item instead of the normal result, which could easily be checked in the further programme flow to start error recovery.

As an example, take the following code fragment, printing the contents of a file:

print all from (fn):
    fd =: new stream reading file fn
    n =: 0
    ?*
       str =: line from file stream fd
       ? str = ()                       \ end of file ?
            ?>                          \ yes, terminate loop
       print n _ ': ' _ str
       n =+ 1
    :> n

If the file cannot be opened, the function new stream reading file () returns a fault item, which the function line from file stream () refuses to handle generating a fatal error.

If during reading, a read error occurs, str receives a fault item. The string catenation operator _ does not accept fault items, and causes the programme to stop with a fatal error.

To handle the fault items, the item kind could be checked; to increase readability slightly, a system attribute .isfault is defined for every item including the void item, which returns ?+, the true value, only if the item is a fault item. Thus, instead of analysing function-dependend return codes, to catch the error, the .isfault attribute has to be checked:

print all from (fn):
    fd =: new stream reading file fn
    ? fd.isfault                    \ error?
        print "could not open " _ fn _ " reason:" _ fd.msg
        fd.checked =+ 1                 \ mark processed
        :> fd                           \ tell upper level about the error
    n =: 0
    ?*
        str =: line from file stream fd
        ? str.isfault
            :> str                      \ pass read error upstream
        ? str = ()                      \ end of file ?
            ?>                          \ yes, terminate loop
        print str
        n =+ 1
    x =: drop file stream fd
    ? x.isfault
        :> x                    \ pass error upstream
    :> n

The less often encountered errors when reading and closing are simply passed upstream; only the standard open error is shown to be processed directly, and marked as processed.

Using a library function, the whole becomes much simpler:

print all from (fn):
    ?# str =: give lines from file fd
        ? str.isfault
            :> str                      \ pass read error upstream
        print str
        n =+ 1
    :> n

Of course, the calling function has more to do to sort out a fault item, unless it simply prints flt.message and flt.when.

When a fault item is created, it is marked as not processed by setting the '.checked attribute to integer zero. In order to avoid accidental loss of unprocessed fault items, the attempt to discard such an unprocessed fault item (e.g. by assigning a new value to a variable holding the last reference) results in a fatal error. Once the status is not zero, the fault item can be freed automatically like any other item.

If the second to last line would not be present, and instead be written:

file stream fd close

the void reference normally returned would be silently discarded. If something else is returned, discarding a non-void value will be a fatal error anyhow. It might be tempting to write:

x =: drop file stream fd
:> n

to get rid of any non-void return values, but it does not work for fault items, because no fault item with zero usecount is freed unless its status is not zero.

In order to help upstream error processing, any field can be set in a fault item; recommended is the use of .when:

print all from (fn):
    fd =: new stream reading file fn
    ? fd.isfault                    \ error?
        print "could not open " _ fn _ " reason:" _ fd.msg
        fd.checked =+ 1                 \ mark processed
        fd.when =: 'open'
        :> fd                           \ tell upper level about the error
    n =: 0
    ?*
        str =: line from file stream fd
        ? str.isfault                       \ pass read error upstream
            str.when =: 'read'
            :> str
        ? str = ()                      \ end of file ?
            ?>                          \ yes, terminate loop
        print str
        n =+ 1
    x =: drop file stream fd
    ? x.isfault                             \ pass error upstream
        x.when =: 'drop'
        :> x
    :> n

The scan function to read lines of a file does just this, so the compact version would be:

print all from (fn):
    ?# str =: give lines from file fn
        ? str.isfault
            error print 'Failed on ' _ str.when _ ', reason: ' str.info
            str.checked =+ 1
            :> str
        print str

Note that the scan returns a fault item if the drop (close) fails, and does not just terminate the loop.

Another help for debugging fault items is the backtrace stored in the .trace attribute whenever a fault item is created, which is row of elements alternating between function name (string) and line number (integer).

The pattern:

x =: do some function
? is x fault
    :> x

is quite often needed, but clutters the code with error processing, even if the condensed form is used:

? x.isfault; :> x

This can be avoided by using the error guarded assignment =? to indicate that instead of void, a fault item may be returned, and in the latter case the function should return the fault item immediately upstream unchanged:

x =? do some function

instead of

x =: do some function
? is x fault
    :> x
work with x

If the function returns void or a fault item, the target variable may be left out:

=? function returns void or fault

Experience so far shows that there are not so many places where the fault item is returned without having added some information, and, in particular in scans, the test is not so seldom delayed one statement.

In case that despite the arguments that follow someone insists on having the equivalent of try/catch blocks, the error guard ?? should be used:

print all from (fn):
  ??
    fd =: new stream reading file fn
    n =: 0
    ?*
        str =: line from file stream fd
        ? str = ()                      \ end of file ?
            ?>                          \ yes, terminate loop
        print str
        n =+ 1
    drop file stream fd
    :> n
  |
    error print 'Failed on ' _ str.when _ ', reason: ' ?@.info
    ?@.checked =+ 1
        :> ?^

Note that the same else notation is used as for a normal guard, as is must follow immediately to a guard and requires to pass context anyhow.

In this case, every assignment (and every function return value to be discarded) is checked for a fault item, and in case it is a fault item, execution continues in the error else block. Within the error else block, the special variable ?@ contains the reference to the fault item.

However, the above example is felt to be artificial; in most cases, the try-part of the block could be grouped in a function, and the example will become:

print all from (fn) block:
    fd =? new stream reading file fn
    n =: 0
    ?*
        str =? line from file stream fd
        ? str = ()                      \ end of file ?
            ?>                          \ yes, terminate loop
        print str
        n =+ 1
    =? drop file stream fd
    :> n
 print all from (fn):
    x =: print all from fn block
    ? x.isfault
        error print 'Failed on ' _ x.when _ ', reason: ' x.~info
        x.checked =+ 1
        :> x

A fault item is created via the library function

new fault item having message (msg) and code (code):

Note that the fault item mechanism is an effective countermeasure against a certain class of errors, cited from the close system call documentation:

Not checking the return value of close() is a common but nevertheless serious programming error. … (it) may lead to silent loss of data."

The same clearly holds for an error detected during reading the next line.

Fault item special cases

Some peculiar situations may arise:

First, if a loop repeatedly returns fault items that are just collected in a map, the program will use up all memory, as with all other endless loops filling maps, as the fault items are not discarded unprocessed yet. Runtime systems could limit the number of fault items created without marking a fault item processed in between, but this seems to be a fairly rare case.

What is the proper handling of fault items within a scan?

Setting the scan context to void and returning the error item stops the iteration and leaves a fault item in the scan variable, which could be checked after the loop block, as the last value is still assigned to the scan variable.

Levaing the scan context other than void, will try to discard the fault item

immediately result in a fatal error, as the iteration is stopped, but the return value discarded. So the fault item should be saved into the scan variable and its previous contents in the .data attribute. If the caller does not process the fault item, the next iteration round will try to replace the old fault item with a new one, which will raise an unprocessed fault item error. If a scan may return a fault item, it should assert that it is not called again with the error item in the iteration variable, as this would also prevent the above endless loop.

Signals

UNIX signals cannot be mapped to exceptions, as there are none. As fault items replace exceptions, signals are honoured using fault items.

There are two kind of signals: synchronous, often called traps, like arithmetic overflow, illegal instructions etc; and asynchronous.

As traps occur if the program is buggy, they immediately terminate the program. Some people find it comfortable to use a try/catch bracket around calculations or other statement sequences that might crash, instead of carefully considering each source of error individually. From my point of view, the recovery of such error situations is either more complicated than detecting the error in the first place, or sloppy, leading to programs with erratic behaviour. Thus, catching traps is not supported in TUF-PL. In case an application must continue in case of errors of subfunctions, the latter must be put into processes of their own, as to protect the main process as well as other children processes from interference.

There are several options how to treat (asynchronous) signals, of which the last one is used:

First, all signals could be disabled. This may be fatal unless for short times only.

Second, signals could be mapped to a fatal error with stack trace, but no means for recovery. This rather handy when a program loops and must be stopped. It is, however, a bit unusual if a program requesting input is stopped by CTRL-C and then shows a stack trace. Note that using curses or readline, this behaviour will not arise.

Third, a signal could set a global flag that (and when) it occurred. However, this must be polled; but it is still not a bad solution if properly designed, i.e. ensured that the flag is inspected often enough. But a timer might still be needed because the programme is far to complex to ensure proper polling.

Forth, a signal could call a TUF function, which is restricted as described in the Unix manual. Unfortunately, the only thing this function could do, is to set some global variables or change an item that is passed with the intercept call. After that, it can just return and have processing resumed where it was interrupted, or stop the programme prematurely.

The pair setjmp/longjmp cannot be used directly, as it would result in memory leaks of all the items hold in the variables of the abandoned stack frames. As memory management uses reference counts, no garbage collection is available that would clean up memory use after the longjmp. A similar solution might come, in that each function returns immediately with releasing all local variables so far, until a fault handler is provided in a function.

A fifth option, which is standard for several signals, changes the return value of the next function return to a fault item composed by the signal handler. The original return value is put into the .errdata field, while the backtrace information was saved in the .errbtfield. Thus, the evaluation of expressions and assignments continues as if no signal occurred. This is sensible, as the time (asynchronous) signals occur is not related to the program, so the time after such local processing is as good as the original time.

After the continued chain of expressions and assignments, either a return, a function call or a loop repeat may be next. A return just returns the fault item instead of the original value, which is saved in the .errdata field.

A loop repeat is treated like a return with a void value; i.e. the function returns immediately independent of the loop nesting. Function calls are also aborted and replaced by an immediate return of the pending fault item. When a function returns the fault item set by the signal handler, this information is cleared.

For this to work even with tight loops like:

?* 1 = 1
    \ do nothing

the loop code inspects at each round a global variable which contains either void or a pointer to the new fault item, and returns immediately this value if it is not void. The return code then returns this value if it is not void, thus working even if the function would return anyhow. If it is already beyond that code, i.e. returning anyhow, there are two possibilities:

The first option would not catch a run-away program that is just recursively calling functions without using a loop, i.e.

runaway1 (x):
    x =: 1
    runaway2 x+1
    print "never reached"
runaway2 (x):
    x =- 1
    runaway1 x

Note that

also runaway (x):
    ? (x // 2) = 0
        also runaway x+1
    |
        also runaway x-1
    print "never reached"

is not caught, because there is no loop, but

not runaway (x):
    ?* x > 0
        x = not runaway x
    never reached

will be caught, because it has a loop. However, the runaway programs will finally stop with a stack overflow anyhow; without using a loop or calling another function, this will be very quick. Unfortunately, some systems do not call the signal handler, as this would require some space on the stack, so there is nothing like a backtrace etc., unless a debugger is used.

Unless a stack overflow can be handled properly, there is no need to check during function call code for a pending fault item.

Unfortunately, this scheme is practically useless as it is, because any function can return a fault item, and either the program unexpectedly has a fault item where another one is expected, and traps into a fatal runtime error, or discards the value, which also traps in a runtime error for non-processed fault item.

So what needed is a scheme that allows a function (or block) to register for the reception of fault items; so it is similar to the try/catch mechanism.

It remains the question of how to cope with multiple signals, because signal races are a real source of unreliability. So, once the forced return of a fault item caused by a signal has been initiated, the next signal must not do the same until the programme signals the end of error handling by setting the '.checked flag as usual. However, to ensure normal functioning during fault item processing, the code that does the forced return clears the location pointing at the fault item. The signal handler uses a second location having a reference to the current fault item caused by a signal, and checks if the fault item is still unprocessed. If it is processed, a new fault item can be injected; otherwise, it is not yet specified as if a fatal error is displayed, or a queue of unprocessed signals is kept and the next signal item dequeued if the current one is marked as processed.

Finally, there is the situation that library functions use system calls like sleep or wait or some blocking IO. In these cases, the system calls return after the signal handler returns, and sets errno, which will be recorded to a fault item; thus these are safe. However, some system calls like fgets do not return if interrupted by a signal during an attempt to read; it just repeats the read. Maybe using setjmp/longjmp should be used here; the current version honours the interrupt after some input.

It is admitted that it is not easy for the programmer to sort out fault items returned; and that it needs some discipline to process only the expected fault items and pass all others; as there is no language help in this as with some exception mechanisms.

Fault items: conclusions

Thus, the net effect is nearly the same as with exceptions, but:

The disadvantage may be that the code is slower compared to the use of exception in C++, as the latter can unravel the call stack by some complex runtime routines only when needed, while fault items will do many checks for the normal program flow, and this sums up even if the checks are very quick compared to the other overhead.

Note that there are still conditions under which not a fault item is returned, because the error is considered non-recoverable, i.e. errors in the program logic than in the environment of the program. Clearly this includes the case that an unprocessed fault item is going to be destroyed. It also includes wrong item kinds, in particular providing voids instead of tangible data, the attempt to key rows with string, or to add strings to numbers.

4. Assertions and Constraints

Assertions and constraints are rather similar and start with an exclamation mark (!), basically followed by logical expressions. They allow to declare properties that are invariants during runtime of the program, and by this means may:

If all constraints and assertions are removed, the function of the program is unchanged; if they are present, additional runtime errors may occur and more efficient code generated.

As a rule of thumb, assertions are checked each time they are encountered and may generate runtime errors, while the effect of constraints depends on the facilities of the compiler, and generate only compile-time errors if the constraints are contradictory; depending on the compiler, they may be fully or partially ignored.

Assertions

An assertion is a line in the body of a function that starts with an exclamation mark. In its basic form, the rest of the line contains a boolean expression that must be true, otherwise the program stops with a fatal runtime error:

! b %% 2 = 1            \ b must be an odd integer
! x.abs < 1'000'000     \ x must be a number less than ± one million
! x >=0 & x < 1'000'000 \ zero or greater, but less than a million

Any expression yielding a boolean result may be used. Function calls are seldom used, as the purpose is a quick sanity check for program invariants, e.g.:

! x.count = y.count     \ both must have the same number of elements

Note that two assertions following immediately are equivalent to using the AND operator & to combine both expressions:

! x >= 0
! x < 1'000'000
\ same as
! (x >= 0) & (x < 1'000'000)

A artificial example to use the OR | operator is:

! x = () | x > 0

Here, x can be void or be an integer greater zero; note that void can be used in any comparison and yields false (unless compared to another void).

Problems may occur in the — rather seldom — case that a variable may contain different kinds other than void, e.g. an integer or a string:

! x < 56 | x < '56'

This assertion is only passed if x is void; in any other case a fatal runtime error will occur, as one of the two is a comparison that is not allowed.

However, in most situations, the kind of variable is predictable, and the assertion should check this too.

If a function accepts different kinds as parameters (which is perfectly allowed), the processing must be split anyhow, as further processing is specific to each kind. Then, the assertions can be placed after the branch determination:

collect argument (x):
    ? is x string
        ! x < '56'
        ....
        :>
    ? is x integer
        ! x < 56
        ...
        >:

By using a second exclamation mark (not yet implemented), a specific message can be used:

! a < 1 | "Use integer instead of " _ a.kind

The compiler will compose the message only if the assertion is violated, thus any functions may be used. Might of course violate an assertion before the primary one is composed.

Advanced compilers can support a block depending on an assertion like in:

! 0 <= x & x < 1'000'000
    x =+ 5
    do some processing with x
    ..

The compiler might deduce from flow analysis that the first check on entry is 0<=x & x < 999995', and then there is no check necessary after the increment.

Also, the compiler could optimize the code and, because the numbers fit well into 32 bit integers, use machine instructions instead of item processing subroutines.

Assertions without a depending block are checked just at the place where they are written during runtime (unless optimised).

Assertions may --not yet implemented — be written as a pair, the boolean expression first, a comma, and a message string next, in particular in libraries, where the user has the source code not at hand:

! step ~= 0, "zero step rejected"

Also not yet implemented: if no text is given, the expression source code is the text instead of the ancient Assertion violation.

To allow more complex assertions that do not crash due to item mix, on could propose to get rid of the restrictive rules for comparisons and simply make them false, if the kinds differ etc. This is not appreciated, as it introduces more complex rules. A solution could be using the more expressive constraints of the next section.

Constraints

Constraints are named declarative assertions and replace the type system used in other languages. While assertions are executable statements inside a function body, constraints bind a name to an assertion expression for a single variable. This name can be bound to a variable, which declares an implicit assertion each time the variable is changed. If the compiler can determine statically that the assertion is always true, no code will be generated.

The difference to a classic type system is not only that constraints are expressed for flexible, e.g. limiting the range of a number or the length of string, including a minimal length (which is evidently very seldom used, but shows the flexibility).

Also, the classic type system needs a mechanism to override the restrictions (normally called a cast) and even remove it completely. In this case, all checks are left to the programmer. In TUF, the checks are always generated, unless the compiler can determine statically that the constraints cannot be violated by the current assignment.

Constraints are syntactically very similar to assertions, i.e. the line begins with an exclamation mark, the constraint name (a word), a colon, and an assertion expression, where the item to be constrained is represented by the scroll symbol @:

! n16:          \ define 16 bit non-negative integer number
    @ >= 0      \ implies integer
    @ < 2^16
! integer:      \ general integer number
    @.den = 1   \ integer numbers have a 1 denominator
    @.frac = 0  \ alternate indication

Constraint names have a separate namespace.

The given constraint for an integer is just an illustration; it is clearly better to write:

! integer:
    @.kind = 'integer'

which is one of the predefined constraints.

A variable can be bound to a constraint by prepending the name and a colon or an exclamation mark:

! int32: 
    @.kind = 'integer'
    @.abs < 2^31-1
int32: y = x +1     \ the result must no exceed 32 bits 
int32! y = x +1     \ same as above

Even if the code is cluttered with colons already, we prefer it to be less invasive than the exclamation mark, which admittedly would have been the logical choice.

The constraint binding may be used at any place, in particular in the parameters of a function template:

n16: advance integer (n16: x):
    x =+ 1
    n16: y =: x / 2
    :> y

A constraint binding is valid for the whole scope of the variable, independent of the position. To restrict it to the lexically following uses would create a false sense of protection.

Note that in the above example, the binding to y is useful to detect a problem early; as the function is bound to n16 in the template, the return will check the constraint towards the value For best readability, the first occurrence is used. Duplicate constraint declarations of the same name are allowed, but different names are not allowed (even if they are equivalent, whatever this would be).

As constraints are declared outside function bodies, the expressions may not use local variables or parameters, and also no global variables; only literals (including named literals).

The constraint target is denoted by the scroll symbol (@), and can be omitted just after the colon, if nothing else could be meant.

To make constraint definitions more compact, a single line can be used, using a list separated by commas as in assertions, i.e. the following lines are equivalent to each other:

! n16: 
    @ >= 0
    @ < 2^16
! n16: @ >= 0, @ < 2^16
! n16: @ >= 0 & @ < 2^16

Inside the expressions, already defined constraints may be used:

! i32: @.abs < 2^31         \ integer implied
! n16: i32:@, @ < 2^16

A tree can be thus declared like this:

! string: @.kind = 'string'
! tree:
    tree: .lhs
    tree: .rhs
    string: .op

So if variable is marked as a tree and has the field .lhs, this refers to an item that must obey the tree constraint; and if a field .op is present, it must be a reference to a string.

In the code fragment

tree: t =: {}
string: s =: 'xx'
t.op =: s
t.lhs =: t
t.op =: 0           \ creates a constraint violation
t.lhs =: s          \ the same

the last two lines could be flagged by the compiler as a constraint violation already at compile time.

If a variable is not declared as certain type, the compiler generates code to check the constraint at runtime, if necessary:

a function with parameter (x):
    tree: t =: {}
    t.op =: x

expands to (because the field .op is constrained to a string):

a function with parameter (x):
    tree: t =: {}
    ! x.kind = 'string'
    t.op =: x

If the parameter is constrained to a string as in:

a function with parameter (string: x):
    tree:t =: {}
    t.op =: x

the check will be done upon entry of the function, i.e.:

a function with parameter (string: x):
! x.kind = 'string'
tree:t =: {}
t.op =: x

Note that constraint verification are applied only if a value is change, i.e. on the left hand side of an assignment.

Of course, if the function is private to the module, it can be checked statically that all invocations hold the constraint, and no check upon entry is necessary.

The notations:

! rowOfInts: integer: @[]
! mapOfIntRows: rowOfInts: @{}
! mapByStrings: @{string}

tell that the row has all integer values, the map only rowOfInts values, and that in mapByStrings, only strings are used as keys. Logically, the integer keys of a row could be constrained too:

! i100: @ > 0 & @ < 101
! a100: @[i100]

Note that the body of the constraint has the assertion syntax, thus one may write:

inOrStr: integer: @ | string: @

Using tagged bunches allows to explain data structures using the same tag boolean operator @=:

! node: @.tag = 'node'
    @.lhs @= @      \ instead of @.lhs.tag = @.tag
    @.rhs @= @

To allow more efficient analysis, the .tag attribute may be set only once, i.e. if it is not void, it is unchanged further on.

Just to restrict a bunch to a given tag name may be written differently, the last version is valid only in assertions and a shortcut to avoid cluttering of scrolls symbols:

@.tag = 'syz'
@ @= 'syz'
@'syz'

If the exclamation mark is in column 1, the constraint name is globally valid in the module, otherwise only in the block (and all dependent blocks) in which it occurs. Contrary to local assertions, they are valid independent of the location; putting them at the beginning is recommended.

The compiler might issue a warning that the assignment to y might violate the assertion for y in the following code:

! n16: @ >=0, @ < 2^16
! i7:  @.abs < 2^7
sample function (n16: x):
    y =: x / 2
    :> i7:y

Depending on the compiler, a compile time warning could be given if the user writes:

! int16: integer, @ > -2^16, @ < 2^16-1
advance integer (int16: x):
    :> x + 1
use it:
    advance integer 0.1
    advance integer -1
    advance integer 2^16  \ expect a warning

However, writing

advance integer 2^16-1

will often result in a fatal error message only, as the argument fits to the assertion, but the function body creates 2^16 and thus an overflow10.

Note that constraints are much more expressive than standard declaration, as the diligent programmer can state exactly the ranges of the numeric variables; a banking application would declare:

#maxSavings = 1´000´000´000  
! savings:
    @ >= 0                  -- 0 is also 0/100
    @ <= #maxSavings
    @.den = 100

telling that numbers with 12 decimal places for the integral part are needed and that they must be kept as rationals11. The compiler might use 64 bit integers and a fixed denominator of 100, or double floating point numbers and fix the fraction often enough.

If, however, the maximum above would read 10^12 or 1000´000´000´000´000 64-bit integers would be required.

Note that during calculation, more digits might be needed. As assertion violations are not errors, but programme failures, to sum up such values, either the standard built in arbitrary precision numbers may be used, and the result checked:

savings: add (savings: input) to (savings: balance):
    new = input + balance
    ? new > max Savings
        :> new fault item ....
    :> new

Or a new assertion must be used:

savings: add (savings: input) to (savings: balance):
    ! new < 2*maxSavings
        new = input + balance
        ? new > maxSavings
            :> new fault item ...
        :> new

As the compiler can determine that no overflow can occur with 64 bit integers, no overhead would be necessary.

This may seem to be unnecessary complicated, but allows to write programmes of high reliability, as the current situation tempts the programmers to ignore integer precisions. Even if the compiler is not smart enough and just uses arbitrary precision numbers, problems will be more easily found during testing.

Normally, declared assertions include the void value implicitly, as it is extensively used in the language. Global variables in particular are initialized with the void reference, thus it must be allowed. But a void reference is not the same as the number zero, most often a compiler will not optimize global variables that are numbers, although the values for NAN (not a number) for floating point numbers and the smallest number (i.e. 0x80…00) could be used for the void reference.

For strings, assertions could be used to limit the length of strings, when required:

! s7: @ .= "", @.count <= 100

To exclude the void reference, write:

! s7: @ ~= (), @ .= "", @.count <= 100

As the length of the string must always be kept additionally, a compiler might optimize s7: strings to occupy 8 bytes12.

A constraint may include other constraints:

! nonneg: integer: @ & @ >= 0

To make reading and writing easier, the : @ may be left out; just a constraint name is sufficient:

! nonneg: integer, @ >= 0

Declared assertions cannot be stacked, thus the following line is invalid:

a: b: x =: y        \ invalid

If x must obey a as well as b, a new constraint can be defined:

! ab: a, b
ab: x =: y

For the basic kinds of items, there are predefined constraints:

! void: @.kind = 'void'
! error: @.kind = 'error'
! boolean: @.kind = 'boolean'
! integer: @.kind = 'integer'
! rational: @.kind = 'rational'
! float: @.kind = 'float'
! string: @.kind = 'string'
! row: @.kind = 'row'
! map: @.kind = 'map'

The constraint name any is also supported, that denotes no constraint.

As the boolean expression may contain logical and as well as logical or operators, it is possible to describe a logical union of constrained elements.

If, e.g., an item can be an integer or a string, this can be expressed like this:

! IntOrString: @.kind = 'integer' | @.kind = 'string'

Using the above predefined constraint names, this becomes

! IntOrString: integer | string

To indicate that all elements of a row are integers is expressed as follows:

! RowOfInts:
    integer: @[]

Thus, the digraph @[ followed by ] denotes all elements of a row, and @{ } all elements of a map.

The digraph .* denotes all user defined fields that are not explicitly constrained, so we could require all fields to be constrained by:

! void: @.*

This effectively prohibits the free use of attributes, which may be very useful in some applications for high reliability. E.g., to limit a bunch to the fields .first and .other, use:

! limbun:
    any: @.first
    any: @.other
    void: @.*

The .* notation does not propagate, so we may have:

! limbunplus:
    limbun: @
    any: @.third

Note that a constraint for a specific element of a row or a map is not supported.

It is good practice to used tagged bunches in constraints, like in:

! exprNode:
    @.tag = 'exprNode'
    exprNode: @.next
    any: @.payload

To make the code easier to write and easier to browse, the digraph @' starts a string, that constraints the .tag attribute, as in:

! exprNode:
    @'exprNode'
    exprNode: @.next
    any: @.payload

Note that using localised strings as tags is considered very dangerous, as there is not yet enough experience with tags.

There is currently no field that returns the name of the constraint at run time, as constraints are conceptually treated at compile time only.

A language extension might drop the requirement for the scroll symbol @ in constraints.

Constraints may seem onerous, if an operation temporarily violates the constraints, in particular during creation of items. If this is not caused by badly defined constraints, one solution is to assign the item to an unconstrained variable, do the necessary changes, and copy the reference back to a constrained variable. This is a direct consequence of the fact that constraints are connected to variables, not to items. Using factory functions is the standard measure for problems during creation. Another solution is to lock the bunch, as constraints are not applied to locked bunches.

There is currently no means to define a constraint that a function is a scan function.

The return values of a function may be declared using constraints:

float: square root of (float: x):
integer: square of (integer! i):
!mixed: @= 'integer' | @= 'float' | @= 'rational | @= 'string'
void: print (mixed! x):

Assertions for Bunches (rows and maps)

Plain values like numbers and strings are not individualized by constraints, as they are immutable anyhow.

Bunches, however, can occur in nearly infinite variants, depending on the use of user attributes and the kind of references, if keyed. As long a bunches are processed by functions in the same module, the programmer may well keep track mentally the layout of the bunches used.

A library of functions that needs to keep context between invocations can either use small integers (handles) and a global variable with a reference to a bunch. Or it can create a bunch, return it to the caller, and expect to called again with such a bunch as argument. In this latter case, there is an inherent risk that the bunch is modified in a way not expected by the library.

One possibility is to set the .lock attribute to prevent accidental changes; however, the attributes are themselves not locked, so the bunches may still be changed.

maybe we need a mandatory constraint?

It is necessary and sufficient to use the prefix at the lexically first occurrence, but it may be used at any place, provided the same one is used at each place.

Note that a function assigned to an element of a bunch can access the bunch itself, called this in other languages, by the symbol @ (or, if enabled, this). Note the scroll is never part of a digraph, nevertheless, it is recommended to be used like any other variable name.

To declare and use a classical unsigned short integer:

! uint16:
    uint16 .= 0         \ integer: same kind as number 0
    uint16 >= 0         \ value not negative
    uint16 <= 2^16-1
 
increment (uint16:x):
    x =+ 1
    :> x

The programmer should expect that the programme fails with an assertion violation, because the compiler should silently add an assertion after the first statement to check that x still is within the constraints of uint16:.

Adding an element at the end of a row thus reads:

stack::append (val):
    @[] =: val

While variable aliases might be considered, just using variables is not a performance issue, but the use of aliases would be a source of error.

Weak and strong assertions

Assertions are very important to tell the compiler more about the items, and allow optimisations, as the above assertion tells the compiler that x is an integer with at most 4 decimal places, so the compiler can generate standard integer instructions.

There are weak and strong assertions, the latter written with two exclamation marks (!!).

Weak assertions can be disabled as runtime checks to increase efficiency, but likewise the probability of undetected errors. Strong assertions can not.

Assertions are used by the compiler to optimize code, thus, they should be used as early and frequent as possible.

If an advanced compiler can ensure that an assertion is not violated, it will not result in code, e.g.:

x =: 1
! x = 1

If a violation is detected at compile time, a warning may be issued.

If you are — unwisely — concerned with run time efficiency, you should use weak assertions and disable them after thorough testing13.

In particular, these should be placed as the first lines of a function body:

gcd (x,y):
  ! x > 0           \ x must be a positive integer
  gcd -5 y          \ may generate a compiler warning

5. Functions

5.1. Function definition templates

A function definition template has a colon (:, or the digraphs :-, :+ and :~, see below) as the last character of the line. Functions are global to the compilation unit, and thus start in column one.

The template is a sequence of words or parameters, the words separated by blank space, and the parameters are enclosed in parenthesis:

string (str) clip from (start) upto (end):

Thus, only blanks (and tabs) separating words are relevant, and counted as one blank. Also, as words must start with a letter and contain only letters and numbers, other characters except letters, digits and parenthesis plus the trailing colon are not allowed in a function template. The underscore character is not needed and thus is not considered a letter. However, small and capital letters are fully distinguished.

The templates must obey the following test using the template string:

The following templates are valid together:

anton berta ceasar dora : 
anton berta caesar :
anton berta ()     :
anton berta ()     ()   :
anton berta :
anton ()    caesar :
anton ()    ()     :
anton ()    :

See below for details on the matching process.

The different termination bigraphs are used as follows:

The bigraph :~ is used if header files (.tufh) are provided for libraries; the library is scanned for :+, and each template issued with :- instead. A comment block before is also copied to explain the function, so there is no manual maintenance of headers necessary. There is also planned a special include that only scans for exported templates, in cases where the creation of a .tufh file is not warranted.

While not absolutely necessary, it is recommended to separate parameters by words, and not only by white space, otherwise use a tuple:

copy (a) to (b):    \ good
copy (a) (b):       \ avoid: one-word function, two parameters in a row

In cases when the parameters are really exchangable, two parameter positions might be sensible:

gcd pair (x) (y):
    ? y = 0
        :> x
    :> gcd pair y x%%y          \ tail recursion should be efficient
print gcd pair 6 8

One-word only function templates without parameters are not supported due to their similarity to plain variables. It is more in the spirit of the language to uses two or more words (and not an underscore), as in:

next input:
    :> read from ...
x =: next input

One prominent example for a one-word function with one parameter is the print function to send a string (or number) to the standard (terminal) output:

print "Hello world."

The one-word functions in the standard library are:

count ()        \ deprecated, use .count
float ()
integer ()      \ deprecated, use trunc ()
join ()
max ()          \ use 'maximum of ()'
min ()          \ use 'minimum of ()'
print ()
quote ()
round ()
scan ()         \ deprecated
trunc ()

The compiler will issue a warning if a local variable name or parameter name may conflict with a one-word function.

The body of the function is defined by the following (indented) lines, until a line starts in column 1.

The parameters are local variables, which are filled with references to the items that are the actual parameters. In case the parameter refers to a bunch, the bunch can be modified, and the modifications become effective for the body of the function as well as for the caller.

A function always returns a reference to an item; this it void by default, if the last statement of the function body had been executed. Otherwise, the return operator :> must be used like in:

greatest common divisor of (x) and (y):
    ? x < y
        :> greatest common divisor of y and x
    ?* y > 0
        z =: x %%  y
        x =: y
        y =: z
    :> x

There is no restriction on the kind of return values, as these are references, but it is recommended to use the same kind (except void) in all cases, as otherwise the caller might need to check the kind before further processing. Of course, a fault item is a commonly used exceptional kind.

Any variable introduced within the body is a local variable, implicitly initialized with the void reference. Thus, within a loop, it can be checked for void to change the flow if it is already set, or not. Nevertheless, setting variables first, even with void, is good programming technique and without any performance loss.

The parameter mechanism is very efficient, as only references (pointers) are used for all items, so there is no reason to use a global variable for efficiency reasons.

Naming conventions for function templates

In particular for libraries, a certain structure in the function templates is useful. Local functions that do not conflict with other modules or libraries should benefit from the fact that the templates allow a natural wording:

calculate checksum for (fn) using (sumpgm):

In this manual, the wording and structure is used for English.

In general, the template should start with one or a few words telling the subject, i.e. an object oriented approach:

file stream (fd) flush           
row (row) sort ascending
string (str) locate (needle)

Systematically prefixing also parameters with a word describing the kind could be tempting:

string (str) locate string (needle)
string (str) times integer (ni)

However, this leads to duplicate words, if the parameters are function calls, as in:

n =: string tgt locate string string 'a' times 5
n =: string tgt locate string 'a' times 5

Also, when the constraint system is implemented, these indicaters would be superfluous.

Thus, only the leading word — often a noun — is used to indicate the domain of functions, in particular the kind of the first parameter.

If the result is boolean, the first word is is or has, and the check added after the parameter for better readability, although putting the parameter last would have its merits too:

is (x) float                
is string (str) ascii
has string (str) head (head)

Words are often used to indicate the following parameter, commonly used (although not strictly for better readability) are:

using (func)    function reference
size (int)      count
count (int)     size
length (int)    amount
by (x)          separator
from (int)      start point
upto (int)      end point forwards
downto (int)    end point backwards

If the result is of a different kind, the word as is used, in the sense of convert to:

string (str) as integer:
string (str) as float:
integer (int) as string:
float (flt) as string:
rational (rat) as string:
boolean (b) as string:

with more variants, e.g. indicating the base:

string (str) as integer base (base):
string (str) as integer else (def):

Some functions have aliases, in particular if the function has only one parameter, that can be moved to the end:

as integer string (string):         \ string (string) as integer
copy row (row):                     \ row (row) copy

If used often, the item kind word is no longer present:

as integer (string)
as float (string)
as string (number)

These might even be generic functions, where the kind of the argument is dynamically determined:

    as string (x):
        ? is (x) integer:
            :> integer (x) as string
        ? is (x) float:
            :> float (x) as string
        ...
This particular function is normally not used, and the number
just catenated with a string.

Most often, they are not generic for the sake of clarity; the less often called one can still use a word:

as integer (string):
as integer float (float):

There are two generic scan functions for rows, maps and tuples:

give keys (bunch):
give values (bunch):

The first one has reduced value, as the programmer must anyhow know whether the key is for a map or a row (tuple). The second one is useful, e.g. in the join (mixed) functions.

To avoid confusion, the words from and to are normally used for ranges, and take, draw and as etc. used for changes.

Sometimes the word as is omitted when the context is clear and it is not a conversion in a narrow sense:

command (cmd) lines
map (map) keys

A plural normally indicates a row of, i.e. instead of map (m) row of keys.

For the basic substring operations, aliases use clip as first word:

clip (src) from (first):
clip (src) from (first) length (len):
clip (src) from (first) upto (last):
clip (src) upto (last)

Using bunch functions when implemented, the tail of a template can be attached to a variable:

m =: new matrix sized 3, 5          \ tags the bunch with 'matrix'
?# tup =: m.:give keys              \ calls matrix (m) give keys
    m[tup] =: tup
m.:print                            \ instead of map (m) print

As the compiler will anyhow try to find a template that has after the first parameter the tail locate () etc. as in the following lines:

p =: str.:locate ';'
wrd =: str.:from 1 upto p-1
i =: str.:as integer

it may be enabled to find the function string () locate () and check if the item str has the kind string. Thus, the systematic forms may be preferred then.

Except the basic number scans, scan functions are marked by the word give:

?# i =: row r give keys
    ...

For formatting numbers, some of the traditional formatting functions from the standard C library (see respective documentation) are available by their library name instead adhering to the above conventions, see Formatting.

Export and import of functions

Functions templates ending in a single colon are local to the file in which they occur. To make them accessible outside, i.e. export them, use the digraph :+ instead of the single colon at the end:

convert (x) to something :+
    ...

To use a function exported from another module, just the template is given, ending in :-, in which case no body may be supplied. This could be regarded as an import declaration, and the compiler may allow them several times, if the templates are the same.

It is used to define the prototypes for library functions, without appending the source code (which is impossible for some basic library functions anyhow). There is — for clarity — no overloading (masking out) of imported library function templates. Thus, all external functions of a library (and also of the libraries used) cannot be redefined.

This means that extending an existing library by new functions might break existing code, and in such a case, a new version with a different library name shall be created, which is good engineering practice anyhow. Of course, using the object oriented mechanism of bunch functions allows to define new functions with a significantly smaller collision probability, even if it is still possible, as bunch functions just remove writing effort in composing function templates.

There is a set of functions required by the runtime system, plus a lot of commonly used functions defined in TUFPL_baselib.tufh, namely the functions in the runtime, string functions, functions for file acccess and mathematical functions. Some often used modules, e.g. crypto, hashmap, and for multi-dimensional operations are added for the collection TUFPL_stdlib.tufh. More seldomly used libraries, like readline and XML parsing, are kept separat and must be included individually.

The modularisation for TUFPL_stdlib.tufh and TUFPL_baselib.tufh may change in near future, if the structure is reorganized.

The compiler might accept :~ instead of :- for functions that are deprecated and issue a warning once the function is used. The convention to mark deprecated functions for the header generation currently is the comment starter \(~.

Note that deprecated attributes are currently determined by a list in the compiler source.

Function aliases

As the linker alias feature can only be used if the parameter list is the same, function aliases are always written as one-line-return-only bodies:

clip (s) upto (i):
    :> string (s) clip upto (i)
is character (c) in string (s):
    :> has string (s) character (c)

If ever this is a measurable perfomance problem, the compiler shall detect the case (similar to tail recursions) and generate appropriate code.

The bigraph := is nevertheless reserved for the purpose of alias declarations, whatever the syntax may be.

5.2. Function call

A function is called if its template pattern matches, parenthesis are neighther required nor used (normally):

sqrt (x):
    ...
r =: sqrt 2
move (a) to (b):
    ...
move 'aa' to s

Matching (part of) an input line is done in a straightforward scan without backtracking, i.e. alternatives are not probed.

In particular:

Template matching has highest priority in scanning an expression, but once an expression scan is started, it continues as far as possible, including function template matching:

x =: sqrt 2 + 5         \ = sqrt (2 + 5)
x =: sqrt 2 + sqrt 5    \ = sqrt (2 + sqrt 5)
y =: (sqrt 2) + sqrt 5  \ = (sqrt 2) + (sqrt 5)
y =: sqrt sqrt 16       \ = sqrt (sqrt 16)

Originally (and via a compiler option) expressions were possible only if a parameter was the last element of a template; within only terms were allowed, i.e. variables, array or map references and variables with attributes. Expressions had to be enclosed in parenthesis. But then a scan with a simple expression would be:

?# ip =: from (sp-1) downto row.first

The current solution is slightly more complex to understand, but needs far less parenthesis on average. For clarity, parameters may still be enclosed in parenthesis, as redundant parenthesis are simply ignored.

If there are templates that are the same until a parameter postion in one and a word in the other one, matching the word has precedence over using a variable with the same name; othewise, the second function would never be matched:

pr (x) nl:
    print 'not matched: ' __ x
pr nonl (x):
    print 'matched:' __ x
nonl =: 'nonl'
nl =: 'nl'
pr nonl nl      \ matches pr nonl (nl)
pr (nonl) nl    \ parenthesis needed

As shown, enclosing the parameter in parenthesis forces it an expression and thus matching the parameter position. The compiler will issue a warning message for the other line that a local variable is masked by the template matching; this message can — with the current compiler — only be avoided by renaming the local variable.

Another example for the same error message (there is math () exp ()):

sqrt =: math sqrt 1.5

5.3. Function returns

A function always returns a reference to an item; if none is provided, void is returned.

To return immediately and to supply a return value, the digraph :> is used:

give me five:
    :> 5

In particular if a function provides the return value, it could be remembered as continue with:

greatest common divisor of (x) and (y):
    ! x > 0, y > 0
    ? x < y
        :> greatest common divisor of y and x
    ? y = 0
        :> x
    :> greatest common divisor of y and (x %% y)

In contast to other statements, the expression may also be a boolean expression.

Note that the language implementations check that a function call that is not part of an assignment does not throw away values: If the returned value is not void, a fatal runtime error will occur. Of course, if the return is assigned to a variable and then replaced, the value is lost without notice.

5.4. Function references

Bunch functions should be preferred in order to dynamically select functions; however, the function prototypes must be made available to the compiler at the point of call.

To allow more flexible dynamic invocation, in particular at places where callback functions are required, references can also refer to functions.

Function references can be plain references to functions defined via templates, or refer to anonymous functions, see below.

A function reference is created by enclosing the template in backticks (acute, `):

funref =: `(compare () and ()`
bunch.sort =: `sort record (a) less (b)`
sort map x using `compare funky () with ()`
 
noop =: ``      \ function that immediately returns void

Single Words in the parameter parenthesis' are allowed and ignored, so the prototype can be copied verbatim.

There are no operators the can be used on a function reference, so the only places to use are the complete right hand side of an assignment and as a parameter in a function call. Of course, the template must match one of the templates defined.

To call a function reference, the digraph :( (colon followed by opening parenthesis) is designeted, followed by a tuple of parameters, closed by a normal parenthesis ):

x = 3 + funcref:(a, b)
? bunch.sortcall:(x, y)

If the number of parameters is different to the function called, a fatal runtime error occurs.

Instead of using the symbolic notation, which might never be implemented, library functions can be used for upto 3 parameters:

call function (func) with (parm):-
call function (func) with (parm1) and (parm2):-
call function (func) with (parm1) and (parm2) and (parm3):-

As often callbacks may be void, these library functions do nothing and return void if (func) is void.

If more parameters are required, use one parameter and a tuple:

call function funcref with 2, 3, 5, 7, 11, 13, 17
funcref:( (2, 3, 5, 7, 11, 13, 17), )       \ trailing comma required

Note that references to functions are not unique, in that each function reference creates a new item. Nevertheless, they may be compared for equality (only), and the comparison is true if both refer to the same function address and have the same number of parameters.

Note that the .count attribute is available for function references and returns the number of parameters.

As an exit function reference is immutable, it does not have fields. Thus, an exit function registration is not totally simple. Any exit function is called only once, if it is registered again, only the parameter reference is updated:

on normal exit call (funcref) with (parm):
    funcs =: get row of exit functions      \ always not void
    ?# fr =: give funcs values              \ scan skips void
        ? fr.funcref = funcref
            fr.parm =: parm
            :>      \ was a redefinition
    fr =: []
    fr.parm =: parm
    fr.funcref =: funcref
    funcs[] =: fr

If the main function returns, the functions are called in reverse order:

call exit functions:
    funcs =: get row of exit functions
    ?# i =: from funcs.last downto row.first
        fr =: funcs[i]
        ? fr ~= ()
            fr.funcref:(fr.parm)        \ call with parameter

Note that this happens only on normal exit, i.e. return from the main function, to avoid exit problems.

Maybe at some later time there will also exit functions for the premature termination.

5.5. Bunch Functions

To bind functions to bunches, possible mechanisms are:

The first case has already been explained above.

The second case uses tagged bunch functions that are common to all bunches with the same tag, similar to what is called class functions in object-orientend programming languages, while field functions are individual, not bound to a tagged bunch, and must be assigned to each newly created bunch.

Tagged bunch functions use the .tag attribute to give a bunch a (class) name.

An example not using bunch functions might read:

\ early style for setter
set attr of mybunch (this) to (val):
    ! this.tag = 'mybunch'
    this.attr =: val
\ consolidated oo-style for getter
mybunch (this) get attr:
    ! this.tag = 'mybunch'
    :> this.attr 
\ use the above templates 
    mb =: {}
    mb.tag =: 'mybunch'
    set attr to 5 for mybunch mb
    ! 5 = mybunch get attr

The object-oriente-style is the one suitable for bunch functions:

mybunch:: get attr:
    :> @.attr
mybunch:: set attr to (val):
    @.attr =: val
\ use the above
    x =: {mybunch}
    x.:set attr to 5
    ! x.:get attr = 5

The double colon :: in the function template delimits the tag from the rest of the template, which is the core template. The tag may be a sequence of words without blank space, delimited by underscores or points; this is a restriction for tags used here. Within the body, the scroll symbol (@) denotes the extra parameter which refers to the actual (this) bunch.

Calling a bunch function also uses the digraph dot-colon .: after a variable name (note .( is a field function call) or an expression. An extended template scan determines all applicable core templates and their associated prefixes. If it is only one, it is just used with the variable or expression as first parameter. Otherwise, a dynamic determination is generated like in:

x @= 'mybunch'
    mybunch x set attr to val
|? x @= 'yourbunch'
    yourbunch x set attr to val
|   
    \ assertion violation generated

Bunch functions can be grouped in a block (which is functionally the same, but requires less typing and is thus less error prone), starting with the tag in column 1 and a trailing double colon:

tag_name::
    get attr:
        :> @.attr
    set attr to (val):
        @.attr =: val
    is attr equal (val):
        :> @.attr = val
\ using it the above
x =: {tag_name}
x.:set attr to 5
! x.:is attr equal 5

As an example, calling code for the Cairo graphic library without bunch functions might read:

print header (text) using (ct):
    cairo ct set font size 15
    cairo ct select font face 'monospace' 
    cairo ct set source r 0 g 0 b 0 
    cairo ct move to x 220 y 40
    cairo ct show text text

With bunch functions, it is written as:

init:   
    ct =: cairo new context for aSurface  \ factory function
print header (text) using (ct):
    ct.:set font size 15
    ct.:select font face 'monospace' 
    ct.:set source r 0 g 0 b 0 
    ct.:move to x 220 y 40
    ct.:show text text

In this case, the templates for the library binding are e.g.:

cairo.: set font size (num):
cairo new context for (surface):

So the possible tag names are collected from the function templates; the compiler will match the core template set font size () and find out the tags supplied for these templates.

The constraint system may help to find out the tag of a variable, and then omit the above selection code.

A function created as bunch function always starts with an assertion to match the tag (unless optimised out, e.g. by constraints).

Using the constraint system via an assertion, any ambiguities could be eliminated:

print header (text) using (ct):
    ! ct @= 'org.cairographics' 
    ct.: set font size 15
    ct.: select font face 'monospace' 
    ct.: set source r 0 g 0 b 0 
    ct.: move to x 220 y 40
    ct.: show text text

Instead of writing ct.: on each line, a block can be used:

print header (text) using (ct):
    ct.:
        set font size 15
        select font face 'monospace' 
        set source r 0 g 0 b 0 
        move to x 220 y 40
        show text text

As the bunch function mechanism is not bound to a type system, the class of org.cairographics functions could be (locally) extended without requiring something like inheritance, by writing:

org.cairographics::print header (text):
    set font size 15
    select font face 'monospace' 
    set source r 0 g 0 b 0 
    move to x 220 y 40
    show text text

Of course, an own class could be used, where the bunch contains a field that refers to the cairo item:

drawing::initialize:
    @.ct =: cairo context for @.surface
drawing::print header (text):
    ct =: @.ct
    ct.:
        set font size 15
        select font face 'monospace' 
        set source r 0 g 0 b 0 
        move to x 220 y 40
        show text text

The user needs not to know the tags for a library function, as it is derived from the template declaration (and so normally shown in the library documentation).

While so far only allowed for mutable bunches that can be tagged the .: operator can be extended to look for any prefix in the template tree, and use .Kind instead of .Tag.

Then, a function definition like

string (str) as integer:
    ....

would be matched by:

i =: str.:as integer

Consequently, the definiton:

tag name::core template:
    ...

is considered as a shortcut for:

tag name (@) core template:
    ....

with the extra effect that the special this variable @ is available in the body.

There will probably no desire to use the scroll instead of this in a function template, but instead to allow a name for this in the bunch function template, using the bigraph :( instead of the opening parenthesis for the first parameter:

drawing :(this) print header (text):
    ct =: this.ct
    ct.:
        set font size 15
        select font face 'monospace' 
        set source r 0 g 0 b 0 
        move to x 220 y 40
        show text text

Once constraints are implemented, the constraint could be used to identify the function to be called:

(string:s) as integer:
    ...
! stringrow: string: @[]
main (stringrow: p):
    i =: p[1].:as integer

Constructors and destructors

Constructors and destructors are defined here only for completeness; factory functions should be used instead. In order to indicate that a tagged object with a constructor is created, the syntax to create tagged bunches is extended:

x =: {!tag}
y =: [!tag]

Alternatively, creating a bunch with a word as a tag could always require a constructor.

Constructors and destructors can be defined by using the digraphs +: and -: as function templates:

! stack:                    \ constraint declaration (optional)
    @.tag = 'stack'
    @ .= []                 \ is a row
    @.top .= 0              \ has the field top as integer
 
stack::
    +:
        @.top =: 0          \ make it an integer
    -:
        ! @.top = 0         \ just for demonstration, not sensible
    push (x):               \ the row expands automatically
        @.top =+1
        @[@.top] =: x       \ do not use @[] = @[@.count+1]
    pop:
        ? @.top = 0
            :> ()
        rv =: @[@.top]
        @.top =- 1
        :> rv
    top:
        :> @[@.top]         \ return void if exhausted
    clear:
        @.top =: 0
    copy:                   \ the copy is condensed
        ns =: []
        ns.tag =: @.tag
        ?# x =: @.scan
            ns:push @[x]
        ns.top =: @.top
        :> ns
new stack:                  \ a factory function
    :> [stack]              \ tagged as 'stack', with constructor
\ using the above:
st =: new stack
st:push 5
! st:pop = 5
st =: ()                \ destroy stack

Constructors and destructors do not have parameters and must return void, as it is discarded. If it is required to have constructor parameters or to return fault items, separate initialisation and termination function must be provided, often in conjunction with factory functions. Normally, a destructor will contain only assertions to assure that no valuable information is lost.

A small inconvenience is that the user must know whether to use a row or a map in creating a tagged bunch with bunch functions, so factory functions will often be used.

Factory functions

Factory functions must also be used if parameters would be required for constructors. They can use initialisation functions from the bunch for most of the work.

Such an example is a queue:

new queue with (n) entries:
    rv =: [queue]
    rv.:setup queue of length n
    :> rv
 queue::
    setup queue of length (n):
        @.in =: 0
        @.out =: 0
        @.filled =: 0   \ allows to use all cells
        @[0] =: 0       \ ensure first key zero
        @[n-1] =: n     \ allocate space
        ! @.count = n
    is empty:
        ? @.filled = 0
            :> ?+
        :> ?-
    is full:
        ? @.filled = @.count
            :> ?+
        :> ?-
    enqueue (x):
        ? is full       \ template with same tag has priority
            :> new fault item with code 1 and message "Queue full"
        @[@.in] =: x
        @.in =: (@.in+1) %% @.count
        @.filled =+ 1
    dequeue:
        ? is empty
            :> ()
        rv =: @[@.out]
        @.out =: (@.out-1) %% @.count
        @.filled =- 1
        :> rv
use it:
    q =: new queue of length 100            \ factory function
    ! q.:dequeue = ()
    q.:enqueue 5
    q.:enqueue 'hallo'
    ! q.:dequeue 5
    q.:enqueue 3.5

In general, factory functions should be used instead of creating a tagged bunch with the correct tag, because:

6. Standard library

6.1. File I/O

There are currently two sets of library functions:

Often, text files are sufficient; this interface consideres the file separated into lines by linefeed characters. Lines are often obtained by a scan function; or the whole file is read into a row. If lines are written, the linefeed character is automatically appended to each string that is issued over this interface. As strings can be very long (upto 2 Gigabytes), even texts without line feed characters may be processed this way. However, zero bytes are not possible; which is a limitation of the underlying streams library, not one of TUF-PL strings. Furthermore, the characters are preferrably encoded as UTF-8, which includes pure ASCII, both without zero bytes. Other encodings are possible, but not easy to handle.

When input is a terminal, input is considered a line even without a linefeed character, if the end of a line is determined by other means.

To access binary files, ???chunk must be used.

Similar to the text file interface, pipes can be used to send lines to a command or receive lines from a command, with the same condition as for files.

6.2. Network I/O

Basic I/O

The basic network interface is similar to the access of text files: Lines of text — terminated by linefeed characters — can be sent and received, either per TCP or UDP. The library handles long strings in a reasonable way and may deliver strings as lines even if there is no explicit linefeed character. As with files, every string sent will be converted to a line, usually by adding a line feed character; and on receive, partition the input into lines, in particular if the separated by line feed characters.

As strings may contain zero characters, and the used interface transparently passes zero bytes, the sent and received lines may contains such beasts.

Note that the basic I/O is suitable for a HTTP library programmed in TUF-PL, even if some string functions are system library functions for efficiency.

Forking subprocesses

Nothing special has to be done here; just the two basic library routines

process fork
process wait

are provided, functioning as in any POSIX environment. Of course, the newer more flexible system calls may be provided by the standard library.

Standard library function examples

Here, only a few functions are sketched; see the separate documentation on the TUF standard library.

6.3. Arithmetic functions for floating point numbers

math sqrt (x):
provides the square root of a number as floating point number, exact numbers are accepted and converted to float

6.4. Arithmetic functions for integers (and exact numbers)

greatest common divisor of (x) and (y):
returns the greatest number that divides both, x and y

6.5. String functions

Instead of substring, which is regarded as a legacy from one-word function names, the term clip string is used.

clip string (s) from (first) to (last):
return a copy of the string between the characters between the integer positions first and last, counted from 1; if first is greater than the length, return the empty string; if last is greater than the length, return upto the end. if last is less first, return the empty string. if first or last are less 1, raise a run time error
clip string (s) from (i) length (l):
:> clip string s from i to (i+l-1)
clip string (s) from (i):
:> clip string s from i to s.length
clip string (s) upto (i):
:> clip string s from 1 to i
split string (s) to row of characters:
returns a row, keyed by 1 upto s.length, where each element refers to a sting containing just the i-th character. Remember that short strings are stored as efficient as integers.
split string (src) by any of (delim):
split src into a map of strings using any one of the characters in delim as delimiter. If two delimiters follow directly, an empty string is assumed.
split string (src) by many of (delim):
split src into a map of strings using as many of the characters from delim as delimiters as possible. Thus no empty strings are found. Useful for scanning for white space with e.g. &tab;
enumerate string (s) as characters:
returns all characters of s as one-character strings in a loop
enumerate string (s) as integers:
returns all characters of s as integer code numbers in a loop

Because there is no single character type, character classes are better supplied by matching from a starting point (see man strspn under Unix); interfaces to obtain character tables, in particular for localized strings, may be added:

count in string (src) any of (accept) starting at (start):
returns the number of characters in the string src, starting at start, which consist only of characters from accept. If none is found or start is larger than src.length, void (not integer 0) is returned.
count in string (src) none of (reject) starting at (start):
returns the number of characters in the string src, starting at start, which consist not of characters from reject. If none is found or start is larger than src.length, void (not integer 0) is returned.

Note that if both functions are called in sequence with unchanged parameters, either both return void, in which case start is greater than the length, or one returns an integer greater zero. In other words, if start is less or equal the string length, and the first returns void, the second will not. So in the following code, the loop will always terminate, as the key is incremented by at least one each iteration.

Using these, the above word split could be written in TUF. If the source starts or ends with delimiters, an empty string14 is stored as the first resp. last element of the result. Consequently, a string with only delimiters returns a row with two empty strings. Note that the number of elements in the result string is always one more than the number of delimiter chains, thus the key is advanced with every delimiter string.

split string (src) to row using delimiters (dlms):
    res =: []
    ? lengthener < 1
        :> res
    idx =: 1
    res[idx] =: ''              \ if starting with delimiters
    start =: 1
    ?* start <= lengthener
        n =: count in string arc any of elms starting at start
        ? n ~= ()               \ delimiter chain found
            start =+ n
            idx =+ 1
            res[idx] =: ''      \ to be overwritten except at end
        n =: count in string arc none of elms starting at start
        ? n ~= ()               \ non-delimiter chain found
            start =+ n
            res[idx] =: part of string arc from start length n
    :> res

More string functions:

split string (s) by string (t):
split string (s) by reg exp (r):
match re (r) in string (s):
:> match reg exp r in s
match reg exp (r) in (s):
returns () if the match fails; otherwise returns a row with three elements: the string before, the matched string, the string after.
find in string (str) regular expression (re):
split string (str) using re (re):
returns void if the regular expression is not matched. Otherwise, returns an row with five attributes: - start: integer key of first character that matches - length: integer key of last character - before: string of the part before the match - match: string from the source that matched - length: length of the part that matched If groups were present, the bunch is a row, having pairs of keys (starting at 1) of key of character key and length of each group.
Note that the length may be zero, if the empty pattern was matched. first and last character of the group.
position of string (s) in string(t):
Find first occurrence of s in t; if found, return the key (starting with 1); if not found, return void (not the integer 0).
substitute (r) by (u) in (t) once:
? (s, l) =: match r in t t =: substring of t upto s-1   u   substring of t from s+l :> t

Returned is the number of characters

6.6. Conversion to and from numbers

integer from string (s):

6.7. Formatting

For converting numbers to strings, pattern strings called formats are often useful.

Well known and mastered by many programmers is the standard C library function printf and the like15.

The formatting function using the C standard library is:

cformat (format) number (num):

This standard C library conversion cannot handle big numbers, use integer (x) as string... or as string (x).

Due to the problem of creating (not using) variable number of argument function calls, only one number is supported, and no error checking is done.

Better us as string(num) and string manipulation functions for the desired output.

To convert an integer to hexadecimal digits, there are two possibilities:

print cformat '#02x' number num
print pad left (integer num as string using 16) to  2 by '0'

The bigraph #% operator is reserved for a TUF-PL specific format; it applies a format string to an expression or expression list:

print ' ###,###.##' #% math sin 1.0
\ instead of
print format ' ###,###.##' num math sin 1.0

The scanf function is not available; it could provide a tuple of items, but extensive analysis of the format string would be required. Futhermore, procsssing input line by line, splitting it into parts and then convertin the parts has been proved to be clear and efficient.

6.8. File functions

See list of library functions.

6.9. File execution

See list of library functions.

7. Various

System variables and parameters

The runtime system creates an all-global variable $SYSTEM (similar to $COMMON) with some preset fields:

A compiler version is not available, as different libraries may have been compiled with different compiler versions.

This variable $SYSTEM is permanently locked as the fields must not be modified by the running program.

There are two commandline options that are often used during program development:

--debug     switch debug on initally
--sysinfo   Runtime version and usage statistics (to stderr)

For convenience, the options can be set and obtained dynamically by a set of functions:

debug enable:
debug disable:
is debug enabled:
system info enable:
system info disable:

Normally, the runtime system is built such that the function

apply runtime options (parm)

is called immediately before passing control to main (parms), and providing a copy of parms with both options removed, while the original parameters are still available under $SYSTEM.parms. Thus, to nullify this feature, write:

main(parms):
    parms =: $SYSTEM.parms
    debug disable
    system info disable

Program and shell invocation

There are library functions to execute a shell command; these can be disabled if the environment variable TUFPL_NO_SHELL is set upon invocation to any string; its effect cannot be reverted. Attempts to use the functions if disabled is a fatal runtime error

The exchange of the current program image with another one (exec) is inhibited by TUFPL_NO_EXEC, which also inhibits shell invocation. Any attempt to use is a fatal runtime error.

There is no use to hide these environment variables from malicious programs; if a malevolent user can start a program once, he can do so twice.

Printing

The streams stdin, stdout and stderr are automatically opened and closed by the runtime, so that standard input-output functions can be used for console output and the like.

Because this is not very handy, a group of functions provide a shorter way and automatic conversion of numbers to strings:

To send a line to standard output, the function print (x) can be used. Strings are sent unchanged; numbers and booleans are converted to strings, all without leading or trailing blanks. The conversion uses the same format as the string operator _, it is quite helpful to compose an output line this way.

Instead of a single parameter, a list can be used:

print a, 1, a + 1

All others are rejected as a fatal runtime error, but see debug dump (x) to show any item on error output.

Using the library join (list) by (sep) , a different separator can be used:

print join a, 1, a + 1 by ';' 

If the output should be the standard error stream, use error print (x) instead.

As a debugging aid, the function debug print (x) suppresses the output unless the debug flag is set. Note however, that the function is called normally, including evaluation of argument(s), so inside heavily used loops its better to use:

? is debug enabled
    print ....

To aid debugging, any item can be made legible (sort of) and written to standard error, mostly in several lines, by:

debug dump (item)

Additionally, debug print this location prints the current source code function template and line number (within the file), and debug print stack trace a stack trace.

As file output normally adds a newline at the end, the same is true for the print variants that are convenient to provide console output. So in general, a complex line would be composed by string catenation, i.e. _ and =_, and then emitted.

This is sometimes cumbersome; the often used solution that the programmer must provide newline characters explicitly is not systematic, thus error prone and also onerous in many situations.

The functions to print without trailing linefeed have an old and new version:

and similar for the other print variants.

If a carriage return is desired, it must be written by &cr;.

Tail recursion

Mathematicians are much trained to reduce a problem to a chain of already solved problems, and thus often use recursion.

In particular, a fully recursive solution for the greatest common divisor of two numbers is:

gcd (x) and (y):
    ? y = 0             \ mathematical definition of gcd
        :> x
    :> gcd (y, x %% y)      

There is no need to do a real recursion and to save x and y when calling gcd recursively, because the values are discarded immediately on return. this situation, also called tail recursion, allows just to set the parameters anew and start over as a loop, effectively creating

gcd (x) and (y):
    ?*
        ? y = 0             \ mathematical definition of gcd
            :> x
        \ :> gcd (y, x %% y)
        x, y =: y, x %% y

Tail recursion, i.e. calling a function inside its body, is not too difficult to detect, and is presumed to be fairly easy to implement, thus the programmer might use it freely.

8. Libraries and Modules

Libraries

Libraries are essential for the use of a programming language.

Some very basic libraries for file input-output etc are part of the core language. They use the special system item, of which only a reference may be passed and copied; changes are by system functions only. Core library bindings use a set of a few (about 10 of 255) subkind codes to protect against improper used.

The remaining subkind codes are dynamically shared by other bindings. The runtime system maintains mapping of identification strings, like de.glaschick.mylib or org.cairographics.surface:

byte map_system_foreign_subcode(const char* id);

It is just a table of strings; if the string has already be used, the code byte will be returned; if not, it will be remembered and a not yet used code returned. An entry is never deleted until the final end of the program. This allows about 200 foreign libraries used by a single program to coexist, which should be sufficient. {There are 16 bits on a 32-bit system assigned to kind and subkind, using 8-bit bytes, leaving room it really this language is so prominently used that this becomes a problem.).

Modules

Modularisation in TUF-PL is — inherited from C — tightly bound to source code files; a function can be internal or visible from the outside.

To inform about functions in other modules, there are two methods:

If the filename starts with a slash or with ./, it is used as is, i.e. as absolute or in the current directory. Otherwise, i.e. starting with a letter, digit etc, a standard search path is used. The syntax ~user/... and ~/... may be used too.

No variables may be shared by different source files, i.e. modules.

In either case, the name of an already compiled library to be included by the linker may be indicated by the pragma \$, mostly used inside the files included by \^. This may not be possible on some systems; it depends on the linker calling mechanism.

In particular, the line is copied to the C source code, with \$ replaced by //$$. The script to compile and link the main function will just search for lines beginning with //$$, and supplies the rest as additional linker options. For TUF files, this may be the corresponding object file, e.g. \$ mylib.tuf.o. In particular, interfaces written in C might require extra libraries; e.g. the source file TUFPL crypt.c may not only need its object file, but also -lcrypt. Thus, the source code shall contain

//$$ -lcrypt
//$$ TUFPL_crypt.o

Of course, the header extraction program defextract has to copy both forms to the header file.

External C functions

As TUFPL compiles the source into a C (or C++) function, linking with functions written in C is possible, provided that these functions behave like a TUF-PL function. This is fairly easy; just a bouquet of auxiliary C functions must be used to obtain the content of items, create new items, etc. To allow functions to pass memory to later calls via items, a special kind of opaque or foreign item is provided, that may be freely passed and copied by TUF-PL code.

Modules

TUF-PL has been created as a precompiler for the C programming language. Thus, a module is a compilation unit that will result in a binary machine code object, that can be arranged in a library etc.

While in C, every function is exported, unless marked as local, TUF-PL treats every function as local to the compilation unit, unless marked as exported.

There are several flavours of modularisation:

Pure binary:
The main programme is compiled, and modules exist as binary machine code objects, bound together by a linker, before the main programme is executed as binary machine programme.
Purely interpreted:
The main programme is interpreted, all modules are read as source code and treated as included verbatim, except some basic library routines, that are resolved by the interpreter
Mixed:
The main programme is interpreted, and modules exist in a precompiled form.

Pure binary modules

In its simplest form, the file to be imported is just scanned for exported function templates and import pragmas. No code is generated; just the proper declarations for the C code issued. It is left to the development environment (work bench) to compile and link the binary objects accordingly.

In order to assist the work bench, the path names that are imported are printed by the precompiler, which might be scanned and used to prepare link commands.

The imported file may well be the sourcecode; while in import mode, the precompiler will just scan for function templates marked exportable and import pragmas. No detached header files are necessary, a mostly efficient enough solution for compilation modules in a project.

In case of libraries, the files imported may well contain only the the templates, thus be equivalent to header files used elsewhere. This may be a single header file collected from all the modules in the library, using TUF-PL syntax, together with the \$ pragma to indicate the path to the library, which is printed too.

If a library — including the runtime system — is not written in TUF-PL, a header file should be created because - packaging is required anyhow - the header reflects the library

Recommended extensions are .tuf for TUF-PL source code, and .tufh for header files that are not meant to be compiled to object code, in particular to provide a library interface.

Message translation

Strings in (double) quotes are translated by the runtime system, if a translation mechanism is provided; otherwise, they are used as-is.

The elementary mechanism is to set a translation map:

set string translation map (ary)

The creation of the map may be done by map literals or reading configuration files. The parameter must either be a map or void, in which case the translation is suspended.

Currently, string translation takes place if the string literal is used in the program flow. This is rather efficient, as just another item reference is used from the translation map.

However, the value of a string literal depends on the situation when it was created, i.e. assigned or otherwise used, as in the following example:

\ translation is done when the literal is dynamically encountered
arg =: "argument string"
print arg
set string translation map cv
print "argument string"
print arg

which will — somewhat unexpectedly — print:

argument string
Uebersetzte Zeichenkette
argument string

It might be less error-prone if the translation takes place when the string contents is finally used, but requires more processor time.

Instead of a translation map, an interface to GNU gettext() may be provided, possibly by allowing to register a callback for translation.

9. Object oriented programming

Introduction

There is a saying that good programs can be written in any language, but in some it is easier. Similar, object oriented programming can be done in any language, but some make it easier. The reverse is also true: bad programs can be written in any programming language, including those that only allow good programs from an object oriented design.

TUF-PL is hopefully a language that allows good programmers to write good programs, provided that the programmer does not assume that the language will automagically correct his errors.

Instead of complex rules governing many features, a simple language with a few rules, where complex issues are to be code explicitly, is deemed more secure: A single sharp knife is in several cases better than a set of specialized cutting tools.

If someone ever has written documentation for a project that uses OO language with classes and inheritance including in-depth reasoning why a certain interface is correct and works, will know that this is as complex as with a simpler language. Assuming the the OO language needs less documentation is a failure that might be the reason why software still needs frequent updates.

That strong typing is not necessary show languages like Python, and TUF-PL is a language that supports both via the extended assertions.

Furthermore, object oriented design is what it says, a design method. Thus, the major questions are:

The most advanced answer to the latter question is to use design tools and leave the mapping to a particular language to the tool, avoiding a lot of errors early. If the first question is positively answered for TUF-PL, the second is trivially true too.

For implementing a design that used the object oriented approach, none of the special features described next are really necessary. Bunches can be used to represent objects, and assertions can be used to not only write down, but also verify pre- and postconditions on the objects. It admittedly requires discipline, which is quite possible compared to the discipline and careful planning hardware engineers must obey.

As bunches provide:

the first question is assumed to be true.

Note that constraints are not applied to bunches that are locked.

How to do object oriented programming

Coding based on an object oriented design in TUF-PL might use — as shown in the libraries provided — the following design pattern:

Although the word new is not an operator, factory functions usually start with the word new (or make), followed by the object name, followed by a trailing template. The documentation processing ignores leading word new in creating the sorted key. For example:

new hashmap with (n) slots:
    this =: [n]
    this.tag =: 'hashmap'
    ...
    :> this
hashmap (this) get (key):
    ....

No conflicts arise, e.g. with

file stream (fd) put line (line)
command stream (fd) put line (line)

The template matching is quick and straight forward, and the function templates are verbose anyhow.

In traditional programming languages that use single words for function names, an underscore (or CamelCase) is often used, e.g.

file_stream_put_line(fd, line)
command_stream_put_line(fd, line)

In particular together with the traditional functional notation, this is hard to read, even if underscores and camel case is combined:

fileStream_putLine(fd, line)
commandStream_putLine(fd, line)

and it is much more readable to write:

fd = fileStream("my file");
fd.putLine(str):

which also allows to change the object, provided this does not create subtle errors.

In TUF-PL, the word new is not an operator, and factory functions must be used . To call a function logically associated with a bunch, a double colon is used:

fd =: new file stream ...
fd.:put line str

If a compiler supports this notation, a more complicated template matching is used, that starts after the first parameter (in the template) and matches the rest of the template. If the match is unique, the function is found; otherwise, for each of the leading words, a test is made, like in:

? fd @= 'file stream'
    file stream fd put line str
|
    ! fd @= 'command stream'
    command stream fd put line str

Due to the verbose way that the tail of the template may be written, the number of matches can be held low, if so desired. Morover, the dynamic function selection is in most cases not relevant for the performance.

Note that while there is no functional need to forbid both notations, i.e.

fd =: file stream reads 'my.file'
file stream fd put line str
fd.:put line str

If ???constraints are implemented, the runtime tag tests are seldomly needed, at the expense of more complex progamme writing.

In order to restrict this extension, the function can be defined using the double colon instead of the (this) parameter, and the scoll symbol in the body instead of the object reference:

hashed map:: put (str) at (key):
    i =: string key hashed %% @.count
    m =: @[i]
    m{key} =: str

and to avoid repeating the prefix:

hashed map::
    put (str) at (key):
        i =: (string key hashed) %% @.count
        m =: @[i]
        m{key} =: str
    get (key):
        i =: string key hashed %% @.count
        m =: @[i]
        :> m{key}

OO outdated section

Many object oriented languages include data abstraction, because it fits well into the the idea of objects to which the relevant operations are assigned.

Data abstraction uses functions to get information from objects and to manipulate objects. This is clearly possible in TUF-PL.

In OO languages, the functions defined for the object are often prefixed with the name of the object. This allows shorter functions names, which is desirable in all commonly used OO languages as the function name is always a single word. With the much longer and flexible function templates in TUF, the object class can be used as as part of the function template. This can be studied with the templates for the standard library.

In order to allow either the compiler or the function to check that the bunch to process is not corrupt, two very similar mechanisms can be used:

Besides these basic ones, we have:

A function template may start with a sequence of words, the tag (class name) of the function, followed by a double colon:

aClass::template to do (x) and (y) :
    \ code for the class member
your class::template to do (x) and (y) :
    \ code for the class member
de.glaschick.aClass::template to do (x) and (y) :
    \ code for the class member 

Although the tags are strings, no (single) quotes are neither required nor allowed; just sequences of words separated by points or blank space.

When setting the .tag attribute, quotes must always be used to form a string, even for a single word:

x = {}
x.tag =: 'aClass'

Note that setting a tag can only be done once, and only for rows and maps.

The tags with the primary kind names, i.e. integer, rational, float, string and boolean are reserverd and cannot be set; these are returned by the .tag attribute. Note that the .kind attribute for a map still returns map.

Thus, string functions can be defined as follows:

string::clip (str) upto (n):
\ instead of
clip (str) upto (n):
    ! str.tag =: 
use this:
    s =: 'hallo, hallo'
    print 

?????

A bunch of a certain class is created using the class name inside the branches:

row =: {'aClass'}
yourmap =: {'your class'}
map =: ['de.glaschick.aClass']
 
row.:template to do 1 and 2     \ calling a class function

The tag name inside the braces and brackets must be a string literal; thus the quotes are required. While this seems inconvenient at first, in many cases a tagged bunch is created by a factory function; so the use of literals is not so often as might be expected.

Inheritance by matching the prefix might be defined in far future.

Example

The following are the templates for the Femto Data-Base:

open fDB(pathname):
    returns an fDB handle (object) for an existing database
new fBD(pathname):
    creates a new database and returns the fDB handle (object)
 
fDB::
    get information:
        returns also the fieldname record
 
close:
    closes the associate file
 
get single record at key (key):
    if the key does not exist, void is retuned
 
give records after (key):
    Starting with key, records in ascending order are supplied.
    If exhausted, void is returned.
    If key is void, the first is returned.
 
count records between (begin) and (end):
    Returns the number of records between (begin) and (end),
    both included, if they exist.
    If (begin) is void, it is before the first.
    If (end) is void, it is the last.
 
add (newrec):
    The record with the key in (newrec) is added.
 
delete (oldrec):
 
update (newrec):
 
\ usage:
xfdb =: new fDB './new_fDB.fdb'
! xfdb.tag = 'fDB'          \ (single) quotes are required
info =: xfdb.:get information
xfdb.:close

Questions and Answers

In the following questions, the term support always means that there are special means in the language to support.

Does TUF support class variables?
No. Global variables are sufficient for this, as they are global only to the current compilation module, thus no collision with globals in other modules can occur. Checking the use of global variables within a module is not very costly.
Does TUF support class methods?
No. Class methods are considered a misuse of the class concept, in order to avoid name spaces or other means to support modules. E.g., file handling is not a class, but a module. A file object can be created, and then all the methods can be applied. Functions that do not need objects are not class functions, but functions within the module.
Does TUF-PL support inheritance?
No, not directly. However, any tagged bunch may use the attribute .super to link to a parent. But, in contrast to normal inheritance, all inherited functions must be written down using the corresponding function from .super. This makes the dependence for features of the parent visible. An this allows to simply override the parent's function, and even to correct interface modifications.
Does TUF support singletons?
No. Global variables are a simple means to share data between functions in a module. Note that singletons are normally deemed necessary if a resource is unique and its used must be synchronised. So use a global variable to save the reference to a bunch describing that resource, and have all functions use this resource. Release of this resource must be programmed explicitly for good reasons, because it is easy to forget the often complex rules that allow automatic release, and often the assumption that the rules are fail-safe is wrong.
What about constructors and destructors?
Use factory functions instead (see above)

10. Examples

More examples can be found online at http://rclab.de/TUF-Online.

Hello world

The famous example:

main (parms):+
    print "Hello, world!"

Table of squares

Early computers often used this a first test:

main (parms):+
    ?# i =: from 1 upto 100
        print i __ i*i

No multiplication is necessary, as (x+1)2 = x2 + 2x + 1, or better x2 = (x-1)2 + 2x - 1:

main (parms):+ 
    x2 =: 0
    ?# x =: from 1 upto integer from string parms[1] else 100
        x2 =+ x + x - 1
        print x __ x2

Approximate square root

Calculate the square root by Heron's method:

sqrt (a):
    ? a < 1.0
      :> 1.0 / (sqrt 1.0/a)     \ tail recursion
    x =: a
    y =: x * 1.01               \ thus y > x
    ?* y > x                    \ monotonically falling until sqrt found
       y =: x
       x =: (x + a/x) / 2.0
    :> x

Greatest common divisor

greatest common divisor of (x) and (y):
    ! x > 0, y > 0           \ implies integers
    ? x < y
        :> greatest common divisor of y and x
    ?* y > 0
        z =: x // y
        x =: y
        y =: z
    :> x

Using tail recursion:

greatest common divisor of (x) and (y):
    ! x > 0, y > 0
    ? x < y
        :> greatest common divisor of y and x
    ? y = 0
        :> x
    :> greatest common divisor of y and (x // y)

Linked List

Many scans are predefined; here comes a scan that goes through a linked list:

linked list give (first):
    ? $ =~ first        \ if initial call
    $ =: $.next         \ advance, may be void
    :> $
 
create linked list from (r):
    ? r.count = 0
        :> ()
    x =: []
    x.val =: r[r.first]
    x.next =: ()
    ?# i =: from r.first upto r.last
        n =: []
        n.val =: r[i]
        n.next =: ()
        x.next =: n
        x =: n
    :> x
r =: 3, 13, 7, 11, 5, 9
l =: create linked list from r
?# v =: traverse linked list (l)
       print v.val
\ not much more complicated if used directly:
v =: l
?* v ~= ()
    print v.val
    v =: v.next
 
\ but we may want to filter the results:
traverse linked list from (first) below (limit):
    hic = $
    ? hic = ()             \ start with first
        hic =: first
    |
        hic =: hic.next
    ?* hic ~= () & hic.val > limit
        hic =: hic.next
   $ = hic               \ done if void, also if first = ()
    :> hic
\ and use it like this
?# v =: traverse linked list from l below 10
    print v.val

11. Features not (yet) approved

This section collects language variants that have not yet been accepted, but that could be integrated seamlessly if so desired and considered useful.

Some of the following texts are not up-to-date and are to be thoroughly revised.

Memory shortage interception

Some functions might return a fault item if there is not enough memory available.

In particular, recursive function activation might result in a fatal runtime error terminating the program immediately. While theoretically a function in this case could return a fault item, it seems to be hard to catch this case on C level already.

Most probably memory is exhausted by catenating two (very) long strings; which currently is a fatal runtime error.

In any case, avoidance and interception of lack of free memory is still to be analysed and integrated thoroughly.

Anonymous functions

Anonymous functions use the same backtick notation, but start with an opening parenthesis for the — possibly empty — parameter list:

give =: `() $glovar =+ 1; :> $glovar`       \ no parameters
inverted =: `(a,b) :>compare (b) with (a)`  \ reverse parameters

No colon is needed after the closing parenthesis, as only a list of words is allowed in between.

The condensed notation is useful here, e.g.

max =: `(a,b) ? a>b : :>a;; :>b`

Multiline definitions are also possible:

max =: `(a,b)
    ? a > b
        :> a
    :> b
    `

Note that an anonymous function is just given a hidden name by the compiler; normally it saves no space or time; it is just a syntactic expansion which is most useful in combination with the condensed notation.

Normally, return value is not void; but this is not required.

Function parameter lists

Occasionally, a function has two parameters that are either exchangable or have a canoncial order, like in

gcd (x,y):
    ...

Row optimisations

As a string contains characters in the strict sense, it may not be (mis-)used to store a row of bytes; if the input is e.g. in UTF-8 code, two or more bytes may denote a single character. Use Chunks of bytes instead.

Basically, each row element is a reference to the respective item; thus, a row of small integers that would fit into a byte will be larger by a significant factor in the order of 10 than in a more tightly machine oriented language like C.

To solve this issue finally, assertions may be used to assert that each element refers e.g. to a small number. In this case, the map with uniform references can be internally optimised, in that a map of bytes is kept contiguous internally and just flagged in the map header correspondingly.

Sequences

Primarily for interactive use and for short anonymous functions, a condensed notation allows two or more statements on one line. (This feature is still not implemented.)

A semicolon (;) is treated like an end of line, with the following statement in the same block (same indentation level):

x =: 0; y =: 0
tok =: toks[ti]; ti =+ 1    

If the line starts with a question mark for a guard (if or loop), a single colon opens a dependent block like a line end:

? a = (): a =: []                   \ set empty row if not yet set
? i < 0: ?^                         \ conditional repeat loop
? i > 0: ?>                         \ conditional terminate loop
?* x.next ~= (): x =: x.next        \ find last in chain
k =: []; ?# i =: from 1 to 10: k[i] =: i    \ fill with integers
? is x fault: fault x mark checked; :> x        \ propagate fault up
? x > y: z =: x ?~ z =: y           \ determine maximum 
? x > y: z =: x; | z =: y           \ same, observe ; before |

The if-not (else) token (?~ or |) closes the current block and opens a new block; note the semicolon before the vertical bar, as otherwise it would be the boolean or-operator. The colon notation is restricted to a guard; it is not possible for any notation having a dependent block. Note also that the function template is always on one line; the body cannot be written on the same line.

Two lines are also possible if an alternative is present:

? x > y: z =: x 
| z =: y

A double semicolon is reserved for block close, but should be avoided for clarity. For complex situations, better use a function instead.

The condensed notation is useful in short anonymous functions:

max =: `(a,b) ? a > b: :>a ;| :>b`

Because the dependent statements are not limited to assignments, two conditional statements may be nested, e.g.

? x ~= (): ? x.attr = val: :> x

so that the partial evaluation rule for the and operator is not needed.

Also, an indented block can be used as in

? x ~= (): ? x.attr = val
    ... 
    :> x

Note the absence of a trailing colon in the second part. Indentation of the dependent block is checked against the indentation of the primary statement.

In order to have the feature defined, not necessarily implemented, conditional expressions may be used:

x =: ?( discriminant ?: trueval ?~ falseval ?)

to avoid writing down the variable twice, as in:

? logical expression
    x =: trueval
|
    x =: falseval

The discriminant is a boolean expression, and trueval and falseval represent any expression that would be allowed at the right hand side of an assignment. While future versions might allow the expressions be preceded by simple statements, separated by semicolons, it is not recommended to use due to its lack of clarity.

Slices

Currently, no slices are included in the language. However, to have the language ready, and check that they would fit in, here is what they would look like.

As opposed to tuples — which are an ordered collection of individual items — a slice is defined by a range of values, defined by the first and last and a rule to compute the values in between. Consequently, ranges of integer numbers are normally used.

Using the slice symbol, two fullstops, a range of integers creates a tuple:

a, b, c =: 1..3

While it would be tempting to allow

    r[1..] =: 1, 2, 3, 5, 7, 11, 13, 19
    a, b, c, d, e, f =: 1..
and leave the upper limit out, this would define a unary postfix
operator, and is thus not considered.

Strings are ordered lists of characters, the slice might be useful here too:

s =: 'The quick brown fox'
t =: s[5..9]        \ instead of: string s from 5 upto 9
! t = 'quick'

Row and map literals

Because of their dynamic nature, rows and maps are normally created at runtime and filled from data calculated or read from files. Thus, there are conceptually no row or map literals.

As tuples are like (small) rows with fixed contents, they can often be used when a row literal were desired:

primes =: 2, 3, 5, 7, 11, 13, 17, 197 

If a row is required, because fields are used or it must be changed, use the library function:

rowOfPrimes =: to row 2, 3, 5, 7, 11, 13, 17, 197 

Note the a tuple in the row generator is reserved for multi-dimensional rows, which currently are implemented using a library:

mr =: new matrix sized 12, 12   \ instead of mr =: [12, 12]

There is no equivalent for maps, because

x =: ('a', 9), ('b',"x"), ('c', 3.0)

creates a tuple of pairs, which is not automatically treated as map; a function may be used

map =: to map tuple pairs ('a', 9), ('b',"x"), ('c', 3.0)

Note that the function creates the map using the .Quickadd option; thus duplicate keys are hidden. While the values are uniqe, the key and value scans will give duplicate entries.

As there are no genuine row and map literals, named literals cannot be used to create rows or maps at compile time; only tuples are available this way.

Field change callbacks

In object-oriented languages, object fields are normally protected and changed by the tedious getter/setter object functions, which often require many lines of code even if the values are just stored, and make the code clumsy with all the function calls. As this cannot be changed later, it is necessary to do so for all field values, even if there is not yet any need to do so.

In TUF-PL, a field of a bunch can be associated with a guard function, that is called if a value is inquired or changed. This has the advantage of very clear code; because setting is just an assignment, not a function call. There is no need to precautionary use getter/setter functions in case a guard function might be required later. There is clearly a — presumably small — penalty in execution size and speed16.

Instead of

obj.set_width(5)
x = obj.get_width()

this is just a normal assignment in TUF-PL:

obj.width =: 5
x =: obj.width

To associate a getter or setter function, a special assignment is used:

x.width =< `set width of (bunch) to (new)`
x.width => `get width of (bunch)`

Using tagged bunches in an object oriented manner:

a_Class::
    get width:
        :> @.width
    set width (val):
        @.width =: val
    +:
        \ anything to be done if the bunch is created
        @.width =< `.set width ()`
        @.width => `.get width`
    -:
    \ code if resources have to be freed
 
x =: {a_Class}

Note that many cases where setter functions are used to enforce integrity can be solved with constraints.

For a setter function, the value returned is set to the attribute; for a getter function, the value returned is used instead of the true value of the attribute.

Not that comparison is done with >= and <=, so there is no collision17.

This feature is very handy for GUI interfaces, as a GUI object is just changed.

A getter or setter function can return a fault item if there is a recoverable error encountered, which is then stored in the attribute in case of the setter function, and given back to the caller in case of getter functions.

Note that e.g. in C++, the compiler may optimize a getter or setter function to inline code, in particular if just a field value is obtained or changed. But in these cases, there is no function associated, and the relative efficiency is the same, without stressing compiler optimization.

Overriding bunch element access

To allow callbacks not only for fields, but also for the element access, i.e. row[key] and map{key}, the syntax

obj.> =: `get (this) from (key)`
obj.< =: `put (this) at (key) value (val)`

will finally be the right way to define it.

However, until field callbacks are available, as intermediate solution to support special keying methods for rows and maps, the access can be overwritten by two attributes, :overrideGet and :overridePut, that can each hold a reference to a function with two resp. three parameters. The :overrideGet function is called whenever the row or map is accessed, i.e. the form row[key] or map{key} is found within an expression, and should return a value, while the :overridePut function is called if the form is the target of an assignment. Both have the bunch reference as first parameter, the key as second one, and put the value to be set as third one:

matrix (this) get at (keys):+
    i =: matrix this calculate keys
    :> this[i]

When such an override function is called, this override function is disabled until it returns and thus can access the row or map normally.

The overriding access functions should not be called otherwise to prevent circular invocation, i.e. not exported.

Both attributes are stored like a field, but with the names :OverrideGet and :OverridePut which are protected from accessed as fields due to the leading colon.

Overriding of operators

Like overriding of bunch access, such a mechanism allows to override other operations, in particular operations in expressions like add, subtract, multiply and divide, for maps and rows.

Implementation is fairly easy, if factory functions are used. A factory function for a (tagged) bunch sets a control field with a function reference, and the operator code checks if the (left hand side) operand has such a control field, and then calls the function.

The syntax to establish complex numbers is then:

new complex number (re) i (im):
    obj =: {}
    obj.tag =: 'complex number'
    obj.re =: re
    obj.im =: im
    obj.OperatorAdd =: `complex () add ()`
    obj.OperatorSub =: `complex () subtract ()`
    obj.OperatorMul =: `complex () multiply (that)`
    obj.OperatorDiv =: `complex () divide ()`
    obj.abs =: `complex () arg`
    \ no comparison 
    :> obj
complex (x) add (y):
    \ no shortcut to return y if x = 0.0, because equals comparison
    \ is not supported, although this might be a useful case
    :> new complex number x.re+y.re i x.im+y.im 

As usual, the operators combine the two arguments and return a complex number.

The problem with this solution is that attributes could be function references, but are not automatically called; unless the specification for fields and attributes defines that number attributes are not deliverd, but called as unary functions.

It seems to be fairly unefficient to call function references instead of calling directly the corresponding functions; however, this overhead is the overhead of finding the control field values. Together with the storage requirements to save the latter, the overhead is nevertheless quite small, but tests are still pending.

Overriding of string conversion

As there is a default conversion to a string for numbers (and booleans), a special attribute .ToString is justified, that, when set with a function reference, calls a function with one argument that converts the item to a string.

Regular expressions

Support for regular expressions is an optional module; using the standard string functions is often clearer.

Regular expressions are written as (non-localised) strings. All library functions which use regular expressions use a common cache to compile the regular expressions on demand.

There is no need for a special match operator symbol, as the standard functions all return a void reference if they fail the match:

?* find re '[a-z][0-9a-z]*' in s       \ returned values ignored
  do something

Most often, in case of a match, the position and length is needed and returned as a bunch with two fields:

?# x =: find re 'a[0-9]+b' in s
  ? x.start > 0 & x.length > 0
    x =: clip string s from s length l

If list assignments are implemented, the example might read:

?# b, l =: find re "a[0-9]+b" in s
  ? b > 0 & l > 0
    x =: clip string s from b length l

Enumerated integers

No language support is provided to assign a range of numbers to a list of names automatically, like the enum in C. This is assumed a legacy from early computers where it was common to assign integers as codes for smaller and quicker programs. Using (short) strings instead, there is no loss in speed or size. Moreover, such a feature would only be useful within a compilation unit, and thus there is no language feature.

If it is desired to use names instead of numerical constants, use Named Literals instead.

Combining (join) bunch elements to strings

In particular in debugging, it is often required to assemble all values (or keys) of a bunch to a string.

Serveral library functions allow this:

row (row) join values
row (row) join values by (sep)
row (row) join keys
row (row) join keys by (sep)
map (map) join values
map (map) join values by (sep)
map (map) join keys
map (map) join keys by (sep)

No separator parameter delimits by a single blank; for consistency (and for future optional parts), a void separator does the same; to catenate without separator, use the empty string.

Normally, shortcut version are used:

join (bunch) values
join (bunch) values by (sep)
join (bunch) keys
join (bunch) keys by (sep)
join (bunch)                    \ same as join (bunc) values

To print all values of a row or map, write

print join rmt

To print all values of a row in separate lines, use

print join row values by '&nl;'

Several scan functions are provided in this context:

row (row) give keys         \ ascending
row (row) give values       \ ascending keys
row (row) give values ascending keys
row (row) give values descending keys
row (row) give values ascending keys using (func)
row (row) give values descending keys using (func)
map (map) give keys                 \ undefined order, with duplicates
map (map) give keys unique          \ undefined order, without duplicates
map (map) give keys ascending       \ without duplicates
map (map) give keys descending
map (map) give keys ascending using (func)
map (map) give keys descending using (func)
map (map) give values                   \ undefined order of keys
map (map) give values unique keys   \ undefined order
map (map) give values ascending keys  
map (map) give values descending keys 
map (map) give values ascending keys using (func) 
map (map) give values descending keys using (func) 

Lazy versions that determine the kind dynamically:

give (bunch) keys                   \ order depends on bunch
give (bunch) keys ascending
give (bunch) keys descending
give (bunch) values                 \ key order depends on bunch
give (bunch) values ascending keys
give (bunch) valuse descending keys
keys of (bunch)                     \ short for 'give (bunch) keys'
values of (bunch)                   \ short for 'give (bunch) values'

Some variants may not yet be present; and not all are defined here in order to avoid cluttering the list with rarly used variants, e.g. giving row keys in descending order, that can be done normally by:

?# i =: from row.last downto row.first
    ? row[i] = ()
        ?^
    ....

or as a scan function:

row (row) give keys descending:
    ? $ = ()
        $ = row.last
    |
        $ =- 1
    ?*
        ? $ < row.first
            $ =: ()
            :> ()
        rv =: row[$]
        ? rv = ()
            $ =- 1
            ?^
    :> rv

Output to standard output (print)

Initially, there are only two print statements:

print item
print item nonl
print partial item   \ deprecated

where item could be a string or a number, and a complex line is created by using string catenation (_ and __).

Instead of the negation nonl, the forms

print item nl
print item
print nl    

would be more systematic; but printing an entire line is more often needed, as in:

print 'x=' _ x

A simple extension was to support lists with a standard delimiter:

print (list):
    ? is list list
        print as string join list by ', '
        print join list by ','
        print join map xx values by ',
 
print list

If a different delimiter is required, use the join function.

Multi-dimensional fields

As bunch values may be bunches, multi-dimensional rows or maps can be created by assigning rows or maps to the elements (known as Iliffe Vectors).

Basically, the secondary bunches are created in advance, as in

a =: []
?# i =: from 1 to 11
    b =: []
    ?# j =: from 1 to 12
        b[j] =: i*j
    a[i] =: b
! a[3][4] = 12                  \ if the compiler allows stacked keys

An example would be a row of maps, the example filling it from a file keyed in a row by length, and counting each word:

r =: []
?# w =: give lines from standard input
    x =: r[w.count]
    ? x = ()
        x =: {}
        r[w.count] =: x
    x{w} =+ 1

As map keys can be any items, they can be lists in particular:

map =: {}
map{1, 'a'} =: 1, 'a'
le =: map{1,'a'}
! le = 1, 'a'               \ if the compiler supports lists here

Multi-dimensional rows are thus rows which are keyed by lists:

row =: [5,5]                \ allocates 25 cells
?# i =: from 1 upto 5
    ? j =: from 1 upto 5
        row[i,j] = i+j

or, if lists are supported by generator scans:

?# i, j =: from 1,1 upto 5,5
    row[i,j] =: i + j
print row               \ if printing of bunches is supported

The attributes .first and .last still refer to the underlying row.

If the row is multidimensional, the attribute .dimensions has a reference to the list of maximum elements per dimension. If .origin is set, it may be a single integer or a list of integers, one for each dimension. If void, integer 1 is the default. The origin of each dimension is subtracted from each dimension in calculating the offset, then the linear cell computed with zero origin.

While it may be changed at any time, this will seldomly be useful and thus may only be supported before the first cell accessed.

Changing the number of elements per dimension normally results in heavy data movements, and may thus be rejected (or because the effort to write the necessary code was not yet justified).

Template permutations

Instead of parameters by keyword, TUF-PL defines permutation of parts of the template, including optional parts, using the + and - characters. The + starts a template part that may be exchanged with other parts, and - starts an optional part.

The function with the template:

copy + from (a) + to (b):

may be called as:

copy from x to y
copy to y from x

Or:

from (start) + upto (end) - step (step):

may be called as

from 1 upto 7
from 1 step 1 upto 7
from 1 upto 7 step 1

Parameters in optional parts are set with void when calling, i.e. are just functions that call the full template with voids as required.

In order to avoid confusion, the + and - characters must be followed by a word, not a parameter, so that the word(s) behind it are the equivalence of keywords. Also, the partial templates must be unique.

This feature is not to be expected in early compilers.

Threads

Threads — as implemented by most operation systems — split a process in two (or more) processes sharing all global data. Threads were very valuable in Microsoft's Windows systems, when neither UNIX fork nor shared memory was supported, and starting another independent process was slow.

Memory administration as used in TUF-PL is not thread safe. If items are shared by threads, incrementing and decrementing the use counter must be atomic. Also, the queue of unused items to speed up allocation must be thread-local, unless the overhead of mutexes is payed.

As both thread execute instructions (not statements) in an unpredictable order (unless synchronisation primitives are provided and properly used), the execution sequence of such a programme becomes rather unpredictable.

For example, the code snippet:

$glovar =: 1
a =: 1      
! $glovar = 1

will often work, but sometimes stop because the assertion is violated. Of course, this assumes that $glovar is changed somewhere else, and this may well be done between setting and reading the variable. It will occur rather seldom, as threads normally run for many thousands of instructions before re-scheduled, but it will occur finally, depending on the load of other processes in the computer.

It is admitted that the same thing may happen when instead of the assignment, a function is called, that may or may not change the global variable, depending on the state of the data used, etc, and this is one reason that global variables are discouraged to use, unless good reason is given do do so.

All this is thoroughly discussed in Edward A. Lee's article The Problem with Threads, published in 2006, available online at https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.html.

As strings and numbers are immutable, they do not conflict with tread safety, if the storage management is made thread safe as part of making the runtime thread safe.

Currently, access to the scan context uses a global variable which then must be replaced thread local storage (POSIX TLS). Another solution would be to add an extra invisible parameter to all function calls in the context of the right hand side of an assignment that is used in a scan loop, and defining that extra parameter if a function contains the scan context symbol. It would create a special class of scan functions.

Using only bunches as globals and the .lock and .Revisions attributes, a high degree of thread-safety can be achieved, provided that .lock and .Revisions are changed atomically.

As POSIX threads denote a function and a single parameter, starting a thread would be just a function (instead of a special digraph):

posix thread start `function using (parm)`

More functions are likely to be necessary in a separate POSIX thread support module.

Alternate scan functions

The current scheme for scan functions is simple, yet proved so far to be powerful and useable, however:

Checking that a function called in a scan assignment is finally a scan function is not primitive and not justified, because other pitfalls were more common. Also, this error is detected very early when testing.

Thread-local storage is available in C, although the performance penalty is unclear.

Alternatively, all scan functions could have an extra parameter automatically provided at the point of call and carrying the scan context. Either indirect scan functions would be no longer allowed, or this parameter would be used by any function call. Note that TUF-PL until now does not have a indirect item; in all cases so far either a bunch or a tuple (of one element) are used.

The latter may be necessary when introducing threads anyhow.

Generator functions are often shorter and clearer if yield is used, e.g. using the #> bigraph:

give Fibonacci numbers upto (n):
    a =: 0
    b =: 1
    ?* a <= limit
        t =: a
        a =: b
        b =: t + b
        #> a

or in tuple split notation:

give Fibonacci numbers upto (n):
    a, b =: 0, 1
    ?* a <= limit
        a, b =: b, a + b
        #> a

compared to the scan function:

give Fibonacci numbers upto (n):
    $ =~ 0, 1                   \ initial tuple if first call
    r =: $.1 + $.2              \ next number
    $ =: $.2, r                 \ save previous and current value
    ? r > n                     
        $ =: ()                 \ end of scan
    :> r                        \ return next number in any case

Another example:

\ generator
give primes upto (n) :
    pt =: [n]               \ table preset with void
    #> 2                    \ first prime
    p =: 3                  \ next prime
    ?*
        \ search in table
        ?* pt[p] ~= ()
            p =+ 2
        ? p > n
            ?>              \ done
        #> p                \ yield next prime found
        \ mark multiples
        ?# j =: from p*p upto n step 2*p
            pt[j] =: p              \ mark all multiples as non-prime
 
\ scan function
give primes upto (n) :
    ? $ = ()                    \ first call
        $ =: [n]                \ table preset with void
        $.idx =: 3              \ start here at next call
        :> 2                    \ suppply first prime 
    p =: $.idx                  \ continue here
    ?* $[p] ~= ()               \ skip non-primes, always finds void
        p =+ 2                  \ $ not used in scan body
    ? p > n                     \ done?
        $ =: ()                 \ yes, clear scan context
        :> ()
    \ prime found, save continue position
    $.idx =: p + 2
    \ mark multiples
    ?# j =: from p*p upto n step 2*p
        $[j] =: p              \ mark all multiples as non-prime
    :> p

However, generators in general are as difficult to implement as coroutines, because the stack frame of the generator has to be saved by the yield, which requires some fiddling in the internal structures in C, normally done be assembly instructions that are not really portable.

Using Duff's device (it is allowed even to jump into loops in C), yields could be implemented fairly portable, but the problem of saving local variables is still unsolved.

Switches

Basically, no case switches are defined. They were used in early times as an efficient multi-way branch, which is no longer necessary.

Admittedly, the statement example

? x = 1
    do something
|
    ? x = 2
        do otherthing
    | 
        ...

is not exactly a terse expression; but the else if symbol (|?) makes the code fairly readable:

? x = 1
    do something
|? x = 2
    do otherthing
|? x = 3
    ...
|
    done otherwise

Using case switches for a large number of alternatives is not very often sensible, as the code is hard to verify, in particular if the cases are not just function calls. Using a row or map with function references is often a good and efficient alternative.

In many cases, the code might be organised without else if, e.g.:

? x = 1
    :> obtain first thing
? x = 2
    :> obtain second thing
...

which is fairly good readable, and is not restricted to equality comparisons.

The major disadvantage is that the variable x has to be repeatedly written, thus a shortcut to avoid this extends the elseif symbol:

? x = 1
    ...
|? = 2
    ...
|? > 3
    ...
|               \ the catchall (otherwise) case
    ... 

The normal guard expression evaluation must be a simple comparison of a variable followed by a comparison operator, the former noted. When an else if operator |? is used and directly followed by a comparison operator, the noted variable is assumed to be prepended.

As the value of this is debatable, few compilers will support this extension.

Macroprocessor

In the early days, a macroprocessor (or preprocessor) was useful for a slim compiler and as a mechanism to extend the language. As its purpose is to provide features not defined in the language, it spoils the structure and makes it harder to verify a programme. Thus, a build-in macroprocessor is not thought as appropriate for TUF-PL.

Just in case the demand for such a beast cannot be rejected, here is the preferred specification:

All macro processing takes place once a line is parsed into tokens (words, numbers, strings, single characters and character pairs (digraphs); comment and pragma lines starting with a backslash in column 1 are not processed at all, as are trailing line comments.

Macro definitions are valid for the lines that follow, they can be redefined without error or warning; no second double backslash or nothing behind it erases the macro.

As an example, these would be the debug print macros:

\\ <\x> \\ '\x="' _ \x _ '"'
\\ <\a{\i}> \\ '\a{' _ \i _ '}="' _ \a{\i} _ '"'
\\ <\r[\i]> \\ '\r[' _ \i _ ']="' _ \r[\i] _ '"'.

Note that the increment assignment could be expressed as

\\ \x =+ \y \\ \x =: \x + \y

and the catenation with blank as a mere substitution by

\\ __ \\ _ ' ' _

Reserved words instead of symbols could be set as follows:

\\ if \\ ?
\\ else \\ |
\\ else if \\ |?
\\ while \\ ?*
\\ scan \\ ?#
\\ return \\ :>
\\ repeat \\ ?^
\\ leave  \\ ?>

Of course, these words become reserved words then and cannot be used for variables etc. any longer, and it may give confusions as the compiler might show the generated line.

Note that macros do not match text inside strings of the source line, only in the replacement. Of course, a known string could be matched verbatim and replaced:

\\ "hello, world" \\ "Hallo, Welt!"
\\ 'Have a ' _ (u) _ ' day' \\ 'Wünsche einen ' _ (u) _ ' Tag' 

Error messages refer to the replaced line.

Note that the backslash is used in the language for comments and pragmas only, so there is no need to have them in the pattern or generate them in the replacement.

Coroutines

As an alternative to callbacks and threads, coroutines (see http://en.wikipedia.org/wiki/Coroutine and in particular Simon Tatham's explanation ) are specified in a generalised message passing model, that allows bidirectional value (message) passing.

Coroutine send and receive is indicated by the digraph <> at either the left hand side of an expression (send, yield) or as a term in an expression (receive)

A function is suspended if the port in the receive context has no value, or for a port symbol in yield context the previous value has not yet been received. The compiler may thus use threads and parallel execution, but may also suspend before a receive port until the value is ready, or after a yield until the value is received, so that there are no threads necessary. (See ??? remarks).

Example:

func1:
    ?# x =: <~          \ receive
        ~> x*x          \ send 
func2 (p):
    i =: 1
    ?* i < p
        ~> i
        j =: <~
        print i _ '^2 = ' _ j
 func1 <> func2 10      \ start two coroutines

Function 1 starts and runs until the port is used in receive context; it is suspended until function 2 yields a value, which may be suspended or allowed to continue execution. Then, function 2 is suspended, the value transferred to function 1, and function 1 resumed until it yields a value, when it is suspended again and function 2 resumed, given the value supplied by function 1.

Coroutines are not available in most implementations, as a stack suspend operation is requiered. Implementing them with threads is critical, as unpredictable behaviour may occur if global variables are shared. A compiler may issue a warning, if global variables and coroutines are used in the same module.

Note that scan functions are similar to coroutines, see below in ??? functions as coroutines.

Unified Rows and maps

It would have been possible to syntactically remove the distinction between maps and rows, and allow the key to be either a string or an integer, using the bunch[idx] syntax for both, rows and maps, and creating both with the [] syntax. The item such created would be a neutral bunch, which may be used as a row or a map, depending on whether the key used is an integer or a string. (Although for a map only a comparison for equality is required, other keys than strings are rarely useful.)

If the first use would set the kind of the keys, and using the other one leading to a runtime error, the situation would be nearly the same as now, only that more runtime errors are expected because there is no syntactical distinction, and it would be more difficult to check this at compile time.

Allowing both, row and map access, to co-exist for the same instance, is possible, but has some consequences:

These are not really profound reasons, but all together let me prefer to keep the distinction.

A problem in the case of overriding the get and put functions was none. To support hash tables with strings as keys, it is tempting to have the primary bunch as a row, keyed by the hash values, each element of the row being a reference to a map, which is keyed by the strings supplied as arguments for the get and put functions. In this case, a row would be keyed by strings. This can easily solved by using a map as primary, with a field referring to the row of hash slots. As a consequence of this evaluation, overriding get and put is only valid for maps.

The desire to use swivelled braces for the condensed notation is also solved, see ???Condensed.

Shell usage

This is a rather old section that has to be thoroughly revised.

TUF-PL can be used instead of bash or dash etc, as its syntax is very terse. The major advantage is that you can use the same syntax in your shell as in your programming language.

However, there are some not so simple differences.

The tufsh shell can use any TUF-PL program as script, but, as it reads commands interactively, has loosened input syntax rules:

These relaxations apply only for terminal input, scripts have to adhere to the normal TUF-PL syntax.

In particular, file name expansion is done by the files function, which uses two versions, which use Bourne shell patterns or regular expressions:

qls 'x*.jpg'
ls 'x*.jpg' \ because the sting is a RE, the point must be backslash escaped
 
%file (files '*.jpg')

All filenames have to be quoted:

rm 'x.out'

No automatic file name expansion, you write

rm (ls '~*')

But you will then more often use rows to check the filenames:

f = ls '~*'
print f     \ check output
rm f

Postprocessing ls output to print size and filename, split without parameters splits on white space:

fl = ls '-l' '*.#x'
! fl.?: x = split @; print x[5] " " x:9 .

This gives problems with filenames with blanks; better use the built in file functions:

fl = directory list '*.#x'
! fl:? x = @; print x.size, x.filename

To allow pipes, the symbol -> is used in conjunction with the call operator:

%ls '*.jpg' -> %awk '{print NR, @0}'

Redirection is done by providing a filename as string as source or target:

ps 'faux' -> 'x.out'
'x.out' -> less

This does search for x.out in the path; this would be done by using the call operator explicitly:

%'x.out' -> less

To redirect file descriptor 2, i.e. standard error:

a.out -2> 'x.err' -> 'x.out'

To join stdout and stderr, just omit the filename:

a.out -2> -> less
a.out -> (1 2) less
a.out -> join 1 2 -> less
a.out -> ftee 2 'x.err' -> less

The equivalent to cat, i.e. write to a file, is named tac and both are provided as a built in functions, if desired for clarity:

f = 'x.out'
%ps faux -> tac f
cat f -> %mail me

If the string x.out should be send as mail, echo has to be used:

echo 'x.out' -> mail 'me'

Or to use my favourite mail test interactively:

date -> mail me

The special pipe function is activated if the source or target are executables in the path (selected by the percent sign unless in interactive mode). Otherwise, is is a common assignment:

5 -> x  ????????????????????
%mail 'me' <- %date
%mail @User <- echo 'Hi, how am I?'

Just for the interactive use, echo is also a build in function.

Instead of %ls, the build-in function dir can be used. It provides a row with information from the directory, as row attributes:

Instead of <code>%ls</code>, the build-in function <code>dir</code> can be used. It provides a row with information from the directory, as row attributes:

f.size
file size in bytes
file size in bytes
f.name
file name
file name
f.permissions
file permissions
file permissions
f.inode
inode number, if supported by filesystem
inode number, if supported by filesystem
f.change
date of last change (seconds since epoch)
date of last change (seconds since epoch)
f.created
creation date
creation date
f.accessed
last access date

Note that dir does not filename expansion.

To rename several files, the following bash script might be used:

for x in x_*.jpg
do y=@{x/#x_/y_}
   mv "@x" "@y"
done

In TUF-PL, this reads, if typed in:

?* x =: %'x_*.jpg'
  y =: sub '^x_' by 'y_' in x
  %mv x y

Both can by typed in in one line:

for x in x_*.jpg; do y_={x/#x_/y_}; mv "_x" "_y"; done
?* x=: %'x_*.jpg'; y = sub '^x_' by 'y_' in x; %mv x y .

Note that filename expansion operator is used instead of the ls-command, because shell has to provide filename expansion.

12. Implementation Remarks

The most direct approach that has found to work is to use memory addresses (pointers) for all kinds of items, having a uniform and robust, but possibly not fastest implementation. Alternatives are discussed at the end of this section.

Fields and Attributes

As the distinction between variable fields, control fields and attributes depends on the kind of item, the final processing is in the runtime.

Because attributes and control fields has precedence over variable fields, the runtime has first to look for the former, which may degrade field access.

Therefore, the compiler has a list of all attribute and control field names, and generates code to check these first, independent of kind. All other do not probe as attributes, but directly look for a field.

This also allows to mark newly introduced attributes as well as deprecated ones with compiler warnings.

Storage Management

In order to automatically free storage that is no longer used, all such items have a use counter, which, when decremented to zero, indicates that the item is no longer used and can be recycled. When a variable gets out of scope, its use count also is decremented, as to free the space of the item.

For keeping track of the use counts, two macros are defined:

#define _use_less(x) if (x) { --(x->usecount); if (x->usecount <=0) _free_item(x); }
#define _use_more(x) {if (x) ++(x->usecount);}

Thus, if a reference is copied from one variable to another, the target reference is referred to one less, and thus the use counter decremented. The use counter of the source reference, as the latter is copied, must be incremented.

A straightforward macro does this:

#define _cpy_var(tgt, src) { _use_more(src); _use_less(tgt); tgt = src; }
#define _cpy_ref(tgt, src) { _tmp = src; _use_more(_tmp); _use_less(tgt); tgt = _tmp; }

However, the macro _cpy_var can only be used if src and tgt are simple variables; if src would be a function call, it would be evaluated twice. Also, it may use the contents of tgt, thus tgt must remain valid until src is finally evaluated. Note that for the case src=tgt, first src is incremented, tgt decremented, also to avoid the intermediate deletion, i.e. to try to increment the use count of a no longer existing item. For the general case, the second form _cpy_ref must be used.

Note that if a local variable is used as a function argument, the use count is not incremented by the caller, different to the treatment of return values. As during the execution of the called function, the local variable of cannot be changed, there seems no need to increment the use count. If the called functions copies the reference to variable local to it, the use count will be incremented, but decremented if the variable gets out of scope. However, if the called function is allowed to modify the parameters, i.e. treat them as predefined local variables, the parameter's use count has to be incremented, as otherwise there would be two variables referring to an item with a usecount one to less.

As this argument is not valid for global variables, their usecount must be incremented and decremented properly. If the called function replaces the reference of that global variable, if will not be freed until the function returns, and thus its value remain valid.

The return values of functions are in a certain sense transient: They are no longer references contained in variables, thus the count of references from variables might be zero, but they cannot be freed until they are finally disposed of.

Let us assume a function just created a new item and returns its reference. This item must still have a use count of 1, as otherwise it would have been deleted. In case it is stored in a variable, this can be done by decrementing the usecount of the previous contents, and then just saving the reference in that variable, without incrementing the use count of the function value. A macro is defined for this, under the assumption that src is not an expression that might use tgt; otherwise use the second form:

#define _set_var(tgt, src) { _use_less(tgt); tgt = src; }
#define _set_ref(tgt, src) { _tmp = tgt; tgt = src; use_less(_tmp); }

Note that _cpy_var could be replaced by two lines, but this does not save any effort in code generation.

_use_more(src);
_set_var(tgt, src);

Alternatively, the transient values could have a use count of zero, and the above macro _cpy_var be used to set a variable, which would make code generation easier. See below for more on zero use count transient items.

More difficult is the handling of nested function calls, where function results are used as parameters of other function calls. This is tempting to generate, but the problem arises how to free the transient return values.

A simple and robust solution would be not to used nested function calls, but assign all intermediate values to temporary variables, and when the statement has finished, release all these temporary variables. Still, the use of _set_var instead of _cpy_var would be required, but the information whether the value is a variable or an expression is present in the compiler.

Note also that functions may be called discarding the return value; in this case, the returned transient value must be disposed of, either by using _use_less in case of return values with nonzero use counts, or a _drop macro:

#define _drop(x) { if (x) { if (x->usecount = 0) _free_item(x); } }

Using transient items with a use count of zero would allow direct function nesting and uniform handling independent of the source of the value, if extra effort for a kind of garbage collection for these transient values at the right places, which both is not simple, would be invested.

First, reducing the use count would require a different action, as it is not free when he use count is zero, but possibly entered in lists for transient values.

If all items are anyhow kept in a list, this list could be scanned for zero use count entries, and these deleted. However, not chaining the used items saves memory and processing time. Alternatively, only the zero use could be kept in a list.

However, there is no point in time where all these transient values may be freed. Let us just assume that the main program contains the line

x =: fun1(fun2 1, fun3 2)

This means, that until the return of fun1, the transient return values of fun2 and fun3 have to be kept. One solution would be to set aside the current chain before the call of a function, after return clear this chain, and restore the chain. It is not clear if this elaborate system will give a benefit at all in run time and space over the simple solution of just to avoid direct function calls.

Note that the advantage not to handle each return value individually is bought by handling it before return, and additional effort for garbage collection.

In general, using reference counts, variables are concerned. If a variable is assigned a new value, the old reference is decremented (and the object freed if zero), and the new reference incremented. So this is a reference copy, implemented by the macro _IPcpy(tgt,src). If the (old or new) reference is void, nothing has to be done. At the end of the block, all variables are decremented, so that, if they have the last reference, the item is freed.

A variable that is parameter fits into this scheme; it remains existent until the function returns and the variable is overwritten. Parameters that are constants could be marked as such, and never need to be freed; in particular, if the same constant has to be itemized again, the same reference could be used.

If a function returns a reference to an item, it must be preserved until it is assigned to a variable, it becomes transient. Thus, its usecount is not decremented; otherwise it will be deleted before it can be used. When assigned to a variable, the old reference is clearly to be decremented, but not the transient reference, as it it just to change from a transient to a permanent state once stored in a variable location. Thus, a different macro, _IPset(tgt, src) is used.

If a transient value is not made permanent via a variable, the caller must ensure its usecount is decremented once it is no longer needed, in particular, if it is discarded. So, if a function is called without assignment, its return value has, if not void, its reference count immediately decremented, and in many cases just then deleted.

This prevents to compile nested function calls directly, as the return value must be discarded when the outer function returns, and the intermediate value have to be saved in temporary variables for this purpose.

Independent of the usecount, the value returned by a function must be properly administrated by the caller of the function. If the value is assigned to a variable, and it has already the final usecount, n

The problem comes with expressions where the intermediate values, mostly return values from functions or operators, must be kept until the functions or operator for witch they are operators returns. Thus, the simple solution is to use temporary values, assign to them the return values, and destroy them once the values are used.

The current version of the macros reflect these problems. Note that up comes before down, in case the target already contains the same reference as the source, because otherwise the source may be deleted in between.

A fairly tricky solution might be to queue all values returned in a transient queue, which is cleared after an expression (that has created transient values) is done. However, as another function can have already created transients, there must be mark in the transient chain, so that the solution to used temporaries seems more efficient.

The run time might use indirect references and, on a 32-bit system, use 24 bits for integers, and use the excess numbers as references. Thus, instead of items, only integers are passed to functions, where a global map maps integers to items. Integer 0 is reserved for void.

Rows and maps

Rows are kept as continuous memory of item pointer, re-allocated if necessary. Currently, the amount added is fixed to 10, which seems rather suboptimal for large maps. But no real performance issues have been observed: Calculating 10 million primes, i.e. increasing the map a million times, in fact has a short delay while the multiples of the small primes are written into the table, but the difference between the first and the second prime is not much larger than expected.

Maps as well as user-attributes use a unidirectional linked list with the lastest used entry first etc. This dynamic rearrangement proved to be very efficient for upto 1000 entries. Insertion of new keys is still quardratic in perfomance, and processing of lists of unique words may be slow.

It has been tried to use cylclic lists and just moving the start pointer, but performance is better only in few cases with mostly unique words, and worse in most other cases.

To speed up map access, a hash table could be used, either by implementing it in TUF-PL, or by integrating it into the runtime system.

The default hash function is the length of the string, so it remains the number of slots to be set. This can be done in the same way as the estimated size of a map is given, i.e. as a number within the bunch creator. If consistency with the row creator is most important, it is the estimated number of elements. However, using the stringlength as hash function, 10 slots is almost optimal choice. Using a TUF-PL solution, the speed is increased by approx. the factor 5. Thus, the rule remains that the number is the estimated number of elements, but the standard implementation will simply switch to 10 slots if a number is given.

Tagged bunches

It might be an implementation option if bunch functions are connected to a tagged bunch during creation, so that the runtime code can search the bunch. However, the current solution seems to be sufficient.

Level 0 TUF

For bootstrapping, a restricted version of TUF, level 0 TUF, is used:

Strings

In order to avoid copying of substrings, the string could be a row of substrings, where each element holds the reference to the source string and the address plus length pair of the substring. A method can then collapse the stuff to a single new string, which will also be required if substrings of substrings are required.

The idea came up considering a text file to be opened in shared memory; then, the lines are rows of substring pointers, just excluding the newline characters. For this purpose, a string should be able to specify the memory pool to which the memory should be returned:

Named Constants

Many libraries have constants as parameters, that are (in C) defined as macros, i.e. FL_PUSH_BUTTON, providing short integers. During early times of computing, the efficiency gain might have been significant in some cases; this is truly no longer the case nowadays.

TUF-PL prefers (short) strings, so that instead of:

    xforms add box FL_UP_BOX title "Quit?"
it is preferred to use:
    xforms add box 'up' title "Quit?"
or define a named constant:
    #FL_UP_BOX='up'

Constraints can be used to restrict the strings at compile time:

! FL_UP : 'box' | 'nix' | 'nax'
xforms add box (FL_UP: flag) title (str:string):

In the basic implementation, short strings do not take up more space than integers, and comparison is quick compared to the overhead required anyhow.

Library bindings

Binding foreign libraries is usually done by providing an interface module, that provides the features of the foreign library by TUF functions. These may, but often will not be a 1:1 relation in order to better adopt the interface to the programming concepts of TUF-PL.

Practically always these binding interfaces need to use data structures different from the items used in TUF-PL, i.e. opaque memory.

For this purpose, there is a system item subkind for foreign (opaque) data. The library may well return bunches which contain normal TUF items, like numbers, strings, etc., and use one or more fields to store foreign data.

The payload for a foreign system item contains two pointers:

If the destructor pointer is not zero, it is used to call a function using this pointer, with just the data pointer as argument, before the item is finally destroyed. The function should return zero; otherwise the situation is treated as a fatal system error and the programme is aborted. There is no means to avoid having the item memory freed afterwards. Note the argument is not the item, just the data pointer.

It is strongly recommended to start the opaque data with a magic number, so that any library routine can check — with high probability — that the pointer in the item provided had been created within this library interface and could reliably be used as a trusted data structure. The preferred magic number is a pointer to a string literal within the module, which is unique within the module. The string may be org.tufpl.bindings.xxxxxx where xxxxxxx identifies the interface.

Also, a row or map can be used, tagged and locked. Note that C programs can set the lock count to maximum and thus inhibit unlocking.

Multiarch 32bit and 64bit

The compilation can be for both, 32- and 64-bit architectures; just adjust the parameter in the makefile. Except for a few places where old 32-bit interfaces to libraries are used, the resultant programmes should only differ in size and execution time.

If in a 64-bit operating system the necessary 32-bit libraries (i.e. multiarch in Debian Linux) are available, the 32-bit variant will be the default, because it requires less space and is often quicker. Note that integers use 64-bit words anyhow, so there is no performance penalty even for integer applications with larger numbers.

Normally, either the 32-bit or the 64-bit variant is installed, selected by the approriate parameter, and with one set of libraries only, which are not distinguished by name. Currently, the selection is in the tuf script, as this is copied during installation.

Alternatively, both versions can co-exist (on a suitable system); then, the additional parameter -m32 and -m64 can be used in the tuf script. In this case, both sets of libraries are present, preferrably in subdirectories TUF32 and TUF64. In this case, the tuf script checks for the subdirectories and uses the 32 version per default if no parameter is given.

Note that currently, the compilation is done by a script tuf that compiles and links a programme; compile-only requires the option -c, while compile-and-run is done by the option -r.



1as was shown 1970 by W. M. Waite in STAGE2 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98.965&rep=rep1&type=pdf.
2meaning not equal, but with small differences that often break the program.
3In early days, the programmer war trained to do arithmetic by bit operations which is no longer needed as most compilers can do a better optimisation.
4The current runtime system has currently implemented only 32-bit rationals for a prove of concept.
5In this case, the runtime must check for overflows and terminate the programme once an overflow is detected.
6Raymond T. Boute: The Euclidean Definition of the Functions div and mod. ACM TOPLAS vol.14 no.2 April 1992, pp. 127-144.
7Also among the conditions is tuf, because he extends the criteria to real numbers. The condition -c had probably been left out to allow real numbers for the remainder.
8Coincidently, in the current implementation, a map without elements needs slightly less memory than a row without elements.
9in contrast to AWK, where using the key for a map creates the cell
10Flow analysis could nevertheless give a compile time warning
11Actually, the last condition should read tuf or -c if the latter is supported by the compiler
12The first runtime uses such a construct to save space for short strings not exceeding 6 characters.
13like the lifeboats are removed once a ship is used daily
14An empty string is not void, thus it is stored in the row as any other value.
15[https://www.man7.org/linux/man-pages/man3/printf.3.html]
16If the fields are a chained list of name-value pairs, use a second list for name-function ref pairs.
17except that is is not possible to extend the language later to allow -r for equal or less