2. Lexical Elements¶
This chapter discusses the lexical structure of the ArkTS programming language and the analytical conventions.
2.1. Use of Unicode Characters¶
The ArkTS programming language uses characters of the Unicode Character set [1] as its terminal symbols. It represents text in sequences of 16-bit code units using the Unicode UTF-16 encoding.
The term Unicode code point is used in this specification only where such representation is relevant to refer the reader to Unicode Character set and UTF-16 encoding. Where such representation is irrelevant to the discussion, the generic term character is used.
2.2. Lexical Input Elements¶
The language has lexical input elements of the following types:
Tokens, and
2.3. White Spaces¶
White spaces are lexical input elements that separate tokens from one another. Whitespaces never occur within a token. White spaces include the following:
Space (U+0020),
Horizontal tabulation (U+0009),
Vertical tabulation (U+000B),
Form feed (U+000C),
No-break space (U+00A0), and
Zero-width no-break space (U+FEFF).
White spaces improve source code readability and help avoiding ambiguities. White spaces are ignored by the syntactic grammar, but can occur within a comment.
2.4. Line Separators¶
Line separators are lexical input elements that divide sequences of Unicode input characters into lines. Line separators include the following:
Newline character (U+000A or ASCII <LF>),
Carriage return character (U+000D or ASCII <CR>),
Line separator character (U+2028 or ASCII <LS>), and
Paragraph separator character (U+2029 or ASCII <PS>).
Line separators separate tokens from one another and improve source code readability. Any sequence of line separators is considered a single separator.
2.5. Tokens¶
Tokens form the vocabulary of the language. There are four classes of tokens:
Token is the only lexical input element that can act as a terminal symbol of the syntactic grammar. In the process of tokenization, the next token is always the longest sequence of characters that form a valid token. Tokens are separated by white spaces (see White Spaces). Without white spaces, tokens merge into a single token. White spaces are ignored by the syntactic grammar.
Line separators are often treated as white spaces, except where line separators have special meanings. See Semicolons for more details.
2.6. Identifiers¶
An identifier is a sequence of one or more valid Unicode characters. The Unicode grammar of identifiers is based on character properties specified by the Unicode Standard.
The first character in an identifier must be ‘$’, ‘_’, or any Unicode code point with the Unicode property ‘ID_Start’[2]. Other characters must be Unicode code points with the Unicode property, or one of the following characters:
‘$’ (\U+0024),
‘Zero-Width Non-Joiner’ (<ZWNJ>, \U+200C), or
‘Zero-Width Joiner’ (<ZWNJ>, \U+200D).
Identifier:
IdentifierStart IdentifierPart \*
;
IdentifierStart:
UnicodeIDStart
| '$'
| '_'
| '\\' EscapeSequence
;
IdentifierPart:
UnicodeIDContinue
| '$'
| <ZWNJ>
| <ZWJ>
| '\\' EscapeSequence
;
2.7. Keywords¶
Keywords are the reserved words that have permanently predefined meanings in ArkTS. Keywords are always lowercase. Keywords can be of four kinds as discussed below.
1. The following keywords are reserved in any context (hard keywords), and cannot be used as identifiers:
abstract |
else |
internal |
static |
as |
enum |
launch |
switch |
assert |
export |
let |
super |
async |
extends |
native |
this |
await |
false |
new |
throw |
break |
final |
null |
true |
case |
for |
override |
try |
class |
function |
package |
while |
const |
if |
private |
|
constructor |
implements |
protected |
|
continue |
import |
public |
|
do |
interface |
return |
2. The following words have special meaning in certain contexts (soft keywords) but are valid identifiers elsewhere:
catch |
get |
out |
throws |
declare |
in |
readonly |
type |
default |
instanceof |
rethrows |
typeof |
finally |
keyof |
set |
|
from |
of |
struct |
3. The following words cannot be used as user-defined type names but are not otherwise restricted:
boolean |
double |
number |
void |
byte |
float |
short |
|
bigint |
int |
string |
|
char |
long |
undefined |
4. The following identifiers are also treated as soft keywords reserved for the future use (or used in TS):
is |
var |
yield |
2.8. Operators and Punctuators¶
Operators are tokens that denote various actions to be performed on values: addition, subtraction, comparison, and other. The keywords instanceof and typeof also act as operators.
Punctuators are tokens that separate, complete, or otherwise organize program elements and parts: commas, semicolons, parentheses, square brackets, etc.
The following character sequences represent operators and punctuators:
&= |
== |
?? |
||||||
+ |
& |
+= |
|= |
< |
?. |
( |
) |
|
- |
| |
-= |
^= |
&& |
> |
!. |
[ |
] |
* |
^ |
*= |
<<= |
|| |
=== |
<= |
{ |
} |
/ |
>> |
/= |
>>= |
++ |
= |
>= |
, |
; |
% |
<< |
%= |
>>>= |
-- |
! |
... |
. |
: |
2.9. Literals¶
Literals are representations of certain value types.
Literal:
IntegerLiteral
| FloatLiteral
| BigIntLiteral
| BooleanLiteral
| StringLiteral
| TemplateLiteral
| NullLiteral
| UndefinedLiteral
| CharLiteral
;
See Character Literals for the experimental char literal.
2.9.1. Integer Literals¶
Integer literals represent numbers that do not have a decimal point or an exponential part. Integer literals can be written with bases 16 (hexadecimal), 10 (decimal), 8 (octal), and 2 (binary).
IntegerLiteral:
DecimalIntegerLiteral
| HexIntegerLiteral
| OctalIntegerLiteral
| BinaryIntegerLiteral
;
DecimalIntegerLiteral:
'0'
| [1-9] ('_'? [0-9])*
;
HexIntegerLiteral:
'0' [xX] ( HexDigit
| HexDigit (HexDigit | '_')* HexDigit
)
;
HexDigit:
[0-9a-fA-F]
;
OctalIntegerLiteral:
'0' [oO] ( [0-7] | [0-7] [0-7_]* [0-7] )
;
BinaryIntegerLiteral:
'0' [bB] ( [01] | [01] [01_]* [01] )
;
It is presented by the examples below:
1 153 // decimal literal
2 1_153 // decimal literal
3 0xBAD3 // hex literal
4 0xBAD_3 // hex literal
5 0o777 // octal literal
6 0b101 // binary literal
The underscore character ‘_’ after a base prefix or between successive digits can be used to denote an integer literal and improve readability. Underscore characters in such positions do not change the values of literals. However, an underscore character must not be the very first or the very last symbol of an integer literal.
Integer literals are of type int if the value can be represented by a 32-bit number; it is of type long otherwise. In variable and constant declarations, an integer literal can be implicitly converted to another integer or char type (see Type Compatibility with Initializer). In all other places an explicit cast must be used (see Cast Expressions).
2.9.2. Floating-Point Literals¶
Floating-point literals represent decimal numbers and consist of a whole-number part, a decimal point, a fraction part, an exponent, and a float type suffix:
FloatLiteral:
DecimalIntegerLiteral '.' FractionalPart? ExponentPart? FloatTypeSuffix?
| '.' FractionalPart ExponentPart? FloatTypeSuffix?
| DecimalIntegerLiteral ExponentPart FloatTypeSuffix?
;
ExponentPart:
[eE] [+-]? DecimalIntegerLiteral
;
FractionalPart:
[0-9]
| [0-9] [0-9_]* [0-9]
;
FloatTypeSuffix:
'f'
;
It is presented by the examples below:
1 3.14
2 3.14f
3 3.141_592
4 .5
5 1e10
6 1e10f
The underscore character ‘_’ after a base prefix or between successive digits can be used to denote a floating-point literal and improve readability. Underscore characters in such positions do not change the values of literals. However, an underscore character must not be the very first and the very last symbol of an integer literal.
A floating-point literal is of type float if float type suffix is present. Otherwise, it is of type double (type number is an alias to double).
A compile-time error occurs if a non-zero floating-point literal is too large for its type.
A floating-point literal in variable and constant declarations can be implicitly converted to type float (see Type Compatibility with Initializer).
2.9.3. BigInt Literals¶
BigInt literals represent integer numbers with unlimited number of digits. BigInt literals use decimal base only. A BigInt literal is a sequence of digits followed by the symbol ‘n’:
BigIntLiteral:
'0n'
| [1-9] ('_'? [0-9])* 'n'
;
It is presented by the examples below:
153n // bigint literal
1_153n // bigint literal
The underscore character ‘_’ used between successive digits can be used to denote a BigInt literal and improve readability. Underscore characters in such positions do not change the values of literals. However, an underscore character must not be the very first or the very last symbol of a BigInt literal.
BigInt literals are always of type bigint.
Strings that represent numbers or any integer values can be converted to bigint by using the built-in functions:
BigInt (other: string): bigint
BigInt (other: long): bigint
Two other static methods allow taking bitsCount lower bits of a BigInt number and return them as a result. Signed and unsigned versions are both possible:
BigInt.asIntN(bitsCount: long, bigIntToCut: bigint): bigint
BigInt.asUintN(bitsCount: long, bigIntToCut: bigint): bigint
2.9.4. Boolean Literals¶
The two Boolean literal values are represented by the keywords
true
and false
.
1 BooleanLiteral:
2 ’true’ | ’false’
3 ;
Boolean literals are of type boolean.
2.9.5. String Literals¶
String literals consist of zero or more characters enclosed between single or double quotes. A special form of string literals is template literal (see Template Literals).
String literals are of type string. Type string is a predefined reference type (see string Type).
StringLiteral:
'"' DoubleQuoteCharacter* '"'
| '\'' SingleQuoteCharacter* '\''
;
DoubleQuoteCharacter:
~["\\\r\n]
| '\\' EscapeSequence
;
SingleQuoteCharacter:
~['\\\r\n]
| '\\' EscapeSequence
;
EscapeSequence:
['"bfnrtv0\\]
| 'x' HexDigit HexDigit
| 'u' HexDigit HexDigit HexDigit HexDigit
| 'u' '{' HexDigit+ '}'
| ~[1-9xu\r\n]
;
Normally, characters in string literals represent themselves. However, certain non-graphic characters can be represented by explicit specifications or Unicode codes. Such constructs are called escape sequences.
Escape sequences can represent graphic characters within a string literal,
e.g., single quotes ‘’
’, double quotes ‘”
’, backslashes ‘\
’, and
some others.
An escape sequence always starts with the backslash character ‘\
’, followed
by one of the following characters:
”
(double quote, U+0022),
'
(neutral single quote, U+0027),
b
(backspace, U+0008),f
(form feed, U+000c),n
(linefeed, U+000a),r
(carriage return, U+000d),t
(horizontal tab, U+0009),v
(vertical tab, U+000b),\
(backslash, U+005c),x
and two hexadecimal digits (like7F
),u
and four hexadecimal digits (forming a fixed Unicode escape sequence like\u005c
),u{
and at least one hexadecimal digit followed by}
(forming a bounded Unicode escape sequence like\u{5c}
), andany single character except digits from ‘1’ to ‘9’ and characters ‘x’, ‘u’, CR, and LF.
The examples are provided below:
1 let s1 = 'Hello, world!'
2 let s2 = "Hello, world!"
3 let s3 = "\\"
4 let s4 = ""
5 let s5 = "don’t do it"
6 let s6 = 'don\'t do it'
7 let s7 = 'don\u0027t do it'
2.9.6. Template Literals¶
Multi-line string literals that can include embedded expressions are called template literals.
A template literal with an embedded expression is a template string.
A template string is not exactly a literal because its value cannot be evaluated at compile time. The evaluation of a template string is called string interpolation (see String Interpolation Expressions).
TemplateLiteral:
'`' (BacktickCharacter | embeddedExpression)* '`'
;
BacktickCharacter:
~[`\\\r\n\]
| '\\' EscapeSequence
| LineContinuation
;
See String Interpolation Expressions for the grammar of embeddedExpression.
An example of a multi-line string is provided below:
1 let sentence = `This is an example of
2 a multi-line string,
3 which should be enclosed in
4 backticks`
Template literals are of type string, which is a predefined reference type (see string Type).
2.9.7. Null Literal¶
Null literal is the only literal to denote a reference without pointing
at any entity. It is represented by the keyword null
.
NullLiteral:
'null'
;
The null literal denotes the null reference that represents the absence of a value. The null literal is, by definition, the only value of type null (see null Type). This value is valid only for types T | null (see Nullish Types).
2.9.8. Undefined Literal¶
Undefined literal is the only literal to denote a reference with a value
that is not defined. Undefined literal is the only value of type
undefined (see undefined Type). It is represented by the keyword
undefined
.
UndefinedLiteral:
'undefined'
;
2.11. Semicolons¶
Declarations and statements are usually terminated by a line separator (see Line Separators). In some cases, a semicolon must be used to separate syntax productions written in one line, or to avoid ambiguity.
1 function foo(x: number): number {
2 x++;
3 x *= x;
4 return x
5 }
6
7 let i = 1
8 i-i++ // one expression
9 i;-i++ // two expressions
2.10. Comments¶
Comment is a piece of text added in the stream to document and compliment the source code. Comments are insignificant for the syntactic grammar.
Line comments begin with the sequence of characters ‘//’ and end with the last line separator character. Any character or sequence of characters between them is allowed but ignored.
Multi-line comments begin with the sequence of characters ‘/*’ and end with the first subsequent sequence of characters ‘*/’. Any character or sequence of characters between them is allowed but ignored.
Comments cannot be nested.