<div class="toolbar" macro="toolbar +finishEditing -cancelEditing"></div>
<div class="title" macro="view title"></div>
<div class="editor" macro="edit title"></div>
<div class="editor" macro="edit text"></div>
<div class="editor" macro="edit tags">Tags: </div>
<div class="editor" macro="edit parenttitle">Parent:</div>
<div class="editorFooter"><span macro="message views.editor.tagHint"></span><span macro="tagChooser"></span></div>
<div id='navigationArea'>
	<div id='navigationCollapser' macro='collapseArea navigationPanel'></div>
	<div id='navigationPanel' refresh='content' force='true' tiddler='NavigationPanel'></div>
<div id='sidebar'>
	<div id='sidebarCollapser' macro='collapseArea sidebarContent true "Side Bar" "»«"'></div>
	<div id='sidebarContent'>
		<div id='sidebarTabs' refresh='content' force='true' tiddler='SideBarTabs'></div>
<div id='displayArea'>
	<div id='navAndSearch'>
		<div id='navToolbar'>
			<span id='navButtons' macro='navigate'></span>
			<div id='breadcrumbsBox'></div>
		<span id='search' macro='search'></span>
	<div id='pageWrapper'>
		<div id='messageArea'></div>
		<div id='tiddlerDisplay'></div>
<div class='toolbar'><span macro='toolbar startEditing deleteArticle > references <'></span></div>
<div class='title' macro='view title'></div>
<div class='articleContentContainer' macro='view text wikified'></div>

Pragmatismus ist, wenn die Struktur missachtet wird, weil man zu faul ist, ein (neues) Tool zu bauen.

This is a wiki containing a compilation of my personal coding guidelines and best practice patterns, collected over about 40 years being a developer.
|Baklava Code||Code with too many layers. <br>Baklava is a delicious pastry made with many paper-thin layers of phyllo dough. While thin layers are fine for a pastry, thin software layers don’t add much value, especially when you have many such layers piled on each other. Each layer has to be pushed onto your mental stack as you dive into the code. Furthermore, the layers of phyllo dough are permeable, allowing the honey to soak through. But software abstractions are best when they don’t leak. When you pile layer on top of layer in software, the layers are bound to leak.|
|Banana Banana Banana||Placeholder text indicating that documentation is in progress or yet to be completed.<br>Mostly used because FxCop complains when a public function lacks documentation.<br>/ / / <summary><br>/ / / banana banana banana<br>/ / / </summary><br>public Void DoSomething()<br>See also: urbandictionary.com<br>See also: [[Undocumentation]]|
|Golden Hammer|||
|Happy Path|||
|Heisenbug||A Bug that disappears during analysis, but returns in production.|
|Hydra Code ||A bug fix that creates two new bugs.|
|Jenga Code||Code that doesn't allow even small changes without breaking a whole dependency chain.|
|Megamoth||MEGA MOnolithic meTHod<br>One single method the code of which goes over many pages (lack of architecture).|
|Mushroom Management|||
|Picnic||Problem in chair, not in computer.|
|Refuctoring||The process of taking a well-designed piece of code and, through a series of small, reversible changes, making it completely unmaintainable by anyone except yourself.|
|Usage Of Internal Knowledge|||
|Voodoo Programming<br>(Programming by Accident)||A "developer" that does try & error (or puts snippets from the web together) until a desired result comes out.<br>He does not understand…<br>…how and why the result was created<br>…whether his "solution" is efficient<br>…if he used an API the right way|
|WET Code<br>(vs. DRY Code)||WET = "write everything twice" or "we enjoy typing"<br>DRY = "don't repeat yourself"|
|Yoda Conditions|||

!Considerations For The Best Separator Character

*Should be readable in a simple text editor: The separater glyph should not look like a another commonly used glyph 
*Should be reachable from the Keyboard 
*Should be 7-Bit-ASCII 
*Should be rarely used in... 
**...common Text 
**...File Paths / URLs 
**...Code Snippets
|||Pros|Cons |h
|"""|"""|Pipe|Well Readable. Escaping s improbable (pipes never occur in paths) ||
||Tab|ASCII, rarely occurs in contents|Invisible character.|
|,|Comma|US Excel standard.|Much escaping will occur.|
|;|Semicolon|European Excel standard|Much escaping might occur.|
|^|Circumflex|ASCII, rarely occurs in contents||
|Β°|Degree|rarely occurs in contents|Not ASCII|

!Considerations For The Best Escaping Syntax

|`|ASCII 96|Rarely used character||
|\ |back slash |C standard.|Very often used in Windows paths => lots of self escaping|
|"Foo,Bar;Batz\Ding"|enclose in Quotes|Excel standard. Parser is easy to implement.|Very often used in regular text and paths  => lots of self escaping|


There is a rarely known feature in Excel. You can add "sep=" as very first line to tell Excel about the separation char. 
Of course, the header will have to move to the 2nd line. 

Separator: Pipe, Escaping: Quotes

TempDirectory|"C:\Temp"|"This pipe | is already escaped, because I'm in quotes" 
LogFile|"""C:\Log Files\Log.txt"""|"Self escaping: Quotes must be escaped by using ""such"" double quotes."

Separator: Pipe, Escaping: Back Slash

TempDirectory|C:\\Temp|This pipe \| had to be escaped by a back slash 
LogFile|"C:\\Log Files\\Log.txt"|Self escaping: back slash must be escaped by using \\

Separator: Pipe, Escaping: ASCII 96

TempDirectory|C:\Temp|This pipe `| had to be escaped by an ASCII 96 
LogFile|"C:\Log Files\Log.txt"|Self escaping: ASCII 96 must be escaped by using ``
|Backslash|sometimes "back slash"|
|Barcode|sometimes "bar code"|
|Bookmark|sometimes "book mark"|
|Database|rarely "data base"|
|--Datasource--|always "data source"|
|--Firstname--|always "first name"|
|--Frontend--|always "front end"|
|--Keyset--|always "key set"|
|Lifetime|sometimes "life time"|
|Lookup|rarely "look up"|
|Lowercase|sometimes "lower case"|
|Metadata|also "meta data"|
|Namespace|never "name space"|
|Placeholder|also "place holder"|
|Timeline|sometimes "time line"|
|Timestamp|also "time stamp"|
|--Toolbar--|always "tool bar"|
|Uppercase|sometimes "upper case"|
|--Username--|always "user name"|
|Watermark|"Water Mark" has a different meaning|
|--Webservice--|always "web service"|
|Website|also "web site"|
|Wildcard|rarely "wild card"|
|Workshop|rarely "work shop"|
!Emitting A Single Error Per Function Call

!!Most simple: Magic Return Value

Return a magic value the convention of which is "something went wrong". E.g.
-1 or (null) or DateMin

!!Very Simple: Return Code

* Do not return the return code as a return value, but use an "out" (by reference) parameter, the name of which is "returnCode"

Public Shared Sub DoSomething(someArguments..., ByRef returnCode As Integer)
  returnCode = 0 ' Wichtig: ByRef returnCode immer explizit auf 0 initialisieren!
  If (fehlerAufgetreten) Then
    returnCode = -SystemReturnCode.AccessDenied ' Bei Fehlern ein Minus voranstellen!
    Exit Sub
  End If
  ...End Sub

!!Sophisticated: Return Code Plus Textual Information

* Separate the message template (containing placeholders) from values (to resolve the placeholders)

Public Shared Function TrySomething(
  ByRef returnCode As Integer, ByRef statusMessageTemplate As String, ByRef statusMessageArgs As Object(),
  moreStuff As Integer, ...
) As String
  If (broken) Then
    returnCode = -1234
    statusMessageTemplate = "Value for ""{foo}"" not available, because of {bar}!"
    statusMessageArgs = {foo, bar}
    Return Nothing
  End If

!!Always Wrong: Throwing Exceptions

!Emitting Multiple Warnings Per Function Call

* Use a callback delegate.

Public Sub DoManyThings(Optional onEmitMessage As Action(Of Integer, String, Object()) = Nothing)
  onEmitMessage?.Invoke(0, "Processing Item {name}", {name})
#1 Only throw changed events, if something really changed (not just if a setter was called) => use IDs or hashcodes to identify previous/current state
#1.1 Explicitely handle the case, where the incoming value is null and the current value is null as well. Be careful when using .Equal(), because it throws an Exception on null values.
#2 Do not handle events, the trigger of which has been yourself (also indirectly) => use a semaphore
#3 If you (must) have many layers, let only the highest one throw events => at least, do not mix events between layers
!Creating Separate Assemblies

* Not for separating problem domains (therefore use namespaces)
* Only for separating dependencies (references to other DLLs)
** e.g. tier separation, isolate dependencies to UI technologies

!Do Not Use A "Modular" Architecture

* It will create a maintainance- & security-hell
* It's like a step back in history - we don't like integrated circuits for some reason and build up everything with single transistors on a breadboard


Taken from: http://www.parsifalsoft.com/gloss.html

Glossary of Parsing Terms

Accept Action
Backus-Naur Form
Binary operator
Character sets
Character Universe
Characteristic Rule
Characteristic Token
Complement of a set
Completed Rule
Context Free Grammars
Default reductions
Difference of two sets
Error Action
Event Driven Parser
Expansion Rule
Grammar Analysis
Grammar Rule
Grammar Token
LALR parsing
Lexical Scanner
Lookahead Token
LR Parsing
Marked Rule
	Nonterminal Token
Null Production
Parameter Assignment
Parser ControlBlock
Parser Generator
Parser State Stack
Parser Value Stack
Parsing Engine
Reduce Action
Reduce-Reduce Conflict
Reducing Token
Reduction Procedure
Reduction Token
Rule Element
Semantic Action
Semantic Value
Semantically Determined Production
Set Expression
Shift Action
Shift-Reduce Conflict
Syntax Directed Parsing
Terminal Token
Unary Operator
Virtual Production

    A "parser action" is one of the basic executable elements of a parsing engine. In a traditional parsing engine there are four actions: the shift action, the reduce action, the accept action, and the error action. The parsing engines which AnaGram generates use a substantially greater number of actions, in order to provide certain short cuts in the interest of greater speed and greater compactness of the tables which control the action of the parsing engine.

Accept Action
    The accept action is one of the four actions of a traditional parsing engine. The accept action is performed when the parser has succeeded in identifying the goal token for the grammar. When the parser executes the accept action, it returns to the calling program. The accept action is thus the last action of the parsing engine and occurs only once for each successful execution of the parser.

    This is a term used to describe binary operators . It describes how you interpret a sequence of operations which all involve the same operator. Consider a - b - c, for instance. Here the convention is that we subtract b from a, and then c from the difference. We say that subtraction is left associative, since if there is a sequence of subtraction operations, the leftmost one is to be performed first. As a second example, consider exponentiation. FORTRAN uses ** as an operator to indicate exponentiation. In order to conform as closely as possible to ordinary mathematical notation, the expression a**b**c is interpreted to mean that first b is raised to the power c, and then the resultant value is used as the power to which a is raised. We say that exponentiation is right associative since the rightmost of a sequence of exponentiation operations is to be performed first. If a programming language does not allow for unparenthesized sequences of a particular operator, the operator is said to be non-associative.

    Another way to view left and right associativity is as implied parenthesization. Left associative operators are parenthesized from the left. Right associative operators are parenthesized from the right. Thus a - b - c = ((a-b) - c) and a**b**c = (a**(b**c))

    AnaGram offers two ways to specify associativity of binary operators. The first way is to use conventional Backus-Naur Form syntax descriptions. You would use recursion to describe both subtraction and exponentiation. Subtraction, since it is left associative, uses left recursion, and exponentiation, being right associative, uses right recursion. Thus

        difference  -> value | difference, '-', value
        exponential -> value | value, "**", exponential

    could be used to describe differences and exponents.

    The second way to specify associativity is to use an ambiguous grammar and precedence declarations. (See Chapter 9, AnaGram User's Guide.)

    In order to make parse tables more compact and parsers faster, it is common to use default reductions. In case of error, it is necessary to undo default reductions before diagnostics can be properly determined. In AnaGram, this undo operation is called backtracking.

Backus-Naur Form
    Backus-Naur Form, or BNF, is a conventional notation for describing context free grammars. AnaGram uses an extended notation for grammars, which, except for semantically determined productions, can be shown to be equivalent to BNF. The term "BNF" is often used colloquially to refer to a grammar specification.

    AnaGram's syntax specification language differs from BNF in the following respects:

        In conventional BNF, a symbol not enclosed in angle brackets (< >) is taken to represent itself literally. In AnaGram, literal character representations must be enclosed in single quotes and literal strings within double quotes.
        In BNF, the right side of a production is simply a list of symbols without any delimiters or punctuation. AnaGram requires that the rule elements which make up a grammar rule, or right side, be joined by commas.
        BNF makes no provision for identifying reduction procedures or their arguments. AnaGram provides both reduction procedures, introduced by an "=" at the end of the production, and named arguments, introduced by a ":" at the end of any token on the right side of the production.
        AnaGram allows virtual productions to be used freely.
        BNF is "pure" in that if you wish to define a nonterminal token called "digit" to represent the digits from zero to nine, you must provide ten explicit productions to define it. AnaGram treats the concept of "terminal token" as used in language theory as an abstraction, and interposes character set functionality between actual character input and the terminal tokens of BNF. You can define digit to be the character range '0-9', for instance, and AnaGram will determine and define precisely the minimum number of productions necessary, taking into account any other usage of the characters '0-9' in your grammar. This makes your grammar more compact, more manageable, and easier to modify.
        AnaGram allows for semantically determined productions, which provide a significant mechanism for melding semantic analysis with syntactic analysis.

Binary operator
    A binary operator is an operator that works on two operands to create a result. It is often written between its operands and is sometimes called an infix operator. In ordinary programming languages "+" and "-" are binary operators.

    Most programming languages, rather than executing arithmetic operators in simple left to right order, conform to the traditional conventions of ordinary algebra, which dictate that, except where parenthesis indicate otherwise, exponentiation is done before multiplication, multiplication before addition, and addition before comparison. One says that exponentiation is "more tightly binding" than multiplication, multiplication is more tightly binding than addition, and so on. The sense of the word here is that the operator binds together its operands into a single entity. An operand which falls between two different operators is "bound" by the more tightly binding operator. An operator that is more tightly binding than another is also said to have higher precedence.

Character sets
    One of the traditional problems with syntax directed programming is that caused by the fact that most formal language specifications have productions of the form:

        letter -> a | b | c | d ... | z

    Since the letters are not often distinguished in the grammar, this large number of essentially identical productions causes a correspondingly large number of states in the parser. This problem is often attacked by using a "lexical scanner" which simply specifies a "letter token" whose value indicates precisely which letter is intended. This works fine as long as there is nothing in the grammar which distinguishes any letter at all. If there is, and there is usually some special use of h, x, q, o, e, or even a-f, the situation gets more complicated. Often the result is to abandon the use of syntax directed programming for elemental units of the language and use a more complex lexical scanner which identifies names, numbers, key words and operators. Such lexical scanners are often built using conventional programming languages or simple pattern recognizing languages such as LEX.

    AnaGram avoids this problem by incorporation the notion of character set into its input specifications. Briefly, the way it works is the following: You specify the set of characters which make up any ostensibly terminal token. AnaGram then analyzes the overlaps among these definitions and creates a minimal set of tokens which match your specifications exactly. For instance, suppose you have the following definitions:

        letter = 'a-z' + 'A-Z'
        hex digit = '0-9' + 'a-f' + 'A-F'

    AnaGram will define a token for letter which has two productions:

        letter -> 'a-f' + 'A-F' | 'g-z' + 'G-Z'

    With this technique, you have the freedom to specify your grammar in the easiest possible manner, without the penalty of an absurdly large, slow parser.

    Character sets may be specified in terms of ranges of characters, as in the above example, as unions, denoted by "+", as intersections, denoted by "&", as differences, denoted by "-", and as complements, using "~". You may also specify a set consisting of a single character either with a literal character or with a numeric specification. If you specify a character numerically, you may use either decimal, octal or hexadecimal notation, as in C.

Character Universe
    In ordinary set theory, the "universe" is the set of all entities which may be elements of a set. It must be defined so that the complement of a set can mean something unambiguous. In AnaGram, the character universe normally comprises the ascii characters and the IBM extended character set, that is, all characters in the range from 0 through 255. If, however, you have defined input characters outside this range, the character universe will be extended to the least character you have used and to the greatest character you have used in your grammar.

Characteristic Rule
    Each parser state is characterized by a particular set of grammar rules, and for each such rule, a marked token which is the next token expected. These rules are called the characteristic rules for the state. In the course of doing grammar analysis, AnaGram determines the characteristic rules for each parser state. After analyzing your grammar, you may inspect the State Definition Table to see the characteristic rules for any state in your parser.

Characteristic Token
    Every state in a parser, except state zero, can be characterized by the one, unique token which causes a jump to that state. That token is called the characteristic token of the state, because to get to that parser state you must have just seen precisely that token in the input. Note that several states could have the same characteristic token.

    When you have a list of states, such as is given by the parser state stack, it is equivalent to a list of characteristic tokens. This list of tokens is the list of tokens that have been recognized so far by the parser. Some of these tokens, of course, may be nonterminal tokens and may thus represent the result of reducing a sequence of previously recognized tokens.

Complement of a set
    In set theory, the complement of a set, S, is the collection of all elements of the character universe which are not elements of S. Using AnaGram's notation, the complement of S is written ~S. The complement operator has higher precedence than the difference, intersection, or union operators.

Completed Rule
    A "completed rule" is a characteristic rule whose index is pointing beyond the last rule element. In other words, it has been completely matched and will be reduced by the next input.

    If a state has more than one completed rule, the decision as to which to reduce is made based on the next input token. If there is only one completed rule in a state, it will be reduced by default unless the default reductions switch has been reset, i.e., turned off. 
    Conflicts arise during a grammar analysis when AnaGram cannot determine how to treat a given input token. There are two sorts of conflicts: shift-reduce conflicts and reduce-reduce conflicts. Conflicts may arise either because the grammar is inherently ambiguous, or simply because the grammar analyzer cannot look far enough ahead to resolve the conflict. In the latter case, it is often possible to rewrite the grammar in such a way as to eliminate the conflict. Null productions are a common source of conflict.

    There are a number of ways to deal with conflicts. If you understand the conflict well, you may simply choose to ignore it. When AnaGram encounters a shift-reduce conflict while building parse tables it resolves it by choosing the shift action. When AnaGram encounters a reduce-reduce conflict while building parse tables, it resolves it by selecting the grammar rule which occurred first in the grammar.

    A second way to deal with conflicts is to set operator precedence parameters. If you set these parameters, AnaGram will use them preferentially to resolve conflicts. Any conflicts so resolved will be listed in the Resolved Conflicts window.

    A third way to resolve a conflict is to declare some tokens as sticky or as subgrammars. This is particularly useful for productions whose sole purpose is to skip over uninteresting input.

    The fourth way to deal with conflicts is to rewrite the grammar to eliminate them. Many people prefer this approach since it yields the highest level of confidence in the resulting program. 
Context Free Grammars
    Context free grammars are grammars wherein the definition of a grammar unit, or nonterminal token, does not depend on the context in which the nonterminal is used. That means that if you define "widget" in a certain way, that definition is valid in all contexts in which "widget" might occur. Context free grammars can be represented in Backus-Naur Form. AnaGram's syntax specification language has no facility for representing grammars which are not context free.

Default reductions
    A "default reduction" is a parser action which may be used in your parser in any state which has precisely one completed rule.

    If a given parser state has, among its characteristic rules, exactly one completed rule, it is usually faster to reduce it on any input than to check specifically for correct input before reducing it. The only time this default reduction causes trouble is in the event of a syntax error. In this situation you may get an erroneous reduction. Normally when you are parsing a file, this is inconsequential because you are not going to continue performing semantic actions in the presence of errors. But, if you are using your parser to handle real-time interactive input, you have to be able to continue semantic processing after notifying your user that he has entered erroneous input. In this case you would want default reductions to have been turned off so that productions are reduced only when there is correct input. 
Difference of two sets
    In set theory, the difference of two sets, A and B, is defined to be the set of all elements of A that are not elements of B. In an AnaGram syntax file, you represent the difference of two character sets by using the "-" operator. Thus the difference of A and B is A - B. The difference operator is left associative.

Error Action
    The error action is one of the four actions of a traditional parsing engine. The error action is performed when the parser has encountered an input token which is not admissible in the current state. The further behavior of a traditional parser is undefined.

Event Driven Parser
    An event driven parser is one in which the relationship between the host program and the parser is turned inside out. In a conventional parser, the host program calls the parser, the parser analyzes the complete input text and returns to the host program only when it has finished with the entire input.

    In an event driven parser, the parser does not read its input directly from a file or from memory. Instead, the host program, after initializing the parser, calls it once for each input token. Each time the parser is called, it updates its state appropriately, calls any reduction procedures that need to be called and finally when it needs more input, returns to the host program. The effect is that parsing can occur in parallel with other processing performed by the host program. This technique is especially useful in situations where the token stream to be parsed is developed on the fly, as when using lexical scanners, for instance.

Expansion Rule
    In analyzing a grammar, we are often interested in the full range of input that can be expected at a certain point. The expansion of a token or state shows us all the expected input. An expansion yields a set of marked rules. Looking at the marked token shows us what input to expect.

    The set of expansion rules of a nonterminal token shows all the expected input that can occur whenever the token appears in the grammar. The set consists of all the grammar rules produced by the token, plus all the rules produced by the first token of any rule in the set. The marked tokens for the expansion rules of a token are always the first element in the rule.

    The expansion of a state consists of its characteristic rules plus the expansion rules of the marked token in each characteristic rule.
    Traditionally, a grammar is a set of productions which taken together specify precisely a set of acceptable input sequences in terms of an abstract set of terminal tokens. The set of acceptable input sequences is often called the "language" defined by the grammar.

    In AnaGram, the term grammar also includes configuration segments, definitions of character sets, virtual productions, etc. that augment the collection of productions. A grammar is often called a "syntax", and the rules of the grammar are often called syntactic rules.

Grammar Analysis
    The major function of AnaGram is the analysis of context free grammars written in a particular variant of Backus-Naur Form.

    The analysis of a grammar proceeds in four stages. In the first, the input grammar is analyzed and a number of tables are built which describe all of the productions and components of the grammar.

    In the second stage, AnaGram analyzes all of the character sets defined in the grammar, and where necessary, defines auxiliary tokens and productions.

    In the third stage, AnaGram identifies all of the states of the parser and builds the go-to table for the parser.

    In the fourth stage, Anagram identifies reduction tokens for each completed grammar rule in each state and checks for conflicts.

Grammar Rule
    A "grammar rule" is the right side of a production. It consists of a sequence of rule elements. Each rule element identifies some token, which can be either a terminal token or nonterminal token. It is "matched" by a corresponding sequence of tokens in the input stream to the parser. The left side of the production identifies one or more nonterminal tokens, or reduction tokens, to which the rule reduces when matched. If there is more than one reduction token, there should be a reduction procedure to make the choice.

Grammar Token
    The grammar token is the token which represents the "top level" in your grammar. Some people refer to it as the "goal" or "goal token" and others as the "start token". Whatever it is called, it is the single token which describes the complete input to your parser.

    In set theory, the intersection of two sets, A and B, is defined to be the set of all elements of A which are also elements of B. In an AnaGram syntax file, the intersection of two character sets is represented with the "&" operator. The intersection operator has lower precedence than the complement operator, but higher precedence than the union and difference operators. The intersection operator is left associative.

LALR parsing
    LALR, or "lookahead LR" parsing, a slightly less powerful variant of LR parsing, produces much smaller parsing tables. Thus an LALR parser has very significant advantages in size over its LR counterpart. It is possible to construct grammars which can be parsed by an LR parser but cannot be parsed by an LALR parser. That is, the LALR parser will find these grammars to be ambiguous and the LR parser will not. In practice, however, such grammars seem to be rare and the ambiguities can usually be eliminated by minor revisions.

    AnaGram generates LALR parsers and, in addition, uses LALR parsing to interpret syntax files.

Lexical Scanner
    A lexical scanner is a program or procedure used as a preprocessor for a parser. It scans input characters and lumps them together into tokens. Many systems for generating parsers, most notably YACC, can scarcely be used without a lexical scanner.

    AnaGram, although it can support the use of a lexical scanner, does not need a lexical scanner for most practical problems. AnaGram avoids the need for a lexical scanner by using character sets and disregard and lexeme statements.

    If your problem does in fact need a lexical scanner, you can use AnaGram itself to write it, so you don't need to know different languages for the scanner and for the parser.

Lookahead Token
    The current input token to a parser is often called the "lookahead" token. In many situations, it is not shifted into the input buffer immediately, but acts like a catalyst to cause a number of rules to be reduced before it is eventually shifted in.

LR Parsing
    An LR(k) parser is a type of deterministic bottom-up shift-reduce parser using at most k lookahead symbols to determine its next action at any point in the parse. If (k) is omitted, then k is assumed to be 1. Discussion of the technical details of LR(k) parsing is beyond the scope of this manual. The interested reader may consult any of a variety of works covering the subject.

    AnaGram produces parsers which employ a variant of LR parsing known as LALR parsing. It is not necessary to understand the details of LR and LALR parsing theory in order to use AnaGram.

Marked Rule
    A "marked rule" is a grammar rule together with a "marked token" that indicates how much of the rule has already been matched and what input should be expected, starting with the marked token, if the remainder of the rule is to be matched. Each parser state is defined by a small number of characteristic rules which indicate what matches have already been made and what input can be expected. Note that when a state has more than one characteristic rule, any two characteristic rules with the same number of tokens to the left of the marked token match identically up to the marked token. Even if one has fewer tokens to the left than the other, its tokens match the other exactly up to the marked token.

    When marked rules are displayed in AnaGram windows, the marked token is displayed in a distinctive font which the developer can select.

    A binary operator, say #, is said to be non-associative if the sequence x # y # z is not permissible. If this sequence is to be interpreted as (x#y)#z, the operator is said to be left associative. If the sequence is to be interpreted as x#(y#z), the operator is said to be right associative. (See associativity.)

Nonterminal Token
    A "nonterminal token" is a token which results from reducing the sequence of tokens which match a grammar rule and replacing them with the appropriate reduction token. Nonterminal tokens are to be distinguished from terminal tokens or input tokens.

Null Production
    A "null production" is one that has no tokens on the right side whatsoever. Null productions essentially are identified by the first following input token. Null productions are extremely convenient syntactic elements when you wish to make some input optional. For example, suppose that you wish to allow an optional semicolon at some point in your grammar. You could write the following pair of productions:

        optional semicolon -> | ';'

    Note that a null production can never follow a '|'. The above could also be written on multiple lines thus:

        optional semicolon
          -> ';'

    You can always rewrite your grammar to eliminate null productions if you wish, but you usually pay a price in conciseness and clarity. Sometimes, however, it is necessary to do such a rewrite in order to avoid conflicts, to which null productions are especially prone.

    If you have a null production with no reduction procedure specified, your parser will automatically assign the value zero to the reduction token.

    Null productions can be generated by virtual productions.

Parameter Assignment
    In any rule in an AnaGram grammar, the semantic value of any rule element may be passed to a reduction procedure by means of a parameter assignment. Simply follow the rule element with a colon and a C variable name. The C variable name can then be used in the reduction procedure to refer to the semantic value of the token it is attached to. AnaGram will automatically provide necessary declarations.

    Here are some examples of rule elements with parameter assignments:

      declaration : declaration_descriptor

    A parser is a program, or more likely a procedure within a program, which scans a sequence of input characters or input tokens and accumulates them in an input buffer or stack as determined by a set of productions which constitute a grammar. When the parser discovers a sequence of tokens as defined by a grammar rule, or right side of a production, it "reduces" the sequence to a single reduction token as defined by the left side of the grammar rule. This nonterminal token now replaces the tokens which matched the grammar rule and the search for matches continues. If an input token is encountered which will not yield a match for any rule, it is considered a syntax error and some kind of error recovery may be required to continue. If a match, or reduction action, yields the grammar token, sometimes called the goal token or start token, the parser deems its work complete and returns to whatever procedure may have called it.

    Tokens may have semantic values. If the input values configuration switch is on, your parser will expect semantic values to be provided by the input process along with the token identification code. If the input values switch is off, your parser will take the ascii value of the input character, that is, the actual input code, as the value of the character. When the parser reduces a production, it can call a reduction procedure or semantic action to analyze the values of the constituent tokens. This reduction procedure can then return a value which characterizes the reduced token.

Parser Control Block
    A "Parser Control Block" is a structure which contains all of the data necessary to describe the instantaneous state of an AnaGram parser.

    A programmer may may add declarations to the parser control block in an AnaGram grammar by using the extend pcb statement. 

Parser Generator
    A parser generator, such as AnaGram, is a program that converts a grammar, a rule-based description of the input to a program, into a conventional, procedural module called a parser. The parsers AnaGram generates are simple C modules which can be compiled on almost any platform. AnaGram parsers are also compatible with C++.

Parser State Stack
    The parser state stack is a stack maintained by an AnaGram parser and which is an integral part of the parsing process. At any point in the parse of your input stream, the parser state stack provides a summary of what has been found so far. In an AnaGram parser, the parser state stack is stored in the Parser Control Block. 

Parser Value Stack
    In parallel with the parser state stack, an AnaGram parser maintains a "value stack", each entry of which corresponds to the semantic value of the token identified at that state. Since the semantic values of different tokens might well have different data types, AnaGram provides the opportunity to define the data type for any token. AnaGram then builds a typedef statement creating a data type which is a union of the all the types you have defined. AnaGram uses this data type to define the value stack. 

Parsing Engine
    A parser consists of three basic components: A set of syntax tables, a set of reduction procedures and a parsing engine. The parsing engine is the body of code that interprets the parsing table, invokes input functions, and calls the reduction procedures. The Build Parser command configures a parsing engine according to the implicit requirements of the syntax specification and according to the explicit values of the configuration parameters.

    The parsing engine itself is a simple automaton, characterized by a set of states and a set of inputs. The inputs are the tokens of your grammar. Each state is represented by a list of tokens which are admissible in that state and for each token an action to perform and a parameter which further defines the action. In a traditional LALR parser, there are only four actions: the shift action, the reduce action, the accept action and the error action. AnaGram, in doing its grammar analysis, identifies a number of special cases, and creates a number of extra actions which make for faster processing, but which can be represented as combinations of these primitive actions.

    When a shift action is performed, the current state number is pushed onto the parser state stack and the new state number is determined by the current state number and the current lookahead token. Different tokens cause different new states.

    When a reduce action is performed, the length of the rule being reduced is subtracted from the parser stack index and the new state number is read from the top of the parser state stack. The reduction token for the rule being reduced is then used as an input token.

    Each state in the grammar, with the exception of state zero, has a characteristic token which must have been recognized in order to jump to that state. Therefore, the parser state stack, which is essentially a list of state numbers, can also be thought of as a list of token numbers. This is the list of tokens that have been seen so far in the parse of your input stream. Some of these tokens, of course, may be nonterminal tokens which have resulted from reducing other sequences of tokens.

    If you use character sets in your grammar, AnaGram will compute a "partition" of the character universe. This partition is a collection of non-overlapping character sets such that every one of the sets you have defined can be written as a union of partition sets. Each partition set is identified by a unique reference number called the partition set number.

    Each partition set is assigned a unique token. If one of your character sets requires more than one partition set to represent it, AnaGram will create appropriate productions and add them to your grammar so your parser can make the necessary distinctions.

    "Precedence" is an attribute of binary operators and unary operators which can be used to resolve conflicts. Operator precedence is also referred to as binding. Suppose that # and @ are two binary operators. The question is how to interpret x # y @ z. If # has higher precedence than @, it is interpreted as (x#y)@z. On the other hand if # has lower precedence than @, it would be x#(y@z). If # and @ have the same precedence then the question must be resolved by associativity.

    Note that all operators at the same precedence level must have the same associativity.

    The situation is somewhat simpler for unary operators. If # and @ were both prefix operators, or both suffix operators, there would be no ambiguity in interpretation, since neither # @ x and x # @ offer more than one possible interpretation. The only difficulty arises when both a prefix and a suffix operator are applied to the same operand.

    Suppose that # is a prefix operator and @ is a suffix operator. The question then is how to interpret # x @. If # has higher precedence than @, it would be interpreted as (# x) @. On the other hand, if # has lower precedence than @, it would be interpreted as # (x @). If # and @ have the same precedence then the question must be resolved by associativity.

    Productions are the mechanism used to describe how complex input structures are built up out of simpler ones. Each production has a left side and a right side. The right side, or grammar rule, is a sequence of rule elements, which may represent either terminal tokens or nonterminal tokens. The left side is a list of reduction tokens. In most cases there would be only a single reduction token. Productions with more than one token on the left side are called semantically determined productions. The "->" symbol is used to separate the left side from the right side.

Reduce Action
    The reduce action is one of the four actions of a traditional parsing engine. The reduce action is performed when the parser has succeeded in matching all the elements of a grammar rule and the next input token is not erroneous. Reducing the grammar rule amounts to subtracting the length of the rule from the parser stack index, identifying the reduction token, stacking its semantic value and then doing a shift action with the reduction token as though it had been input directly.

Reduce-Reduce Conflict
    A grammar has a "reduce-reduce" conflict at some state if a single token turns out to be a reducing token for more than one completed rule.

Reducing Token
    In a state with more than one completed rule, your parser must be able to determine which one was actually found. AnaGram deals with this problem by looking at all the states the parser will branch to once each rule is reduced. The acceptable input tokens for those states are the "reducing tokens" for the completed rules in the state under investigation. If there is a single token which is a reducing token for more than one rule, then the grammar is said to have a reduce-reduce conflict at that state. If in a particular state there is both a shift action and a reduce action for the same token the grammar is said to have a shift-reduce conflict in that state.

    A reducing token is not the same as a reduction token.

Reduction Procedure
    A "reduction procedure" is a function you write which your parser executes when it has identified the grammar rule to which the reduction procedure is attached in your grammar. There are two formats for reduction procedures, depending on the size and complexity of the procedure. A reduction procedure is often referred to as a "semantic action".

Reduction Token
    A token which appears on the left side of a production is called a reduction token. It is so called because when the grammar rule on the right side of the production is matched in the input stream, your parser will "reduce" the sequence of tokens which matches the rule by replacing the sequence of tokens with the reduction token. Note that, if more than one reduction token is specified, your reduction procedure should choose the exact one. If it does not, your parser will use the leftmost syntactically correct one in the list as the default.

    A reduction token is not the same as a reducing token.

    "Resynchronization" is the process of getting your parser back in step with its input after encountering a syntax error. As such, it is one method of error recovery. Of course, you would resynchronize only if it is necessary to continue after the error. There are several options available when using AnaGram. You could use the auto resynch option, which causes AnaGram to incorporate an automatic resynchronizing procedure into your parser, or you could use the error resynch option, which is similar to the technique used by YACC programmers. 
Rule Element
    A grammar rule is a list of "rule elements", separated by commas. Rule elements may be token names, character sets, keywords, immediate actions, or virtual productions. When AnaGram encounters a rule element for which no token presently exists, it creates one.

Semantic Action
    A semantic action is a piece of C or C++ code that is executed when a grammar rule has been identified by a parser. Semantic actions are also called reduction procedures, since they are executed when a grammar rule is reduced to the token on the left side of the production.

Semantic Value
    Tokens, whether terminal or nonterminal, may have a semantic value. In the case of terminal tokens, this may be a value assigned by the lexical scanner or, if the parser is using direct character input, it will be the ascii value of the character itself. The values of nonterminal tokens are created by reduction procedures. As a parse progresses, token values are shifted onto the stack, so that in a reduction procedure, the values of the tokens that comprise the grammar rule that is being reduced are available for inspection.

Semantically Determined Production
    A "semantically determined production" is one which has more than one reduction token specified on the left side of the production. You would write such a production when the reduction tokens are syntactically indistinguishable. The reduction procedure may then specify which of the listed reduction tokens the grammar rule is to reduce to based on semantic considerations. If there is no reduction procedure, or the reduction procedure does not specify a reduction token, the parser will use the leftmost syntactically correct token.

Set Expression
    A set expression is an algebraic expression used to define a character set in terms of individual characters, ranges of characters, and/or other sets of characters as constructed using complements, unions, intersections, and differences.

Shift Action
    The shift action is one of the four actions of a traditional parsing engine. The shift action is performed when the input token matches one of the acceptable input tokens for the current parser state. The semantic value of the token and the current state number are stacked, the parser stack index is incremented and the state number is set to a value determined by the previous state and the input token.

Shift-Reduce Conflict
    A "shift-reduce" conflict occurs if in some state there is a single token that should be shifted, because it is legitimate input for one of the rules of the state, but should also be used to reduce some other rule because it is a reducing token for that rule.

Syntax Directed Parsing

    Syntax directed parsing, or formal parsing, is an approach to building parsers based on formal language theory. Given a suitable description of a language, called a grammar, there are algorithms which can be used to create parsers for the language automatically. In this context, the set of all possible inputs to a program may be considered to constitute a language, and the rules for formulating the input to the program constitute the grammar for the language.

    The parsers built from a grammar have the advantage that they can recognize any input that conforms to the rules, and can reject as erroneous any input that fails to conform.

    Since the program logic necessary to parse input is often extremely intricate, programs which use formal parsing are usually much more reliable than those built by hand. They are also much easier to maintain, since it is much easier to modify a grammar specification than it is to modify complex program logic.

Terminal Token
    A "terminal token" is a token which does not appear on the left side of a production. It represents, therefore, a basic unit of input to your parser. If the input to your parser consists of ascii characters, you may define terminal tokens explicitly as ascii characters or as sets of ascii characters. If you have an input procedure which produces numeric codes, you may define the terminal tokens directly in terms of these numeric codes.

    Tokens are the units with which your parser works. There are two kinds of tokens: terminal tokens and nonterminal tokens. These latter are identified by the parser as sequences of tokens. The grouping of tokens into more complex tokens is governed by the grammar rules, or productions in your grammar. In your grammar, tokens may be denoted by token names, by virtual productions, by immediate actions, by explicit character representations, or by expressions which yield character sets.

    A "token" should be thought of more or less as being something like a cloakroom token. Imagine you had taken a kindergarten class to the museum and all the kids checked their coats. Each one gets a token. What is the probability of losing one? You take all of the tokens from the children, put them in a paper bag, tie it up, and check it. You get one token in return. We would say that the kids' coats represent the actual input to the system, their individual tokens are the terminal tokens, and your one token which enables you to redeem the paper bag is a nonterminal token. Nonterminal tokens are just single token numbers to represent the result of compacting a sequence of more primitive tokens.

    Actually, the word "token" is commonly used to refer to a number of apparently different things. The tokens you use in a grammar are abstract. The tokens identified by a parser are concrete instances of the abstract tokens in the grammar. Furthermore, we have to identify tokens in some manner, so we have token names and token numbers with which to refer to particular tokens. As a result the word "token", in any particular context, may refer directly to either an abstract token, or a concrete instance of an abstract token, or may refer only indirectly to the token by means of a token name or token number.

    As an example, consider a C compiler which has a lexical scanner to identify input tokens. According to Kernighan and Ritchie, there are six classes of C tokens: identifiers, keywords, constants, string literals, operators, and other separators. Consider a particular token, a hexadecimal constant, 0XF00. When the lexical scanner hands this over to the parser, it has to provide an identification code which says what kind of token it is, a hexadecimal constant, as well as the "semantic value" of the token, 3840. The identification code is usually an integer value, determined by the designer of the lexical scanner, which by agreement with the designer of the parser uniquely represents a hexadecimal constant. This identification code can be thought of as an external token number.

    The grammar for the parser, on the other hand, has a token, "HEXconstant", let us say, to represent the abstract usage of hexadecimal constants in the C language. When AnaGram, or any other parser generator for that matter, analyzes the grammar, it assigns its own reference number to identify "HEXconstant". This reference number can be thought of as an internal token number to contrast it with the external token number provided by the lexical scanner. The parsers AnaGram builds contain a table, ag_ tcv, which provides a translation from the external to the internal token number. Both of these token numbers, of course, differ from the semantic value of the token, in this case, 3840.

    Even when an AnaGram parser is using simple character input, the word "token" may represent several different things. Suppose a grammar has a token 'A-Z', which represents the set of upper case letters. There will be an internal token number, assigned by AnaGram, which identifies the (abstract) token in parser tables. The actual (concrete) input token to a parser will, however, be a single letter, say "Q". In this case, the ascii value of the character "Q", 81, is used both as the external token number and as the semantic value of the token.

Unary Operator
    A "unary operator" is a token which represents some sort of operation which operates on a single entity to produce a new resultant entity. It is normally represented by a symbol which is written either preceding the operand or following the operand. In the first case it is called a "prefix operator" and in the second case a "suffix operator".

    The union of two sets is the set of all elements that are to be found in one or another of the two sets. In an AnaGram syntax file the union of two character sets A and B is represented using the plus sign, as in A + B. The union operator has the same precedence as the difference operator: lower than that of intersection and complement. The union operator is left associative.

Virtual Production

    Virtual productions are a special short hand representation of grammar rules which can be used to indicate a choice of inputs. They are an important convenience, especially useful when you are first building a grammar.

    AnaGram rewrites virtual productions, so that when you look at the syntax tables in AnaGram, there will be actual productions replacing the virtual productions.

    A virtual production appears as one of the rule elements in a grammar rule, i.e. as one of the members of the list on the right side of a production.

    The simplest virtual production is the "optional" token. If x is an arbitrary token, x? can be used to indicate an optional x.

    Related virtual productions are x... and x?... where the three dots indicate repetition. x... represents an arbitrary number of occurrences of x, but at least one. x?... represents zero or more occurrences of x.

    The remaining virtual productions use curly or square brackets to enclose a sequence of rules. The brackets may be followed variously by nothing, a string of three dots, or a slash, to indicate the choices to be made from the rules. Note that rules may be used, not merely tokens.

    If r1 through rn are a set of grammar rules, then

       {r1 | r2 | ... | rn}  

    is a virtual production that allows a choice of exactly one of the rules. Similarly,

       {r1 | r2 | ... | rn}...  

    is a virtual production that allows a choice of one or more of the rules. And, finally,

       {r1 | r2 | ... | rn}/...  

    is a virtual production that allows a choice of one or more of the rules subject to the side condition that rules must alternate, that is, that no rule can follow itself immediately without the interposition of some other rule. This is a case that is not particularly easy to write by hand, but is quite useful in a number of contexts.

    If the above virtual productions are written with [ ] instead of { }, they all become optional. [ ] is an optional choice, [ ]... is zero or more choices, and [ ]/... is zero or more alternating choices.

    Null productions are not permitted in virtual productions in those cases where they would cause an intrinsic ambiguity.

    You may use a definition statement to assign a name to a virtual production. 


β”‚ [StructureNavigator]                   β”‚
β”‚ [LocalNavigator]      β”‚ {DetailsSite}  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€                β”‚
β”‚ {PreviewSite}         β”‚ either [Mask]  β”‚
β”‚                       β”‚ or [BigList]   β”‚
β”‚ either [QuickInfo]    β”‚                β”‚
β”‚ or [SmallList]        β”‚                β”‚


{RootEntity}  β–Ά {Sub1Entity}  β–Ά {Sub2Entity} ...


β”‚ {EntityType}                       β”‚
β”‚ ☰ List                            β”‚
β”‚ ✎ Details for "{CurrentEntityId}" β”‚
β”‚ πŸ“οΈŽ {Sub1EntityTypes} (1:N)         β”‚
β”‚ πŸ“οΈŽ {Sub2EntityTypes} (1:N)         β”‚
β”‚ πŸ“οΈŽ ...                             β”‚
β”‚ πŸ“„οΈŽ {SubEntity} (1:1)               β”‚
β”‚ πŸ“„οΈŽ ...                             β”‚


* A full fledged list showing the most wanted attributes of entities as columns, sortable, filterable, offering CRUD operations


* A form containing labels and fields showing the attributes of the current entity


* A small panel containing most wanted information for an entity that cannot be shown in the cells of the lists (e.g. a photo)


* A list with less columns than the BigList - just enough to identify an entity, only for navigating result sets
* By default, filtered to show only the sub entities related to the current entities (1:N) composition relationship
**This is called the "dependent" filter
**This filter can be deactivated by the user to show all entities of that type


!!When Do The Site's Active Panels Change?

*Only when the selected item in the LocalNavigator changes:

|Row Clicked|DetailsSite bings up...|PreviewSite brings up...|h
|πŸ”ŽοΈŽοΈŽ Search|[BigList]|[QuickInfo]|
|πŸ“„οΈŽοΈŽ Details for...|[Mask]|[SmallList] of current EntityType|
|πŸ“οΈŽοΈŽ AnySubEntityList (1:N)|(no change)|[SmallList] of SubEntity type|
|πŸ“„οΈŽ AnySubEntity (1:1)|(no change)|[QuickInfo] of SubEntity type|

!!When Do Contents Change?

*Only when selected row in one of the lists changes:

|BigList|shows current entity|shows current entity|-|shows current entity|shows current entity|shows sub entities related to current entity|
|SmallList|(no change)|(no change)|shows current entity*|shows current entity*|(not visible)|-|

*) Whenn the "dependent" filter is deactivated, selecting a row in the SmallList can implicitely select a different principal and so change everything up the dependeny tree.

!!"Diving" Down The Dependency Structure

*Via LocalNavigator
**When double clicking any SubEntity item (or pressing enter with it having focus)
*Via SmallList
**When double clicking any row (or pressing enter with it having focus)

After diving, the state of the sites must be as follows:

**Growed to the right, showing the SubEntity
**The former SubEntity is now shown as new root entity
***The SubEntity items now show the dependents of the new root (if any)
**The "Details for..." item must be selected
**Showing the Mask
**Showing a SmallList of the new root entity type 
***The state of the "depentent" filter is not changed, and still effective here

!1:1 Behaviour

1:1 actually means 1:0..1 - so there might be no existing record related to the principal. Therefore we need create and remove functions.

*After diving...
**...the LocalNavigator doesn't hav a "List" item
**...the Mask has a local toolbar containing "Create..." and "Remove" Buttons
**...the Mask shows greyed out fields

!!1:1 with mutable target type (inherited types)

*After clicking "Create..."
**...a modal window with stacked buttons appears - the buttons represent the available mutations
**after clicking a button...
***...the sub entity will be created immediately (without an additional "Ok" button)
***...the LocalNavigator will rebuild itself corresponding to the chosen mutation
!Key Value Pairs

|πŸ‘ Good |{{{{}}}<br>{{{  "key1": "value1",}}}<br>{{{  "key2": "value2"}}}<br>{{{}}}}|One object<br>with key => property name, value => property value|Implies that keys are unique.|
|πŸ‘Ž Bad |{{{[}}}<br>{{{  {"key1": "value1"},}}}<br>{{{  {"key2": "value2"}}}}<br>{{{]}}}|Array of dry KVP objects|Only useful if keys are not unique.|
|πŸ‘Ž Bad |{{{[}}}<br>{{{{}}}<br>{{{  "key": "key1",}}}<br>{{{  "value": "value1"}}}<br>{{{},}}}<br>{{{{}}}<br>{{{  "key": "key1",}}}<br>{{{  "value": "value1"}}}<br>{{{}}}}<br>{{{]}}}|Array of wet KVP objects|Maximum overhead.<br>On the other hand, one can see that a dynamic dictionary was serialized and not a static class.|

"Tracing": Forget about it. There is no definite specification about the differentiation of "Logging" and "Tracing" (1). Always use "Logging".
"Message": Never just use "Message" as a name for variables, parameters, etc; because it's a homonym for "Log-Entry", "MessageTemplate" and "MessageText" (2)

!Anatomy Of A Log Entry

The minimum standard of many logging techiques is: Level + ID + MessageTemplate + Args:

|Level|4|Integer. See below.|
|MessageTemplate|"{thing} not found on {place}."|String. Contains named placeholders (curly braces).|
|Args|["File","Disk"]|Object Array. Contains placeholder values in the same order as in the template.|

!!Ready-To-Use representations

|MessageText|"File not found on Disk."|The readable message, placeholders resolved.|
|FormattedLogEntry|"[ERR]4711:File not found on Disk."|Convention to represent a complete LogEntry as string:<br>[{LevelAsAlpha3}][Id}:{MessageText}|

!Use Different Concepts & Channels For (at least) These 3 Dimensions: "Application", "Hosting", "Business"

|Target Group|Programmer|Administrator|User|
|Structure|Time > Assembly|Infrastructure > Time|Process > Time|
|Jargon|"Error", "Exception"|"Error", "Outage"|"Issues", "Failures", "Compensation"|
|Root Causes|missed a situation at design time|Unexpected state at runtime|User input, data quality|
|Fix|Creating a Unit-Test, apply a patch|Turning the computer off and on again.|Correction, repetition, compensation, improvisation.|
|Transport: Emitting|Not the signature (ambient logger)|Hosting-Component (IIS, etc.)|Signature of th BL-Function|
|Transport: Storing|Log File (or DB)|Log File (or DB)|should be DB|
|Transport: Consuming|Notepad|Notepad|"Protocol", in the applikation as list or PDF|


|Alpha3|.net Core Level|.net Trace Level|Color|Criteria for Application|Criteria for Hosting|Criteria for Business|h
|FAT|5 "Fatal"|1 Critical|@@color:Purple;Purple@@|Report an unhandled Exception.|Report unwanted states. Immediate fix necessary.|An issues was detected, that can never be compensated or that creates subsequent errors. Higher level process(es) should be aborted.|
|ERR|4 "Error"|2 Error|@@color:red;Red@@|There's a bug in the code.|Report unwanted states. Fix can be postponed.|An issues was detected, that must be compensated later. Higher level process(es) must not necessarily be aborted.|
|WRN|3 "Warning"|4 Warning|@@color:GoldenRod;DarkYellow@@|Design debts in the code. E.g. usage of obsolete implementations.||An issue was detected that creates limitations, but doesn't need to be compensated timely. Higher level process(es) must not be aborted.|
|INF|2 "Info"|8 Info|(Default)|Should never be used.||Report overall results ("protocol")|
|DBG|1 "Debug"|16 Verbose|@@color:DimGray;DarkGray@@|Temporarily existing logging for bug investigation.||Report intermediate results that can be helpful for trouble shooting.|
|TRC|0 "Trace"|256 Start|@@color:DarkCyan;DarkCyan@@|Report program flow.||Report process flow.|

!The Requiremets Behind These Practices

* The textual message should offer the chance to be localized
** => Therefore we seperately transport a template and placeholders. The template could be localized.
* In a log file (od DB) It should be to search/filter for specific place holder name+value pairs (e.g. search for messages that contain "LastName"="Kent"=
** => Therefore we don't use just numbered place holders, but named place holders.

!Web Links

* [[What is the difference between Tracing and Logging?|https://stackoverflow.com/questions/27244807/what-is-the-difference-between-tracing-and-logging]]
* [[Tracing vs Logging vs Monitoring: What’s the Difference?|https://www.bmc.com/blogs/monitoring-logging-tracing/]]

Here they call it "Message"

Here they call it "Entry"
!Begin- & End-Blocks With Indentation (Code Snippet)

Imports System
Imports System.Diagnostics
Imports System.Text

Namespace Foo

  Public Class DebugWriter

    Private Shared _NestingLevel As Integer

    Private Shared _StartTime(255) As Integer

    Private Shared Function IndentAndFormat(prefix As Char, text As String, ParamArray args() As Object) As String
      Dim sb As New StringBuilder(250)
      sb.Append(" "c, _NestingLevel * 2)
      sb.Append(prefix).Append(" ")
      If (args.GetUpperBound(0) = -1) Then
        sb.AppendFormat(text, args)
      End If
      Return sb.ToString()
    End Function

    Public Shared Sub WriteBegin(text As String, ParamArray args() As Object)
      _StartTime(_NestingLevel) = Environment.TickCount And Int32.MaxValue
      Dim textLine As String = IndentAndFormat("β–Ό"c, text, args)
      _NestingLevel += 1
    End Sub

    Public Shared Sub WriteEnd(text As String, ParamArray args() As Object)
      _NestingLevel -= 1
      Dim duration As Integer = (Environment.TickCount And Int32.MaxValue) - _StartTime(_NestingLevel)
      Dim durationTimeSpan As TimeSpan = TimeSpan.FromMilliseconds(duration)
      Dim textLine As String = IndentAndFormat("β–²"c, text + " [" + durationTimeSpan.ToString() + "]", args)
    End Sub

    Public Shared Sub WriteParaph(text As String, ParamArray args() As Object)
      Debug.WriteLine(IndentAndFormat("♦"c, text, args))
    End Sub

  End Class

End Namespace

!Ambient Transport Carriers

These are components (or language features) that allow emitting messages outside the signature of a function.

*E.g. .net Trace

|{{{ F4}}}|Collapse current level|
|{{{+F4}}}|Uncollapse current level|

!Essential Plugins

|Customize Toolbar||
|XML Tools||

Taken from: https://tomassetti.me/guide-parsing-algorithms-terminology/

!A Guide to Parsing: Algorithms and Terminology

We have already introduced a few parsing terms, while listing the major tools and libraries used for parsing in Java, C#, Python and JavaScript. In this article we make a more in-depth presentation of the concepts and algorithms used in parsing, so that you can get a better understanding of this fascinating world.

We have tried to be practical in this article. Our goal is to help practitioners, not to explain the full theory. We just explain what you need to know to understand and build a parser.

After the definition of parsing the article is divided in three parts:

    The Big Picture. A section in which we describe the fundamental terms and components of a parser.
    Grammars. In this part we explain the main formats of a grammar and the most common issues in writing them.
    Parsing Algorithms. Here we discuss all the most used parsing algorithms and say what they are good for.

Definition of Parsing

    The analysis of an input to organize the data according to the rule of a grammar

There are a few ways to define the meaning of parsing. However the gist remain the same: parsing means to find the underlying structure of the data we are given.

{{TODO{Image missing}}}

In a way parsing can be considered the inverse of templating: identifying the structure and extracting the data. In templating we have a structure and we fill it with data instead. In the case of parsing you have to determine the model from the raw representation. While for templating you have to combine the data with the model, to create the raw representation. Raw representation is usually text, but it can also be binary data.

Fundamentally parsing is necessary because different entities need the data to be in different forms. Parsing allows to transform data in a way that can be understood by a specific software. The obvious example are programs: they are written by humans, but they must be executed by computers. So humans write them in a form that they can understand, then software transforms them in a way that can be used by a computer.

However parsing might be necessary even when passing data between two software that have different needs. For instance, it is needed when you have to serialize or deserialize a class.

!The Big Picture

In this section we are going to describe the fundamental components of a parser. We are not trying to give you formal explanations, but practical ones.
Regular Expressions

    A sequence of characters that can be defined by a pattern

Regular expression are often touted as the thing you should not use for parsing. This is not strictly correct, because you can use regular expressions for parsing simple input. The problem is that some programmers only know regular expressions. So they use them to try to parse everything, even the things they should not. The result is usually a series of regular expressions hacked together, that are very fragile.

You can use regular expressions to parse some simpler languages, but this excludes most programming languages, even the ones that look simple enough like HTML. In fact, languages that can be parsed with just regular expressions are called regular languages. There is a formal mathematical definition, but that is beyond the scope of this article.

Though one important consequence of the theory is that regular languages can be parsed or expressed also by a finite state machine. That is to say regular expressions and finite state machines are equally powerful. This is the reason why they are used to implement lexers, as we are going to see later.

A regular language can be defined by a series of regular expressions, while more complex languages need something more. A simple rule of thumb is: if a grammar of a language has recursive, or nested, elements it is not a regular language. For instance, HTML can contain an arbitrary number of tags inside any tag, therefore is not a regular language and it cannot be parsed using solely regular expressions, no matter how clever.

Regular Expressions in Grammars

The familiarity of a typical programmer with regular expressions often lends them to be used to define the grammar of a language. More precisely, their syntax is used to define the rules of a lexer or a parser. For example, the Kleene star (*) is used in a rule to indicate that a particular element can be present zero or an infinite amount of times.

The definition of the rule should not be confused with how the actual lexer or parser is implemented. You can implement a lexer using the regular expression engine provided by your language. However usually the regular expressions defined in the grammar are actually converted to a finite-state machine to gain better performance.
Structure of a Parser

Having now clarified the role of regular expressions, we can look at the general structure of a parser. A complete parser is usually composed of two parts: a lexer, also known as scanner or tokenizer, and the proper parser. The parser needs the lexer because it does not work directly on the text, but on the output produced by the lexer. Not all parsers adopt this two-steps schema: some parsers do not depend on a separate lexer and they combine the two steps. They are called scannerless parsers.

A lexer and a parser work in sequence: the lexer scans the input and produces the matching tokens, the parser then scans the tokens and produces the parsing result.

Let’s look at the following example and imagine that we are trying to parse an addition.

437 + 734

The lexer scans the text and finds 4, 3, 7 and then a space ( ). The job of the lexer is to recognize that the characters 437 constitute one token of type NUM. Then the lexer finds a + symbol, which corresponds to a second token of type PLUS, and lastly it finds another token of type NUM.

The parser will typically combine the tokens produced by the lexer and group them.

{{TODO{Image missing}}}

The definitions used by lexers and parsers are called rules or productions. In our example a lexer rule will specify that a sequence of digits corresponds to a token of type NUM, while a parser rule will specify that a sequence of tokens of type NUM, PLUS, NUM corresponds to a sum expression.

It is now typical to find suites that can generate both a lexer and parser. In the past it was instead more common to combine two different tools: one to produce the lexer and one to produce the parser. For example, this was the case of the venerable lex & yacc couple: using lex it was possible to generate a lexer, while using yacc it was possible to generate a parser.
Scannerless Parsers

Scannerless parsers operate differently because they process directly the original text, instead of processing a list of tokens produced by a lexer. That is to say, a scannerless parser works as a lexer and a parser combined.

While it is certainly important to know for debugging purposes if a particular parsing tool is a scannerless parser or not, in many cases it is not relevant to defining a grammar. That is because you usually still define patterns that group sequence of characters to create (virtual) tokens; then you combine them to obtain the final result. This is simply because it is more convenient to do so.

In other words, the grammar of a scannerless parser looks very similar to one for a tool with separate steps. Again, you should not confuse how things are defined for your convenience and how things work behind the scene.

    A formal grammar is a set of rules that syntactically describes a language

There are two important parts in this definition: a grammar describes a language, but this description pertains only to the syntax of the language and not the semantics. That is to say, it defines its structure, but not its meaning. The correctness of the meaning of the input must be checked, if necessary, in some other way.

For instance, imagine that we want to define a grammar for the language shown in the paragraph Definition of Parsing.

HELLO: "Hello"
NAME: [a-zA-Z]+
greeting: HELLO NAME

!Anatomy of a Grammar

To define the elements of a grammars, let us look at an example in the most used format to describe grammars: the Backus-Naur Form (BNF). This format has many variants, including the Extended Backus-Naur Form. The Extended variant has the advantage of including a simple way to denote repetitions. Another notable variant is the Augmented Backus-Naur Form which is mostly used to describe bidirectional communications protocols.

A typical rule in a Backus-Naur grammar looks like this:

<symbol> ::= __expression__

The <symbol> is a nonterminal, which means that it can be replaced by the group of elements on the right, __expression__. The element __expression__ could contain other nonterminal symbols or terminal ones. Terminal symbols are simply the ones that do not appear as a <symbol> anywhere in the grammar. A typical example of a terminal symbol is a string of characters, like "Hello".

A rule can also be called production rule. Technically, it defines a transformation between the nonterminal, on the left, and the set of nonterminals and terminals, on the right.
Types of Grammars

There are mainly two kinds of grammars used in parsing: regular grammars and context-free grammars. Usually to a kind of grammar corresponds the same kind of language: a regular grammar defines a regular language and so on. However, there is also a more recent kind of grammars called Parsing Expression Grammar (PEG) that is equally powerful as context-free grammars and thus define a context-free language. The difference between the two is in the notation and the interpretation of the rules.

As we already mentioned, the two kinds of languages are in a hierarchy of complexity: regular languages are simpler than context-free languages.

A relatively easy way to distinguish the two grammars would be that the __expression__ of a regular grammar, that is to say the right side of the rule, could be only one of:

    the empty string
    a single terminal symbol
    a single terminal symbol followed by a nonterminal symbol

In practice though this is harder to check, because a particular tool could allow using more terminal symbols in one definition. Then the tool itself will automatically transform this expression in an equivalent series of expressions all belonging to one of the three mentioned cases.

So you could write an expression that is incompatible with a regular language, but it will be transformed into the proper form. In other words, the tool could provide syntactic sugar for writing grammars.

In a later paragraph we are going to discuss more at length the different kinds of grammars and their formats.

    A lexer transforms a sequence of characters into a sequence of tokens

Lexers are also known as scanners or tokenizers. Lexers play a role in parsing, because they transform the initial input into a form that is more manageable by the proper parser, which works at a later stage. Typically lexers are easier to write than parsers. Although there are special cases when both are quite complicated, for instance in the case of C (see the lexer hack).

A very important part of the job of the lexer is dealing with whitespace. Most of the times you want the lexer to discard whitespace. That is because otherwise the parser would have to check for the presence of whitespace between every single token, which would quickly become annoying.

There are cases when you cannot do that, because whitespace is relevant to the language, like in the case of Python where it is used to identify a block of code. Even in these cases, though, usually it is the lexer that deals with the problem of distinguishing the relevant whitespace from the irrelevant one. This means that you want the lexer to understand which whitespace is relevant to parsing. For example, when parsing Python you want the lexer to check if whitespace defines indentation (relevant) or space between the keyword if and the following expression (irrelevant).
Where the Lexer Ends and the Parser Begins

Given that lexers are almost exclusively used in conjunction with parsers, the dividing line between the two can be blurred at times. That is because the parsing must produce a result that is useful for the particular need of a program. So there is not just one correct way of parsing something, but you care about only the one way that serves your needs.

For example, imagine that you are creating a program that must parse the logs of a server to save them in a database. For this goal, the lexer will identify the series of numbers and dots and transform them into an IPv4 token.

IPv4: [0-9]+ "." [0-9]+ "." [0-9]+ "." [0-9]+

Then the parser will analyze the sequence of tokens to determine if it is a message, a warning, etc.

What would happen instead if you were developing software that had to use the IP address to identify the country of the visitor? Probably you would want the lexer to identity the octets of the address, for later use, and make IPv4 a parser element.

DOT   : "."
OCTET : [0-9]+


This is one example of how the same information can be parsed in different ways because of different goals.

In the context of parsing, β€œparser” can refer both to the software that performs the whole process and also just to the proper parser, that analyzes the tokens produced by the lexer. This is simply a consequence of the fact that the parser takes care of the most important and difficult part of the whole process of parsing. By most important we mean what the user cares about the most and will actually sees. In fact, as we said, the lexer works as a helper to facilitate the work of the parser.

In any sense you pick, the output of the parser is an organized structure of the code, usually a tree. The tree can be a parse tree or an abstract syntax tree. They are both trees, but they differ in how closely they represent the actual code written and the intermediate elements defined by the parser. The line between the two can be blurry at times; we are going to see their differences a bit better in a later paragraph.

The form of a tree is chosen because it is an easy and natural way to work with the code at different levels of detail. For example, a class in C# has one body, this body is made of one statement, the block statement, that is a list of statements enclosed in a pair of curly brackets, and so on…
Syntactic vs Semantic Correctness

A parser is a fundamental part of a compiler, or an interpreter, but of course can be part of a variety of other software. For example, in our article Generate diagrams from C# source code using Roslyn we parsed C# files to produce diagrams.

The parser can only check the syntactic correctness of a piece of code, but the compiler can use its output also in the process of checking the semantic validity of the same piece of code.

Let us see an example of code that is syntactically correct, but semantically incorrect.

int x = 10
int sum = x + y

The problem is that one variable (y) is never defined and thus, if executed, the program will fail. However the parser has no way of knowing this, because it does not keep track of variables, it just looks at the structure of the code.

A compiler instead would typically traverse the parse tree a first time and keeps a list of all the variables that are defined. Then it would traverse the parse tree a second time and checks whether the variables that are used are all properly defined. In this example they are not and it will throw an error. So that is one way the parse tree can also be used to check the semantics by the compiler.
Scannerless Parser

A scannerless parser, or more rarely a lexerless parser, is a parser that performs the tokenization (i.e., the trasformation of a sequence of characters into tokens) and the proper parsing in a single step. In theory having a separate lexer and parser is preferable because it allows a clearer separation of objectives and the creation of a more modular parser.

A scannerless parser is a better design for a language where a clear distinction between lexer and parser is difficult or unnecessary. An example is a parser for a markup language, where special markers are inserted in a sea of text. It can also facilitate the handling of languages where traditional lexing is difficult, like C. That is because a scannerless parser can more easily deal with complex tokenizations.
Issues With Parsing Real Programming Languages

In theory contemporary parsing is designed to handle real programming languages, but in practice there are challenges with some real programming languages. At least, it might be harder parsing them using normal parsing generator tools.
Context-sensitive Parts

Parsing tools are traditionally designed to handle context-free languages, but sometimes the languages are context-sensitive. This might be the case in order to simplify the life of programmers or simply because of a bad design. I remember reading about a programmer that thought he could produce a parser for C in a week, but then it found so many corner cases that a year later he was still working on it…

A typical example of context-sensitive elements are the so-called soft keywords, i.e., strings that can be considered keywords in certain places, but otherwise can be used as identifiers).

Whitespace plays a significant role in some languages. The most well-known example is Python, where the indentation of a statement indicates if it is part of a certain block of code.

In most places whitespace is irrelevant even in Python: spaces between words or keywords do not matter. The real problem is indentation, that is used to identify blocks of code. The simplest way to deal with it is to check the indentation at the start of the line and transform it into the proper token, i.e., to create a token when the indentation changes from the previous line.

In practice, a custom function in the lexer produces INDENT and DEDENT tokens, when the indentation increases or decreases. These tokens play the role that in C-like languages is played by curly brackets: they indicate the start and end of code blocks.

This approach makes the lexing context-sensitive, instead of context-free. This complicates parsing and you normally you would not want to do it, but you are forced to do in this case.
Multiple Syntaxes

Another common issue is dealing with the fact that a language might actually include a few different syntaxes. In other words, the same source file may contain sections of code that follow a different syntax. In the context of parsing effectively the same source file contains different languages. The most famous example is probably the C or C++ preprocessor, which is actually a fairly complicated language on its own and can magically appear inside any random C code.

An easier case to deal with are annotations, that are present in many contemporary programming languages. Among other things they can be used to process the code before it arrives to the compiler. They can command the annotation processor to transform the code in some way, for instance to execute a specific function before executing the annotated one. They are easier to manage, because they can appear only in specific places.
Dangling Else

The dangling else is a common problem in parsing linked to the if-then-else statement. Since the else clause is optional, a sequence of if statements could be ambiguous. For example this one.

if one
   then if two
       then two
else ???

It is unclear if the else belongs to the first if or the second one.

To be fair, this is for the most part a problem of language design. Most of the solutions do not really complicate parsing that much, for example requiring the use of an endif or requiring the use of blocks to delimit the if statement when it includes an else clause.

However there are also languages that offer no solution, that is to say that are designed ambiguously, for example (you guessed it) C. The conventional approach is to associate the else to the nearest if statement, which makes the parsing context-sensitive.
Parsing Tree and Abstract Syntax Tree

There are two terms that are related and sometimes they are used interchangeably: Parse Tree and Abstract Syntax Tree (AST). Technically the parse tree could also be called Concrete Syntax Tree (CST), because it should reflect more concretely the actual syntax of the input, at least compared to the AST.

Conceptually they are very similar. They are both trees: there is a root that has nodes representing the whole source code. The root has children nodes, that contain subtrees representing smaller and smaller portions of code, until single tokens (terminals) appear in the tree.

The difference is in the level of abstraction: a parse tree might contain all the tokens which appeared in the program and possibly a set of intermediate rules.  Instead the AST is a polished version of the parse tree, in which only the information relevant to understanding the code is maintained. We are going to see an example of an intermediate rule in the next section.

Some information might be absent both in the AST and the parse tree. For instance, comments and grouping symbols (i.e., parentheses) are usually not represented. Things like comments are superfluous for a program and grouping symbols are implicitly defined by the structure of the tree.
From Parse Tree to Abstract Syntax Tree

A parse tree is a representation of the code closer to the concrete syntax. It shows many details of the implementation of the parser. For instance, usually each rule corresponds to a specific type of a node. A parse tree is usually transformed into an AST by the user, possibly with some help from the parser generator. A common help allows to annotate some rule in the grammar, to exclude the corresponding nodes from the generated tree. Another one is an option to collapse some kinds of nodes if they have only one child.

This makes sense because the parse tree is easier to produce for the parser, since it is a direct representation of the parsing process. However the AST is simpler and easier to process by the following steps of the program. They typically include all the operations that you may want to perform on the tree: code validation, interpretation, compilation, etc…

Let us look at a simple example to show the difference between a parse tree and an AST. Let’s start with a look at an example grammar.

// lexer
PLUS: '+'
WORD_PLUS: 'plus'
NUMBER: [0-9]+

// parser
// the pipe | symbol indicate an alternative between the two

In this grammar we can define a sum using both the symbol plus (+) or the string plus as operators. Imagine that you have to parse the following code.

10 plus 21

These could be the resulting parse tree and abstract syntax tree.

{{TODO{Image missing}}}

In the AST the indication of the specific operator has disappeared and all that remains is the operation to be performed. The specific operator is an example of an intermediate rule.

Graphical Representation Of A Tree

The output of a parser is a tree, but the tree can also be represented in graphical ways. That is to allow an easier understanding for the developer. Some parsing generator tools can output a file in the DOT language, a language designed to describe graphs (a tree is a particular kind of graph). Then this file is fed to a program that can create a graphical representation starting from this textual description (e.g., Graphviz).

Let’s see a .DOT text, based on the previous sum example.

digraph sum {
     sum -> 10;
     sum -> 21;

The appropriate tool can create the following graphical representation.

{{TODO{Image missing}}}

If you want to see a bit more of DOT you can read our article Language Server Protocol: A Language Server For DOT With Visual Studio Code. In that article we show show how to create a Visual Studio Code plugin that can handle DOT files.


Grammars are a set of rules used to describe a language, so it comes naturally to study the formats of the rules. However there are also several elements of a typical grammar that could use further attention. Some of them are due to the fact that a grammar can also be used to define other duties or to execute some code.

!Typical Grammar Issues

First we are going to talk about some special rules or issues you might encounter in parsing.

!!The Missing Tokens

If you read grammars you will probably encounter many in which only a few tokens are defined and not all of them. Like in this grammar:

NAME     : [a-zA-Z]+
greeting : "Hello" NAME

!!Left-recursive Rules

In the context of parsers, an important feature is the support for left-recursive rules. This means that a rule starts with a reference to itself. Sometime this reference could also be indirect, that is to say it could appear in another rule referenced by the first one.

Consider for example arithmetic operations. An addition could be described as two expressions separated by the plus (+) symbol, but the operands of the additions could be other additions.

addition       : expression '+' expression
multiplication : expression '*' expression
// an expression could be an addition or a multiplication or a number
expression     : multiplication | addition | [0-9]+

In this example expression contains an indirect reference to itself via the rules addition and multiplication.

This description also matches multiple additions like 5 + 4 + 3. That is because it can be interpreted as expression (5) ('+') expression(4+3) (the rule addition: the first expression corresponds to the option [0-9]+, the second one is another addition). And then 4 + 3 itself can be divided into its two components: expression(4) ('+') expression(3) (the rule addition: both the first and second expression corresponds to the option [0-9]+) .

The problem is that left-recursive rules may not be used with some parser generators. The alternative is a long chain of expressions, that takes care also of the precedence of operators. A typical grammar for a parser that does not support such rules would look similar to this one:

expression     : addition
addition       : multiplication ('+' multiplication)* 
multiplication : atom ('*' atom)* 
atom           : [0-9]+

As you can see, the expressions are defined in the inverse order of precedence. So the parser would put the expression with the lower precedence at the lowest level of the three; thus they would be executed first.

Some parser generators support direct left-recursive rules, but not indirect ones. Notice that usually the issue is with the parsing algorithm itself, that does not support left-recursive rules. So the parser generator may transform rules written as left-recursive in the proper way to make them work with its algorithm. In this sense, left-recursive support may be (very useful) syntactic sugar.
How Left-recursive Rules Are Transformed

The specific way in which the rules are transformed vary from one parser generator to the other, however the logic remains the same. The expressions are divided in two groups: the ones with an operator and two operands and the atomic ones. In our example the only atomic expression is a number ([0-9]+), but it could also be an expression between parentheses ((5 + 4)). That is because in mathematics parentheses are used to increase the precedence of an expression.

Once you have these two groups: you maintain the order of the members of the second group and reverse the order of the members of the first group. The reason is that humans reason on a first come, first serve basis: it is easier to write the expressions in their order of precedence.

{{TODO{Image missing}}}

However the final form of the parsing is a tree, which operates on a different principle: you start working on the leaves and rise up. So that at the end of this process the root node contains the final result. Which means that in the parsing tree the atomic expressions are at the bottom, while the ones with operators appear in the inverse order to the one in which they are applied.

Predicates, sometimes called syntactic or semantic predicates, are special rules that are matched only if a certain condition is met. The condition is defined with code in a programming language supported by the tool for which the grammar is written.

Their advantage is that they allow some form of context-sensitive parsing, which is sometime unavoidable to match certain elements. For instance, they can be used to determine if a sequence of characters that defines a soft keyword is used in a position where it would be a keyword (e.g., the previous token can be followed by the keyword) or it is a simple identifier.

The disadvantages are that they slow parsing down and they make the grammar dependent on said programming language. That is because the condition is expressed in a programming language and must be checked. And, of course, to check this condition you must execute the code in the corresponding programming language.
Embedded Actions

Embedded actions identify code that is executed every time the rule is matched. They have the clear disadvantage that they make the grammar harder to read, since the rules are surrounded by code. Furthermore, just like predicates, they break the separation between a grammar that describes the language and the code that manipulates the results of the parsing.

Actions are frequently used by less sophisticated parsing generators as the only way to easily execute some code when a node is matched. Using these parser generations, the only alternative would be to traverse the tree and execute the proper code yourself. More advanced tools allow to use the visitor pattern instead to execute arbitrary code when needed, and also to govern the traversing of the tree.

They can also be used to add certain tokens or change the generated tree. While ugly, this might be the only practical way to deal with complicated languages, like C, or specific issues, like whitespace in Python.

There are two main formats for a grammar: BNF (and its variants) and PEG. Many tools implement their own variants of these ideal formats. Some tools use custom formats altogether. A frequent custom format consists of a three parts grammar: options together with custom code, followed by the lexer section and finally the parser section.

Given that a BNF-like format is usually the foundation of a context-free grammar you could also see it identified like the CFG format.
Backus-Naur Form and Its Variants

Backus–Naur Form (BNF) is the most successful format and was even the basis of PEG. However it is quite simple, and thus it is not often used in its basic form. A more powerful variant is typically used.

To show why these variants were necessary let’s show an example in BNF: the description of a character.

<letter>    ::= "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
<digit>     ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<character> ::= <letter> | <digit>

The symbol <letter> can be transformed into any of the letters of the English alphabet, although in our example only lowercase letters are valid. A similar process happens for <digit>, which can indicate any among the alternative digits. The first issue is that you have to list all the alternatives individually; you cannot use character classes like you do with regular expressions. This is annoying, but usually manageable, unless of course you have to list all Unicode characters.

A much harder problem is that there is no easy way to denote optional elements or repetitions. So if you want to do that, you have to rely on boolean logic and the alternative (|) symbol.

<text>      ::= <character> | <character> <text>

This rule states that <text> can be composed of a <character> or a <character> followed by a <text>. Basically, since we cannot say β€œthis a sequence of characters”, we have to devise a complicated way to express the same idea by using logic. This would be the parse tree for the word β€œdog”.

{{TODO{Image missing}}}

BNF has many other limitations: it makes complicated to use empty strings or the symbols used by the format (e.g., ::=) in the grammar, it has a verbose syntax (e.g., you have to include terminals between < and >), etc.

!!Extended Backus-Naur Form

To solve some of these limitations Niklaus Wirth created the Extended Backus-Naur Form (EBNF), which includes some concepts from his own Wirth syntax notation.

EBNF is the most used form in contemporary parsing tools, although tools might deviate from the standard notation. EBNF has a much cleaner notation and adopts more operators to deal with concatenation or optional elements.

Let’s see how you would write the previous example in EBNF.

letter    = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" ;
digit     = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
character = letter | digit ;
text      = character , { character } ;

The text symbol now it is described more easily with the help of the concatenation operator (,) and the optional one ({ .. }). The resulting parse tree would also be simpler. The standard EBNF still presents some problems, like the unfamiliar syntax. To overcome this issue most parsing tools adopt a syntax inspired by regular expressions and also support additional elements like characters classes.

If you need an in-depth coverage of the format you can read our article on EBNF.

Augmented BNF (ABNF) is another variants of BNF, mainly developed to describe bidirectional communication protocols and formalized by the IETF with several documents (see RFC 5234 and updates).

ABNF can be as productive as EBNF, but it has some quirks that have limited its adoption outside of internet protocols. For example, until recently the standard dictated that strings were matched in a case-insensitive way, which would have been a problem for matching identifiers in many programming languages.

ABNF has a different syntax from EBNF, for example the alternative operator is the slash (/), and sometimes it is plainly better. For instance, there is no need for a concatenation operator. It also has a few more feature than standard EBNF. For example, it allows you to define numeric ranges, such as %x30-39 that is equivalent to [0-9]. This is also used by the designers themselves, to include standard character classes-like basic rules that the end user can use. An example of such rule is ALPHA, that is equivalent to [a-zA-Z].

Parsing Expression Grammar (PEG) is a format presented by Brian Ford in a 2004 paper. Technically it derives from an old formal grammar called Top-Down Parsing Language (TDPL). However a simple way to describe is: EBNF in the real world.

In fact it looks similar to EBNF, but also directly supports widely used features, like character ranges (character classes). Although it also has some differences that are not actually that pragmatic, like using the more formal arrow symbol (←) for assignment, instead of the more common equals symbol (=). The previous example could be written this way with PEG.

letter    ← [a-z]
digit     ← [0-9]
character ← letter / digit
text      ← character+

As you can see, this is the most obvious way a programmer would write it, with character classes and regular expression operators. The only anomalies are the alternative operator (/) and the arrow character, and in fact many implementations of PEG use the equals character.

The pragmatic approach is the foundation of the PEG formalism: it was created to describe programming languages more naturally. That is because context-free grammar has its origin in the work to describe natural languages. In theory, CFG is a generative grammar, while PEG an analytic grammar.

The first should be a sort of recipe to generate only the valid sentences in the language described by the grammar. It does not have to be related to the parsing algorithm. Instead the second kind directly define the structure and semantics of a parser for that language.

The practical implications of this theoretical difference are limited: PEG is more closely associated to the packrat algorithm, but that is basically it. For instance, generally PEG (packrat) does not permit left recursion. Although the algorithm itself can be modified to support it, this eliminates the linear-time parsing property. Also, PEG parsers are generally scannerless parsers.

Probably the most important difference between PEG and CFG is that the ordering of choices is meaningful in PEG, but not in CFG. If there are many possible valid ways to parse an input, a CFG will be ambiguous and thus will usually return an error. By usually we mean that some parsers that adopt CFGs can deal with ambiguous grammars. For instance, by providing all possible valid results to the developer and letting him sort it out. Instead PEG eliminates ambiguity altogether because the first applicable choice will be chosen, thus a PEG can never be ambiguous.

The disadvantage of this approach is that you have to be more careful in listing possible alternatives, because otherwise you could have unexpected consequences. That is to say some choices could never be matched.

In the following example doge will never be matched, since dog comes first it will be picked each time.

dog ← 'dog' / 'doge'

It is an open area of research whether PEG can describe all grammars that can be defined with CFG, but for all practical purposes it does.

!Parsing Algorithms

In theory parsing is a solved problem, but it is the kind of problem that keeps being solved again and again. That is to say that there are many different algorithms, each one with strong and weak points, and they are still being improved by academics.

In this section we are not going to teach how to implement every one of the parsing algorithm, but we are going to explain their features. The goal is that you can choose with more awareness which parsing tool to use, or which algorithm to study better and implement for your custom parser.

Let’s start with a global overview of the features and strategies of all parsers.
Two Strategies

There are two strategies for parsing: top-down parsing and bottom-up parsing. Both terms are defined in relation to the parse tree generated by the parser. In simple terms:

    a top-down parser tries to identity the root of the parse tree first, then it moves down the subtrees, until it find the leaves of the tree.
    a bottom-up parser starts from the lowest part of the tree instead, the leaves, and rise up until it determines the root of the tree.

Let’s see an example, starting with a parse tree.

{{TODO{Image missing}}}

The same tree would be generated in a different order by a top-down and a bottom-up parser. In the following images the number indicate the order in which the nodes are created.

{{TODO{Image missing}}}

Traditionally top-down parsers were easier to build, but bottom-up parsers were more powerful. Now the situation is more balanced, mostly because of advancement in top-down parsing strategies.

The concept of derivation is closely associated to the strategies. Derivation indicates the order in which the nonterminal elements that appears in the rule, on the right, are applied to obtain the nonterminal symbol, on the left. Using the BNF terminology, it indicates how the elements that appear in __expression__ are used to obtain <symbol>. The two possibilities are: leftmost derivation and rightmost derivation. The first indicates that the rule are applied from left to right, while the second indicates the opposite.

A simple example: imagine that you are trying to parse the symbol result which is defined as such in the grammar.

expr_one = .. // stuff
expr_two = .. // stuff
result   = expr_one 'operator' expr_two

You can apply first the rule for symbol expr_one and then expr_two or vice versa. In the case of leftmost derivation you pick the first option, while for rightmost derivation you pick the second one.

It is important to understand that the derivation is applied depth-first or recursively. That is to say, it is applied on the starting expression then it is applied again on the intermediate result that is obtained. So, in this example, if after applying the rule corresponding to expr_one there is a new nonterminal, that one is transformed first. The nonterminal expr_two is applied only when it becomes the first nonterminal and not following the order in the original rule.

Derivation is associated with the two strategies, because for bottom-up parsing you would apply rightmost derivation, while for top-down parsing you would choose leftmost derivation. Note that this has no effect on the final parsing tree, it just affects the intermediate elements and the algorithm used.

!!Common Elements

Parsers built with top-down and bottom-up strategies shares a few elements that we can talk about.

Lookahead and Backtracking

The terms lookadhead and backtracking do not have a different meaning in parsing than the one they have in the larger computer science field. Lookahead indicates the number of elements, following the current one, that are taken into consideration to decide which current action to take.

A simple example: a parser might check the next token to decide which rule to apply now. When the proper rule is matched the current token is consumed, but the next one remains in the queue.

Backtracking is a technique of an algorithm. It consists of finding a solution to a complex problems by trying partial solutions, and then keep checking for the most promising one. If the one that is currently being tested fails, then the parser backtracks (i.e., it goes back to the last position that was successfully parsed) and tries another one.

Lookahead is especially relevant to LL, LR and LALR parsing algorithms, because parsers for languages that just needs one lookahead token are easier to build and quicker to run. The lookahead tokens used by such algorithms are indicated between parentheses after the name of the algorithm (e.g., LL(1), LR(k)). The notation (*) indicate that the algorithm can check infinite lookahead tokens, although this might affect performance.
Chart Parsers

Chart parsers are a family of parsers that can be bottom-up (e.g., CYK) or top-down (e.g., Earley). Chart parsers essentially try to avoid backtracking, which can be expensive, by using dynamic programming. Dynamic programming, or dynamic optimization, is a general method to break down larger problem in smaller subproblems.

A common dynamic programming algorithm used by chart parser is the Viterbi algorithm. The goal of the algorithm is to find the most likely hidden states given the sequence of known events. Basically given the tokens that we know, we try to find the most probable rules that have produced them.

The name chart parser derives from the fact that the partial results are stored in a structure called chart (usually the chart is a table). The particular technique of storing partial results is called memoization. Memoization is also used by other algorithms, unrelated to chart parsers, like packrat.

Before discussing parsing algorithms we would like to talk about the use of automatons in parsing algorithms. Automatons are a family of abstract machines, among which there is the well known Turing machine.

When it comes to parsers you might hear the term (Deterministic) Pushdown Automaton (PDA) and when you read about lexers you would hear the term Deterministic Finite Automaton (DFA). A PDA is more powerful and complex than a DFA (although still simpler than a Turing machine).

Since they define abstract machines, usually they are not directly related to an actual algorithm. Rather, they describe in a formal way the level of complexity that an algorithm must be able to deal with. If somebody says that to solve problem X you need a DFA, he means that you need an algorithm as equally powerful as a DFA.

However since DFAs are state machines in the case of a lexer the distinction is frequently moot . That is because state machines are relatively straightforward to implement (i.e., there are ready to use libraries), so most of the time a DFA is implemented with a state machine. That is why we are going to briefly talk about DFA and why they are frequently used for lexers.
Lexing With a Deterministic Finite Automaton

DFA is a (finite-)state machine, a concept with which we assume you are familiar. Basically, a state machine has many possible states and a transition function for each of them. These transition functions govern how the machine can move from one state to a different one in response to an event. When used for lexing, the machine is fed the input characters one at a time until it reaches an accepted state (i.e., it can build a token).

They are used for a few reasons:

    it has been proven that they recognize exactly the set of regular languages, that is to say they are equally powerful as regular languages
    there are a few mathematical methods to manipulate and check their properties (e.g., whether they can parse all strings or any strings)
    they can work with an efficient online algorithm (see below)

An online algorithm is one that does not need the whole input to work. In the case of a lexer, it means that it can recognize a token as soon as its characters are seen by the algorithm. The alternative would be that it would need the whole input to identify each token.

In addition to these properties it is fairly easy to transform a set of regular expressions into a DFA, which makes it possible to input the rules in a simple way that is familiar to many developers. Then you can automatically convert them into a state machine that can work on them efficiently.
Tables of Parsing Algorithms

We provide a table to offer a summary of the main information needed to understand and implement a specific parser algorithm. You can find more implementations by reading our articles that present parsing tools and libraries for Java, C#, Python and JavaScript.

The table lists:

    a formal description, to explain the theory behind the algorithm
    a more practical explanation
    one or two implementations, usually one easier and the other a professional parser. Sometimes, though, there is no easier or professional version.

{{TODO{Table Formatting}}}
Algorithm	Formal Description	Explanation	Example Implementation
CYK	An Efficient Recognition and Syntax-Analysis Algorithm for Context-Free Languages (PDF)	CKY And Earley Algorithms (PDF)	CYK Algorithm in Golang / Chart-parsers
Earley	An Efficient Context-Free Parsing Algorithm (PDF)	Earley Parsing Explained	Nearley
LL	LL(*): The Foundation of the ANTLR Parser Generator (PDF)	LL Parser on Wikipedia / LL-parser.js	ll(1) parser / ANTLR
LR	On the Translation of Languages from Left to Right (PDF)	Compiler Course	Jison / Bison
Packrat (PEG)	Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking (PDF)	Packrat Parsing: Simple, Powerful, Lazy, Linear TimePackrat Parsing in Scala (PDF)	Companion code of the thesis / Canopy
Parser Combinator	Constructing natural language interpreters in a lazy functional language (PDF)	Parser combinators explained	Understanding Parser Combinators / Sprache
Pratt	Top Down Operator Precedence	Simple Top-Down Parsing in Python	A Pratt Parser in Python / JSLint

To understand how a parsing algorithm works you can also look at the syntax analytic toolkit. It is an educational parser generator that describes the steps that the generated parser takes to accomplish its objective. It implements a LL and a LR algorithm.

{{TODO{Image missing}}}

The second table shows a summary of the main features of the different parsing algorithms and for what they are generally used.

Top-down Algorithms

The top-down strategy is the most widespread of the two strategies and there are several successful algorithms applying it.
LL Parser

LL (Left-to-right read of the input, Leftmost derivation) parsers are table-based parsers without backtracking, but with lookahead. Table-based means that they rely on a parsing table to decide which rule to apply. The parsing table use as rows and columns nonterminals and terminals, respectively.

To find the correct rule to apply:

    first, the parser looks at the current token and the appropriate amount of lookahead tokens
    then it tries to apply the different rules until it finds the correct match.

The concept of LL parser does not refers to a specific algorithm, but more to a class of parsers. They are defined in relations to grammars. That is to say an LL parser is one that can parse a LL grammar. In turn LL grammars are defined in relation to the number of lookahead tokens that are needed to parse them. This number is indicated between parentheses next to LL, in the form LL(k).

An LL(k) parser uses k tokens of lookahead and thus it can parse, at most, a grammar which needs k tokens of lookahead to be parsed. Effectively the concept of LL(k) grammar is more widely employed than the corresponding parser. Which means that LL(k) grammars are used as a meter when comparing different algorithms. For instance, you would read that PEG parsers can handle LL(*) grammars.
The Value Of LL Grammars

This use of LL grammars is due both to the fact that LL parsers are widely used and that they are a bit restrictive. In fact, LL grammars does not support left-recursive rules. You can transform any left-recursive grammar into an equivalent non-recursive form, but this limitation matters for a couple of reasons: productivity and power.

The loss of productivity is due to the requirement that you have to write the grammar in a special way, which takes time. The power is limited because a grammar that might need 1 token of lookahead, when written with a left-recursive rule, might need 2-3 tokens of lookahead, when written in a non-recursive way. So this limitation is not merely annoying, but it is limiting the power of the algorithm, i.e., the grammars it can be used for.

The loss of productivity can be mitigated by an algorithm that automatically transforms a left-recursive grammar into a non-recursive one. ANTLR is a tool that can do that, but, of course, if you are building your own parser, you have to do it yourself.

There are two special kinds of LL(k) grammars: LL(1) and LL(*). In the past the first kind were the only one considered practical, because it is easy to build efficient parsers for them. Up to the point that many computer languages were purposefully designed to be described by a LL(1) grammar. An LL(*), also known as LL-regular parser, can deal with languages using an infinite amount of lookahead tokens.

On StackOverflow you can read a simple comparison between LL parsers and Recursive Descent parsers or one between LL parsers and LR parsers.
Earley Parser

The Earley parser is a chart parser named after its inventor Jay Earley. The algorithm is usually compared to CYK, another chart parser, that is simpler, but also usually worse in performance and memory. The distinguishing feature of the Earley algorithm is that, in addition to storing partial results, it implements a prediction step to decide which rule is going to try to match next.

The Earley parser fundamentally works by dividing a rule in segments, like in the following example.

// an example grammar
HELLO    : "hello"
NAME     : [a-zA-Z]+
greeting : HELLO NAME

// Earley parser would break up greeting like this

Then working on these segments, that can be connected at the dot (.), tries to reach a completed state, that is to say one with the dot at the end.

The appeal of an Earley parser is that it is guaranteed to be able to parse all context-free languages, while other famous algorithms (e.g., LL, LR) can parse only a subset of them. For instance, it has no problem with left-recursive grammars. More generally, an Earley parser can also deal with nondeterministic and ambiguous grammars.

It can do that at the risk of a worse performance (O(n3)), in the worst case. However it has a linear time performance for normal grammars. The catch is that the set of languages parsed by more traditional algorithms are the one we are usually interested in.

There is also a side effect of the lack of limitations: by forcing a developer to write the grammar in certain way the parsing can be more efficient. I.e., building a LL(1) grammar might be harder for the developer, but the parser can apply it very efficiently. With Earley you do less work, so the parser does more of it.

In short, Earley allows you to use grammars that are easier to write, but that might be suboptimal in terms of performance.
Earley Use Cases

So Earley parsers are easy to use, but the advantage, in terms of performance, in the average case might be non-existent. This makes the algorithm great for an educational environment or whenever productivity is more relevant than speed.

In the first case it is useful, for example, because most of the time the grammars your users write work just fine. The problem is that the parser will throw at them obscure and seemingly random errors. Of course the errors are not actually random, but they are due to the limitations of an algorithm that your users do not know or understand. So you are forcing the user to understand the inner workings of your parser in order to use it, which should be unnecessary.

An example of when productivity is more important than speed might be a parser generator to implement syntax highlighting, for an editor that needs to support many languages. In a similar situation, being able to support new languages quickly might be more desirable than completing the task as soon as possible.
Packrat (PEG)

Packrat is often associated to the formal grammar PEG, since they were invented by the same person: Brian Ford. Packrat was described first in his thesis: Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking. The title says almost everything that we care about: it has a linear time of execution, also because it does not use backtracking.

The other reason for its efficiency it is memoization: the storing of partial results during the parsing process. The drawback, and the reason why the technique was not used until recently, is the quantity of memory it needs to store all the intermediate results. If the memory required exceeds what is available, the algorithm loses its linear time of execution.

Packrat also does not support left-recursive rules, a consequence of the fact that PEG requires to always choose the first option. Actually some variants can support direct left-recursive rules, but at the cost of losing linear complexity.

Packrat parsers can perform with an infinite amount of lookahead, if necessary. This influence the execution time, that in the worst case can be exponential.
Recursive Descent Parser

A recursive descent parser is a parser that works with a set of (mutually) recursive procedures, usually one for each rule of the grammars. Thus the structure of the parser mirrors the structure of the grammar.

The term predictive parser is used in a few different ways: some people mean it as a synonym for top-down parser, some as a recursive descent parser that never backtracks.

The opposite to this second meaning is a recursive descent parser that do backtracks. That is to say one that finds the rule that matches the input by trying each of the rules in sequence, and then going back each time it fails.

Typically recursive descent parsers have problems parsing left-recursive rules, because the algorithm would end up calling the same function again and again. A possible solution to this problem is using tail recursion. Parsers that use this method are called tail recursive parsers.

Tail recursion per se is simply recursion that happens at the end of the function. However tail recursion is employed in conjunction with transformations of the grammar rules. The combination of transforming the grammar rules and putting recursion at the end of the process allows to deal with left-recursive rules.
Pratt Parser

A Pratt parser is a widely unused, but much appreciated (by the few that know it) parsing algorithm defined by Vaughan Pratt in a paper called Top Down Operator Precedence. The paper itself starts with a polemic on BNF grammars, which the author wrongly argues are the exclusive concerns of parsing studies. This is one of the reasons for the lack of success. In fact the algorithm does not rely on a grammar, but works directly on tokens, which makes it unusual to parsing experts.

The second reason is that traditional top-down parsers work great if you have a meaningful prefix that helps distinguish between different rules. For example, if you get the token FOR you are looking at a for statement. Since this essentially applies to all programming languages and their statements, it is easy to understand why the Pratt parser did not change the parsing world.

Where the Pratt algorithm shines is with expressions. In fact, the concept of precedence makes it impossible to understand the structure of the input simply by looking at the order in which the tokens are presented.

Basically, the algorithm requires you to assign a precedence value to each operator token and a couple of functions that determine what to do, according to what is on the left and right of the token. Then it uses these values and functions to bind the operations together while it traverses the input.

While the Pratt algorithm has not been overtly successful it is used for parsing expressions. It was also adopted by Douglas Crockford (of JSON fame) for JSLint.
Parser Combinator

A parser combinator is a higher-order function that accepts parser functions as input and returns a new parser function as output. A parser function usually means a function that accepts a string and outputs a parse tree.

A parser combinator is modular and easy to build, but they are also slower (they have O(n4) complexity in the worst case) and less sophisticated. They are typically adopted for easier parsing tasks or for prototyping. In a sense the user of a parser combinator builds the parser partially by hand, but relying on the hard word done by whoever created the parser combinator.

Generally they do not support left recursive rules, but there are more advanced implementations that do just that. See, for example, the paper Parser Combinators for Ambiguous Left-Recursive Grammars, that also manages to describe an algorithm that has polynomial time of execution.

Many contemporary implementations are called monadic parser combinators, since they rely on the structure of functional programming called monad. Monads are a fairly complex concept that we cannot hope to explain here. However basically a monad is able to combine functions and actions relying on a data type. The crucial feature is that the data type specifies how its different values can be combined.

The most basic example is the Maybe monad. This is a wrapper around a normal type, like integer, that returns the value itself when the value is valid (e.g., 567), but a special value Nothing when it is not (e.g., undefined or division by zero). Thus you can avoid using a null value and unceremoniously crashing the program. Instead the Nothing value is managed normally, like any other value would be manage.
Bottom-up Algorithms

The bottom-up strategy main success is the family of many different LR parsers. The reason of their relative unpopularity is because historically they have been harder to build, although LR parsers are also more powerful than traditional LL(1) grammars. So we mostly concentrate on them, apart from a brief description of CYK parsers.

This means that we avoid talking about the more generic class of Shift-reduce Parsers, which also includes LR parsers.

We only say that shift-reduce algorithms works with two steps:

    Shift: read one token from the input, that becomes a new (momentarily isolated) node
    Reduce: once the proper rule is matched join the resulting tree with a precedent existing subtree

Basically the shift step reads the input until completion, while the reduce join the subtrees until the final parse tree is built.
CYK Parser

The Cocke-Younger-Kasami (CYK) parser has been formulated independently by the three authors. Its notability is due to a great worst-case performance (O(n3)), although it is hampered by a comparatively bad performance in most common scenarios.

However the real disadvantage of the algorithm is that it requires the grammars to be expressed in the Chomsky normal form.

That is because the algorithm relies on the properties of this particular form to be able to split the input in half to trying matching all the possibilities. Now, in theory, any context-free grammar can be transformed in a corresponding CNF, but this is seldom practical to do by hand. Imagine being annoyed by the fact that you cannot use left-recursive rules and being asked to learn a special kind of form…

The CYK algorithm is used mostly for specific problems. For instance the membership problem: to determine if a string is compatible with a certain grammar. It can also be used in natural language processing to find the most probable parsing between many options.

For all practical purposes, if you need to parse any context-free grammar with a great worst-case performance, you want to use an Earley parser.
LR Parser

LR (Left-to-right read of the input, Rightmost derivation) parsers are bottom-up parsers that can handle deterministic context-free languages in linear time, with lookahead and without backtracking. The invention of LR parsers is credited to the renown Donald Knuth.

Traditionally they have been compared to, and competed with, LL parsers. So there is a similar analysis related to the number of lookahead tokens necessary to parse a language. An LR(k) parser can parse grammars that need k tokens of lookahead to be parsed. However LR grammars are less restrictive, and thus more powerful, than the corresponding LL grammars. For example, there is no need to exclude left-recursive rules.

Technically, LR grammars are a superset of LL grammars.  One consequence of this is that you need only LR(1) grammars, so usually the (k) is omitted.

They are also table-based, just like LL-parsers, but they need two complicated tables. In very simple terms:

    one table tells the parser what to do depending on the current token, the state it is in and the tokens that can possibly follow it (lookahead sets)
    the other one tells the parser to which state to move next

LR parsers are powerful and have great performance, so where is the catch? The tables they need are hard to build by hand and can grow very large for normal computer languages, so usually they are mostly used through parser generators. If you need to build a parser by hand, you would probably prefer a top-down parser.
Simple LR and Lookahead LR

Parser generators avoid the problem of manually creating such tables, but they do not solve the issue of the cost of generating and navigating them. So there are simpler alternatives to the Canonical LR(1) parser, described by Knuth. These alternatives are less powerful than the original one. They are: Simple LR parser (SLR) and Lookahead LR parser (LALR). So in order of power we have: LR(1) > LALR(1) > SLR(1) > LR(0).

The names of the two parsers, both invented by Frank DeRemer, are somewhat misleading: one is not really that simple and the other is not the only one that uses lookahead. We can say that one is simpler, and the other relies more heavily on lookahead to make decisions.

Basically they differ in the tables they employ, mostly they change the β€œwhat to do” part and the lookahead sets. Which in turn pose different restrictions on the grammars that they can parse. In other words, they use different algorithms to derive the parsing tables from the grammar.

A SLR parser is quite restrictive in practical terms and it is not very widely used. A LALR parser instead works for most practical grammars and it is widely employed. In fact the popular tools yacc and bison works with LALR parser tables.

Contrary to LR grammars, LALR and SLR grammars are not a superset of LL grammars. They are not easily comparable: some grammars can be covered by one class and not the other or vice versa.
Generalized LR Parser

Generalized LR parsers (GLR) are more powerful variants of LR parsers. They were described by Bernard Land, in 1974, and implemented for the first time by Masaru Tomita, in 1984. The reason for the existence of GLR is the need to parse nondeterministic and ambiguous grammars.

The power of a GLR parser is not found in its tables, which are equivalent to those of a traditional LR parser, but in the fact that can move to different states. In practice, when there is an ambiguity it forks new parser(s) that handle that particular case. These parsers might fail at a later stage and be discarded.

The worst-case complexity of a GLR parser is the same as that of an Earley parser (O(n3)), though it may have better performance with the best case of deterministic grammars. A GLR parser is also harder to build than an Earley one.


With this great (or at least large) article we hope to have solved most of the doubts about parsing terms and algorithms: what the terms means and why to pick a certain algorithm instead of another one. We have not just explained them, but we have given a few pointers on common issues with parsing programming languages.

For reasons of space we could not provide a detailed explanation of all parsing algorithms. So we have also provided a few links to get a deeper understanding of the algorithms: the theory behind them and an explanation of how they work. If you are specifically interested in building things like compiler and interpreters, you can read another one of our articles in order to find resources to create programming languages.

If you are just interested in parsing you may want to read Parsing Techniques, a book that is as comprehensive as it is expensive. It obviously goes much more in depth than we could and it also covers less used parsing algorithms.

We would like to thank Dimitrios Kourkoulis for having revised the article and helped us improve it.

We would also like to thank Dustin Whitney for having spotted a typo.

ai = ei 
ay = ei 
ch = h 
chs = x 
ck = k 
ie = i 
j = i 
ph = f 
th = t 
y = i 
ey = ei 
z = s 

ZusÀtzlich beinhaltet SOUNDEX I eine Umwandlung in Großbuchstaben und die Wandlung von Umlauten. Gleiche Buchstaben die mehrfach hintereinander auftreten, werden auf einen Buchstaben reduziert (aus "Zimmer" wird "ZIMER"). 


v = f 
d = t 
b = p 
sch = s 
g = k 

Soundex II muss zuerst gemacht werden, weil sonst z.B. "ch" zuvor gekillt wurde und niemals mehr "sch" auftreten kann.

> Tags are Attributes without keys, because the key can be assumed by the value.

!How To Represent Tags And Attributes Different Containers

!!String ("serialized")

Bud Spencer|Director=Clint Eastwood|Year=1985|Oscars=(null)

!!Hash Table (aka Dictionary)

|Bud Spencer|Bud Spencer|Convention: If key and value are equal => it's a tag|
|Director|Clint Eastwood||
|Oscars|(null)|Here we use the internal null value of the programming language.|


|Lookup Table For Keys|c
|0|(Tag)|We define the pseudo key name "(Tag)" for all tags.|

|Table For Tags And Attributes|c
|0|Bud Spencer||
|1|Clint Eastwood||


> Store tuples with leading and trailing separator => unique selecting via "LIKE" will be possible
> Use a semicolon as separator (like MS Outlook does)

Names=";Smitch, John;Miller, Frank;Peck, Gregory;"

SELECT * FROM Persons WHERE ZipCodes LIKE '%;12345;%'
!Date Values

*Year always 1999 (well distinguishable from month or day, reveals problems with 2-digit years)
**For Birthday use 1956 (it's a leap year)
*Month always 01 or counted up fΓΌr sequential test data
*Day always 28 (well distinguishable from month, works with any month)

!Personal Data

|Adams|Alice|1955-01-01|Last name starts with umlaut which reveals sorting problems with unicode.|
|Γ„ppel|Γ„milia|1956-02-29|Reveals unicode, sorting and leap year problems.|
|Zuckerberg|Ziggy|1958-12-31|Covers names from A-Z|

!Best Practice: Enclose With Delimiter And Escape Null Values

||Pseudo Code|Tuple Representation|Remarks|h
|Collection contains two elements|{"Foo","Bar"}|"#Foo#Bar#"||
|Collection contains one element|{"Foo"}|"#Foo#"||
|Collection contains one empty string|{""}|"##"||
|Collection contains one null element|{null}|"#\0#"||
|Collection is empty|{}|""||
|Collection itself is null|null|"\0"||

* Hashtag is used as delimiter, to avoid looking similar to a "classic" comma or semicolon tuple.

!!Building Is Easy

For Each element

!!Finding Elements Is Easy

tupleString = ";FooBar;Foo;Bar;"
found = tupleString.Contains(";Bar;") // will safely find only 3rd element

!Worse: Enclose With Braces

||Pseudo Code|Tuple Representation|Remarks|h
|Collection contains two elements|{"Foo","Bar"}|"[Foo;Bar]"||
|Collection contains one element|{"Foo"}|"[Foo]"||
|Collection contains one empty string|{""}|"[]"||
|Collection contains one null element|{null}|"[]"||
|Collection is empty|{}|""||
|Collection is null|null|-||

!!Building Is Not So Easy

tuple.Begin() // to start with "["
For Each element
tuple.End() // to end with "]"

!!Finding Elements Is Not So Easy

tupleString = "[FooBar;Foo;Bar]"
found = tupleString.Contains("Bar") // will wrongly find substring of 1st element!

!Even Worse: Do Not Enclose At All ("classic" tuple style)

||Pseudo Code|Tuple Representation|Remarks|h
|Collection contains two elements|{"Foo","Bar"}|"Foo;Bar"||
|Collection contains one element|{"Foo"}|"Foo"||
|Collection contains one empty string|{""}|""|.net String.Split() returns an array with one empty string.|
|Collection contains one null element|{null}|""|.net String.Join() treats null same as empty string.|
|Collection is empty|{}|-|.net String.Split() can never return an empty array, it always returns an array with one empty string.|
|Collection is null|null|-||

The most common pattern is...

* Join elements separated by semicolon (or comma or whatever)
* No separator before the first element or after the last element ("a;b;c")

Original Source: https://blog.codinghorror.com/avoiding-undocumentation/

Have you ever noticed that much of the online MSDN .NET framework help is.. not helpful? Take the the MSDN help for the IBindingList.AddIndex method, for example:

{{TODO{Image missing}}}

Scott Swigart calls this undocumentation, and elaborates further in his blog post:

This is an example where quacking like a duck doesn't make you a duck. This looks like documentation. It shows up in the help alongside documentation, it's indexed like documentation, but it's not documentation. It doesn't actually tell anyone anything they didn't already know.
Large swaths of the Framework are undocumented in exactly this way, and many v1.0 SDKs are, well, very undocumented.
Honestly, my problem isn't that lots of stuff is undocumented. It's that Microsoft spent time writing this undocumentation, proof-reading this undocumentation, and putting this undocumentation through the same process as the real documentation. I don't know how much time was spent undocumenting things, but I'm guessing that if you add it all up, it's a lot.
I guess on the documentation teams, there must be some law that no class, property, method, or event will show up in the help with a big, bold, "Undocumented".
Can we stop pretending? Can you just mark everything as Undocumented until you get around to writing real documentation for it? Maybe even include a "Click here to vote to have this documented." For a simple test, if it doesn't include a code example, it's not documented. Just mark it as such and move on. 

|Verb|Antonym|Not To Mix Up With|Semantic|h
|Add|Remove|--Append--|Adds an element to a collection. Do not use to concatenate strings (better: "Append").|
|AddNew|Remove||Convenience method to create a new instance of something and add it to some container.|
|Append|Truncate|--Add--|Appends content to an element.|
|Approve||--Verify, Validate--|After a succesful validation, change a state or add a flag (may be persistently).|
|Await|||Pauses code execution until a condition is fulfilled (or an interaction occured).|
|Build||--Create, Populate--|Builds an artefact or object (e.g. a string), mostly in memory or on a file system.|
|Check (Verify)||--Validate--||
|Clear||--Dispose--|Empties a collection or a container. The container itselfs remains.|
|Create||--Generate, Build--|A (static) factory method, creating an instance of an object. Or: A persistent operation, like creating a database or table.|
|Delete|Insert|--Remove--|Persistently destroys s.th. E.g. a database record or a file.|
|Determine|Specify|--Get, Locate, Select--|Returns the result of a complex (algorithmic) decision.|
|Evaluate|||Similar to "check" - maybe more complex.|
|Fetch|||Get data from a data source.|
|Generate|||Builds up something generic, more complex (compared to "Create"). The generated output is often used as input for a follow-up action. Examples: Test data, source code.|
|Normalize|||Re-formatting s.th. to fulfill a convention. E.g. removing poison chars from String.|
|--Search--||Find|Seldomly used as a verb ("find" is more popular). Sometimes used as prefix for nouns (SearchMode, SearchEngine,...).|



|Noun|Antonym|Not To Mix Up With|Semantic|Context|h
|Action|Request|--Activity--|(Initiation of a) single event or effort||
|Activity||--Action--|on-going process over a period of time||
|Arguments||--Parameters--|The runtime values that are passed as parameters to a method.||
|Choices|Options||Set of all (not yet chosen) options available.||
|Declaration||--Definition--|Introduce only the name (any maybe the type) of s.th. [[StackOverflow|https://stackoverflow.com/questions/1410563/what-is-the-difference-between-a-definition-and-a-declaration]]|
|Definition||--Declaration--|Provide values / content for s.th. [[StackOverflow|https://stackoverflow.com/questions/1410563/what-is-the-difference-between-a-definition-and-a-declaration]]|
|Expression|||A string that is to be parsed.|Language syntax|
|Principal|Dependant|--Parent--||Dependency models (e.g. data base)|
|Parent|Child|||Tree structures (mostly UI).|
|Parameters||--Arguments--|The design time variable names and types of a method.|Language syntax|
|PropertyBag|||A class with no methods (formerly called "structure"). Oftenly, flat (no tree structure) and having only primitivy type properties,||
|Signature|||The arguments and return type of a method.|Language syntax|

!Specific Problem Domains

!!File System

|Variable Name|Example|Remarks|h
|baseDirectory|"C:\Temp"|Rooted name of a directory (no trailing "\"), which is used as anchor for sub directories.|
|directory|"C:\Temp\Foo"|Rooted name of a directory (no trailing "\").|
|directoryName<br>directoryNames|"Foo"|Name of a Directory (not rooted). No "\" at any position.|
|extension|".txt"|Extension of a file (including trailing ".")|
|fileContent||Complete content of a file (as blob or string). See also fileStream.|
|fileFullName<br>fileFullNames|"C:\Temp\Foo\SomeFile.txt"|Rooted path of a file.|
|fileName<br>fileNames|"SomeFile.txt"|Name of a file including extension, no leading directory.|
|fileOrFullName|"SomeFile.txt"<br>"C:\Temp\Foo\SomeFile.txt"|If both is possible.|
|fileStream||Access channel for sequentially reading a files' content.|
|fileSystemEntry|"C:\Temp"<br>"C:\Temp\SomeFile.bar"|Rooted name of a directory (no trailing "\") or a file.|
|path|"SomeFile"<br>"Temp\SomeFile.bar"<br>"C:\Temp"<br>"C:\Temp\SomeFile.bar"|Bad practice, because it can be everything and nothing. Only to be used if all options are needed.|
|pathLeftPart|"Temp"<br>"Temp\Foo"<br>"C:\Temp\Foo"|When you don't know if it's rooted (='baseDirectory') or relative (='subDirectoryName'). No leading or trailing "\".|
|pathRightPart|"Batz\SomeFile.bar"|Bad practice. A file name (optionally) including a sub directory prefix.|
|subDirectoryRelativePath|"Foo"<br>"Foo\Batz"|Name of one or more directories (not rooted). No leading or trailing "\".|

!!String Processing

|Variable Name|Example|Remarks|h
|beginOfHello|"Hello,World" => "H" => 0|0-Based-Index of the first char of a token.|
|endOfHello|"Hello,World" => "o" => 4|0-Based-Index of the last char of a token.|
|rightOfHello|"Hello,World" => "," => 5|0-Based-Index of the first char AFTER a token.|
|lengthOfHello|"Hello,World" => 5|Number of characters of a token.|
|lookAheadLength|"FooBar@@.@@Batz-Bang." => Index of "." is @@6@@ => used as lookAheadLength.<br>IndexOf("-", 0, @@6@@) => -1|Number of characters to examine when searching a substring.<br>To avoid finding stuff behind a limiter.|

!Web Links