michael@0: michael@0: #***************************************************************************** michael@0: # michael@0: # Copyright (C) 2002-2007, International Business Machines Corporation and others. michael@0: # All Rights Reserved. michael@0: # michael@0: #***************************************************************************** michael@0: # michael@0: # file: regexcst.txt michael@0: # ICU Regular Expression Parser State Table michael@0: # michael@0: # This state table is used when reading and parsing a regular expression pattern michael@0: # The pattern parser uses a state machine; the data in this file define the michael@0: # state transitions that occur for each input character. michael@0: # michael@0: # *** This file defines the regex pattern grammar. This is it. michael@0: # *** The determination of what is accepted is here. michael@0: # michael@0: # This file is processed by a perl script "regexcst.pl" to produce initialized C arrays michael@0: # that are then built with the rule parser. michael@0: # michael@0: michael@0: # michael@0: # Here is the syntax of the state definitions in this file: michael@0: # michael@0: # michael@0: #StateName: michael@0: # input-char n next-state ^push-state action michael@0: # input-char n next-state ^push-state action michael@0: # | | | | | michael@0: # | | | | |--- action to be performed by state machine michael@0: # | | | | See function RBBIRuleScanner::doParseActions() michael@0: # | | | | michael@0: # | | | |--- Push this named state onto the state stack. michael@0: # | | | Later, when next state is specified as "pop", michael@0: # | | | the pushed state will become the current state. michael@0: # | | | michael@0: # | | |--- Transition to this state if the current input character matches the input michael@0: # | | character or char class in the left hand column. "pop" causes the next michael@0: # | | state to be popped from the state stack. michael@0: # | | michael@0: # | |--- When making the state transition specified on this line, advance to the next michael@0: # | character from the input only if 'n' appears here. michael@0: # | michael@0: # |--- Character or named character classes to test for. If the current character being scanned michael@0: # matches, peform the actions and go to the state specified on this line. michael@0: # The input character is tested sequentally, in the order written. The characters and michael@0: # character classes tested for do not need to be mutually exclusive. The first match wins. michael@0: # michael@0: michael@0: michael@0: michael@0: michael@0: # michael@0: # start state, scan position is at the beginning of the pattern. michael@0: # michael@0: start: michael@0: default term doPatStart michael@0: michael@0: michael@0: michael@0: michael@0: # michael@0: # term. At a position where we can accept the start most items in a pattern. michael@0: # michael@0: term: michael@0: quoted n expr-quant doLiteralChar michael@0: rule_char n expr-quant doLiteralChar michael@0: '[' n set-open ^set-finish doSetBegin michael@0: '(' n open-paren michael@0: '.' n expr-quant doDotAny michael@0: '^' n expr-quant doCaret michael@0: '$' n expr-quant doDollar michael@0: '\' n backslash michael@0: '|' n term doOrOperator michael@0: ')' n pop doCloseParen michael@0: eof term doPatFinish michael@0: default errorDeath doRuleError michael@0: michael@0: michael@0: michael@0: # michael@0: # expr-quant We've just finished scanning a term, now look for the optional michael@0: # trailing quantifier - *, +, ?, *?, etc. michael@0: # michael@0: expr-quant: michael@0: '*' n quant-star michael@0: '+' n quant-plus michael@0: '?' n quant-opt michael@0: '{' n interval-open doIntervalInit michael@0: '(' n open-paren-quant michael@0: default expr-cont michael@0: michael@0: michael@0: # michael@0: # expr-cont Expression, continuation. At a point where additional terms are michael@0: # allowed, but not required. No Quantifiers michael@0: # michael@0: expr-cont: michael@0: '|' n term doOrOperator michael@0: ')' n pop doCloseParen michael@0: default term michael@0: michael@0: michael@0: # michael@0: # open-paren-quant Special case handling for comments appearing before a quantifier, michael@0: # e.g. x(?#comment )* michael@0: # Open parens from expr-quant come here; anything but a (?# comment michael@0: # branches into the normal parenthesis sequence as quickly as possible. michael@0: # michael@0: open-paren-quant: michael@0: '?' n open-paren-quant2 doSuppressComments michael@0: default open-paren michael@0: michael@0: open-paren-quant2: michael@0: '#' n paren-comment ^expr-quant michael@0: default open-paren-extended michael@0: michael@0: michael@0: # michael@0: # open-paren We've got an open paren. We need to scan further to michael@0: # determine what kind of quantifier it is - plain (, (?:, (?>, or whatever. michael@0: # michael@0: open-paren: michael@0: '?' n open-paren-extended doSuppressComments michael@0: default term ^expr-quant doOpenCaptureParen michael@0: michael@0: open-paren-extended: michael@0: ':' n term ^expr-quant doOpenNonCaptureParen # (?: michael@0: '>' n term ^expr-quant doOpenAtomicParen # (?> michael@0: '=' n term ^expr-cont doOpenLookAhead # (?= michael@0: '!' n term ^expr-cont doOpenLookAheadNeg # (?! michael@0: '<' n open-paren-lookbehind michael@0: '#' n paren-comment ^term michael@0: 'i' paren-flag doBeginMatchMode michael@0: 'd' paren-flag doBeginMatchMode michael@0: 'm' paren-flag doBeginMatchMode michael@0: 's' paren-flag doBeginMatchMode michael@0: 'u' paren-flag doBeginMatchMode michael@0: 'w' paren-flag doBeginMatchMode michael@0: 'x' paren-flag doBeginMatchMode michael@0: '-' paren-flag doBeginMatchMode michael@0: '(' n errorDeath doConditionalExpr michael@0: '{' n errorDeath doPerlInline michael@0: default errorDeath doBadOpenParenType michael@0: michael@0: open-paren-lookbehind: michael@0: '=' n term ^expr-cont doOpenLookBehind # (?<= michael@0: '!' n term ^expr-cont doOpenLookBehindNeg # (?