intl/icu/source/i18n/regexcst.txt

Thu, 22 Jan 2015 13:21:57 +0100

author
Michael Schloh von Bennewitz <michael@schloh.com>
date
Thu, 22 Jan 2015 13:21:57 +0100
branch
TOR_BUG_9701
changeset 15
b8a032363ba2
permissions
-rw-r--r--

Incorporate requested changes from Mozilla in review:
https://bugzilla.mozilla.org/show_bug.cgi?id=1123480#c6

michael@0 1
michael@0 2 #*****************************************************************************
michael@0 3 #
michael@0 4 # Copyright (C) 2002-2007, International Business Machines Corporation and others.
michael@0 5 # All Rights Reserved.
michael@0 6 #
michael@0 7 #*****************************************************************************
michael@0 8 #
michael@0 9 # file: regexcst.txt
michael@0 10 # ICU Regular Expression Parser State Table
michael@0 11 #
michael@0 12 # This state table is used when reading and parsing a regular expression pattern
michael@0 13 # The pattern parser uses a state machine; the data in this file define the
michael@0 14 # state transitions that occur for each input character.
michael@0 15 #
michael@0 16 # *** This file defines the regex pattern grammar. This is it.
michael@0 17 # *** The determination of what is accepted is here.
michael@0 18 #
michael@0 19 # This file is processed by a perl script "regexcst.pl" to produce initialized C arrays
michael@0 20 # that are then built with the rule parser.
michael@0 21 #
michael@0 22
michael@0 23 #
michael@0 24 # Here is the syntax of the state definitions in this file:
michael@0 25 #
michael@0 26 #
michael@0 27 #StateName:
michael@0 28 # input-char n next-state ^push-state action
michael@0 29 # input-char n next-state ^push-state action
michael@0 30 # | | | | |
michael@0 31 # | | | | |--- action to be performed by state machine
michael@0 32 # | | | | See function RBBIRuleScanner::doParseActions()
michael@0 33 # | | | |
michael@0 34 # | | | |--- Push this named state onto the state stack.
michael@0 35 # | | | Later, when next state is specified as "pop",
michael@0 36 # | | | the pushed state will become the current state.
michael@0 37 # | | |
michael@0 38 # | | |--- Transition to this state if the current input character matches the input
michael@0 39 # | | character or char class in the left hand column. "pop" causes the next
michael@0 40 # | | state to be popped from the state stack.
michael@0 41 # | |
michael@0 42 # | |--- When making the state transition specified on this line, advance to the next
michael@0 43 # | character from the input only if 'n' appears here.
michael@0 44 # |
michael@0 45 # |--- Character or named character classes to test for. If the current character being scanned
michael@0 46 # matches, peform the actions and go to the state specified on this line.
michael@0 47 # The input character is tested sequentally, in the order written. The characters and
michael@0 48 # character classes tested for do not need to be mutually exclusive. The first match wins.
michael@0 49 #
michael@0 50
michael@0 51
michael@0 52
michael@0 53
michael@0 54 #
michael@0 55 # start state, scan position is at the beginning of the pattern.
michael@0 56 #
michael@0 57 start:
michael@0 58 default term doPatStart
michael@0 59
michael@0 60
michael@0 61
michael@0 62
michael@0 63 #
michael@0 64 # term. At a position where we can accept the start most items in a pattern.
michael@0 65 #
michael@0 66 term:
michael@0 67 quoted n expr-quant doLiteralChar
michael@0 68 rule_char n expr-quant doLiteralChar
michael@0 69 '[' n set-open ^set-finish doSetBegin
michael@0 70 '(' n open-paren
michael@0 71 '.' n expr-quant doDotAny
michael@0 72 '^' n expr-quant doCaret
michael@0 73 '$' n expr-quant doDollar
michael@0 74 '\' n backslash
michael@0 75 '|' n term doOrOperator
michael@0 76 ')' n pop doCloseParen
michael@0 77 eof term doPatFinish
michael@0 78 default errorDeath doRuleError
michael@0 79
michael@0 80
michael@0 81
michael@0 82 #
michael@0 83 # expr-quant We've just finished scanning a term, now look for the optional
michael@0 84 # trailing quantifier - *, +, ?, *?, etc.
michael@0 85 #
michael@0 86 expr-quant:
michael@0 87 '*' n quant-star
michael@0 88 '+' n quant-plus
michael@0 89 '?' n quant-opt
michael@0 90 '{' n interval-open doIntervalInit
michael@0 91 '(' n open-paren-quant
michael@0 92 default expr-cont
michael@0 93
michael@0 94
michael@0 95 #
michael@0 96 # expr-cont Expression, continuation. At a point where additional terms are
michael@0 97 # allowed, but not required. No Quantifiers
michael@0 98 #
michael@0 99 expr-cont:
michael@0 100 '|' n term doOrOperator
michael@0 101 ')' n pop doCloseParen
michael@0 102 default term
michael@0 103
michael@0 104
michael@0 105 #
michael@0 106 # open-paren-quant Special case handling for comments appearing before a quantifier,
michael@0 107 # e.g. x(?#comment )*
michael@0 108 # Open parens from expr-quant come here; anything but a (?# comment
michael@0 109 # branches into the normal parenthesis sequence as quickly as possible.
michael@0 110 #
michael@0 111 open-paren-quant:
michael@0 112 '?' n open-paren-quant2 doSuppressComments
michael@0 113 default open-paren
michael@0 114
michael@0 115 open-paren-quant2:
michael@0 116 '#' n paren-comment ^expr-quant
michael@0 117 default open-paren-extended
michael@0 118
michael@0 119
michael@0 120 #
michael@0 121 # open-paren We've got an open paren. We need to scan further to
michael@0 122 # determine what kind of quantifier it is - plain (, (?:, (?>, or whatever.
michael@0 123 #
michael@0 124 open-paren:
michael@0 125 '?' n open-paren-extended doSuppressComments
michael@0 126 default term ^expr-quant doOpenCaptureParen
michael@0 127
michael@0 128 open-paren-extended:
michael@0 129 ':' n term ^expr-quant doOpenNonCaptureParen # (?:
michael@0 130 '>' n term ^expr-quant doOpenAtomicParen # (?>
michael@0 131 '=' n term ^expr-cont doOpenLookAhead # (?=
michael@0 132 '!' n term ^expr-cont doOpenLookAheadNeg # (?!
michael@0 133 '<' n open-paren-lookbehind
michael@0 134 '#' n paren-comment ^term
michael@0 135 'i' paren-flag doBeginMatchMode
michael@0 136 'd' paren-flag doBeginMatchMode
michael@0 137 'm' paren-flag doBeginMatchMode
michael@0 138 's' paren-flag doBeginMatchMode
michael@0 139 'u' paren-flag doBeginMatchMode
michael@0 140 'w' paren-flag doBeginMatchMode
michael@0 141 'x' paren-flag doBeginMatchMode
michael@0 142 '-' paren-flag doBeginMatchMode
michael@0 143 '(' n errorDeath doConditionalExpr
michael@0 144 '{' n errorDeath doPerlInline
michael@0 145 default errorDeath doBadOpenParenType
michael@0 146
michael@0 147 open-paren-lookbehind:
michael@0 148 '=' n term ^expr-cont doOpenLookBehind # (?<=
michael@0 149 '!' n term ^expr-cont doOpenLookBehindNeg # (?<!
michael@0 150 default errorDeath doBadOpenParenType
michael@0 151
michael@0 152
michael@0 153 #
michael@0 154 # paren-comment We've got a (?# ... ) style comment. Eat pattern text till we get to the ')'
michael@0 155 #
michael@0 156 paren-comment:
michael@0 157 ')' n pop
michael@0 158 eof errorDeath doMismatchedParenErr
michael@0 159 default n paren-comment
michael@0 160
michael@0 161 #
michael@0 162 # paren-flag Scanned a (?ismx-ismx flag setting
michael@0 163 #
michael@0 164 paren-flag:
michael@0 165 'i' n paren-flag doMatchMode
michael@0 166 'd' n paren-flag doMatchMode
michael@0 167 'm' n paren-flag doMatchMode
michael@0 168 's' n paren-flag doMatchMode
michael@0 169 'u' n paren-flag doMatchMode
michael@0 170 'w' n paren-flag doMatchMode
michael@0 171 'x' n paren-flag doMatchMode
michael@0 172 '-' n paren-flag doMatchMode
michael@0 173 ')' n term doSetMatchMode
michael@0 174 ':' n term ^expr-quant doMatchModeParen
michael@0 175 default errorDeath doBadModeFlag
michael@0 176
michael@0 177
michael@0 178 #
michael@0 179 # quant-star Scanning a '*' quantifier. Need to look ahead to decide
michael@0 180 # between plain '*', '*?', '*+'
michael@0 181 #
michael@0 182 quant-star:
michael@0 183 '?' n expr-cont doNGStar # *?
michael@0 184 '+' n expr-cont doPossessiveStar # *+
michael@0 185 default expr-cont doStar
michael@0 186
michael@0 187
michael@0 188 #
michael@0 189 # quant-plus Scanning a '+' quantifier. Need to look ahead to decide
michael@0 190 # between plain '+', '+?', '++'
michael@0 191 #
michael@0 192 quant-plus:
michael@0 193 '?' n expr-cont doNGPlus # *?
michael@0 194 '+' n expr-cont doPossessivePlus # *+
michael@0 195 default expr-cont doPlus
michael@0 196
michael@0 197
michael@0 198 #
michael@0 199 # quant-opt Scanning a '?' quantifier. Need to look ahead to decide
michael@0 200 # between plain '?', '??', '?+'
michael@0 201 #
michael@0 202 quant-opt:
michael@0 203 '?' n expr-cont doNGOpt # ??
michael@0 204 '+' n expr-cont doPossessiveOpt # ?+
michael@0 205 default expr-cont doOpt # ?
michael@0 206
michael@0 207
michael@0 208 #
michael@0 209 # Interval scanning a '{', the opening delimiter for an interval specification
michael@0 210 # {number} or {min, max} or {min,}
michael@0 211 #
michael@0 212 interval-open:
michael@0 213 digit_char interval-lower
michael@0 214 default errorDeath doIntervalError
michael@0 215
michael@0 216 interval-lower:
michael@0 217 digit_char n interval-lower doIntevalLowerDigit
michael@0 218 ',' n interval-upper
michael@0 219 '}' n interval-type doIntervalSame # {n}
michael@0 220 default errorDeath doIntervalError
michael@0 221
michael@0 222 interval-upper:
michael@0 223 digit_char n interval-upper doIntervalUpperDigit
michael@0 224 '}' n interval-type
michael@0 225 default errorDeath doIntervalError
michael@0 226
michael@0 227 interval-type:
michael@0 228 '?' n expr-cont doNGInterval # {n,m}?
michael@0 229 '+' n expr-cont doPossessiveInterval # {n,m}+
michael@0 230 default expr-cont doInterval # {m,n}
michael@0 231
michael@0 232
michael@0 233 #
michael@0 234 # backslash # Backslash. Figure out which of the \thingies we have encountered.
michael@0 235 # The low level next-char function will have preprocessed
michael@0 236 # some of them already; those won't come here.
michael@0 237 backslash:
michael@0 238 'A' n term doBackslashA
michael@0 239 'B' n term doBackslashB
michael@0 240 'b' n term doBackslashb
michael@0 241 'd' n expr-quant doBackslashd
michael@0 242 'D' n expr-quant doBackslashD
michael@0 243 'G' n term doBackslashG
michael@0 244 'N' expr-quant doNamedChar # \N{NAME} named char
michael@0 245 'p' expr-quant doProperty # \p{Lu} style property
michael@0 246 'P' expr-quant doProperty
michael@0 247 'Q' n term doEnterQuoteMode
michael@0 248 'S' n expr-quant doBackslashS
michael@0 249 's' n expr-quant doBackslashs
michael@0 250 'W' n expr-quant doBackslashW
michael@0 251 'w' n expr-quant doBackslashw
michael@0 252 'X' n expr-quant doBackslashX
michael@0 253 'Z' n term doBackslashZ
michael@0 254 'z' n term doBackslashz
michael@0 255 digit_char n expr-quant doBackRef # Will scan multiple digits
michael@0 256 eof errorDeath doEscapeError
michael@0 257 default n expr-quant doEscapedLiteralChar
michael@0 258
michael@0 259
michael@0 260
michael@0 261 #
michael@0 262 # [set expression] parsing,
michael@0 263 # All states involved in parsing set expressions have names beginning with "set-"
michael@0 264 #
michael@0 265
michael@0 266 set-open:
michael@0 267 '^' n set-open2 doSetNegate
michael@0 268 ':' set-posix doSetPosixProp
michael@0 269 default set-open2
michael@0 270
michael@0 271 set-open2:
michael@0 272 ']' n set-after-lit doSetLiteral
michael@0 273 default set-start
michael@0 274
michael@0 275 # set-posix:
michael@0 276 # scanned a '[:' If it really is a [:property:], doSetPosixProp will have
michael@0 277 # moved the scan to the closing ']'. If it wasn't a property
michael@0 278 # expression, the scan will still be at the opening ':', which should
michael@0 279 # be interpreted as a normal set expression.
michael@0 280 set-posix:
michael@0 281 ']' n pop doSetEnd
michael@0 282 ':' set-start
michael@0 283 default errorDeath doRuleError # should not be possible.
michael@0 284
michael@0 285 #
michael@0 286 # set-start after the [ and special case leading characters (^ and/or ]) but before
michael@0 287 # everything else. A '-' is literal at this point.
michael@0 288 #
michael@0 289 set-start:
michael@0 290 ']' n pop doSetEnd
michael@0 291 '[' n set-open ^set-after-set doSetBeginUnion
michael@0 292 '\' n set-escape
michael@0 293 '-' n set-start-dash
michael@0 294 '&' n set-start-amp
michael@0 295 default n set-after-lit doSetLiteral
michael@0 296
michael@0 297 # set-start-dash Turn "[--" into a syntax error.
michael@0 298 # "[-x" is good, - and x are literals.
michael@0 299 #
michael@0 300 set-start-dash:
michael@0 301 '-' errorDeath doRuleError
michael@0 302 default set-after-lit doSetAddDash
michael@0 303
michael@0 304 # set-start-amp Turn "[&&" into a syntax error.
michael@0 305 # "[&x" is good, & and x are literals.
michael@0 306 #
michael@0 307 set-start-amp:
michael@0 308 '&' errorDeath doRuleError
michael@0 309 default set-after-lit doSetAddAmp
michael@0 310
michael@0 311 #
michael@0 312 # set-after-lit The last thing scanned was a literal character within a set.
michael@0 313 # Can be followed by anything. Single '-' or '&' are
michael@0 314 # literals in this context, not operators.
michael@0 315 set-after-lit:
michael@0 316 ']' n pop doSetEnd
michael@0 317 '[' n set-open ^set-after-set doSetBeginUnion
michael@0 318 '-' n set-lit-dash
michael@0 319 '&' n set-lit-amp
michael@0 320 '\' n set-escape
michael@0 321 eof errorDeath doSetNoCloseError
michael@0 322 default n set-after-lit doSetLiteral
michael@0 323
michael@0 324 set-after-set:
michael@0 325 ']' n pop doSetEnd
michael@0 326 '[' n set-open ^set-after-set doSetBeginUnion
michael@0 327 '-' n set-set-dash
michael@0 328 '&' n set-set-amp
michael@0 329 '\' n set-escape
michael@0 330 eof errorDeath doSetNoCloseError
michael@0 331 default n set-after-lit doSetLiteral
michael@0 332
michael@0 333 set-after-range:
michael@0 334 ']' n pop doSetEnd
michael@0 335 '[' n set-open ^set-after-set doSetBeginUnion
michael@0 336 '-' n set-range-dash
michael@0 337 '&' n set-range-amp
michael@0 338 '\' n set-escape
michael@0 339 eof errorDeath doSetNoCloseError
michael@0 340 default n set-after-lit doSetLiteral
michael@0 341
michael@0 342
michael@0 343 # set-after-op
michael@0 344 # After a -- or &&
michael@0 345 # It is an error to close a set at this point.
michael@0 346 #
michael@0 347 set-after-op:
michael@0 348 '[' n set-open ^set-after-set doSetBeginUnion
michael@0 349 ']' errorDeath doSetOpError
michael@0 350 '\' n set-escape
michael@0 351 default n set-after-lit doSetLiteral
michael@0 352
michael@0 353 #
michael@0 354 # set-set-amp
michael@0 355 # Have scanned [[set]&
michael@0 356 # Could be a '&' intersection operator, if a set follows.
michael@0 357 # Could be the start of a '&&' operator.
michael@0 358 # Otherewise is a literal.
michael@0 359 set-set-amp:
michael@0 360 '[' n set-open ^set-after-set doSetBeginIntersection1
michael@0 361 '&' n set-after-op doSetIntersection2
michael@0 362 default set-after-lit doSetAddAmp
michael@0 363
michael@0 364
michael@0 365 # set-lit-amp Have scanned "[literals&"
michael@0 366 # Could be a start of "&&" operator or a literal
michael@0 367 # In [abc&[def]], the '&' is a literal
michael@0 368 #
michael@0 369 set-lit-amp:
michael@0 370 '&' n set-after-op doSetIntersection2
michael@0 371 default set-after-lit doSetAddAmp
michael@0 372
michael@0 373
michael@0 374 #
michael@0 375 # set-set-dash
michael@0 376 # Have scanned [set]-
michael@0 377 # Could be a '-' difference operator, if a [set] follows.
michael@0 378 # Could be the start of a '--' operator.
michael@0 379 # Otherewise is a literal.
michael@0 380 set-set-dash:
michael@0 381 '[' n set-open ^set-after-set doSetBeginDifference1
michael@0 382 '-' n set-after-op doSetDifference2
michael@0 383 default set-after-lit doSetAddDash
michael@0 384
michael@0 385
michael@0 386 #
michael@0 387 # set-range-dash
michael@0 388 # scanned a-b- or \w-
michael@0 389 # any set or range like item where the trailing single '-' should
michael@0 390 # be literal, not a set difference operation.
michael@0 391 # A trailing "--" is still a difference operator.
michael@0 392 set-range-dash:
michael@0 393 '-' n set-after-op doSetDifference2
michael@0 394 default set-after-lit doSetAddDash
michael@0 395
michael@0 396
michael@0 397 set-range-amp:
michael@0 398 '&' n set-after-op doSetIntersection2
michael@0 399 default set-after-lit doSetAddAmp
michael@0 400
michael@0 401
michael@0 402 # set-lit-dash
michael@0 403 # Have scanned "[literals-" Could be a range or a -- operator or a literal
michael@0 404 # In [abc-[def]], the '-' is a literal (confirmed with a Java test)
michael@0 405 # [abc-\p{xx} the '-' is an error
michael@0 406 # [abc-] the '-' is a literal
michael@0 407 # [ab-xy] the '-' is a range
michael@0 408 #
michael@0 409 set-lit-dash:
michael@0 410 '-' n set-after-op doSetDifference2
michael@0 411 '[' set-after-lit doSetAddDash
michael@0 412 ']' set-after-lit doSetAddDash
michael@0 413 '\' n set-lit-dash-escape
michael@0 414 default n set-after-range doSetRange
michael@0 415
michael@0 416 # set-lit-dash-escape
michael@0 417 #
michael@0 418 # scanned "[literal-\"
michael@0 419 # Could be a range, if the \ introduces an escaped literal char or a named char.
michael@0 420 # Otherwise it is an error.
michael@0 421 #
michael@0 422 set-lit-dash-escape:
michael@0 423 's' errorDeath doSetOpError
michael@0 424 'S' errorDeath doSetOpError
michael@0 425 'w' errorDeath doSetOpError
michael@0 426 'W' errorDeath doSetOpError
michael@0 427 'd' errorDeath doSetOpError
michael@0 428 'D' errorDeath doSetOpError
michael@0 429 'N' set-after-range doSetNamedRange
michael@0 430 default n set-after-range doSetRange
michael@0 431
michael@0 432
michael@0 433 #
michael@0 434 # set-escape
michael@0 435 # Common back-slash escape processing within set expressions
michael@0 436 #
michael@0 437 set-escape:
michael@0 438 'p' set-after-set doSetProp
michael@0 439 'P' set-after-set doSetProp
michael@0 440 'N' set-after-lit doSetNamedChar
michael@0 441 's' n set-after-range doSetBackslash_s
michael@0 442 'S' n set-after-range doSetBackslash_S
michael@0 443 'w' n set-after-range doSetBackslash_w
michael@0 444 'W' n set-after-range doSetBackslash_W
michael@0 445 'd' n set-after-range doSetBackslash_d
michael@0 446 'D' n set-after-range doSetBackslash_D
michael@0 447 default n set-after-lit doSetLiteralEscaped
michael@0 448
michael@0 449 #
michael@0 450 # set-finish
michael@0 451 # Have just encountered the final ']' that completes a [set], and
michael@0 452 # arrived here via a pop. From here, we exit the set parsing world, and go
michael@0 453 # back to generic regular expression parsing.
michael@0 454 #
michael@0 455 set-finish:
michael@0 456 default expr-quant doSetFinish
michael@0 457
michael@0 458
michael@0 459 #
michael@0 460 # errorDeath. This state is specified as the next state whenever a syntax error
michael@0 461 # in the source rules is detected. Barring bugs, the state machine will never
michael@0 462 # actually get here, but will stop because of the action associated with the error.
michael@0 463 # But, just in case, this state asks the state machine to exit.
michael@0 464 errorDeath:
michael@0 465 default n errorDeath doExit
michael@0 466
michael@0 467

mercurial