1.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 1.2 +++ b/intl/icu/source/common/unicode/ushape.h Wed Dec 31 06:09:35 2014 +0100 1.3 @@ -0,0 +1,474 @@ 1.4 +/* 1.5 +****************************************************************************** 1.6 +* 1.7 +* Copyright (C) 2000-2012, International Business Machines 1.8 +* Corporation and others. All Rights Reserved. 1.9 +* 1.10 +****************************************************************************** 1.11 +* file name: ushape.h 1.12 +* encoding: US-ASCII 1.13 +* tab size: 8 (not used) 1.14 +* indentation:4 1.15 +* 1.16 +* created on: 2000jun29 1.17 +* created by: Markus W. Scherer 1.18 +*/ 1.19 + 1.20 +#ifndef __USHAPE_H__ 1.21 +#define __USHAPE_H__ 1.22 + 1.23 +#include "unicode/utypes.h" 1.24 + 1.25 +/** 1.26 + * \file 1.27 + * \brief C API: Arabic shaping 1.28 + * 1.29 + */ 1.30 + 1.31 +/** 1.32 + * Shape Arabic text on a character basis. 1.33 + * 1.34 + * <p>This function performs basic operations for "shaping" Arabic text. It is most 1.35 + * useful for use with legacy data formats and legacy display technology 1.36 + * (simple terminals). All operations are performed on Unicode characters.</p> 1.37 + * 1.38 + * <p>Text-based shaping means that some character code points in the text are 1.39 + * replaced by others depending on the context. It transforms one kind of text 1.40 + * into another. In comparison, modern displays for Arabic text select 1.41 + * appropriate, context-dependent font glyphs for each text element, which means 1.42 + * that they transform text into a glyph vector.</p> 1.43 + * 1.44 + * <p>Text transformations are necessary when modern display technology is not 1.45 + * available or when text needs to be transformed to or from legacy formats that 1.46 + * use "shaped" characters. Since the Arabic script is cursive, connecting 1.47 + * adjacent letters to each other, computers select images for each letter based 1.48 + * on the surrounding letters. This usually results in four images per Arabic 1.49 + * letter: initial, middle, final, and isolated forms. In Unicode, on the other 1.50 + * hand, letters are normally stored abstract, and a display system is expected 1.51 + * to select the necessary glyphs. (This makes searching and other text 1.52 + * processing easier because the same letter has only one code.) It is possible 1.53 + * to mimic this with text transformations because there are characters in 1.54 + * Unicode that are rendered as letters with a specific shape 1.55 + * (or cursive connectivity). They were included for interoperability with 1.56 + * legacy systems and codepages, and for unsophisticated display systems.</p> 1.57 + * 1.58 + * <p>A second kind of text transformations is supported for Arabic digits: 1.59 + * For compatibility with legacy codepages that only include European digits, 1.60 + * it is possible to replace one set of digits by another, changing the 1.61 + * character code points. These operations can be performed for either 1.62 + * Arabic-Indic Digits (U+0660...U+0669) or Eastern (Extended) Arabic-Indic 1.63 + * digits (U+06f0...U+06f9).</p> 1.64 + * 1.65 + * <p>Some replacements may result in more or fewer characters (code points). 1.66 + * By default, this means that the destination buffer may receive text with a 1.67 + * length different from the source length. Some legacy systems rely on the 1.68 + * length of the text to be constant. They expect extra spaces to be added 1.69 + * or consumed either next to the affected character or at the end of the 1.70 + * text.</p> 1.71 + * 1.72 + * <p>For details about the available operations, see the description of the 1.73 + * <code>U_SHAPE_...</code> options.</p> 1.74 + * 1.75 + * @param source The input text. 1.76 + * 1.77 + * @param sourceLength The number of UChars in <code>source</code>. 1.78 + * 1.79 + * @param dest The destination buffer that will receive the results of the 1.80 + * requested operations. It may be <code>NULL</code> only if 1.81 + * <code>destSize</code> is 0. The source and destination must not 1.82 + * overlap. 1.83 + * 1.84 + * @param destSize The size (capacity) of the destination buffer in UChars. 1.85 + * If <code>destSize</code> is 0, then no output is produced, 1.86 + * but the necessary buffer size is returned ("preflighting"). 1.87 + * 1.88 + * @param options This is a 32-bit set of flags that specify the operations 1.89 + * that are performed on the input text. If no error occurs, 1.90 + * then the result will always be written to the destination 1.91 + * buffer. 1.92 + * 1.93 + * @param pErrorCode must be a valid pointer to an error code value, 1.94 + * which must not indicate a failure before the function call. 1.95 + * 1.96 + * @return The number of UChars written to the destination buffer. 1.97 + * If an error occured, then no output was written, or it may be 1.98 + * incomplete. If <code>U_BUFFER_OVERFLOW_ERROR</code> is set, then 1.99 + * the return value indicates the necessary destination buffer size. 1.100 + * @stable ICU 2.0 1.101 + */ 1.102 +U_STABLE int32_t U_EXPORT2 1.103 +u_shapeArabic(const UChar *source, int32_t sourceLength, 1.104 + UChar *dest, int32_t destSize, 1.105 + uint32_t options, 1.106 + UErrorCode *pErrorCode); 1.107 + 1.108 +/** 1.109 + * Memory option: allow the result to have a different length than the source. 1.110 + * Affects: LamAlef options 1.111 + * @stable ICU 2.0 1.112 + */ 1.113 +#define U_SHAPE_LENGTH_GROW_SHRINK 0 1.114 + 1.115 +/** 1.116 + * Memory option: allow the result to have a different length than the source. 1.117 + * Affects: LamAlef options 1.118 + * This option is an alias to U_SHAPE_LENGTH_GROW_SHRINK 1.119 + * @stable ICU 4.2 1.120 + */ 1.121 +#define U_SHAPE_LAMALEF_RESIZE 0 1.122 + 1.123 +/** 1.124 + * Memory option: the result must have the same length as the source. 1.125 + * If more room is necessary, then try to consume spaces next to modified characters. 1.126 + * @stable ICU 2.0 1.127 + */ 1.128 +#define U_SHAPE_LENGTH_FIXED_SPACES_NEAR 1 1.129 + 1.130 +/** 1.131 + * Memory option: the result must have the same length as the source. 1.132 + * If more room is necessary, then try to consume spaces next to modified characters. 1.133 + * Affects: LamAlef options 1.134 + * This option is an alias to U_SHAPE_LENGTH_FIXED_SPACES_NEAR 1.135 + * @stable ICU 4.2 1.136 + */ 1.137 +#define U_SHAPE_LAMALEF_NEAR 1 1.138 + 1.139 +/** 1.140 + * Memory option: the result must have the same length as the source. 1.141 + * If more room is necessary, then try to consume spaces at the end of the text. 1.142 + * @stable ICU 2.0 1.143 + */ 1.144 +#define U_SHAPE_LENGTH_FIXED_SPACES_AT_END 2 1.145 + 1.146 +/** 1.147 + * Memory option: the result must have the same length as the source. 1.148 + * If more room is necessary, then try to consume spaces at the end of the text. 1.149 + * Affects: LamAlef options 1.150 + * This option is an alias to U_SHAPE_LENGTH_FIXED_SPACES_AT_END 1.151 + * @stable ICU 4.2 1.152 + */ 1.153 +#define U_SHAPE_LAMALEF_END 2 1.154 + 1.155 +/** 1.156 + * Memory option: the result must have the same length as the source. 1.157 + * If more room is necessary, then try to consume spaces at the beginning of the text. 1.158 + * @stable ICU 2.0 1.159 + */ 1.160 +#define U_SHAPE_LENGTH_FIXED_SPACES_AT_BEGINNING 3 1.161 + 1.162 +/** 1.163 + * Memory option: the result must have the same length as the source. 1.164 + * If more room is necessary, then try to consume spaces at the beginning of the text. 1.165 + * Affects: LamAlef options 1.166 + * This option is an alias to U_SHAPE_LENGTH_FIXED_SPACES_AT_BEGINNING 1.167 + * @stable ICU 4.2 1.168 + */ 1.169 +#define U_SHAPE_LAMALEF_BEGIN 3 1.170 + 1.171 + 1.172 +/** 1.173 + * Memory option: the result must have the same length as the source. 1.174 + * Shaping Mode: For each LAMALEF character found, expand LAMALEF using space at end. 1.175 + * If there is no space at end, use spaces at beginning of the buffer. If there 1.176 + * is no space at beginning of the buffer, use spaces at the near (i.e. the space 1.177 + * after the LAMALEF character). 1.178 + * If there are no spaces found, an error U_NO_SPACE_AVAILABLE (as defined in utypes.h) 1.179 + * will be set in pErrorCode 1.180 + * 1.181 + * Deshaping Mode: Perform the same function as the flag equals U_SHAPE_LAMALEF_END. 1.182 + * Affects: LamAlef options 1.183 + * @stable ICU 4.2 1.184 + */ 1.185 +#define U_SHAPE_LAMALEF_AUTO 0x10000 1.186 + 1.187 +/** Bit mask for memory options. @stable ICU 2.0 */ 1.188 +#define U_SHAPE_LENGTH_MASK 0x10003 /* Changed old value 3 */ 1.189 + 1.190 + 1.191 +/** 1.192 + * Bit mask for LamAlef memory options. 1.193 + * @stable ICU 4.2 1.194 + */ 1.195 +#define U_SHAPE_LAMALEF_MASK 0x10003 /* updated */ 1.196 + 1.197 +/** Direction indicator: the source is in logical (keyboard) order. @stable ICU 2.0 */ 1.198 +#define U_SHAPE_TEXT_DIRECTION_LOGICAL 0 1.199 + 1.200 +/** 1.201 + * Direction indicator: 1.202 + * the source is in visual RTL order, 1.203 + * the rightmost displayed character stored first. 1.204 + * This option is an alias to U_SHAPE_TEXT_DIRECTION_LOGICAL 1.205 + * @stable ICU 4.2 1.206 + */ 1.207 +#define U_SHAPE_TEXT_DIRECTION_VISUAL_RTL 0 1.208 + 1.209 +/** 1.210 + * Direction indicator: 1.211 + * the source is in visual LTR order, 1.212 + * the leftmost displayed character stored first. 1.213 + * @stable ICU 2.0 1.214 + */ 1.215 +#define U_SHAPE_TEXT_DIRECTION_VISUAL_LTR 4 1.216 + 1.217 +/** Bit mask for direction indicators. @stable ICU 2.0 */ 1.218 +#define U_SHAPE_TEXT_DIRECTION_MASK 4 1.219 + 1.220 + 1.221 +/** Letter shaping option: do not perform letter shaping. @stable ICU 2.0 */ 1.222 +#define U_SHAPE_LETTERS_NOOP 0 1.223 + 1.224 +/** Letter shaping option: replace abstract letter characters by "shaped" ones. @stable ICU 2.0 */ 1.225 +#define U_SHAPE_LETTERS_SHAPE 8 1.226 + 1.227 +/** Letter shaping option: replace "shaped" letter characters by abstract ones. @stable ICU 2.0 */ 1.228 +#define U_SHAPE_LETTERS_UNSHAPE 0x10 1.229 + 1.230 +/** 1.231 + * Letter shaping option: replace abstract letter characters by "shaped" ones. 1.232 + * The only difference with U_SHAPE_LETTERS_SHAPE is that Tashkeel letters 1.233 + * are always "shaped" into the isolated form instead of the medial form 1.234 + * (selecting code points from the Arabic Presentation Forms-B block). 1.235 + * @stable ICU 2.0 1.236 + */ 1.237 +#define U_SHAPE_LETTERS_SHAPE_TASHKEEL_ISOLATED 0x18 1.238 + 1.239 + 1.240 +/** Bit mask for letter shaping options. @stable ICU 2.0 */ 1.241 +#define U_SHAPE_LETTERS_MASK 0x18 1.242 + 1.243 + 1.244 +/** Digit shaping option: do not perform digit shaping. @stable ICU 2.0 */ 1.245 +#define U_SHAPE_DIGITS_NOOP 0 1.246 + 1.247 +/** 1.248 + * Digit shaping option: 1.249 + * Replace European digits (U+0030...) by Arabic-Indic digits. 1.250 + * @stable ICU 2.0 1.251 + */ 1.252 +#define U_SHAPE_DIGITS_EN2AN 0x20 1.253 + 1.254 +/** 1.255 + * Digit shaping option: 1.256 + * Replace Arabic-Indic digits by European digits (U+0030...). 1.257 + * @stable ICU 2.0 1.258 + */ 1.259 +#define U_SHAPE_DIGITS_AN2EN 0x40 1.260 + 1.261 +/** 1.262 + * Digit shaping option: 1.263 + * Replace European digits (U+0030...) by Arabic-Indic digits if the most recent 1.264 + * strongly directional character is an Arabic letter 1.265 + * (<code>u_charDirection()</code> result <code>U_RIGHT_TO_LEFT_ARABIC</code> [AL]).<br> 1.266 + * The direction of "preceding" depends on the direction indicator option. 1.267 + * For the first characters, the preceding strongly directional character 1.268 + * (initial state) is assumed to be not an Arabic letter 1.269 + * (it is <code>U_LEFT_TO_RIGHT</code> [L] or <code>U_RIGHT_TO_LEFT</code> [R]). 1.270 + * @stable ICU 2.0 1.271 + */ 1.272 +#define U_SHAPE_DIGITS_ALEN2AN_INIT_LR 0x60 1.273 + 1.274 +/** 1.275 + * Digit shaping option: 1.276 + * Replace European digits (U+0030...) by Arabic-Indic digits if the most recent 1.277 + * strongly directional character is an Arabic letter 1.278 + * (<code>u_charDirection()</code> result <code>U_RIGHT_TO_LEFT_ARABIC</code> [AL]).<br> 1.279 + * The direction of "preceding" depends on the direction indicator option. 1.280 + * For the first characters, the preceding strongly directional character 1.281 + * (initial state) is assumed to be an Arabic letter. 1.282 + * @stable ICU 2.0 1.283 + */ 1.284 +#define U_SHAPE_DIGITS_ALEN2AN_INIT_AL 0x80 1.285 + 1.286 +/** Not a valid option value. May be replaced by a new option. @stable ICU 2.0 */ 1.287 +#define U_SHAPE_DIGITS_RESERVED 0xa0 1.288 + 1.289 +/** Bit mask for digit shaping options. @stable ICU 2.0 */ 1.290 +#define U_SHAPE_DIGITS_MASK 0xe0 1.291 + 1.292 + 1.293 +/** Digit type option: Use Arabic-Indic digits (U+0660...U+0669). @stable ICU 2.0 */ 1.294 +#define U_SHAPE_DIGIT_TYPE_AN 0 1.295 + 1.296 +/** Digit type option: Use Eastern (Extended) Arabic-Indic digits (U+06f0...U+06f9). @stable ICU 2.0 */ 1.297 +#define U_SHAPE_DIGIT_TYPE_AN_EXTENDED 0x100 1.298 + 1.299 +/** Not a valid option value. May be replaced by a new option. @stable ICU 2.0 */ 1.300 +#define U_SHAPE_DIGIT_TYPE_RESERVED 0x200 1.301 + 1.302 +/** Bit mask for digit type options. @stable ICU 2.0 */ 1.303 +#define U_SHAPE_DIGIT_TYPE_MASK 0x300 /* I need to change this from 0x3f00 to 0x300 */ 1.304 + 1.305 +/** 1.306 + * Tashkeel aggregation option: 1.307 + * Replaces any combination of U+0651 with one of 1.308 + * U+064C, U+064D, U+064E, U+064F, U+0650 with 1.309 + * U+FC5E, U+FC5F, U+FC60, U+FC61, U+FC62 consecutively. 1.310 + * @stable ICU 3.6 1.311 + */ 1.312 +#define U_SHAPE_AGGREGATE_TASHKEEL 0x4000 1.313 +/** Tashkeel aggregation option: do not aggregate tashkeels. @stable ICU 3.6 */ 1.314 +#define U_SHAPE_AGGREGATE_TASHKEEL_NOOP 0 1.315 +/** Bit mask for tashkeel aggregation. @stable ICU 3.6 */ 1.316 +#define U_SHAPE_AGGREGATE_TASHKEEL_MASK 0x4000 1.317 + 1.318 +/** 1.319 + * Presentation form option: 1.320 + * Don't replace Arabic Presentation Forms-A and Arabic Presentation Forms-B 1.321 + * characters with 0+06xx characters, before shaping. 1.322 + * @stable ICU 3.6 1.323 + */ 1.324 +#define U_SHAPE_PRESERVE_PRESENTATION 0x8000 1.325 +/** Presentation form option: 1.326 + * Replace Arabic Presentation Forms-A and Arabic Presentationo Forms-B with 1.327 + * their unshaped correspondants in range 0+06xx, before shaping. 1.328 + * @stable ICU 3.6 1.329 + */ 1.330 +#define U_SHAPE_PRESERVE_PRESENTATION_NOOP 0 1.331 +/** Bit mask for preserve presentation form. @stable ICU 3.6 */ 1.332 +#define U_SHAPE_PRESERVE_PRESENTATION_MASK 0x8000 1.333 + 1.334 +/* Seen Tail option */ 1.335 +/** 1.336 + * Memory option: the result must have the same length as the source. 1.337 + * Shaping mode: The SEEN family character will expand into two characters using space near 1.338 + * the SEEN family character(i.e. the space after the character). 1.339 + * If there are no spaces found, an error U_NO_SPACE_AVAILABLE (as defined in utypes.h) 1.340 + * will be set in pErrorCode 1.341 + * 1.342 + * De-shaping mode: Any Seen character followed by Tail character will be 1.343 + * replaced by one cell Seen and a space will replace the Tail. 1.344 + * Affects: Seen options 1.345 + * @stable ICU 4.2 1.346 + */ 1.347 +#define U_SHAPE_SEEN_TWOCELL_NEAR 0x200000 1.348 + 1.349 +/** 1.350 + * Bit mask for Seen memory options. 1.351 + * @stable ICU 4.2 1.352 + */ 1.353 +#define U_SHAPE_SEEN_MASK 0x700000 1.354 + 1.355 +/* YehHamza option */ 1.356 +/** 1.357 + * Memory option: the result must have the same length as the source. 1.358 + * Shaping mode: The YEHHAMZA character will expand into two characters using space near it 1.359 + * (i.e. the space after the character 1.360 + * If there are no spaces found, an error U_NO_SPACE_AVAILABLE (as defined in utypes.h) 1.361 + * will be set in pErrorCode 1.362 + * 1.363 + * De-shaping mode: Any Yeh (final or isolated) character followed by Hamza character will be 1.364 + * replaced by one cell YehHamza and space will replace the Hamza. 1.365 + * Affects: YehHamza options 1.366 + * @stable ICU 4.2 1.367 + */ 1.368 +#define U_SHAPE_YEHHAMZA_TWOCELL_NEAR 0x1000000 1.369 + 1.370 + 1.371 +/** 1.372 + * Bit mask for YehHamza memory options. 1.373 + * @stable ICU 4.2 1.374 + */ 1.375 +#define U_SHAPE_YEHHAMZA_MASK 0x3800000 1.376 + 1.377 +/* New Tashkeel options */ 1.378 +/** 1.379 + * Memory option: the result must have the same length as the source. 1.380 + * Shaping mode: Tashkeel characters will be replaced by spaces. 1.381 + * Spaces will be placed at beginning of the buffer 1.382 + * 1.383 + * De-shaping mode: N/A 1.384 + * Affects: Tashkeel options 1.385 + * @stable ICU 4.2 1.386 + */ 1.387 +#define U_SHAPE_TASHKEEL_BEGIN 0x40000 1.388 + 1.389 +/** 1.390 + * Memory option: the result must have the same length as the source. 1.391 + * Shaping mode: Tashkeel characters will be replaced by spaces. 1.392 + * Spaces will be placed at end of the buffer 1.393 + * 1.394 + * De-shaping mode: N/A 1.395 + * Affects: Tashkeel options 1.396 + * @stable ICU 4.2 1.397 + */ 1.398 +#define U_SHAPE_TASHKEEL_END 0x60000 1.399 + 1.400 +/** 1.401 + * Memory option: allow the result to have a different length than the source. 1.402 + * Shaping mode: Tashkeel characters will be removed, buffer length will shrink. 1.403 + * De-shaping mode: N/A 1.404 + * 1.405 + * Affect: Tashkeel options 1.406 + * @stable ICU 4.2 1.407 + */ 1.408 +#define U_SHAPE_TASHKEEL_RESIZE 0x80000 1.409 + 1.410 +/** 1.411 + * Memory option: the result must have the same length as the source. 1.412 + * Shaping mode: Tashkeel characters will be replaced by Tatweel if it is connected to adjacent 1.413 + * characters (i.e. shaped on Tatweel) or replaced by space if it is not connected. 1.414 + * 1.415 + * De-shaping mode: N/A 1.416 + * Affects: YehHamza options 1.417 + * @stable ICU 4.2 1.418 + */ 1.419 +#define U_SHAPE_TASHKEEL_REPLACE_BY_TATWEEL 0xC0000 1.420 + 1.421 +/** 1.422 + * Bit mask for Tashkeel replacement with Space or Tatweel memory options. 1.423 + * @stable ICU 4.2 1.424 + */ 1.425 +#define U_SHAPE_TASHKEEL_MASK 0xE0000 1.426 + 1.427 + 1.428 +/* Space location Control options */ 1.429 +/** 1.430 + * This option affect the meaning of BEGIN and END options. if this option is not used the default 1.431 + * for BEGIN and END will be as following: 1.432 + * The Default (for both Visual LTR, Visual RTL and Logical Text) 1.433 + * 1. BEGIN always refers to the start address of physical memory. 1.434 + * 2. END always refers to the end address of physical memory. 1.435 + * 1.436 + * If this option is used it will swap the meaning of BEGIN and END only for Visual LTR text. 1.437 + * 1.438 + * The effect on BEGIN and END Memory Options will be as following: 1.439 + * A. BEGIN For Visual LTR text: This will be the beginning (right side) of the visual text( 1.440 + * corresponding to the physical memory address end for Visual LTR text, Same as END in 1.441 + * default behavior) 1.442 + * B. BEGIN For Logical text: Same as BEGIN in default behavior. 1.443 + * C. END For Visual LTR text: This will be the end (left side) of the visual text (corresponding 1.444 + * to the physical memory address beginning for Visual LTR text, Same as BEGIN in default behavior. 1.445 + * D. END For Logical text: Same as END in default behavior). 1.446 + * Affects: All LamAlef BEGIN, END and AUTO options. 1.447 + * @stable ICU 4.2 1.448 + */ 1.449 +#define U_SHAPE_SPACES_RELATIVE_TO_TEXT_BEGIN_END 0x4000000 1.450 + 1.451 +/** 1.452 + * Bit mask for swapping BEGIN and END for Visual LTR text 1.453 + * @stable ICU 4.2 1.454 + */ 1.455 +#define U_SHAPE_SPACES_RELATIVE_TO_TEXT_MASK 0x4000000 1.456 + 1.457 +/** 1.458 + * If this option is used, shaping will use the new Unicode code point for TAIL (i.e. 0xFE73). 1.459 + * If this option is not specified (Default), old unofficial Unicode TAIL code point is used (i.e. 0x200B) 1.460 + * De-shaping will not use this option as it will always search for both the new Unicode code point for the 1.461 + * TAIL (i.e. 0xFE73) or the old unofficial Unicode TAIL code point (i.e. 0x200B) and de-shape the 1.462 + * Seen-Family letter accordingly. 1.463 + * 1.464 + * Shaping Mode: Only shaping. 1.465 + * De-shaping Mode: N/A. 1.466 + * Affects: All Seen options 1.467 + * @stable ICU 4.8 1.468 + */ 1.469 +#define U_SHAPE_TAIL_NEW_UNICODE 0x8000000 1.470 + 1.471 +/** 1.472 + * Bit mask for new Unicode Tail option 1.473 + * @stable ICU 4.8 1.474 + */ 1.475 +#define U_SHAPE_TAIL_TYPE_MASK 0x8000000 1.476 + 1.477 +#endif