1.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 1.2 +++ b/intl/icu/source/i18n/unicode/stsearch.h Wed Dec 31 06:09:35 2014 +0100 1.3 @@ -0,0 +1,518 @@ 1.4 +/* 1.5 +********************************************************************** 1.6 +* Copyright (C) 2001-2008 IBM and others. All rights reserved. 1.7 +********************************************************************** 1.8 +* Date Name Description 1.9 +* 03/22/2000 helena Creation. 1.10 +********************************************************************** 1.11 +*/ 1.12 + 1.13 +#ifndef STSEARCH_H 1.14 +#define STSEARCH_H 1.15 + 1.16 +#include "unicode/utypes.h" 1.17 + 1.18 +/** 1.19 + * \file 1.20 + * \brief C++ API: Service for searching text based on RuleBasedCollator. 1.21 + */ 1.22 + 1.23 +#if !UCONFIG_NO_COLLATION && !UCONFIG_NO_BREAK_ITERATION 1.24 + 1.25 +#include "unicode/tblcoll.h" 1.26 +#include "unicode/coleitr.h" 1.27 +#include "unicode/search.h" 1.28 + 1.29 +U_NAMESPACE_BEGIN 1.30 + 1.31 +/** 1.32 + * 1.33 + * <tt>StringSearch</tt> is a <tt>SearchIterator</tt> that provides 1.34 + * language-sensitive text searching based on the comparison rules defined 1.35 + * in a {@link RuleBasedCollator} object. 1.36 + * StringSearch ensures that language eccentricity can be 1.37 + * handled, e.g. for the German collator, characters ß and SS will be matched 1.38 + * if case is chosen to be ignored. 1.39 + * See the <a href="http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm"> 1.40 + * "ICU Collation Design Document"</a> for more information. 1.41 + * <p> 1.42 + * The algorithm implemented is a modified form of the Boyer Moore's search. 1.43 + * For more information see 1.44 + * <a href="http://icu-project.org/docs/papers/efficient_text_searching_in_java.html"> 1.45 + * "Efficient Text Searching in Java"</a>, published in <i>Java Report</i> 1.46 + * in February, 1999, for further information on the algorithm. 1.47 + * <p> 1.48 + * There are 2 match options for selection:<br> 1.49 + * Let S' be the sub-string of a text string S between the offsets start and 1.50 + * end <start, end>. 1.51 + * <br> 1.52 + * A pattern string P matches a text string S at the offsets <start, end> 1.53 + * if 1.54 + * <pre> 1.55 + * option 1. Some canonical equivalent of P matches some canonical equivalent 1.56 + * of S' 1.57 + * option 2. P matches S' and if P starts or ends with a combining mark, 1.58 + * there exists no non-ignorable combining mark before or after S? 1.59 + * in S respectively. 1.60 + * </pre> 1.61 + * Option 2. will be the default. 1.62 + * <p> 1.63 + * This search has APIs similar to that of other text iteration mechanisms 1.64 + * such as the break iterators in <tt>BreakIterator</tt>. Using these 1.65 + * APIs, it is easy to scan through text looking for all occurances of 1.66 + * a given pattern. This search iterator allows changing of direction by 1.67 + * calling a <tt>reset</tt> followed by a <tt>next</tt> or <tt>previous</tt>. 1.68 + * Though a direction change can occur without calling <tt>reset</tt> first, 1.69 + * this operation comes with some speed penalty. 1.70 + * Match results in the forward direction will match the result matches in 1.71 + * the backwards direction in the reverse order 1.72 + * <p> 1.73 + * <tt>SearchIterator</tt> provides APIs to specify the starting position 1.74 + * within the text string to be searched, e.g. <tt>setOffset</tt>, 1.75 + * <tt>preceding</tt> and <tt>following</tt>. Since the 1.76 + * starting position will be set as it is specified, please take note that 1.77 + * there are some danger points which the search may render incorrect 1.78 + * results: 1.79 + * <ul> 1.80 + * <li> The midst of a substring that requires normalization. 1.81 + * <li> If the following match is to be found, the position should not be the 1.82 + * second character which requires to be swapped with the preceding 1.83 + * character. Vice versa, if the preceding match is to be found, 1.84 + * position to search from should not be the first character which 1.85 + * requires to be swapped with the next character. E.g certain Thai and 1.86 + * Lao characters require swapping. 1.87 + * <li> If a following pattern match is to be found, any position within a 1.88 + * contracting sequence except the first will fail. Vice versa if a 1.89 + * preceding pattern match is to be found, a invalid starting point 1.90 + * would be any character within a contracting sequence except the last. 1.91 + * </ul> 1.92 + * <p> 1.93 + * A breakiterator can be used if only matches at logical breaks are desired. 1.94 + * Using a breakiterator will only give you results that exactly matches the 1.95 + * boundaries given by the breakiterator. For instance the pattern "e" will 1.96 + * not be found in the string "\u00e9" if a character break iterator is used. 1.97 + * <p> 1.98 + * Options are provided to handle overlapping matches. 1.99 + * E.g. In English, overlapping matches produces the result 0 and 2 1.100 + * for the pattern "abab" in the text "ababab", where else mutually 1.101 + * exclusive matches only produce the result of 0. 1.102 + * <p> 1.103 + * Though collator attributes will be taken into consideration while 1.104 + * performing matches, there are no APIs here for setting and getting the 1.105 + * attributes. These attributes can be set by getting the collator 1.106 + * from <tt>getCollator</tt> and using the APIs in <tt>coll.h</tt>. 1.107 + * Lastly to update StringSearch to the new collator attributes, 1.108 + * reset() has to be called. 1.109 + * <p> 1.110 + * Restriction: <br> 1.111 + * Currently there are no composite characters that consists of a 1.112 + * character with combining class > 0 before a character with combining 1.113 + * class == 0. However, if such a character exists in the future, 1.114 + * StringSearch does not guarantee the results for option 1. 1.115 + * <p> 1.116 + * Consult the <tt>SearchIterator</tt> documentation for information on 1.117 + * and examples of how to use instances of this class to implement text 1.118 + * searching. 1.119 + * <pre><code> 1.120 + * UnicodeString target("The quick brown fox jumps over the lazy dog."); 1.121 + * UnicodeString pattern("fox"); 1.122 + * 1.123 + * UErrorCode error = U_ZERO_ERROR; 1.124 + * StringSearch iter(pattern, target, Locale::getUS(), NULL, status); 1.125 + * for (int pos = iter.first(error); 1.126 + * pos != USEARCH_DONE; 1.127 + * pos = iter.next(error)) 1.128 + * { 1.129 + * printf("Found match at %d pos, length is %d\n", pos, 1.130 + * iter.getMatchLength()); 1.131 + * } 1.132 + * </code></pre> 1.133 + * <p> 1.134 + * Note, StringSearch is not to be subclassed. 1.135 + * </p> 1.136 + * @see SearchIterator 1.137 + * @see RuleBasedCollator 1.138 + * @since ICU 2.0 1.139 + */ 1.140 + 1.141 +class U_I18N_API StringSearch : public SearchIterator 1.142 +{ 1.143 +public: 1.144 + 1.145 + // public constructors and destructors -------------------------------- 1.146 + 1.147 + /** 1.148 + * Creating a <tt>StringSearch</tt> instance using the argument locale 1.149 + * language rule set. A collator will be created in the process, which 1.150 + * will be owned by this instance and will be deleted during 1.151 + * destruction 1.152 + * @param pattern The text for which this object will search. 1.153 + * @param text The text in which to search for the pattern. 1.154 + * @param locale A locale which defines the language-sensitive 1.155 + * comparison rules used to determine whether text in the 1.156 + * pattern and target matches. 1.157 + * @param breakiter A <tt>BreakIterator</tt> object used to constrain 1.158 + * the matches that are found. Matches whose start and end 1.159 + * indices in the target text are not boundaries as 1.160 + * determined by the <tt>BreakIterator</tt> are 1.161 + * ignored. If this behavior is not desired, 1.162 + * <tt>NULL</tt> can be passed in instead. 1.163 + * @param status for errors if any. If pattern or text is NULL, or if 1.164 + * either the length of pattern or text is 0 then an 1.165 + * U_ILLEGAL_ARGUMENT_ERROR is returned. 1.166 + * @stable ICU 2.0 1.167 + */ 1.168 + StringSearch(const UnicodeString &pattern, const UnicodeString &text, 1.169 + const Locale &locale, 1.170 + BreakIterator *breakiter, 1.171 + UErrorCode &status); 1.172 + 1.173 + /** 1.174 + * Creating a <tt>StringSearch</tt> instance using the argument collator 1.175 + * language rule set. Note, user retains the ownership of this collator, 1.176 + * it does not get destroyed during this instance's destruction. 1.177 + * @param pattern The text for which this object will search. 1.178 + * @param text The text in which to search for the pattern. 1.179 + * @param coll A <tt>RuleBasedCollator</tt> object which defines 1.180 + * the language-sensitive comparison rules used to 1.181 + * determine whether text in the pattern and target 1.182 + * matches. User is responsible for the clearing of this 1.183 + * object. 1.184 + * @param breakiter A <tt>BreakIterator</tt> object used to constrain 1.185 + * the matches that are found. Matches whose start and end 1.186 + * indices in the target text are not boundaries as 1.187 + * determined by the <tt>BreakIterator</tt> are 1.188 + * ignored. If this behavior is not desired, 1.189 + * <tt>NULL</tt> can be passed in instead. 1.190 + * @param status for errors if any. If either the length of pattern or 1.191 + * text is 0 then an U_ILLEGAL_ARGUMENT_ERROR is returned. 1.192 + * @stable ICU 2.0 1.193 + */ 1.194 + StringSearch(const UnicodeString &pattern, 1.195 + const UnicodeString &text, 1.196 + RuleBasedCollator *coll, 1.197 + BreakIterator *breakiter, 1.198 + UErrorCode &status); 1.199 + 1.200 + /** 1.201 + * Creating a <tt>StringSearch</tt> instance using the argument locale 1.202 + * language rule set. A collator will be created in the process, which 1.203 + * will be owned by this instance and will be deleted during 1.204 + * destruction 1.205 + * <p> 1.206 + * Note: No parsing of the text within the <tt>CharacterIterator</tt> 1.207 + * will be done during searching for this version. The block of text 1.208 + * in <tt>CharacterIterator</tt> will be used as it is. 1.209 + * @param pattern The text for which this object will search. 1.210 + * @param text The text iterator in which to search for the pattern. 1.211 + * @param locale A locale which defines the language-sensitive 1.212 + * comparison rules used to determine whether text in the 1.213 + * pattern and target matches. User is responsible for 1.214 + * the clearing of this object. 1.215 + * @param breakiter A <tt>BreakIterator</tt> object used to constrain 1.216 + * the matches that are found. Matches whose start and end 1.217 + * indices in the target text are not boundaries as 1.218 + * determined by the <tt>BreakIterator</tt> are 1.219 + * ignored. If this behavior is not desired, 1.220 + * <tt>NULL</tt> can be passed in instead. 1.221 + * @param status for errors if any. If either the length of pattern or 1.222 + * text is 0 then an U_ILLEGAL_ARGUMENT_ERROR is returned. 1.223 + * @stable ICU 2.0 1.224 + */ 1.225 + StringSearch(const UnicodeString &pattern, CharacterIterator &text, 1.226 + const Locale &locale, 1.227 + BreakIterator *breakiter, 1.228 + UErrorCode &status); 1.229 + 1.230 + /** 1.231 + * Creating a <tt>StringSearch</tt> instance using the argument collator 1.232 + * language rule set. Note, user retains the ownership of this collator, 1.233 + * it does not get destroyed during this instance's destruction. 1.234 + * <p> 1.235 + * Note: No parsing of the text within the <tt>CharacterIterator</tt> 1.236 + * will be done during searching for this version. The block of text 1.237 + * in <tt>CharacterIterator</tt> will be used as it is. 1.238 + * @param pattern The text for which this object will search. 1.239 + * @param text The text in which to search for the pattern. 1.240 + * @param coll A <tt>RuleBasedCollator</tt> object which defines 1.241 + * the language-sensitive comparison rules used to 1.242 + * determine whether text in the pattern and target 1.243 + * matches. User is responsible for the clearing of this 1.244 + * object. 1.245 + * @param breakiter A <tt>BreakIterator</tt> object used to constrain 1.246 + * the matches that are found. Matches whose start and end 1.247 + * indices in the target text are not boundaries as 1.248 + * determined by the <tt>BreakIterator</tt> are 1.249 + * ignored. If this behavior is not desired, 1.250 + * <tt>NULL</tt> can be passed in instead. 1.251 + * @param status for errors if any. If either the length of pattern or 1.252 + * text is 0 then an U_ILLEGAL_ARGUMENT_ERROR is returned. 1.253 + * @stable ICU 2.0 1.254 + */ 1.255 + StringSearch(const UnicodeString &pattern, CharacterIterator &text, 1.256 + RuleBasedCollator *coll, 1.257 + BreakIterator *breakiter, 1.258 + UErrorCode &status); 1.259 + 1.260 + /** 1.261 + * Copy constructor that creates a StringSearch instance with the same 1.262 + * behavior, and iterating over the same text. 1.263 + * @param that StringSearch instance to be copied. 1.264 + * @stable ICU 2.0 1.265 + */ 1.266 + StringSearch(const StringSearch &that); 1.267 + 1.268 + /** 1.269 + * Destructor. Cleans up the search iterator data struct. 1.270 + * If a collator is created in the constructor, it will be destroyed here. 1.271 + * @stable ICU 2.0 1.272 + */ 1.273 + virtual ~StringSearch(void); 1.274 + 1.275 + /** 1.276 + * Clone this object. 1.277 + * Clones can be used concurrently in multiple threads. 1.278 + * If an error occurs, then NULL is returned. 1.279 + * The caller must delete the clone. 1.280 + * 1.281 + * @return a clone of this object 1.282 + * 1.283 + * @see getDynamicClassID 1.284 + * @stable ICU 2.8 1.285 + */ 1.286 + StringSearch *clone() const; 1.287 + 1.288 + // operator overloading --------------------------------------------- 1.289 + 1.290 + /** 1.291 + * Assignment operator. Sets this iterator to have the same behavior, 1.292 + * and iterate over the same text, as the one passed in. 1.293 + * @param that instance to be copied. 1.294 + * @stable ICU 2.0 1.295 + */ 1.296 + StringSearch & operator=(const StringSearch &that); 1.297 + 1.298 + /** 1.299 + * Equality operator. 1.300 + * @param that instance to be compared. 1.301 + * @return TRUE if both instances have the same attributes, 1.302 + * breakiterators, collators and iterate over the same text 1.303 + * while looking for the same pattern. 1.304 + * @stable ICU 2.0 1.305 + */ 1.306 + virtual UBool operator==(const SearchIterator &that) const; 1.307 + 1.308 + // public get and set methods ---------------------------------------- 1.309 + 1.310 + /** 1.311 + * Sets the index to point to the given position, and clears any state 1.312 + * that's affected. 1.313 + * <p> 1.314 + * This method takes the argument index and sets the position in the text 1.315 + * string accordingly without checking if the index is pointing to a 1.316 + * valid starting point to begin searching. 1.317 + * @param position within the text to be set. If position is less 1.318 + * than or greater than the text range for searching, 1.319 + * an U_INDEX_OUTOFBOUNDS_ERROR will be returned 1.320 + * @param status for errors if it occurs 1.321 + * @stable ICU 2.0 1.322 + */ 1.323 + virtual void setOffset(int32_t position, UErrorCode &status); 1.324 + 1.325 + /** 1.326 + * Return the current index in the text being searched. 1.327 + * If the iteration has gone past the end of the text 1.328 + * (or past the beginning for a backwards search), USEARCH_DONE 1.329 + * is returned. 1.330 + * @return current index in the text being searched. 1.331 + * @stable ICU 2.0 1.332 + */ 1.333 + virtual int32_t getOffset(void) const; 1.334 + 1.335 + /** 1.336 + * Set the target text to be searched. 1.337 + * Text iteration will hence begin at the start of the text string. 1.338 + * This method is 1.339 + * useful if you want to re-use an iterator to search for the same 1.340 + * pattern within a different body of text. 1.341 + * @param text text string to be searched 1.342 + * @param status for errors if any. If the text length is 0 then an 1.343 + * U_ILLEGAL_ARGUMENT_ERROR is returned. 1.344 + * @stable ICU 2.0 1.345 + */ 1.346 + virtual void setText(const UnicodeString &text, UErrorCode &status); 1.347 + 1.348 + /** 1.349 + * Set the target text to be searched. 1.350 + * Text iteration will hence begin at the start of the text string. 1.351 + * This method is 1.352 + * useful if you want to re-use an iterator to search for the same 1.353 + * pattern within a different body of text. 1.354 + * Note: No parsing of the text within the <tt>CharacterIterator</tt> 1.355 + * will be done during searching for this version. The block of text 1.356 + * in <tt>CharacterIterator</tt> will be used as it is. 1.357 + * @param text text string to be searched 1.358 + * @param status for errors if any. If the text length is 0 then an 1.359 + * U_ILLEGAL_ARGUMENT_ERROR is returned. 1.360 + * @stable ICU 2.0 1.361 + */ 1.362 + virtual void setText(CharacterIterator &text, UErrorCode &status); 1.363 + 1.364 + /** 1.365 + * Gets the collator used for the language rules. 1.366 + * <p> 1.367 + * Caller may modify but <b>must not</b> delete the <tt>RuleBasedCollator</tt>! 1.368 + * Modifications to this collator will affect the original collator passed in to 1.369 + * the <tt>StringSearch></tt> constructor or to setCollator, if any. 1.370 + * @return collator used for string search 1.371 + * @stable ICU 2.0 1.372 + */ 1.373 + RuleBasedCollator * getCollator() const; 1.374 + 1.375 + /** 1.376 + * Sets the collator used for the language rules. User retains the 1.377 + * ownership of this collator, thus the responsibility of deletion lies 1.378 + * with the user. This method causes internal data such as Boyer-Moore 1.379 + * shift tables to be recalculated, but the iterator's position is 1.380 + * unchanged. 1.381 + * @param coll collator 1.382 + * @param status for errors if any 1.383 + * @stable ICU 2.0 1.384 + */ 1.385 + void setCollator(RuleBasedCollator *coll, UErrorCode &status); 1.386 + 1.387 + /** 1.388 + * Sets the pattern used for matching. 1.389 + * Internal data like the Boyer Moore table will be recalculated, but 1.390 + * the iterator's position is unchanged. 1.391 + * @param pattern search pattern to be found 1.392 + * @param status for errors if any. If the pattern length is 0 then an 1.393 + * U_ILLEGAL_ARGUMENT_ERROR is returned. 1.394 + * @stable ICU 2.0 1.395 + */ 1.396 + void setPattern(const UnicodeString &pattern, UErrorCode &status); 1.397 + 1.398 + /** 1.399 + * Gets the search pattern. 1.400 + * @return pattern used for matching 1.401 + * @stable ICU 2.0 1.402 + */ 1.403 + const UnicodeString & getPattern() const; 1.404 + 1.405 + // public methods ---------------------------------------------------- 1.406 + 1.407 + /** 1.408 + * Reset the iteration. 1.409 + * Search will begin at the start of the text string if a forward 1.410 + * iteration is initiated before a backwards iteration. Otherwise if 1.411 + * a backwards iteration is initiated before a forwards iteration, the 1.412 + * search will begin at the end of the text string. 1.413 + * @stable ICU 2.0 1.414 + */ 1.415 + virtual void reset(); 1.416 + 1.417 + /** 1.418 + * Returns a copy of StringSearch with the same behavior, and 1.419 + * iterating over the same text, as this one. Note that all data will be 1.420 + * replicated, except for the user-specified collator and the 1.421 + * breakiterator. 1.422 + * @return cloned object 1.423 + * @stable ICU 2.0 1.424 + */ 1.425 + virtual SearchIterator * safeClone(void) const; 1.426 + 1.427 + /** 1.428 + * ICU "poor man's RTTI", returns a UClassID for the actual class. 1.429 + * 1.430 + * @stable ICU 2.2 1.431 + */ 1.432 + virtual UClassID getDynamicClassID() const; 1.433 + 1.434 + /** 1.435 + * ICU "poor man's RTTI", returns a UClassID for this class. 1.436 + * 1.437 + * @stable ICU 2.2 1.438 + */ 1.439 + static UClassID U_EXPORT2 getStaticClassID(); 1.440 + 1.441 +protected: 1.442 + 1.443 + // protected method ------------------------------------------------- 1.444 + 1.445 + /** 1.446 + * Search forward for matching text, starting at a given location. 1.447 + * Clients should not call this method directly; instead they should 1.448 + * call {@link SearchIterator#next }. 1.449 + * <p> 1.450 + * If a match is found, this method returns the index at which the match 1.451 + * starts and calls {@link SearchIterator#setMatchLength } with the number 1.452 + * of characters in the target text that make up the match. If no match 1.453 + * is found, the method returns <tt>USEARCH_DONE</tt>. 1.454 + * <p> 1.455 + * The <tt>StringSearch</tt> is adjusted so that its current index 1.456 + * (as returned by {@link #getOffset }) is the match position if one was 1.457 + * found. 1.458 + * If a match is not found, <tt>USEARCH_DONE</tt> will be returned and 1.459 + * the <tt>StringSearch</tt> will be adjusted to the index USEARCH_DONE. 1.460 + * @param position The index in the target text at which the search 1.461 + * starts 1.462 + * @param status for errors if any occurs 1.463 + * @return The index at which the matched text in the target starts, or 1.464 + * USEARCH_DONE if no match was found. 1.465 + * @stable ICU 2.0 1.466 + */ 1.467 + virtual int32_t handleNext(int32_t position, UErrorCode &status); 1.468 + 1.469 + /** 1.470 + * Search backward for matching text, starting at a given location. 1.471 + * Clients should not call this method directly; instead they should call 1.472 + * <tt>SearchIterator.previous()</tt>, which this method overrides. 1.473 + * <p> 1.474 + * If a match is found, this method returns the index at which the match 1.475 + * starts and calls {@link SearchIterator#setMatchLength } with the number 1.476 + * of characters in the target text that make up the match. If no match 1.477 + * is found, the method returns <tt>USEARCH_DONE</tt>. 1.478 + * <p> 1.479 + * The <tt>StringSearch</tt> is adjusted so that its current index 1.480 + * (as returned by {@link #getOffset }) is the match position if one was 1.481 + * found. 1.482 + * If a match is not found, <tt>USEARCH_DONE</tt> will be returned and 1.483 + * the <tt>StringSearch</tt> will be adjusted to the index USEARCH_DONE. 1.484 + * @param position The index in the target text at which the search 1.485 + * starts. 1.486 + * @param status for errors if any occurs 1.487 + * @return The index at which the matched text in the target starts, or 1.488 + * USEARCH_DONE if no match was found. 1.489 + * @stable ICU 2.0 1.490 + */ 1.491 + virtual int32_t handlePrev(int32_t position, UErrorCode &status); 1.492 + 1.493 +private : 1.494 + StringSearch(); // default constructor not implemented 1.495 + 1.496 + // private data members ---------------------------------------------- 1.497 + 1.498 + /** 1.499 + * RuleBasedCollator, contains exactly the same UCollator * in m_strsrch_ 1.500 + * @stable ICU 2.0 1.501 + */ 1.502 + RuleBasedCollator m_collator_; 1.503 + /** 1.504 + * Pattern text 1.505 + * @stable ICU 2.0 1.506 + */ 1.507 + UnicodeString m_pattern_; 1.508 + /** 1.509 + * String search struct data 1.510 + * @stable ICU 2.0 1.511 + */ 1.512 + UStringSearch *m_strsrch_; 1.513 + 1.514 +}; 1.515 + 1.516 +U_NAMESPACE_END 1.517 + 1.518 +#endif /* #if !UCONFIG_NO_COLLATION */ 1.519 + 1.520 +#endif 1.521 +