michael@0: * Copyright (C) 2004-2013, International Business Machines michael@0: * Corporation and others. All Rights Reserved. michael@0: * michael@0: * file name: changes.txt michael@0: * encoding: US-ASCII michael@0: * tab size: 8 (not used) michael@0: * indentation:4 michael@0: * michael@0: * created on: 2004may06 michael@0: * created by: Markus W. Scherer michael@0: * michael@0: * change log for Unicode updates michael@0: michael@0: ---------------------------------------------------------------------------- *** michael@0: michael@0: Unicode 6.3 update michael@0: michael@0: http://www.unicode.org/review/pri249/ -- beta review michael@0: http://www.unicode.org/reports/uax-proposed-updates.html michael@0: http://www.unicode.org/versions/beta-6.3.0.html#notable_issues michael@0: http://www.unicode.org/reports/tr44/tr44-11.html michael@0: michael@0: *** ICU Trac michael@0: michael@0: - ticket 10128: update ICU to Unicode 6.3 beta michael@0: - ticket 10168: update ICU to Unicode 6.3 final michael@0: - C++ branches/markus/uni63 at r33552 from trunk at r33551 michael@0: - Java branches/markus/uni63 at r33550 from trunk at r33553 michael@0: michael@0: - ticket 10142: implement Unicode 6.3 bidi algorithm additions michael@0: michael@0: *** Unicode version numbers michael@0: - makedata.mak michael@0: - uchar.h michael@0: (configure.in & configure: have been modified to extract the version from uchar.h) michael@0: - com.ibm.icu.util.VersionInfo michael@0: - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ michael@0: michael@0: - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h michael@0: so that the makefiles see the new version number. michael@0: michael@0: *** data files & enums & parser code michael@0: michael@0: * file preparation michael@0: michael@0: - download UCD, UCA & IDNA files michael@0: - make sure that the Unicode data folder passed into preparseucd.py michael@0: includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) michael@0: - modify preparseucd.py: michael@0: parse new file BidiBrackets.txt michael@0: with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type michael@0: - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src michael@0: - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. michael@0: - Check test file diffs for previously commented-out, known-failing data lines; michael@0: probably need to keep those commented out. michael@0: michael@0: * PropertyAliases.txt changes michael@0: - 1 new Enumerated Property michael@0: bpt ; Bidi_Paired_Bracket_Type michael@0: -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType michael@0: -> ubidi_props.h & .c & UBiDiProps.java michael@0: -> remember to write the max value at UBIDI_MAX_VALUES_INDEX michael@0: -> uprops.cpp michael@0: -> change ubidi.icu format version from 2.0 to 2.1 michael@0: - 1 new Miscellaneous Property michael@0: bpb ; Bidi_Paired_Bracket michael@0: -> uchar.h & UProperty.java michael@0: -> ppucd.h & .cpp michael@0: michael@0: * PropertyValueAliases.txt changes michael@0: - 3 Bidi_Paired_Bracket_Type (bpt) values: michael@0: bpt; c ; Close michael@0: bpt; n ; None michael@0: bpt; o ; Open michael@0: -> uchar.h & UCharacter.BidiPairedBracketType michael@0: -> ubidi_props.h & .c & UBiDiProps.java michael@0: -> change ubidi.icu format version from 2.0 to 2.1 michael@0: - 4 new Bidi_Class (bc) values: michael@0: bc ; FSI ; First_Strong_Isolate michael@0: bc ; LRI ; Left_To_Right_Isolate michael@0: bc ; RLI ; Right_To_Left_Isolate michael@0: bc ; PDI ; Pop_Directional_Isolate michael@0: -> uchar.h & UCharacterEnums.ECharacterDirection michael@0: -> until the bidi code gets updated, michael@0: Roozbeh suggests mapping the new bc values to ON (Other_Neutral) michael@0: - 3 new Word_Break (WB) values: michael@0: WB ; HL ; Hebrew_Letter michael@0: WB ; SQ ; Single_Quote michael@0: WB ; DQ ; Double_Quote michael@0: -> uchar.h & UCharacter.WordBreak michael@0: -> first time Word_Break numeric constants exceed 4 bits (now 17 values) michael@0: - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html michael@0: (added 2012-10-16) michael@0: Aghb 239 Caucasian Albanian michael@0: Mahj 314 Mahajani michael@0: -> uscript.h michael@0: -> com.ibm.icu.lang.UScript michael@0: find USCRIPT_([^ ]+) *= ([0-9]+),(.+) michael@0: replace public static final int \1 = \2;\3 michael@0: -> preparseucd.py _scripts_only_in_iso15924 michael@0: -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() michael@0: and in com.ibm.icu.dev.test.lang.TestUScript.java michael@0: -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata michael@0: (not strictly necessary for NOT_ENCODED scripts) michael@0: michael@0: * generate normalization data files michael@0: - ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib michael@0: - ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in michael@0: - ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata michael@0: - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt michael@0: - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt michael@0: - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt michael@0: - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt michael@0: michael@0: * build ICU (make install) michael@0: so that the tools build can pick up the new definitions from the installed header files. michael@0: michael@0: ~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt michael@0: michael@0: * build Unicode tools using CMake+make michael@0: michael@0: ~/svn.icutools/trunk/src/unicode/c/icudefs.txt: michael@0: michael@0: # Location (--prefix) of where ICU was installed. michael@0: set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst) michael@0: # Location of the ICU source tree. michael@0: set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src) michael@0: michael@0: ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c michael@0: ~/svn.icutools/trunk/dbg/unicode/c$ make michael@0: michael@0: * generate core properties data files michael@0: - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src michael@0: - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src michael@0: - rebuild ICU (make install) & tools michael@0: - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm michael@0: - rebuild ICU (make install) & tools michael@0: michael@0: * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to michael@0: sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) michael@0: - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters michael@0: - Unicode 6.0..6.3: U+2260, U+226E, U+226F michael@0: - nothing new in 6.3, no test file to update michael@0: michael@0: * update Java data files michael@0: - refresh just the UCD-related files, just to be safe michael@0: - see (ICU4C)/source/data/icu4j-readme.txt michael@0: - mkdir /tmp/icu4j michael@0: - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install michael@0: output: michael@0: ... michael@0: Unicode .icu files built to ./out/build/icudt52l michael@0: mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b michael@0: mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b michael@0: echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt michael@0: LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b michael@0: mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b" michael@0: jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/ michael@0: mkdir -p /tmp/icu4j/main/shared/data michael@0: cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data michael@0: jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/ michael@0: mkdir -p /tmp/icu4j/main/shared/data michael@0: cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data michael@0: make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data' michael@0: - copy the big-endian Unicode data files to another location, michael@0: separate from the other data files michael@0: mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll michael@0: mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr michael@0: ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b michael@0: ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu michael@0: ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b michael@0: ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll michael@0: ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr michael@0: - refresh ICU4J michael@0: ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b michael@0: michael@0: * refresh Java test .txt files michael@0: - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode michael@0: michael@0: * UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files michael@0: michael@0: - get output from Mark's tools; look in http://www.unicode.org/Public/UCA// michael@0: - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that michael@0: - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt michael@0: - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt michael@0: (note removing the underscore before "Rules") michael@0: - update (ICU4C)/source/test/testdata/CollationTest_*.txt michael@0: and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt michael@0: with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) michael@0: - check test file diffs for previously commented-out, known-failing data lines; michael@0: probably need to keep those commented out michael@0: - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani michael@0: - run genuca, see command line above michael@0: - rebuild ICU4C michael@0: - refresh ICU4J collation data: michael@0: (subset of instructions above for properties data refresh, except copies all coll/*) michael@0: ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install michael@0: ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll michael@0: ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll michael@0: ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b michael@0: - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) michael@0: - note on intltest: if collate/UCAConformanceTest fails, then michael@0: utility/MultithreadTest/TestCollators will fail as well; michael@0: fix the conformance test before looking into the multi-thread test michael@0: michael@0: * test ICU, fix test code where necessary michael@0: michael@0: * When refreshing all of ICU4J data from ICU4C michael@0: - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install michael@0: - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data michael@0: or michael@0: - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install michael@0: michael@0: *** LayoutEngine script information michael@0: - skipped for Unicode 6.3: no new scripts michael@0: michael@0: *** merge the Unicode update branches back onto the trunk michael@0: - do not merge the icudata.jar and testdata.jar, michael@0: instead rebuild them from merged & tested ICU4C michael@0: michael@0: ---------------------------------------------------------------------------- *** michael@0: michael@0: Unicode 6.2 update michael@0: michael@0: http://www.unicode.org/review/pri230/ michael@0: http://www.unicode.org/versions/beta-6.2.0.html michael@0: http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0 michael@0: http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values michael@0: http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol michael@0: http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols michael@0: http://www.unicode.org/reports/tr46/tr46-8.html IDNA michael@0: http://unicode.org/Public/idna/6.2.0/ michael@0: michael@0: *** ICU Trac michael@0: michael@0: - ticket 9515: Unicode 6.2: final ICU update michael@0: michael@0: - ticket 9514: UCA 6.2: fix UCARules.txt michael@0: michael@0: - ticket 9437: update ICU to Unicode 6.2 michael@0: - C++ branches/markus/uni62 at r32050 from trunk at r32041 michael@0: - Java branches/markus/uni62 at r32068 from trunk at r32066 michael@0: michael@0: *** Unicode version numbers michael@0: - makedata.mak michael@0: - uchar.h michael@0: (configure.in & configure: have been modified to extract the version from uchar.h) michael@0: - com.ibm.icu.util.VersionInfo michael@0: - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ michael@0: michael@0: *** data files & enums & parser code michael@0: michael@0: * file preparation michael@0: michael@0: - download UCD, UCA & IDNA files michael@0: - make sure that the Unicode data folder passed into preparseucd.py michael@0: includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) michael@0: - modify preparseucd.py: NamesList.txt is now in UTF-8 michael@0: - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src michael@0: - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. michael@0: - Check test file diffs for previously commented-out, known-failing data lines; michael@0: probably need to keep those commented out. michael@0: michael@0: * PropertyValueAliases.txt changes michael@0: - 1 new Line_Break (lb) value: michael@0: lb ; RI ; Regional_Indicator michael@0: -> uchar.h & UCharacter.LineBreak michael@0: - 1 new Word_Break (WB) value: michael@0: WB ; RI ; Regional_Indicator michael@0: -> uchar.h & UCharacter.WordBreak michael@0: - 1 new Grapheme_Cluster_Break (GCB) value: michael@0: GCB; RI ; Regional_Indicator michael@0: -> uchar.h & UCharacter.GraphemeClusterBreak michael@0: michael@0: * 3 new numeric values michael@0: The new value -1, which was really supposed to be NaN but that would have required michael@0: new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1, michael@0: but encodeNumericValue() in corepropsbuilder.cpp had to be fixed. michael@0: cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1 michael@0: cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1 michael@0: The two new values 216000 and 432000 require an addition to the encoding of numeric values. michael@0: cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000 michael@0: cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000 michael@0: -> uprops.h, uchar.c & UCharacterProperty.java michael@0: -> cucdtst.c & UCharacterTest.java michael@0: michael@0: * generate normalization data files michael@0: - ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib michael@0: - ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in michael@0: - ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata michael@0: - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt michael@0: - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt michael@0: - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt michael@0: - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt michael@0: michael@0: * build ICU (make install) michael@0: so that the tools build can pick up the new definitions from the installed header files. michael@0: * build Unicode tools using CMake+make michael@0: michael@0: * generate core properties data files michael@0: - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src michael@0: - in initial bootstrapping, change the UCA version michael@0: in source/data/unidata/FractionalUCA.txt to match the new Unicode version michael@0: - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src michael@0: - rebuild ICU (make install) & tools michael@0: + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, michael@0: check if the UCA version in FractionalUCA.txt matches the new Unicode version michael@0: (see step above) michael@0: - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm michael@0: - rebuild ICU (make install) & tools michael@0: michael@0: * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to michael@0: sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) michael@0: - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters michael@0: - Unicode 6.0..6.2: U+2260, U+226E, U+226F michael@0: - nothing new in 6.2, no test file to update michael@0: michael@0: * update Java data files michael@0: - refresh just the UCD-related files, just to be safe michael@0: - see (ICU4C)/source/data/icu4j-readme.txt michael@0: - mkdir /tmp/icu4j michael@0: - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install michael@0: output: michael@0: ... michael@0: Unicode .icu files built to ./out/build/icudt50l michael@0: mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b michael@0: mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b michael@0: echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt michael@0: LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b michael@0: mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b" michael@0: jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/ michael@0: mkdir -p /tmp/icu4j/main/shared/data michael@0: cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data michael@0: jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/ michael@0: mkdir -p /tmp/icu4j/main/shared/data michael@0: cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data michael@0: make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data' michael@0: - copy the big-endian Unicode data files to another location, michael@0: separate from the other data files michael@0: mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll michael@0: mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr michael@0: ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b michael@0: ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu michael@0: ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b michael@0: ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll michael@0: ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr michael@0: - refresh ICU4J michael@0: ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b michael@0: michael@0: * refresh Java test .txt files michael@0: - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode michael@0: michael@0: * UCA michael@0: michael@0: - get output from Mark's tools; look in http://www.unicode.org/Public/UCA// michael@0: - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that michael@0: - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt michael@0: - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt michael@0: (note removing the underscore before "Rules") michael@0: - update (ICU4C)/source/test/testdata/CollationTest_*.txt michael@0: and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt michael@0: with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) michael@0: - check test file diffs for previously commented-out, known-failing data lines; michael@0: probably need to keep those commented out michael@0: - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani michael@0: - run genuca, see command line above michael@0: - rebuild ICU4C michael@0: - refresh ICU4J collation data: michael@0: (subset of instructions above for properties data refresh, except copies all coll/*) michael@0: ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install michael@0: ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll michael@0: ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll michael@0: ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b michael@0: - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) michael@0: - note on intltest: if collate/UCAConformanceTest fails, then michael@0: utility/MultithreadTest/TestCollators will fail as well; michael@0: fix the conformance test before looking into the multi-thread test michael@0: michael@0: * test ICU, fix test code where necessary michael@0: michael@0: * When refreshing all of ICU4J data from ICU4C michael@0: - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install michael@0: - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data michael@0: or michael@0: - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install michael@0: michael@0: *** LayoutEngine script information michael@0: - skipped for Unicode 6.2: no new scripts michael@0: michael@0: *** merge the Unicode update branches back onto the trunk michael@0: - do not merge the icudata.jar and testdata.jar, michael@0: instead rebuild them from merged & tested ICU4C michael@0: michael@0: ---------------------------------------------------------------------------- *** michael@0: michael@0: Future Unicode update michael@0: michael@0: Tools simplified since the Unicode 6.1 update. See michael@0: - http://site.icu-project.org/design/props/ppucd michael@0: - http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972 michael@0: michael@0: * Unicode version numbers michael@0: - icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates michael@0: michael@0: * file preparation michael@0: - ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py: michael@0: - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src michael@0: - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. michael@0: - Check test file diffs for previously commented-out, known-failing data lines; michael@0: probably need to keep those commented out. michael@0: michael@0: * PropertyValueAliases.txt changes michael@0: - Script codes that are in ISO 15924 but not in Unicode are now listed in michael@0: preparseucd.py, in the _scripts_only_in_iso15924 variable. michael@0: If there are new ISO codes, then add them. michael@0: If Unicode adds some of them, then remove them from the .py variable. michael@0: michael@0: * UnicodeData.txt changes michael@0: - No more manual changes for CJK ranges for algorithmic names; michael@0: those are now written to ppucd.txt and genprops reads them from there. michael@0: michael@0: * generate core properties data files (makeprops.sh was deleted) michael@0: - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src michael@0: michael@0: * no more manual updates of source/data/unidata/norm2/nfkc_cf.txt michael@0: - it is now generated by preparseucd.py michael@0: michael@0: * no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt michael@0: - it is now generated by preparseucd.py michael@0: - make sure that the Unicode data folder passed into preparseucd.py michael@0: includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt michael@0: (can be in some subfolder) michael@0: michael@0: * generate normalization data files michael@0: - ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib michael@0: - ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in michael@0: - ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata michael@0: - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt michael@0: - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt michael@0: - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt michael@0: - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt michael@0: michael@0: * build ICU (make install) michael@0: * build Unicode tools using CMake+make michael@0: michael@0: * new way to call genuca (makeuca.sh was deleted) michael@0: - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src michael@0: michael@0: ---------------------------------------------------------------------------- *** michael@0: michael@0: Unicode 6.1 update michael@0: michael@0: *** ICU Trac michael@0: michael@0: - ticket 8995 final update to Unicode 6.1 michael@0: - ticket 8994 regenerate source/layout/CanonData.cpp michael@0: michael@0: - ticket 8961 support Unicode "Age" value *names* michael@0: - ticket 8963 support multiple character name aliases & types michael@0: michael@0: - ticket 8827 "update ICU to Unicode 6.1" michael@0: - C++ branches/markus/uni61 at r30864 from trunk at r30843 michael@0: - Java branches/markus/uni61 at r30865 from trunk at r30863 michael@0: michael@0: *** Unicode version numbers michael@0: - makedata.mak michael@0: - uchar.h michael@0: (configure.in & configure: have been modified to extract the version from uchar.h) michael@0: - com.ibm.icu.util.VersionInfo michael@0: - icutools/unicode/makedefs.sh michael@0: + also review & update other definitions in that file, michael@0: e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l michael@0: michael@0: *** data files & enums & parser code michael@0: michael@0: * file preparation michael@0: michael@0: ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed michael@0: - This prepares both unidata and testdata files in respective output subfolders. michael@0: - Check test file diffs for previously commented-out, known-failing data lines; michael@0: probably need to keep those commented out. michael@0: michael@0: * PropertyValueAliases.txt changes michael@0: - 11 new block names: michael@0: Arabic_Extended_A michael@0: Arabic_Mathematical_Alphabetic_Symbols michael@0: Chakma michael@0: Meetei_Mayek_Extensions michael@0: Meroitic_Cursive michael@0: Meroitic_Hieroglyphs michael@0: Miao michael@0: Sharada michael@0: Sora_Sompeng michael@0: Sundanese_Supplement michael@0: Takri michael@0: -> add to uchar.h michael@0: -> add to UCharacter.UnicodeBlock IDs michael@0: Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) michael@0: replace public static final int \1_ID = \2; \3 michael@0: -> add to UCharacter.UnicodeBlock objects michael@0: Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) michael@0: replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 michael@0: - 1 new Joining_Group (jg) value: michael@0: Rohingya_Yeh michael@0: -> uchar.h & UCharacter.JoiningGroup michael@0: - 2 new Line_Break (lb) values: michael@0: CJ=Conditional_Japanese_Starter michael@0: HL=Hebrew_Letter michael@0: -> uchar.h & UCharacter.LineBreak michael@0: - 7 new scripts: michael@0: sc ; Cakm ; Chakma michael@0: sc ; Merc ; Meroitic_Cursive michael@0: sc ; Mero ; Meroitic_Hieroglyphs michael@0: sc ; Plrd ; Miao michael@0: sc ; Shrd ; Sharada michael@0: sc ; Sora ; Sora_Sompeng michael@0: sc ; Takr ; Takri michael@0: -> remove these from SyntheticPropertyValueAliases.txt michael@0: -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() michael@0: and in com.ibm.icu.dev.test.lang.TestUScript.java michael@0: - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html michael@0: (added 2011-06-21) michael@0: Khoj 322 Khojki michael@0: Tirh 326 Tirhuta michael@0: and another one added 2011-12-09 michael@0: Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs) michael@0: -> uscript.h michael@0: -> com.ibm.icu.lang.UScript michael@0: find USCRIPT_([^ ]+) *= ([0-9]+),(.+) michael@0: replace public static final int \1 = \2;\3 michael@0: -> SyntheticPropertyValueAliases.txt michael@0: -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() michael@0: and in com.ibm.icu.dev.test.lang.TestUScript.java michael@0: michael@0: * UnicodeData.txt changes michael@0: - the last Unihan code point changes from U+9FCB to U+9FCC michael@0: search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive) michael@0: + do change gennames.c michael@0: + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java michael@0: michael@0: * DerivedBidiClass.txt changes michael@0: - 2 new default-AL blocks: michael@0: # Arabic Extended-A: U+08A0 - U+08FF (was default-R) michael@0: # Arabic Mathematical Alphabetic Symbols: michael@0: # U+1EE00 - U+1EEFF (was default-R) michael@0: - 2 new default-R blocks: michael@0: # Meroitic Hieroglyphs: michael@0: # U+10980 - U+1099F michael@0: # Meroitic Cursive: U+109A0 - U+109FF michael@0: -> should be picked up by the explicit data in the file michael@0: michael@0: * NameAliases.txt changes michael@0: - from michael@0: # Each line has two fields michael@0: # First field: Code point michael@0: # Second field: Alias michael@0: - to michael@0: # Each line has three fields, as described here: michael@0: # michael@0: # First field: Code point michael@0: # Second field: Alias michael@0: # Third field: Type michael@0: - Also, the file previously allowed multiple aliases but only now does it michael@0: actually provide multiple, even multiple of the same type. For example, michael@0: FEFF;BYTE ORDER MARK;alternate michael@0: FEFF;BOM;abbreviation michael@0: FEFF;ZWNBSP;abbreviation michael@0: - This breaks our gennames parser, unames.icu data structure, and API. michael@0: Fix gennames to only pick up "correction" aliases. michael@0: New ticket #8963 for further changes. michael@0: michael@0: * run genpname/preparse.pl (on Linux) michael@0: + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname michael@0: + make sure that data.h is writable michael@0: + perl preparse.pl ~/svn.icu/trunk/src > out.txt michael@0: + preparse.pl shows no errors, out.txt Info and Warning lines look ok michael@0: michael@0: * build ICU (make install) michael@0: so that the tools build can pick up the new definitions from the installed header files. michael@0: * build Unicode tools (at least genpname) using CMake+make michael@0: michael@0: * run genpname michael@0: (builds both pnames.icu and propname_data.h) michael@0: - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in michael@0: - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource michael@0: michael@0: * build ICU (make install) michael@0: * build Unicode tools using CMake+make michael@0: michael@0: * update source/data/unidata/norm2/nfkc_cf.txt michael@0: - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt michael@0: michael@0: * update source/data/unidata/norm2/uts46.txt michael@0: - download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt michael@0: to ~/svn.icu/tools/trunk/src/unicode/py michael@0: - adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008". michael@0: - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py michael@0: - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 michael@0: michael@0: * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to michael@0: sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) michael@0: - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters michael@0: - Unicode 6.0..6.1: U+2260, U+226E, U+226F michael@0: - nothing new in 6.1, no test file to update michael@0: michael@0: * generate core properties data files michael@0: - in initial bootstrapping, change the UCA version michael@0: in source/data/unidata/FractionalUCA.txt to match the new Unicode version michael@0: - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld michael@0: - rebuild ICU & tools michael@0: + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, michael@0: check if the UCA version in FractionalUCA.txt matches the new Unicode version michael@0: (see step above) michael@0: - run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm: michael@0: ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld michael@0: - rebuild ICU & tools michael@0: michael@0: * update Java data files michael@0: - refresh just the UCD-related files, just to be safe michael@0: - see (ICU4C)/source/data/icu4j-readme.txt michael@0: - mkdir /tmp/icu4j michael@0: - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install michael@0: output: michael@0: ... michael@0: Unicode .icu files built to ./out/build/icudt49l michael@0: mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b michael@0: mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b michael@0: echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt michael@0: LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b michael@0: mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b" michael@0: jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/ michael@0: mkdir -p /tmp/icu4j/main/shared/data michael@0: cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data michael@0: jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/ michael@0: mkdir -p /tmp/icu4j/main/shared/data michael@0: cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data michael@0: make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data' michael@0: - copy the big-endian Unicode data files to another location, michael@0: separate from the other data files michael@0: mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll michael@0: mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr michael@0: ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b michael@0: ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu michael@0: ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b michael@0: ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll michael@0: ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr michael@0: - refresh ICU4J michael@0: ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b michael@0: michael@0: * refresh Java test .txt files michael@0: - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode michael@0: michael@0: * test ICU so far, fix test code where necessary michael@0: - temporarily ignore collation issues that look like UCA/UCD mismatches, michael@0: until UCA data is updated michael@0: michael@0: * UCA michael@0: michael@0: - get output from Mark's tools; look in michael@0: http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-.txt michael@0: - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt michael@0: - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt michael@0: (note removing the underscore before "Rules") michael@0: - update (ICU)/source/test/testdata/CollationTest_*.txt michael@0: and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt michael@0: with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) michael@0: - check test file diffs for previously commented-out, known-failing data lines; michael@0: probably need to keep those commented out michael@0: - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani michael@0: - run makeuca.sh: michael@0: ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld michael@0: - rebuild ICU4C michael@0: - refresh ICU4J collation data: michael@0: (subset of instructions above for properties data refresh, except copies all coll/*) michael@0: ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install michael@0: ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll michael@0: ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll michael@0: ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b michael@0: - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) michael@0: - note on intltest: if collate/UCAConformanceTest fails, then michael@0: utility/MultithreadTest/TestCollators will fail as well; michael@0: fix the conformance test before looking into the multi-thread test michael@0: michael@0: * When refreshing all of ICU4J data from ICU4C michael@0: - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install michael@0: - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data michael@0: or michael@0: - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install michael@0: michael@0: *** LayoutEngine script information michael@0: michael@0: (For details see the Unicode 5.2 change log below.) michael@0: michael@0: * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. michael@0: This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp michael@0: in the working directory. michael@0: (It also generates ScriptRunData.cpp, which is no longer needed.) michael@0: michael@0: The generated files have a current copyright date and "@draft" statement. michael@0: michael@0: - diff current /source/layout files vs. generated ones michael@0: ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout michael@0: review and manually merge desired changes; michael@0: fix gratuitous changes, incorrect @draft and missing aliases; michael@0: Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. michael@0: - if you just copy the above files, then michael@0: fix mixed line endings, review the diffs as above and restore changes to API tags etc.; michael@0: manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h michael@0: michael@0: *** merge the Unicode update branches back onto the trunk michael@0: - do not merge the icudata.jar and testdata.jar, michael@0: instead rebuild them from merged & tested ICU4C michael@0: michael@0: ---------------------------------------------------------------------------- *** michael@0: michael@0: ICU 4.8 (no Unicode update, just new script codes) michael@0: michael@0: * 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html michael@0: (added 2010-12-21) michael@0: Afak 439 Afaka michael@0: Jurc 510 Jurchen michael@0: Mroo 199 Mro, Mru michael@0: Nshu 499 Nüshu michael@0: Shrd 319 Sharada, Śāradā michael@0: Sora 398 Sora Sompeng michael@0: Takr 321 Takri, Ṭākrī, Ṭāṅkrī michael@0: Tang 520 Tangut michael@0: Wole 480 Woleai michael@0: -> uscript.h michael@0: -> com.ibm.icu.lang.UScript michael@0: find USCRIPT_([^ ]+) *= ([0-9]+),(.+) michael@0: replace public static final int \1 = \2;\3 michael@0: -> genpname/SyntheticPropertyValueAliases.txt michael@0: -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() michael@0: and in com.ibm.icu.dev.test.lang.TestUScript.java michael@0: michael@0: * run genpname/preparse.pl (on Linux) michael@0: + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname michael@0: + make sure that data.h is writable michael@0: + perl preparse.pl ~/svn.icu/trunk/src > out.txt michael@0: + preparse.pl shows no errors, out.txt Info and Warning lines look ok michael@0: michael@0: * rebuild Unicode tools (at least genpname) using make michael@0: - You might first need to "make install" ICU so that the tools build can pick michael@0: up the new definitions from the installed header files. michael@0: michael@0: * run genpname michael@0: (builds both pnames.icu and propname_data.h) michael@0: - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in michael@0: - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource michael@0: - rebuild ICU & tools michael@0: michael@0: * run genprops michael@0: - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 michael@0: - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 michael@0: - rebuild ICU & tools michael@0: michael@0: * update Java data files michael@0: - refresh just the UCD-related files, just to be safe michael@0: - see (ICU4C)/source/data/icu4j-readme.txt michael@0: - mkdir /tmp/icu4j michael@0: - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install michael@0: - copy the big-endian Unicode data files to another location, michael@0: separate from the other data files michael@0: mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b michael@0: ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b michael@0: ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b michael@0: - refresh ICU4J michael@0: ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b michael@0: michael@0: * should have updated the layout engine script codes but forgot michael@0: michael@0: ---------------------------------------------------------------------------- *** michael@0: michael@0: Unicode 6.0 update michael@0: michael@0: *** related ICU Trac tickets michael@0: michael@0: 7264 Unicode 6.0 Update michael@0: michael@0: *** Unicode version numbers michael@0: - makedata.mak michael@0: - uchar.h michael@0: (configure.in & configure: have been modified to extract the version from uchar.h) michael@0: - com.ibm.icu.util.VersionInfo michael@0: michael@0: *** data files & enums & parser code michael@0: michael@0: * file preparation michael@0: michael@0: ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed michael@0: - This now prepares both unidata and testdata files in respective output subfolders. michael@0: michael@0: * PropertyAliases.txt changes michael@0: - new Script_Extensions property defined in the new ScriptExtensions.txt file michael@0: but not listed in PropertyAliases.txt; reported to unicode.org; michael@0: -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt michael@0: scx; Script_Extensions michael@0: -> uchar.h with new UProperty section michael@0: -> com.ibm.icu.lang.UProperty, parallel with uchar.h michael@0: michael@0: * PropertyValueAliases.txt changes michael@0: - 12 new block names: michael@0: Alchemical_Symbols michael@0: Bamum_Supplement michael@0: Batak michael@0: Brahmi michael@0: CJK_Unified_Ideographs_Extension_D michael@0: Emoticons michael@0: Ethiopic_Extended_A michael@0: Kana_Supplement michael@0: Mandaic michael@0: Miscellaneous_Symbols_And_Pictographs michael@0: Playing_Cards michael@0: Transport_And_Map_Symbols michael@0: -> add to uchar.h michael@0: -> add to UCharacter.UnicodeBlock michael@0: Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) michael@0: replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 michael@0: - Joining_Group (jg) values: michael@0: Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias michael@0: -> uchar.h & UCharacter.JoiningGroup michael@0: - 3 new scripts: michael@0: sc ; Batk ; Batak michael@0: sc ; Brah ; Brahmi michael@0: sc ; Mand ; Mandaic michael@0: -> remove these from SyntheticPropertyValueAliases.txt michael@0: -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN michael@0: -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() michael@0: and in com.ibm.icu.dev.test.lang.TestUScript.java michael@0: - 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html michael@0: (added 2009-11-11..2010-07-18) michael@0: Bass 259 Bassa Vah michael@0: Dupl 755 Duployan shortand michael@0: Elba 226 Elbasan michael@0: Gran 343 Grantha michael@0: Kpel 436 Kpelle michael@0: Loma 437 Loma michael@0: Mend 438 Mende michael@0: Merc 101 Meroitic Cursive michael@0: Narb 106 Old North Arabian michael@0: Nbat 159 Nabataean michael@0: Palm 126 Palmyrene michael@0: Sind 318 Sindhi michael@0: Wara 262 Warang Citi michael@0: -> uscript.h michael@0: -> com.ibm.icu.lang.UScript michael@0: find USCRIPT_([^ ]+) *= ([0-9]+),(.+) michael@0: replace public static final int \1 = \2;\3 michael@0: -> SyntheticPropertyValueAliases.txt michael@0: -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() michael@0: and in com.ibm.icu.dev.test.lang.TestUScript.java michael@0: - ISO 15924 name change michael@0: Mero 100 Meroitic Hieroglyphs (was Meroitic) michael@0: -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC michael@0: - property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt michael@0: michael@0: * UnicodeData.txt changes michael@0: - new CJK block: michael@0: 2B740;;Lo;0;L;;;;;N;;;;; michael@0: 2B81D;;Lo;0;L;;;;;N;;;;; michael@0: -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion michael@0: michael@0: * build Unicode tools using CMake+make michael@0: michael@0: * run genpname/preparse.pl (on Linux) michael@0: + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname michael@0: + make sure that data.h is writable michael@0: + perl preparse.pl ~/svn.icu/trunk/src > out.txt michael@0: + preparse.pl shows no errors, out.txt Info and Warning lines look ok michael@0: michael@0: * rebuild Unicode tools (at least genpname) using make michael@0: - You might first need to "make install" ICU so that the tools build can pick michael@0: up the new definitions from the installed header files. michael@0: michael@0: * run genpname michael@0: - ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in michael@0: - rebuild ICU & tools michael@0: michael@0: * update source/data/unidata/norm2/nfkc_cf.txt michael@0: - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt michael@0: michael@0: * update source/data/unidata/norm2/uts46.txt michael@0: - download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt michael@0: to ~/svn.icu/tools/trunk/src/unicode/py michael@0: - adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values michael@0: - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py michael@0: - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 michael@0: michael@0: * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to michael@0: sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) michael@0: - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters michael@0: - Unicode 6.0: U+2260, U+226E, U+226F michael@0: michael@0: * generate core properties data files michael@0: - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld michael@0: - rebuild ICU & tools michael@0: - run makeuca.sh so that genuca picks up the new nfc.nrm: michael@0: ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld michael@0: - rebuild ICU & tools michael@0: michael@0: * implement new Script_Extensions property (provisional) michael@0: - parser & generator: genprops & uprops.icu michael@0: - uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp michael@0: - UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java michael@0: michael@0: * switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2 michael@0: - (one-time change) michael@0: - genbidi/gencase/genprops tools changes michael@0: - re-run makeprops.sh (see above) michael@0: - UCharacterProperty.java, UCharacterTypeIterator.java, michael@0: UBiDiProps.java, UCaseProps.java, and several others with minor changes; michael@0: UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java michael@0: michael@0: * update Java data files michael@0: - refresh just the UCD-related files, just to be safe michael@0: - see (ICU4C)/source/data/icu4j-readme.txt michael@0: - mkdir /tmp/icu4j michael@0: - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install michael@0: output: michael@0: ... michael@0: Unicode .icu files built to ./out/build/icudt45l michael@0: mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b michael@0: echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt michael@0: LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b michael@0: jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b michael@0: mkdir -p /tmp/icu4j/main/shared/data michael@0: cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data michael@0: - copy the big-endian Unicode data files to another location, michael@0: separate from the other data files michael@0: mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll michael@0: mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr michael@0: ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b michael@0: ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu michael@0: ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b michael@0: ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll michael@0: ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr michael@0: - refresh ICU4J michael@0: ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b michael@0: michael@0: * refresh Java test .txt files michael@0: - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode michael@0: michael@0: * un-hardcode normalization skippable (NF*_Inert) test data michael@0: - removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools michael@0: michael@0: * copy updated break iterator test files michael@0: - now handled by early ucdcopy.py and michael@0: copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata michael@0: (old instructions: michael@0: copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt michael@0: to ~/svn.icu/trunk/src/source/test/testdata) michael@0: - they are not used in ICU4J michael@0: michael@0: * UCA michael@0: michael@0: - get output from Mark's tools; look in michael@0: http://www.unicode.org/~book/incoming/mark/uca6.0.0/ michael@0: http://www.macchiato.com/unicode/utc/additional-uca-files michael@0: http://www.unicode.org/Public/UCA/6.0.0/ michael@0: http://www.unicode.org/~mdavis/uca/ michael@0: - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt michael@0: - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt michael@0: - update Han-implicit ranges for new CJK extensions: michael@0: swapCJK() in ucol.cpp & ImplicitCEGenerator.java michael@0: - genuca: allow bytes 02 for U+FFFE, new merge-sort character; michael@0: do not add it into invuca so that tailoring primary-after an ignorable works michael@0: - genuca: permit space between [variable top] bytes michael@0: - ucol.cpp: treat noncharacters like unassigned rather than ignorable michael@0: - run makeuca.sh: michael@0: ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld michael@0: - rebuild ICU4C michael@0: - refresh ICU4J collation data: michael@0: (subset of instructions above for properties data refresh, except copies all coll/*) michael@0: ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install michael@0: mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll michael@0: ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll michael@0: ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b michael@0: - update (ICU)/source/test/testdata/CollationTest_*.txt michael@0: and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt michael@0: with output from Mark's Unicode tools michael@0: - run all tests with the *_SHORT.txt or the full files (the full ones have comments) michael@0: - note on intltest: if collate/UCAConformanceTest fails, then michael@0: utility/MultithreadTest/TestCollators will fail as well; michael@0: fix the conformance test before looking into the multi-thread test michael@0: michael@0: * When refreshing all of ICU4J data from ICU4C michael@0: - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install michael@0: - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data michael@0: or michael@0: - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install michael@0: michael@0: *** LayoutEngine script information michael@0: michael@0: (For details see the Unicode 5.2 change log below.) michael@0: michael@0: * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, michael@0: ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates michael@0: ScriptRunData.cpp, which is no longer needed.) michael@0: michael@0: The generated files have a current copyright date and "@draft" statement. michael@0: michael@0: * copy the above files into /source/layout, replacing the old files. michael@0: * fix mixed line endings michael@0: * review the diffs and fix incorrect @draft and missing aliases; michael@0: Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. michael@0: * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h michael@0: michael@0: ---------------------------------------------------------------------------- *** michael@0: michael@0: Unicode 5.2 update michael@0: michael@0: *** related ICU Trac tickets michael@0: michael@0: 7084 Unicode 5.2 michael@0: michael@0: 7167 verify collation bytes michael@0: 7235 Java test NAME_ALIAS michael@0: 7236 Java DerivedCoreProperties.txt test michael@0: 7237 Java BidiTest.txt michael@0: 7238 UTrie2 in core unidata michael@0: 7239 test for tailoring gaps michael@0: 7240 Java fix CollationMiscTest michael@0: 7243 update layout engine for Unicode 5.2 michael@0: michael@0: *** Unicode version numbers michael@0: - makedata.mak michael@0: - uchar.h michael@0: - configure.in & configure michael@0: - update ucdVersion in gennames.c if an algorithmic range changes michael@0: michael@0: *** data files & enums & parser code michael@0: michael@0: * file preparation michael@0: michael@0: python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata michael@0: - includes finding files regardless of version numbers, michael@0: copying them, and performing the equivalent processing of the michael@0: ucdstrip and ucdmerge tools on the desired set of files michael@0: michael@0: * notes on changes michael@0: - PropertyAliases.txt michael@0: moved from numeric to enumerated: michael@0: ccc ; Canonical_Combining_Class michael@0: new string properties: michael@0: NFKC_CF ; NFKC_Casefold michael@0: Name_Alias; Name_Alias michael@0: new binary properties: michael@0: Cased ; Cased michael@0: CI ; Case_Ignorable michael@0: CWCF ; Changes_When_Casefolded michael@0: CWCM ; Changes_When_Casemapped michael@0: CWKCF ; Changes_When_NFKC_Casefolded michael@0: CWL ; Changes_When_Lowercased michael@0: CWT ; Changes_When_Titlecased michael@0: CWU ; Changes_When_Uppercased michael@0: new CJK Unihan properties (not supported by ICU) michael@0: - PropertyValueAliases.txt michael@0: new block names michael@0: new scripts michael@0: one script code change: michael@0: sc ; Qaai ; Inherited michael@0: -> michael@0: sc ; Zinh ; Inherited ; Qaai michael@0: new Line_Break (lb) value: michael@0: lb ; CP ; Close_Parenthesis michael@0: new Joining_Group (jg) values: Farsi_Yeh, Nya michael@0: other new values: michael@0: ccc; 214; ATA ; Attached_Above michael@0: - DerivedBidiClass.txt michael@0: new default-R range: U+1E800 - U+1EFFF michael@0: - UnicodeData.txt michael@0: all of the ISO comments are gone michael@0: new CJK block end: michael@0: 9FC3; -> 9FCB; michael@0: new CJK block: michael@0: 2A700;;Lo;0;L;;;;;N;;;;; michael@0: 2B734;;Lo;0;L;;;;;N;;;;; michael@0: michael@0: * genpname michael@0: - run preparse.pl michael@0: + cd \svn\icuproj\icu\trunk\source\tools\genpname michael@0: + make sure that data.h is writable michael@0: + perl preparse.pl \svn\icuproj\icu\trunk > out.txt michael@0: + preparse.pl complains with errors like the following: michael@0: Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, line 34. michael@0: This is because ICU 4.0 had scripts from ISO 15924 which are now michael@0: added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt michael@0: and PropertyValueAliases.txt. michael@0: -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: michael@0: Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt michael@0: + preparse.pl complains with errors about block names missing from uchar.h; add them michael@0: michael@0: * uchar.h & uscript.h & uprops.h & uprops.c & genprops michael@0: - new block & script values michael@0: + 26 new blocks michael@0: copy new blocks from Blocks.txt michael@0: MS VC++ 2008 regular expression: michael@0: find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$" michael@0: replace with " UBLOCK_\3 = 172, /*[\1]*/" michael@0: + several new script values already added in ICU 4.0 for ISO 15924 coverage michael@0: (removed from SyntheticPropertyValueAliases.txt, see genpname notes above) michael@0: + 3 new script values added for ISO 15924 and Unicode 5.2 coverage michael@0: + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2) michael@0: (added to SyntheticPropertyValueAliases.txt) michael@0: - new Joining Group (JG) values: Farsi_Yeh, Nya michael@0: - new Line_Break (lb) value: michael@0: lb ; CP ; Close_Parenthesis michael@0: michael@0: * hardcoded Unihan range end/limit michael@0: - Unihan range end moves from 9FC3 to 9FCB michael@0: search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive) michael@0: + do change gennames.c michael@0: michael@0: * Compare definitions of new binary properties with what we used to use michael@0: in algorithms, to see if the definitions changed. michael@0: - Verified that definitions for Cased and Case_Ignorable are unchanged. michael@0: The gencase tool now parses the newly public Case_Ignorable values michael@0: in case the definition changes in the future. michael@0: michael@0: * uchar.c & uprops.h & uprops.c & genprops michael@0: - new numeric values that didn't exist in Unicode data before: michael@0: 1/7, 1/9, 1/10, 3/10, 1/16, 3/16 michael@0: the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5, michael@0: therefore redesign the encoding of numeric types and values for formatVersion 6; michael@0: design for simple numbers up to at least 144 ("one gross"), michael@0: large values up to at least 10^20, michael@0: and fractions with numerators -1..17 and denominators 1..16 michael@0: to cover current and expected future values michael@0: (e.g., more Han numeric values, Meroitic twelfths) michael@0: michael@0: * reimplement Hangul_Syllable_Type for new Jamo characters michael@0: - the old code assumed that all Jamo characters are in the 11xx block michael@0: - Unicode 5.2 fills holes there and adds new Jamo characters in michael@0: A960..A97F; Hangul Jamo Extended-A michael@0: and in michael@0: D7B0..D7FF; Hangul Jamo Extended-B michael@0: - Hangul_Syllable_Type can be trivially derived from a subset of michael@0: Grapheme_Cluster_Break values michael@0: michael@0: * build Unicode data source code for hardcoding core data michael@0: C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data michael@0: michael@0: ICU data make path is \svn\icuproj\icu\trunk\source\data\ michael@0: ICU root path is \svn\icuproj\icu\trunk michael@0: Information: cannot find "ucmlocal.mk". Not building user-additional converter files. michael@0: Information: cannot find "brklocal.mk". Not building user-additional break iterator files. michael@0: Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. michael@0: Information: cannot find "collocal.mk". Not building user-additional resource bundle files. michael@0: Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. michael@0: Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. michael@0: Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. michael@0: Information: cannot find "spreplocal.mk". Not building user-additional stringprep files. michael@0: Creating data file for Unicode Property Names michael@0: Creating data file for Unicode Character Properties michael@0: Creating data file for Unicode Case Mapping Properties michael@0: Creating data file for Unicode BiDi/Shaping Properties michael@0: Creating data file for Unicode Normalization michael@0: Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l" michael@0: Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp" michael@0: michael@0: - copy the .c source files to C:\svn\icuproj\icu\trunk\source\common michael@0: and rebuild the common library michael@0: michael@0: *** UCA michael@0: michael@0: - update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools) michael@0: - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools michael@0: - update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools michael@0: [ Begin obsolete instructions: michael@0: Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files. michael@0: - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py michael@0: on Windows: michael@0: python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt michael@0: python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt michael@0: End obsolete instructions] michael@0: - run all tests with the *_SHORT.txt or the full files (the full ones have comments) michael@0: not just the *_STUB.txt files michael@0: - note on intltest: if collate/UCAConformanceTest fails, then michael@0: utility/MultithreadTest/TestCollators will fail as well; michael@0: fix the conformance test before looking into the multi-thread test michael@0: michael@0: *** Implement Cased & Case_Ignorable properties michael@0: - via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable() michael@0: - Problem: These properties should be disjoint, but aren't michael@0: - UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not michael@0: - change ucase.icu to be able to store any combination of Cased and Case_Ignorable michael@0: michael@0: *** Implement Changes_When_Xyz properties michael@0: - without stored data michael@0: michael@0: *** Implement Name_Alias property michael@0: - add it as another name field in unames.icu michael@0: - make it available via u_charName() and UCharNameChoice and michael@0: - consider it in u_charFromName() michael@0: michael@0: *** Break iterators michael@0: michael@0: * Update break iterator rules to new UAX versions and new property values michael@0: * Update source/test/testdata/Test.txt files from /ucd/auxiliary michael@0: michael@0: *** new BidiTest file michael@0: - review format and data michael@0: - copy BidiTest.txt to source/test/testdata michael@0: - write test code using this data michael@0: - fix ICU code where it fails the conformance test michael@0: michael@0: *** Java michael@0: - generally, find and update code corresponding to C/C++ michael@0: - UCharacter.UnicodeBlock constants: michael@0: a) add an _ID integer per new block, update COUNT michael@0: b) add a class instance per new block michael@0: Visual Studio regex: michael@0: find UBLOCK_{[^ ]+} = [0-9]+, {/.+} michael@0: replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 michael@0: - CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias() michael@0: michael@0: - port test changes to Java michael@0: michael@0: *** LayoutEngine script information michael@0: michael@0: (For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833) michael@0: michael@0: * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, michael@0: ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates michael@0: ScriptRunData.cpp, which is no longer needed.) michael@0: michael@0: The generated files have a current copyright date and "@draft" statement. michael@0: michael@0: -> Eric Mader wrote in email on 20090930: michael@0: "I think the tool has been modified to update @draft to @stable for michael@0: older scripts and to add @draft for new scripts. michael@0: (I worked with an intern on this last year.) michael@0: You should check the output after you run it." michael@0: michael@0: * copy the above files into /source/layout, replacing the old files. michael@0: * fix mixed line endings michael@0: * review the diffs and fix incorrect @draft and missing aliases michael@0: * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h michael@0: michael@0: Add new default entries to the indicClassTables array in /source/layout/IndicClassTables.cpp michael@0: and the complexTable array in /source/layoutex/ParagraphLayout.cpp. (This step should be automated...) michael@0: michael@0: -> Eric Mader wrote in email on 20090930: michael@0: "This is just a matter of making sure that all the per-script tables have michael@0: entries for any new scripts that were added. michael@0: If any new Indic characters were added, then the class tables in michael@0: IndicClassTables.cpp should be updated to reflect this. michael@0: John Emmons should know how to do this if it's required." michael@0: michael@0: * rebuild the layout and layoutex libraries. michael@0: michael@0: *** Documentation michael@0: - Update User Guide michael@0: + Jamo_Short_Name, sfc->scf, binary property value aliases michael@0: michael@0: ---------------------------------------------------------------------------- *** michael@0: michael@0: Unicode 5.1 update michael@0: michael@0: *** related ICU Trac tickets michael@0: michael@0: 5696 Update to Unicode 5.1 michael@0: michael@0: *** Unicode version numbers michael@0: - makedata.mak michael@0: - uchar.h michael@0: - configure.in & configure michael@0: - update ucdVersion in gennames.c if an algorithmic range changes michael@0: michael@0: *** data files & enums & parser code michael@0: michael@0: * file preparation michael@0: - ucdstrip: michael@0: DerivedCoreProperties.txt michael@0: DerivedNormalizationProps.txt michael@0: NormalizationTest.txt michael@0: PropList.txt michael@0: Scripts.txt michael@0: GraphemeBreakProperty.txt michael@0: SentenceBreakProperty.txt michael@0: WordBreakProperty.txt michael@0: - ucdstrip and ucdmerge: michael@0: EastAsianWidth.txt michael@0: LineBreak.txt michael@0: michael@0: * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) michael@0: copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\ michael@0: copy 5.1.0\ucd\Blocks.txt ..\unidata\ michael@0: copy 5.1.0\ucd\CaseFolding.txt ..\unidata\ michael@0: copy 5.1.0\ucd\DerivedAge.txt ..\unidata\ michael@0: copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ michael@0: copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ michael@0: copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ michael@0: copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ michael@0: copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\ michael@0: copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\ michael@0: copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\ michael@0: copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\ michael@0: copy 5.1.0\ucd\UnicodeData.txt ..\unidata\ michael@0: michael@0: ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt michael@0: ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt michael@0: ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt michael@0: ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt michael@0: ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt michael@0: ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt michael@0: ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt michael@0: ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt michael@0: ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt michael@0: ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt michael@0: michael@0: * genpname michael@0: - run preparse.pl michael@0: + cd \svn\icuproj\icu\uni51\source\tools\genpname michael@0: + make sure that data.h is writable michael@0: + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt michael@0: + preparse.pl complains with errors like the following: michael@0: Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, line 30. michael@0: This is because ICU 3.8 had scripts from ISO 15924 which are now michael@0: added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt michael@0: and PropertyValueAliases.txt. michael@0: -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: michael@0: Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii michael@0: + PropertyValueAliases.txt now explicitly contains values for boolean properties: michael@0: N/Y, No/Yes, F/T, False/True michael@0: -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases. michael@0: It will use further values from the file if present. michael@0: michael@0: * uchar.h & uscript.h & uprops.h & uprops.c & genprops michael@0: - new block & script values michael@0: + 17 new blocks michael@0: + 11 new script values already added in ICU 3.8 for ISO 15924 coverage michael@0: (removed from SyntheticPropertyValueAliases.txt) michael@0: + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1) michael@0: (added to SyntheticPropertyValueAliases.txt) michael@0: - uprops.icu (uprops.h) only provides 7 bits for script codes. michael@0: In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now. michael@0: There is none above 127 yet which is the script code for an michael@0: assigned Unicode character, so ICU 4.0 uprops.icu does not store any michael@0: script code values greater than 127. michael@0: However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129 michael@0: in a parallel bit field, and that overflows now. michael@0: Also, future values >=128 would be incompatible anyway. michael@0: uprops.h is modified to move around several of the bit fields michael@0: in the properties vector words, and now uses 8 bits for the script code. michael@0: Two other bit fields also grow to accommodate future growth: michael@0: Block (current count: 172) grows from 8 to 9 bits, michael@0: and Word_Break grows from 4 to 5 bits. michael@0: - renamed property Simple_Case_Folding (sfc->scf) michael@0: + nothing to be done: handled as normal alias michael@0: - new property JSN Jamo_Short_Name michael@0: + no new API: only contributes to the Name property michael@0: - new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark michael@0: - new Joining Group (JG) value: Burushashki_Yeh_Barree michael@0: - new Sentence_Break (SB) values: michael@0: SB ; CR ; CR michael@0: SB ; EX ; Extend michael@0: SB ; LF ; LF michael@0: SB ; SC ; SContinue michael@0: - new Word_Break (WB) values: michael@0: WB ; CR ; CR michael@0: WB ; Extend ; Extend michael@0: WB ; LF ; LF michael@0: WB ; MB ; MidNumLet michael@0: michael@0: * Further changes in the 2008-02-29 update: michael@0: - Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP michael@0: because they should not normally be invisible. michael@0: - new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed) michael@0: - new Grapheme_Cluster_Break (GCB) value: PP=Prepend michael@0: - new Word_Break (WB) value: NL=Newline michael@0: michael@0: * hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison) michael@0: - Unihan range end moves from 9FBB to 9FC3 michael@0: search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive) michael@0: + do change gennames.c michael@0: michael@0: * build Unicode data source code for hardcoding core data michael@0: C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data michael@0: michael@0: ICU data make path is \svn\icuproj\icu\uni51\source\data\ michael@0: ICU root path is \svn\icuproj\icu\uni51 michael@0: Information: cannot find "ucmlocal.mk". Not building user-additional converter files. michael@0: Information: cannot find "brklocal.mk". Not building user-additional break iterator files. michael@0: Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. michael@0: Information: cannot find "collocal.mk". Not building user-additional resource bundle files. michael@0: Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. michael@0: Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. michael@0: Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. michael@0: Creating data file for Unicode Character Properties michael@0: Creating data file for Unicode Case Mapping Properties michael@0: Creating data file for Unicode BiDi/Shaping Properties michael@0: Creating data file for Unicode Normalization michael@0: Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l" michael@0: Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp" michael@0: michael@0: - copy the .c source files to C:\svn\icuproj\icu\uni51\source\common michael@0: and rebuild the common library michael@0: michael@0: *** Break iterators michael@0: michael@0: * Update break iterator rules to new UAX versions and new property values michael@0: michael@0: *** UCA michael@0: michael@0: * update FractionalUCA.txt and UCARules.txt with new canonical closure michael@0: michael@0: *** Test suites michael@0: - Test that APIs using Unicode property value aliases (like UnicodeSet) michael@0: support all of the boolean values N/Y, No/Yes, F/T, False/True michael@0: -> TestBinaryValues() tests in both cintltst and intltest michael@0: michael@0: *** LayoutEngine script information michael@0: * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, michael@0: ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates michael@0: ScriptRunData.cpp, which is no longer needed.) michael@0: michael@0: The generated files have a current copyright date and "@draft" statement. michael@0: michael@0: * copy the above files into /source/layout, replacing the old files. michael@0: michael@0: Add new default entries to the indicClassTables array in /source/layout/IndicClassTables.cpp michael@0: and the complexTable array in /source/layoutex/ParagraphLayout.cpp. (This step should be automated...) michael@0: michael@0: * rebuild the layout and layoutex libraries. michael@0: michael@0: *** Documentation michael@0: - Update User Guide michael@0: + Jamo_Short_Name, sfc->scf, binary property value aliases michael@0: michael@0: ---------------------------------------------------------------------------- *** michael@0: michael@0: Unicode 5.0 update michael@0: michael@0: *** related Jitterbugs michael@0: michael@0: 5084 RFE: Update to Unicode 5.0 michael@0: michael@0: *** data files & enums & parser code michael@0: michael@0: * file preparation michael@0: - ucdstrip: michael@0: DerivedCoreProperties.txt michael@0: DerivedNormalizationProps.txt michael@0: NormalizationTest.txt michael@0: PropList.txt michael@0: Scripts.txt michael@0: GraphemeBreakProperty.txt michael@0: SentenceBreakProperty.txt michael@0: WordBreakProperty.txt michael@0: - ucdstrip and ucdmerge: michael@0: EastAsianWidth.txt michael@0: LineBreak.txt michael@0: michael@0: * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) michael@0: copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\ michael@0: copy 5.0.0\ucd\Blocks.txt ..\unidata\ michael@0: copy 5.0.0\ucd\CaseFolding.txt ..\unidata\ michael@0: copy 5.0.0\ucd\DerivedAge.txt ..\unidata\ michael@0: copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ michael@0: copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ michael@0: copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ michael@0: copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ michael@0: copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\ michael@0: copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\ michael@0: copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\ michael@0: copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\ michael@0: copy 5.0.0\ucd\UnicodeData.txt ..\unidata\ michael@0: michael@0: ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt michael@0: ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt michael@0: ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt michael@0: ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt michael@0: ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt michael@0: ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt michael@0: ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt michael@0: ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt michael@0: ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt michael@0: ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt michael@0: michael@0: * update FractionalUCA.txt and UCARules.txt with new canonical closure michael@0: michael@0: * genpname michael@0: - run preparse.pl michael@0: + make sure that data.h is writable michael@0: + perl preparse.pl \cvs\oss\icu > out.txt michael@0: michael@0: * uchar.h & uscript.h & uprops.h & uprops.c & genprops michael@0: - new block & script values michael@0: + script values already added in ICU 3.6 because all of ISO 15924 is now covered michael@0: michael@0: * build Unicode data source code for hardcoding core data michael@0: C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data michael@0: michael@0: ICU data make path is \cvs\oss\icu\source\data\ michael@0: ICU root path is \cvs\oss\icu michael@0: Information: cannot find "ucmlocal.mk". Not building user-additional converter files. michael@0: [etc.] michael@0: Creating data file for Unicode Character Properties michael@0: Creating data file for Unicode Case Mapping Properties michael@0: Creating data file for Unicode BiDi/Shaping Properties michael@0: Creating data file for Unicode Normalization michael@0: Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l" michael@0: Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp" michael@0: michael@0: - copy the .c source files to C:\cvs\oss\icu\source\common michael@0: and rebuild the common library michael@0: michael@0: *** Unicode version numbers michael@0: - makedata.mak michael@0: - uchar.h michael@0: - configure.in michael@0: michael@0: *** LayoutEngine script information michael@0: * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, michael@0: ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates michael@0: ScriptRunData.cpp, which is no longer needed.) michael@0: michael@0: The generated files have a current copyright date and "@draft" statement. michael@0: michael@0: * copy the above files into /source/layout, replacing the old files. michael@0: michael@0: Add new default entries to the indicClassTables array in /source/layout/IndicClassTables.cpp michael@0: and the complexTable array in /source/layoutex/ParagraphLayout.cpp. (This step should be automated...) michael@0: michael@0: * rebuild the layout and layoutex libraries. michael@0: michael@0: ---------------------------------------------------------------------------- *** michael@0: michael@0: Unicode 4.1 update michael@0: michael@0: *** related Jitterbugs michael@0: michael@0: 4332 RFE: Update to Unicode 4.1 michael@0: 4157 RBBI, TR29 4.1 updates michael@0: michael@0: *** data files & enums & parser code michael@0: michael@0: * file preparation michael@0: - ucdstrip: michael@0: DerivedCoreProperties.txt michael@0: DerivedNormalizationProps.txt michael@0: NormalizationTest.txt michael@0: GraphemeBreakProperty.txt michael@0: SentenceBreakProperty.txt michael@0: WordBreakProperty.txt michael@0: - ucdstrip and ucdmerge: michael@0: EastAsianWidth.txt michael@0: LineBreak.txt michael@0: michael@0: * add new files to the repository michael@0: GraphemeBreakProperty.txt michael@0: SentenceBreakProperty.txt michael@0: WordBreakProperty.txt michael@0: michael@0: * update FractionalUCA.txt and UCARules.txt with new canonical closure michael@0: michael@0: * genpname michael@0: - handle new enumerated properties in sub read_uchar michael@0: - run preparse.pl michael@0: michael@0: * uchar.h & uscript.h & uprops.h & uprops.c & genprops michael@0: - new binary properties michael@0: + Pattern_Syntax michael@0: + Pattern_White_Space michael@0: - new enumerated properties michael@0: + Grapheme_Cluster_Break michael@0: + Sentence_Break michael@0: + Word_Break michael@0: - new block & script & line break values michael@0: michael@0: * gencase michael@0: - case-ignorable changes michael@0: see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods michael@0: now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk michael@0: michael@0: *** Unicode version numbers michael@0: - makedata.mak michael@0: - uchar.h michael@0: - configure.in michael@0: michael@0: *** tests michael@0: - verify that u_charMirror() round-trips michael@0: - test all new properties and some new values of old properties michael@0: michael@0: *** other code michael@0: michael@0: * hardcoded Unihan range end/limit michael@0: - Unihan range end moves from 9FA5 to 9FBB michael@0: search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive) michael@0: + do not modify BOCU/BOCSU code because that would change the encoding michael@0: and break binary compatibility! michael@0: + similarly, do not change the GB 18030 range data (ucnvmbcs.c), michael@0: NamePrepProfile.txt michael@0: + ignore trietest.c: test data is arbitrary michael@0: + ignore tstnorm.cpp: test optimization, not important michael@0: + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF michael@0: + do change line_th.txt and word_th.txt michael@0: by replacing hardcoded ranges with the new property values michael@0: + do change gennames.c michael@0: michael@0: source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 michael@0: source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 michael@0: source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5, michael@0: michael@0: * case mappings michael@0: - compare new special casing context conditions with previous ones michael@0: see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods michael@0: michael@0: * genpname michael@0: - consider storing only the short name if it is the same as the long name michael@0: michael@0: *** other reviews michael@0: - UAX #29 changes (grapheme/word/sentence breaks) michael@0: - UAX #14 changes (line breaks) michael@0: - Pattern_Syntax & Pattern_White_Space michael@0: michael@0: ---------------------------------------------------------------------------- *** michael@0: michael@0: Unicode 4.0.1 update michael@0: michael@0: *** related Jitterbugs michael@0: michael@0: 3170 RFE: Update to Unicode 4.0.1 michael@0: 3171 Add new Unicode 4.0.1 properties michael@0: 3520 use Unicode 4.0.1 updates for break iteration michael@0: michael@0: *** data files & enums & parser code michael@0: michael@0: * file preparation michael@0: - ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt michael@0: - ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt michael@0: michael@0: * file fixes michael@0: - fix UnicodeData.txt general categories of Ethiopic digits Nd->No michael@0: according to PRI #26 michael@0: http://www.unicode.org/review/resolved-pri.html#pri26 michael@0: - undone again because no corrigendum in sight; michael@0: instead modified tests to not check consistency on this for Unicode 4.0.1 michael@0: michael@0: * ucdterms.txt michael@0: - update from http://www.unicode.org/copyright.html michael@0: formatted for plain text michael@0: michael@0: * uchar.h & uprops.h & uprops.c & genprops michael@0: - add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed michael@0: - add U_LB_INSEPARABLE due to a spelling fix michael@0: + put short name comment only on line with new constant michael@0: for genpname perl script parser michael@0: - new binary properties michael@0: + STerm michael@0: + Variation_Selector michael@0: michael@0: * genpname michael@0: - fix genpname perl script so that it doesn't choke on more than 2 names per property value michael@0: - perl script: correctly calculate the maximum number of fields per row michael@0: michael@0: * uscript.h michael@0: - new script code Hrkt=Katakana_Or_Hiragana michael@0: michael@0: * gennorm.c track changes in DerivedNormalizationProps.txt michael@0: - "FNC" -> "FC_NFKC" michael@0: - single field "NFD_NO" -> two fields "NFD_QC; N" etc. michael@0: michael@0: * genprops/props2.c track changes in DerivedNumericValues.txt michael@0: - changed from 3 columns to 2, dropping the numeric type michael@0: + assume that the type is always numeric for Han characters, michael@0: and that only those are added in addition to what UnicodeData.txt lists michael@0: michael@0: *** Unicode version numbers michael@0: - makedata.mak michael@0: - uchar.h michael@0: - configure.in michael@0: michael@0: *** tests michael@0: - update test of default bidi classes according to PRI #28 michael@0: /tsutil/cucdtst/TestUnicodeData michael@0: http://www.unicode.org/review/resolved-pri.html#pri28 michael@0: - bidi tests: change exemplar character for ES depending on Unicode version michael@0: - change hardcoded expected property values where they change michael@0: michael@0: *** other code michael@0: michael@0: * name matching michael@0: - read UCD.html michael@0: michael@0: * scripts michael@0: - use new Hrkt=Katakana_Or_Hiragana michael@0: michael@0: * ZWJ & ZWNJ michael@0: - are now part of combining character sequences michael@0: - break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ