intl/icu/source/extra/uconv/uconv.1.in

Wed, 31 Dec 2014 06:09:35 +0100

author
Michael Schloh von Bennewitz <michael@schloh.com>
date
Wed, 31 Dec 2014 06:09:35 +0100
changeset 0
6474c204b198
permissions
-rw-r--r--

Cloned upstream origin tor-browser at tor-browser-31.3.0esr-4.5-1-build1
revision ID fc1c9ff7c1b2defdbc039f12214767608f46423f for hacking purpose.

     1 .\" Hey, Emacs! This is -*-nroff-*- you know...
     2 .\"
     3 .\" uconv.1: manual page for the uconv utility.
     4 .\"
     5 .\" Copyright (C) 2000-2013 IBM, Inc. and others.
     6 .\"
     7 .\" Manual page by Yves Arrouye <yves@realnames.com>.
     8 .\"
     9 .TH UCONV 1 "2005-jul-1" "ICU MANPAGE" "ICU @VERSION@ Manual"
    10 .SH NAME
    11 .B uconv
    12 \- convert data from one encoding to another
    13 .SH SYNOPSIS
    14 .B uconv
    15 [
    16 .BR "\-h\fP, \fB\-?\fP, \fB\-\-help"
    17 ]
    18 [
    19 .BI "\-V\fP, \fB\-\-version"
    20 ]
    21 [
    22 .BI "\-s\fP, \fB\-\-silent"
    23 ]
    24 [
    25 .BI "\-v\fP, \fB\-\-verbose"
    26 ]
    27 [
    28 .BI "\-l\fP, \fB\-\-list"
    29 |
    30 .BI "\-l\fP, \fB\-\-list\-code" " code"
    31 |
    32 .BI "\-\-default-code"
    33 |
    34 .BI "\-L\fP, \fB\-\-list\-transliterators"
    35 ]
    36 [
    37 .BI "\-\-canon"
    38 ]
    39 [
    40 .BI "\-x" " transliteration
    41 ]
    42 [
    43 .BI "\-\-to\-callback" " callback"
    44 |
    45 .B "\-c"
    46 ]
    47 [
    48 .BI "\-\-from\-callback" " callback"
    49 |
    50 .B "\-i"
    51 ]
    52 [
    53 .BI "\-\-callback" " callback"
    54 ]
    55 [
    56 .BI "\-\-fallback"
    57 |
    58 .BI "\-\-no\-fallback"
    59 ]
    60 [
    61 .BI "\-b\fP, \fB\-\-block\-size" " size"
    62 ]
    63 [
    64 .BI "\-f\fP, \fB\-\-from\-code" " encoding"
    65 ]
    66 [
    67 .BI "\-t\fP, \fB\-\-to\-code" " encoding"
    68 ]
    69 [
    70 .BI "\-\-add\-signature"
    71 ]
    72 [
    73 .BI "\-\-remove\-signature"
    74 ]
    75 [
    76 .BI "\-o\fP, \fB\-\-output" " file"
    77 ]
    78 [
    79 .IR file .\|.\|.
    80 ]
    81 .SH DESCRIPTION
    82 .B uconv
    83 converts, or transcodes, each given
    84 .I file
    85 (or its standard input if no
    86 .I file
    87 is specified) from one
    88 .I encoding
    89 to another. 
    90 The transcoding is done using Unicode as a pivot encoding
    91 (i.e. the data are first transcoded from their original encoding to
    92 Unicode, and then from Unicode to the destination encoding).
    93 .PP
    94 If an
    95 .I encoding
    96 is not specified or is
    97 .BR - ,
    98 the default encoding is used. Thus, calling
    99 .B uconv
   100 with no
   101 .I encoding
   102 provides an easy way to validate and sanitize data files for
   103 further consumption by tools requiring data in the default encoding.
   104 .PP
   105 When calling
   106 .BR uconv ,
   107 it is possible to specify callbacks that are used to handle invalid
   108 characters in the input, or characters that cannot be transcoded to
   109 the destination encoding. Some encodings, for example, offer a default
   110 substitution character that can be used to represent the occurence of
   111 such characters in the input. Other callbacks offer a useful visual
   112 representation of the invalid data.
   113 .PP
   114 .B uconv
   115 can also run the specified
   116 .IR transliteration
   117 on the transcoded data,
   118 in which case transliteration will happen as an intermediate step,
   119 after the data have been transcoded to Unicode.
   120 The
   121 .I transliteration
   122 can be either a list of semicolon-separated transliterator names,
   123 or an arbitrarily complex set of rules in the ICU transliteration
   124 rules format.
   125 .PP
   126 For transcoding purposes,
   127 .B uconv
   128 options are compatible with those of
   129 .BR iconv (1),
   130 making it easy to replace it in scripts. It is not necessarily the case,
   131 however, that the encoding names used by
   132 .B uconv
   133 and ICU are the same as the ones used by
   134 .BR iconv (1).
   135 Also, options that provide informational data, such as the
   136 .B \-l\fP, \fB\-\-list
   137 one offered by some 
   138 .BR iconv (1)
   139 variants such as GNU's, produce data in a slightly different and
   140 easier to parse format.
   141 .SH OPTIONS
   142 .TP
   143 .BR "\-h\fP, \fB\-?\fP, \fB\-\-help"
   144 Print help about usage and exit.
   145 .TP
   146 .BR "\-V\fP, \fB\-\-version"
   147 Print the version of
   148 .B uconv
   149 and exit.
   150 .TP
   151 .BI "\-s\fP, \fB\-\-silent"
   152 Suppress messages during execution.
   153 .TP
   154 .BI "\-v\fP, \fB\-\-verbose"
   155 Display extra informative messages during execution.
   156 .TP
   157 .BI "\-l\fP, \fB\-\-list"
   158 List all the available encodings and exit.
   159 .TP
   160 .BI "\-l\fP, \fB\-\-list\-code" " code"
   161 List only the
   162 .I code
   163 encoding and exit. If
   164 .I code
   165 is not a proper encoding, exit with an error.
   166 .TP
   167 .BI "\-\-default-code"
   168 List only the name of the default encoding and exit.
   169 .TP
   170 .BI "\-L\fP, \fB\-\-list\-transliterators"
   171 List all the available transliterators and exit.
   172 .TP
   173 .BI "\--canon"
   174 If used with
   175 .BI "\-l\fP, \fB\-\-list"
   176 or
   177 .BR "\-\-default-code" ,
   178 the list of encodings is produced in a format compatible with
   179 .BR convrtrs.txt (5).
   180 If used with
   181 .BR "\-L\fP, \fB\-\-list\-transliterators" ,
   182 print only one transliterator name per line.
   183 .TP
   184 .BI "\-x" " transliteration"
   185 Run the given
   186 .IR transliteration
   187 on the transcoded Unicode data,
   188 and use the transliterated data as input for the transcoding to
   189 the the destination encoding.
   190 .TP
   191 .BI "\-\-to\-callback" " callback"
   192 Use
   193 .I callback
   194 to handle characters that cannot be transcoded to the destination
   195 encoding. See section
   196 .B CALLBACKS
   197 for details on valid callbacks.
   198 .TP
   199 .B "\-c"
   200 Omit invalid characters from the output.
   201 Same as
   202 .BR "\-\-to\-callback skip" .
   203 .TP
   204 .BI "\-\-from\-callback" " callback"
   205 Use
   206 .I callback
   207 to handle characters that cannot be transcoded from the original
   208 encoding. See section
   209 .B CALLBACKS
   210 for details on valid callbacks.
   211 .TP
   212 .B "\-i"
   213 Ignore invalid sequences in the input.
   214 Same as
   215 .BR "\-\-from\-callback skip" .
   216 .TP
   217 .BI "\-\-callback" " callback"
   218 Use
   219 .I callback
   220 to handle both characters that cannot be transcoded from the original
   221 encoding and characters that cannot be transcoded to the destination
   222 encoding. See section
   223 .B CALLBACKS
   224 for details on valid callbacks.
   225 .TP
   226 .BI "\-\-fallback"
   227 Use the fallback mapping when transcoding from
   228 Unicode to the destination encoding.
   229 .TP
   230 .BI "\-\-no\-fallback"
   231 Do not use the fallback mapping when transcoding from Unicode to the
   232 destination encoding.
   233 This is the default.
   234 .TP
   235 .BI "\-b\fP, \fB\-\-block\-size" " size"
   236 Read input in blocks of
   237 .I size
   238 bytes at a time. The default block size is
   239 4096.
   240 .TP
   241 .BI "\-f\fP, \fB\-\-from\-code" " encoding"
   242 Set the original encoding of the data to 
   243 .IR encoding .
   244 .TP
   245 .BI "\-t\fP, \fB\-\-to\-code" " encoding"
   246 Transcode the data to
   247 .IR encoding .
   248 .TP
   249 .BI "\-\-add\-signature"
   250 Add a U+FEFF Unicode signature character (BOM) if the output charset
   251 supports it and does not add one anyway.
   252 .TP
   253 .BI "\-\-remove\-signature"
   254 Remove a U+FEFF Unicode signature character (BOM).
   255 .TP
   256 .BI "\-o\fP, \fB\-\-output" " file"
   257 Write the transcoded data to
   258 .IR file .
   259 .SH CALLBACKS
   260 .B uconv
   261 supports specifying callbacks to handle invalid data. Callbacks can be
   262 set for both directions of transcoding: from the original encoding to
   263 Unicode, with the
   264 .BR "\-\-from\-callback"
   265 option, and from Unicode to the destination encoding, with the
   266 .BR "\-\-to\-callback"
   267 option.
   268 .PP
   269 The following is a list of valid
   270 .I callback
   271 names, along with a description of their behavior. The list of
   272 callbacks actually supported by
   273 .B uconv
   274 is displayed when it is called with
   275 .BR "\-h\fP, \fB\-\-help" .
   276 .PP
   277 .TP \w'\fBescape-unicode'u+3n
   278 .B substitute
   279 Write the the encoding's substitute sequence, or the Unicode
   280 replacement character
   281 .B U+FFFD
   282 when transcoding to Unicode.
   283 .TP
   284 .B skip
   285 Ignore the invalid data.
   286 .TP
   287 .B stop
   288 Stop with an error when encountering invalid data.
   289 This is the default callback.
   290 .TP
   291 .B escape
   292 Same as
   293 .BR escape-icu .
   294 .TP
   295 .B escape-icu
   296 Replace the missing characters with a string of the format
   297 .BR %U\fIhhhh\fP
   298 for plane 0 characters, and
   299 .BR %U\fIhhhh\fP%U\fIhhhh\fP
   300 for planes 1 and above characters,
   301 where
   302 .I hhhh
   303 is the hexadecimal value of one of the UTF-16 code units representing the
   304 character. Characters from planes 1 and above are written as a pair of
   305 UTF-16 surrogate code units.
   306 .TP
   307 .B escape-java
   308 Replace the missing characters with a string of the format
   309 .BR \eu\fIhhhh\fP
   310 for plane 0 characters, and
   311 .BR \eu\fIhhhh\fP\eu\fIhhhh\fP
   312 for planes 1 and above characters,
   313 where
   314 .I hhhh
   315 is the hexadecimal value of one of the UTF-16 code units representing the
   316 character. Characters from planes 1 and above are written as a pair of
   317 UTF-16 surrogate code units.
   318 .TP
   319 .B escape-c
   320 Replace the missing characters with a string of the format
   321 .BR \eu\fIhhhh\fP
   322 for plane 0 characters, and
   323 .BR \eU\fIhhhhhhhh\fP
   324 for planes 1 and above characters,
   325 where
   326 .I hhhh
   327 and
   328 .I hhhhhhhh
   329 are the hexadecimal values of the Unicode codepoint.
   330 .TP
   331 .B escape-xml
   332 Same as
   333 .BR escape-xml-hex .
   334 .TP
   335 .B escape-xml-hex
   336 Replace the missing characters with a string of the format
   337 .BR &#x\fIhhhh\fP; ,
   338 where
   339 .I hhhh
   340 is the hexadecimal value of the Unicode codepoint.
   341 .TP
   342 .B escape-xml-dec
   343 Replace the missing characters with a string of the format
   344 .BR &#\fInnnn\fP; ,
   345 where
   346 .I nnnn
   347 is the decimal value of the Unicode codepoint.
   348 .TP
   349 .B escape-unicode
   350 Replace the missing characters with a string of the format
   351 .BR {U+\fIhhhh\fP} ,
   352 where
   353 .I hhhh
   354 is the hexadecimal value of the Unicode codepoint.
   355 That hexadecimal string is of variable length and can use from 4 to
   356 6 digits.
   357 This is the format universally used to denote a Unicode codepoint in
   358 the litterature, delimited by curly braces for easy recognition of those
   359 substitutions in the output.
   360 .SH EXAMPLES
   361 Convert data from a given
   362 .I encoding
   363 to the platform encoding:
   365 .RS 4
   366 .B \fR$ \fPuconv \-f \fIencoding\fP
   367 .RE
   368 .PP
   369 Check if a
   370 .I file
   371 contains valid data for a given
   372 .IR encoding :
   374 .RS 4
   375 .B \fR$ \fPuconv \-f \fIencoding\fP \-c \fIfile\fP >/dev/null
   376 .RE
   377 .PP
   378 Convert a UTF-8
   379 .I file
   380 to a given
   381 .I encoding
   382 and ensure that the resulting text is good for any version of HTML:
   384 .RS 4
   385 .B \fR$ \fPuconv \-f utf-8 \-t \fIencoding\fP \e
   386 .br
   387 .B "    \-\-callback escape-xml-dec \fIfile\fP"
   388 .RE
   389 .PP
   390 Display the names of the Unicode code points in a UTF-file:
   392 .RS 4
   393 .B \fR$ \fPuconv \-f utf-8 \-x any-name \fIfile\fP
   394 .RE
   395 .PP
   396 Print the name of a Unicode code point whose value is known (\fBU+30AB\fP
   397 in this example):
   399 .RS 4
   400 .B \fR$ \fPecho '\eu30ab' | uconv \-x 'hex-any; any-name'; echo
   401 .br
   402 {KATAKANA LETTER KA}{LINE FEED}
   403 .br
   404 $ 
   405 .RE
   407 (The names are delimited by curly braces.
   408 Also, the name of the line terminator is also displayed.)
   409 .PP
   410 Normalize UTF-8 data using Unicode NFKC, remove all control characters,
   411 and map Katakana to Hiragana:
   413 .RS 4
   414 .B \fR$ \fPuconv \-f utf-8 \-t utf-8 \e
   415 .br
   416 .B "      \-x '::nfkc; [:Cc:] >; ::katakana-hiragana;'"
   417 .SH CAVEATS AND BUGS
   418 .B uconv
   419 does report errors as occuring at the first invalid byte
   420 encountered. This may be confusing to users of GNU
   421 .BR iconv (1),
   422 which reports errors as occuring at the first byte of an invalid
   423 sequence. For multi-byte character sets or encodings, this means that
   424 .BR uconv
   425 error positions may be at a later offset in the input stream than
   426 would be the case with GNU
   427 .BR iconv (1).
   428 .PP
   429 The reporting of error positions when a transliterator is used may be
   430 inaccurate or unavailable, in which case
   431 .BR uconv
   432 will report the offset in the output stream at which the error
   433 occured.
   434 .SH AUTHORS
   435 Jonas Utterstroem
   436 .br
   437 Yves Arrouye
   438 .SH VERSION
   439 @VERSION@
   440 .SH COPYRIGHT
   441 Copyright (C) 2000-2005 IBM, Inc. and others.
   442 .SH SEE ALSO
   443 .BR iconv (1)

mercurial