michael@0: .\" Hey, Emacs! This is -*-nroff-*- you know... michael@0: .\" michael@0: .\" uconv.1: manual page for the uconv utility. michael@0: .\" michael@0: .\" Copyright (C) 2000-2013 IBM, Inc. and others. michael@0: .\" michael@0: .\" Manual page by Yves Arrouye . michael@0: .\" michael@0: .TH UCONV 1 "2005-jul-1" "ICU MANPAGE" "ICU @VERSION@ Manual" michael@0: .SH NAME michael@0: .B uconv michael@0: \- convert data from one encoding to another michael@0: .SH SYNOPSIS michael@0: .B uconv michael@0: [ michael@0: .BR "\-h\fP, \fB\-?\fP, \fB\-\-help" michael@0: ] michael@0: [ michael@0: .BI "\-V\fP, \fB\-\-version" michael@0: ] michael@0: [ michael@0: .BI "\-s\fP, \fB\-\-silent" michael@0: ] michael@0: [ michael@0: .BI "\-v\fP, \fB\-\-verbose" michael@0: ] michael@0: [ michael@0: .BI "\-l\fP, \fB\-\-list" michael@0: | michael@0: .BI "\-l\fP, \fB\-\-list\-code" " code" michael@0: | michael@0: .BI "\-\-default-code" michael@0: | michael@0: .BI "\-L\fP, \fB\-\-list\-transliterators" michael@0: ] michael@0: [ michael@0: .BI "\-\-canon" michael@0: ] michael@0: [ michael@0: .BI "\-x" " transliteration michael@0: ] michael@0: [ michael@0: .BI "\-\-to\-callback" " callback" michael@0: | michael@0: .B "\-c" michael@0: ] michael@0: [ michael@0: .BI "\-\-from\-callback" " callback" michael@0: | michael@0: .B "\-i" michael@0: ] michael@0: [ michael@0: .BI "\-\-callback" " callback" michael@0: ] michael@0: [ michael@0: .BI "\-\-fallback" michael@0: | michael@0: .BI "\-\-no\-fallback" michael@0: ] michael@0: [ michael@0: .BI "\-b\fP, \fB\-\-block\-size" " size" michael@0: ] michael@0: [ michael@0: .BI "\-f\fP, \fB\-\-from\-code" " encoding" michael@0: ] michael@0: [ michael@0: .BI "\-t\fP, \fB\-\-to\-code" " encoding" michael@0: ] michael@0: [ michael@0: .BI "\-\-add\-signature" michael@0: ] michael@0: [ michael@0: .BI "\-\-remove\-signature" michael@0: ] michael@0: [ michael@0: .BI "\-o\fP, \fB\-\-output" " file" michael@0: ] michael@0: [ michael@0: .IR file .\|.\|. michael@0: ] michael@0: .SH DESCRIPTION michael@0: .B uconv michael@0: converts, or transcodes, each given michael@0: .I file michael@0: (or its standard input if no michael@0: .I file michael@0: is specified) from one michael@0: .I encoding michael@0: to another. michael@0: The transcoding is done using Unicode as a pivot encoding michael@0: (i.e. the data are first transcoded from their original encoding to michael@0: Unicode, and then from Unicode to the destination encoding). michael@0: .PP michael@0: If an michael@0: .I encoding michael@0: is not specified or is michael@0: .BR - , michael@0: the default encoding is used. Thus, calling michael@0: .B uconv michael@0: with no michael@0: .I encoding michael@0: provides an easy way to validate and sanitize data files for michael@0: further consumption by tools requiring data in the default encoding. michael@0: .PP michael@0: When calling michael@0: .BR uconv , michael@0: it is possible to specify callbacks that are used to handle invalid michael@0: characters in the input, or characters that cannot be transcoded to michael@0: the destination encoding. Some encodings, for example, offer a default michael@0: substitution character that can be used to represent the occurence of michael@0: such characters in the input. Other callbacks offer a useful visual michael@0: representation of the invalid data. michael@0: .PP michael@0: .B uconv michael@0: can also run the specified michael@0: .IR transliteration michael@0: on the transcoded data, michael@0: in which case transliteration will happen as an intermediate step, michael@0: after the data have been transcoded to Unicode. michael@0: The michael@0: .I transliteration michael@0: can be either a list of semicolon-separated transliterator names, michael@0: or an arbitrarily complex set of rules in the ICU transliteration michael@0: rules format. michael@0: .PP michael@0: For transcoding purposes, michael@0: .B uconv michael@0: options are compatible with those of michael@0: .BR iconv (1), michael@0: making it easy to replace it in scripts. It is not necessarily the case, michael@0: however, that the encoding names used by michael@0: .B uconv michael@0: and ICU are the same as the ones used by michael@0: .BR iconv (1). michael@0: Also, options that provide informational data, such as the michael@0: .B \-l\fP, \fB\-\-list michael@0: one offered by some michael@0: .BR iconv (1) michael@0: variants such as GNU's, produce data in a slightly different and michael@0: easier to parse format. michael@0: .SH OPTIONS michael@0: .TP michael@0: .BR "\-h\fP, \fB\-?\fP, \fB\-\-help" michael@0: Print help about usage and exit. michael@0: .TP michael@0: .BR "\-V\fP, \fB\-\-version" michael@0: Print the version of michael@0: .B uconv michael@0: and exit. michael@0: .TP michael@0: .BI "\-s\fP, \fB\-\-silent" michael@0: Suppress messages during execution. michael@0: .TP michael@0: .BI "\-v\fP, \fB\-\-verbose" michael@0: Display extra informative messages during execution. michael@0: .TP michael@0: .BI "\-l\fP, \fB\-\-list" michael@0: List all the available encodings and exit. michael@0: .TP michael@0: .BI "\-l\fP, \fB\-\-list\-code" " code" michael@0: List only the michael@0: .I code michael@0: encoding and exit. If michael@0: .I code michael@0: is not a proper encoding, exit with an error. michael@0: .TP michael@0: .BI "\-\-default-code" michael@0: List only the name of the default encoding and exit. michael@0: .TP michael@0: .BI "\-L\fP, \fB\-\-list\-transliterators" michael@0: List all the available transliterators and exit. michael@0: .TP michael@0: .BI "\--canon" michael@0: If used with michael@0: .BI "\-l\fP, \fB\-\-list" michael@0: or michael@0: .BR "\-\-default-code" , michael@0: the list of encodings is produced in a format compatible with michael@0: .BR convrtrs.txt (5). michael@0: If used with michael@0: .BR "\-L\fP, \fB\-\-list\-transliterators" , michael@0: print only one transliterator name per line. michael@0: .TP michael@0: .BI "\-x" " transliteration" michael@0: Run the given michael@0: .IR transliteration michael@0: on the transcoded Unicode data, michael@0: and use the transliterated data as input for the transcoding to michael@0: the the destination encoding. michael@0: .TP michael@0: .BI "\-\-to\-callback" " callback" michael@0: Use michael@0: .I callback michael@0: to handle characters that cannot be transcoded to the destination michael@0: encoding. See section michael@0: .B CALLBACKS michael@0: for details on valid callbacks. michael@0: .TP michael@0: .B "\-c" michael@0: Omit invalid characters from the output. michael@0: Same as michael@0: .BR "\-\-to\-callback skip" . michael@0: .TP michael@0: .BI "\-\-from\-callback" " callback" michael@0: Use michael@0: .I callback michael@0: to handle characters that cannot be transcoded from the original michael@0: encoding. See section michael@0: .B CALLBACKS michael@0: for details on valid callbacks. michael@0: .TP michael@0: .B "\-i" michael@0: Ignore invalid sequences in the input. michael@0: Same as michael@0: .BR "\-\-from\-callback skip" . michael@0: .TP michael@0: .BI "\-\-callback" " callback" michael@0: Use michael@0: .I callback michael@0: to handle both characters that cannot be transcoded from the original michael@0: encoding and characters that cannot be transcoded to the destination michael@0: encoding. See section michael@0: .B CALLBACKS michael@0: for details on valid callbacks. michael@0: .TP michael@0: .BI "\-\-fallback" michael@0: Use the fallback mapping when transcoding from michael@0: Unicode to the destination encoding. michael@0: .TP michael@0: .BI "\-\-no\-fallback" michael@0: Do not use the fallback mapping when transcoding from Unicode to the michael@0: destination encoding. michael@0: This is the default. michael@0: .TP michael@0: .BI "\-b\fP, \fB\-\-block\-size" " size" michael@0: Read input in blocks of michael@0: .I size michael@0: bytes at a time. The default block size is michael@0: 4096. michael@0: .TP michael@0: .BI "\-f\fP, \fB\-\-from\-code" " encoding" michael@0: Set the original encoding of the data to michael@0: .IR encoding . michael@0: .TP michael@0: .BI "\-t\fP, \fB\-\-to\-code" " encoding" michael@0: Transcode the data to michael@0: .IR encoding . michael@0: .TP michael@0: .BI "\-\-add\-signature" michael@0: Add a U+FEFF Unicode signature character (BOM) if the output charset michael@0: supports it and does not add one anyway. michael@0: .TP michael@0: .BI "\-\-remove\-signature" michael@0: Remove a U+FEFF Unicode signature character (BOM). michael@0: .TP michael@0: .BI "\-o\fP, \fB\-\-output" " file" michael@0: Write the transcoded data to michael@0: .IR file . michael@0: .SH CALLBACKS michael@0: .B uconv michael@0: supports specifying callbacks to handle invalid data. Callbacks can be michael@0: set for both directions of transcoding: from the original encoding to michael@0: Unicode, with the michael@0: .BR "\-\-from\-callback" michael@0: option, and from Unicode to the destination encoding, with the michael@0: .BR "\-\-to\-callback" michael@0: option. michael@0: .PP michael@0: The following is a list of valid michael@0: .I callback michael@0: names, along with a description of their behavior. The list of michael@0: callbacks actually supported by michael@0: .B uconv michael@0: is displayed when it is called with michael@0: .BR "\-h\fP, \fB\-\-help" . michael@0: .PP michael@0: .TP \w'\fBescape-unicode'u+3n michael@0: .B substitute michael@0: Write the the encoding's substitute sequence, or the Unicode michael@0: replacement character michael@0: .B U+FFFD michael@0: when transcoding to Unicode. michael@0: .TP michael@0: .B skip michael@0: Ignore the invalid data. michael@0: .TP michael@0: .B stop michael@0: Stop with an error when encountering invalid data. michael@0: This is the default callback. michael@0: .TP michael@0: .B escape michael@0: Same as michael@0: .BR escape-icu . michael@0: .TP michael@0: .B escape-icu michael@0: Replace the missing characters with a string of the format michael@0: .BR %U\fIhhhh\fP michael@0: for plane 0 characters, and michael@0: .BR %U\fIhhhh\fP%U\fIhhhh\fP michael@0: for planes 1 and above characters, michael@0: where michael@0: .I hhhh michael@0: is the hexadecimal value of one of the UTF-16 code units representing the michael@0: character. Characters from planes 1 and above are written as a pair of michael@0: UTF-16 surrogate code units. michael@0: .TP michael@0: .B escape-java michael@0: Replace the missing characters with a string of the format michael@0: .BR \eu\fIhhhh\fP michael@0: for plane 0 characters, and michael@0: .BR \eu\fIhhhh\fP\eu\fIhhhh\fP michael@0: for planes 1 and above characters, michael@0: where michael@0: .I hhhh michael@0: is the hexadecimal value of one of the UTF-16 code units representing the michael@0: character. Characters from planes 1 and above are written as a pair of michael@0: UTF-16 surrogate code units. michael@0: .TP michael@0: .B escape-c michael@0: Replace the missing characters with a string of the format michael@0: .BR \eu\fIhhhh\fP michael@0: for plane 0 characters, and michael@0: .BR \eU\fIhhhhhhhh\fP michael@0: for planes 1 and above characters, michael@0: where michael@0: .I hhhh michael@0: and michael@0: .I hhhhhhhh michael@0: are the hexadecimal values of the Unicode codepoint. michael@0: .TP michael@0: .B escape-xml michael@0: Same as michael@0: .BR escape-xml-hex . michael@0: .TP michael@0: .B escape-xml-hex michael@0: Replace the missing characters with a string of the format michael@0: .BR &#x\fIhhhh\fP; , michael@0: where michael@0: .I hhhh michael@0: is the hexadecimal value of the Unicode codepoint. michael@0: .TP michael@0: .B escape-xml-dec michael@0: Replace the missing characters with a string of the format michael@0: .BR &#\fInnnn\fP; , michael@0: where michael@0: .I nnnn michael@0: is the decimal value of the Unicode codepoint. michael@0: .TP michael@0: .B escape-unicode michael@0: Replace the missing characters with a string of the format michael@0: .BR {U+\fIhhhh\fP} , michael@0: where michael@0: .I hhhh michael@0: is the hexadecimal value of the Unicode codepoint. michael@0: That hexadecimal string is of variable length and can use from 4 to michael@0: 6 digits. michael@0: This is the format universally used to denote a Unicode codepoint in michael@0: the litterature, delimited by curly braces for easy recognition of those michael@0: substitutions in the output. michael@0: .SH EXAMPLES michael@0: Convert data from a given michael@0: .I encoding michael@0: to the platform encoding: michael@0: michael@0: .RS 4 michael@0: .B \fR$ \fPuconv \-f \fIencoding\fP michael@0: .RE michael@0: .PP michael@0: Check if a michael@0: .I file michael@0: contains valid data for a given michael@0: .IR encoding : michael@0: michael@0: .RS 4 michael@0: .B \fR$ \fPuconv \-f \fIencoding\fP \-c \fIfile\fP >/dev/null michael@0: .RE michael@0: .PP michael@0: Convert a UTF-8 michael@0: .I file michael@0: to a given michael@0: .I encoding michael@0: and ensure that the resulting text is good for any version of HTML: michael@0: michael@0: .RS 4 michael@0: .B \fR$ \fPuconv \-f utf-8 \-t \fIencoding\fP \e michael@0: .br michael@0: .B " \-\-callback escape-xml-dec \fIfile\fP" michael@0: .RE michael@0: .PP michael@0: Display the names of the Unicode code points in a UTF-file: michael@0: michael@0: .RS 4 michael@0: .B \fR$ \fPuconv \-f utf-8 \-x any-name \fIfile\fP michael@0: .RE michael@0: .PP michael@0: Print the name of a Unicode code point whose value is known (\fBU+30AB\fP michael@0: in this example): michael@0: michael@0: .RS 4 michael@0: .B \fR$ \fPecho '\eu30ab' | uconv \-x 'hex-any; any-name'; echo michael@0: .br michael@0: {KATAKANA LETTER KA}{LINE FEED} michael@0: .br michael@0: $ michael@0: .RE michael@0: michael@0: (The names are delimited by curly braces. michael@0: Also, the name of the line terminator is also displayed.) michael@0: .PP michael@0: Normalize UTF-8 data using Unicode NFKC, remove all control characters, michael@0: and map Katakana to Hiragana: michael@0: michael@0: .RS 4 michael@0: .B \fR$ \fPuconv \-f utf-8 \-t utf-8 \e michael@0: .br michael@0: .B " \-x '::nfkc; [:Cc:] >; ::katakana-hiragana;'" michael@0: .SH CAVEATS AND BUGS michael@0: .B uconv michael@0: does report errors as occuring at the first invalid byte michael@0: encountered. This may be confusing to users of GNU michael@0: .BR iconv (1), michael@0: which reports errors as occuring at the first byte of an invalid michael@0: sequence. For multi-byte character sets or encodings, this means that michael@0: .BR uconv michael@0: error positions may be at a later offset in the input stream than michael@0: would be the case with GNU michael@0: .BR iconv (1). michael@0: .PP michael@0: The reporting of error positions when a transliterator is used may be michael@0: inaccurate or unavailable, in which case michael@0: .BR uconv michael@0: will report the offset in the output stream at which the error michael@0: occured. michael@0: .SH AUTHORS michael@0: Jonas Utterstroem michael@0: .br michael@0: Yves Arrouye michael@0: .SH VERSION michael@0: @VERSION@ michael@0: .SH COPYRIGHT michael@0: Copyright (C) 2000-2005 IBM, Inc. and others. michael@0: .SH SEE ALSO michael@0: .BR iconv (1)