michael@0: .\" Hey, Emacs! This is -*-nroff-*- you know...
michael@0: .\"
michael@0: .\" uconv.1: manual page for the uconv utility.
michael@0: .\"
michael@0: .\" Copyright (C) 2000-2013 IBM, Inc. and others.
michael@0: .\"
michael@0: .\" Manual page by Yves Arrouye <yves@realnames.com>.
michael@0: .\"
michael@0: .TH UCONV 1 "2005-jul-1" "ICU MANPAGE" "ICU @VERSION@ Manual"
michael@0: .SH NAME
michael@0: .B uconv
michael@0: \- convert data from one encoding to another
michael@0: .SH SYNOPSIS
michael@0: .B uconv
michael@0: [
michael@0: .BR "\-h\fP, \fB\-?\fP, \fB\-\-help"
michael@0: ]
michael@0: [
michael@0: .BI "\-V\fP, \fB\-\-version"
michael@0: ]
michael@0: [
michael@0: .BI "\-s\fP, \fB\-\-silent"
michael@0: ]
michael@0: [
michael@0: .BI "\-v\fP, \fB\-\-verbose"
michael@0: ]
michael@0: [
michael@0: .BI "\-l\fP, \fB\-\-list"
michael@0: |
michael@0: .BI "\-l\fP, \fB\-\-list\-code" " code"
michael@0: |
michael@0: .BI "\-\-default-code"
michael@0: |
michael@0: .BI "\-L\fP, \fB\-\-list\-transliterators"
michael@0: ]
michael@0: [
michael@0: .BI "\-\-canon"
michael@0: ]
michael@0: [
michael@0: .BI "\-x" " transliteration
michael@0: ]
michael@0: [
michael@0: .BI "\-\-to\-callback" " callback"
michael@0: |
michael@0: .B "\-c"
michael@0: ]
michael@0: [
michael@0: .BI "\-\-from\-callback" " callback"
michael@0: |
michael@0: .B "\-i"
michael@0: ]
michael@0: [
michael@0: .BI "\-\-callback" " callback"
michael@0: ]
michael@0: [
michael@0: .BI "\-\-fallback"
michael@0: |
michael@0: .BI "\-\-no\-fallback"
michael@0: ]
michael@0: [
michael@0: .BI "\-b\fP, \fB\-\-block\-size" " size"
michael@0: ]
michael@0: [
michael@0: .BI "\-f\fP, \fB\-\-from\-code" " encoding"
michael@0: ]
michael@0: [
michael@0: .BI "\-t\fP, \fB\-\-to\-code" " encoding"
michael@0: ]
michael@0: [
michael@0: .BI "\-\-add\-signature"
michael@0: ]
michael@0: [
michael@0: .BI "\-\-remove\-signature"
michael@0: ]
michael@0: [
michael@0: .BI "\-o\fP, \fB\-\-output" " file"
michael@0: ]
michael@0: [
michael@0: .IR file .\|.\|.
michael@0: ]
michael@0: .SH DESCRIPTION
michael@0: .B uconv
michael@0: converts, or transcodes, each given
michael@0: .I file
michael@0: (or its standard input if no
michael@0: .I file
michael@0: is specified) from one
michael@0: .I encoding
michael@0: to another. 
michael@0: The transcoding is done using Unicode as a pivot encoding
michael@0: (i.e. the data are first transcoded from their original encoding to
michael@0: Unicode, and then from Unicode to the destination encoding).
michael@0: .PP
michael@0: If an
michael@0: .I encoding
michael@0: is not specified or is
michael@0: .BR - ,
michael@0: the default encoding is used. Thus, calling
michael@0: .B uconv
michael@0: with no
michael@0: .I encoding
michael@0: provides an easy way to validate and sanitize data files for
michael@0: further consumption by tools requiring data in the default encoding.
michael@0: .PP
michael@0: When calling
michael@0: .BR uconv ,
michael@0: it is possible to specify callbacks that are used to handle invalid
michael@0: characters in the input, or characters that cannot be transcoded to
michael@0: the destination encoding. Some encodings, for example, offer a default
michael@0: substitution character that can be used to represent the occurence of
michael@0: such characters in the input. Other callbacks offer a useful visual
michael@0: representation of the invalid data.
michael@0: .PP
michael@0: .B uconv
michael@0: can also run the specified
michael@0: .IR transliteration
michael@0: on the transcoded data,
michael@0: in which case transliteration will happen as an intermediate step,
michael@0: after the data have been transcoded to Unicode.
michael@0: The
michael@0: .I transliteration
michael@0: can be either a list of semicolon-separated transliterator names,
michael@0: or an arbitrarily complex set of rules in the ICU transliteration
michael@0: rules format.
michael@0: .PP
michael@0: For transcoding purposes,
michael@0: .B uconv
michael@0: options are compatible with those of
michael@0: .BR iconv (1),
michael@0: making it easy to replace it in scripts. It is not necessarily the case,
michael@0: however, that the encoding names used by
michael@0: .B uconv
michael@0: and ICU are the same as the ones used by
michael@0: .BR iconv (1).
michael@0: Also, options that provide informational data, such as the
michael@0: .B \-l\fP, \fB\-\-list
michael@0: one offered by some 
michael@0: .BR iconv (1)
michael@0: variants such as GNU's, produce data in a slightly different and
michael@0: easier to parse format.
michael@0: .SH OPTIONS
michael@0: .TP
michael@0: .BR "\-h\fP, \fB\-?\fP, \fB\-\-help"
michael@0: Print help about usage and exit.
michael@0: .TP
michael@0: .BR "\-V\fP, \fB\-\-version"
michael@0: Print the version of
michael@0: .B uconv
michael@0: and exit.
michael@0: .TP
michael@0: .BI "\-s\fP, \fB\-\-silent"
michael@0: Suppress messages during execution.
michael@0: .TP
michael@0: .BI "\-v\fP, \fB\-\-verbose"
michael@0: Display extra informative messages during execution.
michael@0: .TP
michael@0: .BI "\-l\fP, \fB\-\-list"
michael@0: List all the available encodings and exit.
michael@0: .TP
michael@0: .BI "\-l\fP, \fB\-\-list\-code" " code"
michael@0: List only the
michael@0: .I code
michael@0: encoding and exit. If
michael@0: .I code
michael@0: is not a proper encoding, exit with an error.
michael@0: .TP
michael@0: .BI "\-\-default-code"
michael@0: List only the name of the default encoding and exit.
michael@0: .TP
michael@0: .BI "\-L\fP, \fB\-\-list\-transliterators"
michael@0: List all the available transliterators and exit.
michael@0: .TP
michael@0: .BI "\--canon"
michael@0: If used with
michael@0: .BI "\-l\fP, \fB\-\-list"
michael@0: or
michael@0: .BR "\-\-default-code" ,
michael@0: the list of encodings is produced in a format compatible with
michael@0: .BR convrtrs.txt (5).
michael@0: If used with
michael@0: .BR "\-L\fP, \fB\-\-list\-transliterators" ,
michael@0: print only one transliterator name per line.
michael@0: .TP
michael@0: .BI "\-x" " transliteration"
michael@0: Run the given
michael@0: .IR transliteration
michael@0: on the transcoded Unicode data,
michael@0: and use the transliterated data as input for the transcoding to
michael@0: the the destination encoding.
michael@0: .TP
michael@0: .BI "\-\-to\-callback" " callback"
michael@0: Use
michael@0: .I callback
michael@0: to handle characters that cannot be transcoded to the destination
michael@0: encoding. See section
michael@0: .B CALLBACKS
michael@0: for details on valid callbacks.
michael@0: .TP
michael@0: .B "\-c"
michael@0: Omit invalid characters from the output.
michael@0: Same as
michael@0: .BR "\-\-to\-callback skip" .
michael@0: .TP
michael@0: .BI "\-\-from\-callback" " callback"
michael@0: Use
michael@0: .I callback
michael@0: to handle characters that cannot be transcoded from the original
michael@0: encoding. See section
michael@0: .B CALLBACKS
michael@0: for details on valid callbacks.
michael@0: .TP
michael@0: .B "\-i"
michael@0: Ignore invalid sequences in the input.
michael@0: Same as
michael@0: .BR "\-\-from\-callback skip" .
michael@0: .TP
michael@0: .BI "\-\-callback" " callback"
michael@0: Use
michael@0: .I callback
michael@0: to handle both characters that cannot be transcoded from the original
michael@0: encoding and characters that cannot be transcoded to the destination
michael@0: encoding. See section
michael@0: .B CALLBACKS
michael@0: for details on valid callbacks.
michael@0: .TP
michael@0: .BI "\-\-fallback"
michael@0: Use the fallback mapping when transcoding from
michael@0: Unicode to the destination encoding.
michael@0: .TP
michael@0: .BI "\-\-no\-fallback"
michael@0: Do not use the fallback mapping when transcoding from Unicode to the
michael@0: destination encoding.
michael@0: This is the default.
michael@0: .TP
michael@0: .BI "\-b\fP, \fB\-\-block\-size" " size"
michael@0: Read input in blocks of
michael@0: .I size
michael@0: bytes at a time. The default block size is
michael@0: 4096.
michael@0: .TP
michael@0: .BI "\-f\fP, \fB\-\-from\-code" " encoding"
michael@0: Set the original encoding of the data to 
michael@0: .IR encoding .
michael@0: .TP
michael@0: .BI "\-t\fP, \fB\-\-to\-code" " encoding"
michael@0: Transcode the data to
michael@0: .IR encoding .
michael@0: .TP
michael@0: .BI "\-\-add\-signature"
michael@0: Add a U+FEFF Unicode signature character (BOM) if the output charset
michael@0: supports it and does not add one anyway.
michael@0: .TP
michael@0: .BI "\-\-remove\-signature"
michael@0: Remove a U+FEFF Unicode signature character (BOM).
michael@0: .TP
michael@0: .BI "\-o\fP, \fB\-\-output" " file"
michael@0: Write the transcoded data to
michael@0: .IR file .
michael@0: .SH CALLBACKS
michael@0: .B uconv
michael@0: supports specifying callbacks to handle invalid data. Callbacks can be
michael@0: set for both directions of transcoding: from the original encoding to
michael@0: Unicode, with the
michael@0: .BR "\-\-from\-callback"
michael@0: option, and from Unicode to the destination encoding, with the
michael@0: .BR "\-\-to\-callback"
michael@0: option.
michael@0: .PP
michael@0: The following is a list of valid
michael@0: .I callback
michael@0: names, along with a description of their behavior. The list of
michael@0: callbacks actually supported by
michael@0: .B uconv
michael@0: is displayed when it is called with
michael@0: .BR "\-h\fP, \fB\-\-help" .
michael@0: .PP
michael@0: .TP \w'\fBescape-unicode'u+3n
michael@0: .B substitute
michael@0: Write the the encoding's substitute sequence, or the Unicode
michael@0: replacement character
michael@0: .B U+FFFD
michael@0: when transcoding to Unicode.
michael@0: .TP
michael@0: .B skip
michael@0: Ignore the invalid data.
michael@0: .TP
michael@0: .B stop
michael@0: Stop with an error when encountering invalid data.
michael@0: This is the default callback.
michael@0: .TP
michael@0: .B escape
michael@0: Same as
michael@0: .BR escape-icu .
michael@0: .TP
michael@0: .B escape-icu
michael@0: Replace the missing characters with a string of the format
michael@0: .BR %U\fIhhhh\fP
michael@0: for plane 0 characters, and
michael@0: .BR %U\fIhhhh\fP%U\fIhhhh\fP
michael@0: for planes 1 and above characters,
michael@0: where
michael@0: .I hhhh
michael@0: is the hexadecimal value of one of the UTF-16 code units representing the
michael@0: character. Characters from planes 1 and above are written as a pair of
michael@0: UTF-16 surrogate code units.
michael@0: .TP
michael@0: .B escape-java
michael@0: Replace the missing characters with a string of the format
michael@0: .BR \eu\fIhhhh\fP
michael@0: for plane 0 characters, and
michael@0: .BR \eu\fIhhhh\fP\eu\fIhhhh\fP
michael@0: for planes 1 and above characters,
michael@0: where
michael@0: .I hhhh
michael@0: is the hexadecimal value of one of the UTF-16 code units representing the
michael@0: character. Characters from planes 1 and above are written as a pair of
michael@0: UTF-16 surrogate code units.
michael@0: .TP
michael@0: .B escape-c
michael@0: Replace the missing characters with a string of the format
michael@0: .BR \eu\fIhhhh\fP
michael@0: for plane 0 characters, and
michael@0: .BR \eU\fIhhhhhhhh\fP
michael@0: for planes 1 and above characters,
michael@0: where
michael@0: .I hhhh
michael@0: and
michael@0: .I hhhhhhhh
michael@0: are the hexadecimal values of the Unicode codepoint.
michael@0: .TP
michael@0: .B escape-xml
michael@0: Same as
michael@0: .BR escape-xml-hex .
michael@0: .TP
michael@0: .B escape-xml-hex
michael@0: Replace the missing characters with a string of the format
michael@0: .BR &#x\fIhhhh\fP; ,
michael@0: where
michael@0: .I hhhh
michael@0: is the hexadecimal value of the Unicode codepoint.
michael@0: .TP
michael@0: .B escape-xml-dec
michael@0: Replace the missing characters with a string of the format
michael@0: .BR &#\fInnnn\fP; ,
michael@0: where
michael@0: .I nnnn
michael@0: is the decimal value of the Unicode codepoint.
michael@0: .TP
michael@0: .B escape-unicode
michael@0: Replace the missing characters with a string of the format
michael@0: .BR {U+\fIhhhh\fP} ,
michael@0: where
michael@0: .I hhhh
michael@0: is the hexadecimal value of the Unicode codepoint.
michael@0: That hexadecimal string is of variable length and can use from 4 to
michael@0: 6 digits.
michael@0: This is the format universally used to denote a Unicode codepoint in
michael@0: the litterature, delimited by curly braces for easy recognition of those
michael@0: substitutions in the output.
michael@0: .SH EXAMPLES
michael@0: Convert data from a given
michael@0: .I encoding
michael@0: to the platform encoding:
michael@0: 
michael@0: .RS 4
michael@0: .B \fR$ \fPuconv \-f \fIencoding\fP
michael@0: .RE
michael@0: .PP
michael@0: Check if a
michael@0: .I file
michael@0: contains valid data for a given
michael@0: .IR encoding :
michael@0: 
michael@0: .RS 4
michael@0: .B \fR$ \fPuconv \-f \fIencoding\fP \-c \fIfile\fP >/dev/null
michael@0: .RE
michael@0: .PP
michael@0: Convert a UTF-8
michael@0: .I file
michael@0: to a given
michael@0: .I encoding
michael@0: and ensure that the resulting text is good for any version of HTML:
michael@0: 
michael@0: .RS 4
michael@0: .B \fR$ \fPuconv \-f utf-8 \-t \fIencoding\fP \e
michael@0: .br
michael@0: .B "    \-\-callback escape-xml-dec \fIfile\fP"
michael@0: .RE
michael@0: .PP
michael@0: Display the names of the Unicode code points in a UTF-file:
michael@0: 
michael@0: .RS 4
michael@0: .B \fR$ \fPuconv \-f utf-8 \-x any-name \fIfile\fP
michael@0: .RE
michael@0: .PP
michael@0: Print the name of a Unicode code point whose value is known (\fBU+30AB\fP
michael@0: in this example):
michael@0: 
michael@0: .RS 4
michael@0: .B \fR$ \fPecho '\eu30ab' | uconv \-x 'hex-any; any-name'; echo
michael@0: .br
michael@0: {KATAKANA LETTER KA}{LINE FEED}
michael@0: .br
michael@0: $ 
michael@0: .RE
michael@0: 
michael@0: (The names are delimited by curly braces.
michael@0: Also, the name of the line terminator is also displayed.)
michael@0: .PP
michael@0: Normalize UTF-8 data using Unicode NFKC, remove all control characters,
michael@0: and map Katakana to Hiragana:
michael@0: 
michael@0: .RS 4
michael@0: .B \fR$ \fPuconv \-f utf-8 \-t utf-8 \e
michael@0: .br
michael@0: .B "      \-x '::nfkc; [:Cc:] >; ::katakana-hiragana;'"
michael@0: .SH CAVEATS AND BUGS
michael@0: .B uconv
michael@0: does report errors as occuring at the first invalid byte
michael@0: encountered. This may be confusing to users of GNU
michael@0: .BR iconv (1),
michael@0: which reports errors as occuring at the first byte of an invalid
michael@0: sequence. For multi-byte character sets or encodings, this means that
michael@0: .BR uconv
michael@0: error positions may be at a later offset in the input stream than
michael@0: would be the case with GNU
michael@0: .BR iconv (1).
michael@0: .PP
michael@0: The reporting of error positions when a transliterator is used may be
michael@0: inaccurate or unavailable, in which case
michael@0: .BR uconv
michael@0: will report the offset in the output stream at which the error
michael@0: occured.
michael@0: .SH AUTHORS
michael@0: Jonas Utterstroem
michael@0: .br
michael@0: Yves Arrouye
michael@0: .SH VERSION
michael@0: @VERSION@
michael@0: .SH COPYRIGHT
michael@0: Copyright (C) 2000-2005 IBM, Inc. and others.
michael@0: .SH SEE ALSO
michael@0: .BR iconv (1)