|
cvsdist |
e44dea4 |
.\"-- << kanji code converter >> ----
|
|
cvsdist |
e44dea4 |
.\"
|
|
cvsdist |
e44dea4 |
.\" kcc.jman
|
|
cvsdist |
e44dea4 |
.\" Aug 24 1992
|
|
cvsdist |
e44dea4 |
.\" mod: Nov 19 1992
|
|
cvsdist |
e44dea4 |
.\" translated: Oct 15 1999
|
|
cvsdist |
e44dea4 |
.\"------------------------------------------------ tonooka ------------
|
|
cvsdist |
e44dea4 |
.\" @(#)kcc.jman 2.1 (Y.Tonooka) 9/6/93
|
|
cvsdist |
e44dea4 |
.TH KCC L "November 19, 1992" "Y. Tonooka"
|
|
cvsdist |
e44dea4 |
.SH NAME
|
|
cvsdist |
e44dea4 |
kcc \- Kanji code coverter with encoding auto detection
|
|
cvsdist |
e44dea4 |
.SH SYNOPSIS
|
|
cvsdist |
e44dea4 |
.B kcc
|
|
cvsdist |
e44dea4 |
[
|
|
cvsdist |
e44dea4 |
.BI \- IO chnvxz
|
|
cvsdist |
e44dea4 |
] [
|
|
cvsdist |
e44dea4 |
.BI \-b " bufsize"
|
|
cvsdist |
e44dea4 |
]
|
|
cvsdist |
e44dea4 |
.RI [ " file " "] .\|.\|."
|
|
cvsdist |
e44dea4 |
.SH DESCRIPTION
|
|
cvsdist |
e44dea4 |
.B kcc
|
|
cvsdist |
e44dea4 |
is a filter that reads
|
|
cvsdist |
e44dea4 |
.I file
|
|
cvsdist |
e44dea4 |
sequencially, converts kanji encodings and output to stdou.
|
|
cvsdist |
e44dea4 |
If no file is specified, or specified
|
|
cvsdist |
e44dea4 |
.B \-
|
|
cvsdist |
e44dea4 |
as filename, it read from stdin.
|
|
cvsdist |
e44dea4 |
You can specify kanji encodings for input/output. However,
|
|
cvsdist |
e44dea4 |
.B kcc
|
|
cvsdist |
e44dea4 |
detect input encodig automatically, if you don't specify input encoding.
|
|
cvsdist |
e44dea4 |
.LP
|
|
cvsdist |
e44dea4 |
Available kanji encodings are
|
|
cvsdist |
e44dea4 |
.SM JIS (7 bit and/or 8 bit), Shift
|
|
cvsdist |
e44dea4 |
.SM JIS\s0\s-1EUC\s0\s-1DEC\s0.
|
|
cvsdist |
e44dea4 |
For input encoding, you can mix when these are pair of one of \s-1EUC\s0
|
|
cvsdist |
e44dea4 |
\s-1DEC\s0 or Shift
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
and 7 bit
|
|
cvsdist |
e44dea4 |
.SM JIS.
|
|
cvsdist |
e44dea4 |
.BR \s-1SI\s0 / \s-1SO\s0 \s-1ESC\s0(I
|
|
cvsdist |
e44dea4 |
are recognized as halfwidth of \s-1JIS\s0.
|
|
cvsdist |
e44dea4 |
.SH OPTIONS
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.BI \- O
|
|
cvsdist |
e44dea4 |
.PD 0
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.BI \- IO
|
|
cvsdist |
e44dea4 |
.IR I " for input kanji encoding¡¤" O
|
|
cvsdist |
e44dea4 |
for output kanji encoding. When no input encoding specified,
|
|
cvsdist |
e44dea4 |
it will be detected automatically, and if both of input/output
|
|
cvsdist |
e44dea4 |
aren't specified, output encoding is 7 bit
|
|
cvsdist |
e44dea4 |
.SM JIS.
|
|
cvsdist |
e44dea4 |
.PD
|
|
cvsdist |
e44dea4 |
.IP
|
|
cvsdist |
e44dea4 |
You can specify one of the followings for the input encoding option,
|
|
cvsdist |
e44dea4 |
.I I.
|
|
cvsdist |
e44dea4 |
.LP
|
|
cvsdist |
e44dea4 |
.RS 10
|
|
cvsdist |
e44dea4 |
.PD 0
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.B e
|
|
cvsdist |
e44dea4 |
.SM EUC\s0(available with 7 bit
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
)
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.B d
|
|
cvsdist |
e44dea4 |
.SM DEC\s0(available with 7 bit
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
)
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.B s
|
|
cvsdist |
e44dea4 |
Shift
|
|
cvsdist |
e44dea4 |
.SM JIS\s0(available with 7 bit
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
)
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.BR j 7 " or " k
|
|
cvsdist |
e44dea4 |
7 bit
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.B 8
|
|
cvsdist |
e44dea4 |
8 bit
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
.PD
|
|
cvsdist |
e44dea4 |
.RE
|
|
cvsdist |
e44dea4 |
.IP
|
|
cvsdist |
e44dea4 |
You can specify one of the followings for output encoding option,
|
|
cvsdist |
e44dea4 |
.I O.
|
|
cvsdist |
e44dea4 |
.LP
|
|
cvsdist |
e44dea4 |
.RS 10
|
|
cvsdist |
e44dea4 |
.PD 0
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.B e
|
|
cvsdist |
e44dea4 |
.SM EUC
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.B d
|
|
cvsdist |
e44dea4 |
.SM DEC
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.B s
|
|
cvsdist |
e44dea4 |
Shift
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.BR j\fIXY " or " 7\fIXY
|
|
cvsdist |
e44dea4 |
7 bit
|
|
cvsdist |
e44dea4 |
.RB \s-1JIS\s0(using \s-1SI\s0 / \s-1SO\s0
|
|
cvsdist |
e44dea4 |
for
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
kana designation)
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.BI k XY
|
|
cvsdist |
e44dea4 |
7 bit
|
|
cvsdist |
e44dea4 |
.RB \s-1JIS\s0(using \s-1ESC\s0(I
|
|
cvsdist |
e44dea4 |
for
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
kana designation)
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.BI 8 XY
|
|
cvsdist |
e44dea4 |
8 bit
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
.PD
|
|
cvsdist |
e44dea4 |
.RE
|
|
cvsdist |
e44dea4 |
.IP
|
|
cvsdist |
e44dea4 |
By
|
|
cvsdist |
e44dea4 |
.I XY
|
|
cvsdist |
e44dea4 |
in
|
|
cvsdist |
e44dea4 |
.I O
|
|
cvsdist |
e44dea4 |
option,
|
|
cvsdist |
e44dea4 |
You can specify which escape sequence used in \s-1JIS\s0 encoding.
|
|
cvsdist |
e44dea4 |
.B BJ
|
|
cvsdist |
e44dea4 |
is default. Supplimental kanji designation is fixed to
|
|
cvsdist |
e44dea4 |
.B \s-1ESC\s0$(D
|
|
cvsdist |
e44dea4 |
.LP
|
|
cvsdist |
e44dea4 |
.RS 10
|
|
cvsdist |
e44dea4 |
.PD 0
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.I X
|
|
cvsdist |
e44dea4 |
Kanji is designated by:
|
|
cvsdist |
e44dea4 |
.RS 5
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.B B
|
|
cvsdist |
e44dea4 |
.BR \s-1ESC\s0$B "(JIS X0208-1983)
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.B @
|
|
cvsdist |
e44dea4 |
.BR \s-1ESC\s0$@ "(JIS X0208-1978)
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.B +
|
|
cvsdist |
e44dea4 |
.BR \s-1ESC\s0&@\s-1ESC\s0$B "(JIS X0212-1990)
|
|
cvsdist |
e44dea4 |
.RE
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.I Y
|
|
cvsdist |
e44dea4 |
Alpha Numerical is designated by:
|
|
cvsdist |
e44dea4 |
.RS 5
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.B B
|
|
cvsdist |
e44dea4 |
.BR \s-1ESC\s0(B "(ASCII)
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.B J
|
|
cvsdist |
e44dea4 |
.BR \s-1ESC\s0(J "(JIS Roman; JIS X0201)
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.B H
|
|
cvsdist |
e44dea4 |
.BR \s-1ESC\s0(H "(Swedish; strongly deprecated)
|
|
cvsdist |
e44dea4 |
.PD
|
|
cvsdist |
e44dea4 |
.RE
|
|
cvsdist |
e44dea4 |
.RE
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.B \-v
|
|
cvsdist |
e44dea4 |
outputs result of input encoding detection to stderr.
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.B \-x
|
|
cvsdist |
e44dea4 |
Extension mode. By auto detection of input encodings, recognize
|
|
cvsdist |
e44dea4 |
user-defined characters and extended character region (
|
|
cvsdist |
e44dea4 |
out of range of \s-1EUC\s0, undefined halfwidth kana, control character,
|
|
cvsdist |
e44dea4 |
.SM C1
|
|
cvsdist |
e44dea4 |
area and/or extended character region Shift
|
|
cvsdist |
e44dea4 |
.SM C1
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
). Distinguish between \s-1DEC\s0 and
|
|
cvsdist |
e44dea4 |
.SM EUC
|
|
cvsdist |
e44dea4 |
is done in this mode.
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.B \-z
|
|
cvsdist |
e44dea4 |
Shrink mode. Don't recognize halfwidth kana (except 7 bit
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
) with input encoding detection.
|
|
cvsdist |
e44dea4 |
With this option, accuracy of auto detection of input encodings becomes
|
|
cvsdist |
e44dea4 |
much better for file without halfwidth kana.
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.B \-h
|
|
cvsdist |
e44dea4 |
Normally, When converted halfwidth kana to
|
|
cvsdist |
e44dea4 |
.SM DEC
|
|
cvsdist |
e44dea4 |
, it becomes fullwidth Katakana. With this option, it becomes Hiragana.
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.B \-n
|
|
cvsdist |
e44dea4 |
user-defined characters, extended characters and supplimental kanji
|
|
cvsdist |
e44dea4 |
characters areconverted to fullwidth white box, and undefined region of
|
|
cvsdist |
e44dea4 |
halfwidth kana are converted to halfwidth centered dot.
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.BI \-b " bufsize"
|
|
cvsdist |
e44dea4 |
specify buffer size. 8kbytes is default.
|
|
cvsdist |
e44dea4 |
.TP
|
|
cvsdist |
e44dea4 |
.B \-c
|
|
cvsdist |
e44dea4 |
don't convert but check input encoding and print result to stdout.
|
|
cvsdist |
e44dea4 |
Different with normal auto-detection, whole contents of file is checked.
|
|
cvsdist |
e44dea4 |
However, when inconsistency of encodings is found, abort reading and print
|
|
cvsdist |
e44dea4 |
"data". Options except \fB\-x\fR¡¤\fB\-z\fR are ignored.
|
|
cvsdist |
e44dea4 |
.SH EXAMPLES
|
|
cvsdist |
e44dea4 |
.IP "\fB% kcc \-e \fIfile"
|
|
cvsdist |
e44dea4 |
Input encoding are detect automatically, and output is in
|
|
cvsdist |
e44dea4 |
.SM EUC
|
|
cvsdist |
e44dea4 |
encoding.
|
|
cvsdist |
e44dea4 |
.IP "\fB% kcc \-sj \fIfile1 file2"
|
|
cvsdist |
e44dea4 |
Two files in Shift
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
concatinated with converting to
|
|
cvsdist |
e44dea4 |
.SM JIS.
|
|
cvsdist |
e44dea4 |
.IP "\fB% \fIcommand\fB | kcc \-k+J"
|
|
cvsdist |
e44dea4 |
output of
|
|
cvsdist |
e44dea4 |
.I command are converted to
|
|
cvsdist |
e44dea4 |
.SM JIS\s0(\s-1JIS\s0 JIS X0208
|
|
cvsdist |
e44dea4 |
\s-1JIS\s0 JIS Roman¡¤\fB\s-1ESC\s0(I\fR Halfwidth Kana
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
)
|
|
cvsdist |
e44dea4 |
.IP "\fB% kcc \-c \fIfile"
|
|
cvsdist |
e44dea4 |
Encoding of contents of
|
|
cvsdist |
e44dea4 |
.I file
|
|
cvsdist |
e44dea4 |
is detected(no conversion)
|
|
cvsdist |
e44dea4 |
.SH BUG
|
|
cvsdist |
e44dea4 |
Auto detection of input encoding is well done for normal case, however,
|
|
cvsdist |
e44dea4 |
it has the following problems.
|
|
cvsdist |
e44dea4 |
.LP
|
|
cvsdist |
e44dea4 |
7 bit
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
is recognized by escape sequence in certain.
|
|
cvsdist |
e44dea4 |
\s-1EUC\s0 and
|
|
cvsdist |
e44dea4 |
.SM DEC
|
|
cvsdist |
e44dea4 |
are the same (refered as
|
|
cvsdist |
e44dea4 |
.SM EUC
|
|
cvsdist |
e44dea4 |
series). Halfwidth kana of 8 bit
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
is the same as halfwidth kana of Shift
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
(refered as Shift
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
series). However,
|
|
cvsdist |
e44dea4 |
.SM EUC
|
|
cvsdist |
e44dea4 |
series and
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
, which are both 8 bit encoding, are sharing the same regions widely.
|
|
cvsdist |
e44dea4 |
So, the problem in auto detection is detection of these 2 encodings.
|
|
cvsdist |
e44dea4 |
.LP
|
|
cvsdist |
e44dea4 |
Detection of
|
|
cvsdist |
e44dea4 |
.SM EUC
|
|
cvsdist |
e44dea4 |
series/Shift
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
series is done in line by line, When it is found that it's not Shift
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
series, or it's not \s-1EUC\s0 series, encoding is determined.
|
|
cvsdist |
e44dea4 |
When inconsistensy found, it will be treated as "data" and
|
|
cvsdist |
e44dea4 |
contents of output is not guaranteed.
|
|
cvsdist |
e44dea4 |
.LP
|
|
cvsdist |
e44dea4 |
While determined between
|
|
cvsdist |
e44dea4 |
.SM EUC
|
|
cvsdist |
e44dea4 |
series/Shift
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
series after 8bit code found, conversions are pending and put input
|
|
cvsdist |
e44dea4 |
data in buffer, however, buffer is fulled, it assumes it's
|
|
cvsdist |
e44dea4 |
.SM EUC
|
|
cvsdist |
e44dea4 |
series and forces to start conversion. Rationale. Usually, we can
|
|
cvsdist |
e44dea4 |
assume that documents with kanji include
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
non-kanji or
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
first standard, it can be detected in certain if it is Shift
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
, which does not share region with
|
|
cvsdist |
e44dea4 |
.SM EUC.
|
|
cvsdist |
e44dea4 |
So if it can't be determined, it's very likely to be
|
|
cvsdist |
e44dea4 |
.SM EUC.
|
|
cvsdist |
e44dea4 |
.LP
|
|
cvsdist |
e44dea4 |
8 bit
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
and it has always even number of halfwidth kana sequences, then
|
|
cvsdist |
e44dea4 |
it will be wrongly detected as \s-1EUC\s0 kanji. Be ceraful.
|
|
cvsdist |
e44dea4 |
.LP
|
|
cvsdist |
e44dea4 |
If input encoding doesn't have halfwidth kana, use
|
|
cvsdist |
e44dea4 |
.B \-z
|
|
cvsdist |
e44dea4 |
and accuracy of detection become much better.
|
|
cvsdist |
e44dea4 |
This is because shared region are restricted to area of
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
second standards.
|
|
cvsdist |
e44dea4 |
.LP
|
|
cvsdist |
e44dea4 |
Extended region of
|
|
cvsdist |
e44dea4 |
Shift
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
user-defined area of \s-1EUC\s0, control characters
|
|
cvsdist |
e44dea4 |
.SM C1
|
|
cvsdist |
e44dea4 |
of \s-1EUC\s0, undefined region of halfwidth kana of \s-1EUC\s0
|
|
cvsdist |
e44dea4 |
are out of range of auto detection, so it will fails to detect
|
|
cvsdist |
e44dea4 |
encodings if input has these characters. Use
|
|
cvsdist |
e44dea4 |
.B \-x
|
|
cvsdist |
e44dea4 |
option to specify extended mode, or specify input code.
|
|
cvsdist |
e44dea4 |
.SH "SEE ALSO"
|
|
cvsdist |
e44dea4 |
.BR cat (1)
|
|
cvsdist |
e44dea4 |
.SH NOTES
|
|
cvsdist |
e44dea4 |
Usually, user-defined characters, extended characters, supplimental kanji
|
|
cvsdist |
e44dea4 |
characters are mapped respectively. However characters that is
|
|
cvsdist |
e44dea4 |
out of range of extended characters become FCFC in hexadecimal when
|
|
cvsdist |
e44dea4 |
converted to Shift
|
|
cvsdist |
e44dea4 |
.SM JIS.
|
|
cvsdist |
e44dea4 |
Although control character region
|
|
cvsdist |
e44dea4 |
.SM C1
|
|
cvsdist |
e44dea4 |
of
|
|
cvsdist |
e44dea4 |
\s-1EUC\s0 and
|
|
cvsdist |
e44dea4 |
.SM DEC
|
|
cvsdist |
e44dea4 |
remains when converted to
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
, these will be deleted when converted to Shift
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
Undefined area of halfwidth kana become halfwidth centered dot
|
|
cvsdist |
e44dea4 |
when convered to Shift
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
Halfwidth kana become fullwidth kana when converted to
|
|
cvsdist |
e44dea4 |
.SM DEC.
|
|
cvsdist |
e44dea4 |
.LP
|
|
cvsdist |
e44dea4 |
When output is
|
|
cvsdist |
e44dea4 |
.SM JIS
|
|
cvsdist |
e44dea4 |
encoding, control characters such as newline, TAB, DEL and white space
|
|
cvsdist |
e44dea4 |
(halfwidth) will be output in ASCII mode.
|
|
cvsdist |
e44dea4 |
.LP
|
|
cvsdist |
e44dea4 |
When encoding of input is detected wrongly, or input undefined
|
|
cvsdist |
e44dea4 |
character for expected character sets, output is indefined.
|
|
cvsdist |
e44dea4 |
.LP
|
|
cvsdist |
e44dea4 |
This manual are translated by Fumitoshi UKAI <ukai@debian.or.jp>
|
|
cvsdist |
e44dea4 |
for Debian system, but you can use it for any purpose.
|
|
cvsdist |
e44dea4 |
|