Contact me

Chinese character pattern data

The pattern data contains the precise definition of which strokes are required to match a certain character. This data is the actual core of Tibi's character recognition algorithm. The match qualitiy is determined, by the presence of all strokes with their defined directions, correctness of stroke order, correctness of stroke direction and the length of unmatched redundant strokes.

The current version of the pattern data can be downloaded here (utf8).

Format Documentation

The format is designed to be compact and quickly editable.

Rules

Rules assign a pattern to a character. If the pattern is matched, the character will be presented to the user.

Character : Pattern ;

Furthermore, there are invisible rules. There are pattern that do not correspond to any character, but do occur as components of more complex characters. More about invisible rules in the section about Macros.

{ Name } : Pattern ;

Comments

Comments start with a percent % and last to the end of the line.

Pattern

Pattern define a character or a part of a character and the way it is correctly written.

Characters as Pattern

Elsewhere defined characters can be used as pattern to make up for a more complex character.

Orientations

Orientations are the most primitive form of pattern. The following orientations are allowed:

Pattern Description
E East
NE North east
N North
NW North west
W West
SW South west
S South
SE South east

Orientation ranges

Multiple orientations can be combined with a star "*". The resulting pattern is matched if any the provided orientations is found.

Example for a stroke that goes either right or up:

E*NE*N

Connected orientations

If orientations and orientation ranges are prefixed with a minus "-" there must be a connecting stroke to the previous orientation.

A correct stroke sequence for the character 乙 is:

E-SW-SE-E-NE

Groups

Paratheses can be used to group together multiple pattern. The importence of groups will become clear in the subsequent section on locators.

A possible group definition the strokes of character 口 is defined below. However without locators completely different shapes will be matched too.

(S (E-S) (E))

Locators

Locators define relative positions of individual strokes. In fact it is not sufficient to just check the order of certain orientations, the graphical position is important to correctly recognize the written character.

Locators are appended to directly after a pattern and have the following forms. The name is required to identify corresponding locators. Equally named locators must have overlapping ranges. A range is defined relative to the patterns extension. Zero is the patterns leftmost or upmost position. Ten is the right or downmost position. If minimum or maximum values are omitted 0 or respecively 10 is assumed. if only a number is given the range contains only one point.

[ Name xmin : xmax , ymin : ymax ]
[ Name x, y ]
[ Name ]

Example for a 口 with locators. Each stroke must go at least through the center position of the bounding boxes edge.

口 : (S[left] E[up]-S[right] E[down]) [left 0,5][up 5,0][right 10,5][down 5,10];

When the second component of the range is smaller than the first then this range must be completely contained, instead of just overlapping.
We can define 中 as a 口 whose horizontial 30%-70% part must be contained in a vertical strokes 20%-80% part.

中 : 口[x 3:7,:] S[x :,2:8];

Macros

Macros are defined as invisible rules. They can be called with curly parathesis that embrace the marcos name and its arguments.

The macro 上下 defines an up down relation ship between its two arguments. It can be called to define 否:

否 : {上下 不 口};

A macro definition can be access its arguments by numbers in curly parathesis, e.g. {1} and {2}.

{上下} : {1}[x :,8:20] {2}[x 5,-20:2];

Conclusion

The pattern definition is specified in a simple bounding box based language, that allows the definition of high level positional relations.