Description

libmbfl is a streamable multibyte character code filter and converter library.
pymbfl provides a way to use libmbfl C API through ctypes module.

Download

git clone https://github.com/moriyoshi/libmbfl

Compile

./buildconf
./configure
make
make install

Utilities

One may need to modify CONST_LIBPATH to let C programs find libmbfl.so.
Python determines libmbfl.so path dynamically though.

mbfl

Run several small tests to check if libmbfl works well.
This program is used to perform the same tests pymbfl.py does.

Documentation

libmbfl has no documentation, so the basic functionality is described here.
Since it is not necessary to go into details, I'll describe only parts of libmbfl we need.

libmbfl

libmbfl supports two categories for encodings: enum mbfl_no_encoding and mbfl_no_encoding.
As one may see, the first one is just an enumeration key. It can be retrieved using mbfl_name2no_encoding:

MBFLAPI extern enum mbfl_no_encoding mbfl_name2no_encoding(const char *name);

If one needs a backward conversion, here comes @@:
MBFLAPI extern const char * mbfl_no_encoding2name(enum mbfl_no_encoding no_encoding);

The second type is more complicated, being a struct with several members:

typedef struct _mbfl_encoding {
    enum mbfl_no_encoding no_encoding;
    const char *name;
    const char *mime_name;
    const char *(*aliases)[];
    const unsigned char *mblen_table;
    unsigned int flag;
} mbfl_encoding;

The first member corresponds to enumeration key, so there is nothing to explain here.
The three next members, name, mime_name and aliases, represent names under which encoding can be found.
These fields are used in various functions, available in libmbfl.

MBFLAPI extern const mbfl_encoding * mbfl_name2encoding(const char *name);
MBFLAPI extern const mbfl_encoding ** mbfl_get_supported_encodings();
MBFLAPI extern const char * mbfl_no2preferred_mime_name(enum mbfl_no_encoding no_encoding);
MBFLAPI extern int mbfl_is_support_encoding(const char *name);

The next member, mblen_table, varies between encodings; most often it is NULL, but several encodings use it, e.g. CP949.
This table is used in other libmbfl functions which are not discussed here (since they're inappropriate to our task).

The last member is flag structure. This member stores different properties, e.g. it shows if character set is SBCS or MBCS.
It also shows if character set uses shifts, etc.

libmbfl has proved it's almost unusable if used in determination of single-byte encodings (as well as all detectors which don't work on language data). We warned you, LOL.

pymbfl

pymbfl interface currently defines two public classes, used to interact with libmbfl's encodings and detection argorithms.
These classes are called (what a surprize!) pymbfl.Encoding and pymbfl.Detector respectively.

pymbfl.Encoding can tell all the necessary information that is provided in mbfl_encoding structure.
pymbfl.Detector receives an iterable object that yields encodings as strings or as pymbfl.Encoding instances.
If pymbfl.Detector doesn't recieve encodings sequence, it tries to determine every possible multi-byte encoding.