libmbfl is a streamable multibyte character code filter and converter library.
pymbfl provides a way to use libmbfl C API through ctypes module.
git clone https://github.com/moriyoshi/libmbfl
./buildconf ./configure make make install
One may need to modify CONST_LIBPATH to let C programs find libmbfl.so.
Python determines libmbfl.so path dynamically though.
Run several small tests to check if libmbfl works well.
This program is used to perform the same tests pymbfl.py does.
libmbfl has no documentation, so the basic functionality is described here.
Since it is not necessary to go into details, I'll describe only parts of libmbfl we need.
libmbfl supports two categories for encodings: enum mbfl_no_encoding
and mbfl_no_encoding
.
As one may see, the first one is just an enumeration key. It can be retrieved using mbfl_name2no_encoding
:
MBFLAPI extern enum mbfl_no_encoding mbfl_name2no_encoding(const char *name);
MBFLAPI extern const char * mbfl_no_encoding2name(enum mbfl_no_encoding no_encoding);
The second type is more complicated, being a struct with several members:
typedef struct _mbfl_encoding { enum mbfl_no_encoding no_encoding; const char *name; const char *mime_name; const char *(*aliases)[]; const unsigned char *mblen_table; unsigned int flag; } mbfl_encoding;
The first member corresponds to enumeration key, so there is nothing to explain here.
The three next members, name
, mime_name
and aliases
, represent names under which encoding can be found.
These fields are used in various functions, available in libmbfl.
MBFLAPI extern const mbfl_encoding * mbfl_name2encoding(const char *name); MBFLAPI extern const mbfl_encoding ** mbfl_get_supported_encodings(); MBFLAPI extern const char * mbfl_no2preferred_mime_name(enum mbfl_no_encoding no_encoding); MBFLAPI extern int mbfl_is_support_encoding(const char *name);
The next member, mblen_table
, varies between encodings; most often it is NULL, but several encodings use it, e.g. CP949.
This table is used in other libmbfl functions which are not discussed here (since they're inappropriate to our task).
The last member is flag
structure. This member stores different properties, e.g. it shows if character set is SBCS or MBCS.
It also shows if character set uses shifts, etc.
libmbfl has proved it's almost unusable if used in determination of single-byte encodings (as well as all detectors which don't work on language data). We warned you, LOL.
pymbfl interface currently defines two public classes, used to interact with libmbfl's encodings and detection argorithms.
These classes are called (what a surprize!) pymbfl.Encoding
and pymbfl.Detector
respectively.
pymbfl.Encoding
can tell all the necessary information that is provided in mbfl_encoding
structure.pymbfl.Detector
receives an iterable object that yields encodings as strings or as pymbfl.Encoding
instances.
If pymbfl.Detector
doesn't recieve encodings sequence, it tries to determine every possible multi-byte encoding.