I am in the middle of creating an NFO Viewer with Qt 4 (with no KDE dependencies) on my spare time, to replace my current viewer which is written in Python for PyGtk+ (NFO View). The load up time for Qt 4 is much faster than a PyGtk+ application, which is my main motivation for creating an NFO viewer. Technically, all things should be faster but it may not be noticeable. Still, I prefer a native Qt 4 app over a Gtk+ application (of any kind). I am a KDE person, after all.
One problem is the archaic CP437 (also known as IBM437) encoding still used in NFOs even today. This means every NFO viewer must either be on a system that can display this code page natively, and the font must have the correct characters for the ASCII drawings to look correct, or a conversion to the system's native code page must be done. Luckily, today colours are no longer part of the scheme any more.
Qt 4 displays all text in Unicode (UTF-8 or UTF-16), regardless of its actual format. No attempt is made at figuring out the character set, and NFOs never have a header as they are plain text files with a special code page. This leaves the programmer with one option: convert CP437 character set to Unicode equivalents. Luckily, most of the characters of CP437 have a place in Unicode.
What does Qt 4 provide for text conversion? QTextCodec! Oh wait, no CP437 support? Shift-JIS yet no CP437 support! I know not of a country that uses code page 437, but I know a small group of users on the Internet use it. So come on Trolltech! Add the support already.
Instead of whining to Nokia, we can see on that page that we can create our own QTextCodec just as Trolltech have implemented some.
A Google search led me to find one page eventually: OpenMoko's implementation of a QTextCodec class which has a method to convert back and forth between CP437 and Unicode. It is not perfect to just copy and paste, as it has a few dependencies that we could look and also copy but that would be just wasted lines of code for the preprocessor to look through. Also be aware of the licence at the top of the code: GPLv2. You should know what this means, and if you do not, see the GPL FAQ. Simply put, any code you take from this must be shared with your derivative work. This may be changed by now with Qt 4.5 being LGPL now. Be careful if you decide to keep your implementation closed.
So, let's adapt the source code to our needs. In the original source, we can find the original class. Let's follow Qt's documentation. It says we need to have the following method declarations (direct from the documentation):
name() - Returns the official name for the encoding. If the encoding is listed in the IANA character-sets encoding file, the name should be the preferred MIME name for the encodingaliases() - Returns a list of alternative names for the encoding. QTextCodec provides a default implementation that returns an empty list. For example, "ISO-8859-1" has "latin1", "CP819", "IBM819", and "iso-ir-100" as aliases.mibEnum() - Return the MIB enum for the encoding if it is listed in the IANA character-sets encoding file. The IANA (Internet Assigned Numbers Authority) have created a standard file with all aliases and names for numerous code pages. If you are going to implement any of them, you should use this document for naming.convertToUnicode() - Obviously folks...convertFromUnicode() - ...Our class is identical to OpenMoko's:
class QTextCodec;
class QCodePage437Codec : public QTextCodec {
public:
QCodePage437Codec();
~QCodePage437Codec();
QByteArray name() const;
QList<QByteArray> aliases() const;
int mibEnum() const;
protected:
QString convertToUnicode(const char *in, int length, ConverterState *state) const;
QByteArray convertFromUnicode(const QChar *in, int length, ConverterState *state) const;
};
This is better off in a header file, but that is not required of course.
Standard C++ stuff:
QCodePage437Codec::QCodePage437Codec() {
}
QCodePage437Codec::~QCodePage437Codec() {
}
Why bother? Because what if you do need something to happen later during construction or destruction?
Here is what the IANA document says about code page 437:
Name: IBM437 [RFC1345,KXS2] MIBenum: 2011 Source: IBM NLS RM Vol2 SE09-8002-01, March 1990 Alias: cp437 Alias: 437 Alias: csPC8CodePage437
So, here is our name() method:
QByteArray QCodePage437Codec::name() const {
return "IBM437";
}
And our aliases() method:
QList<QByteArray> QCodePage437Codec::aliases() const {
QList<QByteArray> list;
list << "CP437" << "cp437" << "437" << "csPC8CodePage437";
return list;
}
Finally, our mibEnum() method:
int QCodePage437Codec::mibEnum() const {
return 2011;
}
So those 3 functions were easy. Nothing hard; just follow some standard documentation. Next we need the methods that actually do the work: convertFromUnicode() and convertToUnicode(). Before those will work, you will need your look-up tables. These can be taken directly from the OpenMoko source.
This may not look like a look-up table, but it is used by the conversion methods:
static const char hexchars[] = "0123456789ABCDEF";
Here is our table to convert to Unicode (must have 256 characters defined):
static ushort const cp437ToUnicode[256] =
{ 0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007,
0x0008, 0x0009, 0x000a, 0x000b, 0x000c, 0x000d, 0x000e, 0x000f,
0x0010, 0x0011, 0x0012, 0x0013, 0x0014, 0x0015, 0x0016, 0x0017,
0x0018, 0x0019, 0x001c, 0x001b, 0x007f, 0x001d, 0x001e, 0x001f,
0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027,
0x0028, 0x0029, 0x002a, 0x002b, 0x002c, 0x002d, 0x002e, 0x002f,
0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037,
0x0038, 0x0039, 0x003a, 0x003b, 0x003c, 0x003d, 0x003e, 0x003f,
0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047,
0x0048, 0x0049, 0x004a, 0x004b, 0x004c, 0x004d, 0x004e, 0x004f,
0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057,
0x0058, 0x0059, 0x005a, 0x005b, 0x005c, 0x005d, 0x005e, 0x005f,
0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067,
0x0068, 0x0069, 0x006a, 0x006b, 0x006c, 0x006d, 0x006e, 0x006f,
0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077,
0x0078, 0x0079, 0x007a, 0x007b, 0x007c, 0x007d, 0x007e, 0x001a,
0x00c7, 0x00fc, 0x00e9, 0x00e2, 0x00e4, 0x00e0, 0x00e5, 0x00e7,
0x00ea, 0x00eb, 0x00e8, 0x00ef, 0x00ee, 0x00ec, 0x00c4, 0x00c5,
0x00c9, 0x00e6, 0x00c6, 0x00f4, 0x00f6, 0x00f2, 0x00fb, 0x00f9,
0x00ff, 0x00d6, 0x00dc, 0x00a2, 0x00a3, 0x00a5, 0x20a7, 0x0192,
0x00e1, 0x00ed, 0x00f3, 0x00fa, 0x00f1, 0x00d1, 0x00aa, 0x00ba,
0x00bf, 0x2310, 0x00ac, 0x00bd, 0x00bc, 0x00a1, 0x00ab, 0x00bb,
0x2591, 0x2592, 0x2593, 0x2502, 0x2524, 0x2561, 0x2562, 0x2556,
0x2555, 0x2563, 0x2551, 0x2557, 0x255d, 0x255c, 0x255b, 0x2510,
0x2514, 0x2534, 0x252c, 0x251c, 0x2500, 0x253c, 0x255e, 0x255f,
0x255a, 0x2554, 0x2569, 0x2566, 0x2560, 0x2550, 0x256c, 0x2567,
0x2568, 0x2564, 0x2565, 0x2559, 0x2558, 0x2552, 0x2553, 0x256b,
0x256a, 0x2518, 0x250c, 0x2588, 0x2584, 0x258c, 0x2590, 0x2580,
0x03b1, 0x00df, 0x0393, 0x03c0, 0x03a3, 0x03c3, 0x03bc, 0x03c4,
0x03a6, 0x0398, 0x03a9, 0x03b4, 0x221e, 0x03c6, 0x03b5, 0x2229,
0x2261, 0x00b1, 0x2265, 0x2264, 0x2320, 0x2321, 0x00f7, 0x2248,
0x00b0, 0x2219, 0x00b7, 0x221a, 0x207f, 0x00b2, 0x25a0, 0x00a0
};
And the Unicode to CP437 conversion look-up table (in case you want to save back to an NFO file later):
static unsigned char const cp437FromUnicode[256] =
{ 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
0x18, 0x19, 0x7f, 0x1b, 0x1a, 0x1d, 0x1e, 0x1f,
0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27,
0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f,
0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37,
0x38, 0x39, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f,
0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47,
0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f,
0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57,
0x58, 0x59, 0x5a, 0x5b, 0x5c, 0x5d, 0x5e, 0x5f,
0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67,
0x68, 0x69, 0x6a, 0x6b, 0x6c, 0x6d, 0x6e, 0x6f,
0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77,
0x78, 0x79, 0x7a, 0x7b, 0x7c, 0x7d, 0x7e, 0x1c,
'?' , '?' , '?' , '?' , '?' , '?' , '?' , '?' ,
'?' , '?' , '?' , '?' , '?' , '?' , '?' , '?' ,
'?' , '?' , '?' , '?' , '?' , '?' , '?' , '?' ,
'?' , '?' , '?' , '?' , '?' , '?' , '?' , '?' ,
0xff, 0xad, 0x9b, 0x9c, '?' , 0x9d, '?' , 0x15,
'?' , '?' , 0xa6, 0xae, 0xaa, '?' , '?' , '?' ,
0xf8, 0xf1, 0xfd, '?' , '?' , '?' , 0x14, 0xfa,
'?' , '?' , 0xa7, 0xaf, 0xac, 0xab, '?' , 0xa8,
'?' , '?' , '?' , '?' , 0x8e, 0x8f, 0x92, 0x80,
'?' , 0x90, '?' , '?' , '?' , '?' , '?' , '?' ,
'?' , 0xa5, '?' , '?' , '?' , '?' , 0x99, '?' ,
'?' , '?' , '?' , '?' , 0x9a, '?' , '?' , 0xe1,
0x85, 0xa0, 0x83, '?' , 0x84, 0x86, 0x91, 0x87,
0x8a, 0x82, 0x88, 0x89, 0x8d, 0xa1, 0x8c, 0x8b,
'?' , 0xa4, 0x95, 0xa2, 0x93, '?' , 0x94, 0xf6,
'?' , 0x97, 0xa3, 0x96, 0x81, '?' , '?' , 0x98
};
Notice how not all characters map to CP437.
Now, we need our functions to use these tables and convert. Taken directly from the original source:
QString QCodePage437Codec::convertToUnicode(const char *in, int length, ConverterState *) const {
QString str;
int nibble = 0;
int value = 0;
int digit;
if (length >= 6 &&
in[0] == '8' &&
in[1] == '0' &&
in[length-4] == 'F' &&
in[length-3] == 'F' &&
in[length-2] == 'F' &&
in[length-1] == 'F') {
// UCS-2 string embedded with a 437-encoded string
in +=2;
length -= 6;
while (length-- > 0) {
char ch = *in++;
if (ch >= '0' && ch <= '9') {
digit = ch - '0';
}
else if (ch >= 'A' && ch <= 'F') {
digit = ch - 'A' + 10;
}
else if (ch >= 'a' && ch <= 'f') {
digit = ch - 'a' + 10;
}
else {
continue;
}
value = value * 16 + digit;
nibble++;
if (nibble >= 4) {
str += QChar((ushort)value);
nibble = 0;
value = 0;
}
}
}
else {
// Regular 437-encoded string
while (length-- > 0) {
str += QChar((unsigned int)cp437ToUnicode[*in++ & 0xFF]);
}
}
return str;
}
QByteArray QCodePage437Codec::convertFromUnicode(const QChar *in, int length, ConverterState *) const {
QByteArray result;
unsigned int ch;
char *out;
bool non437 = false;
int position;
// Determine if the string should be encoded using UCS-2 hack
for (position = 0; !non437 && position < length; position++) {
ch = in[position].unicode();
if (ch >= 0x0100) {
non437 = true;
}
else if (cp437FromUnicode[ch] == '?' && ch != '?') {
non437 = true;
}
}
if (non437) {
// There is a non-CP437 character in the string, so use UCS-2
result.resize(length * 4 + 6);
out = result.data();
*out++ = '8';
*out++ = '0';
while (length-- > 0) {
uint ch = in->unicode();
++in;
*out++ = hexchars[(ch >> 12) & 0x0F];
*out++ = hexchars[(ch >> 8) & 0x0F];
*out++ = hexchars[(ch >> 4) & 0x0F];
*out++ = hexchars[ch & 0x0F];
}
*out++ = 'F';
*out++ = 'F';
*out++ = 'F';
*out = 'F';
return result;
}
// String only contains valid CP437 code points between 0x0000 and 0x00FF
result.resize(length);
out = result.data();
while (length-- > 0) {
*out++ = (char)cp437FromUnicode[in->unicode()];
++in;
}
return result;
}
These methods are very efficient and need no modification.
Now, finally, how do you use it? Well, let's say you are opening a text file.
void MainWindow::open_file() {
QString filename = QFileDialog::getOpenFileName(this);
if (!filename.isEmpty()) {
load_file(filename);
}
}
Now, let's see what load_file() does:
void MainWindow::load_file(const QString &filename) {
QFile file(filename);
if (!file.open(QFile::ReadOnly | QFile::Text)) {
QMessageBox::warning(this, tr("Warning"), tr("Cannot read %1\n%2").arg(filename).arg(file.errorString()));
return;
}
QTextStream stream(&file);
QCodePage437Codec *codec = new QCodePage437Codec;
stream.setCodec(codec);
QApplication::setOverrideCursor(Qt::WaitCursor);
text->setText(stream.readAll());
QApplication::restoreOverrideCursor();
set_current_file(filename);
statusBar()->showMessage(tr("Successfully loaded %1").arg(filename), 2000);
}
Notice the crucial steps highlighted. Again, nothing hard. You end up with a near-perfect conversion to Unicode perfect for displaying with Qt in any context.
One problem that I have not yet solved: line-spacing. It would seem easy that maybe you could even specify CSS to make line-spacing closer. Notice in the picture how lines have about 1px of space between them, which is unusual. If anyone has found a solution, please post.
Comments
Post new comment