Retro text encodings and cross-platform compilings

Working with old tech can take you to unexpected places and expose the quirks of modern coding systems. Case in point: old computers’ supported character sets and their impact on getting workable text when you transfer data across to a recent OS.

I came across this with my Psion Series 3a. I’ve been using it as a note-taking device while researching archive texts on the history of semiconductors. When I was done, I transferred the notes (a single Psion Word file) to a Mac and converted it using my Word2Text utility.

But the computer said, “No”.

Exploring why, I discovered that I had coded for conversion from Ascii, but in fact the Psion uses an extended character set, IBM Code Page 850, which is Ascii plus 128 extra characters above Ascii’s 7-bit range of 0-127. 850 was widely used in the DOS days, but fell out of favour with the transition to Windows — there’s no Euro symbol in 850, for example. Windows Code Page 1252 kind of replaced it, but they’re not the same. Many of the characters above 0x7F are in different locations, and while 850 has a number of graphics-oriented characters handy for marking out tables and such, 1252 focuses more on accented characters for European languages. And includes . You can see both sets’ ‘above Ascii’ characters below.

Code Page 850’s upper area characters
Code Page 850’s upper area characters. Source: Wikipedia

Why mention 1252? Because it’s one of a small selection of historic encodings supported by macOS’ Foundation framework and that’s what I‘m using to code Word2Text. Ultimately, 1252 — and many other character sets — were superseded by UTF-8, which is Swift’s native encoding for strings and so the best target for converted files.

Code page 1252’s upper area characters. Source: Wikipedia

The reason Word2Text failed to convert my notes file was the presence of two £ symbols. In 850 it has the value 156 (0x9C); in 1252, 163 (0xA3). It doesn’t exist in Ascii, so Swift’s string converter rejected the out-of-Ascii-range value. I changed the encoding I was using from String.Encoding.ascii to String.Encoding.windowsCP1252 and, because 1252 != 850, I added a conversion function for certain characters as a test: if the text bytes contain the value 156, change it to 163.

if let text = String(bytes: textBytes, encoding: .windowsCP1252) {
    // Swift String conversion works - return the result
    return text
}

The value of textBytes is a slice of an array of bytes from the Psion Word file: ArraySlice<UInt8> in Swift terminology. This did the trick.

And I looked upon my work and, lo, I was pleased.

Briefly, anyway. Word2Text runs on Linux as well as macOS, and jumping over to my Raspberry Pi 500, I pulled the changes, made a build, converted a copy of the ‘troublesome’ notes file and got.. nothing. Niente.

While generating a Swift String on macOS from Code Page 1252 bytes works fine, it does not on Linux, or at least not for me. In an attempt to mitigate the issue, I fell back on Foundation’s old NSString object, from Apple’s NextSTEP/Objective-C era. Its native format is UTF-16, and it too can convert Code Page 1252 bytes. And since String can interoperate with NSString, getting UTF-8 is easy.

This time it worked. Switching to macOS, the fallback works there too. But it still begs the question: why doesn’t String’s 1252 conversion work on Linux when it does work on macOS, and when NSString conversion works on both? Swift on Linux has full Foundation support, and a UInt8 is the same on both platforms, but while the code can turn the value 163 to a £ on macOS, it barfs when built under Linux.

Different platforms, different rules of course, but that can’t quite be the case here as NSString conversion works just as well on both.

Again, though, not without a quirk. Xcode on macOS reads the Foundation headers and recognises the constant NSWindowsCP1252StringEncoding, but the Swift compiler on Linux does not. One header missing from the standard install, perhaps? So I had to print out its value and set it manually:

#if os(macOS)
    let encoding: UInt = NSWindowsCP1252StringEncoding
#elseif os(Linux)
    let encoding: UInt = 12
#endif

That, however, defeats the whole point of having constants: not only to describe the referenced value meaningfully but to allow that value to be changed at any time without affecting anyone else’s source code. This is why I kept the macOS branch.

Then string conversion just becomes:

if let tb = NSString(bytes: textBytes, 
                     length: textBytes.count, 
                     encoding: encoding) {
    text = String(tb)
}

All of this reveals the complexity of providing cross-platform support, and I take of my metaphorical hat to the engineers behind Swift, Rust, Go and all the other languages the pledge to offer ‘write once, compile anywhere’ platforms that it works at all. Coding for edge cases, especially in obscure retro-tech use-cases like mine cannot be what they had in mind. My job is much easier than theirs.