Friday, May 1, 2009

Haskell - Encodings

The ASCII character set has withstood the test of time. However, as the use of computers and the internet becomes global, it is only natural that people speak in their native language. To this end, the Unicode character set was developed to meet this need. Since Unicode uses much more than 255 code points to represent these characters, there are many ways of mapping the much larger space of 4-bytes into a sequence of one or more bytes. The two most popular schemes for doing this are UTFs (UTF-8 and UTF-16), and Guóbiāo (GB2312 and GB18030). In particular, UTF-8 is a very organized and methodical scheme, which makes it well suited for purposes other than text encoding.

XML has become terrifyingly popular in acedemic and commercial circles, and while the overhead that it incurs is usually negligible, there are some who want to use an XML model while keeping a small memory footprint. For this, the W3C has designed a binary XML format called EXI for the transmission of XML documents in binary form. Parts of this standard are not very nice (which is out of the scope of this post), whereas some parts are quite ingenious, like the integer and date-time formats.

Here are some examples of EXI (efficient XML interchange) unsigned integers. EXI is little-endian, so the As are the low bits and the Cs are the high bits.

  • 0b0AAAAAAA (EXI encoding for 1-byte)
  • 0b1AAAAAAA,0b0BBBBBBB (EXI encoding for 2-bytes)
  • 0b1AAAAAAA,0b1BBBBBBB,0b0CCCCCCC (3-bytes)

Here are some examples of UTF (Unicode transformation format) characters. UTF-8 is big-endian, so the Cs are the low bits and the As are the high bits.

  • 0b0AAAAAAA (UTF-8 encoding for 1-byte)
  • 0b110AAAAA,0b10BBBBBB (UTF-8 encoding for 2-bytes)
  • 0b1110AAAA,0b10BBBBBB,0b10CCCCCC (3-bytes)

In the C programming language, all structs can be thought of as parsers. In the Haskell programming language, one can write parsers quite easily with the help of a library called Parsec, which stands for parser combinators. Suppose we have a Bit datatype with the constructors On and Off, and suppose we have definitions for the following functions in Haskell:

bit :: Bit -> Parser Bit
bitstring :: [Bit] -> Parser [Bit]
manyBits :: Int -> Parser [Bit]
oneBit :: Parser Bit

These parsers would allow writing versions of EXI and UTF-8 parsers without any knowledge of characters. An EXI parser can be described as follows

exi :: Parser [Bit]
exi = do
firstBit <- oneBit
restBits <- manyBits 7
case firstBit of
Off -> return restBits
On -> do
highBits <- exi
return (highBits ++ restBits)

and a UTF-8 parser can be described as

utf8continue = do
bitstring [On, Off]
manyBits 6

utf8 :: Parser [Bit]
utf8 = do bit Off
manyBits 7

<|> do bitstring [On, On, Off]
highBits <- manyBits 5
restBits <- utf8continue
return (highBits ++ restBits)

<|> do bitstring [On, On, On, Off]
highBits <- manyBits 4
rests <- sequence (replicate 2 utf8continue)
return (highBits ++ (concat rests))

<|> do bitstring [On, On, On, On, Off]
highBits <- manyBits 3
rests <- sequence (replicate 3 utf8continue)
return (highBits ++ (concat rests))

As you can see, EXI has a much more elegant implementation in Haskell. Although Parsec was designed to be used on characters and strings, similar techniques can be used for byte-parsers and bit-parsers as shown above. Since C structs can be viewed as byte-parsers, it should be possible to build Parsec style parsers for some common C structs and common file formats like JPEG or MPEG. Bringing this to its logical conclusion, most file formats could be described as some kind of bit-parser or byte-parser, so if such parsers were written in Haskell, what would they look like?

No comments:

Post a Comment