Chapter 12. Code case study: parsing a binary data format

Table of Contents

Greyscale files
Parsing a raw PGM file
Getting rid of boilerplate code
Implicit state
Obtaining and modifying the parse state
Reporting parse errors
Introducing functors
Constraints on type definitions are bad
Functors as operators
Flexible instances
Thinking more about functors
Writing a functor instance for Parse
Using functors for parsing
Rewriting our PGM parser
Conclusions
Exercises

In this chapter, we'll discuss a common task: parsing a binary file. We're going to use this for two purposes. Our first is to talk a little about parsing, but our main goal is to talk about program organisation and “boilerplate removal”.

As our task, we'll choose parsing a few different netpbm file types. The netpbm suite is an ancient and venerable collection of programs and file formats for working with bitmap images. These file formats have the dual advantages of wide use and being fairly easy, but not completely trivial, to parse. Most importantly for our convenience, netpbm files are not compressed.

Greyscale files

The name of netpbm's greyscale file format is PGM (“portable grey map”). It is actually not one format, but two; the “plain” (or “P2”) format is encoded as ASCII, while the more common “raw” (“P5”) format is mostly binary.

A file of either format starts with a header, which in turn begins with a “magic” string describing the format. For a plain file, the string is P5, and for raw, it's P2. The magic string is followed by white space, then by three numbers: the width, height, and maximum grey value of the image. These numbers are represented as ASCII decimal numbers, separated by white space.

After the maximum grey value comes the image data. In a raw file, this is a string of binary values. In a plain file, the values are represented as ASCII decimal numbers separated by white space.

A raw file can contain a sequence of images, one after the other, each with its own header. A plain file contains only one image.

Parsing a raw PGM file

For our first try at a parsing function, we'll only worry about raw PGM files. We'll write our PGM parser as a pure function. It's not responsible for obtaining the data to parse, just for the actual parsing. This is a common approach in Haskell programs. By separating the reading of the data from what we subsequently do with it, we gain flexibility in where we take the data from.

We'll use the ByteString type to store our greymap data, because it's compact.

import qualified Data.ByteString.Lazy.Char8 as L
import Data.Char (isSpace)

For our purposes, it doesn't matter whether we use a lazy or strict ByteString, so we've somewhat arbitrarily chosen the lazy kind.

The ByteString module contains many definitions that have the same names as existing Prelude definitions that are automatically imported for us. Because of this, if we try to use a name that is present in both the ByteString module and the Prelude, the compiler will complain about ambiguity. We avoid this problem by importing the module under an alias, L: every time you see a name prefixed with L., we're using the name from ByteString.

We'll use a straightforward data type to represent PGM files.

data Greymap = Greymap {
      greyWidth :: Int
    , greyHeight :: Int
    , greyMax :: Int
    , greyData :: L.ByteString
    } deriving (Eq)

Normally, a Haskell Show instance should produce a string representation that we can read back by calling read. However, for a bitmap graphics file, this would potentially produce huge text strings, for example if we were to show a photo. For this reason, we're not going to let the compiler automatically derive a Show instance for us: we'll write our own, intentionally less capable, implementation.

instance Show Greymap where
    show (Greymap w h m _) = "Greymap " ++ show w ++ "x" ++ show h ++
                             " " ++ show m

Because our Show instance intentionally avoids printing the bitmap data, there's no point in writing a Read instance, as we can't reconstruct a valid Greymap from the result of show.

Here's an obvious type for our parsing function.

parseP5 :: L.ByteString -> Maybe (Greymap, L.ByteString)

This will take a ByteString, and if the parse succeeds, it will return the parsed Greymap, along with the string that remains after parsing.

Our parsing function has to consume a little bit of its input at a time. First, we need to assure ourselves that we're really looking at a raw PGM file; then we need to parse the numbers from the remainder of the header; then we consume the bitmap data. Here's an obvious, brutish way to express this.

matchHeader :: L.ByteString -> L.ByteString -> Maybe L.ByteString

-- "nat" here is short for "natural number", not "nathan torkington"
getNat :: L.ByteString -> Maybe (Int, L.ByteString)

getBytes :: Int -> L.ByteString -> Maybe (L.ByteString, L.ByteString)

parseP5 s =
  case matchHeader (L.pack "P5") s of
    Nothing -> Nothing
    Just s1 ->
      case getNat s1 of
        Nothing -> Nothing
        Just (width, s2) ->
          case getNat (L.dropWhile isSpace s2) of
            Nothing -> Nothing
            Just (height, s3) ->
              case getNat (L.dropWhile isSpace s3) of
                Nothing -> Nothing
                Just (maxGrey, s4)
                  | maxGrey > 255 -> Nothing
                  | otherwise ->
                      case getBytes 1 s4 of
                        Nothing -> Nothing
                        Just (_, s5) ->
                          case getBytes (width * height) s5 of
                            Nothing -> Nothing
                            Just (bitmap, s6) ->
                              Just (Greymap width height maxGrey bitmap, s6)

Stylistically, this is a very “direct” piece of code, doing all of the parsing in one long staircase of case expressions. Each function that it calls returns the residual ByteString left over after it has consumed all it needs from its input string. This residual string can then be passed along to the next step. It deconstructs each result in turn, either failing if the function failed, or building up a piece of the result as it continues. The bodies of the functions that it calls aren't especially interesting, but we'll include them for completeness.

matchHeader h s
    | h `L.isPrefixOf` s = Just (L.dropWhile isSpace (L.drop (L.length h) s))
    | otherwise = Nothing

getNat s = case L.readInt s of
             Nothing -> Nothing
             Just (i, s') | i <= 0    -> Nothing
                          | otherwise -> Just (fromIntegral i, s')

getBytes n s = let n' = fromIntegral n
                   ht@(h, t) = L.splitAt n' s
               in if L.length h < n'
                  then Nothing
                  else Just ht

Getting rid of boilerplate code

Our parseP5 function is somehow unsatisfying. It marches steadily to the right of the screen, so it's clear that a slightly more complicated function would soon run out of visual real estate. And it repeats a pattern of constructing and then deconstructing Maybe values, only continuing if a particular value matches Just. All of the similar case expressions act as “boilerplate code”, busywork that obscures what we're really trying to do. In short, this function is begging for some abstraction and refactoring.

If we step back a little, we can see two patterns. First is that the functions that we're calling have similar types. Each takes a ByteString as its last argument, and returns Maybe something else. Secondly, every step in the “ladder” of our parseP5 function deconstructs a Maybe value, and either fails or passes the unwrapped result to a function.

We can quite easily write a function that captures this second pattern.

(>>?) :: Maybe a -> (a -> Maybe b) -> Maybe b
Nothing >>? _ = Nothing
Just v >>? f  = f v

The (>>?) function acts very simply: it takes a value as its left argument, and a function as its right. If the value is not Nothing, it calls the function. We have written this function as an operator so that we can use it to chain functions together.

With this chaining function in hand, we can take a second try at our parsing function.

parseP5_take2 :: L.ByteString -> Maybe (Greymap, L.ByteString)
parseP5_take2 s =
    matchHeader (L.pack "P5") s       >>?
    \s -> skipSpace ((), s)           >>?
    (getNat . snd)                    >>?
    skipSpace                         >>?
    \(width, s) ->   getNat s         >>?
    skipSpace                         >>?
    \(height, s) ->  getNat s         >>?
    \(maxGrey, s) -> getBytes 1 s     >>?
    (getBytes (width * height) . snd) >>?
    \(bitmap, s) -> Just (Greymap width height maxGrey bitmap, s)

skipSpace :: (a, L.ByteString) -> Maybe (a, L.ByteString)
skipSpace (a, s) = Just (a, L.dropWhile isSpace s)

The key to understanding this function is to think about the chaining. On the left hand side of each (>>?) is a Maybe value; on the right is a function that returns a Maybe value. Each left-and-right-sides expression is thus of type Maybe, suitable for passing to the following (>>?) expression.

The other change that we've made to improve readability is add a skipSpace function. With these changes, we've halved the number of lines of code compared to our original parsing function. By removing the boilerplate case expressions, we've made the code easier to follow.

While we warned against overuse of anonymous functions in the section called “Anonymous (lambda) functions”, we use several in our chain of functions here. Because these functions are so small, we wouldn't improve readability by giving them names.

However, we're not yet out of the woods. This code explicitly passes two-tuples around, using one element for an intermediate part of the parsed result and the other for the current residual ByteString. If we want to extend the code, for example to track the number of bytes we've consumed so that we can report the location of a parse failure, we need to modify eight different locations so that we can pass a three-tuple around.

Implicit state

While we've gotten rid of some boilerplate code, the two-tuple that we use to pass around our partial result and residual string is a serious problem: it makes our code difficult to change.

We can do something about this, though. First, let's augment the state that our parser uses.

data ParseState = ParseState {
      string :: L.ByteString
    , offset :: Int64
    } deriving (Show)

We can now track both the current residual string and the offset into the original string since we started parsing. This lets us think of parsing as a function from one ParseState to another, also returning the result of the parse.

newtype Parse a = Parse {
      runParse :: ParseState -> Either String (a, ParseState)
    }

The newtype declaration for the Parse type just acts as a safety wrapper around this function type. It allows us to ensure that we can't accidentally run a parser.

The Parse type is encoding two concepts in one. The first is the possibility of failure, reported via an error message. This we achieve using Either to represent two possible results of a parse. The second is the update of the parser state and presentation of an intermediate result, represented by the type of runParse. In other words, if a parse succeeds, it will generate both a result and a new parse state.

Let's try to define a minimal parser.

identity :: a -> Parse a
identity a = Parse (\s -> Right (a, s))

All this function has to do is take a parse state, leave it untouched, and use the function's argument as the result of the parse. We wrap this function in our Parse type, but how can we actually use this wrapped function to parse something?

The first thing we must do is peel off the Parse wrapper so that we can get at the function inside. We do this by calling runParse. We also need to construct a ParseState, then run our parsing function on that parse state. Finally, we'd like to extract the result of the parse from the final ParseState.

parse :: Parse a -> L.ByteString -> Either String a
parse f s = case runParse f (ParseState s 0) of
              Left err -> Left err
              Right (a, _) -> Right a

Because neither the identity parser nor the parse function examine the parse state at all, we don't even need to bother creating an input string in order to try this out.

ghci> :load Parse
[1 of 2] Compiling PNM              ( PNM.hs, interpreted )
[2 of 2] Compiling Parse            ( Parse.hs, interpreted )
Ok, modules loaded: PNM, Parse.
ghci> :type parse (identity 1) undefined
parse (identity 1) undefined :: (Num t) => Either String t

A parser that doesn't even inspect its input isn't very interesting, but at least we have confidence that our types are correct. Let's focus now on writing a parser that does something meaningful. We're not going to get very ambitious yet, though: all we want to do is parse a single byte.

parseByte :: Parse Word8
parseByte =
    getState ==> \st ->
    case uncons (string st) of
      Nothing -> bail "no more input"
      Just (c, s) -> let st' = st { string = s, offset = offset st + 1 }
                     in putState st' ==> \_ -> identity c

There's some unfamiliar code in use here, so let's take a deeper look. The (==>) function serves a similar purpose to our earlier (>>?) function, acting as “glue” to let us chain functions together.

(==>) :: Parse a -> (a -> Parse b) -> Parse b
x ==> f = Parse (\st -> case runParse x st of
                          Left err -> Left err
                          Right (a, st') -> runParse (f a) st')

Indeed, the types of the two functions are very similar. The body of (==>) is interesting, and ever so slightly tricky. Remember that Parse is really a function type with a wrapper. Therefore, (==>) must return a function, in a wrapper. It doesn't really “do” much: it just creates a closure to remember the values of x and f. This closure won't be unwrapped and called until we call parse. At that point, it will be called with a ParseState. It will call the Parse that is its left argument and inspect its result. If that parse failed, the closure fails too. Otherwise, it passes the result of the parse and the new ParseState to f.

This is really quite fancy and subtle stuff: we're effectively passing the ParseState down the chain of Parse values in a hidden argument. (We'll be revisiting this kind of code in a few chapters, so don't fret if that description seemed dense.)

Obtaining and modifying the parse state

Our parseByte function doesn't take the parse state as an argument. Instead, it has to call getState to get a copy of the state, and putState to replace the current state with a new one.

getState :: Parse ParseState
getState = Parse (\s -> Right (s, s))

putState :: ParseState -> Parse ()
putState s = Parse (\_ -> Right ((), s))

When reading these functions, recall that the left element of the tuple is the result of a Parse, while the right is the current ParseState. This makes it easier to follow what these functions are doing.

The getState function extracts the current parsing state, so that the caller can access the string. The putState function replaces the current parsing state with a new one. This becomes the state that will be seen by the next function in the (==>) chain.

these functions let us move explicit state handling into the bodies of only those functions that need it. Many functions don't need to know what the current state is, and so they'll never call getState or putState. This lets us write more compact code than our earlier parser, which had to pass tuples around by hand.

Even better, because we've packaged up the details of the parsing state into the ParseState type, if we want to add more information to the parsing state, all we need to do is modify the definition of ParseState, and the bodies of whatever functions need the new information. Compare this to our earlier parsing code, where we'd have had to turn every use of a two-tuple into a three-tuple, and the advantage should be clear.

Reporting parse errors

We carefully defined our Parse type to accommodate the possibility of failure. The (==>) combinator checks for a parse failure and stops parsing if it runs into a failure. But we haven't yet introduced the bail function, which we use to report a parse error.

bail :: String -> Parse a
bail err = Parse $ \s -> Left $
           "byte offset " ++ show (offset s) ++ ": " ++ err

After we call bail, (==>) will successfully pattern match on the Left constructor that it wraps the error message with, and it will not invoke the next parser in the chain. This will cause the error message to percolate back through the chain of prior callers.

Introducing functors

We're by now thoroughly familiar with the map function, which applies a function to every element of a list, returning a list of possibly a different type.

ghci> map (+1) [1,2,3]
[2,3,4]
ghci> map show [1,2,3]
["1","2","3"]
ghci> :type map show
map show :: (Show a) => [a] -> [String]

This map-like activity can be useful in other instances. For example, consider a binary tree.

data Tree a = Node (Tree a) (Tree a)
            | Leaf a
              deriving (Show)

If we want to take a tree of strings and turn it into a tree containing the lengths of those strings, we could write a function to do this.

treeLengths (Leaf s) = Leaf (length s)
treeLengths (Node l r) = Node (treeLengths l) (treeLengths r)

Now that our eyes are attuned to looking for patterns that we can turn into generally useful functions, we can see a possible case of this here.

treeMap :: (a -> b) -> Tree a -> Tree b
treeMap f (Leaf a) = Leaf (f a)
treeMap f (Node l r) = Node (treeMap f l) (treeMap f r)

As we might hope, treeLengths and treeMap length give the same results.

ghci> let tree = Node (Leaf "foo") (Node (Leaf "x") (Leaf "quux"))
ghci> treeLengths tree
Node (Leaf 3) (Node (Leaf 1) (Leaf 4))
ghci> treeMap length tree
Node (Leaf 3) (Node (Leaf 1) (Leaf 4))
ghci> treeMap (odd . length) tree
Node (Leaf True) (Node (Leaf True) (Leaf False))

Haskell provides a well-known typeclass to further generalise treeMap. This typeclass is named Functor, and it defines one function, fmap.

class Functor f where
    fmap :: (a -> b) -> f a -> f b

We can think of fmap as a kind of lifting function, as we introduced in the section called “Avoiding boilerplate with lifting”. It takes a function over ordinary values a -> b and lifts it to become a function over containers f a -> f b, where f is the container type.

If we substitute Tree for the type variable f, for example, the type of fmap is identical to the type of treeMap, and in fact we can use treeMap as the implementation of fmap over Trees.

instance Functor Tree where
    fmap = treeMap

We can also use map as the implementation of fmap for lists.

instance Functor [] where
    fmap = map

We can now use fmap over different container types.

ghci> fmap length ["foo","quux"]
[3,4]
ghci> fmap length (Node (Leaf "livingstone") (Leaf "i presume"))
Node (Leaf 11) (Leaf 9)

The Prelude defines instances of Functor for several common types. The instance for lists is provided in the Prelude, as is the instance for Maybe.

instance Functor Maybe where
    fmap _ Nothing = Nothing
    fmap f (Just x) = Just (f x)

The instance for Maybe makes it particularly clear what an fmap implementation needs to do. The implementation must have a sensible behaviour for each of a type's constructors. If a value is wrapped in Just, for example, the fmap implementation calls the function on the unwrapped value, then rewraps it in Just.

The definition of Functor imposes a few obvious restrictions on what we can do with fmap. For example, we can only make instances of Functor from types that have exactly one free type variable.

[Note]What's a free type variable?

A free type variable is a lower-case type variable, such as a, that hasn't been bound to a particular type. For example, the type Maybe a has one free type variable, but Maybe Int has none. We say that the type variable a is bound to the type Int.

We can't write an fmap implementation for Either a b or (a, b), for example, because these have two free type variables. We also can't write work with Bool or Int, as they have no free type variables.

In addition, we can't place any constraints on our type definition. What does this mean? To illustrate, let's first look at a normal data definition and its Functor instance.

data Foo a = Foo a
           
instance Functor Foo where
    fmap f (Foo a) = Foo (f a)

When we define a new type, we can add a type constraint just after the data keyword as follows.

data Eq a => Bar a = Bar a

This says that we can only put a type a into a nFoo if a is a member of the Eq typeclass. However, the constraint renders it impossible to write a Functor instance for Bar.

ghci> :load ValidFunctor
[1 of 1] Compiling Main             ( ValidFunctor.hs, interpreted )

ValidFunctor.hs:14:12:
    Could not deduce (Eq a) from the context (Functor Bar)
      arising from a use of `Bar' at ValidFunctor.hs:14:12-16
    Possible fix:
      add (Eq a) to the context of the type signature for `fmap'
    In the pattern: Bar a
    In the definition of `fmap': fmap f (Bar a) = Bar (f a)
    In the definition for method `fmap'

ValidFunctor.hs:14:21:
    Could not deduce (Eq b) from the context (Functor Bar)
      arising from a use of `Bar' at ValidFunctor.hs:14:21-29
    Possible fix:
      add (Eq b) to the context of the type signature for `fmap'
    In the expression: Bar (f a)
    In the definition of `fmap': fmap f (Bar a) = Bar (f a)
    In the definition for method `fmap'
Failed, modules loaded: none.

Constraints on type definitions are bad

Adding a constraint to a type definition is never a good idea. It has the effect of forcing you to add type constraints to every function that will operate on values of that type. Let's say that we need a stack data structure that we want to be able to query to see whether its elements obey some ordering. Here's a naive definition of the data type.

data (Ord a) => OrdStack a = Bottom
                           | Item a (OrdStack a)
                             deriving (Show)

If we want to write a function that checks the stack to see whether it is monotonic (i.e. either every element is bigger than the element below it, or every element is smaller), we'll obviously need an Ord constraint to perform the pairwise comparisons.

isMonotonic :: (Ord a) => OrdStack a -> Bool
isMonotonic (Item a rest@(Item b _))
    | compare a b `elem` [GT, LT] = isMonotonic rest
    | otherwise = False
isMonotonic _ = True

However, because we wrote the type constraint on the type definition, that constraint ends up infecting places where it isn't needed at all: we need to add the Ord constraint to push, which does not care at all about the ordering of elements on the stack.

push :: (Ord a) => a -> OrdStack a -> OrdStack a
push a s = Item a s

Try removing that Ord constraint above, and the definition of push will fail to typecheck.

This is why our attempt to write a Functor instance for Bar failed earlier: it would have required an Eq constraint to somehow get retroactively added to the signature of fmap.

Now that we've tentatively established that putting a type constraint on a type definition is a misfeature of Haskell, what's a more sensible alternative? The answer is simply to omit type constraints from type definitions, and instead place them on the functions that need them.

In this example, we can drop the Ord constraints from OrdStack and push. It needs to stay on isMonotonic, which otherwise couldn't call compare. We now have the constraints where they actually matter. This has the further benefit of making the type signatures better document the real requirements of each function.

Most Haskell container types follow this pattern. The Map type in the Data.Map module requires that its keys be ordered, but this constraint is expressed on functions like insert, where it's actually needed, and not on size, where ordering isn't used.

Functors as operators

Quite often, you'll see fmap called as an operator.

ghci> (1+) `fmap` [1,2,3] ++ [4,5,6]
[2,3,4,4,5,6]

Perhaps strangely, plain old map is almost never used in this way.

One possible reason for the stickiness of the fmap-as-operator meme is that this use lets us omit parentheses from its second argument. Fewer parentheses leads to reduced mental juggling while reading a function.

ghci> fmap (1+) ([1,2,3] ++ [4,5,6])
[2,3,4,5,6,7]

If you really want to use fmap as an operator, the Control.Applicative module contains an operator (<$>) that is an alias for fmap. The $ in its name appeals to the similarity between applying a function to its arguments (using the ($) operator) and lifting a function into a functor.

Flexible instances

You might hope that we could write a Functor instance for the type Either Int b, which has one free type variable.

instance Functor (Either Int) where
    fmap _ (Left n) = Left n
    fmap f (Right r) = Right (f r)

However, the type system of Haskell 98 cannot guarantee that checking the constraints on such an instance will terminate. A non-terminating constraint check will send a compiler into an infinite loop, so instances of this form are forbidden.

ghci> :load EitherInt
[1 of 1] Compiling Main             ( EitherInt.hs, interpreted )

EitherInt.hs:2:0:
    Illegal instance declaration for `Functor (Either Int)'
        (All instance types must be of the form (T a1 ... an)
         where a1 ... an are distinct type *variables*
         Use -XFlexibleInstances if you want to disable this.)
    In the instance declaration for `Functor (Either Int)'
Failed, modules loaded: none.

GHC has a more powerful type system than the base Haskell 98 standard. It operates in Haskell 98 compatibility mode by default, for maximal portability. We can instruct it to allow more flexible instances using a special compiler directive.

{-# LANGUAGE FlexibleInstances #-}

instance Functor (Either Int) where
    fmap _ (Left n) = Left n
    fmap f (Right r) = Right (f r)

The directive is embedded in the specially formatted LANGUAGE comment. These directives are usually referred to as “pragmas”. Pragmas are always enclosed in the special comment sequences {-#, to begin, and #-}, to end.

GHC supports many kinds of pragma. Most pragmas only have meaning at specific locations in a source file. Language pragmas, for example, are only obeyed if they are present at the beginning of a source file.

With our Functor instance in hand, let's try out fmap on Either Int.

ghci> :load EitherIntFlexible
[1 of 1] Compiling Main             ( EitherIntFlexible.hs, interpreted )
Ok, modules loaded: Main.
ghci> fmap (== "cheeseburger") (Left 1 :: Either Int String)
Left 1
ghci> fmap (== "cheeseburger") (Right "fries" :: Either Int String)
Right False

Thinking more about functors

We've made a few implicit assumptions about how functors ought to work. It's helpful to make these explicit and to think of them as rules to follow, because this lets us treat functors as uniform, well-behaved objects. We have only two rules to remember, and they're simple.

Our first rule is that a functor must preserve identity. That is, applying fmap id to a value should give us back an identical value.

ghci> fmap id (Node (Leaf "a") (Leaf "b"))
Node (Leaf "a") (Leaf "b")

Our second rule is that functors must be composable. That is, composing two uses of fmap should give the same result as one fmap with the same functions composed.

ghci> (fmap even . fmap length) (Just "twelve")
Just True
ghci> fmap (even . length) (Just "twelve")
Just True

Another way of looking at these two rules is that a functor must preserve shape. The structure of a collection should not be affected by a functor; only the values that it contains should change.

ghci> fmap odd (Just 1)
Just True
ghci> fmap odd Nothing
Nothing

If you're writing a Functor instance, it's useful to keep these rules in mind, and indeed to test them, because the compiler can't check the rules we've listed above. On the other hand, if you're simply using functors, the rules are “natural” enough that there's no need to memorise them. They just formalise a few intuitive notions of “do what I mean”.

Writing a functor instance for Parse

For the types we have surveyed so far, the behaviour we ought to expect of fmap has been obvious. This is a little less clear for Parse, due to its complexity. A reasonable guess is that the function we're fmapping should be applied to the current result of a parse, and leave the parse state untouched.

instance Functor Parse where
    fmap f p = Parse (\st -> case runParse p st of
                               Left err -> Left err
                               Right (a, st') -> let fa = identity (f a)
                                                 in runParse fa st')

Since this definition isn't especially easy to read, let's perform a few quick experiments to see if we're following our rules for functors.

First, we'll check that identity is preserved. Let's try this first on a parse that should fail: trying to parse a byte from an empty string.

ghci> parse parseByte L.empty
Loading package array-0.1.0.0 ... linking ... done.
Loading package bytestring-0.9.0.1 ... linking ... done.
Left "byte offset 0: no more input"
ghci> parse (id <$> parseByte) L.empty
Left "byte offset 0: no more input"

Good. Now for a parse that should succeed.

ghci> let input = pack "foo"
ghci> L.head input
102
ghci> parse parseByte input
Right 102
ghci> parse (id <$> parseByte) input
Right 102

By inspecting the results above, we can also see that our functor instance is obeying our second rule, that of preserving shape. Failure is preserved as failure, and success as success.

Finally, we'll ensure that composability is preserved.

ghci> parse ((chr . fromIntegral) <$> parseByte) input
Right 'f'
ghci> parse (chr <$> fromIntegral <$> parseByte) input
Right 'f'

On the basis of this brief inspection, our Functor instance appears to be well behaved.

Using functors for parsing

All of this talk about functors had a purpose: they often let us write tidy, expressive code. Recall the parseByte function that we introduced earlier. In recasting our PGM parser to use our new parser infrastructure, we'll often want to work with ASCII characters instead of Word8 values.

While we could write a parseChar function that has a similar structure to parseByte, we can now avoid this code duplication by taking advantage of the functor nature of Parse. Our functor takes the result of a parse and applies a function to it, so what we need is a function that turns a Word8 into a Char.

w2c :: Word8 -> Char
w2c = chr . fromIntegral

parseChar :: Parse Char
parseChar = w2c <$> parseByte

We can also use functors to write a compact “peek” function. This returns Nothing if we're at the end of the input string. Otherwise, it returns the next character without consuming it (i.e. it inspects, but doesn't disturb, the current parsing state).

peekByte :: Parse (Maybe Word8)
peekByte = (fmap fst . uncons . string) <$> getState

The same lifting trick that let us define parseChar lets us write a compact definition for peekChar.

peekChar :: Parse (Maybe Char)
peekChar = fmap w2c <$> peekByte

Notice that peekByte and peekChar each make two calls to fmap, one of which is disguised as (<$>). This is necessary because the type Parse (Maybe a) is a functor within a functor. We thus have to lift a function twice to “get it into” the inner function.

Finally, we'll write another generic combinator, which is the Parse analogue of the familiar takeWhile: it consumes its input while its predicate returns True.

parseWhile :: (Word8 -> Bool) -> Parse [Word8]
parseWhile p = (fmap p <$> peekByte) ==> \mp ->
               if mp == Just True
               then parseByte ==> \b ->
                    (b:) <$> parseWhile p
               else identity []

Once again, we're using functors in several places (doubled up, when necessary) to reduce the verbosity of our code. Here's a rewrite of the same function in a more direct style that does not use functors.

parseWhileVerbose p =
    peekByte ==> \mc ->
    case mc of
      Nothing -> identity []
      Just _ -> parseByte ==> \b ->
                if p b
                then parseWhileVerbose p ==> \bs ->
                     identity (b:bs)
                else identity []

The more verbose definition is likely easier to read when you are less familiar with functors. However, use of functors is sufficiently common in Haskell code that the more compact representation should become second nature (both to read and to write) fairly quickly.

[Tip]Hanging lambdas

The definitions of both parsing functions above have a visual style to them that we haven't discussed before. Each contains anonymous functions in which the parameters and >- sit at the end of a line, with the function's body following on the next line.

This style of laying out an anonymous function doesn't have an official name, so let's call it a “hanging lambda”. Its main use is to make room for more text in the body of the function. It also makes it a little visually clearer that there's a relationship between one function and the one that follows. Most often, either the first function is taking the second function as a parameter, or its result is being passed as a parameter to the second.

Rewriting our PGM parser

With our new parsing code, what does the raw PGM parsing function look like now?

parseRawPGM =
    parseWhileWith w2c (/= '\n') ==> \header -> skipSpaces ==>&
    assert (header == "P5") "invalid raw header" ==>&
    parseNat ==> \width -> skipSpaces ==>&
    parseNat ==> \height -> skipSpaces ==>&
    parseNat ==> \maxGrey ->
    parseByte ==>&
    parseBytes (width * height) ==> \bitmap ->
    identity (Greymap width height maxGrey bitmap)

This definition makes use of a few more helper functions that we present here, following a pattern that should by now be familiar.

(==>&) :: Parse a -> Parse b -> Parse b
p ==>& f = p ==> \_ -> f

skipSpaces :: Parse ()
skipSpaces = parseWhileWith w2c isSpace ==>& identity ()

assert :: Bool -> String -> Parse ()
assert True  _   = identity ()
assert False err = bail err

The (==>&) combinator chains parsers like (==>), but the right hand side ignores the result from the left. The assert function lets us check a property, and abort parsing with a useful error message if the property is False.

Notice how few of the functions that we have written make any reference to the current parsing state. Most notably, where our old parseP5 function explicitly passed two-tuples down the chain of dataflow, all of the state management in parseRawPGM is hidden from us.

Of course, we can't completely avoid inspecting and modifying the parsing state. Here's a case in point, the last of the helper functions needed by parseRawPGM.

parseBytes :: Int -> Parse L.ByteString
parseBytes n =
    getState ==> \st ->
    let n' = fromIntegral n
        (h, t) = L.splitAt n' (string st)
        st' = st { offset = offset st + L.length h, string = t }
    in putState st' ==>&
       assert (L.length h == n') "end of input" ==>&
       identity h

Conclusions

In this chapter, we started with a naive file parser, and successively refined it into code that is at once less brittle and more general. Although our initial task was parsing graphics files, the combinator-based library that we developed along the way is not tied to graphics.

Our main theme in this chapter has been abstraction. We found passing explicit state down a chain of functions to be unsatisfactory, so we abstracted this detail away. We noticed some recurring needs as we worked out our parsing code, and abstracted those into common functions. Along the way, we introduced the notion of a functor as offering a generalised way to map over a container type.

We will revisit parsing in Chapter 19, Using Parsec, to discuss Parsec, a widely used and flexible parsing library. And in Chapter 16, Monads, we will return to our theme of abstraction.

Exercises

1.

Write a parser for “plain” PGM files.

2.

In our description of “raw” PGM files, we omitted a small detail. If the “maximum grey” value in the header is less than 256, each pixel is represented by a single byte. However, it can range up to 65535, in which case each pixel will be represented by two bytes, in big endian order (most significant byte first).

Rewrite the raw PGM parser to accommodate both the single- and double-byte pixel formats.

3.

Extend your parser so that it can identify a raw or plain PGM file, and parse the appropriate file type.

Want to stay up to date? Subscribe to the comment feed for this chapter, or the entire book.

Copyright 2007 Bryan O'Sullivan, Don Stewart, and John Goerzen. This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 License. Icons by Paul Davey aka Mattahan.