Chapter 6. Writing a library: working with JSON data

Table of Contents

A whirlwind tour of the JSON language
Representing JSON data in Haskell
The anatomy of a Haskell module
Compiling Haskell source
Generating a Haskell program, and importing modules
Printing JSON data
Another look at rendering JSON
Developing Haskell code without going nuts
Beware of type inference
Pretty printing a string
Arrays and objects, and the module header
Writing a module header, and good defensive coding
Fleshing out the pretty printing library
Compact rendering
True pretty printing
Following the pretty printer
Exercises
Creating a package
Writing a package description
GHC's package manager
Setting up, building, and installing
Practical pointers, and further reading

In this chapter, we'll take a look at developing a small, but complete, Haskell library.

A whirlwind tour of the JSON language

The JSON (JavaScript Object Notation) language is a small, simple representation for structured data. Its most common use is to transfer data from a web service to a browser-based JavaScript application.

JSON supports four basic types of value: strings, numbers, booleans, and a special value named null.

"a string"
12345
true
null

The language also supports compound values. Its compound types are the array—an ordered sequence of values—and the struct—an unordered collection of name/value pairs. The values in a struct or array can be of any type.

[-3.14, true, null, "a string"]
{"numbers": [1,2,3,4,5], "useful": false}

Representing JSON data in Haskell

To work with JSON data in Haskell, we use an algebraic data type to represent the range of possible JSON types. In a text editor, create a file named SimpleJSON.hs, and insert the following contents.

data JValue = JString String
            | JNumber Double
            | JBool Bool
            | JNull
            | JObject [(String, JValue)]
            | JArray [JValue]
              deriving (Eq, Ord, Show)

Here, we associate each JSON type with a distinct constructor. Some of these constructors have parameters: if we want to construct a JSON string, we must provide a String value as an argument to the JString constructor.

To start experimenting with this code, save the file SimpleJSON.hs in your editor, switch to a ghci window, and load the file into ghci.

ghci> :load SimpleJSON
Ok, modules loaded: SimpleJSON.
ghci> JString "foo"
JString "foo"
ghci> JNumber 2.7
JNumber 2.7
ghci> :type JBool True
JBool True :: JValue

We can see how to use a constructor to take a normal Haskell value and turn it into a JSON value. To do the reverse, we use pattern matching. Here's a function that we can add to SimpleJSON.hs that will extract a string from a JSON value for us. If the JSON value actually contains a string, our function will wrap the string with the Just constructor, otherwise it will return Nothing.

getString :: JValue -> Maybe String
getString (JString s) = Just s
getString _ = Nothing

If we save the modified source file, we can reload it in ghci and try the new definition.

ghci> :reload SimpleJSON
Ok, modules loaded: SimpleJSON.
ghci> getString (JString "hello")
Just "hello"
ghci> getString (JNumber 3)
Nothing

A few more accessor functions, and we've got a small body of code to work with.

getInt (JNumber n) = Just (truncate n)
getInt _ = Nothing

getDouble (JNumber n) = Just n
getDouble _ = Nothing

getBool (JBool b) = Just b
getBool _ = Nothing

getObject (JObject o) = Just o
getObject _ = Nothing

getArray (JArray a) = Just a
getArray _ = Nothing

isNull v = v == JNull

The anatomy of a Haskell module

A Haskell source file contains a single module. A module lets us determine which namesinside the module are accessible from other modules.

Normally, a source file begins with a module declaration.

module SimpleJSON
    (
      JValue(..)
    , getString
    , getInt
    , getDouble
    , getBool
    , getObject
    , getArray
    , isNull
    ) where

The word module is reserved. It is followed by the name of the module, which must begin with a capital letter. By convention, a source file has the same base name (the component before the suffix) as the module it contains, which is why our file SimpleJSON.hs contains the module SimpleJSON.

Following the module name is a list of exports, enclosed in parentheses. The where keyword indicates that the body of the module follows.

The list of exports indicates which names in this module are visible from other modules. This lets us keep private code hidden from the outside world. The special notation (..) that follows the name JValue indicates that we are exporting both the type and all of its constructors.

It might seem strange that we can export a type's name, but not its constructors. The ability to do this is important: it lets us hide the details of a type from its users, making the type abstract. If we can't see a type's constructors, we can't pattern match against a value of that type, nor can we construct a new value of that type. Later in this chapter, we'll discuss some situations in which we might want to make a type abstract.

If we omit the list of exports (including the parentheses) from a module declaration, every name in the module will be exported: module SimpleJSON where .... To export no names at all (only infrequently useful), we write an empty list using a pair of parentheses: module SimpleJSON () where ....

Compiling Haskell source

In addition to the ghci interpreter; the GHC distribution includes an optimising Haskell compiler, named ghc. If you're already familiar with a command line compiler such as gcc or cl (the compiler component of Microsoft's Visual Studio), you'll immediately be at home with ghc.

To compile a source file, we first open a terminal or command prompt window, then invoke ghc with the name of the source file to compile.

ghc -c SimpleJSON.hs

The -c option tells ghc to only generate object code. If we were to omit the -c option, the compiler would attempt to generate a complete executable. That would in turn fail, because we haven't written a main function yet.

After ghc completes, if we list the contents of the directory, it should contain two new files: SimpleJSON.hi and SimpleJSON.o (SimpleJSON.obj on Windows). The former is an interface file, in which ghc stores information about the names exported from our module in machine-readable form. The latter is an object file, which contains the generated code.

Generating a Haskell program, and importing modules

Now that we've successfully compiled our minimal library, we'll write a tiny program to exercise it. Create the following file in your text editor, and save it as Main.hs.

module Main () where

import SimpleJSON

main = print (JObject [("foo", JNumber 1), ("bar", JBool False)])

Notice the import directive that follows the module declaration. This indicates that we want to take all of the names that are exported from the SimpleJSON module, and make them available in our module. Any import directives must appear together at the beginning of a module. We cannot scatter them throughout a source file.

Our choice of naming for the source file and function is deliberate. In order to create an executable, ghc needs a module named Main, and it must contain a function named main. The main function is the one that will be called when we run the program once we've built it.

ghc -o simple Main.hs SimpleJSON.o

This time around, we're omitting the -c option when we invoke ghc, so it will attempt to generate an executable. The process of generating an executable is called linking. As our command line suggests, ghc is perfectly able to both compile source files and link an executable in a single invocation.

We're passing ghc a new option, -o, which takes one argument: this is the name of the executable that ghc should create. Here, we've decided to name the program simple. On Windows, the program will have the suffix .exe, but on Unix variants there will not be a suffix.

Finally, we're passing the name of our new source file, Main.hs, and the object file we already compiled, SimpleJSON.o. We must explicitly list every one of our files that contains code that should end up in the executable. If we forget a source or object file, ghc will complain about “undefined symbols”.

When compiling, we can pass ghc any mixture of source and object files. If ghc notices that it has already compiled a source file into an object once, and we invoke it a second time, it will only recompile the source file if we've modified it.

Once ghc has finished compiling and linking our simple program, we can run it from the command line.

Printing JSON data

Now that we have a Haskell representation for JSON's types, we'd like to be able to take Haskell values and render them as JSON data.

There are a few ways we could go about this. Perhaps the most direct would be to write a rendering function that prints a value in JSON form. We'll quickly show how to do this, because it's about time we started interacting with the outside world. Once we're done, we'll explore some more interesting approaches.

Since this will be our first example of printing data, we'll be introducing some new functions and notation below. Rather than cover these in depth, we'll skim over them, and defer a more extended treatment until Chapter 9, I/O.

module PutJSON where

import Control.Monad (forM_)
import SimpleJSON

putJValue :: JValue -> IO ()

putJValue (JString s) = putStr (show s)
putJValue (JNumber n) = putStr (show n)
putJValue (JBool True) = putStr "true"
putJValue (JBool False) = putStr "false"
putJValue JNull = putStr "null"

The result type of putJValue is IO (), where IO indicates that the function performs I/O. The putStr function prints a string, and show returns a string representation of a Haskell value.

Clearly, printing a simple JSON value is easy.

ghci> :load PutJSON
[2 of 2] Compiling PutJSON          ( PutJSON.hs, interpreted )
Ok, modules loaded: SimpleJSON, PutJSON.
ghci> putJValue (JBool True)
trueghci> putJValue (JString "foo")
"foo"

A compound value requires a little more work, so we can ensure that it's formatted correctly.

putJValue (JObject xs) = do
    putChar '{'
    case xs of
      [] -> putStr ""
      (p:ps) -> do putPair p
                   forM_ ps $ \q -> do putStr ", "
                                       putPair q
    putChar '}'
  where putPair (k,v)   = do putStr (show k)
                             putStr ": "
                             putJValue v

We've introduced one unfamiliar piece of notation here, the do keyword. In this context, it lets us perform a series of actions in sequence.

To print a JObject value, we begin by printing an opening brace. We must then determine whether we have zero or more than zero key/value pairs to print, in order to get the formatting right. If we have no pairs to print, we print nothing[4]. Otherwise, we print the first pair, then we loop over all the pairs that follow, and print each one preceded by a comma. (The forM_ function takes a list and a function that can perform I/O, and applies the function to every element of the list.)

Another look at rendering JSON

If simply printing JSON data is both obvious and easy, why might we want to consider doing something else? For example, we could easily to modify our printing code above to output to an arbitrary file handle. However, if we wanted to compress the data somehow before writing it out, we could not as easily adapt the code to do this.

If we separate the rendering from what we do with the rendered data, we grant ourselves more flexibility. There are several Haskell libraries that handle data compression, but they all do so by providing very simple compression and decompression functions: a compression function takes an uncompressed string and returns a compressed string. If we write a function that takes JSON data and produces a rendered string, we can build a pipeline: we pass the rendered JSON into our desired compression function, and get compressed, rendered JSON back.

To render JSON data, we'll begin by assuming that we already have a generic rendering library: we'll develop its skeleton as we go. We'll call this module Prettify, so its code will go into a source file named Prettify.hs. After we're done writing our client code, we'll go back and fill in the details of the Prettify module.

Instead of rendering straight to a string, we'll make our JSON renderer work with values of a type that we'll call Doc. The renderer won't be able to see any of the internals of the Doc type: instead, it will call functions from our rendering library, which will hide the details from the client.

By basing our generic rendering library on this abstract Doc type, we can choose an implementation that is flexible and efficient.

We'll name our JSON rendering function jvalue. Rendering one of the basic JSON values is a straightforward business.

jvalue :: JValue -> Doc
jvalue (JBool True) = text "true"
jvalue (JBool False) = text "false"
jvalue JNull = text "null"
jvalue (JNumber num) = double num
jvalue (JString str) = string str

We'll write the text and double functions as part of our Prettify module.

Developing Haskell code without going nuts

Early on, as we come to grips with Haskell development, we have so many new, unfamiliar concepts to keep track of at one time that it can be a challenge to write code that compiles at all.

As we write our first substantial body of code, it's a huge help to pause every few minutes and try to compile what we've produced so far. Because Haskell is so strongly typed, if our code compiles cleanly, we're assuring ourselves that we're not wandering too far off into the programming weeds.

One useful technique for quickly developing the skeleton of a program is to write stub versions of functions. For example, we mentioned above that our text and double functions would be provided by our Prettify module. If we don't provide definitions for those functions, our attempts to “compile early, compile often” with our JSON renderer will fail, as the compiler won't know anything about those functions. To avoid this problem, we write stub functions that don't do anything.

text :: String -> Doc
text str = undefined

double :: Double -> Doc
double num = undefined

The special value undefined always typechecks, no matter where we use it, but it will cause our program to crash if we attempt to evaluate it.

ghci> :type double
double :: Double -> Doc
ghci> double 3.14
*** Exception: Prelude.undefined

By providing stub definitions for these functions, we allow ourselves to compile our code early on, even though we won't yet be able to run it. By doing this, we can take advantage of the compiler's type checker to ensure that our program is sensibly typed.

Beware of type inference

A Haskell compiler's ability to infer types is both powerful and valuable. Early on, you'll probably be faced by a strong temptation to take advantage of type inference by omitting as many type declarations as possible: let's simply make the compiler figure the whole lot out!

There is, however, a downside to skimping on explicit type information, and it has a disproportionate effect on the new Haskell programmer. While we work to gain some experience, we're initially extremely likely to write code that will fail to compile due to straightforward type errors. When we leave out type information, we give the compiler more leeway. It will infer types that are logical and consistent, but quite possibly not at all what we thought we were actually using. When what we think we're doing diverges from what the compiler infers, the error messages that result can be very difficult to interpret: not only are we doing something wrong, but the compiler's attempt to make sense of what we've tried can sometimes briefly lead us astray.

Consider an example: let's say we write a function that we think returns a String, but we don't write a type signature for it.

upcaseFirst (c:cs) = toUpper c -- forgot ":cs" here

Here, we want to upper-case the first character of a word, but we've forgotten to append the rest of the word onto the result. We think our function's type is String -> String, but the compiler will correctly infer its type as String -> Char. Let's say we then try to use this function somewhere else.

camelCase :: String -> String
camelCase xs = concat (map upcaseFirst (words xs))

When we try to compile this code or load it into ghci, we won't necessarily get an obvious error message.

ghci> :load Trouble
[1 of 1] Compiling Main             ( Trouble.hs, interpreted )

Trouble.hs:9:27:
    Couldn't match expected type `[Char]' against inferred type `Char'
      Expected type: [Char] -> [Char]
      Inferred type: [Char] -> Char
    In the first argument of `map', namely `upcaseFirst'
    In the first argument of `concat', namely
        `(map upcaseFirst (words xs))'
Failed, modules loaded: none.

Notice that the error is reported where we use the upcaseFirst function. If we're erroneously convinced that our definition and type for upcaseFirst are correct, we may end up staring at the wrong piece of code for quite a while, until enlightenment strikes.

Every time we write a type signature, we remove a degree of freedom from the type inference engine. This reduces the likelihood of divergence between our understanding of our code and the compiler's. Type declarations also act as shorthand for ourselves as readers of our own code, making it easier for us to do our own mental inference of what must be going on.

This is not to say that we need to pepper every tiny fragment of code with a type declaration. It is, however, usually good form to add a signature to every top-level function and variable in our code. It's best to start out fairly aggressive with explicit type signatures, and slowly ease back as your mental model of how type checking works becomes more accurate.

[Tip]Explicit types and undefined values

Because the special value undefined will happily typecheck no matter where we use it, it's especially important to write type signatures when we start slinging it around. If we use undefined to stub in a temporary definition for a function, and we don't write a type signature for the function, we'll potentially be able to use that stubbed definition in completely nonsensical contexts, thus planting the seed of later confusion.

Pretty printing a string

When it comes to to pretty printing a string value, our code gets a little complicated: JSON strings have some moderately involved escaping rules that we must follow.

At the highest level, a string is just a series of characters wrapped in quotes.

string :: String -> Doc
string = enclose '"' '"' . hcat . map oneChar

The enclose function simply wraps a Doc value with an opening and closing character.

enclose :: Char -> Char -> Doc -> Doc
enclose left right x = char left <> x <> char right

To do this, it uses the (<>) function from our pretty printing library, which appends two Doc values: it's the Doc equivalent of (++).

(<>) :: Doc -> Doc -> Doc
a <> b = undefined

char :: Char -> Doc
char c = undefined

Our pretty printing library also provides hcat, which concatenates multiple Doc values into one: it's the analogue of concat for lists.

hcat :: [Doc] -> Doc
hcat xs = undefined

Our string function thus applies the as yet unseen oneChar function to every character in a string, concatenates the lot, and encloses the result in quotes. What does oneChar look like?

oneChar :: Char -> Doc
oneChar c = case lookup c simpleEscapes of
              Just r -> text r
              Nothing | mustEscape c -> hexEscape c
                      | otherwise    -> char c
    where mustEscape c = c < ' ' || c == '\x7f' || c > '\xff'

simpleEscapes :: [(Char, String)]
simpleEscapes = zipWith ch "\b\n\f\r\t\\\"/" "bnfrt\\\"/"
    where ch a b = (a, '\\':[b])

The specialEscapes value is a list of pairs that we call an association list, or alist for short: it maps a few characters to simple escaped representations.

ghci> take 4 simpleEscapes
[('\b',"\\b"),('\n',"\\n"),('\f',"\\f"),('\r',"\\r")]

Our case expression attempts to see if our character has a match in this alist. If we find the match, we emit it, otherwise we might need to escape the character in a more complicated way. If so, we perform this more complicated escaping. Only if neither kind of escaping is required do we emit the plain character.

The more complicated escaping involves turning a character into the string “\u” followed by a four-character sequence of hexadecimal digits representing the numeric value of the Unicode code point.

smallHex :: Int -> Doc
smallHex x = text "\\u" <> text (replicate (4 - length h) '0') <> text h
    where h = showHex x ""

The showHex function comes from the Numeric library, and returns a hexadecimal representation of a number.

ghci> showHex 114111 ""
"1bdbf"

The replicate function is provided by the Prelude, and builds a fixed-length repeating list of its argument.

ghci> replicate 5 "foo"
["foo","foo","foo","foo","foo"]

There's a wrinkle: the four-digit encoding that smallHex provides can only represent Unicode code points up to 0xffff. This portion of the Unicode code space is called the basic multilingual plane: it contains most of the world's common characters and glyphs. However, the Unicode code space extends up to 0x10ffff: values between 0x10000 and 0x10ffff are said to inhabit one of the astral planes.

To properly represent an astral plane code point in a JSON string, we must decompose it into two surrogate characters. For our purposes, we don't need to care what a surrogate character is: we'll look at it instead as an opportunity to perform some bit-level manipulation of Haskell numbers.

astral :: Int -> Doc
astral n = smallHex (a + 0xd800) <> smallHex (b + 0xdc00)
    where a = (n `shiftR` 10) .&. 0x3ff
          b = n .&. 0x3ff

The shiftR function comes from the Data.Bits module, and shifts a number to the right. The (.&.) function, also from Data.Bits, performs a bit-level and of two values.

ghci> 0x10000 `shiftR` 4   :: Int
4096
ghci> 7 .&. 2   :: Int
2

Now that we've written smallHex and astral, we can provide a definition for hexEscape.

hexEscape :: Char -> Doc
hexEscape c | d < 0x10000 = smallHex d
            | otherwise = astral (d - 0x10000)
  where d = fromEnum c

Arrays and objects, and the module header

Compared to strings, pretty printing arrays and objects is a snap. We already know that the two are visually similar: each starts with an opening character, followed by a series of values separated with commas, followed by a closing character. Let's write a function that captures the common structure of arrays and objects.

series :: Char -> Char -> (a -> Doc) -> [a] -> Doc
series open close item = enclose open close
                       . fsep . punctuate (char ',') . map item

We'll start by interpreting this function's type. It takes an opening and closing character, then a function that knows how to pretty print a value of some unknown type a, followed by a list of values of type a, and it returns a value of type Doc.

We have already written enclose, a which wraps a Doc value in opening and closing characters. The fsep function will live in our Prettify module. It combines a series of Doc values into one, possibly wrapping lines if the output will not fit on a single line. The punctuate function will also live in our Prettify module, and we can define it in terms of functions for which we've already written stubs.

punctuate :: Doc -> [Doc] -> [Doc]
punctuate p [d]    = [d]
punctuate p (d:ds) = (d <> p) : punctuate p ds
punctuate p _      = []

With this definition of series, pretty printing an array is entirely straightforward. We add this equation to the end of the block we've already written for our jvalue function.

jvalue (JArray ary) = series '[' ']' jvalue ary

To pretty print an object, we need to do only a little more work: for each element, we have both a name and a value to deal with.

jvalue (JObject obj) = series '{' '}' field obj
    where field (name,val) = string name <> text ": " <> jvalue val

Writing a module header, and good defensive coding

The header at the beginning of our PrettyJSON.hs source file looks like this.

module PrettyJSON
    (
      jvalue
    ) where

import Numeric (showHex)
import Data.Bits (shiftR, (.&.))

import SimpleJSON (JValue(..))
import Prettify (Doc, (<>), char, double, fsep, hcat, punctuate, text,
                 compact, pretty)

We're only exporting one name from this module: jvalue, our pretty printing function. The other functions in the module are purely present to support jvalue, so there's no reason to make them visible to other modules.

Regarding imports, the Numeric and Data.Bits modules are distributed with GHC. We've already written the SimpleJSON module, and filled our Prettify module with skeletal definitions. Notice that there's no difference in the way we import standard modules from those we've written ourselves.

With each import directive, we explicitly list each of the names we want to bring into our module's namespace. This is not required: if we omit the list of names, all of the names exported from a module will be available to us. However, it's generally a good idea to write an explicit import list.

  • An explicit list makes it clear which names we're importing from where. This will make it easier for a reader to look up documentation, if they run across an unfamiliar module.

  • Occasionally, a library maintainer will remove or rename a function. If a function disappears from a third party module that we use, any resulting compilation error is likely to happen long after we've written the module. The explicit list of imported names can act as a reminder to ourselves of where we had been importing the missing name from, which will help us to pinpoint the problem more quickly.

  • It can also occur that someone will add a name to a module that is identical to a name already in our own code. If we don't use an explicit import list, we'll end up with the same name in our module twice. If we use that name, GHC will report an error due to the ambiguity. An explicit list lets us avoid the possibility of accidentally importing an unexpected new name.

This idea of using explicit imports is a guideline that usually makes sense, not a hard-and-fast rule. Occasionally, we'll need so many names from a module that listing each one becomes messy. In other cases, a module might be so widely used that that a moderately experienced Haskell programmer will probably know which names come from that module.

Fleshing out the pretty printing library

In our Prettify module, we represent our Doc type as an algebraic data type.

data Doc = Empty
         | Char Char
         | Text String
         | Line
         | Concat Doc Doc
         | Union Doc Doc
           deriving (Show)

Notice that the Doc type describes a tree: the Concat and Union constructor create an internal node from two other Doc values, while the Empty and other simple constructors build leaves.

In the header of our module, we will export the name of the type, but not any of its constructors: this will prevent modules that use the Doc type from creating and pattern matching against Doc values.

Instead, to create a Doc, a user of the Prettify module will call a function that we provide. Here are the simple construction functions.

empty :: Doc
empty = Empty

char :: Char -> Doc
char c = Char c

text :: String -> Doc
text "" = Empty
text s  = Text s

double :: Double -> Doc
double d = text (show d)

The Line constructor represents a line break. The line function creates hard line breaks, which always appear in the pretty printer's output. Sometimes we'll want a soft line break, which is only used if a line is too wide. We'll introduce a softline function shortly.

line :: Doc
line = Line

Almost as simple as the basic constructors is the (<>) function, which concatenates two Doc values.

(<>) :: Doc -> Doc -> Doc
Empty <> y = y
x <> Empty = x
x <> y = x `Concat` y

Notice that we pattern match against Empty: concatenating a Doc value with Empty on the left or right has no effect. This keeps us from bloating the tree with useless values.

ghci> text "foo" <> text "bar"
Concat (Text "foo") (Text "bar")
ghci> text "foo" <> empty
Text "foo"
ghci> empty <> text "bar"
Text "bar"
[Tip]A mathematical moment

If we briefly put on our mathematical hats, we can say that Empty is the identity under concatenation. Taking the mathematical perspective has useful practical consequences, and we'll explain why later.

Add a reference to wherever we introduce monoids.

Our hcat and fsep functions concatenate a list of Doc values into one. In the section called “Exercises”, we mentioned that we could define concatenation for lists using foldr.

concat :: [[a]] -> [a]
concat = foldr (++) []

Since (<>) is analogous to (:), and empty to [], we can see how we might write hcat and fsep as folds, too.

hcat :: [Doc] -> Doc
hcat = fold (<>)

fold :: (Doc -> Doc -> Doc) -> [Doc] -> Doc
fold f = foldr f empty

The definition of fsep depends on several other functions.

fsep :: [Doc] -> Doc
fsep = fold (</>)

(</>) :: Doc -> Doc -> Doc
x </> y = x <> softline <> y

softline :: Doc
softline = group line

These take a little explaining. The softline function should insert a newline if the current line has become too wide, or a space otherwise. How can we do this if our Doc type doesn't contain any information about rendering? Our answer is that every time we encounter a soft newline, we maintain two alternative representations of the document, using the Union constructor.

group :: Doc -> Doc
group x = flatten x `Union` x

Our flatten function replaces a Line with a space, turning two lines into one longer line.

flatten :: Doc -> Doc
flatten (x `Concat` y) = flatten x `Concat` flatten y
flatten Line           = Char ' '
flatten (x `Union` _)  = flatten x
flatten other          = other

Notice that we always call flatten on the left element of a Union: the left of each Union is always the same width as, or wider than, the right. We'll be making use of this property in our rendering functions below.

[Note]Infix use of constructors

We've used infix notation to represent the Concat and Union constructors in the final two patterns of our case expression. This has no effect on the meaning of our code. Just like using a constructor or function infix in an application, it's merely a small change for readability.

Compact rendering

We frequently need to use a representation for a piece of data that contains as few characters as possible. For example, if we're sending JSON data over a network connection, there's no sense in laying it out nicely: the software on the far end won't care whether the data is pretty or not, and the added white space needed to make the layout look good would add a lot of overhead.

For these cases, and because it's a simple piece of code to start with, we provide a bare-bones compact rendering function.

compact :: Doc -> String
compact x = transform [x]
    where transform [] = ""
          transform (d:ds) =
              case d of
                Empty        -> transform ds
                Char c       -> c : transform ds
                Text s       -> s ++ transform ds
                Line         -> '\n' : transform ds
                a `Concat` b -> transform (a:b:ds)
                _ `Union` b  -> transform (b:ds)

The compact function wraps its argument in a list, and applies the transform helper function to it. The transform function treats its argument as a stack of items to process, where the first element of the list is the top of the stack.

The transform function's (d:ds) pattern breaks the stack into its head, d, and the remainder, ds. In our case expression, the first several branches recurse on ds, consuming one item from the stack for each recursive application. The last two branches add items in front of ds: the Concat branch adds both elements to the stack, while the Union branch ignores its left element, on which we called flatten, and adds its right element to the stack.

We have now fleshed out enough of our original skeletal definitions that we can try out our compact function in ghci.

ghci> let value = jvalue . JObject $ [("f", JNumber 1), ("q", JBool True)]
ghci> :type value
value :: Doc
ghci> putStrLn (compact value)
{"f": 1.0,
"q": true
}

To better understand how the code works, let's look at a simpler example in more detail.

ghci> char 'f' <> text "oo"
Concat (Char 'f') (Text "oo")
ghci> compact (char 'f' <> text "oo")
"foo"

When we apply compact, it turns its argument into a list and applies transform.

  • The transform function receives a one-item list, which matches the (d:ds) pattern. Thus d is the value Concat (Char 'f') (Text "oo"), and ds is the empty list, [].

    Since d's constructor is Concat, the Concat pattern matches in the case expression. On the right hand side, we add Char 'f' and Text "oo" to the stack, and call transformrecursively.

    • The transform function receives a two-item list, again matching the (d:ds) pattern. The variable d is bound to Char 'f', and ds to [Text "oo"].

      The case expression matches in the Char branch. On the right hand side, we use (:) to construct a list whose head is 'f', and whose body is the result of a recursive application of transform.

      • The recursive invocation receives a one-item list. The variable d is bound to Text "oo", and ds to [].

        The case expression matches in the Text branch. On the right hand side, we use (++) to concatenate "oo" with the result of a recursive application of transform.

        • In the final invocation, transform is invoked with an empty list, and returns an empty string.

      • The result is "oo" ++ "".

    • The result is 'f' : "oo" ++ "".

True pretty printing

While our compact function is useful for machine-to-machine communication, its result is not always easy for a human to follow: there's very little information on each line. To generate more readable output, we'll write another function, pretty. Compared to compact, pretty takes one extra argument: the maximum width of a line, in columns. (We're assuming that our typeface is of fixed width.)

pretty :: Int -> Doc -> String

To be more precise, this Int parameter controls the behaviour of pretty when it encounters a softline. Only at a softline does pretty have the option of either continuing the current line or beginning a new line. Elsewhere, we must strictly follow the directives set out by the person using our pretty printing combinators.

Here's the core of our implementation

pretty width x = best 0 [x]
    where best col (d:ds) =
              case d of
                Empty        -> best col ds
                Char c       -> c :  best (col + 1) ds
                Text s       -> s ++ best (col + length s) ds
                Line         -> '\n' : best 0 ds
                a `Concat` b -> best col (a:b:ds)
                a `Union` b  -> nicest col (best col (a:ds))
                                           (best col (b:ds))
          best _ _ = ""

Our best helper function takes two arguments: the number of columns emitted so far on the current line, and the list of remaining Doc values to process.

In the simple cases, best updates the col variable in straightforward ways as it consumes the input. Even the Concat case is obvious: we push the two concatenated components onto our stack/list, and don't touch col.

The interesting case is concerned with the Union constructor. Recall that we applied flatten to the left element, and did nothing to the right. Also, remember that flatten replaces newlines with spaces. Therefore, our job is to see which (ir either) of the two layouts, the flattened one or the original, will fit into our width restriction.

          nicest col a b | (width - least) `fits` a = a
                         | otherwise                = b
                         where least = min width col

To do this, we write a small helper that determines whether a single line of a rendered Doc value will fit into a given number of columns.

fits :: Int -> String -> Bool
w `fits` _ | w < 0 = False
w `fits` ""        = True
w `fits` ('\n':_)  = True
w `fits` (c:cs)    = (w - 1) `fits` cs

Following the pretty printer

In order to understand how this code works, let's first consider a very simple Doc value.

ghci> empty </> char 'a'
Concat (Union (Char ' ') Line) (Char 'a')

We'll call pretty 2 on this value. When we first apply best, the value of col is zero. It matches the Concat case, pushes the values Union (Char ' ') Line and Char 'a' onto the stack, and applies itself recursively. In the recursive application, it matches on Union (Char ' ') Line.

At this point, we're going to ignore Haskell's usual order of evaluation. This keeps our explanation of what's going on simple, without changing the end result. We now have two subexpressions, best 0 [Char ' ', Char 'a'] and best 0 [Line, Char 'a']. The first evaluates to " a", and the second to "\na". We then substitute these into the outer expression to give nicest 0 " a" "\na".

To figure out what the result of nicest is here, we do a little substitution. The values of width and col are 0 and 2, respectively, so least is 0, and width - least is 2. We quickly evaluate 2 `fits` " a" in ghci.

ghci> 2 `fits` " a"
True

Since this evaluates to True, the result of nicest here is " a".

If we apply our pretty function to the same JSON data as earlier, we can see that it produces different output depending on the width that we give it.

ghci> putStrLn (pretty 10 value)
{"f": 1.0,
"q": true
}
ghci> putStrLn (pretty 20 value)
{"f": 1.0, "q": true
}
ghci> putStrLn (pretty 30 value)
{"f": 1.0, "q": true }

Exercises

Our current pretty printer is spartan to fit within our space constraints, but there are a number of useful improvements we can make.

1.

Write a function, fill:

fill :: Int -> Doc -> Doc

It should render the document, then add spaces so that the rendered text is the given number of columns wide. If the text is already wider than this value, it should add no spaces.

2.

Our pretty printer does not take nesting into account. Whenever we open parentheses, braces, or brackets, any lines that follow should be indented so that they are aligned with the opening character until a matching closing character is encountered.

Add support for nesting, with a controllable amount of indentation.

nest :: Int -> Doc -> Doc

Creating a package

The Haskell community has built a standard set of tools, named Cabal, that help with building, installing, and distributing software. Cabal organises software as a package, which consists of one or more libraries or executable programs.

Writing a package description

To do anything with a package, Cabal needs a description of it. This is contained in a text file whose name ends with the suffix .cabal. This file belongs in the top-level directory of your project. It has a simple format, which we'll describe below.

A Cabal package must have a name. Usually, the name of the package matches the name of the .cabal file. We'll call our package mypretty, so our file is mypretty.cabal. (Quite often, the directory that contains the mypretty.cabal file is also called mypretty.)

A package description begins with a series of global properties, which apply to every library and executable in the package.

Name:          mypretty
Version:       0.1

-- This is a comment.  It stretches to the end of the line.

Package names must be unique. If you create and install a package that has the same name as a package already present on your system, GHC will become very confused.

The global properties include a substantial amount of information that intended for human readers, not Cabal itself.

Synopsis:      My pretty printing library, with JSON support
Description:
  A simple pretty printing library that illustrates how to
  develop a Haskell library.
Author:        Real World Haskell
Maintainer:    nobody@realworldhaskell.org

As the Description field indicates, a field can span multiple lines, provided they're indented.

Also included in the global properties is license information. Most Haskell packages are licensed under the BSD license, which Cabal calls BSD3[5]. (Obviously, you're free to choose whatever license you think is appropriate.) The optional License-File field lets us specify the name of a file that contains the exact text of our package's licensing terms.

The features supported by successive versions of Cabal evolve over time, so it's wise to indicate what versions of Cabal we expect to be compatible with. The features we are describing are supported by versions 1.2 and higher of Cabal.

Cabal-Version: >= 1.2

To describe an individual library within a package, we write a library section.

library
  Exposed-Modules: Prettify
                   PrettyJSON
                   SimpleJSON
  Build-Depends:   base >= 2.0

The Exposed-Modules field contains a list of modules that should be available to users of this package. An optional field, Other-Modules, contains a list of internal modules, which are required for this library to function, but shouldn't be visible to users.

The Build-Depends field contains a comma-separated list of packages that our library requires to build. For each package, we can optionally specify the range of versions with which this library is known to work. The base package contains many of the core Haskell modules, such as the Prelude, so it's effectively always required.

[Tip]Figuring out build dependencies

We don't have to guess or do any research to establish which packages we depend on. If we try to build our package without a Build-Depends field, compilation will fail with a useful error message. Here's an example where we commented out the dependency on the base package.

$ runghc Setup build
Preprocessing library mypretty-0.1...
Building mypretty-0.1...

PrettyJSON.hs:8:7:
    Could not find module `Data.Bits':
      it is a member of package base, which is hidden

The error message makes it clear that we need to add the base package, even though base is already installed. Forcing us to be explicit about every package we need has a practical benefit: a Cabal tool named cabal-install can automatically download, build, and install a package and all of the packages it depends on.

GHC's package manager

GHC includes a simple package manager. It knows which packages are installed, and what versions of those packages are installed. A command line tool named ghc-pkg lets us work with its package databases.

We say databases because GHC distinguishes between system-wide packages, which are available to every user, and per-userpackages, which are only visible to the current user.

In practice, since most computers are single-user, the per-user database is mostly useful as a sandbox for testing, and to avoid the hassle of using administrative privileges to install packages.

The ghc-pkg command provides subcommands to address different tasks. Most of the time, we'll only need two of them. The ghc-pkg list command lets us see what packages are installed. When we want to uninstall a package, ghc-pkg unregister tells GHC that we won't be using a particular package any longer.

Setting up, building, and installing

In addition to a .cabal file, a package must contain a setup file. This allows Cabal's build process to be heavily customised, if a package needs it. The simplest setup file looks like this.

#!/usr/bin/env runhaskell
> import Distribution.Simple
> main = defaultMain

We save this file under the name Setup.lhs, to indicate that it is a literate Haskell file[6].

Once we have the .cabal and Setup.lhs files written, we have three steps left.

To instruct Cabal how to build and where to install a package, we run a simple command.

$ runghc Setup configure

This ensures that the packages we need are available, and stores settings to be used later by other Cabal commands.

If we do not provide any arguments to configure, Cabal will install our package in the system-wide package database. To install it into our home directory and our personal package database, we must provide a little more information.

$ runghc Setup configure --prefix=$HOME --user

Following the configure step, we build the package.

$ runghc Setup build

If this succeeds, we can install the package. We don't need to indicate where to install to: Cabal will use the settings we provided in the configure step.

$ runghc Setup install

Practical pointers, and further reading

GHC already bundles a pretty printing library, Text.PrettyPrint.HughesPJ. It provides the same basic API as our example, but a much richer and more useful set of pretty printing functions. We recommend using it, rather than writing your own.

The design of the HughesPJ pretty printer was introduced by John Hughes in [Hughes95]. The library was subsequently improved by Simon Peyton Jones, hence the name. Hughes's paper is long, but well worth reading for his discussion of how to design a library in Haskell.

In this chapter, our pretty printing library is based on a simpler model described by Philip Wadler in [Wadler98]. His library was extended by Daan Leijen; this version is available for download from Hackage.



[4] If you're already somewhat familiar with Haskell, you'll know that this is not an idiomatic way to do nothing. We'll introduce return () in Chapter 9, I/O.

[5] The “3” in BSD3 refers to the number of clauses in the license. An older version of the BSD license contained 4 clauses, but it is no longer used.

[6] We use literate Haskell notation so that on a Linux or Unix system, if we make the file executable, it will be treated as a script, and we can run it directly. This is probably the only time you'll see literate Haskell in this book.

Want to stay up to date? Subscribe to the comment feed for this chapter, or the entire book.

Copyright 2007 Bryan O'Sullivan, Don Stewart, and John Goerzen. This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 License. Icons by Paul Davey aka Mattahan.