PeggyでHTMLパーサー(手抜き)を書いてみた

PEGとか知らないけどPeggy - The Parser Generator for Haskellを見ながら書いてみた。
思うところは色々あるけどもう今日は終わりたいから書いておく。

{-# LANGUAGE TemplateHaskell, QuasiQuotes, FlexibleContexts #-}

import Text.Peggy

main = print . parseHtml =<< getContents

parseHtml :: String -> Either ParseError HTML
parseHtml = parseString document "HTML"

data HTML = TextNode String
          | ElemNode { name :: String, attributes :: [(String,String)], children :: [HTML] }
          | Doctype String
          | Document [HTML]
          deriving (Show)

[peggy|
document :: HTML
    = nodes { Document $1 }

nodes :: [HTML]
    = (node, [\r\n \t]*)

node :: HTML
    = '<!DOCTYPE ' [^<>]+ '>' { Doctype $1 }
    / soloTag
    / coupleTag
    / [^<>]+ { TextNode $1 }

soloTag :: HTML
    = '<' soloTagList [ \t]* (attribute, [ \t]*) [ \t]* '/'? '>' { ElemNode $1 $3 [] }

soloTagList :: String
    = 'meta' { "meta" }
    / 'link' { "link" }
    / 'img' { "img" }
    / 'input' { "input" }
    / 'hr' { "hr" }
    / 'br' { "br" }

coupleTag :: HTML
    = '<' tagName [ \t]* (attribute, [ \t]*) [ \t]* '>' [\r\n \t]* nodes [\r\n \t]* '</' tagName '>' { ElemNode $1 $3 $6 }

tagName :: String
    = [^<>\r\n \t/!]+

attribute :: (String, String)
    = attributeName '=' attributeValue

attributeName :: String
    = [^ \r\n\t&<>=]+

attributeValue :: String
    = '\"' [^\"\r\n]* '\"'
|]

上記パーザーに下のHTMLを食わせる。

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="ja">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf8">
        <title>hoge</title>
    </head>
    <body>
        <div id="header" class="mark">
            <div id="logo"><img src="logo.png"/></div>
        </div>
        <div id="body">
            <div id="content">
                <p>foobar</p>
            </div>
        </div>
        <div id="footer">
            <div id="copyright">Copyright &copy; hogehoge</div>
        </div>
    </body>
</html>

そうするとこんな結果。

Right (
  Document [
    Doctype "html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\"",
    ElemNode {name = "html", attributes = [("lang","ja")], children = [
      ElemNode {name = "head", attributes = [], children = [
        ElemNode {name = "meta", attributes = [("http-equiv","Content-Type"),("content","text/html; charset=utf8")], children = []},
        ElemNode {name = "title", attributes = [], children = [
          TextNode "hoge"
        ]}
      ]},
      ElemNode {name = "body", attributes = [], children = [
        ElemNode {name = "div", attributes = [("id","header"),("class","mark")], children = [
          ElemNode {name = "div", attributes = [("id","logo")], children = [
            ElemNode {name = "img", attributes = [("src","logo.png")], children = []}
          ]}
        ]},
        ElemNode {name = "div", attributes = [("id","body")], children = [
          ElemNode {name = "div", attributes = [("id","content")], children = [
            ElemNode {name = "p", attributes = [], children = [
              TextNode "foobar"
            ]}
          ]}
        ]},
        ElemNode {name = "div", attributes = [("id","footer")], children = [
          ElemNode {name = "div", attributes = [("id","copyright")], children = [
            TextNode "Copyright &copy; hogehoge"
          ]}
        ]}
      ]}
    ]}
  ]
)

あ、ドキュメントタイプがXHTMLだった。
もっと綺麗に書きたいなー。