Monday, August 8, 2011

Cleaning Up MOBI TOC

‹prev | My Chain | next›

I ended last night with no fewer than 3 Tables of Contents in the mobi version of SPDY Book as generated by git-scribe. Today, I hope to get at least one of them in shape to serve as the TOC that is recognized by the Kindle.

To be fair, one of the TOCs generated, the NCX (Navigation Control file for XML applications), is generated and read correctly by the Kindle. The Kindle does not use this file when you "Go to Table of Contents". Rather, it uses this to draw the chapter markers in the progress meter at the bottom of the display.

The TOC that readers actually see is stored in a file named toc.html (the filename is described in a separate book.opf file). Git-scribe 0.0.9 is fairly adept at generating this file although preface material (introduction, copyright notice, acknowledgements) confuse the chapter numbering. My switch from asciidoc to a2x for generating the HTML has further confused git-scribe's toc.html generation—the chapters are already numbered (so I end up with things like "Chapter 8: 4. SPDY Push").

Happily, the a2x command produces a pretty nice TOC, albeit directly embedded in the book HTML. So I extract that TOC out of the book HTML and into a toc.html file:
def extract_toc
  content = File.read("book.html")

  File.open("book.html", 'w') do |f|
    f.write content.sub(%r|<div class="toc">.+?</dl></div>|m, '')
  end

  toc = Regexp.last_match[0].
  gsub(/href="#/, 'href="book.html#')

  File.open("toc.html", 'w') do |f|
    f.puts('<?xml version="1.0" encoding="UTF-8"?>')
    f.puts('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>Table of Contents</title></head><body>')
    f.puts toc
    f.puts('</body></html>')
  end
end
The bit at the beginning about slurping the entire book into memory and replacing the TOC, via regular expression, with nothing is a bit on the oogy side:
content = File.read("book.html")

File.open("book.html", 'w') do |f|
  f.write content.sub(%r|<div class="toc">.+?</dl></div>|m, '')
end
I live with this for now because, for my 150+ page SPDY Book, I am not seeing a significant performance hit. The actual contents of the TOC are now in Regexp.last_match. I need to adjust the location of the URL now that the TOC will be in a separate file from the actual book:
toc = Regexp.last_match[0].
  gsub(/href="#/, 'href="book.html#')
The rest is just a matter of writing to the toc.html.

Sure this could use some improving, but I think it is already a step in the right direction for git-scribe. Instead of scanning the entire HTML document for H2 and H3 tags, I am now using the TOC as generated by a2x directly.

Unfortunately, I am not done slurping the entire book into memory. There is a bit of clean-up necessary to remove white-space in LI tags (to get bullet lists to align properly) and more properly identify header tags:
def clean_html(file)
  content = File.read(file)
  File.open(file, 'w') do |f|
    f.write content.
      gsub(%r"<li(.*?)>\s*(.+?)\s*</li>"m, '<li\1>\2</li>').
      gsub(%r'<h([23] class="title".*?)><a (id=".+?")></a>'m, '<h\1 \2>')
  end
end
The first gsub does multi-line matches to remove whitespace inside LI tags:
# Source:
<li class="listitem">
SPDY-ize your own sites—either by writing your own SPDY parser or using one of the frameworks discussed.
</li>

# Result:
<li class="listitem">SPDY-ize your own sites—either by writing your own SPDY parser or using one of the frameworks discussed.</li>
The second gsub is just a workaround for a Kindle quirk:
#Source:
<h2 class="title">
<a id="chapter_your_first_spdy_app"></a>
Chapter 2. Your First SPDY App
</h2>

# Result:
<h2 class="title" id="chapter_your_first_spdy_app">Chapter 2. Your First SPDY App</h2>
Inline links (e.g. <a href="book.html#chapter_your_first_spdy_app">) work with either format on the Kindle, but the formatting is messed up for the former version—the H2 and H3 text displays like normal text.

With that, I am more or less satisfied with my mobi formatted version of SPDY Book. There are a few tweaks that I still might like to make, but I think tomorrow I will revisit some of the hacks I needed for the PDF version of the book. Armed with what I know now, I think I can come up with a better approach.

Day #108

No comments:

Post a Comment