Select Page

Generating Pandoc Heading Identifiers In ColdFusion

Ben Nadel
Published: December 6, 2023

Over on my Feature Flags book website, I’m using my book’s Markdown content to generate the HTML for the page. I then use jSoup to inject a table of contents (TOC); which requires that I insert an identifier into each header element. And, now that I’m trying to use Pandoc to generate an EPUB (digital book) version, I need to make sure that my ColdFusion-based header identifiers match the ones that Pandoc will generate in the final EPUB.

The Pandoc documentation on “Headings and Sections” describes the algorithm that it uses to generate the heading identifiers:

  • Remove all formatting, links, etc.
  • Remove all footnotes.
  • Remove all non-alphanumeric characters, except underscores, hyphens, and periods.
  • Replace all spaces and newlines with hyphens.
  • Convert all alphabetic characters to lowercase.
  • Remove everything up to the first letter (identifiers may not begin with a number or punctuation mark).
  • If nothing is left after this, use the identifier “section”.

The Pandoc documentation also provides a set of sample headings and the identifiers that it will generate. We can use these samples to test our ColdFusion algorithm. And, of course, we’ll make ample use of Regular Expressions to solve this problem.

In the following ColdFusion code, we’re looping over the samples provided by Pandoc and asserting that our ColdFusion-generated identifier matches the expected identifier:

<cfscript>
	// These values are provided in the Pandoc documentation on Headings and Sections.
	assertions = [
		{
			heading: "Heading identifiers in HTML",
			identifier: "heading-identifiers-in-html"
		},
		{
			heading: "Maître d'hôtel",
			identifier: "maître-dhôtel"
		},
		{
			heading: "*Dogs*?--in *my* house?",
			identifier: "dogs--in-my-house"
		},
		{
			heading: "[HTML], [S5], or [RTF]?",
			identifier: "html-s5-or-rtf"
		},
		{
			heading: "3. Applications",
			identifier: "applications"
		},
		{
			heading: "33",
			identifier: "section"
		}
	];
	// Let's test the Pandoc header assertions against our ColdFusion algorithm, yay!
	for ( assertion in assertions ) {
		identifier = generateIdentifier( assertion.heading );
		writeOutput("
			<p>
				Heading: #encodeForHtml( assertion.heading )# <br />
				Expected: #encodeForHtml( assertion.identifier )# <br />
				Received: #encodeForHtml( identifier )# <br />
				Pass: <b>#yesNoFormat( assertion.identifier == identifier )#</b>
			</p>
		");
	}
	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //
	/**
	* I generate a Pandoc section identifier (ie, URL anchor) from the given heading text.
	* 
	* ASSUMPTION: For this demo, I am assuming that all formatting, links, and footnotes
	* have already been removed and that we are dealing with plain-text header values.
	*/
	public string function generateIdentifier( required string heading ) {
		var identifier = heading
			.trim()
			// Convert all alphabetic characters to lowercase.
			.lcase()
			// Replace all spaces and newlines with hyphens.
			.reReplace( "\s+", "-", "all" )
			// Remove all non-alphanumeric characters, except underscores, hyphens,
			// and periods.
			.reReplace( "[^\w.-]+", "", "all" )
			// Remove everything up to the first letter (identifiers may not begin with
			// a number or punctuation mark).
			.reReplace( "^[^a-z]+", "" )
		;
		// If nothing is left after this, use the identifier section.
		if ( ! identifier.len() ) {
			return( "section" );
		}
		return( identifier );
	}
</cfscript>

As a general rule, when using Regular Expressions to solve a problem, always move the “convert to lowercase” step as high-up in the algorithm as you can. That way, you can simplify your patterns by using [a-z] instead of [a-zA-Z]; and, you can use .reReplace() instead of .reReplaceNoCase(), which will be more efficient.

In this ColdFusion code, I’ve used Pandoc’s description of each step as a comment in the code so that you can see how each RegEx pattern maps to Pandoc’s intended outcome. If Regular Expressions seem like a foreign language to you, check out my video presentation on basic pattern usage. Once you start using patterns, you’ll find that they improve the quality of your developer life.

With that said, if we run this ColdFusion code, we get the following output:

Output of header identifier assertions showing that ColdFusion generated the correct values.

As you can see, the heading identifiers generated by our ColdFusion Regular Expression replacements match the identifier assertions provided by Pandoc. At this point, I can update my Feature Flags site logic and not worry about the inter-chapter links breaking when I generate my EPUB.

Note: My Feature Flags site uses Flexmark to convert from Markdown to HTML in ColdFusion (during site bootstrapping and initialization); which is why the two algorithms need to be aligned. This way, I neither need to install Pandoc on my server nor do I need to commit the generated HTML to my source control.

Want to use code from this post?
Check out the license.

Source: www.bennadel.com