rtf2html

This code is a hack. A monumental hack. It's unsophisticated and inelegant. The output is ugly and non-compliant. So far as I know, it will work on only one input file--and I had to do additional hand-editing on its output. Nevertheless, here it is.

It's difficult to find RTF parsing help online. At first glance, RTF code appears to be a nightmare. However, if you do nothing else but add newlines before the curly braces and forward slashes then spend a few minutes studying the output, the solution will begin to make itself apparent.

I'm dealing with only a handful of RTF formatting keywords here. There's no attempt to interpret font information. But note that one of the many things I learned while doing this is that those keywords starting with a single quote are followed by the hexidecimal ASCII code of the rendered character.

The script loads an entire file at once. I can well imagine that a large enough file could cause a system crash.

#!/usr/bin/perl @lines=<>; $text=join "\n"," ",@lines; $_=$text; ## Adding newlines after the keywords helps parsing ## s/\\/\n\\/g; ## Ampersands need to be HTMLized ## s/&/&amp;/g; ## Replace important RTF tags with HTML tags and entities ## # # ## pard turns off any formatting characteristics. # ## We're using only qc (centering), so that's all # ## I'm turing off. # s/\\qc/<center>/g; s/\\pard/<\/center>/g; s/\\par/<br>/g; s/\\rquote\s/&#146;/g; # Using the right quote on both s/\\lquote\s/&#146;/g; s/\\ldblquote\s/&quot;/g; # Using the standard HTML double quote on both s/\\rdblquote\s/&quot;/g; s/\\i\s/<i>/g; s/\\i0\s/<\/i>/g; s/\\super\s/<sup><small>/g; s/\\nosupersub\s/<\/sup><\/small>/g; s/\\ul\s/<u>/g; s/\\ulnone\s/<\/u>/g; s/\\\'85/&#133;/g; ## elipsis s/\\\'bc\s/&#188;/g; ## 1/4 ## Remove all remaining RTF tags ## s/\\\S*\s/\n/g; s/\n//g; ## Replace all remaining whitespace with a blank ## s/\s/ /g; ## Helps to make the output slightly more human-readable s/<br>/<br>\n/g; ## Clean up the curly braces ## s/{.*}//g; s/({|})//g; print;