Command-line hacking: displaying news headlines in the manual viewer

display In an earlier article in this series, about doing (what I hope are) interesting and offbeat things with Linux command-line tools, I explained how to write a shell script that retrieves a weather forecast from the BBC, and formats it for display in the console using brute-force application of tools like sed. This article is somewhat similar, but takes a different approach to processing the document. The data itself is similar to that of the previous article -- it's an XML RSS document -- but in this article I'll explain how to format it using an XSLT stylesheet. This will require the use of the xsltproc utility. This utility may not be part of a basic installation of Linux, but it's widely available in repositories.

The BBC news feeds are longer documents than the weather reports and, to display them in a console, it's useful to use a pager that allows page-up/page-down operations. Such things already exist, of course; but in this case we need the page to accommodate lightly-formatted text, not just plain text. The Linux manual viewer (man) already handles this kind of operation, so it seemed sensible to turn the news feed into a man page. The result looks something like this:

Prerequisites

To follow this example, you'll need curl, sed, and man (which are part of the base distribution for many Linux variants) and xmlproc (which probably isn't). On Ubuntu and similar, you can get xmlproc by running:

$ sudo apt-get install xsltproc 

Of course, you'll need Internet access to get to the news feeds.

Background

The BBC makes a number of news headline feeds available in RSS format. RSS is nothing more than XML that follows particular structural conventions. For the headline feeds, the XML looks like this:

<rss>
  <channel>
    <item>
      <title>Headline 1</title>
      <description>Summary 1</description>
    </item>
    <item>
      <title>Headline 2</title>
      <description>Summary 2</description>
    </item>
  </channel>
</rss>

It's certainly possible to hack out the relevant bits of text from this document using sed, but using a stylesheet transformation is more elegant. More to the point, perhaps, the stylesheet is comprehensible. If the BBC changes its format, it will be much easier to modify a stylesheet than a heap of regular expressions.

Retrieving the feed

The BBC news feed URLs are of the form http://feeds.bbci.co.uk/news/XXX/rss.xml. The "XXX" is a topic -- "world", "uk", "politics", etc. We can retrieve the feed easily using curl, which will write to standard output by default.

XSLT transformation

XSLT is an XML-based language for transformations of XML to another document: maybe a different kind of XML, maybe something else entirely. In this case, we'll use XSLT to transform an RSS document into a man page.

The xsltproc utility takes an XML document and a stylesheet, applies the latter to the former, and writes the output. To apply the transformation to standard input, and write the result to standard output, the invocation is simple:

$ xsltproc - stylesheet.xslt

Here is the (slightly simplified) stylesheet for this news headline example. Even if you're not a regular XSLT user, it should make reasonable sense when compared to the RSS XML sample above.

<xsl:stylesheet version="1.0"
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/rss/channel">
.TH "BBC News" "7" "" "" "Headlines"
  <xsl:for-each select="item">

\fB<xsl:value-of select="title"/>\fR
.br
<xsl:value-of select="description"/>

</xsl:for-each>

</xsl:template>
</xsl:stylesheet>
Note:
The whitespace in this document is significant. Whitespace will be written to the output document exactly as given, and is meaningful in a man page.

The stylesheet starts off by matching the channel element under the rss element in the RSS document. If this pattern is found, the output beginning .TH will be written. This is the man page header. In practice, we'll modify this header at runtime, to display the title of the particular news feed, but that's a subtlety that I won't elaborate on here.

Within the /rss/channel/ section of the XML, the stylesheet then iterates over all the item elements. For each it writes the text supplied in the stylesheet, and substitutes the contents of the title and description tags for each item. The literal text that gets output contains man-page formatting elements like \fB (bold). In this way the elements in the RSS are transformed into paragraphs in a man page.

Handling the stylesheet

xlstproc can take its document from standard input and write the result to standard output; but the stylesheet has to be provided in a file. I don't want to have to install a separate file -- I just want to put the entire program into a single shell script. So I have the script write the stylesheet to a temporary file, then delete it when the processing is done. Part of this process can involve making on-the-fly transformations to the stylesheet (although few are done in this example).

A simple way to write out the stylesheet from the script is to use a "here document":

cat << EOF > $XSLTFILE

...
EOF

Everything between cat and EOF is written to $XSLTFILE.

Using the man viewer

Linux users are familiar with running man foo to get the manual page for foo. However, the man viewer is capable of working on specific files or even, as in this case, on standard input.

So, having written out the stylesheet to a temporary file, the complete command to display the news headlines is:

curl -s $FEED | xsltproc $XSLTFILE - | grep -v xml |  man -l -

And that's it.

Further work

This application cries out for caching. The news headlines change only infrequently, so the utility really ought to cache the RSS documents in a handy directory, and use the cached versions if only a short time has elapsed.

The approach described in this article will work on any conventional RSS document, so it could easily be expanded to provide a general feed viewer. Of course, such utilities already exist -- but they're typically more complex than a few lines of shell script.

Download

If you're interested, the full source for newsman is available from my GitHub repository.