Command-line hacking: displaying a weather summary

display This is the first in what I hope will become a series of articles on doing off-beat and (I hope) interesting things with standard Linux command-line tools. In this post I'll give some ideas on retrieving a weather forecast summary from the BBC's RSS feed, and formatting it nicely. The idea is that you can just type, for example, weather at the prompt, and see a daily weather summary.

The basic strategy will be use curl to retrieve the weather forecast in RSS (XML) format, xmllint to extract the relevant parts of the document, sed to convert the text into groff format, then finally groff to format it, adding colour and emphasis. groff is highly configurable and produces nicely left- and right-justified text. It's easy to change this layout if required.

A quick-and-dirty alternative to using groff would be to use fmt, and then use sed to insert terminal escape codes to add the emphasis and colour. But this isn't very elegant.

Disclaimer: I don't claim that the way I go about solving this problem is the only way, or even the best. There are many ways to solve the same problem with shell scripting.

Prerequisites

To follow this example, you'll need curl and groff (which are part of the base distribution for many Linux variants) and xmllint (which probably isn't). On Ubuntu and similar, you can get xmllint by running:

$ sudo apt-get install libxml2-utils

Background

The basic URL for the BBC's 3-day forecast, in RSS format, is:

https://weather-broker-cdn.api.bbci.co.uk/en/forecast/rss/3day/NNNNNN

NNNNNN is a numeric code for the location. So far as I know, there is no published list of these location codes, and they don't correspond (to the best of my knowledge) to any geographic indicator. You can find the code for your location by looking at the BBC weather site and searching for the relevant name. The numeric code will appear in the URL. In the examples that follow, I'll be using the code for London, which is 2643743.

The following command will request the 3-day weather forecast for London, and dump it to standard out:

$ curl https://weather-broker-cdn.api.bbci.co.uk/en/forecast/rss/3day/2643743

If you examine the output, you'll see the following basic structure:

<rss ...> 
  <channel ...>
    <item>
      <title>Today: Light Cloud...</title>
      <description>Visibility: Good...</description>
      ...
    </item>
    <item>
      <title>Friday: Heavy Cloud...</title>
      <description>Visibility: Poor...</description>
      ...
    </item>
    ...

All RSS files are XML with this same layout -- only the content differs. In this case, there are three <item> elements, one for each day of the forecast. Each item has a title and a description which, to some extent, are pre-formatted display. Depending on the application, it might be sufficient to display only the "title" element -- the description is a bit wordy. Note that the title and description are intended to be displayed as provided. They do benefit from being formatted, though, and in my example I've added some text highlighting; still, it isn't easy actually to parse these text strings. Apart from anything else, the format tends to change, and relying on a particular entry (e.g., "Visibility") being present is a bad idea.

So our task here is to parse the XML, extract the title (and perhaps description), and format the results for display in the terminal. Here's what I'm aiming for:

screenshot — *The output of the script in a Linux terminal. It's easy enough to customize the appearance and layout if required.*

Parsing the XML

It would be nice to be able to extract the relevant parts of the XML file using simple command-line utilities like grep and sed. Since the structure of the BBC's RSS is so regular, that's almost possible. We could, for example, just grep for lines that contain "title" and "description".

Unfortunately, there are other lines in the XML that contain these tags. Since they tend to occur in the same position, a quick-and-dirty hack would be to extract specific occurences by their position in the XML. However, this isn't very robust -- and remember that we're working with a file format that is not documented, and whose layout could change. We need to ensure that, so long as we get a valid RSS document, we can display something useful. Consequently, it's necessary to do real XML parsing to extract the data.

The xmllint utility makes this easy, so long as the parsing is not too complex. In this case we just want to extract certain elements, that have well-defined positions in the XML's tree structure. This will do the trick:

curl -s [BBC URL] \
   | xmllint --xpath rss/channel/item/title\|rss/channel/item/description

The --xpath switch takes one or more paths, separated by a vertical bar (pipe). Be careful to prevent the shell treating the pipe as an instruction to perform a pipe operation: here I've escaped it with a backslash but, of course, there are other ways.

Note:
RSS XML files are usually encoded in UTF-8 format. This is a multi-byte text encoding, whose significance will become apparent later.

Converting the XML

The first task is to strip any remaining XML tags from the text. That's easy to do with sed

| sed -e 's/<[^>]*>//g'

sed operates on standard input, and outputs to standard output. It edits the text stream in accordance with the instructions provided on the command line. In this case, I'm using the s/search/replace/g instruction, which replaces all instances ('g' for 'global') of the pattern with "//" -- that is, nothing. The pattern is a regular expression; in this case it matches anything starting with "<" and ending in ">", except for the ">" symbol itself.

sed's pattern matching is "greedy" by default -- that is, it will match the longest sequence of characters that it can. This is why we need the slightly odd pattern described above, and it's a format that I'll be using repeatedly in this example. We only want to match the XML tags -- we don't want to match from the start of one tag, right to the end of the next, which is what will happen if we're not careful.

xmllint won't insert any blank lines between the XML sections it extracts. The text of the BBC weather feed contains no line breaks of its own, but we will need to break the lines later so they fit nicely on the screen. But first, we need to insert our own line breaks between each day's weather forecast, forming a blank line. That is, each days's forecast will be followed by a blank line. If we don't do that, when we format the text using groff, it will want to merge all the lines together, and then format the whole chunk. To change a line break into a blank line, I'm doing this:

| tr '\n' '#' \
| sed -e 's/#/\n\n/g' \

sed won't easily match on, or convert, end-of-line marks. So, instead I'm using tr to translate these into hash (#) characters, and then sed to translate these into pairs of newlines. Of course, this won't work if the text actually contains hash characters of its own -- you'd have to pick a different symbol.

My personal preference is to remove the Fahrenheit temperatures that the BBC feed provides, and just keep the Celcius. Your preference may differ, of course. The Fahrenheit temperatures appear in brackets: "(38°F)". To remove these:

| sed -e 's/ (.*F)//g' \

Again, I know that this pattern doesn't appear anywhere else in the text, so it's safe to match it this way. Note that I'm removing the extra space that appears in front of the opening bracket but, in fact, groff will do that later, anyway.

In my script I carry out a number of other simple transformations: shortening long words, that kind of thing. I'm not going to describe these in detail because they just use the same application of sed as the previous steps.. However, there are a few things that merit explanation.

In my script, I display the names of the days (which include "Today" and "Tonight") in red, to make the sections of the output easier to distinguish. Since I'll be processing the text with groff, I want to surround these words with groff colour codes. These are a bit arcance but, essentially, we need \m[name] to set the colour, followed by \m[] to set it back to the default.

My way of making this change isn't very elegant:

| sed -e 's/Tonight\|Today\|Monday\|Tuesday\|Wednesday\
\|Thursday\|Friday\|Saturday\|Sunday/\\m[red]\0\\m[]/g' \

Since the day name is always the first token on a line, there ought to be a more elegant way to match it. However, it isn't the first token on every line. All the other ways I found to match this text were no less ugly, so I'm keeping what I have until somebody suggests a better way.

Note that in the replacement text, \0 means "substitute the text that matched". Because the backslash character has a specific meaning to sed, when we want an actual backslash in the output, we have to write \\. It's this kind of fiddly that makes text-processing scripts so unreadable, but there isn't really an alternative.

Incidentally, sed can take multiple operations on the same command line. They are separated by semicolons. This is quicker, but a sed command with dozens of operations on the same line is profoundly unreadable (rather than just very unreadable). So I've used sequential invocations of sed with one operation each.

I've used a similar technique as the above to add other kinds of emphasis. To make text bold, for example, we surround it by the groff codes \fB and \fR ("format bold" and "format regular"). Again, we need to be careful about escaping the backslashes when writing these codes in a script.

Formatting the text

Finally, we're (almost) ready to pass the converted text, with its embedded formatting codes, to groff. First, though, we need to add a recognizable character or character sequence to the start of each blank line. I'm using "###", but anything that we can be sure won't otherwise appear is fine.

| sed -e 's/^$/###\n/' \

The reason for doing this is that groff always seems to pad its output with a bunch of blank lines -- I haven't found a way to stop it doing that when outputting plain text. We need to strip those blank lines, but we don't want to strip the blank lines that we inserted earlier to separate the different days. Marking the "real" blank lines with a specific pattern will allow us to do that.

The groff invocation looks like this:

| groff -k -Tutf8 -man \

The Linux version of groff can not (so far as I know) handle multi-byte input -- it assumes each character is a single byte. The BBC RSS feed does produce multi-byte characters. In particular, there is the "degree" sign, which isn't an ASCII character. The -k argument tells groff to run a pre-processor that converts multi-byte characters into groff formatting codes. Since most Linux systems use UTF-8 character encoding, and the RSS format is UTF-8, no other configuration is necessary to make this conversion.

The -Tutf8 switch allows UTF-8 output. -man is an instruction to format in a way that is suitable for a Unix man page; essentially, that means for output on a text terminal.

Tidying up

The final step is to remove the spare, empty lines added by groff, and then convert the "real" empty lines we marked using the tag "###" with real blank lines.

| sed '/^$/d' \
| sed 's/###/\n/g'

Further work

The simple script could be improved in many ways. Of course, the formatting of the output won't be to everybody's taste. It would be relatively easy to provide the script with command-line arguments so that you could select the forecast location, or change the amount of data that is displayed. The BBC also provides an RSS feed of current weather observations, rather than forecasts, and it could easily be parsed and formatted with the same approach -- and much of the same code -- I've described above.

Another, slightly useful enhancement would be to suppress the generation of terminal formatting characers if the output is not to a terminal.

Download

If you're interested, here is the Here is the complete source code for the script I've been describing in this post.