Command-line hacking: Downloading a podcast to create an audiobook

display

This is another article in my occasional series on doing useful and (I hope) somewhat unusual things under Linux, using scripting and command-line utilities. Today's example is, I concede, likely to be useful only to a small number of people -- perhaps only me. Nevertheless, it does illustrate some useful features of Bash scripting, including XML parsing and array handling. As always, I'm only going to outline the code here -- for the complete code, please refer to the Download section at the end.

The application

I listen to a lot of podcasts, particularly when I'm working alone in the woods I manage. There's no cellular coverage in these woods, so I can't rely on being able to stream content on demand, which I suspect is the way most people listen to this kind of material. Some proprietary smartphone podcast apps can cache some programs locally, but this requires a measure of forward planning. In any case, I'm trying to reduce my reliance on smartphones, and I'd rather just turn the podcast into a bunch of MP3 files, and play them on an offline audio player.

So this application downloads an entire podcast series (perhaps restricted to a date range), and saves each program in the series as an audio file. Each saved file will have a name that starts with the broadcast date, so programs can be played in the appropriate order.

Effectively, I'm converting a podcast stream into a local audiobook, with each individual program one 'chapter' of the book. Apps like Smart Audiobook Player handle this kind of arrangement particularly well, but I can even copy the downloaded files onto my OpenSwim headset, to listen while swimming.

Of course, to keep up to date with new programs, I'll need to be able to run the script periodically, and avoid downloading anything I already have. I sometimes apply a bitrate reduction to the downloaded stream, because saving several years' worth of podcast programs can otherwise take a lot of storage. I find that voice-only podcasts sound fine at 64 kbits/second, single channel, although I wouldn't apply such a transformation to music.

Constraints

First, and most importantly, my script works only with podcasts that are distributed using an RSS feed, obtainable without authentication using an HTTP request. In my experience, nearly all interesting podcasts are delivered this way, but it might require some detective work to find the source of the RSS file. Some podcast hosts make the RSS source really obvious, others do not. If you're used to listening to podcasts using Spotify, for example, you probably won't find the RSS feed there -- you'll need to hunt for it by title using a web search. Some podcast hosts will reveal an RSS feed only to paid subscribers. Moreover, some hosts have regional restrictions, or block VPNs and proxies. There isn't much a script can do about any of this: it assumes that the RSS, and the audio streams themselves, are freely accessible, and that the user can find the source, and provide its URL on the command line.

Second, my script assumes that all audio streams are in MP3 format. There's nothing in the logic itself that creates this restriction, but I'm using id3v2 to write tags in the downloaded files. This utility only supports ID3 tags, which are typically found in MP3 files. It wouldn't be difficult to extend the script to handle other file types, but all the podcasts I listen to are in MP3 format, so I don't really have a way to test such an extension.

Third, in general I want to be able to select a specific program from a podcast series to play; but at the end of that program, I usually want to play the next program in date order. This means that the files the script downloads must be sortable by date within an audio player, and this in turn means putting the date in the filename (and probably in the 'title' tag as well).

RSS podcast format

The first thing the script does is download the podcast's RSS feed, which we'll pass on the command line. The RSS that defines a podcast is an XML file with the following basic structure.

<rss version="2.0"> 
<channel>
  <title>Title of the podcast</title>
  ...
  <item> 
    <title>Title of this item</title>
    <description>Text description of this item</description>
    <enclosure url="audio_stream_url.mp3" type="audio/mpeg" />
    <pubDate>Publication date of the item in RFC2822 format</pubDate>
  </item>
  <item> 
  ...
  </item>
  ...
</channel>

There's a header that provides a description of the podcast, then a number of <item> elements, one per program. For our purposes, the information we need to extract for each program are the title, the URL, and the publication date. For more robust downloading, we might also extract the program length which, in theory, all podcast hosts should provide. With that information, we could check that the downloaded stream matched the program description, which would be useful for error detection. So far, however, I haven't used the length information.

There's a lot more information in the RSS file that I haven't shown, and the challenge is to find a robust method to separate the necessary information from the unnecessary, bearing in mind that different RSS files will have different layouts. Some, for example, have no line breaks at all, so we can't just use a simple `grep` for a pattern: we'll have to parse the XML properly.

How the script works

The first job is to download the RSS, using wget (curl also works fine). Then we'll need to parse the XML, splitting out the information we need for each program.

My go-to tool here -- the one I nearly always use for parsing XML in a script -- is xsltproc. This utility applies an XSLT stylesheet transformation to an XML input, producing plain text or XML as the output.

There isn't space here to explain XSLT in detail, but I think the XSLST snippet below is reasonably self-explanatory.

<xsl:stylesheet version="1.0" ...>
<xsl:output method="text"/>
<xsl:template match="/rss/channel">
<xsl:for-each select="item">
  <xsl:value-of select="pubDate"/>$DELIMITER<xsl:value-of select="title"/>$DELIMITER<xsl:value-of select="enclosure/@url"/>$DELIMITER
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

The <item> elements we need are all grouped in the RSS within a <channel> element, which is a sub-element of <rss>. So xsltproc applies the for-each template to every item, in the order the items appear in the XML. This template extracts the three values we need, and writes each to a temporary file, each followed by a separator, one line per item.

The purpose of this XSLT transformation is to write a temporary file that is easy for a script to loop over line-by-line -- something that isn't practicable with XML.

The separator can be any text that we can be reasonably sure won't appear in the data itself. At present, I'm using @@@, but this is trivially easy to change in the script, as it's a global variable.

The output of the stylesheet transformation looks like this:

date1@@@title1@@@URL1@@@
date2@@@title2@@@UR2L@@@
...

The script reads the transformed RSS file line-by-line, which is straightforward enough:

while read -r line; do
  ...
done < $transformed_file

There are, of course, many other ways of reading a file using a Bash script; I'm not claiming my approach is the best -- it's just the one I'm familiar with.

For each line, we split out the date, title, and URL into shell variables. Bash does have some built-in line tokenizing support, but I've not found it very useful with multi-character separators. The separators must be multi-character in this application, because there's really no single character that we can be sure won't appear in the data. Of course, we could use single-character separators with some method of escaping them when they aren't actually separators; but I don't think that's any more convenient.

My approach to splitting out the individual data elements is to read each up to the delimiter into an array element, and then assign the array elements to named variables. The code for doing this is somewhat opaque, and I'd be interested to know if there is a more readable method.

 s=$line
 array=();
 while [[ $s ]]; do
   array+=( "${s%%"$DELIMITER"*}" );
   s=${s#*"$DELIMITER"};
 done;

 date=${array[0]}
 title=${array[1]}
 url=${array[2]}

What the code in the while loop does is to match the input line up to the delimiter sequence, adding the matched text to the end of the array. Then the matching section is removed from the line, and we loop until the line is empty. Note that this approach requires that the delimiter is really a terminator -- there must be one at the end of each token, including at the end of the line. That's why the XSLT transformation writes a delimiter at the end of the line, not just between the tokens. Of course, we could just add the delimiter to the line before splitting it, if there were only delimiters between the tokens.

It should go without saying that the technique I'm using here only works if the file we're parsing has an exact number of tokens on each line, and they're always in the same order. Since the script writes that file itself, that's easy to arrange. However, using this parsing technique on input we don't control would require better error-checking.

For each program in the RSS we now have the date, title, and URL. We need to form a filename, which I base on the date and the title: we need the date to ensure that the programs get played in the right order, and the title to make it easier to see what each program is about.

Because the audio player will typically play tracks in alphanumeric order of filename, we can't use the raw date from the RSS file as part of the filename. This is because RSS stipulates dates in RFC2822 format, that is:

Sunday, 26 Jan 2025 12:35:00 BST

If we used this date in the filename, we'd end up with all 'Friday' files being played first, because these would be earliest in alphanumerical order. Instead, we need to write a date in the filename that sorts correctly in date order. That is, we need the year first, then the month, then the day, then the time. We perhaps don't need the time at all, but it's conceivable that some podcasts will publish multiple programs in the same day.

Fortunately, the Linux date utility will read RFC2822 dates, and we can output the date in any format we like, thus:

sortable_date=`date -d "$date" +%Y-%m-%d_%H_%M`

The format we use for the date isn't important, so long as it's sortable; it's helpful if it's human-readable as well, otherwise we could just use the Unix epoch date, which is a simple integer.

We'll form the filename from the date and the program title, but we need to be careful to remove or convert the characters that aren't legal in filenames. For example:

sanitized_title="${title//:/_}"             # Remove :
sanitized_title="${sanitized_title//\?/_}"  # Remove ?
sanitized_title="${sanitized_title//\*/@}"  # Remove *
sanitized_title="${sanitized_title//\"/\'}" # Remove "

sortable_title="${sortable_date} ${title}"
sortable_sanitized_title="${sortable_date} ${sanitized_title}"
output_file="${output_dir}/${sortable_sanitized_title}.${extension}"

So at this point we have a stream URL, and a filename under which to store it, that will sort properly in the audio player. We now need to download the file; I've found wget works better than curl here, for reasons that are not clear to me.

wget -O "${output_file}" "${url}"

We only want to download the file if it doesn't already exist. We can also restrict the downloads to a particular date range, if we don't want to download ten year's worth of programs. I won't describe these tests here, because they're trivial, but they're in the full source code.

For the finishing touches, we can tag the MP3 files as we download them. It's almost imperative to write the 'title' tag, because audio players sometimes sort by title in preference to filename. We probably want to write the 'album' tag as well, because audio players typically group tracks by album, rather than assuming that all tracks in a particular directory go together.

id3v2 -t "${sortable_title}" -A "${album}" ... "${output_file}"

And, essentially, that's it. The full script has some additional features; it can, for example, apply a quality reduction to the downloaded file, so the podcast uses less storage. There's also a fair bit of error checking that I haven't shown here.

Further work

One useful addition would be to check that the downloaded stream results in a file of the correct size. If it does not, then we should not store a file at all, so we can try to download it again later. The reason this check might be necessary is that sometimes there can be an outage in the podcast host, that results in the delivery of an error message, rather than an MP3 file. Or, occasionally, the download will be incomplete. The script won't handle these situations very well because, once there is a file in the output directory with the right name, it won't be overwritten, even if its contents are nonsense.

Making this improvement would require parsing the 'length' fields from the RSS feed, and adding them to the temporary file generated by the XSLT transformation. The rest of the script would then read this value when parsing the temporary file and, after downloading the stream, compare the resulting file's length with the length in the RSS.

It would also be useful to be able to handle streams other than MP3 but, so far, I haven't used any, so I haven't been motivated to write the code. You'd need a tagger for each supported type, such as Atomic Parsley for MP4-type files.

Download

The full source for rsspodfetch is available from my GitHub repository.

Have you posted something in response to this page?
Feel free to send a webmention to notify me, giving the URL of the blog or page that refers to this one.