[gpodder-devel] Non-human readable directory and file names

  • From: chrism at ideareactor.com (Chris McCabe)
  • Date: Wed, 31 Oct 2007 19:33:04 +0100

Here's a quick thought:

How about creating the directory name from the URL of the feed, which 
will always exist.  So for example, the podcasts from the feed:
http://www.hbo.com/apps/podcasts/podcast.xml?a=2

would all be saved in the directory (relative to the download directory):
www.hbo.com/apps/podcasts/podcast.xml?a=2/

It would end up creating a few more directory levels than are really 
necessary, but it would be guaranteed to be unique, and would make it 
easy to find the podcasts.  It would also automatically group feeds from 
the same website together.
You would still have the problem of naming each individual podcast from 
that feed, but at least half the problem is solved.

For naming the podcasts, one easy scheme would be to name it with the 
release date of the podcast, or if not available, the download date, 
with an extra number to make it unique if necessary.  For example:
2007.10.31.001.mp3

This has the advantage that the alphabetical directory listing will list 
the podcasts in order.  It has the disadvantage that you wouldn't be 
able to match with certainty podcasts to filenames without additional 
information.


Just some thoughts.

Chris


Thomas Perl wrote:
> Hello, Jay, Ionut and Pieter!
>
> This mail is not intended to be rude or harsh, I just want to bring up
> real problems with using content from RSS files as base for file naming.
> If you can come up with a stable, sane and secure scheme for creating
> human-readable file names for all possible RSS feeds, please tell me :)
>
> On Wed, 2007-10-31 at 11:03 +0000, Jay Bradley wrote:
>   
>> I was wondering why gpodder stores the downloads in crazily named 
>> directories? I realise that it is partly to ensure unique directories
>> so there are no clashes but it means that it is impossible to browse 
>> through the podcast files manually. I know I can sync to a filesystem
>> so I do this for my mp3 player but I also normally use a soft link to
>> the podcast downloads directory for my mythtv installation as well. 
>> Currently I'm changing the device directory and syncing to my mp3
>> player and changing the device directory again to a separate directory
>> for mythtv. If the directory names were human readable then it would
>> save me a lot of hassle.
>>     
>
> I see you have read the mailing list and are aware of the alternatives
> (MP3 Player sync).
>
> Anyway, this topic has been discussed several times on this list, I
> guess it's time for a FAQ on the gPodder website.. ;)
>
> First of all, here are some relevant postings related to the topic.
> Please read through them to get an overview of what has been proposed
> and discussed already:
>
> https://lists.berlios.de/pipermail/gpodder-devel/2006-November/000283.html
> https://lists.berlios.de/pipermail/gpodder-devel/2007-June/000723.html
> https://lists.berlios.de/pipermail/gpodder-devel/2007-July/000756.html
>
> Script that tries to solve that problem:
>
> http://lists.berlios.de/pipermail/gpodder-devel/2007-August/000911.html 
>
> I'm going to describe the problem you mention a bit further...
>
> Basically, it's hard to create human-readable names because of the
> nature of RSS feeds. It's like with HTML - if browsers were going to
> reject non-standard HTML, all documents on the web would adhere to the
> standards, but thanks to such "useful" features as quirks mode, browsers
> try to fix the shortcomings of bad markup in the parser code.
>
> But the problems with RSS feeds doesn't lie in bad markup. Most of the
> time, fields are not set (no <title> element in <item>), fields have
> empty value (<title> exists, but is empty) or very stupid usage of
> fields (just recently, we had a feed where <title> contained a
> description of the episode, a very long string).
>
> There are two options here:
>
>  a) reject any feeds that have no title, have a too long title or have 
>     some other weird properties that are not usual RSS practice
>  b) accept all feeds and try to make the best of "what we have"
>
> gPodder tries to to the "b)" route and so we have to be prepared to
> accept feeds without <title>. As you can read from the november 2006
> post above (I think one of the inital thoughts about hashed filenames),
> hashing feed and episode URLs always gives us strings that have some
> sane and stable properties:
>
>  1.) (high probability of) uniqueness
>  2.) sane length (even fixed, but at least not empty or too long)
>  3.) sane alphabet (hexadecimal, i.e. only the characters 0-9 and a-f)
>
> So, for every given URL (and _every_ feed has an URL), we have a sane
> "ID" that we can use to identify that feed.
>
> When depending on human-readable strings (i.e. title, etc..) we run into
> several problems:
>
>  i.) what is the directory name of feeds with "<title></title>"??
>  ii.) what is the directory name of feed A with title "radio x podcast" 
>       when there already is a feed B with title "radio x podcast"?
>  iii.) what is the directory name of a feed with a loooong title?
>  iv.) what is the directory name of a feed with chinese characters as
>       title (from the top of my head, imagine (e.g. "???") when 
>       using FAT32 as file system?
>
> We might be able to create a unique filename for a podcast episode from
> the basename of its url, but is there always an unique basename of the
> podcast feed? It might be "index.xml" or "podcast.rss".
>
>   
>> I never understand why some programs add a layer of complexity which 
>> removes the user one step from their files. I believe programs should
>> be as transparent as possible to allow people to do what they like
>> with the data produced by that program.
>>     
>
> gPodder is transparent in that the user doesn't have to care about the
> directory layout, as the user can use the gPodder GUI to browse and
> listen to feeds - all feed information is displayed in the GUI.
>
> You can always determine feed and episode info for given hashes:
>
>  -> Hash (md5) the URLs in ~/.config/gpodder/channels.opml
>  -> MD5 of URL = directory name of feed
>
>  -> Open the file "index.xml" in the feed download directory
>  -> Hash (md5) the URLs in that file
>  -> MD5 of URL + extension of basename of URL = filename of episode
>
> In pseuco-code, this is something like:
>
> opml_file = $HOME + '/.config/gpodder/channels.opml'
>
> ( ... feed_url is to be obtained from opml_file ... )
> feed_directory = gpodder_download_dir + '/' + md5sum( feed_url )
> feed_index = feed_directory + '/index.xml'
>
> ( ... episode_url is to be obtained from feed_index ... )
> extension = file_extension_of( basename( episode_url ) )
> episode_name = feed_directory + '/' + md5sum ( episode_url ) + extension
>
>   
>> Using the non-human readable directory and filenames stops users from
>> accessing their files except through one program (gpodder) which is a
>> shame.
>>     
>
> You can use the above method to find more information (metadata) for the
> files than you can with human readable directories, including the title
> and description of episodes.
>
>   
>> I've looked through the source code but cannot find where the
>> directory names are set. I'm an okay programmer so could do this
>> myself if someone could point me in the right direction. I'd do it to
>> just my local copy if this wasn't something anyone else would be
>> interested in.
>>     
>
> Please, by all means try to do it. If it works for all RSS feeds, I
> would be very happy to merge it into gPodder, as it would be the better
> solution than what we have now. But because of the reasons I mentioned
> above, I am very skeptic if this is possible at all.
>
> The directory name for channels is determined by the "get_filename"
> function of the class "podcastChannel" in "src/gpodder/libpodcasts.py".
> The attributes that _should_ be available when this function is called
> are "url", "title" and "description" (i.e. "self.title").
>
> The filename for an episode is determined by the "local_filename"
> function of the class "podcastItem" in "src/gpodder/libpodcasts.py".
> Only the "url" attribute is guranteed to be available, for all other
> properties, the best possible value is extracted from the RSS feed, but
> you can expect the "title" value to be somewhat identifying, but not
> unique. You also have to be aware that the "title" value _could_ be very
> long (think of a description field value that has been misplaced).
>
>   
>> I realise there may be some other reason why the names are non-human 
>> readable so if I've missed it then please could someone let me know.
>>     
>
> Apart from the practical reasons I mentioned above, there is no real
> reason why the hashes are chosen. It was a simple and straightforward
> solution to a problem for which we have not yet found a better solution.
>
> It would be quite cool if you could come up with something friendlier :)
>
> If you want, please send the modifications you make to make gPodder's
> directory structure human-readable. It will be a nice-to-have patch for
> interested people :)
>
>
> Thanks and Good Luck!
> Thomas
> _______________________________________________
> gpodder-devel mailing list
> gpodder-devel at lists.berlios.de
> https://lists.berlios.de/mailman/listinfo/gpodder-devel
>   



Other related posts: