[gpodder-devel] Non-human readable directory and file names

  • From: chrism at ideareactor.com (Chris McCabe)
  • Date: Wed, 31 Oct 2007 19:39:07 +0100

Sorry to comment on my own message, but now that I think of it a bit 
more, why not just use the entire URL of the actual podcast, instead of 
the feed.  For example, the podcast downloaded from:
http://www.hbo.com/video/podcasts/billmaher/637314_dl.mp3

would be stored at the following location in the filesystem:
www.hbo.com/video/podcasts/billmaher/637314_dl.mp3

Every podcast must have a unique URL, so you know it's always unique, 
and always exists.

Chris

Chris McCabe wrote:
> Here's a quick thought:
>
> How about creating the directory name from the URL of the feed, which 
> will always exist.  So for example, the podcasts from the feed:
> http://www.hbo.com/apps/podcasts/podcast.xml?a=2
>
> would all be saved in the directory (relative to the download directory):
> www.hbo.com/apps/podcasts/podcast.xml?a=2/
>
> It would end up creating a few more directory levels than are really 
> necessary, but it would be guaranteed to be unique, and would make it 
> easy to find the podcasts.  It would also automatically group feeds 
> from the same website together.
> You would still have the problem of naming each individual podcast 
> from that feed, but at least half the problem is solved.
>
> For naming the podcasts, one easy scheme would be to name it with the 
> release date of the podcast, or if not available, the download date, 
> with an extra number to make it unique if necessary.  For example:
> 2007.10.31.001.mp3
>
> This has the advantage that the alphabetical directory listing will 
> list the podcasts in order.  It has the disadvantage that you wouldn't 
> be able to match with certainty podcasts to filenames without 
> additional information.
>
>
> Just some thoughts.
>
> Chris
>
>
> Thomas Perl wrote:
>> Hello, Jay, Ionut and Pieter!
>>
>> This mail is not intended to be rude or harsh, I just want to bring up
>> real problems with using content from RSS files as base for file naming.
>> If you can come up with a stable, sane and secure scheme for creating
>> human-readable file names for all possible RSS feeds, please tell me :)
>>
>> On Wed, 2007-10-31 at 11:03 +0000, Jay Bradley wrote:
>>  
>>> I was wondering why gpodder stores the downloads in crazily named 
>>> directories? I realise that it is partly to ensure unique directories
>>> so there are no clashes but it means that it is impossible to browse 
>>> through the podcast files manually. I know I can sync to a filesystem
>>> so I do this for my mp3 player but I also normally use a soft link to
>>> the podcast downloads directory for my mythtv installation as well. 
>>> Currently I'm changing the device directory and syncing to my mp3
>>> player and changing the device directory again to a separate directory
>>> for mythtv. If the directory names were human readable then it would
>>> save me a lot of hassle.
>>>     
>>
>> I see you have read the mailing list and are aware of the alternatives
>> (MP3 Player sync).
>>
>> Anyway, this topic has been discussed several times on this list, I
>> guess it's time for a FAQ on the gPodder website.. ;)
>>
>> First of all, here are some relevant postings related to the topic.
>> Please read through them to get an overview of what has been proposed
>> and discussed already:
>>
>> https://lists.berlios.de/pipermail/gpodder-devel/2006-November/000283.html 
>>
>> https://lists.berlios.de/pipermail/gpodder-devel/2007-June/000723.html
>> https://lists.berlios.de/pipermail/gpodder-devel/2007-July/000756.html
>>
>> Script that tries to solve that problem:
>>
>> http://lists.berlios.de/pipermail/gpodder-devel/2007-August/000911.html
>> I'm going to describe the problem you mention a bit further...
>>
>> Basically, it's hard to create human-readable names because of the
>> nature of RSS feeds. It's like with HTML - if browsers were going to
>> reject non-standard HTML, all documents on the web would adhere to the
>> standards, but thanks to such "useful" features as quirks mode, browsers
>> try to fix the shortcomings of bad markup in the parser code.
>>
>> But the problems with RSS feeds doesn't lie in bad markup. Most of the
>> time, fields are not set (no <title> element in <item>), fields have
>> empty value (<title> exists, but is empty) or very stupid usage of
>> fields (just recently, we had a feed where <title> contained a
>> description of the episode, a very long string).
>>
>> There are two options here:
>>
>>  a) reject any feeds that have no title, have a too long title or 
>> have     some other weird properties that are not usual RSS practice
>>  b) accept all feeds and try to make the best of "what we have"
>>
>> gPodder tries to to the "b)" route and so we have to be prepared to
>> accept feeds without <title>. As you can read from the november 2006
>> post above (I think one of the inital thoughts about hashed filenames),
>> hashing feed and episode URLs always gives us strings that have some
>> sane and stable properties:
>>
>>  1.) (high probability of) uniqueness
>>  2.) sane length (even fixed, but at least not empty or too long)
>>  3.) sane alphabet (hexadecimal, i.e. only the characters 0-9 and a-f)
>>
>> So, for every given URL (and _every_ feed has an URL), we have a sane
>> "ID" that we can use to identify that feed.
>>
>> When depending on human-readable strings (i.e. title, etc..) we run into
>> several problems:
>>
>>  i.) what is the directory name of feeds with "<title></title>"??
>>  ii.) what is the directory name of feed A with title "radio x 
>> podcast"       when there already is a feed B with title "radio x 
>> podcast"?
>>  iii.) what is the directory name of a feed with a loooong title?
>>  iv.) what is the directory name of a feed with chinese characters as
>>       title (from the top of my head, imagine (e.g. "???") when 
>>       using FAT32 as file system?
>>
>> We might be able to create a unique filename for a podcast episode from
>> the basename of its url, but is there always an unique basename of the
>> podcast feed? It might be "index.xml" or "podcast.rss".
>>
>>  
>>> I never understand why some programs add a layer of complexity which 
>>> removes the user one step from their files. I believe programs should
>>> be as transparent as possible to allow people to do what they like
>>> with the data produced by that program.
>>>     
>>
>> gPodder is transparent in that the user doesn't have to care about the
>> directory layout, as the user can use the gPodder GUI to browse and
>> listen to feeds - all feed information is displayed in the GUI.
>>
>> You can always determine feed and episode info for given hashes:
>>
>>  -> Hash (md5) the URLs in ~/.config/gpodder/channels.opml
>>  -> MD5 of URL = directory name of feed
>>
>>  -> Open the file "index.xml" in the feed download directory
>>  -> Hash (md5) the URLs in that file
>>  -> MD5 of URL + extension of basename of URL = filename of episode
>>
>> In pseuco-code, this is something like:
>>
>> opml_file = $HOME + '/.config/gpodder/channels.opml'
>>
>> ( ... feed_url is to be obtained from opml_file ... )
>> feed_directory = gpodder_download_dir + '/' + md5sum( feed_url )
>> feed_index = feed_directory + '/index.xml'
>>
>> ( ... episode_url is to be obtained from feed_index ... )
>> extension = file_extension_of( basename( episode_url ) )
>> episode_name = feed_directory + '/' + md5sum ( episode_url ) + extension
>>
>>  
>>> Using the non-human readable directory and filenames stops users from
>>> accessing their files except through one program (gpodder) which is a
>>> shame.
>>>     
>>
>> You can use the above method to find more information (metadata) for the
>> files than you can with human readable directories, including the title
>> and description of episodes.
>>
>>  
>>> I've looked through the source code but cannot find where the
>>> directory names are set. I'm an okay programmer so could do this
>>> myself if someone could point me in the right direction. I'd do it to
>>> just my local copy if this wasn't something anyone else would be
>>> interested in.
>>>     
>>
>> Please, by all means try to do it. If it works for all RSS feeds, I
>> would be very happy to merge it into gPodder, as it would be the better
>> solution than what we have now. But because of the reasons I mentioned
>> above, I am very skeptic if this is possible at all.
>>
>> The directory name for channels is determined by the "get_filename"
>> function of the class "podcastChannel" in "src/gpodder/libpodcasts.py".
>> The attributes that _should_ be available when this function is called
>> are "url", "title" and "description" (i.e. "self.title").
>>
>> The filename for an episode is determined by the "local_filename"
>> function of the class "podcastItem" in "src/gpodder/libpodcasts.py".
>> Only the "url" attribute is guranteed to be available, for all other
>> properties, the best possible value is extracted from the RSS feed, but
>> you can expect the "title" value to be somewhat identifying, but not
>> unique. You also have to be aware that the "title" value _could_ be very
>> long (think of a description field value that has been misplaced).
>>
>>  
>>> I realise there may be some other reason why the names are non-human 
>>> readable so if I've missed it then please could someone let me know.
>>>     
>>
>> Apart from the practical reasons I mentioned above, there is no real
>> reason why the hashes are chosen. It was a simple and straightforward
>> solution to a problem for which we have not yet found a better solution.
>>
>> It would be quite cool if you could come up with something friendlier :)
>>
>> If you want, please send the modifications you make to make gPodder's
>> directory structure human-readable. It will be a nice-to-have patch for
>> interested people :)
>>
>>
>> Thanks and Good Luck!
>> Thomas
>> _______________________________________________
>> gpodder-devel mailing list
>> gpodder-devel at lists.berlios.de
>> https://lists.berlios.de/mailman/listinfo/gpodder-devel
>>   
>
>



Other related posts: