[dokuwiki] PHP Script to help migrate to UTF-8 file/directory names

  • From: Daniel Dupriest <kououken@xxxxxxxxx>
  • To: dokuwiki@xxxxxxxxxxxxx
  • Date: Fri, 15 Oct 2010 08:49:25 +0900

Howdy Dokuwiki folks!

I'm writing the mailing list to request help in developing a PHP script to
migrate url-encoded wiki page names to the new encodings offered by the
latest release candidate (such as UTF-8!). I have used Dokuwiki for years,
primarily because of the fantastic simplicity of its folder hierarchy and
flat filesystem, and the knowledge that all the data is safely stored in
plain text files.The one gripe I had, and a small one at that, was the
url-encoded files/directories.

To give you a little background, I am currently using Dokuwiki to run a
bilingual knowledge-base for staff at an international university in Japan.
The wiki is primarily Japanese-based, with Japanese page names/namespaces,
and English translations and definitions inside each page.

I was ecstatic to find that the latest release candidate "Lazy Sunday"
implemented filename encoding! While I understand that url-encoding
filenames and directories is a safe way to ensure that the wiki will work
across a broad range of file systems and operating systems, in my case the
url-encoding has resulted in extremely long and unintelligible filenames.
For example:

Pagename:
新たな社会的ニーズに対応した学生支援プログラム

Becomes:
%E6%96%B0%E3%81%9F%E3%81%AA%E7%A4%BE%E4%BC%9A%E7%9A%84%E3%83%8B%E3%83%BC%E3%82%BA%E3%81%AB%E5%AF%BE%E5%BF%9C%E3%81%97%E3%81%9F%E5%AD%A6%E7%94%9F%E6%94%AF%E6%8F%B4%E3%83%97%E3%83%AD%E3%82%B0%E3%83%A9%E3%83%A0.txt

...which seems to exceed some samba share filename length limit, making it
inaccessible over the network, and I'm sure causes other problems that I
haven't run into yet.

I have tried to find linux cli tools to convert these names to UTF-8, but so
far I haven't found any. I believe the most commonly used tool is "convmv"
but it doesn't seem to do percent encoding. I also tried to tweak some
existing PHP code for recursive file renaming (http://bit.ly/aD7V5w), but
after thinking about all the uppercase/lowercase, full-width/half-width page
names, some with Japanese punctuation mixed in, I realize I don't know
enough about how Dokuwiki normalizes its page links, nor do I have the
confidence that any php script I run will correctly encode all names.

I'd like to know if anyone familiar enough with Dokuwiki's ins and outs
could create such a script, or if something similar isn't already in the
works. Possibly a helper script like the one used when Dokuwiki did the
utf8update? Also, if there is, in fact, a good CLI tool that can do this in
Linux please let me know!

This is not a mission-critical issue for me, but if there is an automatic
way to convert these thousands of pages of url-encoded filenames into neat,
Japanese text it would be a real help!

Best regards,

Daniel  ( ̄ー ̄)b

Other related posts: