To Dmitry Katsubo, WC Jones, Christopher Smith, and any interested multilingual Dokuwiki users: This is a report of my experimentation with writing a script to convert url-encoded filenames into UTF-8 encoded for users who would like to use that option which has become available in recent Dokuwiki releases. To quickly restate the issue, until the utf-8 filename encoding feature became available, Dokuwiki users working with non-roman languages such as cyrillic, Korean, or in my case, Japanese, had to work with file and directory names that look like "%E6%96%B0%E3%81%9F%E3%81%AA%E7%A4%BE%E4%BC%9A%E7%9A%84%E3%83%8B%E3%83%BC%E3%82%BA%E3%81%AB%E5%AF%BE%E5%BF%9C%E3%81%97%E3%81%9F%E5%AD%A6%E7%94%9F%E6%94%AF%E6%8F%B4%E3%83%97%E3%83%AD%E3%82%B0%E3%83%A9%E3%83%A0.txt". Yuck. With UTF-8 encoding, and a utf-8 enabled server, this becomes 新たな社会的ニーズに対応した学生支援プログラム, which not only is legible, but also fits into zip archives, CDR filesystems and other places. I got some good advice from Christopher Smith: > You might try the following http://www.dokuwiki.org/tips:convert_to_utf8 > > The bash + php works. I can't guarantee DokuWiki will like the results. I > believe it should for the following reasons: > > 1. utf8 page name comes into dokuwiki > 2. dokuwiki strips any character that shouldn't be used > 3. if necessary dokuwiki encodes the filename > > The aim is to reverse an earlier encoding process at 3. Since this is a > straight encoding from utf8 all we should need to do is decode the result. > Any "bad" characters in the original utf8 page name were removed at step 2, > so won't have been included in the encoded result. After some testing, it turns out that he is absolutely right. Dokuwiki does all the hard work of cleaning filenames before they're url-encoded, so converting them back is actually an easy matter. I wrote a VERY inefficient and slow bash script to do the conversion, and while it does get the job done I'm sure anyone more familiar with bash and php could improve on it greatly. I have used it to convert several hundred filenames and dozens of nested directories (namespaces) to UTF-8, and so far the results have been great. At first I was worried about how plugins would handle the change, but I guess since Dokuwiki has core support for UTF-8 encoded files now it's a non-issue, as they all run fine for me (ditaa, doodle, graphviz, etc). I do not have extensive media on my wiki, so that is one area I still have not thoroughly tested, and I'm not sure if this script disturbs things like "created on" dates, but for those brave souls who want to go all-unicode... Here is the procedure I used to convert to UTF-8: 1) BACKUP EVERYTHING :) 2) Change the fnencode "Method for encoding non-ASCII filenames" setting on the configuration page to "utf-8" and save. All wiki links pointing to non-ASCII pages will temporarily break now that they point to the non-existent utf-8 filenames. 2) From the linux command line, run the url2utf8 script from inside the "/data/pages", "/data/meta" and "/data/attic" directories. You will need to run the script several times if you have url-encoded namespaces (thanks to my poor programming ability). For example, the script must be run 4x to reach a file nested 3 encoded directories deep. All files will show "Skipped" once they have been converted. This may take some time depending on how large or old the wiki is. 3) Run "php indexer.php -c" from the Dokuwiki "/bin" directory to rebuild the index. 4) Test everything. 5) Enjoy your fully utf-8 enabled wiki! ---- #!/bin/bash # # This script finds all normal files in the current directory # and subdirectories, converting all url-encoded filenames # to utf-8 encoded names. It then attempts to do the same # for all directories. # # Since there is no logical recursive directory-checking, in # order to reach files and directories nested inside url- # encoded directories, it must be run multiple times (until # it returns "Skipped" for all files. I know. It's awfully # inefficient, but its safer than trusting my loop logic. # ############################################################ # First find and convert normal files only with '-type f' for file in $(find . -name "*" -type f) do decoded=$(php -r "echo rawurldecode('$file');") if [ "$file" == "$decoded" ] ; then echo "Skipped $file" else echo "Renaming $file => $decoded" if ! mv $file $decoded; then echo "Error renaming $file" fi fi done # Then find and convert directories with '-type d' for file in $(find . -name "*" -type d) do decoded=$(php -r "echo rawurldecode('$file');") if [ "$file" == "$decoded" ] ; then echo "Skipped $file" else echo "Renaming $file => $decoded" if ! mv $file $decoded; then echo "Error renaming $file" fi fi done ---- Ideally, I think the functionality of this script could be included in some kind of helper php script, and presented in a more end user-friendly way (not to mention programmed much more intelligently). This would let long-time Dokuwiki users running non-English wikis migrate their existing url-encoded file system to a utf-8 one. If they want to of course. Feedback, ideas and pointing out of gaping holes in my code/logic would be much appreciated! Best regards, -- Daniel ( ̄ー ̄)b -- -- DokuWiki mailing list - more info at http://www.dokuwiki.org/mailinglist