[dokuwiki] Re: PHP Script to help migrate to UTF-8 file/directory names

  • From: Daniel Dupriest <kououken@xxxxxxxxx>
  • To: dokuwiki@xxxxxxxxxxxxx
  • Date: Wed, 1 Dec 2010 23:05:41 +0900

To Dmitry Katsubo, WC Jones, Christopher Smith, and any interested
multilingual Dokuwiki users:

This is a report of my experimentation with writing a script to
convert url-encoded filenames into UTF-8 encoded for users who would
like to use that option which has become available in recent Dokuwiki
releases.

To quickly restate the issue, until the utf-8 filename encoding
feature became available, Dokuwiki users working with non-roman
languages such as cyrillic, Korean, or in my case, Japanese, had to
work with file and directory names that look like
"%E6%96%B0%E3%81%9F%E3%81%AA%E7%A4%BE%E4%BC%9A%E7%9A%84%E3%83%8B%E3%83%BC%E3%82%BA%E3%81%AB%E5%AF%BE%E5%BF%9C%E3%81%97%E3%81%9F%E5%AD%A6%E7%94%9F%E6%94%AF%E6%8F%B4%E3%83%97%E3%83%AD%E3%82%B0%E3%83%A9%E3%83%A0.txt".
Yuck. With UTF-8 encoding, and a utf-8 enabled server, this becomes
新たな社会的ニーズに対応した学生支援プログラム, which not only is legible, but also fits into
zip archives, CDR filesystems and other places.

I got some good advice from Christopher Smith:

> You might try the following http://www.dokuwiki.org/tips:convert_to_utf8
>
> The bash + php works.  I can't guarantee DokuWiki will like the results.  I 
> believe it should for the following reasons:
>
> 1. utf8 page name comes into dokuwiki
> 2. dokuwiki strips any character that shouldn't be used
> 3. if necessary dokuwiki encodes the filename
>
> The aim is to reverse an earlier encoding process at 3.  Since this is a 
> straight encoding from utf8 all we should need to do is decode the result.  
> Any "bad" characters in the original utf8 page name were removed at step 2, 
> so won't have been included in the encoded result.

After some testing, it turns out that he is absolutely right. Dokuwiki
does all the hard work of cleaning filenames before they're
url-encoded, so converting them back is actually an easy matter. I
wrote a VERY inefficient and slow bash script to do the conversion,
and while it does get the job done I'm sure anyone more familiar with
bash and php could improve on it greatly.

I have used it to convert several hundred filenames and dozens of
nested directories (namespaces) to UTF-8, and so far the results have
been great. At first I was worried about how plugins would handle the
change, but I guess since Dokuwiki has core support for UTF-8 encoded
files now it's a non-issue, as they all run fine for me (ditaa,
doodle, graphviz, etc). I do not have extensive media on my wiki, so
that is one area I still have not thoroughly tested, and I'm not sure
if this script disturbs things like "created on" dates, but for those
brave souls who want to go all-unicode...

Here is the procedure I used to convert to UTF-8:

1) BACKUP EVERYTHING :)

2) Change the fnencode "Method for encoding non-ASCII filenames"
setting on the configuration page to "utf-8" and save. All wiki links
pointing to non-ASCII pages will temporarily break now that they point
to the non-existent utf-8 filenames.

2) From the linux command line, run the url2utf8 script from inside
the "/data/pages", "/data/meta" and "/data/attic" directories. You
will need to run the script several times if you have url-encoded
namespaces (thanks to my poor programming ability). For example, the
script must be run 4x to reach a file nested 3 encoded directories
deep. All files will show "Skipped" once they have been converted.
This may take some time depending on how large or old the wiki is.

3) Run "php indexer.php -c" from the Dokuwiki "/bin" directory to
rebuild the index.

4) Test everything.

5) Enjoy your fully utf-8 enabled wiki!

----

#!/bin/bash
#
# This script finds all normal files in the current directory
# and subdirectories, converting all url-encoded filenames
# to utf-8 encoded names. It then attempts to do the same
# for all directories.
#
# Since there is no logical recursive directory-checking, in
# order to reach files and directories nested inside url-
# encoded directories, it must be run multiple times (until
# it returns "Skipped" for all files. I know. It's awfully
# inefficient, but its safer than trusting my loop logic.
#
############################################################

# First find and convert normal files only with '-type f'

for file in $(find . -name "*" -type f)
do
        decoded=$(php -r "echo rawurldecode('$file');")
        if [ "$file" == "$decoded" ] ; then
                echo "Skipped $file"
        else
                echo "Renaming $file => $decoded"
                if ! mv $file $decoded; then
                        echo "Error renaming $file"
                fi
        fi
done

# Then find and convert directories with '-type d'

for file in $(find . -name "*" -type d)
do
        decoded=$(php -r "echo rawurldecode('$file');")
        if [ "$file" == "$decoded" ] ; then
                echo "Skipped $file"
        else
                echo "Renaming $file => $decoded"
                if ! mv $file $decoded; then
                        echo "Error renaming $file"
                fi
        fi
done

----

Ideally, I think the functionality of this script could be included in
some kind of helper php script, and presented in a more end
user-friendly way (not to mention programmed much more intelligently).
This would let long-time Dokuwiki users running non-English wikis
migrate their existing url-encoded file system to a utf-8 one. If they
want to of course.

Feedback, ideas and pointing out of gaping holes in my code/logic
would be much appreciated!

Best regards,
--
Daniel  ( ̄ー ̄)b
--
--
DokuWiki mailing list - more info at
http://www.dokuwiki.org/mailinglist

Other related posts: