[dokuwiki] Re: PHP Script to help migrate to UTF-8 file/directory names

  • From: Dmitry Katsubo <dma_k@xxxxxxx>
  • To: dokuwiki@xxxxxxxxxxxxx
  • Date: Sat, 04 Dec 2010 12:02:58 +0100

On 01.12.2010 15:05, Daniel Dupriest wrote:
> To Dmitry Katsubo, WC Jones, Christopher Smith, and any interested
> multilingual Dokuwiki users:
> 
> This is a report of my experimentation with writing a script to
> convert url-encoded filenames into UTF-8 encoded for users who would
> like to use that option which has become available in recent Dokuwiki
> releases.
> 
> To quickly restate the issue, until the utf-8 filename encoding
> feature became available, Dokuwiki users working with non-roman
> languages such as cyrillic, Korean, or in my case, Japanese, had to
> work with file and directory names that look like
> "%E6%96%B0%E3%81%9F%E3%81%AA%E7%A4%BE%E4%BC%9A%E7%9A%84%E3%83%8B%E3%83%BC%E3%82%BA%E3%81%AB%E5%AF%BE%E5%BF%9C%E3%81%97%E3%81%9F%E5%AD%A6%E7%94%9F%E6%94%AF%E6%8F%B4%E3%83%97%E3%83%AD%E3%82%B0%E3%83%A9%E3%83%A0.txt".
> Yuck. With UTF-8 encoding, and a utf-8 enabled server, this becomes
> 新たな社会的ニーズに対応した学生支援プログラム, which not only is legible, but also fits into
> zip archives, CDR filesystems and other places.
> 
> I got some good advice from Christopher Smith:
> 
>> You might try the following http://www.dokuwiki.org/tips:convert_to_utf8
>>
>> The bash + php works.  I can't guarantee DokuWiki will like the
>> results.  I believe it should for the following reasons:
>>
>> 1. utf8 page name comes into dokuwiki
>> 2. dokuwiki strips any character that shouldn't be used
>> 3. if necessary dokuwiki encodes the filename
>>
>> The aim is to reverse an earlier encoding process at 3.  Since this
>> is a straight encoding from utf8 all we should need to do is decode
>> the result.  Any "bad" characters in the original utf8 page name
>> were removed at step 2, so won't have been included in the encoded
>> result.
> 
> After some testing, it turns out that he is absolutely right. Dokuwiki
> does all the hard work of cleaning filenames before they're
> url-encoded, so converting them back is actually an easy matter. I
> wrote a VERY inefficient and slow bash script to do the conversion,
> and while it does get the job done I'm sure anyone more familiar with
> bash and php could improve on it greatly.
> 
> I have used it to convert several hundred filenames and dozens of
> nested directories (namespaces) to UTF-8, and so far the results have
> been great. At first I was worried about how plugins would handle the
> change, but I guess since Dokuwiki has core support for UTF-8 encoded
> files now it's a non-issue, as they all run fine for me (ditaa,
> doodle, graphviz, etc). I do not have extensive media on my wiki, so
> that is one area I still have not thoroughly tested, and I'm not sure
> if this script disturbs things like "created on" dates, but for those
> brave souls who want to go all-unicode...
> 
> Here is the procedure I used to convert to UTF-8:
> 
> 1) BACKUP EVERYTHING :)
> 
> 2) Change the fnencode "Method for encoding non-ASCII filenames"
> setting on the configuration page to "utf-8" and save. All wiki links
> pointing to non-ASCII pages will temporarily break now that they point
> to the non-existent utf-8 filenames.
> 
> 2) From the linux command line, run the url2utf8 script from inside
> the "/data/pages", "/data/meta" and "/data/attic" directories. You
> will need to run the script several times if you have url-encoded
> namespaces (thanks to my poor programming ability). For example, the
> script must be run 4x to reach a file nested 3 encoded directories
> deep. All files will show "Skipped" once they have been converted.
> This may take some time depending on how large or old the wiki is.
> 
> 3) Run "php indexer.php -c" from the Dokuwiki "/bin" directory to
> rebuild the index.
> 
> 4) Test everything.
> 
> 5) Enjoy your fully utf-8 enabled wiki!
> 
> ----
> 
> #!/bin/bash
> #
> # This script finds all normal files in the current directory
> # and subdirectories, converting all url-encoded filenames
> # to utf-8 encoded names. It then attempts to do the same
> # for all directories.
> #
> # Since there is no logical recursive directory-checking, in
> # order to reach files and directories nested inside url-
> # encoded directories, it must be run multiple times (until
> # it returns "Skipped" for all files. I know. It's awfully
> # inefficient, but its safer than trusting my loop logic.
> #
> ############################################################
> 
> # First find and convert normal files only with '-type f'
> 
> for file in $(find . -name "*" -type f)
> do
>         decoded=$(php -r "echo rawurldecode('$file');")
>         if [ "$file" == "$decoded" ] ; then
>                 echo "Skipped $file"
>         else
>                 echo "Renaming $file => $decoded"
>                 if ! mv $file $decoded; then
>                         echo "Error renaming $file"
>                 fi
>         fi
> done
> 
> # Then find and convert directories with '-type d'
> 
> for file in $(find . -name "*" -type d)
> do
>         decoded=$(php -r "echo rawurldecode('$file');")
>         if [ "$file" == "$decoded" ] ; then
>                 echo "Skipped $file"
>         else
>                 echo "Renaming $file => $decoded"
>                 if ! mv $file $decoded; then
>                         echo "Error renaming $file"
>                 fi
>         fi
> done
> 
> ----
> 
> Ideally, I think the functionality of this script could be included in
> some kind of helper php script, and presented in a more end
> user-friendly way (not to mention programmed much more intelligently).
> This would let long-time Dokuwiki users running non-English wikis
> migrate their existing url-encoded file system to a utf-8 one. If they
> want to of course.
> 
> Feedback, ideas and pointing out of gaping holes in my code/logic
> would be much appreciated!
> 
> Best regards,
> --
> Daniel  ( ̄ー ̄)b
> --

Hey Daniel! Great work. Steps worked for me smoothly, and at first
glance everything is OK. I have been waiting for this moment for a long
time... Indeed I expect some plugins like pagemove to cause some
troubles, but hopefully plugin authors will revise the code in a while.

Also I have removed after step (1) all *.indexed files, because some of
them got strange names:

find data/meta/ -iname '*.indexed' -exec rm '{}' \;

I attach the Perl script, which does the renaming in one go. Just
navigate to dokuwiki/data directory and run it:

pushd /some/dokuwiki/data && ~/convert_names_to_utf8.pl

-- 
With best regards,
Dmitry
#!/usr/bin/perl

use URI::Escape;
use File::Find;

use strict;

my @dirs = (scalar(@ARGV) == 0 ? qw(attic media meta pages) : @ARGV);

my ($total_files, $total_dirs);

finddepth(sub {
        my $name = uri_unescape($_);

        if ($name ne $_) {
                if (-d $_) {
                        $total_dirs++;
                } else {
                        $total_files++;
                }
                rename $_, $name or die "Failed to rename a file $_: $?\n";
        }
}, @dirs);

print "$total_files files and $total_dirs dirs renamed.\n";

Other related posts: