[dokuwiki] Re: PHP Script to help migrate to UTF-8 file/directory names
- From: Dmitry Katsubo <dma_k@xxxxxxx>
- To: dokuwiki@xxxxxxxxxxxxx
- Date: Sat, 04 Dec 2010 12:02:58 +0100
On 01.12.2010 15:05, Daniel Dupriest wrote:
> To Dmitry Katsubo, WC Jones, Christopher Smith, and any interested
> multilingual Dokuwiki users:
>
> This is a report of my experimentation with writing a script to
> convert url-encoded filenames into UTF-8 encoded for users who would
> like to use that option which has become available in recent Dokuwiki
> releases.
>
> To quickly restate the issue, until the utf-8 filename encoding
> feature became available, Dokuwiki users working with non-roman
> languages such as cyrillic, Korean, or in my case, Japanese, had to
> work with file and directory names that look like
> "%E6%96%B0%E3%81%9F%E3%81%AA%E7%A4%BE%E4%BC%9A%E7%9A%84%E3%83%8B%E3%83%BC%E3%82%BA%E3%81%AB%E5%AF%BE%E5%BF%9C%E3%81%97%E3%81%9F%E5%AD%A6%E7%94%9F%E6%94%AF%E6%8F%B4%E3%83%97%E3%83%AD%E3%82%B0%E3%83%A9%E3%83%A0.txt".
> Yuck. With UTF-8 encoding, and a utf-8 enabled server, this becomes
> 新たな社会的ニーズに対応した学生支援プログラム, which not only is legible, but also fits into
> zip archives, CDR filesystems and other places.
>
> I got some good advice from Christopher Smith:
>
>> You might try the following http://www.dokuwiki.org/tips:convert_to_utf8
>>
>> The bash + php works. I can't guarantee DokuWiki will like the
>> results. I believe it should for the following reasons:
>>
>> 1. utf8 page name comes into dokuwiki
>> 2. dokuwiki strips any character that shouldn't be used
>> 3. if necessary dokuwiki encodes the filename
>>
>> The aim is to reverse an earlier encoding process at 3. Since this
>> is a straight encoding from utf8 all we should need to do is decode
>> the result. Any "bad" characters in the original utf8 page name
>> were removed at step 2, so won't have been included in the encoded
>> result.
>
> After some testing, it turns out that he is absolutely right. Dokuwiki
> does all the hard work of cleaning filenames before they're
> url-encoded, so converting them back is actually an easy matter. I
> wrote a VERY inefficient and slow bash script to do the conversion,
> and while it does get the job done I'm sure anyone more familiar with
> bash and php could improve on it greatly.
>
> I have used it to convert several hundred filenames and dozens of
> nested directories (namespaces) to UTF-8, and so far the results have
> been great. At first I was worried about how plugins would handle the
> change, but I guess since Dokuwiki has core support for UTF-8 encoded
> files now it's a non-issue, as they all run fine for me (ditaa,
> doodle, graphviz, etc). I do not have extensive media on my wiki, so
> that is one area I still have not thoroughly tested, and I'm not sure
> if this script disturbs things like "created on" dates, but for those
> brave souls who want to go all-unicode...
>
> Here is the procedure I used to convert to UTF-8:
>
> 1) BACKUP EVERYTHING :)
>
> 2) Change the fnencode "Method for encoding non-ASCII filenames"
> setting on the configuration page to "utf-8" and save. All wiki links
> pointing to non-ASCII pages will temporarily break now that they point
> to the non-existent utf-8 filenames.
>
> 2) From the linux command line, run the url2utf8 script from inside
> the "/data/pages", "/data/meta" and "/data/attic" directories. You
> will need to run the script several times if you have url-encoded
> namespaces (thanks to my poor programming ability). For example, the
> script must be run 4x to reach a file nested 3 encoded directories
> deep. All files will show "Skipped" once they have been converted.
> This may take some time depending on how large or old the wiki is.
>
> 3) Run "php indexer.php -c" from the Dokuwiki "/bin" directory to
> rebuild the index.
>
> 4) Test everything.
>
> 5) Enjoy your fully utf-8 enabled wiki!
>
> ----
>
> #!/bin/bash
> #
> # This script finds all normal files in the current directory
> # and subdirectories, converting all url-encoded filenames
> # to utf-8 encoded names. It then attempts to do the same
> # for all directories.
> #
> # Since there is no logical recursive directory-checking, in
> # order to reach files and directories nested inside url-
> # encoded directories, it must be run multiple times (until
> # it returns "Skipped" for all files. I know. It's awfully
> # inefficient, but its safer than trusting my loop logic.
> #
> ############################################################
>
> # First find and convert normal files only with '-type f'
>
> for file in $(find . -name "*" -type f)
> do
> decoded=$(php -r "echo rawurldecode('$file');")
> if [ "$file" == "$decoded" ] ; then
> echo "Skipped $file"
> else
> echo "Renaming $file => $decoded"
> if ! mv $file $decoded; then
> echo "Error renaming $file"
> fi
> fi
> done
>
> # Then find and convert directories with '-type d'
>
> for file in $(find . -name "*" -type d)
> do
> decoded=$(php -r "echo rawurldecode('$file');")
> if [ "$file" == "$decoded" ] ; then
> echo "Skipped $file"
> else
> echo "Renaming $file => $decoded"
> if ! mv $file $decoded; then
> echo "Error renaming $file"
> fi
> fi
> done
>
> ----
>
> Ideally, I think the functionality of this script could be included in
> some kind of helper php script, and presented in a more end
> user-friendly way (not to mention programmed much more intelligently).
> This would let long-time Dokuwiki users running non-English wikis
> migrate their existing url-encoded file system to a utf-8 one. If they
> want to of course.
>
> Feedback, ideas and pointing out of gaping holes in my code/logic
> would be much appreciated!
>
> Best regards,
> --
> Daniel ( ̄ー ̄)b
> --
Hey Daniel! Great work. Steps worked for me smoothly, and at first
glance everything is OK. I have been waiting for this moment for a long
time... Indeed I expect some plugins like pagemove to cause some
troubles, but hopefully plugin authors will revise the code in a while.
Also I have removed after step (1) all *.indexed files, because some of
them got strange names:
find data/meta/ -iname '*.indexed' -exec rm '{}' \;
I attach the Perl script, which does the renaming in one go. Just
navigate to dokuwiki/data directory and run it:
pushd /some/dokuwiki/data && ~/convert_names_to_utf8.pl
--
With best regards,
Dmitry
#!/usr/bin/perl
use URI::Escape;
use File::Find;
use strict;
my @dirs = (scalar(@ARGV) == 0 ? qw(attic media meta pages) : @ARGV);
my ($total_files, $total_dirs);
finddepth(sub {
my $name = uri_unescape($_);
if ($name ne $_) {
if (-d $_) {
$total_dirs++;
} else {
$total_files++;
}
rename $_, $name or die "Failed to rename a file $_: $?\n";
}
}, @dirs);
print "$total_files files and $total_dirs dirs renamed.\n";
Other related posts: