On 01.12.2010 15:05, Daniel Dupriest wrote: > To Dmitry Katsubo, WC Jones, Christopher Smith, and any interested > multilingual Dokuwiki users: > > This is a report of my experimentation with writing a script to > convert url-encoded filenames into UTF-8 encoded for users who would > like to use that option which has become available in recent Dokuwiki > releases. > > To quickly restate the issue, until the utf-8 filename encoding > feature became available, Dokuwiki users working with non-roman > languages such as cyrillic, Korean, or in my case, Japanese, had to > work with file and directory names that look like > "%E6%96%B0%E3%81%9F%E3%81%AA%E7%A4%BE%E4%BC%9A%E7%9A%84%E3%83%8B%E3%83%BC%E3%82%BA%E3%81%AB%E5%AF%BE%E5%BF%9C%E3%81%97%E3%81%9F%E5%AD%A6%E7%94%9F%E6%94%AF%E6%8F%B4%E3%83%97%E3%83%AD%E3%82%B0%E3%83%A9%E3%83%A0.txt". > Yuck. With UTF-8 encoding, and a utf-8 enabled server, this becomes > 新たな社会的ニーズに対応した学生支援プログラム, which not only is legible, but also fits into > zip archives, CDR filesystems and other places. > > I got some good advice from Christopher Smith: > >> You might try the following http://www.dokuwiki.org/tips:convert_to_utf8 >> >> The bash + php works. I can't guarantee DokuWiki will like the >> results. I believe it should for the following reasons: >> >> 1. utf8 page name comes into dokuwiki >> 2. dokuwiki strips any character that shouldn't be used >> 3. if necessary dokuwiki encodes the filename >> >> The aim is to reverse an earlier encoding process at 3. Since this >> is a straight encoding from utf8 all we should need to do is decode >> the result. Any "bad" characters in the original utf8 page name >> were removed at step 2, so won't have been included in the encoded >> result. > > After some testing, it turns out that he is absolutely right. Dokuwiki > does all the hard work of cleaning filenames before they're > url-encoded, so converting them back is actually an easy matter. I > wrote a VERY inefficient and slow bash script to do the conversion, > and while it does get the job done I'm sure anyone more familiar with > bash and php could improve on it greatly. > > I have used it to convert several hundred filenames and dozens of > nested directories (namespaces) to UTF-8, and so far the results have > been great. At first I was worried about how plugins would handle the > change, but I guess since Dokuwiki has core support for UTF-8 encoded > files now it's a non-issue, as they all run fine for me (ditaa, > doodle, graphviz, etc). I do not have extensive media on my wiki, so > that is one area I still have not thoroughly tested, and I'm not sure > if this script disturbs things like "created on" dates, but for those > brave souls who want to go all-unicode... > > Here is the procedure I used to convert to UTF-8: > > 1) BACKUP EVERYTHING :) > > 2) Change the fnencode "Method for encoding non-ASCII filenames" > setting on the configuration page to "utf-8" and save. All wiki links > pointing to non-ASCII pages will temporarily break now that they point > to the non-existent utf-8 filenames. > > 2) From the linux command line, run the url2utf8 script from inside > the "/data/pages", "/data/meta" and "/data/attic" directories. You > will need to run the script several times if you have url-encoded > namespaces (thanks to my poor programming ability). For example, the > script must be run 4x to reach a file nested 3 encoded directories > deep. All files will show "Skipped" once they have been converted. > This may take some time depending on how large or old the wiki is. > > 3) Run "php indexer.php -c" from the Dokuwiki "/bin" directory to > rebuild the index. > > 4) Test everything. > > 5) Enjoy your fully utf-8 enabled wiki! > > ---- > > #!/bin/bash > # > # This script finds all normal files in the current directory > # and subdirectories, converting all url-encoded filenames > # to utf-8 encoded names. It then attempts to do the same > # for all directories. > # > # Since there is no logical recursive directory-checking, in > # order to reach files and directories nested inside url- > # encoded directories, it must be run multiple times (until > # it returns "Skipped" for all files. I know. It's awfully > # inefficient, but its safer than trusting my loop logic. > # > ############################################################ > > # First find and convert normal files only with '-type f' > > for file in $(find . -name "*" -type f) > do > decoded=$(php -r "echo rawurldecode('$file');") > if [ "$file" == "$decoded" ] ; then > echo "Skipped $file" > else > echo "Renaming $file => $decoded" > if ! mv $file $decoded; then > echo "Error renaming $file" > fi > fi > done > > # Then find and convert directories with '-type d' > > for file in $(find . -name "*" -type d) > do > decoded=$(php -r "echo rawurldecode('$file');") > if [ "$file" == "$decoded" ] ; then > echo "Skipped $file" > else > echo "Renaming $file => $decoded" > if ! mv $file $decoded; then > echo "Error renaming $file" > fi > fi > done > > ---- > > Ideally, I think the functionality of this script could be included in > some kind of helper php script, and presented in a more end > user-friendly way (not to mention programmed much more intelligently). > This would let long-time Dokuwiki users running non-English wikis > migrate their existing url-encoded file system to a utf-8 one. If they > want to of course. > > Feedback, ideas and pointing out of gaping holes in my code/logic > would be much appreciated! > > Best regards, > -- > Daniel ( ̄ー ̄)b > -- Hey Daniel! Great work. Steps worked for me smoothly, and at first glance everything is OK. I have been waiting for this moment for a long time... Indeed I expect some plugins like pagemove to cause some troubles, but hopefully plugin authors will revise the code in a while. Also I have removed after step (1) all *.indexed files, because some of them got strange names: find data/meta/ -iname '*.indexed' -exec rm '{}' \; I attach the Perl script, which does the renaming in one go. Just navigate to dokuwiki/data directory and run it: pushd /some/dokuwiki/data && ~/convert_names_to_utf8.pl -- With best regards, Dmitry
#!/usr/bin/perl use URI::Escape; use File::Find; use strict; my @dirs = (scalar(@ARGV) == 0 ? qw(attic media meta pages) : @ARGV); my ($total_files, $total_dirs); finddepth(sub { my $name = uri_unescape($_); if ($name ne $_) { if (-d $_) { $total_dirs++; } else { $total_files++; } rename $_, $name or die "Failed to rename a file $_: $?\n"; } }, @dirs); print "$total_files files and $total_dirs dirs renamed.\n";