[dokuwiki] Dokuwiki scalability - NFS use

  • From: Yann <yann.hamon@xxxxxxxxx>
  • To: dokuwiki@xxxxxxxxxxxxx
  • Date: Sun, 4 May 2008 20:09:07 +0100

After being kicked out of our SAN share at our provider for too heavy usage,
then buying our own NFS server and trying to put dokuwiki on it, we
understood why we got kicked out :)
It seems that dokuwiki is doing many, many file accesses, which would
probably go unnoticed when local and with fast disks, but which are
completely killing when trying to put dokuwiki on NFS.

After some investigation (sigh :D) we found a couple of issues which could
be improved. But let's start with the hardware: dokuwiki is running on a
dual xeon quadcore 1.6ghz with 8GB of ram, the NFS is a dual 2.8ghz
hyperthreaded, 4GB of ram, 10k scsi disks in soft raid 10. Right now, if I
put the data of dokuwiki on the NFS, it holds for, well, a good 3 minutes
before the load reaches 20 :P

So, investigation: we found the following function:

/**
 * Return a list of available and existing page revisons from the attic
 *
 * @author Andreas Gohr <andi@xxxxxxxxxxxxxx>
 * @see    getRevisions()
 */
function getRevisionsFromAttic($id,$sorted=true){
  $revd = dirname(wikiFN($id,'foo'));
  $revs = array();
  $clid = cleanID($id);
  if(strrpos($clid,':')) $clid = substr($clid,strrpos($clid,':')+1);
//remove path
  $clid = utf8_encodeFN($clid);

  if (is_dir($revd) && $dh = opendir($revd)) {
    while (($file = readdir($dh)) !== false) {
      if (is_dir($revd.'/'.$file)) continue;
      if (preg_match('/^'.$clid.'\.(\d+)\.txt(\.gz)?$/',$file,$match)){
        $revs[]=$match[1];
      }
    }
    closedir($dh);
  }
  if($sorted) rsort($revs);
  return $revs;
}

If I am not mistaken, this function does a readdir of the attic directory,
does a preg_match for every file to see if it is quite like $id, and return
the list of revisions for $id. Two things:

yann@dongo:/srv/www/fr/doc.ubuntu-fr.org/htdocs/data/attic$ ls -l | wc -l
46426

That's on a 3 years old wiki :) And also:

yann@dongo:/srv/www/fr/doc.ubuntu-fr.org/htdocs$ grep -R
getRevisionsFromAttic bin/ conf/ inc/ lib/
inc/changelog.php: * @see    getRevisionsFromAttic()
inc/changelog.php:  $revs =
array_merge($revs,getRevisionsFromAttic($id,false));
inc/changelog.php:function getRevisionsFromAttic($id,$sorted=true){

This means the function is called only once, in changelog.php, in the
function getRevisions($id, $first, $num, $chunk_size=8192), right at the
end:

  $revs = array_merge($revs,getRevisionsFromAttic($id,false));
  $revs = array_unique($revs);

So what happens exactly: the changes to a particular file are stored in
data/meta/ in a file called file.changes, which is a split version of the
old changes.log. Dokuwiki parses that file to find out the latest changes
that happened to a particular file. But dokuwiki *also* checks for existing
files in attic/, eventually elder versions, and merge these revisions to the
ones found in the changelog.

My suggestion: should we just get rid of these 2 lines? If someone deleted a
revision in the changelog, it could be intentionnal, but the revision would
still be displayed as the file is still in attic/ :) As this is also the
only call to that function, I'd suggest we just get rid of it, or keep it to
rebuild changes.log somewhere, but called only from the admin panel... Right
now for me this function is doing 45000 getattr() calls, and as many regexp
checks :)


Second point, the following function:


/**
* returns an array of full paths to all metafiles of a given ID
*
* @author Esther Brunner <esther@xxxxxxxxxxxxx>
*/
function metaFiles($id){
$name = noNS($id);
$dir = metaFN(getNS($id),'');
$files = array();

$dh = @opendir($dir);
if(!$dh) return $files;
while(($file = readdir($dh)) !== false){
if(strpos($file,$name.'.') === 0 && !is_dir($dir.$file))
$files[] = $dir.$file;
}
closedir($dh);

return $files;
}

If I understand it right, it returns an array containing all the metas for a
specific file. For this it looks in data/meta, checks for everything that
looks like what we are looking for, and adds it to the array it then
returns. Comments:

yann@dongo:/srv/www/fr/doc.ubuntu-fr.org/htdocs/data/meta$ ls -l | wc -l
6203

Suggestion: Having a quick look at meta/ , it seems that for a given ID you
get 3 different files: a .meta, .changes and .indexed. So, just check if
these 3 files do exist, and return them in an array? Seems simple maybe I am
missing something, don't be too harsh if that's the case :P

I will continue to look for improvements - but I think than any readdir()
working on attic/, pages/ or meta/ should be get rid of, as it means linear
complexity and therefore no scalability as your wiki grows :(


Thanks!

Yann

Other related posts:

  • » [dokuwiki] Dokuwiki scalability - NFS use