[overture] Re: P++ code scaling on a large processor count

  • From: Bill Henshaw <henshaw@xxxxxxxx>
  • To: overture@xxxxxxxxxxxxx
  • Date: Fri, 01 May 2009 09:26:22 -0700

Dear Slava,

  Unfortunately there has been no work on P++ for some years (8 years?)
Since I have not been brave enough to change P++, in Overture I have
instead created my own parallel copy functions as found in ParallelUtility.h
I do not trust any P++ distributed operation except for updateGhostBoundaries.
Even for updateGhostBoundaries I have not looked at the scaling past 128
processors but so far it looks ok.

  As you note, even simple X=0. operations in P++ are using communication
even when it is not needed. I think this communication was turned on
just to be safe while P++ was being developed.

  In Overture I basically avoid all P++ distributed operations and
instead operate on the serial arrays. Accessing the serial array
is quite easy (see examples in Overture/GridGenerator)

   intSerialArray maskLocal; getLocalArrayWithGhostBoundaries(mask,maskLocal);

I suggest using the above Overture routine to get the local array rather than
mask.getLocalArrayWithGhostBoundaries(). A++ is quite robust but if I need top
performance I will write C loops instead of A++ array operations, see
primer/pwave.C for e.g.. I then explicitly call updateGhostBoundaries
when I know it is needed.

As a note, the latest version of A++/P++ allows arrays with more than
2^31 (approx. 2 billion) entries.

...Bill

Viacheslav Merkin wrote:
Hi,

I have been running my MHD code, which uses P++, on kraken, the new Cray XT5 machine at NICS. I am getting a really poor scaling results beyond 128 processors. I could give the actual performance numbers, but the bottom line is that I can definitely see that it is P++ array operations that slow the code down. Surprisingly, even those operations that should not require any communication at all (for instance, X = 0., where X is a distributed array) take much time. I can see that by turning off all P++ functions and then turning them on one-by-one and seeing how they affect the wall-clock time per time step. To try to remedy the problem, I have grabbed Bill's copy function from ParallelUtility.C and used it to do any assignment operations like the one above. Doing so eliminates the problem completely.

Is this a behavior one expects from P++ when going to large core counts or is it, perhaps, indicative of a corrupted P++ installation that we've implemented on the machine? I heard that P++ had a hardwired limit on the number of processors so I checked it in our installation and its 1024, so it should not be a problem. If it is really a problem with the P++ code itself, how is this problem solved in the parallel Overture?

Thanks very much in advance,
Slava Merkin
---------------------------------------------------------------
Viacheslav Merkin
---------------------------------------------------------------
Senior Research Associate
Astronomy Department and
Center for Integrated Space Weather Modeling
Boston University

e-mail: vgm at bu.edu
phone: (617) 358-3441
fax: (617) 358-3242
---------------------------------------------------------------








Other related posts: