[overture] Re: P++ code scaling on a large processor count

  • From: Bill Henshaw <henshaw@xxxxxxxx>
  • To: overture@xxxxxxxxxxxxx
  • Date: Fri, 01 May 2009 12:02:26 -0700

Thanks for the info. If we need a better updateGhostBoundaries
then it can be written based on the routines I have. Of course
for your small problem there are probably more ghost points
than interior points so I am not surprised that it is slow :)
I would try for local sizes of more like 10^5 - 10^6  points.

...Bill

Viacheslav Merkin wrote:
Dear Bill,

Getting local chunks and operating on them is exactly what we are doing for the most part. But I have been using P++ to do just a few operations like assigning external boundary conditions and, as it turns out, they don't scale well. Most of those can be treated using your copy functions, so thank you very much for writing it. Unfortunately, updateGhostBoundaries does not seem to scale well beyond 128 processors. For a relatively small problem, 100x60x120, distributed among 512 processors (local sizes are approx. 13x16x14), 8-th order scheme (ghost boundaries are 4 cells deep), applying updateGhostBoundaries to ~15 global arrays slows the code by a factor of 2. I may be getting hit to some extent by the non-optimal grid size per the number of processors, but the slow down seems bothersome to me. We will do more work to quantify the scaling for updateGhostBoundaries better, but we might need to develop a replacement based on your copy function.

Thanks very much again for your response,
Slava




---------------------------------------------------------------
Viacheslav Merkin
---------------------------------------------------------------
Senior Research Associate
Astronomy Department and
Center for Integrated Space Weather Modeling
Boston University

e-mail: vgm at bu.edu
phone: (617) 358-3441
fax: (617) 358-3242
---------------------------------------------------------------





On May 1, 2009, at 12:26 PM, Bill Henshaw wrote:

Dear Slava,

 Unfortunately there has been no work on P++ for some years (8 years?)
Since I have not been brave enough to change P++, in Overture I have
instead created my own parallel copy functions as found in ParallelUtility.h I do not trust any P++ distributed operation except for updateGhostBoundaries.
Even for updateGhostBoundaries I have not looked at the scaling past 128
processors but so far it looks ok.

 As you note, even simple X=0. operations in P++ are using communication
even when it is not needed. I think this communication was turned on
just to be safe while P++ was being developed.

 In Overture I basically avoid all P++ distributed operations and
instead operate on the serial arrays. Accessing the serial array
is quite easy (see examples in Overture/GridGenerator)

intSerialArray maskLocal; getLocalArrayWithGhostBoundaries(mask,maskLocal);

I suggest using the above Overture routine to get the local array rather than mask.getLocalArrayWithGhostBoundaries(). A++ is quite robust but if I need top
performance I will write C loops instead of A++ array operations, see
primer/pwave.C for e.g.. I then explicitly call updateGhostBoundaries
when I know it is needed.

As a note, the latest version of A++/P++ allows arrays with more than
2^31 (approx. 2 billion) entries.

...Bill

Viacheslav Merkin wrote:
Hi,
I have been running my MHD code, which uses P++, on kraken, the new Cray XT5 machine at NICS. I am getting a really poor scaling results beyond 128 processors. I could give the actual performance numbers, but the bottom line is that I can definitely see that it is P++ array operations that slow the code down. Surprisingly, even those operations that should not require any communication at all (for instance, X = 0., where X is a distributed array) take much time. I can see that by turning off all P++ functions and then turning them on one-by-one and seeing how they affect the wall-clock time per time step. To try to remedy the problem, I have grabbed Bill's copy function from ParallelUtility.C and used it to do any assignment operations like the one above. Doing so eliminates the problem completely. Is this a behavior one expects from P++ when going to large core counts or is it, perhaps, indicative of a corrupted P++ installation that we've implemented on the machine? I heard that P++ had a hardwired limit on the number of processors so I checked it in our installation and its 1024, so it should not be a problem. If it is really a problem with the P++ code itself, how is this problem solved in the parallel Overture?
Thanks very much in advance,
Slava Merkin
---------------------------------------------------------------
Viacheslav Merkin
---------------------------------------------------------------
Senior Research Associate
Astronomy Department and
Center for Integrated Space Weather Modeling
Boston University
e-mail: vgm at bu.edu
phone: (617) 358-3441
fax: (617) 358-3242
---------------------------------------------------------------






Other related posts: