[overture] Re: P++ code scaling on a large processor count

From: Bill Henshaw <henshaw@xxxxxxxx>
To: overture@xxxxxxxxxxxxx
Date: Fri, 01 May 2009 12:02:26 -0700

Thanks for the info. If we need a better updateGhostBoundaries
then it can be written based on the routines I have. Of course
for your small problem there are probably more ghost points
than interior points so I am not surprised that it is slow :)
I would try for local sizes of more like 10^5 - 10^6  points.

...Bill

Viacheslav Merkin wrote:

Dear Bill,
Getting local chunks and operating on them is exactly what we are doingfor the most part. But I have been using P++ to do just a few operationslike assigning external boundary conditions and, as it turns out, theydon't scale well. Most of those can be treated using your copyfunctions, so thank you very much for writing it. Unfortunately,updateGhostBoundaries does not seem to scale well beyond 128 processors.For a relatively small problem, 100x60x120, distributed among 512processors (local sizes are approx. 13x16x14), 8-th order scheme (ghostboundaries are 4 cells deep), applying updateGhostBoundaries to ~15global arrays slows the code by a factor of 2. I may be getting hit tosome extent by the non-optimal grid size per the number of processors,but the slow down seems bothersome to me. We will do more work toquantify the scaling for updateGhostBoundaries better, but we might needto develop a replacement based on your copy function.
Thanks very much again for your response,
Slava




---------------------------------------------------------------
Viacheslav Merkin
---------------------------------------------------------------
Senior Research Associate
Astronomy Department and
Center for Integrated Space Weather Modeling
Boston University

e-mail: vgm at bu.edu
phone: (617) 358-3441
fax: (617) 358-3242
---------------------------------------------------------------





On May 1, 2009, at 12:26 PM, Bill Henshaw wrote:
Dear Slava,

 Unfortunately there has been no work on P++ for some years (8 years?)
Since I have not been brave enough to change P++, in Overture I have
instead created my own parallel copy functions as found inParallelUtility.hI do not trust any P++ distributed operation except forupdateGhostBoundaries.
Even for updateGhostBoundaries I have not looked at the scaling past 128
processors but so far it looks ok.

 As you note, even simple X=0. operations in P++ are using communication
even when it is not needed. I think this communication was turned on
just to be safe while P++ was being developed.

 In Overture I basically avoid all P++ distributed operations and
instead operate on the serial arrays. Accessing the serial array
is quite easy (see examples in Overture/GridGenerator)
intSerialArray maskLocal;getLocalArrayWithGhostBoundaries(mask,maskLocal);
I suggest using the above Overture routine to get the local arrayrather thanmask.getLocalArrayWithGhostBoundaries(). A++ is quite robust but if Ineed top
performance I will write C loops instead of A++ array operations, see
primer/pwave.C for e.g.. I then explicitly call updateGhostBoundaries
when I know it is needed.

As a note, the latest version of A++/P++ allows arrays with more than
2^31 (approx. 2 billion) entries.

...Bill

Viacheslav Merkin wrote:
Hi,
I have been running my MHD code, which uses P++, on kraken, the newCray XT5 machine at NICS. I am getting a really poor scaling resultsbeyond 128 processors. I could give the actual performance numbers,but the bottom line is that I can definitely see that it is P++ arrayoperations that slow the code down. Surprisingly, even thoseoperations that should not require any communication at all (forinstance, X = 0., where X is a distributed array) take much time. Ican see that by turning off all P++ functions and then turning themon one-by-one and seeing how they affect the wall-clock time per timestep. To try to remedy the problem, I have grabbed Bill's copyfunction from ParallelUtility.C and used it to do any assignmentoperations like the one above. Doing so eliminates the problemcompletely.Is this a behavior one expects from P++ when going to large corecounts or is it, perhaps, indicative of a corrupted P++ installationthat we've implemented on the machine? I heard that P++ had ahardwired limit on the number of processors so I checked it in ourinstallation and its 1024, so it should not be a problem. If it isreally a problem with the P++ code itself, how is this problem solvedin the parallel Overture?
Thanks very much in advance,
Slava Merkin
---------------------------------------------------------------
Viacheslav Merkin
---------------------------------------------------------------
Senior Research Associate
Astronomy Department and
Center for Integrated Space Weather Modeling
Boston University
e-mail: vgm at bu.edu
phone: (617) 358-3441
fax: (617) 358-3242
---------------------------------------------------------------

References:
- [overture] P++ code scaling on a large processor count
  - From: Viacheslav Merkin
- [overture] Re: P++ code scaling on a large processor count
  - From: Bill Henshaw
- [overture] Re: P++ code scaling on a large processor count
  - From: Viacheslav Merkin

[overture] Re: P++ code scaling on a large processor count

Other related posts: