Dear Slava, Unfortunately there has been no work on P++ for some years (8 years?) Since I have not been brave enough to change P++, in Overture I have instead created my own parallel copy functions as found in ParallelUtility.h I do not trust any P++ distributed operation except for updateGhostBoundaries. Even for updateGhostBoundaries I have not looked at the scaling past 128 processors but so far it looks ok. As you note, even simple X=0. operations in P++ are using communication even when it is not needed. I think this communication was turned on just to be safe while P++ was being developed. In Overture I basically avoid all P++ distributed operations and instead operate on the serial arrays. Accessing the serial array is quite easy (see examples in Overture/GridGenerator) intSerialArray maskLocal; getLocalArrayWithGhostBoundaries(mask,maskLocal); I suggest using the above Overture routine to get the local array rather than mask.getLocalArrayWithGhostBoundaries(). A++ is quite robust but if I need top performance I will write C loops instead of A++ array operations, see primer/pwave.C for e.g.. I then explicitly call updateGhostBoundaries when I know it is needed. As a note, the latest version of A++/P++ allows arrays with more than 2^31 (approx. 2 billion) entries. ...Bill Viacheslav Merkin wrote:
Hi,I have been running my MHD code, which uses P++, on kraken, the new Cray XT5 machine at NICS. I am getting a really poor scaling results beyond 128 processors. I could give the actual performance numbers, but the bottom line is that I can definitely see that it is P++ array operations that slow the code down. Surprisingly, even those operations that should not require any communication at all (for instance, X = 0., where X is a distributed array) take much time. I can see that by turning off all P++ functions and then turning them on one-by-one and seeing how they affect the wall-clock time per time step. To try to remedy the problem, I have grabbed Bill's copy function from ParallelUtility.C and used it to do any assignment operations like the one above. Doing so eliminates the problem completely.Is this a behavior one expects from P++ when going to large core counts or is it, perhaps, indicative of a corrupted P++ installation that we've implemented on the machine? I heard that P++ had a hardwired limit on the number of processors so I checked it in our installation and its 1024, so it should not be a problem. If it is really a problem with the P++ code itself, how is this problem solved in the parallel Overture?Thanks very much in advance, Slava Merkin --------------------------------------------------------------- Viacheslav Merkin --------------------------------------------------------------- Senior Research Associate Astronomy Department and Center for Integrated Space Weather Modeling Boston University e-mail: vgm at bu.edu phone: (617) 358-3441 fax: (617) 358-3242 ---------------------------------------------------------------