[SI-LIST] Re: timing analysis

  • From: "Mike Williams" <mike.williams@xxxxxxxxxxxxxxxxxx>
  • To: <si-list@xxxxxxxxxxxxx>
  • Date: Fri, 27 Mar 2009 13:06:56 -0400

Hi Mort,

 

I spent most of my career in clock and timing engineering... supercomputers
and mainframes in the early days, and "slumming it" on PC's and workstations
after that:) The consults ranged from front-end pre-emptive tolerance
management (i.e. designing the clock distribution/reception  architecture
and the state communication geometry to meet the need), mid-project
distribution design-validation (did they have a valid belief system? did
they implement to their beliefs if it was?) and the fun part,
late-design-stage rescue consults (why do we have clear timing faults but
the tolerancing simulates/measures out ok?). That has lead to about 30 years
of exposure around how various design teams/cultures think about this sort
of thing. I consider that what you have is more a philosophical problem than
a technical problem. It's about your "belief system". 

 

The definition of "what is right" is fluid (though "what is wrong" is carved
in stone:). I don't think I can give you an answer to where you should be on
the continuum from "conservative" to "aggressive", but I can try to give you
a sense for how I've seen other teams and individuals think about the
problem. 

 

The mainframe and supercomputer "clock guys" I worked with through the 80's
would definitely be at the very conservative end of that spectrum. Part of
what explains that was the nature of the product... supercomputers and
mainframes were expected to not fail randomly. Every copy of what you built
had to reliably synch on every cycle of its operation during its service
life. Contributing to the conservatism, the guys doing the clocks on those
systems tended to be tremendously well-rounded and expert not just at the
electrical stuff (SI) but at actively designing to manage tolerances. A 100%
lost art today, IMO. So... they were conservative. But they also had a lot
of experience in working sticky clock situations (e.g. what special design
and validation methods do you need at a stall/no-stall boundary in a
multiphase pipe?). I can recall at least two situations on big systems where
relaxing specific assumptions would result in a more tractable simulation
pass and the way both situations were resolved was something I can't really
imagine having occurred after ~1994 or so. Much consideration of the local
governing dynamics of the tolerancing took place.. lots of interesting
pictures were drawn in notebooks and even more interesting thought models
constructed... resolving in one case days later, and in the other, I think
it was a few weeks, as the assumptions were relaxed but they were very
thoroughly and expertly regarded. 

 

Coming at the problem from the other end of system space in the 90's were
the PC guys. There was no culture of precision in their past (go ahead and
look for any evidence of clocking sophistication in 486's, where they all
came from). The terms I used to apply to PC and workstation guys back then
(when they weren't in the room) were "swashbucklers" and "butchers". Utterly
NO sense of how to do clocking, and they tended to be far too willing to
make big changes in their belief system about tolerancing on the basis of a
2-minute conversation (which means they had no belief system to start with).
One workstation team I met with (I believe a gentleman from that team visits
here quite regularly) had NO... as in NONE... accommodation of jitter in
their tolerance budget. The discussion there went from oops... to we should
factor that in then... to OMG, nothing is shippable if we factor that in!...
to my favorite recurring "idea" for how to weasel a tolerance budget into an
apparently shippable condition: "Let's RMS it". (My advice on RMS is that it
has no role in the avoidance of real pathologies on clocks... my 2 cents). 

 

Personally, I've ended up suggesting to clients in good faith to move both
toward greater conservatism (grow reliability) as well as toward more
aggressive relaxation of a worst case assumption (grow performance). When
these long philosophical discussions take place, I attempt in my reasoning
to take into consideration all the knowledge we have... the mission of the
system... how many do they intend to make (only ex-supercomputer guys think
"1" is a possible answer to this:)? How deep and wide is the dist. network
and are the things we're worried about static, dynamic or hybrid tolerances
(you stack them differently depending on the answer)? Is the effect of the
assumption we're thinking of relaxing systemic or local? Are there any
"double taxing" effects in the things that are "too big"? Can this team
actually do this? etc. 

 

When there was a rich set of choices from which you could build your
distribution network, I worked to develop good relationships with the
semiconductor manufacturers that made the high-precision distribution
solutions. Sometimes it was possible to get a real feel for a
device/family's personality far beyond the story on the data sheet... what
direction did certain tolerances move under certain conditions... what was
the "lawyer factor"... what are the "I've never ever seen XYZ" factors from
the characterization guys... etc. And actually, derating factors were
provided by some like the Moto guys who I got to know very well. Back then,
clock distribution and reception schemes were universally unique
clean-sheet-of-paper designs and you had lots of choices about how to color
them in. I think this created an environment for a skillful practitioner to
work the statistics in an informed way away from pure worst case. 

 

Something else we had back on the huge systems was absolute control not only
of every path through the CDN, but also every state-to-state path, and given
design tools then, there were not nearly as many critical paths as there are
today. Even though the systems were huge, the problems were tractable. On a
modern system, there's been a LOT of simplification for the clock guy... You
have 3 orders of magnitude fewer board-level loads, usually only a single
level of distribution instead of 6-10 with 2 or more parallel phases...
BUT... today you have no idea of what's happening in the critical paths.
They live inside devices you can't see into. What you do know is that modern
design tools have made nearly every path a critical path. You can't work
with the same kind of knowledge or total system visibility now that you
could then. Reasoning what direction a Moto "111" will swing under likely
operating conditions was just a lot easier than how some
been-on-the-market-for-a-month double-PLL with a Jack the Ripper personality
that only comes out when an uncommon and hitherto-unseen memory-write mode
launches an image current underneath it will. It's different now. You have
to trust that the behaviors you see in validation are all that are in there.
"Skill" can't be applied in the same ways and IMO isn't as useful to the
modern challenge.  

 

At the end of my first decade with clocks, I would have confidently answered
your question that a maximally conservative POV is always required, though
careful and informed relaxation from worst case can still be tolerated
(ROFL... clock joke!). At this point, approaching three decades, I've
encountered this question/scenario more than a few times. I've seen really
careful teams/cultures still get burned being careful because of an
undocumented Jack the Ripper effect in a complex device, and I've seen guys
who shouldn't be allowed to design flashlights hit it out of the park being
more optimistic than Mary Poppins. What I think explains some of that is
that challenging design situations draw in more proficient designers, and
the designs that don't probably have a LOT of undocumented margin. For
example.. take the first two generations of a certain well-known
microprocessor class (not saying which since they have more lawyers inside)
that in my opinion, could NOT be implemented with legally timed designs that
used the devices available at the time, using even a very experienced
supercomputer-grade attention to detail. Yet... they worked, and I think
that is because ALL of the risk margin on timing got pulled inside the
processor to be available for those designers. You seemingly didn't have to
actually legally time them. 

 

If you want to do something to crystallize your belief system, I would
suggest getting out of the sim environment asap (simulation believers
comprised a solid 15 years of rescue consults for me:) and building up a
test program that gives you real data on what happens when you stress some
assumption in a controlled way, and ultimately provides you with a sense of
the pertinent governing dynamics for the kind of systems your team makes.
There's probably no time or budget to do that today, but that is still my
suggestion. That is when you can start relaxing certain worst case
contributors to your overall tolerance budget in an informed way. Oh yea...
and when you first have that confidence and start to apply it, that is when
you're most likely to screw up REALLY big:) Good luck with things. 

 

Mike Williams

 

 

_____________________///ASA\\\_____________________

Mike Williams
President
Principal Product Designer - M1 OT Family
ASA

+1.413.596.5354
mike.williams@xxxxxxxxxxxxxxxxxx <mailto:mike.williams@xxxxxxxxxxxxxxxxxxx> 
M1 Oscilloscope Tools:     www.m1ot.com <http://www.m1ot.com/> 
ASA:     www.amherst-systems.com <http://www.amherst-systems.com/>  

_____________________________________________________________


The information contained in this message may be privileged, confidential,
and protected from disclosure. If the reader of this message is not the
intended recipient, or any employee or agent responsible for delivering this
message to the intended recipient, you are hereby notified that any
dissemination, distribution, or copying of this communication is strictly
prohibited. If you have received this communication in error, please notify
us immediately by replying to the message and deleting it from your
computer. Thank you. 

 
>From: "Harwood, Morton (GE Infra, Aviation, US)" Morton.Harwood@xxxxxx
>To: <si-list@xxxxxxxxxxxxx>
>Date: Fri, 20 Mar 2009 15:08:10 -0400
>
>What I don't really see, however, is how I can legitimately change the
>numbers "on paper" to show other than 85 MHz as the worst-case limit.
>This is what a certain group of people are asking me for, claiming that
>we need to be more "aggressive" -- perhaps by taking an RMS
>(root-mean-square) sum of delays or something of that nature -- instead
>of subtracting from the 10 ns path a straight sum of delays like I am
>doing.  Another idea is to reduce the delays by some percentage to be
>less than worst-case.  (I counter with the argument that even though I
>describe my timing analysis methodology as "worst-case", it actually
>doesn't even consider the effects of noise -- such as from the power
>distribution network or crosstalk.  So I could make my timing analysis
>even more pessimistic by adding some noise margin to the data sheet
>values for vinl and vinh.  And regarding an RMS sum of timing delays, I
>think that's a fudge, not a valid method.)
>
>Well anyway I'm wondering what alternative timing analysis methodologies
>other people use, or how other people deal with this situation at their
>companies.
>
>Thanks very much if you have comments.
>
>Mort




------------------------------------------------------------------
To unsubscribe from si-list:
si-list-request@xxxxxxxxxxxxx with 'unsubscribe' in the Subject field

or to administer your membership from a web page, go to:
//www.freelists.org/webpage/si-list

For help:
si-list-request@xxxxxxxxxxxxx with 'help' in the Subject field


List technical documents are available at:
                http://www.si-list.net

List archives are viewable at:     
                //www.freelists.org/archives/si-list
or at our remote archives:
                http://groups.yahoo.com/group/si-list/messages
Old (prior to June 6, 2001) list archives are viewable at:
                http://www.qsl.net/wb6tpu
  

Other related posts: