[SI-LIST] Re: Fibre channel interconnect margins

  • From: <marcus_mueller2@xxxxxxxxxxx>
  • To: <si-list@xxxxxxxxxxxxx>
  • Date: Thu, 6 Jul 2006 14:56:03 +0200

> I've not done this sort of thing, but I think it's a lot more=20
> than that. Don't forget, these are random processes (or at=20
> least, the mathematics we use to describe them are).
>=20
> It isn't sufficient to run the experiment just long enough=20
> for (on average) one error to occur, and then if the number=20
> of observed errors is either 0 or 1, claim victory and go=20
> home.  The BER could be greater than your limit and yet the=20
> particular interval you chose had only 0 or 1.  Or you might observe
> 2 or 3 errors in that particular interval even though the BER=20
> was what you thought it should be.

Andy,

here are a few numbers that support this. Let's assume that a DUT is =
performing with an error ratio of 1e-12, and we run a BER test for 1e12 =
bits. Our expectation is to observe one error in those 1e12 bits, but as =
you say errors are distributed randomly; so sometimes we'll see no =
errors at all, and sometimes more than just a single error.

"Sometimes" is a bit vague though, we'd all like to see some hard =
numbers there. And they are surprisingly easy to get: if errors are =
distributed randomly, we can just use the standard equations for binary =
random processes (coin flips was the example that was used in my basic =
statistics class). For large number of observed bits and few errors, =
that's the Poisson distribution, and we get the following results (for =
1e12 compared bits, error ratio 1e-12):
  0 errors in 1e12 Bits: measured BER   0.0 with Probability p=3D0.3679
  1 error  in 1e12 Bits: measured BER 1e-12 with Probability p=3D0.3679
  2 errors in 1e12 Bits: measured BER 2e-12 with Probability p=3D0.1839
  3 errors in 1e12 Bits: measured BER 3e-12 with Probability p=3D0.0613
  4 errors in 1e12 Bits: measured BER 4e-12 with Probability p=3D0.0153

Remember that our example device had a BER of exactly 1e-12, so it's a =
bad device by definition; but still there's a 37% probability of =
measuring it at BER=3D0. So if we use this test in production and our =
pass/fail limit is BER<1e-12, we run a 37% risk of shipping a bad =
device. For our 1e-12 limit, the 1e-12 device is the "best case" bad =
device, since most failing devices will have a higher error ratio. For =
our measurement however this device is the worst case, because we know =
we need to measure longer for lower error ratios.

> Personally, I'd want to run the experiment over at least 10=20
> times as long an interval, to have enough confidence that my=20
> measurement is even close to the statistical average.

Since you mention confidence, it would be great if we could express our =
degree of confidence in a measurement with a number, too. In statistics, =
such a number is called a confidence level. Let's go back to our =
example; we can say that if a 1e-12 device passes with 37% probability, =
a device with a higher error ratio passes with a lower probability. And =
from there, we can say that after 1e12 error free bits, the probability =
that _any_ device has BER<1e-12 is about 63%. In other words, a device =
has BER<1e-12 on the 63% confidence level after 1e12 error free bits.

Ok, more numbers. This time we look at the "BER<1e-12 confidence level" =
as a function of error free comapared bits. We do this by calculating =
the probability of observing 0 errors in m bits. The confidence level =
for BER<1e-12 is then simply one minus the probability.
  0 errors in  1e12 bits @ BER=3D1e-12: p=3D0.3679  -> =
CL(BER<1e-12)=3D63.2%
  0 errors in  2e12 bits @ BER=3D1e-12: p=3D0.1353  -> =
CL(BER<1e-12)=3D86.5%
  0 errors in  3e12 bits @ BER=3D1e-12: p=3D0.0498  -> =
CL(BER<1e-12)=3D95.0%
  0 errors in  5e12 bits @ BER=3D1e-12: p=3D0.0067  -> =
CL(BER<1e-12)=3D99.3%
  0 errors in 10e12 bits @ BER=3D1e-12: p=3D5.45e-5 -> CL(BER<1e-12) =
almost 100%

So "at least 10 times as long" might be a bit on the pessimistic side, =
but that depends on how tolerant your company is regarding shipments of =
bad devices. By the way, these numbers scale with the target BER: just =
replace the "12" exponents in the table above with "10" or "15" or =
whatever.

> Which is why you are totally correct when you said "a low BER=20
> means a LOT of testing."  An even bigger LOT than you thought=20
> it was.  Some experiments just aren't practical, as written. =20
> They need to be re-written (if you are clever enough to know=20
> how) to make the errors happen faster.
How long a low BER test takes depends on the data rate, the target BER, =
and the required confidence level. Assuming that a 95% confidence level =
is good enough you need about 3e12 bits foer BER<1e-12, that's about 5 =
minutes at 10GBit/s. How practical such a test is is a different story =
however: in an high volume ATE environment where you shoot for test =
times in the 2-10 seconds range, 5 min is simply impossible. But if =
you're hand-tweaking every device for hours anyway, why not spend the =
extra 5 minutes to be sure.

At 1GBit/s and 1e-13, we're talking about 8 hours and 20 minutes, so =
that's an over-night test. It's doable, at least in an R&D setting, but =
only if you are able to keep your environment constant for that long. =
That usually means a temperature/climate chamber, not only for the DUT =
but the entire measurement setup (test equipment may drift, too).

Directly verifying BER<1e-15? Forget it, it's simply not practial. But =
there will always be specs that ask for error ratios even lower (I've =
see one proposal for 1e-18). These specs are driven by perfectly valid =
considerations (that many links in that many installed systems in that =
amount of time, with that many total errors that we are able to =
tolerate). But as you say, we need different test procedures for a BER =
that low.=20

There are a couple of approaches out there: raising SNR on an optical =
link, BERT scan (bathtub curve) extrapolations, RJ/DJ separation on =
scopes and TIAs with a dual dirac model, etc. But all of them make =
assumptions that may or may not hold for a given device. They all seem =
to come out on the high side though (one reason for this is the =
unbounded noise assumption), so at least most devices that are tested =
with them should work better than advertised.

Regards,
Marcus

---
Marcus Mueller
Agilent Technologies
------------------------------------------------------------------
To unsubscribe from si-list:
si-list-request@xxxxxxxxxxxxx with 'unsubscribe' in the Subject field

or to administer your membership from a web page, go to:
//www.freelists.org/webpage/si-list

For help:
si-list-request@xxxxxxxxxxxxx with 'help' in the Subject field

List FAQ wiki page is located at:
                http://si-list.org/wiki/wiki.pl?Si-List_FAQ

List technical documents are available at:
                http://www.si-list.org

List archives are viewable at:     
                //www.freelists.org/archives/si-list
or at our remote archives:
                http://groups.yahoo.com/group/si-list/messages
Old (prior to June 6, 2001) list archives are viewable at:
                http://www.qsl.net/wb6tpu
  

Other related posts: