[argyllcms] bin/average: averaging and possible outlier elimination for three or more .ti3 sets?

  • From: Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx>
  • To: argyllcms@xxxxxxxxxxxxx
  • Date: Sat, 29 Aug 2009 18:28:21 +0800

Hi folks

I have a fairly large (2310) sample chart set I've scanned in with my
i1Pro. The chart set was run off an offset litho press so I have quite a
few copies. I've scanned in a few copies and I'm seeing significant
differences between copies when I use the `verify' tool to compare
the .ti3 data sets. Eg:

$ verify CHART.sampleA.ti3 CHART.sampleB.ti3 
Verify results:
  Total errors:     peak = 79.868037, avg = 0.971652
  Worst 10% errors: peak = 79.868037, avg = 5.476613
  Best  90% errors: peak = 1.193214, avg = 0.471101

The severity of a few of the errors would suggest possible misreads.
They're strip charts I'm reading with an i1Pro by hand (sigh) so
operator error (mine) isn't unlikely. It could also be quirks of the
printing process and/or newsprint media, since the charts are on
off-white partly recycled newsprint printed on a press that does
adaptive stochastic dithering. 

Here's an example error spike:

5: 83.070051 1.346918 2.895409 <=> 82.505987 1.360226 2.779120  de 0.576080
6: 82.110832 1.536077 2.992189 <=> 81.607727 1.437567 2.733480  de 0.574238
7: 40.886825 4.316704 18.048722 <=> 82.063784 1.539669 2.906388  de 43.960711   
 **** Huge error spike ****
8: 82.523678 1.538664 3.054387 <=> 81.639391 1.422285 2.777460  de 0.933914
9: 82.579391 1.491124 3.066842 <=> 82.226908 1.541316 3.019219  de 0.359209
10: 82.347015 1.442016 3.176074 <=> 80.000400 1.401537 2.851284  de 2.369330

Another:

182: 78.976711 -2.232244 58.087355 <=> 77.775705 -2.120755 56.639930  de 
1.884114
183: 80.095716 -2.763189 55.135360 <=> 79.346721 -2.745549 54.804620  de 
0.818959
184: 37.428620 -5.000581 0.024627 <=> 78.810882 -2.002915 60.115930  de 
73.023573
185: 79.154031 -1.844991 61.359333 <=> 78.829979 -2.012077 60.145829  de 
1.267091
186: 78.984785 -1.665758 61.827127 <=> 78.670603 -1.793216 61.269881  de 
0.652287

.. etc

I know I can average .ti3 sample sets using bin/average. However, it
only seems to accept a pair of .ti3 inputs at a time, and averaging
consecutively isn't going to produce an ideal result. Is there some good
way to average more than two sample sets that I'm missing, or should I
just be doing:

   average a b x
   average x c x2; mv x2 x
   average x d x2; mv x2 x
   average x e x2; mv x2 x

... etc, ie averaging each new set into the accumulated sample set?

Also, is there any good built-in way to eliminate outliers in the .ti3
files, or will I need to roll my own ? For that matter, is it wise to do
outlier elimination at all?

I'd rather not just replace possible outliers in the .ti3 files with
samples taken from another .ti3, as that'd bias the average toward one
particular sample reading.

Any suggestions?

If there's no existing method I'm missing, and if what I want to do
actually seems like a good idea to the folks here, I'm thinking of
seeing if I can extend `average.c' to handle more than two input files.
I'd try to add basic outlier elimination for when it has three or more
inputs, with the outlier elimination threshold shrinking as the number
of input files grows. At this point I'm thinking that any sample more
than three (maybe even two) standard deviations from the mean is
probably a reasonable candidate for outlier elimination.


Also: I've run into an odd issue when averaging the data sets. The
output produced by `average' has more sets than the inputs do - both
inputs have 2130, but the output has 2364. I'm a bit puzzled about why,
given that both inputs were read using `chartread' from charts printed
using the same .ps file and had the same .ti1 and .ti2 . Is that
expected behavour?

--
Craig Ringer


Other related posts: