[argyllcms] Re: bin/average: averaging and possible outlier elimination for three or more .ti3 sets?

  • From: "Alastair M. Robinson" <profiling@xxxxxxxxxxxxxxxxxxxxxxx>
  • To: argyllcms@xxxxxxxxxxxxx
  • Date: Sat, 29 Aug 2009 12:25:08 +0100

Hi :)

Craig Ringer wrote:

Here's an example error spike:

5: 83.070051 1.346918 2.895409 <=> 82.505987 1.360226 2.779120  de 0.576080
6: 82.110832 1.536077 2.992189 <=> 81.607727 1.437567 2.733480  de 0.574238
7: 40.886825 4.316704 18.048722 <=> 82.063784 1.539669 2.906388  de 43.960711   
 **** Huge error spike ****

Very likely to be a reading glitch - we saw similar problems when Robert from the Gutenprint project tried out his i1Pro and GPLin. I've been meaning since then to put together something myself to help with outlier elimination, but never got that far.

I know I can average .ti3 sample sets using bin/average. However, it
only seems to accept a pair of .ti3 inputs at a time, and averaging
consecutively isn't going to produce an ideal result.

No, I guess average is intended to account for process variation, rather than spikes due to chart misreads.

Also, is there any good built-in way to eliminate outliers in the .ti3
files, or will I need to roll my own ? For that matter, is it wise to do
outlier elimination at all?

While I've not used an i1 myself, I've heard it said that it's not just wise but vital. Having said that, you don't want to eliminate process variation - just misread patches.

If there's no existing method I'm missing, and if what I want to do
actually seems like a good idea to the folks here, I'm thinking of
seeing if I can extend `average.c' to handle more than two input files.

I think that would be extremely useful. What I was planning, but never got around to, was to either extend average or create a new analogous utility with the ability to take three or more files, and find the median, rather than the mean average.

I'd try to add basic outlier elimination for when it has three or more
inputs, with the outlier elimination threshold shrinking as the number
of input files grows. At this point I'm thinking that any sample more
than three (maybe even two) standard deviations from the mean is
probably a reasonable candidate for outlier elimination.

So where you have an error spike you'd prefer to cull that sample from all files rather than pick one to use?

Another thing that might work if you don't have enough samples to use a median, is to fit an RSPL to each .ti3 file (which is really easy, thanks to Argyll's libraries), then for each data point compare it against an interpolated value from the RSPL, and see which set of data has the best fit at that point.

Hope this gives you food for thought!

All the best,
--
Alastair M. Robinson

Other related posts: