Re: [PATCH] Implement timekeeping for rumprun/hw (x86)

  • From: Martin Lucina <martin@xxxxxxxxxx>
  • To: rumpkernel-users@xxxxxxxxxxxxx
  • Date: Wed, 1 Jul 2015 15:20:34 +0200

On Wednesday, 01.07.2015 at 00:11, Antti Kantee wrote:

Well, that's a lot of code, apparently a good deal of which comes
from having to convert RTC into seconds :(

You can thank IBM for that :-)

Does using rdtsc really work as a basis for timekeeping? Doesn't
the calibration go off when the clock rate changes?

On older processors/laptop systems the TSC does indeed change frequency
when the clock rate changes. On some even older "broken" processors, it
even does things like halt in idle. In the kernels I've looked at, all of
them contain a maze of code to deal with this and not use the TSC if it is

However, all Intel processors since Nehalem (introduced 2007, manufactured
2008) have an invariant TSC which is completely fine for our purposes:

I can put in a check for the invariant TSC using CPUID; the question is if
that fails should we just refuse to boot or warn and try anyway? For
example, the system I tested on is a 2005-era Pentium M which does not have
an invariant TSC, but as long as you run it on AC all is fine.

There's a bit more to using the TSC when SMP is involved, but that is not
something I want to even think about now :-)

Besides, rdtsc is not available on a 486, which I understood was one of
your targets.

I changed my mind, for two reasons. First, I don't have a clear idea of how
to implement a monotonic clock without a TSC. Second, I'm not trying to
build a general solution that will run on any PC-compatible system since
the dawn of the 80386. TSC is available on any Pentium-class processor, and
new embedded offerings from Intel such as Quark are also Pentium-class.

Hence, I don't think its worth both the work and extra complexity involved.
It's also not a regression, since the current code uses TSC. Having said
that, if someone comes along and wants a massive deployment of rumprun on
486-class CPUs, I'm open to consulting offers :-)

I don't understand the fascination with the 100ms calibration delay.
Why is 99ms not a good value? Or 10ms? or 1ms? I'd assume 100ms
is a value that someone picked out of a hat back when clock rates
were around 8MHz and minimizing it simply didn't matter since
computers booted for minutes anyway.

That particular algorithm is based on what NetBSD does and happens to be
the simplest option which is why I used it.

Linux is even more paranoid and takes longer calibrating the TSC, see here:

The best algorithm I've found so far would appear to be the OpenSolaris
code at:

Unfortunately that is also all hand-coded assembly and CDDL licensed so not
usable for us. If someone wants to do a clean-room implementation of that
code for rumprun, be my guest.

In any case, does it really matter how long actual bare metal takes to
boot? There are way longer delays all over the place once you start
enumerating devices, etc.

What does matter is unikernels on KVM, and there the delay should (if I
understand it correctly) go away entirely once I implement KVM pvclock
since I can just grab the TSC multiplier from that interface and not bother
with the TSC calibration at all.

I don't understand why you need assembly to do multiplication.

I need to operate on the intermediate product which may be larger than 64
bits. It is much easier to reason about what happens if you write it in
assembly, and it also allows use of a single mulq instruction on x86-64.
Doing the latter in C would depend on GCC-specific 128-bit types.

Critically examine need for critical sections.

Good point. bmk_cpu_clock_now() should have cli()/sti() around it, or is
there some other mechanism you'd like me to use?

I'd just get rid of HZ, it serves no purpose.

You mean replace TIMER_HZ / HZ with TIMER_HZ / 100? Sure.

bmk_cpu_block() is wrong. Just because a timer interrupt fired
doesn't mean another interrupt didn't. Seems rather painful doing
tickless with i8254...

A correct but wasteful solution would be to just always return back into
schedule() after the hlt(). It'll be inefficient for long sleeps, but will
work fine. Any better ideas much appreciated!

Regarding the i8254, I used that as despite being limited is easy to
program and well documented.

The alternative would be to use the APIC timer which is available on
Pentium and newer, however that is much more complex to setup (since you
have to enable the APIC instead of the i8259 legacy PICs) and requires
calibration against the PIT since it runs at the CPU bus clock frequency.
KVM does not tell us anything about the APIC so we'd be stuck with the
initial boot delay even there :-/

One nice thing about the APIC timer is that on fairly new processors (Sandy
Bridge and newer, 2011 vintage) it supports a TSC-deadline timer which is
exactly what we need; it fires an IRQ once the TSC passes a certain
deadline. So that may be worth implementing where supported as (guessing)
it should exhibit much lower overhead in virtualized scenarios.

No need to expose everything that the original clock_subr.h exposes.

uint64_t dt_year? Well that's not going to suffer from y2k issues
anytime soon. Why is it unsigned anyway? Does counting start from
-bigbang or what? ;)

That was all lifted rather hurriedly from NetBSD, I'll clean it up a bit
more :-)

I'm not entirely happy about the MD/MI split of the code, perhaps that
could be improved. Antti?

Can you elaborate?

What is the intended division between bmk_platform_X() and bmk_cpu_X()?

Is it ok to just call e.g. bmk_cpu_block() from bmk_platform_block() as I'm
doing it now? Similarly for bmk_platform_clock_epochoffset() just calling

Other related posts: