Re: [PATCH] Implement timekeeping for rumprun/hw (x86)

  • From: Antti Kantee <pooka@xxxxxx>
  • To: rumpkernel-users@xxxxxxxxxxxxx
  • Date: Wed, 01 Jul 2015 14:24:49 +0000

On 01/07/15 13:20, Martin Lucina wrote:

Does using rdtsc really work as a basis for timekeeping? Doesn't
the calibration go off when the clock rate changes?

On older processors/laptop systems the TSC does indeed change frequency
when the clock rate changes. On some even older "broken" processors, it
even does things like halt in idle. In the kernels I've looked at, all of
them contain a maze of code to deal with this and not use the TSC if it is
broken.

So some "real" operating systems use tsc as a basis for clock? ok.

However, all Intel processors since Nehalem (introduced 2007, manufactured
2008) have an invariant TSC which is completely fine for our purposes:

http://marc.info/?l=xen-devel&m=128475199115727&w=2

Ok. Too bad my newest hardware is Core 2 :/

I can put in a check for the invariant TSC using CPUID; the question is if
that fails should we just refuse to boot or warn and try anyway? For
example, the system I tested on is a 2005-era Pentium M which does not have
an invariant TSC, but as long as you run it on AC all is fine.

I don't know what precisely to do, but it would be good to make sure the user sees the error, yet does not have to go edit code if they want to run regardless. Maybe we need some sort of "make error a warning" flag to rumprun?

There's a bit more to using the TSC when SMP is involved, but that is not
something I want to even think about now :-)

I don't think we'll be doing SMP anytime soon. My belief -- and computing is of course mostly about religion -- ever since I implemented & measured locks_up.c (ca. 2010) has been that the Solaris/IRIX/etc. 90's/00's style massive in-kernel SMP permeating everything is the wrong architecture. Just run more instances of the kernel drivers. Completely coincidentally, running more driver instances is very cheap with rump kernels ...

However, I am concerned about the case where the host has SMP. Is tsc always sufficiently virtualized?

Besides, rdtsc is not available on a 486, which I understood was one of
your targets.

I changed my mind, for two reasons. First, I don't have a clear idea of how
to implement a monotonic clock without a TSC. Second, I'm not trying to
build a general solution that will run on any PC-compatible system since
the dawn of the 80386. TSC is available on any Pentium-class processor, and
new embedded offerings from Intel such as Quark are also Pentium-class.

Ok, pointing it out since 486 support was *your* target ;)

I don't understand the fascination with the 100ms calibration delay.
Why is 99ms not a good value? Or 10ms? or 1ms? I'd assume 100ms
is a value that someone picked out of a hat back when clock rates
were around 8MHz and minimizing it simply didn't matter since
computers booted for minutes anyway.

That particular algorithm is based on what NetBSD does and happens to be
the simplest option which is why I used it.

Did you test what the results are like with smaller calibration delays?

I don't understand why you need assembly to do multiplication.

I need to operate on the intermediate product which may be larger than 64
bits. It is much easier to reason about what happens if you write it in
assembly, and it also allows use of a single mulq instruction on x86-64.
Doing the latter in C would depend on GCC-specific 128-bit types.

Well it's definitely not easier for *me* to reason about it if *you* write it in assembly ;)

Perhaps offer a C fallback there? It also serves to document what is actually going on.

Critically examine need for critical sections.

Good point. bmk_cpu_clock_now() should have cli()/sti() around it, or is
there some other mechanism you'd like me to use?

I don't know what exactly since I didn't think about it carefully. Just randomly sprinkling cli/sti is usually the wrong thing, but cli/sti would be the mechanism of critical sectionizing, yes.

I'd just get rid of HZ, it serves no purpose.

You mean replace TIMER_HZ / HZ with TIMER_HZ / 100? Sure.

Why do you need the /100? Can't you just run the clock at TIMER_HZ for calibration? Logically thinking, you'd get an almost HZ times more accurate result that way too and could easily drop the delay (but FIIK what really happens).

bmk_cpu_block() is wrong. Just because a timer interrupt fired
doesn't mean another interrupt didn't. Seems rather painful doing
tickless with i8254...

A correct but wasteful solution would be to just always return back into
schedule() after the hlt(). It'll be inefficient for long sleeps, but will
work fine. Any better ideas much appreciated!

Do we need a really good solution there? I assume that KVM-clock will solve also this for the virtualization case where it matters most. I can't imagine that going to the scheduler is *that* many more cycles since you wake up already anyway.

Regarding the i8254, I used that as despite being limited is easy to
program and well documented.

I'm not saying it was the worst choice, just making an observation that it looks painful. (and I do sort of remember now why I used rdtsc as the "clock" ;)

What is the intended division between bmk_platform_X() and bmk_cpu_X()?

Is it ok to just call e.g. bmk_cpu_block() from bmk_platform_block() as I'm
doing it now? Similarly for bmk_platform_clock_epochoffset() just calling
bmk_cpu_clock_epochoffset().

Well, it's a bit poor since bmk means many things due to how things evolved. Should fix that some day.

Ideally, the hw platform would be like xen, i.e. all except select symbols are hidden when rumprun.o is created (which should probably be called hw.o). (*)

Anyway, bmk_platform is intended to be upcalls from libbmk_core to the platform. bmk_platform_cpu is intended to be the same, except just signaling an always architecture-dependent implementation. bmk_cpu signals architecture-dependence *within* hw. The only rule, really, is that you should only use things starting with bmk_platform as an upcall from the generic libbmk_core.

So, I guess when hw symbols are hidden, the bmk_ can be dropped from there, and the confusion will magically disappear (yes, yes, yes ... ??!?)

*) ok there is one not-yet-examined case. It might be worthwhile to allow for exceptions so that applications could make calls directly into the "kernel" bypassing the syscall layer. But need to imagine a usable mechanism for that, and since it doesn't currently work anyway, no sweat about it.

- antti

p.s. thanks for the braindump and urls, they're a useful resource

Other related posts: