x86 High Precision Event Timers support

- S
- Spoon
  
  Contact options for registered users
posted
17 years ago

Wed, Jun 21, 2006 9:22 AM

Hello,

As far as I understand (which is not very far, please do single out all inaccuracies) there is an effort in the x86 world to replace the legacy x86 timer infrastructure:

o The PIT (Programmable Interval Timer) such as Intel's 8253 and 8254

formatting link

o The RTC (Real-Time Clock)

o The (Local??) APIC timer (I didn't find much information on this timer.)

o The ACPI timer, also known as the PM clock (Any pointers?)

Microsoft provides a rationale for the new infrastructure:

formatting link

Intel provides a spec:

formatting link

As far as I understand, the HPET hardware is provided by the southbridge chipset? For example, Intel's ICH5.

(Would the VIA VT82C686B provide an HPET block?)

My understanding is that the BIOS is supposed to map the HPET addresses in memory, and provide the information through an ACPI table at boot-time? If the BIOS does not initialize the HPET hardware, the OS remains unaware that it is available.

formatting link

Is there, somewhere, a list of hardware with HPET support?

Are there implementations that support more than 3 comparators?

Regards.

- R
- Robert Redelmeier
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Jun 21, 2006 1:49 PM

In comp.os.linux.development.system Spoon wrote in part:

You forgot the venerable and still extremely precise RDTSC instruction available since the original Pentium to read the CPU's cycle counter. Typical overhead, 30 clocks vs interrupt latency of at least 100 clocks.

Accuracy still depends on the clock generator. iAFAIK, nanosleep(), gettimeofday() and friends use RDTSC to interpolate other clocks (APIC prefered over the PIT).

-- Robert

- P
- Paul Keinanen
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Jun 21, 2006 8:42 PM

RDTSC is nice as long as you stay away from Geode processors, which seems to enter the SMM in more or less unpredictable ways. Also any processor doing some dynamic clock frequency changes in various power saving modes will cause problems.

The CPU clock frequency is quite temperature dependent. Unless you can check the time at least once a day from some reliable source, such as the CMOS clock, NTP or some GPS clock, quite significant cumulative errors will occur.

Paul

- N
- Niels Jørgen Kruse
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jun 22, 2006 7:46 AM

Recent Intel CPUs run the RDTSC cyclecounter at a fixed frequency, regardless of temporary reductions in core frequency. Eventually, I suppose AMD will do the right thing too.

--
Mvh./Regards,    Niels Jørgen Kruse,    Vanløse, Denmark

- S
- Spoon
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jun 22, 2006 7:58 AM

Which reminds me of Rich Brunner's excellent article:

formatting link

I'm playing with the hrtimers infrastructure:

formatting link

I *think* they use HPET, if they find it.

formatting link

I'm also wondering: Are there x86-based systems where a card equipped with several PITs (e.g. ADLINK's PCI-8554) is a necessity?

formatting link

- J
- Joe Seigh
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jun 22, 2006 11:24 AM

It would be nice if they get around to supporting a high resolution timing interface that doesn't require a syscall, works in an SMP environment, and supports virtual timing as well as real wall clock timing. It's a known technique and has been around for decades.

Also Intel and AMD need to think about how these things virtualize before they put these kind of things in rather than five years after the fact. But that's only important if Intel and AMD think virtualization is an important part of their business strategy.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

- D
- David Hopwood
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jun 22, 2006 6:33 PM

But since an OS or library that provides timing services cannot rely on running on a processor where the RDTSC frequency is fixed, this won't simplify any such OSes or libraries, until at some point it becomes practical to ignore older processors.

--=20 David Hopwood

- R
- Robert Redelmeier
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jun 22, 2006 6:45 PM

In comp.os.linux.development.system David Hopwood wrote in part:

This depends very much on the software quality requirements. Not everything is a big system that will be used for critical purposes. Everything is a compomise -- RDTSC is very fast and usually good. OS calls are almost always accurate, but slower and usually less precise.

Horses for courses.

-- Robert

- C
- Casper H.S. Dik
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jun 22, 2006 8:02 PM

Unless the OS makes good. If the OS fixes these things up in the other cases (hard, I've tried it), then not having to do this on some system is a bonus.

Casper

- N
- Niels Jørgen Kruse
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jun 22, 2006 10:03 PM

Currently, MacOS X can assume that. Granted, Marklar was started before there were fixed frequency RDTSC processors, so there may be some workaround still in there.

--
Mvh./Regards,    Niels Jørgen Kruse,    Vanløse, Denmark

- T
- Terje Mathisen
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Jun 23, 2006 7:54 AM

There are two main problems here:

a) The TSC might not run at a fixed frequency, but an OS can know when the changes happen, and still use it to provide a fast return value: It needs a userlevel library routine which just has to take the current TSC count, multiply by the current scale factor (producing a triple-width result), shift down by the current shift value, and add the current base count. Total time for this operation is not much higher than the RDTSC opcode which can easily take 20-30 cycles by itself on some cpus.

Intuitively, you would like to either reset the TSC count or store the current value and subtract it out before the multiplication, but the subtraction can instead be included in the base value to be added in after the scaling multiplication.

The OS must of course update the base value and the scale factor each time the TSC frequency changes, but as long as there's only a small number (two?) of base frequencies to support, the needed scale factors can be calculated up front, and you might even get away with just a shift if the slow frequency is a binary fraction of the high.

b) On a multi-cpu/multi-core system, it is quite possible for the TSC counts to get out of sync, and this is a much harder problem to fix while still delivering sub-us precision and latency.

Windows punts by using the best available external counter, which might fall back all the way to the horrible 1.1 MHz keyboard chip/RAM refresh counter designed into the original 1981 model PC. :-(

Terje

--
- 
"almost all programming can be viewed as an exercise in caching"

- C
- Casper H.S. Dik
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Jun 23, 2006 9:39 AM

The number of frequencies can be higher, actually; an typical AMD CPU can only do smallish frequency steps, and that makes for quite a few frequencies (four-five on typical systems around here)

yeah, I didn't do multi-cpu/multi-core; the Opteron multi-core CPUs will all need to run at the same frequency (though I'm not sure if setting the core voltage/frequency of one half of the core affects the other half at the same time or that these actions need to be done in lockstep); multi-socket adds additional challenges.

Ugh.

Casper

--
Expressed in this posting are my opinions.  They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

- J
- Joe Seigh
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Jun 23, 2006 10:38 AM

Which one of the problems is that?

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

- N
- Niels Jørgen Kruse
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Jun 23, 2006 12:06 PM

On a Core Duo, the OS X call "mach_absolute_time()" takes ~132 clocks. With 3 RDTSCs and the triple-width scaling, I suppose that about fits.

If the implementation is the general one, that doesn't rely on a fixed frequency, it could explain why the result is scaled to nanosecond resolution. (A companion call to mach_absolute_time provides a fraction for scaling, so if you want nanosecond resolution, you will be doing a superfluous scaling.) If a fixed frequency was assumed, the raw resolution could have been used in the result, saving a scaling operation.

If Intel could have spared an extra pin, they could have added a proper timebase register incrementing asyncronously on an external timebase signal. At a modest frequency like 33MHz, there should be no problem distributing a timebase signal to multiple CPUs.

--
Mvh./Regards,    Niels Jørgen Kruse,    Vanløse, Denmark

- C
- Casper H.S. Dik
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Jun 23, 2006 12:43 PM

We found that the 10MHz used for this purpose on some SPARC processors is actually not fast enough; that's perhaps several hundred clock ticks which makes using this for precise accounting difficult.

Casper

--
Expressed in this posting are my opinions.  They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

- B
- Bernd Paysan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Jun 23, 2006 1:24 PM

The frontside bus clock should be sufficiently synchronous and identical on all CPUs used within one box - at least now. Xeons have a "real" frontside bus, and Opterons have a common hypertransport clock base (200MHz). This frequency scales with processor performance, so it should not be off so far. That's 10-15 cycles resolution on current CPUs, less than the rdtsc instruction takes.

BTW clock skew: Note that for all practical purposes, the only requirement for a distributed timer is that no signal distributes faster than the timer.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

- T
- Tim McCaffrey
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Jun 23, 2006 3:22 PM

And another pin to syncronize (reset) all the counters?

I think the problem is that the TSC has two definitions: 1) number of clock ticks, and 2) absolute time that has passed. Unfortunately, TSC is a system level counter. What I would really want is four different counters,

2 for each thread, and 2 for the system. When the OS starts a new thread the counters for that thread would be loaded. The 2 counters are 1 to count clock ticks (so if the processor clock changes, counter rate changes, this is good for getting (somewhat) consistent execution time), and 1 counter that follows real world execution time (wall clock time). This counter, IMO, doesn't need to be completely accurate, say 100 Mhz (10 ns).

It also would be nice if there were compare registers (e.g. MIPS), so that external hardware wasn't needed for timeslicing.

- Tim

NOT speaking for Unisys.

- A
- Anton Ertl
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Jun 23, 2006 3:54 PM

On the Dual-Opteron 270 system we have, the two cores in the same socket always have the same voltage and the same frequency, but the other two in the other socket can be at a different speed.

We have seen some instability on that system, maybe related to speed-changing (the system sometimes crashed when the load (and thus the speed) changed, and this went away when we used a kernel that does not change speeds).

Followups set to comp.arch.

- anton

--
M. Anton Ertl                    Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

- T
- Terje Mathisen
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Jun 23, 2006 7:53 PM

Second, as in (b), was my intention. Sorry if I was unclear!

Terje

--
- 
"almost all programming can be viewed as an exercise in caching"

- J
- Joe Seigh
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Jun 23, 2006 8:18 PM

Well, I assume you're using something like NTP to keep them "in sync". You can't actually keep them in absolute sync, just within a certain accuracy with a given precision or certainty. You cannot use separate clocks for synchronization like you can with a single clock unless you accept that synchronizing with multiple clocks will occasionally fail and allow erroneous results.

Is the "problem" you can't use multiple clocks to synchronize with or is it something else?

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.