Based on Roland's comments and from reading the source, I have a possible fix. I'm posting the attached patch _not_ for submission but _only_ for comment. For one thing it's based on 2.6.18.5 and for another it hasn't had much testing yet. I wanted to get it out here for comment, though, in case anyone can see where I might have gone wrong. Comments, criticism and (especially!) testing enthusiastically requested. >From my notes, this patch: Replaces the utime, stime and sched_time fields in signal_struct with shared_utime, shared_stime and shared_schedtime, respectively. It also adds it_sched_expires to the signal struct. Each place that loops through all threads in a thread group to sum task->utime and/or task->stime now loads the value from task->signal->shared_[us]time. This includes compat_sys_times(), do_task_stat(), do_getitimer(), sys_times() and k_getrusage(). Certain routines that used task->signal->[us]time now use the shared fields instead, which may change their semantics slightly. These include fill_prstatus() (in fs/binfmt_elf.c), do_task_stat() (in fs/proc/array.c), wait_task_zombie() and do_notify_parent(). The shared fields are updated at each tick, in update_cpu_clock() (shared_schedtime), account_user_time() (shared_utime) and account_system_time() (shared_stime). Each of these functions updates the task-private field followed by the shared version in the signal structure if one is present. Note that if different threads of the same process are being run by different CPUs at the tick, there may be serious cache contention here. Finally, kernel/posix-cpu-timers.c has changed quite dramatically. First, run_posix_cpu_timers() decides whether a timer has expired by consulting the it_*_expires and shared_* fields in the signal struct. The check_process_timers() routine bases its computations on the new shared fields, removing two loops through the threads. "Rebalancing" is no longer required, the process_timer_rebalance() routine as disappeared entirely and the arm_timer() routine merely fills p->signal->it_*_expires from timer->it.cpu.expires.*. The cpu_clock_sample_group_locked() loses its summing loops, consulting the shared fields instead. Finally, set_process_cpu_timer() sets tsk->signal->it_*_expires directly rather than calling the deleted rebalance routine. There are still a few open questions. In particular, it's possible that cache contention on the tick update of the shared fields could mean that the current scheme is not entirely sufficient. Further, the semantics of the status-returning routines fill_prstatus(), do_task_stat(), wait_task_zombie() and do_notify_parent() may no longer follow standards. For that matter, ITIMER_PROF handling may be broken entirely, although a brief test seems to show that it's working fine. Stats: fs/binfmt_elf.c | 18 +-- fs/proc/array.c | 6 - include/linux/sched.h | 10 +- kernel/exit.c | 13 -- kernel/fork.c | 25 +---- kernel/itimer.c | 18 --- kernel/posix-cpu-timers.c | 224 ++++++++++------------------------------------ kernel/sched.c | 16 +++ kernel/signal.c | 6 - kernel/sys.c | 17 --- 10 files changed, 105 insertions(+), 248 deletions(-) -- Frank Mayhar Google, Inc.