linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
       [not found] <bug-9906-10286@http.bugzilla.kernel.org/>
@ 2008-02-07  0:50 ` Andrew Morton
  2008-02-07  0:58   ` Frank Mayhar
  0 siblings, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2008-02-07  0:50 UTC (permalink / raw)
  To: fmayhar
  Cc: bugme-daemon, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Roland McGrath, Jakub Jelinek

On Wed,  6 Feb 2008 16:33:20 -0800 (PST)
bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=9906
> 
>            Summary: Weird hang with NPTL and SIGPROF.
>            Product: Process Management
>            Version: 2.5
>      KernelVersion: 2.6.24-rc4
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Scheduler
>         AssignedTo: mingo@elte.hu
>         ReportedBy: fmayhar@google.com
> 
> 
> Latest working kernel version: None
> Earliest failing kernel version: 2.6.18
> Distribution: Ubuntu
> Hardware Environment: Any
> Problem Description:
> I have a testcase that demonstrates a strange hang of the latest kernel
> (as well as previous ones).  In the process of investigating the NPTL,
> we wrote a test that just creates a bunch of threads, then does a
> barrier wait to synchronize them all, after which everybody exits.
> That's all it does.
> 
> This works fine under most circumstances.  Unfortunately, we also want
> to do profiling, so we catch SIGPROF and turn on ITIMER_PROF.  In this
> case, at somewhere between 4000 and 4500 threads, and using the NPTL,
> the system hangs.  It's not a hard hang, interrupts are still working
> and clocks are ticking, but nothing is making progress.  It becomes
> noticeable when the softlockup_tick() warning goes off after the
> watchdog has been starved long enough.
> 
> Sometimes the system recovers and gets going again.  Other times it
> doesn't.  I've examined the state of things several times with kdb and
> there's certainly nothing obvious going on.  Something, perhaps having
> to do with the scheduler, is certainly getting into a bad state, but I
> haven't yet been able to figure out what that is.  I've even run it with
> KFT and have seen nothing obvious there, either, except for the fact
> that when it hangs it becomes obvious that it stops making progress and
> it begins to fill up with smp_apic_timer_interrupt() and do_softirq()
> entries.  I've also seen smp_apic_timer_interrupt() appear twice or more
> on the stack, as if the previous run(s) didn't finish before the next
> tick happened.
> 
> Steps to reproduce:
> 
> I'll attach a testcase shortly.
> 

It's probably better to handle this one via email, so please send that
testcase vie reply-to-all to this email, thanks.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
  2008-02-07  0:50 ` [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF Andrew Morton
@ 2008-02-07  0:58   ` Frank Mayhar
  2008-02-07  2:57     ` Parag Warudkar
  0 siblings, 1 reply; 51+ messages in thread
From: Frank Mayhar @ 2008-02-07  0:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: bugme-daemon, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Roland McGrath, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 395 bytes --]

On Wed, 2008-02-06 at 16:50 -0800, Andrew Morton wrote:
> It's probably better to handle this one via email, so please send that
> testcase vie reply-to-all to this email, thanks.

Testcase attached.

Build with
        gcc -D_GNU_SOURCE -c hangc-2.c -o hangc-2.o
        gcc -lpthread -o hangc-2 hangc-2.o

Run with
        hangc-2 4500 4500

-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.

[-- Attachment #2: hangc-2.c --]
[-- Type: text/x-csrc, Size: 2664 bytes --]


#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/time.h>
#include <unistd.h>
#include <ucontext.h>
#include <string.h>
#include <errno.h>

static void *
LinuxThreadTestRoutine(
	void *pid)
{
	pid_t *pid_ptr = (pid_t *) (pid);

	*pid_ptr = getpid();
	return NULL;
}

int
id_runningNPTL(void)
{
	int cc;
	pthread_t thread;
	pid_t child_pid;

	cc = pthread_create(&thread, NULL, &LinuxThreadTestRoutine, &child_pid);
	if (cc != 0) {
		perror("pthread_create");
		exit(1);
	}
	cc = pthread_join(thread, NULL);
	if (cc != 0) {
		perror("pthread_join");
		exit(1);
	}
	int is_linux_threads = (child_pid != getpid());

	return !is_linux_threads;
}



char const *
id_threads_package_string(void)
{
	return id_runningNPTL()? "NPTL" : "LinuxThreads";
}


typedef struct shared_args_t {
	unsigned n;
	pthread_barrier_t barrier;
} shared_args_t;

shared_args_t g_args;

void prof_handler(int sig, siginfo_t *foo, void *signal_ucontext)
{
	static int stk = 0;
	int saved_errno = errno;

	stk = 0;
	errno = saved_errno;
}

void *
nop1_inner(
	void *varg)
{
	int cc;
	cc = pthread_barrier_wait(&g_args.barrier);
	if ((cc != 0) && (cc != PTHREAD_BARRIER_SERIAL_THREAD)) {
		perror("pthread_barrier_wait");
		exit(1);
	}
	return NULL;
}

#define MAXTHREADS  (1024 * 1024)
pthread_t threads[MAXTHREADS];

int
main(
	int argc,
	char **argv)
{
	pthread_attr_t attr;
	unsigned nthreads;
	unsigned nops;
	unsigned i;
	int cc;
	struct sigaction sa;
	struct itimerval timer;

	sa.sa_sigaction = prof_handler;
	sa.sa_flags = SA_RESTART | SA_SIGINFO;
	sigemptyset(&sa.sa_mask);
	sigaction(SIGPROF, &sa, NULL);

	timer.it_interval.tv_sec = 0;
	timer.it_interval.tv_usec = 1000000 / 100;
	timer.it_value = timer.it_interval;
	setitimer(ITIMER_PROF, &timer, 0);
	cc = pthread_attr_init(&attr);
	if (cc != 0) {
		perror("pthread_attr_init");
		exit(1);
	}
	cc = pthread_attr_setstacksize(&attr, 16 * 1024);
	if (cc != 0) {
		perror("pthread_attr_setstacksize");
		exit(1);
	}
	if (argc != 3) {
		fputs("Usage: hangc THREADS BARRIER-OPS-PER-THREAD\n", stderr);
		exit(1);
	}

	nthreads = strtoul(argv[1], NULL, 0);
	if (nthreads > MAXTHREADS) {
		perror("internal error: static allocation too small for THREADS arg");
		exit(1);
	}
	nops = strtoul(argv[2], NULL, 0);

	cc = pthread_barrier_init(&g_args.barrier, NULL, nthreads + 1);
	if (cc != 0) {
		perror("pthread_barrier_init");
		exit(1);
	}

	g_args.n = nops;

	for (i = 0; i < nthreads; ++i) {
		cc = pthread_create(&threads[i], &attr, nop1_inner, NULL);
		if (cc != 0) {
			perror("pthread_create");
			exit(1);
		}
	}

	printf("threads: %s\n", id_threads_package_string());
	exit(0);
}

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
  2008-02-07  0:58   ` Frank Mayhar
@ 2008-02-07  2:57     ` Parag Warudkar
  2008-02-07 15:22       ` Alejandro Riveira Fernández
  0 siblings, 1 reply; 51+ messages in thread
From: Parag Warudkar @ 2008-02-07  2:57 UTC (permalink / raw)
  To: Frank Mayhar
  Cc: Andrew Morton, bugme-daemon, linux-kernel, Ingo Molnar,
	Thomas Gleixner, Roland McGrath, Jakub Jelinek



On Wed, 6 Feb 2008, Frank Mayhar wrote:

> On Wed, 2008-02-06 at 16:50 -0800, Andrew Morton wrote:
> > It's probably better to handle this one via email, so please send that
> > testcase vie reply-to-all to this email, thanks.
> 
> Testcase attached.
> 
> Build with
>         gcc -D_GNU_SOURCE -c hangc-2.c -o hangc-2.o
>         gcc -lpthread -o hangc-2 hangc-2.o
> 
> Run with
>         hangc-2 4500 4500

FWIW this is not reproducible on 2.6.24/x86/CentOS-51. (I tried running 
it nearly 1500 times in a loop.) Assuming those many tries are sufficient 
to reproduce this bug, there seems to be something specific to the 
environment/architecture/configuration that's necessary to trigger it. 

It might be helpful to provide full details like glibc version, compiler 
version, .config etc.

Parag

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
  2008-02-07  2:57     ` Parag Warudkar
@ 2008-02-07 15:22       ` Alejandro Riveira Fernández
  2008-02-07 15:53         ` Parag Warudkar
  0 siblings, 1 reply; 51+ messages in thread
From: Alejandro Riveira Fernández @ 2008-02-07 15:22 UTC (permalink / raw)
  To: parag.warudkar
  Cc: Frank Mayhar, Andrew Morton, bugme-daemon, linux-kernel,
	Ingo Molnar, Thomas Gleixner, Roland McGrath, Jakub Jelinek

El Wed, 6 Feb 2008 21:57:38 -0500 (EST)
Parag Warudkar <parag.warudkar@gmail.com> escribió:

> 
> 
> On Wed, 6 Feb 2008, Frank Mayhar wrote:
> 
> > On Wed, 2008-02-06 at 16:50 -0800, Andrew Morton wrote:
> > > It's probably better to handle this one via email, so please send that
> > > testcase vie reply-to-all to this email, thanks.
> > 
> > Testcase attached.
> > 
> > Build with
> >         gcc -D_GNU_SOURCE -c hangc-2.c -o hangc-2.o
> >         gcc -lpthread -o hangc-2 hangc-2.o
> > 
> > Run with
> >         hangc-2 4500 4500
> 
> FWIW this is not reproducible on 2.6.24/x86/CentOS-51. (I tried running 
> it nearly 1500 times in a loop.) Assuming those many tries are sufficient 
> to reproduce this bug, there seems to be something specific to the 
> environment/architecture/configuration that's necessary to trigger it. 
> 
> It might be helpful to provide full details like glibc version, compiler 
> version, .config etc.

I can reproduce it on my Ubuntu 7.10 kernel 2.6.24


Note i use CC=gcc-4.2 

gcc --version

gcc-4.2 (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4)
 
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.
 
Linux Varda 2.6.24 #2 SMP PREEMPT Fri Jan 25 01:05:47 CET 2008 x86_64 GNU/Linux
 
Gnu C                  4.1.3
Gnu make               3.81
binutils               2.18
util-linux             2.13
mount                  2.13
module-init-tools      3.3-pre2
e2fsprogs              1.40.2
jfsutils               1.1.11
reiserfsprogs          3.6.19
pcmciautils            014
PPP                    2.4.4
Linux C Library        2.6.1
Dynamic linker (ldd)   2.6.1
Procps                 3.2.7
Net-tools              1.60
Kbd                    [opcion...][archivo
Console-tools          0.2.3
Sh-utils               5.97
udev                   113
wireless-tools         29
Modules Loaded         af_packet binfmt_misc rfcomm l2cap bluetooth ipv6 powernow_k8 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand freq_table cpufreq_conservative nf_conntrack_ftp nf_conntrack_irc xt_tcpudp ipt_ULOG xt_limit xt_state iptable_filter nf_conntrack_ipv4 nf_conntrack ip_tables x_tables kvm_amd kvm w83627ehf hwmon_vid lp snd_hda_intel arc4 ecb blkcipher cryptomgr crypto_algapi snd_pcm_oss snd_mixer_oss snd_pcm snd_mpu401 snd_mpu401_uart snd_seq_dummy rt2500pci rt2x00pci rt2x00lib snd_seq_oss rfkill snd_seq_midi input_polldev snd_rawmidi crc_itu_t snd_seq_midi_event snd_seq mac80211 usbhid snd_timer snd_seq_device cfg80211 usblp ff_memless snd eeprom_93cx6 nvidia i2c_ali1535 i2c_ali15x3 evdev snd_page_alloc sr_mod cdrom button soundcore uli526x 8250_pnp 8250 serial_core i2c_core k8temp hwmon parport_pc parport pata_acpi pcspkr rtc floppy sg ata_generic ehci_hcd r8169 ohci_hcd usbcore unix thermal processor fan fuse

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.24-rc8
# Thu Jan 24 19:29:41 2008
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
# CONFIG_QUICKLIST is not set
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ARCH_SUPPORTS_OPROFILE=y
CONFIG_ZONE_DMA32=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
# CONFIG_KTIME_SCALAR is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_USER_NS is not set
# CONFIG_PID_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=16
# CONFIG_CGROUPS is not set
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_FAIR_USER_SCHED=y
# CONFIG_FAIR_CGROUP_SCHED is not set
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_BLK_DEV_BSG=y
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=m
CONFIG_IOSCHED_CFQ=m
CONFIG_DEFAULT_AS=y
# CONFIG_DEFAULT_DEADLINE is not set
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="anticipatory"
CONFIG_PREEMPT_NOTIFIERS=y

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_ES7000 is not set
# CONFIG_X86_VSMP is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
CONFIG_MK8=y
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_HPET_TIMER=y
CONFIG_GART_IOMMU=y
# CONFIG_CALGARY_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_NR_CPUS=2
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_BKL=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_MCE=y
# CONFIG_X86_MCE_INTEL is not set
CONFIG_X86_MCE_AMD=y
# CONFIG_MICROCODE is not set
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
# CONFIG_NUMA is not set
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
# CONFIG_SPARSEMEM_STATIC is not set
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_RESOURCES_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MTRR=y
# CONFIG_SECCOMP is not set
# CONFIG_CC_STACKPROTECTOR is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
# CONFIG_KEXEC is not set
# CONFIG_CRASH_DUMP is not set
CONFIG_PHYSICAL_START=0x200000
# CONFIG_RELOCATABLE is not set
CONFIG_PHYSICAL_ALIGN=0x200000
# CONFIG_HOTPLUG_CPU is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y

#
# Power management options
#
CONFIG_PM=y
# CONFIG_PM_LEGACY is not set
# CONFIG_PM_DEBUG is not set
CONFIG_SUSPEND_SMP_POSSIBLE=y
# CONFIG_SUSPEND is not set
CONFIG_HIBERNATION_SMP_POSSIBLE=y
# CONFIG_HIBERNATION is not set
CONFIG_ACPI=y
# CONFIG_ACPI_PROCFS is not set
# CONFIG_ACPI_PROCFS_POWER is not set
CONFIG_ACPI_SYSFS_POWER=y
# CONFIG_ACPI_PROC_EVENT is not set
# CONFIG_ACPI_AC is not set
# CONFIG_ACPI_BATTERY is not set
CONFIG_ACPI_BUTTON=m
# CONFIG_ACPI_VIDEO is not set
CONFIG_ACPI_FAN=m
# CONFIG_ACPI_DOCK is not set
CONFIG_ACPI_PROCESSOR=m
CONFIG_ACPI_THERMAL=m
# CONFIG_ACPI_ASUS is not set
# CONFIG_ACPI_TOSHIBA is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_EC=y
CONFIG_ACPI_POWER=y
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y
# CONFIG_ACPI_CONTAINER is not set
# CONFIG_ACPI_SBS is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=m
# CONFIG_CPU_FREQ_DEBUG is not set
CONFIG_CPU_FREQ_STAT=m
CONFIG_CPU_FREQ_STAT_DETAILS=y
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=m
CONFIG_CPU_FREQ_GOV_ONDEMAND=m
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m

#
# CPUFreq processor drivers
#
CONFIG_X86_ACPI_CPUFREQ=m
CONFIG_X86_POWERNOW_K8=m
CONFIG_X86_POWERNOW_K8_ACPI=y
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
CONFIG_X86_ACPI_CPUFREQ_PROC_INTF=y
# CONFIG_X86_SPEEDSTEP_LIB is not set
# CONFIG_CPU_IDLE is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_DOMAINS=y
# CONFIG_DMAR is not set
CONFIG_PCIEPORTBUS=y
CONFIG_PCIEAER=y
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
# CONFIG_PCI_LEGACY is not set
# CONFIG_PCI_DEBUG is not set
CONFIG_HT_IRQ=y
CONFIG_ISA_DMA_API=y
CONFIG_K8_NB=y
# CONFIG_PCCARD is not set
# CONFIG_HOTPLUG_PCI is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_MISC=m
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_AOUT is not set
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y

#
# Networking
#
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=m
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=m
CONFIG_XFRM=y
CONFIG_XFRM_USER=m
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
CONFIG_NET_KEY=m
# CONFIG_NET_KEY_MIGRATE is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
# CONFIG_IP_ROUTE_VERBOSE is not set
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
# CONFIG_IP_MROUTE is not set
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
CONFIG_INET_AH=m
CONFIG_INET_ESP=m
CONFIG_INET_IPCOMP=m
CONFIG_INET_XFRM_TUNNEL=m
CONFIG_INET_TUNNEL=m
CONFIG_INET_XFRM_MODE_TRANSPORT=m
CONFIG_INET_XFRM_MODE_TUNNEL=m
CONFIG_INET_XFRM_MODE_BEET=m
CONFIG_INET_LRO=y
CONFIG_INET_DIAG=m
CONFIG_INET_TCP_DIAG=m
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
# CONFIG_IP_VS is not set
CONFIG_IPV6=m
CONFIG_IPV6_PRIVACY=y
CONFIG_IPV6_ROUTER_PREF=y
# CONFIG_IPV6_ROUTE_INFO is not set
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
CONFIG_INET6_AH=m
CONFIG_INET6_ESP=m
CONFIG_INET6_IPCOMP=m
# CONFIG_IPV6_MIP6 is not set
CONFIG_INET6_XFRM_TUNNEL=m
CONFIG_INET6_TUNNEL=m
CONFIG_INET6_XFRM_MODE_TRANSPORT=m
CONFIG_INET6_XFRM_MODE_TUNNEL=m
CONFIG_INET6_XFRM_MODE_BEET=m
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
CONFIG_IPV6_SIT=m
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_NETLABEL is not set
# CONFIG_NETWORK_SECMARK is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
# CONFIG_BRIDGE_NETFILTER is not set

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=m
CONFIG_NETFILTER_NETLINK_QUEUE=m
CONFIG_NETFILTER_NETLINK_LOG=m
CONFIG_NF_CONNTRACK_ENABLED=m
CONFIG_NF_CONNTRACK=m
CONFIG_NF_CT_ACCT=y
CONFIG_NF_CONNTRACK_MARK=y
# CONFIG_NF_CONNTRACK_EVENTS is not set
CONFIG_NF_CT_PROTO_SCTP=m
# CONFIG_NF_CT_PROTO_UDPLITE is not set
# CONFIG_NF_CONNTRACK_AMANDA is not set
CONFIG_NF_CONNTRACK_FTP=m
# CONFIG_NF_CONNTRACK_H323 is not set
CONFIG_NF_CONNTRACK_IRC=m
# CONFIG_NF_CONNTRACK_NETBIOS_NS is not set
# CONFIG_NF_CONNTRACK_PPTP is not set
# CONFIG_NF_CONNTRACK_SANE is not set
CONFIG_NF_CONNTRACK_SIP=m
CONFIG_NF_CONNTRACK_TFTP=m
# CONFIG_NF_CT_NETLINK is not set
CONFIG_NETFILTER_XTABLES=m
CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m
CONFIG_NETFILTER_XT_TARGET_CONNMARK=m
CONFIG_NETFILTER_XT_TARGET_DSCP=m
CONFIG_NETFILTER_XT_TARGET_MARK=m
CONFIG_NETFILTER_XT_TARGET_NFQUEUE=m
CONFIG_NETFILTER_XT_TARGET_NFLOG=m
CONFIG_NETFILTER_XT_TARGET_NOTRACK=m
CONFIG_NETFILTER_XT_TARGET_TRACE=m
CONFIG_NETFILTER_XT_TARGET_TCPMSS=m
CONFIG_NETFILTER_XT_MATCH_COMMENT=m
CONFIG_NETFILTER_XT_MATCH_CONNBYTES=m
CONFIG_NETFILTER_XT_MATCH_CONNLIMIT=m
CONFIG_NETFILTER_XT_MATCH_CONNMARK=m
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=m
CONFIG_NETFILTER_XT_MATCH_DCCP=m
CONFIG_NETFILTER_XT_MATCH_DSCP=m
CONFIG_NETFILTER_XT_MATCH_ESP=m
CONFIG_NETFILTER_XT_MATCH_HELPER=m
CONFIG_NETFILTER_XT_MATCH_LENGTH=m
CONFIG_NETFILTER_XT_MATCH_LIMIT=m
CONFIG_NETFILTER_XT_MATCH_MAC=m
CONFIG_NETFILTER_XT_MATCH_MARK=m
CONFIG_NETFILTER_XT_MATCH_POLICY=m
CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m
CONFIG_NETFILTER_XT_MATCH_PKTTYPE=m
CONFIG_NETFILTER_XT_MATCH_QUOTA=m
CONFIG_NETFILTER_XT_MATCH_REALM=m
CONFIG_NETFILTER_XT_MATCH_SCTP=m
CONFIG_NETFILTER_XT_MATCH_STATE=m
CONFIG_NETFILTER_XT_MATCH_STATISTIC=m
CONFIG_NETFILTER_XT_MATCH_STRING=m
CONFIG_NETFILTER_XT_MATCH_TCPMSS=m
CONFIG_NETFILTER_XT_MATCH_TIME=m
CONFIG_NETFILTER_XT_MATCH_U32=m
CONFIG_NETFILTER_XT_MATCH_HASHLIMIT=m

#
# IP: Netfilter Configuration
#
CONFIG_NF_CONNTRACK_IPV4=m
CONFIG_NF_CONNTRACK_PROC_COMPAT=y
# CONFIG_IP_NF_QUEUE is not set
CONFIG_IP_NF_IPTABLES=m
CONFIG_IP_NF_MATCH_IPRANGE=m
CONFIG_IP_NF_MATCH_TOS=m
CONFIG_IP_NF_MATCH_RECENT=m
CONFIG_IP_NF_MATCH_ECN=m
CONFIG_IP_NF_MATCH_AH=m
CONFIG_IP_NF_MATCH_TTL=m
CONFIG_IP_NF_MATCH_OWNER=m
CONFIG_IP_NF_MATCH_ADDRTYPE=m
CONFIG_IP_NF_FILTER=m
CONFIG_IP_NF_TARGET_REJECT=m
CONFIG_IP_NF_TARGET_LOG=m
CONFIG_IP_NF_TARGET_ULOG=m
CONFIG_NF_NAT=m
CONFIG_NF_NAT_NEEDED=y
CONFIG_IP_NF_TARGET_MASQUERADE=m
CONFIG_IP_NF_TARGET_REDIRECT=m
CONFIG_IP_NF_TARGET_NETMAP=m
CONFIG_IP_NF_TARGET_SAME=m
CONFIG_NF_NAT_SNMP_BASIC=m
CONFIG_NF_NAT_FTP=m
CONFIG_NF_NAT_IRC=m
CONFIG_NF_NAT_TFTP=m
# CONFIG_NF_NAT_AMANDA is not set
# CONFIG_NF_NAT_PPTP is not set
# CONFIG_NF_NAT_H323 is not set
CONFIG_NF_NAT_SIP=m
CONFIG_IP_NF_MANGLE=m
CONFIG_IP_NF_TARGET_TOS=m
CONFIG_IP_NF_TARGET_ECN=m
CONFIG_IP_NF_TARGET_TTL=m
CONFIG_IP_NF_TARGET_CLUSTERIP=m
CONFIG_IP_NF_RAW=m
CONFIG_IP_NF_ARPTABLES=m
CONFIG_IP_NF_ARPFILTER=m
CONFIG_IP_NF_ARP_MANGLE=m

#
# IPv6: Netfilter Configuration (EXPERIMENTAL)
#
CONFIG_NF_CONNTRACK_IPV6=m
CONFIG_IP6_NF_QUEUE=m
CONFIG_IP6_NF_IPTABLES=m
CONFIG_IP6_NF_MATCH_RT=m
CONFIG_IP6_NF_MATCH_OPTS=m
CONFIG_IP6_NF_MATCH_FRAG=m
CONFIG_IP6_NF_MATCH_HL=m
CONFIG_IP6_NF_MATCH_OWNER=m
CONFIG_IP6_NF_MATCH_IPV6HEADER=m
CONFIG_IP6_NF_MATCH_AH=m
CONFIG_IP6_NF_MATCH_MH=m
CONFIG_IP6_NF_MATCH_EUI64=m
CONFIG_IP6_NF_FILTER=m
CONFIG_IP6_NF_TARGET_LOG=m
CONFIG_IP6_NF_TARGET_REJECT=m
CONFIG_IP6_NF_MANGLE=m
CONFIG_IP6_NF_TARGET_HL=m
CONFIG_IP6_NF_RAW=m

#
# Bridge: Netfilter Configuration
#
# CONFIG_BRIDGE_NF_EBTABLES is not set
# CONFIG_IP_DCCP is not set
CONFIG_IP_SCTP=m
# CONFIG_SCTP_DBG_MSG is not set
# CONFIG_SCTP_DBG_OBJCNT is not set
# CONFIG_SCTP_HMAC_NONE is not set
# CONFIG_SCTP_HMAC_SHA1 is not set
CONFIG_SCTP_HMAC_MD5=y
# CONFIG_TIPC is not set
CONFIG_ATM=m
CONFIG_ATM_CLIP=m
CONFIG_ATM_CLIP_NO_ICMP=y
CONFIG_ATM_LANE=m
CONFIG_ATM_MPOA=m
CONFIG_ATM_BR2684=m
CONFIG_ATM_BR2684_IPFILTER=y
CONFIG_BRIDGE=m
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
CONFIG_LLC=m
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
CONFIG_NET_SCH_CBQ=m
CONFIG_NET_SCH_HTB=m
CONFIG_NET_SCH_HFSC=m
CONFIG_NET_SCH_ATM=m
CONFIG_NET_SCH_PRIO=m
CONFIG_NET_SCH_RR=m
CONFIG_NET_SCH_RED=m
CONFIG_NET_SCH_SFQ=m
CONFIG_NET_SCH_TEQL=m
CONFIG_NET_SCH_TBF=m
CONFIG_NET_SCH_GRED=m
CONFIG_NET_SCH_DSMARK=m
CONFIG_NET_SCH_NETEM=m
CONFIG_NET_SCH_INGRESS=m

#
# Classification
#
CONFIG_NET_CLS=y
CONFIG_NET_CLS_BASIC=m
CONFIG_NET_CLS_TCINDEX=m
CONFIG_NET_CLS_ROUTE4=m
CONFIG_NET_CLS_ROUTE=y
CONFIG_NET_CLS_FW=m
CONFIG_NET_CLS_U32=m
CONFIG_CLS_U32_PERF=y
CONFIG_CLS_U32_MARK=y
CONFIG_NET_CLS_RSVP=m
CONFIG_NET_CLS_RSVP6=m
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
CONFIG_NET_EMATCH_CMP=m
CONFIG_NET_EMATCH_NBYTE=m
CONFIG_NET_EMATCH_U32=m
CONFIG_NET_EMATCH_META=m
CONFIG_NET_EMATCH_TEXT=m
CONFIG_NET_CLS_ACT=y
CONFIG_NET_ACT_POLICE=m
CONFIG_NET_ACT_GACT=m
CONFIG_GACT_PROB=y
CONFIG_NET_ACT_MIRRED=m
CONFIG_NET_ACT_IPT=m
CONFIG_NET_ACT_NAT=m
CONFIG_NET_ACT_PEDIT=m
CONFIG_NET_ACT_SIMP=m
# CONFIG_NET_CLS_POLICE is not set
CONFIG_NET_CLS_IND=y
CONFIG_NET_SCH_FIFO=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_HAMRADIO is not set
# CONFIG_IRDA is not set
CONFIG_BT=m
CONFIG_BT_L2CAP=m
CONFIG_BT_SCO=m
CONFIG_BT_RFCOMM=m
CONFIG_BT_RFCOMM_TTY=y
CONFIG_BT_BNEP=m
CONFIG_BT_BNEP_MC_FILTER=y
CONFIG_BT_BNEP_PROTO_FILTER=y
CONFIG_BT_HIDP=m

#
# Bluetooth device drivers
#
CONFIG_BT_HCIUSB=m
CONFIG_BT_HCIUSB_SCO=y
CONFIG_BT_HCIUART=m
CONFIG_BT_HCIUART_H4=y
CONFIG_BT_HCIUART_BCSP=y
# CONFIG_BT_HCIUART_LL is not set
CONFIG_BT_HCIBCM203X=m
CONFIG_BT_HCIBPA10X=m
CONFIG_BT_HCIBFUSB=m
CONFIG_BT_HCIVHCI=m
# CONFIG_AF_RXRPC is not set
CONFIG_FIB_RULES=y

#
# Wireless
#
CONFIG_CFG80211=m
CONFIG_NL80211=y
CONFIG_WIRELESS_EXT=y
CONFIG_MAC80211=m
CONFIG_MAC80211_RCSIMPLE=y
CONFIG_MAC80211_LEDS=y
# CONFIG_MAC80211_DEBUGFS is not set
# CONFIG_MAC80211_DEBUG is not set
CONFIG_IEEE80211=y
# CONFIG_IEEE80211_DEBUG is not set
CONFIG_IEEE80211_CRYPT_WEP=m
CONFIG_IEEE80211_CRYPT_CCMP=m
CONFIG_IEEE80211_CRYPT_TKIP=m
CONFIG_IEEE80211_SOFTMAC=m
# CONFIG_IEEE80211_SOFTMAC_DEBUG is not set
CONFIG_RFKILL=m
CONFIG_RFKILL_INPUT=m
CONFIG_RFKILL_LEDS=y
CONFIG_NET_9P=m
CONFIG_NET_9P_FD=m
# CONFIG_NET_9P_DEBUG is not set

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_SYS_HYPERVISOR is not set
CONFIG_CONNECTOR=m
# CONFIG_MTD is not set
CONFIG_PARPORT=m
CONFIG_PARPORT_PC=m
CONFIG_PARPORT_SERIAL=m
CONFIG_PARPORT_PC_FIFO=y
CONFIG_PARPORT_PC_SUPERIO=y
# CONFIG_PARPORT_GSC is not set
# CONFIG_PARPORT_AX88796 is not set
CONFIG_PARPORT_1284=y
CONFIG_PNP=y
# CONFIG_PNP_DEBUG is not set

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
CONFIG_BLK_DEV_FD=m
# CONFIG_PARIDE is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_CRYPTOLOOP=m
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=8192
CONFIG_BLK_DEV_RAM_BLOCKSIZE=1024
CONFIG_CDROM_PKTCDVD=m
CONFIG_CDROM_PKTCDVD_BUFFERS=8
CONFIG_CDROM_PKTCDVD_WCACHE=y
# CONFIG_ATA_OVER_ETH is not set
# CONFIG_MISC_DEVICES is not set
CONFIG_EEPROM_93CX6=m
# CONFIG_IDE is not set

#
# SCSI device support
#
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
# CONFIG_SCSI_TGT is not set
# CONFIG_SCSI_NETLINK is not set
# CONFIG_SCSI_PROC_FS is not set

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=m
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=m
# CONFIG_CHR_DEV_SCH is not set

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
# CONFIG_SCSI_MULTI_LUN is not set
# CONFIG_SCSI_CONSTANTS is not set
# CONFIG_SCSI_LOGGING is not set
# CONFIG_SCSI_SCAN_ASYNC is not set
CONFIG_SCSI_WAIT_SCAN=m

#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set
# CONFIG_SCSI_SRP_ATTRS is not set
# CONFIG_SCSI_LOWLEVEL is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_ACPI=y
CONFIG_SATA_AHCI=y
# CONFIG_SATA_SVW is not set
# CONFIG_ATA_PIIX is not set
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SX4 is not set
# CONFIG_SATA_SIL is not set
CONFIG_SATA_SIL24=m
# CONFIG_SATA_SIS is not set
CONFIG_SATA_ULI=m
CONFIG_SATA_VIA=m
# CONFIG_SATA_VITESSE is not set
# CONFIG_SATA_INIC162X is not set
CONFIG_PATA_ACPI=m
CONFIG_PATA_ALI=y
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
CONFIG_ATA_GENERIC=m
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_IT8213 is not set
CONFIG_PATA_JMICRON=m
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_OLDPIIX is not set
CONFIG_PATA_NETCELL=m
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_NS87415 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RZ1000 is not set
# CONFIG_PATA_SC1200 is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
CONFIG_PATA_VIA=m
# CONFIG_PATA_WINBOND is not set
# CONFIG_MD is not set
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#
CONFIG_FIREWIRE=m
CONFIG_FIREWIRE_OHCI=m
CONFIG_FIREWIRE_SBP2=m
# CONFIG_IEEE1394 is not set
# CONFIG_I2O is not set
# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
CONFIG_NETDEVICES_MULTIQUEUE=y
# CONFIG_IFB is not set
CONFIG_DUMMY=m
CONFIG_BONDING=m
# CONFIG_MACVLAN is not set
# CONFIG_EQUALIZER is not set
CONFIG_TUN=m
CONFIG_VETH=m
# CONFIG_NET_SB1000 is not set
# CONFIG_ARCNET is not set
CONFIG_PHYLIB=m

#
# MII PHY device drivers
#
CONFIG_MARVELL_PHY=m
CONFIG_DAVICOM_PHY=m
CONFIG_QSEMI_PHY=m
CONFIG_LXT_PHY=m
CONFIG_CICADA_PHY=m
CONFIG_VITESSE_PHY=m
CONFIG_SMSC_PHY=m
CONFIG_BROADCOM_PHY=m
CONFIG_ICPLUS_PHY=m
CONFIG_FIXED_PHY=m
CONFIG_FIXED_MII_10_FDX=y
CONFIG_FIXED_MII_100_FDX=y
CONFIG_FIXED_MII_1000_FDX=y
CONFIG_FIXED_MII_AMNT=1
# CONFIG_MDIO_BITBANG is not set
CONFIG_NET_ETHERNET=y
CONFIG_MII=m
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NET_VENDOR_3COM is not set
CONFIG_NET_TULIP=y
# CONFIG_DE2104X is not set
# CONFIG_TULIP is not set
# CONFIG_DE4X5 is not set
# CONFIG_WINBOND_840 is not set
# CONFIG_DM9102 is not set
CONFIG_ULI526X=m
# CONFIG_HP100 is not set
# CONFIG_IBM_NEW_EMAC_ZMII is not set
# CONFIG_IBM_NEW_EMAC_RGMII is not set
# CONFIG_IBM_NEW_EMAC_TAH is not set
# CONFIG_IBM_NEW_EMAC_EMAC4 is not set
# CONFIG_NET_PCI is not set
# CONFIG_B44 is not set
# CONFIG_NET_POCKET is not set
CONFIG_NETDEV_1000=y
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
# CONFIG_E1000 is not set
# CONFIG_E1000E is not set
# CONFIG_IP1000 is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
CONFIG_R8169=m
CONFIG_R8169_NAPI=y
# CONFIG_SIS190 is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
# CONFIG_SK98LIN is not set
# CONFIG_VIA_VELOCITY is not set
# CONFIG_TIGON3 is not set
# CONFIG_BNX2 is not set
# CONFIG_QLA3XXX is not set
# CONFIG_ATL1 is not set
# CONFIG_NETDEV_10000 is not set
# CONFIG_TR is not set

#
# Wireless LAN
#
# CONFIG_WLAN_PRE80211 is not set
CONFIG_WLAN_80211=y
# CONFIG_IPW2100 is not set
# CONFIG_IPW2200 is not set
# CONFIG_LIBERTAS is not set
# CONFIG_AIRO is not set
# CONFIG_HERMES is not set
# CONFIG_ATMEL is not set
# CONFIG_PRISM54 is not set
CONFIG_USB_ZD1201=m
CONFIG_RTL8187=m
# CONFIG_ADM8211 is not set
# CONFIG_P54_COMMON is not set
# CONFIG_IWLWIFI is not set
# CONFIG_HOSTAP is not set
# CONFIG_BCM43XX is not set
# CONFIG_B43 is not set
# CONFIG_B43LEGACY is not set
CONFIG_ZD1211RW=m
# CONFIG_ZD1211RW_DEBUG is not set
CONFIG_RT2X00=m
CONFIG_RT2X00_LIB=m
CONFIG_RT2X00_LIB_PCI=m
CONFIG_RT2X00_LIB_USB=m
CONFIG_RT2X00_LIB_FIRMWARE=y
CONFIG_RT2X00_LIB_RFKILL=y
# CONFIG_RT2400PCI is not set
CONFIG_RT2500PCI=m
CONFIG_RT2500PCI_RFKILL=y
# CONFIG_RT61PCI is not set
CONFIG_RT2500USB=m
CONFIG_RT73USB=m
# CONFIG_RT2X00_DEBUG is not set

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET is not set
# CONFIG_WAN is not set
# CONFIG_ATM_DRIVERS is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_PLIP is not set
CONFIG_PPP=m
CONFIG_PPP_MULTILINK=y
CONFIG_PPP_FILTER=y
CONFIG_PPP_ASYNC=m
CONFIG_PPP_SYNC_TTY=m
CONFIG_PPP_DEFLATE=m
CONFIG_PPP_BSDCOMP=m
CONFIG_PPP_MPPE=m
CONFIG_PPPOE=m
CONFIG_PPPOATM=m
# CONFIG_PPPOL2TP is not set
# CONFIG_SLIP is not set
CONFIG_SLHC=m
# CONFIG_NET_FC is not set
# CONFIG_SHAPER is not set
CONFIG_NETCONSOLE=m
CONFIG_NETCONSOLE_DYNAMIC=y
CONFIG_NETPOLL=y
# CONFIG_NETPOLL_TRAP is not set
CONFIG_NET_POLL_CONTROLLER=y
# CONFIG_ISDN is not set
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=m
CONFIG_INPUT_POLLDEV=m

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
CONFIG_INPUT_JOYDEV=m
CONFIG_INPUT_EVDEV=m
CONFIG_INPUT_EVBUG=m

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
CONFIG_KEYBOARD_XTKBD=m
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_VSXXXAA is not set
CONFIG_INPUT_JOYSTICK=y
CONFIG_JOYSTICK_ANALOG=m
CONFIG_JOYSTICK_A3D=m
CONFIG_JOYSTICK_ADI=m
CONFIG_JOYSTICK_COBRA=m
CONFIG_JOYSTICK_GF2K=m
CONFIG_JOYSTICK_GRIP=m
CONFIG_JOYSTICK_GRIP_MP=m
CONFIG_JOYSTICK_GUILLEMOT=m
CONFIG_JOYSTICK_INTERACT=m
CONFIG_JOYSTICK_SIDEWINDER=m
CONFIG_JOYSTICK_TMDC=m
CONFIG_JOYSTICK_IFORCE=m
CONFIG_JOYSTICK_IFORCE_USB=y
CONFIG_JOYSTICK_IFORCE_232=y
CONFIG_JOYSTICK_WARRIOR=m
CONFIG_JOYSTICK_MAGELLAN=m
CONFIG_JOYSTICK_SPACEORB=m
CONFIG_JOYSTICK_SPACEBALL=m
CONFIG_JOYSTICK_STINGER=m
CONFIG_JOYSTICK_TWIDJOY=m
CONFIG_JOYSTICK_DB9=m
CONFIG_JOYSTICK_GAMECON=m
CONFIG_JOYSTICK_TURBOGRAFX=m
CONFIG_JOYSTICK_JOYDUMP=m
CONFIG_JOYSTICK_XPAD=m
CONFIG_JOYSTICK_XPAD_FF=y
CONFIG_JOYSTICK_XPAD_LEDS=y
# CONFIG_INPUT_TABLET is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
CONFIG_INPUT_PCSPKR=m
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
CONFIG_INPUT_UINPUT=m

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=m
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PARKBD is not set
CONFIG_SERIO_PCIPS2=m
CONFIG_SERIO_LIBPS2=y
CONFIG_SERIO_RAW=m
CONFIG_GAMEPORT=m
CONFIG_GAMEPORT_NS558=m
# CONFIG_GAMEPORT_L4 is not set
CONFIG_GAMEPORT_EMU10K1=m
# CONFIG_GAMEPORT_FM801 is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
# CONFIG_SERIAL_NONSTANDARD is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=m
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=m
CONFIG_SERIAL_8250_PNP=m
CONFIG_SERIAL_8250_NR_UARTS=4
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
# CONFIG_SERIAL_8250_DETECT_IRQ is not set
CONFIG_SERIAL_8250_RSA=y

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=m
# CONFIG_SERIAL_JSM is not set
CONFIG_UNIX98_PTYS=y
# CONFIG_LEGACY_PTYS is not set
CONFIG_PRINTER=m
# CONFIG_LP_CONSOLE is not set
# CONFIG_PPDEV is not set
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=m
CONFIG_HW_RANDOM_INTEL=m
CONFIG_HW_RANDOM_AMD=m
CONFIG_NVRAM=m
CONFIG_RTC=m
CONFIG_GEN_RTC=m
CONFIG_GEN_RTC_X=y
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_MWAVE is not set
# CONFIG_PC8736x_GPIO is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
CONFIG_HPET_RTC_IRQ=y
CONFIG_HPET_MMAP=y
CONFIG_HANGCHECK_TIMER=m
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_I2C=m
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_CHARDEV=m

#
# I2C Algorithms
#
CONFIG_I2C_ALGOBIT=m
CONFIG_I2C_ALGOPCF=m
CONFIG_I2C_ALGOPCA=m

#
# I2C Hardware Bus support
#
CONFIG_I2C_ALI1535=m
CONFIG_I2C_ALI1563=m
CONFIG_I2C_ALI15X3=m
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_I801 is not set
# CONFIG_I2C_I810 is not set
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
CONFIG_I2C_OCORES=m
CONFIG_I2C_PARPORT=m
CONFIG_I2C_PARPORT_LIGHT=m
# CONFIG_I2C_PROSAVAGE is not set
# CONFIG_I2C_SAVAGE4 is not set
# CONFIG_I2C_SIMTEC is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_STUB is not set
# CONFIG_I2C_TINY_USB is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set
# CONFIG_I2C_VOODOO3 is not set

#
# Miscellaneous I2C Chip support
#
# CONFIG_SENSORS_DS1337 is not set
# CONFIG_SENSORS_DS1374 is not set
# CONFIG_DS1682 is not set
# CONFIG_SENSORS_EEPROM is not set
# CONFIG_SENSORS_PCF8574 is not set
# CONFIG_SENSORS_PCA9539 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_SENSORS_MAX6875 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_I2C_DEBUG_CHIP is not set

#
# SPI support
#
# CONFIG_SPI is not set
# CONFIG_SPI_MASTER is not set
# CONFIG_W1 is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_BATTERY_DS2760 is not set
CONFIG_HWMON=m
CONFIG_HWMON_VID=m
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7470 is not set
CONFIG_SENSORS_K8TEMP=m
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHER is not set
# CONFIG_SENSORS_FSCPOS is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_CORETEMP is not set
CONFIG_SENSORS_IT87=m
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83627HF is not set
CONFIG_SENSORS_W83627EHF=m
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_HWMON_DEBUG_CHIP is not set
# CONFIG_WATCHDOG is not set

#
# Sonics Silicon Backplane
#
CONFIG_SSB_POSSIBLE=y
# CONFIG_SSB is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_SM501 is not set

#
# Multimedia devices
#
# CONFIG_VIDEO_DEV is not set
# CONFIG_DVB_CORE is not set
# CONFIG_DAB is not set

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
# CONFIG_AGP_INTEL is not set
# CONFIG_AGP_SIS is not set
# CONFIG_AGP_VIA is not set
CONFIG_DRM=m
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
# CONFIG_DRM_RADEON is not set
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
CONFIG_VGASTATE=m
CONFIG_VIDEO_OUTPUT_CONTROL=m
CONFIG_FB=m
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_DDC=m
CONFIG_FB_CFB_FILLRECT=m
CONFIG_FB_CFB_COPYAREA=m
CONFIG_FB_CFB_IMAGEBLIT=m
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
# CONFIG_FB_SYS_FILLRECT is not set
# CONFIG_FB_SYS_COPYAREA is not set
# CONFIG_FB_SYS_IMAGEBLIT is not set
# CONFIG_FB_SYS_FOPS is not set
CONFIG_FB_DEFERRED_IO=y
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
CONFIG_FB_BACKLIGHT=y
CONFIG_FB_MODE_HELPERS=y
# CONFIG_FB_TILEBLITTING is not set

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_VGA16 is not set
CONFIG_FB_UVESA=m
# CONFIG_FB_HECUBA is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
CONFIG_FB_NVIDIA=m
CONFIG_FB_NVIDIA_I2C=y
# CONFIG_FB_NVIDIA_DEBUG is not set
CONFIG_FB_NVIDIA_BACKLIGHT=y
# CONFIG_FB_RIVA is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_INTEL is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_VIRTUAL is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
CONFIG_LCD_CLASS_DEVICE=m
CONFIG_BACKLIGHT_CLASS_DEVICE=m
# CONFIG_BACKLIGHT_CORGI is not set
# CONFIG_BACKLIGHT_PROGEAR is not set

#
# Display device support
#
# CONFIG_DISPLAY_SUPPORT is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
# CONFIG_VGACON_SOFT_SCROLLBACK is not set
CONFIG_VIDEO_SELECT=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=m
# CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY is not set
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
CONFIG_FONTS=y
# CONFIG_FONT_8x8 is not set
CONFIG_FONT_8x16=y
# CONFIG_FONT_6x11 is not set
# CONFIG_FONT_7x14 is not set
# CONFIG_FONT_PEARL_8x8 is not set
# CONFIG_FONT_ACORN_8x8 is not set
# CONFIG_FONT_MINI_4x6 is not set
# CONFIG_FONT_SUN8x16 is not set
# CONFIG_FONT_SUN12x22 is not set
# CONFIG_FONT_10x18 is not set
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
CONFIG_LOGO_LINUX_VGA16=y
CONFIG_LOGO_LINUX_CLUT224=y

#
# Sound
#
CONFIG_SOUND=m

#
# Advanced Linux Sound Architecture
#
CONFIG_SND=m
CONFIG_SND_TIMER=m
CONFIG_SND_PCM=m
CONFIG_SND_HWDEP=m
CONFIG_SND_RAWMIDI=m
CONFIG_SND_SEQUENCER=m
CONFIG_SND_SEQ_DUMMY=m
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=m
CONFIG_SND_PCM_OSS=m
CONFIG_SND_PCM_OSS_PLUGINS=y
CONFIG_SND_SEQUENCER_OSS=y
CONFIG_SND_RTCTIMER=m
CONFIG_SND_SEQ_RTCTIMER_DEFAULT=y
# CONFIG_SND_DYNAMIC_MINORS is not set
CONFIG_SND_SUPPORT_OLD_API=y
# CONFIG_SND_VERBOSE_PROCFS is not set
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set

#
# Generic devices
#
CONFIG_SND_MPU401_UART=m
CONFIG_SND_OPL3_LIB=m
CONFIG_SND_AC97_CODEC=m
CONFIG_SND_DUMMY=m
CONFIG_SND_VIRMIDI=m
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_MTS64 is not set
# CONFIG_SND_SERIAL_U16550 is not set
CONFIG_SND_MPU401=m
# CONFIG_SND_PORTMAN2X4 is not set
CONFIG_SND_SB_COMMON=m
CONFIG_SND_SB16_DSP=m

#
# PCI devices
#
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
CONFIG_SND_ALI5451=m
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
CONFIG_SND_CA0106=m
CONFIG_SND_CMIPCI=m
CONFIG_SND_CS4281=m
CONFIG_SND_CS46XX=m
CONFIG_SND_CS46XX_NEW_DSP=y
CONFIG_SND_CS5530=m
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
CONFIG_SND_EMU10K1=m
CONFIG_SND_EMU10K1X=m
CONFIG_SND_ENS1370=m
CONFIG_SND_ENS1371=m
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
CONFIG_SND_HDA_INTEL=m
# CONFIG_SND_HDA_HWDEP is not set
CONFIG_SND_HDA_CODEC_REALTEK=y
# CONFIG_SND_HDA_CODEC_ANALOG is not set
# CONFIG_SND_HDA_CODEC_SIGMATEL is not set
# CONFIG_SND_HDA_CODEC_VIA is not set
# CONFIG_SND_HDA_CODEC_ATIHDMI is not set
# CONFIG_SND_HDA_CODEC_CONEXANT is not set
# CONFIG_SND_HDA_CODEC_CMEDIA is not set
# CONFIG_SND_HDA_CODEC_SI3054 is not set
CONFIG_SND_HDA_GENERIC=y
CONFIG_SND_HDA_POWER_SAVE=y
CONFIG_SND_HDA_POWER_SAVE_DEFAULT=0
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
CONFIG_SND_INTEL8X0=m
CONFIG_SND_INTEL8X0M=m
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set
# CONFIG_SND_AC97_POWER_SAVE is not set

#
# USB devices
#
# CONFIG_SND_USB_AUDIO is not set
# CONFIG_SND_USB_USX2Y is not set
# CONFIG_SND_USB_CAIAQ is not set

#
# System on Chip audio support
#
# CONFIG_SND_SOC is not set

#
# SoC Audio support for SuperH
#

#
# Open Sound System
#
# CONFIG_SOUND_PRIME is not set
CONFIG_AC97_BUS=m
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
# CONFIG_HID_DEBUG is not set
CONFIG_HIDRAW=y

#
# USB Input Devices
#
CONFIG_USB_HID=m
# CONFIG_USB_HIDINPUT_POWERBOOK is not set
CONFIG_HID_FF=y
CONFIG_HID_PID=y
CONFIG_LOGITECH_FF=y
# CONFIG_PANTHERLORD_FF is not set
CONFIG_THRUSTMASTER_FF=y
CONFIG_ZEROPLUS_FF=y
CONFIG_USB_HIDDEV=y

#
# USB HID Boot Protocol drivers
#
CONFIG_USB_KBD=m
CONFIG_USB_MOUSE=m
CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=m
# CONFIG_USB_DEBUG is not set

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
CONFIG_USB_DEVICE_CLASS=y
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
# CONFIG_USB_PERSIST is not set
# CONFIG_USB_OTG is not set

#
# USB Host Controller Drivers
#
CONFIG_USB_EHCI_HCD=m
CONFIG_USB_EHCI_SPLIT_ISO=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
# CONFIG_USB_ISP116X_HCD is not set
CONFIG_USB_OHCI_HCD=m
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
# CONFIG_USB_UHCI_HCD is not set
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set

#
# USB Device Class drivers
#
CONFIG_USB_ACM=m
CONFIG_USB_PRINTER=m

#
# NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support'
#

#
# may also be needed; see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=m
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_ISD200 is not set
# CONFIG_USB_STORAGE_DPCM is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_KARMA is not set
CONFIG_USB_LIBUSUAL=y

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set
CONFIG_USB_MON=y

#
# USB port drivers
#
# CONFIG_USB_USS720 is not set

#
# USB Serial Converter support
#
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_AUERSWALD is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_BERRY_CHARGE is not set
CONFIG_USB_LED=m
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_PHIDGET is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
CONFIG_USB_LD=m
CONFIG_USB_TRANCEVIBRATOR=m
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set

#
# USB DSL modem support
#
CONFIG_USB_ATM=m
CONFIG_USB_SPEEDTOUCH=m
# CONFIG_USB_CXACRU is not set
# CONFIG_USB_UEAGLEATM is not set
# CONFIG_USB_XUSBATM is not set

#
# USB Gadget Support
#
# CONFIG_USB_GADGET is not set
# CONFIG_MMC is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=m

#
# LED drivers
#

#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
CONFIG_LEDS_TRIGGER_TIMER=m
CONFIG_LEDS_TRIGGER_HEARTBEAT=m
# CONFIG_INFINIBAND is not set
# CONFIG_EDAC is not set
CONFIG_RTC_LIB=m
CONFIG_RTC_CLASS=m

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
CONFIG_RTC_INTF_DEV_UIE_EMUL=y
CONFIG_RTC_DRV_TEST=m

#
# I2C RTC drivers
#
CONFIG_RTC_DRV_DS1307=m
# CONFIG_RTC_DRV_DS1374 is not set
CONFIG_RTC_DRV_DS1672=m
CONFIG_RTC_DRV_MAX6900=m
CONFIG_RTC_DRV_RS5C372=m
CONFIG_RTC_DRV_ISL1208=m
CONFIG_RTC_DRV_X1205=m
CONFIG_RTC_DRV_PCF8563=m
CONFIG_RTC_DRV_PCF8583=m
# CONFIG_RTC_DRV_M41T80 is not set

#
# SPI RTC drivers
#

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=m
CONFIG_RTC_DRV_DS1553=m
# CONFIG_RTC_DRV_STK17TA8 is not set
CONFIG_RTC_DRV_DS1742=m
CONFIG_RTC_DRV_M48T86=m
# CONFIG_RTC_DRV_M48T59 is not set
CONFIG_RTC_DRV_V3020=m

#
# on-CPU RTC drivers
#
# CONFIG_DMADEVICES is not set
# CONFIG_AUXDISPLAY is not set
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=m
# CONFIG_KVM_INTEL is not set
CONFIG_KVM_AMD=m

#
# Userspace I/O
#
# CONFIG_UIO is not set

#
# Firmware Drivers
#
# CONFIG_EDD is not set
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
CONFIG_DMIID=y

#
# File systems
#
CONFIG_EXT2_FS=y
# CONFIG_EXT2_FS_XATTR is not set
# CONFIG_EXT2_FS_XIP is not set
CONFIG_EXT3_FS=m
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
# CONFIG_EXT4DEV_FS is not set
CONFIG_JBD=m
# CONFIG_JBD_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
CONFIG_JFS_FS=y
CONFIG_JFS_POSIX_ACL=y
CONFIG_JFS_SECURITY=y
# CONFIG_JFS_DEBUG is not set
CONFIG_JFS_STATISTICS=y
CONFIG_FS_POSIX_ACL=y
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_QUOTA is not set
# CONFIG_DNOTIFY is not set
# CONFIG_AUTOFS_FS is not set
# CONFIG_AUTOFS4_FS is not set
CONFIG_FUSE_FS=m

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=m
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=m
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=850
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
CONFIG_NTFS_FS=m
# CONFIG_NTFS_DEBUG is not set
CONFIG_NTFS_RW=y

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
# CONFIG_HUGETLBFS is not set
# CONFIG_HUGETLB_PAGE is not set
CONFIG_CONFIGFS_FS=m

#
# Miscellaneous filesystems
#
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
CONFIG_ECRYPT_FS=m
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
CONFIG_CRAMFS=y
# CONFIG_VXFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
# CONFIG_NFS_FS is not set
# CONFIG_NFSD is not set
# CONFIG_SMB_FS is not set
CONFIG_CIFS=m
CONFIG_CIFS_STATS=y
# CONFIG_CIFS_STATS2 is not set
CONFIG_CIFS_WEAK_PW_HASH=y
CONFIG_CIFS_XATTR=y
CONFIG_CIFS_POSIX=y
# CONFIG_CIFS_DEBUG2 is not set
CONFIG_CIFS_EXPERIMENTAL=y
CONFIG_CIFS_UPCALL=y
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
CONFIG_9P_FS=m

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
# CONFIG_OSF_PARTITION is not set
# CONFIG_AMIGA_PARTITION is not set
# CONFIG_ATARI_PARTITION is not set
# CONFIG_MAC_PARTITION is not set
CONFIG_MSDOS_PARTITION=y
# CONFIG_BSD_DISKLABEL is not set
# CONFIG_MINIX_SUBPARTITION is not set
# CONFIG_SOLARIS_X86_PARTITION is not set
# CONFIG_UNIXWARE_DISKLABEL is not set
CONFIG_LDM_PARTITION=y
CONFIG_LDM_DEBUG=y
# CONFIG_SGI_PARTITION is not set
# CONFIG_ULTRIX_PARTITION is not set
# CONFIG_SUN_PARTITION is not set
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="cp850"
# CONFIG_NLS_CODEPAGE_437 is not set
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
CONFIG_NLS_CODEPAGE_850=m
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
CONFIG_NLS_CODEPAGE_860=m
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
# CONFIG_NLS_ASCII is not set
CONFIG_NLS_ISO8859_1=m
# CONFIG_NLS_ISO8859_2 is not set
CONFIG_NLS_ISO8859_3=m
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
CONFIG_NLS_ISO8859_14=m
CONFIG_NLS_ISO8859_15=m
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
CONFIG_NLS_UTF8=m
# CONFIG_DLM is not set
# CONFIG_INSTRUMENTATION is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_PRINTK_TIME=y
CONFIG_ENABLE_WARN_DEPRECATED=y
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_MAGIC_SYSRQ=y
# CONFIG_UNUSED_SYMBOLS is not set
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_SHIRQ=y
CONFIG_DETECT_SOFTLOCKUP=y
# CONFIG_SCHED_DEBUG is not set
# CONFIG_SCHEDSTATS is not set
# CONFIG_TIMER_STATS is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_FRAME_POINTER is not set
# CONFIG_FORCED_INLINING is not set
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_FAULT_INJECTION is not set
# CONFIG_SAMPLES is not set
CONFIG_EARLY_PRINTK=y
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_RODATA is not set
# CONFIG_IOMMU_DEBUG is not set

#
# Security options
#
CONFIG_KEYS=y
CONFIG_KEYS_DEBUG_PROC_KEYS=y
CONFIG_SECURITY=y
CONFIG_SECURITY_NETWORK=y
CONFIG_SECURITY_NETWORK_XFRM=y
CONFIG_SECURITY_CAPABILITIES=y
CONFIG_SECURITY_FILE_CAPABILITIES=y
CONFIG_CRYPTO=y
CONFIG_CRYPTO_ALGAPI=m
CONFIG_CRYPTO_ABLKCIPHER=m
CONFIG_CRYPTO_AEAD=m
CONFIG_CRYPTO_BLKCIPHER=m
CONFIG_CRYPTO_HASH=m
CONFIG_CRYPTO_MANAGER=m
CONFIG_CRYPTO_HMAC=m
CONFIG_CRYPTO_XCBC=m
CONFIG_CRYPTO_NULL=m
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=m
CONFIG_CRYPTO_SHA1=m
CONFIG_CRYPTO_SHA256=m
CONFIG_CRYPTO_SHA512=m
CONFIG_CRYPTO_WP512=m
CONFIG_CRYPTO_TGR192=m
CONFIG_CRYPTO_GF128MUL=m
CONFIG_CRYPTO_ECB=m
CONFIG_CRYPTO_CBC=m
CONFIG_CRYPTO_PCBC=m
CONFIG_CRYPTO_LRW=m
CONFIG_CRYPTO_XTS=m
CONFIG_CRYPTO_CRYPTD=m
CONFIG_CRYPTO_DES=m
CONFIG_CRYPTO_FCRYPT=m
CONFIG_CRYPTO_BLOWFISH=m
CONFIG_CRYPTO_TWOFISH=m
CONFIG_CRYPTO_TWOFISH_COMMON=m
CONFIG_CRYPTO_TWOFISH_X86_64=m
CONFIG_CRYPTO_SERPENT=m
CONFIG_CRYPTO_AES=m
CONFIG_CRYPTO_AES_X86_64=m
CONFIG_CRYPTO_CAST5=m
CONFIG_CRYPTO_CAST6=m
CONFIG_CRYPTO_TEA=m
CONFIG_CRYPTO_ARC4=m
CONFIG_CRYPTO_KHAZAD=m
CONFIG_CRYPTO_ANUBIS=m
CONFIG_CRYPTO_SEED=m
CONFIG_CRYPTO_DEFLATE=m
CONFIG_CRYPTO_MICHAEL_MIC=m
CONFIG_CRYPTO_CRC32C=m
CONFIG_CRYPTO_CAMELLIA=m
# CONFIG_CRYPTO_TEST is not set
CONFIG_CRYPTO_AUTHENC=m
# CONFIG_CRYPTO_HW is not set

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_CRC_CCITT=m
CONFIG_CRC16=m
CONFIG_CRC_ITU_T=m
CONFIG_CRC32=y
CONFIG_CRC7=m
CONFIG_LIBCRC32C=m
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=m
CONFIG_TEXTSEARCH=y
CONFIG_TEXTSEARCH_KMP=m
CONFIG_TEXTSEARCH_BM=m
CONFIG_TEXTSEARCH_FSM=m
CONFIG_PLIST=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
  2008-02-07 15:22       ` Alejandro Riveira Fernández
@ 2008-02-07 15:53         ` Parag Warudkar
  2008-02-07 15:56           ` Parag Warudkar
  2008-02-07 17:36           ` Frank Mayhar
  0 siblings, 2 replies; 51+ messages in thread
From: Parag Warudkar @ 2008-02-07 15:53 UTC (permalink / raw)
  To: Alejandro Riveira Fernández
  Cc: parag.warudkar, Frank Mayhar, Andrew Morton, bugme-daemon,
	linux-kernel, Ingo Molnar, Thomas Gleixner, Roland McGrath,
	Jakub Jelinek

[-- Attachment #1: Type: TEXT/PLAIN, Size: 618 bytes --]



On Thu, 7 Feb 2008, Alejandro Riveira Fernández wrote:

> gcc --version
> 
> gcc-4.2 (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4)
>  
> If some fields are empty or look unusual you may have an old version.
> Compare to the current minimal requirements in Documentation/Changes.
>  
> Linux Varda 2.6.24 #2 SMP PREEMPT Fri Jan 25 01:05:47 CET 2008 x86_64 GNU/Linux
>
So x86+SMP+GnuC-4.1.2+Glibc-2.5 = Not reproducible.

   x86_64+SMP+PREEMPT+GnuC-4.1.2+Glibc-2.5 = Reproducible.	

Not sure what the original reporter's $ARCH was.

So next thing worthwhile to try might be to disable PREEMPT and see if 
that cures it.

Parag 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
  2008-02-07 15:56           ` Parag Warudkar
@ 2008-02-07 15:54             ` Alejandro Riveira Fernández
  2008-02-07 16:01               ` Parag Warudkar
  0 siblings, 1 reply; 51+ messages in thread
From: Alejandro Riveira Fernández @ 2008-02-07 15:54 UTC (permalink / raw)
  To: parag.warudkar
  Cc: Frank Mayhar, Andrew Morton, bugme-daemon, linux-kernel,
	Ingo Molnar, Thomas Gleixner, Roland McGrath, Jakub Jelinek

El Thu, 7 Feb 2008 10:56:16 -0500 (EST)
Parag Warudkar <parag.warudkar@gmail.com> escribió:

> 
> 
> On Thu, 7 Feb 2008, Parag Warudkar wrote:
> 
> >    x86_64+SMP+PREEMPT+GnuC-4.1.2+Glibc-2.5 = Reproducible.	
> > 
> That should of course be   
>  x86_64+SMP+PREEMPT+GnuC-4.1.3+Glibc-2.6.1 = Reproducible.	
> 
>From my previous mail 

Note that i use CC=gcc-4.2 

 $gcc-4.2 --version

 gcc-4.2 (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
  2008-02-07 15:53         ` Parag Warudkar
@ 2008-02-07 15:56           ` Parag Warudkar
  2008-02-07 15:54             ` Alejandro Riveira Fernández
  2008-02-07 17:36           ` Frank Mayhar
  1 sibling, 1 reply; 51+ messages in thread
From: Parag Warudkar @ 2008-02-07 15:56 UTC (permalink / raw)
  To: Alejandro Riveira Fernández
  Cc: Frank Mayhar, Andrew Morton, bugme-daemon, linux-kernel,
	Ingo Molnar, Thomas Gleixner, Roland McGrath, Jakub Jelinek



On Thu, 7 Feb 2008, Parag Warudkar wrote:

>    x86_64+SMP+PREEMPT+GnuC-4.1.2+Glibc-2.5 = Reproducible.	
> 
That should of course be   
 x86_64+SMP+PREEMPT+GnuC-4.1.3+Glibc-2.6.1 = Reproducible.	


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
  2008-02-07 15:54             ` Alejandro Riveira Fernández
@ 2008-02-07 16:01               ` Parag Warudkar
  2008-02-07 16:53                 ` Parag Warudkar
  0 siblings, 1 reply; 51+ messages in thread
From: Parag Warudkar @ 2008-02-07 16:01 UTC (permalink / raw)
  To: Alejandro Riveira Fernández
  Cc: parag.warudkar, Frank Mayhar, Andrew Morton, bugme-daemon,
	linux-kernel, Ingo Molnar, Thomas Gleixner, Roland McGrath,
	Jakub Jelinek

[-- Attachment #1: Type: TEXT/PLAIN, Size: 604 bytes --]



On Thu, 7 Feb 2008, Alejandro Riveira Fernández wrote:

> El Thu, 7 Feb 2008 10:56:16 -0500 (EST)
> Parag Warudkar <parag.warudkar@gmail.com> escribió:
> 
> > 
> > 
> > On Thu, 7 Feb 2008, Parag Warudkar wrote:
> > 
> > >    x86_64+SMP+PREEMPT+GnuC-4.1.2+Glibc-2.5 = Reproducible.	
> > > 
> > That should of course be   
> >  x86_64+SMP+PREEMPT+GnuC-4.1.3+Glibc-2.6.1 = Reproducible.	
> > 
> From my previous mail 
> 
> Note that i use CC=gcc-4.2 
> 
>  $gcc-4.2 --version
> 
>  gcc-4.2 (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4)
> 
Yep. I will enable PREEMPT and see if it reproduces for me.

Thanks
Parag 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
  2008-02-07 16:01               ` Parag Warudkar
@ 2008-02-07 16:53                 ` Parag Warudkar
  2008-02-29 19:55                   ` Frank Mayhar
  0 siblings, 1 reply; 51+ messages in thread
From: Parag Warudkar @ 2008-02-07 16:53 UTC (permalink / raw)
  To: Parag Warudkar
  Cc: Alejandro Riveira Fernández, Frank Mayhar, Andrew Morton,
	bugme-daemon, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Roland McGrath, Jakub Jelinek



On Thu, 7 Feb 2008, Parag Warudkar wrote:
> Yep. I will enable PREEMPT and see if it reproduces for me.

Not reproducible with PREEMPT either. 

Parag

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
  2008-02-07 15:53         ` Parag Warudkar
  2008-02-07 15:56           ` Parag Warudkar
@ 2008-02-07 17:36           ` Frank Mayhar
  1 sibling, 0 replies; 51+ messages in thread
From: Frank Mayhar @ 2008-02-07 17:36 UTC (permalink / raw)
  To: parag.warudkar
  Cc: Alejandro Riveira Fernández, Andrew Morton, bugme-daemon,
	linux-kernel, Ingo Molnar, Thomas Gleixner, Roland McGrath,
	Jakub Jelinek

On Thu, 2008-02-07 at 10:53 -0500, Parag Warudkar wrote:
> 
> On Thu, 7 Feb 2008, Alejandro Riveira Fernández wrote:
> 
> > gcc --version
> > 
> > gcc-4.2 (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4)
> >  
> > If some fields are empty or look unusual you may have an old version.
> > Compare to the current minimal requirements in Documentation/Changes.
> >  
> > Linux Varda 2.6.24 #2 SMP PREEMPT Fri Jan 25 01:05:47 CET 2008 x86_64 GNU/Linux
> >
> So x86+SMP+GnuC-4.1.2+Glibc-2.5 = Not reproducible.
> 
>    x86_64+SMP+PREEMPT+GnuC-4.1.2+Glibc-2.5 = Reproducible.	
> 
> Not sure what the original reporter's $ARCH was.

Several, among which were i686+SMP+GnuC-4.0.3+Glibc-2.3.6.  No PREEMPT.
Linux 2.6.18, 2.6.21 and 2.6.24-rc4.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
  2008-02-07 16:53                 ` Parag Warudkar
@ 2008-02-29 19:55                   ` Frank Mayhar
  2008-03-04  7:00                     ` Roland McGrath
  0 siblings, 1 reply; 51+ messages in thread
From: Frank Mayhar @ 2008-02-29 19:55 UTC (permalink / raw)
  To: parag.warudkar
  Cc: Alejandro Riveira Fernández, Andrew Morton, bugme-daemon,
	linux-kernel, Ingo Molnar, Thomas Gleixner, Roland McGrath,
	Jakub Jelinek

On Thu, 2008-02-07 at 11:53 -0500, Parag Warudkar wrote:
> On Thu, 7 Feb 2008, Parag Warudkar wrote:
> > Yep. I will enable PREEMPT and see if it reproduces for me.
> 
> Not reproducible with PREEMPT either. 

Okay, here's an analysis of the problem and a potential solution.  I
mentioned this in the bug itself but I'll repeat it here:

A couple of us here have been investigating this thing and have
concluded that the problem lies in the implementation of
run_posix_cpu_timers() and specifically in the quadratic nature of the
implementation.  It calls check_process_timers() to sum the
utime/stime/sched_time (in 2.6.18.5, under another name in 2.6.24+) of
all threads in the thread group.  This means that runtime there grows
with the number of threads.  It can go through the list _again_ if and
when it decides to rebalance expiry times.

After thinking through it, it seems clear that the critical number of
threads is that in which run_posix_cpu_timers() takes as long as or
longer than a tick to get its work done.  The system makes progress to
that point but after that everything goes to hell as it gets further and
further behind.  This explains all the symptoms we've seen, including
seeing run_posix_cpu_timers() at the top of a bunch of profiling stats
(I saw it get more than a third of overall processing time on a bunch of
tests, even where the system _didn't_ hang!).  It explains the fact that
things get slow right before they go to hell and it explains why under
certain conditions the system can recover (if the threads have started
exiting by the time it hangs, for example).

I've come up with a potential fix for the problem.  It does two things.
First, rather than summing the utime/stime/sched_time at interrupt it
adds all of those times to a new task_struct field on the group leader
then at interrupt just consults those fields; this avoids repeatedly
blowing the cache as well as a loop across all the threads.

Second, if there are more than 1000 threads in the process (as noted in
task->signal->live), it just punts all of the processing to a workqueue.

With these changes I've gone from a hang at 4500 (or fewer) threads to
running out of resources at more than 32000 threads on a single-CPU box.
When I've finished testing I'll polish the patch a bit and submit it to
the LKML but I thought you guys might want to know the state of things.

Oh, and one more note:  This bug is also dependent on HZ, since it
matters how long a tick is.  I've been running with HZ=1000.  A faster
machine or one with HZ=100 would potentially need to generate a _lot_
more threads to see the hang.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
  2008-02-29 19:55                   ` Frank Mayhar
@ 2008-03-04  7:00                     ` Roland McGrath
  2008-03-04 19:52                       ` Frank Mayhar
  0 siblings, 1 reply; 51+ messages in thread
From: Roland McGrath @ 2008-03-04  7:00 UTC (permalink / raw)
  To: Frank Mayhar
  Cc: parag.warudkar, Alejandro Riveira Fernández, Andrew Morton,
	bugme-daemon, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Jakub Jelinek

Thanks for the detailed explanation and for bringing this to my attention.

This is a problem we knew about when I first implemented posix-cpu-timers
and process-wide SIGPROF/SIGVTALRM.  I'm a little surprised it took this
long to become a problem in practice.  I originally expected to have to
revisit it sooner than this, but I certainly haven't thought about it for
quite some time.  I'd guess that HZ=1000 becoming common is what did it.

The obvious implementation for the process-wide clocks is to have the
tick interrupt increment shared utime/stime/sched_time fields in
signal_struct as well as the private task_struct fields.  The all-threads
totals accumulate in the signal_struct fields, which would be atomic_t.
It's then trivial for the timer expiry checks to compare against those
totals.

The concern I had about this was multiple CPUs competing for the
signal_struct fields.  (That is, several CPUs all running threads in the
same process.)  If the ticks on each CPU are even close to synchronized,
then every single time all those CPUs will do an atomic_add on the same
word.  I'm not any kind of expert on SMP and cache effects, but I know
this is bad.  However bad it is, it's that bad all the time and however
few threads (down to 2) it's that bad for that many CPUs.

The implementation we have instead is obviously dismal for large numbers
of threads.  I always figured we'd replace that with something based on
more sophisticated thinking about the CPU-clash issue.  

I don't entirely follow your description of your patch.  It sounds like it
should be two patches, though.  The second of those patches (workqueue)
sounds like it could be an appropriate generic cleanup, or like it could
be a complication that might be unnecessary if we get a really good
solution to main issue.  

The first patch I'm not sure whether I understand what you said or not.
Can you elaborate?  Or just post the unfinished patch as illustration,
marking it as not for submission until you've finished.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
  2008-03-04  7:00                     ` Roland McGrath
@ 2008-03-04 19:52                       ` Frank Mayhar
  2008-03-05  4:08                         ` Roland McGrath
  0 siblings, 1 reply; 51+ messages in thread
From: Frank Mayhar @ 2008-03-04 19:52 UTC (permalink / raw)
  To: Roland McGrath
  Cc: parag.warudkar, Alejandro Riveira Fernández, Andrew Morton,
	linux-kernel, Ingo Molnar, Thomas Gleixner, Jakub Jelinek

Put this on the patch but I'm emailing it as well.

On Mon, 2008-03-03 at 23:00 -0800, Roland McGrath wrote:
> Thanks for the detailed explanation and for bringing this to my attention.

You're quite welcome.

> This is a problem we knew about when I first implemented posix-cpu-timers
> and process-wide SIGPROF/SIGVTALRM.  I'm a little surprised it took this
> long to become a problem in practice.  I originally expected to have to
> revisit it sooner than this, but I certainly haven't thought about it for
> quite some time.  I'd guess that HZ=1000 becoming common is what did it.

Well, the iron is getting bigger, too, so it's beginning to be feasible
to run _lots_ of threads.

> The obvious implementation for the process-wide clocks is to have the
> tick interrupt increment shared utime/stime/sched_time fields in
> signal_struct as well as the private task_struct fields.  The all-threads
> totals accumulate in the signal_struct fields, which would be atomic_t.
> It's then trivial for the timer expiry checks to compare against those
> totals.
> 
> The concern I had about this was multiple CPUs competing for the
> signal_struct fields.  (That is, several CPUs all running threads in the
> same process.)  If the ticks on each CPU are even close to synchronized,
> then every single time all those CPUs will do an atomic_add on the same
> word.  I'm not any kind of expert on SMP and cache effects, but I know
> this is bad.  However bad it is, it's that bad all the time and however
> few threads (down to 2) it's that bad for that many CPUs.
> 
> The implementation we have instead is obviously dismal for large numbers
> of threads.  I always figured we'd replace that with something based on
> more sophisticated thinking about the CPU-clash issue.  
> 
> I don't entirely follow your description of your patch.  It sounds like it
> should be two patches, though.  The second of those patches (workqueue)
> sounds like it could be an appropriate generic cleanup, or like it could
> be a complication that might be unnecessary if we get a really good
> solution to main issue.  
> 
> The first patch I'm not sure whether I understand what you said or not.
> Can you elaborate?  Or just post the unfinished patch as illustration,
> marking it as not for submission until you've finished.

My first patch did essentially what you outlined above, incrementing
shared utime/stime/sched_time fields, except that they were in the
task_struct of the group leader rather than in the signal_struct.  It's
not clear to me exactly how the signal_struct is shared, whether it is
shared among all threads or if each has its own version.

So each timer routine had something like:

	/* If we're part of a thread group, add our time to the leader. */
	if (p->group_leader != NULL)
		p->group_leader->threads_sched_time += tmp;

and check_process_timers() had

	/* Times for the whole thread group are held by the group leader. */
	utime = cputime_add(utime, tsk->group_leader->threads_utime);
	stime = cputime_add(stime, tsk->group_leader->threads_stime);
	sched_time += tsk->group_leader->threads_sched_time;

Of course, this alone is insufficient.  It speeds things up a tiny bit
but not nearly enough.

The other issue has to do with the rest of the processing in
run_posix_cpu_timers(), walking the timer lists and walking the whole
thread group (again) to rebalance expiry times.  My second patch moved
all that work to a workqueue, but only if there were more than 100
threads in the process.  This basically papered over the problem by
moving the processing out of interrupt and into a kernel thread.  It's
still insufficient, though, because it takes just as long and will get
backed up just as badly on large numbers of threads.  This was made
clear in a test I ran yesterday where I generated some 200,000 threads.
The work queue was unreasonably large, as you might expect.

I am looking for a way to do everything that needs to be done in fewer
operations, but unfortunately I'm not familiar enough with the
SIGPROF/SIGVTALRM semantics or with the details of the Linux
implementation to know where it is safe to consolidate things.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
  2008-03-04 19:52                       ` Frank Mayhar
@ 2008-03-05  4:08                         ` Roland McGrath
  2008-03-06 19:04                           ` Frank Mayhar
  2008-03-07 23:26                           ` [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF Frank Mayhar
  0 siblings, 2 replies; 51+ messages in thread
From: Roland McGrath @ 2008-03-05  4:08 UTC (permalink / raw)
  To: Frank Mayhar
  Cc: parag.warudkar, Alejandro Riveira Fernández, Andrew Morton,
	linux-kernel, Ingo Molnar, Thomas Gleixner, Jakub Jelinek

> My first patch did essentially what you outlined above, incrementing
> shared utime/stime/sched_time fields, except that they were in the
> task_struct of the group leader rather than in the signal_struct.  It's
> not clear to me exactly how the signal_struct is shared, whether it is
> shared among all threads or if each has its own version.

There is a 1:1 correspondence between "shares signal_struct" and "member of
same thread group".  signal_struct is the right place for such new fields.
Don't be confused by the existing fields utime, stime, gtime, and
sum_sched_runtime.  All of those are accumulators only touched when a
non-leader thread dies (in __exit_signal), and governed by the siglock.
Their only purpose now is to represent the threads that are dead and gone
when calculating the cumulative total for the whole thread group.  If you
were to provide cumulative totals that are updated on every tick, then
these old fields would not be needed.

> So each timer routine had something like:
> 
> 	/* If we're part of a thread group, add our time to the leader. */
> 	if (p->group_leader != NULL)
> 		p->group_leader->threads_sched_time += tmp;

The task_struct.group_leader field is never NULL.  Every thread is a member
of some thread group.  The degenerate case is that it's the only member of
the group; then p->group_leader == p.

> and check_process_timers() had
> 
> 	/* Times for the whole thread group are held by the group leader. */
> 	utime = cputime_add(utime, tsk->group_leader->threads_utime);
> 	stime = cputime_add(stime, tsk->group_leader->threads_stime);
> 	sched_time += tsk->group_leader->threads_sched_time;
> 
> Of course, this alone is insufficient.  It speeds things up a tiny bit
> but not nearly enough.

It sounds like you sped up only one of the sampling loops.  Having a
cumulative total already on hand means cpu_clock_sample_group can also
become simple and cheap, as can the analogues in do_getitimer and
k_getrusage.  These are what's used in clock_gettime and in the timer
manipulation calls, and in getitimer and getrusage.  That's all just gravy.

The real benefit of having a cumulative total is for the basic logic of
run_posix_cpu_timers (check_process_timers) and the timer expiry setup.  
It sounds like you didn't take advantage of the new fields for that.

When a cumulative total is on hand in the tick handler, then there is no
need at all to derive per-thread expiry times from group-wide CPU timers
("rebalance") either there or when arming the timer in the first place.
All of that complexity can just disappear from the implementation.

check_process_timers can look just like check_thread_timers, but
consulting the shared fields instead of the per-thread ones for both the
clock accumulators and the timers' expiry times.  Likewise, arm_timer
only has to set signal->it_*_expires; process_timer_rebalance goes away.

If you do all that then the time spent in run_posix_cpu_timers should
not be affected at all by the number of threads.  The only "walking the
timer lists" that happens is popping the expired timers off the head of
the lists that are kept in ascending order of expiry time.  For each
flavor of timer, there are n+1 steps in the "walk" for n timers that
have expired.  So already no costs here should scale with the number of
timers, just the with the number of timers that all expire at the same time.

Back for a moment to the status quo and your second patch.

> The other issue has to do with the rest of the processing in
> run_posix_cpu_timers(), walking the timer lists and walking the whole
> thread group (again) to rebalance expiry times.  My second patch moved
> all that work to a workqueue, but only if there were more than 100
> threads in the process.  This basically papered over the problem by
> moving the processing out of interrupt and into a kernel thread.  It's
> still insufficient, though, because it takes just as long and will get
> backed up just as badly on large numbers of threads.  This was made
> clear in a test I ran yesterday where I generated some 200,000 threads.
> The work queue was unreasonably large, as you might expect.

What I would expect is that there be at most one item in the queue for
each process (thread group).  If you have 200000 threads in one process,
you still only need one iteration of check_process_timers to run.  If it
hasn't run by the time more threads in the same group get more ticks,
then all that matters is that it indeed runs once reasonably soon (for
an overall effect of not much less often than once per tick interval).

> I am looking for a way to do everything that needs to be done in fewer
> operations, but unfortunately I'm not familiar enough with the
> SIGPROF/SIGVTALRM semantics or with the details of the Linux
> implementation to know where it is safe to consolidate things.

I can help you with all of that.  What I'll need from you is careful
performance analysis of all the effects of any changes we consider.

The simplifications I described above will obviously greatly improve
your test case (many threads and with some process timers expiring
pretty frequently).  We need to consider and analyze the other kinds of
cases too.  That is, cases with a few threads (not many more than the
number of CPUs); cases where no timer is close to expiring very often.
The most common cases, from one-thread cases to one-million thread
cases, are when no timers are going off before next Tuesday (if any are
set at all).  Then run_posix_cpu_timers always bails out early, and none
of the costs you've seen become relevant at all.  Any change to what the
timer interrupt handler does on every tick affects those cases too.

As I mentioned in my last message, my concern about this originally was
with the SMP cache/lock effects of multiple CPUs touching the same
memory in signal_struct on every tick (which presumably all tend to
happen simultaneously on all the CPUs).  I'd insist that we have
measurements and analysis as thorough as possible of the effects of
introducing that frequent/synchronized sharing, before endorsing such
changes.  I have a couple of guesses as to what might be reasonable ways
to mitigate that.  But it needs a lot of measurement and wise opinion on
the low-level performance effects of each proposal.


Thanks,
Roland


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
  2008-03-05  4:08                         ` Roland McGrath
@ 2008-03-06 19:04                           ` Frank Mayhar
  2008-03-11  7:50                             ` posix-cpu-timers revamp Roland McGrath
  2008-03-07 23:26                           ` [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF Frank Mayhar
  1 sibling, 1 reply; 51+ messages in thread
From: Frank Mayhar @ 2008-03-06 19:04 UTC (permalink / raw)
  To: Roland McGrath
  Cc: parag.warudkar, Alejandro Riveira Fernández, Andrew Morton,
	linux-kernel, Ingo Molnar, Thomas Gleixner, Jakub Jelinek

On Tue, 2008-03-04 at 20:08 -0800, Roland McGrath wrote:
> check_process_timers can look just like check_thread_timers, but
> consulting the shared fields instead of the per-thread ones for both the
> clock accumulators and the timers' expiry times.  Likewise, arm_timer
> only has to set signal->it_*_expires; process_timer_rebalance goes away.

Okay, my understanding of this is still evolving, so please (please!)
correct me when I get it wrong.  I take what you're saying to mean that,
first, run_posix_cpu_timers() only needs to be run once per thread
group.  It _sounds_ like it should be checking the shared fields rather
than the per-task fields for timer expiration (in fact, the more I think
about it the more sure I am that that's the case).

The old process_timer_rebalance() routine was intended to distribute the
remaining ticks across all the threads, so that the per-task fields
would cause run_posix_cpu_timers() to run at the appropriate time.  With
it checking the shared fields this becomes no longer necessary.

Since the shared fields are getting all the ticks, this will work for
per-thread timers as well.

The arm_timers() routine, instead of calling posix_timer_rebalance(),
should just directly set signal->it_*_expires to the expiration time,
e.g.:
			switch (CPUCLOCK_WHICH(timer->it_clock)) {
			default:
				BUG();
			case CPUCLOCK_VIRT:
				if (!cputime_eq(p->signal->it_virt_expires,
						cputime_zero) &&
				    cputime_lt(p->signal->it_virt_expires,
					       timer->it.cpu.expires.cpu))
					break;
				p->signal->it_virt_expires = timer->it.cpu.expires.cpu;
				goto rebalance;
			case CPUCLOCK_PROF:
				if (!cputime_eq(p->signal->it_prof_expires,
						cputime_zero) &&
				    cputime_lt(p->signal->it_prof_expires,
					       timer->it.cpu.expires.cpu))
					break;
				i = p->signal->rlim[RLIMIT_CPU].rlim_cur;
				if (i != RLIM_INFINITY &&
				    i <= cputime_to_secs(timer->it.cpu.expires.cpu))
					break;
				p->signal->it_prof_expires = timer->it.cpu.expires.cpu;
				goto rebalance;
			case CPUCLOCK_SCHED:
				p->signal->it_sched_expires = timer->it.cpu.expires.sched;
				break;
			}


> If you do all that then the time spent in run_posix_cpu_timers should
> not be affected at all by the number of threads.  The only "walking the
> timer lists" that happens is popping the expired timers off the head of
> the lists that are kept in ascending order of expiry time.  For each
> flavor of timer, there are n+1 steps in the "walk" for n timers that
> have expired.  So already no costs here should scale with the number of
> timers, just the with the number of timers that all expire at the same time.

It's still probably worthwhile to defer processing to a workqueue
thread, though, just because it's still a lot to do at interrupt.  I'll
probably end up trying it both ways.

One thing that's still unclear to me is, if there were only one run of
run_posix_cpu_timers() per thread group per tick, how would per-thread
timers be serviced?

> The simplifications I described above will obviously greatly improve
> your test case (many threads and with some process timers expiring
> pretty frequently).  We need to consider and analyze the other kinds of
> cases too.  That is, cases with a few threads (not many more than the
> number of CPUs); cases where no timer is close to expiring very often.
> The most common cases, from one-thread cases to one-million thread
> cases, are when no timers are going off before next Tuesday (if any are
> set at all).  Then run_posix_cpu_timers always bails out early, and none
> of the costs you've seen become relevant at all.  Any change to what the
> timer interrupt handler does on every tick affects those cases too.

These are all on the roadmap, and in fact the null case should already
be covered. :-)

> As I mentioned in my last message, my concern about this originally was
> with the SMP cache/lock effects of multiple CPUs touching the same
> memory in signal_struct on every tick (which presumably all tend to
> happen simultaneously on all the CPUs).  I'd insist that we have
> measurements and analysis as thorough as possible of the effects of
> introducing that frequent/synchronized sharing, before endorsing such
> changes.  I have a couple of guesses as to what might be reasonable ways
> to mitigate that.  But it needs a lot of measurement and wise opinion on
> the low-level performance effects of each proposal.

I've given this some thought.  It seems clear that there's going to be
some performance penalty when multiple CPUs are competing trying to
update the same field at the tick.  It would be much better if there
were cacheline-aligned per-cpu fields associated with either the task or
the signal structure; that way each CPU could update its own field
without competing with the others.  Processing in run_posix_cpu_timers
would still be constant, although slightly higher for having to consult
multiple fields instead of just one.  Not one per thread, though, just
one per CPU, a much smaller and fixed number.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
  2008-03-05  4:08                         ` Roland McGrath
  2008-03-06 19:04                           ` Frank Mayhar
@ 2008-03-07 23:26                           ` Frank Mayhar
  2008-03-08  0:01                             ` Frank Mayhar
  1 sibling, 1 reply; 51+ messages in thread
From: Frank Mayhar @ 2008-03-07 23:26 UTC (permalink / raw)
  To: Roland McGrath
  Cc: parag.warudkar, Alejandro Riveira Fernández, Andrew Morton,
	linux-kernel, Ingo Molnar, Thomas Gleixner, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 3388 bytes --]

Based on Roland's comments and from reading the source, I have a
possible fix.  I'm posting the attached patch _not_ for submission but
_only_ for comment.  For one thing it's based on 2.6.18.5 and for
another it hasn't had much testing yet.  I wanted to get it out here for
comment, though, in case anyone can see where I might have gone wrong.
Comments, criticism and (especially!) testing enthusiastically
requested.

>From my notes, this patch:

	Replaces the utime, stime and sched_time fields in signal_struct with
	shared_utime, shared_stime and shared_schedtime, respectively.  It
	also adds it_sched_expires to the signal struct.

	Each place that loops through all threads in a thread group to sum
	task->utime and/or task->stime now loads the value from
	task->signal->shared_[us]time.  This includes compat_sys_times(),
	do_task_stat(), do_getitimer(), sys_times() and k_getrusage().

	Certain routines that used task->signal->[us]time now use the shared
	fields instead, which may change their semantics slightly.  These
	include fill_prstatus() (in fs/binfmt_elf.c), do_task_stat() (in
	fs/proc/array.c), wait_task_zombie() and do_notify_parent().

	The shared fields are updated at each tick, in update_cpu_clock()
	(shared_schedtime), account_user_time() (shared_utime) and
	account_system_time() (shared_stime).  Each of these functions updates
	the task-private field followed by the shared version in the signal
	structure if one is present.  Note that if different threads of the
	same process are being run by different CPUs at the tick, there may
	be serious cache contention here.

	Finally, kernel/posix-cpu-timers.c has changed quite dramatically.
	First, run_posix_cpu_timers() decides whether a timer has expired by
	consulting the it_*_expires and shared_* fields in the signal struct.
	The check_process_timers() routine bases its computations on the new
	shared fields, removing two loops through the threads.  "Rebalancing"
	is no longer required, the process_timer_rebalance() routine as
	disappeared entirely and the arm_timer() routine merely fills
	p->signal->it_*_expires from timer->it.cpu.expires.*.  The
	cpu_clock_sample_group_locked() loses its summing loops, consulting
	the shared fields instead.  Finally, set_process_cpu_timer() sets
	tsk->signal->it_*_expires directly rather than calling the deleted
	rebalance routine.

	There are still a few open questions.  In particular, it's possible
	that cache contention on the tick update of the shared fields could
	mean that the current scheme is not entirely sufficient.  Further,
	the semantics of the status-returning routines fill_prstatus(),
	do_task_stat(), wait_task_zombie() and do_notify_parent() may no longer
	follow standards.  For that matter, ITIMER_PROF handling may be broken
	entirely, although a brief test seems to show that it's working fine.

Stats:
 fs/binfmt_elf.c           |   18 +--
 fs/proc/array.c           |    6 -
 include/linux/sched.h     |   10 +-
 kernel/exit.c             |   13 --
 kernel/fork.c             |   25 +----
 kernel/itimer.c           |   18 ---
 kernel/posix-cpu-timers.c |  224 ++++++++++------------------------------------
 kernel/sched.c            |   16 +++
 kernel/signal.c           |    6 -
 kernel/sys.c              |   17 ---
 10 files changed, 105 insertions(+), 248 deletions(-)
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.

[-- Attachment #2: posix-timers.patch --]
[-- Type: text/x-patch, Size: 21997 bytes --]

diff -rup /home/fmayhar/Static/linux-2.6.18.5/fs/binfmt_elf.c linux-2.6.18.5/fs/binfmt_elf.c
--- /home/fmayhar/Static/linux-2.6.18.5/fs/binfmt_elf.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/fs/binfmt_elf.c	2008-03-06 15:05:34.000000000 -0800
@@ -1302,17 +1302,17 @@ static void fill_prstatus(struct elf_prs
 	if (thread_group_leader(p)) {
 		/*
 		 * This is the record for the group leader.  Add in the
-		 * cumulative times of previous dead threads.  This total
-		 * won't include the time of each live thread whose state
-		 * is included in the core dump.  The final total reported
-		 * to our parent process when it calls wait4 will include
-		 * those sums as well as the little bit more time it takes
-		 * this and each other thread to finish dying after the
-		 * core dump synchronization phase.
+		 * cumulative times of all threads.  This total includes
+		 * the time of each live thread whose state is included in
+		 * the core dump.  The final total reported to our parent
+		 * process when it calls wait4 will include those sums as
+		 * well as the little bit more time it takes this and each
+		 * other thread to finish dying after the core dump
+		 * synchronization phase.
 		 */
-		cputime_to_timeval(cputime_add(p->utime, p->signal->utime),
+		cputime_to_timeval(p->signal->shared_utime,
 				   &prstatus->pr_utime);
-		cputime_to_timeval(cputime_add(p->stime, p->signal->stime),
+		cputime_to_timeval(p->signal->shared_stime,
 				   &prstatus->pr_stime);
 	} else {
 		cputime_to_timeval(p->utime, &prstatus->pr_utime);
diff -rup /home/fmayhar/Static/linux-2.6.18.5/fs/proc/array.c linux-2.6.18.5/fs/proc/array.c
--- /home/fmayhar/Static/linux-2.6.18.5/fs/proc/array.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/fs/proc/array.c	2008-03-06 17:43:19.000000000 -0800
@@ -359,8 +359,6 @@ static int do_task_stat(struct task_stru
 			do {
 				min_flt += t->min_flt;
 				maj_flt += t->maj_flt;
-				utime = cputime_add(utime, t->utime);
-				stime = cputime_add(stime, t->stime);
 				t = next_thread(t);
 			} while (t != task);
 		}
@@ -382,8 +380,8 @@ static int do_task_stat(struct task_stru
 		if (whole) {
 			min_flt += task->signal->min_flt;
 			maj_flt += task->signal->maj_flt;
-			utime = cputime_add(utime, task->signal->utime);
-			stime = cputime_add(stime, task->signal->stime);
+			utime = task->signal->shared_utime;
+			stime = task->signal->shared_stime;
 		}
 	}
 	ppid = pid_alive(task) ? task->group_leader->real_parent->tgid : 0;
diff -rup /home/fmayhar/Static/linux-2.6.18.5/include/linux/sched.h linux-2.6.18.5/include/linux/sched.h
--- /home/fmayhar/Static/linux-2.6.18.5/include/linux/sched.h	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/include/linux/sched.h	2008-03-07 13:45:35.000000000 -0800
@@ -413,6 +413,7 @@ struct signal_struct {
 	/* ITIMER_PROF and ITIMER_VIRTUAL timers for the process */
 	cputime_t it_prof_expires, it_virt_expires;
 	cputime_t it_prof_incr, it_virt_incr;
+	unsigned long long it_sched_expires;
 
 	/* job control IDs */
 	pid_t pgrp;
@@ -429,17 +430,22 @@ struct signal_struct {
 	 * Live threads maintain their own counters and add to these
 	 * in __exit_signal, except for the group leader.
 	 */
-	cputime_t utime, stime, cutime, cstime;
+	cputime_t cutime, cstime;
 	unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
 	unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
 
 	/*
+	 * Cumulative CPU time for all threads in the group including the
+	 * group leader.
+	 */
+	cputime_t shared_utime, shared_stime;
+	/*
 	 * Cumulative ns of scheduled CPU time for dead threads in the
 	 * group, not including a zombie group leader.  (This only differs
 	 * from jiffies_to_ns(utime + stime) if sched_clock uses something
 	 * other than jiffies.)
 	 */
-	unsigned long long sched_time;
+	unsigned long long shared_schedtime;
 
 	/*
 	 * We don't bother to synchronize most readers of this at all,
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/exit.c linux-2.6.18.5/kernel/exit.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/exit.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/exit.c	2008-03-06 15:48:37.000000000 -0800
@@ -103,13 +103,10 @@ static void __exit_signal(struct task_st
 		 * We won't ever get here for the group leader, since it
 		 * will have been the last reference on the signal_struct.
 		 */
-		sig->utime = cputime_add(sig->utime, tsk->utime);
-		sig->stime = cputime_add(sig->stime, tsk->stime);
 		sig->min_flt += tsk->min_flt;
 		sig->maj_flt += tsk->maj_flt;
 		sig->nvcsw += tsk->nvcsw;
 		sig->nivcsw += tsk->nivcsw;
-		sig->sched_time += tsk->sched_time;
 		sig = NULL; /* Marker for below. */
 	}
 
@@ -1165,14 +1162,12 @@ static int wait_task_zombie(struct task_
 		sig = p->signal;
 		psig->cutime =
 			cputime_add(psig->cutime,
-			cputime_add(p->utime,
-			cputime_add(sig->utime,
-				    sig->cutime)));
+			cputime_add(sig->shared_utime,
+				    sig->cutime));
 		psig->cstime =
 			cputime_add(psig->cstime,
-			cputime_add(p->stime,
-			cputime_add(sig->stime,
-				    sig->cstime)));
+			cputime_add(sig->shared_stime,
+				    sig->cstime));
 		psig->cmin_flt +=
 			p->min_flt + sig->min_flt + sig->cmin_flt;
 		psig->cmaj_flt +=
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/fork.c linux-2.6.18.5/kernel/fork.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/fork.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/fork.c	2008-03-07 14:23:22.000000000 -0800
@@ -855,14 +855,18 @@ static inline int copy_signal(unsigned l
 	sig->it_virt_incr = cputime_zero;
 	sig->it_prof_expires = cputime_zero;
 	sig->it_prof_incr = cputime_zero;
+ 	sig->it_sched_expires = 0;
 
 	sig->leader = 0;	/* session leadership doesn't inherit */
 	sig->tty_old_pgrp = 0;
 
-	sig->utime = sig->stime = sig->cutime = sig->cstime = cputime_zero;
+	sig->shared_utime = cputime_zero;
+	sig->shared_stime = cputime_zero;
+	sig->shared_schedtime = 0;
+
+	sig->cutime = sig->cstime = cputime_zero;
 	sig->nvcsw = sig->nivcsw = sig->cnvcsw = sig->cnivcsw = 0;
 	sig->min_flt = sig->maj_flt = sig->cmin_flt = sig->cmaj_flt = 0;
-	sig->sched_time = 0;
 	INIT_LIST_HEAD(&sig->cpu_timers[0]);
 	INIT_LIST_HEAD(&sig->cpu_timers[1]);
 	INIT_LIST_HEAD(&sig->cpu_timers[2]);
@@ -877,7 +881,7 @@ static inline int copy_signal(unsigned l
 		 * New sole thread in the process gets an expiry time
 		 * of the whole CPU time limit.
 		 */
-		tsk->it_prof_expires =
+		tsk->signal->it_prof_expires =
 			secs_to_cputime(sig->rlim[RLIMIT_CPU].rlim_cur);
 	}
 	acct_init_pacct(&sig->pacct);
@@ -1204,21 +1208,6 @@ static struct task_struct *copy_process(
 	if (clone_flags & CLONE_THREAD) {
 		p->group_leader = current->group_leader;
 		list_add_tail_rcu(&p->thread_group, &p->group_leader->thread_group);
-
-		if (!cputime_eq(current->signal->it_virt_expires,
-				cputime_zero) ||
-		    !cputime_eq(current->signal->it_prof_expires,
-				cputime_zero) ||
-		    current->signal->rlim[RLIMIT_CPU].rlim_cur != RLIM_INFINITY ||
-		    !list_empty(&current->signal->cpu_timers[0]) ||
-		    !list_empty(&current->signal->cpu_timers[1]) ||
-		    !list_empty(&current->signal->cpu_timers[2])) {
-			/*
-			 * Have child wake up on its first tick to check
-			 * for process CPU timers.
-			 */
-			p->it_prof_expires = jiffies_to_cputime(1);
-		}
 	}
 
 	/*
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/itimer.c linux-2.6.18.5/kernel/itimer.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/itimer.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/itimer.c	2008-03-06 15:49:17.000000000 -0800
@@ -61,12 +61,7 @@ int do_getitimer(int which, struct itime
 		cval = tsk->signal->it_virt_expires;
 		cinterval = tsk->signal->it_virt_incr;
 		if (!cputime_eq(cval, cputime_zero)) {
-			struct task_struct *t = tsk;
-			cputime_t utime = tsk->signal->utime;
-			do {
-				utime = cputime_add(utime, t->utime);
-				t = next_thread(t);
-			} while (t != tsk);
+			cputime_t utime = tsk->signal->shared_utime;
 			if (cputime_le(cval, utime)) { /* about to fire */
 				cval = jiffies_to_cputime(1);
 			} else {
@@ -84,15 +79,8 @@ int do_getitimer(int which, struct itime
 		cval = tsk->signal->it_prof_expires;
 		cinterval = tsk->signal->it_prof_incr;
 		if (!cputime_eq(cval, cputime_zero)) {
-			struct task_struct *t = tsk;
-			cputime_t ptime = cputime_add(tsk->signal->utime,
-						      tsk->signal->stime);
-			do {
-				ptime = cputime_add(ptime,
-						    cputime_add(t->utime,
-								t->stime));
-				t = next_thread(t);
-			} while (t != tsk);
+			cputime_t ptime = cputime_add(tsk->signal->shared_utime,
+						      tsk->signal->shared_stime);
 			if (cputime_le(cval, ptime)) { /* about to fire */
 				cval = jiffies_to_cputime(1);
 			} else {
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/posix-cpu-timers.c linux-2.6.18.5/kernel/posix-cpu-timers.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/posix-cpu-timers.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/posix-cpu-timers.c	2008-03-07 12:57:23.000000000 -0800
@@ -164,6 +164,15 @@ static inline unsigned long long sched_n
 	return (p == current) ? current_sched_time(p) : p->sched_time;
 }
 
+static inline cputime_t prof_shared_ticks(struct task_struct *p)
+{
+	return cputime_add(p->signal->shared_utime, p->signal->shared_stime);
+}
+static inline cputime_t virt_shared_ticks(struct task_struct *p)
+{
+	return p->signal->shared_utime;
+}
+
 int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec *tp)
 {
 	int error = check_clock(which_clock);
@@ -227,31 +236,17 @@ static int cpu_clock_sample_group_locked
 					 struct task_struct *p,
 					 union cpu_time_count *cpu)
 {
-	struct task_struct *t = p;
  	switch (clock_idx) {
 	default:
 		return -EINVAL;
 	case CPUCLOCK_PROF:
-		cpu->cpu = cputime_add(p->signal->utime, p->signal->stime);
-		do {
-			cpu->cpu = cputime_add(cpu->cpu, prof_ticks(t));
-			t = next_thread(t);
-		} while (t != p);
+		cpu->cpu = cputime_add(p->signal->shared_utime, p->signal->shared_stime);
 		break;
 	case CPUCLOCK_VIRT:
-		cpu->cpu = p->signal->utime;
-		do {
-			cpu->cpu = cputime_add(cpu->cpu, virt_ticks(t));
-			t = next_thread(t);
-		} while (t != p);
+		cpu->cpu = p->signal->shared_utime;
 		break;
 	case CPUCLOCK_SCHED:
-		cpu->sched = p->signal->sched_time;
-		/* Add in each other live thread.  */
-		while ((t = next_thread(t)) != p) {
-			cpu->sched += t->sched_time;
-		}
-		cpu->sched += sched_ns(p);
+		cpu->sched = p->signal->shared_schedtime;
 		break;
 	}
 	return 0;
@@ -468,79 +463,9 @@ void posix_cpu_timers_exit(struct task_s
 void posix_cpu_timers_exit_group(struct task_struct *tsk)
 {
 	cleanup_timers(tsk->signal->cpu_timers,
-		       cputime_add(tsk->utime, tsk->signal->utime),
-		       cputime_add(tsk->stime, tsk->signal->stime),
-		       tsk->sched_time + tsk->signal->sched_time);
-}
-
-
-/*
- * Set the expiry times of all the threads in the process so one of them
- * will go off before the process cumulative expiry total is reached.
- */
-static void process_timer_rebalance(struct task_struct *p,
-				    unsigned int clock_idx,
-				    union cpu_time_count expires,
-				    union cpu_time_count val)
-{
-	cputime_t ticks, left;
-	unsigned long long ns, nsleft;
- 	struct task_struct *t = p;
-	unsigned int nthreads = atomic_read(&p->signal->live);
-
-	if (!nthreads)
-		return;
-
-	switch (clock_idx) {
-	default:
-		BUG();
-		break;
-	case CPUCLOCK_PROF:
-		left = cputime_div_non_zero(cputime_sub(expires.cpu, val.cpu),
-				       nthreads);
-		do {
-			if (likely(!(t->flags & PF_EXITING))) {
-				ticks = cputime_add(prof_ticks(t), left);
-				if (cputime_eq(t->it_prof_expires,
-					       cputime_zero) ||
-				    cputime_gt(t->it_prof_expires, ticks)) {
-					t->it_prof_expires = ticks;
-				}
-			}
-			t = next_thread(t);
-		} while (t != p);
-		break;
-	case CPUCLOCK_VIRT:
-		left = cputime_div_non_zero(cputime_sub(expires.cpu, val.cpu),
-				       nthreads);
-		do {
-			if (likely(!(t->flags & PF_EXITING))) {
-				ticks = cputime_add(virt_ticks(t), left);
-				if (cputime_eq(t->it_virt_expires,
-					       cputime_zero) ||
-				    cputime_gt(t->it_virt_expires, ticks)) {
-					t->it_virt_expires = ticks;
-				}
-			}
-			t = next_thread(t);
-		} while (t != p);
-		break;
-	case CPUCLOCK_SCHED:
-		nsleft = expires.sched - val.sched;
-		do_div(nsleft, nthreads);
-		nsleft = max_t(unsigned long long, nsleft, 1);
-		do {
-			if (likely(!(t->flags & PF_EXITING))) {
-				ns = t->sched_time + nsleft;
-				if (t->it_sched_expires == 0 ||
-				    t->it_sched_expires > ns) {
-					t->it_sched_expires = ns;
-				}
-			}
-			t = next_thread(t);
-		} while (t != p);
-		break;
-	}
+		       tsk->signal->shared_utime,
+		       tsk->signal->shared_stime,
+		       tsk->signal->shared_schedtime);
 }
 
 static void clear_dead_task(struct k_itimer *timer, union cpu_time_count now)
@@ -637,7 +562,8 @@ static void arm_timer(struct k_itimer *t
 				    cputime_lt(p->signal->it_virt_expires,
 					       timer->it.cpu.expires.cpu))
 					break;
-				goto rebalance;
+				p->signal->it_virt_expires = timer->it.cpu.expires.cpu;
+				break;
 			case CPUCLOCK_PROF:
 				if (!cputime_eq(p->signal->it_prof_expires,
 						cputime_zero) &&
@@ -648,13 +574,10 @@ static void arm_timer(struct k_itimer *t
 				if (i != RLIM_INFINITY &&
 				    i <= cputime_to_secs(timer->it.cpu.expires.cpu))
 					break;
-				goto rebalance;
+				p->signal->it_prof_expires = timer->it.cpu.expires.cpu;
+				break;
 			case CPUCLOCK_SCHED:
-			rebalance:
-				process_timer_rebalance(
-					timer->it.cpu.task,
-					CPUCLOCK_WHICH(timer->it_clock),
-					timer->it.cpu.expires, now);
+				p->signal->it_sched_expires = timer->it.cpu.expires.sched;
 				break;
 			}
 		}
@@ -1020,7 +943,6 @@ static void check_process_timers(struct 
 	struct signal_struct *const sig = tsk->signal;
 	cputime_t utime, stime, ptime, virt_expires, prof_expires;
 	unsigned long long sched_time, sched_expires;
-	struct task_struct *t;
 	struct list_head *timers = sig->cpu_timers;
 
 	/*
@@ -1037,16 +959,10 @@ static void check_process_timers(struct 
 	/*
 	 * Collect the current process totals.
 	 */
-	utime = sig->utime;
-	stime = sig->stime;
-	sched_time = sig->sched_time;
-	t = tsk;
-	do {
-		utime = cputime_add(utime, t->utime);
-		stime = cputime_add(stime, t->stime);
-		sched_time += t->sched_time;
-		t = next_thread(t);
-	} while (t != tsk);
+	utime = sig->shared_utime;
+	stime = sig->shared_stime;
+	sched_time = sig->shared_schedtime;
+
 	ptime = cputime_add(utime, stime);
 
 	maxfire = 20;
@@ -1156,60 +1072,18 @@ static void check_process_timers(struct 
 		}
 	}
 
-	if (!cputime_eq(prof_expires, cputime_zero) ||
-	    !cputime_eq(virt_expires, cputime_zero) ||
-	    sched_expires != 0) {
-		/*
-		 * Rebalance the threads' expiry times for the remaining
-		 * process CPU timers.
-		 */
-
-		cputime_t prof_left, virt_left, ticks;
-		unsigned long long sched_left, sched;
-		const unsigned int nthreads = atomic_read(&sig->live);
-
-		if (!nthreads)
-			return;
-
-		prof_left = cputime_sub(prof_expires, utime);
-		prof_left = cputime_sub(prof_left, stime);
-		prof_left = cputime_div_non_zero(prof_left, nthreads);
-		virt_left = cputime_sub(virt_expires, utime);
-		virt_left = cputime_div_non_zero(virt_left, nthreads);
-		if (sched_expires) {
-			sched_left = sched_expires - sched_time;
-			do_div(sched_left, nthreads);
-			sched_left = max_t(unsigned long long, sched_left, 1);
-		} else {
-			sched_left = 0;
-		}
-		t = tsk;
-		do {
-			if (unlikely(t->flags & PF_EXITING))
-				continue;
-
-			ticks = cputime_add(cputime_add(t->utime, t->stime),
-					    prof_left);
-			if (!cputime_eq(prof_expires, cputime_zero) &&
-			    (cputime_eq(t->it_prof_expires, cputime_zero) ||
-			     cputime_gt(t->it_prof_expires, ticks))) {
-				t->it_prof_expires = ticks;
-			}
-
-			ticks = cputime_add(t->utime, virt_left);
-			if (!cputime_eq(virt_expires, cputime_zero) &&
-			    (cputime_eq(t->it_virt_expires, cputime_zero) ||
-			     cputime_gt(t->it_virt_expires, ticks))) {
-				t->it_virt_expires = ticks;
-			}
-
-			sched = t->sched_time + sched_left;
-			if (sched_expires && (t->it_sched_expires == 0 ||
-					      t->it_sched_expires > sched)) {
-				t->it_sched_expires = sched;
-			}
-		} while ((t = next_thread(t)) != tsk);
-	}
+	if (!cputime_eq(prof_expires, cputime_zero) &&
+	    (cputime_eq(sig->it_prof_expires, cputime_zero) ||
+	     cputime_gt(sig->it_prof_expires, prof_expires)))
+		sig->it_prof_expires = prof_expires;
+	if (!cputime_eq(virt_expires, cputime_zero) &&
+	    (cputime_eq(sig->it_virt_expires, cputime_zero) ||
+	     cputime_gt(sig->it_virt_expires, virt_expires)))
+		sig->it_virt_expires = virt_expires;
+	if (sched_expires != 0 &&
+	    (sig->it_sched_expires == 0 ||
+	     sig->it_sched_expires > sched_expires))
+		sig->it_sched_expires = sched_expires;
 }
 
 /*
@@ -1289,13 +1163,16 @@ void run_posix_cpu_timers(struct task_st
 
 	BUG_ON(!irqs_disabled());
 
+	if (!tsk->signal)
+		return;
+
 #define UNEXPIRED(clock) \
-		(cputime_eq(tsk->it_##clock##_expires, cputime_zero) || \
-		 cputime_lt(clock##_ticks(tsk), tsk->it_##clock##_expires))
+		(cputime_eq(tsk->signal->it_##clock##_expires, cputime_zero) || \
+		 cputime_lt(clock##_shared_ticks(tsk), tsk->signal->it_##clock##_expires))
 
 	if (UNEXPIRED(prof) && UNEXPIRED(virt) &&
-	    (tsk->it_sched_expires == 0 ||
-	     tsk->sched_time < tsk->it_sched_expires))
+	    (tsk->signal->it_sched_expires == 0 ||
+	     tsk->signal->shared_schedtime < tsk->signal->it_sched_expires))
 		return;
 
 #undef	UNEXPIRED
@@ -1398,13 +1275,14 @@ void set_process_cpu_timer(struct task_s
 	    cputime_ge(list_entry(head->next,
 				  struct cpu_timer_list, entry)->expires.cpu,
 		       *newval)) {
-		/*
-		 * Rejigger each thread's expiry time so that one will
-		 * notice before we hit the process-cumulative expiry time.
-		 */
-		union cpu_time_count expires = { .sched = 0 };
-		expires.cpu = *newval;
-		process_timer_rebalance(tsk, clock_idx, expires, now);
+		switch (clock_idx) {
+		case CPUCLOCK_PROF:
+			tsk->signal->it_prof_expires = *newval;
+			break;
+		case CPUCLOCK_VIRT:
+			tsk->signal->it_virt_expires = *newval;
+			break;
+		}
 	}
 }
 
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/sched.c linux-2.6.18.5/kernel/sched.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/sched.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/sched.c	2008-03-07 09:36:43.000000000 -0800
@@ -2901,7 +2901,13 @@ EXPORT_PER_CPU_SYMBOL(kstat);
 static inline void
 update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now)
 {
-	p->sched_time += now - max(p->timestamp, rq->timestamp_last_tick);
+	unsigned long long tmp;
+
+	tmp = now - max(p->timestamp, rq->timestamp_last_tick);
+	p->sched_time += tmp;
+	/* Add our time to the shared field. */
+	if (p->signal)
+		p->signal->shared_schedtime += tmp;
 }
 
 /*
@@ -2955,6 +2961,10 @@ void account_user_time(struct task_struc
 
 	p->utime = cputime_add(p->utime, cputime);
 
+	/* Add our time to the shared field. */
+	if (p->signal)
+		p->signal->shared_utime = cputime_add(p->signal->shared_utime, cputime);
+
 	/* Add user time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
 	if (TASK_NICE(p) > 0)
@@ -2978,6 +2988,10 @@ void account_system_time(struct task_str
 
 	p->stime = cputime_add(p->stime, cputime);
 
+	/* Add our time to the shared field. */
+	if (p->signal)
+		p->signal->shared_stime = cputime_add(p->signal->shared_stime, cputime);
+
 	/* Add system time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
 	if (hardirq_count() - hardirq_offset)
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/signal.c linux-2.6.18.5/kernel/signal.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/signal.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/signal.c	2008-03-06 15:50:31.000000000 -0800
@@ -1447,10 +1447,8 @@ void do_notify_parent(struct task_struct
 	info.si_uid = tsk->uid;
 
 	/* FIXME: find out whether or not this is supposed to be c*time. */
-	info.si_utime = cputime_to_jiffies(cputime_add(tsk->utime,
-						       tsk->signal->utime));
-	info.si_stime = cputime_to_jiffies(cputime_add(tsk->stime,
-						       tsk->signal->stime));
+	info.si_utime = cputime_to_jiffies(tsk->signal->shared_utime);
+	info.si_stime = cputime_to_jiffies(tsk->signal->shared_stime);
 
 	info.si_status = tsk->exit_code & 0x7f;
 	if (tsk->exit_code & 0x80)
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/sys.c linux-2.6.18.5/kernel/sys.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/sys.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/sys.c	2008-03-06 15:51:37.000000000 -0800
@@ -1207,18 +1207,11 @@ asmlinkage long sys_times(struct tms __u
 	if (tbuf) {
 		struct tms tmp;
 		struct task_struct *tsk = current;
-		struct task_struct *t;
 		cputime_t utime, stime, cutime, cstime;
 
 		spin_lock_irq(&tsk->sighand->siglock);
-		utime = tsk->signal->utime;
-		stime = tsk->signal->stime;
-		t = tsk;
-		do {
-			utime = cputime_add(utime, t->utime);
-			stime = cputime_add(stime, t->stime);
-			t = next_thread(t);
-		} while (t != tsk);
+		utime = tsk->signal->shared_utime;
+		stime = tsk->signal->shared_stime;
 
 		cutime = tsk->signal->cutime;
 		cstime = tsk->signal->cstime;
@@ -1910,16 +1903,14 @@ static void k_getrusage(struct task_stru
 				break;
 
 		case RUSAGE_SELF:
-			utime = cputime_add(utime, p->signal->utime);
-			stime = cputime_add(stime, p->signal->stime);
+			utime = cputime_add(utime, p->signal->shared_utime);
+			stime = cputime_add(stime, p->signal->shared_stime);
 			r->ru_nvcsw += p->signal->nvcsw;
 			r->ru_nivcsw += p->signal->nivcsw;
 			r->ru_minflt += p->signal->min_flt;
 			r->ru_majflt += p->signal->maj_flt;
 			t = p;
 			do {
-				utime = cputime_add(utime, t->utime);
-				stime = cputime_add(stime, t->stime);
 				r->ru_nvcsw += t->nvcsw;
 				r->ru_nivcsw += t->nivcsw;
 				r->ru_minflt += t->min_flt;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
  2008-03-07 23:26                           ` [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF Frank Mayhar
@ 2008-03-08  0:01                             ` Frank Mayhar
  0 siblings, 0 replies; 51+ messages in thread
From: Frank Mayhar @ 2008-03-08  0:01 UTC (permalink / raw)
  To: Roland McGrath
  Cc: parag.warudkar, Alejandro Riveira Fernández, Andrew Morton,
	linux-kernel, Ingo Molnar, Thomas Gleixner, Jakub Jelinek

On Fri, 2008-03-07 at 15:26 -0800, Frank Mayhar wrote:
> Based on Roland's comments and from reading the source, I have a
> possible fix.  I'm posting the attached patch _not_ for submission but
> _only_ for comment.  For one thing it's based on 2.6.18.5 and for
> another it hasn't had much testing yet.  I wanted to get it out here for
> comment, though, in case anyone can see where I might have gone wrong.
> Comments, criticism and (especially!) testing enthusiastically
> requested.

The previous email was missing one small part of the patch, reproduced
below.  Remain calm.
----------------------------------BEGIN------------------------------------
diff -urp /home/fmayhar/Static/linux-2.6.18.5/kernel/compat.c linux-2.6.18.5/kernel/compat.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/compat.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/compat.c	2008-03-06 17:26:21.000000000 -0800
@@ -161,18 +161,11 @@ asmlinkage long compat_sys_times(struct 
 	if (tbuf) {
 		struct compat_tms tmp;
 		struct task_struct *tsk = current;
-		struct task_struct *t;
 		cputime_t utime, stime, cutime, cstime;
 
 		read_lock(&tasklist_lock);
-		utime = tsk->signal->utime;
-		stime = tsk->signal->stime;
-		t = tsk;
-		do {
-			utime = cputime_add(utime, t->utime);
-			stime = cputime_add(stime, t->stime);
-			t = next_thread(t);
-		} while (t != tsk);
+		utime = tsk->signal->shared_utime;
+		stime = tsk->signal->shared_stime;
 
 		/*
 		 * While we have tasklist_lock read-locked, no dying thread
-----------------------------------END-------------------------------------
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* posix-cpu-timers revamp
  2008-03-06 19:04                           ` Frank Mayhar
@ 2008-03-11  7:50                             ` Roland McGrath
  2008-03-11 21:05                               ` Frank Mayhar
  0 siblings, 1 reply; 51+ messages in thread
From: Roland McGrath @ 2008-03-11  7:50 UTC (permalink / raw)
  To: Frank Mayhar; +Cc: linux-kernel

[I changed the subject and trimmed the CC list, as this is now quite far
away from the "some mysterious NPTL problem" subject.  If anyone else
wanted to be individually CC'd, you can add them back in followups.]

> correct me when I get it wrong.  I take what you're saying to mean that,
> first, run_posix_cpu_timers() only needs to be run once per thread group.

Not quite.  check_process_timers only needs to be run once per thread
group (per interesting tick).

> It _sounds_ like it should be checking the shared fields rather than the
> per-task fields for timer expiration (in fact, the more I think about it
> the more sure I am that that's the case).

run_posix_cpu_timers does two things: thread CPU timers and process CPU
timers.  The thread CPU timers track the thread CPU clocks, which are
what the per-thread fields in task_struct count.  check_thread_timers
finds what thread CPU timers have fired.  The task_struct.it_*_expires
fields are set when there are thread CPU timers set on those clocks.

The process CPU timers track the process CPU clocks.  Each process CPU
clock (virt, prof, sched) is just the sum of the corresponding thread
CPU clock across all threads in the group.  In the original code, these
clocks are never maintained in any storage as such, but sampled by
summing all the thread clocks when a current value is needed.
check_process_timers finds what process CPU timers have fired.  The
signal_struct.it_*_expires fields are set when there are process CPU
timers set on those clocks.

The "rebalance" stuff also sets the task_struct.it_*_expires fields of
all the threads in the group when there are process CPU timers.  So each
of these fields is really the minimum of the expiration time of the
earliest thread CPU timer and the "balanced" sample-and-update time
computed from the earliest process CPU timer's expiration time.

> The old process_timer_rebalance() routine was intended to distribute the
> remaining ticks across all the threads, so that the per-task fields
> would cause run_posix_cpu_timers() to run at the appropriate time.  With
> it checking the shared fields this becomes no longer necessary.

This is true of check_process_timers.

> Since the shared fields are getting all the ticks, this will work for
> per-thread timers as well.

I do not follow your logic at all here.  The signal_struct fields being
proposed track each process CPU clock's value.  The thread CPU timers
react to the thread CPU clocks, not the process CPU clocks.

> The arm_timers() routine, instead of calling posix_timer_rebalance(),
> should just directly set signal->it_*_expires to the expiration time,

Correct.

> It's still probably worthwhile to defer processing to a workqueue
> thread, though, just because it's still a lot to do at interrupt.  I'll
> probably end up trying it both ways.

I think the natural place for it is run_timer_softirq.  This is where
natural time timers run.  The posix-cpu-timers processing is analogous
to the posix-timers processing, which runs via timer callbacks from
here.  That roughly means doing it right after the interrupt normally,
and falling back to something similar to a workqueue when the load is
too heavy.  In the common case, it doesn't entail any context switches,
as workqueues always do.

The issue with either the workqueue or the softirq approach is that it
means the call will sometimes (softirq) or always (workqueue) be made by
another task.  Currently we always have current taking samples of its
own clocks and firing timers set on them.  (Because of this you can't
actually use softirq in the simple fashion, i.e. moving the call to
run_posix_cpu_timers.  It would only be guaranteed to run once per CPU
and wouldn't know which task it was supposed to look at.  You'd have to
keep a per-CPU list of tasks pending consideration, i.e. a workqueue.)

I can't tell you off hand about serious complications of doing this work
from another task rather than by current on current.  I think I might
have had some in mind when I did the original implementation, but I just
don't remember any more.  It seems like a potential can of worms.  My
first inclination is to do every other cleanup we like first before
touching this question.

Also note that the bulk of the work (and everything that's not O(1))
has to be done with interrupts disabled anyway.  That's necessary to
take siglock.  That lock both protects signal_struct, and it protects
the task_struct.cpu_timers lists.  (You can do a cheap and lossy test
on current->signal->it_*_expires without taking the lock, for the
nothing-fires fast path.)

> One thing that's still unclear to me is, if there were only one run of
> run_posix_cpu_timers() per thread group per tick, how would per-thread
> timers be serviced?

What I said is only actually necessary once per thread group is the work
that check_process_timers does.  In the current style of code where
there is a loop through all the threads anyway, then you could in fact
weave in the check_thread_timers work there too and then all that would
only need to be done once per thread group per tick.  (But I don't think
that's what I suggested last time.)

> I've given this some thought.  It seems clear that there's going to be
> some performance penalty when multiple CPUs are competing trying to
> update the same field at the tick.

Indeed.  That's why I said I would not endorse any patch that doesn't
address this up front, and show concrete measurements about this
overhead.  (To a first approximation, the overhead on every tick for
every task in the system is always more important than the complications
for tasks that are using any CPU timers, including ITIMER_PROF and
ITIMER_VIRTUAL.)

> It would be much better if there were cacheline-aligned per-cpu fields
> associated with either the task or the signal structure; that way each
> CPU could update its own field without competing with the others.
> Processing in run_posix_cpu_timers would still be constant, although
> slightly higher for having to consult multiple fields instead of just
> one.  Not one per thread, though, just one per CPU, a much smaller and
> fixed number.

Exactly this is the first idea I had about this.  (I considered this in
the original implementation and decided for the first crack to err on
the side of no new code or data structures in the paths taken by every
thread, with the new hair only affecting processes that actually use any
process CPU timers.)

But this is not without its own issues.  Currently on my configuration
(64-bit) the utime, stime, sum_sched_runtime fields (which now only
accumulate the contributions to process CPU clocks of threads that are
already dead) take 24 bytes.  Throw in gtime (which is analogous in its
bookkeeping, but has no POSIX clock/timer interface to it) and call it
32 bytes.  That's out of 904 bytes total in signal_struct.

On some common configurations, SMP_CACHE_BYTES is 128 and NR_CPUS is 64.
So the obvious static addition to signal_struct would bloat it by 8192
bytes (i.e. 904 to 9096, or more than 10x), of which 96*NR_CPUS (2/3)
would be wasted even when you really have NR_CPUS running.  That is way
beyond the acceptable size (it's too big to even work right with the
kernel allocators), even if it weren't mostly wasted space.

This leads to some obvious follow-on ideas.  With some fancy
footwork, you could use a pointer to a separately allocated chunk of
only num_possible_cpus() * SMP_CACHE_BYTES.  You needn't allocate it
at all until the first timer is set on that process CPU clock.  That
makes the bloat smaller, and limits it to the processes that actually
need to check on each tick.

With all of this properly encapsulated in a struct and some helper
functions, it would be simple to conditionalize the implementation.
For uniprocessor kernels, clearly it would be preferable just to use
the existing signal_struct fields.  For NR_CPUS=2, it might be
reasonable to use the static aligned fields bloating signal_struct
(or perhaps for NR_CPUS * SMP_CACHE_BYTES < some threshold).  etc.

An alternative notion is to have single shared fields per clock in
signal_struct but add to them only at context switch.  If the threads
in the process don't all yield at the same time, then maybe that
works out ok for cache contention.  It's not a well-developed idea.

This all adds up to me thinking there is no simple answer.  I think
we need to consider several alternatives, and get real measurements
of their overheads in various workloads, etc.  I am hesitant to pick
any "simple" changes to put in the kernel before we have examined the
real trade-offs fully.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-03-11  7:50                             ` posix-cpu-timers revamp Roland McGrath
@ 2008-03-11 21:05                               ` Frank Mayhar
  2008-03-11 21:35                                 ` Roland McGrath
  0 siblings, 1 reply; 51+ messages in thread
From: Frank Mayhar @ 2008-03-11 21:05 UTC (permalink / raw)
  To: Roland McGrath; +Cc: linux-kernel

On Tue, 2008-03-11 at 00:50 -0700, Roland McGrath wrote:
> > correct me when I get it wrong.  I take what you're saying to mean that,
> > first, run_posix_cpu_timers() only needs to be run once per thread group.
> Not quite.  check_process_timers only needs to be run once per thread
> group (per interesting tick).

Where "interesting tick" means "tick in which a process timer has
expired," correct?

> > It _sounds_ like it should be checking the shared fields rather than the
> > per-task fields for timer expiration (in fact, the more I think about it
> > the more sure I am that that's the case).
> run_posix_cpu_timers does two things: thread CPU timers and process CPU
> timers.  The thread CPU timers track the thread CPU clocks, which are
> what the per-thread fields in task_struct count.  check_thread_timers
> finds what thread CPU timers have fired.  The task_struct.it_*_expires
> fields are set when there are thread CPU timers set on those clocks.
> 
> The process CPU timers track the process CPU clocks.  Each process CPU
> clock (virt, prof, sched) is just the sum of the corresponding thread
> CPU clock across all threads in the group.  In the original code, these
> clocks are never maintained in any storage as such, but sampled by
> summing all the thread clocks when a current value is needed.
> check_process_timers finds what process CPU timers have fired.  The
> signal_struct.it_*_expires fields are set when there are process CPU
> timers set on those clocks.

And my changes introduce these clocks as separate fields in the signal
struct, updated at the tick.

> > Since the shared fields are getting all the ticks, this will work for
> > per-thread timers as well.
> 
> I do not follow your logic at all here.  The signal_struct fields being
> proposed track each process CPU clock's value.  The thread CPU timers
> react to the thread CPU clocks, not the process CPU clocks.

Okay, I hadn't been clear on the distinction between process-wide and
thread-only timers.  So, really, run_posix_cpu_timers() needs to check
both sets, the versions in the signal struct for the process-wide timers
and the versions in the task struct for the thread-only timers.

> > It's still probably worthwhile to defer processing to a workqueue
> > thread, though, just because it's still a lot to do at interrupt.  I'll
> > probably end up trying it both ways.

I'm going to table this for now.  Based on my preliminary performance
results, the changes I've made mean that using a workqueue or softirq is
not necessary.  In the profiles of a couple of testcases I've run,
run_posix_cpu_timers() didn't show up at all, whereas before my change
it was right at the top with ~35% of the time.

I'll hang on to your notes, though, for future reference.

> > One thing that's still unclear to me is, if there were only one run of
> > run_posix_cpu_timers() per thread group per tick, how would per-thread
> > timers be serviced?
> What I said is only actually necessary once per thread group is the work
> that check_process_timers does.  In the current style of code where
> there is a loop through all the threads anyway, then you could in fact
> weave in the check_thread_timers work there too and then all that would
> only need to be done once per thread group per tick.  (But I don't think
> that's what I suggested last time.)

So, check_process_timers() checks for and handles any expired timers for
the currently-running process, whereas check_thread_timers() checks for
and handles any expired timers for the currently-running thread.  Is
that correct?

And, since these timers are only counting CPU time, if a thread is never
running at the tick (since that's how we account time in the first
place) any timers it might have will never expire.  Sorry to repeat the
obvious but sometimes it's better to state things very explicitly.

At each tick a process-wide timer may have expired.  Also, at each tick
a thread-only timer may have expired.  Or, of course, both.  So we need
to detect both events and fire the appropriate timer in the appropriate
context.

I think my current code (i.e. the patch I published for comment a few
days ago) does this, with one exception:  If a thread-only timer expires
it _won't_ be detected when run_posix_cpu_timers() runs, since I'm only
checking the process-wide timers.  This implies that I need to do twice
as many checks up front.  I'll think about how to minimize that, though.

> > I've given this some thought.  It seems clear that there's going to be
> > some performance penalty when multiple CPUs are competing trying to
> > update the same field at the tick.
> Indeed.  That's why I said I would not endorse any patch that doesn't
> address this up front, and show concrete measurements about this
> overhead.  (To a first approximation, the overhead on every tick for
> every task in the system is always more important than the complications
> for tasks that are using any CPU timers, including ITIMER_PROF and
> ITIMER_VIRTUAL.)

I've actually gotten a little bit of insight into this.  I don't think
that a straight set of shared fields is sufficient except in UP (and
possibly dual-CPU) environments.  I was able to run a reasonable test on
both a four-core Opteron system and a sixteen-core Opteron system.
(That's two dual-core CPUs and four four-core CPUs, no sixteen-core
Opteron chips here. :-)  They had 1024K and 512K cache respectively.  I
didn't have a chance to actually time the runs but I observed that the
sixteen-core run took substantially longer than the four-core run.
Something like double the time.  While some part of this is likely
attributable to the smaller cache, the testcase was small enough that
that shouldn't have been a real issue.  I'm pretty confident that it was
cache conflict among the sixteen cores that did the damage.

> > It would be much better if there were cacheline-aligned per-cpu fields
> > associated with either the task or the signal structure; that way each
> > CPU could update its own field without competing with the others.
> > Processing in run_posix_cpu_timers would still be constant, although
> > slightly higher for having to consult multiple fields instead of just
> > one.  Not one per thread, though, just one per CPU, a much smaller and
> > fixed number.
> Exactly this is the first idea I had about this.  (I considered this in
> the original implementation and decided for the first crack to err on
> the side of no new code or data structures in the paths taken by every
> thread, with the new hair only affecting processes that actually use any
> process CPU timers.)
> This leads to some obvious follow-on ideas.  With some fancy
> footwork, you could use a pointer to a separately allocated chunk of
> only num_possible_cpus() * SMP_CACHE_BYTES.  You needn't allocate it
> at all until the first timer is set on that process CPU clock.  That
> makes the bloat smaller, and limits it to the processes that actually
> need to check on each tick.

I'm currently working on an implementation that uses the alloc_percpu()
mechanism and a separate structure.  I'm encapsulating access to the
fields in shared_xxx_sum() inline functions, which could have different
implementations for UP, dual-CPU and generic SMP kernels.  Each tick
does something like, for example:

	if (p->signal->shared_times) { /* Set if timer running. */
		cpu = get_cpu();
		shared_times = per_cpu_ptr(p->signal->shared_times, cpu);
		shared_times->shared_stime = cputime_add(shared_times->shared_stime, cputime);
		put_cpu_no_resched();
	}

(Where "shared_times" is the structure encapsulating the shared fields.)
This adds overhead to sum the per-CPU values but means that multiple
CPUs updating at the same tick won't be competing for the cache line and
killing performance.

> An alternative notion is to have single shared fields per clock in
> signal_struct but add to them only at context switch.  If the threads
> in the process don't all yield at the same time, then maybe that
> works out ok for cache contention.  It's not a well-developed idea.

I'll keep this in mind.

> This all adds up to me thinking there is no simple answer.  I think
> we need to consider several alternatives, and get real measurements
> of their overheads in various workloads, etc.  I am hesitant to pick
> any "simple" changes to put in the kernel before we have examined the
> real trade-offs fully.

I personally think that the most promising approach is the one outlined
above (without considering the context-switch scheme for the moment).
It saves as much space as possible, doesn't penalize processes that
aren't using the posix timers and avoids cache contention between CPUs.
I'll go ahead and implement it and try to generate some numbers.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-03-11 21:05                               ` Frank Mayhar
@ 2008-03-11 21:35                                 ` Roland McGrath
  2008-03-14  0:37                                   ` Frank Mayhar
  0 siblings, 1 reply; 51+ messages in thread
From: Roland McGrath @ 2008-03-11 21:35 UTC (permalink / raw)
  To: Frank Mayhar; +Cc: linux-kernel

> > Not quite.  check_process_timers only needs to be run once per thread
> > group (per interesting tick).
> 
> Where "interesting tick" means "tick in which a process timer has
> expired," correct?

Or might have expired, in the current implementation style.  Correct.

> > The process CPU timers track the process CPU clocks.  [...]
> 
> And my changes introduce these clocks as separate fields in the signal
> struct, updated at the tick.

Correct.

> Okay, I hadn't been clear on the distinction between process-wide and
> thread-only timers.  So, really, run_posix_cpu_timers() needs to check
> both sets, the versions in the signal struct for the process-wide timers
> and the versions in the task struct for the thread-only timers.

Correct.

> I'm going to table this for now.  [...]

Agreed.

> So, check_process_timers() checks for and handles any expired timers for
> the currently-running process, whereas check_thread_timers() checks for
> and handles any expired timers for the currently-running thread.  Is
> that correct?

Correct.

> And, since these timers are only counting CPU time, if a thread is never
> running at the tick (since that's how we account time in the first
> place) any timers it might have will never expire.  

Correct.

> At each tick a process-wide timer may have expired.  Also, at each tick
> a thread-only timer may have expired.  Or, of course, both.  So we need
> to detect both events and fire the appropriate timer in the appropriate
> context.

Correct.

> [...]  I'm pretty confident that it was
> cache conflict among the sixteen cores that did the damage.

I'm not surprised by this result.  (I do want to see much more detailed
performance analysis before we decide on a final change.)

> I'm currently working on an implementation that uses the alloc_percpu()
> mechanism and a separate structure.  I'm encapsulating access to the
> fields in shared_xxx_sum() inline functions, which could have different
> implementations for UP, dual-CPU and generic SMP kernels.  

That is exactly what I had in mind.  (I hadn't noticed alloc_percpu, and it
has one more level of indirection than I'd planned.  But that wastes less
space when num_possible_cpus() is far greater than num_online_cpus(), and I
imagine it's vastly superior for NUMA.)

Don't forget do_[gs]etitimer and k_getrusage can use this too.
(Though maybe no reason to bother in k_getrusage since it has
to loop to sum the non-time counters anyway.)

> I personally think that the most promising approach is the one outlined
> above (without considering the context-switch scheme for the moment).

I tend to agree.  It's the only plan I've thought through in detail.
But my remarks stand, about thorough analysis of performance impacts
of options we can think of.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-03-11 21:35                                 ` Roland McGrath
@ 2008-03-14  0:37                                   ` Frank Mayhar
  2008-03-21  7:18                                     ` Roland McGrath
  0 siblings, 1 reply; 51+ messages in thread
From: Frank Mayhar @ 2008-03-14  0:37 UTC (permalink / raw)
  To: Roland McGrath; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2967 bytes --]

After the recent conversation with Roland and after more testing, I have
another patch for review (although _not_ for submission, as again it's
against 2.6.18.5).  This patch breaks the shared utime/stime/sched_time
fields out into their own structure which is allocated as needed via
alloc_percpu().  This avoids cache thrashing when running lots of
threads on lots of CPUs.

Please take a look and let me know what you think.  In the meantime I'll
be working on a similar patch to 2.6-head that has optimizations for
uniprocessor and two-CPU operation, to avoid the overhead of the percpu
functions when they are unneeded.

This patch:

	Replaces the utime, stime and sched_time fields in signal_struct with
	the shared_times structure, which is cacheline aligned and allocated
	when needed using the alloc_percpu() mechanism.  There is one copy of
	this structure per running CPU when it is being used.

	Each place that loops through all threads in a thread group to sum
	task->utime and/or task->stime now use the shared_*_sum() inline
	functions defined in sched.h to sum the per-CPU structures.  This
	includes compat_sys_times(), do_task_stat(), do_getitimer(),
	sys_times() and k_getrusage().

	Certain routines that used task->signal->[us]time now use the
	shared_*_sum() functions instead, which may (but hopefully will not)
	change their semantics slightly.  These include fill_prstatus() (in
	fs/binfmt_elf.c), do_task_stat() (in fs/proc/array.c),
	wait_task_zombie() and do_notify_parent().

	At each tick, update_cpu_clock(), account_user_time() and
	account_system_time() update the relevant field of the shared_times
	structure using a pointer obtained using per_cpu_ptr, with the effect
	that these functions do not compete with one another for the cacheline.
	Each of these functions updates the task-private field followed by the
	shared_times version if one is present.

	Finally, kernel/posix-cpu-timers.c has changed quite dramatically.
	First, run_posix_cpu_timers() decides whether a timer has expired by
	consulting the it_*_expires fields in the task struct of the running
	thread and the shared_*_sum() functions that cover the entire process.
	The check_process_timers() routine bases its computations on the
	shared structure, removing two loops through the threads. "Rebalancing"
	is no longer required, the process_timer_rebalance() routine as
	disappeared entirely and the arm_timer() routine merely fills
	p->signal->it_*_expires from timer->it.cpu.expires.*.  The
	cpu_clock_sample_group_locked() loses its summing loops, using the
	the shared structure instead.  Finally, set_process_cpu_timer() sets
	tsk->signal->it_*_expires directly rather than calling the deleted
	rebalance routine.

	The only remaining open question is whether these changes break the
	semantics of the status-returning routines fill_prstatus(),
	do_task_stat(), wait_task_zombie() and do_notify_parent().
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.

[-- Attachment #2: posix-timers.patch --]
[-- Type: text/x-patch, Size: 26324 bytes --]

diff -rup /home/fmayhar/Static/linux-2.6.18.5/fs/binfmt_elf.c linux-2.6.18.5/fs/binfmt_elf.c
--- /home/fmayhar/Static/linux-2.6.18.5/fs/binfmt_elf.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/fs/binfmt_elf.c	2008-03-10 17:56:41.000000000 -0700
@@ -1302,17 +1302,17 @@ static void fill_prstatus(struct elf_prs
 	if (thread_group_leader(p)) {
 		/*
 		 * This is the record for the group leader.  Add in the
-		 * cumulative times of previous dead threads.  This total
-		 * won't include the time of each live thread whose state
-		 * is included in the core dump.  The final total reported
-		 * to our parent process when it calls wait4 will include
-		 * those sums as well as the little bit more time it takes
-		 * this and each other thread to finish dying after the
-		 * core dump synchronization phase.
+		 * cumulative times of all threads.  This total includes
+		 * the time of each live thread whose state is included in
+		 * the core dump.  The final total reported to our parent
+		 * process when it calls wait4 will include those sums as
+		 * well as the little bit more time it takes this and each
+		 * other thread to finish dying after the core dump
+		 * synchronization phase.
 		 */
-		cputime_to_timeval(cputime_add(p->utime, p->signal->utime),
+		cputime_to_timeval(shared_utime_sum(p->signal),
 				   &prstatus->pr_utime);
-		cputime_to_timeval(cputime_add(p->stime, p->signal->stime),
+		cputime_to_timeval(shared_stime_sum(p->signal),
 				   &prstatus->pr_stime);
 	} else {
 		cputime_to_timeval(p->utime, &prstatus->pr_utime);
diff -rup /home/fmayhar/Static/linux-2.6.18.5/fs/proc/array.c linux-2.6.18.5/fs/proc/array.c
--- /home/fmayhar/Static/linux-2.6.18.5/fs/proc/array.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/fs/proc/array.c	2008-03-12 12:42:41.000000000 -0700
@@ -359,8 +359,6 @@ static int do_task_stat(struct task_stru
 			do {
 				min_flt += t->min_flt;
 				maj_flt += t->maj_flt;
-				utime = cputime_add(utime, t->utime);
-				stime = cputime_add(stime, t->stime);
 				t = next_thread(t);
 			} while (t != task);
 		}
@@ -382,8 +380,8 @@ static int do_task_stat(struct task_stru
 		if (whole) {
 			min_flt += task->signal->min_flt;
 			maj_flt += task->signal->maj_flt;
-			utime = cputime_add(utime, task->signal->utime);
-			stime = cputime_add(stime, task->signal->stime);
+			utime = shared_utime_sum(task->signal);
+			stime = shared_stime_sum(task->signal);
 		}
 	}
 	ppid = pid_alive(task) ? task->group_leader->real_parent->tgid : 0;
diff -rup /home/fmayhar/Static/linux-2.6.18.5/include/linux/sched.h linux-2.6.18.5/include/linux/sched.h
--- /home/fmayhar/Static/linux-2.6.18.5/include/linux/sched.h	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/include/linux/sched.h	2008-03-13 16:19:43.000000000 -0700
@@ -369,6 +369,12 @@ struct pacct_struct {
 	unsigned long		ac_minflt, ac_majflt;
 };
 
+struct sharedtimes_struct {
+	cputime_t shared_utime;
+	cputime_t shared_stime;
+	unsigned long long shared_schedtime;
+} ____cacheline_aligned;
+
 /*
  * NOTE! "signal_struct" does not have it's own
  * locking, because a shared signal_struct always
@@ -413,6 +419,7 @@ struct signal_struct {
 	/* ITIMER_PROF and ITIMER_VIRTUAL timers for the process */
 	cputime_t it_prof_expires, it_virt_expires;
 	cputime_t it_prof_incr, it_virt_incr;
+	unsigned long long it_sched_expires;
 
 	/* job control IDs */
 	pid_t pgrp;
@@ -429,17 +436,11 @@ struct signal_struct {
 	 * Live threads maintain their own counters and add to these
 	 * in __exit_signal, except for the group leader.
 	 */
-	cputime_t utime, stime, cutime, cstime;
+	cputime_t cutime, cstime;
 	unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
 	unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
 
-	/*
-	 * Cumulative ns of scheduled CPU time for dead threads in the
-	 * group, not including a zombie group leader.  (This only differs
-	 * from jiffies_to_ns(utime + stime) if sched_clock uses something
-	 * other than jiffies.)
-	 */
-	unsigned long long sched_time;
+	struct sharedtimes_struct *shared_times; /* Per CPU. */
 
 	/*
 	 * We don't bother to synchronize most readers of this at all,
@@ -482,6 +483,50 @@ struct signal_struct {
 #define SIGNAL_STOP_CONTINUED	0x00000004 /* SIGCONT since WCONTINUED reap */
 #define SIGNAL_GROUP_EXIT	0x00000008 /* group exit in progress */
 
+static inline cputime_t shared_utime_sum(struct signal_struct *sig)
+{
+	int i;
+	struct sharedtimes_struct *shared_times;
+	cputime_t utime = cputime_zero;
+
+	if (sig->shared_times) {
+		for_each_online_cpu(i) {
+			shared_times = per_cpu_ptr(sig->shared_times, i);
+			utime = cputime_add(utime, shared_times->shared_utime);
+		}
+	}
+	return(utime);
+}
+
+static inline cputime_t shared_stime_sum(struct signal_struct *sig)
+{
+	int i;
+	struct sharedtimes_struct *shared_times;
+	cputime_t stime = cputime_zero;
+
+	if (sig->shared_times) {
+		for_each_online_cpu(i) {
+			shared_times = per_cpu_ptr(sig->shared_times, i);
+			stime = cputime_add(stime, shared_times->shared_stime);
+		}
+	}
+	return(stime);
+}
+
+static inline unsigned long long shared_schedtime_sum(struct signal_struct *sig)
+{
+	int i;
+	struct sharedtimes_struct *shared_times;
+	unsigned long long sched_time = 0;
+
+	if (sig->shared_times) {
+		for_each_online_cpu(i) {
+			shared_times = per_cpu_ptr(sig->shared_times, i);
+			sched_time += shared_times->shared_schedtime;
+		}
+	}
+	return(sched_time);
+}
 
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/compat.c linux-2.6.18.5/kernel/compat.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/compat.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/compat.c	2008-03-10 17:58:26.000000000 -0700
@@ -161,18 +161,11 @@ asmlinkage long compat_sys_times(struct 
 	if (tbuf) {
 		struct compat_tms tmp;
 		struct task_struct *tsk = current;
-		struct task_struct *t;
 		cputime_t utime, stime, cutime, cstime;
 
 		read_lock(&tasklist_lock);
-		utime = tsk->signal->utime;
-		stime = tsk->signal->stime;
-		t = tsk;
-		do {
-			utime = cputime_add(utime, t->utime);
-			stime = cputime_add(stime, t->stime);
-			t = next_thread(t);
-		} while (t != tsk);
+		utime = shared_utime_sum(tsk->signal);
+		stime = shared_stime_sum(tsk->signal);
 
 		/*
 		 * While we have tasklist_lock read-locked, no dying thread
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/exit.c linux-2.6.18.5/kernel/exit.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/exit.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/exit.c	2008-03-13 16:20:25.000000000 -0700
@@ -103,13 +103,10 @@ static void __exit_signal(struct task_st
 		 * We won't ever get here for the group leader, since it
 		 * will have been the last reference on the signal_struct.
 		 */
-		sig->utime = cputime_add(sig->utime, tsk->utime);
-		sig->stime = cputime_add(sig->stime, tsk->stime);
 		sig->min_flt += tsk->min_flt;
 		sig->maj_flt += tsk->maj_flt;
 		sig->nvcsw += tsk->nvcsw;
 		sig->nivcsw += tsk->nivcsw;
-		sig->sched_time += tsk->sched_time;
 		sig = NULL; /* Marker for below. */
 	}
 
@@ -1165,14 +1162,12 @@ static int wait_task_zombie(struct task_
 		sig = p->signal;
 		psig->cutime =
 			cputime_add(psig->cutime,
-			cputime_add(p->utime,
-			cputime_add(sig->utime,
-				    sig->cutime)));
+			cputime_add(shared_utime_sum(p->signal),
+				    sig->cutime));
 		psig->cstime =
 			cputime_add(psig->cstime,
-			cputime_add(p->stime,
-			cputime_add(sig->stime,
-				    sig->cstime)));
+			cputime_add(shared_stime_sum(p->signal),
+				    sig->cstime));
 		psig->cmin_flt +=
 			p->min_flt + sig->min_flt + sig->cmin_flt;
 		psig->cmaj_flt +=
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/fork.c linux-2.6.18.5/kernel/fork.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/fork.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/fork.c	2008-03-13 16:20:15.000000000 -0700
@@ -855,14 +855,16 @@ static inline int copy_signal(unsigned l
 	sig->it_virt_incr = cputime_zero;
 	sig->it_prof_expires = cputime_zero;
 	sig->it_prof_incr = cputime_zero;
+ 	sig->it_sched_expires = 0;
 
 	sig->leader = 0;	/* session leadership doesn't inherit */
 	sig->tty_old_pgrp = 0;
 
-	sig->utime = sig->stime = sig->cutime = sig->cstime = cputime_zero;
+	sig->shared_times = NULL;
+
+	sig->cutime = sig->cstime = cputime_zero;
 	sig->nvcsw = sig->nivcsw = sig->cnvcsw = sig->cnivcsw = 0;
 	sig->min_flt = sig->maj_flt = sig->cmin_flt = sig->cmaj_flt = 0;
-	sig->sched_time = 0;
 	INIT_LIST_HEAD(&sig->cpu_timers[0]);
 	INIT_LIST_HEAD(&sig->cpu_timers[1]);
 	INIT_LIST_HEAD(&sig->cpu_timers[2]);
@@ -889,6 +891,8 @@ void __cleanup_signal(struct signal_stru
 {
 	exit_thread_group_keys(sig);
 	taskstats_tgid_free(sig);
+	if (sig->shared_times)
+		free_percpu(sig->shared_times);
 	kmem_cache_free(signal_cachep, sig);
 }
 
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/itimer.c linux-2.6.18.5/kernel/itimer.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/itimer.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/itimer.c	2008-03-12 11:51:11.000000000 -0700
@@ -61,12 +61,7 @@ int do_getitimer(int which, struct itime
 		cval = tsk->signal->it_virt_expires;
 		cinterval = tsk->signal->it_virt_incr;
 		if (!cputime_eq(cval, cputime_zero)) {
-			struct task_struct *t = tsk;
-			cputime_t utime = tsk->signal->utime;
-			do {
-				utime = cputime_add(utime, t->utime);
-				t = next_thread(t);
-			} while (t != tsk);
+			cputime_t utime = shared_utime_sum(tsk->signal);
 			if (cputime_le(cval, utime)) { /* about to fire */
 				cval = jiffies_to_cputime(1);
 			} else {
@@ -84,15 +79,8 @@ int do_getitimer(int which, struct itime
 		cval = tsk->signal->it_prof_expires;
 		cinterval = tsk->signal->it_prof_incr;
 		if (!cputime_eq(cval, cputime_zero)) {
-			struct task_struct *t = tsk;
-			cputime_t ptime = cputime_add(tsk->signal->utime,
-						      tsk->signal->stime);
-			do {
-				ptime = cputime_add(ptime,
-						    cputime_add(t->utime,
-								t->stime));
-				t = next_thread(t);
-			} while (t != tsk);
+			cputime_t ptime = cputime_add(shared_utime_sum(tsk->signal),
+						      shared_stime_sum(tsk->signal));
 			if (cputime_le(cval, ptime)) { /* about to fire */
 				cval = jiffies_to_cputime(1);
 			} else {
@@ -255,6 +243,15 @@ again:
 		}
 		tsk->signal->it_virt_expires = nval;
 		tsk->signal->it_virt_incr = ninterval;
+		/*
+		 * If he's setting the timer for the first time, we need to
+		 * allocate the shared area.  It's freed when the process
+		 * exits.
+		 */
+		if (!cputime_eq(nval, cputime_zero) &&
+		    tsk->signal->shared_times == NULL)
+			tsk->signal->shared_times =
+			    	alloc_percpu(struct sharedtimes_struct);
 		spin_unlock_irq(&tsk->sighand->siglock);
 		read_unlock(&tasklist_lock);
 		if (ovalue) {
@@ -279,6 +276,15 @@ again:
 		}
 		tsk->signal->it_prof_expires = nval;
 		tsk->signal->it_prof_incr = ninterval;
+		/*
+		 * If he's setting the timer for the first time, we need to
+		 * allocate the shared area.  It's freed when the process
+		 * exits.
+		 */
+		if (!cputime_eq(nval, cputime_zero) &&
+		    tsk->signal->shared_times == NULL)
+			tsk->signal->shared_times =
+			    	alloc_percpu(struct sharedtimes_struct);
 		spin_unlock_irq(&tsk->sighand->siglock);
 		read_unlock(&tasklist_lock);
 		if (ovalue) {
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/posix-cpu-timers.c linux-2.6.18.5/kernel/posix-cpu-timers.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/posix-cpu-timers.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/posix-cpu-timers.c	2008-03-13 12:55:29.000000000 -0700
@@ -164,6 +164,15 @@ static inline unsigned long long sched_n
 	return (p == current) ? current_sched_time(p) : p->sched_time;
 }
 
+static inline cputime_t prof_shared_ticks(struct task_struct *p)
+{
+	return cputime_add(shared_utime_sum(p->signal), shared_stime_sum(p->signal));
+}
+static inline cputime_t virt_shared_ticks(struct task_struct *p)
+{
+	return shared_utime_sum(p->signal);
+}
+
 int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec *tp)
 {
 	int error = check_clock(which_clock);
@@ -227,31 +236,17 @@ static int cpu_clock_sample_group_locked
 					 struct task_struct *p,
 					 union cpu_time_count *cpu)
 {
-	struct task_struct *t = p;
  	switch (clock_idx) {
 	default:
 		return -EINVAL;
 	case CPUCLOCK_PROF:
-		cpu->cpu = cputime_add(p->signal->utime, p->signal->stime);
-		do {
-			cpu->cpu = cputime_add(cpu->cpu, prof_ticks(t));
-			t = next_thread(t);
-		} while (t != p);
+		cpu->cpu = cputime_add(shared_utime_sum(p->signal), shared_stime_sum(p->signal));
 		break;
 	case CPUCLOCK_VIRT:
-		cpu->cpu = p->signal->utime;
-		do {
-			cpu->cpu = cputime_add(cpu->cpu, virt_ticks(t));
-			t = next_thread(t);
-		} while (t != p);
+		cpu->cpu = shared_utime_sum(p->signal);
 		break;
 	case CPUCLOCK_SCHED:
-		cpu->sched = p->signal->sched_time;
-		/* Add in each other live thread.  */
-		while ((t = next_thread(t)) != p) {
-			cpu->sched += t->sched_time;
-		}
-		cpu->sched += sched_ns(p);
+		cpu->sched = shared_schedtime_sum(p->signal);
 		break;
 	}
 	return 0;
@@ -468,79 +463,9 @@ void posix_cpu_timers_exit(struct task_s
 void posix_cpu_timers_exit_group(struct task_struct *tsk)
 {
 	cleanup_timers(tsk->signal->cpu_timers,
-		       cputime_add(tsk->utime, tsk->signal->utime),
-		       cputime_add(tsk->stime, tsk->signal->stime),
-		       tsk->sched_time + tsk->signal->sched_time);
-}
-
-
-/*
- * Set the expiry times of all the threads in the process so one of them
- * will go off before the process cumulative expiry total is reached.
- */
-static void process_timer_rebalance(struct task_struct *p,
-				    unsigned int clock_idx,
-				    union cpu_time_count expires,
-				    union cpu_time_count val)
-{
-	cputime_t ticks, left;
-	unsigned long long ns, nsleft;
- 	struct task_struct *t = p;
-	unsigned int nthreads = atomic_read(&p->signal->live);
-
-	if (!nthreads)
-		return;
-
-	switch (clock_idx) {
-	default:
-		BUG();
-		break;
-	case CPUCLOCK_PROF:
-		left = cputime_div_non_zero(cputime_sub(expires.cpu, val.cpu),
-				       nthreads);
-		do {
-			if (likely(!(t->flags & PF_EXITING))) {
-				ticks = cputime_add(prof_ticks(t), left);
-				if (cputime_eq(t->it_prof_expires,
-					       cputime_zero) ||
-				    cputime_gt(t->it_prof_expires, ticks)) {
-					t->it_prof_expires = ticks;
-				}
-			}
-			t = next_thread(t);
-		} while (t != p);
-		break;
-	case CPUCLOCK_VIRT:
-		left = cputime_div_non_zero(cputime_sub(expires.cpu, val.cpu),
-				       nthreads);
-		do {
-			if (likely(!(t->flags & PF_EXITING))) {
-				ticks = cputime_add(virt_ticks(t), left);
-				if (cputime_eq(t->it_virt_expires,
-					       cputime_zero) ||
-				    cputime_gt(t->it_virt_expires, ticks)) {
-					t->it_virt_expires = ticks;
-				}
-			}
-			t = next_thread(t);
-		} while (t != p);
-		break;
-	case CPUCLOCK_SCHED:
-		nsleft = expires.sched - val.sched;
-		do_div(nsleft, nthreads);
-		nsleft = max_t(unsigned long long, nsleft, 1);
-		do {
-			if (likely(!(t->flags & PF_EXITING))) {
-				ns = t->sched_time + nsleft;
-				if (t->it_sched_expires == 0 ||
-				    t->it_sched_expires > ns) {
-					t->it_sched_expires = ns;
-				}
-			}
-			t = next_thread(t);
-		} while (t != p);
-		break;
-	}
+		       shared_utime_sum(tsk->signal),
+		       shared_stime_sum(tsk->signal),
+		       shared_schedtime_sum(tsk->signal));
 }
 
 static void clear_dead_task(struct k_itimer *timer, union cpu_time_count now)
@@ -637,7 +562,8 @@ static void arm_timer(struct k_itimer *t
 				    cputime_lt(p->signal->it_virt_expires,
 					       timer->it.cpu.expires.cpu))
 					break;
-				goto rebalance;
+				p->signal->it_virt_expires = timer->it.cpu.expires.cpu;
+				break;
 			case CPUCLOCK_PROF:
 				if (!cputime_eq(p->signal->it_prof_expires,
 						cputime_zero) &&
@@ -648,13 +574,10 @@ static void arm_timer(struct k_itimer *t
 				if (i != RLIM_INFINITY &&
 				    i <= cputime_to_secs(timer->it.cpu.expires.cpu))
 					break;
-				goto rebalance;
+				p->signal->it_prof_expires = timer->it.cpu.expires.cpu;
+				break;
 			case CPUCLOCK_SCHED:
-			rebalance:
-				process_timer_rebalance(
-					timer->it.cpu.task,
-					CPUCLOCK_WHICH(timer->it_clock),
-					timer->it.cpu.expires, now);
+				p->signal->it_sched_expires = timer->it.cpu.expires.sched;
 				break;
 			}
 		}
@@ -1018,9 +941,8 @@ static void check_process_timers(struct 
 {
 	int maxfire;
 	struct signal_struct *const sig = tsk->signal;
-	cputime_t utime, stime, ptime, virt_expires, prof_expires;
+	cputime_t utime, ptime, virt_expires, prof_expires;
 	unsigned long long sched_time, sched_expires;
-	struct task_struct *t;
 	struct list_head *timers = sig->cpu_timers;
 
 	/*
@@ -1037,17 +959,9 @@ static void check_process_timers(struct 
 	/*
 	 * Collect the current process totals.
 	 */
-	utime = sig->utime;
-	stime = sig->stime;
-	sched_time = sig->sched_time;
-	t = tsk;
-	do {
-		utime = cputime_add(utime, t->utime);
-		stime = cputime_add(stime, t->stime);
-		sched_time += t->sched_time;
-		t = next_thread(t);
-	} while (t != tsk);
-	ptime = cputime_add(utime, stime);
+	utime = shared_utime_sum(sig);
+	ptime = cputime_add(utime, shared_stime_sum(sig));
+	sched_time = shared_schedtime_sum(sig);
 
 	maxfire = 20;
 	prof_expires = cputime_zero;
@@ -1156,60 +1070,18 @@ static void check_process_timers(struct 
 		}
 	}
 
-	if (!cputime_eq(prof_expires, cputime_zero) ||
-	    !cputime_eq(virt_expires, cputime_zero) ||
-	    sched_expires != 0) {
-		/*
-		 * Rebalance the threads' expiry times for the remaining
-		 * process CPU timers.
-		 */
-
-		cputime_t prof_left, virt_left, ticks;
-		unsigned long long sched_left, sched;
-		const unsigned int nthreads = atomic_read(&sig->live);
-
-		if (!nthreads)
-			return;
-
-		prof_left = cputime_sub(prof_expires, utime);
-		prof_left = cputime_sub(prof_left, stime);
-		prof_left = cputime_div_non_zero(prof_left, nthreads);
-		virt_left = cputime_sub(virt_expires, utime);
-		virt_left = cputime_div_non_zero(virt_left, nthreads);
-		if (sched_expires) {
-			sched_left = sched_expires - sched_time;
-			do_div(sched_left, nthreads);
-			sched_left = max_t(unsigned long long, sched_left, 1);
-		} else {
-			sched_left = 0;
-		}
-		t = tsk;
-		do {
-			if (unlikely(t->flags & PF_EXITING))
-				continue;
-
-			ticks = cputime_add(cputime_add(t->utime, t->stime),
-					    prof_left);
-			if (!cputime_eq(prof_expires, cputime_zero) &&
-			    (cputime_eq(t->it_prof_expires, cputime_zero) ||
-			     cputime_gt(t->it_prof_expires, ticks))) {
-				t->it_prof_expires = ticks;
-			}
-
-			ticks = cputime_add(t->utime, virt_left);
-			if (!cputime_eq(virt_expires, cputime_zero) &&
-			    (cputime_eq(t->it_virt_expires, cputime_zero) ||
-			     cputime_gt(t->it_virt_expires, ticks))) {
-				t->it_virt_expires = ticks;
-			}
-
-			sched = t->sched_time + sched_left;
-			if (sched_expires && (t->it_sched_expires == 0 ||
-					      t->it_sched_expires > sched)) {
-				t->it_sched_expires = sched;
-			}
-		} while ((t = next_thread(t)) != tsk);
-	}
+	if (!cputime_eq(prof_expires, cputime_zero) &&
+	    (cputime_eq(sig->it_prof_expires, cputime_zero) ||
+	     cputime_gt(sig->it_prof_expires, prof_expires)))
+		sig->it_prof_expires = prof_expires;
+	if (!cputime_eq(virt_expires, cputime_zero) &&
+	    (cputime_eq(sig->it_virt_expires, cputime_zero) ||
+	     cputime_gt(sig->it_virt_expires, virt_expires)))
+		sig->it_virt_expires = virt_expires;
+	if (sched_expires != 0 &&
+	    (sig->it_sched_expires == 0 ||
+	     sig->it_sched_expires > sched_expires))
+		sig->it_sched_expires = sched_expires;
 }
 
 /*
@@ -1289,13 +1161,29 @@ void run_posix_cpu_timers(struct task_st
 
 	BUG_ON(!irqs_disabled());
 
+	if (!tsk->signal)
+		return;
+
 #define UNEXPIRED(clock) \
-		(cputime_eq(tsk->it_##clock##_expires, cputime_zero) || \
-		 cputime_lt(clock##_ticks(tsk), tsk->it_##clock##_expires))
+		(cputime_eq(tsk->signal->it_##clock##_expires, cputime_zero) || \
+		 cputime_lt(clock##_shared_ticks(tsk), tsk->signal->it_##clock##_expires))
 
-	if (UNEXPIRED(prof) && UNEXPIRED(virt) &&
+	/*
+	 * If neither the running thread nor the process-wide timer has
+	 * expired, do nothing.
+	 */
+	if ((cputime_eq(tsk->it_prof_expires, cputime_zero) ||
+	     cputime_lt(prof_ticks(tsk), tsk->it_prof_expires)) &&
+	    (cputime_eq(tsk->it_virt_expires, cputime_zero) ||
+	     cputime_lt(virt_ticks(tsk), tsk->it_virt_expires)) &&
 	    (tsk->it_sched_expires == 0 ||
-	     tsk->sched_time < tsk->it_sched_expires))
+	     tsk->sched_time < tsk->it_sched_expires) &&
+	    (cputime_eq(tsk->signal->it_prof_expires, cputime_zero) ||
+	     cputime_lt(prof_shared_ticks(tsk), tsk->signal->it_prof_expires)) &&
+	    (cputime_eq(tsk->signal->it_virt_expires, cputime_zero) ||
+	     cputime_lt(virt_shared_ticks(tsk), tsk->signal->it_virt_expires)) &&
+	    (tsk->signal->it_sched_expires == 0 ||
+	     shared_schedtime_sum(tsk->signal) < tsk->signal->it_sched_expires))
 		return;
 
 #undef	UNEXPIRED
@@ -1398,13 +1286,14 @@ void set_process_cpu_timer(struct task_s
 	    cputime_ge(list_entry(head->next,
 				  struct cpu_timer_list, entry)->expires.cpu,
 		       *newval)) {
-		/*
-		 * Rejigger each thread's expiry time so that one will
-		 * notice before we hit the process-cumulative expiry time.
-		 */
-		union cpu_time_count expires = { .sched = 0 };
-		expires.cpu = *newval;
-		process_timer_rebalance(tsk, clock_idx, expires, now);
+		switch (clock_idx) {
+		case CPUCLOCK_PROF:
+			tsk->signal->it_prof_expires = *newval;
+			break;
+		case CPUCLOCK_VIRT:
+			tsk->signal->it_virt_expires = *newval;
+			break;
+		}
 	}
 }
 
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/sched.c linux-2.6.18.5/kernel/sched.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/sched.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/sched.c	2008-03-12 13:02:10.000000000 -0700
@@ -2901,7 +2901,21 @@ EXPORT_PER_CPU_SYMBOL(kstat);
 static inline void
 update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now)
 {
-	p->sched_time += now - max(p->timestamp, rq->timestamp_last_tick);
+	unsigned long long tmp;
+
+	tmp = now - max(p->timestamp, rq->timestamp_last_tick);
+	p->sched_time += tmp;
+	/* Add our time to the shared field. */
+	if (p->signal && p->signal->shared_times) {
+		int cpu;
+		struct sharedtimes_struct *shared_times;
+
+		cpu = get_cpu();
+		shared_times = per_cpu_ptr(p->signal->shared_times, cpu);
+		shared_times->shared_schedtime += tmp;
+		put_cpu_no_resched();
+		/*p->signal->shared_schedtime += tmp;*/
+	}
 }
 
 /*
@@ -2955,6 +2969,20 @@ void account_user_time(struct task_struc
 
 	p->utime = cputime_add(p->utime, cputime);
 
+	/* Add our time to the shared field. */
+	if (p->signal && p->signal->shared_times) {
+		int cpu;
+		struct sharedtimes_struct *shared_times;
+
+		cpu = get_cpu();
+		shared_times = per_cpu_ptr(p->signal->shared_times, cpu);
+		shared_times->shared_utime = cputime_add(shared_times->shared_utime, cputime);
+		put_cpu_no_resched();
+	}
+/*
+	if (p->signal)
+		p->signal->shared_utime = cputime_add(p->signal->shared_utime, cputime);
+*/
 	/* Add user time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
 	if (TASK_NICE(p) > 0)
@@ -2978,6 +3006,20 @@ void account_system_time(struct task_str
 
 	p->stime = cputime_add(p->stime, cputime);
 
+	/* Add our time to the shared field. */
+	if (p->signal && p->signal->shared_times) {
+		int cpu;
+		struct sharedtimes_struct *shared_times;
+
+		cpu = get_cpu();
+		shared_times = per_cpu_ptr(p->signal->shared_times, cpu);
+		shared_times->shared_stime = cputime_add(shared_times->shared_stime, cputime);
+		put_cpu_no_resched();
+	}
+/*
+	if (p->signal)
+		p->signal->shared_stime = cputime_add(p->signal->shared_stime, cputime);
+*/
 	/* Add system time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
 	if (hardirq_count() - hardirq_offset)
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/signal.c linux-2.6.18.5/kernel/signal.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/signal.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/signal.c	2008-03-10 17:57:36.000000000 -0700
@@ -1447,10 +1447,8 @@ void do_notify_parent(struct task_struct
 	info.si_uid = tsk->uid;
 
 	/* FIXME: find out whether or not this is supposed to be c*time. */
-	info.si_utime = cputime_to_jiffies(cputime_add(tsk->utime,
-						       tsk->signal->utime));
-	info.si_stime = cputime_to_jiffies(cputime_add(tsk->stime,
-						       tsk->signal->stime));
+	info.si_utime = cputime_to_jiffies(shared_utime_sum(tsk->signal));
+	info.si_stime = cputime_to_jiffies(shared_stime_sum(tsk->signal));
 
 	info.si_status = tsk->exit_code & 0x7f;
 	if (tsk->exit_code & 0x80)
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/sys.c linux-2.6.18.5/kernel/sys.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/sys.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/sys.c	2008-03-10 16:55:26.000000000 -0700
@@ -1207,18 +1207,11 @@ asmlinkage long sys_times(struct tms __u
 	if (tbuf) {
 		struct tms tmp;
 		struct task_struct *tsk = current;
-		struct task_struct *t;
 		cputime_t utime, stime, cutime, cstime;
 
 		spin_lock_irq(&tsk->sighand->siglock);
-		utime = tsk->signal->utime;
-		stime = tsk->signal->stime;
-		t = tsk;
-		do {
-			utime = cputime_add(utime, t->utime);
-			stime = cputime_add(stime, t->stime);
-			t = next_thread(t);
-		} while (t != tsk);
+		utime = shared_utime_sum(tsk->signal);
+		stime = shared_stime_sum(tsk->signal);
 
 		cutime = tsk->signal->cutime;
 		cstime = tsk->signal->cstime;
@@ -1910,16 +1903,14 @@ static void k_getrusage(struct task_stru
 				break;
 
 		case RUSAGE_SELF:
-			utime = cputime_add(utime, p->signal->utime);
-			stime = cputime_add(stime, p->signal->stime);
+			utime = cputime_add(utime, shared_utime_sum(p->signal));
+			stime = cputime_add(stime, shared_stime_sum(p->signal));
 			r->ru_nvcsw += p->signal->nvcsw;
 			r->ru_nivcsw += p->signal->nivcsw;
 			r->ru_minflt += p->signal->min_flt;
 			r->ru_majflt += p->signal->maj_flt;
 			t = p;
 			do {
-				utime = cputime_add(utime, t->utime);
-				stime = cputime_add(stime, t->stime);
 				r->ru_nvcsw += t->nvcsw;
 				r->ru_nivcsw += t->nivcsw;
 				r->ru_minflt += t->min_flt;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-03-14  0:37                                   ` Frank Mayhar
@ 2008-03-21  7:18                                     ` Roland McGrath
  2008-03-21 17:57                                       ` Frank Mayhar
  2008-03-21 20:40                                       ` posix-cpu-timers revamp Frank Mayhar
  0 siblings, 2 replies; 51+ messages in thread
From: Roland McGrath @ 2008-03-21  7:18 UTC (permalink / raw)
  To: Frank Mayhar; +Cc: linux-kernel

Sorry for the delay.

> Please take a look and let me know what you think.  In the meantime I'll
> be working on a similar patch to 2.6-head that has optimizations for
> uniprocessor and two-CPU operation, to avoid the overhead of the percpu
> functions when they are unneeded.

My mention of a 2-CPU special case was just an off-hand idea.  I don't
really have any idea if that would be optimal given the tradeoff of
increaing signal_struct size.  The performance needs be analyzed.

> 	disappeared entirely and the arm_timer() routine merely fills
> 	p->signal->it_*_expires from timer->it.cpu.expires.*.  The
> 	cpu_clock_sample_group_locked() loses its summing loops, using the
> 	the shared structure instead.  Finally, set_process_cpu_timer() sets
> 	tsk->signal->it_*_expires directly rather than calling the deleted
> 	rebalance routine.

I think I misled you about the use of the it_*_expires fields, sorry.
The task_struct.it_*_expires fields are used solely as a cache of the
head of cpu_timers[].  Despite the poor choice of the same name, the
signal_struct.it_*_expires fields serve a different purpose.  For an
analogous cache of the soonest timer to expire, you need to add new
fields.  The signal_struct.it_{prof,virt}_{expires,incr} fields hold
the setitimer settings for ITIMER_{PROF,VTALRM}.  You can't change
those in arm_timer.  For a quick cache you need a new field that is
the sooner of it_foo_expires or the head cpu_timers[foo] expiry time.

The shared_utime_sum et al names are somewhat oblique to anyone who
hasn't just been hacking on exactly this thing like you and I have.
Things like thread_group_*time make more sense.

There are now several places where you call both shared_utime_sum and
shared_stime_sum.  It looks simple because they're nicely encapsulated.
But now you have two loops through all CPUs, and three loops in
check_process_timers.

I think what we want instead is this:

	struct task_cputime
	{
		cputime_t utime;
		cputime_t stime;
		unsigned long long schedtime;
	};

Use one in task_struct to replace the utime, stime, and sum_sched_runtime
fields, and another to replace it_*_expires.  Use a single inline function
thread_group_cputime() that fills a sum struct task_cputime using a single
loop.  For the places only one or two of the sums is actually used, the
compiler should optimize away the extra summing from the loop.

Don't use __cacheline_aligned on this struct type itself, because most of
the uses don't need that.  When using alloc_percpu, you can rely on it to
take care of those needs--that's what it's for.  If you implement a
variant that uses a flat array, you can use a wrapper struct with
__cacheline_aligned for that.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-03-21  7:18                                     ` Roland McGrath
@ 2008-03-21 17:57                                       ` Frank Mayhar
  2008-03-22 21:58                                         ` Roland McGrath
  2008-03-21 20:40                                       ` posix-cpu-timers revamp Frank Mayhar
  1 sibling, 1 reply; 51+ messages in thread
From: Frank Mayhar @ 2008-03-21 17:57 UTC (permalink / raw)
  To: Roland McGrath; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7526 bytes --]

On Fri, 2008-03-21 at 00:18 -0700, Roland McGrath wrote:
> > Please take a look and let me know what you think.  In the meantime I'll
> > be working on a similar patch to 2.6-head that has optimizations for
> > uniprocessor and two-CPU operation, to avoid the overhead of the percpu
> > functions when they are unneeded.
> My mention of a 2-CPU special case was just an off-hand idea.  I don't
> really have any idea if that would be optimal given the tradeoff of
> increaing signal_struct size.  The performance needs be analyzed.

I would really like to just ignore the 2-cpu scenario and just have two
versions, the UP version and the n-way SMP version.  It would make life,
and maintenance, simpler.

> > 	disappeared entirely and the arm_timer() routine merely fills
> > 	p->signal->it_*_expires from timer->it.cpu.expires.*.  The
> > 	cpu_clock_sample_group_locked() loses its summing loops, using the
> > 	the shared structure instead.  Finally, set_process_cpu_timer() sets
> > 	tsk->signal->it_*_expires directly rather than calling the deleted
> > 	rebalance routine.
> I think I misled you about the use of the it_*_expires fields, sorry.
> The task_struct.it_*_expires fields are used solely as a cache of the
> head of cpu_timers[].  Despite the poor choice of the same name, the
> signal_struct.it_*_expires fields serve a different purpose.  For an
> analogous cache of the soonest timer to expire, you need to add new
> fields.  The signal_struct.it_{prof,virt}_{expires,incr} fields hold
> the setitimer settings for ITIMER_{PROF,VTALRM}.  You can't change
> those in arm_timer.  For a quick cache you need a new field that is
> the sooner of it_foo_expires or the head cpu_timers[foo] expiry time.

Okay, I'll go back over this and make sure I got it right.  It's
interesting, though, that my current patch (written without this
particular bit of knowledge) actually performs no differently from the
existing mechanism.

>From my handy four-core AMD64 test system running 2.6.18.5, the old
kernel gets:

./nohangc-3 1300 200000
Interval timer off.
Threads:                1300
Max prime:              200000
Elapsed:                95.421s
Execution:              User 356.001s, System 0.029s, Total 356.030s
Context switches:       vol 1319, invol 7402

./hangc-3 1300 200000
Interval timer set to 0.010 sec.
Threads:                1300
Max prime:              200000
Elapsed:                131.457s
Execution:              User 435.037s, System 59.495s, Total 494.532s
Context switches:       vol 1464, invol 10123
Ticks:                  22612, tics/sec 45.724, secs/tic 0.022

(More than 1300 threads hangs the old kernel with this test.)

With my patch it gets:

./nohangc-3 1300 200000
Interval timer off.
Threads:                1300
Max prime:              200000
Elapsed:                94.097s
Execution:              User 366.000s, System 0.052s, Total 366.052s
Context switches:       vol 1336, invol 28928

./hangc-3 1300 200000
Interval timer set to 0.010 sec.
Threads:                1300
Max prime:              200000
Elapsed:                93.583s
Execution:              User 366.117s, System 0.047s, Total 366.164s
Context switches:       vol 1323, invol 28875
Ticks:                  12131, tics/sec 33.130, secs/tic 0.030

Also see below.

> The shared_utime_sum et al names are somewhat oblique to anyone who
> hasn't just been hacking on exactly this thing like you and I have.
> Things like thread_group_*time make more sense.

In the latest cut I've named them "process_*" but "thread_group" makes
more sense.

> There are now several places where you call both shared_utime_sum and
> shared_stime_sum.  It looks simple because they're nicely encapsulated.
> But now you have two loops through all CPUs, and three loops in
> check_process_timers.

Good point, although so far it's been undetectable in my performance
testing.  (I can't say that it will stay that way down the road a bit,
when we have systems with large numbers of cores.)

> I think what we want instead is this:
> 
> 	struct task_cputime
> 	{
> 		cputime_t utime;
> 		cputime_t stime;
> 		unsigned long long schedtime;
> 	};
> 
> Use one in task_struct to replace the utime, stime, and sum_sched_runtime
> fields, and another to replace it_*_expires.  Use a single inline function
> thread_group_cputime() that fills a sum struct task_cputime using a single
> loop.  For the places only one or two of the sums is actually used, the
> compiler should optimize away the extra summing from the loop.

Excellent idea!  This method hadn't occurred to me since I was looking
at it from the viewpoint of the existing structure and keeping the
fields separated, but this makes more sense.

> Don't use __cacheline_aligned on this struct type itself, because most of
> the uses don't need that.  When using alloc_percpu, you can rely on it to
> take care of those needs--that's what it's for.  If you implement a
> variant that uses a flat array, you can use a wrapper struct with
> __cacheline_aligned for that.

Yeah, I had caught that one.

FYI, I've attached the latest version of the 2.6.18 patch; you might
want to take a look as it has changed a bit.  I generated some numbers
as well (from a new README):

	Testing was performed using a heavily-modified version of the test
	that originally showed the problem.  The test sets ITIMER_PROF (if
	not run with "nohang" in the name of the executable) and catches
	the SIGPROF signal (in any event), then starts some number of threads,
	each of which computes the prime numbers up to a given maximum (this
	function was lifted from the "cpu" benchmark of sysbench version
	0.4.8).  It takes as parameters the number of threads to create and
	the maximum value for the prime number calculation.  It starts the
	threads, calls pthread_barrier_wait() to wait for them to complete and
	rendezvous, then joins the threads.  It uses gettimeofday() to get
	the time and getrusage() to get resource usage before and after the
	threads run and reports the number of threads, the difference in
	elapsed time, user and system CPU time and in the number of voluntary
	and involuntary context switches, and the total number of SIGPROF
	signals received (this will be zero if the test is run as "nohang").

	On a four-core AMD64 system (two dual-core AMD64s), for 1300 threads
	(more than that hung the kernel) and a max prime of 120,000, the old
	kernel averaged roughly 70s elapsed, with about 240s user cpu and 35s
	system cpu, with the profile timer ticking about every 0.02s.  The new
	kernel averaged roughly 45s elapsed, with about 181s user cpu and .04
	system CPU and with the profile timer ticking about every .01s.
	
	On a sixteen-core system (four quad-core AMD64s), for 1300 threads as
	above but with a max prime of 300,000, the old kernel averaged roughly
	65s elapsed, with about 600s user cpu and 91s system cpu, with the
	profile timer ticking about every .02s.  The new kernel averaged
	roughly 70s elapsed, with about 239s user cpu and 35s system cpu,
	and with the profile timer ticking about every .02s.

	On the same sixteen-core system, 100,000 threads with a max prime of
	100,000 run in roughly 975s elapsed, with about 5,538s user cpu and
	751s system cpu, with the profile timer ticking about every .025s.

	In summary, the performance of the kernel with the fix is comparable to
	the performance without it, with the advantage that many threads will
	no longer hang the system.

The patch is attached.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.

[-- Attachment #2: itimer-hang-2.patch --]
[-- Type: text/x-patch, Size: 23931 bytes --]

diff -rup /home/fmayhar/Static/linux-2.6.18.5/include/linux/sched.h linux-2.6.18.5/include/linux/sched.h
--- /home/fmayhar/Static/linux-2.6.18.5/include/linux/sched.h	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/include/linux/sched.h	2008-03-20 11:51:24.000000000 -0700
@@ -370,6 +370,18 @@ struct pacct_struct {
 };
 
 /*
+ * This structure contains the versions of utime, stime and sched_time
+ * that are shared across all threads within a process.  It's only used for
+ * interval timers and is allocated via alloc_percpu() in the signal
+ * structure when such a timer is set up.  It is freed at process exit.
+ */
+struct process_times_percpu_struct {
+	cputime_t utime;
+	cputime_t stime;
+	unsigned long long sched_time;
+};
+
+/*
  * NOTE! "signal_struct" does not have it's own
  * locking, because a shared signal_struct always
  * implies a shared sighand_struct, so locking
@@ -414,6 +426,9 @@ struct signal_struct {
 	cputime_t it_prof_expires, it_virt_expires;
 	cputime_t it_prof_incr, it_virt_incr;
 
+	/* Scheduling timer for the process */
+	unsigned long long it_sched_expires;
+
 	/* job control IDs */
 	pid_t pgrp;
 	pid_t tty_old_pgrp;
@@ -441,6 +456,9 @@ struct signal_struct {
 	 */
 	unsigned long long sched_time;
 
+	/* Process-wide times for POSIX interval timing.  Per CPU. */
+	struct process_times_percpu_struct *process_times_percpu;
+
 	/*
 	 * We don't bother to synchronize most readers of this at all,
 	 * because there is no reader checking a limit that actually needs
@@ -1472,6 +1490,112 @@ static inline int lock_need_resched(spin
 	return 0;
 }
 
+/*
+ * Allocate the process_times_percpu_struct appropriately and fill in the current
+ * values of the fields.  Called from do_setitimer() when setting an interval
+ * timer (ITIMER_PROF or ITIMER_VIRTUAL).  Assumes interrupts are enabled when
+ * it's called.  Note that there is no corresponding deallocation done from
+ * do_setitimer(); the structure is freed at process exit.
+ */
+static inline int process_times_percpu_alloc(struct task_struct *tsk)
+{
+	struct signal_struct *sig = tsk->signal;
+	struct process_times_percpu_struct *process_times_percpu;
+	struct task_struct *t;
+	cputime_t utime, stime;
+	unsigned long long sched_time;
+
+	/*
+	 * If we don't already have a process_times_percpu_struct, allocate
+	 * one and fill it in with the accumulated times.
+	 */
+	if (sig->process_times_percpu)
+		return(0);
+	process_times_percpu = alloc_percpu(struct process_times_percpu_struct);
+	if (process_times_percpu == NULL)
+		return -ENOMEM;
+	read_lock(&tasklist_lock);
+	spin_lock_irq(&tsk->sighand->siglock);
+	if (sig->process_times_percpu) {
+		spin_unlock_irq(&tsk->sighand->siglock);
+		read_unlock(&tasklist_lock);
+		free_percpu(process_times_percpu);
+		return(0);
+	}
+	sig->process_times_percpu = process_times_percpu;
+	utime = sig->utime;
+	stime = sig->stime;
+	sched_time = sig->sched_time;
+	t = tsk;
+	do {
+		utime = cputime_add(utime, t->utime);
+		stime = cputime_add(stime, t->stime);
+		sched_time += t->sched_time;
+	} while_each_thread(tsk, t);
+	process_times_percpu = per_cpu_ptr(sig->process_times_percpu, get_cpu());
+	process_times_percpu->utime = utime;
+	process_times_percpu->stime = stime;
+	process_times_percpu->sched_time = sched_time;
+	put_cpu_no_resched();
+	spin_unlock_irq(&tsk->sighand->siglock);
+	read_unlock(&tasklist_lock);
+	return(0);
+}
+
+/*
+ * Sum the utime field across all running CPUs.
+ */
+static inline cputime_t process_utime_sum(struct signal_struct *sig)
+{
+	int i;
+	struct process_times_percpu_struct *process_times_percpu;
+	cputime_t utime = cputime_zero;
+
+	if (sig->process_times_percpu) {
+		for_each_online_cpu(i) {
+			process_times_percpu = per_cpu_ptr(sig->process_times_percpu, i);
+			utime = cputime_add(utime, process_times_percpu->utime);
+		}
+	}
+	return(utime);
+}
+
+/*
+ * Sum the stime field across all running CPUs.
+ */
+static inline cputime_t process_stime_sum(struct signal_struct *sig)
+{
+	int i;
+	struct process_times_percpu_struct *process_times_percpu;
+	cputime_t stime = cputime_zero;
+
+	if (sig->process_times_percpu) {
+		for_each_online_cpu(i) {
+			process_times_percpu = per_cpu_ptr(sig->process_times_percpu, i);
+			stime = cputime_add(stime, process_times_percpu->stime);
+		}
+	}
+	return(stime);
+}
+
+/*
+ * Sum the sched_time field across all running CPUs.
+ */
+static inline unsigned long long process_schedtime_sum(struct signal_struct *sig)
+{
+	int i;
+	struct process_times_percpu_struct *process_times_percpu;
+	unsigned long long sched_time = 0;
+
+	if (sig->process_times_percpu) {
+		for_each_online_cpu(i) {
+			process_times_percpu = per_cpu_ptr(sig->process_times_percpu, i);
+			sched_time += process_times_percpu->sched_time;
+		}
+	}
+	return(sched_time);
+}
+
 /* Reevaluate whether the task has signals pending delivery.
    This is required every time the blocked sigset_t changes.
    callers must hold sighand->siglock.  */
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/compat.c linux-2.6.18.5/kernel/compat.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/compat.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/compat.c	2008-03-20 11:50:09.000000000 -0700
@@ -161,18 +161,28 @@ asmlinkage long compat_sys_times(struct 
 	if (tbuf) {
 		struct compat_tms tmp;
 		struct task_struct *tsk = current;
-		struct task_struct *t;
 		cputime_t utime, stime, cutime, cstime;
 
 		read_lock(&tasklist_lock);
-		utime = tsk->signal->utime;
-		stime = tsk->signal->stime;
-		t = tsk;
-		do {
-			utime = cputime_add(utime, t->utime);
-			stime = cputime_add(stime, t->stime);
-			t = next_thread(t);
-		} while (t != tsk);
+		/*
+		 * If a POSIX interval timer is running use the process-wide
+		 * fields, else fall back to brute force.
+		 */
+		if (tsk->signal->process_times_percpu) {
+			utime = process_utime_sum(tsk->signal);
+			stime = process_stime_sum(tsk->signal);
+		}
+		else {
+			struct task_struct *t;
+
+			utime = tsk->signal->utime;
+			stime = tsk->signal->stime;
+			t = tsk;
+			do {
+				utime = cputime_add(utime, t->utime);
+				stime = cputime_add(stime, t->stime);
+			} while_each_thread(tsk, t);
+		}
 
 		/*
 		 * While we have tasklist_lock read-locked, no dying thread
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/fork.c linux-2.6.18.5/kernel/fork.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/fork.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/fork.c	2008-03-20 11:50:09.000000000 -0700
@@ -855,10 +855,13 @@ static inline int copy_signal(unsigned l
 	sig->it_virt_incr = cputime_zero;
 	sig->it_prof_expires = cputime_zero;
 	sig->it_prof_incr = cputime_zero;
+ 	sig->it_sched_expires = 0;
 
 	sig->leader = 0;	/* session leadership doesn't inherit */
 	sig->tty_old_pgrp = 0;
 
+	sig->process_times_percpu = NULL;
+
 	sig->utime = sig->stime = sig->cutime = sig->cstime = cputime_zero;
 	sig->nvcsw = sig->nivcsw = sig->cnvcsw = sig->cnivcsw = 0;
 	sig->min_flt = sig->maj_flt = sig->cmin_flt = sig->cmaj_flt = 0;
@@ -889,6 +892,8 @@ void __cleanup_signal(struct signal_stru
 {
 	exit_thread_group_keys(sig);
 	taskstats_tgid_free(sig);
+	if (sig->process_times_percpu)
+		free_percpu(sig->process_times_percpu);
 	kmem_cache_free(signal_cachep, sig);
 }
 
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/itimer.c linux-2.6.18.5/kernel/itimer.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/itimer.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/itimer.c	2008-03-20 11:50:08.000000000 -0700
@@ -61,12 +61,7 @@ int do_getitimer(int which, struct itime
 		cval = tsk->signal->it_virt_expires;
 		cinterval = tsk->signal->it_virt_incr;
 		if (!cputime_eq(cval, cputime_zero)) {
-			struct task_struct *t = tsk;
-			cputime_t utime = tsk->signal->utime;
-			do {
-				utime = cputime_add(utime, t->utime);
-				t = next_thread(t);
-			} while (t != tsk);
+			cputime_t utime = process_utime_sum(tsk->signal);
 			if (cputime_le(cval, utime)) { /* about to fire */
 				cval = jiffies_to_cputime(1);
 			} else {
@@ -84,15 +79,8 @@ int do_getitimer(int which, struct itime
 		cval = tsk->signal->it_prof_expires;
 		cinterval = tsk->signal->it_prof_incr;
 		if (!cputime_eq(cval, cputime_zero)) {
-			struct task_struct *t = tsk;
-			cputime_t ptime = cputime_add(tsk->signal->utime,
-						      tsk->signal->stime);
-			do {
-				ptime = cputime_add(ptime,
-						    cputime_add(t->utime,
-								t->stime));
-				t = next_thread(t);
-			} while (t != tsk);
+			cputime_t ptime = cputime_add(process_utime_sum(tsk->signal),
+						      process_stime_sum(tsk->signal));
 			if (cputime_le(cval, ptime)) { /* about to fire */
 				cval = jiffies_to_cputime(1);
 			} else {
@@ -241,6 +229,18 @@ again:
 	case ITIMER_VIRTUAL:
 		nval = timeval_to_cputime(&value->it_value);
 		ninterval = timeval_to_cputime(&value->it_interval);
+		/*
+		 * If he's setting the timer for the first time, we need to
+		 * allocate the shared area.  It's freed when the process
+		 * exits.
+		 */
+		if (!cputime_eq(nval, cputime_zero) &&
+		    tsk->signal->process_times_percpu == NULL) {
+			int err;
+
+			if ((err = process_times_percpu_alloc(tsk)) < 0)
+				return(err);
+		}
 		read_lock(&tasklist_lock);
 		spin_lock_irq(&tsk->sighand->siglock);
 		cval = tsk->signal->it_virt_expires;
@@ -265,6 +265,18 @@ again:
 	case ITIMER_PROF:
 		nval = timeval_to_cputime(&value->it_value);
 		ninterval = timeval_to_cputime(&value->it_interval);
+		/*
+		 * If he's setting the timer for the first time, we need to
+		 * allocate the shared area.  It's freed when the process
+		 * exits.
+		 */
+		if (!cputime_eq(nval, cputime_zero) &&
+		    tsk->signal->process_times_percpu == NULL) {
+			int err;
+
+			if ((err = process_times_percpu_alloc(tsk)) < 0)
+				return(err);
+		}
 		read_lock(&tasklist_lock);
 		spin_lock_irq(&tsk->sighand->siglock);
 		cval = tsk->signal->it_prof_expires;
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/posix-cpu-timers.c linux-2.6.18.5/kernel/posix-cpu-timers.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/posix-cpu-timers.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/posix-cpu-timers.c	2008-03-19 16:58:03.000000000 -0700
@@ -164,6 +164,15 @@ static inline unsigned long long sched_n
 	return (p == current) ? current_sched_time(p) : p->sched_time;
 }
 
+static inline cputime_t prof_shared_ticks(struct task_struct *p)
+{
+	return cputime_add(process_utime_sum(p->signal), process_stime_sum(p->signal));
+}
+static inline cputime_t virt_shared_ticks(struct task_struct *p)
+{
+	return process_utime_sum(p->signal);
+}
+
 int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec *tp)
 {
 	int error = check_clock(which_clock);
@@ -227,31 +236,17 @@ static int cpu_clock_sample_group_locked
 					 struct task_struct *p,
 					 union cpu_time_count *cpu)
 {
-	struct task_struct *t = p;
  	switch (clock_idx) {
 	default:
 		return -EINVAL;
 	case CPUCLOCK_PROF:
-		cpu->cpu = cputime_add(p->signal->utime, p->signal->stime);
-		do {
-			cpu->cpu = cputime_add(cpu->cpu, prof_ticks(t));
-			t = next_thread(t);
-		} while (t != p);
+		cpu->cpu = cputime_add(process_utime_sum(p->signal), process_stime_sum(p->signal));
 		break;
 	case CPUCLOCK_VIRT:
-		cpu->cpu = p->signal->utime;
-		do {
-			cpu->cpu = cputime_add(cpu->cpu, virt_ticks(t));
-			t = next_thread(t);
-		} while (t != p);
+		cpu->cpu = process_utime_sum(p->signal);
 		break;
 	case CPUCLOCK_SCHED:
-		cpu->sched = p->signal->sched_time;
-		/* Add in each other live thread.  */
-		while ((t = next_thread(t)) != p) {
-			cpu->sched += t->sched_time;
-		}
-		cpu->sched += sched_ns(p);
+		cpu->sched = process_schedtime_sum(p->signal);
 		break;
 	}
 	return 0;
@@ -468,79 +463,9 @@ void posix_cpu_timers_exit(struct task_s
 void posix_cpu_timers_exit_group(struct task_struct *tsk)
 {
 	cleanup_timers(tsk->signal->cpu_timers,
-		       cputime_add(tsk->utime, tsk->signal->utime),
-		       cputime_add(tsk->stime, tsk->signal->stime),
-		       tsk->sched_time + tsk->signal->sched_time);
-}
-
-
-/*
- * Set the expiry times of all the threads in the process so one of them
- * will go off before the process cumulative expiry total is reached.
- */
-static void process_timer_rebalance(struct task_struct *p,
-				    unsigned int clock_idx,
-				    union cpu_time_count expires,
-				    union cpu_time_count val)
-{
-	cputime_t ticks, left;
-	unsigned long long ns, nsleft;
- 	struct task_struct *t = p;
-	unsigned int nthreads = atomic_read(&p->signal->live);
-
-	if (!nthreads)
-		return;
-
-	switch (clock_idx) {
-	default:
-		BUG();
-		break;
-	case CPUCLOCK_PROF:
-		left = cputime_div_non_zero(cputime_sub(expires.cpu, val.cpu),
-				       nthreads);
-		do {
-			if (likely(!(t->flags & PF_EXITING))) {
-				ticks = cputime_add(prof_ticks(t), left);
-				if (cputime_eq(t->it_prof_expires,
-					       cputime_zero) ||
-				    cputime_gt(t->it_prof_expires, ticks)) {
-					t->it_prof_expires = ticks;
-				}
-			}
-			t = next_thread(t);
-		} while (t != p);
-		break;
-	case CPUCLOCK_VIRT:
-		left = cputime_div_non_zero(cputime_sub(expires.cpu, val.cpu),
-				       nthreads);
-		do {
-			if (likely(!(t->flags & PF_EXITING))) {
-				ticks = cputime_add(virt_ticks(t), left);
-				if (cputime_eq(t->it_virt_expires,
-					       cputime_zero) ||
-				    cputime_gt(t->it_virt_expires, ticks)) {
-					t->it_virt_expires = ticks;
-				}
-			}
-			t = next_thread(t);
-		} while (t != p);
-		break;
-	case CPUCLOCK_SCHED:
-		nsleft = expires.sched - val.sched;
-		do_div(nsleft, nthreads);
-		nsleft = max_t(unsigned long long, nsleft, 1);
-		do {
-			if (likely(!(t->flags & PF_EXITING))) {
-				ns = t->sched_time + nsleft;
-				if (t->it_sched_expires == 0 ||
-				    t->it_sched_expires > ns) {
-					t->it_sched_expires = ns;
-				}
-			}
-			t = next_thread(t);
-		} while (t != p);
-		break;
-	}
+		       process_utime_sum(tsk->signal),
+		       process_stime_sum(tsk->signal),
+		       process_schedtime_sum(tsk->signal));
 }
 
 static void clear_dead_task(struct k_itimer *timer, union cpu_time_count now)
@@ -637,7 +562,8 @@ static void arm_timer(struct k_itimer *t
 				    cputime_lt(p->signal->it_virt_expires,
 					       timer->it.cpu.expires.cpu))
 					break;
-				goto rebalance;
+				p->signal->it_virt_expires = timer->it.cpu.expires.cpu;
+				break;
 			case CPUCLOCK_PROF:
 				if (!cputime_eq(p->signal->it_prof_expires,
 						cputime_zero) &&
@@ -648,13 +574,10 @@ static void arm_timer(struct k_itimer *t
 				if (i != RLIM_INFINITY &&
 				    i <= cputime_to_secs(timer->it.cpu.expires.cpu))
 					break;
-				goto rebalance;
+				p->signal->it_prof_expires = timer->it.cpu.expires.cpu;
+				break;
 			case CPUCLOCK_SCHED:
-			rebalance:
-				process_timer_rebalance(
-					timer->it.cpu.task,
-					CPUCLOCK_WHICH(timer->it_clock),
-					timer->it.cpu.expires, now);
+				p->signal->it_sched_expires = timer->it.cpu.expires.sched;
 				break;
 			}
 		}
@@ -1018,9 +941,8 @@ static void check_process_timers(struct 
 {
 	int maxfire;
 	struct signal_struct *const sig = tsk->signal;
-	cputime_t utime, stime, ptime, virt_expires, prof_expires;
+	cputime_t utime, ptime, virt_expires, prof_expires;
 	unsigned long long sched_time, sched_expires;
-	struct task_struct *t;
 	struct list_head *timers = sig->cpu_timers;
 
 	/*
@@ -1037,17 +959,9 @@ static void check_process_timers(struct 
 	/*
 	 * Collect the current process totals.
 	 */
-	utime = sig->utime;
-	stime = sig->stime;
-	sched_time = sig->sched_time;
-	t = tsk;
-	do {
-		utime = cputime_add(utime, t->utime);
-		stime = cputime_add(stime, t->stime);
-		sched_time += t->sched_time;
-		t = next_thread(t);
-	} while (t != tsk);
-	ptime = cputime_add(utime, stime);
+	utime = process_utime_sum(sig);
+	ptime = cputime_add(utime, process_stime_sum(sig));
+	sched_time = process_schedtime_sum(sig);
 
 	maxfire = 20;
 	prof_expires = cputime_zero;
@@ -1156,60 +1070,18 @@ static void check_process_timers(struct 
 		}
 	}
 
-	if (!cputime_eq(prof_expires, cputime_zero) ||
-	    !cputime_eq(virt_expires, cputime_zero) ||
-	    sched_expires != 0) {
-		/*
-		 * Rebalance the threads' expiry times for the remaining
-		 * process CPU timers.
-		 */
-
-		cputime_t prof_left, virt_left, ticks;
-		unsigned long long sched_left, sched;
-		const unsigned int nthreads = atomic_read(&sig->live);
-
-		if (!nthreads)
-			return;
-
-		prof_left = cputime_sub(prof_expires, utime);
-		prof_left = cputime_sub(prof_left, stime);
-		prof_left = cputime_div_non_zero(prof_left, nthreads);
-		virt_left = cputime_sub(virt_expires, utime);
-		virt_left = cputime_div_non_zero(virt_left, nthreads);
-		if (sched_expires) {
-			sched_left = sched_expires - sched_time;
-			do_div(sched_left, nthreads);
-			sched_left = max_t(unsigned long long, sched_left, 1);
-		} else {
-			sched_left = 0;
-		}
-		t = tsk;
-		do {
-			if (unlikely(t->flags & PF_EXITING))
-				continue;
-
-			ticks = cputime_add(cputime_add(t->utime, t->stime),
-					    prof_left);
-			if (!cputime_eq(prof_expires, cputime_zero) &&
-			    (cputime_eq(t->it_prof_expires, cputime_zero) ||
-			     cputime_gt(t->it_prof_expires, ticks))) {
-				t->it_prof_expires = ticks;
-			}
-
-			ticks = cputime_add(t->utime, virt_left);
-			if (!cputime_eq(virt_expires, cputime_zero) &&
-			    (cputime_eq(t->it_virt_expires, cputime_zero) ||
-			     cputime_gt(t->it_virt_expires, ticks))) {
-				t->it_virt_expires = ticks;
-			}
-
-			sched = t->sched_time + sched_left;
-			if (sched_expires && (t->it_sched_expires == 0 ||
-					      t->it_sched_expires > sched)) {
-				t->it_sched_expires = sched;
-			}
-		} while ((t = next_thread(t)) != tsk);
-	}
+	if (!cputime_eq(prof_expires, cputime_zero) &&
+	    (cputime_eq(sig->it_prof_expires, cputime_zero) ||
+	     cputime_gt(sig->it_prof_expires, prof_expires)))
+		sig->it_prof_expires = prof_expires;
+	if (!cputime_eq(virt_expires, cputime_zero) &&
+	    (cputime_eq(sig->it_virt_expires, cputime_zero) ||
+	     cputime_gt(sig->it_virt_expires, virt_expires)))
+		sig->it_virt_expires = virt_expires;
+	if (sched_expires != 0 &&
+	    (sig->it_sched_expires == 0 ||
+	     sig->it_sched_expires > sched_expires))
+		sig->it_sched_expires = sched_expires;
 }
 
 /*
@@ -1289,17 +1161,27 @@ void run_posix_cpu_timers(struct task_st
 
 	BUG_ON(!irqs_disabled());
 
-#define UNEXPIRED(clock) \
-		(cputime_eq(tsk->it_##clock##_expires, cputime_zero) || \
-		 cputime_lt(clock##_ticks(tsk), tsk->it_##clock##_expires))
+	if (!tsk->signal)
+		return;
 
-	if (UNEXPIRED(prof) && UNEXPIRED(virt) &&
+	/*
+	 * If neither the running thread nor the process-wide timer has
+	 * expired, do nothing.
+	 */
+	if ((cputime_eq(tsk->it_prof_expires, cputime_zero) ||
+	     cputime_lt(prof_ticks(tsk), tsk->it_prof_expires)) &&
+	    (cputime_eq(tsk->it_virt_expires, cputime_zero) ||
+	     cputime_lt(virt_ticks(tsk), tsk->it_virt_expires)) &&
 	    (tsk->it_sched_expires == 0 ||
-	     tsk->sched_time < tsk->it_sched_expires))
+	     tsk->sched_time < tsk->it_sched_expires) &&
+	    (cputime_eq(tsk->signal->it_prof_expires, cputime_zero) ||
+	     cputime_lt(prof_shared_ticks(tsk), tsk->signal->it_prof_expires)) &&
+	    (cputime_eq(tsk->signal->it_virt_expires, cputime_zero) ||
+	     cputime_lt(virt_shared_ticks(tsk), tsk->signal->it_virt_expires)) &&
+	    (tsk->signal->it_sched_expires == 0 ||
+	     process_schedtime_sum(tsk->signal) < tsk->signal->it_sched_expires))
 		return;
 
-#undef	UNEXPIRED
-
 	/*
 	 * Double-check with locks held.
 	 */
@@ -1398,13 +1280,14 @@ void set_process_cpu_timer(struct task_s
 	    cputime_ge(list_entry(head->next,
 				  struct cpu_timer_list, entry)->expires.cpu,
 		       *newval)) {
-		/*
-		 * Rejigger each thread's expiry time so that one will
-		 * notice before we hit the process-cumulative expiry time.
-		 */
-		union cpu_time_count expires = { .sched = 0 };
-		expires.cpu = *newval;
-		process_timer_rebalance(tsk, clock_idx, expires, now);
+		switch (clock_idx) {
+		case CPUCLOCK_PROF:
+			tsk->signal->it_prof_expires = *newval;
+			break;
+		case CPUCLOCK_VIRT:
+			tsk->signal->it_virt_expires = *newval;
+			break;
+		}
 	}
 }
 
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/sched.c linux-2.6.18.5/kernel/sched.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/sched.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/sched.c	2008-03-20 11:51:38.000000000 -0700
@@ -2901,7 +2901,20 @@ EXPORT_PER_CPU_SYMBOL(kstat);
 static inline void
 update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now)
 {
-	p->sched_time += now - max(p->timestamp, rq->timestamp_last_tick);
+	unsigned long long tmp;
+
+	tmp = now - max(p->timestamp, rq->timestamp_last_tick);
+	p->sched_time += tmp;
+	/* Add our time to the shared field. */
+	if (p->signal && p->signal->process_times_percpu) {
+		int cpu;
+		struct process_times_percpu_struct *process_times_percpu;
+
+		cpu = get_cpu();
+		process_times_percpu = per_cpu_ptr(p->signal->process_times_percpu, cpu);
+		process_times_percpu->sched_time += tmp;
+		put_cpu_no_resched();
+	}
 }
 
 /*
@@ -2955,6 +2968,17 @@ void account_user_time(struct task_struc
 
 	p->utime = cputime_add(p->utime, cputime);
 
+	/* Add our time to the shared field. */
+	if (p->signal && p->signal->process_times_percpu) {
+		int cpu;
+		struct process_times_percpu_struct *process_times_percpu;
+
+		cpu = get_cpu();
+		process_times_percpu = per_cpu_ptr(p->signal->process_times_percpu, cpu);
+		process_times_percpu->utime =
+			cputime_add(process_times_percpu->utime, cputime);
+		put_cpu_no_resched();
+	}
 	/* Add user time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
 	if (TASK_NICE(p) > 0)
@@ -2978,6 +3002,17 @@ void account_system_time(struct task_str
 
 	p->stime = cputime_add(p->stime, cputime);
 
+	/* Add our time to the shared field. */
+	if (p->signal && p->signal->process_times_percpu) {
+		int cpu;
+		struct process_times_percpu_struct *process_times_percpu;
+
+		cpu = get_cpu();
+		process_times_percpu = per_cpu_ptr(p->signal->process_times_percpu, cpu);
+		process_times_percpu->stime =
+			cputime_add(process_times_percpu->stime, cputime);
+		put_cpu_no_resched();
+	}
 	/* Add system time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
 	if (hardirq_count() - hardirq_offset)
diff -rup /home/fmayhar/Static/linux-2.6.18.5/kernel/sys.c linux-2.6.18.5/kernel/sys.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/sys.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/sys.c	2008-03-20 11:46:42.000000000 -0700
@@ -1207,19 +1207,28 @@ asmlinkage long sys_times(struct tms __u
 	if (tbuf) {
 		struct tms tmp;
 		struct task_struct *tsk = current;
-		struct task_struct *t;
 		cputime_t utime, stime, cutime, cstime;
 
 		spin_lock_irq(&tsk->sighand->siglock);
-		utime = tsk->signal->utime;
-		stime = tsk->signal->stime;
-		t = tsk;
-		do {
-			utime = cputime_add(utime, t->utime);
-			stime = cputime_add(stime, t->stime);
-			t = next_thread(t);
-		} while (t != tsk);
+		/*
+		 * If a POSIX interval timer is running use the process-wide
+		 * fields, else fall back to brute force.
+		 */
+		if (tsk->signal->process_times_percpu) {
+			utime = process_utime_sum(tsk->signal);
+			stime = process_stime_sum(tsk->signal);
+		}
+		else {
+			struct task_struct *t;
 
+			utime = tsk->signal->utime;
+			stime = tsk->signal->stime;
+			t = tsk;
+			do {
+				utime = cputime_add(utime, t->utime);
+				stime = cputime_add(stime, t->stime);
+			} while_each_thread(tsk, t);
+		}
 		cutime = tsk->signal->cutime;
 		cstime = tsk->signal->cstime;
 		spin_unlock_irq(&tsk->sighand->siglock);
@@ -1924,8 +1933,7 @@ static void k_getrusage(struct task_stru
 				r->ru_nivcsw += t->nivcsw;
 				r->ru_minflt += t->min_flt;
 				r->ru_majflt += t->maj_flt;
-				t = next_thread(t);
-			} while (t != p);
+			} while_each_thread(p, t);
 			break;
 
 		default:

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-03-21  7:18                                     ` Roland McGrath
  2008-03-21 17:57                                       ` Frank Mayhar
@ 2008-03-21 20:40                                       ` Frank Mayhar
  1 sibling, 0 replies; 51+ messages in thread
From: Frank Mayhar @ 2008-03-21 20:40 UTC (permalink / raw)
  To: Roland McGrath; +Cc: linux-kernel

On Fri, 2008-03-21 at 00:18 -0700, Roland McGrath wrote:
> I think I misled you about the use of the it_*_expires fields, sorry.
> The task_struct.it_*_expires fields are used solely as a cache of the
> head of cpu_timers[].  Despite the poor choice of the same name, the
> signal_struct.it_*_expires fields serve a different purpose.  For an
> analogous cache of the soonest timer to expire, you need to add new
> fields.  The signal_struct.it_{prof,virt}_{expires,incr} fields hold
> the setitimer settings for ITIMER_{PROF,VTALRM}.  You can't change
> those in arm_timer.  For a quick cache you need a new field that is
> the sooner of it_foo_expires or the head cpu_timers[foo] expiry time.

Actually, after looking at the code again and thinking about it a bit,
it appears that the signal_struct.it_*_incr field holds the actual
interval as set by setitimer.  Initially the it_*_expires field holds
the expiration time as set by setitimer, but after the timer fires the
first time that value becomes <firing time>+it_*_incr.  In other words,
the first time it fires at the value set by setitimer() but from then on
it fires at a time indicated by whatever the time was the last time the
timer fired plus the value in it_*_incr.  This time is stored in
signal_struct.it_*_expires.

I guess I could be wrong about this, but it appears to be what the code
is doing.  If my analysis is correct, I really don't need a new field,
since the old fields work just fine.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-03-21 17:57                                       ` Frank Mayhar
@ 2008-03-22 21:58                                         ` Roland McGrath
  2008-03-24 17:34                                           ` Frank Mayhar
  2008-03-28  0:52                                           ` [PATCH 2.6.25-rc6] Fix itimer/many thread hang Frank Mayhar
  0 siblings, 2 replies; 51+ messages in thread
From: Roland McGrath @ 2008-03-22 21:58 UTC (permalink / raw)
  To: Frank Mayhar; +Cc: linux-kernel

> I would really like to just ignore the 2-cpu scenario and just have two
> versions, the UP version and the n-way SMP version.  It would make life,
> and maintenance, simpler.

Like I've said, it's only something to investigate for best performance.
If the conditional code is encapsulated well, it will be simple to add
another variant later and experiment with it.

> Okay, I'll go back over this and make sure I got it right.  It's
> interesting, though, that my current patch (written without this
> particular bit of knowledge) actually performs no differently from the
> existing mechanism.

Except for correctness in scenarios other than the one you are testing. :-)

> 	Testing was performed using a heavily-modified version of the test
> 	that originally showed the problem.  The test sets ITIMER_PROF (if
> 	not run with "nohang" in the name of the executable) [...]

There are several important scenarios you did not test.
Analysis of combinations of all these variables is needed.
1. Tests with a few threads, like as many threads as CPUs or only 2x as many.
2. Tests with a process CPU timer set for a long expiration time.
   i.e. a timer set, but that never goes off in your entire run.
   (This is what a non-infinity RLIMIT_CPU limit does.)
   With the old code, a long enough timer and a small enough number
   of threads will never trigger a "rebalance".

> Actually, after looking at the code again and thinking about it a bit,
> it appears that the signal_struct.it_*_incr field holds the actual
> interval as set by setitimer.  Initially the it_*_expires field holds
> the expiration time as set by setitimer, but after the timer fires the
> first time that value becomes <firing time>+it_*_incr.  In other words,
> the first time it fires at the value set by setitimer() but from then on
> it fires at a time indicated by whatever the time was the last time the
> timer fired plus the value in it_*_incr.  This time is stored in
> signal_struct.it_*_expires.

That's correct.  The it_*_expires fields store itimerval.it_value (the
current timer) and the it_*_incr fields store itimerval.it_interval (the
timer reload setting).

> I guess I could be wrong about this, but it appears to be what the code
> is doing.  If my analysis is correct, I really don't need a new field,
> since the old fields work just fine.

The analysis above is correct but your conclusion here is wrong.
The current value of an itimer is a user feature, not just a piece
of internal bookkeeping.

getitimer returns in it_value the amount of time until the itimer
fires, regardless of whether or not it will reload after it fires or
with what value it will be reloaded.  In a setitimer call, the
it_value sets the time at which the itimer must fire, regardless of
the reload setting in it_interval.  Consider the case where
it_interval={0,0}; it_value is still meaningful.  

Your code causes any timer_settime or timer_delete call on a process
CPU timer or any setrlimit call on RLIMIT_CPU to suddenly change the
itimer setting just as if the user had made some setitimer call that
was never made or intended.  That's wrong.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-03-22 21:58                                         ` Roland McGrath
@ 2008-03-24 17:34                                           ` Frank Mayhar
  2008-03-24 22:43                                             ` Frank Mayhar
  2008-03-31  5:44                                             ` Roland McGrath
  2008-03-28  0:52                                           ` [PATCH 2.6.25-rc6] Fix itimer/many thread hang Frank Mayhar
  1 sibling, 2 replies; 51+ messages in thread
From: Frank Mayhar @ 2008-03-24 17:34 UTC (permalink / raw)
  To: Roland McGrath; +Cc: linux-kernel

On Sat, 2008-03-22 at 14:58 -0700, Roland McGrath wrote:
> > I would really like to just ignore the 2-cpu scenario and just have two
> > versions, the UP version and the n-way SMP version.  It would make life,
> > and maintenance, simpler.
> Like I've said, it's only something to investigate for best performance.
> If the conditional code is encapsulated well, it will be simple to add
> another variant later and experiment with it.

Well, if it's acceptable, for a first cut (and the patch I'll submit),
I'll handle the UP and SMP cases, encapsulating them in sched.h in such
a way as to make it invisible (as much as is possible) to the rest of
the code.

> There are several important scenarios you did not test.
> Analysis of combinations of all these variables is needed.
> 1. Tests with a few threads, like as many threads as CPUs or only 2x as many.

I've actually done this, although I didn't find the numbers particularly
interesting.  I'll do it again and keep the numbers, though.

> 2. Tests with a process CPU timer set for a long expiration time.
>    i.e. a timer set, but that never goes off in your entire run.
>    (This is what a non-infinity RLIMIT_CPU limit does.)
>    With the old code, a long enough timer and a small enough number
>    of threads will never trigger a "rebalance".

I'll do this at some point.

> > I guess I could be wrong about this, but it appears to be what the code
> > is doing.  If my analysis is correct, I really don't need a new field,
> > since the old fields work just fine.
> 
> The analysis above is correct but your conclusion here is wrong.
> The current value of an itimer is a user feature, not just a piece
> of internal bookkeeping.

After looking at the code again, I now understand what you're talking
about.  You overloaded it_*_expires to support both the POSIX interval
timers and RLIMIT_CPU.  So the way I have things, setting one can stomp
the other.

> Your code causes any timer_settime or timer_delete call on a process
> CPU timer or any setrlimit call on RLIMIT_CPU to suddenly change the
> itimer setting just as if the user had made some setitimer call that
> was never made or intended.  That's wrong.

Right, because the original effect was to only set the it_*_expires on
each individual task struct, leaving the one in the signal struct alone.

Might it be cleaner to handle the RLIMIT_CPU stuff separately, rather
than rolling it into the itimer handling?
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-03-24 17:34                                           ` Frank Mayhar
@ 2008-03-24 22:43                                             ` Frank Mayhar
  2008-03-31  5:44                                             ` Roland McGrath
  1 sibling, 0 replies; 51+ messages in thread
From: Frank Mayhar @ 2008-03-24 22:43 UTC (permalink / raw)
  To: Roland McGrath; +Cc: linux-kernel

On Mon, 2008-03-24 at 10:34 -0700, Frank Mayhar wrote:
> On Sat, 2008-03-22 at 14:58 -0700, Roland McGrath wrote:
> > The analysis above is correct but your conclusion here is wrong.
> > The current value of an itimer is a user feature, not just a piece
> > of internal bookkeeping.
> 
> After looking at the code again, I now understand what you're talking
> about.  You overloaded it_*_expires to support both the POSIX interval
> timers and RLIMIT_CPU.  So the way I have things, setting one can stomp
> the other.
> 
> > Your code causes any timer_settime or timer_delete call on a process
> > CPU timer or any setrlimit call on RLIMIT_CPU to suddenly change the
> > itimer setting just as if the user had made some setitimer call that
> > was never made or intended.  That's wrong.
> 
> Right, because the original effect was to only set the it_*_expires on
> each individual task struct, leaving the one in the signal struct alone.
> 
> Might it be cleaner to handle the RLIMIT_CPU stuff separately, rather
> than rolling it into the itimer handling?

Okay, my proposed fix for this is to introduce a new field in
signal_struct, rlim_expires, a cputime_t.  Everywhere that the
RLIMIT_CPU code formerly set it_prof_expires it will now set
rlim_expires and in run_posix_cpu_timers() I'll check it against the
thread group prof_time.

I believe that that will solve the problem, if I understand this
correctly.  If I don't, I trust that you will set me straight. :-)
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 2.6.25-rc6] Fix itimer/many thread hang.
  2008-03-22 21:58                                         ` Roland McGrath
  2008-03-24 17:34                                           ` Frank Mayhar
@ 2008-03-28  0:52                                           ` Frank Mayhar
  2008-03-28 10:28                                             ` Ingo Molnar
  2008-03-28 22:46                                             ` [PATCH 2.6.25-rc7 resubmit] " Frank Mayhar
  1 sibling, 2 replies; 51+ messages in thread
From: Frank Mayhar @ 2008-03-28  0:52 UTC (permalink / raw)
  To: Roland McGrath; +Cc: linux-kernel

This is my official first cut at a patch that will fix bug 9906, "Weird
hang with NPTL and SIGPROF."  The problem is that run_posix_cpu_timers()
repeatedly walks the entire thread group every time it runs, which is at
interrupt.  With heavy load and lots of threads, this can take longer
than the tick, at which point the kernel stops doing anything put
servicing clock ticks and the occasional interrupt.  Many thanks to
Roland McGrath for his help in my attempt to understand his code.

The change adds a new structure to the signal_struct,
thread_group_cputime.  On an SMP kernel, this is allocated as a percpu
structure when needed (from do_setitimer()) using the alloc_percpu()
mechanism).  It is manipulated via a set of inline functions and macros
defined in sched.h, thread_group_times_init(),
thread_group_times_free(), thread_group_times_alloc(),
thread_group_update() (the macro) and thread_group_cputime().  The
thread_group_update macro is used to update a single field of the
thread_group_cputime structure when needed; the thread_group_cputime()
function sums the fields for each cpu into a passed structure.

In the uniprocessor case, the thread_group_cputime structure becomes a
simple substructure of the signal_struct, allocation and freeing go away
and updating and "summing" become simple assignments.

In addition to fixing the hang, this change removes the overloading of
it_prof_expires for RLIMIT_CPU handling, replacing it with a new field,
rlim_expires, which is checked instead.  This makes the code simpler and
more straightforward.

I've made some decisions in this work that could have gone in different
directions and I'm certainly happy to entertain comments and criticisms.
Performance with this fix is at least as good as before and in a few
cases is slightly improved, possibly due to the reduced tick overhead.

Signed-off-by:  Frank Mayhar <fmayhar@google.com>

 include/linux/sched.h     |  172 ++++++++++++++++++++++++++++
 kernel/compat.c           |   31 ++++--
 kernel/fork.c             |   22 +---
 kernel/itimer.c           |   40 ++++---
 kernel/posix-cpu-timers.c |  271 +++++++++++++--------------------------------
 kernel/sched.c            |    4 +
 kernel/sched_fair.c       |    2 +
 kernel/sched_rt.c         |    2 +
 kernel/sys.c              |   41 ++++---
 security/selinux/hooks.c  |    4 +-
 10 files changed, 333 insertions(+), 256 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fed07d0..8d1b19d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -424,6 +424,18 @@ struct pacct_struct {
 };
 
 /*
+ * This structure contains the versions of utime, stime and sum_exec_runtime
+ * that are shared across threads within a process.  It's only used for
+ * interval timers and is allocated via alloc_percpu() in the signal
+ * structure when such a timer is set up.
+ */
+struct thread_group_cputime {
+	cputime_t utime;
+	cputime_t stime;
+	unsigned long long sum_exec_runtime;
+};
+
+/*
  * NOTE! "signal_struct" does not have it's own
  * locking, because a shared signal_struct always
  * implies a shared sighand_struct, so locking
@@ -468,6 +480,12 @@ struct signal_struct {
 	cputime_t it_prof_expires, it_virt_expires;
 	cputime_t it_prof_incr, it_virt_incr;
 
+	/* Scheduling timer for the process */
+	unsigned long long it_sched_expires;
+
+	/* RLIMIT_CPU timer for the process */
+	cputime_t rlim_expires;
+
 	/* job control IDs */
 
 	/*
@@ -492,6 +510,13 @@ struct signal_struct {
 
 	struct tty_struct *tty; /* NULL if no tty */
 
+	/* Process-wide times for POSIX interval timing.  Per CPU. */
+#ifdef CONFIG_SMP
+	struct thread_group_cputime *thread_group_times;
+#else
+	struct thread_group_cputime thread_group_times;
+#endif
+
 	/*
 	 * Cumulative resource counters for dead threads in the group,
 	 * and for reaped dead child processes forked by this group.
@@ -1978,6 +2003,153 @@ static inline int spin_needbreak(spinlock_t *lock)
 #endif
 }
 
+#define thread_group_runtime_add(a, b) ((a) + (b))
+
+#ifdef CONFIG_SMP
+
+static inline void thread_group_times_init(struct signal_struct *sig)
+{
+	sig->thread_group_times = NULL;
+}
+
+static inline void thread_group_times_free(struct signal_struct *sig)
+{
+	if (sig->thread_group_times)
+		free_percpu(sig->thread_group_times);
+}
+
+/*
+ * Allocate the thread_group_cputime struct appropriately and fill in the current
+ * values of the fields.  Called from do_setitimer() when setting an interval
+ * timer (ITIMER_PROF or ITIMER_VIRTUAL).  Assumes interrupts are enabled when
+ * it's called.  Note that there is no corresponding deallocation done from
+ * do_setitimer(); the structure is freed at process exit.
+ */
+static inline int thread_group_times_alloc(struct task_struct *tsk)
+{
+	struct signal_struct *sig = tsk->signal;
+	struct thread_group_cputime *thread_group_times;
+	struct task_struct *t;
+	cputime_t utime, stime;
+	unsigned long long sum_exec_runtime;
+
+	/*
+	 * If we don't already have a thread_group_cputime struct, allocate
+	 * one and fill it in with the accumulated times.
+	 */
+	if (sig->thread_group_times)
+		return 0;
+	thread_group_times = alloc_percpu(struct thread_group_cputime);
+	if (thread_group_times == NULL)
+		return -ENOMEM;
+	read_lock(&tasklist_lock);
+	spin_lock_irq(&tsk->sighand->siglock);
+	if (sig->thread_group_times) {
+		spin_unlock_irq(&tsk->sighand->siglock);
+		read_unlock(&tasklist_lock);
+		free_percpu(thread_group_times);
+		return 0;
+	}
+	sig->thread_group_times = thread_group_times;
+	utime = sig->utime;
+	stime = sig->stime;
+	sum_exec_runtime = tsk->se.sum_exec_runtime;
+	t = tsk;
+	do {
+		utime = cputime_add(utime, t->utime);
+		stime = cputime_add(stime, t->stime);
+		sum_exec_runtime += t->se.sum_exec_runtime;
+	} while_each_thread(tsk, t);
+	thread_group_times = per_cpu_ptr(sig->thread_group_times, get_cpu());
+	thread_group_times->utime = utime;
+	thread_group_times->stime = stime;
+	thread_group_times->sum_exec_runtime = sum_exec_runtime;
+	put_cpu_no_resched();
+	spin_unlock_irq(&tsk->sighand->siglock);
+	read_unlock(&tasklist_lock);
+	return 0;
+}
+
+#define thread_group_update(sig, field, val, op) ({ \
+	if (sig && sig->thread_group_times) {				\
+		int cpu;						\
+		struct thread_group_cputime *thread_group_times;	\
+									\
+		cpu = get_cpu();					\
+		thread_group_times =					\
+			per_cpu_ptr(sig->thread_group_times, cpu);	\
+		thread_group_times->field =				\
+			op(thread_group_times->field, val);		\
+		put_cpu_no_resched();					\
+	}								\
+})
+
+/*
+ * Sum the time fields across all running CPUs.
+ */
+static inline int thread_group_cputime(struct thread_group_cputime *thread_group_times,
+	struct signal_struct *sig)
+{
+	int i;
+	struct thread_group_cputime *tg_times;
+	cputime_t utime = cputime_zero;
+	cputime_t stime = cputime_zero;
+	unsigned long long sum_exec_runtime = 0;
+
+	if (!sig->thread_group_times)
+		return(0);
+	for_each_online_cpu(i) {
+		tg_times = per_cpu_ptr(sig->thread_group_times, i);
+		utime = cputime_add(utime, tg_times->utime);
+		stime = cputime_add(stime, tg_times->stime);
+		sum_exec_runtime += tg_times->sum_exec_runtime;
+	}
+	thread_group_times->utime = utime;
+	thread_group_times->stime = stime;
+	thread_group_times->sum_exec_runtime = sum_exec_runtime;
+	return(1);
+}
+
+#else /* CONFIG_SMP */
+
+static inline void thread_group_times_init(struct signal_struct *sig)
+{
+}
+
+static inline void thread_group_times_free(struct signal_struct *sig)
+{
+}
+
+/*
+ * Allocate the thread_group_cputime struct appropriately and fill in the current
+ * values of the fields.  Called from do_setitimer() when setting an interval
+ * timer (ITIMER_PROF or ITIMER_VIRTUAL).  Assumes interrupts are enabled when
+ * it's called.  Note that there is no corresponding deallocation done from
+ * do_setitimer(); the structure is freed at process exit.
+ */
+static inline int thread_group_times_alloc(struct task_struct *tsk)
+{
+	return 0;
+}
+
+#define thread_group_update(sig, field, val, op) ({ \
+	if (sig)							\
+		sig->thread_group_times.field =				\
+			op(sig->thread_group_times.field, val);		\
+})
+
+/*
+ * Sum the time fields across all running CPUs.
+ */
+static inline int thread_group_cputime(struct thread_group_cputime *thread_group_times,
+	struct signal_struct *sig)
+{
+	*thread_group_times = sig->thread_group_times;
+	return(1);
+}
+
+#endif /* CONFIG_SMP */
+
 /*
  * Reevaluate whether the task has signals pending delivery.
  * Wake the task if so.
diff --git a/kernel/compat.c b/kernel/compat.c
index 5f0e201..5c80f32 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -153,6 +153,8 @@ asmlinkage long compat_sys_setitimer(int which,
 
 asmlinkage long compat_sys_times(struct compat_tms __user *tbuf)
 {
+	struct thread_group_cputime thread_group_times;
+
 	/*
 	 *	In the SMP world we might just be unlucky and have one of
 	 *	the times increment as we use it. Since the value is an
@@ -162,18 +164,28 @@ asmlinkage long compat_sys_times(struct compat_tms __user *tbuf)
 	if (tbuf) {
 		struct compat_tms tmp;
 		struct task_struct *tsk = current;
-		struct task_struct *t;
 		cputime_t utime, stime, cutime, cstime;
 
 		read_lock(&tasklist_lock);
-		utime = tsk->signal->utime;
-		stime = tsk->signal->stime;
-		t = tsk;
-		do {
-			utime = cputime_add(utime, t->utime);
-			stime = cputime_add(stime, t->stime);
-			t = next_thread(t);
-		} while (t != tsk);
+		/*
+		 * If a POSIX interval timer is running use the process-wide
+		 * fields, else fall back to brute force.
+		 */
+		if (thread_group_cputime(&thread_group_times, tsk->signal)) {
+			utime = thread_group_times.utime;
+			stime = thread_group_times.stime;
+		}
+		else {
+			struct task_struct *t;
+
+			utime = tsk->signal->utime;
+			stime = tsk->signal->stime;
+			t = tsk;
+			do {
+				utime = cputime_add(utime, t->utime);
+				stime = cputime_add(stime, t->stime);
+			} while_each_thread(tsk, t);
+		}
 
 		/*
 		 * While we have tasklist_lock read-locked, no dying thread
@@ -1081,4 +1093,3 @@ compat_sys_sysinfo(struct compat_sysinfo __user *info)
 
 	return 0;
 }
-
diff --git a/kernel/fork.c b/kernel/fork.c
index dd249c3..e05d224 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -914,10 +914,14 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->it_virt_incr = cputime_zero;
 	sig->it_prof_expires = cputime_zero;
 	sig->it_prof_incr = cputime_zero;
+	sig->it_sched_expires = 0;
+	sig->rlim_expires = cputime_zero;
 
 	sig->leader = 0;	/* session leadership doesn't inherit */
 	sig->tty_old_pgrp = NULL;
 
+	thread_group_times_init(sig);
+
 	sig->utime = sig->stime = sig->cutime = sig->cstime = cputime_zero;
 	sig->gtime = cputime_zero;
 	sig->cgtime = cputime_zero;
@@ -939,7 +943,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 		 * New sole thread in the process gets an expiry time
 		 * of the whole CPU time limit.
 		 */
-		tsk->it_prof_expires =
+		sig->rlim_expires =
 			secs_to_cputime(sig->rlim[RLIMIT_CPU].rlim_cur);
 	}
 	acct_init_pacct(&sig->pacct);
@@ -952,6 +956,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 void __cleanup_signal(struct signal_struct *sig)
 {
 	exit_thread_group_keys(sig);
+	thread_group_times_free(sig);
 	kmem_cache_free(signal_cachep, sig);
 }
 
@@ -1311,21 +1316,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	if (clone_flags & CLONE_THREAD) {
 		p->group_leader = current->group_leader;
 		list_add_tail_rcu(&p->thread_group, &p->group_leader->thread_group);
-
-		if (!cputime_eq(current->signal->it_virt_expires,
-				cputime_zero) ||
-		    !cputime_eq(current->signal->it_prof_expires,
-				cputime_zero) ||
-		    current->signal->rlim[RLIMIT_CPU].rlim_cur != RLIM_INFINITY ||
-		    !list_empty(&current->signal->cpu_timers[0]) ||
-		    !list_empty(&current->signal->cpu_timers[1]) ||
-		    !list_empty(&current->signal->cpu_timers[2])) {
-			/*
-			 * Have child wake up on its first tick to check
-			 * for process CPU timers.
-			 */
-			p->it_prof_expires = jiffies_to_cputime(1);
-		}
 	}
 
 	if (likely(p->pid)) {
diff --git a/kernel/itimer.c b/kernel/itimer.c
index ab98274..8310db2 100644
--- a/kernel/itimer.c
+++ b/kernel/itimer.c
@@ -60,12 +60,11 @@ int do_getitimer(int which, struct itimerval *value)
 		cval = tsk->signal->it_virt_expires;
 		cinterval = tsk->signal->it_virt_incr;
 		if (!cputime_eq(cval, cputime_zero)) {
-			struct task_struct *t = tsk;
-			cputime_t utime = tsk->signal->utime;
-			do {
-				utime = cputime_add(utime, t->utime);
-				t = next_thread(t);
-			} while (t != tsk);
+			struct thread_group_cputime thread_group_times;
+			cputime_t utime;
+
+			(void)thread_group_cputime(&thread_group_times, tsk->signal);
+			utime = thread_group_times.utime;
 			if (cputime_le(cval, utime)) { /* about to fire */
 				cval = jiffies_to_cputime(1);
 			} else {
@@ -83,15 +82,12 @@ int do_getitimer(int which, struct itimerval *value)
 		cval = tsk->signal->it_prof_expires;
 		cinterval = tsk->signal->it_prof_incr;
 		if (!cputime_eq(cval, cputime_zero)) {
-			struct task_struct *t = tsk;
-			cputime_t ptime = cputime_add(tsk->signal->utime,
-						      tsk->signal->stime);
-			do {
-				ptime = cputime_add(ptime,
-						    cputime_add(t->utime,
-								t->stime));
-				t = next_thread(t);
-			} while (t != tsk);
+			struct thread_group_cputime thread_group_times;
+			cputime_t ptime;
+
+			(void)thread_group_cputime(&thread_group_times, tsk->signal);
+			ptime = cputime_add(thread_group_times.utime,
+					    thread_group_times.stime);
 			if (cputime_le(cval, ptime)) { /* about to fire */
 				cval = jiffies_to_cputime(1);
 			} else {
@@ -185,6 +181,13 @@ again:
 	case ITIMER_VIRTUAL:
 		nval = timeval_to_cputime(&value->it_value);
 		ninterval = timeval_to_cputime(&value->it_interval);
+		/*
+		 * If he's setting the timer for the first time, we need to
+		 * allocate the percpu area.  It's freed when the process
+		 * exits.
+		 */
+		if (!cputime_eq(nval, cputime_zero))
+			thread_group_times_alloc(tsk);
 		read_lock(&tasklist_lock);
 		spin_lock_irq(&tsk->sighand->siglock);
 		cval = tsk->signal->it_virt_expires;
@@ -209,6 +212,13 @@ again:
 	case ITIMER_PROF:
 		nval = timeval_to_cputime(&value->it_value);
 		ninterval = timeval_to_cputime(&value->it_interval);
+		/*
+		 * If he's setting the timer for the first time, we need to
+		 * allocate the percpu area.  It's freed when the process
+		 * exits.
+		 */
+		if (!cputime_eq(nval, cputime_zero))
+			thread_group_times_alloc(tsk);
 		read_lock(&tasklist_lock);
 		spin_lock_irq(&tsk->sighand->siglock);
 		cval = tsk->signal->it_prof_expires;
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 2eae91f..53a4486 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -227,31 +227,20 @@ static int cpu_clock_sample_group_locked(unsigned int clock_idx,
 					 struct task_struct *p,
 					 union cpu_time_count *cpu)
 {
-	struct task_struct *t = p;
- 	switch (clock_idx) {
+	struct thread_group_cputime thread_group_times;
+
+	(void)thread_group_cputime(&thread_group_times, p->signal);
+	switch (clock_idx) {
 	default:
 		return -EINVAL;
 	case CPUCLOCK_PROF:
-		cpu->cpu = cputime_add(p->signal->utime, p->signal->stime);
-		do {
-			cpu->cpu = cputime_add(cpu->cpu, prof_ticks(t));
-			t = next_thread(t);
-		} while (t != p);
+		cpu->cpu = cputime_add(thread_group_times.utime, thread_group_times.stime);
 		break;
 	case CPUCLOCK_VIRT:
-		cpu->cpu = p->signal->utime;
-		do {
-			cpu->cpu = cputime_add(cpu->cpu, virt_ticks(t));
-			t = next_thread(t);
-		} while (t != p);
+		cpu->cpu = thread_group_times.utime;
 		break;
 	case CPUCLOCK_SCHED:
-		cpu->sched = p->signal->sum_sched_runtime;
-		/* Add in each other live thread.  */
-		while ((t = next_thread(t)) != p) {
-			cpu->sched += t->se.sum_exec_runtime;
-		}
-		cpu->sched += sched_ns(p);
+		cpu->sched = thread_group_times.sum_exec_runtime;
 		break;
 	}
 	return 0;
@@ -472,80 +461,13 @@ void posix_cpu_timers_exit(struct task_struct *tsk)
 }
 void posix_cpu_timers_exit_group(struct task_struct *tsk)
 {
-	cleanup_timers(tsk->signal->cpu_timers,
-		       cputime_add(tsk->utime, tsk->signal->utime),
-		       cputime_add(tsk->stime, tsk->signal->stime),
-		     tsk->se.sum_exec_runtime + tsk->signal->sum_sched_runtime);
-}
+	struct thread_group_cputime thread_group_times;
 
-
-/*
- * Set the expiry times of all the threads in the process so one of them
- * will go off before the process cumulative expiry total is reached.
- */
-static void process_timer_rebalance(struct task_struct *p,
-				    unsigned int clock_idx,
-				    union cpu_time_count expires,
-				    union cpu_time_count val)
-{
-	cputime_t ticks, left;
-	unsigned long long ns, nsleft;
- 	struct task_struct *t = p;
-	unsigned int nthreads = atomic_read(&p->signal->live);
-
-	if (!nthreads)
-		return;
-
-	switch (clock_idx) {
-	default:
-		BUG();
-		break;
-	case CPUCLOCK_PROF:
-		left = cputime_div_non_zero(cputime_sub(expires.cpu, val.cpu),
-				       nthreads);
-		do {
-			if (likely(!(t->flags & PF_EXITING))) {
-				ticks = cputime_add(prof_ticks(t), left);
-				if (cputime_eq(t->it_prof_expires,
-					       cputime_zero) ||
-				    cputime_gt(t->it_prof_expires, ticks)) {
-					t->it_prof_expires = ticks;
-				}
-			}
-			t = next_thread(t);
-		} while (t != p);
-		break;
-	case CPUCLOCK_VIRT:
-		left = cputime_div_non_zero(cputime_sub(expires.cpu, val.cpu),
-				       nthreads);
-		do {
-			if (likely(!(t->flags & PF_EXITING))) {
-				ticks = cputime_add(virt_ticks(t), left);
-				if (cputime_eq(t->it_virt_expires,
-					       cputime_zero) ||
-				    cputime_gt(t->it_virt_expires, ticks)) {
-					t->it_virt_expires = ticks;
-				}
-			}
-			t = next_thread(t);
-		} while (t != p);
-		break;
-	case CPUCLOCK_SCHED:
-		nsleft = expires.sched - val.sched;
-		do_div(nsleft, nthreads);
-		nsleft = max_t(unsigned long long, nsleft, 1);
-		do {
-			if (likely(!(t->flags & PF_EXITING))) {
-				ns = t->se.sum_exec_runtime + nsleft;
-				if (t->it_sched_expires == 0 ||
-				    t->it_sched_expires > ns) {
-					t->it_sched_expires = ns;
-				}
-			}
-			t = next_thread(t);
-		} while (t != p);
-		break;
-	}
+	(void)thread_group_cputime(&thread_group_times, tsk->signal);
+	cleanup_timers(tsk->signal->cpu_timers,
+		       thread_group_times.utime,
+		       thread_group_times.stime,
+		       thread_group_times.sum_exec_runtime);
 }
 
 static void clear_dead_task(struct k_itimer *timer, union cpu_time_count now)
@@ -642,24 +564,18 @@ static void arm_timer(struct k_itimer *timer, union cpu_time_count now)
 				    cputime_lt(p->signal->it_virt_expires,
 					       timer->it.cpu.expires.cpu))
 					break;
-				goto rebalance;
+				p->signal->it_virt_expires = timer->it.cpu.expires.cpu;
+				break;
 			case CPUCLOCK_PROF:
 				if (!cputime_eq(p->signal->it_prof_expires,
 						cputime_zero) &&
 				    cputime_lt(p->signal->it_prof_expires,
 					       timer->it.cpu.expires.cpu))
 					break;
-				i = p->signal->rlim[RLIMIT_CPU].rlim_cur;
-				if (i != RLIM_INFINITY &&
-				    i <= cputime_to_secs(timer->it.cpu.expires.cpu))
-					break;
-				goto rebalance;
+				p->signal->it_prof_expires = timer->it.cpu.expires.cpu;
+				break;
 			case CPUCLOCK_SCHED:
-			rebalance:
-				process_timer_rebalance(
-					timer->it.cpu.task,
-					CPUCLOCK_WHICH(timer->it_clock),
-					timer->it.cpu.expires, now);
+				p->signal->it_sched_expires = timer->it.cpu.expires.sched;
 				break;
 			}
 		}
@@ -1053,10 +969,10 @@ static void check_process_timers(struct task_struct *tsk,
 {
 	int maxfire;
 	struct signal_struct *const sig = tsk->signal;
-	cputime_t utime, stime, ptime, virt_expires, prof_expires;
+	cputime_t utime, ptime, virt_expires, prof_expires;
 	unsigned long long sum_sched_runtime, sched_expires;
-	struct task_struct *t;
 	struct list_head *timers = sig->cpu_timers;
+	struct thread_group_cputime thread_group_times;
 
 	/*
 	 * Don't sample the current process CPU clocks if there are no timers.
@@ -1072,17 +988,10 @@ static void check_process_timers(struct task_struct *tsk,
 	/*
 	 * Collect the current process totals.
 	 */
-	utime = sig->utime;
-	stime = sig->stime;
-	sum_sched_runtime = sig->sum_sched_runtime;
-	t = tsk;
-	do {
-		utime = cputime_add(utime, t->utime);
-		stime = cputime_add(stime, t->stime);
-		sum_sched_runtime += t->se.sum_exec_runtime;
-		t = next_thread(t);
-	} while (t != tsk);
-	ptime = cputime_add(utime, stime);
+	(void)thread_group_cputime(&thread_group_times, sig);
+	utime = thread_group_times.utime;
+	ptime = cputime_add(utime, thread_group_times.stime);
+	sum_sched_runtime = thread_group_times.sum_exec_runtime;
 
 	maxfire = 20;
 	prof_expires = cputime_zero;
@@ -1185,66 +1094,24 @@ static void check_process_timers(struct task_struct *tsk,
 			}
 		}
 		x = secs_to_cputime(sig->rlim[RLIMIT_CPU].rlim_cur);
-		if (cputime_eq(prof_expires, cputime_zero) ||
-		    cputime_lt(x, prof_expires)) {
-			prof_expires = x;
+		if (cputime_eq(sig->rlim_expires, cputime_zero) ||
+		    cputime_lt(x, sig->rlim_expires)) {
+			sig->rlim_expires = x;
 		}
 	}
 
-	if (!cputime_eq(prof_expires, cputime_zero) ||
-	    !cputime_eq(virt_expires, cputime_zero) ||
-	    sched_expires != 0) {
-		/*
-		 * Rebalance the threads' expiry times for the remaining
-		 * process CPU timers.
-		 */
-
-		cputime_t prof_left, virt_left, ticks;
-		unsigned long long sched_left, sched;
-		const unsigned int nthreads = atomic_read(&sig->live);
-
-		if (!nthreads)
-			return;
-
-		prof_left = cputime_sub(prof_expires, utime);
-		prof_left = cputime_sub(prof_left, stime);
-		prof_left = cputime_div_non_zero(prof_left, nthreads);
-		virt_left = cputime_sub(virt_expires, utime);
-		virt_left = cputime_div_non_zero(virt_left, nthreads);
-		if (sched_expires) {
-			sched_left = sched_expires - sum_sched_runtime;
-			do_div(sched_left, nthreads);
-			sched_left = max_t(unsigned long long, sched_left, 1);
-		} else {
-			sched_left = 0;
-		}
-		t = tsk;
-		do {
-			if (unlikely(t->flags & PF_EXITING))
-				continue;
-
-			ticks = cputime_add(cputime_add(t->utime, t->stime),
-					    prof_left);
-			if (!cputime_eq(prof_expires, cputime_zero) &&
-			    (cputime_eq(t->it_prof_expires, cputime_zero) ||
-			     cputime_gt(t->it_prof_expires, ticks))) {
-				t->it_prof_expires = ticks;
-			}
-
-			ticks = cputime_add(t->utime, virt_left);
-			if (!cputime_eq(virt_expires, cputime_zero) &&
-			    (cputime_eq(t->it_virt_expires, cputime_zero) ||
-			     cputime_gt(t->it_virt_expires, ticks))) {
-				t->it_virt_expires = ticks;
-			}
-
-			sched = t->se.sum_exec_runtime + sched_left;
-			if (sched_expires && (t->it_sched_expires == 0 ||
-					      t->it_sched_expires > sched)) {
-				t->it_sched_expires = sched;
-			}
-		} while ((t = next_thread(t)) != tsk);
-	}
+	if (!cputime_eq(prof_expires, cputime_zero) &&
+	    (cputime_eq(sig->it_prof_expires, cputime_zero) ||
+	     cputime_gt(sig->it_prof_expires, prof_expires)))
+		sig->it_prof_expires = prof_expires;
+	if (!cputime_eq(virt_expires, cputime_zero) &&
+	    (cputime_eq(sig->it_virt_expires, cputime_zero) ||
+	     cputime_gt(sig->it_virt_expires, virt_expires)))
+		sig->it_virt_expires = virt_expires;
+	if (sched_expires != 0 &&
+	    (sig->it_sched_expires == 0 ||
+	     sig->it_sched_expires > sched_expires))
+		sig->it_sched_expires = sched_expires;
 }
 
 /*
@@ -1321,19 +1188,40 @@ void run_posix_cpu_timers(struct task_struct *tsk)
 {
 	LIST_HEAD(firing);
 	struct k_itimer *timer, *next;
+	struct thread_group_cputime thread_group_times;
+	cputime_t tg_virt, tg_prof;
+	unsigned long long tg_exec_runtime;
 
 	BUG_ON(!irqs_disabled());
 
-#define UNEXPIRED(clock) \
-		(cputime_eq(tsk->it_##clock##_expires, cputime_zero) || \
-		 cputime_lt(clock##_ticks(tsk), tsk->it_##clock##_expires))
+#define UNEXPIRED(p, prof, virt, sched) \
+	((cputime_eq((p)->it_prof_expires, cputime_zero) ||	\
+	 cputime_lt((prof), (p)->it_prof_expires)) &&		\
+	(cputime_eq((p)->it_virt_expires, cputime_zero) ||	\
+	 cputime_lt((virt), (p)->it_virt_expires)) &&		\
+	((p)->it_sched_expires == 0 || (sched) < (p)->it_sched_expires))
 
-	if (UNEXPIRED(prof) && UNEXPIRED(virt) &&
-	    (tsk->it_sched_expires == 0 ||
-	     tsk->se.sum_exec_runtime < tsk->it_sched_expires))
-		return;
+	/*
+	 * If there are no expired thread timers, no expired thread group
+	 * timers and no expired RLIMIT_CPU timer, just return.
+	 */
+	if (UNEXPIRED(tsk, prof_ticks(tsk),
+	    virt_ticks(tsk), tsk->se.sum_exec_runtime)) {
+		if (unlikely(tsk->signal == NULL))
+			return;
+		if ((tsk->signal->rlim[RLIMIT_CPU].rlim_cur == RLIM_INFINITY ||
+		     cputime_lt(tg_prof, tsk->signal->rlim_expires)) &&
+		    !thread_group_cputime(&thread_group_times, tsk->signal))
+			return;
+		tg_virt = thread_group_times.utime;
+		tg_prof = cputime_add(thread_group_times.utime,
+		    thread_group_times.stime);
+		tg_exec_runtime = thread_group_times.sum_exec_runtime;
+		if (UNEXPIRED(tsk->signal, tg_virt, tg_prof, tg_exec_runtime))
+			return;
+	}
 
-#undef	UNEXPIRED
+#undef UNEXPIRED
 
 	/*
 	 * Double-check with locks held.
@@ -1414,14 +1302,6 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
 		if (cputime_eq(*newval, cputime_zero))
 			return;
 		*newval = cputime_add(*newval, now.cpu);
-
-		/*
-		 * If the RLIMIT_CPU timer will expire before the
-		 * ITIMER_PROF timer, we have nothing else to do.
-		 */
-		if (tsk->signal->rlim[RLIMIT_CPU].rlim_cur
-		    < cputime_to_secs(*newval))
-			return;
 	}
 
 	/*
@@ -1433,13 +1313,14 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
 	    cputime_ge(list_first_entry(head,
 				  struct cpu_timer_list, entry)->expires.cpu,
 		       *newval)) {
-		/*
-		 * Rejigger each thread's expiry time so that one will
-		 * notice before we hit the process-cumulative expiry time.
-		 */
-		union cpu_time_count expires = { .sched = 0 };
-		expires.cpu = *newval;
-		process_timer_rebalance(tsk, clock_idx, expires, now);
+		switch (clock_idx) {
+		case CPUCLOCK_PROF:
+			tsk->signal->it_prof_expires = *newval;
+			break;
+		case CPUCLOCK_VIRT:
+			tsk->signal->it_virt_expires = *newval;
+			break;
+		}
 	}
 }
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 28c73f0..1ff1a32 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3594,6 +3594,7 @@ void account_user_time(struct task_struct *p, cputime_t cputime)
 	cputime64_t tmp;
 
 	p->utime = cputime_add(p->utime, cputime);
+	thread_group_update(p->signal, utime, cputime, cputime_add);
 
 	/* Add user time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
@@ -3616,6 +3617,7 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime)
 	tmp = cputime_to_cputime64(cputime);
 
 	p->utime = cputime_add(p->utime, cputime);
+	thread_group_update(p->signal, utime, cputime, cputime_add);
 	p->gtime = cputime_add(p->gtime, cputime);
 
 	cpustat->user = cputime64_add(cpustat->user, tmp);
@@ -3649,6 +3651,7 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 		return account_guest_time(p, cputime);
 
 	p->stime = cputime_add(p->stime, cputime);
+	thread_group_update(p->signal, stime, cputime, cputime_add);
 
 	/* Add system time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
@@ -3690,6 +3693,7 @@ void account_steal_time(struct task_struct *p, cputime_t steal)
 
 	if (p == rq->idle) {
 		p->stime = cputime_add(p->stime, steal);
+		thread_group_update(p->signal, stime, steal, cputime_add);
 		if (atomic_read(&rq->nr_iowait) > 0)
 			cpustat->iowait = cputime64_add(cpustat->iowait, tmp);
 		else
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 86a9337..6f7d5d2 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -353,6 +353,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		struct task_struct *curtask = task_of(curr);
 
 		cpuacct_charge(curtask, delta_exec);
+		thread_group_update(curtask->signal, sum_exec_runtime,
+			delta_exec, thread_group_runtime_add);
 	}
 }
 
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 0a6d2e5..7a2cc40 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -256,6 +256,8 @@ static void update_curr_rt(struct rq *rq)
 	schedstat_set(curr->se.exec_max, max(curr->se.exec_max, delta_exec));
 
 	curr->se.sum_exec_runtime += delta_exec;
+	thread_group_update(curr->signal, sum_exec_runtime,
+		delta_exec, thread_group_runtime_add);
 	curr->se.exec_start = rq->clock;
 	cpuacct_charge(curr, delta_exec);
 
diff --git a/kernel/sys.c b/kernel/sys.c
index a626116..baa3130 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -864,6 +864,8 @@ asmlinkage long sys_setfsgid(gid_t gid)
 
 asmlinkage long sys_times(struct tms __user * tbuf)
 {
+	struct thread_group_cputime thread_group_times;
+
 	/*
 	 *	In the SMP world we might just be unlucky and have one of
 	 *	the times increment as we use it. Since the value is an
@@ -873,19 +875,28 @@ asmlinkage long sys_times(struct tms __user * tbuf)
 	if (tbuf) {
 		struct tms tmp;
 		struct task_struct *tsk = current;
-		struct task_struct *t;
 		cputime_t utime, stime, cutime, cstime;
 
 		spin_lock_irq(&tsk->sighand->siglock);
-		utime = tsk->signal->utime;
-		stime = tsk->signal->stime;
-		t = tsk;
-		do {
-			utime = cputime_add(utime, t->utime);
-			stime = cputime_add(stime, t->stime);
-			t = next_thread(t);
-		} while (t != tsk);
+		/*
+		 * If a POSIX interval timer is running use the process-wide
+		 * fields, else fall back to brute force.
+		 */
+		if (thread_group_cputime(&thread_group_times, tsk->signal)) {
+			utime = thread_group_times.utime;
+			stime = thread_group_times.stime;
+		}
+		else {
+			struct task_struct *t;
 
+			utime = tsk->signal->utime;
+			stime = tsk->signal->stime;
+			t = tsk;
+			do {
+				utime = cputime_add(utime, t->utime);
+				stime = cputime_add(stime, t->stime);
+			} while_each_thread(tsk, t);
+		}
 		cutime = tsk->signal->cutime;
 		cstime = tsk->signal->cstime;
 		spin_unlock_irq(&tsk->sighand->siglock);
@@ -1444,7 +1455,7 @@ asmlinkage long sys_old_getrlimit(unsigned int resource, struct rlimit __user *r
 asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit __user *rlim)
 {
 	struct rlimit new_rlim, *old_rlim;
-	unsigned long it_prof_secs;
+	unsigned long rlim_secs;
 	int retval;
 
 	if (resource >= RLIM_NLIMITS)
@@ -1490,15 +1501,11 @@ asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit __user *rlim)
 	if (new_rlim.rlim_cur == RLIM_INFINITY)
 		goto out;
 
-	it_prof_secs = cputime_to_secs(current->signal->it_prof_expires);
-	if (it_prof_secs == 0 || new_rlim.rlim_cur <= it_prof_secs) {
-		unsigned long rlim_cur = new_rlim.rlim_cur;
-		cputime_t cputime;
-
-		cputime = secs_to_cputime(rlim_cur);
+	rlim_secs = cputime_to_secs(current->signal->rlim_expires);
+	if (rlim_secs == 0 || new_rlim.rlim_cur <= rlim_secs) {
 		read_lock(&tasklist_lock);
 		spin_lock_irq(&current->sighand->siglock);
-		set_process_cpu_timer(current, CPUCLOCK_PROF, &cputime, NULL);
+		current->signal->rlim_expires = secs_to_cputime(new_rlim.rlim_cur);
 		spin_unlock_irq(&current->sighand->siglock);
 		read_unlock(&tasklist_lock);
 	}
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 41a049f..62fed13 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2201,7 +2201,7 @@ static void selinux_bprm_post_apply_creds(struct linux_binprm *bprm)
 			 * This will cause RLIMIT_CPU calculations
 			 * to be refigured.
 			 */
-			current->it_prof_expires = jiffies_to_cputime(1);
+			current->signal->rlim_expires = jiffies_to_cputime(1);
 		}
 	}
 
@@ -5624,5 +5624,3 @@ int selinux_disable(void)
 	return 0;
 }
 #endif
-
-

-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 2.6.25-rc6] Fix itimer/many thread hang.
  2008-03-28  0:52                                           ` [PATCH 2.6.25-rc6] Fix itimer/many thread hang Frank Mayhar
@ 2008-03-28 10:28                                             ` Ingo Molnar
  2008-03-28 22:46                                             ` [PATCH 2.6.25-rc7 resubmit] " Frank Mayhar
  1 sibling, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2008-03-28 10:28 UTC (permalink / raw)
  To: Frank Mayhar
  Cc: Roland McGrath, linux-kernel, Peter Zijlstra, Thomas Gleixner


* Frank Mayhar <fmayhar@google.com> wrote:

> +static inline void thread_group_times_init(struct signal_struct *sig)
> +{
> +	sig->thread_group_times = NULL;
> +}
> +
> +static inline void thread_group_times_free(struct signal_struct *sig)
> +{
> +	if (sig->thread_group_times)
> +		free_percpu(sig->thread_group_times);
> +}
> +
> +/*
> + * Allocate the thread_group_cputime struct appropriately and fill in the current
> + * values of the fields.  Called from do_setitimer() when setting an interval
> + * timer (ITIMER_PROF or ITIMER_VIRTUAL).  Assumes interrupts are enabled when
> + * it's called.  Note that there is no corresponding deallocation done from
> + * do_setitimer(); the structure is freed at process exit.
> + */
> +static inline int thread_group_times_alloc(struct task_struct *tsk)
> +{
> +	struct signal_struct *sig = tsk->signal;
> +	struct thread_group_cputime *thread_group_times;
> +	struct task_struct *t;
> +	cputime_t utime, stime;
> +	unsigned long long sum_exec_runtime;
> +
> +	/*
> +	 * If we don't already have a thread_group_cputime struct, allocate
> +	 * one and fill it in with the accumulated times.
> +	 */
> +	if (sig->thread_group_times)
> +		return 0;
> +	thread_group_times = alloc_percpu(struct thread_group_cputime);
> +	if (thread_group_times == NULL)
> +		return -ENOMEM;
> +	read_lock(&tasklist_lock);
> +	spin_lock_irq(&tsk->sighand->siglock);
> +	if (sig->thread_group_times) {
> +		spin_unlock_irq(&tsk->sighand->siglock);
> +		read_unlock(&tasklist_lock);
> +		free_percpu(thread_group_times);
> +		return 0;
> +	}
> +	sig->thread_group_times = thread_group_times;
> +	utime = sig->utime;
> +	stime = sig->stime;
> +	sum_exec_runtime = tsk->se.sum_exec_runtime;
> +	t = tsk;
> +	do {
> +		utime = cputime_add(utime, t->utime);
> +		stime = cputime_add(stime, t->stime);
> +		sum_exec_runtime += t->se.sum_exec_runtime;
> +	} while_each_thread(tsk, t);
> +	thread_group_times = per_cpu_ptr(sig->thread_group_times, get_cpu());
> +	thread_group_times->utime = utime;
> +	thread_group_times->stime = stime;
> +	thread_group_times->sum_exec_runtime = sum_exec_runtime;
> +	put_cpu_no_resched();
> +	spin_unlock_irq(&tsk->sighand->siglock);
> +	read_unlock(&tasklist_lock);
> +	return 0;
> +}

please dont put such a huge inline into a header.

> +
> +#define thread_group_update(sig, field, val, op) ({ \
> +	if (sig && sig->thread_group_times) {				\
> +		int cpu;						\
> +		struct thread_group_cputime *thread_group_times;	\
> +									\
> +		cpu = get_cpu();					\
> +		thread_group_times =					\
> +			per_cpu_ptr(sig->thread_group_times, cpu);	\
> +		thread_group_times->field =				\
> +			op(thread_group_times->field, val);		\
> +		put_cpu_no_resched();					\
> +	}								\
> +})

nor use any macros that includes code.

> +static inline int thread_group_cputime(struct thread_group_cputime *thread_group_times,
> +	struct signal_struct *sig)

ditto, line length as well.

> +	if (!sig->thread_group_times)
> +		return(0);

return 0 is the proper form - please run your patch through 
scripts/checkpatch.pl.

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 2.6.25-rc7 resubmit] Fix itimer/many thread hang.
  2008-03-28  0:52                                           ` [PATCH 2.6.25-rc6] Fix itimer/many thread hang Frank Mayhar
  2008-03-28 10:28                                             ` Ingo Molnar
@ 2008-03-28 22:46                                             ` Frank Mayhar
  2008-04-01 18:45                                               ` Andrew Morton
  1 sibling, 1 reply; 51+ messages in thread
From: Frank Mayhar @ 2008-03-28 22:46 UTC (permalink / raw)
  To: Roland McGrath; +Cc: linux-kernel

This is my second cut at a patch that will fix bug 9906, "Weird hang
with NPTL and SIGPROF."  Thanks to Ingo Molnar who sent me feedback
regarding the first cut.

Again, the problem is that run_posix_cpu_timers() repeatedly walks the
entire thread group every time it runs, which is at interrupt.  With
heavy load and lots of threads, this can take longer than the tick, at
which point the kernel stops doing anything put servicing clock ticks
and the occasional interrupt.  Many thanks to Roland McGrath for his
help in my attempt to understand his code.

The change adds a new structure to the signal_struct,
thread_group_cputime.  On an SMP kernel, this is allocated as a percpu
structure when needed (from do_setitimer()) using the alloc_percpu()
mechanism).  It is manipulated via a set of functions defined in sched.c
and sched.h.  These new functions are

      * thread_group_times_free(), inline function to free via
        free_percpu() (SMP) or kfree (UP) the thread_group_cputime
        structure.
      * thread_group_times_alloc(), external function to allocate the
        thread_group_cputime structure when needed.
      * thread_group_update(), inline function to update a field of the
        thread_group_cputime structure; called at interrupt from tick
        handlers, generally.  It depends on the "offsetof()" macro to
        know which field to update and on compiler optimization to
        remove the unused code paths in each case.
      * thread_group_cputime(), inline function that sums the time
        fields for all running CPUs (SMP) or snapshots the time fields
        (UP) into a passed structure.

I've changed the uniprocessor case to retain the dynamic allocation of
the thread_group_cputime structure as needed; this makes the code
somewhat more consistent between SMP and UP and retains the feature of
reducing overhead for processes that don't use interval timers.

In addition to fixing the hang, this change removes the overloading of
it_prof_expires for RLIMIT_CPU handling, replacing it with a new field,
rlim_expires, which is checked instead.  This makes the code simpler and
more straightforward.

The kernel/posix-cpu-timers.c file has changed pretty drastically, with
it no longer using the per-task times to know when to check for timer
expiration.  Instead, it consecutively checks the per-task timers and
then the per-process timers for expiration, consulting the individual
expiration fields (including the new RLIMIT_CPU expiration field) which
are now logically separate.  Rather than performing "rebalancing"
functions now do simple assignments and all loops through the thread
group have gone away, replaced with calls to thread_group_cputime().

Elsewhere, do_getitimer(), compat_sys_times() and sys_times() now use
thread_group_cputime() to get the times if a POSIX interval timer is in
use, providing a faster path in that case.

This version moves the thread_group_times_alloc() routine to sched.c,
changes the thread_group_update() macro to an inline function, shortens
a few things and cleans up the sched.h changes a bit.

Again, performance with the fix is at least as good as before and in a
few cases is slightly improved, possibly due to the reduced tick
overhead.

Finally, the patch has been retargeted to 2.6.25-rc7 instead of -rc6.

Signed-off-by:  Frank Mayhar <fmayhar@google.com>

 include/linux/sched.h     |  117 +++++++++++++++++++
 kernel/compat.c           |   30 ++++--
 kernel/fork.c             |   22 +---
 kernel/itimer.c           |   40 ++++---
 kernel/posix-cpu-timers.c |  276 +++++++++++++--------------------------------
 kernel/sched.c            |   73 ++++++++++++
 kernel/sched_fair.c       |    3 +
 kernel/sched_rt.c         |    3 +
 kernel/sys.c              |   44 +++++---
 security/selinux/hooks.c  |    4 +-
 10 files changed, 354 insertions(+), 258 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6a1e7af..f6052ac 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -424,6 +424,18 @@ struct pacct_struct {
 };
 
 /*
+ * This structure contains the versions of utime, stime and sum_exec_runtime
+ * that are shared across threads within a process.  It's only used for
+ * interval timers and is allocated via alloc_percpu() in the signal
+ * structure when such a timer is set up.
+ */
+struct thread_group_cputime {
+	cputime_t utime;		/* User time. */
+	cputime_t stime;		/* System time. */
+	unsigned long long sum_exec_runtime; /* Scheduler time. */
+};
+
+/*
  * NOTE! "signal_struct" does not have it's own
  * locking, because a shared signal_struct always
  * implies a shared sighand_struct, so locking
@@ -468,6 +480,12 @@ struct signal_struct {
 	cputime_t it_prof_expires, it_virt_expires;
 	cputime_t it_prof_incr, it_virt_incr;
 
+	/* Scheduling timer for the process */
+	unsigned long long it_sched_expires;
+
+	/* RLIMIT_CPU timer for the process */
+	cputime_t rlim_expires;
+
 	/* job control IDs */
 
 	/*
@@ -492,6 +510,9 @@ struct signal_struct {
 
 	struct tty_struct *tty; /* NULL if no tty */
 
+	/* Process-wide times for POSIX interval timing.  Per CPU. */
+	struct thread_group_cputime *thread_group_times;
+
 	/*
 	 * Cumulative resource counters for dead threads in the group,
 	 * and for reaped dead child processes forked by this group.
@@ -1984,6 +2005,102 @@ static inline int spin_needbreak(spinlock_t *lock)
 #endif
 }
 
+#ifdef CONFIG_SMP
+
+static inline void thread_group_times_free(
+	struct thread_group_cputime *tg_times)
+{
+	free_percpu(tg_times);
+}
+
+/*
+ * Sum the time fields across all running CPUs.
+ */
+static inline void thread_group_cputime(
+	struct thread_group_cputime *tg_times,
+	struct signal_struct *sig)
+{
+	int i;
+	struct thread_group_cputime *tg;
+
+	/*
+	 * Get the values for the current CPU separately so we don't get
+	 * preempted, then sum all the rest.
+	 */
+	tg = per_cpu_ptr(sig->thread_group_times, get_cpu());
+	*tg_times = *tg;
+	put_cpu_no_resched();
+	for_each_online_cpu(i) {
+		if (i == smp_processor_id())
+			continue;
+		tg = per_cpu_ptr(sig->thread_group_times, i);
+		tg_times->utime = cputime_add(tg_times->utime, tg->utime);
+		tg_times->stime = cputime_add(tg_times->stime, tg->stime);
+		tg_times->sum_exec_runtime += tg->sum_exec_runtime;
+	}
+}
+
+#else /* CONFIG_SMP */
+
+static inline void thread_group_times_free(
+	struct thread_group_cputime *tg_times)
+{
+	kfree(tg_times);
+}
+
+/*
+ * Snapshot the time fields.
+ */
+static inline void thread_group_cputime(
+	struct thread_group_cputime *tg_times,
+	struct signal_struct *sig)
+{
+	*tg_times = *sig->thread_group_times;
+}
+
+#endif /* CONFIG_SMP */
+
+/*
+ * Update one of the fields in the thread_group_cputime structure.  This is
+ * passed the offset of the field to be updated (acquired via the "offsetof"
+ * macro) and uses that to determine the actual field.
+ */
+static inline void thread_group_update(struct signal_struct *sig,
+	const int fieldoff, void *val)
+{
+	cputime_t cputime;
+	unsigned long long sum_exec_runtime;
+	struct thread_group_cputime *tg_times;
+
+	if (!sig || !sig->thread_group_times)
+		return;
+#ifdef CONFIG_SMP
+	tg_times = per_cpu_ptr(sig->thread_group_times, get_cpu());
+#else
+	tg_times = sig->thread_group_times;
+#endif
+	switch (fieldoff) {
+	case offsetof(struct thread_group_cputime, utime):
+		cputime = *(cputime_t *)val;
+		tg_times->utime = cputime_add(tg_times->utime, cputime);
+		break;
+	case offsetof(struct thread_group_cputime, stime):
+		cputime = *(cputime_t *)val;
+		tg_times->utime = cputime_add(tg_times->utime, cputime);
+		break;
+	case offsetof(struct thread_group_cputime, sum_exec_runtime):
+		sum_exec_runtime = *(unsigned long long *)val;
+		tg_times->sum_exec_runtime += sum_exec_runtime;
+		break;
+	}
+#ifdef CONFIG_SMP
+	put_cpu_no_resched();
+#endif
+}
+
+/* The thread_group_cputime allocator. */
+extern int thread_group_times_alloc(struct task_struct *);
+
 /*
  * Reevaluate whether the task has signals pending delivery.
  * Wake the task if so.
diff --git a/kernel/compat.c b/kernel/compat.c
index 5f0e201..06a7e3a 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -162,18 +162,29 @@ asmlinkage long compat_sys_times(struct compat_tms __user *tbuf)
 	if (tbuf) {
 		struct compat_tms tmp;
 		struct task_struct *tsk = current;
-		struct task_struct *t;
 		cputime_t utime, stime, cutime, cstime;
+		struct thread_group_cputime thread_group_times;
 
 		read_lock(&tasklist_lock);
-		utime = tsk->signal->utime;
-		stime = tsk->signal->stime;
-		t = tsk;
-		do {
-			utime = cputime_add(utime, t->utime);
-			stime = cputime_add(stime, t->stime);
-			t = next_thread(t);
-		} while (t != tsk);
+		/*
+		 * If a POSIX interval timer is running use the process-wide
+		 * fields, else fall back to brute force.
+		 */
+		if (sig->thread_group_times) {
+			thread_group_cputime(&thread_group_times, tsk->signal);
+			utime = thread_group_times.utime;
+			stime = thread_group_times.stime;
+		} else {
+			struct task_struct *t;
+
+			utime = tsk->signal->utime;
+			stime = tsk->signal->stime;
+			t = tsk;
+			do {
+				utime = cputime_add(utime, t->utime);
+				stime = cputime_add(stime, t->stime);
+			} while_each_thread(tsk, t);
+		}
 
 		/*
 		 * While we have tasklist_lock read-locked, no dying thread
@@ -1081,4 +1092,3 @@ compat_sys_sysinfo(struct compat_sysinfo __user *info)
 
 	return 0;
 }
-
diff --git a/kernel/fork.c b/kernel/fork.c
index dd249c3..d4f6282 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -914,10 +914,14 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->it_virt_incr = cputime_zero;
 	sig->it_prof_expires = cputime_zero;
 	sig->it_prof_incr = cputime_zero;
+	sig->it_sched_expires = 0;
+	sig->rlim_expires = cputime_zero;
 
 	sig->leader = 0;	/* session leadership doesn't inherit */
 	sig->tty_old_pgrp = NULL;
 
+	sig->thread_group_times = NULL;
+
 	sig->utime = sig->stime = sig->cutime = sig->cstime = cputime_zero;
 	sig->gtime = cputime_zero;
 	sig->cgtime = cputime_zero;
@@ -939,7 +943,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 		 * New sole thread in the process gets an expiry time
 		 * of the whole CPU time limit.
 		 */
-		tsk->it_prof_expires =
+		sig->rlim_expires =
 			secs_to_cputime(sig->rlim[RLIMIT_CPU].rlim_cur);
 	}
 	acct_init_pacct(&sig->pacct);
@@ -952,6 +956,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 void __cleanup_signal(struct signal_struct *sig)
 {
 	exit_thread_group_keys(sig);
+	thread_group_times_free(sig->thread_group_times);
 	kmem_cache_free(signal_cachep, sig);
 }
 
@@ -1311,21 +1316,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	if (clone_flags & CLONE_THREAD) {
 		p->group_leader = current->group_leader;
 		list_add_tail_rcu(&p->thread_group, &p->group_leader->thread_group);
-
-		if (!cputime_eq(current->signal->it_virt_expires,
-				cputime_zero) ||
-		    !cputime_eq(current->signal->it_prof_expires,
-				cputime_zero) ||
-		    current->signal->rlim[RLIMIT_CPU].rlim_cur != RLIM_INFINITY ||
-		    !list_empty(&current->signal->cpu_timers[0]) ||
-		    !list_empty(&current->signal->cpu_timers[1]) ||
-		    !list_empty(&current->signal->cpu_timers[2])) {
-			/*
-			 * Have child wake up on its first tick to check
-			 * for process CPU timers.
-			 */
-			p->it_prof_expires = jiffies_to_cputime(1);
-		}
 	}
 
 	if (likely(p->pid)) {
diff --git a/kernel/itimer.c b/kernel/itimer.c
index ab98274..7c5b416 100644
--- a/kernel/itimer.c
+++ b/kernel/itimer.c
@@ -60,12 +60,11 @@ int do_getitimer(int which, struct itimerval *value)
 		cval = tsk->signal->it_virt_expires;
 		cinterval = tsk->signal->it_virt_incr;
 		if (!cputime_eq(cval, cputime_zero)) {
-			struct task_struct *t = tsk;
-			cputime_t utime = tsk->signal->utime;
-			do {
-				utime = cputime_add(utime, t->utime);
-				t = next_thread(t);
-			} while (t != tsk);
+			struct thread_group_cputime thread_group_times;
+			cputime_t utime;
+
+			thread_group_cputime(&thread_group_times, tsk->signal);
+			utime = thread_group_times.utime;
 			if (cputime_le(cval, utime)) { /* about to fire */
 				cval = jiffies_to_cputime(1);
 			} else {
@@ -83,15 +82,12 @@ int do_getitimer(int which, struct itimerval *value)
 		cval = tsk->signal->it_prof_expires;
 		cinterval = tsk->signal->it_prof_incr;
 		if (!cputime_eq(cval, cputime_zero)) {
-			struct task_struct *t = tsk;
-			cputime_t ptime = cputime_add(tsk->signal->utime,
-						      tsk->signal->stime);
-			do {
-				ptime = cputime_add(ptime,
-						    cputime_add(t->utime,
-								t->stime));
-				t = next_thread(t);
-			} while (t != tsk);
+			struct thread_group_cputime thread_group_times;
+			cputime_t ptime;
+
+			thread_group_cputime(&thread_group_times, tsk->signal);
+			ptime = cputime_add(thread_group_times.utime,
+					    thread_group_times.stime);
 			if (cputime_le(cval, ptime)) { /* about to fire */
 				cval = jiffies_to_cputime(1);
 			} else {
@@ -185,6 +181,13 @@ again:
 	case ITIMER_VIRTUAL:
 		nval = timeval_to_cputime(&value->it_value);
 		ninterval = timeval_to_cputime(&value->it_interval);
+		/*
+		 * If he's setting the timer for the first time, we need to
+		 * allocate the percpu area.  It's freed when the process
+		 * exits.
+		 */
+		if (!cputime_eq(nval, cputime_zero))
+			thread_group_times_alloc(tsk);
 		read_lock(&tasklist_lock);
 		spin_lock_irq(&tsk->sighand->siglock);
 		cval = tsk->signal->it_virt_expires;
@@ -209,6 +212,13 @@ again:
 	case ITIMER_PROF:
 		nval = timeval_to_cputime(&value->it_value);
 		ninterval = timeval_to_cputime(&value->it_interval);
+		/*
+		 * If he's setting the timer for the first time, we need to
+		 * allocate the percpu area.  It's freed when the process
+		 * exits.
+		 */
+		if (!cputime_eq(nval, cputime_zero))
+			thread_group_times_alloc(tsk);
 		read_lock(&tasklist_lock);
 		spin_lock_irq(&tsk->sighand->siglock);
 		cval = tsk->signal->it_prof_expires;
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 2eae91f..309a7c4 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -227,31 +227,21 @@ static int cpu_clock_sample_group_locked(unsigned int clock_idx,
 					 struct task_struct *p,
 					 union cpu_time_count *cpu)
 {
-	struct task_struct *t = p;
- 	switch (clock_idx) {
+	struct thread_group_cputime thread_group_times;
+
+	thread_group_cputime(&thread_group_times, p->signal);
+	switch (clock_idx) {
 	default:
 		return -EINVAL;
 	case CPUCLOCK_PROF:
-		cpu->cpu = cputime_add(p->signal->utime, p->signal->stime);
-		do {
-			cpu->cpu = cputime_add(cpu->cpu, prof_ticks(t));
-			t = next_thread(t);
-		} while (t != p);
+		cpu->cpu = cputime_add(thread_group_times.utime,
+			thread_group_times.stime);
 		break;
 	case CPUCLOCK_VIRT:
-		cpu->cpu = p->signal->utime;
-		do {
-			cpu->cpu = cputime_add(cpu->cpu, virt_ticks(t));
-			t = next_thread(t);
-		} while (t != p);
+		cpu->cpu = thread_group_times.utime;
 		break;
 	case CPUCLOCK_SCHED:
-		cpu->sched = p->signal->sum_sched_runtime;
-		/* Add in each other live thread.  */
-		while ((t = next_thread(t)) != p) {
-			cpu->sched += t->se.sum_exec_runtime;
-		}
-		cpu->sched += sched_ns(p);
+		cpu->sched = thread_group_times.sum_exec_runtime;
 		break;
 	}
 	return 0;
@@ -472,80 +462,13 @@ void posix_cpu_timers_exit(struct task_struct *tsk)
 }
 void posix_cpu_timers_exit_group(struct task_struct *tsk)
 {
-	cleanup_timers(tsk->signal->cpu_timers,
-		       cputime_add(tsk->utime, tsk->signal->utime),
-		       cputime_add(tsk->stime, tsk->signal->stime),
-		     tsk->se.sum_exec_runtime + tsk->signal->sum_sched_runtime);
-}
+	struct thread_group_cputime thread_group_times;
 
-
-/*
- * Set the expiry times of all the threads in the process so one of them
- * will go off before the process cumulative expiry total is reached.
- */
-static void process_timer_rebalance(struct task_struct *p,
-				    unsigned int clock_idx,
-				    union cpu_time_count expires,
-				    union cpu_time_count val)
-{
-	cputime_t ticks, left;
-	unsigned long long ns, nsleft;
- 	struct task_struct *t = p;
-	unsigned int nthreads = atomic_read(&p->signal->live);
-
-	if (!nthreads)
-		return;
-
-	switch (clock_idx) {
-	default:
-		BUG();
-		break;
-	case CPUCLOCK_PROF:
-		left = cputime_div_non_zero(cputime_sub(expires.cpu, val.cpu),
-				       nthreads);
-		do {
-			if (likely(!(t->flags & PF_EXITING))) {
-				ticks = cputime_add(prof_ticks(t), left);
-				if (cputime_eq(t->it_prof_expires,
-					       cputime_zero) ||
-				    cputime_gt(t->it_prof_expires, ticks)) {
-					t->it_prof_expires = ticks;
-				}
-			}
-			t = next_thread(t);
-		} while (t != p);
-		break;
-	case CPUCLOCK_VIRT:
-		left = cputime_div_non_zero(cputime_sub(expires.cpu, val.cpu),
-				       nthreads);
-		do {
-			if (likely(!(t->flags & PF_EXITING))) {
-				ticks = cputime_add(virt_ticks(t), left);
-				if (cputime_eq(t->it_virt_expires,
-					       cputime_zero) ||
-				    cputime_gt(t->it_virt_expires, ticks)) {
-					t->it_virt_expires = ticks;
-				}
-			}
-			t = next_thread(t);
-		} while (t != p);
-		break;
-	case CPUCLOCK_SCHED:
-		nsleft = expires.sched - val.sched;
-		do_div(nsleft, nthreads);
-		nsleft = max_t(unsigned long long, nsleft, 1);
-		do {
-			if (likely(!(t->flags & PF_EXITING))) {
-				ns = t->se.sum_exec_runtime + nsleft;
-				if (t->it_sched_expires == 0 ||
-				    t->it_sched_expires > ns) {
-					t->it_sched_expires = ns;
-				}
-			}
-			t = next_thread(t);
-		} while (t != p);
-		break;
-	}
+	thread_group_cputime(&thread_group_times, tsk->signal);
+	cleanup_timers(tsk->signal->cpu_timers,
+		       thread_group_times.utime,
+		       thread_group_times.stime,
+		       thread_group_times.sum_exec_runtime);
 }
 
 static void clear_dead_task(struct k_itimer *timer, union cpu_time_count now)
@@ -572,7 +495,6 @@ static void arm_timer(struct k_itimer *timer, union cpu_time_count now)
 	struct list_head *head, *listpos;
 	struct cpu_timer_list *const nt = &timer->it.cpu;
 	struct cpu_timer_list *next;
-	unsigned long i;
 
 	head = (CPUCLOCK_PERTHREAD(timer->it_clock) ?
 		p->cpu_timers : p->signal->cpu_timers);
@@ -642,24 +564,21 @@ static void arm_timer(struct k_itimer *timer, union cpu_time_count now)
 				    cputime_lt(p->signal->it_virt_expires,
 					       timer->it.cpu.expires.cpu))
 					break;
-				goto rebalance;
+				p->signal->it_virt_expires =
+					timer->it.cpu.expires.cpu;
+				break;
 			case CPUCLOCK_PROF:
 				if (!cputime_eq(p->signal->it_prof_expires,
 						cputime_zero) &&
 				    cputime_lt(p->signal->it_prof_expires,
 					       timer->it.cpu.expires.cpu))
 					break;
-				i = p->signal->rlim[RLIMIT_CPU].rlim_cur;
-				if (i != RLIM_INFINITY &&
-				    i <= cputime_to_secs(timer->it.cpu.expires.cpu))
-					break;
-				goto rebalance;
+				p->signal->it_prof_expires =
+					timer->it.cpu.expires.cpu;
+				break;
 			case CPUCLOCK_SCHED:
-			rebalance:
-				process_timer_rebalance(
-					timer->it.cpu.task,
-					CPUCLOCK_WHICH(timer->it_clock),
-					timer->it.cpu.expires, now);
+				p->signal->it_sched_expires =
+					timer->it.cpu.expires.sched;
 				break;
 			}
 		}
@@ -1053,10 +972,10 @@ static void check_process_timers(struct task_struct *tsk,
 {
 	int maxfire;
 	struct signal_struct *const sig = tsk->signal;
-	cputime_t utime, stime, ptime, virt_expires, prof_expires;
+	cputime_t utime, ptime, virt_expires, prof_expires;
 	unsigned long long sum_sched_runtime, sched_expires;
-	struct task_struct *t;
 	struct list_head *timers = sig->cpu_timers;
+	struct thread_group_cputime thread_group_times;
 
 	/*
 	 * Don't sample the current process CPU clocks if there are no timers.
@@ -1072,17 +991,10 @@ static void check_process_timers(struct task_struct *tsk,
 	/*
 	 * Collect the current process totals.
 	 */
-	utime = sig->utime;
-	stime = sig->stime;
-	sum_sched_runtime = sig->sum_sched_runtime;
-	t = tsk;
-	do {
-		utime = cputime_add(utime, t->utime);
-		stime = cputime_add(stime, t->stime);
-		sum_sched_runtime += t->se.sum_exec_runtime;
-		t = next_thread(t);
-	} while (t != tsk);
-	ptime = cputime_add(utime, stime);
+	thread_group_cputime(&thread_group_times, sig);
+	utime = thread_group_times.utime;
+	ptime = cputime_add(utime, thread_group_times.stime);
+	sum_sched_runtime = thread_group_times.sum_exec_runtime;
 
 	maxfire = 20;
 	prof_expires = cputime_zero;
@@ -1185,66 +1097,24 @@ static void check_process_timers(struct task_struct *tsk,
 			}
 		}
 		x = secs_to_cputime(sig->rlim[RLIMIT_CPU].rlim_cur);
-		if (cputime_eq(prof_expires, cputime_zero) ||
-		    cputime_lt(x, prof_expires)) {
-			prof_expires = x;
+		if (cputime_eq(sig->rlim_expires, cputime_zero) ||
+		    cputime_lt(x, sig->rlim_expires)) {
+			sig->rlim_expires = x;
 		}
 	}
 
-	if (!cputime_eq(prof_expires, cputime_zero) ||
-	    !cputime_eq(virt_expires, cputime_zero) ||
-	    sched_expires != 0) {
-		/*
-		 * Rebalance the threads' expiry times for the remaining
-		 * process CPU timers.
-		 */
-
-		cputime_t prof_left, virt_left, ticks;
-		unsigned long long sched_left, sched;
-		const unsigned int nthreads = atomic_read(&sig->live);
-
-		if (!nthreads)
-			return;
-
-		prof_left = cputime_sub(prof_expires, utime);
-		prof_left = cputime_sub(prof_left, stime);
-		prof_left = cputime_div_non_zero(prof_left, nthreads);
-		virt_left = cputime_sub(virt_expires, utime);
-		virt_left = cputime_div_non_zero(virt_left, nthreads);
-		if (sched_expires) {
-			sched_left = sched_expires - sum_sched_runtime;
-			do_div(sched_left, nthreads);
-			sched_left = max_t(unsigned long long, sched_left, 1);
-		} else {
-			sched_left = 0;
-		}
-		t = tsk;
-		do {
-			if (unlikely(t->flags & PF_EXITING))
-				continue;
-
-			ticks = cputime_add(cputime_add(t->utime, t->stime),
-					    prof_left);
-			if (!cputime_eq(prof_expires, cputime_zero) &&
-			    (cputime_eq(t->it_prof_expires, cputime_zero) ||
-			     cputime_gt(t->it_prof_expires, ticks))) {
-				t->it_prof_expires = ticks;
-			}
-
-			ticks = cputime_add(t->utime, virt_left);
-			if (!cputime_eq(virt_expires, cputime_zero) &&
-			    (cputime_eq(t->it_virt_expires, cputime_zero) ||
-			     cputime_gt(t->it_virt_expires, ticks))) {
-				t->it_virt_expires = ticks;
-			}
-
-			sched = t->se.sum_exec_runtime + sched_left;
-			if (sched_expires && (t->it_sched_expires == 0 ||
-					      t->it_sched_expires > sched)) {
-				t->it_sched_expires = sched;
-			}
-		} while ((t = next_thread(t)) != tsk);
-	}
+	if (!cputime_eq(prof_expires, cputime_zero) &&
+	    (cputime_eq(sig->it_prof_expires, cputime_zero) ||
+	     cputime_gt(sig->it_prof_expires, prof_expires)))
+		sig->it_prof_expires = prof_expires;
+	if (!cputime_eq(virt_expires, cputime_zero) &&
+	    (cputime_eq(sig->it_virt_expires, cputime_zero) ||
+	     cputime_gt(sig->it_virt_expires, virt_expires)))
+		sig->it_virt_expires = virt_expires;
+	if (sched_expires != 0 &&
+	    (sig->it_sched_expires == 0 ||
+	     sig->it_sched_expires > sched_expires))
+		sig->it_sched_expires = sched_expires;
 }
 
 /*
@@ -1321,19 +1191,40 @@ void run_posix_cpu_timers(struct task_struct *tsk)
 {
 	LIST_HEAD(firing);
 	struct k_itimer *timer, *next;
+	struct thread_group_cputime tg_times;
+	cputime_t tg_virt, tg_prof;
+	unsigned long long tg_exec_runtime;
 
 	BUG_ON(!irqs_disabled());
 
-#define UNEXPIRED(clock) \
-		(cputime_eq(tsk->it_##clock##_expires, cputime_zero) || \
-		 cputime_lt(clock##_ticks(tsk), tsk->it_##clock##_expires))
+#define UNEXPIRED(p, prof, virt, sched) \
+	((cputime_eq((p)->it_prof_expires, cputime_zero) ||	\
+	 cputime_lt((prof), (p)->it_prof_expires)) &&		\
+	(cputime_eq((p)->it_virt_expires, cputime_zero) ||	\
+	 cputime_lt((virt), (p)->it_virt_expires)) &&		\
+	((p)->it_sched_expires == 0 || (sched) < (p)->it_sched_expires))
 
-	if (UNEXPIRED(prof) && UNEXPIRED(virt) &&
-	    (tsk->it_sched_expires == 0 ||
-	     tsk->se.sum_exec_runtime < tsk->it_sched_expires))
-		return;
+	/*
+	 * If there are no expired thread timers, no expired thread group
+	 * timers and no expired RLIMIT_CPU timer, just return.
+	 */
+	if (UNEXPIRED(tsk, prof_ticks(tsk),
+	    virt_ticks(tsk), tsk->se.sum_exec_runtime)) {
+		if (unlikely(tsk->signal == NULL))
+			return;
+		if (!sig->thread_group_times)
+			return;
+		thread_group_cputime(&tg_times, tsk->signal);
+		tg_prof = cputime_add(tg_times.utime, tg_times.stime);
+		tg_virt = tg_times.utime;
+		tg_exec_runtime = tg_times.sum_exec_runtime;
+		if ((tsk->signal->rlim[RLIMIT_CPU].rlim_cur == RLIM_INFINITY ||
+		     cputime_lt(tg_prof, tsk->signal->rlim_expires)) &&
+		    UNEXPIRED(tsk->signal, tg_virt, tg_prof, tg_exec_runtime))
+			return;
+	}
 
-#undef	UNEXPIRED
+#undef UNEXPIRED
 
 	/*
 	 * Double-check with locks held.
@@ -1414,14 +1305,6 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
 		if (cputime_eq(*newval, cputime_zero))
 			return;
 		*newval = cputime_add(*newval, now.cpu);
-
-		/*
-		 * If the RLIMIT_CPU timer will expire before the
-		 * ITIMER_PROF timer, we have nothing else to do.
-		 */
-		if (tsk->signal->rlim[RLIMIT_CPU].rlim_cur
-		    < cputime_to_secs(*newval))
-			return;
 	}
 
 	/*
@@ -1433,13 +1316,14 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
 	    cputime_ge(list_first_entry(head,
 				  struct cpu_timer_list, entry)->expires.cpu,
 		       *newval)) {
-		/*
-		 * Rejigger each thread's expiry time so that one will
-		 * notice before we hit the process-cumulative expiry time.
-		 */
-		union cpu_time_count expires = { .sched = 0 };
-		expires.cpu = *newval;
-		process_timer_rebalance(tsk, clock_idx, expires, now);
+		switch (clock_idx) {
+		case CPUCLOCK_PROF:
+			tsk->signal->it_prof_expires = *newval;
+			break;
+		case CPUCLOCK_VIRT:
+			tsk->signal->it_virt_expires = *newval;
+			break;
+		}
 	}
 }
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 8dcdec6..81b61eb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3637,6 +3637,9 @@ void account_user_time(struct task_struct *p, cputime_t cputime)
 	cputime64_t tmp;
 
 	p->utime = cputime_add(p->utime, cputime);
+	thread_group_update(p->signal,
+		offsetof(struct thread_group_cputime, utime),
+		(void *)&cputime);
 
 	/* Add user time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
@@ -3659,6 +3662,9 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime)
 	tmp = cputime_to_cputime64(cputime);
 
 	p->utime = cputime_add(p->utime, cputime);
+	thread_group_update(p->signal,
+		offsetof(struct thread_group_cputime, utime),
+		(void *)&cputime);
 	p->gtime = cputime_add(p->gtime, cputime);
 
 	cpustat->user = cputime64_add(cpustat->user, tmp);
@@ -3692,6 +3698,9 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 		return account_guest_time(p, cputime);
 
 	p->stime = cputime_add(p->stime, cputime);
+	thread_group_update(p->signal,
+		offsetof(struct thread_group_cputime, stime),
+		(void *)&cputime);
 
 	/* Add system time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
@@ -3733,6 +3742,9 @@ void account_steal_time(struct task_struct *p, cputime_t steal)
 
 	if (p == rq->idle) {
 		p->stime = cputime_add(p->stime, steal);
+		thread_group_update(p->signal,
+			offsetof(struct thread_group_cputime, stime),
+			(void *)&steal);
 		if (atomic_read(&rq->nr_iowait) > 0)
 			cpustat->iowait = cputime64_add(cpustat->iowait, tmp);
 		else
@@ -8138,3 +8150,64 @@ struct cgroup_subsys cpuacct_subsys = {
 	.subsys_id = cpuacct_subsys_id,
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
+
+/*
+ * Allocate the thread_group_cputime struct appropriately and fill in the
+ * current values of the fields.  Called from do_setitimer() when setting an
+ * interval timer (ITIMER_PROF or ITIMER_VIRTUAL).  Assumes interrupts are
+ * enabled when it's called.  Note that there is no corresponding deallocation
+ * done from do_setitimer(); the structure is freed at process exit.
+ */
+int thread_group_times_alloc(struct task_struct *tsk)
+{
+	struct signal_struct *sig = tsk->signal;
+	struct thread_group_cputime *thread_group_times;
+	struct task_struct *t;
+	cputime_t utime, stime;
+	unsigned long long sum_exec_runtime;
+
+	/*
+	 * If we don't already have a thread_group_cputime struct, allocate
+	 * one and fill it in with the accumulated times.
+	 */
+	if (sig->thread_group_times)
+		return 0;
+#ifdef CONFIG_SMP
+	thread_group_times = alloc_percpu(struct thread_group_cputime);
+#else
+	thread_group_times =
+		kmalloc(sizeof(struct thread_group_cputime), GFP_KERNEL);
+#endif
+	if (thread_group_times == NULL)
+		return -ENOMEM;
+	read_lock(&tasklist_lock);
+	spin_lock_irq(&tsk->sighand->siglock);
+	if (sig->thread_group_times) {
+		spin_unlock_irq(&tsk->sighand->siglock);
+		read_unlock(&tasklist_lock);
+		thread_group_times_free(thread_group_times);
+		return 0;
+	}
+	sig->thread_group_times = thread_group_times;
+	utime = sig->utime;
+	stime = sig->stime;
+	sum_exec_runtime = tsk->se.sum_exec_runtime;
+	t = tsk;
+	do {
+		utime = cputime_add(utime, t->utime);
+		stime = cputime_add(stime, t->stime);
+		sum_exec_runtime += t->se.sum_exec_runtime;
+	} while_each_thread(tsk, t);
+#ifdef CONFIG_SMP
+	thread_group_times = per_cpu_ptr(sig->thread_group_times, get_cpu());
+#endif
+	thread_group_times->utime = utime;
+	thread_group_times->stime = stime;
+	thread_group_times->sum_exec_runtime = sum_exec_runtime;
+#ifdef CONFIG_SMP
+	put_cpu_no_resched();
+#endif
+	spin_unlock_irq(&tsk->sighand->siglock);
+	read_unlock(&tasklist_lock);
+	return 0;
+}
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 86a9337..fc5e269 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -353,6 +353,9 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		struct task_struct *curtask = task_of(curr);
 
 		cpuacct_charge(curtask, delta_exec);
+		thread_group_update(curtask->signal,
+			offsetof(struct thread_group_cputime, sum_exec_runtime),
+			(void *)&delta_exec);
 	}
 }
 
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 0a6d2e5..ea48a92 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -256,6 +256,9 @@ static void update_curr_rt(struct rq *rq)
 	schedstat_set(curr->se.exec_max, max(curr->se.exec_max, delta_exec));
 
 	curr->se.sum_exec_runtime += delta_exec;
+	thread_group_update(curr->signal,
+		offsetof(struct thread_group_cputime, sum_exec_runtime),
+		(void *)&delta_exec);
 	curr->se.exec_start = rq->clock;
 	cpuacct_charge(curr, delta_exec);
 
diff --git a/kernel/sys.c b/kernel/sys.c
index a626116..ce70226 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -864,6 +864,8 @@ asmlinkage long sys_setfsgid(gid_t gid)
 
 asmlinkage long sys_times(struct tms __user * tbuf)
 {
+	struct thread_group_cputime thread_group_times;
+
 	/*
 	 *	In the SMP world we might just be unlucky and have one of
 	 *	the times increment as we use it. Since the value is an
@@ -873,19 +875,28 @@ asmlinkage long sys_times(struct tms __user * tbuf)
 	if (tbuf) {
 		struct tms tmp;
 		struct task_struct *tsk = current;
-		struct task_struct *t;
 		cputime_t utime, stime, cutime, cstime;
 
 		spin_lock_irq(&tsk->sighand->siglock);
-		utime = tsk->signal->utime;
-		stime = tsk->signal->stime;
-		t = tsk;
-		do {
-			utime = cputime_add(utime, t->utime);
-			stime = cputime_add(stime, t->stime);
-			t = next_thread(t);
-		} while (t != tsk);
-
+		/*
+		 * If a POSIX interval timer is running use the process-wide
+		 * fields, else fall back to brute force.
+		 */
+		if (sig->thread_group_times) {
+			thread_group_cputime(&thread_group_times, tsk->signal);
+			utime = thread_group_times.utime;
+			stime = thread_group_times.stime;
+		} else {
+			struct task_struct *t;
+
+			utime = tsk->signal->utime;
+			stime = tsk->signal->stime;
+			t = tsk;
+			do {
+				utime = cputime_add(utime, t->utime);
+				stime = cputime_add(stime, t->stime);
+			} while_each_thread(tsk, t);
+		}
 		cutime = tsk->signal->cutime;
 		cstime = tsk->signal->cstime;
 		spin_unlock_irq(&tsk->sighand->siglock);
@@ -1444,7 +1455,7 @@ asmlinkage long sys_old_getrlimit(unsigned int resource, struct rlimit __user *r
 asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit __user *rlim)
 {
 	struct rlimit new_rlim, *old_rlim;
-	unsigned long it_prof_secs;
+	unsigned long rlim_secs;
 	int retval;
 
 	if (resource >= RLIM_NLIMITS)
@@ -1490,15 +1501,12 @@ asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit __user *rlim)
 	if (new_rlim.rlim_cur == RLIM_INFINITY)
 		goto out;
 
-	it_prof_secs = cputime_to_secs(current->signal->it_prof_expires);
-	if (it_prof_secs == 0 || new_rlim.rlim_cur <= it_prof_secs) {
-		unsigned long rlim_cur = new_rlim.rlim_cur;
-		cputime_t cputime;
-
-		cputime = secs_to_cputime(rlim_cur);
+	rlim_secs = cputime_to_secs(current->signal->rlim_expires);
+	if (rlim_secs == 0 || new_rlim.rlim_cur <= rlim_secs) {
 		read_lock(&tasklist_lock);
 		spin_lock_irq(&current->sighand->siglock);
-		set_process_cpu_timer(current, CPUCLOCK_PROF, &cputime, NULL);
+		current->signal->rlim_expires =
+			secs_to_cputime(new_rlim.rlim_cur);
 		spin_unlock_irq(&current->sighand->siglock);
 		read_unlock(&tasklist_lock);
 	}
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 41a049f..62fed13 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2201,7 +2201,7 @@ static void selinux_bprm_post_apply_creds(struct linux_binprm *bprm)
 			 * This will cause RLIMIT_CPU calculations
 			 * to be refigured.
 			 */
-			current->it_prof_expires = jiffies_to_cputime(1);
+			current->signal->rlim_expires = jiffies_to_cputime(1);
 		}
 	}

-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-03-24 17:34                                           ` Frank Mayhar
  2008-03-24 22:43                                             ` Frank Mayhar
@ 2008-03-31  5:44                                             ` Roland McGrath
  2008-03-31 20:24                                               ` Frank Mayhar
  1 sibling, 1 reply; 51+ messages in thread
From: Roland McGrath @ 2008-03-31  5:44 UTC (permalink / raw)
  To: Frank Mayhar; +Cc: linux-kernel

> Well, if it's acceptable, for a first cut (and the patch I'll submit),
> I'll handle the UP and SMP cases, encapsulating them in sched.h in such
> a way as to make it invisible (as much as is possible) to the rest of
> the code.

That's fine.

> After looking at the code again, I now understand what you're talking
> about.  You overloaded it_*_expires to support both the POSIX interval
> timers and RLIMIT_CPU.  So the way I have things, setting one can stomp
> the other.

For clarity, please never mention that identifier without indicating which
struct you're talking about.  signal_struct.it_*_expires has never been
overloaded.  signal_struct.it_prof_expires is the ITIMER_PROF setting;
signal_struct.it_virt_expires is the ITIMER_VIRTUAL setting; there is no
signal_struct.it_sched_expires field.  task_struct.it_*_expires has never
been overloaded.  task_struct.it_prof_expires is the next value of
(task_struct.utime + task_struct.stime) at which run_posix_cpu_timers()
needs to check for work to do.

> Might it be cleaner to handle the RLIMIT_CPU stuff separately, rather
> than rolling it into the itimer handling?

It is not "rolled into the itimer handling".  
run_posix_cpu_timers() handles three separate features:

1. ITIMER_VIRTUAL, ITIMER_PROF itimers (setitimer/getitimer)
2. POSIX timers CPU timers (timer_* calls)
3. CPU time rlimits (RLIMIT_CPU for process-wide, RLIMIT_RTTIME for each thread)

The poorly-named task_struct.it_*_expires fields serve a single purpose:
to optimize run_posix_cpu_timers().  task_struct.it_prof_expires is the
minimum of the values at which any of those three features need attention.

> Okay, my proposed fix for this is to introduce a new field in
> signal_struct, rlim_expires, a cputime_t.  Everywhere that the
> RLIMIT_CPU code formerly set it_prof_expires it will now set
> rlim_expires and in run_posix_cpu_timers() I'll check it against the
> thread group prof_time.

I don't see the point of adding this field at all.  It serves solely to
cache the secs_to_cputime calculation on the RLIMIT_CPU rlim_cur value,
which is just a division.

> I've changed the uniprocessor case to retain the dynamic allocation of
> the thread_group_cputime structure as needed; this makes the code
> somewhat more consistent between SMP and UP and retains the feature of
> reducing overhead for processes that don't use interval timers.

This does not make sense.  There is no need for any new state at all in
the UP case, just reorganizing what is already there.

The existing signal_struct fields utime, stime, and sum_sched_runtime are
no longer needed.  These accumulate the times of dead threads in the group
(see __exit_signal in exit.c) solely so cpu_clock_sample_group can add
them in.  Keeping both those old fields and the dynamically allocated
per-CPU counters is wrong.  You will count double all the threads that
died since the struct was allocated (i.e. since the first timer was set).

For the UP case, just replace these with a single struct thread_group_cputime
in signal_struct, and increment it directly on every tick.  __exit_signal
never touches it.

For the SMP case, you need a bit of complication.  When there are no
expirations (none of the three features in use on a process-wide CPU
clock) or only one live thread, then you don't need to allocate the
per-CPU counters.  But you need one or the other kind of state as soon
as a thread dies while others live, or there are multiple threads while
any process-wide expiration is set.  There are several options for how
to reconcile the dead-threads tracking with the started-on-demand
per-CPU counters.

You did not follow what I had in mind for abstracting the code.
Here is what I think will make it easiest to work through all these
angles without rewriting much of the code for each variant.

Define:

	struct task_cputime {
		cputime_t utime;
		cputime_t stime;
		unsigned long long sched_runtime;
	};

and then a second type to use in signal_struct.  
You can clean up task_struct by replacing its it_*_expires fields with one:
	
	struct task_cputime cputime_expires;

(That is, overload cputime_expires.stime for the utime+stime expiration
time.  Even with that kludge, I think it's cleaner to use this struct
for all these places.)

In signal_struct there are no conditionals, it's just:

	struct thread_group_cputime cputime;

The variants provide struct thread_group_cputime and the functions to go
with it.  However many sorts we need can be different big #if blocks
keeping all the related code together.

The UP version just does:

	struct thread_group_cputime {
		struct task_cputime totals;
	};

	static inline void thread_group_cputime(struct signal_struct *sig,
						struct task_cputime *cputime)
	{
		*cputime = sig->cputime;
	}

	static inline void account_group_user_time(struct task_struct *task,
						   cputime_t cputime)
	{
		struct task_cputime *times = &task->signal->cputime.totals;
		times->utime = cputime_add(times->utime, cputime);
	}

	static inline void account_group_system_time(struct task_struct *task,
						     cputime_t cputime)
	{
		struct task_cputime *times = &task->signal->cputime.totals;
		times->stime = cputime_add(times->stime, cputime);
	}

	static inline void account_group_exec_runtime(struct task_struct *task,
						      unsigned long long ns)
	{
		struct task_cputime *times = &task->signal->cputime.totals;
		times->sched_runtime += ns;
	}

Finally, you could consider adding another field to signal_struct:

	struct task_cputime cputime_expires;

This would be a cache, for each of the three process CPU clocks, of the
earliest expiration from any of the three features.  Each of setitimer,
timer_settime, setrlimit, and implicit timer/itimer reloading, would
recalculate the minimum of the head of the cpu_timers list, the itimer
(it_*_expires), and the rlimit.  The reason to do this is that the
common case in run_posix_cpu_timers() stays almost as cheap as it is now.
It also makes a nice parallel with the per-thread expiration cache.
i.e.:

	static int task_cputime_expired(const struct task_cputime *sample,
					const struct task_cputime *expires)
	{
		if (!cputime_eq(expires->utime, cputime_zero) &&
		    cputime_ge(sample->utime, expires->utime))
			return 1;
		if (!cputime_eq(expires->stime, cputime_zero) &&
		    cputime_ge(cputime_add(sample->utime, sample->stime),
			       expires->stime))
			return 1;
		if (expires->sched_runtime != 0 &&
		    sample->sched_runtime >= expires->sched_runtime)
			return 1;
		return 0;
	}

	...

		struct signal_struct *sig = task->signal;
		struct task_cputime task_sample = {
			.utime = task->utime,
			.stime = task->stime,
			.sched_runtime = task->se.sum_exec_runtime
		};
		struct task_cputime group_sample;
		thread_group_cputime(sig, &group_sample);

		if (!task_cputime_expired(&task_sample,
					  &task->cputime_expired) &&
		    !task_cputime_expired(&group_sample,
					  &sig->cputime_expired))
			return 0;

	...



Thanks,
Roland

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-03-31  5:44                                             ` Roland McGrath
@ 2008-03-31 20:24                                               ` Frank Mayhar
  2008-04-02  2:07                                                 ` Roland McGrath
  0 siblings, 1 reply; 51+ messages in thread
From: Frank Mayhar @ 2008-03-31 20:24 UTC (permalink / raw)
  To: Roland McGrath; +Cc: linux-kernel

Roland, I'm very much having to read between the lines of what you've
written.  And, obviously, getting it wrong at least half the time. :-)

So you've cleared part of my understanding with your latest email.
Here's what I've gotten from it:

        struct task_cputime {
        	cputime_t utime;		/* User time. */
        	cputime_t stime;		/* System time. */
        	unsigned long long sched_runtime; /* Scheduler time. */
        };
        
This is for both SMP and UP, defined before signal_struct in sched.h
(since that structure refers to this one).  Following that:

        struct thread_group_cputime;

Which is a forward reference to the real definition later in the file.
The inline functions depend on signal_struct and task_struct, so they
have to come after:

        #ifdef SMP

        struct thread_group_cputime {
        	struct task_cputime *totals;
        };

        < ... inline functions ... >

        #else /* SMP */

        struct thread_group_cputime {
        	struct task_cputime totals;
        };

        < ... inline functions ... >

        #endif

The SMP version is percpu, the UP version is just a substructure.  In
signal_struct itself, delete utime & stime, add
        struct thread_group_cputime cputime;

The inline functions include the ones you defined for UP plus equivalent
ones for SMP.  The SMP inlines check the percpu pointer
(sig->cputime.totals) and don't update if it's NULL.  One small
correction to one of your inlines, in thread_group_cputime:
                *cputime = sig->cputime;
should be
                *cputime = sig->cputime.totals;

A representative inline for SMP is:

        static inline void account_group_system_time(struct task_struct *task,
        					      cputime_t cputime)
        {
        	struct task_cputime *times;
        
        	if (!sig->cputime.totals)
        		return;
        	times = per_cpu_ptr(sig->cputime.totals, get_cpu());
        	times->stime = cputime_add(times->stime, cputime);
        	put_cpu_no_resched();
        }
        
To deal with the need for bookkeeping with multiple threads in the SMP
case (where there isn't a per-cpu structure until it's needed), I'll
allocate the per-cpu structure in __exit_signal() where the relevant
fields are updated.  I'll also allocate it where I do now, in
do_setitimer(), when needed.  The allocation will be a "return 0" for UP
and a call to "thread_group_times_alloc_smp()" (which lives in sched.c)
for SMP.

I'll also optimize run_posix_cpu_timers() as you suggest, and eliminate
rlim_expires.

Expect a new patch fairly soon.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2.6.25-rc7 resubmit] Fix itimer/many thread hang.
  2008-03-28 22:46                                             ` [PATCH 2.6.25-rc7 resubmit] " Frank Mayhar
@ 2008-04-01 18:45                                               ` Andrew Morton
  2008-04-01 21:46                                                 ` Frank Mayhar
  0 siblings, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2008-04-01 18:45 UTC (permalink / raw)
  To: Frank Mayhar; +Cc: roland, linux-kernel

On Fri, 28 Mar 2008 15:46:40 -0700
Frank Mayhar <fmayhar@google.com> wrote:

>  asmlinkage long sys_times(struct tms __user * tbuf)
>  {
> +	struct thread_group_cputime thread_group_times;
> +
>  	/*
>  	 *	In the SMP world we might just be unlucky and have one of
>  	 *	the times increment as we use it. Since the value is an
> @@ -873,19 +875,28 @@ asmlinkage long sys_times(struct tms __user * tbuf)
>  	if (tbuf) {
>  		struct tms tmp;
>  		struct task_struct *tsk = current;
> -		struct task_struct *t;
>  		cputime_t utime, stime, cutime, cstime;
>  
>  		spin_lock_irq(&tsk->sighand->siglock);
> -		utime = tsk->signal->utime;
> -		stime = tsk->signal->stime;
> -		t = tsk;
> -		do {
> -			utime = cputime_add(utime, t->utime);
> -			stime = cputime_add(stime, t->stime);
> -			t = next_thread(t);
> -		} while (t != tsk);
> -
> +		/*
> +		 * If a POSIX interval timer is running use the process-wide
> +		 * fields, else fall back to brute force.
> +		 */
> +		if (sig->thread_group_times) {

kernel/sys.c: In function 'sys_times':
kernel/sys.c:885: error: 'sig' undeclared (first use in this function)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2.6.25-rc7 resubmit] Fix itimer/many thread hang.
  2008-04-01 18:45                                               ` Andrew Morton
@ 2008-04-01 21:46                                                 ` Frank Mayhar
  0 siblings, 0 replies; 51+ messages in thread
From: Frank Mayhar @ 2008-04-01 21:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: roland, linux-kernel

On Tue, 2008-04-01 at 11:45 -0700, Andrew Morton wrote:
> On Fri, 28 Mar 2008 15:46:40 -0700
> Frank Mayhar <fmayhar@google.com> wrote:
> 
> >  asmlinkage long sys_times(struct tms __user * tbuf)
> >  {
> > +	struct thread_group_cputime thread_group_times;
> > +
> >  	/*
> >  	 *	In the SMP world we might just be unlucky and have one of
> >  	 *	the times increment as we use it. Since the value is an
> > @@ -873,19 +875,28 @@ asmlinkage long sys_times(struct tms __user * tbuf)
> >  	if (tbuf) {
> >  		struct tms tmp;
> >  		struct task_struct *tsk = current;
> > -		struct task_struct *t;
> >  		cputime_t utime, stime, cutime, cstime;
> >  
> >  		spin_lock_irq(&tsk->sighand->siglock);
> > -		utime = tsk->signal->utime;
> > -		stime = tsk->signal->stime;
> > -		t = tsk;
> > -		do {
> > -			utime = cputime_add(utime, t->utime);
> > -			stime = cputime_add(stime, t->stime);
> > -			t = next_thread(t);
> > -		} while (t != tsk);
> > -
> > +		/*
> > +		 * If a POSIX interval timer is running use the process-wide
> > +		 * fields, else fall back to brute force.
> > +		 */
> > +		if (sig->thread_group_times) {
> 
> kernel/sys.c: In function 'sys_times':
> kernel/sys.c:885: error: 'sig' undeclared (first use in this function)

Thanks.  As I said privately, I don't know how this snuck in but it's
certainly time to blow away my build tree and reapply the next patch
from scratch.

Speaking of which, expect that next patch in a day or two.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-03-31 20:24                                               ` Frank Mayhar
@ 2008-04-02  2:07                                                 ` Roland McGrath
  2008-04-02 16:34                                                   ` Frank Mayhar
                                                                     ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Roland McGrath @ 2008-04-02  2:07 UTC (permalink / raw)
  To: Frank Mayhar; +Cc: linux-kernel

> Roland, I'm very much having to read between the lines of what you've
> written.  And, obviously, getting it wrong at least half the time. :-)

I'm sorry I wasn't able to communicate more clearly.  Please ask me to
elaborate whenever it would help.  I hope the experience will inspire
you to add some good comments to the code.  If the comments come to
explain everything you now wish had been brought to your attention
from the start, we should be in very good shape.

>         	unsigned long long sched_runtime; /* Scheduler time. */

This is what I'd called it originally (based on the sched_clock() name).
Now it's an aggregate of task_struct.se.sum_exec_runtime, so perhaps
sum_exec_runtime is the right name to use.

> (since that structure refers to this one).  Following that:
> 
>         struct thread_group_cputime;
> 
> Which is a forward reference to the real definition later in the file.

There is no need for that.  This struct is used directly in signal_struct
and so has to be defined before it anyway.  (And when a struct is used
only as the type of a pointer in another type declaration, no forward
declaration is required.)

> The SMP version is percpu, the UP version is just a substructure.  In
> signal_struct itself, delete utime & stime, add
>         struct thread_group_cputime cputime;

Correct, and also delete sum_sched_runtime.

> The inline functions include the ones you defined for UP plus equivalent
> ones for SMP.

Right.  And if we were to try different variants based on NR_CPUS or
whatever, that would entail just a new struct thread_group_cputime
definition and new versions of these functions, touching nothing else.

> A representative inline for SMP is:

Looks correct.

> To deal with the need for bookkeeping with multiple threads in the SMP
> case (where there isn't a per-cpu structure until it's needed), I'll
> allocate the per-cpu structure in __exit_signal() where the relevant
> fields are updated.  I'll also allocate it where I do now, in
> do_setitimer(), when needed.  The allocation will be a "return 0" for UP
> and a call to "thread_group_times_alloc_smp()" (which lives in sched.c)
> for SMP.

By do_setitimer, you mean set_process_cpu_timer and posix_cpu_timer_set.

In the last message I said "there are several options" for dealing with
the dead-threads issue on SMP and did not go into it.  This is one option.
It is attractive in being as lazy as possible.  But I think it is a
prohibitively dubious proposition to do any allocation in __exit_signal.
That is part of the hot path for recovering from OOM situations and all
sorts of troubles; it's also expected to be reliable so that I am not
comfortable with the bookkeeping dropping bits in extreme conditions
(i.e. when you cannot possibly succeed in the allocation and have to
proceed without it or ultimately wedge the system).

The first thing to do is move the existing summation of utime, stime, and
sum_exec_runtime in __exit_signal into an inline thread_group_cputime_exit.
That abstracts it into the set of inlines that can vary for the different
flavors of storage model.  For UP, it does nothing.

1. One set of options keeps static storage for dead-threads accounting, as we
   have now.  i.e., utime, stime, sum_sched_runtime in signal_struct are
   replaced by a "struct task_cputime dead_totals;" in thread_group_cputime.

   a. thread_group_cputime_exit adds into dead_totals as now, unconditionally.
      In thread_group_cputime_exit, if there is a percpu pointer allocated,
      then we subtract current's values from the perpcu counts.
      In posix_cpu_clock_get, if there is no percpu pointer allocated,
      we sum dead_totals plus iterate over live threads, as done now;
      if a percpu pointer is allocated, we sum percpu counters plus dead_totals.

   b. thread_group_cputime_exit adds into dead_totals unless there is a
      percpu pointer allocated.  When we first allocate the percpu
      counters, we fill some chosen CPU's values with dead_totals.
      posix_cpu_clock_get uses just the percpu counters if allocated.

2. No separate storage for dead-threads totals.

   a. Allocate percpu storage in thread_group_cputime_exit.
      I think this is ruled out for the reasons I gave above.

   b. Allocate percpu storage in copy_signal CLONE_THREAD case:

	if (clone_flags & CLONE_THREAD) {
		ret = thread_group_cputime_clone_thread(current, tsk);
		if (likely(!ret)) {
			atomic_inc(&current->signal->count);
			atomic_inc(&current->signal->live);
		}
		return ret;
	}

      That is a safe place to fail with -ENOMEM.  This pays the storage
      cost of the percpu array up front in every multithreaded program; but
      it still avoids that cost in single-threaded programs, and avoids the
      other pressures of enlarging signal_struct itself.  It has the
      benefit of making clock samples (clock_gettime on process CPU clocks)
      in a process not using any CPU timers go from O(n) in live threads to
      O(n) in num_possible_cpus().

There are more variations possible.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-04-02  2:07                                                 ` Roland McGrath
@ 2008-04-02 16:34                                                   ` Frank Mayhar
  2008-04-02 17:42                                                   ` Frank Mayhar
  2008-04-02 18:42                                                   ` Frank Mayhar
  2 siblings, 0 replies; 51+ messages in thread
From: Frank Mayhar @ 2008-04-02 16:34 UTC (permalink / raw)
  To: Roland McGrath; +Cc: linux-kernel

On Tue, 2008-04-01 at 19:07 -0700, Roland McGrath wrote:
> I'm sorry I wasn't able to communicate more clearly.  Please ask me to
> elaborate whenever it would help.  I hope the experience will inspire
> you to add some good comments to the code.  If the comments come to
> explain everything you now wish had been brought to your attention
> from the start, we should be in very good shape.

At least it's beginning to make sense.  Rather than worrying overmuch
about commentary, though, I'll think about writing an explanation for
the Documentation directory.  The implementation is pretty scattered so
there's no obvious single place for good commentary; a separate document
might be easier.

> > (since that structure refers to this one).  Following that:
> > 
> >         struct thread_group_cputime;
> > 
> > Which is a forward reference to the real definition later in the file.
> 
> There is no need for that.  This struct is used directly in signal_struct
> and so has to be defined before it anyway.  (And when a struct is used
> only as the type of a pointer in another type declaration, no forward
> declaration is required.)

Yeah, I remembered this finally.  The structures are declared before
signal_struct in all cases, with the inlines appearing (much) later in
the file.

Here are the declarations (starting at around line 425; the inlines
start at around line 2010 or so):

        struct task_cputime {
        	cputime_t utime;			/* User time. */
        	cputime_t stime;			/* System time. */
        	unsigned long long sum_exec_runtime;	/* Scheduler time. */
        };
        /* Alternate field names when used to cache expirations. */
        #define prof_exp	stime
        #define virt_exp	utime
        #define sched_exp	sum_exec_runtime
        
        #ifdef CONFIG_SMP
        struct thread_group_cputime {
        	struct task_cputime *totals;
        };
        #else
        struct thread_group_cputime {
        	struct task_cputime totals;
        };
        #endif

The #defines there are for when I refer to the task_cputime fields as
expiration values, to make it a little more self-explanatory (and less
confusing).  I can remove them if it violates a Linux kernel coding
style rule.

> > To deal with the need for bookkeeping with multiple threads in the SMP
> > case (where there isn't a per-cpu structure until it's needed), I'll
> > allocate the per-cpu structure in __exit_signal() where the relevant
> > fields are updated.  I'll also allocate it where I do now, in
> > do_setitimer(), when needed.  The allocation will be a "return 0" for UP
> > and a call to "thread_group_times_alloc_smp()" (which lives in sched.c)
> > for SMP.
> 
> By do_setitimer, you mean set_process_cpu_timer and posix_cpu_timer_set.

Um, right.  I think.  I'll check.

> In the last message I said "there are several options" for dealing with
> the dead-threads issue on SMP and did not go into it.  This is one option.
> It is attractive in being as lazy as possible.  But I think it is a
> prohibitively dubious proposition to do any allocation in __exit_signal.
> That is part of the hot path for recovering from OOM situations and all
> sorts of troubles; it's also expected to be reliable so that I am not
> comfortable with the bookkeeping dropping bits in extreme conditions
> (i.e. when you cannot possibly succeed in the allocation and have to
> proceed without it or ultimately wedge the system).

Yeah, that bothered me, too.  It's obviously impossible to do the
allocate itself in __exit_signal() itself (since the caller,
release_task() is holding tasklist_lock) and doing it in release_task()
is problematic for the reasons you give.

> The first thing to do is move the existing summation of utime, stime, and
> sum_exec_runtime in __exit_signal into an inline thread_group_cputime_exit.
> That abstracts it into the set of inlines that can vary for the different
> flavors of storage model.  For UP, it does nothing.
> 
> 1. One set of options keeps static storage for dead-threads accounting, as we
>    have now.  i.e., utime, stime, sum_sched_runtime in signal_struct are
>    replaced by a "struct task_cputime dead_totals;" in thread_group_cputime.
> 
>    a. thread_group_cputime_exit adds into dead_totals as now, unconditionally.
>       In thread_group_cputime_exit, if there is a percpu pointer allocated,
>       then we subtract current's values from the perpcu counts.
>       In posix_cpu_clock_get, if there is no percpu pointer allocated,
>       we sum dead_totals plus iterate over live threads, as done now;
>       if a percpu pointer is allocated, we sum percpu counters plus dead_totals.

This seems kind of hacky to me, particularly the bit about subtracting
values from the percpu counts.

>    b. thread_group_cputime_exit adds into dead_totals unless there is a
>       percpu pointer allocated.  When we first allocate the percpu
>       counters, we fill some chosen CPU's values with dead_totals.
>       posix_cpu_clock_get uses just the percpu counters if allocated.

This is a little better and is along the lines I was thinking at one
point, but it still has the dead_totals field which becomes redundant
when the percpu structure is allocated.  I would rather eliminate that
field entirely, especially since we're growing signal_struct in other
ways.

> 2. No separate storage for dead-threads totals.
> 
>    a. Allocate percpu storage in thread_group_cputime_exit.
>       I think this is ruled out for the reasons I gave above.

Right.

>    b. Allocate percpu storage in copy_signal CLONE_THREAD case:
> 
> 	if (clone_flags & CLONE_THREAD) {
> 		ret = thread_group_cputime_clone_thread(current, tsk);
> 		if (likely(!ret)) {
> 			atomic_inc(&current->signal->count);
> 			atomic_inc(&current->signal->live);
> 		}
> 		return ret;
> 	}
> 
>       That is a safe place to fail with -ENOMEM.  This pays the storage
>       cost of the percpu array up front in every multithreaded program; but
>       it still avoids that cost in single-threaded programs, and avoids the
>       other pressures of enlarging signal_struct itself.  It has the
>       benefit of making clock samples (clock_gettime on process CPU clocks)
>       in a process not using any CPU timers go from O(n) in live threads to
>       O(n) in num_possible_cpus().

I had also thought hard about doing this.  One the one hand, it makes
dealing with thread groups a little simpler, since I don't have to
special-case a thread group with a single thread anywhere (i.e. I don't
deal with thread_group_empty(tsk) being true anywhere).  On the other,
it does cause every multithreaded process to pay the cost of the percpu
array, as you say.  Adding the cputime_expires cache to signal_struct
and checking that in run_posix_cpu_timers() (as you suggested last time)
gets a little of that cost back, though, for multithreaded processes
that don't use any of the timers.

Any solution that avoids using a dead_totals field, in addition to
partly offsetting the growth of the signal_struct, also has the minor
advantage that it requires no handling at all in __exit_signal() and
makes that function a couple of lines shorter.

> There are more variations possible.

But I don't think I'm going to worry about them.  I would like the
solution to be as simple and clear as possible and I think the last one
above qualifies.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-04-02  2:07                                                 ` Roland McGrath
  2008-04-02 16:34                                                   ` Frank Mayhar
@ 2008-04-02 17:42                                                   ` Frank Mayhar
  2008-04-02 19:48                                                     ` Roland McGrath
  2008-04-02 18:42                                                   ` Frank Mayhar
  2 siblings, 1 reply; 51+ messages in thread
From: Frank Mayhar @ 2008-04-02 17:42 UTC (permalink / raw)
  To: Roland McGrath; +Cc: linux-kernel

On Tue, 2008-04-01 at 19:07 -0700, Roland McGrath wrote:
> The first thing to do is move the existing summation of utime, stime, and
> sum_exec_runtime in __exit_signal into an inline thread_group_cputime_exit.
> That abstracts it into the set of inlines that can vary for the different
> flavors of storage model.  For UP, it does nothing.

One quick note:  this inline isn't needed for the 2b solution (allocate
percpu storage in copy_signal CLONE_THREAD case), since if there's more
than one thread there'll always be a percpu area and if there's only one
thread the summation code won't be entered.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-04-02  2:07                                                 ` Roland McGrath
  2008-04-02 16:34                                                   ` Frank Mayhar
  2008-04-02 17:42                                                   ` Frank Mayhar
@ 2008-04-02 18:42                                                   ` Frank Mayhar
  2 siblings, 0 replies; 51+ messages in thread
From: Frank Mayhar @ 2008-04-02 18:42 UTC (permalink / raw)
  To: Roland McGrath; +Cc: linux-kernel

On Tue, 2008-04-01 at 19:07 -0700, Roland McGrath wrote:
> > To deal with the need for bookkeeping with multiple threads in the SMP
> > case (where there isn't a per-cpu structure until it's needed), I'll
> > allocate the per-cpu structure in __exit_signal() where the relevant
> > fields are updated.  I'll also allocate it where I do now, in
> > do_setitimer(), when needed.  The allocation will be a "return 0" for UP
> > and a call to "thread_group_times_alloc_smp()" (which lives in sched.c)
> > for SMP.
> 
> By do_setitimer, you mean set_process_cpu_timer and posix_cpu_timer_set.

And another quick note:  It appears that with the "allocate percpu
storage in copy_signal CLONE_THREAD case" mechanism, I don't need to
worry about allocating it anywhere else.  If I need it (which is only in
the case of multiple threads and an interval timer) then I'll have it
because it was allocated with the second thread.  So I just eliminate
the allocation in do_setitimer() entirely.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-04-02 17:42                                                   ` Frank Mayhar
@ 2008-04-02 19:48                                                     ` Roland McGrath
  2008-04-02 20:34                                                       ` Frank Mayhar
  0 siblings, 1 reply; 51+ messages in thread
From: Roland McGrath @ 2008-04-02 19:48 UTC (permalink / raw)
  To: Frank Mayhar; +Cc: linux-kernel

> One quick note:  this inline isn't needed for the 2b solution (allocate
> percpu storage in copy_signal CLONE_THREAD case), since if there's more
> than one thread there'll always be a percpu area and if there's only one
> thread the summation code won't be entered.

That's true.  I still think it's a good idea to have it, even if it winds
up being empty in the variants we really use.  The principle is that the
new set of types/functions could be used to implement exactly what we have
now.  In fact, it's usually best to do a series of small patches that start
with introducing the abstraction while not changing anything.

> And another quick note:  It appears that with the "allocate percpu
> storage in copy_signal CLONE_THREAD case" mechanism, I don't need to
> worry about allocating it anywhere else.  If I need it (which is only in
> the case of multiple threads and an interval timer) then I'll have it
> because it was allocated with the second thread.  

That's correct.

> So I just eliminate the allocation in do_setitimer() entirely.

Again, I'd leave the call to the inline that would do it.
For this implementation plan, its body is:
	BUG_ON(!task->signal->cputime.totals && !thread_group_empty(task));


Thanks,
Roland

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-04-02 19:48                                                     ` Roland McGrath
@ 2008-04-02 20:34                                                       ` Frank Mayhar
  2008-04-02 21:42                                                         ` Frank Mayhar
  2008-04-04 23:17                                                         ` Roland McGrath
  0 siblings, 2 replies; 51+ messages in thread
From: Frank Mayhar @ 2008-04-02 20:34 UTC (permalink / raw)
  To: Roland McGrath; +Cc: linux-kernel

On Wed, 2008-04-02 at 12:48 -0700, Roland McGrath wrote:
> > One quick note:  this inline isn't needed for the 2b solution (allocate
> > percpu storage in copy_signal CLONE_THREAD case), since if there's more
> > than one thread there'll always be a percpu area and if there's only one
> > thread the summation code won't be entered.
> 
> That's true.  I still think it's a good idea to have it, even if it winds
> up being empty in the variants we really use.  The principle is that the
> new set of types/functions could be used to implement exactly what we have
> now.  In fact, it's usually best to do a series of small patches that start
> with introducing the abstraction while not changing anything.

Ah, okay.  Well, except that the whole point of this exercise is to fix
that hang. :-)  But yeah, I understand.

> > And another quick note:  It appears that with the "allocate percpu
> > storage in copy_signal CLONE_THREAD case" mechanism, I don't need to
> > worry about allocating it anywhere else.  If I need it (which is only in
> > the case of multiple threads and an interval timer) then I'll have it
> > because it was allocated with the second thread.  
> 
> That's correct.
> 
> > So I just eliminate the allocation in do_setitimer() entirely.
> 
> Again, I'd leave the call to the inline that would do it.
> For this implementation plan, its body is:
> 	BUG_ON(!task->signal->cputime.totals && !thread_group_empty(task));

BTW I did look at allocating it in posix_cpu_timer_set() and
set_process_cpu_timer() but the first at least is doing stuff with locks
held.  I'll keep looking at it, though.

One little gotcha we just ran into, though:  When checking
tsk->signal->(anything) in run_posix_cpu_timers(), we have to hold
tasklist_lock to avoid a race with release_task().  This is going to
make even the null case always cost more than before.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-04-02 20:34                                                       ` Frank Mayhar
@ 2008-04-02 21:42                                                         ` Frank Mayhar
  2008-04-04  0:53                                                           ` Frank Mayhar
  2008-04-04 23:17                                                         ` Roland McGrath
  1 sibling, 1 reply; 51+ messages in thread
From: Frank Mayhar @ 2008-04-02 21:42 UTC (permalink / raw)
  To: Roland McGrath; +Cc: linux-kernel

On Wed, 2008-04-02 at 13:34 -0700, Frank Mayhar wrote:
> One little gotcha we just ran into, though:  When checking
> tsk->signal->(anything) in run_posix_cpu_timers(), we have to hold
> tasklist_lock to avoid a race with release_task().  This is going to
> make even the null case always cost more than before.

This race, by the way, is because we're dereferencing task->signal at
interrupt once per tick.  We ran into a case where a process was going
through release_task() and being torn down on one CPU while running a
timer tick on another.  Under load.  It's not a very likely race but
with sufficient time or load it's pretty much inevitable.

My thought is to move thread_group_cputime from the signal structure to
hanging directly off the task structure.  It would be shared in the same
way as the signal structure is now but would be deallocated with the
task structure rather than the signal structure.  This should mean that
I could avoid getting tasklist_lock under most conditions.

Thoughts?
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-04-02 21:42                                                         ` Frank Mayhar
@ 2008-04-04  0:53                                                           ` Frank Mayhar
  0 siblings, 0 replies; 51+ messages in thread
From: Frank Mayhar @ 2008-04-04  0:53 UTC (permalink / raw)
  To: Roland McGrath; +Cc: linux-kernel

On Wed, 2008-04-02 at 14:42 -0700, Frank Mayhar wrote:
> On Wed, 2008-04-02 at 13:34 -0700, Frank Mayhar wrote:
> > One little gotcha we just ran into, though:  When checking
> > tsk->signal->(anything) in run_posix_cpu_timers(), we have to hold
> > tasklist_lock to avoid a race with release_task().  This is going to
> > make even the null case always cost more than before.
> 
> This race, by the way, is because we're dereferencing task->signal at
> interrupt once per tick.  We ran into a case where a process was going
> through release_task() and being torn down on one CPU while running a
> timer tick on another.  Under load.  It's not a very likely race but
> with sufficient time or load it's pretty much inevitable.
> 
> My thought is to move thread_group_cputime from the signal structure to
> hanging directly off the task structure.  It would be shared in the same
> way as the signal structure is now but would be deallocated with the
> task structure rather than the signal structure.  This should mean that
> I could avoid getting tasklist_lock under most conditions.

Okay, having run face-first into this race and having every combination
of spinlock serialization fail for me, I've done a variation of the
above scheme.

For the local environment, I solved the problem by moving the percpu
structure out of the signal structure entirely and by making it
refcounted.  It is allocated as before, but now in two parts, a normal
structure with an atomic refcount that has a pointer to the percpu
structure.  The signal structure doesn't point to it any longer, but
each task_struct in the thread group does, and each of these references
is counted.  New threads will also get a reference (at the top of
copy_signal()) and be counted.  All access goes through the task
structure.  References are removed in __put_task_struct() when the task
itself is destroyed; when the last reference goes away, the structures
are freed.

This eliminates the races with signal_struct being freed and has the
nice effect that there is a bit less overhead in places like
account_group_user_time() and friends.  In run_posix_cpu_timers(),
though, I have to pick up the tasklist_lock early (and therefore in
every case) because it's still dereferencing tsk->signal in the early
comparison.

I'm thinking about moving all of the itimer stuff (i.e. the
cputime_expires structures) into the refcounted structure as well, thus
avoiding the signal_struct entirely so we don't need the tasklist_lock
in the fast path.  I don't know how any of this will affect the UP case,
though.  I'll have to continue to think about it and I'm sure you have
something to say as well.  (And if anyone else wants to chime in,
they're welcome.)
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-04-02 20:34                                                       ` Frank Mayhar
  2008-04-02 21:42                                                         ` Frank Mayhar
@ 2008-04-04 23:17                                                         ` Roland McGrath
  2008-04-06  5:26                                                           ` Frank Mayhar
  1 sibling, 1 reply; 51+ messages in thread
From: Roland McGrath @ 2008-04-04 23:17 UTC (permalink / raw)
  To: Frank Mayhar; +Cc: linux-kernel

> BTW I did look at allocating it in posix_cpu_timer_set() and
> set_process_cpu_timer() but the first at least is doing stuff with locks
> held.  I'll keep looking at it, though.

Yeah, it's a little sticky.  It would be simple enough to arrange things to
do the allocation when needed before taking locks, but would not make the
code wonderfully self-contained.  I wouldn't worry about it unless/until we
conclude for other reasons that this really is the best way to go.  We seem
now to be leaning towards allocating at clone time anyway.

> One little gotcha we just ran into, though:  When checking
> tsk->signal->(anything) in run_posix_cpu_timers(), we have to hold
> tasklist_lock to avoid a race with release_task().  This is going to
> make even the null case always cost more than before.

This is reminiscent of something that came up once before.  I think it
was the same issue of what happens on a tick while the thread is in the
middle of exiting and racing with release_task.  See commit
72ab373a5688a78cbdaf3bf96012e597d5399bb7, commit
3de463c7d9d58f8cf3395268230cb20a4c15bffa and related history (some of
the further history is pre-GIT).

> For the local environment, I solved the problem by moving the percpu
> structure out of the signal structure entirely and by making it
> refcounted.  

This is a big can of worms that we really don't need.  Complicating the
data structure handling this way is really not warranted at all just to
address this race.  You'll just create another version of the same race
with a different pointer, and then solve it some simple way that you
could have just used to solve the existing problem.  If you don't have
some independent (and very compelling) reasons to reorganize the data
structures, nix nix nix.

We can make posix_cpu_timers_exit() or earlier in the exit/reap path
tweak any state we need to ensure that this problem won't come up.
But, off hand, I don't think we need any new state.

Probably the right fix is to make the fast-path check do:

	rcu_read_lock();
	signal = rcu_dereference(current->signal);
	if (unlikely(!signal) || !fastpath_process_timer_check(signal)) {
		rcu_read_unlock();
		return;
	}
	sighand = lock_task_sighand(current, &flags);
	rcu_read_unlock();
	if (likely(sighand))
		slowpath_process_timer_work(signal);
	unlock_task_sighand(sighand, &flags);

Another approach that is probably fine too is just to do:

	if (unlikely(current->exit_state))
		return;

We can never get to the __exit_signal code that causes the race if we
are not already late in exit.  The former seems a little preferable
because the added fast-path cost is the same (or perhaps even more
cache-friendly), but it fires timers even at the very last tick during
exit, and only loses the ideal behavior (of always firing if the timer
expiration is ever crossed) at the last possible instant.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-04-04 23:17                                                         ` Roland McGrath
@ 2008-04-06  5:26                                                           ` Frank Mayhar
  2008-04-07 20:08                                                             ` Roland McGrath
  0 siblings, 1 reply; 51+ messages in thread
From: Frank Mayhar @ 2008-04-06  5:26 UTC (permalink / raw)
  To: Roland McGrath; +Cc: Frank Mayhar, linux-kernel

On Sat, 2008-04-05 at 11:26 -0700, Roland McGrath wrote:
> > BTW I did look at allocating it in posix_cpu_timer_set() and
> > set_process_cpu_timer() but the first at least is doing stuff with locks
> > held.  I'll keep looking at it, though.
> 
> Yeah, it's a little sticky.  It would be simple enough to arrange things to
> do the allocation when needed before taking locks, but would not make the
> code wonderfully self-contained.  I wouldn't worry about it unless/until we
> conclude for other reasons that this really is the best way to go.  We seem
> now to be leaning towards allocating at clone time anyway.

Well, if we allocate at clone time then some things do get a bit
simpler.  (Hmm, for some reason I was thinking that we still allocate in
do_setitimer() as well, but it doesn't look like it.)

> > One little gotcha we just ran into, though:  When checking
> > tsk->signal->(anything) in run_posix_cpu_timers(), we have to hold
> > tasklist_lock to avoid a race with release_task().  This is going to
> > make even the null case always cost more than before.
> 
> This is reminiscent of something that came up once before.  I think it
> was the same issue of what happens on a tick while the thread is in the
> middle of exiting and racing with release_task.  See commit
> 72ab373a5688a78cbdaf3bf96012e597d5399bb7, commit
> 3de463c7d9d58f8cf3395268230cb20a4c15bffa and related history (some of
> the further history is pre-GIT).

Yeah, I checked that out.  The one difference here is that that was a
race between do_exit() (actually release_task()/__exit_signal()) and
run_posix_cpu_timers().  While this race was the same in that respect,
there was also a race between all of the timer-tick routines that call
any of account_group_user_time(), account_group_system_time() or
account_group_exec_runtime() and __exit_signal().  This is because those
functions all dereference tsk->signal.

> > For the local environment, I solved the problem by moving the percpu
> > structure out of the signal structure entirely and by making it
> > refcounted.  
> 
> This is a big can of worms that we really don't need.  Complicating the
> data structure handling this way is really not warranted at all just to
> address this race.  You'll just create another version of the same race
> with a different pointer, and then solve it some simple way that you
> could have just used to solve the existing problem.  If you don't have
> some independent (and very compelling) reasons to reorganize the data
> structures, nix nix nix.

Erm, well, this isn't reorganizing the data structures per se, since
these are new data structures.  Regardless, the way I did this was by
making the refcounted structure follow the lifetime of the task
structure.  It's instantiated when needed and at that point each task
structure in the thread group gets a reference, which is released when
that task structure is destroyed.  The last release destroys the data
structure.

The upshot of this is that none of the timer routines dereference
tsk->signal, so the races go away, no locking needed.  From my
perspective this was the simplest solution, since lock dependency
ordering is _really_ a can of worms.

The solution was really very simple, in fact.

> We can make posix_cpu_timers_exit() or earlier in the exit/reap path
> tweak any state we need to ensure that this problem won't come up.
> But, off hand, I don't think we need any new state.

That's certainly a possibility, but it'll still require more code in the
timer routines to check whatever state there is.

> Probably the right fix is to make the fast-path check do:
> 
> 	rcu_read_lock();
> 	signal = rcu_dereference(current->signal);
> 	if (unlikely(!signal) || !fastpath_process_timer_check(signal)) {
> 		rcu_read_unlock();
> 		return;
> 	}
> 	sighand = lock_task_sighand(current, &flags);
> 	rcu_read_unlock();
> 	if (likely(sighand))
> 		slowpath_process_timer_work(signal);
> 	unlock_task_sighand(sighand, &flags);
> 
> Another approach that is probably fine too is just to do:
> 
> 	if (unlikely(current->exit_state))
> 		return;
> 
> We can never get to the __exit_signal code that causes the race if we
> are not already late in exit.  The former seems a little preferable
> because the added fast-path cost is the same (or perhaps even more
> cache-friendly), but it fires timers even at the very last tick during
> exit, and only loses the ideal behavior (of always firing if the timer
> expiration is ever crossed) at the last possible instant.

I'll have to look at the code some more and thing about this.  For the
other timer routines I would think that the second approach would be
better, but I still think that best of all is to not have to check
anything at all (except of course the presence of the structure pointer
itself).

Regarding the second approach, without locking wouldn't that still be
racy?  Couldn't exit_state change (and therefore __exit_signal() run)
between the check and the dereference?
-- 
Frank Mayhar frank@exit.com     http://www.exit.com/
Exit Consulting                 http://www.gpsclock.com/
                                http://www.exit.com/blog/frank/
                                http://www.zazzle.com/fmayhar*

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-04-06  5:26                                                           ` Frank Mayhar
@ 2008-04-07 20:08                                                             ` Roland McGrath
  2008-04-07 21:31                                                               ` Frank Mayhar
  2008-04-08 21:27                                                               ` Frank Mayhar
  0 siblings, 2 replies; 51+ messages in thread
From: Roland McGrath @ 2008-04-07 20:08 UTC (permalink / raw)
  To: frank; +Cc: Frank Mayhar, linux-kernel

> Yeah, I checked that out.  The one difference here is that that was a
> race between do_exit() (actually release_task()/__exit_signal()) and
> run_posix_cpu_timers().  While this race was the same in that respect,
> there was also a race between all of the timer-tick routines that call
> any of account_group_user_time(), account_group_system_time() or
> account_group_exec_runtime() and __exit_signal().  This is because those
> functions all dereference tsk->signal.

The essence that matters is the same: something that current does with its
own ->signal on a tick vs something that the release_task path does.

> Erm, well, this isn't reorganizing the data structures per se, since
> these are new data structures.

Tomato, tomato.  You're adding new data structures and lifetime rules to
replace data that was described in a different data structure before, yet
your new data's meaningful semantic lifetime exactly matches that of
signal_struct.  You could as well make everything release_task cleans up be
done in __put_task_struct instead, but that would not be a good idea
either.  You've added a word to task_struct (100000 words per 100000-thread
process, vs one word).  It's just not warranted.

> The upshot of this is that none of the timer routines dereference
> tsk->signal, so the races go away, no locking needed.  From my
> perspective this was the simplest solution, since lock dependency
> ordering is _really_ a can of worms.

With the perspective of tunnel vision to just your one test case, adding
something entirely new considering only that case always seems simplest.
That's not how we keep the entire system from getting the wrong kinds of
complexity.

> Regarding the second approach, without locking wouldn't that still be
> racy?  Couldn't exit_state change (and therefore __exit_signal() run)
> between the check and the dereference?

No.  current->exit_state can go from zero to nonzero only by current
running code in the do_exit path.  current does not progress on that
path while current is inside one update_process_times call.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-04-07 20:08                                                             ` Roland McGrath
@ 2008-04-07 21:31                                                               ` Frank Mayhar
  2008-04-07 22:02                                                                 ` Roland McGrath
  2008-04-08 21:27                                                               ` Frank Mayhar
  1 sibling, 1 reply; 51+ messages in thread
From: Frank Mayhar @ 2008-04-07 21:31 UTC (permalink / raw)
  To: Roland McGrath; +Cc: frank, linux-kernel

On Mon, 2008-04-07 at 13:08 -0700, Roland McGrath wrote: 
> > Yeah, I checked that out.  The one difference here is that that was a
> > race between do_exit() (actually release_task()/__exit_signal()) and
> > run_posix_cpu_timers().  While this race was the same in that respect,
> > there was also a race between all of the timer-tick routines that call
> > any of account_group_user_time(), account_group_system_time() or
> > account_group_exec_runtime() and __exit_signal().  This is because those
> > functions all dereference tsk->signal.
> 
> The essence that matters is the same: something that current does with its
> own ->signal on a tick vs something that the release_task path does.

True.  But see below.

> > Erm, well, this isn't reorganizing the data structures per se, since
> > these are new data structures.
> Tomato, tomato.  You're adding new data structures and lifetime rules to
> replace data that was described in a different data structure before, yet
> your new data's meaningful semantic lifetime exactly matches that of
> signal_struct.

This is one thing that has been unclear.  The relationship of
signal_struct to task_struct is, as far as I can tell, an unwritten one.
Certainly the interrupt routines are adjusting values that live only
inside task_struct and (with the exception of run_posix_cpu_timers())
leave signal_struct carefully alone.

>   You could as well make everything release_task cleans up be
> done in __put_task_struct instead, but that would not be a good idea
> either.  You've added a word to task_struct (100000 words per 100000-thread
> process, vs one word).  It's just not warranted.

While true, that's not the only reason to do it.  The tradeoff here is
between performance (i.e. having to do checks before dereferencing
tsk->signal) versus space.  It's really a judgment call.  (Although
adding 100Kwords does have a bit of weight.)

> > The upshot of this is that none of the timer routines dereference
> > tsk->signal, so the races go away, no locking needed.  From my
> > perspective this was the simplest solution, since lock dependency
> > ordering is _really_ a can of worms.
> 
> With the perspective of tunnel vision to just your one test case, adding
> something entirely new considering only that case always seems simplest.

Well, yes.  And not just "seems."

> That's not how we keep the entire system from getting the wrong kinds of
> complexity.

This isn't exactly how I would state it but yes, this is generally true
as well.  The problem is that knowing exactly what is "the wrong kinds"
relies on knowledge possessed by only a few.  Prying that knowledge out
of you guys can be a chore. :-)

> > Regarding the second approach, without locking wouldn't that still be
> > racy?  Couldn't exit_state change (and therefore __exit_signal() run)
> > between the check and the dereference?
> 
> No.  current->exit_state can go from zero to nonzero only by current
> running code in the do_exit path.  current does not progress on that
> path while current is inside one update_process_times call.

Well, okay, this is the vital bit of data that puts everything above
into perspective.  Had I known this, I would not have made the change I
did.

I guess the key bit of knowledge is that a "task" is really a scheduling
unit, right?  And, really, from the scheduler's perspective, "task" is
the same as "thread."  The only thing that makes a set of threads into a
multithreaded process is that they share a signal struct (well, and
their memory map, of course).  So a "task" can only be executed on a
single cpu at any time, it can't be executed on more than one cpu at a
time.  Therefore if a "task" is executing and is interrupted, the value
of "current" at the interrupt will be that task, which is entirely
suspended for the duration of the interrupt.

Is this correct?  (This is not just for this fix, but for my general
understanding of Linux scheduling.)

Unfortunately, these things are often implicit in the code but as far as
I know aren't written down anywhere.  This whole exercise has been for
me a process of becoming really familiar with the internals of the Linux
kernel for the first time.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-04-07 21:31                                                               ` Frank Mayhar
@ 2008-04-07 22:02                                                                 ` Roland McGrath
  0 siblings, 0 replies; 51+ messages in thread
From: Roland McGrath @ 2008-04-07 22:02 UTC (permalink / raw)
  To: Frank Mayhar; +Cc: frank, linux-kernel

> This is one thing that has been unclear.  The relationship of
> signal_struct to task_struct is, as far as I can tell, an unwritten one.

It's written no less than most of them. ;-)

> While true, that's not the only reason to do it.  The tradeoff here is
> between performance (i.e. having to do checks before dereferencing
> tsk->signal) versus space.  It's really a judgment call.  (Although
> adding 100Kwords does have a bit of weight.)

No, the performance idea there is a myth.  You're talking about one test
and branch-not-taken for a word you're already loading into a register
right there anyway (if testing ->signal).  It's maybe two cycles that were
most likely already idle in a load stall.  The cache effects alone of
pushing parts of task_struct a word further away probably swamp it.

> This isn't exactly how I would state it but yes, this is generally true
> as well.  The problem is that knowing exactly what is "the wrong kinds"
> relies on knowledge possessed by only a few.  Prying that knowledge out
> of you guys can be a chore. :-)

But when it comes out, it's flying at high velocity straight into your face!
Surely that's helpful.

> I guess the key bit of knowledge is that a "task" is really a scheduling
> unit, right?  And, really, from the scheduler's perspective, "task" is
> the same as "thread."  

Yes (and from the general Unix-lingo perspective too).  

> The only thing that makes a set of threads into a multithreaded process
> is that they share a signal struct (well, and their memory map, of
> course).  

There are several other things that are implicitly required to be shared
when signal_struct is shared, too.  But approximately, yes.  Had I been in
charge of the world, task_struct would be 'struct thread' and signal_struct
would be 'struct process'.  (Cue left-field flames from the peanut gallery
about what the words mean and Linux exceptionalism.)

> So a "task" can only be executed on a single cpu at any time, it can't be
> executed on more than one cpu at a time.  Therefore if a "task" is
> executing and is interrupted, the value of "current" at the interrupt
> will be that task, which is entirely suspended for the duration of the
> interrupt.

Correct.

> Unfortunately, these things are often implicit in the code but as far as
> I know aren't written down anywhere.  This whole exercise has been for
> me a process of becoming really familiar with the internals of the Linux
> kernel for the first time.

Everything I know I learned from reading the source.  So I sympathize
with the sense of starting out lost without bearings, but I may be a
little hard-hearted about anyone wanting more than their eyeballs and
full-text searching to find their own bootstraps and pull (in my day,
it was up hill both ways, and all that).


Thanks,
Roland

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-04-07 20:08                                                             ` Roland McGrath
  2008-04-07 21:31                                                               ` Frank Mayhar
@ 2008-04-08 21:27                                                               ` Frank Mayhar
  2008-04-08 21:52                                                                 ` Frank Mayhar
  2008-04-08 22:49                                                                 ` Roland McGrath
  1 sibling, 2 replies; 51+ messages in thread
From: Frank Mayhar @ 2008-04-08 21:27 UTC (permalink / raw)
  To: Roland McGrath; +Cc: frank, linux-kernel

On Mon, 2008-04-07 at 13:08 -0700, Roland McGrath wrote:
> > Regarding the second approach, without locking wouldn't that still be
> > racy?  Couldn't exit_state change (and therefore __exit_signal() run)
> > between the check and the dereference?
> No.  current->exit_state can go from zero to nonzero only by current
> running code in the do_exit path.  current does not progress on that
> path while current is inside one update_process_times call.

Okay.  One of the paths to the update code is through update_curr() in
sched_fair.c, which (in my tree) calls account_group_exec_runtime() to
update the sum_exec_runtime field:

	delta_exec = (unsigned long)(now - curr->exec_start);

	__update_curr(cfs_rq, curr, delta_exec);
	curr->exec_start = now;

	if (entity_is_task(curr)) {
		struct task_struct *curtask = task_of(curr);

		cpuacct_charge(curtask, delta_exec);
		account_group_exec_runtime(curtask, delta_exec);
	}

To make sure that I understand what's going on, I put an invariant at
the beginning of account_group_exec_runtime():

static inline void account_group_exec_runtime(struct task_struct *tsk,
					       unsigned long long ns)
{
	struct signal_struct *sig = tsk->signal;
	struct task_cputime *times;

	BUG_ON(tsk != current);
	if (unlikely(tsk->exit_state))
		return;
	if (!sig->cputime.totals)
		return;
	times = per_cpu_ptr(sig->cputime.totals, get_cpu());
	times->sum_exec_runtime += ns;
	put_cpu_no_resched();
}

And, you guessed it, the invariant gets violated.  Apparently the passed
task_struct isn't the same as "current" at this point.

Any ideas?  Am I checking the wrong thing?  If we're really not updating
current then the task we are updating could very easily be running
through __exit_signal() on another CPU.  (And while I wait for your
response I will of course continue to try to figure this out.)
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-04-08 21:27                                                               ` Frank Mayhar
@ 2008-04-08 21:52                                                                 ` Frank Mayhar
  2008-04-08 22:49                                                                 ` Roland McGrath
  1 sibling, 0 replies; 51+ messages in thread
From: Frank Mayhar @ 2008-04-08 21:52 UTC (permalink / raw)
  To: Roland McGrath; +Cc: frank, linux-kernel

On Tue, 2008-04-08 at 14:27 -0700, Frank Mayhar wrote:
> And, you guessed it, the invariant gets violated.  Apparently the passed
> task_struct isn't the same as "current" at this point.
> 
> Any ideas?  Am I checking the wrong thing?  If we're really not updating
> current then the task we are updating could very easily be running
> through __exit_signal() on another CPU.  (And while I wait for your
> response I will of course continue to try to figure this out.)

Found the exception.  do_fork() violates the invariant when it's
cranking up a new process.  Hmmm.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-04-08 21:27                                                               ` Frank Mayhar
  2008-04-08 21:52                                                                 ` Frank Mayhar
@ 2008-04-08 22:49                                                                 ` Roland McGrath
  2008-04-09 16:29                                                                   ` Frank Mayhar
  1 sibling, 1 reply; 51+ messages in thread
From: Roland McGrath @ 2008-04-08 22:49 UTC (permalink / raw)
  To: Frank Mayhar; +Cc: linux-kernel

My explanation about the constraints on exit_state was specifically about
the context of update_process_times(), which is part of the path for a
clock tick interrupting current.

> And, you guessed it, the invariant gets violated.  Apparently the passed
> task_struct isn't the same as "current" at this point.

The scheduler code has gotten a lot more complex since I first implemented
posix-cpu-timers, and I've never been any expert on the scheduler at all.
But I'm moderately sure all those things are all involved in context
switch where the task of interest is about to be on a CPU or just was on a
CPU.  I doubt those are places where the task in question could be
simultaneously executing in exit_notify() on another CPU.  But we'd need
to ask the scheduler experts to be sure we know what we're talking about
there.

> Found the exception.  do_fork() violates the invariant when it's
> cranking up a new process.  Hmmm.

I haven't figured out what actual code path this refers to.


This sort of concern is among the reasons that checking ->signal was the
course I found wise to suggest to begin with.  We can figure out what the
constraints on ->exit_state are in all the places by understanding every
corner of the scheduler.  We can measure whether it winds up being in a
cooler cache line than ->signal and a net loss to add the load, or has
superior performance as you seem to think.  Or we can just test the
constraint that matters, whether the pointer we loaded was in fact null,
and rely on RCU to make it not matter if there is a race after that load.
It doesn't matter whether tsk is current or not, it only matters that we
have the pointer and that we're using some CPU array slot or other that
noone else is using simultaneously.

    static inline void account_group_exec_runtime(struct task_struct *tsk,
						  unsigned long long runtime)
    {
	    struct signal_struct *sig;
	    struct task_cputime *times;

	    rcu_read_lock();
	    sig = rcu_dereference(tsk->signal);
	    if (likely(sig) && sig->cputime.totals) {
		    times = per_cpu_ptr(sig->cputime.totals, get_cpu());
		    times->sum_exec_runtime += runtime;
		    put_cpu_no_resched();
	    }
	    rcu_read_unlock();
    }


Thanks,
Roland

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: posix-cpu-timers revamp
  2008-04-08 22:49                                                                 ` Roland McGrath
@ 2008-04-09 16:29                                                                   ` Frank Mayhar
  0 siblings, 0 replies; 51+ messages in thread
From: Frank Mayhar @ 2008-04-09 16:29 UTC (permalink / raw)
  To: Roland McGrath; +Cc: linux-kernel

On Tue, 2008-04-08 at 15:49 -0700, Roland McGrath wrote:
> My explanation about the constraints on exit_state was specifically about
> the context of update_process_times(), which is part of the path for a
> clock tick interrupting current.

Understood.

> > And, you guessed it, the invariant gets violated.  Apparently the passed
> > task_struct isn't the same as "current" at this point.
> 
> The scheduler code has gotten a lot more complex since I first implemented
> posix-cpu-timers, and I've never been any expert on the scheduler at all.
> But I'm moderately sure all those things are all involved in context
> switch where the task of interest is about to be on a CPU or just was on a
> CPU.  I doubt those are places where the task in question could be
> simultaneously executing in exit_notify() on another CPU.  But we'd need
> to ask the scheduler experts to be sure we know what we're talking about
> there.

This was my conclusion as well.  Certainly the path through do_fork()
(elucidated below) doesn't even allow the task in question to even be
executing, much less on a different CPU, but all these routines with
"struct task_struct *" parameters make me nervous.  Which is why I
inserted the invariant check in the first place.

> > Found the exception.  do_fork() violates the invariant when it's
> > cranking up a new process.  Hmmm.
> 
> I haven't figured out what actual code path this refers to.

do_fork=>wake_up_new_task=>task_new_fair=>
	enqueue_task_fair=>enqueue_entity=>update_curr

> This sort of concern is among the reasons that checking ->signal was the
> course I found wise to suggest to begin with.  We can figure out what the
> constraints on ->exit_state are in all the places by understanding every
> corner of the scheduler.

Well, as much as I would like to take the time to do that, I do have a
_real_ job, here. :-)

>   We can measure whether it winds up being in a
> cooler cache line than ->signal and a net loss to add the load, or has
> superior performance as you seem to think.  Or we can just test the
> constraint that matters, whether the pointer we loaded was in fact null,
> and rely on RCU to make it not matter if there is a race after that load.
> It doesn't matter whether tsk is current or not, it only matters that we
> have the pointer and that we're using some CPU array slot or other that
> noone else is using simultaneously.
> 
>     static inline void account_group_exec_runtime(struct task_struct *tsk,
> 						  unsigned long long runtime)
>     {
> 	    struct signal_struct *sig;
> 	    struct task_cputime *times;
> 
> 	    rcu_read_lock();
> 	    sig = rcu_dereference(tsk->signal);
> 	    if (likely(sig) && sig->cputime.totals) {
> 		    times = per_cpu_ptr(sig->cputime.totals, get_cpu());
> 		    times->sum_exec_runtime += runtime;
> 		    put_cpu_no_resched();
> 	    }
> 	    rcu_read_unlock();
>     }

Yeah, agreed.  Of course, I was hoping (in vain, apparently) to avoid
this level of overhead here.  And I suspect I'll really have to do it in
each of these routines.  But I suppose it can't be helped.

Even with a thorough understanding of the scheduler(s) and code based on
that understanding, we would still not (necessarily) be protected from
future changes that violate the assumptions we make on that basis.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2008-04-09 16:30 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-9906-10286@http.bugzilla.kernel.org/>
2008-02-07  0:50 ` [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF Andrew Morton
2008-02-07  0:58   ` Frank Mayhar
2008-02-07  2:57     ` Parag Warudkar
2008-02-07 15:22       ` Alejandro Riveira Fernández
2008-02-07 15:53         ` Parag Warudkar
2008-02-07 15:56           ` Parag Warudkar
2008-02-07 15:54             ` Alejandro Riveira Fernández
2008-02-07 16:01               ` Parag Warudkar
2008-02-07 16:53                 ` Parag Warudkar
2008-02-29 19:55                   ` Frank Mayhar
2008-03-04  7:00                     ` Roland McGrath
2008-03-04 19:52                       ` Frank Mayhar
2008-03-05  4:08                         ` Roland McGrath
2008-03-06 19:04                           ` Frank Mayhar
2008-03-11  7:50                             ` posix-cpu-timers revamp Roland McGrath
2008-03-11 21:05                               ` Frank Mayhar
2008-03-11 21:35                                 ` Roland McGrath
2008-03-14  0:37                                   ` Frank Mayhar
2008-03-21  7:18                                     ` Roland McGrath
2008-03-21 17:57                                       ` Frank Mayhar
2008-03-22 21:58                                         ` Roland McGrath
2008-03-24 17:34                                           ` Frank Mayhar
2008-03-24 22:43                                             ` Frank Mayhar
2008-03-31  5:44                                             ` Roland McGrath
2008-03-31 20:24                                               ` Frank Mayhar
2008-04-02  2:07                                                 ` Roland McGrath
2008-04-02 16:34                                                   ` Frank Mayhar
2008-04-02 17:42                                                   ` Frank Mayhar
2008-04-02 19:48                                                     ` Roland McGrath
2008-04-02 20:34                                                       ` Frank Mayhar
2008-04-02 21:42                                                         ` Frank Mayhar
2008-04-04  0:53                                                           ` Frank Mayhar
2008-04-04 23:17                                                         ` Roland McGrath
2008-04-06  5:26                                                           ` Frank Mayhar
2008-04-07 20:08                                                             ` Roland McGrath
2008-04-07 21:31                                                               ` Frank Mayhar
2008-04-07 22:02                                                                 ` Roland McGrath
2008-04-08 21:27                                                               ` Frank Mayhar
2008-04-08 21:52                                                                 ` Frank Mayhar
2008-04-08 22:49                                                                 ` Roland McGrath
2008-04-09 16:29                                                                   ` Frank Mayhar
2008-04-02 18:42                                                   ` Frank Mayhar
2008-03-28  0:52                                           ` [PATCH 2.6.25-rc6] Fix itimer/many thread hang Frank Mayhar
2008-03-28 10:28                                             ` Ingo Molnar
2008-03-28 22:46                                             ` [PATCH 2.6.25-rc7 resubmit] " Frank Mayhar
2008-04-01 18:45                                               ` Andrew Morton
2008-04-01 21:46                                                 ` Frank Mayhar
2008-03-21 20:40                                       ` posix-cpu-timers revamp Frank Mayhar
2008-03-07 23:26                           ` [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF Frank Mayhar
2008-03-08  0:01                             ` Frank Mayhar
2008-02-07 17:36           ` Frank Mayhar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).