All of lore.kernel.org
 help / color / mirror / Atom feed
* [ANNOUNCE] 3.0.4-rt13
@ 2011-09-10  9:12 Thomas Gleixner
  2011-09-10 14:53 ` Madovsky
                   ` (3 more replies)
  0 siblings, 4 replies; 52+ messages in thread
From: Thomas Gleixner @ 2011-09-10  9:12 UTC (permalink / raw)
  To: LKML; +Cc: linux-rt-users

Dear RT Folks,

I'm pleased to announce the 3.0.4-rt13 release.

Changes versus 3.0.2-rt11

  * Migrate disable cure (Mike, Peter)

  * ARM smpboot fix
  
  * Printk debug (Peter)

  * Ftrace: workaround the waitqueue issue

Patch against 3.0.4 can be found here:

  https://tglx.de/~tglx/rt/patch-3.0.4-rt13.patch.gz

The split quilt queue is available at:

  https://tglx.de/~tglx/rt/patches-3.0.4-rt13.tar.gz

For those who don't have 3.0.4 around:

  git://tesla.tglx.de/git/linux-2.6-tip rt/3.0
  
  https://tglx.de/~tglx/rt/patch-3.0.4.gz

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-10  9:12 [ANNOUNCE] 3.0.4-rt13 Thomas Gleixner
@ 2011-09-10 14:53 ` Madovsky
  2011-09-10 17:27 ` Rolando Martins
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 52+ messages in thread
From: Madovsky @ 2011-09-10 14:53 UTC (permalink / raw)
  To: Thomas Gleixner, LKML

good job Thomas,
thank you I will try soon

Franck

----- Original Message ----- 
From: "Thomas Gleixner" <tglx@linutronix.de>
To: "LKML" <linux-kernel@vger.kernel.org>
Cc: "linux-rt-users" <linux-rt-users@vger.kernel.org>
Sent: Saturday, September 10, 2011 5:12 AM
Subject: [ANNOUNCE] 3.0.4-rt13


> Dear RT Folks,
>
> I'm pleased to announce the 3.0.4-rt13 release.
>
> Changes versus 3.0.2-rt11
>
>  * Migrate disable cure (Mike, Peter)
>
>  * ARM smpboot fix
>
>  * Printk debug (Peter)
>
>  * Ftrace: workaround the waitqueue issue
>
> Patch against 3.0.4 can be found here:
>
>  https://tglx.de/~tglx/rt/patch-3.0.4-rt13.patch.gz
>
> The split quilt queue is available at:
>
>  https://tglx.de/~tglx/rt/patches-3.0.4-rt13.tar.gz
>
> For those who don't have 3.0.4 around:
>
>  git://tesla.tglx.de/git/linux-2.6-tip rt/3.0
>
>  https://tglx.de/~tglx/rt/patch-3.0.4.gz
>
> Thanks,
>
> tglx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-10  9:12 [ANNOUNCE] 3.0.4-rt13 Thomas Gleixner
  2011-09-10 14:53 ` Madovsky
@ 2011-09-10 17:27 ` Rolando Martins
  2011-09-11 10:35   ` Mike Galbraith
  2011-09-11 18:14 ` Mike Galbraith
  3 siblings, 0 replies; 52+ messages in thread
From: Rolando Martins @ 2011-09-10 17:27 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, linux-rt-users, Peter Zijlstra

Hi,
recent work from Peter fixed the crashes caused by sched_rt_group (cgroups).
Nevertheless, its use still makes the system sluggish.
To test this, it's just a matter of mounting the "cpu" cgroup and then:
echo 100 > cpu.rt_period_us
echo 100 > cpu.rt_runtime_us

Thanks,
Rolando

On Sat, Sep 10, 2011 at 10:12 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> Dear RT Folks,
>
> I'm pleased to announce the 3.0.4-rt13 release.
>
> Changes versus 3.0.2-rt11
>
>  * Migrate disable cure (Mike, Peter)
>
>  * ARM smpboot fix
>
>  * Printk debug (Peter)
>
>  * Ftrace: workaround the waitqueue issue
>
> Patch against 3.0.4 can be found here:
>
>  https://tglx.de/~tglx/rt/patch-3.0.4-rt13.patch.gz
>
> The split quilt queue is available at:
>
>  https://tglx.de/~tglx/rt/patches-3.0.4-rt13.tar.gz
>
> For those who don't have 3.0.4 around:
>
>  git://tesla.tglx.de/git/linux-2.6-tip rt/3.0
>
>  https://tglx.de/~tglx/rt/patch-3.0.4.gz
>
> Thanks,
>
>        tglx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-10  9:12 [ANNOUNCE] 3.0.4-rt13 Thomas Gleixner
@ 2011-09-11 10:35   ` Mike Galbraith
  2011-09-10 17:27 ` Rolando Martins
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2011-09-11 10:35 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra; +Cc: LKML, linux-rt-users

On Sat, 2011-09-10 at 11:12 +0200, Thomas Gleixner wrote: 
> Dear RT Folks,
> 
> I'm pleased to announce the 3.0.4-rt13 release.
> 
> Changes versus 3.0.2-rt11
> 
>   * Migrate disable cure (Mike, Peter)

The warning triggers.

[  134.105241] ------------[ cut here ]------------
[  134.105249] WARNING: at kernel/sched.c:6146 migrate_disable+0x1ae/0x1f0()
[  134.105250] Hardware name: MS-7502
[  134.105252] Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nfsd lockd nfs_acl auth_rpcgss sunrpc parport_pc parport bridge ipv6 stp cpufreq_conservative cpufreq_ondemand cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode mperf nls_iso8859_1 nls_cp437 vfat fat fuse ext3 jbd dm_mod snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep kvm_intel kvm snd_pcm usb_storage usb_libusual uas snd_timer snd sr_mod cdrom firewire_ohci sg firewire_core e1000e soundcore snd_page_alloc crc_itu_t i2c_i801 button ext4 mbcache jbd2 crc16 usbhid hid uhci_hcd sd_mod ehci_hcd usbcore rtc_cmos ahci libahci libata scsi_mod fan processor thermal
[  134.105291] Pid: 6224, comm: lmsched Not tainted 3.0.4-rt13 #2040
[  134.105293] Call Trace:
[  134.105298]  [<ffffffff8103e25a>] warn_slowpath_common+0x7a/0xb0
[  134.105300]  [<ffffffff8103e2a5>] warn_slowpath_null+0x15/0x20
[  134.105303]  [<ffffffff810360de>] migrate_disable+0x1ae/0x1f0
[  134.105307]  [<ffffffff8104f3e2>] do_sigtimedwait+0x62/0x1c0
[  134.105310]  [<ffffffff8104f5c4>] sys_rt_sigtimedwait+0x84/0xd0
[  134.105314] ------------[ cut here ]------------
[  134.105324] WARNING: at kernel/sched.c:6146 migrate_disable+0x1ae/0x1f0()
[  134.105329]  [<ffffffff8105a215>] ? sys_timer_settime+0x185/0x220
[  134.105331] Hardware name: MS-7502
[  134.105334] Modules linked in: [<ffffffff8130983b>] system_call_fastpath+0x16/0x1b
[  134.105341]  snd_pcm_oss
[  134.105343] ---[ end trace 0000000000000002 ]---
[  134.105346]  snd_mixer_oss
[  134.105348] ------------[ cut here ]------------
[  134.105351]  snd_seq
[  134.105354] WARNING: at kernel/sched.c:6205 migrate_enable+0x200/0x280()
[  134.105357]  snd_seq_device
[  134.105359] Hardware name: MS-7502
[  134.105363] ------------[ cut here ]------------
[  134.105369]  edd
[  134.105370] Modules linked in:
[  134.105376] WARNING: at kernel/sched.c:6205 migrate_enable+0x200/0x280()
[  134.105384]  nfsd snd_pcm_oss
[  134.105387] Hardware name: MS-7502
[  134.105391]  lockd snd_mixer_oss
[  134.105393] Modules linked in: nfs_acl snd_seq snd_pcm_oss auth_rpcgss snd_seq_device snd_mixer_oss sunrpc edd snd_seq parport_pc nfsd snd_seq_device parport lockd edd bridge nfs_acl nfsd ipv6 auth_rpcgss lockd stp sunrpc nfs_acl cpufreq_conservative parport_pc auth_rpcgss cpufreq_ondemand parport sunrpc cpufreq_userspace bridge parport_pc cpufreq_powersave ipv6 parport acpi_cpufreq stp bridge microcode cpufreq_conservative ipv6 mperf cpufreq_ondemand stp nls_iso8859_1 cpufreq_userspace cpufreq_conservative nls_cp437 cpufreq_powersave cpufreq_ondemand vfat acpi_cpufreq cpufreq_userspace fat microcode cpufreq_powersave fuse mperf acpi_cpufreq ext3 nls_iso8859_1 microcode jbd nls_cp437 mperf dm_mod vfat nls_iso8859_1 snd_hda_codec_realtek fat nls_cp437 snd_hda_intel fuse vfat snd_hda_codec ext3 fat snd_hwdep jbd fuse kvm_intel dm_mod ext3 kvm snd_hda_codec_realtek jbd snd_pcm snd_hda_intel dm_mod usb_storage snd_hda_codec snd_hda_codec_realtek usb_libusual snd_hwdep snd_hda_intel uas kvm_intel snd_hda_codec snd_timer kvm snd_hwdep snd snd_pcm kvm_intel sr_mod usb_storage kvm cdrom usb_libusual snd_pcm firewire_ohci uas usb_storage sg snd_timer usb_libusual firewire_core snd uas e1000e sr_mod snd_timer soundcore cdrom snd snd_page_alloc firewire_ohci sr_mod crc_itu_t sg cdrom i2c_i801 firewire_core firewire_ohci button e1000e sg ext4 soundcore firewire_core mbcache snd_page_alloc e1000e jbd2 crc_itu_t soundcore crc16 i2c_i801 snd_page_alloc usbhid button crc_itu_t hid ext4 i2c_i801 uhci_hcd mbcache button sd_mod jbd2 ext4 ehci_hcd crc16 mbcache usbcore usbhid jbd2 rtc_cmos hid crc16 ahci uhci_hcd usbhid libahci sd_mod hid libata ehci_hcd uhci_hcd scsi_mod usbcore sd_mod fan rtc_cmos ehci_hcd processor ahci usbcore thermal libahci rtc_cmos
[  134.105548]  libata ahciPid: 6220, comm: lmsched Tainted: G        W   3.0.4-rt13 #2040
[  134.105553]  scsi_mod libahciCall Trace:
[  134.105559]  fan libata processor scsi_mod thermal fan [<ffffffff8103e25a>] warn_slowpath_common+0x7a/0xb0
[  134.105566] 
[  134.105568]  processorPid: 6224, comm: lmsched Tainted: G        W   3.0.4-rt13 #2040
[  134.105577]  thermal [<ffffffff8103e2a5>] warn_slowpath_null+0x15/0x20
[  134.105582] Call Trace:
[  134.105584] 
[  134.105588] Pid: 6222, comm: lmsched Tainted: G        W   3.0.4-rt13 #2040
[  134.105596]  [<ffffffff810360de>] migrate_disable+0x1ae/0x1f0
[  134.105599]  [<ffffffff8103e25a>] warn_slowpath_common+0x7a/0xb0
[  134.105606] Call Trace:
[  134.105611]  [<ffffffff8104f3e2>] do_sigtimedwait+0x62/0x1c0
[  134.105614]  [<ffffffff8103e2a5>] warn_slowpath_null+0x15/0x20
[  134.105622]  [<ffffffff8103e25a>] warn_slowpath_common+0x7a/0xb0
[  134.105631]  [<ffffffff8104f5c4>] sys_rt_sigtimedwait+0x84/0xd0
[  134.105635]  [<ffffffff81035eb0>] migrate_enable+0x200/0x280
[  134.105642]  [<ffffffff8103e2a5>] warn_slowpath_null+0x15/0x20
[  134.105650]  [<ffffffff81307547>] ? preempt_schedule_irq+0x37/0x50
[  134.105655]  [<ffffffff8104f4df>] do_sigtimedwait+0x15f/0x1c0
[  134.105664]  [<ffffffff81035eb0>] migrate_enable+0x200/0x280
[  134.105672]  [<ffffffff81309346>] ? retint_kernel+0x26/0x30
[  134.105676]  [<ffffffff8104f5c4>] sys_rt_sigtimedwait+0x84/0xd0
[  134.105683]  [<ffffffff8104f4df>] do_sigtimedwait+0x15f/0x1c0
[  134.105691]  [<ffffffff8130983b>] system_call_fastpath+0x16/0x1b
[  134.105694]  [<ffffffff8105a215>] ? sys_timer_settime+0x185/0x220
[  134.105703]  [<ffffffff8104f5c4>] sys_rt_sigtimedwait+0x84/0xd0
[  134.105710] ---[ end trace 0000000000000003 ]---
[  134.105714]  [<ffffffff8130983b>] system_call_fastpath+0x16/0x1b
[  134.105721]  [<ffffffff8105a215>] ? sys_timer_settime+0x185/0x220
[  134.105726] ------------[ cut here ]------------
[  134.105729] ---[ end trace 0000000000000004 ]---
[  134.105735] WARNING: at kernel/sched.c:6205 migrate_enable+0x200/0x280()
[  134.105739]  [<ffffffff8130983b>] system_call_fastpath+0x16/0x1b
[  134.105742] Hardware name: MS-7502
[  134.105744] ---[ end trace 0000000000000005 ]---
[  134.105746] Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nfsd lockd nfs_acl auth_rpcgss sunrpc parport_pc parport bridge ipv6 stp cpufreq_conservative cpufreq_ondemand cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode mperf nls_iso8859_1 nls_cp437 vfat fat fuse ext3 jbd dm_mod snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep kvm_intel kvm snd_pcm usb_storage usb_libusual uas snd_timer snd sr_mod cdrom firewire_ohci sg firewire_core e1000e soundcore snd_page_alloc crc_itu_t i2c_i801 button ext4 mbcache jbd2 crc16 usbhid hid uhci_hcd sd_mod ehci_hcd usbcore rtc_cmos ahci libahci libata scsi_mod fan processor thermal
[  134.105769] Pid: 6220, comm: lmsched Tainted: G        W   3.0.4-rt13 #2040
[  134.105771] Call Trace:
[  134.105773]  [<ffffffff8103e25a>] warn_slowpath_common+0x7a/0xb0
[  134.105775]  [<ffffffff8103e2a5>] warn_slowpath_null+0x15/0x20
[  134.105777]  [<ffffffff81035eb0>] migrate_enable+0x200/0x280
[  134.105779]  [<ffffffff8104f4df>] do_sigtimedwait+0x15f/0x1c0
[  134.105782]  [<ffffffff8104f5c4>] sys_rt_sigtimedwait+0x84/0xd0
[  134.105784]  [<ffffffff81307547>] ? preempt_schedule_irq+0x37/0x50
[  134.105786]  [<ffffffff81309346>] ? retint_kernel+0x26/0x30
[  134.105789]  [<ffffffff8130983b>] system_call_fastpath+0x16/0x1b
[  134.105791] ---[ end trace 0000000000000006 ]---

(gdb) list *do_sigtimedwait+0x62
0xffffffff8104f3e2 is in do_sigtimedwait (kernel/signal.c:2628).
2623             * Invert the set of allowed signals to get those we want to block.
2624             */
2625            sigdelsetmask(&mask, sigmask(SIGKILL) | sigmask(SIGSTOP));
2626            signotset(&mask);
2627
2628            spin_lock_irq(&tsk->sighand->siglock);
2629            sig = dequeue_signal(tsk, &mask, info);
2630            if (!sig && timeout) {
2631                    /*
2632                     * None ready, temporarily unblock those we're interested
(gdb) list *do_sigtimedwait+0x15f
0xffffffff8104f4df is in do_sigtimedwait (kernel/signal.c:2642).
2637                    tsk->real_blocked = tsk->blocked;
2638                    sigandsets(&tsk->blocked, &tsk->blocked, &mask);
2639                    recalc_sigpending();
2640                    spin_unlock_irq(&tsk->sighand->siglock);
2641
2642                    timeout = schedule_timeout_interruptible(timeout);
2643
2644                    spin_lock_irq(&tsk->sighand->siglock);
2645                    __set_task_blocked(tsk, &tsk->real_blocked);
2646                    siginitset(&tsk->real_blocked, 0);




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
@ 2011-09-11 10:35   ` Mike Galbraith
  0 siblings, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2011-09-11 10:35 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra; +Cc: LKML, linux-rt-users

On Sat, 2011-09-10 at 11:12 +0200, Thomas Gleixner wrote: 
> Dear RT Folks,
> 
> I'm pleased to announce the 3.0.4-rt13 release.
> 
> Changes versus 3.0.2-rt11
> 
>   * Migrate disable cure (Mike, Peter)

The warning triggers.

[  134.105241] ------------[ cut here ]------------
[  134.105249] WARNING: at kernel/sched.c:6146 migrate_disable+0x1ae/0x1f0()
[  134.105250] Hardware name: MS-7502
[  134.105252] Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nfsd lockd nfs_acl auth_rpcgss sunrpc parport_pc parport bridge ipv6 stp cpufreq_conservative cpufreq_ondemand cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode mperf nls_iso8859_1 nls_cp437 vfat fat fuse ext3 jbd dm_mod snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep kvm_intel kvm snd_pcm usb_storage usb_libusual uas snd_timer snd sr_mod cdrom firewire_ohci sg firewire_core e1000e soundcore snd_page_alloc crc_itu_t i2c_i801 button ext4 mbcache jbd2 crc16 usbhid hid uhci_hcd sd_mod ehci_hcd usbcore rtc_cmos ahci libahci libata scsi_mod fan processor thermal
[  134.105291] Pid: 6224, comm: lmsched Not tainted 3.0.4-rt13 #2040
[  134.105293] Call Trace:
[  134.105298]  [<ffffffff8103e25a>] warn_slowpath_common+0x7a/0xb0
[  134.105300]  [<ffffffff8103e2a5>] warn_slowpath_null+0x15/0x20
[  134.105303]  [<ffffffff810360de>] migrate_disable+0x1ae/0x1f0
[  134.105307]  [<ffffffff8104f3e2>] do_sigtimedwait+0x62/0x1c0
[  134.105310]  [<ffffffff8104f5c4>] sys_rt_sigtimedwait+0x84/0xd0
[  134.105314] ------------[ cut here ]------------
[  134.105324] WARNING: at kernel/sched.c:6146 migrate_disable+0x1ae/0x1f0()
[  134.105329]  [<ffffffff8105a215>] ? sys_timer_settime+0x185/0x220
[  134.105331] Hardware name: MS-7502
[  134.105334] Modules linked in: [<ffffffff8130983b>] system_call_fastpath+0x16/0x1b
[  134.105341]  snd_pcm_oss
[  134.105343] ---[ end trace 0000000000000002 ]---
[  134.105346]  snd_mixer_oss
[  134.105348] ------------[ cut here ]------------
[  134.105351]  snd_seq
[  134.105354] WARNING: at kernel/sched.c:6205 migrate_enable+0x200/0x280()
[  134.105357]  snd_seq_device
[  134.105359] Hardware name: MS-7502
[  134.105363] ------------[ cut here ]------------
[  134.105369]  edd
[  134.105370] Modules linked in:
[  134.105376] WARNING: at kernel/sched.c:6205 migrate_enable+0x200/0x280()
[  134.105384]  nfsd snd_pcm_oss
[  134.105387] Hardware name: MS-7502
[  134.105391]  lockd snd_mixer_oss
[  134.105393] Modules linked in: nfs_acl snd_seq snd_pcm_oss auth_rpcgss snd_seq_device snd_mixer_oss sunrpc edd snd_seq parport_pc nfsd snd_seq_device parport lockd edd bridge nfs_acl nfsd ipv6 auth_rpcgss lockd stp sunrpc nfs_acl cpufreq_conservative parport_pc auth_rpcgss cpufreq_ondemand parport sunrpc cpufreq_userspace bridge parport_pc cpufreq_powersave ipv6 parport acpi_cpufreq stp bridge microcode cpufreq_conservative ipv6 mperf cpufreq_ondemand stp nls_iso8859_1 cpufreq_userspace cpufreq_conservative nls_cp437 cpufreq_powersave cpufreq_ondemand vfat acpi_cpufreq cpufreq_userspace fat microcode cpufreq_powersave fuse mperf acpi_cpufreq ext3 nls_iso8859_1 microcode jbd nls_cp437 mperf dm_mod vfat nls_iso8859_1 snd_hda_codec_realtek fat nls_cp437 snd_hda_intel fuse vfat snd_hda_code
 c ext3 fat snd_hwdep jbd fuse kvm_intel dm_mod ext3 kvm snd_hda_codec_realtek jbd snd_pcm snd_hda_intel dm_mod usb_storage snd_hda_codec snd_hda_codec_realtek usb_libusual snd_hwdep snd_hda_intel uas kvm_intel snd_hda_codec snd_timer kvm snd_hwdep snd snd_pcm kvm_intel sr_mod usb_storage kvm cdrom usb_libusual snd_pcm firewire_ohci uas usb_storage sg snd_timer usb_libusual firewire_core snd uas e1000e sr_mod snd_timer soundcore cdrom snd snd_page_alloc firewire_ohci sr_mod crc_itu_t sg cdrom i2c_i801 firewire_core firewire_ohci button e1000e sg ext4 soundcore firewire_core mbcache snd_page_alloc e1000e jbd2 crc_itu_t soundcore crc16 i2c_i801 snd_page_alloc usbhid button crc_itu_t hid ext4 i2c_i801 uhci_hcd mbcache button sd_mod jbd2 ext4 ehci_hcd crc16 mbcache usbcore usbhid jbd2 rtc_cmo
 s hid crc16 ahci uhci_hcd usbhid libahci sd_mod hid libata ehci_hcd uhci_hcd scsi_mod usbcore sd_mod fan rtc_cmos ehci_hcd processor ahci usbcore thermal libahci rtc_cmos
[  134.105548]  libata ahciPid: 6220, comm: lmsched Tainted: G        W   3.0.4-rt13 #2040
[  134.105553]  scsi_mod libahciCall Trace:
[  134.105559]  fan libata processor scsi_mod thermal fan [<ffffffff8103e25a>] warn_slowpath_common+0x7a/0xb0
[  134.105566] 
[  134.105568]  processorPid: 6224, comm: lmsched Tainted: G        W   3.0.4-rt13 #2040
[  134.105577]  thermal [<ffffffff8103e2a5>] warn_slowpath_null+0x15/0x20
[  134.105582] Call Trace:
[  134.105584] 
[  134.105588] Pid: 6222, comm: lmsched Tainted: G        W   3.0.4-rt13 #2040
[  134.105596]  [<ffffffff810360de>] migrate_disable+0x1ae/0x1f0
[  134.105599]  [<ffffffff8103e25a>] warn_slowpath_common+0x7a/0xb0
[  134.105606] Call Trace:
[  134.105611]  [<ffffffff8104f3e2>] do_sigtimedwait+0x62/0x1c0
[  134.105614]  [<ffffffff8103e2a5>] warn_slowpath_null+0x15/0x20
[  134.105622]  [<ffffffff8103e25a>] warn_slowpath_common+0x7a/0xb0
[  134.105631]  [<ffffffff8104f5c4>] sys_rt_sigtimedwait+0x84/0xd0
[  134.105635]  [<ffffffff81035eb0>] migrate_enable+0x200/0x280
[  134.105642]  [<ffffffff8103e2a5>] warn_slowpath_null+0x15/0x20
[  134.105650]  [<ffffffff81307547>] ? preempt_schedule_irq+0x37/0x50
[  134.105655]  [<ffffffff8104f4df>] do_sigtimedwait+0x15f/0x1c0
[  134.105664]  [<ffffffff81035eb0>] migrate_enable+0x200/0x280
[  134.105672]  [<ffffffff81309346>] ? retint_kernel+0x26/0x30
[  134.105676]  [<ffffffff8104f5c4>] sys_rt_sigtimedwait+0x84/0xd0
[  134.105683]  [<ffffffff8104f4df>] do_sigtimedwait+0x15f/0x1c0
[  134.105691]  [<ffffffff8130983b>] system_call_fastpath+0x16/0x1b
[  134.105694]  [<ffffffff8105a215>] ? sys_timer_settime+0x185/0x220
[  134.105703]  [<ffffffff8104f5c4>] sys_rt_sigtimedwait+0x84/0xd0
[  134.105710] ---[ end trace 0000000000000003 ]---
[  134.105714]  [<ffffffff8130983b>] system_call_fastpath+0x16/0x1b
[  134.105721]  [<ffffffff8105a215>] ? sys_timer_settime+0x185/0x220
[  134.105726] ------------[ cut here ]------------
[  134.105729] ---[ end trace 0000000000000004 ]---
[  134.105735] WARNING: at kernel/sched.c:6205 migrate_enable+0x200/0x280()
[  134.105739]  [<ffffffff8130983b>] system_call_fastpath+0x16/0x1b
[  134.105742] Hardware name: MS-7502
[  134.105744] ---[ end trace 0000000000000005 ]---
[  134.105746] Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nfsd lockd nfs_acl auth_rpcgss sunrpc parport_pc parport bridge ipv6 stp cpufreq_conservative cpufreq_ondemand cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode mperf nls_iso8859_1 nls_cp437 vfat fat fuse ext3 jbd dm_mod snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep kvm_intel kvm snd_pcm usb_storage usb_libusual uas snd_timer snd sr_mod cdrom firewire_ohci sg firewire_core e1000e soundcore snd_page_alloc crc_itu_t i2c_i801 button ext4 mbcache jbd2 crc16 usbhid hid uhci_hcd sd_mod ehci_hcd usbcore rtc_cmos ahci libahci libata scsi_mod fan processor thermal
[  134.105769] Pid: 6220, comm: lmsched Tainted: G        W   3.0.4-rt13 #2040
[  134.105771] Call Trace:
[  134.105773]  [<ffffffff8103e25a>] warn_slowpath_common+0x7a/0xb0
[  134.105775]  [<ffffffff8103e2a5>] warn_slowpath_null+0x15/0x20
[  134.105777]  [<ffffffff81035eb0>] migrate_enable+0x200/0x280
[  134.105779]  [<ffffffff8104f4df>] do_sigtimedwait+0x15f/0x1c0
[  134.105782]  [<ffffffff8104f5c4>] sys_rt_sigtimedwait+0x84/0xd0
[  134.105784]  [<ffffffff81307547>] ? preempt_schedule_irq+0x37/0x50
[  134.105786]  [<ffffffff81309346>] ? retint_kernel+0x26/0x30
[  134.105789]  [<ffffffff8130983b>] system_call_fastpath+0x16/0x1b
[  134.105791] ---[ end trace 0000000000000006 ]---

(gdb) list *do_sigtimedwait+0x62
0xffffffff8104f3e2 is in do_sigtimedwait (kernel/signal.c:2628).
2623             * Invert the set of allowed signals to get those we want to block.
2624             */
2625            sigdelsetmask(&mask, sigmask(SIGKILL) | sigmask(SIGSTOP));
2626            signotset(&mask);
2627
2628            spin_lock_irq(&tsk->sighand->siglock);
2629            sig = dequeue_signal(tsk, &mask, info);
2630            if (!sig && timeout) {
2631                    /*
2632                     * None ready, temporarily unblock those we're interested
(gdb) list *do_sigtimedwait+0x15f
0xffffffff8104f4df is in do_sigtimedwait (kernel/signal.c:2642).
2637                    tsk->real_blocked = tsk->blocked;
2638                    sigandsets(&tsk->blocked, &tsk->blocked, &mask);
2639                    recalc_sigpending();
2640                    spin_unlock_irq(&tsk->sighand->siglock);
2641
2642                    timeout = schedule_timeout_interruptible(timeout);
2643
2644                    spin_lock_irq(&tsk->sighand->siglock);
2645                    __set_task_blocked(tsk, &tsk->real_blocked);
2646                    siginitset(&tsk->real_blocked, 0);

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-11 10:35   ` Mike Galbraith
  (?)
@ 2011-09-11 17:01   ` Mike Galbraith
  2011-09-12  7:24     ` Thomas Gleixner
  -1 siblings, 1 reply; 52+ messages in thread
From: Mike Galbraith @ 2011-09-11 17:01 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Peter Zijlstra, LKML, linux-rt-users

On Sun, 2011-09-11 at 12:35 +0200, Mike Galbraith wrote:
> On Sat, 2011-09-10 at 11:12 +0200, Thomas Gleixner wrote: 
> > Dear RT Folks,
> > 
> > I'm pleased to announce the 3.0.4-rt13 release.
> > 
> > Changes versus 3.0.2-rt11
> > 
> >   * Migrate disable cure (Mike, Peter)
> 
> The warning triggers.

Seems in_atomic() is not pair inclusive.  This does not gripe.

---
 include/linux/sched.h |    3 ---
 kernel/sched.c        |   15 ++-------------
 2 files changed, 2 insertions(+), 16 deletions(-)

Index: linux-3.0-tip/kernel/sched.c
===================================================================
--- linux-3.0-tip.orig/kernel/sched.c
+++ linux-3.0-tip/kernel/sched.c
@@ -6317,16 +6317,10 @@ void migrate_disable(void)
 	struct rq *rq;
 
 	if (in_atomic()) {
-#ifdef CONFIG_SCHED_DEBUG
-		p->migrate_disable_atomic++;
-#endif
+		p->migrate_disable++;
 		return;
 	}
 
-#ifdef CONFIG_SCHED_DEBUG
-	WARN_ON_ONCE(p->migrate_disable_atomic);
-#endif
-
 	preempt_disable();
 	if (p->migrate_disable) {
 		p->migrate_disable++;
@@ -6376,15 +6370,10 @@ void migrate_enable(void)
 	struct rq *rq;
 
 	if (in_atomic()) {
-#ifdef CONFIG_SCHED_DEBUG
-		p->migrate_disable_atomic--;
-#endif
+		p->migrate_disable--;
 		return;
 	}
 
-#ifdef CONFIG_SCHED_DEBUG
-	WARN_ON_ONCE(p->migrate_disable_atomic);
-#endif
 	WARN_ON_ONCE(p->migrate_disable <= 0);
 
 	preempt_disable();
Index: linux-3.0-tip/include/linux/sched.h
===================================================================
--- linux-3.0-tip.orig/include/linux/sched.h
+++ linux-3.0-tip/include/linux/sched.h
@@ -1262,9 +1262,6 @@ struct task_struct {
 	unsigned int policy;
 #ifdef CONFIG_PREEMPT_RT_FULL
 	int migrate_disable;
-#ifdef CONFIG_SCHED_DEBUG
-	int migrate_disable_atomic;
-#endif
 #endif
 	cpumask_t cpus_allowed;
 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-10  9:12 [ANNOUNCE] 3.0.4-rt13 Thomas Gleixner
                   ` (2 preceding siblings ...)
  2011-09-11 10:35   ` Mike Galbraith
@ 2011-09-11 18:14 ` Mike Galbraith
  2011-09-12  7:33   ` Thomas Gleixner
  3 siblings, 1 reply; 52+ messages in thread
From: Mike Galbraith @ 2011-09-11 18:14 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, linux-rt-users

I'm very definitely missing sirq threads from the wakeup latency POV.

(Other things are muddying the water, eg. rcu boost, if wired up and
selected always ramming boosted threads through the roof instead of
configured boost prio.. etc etc, but this definitely improves my latency
woes a lot)

This is a giant step backward from "let's improve abysmal throughput",
so I'm wondering if anyone has better ideas.

WRT below: "fixes" are dinky, this is not...

sched, rt, sirq: resurrect sirq threads for RT_FULL

Not-signed-off-by: Mike Galbraith <efault@gmx.de>
---
 include/linux/interrupt.h |   46 +++++++
 kernel/irq/Kconfig        |    7 +
 kernel/sched.c            |    4 
 kernel/softirq.c          |  268 ++++++++++++++++++++++++++++------------------
 4 files changed, 219 insertions(+), 106 deletions(-)

Index: linux-3.0-tip/include/linux/interrupt.h
===================================================================
--- linux-3.0-tip.orig/include/linux/interrupt.h
+++ linux-3.0-tip/include/linux/interrupt.h
@@ -423,6 +423,9 @@ enum
 	NR_SOFTIRQS
 };
 
+/* Update when adding new softirqs. */
+#define SOFTIRQ_MASK_ALL 0x3ff
+
 /* map softirq index to softirq name. update 'softirq_to_name' in
  * kernel/softirq.c when adding a new softirq.
  */
@@ -438,10 +441,16 @@ struct softirq_action
 };
 
 #ifndef CONFIG_PREEMPT_RT_FULL
+#define NR_SOFTIRQ_THREADS 1
 asmlinkage void do_softirq(void);
 asmlinkage void __do_softirq(void);
 static inline void thread_do_softirq(void) { do_softirq(); }
 #else
+#ifdef CONFIG_SIRQ_FORCED_THREADING
+#define NR_SOFTIRQ_THREADS NR_SOFTIRQS
+#else
+#define NR_SOFTIRQ_THREADS 1
+#endif
 extern void thread_do_softirq(void);
 #endif
 
@@ -467,12 +476,43 @@ extern void softirq_check_pending_idle(v
  */
 DECLARE_PER_CPU(struct list_head [NR_SOFTIRQS], softirq_work_list);
 
-DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
+struct softirqdata {
+	int			mask;
+	struct task_struct	*tsk;
+};
+
+DECLARE_PER_CPU(struct softirqdata [NR_SOFTIRQ_THREADS], ksoftirqd);
+
+static inline bool this_cpu_ksoftirqd(struct task_struct *p)
+{
+	int i;
+
+	for (i = 0; i < NR_SOFTIRQ_THREADS; i++) {
+		if (p == __get_cpu_var(ksoftirqd)[i].tsk)
+			return true;
+	}
 
-static inline struct task_struct *this_cpu_ksoftirqd(void)
+	return false;
+}
+
+#ifdef CONFIG_PREEMPT_RT_FULL
+static inline int task_sirq_mask(struct task_struct *p)
+{
+	int i;
+
+	for (i = 0; i < NR_SOFTIRQ_THREADS; i++) {
+		if (p == __get_cpu_var(ksoftirqd)[i].tsk)
+			return __get_cpu_var(ksoftirqd)[i].mask;
+	}
+
+	return SOFTIRQ_MASK_ALL;
+}
+#else
+static inline int task_sirq_mask(struct task_struct *p)
 {
-	return this_cpu_read(ksoftirqd);
+	return SOFTIRQ_MASK_ALL;
 }
+#endif
 
 /* Try to send a softirq to a remote cpu.  If this cannot be done, the
  * work will be queued to the local cpu.
Index: linux-3.0-tip/kernel/sched.c
===================================================================
--- linux-3.0-tip.orig/kernel/sched.c
+++ linux-3.0-tip/kernel/sched.c
@@ -2079,7 +2079,7 @@ void account_system_vtime(struct task_st
 	 */
 	if (hardirq_count())
 		__this_cpu_add(cpu_hardirq_time, delta);
-	else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
+	else if (in_serving_softirq() && !this_cpu_ksoftirqd(curr))
 		__this_cpu_add(cpu_softirq_time, delta);
 
 	irq_time_write_end();
@@ -4098,7 +4098,7 @@ static void irqtime_account_process_tick
 		cpustat->irq = cputime64_add(cpustat->irq, tmp);
 	} else if (irqtime_account_si_update()) {
 		cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
-	} else if (this_cpu_ksoftirqd() == p) {
+	} else if (this_cpu_ksoftirqd(p)) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
Index: linux-3.0-tip/kernel/softirq.c
===================================================================
--- linux-3.0-tip.orig/kernel/softirq.c
+++ linux-3.0-tip/kernel/softirq.c
@@ -55,13 +55,31 @@ EXPORT_SYMBOL(irq_stat);
 
 static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
 
-DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
+DEFINE_PER_CPU(struct softirqdata[NR_SOFTIRQ_THREADS], ksoftirqd);
 
 char *softirq_to_name[NR_SOFTIRQS] = {
 	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
 	"TASKLET", "SCHED", "HRTIMER", "RCU"
 };
 
+static const char *softirq_to_thread_name [] =
+{
+#ifdef CONFIG_SIRQ_FORCED_THREADING
+	[HI_SOFTIRQ]		= "sirq-high",
+	[SCHED_SOFTIRQ]		= "sirq-sched",
+	[TIMER_SOFTIRQ]		= "sirq-timer",
+	[NET_TX_SOFTIRQ]	= "sirq-net-tx",
+	[NET_RX_SOFTIRQ]	= "sirq-net-rx",
+	[BLOCK_SOFTIRQ]		= "sirq-block",
+	[BLOCK_IOPOLL_SOFTIRQ]	= "sirq-block-iopoll",
+	[TASKLET_SOFTIRQ]	= "sirq-tasklet",
+	[HRTIMER_SOFTIRQ]	= "sirq-hrtimer",
+	[RCU_SOFTIRQ]		= "sirq-rcu",
+#else
+	[HI_SOFTIRQ]		= "ksoftirqd",
+#endif
+};
+
 #ifdef CONFIG_NO_HZ
 # ifdef CONFIG_PREEMPT_RT_FULL
 /*
@@ -77,32 +95,39 @@ char *softirq_to_name[NR_SOFTIRQS] = {
 void softirq_check_pending_idle(void)
 {
 	static int rate_limit;
-	u32 warnpending = 0, pending = local_softirq_pending();
+	u32 warnpending = 0, pending = local_softirq_pending(), mask;
+	int curr = 0;
 
 	if (rate_limit >= 10)
 		return;
 
-	if (pending) {
-		struct task_struct *tsk;
+	while (pending) {
+		mask =  __get_cpu_var(ksoftirqd)[curr].mask;
 
-		tsk = __get_cpu_var(ksoftirqd);
-		/*
-		 * The wakeup code in rtmutex.c wakes up the task
-		 * _before_ it sets pi_blocked_on to NULL under
-		 * tsk->pi_lock. So we need to check for both: state
-		 * and pi_blocked_on.
-		 */
-		raw_spin_lock(&tsk->pi_lock);
+		if (pending & mask) {
+			struct task_struct *tsk;
+
+			tsk = __get_cpu_var(ksoftirqd)[curr].tsk;
+			/*
+			 * The wakeup code in rtmutex.c wakes up the task
+			 * _before_ it sets pi_blocked_on to NULL under
+			 * tsk->pi_lock. So we need to check for both: state
+			 * and pi_blocked_on.
+			 */
+			raw_spin_lock(&tsk->pi_lock);
 
-		if (!tsk->pi_blocked_on && !(tsk->state == TASK_RUNNING))
-			warnpending = 1;
+			if (!tsk->pi_blocked_on && !(tsk->state == TASK_RUNNING))
+				warnpending |= pending & mask;
 
-		raw_spin_unlock(&tsk->pi_lock);
+			raw_spin_unlock(&tsk->pi_lock);
+			pending &= ~mask;
+		}
+		curr++;
 	}
 
 	if (warnpending) {
 		printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",
-		       pending);
+		       warnpending);
 		rate_limit++;
 	}
 }
@@ -131,11 +156,17 @@ void softirq_check_pending_idle(void)
  */
 static void wakeup_softirqd(void)
 {
-	/* Interrupts are disabled: no need to stop preemption */
-	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
+	struct task_struct *tsk;
+	u32 pending = local_softirq_pending(), mask, i;
 
-	if (tsk && tsk->state != TASK_RUNNING)
-		wake_up_process(tsk);
+	for (i = 0; pending && i < NR_SOFTIRQ_THREADS; i++) {
+		mask = __get_cpu_var(ksoftirqd)[i].mask;
+		if (!(pending & mask))
+			continue;
+		tsk = __get_cpu_var(ksoftirqd)[i].tsk;
+		if (tsk && tsk->state != TASK_RUNNING)
+			wake_up_process(tsk);
+	}
 }
 
 static void handle_pending_softirqs(u32 pending, int cpu)
@@ -378,16 +409,19 @@ static inline void ksoftirqd_clr_sched_p
 #else /* !PREEMPT_RT_FULL */
 
 /*
- * On RT we serialize softirq execution with a cpu local lock
+ * On RT we serialize softirq execution with cpu local locks
  */
-static DEFINE_LOCAL_IRQ_LOCK(local_softirq_lock);
-static DEFINE_PER_CPU(struct task_struct *, local_softirq_runner);
+static DEFINE_PER_CPU(struct local_irq_lock, local_softirq_lock[NR_SOFTIRQ_THREADS]);
+static DEFINE_PER_CPU(struct task_struct *, local_softirq_runner[NR_SOFTIRQ_THREADS]);
 
-static void __do_softirq(void);
+static void __do_softirq(u32 mask);
 
 void __init softirq_early_init(void)
 {
-	local_irq_lock_init(local_softirq_lock);
+	int i;
+
+	for (i = 0; i < NR_SOFTIRQ_THREADS; i++)
+		local_irq_lock_init(local_softirq_lock[i]);
 }
 
 void local_bh_disable(void)
@@ -399,20 +433,32 @@ EXPORT_SYMBOL(local_bh_disable);
 
 void local_bh_enable(void)
 {
+	u32 mask = SOFTIRQ_MASK_ALL, i;
+
 	if (WARN_ON(current->softirq_nestcnt == 0))
-		return;
+		goto out;
 
-	if ((current->softirq_nestcnt == 1) &&
-	    local_softirq_pending() &&
-	    local_trylock(local_softirq_lock)) {
+	if (current->softirq_nestcnt != 1)
+		goto out;
+
+	for (i = 0; i < NR_SOFTIRQ_THREADS; i++) {
+		if (NR_SOFTIRQ_THREADS > 1)
+			mask = 1 << i;
+		if (!(local_softirq_pending() & mask))
+			continue;
+		if (!local_trylock(local_softirq_lock[i]))
+			continue;
 
 		local_irq_disable();
-		if (local_softirq_pending())
-			__do_softirq();
-		local_unlock(local_softirq_lock);
+		if (local_softirq_pending() & mask)
+			__do_softirq(local_softirq_pending() & mask);
+		local_unlock(local_softirq_lock[i]);
 		local_irq_enable();
 		WARN_ON(current->softirq_nestcnt != 1);
 	}
+
+out:
+	wakeup_softirqd();
 	current->softirq_nestcnt--;
 	migrate_enable();
 }
@@ -427,17 +473,22 @@ EXPORT_SYMBOL(local_bh_enable_ip);
 /* For tracing */
 int notrace __in_softirq(void)
 {
-	if (__get_cpu_var(local_softirq_lock).owner == current)
-		return __get_cpu_var(local_softirq_lock).nestcnt;
+	int i;
+
+	for (i = 0; i < NR_SOFTIRQ_THREADS; i++) {
+		if (__get_cpu_var(local_softirq_lock)[i].owner == current)
+			return __get_cpu_var(local_softirq_lock)[i].nestcnt;
+	}
 	return 0;
 }
 
 int in_serving_softirq(void)
 {
-	int res;
+	int res = 0, i;
 
 	preempt_disable();
-	res = __get_cpu_var(local_softirq_runner) == current;
+	for (i = 0; i < NR_SOFTIRQ_THREADS && !res; i++)
+		res = __get_cpu_var(local_softirq_runner)[i] == current;
 	preempt_enable();
 	return res;
 }
@@ -446,34 +497,36 @@ int in_serving_softirq(void)
  * Called with bh and local interrupts disabled. For full RT cpu must
  * be pinned.
  */
-static void __do_softirq(void)
+static void __do_softirq(u32 mask)
 {
 	u32 pending = local_softirq_pending();
-	int cpu = smp_processor_id();
+	int cpu = smp_processor_id(), i = 0;
 
 	current->softirq_nestcnt++;
 
-	/* Reset the pending bitmask before enabling irqs */
-	set_softirq_pending(0);
+	/* Reset the pending bit[s] before enabling irqs */
+	set_softirq_pending(pending & ~mask);
 
-	__get_cpu_var(local_softirq_runner) = current;
+	/* If threaded, find which sirq we're processing */
+	while (NR_SOFTIRQ_THREADS > 1 && !(mask & (1 << i)))
+		i++;
 
-	lockdep_softirq_enter();
+	__get_cpu_var(local_softirq_runner)[i] = current;
 
-	handle_pending_softirqs(pending, cpu);
+	lockdep_softirq_enter();
 
-	pending = local_softirq_pending();
-	if (pending)
-		wakeup_softirqd();
+	handle_pending_softirqs(mask, cpu);
 
 	lockdep_softirq_exit();
-	__get_cpu_var(local_softirq_runner) = NULL;
+	__get_cpu_var(local_softirq_runner)[i] = NULL;
 
 	current->softirq_nestcnt--;
 }
 
 static int __thread_do_softirq(int cpu)
 {
+	u32 mask, my_mask, i;
+
 	/*
 	 * Prevent the current cpu from going offline.
 	 * pin_current_cpu() can reenable preemption and block on the
@@ -491,19 +544,27 @@ static int __thread_do_softirq(int cpu)
 		unpin_current_cpu();
 		return -1;
 	}
-	preempt_enable();
-	local_lock(local_softirq_lock);
-	local_irq_disable();
-	/*
-	 * We cannot switch stacks on RT as we want to be able to
-	 * schedule!
-	 */
-	if (local_softirq_pending())
-		__do_softirq();
-	local_unlock(local_softirq_lock);
+
+	mask = my_mask = task_sirq_mask(current);
+
+	for (i = 0; my_mask && i < NR_SOFTIRQ_THREADS; i++) {
+		if (NR_SOFTIRQ_THREADS > 1) {
+			mask = 1 << i;
+			my_mask &= ~mask;
+			if (!(local_softirq_pending() & mask))
+				continue;
+		}
+		preempt_enable();
+		local_lock(local_softirq_lock[i]);
+		local_irq_disable();
+		if (local_softirq_pending() & mask)
+			__do_softirq(local_softirq_pending() & mask);
+		local_unlock(local_softirq_lock[i]);
+		preempt_disable();
+		local_irq_enable();
+	}
 	unpin_current_cpu();
-	preempt_disable();
-	local_irq_enable();
+
 	return 0;
 }
 
@@ -512,11 +573,11 @@ static int __thread_do_softirq(int cpu)
  */
 void thread_do_softirq(void)
 {
-	if (!in_serving_softirq()) {
-		preempt_disable();
+	preempt_disable();
+	if (!in_serving_softirq())
 		__thread_do_softirq(-1);
-		preempt_enable();
-	}
+	wakeup_softirqd();
+	preempt_enable();
 }
 
 static int ksoftirqd_do_softirq(int cpu)
@@ -563,28 +624,15 @@ void irq_enter(void)
 	__irq_enter();
 }
 
-#ifdef __ARCH_IRQ_EXIT_IRQS_DISABLED
 static inline void invoke_softirq(void)
 {
 #ifndef CONFIG_PREEMPT_RT_FULL
 	if (!force_irqthreads)
+#ifdef __ARCH_IRQ_EXIT_IRQS_DISABLED
 		__do_softirq();
-	else {
-		__local_bh_disable((unsigned long)__builtin_return_address(0),
-				SOFTIRQ_OFFSET);
-		wakeup_softirqd();
-		__local_bh_enable(SOFTIRQ_OFFSET);
-	}
-#else
-	wakeup_softirqd();
-#endif
-}
 #else
-static inline void invoke_softirq(void)
-{
-#ifndef CONFIG_PREEMPT_RT_FULL
-	if (!force_irqthreads)
 		do_softirq();
+#endif
 	else {
 		__local_bh_disable((unsigned long)__builtin_return_address(0),
 				SOFTIRQ_OFFSET);
@@ -595,7 +643,6 @@ static inline void invoke_softirq(void)
 	wakeup_softirqd();
 #endif
 }
-#endif
 
 /*
  * Exit an interrupt context. Process softirqs if needed and possible:
@@ -1000,18 +1047,20 @@ void __init softirq_init(void)
 
 static int run_ksoftirqd(void * __bind_cpu)
 {
+	u32 mask = task_sirq_mask(current);
+
 	ksoftirqd_set_sched_params();
 
 	set_current_state(TASK_INTERRUPTIBLE);
 
 	while (!kthread_should_stop()) {
 		preempt_disable();
-		if (!local_softirq_pending())
+		if (!(local_softirq_pending() & mask))
 			schedule_preempt_disabled();
 
 		__set_current_state(TASK_RUNNING);
 
-		while (local_softirq_pending()) {
+		while (local_softirq_pending() & mask) {
 			if (ksoftirqd_do_softirq((long) __bind_cpu))
 				goto wait_to_die;
 			__preempt_enable_no_resched();
@@ -1101,45 +1150,62 @@ static int __cpuinit cpu_callback(struct
 				  unsigned long action,
 				  void *hcpu)
 {
-	int hotcpu = (unsigned long)hcpu;
+	int hotcpu = (unsigned long)hcpu, i;
 	struct task_struct *p;
 
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		p = kthread_create_on_node(run_ksoftirqd,
+		for (i = 0; i < NR_SOFTIRQ_THREADS; i++) {
+			per_cpu(ksoftirqd, hotcpu)[i].mask = SOFTIRQ_MASK_ALL;
+			per_cpu(ksoftirqd, hotcpu)[i].tsk = NULL;
+		}
+		for (i = 0; i < NR_SOFTIRQ_THREADS; i++) {
+			p = kthread_create_on_node(run_ksoftirqd,
 					   hcpu,
 					   cpu_to_node(hotcpu),
-					   "ksoftirqd/%d", hotcpu);
-		if (IS_ERR(p)) {
-			printk("ksoftirqd for %i failed\n", hotcpu);
-			return notifier_from_errno(PTR_ERR(p));
+					   "%s/%d", softirq_to_thread_name[i], hotcpu);
+			if (IS_ERR(p)) {
+				printk(KERN_ERR "%s/%d failed\n",
+					   softirq_to_thread_name[i], hotcpu);
+				return notifier_from_errno(PTR_ERR(p));
+			}
+			kthread_bind(p, hotcpu);
+			per_cpu(ksoftirqd, hotcpu)[i].tsk = p;
+			if (NR_SOFTIRQ_THREADS > 1)
+				per_cpu(ksoftirqd, hotcpu)[i].mask = 1 << i;
 		}
-		kthread_bind(p, hotcpu);
-  		per_cpu(ksoftirqd, hotcpu) = p;
- 		break;
+		break;
 	case CPU_ONLINE:
 	case CPU_ONLINE_FROZEN:
-		wake_up_process(per_cpu(ksoftirqd, hotcpu));
+		for (i = 0; i < NR_SOFTIRQ_THREADS; i++)
+			wake_up_process(per_cpu(ksoftirqd, hotcpu)[i].tsk);
 		break;
 #ifdef CONFIG_HOTPLUG_CPU
 	case CPU_UP_CANCELED:
-	case CPU_UP_CANCELED_FROZEN:
-		if (!per_cpu(ksoftirqd, hotcpu))
-			break;
-		/* Unbind so it can run.  Fall thru. */
-		kthread_bind(per_cpu(ksoftirqd, hotcpu),
-			     cpumask_any(cpu_online_mask));
+	case CPU_UP_CANCELED_FROZEN: {
+		for (i = 0; i < NR_SOFTIRQ_THREADS; i++) {
+			p = per_cpu(ksoftirqd, hotcpu)[i].tsk;
+			if (!p)
+				continue;
+			/* Unbind so it can run. */
+			kthread_bind(p, cpumask_any(cpu_online_mask));
+		}
+	}
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN: {
 		static const struct sched_param param = {
 			.sched_priority = MAX_RT_PRIO-1
 		};
 
-		p = per_cpu(ksoftirqd, hotcpu);
-		per_cpu(ksoftirqd, hotcpu) = NULL;
-		sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
-		kthread_stop(p);
+		for (i = 0; i < NR_SOFTIRQ_THREADS; i++) {
+			p = per_cpu(ksoftirqd, hotcpu)[i].tsk;
+			per_cpu(ksoftirqd, hotcpu)[i].tsk = NULL;
+			if (!p)
+				continue;
+			sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
+			kthread_stop(p);
+		}
 		takeover_tasklets(hotcpu);
 		break;
 	}
Index: linux-3.0-tip/kernel/irq/Kconfig
===================================================================
--- linux-3.0-tip.orig/kernel/irq/Kconfig
+++ linux-3.0-tip/kernel/irq/Kconfig
@@ -60,6 +60,13 @@ config IRQ_DOMAIN
 config IRQ_FORCED_THREADING
        bool
 
+# Support forced sirq threading
+config SIRQ_FORCED_THREADING
+       bool "Forced Soft IRQ threading"
+       depends on PREEMPT_RT_FULL
+	help
+	  Split ksoftirqd into per SOFTIRQ threads
+
 config SPARSE_IRQ
 	bool "Support sparse irq numbering"
 	depends on HAVE_SPARSE_IRQ



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-11 17:01   ` Mike Galbraith
@ 2011-09-12  7:24     ` Thomas Gleixner
  0 siblings, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2011-09-12  7:24 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, LKML, linux-rt-users

On Sun, 11 Sep 2011, Mike Galbraith wrote:

> On Sun, 2011-09-11 at 12:35 +0200, Mike Galbraith wrote:
> > On Sat, 2011-09-10 at 11:12 +0200, Thomas Gleixner wrote: 
> > > Dear RT Folks,
> > > 
> > > I'm pleased to announce the 3.0.4-rt13 release.
> > > 
> > > Changes versus 3.0.2-rt11
> > > 
> > >   * Migrate disable cure (Mike, Peter)
> > 
> > The warning triggers.
> 
> Seems in_atomic() is not pair inclusive.  This does not gripe.

I'd rather like to figure out why it is not symetric. Just working
around it makes me nervous. I'll have a look.

Thanks,

	tglx

 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-11 18:14 ` Mike Galbraith
@ 2011-09-12  7:33   ` Thomas Gleixner
  2011-09-12  8:05     ` Mike Galbraith
  0 siblings, 1 reply; 52+ messages in thread
From: Thomas Gleixner @ 2011-09-12  7:33 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML, linux-rt-users

On Sun, 11 Sep 2011, Mike Galbraith wrote:

> I'm very definitely missing sirq threads from the wakeup latency POV.
> 
> (Other things are muddying the water, eg. rcu boost, if wired up and
> selected always ramming boosted threads through the roof instead of
> configured boost prio.. etc etc, but this definitely improves my latency
> woes a lot)
> 
> This is a giant step backward from "let's improve abysmal throughput",
> so I'm wondering if anyone has better ideas.

One of the problems we have are the signal based timers (posix-timer,
itimer). We really want to move the penalty for those into the context
of the thread/process to which those timers belong. The trick is to
just note the expiry of a timer and wake up the target which has to
deal with the real work in his own context and on his own
account. That's rather simple for thread bound signals, but has a lot
of implications with process wide ones. Though it should be doable and
I'd rather see that solved than hacking around with the split softirqs
 
> WRT below: "fixes" are dinky, this is not...
> 
> sched, rt, sirq: resurrect sirq threads for RT_FULL
> 
> Not-signed-off-by: Mike Galbraith <efault@gmx.de>

Not-that-delighted: tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-12  7:33   ` Thomas Gleixner
@ 2011-09-12  8:05     ` Mike Galbraith
  2011-09-12  8:43       ` Mike Galbraith
  0 siblings, 1 reply; 52+ messages in thread
From: Mike Galbraith @ 2011-09-12  8:05 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, linux-rt-users

On Mon, 2011-09-12 at 09:33 +0200, Thomas Gleixner wrote:
> On Sun, 11 Sep 2011, Mike Galbraith wrote:
> 
> > I'm very definitely missing sirq threads from the wakeup latency POV.
> > 
> > (Other things are muddying the water, eg. rcu boost, if wired up and
> > selected always ramming boosted threads through the roof instead of
> > configured boost prio.. etc etc, but this definitely improves my latency
> > woes a lot)
> > 
> > This is a giant step backward from "let's improve abysmal throughput",
> > so I'm wondering if anyone has better ideas.
> 
> One of the problems we have are the signal based timers (posix-timer,
> itimer).

That's the biggest part of my jitter troubles.

>  We really want to move the penalty for those into the context
> of the thread/process to which those timers belong. The trick is to
> just note the expiry of a timer and wake up the target which has to
> deal with the real work in his own context and on his own
> account. That's rather simple for thread bound signals, but has a lot
> of implications with process wide ones. Though it should be doable and
> I'd rather see that solved than hacking around with the split softirqs

That definitely sounds like a better idea.. for someone who thoroughly
understands signals.
 
> > WRT below: "fixes" are dinky, this is not...
> > 
> > sched, rt, sirq: resurrect sirq threads for RT_FULL
> > 
> > Not-signed-off-by: Mike Galbraith <efault@gmx.de>
> 
> Not-that-delighted: tglx

Ditto.

	-Mike


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-12  8:05     ` Mike Galbraith
@ 2011-09-12  8:43       ` Mike Galbraith
  0 siblings, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2011-09-12  8:43 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, linux-rt-users

Splitting ksoftirqd was somewhat useful.  Seems somebody is firing off
tasklets heftily, at least that's always current when the throttle kicks
in and saves the day.

(tip-rt13 UV100 box)

[   40.701221] ioatdma 0000:00:16.7: enabling device (0000 -> 0002)
[   40.701239] ioatdma 0000:00:16.7: PCI INT D -> GSI 46 (level, low) -> IRQ 46
[   40.701281] ioatdma 0000:00:16.7: setting latency timer to 64
[   40.701696] ioatdma 0000:00:16.7: irq 128 for MSI/MSI-X
[   40.951009] RT: throttling CPU25
[   40.951013] RT: Danger!  (sirq-tasklet/25) is potential runaway.
[   40.951016] sched: RT throttling activated
[   40.951019] RT: throttling CPU26
[   40.951026] RT: Danger!  (sirq-tasklet/26) is potential runaway.
[   40.960010] RT: throttling CPU13
[   40.960013] RT: Danger!  (sirq-tasklet/13) is potential runaway.
[   41.000014] RT: throttling CPU13
[   41.000014] RT: throttling CPU25
[   41.000014] RT: throttling CPU26
[   41.656925] Adding 1959924k swap on /dev/sda2.  Priority:-1 extents:1 across:1959924k 
[   41.950011] RT: throttling CPU25
[   41.950013] RT: Danger!  (sirq-tasklet/25) is potential runaway.
[   41.950024] RT: throttling CPU30
[   41.950031] RT: Danger!  (sirq-tasklet/30) is potential runaway.
[   41.951010] RT: throttling CPU21
[   41.951014] RT: Danger!  (sirq-tasklet/21) is potential runaway.
[   42.000014] RT: throttling CPU13



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-11 10:35   ` Mike Galbraith
  (?)
  (?)
@ 2011-09-12  8:59   ` Peter Zijlstra
  2011-09-12  9:05     ` Mike Galbraith
  2011-09-12 13:52     ` Mike Galbraith
  -1 siblings, 2 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-12  8:59 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Sun, 2011-09-11 at 12:35 +0200, Mike Galbraith wrote:
> (gdb) list *do_sigtimedwait+0x62
> 0xffffffff8104f3e2 is in do_sigtimedwait (kernel/signal.c:2628).
> 2623             * Invert the set of allowed signals to get those we want to block.
> 2624             */
> 2625            sigdelsetmask(&mask, sigmask(SIGKILL) | sigmask(SIGSTOP));
> 2626            signotset(&mask);
> 2627
> 2628            spin_lock_irq(&tsk->sighand->siglock);
> 2629            sig = dequeue_signal(tsk, &mask, info);
> 2630            if (!sig && timeout) {
> 2631                    /*
> 2632                     * None ready, temporarily unblock those we're interested
> (gdb) list *do_sigtimedwait+0x15f
> 0xffffffff8104f4df is in do_sigtimedwait (kernel/signal.c:2642).
> 2637                    tsk->real_blocked = tsk->blocked;
> 2638                    sigandsets(&tsk->blocked, &tsk->blocked, &mask);
> 2639                    recalc_sigpending();
> 2640                    spin_unlock_irq(&tsk->sighand->siglock);
> 2641
> 2642                    timeout = schedule_timeout_interruptible(timeout);
> 2643
> 2644                    spin_lock_irq(&tsk->sighand->siglock);
> 2645                    __set_task_blocked(tsk, &tsk->real_blocked);
> 2646                    siginitset(&tsk->real_blocked, 0);
> 


Right, so what Thomas says.. Now admittedly I haven't had my morning
juice yet, but staring at that function I can't see why that warning
would trigger at all.

I'm going to try and reproduce, but Thomas is already saying he can't,
so I'm not too confident.

I you can easily trigger this, could you add some trace_printk() to
migrate_disable/enable that prints both counters etc.. so we can see wtf
happens?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-12  8:59   ` Peter Zijlstra
@ 2011-09-12  9:05     ` Mike Galbraith
  2011-09-12 13:52     ` Mike Galbraith
  1 sibling, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2011-09-12  9:05 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Mon, 2011-09-12 at 10:59 +0200, Peter Zijlstra wrote:
> On Sun, 2011-09-11 at 12:35 +0200, Mike Galbraith wrote:
> > (gdb) list *do_sigtimedwait+0x62
> > 0xffffffff8104f3e2 is in do_sigtimedwait (kernel/signal.c:2628).
> > 2623             * Invert the set of allowed signals to get those we want to block.
> > 2624             */
> > 2625            sigdelsetmask(&mask, sigmask(SIGKILL) | sigmask(SIGSTOP));
> > 2626            signotset(&mask);
> > 2627
> > 2628            spin_lock_irq(&tsk->sighand->siglock);
> > 2629            sig = dequeue_signal(tsk, &mask, info);
> > 2630            if (!sig && timeout) {
> > 2631                    /*
> > 2632                     * None ready, temporarily unblock those we're interested
> > (gdb) list *do_sigtimedwait+0x15f
> > 0xffffffff8104f4df is in do_sigtimedwait (kernel/signal.c:2642).
> > 2637                    tsk->real_blocked = tsk->blocked;
> > 2638                    sigandsets(&tsk->blocked, &tsk->blocked, &mask);
> > 2639                    recalc_sigpending();
> > 2640                    spin_unlock_irq(&tsk->sighand->siglock);
> > 2641
> > 2642                    timeout = schedule_timeout_interruptible(timeout);
> > 2643
> > 2644                    spin_lock_irq(&tsk->sighand->siglock);
> > 2645                    __set_task_blocked(tsk, &tsk->real_blocked);
> > 2646                    siginitset(&tsk->real_blocked, 0);
> > 
> 
> 
> Right, so what Thomas says.. Now admittedly I haven't had my morning
> juice yet, but staring at that function I can't see why that warning
> would trigger at all.
> 
> I'm going to try and reproduce, but Thomas is already saying he can't,
> so I'm not too confident.
> 
> I you can easily trigger this, could you add some trace_printk() to
> migrate_disable/enable that prints both counters etc.. so we can see wtf
> happens?

It was highly repeatable with rt executive model running on 3 CPUs, will
poke.

	-Mike



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-11 10:35   ` Mike Galbraith
                     ` (2 preceding siblings ...)
  (?)
@ 2011-09-12 10:04   ` Peter Zijlstra
  2011-09-12 11:33     ` Mike Galbraith
  -1 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-12 10:04 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Sun, 2011-09-11 at 12:35 +0200, Mike Galbraith wrote:
> [  134.105291] Pid: 6224, comm: lmsched Not tainted 3.0.4-rt13 #2040

What's this lmsched thing you're running and where does one get it? I
can't seem to make it go bang using kernel compiles ;-)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-12 10:04   ` [ANNOUNCE] 3.0.4-rt13 Peter Zijlstra
@ 2011-09-12 11:33     ` Mike Galbraith
  0 siblings, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2011-09-12 11:33 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Mon, 2011-09-12 at 12:04 +0200, Peter Zijlstra wrote:
> On Sun, 2011-09-11 at 12:35 +0200, Mike Galbraith wrote:
> > [  134.105291] Pid: 6224, comm: lmsched Not tainted 3.0.4-rt13 #2040
> 
> What's this lmsched thing you're running and where does one get it? I
> can't seem to make it go bang using kernel compiles ;-)

It's a simple model of a realtime simulation executive.. and
non-releasable despite containing no rocket science.  It's a high
priority timer thread synchronizing lower priority workers via ipc.

	-Mike


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-12  8:59   ` Peter Zijlstra
  2011-09-12  9:05     ` Mike Galbraith
@ 2011-09-12 13:52     ` Mike Galbraith
  2011-09-12 14:53       ` Mike Galbraith
  1 sibling, 1 reply; 52+ messages in thread
From: Mike Galbraith @ 2011-09-12 13:52 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Mon, 2011-09-12 at 10:59 +0200, Peter Zijlstra wrote:
> On Sun, 2011-09-11 at 12:35 +0200, Mike Galbraith wrote:
> > (gdb) list *do_sigtimedwait+0x62
> > 0xffffffff8104f3e2 is in do_sigtimedwait (kernel/signal.c:2628).
> > 2623             * Invert the set of allowed signals to get those we want to block.
> > 2624             */
> > 2625            sigdelsetmask(&mask, sigmask(SIGKILL) | sigmask(SIGSTOP));
> > 2626            signotset(&mask);
> > 2627
> > 2628            spin_lock_irq(&tsk->sighand->siglock);
> > 2629            sig = dequeue_signal(tsk, &mask, info);
> > 2630            if (!sig && timeout) {
> > 2631                    /*
> > 2632                     * None ready, temporarily unblock those we're interested
> > (gdb) list *do_sigtimedwait+0x15f
> > 0xffffffff8104f4df is in do_sigtimedwait (kernel/signal.c:2642).
> > 2637                    tsk->real_blocked = tsk->blocked;
> > 2638                    sigandsets(&tsk->blocked, &tsk->blocked, &mask);
> > 2639                    recalc_sigpending();
> > 2640                    spin_unlock_irq(&tsk->sighand->siglock);
> > 2641
> > 2642                    timeout = schedule_timeout_interruptible(timeout);
> > 2643
> > 2644                    spin_lock_irq(&tsk->sighand->siglock);
> > 2645                    __set_task_blocked(tsk, &tsk->real_blocked);
> > 2646                    siginitset(&tsk->real_blocked, 0);
> > 
> 
> 
> Right, so what Thomas says.. Now admittedly I haven't had my morning
> juice yet, but staring at that function I can't see why that warning
> would trigger at all.
> 
> I'm going to try and reproduce, but Thomas is already saying he can't,
> so I'm not too confident.
> 
> I you can easily trigger this, could you add some trace_printk() to
> migrate_disable/enable that prints both counters etc.. so we can see wtf
> happens?

trace_printk as we enter migrate_enable/disable like so for both.

        int in_atomic = in_atomic();

        trace_printk("migrate_disable: in_atomic:%d p->migrate_disable_atomic:%d p->migrate_disable:%d\n",
                                in_atomic, p->migrate_disable_atomic, p->migrate_disable);

        if (in_atomic) {
#ifdef CONFIG_SCHED_DEBUG
                p->migrate_disable_atomic++;
#endif
                return;
        }

#ifdef CONFIG_SCHED_DEBUG
        if (WARN_ON_ONCE(p->migrate_disable_atomic))
                tracing_stop();
#endif

We migrate_disable() in_atomic() == false, migrate_enable() with
in_atomic() ==  true.  Burp.

36717            <...>-6266  [002]   242.543129: sys_semop <-system_call_fastpath
36718            <...>-6266  [002]   242.543129: sys_semtimedop <-sys_semop
36719            <...>-6266  [002]   242.543131: ipc_lock_check <-sys_semtimedop
36720            <...>-6266  [002]   242.543131: ipc_lock <-ipc_lock_check
36721            <...>-6266  [002]   242.543132: __rcu_read_lock <-ipc_lock
36722            <...>-6266  [002]   242.543133: migrate_disable <-ipc_lock
36723            <...>-6266  [002]   242.543134: migrate_disable: migrate_disable: in_atomic:0 p->migrate_disable_atomic:0 p->migrate_disable:0
36724            <...>-6266  [002]   242.543134: pin_current_cpu <-migrate_disable
36725            <...>-6266  [002]   242.543134: _raw_spin_lock_irqsave <-migrate_disable
36726            <...>-6266  [002]   242.543135: _raw_spin_unlock_irqrestore <-migrate_disable
36727            <...>-6266  [002]   242.543135: rt_spin_lock <-ipc_lock
36728            <...>-6266  [002]   242.543136: ipcperms <-sys_semtimedop
36729            <...>-6266  [002]   242.543137: ns_capable <-ipcperms
36730            <...>-6266  [002]   242.543138: cap_capable <-ns_capable
36731            <...>-6266  [002]   242.543138: pid_vnr <-sys_semtimedop
36732            <...>-6266  [002]   242.543139: try_atomic_semop <-sys_semtimedop
36733            <...>-6266  [002]   242.543140: do_smart_update <-sys_semtimedop
36734            <...>-6266  [002]   242.543140: update_queue <-do_smart_update
36735            <...>-6266  [002]   242.543141: try_atomic_semop <-update_queue
36736            <...>-6266  [002]   242.543142: update_queue <-do_smart_update
36737            <...>-6266  [002]   242.543142: try_atomic_semop <-update_queue
36738            <...>-6266  [002]   242.543143: update_queue <-do_smart_update
36739            <...>-6266  [002]   242.543143: try_atomic_semop <-update_queue
36740            <...>-6266  [002]   242.543144: get_seconds <-do_smart_update
36741            <...>-6266  [002]   242.543144: rt_spin_unlock <-sys_semtimedop
36742            <...>-6266  [002]   242.543144: migrate_enable <-sys_semtimedop
36743            <...>-6266  [002]   242.543145: migrate_enable: migrate_enable: in_atomic:1 p->migrate_disable_atomic:0 p->migrate_disable:1



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-12 13:52     ` Mike Galbraith
@ 2011-09-12 14:53       ` Mike Galbraith
  2011-09-13 13:36         ` Peter Zijlstra
  2011-09-13 15:08         ` Peter Zijlstra
  0 siblings, 2 replies; 52+ messages in thread
From: Mike Galbraith @ 2011-09-12 14:53 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Mon, 2011-09-12 at 15:52 +0200, Mike Galbraith wrote:

> 36717            <...>-6266  [002]   242.543129: sys_semop <-system_call_fastpath
> 36718            <...>-6266  [002]   242.543129: sys_semtimedop <-sys_semop
> 36719            <...>-6266  [002]   242.543131: ipc_lock_check <-sys_semtimedop
> 36720            <...>-6266  [002]   242.543131: ipc_lock <-ipc_lock_check
> 36721            <...>-6266  [002]   242.543132: __rcu_read_lock <-ipc_lock
> 36722            <...>-6266  [002]   242.543133: migrate_disable <-ipc_lock
> 36723            <...>-6266  [002]   242.543134: migrate_disable: migrate_disable: in_atomic:0 p->migrate_disable_atomic:0 p->migrate_disable:0
> 36724            <...>-6266  [002]   242.543134: pin_current_cpu <-migrate_disable
> 36725            <...>-6266  [002]   242.543134: _raw_spin_lock_irqsave <-migrate_disable
> 36726            <...>-6266  [002]   242.543135: _raw_spin_unlock_irqrestore <-migrate_disable
> 36727            <...>-6266  [002]   242.543135: rt_spin_lock <-ipc_lock
> 36728            <...>-6266  [002]   242.543136: ipcperms <-sys_semtimedop
> 36729            <...>-6266  [002]   242.543137: ns_capable <-ipcperms
> 36730            <...>-6266  [002]   242.543138: cap_capable <-ns_capable
> 36731            <...>-6266  [002]   242.543138: pid_vnr <-sys_semtimedop
> 36732            <...>-6266  [002]   242.543139: try_atomic_semop <-sys_semtimedop
> 36733            <...>-6266  [002]   242.543140: do_smart_update <-sys_semtimedop
> 36734            <...>-6266  [002]   242.543140: update_queue <-do_smart_update
> 36735            <...>-6266  [002]   242.543141: try_atomic_semop <-update_queue
> 36736            <...>-6266  [002]   242.543142: update_queue <-do_smart_update
> 36737            <...>-6266  [002]   242.543142: try_atomic_semop <-update_queue
> 36738            <...>-6266  [002]   242.543143: update_queue <-do_smart_update
> 36739            <...>-6266  [002]   242.543143: try_atomic_semop <-update_queue
> 36740            <...>-6266  [002]   242.543144: get_seconds <-do_smart_update
> 36741            <...>-6266  [002]   242.543144: rt_spin_unlock <-sys_semtimedop
> 36742            <...>-6266  [002]   242.543144: migrate_enable <-sys_semtimedop
> 36743            <...>-6266  [002]   242.543145: migrate_enable: migrate_enable: in_atomic:1 p->migrate_disable_atomic:0 p->migrate_disable:1

Hm.  Seems this is home grown a non-preemptive wakeup in the making.

	-Mike



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-12 14:53       ` Mike Galbraith
@ 2011-09-13 13:36         ` Peter Zijlstra
  2011-09-13 15:17           ` Mike Galbraith
  2011-09-13 15:08         ` Peter Zijlstra
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-13 13:36 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Mon, 2011-09-12 at 16:53 +0200, Mike Galbraith wrote:
> On Mon, 2011-09-12 at 15:52 +0200, Mike Galbraith wrote:
> 
> > 36717            <...>-6266  [002]   242.543129: sys_semop <-system_call_fastpath
> > 36718            <...>-6266  [002]   242.543129: sys_semtimedop <-sys_semop
> > 36719            <...>-6266  [002]   242.543131: ipc_lock_check <-sys_semtimedop
> > 36720            <...>-6266  [002]   242.543131: ipc_lock <-ipc_lock_check
> > 36721            <...>-6266  [002]   242.543132: __rcu_read_lock <-ipc_lock
> > 36722            <...>-6266  [002]   242.543133: migrate_disable <-ipc_lock
> > 36723            <...>-6266  [002]   242.543134: migrate_disable: migrate_disable: in_atomic:0 p->migrate_disable_atomic:0 p->migrate_disable:0
> > 36724            <...>-6266  [002]   242.543134: pin_current_cpu <-migrate_disable
> > 36725            <...>-6266  [002]   242.543134: _raw_spin_lock_irqsave <-migrate_disable
> > 36726            <...>-6266  [002]   242.543135: _raw_spin_unlock_irqrestore <-migrate_disable
> > 36727            <...>-6266  [002]   242.543135: rt_spin_lock <-ipc_lock
> > 36728            <...>-6266  [002]   242.543136: ipcperms <-sys_semtimedop
> > 36729            <...>-6266  [002]   242.543137: ns_capable <-ipcperms
> > 36730            <...>-6266  [002]   242.543138: cap_capable <-ns_capable
> > 36731            <...>-6266  [002]   242.543138: pid_vnr <-sys_semtimedop
> > 36732            <...>-6266  [002]   242.543139: try_atomic_semop <-sys_semtimedop
> > 36733            <...>-6266  [002]   242.543140: do_smart_update <-sys_semtimedop
> > 36734            <...>-6266  [002]   242.543140: update_queue <-do_smart_update
> > 36735            <...>-6266  [002]   242.543141: try_atomic_semop <-update_queue
> > 36736            <...>-6266  [002]   242.543142: update_queue <-do_smart_update
> > 36737            <...>-6266  [002]   242.543142: try_atomic_semop <-update_queue
> > 36738            <...>-6266  [002]   242.543143: update_queue <-do_smart_update
> > 36739            <...>-6266  [002]   242.543143: try_atomic_semop <-update_queue
> > 36740            <...>-6266  [002]   242.543144: get_seconds <-do_smart_update
> > 36741            <...>-6266  [002]   242.543144: rt_spin_unlock <-sys_semtimedop
> > 36742            <...>-6266  [002]   242.543144: migrate_enable <-sys_semtimedop
> > 36743            <...>-6266  [002]   242.543145: migrate_enable: migrate_enable: in_atomic:1 p->migrate_disable_atomic:0 p->migrate_disable:1
> 
> Hm.  Seems this is home grown a non-preemptive wakeup in the making.

Does the below cure things? It breaks !rt builds, but we can cure that if it works..

---
 include/linux/sem.h |    2 ++
 ipc/sem.c           |   20 ++++----------------
 2 files changed, 6 insertions(+), 16 deletions(-)

Index: linux-rt/include/linux/sem.h
===================================================================
--- linux-rt.orig/include/linux/sem.h
+++ linux-rt/include/linux/sem.h
@@ -80,6 +80,7 @@ struct  seminfo {
 #include <asm/atomic.h>
 #include <linux/rcupdate.h>
 #include <linux/cache.h>
+#include <linux/wait.h>
 
 struct task_struct;
 
@@ -114,6 +115,7 @@ struct sem_queue {
 	struct sembuf		*sops;	 /* array of pending operations */
 	int			nsops;	 /* number of operations */
 	int			alter;   /* does the operation alter the array? */
+	wait_queue_head_t	wait;
 };
 
 /* Each task has a list of undo requests. They are executed automatically
Index: linux-rt/ipc/sem.c
===================================================================
--- linux-rt.orig/ipc/sem.c
+++ linux-rt/ipc/sem.c
@@ -415,13 +415,6 @@ static int try_atomic_semop (struct sem_
 static void wake_up_sem_queue_prepare(struct list_head *pt,
 				struct sem_queue *q, int error)
 {
-	if (list_empty(pt)) {
-		/*
-		 * Hold preempt off so that we don't get preempted and have the
-		 * wakee busy-wait until we're scheduled back on.
-		 */
-		preempt_disable();
-	}
 	q->status = IN_WAKEUP;
 	q->pid = error;
 
@@ -450,7 +443,7 @@ static void wake_up_sem_queue_do(struct
 		q->status = q->pid;
 	}
 	if (did_something)
-		preempt_enable();
+		wake_up_all(&q->wait);
 }
 
 static void unlink_queue(struct sem_array *sma, struct sem_queue *q)
@@ -1275,15 +1268,9 @@ static struct sem_undo *find_alloc_undo(
  */
 static int get_queue_result(struct sem_queue *q)
 {
-	int error;
-
-	error = q->status;
-	while (unlikely(error == IN_WAKEUP)) {
-		cpu_relax();
-		error = q->status;
-	}
+	wait_event(q->wait, ACCESS_ONCE(q->status) != IN_WAKEUP);
 
-	return error;
+	return q->status;
 }
 
 
@@ -1432,6 +1419,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid,
 
 	queue.status = -EINTR;
 	queue.sleeper = current;
+	init_waitqueue_head(&queue.wait);
 	current->state = TASK_INTERRUPTIBLE;
 	sem_unlock(sma);
 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-12 14:53       ` Mike Galbraith
  2011-09-13 13:36         ` Peter Zijlstra
@ 2011-09-13 15:08         ` Peter Zijlstra
  2011-09-13 15:28           ` Mike Galbraith
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-13 15:08 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Tue, 2011-09-13 at 15:36 +0200, Peter Zijlstra wrote:
> Does the below cure things? It breaks !rt builds, but we can cure that
> if it works..

n/m its broken like hell..

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-13 13:36         ` Peter Zijlstra
@ 2011-09-13 15:17           ` Mike Galbraith
  0 siblings, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2011-09-13 15:17 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Tue, 2011-09-13 at 15:36 +0200, Peter Zijlstra wrote:
> On Mon, 2011-09-12 at 16:53 +0200, Mike Galbraith wrote:
> > On Mon, 2011-09-12 at 15:52 +0200, Mike Galbraith wrote:
> > 
> > > 36717            <...>-6266  [002]   242.543129: sys_semop <-system_call_fastpath
> > > 36718            <...>-6266  [002]   242.543129: sys_semtimedop <-sys_semop
> > > 36719            <...>-6266  [002]   242.543131: ipc_lock_check <-sys_semtimedop
> > > 36720            <...>-6266  [002]   242.543131: ipc_lock <-ipc_lock_check
> > > 36721            <...>-6266  [002]   242.543132: __rcu_read_lock <-ipc_lock
> > > 36722            <...>-6266  [002]   242.543133: migrate_disable <-ipc_lock
> > > 36723            <...>-6266  [002]   242.543134: migrate_disable: migrate_disable: in_atomic:0 p->migrate_disable_atomic:0 p->migrate_disable:0
> > > 36724            <...>-6266  [002]   242.543134: pin_current_cpu <-migrate_disable
> > > 36725            <...>-6266  [002]   242.543134: _raw_spin_lock_irqsave <-migrate_disable
> > > 36726            <...>-6266  [002]   242.543135: _raw_spin_unlock_irqrestore <-migrate_disable
> > > 36727            <...>-6266  [002]   242.543135: rt_spin_lock <-ipc_lock
> > > 36728            <...>-6266  [002]   242.543136: ipcperms <-sys_semtimedop
> > > 36729            <...>-6266  [002]   242.543137: ns_capable <-ipcperms
> > > 36730            <...>-6266  [002]   242.543138: cap_capable <-ns_capable
> > > 36731            <...>-6266  [002]   242.543138: pid_vnr <-sys_semtimedop
> > > 36732            <...>-6266  [002]   242.543139: try_atomic_semop <-sys_semtimedop
> > > 36733            <...>-6266  [002]   242.543140: do_smart_update <-sys_semtimedop
> > > 36734            <...>-6266  [002]   242.543140: update_queue <-do_smart_update
> > > 36735            <...>-6266  [002]   242.543141: try_atomic_semop <-update_queue
> > > 36736            <...>-6266  [002]   242.543142: update_queue <-do_smart_update
> > > 36737            <...>-6266  [002]   242.543142: try_atomic_semop <-update_queue
> > > 36738            <...>-6266  [002]   242.543143: update_queue <-do_smart_update
> > > 36739            <...>-6266  [002]   242.543143: try_atomic_semop <-update_queue
> > > 36740            <...>-6266  [002]   242.543144: get_seconds <-do_smart_update
> > > 36741            <...>-6266  [002]   242.543144: rt_spin_unlock <-sys_semtimedop
> > > 36742            <...>-6266  [002]   242.543144: migrate_enable <-sys_semtimedop
> > > 36743            <...>-6266  [002]   242.543145: migrate_enable: migrate_enable: in_atomic:1 p->migrate_disable_atomic:0 p->migrate_disable:1
> > 
> > Hm.  Seems this is home grown a non-preemptive wakeup in the making.
> 
> Does the below cure things? It breaks !rt builds, but we can cure that if it works..

Warning is gone, ship it ;-)

[  216.115993] BUG: soft lockup - CPU#1 stuck for 23s! [lmsched:6247]
[  216.115996] Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nfsd lockd nfs_acl auth_rpcgss sunrpc parport_pc parport bridge ipv6 stp cpufreq_conservative cpufreq_ondemand cpufreq_userspace cpufreq_powersave microcode acpi_cpufreq mperf nls_iso8859_1 nls_cp437 vfat fat fuse ext3 jbd dm_mod snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep kvm_intel usb_storage snd_pcm snd_timer kvm sr_mod cdrom usb_libusual uas sg e1000e snd firewire_ohci firewire_core i2c_i801 soundcore snd_page_alloc crc_itu_t button ext4 mbcache jbd2 crc16 usbhid hid uhci_hcd sd_mod ehci_hcd usbcore rtc_cmos ahci libahci libata scsi_mod fan processor thermal
[  216.116000] CPU 1 
[  216.116000] Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nfsd lockd nfs_acl auth_rpcgss sunrpc parport_pc parport bridge ipv6 stp cpufreq_conservative cpufreq_ondemand cpufreq_userspace cpufreq_powersave microcode acpi_cpufreq mperf nls_iso8859_1 nls_cp437 vfat fat fuse ext3 jbd dm_mod snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep kvm_intel usb_storage snd_pcm snd_timer kvm sr_mod cdrom usb_libusual uas sg e1000e snd firewire_ohci firewire_core i2c_i801 soundcore snd_page_alloc crc_itu_t button ext4 mbcache jbd2 crc16 usbhid hid uhci_hcd sd_mod ehci_hcd usbcore rtc_cmos ahci libahci libata scsi_mod fan processor thermal
[  216.116000] 
[  216.116000] Pid: 6247, comm: lmsched Not tainted 3.0.4-rt13 #2049 MEDIONPC MS-7502/MS-7502
[  216.116000] RIP: 0010:[<ffffffff81359b62>]  [<ffffffff81359b62>] _raw_spin_lock+0x22/0x30
[  216.116000] RSP: 0018:ffff8801f1a97b98  EFLAGS: 00000293
[  216.116000] RAX: 0000000000006900 RBX: ffffffff81036ec0 RCX: 0000000000000001
[  216.116000] RDX: ffff8801ed3d2300 RSI: 0000000000000282 RDI: ffff8801f1a97f70
[  216.116000] RBP: ffff8801f1a97b98 R08: ffff88022fcdf128 R09: 0000000000000040
[  216.116000] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8135b02e
[  216.116000] R13: 0000000000000001 R14: 0000000000000001 R15: ffff8801ed3d2300
[  216.116000] FS:  00007f5989dde720(0000) GS:ffff88022fc80000(0000) knlGS:0000000000000000
[  216.116000] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  216.116000] CR2: 00007f0f32fc2000 CR3: 0000000222212000 CR4: 00000000000006e0
[  216.116000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  216.116000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  216.116000] Process lmsched (pid: 6247, threadinfo ffff8801f1a96000, task ffff8801ed3d2300)
[  216.116000] Stack:
[  216.116000]  ffff8801f1a97c48 ffffffff81359353 ffff88022693a540 ffff88022fcdf040
[  216.116000]  ffff88022693a540 0000000000000005 ffff88022fcdf040 ffff88022693a540
[  216.116000]  ffff8801f1a97bf8 ffffffff81036024 0000000000000001 ffff88022693a540
[  216.116000] Call Trace:
[  216.116000]  [<ffffffff81359353>] rt_spin_lock_slowlock+0x33/0x1b0
[  216.116000]  [<ffffffff81036024>] ? check_preempt_curr+0x84/0xa0
[  216.116000]  [<ffffffff8103d84f>] ? ttwu_do_wakeup+0x5f/0x130
[  216.116000]  [<ffffffff81359836>] rt_spin_lock+0x26/0x30
[  216.116000]  [<ffffffff8103eea9>] __wake_up+0x39/0x70
[  216.116000]  [<ffffffff81174d70>] wake_up_sem_queue_do+0x60/0x70
[  216.116000]  [<ffffffff8117611d>] sys_semtimedop+0x36d/0xb40
[  216.116000]  [<ffffffff8106dfdc>] ? __hrtimer_start_range_ns+0x15c/0x2c0
[  216.116000]  [<ffffffff810964fb>] ? rcu_read_unlock_special+0x1bb/0x1f0
[  216.116000]  [<ffffffff8103db42>] ? migrate_enable+0x192/0x280
[  216.116000]  [<ffffffff8103dd8d>] ? migrate_disable+0x15d/0x1f0
[  216.116000]  [<ffffffff8103db42>] ? migrate_enable+0x192/0x280
[  216.116000]  [<ffffffff8105c0eb>] ? do_sigtimedwait+0xab/0x1d0
[  216.116000]  [<ffffffff8105c299>] ? sys_rt_sigtimedwait+0x89/0xe0
[  216.116000]  [<ffffffff81068005>] ? sys_timer_settime+0x185/0x220
[  216.116000]  [<ffffffff81176a00>] copy_semundo+0xf0/0x100
[  216.116000]  [<ffffffff8135a72b>] system_call_fastpath+0x16/0x1b
[  216.116000] Code: 00 00 00 00 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 65 48 8b 04 25 c8 95 00 00 ff 80 44 e0 ff ff b8 00 01 00 00 f0 66 0f c1 07 
[  216.116000]  e0 74 06 f3 90 8a 07 eb f6 c9 c3 66 90 55 48 89 e5 66 66 66 
[  216.116000] Call Trace:
[  216.116000]  [<ffffffff81359353>] rt_spin_lock_slowlock+0x33/0x1b0
[  216.116000]  [<ffffffff81036024>] ? check_preempt_curr+0x84/0xa0
[  216.116000]  [<ffffffff8103d84f>] ? ttwu_do_wakeup+0x5f/0x130
[  216.116000]  [<ffffffff81359836>] rt_spin_lock+0x26/0x30
[  216.116000]  [<ffffffff8103eea9>] __wake_up+0x39/0x70
[  216.116000]  [<ffffffff81174d70>] wake_up_sem_queue_do+0x60/0x70
[  216.116000]  [<ffffffff8117611d>] sys_semtimedop+0x36d/0xb40
[  216.116000]  [<ffffffff8106dfdc>] ? __hrtimer_start_range_ns+0x15c/0x2c0
[  216.116000]  [<ffffffff810964fb>] ? rcu_read_unlock_special+0x1bb/0x1f0
[  216.116000]  [<ffffffff8103db42>] ? migrate_enable+0x192/0x280
[  216.116000]  [<ffffffff8103dd8d>] ? migrate_disable+0x15d/0x1f0
[  216.116000]  [<ffffffff8103db42>] ? migrate_enable+0x192/0x280
[  216.116000]  [<ffffffff8105c0eb>] ? do_sigtimedwait+0xab/0x1d0
[  216.116000]  [<ffffffff8105c299>] ? sys_rt_sigtimedwait+0x89/0xe0
[  216.116000]  [<ffffffff81068005>] ? sys_timer_settime+0x185/0x220
[  216.116000]  [<ffffffff81176a00>] copy_semundo+0xf0/0x100
[  216.116000]  [<ffffffff8135a72b>] system_call_fastpath+0x16/0x1b
[  216.116000] Kernel panic - not syncing: softlockup: hung tasks
[  216.116000] Pid: 6247, comm: lmsched Not tainted 3.0.4-rt13 #2049
[  216.116000] Call Trace:
[  216.116000]  <IRQ>  [<ffffffff81357080>] panic+0xa0/0x1a8
[  216.116000]  [<ffffffff8108fd63>] watchdog_timer_fn+0x183/0x190
[  216.116000]  [<ffffffff8106d683>] __run_hrtimer+0x73/0x240
[  216.116000]  [<ffffffff8108fbe0>] ? __touch_watchdog+0x30/0x30
[  216.116000]  [<ffffffff8106e2f4>] hrtimer_interrupt+0x174/0x340
[  216.116000]  [<ffffffff8135bb29>] smp_apic_timer_interrupt+0x69/0x99
[  216.116000]  [<ffffffff8135b033>] apic_timer_interrupt+0x13/0x20
[  216.116000]  <EOI>  [<ffffffff81036ec0>] ? update_curr_rt+0x180/0x230
[  216.116000]  [<ffffffff81359b62>] ? _raw_spin_lock+0x22/0x30
[  216.116000]  [<ffffffff81038bf0>] ? enqueue_task_rt+0x120/0x2e0
[  216.116000]  [<ffffffff81359353>] rt_spin_lock_slowlock+0x33/0x1b0
[  216.116000]  [<ffffffff81036024>] ? check_preempt_curr+0x84/0xa0
[  216.116000]  [<ffffffff8103d84f>] ? ttwu_do_wakeup+0x5f/0x130
[  216.116000]  [<ffffffff81359836>] rt_spin_lock+0x26/0x30
[  216.116000]  [<ffffffff8103eea9>] __wake_up+0x39/0x70
[  216.116000]  [<ffffffff81174d70>] wake_up_sem_queue_do+0x60/0x70
[  216.116000]  [<ffffffff8117611d>] sys_semtimedop+0x36d/0xb40
[  216.116000]  [<ffffffff8106dfdc>] ? __hrtimer_start_range_ns+0x15c/0x2c0
[  216.116000]  [<ffffffff810964fb>] ? rcu_read_unlock_special+0x1bb/0x1f0
[  216.116000]  [<ffffffff8103db42>] ? migrate_enable+0x192/0x280
[  216.116000]  [<ffffffff8103dd8d>] ? migrate_disable+0x15d/0x1f0
[  216.116000]  [<ffffffff8103db42>] ? migrate_enable+0x192/0x280
[  216.116000]  [<ffffffff8105c0eb>] ? do_sigtimedwait+0xab/0x1d0
[  216.116000]  [<ffffffff8105c299>] ? sys_rt_sigtimedwait+0x89/0xe0
[  216.116000]  [<ffffffff81068005>] ? sys_timer_settime+0x185/0x220
[  216.116000]  [<ffffffff81176a00>] copy_semundo+0xf0/0x100
[  216.116000]  [<ffffffff8135a72b>] system_call_fastpath+0x16/0x1b



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-13 15:08         ` Peter Zijlstra
@ 2011-09-13 15:28           ` Mike Galbraith
  2011-09-13 16:13             ` Peter Zijlstra
  2011-09-14  9:57             ` [PATCH -rt] ipc/sem: Rework semaphore wakeups Peter Zijlstra
  0 siblings, 2 replies; 52+ messages in thread
From: Mike Galbraith @ 2011-09-13 15:28 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Tue, 2011-09-13 at 17:08 +0200, Peter Zijlstra wrote:
> On Tue, 2011-09-13 at 15:36 +0200, Peter Zijlstra wrote:
> > Does the below cure things? It breaks !rt builds, but we can cure that
> > if it works..
> 
> n/m its broken like hell..

Since this seems to be a legitimate need for a non-preemptive wakeup,
how about we [re]create one to get rid of abuse?  (otoh, that would add
a branch everywhere.. btdt a couple times, always ended up tossing it)

	-Mike


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [ANNOUNCE] 3.0.4-rt13
  2011-09-13 15:28           ` Mike Galbraith
@ 2011-09-13 16:13             ` Peter Zijlstra
  2011-09-21 10:17               ` rt14: strace -> migrate_disable_atomic imbalance Mike Galbraith
  2011-09-14  9:57             ` [PATCH -rt] ipc/sem: Rework semaphore wakeups Peter Zijlstra
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-13 16:13 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Tue, 2011-09-13 at 17:28 +0200, Mike Galbraith wrote:
> On Tue, 2011-09-13 at 17:08 +0200, Peter Zijlstra wrote:
> > On Tue, 2011-09-13 at 15:36 +0200, Peter Zijlstra wrote:
> > > Does the below cure things? It breaks !rt builds, but we can cure that
> > > if it works..
> > 
> > n/m its broken like hell..
> 
> Since this seems to be a legitimate need for a non-preemptive wakeup,
> how about we [re]create one to get rid of abuse?  (otoh, that would add
> a branch everywhere.. btdt a couple times, always ended up tossing it)

Nah, we absolutely need to kill that preempt_disable/enable stuff in
there, it covers an O(n^2) loop, its complete crap.

Just need to wrap my head around that code..

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH -rt] ipc/sem: Rework semaphore wakeups
  2011-09-13 15:28           ` Mike Galbraith
  2011-09-13 16:13             ` Peter Zijlstra
@ 2011-09-14  9:57             ` Peter Zijlstra
  2011-09-14 13:02               ` Mike Galbraith
  2011-09-14 18:48               ` Manfred Spraul
  1 sibling, 2 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-14  9:57 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Thomas Gleixner, LKML, linux-rt-users, Manfred Spraul

Subject: ipc/sem: Rework semaphore wakeups
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Tue Sep 13 15:09:40 CEST 2011

Current sysv sems have a weird ass wakeup scheme that involves keeping
preemption disabled over a potential O(n^2) loop and busy waiting on
that on other CPUs.

Kill this and simply wake the task directly from under the sem_lock.

This was discovered by a migrate_disable() debug feature that
disallows:

  spin_lock();
  preempt_disable();
  spin_unlock()
  preempt_enable();

Cc: Manfred Spraul <manfred@colorfullife.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 ipc/sem.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: linux-rt/ipc/sem.c
===================================================================
--- linux-rt.orig/ipc/sem.c
+++ linux-rt/ipc/sem.c
@@ -415,6 +415,13 @@ static int try_atomic_semop (struct sem_
 static void wake_up_sem_queue_prepare(struct list_head *pt,
 				struct sem_queue *q, int error)
 {
+#ifdef CONFIG_PREEMPT_RT_BASE
+	struct task_struct *p = q->sleeper;
+	get_task_struct(p);
+	q->status = error;
+	wake_up_process(p);
+	put_task_struct(p);
+#else
 	if (list_empty(pt)) {
 		/*
 		 * Hold preempt off so that we don't get preempted and have the
@@ -426,6 +433,7 @@ static void wake_up_sem_queue_prepare(st
 	q->pid = error;
 
 	list_add_tail(&q->simple_list, pt);
+#endif
 }
 
 /**
@@ -439,6 +447,7 @@ static void wake_up_sem_queue_prepare(st
  */
 static void wake_up_sem_queue_do(struct list_head *pt)
 {
+#ifndef CONFIG_PREEMPT_RT_BASE
 	struct sem_queue *q, *t;
 	int did_something;
 
@@ -451,6 +460,7 @@ static void wake_up_sem_queue_do(struct
 	}
 	if (did_something)
 		preempt_enable();
+#endif
 }
 
 static void unlink_queue(struct sem_array *sma, struct sem_queue *q)


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -rt] ipc/sem: Rework semaphore wakeups
  2011-09-14  9:57             ` [PATCH -rt] ipc/sem: Rework semaphore wakeups Peter Zijlstra
@ 2011-09-14 13:02               ` Mike Galbraith
  2011-09-14 18:48               ` Manfred Spraul
  1 sibling, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2011-09-14 13:02 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, linux-rt-users, Manfred Spraul

All better.

On Wed, 2011-09-14 at 11:57 +0200, Peter Zijlstra wrote:
> Subject: ipc/sem: Rework semaphore wakeups
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Tue Sep 13 15:09:40 CEST 2011
> 
> Current sysv sems have a weird ass wakeup scheme that involves keeping
> preemption disabled over a potential O(n^2) loop and busy waiting on
> that on other CPUs.
> 
> Kill this and simply wake the task directly from under the sem_lock.
> 
> This was discovered by a migrate_disable() debug feature that
> disallows:
> 
>   spin_lock();
>   preempt_disable();
>   spin_unlock()
>   preempt_enable();
> 
> Cc: Manfred Spraul <manfred@colorfullife.com>
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Reported-by: Mike Galbraith <efault@gmx.de>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  ipc/sem.c |   10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> Index: linux-rt/ipc/sem.c
> ===================================================================
> --- linux-rt.orig/ipc/sem.c
> +++ linux-rt/ipc/sem.c
> @@ -415,6 +415,13 @@ static int try_atomic_semop (struct sem_
>  static void wake_up_sem_queue_prepare(struct list_head *pt,
>  				struct sem_queue *q, int error)
>  {
> +#ifdef CONFIG_PREEMPT_RT_BASE
> +	struct task_struct *p = q->sleeper;
> +	get_task_struct(p);
> +	q->status = error;
> +	wake_up_process(p);
> +	put_task_struct(p);
> +#else
>  	if (list_empty(pt)) {
>  		/*
>  		 * Hold preempt off so that we don't get preempted and have the
> @@ -426,6 +433,7 @@ static void wake_up_sem_queue_prepare(st
>  	q->pid = error;
>  
>  	list_add_tail(&q->simple_list, pt);
> +#endif
>  }
>  
>  /**
> @@ -439,6 +447,7 @@ static void wake_up_sem_queue_prepare(st
>   */
>  static void wake_up_sem_queue_do(struct list_head *pt)
>  {
> +#ifndef CONFIG_PREEMPT_RT_BASE
>  	struct sem_queue *q, *t;
>  	int did_something;
>  
> @@ -451,6 +460,7 @@ static void wake_up_sem_queue_do(struct
>  	}
>  	if (did_something)
>  		preempt_enable();
> +#endif
>  }
>  
>  static void unlink_queue(struct sem_array *sma, struct sem_queue *q)
> 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -rt] ipc/sem: Rework semaphore wakeups
  2011-09-14  9:57             ` [PATCH -rt] ipc/sem: Rework semaphore wakeups Peter Zijlstra
  2011-09-14 13:02               ` Mike Galbraith
@ 2011-09-14 18:48               ` Manfred Spraul
  2011-09-14 19:23                 ` Peter Zijlstra
  1 sibling, 1 reply; 52+ messages in thread
From: Manfred Spraul @ 2011-09-14 18:48 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, Thomas Gleixner, LKML, linux-rt-users

[-- Attachment #1: Type: text/plain, Size: 1223 bytes --]

On 09/14/2011 11:57 AM, Peter Zijlstra wrote:
> Subject: ipc/sem: Rework semaphore wakeups
> From: Peter Zijlstra<a.p.zijlstra@chello.nl>
> Date: Tue Sep 13 15:09:40 CEST 2011
>
> Current sysv sems have a weird ass wakeup scheme that involves keeping
> preemption disabled over a potential O(n^2) loop and busy waiting on
> that on other CPUs.
Have you checked that the patch improves the latency?
Note that  the busy wait only happens if there is a simultaneous timeout 
of a semtimedop() and a true wakeup.

The code does:

     spin_lock()
     preempt_disable();
     usually_very_simple_but_worstcase_O_2
     spin_unlock()
     usually_very_simple_but_worstcase_O_1
     preempt_enable();

with your change, it becomes:

     spin_lock()
     usually_very_simple_but_worstcase_O_2
     usually_very_simple_but_worstcase_O_1
     spin_unlock()

The complex ops remain unchanged, they are still under a lock.

What about removing the preempt_disable?
It's only there to cover a rare race on uniprocessor preempt systems.
(a task is woken up simultaneously due to timeout of semtimedop() and a 
true wakeup)

Then fix the that race - something like the attached patch [obviously 
buggy - see the fixme]

--
     Manfred

[-- Attachment #2: patch-rt-sem --]
[-- Type: text/plain, Size: 1204 bytes --]

diff --git a/ipc/sem.c b/ipc/sem.c
index add93d2..96aef6d 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -416,11 +416,13 @@ static void wake_up_sem_queue_prepare(struct list_head *pt,
 				struct sem_queue *q, int error)
 {
 	if (list_empty(pt)) {
+#ifndef CONFIG_PREEMPT_RT_BASE
 		/*
 		 * Hold preempt off so that we don't get preempted and have the
 		 * wakee busy-wait until we're scheduled back on.
 		 */
 		preempt_disable();
+#endif
 	}
 	q->status = IN_WAKEUP;
 	q->pid = error;
@@ -449,8 +451,10 @@ static void wake_up_sem_queue_do(struct list_head *pt)
 		smp_wmb();
 		q->status = q->pid;
 	}
-	if (did_something)
+	if (did_something) {
+#ifndef CONFIG_PREEMPT_RT_BASE
 		preempt_enable();
+#endif
 }
 
 static void unlink_queue(struct sem_array *sma, struct sem_queue *q)
@@ -1280,6 +1284,13 @@ static int get_queue_result(struct sem_queue *q)
 	error = q->status;
 	while (unlikely(error == IN_WAKEUP)) {
 		cpu_relax();
+#ifdef CONFIG_PREEMPT_RT_BASE
+		/*FIXME: obviously broken if called with semaphore spinlock held
+ 		 * sched_yield() should only be called if get_queue_result() is
+ 		 * called outside of the semaphore lock
+ 		*/
+		sched_yield();
+#endif
 		error = q->status;
 	}
 

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH -rt] ipc/sem: Rework semaphore wakeups
  2011-09-14 18:48               ` Manfred Spraul
@ 2011-09-14 19:23                 ` Peter Zijlstra
  2011-09-15 17:04                   ` Manfred Spraul
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-14 19:23 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Mike Galbraith, Thomas Gleixner, LKML, linux-rt-users

On Wed, 2011-09-14 at 20:48 +0200, Manfred Spraul wrote:
> On 09/14/2011 11:57 AM, Peter Zijlstra wrote:
> > Subject: ipc/sem: Rework semaphore wakeups
> > From: Peter Zijlstra<a.p.zijlstra@chello.nl>
> > Date: Tue Sep 13 15:09:40 CEST 2011
> >
> > Current sysv sems have a weird ass wakeup scheme that involves keeping
> > preemption disabled over a potential O(n^2) loop and busy waiting on
> > that on other CPUs.
> Have you checked that the patch improves the latency?
> Note that  the busy wait only happens if there is a simultaneous timeout 
> of a semtimedop() and a true wakeup.
> 
> The code does:
> 
>      spin_lock()
>      preempt_disable();
>      usually_very_simple_but_worstcase_O_2
>      spin_unlock()
>      usually_very_simple_but_worstcase_O_1
>      preempt_enable();
> 
> with your change, it becomes:
> 
>      spin_lock()
>      usually_very_simple_but_worstcase_O_2
>      usually_very_simple_but_worstcase_O_1
>      spin_unlock()
> 
> The complex ops remain unchanged, they are still under a lock.

preemptible lock (aka pi-mutex) on -rt, so no weird latencies.

> What about removing the preempt_disable?
> It's only there to cover a rare race on uniprocessor preempt systems.
> (a task is woken up simultaneously due to timeout of semtimedop() and a 
> true wakeup)
> 
> Then fix the that race - something like the attached patch [obviously 
> buggy - see the fixme]

sched_yield() is always a bug, as is it here. Its an life-lock if the
woken task is of higher priority than the waking task. A higher prio
FIFO task calling sched_yield() in a loop is just that, a loop, starving
the lower prio waker.

If you've got enough medium prio tasks around to occupy all other cpus,
you're got indefinite priority inversion, so even on smp its a problem.

But yeah its not the prettiest of solutions but it works.. see that
other patch with the wake-list stuff for something that ought to work
for both rt and mainline (except of course it doesn't actually work).

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -rt] ipc/sem: Rework semaphore wakeups
  2011-09-14 19:23                 ` Peter Zijlstra
@ 2011-09-15 17:04                   ` Manfred Spraul
  0 siblings, 0 replies; 52+ messages in thread
From: Manfred Spraul @ 2011-09-15 17:04 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, Thomas Gleixner, LKML, linux-rt-users

On 09/14/2011 09:23 PM, Peter Zijlstra wrote:
> On Wed, 2011-09-14 at 20:48 +0200, Manfred Spraul wrote:
>> The code does:
>>
>>       spin_lock()
>>       preempt_disable();
>>       usually_very_simple_but_worstcase_O_2
>>       spin_unlock()
>>       usually_very_simple_but_worstcase_O_1
>>       preempt_enable();
>>
>> with your change, it becomes:
>>
>>       spin_lock()
>>       usually_very_simple_but_worstcase_O_2
>>       usually_very_simple_but_worstcase_O_1
>>       spin_unlock()
>>
>> The complex ops remain unchanged, they are still under a lock.
> preemptible lock (aka pi-mutex) on -rt, so no weird latencies.
But the change means that more operations are under spin_lock().
Acutally for a large SMP system with a simple semaphore operation, the 
wake_up_process() takes longer than the semaphore operation.
And for some databases, contention on the spin_lock() is an issue.


>> What about removing the preempt_disable?
>> It's only there to cover a rare race on uniprocessor preempt systems.
>> (a task is woken up simultaneously due to timeout of semtimedop() and a
>> true wakeup)
>>
>> Then fix the that race - something like the attached patch [obviously
>> buggy - see the fixme]
> sched_yield() is always a bug, as is it here. Its an life-lock if the
> woken task is of higher priority than the waking task. A higher prio
> FIFO task calling sched_yield() in a loop is just that, a loop, starving
> the lower prio waker.
>
> If you've got enough medium prio tasks around to occupy all other cpus,
> you're got indefinite priority inversion, so even on smp its a problem.
>
> But yeah its not the prettiest of solutions but it works.. see that
> other patch with the wake-list stuff for something that ought to work
> for both rt and mainline (except of course it doesn't actually work).
Wake lists are definitively the better approach.
[let's continue in that thread]

--
     Manfred

^ permalink raw reply	[flat|nested] 52+ messages in thread

* rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-13 16:13             ` Peter Zijlstra
@ 2011-09-21 10:17               ` Mike Galbraith
  2011-09-21 17:01                 ` Peter Zijlstra
                                   ` (5 more replies)
  0 siblings, 6 replies; 52+ messages in thread
From: Mike Galbraith @ 2011-09-21 10:17 UTC (permalink / raw)
  To: linux-rt-users; +Cc: Thomas Gleixner, Peter Zijlstra, LKML


[  144.212272] ------------[ cut here ]------------
[  144.212280] WARNING: at kernel/sched.c:6152 migrate_disable+0x1b6/0x200()
[  144.212282] Hardware name: MS-7502
[  144.212283] Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nfsd lockd parport_pc parport nfs_acl auth_rpcgss sunrpc bridge ipv6 stp cpufreq_conservative microcode cpufreq_ondemand cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf nls_iso8859_1 nls_cp437 vfat fat fuse ext3 jbd dm_mod usbmouse usb_storage usbhid snd_hda_codec_realtek usb_libusual uas sr_mod cdrom hid snd_hda_intel e1000e snd_hda_codec kvm_intel snd_hwdep sg snd_pcm kvm i2c_i801 snd_timer snd firewire_ohci firewire_core soundcore snd_page_alloc crc_itu_t button ext4 mbcache jbd2 crc16 uhci_hcd sd_mod ehci_hcd usbcore rtc_cmos ahci libahci libata scsi_mod fan processor thermal
[  144.212317] Pid: 6215, comm: strace Not tainted 3.0.4-rt14 #2052
[  144.212319] Call Trace:
[  144.212323]  [<ffffffff8104662f>] warn_slowpath_common+0x7f/0xc0
[  144.212326]  [<ffffffff8104668a>] warn_slowpath_null+0x1a/0x20
[  144.212328]  [<ffffffff8103f606>] migrate_disable+0x1b6/0x200
[  144.212331]  [<ffffffff8105a2a8>] ptrace_stop+0x128/0x240
[  144.212334]  [<ffffffff81057b9b>] ? recalc_sigpending+0x1b/0x50
[  144.212337]  [<ffffffff8105b6f1>] get_signal_to_deliver+0x211/0x530
[  144.212340]  [<ffffffff81001835>] do_signal+0x75/0x7a0
[  144.212342]  [<ffffffff8105ae68>] ? kill_pid_info+0x58/0x80
[  144.212344]  [<ffffffff8105c34c>] ? sys_kill+0xac/0x1e0
[  144.212347]  [<ffffffff81001fe5>] do_notify_resume+0x65/0x80
[  144.212350]  [<ffffffff8135978b>] int_signal+0x12/0x17
[  144.212352] ---[ end trace 0000000000000002 ]---
[  144.212354] ------------[ cut here ]------------
[  144.212356] WARNING: at kernel/sched.c:6211 migrate_enable+0x1f8/0x270()
[  144.212357] Hardware name: MS-7502
[  144.212358] Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nfsd lockd parport_pc parport nfs_acl auth_rpcgss sunrpc bridge ipv6 stp cpufreq_conservative microcode cpufreq_ondemand cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf nls_iso8859_1 nls_cp437 vfat fat fuse ext3 jbd dm_mod usbmouse usb_storage usbhid snd_hda_codec_realtek usb_libusual uas sr_mod cdrom hid snd_hda_intel e1000e snd_hda_codec kvm_intel snd_hwdep sg snd_pcm kvm i2c_i801 snd_timer snd firewire_ohci firewire_core soundcore snd_page_alloc crc_itu_t button ext4 mbcache jbd2 crc16 uhci_hcd sd_mod ehci_hcd usbcore rtc_cmos ahci libahci libata scsi_mod fan processor thermal
[  144.212381] Pid: 6215, comm: strace Tainted: G        W   3.0.4-rt14 #2052
[  144.212382] Call Trace:
[  144.212384]  [<ffffffff8104662f>] warn_slowpath_common+0x7f/0xc0
[  144.212387]  [<ffffffff8104668a>] warn_slowpath_null+0x1a/0x20
[  144.212389]  [<ffffffff8103f3d8>] migrate_enable+0x1f8/0x270
[  144.212391]  [<ffffffff8105b856>] get_signal_to_deliver+0x376/0x530
[  144.212394]  [<ffffffff81001835>] do_signal+0x75/0x7a0
[  144.212396]  [<ffffffff8105ae68>] ? kill_pid_info+0x58/0x80
[  144.212398]  [<ffffffff8105c34c>] ? sys_kill+0xac/0x1e0
[  144.212401]  [<ffffffff81001fe5>] do_notify_resume+0x65/0x80
[  144.212403]  [<ffffffff8135978b>] int_signal+0x12/0x17
[  144.212405] ---[ end trace 0000000000000003 ]---



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-21 10:17               ` rt14: strace -> migrate_disable_atomic imbalance Mike Galbraith
@ 2011-09-21 17:01                 ` Peter Zijlstra
  2011-09-21 18:50                   ` Peter Zijlstra
                                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-21 17:01 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov, Miklos Szeredi

On Wed, 2011-09-21 at 12:17 +0200, Mike Galbraith wrote:
> [  144.212272] ------------[ cut here ]------------
> [  144.212280] WARNING: at kernel/sched.c:6152 migrate_disable+0x1b6/0x200()
> [  144.212282] Hardware name: MS-7502
> [  144.212283] Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nfsd lockd parport_pc parport nfs_acl auth_rpcgss sunrpc bridge ipv6 stp cpufreq_conservative microcode cpufreq_ondemand cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf nls_iso8859_1 nls_cp437 vfat fat fuse ext3 jbd dm_mod usbmouse usb_storage usbhid snd_hda_codec_realtek usb_libusual uas sr_mod cdrom hid snd_hda_intel e1000e snd_hda_codec kvm_intel snd_hwdep sg snd_pcm kvm i2c_i801 snd_timer snd firewire_ohci firewire_core soundcore snd_page_alloc crc_itu_t button ext4 mbcache jbd2 crc16 uhci_hcd sd_mod ehci_hcd usbcore rtc_cmos ahci libahci libata scsi_mod fan processor thermal
> [  144.212317] Pid: 6215, comm: strace Not tainted 3.0.4-rt14 #2052
> [  144.212319] Call Trace:
> [  144.212323]  [<ffffffff8104662f>] warn_slowpath_common+0x7f/0xc0
> [  144.212326]  [<ffffffff8104668a>] warn_slowpath_null+0x1a/0x20
> [  144.212328]  [<ffffffff8103f606>] migrate_disable+0x1b6/0x200
> [  144.212331]  [<ffffffff8105a2a8>] ptrace_stop+0x128/0x240
> [  144.212334]  [<ffffffff81057b9b>] ? recalc_sigpending+0x1b/0x50
> [  144.212337]  [<ffffffff8105b6f1>] get_signal_to_deliver+0x211/0x530
> [  144.212340]  [<ffffffff81001835>] do_signal+0x75/0x7a0
> [  144.212342]  [<ffffffff8105ae68>] ? kill_pid_info+0x58/0x80
> [  144.212344]  [<ffffffff8105c34c>] ? sys_kill+0xac/0x1e0
> [  144.212347]  [<ffffffff81001fe5>] do_notify_resume+0x65/0x80
> [  144.212350]  [<ffffffff8135978b>] int_signal+0x12/0x17
> [  144.212352] ---[ end trace 0000000000000002 ]---


Right, that's because of 
53da1d9456fe7f87a920a78fdbdcf1225d197cb7, I think we simply want a full
revert of that for -rt.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-21 10:17               ` rt14: strace -> migrate_disable_atomic imbalance Mike Galbraith
@ 2011-09-21 18:50                   ` Peter Zijlstra
  2011-09-21 18:50                   ` Peter Zijlstra
                                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-21 18:50 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Wed, 2011-09-21 at 19:01 +0200, Peter Zijlstra wrote:
> On Wed, 2011-09-21 at 12:17 +0200, Mike Galbraith wrote:
> > [  144.212272] ------------[ cut here ]------------
> > [  144.212280] WARNING: at kernel/sched.c:6152 migrate_disable+0x1b6/0x200()
> > [  144.212282] Hardware name: MS-7502
> > [  144.212283] Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nfsd lockd parport_pc parport nfs_acl auth_rpcgss sunrpc bridge ipv6 stp cpufreq_conservative microcode cpufreq_ondemand cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf nls_iso8859_1 nls_cp437 vfat fat fuse ext3 jbd dm_mod usbmouse usb_storage usbhid snd_hda_codec_realtek usb_libusual uas sr_mod cdrom hid snd_hda_intel e1000e snd_hda_codec kvm_intel snd_hwdep sg snd_pcm kvm i2c_i801 snd_timer snd firewire_ohci firewire_core soundcore snd_page_alloc crc_itu_t button ext4 mbcache jbd2 crc16 uhci_hcd sd_mod ehci_hcd usbcore rtc_cmos ahci libahci libata scsi_mod fan processor thermal
> > [  144.212317] Pid: 6215, comm: strace Not tainted 3.0.4-rt14 #2052
> > [  144.212319] Call Trace:
> > [  144.212323]  [<ffffffff8104662f>] warn_slowpath_common+0x7f/0xc0
> > [  144.212326]  [<ffffffff8104668a>] warn_slowpath_null+0x1a/0x20
> > [  144.212328]  [<ffffffff8103f606>] migrate_disable+0x1b6/0x200
> > [  144.212331]  [<ffffffff8105a2a8>] ptrace_stop+0x128/0x240
> > [  144.212334]  [<ffffffff81057b9b>] ? recalc_sigpending+0x1b/0x50
> > [  144.212337]  [<ffffffff8105b6f1>] get_signal_to_deliver+0x211/0x530
> > [  144.212340]  [<ffffffff81001835>] do_signal+0x75/0x7a0
> > [  144.212342]  [<ffffffff8105ae68>] ? kill_pid_info+0x58/0x80
> > [  144.212344]  [<ffffffff8105c34c>] ? sys_kill+0xac/0x1e0
> > [  144.212347]  [<ffffffff81001fe5>] do_notify_resume+0x65/0x80
> > [  144.212350]  [<ffffffff8135978b>] int_signal+0x12/0x17
> > [  144.212352] ---[ end trace 0000000000000002 ]---
> 
> 
> Right, that's because of 
> 53da1d9456fe7f87a920a78fdbdcf1225d197cb7, I think we simply want a full
> revert of that for -rt.

This also made me stare at the trainwreck called wait_task_inactive(),
how about something like the below, it survives a boot and simple
strace.

I'm not particularly keen on always enabling preempt notifiers, but
seeing that pretty much world+dog already has them enabled...

Also, less LOC is always better, right ;-)

---
 arch/ia64/kvm/Kconfig    |    1 -
 arch/powerpc/kvm/Kconfig |    1 -
 arch/s390/kvm/Kconfig    |    1 -
 arch/tile/kvm/Kconfig    |    1 -
 arch/x86/kvm/Kconfig     |    1 -
 include/linux/kvm_host.h |    2 -
 include/linux/preempt.h  |    4 -
 include/linux/sched.h    |    2 -
 init/Kconfig             |    3 -
 kernel/sched.c           |  163 ++++++++++++++++++----------------------------
 10 files changed, 64 insertions(+), 115 deletions(-)

diff --git a/arch/ia64/kvm/Kconfig b/arch/ia64/kvm/Kconfig
index 9806e55..02b36ca 100644
--- a/arch/ia64/kvm/Kconfig
+++ b/arch/ia64/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
 	depends on HAVE_KVM && MODULES && EXPERIMENTAL
 	# for device assignment:
 	depends on PCI
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
 	select KVM_APIC_ARCHITECTURE
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 78133de..0bcd5a8 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -18,7 +18,6 @@ if VIRTUALIZATION
 
 config KVM
 	bool
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 
 config KVM_BOOK3S_HANDLER
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index a216341..7ff8d54 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -19,7 +19,6 @@ config KVM
 	def_tristate y
 	prompt "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && EXPERIMENTAL
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	---help---
 	  Support hosting paravirtualized guest machines using the SIE
diff --git a/arch/tile/kvm/Kconfig b/arch/tile/kvm/Kconfig
index 669fcdb..6a936d1 100644
--- a/arch/tile/kvm/Kconfig
+++ b/arch/tile/kvm/Kconfig
@@ -19,7 +19,6 @@ if VIRTUALIZATION
 config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && MODULES && EXPERIMENTAL
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	---help---
 	  Support hosting paravirtualized guest machines.
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ff5790d..d82150a 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -24,7 +24,6 @@ config KVM
 	depends on PCI
 	# for TASKSTATS/TASK_DELAY_ACCT:
 	depends on NET
-	select PREEMPT_NOTIFIERS
 	select MMU_NOTIFIER
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index eabb21a..a9343b8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -111,9 +111,7 @@ enum {
 
 struct kvm_vcpu {
 	struct kvm *kvm;
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	struct preempt_notifier preempt_notifier;
-#endif
 	int cpu;
 	int vcpu_id;
 	int srcu_idx;
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 58969b2..7ca8968 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -101,8 +101,6 @@ do { \
 
 #endif /* CONFIG_PREEMPT_COUNT */
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
 struct preempt_notifier;
 
 /**
@@ -147,6 +145,4 @@ static inline void preempt_notifier_init(struct preempt_notifier *notifier,
 	notifier->ops = ops;
 }
 
-#endif
-
 #endif /* __LINUX_PREEMPT_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e54c890..64fc7c7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1236,10 +1236,8 @@ struct task_struct {
 	struct sched_entity se;
 	struct sched_rt_entity rt;
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
 	struct hlist_head preempt_notifiers;
-#endif
 
 	/*
 	 * fpu_counter contains the number of consecutive context switches
diff --git a/init/Kconfig b/init/Kconfig
index d19b3a7..c1c411c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1403,9 +1403,6 @@ config STOP_MACHINE
 
 source "block/Kconfig"
 
-config PREEMPT_NOTIFIERS
-	bool
-
 config PADATA
 	depends on SMP
 	bool
diff --git a/kernel/sched.c b/kernel/sched.c
index db143fd..b38ab2e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2387,6 +2387,38 @@ struct migration_arg {
 
 static int migration_cpu_stop(void *data);
 
+struct wait_task_inactive_blocked {
+	struct preempt_notifier notifier;
+	struct task_struct *waiter;
+};
+
+static void wait_task_inactive_sched_in(struct preempt_notifier *n, int cpu)
+{
+	/* Dummy, could be called when preempted before sleeping */
+}
+
+static void wait_task_inactive_sched_out(struct preempt_notifier *n,
+		struct task_struct *next)
+{
+	struct task_struct *p;
+	struct wait_task_inactive_blocked *blocked = 
+		container_of(n, struct wait_task_inactive_blocked, notifier);
+
+	if (current->on_rq) /* we're not inactive yet */
+		return;
+
+	hlist_del(&n->link);
+
+	p = ACCESS_ONCE(blocked->waiter);
+	blocked->waiter = NULL;
+	wake_up_process(p);
+}
+
+static struct preempt_ops wait_task_inactive_ops = {
+	.sched_in = wait_task_inactive_sched_in,
+	.sched_out = wait_task_inactive_sched_out,
+};
+
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
@@ -2405,93 +2437,45 @@ static int migration_cpu_stop(void *data);
  */
 unsigned long wait_task_inactive(struct task_struct *p, long match_state)
 {
+	unsigned long ncsw = 0;
 	unsigned long flags;
-	int running, on_rq;
-	unsigned long ncsw;
 	struct rq *rq;
 
-	for (;;) {
-		/*
-		 * We do the initial early heuristics without holding
-		 * any task-queue locks at all. We'll only try to get
-		 * the runqueue lock when things look like they will
-		 * work out!
-		 */
-		rq = task_rq(p);
-
-		/*
-		 * If the task is actively running on another CPU
-		 * still, just relax and busy-wait without holding
-		 * any locks.
-		 *
-		 * NOTE! Since we don't hold any locks, it's not
-		 * even sure that "rq" stays as the right runqueue!
-		 * But we don't care, since "task_running()" will
-		 * return false if the runqueue has changed and p
-		 * is actually now running somewhere else!
-		 */
-		while (task_running(rq, p)) {
-			if (match_state && unlikely(p->state != match_state))
-				return 0;
-			cpu_relax();
-		}
-
-		/*
-		 * Ok, time to look more closely! We need the rq
-		 * lock now, to be *sure*. If we're wrong, we'll
-		 * just go back and repeat.
-		 */
-		rq = task_rq_lock(p, &flags);
-		trace_sched_wait_task(p);
-		running = task_running(rq, p);
-		on_rq = p->on_rq;
-		ncsw = 0;
-		if (!match_state || p->state == match_state)
-			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
-		task_rq_unlock(rq, p, &flags);
-
-		/*
-		 * If it changed from the expected state, bail out now.
-		 */
-		if (unlikely(!ncsw))
-			break;
+	struct wait_task_inactive_blocked blocked = {
+		.notifier = {
+			.ops = &wait_task_inactive_ops,
+		},
+		.waiter = current,
+	};
 
-		/*
-		 * Was it really running after all now that we
-		 * checked with the proper locks actually held?
-		 *
-		 * Oops. Go back and try again..
-		 */
-		if (unlikely(running)) {
-			cpu_relax();
-			continue;
-		}
+	rq = task_rq_lock(p, &flags);
+	if (!task_running(rq, p))
+		goto done;
 
-		/*
-		 * It's not enough that it's not actively running,
-		 * it must be off the runqueue _entirely_, and not
-		 * preempted!
-		 *
-		 * So if it was still runnable (but just not actively
-		 * running right now), it's preempted, and we should
-		 * yield - it could be a while.
-		 */
-		if (unlikely(on_rq)) {
-			ktime_t to = ktime_set(0, NSEC_PER_SEC/HZ);
+	if (match_state && unlikely(p->state != match_state))
+		goto unlock;
 
-			set_current_state(TASK_UNINTERRUPTIBLE);
-			schedule_hrtimeout(&to, HRTIMER_MODE_REL);
-			continue;
-		}
+	hlist_add_head(&blocked.notifier.link, &p->preempt_notifiers);
+	task_rq_unlock(rq, p, &flags);
 
-		/*
-		 * Ahh, all good. It wasn't running, and it wasn't
-		 * runnable, which means that it will never become
-		 * running in the future either. We're all done!
-		 */
-		break;
+	for (;;) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (!blocked.waiter)
+			break;
+		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
 
+	/*
+	 * Serializes against the completion of the previously observed context
+	 * switch.
+	 */
+	rq = task_rq_lock(p, &flags);
+done:
+	if (!match_state || p->state == match_state)
+		ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
+unlock:
+	task_rq_unlock(rq, p, &flags);
 	return ncsw;
 }
 
@@ -2967,10 +2951,7 @@ static void __sched_fork(struct task_struct *p)
 #endif
 
 	INIT_LIST_HEAD(&p->rt.run_list);
-
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
-#endif
 }
 
 /*
@@ -3084,8 +3065,6 @@ void wake_up_new_task(struct task_struct *p)
 	task_rq_unlock(rq, p, &flags);
 }
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
 /**
  * preempt_notifier_register - tell me when current is being preempted & rescheduled
  * @notifier: notifier struct to register
@@ -3122,26 +3101,12 @@ fire_sched_out_preempt_notifiers(struct task_struct *curr,
 				 struct task_struct *next)
 {
 	struct preempt_notifier *notifier;
-	struct hlist_node *node;
+	struct hlist_node *node, *n;
 
-	hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link)
+	hlist_for_each_entry_safe(notifier, node, n, &curr->preempt_notifiers, link)
 		notifier->ops->sched_out(notifier, next);
 }
 
-#else /* !CONFIG_PREEMPT_NOTIFIERS */
-
-static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
-{
-}
-
-static void
-fire_sched_out_preempt_notifiers(struct task_struct *curr,
-				 struct task_struct *next)
-{
-}
-
-#endif /* CONFIG_PREEMPT_NOTIFIERS */
-
 /**
  * prepare_task_switch - prepare to switch tasks
  * @rq: the runqueue preparing to switch


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
@ 2011-09-21 18:50                   ` Peter Zijlstra
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-21 18:50 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Wed, 2011-09-21 at 19:01 +0200, Peter Zijlstra wrote:
> On Wed, 2011-09-21 at 12:17 +0200, Mike Galbraith wrote:
> > [  144.212272] ------------[ cut here ]------------
> > [  144.212280] WARNING: at kernel/sched.c:6152 migrate_disable+0x1b6/0x200()
> > [  144.212282] Hardware name: MS-7502
> > [  144.212283] Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nfsd lockd parport_pc parport nfs_acl auth_rpcgss sunrpc bridge ipv6 stp cpufreq_conservative microcode cpufreq_ondemand cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf nls_iso8859_1 nls_cp437 vfat fat fuse ext3 jbd dm_mod usbmouse usb_storage usbhid snd_hda_codec_realtek usb_libusual uas sr_mod cdrom hid snd_hda_intel e1000e snd_hda_codec kvm_intel snd_hwdep sg snd_pcm kvm i2c_i801 snd_timer snd firewire_ohci firewire_core soundcore snd_page_alloc crc_itu_t button ext4 mbcache jbd2 crc16 uhci_hcd sd_mod ehci_hcd usbcore rtc_cmos ahci libahci libata scsi_mod fan processor thermal
> > [  144.212317] Pid: 6215, comm: strace Not tainted 3.0.4-rt14 #2052
> > [  144.212319] Call Trace:
> > [  144.212323]  [<ffffffff8104662f>] warn_slowpath_common+0x7f/0xc0
> > [  144.212326]  [<ffffffff8104668a>] warn_slowpath_null+0x1a/0x20
> > [  144.212328]  [<ffffffff8103f606>] migrate_disable+0x1b6/0x200
> > [  144.212331]  [<ffffffff8105a2a8>] ptrace_stop+0x128/0x240
> > [  144.212334]  [<ffffffff81057b9b>] ? recalc_sigpending+0x1b/0x50
> > [  144.212337]  [<ffffffff8105b6f1>] get_signal_to_deliver+0x211/0x530
> > [  144.212340]  [<ffffffff81001835>] do_signal+0x75/0x7a0
> > [  144.212342]  [<ffffffff8105ae68>] ? kill_pid_info+0x58/0x80
> > [  144.212344]  [<ffffffff8105c34c>] ? sys_kill+0xac/0x1e0
> > [  144.212347]  [<ffffffff81001fe5>] do_notify_resume+0x65/0x80
> > [  144.212350]  [<ffffffff8135978b>] int_signal+0x12/0x17
> > [  144.212352] ---[ end trace 0000000000000002 ]---
> 
> 
> Right, that's because of 
> 53da1d9456fe7f87a920a78fdbdcf1225d197cb7, I think we simply want a full
> revert of that for -rt.

This also made me stare at the trainwreck called wait_task_inactive(),
how about something like the below, it survives a boot and simple
strace.

I'm not particularly keen on always enabling preempt notifiers, but
seeing that pretty much world+dog already has them enabled...

Also, less LOC is always better, right ;-)

---
 arch/ia64/kvm/Kconfig    |    1 -
 arch/powerpc/kvm/Kconfig |    1 -
 arch/s390/kvm/Kconfig    |    1 -
 arch/tile/kvm/Kconfig    |    1 -
 arch/x86/kvm/Kconfig     |    1 -
 include/linux/kvm_host.h |    2 -
 include/linux/preempt.h  |    4 -
 include/linux/sched.h    |    2 -
 init/Kconfig             |    3 -
 kernel/sched.c           |  163 ++++++++++++++++++----------------------------
 10 files changed, 64 insertions(+), 115 deletions(-)

diff --git a/arch/ia64/kvm/Kconfig b/arch/ia64/kvm/Kconfig
index 9806e55..02b36ca 100644
--- a/arch/ia64/kvm/Kconfig
+++ b/arch/ia64/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
 	depends on HAVE_KVM && MODULES && EXPERIMENTAL
 	# for device assignment:
 	depends on PCI
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
 	select KVM_APIC_ARCHITECTURE
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 78133de..0bcd5a8 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -18,7 +18,6 @@ if VIRTUALIZATION
 
 config KVM
 	bool
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 
 config KVM_BOOK3S_HANDLER
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index a216341..7ff8d54 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -19,7 +19,6 @@ config KVM
 	def_tristate y
 	prompt "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && EXPERIMENTAL
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	---help---
 	  Support hosting paravirtualized guest machines using the SIE
diff --git a/arch/tile/kvm/Kconfig b/arch/tile/kvm/Kconfig
index 669fcdb..6a936d1 100644
--- a/arch/tile/kvm/Kconfig
+++ b/arch/tile/kvm/Kconfig
@@ -19,7 +19,6 @@ if VIRTUALIZATION
 config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && MODULES && EXPERIMENTAL
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	---help---
 	  Support hosting paravirtualized guest machines.
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ff5790d..d82150a 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -24,7 +24,6 @@ config KVM
 	depends on PCI
 	# for TASKSTATS/TASK_DELAY_ACCT:
 	depends on NET
-	select PREEMPT_NOTIFIERS
 	select MMU_NOTIFIER
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index eabb21a..a9343b8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -111,9 +111,7 @@ enum {
 
 struct kvm_vcpu {
 	struct kvm *kvm;
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	struct preempt_notifier preempt_notifier;
-#endif
 	int cpu;
 	int vcpu_id;
 	int srcu_idx;
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 58969b2..7ca8968 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -101,8 +101,6 @@ do { \
 
 #endif /* CONFIG_PREEMPT_COUNT */
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
 struct preempt_notifier;
 
 /**
@@ -147,6 +145,4 @@ static inline void preempt_notifier_init(struct preempt_notifier *notifier,
 	notifier->ops = ops;
 }
 
-#endif
-
 #endif /* __LINUX_PREEMPT_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e54c890..64fc7c7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1236,10 +1236,8 @@ struct task_struct {
 	struct sched_entity se;
 	struct sched_rt_entity rt;
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
 	struct hlist_head preempt_notifiers;
-#endif
 
 	/*
 	 * fpu_counter contains the number of consecutive context switches
diff --git a/init/Kconfig b/init/Kconfig
index d19b3a7..c1c411c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1403,9 +1403,6 @@ config STOP_MACHINE
 
 source "block/Kconfig"
 
-config PREEMPT_NOTIFIERS
-	bool
-
 config PADATA
 	depends on SMP
 	bool
diff --git a/kernel/sched.c b/kernel/sched.c
index db143fd..b38ab2e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2387,6 +2387,38 @@ struct migration_arg {
 
 static int migration_cpu_stop(void *data);
 
+struct wait_task_inactive_blocked {
+	struct preempt_notifier notifier;
+	struct task_struct *waiter;
+};
+
+static void wait_task_inactive_sched_in(struct preempt_notifier *n, int cpu)
+{
+	/* Dummy, could be called when preempted before sleeping */
+}
+
+static void wait_task_inactive_sched_out(struct preempt_notifier *n,
+		struct task_struct *next)
+{
+	struct task_struct *p;
+	struct wait_task_inactive_blocked *blocked = 
+		container_of(n, struct wait_task_inactive_blocked, notifier);
+
+	if (current->on_rq) /* we're not inactive yet */
+		return;
+
+	hlist_del(&n->link);
+
+	p = ACCESS_ONCE(blocked->waiter);
+	blocked->waiter = NULL;
+	wake_up_process(p);
+}
+
+static struct preempt_ops wait_task_inactive_ops = {
+	.sched_in = wait_task_inactive_sched_in,
+	.sched_out = wait_task_inactive_sched_out,
+};
+
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
@@ -2405,93 +2437,45 @@ static int migration_cpu_stop(void *data);
  */
 unsigned long wait_task_inactive(struct task_struct *p, long match_state)
 {
+	unsigned long ncsw = 0;
 	unsigned long flags;
-	int running, on_rq;
-	unsigned long ncsw;
 	struct rq *rq;
 
-	for (;;) {
-		/*
-		 * We do the initial early heuristics without holding
-		 * any task-queue locks at all. We'll only try to get
-		 * the runqueue lock when things look like they will
-		 * work out!
-		 */
-		rq = task_rq(p);
-
-		/*
-		 * If the task is actively running on another CPU
-		 * still, just relax and busy-wait without holding
-		 * any locks.
-		 *
-		 * NOTE! Since we don't hold any locks, it's not
-		 * even sure that "rq" stays as the right runqueue!
-		 * But we don't care, since "task_running()" will
-		 * return false if the runqueue has changed and p
-		 * is actually now running somewhere else!
-		 */
-		while (task_running(rq, p)) {
-			if (match_state && unlikely(p->state != match_state))
-				return 0;
-			cpu_relax();
-		}
-
-		/*
-		 * Ok, time to look more closely! We need the rq
-		 * lock now, to be *sure*. If we're wrong, we'll
-		 * just go back and repeat.
-		 */
-		rq = task_rq_lock(p, &flags);
-		trace_sched_wait_task(p);
-		running = task_running(rq, p);
-		on_rq = p->on_rq;
-		ncsw = 0;
-		if (!match_state || p->state == match_state)
-			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
-		task_rq_unlock(rq, p, &flags);
-
-		/*
-		 * If it changed from the expected state, bail out now.
-		 */
-		if (unlikely(!ncsw))
-			break;
+	struct wait_task_inactive_blocked blocked = {
+		.notifier = {
+			.ops = &wait_task_inactive_ops,
+		},
+		.waiter = current,
+	};
 
-		/*
-		 * Was it really running after all now that we
-		 * checked with the proper locks actually held?
-		 *
-		 * Oops. Go back and try again..
-		 */
-		if (unlikely(running)) {
-			cpu_relax();
-			continue;
-		}
+	rq = task_rq_lock(p, &flags);
+	if (!task_running(rq, p))
+		goto done;
 
-		/*
-		 * It's not enough that it's not actively running,
-		 * it must be off the runqueue _entirely_, and not
-		 * preempted!
-		 *
-		 * So if it was still runnable (but just not actively
-		 * running right now), it's preempted, and we should
-		 * yield - it could be a while.
-		 */
-		if (unlikely(on_rq)) {
-			ktime_t to = ktime_set(0, NSEC_PER_SEC/HZ);
+	if (match_state && unlikely(p->state != match_state))
+		goto unlock;
 
-			set_current_state(TASK_UNINTERRUPTIBLE);
-			schedule_hrtimeout(&to, HRTIMER_MODE_REL);
-			continue;
-		}
+	hlist_add_head(&blocked.notifier.link, &p->preempt_notifiers);
+	task_rq_unlock(rq, p, &flags);
 
-		/*
-		 * Ahh, all good. It wasn't running, and it wasn't
-		 * runnable, which means that it will never become
-		 * running in the future either. We're all done!
-		 */
-		break;
+	for (;;) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (!blocked.waiter)
+			break;
+		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
 
+	/*
+	 * Serializes against the completion of the previously observed context
+	 * switch.
+	 */
+	rq = task_rq_lock(p, &flags);
+done:
+	if (!match_state || p->state == match_state)
+		ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
+unlock:
+	task_rq_unlock(rq, p, &flags);
 	return ncsw;
 }
 
@@ -2967,10 +2951,7 @@ static void __sched_fork(struct task_struct *p)
 #endif
 
 	INIT_LIST_HEAD(&p->rt.run_list);
-
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
-#endif
 }
 
 /*
@@ -3084,8 +3065,6 @@ void wake_up_new_task(struct task_struct *p)
 	task_rq_unlock(rq, p, &flags);
 }
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
 /**
  * preempt_notifier_register - tell me when current is being preempted & rescheduled
  * @notifier: notifier struct to register
@@ -3122,26 +3101,12 @@ fire_sched_out_preempt_notifiers(struct task_struct *curr,
 				 struct task_struct *next)
 {
 	struct preempt_notifier *notifier;
-	struct hlist_node *node;
+	struct hlist_node *node, *n;
 
-	hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link)
+	hlist_for_each_entry_safe(notifier, node, n, &curr->preempt_notifiers, link)
 		notifier->ops->sched_out(notifier, next);
 }
 
-#else /* !CONFIG_PREEMPT_NOTIFIERS */
-
-static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
-{
-}
-
-static void
-fire_sched_out_preempt_notifiers(struct task_struct *curr,
-				 struct task_struct *next)
-{
-}
-
-#endif /* CONFIG_PREEMPT_NOTIFIERS */

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-21 18:50                   ` Peter Zijlstra
  (?)
@ 2011-09-22  4:46                   ` Mike Galbraith
  2011-09-22  6:31                     ` Peter Zijlstra
  -1 siblings, 1 reply; 52+ messages in thread
From: Mike Galbraith @ 2011-09-22  4:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Wed, 2011-09-21 at 20:50 +0200, Peter Zijlstra wrote:
> On Wed, 2011-09-21 at 19:01 +0200, Peter Zijlstra wrote:
> > On Wed, 2011-09-21 at 12:17 +0200, Mike Galbraith wrote:
> > > [  144.212272] ------------[ cut here ]------------
> > > [  144.212280] WARNING: at kernel/sched.c:6152 migrate_disable+0x1b6/0x200()
> > > [  144.212282] Hardware name: MS-7502
> > > [  144.212283] Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nfsd lockd parport_pc parport nfs_acl auth_rpcgss sunrpc bridge ipv6 stp cpufreq_conservative microcode cpufreq_ondemand cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf nls_iso8859_1 nls_cp437 vfat fat fuse ext3 jbd dm_mod usbmouse usb_storage usbhid snd_hda_codec_realtek usb_libusual uas sr_mod cdrom hid snd_hda_intel e1000e snd_hda_codec kvm_intel snd_hwdep sg snd_pcm kvm i2c_i801 snd_timer snd firewire_ohci firewire_core soundcore snd_page_alloc crc_itu_t button ext4 mbcache jbd2 crc16 uhci_hcd sd_mod ehci_hcd usbcore rtc_cmos ahci libahci libata scsi_mod fan processor thermal
> > > [  144.212317] Pid: 6215, comm: strace Not tainted 3.0.4-rt14 #2052
> > > [  144.212319] Call Trace:
> > > [  144.212323]  [<ffffffff8104662f>] warn_slowpath_common+0x7f/0xc0
> > > [  144.212326]  [<ffffffff8104668a>] warn_slowpath_null+0x1a/0x20
> > > [  144.212328]  [<ffffffff8103f606>] migrate_disable+0x1b6/0x200
> > > [  144.212331]  [<ffffffff8105a2a8>] ptrace_stop+0x128/0x240
> > > [  144.212334]  [<ffffffff81057b9b>] ? recalc_sigpending+0x1b/0x50
> > > [  144.212337]  [<ffffffff8105b6f1>] get_signal_to_deliver+0x211/0x530
> > > [  144.212340]  [<ffffffff81001835>] do_signal+0x75/0x7a0
> > > [  144.212342]  [<ffffffff8105ae68>] ? kill_pid_info+0x58/0x80
> > > [  144.212344]  [<ffffffff8105c34c>] ? sys_kill+0xac/0x1e0
> > > [  144.212347]  [<ffffffff81001fe5>] do_notify_resume+0x65/0x80
> > > [  144.212350]  [<ffffffff8135978b>] int_signal+0x12/0x17
> > > [  144.212352] ---[ end trace 0000000000000002 ]---
> > 
> > 
> > Right, that's because of 
> > 53da1d9456fe7f87a920a78fdbdcf1225d197cb7, I think we simply want a full
> > revert of that for -rt.
> 
> This also made me stare at the trainwreck called wait_task_inactive(),
> how about something like the below, it survives a boot and simple
> strace.

There's a missing hunklet, but...

@@ -8325,9 +8290,7 @@ void __init sched_init(void)
 
 	set_load_weight(&init_task);
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&init_task.preempt_notifiers);
-#endif
 
 #ifdef CONFIG_SMP
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);

..perturbation (100% userspace hog) measurement proggy and jitter
measurement proggy pinned to the same cpu makes 100% repeatable boom.

Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
Pid: 6226, comm: pert Not tainted 3.0.4-rt14 #2053
Call Trace:
 <NMI>  [<ffffffff81355f00>] panic+0xa0/0x1a8
 [<ffffffff8108fe47>] watchdog_overflow_callback+0xe7/0xf0
 [<ffffffff810c1c7c>] __perf_event_overflow+0x9c/0x250
 [<ffffffff810c2734>] perf_event_overflow+0x14/0x20
 [<ffffffff81014c7c>] intel_pmu_handle_irq+0x21c/0x440
 [<ffffffff81010fb9>] perf_event_nmi_handler+0x39/0xc0
 [<ffffffff8106f42c>] notifier_call_chain+0x4c/0x70
 [<ffffffff8106fa6a>] __atomic_notifier_call_chain+0x4a/0x70
 [<ffffffff8106faa6>] atomic_notifier_call_chain+0x16/0x20
 [<ffffffff8106fc2e>] notify_die+0x2e/0x30
 [<ffffffff81002c8a>] do_nmi+0xaa/0x240
 [<ffffffff813592ea>] nmi+0x1a/0x20
 <<EOE>> <0>Rebooting in 60 seconds..[    0.000000]



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-22  4:46                   ` Mike Galbraith
@ 2011-09-22  6:31                     ` Peter Zijlstra
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-22  6:31 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Thu, 2011-09-22 at 06:46 +0200, Mike Galbraith wrote:
> 
> ..perturbation (100% userspace hog) measurement proggy and jitter
> measurement proggy pinned to the same cpu makes 100% repeatable boom.
> 
> 
Uh ow... :-) I'd better go have a look then..

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-21 10:17               ` rt14: strace -> migrate_disable_atomic imbalance Mike Galbraith
  2011-09-21 17:01                 ` Peter Zijlstra
  2011-09-21 18:50                   ` Peter Zijlstra
@ 2011-09-22  8:38                 ` Peter Zijlstra
  2011-09-22 10:00                   ` Peter Zijlstra
                                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-22  8:38 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Wed, 2011-09-21 at 20:50 +0200, Peter Zijlstra wrote:
> +static void wait_task_inactive_sched_out(struct preempt_notifier *n,
> +               struct task_struct *next)
> +{
> +       struct task_struct *p;
> +       struct wait_task_inactive_blocked *blocked = 
> +               container_of(n, struct wait_task_inactive_blocked, notifier);
> +
> +       if (current->on_rq) /* we're not inactive yet */
> +               return;
> +
> +       hlist_del(&n->link);
> +
> +       p = ACCESS_ONCE(blocked->waiter);
> +       blocked->waiter = NULL;
> +       wake_up_process(p);
> +} 

Trying a wakeup from there isn't going to actually ever work of-course..
Duh!

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-21 10:17               ` rt14: strace -> migrate_disable_atomic imbalance Mike Galbraith
@ 2011-09-22 10:00                   ` Peter Zijlstra
  2011-09-21 18:50                   ` Peter Zijlstra
                                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-22 10:00 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Thu, 2011-09-22 at 10:38 +0200, Peter Zijlstra wrote:
> On Wed, 2011-09-21 at 20:50 +0200, Peter Zijlstra wrote:
> > +static void wait_task_inactive_sched_out(struct preempt_notifier *n,
> > +               struct task_struct *next)
> > +{
> > +       struct task_struct *p;
> > +       struct wait_task_inactive_blocked *blocked = 
> > +               container_of(n, struct wait_task_inactive_blocked, notifier);
> > +
> > +       if (current->on_rq) /* we're not inactive yet */
> > +               return;
> > +
> > +       hlist_del(&n->link);
> > +
> > +       p = ACCESS_ONCE(blocked->waiter);
> > +       blocked->waiter = NULL;
> > +       wake_up_process(p);
> > +} 
> 
> Trying a wakeup from there isn't going to actually ever work of-course..
> Duh!

OK, this one seems to be better.. But its quite vile, not sure I
actually like it anymore.

---
 arch/ia64/kvm/Kconfig    |    1 
 arch/powerpc/kvm/Kconfig |    1 
 arch/s390/kvm/Kconfig    |    1 
 arch/tile/kvm/Kconfig    |    1 
 arch/x86/kvm/Kconfig     |    1 
 include/linux/kvm_host.h |    2 
 include/linux/preempt.h  |    4 -
 include/linux/sched.h    |    2 
 init/Kconfig             |    3 
 kernel/sched.c           |  188 +++++++++++++++++++++--------------------------
 10 files changed, 85 insertions(+), 119 deletions(-)
Index: linux-2.6/arch/ia64/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/ia64/kvm/Kconfig
+++ linux-2.6/arch/ia64/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
 	depends on HAVE_KVM && MODULES && EXPERIMENTAL
 	# for device assignment:
 	depends on PCI
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
 	select KVM_APIC_ARCHITECTURE
Index: linux-2.6/arch/powerpc/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/powerpc/kvm/Kconfig
+++ linux-2.6/arch/powerpc/kvm/Kconfig
@@ -18,7 +18,6 @@ if VIRTUALIZATION
 
 config KVM
 	bool
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 
 config KVM_BOOK3S_HANDLER
Index: linux-2.6/arch/s390/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/s390/kvm/Kconfig
+++ linux-2.6/arch/s390/kvm/Kconfig
@@ -19,7 +19,6 @@ config KVM
 	def_tristate y
 	prompt "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && EXPERIMENTAL
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	---help---
 	  Support hosting paravirtualized guest machines using the SIE
Index: linux-2.6/arch/tile/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/tile/kvm/Kconfig
+++ linux-2.6/arch/tile/kvm/Kconfig
@@ -19,7 +19,6 @@ if VIRTUALIZATION
 config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && MODULES && EXPERIMENTAL
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	---help---
 	  Support hosting paravirtualized guest machines.
Index: linux-2.6/arch/x86/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86/kvm/Kconfig
+++ linux-2.6/arch/x86/kvm/Kconfig
@@ -24,7 +24,6 @@ config KVM
 	depends on PCI
 	# for TASKSTATS/TASK_DELAY_ACCT:
 	depends on NET
-	select PREEMPT_NOTIFIERS
 	select MMU_NOTIFIER
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
Index: linux-2.6/include/linux/kvm_host.h
===================================================================
--- linux-2.6.orig/include/linux/kvm_host.h
+++ linux-2.6/include/linux/kvm_host.h
@@ -111,9 +111,7 @@ enum {
 
 struct kvm_vcpu {
 	struct kvm *kvm;
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	struct preempt_notifier preempt_notifier;
-#endif
 	int cpu;
 	int vcpu_id;
 	int srcu_idx;
Index: linux-2.6/include/linux/preempt.h
===================================================================
--- linux-2.6.orig/include/linux/preempt.h
+++ linux-2.6/include/linux/preempt.h
@@ -101,8 +101,6 @@ do { \
 
 #endif /* CONFIG_PREEMPT_COUNT */
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
 struct preempt_notifier;
 
 /**
@@ -147,6 +145,4 @@ static inline void preempt_notifier_init
 	notifier->ops = ops;
 }
 
-#endif
-
 #endif /* __LINUX_PREEMPT_H */
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1236,10 +1236,8 @@ struct task_struct {
 	struct sched_entity se;
 	struct sched_rt_entity rt;
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
 	struct hlist_head preempt_notifiers;
-#endif
 
 	/*
 	 * fpu_counter contains the number of consecutive context switches
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -1403,9 +1403,6 @@ config STOP_MACHINE
 
 source "block/Kconfig"
 
-config PREEMPT_NOTIFIERS
-	bool
-
 config PADATA
 	depends on SMP
 	bool
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2387,6 +2387,57 @@ struct migration_arg {
 
 static int migration_cpu_stop(void *data);
 
+struct wait_task_inactive_blocked {
+	struct preempt_notifier notifier;
+	struct task_struct *waiter;
+};
+
+static void
+preempt_ops_sched_out_nop(struct preempt_notifier *n, struct task_struct *next)
+{
+}
+
+static void wait_task_inactive_sched_in(struct preempt_notifier *n, int cpu)
+{
+	struct task_struct *p;
+	struct wait_task_inactive_blocked *blocked =
+		container_of(n, struct wait_task_inactive_blocked, notifier);
+
+	hlist_del(&n->link);
+
+	p = ACCESS_ONCE(blocked->waiter);
+	blocked->waiter = NULL;
+	wake_up_process(p);
+}
+
+static struct preempt_ops wait_task_inactive_ops_post = {
+	.sched_in = wait_task_inactive_sched_in,
+	.sched_out = preempt_ops_sched_out_nop,
+};
+
+static void preempt_ops_sched_in_nop(struct preempt_notifier *n, int cpu)
+{
+}
+
+static void
+wait_task_inactive_sched_out(struct preempt_notifier *n, struct task_struct *next)
+{
+	struct wait_task_inactive_blocked *blocked =
+		container_of(n, struct wait_task_inactive_blocked, notifier);
+
+	if (current->on_rq) /* we're not inactive yet */
+		return;
+
+	hlist_del(&n->link);
+	blocked->notifier.ops = &wait_task_inactive_ops_post;
+	hlist_add_head(&n->link, &next->preempt_notifiers);
+}
+
+static struct preempt_ops wait_task_inactive_ops_pre = {
+	.sched_in = preempt_ops_sched_in_nop,
+	.sched_out = wait_task_inactive_sched_out,
+};
+
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
@@ -2405,93 +2456,45 @@ static int migration_cpu_stop(void *data
  */
 unsigned long wait_task_inactive(struct task_struct *p, long match_state)
 {
+	unsigned long ncsw = 0;
 	unsigned long flags;
-	int running, on_rq;
-	unsigned long ncsw;
 	struct rq *rq;
 
-	for (;;) {
-		/*
-		 * We do the initial early heuristics without holding
-		 * any task-queue locks at all. We'll only try to get
-		 * the runqueue lock when things look like they will
-		 * work out!
-		 */
-		rq = task_rq(p);
-
-		/*
-		 * If the task is actively running on another CPU
-		 * still, just relax and busy-wait without holding
-		 * any locks.
-		 *
-		 * NOTE! Since we don't hold any locks, it's not
-		 * even sure that "rq" stays as the right runqueue!
-		 * But we don't care, since "task_running()" will
-		 * return false if the runqueue has changed and p
-		 * is actually now running somewhere else!
-		 */
-		while (task_running(rq, p)) {
-			if (match_state && unlikely(p->state != match_state))
-				return 0;
-			cpu_relax();
-		}
-
-		/*
-		 * Ok, time to look more closely! We need the rq
-		 * lock now, to be *sure*. If we're wrong, we'll
-		 * just go back and repeat.
-		 */
-		rq = task_rq_lock(p, &flags);
-		trace_sched_wait_task(p);
-		running = task_running(rq, p);
-		on_rq = p->on_rq;
-		ncsw = 0;
-		if (!match_state || p->state == match_state)
-			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
-		task_rq_unlock(rq, p, &flags);
-
-		/*
-		 * If it changed from the expected state, bail out now.
-		 */
-		if (unlikely(!ncsw))
-			break;
+	struct wait_task_inactive_blocked blocked = {
+		.notifier = {
+			.ops = &wait_task_inactive_ops_pre,
+		},
+		.waiter = current,
+	};
 
-		/*
-		 * Was it really running after all now that we
-		 * checked with the proper locks actually held?
-		 *
-		 * Oops. Go back and try again..
-		 */
-		if (unlikely(running)) {
-			cpu_relax();
-			continue;
-		}
+	/* if we don't match the expected state, bail */
+	if (match_state && unlikely(p->state != match_state))
+		return 0;
 
-		/*
-		 * It's not enough that it's not actively running,
-		 * it must be off the runqueue _entirely_, and not
-		 * preempted!
-		 *
-		 * So if it was still runnable (but just not actively
-		 * running right now), it's preempted, and we should
-		 * yield - it could be a while.
-		 */
-		if (unlikely(on_rq)) {
-			ktime_t to = ktime_set(0, NSEC_PER_SEC/HZ);
+	rq = task_rq_lock(p, &flags);
+	if (!p->on_rq) /* we're already blocked */
+		goto done;
 
-			set_current_state(TASK_UNINTERRUPTIBLE);
-			schedule_hrtimeout(&to, HRTIMER_MODE_REL);
-			continue;
-		}
+	hlist_add_head(&blocked.notifier.link, &p->preempt_notifiers);
+	task_rq_unlock(rq, p, &flags);
 
-		/*
-		 * Ahh, all good. It wasn't running, and it wasn't
-		 * runnable, which means that it will never become
-		 * running in the future either. We're all done!
-		 */
-		break;
+	for (;;) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (!blocked.waiter)
+			break;
+		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
 
+	/*
+	 * Serializes against the completion of the previously observed context
+	 * switch.
+	 */
+	rq = task_rq_lock(p, &flags);
+done:
+	if (!match_state || p->state == match_state)
+		ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
+	task_rq_unlock(rq, p, &flags);
 	return ncsw;
 }
 
@@ -2967,10 +2970,7 @@ static void __sched_fork(struct task_str
 #endif
 
 	INIT_LIST_HEAD(&p->rt.run_list);
-
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
-#endif
 }
 
 /*
@@ -3084,8 +3084,6 @@ void wake_up_new_task(struct task_struct
 	task_rq_unlock(rq, p, &flags);
 }
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
 /**
  * preempt_notifier_register - tell me when current is being preempted & rescheduled
  * @notifier: notifier struct to register
@@ -3111,9 +3109,9 @@ EXPORT_SYMBOL_GPL(preempt_notifier_unreg
 static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
 {
 	struct preempt_notifier *notifier;
-	struct hlist_node *node;
+	struct hlist_node *node, *n;
 
-	hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link)
+	hlist_for_each_entry_safe(notifier, node, n, &curr->preempt_notifiers, link)
 		notifier->ops->sched_in(notifier, raw_smp_processor_id());
 }
 
@@ -3122,26 +3120,12 @@ fire_sched_out_preempt_notifiers(struct 
 				 struct task_struct *next)
 {
 	struct preempt_notifier *notifier;
-	struct hlist_node *node;
+	struct hlist_node *node, *n;
 
-	hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link)
+	hlist_for_each_entry_safe(notifier, node, n, &curr->preempt_notifiers, link)
 		notifier->ops->sched_out(notifier, next);
 }
 
-#else /* !CONFIG_PREEMPT_NOTIFIERS */
-
-static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
-{
-}
-
-static void
-fire_sched_out_preempt_notifiers(struct task_struct *curr,
-				 struct task_struct *next)
-{
-}
-
-#endif /* CONFIG_PREEMPT_NOTIFIERS */
-
 /**
  * prepare_task_switch - prepare to switch tasks
  * @rq: the runqueue preparing to switch
@@ -8312,9 +8296,7 @@ void __init sched_init(void)
 
 	set_load_weight(&init_task);
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&init_task.preempt_notifiers);
-#endif
 
 #ifdef CONFIG_SMP
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
@ 2011-09-22 10:00                   ` Peter Zijlstra
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-22 10:00 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Thu, 2011-09-22 at 10:38 +0200, Peter Zijlstra wrote:
> On Wed, 2011-09-21 at 20:50 +0200, Peter Zijlstra wrote:
> > +static void wait_task_inactive_sched_out(struct preempt_notifier *n,
> > +               struct task_struct *next)
> > +{
> > +       struct task_struct *p;
> > +       struct wait_task_inactive_blocked *blocked = 
> > +               container_of(n, struct wait_task_inactive_blocked, notifier);
> > +
> > +       if (current->on_rq) /* we're not inactive yet */
> > +               return;
> > +
> > +       hlist_del(&n->link);
> > +
> > +       p = ACCESS_ONCE(blocked->waiter);
> > +       blocked->waiter = NULL;
> > +       wake_up_process(p);
> > +} 
> 
> Trying a wakeup from there isn't going to actually ever work of-course..
> Duh!

OK, this one seems to be better.. But its quite vile, not sure I
actually like it anymore.

---
 arch/ia64/kvm/Kconfig    |    1 
 arch/powerpc/kvm/Kconfig |    1 
 arch/s390/kvm/Kconfig    |    1 
 arch/tile/kvm/Kconfig    |    1 
 arch/x86/kvm/Kconfig     |    1 
 include/linux/kvm_host.h |    2 
 include/linux/preempt.h  |    4 -
 include/linux/sched.h    |    2 
 init/Kconfig             |    3 
 kernel/sched.c           |  188 +++++++++++++++++++++--------------------------
 10 files changed, 85 insertions(+), 119 deletions(-)
Index: linux-2.6/arch/ia64/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/ia64/kvm/Kconfig
+++ linux-2.6/arch/ia64/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
 	depends on HAVE_KVM && MODULES && EXPERIMENTAL
 	# for device assignment:
 	depends on PCI
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
 	select KVM_APIC_ARCHITECTURE
Index: linux-2.6/arch/powerpc/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/powerpc/kvm/Kconfig
+++ linux-2.6/arch/powerpc/kvm/Kconfig
@@ -18,7 +18,6 @@ if VIRTUALIZATION
 
 config KVM
 	bool
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 
 config KVM_BOOK3S_HANDLER
Index: linux-2.6/arch/s390/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/s390/kvm/Kconfig
+++ linux-2.6/arch/s390/kvm/Kconfig
@@ -19,7 +19,6 @@ config KVM
 	def_tristate y
 	prompt "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && EXPERIMENTAL
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	---help---
 	  Support hosting paravirtualized guest machines using the SIE
Index: linux-2.6/arch/tile/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/tile/kvm/Kconfig
+++ linux-2.6/arch/tile/kvm/Kconfig
@@ -19,7 +19,6 @@ if VIRTUALIZATION
 config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && MODULES && EXPERIMENTAL
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	---help---
 	  Support hosting paravirtualized guest machines.
Index: linux-2.6/arch/x86/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86/kvm/Kconfig
+++ linux-2.6/arch/x86/kvm/Kconfig
@@ -24,7 +24,6 @@ config KVM
 	depends on PCI
 	# for TASKSTATS/TASK_DELAY_ACCT:
 	depends on NET
-	select PREEMPT_NOTIFIERS
 	select MMU_NOTIFIER
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
Index: linux-2.6/include/linux/kvm_host.h
===================================================================
--- linux-2.6.orig/include/linux/kvm_host.h
+++ linux-2.6/include/linux/kvm_host.h
@@ -111,9 +111,7 @@ enum {
 
 struct kvm_vcpu {
 	struct kvm *kvm;
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	struct preempt_notifier preempt_notifier;
-#endif
 	int cpu;
 	int vcpu_id;
 	int srcu_idx;
Index: linux-2.6/include/linux/preempt.h
===================================================================
--- linux-2.6.orig/include/linux/preempt.h
+++ linux-2.6/include/linux/preempt.h
@@ -101,8 +101,6 @@ do { \
 
 #endif /* CONFIG_PREEMPT_COUNT */
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
 struct preempt_notifier;
 
 /**
@@ -147,6 +145,4 @@ static inline void preempt_notifier_init
 	notifier->ops = ops;
 }
 
-#endif
-
 #endif /* __LINUX_PREEMPT_H */
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1236,10 +1236,8 @@ struct task_struct {
 	struct sched_entity se;
 	struct sched_rt_entity rt;
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
 	struct hlist_head preempt_notifiers;
-#endif
 
 	/*
 	 * fpu_counter contains the number of consecutive context switches
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -1403,9 +1403,6 @@ config STOP_MACHINE
 
 source "block/Kconfig"
 
-config PREEMPT_NOTIFIERS
-	bool
-
 config PADATA
 	depends on SMP
 	bool
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2387,6 +2387,57 @@ struct migration_arg {
 
 static int migration_cpu_stop(void *data);
 
+struct wait_task_inactive_blocked {
+	struct preempt_notifier notifier;
+	struct task_struct *waiter;
+};
+
+static void
+preempt_ops_sched_out_nop(struct preempt_notifier *n, struct task_struct *next)
+{
+}
+
+static void wait_task_inactive_sched_in(struct preempt_notifier *n, int cpu)
+{
+	struct task_struct *p;
+	struct wait_task_inactive_blocked *blocked =
+		container_of(n, struct wait_task_inactive_blocked, notifier);
+
+	hlist_del(&n->link);
+
+	p = ACCESS_ONCE(blocked->waiter);
+	blocked->waiter = NULL;
+	wake_up_process(p);
+}
+
+static struct preempt_ops wait_task_inactive_ops_post = {
+	.sched_in = wait_task_inactive_sched_in,
+	.sched_out = preempt_ops_sched_out_nop,
+};
+
+static void preempt_ops_sched_in_nop(struct preempt_notifier *n, int cpu)
+{
+}
+
+static void
+wait_task_inactive_sched_out(struct preempt_notifier *n, struct task_struct *next)
+{
+	struct wait_task_inactive_blocked *blocked =
+		container_of(n, struct wait_task_inactive_blocked, notifier);
+
+	if (current->on_rq) /* we're not inactive yet */
+		return;
+
+	hlist_del(&n->link);
+	blocked->notifier.ops = &wait_task_inactive_ops_post;
+	hlist_add_head(&n->link, &next->preempt_notifiers);
+}
+
+static struct preempt_ops wait_task_inactive_ops_pre = {
+	.sched_in = preempt_ops_sched_in_nop,
+	.sched_out = wait_task_inactive_sched_out,
+};
+
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
@@ -2405,93 +2456,45 @@ static int migration_cpu_stop(void *data
  */
 unsigned long wait_task_inactive(struct task_struct *p, long match_state)
 {
+	unsigned long ncsw = 0;
 	unsigned long flags;
-	int running, on_rq;
-	unsigned long ncsw;
 	struct rq *rq;
 
-	for (;;) {
-		/*
-		 * We do the initial early heuristics without holding
-		 * any task-queue locks at all. We'll only try to get
-		 * the runqueue lock when things look like they will
-		 * work out!
-		 */
-		rq = task_rq(p);
-
-		/*
-		 * If the task is actively running on another CPU
-		 * still, just relax and busy-wait without holding
-		 * any locks.
-		 *
-		 * NOTE! Since we don't hold any locks, it's not
-		 * even sure that "rq" stays as the right runqueue!
-		 * But we don't care, since "task_running()" will
-		 * return false if the runqueue has changed and p
-		 * is actually now running somewhere else!
-		 */
-		while (task_running(rq, p)) {
-			if (match_state && unlikely(p->state != match_state))
-				return 0;
-			cpu_relax();
-		}
-
-		/*
-		 * Ok, time to look more closely! We need the rq
-		 * lock now, to be *sure*. If we're wrong, we'll
-		 * just go back and repeat.
-		 */
-		rq = task_rq_lock(p, &flags);
-		trace_sched_wait_task(p);
-		running = task_running(rq, p);
-		on_rq = p->on_rq;
-		ncsw = 0;
-		if (!match_state || p->state == match_state)
-			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
-		task_rq_unlock(rq, p, &flags);
-
-		/*
-		 * If it changed from the expected state, bail out now.
-		 */
-		if (unlikely(!ncsw))
-			break;
+	struct wait_task_inactive_blocked blocked = {
+		.notifier = {
+			.ops = &wait_task_inactive_ops_pre,
+		},
+		.waiter = current,
+	};
 
-		/*
-		 * Was it really running after all now that we
-		 * checked with the proper locks actually held?
-		 *
-		 * Oops. Go back and try again..
-		 */
-		if (unlikely(running)) {
-			cpu_relax();
-			continue;
-		}
+	/* if we don't match the expected state, bail */
+	if (match_state && unlikely(p->state != match_state))
+		return 0;
 
-		/*
-		 * It's not enough that it's not actively running,
-		 * it must be off the runqueue _entirely_, and not
-		 * preempted!
-		 *
-		 * So if it was still runnable (but just not actively
-		 * running right now), it's preempted, and we should
-		 * yield - it could be a while.
-		 */
-		if (unlikely(on_rq)) {
-			ktime_t to = ktime_set(0, NSEC_PER_SEC/HZ);
+	rq = task_rq_lock(p, &flags);
+	if (!p->on_rq) /* we're already blocked */
+		goto done;
 
-			set_current_state(TASK_UNINTERRUPTIBLE);
-			schedule_hrtimeout(&to, HRTIMER_MODE_REL);
-			continue;
-		}
+	hlist_add_head(&blocked.notifier.link, &p->preempt_notifiers);
+	task_rq_unlock(rq, p, &flags);
 
-		/*
-		 * Ahh, all good. It wasn't running, and it wasn't
-		 * runnable, which means that it will never become
-		 * running in the future either. We're all done!
-		 */
-		break;
+	for (;;) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (!blocked.waiter)
+			break;
+		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
 
+	/*
+	 * Serializes against the completion of the previously observed context
+	 * switch.
+	 */
+	rq = task_rq_lock(p, &flags);
+done:
+	if (!match_state || p->state == match_state)
+		ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
+	task_rq_unlock(rq, p, &flags);
 	return ncsw;
 }
 
@@ -2967,10 +2970,7 @@ static void __sched_fork(struct task_str
 #endif
 
 	INIT_LIST_HEAD(&p->rt.run_list);
-
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
-#endif
 }
 
 /*
@@ -3084,8 +3084,6 @@ void wake_up_new_task(struct task_struct
 	task_rq_unlock(rq, p, &flags);
 }
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
 /**
  * preempt_notifier_register - tell me when current is being preempted & rescheduled
  * @notifier: notifier struct to register
@@ -3111,9 +3109,9 @@ EXPORT_SYMBOL_GPL(preempt_notifier_unreg
 static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
 {
 	struct preempt_notifier *notifier;
-	struct hlist_node *node;
+	struct hlist_node *node, *n;
 
-	hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link)
+	hlist_for_each_entry_safe(notifier, node, n, &curr->preempt_notifiers, link)
 		notifier->ops->sched_in(notifier, raw_smp_processor_id());
 }
 
@@ -3122,26 +3120,12 @@ fire_sched_out_preempt_notifiers(struct 
 				 struct task_struct *next)
 {
 	struct preempt_notifier *notifier;
-	struct hlist_node *node;
+	struct hlist_node *node, *n;
 
-	hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link)
+	hlist_for_each_entry_safe(notifier, node, n, &curr->preempt_notifiers, link)
 		notifier->ops->sched_out(notifier, next);
 }
 
-#else /* !CONFIG_PREEMPT_NOTIFIERS */
-
-static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
-{
-}
-
-static void
-fire_sched_out_preempt_notifiers(struct task_struct *curr,
-				 struct task_struct *next)
-{
-}
-
-#endif /* CONFIG_PREEMPT_NOTIFIERS */

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-21 10:17               ` rt14: strace -> migrate_disable_atomic imbalance Mike Galbraith
                                   ` (3 preceding siblings ...)
  2011-09-22 10:00                   ` Peter Zijlstra
@ 2011-09-22 11:31                 ` Peter Zijlstra
  2011-09-22 11:46                   ` Peter Zijlstra
  5 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-22 11:31 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Thu, 2011-09-22 at 12:00 +0200, Peter Zijlstra wrote:
> OK, this one seems to be better.. But its quite vile, not sure I
> actually like it anymore.
> 
There's also a small matter of it actually being slower than what we had
when tested with: strace hackbench > /dev/null 2>&1.

curses.. even adding a busy wait on p->on_cpu in front.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-21 10:17               ` rt14: strace -> migrate_disable_atomic imbalance Mike Galbraith
@ 2011-09-22 11:46                   ` Peter Zijlstra
  2011-09-21 18:50                   ` Peter Zijlstra
                                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-22 11:46 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Thu, 2011-09-22 at 13:31 +0200, Peter Zijlstra wrote:
> On Thu, 2011-09-22 at 12:00 +0200, Peter Zijlstra wrote:
> > OK, this one seems to be better.. But its quite vile, not sure I
> > actually like it anymore.
> > 
> There's also a small matter of it actually being slower than what we had
> when tested with: strace hackbench > /dev/null 2>&1.
> 
> curses.. even adding a busy wait on p->on_cpu in front.

Gah, its one of those benchmarks where if you rebuild+reboot the numbers
change. Latest patch here..

---
Subject: sched: Rewrite wait_task_inactive()
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Wed Sep 21 21:34:05 CEST 2011

Instead of using a combination of busy-waits and sleeps to poll a
task's state, ensure we get a notification and properly block.

This makes preempt notifiers unconditional, but since pretty much
everybody already has those enabled in their config it shouldn't be a
big deal (I hope).

The notification is somewhat horrid since the sched_out notifier is
called with rq->lock held and interrupts disabled, so we can't
actually do the wakeup from there. However the sched_in notifier is
called without holding rq->lock and with interrupts enable.

So what we do is we register a preempt notifier on the task we want,
that notifier will move itself to the next task passed into the
sched_out function, the sched_in notification of the next task, will do
the actual work... most horrid.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/ia64/kvm/Kconfig    |    1 
 arch/powerpc/kvm/Kconfig |    1 
 arch/s390/kvm/Kconfig    |    1 
 arch/tile/kvm/Kconfig    |    1 
 arch/x86/kvm/Kconfig     |    1 
 include/linux/kvm_host.h |    2 
 include/linux/preempt.h  |    4 
 include/linux/sched.h    |    2 
 init/Kconfig             |    3 
 kernel/sched.c           |  191 +++++++++++++++++++++--------------------------
 10 files changed, 86 insertions(+), 121 deletions(-)
Index: linux-2.6/arch/ia64/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/ia64/kvm/Kconfig
+++ linux-2.6/arch/ia64/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
 	depends on HAVE_KVM && MODULES && EXPERIMENTAL
 	# for device assignment:
 	depends on PCI
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
 	select KVM_APIC_ARCHITECTURE
Index: linux-2.6/arch/powerpc/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/powerpc/kvm/Kconfig
+++ linux-2.6/arch/powerpc/kvm/Kconfig
@@ -18,7 +18,6 @@ if VIRTUALIZATION
 
 config KVM
 	bool
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 
 config KVM_BOOK3S_HANDLER
Index: linux-2.6/arch/s390/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/s390/kvm/Kconfig
+++ linux-2.6/arch/s390/kvm/Kconfig
@@ -19,7 +19,6 @@ config KVM
 	def_tristate y
 	prompt "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && EXPERIMENTAL
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	---help---
 	  Support hosting paravirtualized guest machines using the SIE
Index: linux-2.6/arch/tile/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/tile/kvm/Kconfig
+++ linux-2.6/arch/tile/kvm/Kconfig
@@ -19,7 +19,6 @@ if VIRTUALIZATION
 config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && MODULES && EXPERIMENTAL
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	---help---
 	  Support hosting paravirtualized guest machines.
Index: linux-2.6/arch/x86/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86/kvm/Kconfig
+++ linux-2.6/arch/x86/kvm/Kconfig
@@ -24,7 +24,6 @@ config KVM
 	depends on PCI
 	# for TASKSTATS/TASK_DELAY_ACCT:
 	depends on NET
-	select PREEMPT_NOTIFIERS
 	select MMU_NOTIFIER
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
Index: linux-2.6/include/linux/kvm_host.h
===================================================================
--- linux-2.6.orig/include/linux/kvm_host.h
+++ linux-2.6/include/linux/kvm_host.h
@@ -111,9 +111,7 @@ enum {
 
 struct kvm_vcpu {
 	struct kvm *kvm;
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	struct preempt_notifier preempt_notifier;
-#endif
 	int cpu;
 	int vcpu_id;
 	int srcu_idx;
Index: linux-2.6/include/linux/preempt.h
===================================================================
--- linux-2.6.orig/include/linux/preempt.h
+++ linux-2.6/include/linux/preempt.h
@@ -101,8 +101,6 @@ do { \
 
 #endif /* CONFIG_PREEMPT_COUNT */
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
 struct preempt_notifier;
 
 /**
@@ -147,6 +145,4 @@ static inline void preempt_notifier_init
 	notifier->ops = ops;
 }
 
-#endif
-
 #endif /* __LINUX_PREEMPT_H */
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1236,10 +1236,8 @@ struct task_struct {
 	struct sched_entity se;
 	struct sched_rt_entity rt;
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
 	struct hlist_head preempt_notifiers;
-#endif
 
 	/*
 	 * fpu_counter contains the number of consecutive context switches
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -1403,9 +1403,6 @@ config STOP_MACHINE
 
 source "block/Kconfig"
 
-config PREEMPT_NOTIFIERS
-	bool
-
 config PADATA
 	depends on SMP
 	bool
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2387,6 +2387,54 @@ struct migration_arg {
 
 static int migration_cpu_stop(void *data);
 
+struct wait_task_inactive_blocked {
+	struct preempt_notifier notifier;
+	struct task_struct *waiter;
+};
+
+static void wait_task_inactive_sched_in(struct preempt_notifier *n, int cpu)
+{
+	struct task_struct *p;
+	struct wait_task_inactive_blocked *blocked =
+		container_of(n, struct wait_task_inactive_blocked, notifier);
+
+	hlist_del(&n->link);
+
+	p = ACCESS_ONCE(blocked->waiter);
+	blocked->waiter = NULL;
+	wake_up_process(p);
+}
+
+static void
+preempt_ops_sched_out_nop(struct preempt_notifier *n, struct task_struct *next)
+{
+}
+
+static struct preempt_ops wait_task_inactive_ops_post = {
+	.sched_in = wait_task_inactive_sched_in,
+	.sched_out = preempt_ops_sched_out_nop,
+};
+
+static void preempt_ops_sched_in_nop(struct preempt_notifier *n, int cpu)
+{
+}
+
+static void
+wait_task_inactive_sched_out(struct preempt_notifier *n, struct task_struct *next)
+{
+	if (current->on_rq) /* we're not inactive yet */
+		return;
+
+	hlist_del(&n->link);
+	n->ops = &wait_task_inactive_ops_post;
+	hlist_add_head(&n->link, &next->preempt_notifiers);
+}
+
+static struct preempt_ops wait_task_inactive_ops_pre = {
+	.sched_in = preempt_ops_sched_in_nop,
+	.sched_out = wait_task_inactive_sched_out,
+};
+
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
@@ -2405,93 +2453,49 @@ static int migration_cpu_stop(void *data
  */
 unsigned long wait_task_inactive(struct task_struct *p, long match_state)
 {
+	unsigned long ncsw = 0;
 	unsigned long flags;
-	int running, on_rq;
-	unsigned long ncsw;
 	struct rq *rq;
 
-	for (;;) {
-		/*
-		 * We do the initial early heuristics without holding
-		 * any task-queue locks at all. We'll only try to get
-		 * the runqueue lock when things look like they will
-		 * work out!
-		 */
-		rq = task_rq(p);
-
-		/*
-		 * If the task is actively running on another CPU
-		 * still, just relax and busy-wait without holding
-		 * any locks.
-		 *
-		 * NOTE! Since we don't hold any locks, it's not
-		 * even sure that "rq" stays as the right runqueue!
-		 * But we don't care, since "task_running()" will
-		 * return false if the runqueue has changed and p
-		 * is actually now running somewhere else!
-		 */
-		while (task_running(rq, p)) {
-			if (match_state && unlikely(p->state != match_state))
-				return 0;
-			cpu_relax();
-		}
-
-		/*
-		 * Ok, time to look more closely! We need the rq
-		 * lock now, to be *sure*. If we're wrong, we'll
-		 * just go back and repeat.
-		 */
-		rq = task_rq_lock(p, &flags);
-		trace_sched_wait_task(p);
-		running = task_running(rq, p);
-		on_rq = p->on_rq;
-		ncsw = 0;
-		if (!match_state || p->state == match_state)
-			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
-		task_rq_unlock(rq, p, &flags);
-
-		/*
-		 * If it changed from the expected state, bail out now.
-		 */
-		if (unlikely(!ncsw))
-			break;
+	struct wait_task_inactive_blocked blocked = {
+		.notifier = {
+			.ops = &wait_task_inactive_ops_pre,
+		},
+		.waiter = current,
+	};
 
-		/*
-		 * Was it really running after all now that we
-		 * checked with the proper locks actually held?
-		 *
-		 * Oops. Go back and try again..
-		 */
-		if (unlikely(running)) {
-			cpu_relax();
-			continue;
-		}
+	/*
+	 * Busy wait for the task to stop running, the caller promised it
+	 * wouldn't be long.
+	 */
+	while (p->on_cpu) {
+		/* if we don't match the expected state, bail */
+		if (match_state && unlikely(p->state != match_state))
+			return 0;
+		cpu_relax();
+	}
 
-		/*
-		 * It's not enough that it's not actively running,
-		 * it must be off the runqueue _entirely_, and not
-		 * preempted!
-		 *
-		 * So if it was still runnable (but just not actively
-		 * running right now), it's preempted, and we should
-		 * yield - it could be a while.
-		 */
-		if (unlikely(on_rq)) {
-			ktime_t to = ktime_set(0, NSEC_PER_SEC/HZ);
+	rq = task_rq_lock(p, &flags);
+	trace_sched_wait_task(p);
+	if (!p->on_rq) /* we're already blocked */
+		goto done;
 
-			set_current_state(TASK_UNINTERRUPTIBLE);
-			schedule_hrtimeout(&to, HRTIMER_MODE_REL);
-			continue;
-		}
+	hlist_add_head(&blocked.notifier.link, &p->preempt_notifiers);
+	task_rq_unlock(rq, p, &flags);
 
-		/*
-		 * Ahh, all good. It wasn't running, and it wasn't
-		 * runnable, which means that it will never become
-		 * running in the future either. We're all done!
-		 */
-		break;
+	for (;;) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (!blocked.waiter)
+			break;
+		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
 
+	rq = task_rq_lock(p, &flags);
+done:
+	if (!match_state || p->state == match_state)
+		ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
+	task_rq_unlock(rq, p, &flags);
 	return ncsw;
 }
 
@@ -2519,9 +2523,7 @@ void kick_process(struct task_struct *p)
 	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(kick_process);
-#endif /* CONFIG_SMP */
 
-#ifdef CONFIG_SMP
 /*
  * ->cpus_allowed is protected by both rq->lock and p->pi_lock
  */
@@ -2967,10 +2969,7 @@ static void __sched_fork(struct task_str
 #endif
 
 	INIT_LIST_HEAD(&p->rt.run_list);
-
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
-#endif
 }
 
 /*
@@ -3084,8 +3083,6 @@ void wake_up_new_task(struct task_struct
 	task_rq_unlock(rq, p, &flags);
 }
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
 /**
  * preempt_notifier_register - tell me when current is being preempted & rescheduled
  * @notifier: notifier struct to register
@@ -3111,9 +3108,9 @@ EXPORT_SYMBOL_GPL(preempt_notifier_unreg
 static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
 {
 	struct preempt_notifier *notifier;
-	struct hlist_node *node;
+	struct hlist_node *node, *n;
 
-	hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link)
+	hlist_for_each_entry_safe(notifier, node, n, &curr->preempt_notifiers, link)
 		notifier->ops->sched_in(notifier, raw_smp_processor_id());
 }
 
@@ -3122,26 +3119,12 @@ fire_sched_out_preempt_notifiers(struct 
 				 struct task_struct *next)
 {
 	struct preempt_notifier *notifier;
-	struct hlist_node *node;
+	struct hlist_node *node, *n;
 
-	hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link)
+	hlist_for_each_entry_safe(notifier, node, n, &curr->preempt_notifiers, link)
 		notifier->ops->sched_out(notifier, next);
 }
 
-#else /* !CONFIG_PREEMPT_NOTIFIERS */
-
-static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
-{
-}
-
-static void
-fire_sched_out_preempt_notifiers(struct task_struct *curr,
-				 struct task_struct *next)
-{
-}
-
-#endif /* CONFIG_PREEMPT_NOTIFIERS */
-
 /**
  * prepare_task_switch - prepare to switch tasks
  * @rq: the runqueue preparing to switch
@@ -8312,9 +8295,7 @@ void __init sched_init(void)
 
 	set_load_weight(&init_task);
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&init_task.preempt_notifiers);
-#endif
 
 #ifdef CONFIG_SMP
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
@ 2011-09-22 11:46                   ` Peter Zijlstra
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-22 11:46 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Thu, 2011-09-22 at 13:31 +0200, Peter Zijlstra wrote:
> On Thu, 2011-09-22 at 12:00 +0200, Peter Zijlstra wrote:
> > OK, this one seems to be better.. But its quite vile, not sure I
> > actually like it anymore.
> > 
> There's also a small matter of it actually being slower than what we had
> when tested with: strace hackbench > /dev/null 2>&1.
> 
> curses.. even adding a busy wait on p->on_cpu in front.

Gah, its one of those benchmarks where if you rebuild+reboot the numbers
change. Latest patch here..

---
Subject: sched: Rewrite wait_task_inactive()
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Wed Sep 21 21:34:05 CEST 2011

Instead of using a combination of busy-waits and sleeps to poll a
task's state, ensure we get a notification and properly block.

This makes preempt notifiers unconditional, but since pretty much
everybody already has those enabled in their config it shouldn't be a
big deal (I hope).

The notification is somewhat horrid since the sched_out notifier is
called with rq->lock held and interrupts disabled, so we can't
actually do the wakeup from there. However the sched_in notifier is
called without holding rq->lock and with interrupts enable.

So what we do is we register a preempt notifier on the task we want,
that notifier will move itself to the next task passed into the
sched_out function, the sched_in notification of the next task, will do
the actual work... most horrid.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/ia64/kvm/Kconfig    |    1 
 arch/powerpc/kvm/Kconfig |    1 
 arch/s390/kvm/Kconfig    |    1 
 arch/tile/kvm/Kconfig    |    1 
 arch/x86/kvm/Kconfig     |    1 
 include/linux/kvm_host.h |    2 
 include/linux/preempt.h  |    4 
 include/linux/sched.h    |    2 
 init/Kconfig             |    3 
 kernel/sched.c           |  191 +++++++++++++++++++++--------------------------
 10 files changed, 86 insertions(+), 121 deletions(-)
Index: linux-2.6/arch/ia64/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/ia64/kvm/Kconfig
+++ linux-2.6/arch/ia64/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
 	depends on HAVE_KVM && MODULES && EXPERIMENTAL
 	# for device assignment:
 	depends on PCI
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
 	select KVM_APIC_ARCHITECTURE
Index: linux-2.6/arch/powerpc/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/powerpc/kvm/Kconfig
+++ linux-2.6/arch/powerpc/kvm/Kconfig
@@ -18,7 +18,6 @@ if VIRTUALIZATION
 
 config KVM
 	bool
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 
 config KVM_BOOK3S_HANDLER
Index: linux-2.6/arch/s390/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/s390/kvm/Kconfig
+++ linux-2.6/arch/s390/kvm/Kconfig
@@ -19,7 +19,6 @@ config KVM
 	def_tristate y
 	prompt "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && EXPERIMENTAL
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	---help---
 	  Support hosting paravirtualized guest machines using the SIE
Index: linux-2.6/arch/tile/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/tile/kvm/Kconfig
+++ linux-2.6/arch/tile/kvm/Kconfig
@@ -19,7 +19,6 @@ if VIRTUALIZATION
 config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && MODULES && EXPERIMENTAL
-	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	---help---
 	  Support hosting paravirtualized guest machines.
Index: linux-2.6/arch/x86/kvm/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86/kvm/Kconfig
+++ linux-2.6/arch/x86/kvm/Kconfig
@@ -24,7 +24,6 @@ config KVM
 	depends on PCI
 	# for TASKSTATS/TASK_DELAY_ACCT:
 	depends on NET
-	select PREEMPT_NOTIFIERS
 	select MMU_NOTIFIER
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
Index: linux-2.6/include/linux/kvm_host.h
===================================================================
--- linux-2.6.orig/include/linux/kvm_host.h
+++ linux-2.6/include/linux/kvm_host.h
@@ -111,9 +111,7 @@ enum {
 
 struct kvm_vcpu {
 	struct kvm *kvm;
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	struct preempt_notifier preempt_notifier;
-#endif
 	int cpu;
 	int vcpu_id;
 	int srcu_idx;
Index: linux-2.6/include/linux/preempt.h
===================================================================
--- linux-2.6.orig/include/linux/preempt.h
+++ linux-2.6/include/linux/preempt.h
@@ -101,8 +101,6 @@ do { \
 
 #endif /* CONFIG_PREEMPT_COUNT */
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
 struct preempt_notifier;
 
 /**
@@ -147,6 +145,4 @@ static inline void preempt_notifier_init
 	notifier->ops = ops;
 }
 
-#endif
-
 #endif /* __LINUX_PREEMPT_H */
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1236,10 +1236,8 @@ struct task_struct {
 	struct sched_entity se;
 	struct sched_rt_entity rt;
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
 	struct hlist_head preempt_notifiers;
-#endif
 
 	/*
 	 * fpu_counter contains the number of consecutive context switches
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -1403,9 +1403,6 @@ config STOP_MACHINE
 
 source "block/Kconfig"
 
-config PREEMPT_NOTIFIERS
-	bool
-
 config PADATA
 	depends on SMP
 	bool
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2387,6 +2387,54 @@ struct migration_arg {
 
 static int migration_cpu_stop(void *data);
 
+struct wait_task_inactive_blocked {
+	struct preempt_notifier notifier;
+	struct task_struct *waiter;
+};
+
+static void wait_task_inactive_sched_in(struct preempt_notifier *n, int cpu)
+{
+	struct task_struct *p;
+	struct wait_task_inactive_blocked *blocked =
+		container_of(n, struct wait_task_inactive_blocked, notifier);
+
+	hlist_del(&n->link);
+
+	p = ACCESS_ONCE(blocked->waiter);
+	blocked->waiter = NULL;
+	wake_up_process(p);
+}
+
+static void
+preempt_ops_sched_out_nop(struct preempt_notifier *n, struct task_struct *next)
+{
+}
+
+static struct preempt_ops wait_task_inactive_ops_post = {
+	.sched_in = wait_task_inactive_sched_in,
+	.sched_out = preempt_ops_sched_out_nop,
+};
+
+static void preempt_ops_sched_in_nop(struct preempt_notifier *n, int cpu)
+{
+}
+
+static void
+wait_task_inactive_sched_out(struct preempt_notifier *n, struct task_struct *next)
+{
+	if (current->on_rq) /* we're not inactive yet */
+		return;
+
+	hlist_del(&n->link);
+	n->ops = &wait_task_inactive_ops_post;
+	hlist_add_head(&n->link, &next->preempt_notifiers);
+}
+
+static struct preempt_ops wait_task_inactive_ops_pre = {
+	.sched_in = preempt_ops_sched_in_nop,
+	.sched_out = wait_task_inactive_sched_out,
+};
+
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
@@ -2405,93 +2453,49 @@ static int migration_cpu_stop(void *data
  */
 unsigned long wait_task_inactive(struct task_struct *p, long match_state)
 {
+	unsigned long ncsw = 0;
 	unsigned long flags;
-	int running, on_rq;
-	unsigned long ncsw;
 	struct rq *rq;
 
-	for (;;) {
-		/*
-		 * We do the initial early heuristics without holding
-		 * any task-queue locks at all. We'll only try to get
-		 * the runqueue lock when things look like they will
-		 * work out!
-		 */
-		rq = task_rq(p);
-
-		/*
-		 * If the task is actively running on another CPU
-		 * still, just relax and busy-wait without holding
-		 * any locks.
-		 *
-		 * NOTE! Since we don't hold any locks, it's not
-		 * even sure that "rq" stays as the right runqueue!
-		 * But we don't care, since "task_running()" will
-		 * return false if the runqueue has changed and p
-		 * is actually now running somewhere else!
-		 */
-		while (task_running(rq, p)) {
-			if (match_state && unlikely(p->state != match_state))
-				return 0;
-			cpu_relax();
-		}
-
-		/*
-		 * Ok, time to look more closely! We need the rq
-		 * lock now, to be *sure*. If we're wrong, we'll
-		 * just go back and repeat.
-		 */
-		rq = task_rq_lock(p, &flags);
-		trace_sched_wait_task(p);
-		running = task_running(rq, p);
-		on_rq = p->on_rq;
-		ncsw = 0;
-		if (!match_state || p->state == match_state)
-			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
-		task_rq_unlock(rq, p, &flags);
-
-		/*
-		 * If it changed from the expected state, bail out now.
-		 */
-		if (unlikely(!ncsw))
-			break;
+	struct wait_task_inactive_blocked blocked = {
+		.notifier = {
+			.ops = &wait_task_inactive_ops_pre,
+		},
+		.waiter = current,
+	};
 
-		/*
-		 * Was it really running after all now that we
-		 * checked with the proper locks actually held?
-		 *
-		 * Oops. Go back and try again..
-		 */
-		if (unlikely(running)) {
-			cpu_relax();
-			continue;
-		}
+	/*
+	 * Busy wait for the task to stop running, the caller promised it
+	 * wouldn't be long.
+	 */
+	while (p->on_cpu) {
+		/* if we don't match the expected state, bail */
+		if (match_state && unlikely(p->state != match_state))
+			return 0;
+		cpu_relax();
+	}
 
-		/*
-		 * It's not enough that it's not actively running,
-		 * it must be off the runqueue _entirely_, and not
-		 * preempted!
-		 *
-		 * So if it was still runnable (but just not actively
-		 * running right now), it's preempted, and we should
-		 * yield - it could be a while.
-		 */
-		if (unlikely(on_rq)) {
-			ktime_t to = ktime_set(0, NSEC_PER_SEC/HZ);
+	rq = task_rq_lock(p, &flags);
+	trace_sched_wait_task(p);
+	if (!p->on_rq) /* we're already blocked */
+		goto done;
 
-			set_current_state(TASK_UNINTERRUPTIBLE);
-			schedule_hrtimeout(&to, HRTIMER_MODE_REL);
-			continue;
-		}
+	hlist_add_head(&blocked.notifier.link, &p->preempt_notifiers);
+	task_rq_unlock(rq, p, &flags);
 
-		/*
-		 * Ahh, all good. It wasn't running, and it wasn't
-		 * runnable, which means that it will never become
-		 * running in the future either. We're all done!
-		 */
-		break;
+	for (;;) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (!blocked.waiter)
+			break;
+		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
 
+	rq = task_rq_lock(p, &flags);
+done:
+	if (!match_state || p->state == match_state)
+		ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
+	task_rq_unlock(rq, p, &flags);
 	return ncsw;
 }
 
@@ -2519,9 +2523,7 @@ void kick_process(struct task_struct *p)
 	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(kick_process);
-#endif /* CONFIG_SMP */
 
-#ifdef CONFIG_SMP
 /*
  * ->cpus_allowed is protected by both rq->lock and p->pi_lock
  */
@@ -2967,10 +2969,7 @@ static void __sched_fork(struct task_str
 #endif
 
 	INIT_LIST_HEAD(&p->rt.run_list);
-
-#ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
-#endif
 }
 
 /*
@@ -3084,8 +3083,6 @@ void wake_up_new_task(struct task_struct
 	task_rq_unlock(rq, p, &flags);
 }
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
 /**
  * preempt_notifier_register - tell me when current is being preempted & rescheduled
  * @notifier: notifier struct to register
@@ -3111,9 +3108,9 @@ EXPORT_SYMBOL_GPL(preempt_notifier_unreg
 static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
 {
 	struct preempt_notifier *notifier;
-	struct hlist_node *node;
+	struct hlist_node *node, *n;
 
-	hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link)
+	hlist_for_each_entry_safe(notifier, node, n, &curr->preempt_notifiers, link)
 		notifier->ops->sched_in(notifier, raw_smp_processor_id());
 }
 
@@ -3122,26 +3119,12 @@ fire_sched_out_preempt_notifiers(struct 
 				 struct task_struct *next)
 {
 	struct preempt_notifier *notifier;
-	struct hlist_node *node;
+	struct hlist_node *node, *n;
 
-	hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link)
+	hlist_for_each_entry_safe(notifier, node, n, &curr->preempt_notifiers, link)
 		notifier->ops->sched_out(notifier, next);
 }
 
-#else /* !CONFIG_PREEMPT_NOTIFIERS */
-
-static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
-{
-}
-
-static void
-fire_sched_out_preempt_notifiers(struct task_struct *curr,
-				 struct task_struct *next)
-{
-}
-
-#endif /* CONFIG_PREEMPT_NOTIFIERS */

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-22 10:00                   ` Peter Zijlstra
  (?)
@ 2011-09-22 11:55                   ` Mike Galbraith
  2011-09-22 12:09                     ` Peter Zijlstra
  -1 siblings, 1 reply; 52+ messages in thread
From: Mike Galbraith @ 2011-09-22 11:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Thu, 2011-09-22 at 12:00 +0200, Peter Zijlstra wrote:

> OK, this one seems to be better.. But its quite vile, not sure I
> actually like it anymore.

Well, seemed to work, but I see there's a v3 now.

	-Mike


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-22 11:55                   ` Mike Galbraith
@ 2011-09-22 12:09                     ` Peter Zijlstra
  2011-09-22 13:42                       ` Mike Galbraith
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-22 12:09 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Thu, 2011-09-22 at 13:55 +0200, Mike Galbraith wrote:
> On Thu, 2011-09-22 at 12:00 +0200, Peter Zijlstra wrote:
> 
> > OK, this one seems to be better.. But its quite vile, not sure I
> > actually like it anymore.
> 
> Well, seemed to work, but I see there's a v3 now.

Yeah, just posted it for completeness, not sure its actually going
anywhere since its slower than the current code (although its hard to
say with the results changing from reboot to reboot), and its still
quite ugly..



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-22 12:09                     ` Peter Zijlstra
@ 2011-09-22 13:42                       ` Mike Galbraith
  2011-09-22 14:05                         ` Mike Galbraith
  2011-09-22 14:34                         ` Peter Zijlstra
  0 siblings, 2 replies; 52+ messages in thread
From: Mike Galbraith @ 2011-09-22 13:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Thu, 2011-09-22 at 14:09 +0200, Peter Zijlstra wrote:
> On Thu, 2011-09-22 at 13:55 +0200, Mike Galbraith wrote:
> > On Thu, 2011-09-22 at 12:00 +0200, Peter Zijlstra wrote:
> > 
> > > OK, this one seems to be better.. But its quite vile, not sure I
> > > actually like it anymore.
> > 
> > Well, seemed to work, but I see there's a v3 now.
> 
> Yeah, just posted it for completeness, not sure its actually going
> anywhere since its slower than the current code (although its hard to
> say with the results changing from reboot to reboot), and its still
> quite ugly..

Hm. Stracing this proglet will soon leave it stuck forever unless the
timer is left running.  Virgin rt14 does the same though...

strace ./jitter -c 3 -p 99 -f 1000 -t 10 -r

rt_sigtimedwait([], NULL, NULL, 8)      = 64
timer_settime(0x1, TIMER_ABSTIME, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
timer_settime(0x1, TIMER_ABSTIME, {it_interval={0, 0}, it_value={1316698141, 166759038}}, NULL) = 0
rt_sigtimedwait([], NULL, NULL, 8)      = 64
timer_settime(0x1, TIMER_ABSTIME, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
timer_settime(0x1, TIMER_ABSTIME, {it_interval={0, 0}, it_value={1316698141, 167822701}}, NULL) = 0
rt_sigtimedwait([], NULL, NULL, 8)      = 64
timer_settime(0x1, TIMER_ABSTIME, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
timer_settime(0x1, TIMER_ABSTIME, {it_interval={0, 0}, it_value={1316698141, 168887375}}, NULL) = 0
--- SIGRT_32 (Real-time signal 30) @ 0 (0) ---
rt_sigreturn(0x40)                      = 0
rt_sigtimedwait([], NULL, NULL, 8^C <unfinished ...>

#define _GNU_SOURCE

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <math.h>
#include <values.h>
#include <sched.h>
#include <signal.h>
#include <time.h>
#include <cpuset.h>
#include <getopt.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/mman.h>

/* compile with gcc -O  jitter.c  -o jitter -lrt -lm */

#define NSEC_PER_SEC 1000000000ULL
#define USEC_PER_SEC 1000000ULL
#define NSEC_PER_USEC 1000ULL


int frequency = 1000;
int period;
int tolerance = 5;
int delay = 1;
int samples = 1000;
int cpu;
int priority = 1;
int reset_timer;

double *deltas;
double *deviants;

sigset_t mysigset;

char *usage = "Usage: -c <cpu> -d <delay> -f <freq(Hz)> -p <prio> -t <tolerance(us)>\n";

void parse_options(int argc, char **argv)
{
	int ch;
	extern char *optarg;
	extern int optind;

	while ((ch = getopt(argc, argv, "c:d:f:p:rt:")) != EOF) {
		switch (ch) {
			case 'c':
				if (sscanf(optarg, "%d", &cpu) != 1 ||
						cpu < 0) {
					fprintf(stderr,"Invalid cpu.\n");
					exit(EXIT_FAILURE);
		 		}
			break;
			case 'd':
				if (sscanf(optarg, "%d", &delay) != 1 ||
						delay <= 0) {
					fprintf(stderr,"Invalid delay.\n");
					exit(EXIT_FAILURE);
		 		}
			break;
			case 'f':
				if (sscanf(optarg, "%d", &frequency) != 1 ||
						frequency <= 0) {
					fprintf(stderr,"Invalid frequency.\n");
					exit(EXIT_FAILURE);
		 		}
			break;
			case 'r':
				reset_timer = 1;
			break;
			case 'p':
				if (sscanf(optarg, "%d", &priority) != 1 ||
						priority < 1 || priority > 99) {
					fprintf(stderr,"Invalid priority.\n");
					exit(EXIT_FAILURE);
		 		}
			break;
			case 't':
				if (sscanf(optarg, "%d", &tolerance) != 1 ||
						tolerance < 0) {
					fprintf(stderr,"Invalid tolerance.\n");
					exit(EXIT_FAILURE);
		 		}
			break;
			default:
				fprintf(stderr, "%s", usage);
				exit(EXIT_FAILURE);
			break;
		}
	}

	samples = frequency * delay;
	period = NSEC_PER_SEC / frequency;
}

long delta(struct timespec *now, struct timespec *then)
{
	long delta = now->tv_sec * NSEC_PER_SEC + now->tv_nsec;

	delta -= then->tv_sec * NSEC_PER_SEC + then->tv_nsec; 

	return delta;
}

void signal_handler(int signo)
{
}

void init_timer(timer_t *timer_id)
{
	struct sigaction sa;
	struct sigevent se;

	memset(&sa, 0, sizeof(sa));
	sa.sa_flags = SA_RESTART|SA_SIGINFO;
	sa.sa_handler = signal_handler;
	sigemptyset(&sa.sa_mask);
	sigaddset(&sa.sa_mask, SIGCHLD);

	memset(&se, 0, sizeof(se));
	se.sigev_notify = SIGEV_SIGNAL;
	se.sigev_signo = SIGRTMAX;
	se.sigev_value.sival_int = 0;

	if (sigaction(SIGRTMAX, &sa, 0) < 0) {
		perror("sigaction");
		exit(EXIT_FAILURE);
	}

	if (timer_create(CLOCK_REALTIME, &se, timer_id) < 0) {
		perror("timer_create");
		exit(EXIT_FAILURE);
	}

	sigemptyset(&mysigset);
	sigaddset(&mysigset,SIGRTMAX);
}

void set_timer(timer_t timer_id, int period, struct timespec *target)
{
	struct itimerspec ts;
	struct timespec now;

	clock_gettime(CLOCK_REALTIME, target);

	if (period) {
		target->tv_nsec += period;

		if (target->tv_nsec >= NSEC_PER_SEC) {
			target->tv_sec++;
			target->tv_nsec -= NSEC_PER_SEC;
		}
	
		ts.it_value = *target;
		ts.it_interval.tv_sec = 0;
		ts.it_interval.tv_nsec = reset_timer ? 0 : period;
	} else
		memset(&ts, 0, sizeof(struct itimerspec));

	if (timer_settime(timer_id, TIMER_ABSTIME, &ts, NULL) < 0) {
		perror("timer_settime");
		exit (EXIT_FAILURE);
	}
}

void print_stats(void)
{
	struct sched_param sp;
	double min = MAXDOUBLE, max = MINDOUBLE;
	double sum = 0.0, delta, mean, sd;
	double tol = (double)tolerance;
	int i, deviant = 0;

	sp.sched_priority = 0;
	if (sched_setscheduler(0, SCHED_OTHER, &sp) == -1) {
		perror("sched_setscheduler");
		exit(EXIT_FAILURE);
	}

	for (i = 0; i < samples; i++) {
		deltas[i] /= (double)NSEC_PER_USEC;
		if (deltas[i] < min)
			min = deltas[i];
		if (deltas[i] > max)
			max = deltas[i];
		if (abs((int)deltas[i]) > tol) {
			deviants[deviant] = deltas[i];
			deviant++;
		}
		sum += deltas[i];
	}
	mean = sum / (double)samples;

	/* calculate standard deviation */
	sum = 0.0;
	for (i = 0; i < samples; i++) {
		delta = deltas[i] - mean;
		sum += delta*delta;
	}
	sum /= (double)samples;
	sd = sqrt(sum);

	printf("jitter:%7.2f\tmin: %9.2f max: %9.2f mean: %9.2f stddev: %7.2f\n",
		max - min, min, max, mean, sd);

	if (!deviant)
		goto out;

	min = MAXDOUBLE;
	max = MINDOUBLE;
	sum = 0.0;

	for (i = 0; i < deviant; i++) {
		if (deviants[i] < min)
			min = deviants[i];
		if (deviants[i] > max)
			max = deviants[i];
		sum += deviants[i];
	}
	mean = sum / (double)deviant;

	/* calculate standard deviation */
	sum = 0.0;
	for (i = 0; i < deviant; i++) {
		delta = deviants[i] - mean;
		sum += delta*delta;
	}
	sum /= (double)deviant;
	sd = sqrt(sum);
	printf("%d > %d us hits\tmin: %9.2f max: %9.2f mean: %9.2f stddev: %7.2f\n\n",
		deviant, tolerance, min, max, mean, sd);

out:
	fflush(stdout);
	sp.sched_priority = priority;
	if (sched_setscheduler(0, SCHED_FIFO, &sp) == -1) {
		perror("sched_setscheduler");
		exit(EXIT_FAILURE);
	}
}


int quit;

void exit_handler(int signo)
{
	quit = 1;
}

int main(int argc, char **argv)
{
	timer_t timer_id;
	struct sched_param sp;
	cpu_set_t cpuset;
	struct timespec now, then;
	int i = 0, sig = 0;

	parse_options(argc, argv);
	signal(SIGINT, exit_handler);
	signal(SIGTERM, exit_handler);

	if (mlockall(MCL_CURRENT|MCL_FUTURE) < 0) {
		perror("mlockall");
		exit(EXIT_FAILURE);
	}

	if (!(deltas = malloc(samples * sizeof(double)))) {
		perror("malloc deltas");
		exit(EXIT_FAILURE);
	} else if (!(deviants = malloc(samples * sizeof(double)))) {
		perror("malloc deviants");
		exit(EXIT_FAILURE);
	}

	CPU_ZERO(&cpuset);
	CPU_SET(cpu, &cpuset);

	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) == -1) {
		perror("setaffinity");
		exit(EXIT_FAILURE);
	}

	sp.sched_priority = priority;
	if (sched_setscheduler(0, SCHED_FIFO, &sp) == -1) {
		perror("sched_setscheduler");
		exit(EXIT_FAILURE);
	}

	printf("CPU%d priority: %d timer freq: %d Hz tolerance: %d usecs, stats interval: %d %s\n\n",
		cpu, sp.sched_priority, frequency, tolerance,  delay, "secs");

	init_timer(&timer_id);
	set_timer(timer_id, period, &then);

	while (!quit && reset_timer) {
		sigwait(&mysigset,&sig);
		set_timer(timer_id, 0, &now);
		deltas[i] = (double)delta(&now, &then);

		if (i++ >= samples) {
			print_stats();
			i = 0;
		}

		set_timer(timer_id, period, &then);
	}

	clock_gettime(CLOCK_REALTIME, &then);

	while (!quit && !reset_timer) {
		sigwait(&mysigset,&sig);
		clock_gettime(CLOCK_REALTIME, &now);
		deltas[i] = (double)delta(&now, &then) - period;

		if (i++ >= samples) {
			set_timer(timer_id, 0, &then);
			print_stats();
			i = 0;
			set_timer(timer_id, period, &then);
		}

		clock_gettime(CLOCK_REALTIME, &then);
	}

	set_timer(timer_id, 0, &now);

	exit(EXIT_SUCCESS);
}



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-22 13:42                       ` Mike Galbraith
@ 2011-09-22 14:05                         ` Mike Galbraith
  2011-09-22 15:20                           ` Peter Zijlstra
  2011-09-22 14:34                         ` Peter Zijlstra
  1 sibling, 1 reply; 52+ messages in thread
From: Mike Galbraith @ 2011-09-22 14:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Thu, 2011-09-22 at 15:42 +0200, Mike Galbraith wrote:
> On Thu, 2011-09-22 at 14:09 +0200, Peter Zijlstra wrote:
> > On Thu, 2011-09-22 at 13:55 +0200, Mike Galbraith wrote:
> > > On Thu, 2011-09-22 at 12:00 +0200, Peter Zijlstra wrote:
> > > 
> > > > OK, this one seems to be better.. But its quite vile, not sure I
> > > > actually like it anymore.
> > > 
> > > Well, seemed to work, but I see there's a v3 now.
> > 
> > Yeah, just posted it for completeness, not sure its actually going
> > anywhere since its slower than the current code (although its hard to
> > say with the results changing from reboot to reboot), and its still
> > quite ugly..
> 
> Hm. Stracing this proglet will soon leave it stuck forever unless the
> timer is left running.  Virgin rt14 does the same though...
> 
> strace ./jitter -c 3 -p 99 -f 1000 -t 10 -r
> 
> rt_sigtimedwait([], NULL, NULL, 8)      = 64
> timer_settime(0x1, TIMER_ABSTIME, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
> timer_settime(0x1, TIMER_ABSTIME, {it_interval={0, 0}, it_value={1316698141, 166759038}}, NULL) = 0
> rt_sigtimedwait([], NULL, NULL, 8)      = 64
> timer_settime(0x1, TIMER_ABSTIME, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
> timer_settime(0x1, TIMER_ABSTIME, {it_interval={0, 0}, it_value={1316698141, 167822701}}, NULL) = 0
> rt_sigtimedwait([], NULL, NULL, 8)      = 64
> timer_settime(0x1, TIMER_ABSTIME, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
> timer_settime(0x1, TIMER_ABSTIME, {it_interval={0, 0}, it_value={1316698141, 168887375}}, NULL) = 0
> --- SIGRT_32 (Real-time signal 30) @ 0 (0) ---
> rt_sigreturn(0x40)                      = 0
> rt_sigtimedwait([], NULL, NULL, 8^C <unfinished ...>

I thought it was RT specific, but it's not after all, a 3.0.4 distro
desktop (preempt) kernel did the same after a bit.

	-Mike


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-22 13:42                       ` Mike Galbraith
  2011-09-22 14:05                         ` Mike Galbraith
@ 2011-09-22 14:34                         ` Peter Zijlstra
  2011-09-22 14:38                           ` Mike Galbraith
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-22 14:34 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Thu, 2011-09-22 at 15:42 +0200, Mike Galbraith wrote:
> #include <cpuset.h>

I don't seem to have that, where does that come from?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-22 14:34                         ` Peter Zijlstra
@ 2011-09-22 14:38                           ` Mike Galbraith
  2011-09-22 14:41                             ` Mike Galbraith
  2011-09-22 14:41                             ` Peter Zijlstra
  0 siblings, 2 replies; 52+ messages in thread
From: Mike Galbraith @ 2011-09-22 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Thu, 2011-09-22 at 16:34 +0200, Peter Zijlstra wrote:
> On Thu, 2011-09-22 at 15:42 +0200, Mike Galbraith wrote:
> > #include <cpuset.h>
> 
> I don't seem to have that, where does that come from?

libcpuset-devel.

	-Mike


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-22 14:38                           ` Mike Galbraith
@ 2011-09-22 14:41                             ` Mike Galbraith
  2011-09-22 14:41                             ` Peter Zijlstra
  1 sibling, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2011-09-22 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Thu, 2011-09-22 at 16:38 +0200, Mike Galbraith wrote:
> On Thu, 2011-09-22 at 16:34 +0200, Peter Zijlstra wrote:
> > On Thu, 2011-09-22 at 15:42 +0200, Mike Galbraith wrote:
> > > #include <cpuset.h>
> > 
> > I don't seem to have that, where does that come from?
> 
> libcpuset-devel.

But you don't need it.  That's a leftover from a version that could move
itself into a shielded cpuset.

	-Mike


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-22 14:38                           ` Mike Galbraith
  2011-09-22 14:41                             ` Mike Galbraith
@ 2011-09-22 14:41                             ` Peter Zijlstra
  2011-09-22 14:46                                 ` Mike Galbraith
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-22 14:41 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Thu, 2011-09-22 at 16:38 +0200, Mike Galbraith wrote:
> On Thu, 2011-09-22 at 16:34 +0200, Peter Zijlstra wrote:
> > On Thu, 2011-09-22 at 15:42 +0200, Mike Galbraith wrote:
> > > #include <cpuset.h>
> > 
> > I don't seem to have that, where does that come from?
> 
> libcpuset-devel.

Not in any distro near me. I'm assuming its this stuff:

  ftp://oss.sgi.com/projects/cpusets/download/libcpuset.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-22 14:41                             ` Peter Zijlstra
@ 2011-09-22 14:46                                 ` Mike Galbraith
  0 siblings, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2011-09-22 14:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Thu, 2011-09-22 at 16:41 +0200, Peter Zijlstra wrote:
> On Thu, 2011-09-22 at 16:38 +0200, Mike Galbraith wrote:
> > On Thu, 2011-09-22 at 16:34 +0200, Peter Zijlstra wrote:
> > > On Thu, 2011-09-22 at 15:42 +0200, Mike Galbraith wrote:
> > > > #include <cpuset.h>
> > > 
> > > I don't seem to have that, where does that come from?
> > 
> > libcpuset-devel.
> 
> Not in any distro near me. I'm assuming its this stuff:

(move closer to Nürnberg;)

>   ftp://oss.sgi.com/projects/cpusets/download/libcpuset.html

Yeah, that should be it.

	-Mike


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
@ 2011-09-22 14:46                                 ` Mike Galbraith
  0 siblings, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2011-09-22 14:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Thu, 2011-09-22 at 16:41 +0200, Peter Zijlstra wrote:
> On Thu, 2011-09-22 at 16:38 +0200, Mike Galbraith wrote:
> > On Thu, 2011-09-22 at 16:34 +0200, Peter Zijlstra wrote:
> > > On Thu, 2011-09-22 at 15:42 +0200, Mike Galbraith wrote:
> > > > #include <cpuset.h>
> > > 
> > > I don't seem to have that, where does that come from?
> > 
> > libcpuset-devel.
> 
> Not in any distro near me. I'm assuming its this stuff:

(move closer to Nürnberg;)

>   ftp://oss.sgi.com/projects/cpusets/download/libcpuset.html

Yeah, that should be it.

	-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-22 11:46                   ` Peter Zijlstra
  (?)
@ 2011-09-22 14:52                   ` Oleg Nesterov
  2011-09-22 15:13                     ` Peter Zijlstra
  -1 siblings, 1 reply; 52+ messages in thread
From: Oleg Nesterov @ 2011-09-22 14:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, linux-rt-users, Thomas Gleixner, LKML,
	Miklos Szeredi, mingo

On 09/22, Peter Zijlstra wrote:
>
> +static void wait_task_inactive_sched_in(struct preempt_notifier *n, int cpu)
> +{
> +	struct task_struct *p;
> +	struct wait_task_inactive_blocked *blocked =
> +		container_of(n, struct wait_task_inactive_blocked, notifier);
> +
> +	hlist_del(&n->link);
> +
> +	p = ACCESS_ONCE(blocked->waiter);
> +	blocked->waiter = NULL;
> +	wake_up_process(p);
> +}
> ...
> +static void
> +wait_task_inactive_sched_out(struct preempt_notifier *n, struct task_struct *next)
> +{
> +	if (current->on_rq) /* we're not inactive yet */
> +		return;
> +
> +	hlist_del(&n->link);
> +	n->ops = &wait_task_inactive_ops_post;
> +	hlist_add_head(&n->link, &next->preempt_notifiers);
> +}

Tricky ;) Yes, the first ->sched_out() is not enough.

>  unsigned long wait_task_inactive(struct task_struct *p, long match_state)
>  {
> ...
> +	rq = task_rq_lock(p, &flags);
> +	trace_sched_wait_task(p);
> +	if (!p->on_rq) /* we're already blocked */
> +		goto done;

This doesn't look right. schedule() clears ->on_rq a long before
__switch_to/etc.

And it seems that we check ->on_cpu above, this is not UP friendly.

>
> -			set_current_state(TASK_UNINTERRUPTIBLE);
> -			schedule_hrtimeout(&to, HRTIMER_MODE_REL);
> -			continue;
> -		}
> +	hlist_add_head(&blocked.notifier.link, &p->preempt_notifiers);
> +	task_rq_unlock(rq, p, &flags);

I thought about reimplementing wait_task_inactive() too, but afaics there
is a problem: why we can't race with p doing register_preempt_notifier() ?
I guess register_ needs rq->lock too.

Oleg.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-22 14:52                   ` Oleg Nesterov
@ 2011-09-22 15:13                     ` Peter Zijlstra
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-22 15:13 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mike Galbraith, linux-rt-users, Thomas Gleixner, LKML,
	Miklos Szeredi, mingo

On Thu, 2011-09-22 at 16:52 +0200, Oleg Nesterov wrote:
> On 09/22, Peter Zijlstra wrote:
> >
> > +static void wait_task_inactive_sched_in(struct preempt_notifier *n, int cpu)
> > +{
> > +	struct task_struct *p;
> > +	struct wait_task_inactive_blocked *blocked =
> > +		container_of(n, struct wait_task_inactive_blocked, notifier);
> > +
> > +	hlist_del(&n->link);
> > +
> > +	p = ACCESS_ONCE(blocked->waiter);
> > +	blocked->waiter = NULL;
> > +	wake_up_process(p);
> > +}
> > ...
> > +static void
> > +wait_task_inactive_sched_out(struct preempt_notifier *n, struct task_struct *next)
> > +{
> > +	if (current->on_rq) /* we're not inactive yet */
> > +		return;
> > +
> > +	hlist_del(&n->link);
> > +	n->ops = &wait_task_inactive_ops_post;
> > +	hlist_add_head(&n->link, &next->preempt_notifiers);
> > +}
> 
> Tricky ;) Yes, the first ->sched_out() is not enough.

Not enough isn't the problem, its ran with rq->lock held and irqs
disabled, you simply cannot do ttwu() from there.

If we could, the subsequent task_rq_lock() in wait_task_inactive() would
be enough to serialize against the still in-flight context switch.

One of the problems with doing it from the next sched_in notifier, is
that next can be idle, and then we do a A -> idle -> B switch, which is
of course sub-optimal.

> >  unsigned long wait_task_inactive(struct task_struct *p, long match_state)
> >  {
> > ...
> > +	rq = task_rq_lock(p, &flags);
> > +	trace_sched_wait_task(p);
> > +	if (!p->on_rq) /* we're already blocked */
> > +		goto done;
> 
> This doesn't look right. schedule() clears ->on_rq a long before
> __switch_to/etc.

Oh, bugger, yes its before we can drop the rq for idle balance and
nonsense like that. (!p->on_rq && !p->on_cpu) should suffice I think.

> And it seems that we check ->on_cpu above, this is not UP friendly.

True, but its what the old code did.. and I was seeing performance
suckage compared to the unpatched kernel (not that the p->on_cpu busy
wait fixed it)...

> >
> > -			set_current_state(TASK_UNINTERRUPTIBLE);
> > -			schedule_hrtimeout(&to, HRTIMER_MODE_REL);
> > -			continue;
> > -		}
> > +	hlist_add_head(&blocked.notifier.link, &p->preempt_notifiers);
> > +	task_rq_unlock(rq, p, &flags);
> 
> I thought about reimplementing wait_task_inactive() too, but afaics there
> is a problem: why we can't race with p doing register_preempt_notifier() ?
> I guess register_ needs rq->lock too.

We can actually, now you mention it.. ->pi_lock would be sufficient and
less expensive to acquire.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: rt14: strace ->  migrate_disable_atomic imbalance
  2011-09-22 14:05                         ` Mike Galbraith
@ 2011-09-22 15:20                           ` Peter Zijlstra
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2011-09-22 15:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-rt-users, Thomas Gleixner, LKML, Oleg Nesterov,
	Miklos Szeredi, mingo

On Thu, 2011-09-22 at 16:05 +0200, Mike Galbraith wrote:
> > strace ./jitter -c 3 -p 99 -f 1000 -t 10 -r
> > 

> I thought it was RT specific, but it's not after all, a 3.0.4 distro
> desktop (preempt) kernel did the same after a bit. 

I can't reproduce on a recent -tip, could be my machine, could be -tip,
let me try more.

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2011-09-22 15:20 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-10  9:12 [ANNOUNCE] 3.0.4-rt13 Thomas Gleixner
2011-09-10 14:53 ` Madovsky
2011-09-10 17:27 ` Rolando Martins
2011-09-11 10:35 ` Mike Galbraith
2011-09-11 10:35   ` Mike Galbraith
2011-09-11 17:01   ` Mike Galbraith
2011-09-12  7:24     ` Thomas Gleixner
2011-09-12  8:59   ` Peter Zijlstra
2011-09-12  9:05     ` Mike Galbraith
2011-09-12 13:52     ` Mike Galbraith
2011-09-12 14:53       ` Mike Galbraith
2011-09-13 13:36         ` Peter Zijlstra
2011-09-13 15:17           ` Mike Galbraith
2011-09-13 15:08         ` Peter Zijlstra
2011-09-13 15:28           ` Mike Galbraith
2011-09-13 16:13             ` Peter Zijlstra
2011-09-21 10:17               ` rt14: strace -> migrate_disable_atomic imbalance Mike Galbraith
2011-09-21 17:01                 ` Peter Zijlstra
2011-09-21 18:50                 ` Peter Zijlstra
2011-09-21 18:50                   ` Peter Zijlstra
2011-09-22  4:46                   ` Mike Galbraith
2011-09-22  6:31                     ` Peter Zijlstra
2011-09-22  8:38                 ` Peter Zijlstra
2011-09-22 10:00                 ` Peter Zijlstra
2011-09-22 10:00                   ` Peter Zijlstra
2011-09-22 11:55                   ` Mike Galbraith
2011-09-22 12:09                     ` Peter Zijlstra
2011-09-22 13:42                       ` Mike Galbraith
2011-09-22 14:05                         ` Mike Galbraith
2011-09-22 15:20                           ` Peter Zijlstra
2011-09-22 14:34                         ` Peter Zijlstra
2011-09-22 14:38                           ` Mike Galbraith
2011-09-22 14:41                             ` Mike Galbraith
2011-09-22 14:41                             ` Peter Zijlstra
2011-09-22 14:46                               ` Mike Galbraith
2011-09-22 14:46                                 ` Mike Galbraith
2011-09-22 11:31                 ` Peter Zijlstra
2011-09-22 11:46                 ` Peter Zijlstra
2011-09-22 11:46                   ` Peter Zijlstra
2011-09-22 14:52                   ` Oleg Nesterov
2011-09-22 15:13                     ` Peter Zijlstra
2011-09-14  9:57             ` [PATCH -rt] ipc/sem: Rework semaphore wakeups Peter Zijlstra
2011-09-14 13:02               ` Mike Galbraith
2011-09-14 18:48               ` Manfred Spraul
2011-09-14 19:23                 ` Peter Zijlstra
2011-09-15 17:04                   ` Manfred Spraul
2011-09-12 10:04   ` [ANNOUNCE] 3.0.4-rt13 Peter Zijlstra
2011-09-12 11:33     ` Mike Galbraith
2011-09-11 18:14 ` Mike Galbraith
2011-09-12  7:33   ` Thomas Gleixner
2011-09-12  8:05     ` Mike Galbraith
2011-09-12  8:43       ` Mike Galbraith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.