LKML Archive on lore.kernel.org
 help / color / Atom feed
* [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
@ 2019-03-26  9:34 Juri Lelli
  2019-03-28 10:17 ` Sebastian Andrzej Siewior
  2019-04-19  8:56 ` Juri Lelli
  0 siblings, 2 replies; 26+ messages in thread
From: Juri Lelli @ 2019-03-26  9:34 UTC (permalink / raw)
  To: linux-rt-users
  Cc: LKML, Thomas Gleixner, Sebastian Andrzej Siewior,
	Daniel Bristot de Oliveira, Clark Williams

Hi,

Running this reproducer on a 4.19.25-rt16 kernel (with lock debugging
turned on) produces warning below.

 --->8---
 # dd if=/dev/zero of=fsfreezetest count=999999
 # mkfs -t xfs -q ./fsfreezetest
 # mkdir testmount
 # mount -t xfs -o loop ./fsfreezetest ./testmount
 # for I in `seq 10`; do fsfreeze -f ./testmount; sleep 1; fsfreeze -u ./testmount; done
 --->8---

 ------------[ cut here ]------------
 DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current)
 WARNING: CPU: 10 PID: 1226 at kernel/locking/rtmutex-debug.c:145 debug_rt_mutex_unlock+0x9b/0xb0
 Modules linked in: xfs [...]
 CPU: 10 PID: 1226 Comm: fsfreeze Not tainted 4.19.25-rt16 #2
 Hardware name: LENOVO 30B6S2F900/1030, BIOS S01KT61A 09/28/2018
 RIP: 0010:debug_rt_mutex_unlock+0x9b/0xb0
 Code: e8 aa af 3c 00 4c 8b 04 24 85 c0 74 a9 8b 05 3c 9c a6 02 85 c0 75 9f 48 c7 c6 b8 b4 2d 98 48 c7 c7 9b d2 2b 98 e8 d9 e5 f8 ff <0f> 0b 4c 8b 04 24 eb 84 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 c3
 RSP: 0018:ffffa7efa60cbdd0 EFLAGS: 00010086
 RAX: 0000000000000000 RBX: ffff991b72813920 RCX: 0000000000000000
 RDX: 0000000000000007 RSI: ffffffff98318de2 RDI: 00000000ffffffff
 RBP: 0000000000000246 R08: 0000000000000000 R09: 0000000000024200
 R10: 0000000000000000 R11: 0000000000000000 R12: ffffa7efa60cbe08
 R13: ffffa7efa60cbe18 R14: ffff991b72813478 R15: ffffffff9730718d
 FS:  00007f19baf6a540(0000) GS:ffff991b9fb00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f19bae87040 CR3: 000000103c6ee002 CR4: 00000000001606e0
 Call Trace:
  rt_mutex_slowunlock+0x24/0x70
  __rt_mutex_unlock+0x45/0x80
  percpu_up_write+0x4b/0x60
  thaw_super_locked+0xdb/0x110
  do_vfs_ioctl+0x647/0x6f0
  ksys_ioctl+0x60/0x90
  __x64_sys_ioctl+0x16/0x20
  do_syscall_64+0x60/0x1f0
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 RIP: 0033:0x7f19bae9704b
 Code: 0f 1e fa 48 8b 05 3d be 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d be 0c 00 f7 d8 64 89 01 48
 RSP: 002b:00007ffc6d275358 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f19bae9704b
 RDX: 0000000000000000 RSI: 00000000c0045878 RDI: 0000000000000003
 RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffc6d275755
 R13: 00007ffc6d275500 R14: 0000000000000000 R15: 0000000000000000
 irq event stamp: 8002
 hardirqs last  enabled at (8001): [<ffffffff97a25981>] _raw_spin_unlock_irqrestore+0x81/0x90
 hardirqs last disabled at (8002): [<ffffffff97a25aa0>] _raw_spin_lock_irqsave+0x20/0x60
 softirqs last  enabled at (0): [<ffffffff970c04ad>] copy_process.part.36+0x89d/0x2170
 softirqs last disabled at (0): [<0000000000000000>]           (null)
 ---[ end trace 0000000000000002 ]---

AFAIU, this is a legit warning, since

 fsfreeze -f ./testmount grabs rt_mutexes embedded into
 sb->s_writers.rw_sem[SB_FREEZE_LEVELS] (rt-rwsem) as part of executing
 sb_wait_write() (for each FREEZE_LEVEL) in freeze_super().

 We then return to userspace.

 fsfreeze -u ./testmount unlocks the rt_mutexes while doing
 sb_freeze_unlock() in thaw_super_locked(). This is a different process
 w.r.t. the one that did the freeze above.

I noticed that a very similar problem was fixed (for !rt rwsem) by
5a817641f68a ("locking/percpu-rwsem: Annotate rwsem ownership transfer
by setting RWSEM_OWNER_UNKNOWN"). However, RT has of course to deal with
PI, so I wonder if there is an easy fix for this problem.

Suggestions?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-03-26  9:34 [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16) Juri Lelli
@ 2019-03-28 10:17 ` Sebastian Andrzej Siewior
  2019-04-19  8:56 ` Juri Lelli
  1 sibling, 0 replies; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-03-28 10:17 UTC (permalink / raw)
  To: Juri Lelli, Peter Zijlstra
  Cc: linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams

On 2019-03-26 10:34:21 [+0100], Juri Lelli wrote:
> Hi,
Hi,

…
>  # for I in `seq 10`; do fsfreeze -f ./testmount; sleep 1; fsfreeze -u ./testmount; done
> 
>  ------------[ cut here ]------------
>  DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current)
>  WARNING: CPU: 10 PID: 1226 at kernel/locking/rtmutex-debug.c:145 debug_rt_mutex_unlock+0x9b/0xb0
>  Modules linked in: xfs [...]
>  CPU: 10 PID: 1226 Comm: fsfreeze Not tainted 4.19.25-rt16 #2
>  Hardware name: LENOVO 30B6S2F900/1030, BIOS S01KT61A 09/28/2018
>  RIP: 0010:debug_rt_mutex_unlock+0x9b/0xb0
>   __rt_mutex_unlock+0x45/0x80
>   percpu_up_write+0x4b/0x60
>   thaw_super_locked+0xdb/0x110
> AFAIU, this is a legit warning, since
> 
>  fsfreeze -f ./testmount grabs rt_mutexes embedded into
>  sb->s_writers.rw_sem[SB_FREEZE_LEVELS] (rt-rwsem) as part of executing
>  sb_wait_write() (for each FREEZE_LEVEL) in freeze_super().
> 
>  We then return to userspace.
> 
>  fsfreeze -u ./testmount unlocks the rt_mutexes while doing
>  sb_freeze_unlock() in thaw_super_locked(). This is a different process
>  w.r.t. the one that did the freeze above.
> 
> I noticed that a very similar problem was fixed (for !rt rwsem) by
> 5a817641f68a ("locking/percpu-rwsem: Annotate rwsem ownership transfer
> by setting RWSEM_OWNER_UNKNOWN"). However, RT has of course to deal with
> PI, so I wonder if there is an easy fix for this problem.
> 
> Suggestions?

So we leave to userland with an acquired rtmutex. And lockdep doesn't
complain because lockdep_sb_freeze_release() /
lockdep_sb_freeze_acquire() informs that everything is okay.
I have no idea, PeterZ? The rwsem is not ownerless afaik.

> Thanks,
> 
> - Juri

Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-03-26  9:34 [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16) Juri Lelli
  2019-03-28 10:17 ` Sebastian Andrzej Siewior
@ 2019-04-19  8:56 ` Juri Lelli
  2019-04-30 12:51   ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 26+ messages in thread
From: Juri Lelli @ 2019-04-19  8:56 UTC (permalink / raw)
  To: linux-rt-users
  Cc: LKML, Thomas Gleixner, Sebastian Andrzej Siewior,
	Daniel Bristot de Oliveira, Clark Williams

On 26/03/19 10:34, Juri Lelli wrote:
> Hi,
> 
> Running this reproducer on a 4.19.25-rt16 kernel (with lock debugging
> turned on) produces warning below.

And I now think this might lead to an actual crash.

I've got what below while running xfstest suite [1] on 4.19.31-rt18.
generic/390 test seems able to reliable reproduce it
https://github.com/kdave/xfstests/blob/master/tests/generic/390

--->8---
[  275.798877] run fstests generic/390 at 2019-04-19 08:31:44
[  276.538394] XFS (loop1): Unmounting Filesystem
[  277.309404] ------------[ cut here ]------------
[  277.309406] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current)
[  277.309421] WARNING: CPU: 4 PID: 2012 at kernel/locking/rtmutex-debug.c:145 debug_rt_mutex_unlock+0x47/0x50
[  277.309422] Modules linked in: loop xfs ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables sunrpc intel_rapl sb_edac intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore intel_rapl_perf iwlwifi wmi_bmof intel_wmi_thunderbolt pcspkr joydev cfg80211 btusb btrtl btbcm i2c_i801 btintel snd_hda_codec_hdmi snd_hda_codec_realtek bluetooth iTCO_wdt snd_hda_codec_generic rtsx_usb_ms iTCO_vendor_support memstick mei_wdt lpc_ich ecdh_generic rfkill snd_hda_intel
[  277.309467]  snd_hda_codec mei_me mei snd_hda_core snd_hwdep snd_pcm snd_timer snd soundcore pcc_cpufreq nouveau crc32c_intel video drm_kms_helper rtsx_usb_sdmmc igb ttm mmc_core dca i2c_algo_bit drm e1000e rtsx_usb ata_generic pata_acpi r8169 realtek mxm_wmi wmi
[  277.309491] CPU: 4 PID: 2012 Comm: xfs_io Not tainted 4.19.31-rt18 #1
[  277.309492] Hardware name: LENOVO 30B6S2F900/1030, BIOS S01KT61A 09/28/2018
[  277.309495] RIP: 0010:debug_rt_mutex_unlock+0x47/0x50
[  277.309497] Code: c2 75 01 c3 e8 ca b8 3c 00 85 c0 74 f6 8b 05 80 bc a6 02 85 c0 75 ec 48 c7 c6 70 ba 2d 82 48 c7 c7 7b d8 2b 82 e8 5d e6 f8 ff <0f> 0b c3 66 0f 1f 44 00 00 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f
[  277.309498] RSP: 0018:ffffb07dea287dd8 EFLAGS: 00010086
[  277.309500] RAX: 0000000000000000 RBX: ffff906d43c33920 RCX: 0000000000000000
[  277.309501] RDX: 0000000000000007 RSI: ffffffff8231935a RDI: 00000000ffffffff
[  277.309502] RBP: 0000000000000246 R08: 0000000000000000 R09: 0000000000024200
[  277.309503] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb07dea287e08
[  277.309504] R13: ffffb07dea287e18 R14: ffff906d43c33478 R15: ffffffff8130757d
[  277.309506] FS:  00007f586dd06880(0000) GS:ffff906d5fb00000(0000) knlGS:0000000000000000
[  277.309507] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  277.309509] CR2: 000055aee375c088 CR3: 0000000844b90005 CR4: 00000000001606e0
[  277.309510] Call Trace:
[  277.309516]  rt_mutex_slowunlock+0x24/0x70
[  277.309521]  __rt_mutex_unlock+0x45/0x80
[  277.309527]  percpu_up_write+0x1f/0x30
[  277.309533]  thaw_super_locked+0xdb/0x110
[  277.309538]  do_vfs_ioctl+0x647/0x6f0
[  277.309544]  ksys_ioctl+0x60/0x90
[  277.309547]  __x64_sys_ioctl+0x16/0x20
[  277.309552]  do_syscall_64+0x60/0x1f0
[  277.309556]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  277.309558] RIP: 0033:0x7f586e22804b
[  277.309560] Code: 0f 1e fa 48 8b 05 3d be 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d be 0c 00 f7 d8 64 89 01 48
[  277.309561] RSP: 002b:00007ffd7b70b378 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  277.309563] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f586e22804b
[  277.309564] RDX: 00007ffd7b70b384 RSI: ffffffffc0045878 RDI: 0000000000000003
[  277.309565] RBP: 000055aee375be90 R08: 00007f586e2f4c60 R09: 000055aee375c080
[  277.309567] R10: 000055aee3759010 R11: 0000000000000246 R12: 0000000000000001
[  277.309568] R13: 000055aee375be70 R14: 000055aee375be70 R15: 000055aee375bd20
[  277.309575] irq event stamp: 24666
[  277.309578] hardirqs last  enabled at (24665): [<ffffffff81a27201>] _raw_spin_unlock_irqrestore+0x81/0x90
[  277.309581] hardirqs last disabled at (24666): [<ffffffff81a27320>] _raw_spin_lock_irqsave+0x20/0x60
[  277.309586] softirqs last  enabled at (0): [<ffffffff810c04ed>] copy_process.part.36+0x89d/0x2170
[  277.309588] softirqs last disabled at (0): [<0000000000000000>]           (null)
[  277.309589] ---[ end trace 0000000000000002 ]---
[  277.654909] XFS (loop1): Mounting V5 Filesystem
[  277.665152] XFS (loop1): Ending clean mount
[  279.932112] BUG: unable to handle kernel NULL pointer dereference at 000000000000008e
[  279.932114] PGD 0 P4D 0
[  279.932118] Oops: 0000 [#1] PREEMPT SMP PTI
[  279.932122] CPU: 10 PID: 2262 Comm: sh Tainted: G        W         4.19.31-rt18 #1
[  279.932123] Hardware name: LENOVO 30B6S2F900/1030, BIOS S01KT61A 09/28/2018
[  279.932133] RIP: 0010:rt_mutex_enqueue_pi+0x35/0x70
[  279.932135] Code: 00 0a 00 00 ba 01 00 00 00 45 31 c9 48 89 f1 eb 0d 85 c9 78 40 48 8d 48 08 31 d2 49 89 c1 48 8b 01 48 85 c0 74 0f 41 8b 48 60 <3b> 48 48 7d e2 48 8d 48 10 eb e6 49 8d 78 18 4d 89 48 18 49 c7 40
[  279.932136] RSP: 0018:ffffb07dea3cfaf8 EFLAGS: 00010002
[  279.932138] RAX: 0000000000000046 RBX: ffff9075450e5380 RCX: 0000000000000082
[  279.932139] RDX: 0000000000000001 RSI: ffff90753f658a00 RDI: ffff90753f658000
[  279.932140] RBP: ffffb07dea3cfba0 R08: ffffb07dea3cfba0 R09: ffffb07dea83bd30
[  279.932141] R10: 0000000000000000 R11: 0000000000000000 R12: ffff90753f658000
[  279.932142] R13: ffff90753f6589b8 R14: ffffb07dea3cfba0 R15: ffff90754b235590
[  279.932144] FS:  0000000000000000(0000) GS:ffff90755fb00000(0000) knlGS:0000000000000000
[  279.932145] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  279.932146] CR2: 000000000000008e CR3: 0000001045326006 CR4: 00000000001606e0
[  279.932147] Call Trace:
[  279.932154]  task_blocks_on_rt_mutex+0x1e2/0x240
[  279.932164]  rt_mutex_slowlock_locked+0xb8/0x300
[  279.932170]  rt_mutex_slowlock+0x78/0xd0
[  279.932178]  __down_write_common+0x24/0x170
[  279.932185]  __vma_adjust+0x124/0x970
[  279.932192]  __split_vma+0xf4/0x1a0
[  279.932196]  do_munmap+0xf7/0x430
[  279.932200]  mmap_region+0xb2/0x650
[  279.932208]  ? selinux_file_mprotect+0x140/0x140
[  279.932212]  do_mmap+0x38e/0x510
[  279.932218]  vm_mmap_pgoff+0xd5/0x130
[  279.932225]  ksys_mmap_pgoff+0x1b8/0x270
[  279.932231]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  279.932235]  do_syscall_64+0x60/0x1f0
[  279.932240]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  279.932243] RIP: 0033:0x7f79b2d6dd87
[  279.932244] Code: 54 41 89 d4 55 48 89 fd 53 4c 89 cb 48 85 ff 74 4a 49 89 d9 45 89 f8 45 89 f2 44 89 e2 4c 89 ee 48 89 ef b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 61 5b 5d 41 5c 41 5d 41 5e 41 5f c3 66 0f 1f
[  279.932246] RSP: 002b:00007ffcc293a098 EFLAGS: 00000206 ORIG_RAX: 0000000000000009
[  279.932248] RAX: ffffffffffffffda RBX: 0000000000029000 RCX: 00007f79b2d6dd87
[  279.932249] RDX: 0000000000000003 RSI: 0000000000005000 RDI: 00007f79b2d40000
[  279.932250] RBP: 00007f79b2d40000 R08: 0000000000000003 R09: 0000000000029000
[  279.932251] R10: 0000000000000812 R11: 0000000000000206 R12: 0000000000000003
[  279.932252] R13: 0000000000005000 R14: 0000000000000812 R15: 0000000000000003
[  279.932259] Modules linked in: loop xfs ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables sunrpc intel_rapl sb_edac intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore intel_rapl_perf iwlwifi wmi_bmof intel_wmi_thunderbolt pcspkr joydev cfg80211 btusb btrtl btbcm i2c_i801 btintel snd_hda_codec_hdmi snd_hda_codec_realtek bluetooth iTCO_wdt snd_hda_codec_generic rtsx_usb_ms iTCO_vendor_support memstick mei_wdt lpc_ich ecdh_generic rfkill snd_hda_intel
[  279.932306]  snd_hda_codec mei_me mei snd_hda_core snd_hwdep snd_pcm snd_timer snd soundcore pcc_cpufreq nouveau crc32c_intel video drm_kms_helper rtsx_usb_sdmmc igb ttm mmc_core dca i2c_algo_bit drm e1000e rtsx_usb ata_generic pata_acpi r8169 realtek mxm_wmi wmi
[  279.932328] CR2: 000000000000008e
[  280.273956] ---[ end trace 0000000000000003 ]---
[  280.273959] RIP: 0010:rt_mutex_enqueue_pi+0x35/0x70
[  280.273960] Code: 00 0a 00 00 ba 01 00 00 00 45 31 c9 48 89 f1 eb 0d 85 c9 78 40 48 8d 48 08 31 d2 49 89 c1 48 8b 01 48 85 c0 74 0f 41 8b 48 60 <3b> 48 48 7d e2 48 8d 48 10 eb e6 49 8d 78 18 4d 89 48 18 49 c7 40
[  280.273961] RSP: 0018:ffffb07dea3cfaf8 EFLAGS: 00010002
[  280.273963] RAX: 0000000000000046 RBX: ffff9075450e5380 RCX: 0000000000000082
[  280.273964] RDX: 0000000000000001 RSI: ffff90753f658a00 RDI: ffff90753f658000
[  280.273965] RBP: ffffb07dea3cfba0 R08: ffffb07dea3cfba0 R09: ffffb07dea83bd30
[  280.273966] R10: 0000000000000000 R11: 0000000000000000 R12: ffff90753f658000
[  280.273967] R13: ffff90753f6589b8 R14: ffffb07dea3cfba0 R15: ffff90754b235590
[  280.273968] FS:  0000000000000000(0000) GS:ffff90755fb00000(0000) knlGS:0000000000000000
[  280.273970] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  280.273971] CR2: 000000000000008e CR3: 0000001045326006 CR4: 00000000001606e0
[  280.273972] ------------[ cut here ]------------
--->8---

Since AFAIU generic/390 performs a multi-thread freeze/unfreeze loop, the
warning and the actual crash looks related to me.

Best,

- Juri

1 - https://github.com/kdave/xfstests

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-04-19  8:56 ` Juri Lelli
@ 2019-04-30 12:51   ` Sebastian Andrzej Siewior
  2019-04-30 13:28     ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-04-30 12:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli

On 2019-04-19 10:56:27 [+0200], Juri Lelli wrote:
> On 26/03/19 10:34, Juri Lelli wrote:
> > Hi,
> > 
> > Running this reproducer on a 4.19.25-rt16 kernel (with lock debugging
> > turned on) produces warning below.
> 
> And I now think this might lead to an actual crash.

Peter, could you please take a look at the thread:
  https://lkml.kernel.org/r/20190419085627.GI4742@localhost.localdomain

I assumed that returning to userland with acquired locks is something we
did not want…

Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-04-30 12:51   ` Sebastian Andrzej Siewior
@ 2019-04-30 13:28     ` Peter Zijlstra
  2019-04-30 13:45       ` Sebastian Andrzej Siewior
                         ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-04-30 13:28 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli,
	Oleg Nesterov, jack

On Tue, Apr 30, 2019 at 02:51:31PM +0200, Sebastian Andrzej Siewior wrote:
> On 2019-04-19 10:56:27 [+0200], Juri Lelli wrote:
> > On 26/03/19 10:34, Juri Lelli wrote:
> > > Hi,
> > > 
> > > Running this reproducer on a 4.19.25-rt16 kernel (with lock debugging
> > > turned on) produces warning below.
> > 
> > And I now think this might lead to an actual crash.
> 
> Peter, could you please take a look at the thread:
>   https://lkml.kernel.org/r/20190419085627.GI4742@localhost.localdomain
> 
> I assumed that returning to userland with acquired locks is something we
> did not want…

Yeah, but AFAIK fs freezing code has a history of doing exactly that..
This is just the latest incarnation here.

So the immediate problem here is that the task doing thaw isn't the same
that did freeze, right? The thing is, I'm not seeing how that isn't a
problem with upstream either.

The freeze code seems to do: percpu_down_write() for the various states,
and then frobs lockdep state.

Thaw then does the reverse, frobs lockdep and then does: percpu_up_write().

percpu_down_write() directly relies on down_write(), and
percpu_up_write() on up_write(). And note how __up_write() has:

	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);

So why isn't this same code coming unstuck in mainline?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-04-30 13:28     ` Peter Zijlstra
@ 2019-04-30 13:45       ` Sebastian Andrzej Siewior
  2019-04-30 14:01         ` Peter Zijlstra
  2019-04-30 14:15       ` Oleg Nesterov
  2019-05-01 17:09       ` Peter Zijlstra
  2 siblings, 1 reply; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-04-30 13:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli,
	Oleg Nesterov, jack

On 2019-04-30 15:28:11 [+0200], Peter Zijlstra wrote:
> On Tue, Apr 30, 2019 at 02:51:31PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2019-04-19 10:56:27 [+0200], Juri Lelli wrote:
> > > On 26/03/19 10:34, Juri Lelli wrote:
> > > > Hi,
> > > > 
> > > > Running this reproducer on a 4.19.25-rt16 kernel (with lock debugging
> > > > turned on) produces warning below.
> > > 
> > > And I now think this might lead to an actual crash.
> > 
> > Peter, could you please take a look at the thread:
> >   https://lkml.kernel.org/r/20190419085627.GI4742@localhost.localdomain
> > 
> > I assumed that returning to userland with acquired locks is something we
> > did not want…
> 
> Yeah, but AFAIK fs freezing code has a history of doing exactly that..
> This is just the latest incarnation here.
> 
> So the immediate problem here is that the task doing thaw isn't the same
> that did freeze, right? The thing is, I'm not seeing how that isn't a
> problem with upstream either.
> 
> The freeze code seems to do: percpu_down_write() for the various states,
> and then frobs lockdep state.
> 
> Thaw then does the reverse, frobs lockdep and then does: percpu_up_write().
> 
> percpu_down_write() directly relies on down_write(), and
> percpu_up_write() on up_write(). And note how __up_write() has:
> 
> 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
> 
> So why isn't this same code coming unstuck in mainline?

I have to re-route most of this questions to Juri Lelli.
Lockdep has these gems:
	lockdep_sb_freeze_release() / lockdep_sb_freeze_acquire()

Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-04-30 13:45       ` Sebastian Andrzej Siewior
@ 2019-04-30 14:01         ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-04-30 14:01 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli,
	Oleg Nesterov, jack

On Tue, Apr 30, 2019 at 03:45:48PM +0200, Sebastian Andrzej Siewior wrote:
> On 2019-04-30 15:28:11 [+0200], Peter Zijlstra wrote:
> > On Tue, Apr 30, 2019 at 02:51:31PM +0200, Sebastian Andrzej Siewior wrote:
> > > On 2019-04-19 10:56:27 [+0200], Juri Lelli wrote:
> > > > On 26/03/19 10:34, Juri Lelli wrote:
> > > > > Hi,
> > > > > 
> > > > > Running this reproducer on a 4.19.25-rt16 kernel (with lock debugging
> > > > > turned on) produces warning below.
> > > > 
> > > > And I now think this might lead to an actual crash.
> > > 
> > > Peter, could you please take a look at the thread:
> > >   https://lkml.kernel.org/r/20190419085627.GI4742@localhost.localdomain
> > > 
> > > I assumed that returning to userland with acquired locks is something we
> > > did not want…
> > 
> > Yeah, but AFAIK fs freezing code has a history of doing exactly that..
> > This is just the latest incarnation here.
> > 
> > So the immediate problem here is that the task doing thaw isn't the same
> > that did freeze, right? The thing is, I'm not seeing how that isn't a
> > problem with upstream either.
> > 
> > The freeze code seems to do: percpu_down_write() for the various states,
> > and then frobs lockdep state.
> > 
> > Thaw then does the reverse, frobs lockdep and then does: percpu_up_write().
> > 
> > percpu_down_write() directly relies on down_write(), and
> > percpu_up_write() on up_write(). And note how __up_write() has:
> > 
> > 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
> > 
> > So why isn't this same code coming unstuck in mainline?
> 
> I have to re-route most of this questions to Juri Lelli.
> Lockdep has these gems:
> 	lockdep_sb_freeze_release() / lockdep_sb_freeze_acquire()

Yeah, saw those, but irrespective of them, the rwsem code (not
percpu_rwsem) should complain about freeze and thaw not being the same
process.

Anyway; it's Oleg and Jan who put this together. I simply don't see how
upstream is correct here.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-04-30 13:28     ` Peter Zijlstra
  2019-04-30 13:45       ` Sebastian Andrzej Siewior
@ 2019-04-30 14:15       ` Oleg Nesterov
  2019-04-30 14:29         ` Peter Zijlstra
  2019-04-30 14:42         ` Oleg Nesterov
  2019-05-01 17:09       ` Peter Zijlstra
  2 siblings, 2 replies; 26+ messages in thread
From: Oleg Nesterov @ 2019-04-30 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli, jack

Sorry, I don't understand...

On 04/30, Peter Zijlstra wrote:
>
> Thaw then does the reverse, frobs lockdep

Yes, in particular it does

	lockdep_sb_freeze_acquire()
		percpu_rwsem_acquire()
			sem->rw_sem.owner = current;

	
> and then does: percpu_up_write().
>
> percpu_up_write() on up_write(). And note how __up_write() has:
>
> 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);

and everything looks correct, sem->owner == current by the time
thaw_super_locked() does percpu_up_write/up_write.

Oleg.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-04-30 14:15       ` Oleg Nesterov
@ 2019-04-30 14:29         ` Peter Zijlstra
  2019-04-30 14:42         ` Oleg Nesterov
  1 sibling, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-04-30 14:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sebastian Andrzej Siewior, linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli, jack

On Tue, Apr 30, 2019 at 04:15:01PM +0200, Oleg Nesterov wrote:
> Sorry, I don't understand...

So the problem is that on -RT rwsem uses PI mutexes, and the below just
cannot work.

Also see:

  https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/include/linux/rwsem_rt.h?h=linux-5.0.y-rt
  https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/kernel/locking/rwsem-rt.c?h=linux-5.0.y-rt

> On 04/30, Peter Zijlstra wrote:
> >
> > Thaw then does the reverse, frobs lockdep
> 
> Yes, in particular it does
> 
> 	lockdep_sb_freeze_acquire()
> 		percpu_rwsem_acquire()
> 			sem->rw_sem.owner = current;
> 

Oh ARGGH, I missed that. This is horrible :-(


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-04-30 14:15       ` Oleg Nesterov
  2019-04-30 14:29         ` Peter Zijlstra
@ 2019-04-30 14:42         ` Oleg Nesterov
  2019-04-30 14:44           ` Peter Zijlstra
  1 sibling, 1 reply; 26+ messages in thread
From: Oleg Nesterov @ 2019-04-30 14:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli, jack

I have cloned linux-rt-devel.git

If I understand correctly, in rt rw_semaphore is actually defined in rwsem_rt.h
so percpu_rwsem_acquire() should probably do

	sem->rw_sem.rtmutex.owner = current;

?


On 04/30, Oleg Nesterov wrote:
>
> Sorry, I don't understand...
> 
> On 04/30, Peter Zijlstra wrote:
> >
> > Thaw then does the reverse, frobs lockdep
> 
> Yes, in particular it does
> 
> 	lockdep_sb_freeze_acquire()
> 		percpu_rwsem_acquire()
> 			sem->rw_sem.owner = current;
> 
> 	
> > and then does: percpu_up_write().
> >
> > percpu_up_write() on up_write(). And note how __up_write() has:
> >
> > 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
> 
> and everything looks correct, sem->owner == current by the time
> thaw_super_locked() does percpu_up_write/up_write.
> 
> Oleg.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-04-30 14:42         ` Oleg Nesterov
@ 2019-04-30 14:44           ` Peter Zijlstra
  2019-04-30 14:53             ` Oleg Nesterov
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2019-04-30 14:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sebastian Andrzej Siewior, linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli, jack

On Tue, Apr 30, 2019 at 04:42:53PM +0200, Oleg Nesterov wrote:
> I have cloned linux-rt-devel.git
> 
> If I understand correctly, in rt rw_semaphore is actually defined in rwsem_rt.h
> so percpu_rwsem_acquire() should probably do
> 
> 	sem->rw_sem.rtmutex.owner = current;

That'll screw the PI chain (if there is one), right?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-04-30 14:44           ` Peter Zijlstra
@ 2019-04-30 14:53             ` Oleg Nesterov
  0 siblings, 0 replies; 26+ messages in thread
From: Oleg Nesterov @ 2019-04-30 14:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli, jack

On 04/30, Peter Zijlstra wrote:
>
> On Tue, Apr 30, 2019 at 04:42:53PM +0200, Oleg Nesterov wrote:
> > I have cloned linux-rt-devel.git
> >
> > If I understand correctly, in rt rw_semaphore is actually defined in rwsem_rt.h
> > so percpu_rwsem_acquire() should probably do
> >
> > 	sem->rw_sem.rtmutex.owner = current;
>
> That'll screw the PI chain (if there is one), right?

Yes, I have already realized this can't work after I glanced at this code, thanks.

Oleg.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-04-30 13:28     ` Peter Zijlstra
  2019-04-30 13:45       ` Sebastian Andrzej Siewior
  2019-04-30 14:15       ` Oleg Nesterov
@ 2019-05-01 17:09       ` Peter Zijlstra
  2019-05-01 17:26         ` Waiman Long
  2019-05-02 10:09         ` Oleg Nesterov
  2 siblings, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-05-01 17:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli,
	Oleg Nesterov, jack, Waiman Long, Davidlohr Bueso

On Tue, Apr 30, 2019 at 03:28:11PM +0200, Peter Zijlstra wrote:

> Yeah, but AFAIK fs freezing code has a history of doing exactly that..
> This is just the latest incarnation here.
> 
> So the immediate problem here is that the task doing thaw isn't the same
> that did freeze, right? The thing is, I'm not seeing how that isn't a
> problem with upstream either.
> 
> The freeze code seems to do: percpu_down_write() for the various states,
> and then frobs lockdep state.
> 
> Thaw then does the reverse, frobs lockdep and then does: percpu_up_write().
> 
> percpu_down_write() directly relies on down_write(), and
> percpu_up_write() on up_write(). And note how __up_write() has:
> 
> 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
> 
> So why isn't this same code coming unstuck in mainline?

Anyway; I cobbled together the below. Oleg, could you have a look, I'm
sure I messed it up.

---
 fs/super.c                    | 31 ++----------------
 include/linux/fs.h            |  4 +--
 include/linux/percpu-rwsem.h  | 25 ++++++--------
 kernel/locking/percpu-rwsem.c | 76 ++++++++++++++++++++++++++++++++-----------
 4 files changed, 71 insertions(+), 65 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 583a0124bc39..bf9c54d05edb 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1629,30 +1629,7 @@ EXPORT_SYMBOL(__sb_start_write);
  */
 static void sb_wait_write(struct super_block *sb, int level)
 {
-	percpu_down_write(sb->s_writers.rw_sem + level-1);
-}
-
-/*
- * We are going to return to userspace and forget about these locks, the
- * ownership goes to the caller of thaw_super() which does unlock().
- */
-static void lockdep_sb_freeze_release(struct super_block *sb)
-{
-	int level;
-
-	for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
-		percpu_rwsem_release(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
-}
-
-/*
- * Tell lockdep we are holding these locks before we call ->unfreeze_fs(sb).
- */
-static void lockdep_sb_freeze_acquire(struct super_block *sb)
-{
-	int level;
-
-	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
-		percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
+	percpu_down_write_non_owner(sb->s_writers.rw_sem + level-1);
 }
 
 static void sb_freeze_unlock(struct super_block *sb)
@@ -1660,7 +1637,7 @@ static void sb_freeze_unlock(struct super_block *sb)
 	int level;
 
 	for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
-		percpu_up_write(sb->s_writers.rw_sem + level);
+		percpu_up_write_non_owner(sb->s_writers.rw_sem + level);
 }
 
 /**
@@ -1753,7 +1730,6 @@ int freeze_super(struct super_block *sb)
 	 * when frozen is set to SB_FREEZE_COMPLETE, and for thaw_super().
 	 */
 	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
-	lockdep_sb_freeze_release(sb);
 	up_write(&sb->s_umount);
 	return 0;
 }
@@ -1779,14 +1755,11 @@ static int thaw_super_locked(struct super_block *sb)
 		goto out;
 	}
 
-	lockdep_sb_freeze_acquire(sb);
-
 	if (sb->s_op->unfreeze_fs) {
 		error = sb->s_op->unfreeze_fs(sb);
 		if (error) {
 			printk(KERN_ERR
 				"VFS:Filesystem thaw failed\n");
-			lockdep_sb_freeze_release(sb);
 			up_write(&sb->s_umount);
 			return error;
 		}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dd28e7679089..3b61740b90d7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1557,9 +1557,9 @@ void __sb_end_write(struct super_block *sb, int level);
 int __sb_start_write(struct super_block *sb, int level, bool wait);
 
 #define __sb_writers_acquired(sb, lev)	\
-	percpu_rwsem_acquire(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_)
+	percpu_rwsem_acquire(&(sb)->s_writers.rw_sem[(lev)-1], _THIS_IP_)
 #define __sb_writers_release(sb, lev)	\
-	percpu_rwsem_release(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_)
+	percpu_rwsem_release(&(sb)->s_writers.rw_sem[(lev)-1], _THIS_IP_)
 
 /**
  * sb_end_write - drop write access to a superblock
diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 03cb4b6f842e..e0c02d0f82a6 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -5,15 +5,15 @@
 #include <linux/atomic.h>
 #include <linux/rwsem.h>
 #include <linux/percpu.h>
-#include <linux/rcuwait.h>
+#include <linux/wait.h>
 #include <linux/rcu_sync.h>
 #include <linux/lockdep.h>
 
 struct percpu_rw_semaphore {
 	struct rcu_sync		rss;
 	unsigned int __percpu	*read_count;
-	struct rw_semaphore	rw_sem; /* slowpath */
-	struct rcuwait          writer; /* blocked writer */
+	struct rw_semaphore	rw_sem;
+	wait_queue_head_t	writer;
 	int			readers_block;
 };
 
@@ -23,7 +23,7 @@ static struct percpu_rw_semaphore name = {				\
 	.rss = __RCU_SYNC_INITIALIZER(name.rss, RCU_SCHED_SYNC),	\
 	.read_count = &__percpu_rwsem_rc_##name,			\
 	.rw_sem = __RWSEM_INITIALIZER(name.rw_sem),			\
-	.writer = __RCUWAIT_INITIALIZER(name.writer),			\
+	.writer = __WAIT_QUEUE_HEAD_INITIALIZER(name.writer),		\
 }
 
 extern int __percpu_down_read(struct percpu_rw_semaphore *, int);
@@ -95,6 +95,9 @@ static inline void percpu_up_read(struct percpu_rw_semaphore *sem)
 extern void percpu_down_write(struct percpu_rw_semaphore *);
 extern void percpu_up_write(struct percpu_rw_semaphore *);
 
+extern void percpu_down_write_non_owner(struct percpu_rw_semaphore *);
+extern void percpu_up_write_non_owner(struct percpu_rw_semaphore *);
+
 extern int __percpu_init_rwsem(struct percpu_rw_semaphore *,
 				const char *, struct lock_class_key *);
 
@@ -112,23 +115,15 @@ extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
 	lockdep_assert_held(&(sem)->rw_sem)
 
 static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
-					bool read, unsigned long ip)
+					unsigned long ip)
 {
 	lock_release(&sem->rw_sem.dep_map, 1, ip);
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-	if (!read)
-		sem->rw_sem.owner = RWSEM_OWNER_UNKNOWN;
-#endif
 }
 
 static inline void percpu_rwsem_acquire(struct percpu_rw_semaphore *sem,
-					bool read, unsigned long ip)
+					unsigned long ip)
 {
-	lock_acquire(&sem->rw_sem.dep_map, 0, 1, read, 1, NULL, ip);
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-	if (!read)
-		sem->rw_sem.owner = current;
-#endif
+	lock_acquire(&sem->rw_sem.dep_map, 0, 1, 1, 1, NULL, ip);
 }
 
 #endif
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index f17dad99eec8..a51fd2a9ee90 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -1,6 +1,7 @@
 #include <linux/atomic.h>
 #include <linux/rwsem.h>
 #include <linux/percpu.h>
+#include <linux/wait.h>
 #include <linux/lockdep.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/rcupdate.h>
@@ -19,7 +20,7 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *sem,
 	/* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */
 	rcu_sync_init(&sem->rss, RCU_SCHED_SYNC);
 	__init_rwsem(&sem->rw_sem, name, rwsem_key);
-	rcuwait_init(&sem->writer);
+	init_waitqueue_head(&sem->writer);
 	sem->readers_block = 0;
 	return 0;
 }
@@ -40,6 +41,40 @@ void percpu_free_rwsem(struct percpu_rw_semaphore *sem)
 }
 EXPORT_SYMBOL_GPL(percpu_free_rwsem);
 
+static void readers_block(struct percpu_rw_semaphore *sem)
+{
+	wait_event_cmd(sem->writer, !sem->readers_block,
+		       __up_read(&sem->rw_sem), __down_read(&sem->rw_sem));
+}
+
+static void block_readers(struct percpu_rw_semaphore *sem)
+{
+	wait_event_exclusive_cmd(sem->writer, !sem->readers_block,
+				 __up_write(&sem->rw_sem),
+				 __down_write(&sem->rw_sem));
+	/*
+	 * Notify new readers to block; up until now, and thus throughout the
+	 * longish rcu_sync_enter() above, new readers could still come in.
+	 */
+	WRITE_ONCE(sem->readers_block, 1);
+}
+
+static void unblock_readers(struct percpu_rw_semaphore *sem)
+{
+	/*
+	 * Signal the writer is done, no fast path yet.
+	 *
+	 * One reason that we cannot just immediately flip to readers_fast is
+	 * that new readers might fail to see the results of this writer's
+	 * critical section.
+	 *
+	 * Therefore we force it through the slow path which guarantees an
+	 * acquire and thereby guarantees the critical section's consistency.
+	 */
+	smp_store_release(&sem->readers_block, 0);
+	wake_up(&sem->writer);
+}
+
 int __percpu_down_read(struct percpu_rw_semaphore *sem, int try)
 {
 	/*
@@ -85,6 +120,9 @@ int __percpu_down_read(struct percpu_rw_semaphore *sem, int try)
 	 * Avoid lockdep for the down/up_read() we already have them.
 	 */
 	__down_read(&sem->rw_sem);
+
+	readers_block(sem);
+
 	this_cpu_inc(*sem->read_count);
 	__up_read(&sem->rw_sem);
 
@@ -104,7 +142,7 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem)
 	__this_cpu_dec(*sem->read_count);
 
 	/* Prod writer to recheck readers_active */
-	rcuwait_wake_up(&sem->writer);
+	wake_up(&sem->writer);
 }
 EXPORT_SYMBOL_GPL(__percpu_up_read);
 
@@ -146,11 +184,7 @@ void percpu_down_write(struct percpu_rw_semaphore *sem)
 
 	down_write(&sem->rw_sem);
 
-	/*
-	 * Notify new readers to block; up until now, and thus throughout the
-	 * longish rcu_sync_enter() above, new readers could still come in.
-	 */
-	WRITE_ONCE(sem->readers_block, 1);
+	block_readers(sem);
 
 	smp_mb(); /* D matches A */
 
@@ -161,23 +195,13 @@ void percpu_down_write(struct percpu_rw_semaphore *sem)
 	 */
 
 	/* Wait for all now active readers to complete. */
-	rcuwait_wait_event(&sem->writer, readers_active_check(sem));
+	wait_event(sem->writer, readers_active_check(sem));
 }
 EXPORT_SYMBOL_GPL(percpu_down_write);
 
 void percpu_up_write(struct percpu_rw_semaphore *sem)
 {
-	/*
-	 * Signal the writer is done, no fast path yet.
-	 *
-	 * One reason that we cannot just immediately flip to readers_fast is
-	 * that new readers might fail to see the results of this writer's
-	 * critical section.
-	 *
-	 * Therefore we force it through the slow path which guarantees an
-	 * acquire and thereby guarantees the critical section's consistency.
-	 */
-	smp_store_release(&sem->readers_block, 0);
+	unblock_readers(sem);
 
 	/*
 	 * Release the write lock, this will allow readers back in the game.
@@ -191,4 +215,18 @@ void percpu_up_write(struct percpu_rw_semaphore *sem)
 	 */
 	rcu_sync_exit(&sem->rss);
 }
+EXPORT_SYMBOL_GPL(percpu_up_write_non_owner);
+
+void percpu_down_write_non_owner(struct percpu_rw_semaphore *sem)
+{
+	percpu_down_write(sem);
+	up_write(&sem->rw_sem);
+}
+EXPORT_SYMBOL_GPL(percpu_down_write_non_owner);
+
+void percpu_up_write_non_owner(struct percpu_rw_semaphore *sem)
+{
+	down_write(&sem->rw_sem);
+	percpu_up_write(sem);
+}
 EXPORT_SYMBOL_GPL(percpu_up_write);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-05-01 17:09       ` Peter Zijlstra
@ 2019-05-01 17:26         ` Waiman Long
  2019-05-01 18:54           ` Peter Zijlstra
  2019-05-02 10:09         ` Oleg Nesterov
  1 sibling, 1 reply; 26+ messages in thread
From: Waiman Long @ 2019-05-01 17:26 UTC (permalink / raw)
  To: Peter Zijlstra, Sebastian Andrzej Siewior
  Cc: linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli,
	Oleg Nesterov, jack, Davidlohr Bueso

On 5/1/19 1:09 PM, Peter Zijlstra wrote:
> On Tue, Apr 30, 2019 at 03:28:11PM +0200, Peter Zijlstra wrote:
>
>> Yeah, but AFAIK fs freezing code has a history of doing exactly that..
>> This is just the latest incarnation here.
>>
>> So the immediate problem here is that the task doing thaw isn't the same
>> that did freeze, right? The thing is, I'm not seeing how that isn't a
>> problem with upstream either.
>>
>> The freeze code seems to do: percpu_down_write() for the various states,
>> and then frobs lockdep state.
>>
>> Thaw then does the reverse, frobs lockdep and then does: percpu_up_write().
>>
>> percpu_down_write() directly relies on down_write(), and
>> percpu_up_write() on up_write(). And note how __up_write() has:
>>
>> 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
>>
>> So why isn't this same code coming unstuck in mainline?

That code is in just in the tip tree. It is not in the mainline yet. I
do realize that it can be a problem and so I have it modified to the
following in my part2 patchset.

static inline void __up_write(struct rw_semaphore *sem)
{
        long tmp;

        /*
         * sem->owner may differ from current if the ownership is
transferred
         * to an anonymous writer by setting the RWSEM_NONSPINNABLE bits.
         */
        DEBUG_RWSEMS_WARN_ON((sem->owner != current) &&
                            !((long)sem->owner & RWSEM_NONSPINNABLE), sem);

Maybe I should break this part out and have it merged into tip as well.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-05-01 17:26         ` Waiman Long
@ 2019-05-01 18:54           ` Peter Zijlstra
  2019-05-01 19:22             ` Davidlohr Bueso
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2019-05-01 18:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Sebastian Andrzej Siewior, linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli,
	Oleg Nesterov, jack, Davidlohr Bueso

On Wed, May 01, 2019 at 01:26:08PM -0400, Waiman Long wrote:
> On 5/1/19 1:09 PM, Peter Zijlstra wrote:
> > On Tue, Apr 30, 2019 at 03:28:11PM +0200, Peter Zijlstra wrote:
> >
> >> Yeah, but AFAIK fs freezing code has a history of doing exactly that..
> >> This is just the latest incarnation here.
> >>
> >> So the immediate problem here is that the task doing thaw isn't the same
> >> that did freeze, right? The thing is, I'm not seeing how that isn't a
> >> problem with upstream either.
> >>
> >> The freeze code seems to do: percpu_down_write() for the various states,
> >> and then frobs lockdep state.
> >>
> >> Thaw then does the reverse, frobs lockdep and then does: percpu_up_write().
> >>
> >> percpu_down_write() directly relies on down_write(), and
> >> percpu_up_write() on up_write(). And note how __up_write() has:
> >>
> >> 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
> >>
> >> So why isn't this same code coming unstuck in mainline?
> 
> That code is in just in the tip tree. It is not in the mainline yet. I
> do realize that it can be a problem and so I have it modified to the
> following in my part2 patchset.

Nah, the percpu_rwsem abuse by the freezer is atrocious, we really
should not encourage that. Also, it completely wrecks -RT.

Hence the proposed patch.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-05-01 18:54           ` Peter Zijlstra
@ 2019-05-01 19:22             ` Davidlohr Bueso
  2019-05-01 19:25               ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Davidlohr Bueso @ 2019-05-01 19:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Sebastian Andrzej Siewior, linux-rt-users, LKML,
	Thomas Gleixner, Daniel Bristot de Oliveira, Clark Williams,
	Juri Lelli, Oleg Nesterov, jack

On Wed, 01 May 2019, Peter Zijlstra wrote:

>Nah, the percpu_rwsem abuse by the freezer is atrocious, we really
>should not encourage that. Also, it completely wrecks -RT.
>
>Hence the proposed patch.

Is this patch (and removing rcuwait) only intended for rt?

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-05-01 19:22             ` Davidlohr Bueso
@ 2019-05-01 19:25               ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-05-01 19:25 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Waiman Long, Sebastian Andrzej Siewior, linux-rt-users, LKML,
	Thomas Gleixner, Daniel Bristot de Oliveira, Clark Williams,
	Juri Lelli, Oleg Nesterov, jack

On Wed, May 01, 2019 at 12:22:34PM -0700, Davidlohr Bueso wrote:
> On Wed, 01 May 2019, Peter Zijlstra wrote:
> 
> > Nah, the percpu_rwsem abuse by the freezer is atrocious, we really
> > should not encourage that. Also, it completely wrecks -RT.
> > 
> > Hence the proposed patch.
> 
> Is this patch (and removing rcuwait) only intended for rt?

Seeing how we want to have -rt upstream, and how horrible that current
fixup is; I'm tempted to say that barring a better solution, we'd want
this upstream.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-05-01 17:09       ` Peter Zijlstra
  2019-05-01 17:26         ` Waiman Long
@ 2019-05-02 10:09         ` Oleg Nesterov
  2019-05-02 11:42           ` Oleg Nesterov
  2019-05-03 14:16           ` Peter Zijlstra
  1 sibling, 2 replies; 26+ messages in thread
From: Oleg Nesterov @ 2019-05-02 10:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli, jack,
	Waiman Long, Davidlohr Bueso

On 05/01, Peter Zijlstra wrote:
>
> Anyway; I cobbled together the below. Oleg, could you have a look, I'm
> sure I messed it up.

Oh, I will need to read this carefully. but at first glance I do not see
any hole...

> +static void readers_block(struct percpu_rw_semaphore *sem)
> +{
> +	wait_event_cmd(sem->writer, !sem->readers_block,
> +		       __up_read(&sem->rw_sem), __down_read(&sem->rw_sem));
> +}
> +
> +static void block_readers(struct percpu_rw_semaphore *sem)
> +{
> +	wait_event_exclusive_cmd(sem->writer, !sem->readers_block,
> +				 __up_write(&sem->rw_sem),
> +				 __down_write(&sem->rw_sem));
> +	/*
> +	 * Notify new readers to block; up until now, and thus throughout the
> +	 * longish rcu_sync_enter() above, new readers could still come in.
> +	 */
> +	WRITE_ONCE(sem->readers_block, 1);
> +}

So iiuc, despite it name block_readers() also serializes the writers, ->rw_sem
can be dropped by down_write_non_owner() so the new writer can take this lock.

And note that the caller of readers_block() does down_read(), the caller of
block_readers() does down_write(). So perhaps it makes sense to shift these
down_read/write into the helpers above and rename them,

	void xxx_down_read(struct percpu_rw_semaphore *sem)
	{
		__down_read(&sem->rw_sem);

		wait_event_cmd(sem->writer, !sem->readers_block,
		       __up_read(&sem->rw_sem), __down_read(&sem->rw_sem));
	}

	void xxx_down_write(struct percpu_rw_semaphore *sem)
	{
		down_write(&sem->rw_sem);

		wait_event_exclusive_cmd(sem->writer, !sem->readers_block,
					 __up_write(&sem->rw_sem),
					 __down_write(&sem->rw_sem));
		/*
		 * Notify new readers to block; up until now, and thus throughout the
		 * longish rcu_sync_enter() above, new readers could still come in.
		 */
		WRITE_ONCE(sem->readers_block, 1);
	}

to make this logic more clear? Or even

	bool ck_read(struct percpu_rw_semaphore *sem)
	{
		__down_read(&sem->rw_sem);
		if (!sem->readers_block)
			return true;
		__up_read(&sem->rw_sem);
	}

	bool ck_write(struct percpu_rw_semaphore *sem)
	{
		down_write(&sem->rw_sem);
		if (!sem->readers_block)
			return true;
		up_write(&sem->rw_sem);
	}

Then percpu_down_read/write can simply do wait_event(ck_read(sem)) and
wait_event_exclusive(ck_write(sem)) respectively.

But this all is cosmetic, it seems that we can remove ->rw_sem altogether
but I am not sure...

Oleg.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-05-02 10:09         ` Oleg Nesterov
@ 2019-05-02 11:42           ` Oleg Nesterov
  2019-05-03 14:50             ` Peter Zijlstra
  2019-05-03 14:16           ` Peter Zijlstra
  1 sibling, 1 reply; 26+ messages in thread
From: Oleg Nesterov @ 2019-05-02 11:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli, jack,
	Waiman Long, Davidlohr Bueso

On 05/02, Oleg Nesterov wrote:
>
> But this all is cosmetic, it seems that we can remove ->rw_sem altogether
> but I am not sure...

I mean, afaics percpu_down_read() can just do

	wait_event(readers_block == 0);

in the slow path, while percpu_down_write()

	wait_even_exclusive(xchg(readers_block, 1) == 0);

we do not really need ->rw_sem if we rely on wait_queue_head.


But in fact, either way it seems that we going to implement another simple
"non owner" read/write lock, so perhaps we should do this explicitly?

Oleg.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-05-02 10:09         ` Oleg Nesterov
  2019-05-02 11:42           ` Oleg Nesterov
@ 2019-05-03 14:16           ` Peter Zijlstra
  2019-05-03 15:37             ` Oleg Nesterov
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2019-05-03 14:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sebastian Andrzej Siewior, linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli, jack,
	Waiman Long, Davidlohr Bueso

On Thu, May 02, 2019 at 12:09:32PM +0200, Oleg Nesterov wrote:
> On 05/01, Peter Zijlstra wrote:
> >
> > Anyway; I cobbled together the below. Oleg, could you have a look, I'm
> > sure I messed it up.
> 
> Oh, I will need to read this carefully. but at first glance I do not see
> any hole...
> 
> > +static void readers_block(struct percpu_rw_semaphore *sem)
> > +{
> > +	wait_event_cmd(sem->writer, !sem->readers_block,
> > +		       __up_read(&sem->rw_sem), __down_read(&sem->rw_sem));
> > +}
> > +
> > +static void block_readers(struct percpu_rw_semaphore *sem)
> > +{
> > +	wait_event_exclusive_cmd(sem->writer, !sem->readers_block,
> > +				 __up_write(&sem->rw_sem),
> > +				 __down_write(&sem->rw_sem));
> > +	/*
> > +	 * Notify new readers to block; up until now, and thus throughout the
> > +	 * longish rcu_sync_enter() above, new readers could still come in.
> > +	 */
> > +	WRITE_ONCE(sem->readers_block, 1);
> > +}
> 
> So iiuc, despite it name block_readers() also serializes the writers, ->rw_sem
> can be dropped by down_write_non_owner() so the new writer can take this lock.

I don't think block_readers() is sufficient to serialize writers;
suppose two concurrent callers when !->readers_block. Without ->rwsem
that case would not serialize.

> But this all is cosmetic, it seems that we can remove ->rw_sem altogether
> but I am not sure...

Only if we introduce something like ->wait_lock to serialize things.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-05-02 11:42           ` Oleg Nesterov
@ 2019-05-03 14:50             ` Peter Zijlstra
  2019-05-03 15:25               ` Peter Zijlstra
  2019-05-06 16:50               ` Oleg Nesterov
  0 siblings, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-05-03 14:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sebastian Andrzej Siewior, linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli, jack,
	Waiman Long, Davidlohr Bueso

On Thu, May 02, 2019 at 01:42:59PM +0200, Oleg Nesterov wrote:
> On 05/02, Oleg Nesterov wrote:
> >
> > But this all is cosmetic, it seems that we can remove ->rw_sem altogether
> > but I am not sure...
> 
> I mean, afaics percpu_down_read() can just do
> 
> 	wait_event(readers_block == 0);
> 
> in the slow path, while percpu_down_write()
> 
> 	wait_even_exclusive(xchg(readers_block, 1) == 0);

Cute.. yes that might work.

Although at that point I think we might want to switch readers_block to
atomic_t or something.

> we do not really need ->rw_sem if we rely on wait_queue_head.
> 
> 
> But in fact, either way it seems that we going to implement another simple
> "non owner" read/write lock, so perhaps we should do this explicitly?

I'd rather not; people might start using it, and then -RT ends up with
another problem.

I mean, percpu-rwsem isn't used in situations where PI really matters,
but rwsem is.

So how about something like so then?

--- a/fs/super.c
+++ b/fs/super.c
@@ -1629,30 +1629,7 @@ EXPORT_SYMBOL(__sb_start_write);
  */
 static void sb_wait_write(struct super_block *sb, int level)
 {
-	percpu_down_write(sb->s_writers.rw_sem + level-1);
-}
-
-/*
- * We are going to return to userspace and forget about these locks, the
- * ownership goes to the caller of thaw_super() which does unlock().
- */
-static void lockdep_sb_freeze_release(struct super_block *sb)
-{
-	int level;
-
-	for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
-		percpu_rwsem_release(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
-}
-
-/*
- * Tell lockdep we are holding these locks before we call ->unfreeze_fs(sb).
- */
-static void lockdep_sb_freeze_acquire(struct super_block *sb)
-{
-	int level;
-
-	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
-		percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
+	percpu_down_write_non_owner(sb->s_writers.rw_sem + level-1);
 }
 
 static void sb_freeze_unlock(struct super_block *sb)
@@ -1660,7 +1637,7 @@ static void sb_freeze_unlock(struct supe
 	int level;
 
 	for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
-		percpu_up_write(sb->s_writers.rw_sem + level);
+		percpu_up_write_non_owner(sb->s_writers.rw_sem + level);
 }
 
 /**
@@ -1753,7 +1730,6 @@ int freeze_super(struct super_block *sb)
 	 * when frozen is set to SB_FREEZE_COMPLETE, and for thaw_super().
 	 */
 	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
-	lockdep_sb_freeze_release(sb);
 	up_write(&sb->s_umount);
 	return 0;
 }
@@ -1779,14 +1755,11 @@ static int thaw_super_locked(struct supe
 		goto out;
 	}
 
-	lockdep_sb_freeze_acquire(sb);
-
 	if (sb->s_op->unfreeze_fs) {
 		error = sb->s_op->unfreeze_fs(sb);
 		if (error) {
 			printk(KERN_ERR
 				"VFS:Filesystem thaw failed\n");
-			lockdep_sb_freeze_release(sb);
 			up_write(&sb->s_umount);
 			return error;
 		}
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1557,9 +1557,9 @@ void __sb_end_write(struct super_block *
 int __sb_start_write(struct super_block *sb, int level, bool wait);
 
 #define __sb_writers_acquired(sb, lev)	\
-	percpu_rwsem_acquire(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_)
+	percpu_rwsem_acquire(&(sb)->s_writers.rw_sem[(lev)-1], _THIS_IP_)
 #define __sb_writers_release(sb, lev)	\
-	percpu_rwsem_release(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_)
+	percpu_rwsem_release(&(sb)->s_writers.rw_sem[(lev)-1], _THIS_IP_)
 
 /**
  * sb_end_write - drop write access to a superblock
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -5,25 +5,34 @@
 #include <linux/atomic.h>
 #include <linux/rwsem.h>
 #include <linux/percpu.h>
-#include <linux/rcuwait.h>
+#include <linux/wait.h>
 #include <linux/rcu_sync.h>
 #include <linux/lockdep.h>
 
 struct percpu_rw_semaphore {
 	struct rcu_sync		rss;
 	unsigned int __percpu	*read_count;
-	struct rw_semaphore	rw_sem; /* slowpath */
-	struct rcuwait          writer; /* blocked writer */
-	int			readers_block;
+	wait_queue_head_t	waiters;
+	atomic_t		block;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map	dep_map;
+#endif
 };
 
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+#define __PERCPU_RWSEM_DEP_MAP_INIT(lockname)	.dep_map = { .name = #lockname },
+#else
+#define __PERCPU_RWSEM_DEP_MAP_INIT(lockname)
+#endif
+
 #define DEFINE_STATIC_PERCPU_RWSEM(name)				\
 static DEFINE_PER_CPU(unsigned int, __percpu_rwsem_rc_##name);		\
 static struct percpu_rw_semaphore name = {				\
 	.rss = __RCU_SYNC_INITIALIZER(name.rss, RCU_SCHED_SYNC),	\
 	.read_count = &__percpu_rwsem_rc_##name,			\
-	.rw_sem = __RWSEM_INITIALIZER(name.rw_sem),			\
-	.writer = __RCUWAIT_INITIALIZER(name.writer),			\
+	.waiters = __WAIT_QUEUE_HEAD_INITIALIZER(name.waiters),		\
+	.block = ATOMIC_INIT(0),					\
+	__PERCPU_RWSEM_DEP_MAP_INIT(name)				\
 }
 
 extern int __percpu_down_read(struct percpu_rw_semaphore *, int);
@@ -33,7 +42,7 @@ static inline void percpu_down_read(stru
 {
 	might_sleep();
 
-	rwsem_acquire_read(&sem->rw_sem.dep_map, 0, 0, _RET_IP_);
+	rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);
 
 	preempt_disable();
 	/*
@@ -72,7 +81,7 @@ static inline int percpu_down_read_trylo
 	 */
 
 	if (ret)
-		rwsem_acquire_read(&sem->rw_sem.dep_map, 0, 1, _RET_IP_);
+		rwsem_acquire_read(&sem->dep_map, 0, 1, _RET_IP_);
 
 	return ret;
 }
@@ -89,12 +98,15 @@ static inline void percpu_up_read(struct
 		__percpu_up_read(sem); /* Unconditional memory barrier */
 	preempt_enable();
 
-	rwsem_release(&sem->rw_sem.dep_map, 1, _RET_IP_);
+	rwsem_release(&sem->dep_map, 1, _RET_IP_);
 }
 
 extern void percpu_down_write(struct percpu_rw_semaphore *);
 extern void percpu_up_write(struct percpu_rw_semaphore *);
 
+extern void percpu_down_write_non_owner(struct percpu_rw_semaphore *);
+extern void percpu_up_write_non_owner(struct percpu_rw_semaphore *);
+
 extern int __percpu_init_rwsem(struct percpu_rw_semaphore *,
 				const char *, struct lock_class_key *);
 
@@ -112,23 +124,15 @@ extern void percpu_free_rwsem(struct per
 	lockdep_assert_held(&(sem)->rw_sem)
 
 static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
-					bool read, unsigned long ip)
+					unsigned long ip)
 {
-	lock_release(&sem->rw_sem.dep_map, 1, ip);
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-	if (!read)
-		sem->rw_sem.owner = RWSEM_OWNER_UNKNOWN;
-#endif
+	lock_release(&sem->dep_map, 1, ip);
 }
 
 static inline void percpu_rwsem_acquire(struct percpu_rw_semaphore *sem,
-					bool read, unsigned long ip)
+					unsigned long ip)
 {
-	lock_acquire(&sem->rw_sem.dep_map, 0, 1, read, 1, NULL, ip);
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-	if (!read)
-		sem->rw_sem.owner = current;
-#endif
+	lock_acquire(&sem->dep_map, 0, 1, 1, 1, NULL, ip);
 }
 
 #endif
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -395,6 +395,9 @@ do {										\
 	__wait_event_exclusive_cmd(wq_head, condition, cmd1, cmd2);		\
 } while (0)
 
+#define wait_event_exclusive(wq_head, condition)				\
+	wait_event_exclusive_cmd(wq_head, condition, ,)
+
 #define __wait_event_cmd(wq_head, condition, cmd1, cmd2)			\
 	(void)___wait_event(wq_head, condition, TASK_UNINTERRUPTIBLE, 0, 0,	\
 			    cmd1; schedule(); cmd2)
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -1,6 +1,7 @@
 #include <linux/atomic.h>
 #include <linux/rwsem.h>
 #include <linux/percpu.h>
+#include <linux/wait.h>
 #include <linux/lockdep.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/rcupdate.h>
@@ -16,11 +17,13 @@ int __percpu_init_rwsem(struct percpu_rw
 	if (unlikely(!sem->read_count))
 		return -ENOMEM;
 
-	/* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */
 	rcu_sync_init(&sem->rss, RCU_SCHED_SYNC);
-	__init_rwsem(&sem->rw_sem, name, rwsem_key);
-	rcuwait_init(&sem->writer);
-	sem->readers_block = 0;
+	init_waitqueue_head(&sem->waiters);
+	atomic_set(&sem->block, 0);
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	debug_check_no_locks_freed((void *)sem, sizeof(*sem));
+	lockdep_init_map(&sem->dep_map, name, key, 0);
+#endif
 	return 0;
 }
 EXPORT_SYMBOL_GPL(__percpu_init_rwsem);
@@ -63,7 +66,7 @@ int __percpu_down_read(struct percpu_rw_
 	 * If !readers_block the critical section starts here, matched by the
 	 * release in percpu_up_write().
 	 */
-	if (likely(!smp_load_acquire(&sem->readers_block)))
+	if (likely(!atomic_read_acquire(&sem->block)))
 		return 1;
 
 	/*
@@ -80,14 +83,8 @@ int __percpu_down_read(struct percpu_rw_
 	 * and reschedule on the preempt_enable() in percpu_down_read().
 	 */
 	preempt_enable_no_resched();
-
-	/*
-	 * Avoid lockdep for the down/up_read() we already have them.
-	 */
-	__down_read(&sem->rw_sem);
+	wait_event(sem->waiters, !atomic_read(&sem->block));
 	this_cpu_inc(*sem->read_count);
-	__up_read(&sem->rw_sem);
-
 	preempt_disable();
 	return 1;
 }
@@ -104,7 +101,7 @@ void __percpu_up_read(struct percpu_rw_s
 	__this_cpu_dec(*sem->read_count);
 
 	/* Prod writer to recheck readers_active */
-	rcuwait_wake_up(&sem->writer);
+	wake_up(&sem->waiters);
 }
 EXPORT_SYMBOL_GPL(__percpu_up_read);
 
@@ -139,18 +136,22 @@ static bool readers_active_check(struct
 	return true;
 }
 
+static inline bool acquire_block(struct percpu_rw_semaphore *sem)
+{
+	if (atomic_read(&sem->block))
+		return false;
+
+	return atomic_xchg(&sem->block, 1) == 0;
+}
+
 void percpu_down_write(struct percpu_rw_semaphore *sem)
 {
+	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
+
 	/* Notify readers to take the slow path. */
 	rcu_sync_enter(&sem->rss);
 
-	down_write(&sem->rw_sem);
-
-	/*
-	 * Notify new readers to block; up until now, and thus throughout the
-	 * longish rcu_sync_enter() above, new readers could still come in.
-	 */
-	WRITE_ONCE(sem->readers_block, 1);
+	wait_event_exclusive(sem->waiters, acquire_block(sem));
 
 	smp_mb(); /* D matches A */
 
@@ -161,7 +162,7 @@ void percpu_down_write(struct percpu_rw_
 	 */
 
 	/* Wait for all now active readers to complete. */
-	rcuwait_wait_event(&sem->writer, readers_active_check(sem));
+	wait_event(sem->waiters, readers_active_check(sem));
 }
 EXPORT_SYMBOL_GPL(percpu_down_write);
 
@@ -177,12 +178,8 @@ void percpu_up_write(struct percpu_rw_se
 	 * Therefore we force it through the slow path which guarantees an
 	 * acquire and thereby guarantees the critical section's consistency.
 	 */
-	smp_store_release(&sem->readers_block, 0);
-
-	/*
-	 * Release the write lock, this will allow readers back in the game.
-	 */
-	up_write(&sem->rw_sem);
+	atomic_set_release(&sem->block, 0);
+	wake_up(&sem->waiters);
 
 	/*
 	 * Once this completes (at least one RCU-sched grace period hence) the
@@ -190,5 +187,21 @@ void percpu_up_write(struct percpu_rw_se
 	 * exclusive write lock because its counting.
 	 */
 	rcu_sync_exit(&sem->rss);
+
+	rwsem_release(&sem->dep_map, 1, _RET_IP_);
 }
 EXPORT_SYMBOL_GPL(percpu_up_write);
+
+void percpu_down_write_non_owner(struct percpu_rw_semaphore *sem)
+{
+	percpu_down_write(sem);
+	rwsem_release(&sem->dep_map, 1, _RET_IP_);
+}
+EXPORT_SYMBOL_GPL(percpu_down_write_non_owner);
+
+void percpu_up_write_non_owner(struct percpu_rw_semaphore *sem)
+{
+	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
+	percpu_up_write(sem);
+}
+EXPORT_SYMBOL_GPL(percpu_up_write_non_owner);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-05-03 14:50             ` Peter Zijlstra
@ 2019-05-03 15:25               ` Peter Zijlstra
  2019-05-06 16:50               ` Oleg Nesterov
  1 sibling, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-05-03 15:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sebastian Andrzej Siewior, linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli, jack,
	Waiman Long, Davidlohr Bueso

On Fri, May 03, 2019 at 04:50:59PM +0200, Peter Zijlstra wrote:
> So how about something like so then?

> --- a/kernel/locking/percpu-rwsem.c
> +++ b/kernel/locking/percpu-rwsem.c

> @@ -63,7 +66,7 @@ int __percpu_down_read(struct percpu_rw_
>  	 * If !readers_block the critical section starts here, matched by the
>  	 * release in percpu_up_write().
>  	 */
> -	if (likely(!smp_load_acquire(&sem->readers_block)))
> +	if (likely(!atomic_read_acquire(&sem->block)))
>  		return 1;
>  
>  	/*
> @@ -80,14 +83,8 @@ int __percpu_down_read(struct percpu_rw_
>  	 * and reschedule on the preempt_enable() in percpu_down_read().
>  	 */
>  	preempt_enable_no_resched();
> -
> -	/*
> -	 * Avoid lockdep for the down/up_read() we already have them.
> -	 */
> -	__down_read(&sem->rw_sem);
> +	wait_event(sem->waiters, !atomic_read(&sem->block));

That should be:

	wait_event(sem->waiters, !atomic_read_acquire(&sem->block));

I suppose.

>  	this_cpu_inc(*sem->read_count);
> -	__up_read(&sem->rw_sem);
> -
>  	preempt_disable();
>  	return 1;
>  }
> @@ -104,7 +101,7 @@ void __percpu_up_read(struct percpu_rw_s
>  	__this_cpu_dec(*sem->read_count);
>  
>  	/* Prod writer to recheck readers_active */
> -	rcuwait_wake_up(&sem->writer);
> +	wake_up(&sem->waiters);
>  }
>  EXPORT_SYMBOL_GPL(__percpu_up_read);
>  
> @@ -139,18 +136,22 @@ static bool readers_active_check(struct
>  	return true;
>  }
>  
> +static inline bool acquire_block(struct percpu_rw_semaphore *sem)
> +{
> +	if (atomic_read(&sem->block))
> +		return false;
> +
> +	return atomic_xchg(&sem->block, 1) == 0;
> +}
> +
>  void percpu_down_write(struct percpu_rw_semaphore *sem)
>  {
> +	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
> +
>  	/* Notify readers to take the slow path. */
>  	rcu_sync_enter(&sem->rss);
>  
> -	down_write(&sem->rw_sem);
> -
> -	/*
> -	 * Notify new readers to block; up until now, and thus throughout the
> -	 * longish rcu_sync_enter() above, new readers could still come in.
> -	 */
> -	WRITE_ONCE(sem->readers_block, 1);
> +	wait_event_exclusive(sem->waiters, acquire_block(sem));
>  
>  	smp_mb(); /* D matches A */

And we can remove that smp_mb() and rely on the atomic_xchg() from
acquire_block().

>  
> @@ -161,7 +162,7 @@ void percpu_down_write(struct percpu_rw_
>  	 */
>  
>  	/* Wait for all now active readers to complete. */
> -	rcuwait_wait_event(&sem->writer, readers_active_check(sem));
> +	wait_event(sem->waiters, readers_active_check(sem));
>  }
>  EXPORT_SYMBOL_GPL(percpu_down_write);
>  
> @@ -177,12 +178,8 @@ void percpu_up_write(struct percpu_rw_se
>  	 * Therefore we force it through the slow path which guarantees an
>  	 * acquire and thereby guarantees the critical section's consistency.
>  	 */
> -	smp_store_release(&sem->readers_block, 0);
> -
> -	/*
> -	 * Release the write lock, this will allow readers back in the game.
> -	 */
> -	up_write(&sem->rw_sem);
> +	atomic_set_release(&sem->block, 0);
> +	wake_up(&sem->waiters);
>  
>  	/*
>  	 * Once this completes (at least one RCU-sched grace period hence) the
> @@ -190,5 +187,21 @@ void percpu_up_write(struct percpu_rw_se
>  	 * exclusive write lock because its counting.
>  	 */
>  	rcu_sync_exit(&sem->rss);
> +
> +	rwsem_release(&sem->dep_map, 1, _RET_IP_);
>  }
>  EXPORT_SYMBOL_GPL(percpu_up_write);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-05-03 14:16           ` Peter Zijlstra
@ 2019-05-03 15:37             ` Oleg Nesterov
  2019-05-03 15:46               ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Oleg Nesterov @ 2019-05-03 15:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli, jack,
	Waiman Long, Davidlohr Bueso

On 05/03, Peter Zijlstra wrote:
>
> On Thu, May 02, 2019 at 12:09:32PM +0200, Oleg Nesterov wrote:
 >
> > > +static void readers_block(struct percpu_rw_semaphore *sem)
> > > +{
> > > +	wait_event_cmd(sem->writer, !sem->readers_block,
> > > +		       __up_read(&sem->rw_sem), __down_read(&sem->rw_sem));
> > > +}
> > > +
> > > +static void block_readers(struct percpu_rw_semaphore *sem)
> > > +{
> > > +	wait_event_exclusive_cmd(sem->writer, !sem->readers_block,
> > > +				 __up_write(&sem->rw_sem),
> > > +				 __down_write(&sem->rw_sem));
> > > +	/*
> > > +	 * Notify new readers to block; up until now, and thus throughout the
> > > +	 * longish rcu_sync_enter() above, new readers could still come in.
> > > +	 */
> > > +	WRITE_ONCE(sem->readers_block, 1);
> > > +}
> >
> > So iiuc, despite it name block_readers() also serializes the writers, ->rw_sem
> > can be dropped by down_write_non_owner() so the new writer can take this lock.
>
> I don't think block_readers() is sufficient to serialize writers;
> suppose two concurrent callers when !->readers_block. Without ->rwsem
> that case would not serialize.

Of course. I meant that the next writer can enter block_readers() if
up_non_owner() drops ->rw_sem, but it will block in wait_event(!readers_block).

(And if we change this code to use wait_event(xchg(readers_block) == 0) we
 can remove rw_sem altogether).

The main problem is that this is sub-optimal. We can have a lot of readers
sleeping in __down_read() when percpu_down_write() succeeds, then after
percpu_down_write_non_owner() does up_write() they all will be woken just
to hang in readers_block(). Plus the new readers will need to pass the
lock-check-unlock-schedule path.

Peter, just in case... I see another patch from you but I need to run away
till Monday.

Oleg.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-05-03 15:37             ` Oleg Nesterov
@ 2019-05-03 15:46               ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-05-03 15:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sebastian Andrzej Siewior, linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli, jack,
	Waiman Long, Davidlohr Bueso

On Fri, May 03, 2019 at 05:37:48PM +0200, Oleg Nesterov wrote:
> (And if we change this code to use wait_event(xchg(readers_block) == 0) we
>  can remove rw_sem altogether).

That patch you just saw and didn't look at did just that.

> The main problem is that this is sub-optimal. We can have a lot of readers
> sleeping in __down_read() when percpu_down_write() succeeds, then after
> percpu_down_write_non_owner() does up_write() they all will be woken just
> to hang in readers_block(). Plus the new readers will need to pass the
> lock-check-unlock-schedule path.

Yes, that's gone. Still, write side locking on percpu-rwsem _should_ be
relatively rare and is certainly not a fast path.

> Peter, just in case... I see another patch from you but I need to run away
> till Monday.

n/p, enjoy the weekend!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-05-03 14:50             ` Peter Zijlstra
  2019-05-03 15:25               ` Peter Zijlstra
@ 2019-05-06 16:50               ` Oleg Nesterov
  2019-06-19  9:50                 ` Peter Zijlstra
  1 sibling, 1 reply; 26+ messages in thread
From: Oleg Nesterov @ 2019-05-06 16:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli, jack,
	Waiman Long, Davidlohr Bueso

On 05/03, Peter Zijlstra wrote:
>
> -static void lockdep_sb_freeze_release(struct super_block *sb)
> -{
> -	int level;
> -
> -	for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
> -		percpu_rwsem_release(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
> -}
> -
> -/*
> - * Tell lockdep we are holding these locks before we call ->unfreeze_fs(sb).
> - */
> -static void lockdep_sb_freeze_acquire(struct super_block *sb)
> -{
> -	int level;
> -
> -	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
> -		percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
> +	percpu_down_write_non_owner(sb->s_writers.rw_sem + level-1);
>  }

I'd suggest to not change fs/super.c, keep these helpers, and even not introduce
xxx_write_non_owner().

freeze_super() takes other locks, it calls sync_filesystem(), freeze_fs(), lockdep
should know that this task holds SB_FREEZE_XXX locks for writing.


> @@ -80,14 +83,8 @@ int __percpu_down_read(struct percpu_rw_
>  	 * and reschedule on the preempt_enable() in percpu_down_read().
>  	 */
>  	preempt_enable_no_resched();
> -
> -	/*
> -	 * Avoid lockdep for the down/up_read() we already have them.
> -	 */
> -	__down_read(&sem->rw_sem);
> +	wait_event(sem->waiters, !atomic_read(&sem->block));
>  	this_cpu_inc(*sem->read_count);

Argh, this looks racy :/

Suppose that sem->block == 0 when wait_event() is called, iow the writer released
the lock.

Now suppose that this __percpu_down_read() races with another percpu_down_write().
The new writer can set sem->block == 1 and call readers_active_check() in between,
after wait_event() and before this_cpu_inc(*sem->read_count).

Oleg.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16)
  2019-05-06 16:50               ` Oleg Nesterov
@ 2019-06-19  9:50                 ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-06-19  9:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sebastian Andrzej Siewior, linux-rt-users, LKML, Thomas Gleixner,
	Daniel Bristot de Oliveira, Clark Williams, Juri Lelli, jack,
	Waiman Long, Davidlohr Bueso


Sorry, I seem to have missed this email.

On Mon, May 06, 2019 at 06:50:09PM +0200, Oleg Nesterov wrote:
> On 05/03, Peter Zijlstra wrote:
> >
> > -static void lockdep_sb_freeze_release(struct super_block *sb)
> > -{
> > -	int level;
> > -
> > -	for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
> > -		percpu_rwsem_release(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
> > -}
> > -
> > -/*
> > - * Tell lockdep we are holding these locks before we call ->unfreeze_fs(sb).
> > - */
> > -static void lockdep_sb_freeze_acquire(struct super_block *sb)
> > -{
> > -	int level;
> > -
> > -	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
> > -		percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
> > +	percpu_down_write_non_owner(sb->s_writers.rw_sem + level-1);
> >  }
> 
> I'd suggest to not change fs/super.c, keep these helpers, and even not introduce
> xxx_write_non_owner().
> 
> freeze_super() takes other locks, it calls sync_filesystem(), freeze_fs(), lockdep
> should know that this task holds SB_FREEZE_XXX locks for writing.

Bah, I so hate these games. But OK, I suppose.

> > @@ -80,14 +83,8 @@ int __percpu_down_read(struct percpu_rw_
> >  	 * and reschedule on the preempt_enable() in percpu_down_read().
> >  	 */
> >  	preempt_enable_no_resched();
> > -
> > -	/*
> > -	 * Avoid lockdep for the down/up_read() we already have them.
> > -	 */
> > -	__down_read(&sem->rw_sem);
> > +	wait_event(sem->waiters, !atomic_read(&sem->block));
> >  	this_cpu_inc(*sem->read_count);
> 
> Argh, this looks racy :/
> 
> Suppose that sem->block == 0 when wait_event() is called, iow the writer released
> the lock.
> 
> Now suppose that this __percpu_down_read() races with another percpu_down_write().
> The new writer can set sem->block == 1 and call readers_active_check() in between,
> after wait_event() and before this_cpu_inc(*sem->read_count).


CPU0			CPU1			CPU2

percpu_up_write()
  sem->block = 0;

			__percpu_down_read()
			  wait_event(, !sem->block);

						percpu_down_write()
						  wait_event_exclusive(, xchg(sem->block,1)==0);
						  readers_active_check()

			  this_cpu_inc();

			  *whoopsy* reader while write owned.



I suppose we can 'patch' that by checking blocking again after we've
incremented, something like the below.

But looking at percpu_down_write() we have two wait_event*() on the same
queue back to back, which is 'odd' at best. Let me ponder that a little
more.


---

--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -61,6 +61,7 @@ int __percpu_down_read(struct percpu_rw_
 	 * writer missed them.
 	 */
 
+again:
 	smp_mb(); /* A matches D */
 
 	/*
@@ -87,7 +88,13 @@ int __percpu_down_read(struct percpu_rw_
 	wait_event(sem->waiters, !atomic_read_acquire(&sem->block));
 	this_cpu_inc(*sem->read_count);
 	preempt_disable();
-	return 1;
+
+	/*
+	 * percpu_down_write() could've set ->blocked right after we've seen it
+	 * 0 but missed our this_cpu_inc(), which is exactly the condition we
+	 * get called for from percpu_down_read().
+	 */
+	goto again;
 }
 EXPORT_SYMBOL_GPL(__percpu_down_read);
 


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, back to index

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-26  9:34 [RT WARNING] DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current) with fsfreeze (4.19.25-rt16) Juri Lelli
2019-03-28 10:17 ` Sebastian Andrzej Siewior
2019-04-19  8:56 ` Juri Lelli
2019-04-30 12:51   ` Sebastian Andrzej Siewior
2019-04-30 13:28     ` Peter Zijlstra
2019-04-30 13:45       ` Sebastian Andrzej Siewior
2019-04-30 14:01         ` Peter Zijlstra
2019-04-30 14:15       ` Oleg Nesterov
2019-04-30 14:29         ` Peter Zijlstra
2019-04-30 14:42         ` Oleg Nesterov
2019-04-30 14:44           ` Peter Zijlstra
2019-04-30 14:53             ` Oleg Nesterov
2019-05-01 17:09       ` Peter Zijlstra
2019-05-01 17:26         ` Waiman Long
2019-05-01 18:54           ` Peter Zijlstra
2019-05-01 19:22             ` Davidlohr Bueso
2019-05-01 19:25               ` Peter Zijlstra
2019-05-02 10:09         ` Oleg Nesterov
2019-05-02 11:42           ` Oleg Nesterov
2019-05-03 14:50             ` Peter Zijlstra
2019-05-03 15:25               ` Peter Zijlstra
2019-05-06 16:50               ` Oleg Nesterov
2019-06-19  9:50                 ` Peter Zijlstra
2019-05-03 14:16           ` Peter Zijlstra
2019-05-03 15:37             ` Oleg Nesterov
2019-05-03 15:46               ` Peter Zijlstra

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git