On 16/04/2018 04:40 μμ, Jan Kara wrote:
> On Mon 16-04-18 15:25:50, Guillaume Morin wrote:
>> Fwiw, there have been already reports of similar soft lockups in
>> fsnotify() on 4.14: https://lkml.org/lkml/2018/3/2/1038
>>
>> We have also noticed similar softlockups with 4.14.22 here.
> 
> Yeah.
>  
>> On 16 Apr 13:54, Pavlos Parissis wrote:
>>>
>>> Hi all,
>>>

[..snip..]

>>> [373782.361064] watchdog: BUG: soft lockup - CPU#24 stuck for 22s! [kube-apiserver:24261]
>>> [373782.378225] Modules linked in: binfmt_misc sctp_diag sctp dccp_diag dccp tcp_diag udp_diag
>>> inet_diag unix_diag cfg80211 rfkill dell_rbu 8021q garp mrp xfs libcrc32c loop x86_pkg_temp_thermal
>>> intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
>>> pcbc aesni_intel vfat fat crypto_simd glue_helper cryptd intel_cstate intel_rapl_perf iTCO_wdt ses
>>> iTCO_vendor_support mxm_wmi ipmi_si dcdbas enclosure mei_me pcspkr ipmi_devintf lpc_ich sg mei
>>> ipmi_msghandler mfd_core shpchp wmi acpi_power_meter netconsole nfsd auth_rpcgss nfs_acl lockd grace
>>> sunrpc ip_tables ext4 mbcache jbd2 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
>>> fb_sys_fops sd_mod ttm crc32c_intel ahci libahci mlx5_core drm mlxfw mpt3sas ptp libata raid_class
>>> pps_core scsi_transport_sas
>>> [373782.516807]  dm_mirror dm_region_hash dm_log dm_mod dax
>>> [373782.531739] CPU: 24 PID: 24261 Comm: kube-apiserver Not tainted 4.14.32-1.el7.x86_64 #1
>>> [373782.549848] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.4.3 01/17/2017
>>> [373782.567486] task: ffff882f66d28000 task.stack: ffffc9002120c000
>>> [373782.583441] RIP: 0010:fsnotify+0x197/0x510
>>> [373782.597319] RSP: 0018:ffffc9002120fdb8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff10
>>> [373782.615308] RAX: 0000000000000000 RBX: ffff882f9ec65c20 RCX: 0000000000000002
>>> [373782.632950] RDX: 0000000000028700 RSI: 0000000000000002 RDI: ffffffff8269a4e0
>>> [373782.650616] RBP: ffffc9002120fe98 R08: 0000000000000000 R09: 0000000000000000
>>> [373782.668287] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>>> [373782.685918] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>> [373782.703302] FS:  000000c42009f090(0000) GS:ffff882fbf900000(0000) knlGS:0000000000000000
>>> [373782.721887] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [373782.737741] CR2: 00007f82b6539244 CR3: 0000002f3de2a005 CR4: 00000000003606e0
>>> [373782.755247] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [373782.772722] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [373782.790043] Call Trace:
>>> [373782.802041]  vfs_write+0x151/0x1b0
>>> [373782.815081]  ? syscall_trace_enter+0x1cd/0x2b0
>>> [373782.829175]  SyS_write+0x55/0xc0
>>> [373782.841870]  do_syscall_64+0x79/0x1b0
>>> [373782.855073]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> 
> Can you please run RIP through ./scripts/faddr2line to see where exactly
> are we looping? I expect the loop iterating over marks to notify but better
> be sure.
> 

I am very newbie on this and I tried with:
 ../repo/Linux/linux/scripts/faddr2line ./vmlinuz-4.14.32-1.el7.x86_64
0010:fsnotify+0x197/0x510
readelf: Error: Not an ELF file - it has the wrong magic bytes at the start
size: ./vmlinuz-4.14.32-1.el7.x86_64: Warning: Ignoring section flag
IMAGE_SCN_MEM_NOT_PAGED in section .bss
nm: ./vmlinuz-4.14.32-1.el7.x86_64: Warning: Ignoring section flag
IMAGE_SCN_MEM_NOT_PAGED in section .bss
nm: ./vmlinuz-4.14.32-1.el7.x86_64: no symbols
size: ./vmlinuz-4.14.32-1.el7.x86_64: Warning: Ignoring section flag
IMAGE_SCN_MEM_NOT_PAGED in section .bss
nm: ./vmlinuz-4.14.32-1.el7.x86_64: Warning: Ignoring section flag
IMAGE_SCN_MEM_NOT_PAGED in section .bss
nm: ./vmlinuz-4.14.32-1.el7.x86_64: no symbols
no match for 0010:fsnotify+0x197/0x510

Obviously, I am doing something very wrong.

> How easily can you hit this?

Very easily, I only need to wait 1-2 days for a crash to occur.

> Are you able to run debug kernels

Well, I was under the impression I do as I have:
  grep -E 'DEBUG_KERNEL|DEBUG_INFO' /boot/config-4.14.32-1.el7.x86_64
  CONFIG_DEBUG_INFO=y
  # CONFIG_DEBUG_INFO_REDUCED is not set
  # CONFIG_DEBUG_INFO_SPLIT is not set
  # CONFIG_DEBUG_INFO_DWARF4 is not set
  CONFIG_DEBUG_KERNEL=y

Do you think that my kernel doesn't produce a proper crash dump?
I have a production cluster where I can run any kernel we need, so if I need
to compile again with different settings I can certainly do that.

> / inspect
> crash dumps when the issue occurs?

I can't do that as the server isn't responsive and I can only power cycle it.

> Also testing with the latest mainline
> kernel (4.16) would be welcome whether this isn't just an issue with the
> backport of fsnotify fixes from Miklos.

I can try the kernel-ml-4.16.2 from elrepo (we use CentOS 7).

Thanks a lot for your reply.
Pavlos Parissis