On 17/04/2018 02:12 μμ, Jan Kara wrote: > On Tue 17-04-18 01:31:24, Pavlos Parissis wrote: >> On 16/04/2018 04:40 μμ, Jan Kara wrote: > > > >>> How easily can you hit this? >> >> Very easily, I only need to wait 1-2 days for a crash to occur. > > I wouldn't call that very easily but opinions may differ :). Anyway it's > good (at least for debugging) that it's reproducible. > Unfortunately, I can't reproduce it, so waiting 1-2 days is the only option I have. >>> Are you able to run debug kernels >> >> Well, I was under the impression I do as I have: >> grep -E 'DEBUG_KERNEL|DEBUG_INFO' /boot/config-4.14.32-1.el7.x86_64 >> CONFIG_DEBUG_INFO=y >> # CONFIG_DEBUG_INFO_REDUCED is not set >> # CONFIG_DEBUG_INFO_SPLIT is not set >> # CONFIG_DEBUG_INFO_DWARF4 is not set >> CONFIG_DEBUG_KERNEL=y >> >> Do you think that my kernel doesn't produce a proper crash dump? >> I have a production cluster where I can run any kernel we need, so if I need >> to compile again with different settings I can certainly do that. > > OK, good. So please try running 4.16 as you mention below to verify whether > this is just a -stable regression or also a problem in the current upstream > kernel. Based on your results with 4.16 I'll prepare a debug patch for you to > apply on top of 4.14.32 so that we can debug this further. > >>> / inspect >>> crash dumps when the issue occurs? >> >> I can't do that as the server isn't responsive and I can only power cycle it. > > Well, kernel crash dumps work in that situation as well - when the kernel > panics, it will kexec into a new kernel and dump memory of the old kernel > to disk. It can then be investigated with the 'crash' utility. But > obviously you don't have this set up and don't have experience with this so > let's go via a standard 'debug patch' route. > >>> Also testing with the latest mainline >>> kernel (4.16) would be welcome whether this isn't just an issue with the >>> backport of fsnotify fixes from Miklos. >> >> I can try the kernel-ml-4.16.2 from elrepo (we use CentOS 7). > > Yes, that would be good. > I have production server running 4.16.2 and no kernel crash dumps yet. Let's wait another day before we say anything. Cheers, Pavlos