[Ocfs2-devel] NFS clients crash OCFS2 nodes (general protection fault: 0000 [#1] SMP PTI)

From: Changwei Ge <chge@linux.alibaba.com>
To: ocfs2-devel@oss.oracle.com
Subject: [Ocfs2-devel] NFS clients crash OCFS2 nodes (general protection fault: 0000 [#1] SMP PTI)
Date: Fri, 22 May 2020 09:35:04 +0800	[thread overview]
Message-ID: <fbf35187-d5bd-55f2-ad51-5ecbc0b9b7de@linux.alibaba.com> (raw)
In-Reply-To: <fe1c8848-f25b-193e-d488-dedfb37d45b7@qsdf.org>

On 5/22/20 1:02 AM, Pierre Dinh-van wrote:
> On 5/20/20 4:52 AM, Changwei Ge wrote:
>> Hi Pierre,
> 
> Hi Changwei,
> 
>> You indeed provide detailed information on this problem and the
> research looks very nice.
> 
> Thank you
> 
>> On 5/19/20 6:48 PM, Pierre Dinh-van wrote:
>>>
>> Hi,
>>
>> I'm experiencing quite reproducable crashs on my OCFS2 cluster the last
>> weeks, and I think I found out part of the Problem. But before I make a
>> bug report (I'm not sure where), I though I could post my research here
>> in case someone understands it better than I.
>>
>> When it happens, the kernel is giving me a 'general protection fault:
>> 0000 [#1] SMP PTI' or ' BUG: unable to handle page fault for address:'
>> message, local access to the OCFS2 volumes are all hanging forever, and
>> the load is getting so high that only a hard reset "solves" the problem.
>> It happened the same way on both nodes (always on the active node
>> serving NFS). It happens when both nodes are up, or when only one is
>> active.
>>
>>> This makes me a little puzzled, you mean you crash the kernel
>> manually or ocfs2 crash the kernel itself? From the backtrace you
>> posted, I assume that you crash the kernel manually to migrate NFS
>> export IP to the other standby node, right?
>>
> I'm not crashing the kernel manually (I don't know how I could do).
> Ocfs2 is crashing the kernel as soon as some NFS requests are coming in,
> after the node started to serve NFS.
> 
> Maybe I'm using the term "crash the kernel" a wrong way. Actually, only
> the OCFS2 part of the system is not responding after this crash. I can
> still log in with SSH and try to do things.
> 
> 
>> First, a few words about my setup :
> 
>> In fact, I am not quite experienced in this kind of setup where ocfs2
> serves as NFS/Samba backend filesystem, but the use case seems no
> problem to me. But considering that you have hundreds of clients
> connected to a single NFS/v4 server, do you have a performance
> limitation. This has nothing with the problem, just my personal curiosity.
> 
> Most of the NFS clients are Linux desktop clients. Some MacOSX too. The
> accesses are not really I/O intensive so I don't have preformance issues
> of having only 1 active node. Actually, the performance problem we had
> first was because of o2cb speaking over a 1Gb ethernet link, which was
> increasing the lstat() time a lot compared to XFS we had before. The
> cluster setup was mostly designed to have an easy failover system, and
> OCFS2 gave the impression of reducing the risk of corrupt file system
> (and allowed samba to be load balanced too)

For load balance use case, it might have an effect on ocfs2 performance.
IMO, it works well for high availability scenario.

> 
> 
>> 4) nodeB's load is getting high
>>
>>> I suppose the reason is that ocfs2 can't successfully acquire
>> spin_lock of a system inode(ocfs2 internal mechanism to store its
>> metadata, not visible to end users ). As you know, locking path won't
>> stop trying until it get the lock, so it consumes a lot of CPU cycles,
>> you probably observed a high system workload.
> 
> 
> So a possible path would be something like this ? (I don't know much
> about internals auf NFS locking and POSIX locking, so it's a guess):
> 
> NFS client has a lock -> server is rebooted and forgets about the locks
> -> NFS clients requests the lock (maybe ignores the grace time or races
> with the nfsd restart). nfsd ask the FS about an unknown lock and it
> crashes the OCFS2 module somehow.
> 

Not really, the lock I mentioned is neither Posix lock nor NFSv4 lock 
service. It's kind of kernel internal synchronization mechanism.

> 
>> Most of the time, when the IP jumps to the other node, I get some of
>> this messages :
>>
>>
>> [? 501.707566] (nfsd,2133,14):ocfs2_test_inode_bit:2841 ERROR: unable to
>> get alloc inode in slot 65535
>>
>>> This is very strange, I think ocfs2 passes the invalid slot number
>> to subsequent consumer function, which makes it use a dangling pointer
>> to a fake/illegal system inode. In other word, this why ocfs2 can't
>> acquire the spin_lock of system inode.
>>
>>
>>> Can you try to fix the logic in `ocfs2_get_suballoc_slot_bit`
>> judging if slot number is valid? And perform your case once more?
>>
>>> Feel free to ask my help if you need. :-)
>>
> I never looked at the code of OCFS2 and I'm not familiar with filesystem
> internals. I actually wrote to ocfs2-devel because ocfs2-users is not
> letting my mails go through :).
> 
> For testing the case, I will try to do something on the standby node on
> the original ocfs2 volumes, as soon as I'm sure that it's safe for the
> productive server. But for checking the logic in
> ocfs2_get_suballoc_slot_bit, I looked at the code and don't really
> understand it
> 

Not a problem. I will see if I can allocate some bandwidth to fix error 
judgement logic.

Thanks,
Changwei

> 
>>
>> [? 501.716660] (nfsd,2133,14):ocfs2_test_inode_bit:2867 ERROR: status
>> = -22
>>
>>> Why this happens and if it is another issue needs further investigation.
>>
>> [? 501.726585] (nfsd,2133,6):ocfs2_test_inode_bit:2841 ERROR: unable to
>> get alloc inode in slot 65535
>> [? 501.735579] (nfsd,2133,6):ocfs2_test_inode_bit:2867 ERROR: status = -22
>>
>> But it also happens when the node is not crashing, so I think it's
>> another Problem.
>>
>> Last night, I saved a few of the kernel output while the servers where
>> crashing one after the other like in a ping-pong game :
>>
>>
>> -----8< on nodeB? 8<-----
>>
>> [? 502.070431] nfsd: nfsv4 idmapping failing: has idmapd not been started?
>> [? 502.475747] general protection fault: 0000 [#1] SMP PTI
>> [? 502.481027] CPU: 6 PID: 2104 Comm: nfsd Not tainted
>> 5.5.0-0.bpo.2-amd64 #1 Debian 5.5.17-1~bpo10+1
>> []...
>> I also made some record of NFS traffic with tcpdump when the crashs are
>> occuring, but I didn't had time to analyse it. I can provide them if it
>> helps (but I might have to clean them up first to avoid some data leaks :)
>>
>> After this happening a few times, I had one case where one of the few
>> NFSv4 client was involved, and I took as aworkaround to migrate the
>> NFSv4 share to XFS since I though it could be NFSv4 related. Now that I
>> looked at the kernel dumps, I think it's also NFSv3 related.
>>
>> Is it a kernel bug or am I missing something ?
>>
>>> Probably a ocfs2 internal error handling bug. :-(
> 
> I'll check if I can reproduce it in a safe testing environment.
> 
> Is there any things I should also make while reproducing to have a
> better understanding of the failure path ? I never did kernel debugging
> before.
> 
> 
>>
>> For now, I will switch back to XFS with a failover setup of my cluster,
>> cause I cannot afford this kind of crashes of a central file server.
>>
>>
>>
>>> Interesting, do you mean you can use XFS as a NFSv4 backend
>> filesystem which is formatted on a shared SCSI LUN with more than one
>> host attached without filesystem corruption?
>>
>>
> I switched back to a single active node setup. the second node is in
> standby and don't touch the XFS until the first node is not using them
> anymore.
> 
> Should be safe for now.
> 
> 
> Cheers
> 
> 
> Pierre
> 
> 
>