From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3E546C74A42 for ; Thu, 11 Jul 2019 06:49:59 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 091E220838 for ; Thu, 11 Jul 2019 06:49:59 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=techterra-in.20150623.gappssmtp.com header.i=@techterra-in.20150623.gappssmtp.com header.b="ZwkKvpP4" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728045AbfGKGt6 (ORCPT ); Thu, 11 Jul 2019 02:49:58 -0400 Received: from mail-ot1-f48.google.com ([209.85.210.48]:46336 "EHLO mail-ot1-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725963AbfGKGt6 (ORCPT ); Thu, 11 Jul 2019 02:49:58 -0400 Received: by mail-ot1-f48.google.com with SMTP id z23so4736732ote.13 for ; Wed, 10 Jul 2019 23:49:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=techterra-in.20150623.gappssmtp.com; s=20150623; h=mime-version:from:date:message-id:subject:to; bh=+ntdsTiWVThPk9Dhmu/aXX9siFIXi+NZoeZ+1O1Wb1M=; b=ZwkKvpP4Tc/g8GDeKvwonkpadKcjPUZWu3kNonEm+S+YLFne5x97H+bn2RLCPeyFOY UnI88ggeLU1xd2ZhlsqoWIfrSv8GHiV0rf85WqyLPkQGtCoqoS/c93GHrYGUsl+CwWRe zhDc507Y818pv8lOAPbXQgdfomP/hLUsO13DYDQgUuo5jN+UKSyefsa+NHoGElqVZuIx 8ste+mhSKvQ8WqvwM4q6H54nQ61ABvx5u6N7YSfXfxGxK+SPl2BDGQA7ey2iuTlr0jT6 oV4TovGMXM0qFbBcTWlnkMKytzO8WZACjpLCzbaDEPi0g/PMLKmBPr+Mq1K74kv/Ot0P VEgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=+ntdsTiWVThPk9Dhmu/aXX9siFIXi+NZoeZ+1O1Wb1M=; b=RC790eesVrkKhlOMDbp8l3iacWzcflGYVP6YMYGWoxJGyreUP54zlkCFOp6hlwQb6/ 71D/X+Fwd2K1lhtuOz0FkpIba9lZUXa6X3VecTMgog2KHd+SjDrJB15cfvGlCnUke6t9 OE04NqjJxMLIfHV/lC5zaCQY94F4uN1qkbGkdOe/4G2Uok3QF1a0CKL6SgU2v3YlqYe9 R9EmvbqWSQo7TTJZPtFvxSxn4/kNXtVc8CsFRkYiqyIjZ5WQ9l5LEqtD8CzL+0OJTb3e y+xymf7gmFt7rxGtsA8nUlCjt0Dxt/prZGOp6KWWrMo/aXOhrggdr/C8LRX8tjGbjYgp 5sdQ== X-Gm-Message-State: APjAAAUvKLWEPGkBZp/kE9LDcUgf1VymRuC6s2dg36OL0A5TvnaiyfEw 3AmNCxR+01R9GZCxYN6XKKBPPdcKUdfRG8191ys5e06Y X-Google-Smtp-Source: APXvYqyXX3KlAB+w+VsySkv03lkAgek4EwqLtGg0UzdXrbIF425m8huMH/KiC/3bbNEZ0qG7kK/SSpCEFSS2dHhuFWE= X-Received: by 2002:a05:6830:2098:: with SMTP id y24mr1934965otq.173.1562827797029; Wed, 10 Jul 2019 23:49:57 -0700 (PDT) MIME-Version: 1.0 From: Indivar Nair Date: Thu, 11 Jul 2019 12:19:21 +0530 Message-ID: Subject: rpc.statd dies because of pacemaker monitoring To: linux-nfs@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org Hi ..., I have a 2 node Pacemaker cluster built using CentOS 7.6.1810 It serves files using NFS and Samba. Every 15 - 20 minutes, the rpc.statd service fails, and the whole NFS service is restarted. After investigation, it was found that the service fails after a few rounds of monitoring by Pacemaker. The Pacemaker's script runs the following command to check whether all the services are running - --------------------------------------------------------------------------------------------------------------------------------------- rpcinfo > /dev/null 2>&1 rpcinfo -t localhost 100005 > /dev/null 2>&1 nfs_exec status nfs-idmapd > $fn 2>&1 rpcinfo -t localhost 100024 > /dev/null 2>&1 --------------------------------------------------------------------------------------------------------------------------------------- The script is scheduled to check every 20 seconds. This is the message we get in the logs - ------------------------------------------------------------------------------------------------------------------------------------- Jul 09 07:33:56 virat-nd01 rpc.mountd[51641]: check_default: access by 127.0.0.1 ALLOWED Jul 09 07:33:56 virat-nd01 rpc.mountd[51641]: Received NULL request from 127.0.0.1 Jul 09 07:33:56 virat-nd01 rpc.mountd[51641]: check_default: access by 127.0.0.1 ALLOWED (cached) Jul 09 07:33:56 virat-nd01 rpc.mountd[51641]: Received NULL request from 127.0.0.1 Jul 09 07:33:56 virat-nd01 rpc.mountd[51641]: check_default: access by 127.0.0.1 ALLOWED (cached) Jul 09 07:33:56 virat-nd01 rpc.mountd[51641]: Received NULL request from 127.0.0.1 ------------------------------------------------------------------------------------------------------------------------------------- After 10 seconds, we get his message - ------------------------------------------------------------------------------------------------------------------------------------- Jul 09 07:34:09 virat-nd01 nfsserver(virat-nfs-daemon)[54087]: ERROR: rpc-statd is not running ------------------------------------------------------------------------------------------------------------------------------------- Once we get this error, the NFS service is automatically restarted. "ERROR: rpc-statd is not running" message is from the pacemaker's monitoring script. I have pasted that part of the script below. I disabled monitoring and everything is working fine, since then. I cant keep the cluster monitoring disabled forever. Kindly help. Regards, Indivar Nair Part of the pacemaker script that does the monitoring (/usr/lib/ocf/resources.d/heartbeat/nfsserver) ======================================================================= nfsserver_systemd_monitor() { local threads_num local rc local fn ocf_log debug "Status: rpcbind" rpcinfo > /dev/null 2>&1 rc=$? if [ "$rc" -ne "0" ]; then ocf_exit_reason "rpcbind is not running" return $OCF_NOT_RUNNING fi ocf_log debug "Status: nfs-mountd" rpcinfo -t localhost 100005 > /dev/null 2>&1 rc=$? if [ "$rc" -ne "0" ]; then ocf_exit_reason "nfs-mountd is not running" return $OCF_NOT_RUNNING fi ocf_log debug "Status: nfs-idmapd" fn=`mktemp` nfs_exec status nfs-idmapd > $fn 2>&1 rc=$? ocf_log debug "$(cat $fn)" rm -f $fn if [ "$rc" -ne "0" ]; then ocf_exit_reason "nfs-idmapd is not running" return $OCF_NOT_RUNNING fi ocf_log debug "Status: rpc-statd" rpcinfo -t localhost 100024 > /dev/null 2>&1 rc=$? if [ "$rc" -ne "0" ]; then ocf_exit_reason "rpc-statd is not running" return $OCF_NOT_RUNNING fi nfs_exec is-active nfs-server rc=$? # Now systemctl is-active can't detect the failure of kernel process like nfsd. # So, if the return value of systemctl is-active is 0, check the threads number # to make sure the process is running really. # /proc/fs/nfsd/threads has the numbers of the nfsd threads. if [ $rc -eq 0 ]; then threads_num=`cat /proc/fs/nfsd/threads 2>/dev/null` if [ $? -eq 0 ]; then if [ $threads_num -gt 0 ]; then return $OCF_SUCCESS else return 3 fi else return $OCF_ERR_GENERIC fi fi return $rc } =======================================================================