From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1751855AbdGQOq6 convert rfc822-to-8bit (ORCPT <rfc822;w@1wt.eu>);
        Mon, 17 Jul 2017 10:46:58 -0400
Received: from mga09.intel.com ([134.134.136.24]:52131 "EHLO mga09.intel.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1751320AbdGQOq5 (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 17 Jul 2017 10:46:57 -0400
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.40,374,1496127600"; 
   d="scan'208";a="1196345615"
From: "Liang, Kan" <kan.liang@intel.com>
To: Thomas Gleixner <tglx@linutronix.de>
CC: Don Zickus <dzickus@redhat.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "mingo@kernel.org" <mingo@kernel.org>,
        "akpm@linux-foundation.org" <akpm@linux-foundation.org>,
        "babu.moger@oracle.com" <babu.moger@oracle.com>,
        "atomlin@redhat.com" <atomlin@redhat.com>,
        "prarit@redhat.com" <prarit@redhat.com>,
        "torvalds@linux-foundation.org" <torvalds@linux-foundation.org>,
        "peterz@infradead.org" <peterz@infradead.org>,
        "eranian@google.com" <eranian@google.com>,
        "acme@redhat.com" <acme@redhat.com>,
        "ak@linux.intel.com" <ak@linux.intel.com>,
        "stable@vger.kernel.org" <stable@vger.kernel.org>
Subject: RE: [PATCH V2] kernel/watchdog: fix spurious hard lockups
Thread-Topic: [PATCH V2] kernel/watchdog: fix spurious hard lockups
Thread-Index: AQHS6pyX93nZMscYGUu+rlSjZMkkxKIvVm6AgAErMwCAARD/gIAAjbWAgABZxoCABJ2VAIABkHmAgB6zMiD//+HwAIAA0r+Q//+RnoCAAIguAA==
Date: Mon, 17 Jul 2017 14:46:53 +0000
Message-ID: <37D7C6CF3E00A74B8858931C1DB2F0775371D9AE@SHSMSX103.ccr.corp.intel.com>
References: <20170621144118.5939-1-kan.liang@intel.com>
 <alpine.DEB.2.20.1706212235550.2152@nanos>
 <20170622154450.2lua7fdmigcixldw@redhat.com>
 <alpine.DEB.2.20.1706230952190.2647@nanos>
 <20170623162907.l6inpxgztwwkeaoi@redhat.com>
 <alpine.DEB.2.20.1706232348020.2234@nanos>
 <20170626201927.3ak7fk3yvdzbb4ay@redhat.com>
 <20170627201249.ll34ecwhpme3vh2u@redhat.com>
 <37D7C6CF3E00A74B8858931C1DB2F0775371D43E@SHSMSX103.ccr.corp.intel.com>
 <alpine.DEB.2.20.1707170912370.2185@nanos>
 <37D7C6CF3E00A74B8858931C1DB2F0775371D8AA@SHSMSX103.ccr.corp.intel.com>
 <alpine.DEB.2.20.1707171503400.2185@nanos>
In-Reply-To: <alpine.DEB.2.20.1707171503400.2185@nanos>
Accept-Language: zh-CN, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiMzA2YmU2YzAtMzU0OC00ZjY5LWJiMTItMWFmMDZlNDg3MzExIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6InlaSU45NjdYYkYwS1dJNEFqTkNkR0tJZ1c4QmJhSXZocnpvMitrK3JWM0E9In0=
x-ctpclassification: CTP_IC
dlp-product: dlpe-windows
dlp-version: 10.0.102.7
dlp-reaction: no-action
x-originating-ip: [10.239.127.40]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8BIT
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


> On Mon, 17 Jul 2017, Liang, Kan wrote:
> > > That doesn't make sense. What's the exact test procedure?
> >
> > I don't know the exact test procedure. The test case is from our customer.
> > I only know that the test case makes calls into the x11 libs.
> 
> Sigh. This starts to be silly. You test something and have no idea what it does?

As I said, the test case is from our customer. They only share binaries with us.
Actually, it's more proper to call it test suite. It includes dozens of small test.
I just reproduced the issue and verified all the three patches in our lab.
Then I report it here as request immediately.
So I know little about the test case for now. 
I will share more when I learn more.
Sorry for that.

> 
> > > > According to our test, only patch 3 works well.
> > > > The other two patches will hang the system eventually.
> 
> Hang the system eventually? Does that mean that the system stops working
> and the watchdog does not catch the problem?


Right, the system stops working and the watchdog does not catch the problem.

> 
> > > > BTW: We set 1 to watchdog_thresh when we did the test.
> > > > It's believed that can speed up the failure.
> > >
> > > Believe is not really a technical measure....
> > >
> >
> > 1 is a valid value for watchdog_thresh.
> > It was set through the standard proc interface.
> > /proc/sys/kernel/watchdog_thresh
> > It should not impacts the final test result.
> 
> I know that 1 is a valid value and I know how that can be set. Still, it does not
> help if you believe that setting the threshold to 1 can speed up the failure.
> Either you know it for sure or not. You can believe in god or whatever, but
> here we talk about facts.

I personally didn't compare the difference between 1 and default 10 for this
test case.
Before we had the test case from customer, we developed other micro
which can reproduce the similar issue.
For that micro, 1 can speed up the failure.
(BTW: all the three patches can fix the issue which was reproduced by that micro.)

If you think it's meaningful to verify 10 as well, I can do the compare.

Thanks,
Kan