From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752653AbaEVXtD (ORCPT ); Thu, 22 May 2014 19:49:03 -0400 Received: from mail-qg0-f47.google.com ([209.85.192.47]:60639 "EHLO mail-qg0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751430AbaEVXtB (ORCPT ); Thu, 22 May 2014 19:49:01 -0400 MIME-Version: 1.0 In-Reply-To: <1400233673-11477-1-git-send-email-vbabka@suse.cz> References: <1399904111-23520-1-git-send-email-vbabka@suse.cz> <1400233673-11477-1-git-send-email-vbabka@suse.cz> Date: Thu, 22 May 2014 16:49:00 -0700 Message-ID: Subject: Re: [PATCH v2] mm, compaction: properly signal and act upon lock and need_sched() contention From: Kevin Hilman To: Vlastimil Babka Cc: Joonsoo Kim , Andrew Morton , David Rientjes , Hugh Dickins , Greg Thelen , LKML , linux-mm@kvack.org, Minchan Kim , Mel Gorman , Bartlomiej Zolnierkiewicz , Michal Nazarewicz , Christoph Lameter , Rik van Riel , Olof Johansson , Stephen Warren Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 16, 2014 at 2:47 AM, Vlastimil Babka wrote: > Compaction uses compact_checklock_irqsave() function to periodically check for > lock contention and need_resched() to either abort async compaction, or to > free the lock, schedule and retake the lock. When aborting, cc->contended is > set to signal the contended state to the caller. Two problems have been > identified in this mechanism. This patch (or later version) has hit next-20140522 (in the form commit 645ceea9331bfd851bc21eea456dda27862a10f4) and according to my bisect, appears to be the culprit of several boot failures on ARM platforms. Unfortunately, there isn't much useful in the logs of the boot failures/hangs since they mostly silently hang. However, on one platform (Marvell Armada 370 Mirabox), it reports a failure to allocate memory, and the RCU stall detection kicks in: [ 1.298234] xhci_hcd 0000:02:00.0: xHCI Host Controller [ 1.303485] xhci_hcd 0000:02:00.0: new USB bus registered, assigned bus number 1 [ 1.310966] xhci_hcd 0000:02:00.0: Couldn't initialize memory [ 22.245395] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, t=2102 jiffies, g=-282, c=-283, q=16) [ 22.255886] INFO: Stall ended before state dump start [ 48.095396] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1] Reverting this commit makes them all happy again. Kevin From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f52.google.com (mail-qg0-f52.google.com [209.85.192.52]) by kanga.kvack.org (Postfix) with ESMTP id 22A7C6B0036 for ; Thu, 22 May 2014 19:49:01 -0400 (EDT) Received: by mail-qg0-f52.google.com with SMTP id a108so7034760qge.11 for ; Thu, 22 May 2014 16:49:00 -0700 (PDT) Received: from mail-qc0-f172.google.com (mail-qc0-f172.google.com [209.85.216.172]) by mx.google.com with ESMTPS id z4si1654583qar.97.2014.05.22.16.49.00 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 22 May 2014 16:49:00 -0700 (PDT) Received: by mail-qc0-f172.google.com with SMTP id l6so6857784qcy.17 for ; Thu, 22 May 2014 16:49:00 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <1400233673-11477-1-git-send-email-vbabka@suse.cz> References: <1399904111-23520-1-git-send-email-vbabka@suse.cz> <1400233673-11477-1-git-send-email-vbabka@suse.cz> Date: Thu, 22 May 2014 16:49:00 -0700 Message-ID: Subject: Re: [PATCH v2] mm, compaction: properly signal and act upon lock and need_sched() contention From: Kevin Hilman Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka Cc: Joonsoo Kim , Andrew Morton , David Rientjes , Hugh Dickins , Greg Thelen , LKML , linux-mm@kvack.org, Minchan Kim , Mel Gorman , Bartlomiej Zolnierkiewicz , Michal Nazarewicz , Christoph Lameter , Rik van Riel , Olof Johansson , Stephen Warren On Fri, May 16, 2014 at 2:47 AM, Vlastimil Babka wrote: > Compaction uses compact_checklock_irqsave() function to periodically check for > lock contention and need_resched() to either abort async compaction, or to > free the lock, schedule and retake the lock. When aborting, cc->contended is > set to signal the contended state to the caller. Two problems have been > identified in this mechanism. This patch (or later version) has hit next-20140522 (in the form commit 645ceea9331bfd851bc21eea456dda27862a10f4) and according to my bisect, appears to be the culprit of several boot failures on ARM platforms. Unfortunately, there isn't much useful in the logs of the boot failures/hangs since they mostly silently hang. However, on one platform (Marvell Armada 370 Mirabox), it reports a failure to allocate memory, and the RCU stall detection kicks in: [ 1.298234] xhci_hcd 0000:02:00.0: xHCI Host Controller [ 1.303485] xhci_hcd 0000:02:00.0: new USB bus registered, assigned bus number 1 [ 1.310966] xhci_hcd 0000:02:00.0: Couldn't initialize memory [ 22.245395] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, t=2102 jiffies, g=-282, c=-283, q=16) [ 22.255886] INFO: Stall ended before state dump start [ 48.095396] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1] Reverting this commit makes them all happy again. Kevin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org