From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1752569AbdJ3OB5 (ORCPT <rfc822;w@1wt.eu>);
        Mon, 30 Oct 2017 10:01:57 -0400
Received: from smtp.codeaurora.org ([198.145.29.96]:35686 "EHLO
        smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751677AbdJ3OBz (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 30 Oct 2017 10:01:55 -0400
DMARC-Filter: OpenDMARC Filter v1.3.2 smtp.codeaurora.org 10A40605A4
Authentication-Results: pdx-caf-mail.web.codeaurora.org; dmarc=none (p=none dis=none) header.from=codeaurora.org
Authentication-Results: pdx-caf-mail.web.codeaurora.org; spf=none smtp.mailfrom=tbaicar@codeaurora.org
Subject: Re: [ghes_copy_tofrom_phys] BUG: sleeping function called from
 invalid context at mm/page_alloc.c:4150
To: Borislav Petkov <bp@suse.de>, Fengguang Wu <fengguang.wu@intel.com>
Cc: Huang Ying <ying.huang@intel.com>,
        Chen Gong <gong.chen@linux.intel.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Will Deacon <will.deacon@arm.com>,
        "Rafael J. Wysocki" <rjw@rjwysocki.net>
References: <CA+55aFxSJGeN=2X-uX-on1Uq2Nb8+v1aiMDz5H1+tKW_N5Q+6g@mail.gmail.com>
 <20171029225155.qcum5i75awrt5tzm@wfg-t540p.sh.intel.com>
 <20171029231835.3725fnd5yehlmqob@wfg-t540p.sh.intel.com>
 <20171030110511.scfrdtlnf5lbdhu5@pd.tnic>
From: Tyler Baicar <tbaicar@codeaurora.org>
Message-ID: <2d40fa2f-0b88-a466-fc67-26653f5f72e8@codeaurora.org>
Date: Mon, 30 Oct 2017 10:01:52 -0400
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101
 Thunderbird/52.4.0
MIME-Version: 1.0
In-Reply-To: <20171030110511.scfrdtlnf5lbdhu5@pd.tnic>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Content-Language: en-US
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 10/30/2017 7:05 AM, Borislav Petkov wrote:
> On Mon, Oct 30, 2017 at 12:18:35AM +0100, Fengguang Wu wrote:
>> CC related developers for the BUG in v4.14-rc6.
>>
>> On Sun, Oct 29, 2017 at 11:51:55PM +0100, Fengguang Wu wrote:
>>> Hi Linus,
>>>
>>> Up to now we see the below boot error/warnings when testing v4.14-rc6.
>>>
>>> They hit the RC release mainly due to various imperfections in 0day's
>>> auto bisection. So I manually list them here and CC the likely easy to
>>> debug ones to the corresponding maintainers in the followup emails.
>>>
>>> boot_successes: 4700
>>> boot_failures: 247
>>>
>>> BUG:kernel_hang_in_test_stage: 152
>>> BUG:kernel_reboot-without-warning_in_test_stage: 10
>>> BUG:sleeping_function_called_from_invalid_context_at_kernel/locking/mutex.c: 1
>>> BUG:sleeping_function_called_from_invalid_context_at_kernel/locking/rwsem.c: 3
>>> BUG:sleeping_function_called_from_invalid_context_at_mm/page_alloc.c: 21
>> Here is the dmesg fragment:
>>
>> [   47.597981] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x26d34d96462, max_idle_ns: 440795289520 ns
>> [   48.626601] clocksource: Switched to clocksource tsc
>> [   49.273620] ERST: Error Record Serialization Table (ERST) support is initialized.
>> [   49.290288] pstore: using zlib compression
>> [   49.299588] pstore: Registered erst as persistent store backend
>> [   49.311408] BUG: sleeping function called from invalid context at mm/page_alloc.c:4150
>> [   49.312031] in_atomic(): 1, irqs_disabled(): 1, pid: 1, name: swapper/0
>> [   49.312031] CPU: 37 PID: 1 Comm: swapper/0 Not tainted 4.14.0-rc6 #1
>> [   49.312031] Hardware name: Intel Corporation S2600WP/S2600WP, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
>> [   49.312031] Call Trace:
>> [   49.312031]  dump_stack+0x63/0x86
>> [   49.312031]  ___might_sleep+0xf1/0x110
>> [   49.312031]  __might_sleep+0x4a/0x80
>> [   49.312031]  __alloc_pages_nodemask+0x14e/0x270
>> [   49.312031]  alloc_page_interleave+0x17/0x80
>> [   49.312031]  alloc_pages_current+0xc8/0xe0
>> [   49.312031]  __get_free_pages+0xe/0x40
>> [   49.312031]  pte_alloc_one_kernel+0x15/0x20
>> [   49.312031]  __pte_alloc_kernel+0x1d/0x100
>> [   49.312031]  ioremap_page_range+0x330/0x3a0
>> [   49.312031]  ghes_copy_tofrom_phys+0x182/0x2b0
>> [   49.312031]  ghes_read_estatus+0x76/0x140
>> [   49.312031]  ghes_proc+0x1c/0x130
>> [   49.312031]  ghes_probe+0x157/0x430
>> [   49.312031]  platform_drv_probe+0x3b/0xa0
>> [   49.312031]  driver_probe_device+0x29c/0x450
>> [   49.312031]  __driver_attach+0xdf/0xf0
>> [   49.312031]  ? driver_probe_device+0x450/0x450
>> [   49.312031]  bus_for_each_dev+0x60/0xa0
>> [   49.312031]  driver_attach+0x1e/0x20
>> [   49.312031]  bus_add_driver+0x170/0x260
>> [   49.312031]  ? set_debug_rodata+0x17/0x17
>> [   49.312031]  driver_register+0x60/0xe0
>> [   49.312031]  __platform_driver_register+0x36/0x40
>> [   49.312031]  ghes_init+0x10f/0x199
>> [   49.312031]  ? bert_init+0x215/0x215
>> [   49.312031]  do_one_initcall+0x43/0x170
>> [   49.312031]  ? set_debug_rodata+0x17/0x17
>> [   49.312031]  kernel_init_freeable+0x198/0x220
>> [   49.312031]  ? rest_init+0xd0/0xd0
>> [   49.312031]  kernel_init+0xe/0x101
>> [   49.312031]  ret_from_fork+0x25/0x30
>> [   49.670116] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC.
>> [   49.691436] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
>> [   49.729954] 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
>> [   49.767235] Non-volatile memory driver v1.3
>> [   49.778363] Linux agpgart interface v0.103
> Looks like Tyler broke it:
>
> 77b246b32b2c ("acpi: apei: check for pending errors when probing GHES entries")
>
> and it went into 4.13 and -stable.
>
> Tyler, why is it so important to do the polling immediately upon
> registration? Can't we wait until the polling does it?
This is not as important for polling sources as it is for the interrupt sources 
since polling sources
are regularly checked and shouldn't be used for fatal error scenarios. For 
interrupt driven sources,
there could already be a fatal error pending, so we should handle it 
immediately. Also, it's possible
that the interrupt was cleared because it happened prior to GHES probing so then 
the error
wouldn't get serviced. Example of that would be interrupts handled through the 
GED driver.

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.