All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hush Bensen <hush.bensen@gmail.com>
To: Minchan Kim <minchan@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>,
	Dave Jones <davej@redhat.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org, Rik van Riel <riel@redhat.com>,
	Michal Hocko <mhocko@suse.cz>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Hillf Danton <dhillf@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: hugepage related lockdep trace.
Date: Tue, 23 Jul 2013 01:24:17 -0600	[thread overview]
Message-ID: <51EE2FA1.9080504@gmail.com> (raw)
In-Reply-To: <20130719001303.GB23354@blaptop>

On 07/18/2013 06:13 PM, Minchan Kim wrote:
> On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>>
>>> Ccing people get_maintainer says.
>>>
>>> On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote:
>>>> [128095.470960] =================================
>>>> [128095.471315] [ INFO: inconsistent lock state ]
>>>> [128095.471660] 3.11.0-rc1+ #9 Not tainted
>>>> [128095.472156] ---------------------------------
>>>> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
>>>> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
>>>> [128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
>>>> [128095.475128] {RECLAIM_FS-ON-W} state was registered at:
>>>> [128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
>>>> [128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
>>>> [128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
>>>> [128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
>>>> [128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
>>>> [128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
>>>> [128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
>>>> [128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
>>>> [128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
>>>> [128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
>>>> [128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
>>>> [128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
>>>> [128095.480138]   [<c15b927c>] error_code+0x6c/0x74
>>>> [128095.480138] irq event stamp: 3136917
>>>> [128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
>>>> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
>>>> [128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
>>>> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
>>>> [128095.480138]
>>>> other info that might help us debug this:
>>>> [128095.480138]  Possible unsafe locking scenario:
>>>>
>>>> [128095.480138]        CPU0
>>>> [128095.480138]        ----
>>>> [128095.480138]   lock(&mapping->i_mmap_mutex);
>>>> [128095.480138]   <Interrupt>
>>>> [128095.480138]     lock(&mapping->i_mmap_mutex);
>>>> [128095.480138]
>>>>   *** DEADLOCK ***
>>>>
>>>> [128095.480138] no locks held by kswapd0/49.
>>>> [128095.480138]
>>>> stack backtrace:
>>>> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
>>>> [128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
>>>> [128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
>>>> [128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
>>>> [128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
>>>> [128095.480138] Call Trace:
>>>> [128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
>>>> [128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
>>>> [128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
>>>> [128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
>>>> [128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
>>>> [128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
>>>> [128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
>>>> [128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
>>>> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
>>>> [128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
>>>> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
>>>> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
>>>> [128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
>>>> [128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
>>>> [128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
>>>> [128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
>>>> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
>>>> [128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
>>>> [128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
>>>> [128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
>>>> [128095.480138]  [<c112e531>] kswapd+0x517/0xa75
>>>> [128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
>>>> [128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
>>>> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
>>>> [128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
>>>> [128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63
>>> IMHO, it's a false positive because i_mmap_mutex was held by kswapd
>>> while one in the middle of fault path could be never on kswapd context.
>>>
>>> It seems lockdep for reclaim-over-fs isn't enough smart to identify
>>> between background and direct reclaim.
>>>
>>> Wait for other's opinion.
>> Is that reasoning correct ?. We may not deadlock because hugetlb pages
>> cannot be reclaimed. So the fault path in hugetlb won't end up
>> reclaiming pages from same inode. But the report is correct right ?
>>
>>
>> Looking at the hugetlb code we have in huge_pmd_share
>>
>> out:
>> 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
>> 	mutex_unlock(&mapping->i_mmap_mutex);
>> 	return pte;
>>
>> I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
>> that pmd_alloc can result in a reclaim which can call shrink_page_list ?
> True. Sorry for that I didn't review the code carefully and I was very paranoid
> in reclaim-over-fs due to internal works. :(

Could you explain more about reclaim-over-fs stuff?

>
>> Something like  ?
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 83aff0a..2cb1be3 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>>   		put_page(virt_to_page(spte));
>>   	spin_unlock(&mm->page_table_lock);
>>   out:
>> -	pte = (pte_t *)pmd_alloc(mm, pud, addr);
>>   	mutex_unlock(&mapping->i_mmap_mutex);
>> +	pte = (pte_t *)pmd_alloc(mm, pud, addr);
>>   	return pte;
> I am blind on hugetlb but not sure it doesn't break eb48c071.
> Michal?
>
>
>>   }
>>   
>> -aneesh
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/


WARNING: multiple messages have this Message-ID (diff)
From: Hush Bensen <hush.bensen@gmail.com>
To: Minchan Kim <minchan@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>,
	Dave Jones <davej@redhat.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org, Rik van Riel <riel@redhat.com>,
	Michal Hocko <mhocko@suse.cz>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Hillf Danton <dhillf@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: hugepage related lockdep trace.
Date: Tue, 23 Jul 2013 01:24:17 -0600	[thread overview]
Message-ID: <51EE2FA1.9080504@gmail.com> (raw)
In-Reply-To: <20130719001303.GB23354@blaptop>

On 07/18/2013 06:13 PM, Minchan Kim wrote:
> On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>>
>>> Ccing people get_maintainer says.
>>>
>>> On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote:
>>>> [128095.470960] =================================
>>>> [128095.471315] [ INFO: inconsistent lock state ]
>>>> [128095.471660] 3.11.0-rc1+ #9 Not tainted
>>>> [128095.472156] ---------------------------------
>>>> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
>>>> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
>>>> [128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
>>>> [128095.475128] {RECLAIM_FS-ON-W} state was registered at:
>>>> [128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
>>>> [128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
>>>> [128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
>>>> [128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
>>>> [128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
>>>> [128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
>>>> [128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
>>>> [128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
>>>> [128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
>>>> [128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
>>>> [128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
>>>> [128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
>>>> [128095.480138]   [<c15b927c>] error_code+0x6c/0x74
>>>> [128095.480138] irq event stamp: 3136917
>>>> [128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
>>>> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
>>>> [128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
>>>> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
>>>> [128095.480138]
>>>> other info that might help us debug this:
>>>> [128095.480138]  Possible unsafe locking scenario:
>>>>
>>>> [128095.480138]        CPU0
>>>> [128095.480138]        ----
>>>> [128095.480138]   lock(&mapping->i_mmap_mutex);
>>>> [128095.480138]   <Interrupt>
>>>> [128095.480138]     lock(&mapping->i_mmap_mutex);
>>>> [128095.480138]
>>>>   *** DEADLOCK ***
>>>>
>>>> [128095.480138] no locks held by kswapd0/49.
>>>> [128095.480138]
>>>> stack backtrace:
>>>> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
>>>> [128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
>>>> [128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
>>>> [128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
>>>> [128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
>>>> [128095.480138] Call Trace:
>>>> [128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
>>>> [128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
>>>> [128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
>>>> [128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
>>>> [128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
>>>> [128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
>>>> [128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
>>>> [128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
>>>> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
>>>> [128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
>>>> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
>>>> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
>>>> [128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
>>>> [128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
>>>> [128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
>>>> [128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
>>>> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
>>>> [128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
>>>> [128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
>>>> [128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
>>>> [128095.480138]  [<c112e531>] kswapd+0x517/0xa75
>>>> [128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
>>>> [128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
>>>> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
>>>> [128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
>>>> [128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63
>>> IMHO, it's a false positive because i_mmap_mutex was held by kswapd
>>> while one in the middle of fault path could be never on kswapd context.
>>>
>>> It seems lockdep for reclaim-over-fs isn't enough smart to identify
>>> between background and direct reclaim.
>>>
>>> Wait for other's opinion.
>> Is that reasoning correct ?. We may not deadlock because hugetlb pages
>> cannot be reclaimed. So the fault path in hugetlb won't end up
>> reclaiming pages from same inode. But the report is correct right ?
>>
>>
>> Looking at the hugetlb code we have in huge_pmd_share
>>
>> out:
>> 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
>> 	mutex_unlock(&mapping->i_mmap_mutex);
>> 	return pte;
>>
>> I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
>> that pmd_alloc can result in a reclaim which can call shrink_page_list ?
> True. Sorry for that I didn't review the code carefully and I was very paranoid
> in reclaim-over-fs due to internal works. :(

Could you explain more about reclaim-over-fs stuff?

>
>> Something like  ?
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 83aff0a..2cb1be3 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>>   		put_page(virt_to_page(spte));
>>   	spin_unlock(&mm->page_table_lock);
>>   out:
>> -	pte = (pte_t *)pmd_alloc(mm, pud, addr);
>>   	mutex_unlock(&mapping->i_mmap_mutex);
>> +	pte = (pte_t *)pmd_alloc(mm, pud, addr);
>>   	return pte;
> I am blind on hugetlb but not sure it doesn't break eb48c071.
> Michal?
>
>
>>   }
>>   
>> -aneesh
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2013-07-23  7:24 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-17 15:32 hugepage related lockdep trace Dave Jones
2013-07-17 15:32 ` Dave Jones
2013-07-18  0:09 ` Minchan Kim
2013-07-18  0:09   ` Minchan Kim
2013-07-18 17:42   ` Aneesh Kumar K.V
2013-07-18 17:42     ` Aneesh Kumar K.V
2013-07-19  0:13     ` Minchan Kim
2013-07-19  0:13       ` Minchan Kim
2013-07-23  7:24       ` Hush Bensen [this message]
2013-07-23  7:24         ` Hush Bensen
2013-07-24  2:57         ` Minchan Kim
2013-07-24  2:57           ` Minchan Kim
2013-07-23 14:01       ` Michal Hocko
2013-07-23 14:01         ` Michal Hocko
2013-07-24  2:44         ` Minchan Kim
2013-07-24  2:44           ` Minchan Kim
2013-07-25 13:30           ` Michal Hocko
2013-07-25 13:30             ` Michal Hocko
2013-07-29  8:24             ` Minchan Kim
2013-07-29  8:24               ` Minchan Kim
2013-07-29 14:53               ` Michal Hocko
2013-07-29 14:53                 ` Michal Hocko
2013-07-29 15:20                 ` Peter Zijlstra
2013-07-29 15:20                   ` Peter Zijlstra
2013-07-30 14:29                   ` Michal Hocko
2013-07-30 14:29                     ` Michal Hocko
2013-07-30 14:46                     ` [PATCH] hugetlb: fix lockdep splat caused by pmd sharing Michal Hocko
2013-07-30 14:46                       ` Michal Hocko
2013-07-30 14:58                       ` Peter Zijlstra
2013-07-30 14:58                         ` Peter Zijlstra
2013-07-30 15:23                         ` Michal Hocko
2013-07-30 15:23                           ` Michal Hocko
2013-07-30 23:35                           ` Minchan Kim
2013-07-30 23:35                             ` Minchan Kim
2013-07-30 23:37                             ` Dave Jones
2013-07-30 23:37                               ` Dave Jones
2013-07-19  2:08     ` hugepage related lockdep trace Hillf Danton
2013-07-19  2:08       ` Hillf Danton
2013-07-19  3:20       ` Aneesh Kumar K.V
2013-07-19  3:20         ` Aneesh Kumar K.V

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51EE2FA1.9080504@gmail.com \
    --to=hush.bensen@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=davej@redhat.com \
    --cc=dhillf@gmail.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.cz \
    --cc=minchan@kernel.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.