All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] mm/hugetlb: Defer freeing of huge pages if in non-task context
@ 2019-12-17  1:25 Waiman Long
  2019-12-17  9:31 ` Michal Hocko
  0 siblings, 1 reply; 9+ messages in thread
From: Waiman Long @ 2019-12-17  1:25 UTC (permalink / raw)
  To: Mike Kravetz, Andrew Morton
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Davidlohr Bueso,
	Andi Kleen, Michal Hocko, Aneesh Kumar K.V, Waiman Long

The following lockdep splat was observed when a certain hugetlbfs test
was run:

[  612.388273] ================================
[  612.411273] WARNING: inconsistent lock state
[  612.432273] 4.18.0-159.el8.x86_64+debug #1 Tainted: G        W --------- -  -
[  612.469273] --------------------------------
[  612.489273] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[  612.517273] swapper/30/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
[  612.541273] ffffffff9acdc038 (hugetlb_lock){+.?.}, at: free_huge_page+0x36f/0xaa0
[  612.576273] {SOFTIRQ-ON-W} state was registered at:
[  612.598273]   lock_acquire+0x14f/0x3b0
[  612.616273]   _raw_spin_lock+0x30/0x70
[  612.634273]   __nr_hugepages_store_common+0x11b/0xb30
[  612.657273]   hugetlb_sysctl_handler_common+0x209/0x2d0
[  612.681273]   proc_sys_call_handler+0x37f/0x450
[  612.703273]   vfs_write+0x157/0x460
[  612.719273]   ksys_write+0xb8/0x170
[  612.736273]   do_syscall_64+0xa5/0x4d0
[  612.753273]   entry_SYSCALL_64_after_hwframe+0x6a/0xdf
[  612.777273] irq event stamp: 691296
[  612.794273] hardirqs last  enabled at (691296): [<ffffffff99bb034b>] _raw_spin_unlock_irqrestore+0x4b/0x60
[  612.839273] hardirqs last disabled at (691295): [<ffffffff99bb0ad2>] _raw_spin_lock_irqsave+0x22/0x81
[  612.882273] softirqs last  enabled at (691284): [<ffffffff97ff0c63>] irq_enter+0xc3/0xe0
[  612.922273] softirqs last disabled at (691285): [<ffffffff97ff0ebe>] irq_exit+0x23e/0x2b0
[  612.962273]
[  612.962273] other info that might help us debug this:
[  612.993273]  Possible unsafe locking scenario:
[  612.993273]
[  613.020273]        CPU0
[  613.031273]        ----
[  613.042273]   lock(hugetlb_lock);
[  613.057273]   <Interrupt>
[  613.069273]     lock(hugetlb_lock);
[  613.085273]
[  613.085273]  *** DEADLOCK ***
      :
[  613.245273] Call Trace:
[  613.256273]  <IRQ>
[  613.265273]  dump_stack+0x9a/0xf0
[  613.281273]  mark_lock+0xd0c/0x12f0
[  613.297273]  ? print_shortest_lock_dependencies+0x80/0x80
[  613.322273]  ? sched_clock_cpu+0x18/0x1e0
[  613.341273]  __lock_acquire+0x146b/0x48c0
[  613.360273]  ? trace_hardirqs_on+0x10/0x10
[  613.379273]  ? trace_hardirqs_on_caller+0x27b/0x580
[  613.401273]  lock_acquire+0x14f/0x3b0
[  613.419273]  ? free_huge_page+0x36f/0xaa0
[  613.440273]  _raw_spin_lock+0x30/0x70
[  613.458273]  ? free_huge_page+0x36f/0xaa0
[  613.477273]  free_huge_page+0x36f/0xaa0
[  613.495273]  bio_check_pages_dirty+0x2fc/0x5c0
[  613.516273]  clone_endio+0x17f/0x670 [dm_mod]
[  613.536273]  ? disable_discard+0x90/0x90 [dm_mod]
[  613.558273]  ? bio_endio+0x4ba/0x930
[  613.575273]  ? blk_account_io_completion+0x400/0x530
[  613.598273]  blk_update_request+0x276/0xe50
[  613.617273]  scsi_end_request+0x7b/0x6a0
[  613.636273]  ? lock_downgrade+0x6f0/0x6f0
[  613.654273]  scsi_io_completion+0x1c6/0x1570
[  613.674273]  ? sd_completed_bytes+0x3a0/0x3a0 [sd_mod]
[  613.698273]  ? scsi_mq_requeue_cmd+0xc0/0xc0
[  613.718273]  blk_done_softirq+0x22e/0x350
[  613.737273]  ? blk_softirq_cpu_dead+0x230/0x230
[  613.758273]  __do_softirq+0x23d/0xad8
[  613.776273]  irq_exit+0x23e/0x2b0
[  613.792273]  do_IRQ+0x11a/0x200
[  613.806273]  common_interrupt+0xf/0xf
[  613.823273]  </IRQ>

Both the hugetbl_lock and the subpool lock can be acquired in
free_huge_page(). One way to solve the problem is to make both locks
irq-safe. Another alternative is to defer the freeing to a workqueue job.

This patch implements the deferred freeing by adding a
free_hpage_workfn() work function to do the actual freeing. The
free_huge_page() call in a non-task context saves the page to be freed
in the hpage_freelist linked list in a lockless manner.

The generic workqueue is used to process the work, but a dedicated
workqueue can be used instead if it is desirable to have the huge page
freed ASAP.

 [v2: Add more comment & remove unneeded racing check]

Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/hugetlb.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 69 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ac65bb5e38ac..1744481bb9a1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1136,7 +1136,7 @@ static inline void ClearPageHugeTemporary(struct page *page)
 	page[2].mapping = NULL;
 }
 
-void free_huge_page(struct page *page)
+static void __free_huge_page(struct page *page)
 {
 	/*
 	 * Can't pass hstate in here because it is called from the
@@ -1199,6 +1199,74 @@ void free_huge_page(struct page *page)
 	spin_unlock(&hugetlb_lock);
 }
 
+/*
+ * As free_huge_page() can be called from a non-task context, we have
+ * to defer the actual freeing in a workqueue to prevent potential
+ * hugetlb_lock deadlock.
+ *
+ * free_hpage_workfn() locklessly retrieves the linked list of pages to
+ * be freed and frees them one-by-one. As the page->mapping pointer is
+ * going to be cleared in __free_huge_page() anyway, it is reused as the
+ * next pointer of a singly linked list of huge pages to be freed.
+ */
+#define NEXT_PENDING	((struct page *)-1)
+static struct page *hpage_freelist;
+
+static void free_hpage_workfn(struct work_struct *work)
+{
+	struct page *curr, *next;
+	int cnt = 0;
+
+	do {
+		curr = xchg(&hpage_freelist, NULL);
+		if (!curr)
+			break;
+
+		while (curr) {
+			next = (struct page *)READ_ONCE(curr->mapping);
+			if (next == NEXT_PENDING) {
+				cpu_relax();
+				continue;
+			}
+			__free_huge_page(curr);
+			curr = next;
+			cnt++;
+		}
+	} while (!READ_ONCE(hpage_freelist));
+
+	if (!cnt)
+		return;
+	pr_debug("HugeTLB: free_hpage_workfn() frees %d huge page(s)\n", cnt);
+}
+static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
+
+void free_huge_page(struct page *page)
+{
+	/*
+	 * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
+	 */
+	if (!in_task()) {
+		struct page *next;
+
+		/*
+		 * Since we cannot atomically update both page->mapping
+		 * and hpage_freelist without lock, a handshake using an
+		 * intermediate NEXT_PENDING value in mapping is used to
+		 * make sure that the reader will see both values correctly.
+		 * The initial write of NEXT_PENDING is before the page
+		 * is visible to a concurrent reader, so WRITE_ONCE()
+		 * isn't needed.
+		 */
+		page->mapping = (struct address_space *)NEXT_PENDING;
+		next = xchg(&hpage_freelist, page);
+		WRITE_ONCE(page->mapping, (struct address_space *)next);
+		schedule_work(&free_hpage_work);
+		return;
+	}
+
+	__free_huge_page(page);
+}
+
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
 	INIT_LIST_HEAD(&page->lru);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] mm/hugetlb: Defer freeing of huge pages if in non-task context
  2019-12-17  1:25 [PATCH v2] mm/hugetlb: Defer freeing of huge pages if in non-task context Waiman Long
@ 2019-12-17  9:31 ` Michal Hocko
  2019-12-17 10:50   ` Kirill Tkhai
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Michal Hocko @ 2019-12-17  9:31 UTC (permalink / raw)
  To: Waiman Long
  Cc: Mike Kravetz, Andrew Morton, linux-kernel, linux-mm,
	Matthew Wilcox, Davidlohr Bueso, Andi Kleen, Aneesh Kumar K.V

On Mon 16-12-19 20:25:08, Waiman Long wrote:
[...]
> Both the hugetbl_lock and the subpool lock can be acquired in
> free_huge_page(). One way to solve the problem is to make both locks
> irq-safe.

Please document why we do not take this, quite natural path and instead
we have to come up with an elaborate way instead. I believe the primary
motivation is that some operations under those locks are quite
expensive. Please add that to the changelog and ideally to the code as
well. We probably want to fix those anyway and then this would be a
temporary workaround.

> Another alternative is to defer the freeing to a workqueue job.
> 
> This patch implements the deferred freeing by adding a
> free_hpage_workfn() work function to do the actual freeing. The
> free_huge_page() call in a non-task context saves the page to be freed
> in the hpage_freelist linked list in a lockless manner.

Do we need to over complicate this (presumably) rare event by a lockless
algorithm? Why cannot we use a dedicated spin lock for for the linked
list manipulation? This should be really a trivial code without an
additional burden of all the lockless subtleties.

> +	pr_debug("HugeTLB: free_hpage_workfn() frees %d huge page(s)\n", cnt);

Why do we need the debugging message here?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re: [PATCH v2] mm/hugetlb: Defer freeing of huge pages if in non-task context
  2019-12-17  9:31 ` Michal Hocko
@ 2019-12-17 10:50   ` Kirill Tkhai
  2019-12-17 14:00     ` Waiman Long
  2019-12-17 14:06   ` Waiman Long
  2019-12-17 18:33   ` Mike Kravetz
  2 siblings, 1 reply; 9+ messages in thread
From: Kirill Tkhai @ 2019-12-17 10:50 UTC (permalink / raw)
  To: Michal Hocko, Waiman Long
  Cc: Mike Kravetz, Andrew Morton, linux-kernel, linux-mm,
	Matthew Wilcox, Davidlohr Bueso, Andi Kleen, Aneesh Kumar K.V

On 17.12.2019 12:31, Michal Hocko wrote:
> On Mon 16-12-19 20:25:08, Waiman Long wrote:
> [...]
>> Both the hugetbl_lock and the subpool lock can be acquired in
>> free_huge_page(). One way to solve the problem is to make both locks
>> irq-safe.
> 
> Please document why we do not take this, quite natural path and instead
> we have to come up with an elaborate way instead. I believe the primary
> motivation is that some operations under those locks are quite
> expensive. Please add that to the changelog and ideally to the code as
> well. We probably want to fix those anyway and then this would be a
> temporary workaround.
> 
>> Another alternative is to defer the freeing to a workqueue job.
>>
>> This patch implements the deferred freeing by adding a
>> free_hpage_workfn() work function to do the actual freeing. The
>> free_huge_page() call in a non-task context saves the page to be freed
>> in the hpage_freelist linked list in a lockless manner.
> 
> Do we need to over complicate this (presumably) rare event by a lockless
> algorithm? Why cannot we use a dedicated spin lock for for the linked
> list manipulation? This should be really a trivial code without an
> additional burden of all the lockless subtleties.

Why not llist_add()/llist_del_all() ?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] mm/hugetlb: Defer freeing of huge pages if in non-task context
  2019-12-17 10:50   ` Kirill Tkhai
@ 2019-12-17 14:00     ` Waiman Long
  2019-12-17 14:13       ` Kirill Tkhai
  0 siblings, 1 reply; 9+ messages in thread
From: Waiman Long @ 2019-12-17 14:00 UTC (permalink / raw)
  To: Kirill Tkhai, Michal Hocko
  Cc: Mike Kravetz, Andrew Morton, linux-kernel, linux-mm,
	Matthew Wilcox, Davidlohr Bueso, Andi Kleen, Aneesh Kumar K.V

On 12/17/19 5:50 AM, Kirill Tkhai wrote:
> On 17.12.2019 12:31, Michal Hocko wrote:
>> On Mon 16-12-19 20:25:08, Waiman Long wrote:
>> [...]
>>> Both the hugetbl_lock and the subpool lock can be acquired in
>>> free_huge_page(). One way to solve the problem is to make both locks
>>> irq-safe.
>> Please document why we do not take this, quite natural path and instead
>> we have to come up with an elaborate way instead. I believe the primary
>> motivation is that some operations under those locks are quite
>> expensive. Please add that to the changelog and ideally to the code as
>> well. We probably want to fix those anyway and then this would be a
>> temporary workaround.
>>
>>> Another alternative is to defer the freeing to a workqueue job.
>>>
>>> This patch implements the deferred freeing by adding a
>>> free_hpage_workfn() work function to do the actual freeing. The
>>> free_huge_page() call in a non-task context saves the page to be freed
>>> in the hpage_freelist linked list in a lockless manner.
>> Do we need to over complicate this (presumably) rare event by a lockless
>> algorithm? Why cannot we use a dedicated spin lock for for the linked
>> list manipulation? This should be really a trivial code without an
>> additional burden of all the lockless subtleties.
> Why not llist_add()/llist_del_all() ?
>
The llist_add() and llist_del_all() are just simple helpers. Because
this lockless case involve synchronization of two variables, the llist
helpers do not directly apply here. So the rests cannot be used. It will
look awkward it is partially converted to use the helpers. If we convert
to use a lock as suggested by Michal, using the helpers will be an
overkill as xchg() will not be needed.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] mm/hugetlb: Defer freeing of huge pages if in non-task context
  2019-12-17  9:31 ` Michal Hocko
  2019-12-17 10:50   ` Kirill Tkhai
@ 2019-12-17 14:06   ` Waiman Long
  2019-12-17 14:59     ` Michal Hocko
  2019-12-17 18:33   ` Mike Kravetz
  2 siblings, 1 reply; 9+ messages in thread
From: Waiman Long @ 2019-12-17 14:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Andrew Morton, linux-kernel, linux-mm,
	Matthew Wilcox, Davidlohr Bueso, Andi Kleen, Aneesh Kumar K.V

On 12/17/19 4:31 AM, Michal Hocko wrote:
> On Mon 16-12-19 20:25:08, Waiman Long wrote:
> [...]
>> Both the hugetbl_lock and the subpool lock can be acquired in
>> free_huge_page(). One way to solve the problem is to make both locks
>> irq-safe.
> Please document why we do not take this, quite natural path and instead
> we have to come up with an elaborate way instead. I believe the primary
> motivation is that some operations under those locks are quite
> expensive. Please add that to the changelog and ideally to the code as
> well. We probably want to fix those anyway and then this would be a
> temporary workaround.
>
Fair enough, I will include data from Mike about some use cases where
hugetlb_lock will be held for a long time.


>> Another alternative is to defer the freeing to a workqueue job.
>>
>> This patch implements the deferred freeing by adding a
>> free_hpage_workfn() work function to do the actual freeing. The
>> free_huge_page() call in a non-task context saves the page to be freed
>> in the hpage_freelist linked list in a lockless manner.
> Do we need to over complicate this (presumably) rare event by a lockless
> algorithm? Why cannot we use a dedicated spin lock for for the linked
> list manipulation? This should be really a trivial code without an
> additional burden of all the lockless subtleties.

Right, I can use an irq-safe raw spinlock instead. I am fine doing that.


>
>> +	pr_debug("HugeTLB: free_hpage_workfn() frees %d huge page(s)\n", cnt);
> Why do we need the debugging message here?

It is there just to verify that the workfn is properly activated and
frees the huge page. This message won't be printed by default. I can
remove it if you guys don't really want a debug statement here.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] mm/hugetlb: Defer freeing of huge pages if in non-task context
  2019-12-17 14:00     ` Waiman Long
@ 2019-12-17 14:13       ` Kirill Tkhai
  2019-12-17 14:27         ` Waiman Long
  0 siblings, 1 reply; 9+ messages in thread
From: Kirill Tkhai @ 2019-12-17 14:13 UTC (permalink / raw)
  To: Waiman Long, Michal Hocko
  Cc: Mike Kravetz, Andrew Morton, linux-kernel, linux-mm,
	Matthew Wilcox, Davidlohr Bueso, Andi Kleen, Aneesh Kumar K.V

On 17.12.2019 17:00, Waiman Long wrote:
> On 12/17/19 5:50 AM, Kirill Tkhai wrote:
>> On 17.12.2019 12:31, Michal Hocko wrote:
>>> On Mon 16-12-19 20:25:08, Waiman Long wrote:
>>> [...]
>>>> Both the hugetbl_lock and the subpool lock can be acquired in
>>>> free_huge_page(). One way to solve the problem is to make both locks
>>>> irq-safe.
>>> Please document why we do not take this, quite natural path and instead
>>> we have to come up with an elaborate way instead. I believe the primary
>>> motivation is that some operations under those locks are quite
>>> expensive. Please add that to the changelog and ideally to the code as
>>> well. We probably want to fix those anyway and then this would be a
>>> temporary workaround.
>>>
>>>> Another alternative is to defer the freeing to a workqueue job.
>>>>
>>>> This patch implements the deferred freeing by adding a
>>>> free_hpage_workfn() work function to do the actual freeing. The
>>>> free_huge_page() call in a non-task context saves the page to be freed
>>>> in the hpage_freelist linked list in a lockless manner.
>>> Do we need to over complicate this (presumably) rare event by a lockless
>>> algorithm? Why cannot we use a dedicated spin lock for for the linked
>>> list manipulation? This should be really a trivial code without an
>>> additional burden of all the lockless subtleties.
>> Why not llist_add()/llist_del_all() ?
>>
> The llist_add() and llist_del_all() are just simple helpers. Because
> this lockless case involve synchronization of two variables, the llist
> helpers do not directly apply here. So the rests cannot be used. It will
> look awkward it is partially converted to use the helpers. If we convert
> to use a lock as suggested by Michal, using the helpers will be an
> overkill as xchg() will not be needed.

I don't understand you. What are two variables?

Why can't you simply do the below?

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ac65bb5e38ac..e8ec753f3d92 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1136,7 +1136,7 @@ static inline void ClearPageHugeTemporary(struct page *page)
 	page[2].mapping = NULL;
 }
 
-void free_huge_page(struct page *page)
+static void __free_huge_page(struct page *page)
 {
 	/*
 	 * Can't pass hstate in here because it is called from the
@@ -1199,6 +1199,35 @@ void free_huge_page(struct page *page)
 	spin_unlock(&hugetlb_lock);
 }
 
+static struct llist_head hpage_freelist = LLIST_HEAD_INIT;
+
+static void free_hpage_workfn(struct work_struct *work)
+{
+	struct llist_node *node;
+	struct page *page;
+
+	node = llist_del_all(&hpage_freelist);
+
+	while (node) {
+		page = container_of(node, struct page, mapping);
+		node = node->next;
+		__free_huge_page(page);
+	}
+}
+
+static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
+
+void free_huge_page(struct page *page)
+{
+	if (!in_task()) {
+		if (llist_add((struct llist_node *)&page->mapping, &hpage_freelist))
+			schedule_work(&free_hpage_work);
+		return;
+	}
+
+	__free_huge_page(page);
+}
+
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
 	INIT_LIST_HEAD(&page->lru);


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] mm/hugetlb: Defer freeing of huge pages if in non-task context
  2019-12-17 14:13       ` Kirill Tkhai
@ 2019-12-17 14:27         ` Waiman Long
  0 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2019-12-17 14:27 UTC (permalink / raw)
  To: Kirill Tkhai, Michal Hocko
  Cc: Mike Kravetz, Andrew Morton, linux-kernel, linux-mm,
	Matthew Wilcox, Davidlohr Bueso, Andi Kleen, Aneesh Kumar K.V

On 12/17/19 9:13 AM, Kirill Tkhai wrote:
> On 17.12.2019 17:00, Waiman Long wrote:
>> On 12/17/19 5:50 AM, Kirill Tkhai wrote:
>>> On 17.12.2019 12:31, Michal Hocko wrote:
>>>> On Mon 16-12-19 20:25:08, Waiman Long wrote:
>>>> [...]
>>>>> Both the hugetbl_lock and the subpool lock can be acquired in
>>>>> free_huge_page(). One way to solve the problem is to make both locks
>>>>> irq-safe.
>>>> Please document why we do not take this, quite natural path and instead
>>>> we have to come up with an elaborate way instead. I believe the primary
>>>> motivation is that some operations under those locks are quite
>>>> expensive. Please add that to the changelog and ideally to the code as
>>>> well. We probably want to fix those anyway and then this would be a
>>>> temporary workaround.
>>>>
>>>>> Another alternative is to defer the freeing to a workqueue job.
>>>>>
>>>>> This patch implements the deferred freeing by adding a
>>>>> free_hpage_workfn() work function to do the actual freeing. The
>>>>> free_huge_page() call in a non-task context saves the page to be freed
>>>>> in the hpage_freelist linked list in a lockless manner.
>>>> Do we need to over complicate this (presumably) rare event by a lockless
>>>> algorithm? Why cannot we use a dedicated spin lock for for the linked
>>>> list manipulation? This should be really a trivial code without an
>>>> additional burden of all the lockless subtleties.
>>> Why not llist_add()/llist_del_all() ?
>>>
>> The llist_add() and llist_del_all() are just simple helpers. Because
>> this lockless case involve synchronization of two variables, the llist
>> helpers do not directly apply here. So the rests cannot be used. It will
>> look awkward it is partially converted to use the helpers. If we convert
>> to use a lock as suggested by Michal, using the helpers will be an
>> overkill as xchg() will not be needed.
> I don't understand you. What are two variables?
>
> Why can't you simply do the below?
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ac65bb5e38ac..e8ec753f3d92 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1136,7 +1136,7 @@ static inline void ClearPageHugeTemporary(struct page *page)
>  	page[2].mapping = NULL;
>  }
>  
> -void free_huge_page(struct page *page)
> +static void __free_huge_page(struct page *page)
>  {
>  	/*
>  	 * Can't pass hstate in here because it is called from the
> @@ -1199,6 +1199,35 @@ void free_huge_page(struct page *page)
>  	spin_unlock(&hugetlb_lock);
>  }
>  
> +static struct llist_head hpage_freelist = LLIST_HEAD_INIT;
> +
> +static void free_hpage_workfn(struct work_struct *work)
> +{
> +	struct llist_node *node;
> +	struct page *page;
> +
> +	node = llist_del_all(&hpage_freelist);
> +
> +	while (node) {
> +		page = container_of(node, struct page, mapping);
> +		node = node->next;
> +		__free_huge_page(page);
> +	}
> +}
> +
> +static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
> +
> +void free_huge_page(struct page *page)
> +{
> +	if (!in_task()) {
> +		if (llist_add((struct llist_node *)&page->mapping, &hpage_freelist))
> +			schedule_work(&free_hpage_work);
> +		return;
> +	}
> +
> +	__free_huge_page(page);
> +}
> +
>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>  {
>  	INIT_LIST_HEAD(&page->lru);
>
You are right. That should work. I was not aware of the llist before so
I haven't fully grasped its capability. Thanks for the suggestion.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] mm/hugetlb: Defer freeing of huge pages if in non-task context
  2019-12-17 14:06   ` Waiman Long
@ 2019-12-17 14:59     ` Michal Hocko
  0 siblings, 0 replies; 9+ messages in thread
From: Michal Hocko @ 2019-12-17 14:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: Mike Kravetz, Andrew Morton, linux-kernel, linux-mm,
	Matthew Wilcox, Davidlohr Bueso, Andi Kleen, Aneesh Kumar K.V

On Tue 17-12-19 09:06:31, Waiman Long wrote:
> On 12/17/19 4:31 AM, Michal Hocko wrote:
> > On Mon 16-12-19 20:25:08, Waiman Long wrote:
[...]
> >> +	pr_debug("HugeTLB: free_hpage_workfn() frees %d huge page(s)\n", cnt);
> > Why do we need the debugging message here?
> 
> It is there just to verify that the workfn is properly activated and
> frees the huge page. This message won't be printed by default. I can
> remove it if you guys don't really want a debug statement here.

Yes, drop it please. We are not adding debugging messages unless they
are really actionable. If this is a sign of a bug then put a WARN_ONCE or
somethin like that. But with a simple code like this it doesn't really
seem to be suitable IMHO.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] mm/hugetlb: Defer freeing of huge pages if in non-task context
  2019-12-17  9:31 ` Michal Hocko
  2019-12-17 10:50   ` Kirill Tkhai
  2019-12-17 14:06   ` Waiman Long
@ 2019-12-17 18:33   ` Mike Kravetz
  2 siblings, 0 replies; 9+ messages in thread
From: Mike Kravetz @ 2019-12-17 18:33 UTC (permalink / raw)
  To: Michal Hocko, Waiman Long
  Cc: Andrew Morton, linux-kernel, linux-mm, Matthew Wilcox,
	Davidlohr Bueso, Andi Kleen, Aneesh Kumar K.V

On 12/17/19 1:31 AM, Michal Hocko wrote:
> On Mon 16-12-19 20:25:08, Waiman Long wrote:
> [...]
>> Both the hugetbl_lock and the subpool lock can be acquired in
>> free_huge_page(). One way to solve the problem is to make both locks
>> irq-safe.
> 
> Please document why we do not take this, quite natural path and instead
> we have to come up with an elaborate way instead. I believe the primary
> motivation is that some operations under those locks are quite
> expensive. Please add that to the changelog and ideally to the code as
> well. We probably want to fix those anyway and then this would be a
> temporary workaround.

We may have talked in the past about how hugetlbfs locking is not ideal.
However, there are many things in hugetlbfs that are not ideal. :(  The
thought was to avoid making changes unless they showed up as real problems.
Looks like the locking is now a real issue.

I'll start work on restructuring at least this part of the locking.  But,
let's move forward with the deferred freeing approach until that is ready.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-12-17 18:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-17  1:25 [PATCH v2] mm/hugetlb: Defer freeing of huge pages if in non-task context Waiman Long
2019-12-17  9:31 ` Michal Hocko
2019-12-17 10:50   ` Kirill Tkhai
2019-12-17 14:00     ` Waiman Long
2019-12-17 14:13       ` Kirill Tkhai
2019-12-17 14:27         ` Waiman Long
2019-12-17 14:06   ` Waiman Long
2019-12-17 14:59     ` Michal Hocko
2019-12-17 18:33   ` Mike Kravetz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.