[v6,09/16] mm/hugetlb: Defer freeing of HugeTLB pages
diff mbox series

Message ID 20201124095259.58755-10-songmuchun@bytedance.com
State New, archived
Headers show
Series
  • Free some vmemmap pages of hugetlb page
Related show

Commit Message

Muchun Song Nov. 24, 2020, 9:52 a.m. UTC
In the subsequent patch, we will allocate the vmemmap pages when free
HugeTLB pages. But update_and_free_page() is called from a non-task
context(and hold hugetlb_lock), so we can defer the actual freeing in
a workqueue to prevent use GFP_ATOMIC to allocate the vmemmap pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c         | 96 ++++++++++++++++++++++++++++++++++++++++++++++------
 mm/hugetlb_vmemmap.c |  5 ---
 mm/hugetlb_vmemmap.h | 10 ++++++
 3 files changed, 95 insertions(+), 16 deletions(-)

Comments

Michal Hocko Nov. 24, 2020, 11:51 a.m. UTC | #1
On Tue 24-11-20 17:52:52, Muchun Song wrote:
> In the subsequent patch, we will allocate the vmemmap pages when free
> HugeTLB pages. But update_and_free_page() is called from a non-task
> context(and hold hugetlb_lock), so we can defer the actual freeing in
> a workqueue to prevent use GFP_ATOMIC to allocate the vmemmap pages.

This has been brought up earlier without any satisfying answer. Do we
really have bother with the freeing from the pool and reconstructing the
vmemmap page tables? Do existing usecases really require such a dynamic
behavior? In other words, wouldn't it be much simpler to allow to use
hugetlb pages with sparse vmemmaps only for the boot time reservations
and never allow them to be freed back to the allocator. This is pretty
restrictive, no question about that, but it would drop quite some code
AFAICS and the resulting series would be much easier to review really
carefully. Additional enhancements can be done on top with specifics
about usecases which require more flexibility.

> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/hugetlb.c         | 96 ++++++++++++++++++++++++++++++++++++++++++++++------
>  mm/hugetlb_vmemmap.c |  5 ---
>  mm/hugetlb_vmemmap.h | 10 ++++++
>  3 files changed, 95 insertions(+), 16 deletions(-)
Muchun Song Nov. 24, 2020, 12:45 p.m. UTC | #2
On Tue, Nov 24, 2020 at 7:51 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 24-11-20 17:52:52, Muchun Song wrote:
> > In the subsequent patch, we will allocate the vmemmap pages when free
> > HugeTLB pages. But update_and_free_page() is called from a non-task
> > context(and hold hugetlb_lock), so we can defer the actual freeing in
> > a workqueue to prevent use GFP_ATOMIC to allocate the vmemmap pages.
>
> This has been brought up earlier without any satisfying answer. Do we
> really have bother with the freeing from the pool and reconstructing the
> vmemmap page tables? Do existing usecases really require such a dynamic
> behavior? In other words, wouldn't it be much simpler to allow to use

If someone wants to free a HugeTLB page, there is no way to do that if we
do not allow this behavior. When do we need this? On our server, we will
allocate a lot of HugeTLB pages for SPDK or virtualization. Sometimes,
we want to debug some issues and want to apt install some debug tools,
but if the host has little memory and the install operation can be failed
because of no memory. In this time, we can try to free some HugeTLB
pages to buddy in order to continue debugging. So maybe we need this.

> hugetlb pages with sparse vmemmaps only for the boot time reservations
> and never allow them to be freed back to the allocator. This is pretty
> restrictive, no question about that, but it would drop quite some code

Yeah, if we do not allow freeing the HugeTLB page to buddy, it actually
can drop some code. But I think that it only drop this one and next one
patch. It seems not a lot. And if we drop this patch, we need to add some
another code to do the boot time reservations and other code to disallow
freeing HugeTLB pages. So why not support freeing now.

> AFAICS and the resulting series would be much easier to review really
> carefully. Additional enhancements can be done on top with specifics
> about usecases which require more flexibility.

The code of allocating vmemmap pages for the HugeTLB page is very
similar to the freeing vmemmap pages. The two operations are opposite.
I think that if someone can understand the freeing path, it is also easy
for him to understand the allcating path. If you look at close to this patch,
I believe that it is easy for you.

>
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  mm/hugetlb.c         | 96 ++++++++++++++++++++++++++++++++++++++++++++++------
> >  mm/hugetlb_vmemmap.c |  5 ---
> >  mm/hugetlb_vmemmap.h | 10 ++++++
> >  3 files changed, 95 insertions(+), 16 deletions(-)
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Nov. 24, 2020, 1:14 p.m. UTC | #3
On Tue 24-11-20 20:45:30, Muchun Song wrote:
> On Tue, Nov 24, 2020 at 7:51 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 24-11-20 17:52:52, Muchun Song wrote:
> > > In the subsequent patch, we will allocate the vmemmap pages when free
> > > HugeTLB pages. But update_and_free_page() is called from a non-task
> > > context(and hold hugetlb_lock), so we can defer the actual freeing in
> > > a workqueue to prevent use GFP_ATOMIC to allocate the vmemmap pages.
> >
> > This has been brought up earlier without any satisfying answer. Do we
> > really have bother with the freeing from the pool and reconstructing the
> > vmemmap page tables? Do existing usecases really require such a dynamic
> > behavior? In other words, wouldn't it be much simpler to allow to use
> 
> If someone wants to free a HugeTLB page, there is no way to do that if we
> do not allow this behavior.

Right. The question is how much that matters for the _initial_ feature
submission. Is this restriction so important that it would render it
unsuable?

> When do we need this? On our server, we will
> allocate a lot of HugeTLB pages for SPDK or virtualization. Sometimes,
> we want to debug some issues and want to apt install some debug tools,
> but if the host has little memory and the install operation can be failed
> because of no memory. In this time, we can try to free some HugeTLB
> pages to buddy in order to continue debugging. So maybe we need this.

Or maybe you can still allocate hugetlb pages for debugging in runtime
and try to free those when you need to.

> > hugetlb pages with sparse vmemmaps only for the boot time reservations
> > and never allow them to be freed back to the allocator. This is pretty
> > restrictive, no question about that, but it would drop quite some code
> 
> Yeah, if we do not allow freeing the HugeTLB page to buddy, it actually
> can drop some code. But I think that it only drop this one and next one
> patch. It seems not a lot. And if we drop this patch, we need to add some
> another code to do the boot time reservations and other code to disallow
> freeing HugeTLB pages.

you need a per hugetlb page flag to note the sparse vmemmap anyway so
the freeing path should be a trivial check for the flag. Early boot
reservation. Special casing for the early boot reservation shouldn't be
that hard either. But I haven't checked closely.

> So why not support freeing now.

Because it adds some non trivial challenges which would be better to
deal with with a stable and tested and feature limited implementation.
The most obvious one is the problem with vmemmap allocations when
freeing hugetlb page. Others like vmemmap manipulation is quite some
code but no surprises. Btw. that should be implemented in vmemmap proper
and ready for other potential users. But this is a minor detail.

Patch
diff mbox series

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9662b5535f3a..41056b4230f1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1221,7 +1221,7 @@  static void destroy_compound_gigantic_page(struct page *page,
 	__ClearPageHead(page);
 }
 
-static void free_gigantic_page(struct page *page, unsigned int order)
+static void __free_gigantic_page(struct page *page, unsigned int order)
 {
 	/*
 	 * If the page isn't allocated using the cma allocator,
@@ -1288,20 +1288,100 @@  static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 {
 	return NULL;
 }
-static inline void free_gigantic_page(struct page *page, unsigned int order) { }
+static inline void __free_gigantic_page(struct page *page,
+					unsigned int order) { }
 static inline void destroy_compound_gigantic_page(struct page *page,
 						unsigned int order) { }
 #endif
 
-static void update_and_free_page(struct hstate *h, struct page *page)
+static void __free_hugepage(struct hstate *h, struct page *page);
+
+/*
+ * As update_and_free_page() is be called from a non-task context(and hold
+ * hugetlb_lock), we can defer the actual freeing in a workqueue to prevent
+ * use GFP_ATOMIC to allocate a lot of vmemmap pages.
+ *
+ * update_hpage_vmemmap_workfn() locklessly retrieves the linked list of
+ * pages to be freed and frees them one-by-one. As the page->mapping pointer
+ * is going to be cleared in update_hpage_vmemmap_workfn() anyway, it is
+ * reused as the llist_node structure of a lockless linked list of huge
+ * pages to be freed.
+ */
+static LLIST_HEAD(hpage_update_freelist);
+
+static void update_hpage_vmemmap_workfn(struct work_struct *work)
 {
-	int i;
+	struct llist_node *node;
+	struct page *page;
+
+	node = llist_del_all(&hpage_update_freelist);
+
+	while (node) {
+		page = container_of((struct address_space **)node,
+				     struct page, mapping);
+		node = node->next;
+		page->mapping = NULL;
+		__free_hugepage(page_hstate(page), page);
 
+		cond_resched();
+	}
+}
+static DECLARE_WORK(hpage_update_work, update_hpage_vmemmap_workfn);
+
+static inline void __update_and_free_page(struct hstate *h, struct page *page)
+{
+	/* No need to allocate vmemmap pages */
+	if (!free_vmemmap_pages_per_hpage(h)) {
+		__free_hugepage(h, page);
+		return;
+	}
+
+	/*
+	 * Defer freeing to avoid using GFP_ATOMIC to allocate vmemmap
+	 * pages.
+	 *
+	 * Only call schedule_work() if hpage_update_freelist is previously
+	 * empty. Otherwise, schedule_work() had been called but the workfn
+	 * hasn't retrieved the list yet.
+	 */
+	if (llist_add((struct llist_node *)&page->mapping,
+		      &hpage_update_freelist))
+		schedule_work(&hpage_update_work);
+}
+
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+static inline void free_gigantic_page(struct hstate *h, struct page *page)
+{
+	__free_gigantic_page(page, huge_page_order(h));
+}
+#else
+static inline void free_gigantic_page(struct hstate *h, struct page *page)
+{
+	/*
+	 * Temporarily drop the hugetlb_lock, because
+	 * we might block in __free_gigantic_page().
+	 */
+	spin_unlock(&hugetlb_lock);
+	__free_gigantic_page(page, huge_page_order(h));
+	spin_lock(&hugetlb_lock);
+}
+#endif
+
+static void update_and_free_page(struct hstate *h, struct page *page)
+{
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
 
 	h->nr_huge_pages--;
 	h->nr_huge_pages_node[page_to_nid(page)]--;
+
+	__update_and_free_page(h, page);
+}
+
+static void __free_hugepage(struct hstate *h, struct page *page)
+{
+	int i;
+
 	for (i = 0; i < pages_per_huge_page(h); i++) {
 		page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
 				1 << PG_referenced | 1 << PG_dirty |
@@ -1313,14 +1393,8 @@  static void update_and_free_page(struct hstate *h, struct page *page)
 	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
 	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
-		/*
-		 * Temporarily drop the hugetlb_lock, because
-		 * we might block in free_gigantic_page().
-		 */
-		spin_unlock(&hugetlb_lock);
 		destroy_compound_gigantic_page(page, huge_page_order(h));
-		free_gigantic_page(page, huge_page_order(h));
-		spin_lock(&hugetlb_lock);
+		free_gigantic_page(h, page);
 	} else {
 		__free_pages(page, huge_page_order(h));
 	}
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 1576f69bd1d3..f6ba288966d4 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -124,11 +124,6 @@ 
 	(__boundary - 1 < (end) - 1) ? __boundary : (end);		 \
 })
 
-static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
-{
-	return h->nr_free_vmemmap_pages;
-}
-
 static inline unsigned int vmemmap_pages_per_hpage(struct hstate *h)
 {
 	return free_vmemmap_pages_per_hpage(h) + RESERVE_VMEMMAP_NR;
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 67113b67495f..293897b9f1d8 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -13,6 +13,11 @@ 
 #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
 void __init hugetlb_vmemmap_init(struct hstate *h);
 void free_huge_page_vmemmap(struct hstate *h, struct page *head);
+
+static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
+{
+	return h->nr_free_vmemmap_pages;
+}
 #else
 static inline void hugetlb_vmemmap_init(struct hstate *h)
 {
@@ -21,5 +26,10 @@  static inline void hugetlb_vmemmap_init(struct hstate *h)
 static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 {
 }
+
+static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
+{
+	return 0;
+}
 #endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */
 #endif /* _LINUX_HUGETLB_VMEMMAP_H */