linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 2/6] ksm: support unsharing zero pages placed by KSM
@ 2022-12-30  1:13 yang.yang29
  2023-01-18 14:10 ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: yang.yang29 @ 2022-12-30  1:13 UTC (permalink / raw)
  To: akpm
  Cc: david, imbrenda, jiang.xuexin, linux-kernel, linux-mm,
	ran.xiaokai, xu.xin.sc, xu.xin16, yang.yang29

From: xu xin <xu.xin16@zte.com.cn>

use_zero_pages may be very useful, not just because of cache colouring
as described in doc, but also because use_zero_pages can accelerate
merging empty pages when there are plenty of empty pages (full of zeros)
as the time of page-by-page comparisons (unstable_tree_search_insert) is
saved.

But when enabling use_zero_pages, madvise(addr, len, MADV_UNMERGEABLE) and
other ways (like write 2 to /sys/kernel/mm/ksm/run) to trigger unsharing
will *not* actually unshare the shared zeropage as placed by KSM (which is
against the MADV_UNMERGEABLE documentation). As these KSM-placed zero pages
are out of the control of KSM, the related counts of ksm pages don't expose
how many zero pages are placed by KSM (these special zero pages are different
from those initially mapped zero pages, because the zero pages mapped to
MADV_UNMERGEABLE areas are expected to be a complete and unshared page)

To not blindly unshare all shared zero_pages in applicable VMAs, the patch
introduces a dedicated flag ZERO_PAGE_FLAG to mark the rmap_items of those
shared zero_pages. and guarantee that these rmap_items will be not freed
during the time of zero_pages not being writing, so we can only unshare
the *KSM-placed* zero_pages.

The patch will not degrade the performance of use_zero_pages as it doesn't
change the way of merging empty pages in use_zero_pages's feature.

Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reported-by: David Hildenbrand <david@redhat.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn>
Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
---
 mm/ksm.c | 141 +++++++++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 111 insertions(+), 30 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 5b0a7343ff4a..652c088f9786 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -214,6 +214,7 @@ struct ksm_rmap_item {
 #define SEQNR_MASK	0x0ff	/* low bits of unstable tree seqnr */
 #define UNSTABLE_FLAG	0x100	/* is a node of the unstable tree */
 #define STABLE_FLAG	0x200	/* is listed from the stable tree */
+#define ZERO_PAGE_FLAG 0x400 /* is zero page placed by KSM */

 /* The stable and unstable tree heads */
 static struct rb_root one_stable_tree[1] = { RB_ROOT };
@@ -420,6 +421,11 @@ static inline bool ksm_test_exit(struct mm_struct *mm)
 	return atomic_read(&mm->mm_users) == 0;
 }

+enum break_ksm_pmd_entry_return_flag {
+	HAVE_KSM_PAGE = 1,
+	HAVE_ZERO_PAGE
+};
+
 static int break_ksm_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next,
 			struct mm_walk *walk)
 {
@@ -427,6 +433,7 @@ static int break_ksm_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long nex
 	spinlock_t *ptl;
 	pte_t *pte;
 	int ret;
+	bool is_zero_page = false;

 	if (pmd_leaf(*pmd) || !pmd_present(*pmd))
 		return 0;
@@ -434,6 +441,8 @@ static int break_ksm_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long nex
 	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
 	if (pte_present(*pte)) {
 		page = vm_normal_page(walk->vma, addr, *pte);
+		if (!page)
+			is_zero_page = is_zero_pfn(pte_pfn(*pte));
 	} else if (!pte_none(*pte)) {
 		swp_entry_t entry = pte_to_swp_entry(*pte);

@@ -444,7 +453,14 @@ static int break_ksm_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long nex
 		if (is_migration_entry(entry))
 			page = pfn_swap_entry_to_page(entry);
 	}
-	ret = page && PageKsm(page);
+
+	if (page && PageKsm(page))
+		ret = HAVE_KSM_PAGE;
+	else if (is_zero_page)
+		ret = HAVE_ZERO_PAGE;
+	else
+		ret = 0;
+
 	pte_unmap_unlock(pte, ptl);
 	return ret;
 }
@@ -466,19 +482,22 @@ static const struct mm_walk_ops break_ksm_ops = {
  * of the process that owns 'vma'.  We also do not want to enforce
  * protection keys here anyway.
  */
-static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
+static int break_ksm(struct vm_area_struct *vma, unsigned long addr,
+				     bool unshare_zero_page)
 {
 	vm_fault_t ret = 0;

 	do {
-		int ksm_page;
+		int walk_result;

 		cond_resched();
-		ksm_page = walk_page_range_vma(vma, addr, addr + 1,
+		walk_result = walk_page_range_vma(vma, addr, addr + 1,
 					       &break_ksm_ops, NULL);
-		if (WARN_ON_ONCE(ksm_page < 0))
-			return ksm_page;
-		if (!ksm_page)
+		if (WARN_ON_ONCE(walk_result < 0))
+			return walk_result;
+		if (!walk_result)
+			return 0;
+		if (walk_result == HAVE_ZERO_PAGE && !unshare_zero_page)
 			return 0;
 		ret = handle_mm_fault(vma, addr,
 				      FAULT_FLAG_UNSHARE | FAULT_FLAG_REMOTE,
@@ -539,7 +558,7 @@ static void break_cow(struct ksm_rmap_item *rmap_item)
 	mmap_read_lock(mm);
 	vma = find_mergeable_vma(mm, addr);
 	if (vma)
-		break_ksm(vma, addr);
+		break_ksm(vma, addr, false);
 	mmap_read_unlock(mm);
 }

@@ -764,6 +783,30 @@ static struct page *get_ksm_page(struct ksm_stable_node *stable_node,
 	return NULL;
 }

+/*
+ * Cleaning the rmap_item's ZERO_PAGE_FLAG
+ * This function will be called when unshare or writing on zero pages.
+ */
+static inline void clean_rmap_item_zero_flag(struct ksm_rmap_item *rmap_item)
+{
+	if (rmap_item->address & ZERO_PAGE_FLAG)
+		rmap_item->address &= PAGE_MASK;
+}
+
+/* Only called when rmap_item is going to be freed */
+static inline void unshare_zero_pages(struct ksm_rmap_item *rmap_item)
+{
+	struct vm_area_struct *vma;
+
+	if (rmap_item->address & ZERO_PAGE_FLAG) {
+		vma = vma_lookup(rmap_item->mm, rmap_item->address);
+		if (vma && !ksm_test_exit(rmap_item->mm))
+			break_ksm(vma, rmap_item->address, true);
+	}
+	/* Put at last. */
+	clean_rmap_item_zero_flag(rmap_item);
+}
+
 /*
  * Removing rmap_item from stable or unstable tree.
  * This function will clean the information from the stable/unstable tree.
@@ -824,6 +867,7 @@ static void remove_trailing_rmap_items(struct ksm_rmap_item **rmap_list)
 		struct ksm_rmap_item *rmap_item = *rmap_list;
 		*rmap_list = rmap_item->rmap_list;
 		remove_rmap_item_from_tree(rmap_item);
+		unshare_zero_pages(rmap_item);
 		free_rmap_item(rmap_item);
 	}
 }
@@ -853,7 +897,7 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma,
 		if (signal_pending(current))
 			err = -ERESTARTSYS;
 		else
-			err = break_ksm(vma, addr);
+			err = break_ksm(vma, addr, false);
 	}
 	return err;
 }
@@ -2044,6 +2088,39 @@ static void stable_tree_append(struct ksm_rmap_item *rmap_item,
 	rmap_item->mm->ksm_merging_pages++;
 }

+static int try_to_merge_with_kernel_zero_page(struct ksm_rmap_item *rmap_item,
+									struct page *page)
+{
+	struct mm_struct *mm = rmap_item->mm;
+	int err = 0;
+
+	/*
+	 * It should not take ZERO_PAGE_FLAG because on one hand,
+	 * get_next_rmap_item don't return zero pages' rmap_item.
+	 * On the other hand, even if zero page was writen as
+	 * anonymous page, rmap_item has been cleaned after
+	 * stable_tree_search
+	 */
+	if (!WARN_ON_ONCE(rmap_item->address & ZERO_PAGE_FLAG)) {
+		struct vm_area_struct *vma;
+
+		mmap_read_lock(mm);
+		vma = find_mergeable_vma(mm, rmap_item->address);
+		if (vma) {
+			err = try_to_merge_one_page(vma, page,
+						ZERO_PAGE(rmap_item->address));
+			if (!err)
+				rmap_item->address |= ZERO_PAGE_FLAG;
+		} else {
+			/* If the vma is out of date, we do not need to continue. */
+			err = 0;
+		}
+		mmap_read_unlock(mm);
+	}
+
+	return err;
+}
+
 /*
  * cmp_and_merge_page - first see if page can be merged into the stable tree;
  * if not, compare checksum to previous and if it's the same, see if page can
@@ -2055,7 +2132,6 @@ static void stable_tree_append(struct ksm_rmap_item *rmap_item,
  */
 static void cmp_and_merge_page(struct page *page, struct ksm_rmap_item *rmap_item)
 {
-	struct mm_struct *mm = rmap_item->mm;
 	struct ksm_rmap_item *tree_rmap_item;
 	struct page *tree_page = NULL;
 	struct ksm_stable_node *stable_node;
@@ -2092,6 +2168,7 @@ static void cmp_and_merge_page(struct page *page, struct ksm_rmap_item *rmap_ite
 	}

 	remove_rmap_item_from_tree(rmap_item);
+	clean_rmap_item_zero_flag(rmap_item);

 	if (kpage) {
 		if (PTR_ERR(kpage) == -EBUSY)
@@ -2128,29 +2205,16 @@ static void cmp_and_merge_page(struct page *page, struct ksm_rmap_item *rmap_ite
 	 * Same checksum as an empty page. We attempt to merge it with the
 	 * appropriate zero page if the user enabled this via sysfs.
 	 */
-	if (ksm_use_zero_pages && (checksum == zero_checksum)) {
-		struct vm_area_struct *vma;
-
-		mmap_read_lock(mm);
-		vma = find_mergeable_vma(mm, rmap_item->address);
-		if (vma) {
-			err = try_to_merge_one_page(vma, page,
-					ZERO_PAGE(rmap_item->address));
-		} else {
+	if (ksm_use_zero_pages) {
+		if (checksum == zero_checksum)
 			/*
-			 * If the vma is out of date, we do not need to
-			 * continue.
+			 * In case of failure, the page was not really empty, so we
+			 * need to continue. Otherwise we're done.
 			 */
-			err = 0;
-		}
-		mmap_read_unlock(mm);
-		/*
-		 * In case of failure, the page was not really empty, so we
-		 * need to continue. Otherwise we're done.
-		 */
-		if (!err)
-			return;
+			if (!try_to_merge_with_kernel_zero_page(rmap_item, page))
+				return;
 	}
+
 	tree_rmap_item =
 		unstable_tree_search_insert(rmap_item, page, &tree_page);
 	if (tree_rmap_item) {
@@ -2226,6 +2290,7 @@ static struct ksm_rmap_item *try_to_get_old_rmap_item(unsigned long addr,
 		*rmap_list = rmap_item->rmap_list;
 		/* Running here indicates it's vma has been UNMERGEABLE */
 		remove_rmap_item_from_tree(rmap_item);
+		unshare_zero_pages(rmap_item);
 		free_rmap_item(rmap_item);
 	}

@@ -2350,6 +2415,22 @@ static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page)
 			}
 			if (is_zone_device_page(*page))
 				goto next_page;
+			if (is_zero_pfn(page_to_pfn(*page))) {
+				/*
+				 * To monitor ksm zero pages which becomes non-anonymous,
+				 * we have to save each rmap_item of zero pages by
+				 * try_to_get_old_rmap_item() walking on
+				 * ksm_scan.rmap_list, otherwise their rmap_items will be
+				 * freed by the next turn of get_next_rmap_item(). The
+				 * function get_next_rmap_item() will free all "skipped"
+				 * rmap_items because it thinks its areas as UNMERGEABLE.
+				 */
+				rmap_item = try_to_get_old_rmap_item(ksm_scan.address,
+									ksm_scan.rmap_list);
+				if (rmap_item && (rmap_item->address & ZERO_PAGE_FLAG))
+					ksm_scan.rmap_list = &rmap_item->rmap_list;
+				goto next_page;
+			}
 			if (PageAnon(*page)) {
 				flush_anon_page(vma, *page, ksm_scan.address);
 				flush_dcache_page(*page);
-- 
2.15.2

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 2/6] ksm: support unsharing zero pages placed by KSM
  2022-12-30  1:13 [PATCH v5 2/6] ksm: support unsharing zero pages placed by KSM yang.yang29
@ 2023-01-18 14:10 ` David Hildenbrand
  2023-02-04  6:18   ` yang.yang29
  0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2023-01-18 14:10 UTC (permalink / raw)
  To: yang.yang29, akpm
  Cc: imbrenda, jiang.xuexin, linux-kernel, linux-mm, ran.xiaokai,
	xu.xin.sc, xu.xin16

On 30.12.22 02:13, yang.yang29@zte.com.cn wrote:
> From: xu xin <xu.xin16@zte.com.cn>
> 
> use_zero_pages may be very useful, not just because of cache colouring
> as described in doc, but also because use_zero_pages can accelerate
> merging empty pages when there are plenty of empty pages (full of zeros)
> as the time of page-by-page comparisons (unstable_tree_search_insert) is
> saved.
> 
> But when enabling use_zero_pages, madvise(addr, len, MADV_UNMERGEABLE) and
> other ways (like write 2 to /sys/kernel/mm/ksm/run) to trigger unsharing
> will *not* actually unshare the shared zeropage as placed by KSM (which is
> against the MADV_UNMERGEABLE documentation). As these KSM-placed zero pages
> are out of the control of KSM, the related counts of ksm pages don't expose
> how many zero pages are placed by KSM (these special zero pages are different
> from those initially mapped zero pages, because the zero pages mapped to
> MADV_UNMERGEABLE areas are expected to be a complete and unshared page)
> 
> To not blindly unshare all shared zero_pages in applicable VMAs, the patch
> introduces a dedicated flag ZERO_PAGE_FLAG to mark the rmap_items of those
> shared zero_pages. and guarantee that these rmap_items will be not freed
> during the time of zero_pages not being writing, so we can only unshare
> the *KSM-placed* zero_pages.
> 
> The patch will not degrade the performance of use_zero_pages as it doesn't
> change the way of merging empty pages in use_zero_pages's feature.
> 
> Signed-off-by: xu xin <xu.xin16@zte.com.cn>
> Reported-by: David Hildenbrand <david@redhat.com>
> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
> Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn>
> Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn>
> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>

[...]

> 
>   /* The stable and unstable tree heads */
>   static struct rb_root one_stable_tree[1] = { RB_ROOT };
> @@ -420,6 +421,11 @@ static inline bool ksm_test_exit(struct mm_struct *mm)
>   	return atomic_read(&mm->mm_users) == 0;
>   }
> 
> +enum break_ksm_pmd_entry_return_flag {
> +	HAVE_KSM_PAGE = 1,
> +	HAVE_ZERO_PAGE
> +};

Why use flags if they both conditions are mutually exclusive?

> +
>   static int break_ksm_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next,
>   			struct mm_walk *walk)
>   {
> @@ -427,6 +433,7 @@ static int break_ksm_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long nex
>   	spinlock_t *ptl;
>   	pte_t *pte;
>   	int ret;
> +	bool is_zero_page = false;
> 
>   	if (pmd_leaf(*pmd) || !pmd_present(*pmd))
>   		return 0;
> @@ -434,6 +441,8 @@ static int break_ksm_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long nex
>   	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
>   	if (pte_present(*pte)) {
>   		page = vm_normal_page(walk->vma, addr, *pte);
> +		if (!page)
> +			is_zero_page = is_zero_pfn(pte_pfn(*pte));
>   	} else if (!pte_none(*pte)) {
>   		swp_entry_t entry = pte_to_swp_entry(*pte);
> 
> @@ -444,7 +453,14 @@ static int break_ksm_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long nex
>   		if (is_migration_entry(entry))
>   			page = pfn_swap_entry_to_page(entry);
>   	}
> -	ret = page && PageKsm(page);
> +
> +	if (page && PageKsm(page))
> +		ret = HAVE_KSM_PAGE;
> +	else if (is_zero_page)
> +		ret = HAVE_ZERO_PAGE;
> +	else
> +		ret = 0;
> +
>   	pte_unmap_unlock(pte, ptl);
>   	return ret;
>   }
> @@ -466,19 +482,22 @@ static const struct mm_walk_ops break_ksm_ops = {
>    * of the process that owns 'vma'.  We also do not want to enforce
>    * protection keys here anyway.
>    */
> -static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
> +static int break_ksm(struct vm_area_struct *vma, unsigned long addr,
> +				     bool unshare_zero_page)
>   {
>   	vm_fault_t ret = 0;
> 
>   	do {
> -		int ksm_page;
> +		int walk_result;
> 
>   		cond_resched();
> -		ksm_page = walk_page_range_vma(vma, addr, addr + 1,
> +		walk_result = walk_page_range_vma(vma, addr, addr + 1,
>   					       &break_ksm_ops, NULL);
> -		if (WARN_ON_ONCE(ksm_page < 0))
> -			return ksm_page;
> -		if (!ksm_page)
> +		if (WARN_ON_ONCE(walk_result < 0))
> +			return walk_result;
> +		if (!walk_result)
> +			return 0;
> +		if (walk_result == HAVE_ZERO_PAGE && !unshare_zero_page)
>   			return 0;
>   		ret = handle_mm_fault(vma, addr,
>   				      FAULT_FLAG_UNSHARE | FAULT_FLAG_REMOTE,
> @@ -539,7 +558,7 @@ static void break_cow(struct ksm_rmap_item *rmap_item)
>   	mmap_read_lock(mm);
>   	vma = find_mergeable_vma(mm, addr);
>   	if (vma)
> -		break_ksm(vma, addr);
> +		break_ksm(vma, addr, false);
>   	mmap_read_unlock(mm);
>   }
> 
> @@ -764,6 +783,30 @@ static struct page *get_ksm_page(struct ksm_stable_node *stable_node,
>   	return NULL;
>   }
> 
> +/*
> + * Cleaning the rmap_item's ZERO_PAGE_FLAG
> + * This function will be called when unshare or writing on zero pages.
> + */
> +static inline void clean_rmap_item_zero_flag(struct ksm_rmap_item *rmap_item)
> +{
> +	if (rmap_item->address & ZERO_PAGE_FLAG)
> +		rmap_item->address &= PAGE_MASK;
> +}
> +
> +/* Only called when rmap_item is going to be freed */
> +static inline void unshare_zero_pages(struct ksm_rmap_item *rmap_item)
> +{
> +	struct vm_area_struct *vma;
> +
> +	if (rmap_item->address & ZERO_PAGE_FLAG) {
> +		vma = vma_lookup(rmap_item->mm, rmap_item->address);
> +		if (vma && !ksm_test_exit(rmap_item->mm))
> +			break_ksm(vma, rmap_item->address, true);
> +	}
> +	/* Put at last. */
> +	clean_rmap_item_zero_flag(rmap_item);
> +}
> +
>   /*
>    * Removing rmap_item from stable or unstable tree.
>    * This function will clean the information from the stable/unstable tree.
> @@ -824,6 +867,7 @@ static void remove_trailing_rmap_items(struct ksm_rmap_item **rmap_list)
>   		struct ksm_rmap_item *rmap_item = *rmap_list;
>   		*rmap_list = rmap_item->rmap_list;
>   		remove_rmap_item_from_tree(rmap_item);
> +		unshare_zero_pages(rmap_item);
>   		free_rmap_item(rmap_item);
>   	}
>   }
> @@ -853,7 +897,7 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma,
>   		if (signal_pending(current))
>   			err = -ERESTARTSYS;
>   		else
> -			err = break_ksm(vma, addr);
> +			err = break_ksm(vma, addr, false);
>   	}

MADV_UNMERGEABLE -> unmerge_ksm_pages() will never unshare the shared 
zeropage? I thought the patch description mentions that that is one of 
the goals?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 2/6] ksm: support unsharing zero pages placed by KSM
  2023-01-18 14:10 ` David Hildenbrand
@ 2023-02-04  6:18   ` yang.yang29
  2023-02-13 12:44     ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: yang.yang29 @ 2023-02-04  6:18 UTC (permalink / raw)
  To: david
  Cc: akpm, imbrenda, jiang.xuexin, linux-kernel, linux-mm,
	ran.xiaokai, xu.xin.sc, xu.xin16

> Why use flags if they both conditions are mutually exclusive?

Just to make the return value of break_ksm_pmd_entry() more expressive and
understandable. because break_ksm_pmd_entry have three types of returned
values (0, 1, 2).

> MADV_UNMERGEABLE -> unmerge_ksm_pages() will never unshare the shared 
> zeropage? I thought the patch description mentions that that is one of 
> the goals?

No, MADV_UNMERGEABLE will trigger KSM to unshare the shared zeropages in the
context of "get_next_rmap_item() -> unshare_zero_pages(), but not directly in the
context of " madvise()-> unmerge_ksm_pages() ". The reason for this is to avoid
increasing long delays of madvise() calling on unsharing zero pages.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 2/6] ksm: support unsharing zero pages placed by KSM
  2023-02-04  6:18   ` yang.yang29
@ 2023-02-13 12:44     ` David Hildenbrand
  0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand @ 2023-02-13 12:44 UTC (permalink / raw)
  To: yang.yang29
  Cc: akpm, imbrenda, jiang.xuexin, linux-kernel, linux-mm,
	ran.xiaokai, xu.xin.sc, xu.xin16

On 04.02.23 07:18, yang.yang29@zte.com.cn wrote:

[sorry, was on vacation last week]

>> Why use flags if they both conditions are mutually exclusive?
> 
> Just to make the return value of break_ksm_pmd_entry() more expressive and
> understandable. because break_ksm_pmd_entry have three types of returned
> values (0, 1, 2).

It adds confusion. Just simplify it please.

> 
>> MADV_UNMERGEABLE -> unmerge_ksm_pages() will never unshare the shared
>> zeropage? I thought the patch description mentions that that is one of
>> the goals?
> 
> No, MADV_UNMERGEABLE will trigger KSM to unshare the shared zeropages in the
> context of "get_next_rmap_item() -> unshare_zero_pages(), but not directly in the
> context of " madvise()-> unmerge_ksm_pages() ". The reason for this is to avoid
> increasing long delays of madvise() calling on unsharing zero pages.
> 

Why do we care and make this case special?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 2/6] ksm: support unsharing zero pages placed by KSM
  2023-03-22  2:02 [PATCH] rps: process the skb directly if rps cpu not changed Yunsheng Lin
@ 2023-03-22  7:21 ` xu xin
  0 siblings, 0 replies; 7+ messages in thread
From: xu xin @ 2023-03-22  7:21 UTC (permalink / raw)
  To: linyunsheng, kuba
  Cc: davem, edumazet, jiang.xuexin, linux-kernel, netdev, xu.xin16,
	yang.yang29, zhang.yunkai

On 2023/3/21 20:12, yang.yang29@zte.com.cn wrote:
>> From: xu xin <xu.xin16@zte.com.cn>
>> 
>> In the RPS procedure of NAPI receiving, regardless of whether the
>> rps-calculated CPU of the skb equals to the currently processing CPU, RPS
>> will always use enqueue_to_backlog to enqueue the skb to per-cpu backlog,
>> which will trigger a new NET_RX softirq.
>
>Does bypassing the backlog cause out of order problem for packet handling?
>It seems currently the RPS/RFS will ensure order delivery,such as:
>https://elixir.bootlin.com/linux/v6.3-rc3/source/net/core/dev.c#L4485
>
>Also, this is an optimization, it should target the net-next branch:
>[PATCH net-next] rps: process the skb directly if rps cpu not changed
>

Well, I thought the patch would't break the effort RFS tried to avoid "Out of
Order" packets. But thanks for your reminder, I rethink it again, bypassing the
backlog from "netif_receive_skb_list" will mislead RFS's judging if all
previous packets for the flow have been dequeued, where RFS thought all packets
have been dealed with, but actually they are still in skb lists. Fortunately,
bypassing the backlog from "netif_receive_skb" for a single skb is okay and won't
cause OOO packets because every skb is processed serially by RPS and sent to the
protocol stack as soon as possible.

If I'm correct, the code as follws can fix this.

--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5666,8 +5666,9 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
        if (static_branch_unlikely(&rps_needed)) {
                struct rps_dev_flow voidflow, *rflow = &voidflow;
                int cpu = get_rps_cpu(skb->dev, skb, &rflow);
+               int current_cpu = smp_processor_id();
 
-               if (cpu >= 0) {
+               if (cpu >= 0 && cpu != current_cpu) {
                        ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
                        rcu_read_unlock();
                        return ret;
@@ -5699,11 +5700,15 @@ void netif_receive_skb_list_internal(struct list_head *head)
                list_for_each_entry_safe(skb, next, head, list) {
                        struct rps_dev_flow voidflow, *rflow = &voidflow;
                        int cpu = get_rps_cpu(skb->dev, skb, &rflow);
+                       int current_cpu = smp_processor_id();
 
                        if (cpu >= 0) {
                                /* Will be handled, remove from list */
                                skb_list_del_init(skb);
-                               enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+                               if (cpu != current_cpu)
+                                       enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+                               else
+                                       __netif_receive_skb(skb);
                        }
                }


Thanks.

>> 
>> Actually, it's not necessary to enqueue it to backlog when rps-calculated
>> CPU id equals to the current processing CPU, and we can call
>> __netif_receive_skb or __netif_receive_skb_list to process the skb directly.
>> The benefit is that it can reduce the number of softirqs of NET_RX and reduce
>> the processing delay of skb.
>> 
>> The measured result shows the patch brings 50% reduction of NET_RX softirqs.
>> The test was done on the QEMU environment with two-core CPU by iperf3.
>> taskset 01 iperf3 -c 192.168.2.250 -t 3 -u -R;
>> taskset 02 iperf3 -c 192.168.2.250 -t 3 -u -R;
>> 
>> Previous RPS:
>> 		    	CPU0       CPU1
>> NET_RX:         45          0    (before iperf3 testing)
>> NET_RX:        1095         241   (after iperf3 testing)
>> 
>> Patched RPS:
>>                 CPU0       CPU1
>> NET_RX:         28          4    (before iperf3 testing)
>> NET_RX:         573         32   (after iperf3 testing)
>
>Sincerely.
>Xu Xin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 2/6] ksm: support unsharing zero pages placed by KSM
  2023-03-11  5:37 xu xin
@ 2023-03-13 12:14 ` David Hildenbrand
  0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand @ 2023-03-13 12:14 UTC (permalink / raw)
  To: xu xin
  Cc: akpm, imbrenda, jiang.xuexin, linux-kernel, linux-mm,
	ran.xiaokai, xu.xin16, yang.yang29

On 11.03.23 06:37, xu xin wrote:
> [sorry to reply so late, on vacation too, and my mailing system has some kind of problem]
> 
>> [sorry, was on vacation last week]
> 
>>> Why use flags if they both conditions are mutually exclusive?
>>
>> Just to make the return value of break_ksm_pmd_entry() more expressive and
>> understandable. because break_ksm_pmd_entry have three types of returned
>> values (0, 1, 2).
> 
>> It adds confusion. Just simplify it please.
> 
> So I think it's good to add a enum value of 0 listed here as suggested
> by Claudio Imbrenda.
> 

Please keep it simple.

>>
>>> MADV_UNMERGEABLE -> unmerge_ksm_pages() will never unshare the shared
>>> zeropage? I thought the patch description mentions that that is one of
>>> the goals?
>>
>> No, MADV_UNMERGEABLE will trigger KSM to unshare the shared zeropages in the
>> context of "get_next_rmap_item() -> unshare_zero_pages(), but not directly in the
>> context of " madvise()-> unmerge_ksm_pages() ". The reason for this is to avoid
>> increasing long delays of madvise() calling on unsharing zero pages.
>>
> 
>> Why do we care and make this case special?
> 
> Yeah, the code seems a bit special, but it is a helpless way and best choice, because the
> action of unsharing zero-pages is too complex and CPU consuming because checking whether the
> page we get is actually placed by KSM or not is not a easy thing in the context of
> unmerge_ksm_pages.
> 
> In experiment, unsharing zero-pages in the context of unmerge_ksm_pages cause user' madvise()
> spend 5 times the time than the way of the current patch.

Who exactly cares  and why?

> 
> So let's leave it as it is now. I will add a (short) explanation of when and why the new
> unshare_zero_page flag should be used.

I vote to keep it as simple as possible in the initial version.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 2/6] ksm: support unsharing zero pages placed by KSM
@ 2023-03-11  5:37 xu xin
  2023-03-13 12:14 ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: xu xin @ 2023-03-11  5:37 UTC (permalink / raw)
  To: david
  Cc: akpm, imbrenda, jiang.xuexin, linux-kernel, linux-mm,
	ran.xiaokai, xu.xin.sc, xu.xin16, yang.yang29

[sorry to reply so late, on vacation too, and my mailing system has some kind of problem]

>[sorry, was on vacation last week]

>> Why use flags if they both conditions are mutually exclusive?
> 
> Just to make the return value of break_ksm_pmd_entry() more expressive and
> understandable. because break_ksm_pmd_entry have three types of returned
> values (0, 1, 2).

> It adds confusion. Just simplify it please.

So I think it's good to add a enum value of 0 listed here as suggested
by Claudio Imbrenda.

> 
>> MADV_UNMERGEABLE -> unmerge_ksm_pages() will never unshare the shared
>> zeropage? I thought the patch description mentions that that is one of
>> the goals?
> 
> No, MADV_UNMERGEABLE will trigger KSM to unshare the shared zeropages in the
> context of "get_next_rmap_item() -> unshare_zero_pages(), but not directly in the
> context of " madvise()-> unmerge_ksm_pages() ". The reason for this is to avoid
> increasing long delays of madvise() calling on unsharing zero pages.
> 

>Why do we care and make this case special?

Yeah, the code seems a bit special, but it is a helpless way and best choice, because the
action of unsharing zero-pages is too complex and CPU consuming because checking whether the
page we get is actually placed by KSM or not is not a easy thing in the context of
unmerge_ksm_pages.

In experiment, unsharing zero-pages in the context of unmerge_ksm_pages cause user' madvise()
spend 5 times the time than the way of the current patch.

So let's leave it as it is now. I will add a (short) explanation of when and why the new
unshare_zero_page flag should be used.

Sincerely.
Xu Xin

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-03-22  7:21 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-30  1:13 [PATCH v5 2/6] ksm: support unsharing zero pages placed by KSM yang.yang29
2023-01-18 14:10 ` David Hildenbrand
2023-02-04  6:18   ` yang.yang29
2023-02-13 12:44     ` David Hildenbrand
2023-03-11  5:37 xu xin
2023-03-13 12:14 ` David Hildenbrand
2023-03-22  2:02 [PATCH] rps: process the skb directly if rps cpu not changed Yunsheng Lin
2023-03-22  7:21 ` [PATCH v5 2/6] ksm: support unsharing zero pages placed by KSM xu xin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).