Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH 0/4] mm: pagewalk: Rework callback return values and optionally skip the pte level
@ 2019-10-10 13:40 Thomas Hellström (VMware)
  2019-10-10 13:40 ` [RFC PATCH 1/4] mm: Have the mempolicy pagewalk to avoid positive callback return codes Thomas Hellström (VMware)
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-10 13:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, torvalds, kirill; +Cc: Thomas Hellström

This series converts all users of pagewalk positive callback return values
to use negative values instead, so that the positive values are free for
pagewalk control. Then the return value PAGE_WALK_CONTINUE is introduced.
That value is intended for callbacks to indicate that they've handled the
entry and it's not necessary to split and go to a lower level.
Initially this is implemented only for pmd_entry(), but could (should?)
at some point be used also for pud_entry().

Finally the mapping_dirty_helpers pagewalk is modified to use the new value
indicating that it has processed a read-only huge pmd entry and don't want
it to be split and handled at the pte level.

Note: This series still needs some significant testing and another re-audit
to verify that there are no more pagewalk users relying on positive return
values.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC PATCH 1/4] mm: Have the mempolicy pagewalk to avoid positive callback return codes
  2019-10-10 13:40 [RFC PATCH 0/4] mm: pagewalk: Rework callback return values and optionally skip the pte level Thomas Hellström (VMware)
@ 2019-10-10 13:40 ` Thomas Hellström (VMware)
  2019-10-11 13:08   ` Kirill A. Shutemov
  2019-10-10 13:40 ` [RFC PATCH 2/4] fs: task_mmu: Have the pagewalk " Thomas Hellström (VMware)
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 6+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-10 13:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, torvalds, kirill; +Cc: Thomas Hellstrom

From: Linus Torvalds <torvalds@linux-foundation.org>

The pagewalk code is being reworked to have positive callback return codes
do walk control. Avoid using positive return codes: "1" is replaced by
"-EBUSY".

Co-developed-by: Thomas Hellstrom <thellstrom@vmware.com>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
---
 mm/mempolicy.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4ae967bcf954..df34c7498c27 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -482,8 +482,8 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
  *
  * queue_pages_pte_range() has three possible return values:
  * 0 - pages are placed on the right node or queued successfully.
- * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
- *     specified.
+ * -EBUSY - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
+ *          specified.
  * -EIO - only MPOL_MF_STRICT was specified and an existing page was already
  *        on a node that does not follow the policy.
  */
@@ -503,7 +503,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	if (ptl) {
 		ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
 		if (ret != 2)
-			return ret;
+			return (ret == 1) ? -EBUSY : ret;
 	}
 	/* THP was split, fall through to pte walk */
 
@@ -546,7 +546,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	cond_resched();
 
 	if (has_unmovable)
-		return 1;
+		return -EBUSY;
 
 	return addr != end ? -EIO : 0;
 }
@@ -669,9 +669,9 @@ static const struct mm_walk_ops queue_pages_walk_ops = {
  * passed via @private.
  *
  * queue_pages_range() has three possible return values:
- * 1 - there is unmovable page, but MPOL_MF_MOVE* & MPOL_MF_STRICT were
- *     specified.
  * 0 - queue pages successfully or no misplaced page.
+ * -EBUSY - there is unmovable page, but MPOL_MF_MOVE* & MPOL_MF_STRICT were
+ *	    specified.
  * -EIO - there is misplaced page and only MPOL_MF_STRICT was specified.
  */
 static int
@@ -1285,7 +1285,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	ret = queue_pages_range(mm, start, end, nmask,
 			  flags | MPOL_MF_INVERT, &pagelist);
 
-	if (ret < 0) {
+	if (ret < 0 && ret != -EBUSY) {
 		err = -EIO;
 		goto up_out;
 	}
@@ -1303,7 +1303,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 				putback_movable_pages(&pagelist);
 		}
 
-		if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
+		if ((ret < 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
 			err = -EIO;
 	} else
 		putback_movable_pages(&pagelist);
-- 
2.21.0



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC PATCH 2/4] fs: task_mmu: Have the pagewalk avoid positive callback return codes
  2019-10-10 13:40 [RFC PATCH 0/4] mm: pagewalk: Rework callback return values and optionally skip the pte level Thomas Hellström (VMware)
  2019-10-10 13:40 ` [RFC PATCH 1/4] mm: Have the mempolicy pagewalk to avoid positive callback return codes Thomas Hellström (VMware)
@ 2019-10-10 13:40 ` " Thomas Hellström (VMware)
  2019-10-10 13:40 ` [RFC PATCH 3/4] mm: pagewalk: Disallow user positive callback return values and use them for walk control Thomas Hellström (VMware)
  2019-10-10 13:40 ` [RFC PATCH 4/4] mm: mapping_dirty_helpers: Handle huge pmds correctly Thomas Hellström (VMware)
  3 siblings, 0 replies; 6+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-10 13:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, torvalds, kirill; +Cc: Thomas Hellstrom

From: Thomas Hellstrom <thellstrom@vmware.com>

The pagewalk code is being reworked to have positive callback return codes
mean "walk control". Avoid using positive return codes: "1" is replaced by
"-ENOBUFS".

Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
---
 fs/proc/task_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 9442631fd4af..ef11969d9ba1 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1265,7 +1265,7 @@ struct pagemapread {
 #define PM_SWAP			BIT_ULL(62)
 #define PM_PRESENT		BIT_ULL(63)
 
-#define PM_END_OF_BUFFER    1
+#define PM_END_OF_BUFFER    (-ENOBUFS)
 
 static inline pagemap_entry_t make_pme(u64 frame, u64 flags)
 {
-- 
2.21.0



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC PATCH 3/4] mm: pagewalk: Disallow user positive callback return values and use them for walk control
  2019-10-10 13:40 [RFC PATCH 0/4] mm: pagewalk: Rework callback return values and optionally skip the pte level Thomas Hellström (VMware)
  2019-10-10 13:40 ` [RFC PATCH 1/4] mm: Have the mempolicy pagewalk to avoid positive callback return codes Thomas Hellström (VMware)
  2019-10-10 13:40 ` [RFC PATCH 2/4] fs: task_mmu: Have the pagewalk " Thomas Hellström (VMware)
@ 2019-10-10 13:40 ` Thomas Hellström (VMware)
  2019-10-10 13:40 ` [RFC PATCH 4/4] mm: mapping_dirty_helpers: Handle huge pmds correctly Thomas Hellström (VMware)
  3 siblings, 0 replies; 6+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-10 13:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, torvalds, kirill; +Cc: Thomas Hellstrom

From: Linus Torvalds <torvalds@linux-foundation.org>

When we have both a pmd_entry() and a pte_entry() callback, in some
siutaions it is desirable not to traverse the pte level.
Reserve positive callback return values for walk control and define a
return value PAGE_WALK_CONTINUE that means skip lower level traversal
and continue the walk.

Co-developed-by: Thomas Hellstrom <thellstrom@vmware.com>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
---
 include/linux/pagewalk.h | 6 ++++++
 mm/pagewalk.c            | 9 ++++++---
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 6ec82e92c87f..d9e5d1927315 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -4,6 +4,12 @@
 
 #include <linux/mm.h>
 
+/*
+ * pmd_entry() Return code meaning skip to next entry.
+ * Don't look for lower levels
+ */
+#define PAGE_WALK_CONTINUE 1
+
 struct mm_walk;
 
 /**
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index ea0b9e606ad1..d2483d432fda 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -52,8 +52,12 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 		 */
 		if (ops->pmd_entry)
 			err = ops->pmd_entry(pmd, addr, next, walk);
-		if (err)
+		if (err < 0)
 			break;
+		if (err == PAGE_WALK_CONTINUE) {
+			err = 0;
+			continue;
+		}
 
 		/*
 		 * Check this here so we only break down trans_huge
@@ -291,8 +295,7 @@ static int __walk_page_range(unsigned long start, unsigned long end,
  *
  *  - 0  : succeeded to handle the current entry, and if you don't reach the
  *         end address yet, continue to walk.
- *  - >0 : succeeded to handle the current entry, and return to the caller
- *         with caller specific value.
+ *  - >0 : Reserved for walk control. Use only PAGE_WALK_XX values.
  *  - <0 : failed to handle the current entry, and return to the caller
  *         with error code.
  *
-- 
2.21.0



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC PATCH 4/4] mm: mapping_dirty_helpers: Handle huge pmds correctly
  2019-10-10 13:40 [RFC PATCH 0/4] mm: pagewalk: Rework callback return values and optionally skip the pte level Thomas Hellström (VMware)
                   ` (2 preceding siblings ...)
  2019-10-10 13:40 ` [RFC PATCH 3/4] mm: pagewalk: Disallow user positive callback return values and use them for walk control Thomas Hellström (VMware)
@ 2019-10-10 13:40 ` Thomas Hellström (VMware)
  3 siblings, 0 replies; 6+ messages in thread
From: Thomas Hellström (VMware) @ 2019-10-10 13:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, torvalds, kirill; +Cc: Thomas Hellstrom

From: Thomas Hellstrom <thellstrom@vmware.com>

We always do dirty tracking on the PTE level. This means that any huge
pmds we encounter should be read-only and not dirty: We can just skip
those. Write-enabled huge pmds should not exist. They should have been
split when made write-enabled. Warn and attempt to split them.

Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
---
 mm/mapping_dirty_helpers.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
index 799b9154b48f..f61bb9de1530 100644
--- a/mm/mapping_dirty_helpers.c
+++ b/mm/mapping_dirty_helpers.c
@@ -115,11 +115,18 @@ static int clean_record_pte(pte_t *pte, unsigned long addr,
 static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end,
 			      struct mm_walk *walk)
 {
-	/* Dirty-tracking should be handled on the pte level */
 	pmd_t pmdval = pmd_read_atomic(pmd);
 
-	if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval))
-		WARN_ON(pmd_write(pmdval) || pmd_dirty(pmdval));
+	/*
+	 * Dirty-tracking should be handled on the pte level, and write-
+	 * enabled huge PMDS should never have been created. Warn on those.
+	 * Read-only huge PMDS can't be dirty so we just skip them.
+	 */
+	if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval)) {
+		if (WARN_ON(pmd_write(pmdval) || pmd_dirty(pmdval)))
+			return 0;
+		return PAGE_WALK_CONTINUE;
+	}
 
 	return 0;
 }
-- 
2.21.0



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 1/4] mm: Have the mempolicy pagewalk to avoid positive callback return codes
  2019-10-10 13:40 ` [RFC PATCH 1/4] mm: Have the mempolicy pagewalk to avoid positive callback return codes Thomas Hellström (VMware)
@ 2019-10-11 13:08   ` Kirill A. Shutemov
  0 siblings, 0 replies; 6+ messages in thread
From: Kirill A. Shutemov @ 2019-10-11 13:08 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: linux-kernel, linux-mm, torvalds, Thomas Hellstrom

On Thu, Oct 10, 2019 at 03:40:55PM +0200, Thomas Hellström (VMware) wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> 
> The pagewalk code is being reworked to have positive callback return codes
> do walk control. Avoid using positive return codes: "1" is replaced by
> "-EBUSY".
> 
> Co-developed-by: Thomas Hellstrom <thellstrom@vmware.com>
> Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
> ---
>  mm/mempolicy.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4ae967bcf954..df34c7498c27 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -482,8 +482,8 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
>   *
>   * queue_pages_pte_range() has three possible return values:
>   * 0 - pages are placed on the right node or queued successfully.
> - * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
> - *     specified.
> + * -EBUSY - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
> + *          specified.
>   * -EIO - only MPOL_MF_STRICT was specified and an existing page was already
>   *        on a node that does not follow the policy.
>   */
> @@ -503,7 +503,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  	if (ptl) {
>  		ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
>  		if (ret != 2)
> -			return ret;
> +			return (ret == 1) ? -EBUSY : ret;

It would be cleaner to propagate the error code logic to queue_pages_pmd()
too: 0 - placed, 1 - split, -EBUSY - unmovable, ...

>  	}
>  	/* THP was split, fall through to pte walk */
>  
> @@ -546,7 +546,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  	cond_resched();
>  
>  	if (has_unmovable)
> -		return 1;
> +		return -EBUSY;
>  
>  	return addr != end ? -EIO : 0;
>  }
> @@ -669,9 +669,9 @@ static const struct mm_walk_ops queue_pages_walk_ops = {
>   * passed via @private.
>   *
>   * queue_pages_range() has three possible return values:
> - * 1 - there is unmovable page, but MPOL_MF_MOVE* & MPOL_MF_STRICT were
> - *     specified.
>   * 0 - queue pages successfully or no misplaced page.
> + * -EBUSY - there is unmovable page, but MPOL_MF_MOVE* & MPOL_MF_STRICT were
> + *	    specified.
>   * -EIO - there is misplaced page and only MPOL_MF_STRICT was specified.
>   */
>  static int
> @@ -1285,7 +1285,7 @@ static long do_mbind(unsigned long start, unsigned long len,
>  	ret = queue_pages_range(mm, start, end, nmask,
>  			  flags | MPOL_MF_INVERT, &pagelist);
>  
> -	if (ret < 0) {
> +	if (ret < 0 && ret != -EBUSY) {
>  		err = -EIO;
>  		goto up_out;
>  	}
> @@ -1303,7 +1303,7 @@ static long do_mbind(unsigned long start, unsigned long len,
>  				putback_movable_pages(&pagelist);
>  		}
>  
> -		if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
> +		if ((ret < 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
>  			err = -EIO;
>  	} else
>  		putback_movable_pages(&pagelist);
> -- 
> 2.21.0
> 

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, back to index

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-10 13:40 [RFC PATCH 0/4] mm: pagewalk: Rework callback return values and optionally skip the pte level Thomas Hellström (VMware)
2019-10-10 13:40 ` [RFC PATCH 1/4] mm: Have the mempolicy pagewalk to avoid positive callback return codes Thomas Hellström (VMware)
2019-10-11 13:08   ` Kirill A. Shutemov
2019-10-10 13:40 ` [RFC PATCH 2/4] fs: task_mmu: Have the pagewalk " Thomas Hellström (VMware)
2019-10-10 13:40 ` [RFC PATCH 3/4] mm: pagewalk: Disallow user positive callback return values and use them for walk control Thomas Hellström (VMware)
2019-10-10 13:40 ` [RFC PATCH 4/4] mm: mapping_dirty_helpers: Handle huge pmds correctly Thomas Hellström (VMware)

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org linux-mm@archiver.kernel.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/ public-inbox