[PATCH v4 0/4] A few cleanup and fixup patches for migration

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/4] A few cleanup and fixup patches for migration
@ 2022-05-30 11:30 Miaohe Lin
  2022-05-30 11:30 ` [PATCH v4 1/4] mm: reduce the rcu lock duration Miaohe Lin
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Miaohe Lin @ 2022-05-30 11:30 UTC (permalink / raw)
  To: akpm, naoya.horiguchi
  Cc: peterx, apopple, ying.huang, osalvador, mike.kravetz, songmuchun,
	hch, dhowells, cl, linux-mm, linux-kernel, linmiaohe

Hi everyone,
This series contains a few patches to remove unneeded lock page and
PageMovable check, reduce the rcu lock duration. Also we fix potential
pte_unmap on an not mapped pte. More details can be found in the
respective changelogs. Thanks!

---
v4:
  collect Reviewed-by tag of Oscar.
  fix build error reported by kernel test robot.
v3:
[PATCH 1/4] mm/migration: reduce the rcu lock duration
  add the analysis that the original race condition isn't possible now
  alsoreduce the rcu lock duration in kernel_migrate_pages
[PATCH 2/4] mm/migration: remove unneeded lock page and PageMovable check
  let free_pages_prepare() clear PG_isolated for us
[PATCH 3/4] mm/migration: return errno when isolate_huge_page failed
  rename isolate_huge_page to isolate_hugetlb
[PATCH 4/4] mm/migration: fix potential pte_unmap on an not mapped pte
  collect Reviewed-by tag
v2:
  collect Reviewed-by tag
  make isolate_huge_page consistent with isolate_lru_page
  add hugetlbfs variant of hugetlb_migration_entry_wait
v1:
  rebase [1] on mainline.

[1] https://lore.kernel.org/lkml/20220304093409.25829-2-linmiaohe@huawei.com/T/
---

Miaohe Lin (4):
  mm: reduce the rcu lock duration
  mm/migration: remove unneeded lock page and PageMovable check
  mm/migration: return errno when isolate_huge_page failed
  mm/migration: fix potential pte_unmap on an not mapped pte

 include/linux/hugetlb.h |  6 +++---
 include/linux/swapops.h | 12 ++++++++----
 mm/gup.c                |  2 +-
 mm/hugetlb.c            | 15 +++++++--------
 mm/memory-failure.c     |  2 +-
 mm/memory_hotplug.c     |  2 +-
 mm/mempolicy.c          |  5 ++---
 mm/migrate.c            | 42 +++++++++++++++++++++++++----------------
 8 files changed, 49 insertions(+), 37 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4 1/4] mm: reduce the rcu lock duration
  2022-05-30 11:30 [PATCH v4 0/4] A few cleanup and fixup patches for migration Miaohe Lin
@ 2022-05-30 11:30 ` Miaohe Lin
  2022-05-31  6:06   ` Ying Huang
                     ` (2 more replies)
  2022-05-30 11:30 ` [PATCH v4 2/4] mm/migration: remove unneeded lock page and PageMovable check Miaohe Lin
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 20+ messages in thread
From: Miaohe Lin @ 2022-05-30 11:30 UTC (permalink / raw)
  To: akpm, naoya.horiguchi
  Cc: peterx, apopple, ying.huang, osalvador, mike.kravetz, songmuchun,
	hch, dhowells, cl, linux-mm, linux-kernel, linmiaohe

Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
extends the period of the rcu_read_lock until after the permissions checks
are done to prevent the task pointed to from changing from under us. But
the task_struct refcount is also taken at that time, the reference to task
is guaranteed to be stable. So it's unnecessary to extend the period of
the rcu_read_lock. Release the rcu lock after task refcount is successfully
grabbed to reduce the rcu holding time.

Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
---
 mm/mempolicy.c | 3 +--
 mm/migrate.c   | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0b4ba3ee810e..2dad094177bf 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1609,6 +1609,7 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
 		goto out;
 	}
 	get_task_struct(task);
+	rcu_read_unlock();
 
 	err = -EINVAL;
 
@@ -1617,11 +1618,9 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
 	 * Use the regular "ptrace_may_access()" checks.
 	 */
 	if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
-		rcu_read_unlock();
 		err = -EPERM;
 		goto out_put;
 	}
-	rcu_read_unlock();
 
 	task_nodes = cpuset_mems_allowed(task);
 	/* Is the user allowed to access the target nodes? */
diff --git a/mm/migrate.c b/mm/migrate.c
index e51588e95f57..e88ebb88fa6f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1902,17 +1902,16 @@ static struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes)
 		return ERR_PTR(-ESRCH);
 	}
 	get_task_struct(task);
+	rcu_read_unlock();
 
 	/*
 	 * Check if this process has the right to modify the specified
 	 * process. Use the regular "ptrace_may_access()" checks.
 	 */
 	if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
-		rcu_read_unlock();
 		mm = ERR_PTR(-EPERM);
 		goto out;
 	}
-	rcu_read_unlock();
 
 	mm = ERR_PTR(security_task_movememory(task));
 	if (IS_ERR(mm))
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 2/4] mm/migration: remove unneeded lock page and PageMovable check
  2022-05-30 11:30 [PATCH v4 0/4] A few cleanup and fixup patches for migration Miaohe Lin
  2022-05-30 11:30 ` [PATCH v4 1/4] mm: reduce the rcu lock duration Miaohe Lin
@ 2022-05-30 11:30 ` Miaohe Lin
  2022-05-31 12:50   ` David Hildenbrand
  2022-05-30 11:30 ` [PATCH v4 3/4] mm/migration: return errno when isolate_huge_page failed Miaohe Lin
  2022-05-30 11:30 ` [PATCH v4 4/4] mm/migration: fix potential pte_unmap on an not mapped pte Miaohe Lin
  3 siblings, 1 reply; 20+ messages in thread
From: Miaohe Lin @ 2022-05-30 11:30 UTC (permalink / raw)
  To: akpm, naoya.horiguchi
  Cc: peterx, apopple, ying.huang, osalvador, mike.kravetz, songmuchun,
	hch, dhowells, cl, linux-mm, linux-kernel, linmiaohe

When non-lru movable page was freed from under us, __ClearPageMovable must
have been done.  So we can remove unneeded lock page and PageMovable check
here. Also free_pages_prepare() will clear PG_isolated for us, so we can
further remove ClearPageIsolated as suggested by David.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
---
 mm/migrate.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index e88ebb88fa6f..337336115e43 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1090,15 +1090,10 @@ static int unmap_and_move(new_page_t get_new_page,
 		return -ENOSYS;
 
 	if (page_count(page) == 1) {
-		/* page was freed from under us. So we are done. */
+		/* Page was freed from under us. So we are done. */
 		ClearPageActive(page);
 		ClearPageUnevictable(page);
-		if (unlikely(__PageMovable(page))) {
-			lock_page(page);
-			if (!PageMovable(page))
-				ClearPageIsolated(page);
-			unlock_page(page);
-		}
+		/* free_pages_prepare() will clear PG_isolated. */
 		goto out;
 	}
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 3/4] mm/migration: return errno when isolate_huge_page failed
  2022-05-30 11:30 [PATCH v4 0/4] A few cleanup and fixup patches for migration Miaohe Lin
  2022-05-30 11:30 ` [PATCH v4 1/4] mm: reduce the rcu lock duration Miaohe Lin
  2022-05-30 11:30 ` [PATCH v4 2/4] mm/migration: remove unneeded lock page and PageMovable check Miaohe Lin
@ 2022-05-30 11:30 ` Miaohe Lin
  2022-05-30 11:30 ` [PATCH v4 4/4] mm/migration: fix potential pte_unmap on an not mapped pte Miaohe Lin
  3 siblings, 0 replies; 20+ messages in thread
From: Miaohe Lin @ 2022-05-30 11:30 UTC (permalink / raw)
  To: akpm, naoya.horiguchi
  Cc: peterx, apopple, ying.huang, osalvador, mike.kravetz, songmuchun,
	hch, dhowells, cl, linux-mm, linux-kernel, linmiaohe

We might fail to isolate huge page due to e.g. the page is under migration
which cleared HPageMigratable. We should return errno in this case rather
than always return 1 which could confuse the user, i.e. the caller might
think all of the memory is migrated while the hugetlb page is left behind.
We make the prototype of isolate_huge_page consistent with isolate_lru_page
as suggested by Huang Ying and rename isolate_huge_page to isolate_hugetlb
as suggested by Muchun to improve the readability.

Fixes: e8db67eb0ded ("mm: migrate: move_pages() supports thp migration")
Suggested-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reported-by: kernel test robot <lkp@intel.com> (build error)
---
 include/linux/hugetlb.h |  6 +++---
 mm/gup.c                |  2 +-
 mm/hugetlb.c            | 11 +++++------
 mm/memory-failure.c     |  2 +-
 mm/memory_hotplug.c     |  2 +-
 mm/mempolicy.c          |  2 +-
 mm/migrate.c            |  7 ++++---
 7 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e4cff27d1198..756b66ff025e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -170,7 +170,7 @@ bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						vm_flags_t vm_flags);
 long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
 						long freed);
-bool isolate_huge_page(struct page *page, struct list_head *list);
+int isolate_hugetlb(struct page *page, struct list_head *list);
 int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
 int get_huge_page_for_hwpoison(unsigned long pfn, int flags);
 void putback_active_hugepage(struct page *page);
@@ -376,9 +376,9 @@ static inline pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr,
 	return NULL;
 }
 
-static inline bool isolate_huge_page(struct page *page, struct list_head *list)
+static inline int isolate_hugetlb(struct page *page, struct list_head *list)
 {
-	return false;
+	return -EBUSY;
 }
 
 static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
diff --git a/mm/gup.c b/mm/gup.c
index 551264407624..3899dcb288a6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1898,7 +1898,7 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
 		 * Try to move out any movable page before pinning the range.
 		 */
 		if (folio_test_hugetlb(folio)) {
-			if (!isolate_huge_page(&folio->page,
+			if (isolate_hugetlb(&folio->page,
 						&movable_page_list))
 				isolation_error_count++;
 			continue;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7c468ac1d069..65ab215b39ee 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2765,8 +2765,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 		 * Fail with -EBUSY if not possible.
 		 */
 		spin_unlock_irq(&hugetlb_lock);
-		if (!isolate_huge_page(old_page, list))
-			ret = -EBUSY;
+		ret = isolate_hugetlb(old_page, list);
 		spin_lock_irq(&hugetlb_lock);
 		goto free_new;
 	} else if (!HPageFreed(old_page)) {
@@ -2842,7 +2841,7 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
 	if (hstate_is_gigantic(h))
 		return -ENOMEM;
 
-	if (page_count(head) && isolate_huge_page(head, list))
+	if (page_count(head) && !isolate_hugetlb(head, list))
 		ret = 0;
 	else if (!page_count(head))
 		ret = alloc_and_dissolve_huge_page(h, head, list);
@@ -6952,15 +6951,15 @@ follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int fla
 	return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT);
 }
 
-bool isolate_huge_page(struct page *page, struct list_head *list)
+int isolate_hugetlb(struct page *page, struct list_head *list)
 {
-	bool ret = true;
+	int ret = 0;
 
 	spin_lock_irq(&hugetlb_lock);
 	if (!PageHeadHuge(page) ||
 	    !HPageMigratable(page) ||
 	    !get_page_unless_zero(page)) {
-		ret = false;
+		ret = -EBUSY;
 		goto unlock;
 	}
 	ClearHPageMigratable(page);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b85661cbdc4a..5deb1b1cb2fd 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2166,7 +2166,7 @@ static bool isolate_page(struct page *page, struct list_head *pagelist)
 	bool lru = PageLRU(page);
 
 	if (PageHuge(page)) {
-		isolated = isolate_huge_page(page, pagelist);
+		isolated = !isolate_hugetlb(page, pagelist);
 	} else {
 		if (lru)
 			isolated = !isolate_lru_page(page);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1213d0c67a53..649a50ed90f3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1643,7 +1643,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 
 		if (PageHuge(page)) {
 			pfn = page_to_pfn(head) + compound_nr(head) - 1;
-			isolate_huge_page(head, &source);
+			isolate_hugetlb(head, &source);
 			continue;
 		} else if (PageTransHuge(page))
 			pfn = page_to_pfn(head) + thp_nr_pages(page) - 1;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2dad094177bf..f96d55131689 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -602,7 +602,7 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
 	if (flags & (MPOL_MF_MOVE_ALL) ||
 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
-		if (!isolate_huge_page(page, qp->pagelist) &&
+		if (isolate_hugetlb(page, qp->pagelist) &&
 			(flags & MPOL_MF_STRICT))
 			/*
 			 * Failed to isolate page but allow migrating pages
diff --git a/mm/migrate.c b/mm/migrate.c
index 337336115e43..a14c6d5ceafc 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -133,7 +133,7 @@ static void putback_movable_page(struct page *page)
  *
  * This function shall be used whenever the isolated pageset has been
  * built from lru, balloon, hugetlbfs page. See isolate_migratepages_range()
- * and isolate_huge_page().
+ * and isolate_hugetlb().
  */
 void putback_movable_pages(struct list_head *l)
 {
@@ -1627,8 +1627,9 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
 
 	if (PageHuge(page)) {
 		if (PageHead(page)) {
-			isolate_huge_page(page, pagelist);
-			err = 1;
+			err = isolate_hugetlb(page, pagelist);
+			if (!err)
+				err = 1;
 		}
 	} else {
 		struct page *head;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 4/4] mm/migration: fix potential pte_unmap on an not mapped pte
  2022-05-30 11:30 [PATCH v4 0/4] A few cleanup and fixup patches for migration Miaohe Lin
                   ` (2 preceding siblings ...)
  2022-05-30 11:30 ` [PATCH v4 3/4] mm/migration: return errno when isolate_huge_page failed Miaohe Lin
@ 2022-05-30 11:30 ` Miaohe Lin
  3 siblings, 0 replies; 20+ messages in thread
From: Miaohe Lin @ 2022-05-30 11:30 UTC (permalink / raw)
  To: akpm, naoya.horiguchi
  Cc: peterx, apopple, ying.huang, osalvador, mike.kravetz, songmuchun,
	hch, dhowells, cl, linux-mm, linux-kernel, linmiaohe

__migration_entry_wait and migration_entry_wait_on_locked assume pte is
always mapped from caller. But this is not the case when it's called from
migration_entry_wait_huge and follow_huge_pmd. Add a hugetlbfs variant that
calls hugetlb_migration_entry_wait(ptep == NULL) to fix this issue.

Fixes: 30dad30922cc ("mm: migration: add migrate_entry_wait_huge()")
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 include/linux/swapops.h | 12 ++++++++----
 mm/hugetlb.c            |  4 ++--
 mm/migrate.c            | 23 +++++++++++++++++++----
 3 files changed, 29 insertions(+), 10 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index f24775b41880..bb7afd03a324 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -244,8 +244,10 @@ extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 					spinlock_t *ptl);
 extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					unsigned long address);
-extern void migration_entry_wait_huge(struct vm_area_struct *vma,
-		struct mm_struct *mm, pte_t *pte);
+#ifdef CONFIG_HUGETLB_PAGE
+extern void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl);
+extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte);
+#endif
 #else
 static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
 {
@@ -271,8 +273,10 @@ static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 					spinlock_t *ptl) { }
 static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					 unsigned long address) { }
-static inline void migration_entry_wait_huge(struct vm_area_struct *vma,
-		struct mm_struct *mm, pte_t *pte) { }
+#ifdef CONFIG_HUGETLB_PAGE
+static inline void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) { }
+static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { }
+#endif
 static inline int is_writable_migration_entry(swp_entry_t entry)
 {
 	return 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 65ab215b39ee..c092157b896a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5694,7 +5694,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		entry = huge_ptep_get(ptep);
 		if (unlikely(is_hugetlb_entry_migration(entry))) {
-			migration_entry_wait_huge(vma, mm, ptep);
+			migration_entry_wait_huge(vma, ptep);
 			return 0;
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON_LARGE |
@@ -6919,7 +6919,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	} else {
 		if (is_hugetlb_entry_migration(pte)) {
 			spin_unlock(ptl);
-			__migration_entry_wait(mm, (pte_t *)pmd, ptl);
+			__migration_entry_wait_huge((pte_t *)pmd, ptl);
 			goto retry;
 		}
 		/*
diff --git a/mm/migrate.c b/mm/migrate.c
index a14c6d5ceafc..3359a6e169bb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -315,13 +315,28 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 	__migration_entry_wait(mm, ptep, ptl);
 }
 
-void migration_entry_wait_huge(struct vm_area_struct *vma,
-		struct mm_struct *mm, pte_t *pte)
+#ifdef CONFIG_HUGETLB_PAGE
+void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl)
 {
-	spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), mm, pte);
-	__migration_entry_wait(mm, pte, ptl);
+	pte_t pte;
+
+	spin_lock(ptl);
+	pte = huge_ptep_get(ptep);
+
+	if (unlikely(!is_hugetlb_entry_migration(pte)))
+		spin_unlock(ptl);
+	else
+		migration_entry_wait_on_locked(pte_to_swp_entry(pte), NULL, ptl);
 }
 
+void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte)
+{
+	spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, pte);
+
+	__migration_entry_wait_huge(pte, ptl);
+}
+#endif
+
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
 {
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/4] mm: reduce the rcu lock duration
  2022-05-30 11:30 ` [PATCH v4 1/4] mm: reduce the rcu lock duration Miaohe Lin
@ 2022-05-31  6:06   ` Ying Huang
  2022-05-31  9:01     ` Miaohe Lin
  2022-05-31 12:50   ` David Hildenbrand
  2022-05-31 12:58   ` Matthew Wilcox
  2 siblings, 1 reply; 20+ messages in thread
From: Ying Huang @ 2022-05-31  6:06 UTC (permalink / raw)
  To: Miaohe Lin, akpm, naoya.horiguchi
  Cc: peterx, apopple, osalvador, mike.kravetz, songmuchun, hch,
	dhowells, cl, linux-mm, linux-kernel, Eric W. Biederman

On Mon, 2022-05-30 at 19:30 +0800, Miaohe Lin wrote:
> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
> extends the period of the rcu_read_lock until after the permissions checks
> are done to prevent the task pointed to from changing from under us. But
> the task_struct refcount is also taken at that time, the reference to task
> is guaranteed to be stable. So it's unnecessary to extend the period of
> the rcu_read_lock. Release the rcu lock after task refcount is successfully
> grabbed to reduce the rcu holding time.

Sorry for late reply, I am busy on something else recently.

I have just read the whole thread of the original patch discussion. 
During discussion, in

https://lore.kernel.org/lkml/alpine.DEB.2.00.1202241131400.3726@router.home/

a patch that is same as your one is proposed.  Then in the following
message, Eric think that the rcu read lock should be released until
permission is checked,

https://lore.kernel.org/lkml/87sjhzun47.fsf@xmission.com/

"
At the moment I suspect the permissions checks are not safe unless
performed under both rcu_read_lock and task_lock to ensure that
the task<->mm association does not change on us while we are
working.  Even with that the cred can change under us but at least
we know the cred will be valid until rcu_read_unlock happens.
"

So the rcu lock duration is enlarged in the following message.

https://lore.kernel.org/lkml/alpine.DEB.2.00.1202271238450.32410@router.home/

But, after some thought, I don't think extended rcu read lock adds much
value.  Because after permission checking the permission may still be
changed.  There's no much difference.

So, I have no objection to the patch itself.  But you should add more
information in patch description about why the RCU proected region is
extended and why we can reduce it.

Best Regards,
Huang, Ying

> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Christoph Lameter <cl@linux.com>
> ---
>  mm/mempolicy.c | 3 +--
>  mm/migrate.c   | 3 +--
>  2 files changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 0b4ba3ee810e..2dad094177bf 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1609,6 +1609,7 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
>  		goto out;
>  	}
>  	get_task_struct(task);
> +	rcu_read_unlock();
>  
> 
> 
> 
>  	err = -EINVAL;
>  
> 
> 
> 
> @@ -1617,11 +1618,9 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
>  	 * Use the regular "ptrace_may_access()" checks.
>  	 */
>  	if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
> -		rcu_read_unlock();
>  		err = -EPERM;
>  		goto out_put;
>  	}
> -	rcu_read_unlock();
>  
> 
> 
> 
>  	task_nodes = cpuset_mems_allowed(task);
>  	/* Is the user allowed to access the target nodes? */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index e51588e95f57..e88ebb88fa6f 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1902,17 +1902,16 @@ static struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes)
>  		return ERR_PTR(-ESRCH);
>  	}
>  	get_task_struct(task);
> +	rcu_read_unlock();
>  
> 
> 
> 
>  	/*
>  	 * Check if this process has the right to modify the specified
>  	 * process. Use the regular "ptrace_may_access()" checks.
>  	 */
>  	if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
> -		rcu_read_unlock();
>  		mm = ERR_PTR(-EPERM);
>  		goto out;
>  	}
> -	rcu_read_unlock();
>  
> 
> 
> 
>  	mm = ERR_PTR(security_task_movememory(task));
>  	if (IS_ERR(mm))



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/4] mm: reduce the rcu lock duration
  2022-05-31  6:06   ` Ying Huang
@ 2022-05-31  9:01     ` Miaohe Lin
  2022-05-31 16:09       ` Eric W. Biederman
  0 siblings, 1 reply; 20+ messages in thread
From: Miaohe Lin @ 2022-05-31  9:01 UTC (permalink / raw)
  To: Ying Huang
  Cc: peterx, apopple, osalvador, mike.kravetz, songmuchun, hch,
	dhowells, cl, linux-mm, linux-kernel, Eric W. Biederman, akpm,
	naoya.horiguchi

On 2022/5/31 14:06, Ying Huang wrote:
> On Mon, 2022-05-30 at 19:30 +0800, Miaohe Lin wrote:
>> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
>> extends the period of the rcu_read_lock until after the permissions checks
>> are done to prevent the task pointed to from changing from under us. But
>> the task_struct refcount is also taken at that time, the reference to task
>> is guaranteed to be stable. So it's unnecessary to extend the period of
>> the rcu_read_lock. Release the rcu lock after task refcount is successfully
>> grabbed to reduce the rcu holding time.
> 
> Sorry for late reply, I am busy on something else recently.

That's all right. Many thanks for your hard work. :)

> 
> I have just read the whole thread of the original patch discussion. 
> During discussion, in
> 
> https://lore.kernel.org/lkml/alpine.DEB.2.00.1202241131400.3726@router.home/
> 
> a patch that is same as your one is proposed.  Then in the following
> message, Eric think that the rcu read lock should be released until
> permission is checked,
> 
> https://lore.kernel.org/lkml/87sjhzun47.fsf@xmission.com/
> 
> "
> At the moment I suspect the permissions checks are not safe unless
> performed under both rcu_read_lock and task_lock to ensure that
> the task<->mm association does not change on us while we are
> working.  Even with that the cred can change under us but at least
> we know the cred will be valid until rcu_read_unlock happens.
> "
> 
> So the rcu lock duration is enlarged in the following message.
> 
> https://lore.kernel.org/lkml/alpine.DEB.2.00.1202271238450.32410@router.home/
> 
> But, after some thought, I don't think extended rcu read lock adds much
> value.  Because after permission checking the permission may still be
> changed.  There's no much difference.
> 
> So, I have no objection to the patch itself.  But you should add more
> information in patch description about why the RCU proected region is
> extended and why we can reduce it.

Does below patch description makes sense for you?

"
Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
extends the period of the rcu_read_lock until after the permissions checks
are done because it suspects the permissions checks are not safe unless
performed under both rcu_read_lock and task_lock to ensure the task<->mm
association does not change on us while we are working [1]. But extended
rcu read lock does not add much value. Because after permission checking
the permission may still be changed. There's no much difference. So it's
unnecessary to extend the period of the rcu_read_lock. Release the rcu
lock after task refcount is successfully grabbed to reduce the rcu holding
time.

[1] https://lore.kernel.org/lkml/87sjhzun47.fsf@xmission.com/
"

> 
> Best Regards,
> Huang, Ying

Thanks again!

> 
>> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> Reviewed-by: Oscar Salvador <osalvador@suse.de>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Cc: David Howells <dhowells@redhat.com>
>> Cc: Christoph Lameter <cl@linux.com>
>> ---
>>  mm/mempolicy.c | 3 +--
>>  mm/migrate.c   | 3 +--
>>  2 files changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index 0b4ba3ee810e..2dad094177bf 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -1609,6 +1609,7 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
>>  		goto out;
>>  	}
>>  	get_task_struct(task);
>> +	rcu_read_unlock();
>>  
>>
>>
>>
>>  	err = -EINVAL;
>>  
>>
>>
>>
>> @@ -1617,11 +1618,9 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
>>  	 * Use the regular "ptrace_may_access()" checks.
>>  	 */
>>  	if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
>> -		rcu_read_unlock();
>>  		err = -EPERM;
>>  		goto out_put;
>>  	}
>> -	rcu_read_unlock();
>>  
>>
>>
>>
>>  	task_nodes = cpuset_mems_allowed(task);
>>  	/* Is the user allowed to access the target nodes? */
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index e51588e95f57..e88ebb88fa6f 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1902,17 +1902,16 @@ static struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes)
>>  		return ERR_PTR(-ESRCH);
>>  	}
>>  	get_task_struct(task);
>> +	rcu_read_unlock();
>>  
>>
>>
>>
>>  	/*
>>  	 * Check if this process has the right to modify the specified
>>  	 * process. Use the regular "ptrace_may_access()" checks.
>>  	 */
>>  	if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
>> -		rcu_read_unlock();
>>  		mm = ERR_PTR(-EPERM);
>>  		goto out;
>>  	}
>> -	rcu_read_unlock();
>>  
>>
>>
>>
>>  	mm = ERR_PTR(security_task_movememory(task));
>>  	if (IS_ERR(mm))
> 
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/4] mm: reduce the rcu lock duration
  2022-05-30 11:30 ` [PATCH v4 1/4] mm: reduce the rcu lock duration Miaohe Lin
  2022-05-31  6:06   ` Ying Huang
@ 2022-05-31 12:50   ` David Hildenbrand
  2022-05-31 12:58   ` Matthew Wilcox
  2 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-05-31 12:50 UTC (permalink / raw)
  To: Miaohe Lin, akpm, naoya.horiguchi
  Cc: peterx, apopple, ying.huang, osalvador, mike.kravetz, songmuchun,
	hch, dhowells, cl, linux-mm, linux-kernel

On 30.05.22 13:30, Miaohe Lin wrote:
> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
> extends the period of the rcu_read_lock until after the permissions checks
> are done to prevent the task pointed to from changing from under us. But
> the task_struct refcount is also taken at that time, the reference to task
> is guaranteed to be stable. So it's unnecessary to extend the period of
> the rcu_read_lock. Release the rcu lock after task refcount is successfully
> grabbed to reduce the rcu holding time.
> 
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Christoph Lameter <cl@linux.com>


Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/4] mm/migration: remove unneeded lock page and PageMovable check
  2022-05-30 11:30 ` [PATCH v4 2/4] mm/migration: remove unneeded lock page and PageMovable check Miaohe Lin
@ 2022-05-31 12:50   ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-05-31 12:50 UTC (permalink / raw)
  To: Miaohe Lin, akpm, naoya.horiguchi
  Cc: peterx, apopple, ying.huang, osalvador, mike.kravetz, songmuchun,
	hch, dhowells, cl, linux-mm, linux-kernel

On 30.05.22 13:30, Miaohe Lin wrote:
> When non-lru movable page was freed from under us, __ClearPageMovable must
> have been done.  So we can remove unneeded lock page and PageMovable check
> here. Also free_pages_prepare() will clear PG_isolated for us, so we can
> further remove ClearPageIsolated as suggested by David.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> ---
>  mm/migrate.c | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index e88ebb88fa6f..337336115e43 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1090,15 +1090,10 @@ static int unmap_and_move(new_page_t get_new_page,
>  		return -ENOSYS;
>  
>  	if (page_count(page) == 1) {
> -		/* page was freed from under us. So we are done. */
> +		/* Page was freed from under us. So we are done. */
>  		ClearPageActive(page);
>  		ClearPageUnevictable(page);
> -		if (unlikely(__PageMovable(page))) {
> -			lock_page(page);
> -			if (!PageMovable(page))
> -				ClearPageIsolated(page);
> -			unlock_page(page);
> -		}
> +		/* free_pages_prepare() will clear PG_isolated. */
>  		goto out;
>  	}
>  

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/4] mm: reduce the rcu lock duration
  2022-05-30 11:30 ` [PATCH v4 1/4] mm: reduce the rcu lock duration Miaohe Lin
  2022-05-31  6:06   ` Ying Huang
  2022-05-31 12:50   ` David Hildenbrand
@ 2022-05-31 12:58   ` Matthew Wilcox
  2022-05-31 13:05     ` Matthew Wilcox
  2 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2022-05-31 12:58 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: akpm, naoya.horiguchi, peterx, apopple, ying.huang, osalvador,
	mike.kravetz, songmuchun, hch, dhowells, cl, linux-mm,
	linux-kernel

On Mon, May 30, 2022 at 07:30:13PM +0800, Miaohe Lin wrote:
> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
> extends the period of the rcu_read_lock until after the permissions checks
> are done to prevent the task pointed to from changing from under us. But
> the task_struct refcount is also taken at that time, the reference to task
> is guaranteed to be stable. So it's unnecessary to extend the period of
> the rcu_read_lock. Release the rcu lock after task refcount is successfully
> grabbed to reduce the rcu holding time.

But why bother?  You know the RCU read lock isn't a "real" lock, right?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/4] mm: reduce the rcu lock duration
  2022-05-31 12:58   ` Matthew Wilcox
@ 2022-05-31 13:05     ` Matthew Wilcox
  2022-06-01  3:21       ` Miaohe Lin
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2022-05-31 13:05 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: akpm, naoya.horiguchi, peterx, apopple, ying.huang, osalvador,
	mike.kravetz, songmuchun, hch, dhowells, cl, linux-mm,
	linux-kernel

On Tue, May 31, 2022 at 01:58:31PM +0100, Matthew Wilcox wrote:
> On Mon, May 30, 2022 at 07:30:13PM +0800, Miaohe Lin wrote:
> > Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
> > extends the period of the rcu_read_lock until after the permissions checks
> > are done to prevent the task pointed to from changing from under us. But
> > the task_struct refcount is also taken at that time, the reference to task
> > is guaranteed to be stable. So it's unnecessary to extend the period of
> > the rcu_read_lock. Release the rcu lock after task refcount is successfully
> > grabbed to reduce the rcu holding time.
> 
> But why bother?  You know the RCU read lock isn't a "real" lock, right?

Looking over this code some more, I think this may harm performance.
ptrace_may_access() itself takes the rcu_read_lock().  So we currently
have:

rcu_read_lock()
rcu_read_lock();
rcu_read_unlock();
rcu_read_unlock();

In at least one RCU configuration, rcu_read_lock() maps to
preempt_disable().  Nested preempt_disable() just bump a counter, while
that counter reaching zero incurs some actual work.  So nested
rcu_read_lock() can be better than sequential lock/unlock/lock/unlock.

This needs far better justification.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/4] mm: reduce the rcu lock duration
  2022-05-31  9:01     ` Miaohe Lin
@ 2022-05-31 16:09       ` Eric W. Biederman
  2022-06-01  6:33         ` Miaohe Lin
  0 siblings, 1 reply; 20+ messages in thread
From: Eric W. Biederman @ 2022-05-31 16:09 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Ying Huang, peterx, apopple, osalvador, mike.kravetz, songmuchun,
	hch, dhowells, cl, linux-mm, linux-kernel, akpm, naoya.horiguchi

Miaohe Lin <linmiaohe@huawei.com> writes:

> On 2022/5/31 14:06, Ying Huang wrote:
>> On Mon, 2022-05-30 at 19:30 +0800, Miaohe Lin wrote:
>>> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
>>> extends the period of the rcu_read_lock until after the permissions checks
>>> are done to prevent the task pointed to from changing from under us. But
>>> the task_struct refcount is also taken at that time, the reference to task
>>> is guaranteed to be stable. So it's unnecessary to extend the period of
>>> the rcu_read_lock. Release the rcu lock after task refcount is successfully
>>> grabbed to reduce the rcu holding time.
>> 
>> Sorry for late reply, I am busy on something else recently.
>
> That's all right. Many thanks for your hard work. :)
>
>> 
>> I have just read the whole thread of the original patch discussion. 
>> During discussion, in
>> 
>> https://lore.kernel.org/lkml/alpine.DEB.2.00.1202241131400.3726@router.home/
>> 
>> a patch that is same as your one is proposed.  Then in the following
>> message, Eric think that the rcu read lock should be released until
>> permission is checked,
>> 
>> https://lore.kernel.org/lkml/87sjhzun47.fsf@xmission.com/
>> 
>> "
>> At the moment I suspect the permissions checks are not safe unless
>> performed under both rcu_read_lock and task_lock to ensure that
>> the task<->mm association does not change on us while we are
>> working.  Even with that the cred can change under us but at least
>> we know the cred will be valid until rcu_read_unlock happens.
>> "
>> 
>> So the rcu lock duration is enlarged in the following message.
>> 
>> https://lore.kernel.org/lkml/alpine.DEB.2.00.1202271238450.32410@router.home/
>> 
>> But, after some thought, I don't think extended rcu read lock adds much
>> value.  Because after permission checking the permission may still be
>> changed.  There's no much difference.
>> 
>> So, I have no objection to the patch itself.  But you should add more
>> information in patch description about why the RCU proected region is
>> extended and why we can reduce it.
>
> Does below patch description makes sense for you?
>
> "
> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
> extends the period of the rcu_read_lock until after the permissions checks
> are done because it suspects the permissions checks are not safe unless
> performed under both rcu_read_lock and task_lock to ensure the task<->mm
> association does not change on us while we are working [1]. But extended
> rcu read lock does not add much value. Because after permission checking
> the permission may still be changed. There's no much difference. So it's
> unnecessary to extend the period of the rcu_read_lock. Release the rcu
> lock after task refcount is successfully grabbed to reduce the rcu holding
> time.
>
> [1] https://lore.kernel.org/lkml/87sjhzun47.fsf@xmission.com/
> "

It doesn't make sense to me.

I don't see any sleeping functions called from find_mm_struct or
kernel_migrate_pages in the area kernel_migrate_pages in the area of the
code protected by get_task_struct.  So at a very basic level I see a
justification for dirtying a cache line twice with get_task_struct and
put_task_struct to reduce rcu_read_lock hold times.

I would contend that a reasonable cleanup based up on the current state
of the code would be to extend the rcu_read_lock over get_task_mm so
that a reference to task_struct does not need to be taken.  That has
the potential to reduce contention and reduce lock hold times.

The code is missing a big fat comment with the assertion that it is ok
if the permission checks are racy because the race is small, and the
worst case thing that happens is the page is migrated to another
numa node.

Given that the get_mm_task takes task_lock the cost of dirtying the
cache line is already being paid.  Perhaps not extending task_lock hold
times a little bit is justified, but I haven't seen that case made.

This seems like code that is called little enough it would be better for
it to be correct, and not need big fat comments explaining why it
doesn't matter that they code is deliberately buggy.

In short it does not make sense to me to justify a patch for performance
reasons when it appears that extending the rcu_read_lock hold time and
not touch the task reference count would stop dirtying a cache line and
likely have more impact.

Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/4] mm: reduce the rcu lock duration
  2022-05-31 13:05     ` Matthew Wilcox
@ 2022-06-01  3:21       ` Miaohe Lin
  0 siblings, 0 replies; 20+ messages in thread
From: Miaohe Lin @ 2022-06-01  3:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, naoya.horiguchi, peterx, apopple, ying.huang, osalvador,
	mike.kravetz, songmuchun, hch, dhowells, cl, linux-mm,
	linux-kernel

On 2022/5/31 21:05, Matthew Wilcox wrote:
> On Tue, May 31, 2022 at 01:58:31PM +0100, Matthew Wilcox wrote:
>> On Mon, May 30, 2022 at 07:30:13PM +0800, Miaohe Lin wrote:
>>> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
>>> extends the period of the rcu_read_lock until after the permissions checks
>>> are done to prevent the task pointed to from changing from under us. But
>>> the task_struct refcount is also taken at that time, the reference to task
>>> is guaranteed to be stable. So it's unnecessary to extend the period of
>>> the rcu_read_lock. Release the rcu lock after task refcount is successfully
>>> grabbed to reduce the rcu holding time.
>>
>> But why bother?  You know the RCU read lock isn't a "real" lock, right?
> 
> Looking over this code some more, I think this may harm performance.
> ptrace_may_access() itself takes the rcu_read_lock().  So we currently
> have:
> 
> rcu_read_lock()
> rcu_read_lock();
> rcu_read_unlock();
> rcu_read_unlock();

More precisely, we currently have:

rcu_read_lock()
task_lock()
rcu_read_lock();
rcu_read_unlock();
task_unlock()
rcu_read_unlock();

> 
> In at least one RCU configuration, rcu_read_lock() maps to
> preempt_disable().  Nested preempt_disable() just bump a counter, while
> that counter reaching zero incurs some actual work.  So nested
> rcu_read_lock() can be better than sequential lock/unlock/lock/unlock.

In this case, I agree with you.
But when task_lock is heavily contented, it might take a long time. So in this
case, I think it's better to do the sequential rcu_lock+unlock to avoid long
rcu lock duration. Or am I miss something?

> 
> This needs far better justification.

Thanks!

> 
> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/4] mm: reduce the rcu lock duration
  2022-05-31 16:09       ` Eric W. Biederman
@ 2022-06-01  6:33         ` Miaohe Lin
  2022-06-01 14:37           ` Eric W. Biederman
  0 siblings, 1 reply; 20+ messages in thread
From: Miaohe Lin @ 2022-06-01  6:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ying Huang, peterx, apopple, osalvador, mike.kravetz, songmuchun,
	hch, dhowells, cl, linux-mm, linux-kernel, akpm, naoya.horiguchi

On 2022/6/1 0:09, Eric W. Biederman wrote:
> Miaohe Lin <linmiaohe@huawei.com> writes:
snip
>>
>> "
>> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
>> extends the period of the rcu_read_lock until after the permissions checks
>> are done because it suspects the permissions checks are not safe unless
>> performed under both rcu_read_lock and task_lock to ensure the task<->mm
>> association does not change on us while we are working [1]. But extended
>> rcu read lock does not add much value. Because after permission checking
>> the permission may still be changed. There's no much difference. So it's
>> unnecessary to extend the period of the rcu_read_lock. Release the rcu
>> lock after task refcount is successfully grabbed to reduce the rcu holding
>> time.
>>
>> [1] https://lore.kernel.org/lkml/87sjhzun47.fsf@xmission.com/
>> "
> 
> It doesn't make sense to me.
> 
> I don't see any sleeping functions called from find_mm_struct or
> kernel_migrate_pages in the area kernel_migrate_pages in the area of the
> code protected by get_task_struct.  So at a very basic level I see a
> justification for dirtying a cache line twice with get_task_struct and
> put_task_struct to reduce rcu_read_lock hold times.
> 
> I would contend that a reasonable cleanup based up on the current state
> of the code would be to extend the rcu_read_lock over get_task_mm so

If so, security_task_movememory will be called inside rcu lock. It might
call sleeping functions, e.g. smack_log(). I think it's not a good idea.

> that a reference to task_struct does not need to be taken.  That has
> the potential to reduce contention and reduce lock hold times.
> 
> 
> The code is missing a big fat comment with the assertion that it is ok
> if the permission checks are racy because the race is small, and the
> worst case thing that happens is the page is migrated to another
> numa node.
> 
> 
> Given that the get_mm_task takes task_lock the cost of dirtying the
> cache line is already being paid.  Perhaps not extending task_lock hold
> times a little bit is justified, but I haven't seen that case made.
> 
> This seems like code that is called little enough it would be better for
> it to be correct, and not need big fat comments explaining why it
> doesn't matter that they code is deliberately buggy.
> 

Agree. A big fat comments will make code hard to follow.

> 
> In short it does not make sense to me to justify a patch for performance
> reasons when it appears that extending the rcu_read_lock hold time and
> not touch the task reference count would stop dirtying a cache line and
> likely have more impact.

IMHO, incremented task refcount should make code works correctly. And extending
the rcu_read_lock over get_task_mm will break the things because sleeping functions
might be called while holding rcu lock.

Does the patch itself makes sense for you? Should I rephase the commit log further?
I'm afraid I didn't get your point correctly.

> 
> Eric

Thanks!

> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/4] mm: reduce the rcu lock duration
  2022-06-01  6:33         ` Miaohe Lin
@ 2022-06-01 14:37           ` Eric W. Biederman
  2022-06-02  9:22             ` Miaohe Lin
  0 siblings, 1 reply; 20+ messages in thread
From: Eric W. Biederman @ 2022-06-01 14:37 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Ying Huang, peterx, apopple, osalvador, mike.kravetz, songmuchun,
	hch, dhowells, cl, linux-mm, linux-kernel, akpm, naoya.horiguchi

Miaohe Lin <linmiaohe@huawei.com> writes:

> On 2022/6/1 0:09, Eric W. Biederman wrote:
>> Miaohe Lin <linmiaohe@huawei.com> writes:
> snip
>>>
>>> "
>>> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
>>> extends the period of the rcu_read_lock until after the permissions checks
>>> are done because it suspects the permissions checks are not safe unless
>>> performed under both rcu_read_lock and task_lock to ensure the task<->mm
>>> association does not change on us while we are working [1]. But extended
>>> rcu read lock does not add much value. Because after permission checking
>>> the permission may still be changed. There's no much difference. So it's
>>> unnecessary to extend the period of the rcu_read_lock. Release the rcu
>>> lock after task refcount is successfully grabbed to reduce the rcu holding
>>> time.
>>>
>>> [1] https://lore.kernel.org/lkml/87sjhzun47.fsf@xmission.com/
>>> "
>> 
>> It doesn't make sense to me.
>> 
>> I don't see any sleeping functions called from find_mm_struct or
>> kernel_migrate_pages in the area kernel_migrate_pages in the area of the
>> code protected by get_task_struct.  So at a very basic level I see a
>> justification for dirtying a cache line twice with get_task_struct and
>> put_task_struct to reduce rcu_read_lock hold times.
>> 
>> I would contend that a reasonable cleanup based up on the current state
>> of the code would be to extend the rcu_read_lock over get_task_mm so
>
> If so, security_task_movememory will be called inside rcu lock. It might
> call sleeping functions, e.g. smack_log(). I think it's not a good
> idea.

In general the security functions are not allowed to sleep.
The audit mechanism typically preallocates memory so it does
not need to sleep.  From a quick skim through the code I don't
see smack_log sleeping.

Certainly the security hooks are not supposed to be inserted into the
code that they prevent reasonable implementation decisions.

Which is to say if there is a good (non-security hook reason) for
supporting sleeping deal with it.  Otherwise the security hooks has a
bug and needs to get fixed/removed.

>> that a reference to task_struct does not need to be taken.  That has
>> the potential to reduce contention and reduce lock hold times.
>> 
>> 
>> The code is missing a big fat comment with the assertion that it is ok
>> if the permission checks are racy because the race is small, and the
>> worst case thing that happens is the page is migrated to another
>> numa node.
>> 
>> 
>> Given that the get_mm_task takes task_lock the cost of dirtying the
>> cache line is already being paid.  Perhaps not extending task_lock hold
>> times a little bit is justified, but I haven't seen that case made.
>> 
>> This seems like code that is called little enough it would be better for
>> it to be correct, and not need big fat comments explaining why it
>> doesn't matter that they code is deliberately buggy.
>> 
>
> Agree. A big fat comments will make code hard to follow.

No.

The code is impossible to follow currently.

The code either requires a comment pointing out that it is deliberately
racy, or the code needs to be fixed.

Clever and subtle code always requires a comment if for no other
reason then to alert the reader that something a typical is going on.

>> In short it does not make sense to me to justify a patch for performance
>> reasons when it appears that extending the rcu_read_lock hold time and
>> not touch the task reference count would stop dirtying a cache line and
>> likely have more impact.
>
> IMHO, incremented task refcount should make code works correctly. And extending
> the rcu_read_lock over get_task_mm will break the things because sleeping functions
> might be called while holding rcu lock.

Which sleeping functions?

I can't find any.  In particular smack_task_movememory calls
smk_curacc_on_task which is the same function called by
security_task_getpgid.  security_task_getpgid is called
under rcu_read_lock.  So smack won't sleep.

> Does the patch itself makes sense for you? Should I rephase the commit log further?
> I'm afraid I didn't get your point correctly.

The patch makes no sense to me because I don't see it doing anything
worth doing.

get/put_task_struct both dirty a cache line and are expensive especially
when contended.  Dirtying a cache line that is contended is the pretty
much the most expensive native cpu operation.  In pathological scenarios
I have seen it take up to 1s.  Realistically in a cache cold scenario
(which is not as bad as a contended scenario) you are looking at 100ns
or more just to execute get_task_struct/put_task_struct.  That is the
kind of cost it would be nice to avoid all together (even if the
pathological case never comes up).

So I see two pieces of code where we could use the cheap operation
rcu_read_lock/rcu_read_unlock to remove the expensive operation
get_task_struct/put_task_struct.  Instead I see people removing
rcu_read_lock/rcu_read_unlock.

That makes no sense.  Especially as your implied reason for making this
change is to make the code have better performance.  Improving
performance is the reason for making critical sections smaller isn't it?

Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/4] mm: reduce the rcu lock duration
  2022-06-01 14:37           ` Eric W. Biederman
@ 2022-06-02  9:22             ` Miaohe Lin
  2022-06-03 16:28               ` Eric W. Biederman
  0 siblings, 1 reply; 20+ messages in thread
From: Miaohe Lin @ 2022-06-02  9:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ying Huang, peterx, apopple, osalvador, mike.kravetz, songmuchun,
	hch, dhowells, cl, linux-mm, linux-kernel, akpm, naoya.horiguchi

On 2022/6/1 22:37, Eric W. Biederman wrote:
> Miaohe Lin <linmiaohe@huawei.com> writes:
> 
>> On 2022/6/1 0:09, Eric W. Biederman wrote:
>>> Miaohe Lin <linmiaohe@huawei.com> writes:
>> snip
>>>>
>>>> "
>>>> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
>>>> extends the period of the rcu_read_lock until after the permissions checks
>>>> are done because it suspects the permissions checks are not safe unless
>>>> performed under both rcu_read_lock and task_lock to ensure the task<->mm
>>>> association does not change on us while we are working [1]. But extended
>>>> rcu read lock does not add much value. Because after permission checking
>>>> the permission may still be changed. There's no much difference. So it's
>>>> unnecessary to extend the period of the rcu_read_lock. Release the rcu
>>>> lock after task refcount is successfully grabbed to reduce the rcu holding
>>>> time.
>>>>
>>>> [1] https://lore.kernel.org/lkml/87sjhzun47.fsf@xmission.com/
>>>> "
>>>
>>> It doesn't make sense to me.
>>>
>>> I don't see any sleeping functions called from find_mm_struct or
>>> kernel_migrate_pages in the area kernel_migrate_pages in the area of the
>>> code protected by get_task_struct.  So at a very basic level I see a
>>> justification for dirtying a cache line twice with get_task_struct and
>>> put_task_struct to reduce rcu_read_lock hold times.
>>>
>>> I would contend that a reasonable cleanup based up on the current state
>>> of the code would be to extend the rcu_read_lock over get_task_mm so
>>
>> If so, security_task_movememory will be called inside rcu lock. It might
>> call sleeping functions, e.g. smack_log(). I think it's not a good
>> idea.
> 
> In general the security functions are not allowed to sleep.
> The audit mechanism typically preallocates memory so it does
> not need to sleep.  From a quick skim through the code I don't
> see smack_log sleeping.
> 
> Certainly the security hooks are not supposed to be inserted into the
> code that they prevent reasonable implementation decisions.
> 
> Which is to say if there is a good (non-security hook reason) for
> supporting sleeping deal with it.  Otherwise the security hooks has a
> bug and needs to get fixed/removed.

I see. Many thanks for explanation.

> 
>>> that a reference to task_struct does not need to be taken.  That has
>>> the potential to reduce contention and reduce lock hold times.
>>>
>>>
>>> The code is missing a big fat comment with the assertion that it is ok
>>> if the permission checks are racy because the race is small, and the
>>> worst case thing that happens is the page is migrated to another
>>> numa node.
>>>
>>>
>>> Given that the get_mm_task takes task_lock the cost of dirtying the
>>> cache line is already being paid.  Perhaps not extending task_lock hold
>>> times a little bit is justified, but I haven't seen that case made.
>>>
>>> This seems like code that is called little enough it would be better for
>>> it to be correct, and not need big fat comments explaining why it
>>> doesn't matter that they code is deliberately buggy.
>>>
>>
>> Agree. A big fat comments will make code hard to follow.
> 
> No.
> 
> The code is impossible to follow currently.
> 
> The code either requires a comment pointing out that it is deliberately
> racy, or the code needs to be fixed.
> 
> Clever and subtle code always requires a comment if for no other
> reason then to alert the reader that something a typical is going on.

Yes, clever and subtle code requires a comment but others might not.

> 
>>> In short it does not make sense to me to justify a patch for performance
>>> reasons when it appears that extending the rcu_read_lock hold time and
>>> not touch the task reference count would stop dirtying a cache line and
>>> likely have more impact.
>>
>> IMHO, incremented task refcount should make code works correctly. And extending
>> the rcu_read_lock over get_task_mm will break the things because sleeping functions
>> might be called while holding rcu lock.
> 
> Which sleeping functions?
> 
> I can't find any.  In particular smack_task_movememory calls
> smk_curacc_on_task which is the same function called by
> security_task_getpgid.  security_task_getpgid is called
> under rcu_read_lock.  So smack won't sleep.

Sorry, I didn't take a close look at smack_log code. So I thought it could sleep.

> 
>> Does the patch itself makes sense for you? Should I rephase the commit log further?
>> I'm afraid I didn't get your point correctly.
> 
> The patch makes no sense to me because I don't see it doing anything
> worth doing.
> 
> get/put_task_struct both dirty a cache line and are expensive especially
> when contended.  Dirtying a cache line that is contended is the pretty
> much the most expensive native cpu operation.  In pathological scenarios
> I have seen it take up to 1s.  Realistically in a cache cold scenario
> (which is not as bad as a contended scenario) you are looking at 100ns
> or more just to execute get_task_struct/put_task_struct.  That is the
> kind of cost it would be nice to avoid all together (even if the
> pathological case never comes up).
> 
> So I see two pieces of code where we could use the cheap operation
> rcu_read_lock/rcu_read_unlock to remove the expensive operation
> get_task_struct/put_task_struct.  Instead I see people removing
> rcu_read_lock/rcu_read_unlock.
> 
> That makes no sense.  Especially as your implied reason for making this
> change is to make the code have better performance.  Improving
> performance is the reason for making critical sections smaller isn't it?
> 

I think you're right. We should extend the rcu_read_lock over get_task_mm so we can
remove the expensive operation get_task_struct/put_task_struct and thus avoid possible
cacheline penalty. But is the extended rcu lock enough to ensure the task reference
is stable when calling security check functions and performing cpuset node validation?
Or this race is just OK and can be left alone with a comment?

> Eric

Many thanks for your comment and suggestion!

> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/4] mm: reduce the rcu lock duration
  2022-06-02  9:22             ` Miaohe Lin
@ 2022-06-03 16:28               ` Eric W. Biederman
  2022-06-07  9:19                 ` Miaohe Lin
  0 siblings, 1 reply; 20+ messages in thread
From: Eric W. Biederman @ 2022-06-03 16:28 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Ying Huang, peterx, apopple, osalvador, mike.kravetz, songmuchun,
	hch, dhowells, cl, linux-mm, linux-kernel, akpm, naoya.horiguchi

Miaohe Lin <linmiaohe@huawei.com> writes:

> On 2022/6/1 22:37, Eric W. Biederman wrote:
>> Miaohe Lin <linmiaohe@huawei.com> writes:
>> 
>>> On 2022/6/1 0:09, Eric W. Biederman wrote:
>>>> Miaohe Lin <linmiaohe@huawei.com> writes:
>>> snip
>>>>>
>>>>> "
>>>>> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
>>>>> extends the period of the rcu_read_lock until after the permissions checks
>>>>> are done because it suspects the permissions checks are not safe unless
>>>>> performed under both rcu_read_lock and task_lock to ensure the task<->mm
>>>>> association does not change on us while we are working [1]. But extended
>>>>> rcu read lock does not add much value. Because after permission checking
>>>>> the permission may still be changed. There's no much difference. So it's
>>>>> unnecessary to extend the period of the rcu_read_lock. Release the rcu
>>>>> lock after task refcount is successfully grabbed to reduce the rcu holding
>>>>> time.
>>>>>
>>>>> [1] https://lore.kernel.org/lkml/87sjhzun47.fsf@xmission.com/
>>>>> "
>>>>
>>>> It doesn't make sense to me.
>>>>
>>>> I don't see any sleeping functions called from find_mm_struct or
>>>> kernel_migrate_pages in the area kernel_migrate_pages in the area of the
>>>> code protected by get_task_struct.  So at a very basic level I see a
>>>> justification for dirtying a cache line twice with get_task_struct and
>>>> put_task_struct to reduce rcu_read_lock hold times.
>>>>
>>>> I would contend that a reasonable cleanup based up on the current state
>>>> of the code would be to extend the rcu_read_lock over get_task_mm so
>>>
>>> If so, security_task_movememory will be called inside rcu lock. It might
>>> call sleeping functions, e.g. smack_log(). I think it's not a good
>>> idea.
>> 
>> In general the security functions are not allowed to sleep.
>> The audit mechanism typically preallocates memory so it does
>> not need to sleep.  From a quick skim through the code I don't
>> see smack_log sleeping.
>> 
>> Certainly the security hooks are not supposed to be inserted into the
>> code that they prevent reasonable implementation decisions.
>> 
>> Which is to say if there is a good (non-security hook reason) for
>> supporting sleeping deal with it.  Otherwise the security hooks has a
>> bug and needs to get fixed/removed.
>
> I see. Many thanks for explanation.
>
>> 
>>>> that a reference to task_struct does not need to be taken.  That has
>>>> the potential to reduce contention and reduce lock hold times.
>>>>
>>>>
>>>> The code is missing a big fat comment with the assertion that it is ok
>>>> if the permission checks are racy because the race is small, and the
>>>> worst case thing that happens is the page is migrated to another
>>>> numa node.
>>>>
>>>>
>>>> Given that the get_mm_task takes task_lock the cost of dirtying the
>>>> cache line is already being paid.  Perhaps not extending task_lock hold
>>>> times a little bit is justified, but I haven't seen that case made.
>>>>
>>>> This seems like code that is called little enough it would be better for
>>>> it to be correct, and not need big fat comments explaining why it
>>>> doesn't matter that they code is deliberately buggy.
>>>>
>>>
>>> Agree. A big fat comments will make code hard to follow.
>> 
>> No.
>> 
>> The code is impossible to follow currently.
>> 
>> The code either requires a comment pointing out that it is deliberately
>> racy, or the code needs to be fixed.
>> 
>> Clever and subtle code always requires a comment if for no other
>> reason then to alert the reader that something a typical is going on.
>
> Yes, clever and subtle code requires a comment but others might not.
>
>> 
>>>> In short it does not make sense to me to justify a patch for performance
>>>> reasons when it appears that extending the rcu_read_lock hold time and
>>>> not touch the task reference count would stop dirtying a cache line and
>>>> likely have more impact.
>>>
>>> IMHO, incremented task refcount should make code works correctly. And extending
>>> the rcu_read_lock over get_task_mm will break the things because sleeping functions
>>> might be called while holding rcu lock.
>> 
>> Which sleeping functions?
>> 
>> I can't find any.  In particular smack_task_movememory calls
>> smk_curacc_on_task which is the same function called by
>> security_task_getpgid.  security_task_getpgid is called
>> under rcu_read_lock.  So smack won't sleep.
>
> Sorry, I didn't take a close look at smack_log code. So I thought it could sleep.
>
>> 
>>> Does the patch itself makes sense for you? Should I rephase the commit log further?
>>> I'm afraid I didn't get your point correctly.
>> 
>> The patch makes no sense to me because I don't see it doing anything
>> worth doing.
>> 
>> get/put_task_struct both dirty a cache line and are expensive especially
>> when contended.  Dirtying a cache line that is contended is the pretty
>> much the most expensive native cpu operation.  In pathological scenarios
>> I have seen it take up to 1s.  Realistically in a cache cold scenario
>> (which is not as bad as a contended scenario) you are looking at 100ns
>> or more just to execute get_task_struct/put_task_struct.  That is the
>> kind of cost it would be nice to avoid all together (even if the
>> pathological case never comes up).
>> 
>> So I see two pieces of code where we could use the cheap operation
>> rcu_read_lock/rcu_read_unlock to remove the expensive operation
>> get_task_struct/put_task_struct.  Instead I see people removing
>> rcu_read_lock/rcu_read_unlock.
>> 
>> That makes no sense.  Especially as your implied reason for making this
>> change is to make the code have better performance.  Improving
>> performance is the reason for making critical sections smaller isn't it?
>> 
>
> I think you're right. We should extend the rcu_read_lock over get_task_mm so we can
> remove the expensive operation get_task_struct/put_task_struct and thus avoid possible
> cacheline penalty. But is the extended rcu lock enough to ensure the task reference
> is stable when calling security check functions and performing cpuset node validation?
> Or this race is just OK and can be left alone with a comment?

The extending the rcu_read_lock is just about removing the expense of
get_task_struct/put_task_struct.  It can be handled separately from
other issues at play in this code.


The rcu_read_lock guarantees that task does not go away.  The
rcu_read_lock does not guarantee that task->mm does not change.



A separate but related issue is should the task_lock in get_task_mm
be extended to cover the security checks so that everything checks
against the same mm.

Possibly the code could be refactored or reordered so that everything
is guaranteed to check against the same mm.


If the checks are not made to guarantee they are all checking against
the same mm, the code needs a comment to say that there is a tiny race.
The comment should say we don't care about the tiny race because
the worst that can happen is that a page is migrated to a different
numa node.  And so we don't care.


Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/4] mm: reduce the rcu lock duration
  2022-06-03 16:28               ` Eric W. Biederman
@ 2022-06-07  9:19                 ` Miaohe Lin
  2022-06-18  0:23                   ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Miaohe Lin @ 2022-06-07  9:19 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ying Huang, peterx, apopple, osalvador, mike.kravetz, songmuchun,
	hch, dhowells, cl, linux-mm, linux-kernel, akpm, naoya.horiguchi

On 2022/6/4 0:28, Eric W. Biederman wrote:
> Miaohe Lin <linmiaohe@huawei.com> writes:
> 
>> On 2022/6/1 22:37, Eric W. Biederman wrote:
>>> Miaohe Lin <linmiaohe@huawei.com> writes:
>>>
>>>> On 2022/6/1 0:09, Eric W. Biederman wrote:
>>>>> Miaohe Lin <linmiaohe@huawei.com> writes:
>>>> snip
>>>>>>
>>>>>> "
>>>>>> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
>>>>>> extends the period of the rcu_read_lock until after the permissions checks
>>>>>> are done because it suspects the permissions checks are not safe unless
>>>>>> performed under both rcu_read_lock and task_lock to ensure the task<->mm
>>>>>> association does not change on us while we are working [1]. But extended
>>>>>> rcu read lock does not add much value. Because after permission checking
>>>>>> the permission may still be changed. There's no much difference. So it's
>>>>>> unnecessary to extend the period of the rcu_read_lock. Release the rcu
>>>>>> lock after task refcount is successfully grabbed to reduce the rcu holding
>>>>>> time.
>>>>>>
>>>>>> [1] https://lore.kernel.org/lkml/87sjhzun47.fsf@xmission.com/
>>>>>> "
>>>>>
>>>>> It doesn't make sense to me.
>>>>>
>>>>> I don't see any sleeping functions called from find_mm_struct or
>>>>> kernel_migrate_pages in the area kernel_migrate_pages in the area of the
>>>>> code protected by get_task_struct.  So at a very basic level I see a
>>>>> justification for dirtying a cache line twice with get_task_struct and
>>>>> put_task_struct to reduce rcu_read_lock hold times.
>>>>>
>>>>> I would contend that a reasonable cleanup based up on the current state
>>>>> of the code would be to extend the rcu_read_lock over get_task_mm so
>>>>
>>>> If so, security_task_movememory will be called inside rcu lock. It might
>>>> call sleeping functions, e.g. smack_log(). I think it's not a good
>>>> idea.
>>>
>>> In general the security functions are not allowed to sleep.
>>> The audit mechanism typically preallocates memory so it does
>>> not need to sleep.  From a quick skim through the code I don't
>>> see smack_log sleeping.
>>>
>>> Certainly the security hooks are not supposed to be inserted into the
>>> code that they prevent reasonable implementation decisions.
>>>
>>> Which is to say if there is a good (non-security hook reason) for
>>> supporting sleeping deal with it.  Otherwise the security hooks has a
>>> bug and needs to get fixed/removed.
>>
>> I see. Many thanks for explanation.
>>
>>>
>>>>> that a reference to task_struct does not need to be taken.  That has
>>>>> the potential to reduce contention and reduce lock hold times.
>>>>>
>>>>>
>>>>> The code is missing a big fat comment with the assertion that it is ok
>>>>> if the permission checks are racy because the race is small, and the
>>>>> worst case thing that happens is the page is migrated to another
>>>>> numa node.
>>>>>
>>>>>
>>>>> Given that the get_mm_task takes task_lock the cost of dirtying the
>>>>> cache line is already being paid.  Perhaps not extending task_lock hold
>>>>> times a little bit is justified, but I haven't seen that case made.
>>>>>
>>>>> This seems like code that is called little enough it would be better for
>>>>> it to be correct, and not need big fat comments explaining why it
>>>>> doesn't matter that they code is deliberately buggy.
>>>>>
>>>>
>>>> Agree. A big fat comments will make code hard to follow.
>>>
>>> No.
>>>
>>> The code is impossible to follow currently.
>>>
>>> The code either requires a comment pointing out that it is deliberately
>>> racy, or the code needs to be fixed.
>>>
>>> Clever and subtle code always requires a comment if for no other
>>> reason then to alert the reader that something a typical is going on.
>>
>> Yes, clever and subtle code requires a comment but others might not.
>>
>>>
>>>>> In short it does not make sense to me to justify a patch for performance
>>>>> reasons when it appears that extending the rcu_read_lock hold time and
>>>>> not touch the task reference count would stop dirtying a cache line and
>>>>> likely have more impact.
>>>>
>>>> IMHO, incremented task refcount should make code works correctly. And extending
>>>> the rcu_read_lock over get_task_mm will break the things because sleeping functions
>>>> might be called while holding rcu lock.
>>>
>>> Which sleeping functions?
>>>
>>> I can't find any.  In particular smack_task_movememory calls
>>> smk_curacc_on_task which is the same function called by
>>> security_task_getpgid.  security_task_getpgid is called
>>> under rcu_read_lock.  So smack won't sleep.
>>
>> Sorry, I didn't take a close look at smack_log code. So I thought it could sleep.
>>
>>>
>>>> Does the patch itself makes sense for you? Should I rephase the commit log further?
>>>> I'm afraid I didn't get your point correctly.
>>>
>>> The patch makes no sense to me because I don't see it doing anything
>>> worth doing.
>>>
>>> get/put_task_struct both dirty a cache line and are expensive especially
>>> when contended.  Dirtying a cache line that is contended is the pretty
>>> much the most expensive native cpu operation.  In pathological scenarios
>>> I have seen it take up to 1s.  Realistically in a cache cold scenario
>>> (which is not as bad as a contended scenario) you are looking at 100ns
>>> or more just to execute get_task_struct/put_task_struct.  That is the
>>> kind of cost it would be nice to avoid all together (even if the
>>> pathological case never comes up).
>>>
>>> So I see two pieces of code where we could use the cheap operation
>>> rcu_read_lock/rcu_read_unlock to remove the expensive operation
>>> get_task_struct/put_task_struct.  Instead I see people removing
>>> rcu_read_lock/rcu_read_unlock.
>>>
>>> That makes no sense.  Especially as your implied reason for making this
>>> change is to make the code have better performance.  Improving
>>> performance is the reason for making critical sections smaller isn't it?
>>>
>>
>> I think you're right. We should extend the rcu_read_lock over get_task_mm so we can
>> remove the expensive operation get_task_struct/put_task_struct and thus avoid possible
>> cacheline penalty. But is the extended rcu lock enough to ensure the task reference
>> is stable when calling security check functions and performing cpuset node validation?
>> Or this race is just OK and can be left alone with a comment?
> 
> The extending the rcu_read_lock is just about removing the expense of
> get_task_struct/put_task_struct.  It can be handled separately from
> other issues at play in this code.
> 
> 
> The rcu_read_lock guarantees that task does not go away.  The
> rcu_read_lock does not guarantee that task->mm does not change.
> 

It seems task_lock is needed in this case.

> 
> 
> A separate but related issue is should the task_lock in get_task_mm
> be extended to cover the security checks so that everything checks
> against the same mm.
> 
> Possibly the code could be refactored or reordered so that everything
> is guaranteed to check against the same mm.

IMHO, it might be overkill to do this.

> 
> 
> If the checks are not made to guarantee they are all checking against
> the same mm, the code needs a comment to say that there is a tiny race.
> The comment should say we don't care about the tiny race because
> the worst that can happen is that a page is migrated to a different
> numa node.  And so we don't care.
> 
> 

I tend to do this one. To make sure, is the below code change what you suggest?

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3c64f3cdac4b..943f58eff1fc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1612,42 +1612,43 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
        rcu_read_lock();
        task = pid ? find_task_by_vpid(pid) : current;
        if (!task) {
-               rcu_read_unlock();
                err = -ESRCH;
-               goto out;
+               goto out_unlock;
        }
-       get_task_struct(task);
-       rcu_read_unlock();

        err = -EINVAL;

        /*
         * Check if this process has the right to modify the specified process.
         * Use the regular "ptrace_may_access()" checks.
+        * The below checks are not made to guarantee they are all checking
+        * against the same mm. But we don't care about such tiny race because
+        * the worst that can happen is that a page is migrated to a different
+        * numa node. And so we don't care.
         */
        if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
                err = -EPERM;
-               goto out_put;
+               goto out_unlock;
        }

        task_nodes = cpuset_mems_allowed(task);
        /* Is the user allowed to access the target nodes? */
        if (!nodes_subset(*new, task_nodes) && !capable(CAP_SYS_NICE)) {
                err = -EPERM;
-               goto out_put;
+               goto out_unlock;
        }

        task_nodes = cpuset_mems_allowed(current);
        nodes_and(*new, *new, task_nodes);
        if (nodes_empty(*new))
-               goto out_put;
+               goto out_unlock;

        err = security_task_movememory(task);
        if (err)
-               goto out_put;
+               goto out_unlock;

        mm = get_task_mm(task);
-       put_task_struct(task);
+       rcu_read_unlock();

        if (!mm) {
                err = -EINVAL;
@@ -1663,8 +1664,8 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,

        return err;

-out_put:
-       put_task_struct(task);
+out_unlock:
+       rcu_read_unlock();
        goto out;

 }

> Eric

Many thanks for your valuable suggestion!

> .
> 


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/4] mm: reduce the rcu lock duration
  2022-06-07  9:19                 ` Miaohe Lin
@ 2022-06-18  0:23                   ` Andrew Morton
  2022-06-18  2:49                     ` Miaohe Lin
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2022-06-18  0:23 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Eric W. Biederman, Ying Huang, peterx, apopple, osalvador,
	mike.kravetz, songmuchun, hch, dhowells, cl, linux-mm,
	linux-kernel, naoya.horiguchi

On Tue, 7 Jun 2022 17:19:53 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:

> > 
> > 
> > If the checks are not made to guarantee they are all checking against
> > the same mm, the code needs a comment to say that there is a tiny race.
> > The comment should say we don't care about the tiny race because
> > the worst that can happen is that a page is migrated to a different
> > numa node.  And so we don't care.
> > 
> > 
> 
> I tend to do this one. To make sure, is the below code change what you suggest?

Eric went quiet.

As I understand it, any changes arising from this discussion can be
done as a followup patch.  So I'm planning on moving this 4-patch
series into the non-rebasing mm-stable branch late next week.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/4] mm: reduce the rcu lock duration
  2022-06-18  0:23                   ` Andrew Morton
@ 2022-06-18  2:49                     ` Miaohe Lin
  0 siblings, 0 replies; 20+ messages in thread
From: Miaohe Lin @ 2022-06-18  2:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric W. Biederman, Ying Huang, peterx, apopple, osalvador,
	mike.kravetz, songmuchun, hch, dhowells, cl, linux-mm,
	linux-kernel, naoya.horiguchi

On 2022/6/18 8:23, Andrew Morton wrote:
> On Tue, 7 Jun 2022 17:19:53 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:
> 
>>>
>>>
>>> If the checks are not made to guarantee they are all checking against
>>> the same mm, the code needs a comment to say that there is a tiny race.
>>> The comment should say we don't care about the tiny race because
>>> the worst that can happen is that a page is migrated to a different
>>> numa node.  And so we don't care.
>>>
>>>
>>
>> I tend to do this one. To make sure, is the below code change what you suggest?
> 
> Eric went quiet.
> 
> As I understand it, any changes arising from this discussion can be
> done as a followup patch.  So I'm planning on moving this 4-patch
> series into the non-rebasing mm-stable branch late next week.

I will send a followup patch when discussion is done. Thanks.

> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-06-18  2:49 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-30 11:30 [PATCH v4 0/4] A few cleanup and fixup patches for migration Miaohe Lin
2022-05-30 11:30 ` [PATCH v4 1/4] mm: reduce the rcu lock duration Miaohe Lin
2022-05-31  6:06   ` Ying Huang
2022-05-31  9:01     ` Miaohe Lin
2022-05-31 16:09       ` Eric W. Biederman
2022-06-01  6:33         ` Miaohe Lin
2022-06-01 14:37           ` Eric W. Biederman
2022-06-02  9:22             ` Miaohe Lin
2022-06-03 16:28               ` Eric W. Biederman
2022-06-07  9:19                 ` Miaohe Lin
2022-06-18  0:23                   ` Andrew Morton
2022-06-18  2:49                     ` Miaohe Lin
2022-05-31 12:50   ` David Hildenbrand
2022-05-31 12:58   ` Matthew Wilcox
2022-05-31 13:05     ` Matthew Wilcox
2022-06-01  3:21       ` Miaohe Lin
2022-05-30 11:30 ` [PATCH v4 2/4] mm/migration: remove unneeded lock page and PageMovable check Miaohe Lin
2022-05-31 12:50   ` David Hildenbrand
2022-05-30 11:30 ` [PATCH v4 3/4] mm/migration: return errno when isolate_huge_page failed Miaohe Lin
2022-05-30 11:30 ` [PATCH v4 4/4] mm/migration: fix potential pte_unmap on an not mapped pte Miaohe Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).