linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages
@ 2019-09-04 13:53 Konstantin Khlebnikov
  2019-09-04 13:53 ` [PATCH v1 1/7] mm/memcontrol: move locking page out of mem_cgroup_move_account Konstantin Khlebnikov
                   ` (8 more replies)
  0 siblings, 9 replies; 12+ messages in thread
From: Konstantin Khlebnikov @ 2019-09-04 13:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel, cgroups
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner

Currently mlock keeps pages in cgroups where they were accounted.
This way one container could affect another if they share file cache.
Typical case is writing (downloading) file in one container and then
locking in another. After that first container cannot get rid of cache.
Also removed cgroup stays pinned by these mlocked pages.

This patchset implements recharging pages to cgroup of mlock user.

There are three cases:
* recharging at first mlock
* recharging at munlock to any remaining mlock
* recharging at 'culling' in reclaimer to any existing mlock

To keep things simple recharging ignores memory limit. After that memory
usage temporary could be higher than limit but cgroup will reclaim memory
later or trigger oom, which is valid outcome when somebody mlock too much.

---

Konstantin Khlebnikov (7):
      mm/memcontrol: move locking page out of mem_cgroup_move_account
      mm/memcontrol: add mem_cgroup_recharge
      mm/mlock: add vma argument for mlock_vma_page()
      mm/mlock: recharge memory accounting to first mlock user
      mm/mlock: recharge memory accounting to second mlock user at munlock
      mm/vmscan: allow changing page memory cgroup during reclaim
      mm/mlock: recharge mlocked pages at culling by vmscan


 Documentation/admin-guide/cgroup-v1/memory.rst |    5 +
 include/linux/memcontrol.h                     |    9 ++
 include/linux/rmap.h                           |    3 -
 mm/gup.c                                       |    2 
 mm/huge_memory.c                               |    4 -
 mm/internal.h                                  |    6 +
 mm/ksm.c                                       |    2 
 mm/memcontrol.c                                |  104 ++++++++++++++++--------
 mm/migrate.c                                   |    2 
 mm/mlock.c                                     |   14 +++
 mm/rmap.c                                      |    5 +
 mm/vmscan.c                                    |   17 ++--
 12 files changed, 121 insertions(+), 52 deletions(-)

--
Signature

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v1 1/7] mm/memcontrol: move locking page out of mem_cgroup_move_account
  2019-09-04 13:53 [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages Konstantin Khlebnikov
@ 2019-09-04 13:53 ` Konstantin Khlebnikov
  2019-09-04 13:53 ` [PATCH v1 2/7] mm/memcontrol: add mem_cgroup_recharge Konstantin Khlebnikov
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Konstantin Khlebnikov @ 2019-09-04 13:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel, cgroups
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner

Required for calling mem_cgroup_move_account() for already locked page.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 mm/memcontrol.c |   64 +++++++++++++++++++++++++++----------------------------
 1 file changed, 31 insertions(+), 33 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9ec5e12486a7..40ddc233e973 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5135,7 +5135,8 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
  * @from: mem_cgroup which the page is moved from.
  * @to:	mem_cgroup which the page is moved to. @from != @to.
  *
- * The caller must make sure the page is not on LRU (isolate_page() is useful.)
+ * The caller must lock the page and make sure it is not on LRU
+ * (isolate_page() is useful.)
  *
  * This function doesn't do "charge" to new cgroup and doesn't do "uncharge"
  * from old cgroup.
@@ -5147,24 +5148,15 @@ static int mem_cgroup_move_account(struct page *page,
 {
 	unsigned long flags;
 	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
-	int ret;
 	bool anon;
 
 	VM_BUG_ON(from == to);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 	VM_BUG_ON(compound && !PageTransHuge(page));
 
-	/*
-	 * Prevent mem_cgroup_migrate() from looking at
-	 * page->mem_cgroup of its source page while we change it.
-	 */
-	ret = -EBUSY;
-	if (!trylock_page(page))
-		goto out;
-
-	ret = -EINVAL;
 	if (page->mem_cgroup != from)
-		goto out_unlock;
+		return -EINVAL;
 
 	anon = PageAnon(page);
 
@@ -5204,18 +5196,14 @@ static int mem_cgroup_move_account(struct page *page,
 	page->mem_cgroup = to;
 	spin_unlock_irqrestore(&from->move_lock, flags);
 
-	ret = 0;
-
 	local_irq_disable();
 	mem_cgroup_charge_statistics(to, page, compound, nr_pages);
 	memcg_check_events(to, page);
 	mem_cgroup_charge_statistics(from, page, compound, -nr_pages);
 	memcg_check_events(from, page);
 	local_irq_enable();
-out_unlock:
-	unlock_page(page);
-out:
-	return ret;
+
+	return 0;
 }
 
 /**
@@ -5535,36 +5523,42 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	struct vm_area_struct *vma = walk->vma;
 	pte_t *pte;
 	spinlock_t *ptl;
-	enum mc_target_type target_type;
 	union mc_target target;
 	struct page *page;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
+		bool device = false;
+
 		if (mc.precharge < HPAGE_PMD_NR) {
 			spin_unlock(ptl);
 			return 0;
 		}
-		target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
-		if (target_type == MC_TARGET_PAGE) {
-			page = target.page;
-			if (!isolate_lru_page(page)) {
-				if (!mem_cgroup_move_account(page, true,
-							     mc.from, mc.to)) {
-					mc.precharge -= HPAGE_PMD_NR;
-					mc.moved_charge += HPAGE_PMD_NR;
-				}
-				putback_lru_page(page);
-			}
-			put_page(page);
-		} else if (target_type == MC_TARGET_DEVICE) {
+
+		switch (get_mctgt_type_thp(vma, addr, *pmd, &target)) {
+		case MC_TARGET_DEVICE:
+			device = true;
+			/* fall through */
+		case MC_TARGET_PAGE:
 			page = target.page;
+			if (!trylock_page(page))
+				goto put_huge;
+			if (!device && isolate_lru_page(page))
+				goto unlock_huge;
 			if (!mem_cgroup_move_account(page, true,
 						     mc.from, mc.to)) {
 				mc.precharge -= HPAGE_PMD_NR;
 				mc.moved_charge += HPAGE_PMD_NR;
 			}
+			if (!device)
+				putback_lru_page(page);
+unlock_huge:
+			unlock_page(page);
+put_huge:
 			put_page(page);
+			break;
+		default:
+			break;
 		}
 		spin_unlock(ptl);
 		return 0;
@@ -5596,8 +5590,10 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 			 */
 			if (PageTransCompound(page))
 				goto put;
-			if (!device && isolate_lru_page(page))
+			if (!trylock_page(page))
 				goto put;
+			if (!device && isolate_lru_page(page))
+				goto unlock;
 			if (!mem_cgroup_move_account(page, false,
 						mc.from, mc.to)) {
 				mc.precharge--;
@@ -5606,6 +5602,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 			}
 			if (!device)
 				putback_lru_page(page);
+unlock:
+			unlock_page(page);
 put:			/* get_mctgt_type() gets the page */
 			put_page(page);
 			break;


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 2/7] mm/memcontrol: add mem_cgroup_recharge
  2019-09-04 13:53 [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages Konstantin Khlebnikov
  2019-09-04 13:53 ` [PATCH v1 1/7] mm/memcontrol: move locking page out of mem_cgroup_move_account Konstantin Khlebnikov
@ 2019-09-04 13:53 ` Konstantin Khlebnikov
  2019-09-04 13:53 ` [PATCH v1 3/7] mm/mlock: add vma argument for mlock_vma_page() Konstantin Khlebnikov
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Konstantin Khlebnikov @ 2019-09-04 13:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel, cgroups
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner

This function tries to move page into other cgroup.
Caller must lock page and isolate it from LRU.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 include/linux/memcontrol.h |    9 +++++++++
 mm/memcontrol.c            |   40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2cd4359cb38c..d94950584f60 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -352,6 +352,8 @@ void mem_cgroup_uncharge(struct page *page);
 void mem_cgroup_uncharge_list(struct list_head *page_list);
 
 void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
+int mem_cgroup_try_recharge(struct page *page, struct mm_struct *mm,
+			    gfp_t gfp_mask);
 
 static struct mem_cgroup_per_node *
 mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
@@ -857,6 +859,13 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
 {
 }
 
+static inline int mem_cgroup_try_recharge(struct page *page,
+					  struct mm_struct *mm,
+					  gfp_t gfp_mask)
+{
+	return 0;
+}
+
 static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
 				struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 40ddc233e973..953a0bbb9f43 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6507,6 +6507,46 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
 	local_irq_restore(flags);
 }
 
+/*
+ * mem_cgroup_try_recharge - try to recharge page to mm's memcg.
+ *
+ * Page must be locked and isolated.
+ */
+int mem_cgroup_try_recharge(struct page *page, struct mm_struct *mm,
+			    gfp_t gfp_mask)
+{
+	struct mem_cgroup *from, *to;
+	int nr_pages;
+	int err = 0;
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+
+	if (mem_cgroup_disabled())
+		return 0;
+
+	from = page->mem_cgroup;
+	to = get_mem_cgroup_from_mm(mm);
+
+	if (likely(from == to) || !from)
+		goto out;
+
+	nr_pages = hpage_nr_pages(page);
+	err = try_charge(to, gfp_mask, nr_pages);
+	if (err)
+		goto out;
+
+	err = mem_cgroup_move_account(page, nr_pages > 1, from, to);
+	if (err)
+		cancel_charge(to, nr_pages);
+	else
+		cancel_charge(from, nr_pages);
+out:
+	css_put(&to->css);
+
+	return err;
+}
+
 DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
 EXPORT_SYMBOL(memcg_sockets_enabled_key);
 


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 3/7] mm/mlock: add vma argument for mlock_vma_page()
  2019-09-04 13:53 [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages Konstantin Khlebnikov
  2019-09-04 13:53 ` [PATCH v1 1/7] mm/memcontrol: move locking page out of mem_cgroup_move_account Konstantin Khlebnikov
  2019-09-04 13:53 ` [PATCH v1 2/7] mm/memcontrol: add mem_cgroup_recharge Konstantin Khlebnikov
@ 2019-09-04 13:53 ` Konstantin Khlebnikov
  2019-09-04 13:53 ` [PATCH v1 4/7] mm/mlock: recharge memory accounting to first mlock user Konstantin Khlebnikov
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Konstantin Khlebnikov @ 2019-09-04 13:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel, cgroups
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner

This will be used for recharging memory cgroup accounting.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 mm/gup.c         |    2 +-
 mm/huge_memory.c |    4 ++--
 mm/internal.h    |    4 ++--
 mm/ksm.c         |    2 +-
 mm/migrate.c     |    2 +-
 mm/mlock.c       |    2 +-
 mm/rmap.c        |    2 +-
 7 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 98f13ab37bac..f0accc229266 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -306,7 +306,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 			 * know the page is still mapped, we don't even
 			 * need to check for file-cache page truncation.
 			 */
-			mlock_vma_page(page);
+			mlock_vma_page(vma, page);
 			unlock_page(page);
 		}
 	}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de1f15969e27..157faa231e26 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1513,7 +1513,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 			goto skip_mlock;
 		lru_add_drain();
 		if (page->mapping && !PageDoubleMap(page))
-			mlock_vma_page(page);
+			mlock_vma_page(vma, page);
 		unlock_page(page);
 	}
 skip_mlock:
@@ -3009,7 +3009,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 		page_add_file_rmap(new, true);
 	set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
 	if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new))
-		mlock_vma_page(new);
+		mlock_vma_page(vma, new);
 	update_mmu_cache_pmd(vma, address, pvmw->pmd);
 }
 #endif
diff --git a/mm/internal.h b/mm/internal.h
index e32390802fd3..9f91992ef281 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -305,7 +305,7 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
 /*
  * must be called with vma's mmap_sem held for read or write, and page locked.
  */
-extern void mlock_vma_page(struct page *page);
+extern void mlock_vma_page(struct vm_area_struct *vma, struct page *page);
 extern unsigned int munlock_vma_page(struct page *page);
 
 /*
@@ -364,7 +364,7 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 
 #else /* !CONFIG_MMU */
 static inline void clear_page_mlock(struct page *page) { }
-static inline void mlock_vma_page(struct page *page) { }
+static inline void mlock_vma_page(struct vm_area_struct *, struct page *) { }
 static inline void mlock_migrate_page(struct page *new, struct page *old) { }
 
 #endif /* !CONFIG_MMU */
diff --git a/mm/ksm.c b/mm/ksm.c
index 3dc4346411e4..cb5705d6f26c 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1274,7 +1274,7 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 		if (!PageMlocked(kpage)) {
 			unlock_page(page);
 			lock_page(kpage);
-			mlock_vma_page(kpage);
+			mlock_vma_page(vma, kpage);
 			page = kpage;		/* for final unlock */
 		}
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index a42858d8e00b..1f6151cb7310 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -269,7 +269,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 				page_add_file_rmap(new, false);
 		}
 		if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
-			mlock_vma_page(new);
+			mlock_vma_page(vma, new);
 
 		if (PageTransHuge(page) && PageMlocked(page))
 			clear_page_mlock(page);
diff --git a/mm/mlock.c b/mm/mlock.c
index a90099da4fb4..73d477aaa411 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -85,7 +85,7 @@ void clear_page_mlock(struct page *page)
  * Mark page as mlocked if not already.
  * If page on LRU, isolate and putback to move to unevictable list.
  */
-void mlock_vma_page(struct page *page)
+void mlock_vma_page(struct vm_area_struct *vma, struct page *page)
 {
 	/* Serialize with page migration */
 	BUG_ON(!PageLocked(page));
diff --git a/mm/rmap.c b/mm/rmap.c
index 003377e24232..de88f4897c1d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1410,7 +1410,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 					 * Holding pte lock, we do *not* need
 					 * mmap_sem here
 					 */
-					mlock_vma_page(page);
+					mlock_vma_page(vma, page);
 				}
 				ret = false;
 				page_vma_mapped_walk_done(&pvmw);


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 4/7] mm/mlock: recharge memory accounting to first mlock user
  2019-09-04 13:53 [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages Konstantin Khlebnikov
                   ` (2 preceding siblings ...)
  2019-09-04 13:53 ` [PATCH v1 3/7] mm/mlock: add vma argument for mlock_vma_page() Konstantin Khlebnikov
@ 2019-09-04 13:53 ` Konstantin Khlebnikov
  2019-09-04 13:53 ` [PATCH v1 5/7] mm/mlock: recharge memory accounting to second mlock user at munlock Konstantin Khlebnikov
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Konstantin Khlebnikov @ 2019-09-04 13:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel, cgroups
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner

Currently mlock keeps pages in cgroups where they were accounted.
This way one container could affect another if they share file cache.
Typical case is writing (downloading) file in one container and then
locking in another. After that first container cannot get rid of file.

This patch recharges accounting to cgroup which owns mm for mlock vma.
Recharging happens at first mlock, when PageMlocked is set.

Mlock moves pages into unevictable LRU under pte lock thus in this place
we cannot call reclaimer. To keep things simple just charge using force.
After that memory usage temporary could be higher than limit but cgroup
will reclaim memory later or trigger oom, which is valid outcome when
somebody mlock too much.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 Documentation/admin-guide/cgroup-v1/memory.rst |    5 +++++
 mm/mlock.c                                     |    9 ++++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 41bdc038dad9..4c79e5a9153b 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -220,6 +220,11 @@ the cgroup that brought it in -- this will happen on memory pressure).
 But see section 8.2: when moving a task to another cgroup, its pages may
 be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.
 
+Locking pages in memory with mlock() or mmap(MAP_LOCKED) recharges pages
+into current memory cgroup. This recharge ignores memory limit thus memory
+usage could temporary become higher than limit. After that any allocation
+will reclaim memory down to limit or trigger oom if mlock size does not fit.
+
 Exception: If CONFIG_MEMCG_SWAP is not used.
 When you do swapoff and make swapped-out pages of shmem(tmpfs) to
 be backed into memory in force, charges for pages are accounted against the
diff --git a/mm/mlock.c b/mm/mlock.c
index 73d477aaa411..68f068711203 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -97,8 +97,15 @@ void mlock_vma_page(struct vm_area_struct *vma, struct page *page)
 		mod_zone_page_state(page_zone(page), NR_MLOCK,
 				    hpage_nr_pages(page));
 		count_vm_event(UNEVICTABLE_PGMLOCKED);
-		if (!isolate_lru_page(page))
+		if (!isolate_lru_page(page)) {
+			/*
+			 * Force memory recharge to mlock user. Cannot
+			 * reclaim memory because called under pte lock.
+			 */
+			mem_cgroup_try_recharge(page, vma->vm_mm,
+						GFP_NOWAIT | __GFP_NOFAIL);
 			putback_lru_page(page);
+		}
 	}
 }
 


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 5/7] mm/mlock: recharge memory accounting to second mlock user at munlock
  2019-09-04 13:53 [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages Konstantin Khlebnikov
                   ` (3 preceding siblings ...)
  2019-09-04 13:53 ` [PATCH v1 4/7] mm/mlock: recharge memory accounting to first mlock user Konstantin Khlebnikov
@ 2019-09-04 13:53 ` Konstantin Khlebnikov
  2019-09-04 13:53 ` [PATCH v1 6/7] mm/vmscan: allow changing page memory cgroup during reclaim Konstantin Khlebnikov
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Konstantin Khlebnikov @ 2019-09-04 13:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel, cgroups
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner

Munlock isolates page from LRU and then looks for another mlock vma.
Thus we could could rechange page to second mlock without isolating.

This patch adds argument 'isolated' to mlock_vma_page() and passes this
flag through try_to_ummap as TTU_LRU_ISOLATED.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 include/linux/rmap.h |    3 ++-
 mm/gup.c             |    2 +-
 mm/huge_memory.c     |    4 ++--
 mm/internal.h        |    6 ++++--
 mm/ksm.c             |    2 +-
 mm/migrate.c         |    2 +-
 mm/mlock.c           |   21 ++++++++++++---------
 mm/rmap.c            |    5 +++--
 8 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 988d176472df..4552716ac3da 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -98,7 +98,8 @@ enum ttu_flags {
 					 * do a final flush if necessary */
 	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
 					 * caller holds it */
-	TTU_SPLIT_FREEZE	= 0x100,		/* freeze pte under splitting thp */
+	TTU_SPLIT_FREEZE	= 0x100, /* freeze pte under splitting thp */
+	TTU_LRU_ISOLATED	= 0x200, /* caller isolated page from LRU */
 };
 
 #ifdef CONFIG_MMU
diff --git a/mm/gup.c b/mm/gup.c
index f0accc229266..e0784e9022fe 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -306,7 +306,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 			 * know the page is still mapped, we don't even
 			 * need to check for file-cache page truncation.
 			 */
-			mlock_vma_page(vma, page);
+			mlock_vma_page(vma, page, false);
 			unlock_page(page);
 		}
 	}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 157faa231e26..7822997d765c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1513,7 +1513,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 			goto skip_mlock;
 		lru_add_drain();
 		if (page->mapping && !PageDoubleMap(page))
-			mlock_vma_page(vma, page);
+			mlock_vma_page(vma, page, false);
 		unlock_page(page);
 	}
 skip_mlock:
@@ -3009,7 +3009,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 		page_add_file_rmap(new, true);
 	set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
 	if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new))
-		mlock_vma_page(vma, new);
+		mlock_vma_page(vma, new, false);
 	update_mmu_cache_pmd(vma, address, pvmw->pmd);
 }
 #endif
diff --git a/mm/internal.h b/mm/internal.h
index 9f91992ef281..1639fb581496 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -305,7 +305,8 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
 /*
  * must be called with vma's mmap_sem held for read or write, and page locked.
  */
-extern void mlock_vma_page(struct vm_area_struct *vma, struct page *page);
+extern void mlock_vma_page(struct vm_area_struct *vma, struct page *page,
+			   bool isolated);
 extern unsigned int munlock_vma_page(struct page *page);
 
 /*
@@ -364,7 +365,8 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 
 #else /* !CONFIG_MMU */
 static inline void clear_page_mlock(struct page *page) { }
-static inline void mlock_vma_page(struct vm_area_struct *, struct page *) { }
+static inline void mlock_vma_page(struct vm_area_struct *vma,
+				  struct page *page, bool isolated) { }
 static inline void mlock_migrate_page(struct page *new, struct page *old) { }
 
 #endif /* !CONFIG_MMU */
diff --git a/mm/ksm.c b/mm/ksm.c
index cb5705d6f26c..bf2a748c5e64 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1274,7 +1274,7 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 		if (!PageMlocked(kpage)) {
 			unlock_page(page);
 			lock_page(kpage);
-			mlock_vma_page(vma, kpage);
+			mlock_vma_page(vma, kpage, false);
 			page = kpage;		/* for final unlock */
 		}
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index 1f6151cb7310..c13256ddc063 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -269,7 +269,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 				page_add_file_rmap(new, false);
 		}
 		if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
-			mlock_vma_page(vma, new);
+			mlock_vma_page(vma, new, false);
 
 		if (PageTransHuge(page) && PageMlocked(page))
 			clear_page_mlock(page);
diff --git a/mm/mlock.c b/mm/mlock.c
index 68f068711203..07a2ab4d6a6c 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -85,7 +85,8 @@ void clear_page_mlock(struct page *page)
  * Mark page as mlocked if not already.
  * If page on LRU, isolate and putback to move to unevictable list.
  */
-void mlock_vma_page(struct vm_area_struct *vma, struct page *page)
+void mlock_vma_page(struct vm_area_struct *vma, struct page *page,
+		    bool isolated)
 {
 	/* Serialize with page migration */
 	BUG_ON(!PageLocked(page));
@@ -97,15 +98,17 @@ void mlock_vma_page(struct vm_area_struct *vma, struct page *page)
 		mod_zone_page_state(page_zone(page), NR_MLOCK,
 				    hpage_nr_pages(page));
 		count_vm_event(UNEVICTABLE_PGMLOCKED);
-		if (!isolate_lru_page(page)) {
-			/*
-			 * Force memory recharge to mlock user. Cannot
-			 * reclaim memory because called under pte lock.
-			 */
-			mem_cgroup_try_recharge(page, vma->vm_mm,
-						GFP_NOWAIT | __GFP_NOFAIL);
+
+		if (!isolated && isolate_lru_page(page))
+			return;
+		/*
+		 * Force memory recharge to mlock user.
+		 * Cannot reclaim memory because called under pte lock.
+		 */
+		mem_cgroup_try_recharge(page, vma->vm_mm,
+					GFP_NOWAIT | __GFP_NOFAIL);
+		if (!isolated)
 			putback_lru_page(page);
-		}
 	}
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index de88f4897c1d..0b21b27f3519 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1410,7 +1410,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 					 * Holding pte lock, we do *not* need
 					 * mmap_sem here
 					 */
-					mlock_vma_page(vma, page);
+					mlock_vma_page(vma, page,
+						!!(flags & TTU_LRU_ISOLATED));
 				}
 				ret = false;
 				page_vma_mapped_walk_done(&pvmw);
@@ -1752,7 +1753,7 @@ void try_to_munlock(struct page *page)
 {
 	struct rmap_walk_control rwc = {
 		.rmap_one = try_to_unmap_one,
-		.arg = (void *)TTU_MUNLOCK,
+		.arg = (void *)(TTU_MUNLOCK | TTU_LRU_ISOLATED),
 		.done = page_not_mapped,
 		.anon_lock = page_lock_anon_vma_read,
 


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 6/7] mm/vmscan: allow changing page memory cgroup during reclaim
  2019-09-04 13:53 [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages Konstantin Khlebnikov
                   ` (4 preceding siblings ...)
  2019-09-04 13:53 ` [PATCH v1 5/7] mm/mlock: recharge memory accounting to second mlock user at munlock Konstantin Khlebnikov
@ 2019-09-04 13:53 ` Konstantin Khlebnikov
  2019-09-04 13:53 ` [PATCH v1 7/7] mm/mlock: recharge mlocked pages at culling by vmscan Konstantin Khlebnikov
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Konstantin Khlebnikov @ 2019-09-04 13:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel, cgroups
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner

All LRU lists in one numa node are protected with one spin-lock and
right now move_pages_to_lru() re-evaluates lruvec for each page.
This allows to change page cgroup while page is isolated by reclaimer,
but nobody use that for now. This patch makes this feature clear and
passes into move_pages_to_lru pgdat rather than lruvec pointer.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 mm/vmscan.c |   14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a6c5d0b28321..bf7a05e8a717 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1873,15 +1873,15 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
  * The downside is that we have to touch page->_refcount against each page.
  * But we had to alter page->flags anyway.
  *
- * Returns the number of pages moved to the given lruvec.
+ * Returns the number of pages moved to LRU lists.
  */
 
-static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
+static unsigned noinline_for_stack move_pages_to_lru(struct pglist_data *pgdat,
 						     struct list_head *list)
 {
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
+	struct lruvec *lruvec;
 	struct page *page;
 	enum lru_list lru;
 
@@ -1895,6 +1895,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			spin_lock_irq(&pgdat->lru_lock);
 			continue;
 		}
+
+		/* Re-evaluate lru: isolated page could be moved */
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		SetPageLRU(page);
@@ -2005,7 +2007,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	reclaim_stat->recent_rotated[0] += stat.nr_activate[0];
 	reclaim_stat->recent_rotated[1] += stat.nr_activate[1];
 
-	move_pages_to_lru(lruvec, &page_list);
+	move_pages_to_lru(pgdat, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 
@@ -2128,8 +2130,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	 */
 	reclaim_stat->recent_rotated[file] += nr_rotated;
 
-	nr_activate = move_pages_to_lru(lruvec, &l_active);
-	nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
+	nr_activate = move_pages_to_lru(pgdat, &l_active);
+	nr_deactivate = move_pages_to_lru(pgdat, &l_inactive);
 	/* Keep all free pages in l_active list */
 	list_splice(&l_inactive, &l_active);
 


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 7/7] mm/mlock: recharge mlocked pages at culling by vmscan
  2019-09-04 13:53 [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages Konstantin Khlebnikov
                   ` (5 preceding siblings ...)
  2019-09-04 13:53 ` [PATCH v1 6/7] mm/vmscan: allow changing page memory cgroup during reclaim Konstantin Khlebnikov
@ 2019-09-04 13:53 ` Konstantin Khlebnikov
  2019-09-04 14:37 ` [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages Michal Hocko
  2019-09-04 23:13 ` Roman Gushchin
  8 siblings, 0 replies; 12+ messages in thread
From: Konstantin Khlebnikov @ 2019-09-04 13:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel, cgroups
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner

If mlock cannot catch page in LRU then it isn't moved into unevictable lru.
These pages are 'culled' by reclaimer and moved into unevictable lru.
It seems pages locked with MLOCK_ONFAULT always go through this path.

Reclaimer calls try_to_unmap for already isolated pages, thus on this path
we could freely change page to owner of any mlock vma we found in rmap.

This patch passes flag TTU_LRU_ISOLATED into try_to_ummap to move pages
in mmlock_vma_page().

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 mm/vmscan.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bf7a05e8a717..2060f254dd6b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1345,7 +1345,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page)) {
-			enum ttu_flags flags = ttu_flags | TTU_BATCH_FLUSH;
+			enum ttu_flags flags = ttu_flags | TTU_BATCH_FLUSH |
+					       TTU_LRU_ISOLATED;
 
 			if (unlikely(PageTransHuge(page)))
 				flags |= TTU_SPLIT_HUGE_PMD;


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages
  2019-09-04 13:53 [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages Konstantin Khlebnikov
                   ` (6 preceding siblings ...)
  2019-09-04 13:53 ` [PATCH v1 7/7] mm/mlock: recharge mlocked pages at culling by vmscan Konstantin Khlebnikov
@ 2019-09-04 14:37 ` Michal Hocko
  2019-09-05  7:38   ` Konstantin Khlebnikov
  2019-09-04 23:13 ` Roman Gushchin
  8 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2019-09-04 14:37 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, linux-kernel, cgroups, Roman Gushchin, Johannes Weiner

On Wed 04-09-19 16:53:08, Konstantin Khlebnikov wrote:
> Currently mlock keeps pages in cgroups where they were accounted.
> This way one container could affect another if they share file cache.
> Typical case is writing (downloading) file in one container and then
> locking in another. After that first container cannot get rid of cache.
> Also removed cgroup stays pinned by these mlocked pages.
> 
> This patchset implements recharging pages to cgroup of mlock user.
> 
> There are three cases:
> * recharging at first mlock
> * recharging at munlock to any remaining mlock
> * recharging at 'culling' in reclaimer to any existing mlock
> 
> To keep things simple recharging ignores memory limit. After that memory
> usage temporary could be higher than limit but cgroup will reclaim memory
> later or trigger oom, which is valid outcome when somebody mlock too much.

I assume that this is mlock specific because the pagecache which has the
same problem is reclaimable and the problem tends to resolve itself
after some time.

Anyway, how big of a problem this really is? A lingering memcg is
certainly not nice but pages are usually not mlocked for ever. Or is
this a way to protect from an hostile actor?

> Konstantin Khlebnikov (7):
>       mm/memcontrol: move locking page out of mem_cgroup_move_account
>       mm/memcontrol: add mem_cgroup_recharge
>       mm/mlock: add vma argument for mlock_vma_page()
>       mm/mlock: recharge memory accounting to first mlock user
>       mm/mlock: recharge memory accounting to second mlock user at munlock
>       mm/vmscan: allow changing page memory cgroup during reclaim
>       mm/mlock: recharge mlocked pages at culling by vmscan
> 
> 
>  Documentation/admin-guide/cgroup-v1/memory.rst |    5 +
>  include/linux/memcontrol.h                     |    9 ++
>  include/linux/rmap.h                           |    3 -
>  mm/gup.c                                       |    2 
>  mm/huge_memory.c                               |    4 -
>  mm/internal.h                                  |    6 +
>  mm/ksm.c                                       |    2 
>  mm/memcontrol.c                                |  104 ++++++++++++++++--------
>  mm/migrate.c                                   |    2 
>  mm/mlock.c                                     |   14 +++
>  mm/rmap.c                                      |    5 +
>  mm/vmscan.c                                    |   17 ++--
>  12 files changed, 121 insertions(+), 52 deletions(-)
> 
> --
> Signature

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages
  2019-09-04 13:53 [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages Konstantin Khlebnikov
                   ` (7 preceding siblings ...)
  2019-09-04 14:37 ` [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages Michal Hocko
@ 2019-09-04 23:13 ` Roman Gushchin
  8 siblings, 0 replies; 12+ messages in thread
From: Roman Gushchin @ 2019-09-04 23:13 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, linux-kernel, cgroups, Michal Hocko, Johannes Weiner

On Wed, Sep 04, 2019 at 04:53:08PM +0300, Konstantin Khlebnikov wrote:
> Currently mlock keeps pages in cgroups where they were accounted.
> This way one container could affect another if they share file cache.
> Typical case is writing (downloading) file in one container and then
> locking in another. After that first container cannot get rid of cache.

Yeah, it's a valid problem, and it's not about mlocked pages only,
the same thing is true for generic pagecache. The only difference is that
in theory memory pressure should fix everything. But in reality
pagecache used by the second container can be very hot, so the first
once can't really get rid of it.
In other words, there is no way to pass a pagecache page between cgroups
without evicting it and re-reading from a storage, which is sub-optimal
in many cases.

We thought about new madvise(), which will uncharge pagecache but set
a new page flag, which will mean something like "whoever first starts using
the page, should be charged for it". But it never materialized in a patchset.

> Also removed cgroup stays pinned by these mlocked pages.

Tbh, I don't think it's a big issue here. If only there is a huge number
of 1-page sized mlock areas, but this seems to be unlikely.

> 
> This patchset implements recharging pages to cgroup of mlock user.
> 
> There are three cases:
> * recharging at first mlock
> * recharging at munlock to any remaining mlock
> * recharging at 'culling' in reclaimer to any existing mlock
> 
> To keep things simple recharging ignores memory limit. After that memory
> usage temporary could be higher than limit but cgroup will reclaim memory
> later or trigger oom, which is valid outcome when somebody mlock too much.

OOM is a concern here. If quitting an application will cause an immediate OOM
in an other cgroup, that's not so good. Ideally it should work like
memory.high, forcing all threads in the second cgroup into direct reclaim.

Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages
  2019-09-04 14:37 ` [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages Michal Hocko
@ 2019-09-05  7:38   ` Konstantin Khlebnikov
  2019-09-05 23:11     ` Roman Gushchin
  0 siblings, 1 reply; 12+ messages in thread
From: Konstantin Khlebnikov @ 2019-09-05  7:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, cgroups, Roman Gushchin, Johannes Weiner

On 04/09/2019 17.37, Michal Hocko wrote:
> On Wed 04-09-19 16:53:08, Konstantin Khlebnikov wrote:
>> Currently mlock keeps pages in cgroups where they were accounted.
>> This way one container could affect another if they share file cache.
>> Typical case is writing (downloading) file in one container and then
>> locking in another. After that first container cannot get rid of cache.
>> Also removed cgroup stays pinned by these mlocked pages.
>>
>> This patchset implements recharging pages to cgroup of mlock user.
>>
>> There are three cases:
>> * recharging at first mlock
>> * recharging at munlock to any remaining mlock
>> * recharging at 'culling' in reclaimer to any existing mlock
>>
>> To keep things simple recharging ignores memory limit. After that memory
>> usage temporary could be higher than limit but cgroup will reclaim memory
>> later or trigger oom, which is valid outcome when somebody mlock too much.
> 
> I assume that this is mlock specific because the pagecache which has the
> same problem is reclaimable and the problem tends to resolve itself
> after some time.
> 
> Anyway, how big of a problem this really is? A lingering memcg is
> certainly not nice but pages are usually not mlocked for ever. Or is
> this a way to protect from an hostile actor?

We're using mlock mostly to avoid non-deterministic behaviour in cache.
For example some of our applications mlock index structures in databases
to limit count of major faults in worst case.

Surprisingly mlock fixates unwanted effects of non-predictable cache sharing.

So, it seems makes sense to make mlock behaviour simple and completely
deterministic because this isn't cheap operation and needs careful
resource planning.



On 05/09/2019 02.13, Roman Gushchin wrote:> On Wed, Sep 04, 2019 at 04:53:08PM +0300, Konstantin Khlebnikov wrote:
 >> Currently mlock keeps pages in cgroups where they were accounted.
 >> This way one container could affect another if they share file cache.
 >> Typical case is writing (downloading) file in one container and then
 >> locking in another. After that first container cannot get rid of cache.
 >
 > Yeah, it's a valid problem, and it's not about mlocked pages only,
 > the same thing is true for generic pagecache. The only difference is that
 > in theory memory pressure should fix everything. But in reality
 > pagecache used by the second container can be very hot, so the first
 > once can't really get rid of it.
 > In other words, there is no way to pass a pagecache page between cgroups
 > without evicting it and re-reading from a storage, which is sub-optimal
 > in many cases.
 >
 > We thought about new madvise(), which will uncharge pagecache but set
 > a new page flag, which will mean something like "whoever first starts using
 > the page, should be charged for it". But it never materialized in a patchset.

I've implemented something similar in OpenVZ kernel - "shadow" LRU sets for
abandoned cache which automatically changes ownership at first activation.

I'm thinking about fadvise() or fcntl() for moving cache into current memory cgroup.
This should give enough control to solve all our problems.

 >
 >> Also removed cgroup stays pinned by these mlocked pages.
 >
 > Tbh, I don't think it's a big issue here. If only there is a huge number
 > of 1-page sized mlock areas, but this seems to be unlikely.

Yep, not so big problem, tmpfs generates much more issues in this area.

 >
 >>
 >> This patchset implements recharging pages to cgroup of mlock user.
 >>
 >> There are three cases:
 >> * recharging at first mlock
 >> * recharging at munlock to any remaining mlock
 >> * recharging at 'culling' in reclaimer to any existing mlock
 >>
 >> To keep things simple recharging ignores memory limit. After that memory
 >> usage temporary could be higher than limit but cgroup will reclaim memory
 >> later or trigger oom, which is valid outcome when somebody mlock too much.
 >
 > OOM is a concern here. If quitting an application will cause an immediate OOM
 > in an other cgroup, that's not so good. Ideally it should work like
 > memory.high, forcing all threads in the second cgroup into direct reclaim.
 >

Mlock requires careful resource planning. Since sharing always been
non-deterministic each user should be ready to take all locked memory.

It's hard to inject direct reclaim into another thread for sure.
All we could do is starting background reclaim in kernel thread.
Doing this in task who calls munlock() is not fair.

At mlock it's possible to force direct claim for memory usage over high limit:
https://lore.kernel.org/linux-mm/156431697805.3170.6377599347542228221.stgit@buzz/

> 
>> Konstantin Khlebnikov (7):
>>        mm/memcontrol: move locking page out of mem_cgroup_move_account
>>        mm/memcontrol: add mem_cgroup_recharge
>>        mm/mlock: add vma argument for mlock_vma_page()
>>        mm/mlock: recharge memory accounting to first mlock user
>>        mm/mlock: recharge memory accounting to second mlock user at munlock
>>        mm/vmscan: allow changing page memory cgroup during reclaim
>>        mm/mlock: recharge mlocked pages at culling by vmscan
>>
>>
>>   Documentation/admin-guide/cgroup-v1/memory.rst |    5 +
>>   include/linux/memcontrol.h                     |    9 ++
>>   include/linux/rmap.h                           |    3 -
>>   mm/gup.c                                       |    2
>>   mm/huge_memory.c                               |    4 -
>>   mm/internal.h                                  |    6 +
>>   mm/ksm.c                                       |    2
>>   mm/memcontrol.c                                |  104 ++++++++++++++++--------
>>   mm/migrate.c                                   |    2
>>   mm/mlock.c                                     |   14 +++
>>   mm/rmap.c                                      |    5 +
>>   mm/vmscan.c                                    |   17 ++--
>>   12 files changed, 121 insertions(+), 52 deletions(-)
>>
>> --
>> Signature
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages
  2019-09-05  7:38   ` Konstantin Khlebnikov
@ 2019-09-05 23:11     ` Roman Gushchin
  0 siblings, 0 replies; 12+ messages in thread
From: Roman Gushchin @ 2019-09-05 23:11 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Michal Hocko, linux-mm, linux-kernel, cgroups, Johannes Weiner

On Thu, Sep 05, 2019 at 10:38:16AM +0300, Konstantin Khlebnikov wrote:
> On 04/09/2019 17.37, Michal Hocko wrote:
> > On Wed 04-09-19 16:53:08, Konstantin Khlebnikov wrote:
> > > Currently mlock keeps pages in cgroups where they were accounted.
> > > This way one container could affect another if they share file cache.
> > > Typical case is writing (downloading) file in one container and then
> > > locking in another. After that first container cannot get rid of cache.
> > > Also removed cgroup stays pinned by these mlocked pages.
> > > 
> > > This patchset implements recharging pages to cgroup of mlock user.
> > > 
> > > There are three cases:
> > > * recharging at first mlock
> > > * recharging at munlock to any remaining mlock
> > > * recharging at 'culling' in reclaimer to any existing mlock
> > > 
> > > To keep things simple recharging ignores memory limit. After that memory
> > > usage temporary could be higher than limit but cgroup will reclaim memory
> > > later or trigger oom, which is valid outcome when somebody mlock too much.
> > 
> > I assume that this is mlock specific because the pagecache which has the
> > same problem is reclaimable and the problem tends to resolve itself
> > after some time.
> > 
> > Anyway, how big of a problem this really is? A lingering memcg is
> > certainly not nice but pages are usually not mlocked for ever. Or is
> > this a way to protect from an hostile actor?
> 
> We're using mlock mostly to avoid non-deterministic behaviour in cache.
> For example some of our applications mlock index structures in databases
> to limit count of major faults in worst case.
> 
> Surprisingly mlock fixates unwanted effects of non-predictable cache sharing.
> 
> So, it seems makes sense to make mlock behaviour simple and completely
> deterministic because this isn't cheap operation and needs careful
> resource planning.
>

Totally agree.

> 
> 
> On 05/09/2019 02.13, Roman Gushchin wrote:> On Wed, Sep 04, 2019 at 04:53:08PM +0300, Konstantin Khlebnikov wrote:
> >> Currently mlock keeps pages in cgroups where they were accounted.
> >> This way one container could affect another if they share file cache.
> >> Typical case is writing (downloading) file in one container and then
> >> locking in another. After that first container cannot get rid of cache.
> >
> > Yeah, it's a valid problem, and it's not about mlocked pages only,
> > the same thing is true for generic pagecache. The only difference is that
> > in theory memory pressure should fix everything. But in reality
> > pagecache used by the second container can be very hot, so the first
> > once can't really get rid of it.
> > In other words, there is no way to pass a pagecache page between cgroups
> > without evicting it and re-reading from a storage, which is sub-optimal
> > in many cases.
> >
> > We thought about new madvise(), which will uncharge pagecache but set
> > a new page flag, which will mean something like "whoever first starts using
> > the page, should be charged for it". But it never materialized in a patchset.
> 
> I've implemented something similar in OpenVZ kernel - "shadow" LRU sets for
> abandoned cache which automatically changes ownership at first activation.
> 
> I'm thinking about fadvise() or fcntl() for moving cache into current memory cgroup.
> This should give enough control to solve all our problems.

Idk, it feels a bit fragile: because only one cgroup can own a page, there
should be a strong coordination, otherwise cgroups may just spend non-trivial
amount of cpu time stealing pages back and forth.

I'm not strictly against such fadvise() though, it can be useful in certain
cases. But in general the abandoning semantics makes more sense to me.
If a cgroup doesn't plan to use the page anymore, it marks it in a special way,
so that the next one who want to use it pays the whole price. So it works
exactly as if the page had been evicted, but more efficiently.

Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-09-05 23:11 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-04 13:53 [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages Konstantin Khlebnikov
2019-09-04 13:53 ` [PATCH v1 1/7] mm/memcontrol: move locking page out of mem_cgroup_move_account Konstantin Khlebnikov
2019-09-04 13:53 ` [PATCH v1 2/7] mm/memcontrol: add mem_cgroup_recharge Konstantin Khlebnikov
2019-09-04 13:53 ` [PATCH v1 3/7] mm/mlock: add vma argument for mlock_vma_page() Konstantin Khlebnikov
2019-09-04 13:53 ` [PATCH v1 4/7] mm/mlock: recharge memory accounting to first mlock user Konstantin Khlebnikov
2019-09-04 13:53 ` [PATCH v1 5/7] mm/mlock: recharge memory accounting to second mlock user at munlock Konstantin Khlebnikov
2019-09-04 13:53 ` [PATCH v1 6/7] mm/vmscan: allow changing page memory cgroup during reclaim Konstantin Khlebnikov
2019-09-04 13:53 ` [PATCH v1 7/7] mm/mlock: recharge mlocked pages at culling by vmscan Konstantin Khlebnikov
2019-09-04 14:37 ` [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages Michal Hocko
2019-09-05  7:38   ` Konstantin Khlebnikov
2019-09-05 23:11     ` Roman Gushchin
2019-09-04 23:13 ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).