* [PATCH 1/3] fork/exec: cleanup mm initialization
@ 2014-06-19 9:07 Vladimir Davydov
2014-06-19 9:07 ` [PATCH 2/3] fork: reset mm->pinned_vm Vladimir Davydov
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Vladimir Davydov @ 2014-06-19 9:07 UTC (permalink / raw)
To: akpm; +Cc: oleg, rientjes, cl, linux-mm, linux-kernel
mm initialization on fork/exec is spread all over the place, which makes
the code look inconsistent.
We have mm_init(), which is supposed to init/nullify mm's internals, but
it doesn't init all the fields it should:
- on fork ->mmap,mm_rb,vmacache_seqnum,map_count,mm_cpumask,locked_vm
are zeroed in dup_mmap();
- on fork ->pmd_huge_pte is zeroed in dup_mm(), immediately before
calling mm_init();
- ->cpu_vm_mask_var ptr is initialized by mm_init_cpumask(), which is
called before mm_init() on both fork and exec;
- ->context is initialized by init_new_context(), which is called after
mm_init() on both fork and exec;
Let's consolidate all the initializations in mm_init() to make the code
look cleaner.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
fs/exec.c | 4 ----
include/linux/mm_types.h | 1 +
kernel/fork.c | 47 ++++++++++++++++++++--------------------------
3 files changed, 21 insertions(+), 31 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index a3d33fe592d6..2ef2751f5a8d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -368,10 +368,6 @@ static int bprm_mm_init(struct linux_binprm *bprm)
if (!mm)
goto err;
- err = init_new_context(current, mm);
- if (err)
- goto err;
-
err = __bprm_mm_init(bprm);
if (err)
goto err;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 96c5750e3110..21bff4be4379 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -461,6 +461,7 @@ static inline void mm_init_cpumask(struct mm_struct *mm)
#ifdef CONFIG_CPUMASK_OFFSTACK
mm->cpu_vm_mask_var = &mm->cpumask_allocation;
#endif
+ cpumask_clear(mm->cpu_vm_mask_var);
}
/* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
diff --git a/kernel/fork.c b/kernel/fork.c
index d2799d1fc952..01f0d0c56cb9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -365,12 +365,6 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
*/
down_write_nested(&mm->mmap_sem, SINGLE_DEPTH_NESTING);
- mm->locked_vm = 0;
- mm->mmap = NULL;
- mm->vmacache_seqnum = 0;
- mm->map_count = 0;
- cpumask_clear(mm_cpumask(mm));
- mm->mm_rb = RB_ROOT;
rb_link = &mm->mm_rb.rb_node;
rb_parent = NULL;
pprev = &mm->mmap;
@@ -529,17 +523,27 @@ static void mm_init_aio(struct mm_struct *mm)
static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
{
+ mm->mmap = NULL;
+ mm->mm_rb = RB_ROOT;
+ mm->vmacache_seqnum = 0;
atomic_set(&mm->mm_users, 1);
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
INIT_LIST_HEAD(&mm->mmlist);
mm->core_state = NULL;
atomic_long_set(&mm->nr_ptes, 0);
+ mm->map_count = 0;
+ mm->locked_vm = 0;
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
spin_lock_init(&mm->page_table_lock);
+ mm_init_cpumask(mm);
mm_init_aio(mm);
mm_init_owner(mm, p);
+ mmu_notifier_mm_init(mm);
clear_tlb_flush_pending(mm);
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
+ mm->pmd_huge_pte = NULL;
+#endif
if (current->mm) {
mm->flags = current->mm->flags & MMF_INIT_MASK;
@@ -549,11 +553,17 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
mm->def_flags = 0;
}
- if (likely(!mm_alloc_pgd(mm))) {
- mmu_notifier_mm_init(mm);
- return mm;
- }
+ if (mm_alloc_pgd(mm))
+ goto fail_nopgd;
+
+ if (init_new_context(p, mm))
+ goto fail_nocontext;
+ return mm;
+
+fail_nocontext:
+ mm_free_pgd(mm);
+fail_nopgd:
free_mm(mm);
return NULL;
}
@@ -587,7 +597,6 @@ struct mm_struct *mm_alloc(void)
return NULL;
memset(mm, 0, sizeof(*mm));
- mm_init_cpumask(mm);
return mm_init(mm, current);
}
@@ -819,17 +828,10 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)
goto fail_nomem;
memcpy(mm, oldmm, sizeof(*mm));
- mm_init_cpumask(mm);
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
- mm->pmd_huge_pte = NULL;
-#endif
if (!mm_init(mm, tsk))
goto fail_nomem;
- if (init_new_context(tsk, mm))
- goto fail_nocontext;
-
dup_mm_exe_file(oldmm, mm);
err = dup_mmap(mm, oldmm);
@@ -851,15 +853,6 @@ free_pt:
fail_nomem:
return NULL;
-
-fail_nocontext:
- /*
- * If init_new_context() failed, we cannot use mmput() to free the mm
- * because it calls destroy_context()
- */
- mm_free_pgd(mm);
- free_mm(mm);
- return NULL;
}
static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
--
1.7.10.4
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/3] fork: reset mm->pinned_vm
2014-06-19 9:07 [PATCH 1/3] fork/exec: cleanup mm initialization Vladimir Davydov
@ 2014-06-19 9:07 ` Vladimir Davydov
2014-06-19 20:58 ` Andrew Morton
2014-06-19 9:07 ` [PATCH 3/3] fork: copy mm's vm usage counters under mmap_sem Vladimir Davydov
2014-06-19 18:15 ` [PATCH 1/3] fork/exec: cleanup mm initialization Oleg Nesterov
2 siblings, 1 reply; 6+ messages in thread
From: Vladimir Davydov @ 2014-06-19 9:07 UTC (permalink / raw)
To: akpm; +Cc: oleg, rientjes, cl, linux-mm, linux-kernel
mm->pinned_vm counts pages of mm's address space that were permanently
pinned in memory by increasing their reference counter. The counter was
introduced by commit bc3e53f682d9 ("mm: distinguish between mlocked and
pinned pages"), while before it locked_vm had been used for such pages.
Obviously, we should reset the counter on fork if !CLONE_VM, just like
we do with locked_vm, but currently we don't. Let's fix it.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
kernel/fork.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/fork.c b/kernel/fork.c
index 01f0d0c56cb9..938707e108db 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -534,6 +534,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
atomic_long_set(&mm->nr_ptes, 0);
mm->map_count = 0;
mm->locked_vm = 0;
+ mm->pinned_vm = 0;
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
spin_lock_init(&mm->page_table_lock);
mm_init_cpumask(mm);
--
1.7.10.4
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 3/3] fork: copy mm's vm usage counters under mmap_sem
2014-06-19 9:07 [PATCH 1/3] fork/exec: cleanup mm initialization Vladimir Davydov
2014-06-19 9:07 ` [PATCH 2/3] fork: reset mm->pinned_vm Vladimir Davydov
@ 2014-06-19 9:07 ` Vladimir Davydov
2014-06-19 18:15 ` [PATCH 1/3] fork/exec: cleanup mm initialization Oleg Nesterov
2 siblings, 0 replies; 6+ messages in thread
From: Vladimir Davydov @ 2014-06-19 9:07 UTC (permalink / raw)
To: akpm; +Cc: oleg, rientjes, cl, linux-mm, linux-kernel
If a forking process has a thread calling (un)mmap (silly but still),
the child process may have some of its mm's vm usage counters (total_vm
and friends) screwed up, because currently they are copied from oldmm
w/o holding any locks (memcpy in dup_mm).
This patch moves the counters initialization to dup_mmap() to be called
under oldmm->mmap_sem, which eliminates any possibility of race.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
kernel/fork.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/kernel/fork.c b/kernel/fork.c
index 938707e108db..5002b1188554 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -365,6 +365,11 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
*/
down_write_nested(&mm->mmap_sem, SINGLE_DEPTH_NESTING);
+ mm->total_vm = oldmm->total_vm;
+ mm->shared_vm = oldmm->shared_vm;
+ mm->exec_vm = oldmm->exec_vm;
+ mm->stack_vm = oldmm->stack_vm;
+
rb_link = &mm->mm_rb.rb_node;
rb_parent = NULL;
pprev = &mm->mmap;
--
1.7.10.4
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 1/3] fork/exec: cleanup mm initialization
2014-06-19 9:07 [PATCH 1/3] fork/exec: cleanup mm initialization Vladimir Davydov
2014-06-19 9:07 ` [PATCH 2/3] fork: reset mm->pinned_vm Vladimir Davydov
2014-06-19 9:07 ` [PATCH 3/3] fork: copy mm's vm usage counters under mmap_sem Vladimir Davydov
@ 2014-06-19 18:15 ` Oleg Nesterov
2 siblings, 0 replies; 6+ messages in thread
From: Oleg Nesterov @ 2014-06-19 18:15 UTC (permalink / raw)
To: Vladimir Davydov, Andi Kleen; +Cc: akpm, rientjes, cl, linux-mm, linux-kernel
On 06/19, Vladimir Davydov wrote:
>
> mm initialization on fork/exec is spread all over the place, which makes
> the code look inconsistent.
>
> We have mm_init(), which is supposed to init/nullify mm's internals, but
> it doesn't init all the fields it should:
>
> - on fork ->mmap,mm_rb,vmacache_seqnum,map_count,mm_cpumask,locked_vm
> are zeroed in dup_mmap();
>
> - on fork ->pmd_huge_pte is zeroed in dup_mm(), immediately before
> calling mm_init();
>
> - ->cpu_vm_mask_var ptr is initialized by mm_init_cpumask(), which is
> called before mm_init() on both fork and exec;
>
> - ->context is initialized by init_new_context(), which is called after
> mm_init() on both fork and exec;
>
> Let's consolidate all the initializations in mm_init() to make the code
> look cleaner.
Yes, agreed. Afaics, the patch is fine (2 and 3 too).
This is off-topic, but why init_new_context() copies ldt even on exec?
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> ---
> fs/exec.c | 4 ----
> include/linux/mm_types.h | 1 +
> kernel/fork.c | 47 ++++++++++++++++++++--------------------------
> 3 files changed, 21 insertions(+), 31 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index a3d33fe592d6..2ef2751f5a8d 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -368,10 +368,6 @@ static int bprm_mm_init(struct linux_binprm *bprm)
> if (!mm)
> goto err;
>
> - err = init_new_context(current, mm);
> - if (err)
> - goto err;
> -
> err = __bprm_mm_init(bprm);
> if (err)
> goto err;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 96c5750e3110..21bff4be4379 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -461,6 +461,7 @@ static inline void mm_init_cpumask(struct mm_struct *mm)
> #ifdef CONFIG_CPUMASK_OFFSTACK
> mm->cpu_vm_mask_var = &mm->cpumask_allocation;
> #endif
> + cpumask_clear(mm->cpu_vm_mask_var);
> }
>
> /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d2799d1fc952..01f0d0c56cb9 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -365,12 +365,6 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> */
> down_write_nested(&mm->mmap_sem, SINGLE_DEPTH_NESTING);
>
> - mm->locked_vm = 0;
> - mm->mmap = NULL;
> - mm->vmacache_seqnum = 0;
> - mm->map_count = 0;
> - cpumask_clear(mm_cpumask(mm));
> - mm->mm_rb = RB_ROOT;
> rb_link = &mm->mm_rb.rb_node;
> rb_parent = NULL;
> pprev = &mm->mmap;
> @@ -529,17 +523,27 @@ static void mm_init_aio(struct mm_struct *mm)
>
> static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> {
> + mm->mmap = NULL;
> + mm->mm_rb = RB_ROOT;
> + mm->vmacache_seqnum = 0;
> atomic_set(&mm->mm_users, 1);
> atomic_set(&mm->mm_count, 1);
> init_rwsem(&mm->mmap_sem);
> INIT_LIST_HEAD(&mm->mmlist);
> mm->core_state = NULL;
> atomic_long_set(&mm->nr_ptes, 0);
> + mm->map_count = 0;
> + mm->locked_vm = 0;
> memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
> spin_lock_init(&mm->page_table_lock);
> + mm_init_cpumask(mm);
> mm_init_aio(mm);
> mm_init_owner(mm, p);
> + mmu_notifier_mm_init(mm);
> clear_tlb_flush_pending(mm);
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
> + mm->pmd_huge_pte = NULL;
> +#endif
>
> if (current->mm) {
> mm->flags = current->mm->flags & MMF_INIT_MASK;
> @@ -549,11 +553,17 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> mm->def_flags = 0;
> }
>
> - if (likely(!mm_alloc_pgd(mm))) {
> - mmu_notifier_mm_init(mm);
> - return mm;
> - }
> + if (mm_alloc_pgd(mm))
> + goto fail_nopgd;
> +
> + if (init_new_context(p, mm))
> + goto fail_nocontext;
>
> + return mm;
> +
> +fail_nocontext:
> + mm_free_pgd(mm);
> +fail_nopgd:
> free_mm(mm);
> return NULL;
> }
> @@ -587,7 +597,6 @@ struct mm_struct *mm_alloc(void)
> return NULL;
>
> memset(mm, 0, sizeof(*mm));
> - mm_init_cpumask(mm);
> return mm_init(mm, current);
> }
>
> @@ -819,17 +828,10 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)
> goto fail_nomem;
>
> memcpy(mm, oldmm, sizeof(*mm));
> - mm_init_cpumask(mm);
>
> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
> - mm->pmd_huge_pte = NULL;
> -#endif
> if (!mm_init(mm, tsk))
> goto fail_nomem;
>
> - if (init_new_context(tsk, mm))
> - goto fail_nocontext;
> -
> dup_mm_exe_file(oldmm, mm);
>
> err = dup_mmap(mm, oldmm);
> @@ -851,15 +853,6 @@ free_pt:
>
> fail_nomem:
> return NULL;
> -
> -fail_nocontext:
> - /*
> - * If init_new_context() failed, we cannot use mmput() to free the mm
> - * because it calls destroy_context()
> - */
> - mm_free_pgd(mm);
> - free_mm(mm);
> - return NULL;
> }
>
> static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
> --
> 1.7.10.4
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 2/3] fork: reset mm->pinned_vm
2014-06-19 9:07 ` [PATCH 2/3] fork: reset mm->pinned_vm Vladimir Davydov
@ 2014-06-19 20:58 ` Andrew Morton
2014-06-20 7:38 ` Vladimir Davydov
0 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2014-06-19 20:58 UTC (permalink / raw)
To: Vladimir Davydov; +Cc: oleg, rientjes, cl, linux-mm, linux-kernel
On Thu, 19 Jun 2014 13:07:47 +0400 Vladimir Davydov <vdavydov@parallels.com> wrote:
> mm->pinned_vm counts pages of mm's address space that were permanently
> pinned in memory by increasing their reference counter. The counter was
> introduced by commit bc3e53f682d9 ("mm: distinguish between mlocked and
> pinned pages"), while before it locked_vm had been used for such pages.
>
> Obviously, we should reset the counter on fork if !CLONE_VM, just like
> we do with locked_vm, but currently we don't. Let's fix it.
>
> ...
>
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -534,6 +534,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> atomic_long_set(&mm->nr_ptes, 0);
> mm->map_count = 0;
> mm->locked_vm = 0;
> + mm->pinned_vm = 0;
> memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
> spin_lock_init(&mm->page_table_lock);
> mm_init_cpumask(mm);
What are the runtime effects of this? I think it is only
"/proc/pid/status:VmPin is screwed up", because we don't use vm_pinned
in rlimit checks. Yes?
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 2/3] fork: reset mm->pinned_vm
2014-06-19 20:58 ` Andrew Morton
@ 2014-06-20 7:38 ` Vladimir Davydov
0 siblings, 0 replies; 6+ messages in thread
From: Vladimir Davydov @ 2014-06-20 7:38 UTC (permalink / raw)
To: Andrew Morton; +Cc: oleg, rientjes, cl, linux-mm, linux-kernel
On Thu, Jun 19, 2014 at 01:58:20PM -0700, Andrew Morton wrote:
> On Thu, 19 Jun 2014 13:07:47 +0400 Vladimir Davydov <vdavydov@parallels.com> wrote:
>
> > mm->pinned_vm counts pages of mm's address space that were permanently
> > pinned in memory by increasing their reference counter. The counter was
> > introduced by commit bc3e53f682d9 ("mm: distinguish between mlocked and
> > pinned pages"), while before it locked_vm had been used for such pages.
> >
> > Obviously, we should reset the counter on fork if !CLONE_VM, just like
> > we do with locked_vm, but currently we don't. Let's fix it.
> >
> > ...
> >
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -534,6 +534,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> > atomic_long_set(&mm->nr_ptes, 0);
> > mm->map_count = 0;
> > mm->locked_vm = 0;
> > + mm->pinned_vm = 0;
> > memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
> > spin_lock_init(&mm->page_table_lock);
> > mm_init_cpumask(mm);
>
> What are the runtime effects of this? I think it is only
> "/proc/pid/status:VmPin is screwed up", because we don't use vm_pinned
> in rlimit checks. Yes?
Hmm, ib_umem_get[infiniband] and perf_mmap still check pinned_vm against
RLIMIT_MEMLOCK. It's left from the times when pinned pages were
accounted under locked_vm, but today it looks wrong. It isn't clear to
me how we should deal with it.
And BTW, we still have some drivers accounting pinned pages under
mm->locked_vm - this is what commit bc3e53f682d9 was fighting against.
It's infiniband/usnic and vfio.
Thanks.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2014-06-20 7:38 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-19 9:07 [PATCH 1/3] fork/exec: cleanup mm initialization Vladimir Davydov
2014-06-19 9:07 ` [PATCH 2/3] fork: reset mm->pinned_vm Vladimir Davydov
2014-06-19 20:58 ` Andrew Morton
2014-06-20 7:38 ` Vladimir Davydov
2014-06-19 9:07 ` [PATCH 3/3] fork: copy mm's vm usage counters under mmap_sem Vladimir Davydov
2014-06-19 18:15 ` [PATCH 1/3] fork/exec: cleanup mm initialization Oleg Nesterov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).