* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler. [not found] ` <20140708170355.GA2469@gmail.com> @ 2015-10-11 19:03 ` David Woodhouse 2015-10-12 17:41 ` Jerome Glisse 2015-11-20 15:45 ` David Woodhouse 0 siblings, 2 replies; 7+ messages in thread From: David Woodhouse @ 2015-10-11 19:03 UTC (permalink / raw) To: Jerome Glisse, joro Cc: peterz, SCheung, linux-mm, ldunning, hpa, aarcange, jakumar, mgorman, jweiner, sgutti, riel, Bridgman, John, jhubbard, mhairgrove, cabuschardt, dpoole, Cornwall, Jay, Lewycky, Andrew, linux-kernel, iommu, arvindg, Deucher, Alexander, akpm, torvalds [-- Attachment #1: Type: text/plain, Size: 3029 bytes --] On Tue, 2014-07-08 at 13:03 -0400, Jerome Glisse wrote: > > Now regarding the device side, if we were to cleanup inside the file release > callback than we would be broken in front of fork. Imagine the following : > - process A open device file and mirror its address space (hmm or kfd) > through a device file. > - process A forks itself (child is B) while having the device file open. > - process A quits > > Now the file release will not be call until child B exit which might infinite. > Thus we would be leaking memory. As we already pointed out we can not free the > resources from the mmu_notifier >release callback. So if your library just registers a pthread_atfork() handler to close the file descriptor in the child, that problem goes away? Like any other "if we continue to hold file descriptors open when we should close them, resources don't get freed" problem? Perhaps we could even handled that automatically in the kernel, with something like an O_CLOFORK flag on the file descriptor. Although that's a little horrid. You've argued that the amdkfd code is special and not just a device driver. I'm not going to contradict you there, but now we *are* going to see device drivers doing this kind of thing. And we definitely *don't* want to be calling back into device driver code from the mmu_notifier's .release() function. I think amdkfd absolutely is *not* a useful precedent for normal device drivers, and we *don't* want to follow this model in the general case. As we try to put together a generic API for device access to processes' address space, I definitely think we want to stick with the model that we take a reference on the mm, and we *keep* it until the device driver unbinds from the mm (because its file descriptor is closed, or whatever). Perhaps you can keep a back door into the AMD IOMMU code to continue to do what you're doing, or perhaps the explicit management of off-cpu tasks that is being posited will give you a sane cleanup path that *doesn't* involve the IOMMU's mmu_notifier calling back to you. But either way, I *really* don't want this to be the way it works for device drivers. > One hacky way to do it would be to schedule some delayed job from > >release callback but then again we would have no way to properly > synchronize ourself with other mm destruction code ie the delayed job > could run concurently with other mm destruction code and interfer > badly. With the RCU-based free of the struct containing the notifier, your 'schedule some delayed job' is basically what we have now, isn't it? I note that you also have your *own* notifier which does other things, and has to cope with being called before or after the IOMMU's notifier. Seriously, we don't want device drivers having to do this. We really need to keep it simple. -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5691 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler. 2015-10-11 19:03 ` [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler David Woodhouse @ 2015-10-12 17:41 ` Jerome Glisse 2015-11-20 15:45 ` David Woodhouse 1 sibling, 0 replies; 7+ messages in thread From: Jerome Glisse @ 2015-10-12 17:41 UTC (permalink / raw) To: David Woodhouse Cc: joro, peterz, SCheung, linux-mm, ldunning, hpa, aarcange, jakumar, mgorman, jweiner, sgutti, riel, Bridgman, John, jhubbard, mhairgrove, cabuschardt, dpoole, Cornwall, Jay, Lewycky, Andrew, linux-kernel, iommu, arvindg, Deucher, Alexander, akpm, torvalds Note that i am no longer actively pushing this patch serie but i believe the solution it provides to be needed in one form or another. So I still think discussion on this to be useful so see below for my answer. On Sun, Oct 11, 2015 at 08:03:29PM +0100, David Woodhouse wrote: > On Tue, 2014-07-08 at 13:03 -0400, Jerome Glisse wrote: > > > > Now regarding the device side, if we were to cleanup inside the file release > > callback than we would be broken in front of fork. Imagine the following : > > - process A open device file and mirror its address space (hmm or kfd) > > through a device file. > > - process A forks itself (child is B) while having the device file open. > > - process A quits > > > > Now the file release will not be call until child B exit which might infinite. > > Thus we would be leaking memory. As we already pointed out we can not free the > > resources from the mmu_notifier >release callback. > > So if your library just registers a pthread_atfork() handler to close > the file descriptor in the child, that problem goes away? Like any > other "if we continue to hold file descriptors open when we should > close them, resources don't get freed" problem? I was just pointing out existing device driver usage pattern where user space open device file and do ioctl on it without necessarily caring about the mm struct. New usecase where device actually run thread against a specific process mm is different and require proper synchronization as file lifetime is different from process lifetime in many case when fork is involve. > > Perhaps we could even handled that automatically in the kernel, with > something like an O_CLOFORK flag on the file descriptor. Although > that's a little horrid. > > You've argued that the amdkfd code is special and not just a device > driver. I'm not going to contradict you there, but now we *are* going > to see device drivers doing this kind of thing. And we definitely > *don't* want to be calling back into device driver code from the > mmu_notifier's .release() function. Well that's the current solution, call back into device driver from the mmu_notifer release() call back. Since changes to mmu_notifier this is a workable solution (thanks to mmu_notifier_unregister_no_release()). > > I think amdkfd absolutely is *not* a useful precedent for normal device > drivers, and we *don't* want to follow this model in the general case. > > As we try to put together a generic API for device access to processes' > address space, I definitely think we want to stick with the model that > we take a reference on the mm, and we *keep* it until the device driver > unbinds from the mm (because its file descriptor is closed, or > whatever). Well i think when a process is kill (for whatever reasons) we do want to also kill all device threads at the same time. For instance we do not want to have zombie GPU threads that can keep running indefinitly. This is why use mmu_notifier.release() callback is kind of right place as it will be call once all threads using a given mm are killed. The exit_mm() or do_exit() are also places from where we could a callback to let device know that it must kill all thread related to a given mm. > > Perhaps you can keep a back door into the AMD IOMMU code to continue to > do what you're doing, or perhaps the explicit management of off-cpu > tasks that is being posited will give you a sane cleanup path that > *doesn't* involve the IOMMU's mmu_notifier calling back to you. But > either way, I *really* don't want this to be the way it works for > device drivers. So at kernel summit we are supposedly gonna have a discussion about device thread and scheduling and i think device thread lifetime belongs to that discussion too. My opinion is that device threads must be kill when a process quits. > > One hacky way to do it would be to schedule some delayed job from > > >release callback but then again we would have no way to properly > > synchronize ourself with other mm destruction code ie the delayed job > > could run concurently with other mm destruction code and interfer > > badly. > > With the RCU-based free of the struct containing the notifier, your > 'schedule some delayed job' is basically what we have now, isn't it? > > I note that you also have your *own* notifier which does other things, > and has to cope with being called before or after the IOMMU's notifier. > > Seriously, we don't want device drivers having to do this. We really > need to keep it simple. So right now in HMM what happens is that device driver get a callback as a result of mmu_notifier.release() and can call the unregister functions and device driver must then schedule through whatever means a call to the unregister function (can be a workqueue or other a kernel thread). Basicly i am hidding the issue inside the device driver until we can agree on a common proper way to do this. Cheers, Jérôme ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler. 2015-10-11 19:03 ` [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler David Woodhouse 2015-10-12 17:41 ` Jerome Glisse @ 2015-11-20 15:45 ` David Woodhouse 1 sibling, 0 replies; 7+ messages in thread From: David Woodhouse @ 2015-11-20 15:45 UTC (permalink / raw) To: Jerome Glisse, joro, Tang, CQ Cc: peterz, linux-kernel, linux-mm, hpa, aarcange, jakumar, ldunning, mgorman, jweiner, sgutti, riel, Bridgman, John, jhubbard, mhairgrove, cabuschardt, dpoole, Cornwall, Jay, Lewycky, Andrew, SCheung, iommu, arvindg, Deucher, Alexander, akpm, torvalds [-- Attachment #1: Type: text/plain, Size: 1530 bytes --] On Sun, 2015-10-11 at 20:03 +0100, David Woodhouse wrote: > As we try to put together a generic API for device access to processes' > address space, I definitely think we want to stick with the model that > we take a reference on the mm, and we *keep* it until the device driver > unbinds from the mm (because its file descriptor is closed, or > whatever). I've found another problem with this. In use some cases, we mmap() the device file descriptor which is responsible for the PASID binding. And in that case we end up with a recursive refcount. When the process exits, its file descriptors are closed... but the underlying struct file remains open because it's still referenced from the mmap'd VMA. That VMA remains alive because it's still part of the MM. And the MM remains alive because the PASID binding still holds a refcount for it because device's struct file didn't get closed yet... because it's still mmap'd... because the MM is still alive... So I suspect that even for the relatively simple case where the lifetime of the PASID can be bound to a file descriptor (unlike with amdkfd), we probably still want to explicitly manage its lifetime as an 'off-cpu task' and explicitly kill it when the process dies. I'm still not keen on doing that implicitly through the mm_release. I think that way lies a lot of subtle bugs. -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5691 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* mm preparatory patches for HMM and IOMMUv2 @ 2014-06-28 2:00 Jérôme Glisse 2014-06-28 2:00 ` [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler Jérôme Glisse 0 siblings, 1 reply; 7+ messages in thread From: Jérôme Glisse @ 2014-06-28 2:00 UTC (permalink / raw) To: akpm, linux-mm, linux-kernel Cc: mgorman, hpa, peterz, aarcange, riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar, Subhash Gutti, Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan, John Hubbard, Sherry Cheung, Duncan Poole, Oded Gabbay, Alexander Deucher, Andrew Lewycky Andrew so here are a set of mm patch that do some ground modification to core mm code. They apply on top of today's linux-next and they pass checkpatch.pl with flying color (except patch 4 but i did not wanted to be a nazi about 80 char line). Patch 1 is the mmput notifier call chain we discussed with AMD. Patch 2, 3 and 4 are so far only useful to HMM but i am discussing with AMD and i believe it will be useful to them to (in the context of IOMMUv2). Patch 2 allows to differentiate page unmap for vmscan reason or for poisoning. Patch 3 associate mmu_notifier with an event type allowing to take different code path inside mmu_notifier callback depending on what is currently happening to the cpu page table. There is no functional change, it just add a new argument to the various mmu_notifier calls and callback. Patch 4 pass along the vma into which the range invalidation is happening. There is few functional changes in place where mmu_notifier_range_invalidate_start/end used [0, -1] as range, instead now those place call the notifier once for each vma. This might prove to add unwanted overhead hence why i did it as a separate patch. I did not include the core hmm patch but i intend to send a v4 next week. So i really would like to see those included for next release. As usual comments welcome. Cheers, Jérôme Glisse ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler. 2014-06-28 2:00 mm preparatory patches for HMM and IOMMUv2 Jérôme Glisse @ 2014-06-28 2:00 ` Jérôme Glisse 2014-06-30 3:49 ` John Hubbard 2014-06-30 15:37 ` Joerg Roedel 0 siblings, 2 replies; 7+ messages in thread From: Jérôme Glisse @ 2014-06-28 2:00 UTC (permalink / raw) To: akpm, linux-mm, linux-kernel Cc: mgorman, hpa, peterz, aarcange, riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar, Subhash Gutti, Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan, John Hubbard, Sherry Cheung, Duncan Poole, Oded Gabbay, Alexander Deucher, Andrew Lewycky, Jérôme Glisse From: Jérôme Glisse <jglisse@redhat.com> Several subsystem require a callback when a mm struct is being destroy so that they can cleanup there respective per mm struct. Instead of having each subsystem add its callback to mmput use a notifier chain to call each of the subsystem. This will allow new subsystem to register callback even if they are module. There should be no contention on the rw semaphore protecting the call chain and the impact on the code path should be low and burried in the noise. Note that this patch also move the call to cleanup functions after exit_mmap so that new call back can assume that mmu_notifier_release have already been call. This does not impact existing cleanup functions as they do not rely on anything that exit_mmap is freeing. Also moved khugepaged_exit to exit_mmap so that ordering is preserved for that function. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> --- fs/aio.c | 29 ++++++++++++++++++++++------- include/linux/aio.h | 2 -- include/linux/ksm.h | 11 ----------- include/linux/sched.h | 5 +++++ include/linux/uprobes.h | 1 - kernel/events/uprobes.c | 19 ++++++++++++++++--- kernel/fork.c | 22 ++++++++++++++++++---- mm/ksm.c | 26 +++++++++++++++++++++----- mm/mmap.c | 3 +++ 9 files changed, 85 insertions(+), 33 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index c1d8c48..1d06e92 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -40,6 +40,7 @@ #include <linux/ramfs.h> #include <linux/percpu-refcount.h> #include <linux/mount.h> +#include <linux/notifier.h> #include <asm/kmap_types.h> #include <asm/uaccess.h> @@ -774,20 +775,22 @@ ssize_t wait_on_sync_kiocb(struct kiocb *req) EXPORT_SYMBOL(wait_on_sync_kiocb); /* - * exit_aio: called when the last user of mm goes away. At this point, there is + * aio_exit: called when the last user of mm goes away. At this point, there is * no way for any new requests to be submited or any of the io_* syscalls to be * called on the context. * * There may be outstanding kiocbs, but free_ioctx() will explicitly wait on * them. */ -void exit_aio(struct mm_struct *mm) +static int aio_exit(struct notifier_block *nb, + unsigned long action, void *data) { + struct mm_struct *mm = data; struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table); int i; if (!table) - return; + return 0; for (i = 0; i < table->nr; ++i) { struct kioctx *ctx = table->table[i]; @@ -796,10 +799,10 @@ void exit_aio(struct mm_struct *mm) continue; /* * We don't need to bother with munmap() here - exit_mmap(mm) - * is coming and it'll unmap everything. And we simply can't, - * this is not necessarily our ->mm. - * Since kill_ioctx() uses non-zero ->mmap_size as indicator - * that it needs to unmap the area, just set it to 0. + * have already been call and everything is unmap by now. But + * to be safe set ->mmap_size to 0 since aio_free_ring() uses + * non-zero ->mmap_size as indicator that it needs to unmap the + * area. */ ctx->mmap_size = 0; kill_ioctx(mm, ctx, NULL); @@ -807,6 +810,7 @@ void exit_aio(struct mm_struct *mm) RCU_INIT_POINTER(mm->ioctx_table, NULL); kfree(table); + return 0; } static void put_reqs_available(struct kioctx *ctx, unsigned nr) @@ -1629,3 +1633,14 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id, } return ret; } + +static struct notifier_block aio_mmput_nb = { + .notifier_call = aio_exit, + .priority = 1, +}; + +static int __init aio_init(void) +{ + return mmput_register_notifier(&aio_mmput_nb); +} +subsys_initcall(aio_init); diff --git a/include/linux/aio.h b/include/linux/aio.h index d9c92da..6308fac 100644 --- a/include/linux/aio.h +++ b/include/linux/aio.h @@ -73,7 +73,6 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp) extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb); extern void aio_complete(struct kiocb *iocb, long res, long res2); struct mm_struct; -extern void exit_aio(struct mm_struct *mm); extern long do_io_submit(aio_context_t ctx_id, long nr, struct iocb __user *__user *iocbpp, bool compat); void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel); @@ -81,7 +80,6 @@ void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel); static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; } static inline void aio_complete(struct kiocb *iocb, long res, long res2) { } struct mm_struct; -static inline void exit_aio(struct mm_struct *mm) { } static inline long do_io_submit(aio_context_t ctx_id, long nr, struct iocb __user * __user *iocbpp, bool compat) { return 0; } diff --git a/include/linux/ksm.h b/include/linux/ksm.h index 3be6bb1..84c184f 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -20,7 +20,6 @@ struct mem_cgroup; int ksm_madvise(struct vm_area_struct *vma, unsigned long start, unsigned long end, int advice, unsigned long *vm_flags); int __ksm_enter(struct mm_struct *mm); -void __ksm_exit(struct mm_struct *mm); static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) { @@ -29,12 +28,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) return 0; } -static inline void ksm_exit(struct mm_struct *mm) -{ - if (test_bit(MMF_VM_MERGEABLE, &mm->flags)) - __ksm_exit(mm); -} - /* * A KSM page is one of those write-protected "shared pages" or "merged pages" * which KSM maps into multiple mms, wherever identical anonymous page content @@ -83,10 +76,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) return 0; } -static inline void ksm_exit(struct mm_struct *mm) -{ -} - static inline int PageKsm(struct page *page) { return 0; diff --git a/include/linux/sched.h b/include/linux/sched.h index 322d4fc..428b3cf 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2384,6 +2384,11 @@ static inline void mmdrop(struct mm_struct * mm) __mmdrop(mm); } +/* mmput call list of notifier and subsystem/module can register + * new one through this call. + */ +extern int mmput_register_notifier(struct notifier_block *nb); +extern int mmput_unregister_notifier(struct notifier_block *nb); /* mmput gets rid of the mappings and all user-space */ extern void mmput(struct mm_struct *); /* Grab a reference to a task's mm, if it is not already going away */ diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 4f844c6..44e7267 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -120,7 +120,6 @@ extern int uprobe_pre_sstep_notifier(struct pt_regs *regs); extern void uprobe_notify_resume(struct pt_regs *regs); extern bool uprobe_deny_signal(void); extern bool arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs); -extern void uprobe_clear_state(struct mm_struct *mm); extern int arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long addr); extern int arch_uprobe_pre_xol(struct arch_uprobe *aup, struct pt_regs *regs); extern int arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs); diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 46b7c31..32b04dc 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -37,6 +37,7 @@ #include <linux/percpu-rwsem.h> #include <linux/task_work.h> #include <linux/shmem_fs.h> +#include <linux/notifier.h> #include <linux/uprobes.h> @@ -1220,16 +1221,19 @@ static struct xol_area *get_xol_area(void) /* * uprobe_clear_state - Free the area allocated for slots. */ -void uprobe_clear_state(struct mm_struct *mm) +static int uprobe_clear_state(struct notifier_block *nb, + unsigned long action, void *data) { + struct mm_struct *mm = data; struct xol_area *area = mm->uprobes_state.xol_area; if (!area) - return; + return 0; put_page(area->page); kfree(area->bitmap); kfree(area); + return 0; } void uprobe_start_dup_mmap(void) @@ -1979,9 +1983,14 @@ static struct notifier_block uprobe_exception_nb = { .priority = INT_MAX-1, /* notified after kprobes, kgdb */ }; +static struct notifier_block uprobe_mmput_nb = { + .notifier_call = uprobe_clear_state, + .priority = 0, +}; + static int __init init_uprobes(void) { - int i; + int i, err; for (i = 0; i < UPROBES_HASH_SZ; i++) mutex_init(&uprobes_mmap_mutex[i]); @@ -1989,6 +1998,10 @@ static int __init init_uprobes(void) if (percpu_init_rwsem(&dup_mmap_sem)) return -ENOMEM; + err = mmput_register_notifier(&uprobe_mmput_nb); + if (err) + return err; + return register_die_notifier(&uprobe_exception_nb); } __initcall(init_uprobes); diff --git a/kernel/fork.c b/kernel/fork.c index dd8864f..b448509 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -87,6 +87,8 @@ #define CREATE_TRACE_POINTS #include <trace/events/task.h> +static BLOCKING_NOTIFIER_HEAD(mmput_notifier); + /* * Protected counters by write_lock_irq(&tasklist_lock) */ @@ -623,6 +625,21 @@ void __mmdrop(struct mm_struct *mm) EXPORT_SYMBOL_GPL(__mmdrop); /* + * Register a notifier that will be call by mmput + */ +int mmput_register_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_register(&mmput_notifier, nb); +} +EXPORT_SYMBOL_GPL(mmput_register_notifier); + +int mmput_unregister_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_unregister(&mmput_notifier, nb); +} +EXPORT_SYMBOL_GPL(mmput_unregister_notifier); + +/* * Decrement the use count and release all resources for an mm. */ void mmput(struct mm_struct *mm) @@ -630,11 +647,8 @@ void mmput(struct mm_struct *mm) might_sleep(); if (atomic_dec_and_test(&mm->mm_users)) { - uprobe_clear_state(mm); - exit_aio(mm); - ksm_exit(mm); - khugepaged_exit(mm); /* must run before exit_mmap */ exit_mmap(mm); + blocking_notifier_call_chain(&mmput_notifier, 0, mm); set_mm_exe_file(mm, NULL); if (!list_empty(&mm->mmlist)) { spin_lock(&mmlist_lock); diff --git a/mm/ksm.c b/mm/ksm.c index 346ddc9..cb1e976 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -37,6 +37,7 @@ #include <linux/freezer.h> #include <linux/oom.h> #include <linux/numa.h> +#include <linux/notifier.h> #include <asm/tlbflush.h> #include "internal.h" @@ -1586,7 +1587,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page) ksm_scan.mm_slot = slot; spin_unlock(&ksm_mmlist_lock); /* - * Although we tested list_empty() above, a racing __ksm_exit + * Although we tested list_empty() above, a racing ksm_exit * of the last mm on the list may have removed it since then. */ if (slot == &ksm_mm_head) @@ -1658,9 +1659,9 @@ next_mm: /* * We've completed a full scan of all vmas, holding mmap_sem * throughout, and found no VM_MERGEABLE: so do the same as - * __ksm_exit does to remove this mm from all our lists now. - * This applies either when cleaning up after __ksm_exit - * (but beware: we can reach here even before __ksm_exit), + * ksm_exit does to remove this mm from all our lists now. + * This applies either when cleaning up after ksm_exit + * (but beware: we can reach here even before ksm_exit), * or when all VM_MERGEABLE areas have been unmapped (and * mmap_sem then protects against race with MADV_MERGEABLE). */ @@ -1821,11 +1822,16 @@ int __ksm_enter(struct mm_struct *mm) return 0; } -void __ksm_exit(struct mm_struct *mm) +static int ksm_exit(struct notifier_block *nb, + unsigned long action, void *data) { + struct mm_struct *mm = data; struct mm_slot *mm_slot; int easy_to_free = 0; + if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) + return 0; + /* * This process is exiting: if it's straightforward (as is the * case when ksmd was never running), free mm_slot immediately. @@ -1857,6 +1863,7 @@ void __ksm_exit(struct mm_struct *mm) down_write(&mm->mmap_sem); up_write(&mm->mmap_sem); } + return 0; } struct page *ksm_might_need_to_copy(struct page *page, @@ -2305,11 +2312,20 @@ static struct attribute_group ksm_attr_group = { }; #endif /* CONFIG_SYSFS */ +static struct notifier_block ksm_mmput_nb = { + .notifier_call = ksm_exit, + .priority = 2, +}; + static int __init ksm_init(void) { struct task_struct *ksm_thread; int err; + err = mmput_register_notifier(&ksm_mmput_nb); + if (err) + return err; + err = ksm_slab_init(); if (err) goto out; diff --git a/mm/mmap.c b/mm/mmap.c index 61aec93..b684a21 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2775,6 +2775,9 @@ void exit_mmap(struct mm_struct *mm) struct vm_area_struct *vma; unsigned long nr_accounted = 0; + /* Important to call this first. */ + khugepaged_exit(mm); + /* mm's last user has gone, and its about to be pulled down */ mmu_notifier_release(mm); -- 1.9.0 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler. 2014-06-28 2:00 ` [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler Jérôme Glisse @ 2014-06-30 3:49 ` John Hubbard 2014-06-30 15:07 ` Jerome Glisse 2014-06-30 15:37 ` Joerg Roedel 1 sibling, 1 reply; 7+ messages in thread From: John Hubbard @ 2014-06-30 3:49 UTC (permalink / raw) To: Jérôme Glisse Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange, riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar, Subhash Gutti, Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan, Sherry Cheung, Duncan Poole, Oded Gabbay, Alexander Deucher, Andrew Lewycky, Jérôme Glisse [-- Attachment #1: Type: text/plain, Size: 14760 bytes --] On Fri, 27 Jun 2014, Jérôme Glisse wrote: > From: Jérôme Glisse <jglisse@redhat.com> > > Several subsystem require a callback when a mm struct is being destroy > so that they can cleanup there respective per mm struct. Instead of > having each subsystem add its callback to mmput use a notifier chain > to call each of the subsystem. > > This will allow new subsystem to register callback even if they are > module. There should be no contention on the rw semaphore protecting > the call chain and the impact on the code path should be low and > burried in the noise. > > Note that this patch also move the call to cleanup functions after > exit_mmap so that new call back can assume that mmu_notifier_release > have already been call. This does not impact existing cleanup functions > as they do not rely on anything that exit_mmap is freeing. Also moved > khugepaged_exit to exit_mmap so that ordering is preserved for that > function. > > Signed-off-by: Jérôme Glisse <jglisse@redhat.com> > --- > fs/aio.c | 29 ++++++++++++++++++++++------- > include/linux/aio.h | 2 -- > include/linux/ksm.h | 11 ----------- > include/linux/sched.h | 5 +++++ > include/linux/uprobes.h | 1 - > kernel/events/uprobes.c | 19 ++++++++++++++++--- > kernel/fork.c | 22 ++++++++++++++++++---- > mm/ksm.c | 26 +++++++++++++++++++++----- > mm/mmap.c | 3 +++ > 9 files changed, 85 insertions(+), 33 deletions(-) > > diff --git a/fs/aio.c b/fs/aio.c > index c1d8c48..1d06e92 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -40,6 +40,7 @@ > #include <linux/ramfs.h> > #include <linux/percpu-refcount.h> > #include <linux/mount.h> > +#include <linux/notifier.h> > > #include <asm/kmap_types.h> > #include <asm/uaccess.h> > @@ -774,20 +775,22 @@ ssize_t wait_on_sync_kiocb(struct kiocb *req) > EXPORT_SYMBOL(wait_on_sync_kiocb); > > /* > - * exit_aio: called when the last user of mm goes away. At this point, there is > + * aio_exit: called when the last user of mm goes away. At this point, there is > * no way for any new requests to be submited or any of the io_* syscalls to be > * called on the context. > * > * There may be outstanding kiocbs, but free_ioctx() will explicitly wait on > * them. > */ > -void exit_aio(struct mm_struct *mm) > +static int aio_exit(struct notifier_block *nb, > + unsigned long action, void *data) > { > + struct mm_struct *mm = data; > struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table); > int i; > > if (!table) > - return; > + return 0; > > for (i = 0; i < table->nr; ++i) { > struct kioctx *ctx = table->table[i]; > @@ -796,10 +799,10 @@ void exit_aio(struct mm_struct *mm) > continue; > /* > * We don't need to bother with munmap() here - exit_mmap(mm) > - * is coming and it'll unmap everything. And we simply can't, > - * this is not necessarily our ->mm. > - * Since kill_ioctx() uses non-zero ->mmap_size as indicator > - * that it needs to unmap the area, just set it to 0. > + * have already been call and everything is unmap by now. But > + * to be safe set ->mmap_size to 0 since aio_free_ring() uses > + * non-zero ->mmap_size as indicator that it needs to unmap the > + * area. > */ Actually, I think the original part of the comment about kill_ioctx was accurate, but the new reference to aio_free_ring looks like a typo (?). I'd write the entire comment as follows (I've dropped the leading whitespace, for email): /* * We don't need to bother with munmap() here - exit_mmap(mm) * has already been called and everything is unmapped by now. * But to be safe, set ->mmap_size to 0 since kill_ioctx() uses a * non-zero >mmap_size as an indicator that it needs to unmap the * area. */ > ctx->mmap_size = 0; > kill_ioctx(mm, ctx, NULL); > @@ -807,6 +810,7 @@ void exit_aio(struct mm_struct *mm) > > RCU_INIT_POINTER(mm->ioctx_table, NULL); > kfree(table); > + return 0; > } > > static void put_reqs_available(struct kioctx *ctx, unsigned nr) > @@ -1629,3 +1633,14 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id, > } > return ret; > } > + > +static struct notifier_block aio_mmput_nb = { > + .notifier_call = aio_exit, > + .priority = 1, > +}; > + > +static int __init aio_init(void) > +{ > + return mmput_register_notifier(&aio_mmput_nb); > +} > +subsys_initcall(aio_init); > diff --git a/include/linux/aio.h b/include/linux/aio.h > index d9c92da..6308fac 100644 > --- a/include/linux/aio.h > +++ b/include/linux/aio.h > @@ -73,7 +73,6 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp) > extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb); > extern void aio_complete(struct kiocb *iocb, long res, long res2); > struct mm_struct; > -extern void exit_aio(struct mm_struct *mm); > extern long do_io_submit(aio_context_t ctx_id, long nr, > struct iocb __user *__user *iocbpp, bool compat); > void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel); > @@ -81,7 +80,6 @@ void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel); > static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; } > static inline void aio_complete(struct kiocb *iocb, long res, long res2) { } > struct mm_struct; > -static inline void exit_aio(struct mm_struct *mm) { } > static inline long do_io_submit(aio_context_t ctx_id, long nr, > struct iocb __user * __user *iocbpp, > bool compat) { return 0; } > diff --git a/include/linux/ksm.h b/include/linux/ksm.h > index 3be6bb1..84c184f 100644 > --- a/include/linux/ksm.h > +++ b/include/linux/ksm.h > @@ -20,7 +20,6 @@ struct mem_cgroup; > int ksm_madvise(struct vm_area_struct *vma, unsigned long start, > unsigned long end, int advice, unsigned long *vm_flags); > int __ksm_enter(struct mm_struct *mm); > -void __ksm_exit(struct mm_struct *mm); > > static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) > { > @@ -29,12 +28,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) > return 0; > } > > -static inline void ksm_exit(struct mm_struct *mm) > -{ > - if (test_bit(MMF_VM_MERGEABLE, &mm->flags)) > - __ksm_exit(mm); > -} > - > /* > * A KSM page is one of those write-protected "shared pages" or "merged pages" > * which KSM maps into multiple mms, wherever identical anonymous page content > @@ -83,10 +76,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) > return 0; > } > > -static inline void ksm_exit(struct mm_struct *mm) > -{ > -} > - > static inline int PageKsm(struct page *page) > { > return 0; > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 322d4fc..428b3cf 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -2384,6 +2384,11 @@ static inline void mmdrop(struct mm_struct * mm) > __mmdrop(mm); > } > > +/* mmput call list of notifier and subsystem/module can register > + * new one through this call. > + */ > +extern int mmput_register_notifier(struct notifier_block *nb); > +extern int mmput_unregister_notifier(struct notifier_block *nb); > /* mmput gets rid of the mappings and all user-space */ > extern void mmput(struct mm_struct *); > /* Grab a reference to a task's mm, if it is not already going away */ > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h > index 4f844c6..44e7267 100644 > --- a/include/linux/uprobes.h > +++ b/include/linux/uprobes.h > @@ -120,7 +120,6 @@ extern int uprobe_pre_sstep_notifier(struct pt_regs *regs); > extern void uprobe_notify_resume(struct pt_regs *regs); > extern bool uprobe_deny_signal(void); > extern bool arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs); > -extern void uprobe_clear_state(struct mm_struct *mm); > extern int arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long addr); > extern int arch_uprobe_pre_xol(struct arch_uprobe *aup, struct pt_regs *regs); > extern int arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs); > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index 46b7c31..32b04dc 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -37,6 +37,7 @@ > #include <linux/percpu-rwsem.h> > #include <linux/task_work.h> > #include <linux/shmem_fs.h> > +#include <linux/notifier.h> > > #include <linux/uprobes.h> > > @@ -1220,16 +1221,19 @@ static struct xol_area *get_xol_area(void) > /* > * uprobe_clear_state - Free the area allocated for slots. > */ > -void uprobe_clear_state(struct mm_struct *mm) > +static int uprobe_clear_state(struct notifier_block *nb, > + unsigned long action, void *data) > { > + struct mm_struct *mm = data; > struct xol_area *area = mm->uprobes_state.xol_area; > > if (!area) > - return; > + return 0; > > put_page(area->page); > kfree(area->bitmap); > kfree(area); > + return 0; > } > > void uprobe_start_dup_mmap(void) > @@ -1979,9 +1983,14 @@ static struct notifier_block uprobe_exception_nb = { > .priority = INT_MAX-1, /* notified after kprobes, kgdb */ > }; > > +static struct notifier_block uprobe_mmput_nb = { > + .notifier_call = uprobe_clear_state, > + .priority = 0, > +}; > + > static int __init init_uprobes(void) > { > - int i; > + int i, err; > > for (i = 0; i < UPROBES_HASH_SZ; i++) > mutex_init(&uprobes_mmap_mutex[i]); > @@ -1989,6 +1998,10 @@ static int __init init_uprobes(void) > if (percpu_init_rwsem(&dup_mmap_sem)) > return -ENOMEM; > > + err = mmput_register_notifier(&uprobe_mmput_nb); > + if (err) > + return err; > + > return register_die_notifier(&uprobe_exception_nb); > } > __initcall(init_uprobes); > diff --git a/kernel/fork.c b/kernel/fork.c > index dd8864f..b448509 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -87,6 +87,8 @@ > #define CREATE_TRACE_POINTS > #include <trace/events/task.h> > > +static BLOCKING_NOTIFIER_HEAD(mmput_notifier); > + > /* > * Protected counters by write_lock_irq(&tasklist_lock) > */ > @@ -623,6 +625,21 @@ void __mmdrop(struct mm_struct *mm) > EXPORT_SYMBOL_GPL(__mmdrop); > > /* > + * Register a notifier that will be call by mmput > + */ > +int mmput_register_notifier(struct notifier_block *nb) > +{ > + return blocking_notifier_chain_register(&mmput_notifier, nb); > +} > +EXPORT_SYMBOL_GPL(mmput_register_notifier); > + > +int mmput_unregister_notifier(struct notifier_block *nb) > +{ > + return blocking_notifier_chain_unregister(&mmput_notifier, nb); > +} > +EXPORT_SYMBOL_GPL(mmput_unregister_notifier); > + > +/* > * Decrement the use count and release all resources for an mm. > */ > void mmput(struct mm_struct *mm) > @@ -630,11 +647,8 @@ void mmput(struct mm_struct *mm) > might_sleep(); > > if (atomic_dec_and_test(&mm->mm_users)) { > - uprobe_clear_state(mm); > - exit_aio(mm); > - ksm_exit(mm); > - khugepaged_exit(mm); /* must run before exit_mmap */ > exit_mmap(mm); > + blocking_notifier_call_chain(&mmput_notifier, 0, mm); > set_mm_exe_file(mm, NULL); > if (!list_empty(&mm->mmlist)) { > spin_lock(&mmlist_lock); > diff --git a/mm/ksm.c b/mm/ksm.c > index 346ddc9..cb1e976 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -37,6 +37,7 @@ > #include <linux/freezer.h> > #include <linux/oom.h> > #include <linux/numa.h> > +#include <linux/notifier.h> > > #include <asm/tlbflush.h> > #include "internal.h" > @@ -1586,7 +1587,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page) > ksm_scan.mm_slot = slot; > spin_unlock(&ksm_mmlist_lock); > /* > - * Although we tested list_empty() above, a racing __ksm_exit > + * Although we tested list_empty() above, a racing ksm_exit > * of the last mm on the list may have removed it since then. > */ > if (slot == &ksm_mm_head) > @@ -1658,9 +1659,9 @@ next_mm: > /* > * We've completed a full scan of all vmas, holding mmap_sem > * throughout, and found no VM_MERGEABLE: so do the same as > - * __ksm_exit does to remove this mm from all our lists now. > - * This applies either when cleaning up after __ksm_exit > - * (but beware: we can reach here even before __ksm_exit), > + * ksm_exit does to remove this mm from all our lists now. > + * This applies either when cleaning up after ksm_exit > + * (but beware: we can reach here even before ksm_exit), > * or when all VM_MERGEABLE areas have been unmapped (and > * mmap_sem then protects against race with MADV_MERGEABLE). > */ > @@ -1821,11 +1822,16 @@ int __ksm_enter(struct mm_struct *mm) > return 0; > } > > -void __ksm_exit(struct mm_struct *mm) > +static int ksm_exit(struct notifier_block *nb, > + unsigned long action, void *data) > { > + struct mm_struct *mm = data; > struct mm_slot *mm_slot; > int easy_to_free = 0; > > + if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) > + return 0; > + > /* > * This process is exiting: if it's straightforward (as is the > * case when ksmd was never running), free mm_slot immediately. > @@ -1857,6 +1863,7 @@ void __ksm_exit(struct mm_struct *mm) > down_write(&mm->mmap_sem); > up_write(&mm->mmap_sem); > } > + return 0; > } > > struct page *ksm_might_need_to_copy(struct page *page, > @@ -2305,11 +2312,20 @@ static struct attribute_group ksm_attr_group = { > }; > #endif /* CONFIG_SYSFS */ > > +static struct notifier_block ksm_mmput_nb = { > + .notifier_call = ksm_exit, > + .priority = 2, > +}; > + > static int __init ksm_init(void) > { > struct task_struct *ksm_thread; > int err; > > + err = mmput_register_notifier(&ksm_mmput_nb); > + if (err) > + return err; > + In order to be perfectly consistent with this routine's existing code, you would want to write: if (err) goto out; ...but it does the same thing as your code. It' just a consistency thing. > err = ksm_slab_init(); > if (err) > goto out; > diff --git a/mm/mmap.c b/mm/mmap.c > index 61aec93..b684a21 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2775,6 +2775,9 @@ void exit_mmap(struct mm_struct *mm) > struct vm_area_struct *vma; > unsigned long nr_accounted = 0; > > + /* Important to call this first. */ > + khugepaged_exit(mm); > + > /* mm's last user has gone, and its about to be pulled down */ > mmu_notifier_release(mm); > > -- > 1.9.0 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > Above points are extremely minor, so: Reviewed-by: John Hubbard <jhubbard@nvidia.com> thanks, John H. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler. 2014-06-30 3:49 ` John Hubbard @ 2014-06-30 15:07 ` Jerome Glisse 0 siblings, 0 replies; 7+ messages in thread From: Jerome Glisse @ 2014-06-30 15:07 UTC (permalink / raw) To: John Hubbard Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange, riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar, Subhash Gutti, Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan, Sherry Cheung, Duncan Poole, Oded Gabbay, Alexander Deucher, Andrew Lewycky, Jérôme Glisse On Sun, Jun 29, 2014 at 08:49:16PM -0700, John Hubbard wrote: > On Fri, 27 Jun 2014, Jérôme Glisse wrote: > > > From: Jérôme Glisse <jglisse@redhat.com> > > > > Several subsystem require a callback when a mm struct is being destroy > > so that they can cleanup there respective per mm struct. Instead of > > having each subsystem add its callback to mmput use a notifier chain > > to call each of the subsystem. > > > > This will allow new subsystem to register callback even if they are > > module. There should be no contention on the rw semaphore protecting > > the call chain and the impact on the code path should be low and > > burried in the noise. > > > > Note that this patch also move the call to cleanup functions after > > exit_mmap so that new call back can assume that mmu_notifier_release > > have already been call. This does not impact existing cleanup functions > > as they do not rely on anything that exit_mmap is freeing. Also moved > > khugepaged_exit to exit_mmap so that ordering is preserved for that > > function. > > > > Signed-off-by: Jérôme Glisse <jglisse@redhat.com> > > --- > > fs/aio.c | 29 ++++++++++++++++++++++------- > > include/linux/aio.h | 2 -- > > include/linux/ksm.h | 11 ----------- > > include/linux/sched.h | 5 +++++ > > include/linux/uprobes.h | 1 - > > kernel/events/uprobes.c | 19 ++++++++++++++++--- > > kernel/fork.c | 22 ++++++++++++++++++---- > > mm/ksm.c | 26 +++++++++++++++++++++----- > > mm/mmap.c | 3 +++ > > 9 files changed, 85 insertions(+), 33 deletions(-) > > > > diff --git a/fs/aio.c b/fs/aio.c > > index c1d8c48..1d06e92 100644 > > --- a/fs/aio.c > > +++ b/fs/aio.c > > @@ -40,6 +40,7 @@ > > #include <linux/ramfs.h> > > #include <linux/percpu-refcount.h> > > #include <linux/mount.h> > > +#include <linux/notifier.h> > > > > #include <asm/kmap_types.h> > > #include <asm/uaccess.h> > > @@ -774,20 +775,22 @@ ssize_t wait_on_sync_kiocb(struct kiocb *req) > > EXPORT_SYMBOL(wait_on_sync_kiocb); > > > > /* > > - * exit_aio: called when the last user of mm goes away. At this point, there is > > + * aio_exit: called when the last user of mm goes away. At this point, there is > > * no way for any new requests to be submited or any of the io_* syscalls to be > > * called on the context. > > * > > * There may be outstanding kiocbs, but free_ioctx() will explicitly wait on > > * them. > > */ > > -void exit_aio(struct mm_struct *mm) > > +static int aio_exit(struct notifier_block *nb, > > + unsigned long action, void *data) > > { > > + struct mm_struct *mm = data; > > struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table); > > int i; > > > > if (!table) > > - return; > > + return 0; > > > > for (i = 0; i < table->nr; ++i) { > > struct kioctx *ctx = table->table[i]; > > @@ -796,10 +799,10 @@ void exit_aio(struct mm_struct *mm) > > continue; > > /* > > * We don't need to bother with munmap() here - exit_mmap(mm) > > - * is coming and it'll unmap everything. And we simply can't, > > - * this is not necessarily our ->mm. > > - * Since kill_ioctx() uses non-zero ->mmap_size as indicator > > - * that it needs to unmap the area, just set it to 0. > > + * have already been call and everything is unmap by now. But > > + * to be safe set ->mmap_size to 0 since aio_free_ring() uses > > + * non-zero ->mmap_size as indicator that it needs to unmap the > > + * area. > > */ > > Actually, I think the original part of the comment about kill_ioctx > was accurate, but the new reference to aio_free_ring looks like a typo > (?). I'd write the entire comment as follows (I've dropped the leading > whitespace, for email): > > /* > * We don't need to bother with munmap() here - exit_mmap(mm) > * has already been called and everything is unmapped by now. > * But to be safe, set ->mmap_size to 0 since kill_ioctx() uses a > * non-zero >mmap_size as an indicator that it needs to unmap the > * area. > */ > This is a rebase issue, the code changed and i updated the code but not the comment. > > > ctx->mmap_size = 0; > > kill_ioctx(mm, ctx, NULL); > > @@ -807,6 +810,7 @@ void exit_aio(struct mm_struct *mm) > > > > RCU_INIT_POINTER(mm->ioctx_table, NULL); > > kfree(table); > > + return 0; > > } > > > > static void put_reqs_available(struct kioctx *ctx, unsigned nr) > > @@ -1629,3 +1633,14 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id, > > } > > return ret; > > } > > + > > +static struct notifier_block aio_mmput_nb = { > > + .notifier_call = aio_exit, > > + .priority = 1, > > +}; > > + > > +static int __init aio_init(void) > > +{ > > + return mmput_register_notifier(&aio_mmput_nb); > > +} > > +subsys_initcall(aio_init); > > diff --git a/include/linux/aio.h b/include/linux/aio.h > > index d9c92da..6308fac 100644 > > --- a/include/linux/aio.h > > +++ b/include/linux/aio.h > > @@ -73,7 +73,6 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp) > > extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb); > > extern void aio_complete(struct kiocb *iocb, long res, long res2); > > struct mm_struct; > > -extern void exit_aio(struct mm_struct *mm); > > extern long do_io_submit(aio_context_t ctx_id, long nr, > > struct iocb __user *__user *iocbpp, bool compat); > > void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel); > > @@ -81,7 +80,6 @@ void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel); > > static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; } > > static inline void aio_complete(struct kiocb *iocb, long res, long res2) { } > > struct mm_struct; > > -static inline void exit_aio(struct mm_struct *mm) { } > > static inline long do_io_submit(aio_context_t ctx_id, long nr, > > struct iocb __user * __user *iocbpp, > > bool compat) { return 0; } > > diff --git a/include/linux/ksm.h b/include/linux/ksm.h > > index 3be6bb1..84c184f 100644 > > --- a/include/linux/ksm.h > > +++ b/include/linux/ksm.h > > @@ -20,7 +20,6 @@ struct mem_cgroup; > > int ksm_madvise(struct vm_area_struct *vma, unsigned long start, > > unsigned long end, int advice, unsigned long *vm_flags); > > int __ksm_enter(struct mm_struct *mm); > > -void __ksm_exit(struct mm_struct *mm); > > > > static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) > > { > > @@ -29,12 +28,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) > > return 0; > > } > > > > -static inline void ksm_exit(struct mm_struct *mm) > > -{ > > - if (test_bit(MMF_VM_MERGEABLE, &mm->flags)) > > - __ksm_exit(mm); > > -} > > - > > /* > > * A KSM page is one of those write-protected "shared pages" or "merged pages" > > * which KSM maps into multiple mms, wherever identical anonymous page content > > @@ -83,10 +76,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) > > return 0; > > } > > > > -static inline void ksm_exit(struct mm_struct *mm) > > -{ > > -} > > - > > static inline int PageKsm(struct page *page) > > { > > return 0; > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > index 322d4fc..428b3cf 100644 > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -2384,6 +2384,11 @@ static inline void mmdrop(struct mm_struct * mm) > > __mmdrop(mm); > > } > > > > +/* mmput call list of notifier and subsystem/module can register > > + * new one through this call. > > + */ > > +extern int mmput_register_notifier(struct notifier_block *nb); > > +extern int mmput_unregister_notifier(struct notifier_block *nb); > > /* mmput gets rid of the mappings and all user-space */ > > extern void mmput(struct mm_struct *); > > /* Grab a reference to a task's mm, if it is not already going away */ > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h > > index 4f844c6..44e7267 100644 > > --- a/include/linux/uprobes.h > > +++ b/include/linux/uprobes.h > > @@ -120,7 +120,6 @@ extern int uprobe_pre_sstep_notifier(struct pt_regs *regs); > > extern void uprobe_notify_resume(struct pt_regs *regs); > > extern bool uprobe_deny_signal(void); > > extern bool arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs); > > -extern void uprobe_clear_state(struct mm_struct *mm); > > extern int arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long addr); > > extern int arch_uprobe_pre_xol(struct arch_uprobe *aup, struct pt_regs *regs); > > extern int arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs); > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > index 46b7c31..32b04dc 100644 > > --- a/kernel/events/uprobes.c > > +++ b/kernel/events/uprobes.c > > @@ -37,6 +37,7 @@ > > #include <linux/percpu-rwsem.h> > > #include <linux/task_work.h> > > #include <linux/shmem_fs.h> > > +#include <linux/notifier.h> > > > > #include <linux/uprobes.h> > > > > @@ -1220,16 +1221,19 @@ static struct xol_area *get_xol_area(void) > > /* > > * uprobe_clear_state - Free the area allocated for slots. > > */ > > -void uprobe_clear_state(struct mm_struct *mm) > > +static int uprobe_clear_state(struct notifier_block *nb, > > + unsigned long action, void *data) > > { > > + struct mm_struct *mm = data; > > struct xol_area *area = mm->uprobes_state.xol_area; > > > > if (!area) > > - return; > > + return 0; > > > > put_page(area->page); > > kfree(area->bitmap); > > kfree(area); > > + return 0; > > } > > > > void uprobe_start_dup_mmap(void) > > @@ -1979,9 +1983,14 @@ static struct notifier_block uprobe_exception_nb = { > > .priority = INT_MAX-1, /* notified after kprobes, kgdb */ > > }; > > > > +static struct notifier_block uprobe_mmput_nb = { > > + .notifier_call = uprobe_clear_state, > > + .priority = 0, > > +}; > > + > > static int __init init_uprobes(void) > > { > > - int i; > > + int i, err; > > > > for (i = 0; i < UPROBES_HASH_SZ; i++) > > mutex_init(&uprobes_mmap_mutex[i]); > > @@ -1989,6 +1998,10 @@ static int __init init_uprobes(void) > > if (percpu_init_rwsem(&dup_mmap_sem)) > > return -ENOMEM; > > > > + err = mmput_register_notifier(&uprobe_mmput_nb); > > + if (err) > > + return err; > > + > > return register_die_notifier(&uprobe_exception_nb); > > } > > __initcall(init_uprobes); > > diff --git a/kernel/fork.c b/kernel/fork.c > > index dd8864f..b448509 100644 > > --- a/kernel/fork.c > > +++ b/kernel/fork.c > > @@ -87,6 +87,8 @@ > > #define CREATE_TRACE_POINTS > > #include <trace/events/task.h> > > > > +static BLOCKING_NOTIFIER_HEAD(mmput_notifier); > > + > > /* > > * Protected counters by write_lock_irq(&tasklist_lock) > > */ > > @@ -623,6 +625,21 @@ void __mmdrop(struct mm_struct *mm) > > EXPORT_SYMBOL_GPL(__mmdrop); > > > > /* > > + * Register a notifier that will be call by mmput > > + */ > > +int mmput_register_notifier(struct notifier_block *nb) > > +{ > > + return blocking_notifier_chain_register(&mmput_notifier, nb); > > +} > > +EXPORT_SYMBOL_GPL(mmput_register_notifier); > > + > > +int mmput_unregister_notifier(struct notifier_block *nb) > > +{ > > + return blocking_notifier_chain_unregister(&mmput_notifier, nb); > > +} > > +EXPORT_SYMBOL_GPL(mmput_unregister_notifier); > > + > > +/* > > * Decrement the use count and release all resources for an mm. > > */ > > void mmput(struct mm_struct *mm) > > @@ -630,11 +647,8 @@ void mmput(struct mm_struct *mm) > > might_sleep(); > > > > if (atomic_dec_and_test(&mm->mm_users)) { > > - uprobe_clear_state(mm); > > - exit_aio(mm); > > - ksm_exit(mm); > > - khugepaged_exit(mm); /* must run before exit_mmap */ > > exit_mmap(mm); > > + blocking_notifier_call_chain(&mmput_notifier, 0, mm); > > set_mm_exe_file(mm, NULL); > > if (!list_empty(&mm->mmlist)) { > > spin_lock(&mmlist_lock); > > diff --git a/mm/ksm.c b/mm/ksm.c > > index 346ddc9..cb1e976 100644 > > --- a/mm/ksm.c > > +++ b/mm/ksm.c > > @@ -37,6 +37,7 @@ > > #include <linux/freezer.h> > > #include <linux/oom.h> > > #include <linux/numa.h> > > +#include <linux/notifier.h> > > > > #include <asm/tlbflush.h> > > #include "internal.h" > > @@ -1586,7 +1587,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page) > > ksm_scan.mm_slot = slot; > > spin_unlock(&ksm_mmlist_lock); > > /* > > - * Although we tested list_empty() above, a racing __ksm_exit > > + * Although we tested list_empty() above, a racing ksm_exit > > * of the last mm on the list may have removed it since then. > > */ > > if (slot == &ksm_mm_head) > > @@ -1658,9 +1659,9 @@ next_mm: > > /* > > * We've completed a full scan of all vmas, holding mmap_sem > > * throughout, and found no VM_MERGEABLE: so do the same as > > - * __ksm_exit does to remove this mm from all our lists now. > > - * This applies either when cleaning up after __ksm_exit > > - * (but beware: we can reach here even before __ksm_exit), > > + * ksm_exit does to remove this mm from all our lists now. > > + * This applies either when cleaning up after ksm_exit > > + * (but beware: we can reach here even before ksm_exit), > > * or when all VM_MERGEABLE areas have been unmapped (and > > * mmap_sem then protects against race with MADV_MERGEABLE). > > */ > > @@ -1821,11 +1822,16 @@ int __ksm_enter(struct mm_struct *mm) > > return 0; > > } > > > > -void __ksm_exit(struct mm_struct *mm) > > +static int ksm_exit(struct notifier_block *nb, > > + unsigned long action, void *data) > > { > > + struct mm_struct *mm = data; > > struct mm_slot *mm_slot; > > int easy_to_free = 0; > > > > + if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) > > + return 0; > > + > > /* > > * This process is exiting: if it's straightforward (as is the > > * case when ksmd was never running), free mm_slot immediately. > > @@ -1857,6 +1863,7 @@ void __ksm_exit(struct mm_struct *mm) > > down_write(&mm->mmap_sem); > > up_write(&mm->mmap_sem); > > } > > + return 0; > > } > > > > struct page *ksm_might_need_to_copy(struct page *page, > > @@ -2305,11 +2312,20 @@ static struct attribute_group ksm_attr_group = { > > }; > > #endif /* CONFIG_SYSFS */ > > > > +static struct notifier_block ksm_mmput_nb = { > > + .notifier_call = ksm_exit, > > + .priority = 2, > > +}; > > + > > static int __init ksm_init(void) > > { > > struct task_struct *ksm_thread; > > int err; > > > > + err = mmput_register_notifier(&ksm_mmput_nb); > > + if (err) > > + return err; > > + > > In order to be perfectly consistent with this routine's existing code, you > would want to write: > > if (err) > goto out; > > ...but it does the same thing as your code. It' just a consistency thing. > > > err = ksm_slab_init(); > > if (err) > > goto out; > > diff --git a/mm/mmap.c b/mm/mmap.c > > index 61aec93..b684a21 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -2775,6 +2775,9 @@ void exit_mmap(struct mm_struct *mm) > > struct vm_area_struct *vma; > > unsigned long nr_accounted = 0; > > > > + /* Important to call this first. */ > > + khugepaged_exit(mm); > > + > > /* mm's last user has gone, and its about to be pulled down */ > > mmu_notifier_release(mm); > > > > -- > > 1.9.0 > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majordomo@kvack.org. For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > > > > Above points are extremely minor, so: > > Reviewed-by: John Hubbard <jhubbard@nvidia.com> I will respin none the less with fixed comment. > > thanks, > John H. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler. 2014-06-28 2:00 ` [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler Jérôme Glisse 2014-06-30 3:49 ` John Hubbard @ 2014-06-30 15:37 ` Joerg Roedel 1 sibling, 0 replies; 7+ messages in thread From: Joerg Roedel @ 2014-06-30 15:37 UTC (permalink / raw) To: Jérôme Glisse Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange, riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar, Subhash Gutti, Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan, John Hubbard, Sherry Cheung, Duncan Poole, Oded Gabbay, Alexander Deucher, Andrew Lewycky, Jérôme Glisse On Fri, Jun 27, 2014 at 10:00:19PM -0400, Jérôme Glisse wrote: > Note that this patch also move the call to cleanup functions after > exit_mmap so that new call back can assume that mmu_notifier_release > have already been call. This does not impact existing cleanup functions > as they do not rely on anything that exit_mmap is freeing. Also moved > khugepaged_exit to exit_mmap so that ordering is preserved for that > function. What this patch does is duplicating the functionality of the mmu_notifier_release call-back. Why is it needed? Joerg ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-11-20 15:45 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20140701110018.GH26537@8bytes.org> [not found] ` <20140701193343.GB3322@gmail.com> [not found] ` <20140701210620.GL26537@8bytes.org> [not found] ` <20140701213208.GC3322@gmail.com> [not found] ` <20140703183024.GA3306@gmail.com> [not found] ` <20140703231541.GR26537@8bytes.org> [not found] ` <019CCE693E457142B37B791721487FD918085329@storexdag01.amd.com> [not found] ` <20140707101158.GD1958@8bytes.org> [not found] ` <1404729783.31606.1.camel@tlv-gabbay-ws.amd.com> [not found] ` <20140708080059.GF1958@8bytes.org> [not found] ` <20140708170355.GA2469@gmail.com> 2015-10-11 19:03 ` [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler David Woodhouse 2015-10-12 17:41 ` Jerome Glisse 2015-11-20 15:45 ` David Woodhouse 2014-06-28 2:00 mm preparatory patches for HMM and IOMMUv2 Jérôme Glisse 2014-06-28 2:00 ` [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler Jérôme Glisse 2014-06-30 3:49 ` John Hubbard 2014-06-30 15:07 ` Jerome Glisse 2014-06-30 15:37 ` Joerg Roedel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).