* [PATCH v2 1/2] mm: reorganize internal_get_user_pages_fast() [not found] <0-v2-dfe9ecdb6c74+2066-gup_fork_jgg@nvidia.com> @ 2020-10-30 14:46 ` Jason Gunthorpe 2020-10-30 16:29 ` Jan Kara ` (2 more replies) 2020-11-02 22:19 ` [PATCH v2 0/2] Add a seqcount between gup_fast and copy_page_range() Ahmed S. Darwish [not found] ` <2-v2-dfe9ecdb6c74+2066-gup_fork_jgg@nvidia.com> 2 siblings, 3 replies; 28+ messages in thread From: Jason Gunthorpe @ 2020-10-30 14:46 UTC (permalink / raw) To: linux-kernel, Peter Xu, Linus Torvalds Cc: Andrea Arcangeli, Andrew Morton, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai, Linux-MM, Michal Hocko, Oleg Nesterov The next patch in this series makes the lockless flow a little more complex, so move the entire block into a new function and remove a level of indention. Tidy a bit of cruft: - addr is always the same as start, so use start - Use the modern check_add_overflow() for computing end = start + len - nr_pinned/pages << PAGE_SHIFT needs the LHS to be unsigned long to avoid shift overflow, make the variables unsigned long to avoid coding casts in both places. nr_pinned was missing its cast - The handling of ret and nr_pinned can be streamlined a bit No functional change. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> --- mm/gup.c | 99 ++++++++++++++++++++++++++++++-------------------------- 1 file changed, 54 insertions(+), 45 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 102877ed77a4b4..150cc962c99201 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2671,13 +2671,43 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages, return ret; } -static int internal_get_user_pages_fast(unsigned long start, int nr_pages, +static unsigned long lockless_pages_from_mm(unsigned long start, + unsigned long end, + unsigned int gup_flags, + struct page **pages) +{ + unsigned long flags; + int nr_pinned = 0; + + if (!IS_ENABLED(CONFIG_HAVE_FAST_GUP) || + !gup_fast_permitted(start, end)) + return 0; + + /* + * Disable interrupts. The nested form is used, in order to allow full, + * general purpose use of this routine. + * + * With interrupts disabled, we block page table pages from being freed + * from under us. See struct mmu_table_batch comments in + * include/asm-generic/tlb.h for more details. + * + * We do not adopt an rcu_read_lock() here as we also want to block IPIs + * that come from THPs splitting. + */ + local_irq_save(flags); + gup_pgd_range(start, end, gup_flags, pages, &nr_pinned); + local_irq_restore(flags); + return nr_pinned; +} + +static int internal_get_user_pages_fast(unsigned long start, + unsigned long nr_pages, unsigned int gup_flags, struct page **pages) { - unsigned long addr, len, end; - unsigned long flags; - int nr_pinned = 0, ret = 0; + unsigned long len, end; + unsigned long nr_pinned; + int ret; if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_FORCE | FOLL_PIN | FOLL_GET | @@ -2691,54 +2721,33 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages, might_lock_read(¤t->mm->mmap_lock); start = untagged_addr(start) & PAGE_MASK; - addr = start; - len = (unsigned long) nr_pages << PAGE_SHIFT; - end = start + len; - - if (end <= start) + len = nr_pages << PAGE_SHIFT; + if (check_add_overflow(start, len, &end)) return 0; if (unlikely(!access_ok((void __user *)start, len))) return -EFAULT; - /* - * Disable interrupts. The nested form is used, in order to allow - * full, general purpose use of this routine. - * - * With interrupts disabled, we block page table pages from being - * freed from under us. See struct mmu_table_batch comments in - * include/asm-generic/tlb.h for more details. - * - * We do not adopt an rcu_read_lock(.) here as we also want to - * block IPIs that come from THPs splitting. - */ - if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && gup_fast_permitted(start, end)) { - unsigned long fast_flags = gup_flags; - - local_irq_save(flags); - gup_pgd_range(addr, end, fast_flags, pages, &nr_pinned); - local_irq_restore(flags); - ret = nr_pinned; - } + nr_pinned = lockless_pages_from_mm(start, end, gup_flags, pages); + if (nr_pinned == nr_pages || gup_flags & FOLL_FAST_ONLY) + return nr_pinned; - if (nr_pinned < nr_pages && !(gup_flags & FOLL_FAST_ONLY)) { - /* Try to get the remaining pages with get_user_pages */ - start += nr_pinned << PAGE_SHIFT; - pages += nr_pinned; - - ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned, - gup_flags, pages); - - /* Have to be a bit careful with return values */ - if (nr_pinned > 0) { - if (ret < 0) - ret = nr_pinned; - else - ret += nr_pinned; - } + /* Slow path: try to get the remaining pages with get_user_pages */ + start += nr_pinned << PAGE_SHIFT; + pages += nr_pinned; + ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned, gup_flags, + pages); + if (ret < 0) { + /* + * The caller has to unpin the pages we already pinned so + * returning -errno is not an option + */ + if (nr_pinned) + return nr_pinned; + return ret; } - - return ret; + return ret + nr_pinned; } + /** * get_user_pages_fast_only() - pin user pages in memory * @start: starting user address -- 2.28.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH v2 1/2] mm: reorganize internal_get_user_pages_fast() 2020-10-30 14:46 ` [PATCH v2 1/2] mm: reorganize internal_get_user_pages_fast() Jason Gunthorpe @ 2020-10-30 16:29 ` Jan Kara 2020-10-30 21:31 ` John Hubbard 2020-10-30 22:36 ` Peter Xu 2 siblings, 0 replies; 28+ messages in thread From: Jan Kara @ 2020-10-30 16:29 UTC (permalink / raw) To: Jason Gunthorpe Cc: linux-kernel, Peter Xu, Linus Torvalds, Andrea Arcangeli, Andrew Morton, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai, Linux-MM, Michal Hocko, Oleg Nesterov On Fri 30-10-20 11:46:20, Jason Gunthorpe wrote: > The next patch in this series makes the lockless flow a little more > complex, so move the entire block into a new function and remove a level > of indention. Tidy a bit of cruft: > > - addr is always the same as start, so use start > > - Use the modern check_add_overflow() for computing end = start + len > > - nr_pinned/pages << PAGE_SHIFT needs the LHS to be unsigned long to > avoid shift overflow, make the variables unsigned long to avoid coding > casts in both places. nr_pinned was missing its cast > > - The handling of ret and nr_pinned can be streamlined a bit > > No functional change. > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Looks good to me. You can add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > mm/gup.c | 99 ++++++++++++++++++++++++++++++-------------------------- > 1 file changed, 54 insertions(+), 45 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index 102877ed77a4b4..150cc962c99201 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2671,13 +2671,43 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages, > return ret; > } > > -static int internal_get_user_pages_fast(unsigned long start, int nr_pages, > +static unsigned long lockless_pages_from_mm(unsigned long start, > + unsigned long end, > + unsigned int gup_flags, > + struct page **pages) > +{ > + unsigned long flags; > + int nr_pinned = 0; > + > + if (!IS_ENABLED(CONFIG_HAVE_FAST_GUP) || > + !gup_fast_permitted(start, end)) > + return 0; > + > + /* > + * Disable interrupts. The nested form is used, in order to allow full, > + * general purpose use of this routine. > + * > + * With interrupts disabled, we block page table pages from being freed > + * from under us. See struct mmu_table_batch comments in > + * include/asm-generic/tlb.h for more details. > + * > + * We do not adopt an rcu_read_lock() here as we also want to block IPIs > + * that come from THPs splitting. > + */ > + local_irq_save(flags); > + gup_pgd_range(start, end, gup_flags, pages, &nr_pinned); > + local_irq_restore(flags); > + return nr_pinned; > +} > + > +static int internal_get_user_pages_fast(unsigned long start, > + unsigned long nr_pages, > unsigned int gup_flags, > struct page **pages) > { > - unsigned long addr, len, end; > - unsigned long flags; > - int nr_pinned = 0, ret = 0; > + unsigned long len, end; > + unsigned long nr_pinned; > + int ret; > > if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | > FOLL_FORCE | FOLL_PIN | FOLL_GET | > @@ -2691,54 +2721,33 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages, > might_lock_read(¤t->mm->mmap_lock); > > start = untagged_addr(start) & PAGE_MASK; > - addr = start; > - len = (unsigned long) nr_pages << PAGE_SHIFT; > - end = start + len; > - > - if (end <= start) > + len = nr_pages << PAGE_SHIFT; > + if (check_add_overflow(start, len, &end)) > return 0; > if (unlikely(!access_ok((void __user *)start, len))) > return -EFAULT; > > - /* > - * Disable interrupts. The nested form is used, in order to allow > - * full, general purpose use of this routine. > - * > - * With interrupts disabled, we block page table pages from being > - * freed from under us. See struct mmu_table_batch comments in > - * include/asm-generic/tlb.h for more details. > - * > - * We do not adopt an rcu_read_lock(.) here as we also want to > - * block IPIs that come from THPs splitting. > - */ > - if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && gup_fast_permitted(start, end)) { > - unsigned long fast_flags = gup_flags; > - > - local_irq_save(flags); > - gup_pgd_range(addr, end, fast_flags, pages, &nr_pinned); > - local_irq_restore(flags); > - ret = nr_pinned; > - } > + nr_pinned = lockless_pages_from_mm(start, end, gup_flags, pages); > + if (nr_pinned == nr_pages || gup_flags & FOLL_FAST_ONLY) > + return nr_pinned; > > - if (nr_pinned < nr_pages && !(gup_flags & FOLL_FAST_ONLY)) { > - /* Try to get the remaining pages with get_user_pages */ > - start += nr_pinned << PAGE_SHIFT; > - pages += nr_pinned; > - > - ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned, > - gup_flags, pages); > - > - /* Have to be a bit careful with return values */ > - if (nr_pinned > 0) { > - if (ret < 0) > - ret = nr_pinned; > - else > - ret += nr_pinned; > - } > + /* Slow path: try to get the remaining pages with get_user_pages */ > + start += nr_pinned << PAGE_SHIFT; > + pages += nr_pinned; > + ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned, gup_flags, > + pages); > + if (ret < 0) { > + /* > + * The caller has to unpin the pages we already pinned so > + * returning -errno is not an option > + */ > + if (nr_pinned) > + return nr_pinned; > + return ret; > } > - > - return ret; > + return ret + nr_pinned; > } > + > /** > * get_user_pages_fast_only() - pin user pages in memory > * @start: starting user address > -- > 2.28.0 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 1/2] mm: reorganize internal_get_user_pages_fast() 2020-10-30 14:46 ` [PATCH v2 1/2] mm: reorganize internal_get_user_pages_fast() Jason Gunthorpe 2020-10-30 16:29 ` Jan Kara @ 2020-10-30 21:31 ` John Hubbard 2020-10-30 22:36 ` Peter Xu 2 siblings, 0 replies; 28+ messages in thread From: John Hubbard @ 2020-10-30 21:31 UTC (permalink / raw) To: Jason Gunthorpe, linux-kernel, Peter Xu, Linus Torvalds Cc: Andrea Arcangeli, Andrew Morton, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, Kirill Shutemov, Kirill Tkhai, Linux-MM, Michal Hocko, Oleg Nesterov On 10/30/20 7:46 AM, Jason Gunthorpe wrote: > The next patch in this series makes the lockless flow a little more > complex, so move the entire block into a new function and remove a level > of indention. Tidy a bit of cruft: > > - addr is always the same as start, so use start > > - Use the modern check_add_overflow() for computing end = start + len > > - nr_pinned/pages << PAGE_SHIFT needs the LHS to be unsigned long to > avoid shift overflow, make the variables unsigned long to avoid coding > casts in both places. nr_pinned was missing its cast > > - The handling of ret and nr_pinned can be streamlined a bit > > No functional change. > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> > --- > mm/gup.c | 99 ++++++++++++++++++++++++++++++-------------------------- > 1 file changed, 54 insertions(+), 45 deletions(-) Everything still looks correct. Reviewed-by: John Hubbard <jhubbard@nvidia.com> thanks, -- John Hubbard NVIDIA > > diff --git a/mm/gup.c b/mm/gup.c > index 102877ed77a4b4..150cc962c99201 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2671,13 +2671,43 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages, > return ret; > } > > -static int internal_get_user_pages_fast(unsigned long start, int nr_pages, > +static unsigned long lockless_pages_from_mm(unsigned long start, > + unsigned long end, > + unsigned int gup_flags, > + struct page **pages) > +{ > + unsigned long flags; > + int nr_pinned = 0; > + > + if (!IS_ENABLED(CONFIG_HAVE_FAST_GUP) || > + !gup_fast_permitted(start, end)) > + return 0; > + > + /* > + * Disable interrupts. The nested form is used, in order to allow full, > + * general purpose use of this routine. > + * > + * With interrupts disabled, we block page table pages from being freed > + * from under us. See struct mmu_table_batch comments in > + * include/asm-generic/tlb.h for more details. > + * > + * We do not adopt an rcu_read_lock() here as we also want to block IPIs > + * that come from THPs splitting. > + */ > + local_irq_save(flags); > + gup_pgd_range(start, end, gup_flags, pages, &nr_pinned); > + local_irq_restore(flags); > + return nr_pinned; > +} > + > +static int internal_get_user_pages_fast(unsigned long start, > + unsigned long nr_pages, > unsigned int gup_flags, > struct page **pages) > { > - unsigned long addr, len, end; > - unsigned long flags; > - int nr_pinned = 0, ret = 0; > + unsigned long len, end; > + unsigned long nr_pinned; > + int ret; > > if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | > FOLL_FORCE | FOLL_PIN | FOLL_GET | > @@ -2691,54 +2721,33 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages, > might_lock_read(¤t->mm->mmap_lock); > > start = untagged_addr(start) & PAGE_MASK; > - addr = start; > - len = (unsigned long) nr_pages << PAGE_SHIFT; > - end = start + len; > - > - if (end <= start) > + len = nr_pages << PAGE_SHIFT; > + if (check_add_overflow(start, len, &end)) > return 0; > if (unlikely(!access_ok((void __user *)start, len))) > return -EFAULT; > > - /* > - * Disable interrupts. The nested form is used, in order to allow > - * full, general purpose use of this routine. > - * > - * With interrupts disabled, we block page table pages from being > - * freed from under us. See struct mmu_table_batch comments in > - * include/asm-generic/tlb.h for more details. > - * > - * We do not adopt an rcu_read_lock(.) here as we also want to > - * block IPIs that come from THPs splitting. > - */ > - if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && gup_fast_permitted(start, end)) { > - unsigned long fast_flags = gup_flags; > - > - local_irq_save(flags); > - gup_pgd_range(addr, end, fast_flags, pages, &nr_pinned); > - local_irq_restore(flags); > - ret = nr_pinned; > - } > + nr_pinned = lockless_pages_from_mm(start, end, gup_flags, pages); > + if (nr_pinned == nr_pages || gup_flags & FOLL_FAST_ONLY) > + return nr_pinned; > > - if (nr_pinned < nr_pages && !(gup_flags & FOLL_FAST_ONLY)) { > - /* Try to get the remaining pages with get_user_pages */ > - start += nr_pinned << PAGE_SHIFT; > - pages += nr_pinned; > - > - ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned, > - gup_flags, pages); > - > - /* Have to be a bit careful with return values */ > - if (nr_pinned > 0) { > - if (ret < 0) > - ret = nr_pinned; > - else > - ret += nr_pinned; > - } > + /* Slow path: try to get the remaining pages with get_user_pages */ > + start += nr_pinned << PAGE_SHIFT; > + pages += nr_pinned; > + ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned, gup_flags, > + pages); > + if (ret < 0) { > + /* > + * The caller has to unpin the pages we already pinned so > + * returning -errno is not an option > + */ > + if (nr_pinned) > + return nr_pinned; > + return ret; > } > - > - return ret; > + return ret + nr_pinned; > } > + > /** > * get_user_pages_fast_only() - pin user pages in memory > * @start: starting user address > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 1/2] mm: reorganize internal_get_user_pages_fast() 2020-10-30 14:46 ` [PATCH v2 1/2] mm: reorganize internal_get_user_pages_fast() Jason Gunthorpe 2020-10-30 16:29 ` Jan Kara 2020-10-30 21:31 ` John Hubbard @ 2020-10-30 22:36 ` Peter Xu 2 siblings, 0 replies; 28+ messages in thread From: Peter Xu @ 2020-10-30 22:36 UTC (permalink / raw) To: Jason Gunthorpe Cc: linux-kernel, Linus Torvalds, Andrea Arcangeli, Andrew Morton, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai, Linux-MM, Michal Hocko, Oleg Nesterov On Fri, Oct 30, 2020 at 11:46:20AM -0300, Jason Gunthorpe wrote: > The next patch in this series makes the lockless flow a little more > complex, so move the entire block into a new function and remove a level > of indention. Tidy a bit of cruft: > > - addr is always the same as start, so use start > > - Use the modern check_add_overflow() for computing end = start + len > > - nr_pinned/pages << PAGE_SHIFT needs the LHS to be unsigned long to > avoid shift overflow, make the variables unsigned long to avoid coding > casts in both places. nr_pinned was missing its cast > > - The handling of ret and nr_pinned can be streamlined a bit > > No functional change. > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Sorry for a very late reply (due to other distractions): Reviewed-by: Peter Xu <peterx@redhat.com> Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/2] Add a seqcount between gup_fast and copy_page_range() [not found] <0-v2-dfe9ecdb6c74+2066-gup_fork_jgg@nvidia.com> 2020-10-30 14:46 ` [PATCH v2 1/2] mm: reorganize internal_get_user_pages_fast() Jason Gunthorpe @ 2020-11-02 22:19 ` Ahmed S. Darwish 2020-11-02 22:39 ` Linus Torvalds [not found] ` <2-v2-dfe9ecdb6c74+2066-gup_fork_jgg@nvidia.com> 2 siblings, 1 reply; 28+ messages in thread From: Ahmed S. Darwish @ 2020-11-02 22:19 UTC (permalink / raw) To: Jason Gunthorpe Cc: linux-kernel, Peter Xu, Linus Torvalds, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner, Sebastian Siewior Hello Jason, Thanks for keeping me in the loop on this. I've also added the locking maintainers in Cc. IMHO there are some seqlock.h API violations in this series, and they should have the final say on this. On Fri, Oct 30, 2020 at 11:46:19AM -0300, Jason Gunthorpe wrote: > > As discussed and suggested by Linus use a seqcount to close the small race > between gup_fast and copy_page_range(). > > Unfortunately the good suggestion to just use write_seqcount_begin() blows > up lockdep immediately due to the (new?) requirement that the write side > of seqcount be in a preempt disabled region. Disabling preemption for seqcount_t write-side critical sections was never a new requirement. It has always been this way, for the reasons explained at Documentation/locking/seqlock.rst, "Introduction" section. The recent seqcount_t changes did not mandate any new rules. This was done explicitly, and on-purpose, not to break any of the *large* set of existing seqcount_t call sites. It added multiple lockdep asserts though, to catch a number of (already) buggy users, and they were fixed beforehand. It seems you have a special case here, so I'll continue discussing this at patch #2 where the code resides. Just wanted to answer the "(new?)" part above. > For this application it does > not seem like a good idea, nor is it necessary as we don't spin on retry. > This is solved by being the first place to use raw_write_seqcount_t_begin() > Regardless of this series write side preemptibility requirements, the "_write_seqcount_*t*_()" interfaces are internal to seqlock.h and should _never_ be used outside of it. For plain seqcount_t, raw_write_seqcount_begin() is equivalent to raw_write_seqcount_*t*_begin() anyway, and should already satisfy your needs. /me jumps to patch #2 now... Thanks, -- Ahmed S. Darwish Linutronix GmbH ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/2] Add a seqcount between gup_fast and copy_page_range() 2020-11-02 22:19 ` [PATCH v2 0/2] Add a seqcount between gup_fast and copy_page_range() Ahmed S. Darwish @ 2020-11-02 22:39 ` Linus Torvalds 2020-11-02 23:18 ` Ahmed S. Darwish 0 siblings, 1 reply; 28+ messages in thread From: Linus Torvalds @ 2020-11-02 22:39 UTC (permalink / raw) To: Ahmed S. Darwish Cc: Jason Gunthorpe, Linux Kernel Mailing List, Peter Xu, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner, Sebastian Siewior On Mon, Nov 2, 2020 at 2:19 PM Ahmed S. Darwish <a.darwish@linutronix.de> wrote: > > Disabling preemption for seqcount_t write-side critical sections was > never a new requirement. It has always been this way, for the reasons > explained at Documentation/locking/seqlock.rst, "Introduction" section. Note that that is only true if you spin on the reading side (either of the two kinds of spinning: (a) spinning to wait for it to become even, or (b) repeating if they don't match) Which this code doesn't do, it just fails. I'm not sure how to perhaps document that. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/2] Add a seqcount between gup_fast and copy_page_range() 2020-11-02 22:39 ` Linus Torvalds @ 2020-11-02 23:18 ` Ahmed S. Darwish 0 siblings, 0 replies; 28+ messages in thread From: Ahmed S. Darwish @ 2020-11-02 23:18 UTC (permalink / raw) To: Linus Torvalds Cc: Jason Gunthorpe, Linux Kernel Mailing List, Peter Xu, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner, Sebastian Siewior On Mon, Nov 02, 2020 at 02:39:49PM -0800, Linus Torvalds wrote: > On Mon, Nov 2, 2020 at 2:19 PM Ahmed S. Darwish <a.darwish@linutronix.de> wrote: > > > > Disabling preemption for seqcount_t write-side critical sections was > > never a new requirement. It has always been this way, for the reasons > > explained at Documentation/locking/seqlock.rst, "Introduction" section. > > Note that that is only true if you spin on the reading side (either of > the two kinds of spinning: (a) spinning to wait for it to become even, > or (b) repeating if they don't match) > > Which this code doesn't do, it just fails. > > I'm not sure how to perhaps document that. > Sure, and this is one of the reasons the lockdep non-preemptibility check is only added to the non-raw variants of the seqcount write APIs. Presumably, users of the raw_*() part of the API know what they're doing, and they don't need to read seqlock.rst :) (I'm in progress of replying to patch #2, which touches a bit on this and other points).. > Linus Thanks, -- Ahmed S. Darwish Linutronix GmbH ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <2-v2-dfe9ecdb6c74+2066-gup_fork_jgg@nvidia.com>]
* Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork [not found] ` <2-v2-dfe9ecdb6c74+2066-gup_fork_jgg@nvidia.com> @ 2020-10-30 16:51 ` Jan Kara [not found] ` <20201030170226.GF2620339@nvidia.com> 2020-10-30 22:52 ` Peter Xu 2020-11-02 23:58 ` Ahmed S. Darwish 2 siblings, 1 reply; 28+ messages in thread From: Jan Kara @ 2020-10-30 16:51 UTC (permalink / raw) To: Jason Gunthorpe Cc: linux-kernel, Peter Xu, Linus Torvalds, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov On Fri 30-10-20 11:46:21, Jason Gunthorpe wrote: > Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during > fork() for ptes") pages under a FOLL_PIN will not be write protected > during COW for fork. This means that pages returned from > pin_user_pages(FOLL_WRITE) should not become write protected while the pin > is active. > > However, there is a small race where get_user_pages_fast(FOLL_PIN) can > establish a FOLL_PIN at the same time copy_present_page() is write > protecting it: > > CPU 0 CPU 1 > get_user_pages_fast() > internal_get_user_pages_fast() > copy_page_range() > pte_alloc_map_lock() > copy_present_page() > atomic_read(has_pinned) == 0 > page_maybe_dma_pinned() == false > atomic_set(has_pinned, 1); > gup_pgd_range() > gup_pte_range() > pte_t pte = gup_get_pte(ptep) > pte_access_permitted(pte) > try_grab_compound_head() > pte = pte_wrprotect(pte) > set_pte_at(); > pte_unmap_unlock() > // GUP now returns with a write protected page > > The first attempt to resolve this by using the write protect caused > problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid > early COW write protect games during fork()") > > Instead wrap copy_p4d_range() with the write side of a seqcount and check > the read side around gup_pgd_range(). If there is a collision then > get_user_pages_fast() fails and falls back to slow GUP. > > Slow GUP is safe against this race because copy_page_range() is only > called while holding the exclusive side of the mmap_lock on the src > mm_struct. > > Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()") > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> > Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com > Reviewed-by: John Hubbard <jhubbard@nvidia.com> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Looks good to me. Just one nit below. With that fixed feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> > @@ -446,6 +447,12 @@ struct mm_struct { > */ > atomic_t has_pinned; > > + /** > + * @write_protect_seq: Odd when any thread is write protecting > + * pages in this mm, for instance during fork(). > + */ > + seqcount_t write_protect_seq; > + So this comment isn't quite true. We can be writeprotecting pages due to many other reasons and not touch write_protect_seq. E.g. for shared mappings or due to explicit mprotect() calls. So the write_protect_seq protection has to be about something more than pure write protection. One requirement certainly is that the VMA has to be is_cow_mapping(). What about mprotect(2) calls? I guess the application would have only itself to blame so we don't care? Anyway my point is just that the comment should tell more what this is about. I'd even go as far as making it "page table copying during fork in progress". Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <20201030170226.GF2620339@nvidia.com>]
* Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork [not found] ` <20201030170226.GF2620339@nvidia.com> @ 2020-11-02 8:31 ` Jan Kara 0 siblings, 0 replies; 28+ messages in thread From: Jan Kara @ 2020-11-02 8:31 UTC (permalink / raw) To: Jason Gunthorpe Cc: Jan Kara, linux-kernel, Peter Xu, Linus Torvalds, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov On Fri 30-10-20 14:02:26, Jason Gunthorpe wrote: > On Fri, Oct 30, 2020 at 05:51:05PM +0100, Jan Kara wrote: > > > @@ -446,6 +447,12 @@ struct mm_struct { > > > */ > > > atomic_t has_pinned; > > > > > > + /** > > > + * @write_protect_seq: Odd when any thread is write protecting > > > + * pages in this mm, for instance during fork(). > > > + */ > > > + seqcount_t write_protect_seq; > > > + > > > > So this comment isn't quite true. We can be writeprotecting pages due to > > many other reasons and not touch write_protect_seq. E.g. for shared > > mappings or due to explicit mprotect() calls. So the write_protect_seq > > protection has to be about something more than pure write protection. One > > requirement certainly is that the VMA has to be is_cow_mapping(). What > > about mprotect(2) calls? I guess the application would have only itself to > > blame so we don't care? > > Yes, that sounds right, How about > > /** > * @write_protect_seq: Locked when any thread is write protecting > * pages for COW in this mm, for instance during page table copying ^^^ maybe I'd write a bit more explicitly "... write protecting pages mapped by this mm to enforce later COW, ..." > * for fork(). > */ > > mprotect and shared mappings cause faults on write access not COW? Correct. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork [not found] ` <2-v2-dfe9ecdb6c74+2066-gup_fork_jgg@nvidia.com> 2020-10-30 16:51 ` [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork Jan Kara @ 2020-10-30 22:52 ` Peter Xu [not found] ` <20201030235121.GQ2620339@nvidia.com> 2020-11-02 23:58 ` Ahmed S. Darwish 2 siblings, 1 reply; 28+ messages in thread From: Peter Xu @ 2020-10-30 22:52 UTC (permalink / raw) To: Jason Gunthorpe Cc: linux-kernel, Linus Torvalds, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov Hi, Jason, I think majorly the patch looks good to me, but I have a few pure questions majorly not directly related to the patch itself, but around the contexts. Since I _feel_ like there'll be a new version to update the comments below, maybe I can still ask aloud... Please bare with me. :) On Fri, Oct 30, 2020 at 11:46:21AM -0300, Jason Gunthorpe wrote: > Slow GUP is safe against this race because copy_page_range() is only > called while holding the exclusive side of the mmap_lock on the src > mm_struct. Pure question: I understand that this patch requires this, but... Could anyone remind me why read lock of mmap_sem is not enough for fork() before this one? > > Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()") > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> > Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com > Reviewed-by: John Hubbard <jhubbard@nvidia.com> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> > --- > arch/x86/kernel/tboot.c | 1 + > drivers/firmware/efi/efi.c | 1 + > include/linux/mm_types.h | 7 +++++++ > kernel/fork.c | 1 + > mm/gup.c | 19 +++++++++++++++++++ > mm/init-mm.c | 1 + > mm/memory.c | 10 +++++++++- > 7 files changed, 39 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c > index 992fb1415c0f1f..6a2f542d9588a4 100644 > --- a/arch/x86/kernel/tboot.c > +++ b/arch/x86/kernel/tboot.c > @@ -93,6 +93,7 @@ static struct mm_struct tboot_mm = { > .pgd = swapper_pg_dir, > .mm_users = ATOMIC_INIT(2), > .mm_count = ATOMIC_INIT(1), > + .write_protect_seq = SEQCNT_ZERO(tboot_mm.write_protect_seq), > MMAP_LOCK_INITIALIZER(init_mm) > .page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock), > .mmlist = LIST_HEAD_INIT(init_mm.mmlist), > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c > index 5e5480a0a32d7d..2520f6e05f4d44 100644 > --- a/drivers/firmware/efi/efi.c > +++ b/drivers/firmware/efi/efi.c > @@ -57,6 +57,7 @@ struct mm_struct efi_mm = { > .mm_rb = RB_ROOT, > .mm_users = ATOMIC_INIT(2), > .mm_count = ATOMIC_INIT(1), > + .write_protect_seq = SEQCNT_ZERO(efi_mm.write_protect_seq), > MMAP_LOCK_INITIALIZER(efi_mm) > .page_table_lock = __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), > .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), Another pure question: I'm just curious how you find all the statically definied mm_structs, and to make sure all of them are covered (just in case un-initialized seqcount could fail strangely). Actually I'm thinking whether we should have one place to keep all the init vars for all the statically definied mm_structs, so we don't need to find them everytime, but only change that one place. > diff --git a/mm/memory.c b/mm/memory.c > index c48f8df6e50268..294c2c3c4fe00d 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) > mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, > 0, src_vma, src_mm, addr, end); > mmu_notifier_invalidate_range_start(&range); > + /* > + * The read side doesn't spin, it goes to the mmap_lock, so the > + * raw version is used to avoid disabling preemption here > + */ > + mmap_assert_write_locked(src_mm); > + raw_write_seqcount_t_begin(&src_mm->write_protect_seq); Would raw_write_seqcount_begin() be better here? My understanding is that we used raw_write_seqcount_t_begin() because we're with spin lock so assuming we disabled preemption already. However I'm thinking whether raw_write_seqcount_begin() would be even better to guarantee that. I have no idea of how the rt kernel merging topic, but if rt kernel merged into mainline then IIUC preemption is allowed here (since pgtable spin lock should be rt_spin_lock, not raw spin locks). An even further pure question on __seqcount_preemptible() (feel free to ignore this question!): I saw that __seqcount_preemptible() seems to have been constantly defined as "return false". Not sure what happened there.. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <20201030235121.GQ2620339@nvidia.com>]
* Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork [not found] ` <20201030235121.GQ2620339@nvidia.com> @ 2020-10-31 15:26 ` Peter Xu 2020-11-03 0:33 ` Ahmed S. Darwish 2020-11-03 0:17 ` Ahmed S. Darwish 1 sibling, 1 reply; 28+ messages in thread From: Peter Xu @ 2020-10-31 15:26 UTC (permalink / raw) To: Jason Gunthorpe Cc: Ahmed S. Darwish, linux-kernel, Linus Torvalds, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov On Fri, Oct 30, 2020 at 08:51:21PM -0300, Jason Gunthorpe wrote: > > Another pure question: I'm just curious how you find all the statically > > definied mm_structs, and to make sure all of them are covered (just in case > > un-initialized seqcount could fail strangely). > > I searched for all MMAP_LOCK_INITIALIZER() places and assumed that > Michel got them all when he added it :) Hmm, I should have noticed that before I ask.. :) > > > Actually I'm thinking whether we should have one place to keep all the init > > vars for all the statically definied mm_structs, so we don't need to find them > > everytime, but only change that one place. > > I was thinking that as well, most of the places are all the same Yes, we can work on top. > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index c48f8df6e50268..294c2c3c4fe00d 100644 > > > +++ b/mm/memory.c > > > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) > > > mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, > > > 0, src_vma, src_mm, addr, end); > > > mmu_notifier_invalidate_range_start(&range); > > > + /* > > > + * The read side doesn't spin, it goes to the mmap_lock, so the > > > + * raw version is used to avoid disabling preemption here > > > + */ > > > + mmap_assert_write_locked(src_mm); > > > + raw_write_seqcount_t_begin(&src_mm->write_protect_seq); > > > > Would raw_write_seqcount_begin() be better here? > > Hum.. > > I felt no because it had the preempt stuff added into it, however it > would work - __seqcount_lock_preemptible() == false for the seqcount_t > case (see below) > > Looking more closely, maybe the right API to pick is > write_seqcount_t_begin() and write_seqcount_t_end() ?? > > However, no idea what the intention of the '*_seqcount_t_*' family is > - it only seems to be used to implement the seqlock.. > > Lets add Amhed, perhaps he can give some guidance (see next section)? IMHO we shouldn't directly use these helpers since they seem to only be used by lock-associated versions of seqcount types. But yeah, Amhed would be the best one to answer... > > > My understanding is that we used raw_write_seqcount_t_begin() because we're > > with spin lock so assuming we disabled preemption already. > > Here we rely on the exclusive mmap_lock, not a spinlock. This ensures > only one write side is running concurrently as required by seqcount. So imho here we have these things to consider during one thread updating the seqcount_t: 0. Concurrent read is perfectly welcomed, for sure. 1. Concurrent writes on seqcount_t: mm sem protects it. 2. Preempted write (if possible, maybe on RT?): I think it's also protected by mm sem, so looks ok too to me. 3. Preempted/interrupted read on seqcount_t. Seems to be the one discussed below. Looks safe to me now with below explanation. However... > > The concern about preemption disable is that it wasn't held for fork() > before, and we don't need it.. I understand preemption disable regions > must be short or the RT people will not be happy, holding one across > all of copy_page_range() sounds bad. > > Ahmed explained in commit 8117ab508f the reason the seqcount_t write > side has preemption disabled is because it can livelock RT kernels if > the read side is spinning after preempting the write side. eg look at > how __read_seqcount_begin() is implemented: > > while ((seq = __seqcount_sequence(s)) & 1) \ > cpu_relax(); \ > > However, in this patch, we don't spin on the read side. ... Shall we document this explicitly (if this patch still needs a repost)? Seems not straightforward since that seems not the usual way to use seqcount, not sure whether I'm the only one that feels this way, though. > > If the read side collides with a writer it immediately goes to the > mmap_lock, which is sleeping, and so it will sort itself out properly, > even if it was preempted. > > > An even further pure question on __seqcount_preemptible() (feel free to ignore > > this question!): I saw that __seqcount_preemptible() seems to have been > > constantly defined as "return false". Not sure what happened there.. > > The new code has a range of seqcount_t types see > Documentation/locking/seqlock.rst 'Sequence counters with associated > locks' > > It uses _Generic to do a bit of meta-programming and creates a compile > time table of lock properties: > > SEQCOUNT_LOCKNAME(raw_spinlock, raw_spinlock_t, false, s->lock, raw_spin, raw_spin_lock(s->lock)) > SEQCOUNT_LOCKNAME(spinlock, spinlock_t, __SEQ_RT, s->lock, spin, spin_lock(s->lock)) > SEQCOUNT_LOCKNAME(rwlock, rwlock_t, __SEQ_RT, s->lock, read, read_lock(s->lock)) > SEQCOUNT_LOCKNAME(mutex, struct mutex, true, s->lock, mutex, mutex_lock(s->lock)) > SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mutex, ww_mutex_lock(s->lock, NULL)) > > As well as as default set of properties for normal seqcount_t. The > __seqcount_preemptible() is selected by the _Generic for seqcount_t: > > #define __seqprop(s, prop) _Generic(*(s), \ > seqcount_t: __seqprop_##prop((void *)(s)), \ > > And it says preemption must be disabled before using the lock: > > static inline void __seqprop_assert(const seqcount_t *s) > { > lockdep_assert_preemption_disabled(); > } > > And thus no need to have an automatic disable preemption: > > static inline bool __seqprop_preemptible(const seqcount_t *s) > { > return false; > } > > Other lock subtypes are different, eg the codegen for mutex will use > lockdep_assert_held(s->lock) for _assert and true for _preemptible() > > So if we map the 'write begin' entry points: > > write_seqcount_begin - Enforces preemption off > raw_write_seqcount_begin - Auto disable preemption if required (false) > raw_write_seqcount_t_begin - No preemption stuff > write_seqcount_t_begin - No preemption stuff Thanks for listing these details. As a summary, I think I'm convinced maybe we can have this work without disable preemtion. It's just that some more comment might be even better. The other thing is, considering this use of seqcount seems to be quite special as explained below, I'm just not sure whether this would confuse lockdep or kcsan, etc., if we decide to use write_seqcount_t_begin(). Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork 2020-10-31 15:26 ` Peter Xu @ 2020-11-03 0:33 ` Ahmed S. Darwish 0 siblings, 0 replies; 28+ messages in thread From: Ahmed S. Darwish @ 2020-11-03 0:33 UTC (permalink / raw) To: Peter Xu Cc: Jason Gunthorpe, linux-kernel, Linus Torvalds, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner, Sebastian Siewior On Sat, Oct 31, 2020 at 11:26:05AM -0400, Peter Xu wrote: ... > Shall we document this explicitly (if this patch still needs a repost)? Yes, this patch series needs a v3 :) > Seems not straightforward since that seems not the usual way to use seqcount, > not sure whether I'm the only one that feels this way, though. Yes, this usage is correct but not common. I've proposed a more explicit comment above the write section code, in my reply to patch #2. ... > The other thing is, considering this use of seqcount seems to be quite special > as explained below, I'm just not sure whether this would confuse lockdep or > kcsan, etc., if we decide to use write_seqcount_t_begin(). > Lockdep won't be confused as it's not used in the raw_*() variant of the seqcount APIs. AFAIK KCSAN also has some margin to protect itself from this: see seqlock.h KCSAN_SEQLOCK_REGION_MAX. Thanks, > Peter Xu Ahmed Darwish Linutronix GmbH ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork [not found] ` <20201030235121.GQ2620339@nvidia.com> 2020-10-31 15:26 ` Peter Xu @ 2020-11-03 0:17 ` Ahmed S. Darwish [not found] ` <20201103002532.GL2620339@nvidia.com> 2020-11-03 17:03 ` [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork Peter Xu 1 sibling, 2 replies; 28+ messages in thread From: Ahmed S. Darwish @ 2020-11-03 0:17 UTC (permalink / raw) To: Jason Gunthorpe Cc: Peter Xu, linux-kernel, Linus Torvalds, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner, Sebastian Siewior On Fri, Oct 30, 2020 at 08:51:21PM -0300, Jason Gunthorpe wrote: > On Fri, Oct 30, 2020 at 06:52:50PM -0400, Peter Xu wrote: ... > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index c48f8df6e50268..294c2c3c4fe00d 100644 > > > +++ b/mm/memory.c > > > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) > > > mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, > > > 0, src_vma, src_mm, addr, end); > > > mmu_notifier_invalidate_range_start(&range); > > > + /* > > > + * The read side doesn't spin, it goes to the mmap_lock, so the > > > + * raw version is used to avoid disabling preemption here > > > + */ > > > + mmap_assert_write_locked(src_mm); > > > + raw_write_seqcount_t_begin(&src_mm->write_protect_seq); > > > > Would raw_write_seqcount_begin() be better here? > > Hum.. > > I felt no because it had the preempt stuff added into it, however it > would work - __seqcount_lock_preemptible() == false for the seqcount_t > case (see below) > > Looking more closely, maybe the right API to pick is > write_seqcount_t_begin() and write_seqcount_t_end() ?? > No, that's not the right API: it is also internal to seqlock.h. Please stick with the official exported API: raw_write_seqcount_begin(). It should satisfy your needs, and the raw_*() variant is created exactly for contexts wishing to avoid the lockdep checks (e.g. NMI handlers cannot invoke lockdep, etc.) > However, no idea what the intention of the '*_seqcount_t_*' family is > - it only seems to be used to implement the seqlock.. > Exactly. '*_seqcount_t_*' is a seqlock.h implementation detail, and it has _zero_ relevance to what is discussed in this thread actually. ... > Ahmed explained in commit 8117ab508f the reason the seqcount_t write > side has preemption disabled is because it can livelock RT kernels if > the read side is spinning after preempting the write side. eg look at > how __read_seqcount_begin() is implemented: > > while ((seq = __seqcount_sequence(s)) & 1) \ > cpu_relax(); \ > > However, in this patch, we don't spin on the read side. > > If the read side collides with a writer it immediately goes to the > mmap_lock, which is sleeping, and so it will sort itself out properly, > even if it was preempted. > Correct. Thanks, -- Ahmed Darwish Linutronix GmbH ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <20201103002532.GL2620339@nvidia.com>]
* Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork [not found] ` <20201103002532.GL2620339@nvidia.com> @ 2020-11-03 0:41 ` Ahmed S. Darwish 2020-11-03 2:20 ` John Hubbard 0 siblings, 1 reply; 28+ messages in thread From: Ahmed S. Darwish @ 2020-11-03 0:41 UTC (permalink / raw) To: Jason Gunthorpe Cc: Peter Xu, linux-kernel, Linus Torvalds, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner, Sebastian Siewior On Mon, Nov 02, 2020 at 08:25:32PM -0400, Jason Gunthorpe wrote: > On Tue, Nov 03, 2020 at 01:17:12AM +0100, Ahmed S. Darwish wrote: > > > Please stick with the official exported API: raw_write_seqcount_begin(). > > How did you know this was 'offical exported API' ?? > All the official exported seqlock.h APIs are marked with verbose kernel-doc annotations on top. The rest are internal... > > Exactly. '*_seqcount_t_*' is a seqlock.h implementation detail, and it > > has _zero_ relevance to what is discussed in this thread actually. > > Add some leading __'s to them? > It's a bit more complicated than just adding some "__" prefixes, due to the massive compile-time polymorphism done through _Generic(). The '*_seqcount_t_*' format was the best we could come up with to distinguish (again, for internal seqlock.h code) between macros taking all seqcount_LOCKNAME_t types, and macros/functions taking only plain seqcount_t. Thanks, -- Ahmed S. Darwish Linutronix GmbH ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork 2020-11-03 0:41 ` Ahmed S. Darwish @ 2020-11-03 2:20 ` John Hubbard 2020-11-03 6:52 ` Ahmed S. Darwish 0 siblings, 1 reply; 28+ messages in thread From: John Hubbard @ 2020-11-03 2:20 UTC (permalink / raw) To: Ahmed S. Darwish, Jason Gunthorpe Cc: Peter Xu, linux-kernel, Linus Torvalds, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner, Sebastian Siewior On 11/2/20 4:41 PM, Ahmed S. Darwish wrote: > On Mon, Nov 02, 2020 at 08:25:32PM -0400, Jason Gunthorpe wrote: >> On Tue, Nov 03, 2020 at 01:17:12AM +0100, Ahmed S. Darwish wrote: >> >>> Please stick with the official exported API: raw_write_seqcount_begin(). >> >> How did you know this was 'offical exported API' ?? >> > > All the official exported seqlock.h APIs are marked with verbose > kernel-doc annotations on top. The rest are internal... > OK, but no one here was able to deduce that, probably because there is not enough consistency throughout the kernel to be able to assume such things--even though your seqlock project is internally consistent. It's just not *quite* enough communication. I think if we added the following it would be very nice: a) Short comments to the "unofficial and internal" routines, identifying them as such, and b) Short comments to the "official API for general use", also identifying those as such. c) A comment about what makes "raw" actually raw, for seqlock. Since I'm proposing new work, I'll also offer to help, perhaps by putting together a small patch to get it kicked off, if you approve of the idea. thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork 2020-11-03 2:20 ` John Hubbard @ 2020-11-03 6:52 ` Ahmed S. Darwish 2020-11-03 17:40 ` Linus Torvalds 0 siblings, 1 reply; 28+ messages in thread From: Ahmed S. Darwish @ 2020-11-03 6:52 UTC (permalink / raw) To: John Hubbard Cc: Jason Gunthorpe, Peter Xu, linux-kernel, Linus Torvalds, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner, Sebastian Siewior On Mon, Nov 02, 2020 at 06:20:45PM -0800, John Hubbard wrote: > On 11/2/20 4:41 PM, Ahmed S. Darwish wrote: > > On Mon, Nov 02, 2020 at 08:25:32PM -0400, Jason Gunthorpe wrote: > > > On Tue, Nov 03, 2020 at 01:17:12AM +0100, Ahmed S. Darwish wrote: > > > > > > > Please stick with the official exported API: raw_write_seqcount_begin(). > > > > > > How did you know this was 'offical exported API' ?? > > > > > > > All the official exported seqlock.h APIs are marked with verbose > > kernel-doc annotations on top. The rest are internal... > > > > OK, but no one here was able to deduce that, probably because there is not > enough consistency throughout the kernel to be able to assume such things--even > though your seqlock project is internally consistent. It's just not *quite* > enough communication. > > I think if we added the following it would be very nice: > The problem is, I've already documented seqlock.h to death.... There are more comments than code in there, and there is "seqlock.rst" under Documentation/ to further describe the big picture. There comes a point where you decide what level of documentation to add, and what level to skip. Because in the end, you don't want to confuse "Joe, the general driver developer" with too much details that's not relevant to their task at hand. (I work in the Embedded domain, and I've seen so much ugly code from embedded drivers/SoC developers already, sorry) See for example my reply to Linus, where any talk about the lockdep-free and barrier-free parts of the API was explicitly not mentioned in seqlock.rst. This was done on purpose: 1) you want to keep the generic case simple, but the special case do-able, 2) you want to encourage people to use the standard entry/exit points as much as possible. > a) Short comments to the "unofficial and internal" routines, identifying them as > such, and > > b) Short comments to the "official API for general use", also identifying > those as such. > I really think the already added kernel-doc is sufficient... See for example __read_seqcount_begin() and __read_seqcount_retry(). They begin with "__", but they are semi-external seqlock.h API that are used by VFS to avoid barriers. And these APIs are then polymorphised according to the write serialization lock type, and so on. So the most consistent way for seqlock.h was to use kernel-doc as *the* marker for exported functions. This is not unique to seqlock.h by the way. The same pattern is heavily used by the DRM folks. Yes, of course, we can add even more comments to seqlock.h, but then, I honestly think it would be too much that maybe people will just skip reading the whole thing altogether... > c) A comment about what makes "raw" actually raw, for seqlock. > That's already documented. What more can really be written than what's in seqlock.h below?? /** * raw_read_seqcount_begin() - begin a seqcount_t read section w/o lockdep /** * raw_seqcount_begin() - begin a seqcount_t read critical section w/o * lockdep and w/o counter stabilization /** * raw_write_seqcount_begin() - start a seqcount_t write section w/o lockdep /** * raw_write_seqcount_end() - end a seqcount_t write section w/o lockdep > > Since I'm proposing new work, I'll also offer to help, perhaps by putting together > a small patch to get it kicked off, if you approve of the idea. > Patches are always welcome of course, and please put me in Cc. I don't approve or deny anything though, that's the locking maintainers job :) Kind regards, > John Hubbard > NVIDIA -- Ahmed S. Darwish Linutronix GmbH ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork 2020-11-03 6:52 ` Ahmed S. Darwish @ 2020-11-03 17:40 ` Linus Torvalds 2020-11-04 1:32 ` Ahmed S. Darwish 0 siblings, 1 reply; 28+ messages in thread From: Linus Torvalds @ 2020-11-03 17:40 UTC (permalink / raw) To: Ahmed S. Darwish Cc: John Hubbard, Jason Gunthorpe, Peter Xu, Linux Kernel Mailing List, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner, Sebastian Siewior On Mon, Nov 2, 2020 at 10:52 PM Ahmed S. Darwish <a.darwish@linutronix.de> wrote: > > The problem is, I've already documented seqlock.h to death.... There are > more comments than code in there, and there is "seqlock.rst" under > Documentation/ to further describe the big picture. Well, honestly, I think the correct thing to do is to get rid of the *_seqcount_t_*() functions entirely. They add nothing but confusion, and they are entirely misnamed. That's not the pattern we use for "internal use only" functions, and they are *very* confusing. They have other issues too: like raw_write_seqcount_end() not being usable on its own when preemptibility isn't an issue like here. You basically _have_ to use raw_write_seqcount_t_end(), because otherwise it tries to re-enable preemption that was never there. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork 2020-11-03 17:40 ` Linus Torvalds @ 2020-11-04 1:32 ` Ahmed S. Darwish 2020-11-04 2:01 ` John Hubbard 0 siblings, 1 reply; 28+ messages in thread From: Ahmed S. Darwish @ 2020-11-04 1:32 UTC (permalink / raw) To: Linus Torvalds Cc: John Hubbard, Jason Gunthorpe, Peter Xu, Linux Kernel Mailing List, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner, Sebastian Siewior On Tue, Nov 03, 2020 at 09:40:22AM -0800, Linus Torvalds wrote: > On Mon, Nov 2, 2020 at 10:52 PM Ahmed S. Darwish > <a.darwish@linutronix.de> wrote: > > > > The problem is, I've already documented seqlock.h to death.... There are > > more comments than code in there, and there is "seqlock.rst" under > > Documentation/ to further describe the big picture. > > Well, honestly, I think the correct thing to do is to get rid of the > *_seqcount_t_*() functions entirely. > > They add nothing but confusion, and they are entirely misnamed. That's > not the pattern we use for "internal use only" functions, and they are > *very* confusing. > I see. Would the enclosed patch #1 be OK? It basically uses the "__do_" prefix instead, with some rationale. > > They have other issues too: like raw_write_seqcount_end() not being > usable on its own when preemptibility isn't an issue like here. You > basically _have_ to use raw_write_seqcount_t_end(), because otherwise > it tries to re-enable preemption that was never there. > Hmmm, raw_write_seqcount_{begin,end}() *never* disable/enable preemption for plain seqcount_t. This is why I kept recommending those for this patch series instead of internal raw_write_seqcount_*t*_{begin,end}(). But..... given that multiple people made the exact same remark by now, I guess that's due to: #define raw_write_seqcount_begin(s) \ do { \ if (__seqcount_lock_preemptible(s)) \ preempt_disable(); \ \ ... \ } while (0); #define raw_write_seqcount_end(s) \ do { \ ... \ \ if (__seqcount_lock_preemptible(s)) \ preempt_enable(); \ } while (0); The tricky part is that __seqcount_lock_preemptible() is always false for plain "seqcount_t". With that data type, the _Generic() selection makes it resolve to __seqprop_preemptible(), which just returns false. Originally, __seqcount_lock_preemptible() was called: __seqcount_associated_lock_exists_and_is_preemptible() but it got transformed to its current short form in the process of some pre-mainline refactorings. Looking at it now after all the dust has settled, maybe the verbose form was much more clear. Please see the enclosed patch #2... Would that be OK too? (I will submit these two patches in their own thread after some common ground is reached.) Patches ------- ====> ====> patch #1: ====> Subject: [RFC][PATCH 1/2] seqlock: Use __do_ prefix instead of non-standed _seqcount_t_ marker The use of "*_seqcount_t_*" as a marker to denote internal seqlock.h functions taking only plain seqcount_t instead of the whole seqcount_LOCKNAME_t family is confusing users, as it's also not the standard kernel pattern for denoting header file internal functions. Use the __do_ prefix instead. Note, a plain "__" prefix is not used since seqlock.h already uses it for some of its exported functions; e.g. __read_seqcount_begin() and __read_seqcount_retry(). Reported-by: Jason Gunthorpe <jgg@nvidia.com> Reported-by: John Hubbard <jhubbard@nvidia.com> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lkml.kernel.org/r/CAHk-=wgB8nyOQufpn0o6a5BpJCJPnXvH+kRxApujhsgG+7qAwQ@mail.gmail.com Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> --- include/linux/seqlock.h | 62 ++++++++++++++++++++--------------------- 1 file changed, 31 insertions(+), 31 deletions(-) diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index cbfc78b92b65..5de043841d33 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -425,9 +425,9 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu * Return: true if a read section retry is required, else false */ #define __read_seqcount_retry(s, start) \ - __read_seqcount_t_retry(__seqcount_ptr(s), start) + __do___read_seqcount_retry(__seqcount_ptr(s), start) -static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start) +static inline int __do___read_seqcount_retry(const seqcount_t *s, unsigned start) { kcsan_atomic_next(0); return unlikely(READ_ONCE(s->sequence) != start); @@ -445,12 +445,12 @@ static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start) * Return: true if a read section retry is required, else false */ #define read_seqcount_retry(s, start) \ - read_seqcount_t_retry(__seqcount_ptr(s), start) + __do_read_seqcount_retry(__seqcount_ptr(s), start) -static inline int read_seqcount_t_retry(const seqcount_t *s, unsigned start) +static inline int __do_read_seqcount_retry(const seqcount_t *s, unsigned start) { smp_rmb(); - return __read_seqcount_t_retry(s, start); + return __do___read_seqcount_retry(s, start); } /** @@ -462,10 +462,10 @@ do { \ if (__seqcount_lock_preemptible(s)) \ preempt_disable(); \ \ - raw_write_seqcount_t_begin(__seqcount_ptr(s)); \ + __do_raw_write_seqcount_begin(__seqcount_ptr(s)); \ } while (0) -static inline void raw_write_seqcount_t_begin(seqcount_t *s) +static inline void __do_raw_write_seqcount_begin(seqcount_t *s) { kcsan_nestable_atomic_begin(); s->sequence++; @@ -478,13 +478,13 @@ static inline void raw_write_seqcount_t_begin(seqcount_t *s) */ #define raw_write_seqcount_end(s) \ do { \ - raw_write_seqcount_t_end(__seqcount_ptr(s)); \ + __do_raw_write_seqcount_end(__seqcount_ptr(s)); \ \ if (__seqcount_lock_preemptible(s)) \ preempt_enable(); \ } while (0) -static inline void raw_write_seqcount_t_end(seqcount_t *s) +static inline void __do_raw_write_seqcount_end(seqcount_t *s) { smp_wmb(); s->sequence++; @@ -506,12 +506,12 @@ do { \ if (__seqcount_lock_preemptible(s)) \ preempt_disable(); \ \ - write_seqcount_t_begin_nested(__seqcount_ptr(s), subclass); \ + __do_write_seqcount_begin_nested(__seqcount_ptr(s), subclass); \ } while (0) -static inline void write_seqcount_t_begin_nested(seqcount_t *s, int subclass) +static inline void __do_write_seqcount_begin_nested(seqcount_t *s, int subclass) { - raw_write_seqcount_t_begin(s); + __do_raw_write_seqcount_begin(s); seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_); } @@ -533,12 +533,12 @@ do { \ if (__seqcount_lock_preemptible(s)) \ preempt_disable(); \ \ - write_seqcount_t_begin(__seqcount_ptr(s)); \ + __do_write_seqcount_begin(__seqcount_ptr(s)); \ } while (0) -static inline void write_seqcount_t_begin(seqcount_t *s) +static inline void __do_write_seqcount_begin(seqcount_t *s) { - write_seqcount_t_begin_nested(s, 0); + __do_write_seqcount_begin_nested(s, 0); } /** @@ -549,16 +549,16 @@ static inline void write_seqcount_t_begin(seqcount_t *s) */ #define write_seqcount_end(s) \ do { \ - write_seqcount_t_end(__seqcount_ptr(s)); \ + __do_write_seqcount_end(__seqcount_ptr(s)); \ \ if (__seqcount_lock_preemptible(s)) \ preempt_enable(); \ } while (0) -static inline void write_seqcount_t_end(seqcount_t *s) +static inline void __do_write_seqcount_end(seqcount_t *s) { seqcount_release(&s->dep_map, _RET_IP_); - raw_write_seqcount_t_end(s); + __do_raw_write_seqcount_end(s); } /** @@ -603,9 +603,9 @@ static inline void write_seqcount_t_end(seqcount_t *s) * } */ #define raw_write_seqcount_barrier(s) \ - raw_write_seqcount_t_barrier(__seqcount_ptr(s)) + __do_raw_write_seqcount_barrier(__seqcount_ptr(s)) -static inline void raw_write_seqcount_t_barrier(seqcount_t *s) +static inline void __do_raw_write_seqcount_barrier(seqcount_t *s) { kcsan_nestable_atomic_begin(); s->sequence++; @@ -623,9 +623,9 @@ static inline void raw_write_seqcount_t_barrier(seqcount_t *s) * will complete successfully and see data older than this. */ #define write_seqcount_invalidate(s) \ - write_seqcount_t_invalidate(__seqcount_ptr(s)) + __do_write_seqcount_invalidate(__seqcount_ptr(s)) -static inline void write_seqcount_t_invalidate(seqcount_t *s) +static inline void __do_write_seqcount_invalidate(seqcount_t *s) { smp_wmb(); kcsan_nestable_atomic_begin(); @@ -865,7 +865,7 @@ static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start) } /* - * For all seqlock_t write side functions, use write_seqcount_*t*_begin() + * For all seqlock_t write side functions, use __do_write_seqcount_begin() * instead of the generic write_seqcount_begin(). This way, no redundant * lockdep_assert_held() checks are added. */ @@ -886,7 +886,7 @@ static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start) static inline void write_seqlock(seqlock_t *sl) { spin_lock(&sl->lock); - write_seqcount_t_begin(&sl->seqcount.seqcount); + __do_write_seqcount_begin(&sl->seqcount.seqcount); } /** @@ -898,7 +898,7 @@ static inline void write_seqlock(seqlock_t *sl) */ static inline void write_sequnlock(seqlock_t *sl) { - write_seqcount_t_end(&sl->seqcount.seqcount); + __do_write_seqcount_end(&sl->seqcount.seqcount); spin_unlock(&sl->lock); } @@ -912,7 +912,7 @@ static inline void write_sequnlock(seqlock_t *sl) static inline void write_seqlock_bh(seqlock_t *sl) { spin_lock_bh(&sl->lock); - write_seqcount_t_begin(&sl->seqcount.seqcount); + __do_write_seqcount_begin(&sl->seqcount.seqcount); } /** @@ -925,7 +925,7 @@ static inline void write_seqlock_bh(seqlock_t *sl) */ static inline void write_sequnlock_bh(seqlock_t *sl) { - write_seqcount_t_end(&sl->seqcount.seqcount); + __do_write_seqcount_end(&sl->seqcount.seqcount); spin_unlock_bh(&sl->lock); } @@ -939,7 +939,7 @@ static inline void write_sequnlock_bh(seqlock_t *sl) static inline void write_seqlock_irq(seqlock_t *sl) { spin_lock_irq(&sl->lock); - write_seqcount_t_begin(&sl->seqcount.seqcount); + __do_write_seqcount_begin(&sl->seqcount.seqcount); } /** @@ -951,7 +951,7 @@ static inline void write_seqlock_irq(seqlock_t *sl) */ static inline void write_sequnlock_irq(seqlock_t *sl) { - write_seqcount_t_end(&sl->seqcount.seqcount); + __do_write_seqcount_end(&sl->seqcount.seqcount); spin_unlock_irq(&sl->lock); } @@ -960,7 +960,7 @@ static inline unsigned long __write_seqlock_irqsave(seqlock_t *sl) unsigned long flags; spin_lock_irqsave(&sl->lock, flags); - write_seqcount_t_begin(&sl->seqcount.seqcount); + __do_write_seqcount_begin(&sl->seqcount.seqcount); return flags; } @@ -989,7 +989,7 @@ static inline unsigned long __write_seqlock_irqsave(seqlock_t *sl) static inline void write_sequnlock_irqrestore(seqlock_t *sl, unsigned long flags) { - write_seqcount_t_end(&sl->seqcount.seqcount); + __do_write_seqcount_end(&sl->seqcount.seqcount); spin_unlock_irqrestore(&sl->lock, flags); } ====> ====> patch #2: ====> Subject: [PATCH 2/2] seqlock: seqcount_LOCKAME_t: Use more verbose macro names As evidenced by multiple discussions over LKML so far, it's not clear that __seqcount_lock_preemptible() is always false for plain seqcount_t. For that data type, the _Generic() selection resolves to __seqprop_preemptible(), which just returns false. Use __seqcount_associated_lock_exists_and_is_preemptible() instead, which hints that "preemptibility" is for the associated write serialization lock (if any), not for the seqcount itself. Similarly, rename __seqcount_assert_lock_held() to __seqcount_assert_associated_lock_held(). Link: https://lkml.kernel.org/r/CAHk-=wgB8nyOQufpn0o6a5BpJCJPnXvH+kRxApujhsgG+7qAwQ@mail.gmail.com Link: https://lkml.kernel.org/r/20201030235121.GQ2620339@nvidia.com Link: https://lkml.kernel.org/r/20201103170327.GJ20600@xz-x1 Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> --- include/linux/seqlock.h | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index 5de043841d33..eb1e5a822e44 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -307,10 +307,10 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu __seqprop_case((s), mutex, prop), \ __seqprop_case((s), ww_mutex, prop)) -#define __seqcount_ptr(s) __seqprop(s, ptr) -#define __seqcount_sequence(s) __seqprop(s, sequence) -#define __seqcount_lock_preemptible(s) __seqprop(s, preemptible) -#define __seqcount_assert_lock_held(s) __seqprop(s, assert) +#define __seqcount_ptr(s) __seqprop(s, ptr) +#define __seqcount_sequence(s) __seqprop(s, sequence) +#define __seqcount_associated_lock_exists_and_is_preemptible(s) __seqprop(s, preemptible) +#define __seqcount_assert_associated_lock_held(s) __seqprop(s, assert) /** * __read_seqcount_begin() - begin a seqcount_t read section w/o barrier @@ -459,7 +459,7 @@ static inline int __do_read_seqcount_retry(const seqcount_t *s, unsigned start) */ #define raw_write_seqcount_begin(s) \ do { \ - if (__seqcount_lock_preemptible(s)) \ + if (__seqcount_associated_lock_exists_and_is_preemptible(s)) \ preempt_disable(); \ \ __do_raw_write_seqcount_begin(__seqcount_ptr(s)); \ @@ -480,7 +480,7 @@ static inline void __do_raw_write_seqcount_begin(seqcount_t *s) do { \ __do_raw_write_seqcount_end(__seqcount_ptr(s)); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (__seqcount_associated_lock_exists_and_is_preemptible(s)) \ preempt_enable(); \ } while (0) @@ -501,9 +501,9 @@ static inline void __do_raw_write_seqcount_end(seqcount_t *s) */ #define write_seqcount_begin_nested(s, subclass) \ do { \ - __seqcount_assert_lock_held(s); \ + __seqcount_assert_associated_lock_held(s); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (__seqcount_associated_lock_exists_and_is_preemptible(s)) \ preempt_disable(); \ \ __do_write_seqcount_begin_nested(__seqcount_ptr(s), subclass); \ @@ -528,9 +528,9 @@ static inline void __do_write_seqcount_begin_nested(seqcount_t *s, int subclass) */ #define write_seqcount_begin(s) \ do { \ - __seqcount_assert_lock_held(s); \ + __seqcount_assert_associated_lock_held(s); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (__seqcount_associated_lock_exists_and_is_preemptible(s)) \ preempt_disable(); \ \ __do_write_seqcount_begin(__seqcount_ptr(s)); \ @@ -551,7 +551,7 @@ static inline void __do_write_seqcount_begin(seqcount_t *s) do { \ __do_write_seqcount_end(__seqcount_ptr(s)); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (__seqcount_associated_lock_exists_and_is_preemptible(s)) \ preempt_enable(); \ } while (0) > Linus Thanks, -- Ahmed S. Darwish Linutronix GmbH ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork 2020-11-04 1:32 ` Ahmed S. Darwish @ 2020-11-04 2:01 ` John Hubbard 2020-11-04 3:17 ` Ahmed S. Darwish 0 siblings, 1 reply; 28+ messages in thread From: John Hubbard @ 2020-11-04 2:01 UTC (permalink / raw) To: Ahmed S. Darwish, Linus Torvalds Cc: Jason Gunthorpe, Peter Xu, Linux Kernel Mailing List, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner, Sebastian Siewior On 11/3/20 5:32 PM, Ahmed S. Darwish wrote: > On Tue, Nov 03, 2020 at 09:40:22AM -0800, Linus Torvalds wrote: >> On Mon, Nov 2, 2020 at 10:52 PM Ahmed S. Darwish >> <a.darwish@linutronix.de> wrote: ... > > ====> > ====> patch #1: > ====> > > Subject: [RFC][PATCH 1/2] seqlock: Use __do_ prefix instead of non-standed > _seqcount_t_ marker > > The use of "*_seqcount_t_*" as a marker to denote internal seqlock.h > functions taking only plain seqcount_t instead of the whole > seqcount_LOCKNAME_t family is confusing users, as it's also not the > standard kernel pattern for denoting header file internal functions. > > Use the __do_ prefix instead. > > Note, a plain "__" prefix is not used since seqlock.h already uses it > for some of its exported functions; e.g. __read_seqcount_begin() and > __read_seqcount_retry(). > > Reported-by: Jason Gunthorpe <jgg@nvidia.com> > Reported-by: John Hubbard <jhubbard@nvidia.com> > Reported-by: Linus Torvalds <torvalds@linux-foundation.org> > Link: https://lkml.kernel.org/r/CAHk-=wgB8nyOQufpn0o6a5BpJCJPnXvH+kRxApujhsgG+7qAwQ@mail.gmail.com > Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> > --- > include/linux/seqlock.h | 62 ++++++++++++++++++++--------------------- > 1 file changed, 31 insertions(+), 31 deletions(-) > > diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h > index cbfc78b92b65..5de043841d33 100644 > --- a/include/linux/seqlock.h > +++ b/include/linux/seqlock.h > @@ -425,9 +425,9 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu > * Return: true if a read section retry is required, else false > */ > #define __read_seqcount_retry(s, start) \ > - __read_seqcount_t_retry(__seqcount_ptr(s), start) > + __do___read_seqcount_retry(__seqcount_ptr(s), start) Looking better. "do_" is clearly an internal function name prefix, so that's good. A nit: while various numbers of leading underscores are sometimes used, it's a lot less common to use, say, 3 consecutive underscores (as above) *within* the name. And I don't think you need it for uniqueness, at least from a quick look around here. In fact, if you actually need 3 vs 2 vs 1 underscore within the name, I'm going to claim that that's too far afield, and the naming should be re-revisited. :) So why not just: __do_read_seqcount_retry() ? ...or, if you feeling bold: do_read_seqcount_retry() ...thus taking further advantage of the "do" convention, in order to get rid of some more underscores. And similarly for other __do___*() functions. But again, either way, I think "do" is helping a *lot* here (as is getting rid of the _t_ idea). thanks, -- John Hubbard NVIDIA > > -static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start) > +static inline int __do___read_seqcount_retry(const seqcount_t *s, unsigned start) > { > kcsan_atomic_next(0); > return unlikely(READ_ONCE(s->sequence) != start); > @@ -445,12 +445,12 @@ static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start) > * Return: true if a read section retry is required, else false > */ > #define read_seqcount_retry(s, start) \ > - read_seqcount_t_retry(__seqcount_ptr(s), start) > + __do_read_seqcount_retry(__seqcount_ptr(s), start) > > -static inline int read_seqcount_t_retry(const seqcount_t *s, unsigned start) > +static inline int __do_read_seqcount_retry(const seqcount_t *s, unsigned start) > { > smp_rmb(); > - return __read_seqcount_t_retry(s, start); > + return __do___read_seqcount_retry(s, start); > } > > /** > @@ -462,10 +462,10 @@ do { \ > if (__seqcount_lock_preemptible(s)) \ > preempt_disable(); \ > \ > - raw_write_seqcount_t_begin(__seqcount_ptr(s)); \ > + __do_raw_write_seqcount_begin(__seqcount_ptr(s)); \ > } while (0) > > -static inline void raw_write_seqcount_t_begin(seqcount_t *s) > +static inline void __do_raw_write_seqcount_begin(seqcount_t *s) > { > kcsan_nestable_atomic_begin(); > s->sequence++; > @@ -478,13 +478,13 @@ static inline void raw_write_seqcount_t_begin(seqcount_t *s) > */ > #define raw_write_seqcount_end(s) \ > do { \ > - raw_write_seqcount_t_end(__seqcount_ptr(s)); \ > + __do_raw_write_seqcount_end(__seqcount_ptr(s)); \ > \ > if (__seqcount_lock_preemptible(s)) \ > preempt_enable(); \ > } while (0) > > -static inline void raw_write_seqcount_t_end(seqcount_t *s) > +static inline void __do_raw_write_seqcount_end(seqcount_t *s) > { > smp_wmb(); > s->sequence++; > @@ -506,12 +506,12 @@ do { \ > if (__seqcount_lock_preemptible(s)) \ > preempt_disable(); \ > \ > - write_seqcount_t_begin_nested(__seqcount_ptr(s), subclass); \ > + __do_write_seqcount_begin_nested(__seqcount_ptr(s), subclass); \ > } while (0) > > -static inline void write_seqcount_t_begin_nested(seqcount_t *s, int subclass) > +static inline void __do_write_seqcount_begin_nested(seqcount_t *s, int subclass) > { > - raw_write_seqcount_t_begin(s); > + __do_raw_write_seqcount_begin(s); > seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_); > } > > @@ -533,12 +533,12 @@ do { \ > if (__seqcount_lock_preemptible(s)) \ > preempt_disable(); \ > \ > - write_seqcount_t_begin(__seqcount_ptr(s)); \ > + __do_write_seqcount_begin(__seqcount_ptr(s)); \ > } while (0) > > -static inline void write_seqcount_t_begin(seqcount_t *s) > +static inline void __do_write_seqcount_begin(seqcount_t *s) > { > - write_seqcount_t_begin_nested(s, 0); > + __do_write_seqcount_begin_nested(s, 0); > } > > /** > @@ -549,16 +549,16 @@ static inline void write_seqcount_t_begin(seqcount_t *s) > */ > #define write_seqcount_end(s) \ > do { \ > - write_seqcount_t_end(__seqcount_ptr(s)); \ > + __do_write_seqcount_end(__seqcount_ptr(s)); \ > \ > if (__seqcount_lock_preemptible(s)) \ > preempt_enable(); \ > } while (0) > > -static inline void write_seqcount_t_end(seqcount_t *s) > +static inline void __do_write_seqcount_end(seqcount_t *s) > { > seqcount_release(&s->dep_map, _RET_IP_); > - raw_write_seqcount_t_end(s); > + __do_raw_write_seqcount_end(s); > } > > /** > @@ -603,9 +603,9 @@ static inline void write_seqcount_t_end(seqcount_t *s) > * } > */ > #define raw_write_seqcount_barrier(s) \ > - raw_write_seqcount_t_barrier(__seqcount_ptr(s)) > + __do_raw_write_seqcount_barrier(__seqcount_ptr(s)) > > -static inline void raw_write_seqcount_t_barrier(seqcount_t *s) > +static inline void __do_raw_write_seqcount_barrier(seqcount_t *s) > { > kcsan_nestable_atomic_begin(); > s->sequence++; > @@ -623,9 +623,9 @@ static inline void raw_write_seqcount_t_barrier(seqcount_t *s) > * will complete successfully and see data older than this. > */ > #define write_seqcount_invalidate(s) \ > - write_seqcount_t_invalidate(__seqcount_ptr(s)) > + __do_write_seqcount_invalidate(__seqcount_ptr(s)) > > -static inline void write_seqcount_t_invalidate(seqcount_t *s) > +static inline void __do_write_seqcount_invalidate(seqcount_t *s) > { > smp_wmb(); > kcsan_nestable_atomic_begin(); > @@ -865,7 +865,7 @@ static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start) > } > > /* > - * For all seqlock_t write side functions, use write_seqcount_*t*_begin() > + * For all seqlock_t write side functions, use __do_write_seqcount_begin() > * instead of the generic write_seqcount_begin(). This way, no redundant > * lockdep_assert_held() checks are added. > */ > @@ -886,7 +886,7 @@ static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start) > static inline void write_seqlock(seqlock_t *sl) > { > spin_lock(&sl->lock); > - write_seqcount_t_begin(&sl->seqcount.seqcount); > + __do_write_seqcount_begin(&sl->seqcount.seqcount); > } > > /** > @@ -898,7 +898,7 @@ static inline void write_seqlock(seqlock_t *sl) > */ > static inline void write_sequnlock(seqlock_t *sl) > { > - write_seqcount_t_end(&sl->seqcount.seqcount); > + __do_write_seqcount_end(&sl->seqcount.seqcount); > spin_unlock(&sl->lock); > } > > @@ -912,7 +912,7 @@ static inline void write_sequnlock(seqlock_t *sl) > static inline void write_seqlock_bh(seqlock_t *sl) > { > spin_lock_bh(&sl->lock); > - write_seqcount_t_begin(&sl->seqcount.seqcount); > + __do_write_seqcount_begin(&sl->seqcount.seqcount); > } > > /** > @@ -925,7 +925,7 @@ static inline void write_seqlock_bh(seqlock_t *sl) > */ > static inline void write_sequnlock_bh(seqlock_t *sl) > { > - write_seqcount_t_end(&sl->seqcount.seqcount); > + __do_write_seqcount_end(&sl->seqcount.seqcount); > spin_unlock_bh(&sl->lock); > } > > @@ -939,7 +939,7 @@ static inline void write_sequnlock_bh(seqlock_t *sl) > static inline void write_seqlock_irq(seqlock_t *sl) > { > spin_lock_irq(&sl->lock); > - write_seqcount_t_begin(&sl->seqcount.seqcount); > + __do_write_seqcount_begin(&sl->seqcount.seqcount); > } > > /** > @@ -951,7 +951,7 @@ static inline void write_seqlock_irq(seqlock_t *sl) > */ > static inline void write_sequnlock_irq(seqlock_t *sl) > { > - write_seqcount_t_end(&sl->seqcount.seqcount); > + __do_write_seqcount_end(&sl->seqcount.seqcount); > spin_unlock_irq(&sl->lock); > } > > @@ -960,7 +960,7 @@ static inline unsigned long __write_seqlock_irqsave(seqlock_t *sl) > unsigned long flags; > > spin_lock_irqsave(&sl->lock, flags); > - write_seqcount_t_begin(&sl->seqcount.seqcount); > + __do_write_seqcount_begin(&sl->seqcount.seqcount); > return flags; > } > > @@ -989,7 +989,7 @@ static inline unsigned long __write_seqlock_irqsave(seqlock_t *sl) > static inline void > write_sequnlock_irqrestore(seqlock_t *sl, unsigned long flags) > { > - write_seqcount_t_end(&sl->seqcount.seqcount); > + __do_write_seqcount_end(&sl->seqcount.seqcount); > spin_unlock_irqrestore(&sl->lock, flags); > } > > ====> > ====> patch #2: > ====> > > Subject: [PATCH 2/2] seqlock: seqcount_LOCKAME_t: Use more verbose macro names > > As evidenced by multiple discussions over LKML so far, it's not clear > that __seqcount_lock_preemptible() is always false for plain seqcount_t. > For that data type, the _Generic() selection resolves to > __seqprop_preemptible(), which just returns false. > > Use __seqcount_associated_lock_exists_and_is_preemptible() instead, > which hints that "preemptibility" is for the associated write > serialization lock (if any), not for the seqcount itself. > > Similarly, rename __seqcount_assert_lock_held() to > __seqcount_assert_associated_lock_held(). > > Link: https://lkml.kernel.org/r/CAHk-=wgB8nyOQufpn0o6a5BpJCJPnXvH+kRxApujhsgG+7qAwQ@mail.gmail.com > Link: https://lkml.kernel.org/r/20201030235121.GQ2620339@nvidia.com > Link: https://lkml.kernel.org/r/20201103170327.GJ20600@xz-x1 > Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> > --- > include/linux/seqlock.h | 22 +++++++++++----------- > 1 file changed, 11 insertions(+), 11 deletions(-) > > diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h > index 5de043841d33..eb1e5a822e44 100644 > --- a/include/linux/seqlock.h > +++ b/include/linux/seqlock.h > @@ -307,10 +307,10 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu > __seqprop_case((s), mutex, prop), \ > __seqprop_case((s), ww_mutex, prop)) > > -#define __seqcount_ptr(s) __seqprop(s, ptr) > -#define __seqcount_sequence(s) __seqprop(s, sequence) > -#define __seqcount_lock_preemptible(s) __seqprop(s, preemptible) > -#define __seqcount_assert_lock_held(s) __seqprop(s, assert) > +#define __seqcount_ptr(s) __seqprop(s, ptr) > +#define __seqcount_sequence(s) __seqprop(s, sequence) > +#define __seqcount_associated_lock_exists_and_is_preemptible(s) __seqprop(s, preemptible) > +#define __seqcount_assert_associated_lock_held(s) __seqprop(s, assert) > > /** > * __read_seqcount_begin() - begin a seqcount_t read section w/o barrier > @@ -459,7 +459,7 @@ static inline int __do_read_seqcount_retry(const seqcount_t *s, unsigned start) > */ > #define raw_write_seqcount_begin(s) \ > do { \ > - if (__seqcount_lock_preemptible(s)) \ > + if (__seqcount_associated_lock_exists_and_is_preemptible(s)) \ > preempt_disable(); \ > \ > __do_raw_write_seqcount_begin(__seqcount_ptr(s)); \ > @@ -480,7 +480,7 @@ static inline void __do_raw_write_seqcount_begin(seqcount_t *s) > do { \ > __do_raw_write_seqcount_end(__seqcount_ptr(s)); \ > \ > - if (__seqcount_lock_preemptible(s)) \ > + if (__seqcount_associated_lock_exists_and_is_preemptible(s)) \ > preempt_enable(); \ > } while (0) > > @@ -501,9 +501,9 @@ static inline void __do_raw_write_seqcount_end(seqcount_t *s) > */ > #define write_seqcount_begin_nested(s, subclass) \ > do { \ > - __seqcount_assert_lock_held(s); \ > + __seqcount_assert_associated_lock_held(s); \ > \ > - if (__seqcount_lock_preemptible(s)) \ > + if (__seqcount_associated_lock_exists_and_is_preemptible(s)) \ > preempt_disable(); \ > \ > __do_write_seqcount_begin_nested(__seqcount_ptr(s), subclass); \ > @@ -528,9 +528,9 @@ static inline void __do_write_seqcount_begin_nested(seqcount_t *s, int subclass) > */ > #define write_seqcount_begin(s) \ > do { \ > - __seqcount_assert_lock_held(s); \ > + __seqcount_assert_associated_lock_held(s); \ > \ > - if (__seqcount_lock_preemptible(s)) \ > + if (__seqcount_associated_lock_exists_and_is_preemptible(s)) \ > preempt_disable(); \ > \ > __do_write_seqcount_begin(__seqcount_ptr(s)); \ > @@ -551,7 +551,7 @@ static inline void __do_write_seqcount_begin(seqcount_t *s) > do { \ > __do_write_seqcount_end(__seqcount_ptr(s)); \ > \ > - if (__seqcount_lock_preemptible(s)) \ > + if (__seqcount_associated_lock_exists_and_is_preemptible(s)) \ > preempt_enable(); \ > } while (0) > >> Linus > > Thanks, > > -- > Ahmed S. Darwish > Linutronix GmbH > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork 2020-11-04 2:01 ` John Hubbard @ 2020-11-04 3:17 ` Ahmed S. Darwish 2020-11-04 18:38 ` Linus Torvalds 2020-11-10 11:53 ` [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork Peter Zijlstra 0 siblings, 2 replies; 28+ messages in thread From: Ahmed S. Darwish @ 2020-11-04 3:17 UTC (permalink / raw) To: John Hubbard Cc: Linus Torvalds, Jason Gunthorpe, Peter Xu, Linux Kernel Mailing List, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner, Sebastian Siewior On Tue, Nov 03, 2020 at 06:01:30PM -0800, John Hubbard wrote: > On 11/3/20 5:32 PM, Ahmed S. Darwish wrote: ... > > #define __read_seqcount_retry(s, start) \ > > - __read_seqcount_t_retry(__seqcount_ptr(s), start) > > + __do___read_seqcount_retry(__seqcount_ptr(s), start) > ... > A nit: while various numbers of leading underscores are sometimes used, it's a lot > less common to use, say, 3 consecutive underscores (as above) *within* the name. And > I don't think you need it for uniqueness, at least from a quick look around here. > ... > But again, either way, I think "do" is helping a *lot* here (as is getting rid > of the _t_ idea). The three underscores are needed because there's a do_ version for read_seqcount_retry(), and another for __read_seqcount_retry(). Similarly for {__,}read_seqcount_begin(). You want to be very careful with this, and never mistaknely mix the two, because it affects some VFS hot paths. Nonetheless, as you mentioned in the later (dropped) part of your message, I think do_ is better than __do_, so the final result will be: do___read_seqcount_retry() do_read_seqcount_retry() do_raw_write_seqcount_begin() do_raw_write_seqcount_end() do_write_seqcount_begin() ... and so on. I'll wait for some further feedback on the two patches (possibly from Linus or PeterZ), then send a mini patch series. (This shouldn't block a v3 of Jason's mm patch series though, as it will be using the external seqlock.h APIs anyway...). Thanks, -- Ahmed S. Darwish Linutronix GmbH ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork 2020-11-04 3:17 ` Ahmed S. Darwish @ 2020-11-04 18:38 ` Linus Torvalds 2020-11-04 19:54 ` Ahmed S. Darwish ` (2 more replies) 2020-11-10 11:53 ` [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork Peter Zijlstra 1 sibling, 3 replies; 28+ messages in thread From: Linus Torvalds @ 2020-11-04 18:38 UTC (permalink / raw) To: Ahmed S. Darwish Cc: John Hubbard, Jason Gunthorpe, Peter Xu, Linux Kernel Mailing List, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner, Sebastian Siewior On Tue, Nov 3, 2020 at 7:17 PM Ahmed S. Darwish <a.darwish@linutronix.de> wrote: > > Nonetheless, as you mentioned in the later (dropped) part of your > message, I think do_ is better than __do_, so the final result will be: > > do___read_seqcount_retry() > do_read_seqcount_retry() > do_raw_write_seqcount_begin() > do_raw_write_seqcount_end() > do_write_seqcount_begin() > ... > > and so on. Looks reasonable to me. And can you add a few comments to the magic type macros, so that it's a lot more obvious what the end result was. I clearly wasn't able to follow all the _Generic() cases from the seqcount_t to the final end result. It's a really odd combination of subtle _GENERIC() macro and token pasting to get from zeqcount_t to "false" in __seqcount_lock_preemptible(). I can see it when I really look, but when looking at the actual use, it's very non-obvious indeed. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork 2020-11-04 18:38 ` Linus Torvalds @ 2020-11-04 19:54 ` Ahmed S. Darwish 2020-12-09 18:38 ` [tip: locking/core] seqlock: Prefix internal seqcount_t-only macros with a "do_" tip-bot2 for Ahmed S. Darwish 2020-12-09 18:38 ` [tip: locking/core] seqlock: kernel-doc: Specify when preemption is automatically altered tip-bot2 for Ahmed S. Darwish 2 siblings, 0 replies; 28+ messages in thread From: Ahmed S. Darwish @ 2020-11-04 19:54 UTC (permalink / raw) To: Linus Torvalds Cc: John Hubbard, Jason Gunthorpe, Peter Xu, Linux Kernel Mailing List, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner, Sebastian Siewior On Wed, Nov 04, 2020 at 10:38:57AM -0800, Linus Torvalds wrote: ... > > Looks reasonable to me. > > And can you add a few comments to the magic type macros, so that it's > a lot more obvious what the end result was. ... > > I can see it when I really look, but when looking at the actual use, > it's very non-obvious indeed. > ACK, will do. > Linus Thanks, -- Ahmed S. Darwish Linutronix GmbH ^ permalink raw reply [flat|nested] 28+ messages in thread
* [tip: locking/core] seqlock: Prefix internal seqcount_t-only macros with a "do_" 2020-11-04 18:38 ` Linus Torvalds 2020-11-04 19:54 ` Ahmed S. Darwish @ 2020-12-09 18:38 ` tip-bot2 for Ahmed S. Darwish 2020-12-09 18:38 ` [tip: locking/core] seqlock: kernel-doc: Specify when preemption is automatically altered tip-bot2 for Ahmed S. Darwish 2 siblings, 0 replies; 28+ messages in thread From: tip-bot2 for Ahmed S. Darwish @ 2020-12-09 18:38 UTC (permalink / raw) To: linux-tip-commits Cc: Ahmed S. Darwish, Peter Zijlstra (Intel), x86, linux-kernel The following commit has been merged into the locking/core branch of tip: Commit-ID: 66bcfcdf89d00f2409f4b5da0f8c20c08318dc72 Gitweb: https://git.kernel.org/tip/66bcfcdf89d00f2409f4b5da0f8c20c08318dc72 Author: Ahmed S. Darwish <a.darwish@linutronix.de> AuthorDate: Sun, 06 Dec 2020 17:21:42 +01:00 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Wed, 09 Dec 2020 17:08:49 +01:00 seqlock: Prefix internal seqcount_t-only macros with a "do_" When the seqcount_LOCKNAME_t group of data types were introduced, two classes of seqlock.h sequence counter macros were added: - An external public API which can either take a plain seqcount_t or any of the seqcount_LOCKNAME_t variants. - An internal API which takes only a plain seqcount_t. To distinguish between the two groups, the "*_seqcount_t_*" pattern was used for the latter. This confused a number of mm/ call-site developers, and Linus also commented that it was not a standard practice for marking seqlock.h internal APIs. Distinguish the latter group of macros by prefixing a "do_". Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/CAHk-=wikhGExmprXgaW+MVXG1zsGpztBbVwOb23vetk41EtTBQ@mail.gmail.com --- include/linux/seqlock.h | 66 ++++++++++++++++++++-------------------- 1 file changed, 33 insertions(+), 33 deletions(-) diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index d89134c..235cbc6 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -425,9 +425,9 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu * Return: true if a read section retry is required, else false */ #define __read_seqcount_retry(s, start) \ - __read_seqcount_t_retry(seqprop_ptr(s), start) + do___read_seqcount_retry(seqprop_ptr(s), start) -static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start) +static inline int do___read_seqcount_retry(const seqcount_t *s, unsigned start) { kcsan_atomic_next(0); return unlikely(READ_ONCE(s->sequence) != start); @@ -445,12 +445,12 @@ static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start) * Return: true if a read section retry is required, else false */ #define read_seqcount_retry(s, start) \ - read_seqcount_t_retry(seqprop_ptr(s), start) + do_read_seqcount_retry(seqprop_ptr(s), start) -static inline int read_seqcount_t_retry(const seqcount_t *s, unsigned start) +static inline int do_read_seqcount_retry(const seqcount_t *s, unsigned start) { smp_rmb(); - return __read_seqcount_t_retry(s, start); + return do___read_seqcount_retry(s, start); } /** @@ -462,10 +462,10 @@ do { \ if (seqprop_preemptible(s)) \ preempt_disable(); \ \ - raw_write_seqcount_t_begin(seqprop_ptr(s)); \ + do_raw_write_seqcount_begin(seqprop_ptr(s)); \ } while (0) -static inline void raw_write_seqcount_t_begin(seqcount_t *s) +static inline void do_raw_write_seqcount_begin(seqcount_t *s) { kcsan_nestable_atomic_begin(); s->sequence++; @@ -478,13 +478,13 @@ static inline void raw_write_seqcount_t_begin(seqcount_t *s) */ #define raw_write_seqcount_end(s) \ do { \ - raw_write_seqcount_t_end(seqprop_ptr(s)); \ + do_raw_write_seqcount_end(seqprop_ptr(s)); \ \ if (seqprop_preemptible(s)) \ preempt_enable(); \ } while (0) -static inline void raw_write_seqcount_t_end(seqcount_t *s) +static inline void do_raw_write_seqcount_end(seqcount_t *s) { smp_wmb(); s->sequence++; @@ -506,12 +506,12 @@ do { \ if (seqprop_preemptible(s)) \ preempt_disable(); \ \ - write_seqcount_t_begin_nested(seqprop_ptr(s), subclass); \ + do_write_seqcount_begin_nested(seqprop_ptr(s), subclass); \ } while (0) -static inline void write_seqcount_t_begin_nested(seqcount_t *s, int subclass) +static inline void do_write_seqcount_begin_nested(seqcount_t *s, int subclass) { - raw_write_seqcount_t_begin(s); + do_raw_write_seqcount_begin(s); seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_); } @@ -533,12 +533,12 @@ do { \ if (seqprop_preemptible(s)) \ preempt_disable(); \ \ - write_seqcount_t_begin(seqprop_ptr(s)); \ + do_write_seqcount_begin(seqprop_ptr(s)); \ } while (0) -static inline void write_seqcount_t_begin(seqcount_t *s) +static inline void do_write_seqcount_begin(seqcount_t *s) { - write_seqcount_t_begin_nested(s, 0); + do_write_seqcount_begin_nested(s, 0); } /** @@ -549,16 +549,16 @@ static inline void write_seqcount_t_begin(seqcount_t *s) */ #define write_seqcount_end(s) \ do { \ - write_seqcount_t_end(seqprop_ptr(s)); \ + do_write_seqcount_end(seqprop_ptr(s)); \ \ if (seqprop_preemptible(s)) \ preempt_enable(); \ } while (0) -static inline void write_seqcount_t_end(seqcount_t *s) +static inline void do_write_seqcount_end(seqcount_t *s) { seqcount_release(&s->dep_map, _RET_IP_); - raw_write_seqcount_t_end(s); + do_raw_write_seqcount_end(s); } /** @@ -603,9 +603,9 @@ static inline void write_seqcount_t_end(seqcount_t *s) * } */ #define raw_write_seqcount_barrier(s) \ - raw_write_seqcount_t_barrier(seqprop_ptr(s)) + do_raw_write_seqcount_barrier(seqprop_ptr(s)) -static inline void raw_write_seqcount_t_barrier(seqcount_t *s) +static inline void do_raw_write_seqcount_barrier(seqcount_t *s) { kcsan_nestable_atomic_begin(); s->sequence++; @@ -623,9 +623,9 @@ static inline void raw_write_seqcount_t_barrier(seqcount_t *s) * will complete successfully and see data older than this. */ #define write_seqcount_invalidate(s) \ - write_seqcount_t_invalidate(seqprop_ptr(s)) + do_write_seqcount_invalidate(seqprop_ptr(s)) -static inline void write_seqcount_t_invalidate(seqcount_t *s) +static inline void do_write_seqcount_invalidate(seqcount_t *s) { smp_wmb(); kcsan_nestable_atomic_begin(); @@ -865,9 +865,9 @@ static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start) } /* - * For all seqlock_t write side functions, use write_seqcount_*t*_begin() - * instead of the generic write_seqcount_begin(). This way, no redundant - * lockdep_assert_held() checks are added. + * For all seqlock_t write side functions, use the the internal + * do_write_seqcount_begin() instead of generic write_seqcount_begin(). + * This way, no redundant lockdep_assert_held() checks are added. */ /** @@ -886,7 +886,7 @@ static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start) static inline void write_seqlock(seqlock_t *sl) { spin_lock(&sl->lock); - write_seqcount_t_begin(&sl->seqcount.seqcount); + do_write_seqcount_begin(&sl->seqcount.seqcount); } /** @@ -898,7 +898,7 @@ static inline void write_seqlock(seqlock_t *sl) */ static inline void write_sequnlock(seqlock_t *sl) { - write_seqcount_t_end(&sl->seqcount.seqcount); + do_write_seqcount_end(&sl->seqcount.seqcount); spin_unlock(&sl->lock); } @@ -912,7 +912,7 @@ static inline void write_sequnlock(seqlock_t *sl) static inline void write_seqlock_bh(seqlock_t *sl) { spin_lock_bh(&sl->lock); - write_seqcount_t_begin(&sl->seqcount.seqcount); + do_write_seqcount_begin(&sl->seqcount.seqcount); } /** @@ -925,7 +925,7 @@ static inline void write_seqlock_bh(seqlock_t *sl) */ static inline void write_sequnlock_bh(seqlock_t *sl) { - write_seqcount_t_end(&sl->seqcount.seqcount); + do_write_seqcount_end(&sl->seqcount.seqcount); spin_unlock_bh(&sl->lock); } @@ -939,7 +939,7 @@ static inline void write_sequnlock_bh(seqlock_t *sl) static inline void write_seqlock_irq(seqlock_t *sl) { spin_lock_irq(&sl->lock); - write_seqcount_t_begin(&sl->seqcount.seqcount); + do_write_seqcount_begin(&sl->seqcount.seqcount); } /** @@ -951,7 +951,7 @@ static inline void write_seqlock_irq(seqlock_t *sl) */ static inline void write_sequnlock_irq(seqlock_t *sl) { - write_seqcount_t_end(&sl->seqcount.seqcount); + do_write_seqcount_end(&sl->seqcount.seqcount); spin_unlock_irq(&sl->lock); } @@ -960,7 +960,7 @@ static inline unsigned long __write_seqlock_irqsave(seqlock_t *sl) unsigned long flags; spin_lock_irqsave(&sl->lock, flags); - write_seqcount_t_begin(&sl->seqcount.seqcount); + do_write_seqcount_begin(&sl->seqcount.seqcount); return flags; } @@ -989,7 +989,7 @@ static inline unsigned long __write_seqlock_irqsave(seqlock_t *sl) static inline void write_sequnlock_irqrestore(seqlock_t *sl, unsigned long flags) { - write_seqcount_t_end(&sl->seqcount.seqcount); + do_write_seqcount_end(&sl->seqcount.seqcount); spin_unlock_irqrestore(&sl->lock, flags); } ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [tip: locking/core] seqlock: kernel-doc: Specify when preemption is automatically altered 2020-11-04 18:38 ` Linus Torvalds 2020-11-04 19:54 ` Ahmed S. Darwish 2020-12-09 18:38 ` [tip: locking/core] seqlock: Prefix internal seqcount_t-only macros with a "do_" tip-bot2 for Ahmed S. Darwish @ 2020-12-09 18:38 ` tip-bot2 for Ahmed S. Darwish 2 siblings, 0 replies; 28+ messages in thread From: tip-bot2 for Ahmed S. Darwish @ 2020-12-09 18:38 UTC (permalink / raw) To: linux-tip-commits Cc: Ahmed S. Darwish, Peter Zijlstra (Intel), x86, linux-kernel The following commit has been merged into the locking/core branch of tip: Commit-ID: cb262935a166bdef0ccfe6e2adffa00c0f2d038a Gitweb: https://git.kernel.org/tip/cb262935a166bdef0ccfe6e2adffa00c0f2d038a Author: Ahmed S. Darwish <a.darwish@linutronix.de> AuthorDate: Sun, 06 Dec 2020 17:21:43 +01:00 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Wed, 09 Dec 2020 17:08:49 +01:00 seqlock: kernel-doc: Specify when preemption is automatically altered The kernel-doc annotations for sequence counters write side functions are incomplete: they do not specify when preemption is automatically disabled and re-enabled. This has confused a number of call-site developers. Fix it. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/CAHk-=wikhGExmprXgaW+MVXG1zsGpztBbVwOb23vetk41EtTBQ@mail.gmail.com --- include/linux/seqlock.h | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index 235cbc6..2f7bb92 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -456,6 +456,8 @@ static inline int do_read_seqcount_retry(const seqcount_t *s, unsigned start) /** * raw_write_seqcount_begin() - start a seqcount_t write section w/o lockdep * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants + * + * Context: check write_seqcount_begin() */ #define raw_write_seqcount_begin(s) \ do { \ @@ -475,6 +477,8 @@ static inline void do_raw_write_seqcount_begin(seqcount_t *s) /** * raw_write_seqcount_end() - end a seqcount_t write section w/o lockdep * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants + * + * Context: check write_seqcount_end() */ #define raw_write_seqcount_end(s) \ do { \ @@ -498,6 +502,7 @@ static inline void do_raw_write_seqcount_end(seqcount_t *s) * @subclass: lockdep nesting level * * See Documentation/locking/lockdep-design.rst + * Context: check write_seqcount_begin() */ #define write_seqcount_begin_nested(s, subclass) \ do { \ @@ -519,11 +524,10 @@ static inline void do_write_seqcount_begin_nested(seqcount_t *s, int subclass) * write_seqcount_begin() - start a seqcount_t write side critical section * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants * - * write_seqcount_begin opens a write side critical section of the given - * seqcount_t. - * - * Context: seqcount_t write side critical sections must be serialized and - * non-preemptible. If readers can be invoked from hardirq or softirq + * Context: sequence counter write side sections must be serialized and + * non-preemptible. Preemption will be automatically disabled if and + * only if the seqcount write serialization lock is associated, and + * preemptible. If readers can be invoked from hardirq or softirq * context, interrupts or bottom halves must be respectively disabled. */ #define write_seqcount_begin(s) \ @@ -545,7 +549,8 @@ static inline void do_write_seqcount_begin(seqcount_t *s) * write_seqcount_end() - end a seqcount_t write side critical section * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants * - * The write section must've been opened with write_seqcount_begin(). + * Context: Preemption will be automatically re-enabled if and only if + * the seqcount write serialization lock is associated, and preemptible. */ #define write_seqcount_end(s) \ do { \ ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork 2020-11-04 3:17 ` Ahmed S. Darwish 2020-11-04 18:38 ` Linus Torvalds @ 2020-11-10 11:53 ` Peter Zijlstra 2020-12-03 10:35 ` [tip: locking/core] seqlock: Rename __seqprop() users tip-bot2 for Peter Zijlstra 1 sibling, 1 reply; 28+ messages in thread From: Peter Zijlstra @ 2020-11-10 11:53 UTC (permalink / raw) To: Ahmed S. Darwish Cc: John Hubbard, Linus Torvalds, Jason Gunthorpe, Peter Xu, Linux Kernel Mailing List, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov, Ingo Molnar, Will Deacon, Thomas Gleixner, Sebastian Siewior On Wed, Nov 04, 2020 at 04:17:11AM +0100, Ahmed S. Darwish wrote: > On Tue, Nov 03, 2020 at 06:01:30PM -0800, John Hubbard wrote: > > On 11/3/20 5:32 PM, Ahmed S. Darwish wrote: > ... > > > #define __read_seqcount_retry(s, start) \ > > > - __read_seqcount_t_retry(__seqcount_ptr(s), start) > > > + __do___read_seqcount_retry(__seqcount_ptr(s), start) > > > ... > > A nit: while various numbers of leading underscores are sometimes used, it's a lot > > less common to use, say, 3 consecutive underscores (as above) *within* the name. And > > I don't think you need it for uniqueness, at least from a quick look around here. > > > ... > > But again, either way, I think "do" is helping a *lot* here (as is getting rid > > of the _t_ idea). > > The three underscores are needed because there's a do_ version for > read_seqcount_retry(), and another for __read_seqcount_retry(). > > Similarly for {__,}read_seqcount_begin(). You want to be very careful > with this, and never mistaknely mix the two, because it affects some VFS > hot paths. > > Nonetheless, as you mentioned in the later (dropped) part of your > message, I think do_ is better than __do_, so the final result will be: > > do___read_seqcount_retry() > do_read_seqcount_retry() > do_raw_write_seqcount_begin() > do_raw_write_seqcount_end() > do_write_seqcount_begin() > ... > > and so on. > > I'll wait for some further feedback on the two patches (possibly from > Linus or PeterZ), then send a mini patch series. I'm Ok with that. The change I have issues with is: -#define __seqcount_lock_preemptible(s) __seqprop(s, preemptible) +#define __seqcount_associated_lock_exists_and_is_preemptible(s) __seqprop(s, preemptible) That's just _realllllllly_ long. Would something like the below make it easier to follow? seqprop_preemptible() is 'obviously' related to __seqprop_preemptible() without having to follow through the _Generic token pasting maze. --- diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index 8d8552474c64..576594add921 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -307,10 +307,10 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu __seqprop_case((s), mutex, prop), \ __seqprop_case((s), ww_mutex, prop)) -#define __seqcount_ptr(s) __seqprop(s, ptr) -#define __seqcount_sequence(s) __seqprop(s, sequence) -#define __seqcount_lock_preemptible(s) __seqprop(s, preemptible) -#define __seqcount_assert_lock_held(s) __seqprop(s, assert) +#define seqprop_ptr(s) __seqprop(s, ptr) +#define seqprop_sequence(s) __seqprop(s, sequence) +#define seqprop_preemptible(s) __seqprop(s, preemptible) +#define seqprop_assert(s) __seqprop(s, assert) /** * __read_seqcount_begin() - begin a seqcount_t read section w/o barrier @@ -330,7 +330,7 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu ({ \ unsigned __seq; \ \ - while ((__seq = __seqcount_sequence(s)) & 1) \ + while ((__seq = seqprop_sequence(s)) & 1) \ cpu_relax(); \ \ kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX); \ @@ -359,7 +359,7 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu */ #define read_seqcount_begin(s) \ ({ \ - seqcount_lockdep_reader_access(__seqcount_ptr(s)); \ + seqcount_lockdep_reader_access(seqprop_ptr(s)); \ raw_read_seqcount_begin(s); \ }) @@ -376,7 +376,7 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu */ #define raw_read_seqcount(s) \ ({ \ - unsigned __seq = __seqcount_sequence(s); \ + unsigned __seq = seqprop_sequence(s); \ \ smp_rmb(); \ kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX); \ @@ -425,7 +425,7 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu * Return: true if a read section retry is required, else false */ #define __read_seqcount_retry(s, start) \ - __read_seqcount_t_retry(__seqcount_ptr(s), start) + __read_seqcount_t_retry(seqprop_ptr(s), start) static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start) { @@ -445,7 +445,7 @@ static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start) * Return: true if a read section retry is required, else false */ #define read_seqcount_retry(s, start) \ - read_seqcount_t_retry(__seqcount_ptr(s), start) + read_seqcount_t_retry(seqprop_ptr(s), start) static inline int read_seqcount_t_retry(const seqcount_t *s, unsigned start) { @@ -459,10 +459,10 @@ static inline int read_seqcount_t_retry(const seqcount_t *s, unsigned start) */ #define raw_write_seqcount_begin(s) \ do { \ - if (__seqcount_lock_preemptible(s)) \ + if (seqprop_preemptible(s)) \ preempt_disable(); \ \ - raw_write_seqcount_t_begin(__seqcount_ptr(s)); \ + raw_write_seqcount_t_begin(seqprop_ptr(s)); \ } while (0) static inline void raw_write_seqcount_t_begin(seqcount_t *s) @@ -478,9 +478,9 @@ static inline void raw_write_seqcount_t_begin(seqcount_t *s) */ #define raw_write_seqcount_end(s) \ do { \ - raw_write_seqcount_t_end(__seqcount_ptr(s)); \ + raw_write_seqcount_t_end(seqprop_ptr(s)); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (seqprop_preemptible(s)) \ preempt_enable(); \ } while (0) @@ -501,12 +501,12 @@ static inline void raw_write_seqcount_t_end(seqcount_t *s) */ #define write_seqcount_begin_nested(s, subclass) \ do { \ - __seqcount_assert_lock_held(s); \ + seqprop_assert(s); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (seqprop_preemptible(s)) \ preempt_disable(); \ \ - write_seqcount_t_begin_nested(__seqcount_ptr(s), subclass); \ + write_seqcount_t_begin_nested(seqprop_ptr(s), subclass); \ } while (0) static inline void write_seqcount_t_begin_nested(seqcount_t *s, int subclass) @@ -528,12 +528,12 @@ static inline void write_seqcount_t_begin_nested(seqcount_t *s, int subclass) */ #define write_seqcount_begin(s) \ do { \ - __seqcount_assert_lock_held(s); \ + seqprop_assert(s); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (seqprop_preemptible(s)) \ preempt_disable(); \ \ - write_seqcount_t_begin(__seqcount_ptr(s)); \ + write_seqcount_t_begin(seqprop_ptr(s)); \ } while (0) static inline void write_seqcount_t_begin(seqcount_t *s) @@ -549,9 +549,9 @@ static inline void write_seqcount_t_begin(seqcount_t *s) */ #define write_seqcount_end(s) \ do { \ - write_seqcount_t_end(__seqcount_ptr(s)); \ + write_seqcount_t_end(seqprop_ptr(s)); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (seqprop_preemptible(s)) \ preempt_enable(); \ } while (0) @@ -603,7 +603,7 @@ static inline void write_seqcount_t_end(seqcount_t *s) * } */ #define raw_write_seqcount_barrier(s) \ - raw_write_seqcount_t_barrier(__seqcount_ptr(s)) + raw_write_seqcount_t_barrier(seqprop_ptr(s)) static inline void raw_write_seqcount_t_barrier(seqcount_t *s) { @@ -623,7 +623,7 @@ static inline void raw_write_seqcount_t_barrier(seqcount_t *s) * will complete successfully and see data older than this. */ #define write_seqcount_invalidate(s) \ - write_seqcount_t_invalidate(__seqcount_ptr(s)) + write_seqcount_t_invalidate(seqprop_ptr(s)) static inline void write_seqcount_t_invalidate(seqcount_t *s) { ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [tip: locking/core] seqlock: Rename __seqprop() users 2020-11-10 11:53 ` [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork Peter Zijlstra @ 2020-12-03 10:35 ` tip-bot2 for Peter Zijlstra 0 siblings, 0 replies; 28+ messages in thread From: tip-bot2 for Peter Zijlstra @ 2020-12-03 10:35 UTC (permalink / raw) To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), x86, linux-kernel The following commit has been merged into the locking/core branch of tip: Commit-ID: ab440b2c604b60fe90885270fcfeb5c3dd5d6fae Gitweb: https://git.kernel.org/tip/ab440b2c604b60fe90885270fcfeb5c3dd5d6fae Author: Peter Zijlstra <peterz@infradead.org> AuthorDate: Tue, 10 Nov 2020 13:44:17 +01:00 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Thu, 03 Dec 2020 11:20:51 +01:00 seqlock: Rename __seqprop() users More consistent naming should make it easier to untangle the _Generic token pasting maze called __seqprop(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201110115358.GE2594@hirez.programming.kicks-ass.net --- include/linux/seqlock.h | 46 ++++++++++++++++++++-------------------- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index 8d85524..d89134c 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -307,10 +307,10 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu __seqprop_case((s), mutex, prop), \ __seqprop_case((s), ww_mutex, prop)) -#define __seqcount_ptr(s) __seqprop(s, ptr) -#define __seqcount_sequence(s) __seqprop(s, sequence) -#define __seqcount_lock_preemptible(s) __seqprop(s, preemptible) -#define __seqcount_assert_lock_held(s) __seqprop(s, assert) +#define seqprop_ptr(s) __seqprop(s, ptr) +#define seqprop_sequence(s) __seqprop(s, sequence) +#define seqprop_preemptible(s) __seqprop(s, preemptible) +#define seqprop_assert(s) __seqprop(s, assert) /** * __read_seqcount_begin() - begin a seqcount_t read section w/o barrier @@ -330,7 +330,7 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu ({ \ unsigned __seq; \ \ - while ((__seq = __seqcount_sequence(s)) & 1) \ + while ((__seq = seqprop_sequence(s)) & 1) \ cpu_relax(); \ \ kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX); \ @@ -359,7 +359,7 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu */ #define read_seqcount_begin(s) \ ({ \ - seqcount_lockdep_reader_access(__seqcount_ptr(s)); \ + seqcount_lockdep_reader_access(seqprop_ptr(s)); \ raw_read_seqcount_begin(s); \ }) @@ -376,7 +376,7 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu */ #define raw_read_seqcount(s) \ ({ \ - unsigned __seq = __seqcount_sequence(s); \ + unsigned __seq = seqprop_sequence(s); \ \ smp_rmb(); \ kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX); \ @@ -425,7 +425,7 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu * Return: true if a read section retry is required, else false */ #define __read_seqcount_retry(s, start) \ - __read_seqcount_t_retry(__seqcount_ptr(s), start) + __read_seqcount_t_retry(seqprop_ptr(s), start) static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start) { @@ -445,7 +445,7 @@ static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start) * Return: true if a read section retry is required, else false */ #define read_seqcount_retry(s, start) \ - read_seqcount_t_retry(__seqcount_ptr(s), start) + read_seqcount_t_retry(seqprop_ptr(s), start) static inline int read_seqcount_t_retry(const seqcount_t *s, unsigned start) { @@ -459,10 +459,10 @@ static inline int read_seqcount_t_retry(const seqcount_t *s, unsigned start) */ #define raw_write_seqcount_begin(s) \ do { \ - if (__seqcount_lock_preemptible(s)) \ + if (seqprop_preemptible(s)) \ preempt_disable(); \ \ - raw_write_seqcount_t_begin(__seqcount_ptr(s)); \ + raw_write_seqcount_t_begin(seqprop_ptr(s)); \ } while (0) static inline void raw_write_seqcount_t_begin(seqcount_t *s) @@ -478,9 +478,9 @@ static inline void raw_write_seqcount_t_begin(seqcount_t *s) */ #define raw_write_seqcount_end(s) \ do { \ - raw_write_seqcount_t_end(__seqcount_ptr(s)); \ + raw_write_seqcount_t_end(seqprop_ptr(s)); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (seqprop_preemptible(s)) \ preempt_enable(); \ } while (0) @@ -501,12 +501,12 @@ static inline void raw_write_seqcount_t_end(seqcount_t *s) */ #define write_seqcount_begin_nested(s, subclass) \ do { \ - __seqcount_assert_lock_held(s); \ + seqprop_assert(s); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (seqprop_preemptible(s)) \ preempt_disable(); \ \ - write_seqcount_t_begin_nested(__seqcount_ptr(s), subclass); \ + write_seqcount_t_begin_nested(seqprop_ptr(s), subclass); \ } while (0) static inline void write_seqcount_t_begin_nested(seqcount_t *s, int subclass) @@ -528,12 +528,12 @@ static inline void write_seqcount_t_begin_nested(seqcount_t *s, int subclass) */ #define write_seqcount_begin(s) \ do { \ - __seqcount_assert_lock_held(s); \ + seqprop_assert(s); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (seqprop_preemptible(s)) \ preempt_disable(); \ \ - write_seqcount_t_begin(__seqcount_ptr(s)); \ + write_seqcount_t_begin(seqprop_ptr(s)); \ } while (0) static inline void write_seqcount_t_begin(seqcount_t *s) @@ -549,9 +549,9 @@ static inline void write_seqcount_t_begin(seqcount_t *s) */ #define write_seqcount_end(s) \ do { \ - write_seqcount_t_end(__seqcount_ptr(s)); \ + write_seqcount_t_end(seqprop_ptr(s)); \ \ - if (__seqcount_lock_preemptible(s)) \ + if (seqprop_preemptible(s)) \ preempt_enable(); \ } while (0) @@ -603,7 +603,7 @@ static inline void write_seqcount_t_end(seqcount_t *s) * } */ #define raw_write_seqcount_barrier(s) \ - raw_write_seqcount_t_barrier(__seqcount_ptr(s)) + raw_write_seqcount_t_barrier(seqprop_ptr(s)) static inline void raw_write_seqcount_t_barrier(seqcount_t *s) { @@ -623,7 +623,7 @@ static inline void raw_write_seqcount_t_barrier(seqcount_t *s) * will complete successfully and see data older than this. */ #define write_seqcount_invalidate(s) \ - write_seqcount_t_invalidate(__seqcount_ptr(s)) + write_seqcount_t_invalidate(seqprop_ptr(s)) static inline void write_seqcount_t_invalidate(seqcount_t *s) { ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork 2020-11-03 0:17 ` Ahmed S. Darwish [not found] ` <20201103002532.GL2620339@nvidia.com> @ 2020-11-03 17:03 ` Peter Xu 1 sibling, 0 replies; 28+ messages in thread From: Peter Xu @ 2020-11-03 17:03 UTC (permalink / raw) To: Ahmed S. Darwish Cc: Jason Gunthorpe, linux-kernel, Linus Torvalds, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner, Sebastian Siewior On Tue, Nov 03, 2020 at 01:17:12AM +0100, Ahmed S. Darwish wrote: > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > index c48f8df6e50268..294c2c3c4fe00d 100644 > > > > +++ b/mm/memory.c > > > > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) > > > > mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, > > > > 0, src_vma, src_mm, addr, end); > > > > mmu_notifier_invalidate_range_start(&range); > > > > + /* > > > > + * The read side doesn't spin, it goes to the mmap_lock, so the > > > > + * raw version is used to avoid disabling preemption here > > > > + */ > > > > + mmap_assert_write_locked(src_mm); > > > > + raw_write_seqcount_t_begin(&src_mm->write_protect_seq); > > > > > > Would raw_write_seqcount_begin() be better here? > > > > Hum.. > > > > I felt no because it had the preempt stuff added into it, however it > > would work - __seqcount_lock_preemptible() == false for the seqcount_t > > case (see below) > > > > Looking more closely, maybe the right API to pick is > > write_seqcount_t_begin() and write_seqcount_t_end() ?? > > > > No, that's not the right API: it is also internal to seqlock.h. > > Please stick with the official exported API: raw_write_seqcount_begin(). > > It should satisfy your needs, and the raw_*() variant is created exactly > for contexts wishing to avoid the lockdep checks (e.g. NMI handlers > cannot invoke lockdep, etc.) Ahmed, Jason, feel free to correct me - but I feel like what Jason wanted here is indeed the version that does not require disabling of preemption, a.k.a., write_seqcount_t_begin() and write_seqcount_t_end(), since it's preempt-safe if the read side does not retry. Not sure whether there's no "*_t_*" version of it. Another idea is that maybe we can use the raw_write_seqcount_begin() version, instead of in copy_page_range() but move it to copy_pte_range(). That would not affect normal Linux on preemption I think, since when reach pte level we should have disabled preemption already after all (by taking the pgtable spin lock). But again there could be extra overhead since we'll need to take the write seqcount very often (rather than once per fork(), so maybe there's some perf influence), also that means it'll be an extra/real disable_preempt() for the future RT code if it'll land some day (since again rt_spin_lock should not need to disable preemption, iiuc, which is used here). So seems it's still better to do in copy_page_range() as Jason proposed. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork [not found] ` <2-v2-dfe9ecdb6c74+2066-gup_fork_jgg@nvidia.com> 2020-10-30 16:51 ` [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork Jan Kara 2020-10-30 22:52 ` Peter Xu @ 2020-11-02 23:58 ` Ahmed S. Darwish 2 siblings, 0 replies; 28+ messages in thread From: Ahmed S. Darwish @ 2020-11-02 23:58 UTC (permalink / raw) To: Jason Gunthorpe Cc: linux-kernel, Peter Xu, Linus Torvalds, Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner, Sebastian Siewior On Fri, Oct 30, 2020 at 11:46:21AM -0300, Jason Gunthorpe wrote: ... > diff --git a/mm/memory.c b/mm/memory.c > index c48f8df6e50268..294c2c3c4fe00d 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) > mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, > 0, src_vma, src_mm, addr, end); > mmu_notifier_invalidate_range_start(&range); > + /* > + * The read side doesn't spin, it goes to the mmap_lock, so the > + * raw version is used to avoid disabling preemption here > + */ > + mmap_assert_write_locked(src_mm); > + raw_write_seqcount_t_begin(&src_mm->write_protect_seq); > } > Please, s/raw_write_seqcount_t_begin()/raw_write_seqcount_begin()/g. For plain seqcount_t, it's the same, while still respecting the seqlock.h API boundaries. Let's make the comment also a bit more clear (IMHO, "lockdep" needs to be mentioned somewhere): /* * Disabling preemption is not needed for the write side, as * the read side doesn't spin, but goes to the mmap_lock. * * Use the raw variant of the seqcount_t write API to avoid * lockdep complaining about preemptibility. */ mmap_assert_write_locked(src_mm); raw_write_seqcount_t_begin(&src_mm->write_protect_seq); > ret = 0; > @@ -1187,8 +1193,10 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) > } > } while (dst_pgd++, src_pgd++, addr = next, addr != end); > > - if (is_cow) > + if (is_cow) { > + raw_write_seqcount_t_end(&src_mm->write_protect_seq); ditto. s/raw_write_seqcount_t_end()/raw_write_seqcount_end()/g > mmu_notifier_invalidate_range_end(&range); > + } > return ret; > } > Thanks, -- Ahmed S. Darwish Linutronix GmbH ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2020-12-09 18:42 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <0-v2-dfe9ecdb6c74+2066-gup_fork_jgg@nvidia.com> 2020-10-30 14:46 ` [PATCH v2 1/2] mm: reorganize internal_get_user_pages_fast() Jason Gunthorpe 2020-10-30 16:29 ` Jan Kara 2020-10-30 21:31 ` John Hubbard 2020-10-30 22:36 ` Peter Xu 2020-11-02 22:19 ` [PATCH v2 0/2] Add a seqcount between gup_fast and copy_page_range() Ahmed S. Darwish 2020-11-02 22:39 ` Linus Torvalds 2020-11-02 23:18 ` Ahmed S. Darwish [not found] ` <2-v2-dfe9ecdb6c74+2066-gup_fork_jgg@nvidia.com> 2020-10-30 16:51 ` [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork Jan Kara [not found] ` <20201030170226.GF2620339@nvidia.com> 2020-11-02 8:31 ` Jan Kara 2020-10-30 22:52 ` Peter Xu [not found] ` <20201030235121.GQ2620339@nvidia.com> 2020-10-31 15:26 ` Peter Xu 2020-11-03 0:33 ` Ahmed S. Darwish 2020-11-03 0:17 ` Ahmed S. Darwish [not found] ` <20201103002532.GL2620339@nvidia.com> 2020-11-03 0:41 ` Ahmed S. Darwish 2020-11-03 2:20 ` John Hubbard 2020-11-03 6:52 ` Ahmed S. Darwish 2020-11-03 17:40 ` Linus Torvalds 2020-11-04 1:32 ` Ahmed S. Darwish 2020-11-04 2:01 ` John Hubbard 2020-11-04 3:17 ` Ahmed S. Darwish 2020-11-04 18:38 ` Linus Torvalds 2020-11-04 19:54 ` Ahmed S. Darwish 2020-12-09 18:38 ` [tip: locking/core] seqlock: Prefix internal seqcount_t-only macros with a "do_" tip-bot2 for Ahmed S. Darwish 2020-12-09 18:38 ` [tip: locking/core] seqlock: kernel-doc: Specify when preemption is automatically altered tip-bot2 for Ahmed S. Darwish 2020-11-10 11:53 ` [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork Peter Zijlstra 2020-12-03 10:35 ` [tip: locking/core] seqlock: Rename __seqprop() users tip-bot2 for Peter Zijlstra 2020-11-03 17:03 ` [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork Peter Xu 2020-11-02 23:58 ` Ahmed S. Darwish
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).