[-next] lib: disable KCSAN for XArray
diff mbox series

Message ID 20200304031551.1326-1-cai@lca.pw
State New
Headers show
Series
  • [-next] lib: disable KCSAN for XArray
Related show

Commit Message

Qian Cai March 4, 2020, 3:15 a.m. UTC
Functions like xas_find_marked(), xas_set_mark(), and xas_clear_mark()
could happen concurrently result in data races, but those operate only
on a single bit that are pretty much harmless. For example,

 write to 0xffffa0020ee705c8 of 8 bytes by task 39718 on cpu 0:
  xas_set_mark+0x8e/0x190
  arch___test_and_set_bit at arch/x86/include/asm/bitops.h:152
  (inlined by) __test_and_set_bit at include/asm-generic/bitops/instrumented-non-atomic.h:72
  (inlined by) node_set_mark at lib/xarray.c:93
  (inlined by) xas_set_mark at lib/xarray.c:879
  __test_set_page_writeback+0x5de/0x8c0
  iomap_writepage_map+0x8c6/0xf90
  iomap_do_writepage+0x12b/0x450
  write_cache_pages+0x523/0xb20
  iomap_writepages+0x47/0x80
  xfs_file_fsync+0xeb/0x450 [xfs]
  do_writepages+0x5e/0x130
  __filemap_fdatawrite_range+0x19e/0x1f0
  file_write_and_wait_range+0xc0/0x100
  xfs_file_fsync+0xeb/0x450 [xfs]
  vfs_fsync_range+0x71/0x110
  __x64_sys_msync+0x210/0x2a0
  do_syscall_64+0x91/0xb05
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 read to 0xffffa0020ee705c8 of 8 bytes by task 39717 on cpu 5:
  xas_find_marked+0xe9/0x750
  xas_find_chunk at include/linux/xarray.h:1625
  (inlined by) xas_find_marked at lib/xarray.c:1198
  find_get_pages_range_tag+0x1bf/0xa90
  pagevec_lookup_range_tag+0x46/0x70
  __filemap_fdatawait_range+0xbb/0x270
  file_write_and_wait_range+0xe0/0x100
  xfs_file_fsync+0xeb/0x450 [xfs]
  vfs_fsync_range+0x71/0x110
  __x64_sys_msync+0x210/0x2a0
  do_syscall_64+0x91/0xb05
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Qian Cai <cai@lca.pw>
---
 lib/Makefile | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Matthew Wilcox March 4, 2020, 3:33 a.m. UTC | #1
On Tue, Mar 03, 2020 at 10:15:51PM -0500, Qian Cai wrote:
> Functions like xas_find_marked(), xas_set_mark(), and xas_clear_mark()
> could happen concurrently result in data races, but those operate only
> on a single bit that are pretty much harmless. For example,

Those aren't data races.  The writes are protected by a spinlock and the
reads by the RCU read lock.  If the tool can't handle RCU protection,
it's not going to be much use.
Qian Cai March 4, 2020, 3:55 a.m. UTC | #2
> On Mar 3, 2020, at 10:33 PM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Tue, Mar 03, 2020 at 10:15:51PM -0500, Qian Cai wrote:
>> Functions like xas_find_marked(), xas_set_mark(), and xas_clear_mark()
>> could happen concurrently result in data races, but those operate only
>> on a single bit that are pretty much harmless. For example,
> 
> Those aren't data races.  The writes are protected by a spinlock and the
> reads by the RCU read lock.  If the tool can't handle RCU protection,
> it's not going to be much use.
> 

Maybe the commit log is a bit confusing if you did not look at the example closely
enough. It is actually one read and one write result in data races where one spin_lock()
and one rcu_read_lock()  can’t prevent that from happening. We also have,

[19522.548668][T39646] BUG: KCSAN: data-race in xas_clear_mark / xas_find_marked 
[19522.583776][T39646]  
[19522.593816][T39646] write to 0xffff9ffb56c5c238 of 8 bytes by interrupt on cpu 16: 
[19522.628560][T39646]  xas_clear_mark+0x8e/0x1a0 
[19522.648993][T39646]  __xa_clear_mark+0xe7/0x110 
[19522.669367][T39646]  test_clear_page_writeback+0x3e9/0x712 [19522.694638][T39646]  end_page_writeback+0xaa/0x2b0 
[19522.716850][T39646]  iomap_finish_ioend+0x213/0x3b0 
[19522.740070][T39646]  iomap_writepage_end_bio+0x41/0x50 
[19522.763835][T39646]  bio_endio+0x297/0x560 
[19522.782608][T39646]  dec_pending+0x218/0x430 [dm_mod] 
[19522.805389][T39646]  clone_endio+0xe4/0x2c0 [dm_mod] 
[19522.828014][T39646]  bio_endio+0x297/0x560 
[19522.846681][T39646]  blk_update_request+0x201/0x920 
[19522.868929][T39646]  scsi_end_request+0x6b/0x4b0 
[19522.889924][T39646]  scsi_io_completion+0xb7/0x7e0 
[19522.911744][T39646]  scsi_finish_command+0x1ed/0x2a0 
[19522.934411][T39646]  scsi_softirq_done+0x1c9/0x1d0 
[19522.956357][T39646]  blk_done_softirq+0x181/0x1d0 
[19522.977796][T39646]  __do_softirq+0xd9/0x57c 
[19522.997300][T39646]  irq_exit+0xa2/0xc0 
[19523.015469][T39646]  do_IRQ+0x87/0x180 
[19523.032848][T39646]  ret_from_intr+0x0/0x42 
[19523.054251][T39646]  delay_tsc+0x46/0x80 
[19523.074662][T39646]  __const_udelay+0x3c/0x40 
[19523.095161][T39646]  __udelay+0x10/0x20 
[19523.113321][T39646]  kcsan_setup_watchpoint+0x1ec/0x3a0 
[19523.137486][T39646]  __tsan_read8+0xf1/0x110 
[19523.156973][T39646]  xas_find_marked+0xe9/0x750 
[19523.177267][T39646]  find_get_pages_range_tag+0x1bf/0xa90 
[19523.201945][T39646]  pagevec_lookup_range_tag+0x46/0x70 
[19523.226242][T39646]  __filemap_fdatawait_range+0xbb/0x270 
[19523.250977][T39646]  file_write_and_wait_range+0xe0/0x100 
[19523.276077][T39646]  xfs_file_fsync+0xeb/0x450 [xfs] 
[19523.298789][T39646]  vfs_fsync_range+0x71/0x110 
[19523.320550][T39646]  xfs_file_buffered_aio_write+0x6cf/0x790 [xfs] 
[19523.350421][T39646]  xfs_file_write_iter+0x232/0x2d0 [xfs] 
[19523.375536][T39646]  do_iter_readv_writev+0x321/0x400 
[19523.398811][T39646]  do_iter_write+0xdf/0x2b0 
[19523.418758][T39646]  vfs_writev+0xe6/0x170 
[19523.437516][T39646]  do_writev+0xa8/0x140 
[19523.455844][T39646]  __x64_sys_writev+0x4e/0x60 
[19523.476515][T39646]  do_syscall_64+0x91/0xb05 
[19523.496441][T39646]  entry_SYSCALL_64_after_hwframe+0x49/0xbe 
[19523.522946][T39646]  
[19523.533125][T39646] read to 0xffff9ffb56c5c238 of 8 bytes by task 39646 on cpu 16: 
[19523.570758][T39646]  xas_find_marked+0xe9/0x750 
[19523.594276][T39646]  find_get_pages_range_tag+0x1bf/0xa90 
[19523.618877][T39646]  pagevec_lookup_range_tag+0x46/0x70 
[19523.642674][T39646]  __filemap_fdatawait_range+0xbb/0x270 
[19523.667295][T39646]  file_write_and_wait_range+0xe0/0x100 
[19523.692394][T39646]  xfs_file_fsync+0xeb/0x450 [xfs] 
[19523.715149][T39646]  vfs_fsync_range+0x71/0x110 
[19523.736239][T39646]  xfs_file_buffered_aio_write+0x6cf/0x790 [xfs] 
[19523.765345][T39646]  xfs_file_write_iter+0x232/0x2d0 [xfs] 
[19523.791398][T39646]  do_iter_readv_writev+0x321/0x400 
[19523.814709][T39646]  do_iter_write+0xdf/0x2b0 
[19523.834576][T39646]  vfs_writev+0xe6/0x170 
[19523.853486][T39646]  do_writev+0xa8/0x140 
[19523.871952][T39646]  __x64_sys_writev+0x4e/0x60 
[19523.893148][T39646]  do_syscall_64+0x91/0xb05 
[19523.914075][T39646]  entry_SYSCALL_64_after_hwframe+0x49/0xbe 


[19648.209937][T39683] BUG: KCSAN: data-race in find_get_pages_range_tag / xas_set_mark 
[19648.248321][T39683]  
[19648.258683][T39683] write to 0xffffa001c3340238 of 8 bytes by task 39682 on cpu 25: 
[19648.295245][T39683]  xas_set_mark+0x8e/0x190 
[19648.315514][T39683]  __test_set_page_writeback+0x5de/0x8c0 
[19648.341697][T39683]  iomap_writepage_map+0x8c6/0xf90 
[19648.364916][T39683]  iomap_do_writepage+0x12b/0x450 
[19648.388367][T39683]  write_cache_pages+0x523/0xb20 
[19648.410232][T39683]  iomap_writepages+0x47/0x80 
[19648.431404][T39683]  xfs_vm_writepages+0xc7/0x100 [xfs] 
[19648.455333][T39683]  do_writepages+0x5e/0x130 
[19648.476105][T39683]  __filemap_fdatawrite_range+0x19e/0x1f0 
[19648.502048][T39683]  file_write_and_wait_range+0xc0/0x100 
[19648.527175][T39683]  xfs_file_fsync+0xeb/0x450 [xfs] 
[19648.549886][T39683]  vfs_fsync_range+0x71/0x110 
[19648.570508][T39683]  __x64_sys_msync+0x210/0x2a0 
[19648.591742][T39683]  do_syscall_64+0x91/0xb05 
[19648.612093][T39683]  entry_SYSCALL_64_after_hwframe+0x49/0xbe 
[19648.638410][T39683]  
[19648.648486][T39683] read to 0xffffa001c3340238 of 8 bytes by task 39683 on cpu 26: 
[19648.684048][T39683]  find_get_pages_range_tag+0x549/0xa90 
xas_for_each_marked() —> xas_find_marked()
[19648.710117][T39683]  pagevec_lookup_range_tag+0x46/0x70 
[19648.737531][T39683]  __filemap_fdatawait_range+0xbb/0x270 
[19648.762541][T39683]  file_write_and_wait_range+0xe0/0x100 
[19648.787436][T39683]  xfs_file_fsync+0xeb/0x450 [xfs] 
[19648.810136][T39683]  vfs_fsync_range+0x71/0x110 
[19648.830827][T39683]  __x64_sys_msync+0x210/0x2a0 
[19648.852022][T39683]  do_syscall_64+0x91/0xb05 
[19648.871962][T39683]  entry_SYSCALL_64_after_hwframe+0x49/0xbe 
[19648.899496][T39683]  
[19648.909496][T39683] 1 lock held by doio/39683: 
[19648.930880][T39683]  #0: ffffffffaf286cc0 (rcu_read_lock){....}, at: find_get_pages_range_tag+0x10f/0xa90 
[19648.976219][T39683] irq event stamp: 2463763 
[19648.996565][T39683] hardirqs last  enabled at (2463763): [<ffffffffade03ec2>] trace_hardirqs_on_thunk+0x1a/0x1c 
[19649.043090][T39683] hardirqs last disabled at (2463761): [<ffffffffaec002e7>] __do_softirq+0x2e7/0x57c 
[19649.087414][T39683] softirqs last  enabled at (2463762): [<ffffffffaec0034c>] __do_softirq+0x34c/0x57c 
[19649.129838][T39683] softirqs last disabled at (2463635): [<ffffffffadec69f2>] irq_exit+0xa2/0xc0
Matthew Wilcox March 4, 2020, 4 a.m. UTC | #3
On Tue, Mar 03, 2020 at 10:55:02PM -0500, Qian Cai wrote:
> 
> 
> > On Mar 3, 2020, at 10:33 PM, Matthew Wilcox <willy@infradead.org> wrote:
> > 
> > On Tue, Mar 03, 2020 at 10:15:51PM -0500, Qian Cai wrote:
> >> Functions like xas_find_marked(), xas_set_mark(), and xas_clear_mark()
> >> could happen concurrently result in data races, but those operate only
> >> on a single bit that are pretty much harmless. For example,
> > 
> > Those aren't data races.  The writes are protected by a spinlock and the
> > reads by the RCU read lock.  If the tool can't handle RCU protection,
> > it's not going to be much use.
> > 
> 
> Maybe the commit log is a bit confusing if you did not look at the example closely
> enough. It is actually one read and one write result in data races where one spin_lock()
> and one rcu_read_lock()  can’t prevent that from happening. We also have,

Yes.  All these examples have the reader protected by the RCU read lock
and the writer protected by the spinlock.  Does the tool need some kind
of extra annotation here?

> [19522.548668][T39646] BUG: KCSAN: data-race in xas_clear_mark / xas_find_marked 
> [19522.583776][T39646]  
> [19522.593816][T39646] write to 0xffff9ffb56c5c238 of 8 bytes by interrupt on cpu 16: 
> [19522.628560][T39646]  xas_clear_mark+0x8e/0x1a0 
> [19522.648993][T39646]  __xa_clear_mark+0xe7/0x110 
> [19522.669367][T39646]  test_clear_page_writeback+0x3e9/0x712 [19522.694638][T39646]  end_page_writeback+0xaa/0x2b0 

Spinlock taken here ^^^

> [19523.533125][T39646] read to 0xffff9ffb56c5c238 of 8 bytes by task 39646 on cpu 16: 
> [19523.570758][T39646]  xas_find_marked+0xe9/0x750 
> [19523.594276][T39646]  find_get_pages_range_tag+0x1bf/0xa90 

RCU read lock taken here ^^^
Paul E. McKenney March 4, 2020, 4:05 a.m. UTC | #4
On Tue, Mar 03, 2020 at 07:33:29PM -0800, Matthew Wilcox wrote:
> On Tue, Mar 03, 2020 at 10:15:51PM -0500, Qian Cai wrote:
> > Functions like xas_find_marked(), xas_set_mark(), and xas_clear_mark()
> > could happen concurrently result in data races, but those operate only
> > on a single bit that are pretty much harmless. For example,
> 
> Those aren't data races.  The writes are protected by a spinlock and the
> reads by the RCU read lock.  If the tool can't handle RCU protection,
> it's not going to be much use.

Would KCSAN's ASSERT_EXCLUSIVE_BITS() help here?

If not, you lost me on this one.  RCU readers don't exclude lock-based
writers.

RCU readers -do- exclude pre-insertion initialization on the one hand,
and those post-removal accesses that follow a grace period, but only
if that grace period starts after the removal.  In addition, the
accesses due to rcu_dereference(), rcu_assign_pointer(), and similar
are guaranteed to work even if they are concurrent.

Or am I missing something subtle here?

That said, you are permitted to define "data race" for your subsystem
by choosing KCSAN settings, up to and including keeping KCSAN out
entirely.

							Thanx, Paul
Matthew Wilcox March 4, 2020, 4:33 a.m. UTC | #5
On Tue, Mar 03, 2020 at 08:05:15PM -0800, Paul E. McKenney wrote:
> On Tue, Mar 03, 2020 at 07:33:29PM -0800, Matthew Wilcox wrote:
> > On Tue, Mar 03, 2020 at 10:15:51PM -0500, Qian Cai wrote:
> > > Functions like xas_find_marked(), xas_set_mark(), and xas_clear_mark()
> > > could happen concurrently result in data races, but those operate only
> > > on a single bit that are pretty much harmless. For example,
> > 
> > Those aren't data races.  The writes are protected by a spinlock and the
> > reads by the RCU read lock.  If the tool can't handle RCU protection,
> > it's not going to be much use.
> 
> Would KCSAN's ASSERT_EXCLUSIVE_BITS() help here?

I'm quite lost in the sea of new macros that have been added to help
with KCSAN.  It doesn't help that they're in -next somewhere that I
can't find, and not in mainline yet.  Is there documentation somewhere?

> RCU readers -do- exclude pre-insertion initialization on the one hand,
> and those post-removal accesses that follow a grace period, but only
> if that grace period starts after the removal.  In addition, the
> accesses due to rcu_dereference(), rcu_assign_pointer(), and similar
> are guaranteed to work even if they are concurrent.
> 
> Or am I missing something subtle here?

I probably am.  An XArray is composed of a tree of xa_nodes:

struct xa_node {
        unsigned char   shift;          /* Bits remaining in each slot */
        unsigned char   offset;         /* Slot offset in parent */
        unsigned char   count;          /* Total entry count */
        unsigned char   nr_values;      /* Value entry count */
        struct xa_node __rcu *parent;   /* NULL at top of tree */
        struct xarray   *array;         /* The array we belong to */
        union {
                struct list_head private_list;  /* For tree user */
                struct rcu_head rcu_head;       /* Used when freeing node */
        };
        void __rcu      *slots[XA_CHUNK_SIZE];
        union {
                unsigned long   tags[XA_MAX_MARKS][XA_MARK_LONGS];
                unsigned long   marks[XA_MAX_MARKS][XA_MARK_LONGS];
        };
};

'shift' is initialised before the node is inserted into the tree.
Ditto 'offset'.  'count' and 'nr_values' should only be touched with the
xa_lock held.  'parent' might be modified with the lock held and an RCU
reader expecting to see either the previous or new value.  'array' should
not change once the node is inserted.  private_list is, I believe, only
modified with the lock held.  'slots' may be modified with the xa_lock
held, and simultaneously read by an RCU reader.  Ditto 'tags'/'marks'.

The RCU readers are prepared for what they see to be inconsistent --
a fact of life when dealing with RCU!  So in a sense, yes, there is a
race there.  But it's a known, accepted race, and that acceptance is
indicated by the fact that the RCU lock is held.  Does there need to be
more annotation here?  Or is there an un-noticed bug that the tool is
legitimately pointing out?
Paul E. McKenney March 4, 2020, 2:10 p.m. UTC | #6
On Tue, Mar 03, 2020 at 08:33:56PM -0800, Matthew Wilcox wrote:
> On Tue, Mar 03, 2020 at 08:05:15PM -0800, Paul E. McKenney wrote:
> > On Tue, Mar 03, 2020 at 07:33:29PM -0800, Matthew Wilcox wrote:
> > > On Tue, Mar 03, 2020 at 10:15:51PM -0500, Qian Cai wrote:
> > > > Functions like xas_find_marked(), xas_set_mark(), and xas_clear_mark()
> > > > could happen concurrently result in data races, but those operate only
> > > > on a single bit that are pretty much harmless. For example,
> > > 
> > > Those aren't data races.  The writes are protected by a spinlock and the
> > > reads by the RCU read lock.  If the tool can't handle RCU protection,
> > > it's not going to be much use.
> > 
> > Would KCSAN's ASSERT_EXCLUSIVE_BITS() help here?
> 
> I'm quite lost in the sea of new macros that have been added to help
> with KCSAN.  It doesn't help that they're in -next somewhere that I
> can't find, and not in mainline yet.  Is there documentation somewhere?

Yes, there is documentation.  In -next somewhere.  :-/

Early days, apologies for the construction!

> > RCU readers -do- exclude pre-insertion initialization on the one hand,
> > and those post-removal accesses that follow a grace period, but only
> > if that grace period starts after the removal.  In addition, the
> > accesses due to rcu_dereference(), rcu_assign_pointer(), and similar
> > are guaranteed to work even if they are concurrent.
> > 
> > Or am I missing something subtle here?
> 
> I probably am.  An XArray is composed of a tree of xa_nodes:
> 
> struct xa_node {
>         unsigned char   shift;          /* Bits remaining in each slot */
>         unsigned char   offset;         /* Slot offset in parent */
>         unsigned char   count;          /* Total entry count */
>         unsigned char   nr_values;      /* Value entry count */
>         struct xa_node __rcu *parent;   /* NULL at top of tree */
>         struct xarray   *array;         /* The array we belong to */
>         union {
>                 struct list_head private_list;  /* For tree user */
>                 struct rcu_head rcu_head;       /* Used when freeing node */
>         };
>         void __rcu      *slots[XA_CHUNK_SIZE];
>         union {
>                 unsigned long   tags[XA_MAX_MARKS][XA_MARK_LONGS];
>                 unsigned long   marks[XA_MAX_MARKS][XA_MARK_LONGS];
>         };
> };
> 
> 'shift' is initialised before the node is inserted into the tree.
> Ditto 'offset'.

Very good, then both ->shift and ->offset can be accessed using normal
C-language loads and stores even by most strict definition of data race.

>                  'count' and 'nr_values' should only be touched with the
> xa_lock held.  'parent' might be modified with the lock held and an RCU
> reader expecting to see either the previous or new value.  'array' should
> not change once the node is inserted.  private_list is, I believe, only
> modified with the lock held.  'slots' may be modified with the xa_lock
> held, and simultaneously read by an RCU reader.  Ditto 'tags'/'marks'.

If ->count and ->nr_values are never accessed by readers, they can also
use plain C-language loads and stores.

KCSAN expects that accesses to the ->parent field should be marked.
But if ->parent is always accessed via things like rcu_dereference()
and rcu_assign_pointer() (guessing based on the __rcu), then KCSAN
won't complain.

The ->array can be accessed using plain C-language loads and stores.

If ->private_list is never accessed without holding the lock, then
plain C-language loads and stores work for it without KCSAN complaints.

The situation with ->slots is the same as that for ->parent.

KCSAN expects accesses to the ->tags[] and ->marks[] arrays to be marked.
However, the default configuration of KCSAN asks only that the reads
be marked.  (Within RCU, I instead configure KCSAN so that it asks that
both be marked, but it is of course your choice within your code.)

> The RCU readers are prepared for what they see to be inconsistent --
> a fact of life when dealing with RCU!  So in a sense, yes, there is a
> race there.  But it's a known, accepted race, and that acceptance is
> indicated by the fact that the RCU lock is held.  Does there need to be
> more annotation here?  Or is there an un-noticed bug that the tool is
> legitimately pointing out?

The answer to both questions is "maybe", depending on the code using
the values read.  Yes, it would be nice if KCSAN could figure out the
difference, but there are limits to what a tool can do.  And things
are sometimes no-obvious, for example:

	switch (foo) {
	case 1:  do_something_1();  break;
	case 3:  do_something_3();  break;
	case 7:  do_something_7();  break;
	case 19: do_something_19(); break;
	case 23: do_something_23(); break;
	}

Only one access to "foo", so all is well, right?

Sadly, wrong.  Compilers can create jump tables, and will often emit two
loads from "foo", one to check against the table size, and the other to
index the table.

Other potential traps may be found in https://lwn.net/Articles/793253/
("Who's afraid of a big bad optimizing compiler?").

One approach is to use READ_ONCE() on the reads in the RCU read-side
critical section that are subject to concurrent update.  Another is to
use the data_race() macro (as in data_race(foo) in the switch statement
above) to tell KCSAN that you have analyzed the compiler's response.
These first two can be mixed, if you like.  And yet another is the patch
proposed by Qian if you want KCSAN to get out of your code altogether.

Within RCU, I mark the accesses rather aggressively.  RCU is quite
concurrent, and the implied documentation is very worthwhile.

Your mileage may vary, of course.  For one thing, your actuarial
statistics are quite likley significantly more favorable than are mine.
Not that mine are at all bad, particularly by the standards of a century
or two ago.  ;-)

							Thanx, Paul
Marco Elver March 4, 2020, 4:40 p.m. UTC | #7
On Wed, 4 Mar 2020 at 15:10, Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Tue, Mar 03, 2020 at 08:33:56PM -0800, Matthew Wilcox wrote:
> > On Tue, Mar 03, 2020 at 08:05:15PM -0800, Paul E. McKenney wrote:
> > > On Tue, Mar 03, 2020 at 07:33:29PM -0800, Matthew Wilcox wrote:
> > > > On Tue, Mar 03, 2020 at 10:15:51PM -0500, Qian Cai wrote:
> > > > > Functions like xas_find_marked(), xas_set_mark(), and xas_clear_mark()
> > > > > could happen concurrently result in data races, but those operate only
> > > > > on a single bit that are pretty much harmless. For example,

I currently do not see those as justification to blacklist the whole
file. Wouldn't __no_kcsan be better? That is, in case there is no
other solution that emerges from the remainder of the discussion here.

> > > > Those aren't data races.  The writes are protected by a spinlock and the
> > > > reads by the RCU read lock.  If the tool can't handle RCU protection,
> > > > it's not going to be much use.
> > >
> > > Would KCSAN's ASSERT_EXCLUSIVE_BITS() help here?
> >
> > I'm quite lost in the sea of new macros that have been added to help
> > with KCSAN.  It doesn't help that they're in -next somewhere that I
> > can't find, and not in mainline yet.  Is there documentation somewhere?
>
> Yes, there is documentation.  In -next somewhere.  :-/
>
> Early days, apologies for the construction!

I just sent a patch updating documentation, just to make sure we have
something up-to-date we can refer to in the meantime.

A preview of generated documentation for ASSERT_EXCLUSIVE_BITS (and
others) is here:
https://htmlpreview.github.io/?https://github.com/google/ktsan/blob/kcsan-kerneldoc/output/dev-tools/kcsan.html#race-detection-beyond-data-races

Hope that helps somewhat.


> > > RCU readers -do- exclude pre-insertion initialization on the one hand,
> > > and those post-removal accesses that follow a grace period, but only
> > > if that grace period starts after the removal.  In addition, the
> > > accesses due to rcu_dereference(), rcu_assign_pointer(), and similar
> > > are guaranteed to work even if they are concurrent.
> > >
> > > Or am I missing something subtle here?
> >
> > I probably am.  An XArray is composed of a tree of xa_nodes:
> >
> > struct xa_node {
> >         unsigned char   shift;          /* Bits remaining in each slot */
> >         unsigned char   offset;         /* Slot offset in parent */
> >         unsigned char   count;          /* Total entry count */
> >         unsigned char   nr_values;      /* Value entry count */
> >         struct xa_node __rcu *parent;   /* NULL at top of tree */
> >         struct xarray   *array;         /* The array we belong to */
> >         union {
> >                 struct list_head private_list;  /* For tree user */
> >                 struct rcu_head rcu_head;       /* Used when freeing node */
> >         };
> >         void __rcu      *slots[XA_CHUNK_SIZE];
> >         union {
> >                 unsigned long   tags[XA_MAX_MARKS][XA_MARK_LONGS];
> >                 unsigned long   marks[XA_MAX_MARKS][XA_MARK_LONGS];
> >         };
> > };
> >
> > 'shift' is initialised before the node is inserted into the tree.
> > Ditto 'offset'.
>
> Very good, then both ->shift and ->offset can be accessed using normal
> C-language loads and stores even by most strict definition of data race.
>
> >                  'count' and 'nr_values' should only be touched with the
> > xa_lock held.  'parent' might be modified with the lock held and an RCU
> > reader expecting to see either the previous or new value.  'array' should
> > not change once the node is inserted.  private_list is, I believe, only
> > modified with the lock held.  'slots' may be modified with the xa_lock
> > held, and simultaneously read by an RCU reader.  Ditto 'tags'/'marks'.
>
> If ->count and ->nr_values are never accessed by readers, they can also
> use plain C-language loads and stores.
>
> KCSAN expects that accesses to the ->parent field should be marked.
> But if ->parent is always accessed via things like rcu_dereference()
> and rcu_assign_pointer() (guessing based on the __rcu), then KCSAN
> won't complain.
>
> The ->array can be accessed using plain C-language loads and stores.
>
> If ->private_list is never accessed without holding the lock, then
> plain C-language loads and stores work for it without KCSAN complaints.
>
> The situation with ->slots is the same as that for ->parent.
>
> KCSAN expects accesses to the ->tags[] and ->marks[] arrays to be marked.
> However, the default configuration of KCSAN asks only that the reads
> be marked.  (Within RCU, I instead configure KCSAN so that it asks that
> both be marked, but it is of course your choice within your code.)
>
> > The RCU readers are prepared for what they see to be inconsistent --
> > a fact of life when dealing with RCU!  So in a sense, yes, there is a
> > race there.  But it's a known, accepted race, and that acceptance is
> > indicated by the fact that the RCU lock is held.  Does there need to be
> > more annotation here?  Or is there an un-noticed bug that the tool is
> > legitimately pointing out?
>
> The answer to both questions is "maybe", depending on the code using
> the values read.  Yes, it would be nice if KCSAN could figure out the
> difference, but there are limits to what a tool can do.  And things
> are sometimes no-obvious, for example:
>
>         switch (foo) {
>         case 1:  do_something_1();  break;
>         case 3:  do_something_3();  break;
>         case 7:  do_something_7();  break;
>         case 19: do_something_19(); break;
>         case 23: do_something_23(); break;
>         }
>
> Only one access to "foo", so all is well, right?
>
> Sadly, wrong.  Compilers can create jump tables, and will often emit two
> loads from "foo", one to check against the table size, and the other to
> index the table.
>
> Other potential traps may be found in https://lwn.net/Articles/793253/
> ("Who's afraid of a big bad optimizing compiler?").
>
> One approach is to use READ_ONCE() on the reads in the RCU read-side
> critical section that are subject to concurrent update.  Another is to
> use the data_race() macro (as in data_race(foo) in the switch statement
> above) to tell KCSAN that you have analyzed the compiler's response.
> These first two can be mixed, if you like.  And yet another is the patch
> proposed by Qian if you want KCSAN to get out of your code altogether.
>
> Within RCU, I mark the accesses rather aggressively.  RCU is quite
> concurrent, and the implied documentation is very worthwhile.
>
> Your mileage may vary, of course.  For one thing, your actuarial
> statistics are quite likley significantly more favorable than are mine.
> Not that mine are at all bad, particularly by the standards of a century
> or two ago.  ;-)
>
>                                                         Thanx, Paul
Qian Cai March 4, 2020, 5:10 p.m. UTC | #8
On Wed, 2020-03-04 at 17:40 +0100, Marco Elver wrote:
> On Wed, 4 Mar 2020 at 15:10, Paul E. McKenney <paulmck@kernel.org> wrote:
> > 
> > On Tue, Mar 03, 2020 at 08:33:56PM -0800, Matthew Wilcox wrote:
> > > On Tue, Mar 03, 2020 at 08:05:15PM -0800, Paul E. McKenney wrote:
> > > > On Tue, Mar 03, 2020 at 07:33:29PM -0800, Matthew Wilcox wrote:
> > > > > On Tue, Mar 03, 2020 at 10:15:51PM -0500, Qian Cai wrote:
> > > > > > Functions like xas_find_marked(), xas_set_mark(), and xas_clear_mark()
> > > > > > could happen concurrently result in data races, but those operate only
> > > > > > on a single bit that are pretty much harmless. For example,
> 
> I currently do not see those as justification to blacklist the whole
> file. Wouldn't __no_kcsan be better? That is, in case there is no
> other solution that emerges from the remainder of the discussion here.

I suppose it is up to Matthew. Currently, I can see there are several functions
may need __no_kcsan,

xa_get_mark(), xas_find_marked(), xas_find_chunk() etc.

My worry was that there could be many of those functions operating on marks
(which is a single-bit) in xarray that could end up needing the same treatment.

So far, my testing is thin on filesystem side where xarray is pretty much used
for page caches, so the reports I got from KCSAN runs does not necessary tell
the whole story. Once I updated my KCSAN runs to include things like xfstests,
it could add quite a few churns later if we decided to go with the __no_kcsan
route.
Paul E. McKenney March 4, 2020, 5:14 p.m. UTC | #9
On Wed, Mar 04, 2020 at 12:10:58PM -0500, Qian Cai wrote:
> On Wed, 2020-03-04 at 17:40 +0100, Marco Elver wrote:
> > On Wed, 4 Mar 2020 at 15:10, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > 
> > > On Tue, Mar 03, 2020 at 08:33:56PM -0800, Matthew Wilcox wrote:
> > > > On Tue, Mar 03, 2020 at 08:05:15PM -0800, Paul E. McKenney wrote:
> > > > > On Tue, Mar 03, 2020 at 07:33:29PM -0800, Matthew Wilcox wrote:
> > > > > > On Tue, Mar 03, 2020 at 10:15:51PM -0500, Qian Cai wrote:
> > > > > > > Functions like xas_find_marked(), xas_set_mark(), and xas_clear_mark()
> > > > > > > could happen concurrently result in data races, but those operate only
> > > > > > > on a single bit that are pretty much harmless. For example,
> > 
> > I currently do not see those as justification to blacklist the whole
> > file. Wouldn't __no_kcsan be better? That is, in case there is no
> > other solution that emerges from the remainder of the discussion here.
> 
> I suppose it is up to Matthew. Currently, I can see there are several functions
> may need __no_kcsan,
> 
> xa_get_mark(), xas_find_marked(), xas_find_chunk() etc.
> 
> My worry was that there could be many of those functions operating on marks
> (which is a single-bit) in xarray that could end up needing the same treatment.
> 
> So far, my testing is thin on filesystem side where xarray is pretty much used
> for page caches, so the reports I got from KCSAN runs does not necessary tell
> the whole story. Once I updated my KCSAN runs to include things like xfstests,
> it could add quite a few churns later if we decided to go with the __no_kcsan
> route.

Agreed, the filesystems developers and maintainers will decide how they
would like to approach this.

							Thanx, Paul
Matthew Wilcox March 5, 2020, 3:18 p.m. UTC | #10
On Wed, Mar 04, 2020 at 06:10:21AM -0800, Paul E. McKenney wrote:
> On Tue, Mar 03, 2020 at 08:33:56PM -0800, Matthew Wilcox wrote:
> > On Tue, Mar 03, 2020 at 08:05:15PM -0800, Paul E. McKenney wrote:
> > > On Tue, Mar 03, 2020 at 07:33:29PM -0800, Matthew Wilcox wrote:
> > > > On Tue, Mar 03, 2020 at 10:15:51PM -0500, Qian Cai wrote:
> > > > > Functions like xas_find_marked(), xas_set_mark(), and xas_clear_mark()
> > > > > could happen concurrently result in data races, but those operate only
> > > > > on a single bit that are pretty much harmless. For example,
> > > > 
> > > > Those aren't data races.  The writes are protected by a spinlock and the
> > > > reads by the RCU read lock.  If the tool can't handle RCU protection,
> > > > it's not going to be much use.
> > > 
> > > Would KCSAN's ASSERT_EXCLUSIVE_BITS() help here?
> > 
> > I'm quite lost in the sea of new macros that have been added to help
> > with KCSAN.  It doesn't help that they're in -next somewhere that I
> > can't find, and not in mainline yet.  Is there documentation somewhere?
> 
> Yes, there is documentation.  In -next somewhere.  :-/

Looking at the documentation link that Marco kindly provided, I don't
think ASSERT_EXCLUSIVE_BITS is the correct annotation to make.  The bits
in question are being accessed in parallel, it's just that we're willing
to accept a certain degree of inaccuracy.

> > struct xa_node {
> >         unsigned char   shift;          /* Bits remaining in each slot */
> >         unsigned char   offset;         /* Slot offset in parent */
> >         unsigned char   count;          /* Total entry count */
> >         unsigned char   nr_values;      /* Value entry count */
> >         struct xa_node __rcu *parent;   /* NULL at top of tree */
> >         struct xarray   *array;         /* The array we belong to */
> >         union {
> >                 struct list_head private_list;  /* For tree user */
> >                 struct rcu_head rcu_head;       /* Used when freeing node */
> >         };
> >         void __rcu      *slots[XA_CHUNK_SIZE];
> >         union {
> >                 unsigned long   tags[XA_MAX_MARKS][XA_MARK_LONGS];
> >                 unsigned long   marks[XA_MAX_MARKS][XA_MARK_LONGS];
> >         };
> > };
> > 
> > 'slots' may be modified with the xa_lock
> > held, and simultaneously read by an RCU reader.  Ditto 'tags'/'marks'.
> 
> KCSAN expects that accesses to the ->parent field should be marked.
> But if ->parent is always accessed via things like rcu_dereference()
> and rcu_assign_pointer() (guessing based on the __rcu), then KCSAN
> won't complain.
[...]
> The situation with ->slots is the same as that for ->parent.
> 
> KCSAN expects accesses to the ->tags[] and ->marks[] arrays to be marked.
> However, the default configuration of KCSAN asks only that the reads
> be marked.  (Within RCU, I instead configure KCSAN so that it asks that
> both be marked, but it is of course your choice within your code.)
> 
> > The RCU readers are prepared for what they see to be inconsistent --
> > a fact of life when dealing with RCU!  So in a sense, yes, there is a
> > race there.  But it's a known, accepted race, and that acceptance is
> > indicated by the fact that the RCU lock is held.  Does there need to be
> > more annotation here?  Or is there an un-noticed bug that the tool is
> > legitimately pointing out?
> 
> The answer to both questions is "maybe", depending on the code using
> the values read.  Yes, it would be nice if KCSAN could figure out the
> difference, but there are limits to what a tool can do.  And things
> are sometimes no-obvious, for example:

I require the following properties from this array of bits:

 - If a bit was set before and after the modification, it must be seen to
   be set.
 - If a bit was clear before and after the modification, it must be seen to
   be clear.
 - If a bit is modified, it may be seen as set or clear.

I have found three locations where we use the ->marks array:

1.
                        unsigned long data = *addr & (~0UL << offset);
                        if (data)
                                return __ffs(data);

2.
        return find_next_bit(addr, XA_CHUNK_SIZE, offset);
3.
        return test_bit(offset, node_marks(node, mark));

The modifications -- all done with the spinlock held -- use the non-atomic
bitops:
        return __test_and_set_bit(offset, node_marks(node, mark));
        return __test_and_clear_bit(offset, node_marks(node, mark));
        bitmap_fill(node_marks(node, mark), XA_CHUNK_SIZE);
(that last one doesn't really count -- it's done prior to placing the node
in the tree)

The first read seems straightforward; I can place a READ_ONCE around
*addr.  The second & third reads are rather less straightforward.
find_next_bit() and test_bit() are common code and use plain loads today.

> Your mileage may vary, of course.  For one thing, your actuarial
> statistics are quite likley significantly more favorable than are mine.
> Not that mine are at all bad, particularly by the standards of a century
> or two ago.  ;-)

I don't know ... my preferred form of exercise takes me into the
harm's way of cars, so while I've significantly lowered my risk of
dying of cardiovascular disease, I'm much more likely to be killed by
an inattentive human being!  Bring on our robot cars ...
Paul E. McKenney March 5, 2020, 9:39 p.m. UTC | #11
On Thu, Mar 05, 2020 at 07:18:31AM -0800, Matthew Wilcox wrote:
> On Wed, Mar 04, 2020 at 06:10:21AM -0800, Paul E. McKenney wrote:
> > On Tue, Mar 03, 2020 at 08:33:56PM -0800, Matthew Wilcox wrote:
> > > On Tue, Mar 03, 2020 at 08:05:15PM -0800, Paul E. McKenney wrote:
> > > > On Tue, Mar 03, 2020 at 07:33:29PM -0800, Matthew Wilcox wrote:
> > > > > On Tue, Mar 03, 2020 at 10:15:51PM -0500, Qian Cai wrote:
> > > > > > Functions like xas_find_marked(), xas_set_mark(), and xas_clear_mark()
> > > > > > could happen concurrently result in data races, but those operate only
> > > > > > on a single bit that are pretty much harmless. For example,
> > > > > 
> > > > > Those aren't data races.  The writes are protected by a spinlock and the
> > > > > reads by the RCU read lock.  If the tool can't handle RCU protection,
> > > > > it's not going to be much use.
> > > > 
> > > > Would KCSAN's ASSERT_EXCLUSIVE_BITS() help here?
> > > 
> > > I'm quite lost in the sea of new macros that have been added to help
> > > with KCSAN.  It doesn't help that they're in -next somewhere that I
> > > can't find, and not in mainline yet.  Is there documentation somewhere?
> > 
> > Yes, there is documentation.  In -next somewhere.  :-/
> 
> Looking at the documentation link that Marco kindly provided, I don't
> think ASSERT_EXCLUSIVE_BITS is the correct annotation to make.  The bits
> in question are being accessed in parallel, it's just that we're willing
> to accept a certain degree of inaccuracy.
> 
> > > struct xa_node {
> > >         unsigned char   shift;          /* Bits remaining in each slot */
> > >         unsigned char   offset;         /* Slot offset in parent */
> > >         unsigned char   count;          /* Total entry count */
> > >         unsigned char   nr_values;      /* Value entry count */
> > >         struct xa_node __rcu *parent;   /* NULL at top of tree */
> > >         struct xarray   *array;         /* The array we belong to */
> > >         union {
> > >                 struct list_head private_list;  /* For tree user */
> > >                 struct rcu_head rcu_head;       /* Used when freeing node */
> > >         };
> > >         void __rcu      *slots[XA_CHUNK_SIZE];
> > >         union {
> > >                 unsigned long   tags[XA_MAX_MARKS][XA_MARK_LONGS];
> > >                 unsigned long   marks[XA_MAX_MARKS][XA_MARK_LONGS];
> > >         };
> > > };
> > > 
> > > 'slots' may be modified with the xa_lock
> > > held, and simultaneously read by an RCU reader.  Ditto 'tags'/'marks'.
> > 
> > KCSAN expects that accesses to the ->parent field should be marked.
> > But if ->parent is always accessed via things like rcu_dereference()
> > and rcu_assign_pointer() (guessing based on the __rcu), then KCSAN
> > won't complain.
> [...]
> > The situation with ->slots is the same as that for ->parent.
> > 
> > KCSAN expects accesses to the ->tags[] and ->marks[] arrays to be marked.
> > However, the default configuration of KCSAN asks only that the reads
> > be marked.  (Within RCU, I instead configure KCSAN so that it asks that
> > both be marked, but it is of course your choice within your code.)
> > 
> > > The RCU readers are prepared for what they see to be inconsistent --
> > > a fact of life when dealing with RCU!  So in a sense, yes, there is a
> > > race there.  But it's a known, accepted race, and that acceptance is
> > > indicated by the fact that the RCU lock is held.  Does there need to be
> > > more annotation here?  Or is there an un-noticed bug that the tool is
> > > legitimately pointing out?
> > 
> > The answer to both questions is "maybe", depending on the code using
> > the values read.  Yes, it would be nice if KCSAN could figure out the
> > difference, but there are limits to what a tool can do.  And things
> > are sometimes no-obvious, for example:
> 
> I require the following properties from this array of bits:
> 
>  - If a bit was set before and after the modification, it must be seen to
>    be set.
>  - If a bit was clear before and after the modification, it must be seen to
>    be clear.
>  - If a bit is modified, it may be seen as set or clear.
> 
> I have found three locations where we use the ->marks array:
> 
> 1.
>                         unsigned long data = *addr & (~0UL << offset);
>                         if (data)
>                                 return __ffs(data);
> 
> 2.
>         return find_next_bit(addr, XA_CHUNK_SIZE, offset);
> 3.
>         return test_bit(offset, node_marks(node, mark));
> 
> The modifications -- all done with the spinlock held -- use the non-atomic
> bitops:
>         return __test_and_set_bit(offset, node_marks(node, mark));
>         return __test_and_clear_bit(offset, node_marks(node, mark));
>         bitmap_fill(node_marks(node, mark), XA_CHUNK_SIZE);
> (that last one doesn't really count -- it's done prior to placing the node
> in the tree)
> 
> The first read seems straightforward; I can place a READ_ONCE around
> *addr.  The second & third reads are rather less straightforward.
> find_next_bit() and test_bit() are common code and use plain loads today.

Yes, those last two are a bit annoying, aren't they?  I guess the first
thing would be placing READ_ONCE() inside them, and if that results in
regressions, have an alternative API for concurrent access?

> > Your mileage may vary, of course.  For one thing, your actuarial
> > statistics are quite likley significantly more favorable than are mine.
> > Not that mine are at all bad, particularly by the standards of a century
> > or two ago.  ;-)
> 
> I don't know ... my preferred form of exercise takes me into the
> harm's way of cars, so while I've significantly lowered my risk of
> dying of cardiovascular disease, I'm much more likely to be killed by
> an inattentive human being!  Bring on our robot cars ...

I did my 40 years using that mode of transport.  But my reflexes are
no longer up to the task.  So I have no advice for you on that one!
Other than to question your trust in robot cars whose code was produced
by human developers!  ;-)

							Thanx, Paul
Marco Elver March 6, 2020, 1:38 p.m. UTC | #12
On Thu, 5 Mar 2020 at 22:39, Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Thu, Mar 05, 2020 at 07:18:31AM -0800, Matthew Wilcox wrote:
> > On Wed, Mar 04, 2020 at 06:10:21AM -0800, Paul E. McKenney wrote:
> > > On Tue, Mar 03, 2020 at 08:33:56PM -0800, Matthew Wilcox wrote:
> > > > On Tue, Mar 03, 2020 at 08:05:15PM -0800, Paul E. McKenney wrote:
> > > > > On Tue, Mar 03, 2020 at 07:33:29PM -0800, Matthew Wilcox wrote:
> > > > > > On Tue, Mar 03, 2020 at 10:15:51PM -0500, Qian Cai wrote:
> > > > > > > Functions like xas_find_marked(), xas_set_mark(), and xas_clear_mark()
> > > > > > > could happen concurrently result in data races, but those operate only
> > > > > > > on a single bit that are pretty much harmless. For example,
> > > > > >
> > > > > > Those aren't data races.  The writes are protected by a spinlock and the
> > > > > > reads by the RCU read lock.  If the tool can't handle RCU protection,
> > > > > > it's not going to be much use.
> > > > >
> > > > > Would KCSAN's ASSERT_EXCLUSIVE_BITS() help here?
> > > >
> > > > I'm quite lost in the sea of new macros that have been added to help
> > > > with KCSAN.  It doesn't help that they're in -next somewhere that I
> > > > can't find, and not in mainline yet.  Is there documentation somewhere?
> > >
> > > Yes, there is documentation.  In -next somewhere.  :-/
> >
> > Looking at the documentation link that Marco kindly provided, I don't
> > think ASSERT_EXCLUSIVE_BITS is the correct annotation to make.  The bits
> > in question are being accessed in parallel, it's just that we're willing
> > to accept a certain degree of inaccuracy.
> >
> > > > struct xa_node {
> > > >         unsigned char   shift;          /* Bits remaining in each slot */
> > > >         unsigned char   offset;         /* Slot offset in parent */
> > > >         unsigned char   count;          /* Total entry count */
> > > >         unsigned char   nr_values;      /* Value entry count */
> > > >         struct xa_node __rcu *parent;   /* NULL at top of tree */
> > > >         struct xarray   *array;         /* The array we belong to */
> > > >         union {
> > > >                 struct list_head private_list;  /* For tree user */
> > > >                 struct rcu_head rcu_head;       /* Used when freeing node */
> > > >         };
> > > >         void __rcu      *slots[XA_CHUNK_SIZE];
> > > >         union {
> > > >                 unsigned long   tags[XA_MAX_MARKS][XA_MARK_LONGS];
> > > >                 unsigned long   marks[XA_MAX_MARKS][XA_MARK_LONGS];
> > > >         };
> > > > };
> > > >
> > > > 'slots' may be modified with the xa_lock
> > > > held, and simultaneously read by an RCU reader.  Ditto 'tags'/'marks'.
> > >
> > > KCSAN expects that accesses to the ->parent field should be marked.
> > > But if ->parent is always accessed via things like rcu_dereference()
> > > and rcu_assign_pointer() (guessing based on the __rcu), then KCSAN
> > > won't complain.
> > [...]
> > > The situation with ->slots is the same as that for ->parent.
> > >
> > > KCSAN expects accesses to the ->tags[] and ->marks[] arrays to be marked.
> > > However, the default configuration of KCSAN asks only that the reads
> > > be marked.  (Within RCU, I instead configure KCSAN so that it asks that
> > > both be marked, but it is of course your choice within your code.)
> > >
> > > > The RCU readers are prepared for what they see to be inconsistent --
> > > > a fact of life when dealing with RCU!  So in a sense, yes, there is a
> > > > race there.  But it's a known, accepted race, and that acceptance is
> > > > indicated by the fact that the RCU lock is held.  Does there need to be
> > > > more annotation here?  Or is there an un-noticed bug that the tool is
> > > > legitimately pointing out?
> > >
> > > The answer to both questions is "maybe", depending on the code using
> > > the values read.  Yes, it would be nice if KCSAN could figure out the
> > > difference, but there are limits to what a tool can do.  And things
> > > are sometimes no-obvious, for example:
> >
> > I require the following properties from this array of bits:
> >
> >  - If a bit was set before and after the modification, it must be seen to
> >    be set.
> >  - If a bit was clear before and after the modification, it must be seen to
> >    be clear.
> >  - If a bit is modified, it may be seen as set or clear.
> >
> > I have found three locations where we use the ->marks array:
> >
> > 1.
> >                         unsigned long data = *addr & (~0UL << offset);
> >                         if (data)
> >                                 return __ffs(data);
> >
> > 2.
> >         return find_next_bit(addr, XA_CHUNK_SIZE, offset);
> > 3.
> >         return test_bit(offset, node_marks(node, mark));
> >
> > The modifications -- all done with the spinlock held -- use the non-atomic
> > bitops:
> >         return __test_and_set_bit(offset, node_marks(node, mark));
> >         return __test_and_clear_bit(offset, node_marks(node, mark));
> >         bitmap_fill(node_marks(node, mark), XA_CHUNK_SIZE);
> > (that last one doesn't really count -- it's done prior to placing the node
> > in the tree)
> >
> > The first read seems straightforward; I can place a READ_ONCE around
> > *addr.  The second & third reads are rather less straightforward.
> > find_next_bit() and test_bit() are common code and use plain loads today.
>
> Yes, those last two are a bit annoying, aren't they?  I guess the first
> thing would be placing READ_ONCE() inside them, and if that results in
> regressions, have an alternative API for concurrent access?

FWIW test_bit() is an "atomic" bitop (per atomic_bitops.txt), and
KCSAN treats it as such. On x86 arch_test_bit() is not instrumented,
and then in asm-generic/bitops/instrumented-non-atomic.h test_bit() is
instrumented with instrument_atomic_read(). So on x86, things should
already be fine for test_bit(). Not sure about other architectures.

Thanks,
-- Marco



> > > Your mileage may vary, of course.  For one thing, your actuarial
> > > statistics are quite likley significantly more favorable than are mine.
> > > Not that mine are at all bad, particularly by the standards of a century
> > > or two ago.  ;-)
> >
> > I don't know ... my preferred form of exercise takes me into the
> > harm's way of cars, so while I've significantly lowered my risk of
> > dying of cardiovascular disease, I'm much more likely to be killed by
> > an inattentive human being!  Bring on our robot cars ...
>
> I did my 40 years using that mode of transport.  But my reflexes are
> no longer up to the task.  So I have no advice for you on that one!
> Other than to question your trust in robot cars whose code was produced
> by human developers!  ;-)
>
>                                                         Thanx, Paul
Matthew Wilcox March 6, 2020, 4:53 p.m. UTC | #13
On Fri, Mar 06, 2020 at 02:38:39PM +0100, Marco Elver wrote:
> On Thu, 5 Mar 2020 at 22:39, Paul E. McKenney <paulmck@kernel.org> wrote:
> > On Thu, Mar 05, 2020 at 07:18:31AM -0800, Matthew Wilcox wrote:
> > > I have found three locations where we use the ->marks array:
> > >
> > > 1.
> > >                         unsigned long data = *addr & (~0UL << offset);
> > >                         if (data)
> > >                                 return __ffs(data);
> > >
> > > 2.
> > >         return find_next_bit(addr, XA_CHUNK_SIZE, offset);
> > > 3.
> > >         return test_bit(offset, node_marks(node, mark));
> > >
> > > The modifications -- all done with the spinlock held -- use the non-atomic
> > > bitops:
> > >         return __test_and_set_bit(offset, node_marks(node, mark));
> > >         return __test_and_clear_bit(offset, node_marks(node, mark));
> > >         bitmap_fill(node_marks(node, mark), XA_CHUNK_SIZE);
> > > (that last one doesn't really count -- it's done prior to placing the node
> > > in the tree)
> > >
> > > The first read seems straightforward; I can place a READ_ONCE around
> > > *addr.  The second & third reads are rather less straightforward.
> > > find_next_bit() and test_bit() are common code and use plain loads today.
> >
> > Yes, those last two are a bit annoying, aren't they?  I guess the first
> > thing would be placing READ_ONCE() inside them, and if that results in
> > regressions, have an alternative API for concurrent access?
> 
> FWIW test_bit() is an "atomic" bitop (per atomic_bitops.txt), and
> KCSAN treats it as such. On x86 arch_test_bit() is not instrumented,
> and then in asm-generic/bitops/instrumented-non-atomic.h test_bit() is
> instrumented with instrument_atomic_read(). So on x86, things should
> already be fine for test_bit(). Not sure about other architectures.

Hum.  It may well be documented as atomic, but is it?  Here's the
generic implementation:

static inline int test_bit(int nr, const volatile unsigned long *addr)
{
        return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
}

arch_test_bit is only used by the instrumented variants:

$ git grep arch_test_bit include
include/asm-generic/bitops/instrumented-non-atomic.h:   return arch_test_bit(nr, addr);

As far as I can tell, the generic version is what's used on x86.  Does
the 'volatile' qualifier save us here?

find_next_bit() doesn't have the 'volatile' qualifier, so may still be
a problem?
Paul E. McKenney March 6, 2020, 5:03 p.m. UTC | #14
On Fri, Mar 06, 2020 at 08:53:00AM -0800, Matthew Wilcox wrote:
> On Fri, Mar 06, 2020 at 02:38:39PM +0100, Marco Elver wrote:
> > On Thu, 5 Mar 2020 at 22:39, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > On Thu, Mar 05, 2020 at 07:18:31AM -0800, Matthew Wilcox wrote:
> > > > I have found three locations where we use the ->marks array:
> > > >
> > > > 1.
> > > >                         unsigned long data = *addr & (~0UL << offset);
> > > >                         if (data)
> > > >                                 return __ffs(data);
> > > >
> > > > 2.
> > > >         return find_next_bit(addr, XA_CHUNK_SIZE, offset);
> > > > 3.
> > > >         return test_bit(offset, node_marks(node, mark));
> > > >
> > > > The modifications -- all done with the spinlock held -- use the non-atomic
> > > > bitops:
> > > >         return __test_and_set_bit(offset, node_marks(node, mark));
> > > >         return __test_and_clear_bit(offset, node_marks(node, mark));
> > > >         bitmap_fill(node_marks(node, mark), XA_CHUNK_SIZE);
> > > > (that last one doesn't really count -- it's done prior to placing the node
> > > > in the tree)
> > > >
> > > > The first read seems straightforward; I can place a READ_ONCE around
> > > > *addr.  The second & third reads are rather less straightforward.
> > > > find_next_bit() and test_bit() are common code and use plain loads today.
> > >
> > > Yes, those last two are a bit annoying, aren't they?  I guess the first
> > > thing would be placing READ_ONCE() inside them, and if that results in
> > > regressions, have an alternative API for concurrent access?
> > 
> > FWIW test_bit() is an "atomic" bitop (per atomic_bitops.txt), and
> > KCSAN treats it as such. On x86 arch_test_bit() is not instrumented,
> > and then in asm-generic/bitops/instrumented-non-atomic.h test_bit() is
> > instrumented with instrument_atomic_read(). So on x86, things should
> > already be fine for test_bit(). Not sure about other architectures.
> 
> Hum.  It may well be documented as atomic, but is it?  Here's the
> generic implementation:
> 
> static inline int test_bit(int nr, const volatile unsigned long *addr)
> {
>         return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
> }
> 
> arch_test_bit is only used by the instrumented variants:
> 
> $ git grep arch_test_bit include
> include/asm-generic/bitops/instrumented-non-atomic.h:   return arch_test_bit(nr, addr);
> 
> As far as I can tell, the generic version is what's used on x86.  Does
> the 'volatile' qualifier save us here?
> 
> find_next_bit() doesn't have the 'volatile' qualifier, so may still be
> a problem?

One approach would be to add the needed READ_ONCE().

Another, if someone is crazy enough to do the work, would be to verify
that the code output is as if there was a READ_ONCE().

Thoughts?

							Thanx, Paul
Marco Elver March 6, 2020, 8:04 p.m. UTC | #15
On Fri, 6 Mar 2020 at 18:03, Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Mar 06, 2020 at 08:53:00AM -0800, Matthew Wilcox wrote:
> > On Fri, Mar 06, 2020 at 02:38:39PM +0100, Marco Elver wrote:
> > > On Thu, 5 Mar 2020 at 22:39, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > On Thu, Mar 05, 2020 at 07:18:31AM -0800, Matthew Wilcox wrote:
> > > > > I have found three locations where we use the ->marks array:
> > > > >
> > > > > 1.
> > > > >                         unsigned long data = *addr & (~0UL << offset);
> > > > >                         if (data)
> > > > >                                 return __ffs(data);
> > > > >
> > > > > 2.
> > > > >         return find_next_bit(addr, XA_CHUNK_SIZE, offset);
> > > > > 3.
> > > > >         return test_bit(offset, node_marks(node, mark));
> > > > >
> > > > > The modifications -- all done with the spinlock held -- use the non-atomic
> > > > > bitops:
> > > > >         return __test_and_set_bit(offset, node_marks(node, mark));
> > > > >         return __test_and_clear_bit(offset, node_marks(node, mark));
> > > > >         bitmap_fill(node_marks(node, mark), XA_CHUNK_SIZE);
> > > > > (that last one doesn't really count -- it's done prior to placing the node
> > > > > in the tree)
> > > > >
> > > > > The first read seems straightforward; I can place a READ_ONCE around
> > > > > *addr.  The second & third reads are rather less straightforward.
> > > > > find_next_bit() and test_bit() are common code and use plain loads today.
> > > >
> > > > Yes, those last two are a bit annoying, aren't they?  I guess the first
> > > > thing would be placing READ_ONCE() inside them, and if that results in
> > > > regressions, have an alternative API for concurrent access?
> > >
> > > FWIW test_bit() is an "atomic" bitop (per atomic_bitops.txt), and
> > > KCSAN treats it as such. On x86 arch_test_bit() is not instrumented,
> > > and then in asm-generic/bitops/instrumented-non-atomic.h test_bit() is
> > > instrumented with instrument_atomic_read(). So on x86, things should
> > > already be fine for test_bit(). Not sure about other architectures.
> >
> > Hum.  It may well be documented as atomic, but is it?  Here's the
> > generic implementation:
> >
> > static inline int test_bit(int nr, const volatile unsigned long *addr)
> > {
> >         return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
> > }
> >
> > arch_test_bit is only used by the instrumented variants:
> >
> > $ git grep arch_test_bit include
> > include/asm-generic/bitops/instrumented-non-atomic.h:   return arch_test_bit(nr, addr);
> >
> > As far as I can tell, the generic version is what's used on x86.  Does
> > the 'volatile' qualifier save us here?

x86 uses its own implementation of test_bit(), which we had to address
early on, otherwise we would still have tons of false reports due to
test_bit() usage.

$ grep -E -A3 'test_bit|instrumented' arch/x86/include/asm/bitops.h
static __no_kcsan_or_inline bool constant_test_bit(long nr, const
volatile unsigned long *addr)
{
        /*
         * Because this is a plain access, we need to disable KCSAN here to
         * avoid double instrumentation via instrumented bitops.
         */
        return ((1UL << (nr & (BITS_PER_LONG-1))) &
                (addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
--
static __always_inline bool variable_test_bit(long nr, volatile const
unsigned long *addr)
--
#define arch_test_bit(nr, addr)                 \
        (__builtin_constant_p((nr))             \
         ? constant_test_bit((nr), (addr))      \
         : variable_test_bit((nr), (addr)))
--
#include <asm-generic/bitops/instrumented-atomic.h>
#include <asm-generic/bitops/instrumented-non-atomic.h>
#include <asm-generic/bitops/instrumented-lock.h>


For the asm-generic variant, the cast to volatile should have the same
effect as READ_ONCE today (except maybe on Alpha?).  We would still
need to use READ_ONCE() in asm-generic's test_bit() though, to avoid
KCSAN false positives on other architectures. The code-gen should be
the same. I can try to send a patch and see if that's ok to do.

> > find_next_bit() doesn't have the 'volatile' qualifier, so may still be
> > a problem?
>
> One approach would be to add the needed READ_ONCE().
>
> Another, if someone is crazy enough to do the work, would be to verify
> that the code output is as if there was a READ_ONCE().
>
> Thoughts?

find_next_bit() is difficult. The code is definitely not the same with
READ_ONCE(), there are 8 more instructions (x86-64). For now the only
thing we can do if we're fine with data-racy behaviour in
find_next_bit(), without changing it, is to use it with
'data_race(find_next_bit(...))'. Not great though. :-/

Thanks,
-- Marco

Patch
diff mbox series

diff --git a/lib/Makefile b/lib/Makefile
index 523dfe2063e2..989e702c275b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -28,6 +28,11 @@  endif
 # Used by KCSAN while enabled, avoid recursion.
 KCSAN_SANITIZE_random32.o := n
 
+# This produces frequent data race reports: most of them are due to races on
+# the same word but accesses to a single bit of that word. Re-enable KCSAN
+# for this when we have more consensus on what to do about them.
+KCSAN_SANITIZE_xarray.o := n
+
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o timerqueue.o xarray.o \
 	 idr.o extable.o sha1.o irq_regs.o argv_split.o \