All of lore.kernel.org
 help / color / mirror / Atom feed
* Data races in ext4
@ 2014-04-08 20:58 Pedro Fonseca
  2014-04-09 21:33 ` Theodore Ts'o
  0 siblings, 1 reply; 5+ messages in thread
From: Pedro Fonseca @ 2014-04-08 20:58 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, adilger.kernel

Hi,

I've encountered several data races while running custom test cases on 
ext4 (kernel 3.13.5), inside a QEMU based VM, but I'm not sure whether 
all of these data races are harmful.

Bellow I'm listing the data races summary, including the variable name, 
IP addresses, function names and source code files/line numbers. In 
addition, the pastbin links include snippets of the code at those 
locations and also include examples of racing pairs of instructions 
(which can be useful when there are more than two instructions racing). 
Several of the races reported affect either the function 
generic_fillattr() or the function ext4_do_update_inode(), so I'm 
grouping them bellow to simplify the analysis. Feel free to ask for more 
information in case it's needed.

I was hoping someone could have a look at these data races and let me 
know if they're going to be fixed.

Thanks,
Pedro


** Data races:
> Variable: mapping->nrpages Addresses: c107d9f3 c110c8d7
> c107d9f3 add_to_page_cache_locked /linux-3.13.5/mm/filemap.c:487
> c110c8d7 ext4_writepages /linux-3.13.5/fs/ext4/inode.c:2413
>
> Variable: raw_inode->i_size_lo  Addresses: c110abf0 c110abdb
> c110abf0 ext4_isize_set /linux-3.13.5/fs/ext4/ext4.h:2395
> c110abdb ext4_isize /linux-3.13.5/fs/ext4/ext4.h:2387
>
> Variable: mapping->writeback_index      Addresses: c108320f
> c108320f write_cache_pages /linux-3.13.5/mm/page-writeback.c:2012
>
> Variable: ei->i_datasync_tid    Addresses: c110adb3
> c110adb3 ext4_update_inode_fsync_trans 
> /linux-3.13.5/fs/ext4/ext4_jbd2.h:384
>
> Variable: ei->i_sync_tid        Addresses: c112ae80 c110ae6f
> c112ae80 ext4_update_inode_fsync_trans 
> /linux-3.13.5/fs/ext4/ext4_jbd2.h:382
> c110ae6f ext4_update_inode_fsync_trans 
> /linux-3.13.5/fs/ext4/ext4_jbd2.h:382
>
> Variable: kaddr/link    Addresses: c10ade1a c12d40eb
> c10ade1a 
> generic_readlinkat()->page_follow_link_light()->page_getlink()->nd_terminate_link 
> /linux-3.13.5/include/linux/namei.h:98
> c12d40eb generic_readlinkat()->vfs_readlink()->strlen 
> /linux-3.13.5/arch/x86/lib/string_32.c:168
>
> Variable: journal->j_running_transaction        Addresses: c1138ca1 
> c1136b02 c1136ca9
> c1136ca9 jbd2_get_transaction /linux-3.13.5/fs/jbd2/transaction.c:103
> c1138ca1 jbd2_journal_commit_transaction 
> /linux-3.13.5/fs/jbd2/commit.c:539
> c1136b02 start_this_handle /linux-3.13.5/fs/jbd2/transaction.c:280
>
> Variable: inode->i_state        Addresses: c10c035e c10c03cb c10c0352 
> c10c093a c10bffe2
> c10c0352 __writeback_single_inode /linux-3.13.5/fs/fs-writeback.c:473
> c10bffe2 inode_sync_complete /linux-3.13.5/fs/fs-writeback.c:217
> c10c035e __writeback_single_inode /linux-3.13.5/fs/fs-writeback.c:475
> c10c093a writeback_sb_inodes /linux-3.13.5/fs/fs-writeback.c:666
> c10c03cb __mark_inode_dirty /linux-3.13.5/fs/fs-writeback.c:1149
>
> Variable: raw_inode->i_size_high        Addresses: c110abf3 c110abde
> c110abde ext4_isize /linux-3.13.5/fs/ext4/ext4.h:2387
> c110abf3 ext4_isize_set /linux-3.13.5/fs/ext4/ext4.h:2396
http://pastebin.com/gSGUMC31


** Races involving/related to the function generic_fillattr():
> Variable: inode->i_blocks Addresses: c10aa04b c10aa891 c10a9f72
> c10a9f72 __inode_sub_bytes /linux-3.13.5/fs/stat.c:471
> c10aa891 __inode_add_bytes /linux-3.13.5/fs/stat.c:451
> c10aa04b generic_fillattr /linux-3.13.5/fs/stat.c:35
>
> Variable: inode->i_ctime        Addresses: c10aa032 c110b231
> c10aa032 generic_fillattr /linux-3.13.5/fs/stat.c:33
> c110b231 ext4_current_time /linux-3.13.5/fs/ext4/inode.c:3769
>
> Variable: inode->i_blocks       Addresses: c10a9f75 c110ab50 c10aa048 
> c10aa894
> c10a9f75 __inode_sub_bytes /linux-3.13.5/fs/stat.c:471
> c110ab50 ext4_inode_blocks_set /linux-3.13.5/fs/ext4/inode.c:4238
> c10aa048 generic_fillattr /linux-3.13.5/fs/stat.c:35
> c10aa894 __inode_add_bytes /linux-3.13.5/fs/stat.c:451
>
> Variable: inode->i_ctime        Addresses: c10b79e9 c110b22b c10aa02f
> c110b22b ext4_current_time /linux-3.13.5/fs/ext4/inode.c:3769
> c10b79e9 update_time /linux-3.13.5/fs/inode.c:1510
> c10aa02f generic_fillattr /linux-3.13.5/fs/stat.c:33
>
> Variable: inode->i_atime        Addresses: c10b7f62 c10b79a9 c10aa017 
> c10b7fcb
> c10b7f62 timespec_compare /linux-3.13.5/include/linux/time.h:35
> c10b7fcb timespec_equal /linux-3.13.5/include/linux/time.h:25
> c10aa017 generic_fillattr /linux-3.13.5/fs/stat.c:31
> c10b79a9 update_time /linux-3.13.5/fs/inode.c:1506
>
> Variable: inode->i_mtime        Addresses: c10b79f9 c10aa023 c110b237
> c110b237 ext4_truncate /linux-3.13.5/fs/ext4/inode.c:3769
> c10b79f9 update_time /linux-3.13.5/fs/inode.c:1512
> c10aa023 generic_fillattr /linux-3.13.5/fs/stat.c:32
http://pastebin.com/6cHQsZJk


** Races involving/related to the function ext4_do_update_inode():
> Variable: inode->i_ctime    Addresses: c110aac1 c10b79ec
> c110aac1 ext4_encode_extra_time /linux-3.13.5/fs/ext4/ext4.h:729
> c10b79ec update_time /linux-3.13.5/fs/inode.c:1510
>
> Variable: i_crtime  Addresses: c110ab4a
> c110ab4a ext4_encode_extra_time /linux-3.13.5/fs/ext4/ext4.h:729
>
> Variable: ei->i_disksize        Addresses: c110c813 c110ac06
> c110c813 ext4_wb_update_i_disksize /linux-3.13.5/fs/ext4/ext4.h:2476
> c110ac06 ext4_do_update_inode /linux-3.13.5/fs/ext4/inode.c:4342
>
> Variable: raw_inode->i_gid_high Addresses: c110aa6a
> c110aa6a ext4_do_update_inode /linux-3.13.5/fs/ext4/inode.c:4311
>
> Variable: raw_inode->i_dtime    Addresses: c110adf2
> c110adf2 ext4_do_update_inode /linux-3.13.5/fs/ext4/inode.c:4331
>
> Variable: i_mtime       Addresses: c110aac7
> c110aac7 ext4_encode_extra_time /linux-3.13.5/fs/ext4/ext4.h:729
>
> Variable: i_mtime       Addresses: c110aad0
> c110aad0 ext4_do_update_inode /linux-3.13.5/fs/ext4/inode.c:4325
>
> Variable: raw_inode->i_file_acl_high    Addresses: c110abbf
> c110abbf ext4_do_update_inode /linux-3.13.5/fs/ext4/inode.c:4336
>
> Variable: raw_inode->i_flags   Addresses: c110adf8
> c110adf8 ext4_do_update_inode /linux-3.13.5/fs/ext4/inode.c:4332
>
> Variable: raw_inode->i_file_acl_lo      Addresses: c110abc6
> c110abc6 ext4_do_update_inode /linux-3.13.5/fs/ext4/inode.c:4337
>
> Variable: raw_inode->i_blocks_lo        Addresses: c110ab5e
> c110ab5e ext4_inode_blocks_set /linux-3.13.5/fs/ext4/inode.c:4246
>
> Variable: raw_inode->i_links_count      Addresses: c110aaa6
> c110aaa6 ext4_do_update_inode /linux-3.13.5/fs/ext4/inode.c:4322
>
> Variable: raw_inode->i_mode     Addresses: c110aa52 c110aa3a c110abc9
> c110aa3a ext4_do_update_inode /linux-3.13.5/fs/ext4/inode.c:4297
> c110aa52 ext4_do_update_inode /linux-3.13.5/fs/ext4/inode.c:4301
> c110abc9 ext4_isize /linux-3.13.5/fs/ext4/ext4.h:2386
>
> Variable: EXT4_I(inode)->i_disksize     Addresses: c110c810 c110c147 
> c110abcb c110c7d2
> c110c810 ext4_wb_update_i_disksize /linux-3.13.5/fs/ext4/ext4.h:2476
> c110abcb ext4_do_update_inode /linux-3.13.5/fs/ext4/inode.c:4338
> c110c7d2 mpage_map_and_submit_extent /linux-3.13.5/fs/ext4/inode.c:2236
> c110c147 ext4_da_write_end /linux-3.13.5/fs/ext4/inode.c:2758
http://pastebin.com/Sqhf1b92





^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Data races in ext4
  2014-04-08 20:58 Data races in ext4 Pedro Fonseca
@ 2014-04-09 21:33 ` Theodore Ts'o
  2014-04-10 15:40   ` Pedro Fonseca
  0 siblings, 1 reply; 5+ messages in thread
From: Theodore Ts'o @ 2014-04-09 21:33 UTC (permalink / raw)
  To: Pedro Fonseca; +Cc: linux-ext4, adilger.kernel

On Tue, Apr 08, 2014 at 10:58:46PM +0200, Pedro Fonseca wrote:
> Bellow I'm listing the data races summary, including the variable name, IP
> addresses, function names and source code files/line numbers. In addition,
> the pastbin links include snippets of the code at those locations and also
> include examples of racing pairs of instructions (which can be useful when
> there are more than two instructions racing). Several of the races reported
> affect either the function generic_fillattr() or the function
> ext4_do_update_inode(), so I'm grouping them bellow to simplify the
> analysis. Feel free to ask for more information in case it's needed.

The ones relating to ext4_do_update_inode() and generic_fillattr()
look fine to me.  Basically the first may happen if there are two
CPU's simultaneously trying to update the on-disk data structure for
an inode from the in-memory data structure, and that's not a problem
since the in-memory data structure is always authoratative.  The
second happens if someone tries to stat(2) an inode while some inode
field is getting updated.  Most of the inode field updates will be for
things like mtime and atime updates, or a chown or chmod, where a
single field is getting updated, and that's not a problem since there
is no guarantee which version of the inode information you'll get when
you call stat(2) while the inode is getting modified out from under
you.  It is possible that the userspace might see the i_blocks and
i_size be out of sync, but I really can't bring myself to care about
that.


As for the rest, some are obviously false positives.  For example, if
you take a look at:

Variable: journal->j_running_transaction	Addresses: c1138ca1 c1136b02 c1136ca9
c1136ca9 jbd2_get_transaction /linux-3.13.5/fs/jbd2/transaction.c:103
c1138ca1 jbd2_journal_commit_transaction /linux-3.13.5/fs/jbd2/commit.c:539
c1136b02 start_this_handle /linux-3.13.5/fs/jbd2/transaction.c:280

It's obvious from the code path that start_this_handle() is extremely
careful to always revalidate j_running_transaction after either (a)
taking a read lock on the j_state_lock for the first time, or (b)
after dropping the read_lock and grabbing a write_lock on
j_state_lock, and if things have changed, it loops and tries again.


There are a couple that require some closer examination; for some of
them, for example the two reported cases of ext4_isize_set() vs
ext4_isize(), having the stack trace would be really helpful.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Data races in ext4
  2014-04-09 21:33 ` Theodore Ts'o
@ 2014-04-10 15:40   ` Pedro Fonseca
  2014-04-10 22:09     ` Theodore Ts'o
  0 siblings, 1 reply; 5+ messages in thread
From: Pedro Fonseca @ 2014-04-10 15:40 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, adilger.kernel

On 04/09/2014 11:33 PM, Theodore Ts'o wrote:
> The ones relating to ext4_do_update_inode() and generic_fillattr()
> look fine to me.  Basically the first may happen if there are two
> CPU's simultaneously trying to update the on-disk data structure for
> an inode from the in-memory data structure, and that's not a problem
> since the in-memory data structure is always authoratative.  The
> second happens if someone tries to stat(2) an inode while some inode
> field is getting updated.  Most of the inode field updates will be for
> things like mtime and atime updates, or a chown or chmod, where a
> single field is getting updated, and that's not a problem since there
> is no guarantee which version of the inode information you'll get when
> you call stat(2) while the inode is getting modified out from under
> you.  It is possible that the userspace might see the i_blocks and
> i_size be out of sync, but I really can't bring myself to care about
> that.

Regarding the generic_fillattr() function, apart from the inconsistency 
between i_blocks/i_size, it looks like it can also cause wrong values to 
be returned because several of those fields are 64 bits. This means that 
each assignment gets converted into multiple instructions (in 32-bit 
architectures), allowing other instructions to interleave in between. 
For example, this is the assembly code I get when compiling 
generic_fillattr():
>     stat->atime = inode->i_atime;
> c10aa017:   8b 48 3c                mov    0x3c(%eax),%ecx
> c10aa01a:   8b 58 40                mov    0x40(%eax),%ebx
> c10aa01d:   89 4a 28                mov    %ecx,0x28(%edx)
> c10aa020:   89 5a 2c                mov    %ebx,0x2c(%edx)
>     stat->mtime = inode->i_mtime;
> c10aa023:   8b 48 44                mov    0x44(%eax),%ecx
> c10aa026:   8b 58 48                mov    0x48(%eax),%ebx
> c10aa029:   89 4a 30                mov    %ecx,0x30(%edx)
> c10aa02c:   89 5a 34                mov    %ebx,0x34(%edx)
>     stat->ctime = inode->i_ctime;
> c10aa02f:   8b 48 4c                mov    0x4c(%eax),%ecx
> c10aa032:   8b 58 50                mov    0x50(%eax),%ebx
> c10aa035:   89 4a 38                mov    %ecx,0x38(%edx)
> c10aa038:   89 5a 3c                mov    %ebx,0x3c(%edx)
...
>     stat->blocks = inode->i_blocks;
> c10aa048:   8b 58 60                mov    0x60(%eax),%ebx
> c10aa04b:   8b 48 5c                mov    0x5c(%eax),%ecx
> c10aa04e:   89 5a 48                mov    %ebx,0x48(%edx)
> c10aa051:   89 4a 44                mov    %ecx,0x44(%edx)


I still don't understand why the races in ext4_do_update_inode() won't 
cause problems. Similarly to generic_fillattr(), in 
ext4_do_update_inode() wrong values can end up being written to the 
inode fields. Is there some mechanism to recover/ignore these values 
that I'm missing?


> As for the rest, some are obviously false positives.  For example, if
> you take a look at:
>
> Variable: journal->j_running_transaction	Addresses: c1138ca1 c1136b02 c1136ca9
> c1136ca9 jbd2_get_transaction /linux-3.13.5/fs/jbd2/transaction.c:103
> c1138ca1 jbd2_journal_commit_transaction /linux-3.13.5/fs/jbd2/commit.c:539
> c1136b02 start_this_handle /linux-3.13.5/fs/jbd2/transaction.c:280
>
> It's obvious from the code path that start_this_handle() is extremely
> careful to always revalidate j_running_transaction after either (a)
> taking a read lock on the j_state_lock for the first time, or (b)
> after dropping the read_lock and grabbing a write_lock on
> j_state_lock, and if things have changed, it loops and tries again.
You're right, start_this_handle() revalidates the value read so I also 
don't see harm in this race.


> There are a couple that require some closer examination; for some of
> them, for example the two reported cases of ext4_isize_set() vs
> ext4_isize(), having the stack trace would be really helpful.

Those two pairs of functions were actually also called from the 
ext4_do_update_inode() function:
> 4338     if (ei->i_disksize != ext4_isize(raw_inode)) {
> 4339         ext4_isize_set(raw_inode, ei->i_disksize);
> 4340         need_datasync = 1;
> 4341     }

So the top part of the stack trace is:
   ext4_mark_inode_dirty() -> ext4_mark_iloc_dirty() -> 
ext4_do_update_inode() -> ext4_isize()
   ext4_mark_inode_dirty() -> ext4_mark_iloc_dirty() -> 
ext4_do_update_inode() -> ext4_isize_set()





^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Data races in ext4
  2014-04-10 15:40   ` Pedro Fonseca
@ 2014-04-10 22:09     ` Theodore Ts'o
  2014-04-11 13:08       ` Pedro Fonseca
  0 siblings, 1 reply; 5+ messages in thread
From: Theodore Ts'o @ 2014-04-10 22:09 UTC (permalink / raw)
  To: Pedro Fonseca; +Cc: linux-ext4, adilger.kernel

Thinking about this some more, you're right.  We should probably use a
spinlock to regulate access to the raw inode in
ext4_do_update_inode().  The main reason why we haven't noticed that
most of these fields don't change very often, and so in practice it's
not really problem.  The one that _is_ critical is i_disksize /
raw_inode->i_size, and that's supposed to be protected by i_data_sem.
(The reason why that one showed up in your analysis is because we did
screw up the locking in mpage_map_and_submit_extent.)

> Regarding the generic_fillattr() function, apart from the inconsistency
> between i_blocks/i_size, it looks like it can also cause wrong values to be
> returned because several of those fields are 64 bits. 

Well, it's actually a 32-bit seconds and 32-bit nanoseconds field.  So
we could theoretically get a torn value, but if the ns field is from a
previous update, the worst that could happen under those circumstances
is that two quick successive stat() commands could potentially result
in end up with an apparent "time moving backwards" in the nanoseconds
field.  If someone was constantly calling utimes(2) to modify the
mtime/atime at high rate, that might cause a more serious wrong value,
but that's an argument you should take up with the generic VFS folks
as to whether it's important enough that they would want to care about
it.

> >As for the rest, some are obviously false positives.  For example, if
> >you take a look at:
> >
> >Variable: journal->j_running_transaction	Addresses: c1138ca1 c1136b02 c1136ca9
> >c1136ca9 jbd2_get_transaction /linux-3.13.5/fs/jbd2/transaction.c:103
> >c1138ca1 jbd2_journal_commit_transaction /linux-3.13.5/fs/jbd2/commit.c:539
> >c1136b02 start_this_handle /linux-3.13.5/fs/jbd2/transaction.c:280
> >
> >It's obvious from the code path that start_this_handle() is extremely
> >careful to always revalidate j_running_transaction after either (a)
> >taking a read lock on the j_state_lock for the first time, or (b)
> >after dropping the read_lock and grabbing a write_lock on
> >j_state_lock, and if things have changed, it loops and tries again.
> You're right, start_this_handle() revalidates the value read so I also don't
> see harm in this race.

Well, I suppose technically it's a race, but in it was quite
deliberate, because the goal was to minimize pressure on the
j_state_lock.

> >There are a couple that require some closer examination; for some of
> >them, for example the two reported cases of ext4_isize_set() vs
> >ext4_isize(), having the stack trace would be really helpful.
> 
> Those two pairs of functions were actually also called from the
> ext4_do_update_inode() function:

OK, and I think this is ultimately caused by a screwup in how
mpage_map_and_submit_extent() which I mentioned earlier.

Thanks for your work,

					- Ted

P.S.  What did you use to generate these reports?  Is it something
that can be easily replicated by others?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Data races in ext4
  2014-04-10 22:09     ` Theodore Ts'o
@ 2014-04-11 13:08       ` Pedro Fonseca
  0 siblings, 0 replies; 5+ messages in thread
From: Pedro Fonseca @ 2014-04-11 13:08 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, adilger.kernel


>> Regarding the generic_fillattr() function, apart from the inconsistency
>> between i_blocks/i_size, it looks like it can also cause wrong values to be
>> returned because several of those fields are 64 bits.
> Well, it's actually a 32-bit seconds and 32-bit nanoseconds field.  So
> we could theoretically get a torn value, but if the ns field is from a
> previous update, the worst that could happen under those circumstances
> is that two quick successive stat() commands could potentially result
> in end up with an apparent "time moving backwards" in the nanoseconds
> field.  If someone was constantly calling utimes(2) to modify the
> mtime/atime at high rate, that might cause a more serious wrong value,
> but that's an argument you should take up with the generic VFS folks
> as to whether it's important enough that they would want to care about
> it.

Note that the races in stat() also involve the "inode->i_blocks" field, 
which is also 64-bits. So, as far as I understand, this field too can 
get wrapped, although it should only affect very large files. (On the 
other hand, the field inode->i_size seems to be protected.)

I'll follow your suggestion and let the VFS developers know about this.


> P.S.  What did you use to generate these reports?  Is it something
> that can be easily replicated by others?

We're currently developing an experimental tool to test the kernel for 
concurrency bugs. It's not yet publicly available, but we hope it will 
become available at some point in the future.

Thanks,
Pedro

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-04-11 13:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-08 20:58 Data races in ext4 Pedro Fonseca
2014-04-09 21:33 ` Theodore Ts'o
2014-04-10 15:40   ` Pedro Fonseca
2014-04-10 22:09     ` Theodore Ts'o
2014-04-11 13:08       ` Pedro Fonseca

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.