linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
@ 2015-10-05 15:22 Theodore Ts'o
  2015-10-05 15:58 ` Linus Torvalds
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Theodore Ts'o @ 2015-10-05 15:22 UTC (permalink / raw)
  To: Dave Hansen, Andrew Morton, Linus Torvalds; +Cc: linux-ext4, linux-kernel

I've been tracking down a test failure in xfstests generic/208 in the
data_journal configuration, which I've been testing using:

	gce-xfstests -c data_journal -C 20 generic/208

I've bisected it down to commit 998ef75ddb: "fs: do not prefault
sys_write() user buffer pages".  I've confirmed that 4.3-rc2 fails as
detailed below, but with 998ef75ddb reverted, the problem goes away.

The generic/208 test tries to run the test program
aio-dio-invalidate-failure[1] 20 times.

On a successful pass, the test runs without incident.  On a failure,
the syslog gets flooded with messages which look like this:

Oct  5 08:46:40 xfstests-201510050844 kernel: JBD2: Spotted dirty metadata buffer (dev = dm-0, blocknr = 33797). There's a risk of filesystem corruption in case of system crash.

... and eventually, almost always before successful 5 test runs, and
often before even a single successful test run, we end up triggering
a BUG_ON[2].

Before commit 998ef75ddb, if we need to prefault in the page, we do so
before we attempt the copy.  After this commit, we attempt the copy
and if it fails because pagefaults have been turned off, we call
write_end(), the unlock the page, prefault in the pages, and then
retry the commit.

What I think is going on is that when we do attempt the copy, we end
up marking the page dirty before we notice that we need to page fault
in the page, which ends up triggering the warning that jbd2
buffer_head that is supposed to be journaled has been marked dirty
without calling ext4_handle_dirty_metadata() --- which is handled by
ext4_journalled_write_end(), but which is now happening out of order
given this commit.

Is it possible that we can change iov_iter_copy_from_user_atomic(), to
check for the error case before it marks the page dirty?  Or can we
create a light-weight function which checks to see if the page needs
to be faulted in which is lighter weight than
iov_iter_fault_in_readable?

Thanks,

						- Ted


[1] https://git.kernel.org/cgit/fs/xfs/xfstests-dev.git/tree/src/aio-dio-regress/aio-dio-invalidate-failure.c

[2] ------------[ cut here ]------------
kernel BUG at /usr/projects/linux/ext4/fs/jbd2/commit.c:1030!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC 
CPU: 1 PID: 9842 Comm: jbd2/dm-0-8 Not tainted 4.2.0-ext4-09906-g998ef75 #92
Hardware name: Google Google, BIOS Google 01/01/2011
task: ffff8802132a73c0 ti: ffff8800ade8c000 task.ti: ffff8800ade8c000
RIP: 0010:[<ffffffff81269145>]  [<ffffffff81269145>] jbd2_journal_commit_transaction+0xfbc/0x1592
RSP: 0018:ffff8800ade8fcc0  EFLAGS: 00010202
RAX: 0000000000a20003 RBX: ffff8800b8a5a888 RCX: ffff8800b8a5a088
RDX: 0000000000000001 RSI: ffff8800ade8fc78 RDI: ffff880200f63c30
RBP: ffff8800ade8fe30 R08: 0000013f3a2c21fa R09: 0000000000000002
R10: ffff8800ade8fbf0 R11: 0000000000000774 R12: ffff8800b66b21a0
R13: ffff8800b8a5a088 R14: ffff880200f63800 R15: ffff8800b8b9d800
FS:  0000000000000000(0000) GS:ffff88021df00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f9005ff2050 CR3: 000000020fea6000 CR4: 00000000001406e0
Stack:
ffff8800b4462488 000001375a588074 ffff8800b8b9d8ac ffff8802132a73c0
0000001000000024 ffff8800ada9b000 ffff8802132a73c0 ffff8800b8b9d850
00000000ffffffff ffff880000000784 0000000000000000 ffff8802102ab288
Call Trace:
[<ffffffff8126d26f>] ? kjournald2+0xb6/0x1e5
[<ffffffff8126d26f>] ? kjournald2+0xb6/0x1e5
[<ffffffff810f7ba6>] ? __wake_up_common+0x71/0x71
[<ffffffff8126d1b9>] ? commit_timeout+0xa/0xa
[<ffffffff810e1f8c>] ? kthread+0xc6/0xce
[<ffffffff810e1ec6>] ? __kthread_parkme+0x5a/0x5a
[<ffffffff8168ac5f>] ? ret_from_fork+0x3f/0x70
[<ffffffff810e1ec6>] ? __kthread_parkme+0x5a/0x5a
Code: 8b 03 a9 00 00 40 00 74 1b 4c 89 fe 4c 89 e7 45 31 ed e8 19 13 00 00 41 f6 06 02 74 1d f0 80 63 02 bf eb 16 48 8b 03 a8 02 74 02 <0f> 0b 45 31 ed 49 83 7c 24 30 00 41 0f 94 c5 4c 89 e7 e8 d5 eb 
RIP  [<ffffffff81269145>] jbd2_journal_commit_transaction+0xfbc/0x1592
RSP <ffff8800ade8fcc0>
---[ end trace 2c7d9ab15164cf1c ]---

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
  2015-10-05 15:22 [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal Theodore Ts'o
@ 2015-10-05 15:58 ` Linus Torvalds
  2015-10-05 16:23   ` Dave Hansen
  2015-10-05 16:03 ` Dave Hansen
  2015-10-05 18:04 ` Dave Hansen
  2 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2015-10-05 15:58 UTC (permalink / raw)
  To: Theodore Ts'o, Dave Hansen, Andrew Morton, linux-ext4,
	Linux Kernel Mailing List

On Mon, Oct 5, 2015 at 4:22 PM, Theodore Ts'o <tytso@mit.edu> wrote:
>
> What I think is going on is that when we do attempt the copy, we end
> up marking the page dirty before we notice that we need to page fault
> in the page, which ends up triggering the warning that jbd2
> buffer_head that is supposed to be journaled has been marked dirty
> without calling ext4_handle_dirty_metadata() --- which is handled by
> ext4_journalled_write_end(), but which is now happening out of order
> given this commit.

Hmm. I suspect that we'll just need to revert that commit for now.

It does smell like jbd2 might be a bit too fragile here, and you might
be able to trigger the same issue by having some random race condition
where the user unmaps the memory in another thread in between the
iov_iter_fault_in_readable() and the actual
iov_iter_copy_from_user_atomic() call later. So I think that this
commit may not be buggy per se, as much as just exposing a problem in
jbd2, but I don't think that is something we can really fix at this
point in the release schedule.

Dave, comments?

> Is it possible that we can change iov_iter_copy_from_user_atomic(), to
> check for the error case before it marks the page dirty?

iov_iter_copy_from_user_atomic() doesn't mark anything dirty, it just
does the copy afaik. The dirtying is up to the write_begin/write_end
logic.

>  Or can we
> create a light-weight function which checks to see if the page needs
> to be faulted in which is lighter weight than
> iov_iter_fault_in_readable?

I'm not actually sure why Dave finds that function to be expensive
as-is. It should be a very cheap thing to do if it's already mapped,
and if the area isn't mapped it does need to be faulted in later
anyway, so it's not like you're really doing any extra work.

Dave, mind sharing the micro-benchmark or perhaps even just a kernel
profile of it? How is that "iov_iter_fault_in_readable()" so
noticeable? It really shouldn't be a big deal.

               Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
  2015-10-05 15:22 [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal Theodore Ts'o
  2015-10-05 15:58 ` Linus Torvalds
@ 2015-10-05 16:03 ` Dave Hansen
  2015-10-05 18:04 ` Dave Hansen
  2 siblings, 0 replies; 23+ messages in thread
From: Dave Hansen @ 2015-10-05 16:03 UTC (permalink / raw)
  To: Theodore Ts'o, Andrew Morton, Linus Torvalds, linux-ext4,
	linux-kernel

On 10/05/2015 08:22 AM, Theodore Ts'o wrote:
...
> I've bisected it down to commit 998ef75ddb: "fs: do not prefault
> sys_write() user buffer pages".  I've confirmed that 4.3-rc2 fails as
> detailed below, but with 998ef75ddb reverted, the problem goes away.
...
> Before commit 998ef75ddb, if we need to prefault in the page, we do so
> before we attempt the copy.  After this commit, we attempt the copy
> and if it fails because pagefaults have been turned off, we call
> write_end(), the unlock the page, prefault in the pages, and then
> retry the commit.

That's nasty.  Thanks for the bug report!

I'll go see if I can reproduce this.

> What I think is going on is that when we do attempt the copy, we end
> up marking the page dirty before we notice that we need to page fault
> in the page, which ends up triggering the warning that jbd2
> buffer_head that is supposed to be journaled has been marked dirty
> without calling ext4_handle_dirty_metadata() --- which is handled by
> ext4_journalled_write_end(), but which is now happening out of order
> given this commit.
> 
> Is it possible that we can change iov_iter_copy_from_user_atomic(), to
> check for the error case before it marks the page dirty?  Or can we
> create a light-weight function which checks to see if the page needs
> to be faulted in which is lighter weight than
> iov_iter_fault_in_readable?

Maybe I'm not following the macro magic, but I don't see where
iov_iter_copy_from_user_atomic() is setting 'page' dirty.  It'll set the
dirty bit in the PTEs of course, but I don't see it touching 'page'
except to kmap() it.

I do see some ->write_end() implementations doing set_page_dirty(), though.

Could we have just been confused and dirtied a page under ->write_end()
when we had copied=0 and it wasn't _really_ dirtied?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
  2015-10-05 15:58 ` Linus Torvalds
@ 2015-10-05 16:23   ` Dave Hansen
  2015-10-05 20:22     ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Hansen @ 2015-10-05 16:23 UTC (permalink / raw)
  To: Linus Torvalds, Theodore Ts'o, Andrew Morton, linux-ext4,
	Linux Kernel Mailing List, H. Peter Anvin

On 10/05/2015 08:58 AM, Linus Torvalds wrote:
...
> Dave, mind sharing the micro-benchmark or perhaps even just a kernel
> profile of it? How is that "iov_iter_fault_in_readable()" so
> noticeable? It really shouldn't be a big deal.

The micro was just plugging this test:

	https://www.sr71.net/~dave/intel/write1byte.c

In to will-it-scale:

	https://github.com/antonblanchard/will-it-scale

iov_iter_fault_in_readable() shows up as the third-most expensive kernel
function in a profile:

>      7.45%  write1byte_proc  [kernel.kallsyms]     [k] copy_user_enhanced_fast_string 
>      6.51%  write1byte_proc  [kernel.kallsyms]     [k] unlock_page                    
>      6.04%  write1byte_proc  [kernel.kallsyms]     [k] iov_iter_fault_in_readable     
>      5.23%  write1byte_proc  libc-2.20.so          [.] __GI___libc_write              
>      4.86%  write1byte_proc  [kernel.kallsyms]     [k] entry_SYSCALL_64               
>      4.48%  write1byte_proc  [kernel.kallsyms]     [k] iov_iter_copy_from_user_atomic 
>      3.94%  write1byte_proc  [kernel.kallsyms]     [k] generic_perform_write          
>      3.74%  write1byte_proc  [kernel.kallsyms]     [k] mutex_lock                     
>      3.59%  write1byte_proc  [kernel.kallsyms]     [k] entry_SYSCALL_64_after_swapgs  
>      3.55%  write1byte_proc  [kernel.kallsyms]     [k] find_get_entry                 
>      3.53%  write1byte_proc  [kernel.kallsyms]     [k] vfs_write                      
>      3.17%  write1byte_proc  [kernel.kallsyms]     [k] find_lock_entry                
>      3.17%  write1byte_proc  [kernel.kallsyms]     [k] put_page                       

The disassembly points at the stac/clac pair being the culprits inside
the function (copy/paste from 'perf top' disassebly here):

...
>        │      stac
>  24.57 │      mov    (%rcx),%sil
>  15.70 │      clac
>  28.77 │      test   %eax,%eax
>   2.15 │      mov    %sil,-0x1(%rbp)
>   8.93 │    ↓ jne    66
>   2.31 │      movslq %edx,%rdx

One thing I've been noticing on Skylake is that barriers (implicit and
explicit) are showing up more in profiles.  What we're seeing here
probably isn't actually stac/clac overhead, but the cost of finishing
some other operations that are outstanding before we can proceed through
here.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
  2015-10-05 15:22 [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal Theodore Ts'o
  2015-10-05 15:58 ` Linus Torvalds
  2015-10-05 16:03 ` Dave Hansen
@ 2015-10-05 18:04 ` Dave Hansen
  2015-10-07  3:34   ` Theodore Ts'o
  2 siblings, 1 reply; 23+ messages in thread
From: Dave Hansen @ 2015-10-05 18:04 UTC (permalink / raw)
  To: Theodore Ts'o, Andrew Morton, Linus Torvalds, linux-ext4,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5072 bytes --]

I managed to catch the condition in an ftrace.  Full spew is below.

We can see that the iov_iter_copy_from_user_atomic() "failed" and ended
up with a copied=0 which we can see in the ext4_journalled_write_end()
tracepoint as "copied 0".

So we're in this code with copied=0 and len=4096:

> static int ext4_journalled_write_end(struct file *file, ... unsigned copied, ...
> {
...

>         else {
>                 if (copied < len) {
>                         if (!PageUptodate(page))
>                                 copied = 0;
>                         page_zero_new_buffers(page, from+copied, to);
>                 }
> 
>                 ret = ext4_walk_page_buffers(handle, page_buffers(page), from,
>                                              to, &partial, write_end_fn);
>                 if (!partial)
>                         SetPageUptodate(page);
>         }


The warning comes out of ext4_walk_page_buffers() and the dirty state
comes from page_zero_new_buffers(). That seems a _bit_ goofy that the
filesystem is marking the page dirty and then so shortly warning about it.

If we either come in here with copied=0 or force it (when
!PageUptodate()) we are essentially telling the caller that its write
must be repeated. We have no data in the buffer to preserve and we are
not leaking any data we are not marking it Uptodate.

The very lightly tested attached patch gets rid of the "Spotted dirty
metadata buffer" messages that I've been getting testing like this:

	mount -o data=journal /dev/sda1 /mnt/sda1
	TEST_DIR=/mnt/sda1 TEST_DEV=/dev/sda1 ./tests/generic/208

It _seems_ like a relatively sane thing to do.

>>  2)  aio-dio-3204  |               |              ext4_journalled_write_end() {
>>  2)  aio-dio-3204  |               |                /* ext4_journalled_write_end: dev 8,1 ino 12 pos 0 len 4096 copied 0 */
>>  2)  aio-dio-3204  |               |                page_zero_new_buffers() {
>>  2)  aio-dio-3204  |               |                  mark_buffer_dirty() {
>>  2)  aio-dio-3204  |               |                    /* block_dirty_buffer: 8,1 sector=34356 size=4096 */
>>  2)  aio-dio-3204  |   0.025 us    |                    page_mapping();
>>  2)  aio-dio-3204  |               |                    __set_page_dirty.constprop.46() {
>>  2)  aio-dio-3204  |   0.026 us    |                      _raw_spin_lock_irqsave();
>>  2)  aio-dio-3204  |               |                      account_page_dirtied() {
>>  2)  aio-dio-3204  |               |                        /* writeback_dirty_page: bdi 8:0: ino=12 index=0 */
>>  2)  aio-dio-3204  |               |                        __inc_zone_page_state() {
>>  2)  aio-dio-3204  |   0.030 us    |                          __inc_zone_state();
>>  2)  aio-dio-3204  |   0.211 us    |                        }
>>  2)  aio-dio-3204  |               |                        __inc_zone_page_state() {
>>  2)  aio-dio-3204  |   0.024 us    |                          __inc_zone_state();
>>  2)  aio-dio-3204  |   0.211 us    |                        }
>>  2)  aio-dio-3204  |   1.085 us    |                      }
>>  2)  aio-dio-3204  |   0.033 us    |                      _raw_spin_unlock_irqrestore();
>>  2)  aio-dio-3204  |   1.727 us    |                    }
>>  2)  aio-dio-3204  |               |                    __mark_inode_dirty() {
>>  2)  aio-dio-3204  |               |                      /* writeback_mark_inode_dirty: bdi 8:0: ino=12 state=I_DIRTY_SYNC|I_DIRTY_DATASYNC|I_DIRTY_PAGES flags=I_DIRTY_PAGES */
>>  2)  aio-dio-3204  |   0.207 us    |                    }
>>  2)  aio-dio-3204  |   2.600 us    |                  }
>>  2)  aio-dio-3204  |   2.878 us    |                }
>>  2)  aio-dio-3204  |               |                ext4_walk_page_buffers() {
>>  2)  aio-dio-3204  |               |                  write_end_fn() {
>>  2)  aio-dio-3204  |               |                    __ext4_handle_dirty_metadata() {
>>  2)  aio-dio-3204  |   0.024 us    |                      _cond_resched();
>>  2)  aio-dio-3204  |               |                      jbd2_journal_dirty_metadata() {
>>  2)  aio-dio-3204  |   0.025 us    |                        _raw_spin_lock();
>>  2)  aio-dio-3204  |               |                        __jbd2_journal_file_buffer() {
>>  2)  aio-dio-3204  |               |                          warn_dirty_buffer.isra.10() {
>>  2)  aio-dio-3204  |               |                            bdevname() {
>>  2)  aio-dio-3204  | + 15.280 us   |                              disk_name();
>>  2)  aio-dio-3204  | + 15.661 us   |                            }
>>  2)  aio-dio-3204  |               |                            /* ^A4JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 34356). There's a risk of filesystem corruption in case of system crash. */
>>  2)  aio-dio-3204  |               |                            bdevname() {
>>  2)  aio-dio-3204  |   0.080 us    |                              disk_name();
>>  2)  aio-dio-3204  |   0.298 us    |                            }


[-- Attachment #2: ext4-regression-2015-10-05-0.patch --]
[-- Type: text/x-patch, Size: 981 bytes --]

diff -puN fs/ext4/inode.c~debug-dont-prefault-on-write fs/ext4/inode.c
--- a/fs/ext4/inode.c~debug-dont-prefault-on-write	2015-10-05 10:12:35.286153137 -0700
+++ b/fs/ext4/inode.c	2015-10-05 10:50:37.372253050 -0700
@@ -1204,10 +1204,20 @@ static int ext4_journalled_write_end(str
 		copied = ext4_write_inline_data_end(inode, pos, len,
 						    copied, page);
 	else {
+		/*
+		 * With a short write (copied < len) we have potentially
+		 * valuable data in 'page'.  But, when the page is made Uptodate
+		 * this data will be overwritten.  Setting copied=0 will tell
+		 * the upper layers to repeat the write to 'page'.
+		 *
+		 * Only bother zeroing out buffers when we have _actually_
+		 * written data.
+		 */
 		if (copied < len) {
 			if (!PageUptodate(page))
 				copied = 0;
-			page_zero_new_buffers(page, from+copied, to);
+			if (copied)
+				page_zero_new_buffers(page, from+copied, to);
 		}
 
 		ret = ext4_walk_page_buffers(handle, page_buffers(page), from,
_

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
  2015-10-05 16:23   ` Dave Hansen
@ 2015-10-05 20:22     ` Linus Torvalds
  2015-10-05 20:48       ` Dave Hansen
  2015-10-05 20:49       ` H. Peter Anvin
  0 siblings, 2 replies; 23+ messages in thread
From: Linus Torvalds @ 2015-10-05 20:22 UTC (permalink / raw)
  To: Dave Hansen, Peter Anvin
  Cc: Theodore Ts'o, Andrew Morton, linux-ext4, Linux Kernel Mailing List

On Mon, Oct 5, 2015 at 5:23 PM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
> One thing I've been noticing on Skylake is that barriers (implicit and
> explicit) are showing up more in profiles.

Ahh, you're on skylake?

It's entirely possible that the issue is that the whole
"stac/mov/clac" is much more expensive because skylake actually ends
up supporting those AC instructions. That would make sense.

We could probably do them outside the loop, rather than tightly around
the actual move instructions. Peter (hpa), is there some sane
interface to try to do that?

>  What we're seeing here
> probably isn't actually stac/clac overhead, but the cost of finishing
> some other operations that are outstanding before we can proceed through
> here.

I suspect it actually _is_ stac/clac overhead. It might well be that
clac/stac ends up serializing loads some way. Last I heard, they were
reasonably cheap but certainly not free - and when we're talking about
something that just loops over bringing the line into cache, it might
be relatively expensive.

How did you do the profile? Use "-e cycles:pp" to get the precise
profile information, which should actually attribute the cost to the
instruction that really causes it.

              Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
  2015-10-05 20:22     ` Linus Torvalds
@ 2015-10-05 20:48       ` Dave Hansen
  2015-10-05 21:18         ` Linus Torvalds
  2015-10-05 20:49       ` H. Peter Anvin
  1 sibling, 1 reply; 23+ messages in thread
From: Dave Hansen @ 2015-10-05 20:48 UTC (permalink / raw)
  To: Linus Torvalds, Peter Anvin
  Cc: Theodore Ts'o, Andrew Morton, linux-ext4, Linux Kernel Mailing List

On 10/05/2015 01:22 PM, Linus Torvalds wrote:
> On Mon, Oct 5, 2015 at 5:23 PM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
>> One thing I've been noticing on Skylake is that barriers (implicit and
>> explicit) are showing up more in profiles.
> 
> Ahh, you're on skylake?

Yup.

> It's entirely possible that the issue is that the whole
> "stac/mov/clac" is much more expensive because skylake actually ends
> up supporting those AC instructions. That would make sense.
> 
> We could probably do them outside the loop, rather than tightly around
> the actual move instructions. Peter (hpa), is there some sane
> interface to try to do that?

iov_iter_fault_in_readable() is just going and touching a single word in
the page so that it is faulted in, or a pair of words if it manages to
cross a page boundary (which isn't happening here).  I'm not sure
there's a loop to move them out of here (for the prefaulting part).

We could theoretically expand the stac/clac to be around the pair of
__get_user()s in fault_in_pages_readable() but that would only help the
case where we are crossing a page boundary.

Although I was probably wrong about the source of the overhead, the
point still remains that the prefaulting is eating cycles for no
practical benefit.

>>  What we're seeing here
>> probably isn't actually stac/clac overhead, but the cost of finishing
>> some other operations that are outstanding before we can proceed through
>> here.
> 
> I suspect it actually _is_ stac/clac overhead. It might well be that
> clac/stac ends up serializing loads some way. Last I heard, they were
> reasonably cheap but certainly not free - and when we're talking about
> something that just loops over bringing the line into cache, it might
> be relatively expensive.
> 
> How did you do the profile? Use "-e cycles:pp" to get the precise
> profile information, which should actually attribute the cost to the
> instruction that really causes it.

It reduced the skid a bit.

Plain (no -e"):
>        │      stac
>  24.57 │      mov    (%rcx),%sil
>  15.70 │      clac
>  28.77 │      test   %eax,%eax
>   2.15 │      mov    %sil,-0x1(%rbp)
>   8.93 │    ↓ jne    66
>   2.31 │      movslq %edx,%rdx

With "-e cycles:pp":
>        │      sub    $0x8,%rsp
>  24.57 │      stac
>  15.49 │      mov    (%rcx),%sil
>  29.06 │      clac
>   2.24 │      test   %eax,%eax
>   8.77 │      mov    %sil,-0x1(%rbp)
>   2.22 │    ↓ jne    66
>        │      movslq %edx,%rdx


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
  2015-10-05 20:22     ` Linus Torvalds
  2015-10-05 20:48       ` Dave Hansen
@ 2015-10-05 20:49       ` H. Peter Anvin
  2015-10-06  7:56         ` Ingo Molnar
  1 sibling, 1 reply; 23+ messages in thread
From: H. Peter Anvin @ 2015-10-05 20:49 UTC (permalink / raw)
  To: Linus Torvalds, Dave Hansen
  Cc: Theodore Ts'o, Andrew Morton, linux-ext4, Linux Kernel Mailing List

On October 5, 2015 1:22:44 PM PDT, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>We could probably do them outside the loop, rather than tightly around
>the actual move instructions. Peter (hpa), is there some sane
>interface to try to do that?

There are... in particular the get_user_try/catch mechanism, used among other things by the signal handing, as well as the copy functions (I don't have the original code in front of me so I don't have the context which would be appropriate.)


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
  2015-10-05 20:48       ` Dave Hansen
@ 2015-10-05 21:18         ` Linus Torvalds
  2015-10-05 21:55           ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2015-10-05 21:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Anvin, Theodore Ts'o, Andrew Morton, linux-ext4,
	Linux Kernel Mailing List

On Mon, Oct 5, 2015 at 9:48 PM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
> Although I was probably wrong about the source of the overhead, the
> point still remains that the prefaulting is eating cycles for no
> practical benefit.

Yeah, no, I'm not disagreeing with that part, I'm just more of a "at
this point in the rc series we are probably better off reverting".

Your ext4 patch may well fix the issue, and be the right thing to do
(_regardless_ of the revert, in fact - while it might make the revert
unnecessary, it might also be a good idea even if we do revert).

The subtlety of this just worries me, and the reason I'd still be
inclined to revert is simply "it's been that way a long time, the safe
thing is to go back and take this slow".

> With "-e cycles:pp":
>>        │      sub    $0x8,%rsp
>>  24.57 │      stac
>>  15.49 │      mov    (%rcx),%sil
>>  29.06 │      clac
>>   2.24 │      test   %eax,%eax
>>   8.77 │      mov    %sil,-0x1(%rbp)
>>   2.22 │    ↓ jne    66
>>        │      movslq %edx,%rdx

Ok, so it really is the stac/clac that is the bulk of the cost. Hmm.

You're right that the loop there will only be executed once for your
case, so moving the stac/clac outside probably doesn't help. It
*might* still make a difference just for microarchitectural reasons
(ie they may cause more trouble just because they are close to an
instruction that depends on them), but it's questionable.

It is a bit worrisome to see that those things are so expensive. Right
now almost all user accesses will cause *lots* of clac/stac stuff.

I originally asked Intel to do SMAP using a segment prefix, but that
was not to be..

              Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
  2015-10-05 21:18         ` Linus Torvalds
@ 2015-10-05 21:55           ` Linus Torvalds
  2015-10-05 23:33             ` Dave Hansen
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2015-10-05 21:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Anvin, Theodore Ts'o, Andrew Morton, linux-ext4,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 2555 bytes --]

On Mon, Oct 5, 2015 at 10:18 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Your ext4 patch may well fix the issue, and be the right thing to do
> (_regardless_ of the revert, in fact - while it might make the revert
> unnecessary, it might also be a good idea even if we do revert).

Thinking a bit more about your patch, I actually am getting more and
more convinced that it's the wrong thing to do.

Why?

Because the whole "Setting copied=0 will tell the upper layers to
repeat the write" just seems a nasty layering violation, where the
low-level filesystem uses a magic return code to introduce a special
case at the upper layers. But the upper layers are actually already
*aware* of the special case, and in fact have a comment about it.

So I think that the whole "setting copied to 0" would actually make a
lot more sense in the *caller*. Just do it in generic_perform_write()
instead. Then all the special cases and the restarting is all
together.

What do you guys think? This basically simplifies the low-level
filesystem rules, and says:

 - the filesystem will only ever see a partial "->write_end()" for the
case where the page was up-to-date, so that there is no issue with
"oops, we now have part of the page that may not have been written at
all"

 - if the page wasn't up-to-date before, ->write_end() will either be
everything we said we'd do in ->write_begin(), or it will be nothing
at all.

Hmm? This would seem to keep the special cases at the right layer, and
actually allow low-level filesystems to simplify things (ie the
special "copied = 0" special case in ext4 goes away.

The ext4 side still worries me, though. You made that
"page_zero_new_buffers()" conditional on "copied" being non-zero, but
I'm not convinced it can be conditional. Even if we retry, that retry
may end up failing (for example, because the source isn't mapped, so
we return -EFAULT rather than re-doing the write), but we have those
new buffers that got allocated in write_begin(), and now nobody has
ever written any data to them at all, so they have random stale
contents.

So I do think this needs more thought. Or at least somebody should
explain to me better why it's all ok.

I'm attaching the "copied = 0" special case thing at the
generic_perform_write() level as a patch for comments. But for now I
still think that reverting would seem to be the safer thing (which
still possibly leaves things buggy with a racy unmap, but at least
it's the old bug that we've never hit in practice).

Dave? Ted? Comments?

               Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 925 bytes --]

 mm/filemap.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 72940fb38666..e8d01936817a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2493,6 +2493,20 @@ again:
 		pagefault_enable();
 		flush_dcache_page(page);
 
+		/*
+		 * If we didn't successfully copy all the data from user space,
+		 * and the target page is not up-to-date, we will have to prefault
+		 * the source. And if the page wasn't up-to-date before the write,
+		 * the "write_end()" may need to *make* it up-to-date, and thus
+		 * overwrite our partial copy.
+		 *
+		 * So for that case, thow away the whole thing and force a full
+		 * restart (see comment above, and iov_iter_fault_in_readable()
+		 * below).
+		 */
+		if (copied < bytes && !PageUptodate(page))
+			copied = 0;
+
 		status = a_ops->write_end(file, mapping, pos, bytes, copied,
 						page, fsdata);
 		if (unlikely(status < 0))

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
  2015-10-05 21:55           ` Linus Torvalds
@ 2015-10-05 23:33             ` Dave Hansen
  2015-10-06  9:01               ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Hansen @ 2015-10-05 23:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Anvin, Theodore Ts'o, Andrew Morton, linux-ext4,
	Linux Kernel Mailing List

> On Mon, Oct 5, 2015 at 10:18 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Your ext4 patch may well fix the issue, and be the right thing to do
>> (_regardless_ of the revert, in fact - while it might make the revert
>> unnecessary, it might also be a good idea even if we do revert).
> 
> Thinking a bit more about your patch, I actually am getting more and
> more convinced that it's the wrong thing to do.
> 
> Why?
> 
> Because the whole "Setting copied=0 will tell the upper layers to
> repeat the write" just seems a nasty layering violation, where the
> low-level filesystem uses a magic return code to introduce a special
> case at the upper layers. But the upper layers are actually already
> *aware* of the special case, and in fact have a comment about it.

It's nasty, but it's widespread.  I'm fairly sure some version of that
code shows up in all these places:

	ext4_journalled_write_end()
	generic_write_end()->block_write_end()
	ocfs2_write_end_inline()
	ocfs2_write_end_nolock()
	logfs_write_end()
	ext3_journalled_write_end()
	btrfs_copy_from_user()
	reiserfs_write_end()

> The ext4 side still worries me, though. You made that
> "page_zero_new_buffers()" conditional on "copied" being non-zero, but
> I'm not convinced it can be conditional. Even if we retry, that retry
> may end up failing (for example, because the source isn't mapped, so
> we return -EFAULT rather than re-doing the write), but we have those
> new buffers that got allocated in write_begin(), and now nobody has
> ever written any data to them at all, so they have random stale
> contents.

Yeah, I guess they're not uptodate and nobody is on the hook to make
them uptodate.

> So I do think this needs more thought. Or at least somebody should
> explain to me better why it's all ok.
> 
> I'm attaching the "copied = 0" special case thing at the
> generic_perform_write() level as a patch for comments. But for now I
> still think that reverting would seem to be the safer thing (which
> still possibly leaves things buggy with a racy unmap, but at least
> it's the old bug that we've never hit in practice).

FWIW, I'm not objecting at all to a revert.  It's obviously the safe
thing to do.

> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2493,6 +2493,20 @@ again:
>  		pagefault_enable();
>  		flush_dcache_page(page);
>  
> +		/*
> +		 * If we didn't successfully copy all the data from user space,
> +		 * and the target page is not up-to-date, we will have to prefault
> +		 * the source. And if the page wasn't up-to-date before the write,
> +		 * the "write_end()" may need to *make* it up-to-date, and thus
> +		 * overwrite our partial copy.
> +		 *
> +		 * So for that case, thow away the whole thing and force a full
> +		 * restart (see comment above, and iov_iter_fault_in_readable()
> +		 * below).
> +		 */
> +		if (copied < bytes && !PageUptodate(page))
> +			copied = 0;
> +
>  		status = a_ops->write_end(file, mapping, pos, bytes, copied,
>  						page, fsdata);

Did you mean that as a cleanup, or to fix the regression?

Since the page isn't faulted in yet, iov_iter_copy_from_user_atomic()
had already set copied=0, so that has no effect on the ext4 code being
called and it spits out the same warning Ted originally reported.

Ted, do we really just need to tell ext4 that this dirtying operation is
OK?  Or is there really a risk of filesystem corruption that we're not
seeing?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
  2015-10-05 20:49       ` H. Peter Anvin
@ 2015-10-06  7:56         ` Ingo Molnar
  2015-10-06  9:10           ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2015-10-06  7:56 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Dave Hansen, Theodore Ts'o, Andrew Morton,
	linux-ext4, Linux Kernel Mailing List


* H. Peter Anvin <hpa@zytor.com> wrote:

> On October 5, 2015 1:22:44 PM PDT, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > We could probably do them outside the loop, rather than tightly around the 
> > actual move instructions. Peter (hpa), is there some sane interface to try to 
> > do that?
> 
> There are... in particular the get_user_try/catch mechanism, used among other 
> things by the signal handing, as well as the copy functions (I don't have the 
> original code in front of me so I don't have the context which would be 
> appropriate.)

Yes, but note that those interfaces are x86 only at the moment, so they'd have to 
be factored out and generalized before we can use it in generic code.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
  2015-10-05 23:33             ` Dave Hansen
@ 2015-10-06  9:01               ` Linus Torvalds
  0 siblings, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2015-10-06  9:01 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Anvin, Theodore Ts'o, Andrew Morton, linux-ext4,
	Linux Kernel Mailing List

On Tue, Oct 6, 2015 at 12:33 AM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> Did you mean that as a cleanup, or to fix the regression?

Purely as a cleanup to try to avoid the (already existing) special
case in at least ext4 - and possibly others.

So that patch was meant just for discussion - it's not really fixing
any existing bugs, and I didn't actually keep it live in my tree.

I'm planning on just doing the revert for now, but I'll wait a bit to
see how this thread pans out first.

> Since the page isn't faulted in yet, iov_iter_copy_from_user_atomic()
> had already set copied=0

Not necessarily. For your case that only does one-byte writes, yes.
But in the generic case you may well have a page-crossing source area
in user space, and get a partial success from the user copy. It's that
partial success case (when the rest of the missing data isn't
necessarily already up-to-date) that I'd like to have low-level
filesystems not have to worry about.

           Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
  2015-10-06  7:56         ` Ingo Molnar
@ 2015-10-06  9:10           ` Linus Torvalds
  2015-10-06  9:27             ` Ingo Molnar
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2015-10-06  9:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Dave Hansen, Theodore Ts'o, Andrew Morton,
	linux-ext4, Linux Kernel Mailing List

On Tue, Oct 6, 2015 at 8:56 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> Yes, but note that those interfaces are x86 only at the moment, so they'd have to
> be factored out and generalized before we can use it in generic code.

ARM64 these days (as a part of ARM8.1) has "Privileged Access Never",
which is their name for SMAP. They do it somewhat similarly with an
instruction to clear/set the PAN bit in the pstate register.

So we really should strive to make this support generic, because by
now it's not x86-specific, and x86 and ARM64 together aren't exactly
some odd special case..

I do in fact wonder if we should aim (eventually) for the rule that
the "__" versions of the user access functions should not do the
SMAP/PAN thing, since they have to be explicitly checked for pointer
being valid anyway. And just make the rule be that since you have to
check for the pointer being valid, you might as well also have to do
the SMAP/PAN thing too.

We really should try get rid of _all_ uses of the "__" versions unless
they are very locally and obviously checked with access_ok(). We've
had way too many cases where people thought they were clever, and
weren't really.

                Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
  2015-10-06  9:10           ` Linus Torvalds
@ 2015-10-06  9:27             ` Ingo Molnar
  2015-10-06 13:29               ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2015-10-06  9:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Dave Hansen, Theodore Ts'o, Andrew Morton,
	linux-ext4, Linux Kernel Mailing List


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, Oct 6, 2015 at 8:56 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > Yes, but note that those interfaces are x86 only at the moment, so they'd have 
> > to be factored out and generalized before we can use it in generic code.
> 
> ARM64 these days (as a part of ARM8.1) has "Privileged Access Never", which is 
> their name for SMAP. They do it somewhat similarly with an instruction to 
> clear/set the PAN bit in the pstate register.
> 
> So we really should strive to make this support generic, because by now it's not 
> x86-specific, and x86 and ARM64 together aren't exactly some odd special case..

Absolutely.

> I do in fact wonder if we should aim (eventually) for the rule that the "__" 
> versions of the user access functions should not do the SMAP/PAN thing, since 
> they have to be explicitly checked for pointer being valid anyway. And just make 
> the rule be that since you have to check for the pointer being valid, you might 
> as well also have to do the SMAP/PAN thing too.
> 
> We really should try get rid of _all_ uses of the "__" versions unless they are 
> very locally and obviously checked with access_ok(). We've had way too many 
> cases where people thought they were clever, and weren't really.

That's a good idea.

The logistics worries me a bit: it looks like a major undertaking, considering the 
widespread use of these APIs in 1400+ call sites:

  triton:~/tip> git grep -E '__get_user|__put_user' | grep -vE '\.h:|arch/' | wc -l
  1086
  triton:~/tip> git grep -E '__get_user|__put_user' arch/x86/ arch/arm64/ | wc -l
  354

... affecting 93 files:

  triton:~/tip> git grep -lE '__get_user|__put_user' | grep -vE '\.h:|arch/' | wc -l
  70
  triton:~/tip> git grep -lE '__get_user|__put_user' arch/x86/ arch/arm64/ | wc -l
  23

(Assuming my grep-fu is strong enough.)

Which is probably more than a hundred commits until we get rid of all 
double-underscore uses?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
  2015-10-06  9:27             ` Ingo Molnar
@ 2015-10-06 13:29               ` Linus Torvalds
  2015-10-06 13:42                 ` Ingo Molnar
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2015-10-06 13:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Dave Hansen, Theodore Ts'o, Andrew Morton,
	linux-ext4, Linux Kernel Mailing List

On Tue, Oct 6, 2015 at 10:27 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
>>
>> We really should try get rid of _all_ uses of the "__" versions unless they are
>> very locally and obviously checked with access_ok(). We've had way too many
>> cases where people thought they were clever, and weren't really.
>
> That's a good idea.
>
> The logistics worries me a bit: it looks like a major undertaking, considering the
> widespread use of these APIs in 1400+ call sites:

Well, quite frankly, I think I'd be ok with just a mass conversion of
the "__" functions to non-underscore ones.

>From past experience, I don't think we have anything that really
cares. The one exception is probably the signal stack handling,
because it really uses multiple individual accesses, and so it is much
more noticeable.

And there should be *no* meaningful difference between the underscore
version and the non-underscore one, unless somebody does something
really odd and questionable (ie depends on a kernel pointer - which
doesn't even work on all architectures!).

So I really think we could do a mass conversion of everything that
isn't under "arch/" (and obviously asm-generic/uaccess.h) in just one
single go.

I obviously wouldn't take that into 4.3, but I really don't think this
would merit splitting up into multiple patches either.

Then, one by one, we might convert back to __get/put_user() when we've
(a) added the SMAP/PAN infrastructure (b) verified that there's an
access_ok() _right_there_ and (c) actually verified that it's
performance-critical.

I see drivers doing __get/put_user(), and it just makes me go "no".
Not only are drivers likely to get it wrong, I don't believe the extra
couple of cycles is going to matter compared to the cost of the
hardware access itself. And if the access_ok() isn't local and
obviously in the *only* place that can possibly lead to that code,
then the code shouldn't use the underscore versions.

                Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
  2015-10-06 13:29               ` Linus Torvalds
@ 2015-10-06 13:42                 ` Ingo Molnar
  0 siblings, 0 replies; 23+ messages in thread
From: Ingo Molnar @ 2015-10-06 13:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Dave Hansen, Theodore Ts'o, Andrew Morton,
	linux-ext4, Linux Kernel Mailing List


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, Oct 6, 2015 at 10:27 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> >>
> >> We really should try get rid of _all_ uses of the "__" versions unless they are
> >> very locally and obviously checked with access_ok(). We've had way too many
> >> cases where people thought they were clever, and weren't really.
> >
> > That's a good idea.
> >
> > The logistics worries me a bit: it looks like a major undertaking, considering the
> > widespread use of these APIs in 1400+ call sites:
> 
> Well, quite frankly, I think I'd be ok with just a mass conversion of the "__" 
> functions to non-underscore ones.
> 
> From past experience, I don't think we have anything that really cares. The one 
> exception is probably the signal stack handling, because it really uses multiple 
> individual accesses, and so it is much more noticeable.
> 
> And there should be *no* meaningful difference between the underscore version 
> and the non-underscore one, unless somebody does something really odd and 
> questionable (ie depends on a kernel pointer - which doesn't even work on all 
> architectures!).
> 
> So I really think we could do a mass conversion of everything that isn't under 
> "arch/" (and obviously asm-generic/uaccess.h) in just one single go.
> 
> I obviously wouldn't take that into 4.3, but I really don't think this would 
> merit splitting up into multiple patches either.
> 
> Then, one by one, we might convert back to __get/put_user() when we've (a) added 
> the SMAP/PAN infrastructure (b) verified that there's an access_ok() 
> _right_there_ and (c) actually verified that it's performance-critical.
> 
> I see drivers doing __get/put_user(), and it just makes me go "no". Not only are 
> drivers likely to get it wrong, I don't believe the extra couple of cycles is 
> going to matter compared to the cost of the hardware access itself. And if the 
> access_ok() isn't local and obviously in the *only* place that can possibly lead 
> to that code, then the code shouldn't use the underscore versions.

Great, fully agreed and will implement it this way!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
  2015-10-05 18:04 ` Dave Hansen
@ 2015-10-07  3:34   ` Theodore Ts'o
  2015-10-07  7:32     ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Theodore Ts'o @ 2015-10-07  3:34 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Andrew Morton, Linus Torvalds, linux-ext4, linux-kernel

On Mon, Oct 05, 2015 at 11:04:35AM -0700, Dave Hansen wrote:
> 
> The warning comes out of ext4_walk_page_buffers() and the dirty state
> comes from page_zero_new_buffers(). That seems a _bit_ goofy that the
> filesystem is marking the page dirty and then so shortly warning about it.

Yes, this is a bug in ext4 --- and in fact in ext3, which apparently
we've lived with for *years*.  The problem is that when we are
journalling data buffers, we can't use page_zero_new_buffers(),
because instead of calling mark_buffer_dirty(bh), we need to call
ext4_handle_dirty_metadata(bh).  This will call mark_buffer_dirty(bh)
if journalling is not enabled, or if journalling is enabled, it will
call jbd2_journal_dirty_metadata(handle,bh).

Apprently it is extremely rare that (copied < len) --- especially when
mm/filemap.c was doing a prefault.  :-)

So your patch looks good, but in addition to that, if copied is > 0
and less than len, we shouldn't be calling page_zero_new_buffers().
We're going to need our own version of it that doesn't call
mark_buffer_dirty().

So if Linus wants to revert 998ef75ddb patch, we can do that, but I'm
also happy applying your patch as a way of preventing the failure.
We'll need to do more work to make ext4_journalled_write_end(), but
that's a bigger change which I'd rather not do at this point in the
development cycle.

Thanks again for taking a closer look at things.  I'm currently
running a full soak test to make sure your patch to
ext4_journalled_write_end() doesn't introduce any other problems, but
I'm quite confident it should be fine.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
  2015-10-07  3:34   ` Theodore Ts'o
@ 2015-10-07  7:32     ` Linus Torvalds
  2015-10-07 15:43       ` Theodore Ts'o
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2015-10-07  7:32 UTC (permalink / raw)
  To: Theodore Ts'o, Dave Hansen, Andrew Morton, Linus Torvalds,
	linux-ext4, Linux Kernel Mailing List

On Wed, Oct 7, 2015 at 4:34 AM, Theodore Ts'o <tytso@mit.edu> wrote:
>
> So your patch looks good, but in addition to that, if copied is > 0
> and less than len, we shouldn't be calling page_zero_new_buffers().
> We're going to need our own version of it that doesn't call
> mark_buffer_dirty().

Well, even with copied == 0, isn't it wrong not to zero the data and
mark it dirty, though?

At least for traditional non-prealloc filesystems, write_begin() will
have allocated the blocks on disk, and depending on the size of the
write may not have zeroed anything, or marked anything dirty. So the
disk layout may be set up, and the blocks *need* to be written back to
disk with real data at some later time.

I'm not sure that that is how write_begin() works for ext4, but the
fact that you do the page_zero_new_buffers() does imply that it's an
issue.

And none of *those* requirements change just because "copied" would be
zero. If you avoid zeroing the buffers and marking them dirty, nothing
will ever initialize them on disk, andn if the prefault then later
fails during retry, no later write will happen either. So now
eventually later, a read() can see stale data from disk.

Now, again, I didn't actually follow the ext4 code, so this may not be
an issue on ext4 due to the journaling perhaps never writing back the
actual block allocations either, so the above may be a "only happens
on old traditional unix filesystems" kind of scenario. But it worries
me.

That's why II'll just revert the change for now as a "ok, the change
itself isn't buggy, but it exposes long-time bugs in at least ext4,
and let's take this slow and give the ext4 people time to make sure
they have that case fixed".

> So if Linus wants to revert 998ef75ddb patch, we can do that, but I'm
> also happy applying your patch as a way of preventing the failure.

I do think this is an ext4 bug, and you'll need to do something *like*
that patch. Maybe Dave's patch is good as-is. It's the "I think you
need to do more" that I worry about. Not at -rc4 time. Not with a core
filesystem like ext4. Let's not hurry this too much.

                Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal
  2015-10-07  7:32     ` Linus Torvalds
@ 2015-10-07 15:43       ` Theodore Ts'o
  2015-10-09  4:01         ` [PATCH] ext4: use private version of page_zero_new_buffers() for data=journal mode Theodore Ts'o
  0 siblings, 1 reply; 23+ messages in thread
From: Theodore Ts'o @ 2015-10-07 15:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Andrew Morton, linux-ext4, Linux Kernel Mailing List

On Wed, Oct 07, 2015 at 08:32:16AM +0100, Linus Torvalds wrote:
> And none of *those* requirements change just because "copied" would be
> zero. If you avoid zeroing the buffers and marking them dirty, nothing
> will ever initialize them on disk, andn if the prefault then later
> fails during retry, no later write will happen either. So now
> eventually later, a read() can see stale data from disk.

Shoot.  You're right, we could end up allowing a stale data to be
exposed.  If we knew the caller of write_end() was guaranteed to
retry, we could skip the jbd2_journal_stop() call and keep the handle
open, which would prevent the transaction from closing.  But if the
write gets abandoned, then the transaction would never close, and
things would grind to a halt.

> I do think this is an ext4 bug, and you'll need to do something *like*
> that patch. Maybe Dave's patch is good as-is. It's the "I think you
> need to do more" that I worry about. Not at -rc4 time. Not with a core
> filesystem like ext4. Let's not hurry this too much.

Agreed, I know what to do, and and the change is not something I'd
want to get in -rc4.  I'll target a fix for the next merge window.

    	    	   	       	      	- Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH] ext4: use private version of page_zero_new_buffers() for data=journal mode
  2015-10-07 15:43       ` Theodore Ts'o
@ 2015-10-09  4:01         ` Theodore Ts'o
  2015-10-13  6:06           ` Leonid V. Fedorenchik
  2015-10-15 11:17           ` Jan Kara
  0 siblings, 2 replies; 23+ messages in thread
From: Theodore Ts'o @ 2015-10-09  4:01 UTC (permalink / raw)
  To: Ext4 Developers List
  Cc: Linux Kernel Developers List, dave.hansen, torvalds, akpm,
	Theodore Ts'o

If there is a error while copying data from userspace into the page
cache during a write(2) system call, in data=journal mode, in
ext4_journalled_write_end() were using page_zero_new_buffers() from
fs/buffer.c.  Unfortunately, this sets the buffer dirty flag, which is
no good if journalling is enabled.  This is a long-standing bug that
goes back for years and years in ext3, but a combination of (a)
data=journal not being very common, (b) in many case it only results
in a warning message. and (c) only very rarely causes the kernel hang,
means that we only really noticed this as a problem when commit
998ef75ddb caused this failure to happen frequently enough to cause
generic/208 to fail when run in data=journal mode.

The fix is to have our own version of this function that doesn't call
mark_dirty_buffer(), since we will end up calling
ext4_handle_dirty_metadata() on the buffer head(s) in questions very
shortly afterwards in ext4_journalled_write_end().

Thanks to Dave Hansen and Linus Torvalds for helping to identify the
root cause of the problem.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/ext4/inode.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ae52e32..0a589bb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1181,6 +1181,38 @@ errout:
 	return ret ? ret : copied;
 }
 
+/*
+ * This is a private version of page_zero_new_buffers() which doesn't
+ * set the buffer to be dirty, since in data=journalled mode we need
+ * to call ext4_handle_dirty_metadata() instad.
+ */
+static void zero_new_buffers(struct page *page, unsigned from, unsigned to)
+{
+	unsigned int block_start = 0, block_end;
+	struct buffer_head *head, *bh;
+
+	bh = head = page_buffers(page);
+	do {
+		block_end = block_start + bh->b_size;
+		if (buffer_new(bh)) {
+			if (block_end > from && block_start < to) {
+				if (!PageUptodate(page)) {
+					unsigned start, size;
+
+					start = max(from, block_start);
+					size = min(to, block_end) - start;
+
+					zero_user(page, start, size);
+					set_buffer_uptodate(bh);
+				}
+				clear_buffer_new(bh);
+			}
+		}
+		block_start = block_end;
+		bh = bh->b_this_page;
+	} while (bh != head);
+}
+
 static int ext4_journalled_write_end(struct file *file,
 				     struct address_space *mapping,
 				     loff_t pos, unsigned len, unsigned copied,
@@ -1207,7 +1239,7 @@ static int ext4_journalled_write_end(struct file *file,
 		if (copied < len) {
 			if (!PageUptodate(page))
 				copied = 0;
-			page_zero_new_buffers(page, from+copied, to);
+			zero_new_buffers(page, from+copied, to);
 		}
 
 		ret = ext4_walk_page_buffers(handle, page_buffers(page), from,
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] ext4: use private version of page_zero_new_buffers() for data=journal mode
  2015-10-09  4:01         ` [PATCH] ext4: use private version of page_zero_new_buffers() for data=journal mode Theodore Ts'o
@ 2015-10-13  6:06           ` Leonid V. Fedorenchik
  2015-10-15 11:17           ` Jan Kara
  1 sibling, 0 replies; 23+ messages in thread
From: Leonid V. Fedorenchik @ 2015-10-13  6:06 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Ext4 Developers List, Linux Kernel Developers List, dave.hansen,
	torvalds, akpm

On Fri,  9 Oct 2015 00:01:09 -0400
Theodore Ts'o <tytso@mit.edu> wrote:

> If there is a error while copying data from userspace into the page
> cache during a write(2) system call, in data=journal mode, in
> ext4_journalled_write_end() were using page_zero_new_buffers() from
> fs/buffer.c.  Unfortunately, this sets the buffer dirty flag, which is
> no good if journalling is enabled.  This is a long-standing bug that
> goes back for years and years in ext3, but a combination of (a)
> data=journal not being very common, (b) in many case it only results
> in a warning message. and (c) only very rarely causes the kernel hang,
> means that we only really noticed this as a problem when commit
> 998ef75ddb caused this failure to happen frequently enough to cause
> generic/208 to fail when run in data=journal mode.
> 
> The fix is to have our own version of this function that doesn't call
> mark_dirty_buffer(), since we will end up calling
> ext4_handle_dirty_metadata() on the buffer head(s) in questions very
> shortly afterwards in ext4_journalled_write_end().
> 
> Thanks to Dave Hansen and Linus Torvalds for helping to identify the
> root cause of the problem.
> 
> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> ---
>  fs/ext4/inode.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index ae52e32..0a589bb 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1181,6 +1181,38 @@ errout:
>  	return ret ? ret : copied;
>  }
>  
> +/*
> + * This is a private version of page_zero_new_buffers() which doesn't
> + * set the buffer to be dirty, since in data=journalled mode we need
> + * to call ext4_handle_dirty_metadata() instad.
Small typo: s/instad/instead/

> + */
> +static void zero_new_buffers(struct page *page, unsigned from, unsigned to)
> +{
> +	unsigned int block_start = 0, block_end;
> +	struct buffer_head *head, *bh;
> +
> +	bh = head = page_buffers(page);
> +	do {
> +		block_end = block_start + bh->b_size;
> +		if (buffer_new(bh)) {
> +			if (block_end > from && block_start < to) {
> +				if (!PageUptodate(page)) {
> +					unsigned start, size;
> +
> +					start = max(from, block_start);
> +					size = min(to, block_end) - start;
> +
> +					zero_user(page, start, size);
> +					set_buffer_uptodate(bh);
> +				}
> +				clear_buffer_new(bh);
> +			}
> +		}
> +		block_start = block_end;
> +		bh = bh->b_this_page;
> +	} while (bh != head);
> +}
> +
>  static int ext4_journalled_write_end(struct file *file,
>  				     struct address_space *mapping,
>  				     loff_t pos, unsigned len, unsigned copied,
> @@ -1207,7 +1239,7 @@ static int ext4_journalled_write_end(struct file *file,
>  		if (copied < len) {
>  			if (!PageUptodate(page))
>  				copied = 0;
> -			page_zero_new_buffers(page, from+copied, to);
> +			zero_new_buffers(page, from+copied, to);
>  		}
>  
>  		ret = ext4_walk_page_buffers(handle, page_buffers(page), from,



-- 

Best regards,
Leonid Fedorenchik

Software Engineer
Paragon Software Group
Skype: leonid.fedorenchik
http://www.paragon-software.com


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] ext4: use private version of page_zero_new_buffers() for data=journal mode
  2015-10-09  4:01         ` [PATCH] ext4: use private version of page_zero_new_buffers() for data=journal mode Theodore Ts'o
  2015-10-13  6:06           ` Leonid V. Fedorenchik
@ 2015-10-15 11:17           ` Jan Kara
  1 sibling, 0 replies; 23+ messages in thread
From: Jan Kara @ 2015-10-15 11:17 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Ext4 Developers List, Linux Kernel Developers List, dave.hansen,
	torvalds, akpm

On Fri 09-10-15 00:01:09, Ted Tso wrote:
> If there is a error while copying data from userspace into the page
> cache during a write(2) system call, in data=journal mode, in
> ext4_journalled_write_end() were using page_zero_new_buffers() from
> fs/buffer.c.  Unfortunately, this sets the buffer dirty flag, which is
> no good if journalling is enabled.  This is a long-standing bug that
> goes back for years and years in ext3, but a combination of (a)
> data=journal not being very common, (b) in many case it only results
> in a warning message. and (c) only very rarely causes the kernel hang,
> means that we only really noticed this as a problem when commit
> 998ef75ddb caused this failure to happen frequently enough to cause
> generic/208 to fail when run in data=journal mode.
> 
> The fix is to have our own version of this function that doesn't call
> mark_dirty_buffer(), since we will end up calling
> ext4_handle_dirty_metadata() on the buffer head(s) in questions very
> shortly afterwards in ext4_journalled_write_end().
> 
> Thanks to Dave Hansen and Linus Torvalds for helping to identify the
> root cause of the problem.
> 
> Signed-off-by: Theodore Ts'o <tytso@mit.edu>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.com>

								Honza
> ---
>  fs/ext4/inode.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index ae52e32..0a589bb 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1181,6 +1181,38 @@ errout:
>  	return ret ? ret : copied;
>  }
>  
> +/*
> + * This is a private version of page_zero_new_buffers() which doesn't
> + * set the buffer to be dirty, since in data=journalled mode we need
> + * to call ext4_handle_dirty_metadata() instad.
> + */
> +static void zero_new_buffers(struct page *page, unsigned from, unsigned to)
> +{
> +	unsigned int block_start = 0, block_end;
> +	struct buffer_head *head, *bh;
> +
> +	bh = head = page_buffers(page);
> +	do {
> +		block_end = block_start + bh->b_size;
> +		if (buffer_new(bh)) {
> +			if (block_end > from && block_start < to) {
> +				if (!PageUptodate(page)) {
> +					unsigned start, size;
> +
> +					start = max(from, block_start);
> +					size = min(to, block_end) - start;
> +
> +					zero_user(page, start, size);
> +					set_buffer_uptodate(bh);
> +				}
> +				clear_buffer_new(bh);
> +			}
> +		}
> +		block_start = block_end;
> +		bh = bh->b_this_page;
> +	} while (bh != head);
> +}
> +
>  static int ext4_journalled_write_end(struct file *file,
>  				     struct address_space *mapping,
>  				     loff_t pos, unsigned len, unsigned copied,
> @@ -1207,7 +1239,7 @@ static int ext4_journalled_write_end(struct file *file,
>  		if (copied < len) {
>  			if (!PageUptodate(page))
>  				copied = 0;
> -			page_zero_new_buffers(page, from+copied, to);
> +			zero_new_buffers(page, from+copied, to);
>  		}
>  
>  		ret = ext4_walk_page_buffers(handle, page_buffers(page), from,
> -- 
> 2.5.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-10-15 11:17 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-05 15:22 [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal Theodore Ts'o
2015-10-05 15:58 ` Linus Torvalds
2015-10-05 16:23   ` Dave Hansen
2015-10-05 20:22     ` Linus Torvalds
2015-10-05 20:48       ` Dave Hansen
2015-10-05 21:18         ` Linus Torvalds
2015-10-05 21:55           ` Linus Torvalds
2015-10-05 23:33             ` Dave Hansen
2015-10-06  9:01               ` Linus Torvalds
2015-10-05 20:49       ` H. Peter Anvin
2015-10-06  7:56         ` Ingo Molnar
2015-10-06  9:10           ` Linus Torvalds
2015-10-06  9:27             ` Ingo Molnar
2015-10-06 13:29               ` Linus Torvalds
2015-10-06 13:42                 ` Ingo Molnar
2015-10-05 16:03 ` Dave Hansen
2015-10-05 18:04 ` Dave Hansen
2015-10-07  3:34   ` Theodore Ts'o
2015-10-07  7:32     ` Linus Torvalds
2015-10-07 15:43       ` Theodore Ts'o
2015-10-09  4:01         ` [PATCH] ext4: use private version of page_zero_new_buffers() for data=journal mode Theodore Ts'o
2015-10-13  6:06           ` Leonid V. Fedorenchik
2015-10-15 11:17           ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).