* Re: ext3 IO latency measurements [not found] ` <ck0D8-5Ua-11@gated-at.bofh.it> @ 2009-03-26 18:06 ` Bodo Eggert [not found] ` <ck1fN-6Yp-25@gated-at.bofh.it> 1 sibling, 0 replies; 6+ messages in thread From: Bodo Eggert @ 2009-03-26 18:06 UTC (permalink / raw) To: Linus Torvalds, Ingo Molnar, Jan Kara, Andrew Morton, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov, Roland McGrath, Theodore Tso Linus Torvalds <torvalds@linux-foundation.org> wrote: > brains, and we'd better change the default - and if some distro really > _thinks_ about it, and decides that they really want old-fashioned atime, > let them do that". That's something a 2.7 kernel series might do - collect all sane defaults you wished to set all the time. ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <ck1fN-6Yp-25@gated-at.bofh.it>]
[parent not found: <ck1zb-7o2-29@gated-at.bofh.it>]
[parent not found: <ck22i-7Zy-25@gated-at.bofh.it>]
* Re: [PATCH 1/2] Add a strictatime mount option [not found] ` <ck22i-7Zy-25@gated-at.bofh.it> @ 2009-03-27 19:13 ` Bodo Eggert 0 siblings, 0 replies; 6+ messages in thread From: Bodo Eggert @ 2009-03-27 19:13 UTC (permalink / raw) To: Matthew Garrett, Linus Torvalds, Theodore Tso, Ingo Molnar, Jan Kara, Andrew Morton, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov, Roland McGrath Matthew Garrett <mjg@redhat.com> wrote: > Add support for explicitly requesting full atime updates. This makes it > possible for kernels to default to relatime but still allow userspace to > override it. Maybe the *atime should be consolidated to atime=foo. ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <cjZ4h-3jp-21@gated-at.bofh.it>]
[parent not found: <cjZQH-4AG-21@gated-at.bofh.it>]
[parent not found: <ck0a6-51w-39@gated-at.bofh.it>]
[parent not found: <ck0D7-5Ua-9@gated-at.bofh.it>]
[parent not found: <ck1IK-7zR-19@gated-at.bofh.it>]
* Re: [PATCH] Allow relatime to update atime once a day [not found] ` <ck1IK-7zR-19@gated-at.bofh.it> @ 2009-03-27 19:34 ` Bodo Eggert 2009-03-27 19:58 ` Bodo Eggert 1 sibling, 0 replies; 6+ messages in thread From: Bodo Eggert @ 2009-03-27 19:34 UTC (permalink / raw) To: Matthew Garrett, Linus Torvalds, Andrew Morton, Frans Pop, mingo, tytso, jack, alan, arjan, a.p.zijlstra, npiggin, jens.axboe, drees76, jesper, linux-kernel, oleg, roland, willy, vaurora Matthew Garrett <mjg@redhat.com> wrote: > --- a/fs/inode.c > + > + if (!relatime_need_update(mnt, inode, now)) > + goto out; > + -> if (timespec_equal(&inode->i_atime, &now)) -> goto out; timespec_equal is now duplicate, because: > +static int relatime_need_update(struct vfsmount *mnt, struct inode *inode, .. > + if (timespec_compare(&inode->i_mtime, &inode->i_atime) >= 0) > + return 1; ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Allow relatime to update atime once a day [not found] ` <ck1IK-7zR-19@gated-at.bofh.it> 2009-03-27 19:34 ` [PATCH] Allow relatime to update atime once a day Bodo Eggert @ 2009-03-27 19:58 ` Bodo Eggert 1 sibling, 0 replies; 6+ messages in thread From: Bodo Eggert @ 2009-03-27 19:58 UTC (permalink / raw) To: Matthew Garrett, Linus Torvalds, Andrew Morton, Frans Pop, mingo, tytso, jack, alan, arjan, a.p.zijlstra, npiggin, jens.axboe, drees76, jesper, linux-kernel, oleg, roland, willy, vaurora Matthew Garrett <mjg@redhat.com> wrote: > diff --git a/fs/inode.c b/fs/inode.c > index 0487ddb..057c92b 100644 > --- a/fs/inode.c > now = current_fs_time(inode->i_sb); > + > + if (!relatime_need_update(mnt, inode, now)) > + goto out; > + > if (timespec_equal(&inode->i_atime, &now)) > goto out; Forget what I just said, I should rather read than assume. But I'm wondering if inlining this once-used function would be a good thing, since relatime is supposed to be a common option? Otherwise, I'd pull the flags check out and avoid the function call. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Linux 2.6.29
@ 2009-03-25 21:51 Theodore Tso
2009-03-25 23:21 ` Linus Torvalds
0 siblings, 1 reply; 6+ messages in thread
From: Theodore Tso @ 2009-03-25 21:51 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven,
Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees,
Jesper Krogh, Linux Kernel Mailing List
On Wed, Mar 25, 2009 at 01:45:43PM -0700, Linus Torvalds wrote:
> > The third potential solution we can try doing is to make some tuning
> > adjustments to the VM so that we start pushing out these data blocks
> > much more aggressively out to the disk.
>
> Yes. but at least one problem is, as mentioned, that when the VM calls
> writepage[s]() to start async writeback, many filesystems do seem to just
> _block_ on it.
Um, no, ext3 shouldn't block on writepage(). Since it doesn't do
delayed allocation, it should always be able to push out a dirty page
to the disk.
- Ted
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Linux 2.6.29 2009-03-25 21:51 Linux 2.6.29 Theodore Tso @ 2009-03-25 23:21 ` Linus Torvalds 2009-03-25 23:50 ` Jan Kara 0 siblings, 1 reply; 6+ messages in thread From: Linus Torvalds @ 2009-03-25 23:21 UTC (permalink / raw) To: Theodore Tso Cc: Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List On Wed, 25 Mar 2009, Theodore Tso wrote: > > Um, no, ext3 shouldn't block on writepage(). Since it doesn't do > delayed allocation, it should always be able to push out a dirty page > to the disk. Umm. Maybe I'm mis-reading something, but they seem to all synchronize with the journal with "ext3_journal_start/stop". Which will at a minimum wait for 'j_barrier_count == 0' and 't_state != T_LOCKED'. Along with making sure that there are enough transaction buffers. Do I understand _why_ ext3 does that? Hell no. The code makes no sense to me. But I don't think I'm wrong. Look at the sane case (data=ordered): it still does handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode)); ... err = ext3_journal_stop(handle); around all the IO starting. Never mind that the IO shouldn't be needing any journal activity at all afaik in any common case. Yes, yes, it may need to allocate backing store (a page that was dirtied by mmap), and I'm sure that's the reason for it all, but the point is, most of the time there should be no journal activity at all, yet it looks very much like a simple writepage() will synchronize with a full journal and wait for the journal to get space. No? So tell me again how the VM can rely on the filesystem not blocking at random points. Linus ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Linux 2.6.29 2009-03-25 23:21 ` Linus Torvalds @ 2009-03-25 23:50 ` Jan Kara 2009-03-26 9:06 ` ext3 IO latency measurements (was: Linux 2.6.29) Ingo Molnar 0 siblings, 1 reply; 6+ messages in thread From: Jan Kara @ 2009-03-25 23:50 UTC (permalink / raw) To: Linus Torvalds Cc: Theodore Tso, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List On Wed 25-03-09 16:21:56, Linus Torvalds wrote: > On Wed, 25 Mar 2009, Theodore Tso wrote: > > > > Um, no, ext3 shouldn't block on writepage(). Since it doesn't do > > delayed allocation, it should always be able to push out a dirty page > > to the disk. > > Umm. Maybe I'm mis-reading something, but they seem to all synchronize > with the journal with "ext3_journal_start/stop". > > Which will at a minimum wait for 'j_barrier_count == 0' and 't_state != > T_LOCKED'. Along with making sure that there are enough transaction > buffers. > > Do I understand _why_ ext3 does that? Hell no. The code makes no sense to > me. But I don't think I'm wrong. > > Look at the sane case (data=ordered): it still does > > handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode)); > ... > err = ext3_journal_stop(handle); > > around all the IO starting. Never mind that the IO shouldn't be needing > any journal activity at all afaik in any common case. > > Yes, yes, it may need to allocate backing store (a page that was dirtied > by mmap), and I'm sure that's the reason for it all, but the point is, > most of the time there should be no journal activity at all, yet it looks > very much like a simple writepage() will synchronize with a full journal > and wait for the journal to get space. > > No? Yes, you got it right. Furthermore in ordered mode we need to attach buffers to the running transaction if they aren't there (but for checking whether they are we need to pin the running transaction and we are basically where we started.. damn). But maybe there's a way out of it. We don't have to guarantee data written via mmap are on disk when "the transaction running when somebody decided to call writepage" commits (in case no block allocation happen) and so we could just submit those buffers for IO and don't attach them to the transaction... > So tell me again how the VM can rely on the filesystem not blocking at > random points. I can write a patch to make writepage() in the non-"mmapped creation" case non-blocking on journal. But I'll also have to find out whether it really helps something. But it's probably worth trying... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 6+ messages in thread
* ext3 IO latency measurements (was: Linux 2.6.29) 2009-03-25 23:50 ` Jan Kara @ 2009-03-26 9:06 ` Ingo Molnar 2009-03-26 11:37 ` Theodore Tso 0 siblings, 1 reply; 6+ messages in thread From: Ingo Molnar @ 2009-03-26 9:06 UTC (permalink / raw) To: Jan Kara Cc: Linus Torvalds, Theodore Tso, Andrew Morton, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov, Roland McGrath * Jan Kara <jack@suse.cz> wrote: > > So tell me again how the VM can rely on the filesystem not > > blocking at random points. > > I can write a patch to make writepage() in the non-"mmapped > creation" case non-blocking on journal. But I'll also have to find > out whether it really helps something. But it's probably worth > trying... _all_ the problems i ever had with ext3 were 'collateral damage' type of things: simple writes (sometimes even reads) getting serialized on some large [but reasonable] dirtying activity elsewhere - even if the system was still well within its hard-dirty-limit threshold. So it sure sounds like an area worth improving, and it's not that hard to reproduce either. Take a system with enough RAM but only a single disk, and do this in a kernel tree: sync echo 3 > /proc/sys/vm/drop_caches while :; do date make mrproper 2>/dev/null >/dev/null make defconfig 2>/dev/null >/dev/null make -j32 bzImage 2>/dev/null >/dev/null done & Plain old kernel build, no distcc and no icecream. Wait a few minutes for the system to reach equilibrium. There's no tweaking anywhere, kernel, distro and filesystem defaults used everywhere: aldebaran:/home/mingo/linux/linux> ./compile-test Thu Mar 26 10:33:03 CET 2009 Thu Mar 26 10:35:24 CET 2009 Thu Mar 26 10:36:48 CET 2009 Thu Mar 26 10:38:54 CET 2009 Thu Mar 26 10:41:22 CET 2009 Thu Mar 26 10:43:41 CET 2009 Thu Mar 26 10:46:02 CET 2009 Thu Mar 26 10:48:28 CET 2009 And try to use the system while this workload is going on. Use Vim to edit files in this kernel tree. Use plain _cat_ - and i hit delays all the time - and it's not the CPU scheduler but all IO related. I have such an ext3 based system where i can do such tests and where i dont mind crashes and data corruption either, so if you send me experimental patches against latet -git i can try them immediately. The system has 16 CPUs, 12GB of RAM and a single disk. Btw., i had this test going on that box while i wrote some simple scripts in Vim - and it was a horrible experience. The worst wait was well above one minute - Vim just hung there indefinitely. Not even Ctrl-Z was possible. I captured one such wait, it was hanging right here: aldebaran:~/linux/linux> cat /proc/3742/stack [<ffffffff8034790a>] log_wait_commit+0xbd/0x110 [<ffffffff803430b2>] journal_stop+0x1df/0x20d [<ffffffff8034421f>] journal_force_commit+0x28/0x2d [<ffffffff80331c69>] ext3_force_commit+0x2b/0x2d [<ffffffff80328b56>] ext3_write_inode+0x3e/0x44 [<ffffffff802ebb9d>] __sync_single_inode+0xc1/0x2ad [<ffffffff802ebed6>] __writeback_single_inode+0x14d/0x15a [<ffffffff802ebf0c>] sync_inode+0x29/0x34 [<ffffffff80327453>] ext3_sync_file+0xa7/0xb4 [<ffffffff802ef17d>] vfs_fsync+0x78/0xaf [<ffffffff802ef1eb>] do_fsync+0x37/0x4d [<ffffffff802ef228>] sys_fsync+0x10/0x14 [<ffffffff8020bd1b>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff It took about 120 seconds for it to recover. And it's not just sys_fsync(). The script i wrote tests file read latencies. I have created 1000 files with the same size (all copies of kernel/sched.c ;-), and tested their cache-cold plain-cat performance via: for ((i=0;i<1000;i++)); do printf "file #%4d, plain reading it took: " $i /usr/bin/time -f "%e seconds." cat $i >/dev/null done I.e. plain, supposedly high-prio reads. The result is very common hickups in read latencies: file # 579 (253560 bytes), reading it took: 0.08 seconds. file # 580 (253560 bytes), reading it took: 0.05 seconds. file # 581 (253560 bytes), reading it took: 0.01 seconds. file # 582 (253560 bytes), reading it took: 0.01 seconds. file # 583 (253560 bytes), reading it took: 4.61 seconds. file # 584 (253560 bytes), reading it took: 1.29 seconds. file # 585 (253560 bytes), reading it took: 3.01 seconds. file # 586 (253560 bytes), reading it took: 7.74 seconds. file # 587 (253560 bytes), reading it took: 3.22 seconds. file # 588 (253560 bytes), reading it took: 0.05 seconds. file # 589 (253560 bytes), reading it took: 0.36 seconds. file # 590 (253560 bytes), reading it took: 7.39 seconds. file # 591 (253560 bytes), reading it took: 7.58 seconds. file # 592 (253560 bytes), reading it took: 7.90 seconds. file # 593 (253560 bytes), reading it took: 8.78 seconds. file # 594 (253560 bytes), reading it took: 8.01 seconds. file # 595 (253560 bytes), reading it took: 7.47 seconds. file # 596 (253560 bytes), reading it took: 11.52 seconds. file # 597 (253560 bytes), reading it took: 10.33 seconds. file # 598 (253560 bytes), reading it took: 8.56 seconds. file # 599 (253560 bytes), reading it took: 7.58 seconds. The system's RAM is ridiculously under-utilized, 96.1% is free, only 3.9% is utilized: total used free shared buffers cached Mem: 12318192 476732 11841460 0 48324 142936 -/+ buffers/cache: 285472 12032720 Swap: 4096564 0 4096564 Dirty data in /proc/meminfo fluctuates between 0.4% and 1.6% of total RAM. (the script removes the freshly build kernel object files, so the workload is pretty steady.) The peak of 1.6% looks like this: Dirty: 118376 kB Dirty: 143784 kB Dirty: 161756 kB Dirty: 185084 kB Dirty: 210524 kB Dirty: 213348 kB Dirty: 200124 kB Dirty: 122152 kB Dirty: 121508 kB Dirty: 121512 kB (1 second snapshots) So the problems are all around the place and they are absolutely, trivially reproducible. And this is how a default ext3 based distro and the default upstream kernel will present itself to new Linux users and developers. It's not a pretty experience. Oh, and while at it - also a job control complaint. I tried to Ctrl-C the above script: file # 858 (253560 bytes), reading it took: 0.06 seconds. file # 859 (253560 bytes), reading it took: 0.02 seconds. file # 860 (253560 bytes), reading it took: 5.53 seconds. file # 861 (253560 bytes), reading it took: 3.70 seconds. file # 862 (253560 bytes), reading it took: 0.88 seconds. file # 863 (253560 bytes), reading it took: 0.04 seconds. file # 864 (253560 bytes), reading it took: ^C0.69 seconds. file # 865 (253560 bytes), reading it took: ^C0.49 seconds. file # 866 (253560 bytes), reading it took: ^C0.01 seconds. file # 867 (253560 bytes), reading it took: ^C0.02 seconds. file # 868 (253560 bytes), reading it took: ^C^C0.01 seconds. file # 869 (253560 bytes), reading it took: ^C^C0.04 seconds. file # 870 (253560 bytes), reading it took: ^C^C^C0.03 seconds. file # 871 (253560 bytes), reading it took: ^C0.02 seconds. file # 872 (253560 bytes), reading it took: ^C^C0.02 seconds. file # 873 (253560 bytes), reading it took: ^C^C^C^Caldebaran:~/linux/linux/test-files/src> I had to hit Ctrl-C numerous times before Bash would honor it. This to is a very common thing on large SMP systems. I'm willing to test patches until all these problems are fixed. Any takers? Ingo ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ext3 IO latency measurements (was: Linux 2.6.29) 2009-03-26 9:06 ` ext3 IO latency measurements (was: Linux 2.6.29) Ingo Molnar @ 2009-03-26 11:37 ` Theodore Tso 2009-03-26 14:03 ` Ingo Molnar 0 siblings, 1 reply; 6+ messages in thread From: Theodore Tso @ 2009-03-26 11:37 UTC (permalink / raw) To: Ingo Molnar Cc: Jan Kara, Linus Torvalds, Andrew Morton, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov, Roland McGrath Ingo, Interesting. I wonder if the problem is the journal is cycling fast enough that it is checkpointing all the time. If so, it could be that a bigger-sized journal might help. Can you try this as an experiment? Mount the filesystem using ext4, with the mount option nodelalloc. With an filesystem formatted as ext3, and with delayed allocation disabled, it should behave mostly the same as ext3; try and make sure you're still seeing the same problems. Then could you grab /proc/fs/jbd2/<dev>:8/history and /proc/fs/jbd2/<dev>:8/info while running your test workload? Also, can you send me the output of "dumpe2fs -h /dev/sdXX | grep Journal"? > Oh, and while at it - also a job control complaint. I tried to > Ctrl-C the above script: > > I had to hit Ctrl-C numerous times before Bash would honor it. This > to is a very common thing on large SMP systems. Well, the script you sent runs the compile in the background. It did: > while :; do > date > make mrproper 2>/dev/null >/dev/null > make defconfig 2>/dev/null >/dev/null > make -j32 bzImage 2>/dev/null >/dev/null > done & ^^ So there would have been nothing to ^C; I assume you were running this with a variant that didn't have the ampersand, which would have run the whole shell pipeline in a detached background process? In any case, the workaround for this is to ^Z the script, and then "kill %" it. I'm pretty sure this is actually a bash problem. When you send a Ctrl-C, it sends a SIGINT to all of the members of the tty's foreground process group. Under some circumstances, bash sets the signal handler for SIGINT to be SIGIGN. I haven't looked at this super closely (it would require diving into the bash sources), but you can see it if you attach an strace to the bash shell driving a script such as #!/bin/bash while /bin/true; do date sleep 60 done & If you do a "ps axo pid,ppid,pgrp,args", you'll see that the bash and the sleep 60 have the same process group. If you emulate hitting ^C by sending a SIGINT to pid of the shell, you'll see that it ignores it. Sleep also seems to be ignoring the SIGINT when run in the background; but it does honor SIGINT in the foreground --- I didn't have time to dig into that. In any case, bash appears to SIGIGN the INT signal if there is a child process running, and only takes the ^C if bash itself is actually "running" the shell script. For example, if you run the command "date;sleep 10;date;sleep 10;date", the ^C only interrupts the sleep command. It doesn't stop the series of commands which bash is running. - Ted ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ext3 IO latency measurements (was: Linux 2.6.29) 2009-03-26 11:37 ` Theodore Tso @ 2009-03-26 14:03 ` Ingo Molnar 2009-03-26 14:47 ` Theodore Tso 0 siblings, 1 reply; 6+ messages in thread From: Ingo Molnar @ 2009-03-26 14:03 UTC (permalink / raw) To: Theodore Tso, Jan Kara, Linus Torvalds, Andrew Morton, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov, Roland McGrath * Theodore Tso <tytso@mit.edu> wrote: > Ingo, > > Interesting. I wonder if the problem is the journal is cycling > fast enough that it is checkpointing all the time. If so, it > could be that a bigger-sized journal might help. Can you try this > as an experiment? Mount the filesystem using ext4, with the mount > option nodelalloc. With an filesystem formatted as ext3, and with > delayed allocation disabled, it should behave mostly the same as > ext3; try and make sure you're still seeing the same problems. > > Then could you grab /proc/fs/jbd2/<dev>:8/history and > /proc/fs/jbd2/<dev>:8/info while running your test workload? i tried it: /dev/sda2 on /home type ext4 (rw,nodelalloc) I still see similarly bad latencies in Vim: aldebaran:~> cat /proc/10227/stack [<ffffffff80370cad>] jbd2_log_wait_commit+0xbd/0x110 [<ffffffff8036bc70>] jbd2_journal_stop+0x1f3/0x221 [<ffffffff8036ccb0>] jbd2_journal_force_commit+0x28/0x2c [<ffffffff80352660>] ext4_force_commit+0x2e/0x34 [<ffffffff80346682>] ext4_write_inode+0x3e/0x44 [<ffffffff802eb941>] __sync_single_inode+0xc1/0x2ad [<ffffffff802ebc7a>] __writeback_single_inode+0x14d/0x15a [<ffffffff802ebcb0>] sync_inode+0x29/0x34 [<ffffffff80343e16>] ext4_sync_file+0xf6/0x138 [<ffffffff802eef21>] vfs_fsync+0x78/0xaf [<ffffffff802eef8f>] do_fsync+0x37/0x4d [<ffffffff802eefcc>] sys_fsync+0x10/0x14 [<ffffffff8020bd1b>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff Vim is still almost unusable during this workload - even if i dont write out the source file just use it interactively to edit it. The read-test is somewhat better. There are occasional blips of 4-5 seconds: file # 928 (253560 bytes), reading it took: 0.76 seconds. file # 929 (253560 bytes), reading it took: 3.98 seconds. file # 930 (253560 bytes), reading it took: 3.45 seconds. file # 931 (253560 bytes), reading it took: 0.04 seconds. I have also written a 'vim open' test which does vim -c q, i.e. it just opens a source file and closes it without writing the file. That too takes a lot of time: file # 0 (253560 bytes), Vim-opening it took: 2.04 seconds. file # 1 (253560 bytes), Vim-opening it took: 2.39 seconds. file # 2 (253560 bytes), Vim-opening it took: 2.03 seconds. file # 3 (253560 bytes), Vim-opening it took: 2.81 seconds. file # 4 (253560 bytes), Vim-opening it took: 2.11 seconds. file # 5 (253560 bytes), Vim-opening it took: 2.44 seconds. file # 6 (253560 bytes), Vim-opening it took: 2.04 seconds. file # 7 (253560 bytes), Vim-opening it took: 3.59 seconds. file # 8 (253560 bytes), Vim-opening it took: 2.06 seconds. file # 9 (253560 bytes), Vim-opening it took: 3.26 seconds. file # 10 (253560 bytes), Vim-opening it took: 2.04 seconds. file # 11 (253560 bytes), Vim-opening it took: 2.38 seconds. file # 12 (253560 bytes), Vim-opening it took: 2.04 seconds. file # 13 (253560 bytes), Vim-opening it took: 3.05 seconds. Here's a few snapshots of Vim waiting spots: aldebaran:~> cat /proc/$(ps aux | grep -m 1 'vim -c' | cut -d' ' -f5)/stack [<ffffffff8036c1ae>] do_get_write_access+0x22b/0x452 [<ffffffff8036c3fc>] jbd2_journal_get_write_access+0x27/0x38 [<ffffffff8035aa8c>] __ext4_journal_get_write_access+0x51/0x59 [<ffffffff80346f30>] ext4_reserve_inode_write+0x3d/0x79 [<ffffffff80346f9f>] ext4_mark_inode_dirty+0x33/0x187 [<ffffffff8034724e>] ext4_dirty_inode+0x6a/0x9f [<ffffffff802ec4eb>] __mark_inode_dirty+0x38/0x199 [<ffffffff802e27f5>] touch_atime+0xf6/0x101 [<ffffffff8029cc83>] do_generic_file_read+0x37c/0x3c7 [<ffffffff8029d770>] generic_file_aio_read+0x15b/0x197 [<ffffffff802d0816>] do_sync_read+0xec/0x132 [<ffffffff802d11de>] vfs_read+0xb0/0x139 [<ffffffff802d1335>] sys_read+0x4c/0x74 [<ffffffff8020bd1b>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff aldebaran:~> cat /proc/$(ps aux | grep -m 1 'vim -c' | cut -d' ' -f5)/stack [<ffffffff8029c0ed>] sync_page+0x41/0x45 [<ffffffff8029c274>] wait_on_page_bit+0x73/0x7a [<ffffffff802a5a76>] truncate_inode_pages_range+0x2f6/0x37b [<ffffffff802a5b0d>] truncate_inode_pages+0x12/0x15 [<ffffffff8034b97b>] ext4_delete_inode+0x6a/0x25f [<ffffffff802e378e>] generic_delete_inode+0xe7/0x174 [<ffffffff802e382f>] generic_drop_inode+0x14/0x1d [<ffffffff802e2866>] iput+0x66/0x6a [<ffffffff802db889>] do_unlinkat+0x107/0x15d [<ffffffff802db8f5>] sys_unlink+0x16/0x18 [<ffffffff8020bd1b>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff aldebaran:~> cat /proc/$(ps aux | grep -m 1 'vim -c' | cut -d' ' -f5)/stack [<ffffffff8036c1ae>] do_get_write_access+0x22b/0x452 [<ffffffff8036c3fc>] jbd2_journal_get_write_access+0x27/0x38 [<ffffffff8035aa8c>] __ext4_journal_get_write_access+0x51/0x59 [<ffffffff80346f30>] ext4_reserve_inode_write+0x3d/0x79 [<ffffffff80346f9f>] ext4_mark_inode_dirty+0x33/0x187 [<ffffffff8034724e>] ext4_dirty_inode+0x6a/0x9f [<ffffffff802ec4eb>] __mark_inode_dirty+0x38/0x199 [<ffffffff802e27f5>] touch_atime+0xf6/0x101 [<ffffffff8029cc83>] do_generic_file_read+0x37c/0x3c7 [<ffffffff8029d770>] generic_file_aio_read+0x15b/0x197 [<ffffffff802d0816>] do_sync_read+0xec/0x132 [<ffffffff802d11de>] vfs_read+0xb0/0x139 [<ffffffff802d1335>] sys_read+0x4c/0x74 [<ffffffff8020bd1b>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff aldebaran:~> cat /proc/$(ps aux | grep -m 1 'vim -c' | cut -d' ' -f5)/stack [<ffffffff8036c1ae>] do_get_write_access+0x22b/0x452 [<ffffffff8036c3fc>] jbd2_journal_get_write_access+0x27/0x38 [<ffffffff8035aa8c>] __ext4_journal_get_write_access+0x51/0x59 [<ffffffff80346f30>] ext4_reserve_inode_write+0x3d/0x79 [<ffffffff80346f9f>] ext4_mark_inode_dirty+0x33/0x187 [<ffffffff8034724e>] ext4_dirty_inode+0x6a/0x9f [<ffffffff802ec4eb>] __mark_inode_dirty+0x38/0x199 [<ffffffff802e27f5>] touch_atime+0xf6/0x101 [<ffffffff8029cc83>] do_generic_file_read+0x37c/0x3c7 [<ffffffff8029d770>] generic_file_aio_read+0x15b/0x197 [<ffffffff802d0816>] do_sync_read+0xec/0x132 [<ffffffff802d11de>] vfs_read+0xb0/0x139 [<ffffffff802d1335>] sys_read+0x4c/0x74 [<ffffffff8020bd1b>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff aldebaran:~> cat /proc/$(ps aux | grep -m 1 'vim -c' | cut -d' ' -f5)/stack [<ffffffff8036c1ae>] do_get_write_access+0x22b/0x452 [<ffffffff8036c3fc>] jbd2_journal_get_write_access+0x27/0x38 [<ffffffff8035aa8c>] __ext4_journal_get_write_access+0x51/0x59 [<ffffffff80346f30>] ext4_reserve_inode_write+0x3d/0x79 [<ffffffff80346f9f>] ext4_mark_inode_dirty+0x33/0x187 [<ffffffff8034724e>] ext4_dirty_inode+0x6a/0x9f [<ffffffff802ec4eb>] __mark_inode_dirty+0x38/0x199 [<ffffffff802e27f5>] touch_atime+0xf6/0x101 [<ffffffff8029cc83>] do_generic_file_read+0x37c/0x3c7 [<ffffffff8029d770>] generic_file_aio_read+0x15b/0x197 [<ffffffff802d0816>] do_sync_read+0xec/0x132 [<ffffffff802d11de>] vfs_read+0xb0/0x139 [<ffffffff802d1335>] sys_read+0x4c/0x74 [<ffffffff8020bd1b>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff That's in good deal atime update latencies. We still appear to default to atime enabled in ext4. That's stupid - only around 0.01% of all Linux systems relies on atime - and even those who rely on it would be well served by relatime. Why arent the relatime patches upstream? Why isnt it the default? They have been submitted several times. Atime in its current mandatory do-a-write-for-every-read form is a stupid relic and we have been paying the fool's tax for it in the past 10 years. Ingo ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ext3 IO latency measurements (was: Linux 2.6.29) 2009-03-26 14:03 ` Ingo Molnar @ 2009-03-26 14:47 ` Theodore Tso 2009-03-26 16:20 ` Linus Torvalds 0 siblings, 1 reply; 6+ messages in thread From: Theodore Tso @ 2009-03-26 14:47 UTC (permalink / raw) To: Ingo Molnar Cc: Jan Kara, Linus Torvalds, Andrew Morton, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov, Roland McGrath On Thu, Mar 26, 2009 at 03:03:12PM +0100, Ingo Molnar wrote: > That's in good deal atime update latencies. We still appear to > default to atime enabled in ext4. > > That's stupid - only around 0.01% of all Linux systems relies on > atime - and even those who rely on it would be well served by > relatime. Why arent the relatime patches upstream? Why isnt it the > default? They have been submitted several times. The relatime patches are upstream. Both noatime and relatime are handled at the VFS layer, not at the per-filesystem level. The reason why it sin't the default is because of a desire for POSIX compliance, I suspect. Most distributions are putting relatime into /etc/fstab by default, but we haven't changed the mount option. It wouldn't be hard to add an "atime" option to turn on atime updates, and make either "noatime" or "relatime" the default. This is a simple patch to fs/namespace.c > Atime in its current mandatory do-a-write-for-every-read form is a > stupid relic and we have been paying the fool's tax for it in the > past 10 years. No argument here. I use noatime, myself. It actually saves a lot more than relatime, and unless you are using mutt with local Maildir delivery, relatime isn't really that helpful, and the benefit of noatime is roughly double that of relatime vs normal atime update, in my measurements: http://thunk.org/tytso/blog/2009/03/01/ssds-journaling-and-noatimerelatime/ - Ted ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ext3 IO latency measurements (was: Linux 2.6.29) 2009-03-26 14:47 ` Theodore Tso @ 2009-03-26 16:20 ` Linus Torvalds 2009-03-26 17:07 ` Theodore Tso 0 siblings, 1 reply; 6+ messages in thread From: Linus Torvalds @ 2009-03-26 16:20 UTC (permalink / raw) To: Theodore Tso Cc: Ingo Molnar, Jan Kara, Andrew Morton, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov, Roland McGrath On Thu, 26 Mar 2009, Theodore Tso wrote: > > Most distributions are putting relatime into /etc/fstab by > default, but we haven't changed the mount option. I don't think this is true. Fedora certainly does not. Not in F10, not in F11. And quite frankly, even if you then _manually_ put 'relatime' in /etc/fstab, the default Fedora install will totally ignore it. Why? Because it mounts the root partition while using initrd, and totally ignores /etc/fstab. In other words, not only do distributions not do it, but you can't even do it by hand afterwards the sane way in the most common distro! There really is reason for the kernel to just say "user space has sh*t for brains, and we'd better change the default - and if some distro really _thinks_ about it, and decides that they really want old-fashioned atime, let them do that". Because right now, I do not believe for a moment that any distro that defaults to "atime" has spent lots of effort thinking about it. Quite the reverse. They probably default to "atime" because they spent no time AT ALL thinking about it. > It wouldn't be hard to add an "atime" option to turn on atime updates, > and make either "noatime" or "relatime" the default. This is a simple > patch to fs/namespace.c Yes. I think we have to. > No argument here. I use noatime, myself. It actually saves a lot > more than relatime, and unless you are using mutt with local Maildir > delivery, relatime isn't really that helpful, and the benefit of > noatime is roughly double that of relatime vs normal atime update, in > my measurements: > > http://thunk.org/tytso/blog/2009/03/01/ssds-journaling-and-noatimerelatime/ I do agree that "noatime" is better, but with "relatime" you at least are likely to not break anything. A program has to be _really_ odd to care about the "relatime" vs "atime" behavior. Linus ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ext3 IO latency measurements (was: Linux 2.6.29) 2009-03-26 16:20 ` Linus Torvalds @ 2009-03-26 17:07 ` Theodore Tso 2009-03-26 17:16 ` Linus Torvalds 0 siblings, 1 reply; 6+ messages in thread From: Theodore Tso @ 2009-03-26 17:07 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Jan Kara, Andrew Morton, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov, Roland McGrath On Thu, Mar 26, 2009 at 09:20:14AM -0700, Linus Torvalds wrote: > > > On Thu, 26 Mar 2009, Theodore Tso wrote: > > > > Most distributions are putting relatime into /etc/fstab by > > default, but we haven't changed the mount option. > > I don't think this is true. Fedora certainly does not. Not in F10, not in > F11. Ubuntu does. I thought Fedora had, but I stand corrected. > And quite frankly, even if you then _manually_ put 'relatime' in > /etc/fstab, the default Fedora install will totally ignore it. Why? > Because it mounts the root partition while using initrd, and totally > ignores /etc/fstab. You can, actually, but it requires hacking /boot/grub/menu.list. The boot command option "rootflags=noatime" should do it, if their initrd scripts are at all sane (and they honor rootfstype, so they probably do also honor rootflags). The question is whether we can make Fedora 11 and OpenSUSE do the right thing now that this has become a highly visible discussion. I'm actually fairly optimistic on this front. (Maybe some distro folks will care to chime in on whether upcoming releases of F11 and OpenSuSE can be changed to DTRT?) Actually, given where F11 is on its release schedule, I suspect it would be *easier* for them to make a change to default boot options in grub's menu.conf than it would be backport a kernel patch, since they will be releasing their beta release within the week, and their final development freeze is in less than two weeks. - Ted ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ext3 IO latency measurements (was: Linux 2.6.29) 2009-03-26 17:07 ` Theodore Tso @ 2009-03-26 17:16 ` Linus Torvalds 2009-03-26 17:49 ` [PATCH 1/2] Add a strictatime mount option Matthew Garrett 0 siblings, 1 reply; 6+ messages in thread From: Linus Torvalds @ 2009-03-26 17:16 UTC (permalink / raw) To: Theodore Tso Cc: Ingo Molnar, Jan Kara, Andrew Morton, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov, Roland McGrath On Thu, 26 Mar 2009, Theodore Tso wrote: > > You can, actually, but it requires hacking /boot/grub/menu.list. The > boot command option "rootflags=noatime" should do it, if their initrd > scripts are at all sane (and they honor rootfstype, so they probably > do also honor rootflags). Not when I tried it. It just causes the initrd to be mounted noatime, and then the real root filesystem gets mounted atime again. Maybe I screwed up. But I don't think so. > The question is whether we can make Fedora 11 and OpenSUSE do the > right thing now that this has become a highly visible discussion. I'm > actually fairly optimistic on this front. (Maybe some distro folks > will care to chime in on whether upcoming releases of F11 and OpenSuSE > can be changed to DTRT?) And what's the argument for not doing it in the kernel? The fact is, "atime" by default is just wrong. Linus ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 1/2] Add a strictatime mount option 2009-03-26 17:16 ` Linus Torvalds @ 2009-03-26 17:49 ` Matthew Garrett 2009-03-26 18:52 ` Alan Cox 0 siblings, 1 reply; 6+ messages in thread From: Matthew Garrett @ 2009-03-26 17:49 UTC (permalink / raw) To: Linus Torvalds Cc: Theodore Tso, Ingo Molnar, Jan Kara, Andrew Morton, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov, Roland McGrath Add support for explicitly requesting full atime updates. This makes it possible for kernels to default to relatime but still allow userspace to override it. Signed-off-by: Matthew Garrett <mjg@redhat.com> --- fs/namespace.c | 6 +++++- include/linux/fs.h | 1 + include/linux/mount.h | 1 + 3 files changed, 7 insertions(+), 1 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 06f8e63..d0659ec 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -780,6 +780,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt) { MNT_NOATIME, ",noatime" }, { MNT_NODIRATIME, ",nodiratime" }, { MNT_RELATIME, ",relatime" }, + { MNT_STRICTATIME, ",strictatime" }, { 0, NULL } }; const struct proc_fs_info *fs_infop; @@ -1932,11 +1933,14 @@ long do_mount(char *dev_name, char *dir_name, char *type_page, mnt_flags |= MNT_NODIRATIME; if (flags & MS_RELATIME) mnt_flags |= MNT_RELATIME; + if (flags & MS_STRICTATIME) + mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME); if (flags & MS_RDONLY) mnt_flags |= MNT_READONLY; flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | - MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT); + MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT | + MS_STRICTATIME); /* ... and get the mountpoint */ retval = kern_path(dir_name, LOOKUP_FOLLOW, &path); diff --git a/include/linux/fs.h b/include/linux/fs.h index 92734c0..5bc81c4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -141,6 +141,7 @@ struct inodes_stat_t { #define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */ #define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */ #define MS_I_VERSION (1<<23) /* Update inode I_version field */ +#define MS_STRICTATIME (1<<24) /* Always perform atime updates */ #define MS_ACTIVE (1<<30) #define MS_NOUSER (1<<31) diff --git a/include/linux/mount.h b/include/linux/mount.h index cab2a85..51f55f9 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -27,6 +27,7 @@ struct mnt_namespace; #define MNT_NODIRATIME 0x10 #define MNT_RELATIME 0x20 #define MNT_READONLY 0x40 /* does the user want this to be r/o? */ +#define MNT_STRICTATIME 0x80 #define MNT_SHRINKABLE 0x100 #define MNT_IMBALANCED_WRITE_COUNT 0x200 /* just for debugging */ -- Matthew Garrett | mjg59@srcf.ucam.org ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] Add a strictatime mount option 2009-03-26 17:49 ` [PATCH 1/2] Add a strictatime mount option Matthew Garrett @ 2009-03-26 18:52 ` Alan Cox 0 siblings, 0 replies; 6+ messages in thread From: Alan Cox @ 2009-03-26 18:52 UTC (permalink / raw) To: Matthew Garrett Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Jan Kara, Andrew Morton, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List, Oleg Nesterov, Roland McGrath On Thu, 26 Mar 2009 17:49:56 +0000 Matthew Garrett <mjg@redhat.com> wrote: > Add support for explicitly requesting full atime updates. This makes it > possible for kernels to default to relatime but still allow userspace to > override it. > > Signed-off-by: Matthew Garrett <mjg@redhat.com> NAK this is unneccessary complication from a broken ABI change that isn't safe to make anyway. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-03-27 19:59 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <cjCU5-1QB-11@gated-at.bofh.it> [not found] ` <cjFfc-5tr-13@gated-at.bofh.it> [not found] ` <cjGuK-7mU-39@gated-at.bofh.it> [not found] ` <cjImV-1Wa-27@gated-at.bofh.it> [not found] ` <cjJiR-3rY-9@gated-at.bofh.it> [not found] ` <cjKHX-5MF-17@gated-at.bofh.it> [not found] ` <cjLb2-6no-13@gated-at.bofh.it> [not found] ` <cjTLl-3ln-15@gated-at.bofh.it> [not found] ` <cjW6r-72f-21@gated-at.bofh.it> [not found] ` <cjYrG-2u0-23@gated-at.bofh.it> [not found] ` <cjZ4h-3jp-19@gated-at.bofh.it> [not found] ` <ck0D8-5Ua-11@gated-at.bofh.it> 2009-03-26 18:06 ` ext3 IO latency measurements Bodo Eggert [not found] ` <ck1fN-6Yp-25@gated-at.bofh.it> [not found] ` <ck1zb-7o2-29@gated-at.bofh.it> [not found] ` <ck22i-7Zy-25@gated-at.bofh.it> 2009-03-27 19:13 ` [PATCH 1/2] Add a strictatime mount option Bodo Eggert [not found] ` <cjZ4h-3jp-21@gated-at.bofh.it> [not found] ` <cjZQH-4AG-21@gated-at.bofh.it> [not found] ` <ck0a6-51w-39@gated-at.bofh.it> [not found] ` <ck0D7-5Ua-9@gated-at.bofh.it> [not found] ` <ck1IK-7zR-19@gated-at.bofh.it> 2009-03-27 19:34 ` [PATCH] Allow relatime to update atime once a day Bodo Eggert 2009-03-27 19:58 ` Bodo Eggert 2009-03-25 21:51 Linux 2.6.29 Theodore Tso 2009-03-25 23:21 ` Linus Torvalds 2009-03-25 23:50 ` Jan Kara 2009-03-26 9:06 ` ext3 IO latency measurements (was: Linux 2.6.29) Ingo Molnar 2009-03-26 11:37 ` Theodore Tso 2009-03-26 14:03 ` Ingo Molnar 2009-03-26 14:47 ` Theodore Tso 2009-03-26 16:20 ` Linus Torvalds 2009-03-26 17:07 ` Theodore Tso 2009-03-26 17:16 ` Linus Torvalds 2009-03-26 17:49 ` [PATCH 1/2] Add a strictatime mount option Matthew Garrett 2009-03-26 18:52 ` Alan Cox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).