All of lore.kernel.org
 help / color / mirror / Atom feed
* Lazytime feature bugs
@ 2015-02-18 13:19 Jan Kara
  2015-02-24 18:31 ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2015-02-18 13:19 UTC (permalink / raw)
  To: tytso; +Cc: linux-fsdevel, Alexander Viro, Linus Torvalds

  Hello,

  I had a look at what got merged as a lazytime series from Ted and I have
found a couple of issues (I think I've pointed them out in my emails back
in November but admittedly things were somewhat confused back then since
Ted was submitting new versions pretty fast). Anyway here are the issues
I've found in the merged code:

1) Inode that gets periodically dirtied with I_DIRTY_PAGES, cleaned and
   dirtied again will have inode with updated timestamps never written due
   to age since inode->dirtied_when gets reset on each redirtying with
   I_DIRTY_PAGES.
2) The code won't maintain time ordering of b_dirty_time list by
   inode->dirtied_when - this happens because requeue_inode() moves inode
   at the head of the b_dirty_time list but inodes in b_io list from which
   we move are no longer ordered by dirtied_when (due to that list being
   combined from several lists and also because we sort the list by
   superblock). As a result terminating logic in move_expired_inodes() may
   terminate the scan too early for b_dirty_time list.
3) This is mostly cosmetic currently but is a potential landmine for
   future: If you dirty with I_DIRTY_PAGES | I_DIRTY_TIME inode gets filed
   to b_dirty_time list instead of b_dirty list.
4) Another mostly cosmetic issue: move_expired_inodes() should use
   work->for_sync instead of work->reason == WB_REASON_SYNC (work->reason is
   there only for tracing purposes).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Lazytime feature bugs
  2015-02-18 13:19 Lazytime feature bugs Jan Kara
@ 2015-02-24 18:31 ` Jan Kara
  2015-02-24 18:58   ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2015-02-24 18:31 UTC (permalink / raw)
  To: tytso; +Cc: linux-fsdevel, Alexander Viro, Linus Torvalds

  Hello,

  any reaction here?

								Honza

On Wed 18-02-15 14:19:31, Jan Kara wrote:
>   I had a look at what got merged as a lazytime series from Ted and I have
> found a couple of issues (I think I've pointed them out in my emails back
> in November but admittedly things were somewhat confused back then since
> Ted was submitting new versions pretty fast). Anyway here are the issues
> I've found in the merged code:
> 
> 1) Inode that gets periodically dirtied with I_DIRTY_PAGES, cleaned and
>    dirtied again will have inode with updated timestamps never written due
>    to age since inode->dirtied_when gets reset on each redirtying with
>    I_DIRTY_PAGES.
> 2) The code won't maintain time ordering of b_dirty_time list by
>    inode->dirtied_when - this happens because requeue_inode() moves inode
>    at the head of the b_dirty_time list but inodes in b_io list from which
>    we move are no longer ordered by dirtied_when (due to that list being
>    combined from several lists and also because we sort the list by
>    superblock). As a result terminating logic in move_expired_inodes() may
>    terminate the scan too early for b_dirty_time list.
> 3) This is mostly cosmetic currently but is a potential landmine for
>    future: If you dirty with I_DIRTY_PAGES | I_DIRTY_TIME inode gets filed
>    to b_dirty_time list instead of b_dirty list.
> 4) Another mostly cosmetic issue: move_expired_inodes() should use
>    work->for_sync instead of work->reason == WB_REASON_SYNC (work->reason is
>    there only for tracing purposes).
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Lazytime feature bugs
  2015-02-24 18:31 ` Jan Kara
@ 2015-02-24 18:58   ` Linus Torvalds
  2015-02-25 14:52     ` Theodore Ts'o
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2015-02-24 18:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: Theodore Ts'o, linux-fsdevel, Alexander Viro

On Tue, Feb 24, 2015 at 10:31 AM, Jan Kara <jack@suse.cz> wrote:
>
>   any reaction here?

Ted? Al? Should we just revert or at least disable the 'lazytime' option?

                      Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Lazytime feature bugs
  2015-02-24 18:58   ` Linus Torvalds
@ 2015-02-25 14:52     ` Theodore Ts'o
  2015-02-25 16:25       ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Theodore Ts'o @ 2015-02-25 14:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jan Kara, linux-fsdevel, Alexander Viro

On Tue, Feb 24, 2015 at 10:58:33AM -0800, Linus Torvalds wrote:
> On Tue, Feb 24, 2015 at 10:31 AM, Jan Kara <jack@suse.cz> wrote:
> >
> >   any reaction here?
> 
> Ted? Al? Should we just revert or at least disable the 'lazytime' option?

Sorry, I've been recovering from conference travel.  Lazytime is not
enabled by default, and except for ext4 it can't be enabled without
changes to the mount command.  I'll take a look at it in the next
couple of days, and in the worst case we can disable it in
fs/ext4/super.c relatively simple.

							- Ted

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Lazytime feature bugs
  2015-02-25 14:52     ` Theodore Ts'o
@ 2015-02-25 16:25       ` Jan Kara
  2015-02-26  4:33         ` Theodore Ts'o
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2015-02-25 16:25 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Linus Torvalds, Jan Kara, linux-fsdevel, Alexander Viro

On Wed 25-02-15 09:52:58, Ted Tso wrote:
> On Tue, Feb 24, 2015 at 10:58:33AM -0800, Linus Torvalds wrote:
> > On Tue, Feb 24, 2015 at 10:31 AM, Jan Kara <jack@suse.cz> wrote:
> > >
> > >   any reaction here?
> > 
> > Ted? Al? Should we just revert or at least disable the 'lazytime' option?
> 
> Sorry, I've been recovering from conference travel.  Lazytime is not
> enabled by default, and except for ext4 it can't be enabled without
> changes to the mount command.  I'll take a look at it in the next
> couple of days, and in the worst case we can disable it in
> fs/ext4/super.c relatively simple.
  Yeah, that sounds reasonable. I've been thinking how to fix those time
ordering issues and sadly it isn't trivial. We'll likely need a
timestamp in the inode (although a coarse one is enough) remembering when
inode was last written (i_dirtied_when needs to remember when dirty *data*
were created so we cannot use it as I originally thought). And we'll need
to sort inodes back into the list of inodes with dirty timestamps. It
could be done in O(length of list + number of inodes added) if we are
careful but it will be non-trivial.

Alternatively we could decide that once per X hours we just walk the whole
list of inodes with unwritten timestamps and write all inodes there.
That's easy to implement but the downside is that we may generate peaks in
inode writeback instead of slowly trickling inodes with dirty timestamps to
disk.

								Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Lazytime feature bugs
  2015-02-25 16:25       ` Jan Kara
@ 2015-02-26  4:33         ` Theodore Ts'o
  2015-02-26  8:34           ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Theodore Ts'o @ 2015-02-26  4:33 UTC (permalink / raw)
  To: Jan Kara; +Cc: Linus Torvalds, linux-fsdevel, Alexander Viro

On Wed, Feb 25, 2015 at 05:25:06PM +0100, Jan Kara wrote:
>   Yeah, that sounds reasonable. I've been thinking how to fix those time
> ordering issues and sadly it isn't trivial. We'll likely need a
> timestamp in the inode (although a coarse one is enough) remembering when
> inode was last written (i_dirtied_when needs to remember when dirty *data*
> were created so we cannot use it as I originally thought). And we'll need
> to sort inodes back into the list of inodes with dirty timestamps. It
> could be done in O(length of list + number of inodes added) if we are
> careful but it will be non-trivial.

Well, the bottom line is that the two major problems you've listed
(ignoring the cosmetic issues) is that some of the inodes with dirty
timestamps might not get rewritten out until the inodes get ejected
from memory or the file system is unmounted.  This isn't exactly a
disaster; it's not going to cause data loss, or cause the system to
become unstable, no?

   1) Inode that gets periodically dirtied with I_DIRTY_PAGES, cleaned and
   dirtied again will have inode with updated timestamps never written due
   to age since inode->dirtied_when gets reset on each redirtying with
   I_DIRTY_PAGES.

If we maintain an i_dirtied_when and a i_dirtied_time_when field, all
we need to do is to check if i_dirtied_time_when is older than 24
hours while we are processing writeback for inodes on b_io (nor just
when we are processing inodes on b_dirty_time), and update the
timestamps if necessary.

   2) The code won't maintain time ordering of b_dirty_time list by
   inode->dirtied_when - this happens because requeue_inode() moves inode
   at the head of the b_dirty_time list but inodes in b_io list from which
   we move are no longer ordered by dirtied_when (due to that list being
   combined from several lists and also because we sort the list by
   superblock). As a result terminating logic in move_expired_inodes() may
   terminate the scan too early for b_dirty_time list.

To solve this problem, we need to make sure that the inode is inserted
into the list sorted by i_dirtied_time_when (and then
move_expired_inodes can just terminate checking on i_dirtied_time_when
instead of i_dirtied_when when we are scanning the b_dirty_time list).

If we don't care for this overhead, we could can do the following, at
the cost of a bit less precision about when we write out timestamps:

a) When checking to see if we need to write back timestamps while
processing inodes on b_io, we check not only i_dirty_time_when, but
we also check to see if mtime is oldered than a day.  If so, we force
out the timestamps.  This means we could potentially push out
timestamps earlier than we should, but in the steady state, the
timestamps will only be updated once a day.

b) When we move an inode from b_io to b_dirty_time, we set
i_dirty_time_when to "now".  Because of (a) we know that mtime will be
at most stale by one day.  If we don't dirty the inode's pages for the
next 24 hours, at that point the timestamps will be written out.
Hence, in the worst case the dirty timestamps might be stale on disk
by a maximum of two days.

Yes, we're playing loosey-goosey with exactly when the dirtied inodes
will get written out to disk, but the whole point of lazytime is to
relax when timestamps get updated on disk in exchange for better
performance.  This just relaxes the timing a bit more.

But the fact that exactly when the timestamps get written back is not
something I view as a particularly critical, even if we don't fix both
of these two issues in 4.0, I don't think it's the end of the world
even if we don't yank the support for the lazytime mount option in
ext4.

Jan, what do you think?

     	  					- Ted
					

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Lazytime feature bugs
  2015-02-26  4:33         ` Theodore Ts'o
@ 2015-02-26  8:34           ` Jan Kara
  2015-02-26 13:45             ` Theodore Ts'o
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2015-02-26  8:34 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, Linus Torvalds, linux-fsdevel, Alexander Viro

On Wed 25-02-15 23:33:04, Ted Tso wrote:
> On Wed, Feb 25, 2015 at 05:25:06PM +0100, Jan Kara wrote:
> >   Yeah, that sounds reasonable. I've been thinking how to fix those time
> > ordering issues and sadly it isn't trivial. We'll likely need a
> > timestamp in the inode (although a coarse one is enough) remembering when
> > inode was last written (i_dirtied_when needs to remember when dirty *data*
> > were created so we cannot use it as I originally thought). And we'll need
> > to sort inodes back into the list of inodes with dirty timestamps. It
> > could be done in O(length of list + number of inodes added) if we are
> > careful but it will be non-trivial.
> 
> Well, the bottom line is that the two major problems you've listed
> (ignoring the cosmetic issues) is that some of the inodes with dirty
> timestamps might not get rewritten out until the inodes get ejected
> from memory or the file system is unmounted.  This isn't exactly a
> disaster; it's not going to cause data loss, or cause the system to
> become unstable, no?
  Yes, it's not too serious since the problem (timestamps much older than
data in file) will be visible only in case of crash (originally I thought
problems will be bigger but later I realized they won't).
 
>    1) Inode that gets periodically dirtied with I_DIRTY_PAGES, cleaned and
>    dirtied again will have inode with updated timestamps never written due
>    to age since inode->dirtied_when gets reset on each redirtying with
>    I_DIRTY_PAGES.
> 
> If we maintain an i_dirtied_when and a i_dirtied_time_when field, all
> we need to do is to check if i_dirtied_time_when is older than 24
> hours while we are processing writeback for inodes on b_io (nor just
> when we are processing inodes on b_dirty_time), and update the
> timestamps if necessary.
> 
>    2) The code won't maintain time ordering of b_dirty_time list by
>    inode->dirtied_when - this happens because requeue_inode() moves inode
>    at the head of the b_dirty_time list but inodes in b_io list from which
>    we move are no longer ordered by dirtied_when (due to that list being
>    combined from several lists and also because we sort the list by
>    superblock). As a result terminating logic in move_expired_inodes() may
>    terminate the scan too early for b_dirty_time list.
> 
> To solve this problem, we need to make sure that the inode is inserted
> into the list sorted by i_dirtied_time_when (and then
> move_expired_inodes can just terminate checking on i_dirtied_time_when
> instead of i_dirtied_when when we are scanning the b_dirty_time list).
> 
> If we don't care for this overhead, we could can do the following, at
> the cost of a bit less precision about when we write out timestamps:
> 
> a) When checking to see if we need to write back timestamps while
> processing inodes on b_io, we check not only i_dirty_time_when, but
> we also check to see if mtime is oldered than a day.  If so, we force
> out the timestamps.  This means we could potentially push out
> timestamps earlier than we should, but in the steady state, the
> timestamps will only be updated once a day.
> 
> b) When we move an inode from b_io to b_dirty_time, we set
> i_dirty_time_when to "now".  Because of (a) we know that mtime will be
> at most stale by one day.  If we don't dirty the inode's pages for the
> next 24 hours, at that point the timestamps will be written out.
> Hence, in the worst case the dirty timestamps might be stale on disk
> by a maximum of two days.
  I don't think you can reset i_dirty_time_when to "now" in b) because in
that case continually dirtied files won't ever have timestamps written (you
are completely loosing track of when you last wrote timestamps to disk).
But you could set *i_dirtied_when* to "now" and maintain b_dirty_time list
ordered by i_dirtied_when the same way as other writeback lists. That should
work. Actually I don't think there's a need to look at mtime at all in that
case. The logic would be as follows:
1) i_dirty_time_when tracks when inode timestamps first changed.
2) i_dirty_when tracks when inode was added to b_dirty or b_dirty_time list
   (gets reset when list changes).
3) When queueing inodes for writeback in queue_io(), we queue also inodes
   from b_dirty_time list with i_dirtied_when older than 12 hours.
4) When processing inodes on b_io list we also check whether
   i_dirty_time_when is older than 12 hours. If so, we writeout inode.

Rule 3) guarantees that inode gets to b_io list at least once per 12 hours.
Rule 4) guarantees that if inode gets to b_io list and timestamps are older
than 12 hours, we write them out. So together we have a guarantee that
inode will be written out at least once per 24 hours. And the logic looks
reasonably simple (but please spell out these rules in some comment in the
code so that we don't forget...).

BTW, I prefer to really keep the timestamp updates within 24 hours. IMHO
it's easier for humans and it's not like there's any performance difference
in possibly writing inodes every 12 hours instead of every 24 hours.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Lazytime feature bugs
  2015-02-26  8:34           ` Jan Kara
@ 2015-02-26 13:45             ` Theodore Ts'o
  2015-02-26 14:38               ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Theodore Ts'o @ 2015-02-26 13:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: Linus Torvalds, linux-fsdevel, Alexander Viro

On Thu, Feb 26, 2015 at 09:34:55AM +0100, Jan Kara wrote:
>   I don't think you can reset i_dirty_time_when to "now" in b) because in
> that case continually dirtied files won't ever have timestamps written (you
> are completely loosing track of when you last wrote timestamps to disk).

In the case of continually dirtied files, in my proposal we would be
checking to see if i_dirty_time_when is older when we scan files in
b_io, which means that we would discover files with stale timestamps
on disk, and they would get written out.  This was the same as (4) in
your propsal:


> 4) When processing inodes on b_io list we also check whether
>    i_dirty_time_when is older than 12 hours. If so, we writeout inode.

But your mechanism would work as well; I preferred to keep
i_dirty_when to only mean when the pages were dirtied, but the
important thing is that we have to track when the timestamps were
dirtied and when the pages were dirtied as separate things.

> BTW, I prefer to really keep the timestamp updates within 24 hours. IMHO
> it's easier for humans and it's not like there's any performance difference
> in possibly writing inodes every 12 hours instead of every 24 hours.

I agree, writing inodes ever 12 hours so that the worst case staleness
is 24 hours is fair enough.  If we really cared we could make the
duration tunable, but I agree that whether we use 12 or 24 hours, with
the worst case being 24 or 48 hours, is probably not going to result
in a measurable performance difference.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Lazytime feature bugs
  2015-02-26 13:45             ` Theodore Ts'o
@ 2015-02-26 14:38               ` Jan Kara
  2015-02-26 19:27                 ` Theodore Ts'o
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2015-02-26 14:38 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, Linus Torvalds, linux-fsdevel, Alexander Viro

On Thu 26-02-15 08:45:53, Ted Tso wrote:
> On Thu, Feb 26, 2015 at 09:34:55AM +0100, Jan Kara wrote:
> >   I don't think you can reset i_dirty_time_when to "now" in b) because in
> > that case continually dirtied files won't ever have timestamps written (you
> > are completely loosing track of when you last wrote timestamps to disk).
> 
> In the case of continually dirtied files, in my proposal we would be
> checking to see if i_dirty_time_when is older when we scan files in
> b_io, which means that we would discover files with stale timestamps
> on disk, and they would get written out.
  But you wrote you'll reset i_dirty_time_when to "now" when queueing into
b_dirty_time. Thus frequently dirtied inodes will just travel back and
forth between b_dirty and b_dirty_time lists and i_dirty_time will be
continuously reset. And you never end up writing it. That's why I suggested
to reset i_dirted_when instead (since it would be unused in b_dirty_time
list anyway).

> This was the same as (4) in your propsal:
> 
> 
> > 4) When processing inodes on b_io list we also check whether
> >    i_dirty_time_when is older than 12 hours. If so, we writeout inode.
> 
> But your mechanism would work as well; I preferred to keep
> i_dirty_when to only mean when the pages were dirtied, but the
> important thing is that we have to track when the timestamps were
> dirtied and when the pages were dirtied as separate things.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Lazytime feature bugs
  2015-02-26 14:38               ` Jan Kara
@ 2015-02-26 19:27                 ` Theodore Ts'o
  2015-03-02  8:29                   ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Theodore Ts'o @ 2015-02-26 19:27 UTC (permalink / raw)
  To: Jan Kara; +Cc: Linus Torvalds, linux-fsdevel, Alexander Viro

On Thu, Feb 26, 2015 at 03:38:28PM +0100, Jan Kara wrote:
>   But you wrote you'll reset i_dirty_time_when to "now" when queueing into
> b_dirty_time. Thus frequently dirtied inodes will just travel back and
> forth between b_dirty and b_dirty_time lists and i_dirty_time will be
> continuously reset. And you never end up writing it. That's why I suggested
> to reset i_dirted_when instead (since it would be unused in b_dirty_time
> list anyway).

Inodes will never get transfered onto the b_dirty_time list if they
also have dirty pages.  So if the inode is constantly getting dirtied,
the pages *will* get written out, and when they do, we'll also check
to see if i_dirty_time_when is too old, and also updated the
timestamp.

Basically the I_DIRTY_TIME flag is the "weakest" of all of the dirty
flags.  If the inode is dirty, it takes precedence, and writing out
the inode will include updating the timestamps.  If the inode's pages
are dirty, it takes precedence over I_DIRTY_TIME, so we worry about
getting the pages written out before we even think about considering
whether or not the inode should go on the b_dirty_time list.

So I don't see how inodes would be shuttling back and forth between
b_io and b_dirty_time.

	       	   	 	   - Ted

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Lazytime feature bugs
  2015-02-26 19:27                 ` Theodore Ts'o
@ 2015-03-02  8:29                   ` Jan Kara
  2015-03-07  5:34                     ` [PATCH] fs: make sure the timestamps for lazytime inodes eventually get written Theodore Ts'o
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2015-03-02  8:29 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, Linus Torvalds, linux-fsdevel, Alexander Viro

On Thu 26-02-15 14:27:35, Ted Tso wrote:
> On Thu, Feb 26, 2015 at 03:38:28PM +0100, Jan Kara wrote:
> >   But you wrote you'll reset i_dirty_time_when to "now" when queueing into
> > b_dirty_time. Thus frequently dirtied inodes will just travel back and
> > forth between b_dirty and b_dirty_time lists and i_dirty_time will be
> > continuously reset. And you never end up writing it. That's why I suggested
> > to reset i_dirted_when instead (since it would be unused in b_dirty_time
> > list anyway).
> 
> Inodes will never get transfered onto the b_dirty_time list if they
> also have dirty pages.  So if the inode is constantly getting dirtied,
> the pages *will* get written out, and when they do, we'll also check
> to see if i_dirty_time_when is too old, and also updated the
> timestamp.
> 
> Basically the I_DIRTY_TIME flag is the "weakest" of all of the dirty
> flags.  If the inode is dirty, it takes precedence, and writing out
> the inode will include updating the timestamps.  If the inode's pages
> are dirty, it takes precedence over I_DIRTY_TIME, so we worry about
> getting the pages written out before we even think about considering
> whether or not the inode should go on the b_dirty_time list.
> 
> So I don't see how inodes would be shuttling back and forth between
> b_io and b_dirty_time.
  So maybe I miss something but I still don't see what prevents this
scenario:
 1) write to file happens -> inode timestamps updated, inode filed to b_dirty,
    i_dirtied_when = now, i_dirty_time_when = now
 2) writeback for file happens, all pages written but not timestamps since
    they are fresh -> inode gets filed to b_dirty_time list, if we use your
    algorithm, i_dirty_time_when = now.
 3) write to file happens -> inode filed to b_dirty, i_dirtied_when = now.
 4) goto 2)

Now the loop in the above can happen infinitely long and you never end up
writing inode because time stamps are updated on write to file and
i_dirty_time_when gets updated when you move inode to b_dirty_time list.

Again if there's a fault in the above logic or I misunderstood something
from your proposal, I'm happy to be corrected.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] fs: make sure the timestamps for lazytime inodes eventually get written
  2015-03-02  8:29                   ` Jan Kara
@ 2015-03-07  5:34                     ` Theodore Ts'o
  2015-03-08 10:06                       ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Theodore Ts'o @ 2015-03-07  5:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: Linus Torvalds, linux-fsdevel, Alexander Viro

Hi Jan,

I believe the following should address all of the issues that you
raised.  Could you take a look and let me know what you think?

Thanks!!

						- Ted

commit e42f32d0ed65059e5cb689cc0ac28099a729ba9a
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Sat Mar 7 00:22:37 2015 -0500

    fs: make sure the timestamps for lazytime inodes eventually get written
    
    Jan Kara pointed out that if there is an inode which is constantly
    getting dirtied with I_DIRTY_PAGES, an inode with an updated timestamp
    will never be written since inode->dirtied_when is constantly getting
    updated.  We fix this by adding an extra field to the inode,
    dirtied_time_when, so inodes with a stale dirtytime can get detected
    and handled.
    
    Also add a sysctl tunable, dirtytime_expire_seconds so we can properly
    debug this code and make sure it all works.
    
    Finally, if we have a dirtytime inode caused by an atime update, and
    there is no write activity on the file system, we need to have a
    secondary system to make sure these inodes get written out.  We do
    this by setting up a second delayed work structure which wakes up the
    CPU much more rarely compared to writeback_expire_centisecs.
    
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e907052..260d7e7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -53,6 +53,9 @@ struct wb_writeback_work {
 	struct completion *done;	/* set if the caller waits */
 };
 
+/* 12 hours in seconds */
+unsigned int dirtytime_expire_interval = 12 * 60 * 60;
+
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -275,8 +278,8 @@ static int move_expired_inodes(struct list_head *delaying_queue,
 
 	if ((flags & EXPIRE_DIRTY_ATIME) == 0)
 		older_than_this = work->older_than_this;
-	else if ((work->reason == WB_REASON_SYNC) == 0) {
-		expire_time = jiffies - (HZ * 86400);
+	else if (!work->for_sync) {
+		expire_time = jiffies - (dirtytime_expire_interval * HZ);
 		older_than_this = &expire_time;
 	}
 	while (!list_empty(delaying_queue)) {
@@ -458,6 +461,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
 		 */
 		redirty_tail(inode, wb);
 	} else if (inode->i_state & I_DIRTY_TIME) {
+		inode->dirtied_when = jiffies;
 		list_move(&inode->i_wb_list, &wb->b_dirty_time);
 	} else {
 		/* The inode is clean. Remove from writeback lists. */
@@ -741,6 +745,13 @@ static long writeback_sb_inodes(struct super_block *sb,
 		spin_lock(&inode->i_lock);
 		if (!(inode->i_state & I_DIRTY_ALL))
 			wrote++;
+		if ((inode->i_state & I_DIRTY_TIME) &&
+		    ((start_time - inode->dirtied_time_when) >
+		     (dirtytime_expire_interval * HZ))) {
+			inode->i_state &= ~I_DIRTY_TIME;
+			inode->i_state |= I_DIRTY_SYNC;
+			trace_writeback_lazytime(inode);
+		}
 		requeue_inode(inode, wb, &wbc);
 		inode_sync_complete(inode);
 		spin_unlock(&inode->i_lock);
@@ -1131,6 +1142,56 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
 	rcu_read_unlock();
 }
 
+/*
+ * Wake up bdi's periodically to make sure dirtytime inodes gets
+ * written back periodically.  We deliberately do *not* check the
+ * b_dirtytime list in wb_has_dirty_io(), since this would cause the
+ * kernel to be constantly waking up once there are any dirtytime
+ * inodes on the system.  So instead we define a separate delayed work
+ * function which gets called much more rarely.
+ *
+ * If there is any other write activity going on in the file system,
+ * this function won't be necessary.  But if the only thing that has
+ * happened on the file system is a dirtytime inode caused by an atime
+ * update, we need this infrastructure below to make sure that inode
+ * eventually gets pushed out to disk.
+ */
+static void wakeup_dirtytime_writeback(struct work_struct *w);
+static DECLARE_DELAYED_WORK(dirtytime_work, wakeup_dirtytime_writeback);
+
+static void wakeup_dirtytime_writeback(struct work_struct *w)
+{
+	struct backing_dev_info *bdi;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
+		if (list_empty(&bdi->wb.b_dirty_time))
+			continue;
+		bdi_wakeup_thread(bdi);
+	}
+	rcu_read_unlock();
+	schedule_delayed_work(&dirtytime_work, dirtytime_expire_interval * HZ);
+}
+
+static int __init start_dirtytime_writeback(void)
+{
+	schedule_delayed_work(&dirtytime_work, dirtytime_expire_interval * HZ);
+	return 0;
+}
+__initcall(start_dirtytime_writeback);
+
+int dirtytime_interval_handler(struct ctl_table *table, int write,
+			       void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret;
+
+	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	if (ret == 0 && write)
+		mod_delayed_work(system_wq, &dirtytime_work, 0);
+	return ret;
+}
+
+
 static noinline void block_dump___mark_inode_dirty(struct inode *inode)
 {
 	if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
@@ -1269,6 +1330,10 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 			}
 
 			inode->dirtied_when = jiffies;
+			if (dirtytime)
+				inode->dirtied_time_when = jiffies;
+			if (flags & I_DIRTY_PAGES)
+				dirtytime = 0;
 			list_move(&inode->i_wb_list, dirtytime ?
 				  &bdi->wb.b_dirty_time : &bdi->wb.b_dirty);
 			spin_unlock(&bdi->wb.list_lock);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b4d71b5..f4131e8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -604,6 +604,7 @@ struct inode {
 	struct mutex		i_mutex;
 
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
+	unsigned long		dirtied_time_when;
 
 	struct hlist_node	i_hash;
 	struct list_head	i_wb_list;	/* backing dev IO list */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 0004833..b2dd371e 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -130,6 +130,7 @@ extern int vm_dirty_ratio;
 extern unsigned long vm_dirty_bytes;
 extern unsigned int dirty_writeback_interval;
 extern unsigned int dirty_expire_interval;
+extern unsigned int dirtytime_expire_interval;
 extern int vm_highmem_is_dirtyable;
 extern int block_dump;
 extern int laptop_mode;
@@ -146,6 +147,8 @@ extern int dirty_ratio_handler(struct ctl_table *table, int write,
 extern int dirty_bytes_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
+int dirtytime_interval_handler(struct ctl_table *table, int write,
+			       void __user *buffer, size_t *lenp, loff_t *ppos);
 
 struct ctl_table;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 88ea2d6..ce410bb 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1228,6 +1228,14 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 	},
 	{
+		.procname	= "dirtytime_expire_seconds",
+		.data		= &dirtytime_expire_interval,
+		.maxlen		= sizeof(dirty_expire_interval),
+		.mode		= 0644,
+		.proc_handler	= dirtytime_interval_handler,
+		.extra1		= &zero,
+	},
+	{
 		.procname       = "nr_pdflush_threads",
 		.mode           = 0444 /* read-only */,
 		.proc_handler   = pdflush_proc_obsolete,

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] fs: make sure the timestamps for lazytime inodes eventually get written
  2015-03-07  5:34                     ` [PATCH] fs: make sure the timestamps for lazytime inodes eventually get written Theodore Ts'o
@ 2015-03-08 10:06                       ` Jan Kara
  2015-03-08 19:06                         ` Theodore Ts'o
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2015-03-08 10:06 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, Linus Torvalds, linux-fsdevel, Alexander Viro

  Hi Ted,

On Sat 07-03-15 00:34:08, Ted Tso wrote:
> I believe the following should address all of the issues that you
> raised.  Could you take a look and let me know what you think?
> 
> Thanks!!
  Thanks for the patch. A few comments below:

> commit e42f32d0ed65059e5cb689cc0ac28099a729ba9a
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Sat Mar 7 00:22:37 2015 -0500
> 
>     fs: make sure the timestamps for lazytime inodes eventually get written
>     
>     Jan Kara pointed out that if there is an inode which is constantly
>     getting dirtied with I_DIRTY_PAGES, an inode with an updated timestamp
>     will never be written since inode->dirtied_when is constantly getting
>     updated.  We fix this by adding an extra field to the inode,
>     dirtied_time_when, so inodes with a stale dirtytime can get detected
>     and handled.
>     
>     Also add a sysctl tunable, dirtytime_expire_seconds so we can properly
>     debug this code and make sure it all works.
>     
>     Finally, if we have a dirtytime inode caused by an atime update, and
>     there is no write activity on the file system, we need to have a
>     secondary system to make sure these inodes get written out.  We do
>     this by setting up a second delayed work structure which wakes up the
>     CPU much more rarely compared to writeback_expire_centisecs.
>     
>     Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index e907052..260d7e7 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -53,6 +53,9 @@ struct wb_writeback_work {
>  	struct completion *done;	/* set if the caller waits */
>  };
>  
> +/* 12 hours in seconds */
> +unsigned int dirtytime_expire_interval = 12 * 60 * 60;
> +
>  /**
>   * writeback_in_progress - determine whether there is writeback in progress
>   * @bdi: the device's backing_dev_info structure.
> @@ -275,8 +278,8 @@ static int move_expired_inodes(struct list_head *delaying_queue,
>  
>  	if ((flags & EXPIRE_DIRTY_ATIME) == 0)
>  		older_than_this = work->older_than_this;
> -	else if ((work->reason == WB_REASON_SYNC) == 0) {
> -		expire_time = jiffies - (HZ * 86400);
> +	else if (!work->for_sync) {
> +		expire_time = jiffies - (dirtytime_expire_interval * HZ);
>  		older_than_this = &expire_time;
>  	}
>  	while (!list_empty(delaying_queue)) {
  This hunk should be a separate patch since it's completely unrelated.


> @@ -458,6 +461,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
>  		 */
>  		redirty_tail(inode, wb);
>  	} else if (inode->i_state & I_DIRTY_TIME) {
> +		inode->dirtied_when = jiffies;
>  		list_move(&inode->i_wb_list, &wb->b_dirty_time);
>  	} else {
>  		/* The inode is clean. Remove from writeback lists. */
> @@ -741,6 +745,13 @@ static long writeback_sb_inodes(struct super_block *sb,
>  		spin_lock(&inode->i_lock);
>  		if (!(inode->i_state & I_DIRTY_ALL))
>  			wrote++;
> +		if ((inode->i_state & I_DIRTY_TIME) &&
> +		    ((start_time - inode->dirtied_time_when) >
> +		     (dirtytime_expire_interval * HZ))) {
> +			inode->i_state &= ~I_DIRTY_TIME;
> +			inode->i_state |= I_DIRTY_SYNC;
> +			trace_writeback_lazytime(inode);
> +		}
  Hum, why is this here? A more logical place for it would IMO be in
__writeback_single_inode() where we modify inode state. Also we would then
immediately end up writing the inode instead of just queueing it to a
different writeback queue.

>  		requeue_inode(inode, wb, &wbc);
>  		inode_sync_complete(inode);
>  		spin_unlock(&inode->i_lock);
> @@ -1131,6 +1142,56 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
>  	rcu_read_unlock();
>  }
>  
> +/*
> + * Wake up bdi's periodically to make sure dirtytime inodes gets
> + * written back periodically.  We deliberately do *not* check the
> + * b_dirtytime list in wb_has_dirty_io(), since this would cause the
> + * kernel to be constantly waking up once there are any dirtytime
> + * inodes on the system.  So instead we define a separate delayed work
> + * function which gets called much more rarely.
> + *
> + * If there is any other write activity going on in the file system,
> + * this function won't be necessary.  But if the only thing that has
> + * happened on the file system is a dirtytime inode caused by an atime
> + * update, we need this infrastructure below to make sure that inode
> + * eventually gets pushed out to disk.
> + */
> +static void wakeup_dirtytime_writeback(struct work_struct *w);
> +static DECLARE_DELAYED_WORK(dirtytime_work, wakeup_dirtytime_writeback);
> +
> +static void wakeup_dirtytime_writeback(struct work_struct *w)
> +{
> +	struct backing_dev_info *bdi;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
> +		if (list_empty(&bdi->wb.b_dirty_time))
> +			continue;
> +		bdi_wakeup_thread(bdi);
> +	}
> +	rcu_read_unlock();
> +	schedule_delayed_work(&dirtytime_work, dirtytime_expire_interval * HZ);
> +}
> +
> +static int __init start_dirtytime_writeback(void)
> +{
> +	schedule_delayed_work(&dirtytime_work, dirtytime_expire_interval * HZ);
> +	return 0;
> +}
> +__initcall(start_dirtytime_writeback);
> +
> +int dirtytime_interval_handler(struct ctl_table *table, int write,
> +			       void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	int ret;
> +
> +	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> +	if (ret == 0 && write)
> +		mod_delayed_work(system_wq, &dirtytime_work, 0);
> +	return ret;
> +}
> +
> +
>  static noinline void block_dump___mark_inode_dirty(struct inode *inode)
>  {
>  	if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
> @@ -1269,6 +1330,10 @@ void __mark_inode_dirty(struct inode *inode, int flags)
>  			}
>  
>  			inode->dirtied_when = jiffies;
> +			if (dirtytime)
> +				inode->dirtied_time_when = jiffies;
> +			if (flags & I_DIRTY_PAGES)
> +				dirtytime = 0;
>  			list_move(&inode->i_wb_list, dirtytime ?
>  				  &bdi->wb.b_dirty_time : &bdi->wb.b_dirty);
>  			spin_unlock(&bdi->wb.list_lock);
  I guess this would be more readable as:
			if (dirtytime)
				inode->dirtied_time_when = jiffies;
			if (inode->i_state & (I_DIRTY_INODE | I_DIRTY_PAGES))
				list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
			else {
				list_move(&inode->i_wb_list,
					  &bdi->wb.b_dirty_time);
			}
  Since that will clearly express the inode needs to end up in the list
which corresponds to current inode state. Also preferably the change in the
condition deciding in which list inode ends up should be split in a
separate patch since that's unrelated problem to the issue described in the
changelog.

Othewise the patch looks good.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fs: make sure the timestamps for lazytime inodes eventually get written
  2015-03-08 10:06                       ` Jan Kara
@ 2015-03-08 19:06                         ` Theodore Ts'o
  0 siblings, 0 replies; 14+ messages in thread
From: Theodore Ts'o @ 2015-03-08 19:06 UTC (permalink / raw)
  To: Jan Kara; +Cc: Linus Torvalds, linux-fsdevel, Alexander Viro

On Sun, Mar 08, 2015 at 11:06:50AM +0100, Jan Kara wrote:
> > @@ -275,8 +278,8 @@ static int move_expired_inodes(struct list_head *delaying_queue,
> >  
> >  	if ((flags & EXPIRE_DIRTY_ATIME) == 0)
> >  		older_than_this = work->older_than_this;
> > -	else if ((work->reason == WB_REASON_SYNC) == 0) {
> > -		expire_time = jiffies - (HZ * 86400);
> > +	else if (!work->for_sync) {
> > +		expire_time = jiffies - (dirtytime_expire_interval * HZ);
> >  		older_than_this = &expire_time;
> >  	}
> >  	while (!list_empty(delaying_queue)) {
>   This hunk should be a separate patch since it's completely unrelated.

Along with all of the other changes that relate to adding a sysctl
tunable?  Sure, I can do that.

BTW, I know that originally we talked about not needing the tunable,
but it my experience it **really** helps with testing the future.  If
we ever want to try to create a automated test suite, it really helps
to have the tunable.

> > @@ -741,6 +745,13 @@ static long writeback_sb_inodes(struct super_block *sb,
> >  		spin_lock(&inode->i_lock);
> >  		if (!(inode->i_state & I_DIRTY_ALL))
> >  			wrote++;
> > +		if ((inode->i_state & I_DIRTY_TIME) &&
> > +		    ((start_time - inode->dirtied_time_when) >
> > +		     (dirtytime_expire_interval * HZ))) {
> > +			inode->i_state &= ~I_DIRTY_TIME;
> > +			inode->i_state |= I_DIRTY_SYNC;
> > +			trace_writeback_lazytime(inode);
> > +		}
>   Hum, why is this here? A more logical place for it would IMO be in
> __writeback_single_inode() where we modify inode state. Also we would then
> immediately end up writing the inode instead of just queueing it to a
> different writeback queue.

Good point, it woud be much better to put it there.  I'll move it in
the next version of the patch.

> > @@ -1269,6 +1330,10 @@ void __mark_inode_dirty(struct inode *inode, int flags)
> >  			}
> >  
> >  			inode->dirtied_when = jiffies;
> > +			if (dirtytime)
> > +				inode->dirtied_time_when = jiffies;
> > +			if (flags & I_DIRTY_PAGES)
> > +				dirtytime = 0;
> >  			list_move(&inode->i_wb_list, dirtytime ?
> >  				  &bdi->wb.b_dirty_time : &bdi->wb.b_dirty);
> >  			spin_unlock(&bdi->wb.list_lock);
>   I guess this would be more readable as:
> 			if (dirtytime)
> 				inode->dirtied_time_when = jiffies;
> 			if (inode->i_state & (I_DIRTY_INODE | I_DIRTY_PAGES))
> 				list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
> 			else {
> 				list_move(&inode->i_wb_list,
> 					  &bdi->wb.b_dirty_time);
> 			}
>   Since that will clearly express the inode needs to end up in the list
> which corresponds to current inode state. Also preferably the change in the
> condition deciding in which list inode ends up should be split in a
> separate patch since that's unrelated problem to the issue described in the
> changelog.

Agreed, I'll change this and resend.

						- Ted

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-03-08 19:06 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-18 13:19 Lazytime feature bugs Jan Kara
2015-02-24 18:31 ` Jan Kara
2015-02-24 18:58   ` Linus Torvalds
2015-02-25 14:52     ` Theodore Ts'o
2015-02-25 16:25       ` Jan Kara
2015-02-26  4:33         ` Theodore Ts'o
2015-02-26  8:34           ` Jan Kara
2015-02-26 13:45             ` Theodore Ts'o
2015-02-26 14:38               ` Jan Kara
2015-02-26 19:27                 ` Theodore Ts'o
2015-03-02  8:29                   ` Jan Kara
2015-03-07  5:34                     ` [PATCH] fs: make sure the timestamps for lazytime inodes eventually get written Theodore Ts'o
2015-03-08 10:06                       ` Jan Kara
2015-03-08 19:06                         ` Theodore Ts'o

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.