All of lore.kernel.org
 help / color / mirror / Atom feed
* more gc experiences
@ 2015-10-04 12:49 Marc Lehmann
  2015-10-05  7:25 ` more gc / gc script refinements Marc Lehmann
  0 siblings, 1 reply; 7+ messages in thread
From: Marc Lehmann @ 2015-10-04 12:49 UTC (permalink / raw)
  To: linux-f2fs-devel

Still working on my nearly full volume.

After stopping at ~180GB free due to low performance, I reconfigured the GC
so it runs more often and let it run for less than two hours.

   echo   500 >gc_min_sleep_time
   echo   500 >gc_max_sleep_time
   echo   500 >gc_no_gc_sleep_time

status before/after:
http://ue.tst.eu/dc80b74b69bbb431f51d731d8b075324.txt
http://ue.tst.eu/821be8e5c227653a90596c1503b84567.txt

During GC I had a steady 45MB/read + 45MB/write. The number of Dirty segments
didn't reduce very much, but that is likely due to the structure of the data:

Apart from fs metadata, I also did rsync without -H, followed by rsync with
-H. The latter replaces physical file copies that rsync created by hardlinks,
if they were hardlinks in the source. The source had a moderate anmount of
hardlinked files.

Since this probably created nicely spread holes all over the data, it's
expected that the GC has to copy a lot of data (at -s64), so overall, during
this time, the GC seemed to work very fine.

After that, I "stopped" the GC again and started the rsync, which then
proceeded to copy another 110GB (ending up at 70GB free), where the
performance again became unusably slow.

I then decided to give F2FS_IOC_GARBAGE_COLLECT a try with the following
script, and found some issues:

http://ue.tst.eu/9723851c87bb35e5899534123a5af497.txt

The first problem is that the ioctl seems to return instantly and
successfully when the bg garbage collect thread runs. Or at leats, thats my
theory: the sympton was that most of the time calls took 1-4 seconds, but
regularly (presumably when the kernel gc runs), the call returned within
microseconds.

This causes unnecessary high CPU usage - I think the call should just wait
for the GC lock in that case, or (less preferably) somehow signal this
third condition so the user code can do something else.

Which brings us to the next problem - calling the GC ioctl in a loop quickly
generated 23GB of dirty pages, which then more or less locked up the box - no
login was possible for 6 minutes after I killed the GC script, no nfs
operations took place.

While this is a shortcoming with linux in general, it highlights the
principal problem of not having any rate control in f2fs's gc - basically,
the user has to guess when the GC is done, and when the next round can
start, which is, in general, impossible, as only the fs knows the real I/O
load. Or in other words, here, again, the script would have to contain a
magic delay, just like gc_min_sleep_time, after each round.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: more gc / gc script refinements
  2015-10-04 12:49 more gc experiences Marc Lehmann
@ 2015-10-05  7:25 ` Marc Lehmann
  2015-10-05 15:02   ` Chao Yu
  0 siblings, 1 reply; 7+ messages in thread
From: Marc Lehmann @ 2015-10-05  7:25 UTC (permalink / raw)
  To: linux-f2fs-devel

After I successfully filled the disk, it was time to see hope f2fs recovers
from this bad situation, by deleting a lot of files and filling it again.

To ease the load on the gc, but still present a bit of a challenge, I
deleted the first 12000 files out of every 80000 files (directory order), in the hope
that this carves out comparatively big chunks.

I started with "Dirty: 30k" and "Free: 45k" and ended up with
"Dirty: 216k" and "Free: 968k", which to me seems to indicate it kind of
worked, although I am not suire how contiguous this free space really is
(I oriignally hoped this would be in the form of mostly free sections).

Then I worked on my GC script. Since the box became mostly unusable by just
calling the GC, I first tried this refinement:

http://ue.tst.eu/38809274b56fe9b161492f09b5411071.txt

(Used like "script </mountpoint" btw.)

Or in other words, only call the GC if there is less than 2GB of dirty
pages. For lower values than 2GB the GC often didn't run at all for 10-20
seconds.

This helped a lot, but the box was still noticably sluggish, and I
realised why the current GC I/O implementation is wrong - the purpose of
the cache is (among other uses, such as being the normal way to do I/O) to
cache recently-requested data in the hope that it will be reused.

However, in the case of the GC, unless the data was in the cache before,
chances that this data is required later are just as low as for the rest of
the device, and in general, much lower then the data that was in the cache
before f2fs evicted it.

Moreso, a lot of stress is put on the page cache because of the f2fs gc
treating it as normal data and leaving it in the cache and up to the
kernel to write out the pages.

What the GC should do is minimize the impact of the GC on the rest of the
system, by immediately flushing the data out and expiring the pages.

To improve the situaiton somewhat I decided to experiment with fdatasync
on the block device and/or a directory handle, but ended up calling syncfs
on the f2fs fs after every gc call, because fdatasync etc. seemed to be
the equivalent of syncfs anyway:

http://ue.tst.eu/325b6ba70b1abe814dc6a5cb6c02730e.txt

The effect of syncfs was to make I/O a lot more "chunky" - first everything
was read, then everything was written (dstat output, btw., this is 1 second
intervals as always, but I never mentioned it - sorry):

http://ue.tst.eu/9a552a4f41a4863133d3eceb90f1ec87.txt

Without it, read and write happen "at the same time" (when sampled with 1
second intervals).

This increased the average throughput considerably, from around 45MB
read+write/s to 66MB/s. Whether this actually increased the GC process at
all I don't know, because syncfs of course forces a sync on the fs, with
its own overhead.

So while this is a rather heavy-handed approach, the major result was that
the amount of dirty pages is notably reduced (it never reaches 1GB), and
the box is much more usable during this time.

Right now, after about 9 hours, I am at "Dirty: 44k", and will start
writing to the device soon.

In any case, it seems f2fs seems to hold up quite nicely near disk full
conditions, and does recover nicely as well.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: more gc / gc script refinements
  2015-10-05  7:25 ` more gc / gc script refinements Marc Lehmann
@ 2015-10-05 15:02   ` Chao Yu
  2015-10-05 23:16     ` Jaegeuk Kim
  0 siblings, 1 reply; 7+ messages in thread
From: Chao Yu @ 2015-10-05 15:02 UTC (permalink / raw)
  To: 'Marc Lehmann', linux-f2fs-devel

> -----Original Message-----
> From: Marc Lehmann [mailto:schmorp@schmorp.de]
> Sent: Monday, October 05, 2015 3:26 PM
> To: linux-f2fs-devel@lists.sourceforge.net; linux-f2fs-devel@lists.sourceforge.net
> Subject: Re: [f2fs-dev] more gc / gc script refinements
> 
> After I successfully filled the disk, it was time to see hope f2fs recovers
> from this bad situation, by deleting a lot of files and filling it again.
> 
> To ease the load on the gc, but still present a bit of a challenge, I
> deleted the first 12000 files out of every 80000 files (directory order), in the hope
> that this carves out comparatively big chunks.
> 
> I started with "Dirty: 30k" and "Free: 45k" and ended up with
> "Dirty: 216k" and "Free: 968k", which to me seems to indicate it kind of
> worked, although I am not suire how contiguous this free space really is
> (I oriignally hoped this would be in the form of mostly free sections).
> 
> Then I worked on my GC script. Since the box became mostly unusable by just
> calling the GC, I first tried this refinement:
> 
> http://ue.tst.eu/38809274b56fe9b161492f09b5411071.txt
> 
> (Used like "script </mountpoint" btw.)
> 
> Or in other words, only call the GC if there is less than 2GB of dirty
> pages. For lower values than 2GB the GC often didn't run at all for 10-20
> seconds.
> 
> This helped a lot, but the box was still noticably sluggish, and I
> realised why the current GC I/O implementation is wrong - the purpose of
> the cache is (among other uses, such as being the normal way to do I/O) to
> cache recently-requested data in the hope that it will be reused.
> 
> However, in the case of the GC, unless the data was in the cache before,
> chances that this data is required later are just as low as for the rest of
> the device, and in general, much lower then the data that was in the cache
> before f2fs evicted it.
> 
> Moreso, a lot of stress is put on the page cache because of the f2fs gc
> treating it as normal data and leaving it in the cache and up to the
> kernel to write out the pages.

IMO, the reason of the behavior is a) keeping gced pages in cache as we look
forward further hits; b) as we know, kworker will flush all pages belong to
one inode together, we expect that inode's pages cached from multiple background
gc can be merged, and then be flushed with continuous block address, which
can improve the read performance afterward.

> 
> What the GC should do is minimize the impact of the GC on the rest of the
> system, by immediately flushing the data out and expiring the pages.

I think f2fs should support us more flexible method of triggering gc, mostly
like supporting sync/async gc ioctl command.
1) synchronous gc: all gced pages should be persistent in device after ioctl
returns successfully.
2) asynchronous gc: we don't guarantee all gced pages will be persistent
after ioctl returns successfully.

In your scenario, I think gc flow can easily be controlled with:

while (n) {
	ioctl gc with sync mode
}
syncfs or ioctl write_checkpoint

I wrote and sent the patches for supporting synchronous gc and supporting
triggering checkpoint by ioctl, I hope that can be helpful once we get
Jaegeuk's Ack.

Thanks,

> 
> To improve the situaiton somewhat I decided to experiment with fdatasync
> on the block device and/or a directory handle, but ended up calling syncfs
> on the f2fs fs after every gc call, because fdatasync etc. seemed to be
> the equivalent of syncfs anyway:
> 
> http://ue.tst.eu/325b6ba70b1abe814dc6a5cb6c02730e.txt
> 
> The effect of syncfs was to make I/O a lot more "chunky" - first everything
> was read, then everything was written (dstat output, btw., this is 1 second
> intervals as always, but I never mentioned it - sorry):
> 
> http://ue.tst.eu/9a552a4f41a4863133d3eceb90f1ec87.txt
> 
> Without it, read and write happen "at the same time" (when sampled with 1
> second intervals).
> 
> This increased the average throughput considerably, from around 45MB
> read+write/s to 66MB/s. Whether this actually increased the GC process at
> all I don't know, because syncfs of course forces a sync on the fs, with
> its own overhead.
> 
> So while this is a rather heavy-handed approach, the major result was that
> the amount of dirty pages is notably reduced (it never reaches 1GB), and
> the box is much more usable during this time.
> 
> Right now, after about 9 hours, I am at "Dirty: 44k", and will start
> writing to the device soon.
> 
> In any case, it seems f2fs seems to hold up quite nicely near disk full
> conditions, and does recover nicely as well.
> 
> --
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: more gc / gc script refinements
  2015-10-05 15:02   ` Chao Yu
@ 2015-10-05 23:16     ` Jaegeuk Kim
  2015-10-06 16:41       ` Chao Yu
  0 siblings, 1 reply; 7+ messages in thread
From: Jaegeuk Kim @ 2015-10-05 23:16 UTC (permalink / raw)
  To: Chao Yu; +Cc: 'Marc Lehmann', linux-f2fs-devel

Thanks Chao,

On Mon, Oct 05, 2015 at 11:02:42PM +0800, Chao Yu wrote:
> > -----Original Message-----
> > From: Marc Lehmann [mailto:schmorp@schmorp.de]
> > Sent: Monday, October 05, 2015 3:26 PM
> > To: linux-f2fs-devel@lists.sourceforge.net; linux-f2fs-devel@lists.sourceforge.net
> > Subject: Re: [f2fs-dev] more gc / gc script refinements
> > 
> > After I successfully filled the disk, it was time to see hope f2fs recovers
> > from this bad situation, by deleting a lot of files and filling it again.
> > 
> > To ease the load on the gc, but still present a bit of a challenge, I
> > deleted the first 12000 files out of every 80000 files (directory order), in the hope
> > that this carves out comparatively big chunks.
> > 
> > I started with "Dirty: 30k" and "Free: 45k" and ended up with
> > "Dirty: 216k" and "Free: 968k", which to me seems to indicate it kind of
> > worked, although I am not suire how contiguous this free space really is
> > (I oriignally hoped this would be in the form of mostly free sections).
> > 
> > Then I worked on my GC script. Since the box became mostly unusable by just
> > calling the GC, I first tried this refinement:
> > 
> > http://ue.tst.eu/38809274b56fe9b161492f09b5411071.txt
> > 
> > (Used like "script </mountpoint" btw.)
> > 
> > Or in other words, only call the GC if there is less than 2GB of dirty
> > pages. For lower values than 2GB the GC often didn't run at all for 10-20
> > seconds.
> > 
> > This helped a lot, but the box was still noticably sluggish, and I
> > realised why the current GC I/O implementation is wrong - the purpose of
> > the cache is (among other uses, such as being the normal way to do I/O) to
> > cache recently-requested data in the hope that it will be reused.
> > 
> > However, in the case of the GC, unless the data was in the cache before,
> > chances that this data is required later are just as low as for the rest of
> > the device, and in general, much lower then the data that was in the cache
> > before f2fs evicted it.
> > 
> > Moreso, a lot of stress is put on the page cache because of the f2fs gc
> > treating it as normal data and leaving it in the cache and up to the
> > kernel to write out the pages.
> 
> IMO, the reason of the behavior is a) keeping gced pages in cache as we look
> forward further hits; b) as we know, kworker will flush all pages belong to
> one inode together, we expect that inode's pages cached from multiple background
> gc can be merged, and then be flushed with continuous block address, which
> can improve the read performance afterward.

Agreed to this.

> > 
> > What the GC should do is minimize the impact of the GC on the rest of the
> > system, by immediately flushing the data out and expiring the pages.
> 
> I think f2fs should support us more flexible method of triggering gc, mostly
> like supporting sync/async gc ioctl command.
> 1) synchronous gc: all gced pages should be persistent in device after ioctl
> returns successfully.
> 2) asynchronous gc: we don't guarantee all gced pages will be persistent
> after ioctl returns successfully.

Yeah, agreed.

I wrote some other patches on top of Chao's patches.
We can do "background_gc=sync" and "/sys/fs/f2fs/dev/cp_interval".
Currently, I'm not quite convincing that we need to flush use data periodically.
At least, I expect cp_interval could enhance user experiences quite well.

> 
> In your scenario, I think gc flow can easily be controlled with:
> 
> while (n) {
> 	ioctl gc with sync mode
> }
> syncfs or ioctl write_checkpoint
> 
> I wrote and sent the patches for supporting synchronous gc and supporting
> triggering checkpoint by ioctl, I hope that can be helpful once we get
> Jaegeuk's Ack.

Thanks for the patches, :)

> 
> Thanks,
> 
> > 
> > To improve the situaiton somewhat I decided to experiment with fdatasync
> > on the block device and/or a directory handle, but ended up calling syncfs
> > on the f2fs fs after every gc call, because fdatasync etc. seemed to be
> > the equivalent of syncfs anyway:
> > 
> > http://ue.tst.eu/325b6ba70b1abe814dc6a5cb6c02730e.txt
> > 
> > The effect of syncfs was to make I/O a lot more "chunky" - first everything
> > was read, then everything was written (dstat output, btw., this is 1 second
> > intervals as always, but I never mentioned it - sorry):
> > 
> > http://ue.tst.eu/9a552a4f41a4863133d3eceb90f1ec87.txt
> > 
> > Without it, read and write happen "at the same time" (when sampled with 1
> > second intervals).
> > 
> > This increased the average throughput considerably, from around 45MB
> > read+write/s to 66MB/s. Whether this actually increased the GC process at
> > all I don't know, because syncfs of course forces a sync on the fs, with
> > its own overhead.
> > 
> > So while this is a rather heavy-handed approach, the major result was that
> > the amount of dirty pages is notably reduced (it never reaches 1GB), and
> > the box is much more usable during this time.
> > 
> > Right now, after about 9 hours, I am at "Dirty: 44k", and will start
> > writing to the device soon.
> > 
> > In any case, it seems f2fs seems to hold up quite nicely near disk full
> > conditions, and does recover nicely as well.
> > 
> > --
> >                 The choice of a       Deliantra, the free code+content MORPG
> >       -----==-     _GNU_              http://www.deliantra.net
> >       ----==-- _       generation
> >       ---==---(_)__  __ ____  __      Marc Lehmann
> >       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
> >       -=====/_/_//_/\_,_/ /_/\_\
> > 
> > ------------------------------------------------------------------------------
> > _______________________________________________
> > Linux-f2fs-devel mailing list
> > Linux-f2fs-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: more gc / gc script refinements
  2015-10-05 23:16     ` Jaegeuk Kim
@ 2015-10-06 16:41       ` Chao Yu
  2015-10-06 23:44         ` Jaegeuk Kim
  0 siblings, 1 reply; 7+ messages in thread
From: Chao Yu @ 2015-10-06 16:41 UTC (permalink / raw)
  To: 'Jaegeuk Kim'; +Cc: 'Marc Lehmann', linux-f2fs-devel

Hi Jaegeuk,

> -----Original Message-----
> From: Jaegeuk Kim [mailto:jaegeuk@kernel.org]
> Sent: Tuesday, October 06, 2015 7:17 AM
> To: Chao Yu
> Cc: 'Marc Lehmann'; linux-f2fs-devel@lists.sourceforge.net
> Subject: Re: [f2fs-dev] more gc / gc script refinements
> 
> Thanks Chao,
> 
> On Mon, Oct 05, 2015 at 11:02:42PM +0800, Chao Yu wrote:
> > > -----Original Message-----
> > > From: Marc Lehmann [mailto:schmorp@schmorp.de]
> > > Sent: Monday, October 05, 2015 3:26 PM
> > > To: linux-f2fs-devel@lists.sourceforge.net; linux-f2fs-devel@lists.sourceforge.net
> > > Subject: Re: [f2fs-dev] more gc / gc script refinements
> > >
> > > After I successfully filled the disk, it was time to see hope f2fs recovers
> > > from this bad situation, by deleting a lot of files and filling it again.
> > >
> > > To ease the load on the gc, but still present a bit of a challenge, I
> > > deleted the first 12000 files out of every 80000 files (directory order), in the hope
> > > that this carves out comparatively big chunks.
> > >
> > > I started with "Dirty: 30k" and "Free: 45k" and ended up with
> > > "Dirty: 216k" and "Free: 968k", which to me seems to indicate it kind of
> > > worked, although I am not suire how contiguous this free space really is
> > > (I oriignally hoped this would be in the form of mostly free sections).
> > >
> > > Then I worked on my GC script. Since the box became mostly unusable by just
> > > calling the GC, I first tried this refinement:
> > >
> > > http://ue.tst.eu/38809274b56fe9b161492f09b5411071.txt
> > >
> > > (Used like "script </mountpoint" btw.)
> > >
> > > Or in other words, only call the GC if there is less than 2GB of dirty
> > > pages. For lower values than 2GB the GC often didn't run at all for 10-20
> > > seconds.
> > >
> > > This helped a lot, but the box was still noticably sluggish, and I
> > > realised why the current GC I/O implementation is wrong - the purpose of
> > > the cache is (among other uses, such as being the normal way to do I/O) to
> > > cache recently-requested data in the hope that it will be reused.
> > >
> > > However, in the case of the GC, unless the data was in the cache before,
> > > chances that this data is required later are just as low as for the rest of
> > > the device, and in general, much lower then the data that was in the cache
> > > before f2fs evicted it.
> > >
> > > Moreso, a lot of stress is put on the page cache because of the f2fs gc
> > > treating it as normal data and leaving it in the cache and up to the
> > > kernel to write out the pages.
> >
> > IMO, the reason of the behavior is a) keeping gced pages in cache as we look
> > forward further hits; b) as we know, kworker will flush all pages belong to
> > one inode together, we expect that inode's pages cached from multiple background
> > gc can be merged, and then be flushed with continuous block address, which
> > can improve the read performance afterward.
> 
> Agreed to this.
> 
> > >
> > > What the GC should do is minimize the impact of the GC on the rest of the
> > > system, by immediately flushing the data out and expiring the pages.
> >
> > I think f2fs should support us more flexible method of triggering gc, mostly
> > like supporting sync/async gc ioctl command.
> > 1) synchronous gc: all gced pages should be persistent in device after ioctl
> > returns successfully.
> > 2) asynchronous gc: we don't guarantee all gced pages will be persistent
> > after ioctl returns successfully.
> 
> Yeah, agreed.
> 
> I wrote some other patches on top of Chao's patches.
> We can do "background_gc=sync" and "/sys/fs/f2fs/dev/cp_interval".

Nice! The usage is more easy and convenient than the way of ioctl.
I think that can really be helpful for Marc.

> Currently, I'm not quite convincing that we need to flush use data periodically.

IMHO, I prefer to support data flush functionality in f2fs, the reason is that:
a) In order to keep the consistency and integrity of user data, kworker flush
and periodical checkpoint are hard to be controlled and cooperated because
kworker and checkpoint can't be aware of each other, on the contrary, configurable
inner periodical data flush + checkpoint can supply more flexible way to keep
persistent of user data (e.g. config period time as n second, so at least we will
lost the user data in recent n second).
b) User can choose more options in log level:
 - periodical data flush + checkpoint supply us with better integrity but worse
performance since in the period we can merge less dirty data for writebacking.
 - kworker flush + periodical checkpoint supply us with better performance as
dirty pages can merged well, but worse integrity of user data as data didn't
flushed before cp will be unrecoverable forever.
With this functionality, f2fs will become more configurable, and it should be
suitable for more workload (especially in strong demand on user data integrity).

So I hope we can add this feature in our pending list. How do you think?

Thanks,

> At least, I expect cp_interval could enhance user experiences quite well.
> 
> >
> > In your scenario, I think gc flow can easily be controlled with:
> >
> > while (n) {
> > 	ioctl gc with sync mode
> > }
> > syncfs or ioctl write_checkpoint
> >
> > I wrote and sent the patches for supporting synchronous gc and supporting
> > triggering checkpoint by ioctl, I hope that can be helpful once we get
> > Jaegeuk's Ack.
> 
> Thanks for the patches, :)

Thank you for the quick response! :)

Thanks,

> 
> >
> > Thanks,
> >
> > >
> > > To improve the situaiton somewhat I decided to experiment with fdatasync
> > > on the block device and/or a directory handle, but ended up calling syncfs
> > > on the f2fs fs after every gc call, because fdatasync etc. seemed to be
> > > the equivalent of syncfs anyway:
> > >
> > > http://ue.tst.eu/325b6ba70b1abe814dc6a5cb6c02730e.txt
> > >
> > > The effect of syncfs was to make I/O a lot more "chunky" - first everything
> > > was read, then everything was written (dstat output, btw., this is 1 second
> > > intervals as always, but I never mentioned it - sorry):
> > >
> > > http://ue.tst.eu/9a552a4f41a4863133d3eceb90f1ec87.txt
> > >
> > > Without it, read and write happen "at the same time" (when sampled with 1
> > > second intervals).
> > >
> > > This increased the average throughput considerably, from around 45MB
> > > read+write/s to 66MB/s. Whether this actually increased the GC process at
> > > all I don't know, because syncfs of course forces a sync on the fs, with
> > > its own overhead.
> > >
> > > So while this is a rather heavy-handed approach, the major result was that
> > > the amount of dirty pages is notably reduced (it never reaches 1GB), and
> > > the box is much more usable during this time.
> > >
> > > Right now, after about 9 hours, I am at "Dirty: 44k", and will start
> > > writing to the device soon.
> > >
> > > In any case, it seems f2fs seems to hold up quite nicely near disk full
> > > conditions, and does recover nicely as well.
> > >
> > > --
> > >                 The choice of a       Deliantra, the free code+content MORPG
> > >       -----==-     _GNU_              http://www.deliantra.net
> > >       ----==-- _       generation
> > >       ---==---(_)__  __ ____  __      Marc Lehmann
> > >       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
> > >       -=====/_/_//_/\_,_/ /_/\_\
> > >
> > > ------------------------------------------------------------------------------
> > > _______________________________________________
> > > Linux-f2fs-devel mailing list
> > > Linux-f2fs-devel@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> >
> > ------------------------------------------------------------------------------
> > _______________________________________________
> > Linux-f2fs-devel mailing list
> > Linux-f2fs-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: more gc / gc script refinements
  2015-10-06 16:41       ` Chao Yu
@ 2015-10-06 23:44         ` Jaegeuk Kim
  2015-10-07 12:32           ` Chao Yu
  0 siblings, 1 reply; 7+ messages in thread
From: Jaegeuk Kim @ 2015-10-06 23:44 UTC (permalink / raw)
  To: Chao Yu; +Cc: 'Marc Lehmann', linux-f2fs-devel

On Wed, Oct 07, 2015 at 12:41:45AM +0800, Chao Yu wrote:
> Hi Jaegeuk,
> 
> > -----Original Message-----
> > From: Jaegeuk Kim [mailto:jaegeuk@kernel.org]
> > Sent: Tuesday, October 06, 2015 7:17 AM
> > To: Chao Yu
> > Cc: 'Marc Lehmann'; linux-f2fs-devel@lists.sourceforge.net
> > Subject: Re: [f2fs-dev] more gc / gc script refinements
> > 
> > Thanks Chao,
> > 
> > On Mon, Oct 05, 2015 at 11:02:42PM +0800, Chao Yu wrote:
> > > > -----Original Message-----
> > > > From: Marc Lehmann [mailto:schmorp@schmorp.de]
> > > > Sent: Monday, October 05, 2015 3:26 PM
> > > > To: linux-f2fs-devel@lists.sourceforge.net; linux-f2fs-devel@lists.sourceforge.net
> > > > Subject: Re: [f2fs-dev] more gc / gc script refinements
> > > >
> > > > After I successfully filled the disk, it was time to see hope f2fs recovers
> > > > from this bad situation, by deleting a lot of files and filling it again.
> > > >
> > > > To ease the load on the gc, but still present a bit of a challenge, I
> > > > deleted the first 12000 files out of every 80000 files (directory order), in the hope
> > > > that this carves out comparatively big chunks.
> > > >
> > > > I started with "Dirty: 30k" and "Free: 45k" and ended up with
> > > > "Dirty: 216k" and "Free: 968k", which to me seems to indicate it kind of
> > > > worked, although I am not suire how contiguous this free space really is
> > > > (I oriignally hoped this would be in the form of mostly free sections).
> > > >
> > > > Then I worked on my GC script. Since the box became mostly unusable by just
> > > > calling the GC, I first tried this refinement:
> > > >
> > > > http://ue.tst.eu/38809274b56fe9b161492f09b5411071.txt
> > > >
> > > > (Used like "script </mountpoint" btw.)
> > > >
> > > > Or in other words, only call the GC if there is less than 2GB of dirty
> > > > pages. For lower values than 2GB the GC often didn't run at all for 10-20
> > > > seconds.
> > > >
> > > > This helped a lot, but the box was still noticably sluggish, and I
> > > > realised why the current GC I/O implementation is wrong - the purpose of
> > > > the cache is (among other uses, such as being the normal way to do I/O) to
> > > > cache recently-requested data in the hope that it will be reused.
> > > >
> > > > However, in the case of the GC, unless the data was in the cache before,
> > > > chances that this data is required later are just as low as for the rest of
> > > > the device, and in general, much lower then the data that was in the cache
> > > > before f2fs evicted it.
> > > >
> > > > Moreso, a lot of stress is put on the page cache because of the f2fs gc
> > > > treating it as normal data and leaving it in the cache and up to the
> > > > kernel to write out the pages.
> > >
> > > IMO, the reason of the behavior is a) keeping gced pages in cache as we look
> > > forward further hits; b) as we know, kworker will flush all pages belong to
> > > one inode together, we expect that inode's pages cached from multiple background
> > > gc can be merged, and then be flushed with continuous block address, which
> > > can improve the read performance afterward.
> > 
> > Agreed to this.
> > 
> > > >
> > > > What the GC should do is minimize the impact of the GC on the rest of the
> > > > system, by immediately flushing the data out and expiring the pages.
> > >
> > > I think f2fs should support us more flexible method of triggering gc, mostly
> > > like supporting sync/async gc ioctl command.
> > > 1) synchronous gc: all gced pages should be persistent in device after ioctl
> > > returns successfully.
> > > 2) asynchronous gc: we don't guarantee all gced pages will be persistent
> > > after ioctl returns successfully.
> > 
> > Yeah, agreed.
> > 
> > I wrote some other patches on top of Chao's patches.
> > We can do "background_gc=sync" and "/sys/fs/f2fs/dev/cp_interval".
> 
> Nice! The usage is more easy and convenient than the way of ioctl.
> I think that can really be helpful for Marc.
> 
> > Currently, I'm not quite convincing that we need to flush use data periodically.
> 
> IMHO, I prefer to support data flush functionality in f2fs, the reason is that:
> a) In order to keep the consistency and integrity of user data, kworker flush
> and periodical checkpoint are hard to be controlled and cooperated because
> kworker and checkpoint can't be aware of each other, on the contrary, configurable
> inner periodical data flush + checkpoint can supply more flexible way to keep
> persistent of user data (e.g. config period time as n second, so at least we will
> lost the user data in recent n second).
> b) User can choose more options in log level:
>  - periodical data flush + checkpoint supply us with better integrity but worse
> performance since in the period we can merge less dirty data for writebacking.
>  - kworker flush + periodical checkpoint supply us with better performance as
> dirty pages can merged well, but worse integrity of user data as data didn't
> flushed before cp will be unrecoverable forever.
> With this functionality, f2fs will become more configurable, and it should be
> suitable for more workload (especially in strong demand on user data integrity).
> 
> So I hope we can add this feature in our pending list. How do you think?

I guess you can understand my concern which is a kind of redundunt flushers
between VFS and f2fs. We already have several sysfs entries to control
system-wide dirty pages and throttling IO strategies through VFS.
I agreed that, if we can control our flush timings, we can do something more
for better IO behaviors. However, if users want more stable data, they need to
first take a look at system-wide configurations, not something special in f2fs
independently.

Of course, I have no objection to add this feature in our pending list.
Oh, it would be good to make a wiki linked with [1] to describe pending
items too.

[1] https://en.wikipedia.org/wiki/F2FS

Thanks,

> 
> Thanks,
> 
> > At least, I expect cp_interval could enhance user experiences quite well.
> > 
> > >
> > > In your scenario, I think gc flow can easily be controlled with:
> > >
> > > while (n) {
> > > 	ioctl gc with sync mode
> > > }
> > > syncfs or ioctl write_checkpoint
> > >
> > > I wrote and sent the patches for supporting synchronous gc and supporting
> > > triggering checkpoint by ioctl, I hope that can be helpful once we get
> > > Jaegeuk's Ack.
> > 
> > Thanks for the patches, :)
> 
> Thank you for the quick response! :)
> 
> Thanks,
> 
> > 
> > >
> > > Thanks,
> > >
> > > >
> > > > To improve the situaiton somewhat I decided to experiment with fdatasync
> > > > on the block device and/or a directory handle, but ended up calling syncfs
> > > > on the f2fs fs after every gc call, because fdatasync etc. seemed to be
> > > > the equivalent of syncfs anyway:
> > > >
> > > > http://ue.tst.eu/325b6ba70b1abe814dc6a5cb6c02730e.txt
> > > >
> > > > The effect of syncfs was to make I/O a lot more "chunky" - first everything
> > > > was read, then everything was written (dstat output, btw., this is 1 second
> > > > intervals as always, but I never mentioned it - sorry):
> > > >
> > > > http://ue.tst.eu/9a552a4f41a4863133d3eceb90f1ec87.txt
> > > >
> > > > Without it, read and write happen "at the same time" (when sampled with 1
> > > > second intervals).
> > > >
> > > > This increased the average throughput considerably, from around 45MB
> > > > read+write/s to 66MB/s. Whether this actually increased the GC process at
> > > > all I don't know, because syncfs of course forces a sync on the fs, with
> > > > its own overhead.
> > > >
> > > > So while this is a rather heavy-handed approach, the major result was that
> > > > the amount of dirty pages is notably reduced (it never reaches 1GB), and
> > > > the box is much more usable during this time.
> > > >
> > > > Right now, after about 9 hours, I am at "Dirty: 44k", and will start
> > > > writing to the device soon.
> > > >
> > > > In any case, it seems f2fs seems to hold up quite nicely near disk full
> > > > conditions, and does recover nicely as well.
> > > >
> > > > --
> > > >                 The choice of a       Deliantra, the free code+content MORPG
> > > >       -----==-     _GNU_              http://www.deliantra.net
> > > >       ----==-- _       generation
> > > >       ---==---(_)__  __ ____  __      Marc Lehmann
> > > >       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
> > > >       -=====/_/_//_/\_,_/ /_/\_\
> > > >
> > > > ------------------------------------------------------------------------------
> > > > _______________________________________________
> > > > Linux-f2fs-devel mailing list
> > > > Linux-f2fs-devel@lists.sourceforge.net
> > > > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> > >
> > > ------------------------------------------------------------------------------
> > > _______________________________________________
> > > Linux-f2fs-devel mailing list
> > > Linux-f2fs-devel@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: more gc / gc script refinements
  2015-10-06 23:44         ` Jaegeuk Kim
@ 2015-10-07 12:32           ` Chao Yu
  0 siblings, 0 replies; 7+ messages in thread
From: Chao Yu @ 2015-10-07 12:32 UTC (permalink / raw)
  To: 'Jaegeuk Kim'; +Cc: 'Marc Lehmann', linux-f2fs-devel

> -----Original Message-----
> From: Jaegeuk Kim [mailto:jaegeuk@kernel.org]
> Sent: Wednesday, October 07, 2015 7:45 AM
> To: Chao Yu
> Cc: 'Marc Lehmann'; linux-f2fs-devel@lists.sourceforge.net
> Subject: Re: [f2fs-dev] more gc / gc script refinements
> 
> On Wed, Oct 07, 2015 at 12:41:45AM +0800, Chao Yu wrote:
> > Hi Jaegeuk,
> >
> > > -----Original Message-----
> > > From: Jaegeuk Kim [mailto:jaegeuk@kernel.org]
> > > Sent: Tuesday, October 06, 2015 7:17 AM
> > > To: Chao Yu
> > > Cc: 'Marc Lehmann'; linux-f2fs-devel@lists.sourceforge.net
> > > Subject: Re: [f2fs-dev] more gc / gc script refinements
> > >
> > > Thanks Chao,
> > >
> > > On Mon, Oct 05, 2015 at 11:02:42PM +0800, Chao Yu wrote:
> > > > > -----Original Message-----
> > > > > From: Marc Lehmann [mailto:schmorp@schmorp.de]
> > > > > Sent: Monday, October 05, 2015 3:26 PM
> > > > > To: linux-f2fs-devel@lists.sourceforge.net; linux-f2fs-devel@lists.sourceforge.net
> > > > > Subject: Re: [f2fs-dev] more gc / gc script refinements
> > > > >
> > > > > After I successfully filled the disk, it was time to see hope f2fs recovers
> > > > > from this bad situation, by deleting a lot of files and filling it again.
> > > > >
> > > > > To ease the load on the gc, but still present a bit of a challenge, I
> > > > > deleted the first 12000 files out of every 80000 files (directory order), in the hope
> > > > > that this carves out comparatively big chunks.
> > > > >
> > > > > I started with "Dirty: 30k" and "Free: 45k" and ended up with
> > > > > "Dirty: 216k" and "Free: 968k", which to me seems to indicate it kind of
> > > > > worked, although I am not suire how contiguous this free space really is
> > > > > (I oriignally hoped this would be in the form of mostly free sections).
> > > > >
> > > > > Then I worked on my GC script. Since the box became mostly unusable by just
> > > > > calling the GC, I first tried this refinement:
> > > > >
> > > > > http://ue.tst.eu/38809274b56fe9b161492f09b5411071.txt
> > > > >
> > > > > (Used like "script </mountpoint" btw.)
> > > > >
> > > > > Or in other words, only call the GC if there is less than 2GB of dirty
> > > > > pages. For lower values than 2GB the GC often didn't run at all for 10-20
> > > > > seconds.
> > > > >
> > > > > This helped a lot, but the box was still noticably sluggish, and I
> > > > > realised why the current GC I/O implementation is wrong - the purpose of
> > > > > the cache is (among other uses, such as being the normal way to do I/O) to
> > > > > cache recently-requested data in the hope that it will be reused.
> > > > >
> > > > > However, in the case of the GC, unless the data was in the cache before,
> > > > > chances that this data is required later are just as low as for the rest of
> > > > > the device, and in general, much lower then the data that was in the cache
> > > > > before f2fs evicted it.
> > > > >
> > > > > Moreso, a lot of stress is put on the page cache because of the f2fs gc
> > > > > treating it as normal data and leaving it in the cache and up to the
> > > > > kernel to write out the pages.
> > > >
> > > > IMO, the reason of the behavior is a) keeping gced pages in cache as we look
> > > > forward further hits; b) as we know, kworker will flush all pages belong to
> > > > one inode together, we expect that inode's pages cached from multiple background
> > > > gc can be merged, and then be flushed with continuous block address, which
> > > > can improve the read performance afterward.
> > >
> > > Agreed to this.
> > >
> > > > >
> > > > > What the GC should do is minimize the impact of the GC on the rest of the
> > > > > system, by immediately flushing the data out and expiring the pages.
> > > >
> > > > I think f2fs should support us more flexible method of triggering gc, mostly
> > > > like supporting sync/async gc ioctl command.
> > > > 1) synchronous gc: all gced pages should be persistent in device after ioctl
> > > > returns successfully.
> > > > 2) asynchronous gc: we don't guarantee all gced pages will be persistent
> > > > after ioctl returns successfully.
> > >
> > > Yeah, agreed.
> > >
> > > I wrote some other patches on top of Chao's patches.
> > > We can do "background_gc=sync" and "/sys/fs/f2fs/dev/cp_interval".
> >
> > Nice! The usage is more easy and convenient than the way of ioctl.
> > I think that can really be helpful for Marc.
> >
> > > Currently, I'm not quite convincing that we need to flush use data periodically.
> >
> > IMHO, I prefer to support data flush functionality in f2fs, the reason is that:
> > a) In order to keep the consistency and integrity of user data, kworker flush
> > and periodical checkpoint are hard to be controlled and cooperated because
> > kworker and checkpoint can't be aware of each other, on the contrary, configurable
> > inner periodical data flush + checkpoint can supply more flexible way to keep
> > persistent of user data (e.g. config period time as n second, so at least we will
> > lost the user data in recent n second).
> > b) User can choose more options in log level:
> >  - periodical data flush + checkpoint supply us with better integrity but worse
> > performance since in the period we can merge less dirty data for writebacking.
> >  - kworker flush + periodical checkpoint supply us with better performance as
> > dirty pages can merged well, but worse integrity of user data as data didn't
> > flushed before cp will be unrecoverable forever.
> > With this functionality, f2fs will become more configurable, and it should be
> > suitable for more workload (especially in strong demand on user data integrity).
> >
> > So I hope we can add this feature in our pending list. How do you think?
> 
> I guess you can understand my concern which is a kind of redundunt flushers
> between VFS and f2fs. We already have several sysfs entries to control
> system-wide dirty pages and throttling IO strategies through VFS.

I can understand that.

> I agreed that, if we can control our flush timings, we can do something more
> for better IO behaviors. However, if users want more stable data, they need to
> first take a look at system-wide configurations, not something special in f2fs
> independently.

Yeah, but as we know, that's global configurations, each configuration we set on
VM can impact other fs, so this way has its boundedness.

In brief, what I think is it's not bad thing to arm f2fs with different weapons
like other fs (e.g. ext4), Once we couldn't destroy enemy with AK47, we can
try with M16. :)

> 
> Of course, I have no objection to add this feature in our pending list.
> Oh, it would be good to make a wiki linked with [1] to describe pending
> items too.

Ah, thanks for remindering that! :) I have add this feature in planned list
in link [1].

Thanks,

> 
> [1] https://en.wikipedia.org/wiki/F2FS
> 
> Thanks,
> 
> >
> > Thanks,
> >
> > > At least, I expect cp_interval could enhance user experiences quite well.
> > >
> > > >
> > > > In your scenario, I think gc flow can easily be controlled with:
> > > >
> > > > while (n) {
> > > > 	ioctl gc with sync mode
> > > > }
> > > > syncfs or ioctl write_checkpoint
> > > >
> > > > I wrote and sent the patches for supporting synchronous gc and supporting
> > > > triggering checkpoint by ioctl, I hope that can be helpful once we get
> > > > Jaegeuk's Ack.
> > >
> > > Thanks for the patches, :)
> >
> > Thank you for the quick response! :)
> >
> > Thanks,
> >
> > >
> > > >
> > > > Thanks,
> > > >
> > > > >
> > > > > To improve the situaiton somewhat I decided to experiment with fdatasync
> > > > > on the block device and/or a directory handle, but ended up calling syncfs
> > > > > on the f2fs fs after every gc call, because fdatasync etc. seemed to be
> > > > > the equivalent of syncfs anyway:
> > > > >
> > > > > http://ue.tst.eu/325b6ba70b1abe814dc6a5cb6c02730e.txt
> > > > >
> > > > > The effect of syncfs was to make I/O a lot more "chunky" - first everything
> > > > > was read, then everything was written (dstat output, btw., this is 1 second
> > > > > intervals as always, but I never mentioned it - sorry):
> > > > >
> > > > > http://ue.tst.eu/9a552a4f41a4863133d3eceb90f1ec87.txt
> > > > >
> > > > > Without it, read and write happen "at the same time" (when sampled with 1
> > > > > second intervals).
> > > > >
> > > > > This increased the average throughput considerably, from around 45MB
> > > > > read+write/s to 66MB/s. Whether this actually increased the GC process at
> > > > > all I don't know, because syncfs of course forces a sync on the fs, with
> > > > > its own overhead.
> > > > >
> > > > > So while this is a rather heavy-handed approach, the major result was that
> > > > > the amount of dirty pages is notably reduced (it never reaches 1GB), and
> > > > > the box is much more usable during this time.
> > > > >
> > > > > Right now, after about 9 hours, I am at "Dirty: 44k", and will start
> > > > > writing to the device soon.
> > > > >
> > > > > In any case, it seems f2fs seems to hold up quite nicely near disk full
> > > > > conditions, and does recover nicely as well.
> > > > >
> > > > > --
> > > > >                 The choice of a       Deliantra, the free code+content MORPG
> > > > >       -----==-     _GNU_              http://www.deliantra.net
> > > > >       ----==-- _       generation
> > > > >       ---==---(_)__  __ ____  __      Marc Lehmann
> > > > >       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
> > > > >       -=====/_/_//_/\_,_/ /_/\_\
> > > > >
> > > > > ------------------------------------------------------------------------------
> > > > > _______________________________________________
> > > > > Linux-f2fs-devel mailing list
> > > > > Linux-f2fs-devel@lists.sourceforge.net
> > > > > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> > > >
> > > > ------------------------------------------------------------------------------
> > > > _______________________________________________
> > > > Linux-f2fs-devel mailing list
> > > > Linux-f2fs-devel@lists.sourceforge.net
> > > > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

------------------------------------------------------------------------------
Full-scale, agent-less Infrastructure Monitoring from a single dashboard
Integrate with 40+ ManageEngine ITSM Solutions for complete visibility
Physical-Virtual-Cloud Infrastructure monitoring from one console
Real user monitoring with APM Insights and performance trend reports 
Learn More http://pubads.g.doubleclick.net/gampad/clk?id=247754911&iu=/4140

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-10-07 12:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-04 12:49 more gc experiences Marc Lehmann
2015-10-05  7:25 ` more gc / gc script refinements Marc Lehmann
2015-10-05 15:02   ` Chao Yu
2015-10-05 23:16     ` Jaegeuk Kim
2015-10-06 16:41       ` Chao Yu
2015-10-06 23:44         ` Jaegeuk Kim
2015-10-07 12:32           ` Chao Yu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.