All of lore.kernel.org
 help / color / mirror / Atom feed
* general stability of f2fs?
@ 2015-08-08 20:50 Marc Lehmann
  2015-08-10 20:31 ` Jaegeuk Kim
  0 siblings, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-08-08 20:50 UTC (permalink / raw)
  To: linux-f2fs-devel

Hi!

I did some more experiments, and wonder about the general stabiulity of f2fs.
I have not managed to keep an f2fs filesystem that worked for longer than a
few days.

For example, a few days ago I created an 8TB volume and copied 2TB of data to
it, which worked until I hot the (very low...) 32k limit on the number of
subdirectories.

I moved some directoriesd into a single subdirectory, and continued.
Everything seemed fine.

Today I ran fsck.f2fs on the fs, which found 4 inodes with wrong link counts
(generally higher than fsck counted). It asked me whether to fix this, which
I did.

I then did another fsck run, and was greeted with tens of thousands of
errors:

http://ue.tst.eu/f692bac9abbe4e910787adee18ec52be.txt

Mounting made the box unusable for multiple minutes, probably due to the
amount of backtraces:

http://ue.tst.eu/6243cc344a943d95a20907ecbc37061f.txt

The data is toast (which is fine, I am still experimenting only), but this,
the weird write behaviour, the fact that you don#t get signalled on ENOSPC
make me wonder what the general status of f2fs is.

It *seems* to be in actual use for a number of years now, and I would expect
small hiccups and problems, so backups would be advised, but this level of
brokenness (I only tested linux 3.18.14 and 4.1.4) is not something I didn#t
expect from a fs that is in development for so long.

So I wonder what the general stability epxectation for f2fs is - is it just
meant to be an experimental fs not used for any data, or am I just unlucky
and hit so many disastrous bugs by chance?

(It's really too bad, it's the only fs in linux that has stable write
performance on SMR drives at this time).

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: general stability of f2fs?
  2015-08-08 20:50 general stability of f2fs? Marc Lehmann
@ 2015-08-10 20:31 ` Jaegeuk Kim
  2015-08-10 20:53   ` Marc Lehmann
  2015-09-20 23:59   ` finally testing with SMR drives Marc Lehmann
  0 siblings, 2 replies; 74+ messages in thread
From: Jaegeuk Kim @ 2015-08-10 20:31 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

Hi Marc,

I'm very interested in trying f2fs on SMR drives too.
I also think that several characteristics of SMR drives are very similar with 
flash drives.

So far, the f2fs has been well performed on embedded systems like smart phones.
For server environement, however, I couldn't actually test f2fs pretty much
intensively. The major uncovered code areas would be:
- over 4TB storage space case
- inline_dentry mount option; I'm still working on extent_cache for v4.3 too
- various sizes of section and zone
- tmpfile, and rename2 interfaces

In your logs, I suspect some fsck.f2fs bugs in a large storage case.
In order to confirm that, could you use the latest f2fs-tools from:
 http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs-tools.git

And, if possible, could you share some experiences when you didn't fill up the
partition to 100%? If there is no problem, we can nicely focus on ENOSPC only.

Thanks,

On Sat, Aug 08, 2015 at 10:50:03PM +0200, Marc Lehmann wrote:
> Hi!
> 
> I did some more experiments, and wonder about the general stabiulity of f2fs.
> I have not managed to keep an f2fs filesystem that worked for longer than a
> few days.
> 
> For example, a few days ago I created an 8TB volume and copied 2TB of data to
> it, which worked until I hot the (very low...) 32k limit on the number of
> subdirectories.
> 
> I moved some directoriesd into a single subdirectory, and continued.
> Everything seemed fine.
> 
> Today I ran fsck.f2fs on the fs, which found 4 inodes with wrong link counts
> (generally higher than fsck counted). It asked me whether to fix this, which
> I did.
> 
> I then did another fsck run, and was greeted with tens of thousands of
> errors:
> 
> http://ue.tst.eu/f692bac9abbe4e910787adee18ec52be.txt
> 
> Mounting made the box unusable for multiple minutes, probably due to the
> amount of backtraces:
> 
> http://ue.tst.eu/6243cc344a943d95a20907ecbc37061f.txt
> 
> The data is toast (which is fine, I am still experimenting only), but this,
> the weird write behaviour, the fact that you don#t get signalled on ENOSPC
> make me wonder what the general status of f2fs is.
> 
> It *seems* to be in actual use for a number of years now, and I would expect
> small hiccups and problems, so backups would be advised, but this level of
> brokenness (I only tested linux 3.18.14 and 4.1.4) is not something I didn#t
> expect from a fs that is in development for so long.
> 
> So I wonder what the general stability epxectation for f2fs is - is it just
> meant to be an experimental fs not used for any data, or am I just unlucky
> and hit so many disastrous bugs by chance?
> 
> (It's really too bad, it's the only fs in linux that has stable write
> performance on SMR drives at this time).
> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: general stability of f2fs?
  2015-08-10 20:31 ` Jaegeuk Kim
@ 2015-08-10 20:53   ` Marc Lehmann
  2015-08-10 21:58     ` Jaegeuk Kim
  2015-09-20 23:59   ` finally testing with SMR drives Marc Lehmann
  1 sibling, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-08-10 20:53 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Mon, Aug 10, 2015 at 01:31:06PM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> I'm very interested in trying f2fs on SMR drives too.
> I also think that several characteristics of SMR drives are very similar with 
> flash drives.

Indeed, but of course there isn't an exact match for any characteristic.
Also, in the end, drive-managed SMR drives will suck somewhat with any
filesystem (note that nilfs performs very badly, even thought it should be
better than anything else till the drive is completely full).

Now, looking at the characteristics of f2fs, it could be a good match for
any rotational media, too, since it writes linearly and can defragment. At
least for desktop or similar loads (where files usually aren't randomly
written, but mostly replaced and rarely appended).

The only crucial ability it would need to have is to be able to free large
chunks for rewriting, which should be in f2fs as well.

So at this time, what I apparently need is mkfs.f2fs -s128 instead of -s7.

Unfortunately, I probably can't make these tests immediately, and they do
take some days to run, but hopefully I cna repeatmy experiments next week.

> - over 4TB storage space case

fsck limits could well have been the issue for my first big filesystem,
but not the second (which was only 128G in size to be able to utilize it
within a reasonable time).

> - inline_dentry mount option; I'm still working on extent_cache for v4.3 too

I only enabled mount options other than noatime for the 128G filesystem,
so it might well have cauzsed the trouble with it.

Another thing that will seriously hamper adoption of these drives is the
32000 limit on hardlinks - I am hard pressed to find any large file tree
here that doesn't have places with of 40000 subdirs somewhere, but I guess
on a 32GB phone flash storage, this was less of a concern.

In any case, if f2fs turns out to be workable, it will become the fs of
choice for me for my archival uses, and maybe even more, and I then have
to somehow cope with that limit.

> In your logs, I suspect some fsck.f2fs bugs in a large storage case.
> In order to confirm that, could you use the latest f2fs-tools from:
>  http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs-tools.git

Will do so.

Is there a repository for out-of-tree module builds for f2fs? It seems
kernels 3.17.x to 4.1 (at least) have a kernel bug making reads to these SMR
drives unstable (https://bugzilla.kernel.org/show_bug.cgi?id=93581), so I
will have to test with a relatively old kernel or play too many tricks.

And I suspect from glancing over patches (And mount options) that there
have been quite some improvements in f2fs since 3.16 days.

> And, if possible, could you share some experiences when you didn't fill up the
> partition to 100%? If there is no problem, we can nicely focus on ENOSPC only.

My experience was that f2fs wrote at nearly maximum I/O speed of the drives.
In fact, I couldn't saturate the bandwidth except when writing small files,
because the 8 drive source raid using xfs was not able to read files quickly
enough. After writing an initial tree of >2TB

Directory reading and mass stat seemed to be considerably slower and take
more time directly afterwards. I don't know if that is something that
balancing can fix (or improve), but I am not overly concerned about that,
as the difference to e.g. xfs is not that big (roughly a factor of two),
and thes eoperations are too slow for me on any device, so I usually put a
dm-cache in front of such storage devices.

I don't think that I have more useful data to report - if I used 14MB
sections, performance would predictably suck, so the real test is still
outstanding. Stay tuned, and thanks for your reply!

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: general stability of f2fs?
  2015-08-10 20:53   ` Marc Lehmann
@ 2015-08-10 21:58     ` Jaegeuk Kim
  2015-08-13  0:26       ` Marc Lehmann
  0 siblings, 1 reply; 74+ messages in thread
From: Jaegeuk Kim @ 2015-08-10 21:58 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Mon, Aug 10, 2015 at 10:53:32PM +0200, Marc Lehmann wrote:
> On Mon, Aug 10, 2015 at 01:31:06PM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > I'm very interested in trying f2fs on SMR drives too.
> > I also think that several characteristics of SMR drives are very similar with 
> > flash drives.
> 
> Indeed, but of course there isn't an exact match for any characteristic.
> Also, in the end, drive-managed SMR drives will suck somewhat with any
> filesystem (note that nilfs performs very badly, even thought it should be
> better than anything else till the drive is completely full).

IMO, it's similar to flash drives too. Indeed, I believe host-managed SMR/flash
drives are likely to show much better performance than drive-managed ones.
However, I think there are many HW constraints inside the storage not to move
forward to it easily.

> Now, looking at the characteristics of f2fs, it could be a good match for
> any rotational media, too, since it writes linearly and can defragment. At
> least for desktop or similar loads (where files usually aren't randomly
> written, but mostly replaced and rarely appended).

Possible, but not much different from other filesystems. :)

> The only crucial ability it would need to have is to be able to free large
> chunks for rewriting, which should be in f2fs as well.
> 
> So at this time, what I apparently need is mkfs.f2fs -s128 instead of -s7.

I wrote a patch to fix the document. Sorry about that.

> Unfortunately, I probably can't make these tests immediately, and they do
> take some days to run, but hopefully I cna repeatmy experiments next week.
> 
> > - over 4TB storage space case
> 
> fsck limits could well have been the issue for my first big filesystem,
> but not the second (which was only 128G in size to be able to utilize it
> within a reasonable time).
> 
> > - inline_dentry mount option; I'm still working on extent_cache for v4.3 too
> 
> I only enabled mount options other than noatime for the 128G filesystem,
> so it might well have cauzsed the trouble with it.

Okay, so I think it'd be good to start with:
 - noatime,inline_xattr,inline_data,flush_merge,extent_cache.

And you can control defragementation through
 /sys/fs/f2fs/[DEV]/gc_[min|max|no]_sleep_time

> Another thing that will seriously hamper adoption of these drives is the
> 32000 limit on hardlinks - I am hard pressed to find any large file tree
> here that doesn't have places with of 40000 subdirs somewhere, but I guess
> on a 32GB phone flash storage, this was less of a concern.

Looking at a glance, it'll be no problme to increase as 64k.
Let me check again.

> In any case, if f2fs turns out to be workable, it will become the fs of
> choice for me for my archival uses, and maybe even more, and I then have
> to somehow cope with that limit.
> 
> > In your logs, I suspect some fsck.f2fs bugs in a large storage case.
> > In order to confirm that, could you use the latest f2fs-tools from:
> >  http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs-tools.git
> 
> Will do so.
> 
> Is there a repository for out-of-tree module builds for f2fs? It seems
> kernels 3.17.x to 4.1 (at least) have a kernel bug making reads to these SMR
> drives unstable (https://bugzilla.kernel.org/show_bug.cgi?id=93581), so I
> will have to test with a relatively old kernel or play too many tricks.

What kernel version do you prefer? I've been maintaining f2fs for v3.10 mainly.

http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs.git/log/?h=linux-3.10

Thanks,

> And I suspect from glancing over patches (And mount options) that there
> have been quite some improvements in f2fs since 3.16 days.
> 
> > And, if possible, could you share some experiences when you didn't fill up the
> > partition to 100%? If there is no problem, we can nicely focus on ENOSPC only.
> 
> My experience was that f2fs wrote at nearly maximum I/O speed of the drives.
> In fact, I couldn't saturate the bandwidth except when writing small files,
> because the 8 drive source raid using xfs was not able to read files quickly
> enough. After writing an initial tree of >2TB
> 
> Directory reading and mass stat seemed to be considerably slower and take
> more time directly afterwards. I don't know if that is something that
> balancing can fix (or improve), but I am not overly concerned about that,
> as the difference to e.g. xfs is not that big (roughly a factor of two),
> and thes eoperations are too slow for me on any device, so I usually put a
> dm-cache in front of such storage devices.
> 
> I don't think that I have more useful data to report - if I used 14MB
> sections, performance would predictably suck, so the real test is still
> outstanding. Stay tuned, and thanks for your reply!
> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: general stability of f2fs?
  2015-08-10 21:58     ` Jaegeuk Kim
@ 2015-08-13  0:26       ` Marc Lehmann
  2015-08-14 23:07         ` Jaegeuk Kim
  0 siblings, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-08-13  0:26 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Mon, Aug 10, 2015 at 02:58:06PM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> IMO, it's similar to flash drives too. Indeed, I believe host-managed SMR/flash
> drives are likely to show much better performance than drive-managed ones.

If I had one, its performance would be abysmal, as filesystems (and
indeed, driver support) for that are far away... :)

> However, I think there are many HW constraints inside the storage not to move
> forward to it easily.

Exactly :)

> > Now, looking at the characteristics of f2fs, it could be a good match for
> > any rotational media, too, since it writes linearly and can defragment. At
> > least for desktop or similar loads (where files usually aren't randomly
> > written, but mostly replaced and rarely appended).
> 
> Possible, but not much different from other filesystems. :)

Hmm, I would strongly disagree - most other filesystems cannot defragment
effectively. For example, xfs_fsr is unstable under load and only
defragments files, but greatly increases external fragmentation over time.
Similarly for e4defrag. Most other filesystems do not even have a way to
defragment.

Files that are defragmented never move on other filesystems. This can be
true for f2fs as well, but as far as I can see, if formatted with e.g.
-s128, the external fragments will be 256mb in size, which is far more
acceptable than the millions of 4-100kb size fragments on some of my xfs
filesystems.

If I wouldn't copy my filesystems every 1.5 years or so, they would be
horrible degraded. It's very common to read directories with many medium
to small files at 10-20mb/s on an old xfs filesystem, but at 80mb/s on a
new one with exactly the same contents.

I don't think f2fs will intelligently defragment and relayout directories
anytime soon, either, but at least internal and external defragmentation
are being managed.

> Okay, so I think it'd be good to start with:
>  - noatime,inline_xattr,inline_data,flush_merge,extent_cache.

I still haven't found the right kernel for my main server, but I did some
preliminary experiments today, with 3.19.8-ckt5 (an ubuntu kernel).

After formatting a 128G partition with "mkfs.f2fs -o1 -s128 -t0", I got
this after mounting (kernel complained about missing extent_cache in my
kernel version):

   Filesystem                Size  Used Avail Use% Mounted on
   /dev/mapper/vg_test-test  128G   53G   75G  42% /mnt

which give sme another quetsion - on an 8TB disk, 5% overprovision is
400GB, which sounds a bit wasteful. Even 1% (80GB) sounds a bit much,
especially asI am prepared to wait for defragmentation, if defragmentation
works well. And lastly, the 53GB used on a 128GB partition looks way too
conservative.

I immediately configured the fs with these values:

   echo 500 >gc_max_sleep_time
   echo 100 >gc_min_sleep_time
   echo 800 >gc_no_gc_sleep_time

Anyways, I write it until disk was 99% utilizied according to
/sys/kernel/debug/f2fs/status, at which write speed crawled down to 1-2MB/s.

I deleted some "random" files till utilisation was at 38%, then waited
until there was no disk I/O (disk went into standby, which indicates that
it has flushed its internal transaction log as well).

When I then tried to write a file, the writer (rsync) stopped after ~4kb, and
the filesystem started reading at <2MB/s and wriitng at <2MB/s for a few
minutes. Since I didn't intend this to test very well (I was looking mainly
for a kernel that worked well with the hardware and drives), I didn't make
detailed notes, but basically, "LFS:" increased exactly with the writing
speed.

I then stopped writing, after which the fs wrote (but did not read) a bit
longer at this speed, then became idle, disk went into standby again.

The next day, I mounted it, and now I will take notes. Initial status was:

   http://ue.tst.eu/e2ea137a6b87fd0e43446b286a3d1b19.txt

The disk woke up and started reading and writing at <1MB/s:

   http://ue.tst.eu/a9dd48428b7b454f52590efeea636a27.txt

At some point, you can see that the disk stopped reading, that's when I
killed rsync. rsync also transfers over the net, and as you can see, it
didn't maange to transfer anything. The read I/O is probably due to rsync
reading the filetree info.

A status snapshot after killing rsync looks like this:

   http://ue.tst.eu/211fc87b0b43270e4b2ee0261d251818.txt

The disk did no other I/O afterwards and went into standby again.

I repeated the experiment a few minutes later with similar
results, with these differences:

1. There was absolutely no read I/O (maybe all inodes were still in the
   cache, but that would be surprising as rsync probably didn't read all
   of them in the previous run).

2. The disk didn't stay idle this time, but instead kept steadily writing
at ~1MB/s.

Status output at the end:

   http://ue.tst.eu/cbb4774b2f8e44ae68e635be5a414d1d.txt

Status output a bit later, disk still writing:

   http://ue.tst.eu/9fbdfe1e9051a65c1417bea7192ea182.txt

Much later, disk idle:

   http://ue.tst.eu/78a1614d867bfbfa115485e5fcf1a1a8.txt

At this point, my main problem is that I have no clue what is causing the
slow writes. Obviously the garbage collector doesn't think anything needs
to be done, it shouldn't be IPU writes either then, and even if they are,
I don't know what the ipu_policy's mean.

I tried the same with ipu_policy=8 and min_ipu_util=100, also separately
also gc_idle=1, with seemingly no difference.

Here is what I expect should happen:

When I write to a new disk, or append to a still-free-enough disk, writing
happens linearly (with that I mean appending to multiple of its logs
linearly, which is not optimal, but should be fine). This clearly happens,
and near perfectly so.

When the disk is near-full, bad things might happen, delays might be there
when some small areas are being garbage collected.

When I delete files, the disk should start garbage collecting at around
50mb/s read + 50mb/s write. If combined with writing, I should be able to
write at roughly 30MB/s while the garbage collector is cleaning up.

I would expect the gc to do its work by selecting a 256MB section, reading
everything it needs to, write this data linearly to some log poossibly
followed by some random update and a flush or somesuch, and thus achieve
about 50MB/s cleaning throughput. This clearly doesn't seem to happen,
possibly because the gc things nothing needs to be done.

I would expect the gc to do its work when the disk is idle, at least if
need to, so after coming back after a while, I can write at nearly full
speed again. This also dosn't happen - maybe the gc runs, but writing to
the disk is impossible even after it qwuited down.

> > Another thing that will seriously hamper adoption of these drives is the
> > 32000 limit on hardlinks - I am hard pressed to find any large file tree
> > here that doesn't have places with of 40000 subdirs somewhere, but I guess
> > on a 32GB phone flash storage, this was less of a concern.
> 
> Looking at a glance, it'll be no problme to increase as 64k.
> Let me check again.

I thought more like 2**31 or so links, but it so happens that all my
testcases (by pure chance) have between 57k and 64k links,. so thanks a
lot for that.

If you are reluctant, look at other filesystems. extX thought 16 bit is
enough. btrfs thought 16 bit is enough - even reiserfs thought 16 bit is
enough. Lots of filesystems thought 16 bits is enough, but all modern
incarnations of them do 31 or 32 bit link counts these days.

It's kind of rare to have 8+TB of storage where you are fine with 2**16
subdirectories everywhere.

> What kernel version do you prefer? I've been maintaining f2fs for v3.10 mainly.
> 
> http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs.git/log/?h=linux-3.10

I have a hard time finding kernels that work with these SMR drives. So
far, only the 3.18.x and the 3.19.x series works for me. the 3.17 and
3.16 kernels fail for various reasons, and the 4.1.x kernels still fail
miserably with these drives.

So, at this point, it needs to be either 3.18 or 3.19 for me. It seems
3.19 has everything but the extent_cache, which probably shouldn't make
such a big difference. Are there any big bugs in 3.8/3.19 which I would
have to look out for? Storage size isn't an issue right now, because I can
reproduce the performance characteristics just fine on a 128G partition.

I mainly asked because I thought newer kernel versions might have
important bugfixes.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: general stability of f2fs?
  2015-08-13  0:26       ` Marc Lehmann
@ 2015-08-14 23:07         ` Jaegeuk Kim
  0 siblings, 0 replies; 74+ messages in thread
From: Jaegeuk Kim @ 2015-08-14 23:07 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Thu, Aug 13, 2015 at 02:26:41AM +0200, Marc Lehmann wrote:

Okay, let me jump into the original issues.

> I still haven't found the right kernel for my main server, but I did some
> preliminary experiments today, with 3.19.8-ckt5 (an ubuntu kernel).

I backported the latest f2fs into 3.19 here.

http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs.git/log/?h=linux-3.19

You can build f2fs by linking the following f2fs source codes into your base
ubuntu 3.19.8-ckt5.

- fs/f2fs/*
- include/linux/f2fs_fs.h
- include/trace/events/f2fs.h

> After formatting a 128G partition with "mkfs.f2fs -o1 -s128 -t0", I got
> this after mounting (kernel complained about missing extent_cache in my
> kernel version):
> 
>    Filesystem                Size  Used Avail Use% Mounted on
>    /dev/mapper/vg_test-test  128G   53G   75G  42% /mnt
> 
> which give sme another quetsion - on an 8TB disk, 5% overprovision is
> 400GB, which sounds a bit wasteful. Even 1% (80GB) sounds a bit much,
> especially asI am prepared to wait for defragmentation, if defragmentation
> works well. And lastly, the 53GB used on a 128GB partition looks way too
> conservative.

Right, so I wrote a patch to resolve this issue.

http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs-tools.git

You can find this patch which set the best overprovision ratio automatically.

  mkfs.f2fs: set overprovision size more precisely

> 
> I immediately configured the fs with these values:
> 
>    echo 500 >gc_max_sleep_time
>    echo 100 >gc_min_sleep_time
>    echo 800 >gc_no_gc_sleep_time
> 
> Anyways, I write it until disk was 99% utilizied according to
> /sys/kernel/debug/f2fs/status, at which write speed crawled down to 1-2MB/s.
> 
> I deleted some "random" files till utilisation was at 38%, then waited
> until there was no disk I/O (disk went into standby, which indicates that
> it has flushed its internal transaction log as well).
> 
> When I then tried to write a file, the writer (rsync) stopped after ~4kb, and
> the filesystem started reading at <2MB/s and wriitng at <2MB/s for a few
> minutes. Since I didn't intend this to test very well (I was looking mainly
> for a kernel that worked well with the hardware and drives), I didn't make
> detailed notes, but basically, "LFS:" increased exactly with the writing
> speed.
> 
> I then stopped writing, after which the fs wrote (but did not read) a bit
> longer at this speed, then became idle, disk went into standby again.
> 
> The next day, I mounted it, and now I will take notes. Initial status was:
> 
>    http://ue.tst.eu/e2ea137a6b87fd0e43446b286a3d1b19.txt
> 
> The disk woke up and started reading and writing at <1MB/s:
> 
>    http://ue.tst.eu/a9dd48428b7b454f52590efeea636a27.txt
> 
> At some point, you can see that the disk stopped reading, that's when I
> killed rsync. rsync also transfers over the net, and as you can see, it
> didn't maange to transfer anything. The read I/O is probably due to rsync
> reading the filetree info.
> 
> A status snapshot after killing rsync looks like this:
> 
>    http://ue.tst.eu/211fc87b0b43270e4b2ee0261d251818.txt

Here, the key clue is the number of CP calls, which increased enormously.
So, I did some test which filled up with data and take a look at what happened
in the last minutes.
In my case, I could have seen that a lot of checkpoints were triggered by
f2fs_gc even though there was nothing to gather garbages.
I suspect that's the exact corner case where the performance goes down
dramatically.

In order to resolve that issue, I made a patch:
  f2fs: skip checkpoint if there is no dirty and prefree segments
Note that, the backported f2fs should have this patch too.

So, at first, could you check this patch in your workloads?

> The disk did no other I/O afterwards and went into standby again.
> 
> I repeated the experiment a few minutes later with similar
> results, with these differences:
> 
> 1. There was absolutely no read I/O (maybe all inodes were still in the
>    cache, but that would be surprising as rsync probably didn't read all
>    of them in the previous run).
> 
> 2. The disk didn't stay idle this time, but instead kept steadily writing
> at ~1MB/s.
> 
> Status output at the end:
> 
>    http://ue.tst.eu/cbb4774b2f8e44ae68e635be5a414d1d.txt
> 
> Status output a bit later, disk still writing:
> 
>    http://ue.tst.eu/9fbdfe1e9051a65c1417bea7192ea182.txt
> 
> Much later, disk idle:
> 
>    http://ue.tst.eu/78a1614d867bfbfa115485e5fcf1a1a8.txt
> 
> At this point, my main problem is that I have no clue what is causing the
> slow writes. Obviously the garbage collector doesn't think anything needs
> to be done, it shouldn't be IPU writes either then, and even if they are,
> I don't know what the ipu_policy's mean.
> 
> I tried the same with ipu_policy=8 and min_ipu_util=100, also separately
> also gc_idle=1, with seemingly no difference.
> 
> Here is what I expect should happen:
> 
> When I write to a new disk, or append to a still-free-enough disk, writing
> happens linearly (with that I mean appending to multiple of its logs
> linearly, which is not optimal, but should be fine). This clearly happens,
> and near perfectly so.
> 
> When the disk is near-full, bad things might happen, delays might be there
> when some small areas are being garbage collected.
> 
> When I delete files, the disk should start garbage collecting at around
> 50mb/s read + 50mb/s write. If combined with writing, I should be able to
> write at roughly 30MB/s while the garbage collector is cleaning up.

At that moment, actually I suspect garbage collector has no sections to clean
up. Because, if you set a big section in a small partition, the deleted regions
are likely to be laid across the current active sections. In such the case,
even if there are many dirty segments, garbage collector can't select them as
victims at all.

> I would expect the gc to do its work by selecting a 256MB section, reading
> everything it needs to, write this data linearly to some log poossibly
> followed by some random update and a flush or somesuch, and thus achieve
> about 50MB/s cleaning throughput. This clearly doesn't seem to happen,
> possibly because the gc things nothing needs to be done.
> 
> I would expect the gc to do its work when the disk is idle, at least if
> need to, so after coming back after a while, I can write at nearly full
> speed again. This also dosn't happen - maybe the gc runs, but writing to
> the disk is impossible even after it qwuited down.
> 
> > > Another thing that will seriously hamper adoption of these drives is the
> > > 32000 limit on hardlinks - I am hard pressed to find any large file tree
> > > here that doesn't have places with of 40000 subdirs somewhere, but I guess
> > > on a 32GB phone flash storage, this was less of a concern.
> > 
> > Looking at a glance, it'll be no problme to increase as 64k.
> > Let me check again.
> 
> I thought more like 2**31 or so links, but it so happens that all my
> testcases (by pure chance) have between 57k and 64k links,. so thanks a
> lot for that.
> 
> If you are reluctant, look at other filesystems. extX thought 16 bit is
> enough. btrfs thought 16 bit is enough - even reiserfs thought 16 bit is
> enough. Lots of filesystems thought 16 bits is enough, but all modern
> incarnations of them do 31 or 32 bit link counts these days.

Oh, yes. The f2fs_inode's link_count is the 32 bit structure, so it would be
good to set 0xffffffff for F2FS_LINK_MAX.

> It's kind of rare to have 8+TB of storage where you are fine with 2**16
> subdirectories everywhere.
> 
> > What kernel version do you prefer? I've been maintaining f2fs for v3.10 mainly.
> > 
> > http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs.git/log/?h=linux-3.10
> 
> I have a hard time finding kernels that work with these SMR drives. So
> far, only the 3.18.x and the 3.19.x series works for me. the 3.17 and
> 3.16 kernels fail for various reasons, and the 4.1.x kernels still fail
> miserably with these drives.
> 
> So, at this point, it needs to be either 3.18 or 3.19 for me. It seems
> 3.19 has everything but the extent_cache, which probably shouldn't make
> such a big difference. Are there any big bugs in 3.8/3.19 which I would
> have to look out for? Storage size isn't an issue right now, because I can
> reproduce the performance characteristics just fine on a 128G partition.
> 
> I mainly asked because I thought newer kernel versions might have
> important bugfixes.
> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: finally testing with SMR drives
  2015-08-10 20:31 ` Jaegeuk Kim
  2015-08-10 20:53   ` Marc Lehmann
@ 2015-09-20 23:59   ` Marc Lehmann
  2015-09-21  8:17     ` SMR drive test 1; 512GB partition; very slow + unfixable corruption Marc Lehmann
  1 sibling, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-09-20 23:59 UTC (permalink / raw)
  To: linux-f2fs-devel

On Mon, Aug 10, 2015 at 01:31:06PM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> I'm very interested in trying f2fs on SMR drives too.

Sorry that it took me so long, I am currently conducting initial tests, and
will hopefully be able tpo report soon, for real this time.

The kernel bug regarding SMR drives has multiplied (the gory details
are in https://bugzilla.kernel.org/show_bug.cgi?id=93581, apparently, a
silent data corruption error has emerged, although I don't think I was
affected by it), in short, I will test with "stock" 3.18.20 and/or 4.2.0
(with max_sectors_kb=512). I also have the current git f2fs tools up and
running.

I'll do small test with 4.2.0 and hopefully also the big ones (depending on
when I can reboot the boxes).

In the meantime, can you answer me one question?

How can I effectively disable IPU?

I currently try this:

   echo 8 >ipu_policy
   echo 100 >min_ipu_util

Can you verify that this would suppress at least "most" IPU updates? If
not, is there a better way to suppress them? Thanks!

I really want to see the garbage collector freeing big chunks on its own,
and rather wait for it to do its work than to risk IPU writes, as the
latter will effectively trigger a similar garbage collect on the drive
(less efficient, but with more cache).

> - over 4TB storage space case

I currently do tests with 512GB, will do full device size later.

> - inline_dentry mount option; I'm still working on extent_cache for v4.3 too

While inline_dentry will be nice to have, I can live with them being
disabled, but will test anyways. Likewise the extent_cache.

> - various sizes of section and zone
> - tmpfile, and rename2 interfaces

I wasn'*t even aware of the renameat2 syscall, thanks for indirectly
pointing this out to me :)

>  http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs-tools.git

Up and running, thanks for pointing the URL out to me, I overlooked it in
the manpage :/

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 1; 512GB partition; very slow + unfixable corruption
  2015-09-20 23:59   ` finally testing with SMR drives Marc Lehmann
@ 2015-09-21  8:17     ` Marc Lehmann
  2015-09-21  8:19       ` Marc Lehmann
  0 siblings, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-09-21  8:17 UTC (permalink / raw)
  To: linux-f2fs-devel

On Mon, Sep 21, 2015 at 01:59:01AM +0200, Marc Lehmann <schmorp@schmorp.de> wrote:
> Sorry that it took me so long, I am currently conducting initial tests, and
> will hopefully be able tpo report soon, for real this time.

Ok, here is my first test result. It's primarily concerned with GC and
near-full conditions, because that is fastest to test. The test was done
on a 4.2.0 kernel and current git f2fs tools.

Summary: not good - write performance went down to 20kb/s at the 10GB free
mark, sync took hours to complete, the filesystem was corrupt afterwards
and fsck failed to repair it.

I created a 512GB partition (-s 128 -o 1), mounted it
(-onoatime,inline_xattr,inline_data,flush_merge,extent_cache, note: no
inline_dentry) and started writing files to it (again, via rsync). Every
few minutes, a simple script deleted every 80th file, to create dirty
blocks. This test didn't test write performance, but it was adequate (the
filesystem kept up with it).

I paused rsync multiple times to check delete speed - the find -type f
command I used to generate the list was rather slow (it took multiple
minutes to list ~50000 files), which is not completely surprising, and
still manageable for me.

At around 50% utilization I paused the rsync and delete to see if there is
any gc or otherwise activity. Indeed, every 30 seconds or so there was a
~100mb read and write, and no other activity.

I continuted writing. At the 10GB free mark (df -h), write speed became
rather slow (~1MB/s), and a short time later (9.8GB) I paused rsync+delete
again. The "Dirty:" value was around 11000 at the time.

>From then on performance became rather abysmal - the speed went down to a
steady 20kb/s (sic!).

After a while I started "sync", which hung for almost 2 hours, during which
the disk was mostly written at ~20kb/s, with occasional faster writes
(~40-100mb/s) for a few seconds.

The faster write periods coincided mostly with activity in the "Balancing
F2FS Async" section of the status file.

Here is the status file from when the write speed became slow:

http://ue.tst.eu/12cf94978b9f47013f5f3b5712692ed5.txt

And here is the status file maybe half an hour later:

http://ue.tst.eu/144d36137371905a43d9a100f2f6b65c.txt

I can't really explain the abysmal speed - it doesn't happen with other
filesystems, so it's unlikely to be a hardware issue, but the only way I
can imagine how this speed could be explained is by f2fs scattering random
small writes all over the disk. The disk can do about 5-15 fully random
writes per second, but should be able to buffer >20GB of random writes before
this would happen.

The reason why I am so infatuated with disk full conditions is that it will
happen sooner or later, and while a slowdown to a 1MB/s might be ok when the
disk is nearly full, the filesystem absolutely must needs recover one there
is more free space and it had some time to reorganise.

Another issue is that in one of my applications (backup), I reserve 10GB
of space for transaction storage used only temporary, and the rest for
long term storage. With f2fs, it seems this has to be at least 25GB to
avoid the performance drop (which effectively takes down the disk for
hours). This is a bit painful for two reasons: 1) f2fs already sets aside
a lot of storage.  Even with the minimum amount of reserved space (1%),
this boils down to 80GB, a lot). In this test, only 5GB were reserved, but
performance dropped when df -h still showed 10GB of free space.

Now my observations on recovery after this condition:

After sync returned, I more or less regained control of the disk, and
started thinning out files again. This was rather slow at first (but the
disk was reading and writing 1-50mb/s - I assume the GC was at work).

After about 20 minutes, the utilization went down from 97% to 96%:

http://ue.tst.eu/74dd57f9b0fe2657a1518af71de0ce38.txt

At this point I noticed "find" spewing a large number of "No such file or
directory" messages for files.

The command I used to delete was:

   find /mnt -type f | awk '0 == NR % 80' | xargs -d\\n rm -v

And I don't see how find can ever complain about "No such file or
directory", even when there are concurrent deletes, because find should
not revisit the same file multiple times, so by the time it gets deleted,
find should be done with it.

At this point I stopped the find/rm - the disk then only showed large
reads and writes with a fgew second pauses between them. I then and ran
the find command manually, and fair enough, find gives thousands of "No such
file or directory" messages like these:

   find: `/mnt/ebook-export/eng/Pyrotools.txt': No such file or directory

And indeed, the filesysstem is completely corrupted at this point, with
lots of directory entries that cannot be stat'ed.

   root@shag:~# echo /mnt/ebook-export/eng/Pyrotools*
   /mnt/ebook-export/eng/Pyrotools.txt
   root@shag:~# ls -ld /mnt/ebook-export/eng/Pyrotools*
   ls: cannot access /mnt/ebook-export/eng/Pyrotools.txt: No such file or directory

Since you warned me about the inline_dentry/extent_cache options, I
will re-run this test tomorrow with noinline_dentry,noextent_cache (not
documented, if they even exist - but inline_dentry seems to be on by
default?).

For completeness, I ran fsck.f2fs, which gave me a lot of these:

   [ASSERT] (fsck_chk_inode_blk: 525)  --> ino: 0xc4e9 has i_blocks: 0000009e, but has 1 blocks
   [ASSERT] (fsck_chk_inode_blk: 391)  --> [0xc79a] needs more i_links=0x1
   [ASSERT] (fsck_chk_inode_blk: 525)  --> ino: 0xc79a has i_blocks: 0000005c, but has 1 blocks
   [ASSERT] (fsck_chk_inode_blk: 391)  --> [0xc845] needs more i_links=0x1
   [ASSERT] (fsck_chk_inode_blk: 525)  --> ino: 0xc845 has i_blocks: 000002d5, but has 1 blocks
   [ASSERT] (sanity_check_nid: 261)  --> Duplicated node blk. nid[0x34fa5][0x7fe07b3]

   [ASSERT] (fsck_chk_inode_blk: 391)  --> [0xccdc] needs more i_links=0x1
   [ASSERT] (fsck_chk_inode_blk: 525)  --> ino: 0xccdc has i_blocks: 00000063, but has 1 blocks
   [ASSERT] (fsck_chk_inode_blk: 391)  --> [0xcebc] needs more i_links=0x1
   [ASSERT] (fsck_chk_inode_blk: 525)  --> ino: 0xcebc has i_blocks: 000000b0, but has 1 blocks
   [ASSERT] (fsck_chk_inode_blk: 391)  --> [0xcf12] needs more i_links=0x1
   [ASSERT] (fsck_chk_inode_blk: 525)  --> ino: 0xcf12 has i_blocks: 00001b18, but has 1 blocks

I then tried fsck.f2fs -a, which completed without much output, almost
instantly (what does it do?). I then tried fsck.f2fs -f, which seemed to
do something:

   [ASSERT] (fsck_chk_inode_blk: 391)  --> [0x5c524] needs more i_links=0x1
   [FIX] (fsck_chk_inode_blk: 398)  --> File: 0x5c524 i_links= 0x1 -> 0x2
   [ASSERT] (fsck_chk_inode_blk: 525)  --> ino: 0x5c524 has i_blocks: 00000019, but has 1 blocks
   [FIX] (fsck_chk_inode_blk: 530)  --> [0x5c524] i_blocks=0x00000019 -> 0x1
   [ASSERT] (fsck_chk_inode_blk: 391)  --> [0x671ba] needs more i_links=0x1
   [FIX] (fsck_chk_inode_blk: 398)  --> File: 0x671ba i_links= 0x1 -> 0x2

   ...

   [FIX] (fsck_chk_inode_blk: 530)  --> [0x1a7bf] i_blocks=0x000000ca -> 0x1
   [ASSERT] (IS_VALID_BLK_ADDR: 344)  --> block addr [0x0]

   [ASSERT] (sanity_check_nid: 212)  --> blkaddres is not valid. [0x0]
   [FIX] (__chk_dentries: 779)  --> Unlink [0x1a7d8] - E B Jones.epub len[0x33], type[0x1]
   [ASSERT] (IS_VALID_BLK_ADDR: 344)  --> block addr [0x0]

   ...

   NID[0x679e2] is unreachable
   NID[0x679e3] is unreachable
   NID[0x6bc52] is unreachable
   NID[0x6bc53] is unreachable
   NID[0x6bc54] is unreachable
   [FSCK] Unreachable nat entries                        [Fail] [0x2727]
   [FSCK] SIT valid block bitmap checking                [Fail]
   [FSCK] Hard link checking for regular file            [Ok..] [0x0]
   [FSCK] valid_block_count matching with CP             [Fail] [0x6a6bc8a]
   [FSCK] valid_node_count matcing with CP (de lookup)   [Fail] [0x6808d]
   [FSCK] valid_node_count matcing with CP (nat lookup)  [Ok..] [0x6a7b4]
   [FSCK] valid_inode_count matched with CP              [Fail] [0x55bb8]
   [FSCK] free segment_count matched with CP             [Ok..] [0x8f5d]
   [FSCK] next block offset is free                      [Ok..]
   [FSCK] fixing SIT types
   [FIX] (check_sit_types:1056)  --> Wrong segment type [0x3fc6a] 3 -> 4
   [FIX] (check_sit_types:1056)  --> Wrong segment type [0x3fc6b] 3 -> 4
   [FSCK] other corrupted bugs                           [Fail]

Doesn't look good to me, however, the filesystem was mountable without
error afterwards, but find showed similar errors, so fsck.f2fs did not
result in a working filesystem either.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 1; 512GB partition; very slow + unfixable corruption
  2015-09-21  8:17     ` SMR drive test 1; 512GB partition; very slow + unfixable corruption Marc Lehmann
@ 2015-09-21  8:19       ` Marc Lehmann
  2015-09-21  9:58         ` SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning Marc Lehmann
  0 siblings, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-09-21  8:19 UTC (permalink / raw)
  To: linux-f2fs-devel

On Mon, Sep 21, 2015 at 10:17:48AM +0200, Marc Lehmann <schmorp@schmorp.de> wrote:
> (-onoatime,inline_xattr,inline_data,flush_merge,extent_cache, note: no

Correction, I copied the wrong line from my log, the mount options were:

   mount -o inline_data,inline_dentry,flush_merge,extent_cache /dev/vg_test/test2 /mnt

So with inline_dentry.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-21  8:19       ` Marc Lehmann
@ 2015-09-21  9:58         ` Marc Lehmann
  2015-09-22 20:22           ` SMR drive test 3: full 8TB partition, mount problems, fsck error after delete Marc Lehmann
  2015-09-23  1:12           ` SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning Jaegeuk Kim
  0 siblings, 2 replies; 74+ messages in thread
From: Marc Lehmann @ 2015-09-21  9:58 UTC (permalink / raw)
  To: linux-f2fs-devel

Second test - we're getting there:

Summary: looks much better, no obvious corruption (but fsck still gives
tens of thousands of [FIX] messages), performance somewhat as expected,
but a 138GB partition can only store 71.5GB of data (avg filesize 2.2MB)
and f2fs doesn't seem to do visible background GC.

For this test, changed a bunch of parameters:

    1. partition size

       128GiB instead of 512GiB (not ideal, but I wanted this test to be
       quick)

    2. mkfs options

        mkfs.f2fs -lTEST -o5 -s128 -t0 -a0 # change: -o5 -a0

    3. mount options

        mount -t f2fs -onoatime,flush_merge,active_logs=2,no_heap
        # change: no inline_* options, no extent_cache, but no_heap + active_logs=2

First of all, the discrepancy between utilization in the status file, du
and df is quite large:

    Filesystem                Size  Used Avail Use% Mounted on
    /dev/mapper/vg_test-test  128G  106G   22G  84% /mnt

    # du -skc /mnt
    51674268        /mnt
    51674268        total

    Utilization: 67% (13168028 valid blocks)

So ~52GB of files take up ~106GB of the partition, which is 84% of the
total size, yet it's only utilized by 67%.

Second, and subjectively, the filesystem was much more responsive during
the test- find almost instantly give ssome output, instead of having to
wait for half a minute, and find|rm is much faster as well. find also
reads data at ~2mb/s, while in the previous test, it was 0.7MB/s (which
can be good or bad, but it looks good).

At 6.7GB free (df: 95%, status: 91%, du: 70/128GiB) I paused rsync. The disk
then did some heavy read/write for a short while, and the Dirty: count
reduced:

http://ue.tst.eu/d61a7017786dc6ebf5be2f7e2d2006d7.txt

I continued, and the disk afterwards did almost the same amount of reading
as it was writing, with short intzermittent write-only periods for a fe
seconds each. Rsync itself was noticably slower, so I guess f2fs finally
ran out of space and did garbage collect.

This is exactly the behaviour I did expect of f2fs, but this is the first
time I actually saw it.

Pausing didn't result in any activity.

At 6.3GB free, disk write speed went down to 1MB/s with intermittent
phases of 100MB/s write only, or 50MB/s read + 50MB/s write (but rsync was
transferring about 100kb/s at this point only, so no real progress was
made).

After about 10 minutes I paused rsync again, still at 6.3GB free (df
reporting 96% in use, status 91% and du 52% (71.5GB))

I must admit I don't understand these ratios - df vs. status can easily
be explained by overprovisioning, but the fact that a 138GB (128GiB)
partition can only hold 72GB of data with very few small files is not
looking good to me:

    # df -H /mnt
    Filesystem                Size  Used Avail Use% Mounted on
    /dev/mapper/vg_test-test  138G  130G  6.3G  96% /mnt
    # du -skc /mnt
    71572620        /mnt

I wonder what this means, too:

    MAIN: 65152(OverProv:27009 Resv:26624)

Surely this doesn't mean that 27009 of 65152 segments are for
overprovisioning? That would explain the bad values for due, but then, I
did specify -o5, not -o45 or so.

status at that point was:

    http://ue.tst.eu/f869dfb6ac7b4d52966e8eb012b81d2a.txt

Anyways, I did more thinning to regain free space by deleting every 10th
file. That went reasonably slow, the disk was contantly reading + writing at
high speed, so I guess it was busy garbage colelcting, as it should.

status after deleting, with completely idle disk:

    http://ue.tst.eu/1831202bc94d9cd521cfcefc938d2095.txt

    /dev/mapper/vg_test-test  138G  123G   15G  90% /mnt

I waited a few minutes, but there was no further activity. I then unpaused
the rsync, which proceeded with good speed again.

At 11GB free, rsync effectively stopped, and the disk went to ~1MB/s wrtite
mode aagin. Pausing rsync didn't cause I/O to stop this time, it continued
for a few minutes.

I waited for 2 minutes with no disk I/O, unpaused rsync, and the disk
immediately went into 1MB/s write mode againh, with rsync not really
getting any data through though.

It's as if f2fs only tried to clean up when there is write data. I would
expect a highly fragmented f2fs to be very busy garbage collecting, but
apparently, not so, it just idles, and when a program wants to write,
fails to perform. Maybe I need to give it more time than two minutes, but
then, I wouldn't see a point in delaying to garbage collect if it has to
be done anyways.

In any case, no progress possible, I deleted more files again, this time
every 5th file, which went reasonably fast,

status after delete:

    http://ue.tst.eu/fb3287adf4cc109c88b89f6120c9e4a6.txt

    /dev/mapper/vg_test-test  138G  114G   23G  84% /mnt

rsync writing was reasonably fast down to 18GB, when rsync stopped making
much profgress (<100kb/s), but the disk wasn't in "1MB/s mode" but instead in
40MB/s read+write, which looks reasonable to me, as the disk was probably
quite fargmented at this point:

    http://ue.tst.eu/fb3287adf4cc109c88b89f6120c9e4a6.txt

However, when pausing rsync, f2fs immediatelly ceased doing anything again,
so even though clearly there is a need for clean up activities, f2fs doesn't
do them.

To state this more clearly: My expectation is that when f2fs runs out of
immediatelly usable space for writing, it should do GC. That means that
when rsync is very slow and the disk is very fragmented, even when I pause
rsync, f2fs should GC at full speed until it has a reasonable amount of
usable free space again. Instead, it apparently just sits idle until some
program generates write data.

At this point, I unmounted the filesystem and "fsck.f2fs -f"'ed it. The
report looked good:

    [FSCK] Unreachable nat entries                        [Ok..] [0x0]
    [FSCK] SIT valid block bitmap checking                [Ok..]
    [FSCK] Hard link checking for regular file            [Ok..] [0x0]
    [FSCK] valid_block_count matching with CP             [Ok..] [0xe8b623]
    [FSCK] valid_node_count matcing with CP (de lookup)   [Ok..] [0xa58a]
    [FSCK] valid_node_count matcing with CP (nat lookup)  [Ok..] [0xa58a]
    [FSCK] valid_inode_count matched with CP              [Ok..] [0x7800]
    [FSCK] free segment_count matched with CP             [Ok..] [0x8a17]
    [FSCK] next block offset is free                      [Ok..]
    [FSCK] fixing SIT types

However, there were about 30000 messages like these:

    [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdf6] 0 -> 1
    [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdf7] 0 -> 1
    [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdf8] 0 -> 1
    [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdf9] 0 -> 1
    [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdfa] 0 -> 1
    [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdfb] 0 -> 1
    [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdfc] 0 -> 1
    [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdfd] 0 -> 1
    [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdfe] 0 -> 1
    [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdff] 0 -> 1
    [FSCK] other corrupted bugs                           [Ok..]

That's not promising, why does it think it needs to fix anything?

I mounted the partition again. Listing the files was very fast. I deleted all
the files and ran rsync for a while. It seems the partition completely
recovered. This is the empty state btw.:

   Filesystem                Size  Used Avail Use% Mounted on
   /dev/mapper/vg_test-test  138G   57G   80G  42% /mnt

So, all the pathological behaviour is gone (no 20kb/s write speed blocking
the disk for hours, more importantly, no obvious filesystem corruption,
although the fsck messages need explanation).

Moreso, the behaviour, while still confusing (weird du vs. df, no background
activity), at least seems to be in line with what I expect - fragmentation
kills performance, but f2fs seems capable of recovering.

So here is my wishlist:

1. the overprovisioning values seems to be completely out of this world. I'm
prepared top give up maybe 50GB of my 8TB disk for this, but not more.

2. even though ~40% of space is not used by file data, f2fs still becomes
extremely slow. this can't be right.

3. why does f2fs sit idle on a highly fragmented filesystem, why does it not
do background garbage collect at maximum I/O speed, so the filesystem is
ready when the next writes come?

Greetings, and good night :)

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* SMR drive test 3: full 8TB partition, mount problems, fsck error after delete
  2015-09-21  9:58         ` SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning Marc Lehmann
@ 2015-09-22 20:22           ` Marc Lehmann
  2015-09-22 23:08             ` Jaegeuk Kim
  2015-09-23  1:12           ` SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning Jaegeuk Kim
  1 sibling, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-09-22 20:22 UTC (permalink / raw)
  To: linux-f2fs-devel

Third test, using the full device, on linux 4.2.1

   mkfs.f2fs -l COLD1 -o1 -a0 -d1 -s128 /dev/mapper/xmnt-cold1
   mount -tf2fs -onoatime,flush_merge,active_logs=2,no_heap /dev/mapper/xmnt-cold1 /cold1

Unfortunately, mount failed with. The kernel showed that a high order
allocation could not be satisfied:

   mount: page allocation failure: order:7, mode:0x40d0
   ...
   F2FS-fs (dm-18): Failed to initialize F2FS segment manager
   (http://data.plan9.de/f2fs-mount-failure.txt)

I think this memory management is a real problem - the server was booted
about 20 minutes earlier and had 23GB free ram (used for cache). I was able
to mount it by dropping the page cache, but clearly this shouldn't be
neccessary.

After this, df showed 185GB in use, which is more like 3%, not 1% - again
overprovisioning seems to be out of bounds.

I started copying files with tar|tar, after 10GB, I restarted, which started
to overwrite the existing 10GB files.

Unfortunately, this time the GC kicked in every 10-20 seconds, slowing down
writing times. I don't know what triggered it this time, but I am quite sure
at less than 1% utilisation it shouldn't feel the need to gc while the disk
is busy writing.

After 90GB were written, I decided to simulate a disk problem by deleting
the device (to avoid any corruption issues the disk itself might have):

   echo 1 >/sys/block/sde/device/delete

After rescanning the device, I used fsck.f2fs on it, and it failed quickly:

   Info: superblock features = 0 : 
   Info: superblock encrypt level = 0, salt = 00000000000000000000000000000000
   Info: total FS sectors = 15628050432 (7630884 MB)
   Info: CKPT version = 2
   [ASSERT] (restore_node_summary: 688) ret >= 0
   [Exit 255] 

Re-running it with -f failed differently, but also quickly:

   Info: superblock features = 0 : 
   Info: superblock encrypt level = 0, salt = 00000000000000000000000000000000
   Info: total FS sectors = 15628050432 (7630884 MB)
   Info: CKPT version = 2
   [ASSERT] (get_current_sit_page: 803) ret >= 0
   [Exit 255] 

I'll reformat and try without any simulated problems.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 3: full 8TB partition, mount problems, fsck error after delete
  2015-09-22 20:22           ` SMR drive test 3: full 8TB partition, mount problems, fsck error after delete Marc Lehmann
@ 2015-09-22 23:08             ` Jaegeuk Kim
  2015-09-23  3:50               ` Marc Lehmann
  0 siblings, 1 reply; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-22 23:08 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

Thank you for the test.

On Tue, Sep 22, 2015 at 10:22:02PM +0200, Marc Lehmann wrote:
> Third test, using the full device, on linux 4.2.1
> 
>    mkfs.f2fs -l COLD1 -o1 -a0 -d1 -s128 /dev/mapper/xmnt-cold1

Could you check without -o1, since I merged a patch to calculate the best
overprovision at runtime in f2fs-tools.
Originally, even if you set a specific overprovision ratio, mkfs.f2fs calculates
the real space again.
For example, if you set 1%, we need to reserve 100 sections to do cleaning at
the worse case. That's why you cannot see the reserved area as just 1% over
total space.

>    mount -tf2fs -onoatime,flush_merge,active_logs=2,no_heap /dev/mapper/xmnt-cold1 /cold1
> 
> Unfortunately, mount failed with. The kernel showed that a high order
> allocation could not be satisfied:
> 
>    mount: page allocation failure: order:7, mode:0x40d0
>    ...
>    F2FS-fs (dm-18): Failed to initialize F2FS segment manager
>    (http://data.plan9.de/f2fs-mount-failure.txt)

I think the below patch should resolve this issue.

> 
> I think this memory management is a real problem - the server was booted
> about 20 minutes earlier and had 23GB free ram (used for cache). I was able
> to mount it by dropping the page cache, but clearly this shouldn't be
> neccessary.
> 
> After this, df showed 185GB in use, which is more like 3%, not 1% - again
> overprovisioning seems to be out of bounds.

Actually, 185GB should include FS metadata as well as reserved or overprovision
space. It would be good to check the on-disk layout by fsck.f2fs.

> 
> I started copying files with tar|tar, after 10GB, I restarted, which started
> to overwrite the existing 10GB files.
> 
> Unfortunately, this time the GC kicked in every 10-20 seconds, slowing down
> writing times. I don't know what triggered it this time, but I am quite sure
> at less than 1% utilisation it shouldn't feel the need to gc while the disk
> is busy writing.
> 
> After 90GB were written, I decided to simulate a disk problem by deleting
> the device (to avoid any corruption issues the disk itself might have):
> 
>    echo 1 >/sys/block/sde/device/delete
> 
> After rescanning the device, I used fsck.f2fs on it, and it failed quickly:
> 
>    Info: superblock features = 0 : 
>    Info: superblock encrypt level = 0, salt = 00000000000000000000000000000000
>    Info: total FS sectors = 15628050432 (7630884 MB)
>    Info: CKPT version = 2
>    [ASSERT] (restore_node_summary: 688) ret >= 0
>    [Exit 255] 
> 
> Re-running it with -f failed differently, but also quickly:
> 
>    Info: superblock features = 0 : 
>    Info: superblock encrypt level = 0, salt = 00000000000000000000000000000000
>    Info: total FS sectors = 15628050432 (7630884 MB)
>    Info: CKPT version = 2
>    [ASSERT] (get_current_sit_page: 803) ret >= 0
>    [Exit 255] 

Actually, this doesn't report f2fs inconsistency.
Instead, these two errors are from lseek64() and read() failures in dev_read():
lib/libf2fs_io.c.

Maybe ENOMEM? Can you check the errno of this function?

Thanks,

> 
> I'll reformat and try without any simulated problems.
> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

>From d495b00a2f04c0ec5e6c6d95c9e66bdba45b174c Mon Sep 17 00:00:00 2001
From: Jaegeuk Kim <jaegeuk@kernel.org>
Date: Tue, 22 Sep 2015 13:50:47 -0700
Subject: [PATCH] f2fs: use vmalloc to handle -ENOMEM error

This patch introduces f2fs_kvmalloc to avoid -ENOMEM during mount.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
---
 fs/f2fs/f2fs.h    | 11 +++++++++++
 fs/f2fs/segment.c |  9 ++++-----
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 79c38ad..553529d 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -19,6 +19,7 @@
 #include <linux/magic.h>
 #include <linux/kobject.h>
 #include <linux/sched.h>
+#include <linux/vmalloc.h>
 #include <linux/bio.h>
 
 #ifdef CONFIG_F2FS_CHECK_FS
@@ -1579,6 +1580,16 @@ static inline bool f2fs_may_extent_tree(struct inode *inode)
 	return S_ISREG(mode);
 }
 
+static inline void *f2fs_kvmalloc(size_t size, gfp_t flags)
+{
+	void *ret;
+
+	ret = kmalloc(size, flags | __GFP_NOWARN);
+	if (!ret)
+		ret = __vmalloc(size, flags, PAGE_KERNEL);
+	return ret;
+}
+
 #define get_inode_mode(i) \
 	((is_inode_flag_set(F2FS_I(i), FI_ACL_MODE)) ? \
 	 (F2FS_I(i)->i_acl_mode) : ((i)->i_mode))
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 78e6d06..13567ad 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -14,7 +14,6 @@
 #include <linux/blkdev.h>
 #include <linux/prefetch.h>
 #include <linux/kthread.h>
-#include <linux/vmalloc.h>
 #include <linux/swap.h>
 
 #include "f2fs.h"
@@ -2028,12 +2027,12 @@ static int build_free_segmap(struct f2fs_sb_info *sbi)
 	SM_I(sbi)->free_info = free_i;
 
 	bitmap_size = f2fs_bitmap_size(MAIN_SEGS(sbi));
-	free_i->free_segmap = kmalloc(bitmap_size, GFP_KERNEL);
+	free_i->free_segmap = f2fs_kvmalloc(bitmap_size, GFP_KERNEL);
 	if (!free_i->free_segmap)
 		return -ENOMEM;
 
 	sec_bitmap_size = f2fs_bitmap_size(MAIN_SECS(sbi));
-	free_i->free_secmap = kmalloc(sec_bitmap_size, GFP_KERNEL);
+	free_i->free_secmap = f2fs_kvmalloc(sec_bitmap_size, GFP_KERNEL);
 	if (!free_i->free_secmap)
 		return -ENOMEM;
 
@@ -2348,8 +2347,8 @@ static void destroy_free_segmap(struct f2fs_sb_info *sbi)
 	if (!free_i)
 		return;
 	SM_I(sbi)->free_info = NULL;
-	kfree(free_i->free_segmap);
-	kfree(free_i->free_secmap);
+	kvfree(free_i->free_segmap);
+	kvfree(free_i->free_secmap);
 	kfree(free_i);
 }
 
-- 
2.1.1


------------------------------------------------------------------------------

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-21  9:58         ` SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning Marc Lehmann
  2015-09-22 20:22           ` SMR drive test 3: full 8TB partition, mount problems, fsck error after delete Marc Lehmann
@ 2015-09-23  1:12           ` Jaegeuk Kim
  2015-09-23  4:15             ` Marc Lehmann
  1 sibling, 1 reply; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-23  1:12 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Mon, Sep 21, 2015 at 11:58:06AM +0200, Marc Lehmann wrote:
> Second test - we're getting there:
> 
> Summary: looks much better, no obvious corruption (but fsck still gives
> tens of thousands of [FIX] messages), performance somewhat as expected,
> but a 138GB partition can only store 71.5GB of data (avg filesize 2.2MB)
> and f2fs doesn't seem to do visible background GC.
> 
> For this test, changed a bunch of parameters:
> 
>     1. partition size
> 
>        128GiB instead of 512GiB (not ideal, but I wanted this test to be
>        quick)
> 
>     2. mkfs options
> 
>         mkfs.f2fs -lTEST -o5 -s128 -t0 -a0 # change: -o5 -a0

Please, check without -o5.

> 
>     3. mount options
> 
>         mount -t f2fs -onoatime,flush_merge,active_logs=2,no_heap
>         # change: no inline_* options, no extent_cache, but no_heap + active_logs=2

Hmm. Is it necessary to reduce the number of active_logs? Only two logs would
increase the GC overheads significantly.
And, you can use inline_data in v4.2.
In v4.3, I expect extent_cache will be stable and usable.

> 
> First of all, the discrepancy between utilization in the status file, du
> and df is quite large:
> 
>     Filesystem                Size  Used Avail Use% Mounted on
>     /dev/mapper/vg_test-test  128G  106G   22G  84% /mnt
> 
>     # du -skc /mnt
>     51674268        /mnt
>     51674268        total
> 
>     Utilization: 67% (13168028 valid blocks)

Ok. I could retrieve the on-disk layout from the below log.
In the log, the overprovision area is set as about 54GB.
However, when I tried to do mkfs.f2fs with the same options, I got about 18GB.
Could you share the mkfs.f2fs messages and fsck.f2fs -d3 as well?

> 
> So ~52GB of files take up ~106GB of the partition, which is 84% of the
> total size, yet it's only utilized by 67%.
> 
> Second, and subjectively, the filesystem was much more responsive during
> the test- find almost instantly give ssome output, instead of having to
> wait for half a minute, and find|rm is much faster as well. find also
> reads data at ~2mb/s, while in the previous test, it was 0.7MB/s (which
> can be good or bad, but it looks good).
> 
> At 6.7GB free (df: 95%, status: 91%, du: 70/128GiB) I paused rsync. The disk
> then did some heavy read/write for a short while, and the Dirty: count
> reduced:
> 
> http://ue.tst.eu/d61a7017786dc6ebf5be2f7e2d2006d7.txt
> 
> I continued, and the disk afterwards did almost the same amount of reading
> as it was writing, with short intzermittent write-only periods for a fe
> seconds each. Rsync itself was noticably slower, so I guess f2fs finally
> ran out of space and did garbage collect.
> 
> This is exactly the behaviour I did expect of f2fs, but this is the first
> time I actually saw it.
> 
> Pausing didn't result in any activity.
> 
> At 6.3GB free, disk write speed went down to 1MB/s with intermittent
> phases of 100MB/s write only, or 50MB/s read + 50MB/s write (but rsync was
> transferring about 100kb/s at this point only, so no real progress was
> made).
> 
> After about 10 minutes I paused rsync again, still at 6.3GB free (df
> reporting 96% in use, status 91% and du 52% (71.5GB))
> 
> I must admit I don't understand these ratios - df vs. status can easily
> be explained by overprovisioning, but the fact that a 138GB (128GiB)
> partition can only hold 72GB of data with very few small files is not
> looking good to me:
> 
>     # df -H /mnt
>     Filesystem                Size  Used Avail Use% Mounted on
>     /dev/mapper/vg_test-test  138G  130G  6.3G  96% /mnt
>     # du -skc /mnt
>     71572620        /mnt
> 
> I wonder what this means, too:
> 
>     MAIN: 65152(OverProv:27009 Resv:26624)

Yeah, that's the hint that overprovision area occupies 54GB abnormally.
I think there is something wrong on your mkfs.f2fs when calculating reserved
space. It needs to take a look at mkfs.f2fs log.

> 
> Surely this doesn't mean that 27009 of 65152 segments are for
> overprovisioning? That would explain the bad values for due, but then, I
> did specify -o5, not -o45 or so.
> 
> status at that point was:
> 
>     http://ue.tst.eu/f869dfb6ac7b4d52966e8eb012b81d2a.txt
> 
> Anyways, I did more thinning to regain free space by deleting every 10th
> file. That went reasonably slow, the disk was contantly reading + writing at
> high speed, so I guess it was busy garbage colelcting, as it should.
> 
> status after deleting, with completely idle disk:
> 
>     http://ue.tst.eu/1831202bc94d9cd521cfcefc938d2095.txt
> 
>     /dev/mapper/vg_test-test  138G  123G   15G  90% /mnt
> 
> I waited a few minutes, but there was no further activity. I then unpaused
> the rsync, which proceeded with good speed again.
> 
> At 11GB free, rsync effectively stopped, and the disk went to ~1MB/s wrtite
> mode aagin. Pausing rsync didn't cause I/O to stop this time, it continued
> for a few minutes.
> 
> I waited for 2 minutes with no disk I/O, unpaused rsync, and the disk
> immediately went into 1MB/s write mode againh, with rsync not really
> getting any data through though.
> 
> It's as if f2fs only tried to clean up when there is write data. I would
> expect a highly fragmented f2fs to be very busy garbage collecting, but
> apparently, not so, it just idles, and when a program wants to write,
> fails to perform. Maybe I need to give it more time than two minutes, but
> then, I wouldn't see a point in delaying to garbage collect if it has to
> be done anyways.
> 
> In any case, no progress possible, I deleted more files again, this time
> every 5th file, which went reasonably fast,
> 
> status after delete:
> 
>     http://ue.tst.eu/fb3287adf4cc109c88b89f6120c9e4a6.txt
> 
>     /dev/mapper/vg_test-test  138G  114G   23G  84% /mnt
> 
> rsync writing was reasonably fast down to 18GB, when rsync stopped making
> much profgress (<100kb/s), but the disk wasn't in "1MB/s mode" but instead in
> 40MB/s read+write, which looks reasonable to me, as the disk was probably
> quite fargmented at this point:
> 
>     http://ue.tst.eu/fb3287adf4cc109c88b89f6120c9e4a6.txt
> 
> However, when pausing rsync, f2fs immediatelly ceased doing anything again,
> so even though clearly there is a need for clean up activities, f2fs doesn't
> do them.

It seems that why f2fs didn't do gc was that all the sections were traversed
by background gc. In order to reset that, it needs to trigger checkpoint, but 
it couldn't meet the condition in background.

How about calling "sync" before leaving the system as idle?
Or, you can check decreasing the number in /sys/fs/f2fs/xxx/reclaim_segments to
256 or 512?

> 
> To state this more clearly: My expectation is that when f2fs runs out of
> immediatelly usable space for writing, it should do GC. That means that
> when rsync is very slow and the disk is very fragmented, even when I pause
> rsync, f2fs should GC at full speed until it has a reasonable amount of
> usable free space again. Instead, it apparently just sits idle until some
> program generates write data.
> 
> At this point, I unmounted the filesystem and "fsck.f2fs -f"'ed it. The
> report looked good:
> 
>     [FSCK] Unreachable nat entries                        [Ok..] [0x0]
>     [FSCK] SIT valid block bitmap checking                [Ok..]
>     [FSCK] Hard link checking for regular file            [Ok..] [0x0]
>     [FSCK] valid_block_count matching with CP             [Ok..] [0xe8b623]
>     [FSCK] valid_node_count matcing with CP (de lookup)   [Ok..] [0xa58a]
>     [FSCK] valid_node_count matcing with CP (nat lookup)  [Ok..] [0xa58a]
>     [FSCK] valid_inode_count matched with CP              [Ok..] [0x7800]
>     [FSCK] free segment_count matched with CP             [Ok..] [0x8a17]
>     [FSCK] next block offset is free                      [Ok..]
>     [FSCK] fixing SIT types
> 
> However, there were about 30000 messages like these:
> 
>     [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdf6] 0 -> 1
>     [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdf7] 0 -> 1
>     [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdf8] 0 -> 1
>     [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdf9] 0 -> 1
>     [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdfa] 0 -> 1
>     [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdfb] 0 -> 1
>     [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdfc] 0 -> 1
>     [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdfd] 0 -> 1
>     [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdfe] 0 -> 1
>     [FIX] (check_sit_types:1056)  --> Wrong segment type [0xfdff] 0 -> 1
>     [FSCK] other corrupted bugs                           [Ok..]
> 
> That's not promising, why does it think it needs to fix anything?

I need to take a look at the fsck.f2fs when handling there are two active logs.
Anyway, this doesn't break the core FS consistency, so you can ignore them.

> 
> I mounted the partition again. Listing the files was very fast. I deleted all
> the files and ran rsync for a while. It seems the partition completely
> recovered. This is the empty state btw.:
> 
>    Filesystem                Size  Used Avail Use% Mounted on
>    /dev/mapper/vg_test-test  138G   57G   80G  42% /mnt
> 
> So, all the pathological behaviour is gone (no 20kb/s write speed blocking
> the disk for hours, more importantly, no obvious filesystem corruption,
> although the fsck messages need explanation).
> 
> Moreso, the behaviour, while still confusing (weird du vs. df, no background
> activity), at least seems to be in line with what I expect - fragmentation
> kills performance, but f2fs seems capable of recovering.
> 
> So here is my wishlist:
> 
> 1. the overprovisioning values seems to be completely out of this world. I'm
> prepared top give up maybe 50GB of my 8TB disk for this, but not more.

Maybe, it needs to check with other filesystems' *available* spaces.
Since, many of them hide additional FS metadata initially.

> 
> 2. even though ~40% of space is not used by file data, f2fs still becomes
> extremely slow. this can't be right.

I think it was due to the wrong overprovision space.
It needs to check that number first.

> 
> 3. why does f2fs sit idle on a highly fragmented filesystem, why does it not
> do background garbage collect at maximum I/O speed, so the filesystem is
> ready when the next writes come?

I suspect the section size is too large comparing to the whole partition size,
which number is only 509. Each GC selects a victim in a unit of section and
background GC would not select again for the previously visited ones.
So, I think GC is easy to traverse whole sections, and go to bed since there
is no new victims. So, I think checkpoint, "sync", resets whole history and
makes background GC conduct its job again.

Thank you, :)

> 
> Greetings, and good night :)
> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 3: full 8TB partition, mount problems, fsck error after delete
  2015-09-22 23:08             ` Jaegeuk Kim
@ 2015-09-23  3:50               ` Marc Lehmann
  0 siblings, 0 replies; 74+ messages in thread
From: Marc Lehmann @ 2015-09-23  3:50 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Tue, Sep 22, 2015 at 04:08:50PM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> Could you check without -o1, since I merged a patch to calculate the best
> overprovision at runtime in f2fs-tools.

I assume I have it, as git pull didn't give me any updates.

> For example, if you set 1%, we need to reserve 100 sections to do cleaning at
> the worse case. That's why you cannot see the reserved area as just 1% over
> total space.

Ok, I tried with -o1, -o5, and no -o switch at all:

   switch df -h "Used"
   -o1    126GiB
   -o5    384GiB
   ""     126GiB

So indeed, the manpage (which says -o5 is the default) doesn't match the behaviour.

With -s1 instead of -s128, I get:

   75GiB

100 sections at -s128 would be 25G, so I wonder what the remaining 101GiB
are (or the remaining 75GiB).

Don't get me wrong, a "default" ext4 gives me a lot less initial space,
but that's why I don't use ext4. XFS (which has a lot more on-disk data
structures) gives me a 100GB more space, which is not something to be
trifled with. If f2fs absolutely needs this space, it has to be, but at
the moment, it feels excessive.

I'm also ok with having to wait for gc when the disk is almost completely
full. The issue I have at this point is that f2fs reserves a LOT of space,
and long before the disk even near getting full, it basically stops
working. That's the point where I would have to wait for the GC, but f2fs
just seems to sit idle.

> >    F2FS-fs (dm-18): Failed to initialize F2FS segment manager
> >    (http://data.plan9.de/f2fs-mount-failure.txt)
> 
> I think the below patch should resolve this issue.

Sounds cool!

> > After this, df showed 185GB in use, which is more like 3%, not 1% - again
> > overprovisioning seems to be out of bounds.
> 
> Actually, 185GB should include FS metadata as well as reserved or overprovision
> space. It would be good to check the on-disk layout by fsck.f2fs.

That's a lot of metadata for an empty filesystem.

> >    [ASSERT] (get_current_sit_page: 803) ret >= 0
> >    [Exit 255] 
> 
> Actually, this doesn't report f2fs inconsistency.
> Instead, these two errors are from lseek64() and read() failures in dev_read():
> lib/libf2fs_io.c.
> 
> Maybe ENOMEM? Can you check the errno of this function?

That's very strange, if the kernel fails, I would expect some dmesg
output, but the fs was mountable before and after.

Unfortunately, I already went onwards with the next test.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-23  1:12           ` SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning Jaegeuk Kim
@ 2015-09-23  4:15             ` Marc Lehmann
  2015-09-23  6:00               ` Marc Lehmann
                                 ` (2 more replies)
  0 siblings, 3 replies; 74+ messages in thread
From: Marc Lehmann @ 2015-09-23  4:15 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Tue, Sep 22, 2015 at 06:12:39PM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> Hmm. Is it necessary to reduce the number of active_logs?

I don't know, the documentation isn't very forthcoming with details :)

In any case, this is just for testing. My rationale was that multiple logs
probably means that there are multiple sequential write zones. Reducing those

Only two logs would help the disk. Probably. Maybe.

> increase the GC overheads significantly.

Can you elaborate? I do get a speed improvement with only two logs, but of
course, GC time is an impoprtant factor, so maybe more logs would be a
necessary trade-off.

> And, you can use inline_data in v4.2.

I think I did - the documentation says inline_data is the default.

> >     Filesystem                Size  Used Avail Use% Mounted on
> >     /dev/mapper/vg_test-test  128G  106G   22G  84% /mnt
> > 
> >     # du -skc /mnt
> >     51674268        /mnt
> >     51674268        total
> > 
> >     Utilization: 67% (13168028 valid blocks)
> 
> Ok. I could retrieve the on-disk layout from the below log.
> In the log, the overprovision area is set as about 54GB.
> However, when I tried to do mkfs.f2fs with the same options, I got about 18GB.
> Could you share the mkfs.f2fs messages and fsck.f2fs -d3 as well?

When I re-ran the mkfs.f2fs, I got:

   Filesystem                Size  Used Avail Use% Mounted on
   /dev/mapper/vg_test-test  138G   20G  118G  14% /mnt

I didn't note down the overhead in my test, the df I had was when the disk
was filled, so it possibly changed(?) at runtime?

(I tried debians mkfs.f2fs, but it gave identical results).

I'll redo the 128GiB test and see if I can get similar results.

> > However, when pausing rsync, f2fs immediatelly ceased doing anything again,
> > so even though clearly there is a need for clean up activities, f2fs doesn't
> > do them.
> 
> It seems that why f2fs didn't do gc was that all the sections were traversed
> by background gc. In order to reset that, it needs to trigger checkpoint, but 
> it couldn't meet the condition in background.
> 
> How about calling "sync" before leaving the system as idle?
> Or, you can check decreasing the number in /sys/fs/f2fs/xxx/reclaim_segments to
> 256 or 512?

Will try next time. I distinctly remember that sync didn't do anything to
pre-free and free, though.

> > 1. the overprovisioning values seems to be completely out of this world. I'm
> > prepared top give up maybe 50GB of my 8TB disk for this, but not more.
> 
> Maybe, it needs to check with other filesystems' *available* spaces.
> Since, many of them hide additional FS metadata initially.

I habitually do comparefree space between filesystems. While f2fs is better
than ext4 with default settings (and even with some tuning), ext4 is quite
known to have excessive preallocated metadata requirements.

As mentioned in my other mail, XFS for example has 100GB more free
space than f2fs on the full 8TB device, and from memory I expect other
filesystems without fixed inode numbers (practically all of them) to be
similar.

> > 3. why does f2fs sit idle on a highly fragmented filesystem, why does it not
> > do background garbage collect at maximum I/O speed, so the filesystem is
> > ready when the next writes come?
> 
> I suspect the section size is too large comparing to the whole partition size,
> which number is only 509. Each GC selects a victim in a unit of section and
> background GC would not select again for the previously visited ones.
> So, I think GC is easy to traverse whole sections, and go to bed since there
> is no new victims. So, I think checkpoint, "sync", resets whole history and
> makes background GC conduct its job again.

The large section size is of course the whole point of the exercise, as
hopefully this causes the GC to do larger sequential writes. It's clear
that this is not a perfect match for these SMR drives, but the goal is to
have acceptable performance, faster than a few megabytes/s. And indeed,
when the GC runs, it get quite good I/O performance in my test (deleteing
every nth file makes comparatively small holes, so the GC has to copy most
of the section).

Now, the other thing is that the GC, whgen it triggers, isn't very
aggressive - when I saw it, it was doing something every 10-15 seconds,
with the system being idle, when it should be more or less completely busy.

I am aware that "idle" is a difficult to inmpossible condition to detect
- maybe this could be made more tunable (I tried to play around with the
gc_*_time values, but probably due to lack of documentation. I didn't get
very far, and couldn't correlate the behaviour I saw with the settings I
made).

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-23  4:15             ` Marc Lehmann
@ 2015-09-23  6:00               ` Marc Lehmann
  2015-09-23  8:55                 ` Chao Yu
  2015-09-23 22:08                 ` Jaegeuk Kim
  2015-09-23  6:06               ` Marc Lehmann
  2015-09-23 21:29               ` Jaegeuk Kim
  2 siblings, 2 replies; 74+ messages in thread
From: Marc Lehmann @ 2015-09-23  6:00 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Wed, Sep 23, 2015 at 06:15:24AM +0200, Marc Lehmann <schmorp@schmorp.de> wrote:
> > However, when I tried to do mkfs.f2fs with the same options, I got about 18GB.
> > Could you share the mkfs.f2fs messages and fsck.f2fs -d3 as well?
> 
> When I re-ran the mkfs.f2fs, I got:

I get the feeling I did something idiotic, but for the life of it, I don't
know what. I see the mkfs.f2fs in my test log, I see it in my command
history, but for the life of it, I can't reproduce it.

So let's disregard this and go to the next test - I redid the 128G partitipn
test, with 6 active logs, no -o and -s64:

   mkfs.f2fs -lTEST -s64 -t0 -a0

This allowed me to arrive at this state, at which rsync stopped making
progress:

   root@shag:/sys/fs/f2fs/dm-1# df -H /mnt
   Filesystem                Size  Used Avail Use% Mounted on
   /dev/mapper/vg_test-test  138G  137G  803k 100% /mnt

This would be about perfect (I even got ENOSPC for the first
time!). However, when I do my "delete every nth file":

   /dev/mapper/vg_test-test  138G  135G  1.8G  99% /mnt

The disk still sits mostly idle. I did verify that "sync" indeed reduces
Pre-Free to 0, and I do see some activity every ~30s now, though:

http://ue.tst.eu/ac1ec447de214edc4e007623da2dda72.txt (see the dsk/sde
columns).

If I start writing, I guess I trigger the foreground gc:

http://ue.tst.eu/1dfbac9166552a95551855000d820ce9.txt

The first few lines there are some background gc activity (I guess), then I
started an rsync to write data - net/total shows the data rsync transfers.
After that, there is constant ~40mb read/write activity, but very little
actual write data gets to the disk (rsync makes progress at <100kb/s).

At some point I stop rsync (the single line line with 0/0 for sde read
write, after the second header), followed by sync a second later. Sync
does it's job, and then there is no activity for a bit, until I start
rsync again, which immediatelly triggers the 40/40 mode, and makes little
progress.

So little to no gc activity, even though the filesystem really needs some
GC activity at this point.

If I play around with gc_* like this:

   echo 1 >gc_idle
   echo 1000 >gc_max_sleep_time
   echo 5000 >gc_no_gc_sleep_time

Then I get a lot more activity:

http://ue.tst.eu/f05ee3ff52dc7814ee8352cc2d67f364.txt

But still, as you can see, a lot of the time the disk and the cpu are
idle.

In any case, I think I am getting somewhere - until now all my tests ended in
unusable filesystem sooner or later, this is the firts one which shows mostly
expected behaviour.

Maybe -s128 (or -s256) with which I did my previous tests are problematic?
Maybe the active_logs=2 caused problems (but I only used this option recently)?

And the previous problems can be explaioned by using inline_dentry and/or
extent_cache.

Anyway, this behaviour is what I would expect, mostly.

Now, I could go with -s64 (128MB segments still span 4-7 zones with this
disk). Or maybe something uneven, such as -s90, if that doesn't cause
problems.

Also, if it were possible to tune the gc to be more aggressive when idle
(and mostly off if the disk is free), and possibly, if the loss of space
by metadata could br reduced, I'd risk f2fs in production in one system
here.

Greetings,

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-23  4:15             ` Marc Lehmann
  2015-09-23  6:00               ` Marc Lehmann
@ 2015-09-23  6:06               ` Marc Lehmann
  2015-09-23  9:10                 ` Chao Yu
  2015-09-23 21:29               ` Jaegeuk Kim
  2 siblings, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-09-23  6:06 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

> by metadata could br reduced, I'd risk f2fs in production in one system
> here.

Oh, and please, I beg you, consider increasing the hardlink limit to >16
bit - look at other filesystems,. many filesystems thought they could get
away with 16 bit (ext*, xfs, ...) but all of them nowadays support 31 bit
or more for the hardlink count :) Merely 18 bits would probably suffice :)

While 65535 will just work at the moment for me (my largest directory has
~62000 subdirectories, and I can half this wiht some extra work), it's
guaranteed to fail sooner or later.

Thanks for listening (even if you decide against it :).

Greetings,

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-23  6:00               ` Marc Lehmann
@ 2015-09-23  8:55                 ` Chao Yu
  2015-09-23 23:30                   ` Marc Lehmann
  2015-09-23 22:08                 ` Jaegeuk Kim
  1 sibling, 1 reply; 74+ messages in thread
From: Chao Yu @ 2015-09-23  8:55 UTC (permalink / raw)
  To: 'Marc Lehmann', 'Jaegeuk Kim'; +Cc: linux-f2fs-devel

Hi Marc,

> -----Original Message-----
> From: Marc Lehmann [mailto:schmorp@schmorp.de]
> Sent: Wednesday, September 23, 2015 2:01 PM
> To: Jaegeuk Kim
> Cc: linux-f2fs-devel@lists.sourceforge.net
> Subject: Re: [f2fs-dev] SMR drive test 2; 128GB partition; no obvious corruption, much more
> sane behaviour, weird overprovisioning
> 
> On Wed, Sep 23, 2015 at 06:15:24AM +0200, Marc Lehmann <schmorp@schmorp.de> wrote:
> > > However, when I tried to do mkfs.f2fs with the same options, I got about 18GB.
> > > Could you share the mkfs.f2fs messages and fsck.f2fs -d3 as well?
> >
> > When I re-ran the mkfs.f2fs, I got:
> 
> I get the feeling I did something idiotic, but for the life of it, I don't
> know what. I see the mkfs.f2fs in my test log, I see it in my command
> history, but for the life of it, I can't reproduce it.
> 
> So let's disregard this and go to the next test - I redid the 128G partitipn
> test, with 6 active logs, no -o and -s64:
> 
>    mkfs.f2fs -lTEST -s64 -t0 -a0
> 
> This allowed me to arrive at this state, at which rsync stopped making
> progress:
> 
>    root@shag:/sys/fs/f2fs/dm-1# df -H /mnt
>    Filesystem                Size  Used Avail Use% Mounted on
>    /dev/mapper/vg_test-test  138G  137G  803k 100% /mnt
> 
> This would be about perfect (I even got ENOSPC for the first
> time!). However, when I do my "delete every nth file":
> 
>    /dev/mapper/vg_test-test  138G  135G  1.8G  99% /mnt
> 
> The disk still sits mostly idle. I did verify that "sync" indeed reduces
> Pre-Free to 0, and I do see some activity every ~30s now, though:
> 
> http://ue.tst.eu/ac1ec447de214edc4e007623da2dda72.txt (see the dsk/sde
> columns).
> 
> If I start writing, I guess I trigger the foreground gc:
> 
> http://ue.tst.eu/1dfbac9166552a95551855000d820ce9.txt
> 
> The first few lines there are some background gc activity (I guess), then I
> started an rsync to write data - net/total shows the data rsync transfers.
> After that, there is constant ~40mb read/write activity, but very little
> actual write data gets to the disk (rsync makes progress at <100kb/s).
> 
> At some point I stop rsync (the single line line with 0/0 for sde read
> write, after the second header), followed by sync a second later. Sync
> does it's job, and then there is no activity for a bit, until I start
> rsync again, which immediatelly triggers the 40/40 mode, and makes little
> progress.
> 
> So little to no gc activity, even though the filesystem really needs some
> GC activity at this point.
> 
> If I play around with gc_* like this:
> 
>    echo 1 >gc_idle
>    echo 1000 >gc_max_sleep_time
>    echo 5000 >gc_no_gc_sleep_time

One thing I note is that gc_min_sleep_time is not be set in your script,
so in some condition gc may still do the sleep with gc_min_sleep_time (30
seconds by default) instead of gc_max_sleep_time which we expect.

So setting gc_min_sleep_time/gc_max_sleep_time as a pair is a better way
of controlling sleeping time of gc.

> 
> Then I get a lot more activity:
> 
> http://ue.tst.eu/f05ee3ff52dc7814ee8352cc2d67f364.txt
> 
> But still, as you can see, a lot of the time the disk and the cpu are
> idle.
> 
> In any case, I think I am getting somewhere - until now all my tests ended in
> unusable filesystem sooner or later, this is the firts one which shows mostly
> expected behaviour.
> 
> Maybe -s128 (or -s256) with which I did my previous tests are problematic?
> Maybe the active_logs=2 caused problems (but I only used this option recently)?
> 
> And the previous problems can be explaioned by using inline_dentry and/or
> extent_cache.
> 
> Anyway, this behaviour is what I would expect, mostly.
> 
> Now, I could go with -s64 (128MB segments still span 4-7 zones with this
> disk). Or maybe something uneven, such as -s90, if that doesn't cause
> problems.
> 
> Also, if it were possible to tune the gc to be more aggressive when idle

In 4.3 rc1 kernel, we have add a new ioctl to trigger in batches gc, maybe
we can use it as one option.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?
id=c1c1b58359d45e1a9f236ce5a40d50720c07c70e

Thanks,

> (and mostly off if the disk is free), and possibly, if the loss of space
> by metadata could br reduced, I'd risk f2fs in production in one system
> here.
> 
> Greetings,
> 
> --
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\
> 
> ------------------------------------------------------------------------------
> Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
> Get real-time metrics from all of your servers, apps and tools
> in one place.
> SourceForge users - Click here to start your Free Trial of Datadog now!
> http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-23  6:06               ` Marc Lehmann
@ 2015-09-23  9:10                 ` Chao Yu
  2015-09-23 21:30                   ` Jaegeuk Kim
  2015-09-23 23:11                   ` Marc Lehmann
  0 siblings, 2 replies; 74+ messages in thread
From: Chao Yu @ 2015-09-23  9:10 UTC (permalink / raw)
  To: 'Marc Lehmann', 'Jaegeuk Kim'; +Cc: linux-f2fs-devel

Hi Marc,

The max hardlink number was increased to 0xffffffff by Jaegeuk in 4.3 rc1
Kernel, we can use it directly through backport.

>From a6db67f06fd9f6b1ddb11bcf4d7e8e8a86908d01 Mon Sep 17 00:00:00 2001
From: Jaegeuk Kim <jaegeuk@kernel.org>
Date: Mon, 10 Aug 2015 15:01:12 -0700
Subject: [PATCH] f2fs: increase the number of max hard links

This patch increases the number of maximum hard links for one file.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
---
 fs/f2fs/f2fs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 23bfc0c..8308488 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -321,7 +321,7 @@ enum {
 					 */
 };
 
-#define F2FS_LINK_MAX		32000	/* maximum link count per file */
+#define F2FS_LINK_MAX	0xffffffff	/* maximum link count per file */
 
 #define MAX_DIR_RA_PAGES	4	/* maximum ra pages of dir */
 
-- 
2.5.2



------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-23  4:15             ` Marc Lehmann
  2015-09-23  6:00               ` Marc Lehmann
  2015-09-23  6:06               ` Marc Lehmann
@ 2015-09-23 21:29               ` Jaegeuk Kim
  2015-09-23 23:24                 ` Marc Lehmann
  2 siblings, 1 reply; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-23 21:29 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Wed, Sep 23, 2015 at 06:15:24AM +0200, Marc Lehmann wrote:
> On Tue, Sep 22, 2015 at 06:12:39PM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > Hmm. Is it necessary to reduce the number of active_logs?
> 
> I don't know, the documentation isn't very forthcoming with details :)
> 
> In any case, this is just for testing. My rationale was that multiple logs
> probably means that there are multiple sequential write zones. Reducing those
> 
> Only two logs would help the disk. Probably. Maybe.
> 
> > increase the GC overheads significantly.
> 
> Can you elaborate? I do get a speed improvement with only two logs, but of
> course, GC time is an impoprtant factor, so maybe more logs would be a
> necessary trade-off.

This will help you to understand more precisely.

https://www.usenix.org/system/files/conference/fast15/fast15-paper-lee.pdf

One GC needs to move whole valid blocks inside a section, so if the section
size is too large, every GC is likely to show very long latency.
In addion, we need more overprovision space too.

And, if the number of logs is small, GC can suffer from moving hot and cold
data blocks which represents somewhat temporal locality.

Of course, these numbers highly depend on storage speed and workloads, so
it needs to be tuned up.

Thanks,

> 
> > And, you can use inline_data in v4.2.
> 
> I think I did - the documentation says inline_data is the default.
> 
> > >     Filesystem                Size  Used Avail Use% Mounted on
> > >     /dev/mapper/vg_test-test  128G  106G   22G  84% /mnt
> > > 
> > >     # du -skc /mnt
> > >     51674268        /mnt
> > >     51674268        total
> > > 
> > >     Utilization: 67% (13168028 valid blocks)
> > 
> > Ok. I could retrieve the on-disk layout from the below log.
> > In the log, the overprovision area is set as about 54GB.
> > However, when I tried to do mkfs.f2fs with the same options, I got about 18GB.
> > Could you share the mkfs.f2fs messages and fsck.f2fs -d3 as well?
> 
> When I re-ran the mkfs.f2fs, I got:
> 
>    Filesystem                Size  Used Avail Use% Mounted on
>    /dev/mapper/vg_test-test  138G   20G  118G  14% /mnt
> 
> I didn't note down the overhead in my test, the df I had was when the disk
> was filled, so it possibly changed(?) at runtime?
> 
> (I tried debians mkfs.f2fs, but it gave identical results).
> 
> I'll redo the 128GiB test and see if I can get similar results.
> 
> > > However, when pausing rsync, f2fs immediatelly ceased doing anything again,
> > > so even though clearly there is a need for clean up activities, f2fs doesn't
> > > do them.
> > 
> > It seems that why f2fs didn't do gc was that all the sections were traversed
> > by background gc. In order to reset that, it needs to trigger checkpoint, but 
> > it couldn't meet the condition in background.
> > 
> > How about calling "sync" before leaving the system as idle?
> > Or, you can check decreasing the number in /sys/fs/f2fs/xxx/reclaim_segments to
> > 256 or 512?
> 
> Will try next time. I distinctly remember that sync didn't do anything to
> pre-free and free, though.
> 
> > > 1. the overprovisioning values seems to be completely out of this world. I'm
> > > prepared top give up maybe 50GB of my 8TB disk for this, but not more.
> > 
> > Maybe, it needs to check with other filesystems' *available* spaces.
> > Since, many of them hide additional FS metadata initially.
> 
> I habitually do comparefree space between filesystems. While f2fs is better
> than ext4 with default settings (and even with some tuning), ext4 is quite
> known to have excessive preallocated metadata requirements.
> 
> As mentioned in my other mail, XFS for example has 100GB more free
> space than f2fs on the full 8TB device, and from memory I expect other
> filesystems without fixed inode numbers (practically all of them) to be
> similar.
> 
> > > 3. why does f2fs sit idle on a highly fragmented filesystem, why does it not
> > > do background garbage collect at maximum I/O speed, so the filesystem is
> > > ready when the next writes come?
> > 
> > I suspect the section size is too large comparing to the whole partition size,
> > which number is only 509. Each GC selects a victim in a unit of section and
> > background GC would not select again for the previously visited ones.
> > So, I think GC is easy to traverse whole sections, and go to bed since there
> > is no new victims. So, I think checkpoint, "sync", resets whole history and
> > makes background GC conduct its job again.
> 
> The large section size is of course the whole point of the exercise, as
> hopefully this causes the GC to do larger sequential writes. It's clear
> that this is not a perfect match for these SMR drives, but the goal is to
> have acceptable performance, faster than a few megabytes/s. And indeed,
> when the GC runs, it get quite good I/O performance in my test (deleteing
> every nth file makes comparatively small holes, so the GC has to copy most
> of the section).
> 
> Now, the other thing is that the GC, whgen it triggers, isn't very
> aggressive - when I saw it, it was doing something every 10-15 seconds,
> with the system being idle, when it should be more or less completely busy.
> 
> I am aware that "idle" is a difficult to inmpossible condition to detect
> - maybe this could be made more tunable (I tried to play around with the
> gc_*_time values, but probably due to lack of documentation. I didn't get
> very far, and couldn't correlate the behaviour I saw with the settings I
> made).
> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-23  9:10                 ` Chao Yu
@ 2015-09-23 21:30                   ` Jaegeuk Kim
  2015-09-23 23:11                   ` Marc Lehmann
  1 sibling, 0 replies; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-23 21:30 UTC (permalink / raw)
  To: Chao Yu; +Cc: 'Marc Lehmann', linux-f2fs-devel

Thanks Chao.

That's right.

On Wed, Sep 23, 2015 at 05:10:21PM +0800, Chao Yu wrote:
> Hi Marc,
> 
> The max hardlink number was increased to 0xffffffff by Jaegeuk in 4.3 rc1
> Kernel, we can use it directly through backport.
> 
> >From a6db67f06fd9f6b1ddb11bcf4d7e8e8a86908d01 Mon Sep 17 00:00:00 2001
> From: Jaegeuk Kim <jaegeuk@kernel.org>
> Date: Mon, 10 Aug 2015 15:01:12 -0700
> Subject: [PATCH] f2fs: increase the number of max hard links
> 
> This patch increases the number of maximum hard links for one file.
> 
> Reviewed-by: Chao Yu <chao2.yu@samsung.com>
> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
> ---
>  fs/f2fs/f2fs.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index 23bfc0c..8308488 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -321,7 +321,7 @@ enum {
>  					 */
>  };
>  
> -#define F2FS_LINK_MAX		32000	/* maximum link count per file */
> +#define F2FS_LINK_MAX	0xffffffff	/* maximum link count per file */
>  
>  #define MAX_DIR_RA_PAGES	4	/* maximum ra pages of dir */
>  
> -- 
> 2.5.2

------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

^ permalink raw reply	[flat|nested] 74+ messages in thread

* sync/umount hang on 3.18.21, 1.4TB gone after crash
@ 2015-09-23 21:58 Marc Lehmann
  2015-09-23 23:11 ` write performance difference 3.18.21/4.2.1 Marc Lehmann
  2015-09-24 18:50 ` sync/umount hang on 3.18.21, 1.4TB gone after crash Jaegeuk Kim
  0 siblings, 2 replies; 74+ messages in thread
From: Marc Lehmann @ 2015-09-23 21:58 UTC (permalink / raw)
  To: linux-f2fs-devel

Hi!

I moved one of the SMR disks to another box with a 3.18.21 kernel.

I formatted and mounted like this:

   /opt/f2fs-tools/sbin/mkfs.f2fs -lTEST -s90 -t0 -a0 /dev/vg_test/test
   mount -t f2fs -onoatime,flush_merge,no_heap /dev/vg_test/test /mnt

I then copied (tar | tar) 2.1TB of data to the disk, which took about 6
hours, which is about the read speed of this data set (so the speed was very
good).

When I came back after ~10 hours, I found a number of hung task messages
in syslog, and when I entered sync, sync was consuming 100% system time.

I took a snapshot of /sys/kernel/debug/f2fs/status before sync, and the
values arfe "frozen", i.e. they didn't change.

I was able to read from the mounted filesystem normally, and I was able to
read and write the block device itself, so the disk is responsive.

After ~1h in this state, I tried to umount, which made the filesystem
mountpoint go away, but umount hangs, and /sys/kernel/debug/f2fs/status still
doesn't change.

This is the output of /sys/kernel/debug/f2fs/status:

http://ue.tst.eu/d88ce0e21a7ca0fb74b1ecadfa475df0.txt

I then deleted the device, but the echo 1 >/sys/block/sde/device/delete was
also hanging.

Here are /proc/.../stack outputs of sync, umount and bash(echo):

   sync:
   [<ffffffffffffffff>] 0xffffffffffffffff

   umount:
   [<ffffffff8139ba03>] call_rwsem_down_write_failed+0x13/0x20
   [<ffffffff811e7ee6>] deactivate_super+0x46/0x70
   [<ffffffff81204733>] cleanup_mnt+0x43/0x90
   [<ffffffff812047d2>] __cleanup_mnt+0x12/0x20
   [<ffffffff8108e8a4>] task_work_run+0xc4/0xe0
   [<ffffffff81012fa7>] do_notify_resume+0x97/0xb0
   [<ffffffff8178896f>] int_signal+0x12/0x17
   [<ffffffffffffffff>] 0xffffffffffffffff

   bash (delete):
   [<ffffffff810d8917>] msleep+0x37/0x50
   [<ffffffff8135d686>] __blk_drain_queue+0xa6/0x1a0
   [<ffffffff8135da05>] blk_cleanup_queue+0x1b5/0x1c0
   [<ffffffff8152082a>] __scsi_remove_device+0x5a/0xe0
   [<ffffffff815208d6>] scsi_remove_device+0x26/0x40
   [<ffffffff81520917>] sdev_store_delete+0x27/0x30
   [<ffffffff814bf748>] dev_attr_store+0x18/0x30
   [<ffffffff8125bc4d>] sysfs_kf_write+0x3d/0x50
   [<ffffffff8125b154>] kernfs_fop_write+0xe4/0x160
   [<ffffffff811e51a7>] vfs_write+0xb7/0x1f0
   [<ffffffff811e5c26>] SyS_write+0x46/0xb0
   [<ffffffff817886cd>] system_call_fastpath+0x16/0x1b
   [<ffffffffffffffff>] 0xffffffffffffffff

After a forced reboot, I did a fsck, and got this, which looks good except
for the "Wrong segment type" message, which hopefully is harmless.

http://ue.tst.eu/4c750d2301a581cb07249d607aa0e6d0.txt

After mounting, status was this (and was changing):

http://ue.tst.eu/6462606ac3aa85bde0d6674365c86318.txt

Note that 1.4TB of data are missing(!)

This large amount of missing data was certainly unexpected. I assume f2fs
stopped checkpointing earlier, and only after a checkpoint the data is
safe, but being able to write 1.4TB of data without it ever reaching the
disk is very unexpected behaviour for a filesystem (which normally loses
about half a minute of data at most).

Minor question, since the disk actually has 4K physical sectors, and fsck
says sector size = 512, is there a way to teach f2fs that the physical
sector size is actually 4k, or does this not matter because f2fs will do
page-sized writes anyways?

In any case, any insights would be appreciated. I will attwmpt to upgrade
this box to linux 4.2.1 to see if that helps, but 3.18.x is the onl
kernel known to work with smr drives without any issues.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-23  6:00               ` Marc Lehmann
  2015-09-23  8:55                 ` Chao Yu
@ 2015-09-23 22:08                 ` Jaegeuk Kim
  2015-09-23 23:39                   ` Marc Lehmann
  1 sibling, 1 reply; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-23 22:08 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Wed, Sep 23, 2015 at 08:00:37AM +0200, Marc Lehmann wrote:
> On Wed, Sep 23, 2015 at 06:15:24AM +0200, Marc Lehmann <schmorp@schmorp.de> wrote:
> > > However, when I tried to do mkfs.f2fs with the same options, I got about 18GB.
> > > Could you share the mkfs.f2fs messages and fsck.f2fs -d3 as well?
> > 
> > When I re-ran the mkfs.f2fs, I got:
> 
> I get the feeling I did something idiotic, but for the life of it, I don't
> know what. I see the mkfs.f2fs in my test log, I see it in my command
> history, but for the life of it, I can't reproduce it.
> 
> So let's disregard this and go to the next test - I redid the 128G partitipn
> test, with 6 active logs, no -o and -s64:
> 
>    mkfs.f2fs -lTEST -s64 -t0 -a0
> 
> This allowed me to arrive at this state, at which rsync stopped making
> progress:
> 
>    root@shag:/sys/fs/f2fs/dm-1# df -H /mnt
>    Filesystem                Size  Used Avail Use% Mounted on
>    /dev/mapper/vg_test-test  138G  137G  803k 100% /mnt

Could you please share /sys/kernel/debug/f2fs/status?

> 
> This would be about perfect (I even got ENOSPC for the first
> time!). However, when I do my "delete every nth file":
> 
>    /dev/mapper/vg_test-test  138G  135G  1.8G  99% /mnt
> 
> The disk still sits mostly idle. I did verify that "sync" indeed reduces
> Pre-Free to 0, and I do see some activity every ~30s now, though:
> 
> http://ue.tst.eu/ac1ec447de214edc4e007623da2dda72.txt (see the dsk/sde
> columns).
> 
> If I start writing, I guess I trigger the foreground gc:
> 
> http://ue.tst.eu/1dfbac9166552a95551855000d820ce9.txt
> 
> The first few lines there are some background gc activity (I guess), then I
> started an rsync to write data - net/total shows the data rsync transfers.
> After that, there is constant ~40mb read/write activity, but very little
> actual write data gets to the disk (rsync makes progress at <100kb/s).
> 
> At some point I stop rsync (the single line line with 0/0 for sde read
> write, after the second header), followed by sync a second later. Sync
> does it's job, and then there is no activity for a bit, until I start
> rsync again, which immediatelly triggers the 40/40 mode, and makes little
> progress.
> 
> So little to no gc activity, even though the filesystem really needs some
> GC activity at this point.
> 
> If I play around with gc_* like this:
> 
>    echo 1 >gc_idle
>    echo 1000 >gc_max_sleep_time
>    echo 5000 >gc_no_gc_sleep_time

As Chao mentioned, if the system is idle, f2fs starts to do GC with
gc_min_sleep_time.

> 
> Then I get a lot more activity:
> 
> http://ue.tst.eu/f05ee3ff52dc7814ee8352cc2d67f364.txt
> 
> But still, as you can see, a lot of the time the disk and the cpu are
> idle.
> 
> In any case, I think I am getting somewhere - until now all my tests ended in
> unusable filesystem sooner or later, this is the firts one which shows mostly
> expected behaviour.
> 
> Maybe -s128 (or -s256) with which I did my previous tests are problematic?
> Maybe the active_logs=2 caused problems (but I only used this option recently)?
> 
> And the previous problems can be explaioned by using inline_dentry and/or
> extent_cache.
> 
> Anyway, this behaviour is what I would expect, mostly.
> 
> Now, I could go with -s64 (128MB segments still span 4-7 zones with this
> disk). Or maybe something uneven, such as -s90, if that doesn't cause
> problems.
> 
> Also, if it were possible to tune the gc to be more aggressive when idle
> (and mostly off if the disk is free), and possibly, if the loss of space
> by metadata could br reduced, I'd risk f2fs in production in one system
> here.

When I did mkfs.f2fs on 128GB, I got the following numbers.

option                 overprovision area         reserved area
-o5 -s128              9094                       6144
-o5 -s64               6179                       3072
-o5 -s1                3309                       48
-o1 -s128              27009                      26624
-o1 -s64               13831                      13312
-o1 -s1                858                        208

-s1    (ovp:1%)        858                        208
-s64   (ovp:1%)        13831                      13312
-s128  (ovp:1%)        27009                      26624

So, I'm convinced that your inital test set "-o1 -s128", which was an unlucky
trial. :)
Anyway, I've found a bug in the case without -o, which is "-s64" should select
other overprovision ratio instead of "1%".

With the below patch, I could get:

-s1    (ovp:1%)        858                        208
-s64   (ovp:4%)        6172                       3712
-s128  (ovp:6%)        8721                       5120

>From 6e2b58dcaffc2d88291e07fa1f99773eca04a58f Mon Sep 17 00:00:00 2001
From: Jaegeuk Kim <jaegeuk@kernel.org>
Date: Wed, 23 Sep 2015 14:59:30 -0700
Subject: [PATCH] mkfs.f2fs: fix wrong ovp space calculation on large section

If a section consists of multiple segments, we should change the equation
to apply it on reserved space.

On 128GB,

option                 overprovision area         reserved area
-o5 -s128              9094                       6144
-o5 -s64               6179                       3072
-o5 -s1                3309                       48
-o1 -s128              27009                      26624
-o1 -s64               13831                      13312
-o1 -s1                858                        208

-s1                    858                        208
-s64   *               13831                      13312
-s128  *               27009                      26624
: * should be wrong.

After patch,

-s1    (ovp:1%)        858                        208
-s64   (ovp:4%)        6172                       3712
-s128  (ovp:6%)        8721                       5120

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
---
 mkfs/f2fs_format.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mkfs/f2fs_format.c b/mkfs/f2fs_format.c
index 21e74fe..2d4ab09 100644
--- a/mkfs/f2fs_format.c
+++ b/mkfs/f2fs_format.c
@@ -171,7 +171,8 @@ static u_int32_t get_best_overprovision(void)
 	}
 
 	for (; candidate <= end; candidate += diff) {
-		reserved = 2 * (100 / candidate + 1) + 6;
+		reserved = (2 * (100 / candidate + 1) + 6) *
+						get_sb(segs_per_sec);
 		ovp = (get_sb(segment_count_main) - reserved) * candidate / 100;
 		space = get_sb(segment_count_main) - reserved - ovp;
 		if (max_space < space) {
-- 
2.1.1


------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/4.2.1
  2015-09-23 21:58 sync/umount hang on 3.18.21, 1.4TB gone after crash Marc Lehmann
@ 2015-09-23 23:11 ` Marc Lehmann
  2015-09-24 18:28   ` Jaegeuk Kim
  2015-09-24 18:50 ` sync/umount hang on 3.18.21, 1.4TB gone after crash Jaegeuk Kim
  1 sibling, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-09-23 23:11 UTC (permalink / raw)
  To: linux-f2fs-devel

I tried twice with the same options, on linux 3.8.21 and 4.2.1, and write
performance is very different on these kernels (I rebooted every time in
between).

On 3.18.21, the source raid is hardly able to keep up with the write
speed, f2fs basically writes at more or less full disk speed, for extended
periods of time. As mentiopned earlier, average write speed was 103MB/s
over the whole 2.1TB, with f2fs being idle a lot of that time, so actual
write speed would be higher.

On 4.2.1, same mkfs+mount options, performance is 5-10 times less, i.e.
<<100MB/s (and more like 20MB/s for extendesd stretches):

http://ue.tst.eu/43a8f7ae96ac770dc45ac8f1e3b0c479.txt
(see the dsk/sde columns)

First there is a "good" stretch with >100MB/s, then it starts to degrade.
Another difference is the frequent read activity, which is mostly absent on
3.18.21.

The log starts after only 20GB had been written. Here is the status output
at 25GB:

http://ue.tst.eu/734f7883107ee4dabff77602db92310b.txt

Any idea why 4.2.1 performs so much worse than 3.8.21?

(Note, if that is the difference between "filesystem works" (4.2.1?) and
"filesystem doesn't work" (3.18.21) then that might be it - still, that
would mean f2fs performs worse than traditional filesystems on these
disks).

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-23  9:10                 ` Chao Yu
  2015-09-23 21:30                   ` Jaegeuk Kim
@ 2015-09-23 23:11                   ` Marc Lehmann
  1 sibling, 0 replies; 74+ messages in thread
From: Marc Lehmann @ 2015-09-23 23:11 UTC (permalink / raw)
  To: Chao Yu; +Cc: 'Jaegeuk Kim', linux-f2fs-devel

On Wed, Sep 23, 2015 at 05:10:21PM +0800, Chao Yu <chao2.yu@samsung.com> wrote:
> The max hardlink number was increased to 0xffffffff by Jaegeuk in 4.3 rc1
> Kernel, we can use it directly through backport.

Thats absolutely wonderful news, thanks a lot!

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-23 21:29               ` Jaegeuk Kim
@ 2015-09-23 23:24                 ` Marc Lehmann
  2015-09-24 17:51                   ` Jaegeuk Kim
  0 siblings, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-09-23 23:24 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Wed, Sep 23, 2015 at 02:29:31PM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > Can you elaborate? I do get a speed improvement with only two logs, but of
> > course, GC time is an impoprtant factor, so maybe more logs would be a
> > necessary trade-off.
> 
> This will help you to understand more precisely.

Thanks, will read more thoroughly, but that means I probably do want two logs.
Regarding your elaboration:

> One GC needs to move whole valid blocks inside a section, so if the section
> size is too large, every GC is likely to show very long latency.
> In addion, we need more overprovision space too.

That wouldn't increase the overhead in general though, because the
overhead depends on how much space is free in each section.

> And, if the number of logs is small, GC can suffer from moving hot and cold
> data blocks which represents somewhat temporal locality.

I am somewhat skeptical of this for (on of my) my usage(s) (archival),
because there is absolutely no way to know in advance what is hot and what
is cold. Example: a file might be deleted, but there is no way in advance
to know which it will be. The only thing I know is that files never get
modified after written once (but often replaced). In another of of my
usages, files do get modified, but there is no way to know in advance
which it will be, and they will only ever be modified once (after initial
allocation).

So I am very suspicious of both static and dynamic attempts to seperate
data into hot/cold. You can't know from file extensions, and you can't
know from past modification history.

The only applicability of hot/cold I can see is filesystem metadata and
directories (files get moved/renamed/added), and afaics, f2fs already does
that.

> Of course, these numbers highly depend on storage speed and workloads, so
> it needs to be tuned up.

>From your original comment, I assumed that the gc somehow needs more logs
to be more efficient for some internal reason, but it seems since it is
mostly a matter of section size (which I want to have "unreasonably" large),
which means potentially a lot of valid data has to be moved, and hot/cold
data, which I am very skeptical about.

(I think hot/cold works absolutely splendid for normal desktop uses and
most forms of /home, though).

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-23  8:55                 ` Chao Yu
@ 2015-09-23 23:30                   ` Marc Lehmann
  2015-09-23 23:43                     ` Marc Lehmann
  2015-09-25  8:05                     ` Chao Yu
  0 siblings, 2 replies; 74+ messages in thread
From: Marc Lehmann @ 2015-09-23 23:30 UTC (permalink / raw)
  To: Chao Yu; +Cc: 'Jaegeuk Kim', linux-f2fs-devel

On Wed, Sep 23, 2015 at 04:55:57PM +0800, Chao Yu <chao2.yu@samsung.com> wrote:
> >    echo 1 >gc_idle
> >    echo 1000 >gc_max_sleep_time
> >    echo 5000 >gc_no_gc_sleep_time
> 
> One thing I note is that gc_min_sleep_time is not be set in your script,
> so in some condition gc may still do the sleep with gc_min_sleep_time (30
> seconds by default) instead of gc_max_sleep_time which we expect.

Ah, sorry, I actually set gc_min_sleep_time to 100, but forgot to include
it.

> In 4.3 rc1 kernel, we have add a new ioctl to trigger in batches gc, maybe
> we can use it as one option.

Yes, such an ioctl could be useful to me, although I do not intend to have
background gc off.

I assume that the ioctl will block for the time it runs, and I can ask it
to do up to 16 batches in one go (by default)? That sounds indeed very
useful to have.

What is "one batch" in terms of gc, one section?

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-23 22:08                 ` Jaegeuk Kim
@ 2015-09-23 23:39                   ` Marc Lehmann
  2015-09-24 17:27                     ` Jaegeuk Kim
  0 siblings, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-09-23 23:39 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Wed, Sep 23, 2015 at 03:08:23PM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> >    root@shag:/sys/fs/f2fs/dm-1# df -H /mnt
> >    Filesystem                Size  Used Avail Use% Mounted on
> >    /dev/mapper/vg_test-test  138G  137G  803k 100% /mnt
> 
> Could you please share /sys/kernel/debug/f2fs/status?

Uh, sorry, I planned to, but forgot, probably because I thought the result
was so good it didn't need any checking :)

> So, I'm convinced that your inital test set "-o1 -s128", which was an unlucky
> trial. :)

hmm... since the point is to simulate a full 8TB partition, having large
overprovision/reserved space AND large section size might actually have been
a good test, as it would simulate the TB case better, which would also have
larger overprovisioning and the larger section size.

In the end, I might settle with -s64, and currently do tests with -s90.

I was just scared that overprovisioning might turn out ot be extremely large
with 8TB.

I have since then dropped -o from all my mkfs.f2fs invocations, seeing
that the resulting filesystem does not actually have 5% overprovisioning.

> Subject: [PATCH] mkfs.f2fs: fix wrong ovp space calculation on large section

Hmm, the latest change in
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git is from
august 10 - do I need to select a branch (I am not good with git)?

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-23 23:30                   ` Marc Lehmann
@ 2015-09-23 23:43                     ` Marc Lehmann
  2015-09-24 17:21                       ` Jaegeuk Kim
  2015-09-25  8:05                     ` Chao Yu
  1 sibling, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-09-23 23:43 UTC (permalink / raw)
  To: Chao Yu; +Cc: 'Jaegeuk Kim', linux-f2fs-devel

On Thu, Sep 24, 2015 at 01:30:22AM +0200, Marc Lehmann <schmorp@schmorp.de> wrote:
> > One thing I note is that gc_min_sleep_time is not be set in your script,
> > so in some condition gc may still do the sleep with gc_min_sleep_time (30
> > seconds by default) instead of gc_max_sleep_time which we expect.
> 
> Ah, sorry, I actually set gc_min_sleep_time to 100, but forgot to include
> it.

Sorry, that sounded confusing - I set it to 100 in previous tests, and forgot
to include it, so it was running with 30000. When experimenting, I actually
do get the gc to do more frequent operations now.

Is there any obvious harm setting it to a very low value (such as 100 or 10)?

I assume all it does is have less time buffer between the last operation
and the gc starting. When I write in batches, or when I know the fs will be
idle, there shouldn't be any harm, performance wise, of letting it work all
the time.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-23 23:43                     ` Marc Lehmann
@ 2015-09-24 17:21                       ` Jaegeuk Kim
  2015-09-25  8:28                         ` Chao Yu
  0 siblings, 1 reply; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-24 17:21 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Thu, Sep 24, 2015 at 01:43:24AM +0200, Marc Lehmann wrote:
> On Thu, Sep 24, 2015 at 01:30:22AM +0200, Marc Lehmann <schmorp@schmorp.de> wrote:
> > > One thing I note is that gc_min_sleep_time is not be set in your script,
> > > so in some condition gc may still do the sleep with gc_min_sleep_time (30
> > > seconds by default) instead of gc_max_sleep_time which we expect.
> > 
> > Ah, sorry, I actually set gc_min_sleep_time to 100, but forgot to include
> > it.
> 
> Sorry, that sounded confusing - I set it to 100 in previous tests, and forgot
> to include it, so it was running with 30000. When experimenting, I actually
> do get the gc to do more frequent operations now.
> 
> Is there any obvious harm setting it to a very low value (such as 100 or 10)?
> 
> I assume all it does is have less time buffer between the last operation
> and the gc starting. When I write in batches, or when I know the fs will be
> idle, there shouldn't be any harm, performance wise, of letting it work all
> the time.

Yeah, I don't think it does matter with very small time periods, since the timer
is set after background GC is done.
But, we use msecs_to_jiffies(), so hope not to use something like 10 ms, since
each backgroudn GC conducts reading victim blocks into page cache and then just
sets them as dirty.
That indicates, after a while, we hope flusher will write them all to disk and
finally we got a free section.
So, IMO, we need to give some time slots to flusher as well.

For example, if write bandwidth is 30MB/s and section size is 128MB, it needs
about 4secs to write one section.
So, how about setting
 - gc_min_time to 1~2 secs,
 - gc_max_time to 3~4 secs,
 - gc_idle_time to 10 secs,
 - reclaim_segments to 64 (sync when 1 section becomes prefree)

Thanks,

> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-23 23:39                   ` Marc Lehmann
@ 2015-09-24 17:27                     ` Jaegeuk Kim
  2015-09-25  5:42                       ` Marc Lehmann
  0 siblings, 1 reply; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-24 17:27 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Thu, Sep 24, 2015 at 01:39:38AM +0200, Marc Lehmann wrote:
> On Wed, Sep 23, 2015 at 03:08:23PM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > >    root@shag:/sys/fs/f2fs/dm-1# df -H /mnt
> > >    Filesystem                Size  Used Avail Use% Mounted on
> > >    /dev/mapper/vg_test-test  138G  137G  803k 100% /mnt
> > 
> > Could you please share /sys/kernel/debug/f2fs/status?
> 
> Uh, sorry, I planned to, but forgot, probably because I thought the result
> was so good it didn't need any checking :)
> 
> > So, I'm convinced that your inital test set "-o1 -s128", which was an unlucky
> > trial. :)
> 
> hmm... since the point is to simulate a full 8TB partition, having large
> overprovision/reserved space AND large section size might actually have been
> a good test, as it would simulate the TB case better, which would also have
> larger overprovisioning and the larger section size.
> 
> In the end, I might settle with -s64, and currently do tests with -s90.

Got it. But why -s90? :)

> I was just scared that overprovisioning might turn out ot be extremely large
> with 8TB.
> 
> I have since then dropped -o from all my mkfs.f2fs invocations, seeing
> that the resulting filesystem does not actually have 5% overprovisioning.
> 
> > Subject: [PATCH] mkfs.f2fs: fix wrong ovp space calculation on large section
> 
> Hmm, the latest change in
> git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git is from
> august 10 - do I need to select a branch (I am not good with git)?

I just pushed the patches to master branch in f2fs-tools.git.
Could you pull them and check them?

I added one more patch to avoid harmless sit_type fixes previously you reported.

And, for the 8TB case, let me check again. It seems that we need to handle under
1% overprovision ratio. (e.g., 0.5%)

Thanks,

> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-23 23:24                 ` Marc Lehmann
@ 2015-09-24 17:51                   ` Jaegeuk Kim
  0 siblings, 0 replies; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-24 17:51 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Thu, Sep 24, 2015 at 01:24:14AM +0200, Marc Lehmann wrote:
> On Wed, Sep 23, 2015 at 02:29:31PM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > > Can you elaborate? I do get a speed improvement with only two logs, but of
> > > course, GC time is an impoprtant factor, so maybe more logs would be a
> > > necessary trade-off.
> > 
> > This will help you to understand more precisely.
> 
> Thanks, will read more thoroughly, but that means I probably do want two logs.
> Regarding your elaboration:
> 
> > One GC needs to move whole valid blocks inside a section, so if the section
> > size is too large, every GC is likely to show very long latency.
> > In addion, we need more overprovision space too.
> 
> That wouldn't increase the overhead in general though, because the
> overhead depends on how much space is free in each section.

Surely, it depends on workloads.

> 
> > And, if the number of logs is small, GC can suffer from moving hot and cold
> > data blocks which represents somewhat temporal locality.
> 
> I am somewhat skeptical of this for (on of my) my usage(s) (archival),
> because there is absolutely no way to know in advance what is hot and what
> is cold. Example: a file might be deleted, but there is no way in advance
> to know which it will be. The only thing I know is that files never get
> modified after written once (but often replaced). In another of of my
> usages, files do get modified, but there is no way to know in advance
> which it will be, and they will only ever be modified once (after initial
> allocation).
> 
> So I am very suspicious of both static and dynamic attempts to seperate
> data into hot/cold. You can't know from file extensions, and you can't
> know from past modification history.

Yes, regarding to userdata, we cannot determine the hotness of every data
actually.

> The only applicability of hot/cold I can see is filesystem metadata and
> directories (files get moved/renamed/added), and afaics, f2fs already does
> that.

It does all the time. But, what I'm curious is the effect of splitting
directories and files explicitly. If we use two logs, f2fs only splits metadata
and their data.
But, if we use 4 logs at least, it splits each of metadata and data according to
their origins, directory or user file.

For example, if I can represent blocks like:

 D : dentry block
 U : user block
 I : directory inode
 F : file inode,
 O : obsolete

1) in 2 logs, each section can consist of
  DDUUUUUDDUUUUU     IFFFFIFFFFFF

2) in 4 logs,
   DDDD     UUUUUUUUUUU     II        FFFFFFFFFF

Then, if we rename files or delete files,
1) in 2 logs,
  OOUUUUUODUUUUDD    IOOOOIFFOOFI

2) in 4 logs,
  OOODDD      OOOOOUUUOUUU      OOIII      OOOOFFOOFFFF

So, I expect, we can reduce the number of valid blocks if we use 4 logs.

Surely, if workloads produce mostly a huge number of data blocks, I think two
logs are enough. Using more logs would not show a big impact.

Thanks,

> 
> > Of course, these numbers highly depend on storage speed and workloads, so
> > it needs to be tuned up.
> 
> From your original comment, I assumed that the gc somehow needs more logs
> to be more efficient for some internal reason, but it seems since it is
> mostly a matter of section size (which I want to have "unreasonably" large),
> which means potentially a lot of valid data has to be moved, and hot/cold
> data, which I am very skeptical about.
> 
> (I think hot/cold works absolutely splendid for normal desktop uses and
> most forms of /home, though).
> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/4.2.1
  2015-09-23 23:11 ` write performance difference 3.18.21/4.2.1 Marc Lehmann
@ 2015-09-24 18:28   ` Jaegeuk Kim
  2015-09-24 23:20     ` Marc Lehmann
  2015-09-25  6:50     ` Marc Lehmann
  0 siblings, 2 replies; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-24 18:28 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Thu, Sep 24, 2015 at 01:11:03AM +0200, Marc Lehmann wrote:
> I tried twice with the same options, on linux 3.8.21 and 4.2.1, and write
> performance is very different on these kernels (I rebooted every time in
> between).
> 
> On 3.18.21, the source raid is hardly able to keep up with the write
> speed, f2fs basically writes at more or less full disk speed, for extended
> periods of time. As mentiopned earlier, average write speed was 103MB/s
> over the whole 2.1TB, with f2fs being idle a lot of that time, so actual
> write speed would be higher.
> 
> On 4.2.1, same mkfs+mount options, performance is 5-10 times less, i.e.
> <<100MB/s (and more like 20MB/s for extendesd stretches):
> 
> http://ue.tst.eu/43a8f7ae96ac770dc45ac8f1e3b0c479.txt
> (see the dsk/sde columns)
> 
> First there is a "good" stretch with >100MB/s, then it starts to degrade.
> Another difference is the frequent read activity, which is mostly absent on
> 3.18.21.
> 
> The log starts after only 20GB had been written. Here is the status output
> at 25GB:
> 
> http://ue.tst.eu/734f7883107ee4dabff77602db92310b.txt

It seems about 700MB were moved by background GC.

> 
> Any idea why 4.2.1 performs so much worse than 3.8.21?

One thing that we can try is to run the latest f2fs source in v3.18.
This branch supports f2fs for v3.18.

http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs.git/log/?h=linux-3.18

Could you check out this branch and then copy the following files into your
v3.18 kernel?

- fs/f2fs/*
- include/linux/f2fs_fs.h
- include/trace/event/f2fs.h

And, if possible, could you share the status output of both of v3.18 and v4.2?

Thanks,

> 
> (Note, if that is the difference between "filesystem works" (4.2.1?) and
> "filesystem doesn't work" (3.18.21) then that might be it - still, that
> would mean f2fs performs worse than traditional filesystems on these
> disks).
> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\
> 
> ------------------------------------------------------------------------------
> Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
> Get real-time metrics from all of your servers, apps and tools
> in one place.
> SourceForge users - Click here to start your Free Trial of Datadog now!
> http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: sync/umount hang on 3.18.21, 1.4TB gone after crash
  2015-09-23 21:58 sync/umount hang on 3.18.21, 1.4TB gone after crash Marc Lehmann
  2015-09-23 23:11 ` write performance difference 3.18.21/4.2.1 Marc Lehmann
@ 2015-09-24 18:50 ` Jaegeuk Kim
  2015-09-25  6:00   ` Marc Lehmann
  2015-09-25  9:13   ` Chao Yu
  1 sibling, 2 replies; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-24 18:50 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Wed, Sep 23, 2015 at 11:58:51PM +0200, Marc Lehmann wrote:
> Hi!
> 
> I moved one of the SMR disks to another box with a 3.18.21 kernel.
> 
> I formatted and mounted like this:
> 
>    /opt/f2fs-tools/sbin/mkfs.f2fs -lTEST -s90 -t0 -a0 /dev/vg_test/test
>    mount -t f2fs -onoatime,flush_merge,no_heap /dev/vg_test/test /mnt
> 
> I then copied (tar | tar) 2.1TB of data to the disk, which took about 6
> hours, which is about the read speed of this data set (so the speed was very
> good).
> 
> When I came back after ~10 hours, I found a number of hung task messages
> in syslog, and when I entered sync, sync was consuming 100% system time.

Hmm, at this time, it would be good to check what process is stuck through
sysrq.

> I took a snapshot of /sys/kernel/debug/f2fs/status before sync, and the
> values arfe "frozen", i.e. they didn't change.
> 
> I was able to read from the mounted filesystem normally, and I was able to
> read and write the block device itself, so the disk is responsive.
> 
> After ~1h in this state, I tried to umount, which made the filesystem
> mountpoint go away, but umount hangs, and /sys/kernel/debug/f2fs/status still
> doesn't change.
> 
> This is the output of /sys/kernel/debug/f2fs/status:
> 
> http://ue.tst.eu/d88ce0e21a7ca0fb74b1ecadfa475df0.txt
> 
> I then deleted the device, but the echo 1 >/sys/block/sde/device/delete was
> also hanging.
> 
> Here are /proc/.../stack outputs of sync, umount and bash(echo):
> 
>    sync:
>    [<ffffffffffffffff>] 0xffffffffffffffff
> 
>    umount:
>    [<ffffffff8139ba03>] call_rwsem_down_write_failed+0x13/0x20
>    [<ffffffff811e7ee6>] deactivate_super+0x46/0x70
>    [<ffffffff81204733>] cleanup_mnt+0x43/0x90
>    [<ffffffff812047d2>] __cleanup_mnt+0x12/0x20
>    [<ffffffff8108e8a4>] task_work_run+0xc4/0xe0
>    [<ffffffff81012fa7>] do_notify_resume+0x97/0xb0
>    [<ffffffff8178896f>] int_signal+0x12/0x17
>    [<ffffffffffffffff>] 0xffffffffffffffff
> 
>    bash (delete):
>    [<ffffffff810d8917>] msleep+0x37/0x50
>    [<ffffffff8135d686>] __blk_drain_queue+0xa6/0x1a0
>    [<ffffffff8135da05>] blk_cleanup_queue+0x1b5/0x1c0
>    [<ffffffff8152082a>] __scsi_remove_device+0x5a/0xe0
>    [<ffffffff815208d6>] scsi_remove_device+0x26/0x40
>    [<ffffffff81520917>] sdev_store_delete+0x27/0x30
>    [<ffffffff814bf748>] dev_attr_store+0x18/0x30
>    [<ffffffff8125bc4d>] sysfs_kf_write+0x3d/0x50
>    [<ffffffff8125b154>] kernfs_fop_write+0xe4/0x160
>    [<ffffffff811e51a7>] vfs_write+0xb7/0x1f0
>    [<ffffffff811e5c26>] SyS_write+0x46/0xb0
>    [<ffffffff817886cd>] system_call_fastpath+0x16/0x1b
>    [<ffffffffffffffff>] 0xffffffffffffffff
> 
> After a forced reboot, I did a fsck, and got this, which looks good except
> for the "Wrong segment type" message, which hopefully is harmless.
> 
> http://ue.tst.eu/4c750d2301a581cb07249d607aa0e6d0.txt
> 
> After mounting, status was this (and was changing):
> 
> http://ue.tst.eu/6462606ac3aa85bde0d6674365c86318.txt
> 
> Note that 1.4TB of data are missing(!)
> 
> This large amount of missing data was certainly unexpected. I assume f2fs
> stopped checkpointing earlier, and only after a checkpoint the data is
> safe, but being able to write 1.4TB of data without it ever reaching the
> disk is very unexpected behaviour for a filesystem (which normally loses
> about half a minute of data at most).

It seems there was no fsync after sync at all. That's why f2fs recovered back to
the latest checkpoint. Anyway, I'm thinking that it's worth to add a kind of
periodic checkpoints.

> 
> Minor question, since the disk actually has 4K physical sectors, and fsck
> says sector size = 512, is there a way to teach f2fs that the physical
> sector size is actually 4k, or does this not matter because f2fs will do
> page-sized writes anyways?

No problem. The fsck just reads the sector size from the block device, which
is used when calculating the total partition size.

> 
> In any case, any insights would be appreciated. I will attwmpt to upgrade
> this box to linux 4.2.1 to see if that helps, but 3.18.x is the onl
> kernel known to work with smr drives without any issues.
> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\
> 
> ------------------------------------------------------------------------------
> Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
> Get real-time metrics from all of your servers, apps and tools
> in one place.
> SourceForge users - Click here to start your Free Trial of Datadog now!
> http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/4.2.1
  2015-09-24 18:28   ` Jaegeuk Kim
@ 2015-09-24 23:20     ` Marc Lehmann
  2015-09-24 23:27       ` Marc Lehmann
  2015-09-25  6:50     ` Marc Lehmann
  1 sibling, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-09-24 23:20 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Thu, Sep 24, 2015 at 11:28:36AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > The log starts after only 20GB had been written. Here is the status output
> > at 25GB:
> > 
> > http://ue.tst.eu/734f7883107ee4dabff77602db92310b.txt
> 
> It seems about 700MB were moved by background GC.

And to quantify "slower", copying 2.1TiB took 18h, at 35.6MB/s. "sync" was
routinely taking minutes to execute.

As a sidenote, a "du" on cold cache pulled in data from the drive at
~26MB/s, which is quite impressive. At the number of files, that means
it is pulling in ~4.4kb/stat. It also means it stat's at 5869 files/s,
which is very good! I hadn't expected this from a flash file system on
rotational media.

> > Any idea why 4.2.1 performs so much worse than 3.8.21?
> 
> One thing that we can try is to run the latest f2fs source in v3.18.
> This branch supports f2fs for v3.18.

Will do!

> And, if possible, could you share the status output of both of v3.18 and v4.2?

Will do, this is the output for 4.2.1, after copying and having it sit
idle for a few hours.

http://ue.tst.eu/90522477a480ae7a3b71647269b31ff7.txt

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/4.2.1
  2015-09-24 23:20     ` Marc Lehmann
@ 2015-09-24 23:27       ` Marc Lehmann
  0 siblings, 0 replies; 74+ messages in thread
From: Marc Lehmann @ 2015-09-24 23:27 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Fri, Sep 25, 2015 at 01:20:25AM +0200, Marc Lehmann <schmorp@schmorp.de> wrote:
> On Thu, Sep 24, 2015 at 11:28:36AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > > The log starts after only 20GB had been written. Here is the status output
> > > at 25GB:
> > > 
> > > http://ue.tst.eu/734f7883107ee4dabff77602db92310b.txt
> > 
> > It seems about 700MB were moved by background GC.

And before rebooting I did another sync, which took 130s, using 100% system
time. This is the status output afterward:

http://ue.tst.eu/6d0398c4184fdbef059e90a6bf454241.txt

Is it normal that f2fs takes this long for a sync? What happens when,
during shutdown, there is a timeout during umount and systemd shuts down
anyway, or the fs doesn't get unmounted, will there be data loss because
of rolling back?

I am a bit concerned, because the filesystem was sitting idle for hours,
and I expect a crash at that point to not lose any data due to reverting
to a checkpoint.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-24 17:27                     ` Jaegeuk Kim
@ 2015-09-25  5:42                       ` Marc Lehmann
  2015-09-25 17:45                         ` Jaegeuk Kim
  0 siblings, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-09-25  5:42 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Thu, Sep 24, 2015 at 10:27:49AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > In the end, I might settle with -s64, and currently do tests with -s90.
> 
> Got it. But why -s90? :)

He :) It's a nothing-special number between 64 and 128, that's all.

> I just pushed the patches to master branch in f2fs-tools.git.
> Could you pull them and check them?

Got them, last patch was the "check sit types" change.

> I added one more patch to avoid harmless sit_type fixes previously you reported.
> 
> And, for the 8TB case, let me check again. It seems that we need to handle under
> 1% overprovision ratio. (e.g., 0.5%)

That might make me potentially very happy. But my main concern at the
moment is stability - even when you have a backup, restoring 8TB will take
days, and backups are never uptodate.

It would be nice to be able to control it more from the user side though.

For example, I have not yet reached 0.0% free with f2fs. That's fine, I don't
plan9 to, but I need to know at which percentage should I stop, which is
something I can only really find out with experiments.

And just filling these 8TB disks takes days, so the question is, can I
simulate near-full behaviour with smaller partitions.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: sync/umount hang on 3.18.21, 1.4TB gone after crash
  2015-09-24 18:50 ` sync/umount hang on 3.18.21, 1.4TB gone after crash Jaegeuk Kim
@ 2015-09-25  6:00   ` Marc Lehmann
  2015-09-25  6:01     ` Marc Lehmann
  2015-09-25 18:42     ` Jaegeuk Kim
  2015-09-25  9:13   ` Chao Yu
  1 sibling, 2 replies; 74+ messages in thread
From: Marc Lehmann @ 2015-09-25  6:00 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Thu, Sep 24, 2015 at 11:50:23AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > When I came back after ~10 hours, I found a number of hung task messages
> > in syslog, and when I entered sync, sync was consuming 100% system time.
> 
> Hmm, at this time, it would be good to check what process is stuck through
> sysrq.

It was only intermittently, but here they are. The first one is almost
certainly the sync that I originally didn't have a backtrace for, the
second one is one that came up frequently during the f2fs test.

   INFO: task sync:10577 blocked for more than 120 seconds.
         Tainted: G        W  OE   4.2.1-040201-generic #201509211431
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   sync            D ffff88082ec964c0     0 10577  10549 0x00000000
    ffff88000210fdc8 0000000000000082 ffff88062ef2a940 ffff88010337e040
    0000000000000246 ffff880002110000 ffff8806294915f8 ffff8805c939b800
    ffff88000210fe54 ffffffff8121a910 ffff88000210fde8 ffffffff817a5a37
   Call Trace:
    [<ffffffff8121a910>] ? SyS_tee+0x360/0x360
    [<ffffffff817a5a37>] schedule+0x37/0x80
    [<ffffffff81211f09>] wb_wait_for_completion+0x49/0x80
    [<ffffffff810b6f90>] ? prepare_to_wait_event+0xf0/0xf0
    [<ffffffff81213134>] sync_inodes_sb+0x94/0x1b0
    [<ffffffff8121a910>] ? SyS_tee+0x360/0x360
    [<ffffffff8121a925>] sync_inodes_one_sb+0x15/0x20
    [<ffffffff811ed1b9>] iterate_supers+0xb9/0x110
    [<ffffffff8121ac65>] sys_sync+0x35/0x90
    [<ffffffff817a9272>] entry_SYSCALL_64_fastpath+0x16/0x75

   INFO: task watchdog/1:14743 blocked for more than 120 seconds.
         Tainted: P           OE  3.18.21-031821-generic #201509020527
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   watchdog/1      D ffff88082ec93300     0 14743      2 0x00000000
    ffff8801a2383c48 0000000000000046 ffff880273a50000 0000000000013300
    ffff8801a2383fd8 0000000000013300 ffff8802e642a800 ffff880273a50000
    0000000000001000 ffffffff81c23d80 ffffffff81c23d84 ffff880273a50000
   Call Trace:
    [<ffffffff817847f9>] schedule_preempt_disabled+0x29/0x70
    [<ffffffff81786435>] __mutex_lock_slowpath+0x95/0x100
    [<ffffffff810a8ac9>] ? enqueue_entity+0x289/0xb20
    [<ffffffff817864c3>] mutex_lock+0x23/0x37
    [<ffffffff81029823>] x86_pmu_event_init+0x343/0x430
    [<ffffffff811680db>] perf_init_event+0xcb/0x130
    [<ffffffff811684d8>] perf_event_alloc+0x398/0x440
    [<ffffffff810a8431>] ? put_prev_entity+0x31/0x3f0
    [<ffffffff811249b0>] ? restart_watchdog_hrtimer+0x60/0x60
    [<ffffffff81169156>] perf_event_create_kernel_counter+0x26/0x100
    [<ffffffff8112477d>] watchdog_nmi_enable+0xcd/0x170
    [<ffffffff81124865>] watchdog_enable+0x45/0xa0
    [<ffffffff81093f09>] smpboot_thread_fn+0xb9/0x1a0
    [<ffffffff8108ff9c>] ? __kthread_parkme+0x4c/0x80
    [<ffffffff81093e50>] ? SyS_setgroups+0x180/0x180
    [<ffffffff81090219>] kthread+0xc9/0xe0
    [<ffffffff81090150>] ? kthread_create_on_node+0x180/0x180
    [<ffffffff81788618>] ret_from_fork+0x58/0x90
    [<ffffffff81090150>] ? kthread_create_on_node+0x180/0x180

The watchdog might or might not be unrelated, but it is either a 4.2.1
thing (new kernel) or f2fs related. I only had them during the f2fs test,
and often, not before or after.

(I don't know what that kernel thread does, but the system was somewhat
sluggish during the test, and other, unrelated servcies, were negatively
affected).

> It seems there was no fsync after sync at all. That's why f2fs recovered back to
> the latest checkpoint. Anyway, I'm thinking that it's worth to add a kind of
> periodic checkpoints.

Well, would it sync more often if this problem hadn't occured? Most
filesystems (or rather, the filesystems I use, btrfs, xfs, ext* and zfs)
seem to have their own regular commit interval, or otherwise commit
frequently if it is cheap enough.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: sync/umount hang on 3.18.21, 1.4TB gone after crash
  2015-09-25  6:00   ` Marc Lehmann
@ 2015-09-25  6:01     ` Marc Lehmann
  2015-09-25 18:42     ` Jaegeuk Kim
  1 sibling, 0 replies; 74+ messages in thread
From: Marc Lehmann @ 2015-09-25  6:01 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Fri, Sep 25, 2015 at 08:00:19AM +0200, Marc Lehmann <schmorp@schmorp.de> wrote:
> On Thu, Sep 24, 2015 at 11:50:23AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > > When I came back after ~10 hours, I found a number of hung task messages
> > > in syslog, and when I entered sync, sync was consuming 100% system time.
> > 
> > Hmm, at this time, it would be good to check what process is stuck through
> > sysrq.
> 
> It was only intermittently, but here they are.

I meant "here are backtraces from the stuck process", from syslog, not via
sysrq of course.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/4.2.1
  2015-09-24 18:28   ` Jaegeuk Kim
  2015-09-24 23:20     ` Marc Lehmann
@ 2015-09-25  6:50     ` Marc Lehmann
  2015-09-25  9:47       ` Chao Yu
  2015-09-25 18:26       ` Jaegeuk Kim
  1 sibling, 2 replies; 74+ messages in thread
From: Marc Lehmann @ 2015-09-25  6:50 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Thu, Sep 24, 2015 at 11:28:36AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> One thing that we can try is to run the latest f2fs source in v3.18.
> This branch supports f2fs for v3.18.

Ok, please bear with me, the last time I built my own kernel was during
the 2.4 timeframe, and this is a ubuntu kernel. What I did is this:

   git clone -b linux-3.18 git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git
   cd f2fs/fs/f2fs
   rsync -avPR include/linux/f2fs_fs.h include/trace/events/f2fs.h /usr/src/linux-headers-3.18.21-031821/.
   make -C /lib/modules/3.18.21-031821-generic/build/ M=$PWD modules modules_install

I then rmmod f2fs/insmod the resulting module, and tried to mount my
existing f2fs fs for a quick test, but got a null ptr exception on "mount":

http://ue.tst.eu/e4628dcee97324e580da1bafad938052.txt

Probably caused me not building a full kernel, but recreating how ubuntu
build their kernels on a debian system isn't something I look forward to.

> For example, if I can represent blocks like:
[number of logs discussion]

Thanks for this explanation - two logs doesn't look so bad, from a
locality viewpoint (not a big issue for flash, but a big issue for
rotational devices - I also realised I can't use dmcache as dmcache, even
in writethrough mode, writes back all data after an unclean shutdown,
which would positively kill the disk).

Since whatever speed difference I saw with two logs wasn't big, you
completely sold me on 6 logs, or 4 (especially if it seepds up the gc,
which I haven't much tested yet). Two logs was merely a test anyway (the
same with no_heap, I don't know what it does, but I thought it is worth
a try, as metadata + data nearer together is better than having them at
opposite ends of the log or so).

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-23 23:30                   ` Marc Lehmann
  2015-09-23 23:43                     ` Marc Lehmann
@ 2015-09-25  8:05                     ` Chao Yu
  2015-09-26  3:42                       ` Marc Lehmann
  1 sibling, 1 reply; 74+ messages in thread
From: Chao Yu @ 2015-09-25  8:05 UTC (permalink / raw)
  To: 'Marc Lehmann'; +Cc: 'Jaegeuk Kim', linux-f2fs-devel

> -----Original Message-----
> From: Marc Lehmann [mailto:schmorp@schmorp.de]
> Sent: Thursday, September 24, 2015 7:30 AM
> To: Chao Yu
> Cc: 'Jaegeuk Kim'; linux-f2fs-devel@lists.sourceforge.net
> Subject: Re: [f2fs-dev] SMR drive test 2; 128GB partition; no obvious corruption, much more
> sane behaviour, weird overprovisioning
> 
> On Wed, Sep 23, 2015 at 04:55:57PM +0800, Chao Yu <chao2.yu@samsung.com> wrote:
> > >    echo 1 >gc_idle
> > >    echo 1000 >gc_max_sleep_time
> > >    echo 5000 >gc_no_gc_sleep_time
> >
> > One thing I note is that gc_min_sleep_time is not be set in your script,
> > so in some condition gc may still do the sleep with gc_min_sleep_time (30
> > seconds by default) instead of gc_max_sleep_time which we expect.
> 
> Ah, sorry, I actually set gc_min_sleep_time to 100, but forgot to include
> it.
> 
> > In 4.3 rc1 kernel, we have add a new ioctl to trigger in batches gc, maybe
> > we can use it as one option.
> 
> Yes, such an ioctl could be useful to me, although I do not intend to have
> background gc off.
> 
> I assume that the ioctl will block for the time it runs, and I can ask it
> to do up to 16 batches in one go (by default)? That sounds indeed very

Actually, we should set the value of 'count' parameter to indicate how many
times we want to do gc in one batch, at most 16 times in a loop for each
ioctl invoking:
        ioctl(fd, F2FS_IOC_GC, &count);
After ioctl retruned successfully, 'count' parameter will contain the count
of gces we did actually.

> useful to have.
> 
> What is "one batch" in terms of gc, one section?

One batch means a certain number of gces excuting serially.

We have foreground/background mode in gc procedure:
1) For forground gc mode, it will try to gc several sections until there are
enough free sections;
2) For background gc mode, it will try to gc one section.
So we will not know how many sections will be freed in one batch, because it
depends on a) which mode we will use (gc mode is dynamically depending on current
status of free section/dirty datas) and b) whether a victim exist or not.

Thanks,

> 
> --
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\


------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-24 17:21                       ` Jaegeuk Kim
@ 2015-09-25  8:28                         ` Chao Yu
  0 siblings, 0 replies; 74+ messages in thread
From: Chao Yu @ 2015-09-25  8:28 UTC (permalink / raw)
  To: 'Jaegeuk Kim', 'Marc Lehmann'; +Cc: linux-f2fs-devel

> -----Original Message-----
> From: Jaegeuk Kim [mailto:jaegeuk@kernel.org]
> Sent: Friday, September 25, 2015 1:21 AM
> To: Marc Lehmann
> Cc: Chao Yu; linux-f2fs-devel@lists.sourceforge.net
> Subject: Re: [f2fs-dev] SMR drive test 2; 128GB partition; no obvious corruption, much more
> sane behaviour, weird overprovisioning
> 
> On Thu, Sep 24, 2015 at 01:43:24AM +0200, Marc Lehmann wrote:
> > On Thu, Sep 24, 2015 at 01:30:22AM +0200, Marc Lehmann <schmorp@schmorp.de> wrote:
> > > > One thing I note is that gc_min_sleep_time is not be set in your script,
> > > > so in some condition gc may still do the sleep with gc_min_sleep_time (30
> > > > seconds by default) instead of gc_max_sleep_time which we expect.
> > >
> > > Ah, sorry, I actually set gc_min_sleep_time to 100, but forgot to include
> > > it.
> >
> > Sorry, that sounded confusing - I set it to 100 in previous tests, and forgot
> > to include it, so it was running with 30000. When experimenting, I actually
> > do get the gc to do more frequent operations now.
> >
> > Is there any obvious harm setting it to a very low value (such as 100 or 10)?
> >
> > I assume all it does is have less time buffer between the last operation
> > and the gc starting. When I write in batches, or when I know the fs will be
> > idle, there shouldn't be any harm, performance wise, of letting it work all
> > the time.
> 
> Yeah, I don't think it does matter with very small time periods, since the timer
> is set after background GC is done.
> But, we use msecs_to_jiffies(), so hope not to use something like 10 ms, since
> each backgroudn GC conducts reading victim blocks into page cache and then just
> sets them as dirty.
> That indicates, after a while, we hope flusher will write them all to disk and
> finally we got a free section.
> So, IMO, we need to give some time slots to flusher as well.
> 
> For example, if write bandwidth is 30MB/s and section size is 128MB, it needs
> about 4secs to write one section.

It's better for us to consider VM dirty data flush policy, IIRC, Fengguang
did the optimization work of writeback, if dirty ratio (dirty bytes?)is not
high, VM will flush data slightly slowly, but as dirty ratio  increase, VM
will flush data aggressively. If we want a large usage of max bandwidth, the
value of following interface could be consider when tuned up with gc policy
of f2fs.

/proc/sys/vm/
	dirty_background_bytes
	dirty_background_ratio
	dirty_expire_centisecs

Thanks,

> So, how about setting
>  - gc_min_time to 1~2 secs,
>  - gc_max_time to 3~4 secs,
>  - gc_idle_time to 10 secs,
>  - reclaim_segments to 64 (sync when 1 section becomes prefree)
> 
> Thanks,
> 
> >
> > --
> >                 The choice of a       Deliantra, the free code+content MORPG
> >       -----==-     _GNU_              http://www.deliantra.net
> >       ----==-- _       generation
> >       ---==---(_)__  __ ____  __      Marc Lehmann
> >       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
> >       -=====/_/_//_/\_,_/ /_/\_\


------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: sync/umount hang on 3.18.21, 1.4TB gone after crash
  2015-09-24 18:50 ` sync/umount hang on 3.18.21, 1.4TB gone after crash Jaegeuk Kim
  2015-09-25  6:00   ` Marc Lehmann
@ 2015-09-25  9:13   ` Chao Yu
  2015-09-25 18:30     ` Jaegeuk Kim
  1 sibling, 1 reply; 74+ messages in thread
From: Chao Yu @ 2015-09-25  9:13 UTC (permalink / raw)
  To: 'Jaegeuk Kim', 'Marc Lehmann'; +Cc: linux-f2fs-devel

Hi Jaegeuk,

> -----Original Message-----
> From: Jaegeuk Kim [mailto:jaegeuk@kernel.org]
> Sent: Friday, September 25, 2015 2:50 AM
> To: Marc Lehmann
> Cc: linux-f2fs-devel@lists.sourceforge.net
> Subject: Re: [f2fs-dev] sync/umount hang on 3.18.21, 1.4TB gone after crash
> 
> On Wed, Sep 23, 2015 at 11:58:51PM +0200, Marc Lehmann wrote:
> > Hi!
> >
> > I moved one of the SMR disks to another box with a 3.18.21 kernel.
> >
> > I formatted and mounted like this:
> >
> >    /opt/f2fs-tools/sbin/mkfs.f2fs -lTEST -s90 -t0 -a0 /dev/vg_test/test
> >    mount -t f2fs -onoatime,flush_merge,no_heap /dev/vg_test/test /mnt
> >
> > I then copied (tar | tar) 2.1TB of data to the disk, which took about 6
> > hours, which is about the read speed of this data set (so the speed was very
> > good).
> >
> > When I came back after ~10 hours, I found a number of hung task messages
> > in syslog, and when I entered sync, sync was consuming 100% system time.
> 
> Hmm, at this time, it would be good to check what process is stuck through
> sysrq.
> 
> > I took a snapshot of /sys/kernel/debug/f2fs/status before sync, and the
> > values arfe "frozen", i.e. they didn't change.
> >
> > I was able to read from the mounted filesystem normally, and I was able to
> > read and write the block device itself, so the disk is responsive.
> >
> > After ~1h in this state, I tried to umount, which made the filesystem
> > mountpoint go away, but umount hangs, and /sys/kernel/debug/f2fs/status still
> > doesn't change.
> >
> > This is the output of /sys/kernel/debug/f2fs/status:
> >
> > http://ue.tst.eu/d88ce0e21a7ca0fb74b1ecadfa475df0.txt
> >
> > I then deleted the device, but the echo 1 >/sys/block/sde/device/delete was
> > also hanging.
> >
> > Here are /proc/.../stack outputs of sync, umount and bash(echo):
> >
> >    sync:
> >    [<ffffffffffffffff>] 0xffffffffffffffff
> >
> >    umount:
> >    [<ffffffff8139ba03>] call_rwsem_down_write_failed+0x13/0x20
> >    [<ffffffff811e7ee6>] deactivate_super+0x46/0x70
> >    [<ffffffff81204733>] cleanup_mnt+0x43/0x90
> >    [<ffffffff812047d2>] __cleanup_mnt+0x12/0x20
> >    [<ffffffff8108e8a4>] task_work_run+0xc4/0xe0
> >    [<ffffffff81012fa7>] do_notify_resume+0x97/0xb0
> >    [<ffffffff8178896f>] int_signal+0x12/0x17
> >    [<ffffffffffffffff>] 0xffffffffffffffff
> >
> >    bash (delete):
> >    [<ffffffff810d8917>] msleep+0x37/0x50
> >    [<ffffffff8135d686>] __blk_drain_queue+0xa6/0x1a0
> >    [<ffffffff8135da05>] blk_cleanup_queue+0x1b5/0x1c0
> >    [<ffffffff8152082a>] __scsi_remove_device+0x5a/0xe0
> >    [<ffffffff815208d6>] scsi_remove_device+0x26/0x40
> >    [<ffffffff81520917>] sdev_store_delete+0x27/0x30
> >    [<ffffffff814bf748>] dev_attr_store+0x18/0x30
> >    [<ffffffff8125bc4d>] sysfs_kf_write+0x3d/0x50
> >    [<ffffffff8125b154>] kernfs_fop_write+0xe4/0x160
> >    [<ffffffff811e51a7>] vfs_write+0xb7/0x1f0
> >    [<ffffffff811e5c26>] SyS_write+0x46/0xb0
> >    [<ffffffff817886cd>] system_call_fastpath+0x16/0x1b
> >    [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > After a forced reboot, I did a fsck, and got this, which looks good except
> > for the "Wrong segment type" message, which hopefully is harmless.
> >
> > http://ue.tst.eu/4c750d2301a581cb07249d607aa0e6d0.txt
> >
> > After mounting, status was this (and was changing):
> >
> > http://ue.tst.eu/6462606ac3aa85bde0d6674365c86318.txt
> >
> > Note that 1.4TB of data are missing(!)
> >
> > This large amount of missing data was certainly unexpected. I assume f2fs
> > stopped checkpointing earlier, and only after a checkpoint the data is
> > safe, but being able to write 1.4TB of data without it ever reaching the
> > disk is very unexpected behaviour for a filesystem (which normally loses
> > about half a minute of data at most).
> 
> It seems there was no fsync after sync at all. That's why f2fs recovered back to
> the latest checkpoint. Anyway, I'm thinking that it's worth to add a kind of
> periodic checkpoints.

Agree, I have that in my mind for long time, since Yunlei said that they
may lost all data of new generated photos after an abnormal poweroff, I
wrote the below patch, but I have not much time to test and tuned up with
it.

I hope if you have time, we can discuss the implementation of periodic cp.
Maybe in another thread. :)

>From c81c03fb69612350b12a14bccc07a1fd95cf606b Mon Sep 17 00:00:00 2001
From: Chao Yu <chao2.yu@samsung.com>
Date: Wed, 5 Aug 2015 22:58:54 +0800
Subject: [PATCH] f2fs: support background data flush

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
---
 fs/f2fs/data.c  | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/f2fs/f2fs.h  |  15 +++++++++
 fs/f2fs/inode.c |  16 +++++++++
 fs/f2fs/namei.c |   7 ++++
 fs/f2fs/super.c |  50 ++++++++++++++++++++++++++--
 5 files changed, 186 insertions(+), 2 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index a82abe9..39b6339 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -20,6 +20,8 @@
 #include <linux/prefetch.h>
 #include <linux/uio.h>
 #include <linux/cleancache.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
 
 #include "f2fs.h"
 #include "node.h"
@@ -27,6 +29,104 @@
 #include "trace.h"
 #include <trace/events/f2fs.h>
 
+static void f2fs_do_data_flush(struct f2fs_sb_info *sbi)
+{
+	struct list_head *inode_list = &sbi->inode_list;
+	struct f2fs_inode_info *fi, *tmp;
+	struct inode *inode;
+	unsigned int number;
+
+	spin_lock(&sbi->inode_lock);
+	number = sbi->inode_num;
+	list_for_each_entry_safe(fi, tmp, inode_list, i_flush) {
+
+		if (number-- == 0)
+			break;
+
+		inode = &fi->vfs_inode;
+
+		/*
+		 * If the inode is in evicting path, we will fail to igrab
+		 * inode since I_WILL_FREE or I_FREEING should be set in
+		 * inode, so after grab valid inode, it's safe to flush
+		 * dirty page after unlock inode_lock.
+		 */
+		inode = igrab(inode);
+		if (!inode)
+			continue;
+
+		spin_unlock(&sbi->inode_lock);
+
+		if (!get_dirty_pages(inode))
+			goto next;
+
+		filemap_flush(inode->i_mapping);
+next:
+		iput(inode);
+		spin_lock(&sbi->inode_lock);
+	}
+	spin_unlock(&sbi->inode_lock);
+}
+
+static int f2fs_data_flush_thread(void *data)
+{
+	struct f2fs_sb_info *sbi = data;
+	wait_queue_head_t *wq = &sbi->dflush_wait_queue;
+	struct cp_control cpc;
+	unsigned long wait_time;
+
+	wait_time = sbi->wait_time;
+
+	do {
+		if (try_to_freeze())
+			continue;
+		else
+			wait_event_interruptible_timeout(*wq,
+						kthread_should_stop(),
+						msecs_to_jiffies(wait_time));
+		if (kthread_should_stop())
+			break;
+
+		if (sbi->sb->s_writers.frozen >= SB_FREEZE_WRITE)
+			continue;
+
+		mutex_lock(&sbi->gc_mutex);
+
+		f2fs_do_data_flush(sbi);
+
+		cpc.reason = __get_cp_reason(sbi);
+		write_checkpoint(sbi, &cpc);
+
+		mutex_unlock(&sbi->gc_mutex);
+
+	} while (!kthread_should_stop());
+	return 0;
+}
+
+int start_data_flush_thread(struct f2fs_sb_info *sbi)
+{
+	dev_t dev = sbi->sb->s_bdev->bd_dev;
+	int err = 0;
+
+	init_waitqueue_head(&sbi->dflush_wait_queue);
+	sbi->data_flush_thread = kthread_run(f2fs_data_flush_thread, sbi,
+			"f2fs_flush-%u:%u", MAJOR(dev), MINOR(dev));
+	if (IS_ERR(sbi->data_flush_thread)) {
+		err = PTR_ERR(sbi->data_flush_thread);
+		sbi->data_flush_thread = NULL;
+	}
+
+	return err;
+}
+
+void stop_data_flush_thread(struct f2fs_sb_info *sbi)
+{
+	if (!sbi->data_flush_thread)
+		return;
+	kthread_stop(sbi->data_flush_thread);
+	sbi->data_flush_thread = NULL;
+}
+
 static void f2fs_read_end_io(struct bio *bio)
 {
 	struct bio_vec *bvec;
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index f1a90ff..b6790c9 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -52,6 +52,7 @@
 #define F2FS_MOUNT_NOBARRIER		0x00000800
 #define F2FS_MOUNT_FASTBOOT		0x00001000
 #define F2FS_MOUNT_EXTENT_CACHE		0x00002000
+#define F2FS_MOUNT_DATA_FLUSH		0X00004000
 
 #define clear_opt(sbi, option)	(sbi->mount_opt.opt &= ~F2FS_MOUNT_##option)
 #define set_opt(sbi, option)	(sbi->mount_opt.opt |= F2FS_MOUNT_##option)
@@ -322,6 +323,8 @@ enum {
 					 */
 };
 
+#define DEF_DATA_FLUSH_DELAY_TIME	5000	/* delay time of data flush */
+
 #define F2FS_LINK_MAX	0xffffffff	/* maximum link count per file */
 
 #define MAX_DIR_RA_PAGES	4	/* maximum ra pages of dir */
@@ -436,6 +439,8 @@ struct f2fs_inode_info {
 
 	struct extent_tree *extent_tree;	/* cached extent_tree entry */
 
+	struct list_head i_flush;	/* link in inode_list of sbi */
+
 #ifdef CONFIG_F2FS_FS_ENCRYPTION
 	/* Encryption params */
 	struct f2fs_crypt_info *i_crypt_info;
@@ -808,6 +813,14 @@ struct f2fs_sb_info {
 	struct list_head s_list;
 	struct mutex umount_mutex;
 	unsigned int shrinker_run_no;
+
+	/* For data flush support */
+	struct task_struct *data_flush_thread;	/* data flush task */
+	wait_queue_head_t dflush_wait_queue;	/* data flush wait queue */
+	unsigned long wait_time;		/* wait time for flushing */
+	struct list_head inode_list;		/* link all inmem inode */
+	spinlock_t inode_lock;			/* protect inode list */
+	unsigned int inode_num;			/* inode number in inode_list */
 };
 
 /*
@@ -1780,6 +1793,8 @@ void destroy_checkpoint_caches(void);
 /*
  * data.c
  */
+int start_data_flush_thread(struct f2fs_sb_info *);
+void stop_data_flush_thread(struct f2fs_sb_info *);
 void f2fs_submit_merged_bio(struct f2fs_sb_info *, enum page_type, int);
 int f2fs_submit_page_bio(struct f2fs_io_info *);
 void f2fs_submit_page_mbio(struct f2fs_io_info *);
diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
index 35aae65..6bf22ad 100644
--- a/fs/f2fs/inode.c
+++ b/fs/f2fs/inode.c
@@ -158,6 +158,13 @@ static int do_read_inode(struct inode *inode)
 	stat_inc_inline_inode(inode);
 	stat_inc_inline_dir(inode);
 
+	if (S_ISREG(inode->i_mode) || S_ISLNK(inode->i_mode)) {
+		spin_lock(&sbi->inode_lock);
+		list_add_tail(&fi->i_flush, &sbi->inode_list);
+		sbi->inode_num++;
+		spin_unlock(&sbi->inode_lock);
+	}
+
 	return 0;
 }
 
@@ -335,6 +342,15 @@ void f2fs_evict_inode(struct inode *inode)
 
 	f2fs_destroy_extent_tree(inode);
 
+	if (S_ISREG(inode->i_mode) || S_ISLNK(inode->i_mode)) {
+		spin_lock(&sbi->inode_lock);
+		if (!list_empty(&fi->i_flush)) {
+			list_del(&fi->i_flush);
+			sbi->inode_num--;
+		}
+		spin_unlock(&sbi->inode_lock);
+	}
+
 	if (inode->i_nlink || is_bad_inode(inode))
 		goto no_delete;
 
diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
index a680bf3..f639e96 100644
--- a/fs/f2fs/namei.c
+++ b/fs/f2fs/namei.c
@@ -71,6 +71,13 @@ static struct inode *f2fs_new_inode(struct inode *dir, umode_t mode)
 	stat_inc_inline_inode(inode);
 	stat_inc_inline_dir(inode);
 
+	if (S_ISREG(inode->i_mode) || S_ISLNK(inode->i_mode)) {
+		spin_lock(&sbi->inode_lock);
+		list_add_tail(&F2FS_I(inode)->i_flush, &sbi->inode_list);
+		sbi->inode_num++;
+		spin_unlock(&sbi->inode_lock);
+	}
+
 	trace_f2fs_new_inode(inode, 0);
 	mark_inode_dirty(inode);
 	return inode;
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index f794781..286cdb4 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -67,6 +67,7 @@ enum {
 	Opt_extent_cache,
 	Opt_noextent_cache,
 	Opt_noinline_data,
+	Opt_data_flush,
 	Opt_err,
 };
 
@@ -91,6 +92,7 @@ static match_table_t f2fs_tokens = {
 	{Opt_extent_cache, "extent_cache"},
 	{Opt_noextent_cache, "noextent_cache"},
 	{Opt_noinline_data, "noinline_data"},
+	{Opt_data_flush, "data_flush"},
 	{Opt_err, NULL},
 };
 
@@ -215,6 +217,7 @@ F2FS_RW_ATTR(SM_INFO, f2fs_sm_info, min_fsync_blocks, min_fsync_blocks);
 F2FS_RW_ATTR(NM_INFO, f2fs_nm_info, ram_thresh, ram_thresh);
 F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, max_victim_search, max_victim_search);
 F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, dir_level, dir_level);
+F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, wait_time, wait_time);
 
 #define ATTR_LIST(name) (&f2fs_attr_##name.attr)
 static struct attribute *f2fs_attrs[] = {
@@ -231,6 +234,7 @@ static struct attribute *f2fs_attrs[] = {
 	ATTR_LIST(max_victim_search),
 	ATTR_LIST(dir_level),
 	ATTR_LIST(ram_thresh),
+	ATTR_LIST(wait_time),
 	NULL,
 };
 
@@ -397,6 +401,9 @@ static int parse_options(struct super_block *sb, char *options)
 		case Opt_noinline_data:
 			clear_opt(sbi, INLINE_DATA);
 			break;
+		case Opt_data_flush:
+			set_opt(sbi, DATA_FLUSH);
+			break;
 		default:
 			f2fs_msg(sb, KERN_ERR,
 				"Unrecognized mount option \"%s\" or missing value",
@@ -434,6 +441,8 @@ static struct inode *f2fs_alloc_inode(struct super_block *sb)
 	/* Will be used by directory only */
 	fi->i_dir_level = F2FS_SB(sb)->dir_level;
 
+	INIT_LIST_HEAD(&fi->i_flush);
+
 #ifdef CONFIG_F2FS_FS_ENCRYPTION
 	fi->i_crypt_info = NULL;
 #endif
@@ -514,6 +523,8 @@ static void f2fs_put_super(struct super_block *sb)
 	}
 	kobject_del(&sbi->s_kobj);
 
+	stop_data_flush_thread(sbi);
+
 	stop_gc_thread(sbi);
 
 	/* prevent remaining shrinker jobs */
@@ -742,6 +753,8 @@ static int f2fs_remount(struct super_block *sb, int *flags, char *data)
 	int err, active_logs;
 	bool need_restart_gc = false;
 	bool need_stop_gc = false;
+	bool need_restart_df = false;
+	bool need_stop_df = false;
 
 	sync_filesystem(sb);
 
@@ -785,6 +798,19 @@ static int f2fs_remount(struct super_block *sb, int *flags, char *data)
 		need_stop_gc = true;
 	}
 
+	if ((*flags & MS_RDONLY) || !test_opt(sbi, DATA_FLUSH)) {
+		if (sbi->data_flush_thread) {
+			stop_data_flush_thread(sbi);
+			f2fs_sync_fs(sb, 1);
+			need_restart_df = true;
+		}
+	} else if (!sbi->data_flush_thread) {
+		err = start_data_flush_thread(sbi);
+		if (err)
+			goto restore_gc;
+		need_stop_df = true;
+	}
+
 	/*
 	 * We stop issue flush thread if FS is mounted as RO
 	 * or if flush_merge is not passed in mount option.
@@ -794,13 +820,21 @@ static int f2fs_remount(struct super_block *sb, int *flags, char *data)
 	} else if (!SM_I(sbi)->cmd_control_info) {
 		err = create_flush_cmd_control(sbi);
 		if (err)
-			goto restore_gc;
+			goto restore_df;
 	}
 skip:
 	/* Update the POSIXACL Flag */
 	 sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
 		(test_opt(sbi, POSIX_ACL) ? MS_POSIXACL : 0);
 	return 0;
+restore_df:
+	if (need_restart_df) {
+		if (start_data_flush_thread(sbi))
+			f2fs_msg(sbi->sb, KERN_WARNING,
+				"background data flush thread has stopped");
+	} else if (need_stop_df) {
+		stop_data_flush_thread(sbi);
+	}
 restore_gc:
 	if (need_restart_gc) {
 		if (start_gc_thread(sbi))
@@ -1216,6 +1250,11 @@ try_onemore:
 	INIT_LIST_HEAD(&sbi->dir_inode_list);
 	spin_lock_init(&sbi->dir_inode_lock);
 
+	sbi->wait_time = DEF_DATA_FLUSH_DELAY_TIME;
+	INIT_LIST_HEAD(&sbi->inode_list);
+	spin_lock_init(&sbi->inode_lock);
+	sbi->inode_num = 0;
+
 	init_extent_cache_info(sbi);
 
 	init_ino_entry_info(sbi);
@@ -1324,6 +1363,12 @@ try_onemore:
 		if (err)
 			goto free_kobj;
 	}
+
+	if (test_opt(sbi, DATA_FLUSH) && !f2fs_readonly(sb)) {
+		err = start_data_flush_thread(sbi);
+		if (err)
+			goto stop_gc;
+	}
 	kfree(options);
 
 	/* recover broken superblock */
@@ -1333,7 +1378,8 @@ try_onemore:
 	}
 
 	return 0;
-
+stop_gc:
+	stop_gc_thread(sbi);
 free_kobj:
 	kobject_del(&sbi->s_kobj);
 free_proc:
-- 
2.4.2



------------------------------------------------------------------------------

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/4.2.1
  2015-09-25  6:50     ` Marc Lehmann
@ 2015-09-25  9:47       ` Chao Yu
  2015-09-25 18:20         ` Jaegeuk Kim
  2015-09-26  3:22         ` Marc Lehmann
  2015-09-25 18:26       ` Jaegeuk Kim
  1 sibling, 2 replies; 74+ messages in thread
From: Chao Yu @ 2015-09-25  9:47 UTC (permalink / raw)
  To: 'Marc Lehmann', 'Jaegeuk Kim'; +Cc: linux-f2fs-devel

Hi Marc Jaegeuk,

> -----Original Message-----
> From: Marc Lehmann [mailto:schmorp@schmorp.de]
> Sent: Friday, September 25, 2015 2:51 PM
> To: Jaegeuk Kim
> Cc: linux-f2fs-devel@lists.sourceforge.net
> Subject: Re: [f2fs-dev] write performance difference 3.18.21/4.2.1
> 
> On Thu, Sep 24, 2015 at 11:28:36AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > One thing that we can try is to run the latest f2fs source in v3.18.
> > This branch supports f2fs for v3.18.
> 
> Ok, please bear with me, the last time I built my own kernel was during
> the 2.4 timeframe, and this is a ubuntu kernel. What I did is this:
> 
>    git clone -b linux-3.18 git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git
>    cd f2fs/fs/f2fs
>    rsync -avPR include/linux/f2fs_fs.h include/trace/events/f2fs.h
> /usr/src/linux-headers-3.18.21-031821/.
>    make -C /lib/modules/3.18.21-031821-generic/build/ M=$PWD modules modules_install
> 
> I then rmmod f2fs/insmod the resulting module, and tried to mount my
> existing f2fs fs for a quick test, but got a null ptr exception on "mount":
> 
> http://ue.tst.eu/e4628dcee97324e580da1bafad938052.txt

This is my fault, sorry about introducing this oops. :(

Please revert the commit 7c5e466755ff ("f2fs: readahead cp payload
pages when mount") since in this commit we try to access invalid
SIT_I(sbi)->sit_base_addr which should be inited later.

Thanks,

> 
> Probably caused me not building a full kernel, but recreating how ubuntu
> build their kernels on a debian system isn't something I look forward to.
> 
> > For example, if I can represent blocks like:
> [number of logs discussion]
> 
> Thanks for this explanation - two logs doesn't look so bad, from a
> locality viewpoint (not a big issue for flash, but a big issue for
> rotational devices - I also realised I can't use dmcache as dmcache, even
> in writethrough mode, writes back all data after an unclean shutdown,
> which would positively kill the disk).
> 
> Since whatever speed difference I saw with two logs wasn't big, you
> completely sold me on 6 logs, or 4 (especially if it seepds up the gc,
> which I haven't much tested yet). Two logs was merely a test anyway (the
> same with no_heap, I don't know what it does, but I thought it is worth
> a try, as metadata + data nearer together is better than having them at
> opposite ends of the log or so).
> 
> --
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-25  5:42                       ` Marc Lehmann
@ 2015-09-25 17:45                         ` Jaegeuk Kim
  2015-09-26  3:32                           ` Marc Lehmann
  0 siblings, 1 reply; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-25 17:45 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Fri, Sep 25, 2015 at 07:42:25AM +0200, Marc Lehmann wrote:
> On Thu, Sep 24, 2015 at 10:27:49AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > > In the end, I might settle with -s64, and currently do tests with -s90.
> > 
> > Got it. But why -s90? :)
> 
> He :) It's a nothing-special number between 64 and 128, that's all.

Oh, then, I don't think that is a good magic number.
It seems that you decided to use -s64, so it'd better to keep it to address
any perf results.

> > I just pushed the patches to master branch in f2fs-tools.git.
> > Could you pull them and check them?
> 
> Got them, last patch was the "check sit types" change.
> 
> > I added one more patch to avoid harmless sit_type fixes previously you reported.
> > 
> > And, for the 8TB case, let me check again. It seems that we need to handle under
> > 1% overprovision ratio. (e.g., 0.5%)
> 
> That might make me potentially very happy. But my main concern at the
> moment is stability - even when you have a backup, restoring 8TB will take
> days, and backups are never uptodate.
> 
> It would be nice to be able to control it more from the user side though.
> 
> For example, I have not yet reached 0.0% free with f2fs. That's fine, I don't
> plan9 to, but I need to know at which percentage should I stop, which is
> something I can only really find out with experiments.
> 
> And just filling these 8TB disks takes days, so the question is, can I
> simulate near-full behaviour with smaller partitions.

Why not? :)
I think the behavior should be same. And, it'd good to set small sections
in order to see it more clearly.

Anyway, I wrote a patch to consider under 1% for large partitions.

 section  ovp ratio  ovp size

For 8TB,
 -s1    : 0.07%   -> 10GB
 -s32   : 0.39%   -> 65GB
 -s64   : 0.55%   -> 92GB
 -s128  : 0.78%   -> 132GB

For 128GB,
 -s1    : 0.55%   -> 1.4GB
 -s32   : 3.14%   -> 8GB
 -s64   : 4.45%   -> 12GB
 -s128  : 6.32%   -> 17GB

Let me test this patch for a while, and then push into our git.

Thanks,

>From 2cdb04b52f202e931e370564396366d44bd4d1e2 Mon Sep 17 00:00:00 2001
From: Jaegeuk Kim <jaegeuk@kernel.org>
Date: Fri, 25 Sep 2015 09:31:04 -0700
Subject: [PATCH] mkfs.f2fs: support <1% overprovision ratio

Big partition size needs uner 1% overprovision space to acquire more space.

    section  ovp ratio  ovp size
For 8TB,
    -s1    : 0.07%     -> 10GB
    -s32   : 0.39%     -> 65GB
    -s64   : 0.55%     -> 92GB
    -s128  : 0.78%     -> 132GB

For 128GB,
    -s1    : 0.55%     -> 1.4GB
    -s32   : 3.14%     -> 8GB
    -s64   : 4.45%     -> 12GB
    -s128  : 6.32%     -> 17GB

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
---
 include/f2fs_fs.h       |  2 +-
 mkfs/f2fs_format.c      | 12 ++++++------
 mkfs/f2fs_format_main.c |  2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/f2fs_fs.h b/include/f2fs_fs.h
index 38a774c..359deec 100644
--- a/include/f2fs_fs.h
+++ b/include/f2fs_fs.h
@@ -225,7 +225,7 @@ enum f2fs_config_func {
 struct f2fs_configuration {
 	u_int32_t sector_size;
 	u_int32_t reserved_segments;
-	u_int32_t overprovision;
+	double overprovision;
 	u_int32_t cur_seg[6];
 	u_int32_t segs_per_sec;
 	u_int32_t secs_per_zone;
diff --git a/mkfs/f2fs_format.c b/mkfs/f2fs_format.c
index 2d4ab09..176bdea 100644
--- a/mkfs/f2fs_format.c
+++ b/mkfs/f2fs_format.c
@@ -155,19 +155,19 @@ static void configure_extension_list(void)
 	free(config.extension_list);
 }
 
-static u_int32_t get_best_overprovision(void)
+static double get_best_overprovision(void)
 {
-	u_int32_t reserved, ovp, candidate, end, diff, space;
-	u_int32_t max_ovp = 0, max_space = 0;
+	double reserved, ovp, candidate, end, diff, space;
+	double max_ovp = 0, max_space = 0;
 
 	if (get_sb(segment_count_main) < 256) {
 		candidate = 10;
 		end = 95;
 		diff = 5;
 	} else {
-		candidate = 1;
+		candidate = 0.01;
 		end = 10;
-		diff = 1;
+		diff = 0.01;
 	}
 
 	for (; candidate <= end; candidate += diff) {
@@ -533,7 +533,7 @@ static int f2fs_write_check_point_pack(void)
 	set_cp(overprov_segment_count, get_cp(overprov_segment_count) +
 			get_cp(rsvd_segment_count));
 
-	MSG(0, "Info: Overprovision ratio = %u%%\n", config.overprovision);
+	MSG(0, "Info: Overprovision ratio = %.3lf%%\n", config.overprovision);
 	MSG(0, "Info: Overprovision segments = %u (GC reserved = %u)\n",
 					get_cp(overprov_segment_count),
 					config.reserved_segments);
diff --git a/mkfs/f2fs_format_main.c b/mkfs/f2fs_format_main.c
index fc612d8..2ea809c 100644
--- a/mkfs/f2fs_format_main.c
+++ b/mkfs/f2fs_format_main.c
@@ -99,7 +99,7 @@ static void f2fs_parse_options(int argc, char *argv[])
 			config.vol_label = optarg;
 			break;
 		case 'o':
-			config.overprovision = atoi(optarg);
+			config.overprovision = atof(optarg);
 			break;
 		case 'O':
 			parse_feature(strdup(optarg));
-- 
2.1.1


------------------------------------------------------------------------------

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/4.2.1
  2015-09-25  9:47       ` Chao Yu
@ 2015-09-25 18:20         ` Jaegeuk Kim
  2015-09-26  3:22         ` Marc Lehmann
  1 sibling, 0 replies; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-25 18:20 UTC (permalink / raw)
  To: Chao Yu; +Cc: 'Marc Lehmann', linux-f2fs-devel

On Fri, Sep 25, 2015 at 05:47:12PM +0800, Chao Yu wrote:
> Hi Marc Jaegeuk,
> 
> > -----Original Message-----
> > From: Marc Lehmann [mailto:schmorp@schmorp.de]
> > Sent: Friday, September 25, 2015 2:51 PM
> > To: Jaegeuk Kim
> > Cc: linux-f2fs-devel@lists.sourceforge.net
> > Subject: Re: [f2fs-dev] write performance difference 3.18.21/4.2.1
> > 
> > On Thu, Sep 24, 2015 at 11:28:36AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > > One thing that we can try is to run the latest f2fs source in v3.18.
> > > This branch supports f2fs for v3.18.
> > 
> > Ok, please bear with me, the last time I built my own kernel was during
> > the 2.4 timeframe, and this is a ubuntu kernel. What I did is this:
> > 
> >    git clone -b linux-3.18 git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git
> >    cd f2fs/fs/f2fs
> >    rsync -avPR include/linux/f2fs_fs.h include/trace/events/f2fs.h
> > /usr/src/linux-headers-3.18.21-031821/.
> >    make -C /lib/modules/3.18.21-031821-generic/build/ M=$PWD modules modules_install
> > 
> > I then rmmod f2fs/insmod the resulting module, and tried to mount my
> > existing f2fs fs for a quick test, but got a null ptr exception on "mount":
> > 
> > http://ue.tst.eu/e4628dcee97324e580da1bafad938052.txt
> 
> This is my fault, sorry about introducing this oops. :(
> 
> Please revert the commit 7c5e466755ff ("f2fs: readahead cp payload
> pages when mount") since in this commit we try to access invalid
> SIT_I(sbi)->sit_base_addr which should be inited later.

Oops, I'll just remove this patch which was not going too far away.

Thanks,

> 
> Thanks,
> 
> > 
> > Probably caused me not building a full kernel, but recreating how ubuntu
> > build their kernels on a debian system isn't something I look forward to.
> > 
> > > For example, if I can represent blocks like:
> > [number of logs discussion]
> > 
> > Thanks for this explanation - two logs doesn't look so bad, from a
> > locality viewpoint (not a big issue for flash, but a big issue for
> > rotational devices - I also realised I can't use dmcache as dmcache, even
> > in writethrough mode, writes back all data after an unclean shutdown,
> > which would positively kill the disk).
> > 
> > Since whatever speed difference I saw with two logs wasn't big, you
> > completely sold me on 6 logs, or 4 (especially if it seepds up the gc,
> > which I haven't much tested yet). Two logs was merely a test anyway (the
> > same with no_heap, I don't know what it does, but I thought it is worth
> > a try, as metadata + data nearer together is better than having them at
> > opposite ends of the log or so).
> > 
> > --
> >                 The choice of a       Deliantra, the free code+content MORPG
> >       -----==-     _GNU_              http://www.deliantra.net
> >       ----==-- _       generation
> >       ---==---(_)__  __ ____  __      Marc Lehmann
> >       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
> >       -=====/_/_//_/\_,_/ /_/\_\
> > 
> > ------------------------------------------------------------------------------
> > _______________________________________________
> > Linux-f2fs-devel mailing list
> > Linux-f2fs-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/4.2.1
  2015-09-25  6:50     ` Marc Lehmann
  2015-09-25  9:47       ` Chao Yu
@ 2015-09-25 18:26       ` Jaegeuk Kim
  1 sibling, 0 replies; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-25 18:26 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Fri, Sep 25, 2015 at 08:50:57AM +0200, Marc Lehmann wrote:
> On Thu, Sep 24, 2015 at 11:28:36AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > One thing that we can try is to run the latest f2fs source in v3.18.
> > This branch supports f2fs for v3.18.
> 
> Ok, please bear with me, the last time I built my own kernel was during
> the 2.4 timeframe, and this is a ubuntu kernel. What I did is this:
> 
>    git clone -b linux-3.18 git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git
>    cd f2fs/fs/f2fs
>    rsync -avPR include/linux/f2fs_fs.h include/trace/events/f2fs.h /usr/src/linux-headers-3.18.21-031821/.
>    make -C /lib/modules/3.18.21-031821-generic/build/ M=$PWD modules modules_install
> 
> I then rmmod f2fs/insmod the resulting module, and tried to mount my
> existing f2fs fs for a quick test, but got a null ptr exception on "mount":
> 
> http://ue.tst.eu/e4628dcee97324e580da1bafad938052.txt
> 
> Probably caused me not building a full kernel, but recreating how ubuntu
> build their kernels on a debian system isn't something I look forward to.

Please, pull the v3.18 again. I rebased it. :-(

> 
> > For example, if I can represent blocks like:
> [number of logs discussion]
> 
> Thanks for this explanation - two logs doesn't look so bad, from a
> locality viewpoint (not a big issue for flash, but a big issue for
> rotational devices - I also realised I can't use dmcache as dmcache, even
> in writethrough mode, writes back all data after an unclean shutdown,
> which would positively kill the disk).
> 
> Since whatever speed difference I saw with two logs wasn't big, you
> completely sold me on 6 logs, or 4 (especially if it seepds up the gc,
> which I haven't much tested yet). Two logs was merely a test anyway (the
> same with no_heap, I don't know what it does, but I thought it is worth
> a try, as metadata + data nearer together is better than having them at
> opposite ends of the log or so).

If the section size is pretty large, no_heap would be enough. The original
intention was to provide more contiguous space for data only so that a big
file could have a large extent instead of splitting by its metadata.

> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: sync/umount hang on 3.18.21, 1.4TB gone after crash
  2015-09-25  9:13   ` Chao Yu
@ 2015-09-25 18:30     ` Jaegeuk Kim
  0 siblings, 0 replies; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-25 18:30 UTC (permalink / raw)
  To: Chao Yu; +Cc: 'Marc Lehmann', linux-f2fs-devel

Hi Chao,

[snip]

> > It seems there was no fsync after sync at all. That's why f2fs recovered back to
> > the latest checkpoint. Anyway, I'm thinking that it's worth to add a kind of
> > periodic checkpoints.
> 
> Agree, I have that in my mind for long time, since Yunlei said that they
> may lost all data of new generated photos after an abnormal poweroff, I
> wrote the below patch, but I have not much time to test and tuned up with
> it.
> 
> I hope if you have time, we can discuss the implementation of periodic cp.
> Maybe in another thread. :)

Sure. Actually, in my thought, we can use our gc thread and existing VFS inode
lists.
Let's take a time to think a bout this.

Thanks,

> 
> >From c81c03fb69612350b12a14bccc07a1fd95cf606b Mon Sep 17 00:00:00 2001
> From: Chao Yu <chao2.yu@samsung.com>
> Date: Wed, 5 Aug 2015 22:58:54 +0800
> Subject: [PATCH] f2fs: support background data flush
> 
> Signed-off-by: Chao Yu <chao2.yu@samsung.com>
> ---
>  fs/f2fs/data.c  | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/f2fs/f2fs.h  |  15 +++++++++
>  fs/f2fs/inode.c |  16 +++++++++
>  fs/f2fs/namei.c |   7 ++++
>  fs/f2fs/super.c |  50 ++++++++++++++++++++++++++--
>  5 files changed, 186 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index a82abe9..39b6339 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -20,6 +20,8 @@
>  #include <linux/prefetch.h>
>  #include <linux/uio.h>
>  #include <linux/cleancache.h>
> +#include <linux/kthread.h>
> +#include <linux/freezer.h>
>  
>  #include "f2fs.h"
>  #include "node.h"
> @@ -27,6 +29,104 @@
>  #include "trace.h"
>  #include <trace/events/f2fs.h>
>  
> +static void f2fs_do_data_flush(struct f2fs_sb_info *sbi)
> +{
> +	struct list_head *inode_list = &sbi->inode_list;
> +	struct f2fs_inode_info *fi, *tmp;
> +	struct inode *inode;
> +	unsigned int number;
> +
> +	spin_lock(&sbi->inode_lock);
> +	number = sbi->inode_num;
> +	list_for_each_entry_safe(fi, tmp, inode_list, i_flush) {
> +
> +		if (number-- == 0)
> +			break;
> +
> +		inode = &fi->vfs_inode;
> +
> +		/*
> +		 * If the inode is in evicting path, we will fail to igrab
> +		 * inode since I_WILL_FREE or I_FREEING should be set in
> +		 * inode, so after grab valid inode, it's safe to flush
> +		 * dirty page after unlock inode_lock.
> +		 */
> +		inode = igrab(inode);
> +		if (!inode)
> +			continue;
> +
> +		spin_unlock(&sbi->inode_lock);
> +
> +		if (!get_dirty_pages(inode))
> +			goto next;
> +
> +		filemap_flush(inode->i_mapping);
> +next:
> +		iput(inode);
> +		spin_lock(&sbi->inode_lock);
> +	}
> +	spin_unlock(&sbi->inode_lock);
> +}
> +
> +static int f2fs_data_flush_thread(void *data)
> +{
> +	struct f2fs_sb_info *sbi = data;
> +	wait_queue_head_t *wq = &sbi->dflush_wait_queue;
> +	struct cp_control cpc;
> +	unsigned long wait_time;
> +
> +	wait_time = sbi->wait_time;
> +
> +	do {
> +		if (try_to_freeze())
> +			continue;
> +		else
> +			wait_event_interruptible_timeout(*wq,
> +						kthread_should_stop(),
> +						msecs_to_jiffies(wait_time));
> +		if (kthread_should_stop())
> +			break;
> +
> +		if (sbi->sb->s_writers.frozen >= SB_FREEZE_WRITE)
> +			continue;
> +
> +		mutex_lock(&sbi->gc_mutex);
> +
> +		f2fs_do_data_flush(sbi);
> +
> +		cpc.reason = __get_cp_reason(sbi);
> +		write_checkpoint(sbi, &cpc);
> +
> +		mutex_unlock(&sbi->gc_mutex);
> +
> +	} while (!kthread_should_stop());
> +	return 0;
> +}
> +
> +int start_data_flush_thread(struct f2fs_sb_info *sbi)
> +{
> +	dev_t dev = sbi->sb->s_bdev->bd_dev;
> +	int err = 0;
> +
> +	init_waitqueue_head(&sbi->dflush_wait_queue);
> +	sbi->data_flush_thread = kthread_run(f2fs_data_flush_thread, sbi,
> +			"f2fs_flush-%u:%u", MAJOR(dev), MINOR(dev));
> +	if (IS_ERR(sbi->data_flush_thread)) {
> +		err = PTR_ERR(sbi->data_flush_thread);
> +		sbi->data_flush_thread = NULL;
> +	}
> +
> +	return err;
> +}
> +
> +void stop_data_flush_thread(struct f2fs_sb_info *sbi)
> +{
> +	if (!sbi->data_flush_thread)
> +		return;
> +	kthread_stop(sbi->data_flush_thread);
> +	sbi->data_flush_thread = NULL;
> +}
> +
>  static void f2fs_read_end_io(struct bio *bio)
>  {
>  	struct bio_vec *bvec;
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index f1a90ff..b6790c9 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -52,6 +52,7 @@
>  #define F2FS_MOUNT_NOBARRIER		0x00000800
>  #define F2FS_MOUNT_FASTBOOT		0x00001000
>  #define F2FS_MOUNT_EXTENT_CACHE		0x00002000
> +#define F2FS_MOUNT_DATA_FLUSH		0X00004000
>  
>  #define clear_opt(sbi, option)	(sbi->mount_opt.opt &= ~F2FS_MOUNT_##option)
>  #define set_opt(sbi, option)	(sbi->mount_opt.opt |= F2FS_MOUNT_##option)
> @@ -322,6 +323,8 @@ enum {
>  					 */
>  };
>  
> +#define DEF_DATA_FLUSH_DELAY_TIME	5000	/* delay time of data flush */
> +
>  #define F2FS_LINK_MAX	0xffffffff	/* maximum link count per file */
>  
>  #define MAX_DIR_RA_PAGES	4	/* maximum ra pages of dir */
> @@ -436,6 +439,8 @@ struct f2fs_inode_info {
>  
>  	struct extent_tree *extent_tree;	/* cached extent_tree entry */
>  
> +	struct list_head i_flush;	/* link in inode_list of sbi */
> +
>  #ifdef CONFIG_F2FS_FS_ENCRYPTION
>  	/* Encryption params */
>  	struct f2fs_crypt_info *i_crypt_info;
> @@ -808,6 +813,14 @@ struct f2fs_sb_info {
>  	struct list_head s_list;
>  	struct mutex umount_mutex;
>  	unsigned int shrinker_run_no;
> +
> +	/* For data flush support */
> +	struct task_struct *data_flush_thread;	/* data flush task */
> +	wait_queue_head_t dflush_wait_queue;	/* data flush wait queue */
> +	unsigned long wait_time;		/* wait time for flushing */
> +	struct list_head inode_list;		/* link all inmem inode */
> +	spinlock_t inode_lock;			/* protect inode list */
> +	unsigned int inode_num;			/* inode number in inode_list */
>  };
>  
>  /*
> @@ -1780,6 +1793,8 @@ void destroy_checkpoint_caches(void);
>  /*
>   * data.c
>   */
> +int start_data_flush_thread(struct f2fs_sb_info *);
> +void stop_data_flush_thread(struct f2fs_sb_info *);
>  void f2fs_submit_merged_bio(struct f2fs_sb_info *, enum page_type, int);
>  int f2fs_submit_page_bio(struct f2fs_io_info *);
>  void f2fs_submit_page_mbio(struct f2fs_io_info *);
> diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
> index 35aae65..6bf22ad 100644
> --- a/fs/f2fs/inode.c
> +++ b/fs/f2fs/inode.c
> @@ -158,6 +158,13 @@ static int do_read_inode(struct inode *inode)
>  	stat_inc_inline_inode(inode);
>  	stat_inc_inline_dir(inode);
>  
> +	if (S_ISREG(inode->i_mode) || S_ISLNK(inode->i_mode)) {
> +		spin_lock(&sbi->inode_lock);
> +		list_add_tail(&fi->i_flush, &sbi->inode_list);
> +		sbi->inode_num++;
> +		spin_unlock(&sbi->inode_lock);
> +	}
> +
>  	return 0;
>  }
>  
> @@ -335,6 +342,15 @@ void f2fs_evict_inode(struct inode *inode)
>  
>  	f2fs_destroy_extent_tree(inode);
>  
> +	if (S_ISREG(inode->i_mode) || S_ISLNK(inode->i_mode)) {
> +		spin_lock(&sbi->inode_lock);
> +		if (!list_empty(&fi->i_flush)) {
> +			list_del(&fi->i_flush);
> +			sbi->inode_num--;
> +		}
> +		spin_unlock(&sbi->inode_lock);
> +	}
> +
>  	if (inode->i_nlink || is_bad_inode(inode))
>  		goto no_delete;
>  
> diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
> index a680bf3..f639e96 100644
> --- a/fs/f2fs/namei.c
> +++ b/fs/f2fs/namei.c
> @@ -71,6 +71,13 @@ static struct inode *f2fs_new_inode(struct inode *dir, umode_t mode)
>  	stat_inc_inline_inode(inode);
>  	stat_inc_inline_dir(inode);
>  
> +	if (S_ISREG(inode->i_mode) || S_ISLNK(inode->i_mode)) {
> +		spin_lock(&sbi->inode_lock);
> +		list_add_tail(&F2FS_I(inode)->i_flush, &sbi->inode_list);
> +		sbi->inode_num++;
> +		spin_unlock(&sbi->inode_lock);
> +	}
> +
>  	trace_f2fs_new_inode(inode, 0);
>  	mark_inode_dirty(inode);
>  	return inode;
> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
> index f794781..286cdb4 100644
> --- a/fs/f2fs/super.c
> +++ b/fs/f2fs/super.c
> @@ -67,6 +67,7 @@ enum {
>  	Opt_extent_cache,
>  	Opt_noextent_cache,
>  	Opt_noinline_data,
> +	Opt_data_flush,
>  	Opt_err,
>  };
>  
> @@ -91,6 +92,7 @@ static match_table_t f2fs_tokens = {
>  	{Opt_extent_cache, "extent_cache"},
>  	{Opt_noextent_cache, "noextent_cache"},
>  	{Opt_noinline_data, "noinline_data"},
> +	{Opt_data_flush, "data_flush"},
>  	{Opt_err, NULL},
>  };
>  
> @@ -215,6 +217,7 @@ F2FS_RW_ATTR(SM_INFO, f2fs_sm_info, min_fsync_blocks, min_fsync_blocks);
>  F2FS_RW_ATTR(NM_INFO, f2fs_nm_info, ram_thresh, ram_thresh);
>  F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, max_victim_search, max_victim_search);
>  F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, dir_level, dir_level);
> +F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, wait_time, wait_time);
>  
>  #define ATTR_LIST(name) (&f2fs_attr_##name.attr)
>  static struct attribute *f2fs_attrs[] = {
> @@ -231,6 +234,7 @@ static struct attribute *f2fs_attrs[] = {
>  	ATTR_LIST(max_victim_search),
>  	ATTR_LIST(dir_level),
>  	ATTR_LIST(ram_thresh),
> +	ATTR_LIST(wait_time),
>  	NULL,
>  };
>  
> @@ -397,6 +401,9 @@ static int parse_options(struct super_block *sb, char *options)
>  		case Opt_noinline_data:
>  			clear_opt(sbi, INLINE_DATA);
>  			break;
> +		case Opt_data_flush:
> +			set_opt(sbi, DATA_FLUSH);
> +			break;
>  		default:
>  			f2fs_msg(sb, KERN_ERR,
>  				"Unrecognized mount option \"%s\" or missing value",
> @@ -434,6 +441,8 @@ static struct inode *f2fs_alloc_inode(struct super_block *sb)
>  	/* Will be used by directory only */
>  	fi->i_dir_level = F2FS_SB(sb)->dir_level;
>  
> +	INIT_LIST_HEAD(&fi->i_flush);
> +
>  #ifdef CONFIG_F2FS_FS_ENCRYPTION
>  	fi->i_crypt_info = NULL;
>  #endif
> @@ -514,6 +523,8 @@ static void f2fs_put_super(struct super_block *sb)
>  	}
>  	kobject_del(&sbi->s_kobj);
>  
> +	stop_data_flush_thread(sbi);
> +
>  	stop_gc_thread(sbi);
>  
>  	/* prevent remaining shrinker jobs */
> @@ -742,6 +753,8 @@ static int f2fs_remount(struct super_block *sb, int *flags, char *data)
>  	int err, active_logs;
>  	bool need_restart_gc = false;
>  	bool need_stop_gc = false;
> +	bool need_restart_df = false;
> +	bool need_stop_df = false;
>  
>  	sync_filesystem(sb);
>  
> @@ -785,6 +798,19 @@ static int f2fs_remount(struct super_block *sb, int *flags, char *data)
>  		need_stop_gc = true;
>  	}
>  
> +	if ((*flags & MS_RDONLY) || !test_opt(sbi, DATA_FLUSH)) {
> +		if (sbi->data_flush_thread) {
> +			stop_data_flush_thread(sbi);
> +			f2fs_sync_fs(sb, 1);
> +			need_restart_df = true;
> +		}
> +	} else if (!sbi->data_flush_thread) {
> +		err = start_data_flush_thread(sbi);
> +		if (err)
> +			goto restore_gc;
> +		need_stop_df = true;
> +	}
> +
>  	/*
>  	 * We stop issue flush thread if FS is mounted as RO
>  	 * or if flush_merge is not passed in mount option.
> @@ -794,13 +820,21 @@ static int f2fs_remount(struct super_block *sb, int *flags, char *data)
>  	} else if (!SM_I(sbi)->cmd_control_info) {
>  		err = create_flush_cmd_control(sbi);
>  		if (err)
> -			goto restore_gc;
> +			goto restore_df;
>  	}
>  skip:
>  	/* Update the POSIXACL Flag */
>  	 sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
>  		(test_opt(sbi, POSIX_ACL) ? MS_POSIXACL : 0);
>  	return 0;
> +restore_df:
> +	if (need_restart_df) {
> +		if (start_data_flush_thread(sbi))
> +			f2fs_msg(sbi->sb, KERN_WARNING,
> +				"background data flush thread has stopped");
> +	} else if (need_stop_df) {
> +		stop_data_flush_thread(sbi);
> +	}
>  restore_gc:
>  	if (need_restart_gc) {
>  		if (start_gc_thread(sbi))
> @@ -1216,6 +1250,11 @@ try_onemore:
>  	INIT_LIST_HEAD(&sbi->dir_inode_list);
>  	spin_lock_init(&sbi->dir_inode_lock);
>  
> +	sbi->wait_time = DEF_DATA_FLUSH_DELAY_TIME;
> +	INIT_LIST_HEAD(&sbi->inode_list);
> +	spin_lock_init(&sbi->inode_lock);
> +	sbi->inode_num = 0;
> +
>  	init_extent_cache_info(sbi);
>  
>  	init_ino_entry_info(sbi);
> @@ -1324,6 +1363,12 @@ try_onemore:
>  		if (err)
>  			goto free_kobj;
>  	}
> +
> +	if (test_opt(sbi, DATA_FLUSH) && !f2fs_readonly(sb)) {
> +		err = start_data_flush_thread(sbi);
> +		if (err)
> +			goto stop_gc;
> +	}
>  	kfree(options);
>  
>  	/* recover broken superblock */
> @@ -1333,7 +1378,8 @@ try_onemore:
>  	}
>  
>  	return 0;
> -
> +stop_gc:
> +	stop_gc_thread(sbi);
>  free_kobj:
>  	kobject_del(&sbi->s_kobj);
>  free_proc:
> -- 
> 2.4.2

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: sync/umount hang on 3.18.21, 1.4TB gone after crash
  2015-09-25  6:00   ` Marc Lehmann
  2015-09-25  6:01     ` Marc Lehmann
@ 2015-09-25 18:42     ` Jaegeuk Kim
  2015-09-26  3:08       ` Marc Lehmann
  1 sibling, 1 reply; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-25 18:42 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Fri, Sep 25, 2015 at 08:00:19AM +0200, Marc Lehmann wrote:
> On Thu, Sep 24, 2015 at 11:50:23AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > > When I came back after ~10 hours, I found a number of hung task messages
> > > in syslog, and when I entered sync, sync was consuming 100% system time.
> > 
> > Hmm, at this time, it would be good to check what process is stuck through
> > sysrq.
> 
> It was only intermittently, but here they are. The first one is almost
> certainly the sync that I originally didn't have a backtrace for, the
> second one is one that came up frequently during the f2fs test.
> 
>    INFO: task sync:10577 blocked for more than 120 seconds.
>          Tainted: G        W  OE   4.2.1-040201-generic #201509211431
>    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>    sync            D ffff88082ec964c0     0 10577  10549 0x00000000
>     ffff88000210fdc8 0000000000000082 ffff88062ef2a940 ffff88010337e040
>     0000000000000246 ffff880002110000 ffff8806294915f8 ffff8805c939b800
>     ffff88000210fe54 ffffffff8121a910 ffff88000210fde8 ffffffff817a5a37
>    Call Trace:
>     [<ffffffff8121a910>] ? SyS_tee+0x360/0x360
>     [<ffffffff817a5a37>] schedule+0x37/0x80
>     [<ffffffff81211f09>] wb_wait_for_completion+0x49/0x80
>     [<ffffffff810b6f90>] ? prepare_to_wait_event+0xf0/0xf0
>     [<ffffffff81213134>] sync_inodes_sb+0x94/0x1b0
>     [<ffffffff8121a910>] ? SyS_tee+0x360/0x360
>     [<ffffffff8121a925>] sync_inodes_one_sb+0x15/0x20
>     [<ffffffff811ed1b9>] iterate_supers+0xb9/0x110
>     [<ffffffff8121ac65>] sys_sync+0x35/0x90
>     [<ffffffff817a9272>] entry_SYSCALL_64_fastpath+0x16/0x75
> 
>    INFO: task watchdog/1:14743 blocked for more than 120 seconds.
>          Tainted: P           OE  3.18.21-031821-generic #201509020527
>    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>    watchdog/1      D ffff88082ec93300     0 14743      2 0x00000000
>     ffff8801a2383c48 0000000000000046 ffff880273a50000 0000000000013300
>     ffff8801a2383fd8 0000000000013300 ffff8802e642a800 ffff880273a50000
>     0000000000001000 ffffffff81c23d80 ffffffff81c23d84 ffff880273a50000
>    Call Trace:
>     [<ffffffff817847f9>] schedule_preempt_disabled+0x29/0x70
>     [<ffffffff81786435>] __mutex_lock_slowpath+0x95/0x100
>     [<ffffffff810a8ac9>] ? enqueue_entity+0x289/0xb20
>     [<ffffffff817864c3>] mutex_lock+0x23/0x37
>     [<ffffffff81029823>] x86_pmu_event_init+0x343/0x430
>     [<ffffffff811680db>] perf_init_event+0xcb/0x130
>     [<ffffffff811684d8>] perf_event_alloc+0x398/0x440
>     [<ffffffff810a8431>] ? put_prev_entity+0x31/0x3f0
>     [<ffffffff811249b0>] ? restart_watchdog_hrtimer+0x60/0x60
>     [<ffffffff81169156>] perf_event_create_kernel_counter+0x26/0x100
>     [<ffffffff8112477d>] watchdog_nmi_enable+0xcd/0x170
>     [<ffffffff81124865>] watchdog_enable+0x45/0xa0
>     [<ffffffff81093f09>] smpboot_thread_fn+0xb9/0x1a0
>     [<ffffffff8108ff9c>] ? __kthread_parkme+0x4c/0x80
>     [<ffffffff81093e50>] ? SyS_setgroups+0x180/0x180
>     [<ffffffff81090219>] kthread+0xc9/0xe0
>     [<ffffffff81090150>] ? kthread_create_on_node+0x180/0x180
>     [<ffffffff81788618>] ret_from_fork+0x58/0x90
>     [<ffffffff81090150>] ? kthread_create_on_node+0x180/0x180
> 
> The watchdog might or might not be unrelated, but it is either a 4.2.1
> thing (new kernel) or f2fs related. I only had them during the f2fs test,
> and often, not before or after.
> 
> (I don't know what that kernel thread does, but the system was somewhat
> sluggish during the test, and other, unrelated servcies, were negatively
> affected).
> 
> > It seems there was no fsync after sync at all. That's why f2fs recovered back to
> > the latest checkpoint. Anyway, I'm thinking that it's worth to add a kind of
> > periodic checkpoints.
> 
> Well, would it sync more often if this problem hadn't occured? Most
> filesystems (or rather, the filesystems I use, btrfs, xfs, ext* and zfs)
> seem to have their own regular commit interval, or otherwise commit
> frequently if it is cheap enough.

AFAIK, there-in *commit* means syncing metadata, not userdata. Doesn't it?
So, even if you saw no data loss, filesystem doesn't guarantee all the data were
completely recovered, since sync or fsync was not called for that file.

I think you need to tune the system-wide parameters related to flusher mentioned
by Chao for your workloads.
And, we need to expect periodic checkpoints are able to recover the previously
flushed data.

Thanks,

> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: sync/umount hang on 3.18.21, 1.4TB gone after crash
  2015-09-25 18:42     ` Jaegeuk Kim
@ 2015-09-26  3:08       ` Marc Lehmann
  2015-09-26  7:27         ` Jaegeuk Kim
  0 siblings, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-09-26  3:08 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Fri, Sep 25, 2015 at 11:42:02AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> AFAIK, there-in *commit* means syncing metadata, not userdata. Doesn't it?

In general, no, they commit userdata, even if not necessarily at the same
time. ext* for example has three modes, and the default commits userdata
before the corresponding metadata (data=ordered).

But even when you relax this (data=writeback), a few minutes after a file is
written, both userdata and metadata are there (usually after 30s). Data that
was just being wirtten is generally mixed, but that's an easy to handle
trade-off.

(and then there is data=journal, which should get perfectly ordered
behaviour for both, at high cost, and flushoncommit).

Early (linux) versions of XFS were more like a brutal version of writeback
- files recently written before a crash were frequently filled with zero
bytes (something I haven't seen with irix, which frequently crashed :).
But they somehow made it work - I was a frequent victim of zero-filled
files, but for many years it didn't happen for me. So while I don't know
if it's a guarantee, in practise, file data is there together with the
metadata, and usually within the writeback period configured in the kernel
(+ whatever time it takes to write out the data, which cna be substantial,
but can also be limited in /proc/sys/vm, especially dirty_bytes and
dirty_background_bytes).

Note also that filesystems often special case write + rename over old
file, and carious other cases, to give a good user experience.

So for traditional filesystems, /proc/sys/vm/dirty_bytes + dirty_expire
gives a pretty good way of defining a) how much data is lost and b) within
which timeframe. Such filesystems also have their own setting for metadata
commit, but they are generally within the timeframe of a few seconds to
half a minute.

It does not have the nice "exact version of a point in time" qualities you
can get from log-based file system, but they give quite nice guarantees in
practise - if a file was half-written, it does not have it's full length
but corrupted data inside for example.

For things like database files, this could be an issue, as indeed you
don't control the order of things written, but programs *know* about this
problem and fsync accordingly (and the kernel has extra support for these
things, as in sync_page_range and so on).

So, in general, filesystems only commit metadata, but the kernel commits
userdata on its own, and as extra feature, "good" filesystems such
as xfs or ext* have extra logic to commit userdata before committing
corresponding metadata (or after).

Note also that with most journal-based filesystems, commit just forces the
issue, both metadata and userdata usually hit the disk much earlier.

In addition, the issue at hand is f2fs losing metadata, not userdata,
as all data had been written to the device hours before tghe crash. The
userdata was all there, but the filesystem forgot to how to access it.

> So, even if you saw no data loss, filesystem doesn't guarantee all the data were
> completely recovered, since sync or fsync was not called for that file.

No, but I feel fairly confident that a file written over a minutes ago
on a box that is sitting idle for a minute is still there after a crash,
barring hardware faults.

Now, I am not neecessarily criticing f2fs here, after all, the problem at
hand is f2fs _hanging_, which is clearly a bug. I don't know how well f2fs
performs with this bug fixed, regarding data loss.

Also, f2fs is a different beast - syncs can take a veeery long time on f2fs
compared to xfs or ext4, and maybe that is due to the design of f2fs (I
suppose so, but you can correct me). In which case it might not be such a
good idea to commit every 30s. Maybe my performance problem was because
f2fs committed every 30s.

> I think you need to tune the system-wide parameters related to flusher mentioned
> by Chao for your workloads.

I already do configure these extensively, according to my workload. On the
box I did my recent tests:

   vm.dirty_ratio = 80
   vm.dirty_background_ratio = 4
   vm.dirty_writeback_centisecs = 100
   vm.dirty_expire_centisecs = 100

These are pretty aggressive. The reason is that the box has 32GB of ram, and
with default values it is not uncommon to get 10-20gb of dirty data before a
writeback, which then more or less freezes everything and can take a long
time. So the above values don't wait long to write userdata, and make sure a
process generating lots of dirty blocks can't freeze the system.

Speciifcally, in the case of tar writing files, tar will start blocking after
only ~1.3GB of dirty data.

That means with a "conventional" filesystem, I lose at most 1.3GB of data
+ less than 30s, on a crash.

> And, we need to expect periodic checkpoints are able to recover the previously
> flushed data.

Yes, I would consider this a must, however, again, I can accept if f2fs
needs much higher "commit" intervals than other filesystems (say, 10
minutes), if that is needed to make it performant.

But some form of fixed timeframe is needed, I tzhink, whether it's seconds
or minutes.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/4.2.1
  2015-09-25  9:47       ` Chao Yu
  2015-09-25 18:20         ` Jaegeuk Kim
@ 2015-09-26  3:22         ` Marc Lehmann
  2015-09-26  5:25           ` write performance difference 3.18.21/git f2fs Marc Lehmann
  2015-09-26  7:48           ` write performance difference 3.18.21/4.2.1 Jaegeuk Kim
  1 sibling, 2 replies; 74+ messages in thread
From: Marc Lehmann @ 2015-09-26  3:22 UTC (permalink / raw)
  To: linux-f2fs-devel

On Fri, Sep 25, 2015 at 05:47:12PM +0800, Chao Yu <chao2.yu@samsung.com> wrote:
> Please revert the commit 7c5e466755ff ("f2fs: readahead cp payload
> pages when mount") since in this commit we try to access invalid
> SIT_I(sbi)->sit_base_addr which should be inited later.

Wow, you are fast. To make it short, the new module loads and mounts. Since
systemd failed to clear the dmcache again, I need to wait a few hours for it
to write back before testing. On the plus side, this gives a fairly high
chance of fragmented memory, so I can test the code that avoids oom on mount
as well :)

> > Since whatever speed difference I saw with two logs wasn't big, you
> > completely sold me on 6 logs, or 4 (especially if it seepds up the gc,
> > which I haven't much tested yet). Two logs was merely a test anyway (the
> > same with no_heap, I don't know what it does, but I thought it is worth
> > a try, as metadata + data nearer together is better than having them at
> > opposite ends of the log or so).
> 
> If the section size is pretty large, no_heap would be enough. The original
> intention was to provide more contiguous space for data only so that a big
> file could have a large extent instead of splitting by its metadata.

Great, so no_heap it is.

Also, I was thinking a bit more on the active_logs issue.

The problem with SMR drives and too many logs is not just locality,
but the fatc that appending data, unlike as with flash, requires a
read-modify-write cycle. Likewise, I am pretty sure the disk can't keep
6 open write fragments in memory - maybe it can only keep one, so every
metadata write might cause a RMW cycle again, because it's not big enough
to fill a full zone (17-30MB).

So, hmm, well, separating the metadata that frequently changes
(directories) form the rest is necessary for the GC to not have to copy
almost all data block, but otherwise, it's nice if everything else clumps
together.

(likewise, stat information probably changes a lot more often than file
data, e.g. chown -R user . will change stat data regardless of whether the
files already belong to a user, and it would be nice if that menas the
data blocks can be kept untouched. Similar, renames).

What would you recommend for this case?

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-25 17:45                         ` Jaegeuk Kim
@ 2015-09-26  3:32                           ` Marc Lehmann
  2015-09-26  7:36                             ` Jaegeuk Kim
  0 siblings, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-09-26  3:32 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Fri, Sep 25, 2015 at 10:45:46AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > He :) It's a nothing-special number between 64 and 128, that's all.
> 
> Oh, then, I don't think that is a good magic number.

Care to share why? :)

> It seems that you decided to use -s64, so it'd better to keep it to address
> any perf results.

Is there anysthing specially good for numbers of two? Or do you just want top
reduce the number of changed variables?

I'f yes, should I do the 3.18.21 test with -s90 (as the 3.18.21 and 4.2.1
tests before), or with -s64?

> > And just filling these 8TB disks takes days, so the question is, can I
> > simulate near-full behaviour with smaller partitions.
> 
> Why not? :)
> I think the behavior should be same. And, it'd good to set small sections
> in order to see it more clearly.

The section size is a critical parameter for these drives. Also, the data
mix is the same for 8TB and smaller partitions (in these tests, which were
meantr to be the first round of tests only anyway).

So a smaller section size compared to the full partition test, I think,
would result in very different behaviour. Likewise, if a small partition
has comparatively more (or absolutely less) overprovision (and/or reserved
space), this again might cause different behaviour.

At least to me, it's not obvious what a good comparable overprovision ratio
is to test full device behaviour on a smaller partition.

Also, section sizes vary by a factor fo two over the device, so what might
work fine with -s64 in the middle of the disk, might work badly at the end.

Likewise, since the files don't get larger, the GC might do a much better
job at -s64 than at -s128 (almost certainly, actually).

As a thought experiment, what happens when I use -s8 or a similar small size?
If the GC writes linearly, there won't be too many RMW cycles. But is that
guaranteed even with an aging filesystem?

If yes, then the best -s number might be 1. Because all I rely on is
mostly linear batched large writes, not so much large batched reads.

That is, unfortunately, not something I can easily test.

> Let me test this patch for a while, and then push into our git.

Thanks, will do so, then.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-25  8:05                     ` Chao Yu
@ 2015-09-26  3:42                       ` Marc Lehmann
  0 siblings, 0 replies; 74+ messages in thread
From: Marc Lehmann @ 2015-09-26  3:42 UTC (permalink / raw)
  To: linux-f2fs-devel

On Fri, Sep 25, 2015 at 04:05:48PM +0800, Chao Yu <chao2.yu@samsung.com> wrote:
> Actually, we should set the value of 'count' parameter to indicate how many
> times we want to do gc in one batch, at most 16 times in a loop for each
> ioctl invoking:
>         ioctl(fd, F2FS_IOC_GC, &count);
> After ioctl retruned successfully, 'count' parameter will contain the count
> of gces we did actually.

Ah, so this way, I could even find out when to stop.

> One batch means a certain number of gces excuting serially.

Thanks for the explanation - well, I guess there is no harm in setting
count to 1 and calling it repeatedly, as GC operations should generally be
slow enough so many repeated calls will be ok.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* write performance difference 3.18.21/git f2fs
  2015-09-26  3:22         ` Marc Lehmann
@ 2015-09-26  5:25           ` Marc Lehmann
  2015-09-26  5:57             ` Marc Lehmann
  2015-09-26  7:52             ` Jaegeuk Kim
  2015-09-26  7:48           ` write performance difference 3.18.21/4.2.1 Jaegeuk Kim
  1 sibling, 2 replies; 74+ messages in thread
From: Marc Lehmann @ 2015-09-26  5:25 UTC (permalink / raw)
  To: linux-f2fs-devel

Ok, before I tried the f2fs git I made another short test with the
original 3.18.21 f2fs, and it was as fast as before. Then I used the
faulty f2fs module,. which forced a reboot.

Now I started to redo the 3.18.21 test + git f2fs, with the same parameters
(specifically, -s90), and while it didn't start out to be as slow as 4.2.1,
it's similarly slow.

After 218GiB, I stopped the test, giving me an average of 50MiB/s.

Here is typical dstat output (again, dsk/sde):

http://ue.tst.eu/7a40644b3432e2932bdd8c1f6b6fc32d.txt

So less read behaviour than with 4.2.1, but also very slow writes.

That means the performance drop moves with f2fs, not the kernel version.

This is the resulting status:

http://ue.tst.eu/6d94e9bfad48a433bbc6f7daeaf5eb38.txt

Just for fun I'll start doing a -s64 run.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/git f2fs
  2015-09-26  5:25           ` write performance difference 3.18.21/git f2fs Marc Lehmann
@ 2015-09-26  5:57             ` Marc Lehmann
  2015-09-26  7:52             ` Jaegeuk Kim
  1 sibling, 0 replies; 74+ messages in thread
From: Marc Lehmann @ 2015-09-26  5:57 UTC (permalink / raw)
  To: linux-f2fs-devel

On Sat, Sep 26, 2015 at 07:25:51AM +0200, Marc Lehmann <schmorp@schmorp.de> wrote:
> Just for fun I'll start doing a -s64 run.

Same thing with -s64.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: sync/umount hang on 3.18.21, 1.4TB gone after crash
  2015-09-26  3:08       ` Marc Lehmann
@ 2015-09-26  7:27         ` Jaegeuk Kim
  0 siblings, 0 replies; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-26  7:27 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Sat, Sep 26, 2015 at 05:08:33AM +0200, Marc Lehmann wrote:
> On Fri, Sep 25, 2015 at 11:42:02AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > AFAIK, there-in *commit* means syncing metadata, not userdata. Doesn't it?
> 
> In general, no, they commit userdata, even if not necessarily at the same
> time. ext* for example has three modes, and the default commits userdata
> before the corresponding metadata (data=ordered).

Well, when I take a look at other filesystems, filemap_flush is called in some
specific cases such as release_file, rename, and transaction stuffs.

> 
> But even when you relax this (data=writeback), a few minutes after a file is
> written, both userdata and metadata are there (usually after 30s). Data that
> was just being wirtten is generally mixed, but that's an easy to handle
> trade-off.

I think that should be done by flusher not by filesystem.

> (and then there is data=journal, which should get perfectly ordered
> behaviour for both, at high cost, and flushoncommit).
> 
> Early (linux) versions of XFS were more like a brutal version of writeback
> - files recently written before a crash were frequently filled with zero
> bytes (something I haven't seen with irix, which frequently crashed :).
> But they somehow made it work - I was a frequent victim of zero-filled
> files, but for many years it didn't happen for me. So while I don't know
> if it's a guarantee, in practise, file data is there together with the
> metadata, and usually within the writeback period configured in the kernel
> (+ whatever time it takes to write out the data, which cna be substantial,
> but can also be limited in /proc/sys/vm, especially dirty_bytes and
> dirty_background_bytes).

I think that's why btrfs/xfs/ext4 call filemap_flush at release_file, which
means data blocks are flushed when all open files are closed.
Then, xfs and ext4 support metadata journalling, so recent changes also could be
recovered.

> Note also that filesystems often special case write + rename over old
> file, and carious other cases, to give a good user experience.

The filemap_flush includes also rename case too.

> So for traditional filesystems, /proc/sys/vm/dirty_bytes + dirty_expire
> gives a pretty good way of defining a) how much data is lost and b) within
> which timeframe. Such filesystems also have their own setting for metadata
> commit, but they are generally within the timeframe of a few seconds to
> half a minute.
> 
> It does not have the nice "exact version of a point in time" qualities you
> can get from log-based file system, but they give quite nice guarantees in
> practise - if a file was half-written, it does not have it's full length
> but corrupted data inside for example.

Indeed, xfs and ext4 have a metadata journalling which gives a good user
experience in terms of sudden power-offs.
For now, f2fs recovers only fsynced files after power-cut, so that's
somewhat different weak point. But, later I think f2fs is also able to support
that too.

> For things like database files, this could be an issue, as indeed you
> don't control the order of things written, but programs *know* about this
> problem and fsync accordingly (and the kernel has extra support for these
> things, as in sync_page_range and so on).
> 
> So, in general, filesystems only commit metadata, but the kernel commits
> userdata on its own, and as extra feature, "good" filesystems such
> as xfs or ext* have extra logic to commit userdata before committing
> corresponding metadata (or after).

Okay.

> Note also that with most journal-based filesystems, commit just forces the
> issue, both metadata and userdata usually hit the disk much earlier.
> 
> In addition, the issue at hand is f2fs losing metadata, not userdata,
> as all data had been written to the device hours before tghe crash. The
> userdata was all there, but the filesystem forgot to how to access it.

Yeah, so I think it would be better to do periodic checkpoint, and filemap_flush
for release_file and rename stuffs.

> 
> > So, even if you saw no data loss, filesystem doesn't guarantee all the data were
> > completely recovered, since sync or fsync was not called for that file.
> 
> No, but I feel fairly confident that a file written over a minutes ago
> on a box that is sitting idle for a minute is still there after a crash,
> barring hardware faults.
> 
> Now, I am not neecessarily criticing f2fs here, after all, the problem at
> hand is f2fs _hanging_, which is clearly a bug. I don't know how well f2fs
> performs with this bug fixed, regarding data loss.
> 
> Also, f2fs is a different beast - syncs can take a veeery long time on f2fs
> compared to xfs or ext4, and maybe that is due to the design of f2fs (I
> suppose so, but you can correct me). In which case it might not be such a
> good idea to commit every 30s. Maybe my performance problem was because
> f2fs committed every 30s.

Normally the checkpointing time is not so high. So, maybe there are something
else like flushing data, dentries, or huge number of prefree entries maybe.
If possible, it needs to take a look at f2fs stat before sync.

> 
> > I think you need to tune the system-wide parameters related to flusher mentioned
> > by Chao for your workloads.
> 
> I already do configure these extensively, according to my workload. On the
> box I did my recent tests:
> 
>    vm.dirty_ratio = 80
>    vm.dirty_background_ratio = 4
>    vm.dirty_writeback_centisecs = 100
>    vm.dirty_expire_centisecs = 100
> 
> These are pretty aggressive. The reason is that the box has 32GB of ram, and
> with default values it is not uncommon to get 10-20gb of dirty data before a
> writeback, which then more or less freezes everything and can take a long
> time. So the above values don't wait long to write userdata, and make sure a
> process generating lots of dirty blocks can't freeze the system.
> 
> Speciifcally, in the case of tar writing files, tar will start blocking after
> only ~1.3GB of dirty data.
> 
> That means with a "conventional" filesystem, I lose at most 1.3GB of data
> + less than 30s, on a crash.
> 
> > And, we need to expect periodic checkpoints are able to recover the previously
> > flushed data.
> 
> Yes, I would consider this a must, however, again, I can accept if f2fs
> needs much higher "commit" intervals than other filesystems (say, 10
> minutes), if that is needed to make it performant.
> 
> But some form of fixed timeframe is needed, I tzhink, whether it's seconds
> or minutes.

Will consider that.

Thanks,

> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-26  3:32                           ` Marc Lehmann
@ 2015-09-26  7:36                             ` Jaegeuk Kim
  2015-09-26 13:53                               ` Marc Lehmann
  0 siblings, 1 reply; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-26  7:36 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Sat, Sep 26, 2015 at 05:32:53AM +0200, Marc Lehmann wrote:
> On Fri, Sep 25, 2015 at 10:45:46AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > > He :) It's a nothing-special number between 64 and 128, that's all.
> > 
> > Oh, then, I don't think that is a good magic number.
> 
> Care to share why? :)

Mostly, in the flash storages, it is multiple 2MB normally. :)

> 
> > It seems that you decided to use -s64, so it'd better to keep it to address
> > any perf results.
> 
> Is there anysthing specially good for numbers of two? Or do you just want top
> reduce the number of changed variables?

IMO, likewise flash storages, it needs to investigate the raw device
characteristics.

I think this can be used for SMR too.

https://github.com/bradfa/flashbench

I think there might be some hints for section size at first and performance
variation as well.

> I'f yes, should I do the 3.18.21 test with -s90 (as the 3.18.21 and 4.2.1
> tests before), or with -s64?
> 
> > > And just filling these 8TB disks takes days, so the question is, can I
> > > simulate near-full behaviour with smaller partitions.
> > 
> > Why not? :)
> > I think the behavior should be same. And, it'd good to set small sections
> > in order to see it more clearly.
> 
> The section size is a critical parameter for these drives. Also, the data
> mix is the same for 8TB and smaller partitions (in these tests, which were
> meantr to be the first round of tests only anyway).
> 
> So a smaller section size compared to the full partition test, I think,
> would result in very different behaviour. Likewise, if a small partition
> has comparatively more (or absolutely less) overprovision (and/or reserved
> space), this again might cause different behaviour.
> 
> At least to me, it's not obvious what a good comparable overprovision ratio
> is to test full device behaviour on a smaller partition.
> 
> Also, section sizes vary by a factor fo two over the device, so what might
> work fine with -s64 in the middle of the disk, might work badly at the end.
> 
> Likewise, since the files don't get larger, the GC might do a much better
> job at -s64 than at -s128 (almost certainly, actually).
> 
> As a thought experiment, what happens when I use -s8 or a similar small size?
> If the GC writes linearly, there won't be too many RMW cycles. But is that
> guaranteed even with an aging filesystem?
> 
> If yes, then the best -s number might be 1. Because all I rely on is
> mostly linear batched large writes, not so much large batched reads.
> 
> That is, unfortunately, not something I can easily test.
> 
> > Let me test this patch for a while, and then push into our git.
> 
> Thanks, will do so, then.
> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/4.2.1
  2015-09-26  3:22         ` Marc Lehmann
  2015-09-26  5:25           ` write performance difference 3.18.21/git f2fs Marc Lehmann
@ 2015-09-26  7:48           ` Jaegeuk Kim
  1 sibling, 0 replies; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-26  7:48 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Sat, Sep 26, 2015 at 05:22:18AM +0200, Marc Lehmann wrote:
> On Fri, Sep 25, 2015 at 05:47:12PM +0800, Chao Yu <chao2.yu@samsung.com> wrote:
> > Please revert the commit 7c5e466755ff ("f2fs: readahead cp payload
> > pages when mount") since in this commit we try to access invalid
> > SIT_I(sbi)->sit_base_addr which should be inited later.
> 
> Wow, you are fast. To make it short, the new module loads and mounts. Since
> systemd failed to clear the dmcache again, I need to wait a few hours for it
> to write back before testing. On the plus side, this gives a fairly high
> chance of fragmented memory, so I can test the code that avoids oom on mount
> as well :)
> 
> > > Since whatever speed difference I saw with two logs wasn't big, you
> > > completely sold me on 6 logs, or 4 (especially if it seepds up the gc,
> > > which I haven't much tested yet). Two logs was merely a test anyway (the
> > > same with no_heap, I don't know what it does, but I thought it is worth
> > > a try, as metadata + data nearer together is better than having them at
> > > opposite ends of the log or so).
> > 
> > If the section size is pretty large, no_heap would be enough. The original
> > intention was to provide more contiguous space for data only so that a big
> > file could have a large extent instead of splitting by its metadata.
> 
> Great, so no_heap it is.
> 
> Also, I was thinking a bit more on the active_logs issue.
> 
> The problem with SMR drives and too many logs is not just locality,
> but the fatc that appending data, unlike as with flash, requires a
> read-modify-write cycle. Likewise, I am pretty sure the disk can't keep
> 6 open write fragments in memory - maybe it can only keep one, so every
> metadata write might cause a RMW cycle again, because it's not big enough
> to fill a full zone (17-30MB).
> 
> So, hmm, well, separating the metadata that frequently changes
> (directories) form the rest is necessary for the GC to not have to copy
> almost all data block, but otherwise, it's nice if everything else clumps
> together.
> 
> (likewise, stat information probably changes a lot more often than file
> data, e.g. chown -R user . will change stat data regardless of whether the
> files already belong to a user, and it would be nice if that menas the
> data blocks can be kept untouched. Similar, renames).
> 
> What would you recommend for this case?

Hmm, from the device side, IMO, it's not a big concern for the number of open
zones, since f2fs normally tries to merge data and node IOs separately in order
to submit a big IO at once.
So, in my sense, it is not a big deal to use more logs.

> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/git f2fs
  2015-09-26  5:25           ` write performance difference 3.18.21/git f2fs Marc Lehmann
  2015-09-26  5:57             ` Marc Lehmann
@ 2015-09-26  7:52             ` Jaegeuk Kim
  2015-09-26 13:59               ` Marc Lehmann
  1 sibling, 1 reply; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-26  7:52 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Sat, Sep 26, 2015 at 07:25:51AM +0200, Marc Lehmann wrote:
> Ok, before I tried the f2fs git I made another short test with the
> original 3.18.21 f2fs, and it was as fast as before. Then I used the
> faulty f2fs module,. which forced a reboot.
> 
> Now I started to redo the 3.18.21 test + git f2fs, with the same parameters
> (specifically, -s90), and while it didn't start out to be as slow as 4.2.1,
> it's similarly slow.
> 
> After 218GiB, I stopped the test, giving me an average of 50MiB/s.
> 
> Here is typical dstat output (again, dsk/sde):
> 
> http://ue.tst.eu/7a40644b3432e2932bdd8c1f6b6fc32d.txt
> 
> So less read behaviour than with 4.2.1, but also very slow writes.
> 
> That means the performance drop moves with f2fs, not the kernel version.
> 
> This is the resulting status:
> 
> http://ue.tst.eu/6d94e9bfad48a433bbc6f7daeaf5eb38.txt
> 
> Just for fun I'll start doing a -s64 run.

Okay, so before finding bad commits, if possible, can you get block traces?
It would be good to see block_rq_complete and block_bio_complete tracepoints.

Thanks,

> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-26  7:36                             ` Jaegeuk Kim
@ 2015-09-26 13:53                               ` Marc Lehmann
  2015-09-28 18:33                                 ` Jaegeuk Kim
  0 siblings, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-09-26 13:53 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Sat, Sep 26, 2015 at 12:36:55AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > Care to share why? :)
> 
> Mostly, in the flash storages, it is multiple 2MB normally. :)

Well, any value of -s gives me a multiple of 2MB, no? :)

> > Is there anysthing specially good for numbers of two? Or do you just want top
> > reduce the number of changed variables?
> 
> IMO, likewise flash storages, it needs to investigate the raw device
> characteristics.

Keep in mind that I don't use it for flash, but smr drives.

We already know the raw device characteristics, basically, the zones are
between 15 and 40 or so MB in size (on the seagate 8tb drive), and they
likely don't have "even" sizes at all.

It's also by far not easy to benchmark these things, the disks can
buffer up to 25GB of random writes (and then might need several hours of
cleanup). Failing a linear write incurs a 0.6-1.6s penalty, to be paid
much later. It's a shame that none of the drive companies actually release
any usable info on their drives.

These guys made a hole into the disk and devised a lot of benchmarks to
find out the characteristics of these drives.

https://www.usenix.org/system/files/conference/fast15/fast15-paper-aghayev.pdf

So, the strategy for a fs would be to write linearly, most of the time,
without any gaps. f2fs (at least in 3.18.x) manages to do that very
nicely, which is why I really try to get it working.

But for writing once, any value of -s would probably suffice. There are
two problems when the disk gets full:

a) ipu writes. the drive can't do, so gc might be cheaper.
b) reuse of sections - if sections are reasonably large, if one gets freed
and reused, it should be large to guarantee large linear writes again.

b) is the reason behind me trying large values of -s.

Since I know that f2fs is the only fs that I tested that can have a sustained
write performance on these drives that is near the physical drive
characteristics, all that needs to be done is to see how f2fs performs after
it starts gc'ing.

That's why I am so interested in disk full conditions - writing the disk
linearly once is easy, I can just write a tar to the device. Ensuring that
writes are large linear after deleting and cleaning up is harder.

nilfs is a good example - it should fit smr drives perfectly, until they
are nearly full, after which nilfs still matches smr drives perfectly,
but waiting for 8TB to be shuffled around to delete some files can take
days.  More surprising is that nilfs phenomenally fails with these drives,
performance wise, for reaosns I haven't investigated (my guess is that
nilfs leaves gaps).

> I think this can be used for SMR too.

You can run any blockdevice operation on these drives, but the results
from flashbench will be close to meaningless for them. For example, you
can't distinguish betwene a nonaligned write causing a read-modify write
from an aligned large write, or a partial write, by access time, as they
will probably all have similar access times.

> I think there might be some hints for section size at first and performance
> variation as well.

I think you confuse these drives with flash drives - while they share some
characteristics, they are completely unlike flash. There is no translation
layer, there is no need for wear leveling, zones have widely varying
sizes, appending can be expensive or cheap, depending on the write size.

What these drives need is primarily large linear writes without gaps, and
secondarily any optimisations for rotational media apply. (And for that, f2fs
performs unexpectedly good, given it wasn't meant for rotational media).

Now, if f2fs can be made to (mostly) work bug-free, but with the
characteristics of 3.18.21, and the gc can ensure that reasonably big
areas spanning multiple zones will be reused, then f2fs will be the _ONLY_ fs
able to take care of drive managed smr disks efficiently.

Specifically, these filesystems do NOT work well with these drives:

nilfs, zfs, btrfs, ext4, xfs

And modifications for these filesystems are either far away in the
future, or not targetted at drive managed disks (ext4 already has some
modifications, but they are clearly not very suitable for actual drives,
assuming these drives have a fast area near the start of the disk, which
isn't the case). But these disks are not uncommon (seagate is shipping by
the millions), and will stay with us for quite a while.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/git f2fs
  2015-09-26  7:52             ` Jaegeuk Kim
@ 2015-09-26 13:59               ` Marc Lehmann
  2015-09-28 17:59                 ` Jaegeuk Kim
  0 siblings, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-09-26 13:59 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Sat, Sep 26, 2015 at 12:52:53AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > Just for fun I'll start doing a -s64 run.

(which had the same result).

> Okay, so before finding bad commits, if possible, can you get block traces?

If you can teach me how to, sure!

In the meantime, maybe what happened is that f2fs leaves some gaps (e.g.
for alignment) when writing now, and didn't in 3.18? That would somehow
explain what I see.

Except that with the 3.18 code, there is virtually no read access while
writing to the disk for hours. With f2fs git, there is a small 16kb read
every half a minute or so, and after a while, a lot of "reading a few
mb/s, while writing a few mb/s".

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/git f2fs
  2015-09-26 13:59               ` Marc Lehmann
@ 2015-09-28 17:59                 ` Jaegeuk Kim
  2015-09-29 11:02                   ` Marc Lehmann
  0 siblings, 1 reply; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-28 17:59 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Sat, Sep 26, 2015 at 03:59:57PM +0200, Marc Lehmann wrote:
> On Sat, Sep 26, 2015 at 12:52:53AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > > Just for fun I'll start doing a -s64 run.
> 
> (which had the same result).
> 
> > Okay, so before finding bad commits, if possible, can you get block traces?
> 
> If you can teach me how to, sure!
> 
> In the meantime, maybe what happened is that f2fs leaves some gaps (e.g.
> for alignment) when writing now, and didn't in 3.18? That would somehow
> explain what I see.
> 
> Except that with the 3.18 code, there is virtually no read access while
> writing to the disk for hours. With f2fs git, there is a small 16kb read
> every half a minute or so, and after a while, a lot of "reading a few
> mb/s, while writing a few mb/s".

In order to verify this also, could you retrieve the following logs?

# echo 1 > /sys/kernel/debug/tracing/tracing_on
# echo 1 > /sys/kernel/debug/tracing/events/f2fs/f2fs_submit_read_bio/enable
# echo 1 > /sys/kernel/debug/tracing/events/f2fs/f2fs_submit_write_bio/enable
# cat /sys/kernel/debug/tracing/trace_pipe

Thanks,

> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-26 13:53                               ` Marc Lehmann
@ 2015-09-28 18:33                                 ` Jaegeuk Kim
  2015-09-29  7:36                                   ` Marc Lehmann
  0 siblings, 1 reply; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-28 18:33 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Sat, Sep 26, 2015 at 03:53:53PM +0200, Marc Lehmann wrote:
> On Sat, Sep 26, 2015 at 12:36:55AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > > Care to share why? :)
> > 
> > Mostly, in the flash storages, it is multiple 2MB normally. :)
> 
> Well, any value of -s gives me a multiple of 2MB, no? :)
> 
> > > Is there anysthing specially good for numbers of two? Or do you just want top
> > > reduce the number of changed variables?
> > 
> > IMO, likewise flash storages, it needs to investigate the raw device
> > characteristics.
> 
> Keep in mind that I don't use it for flash, but smr drives.
> 
> We already know the raw device characteristics, basically, the zones are
> between 15 and 40 or so MB in size (on the seagate 8tb drive), and they
> likely don't have "even" sizes at all.
> 
> It's also by far not easy to benchmark these things, the disks can
> buffer up to 25GB of random writes (and then might need several hours of
> cleanup). Failing a linear write incurs a 0.6-1.6s penalty, to be paid
> much later. It's a shame that none of the drive companies actually release
> any usable info on their drives.
> 
> These guys made a hole into the disk and devised a lot of benchmarks to
> find out the characteristics of these drives.
> 
> https://www.usenix.org/system/files/conference/fast15/fast15-paper-aghayev.pdf
> 
> So, the strategy for a fs would be to write linearly, most of the time,
> without any gaps. f2fs (at least in 3.18.x) manages to do that very
> nicely, which is why I really try to get it working.
> 
> But for writing once, any value of -s would probably suffice. There are
> two problems when the disk gets full:
> 
> a) ipu writes. the drive can't do, so gc might be cheaper.
> b) reuse of sections - if sections are reasonably large, if one gets freed
> and reused, it should be large to guarantee large linear writes again.
> 
> b) is the reason behind me trying large values of -s.

Hmm. It seems that SMR has 20~25GB cache to absorb random writes with a big
block map. Then, it uses a static allocation, which is a kind of very early
stage of FTL shapes though.
Comparing to flash, it seems that SMR degrades the performance significantly
due to internal cleaning overhead, so I could understand that it needs to
control IO patterns very carefully.

So, how about testing -s20, which comes resasonble to me?

+ direct IO can break the alignment too.

> Since I know that f2fs is the only fs that I tested that can have a sustained
> write performance on these drives that is near the physical drive
> characteristics, all that needs to be done is to see how f2fs performs after
> it starts gc'ing.
> 
> That's why I am so interested in disk full conditions - writing the disk
> linearly once is easy, I can just write a tar to the device. Ensuring that
> writes are large linear after deleting and cleaning up is harder.
> 
> nilfs is a good example - it should fit smr drives perfectly, until they
> are nearly full, after which nilfs still matches smr drives perfectly,
> but waiting for 8TB to be shuffled around to delete some files can take
> days.  More surprising is that nilfs phenomenally fails with these drives,
> performance wise, for reaosns I haven't investigated (my guess is that
> nilfs leaves gaps).
> 
> > I think this can be used for SMR too.
> 
> You can run any blockdevice operation on these drives, but the results
> from flashbench will be close to meaningless for them. For example, you
> can't distinguish betwene a nonaligned write causing a read-modify write
> from an aligned large write, or a partial write, by access time, as they
> will probably all have similar access times.
> 
> > I think there might be some hints for section size at first and performance
> > variation as well.
> 
> I think you confuse these drives with flash drives - while they share some
> characteristics, they are completely unlike flash. There is no translation
> layer, there is no need for wear leveling, zones have widely varying
> sizes, appending can be expensive or cheap, depending on the write size.
> 
> What these drives need is primarily large linear writes without gaps, and
> secondarily any optimisations for rotational media apply. (And for that, f2fs
> performs unexpectedly good, given it wasn't meant for rotational media).
> 
> Now, if f2fs can be made to (mostly) work bug-free, but with the
> characteristics of 3.18.21, and the gc can ensure that reasonably big
> areas spanning multiple zones will be reused, then f2fs will be the _ONLY_ fs
> able to take care of drive managed smr disks efficiently.

Hmm. The f2fs has been deployed on smartphones for a couple of years so far.
The main stuffs here would be about tuning it with SMR drives.
It's the time for me to take a look at pretty big partitions. :)

Oh, anyway, have you tried just -s1 for fun?

Thanks,

> 
> Specifically, these filesystems do NOT work well with these drives:
> 
> nilfs, zfs, btrfs, ext4, xfs
> 
> And modifications for these filesystems are either far away in the
> future, or not targetted at drive managed disks (ext4 already has some
> modifications, but they are clearly not very suitable for actual drives,
> assuming these drives have a fast area near the start of the disk, which
> isn't the case). But these disks are not uncommon (seagate is shipping by
> the millions), and will stay with us for quite a while.
> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning
  2015-09-28 18:33                                 ` Jaegeuk Kim
@ 2015-09-29  7:36                                   ` Marc Lehmann
  0 siblings, 0 replies; 74+ messages in thread
From: Marc Lehmann @ 2015-09-29  7:36 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Mon, Sep 28, 2015 at 11:33:52AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> Hmm. It seems that SMR has 20~25GB cache to absorb random writes with a big
> block map. Then, it uses a static allocation, which is a kind of very early
> stage of FTL shapes though.

Yes, very sucky. For my previous tests, though, the cache is essentially
irrelevant, and only makes it harder to diagnose problems (it is very
helpful under light load, though).

> Comparing to flash, it seems that SMR degrades the performance significantly
> due to internal cleaning overhead, so I could understand that it needs to
> control IO patterns very carefully.

Yes, basically every write that ends (in time) before the zone boundary
requires RMW. Even writes that cross the zone boundary might require RMW as
the disk can probably only overwrite the zone partially once before having to
rewrite it fully again.

Since basically every write ends within a zone, the only way to keep
performance is to have very large sequential writes crossing multiple
zones, in multiple chunks, quick enough so the disk doesn't consider the
write as finished. Large meaning 100MB+.

> So, how about testing -s20, which comes resasonble to me?

I can test with -s20, but I fail to see why that is reasonable: -s20 menas
40MB, which isn't even as large as a single large zone, so spells desaster
in my book, basically causing a RMW cycle for every single section.
(Hopefully I just don't understand f2fs well enough).

In any acse, if -s20 if reasonable, then I would assume -s1 would also
reasonable, as both cause sections to be not larger than a zone.

> > characteristics of 3.18.21, and the gc can ensure that reasonably big
> > areas spanning multiple zones will be reused, then f2fs will be the _ONLY_ fs
> > able to take care of drive managed smr disks efficiently.
> 
> Hmm. The f2fs has been deployed on smartphones for a couple of years so far.
> The main stuffs here would be about tuning it with SMR drives.

Well, I don't want to sound too negative, and honestly, now that I gathered
more experience with f2fs I do start to consider it for a lot more than
originally anticipated (I will try to replace ext4 with it for a database
partiton on an ssd, and I do think f2fs might be a possible replacement for
traditional fs's on rotationel media as well).

However, it's clearly far from stable - the amuount of data corruption I got
with documented options was enourmous, and the fact that causes sync to hang
and freeze the fs in 3.18.21 is a serious show-stopper.

You would expect that it doesn't work fine, out of the box, with SMR
drives, but the reality is that all my early tests showed that f2fs works
fine (compared to other filesystems even stellar!) on SMR drives, but
isn't stable itself, independ on the drive technology. Only the later
kernels fail to perform with SMR drives, and that might or might not be
fixable.

> It's the time for me to take a look at pretty big partitions. :)

I also have no issues when large partitions pose a problem for f2fs - I
am confident that this can be fixed. Can't wait to use it for some 40TB
partitions and see how it performs in practise :)

In fact, I think f2fs + dmcache (with google modifications) + traditional
rotational drives might deliver absolutely superior performance to XFS,
which is my current workhorse for such partitions.

(One would hope fsck times could be improved for this, though, although
they are not particularly bad at this time).

> Oh, anyway, have you tried just -s1 for fun?

Will also try and see how it performs with the first hundred GB or so.

Then I will get the traces.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/git f2fs
  2015-09-28 17:59                 ` Jaegeuk Kim
@ 2015-09-29 11:02                   ` Marc Lehmann
  2015-09-29 23:13                     ` Jaegeuk Kim
  0 siblings, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-09-29 11:02 UTC (permalink / raw)
  To: linux-f2fs-devel

On Mon, Sep 28, 2015 at 10:59:44AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> In order to verify this also, could you retrieve the following logs?

First thing, the allocation-failure-on-mount is still in the backported 3.18
f2fs module. If it's supposed to be gone in that version, it's not working:

http://ue.tst.eu/a1bc4796012bd7191ab2ada566d4cd22.txt

And here are traces and descriptions. The traces all start directly after
mount, my test script is http://data.plan9.de/f2fstest

(event tracing is cool btw., thanks for showing me :)

################ -s1, f2fs git ##############################################

   /opt/f2fs-tools/sbin/mkfs.f2fs -lTEST -s1 -t0 -a0 /dev/vg_test/test
   mount -t f2fs -onoatime,flush_merge,no_heap /dev/vg_test/test /mnt

For the fist ~120GB, performance was solid (100MB/s+), but much worse than
stock 3.18.21 (with -s64!).

3.1.8.21 regularly reached >190MB/s regularly (at least near the beginning
of the disk) then was idle in between writes, as the source wasn't fast
enough to keep up. With the backport, tar was almost never idle, and if,
then not for long, so it could just keep up. (Just keeping up with the
read speed of a 6-disk raid is very good, but I know f2fs can do much
better :)

At the 122GB mark, it started to slow down, being consistently <100MB/s

At 127GB, it was <<20MB/s, and I stopped.

Most of the time, the test was write-I/O-bound.

http://data.plan9.de/f2fs.s1.trace.xz

################ -s64, f2fs 3.18.21 #########################################

As contrast I then did a test with the original f2fs module, and -s64.
Throughput was up to 202MB/s, almost continously. At the 100GB mark, it
slowed down to maybe 170MB/s peak, which might well be the speed of the
platters.

I stopped at 217GB.

I have a 12GB mbuffer between the read-tar and the write-tar, configured to
write minimum bursts of ~120MB. At no time was the buffer filled at >2%,
while with the -s1, f2fs git case, it was basically always >2%.

The trace includes a few minutes after tar was stopped.

http://data.plan9.de/f2fs.s64.3.18.trace.xz

################ -s64, f2fs git #############################################

The direct equivalent of the previous test, but with f2fs git.

Almost from the very beginning, it was often write-bound, but could still
keep up.

At around 70GB, it mostly stopped being able to keep up, and the read
tar overtook the write tar. At 139GB, performance degraded to <2MB/s. I
stopped at 147GB.

So mostly, behaviour was the same as with -s1, excedpt it took longer to
slow down.

http://data.plan9.de/f2fs.s64.trace.xz

################ -s20, f2fs git #############################################

By special request, here is the test with -s20.

Surprisingly, this stopped being able to cope at the 40GB mark, but I didn't
wait very long after the previous test, maybe that influenced it. I stopped
at 63GB.

http://data.plan9.de/f2fs.s20.trace.xz

#############################################################################

I hope to find time to look at these traces myself later this day.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/git f2fs
  2015-09-29 11:02                   ` Marc Lehmann
@ 2015-09-29 23:13                     ` Jaegeuk Kim
  2015-09-30  9:02                       ` Chao Yu
  2015-10-01 12:11                       ` Marc Lehmann
  0 siblings, 2 replies; 74+ messages in thread
From: Jaegeuk Kim @ 2015-09-29 23:13 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Tue, Sep 29, 2015 at 01:02:04PM +0200, Marc Lehmann wrote:
> On Mon, Sep 28, 2015 at 10:59:44AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > In order to verify this also, could you retrieve the following logs?
> 
> First thing, the allocation-failure-on-mount is still in the backported 3.18
> f2fs module. If it's supposed to be gone in that version, it's not working:
> 
> http://ue.tst.eu/a1bc4796012bd7191ab2ada566d4cd22.txt

Oops, it seems I missed some other allocations too.
Could you pull the v3.18 again? I changed the patch.

> 
> And here are traces and descriptions. The traces all start directly after
> mount, my test script is http://data.plan9.de/f2fstest
> 
> (event tracing is cool btw., thanks for showing me :)

Thank you for the below traces.
I have two suspicious things from the traces where:

1. several inline_data conversion of existing files
Could you test -o noinline_data and share its trace too?

2. running out of free node ids along with its shrinker
The latest f2fs registers a shrinker to release cached free ids which will
be used for inode numbers when creating files. If there is no id, it needs
to read some meta blocks to fill up the empty lists.

I think its threshold does not consider enough for this use cases.
Let me think about this in more details.

Thanks,

> 
> ################ -s1, f2fs git ##############################################
> 
>    /opt/f2fs-tools/sbin/mkfs.f2fs -lTEST -s1 -t0 -a0 /dev/vg_test/test
>    mount -t f2fs -onoatime,flush_merge,no_heap /dev/vg_test/test /mnt
> 
> For the fist ~120GB, performance was solid (100MB/s+), but much worse than
> stock 3.18.21 (with -s64!).
> 
> 3.1.8.21 regularly reached >190MB/s regularly (at least near the beginning
> of the disk) then was idle in between writes, as the source wasn't fast
> enough to keep up. With the backport, tar was almost never idle, and if,
> then not for long, so it could just keep up. (Just keeping up with the
> read speed of a 6-disk raid is very good, but I know f2fs can do much
> better :)
> 
> At the 122GB mark, it started to slow down, being consistently <100MB/s
> 
> At 127GB, it was <<20MB/s, and I stopped.
> 
> Most of the time, the test was write-I/O-bound.
> 
> http://data.plan9.de/f2fs.s1.trace.xz
> 
> ################ -s64, f2fs 3.18.21 #########################################
> 
> As contrast I then did a test with the original f2fs module, and -s64.
> Throughput was up to 202MB/s, almost continously. At the 100GB mark, it
> slowed down to maybe 170MB/s peak, which might well be the speed of the
> platters.
> 
> I stopped at 217GB.
> 
> I have a 12GB mbuffer between the read-tar and the write-tar, configured to
> write minimum bursts of ~120MB. At no time was the buffer filled at >2%,
> while with the -s1, f2fs git case, it was basically always >2%.
> 
> The trace includes a few minutes after tar was stopped.
> 
> http://data.plan9.de/f2fs.s64.3.18.trace.xz
> 
> ################ -s64, f2fs git #############################################
> 
> The direct equivalent of the previous test, but with f2fs git.
> 
> Almost from the very beginning, it was often write-bound, but could still
> keep up.
> 
> At around 70GB, it mostly stopped being able to keep up, and the read
> tar overtook the write tar. At 139GB, performance degraded to <2MB/s. I
> stopped at 147GB.
> 
> So mostly, behaviour was the same as with -s1, excedpt it took longer to
> slow down.
> 
> http://data.plan9.de/f2fs.s64.trace.xz
> 
> ################ -s20, f2fs git #############################################
> 
> By special request, here is the test with -s20.
> 
> Surprisingly, this stopped being able to cope at the 40GB mark, but I didn't
> wait very long after the previous test, maybe that influenced it. I stopped
> at 63GB.
> 
> http://data.plan9.de/f2fs.s20.trace.xz
> 
> #############################################################################
> 
> I hope to find time to look at these traces myself later this day.
> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/git f2fs
  2015-09-29 23:13                     ` Jaegeuk Kim
@ 2015-09-30  9:02                       ` Chao Yu
  2015-10-01 12:11                       ` Marc Lehmann
  1 sibling, 0 replies; 74+ messages in thread
From: Chao Yu @ 2015-09-30  9:02 UTC (permalink / raw)
  To: 'Jaegeuk Kim', 'Marc Lehmann'; +Cc: linux-f2fs-devel

> -----Original Message-----
> From: Jaegeuk Kim [mailto:jaegeuk@kernel.org]
> Sent: Wednesday, September 30, 2015 7:13 AM
> To: Marc Lehmann
> Cc: linux-f2fs-devel@lists.sourceforge.net
> Subject: Re: [f2fs-dev] write performance difference 3.18.21/git f2fs
> 
> On Tue, Sep 29, 2015 at 01:02:04PM +0200, Marc Lehmann wrote:
> > On Mon, Sep 28, 2015 at 10:59:44AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > > In order to verify this also, could you retrieve the following logs?
> >
> > First thing, the allocation-failure-on-mount is still in the backported 3.18
> > f2fs module. If it's supposed to be gone in that version, it's not working:
> >
> > http://ue.tst.eu/a1bc4796012bd7191ab2ada566d4cd22.txt
> 
> Oops, it seems I missed some other allocations too.
> Could you pull the v3.18 again? I changed the patch.
> 
> >
> > And here are traces and descriptions. The traces all start directly after
> > mount, my test script is http://data.plan9.de/f2fstest
> >
> > (event tracing is cool btw., thanks for showing me :)
> 
> Thank you for the below traces.
> I have two suspicious things from the traces where:
> 
> 1. several inline_data conversion of existing files

Good catch, :)

This evidence is that the following trace message appear repeatedly for thousand
times, spreads in the whole trace log, and it should be outputted by inline
conversion here only.
"f2fs_submit_write_bio: dev = (252,9), WRITE_SYNC(P), DATA, sector = xxx, size = 4096"

It not only delay the tar about 50 ms for each time since inline conversion is
completely synchronous, but also use WRITE_SYNC request break mering of between IO
requests both in the f2fs bio cache and io scheduler of block layer, resulting in
falling in RMW mode io for SMR driver.

So I guess in this scenario noinline_data may be a good choice.

> Could you test -o noinline_data and share its trace too?
> 
> 2. running out of free node ids along with its shrinker
> The latest f2fs registers a shrinker to release cached free ids which will
> be used for inode numbers when creating files. If there is no id, it needs
> to read some meta blocks to fill up the empty lists.

Right, IMO, by compare with flash device, rotational device have worse random read
performance since its expensive seeking cost. So optimization should be applied
to building flow of free nid cache too. What I think is maybe in SMR_MODE we can
check whether our free nid number is lower than threshold value, if it is true, we
can readahead more (32?) nat blocks with low priority READA instead of READ_SYNC,
so this can prevent overmuch rotation cost and long latency cause by synchronous
cache building in alloc_nid.

Thanks,

> 
> I think its threshold does not consider enough for this use cases.
> Let me think about this in more details.
> 
> Thanks,
> 
> >
> > ################ -s1, f2fs git ##############################################
> >
> >    /opt/f2fs-tools/sbin/mkfs.f2fs -lTEST -s1 -t0 -a0 /dev/vg_test/test
> >    mount -t f2fs -onoatime,flush_merge,no_heap /dev/vg_test/test /mnt
> >
> > For the fist ~120GB, performance was solid (100MB/s+), but much worse than
> > stock 3.18.21 (with -s64!).
> >
> > 3.1.8.21 regularly reached >190MB/s regularly (at least near the beginning
> > of the disk) then was idle in between writes, as the source wasn't fast
> > enough to keep up. With the backport, tar was almost never idle, and if,
> > then not for long, so it could just keep up. (Just keeping up with the
> > read speed of a 6-disk raid is very good, but I know f2fs can do much
> > better :)
> >
> > At the 122GB mark, it started to slow down, being consistently <100MB/s
> >
> > At 127GB, it was <<20MB/s, and I stopped.
> >
> > Most of the time, the test was write-I/O-bound.
> >
> > http://data.plan9.de/f2fs.s1.trace.xz
> >
> > ################ -s64, f2fs 3.18.21 #########################################
> >
> > As contrast I then did a test with the original f2fs module, and -s64.
> > Throughput was up to 202MB/s, almost continously. At the 100GB mark, it
> > slowed down to maybe 170MB/s peak, which might well be the speed of the
> > platters.
> >
> > I stopped at 217GB.
> >
> > I have a 12GB mbuffer between the read-tar and the write-tar, configured to
> > write minimum bursts of ~120MB. At no time was the buffer filled at >2%,
> > while with the -s1, f2fs git case, it was basically always >2%.
> >
> > The trace includes a few minutes after tar was stopped.
> >
> > http://data.plan9.de/f2fs.s64.3.18.trace.xz
> >
> > ################ -s64, f2fs git #############################################
> >
> > The direct equivalent of the previous test, but with f2fs git.
> >
> > Almost from the very beginning, it was often write-bound, but could still
> > keep up.
> >
> > At around 70GB, it mostly stopped being able to keep up, and the read
> > tar overtook the write tar. At 139GB, performance degraded to <2MB/s. I
> > stopped at 147GB.
> >
> > So mostly, behaviour was the same as with -s1, excedpt it took longer to
> > slow down.
> >
> > http://data.plan9.de/f2fs.s64.trace.xz
> >
> > ################ -s20, f2fs git #############################################
> >
> > By special request, here is the test with -s20.
> >
> > Surprisingly, this stopped being able to cope at the 40GB mark, but I didn't
> > wait very long after the previous test, maybe that influenced it. I stopped
> > at 63GB.
> >
> > http://data.plan9.de/f2fs.s20.trace.xz
> >
> > #############################################################################
> >
> > I hope to find time to look at these traces myself later this day.
> >
> > --
> >                 The choice of a       Deliantra, the free code+content MORPG
> >       -----==-     _GNU_              http://www.deliantra.net
> >       ----==-- _       generation
> >       ---==---(_)__  __ ____  __      Marc Lehmann
> >       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
> >       -=====/_/_//_/\_,_/ /_/\_\
> >
> > ------------------------------------------------------------------------------
> > _______________________________________________
> > Linux-f2fs-devel mailing list
> > Linux-f2fs-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/git f2fs
  2015-09-29 23:13                     ` Jaegeuk Kim
  2015-09-30  9:02                       ` Chao Yu
@ 2015-10-01 12:11                       ` Marc Lehmann
  2015-10-01 18:51                         ` Marc Lehmann
  1 sibling, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-10-01 12:11 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Tue, Sep 29, 2015 at 04:13:10PM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > First thing, the allocation-failure-on-mount is still in the backported 3.18
> > f2fs module. If it's supposed to be gone in that version, it's not working:
> > 
> > http://ue.tst.eu/a1bc4796012bd7191ab2ada566d4cd22.txt
> 
> Oops, it seems I missed some other allocations too.
> Could you pull the v3.18 again? I changed the patch.

Hmm, I don't see any relevant change in the log, but I will use the newest
version.

> > (event tracing is cool btw., thanks for showing me :)
> 
> Thank you for the below traces.
> I have two suspicious things from the traces where:
> 
> 1. several inline_data conversion of existing files
> Could you test -o noinline_data and share its trace too?

WOW, THAT HELPED A LOT. While the peak throughput seems quite a bit lower
than 3.18 stock (seems, because this might be caused by less peaky, more
smooth, writes - I am only looking at it with my eyes), it could keep up
with the reading side easily, and performance didn't degrade so far. The
test continues, but here is the trace for the first 368GB:

http://data.plan9.de/f2fs.s64.noinline.trace.xz

I had a cursory look at the previous traces, and didn't see any obvious
problem (if anything, git f2fs seems to leave fewer gaps). However, both
versions do leave quite a few gaps.

I wanted to look at reordering, which might be a problem, too, but didn't
get to that.

Obviously, inline_data was the culprit for the especially bad performance.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/git f2fs
  2015-10-01 12:11                       ` Marc Lehmann
@ 2015-10-01 18:51                         ` Marc Lehmann
  2015-10-02  8:53                           ` 100% system time hang with git f2fs Marc Lehmann
  2015-10-02 16:46                           ` write performance difference 3.18.21/git f2fs Jaegeuk Kim
  0 siblings, 2 replies; 74+ messages in thread
From: Marc Lehmann @ 2015-10-01 18:51 UTC (permalink / raw)
  To: linux-f2fs-devel

On Thu, Oct 01, 2015 at 02:11:20PM +0200, Marc Lehmann <schmorp@schmorp.de> wrote:
> WOW, THAT HELPED A LOT. While the peak throughput seems quite a bit lower

Ok, for completeness, here is the full log and a description of what was
going on.

http://data.plan9.de/f2fs.s64.noinline.full.trace.xz

status at the end + some idle time
http://ue.tst.eu/d16cf98c72fe9ecbac178ded47a21396.txt

It was faster than the reader till roughtly the 1.2TB mark, after
which it acquired longish episodes of being <<50MB/s (for example,
around 481842.363964), and also periods of ~20kb/s, due to many small
WRITE_SYNC's in a row (e.g. at 482329.101222 and 490189.681438,
http://ue.tst.eu/cc94978eafc736422437a4ab35862c12.txt). The small
WRITE_SYNCs did not always result in this behaviour by the disk, though.

After that, it was generally write-I/O bound.

Also, the gc seemed to have kicked in at around that time, which is kind
of counterproductive. I increased the gc_* values in /sys, but don't know
if that had any effect.

Most importantly, f2fs always recovered and had periods of much faster
writes (>= 120MB/s), so it's not the case that f2fs somehow saturates the
internal cache and then becomes slow forever.

Overall, the throughput was 83MB/s, which is 20% worse than stock 3.18, but
still way beyond what any other filesystem could do.

Also, writing 1TB in a single session, with somewhat reduced speed
afterwards, would be enough for my purposes, i.e. I can live with that
(still, gigabit speeds would be nice of course, as that is the data rate I
often deal with).

Notwithstanding any other improvements you might implement, f2fs has now
officially become my choice for SMR drives, the only remaining thing
needed is to convince me of its stability - it seems getting a kernel
with truly stable f2fs is a bit of a game of chance still, but I guess
confidence will come with more tests and actualy using it in production,
which I will do soon.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: 100% system time hang with git f2fs
  2015-10-01 18:51                         ` Marc Lehmann
@ 2015-10-02  8:53                           ` Marc Lehmann
  2015-10-02 16:51                             ` Jaegeuk Kim
  2015-10-02 16:46                           ` write performance difference 3.18.21/git f2fs Jaegeuk Kim
  1 sibling, 1 reply; 74+ messages in thread
From: Marc Lehmann @ 2015-10-02  8:53 UTC (permalink / raw)
  To: linux-f2fs-devel

On Thu, Oct 01, 2015 at 08:51:24PM +0200, Marc Lehmann <schmorp@schmorp.de> wrote:
> Ok, for completeness, here is the full log and a description of what was
> going on.

Ok, so I did a fsck, which took one hour, which is not very good, but I
don't use fsck very often. It didn't find any problems (everything Ok).

However, I have a freeze. When I mount the volume, start a du on it, and after
a while do:

   echo 3 >/proc/sys/vm/drop_caches

Then this process hangs with 100% sys time. /proc/../stack gives no usable
backtrace.

umount on the f2fs volume also hangs:

   [<ffffffff8139ba03>] call_rwsem_down_write_failed+0x13/0x20
   [<ffffffff8118086d>] unregister_shrinker+0x1d/0x70
   [<ffffffff811e7911>] deactivate_locked_super+0x41/0x60
   [<ffffffff811e7eee>] deactivate_super+0x4e/0x70
   [<ffffffff81204733>] cleanup_mnt+0x43/0x90
   [<ffffffff812047d2>] __cleanup_mnt+0x12/0x20
   [<ffffffff8108e8a4>] task_work_run+0xc4/0xe0
   [<ffffffff81012fa7>] do_notify_resume+0x97/0xb0
   [<ffffffff8178896f>] int_signal+0x12/0x17
   [<ffffffffffffffff>] 0xffffffffffffffff

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: write performance difference 3.18.21/git f2fs
  2015-10-01 18:51                         ` Marc Lehmann
  2015-10-02  8:53                           ` 100% system time hang with git f2fs Marc Lehmann
@ 2015-10-02 16:46                           ` Jaegeuk Kim
  2015-10-04  9:40                             ` near disk full performance (full 8TB) Marc Lehmann
  1 sibling, 1 reply; 74+ messages in thread
From: Jaegeuk Kim @ 2015-10-02 16:46 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Thu, Oct 01, 2015 at 08:51:24PM +0200, Marc Lehmann wrote:
> On Thu, Oct 01, 2015 at 02:11:20PM +0200, Marc Lehmann <schmorp@schmorp.de> wrote:
> > WOW, THAT HELPED A LOT. While the peak throughput seems quite a bit lower
> 
> Ok, for completeness, here is the full log and a description of what was
> going on.
> 
> http://data.plan9.de/f2fs.s64.noinline.full.trace.xz

Now, I can see much clean patterns.

> status at the end + some idle time
> http://ue.tst.eu/d16cf98c72fe9ecbac178ded47a21396.txt
> 
> It was faster than the reader till roughtly the 1.2TB mark, after
> which it acquired longish episodes of being <<50MB/s (for example,
> around 481842.363964), and also periods of ~20kb/s, due to many small
> WRITE_SYNC's in a row (e.g. at 482329.101222 and 490189.681438,
> http://ue.tst.eu/cc94978eafc736422437a4ab35862c12.txt). The small
> WRITE_SYNCs did not always result in this behaviour by the disk, though.

Hmm, this is because of FS metadata flushes in background.
I pushed one patch, can you get it through v3.18 branch?

> After that, it was generally write-I/O bound.
> 
> Also, the gc seemed to have kicked in at around that time, which is kind
> of counterproductive. I increased the gc_* values in /sys, but don't know
> if that had any effect.
> 
> Most importantly, f2fs always recovered and had periods of much faster
> writes (>= 120MB/s), so it's not the case that f2fs somehow saturates the
> internal cache and then becomes slow forever.
> 
> Overall, the throughput was 83MB/s, which is 20% worse than stock 3.18, but
> still way beyond what any other filesystem could do.

Cool.

> Also, writing 1TB in a single session, with somewhat reduced speed
> afterwards, would be enough for my purposes, i.e. I can live with that
> (still, gigabit speeds would be nice of course, as that is the data rate I
> often deal with).
> 
> Notwithstanding any other improvements you might implement, f2fs has now
> officially become my choice for SMR drives, the only remaining thing
> needed is to convince me of its stability - it seems getting a kernel
> with truly stable f2fs is a bit of a game of chance still, but I guess
> confidence will come with more tests and actualy using it in production,
> which I will do soon.
> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: 100% system time hang with git f2fs
  2015-10-02  8:53                           ` 100% system time hang with git f2fs Marc Lehmann
@ 2015-10-02 16:51                             ` Jaegeuk Kim
  2015-10-03  6:29                               ` Marc Lehmann
  0 siblings, 1 reply; 74+ messages in thread
From: Jaegeuk Kim @ 2015-10-02 16:51 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-f2fs-devel

On Fri, Oct 02, 2015 at 10:53:40AM +0200, Marc Lehmann wrote:
> On Thu, Oct 01, 2015 at 08:51:24PM +0200, Marc Lehmann <schmorp@schmorp.de> wrote:
> > Ok, for completeness, here is the full log and a description of what was
> > going on.
> 
> Ok, so I did a fsck, which took one hour, which is not very good, but I
> don't use fsck very often. It didn't find any problems (everything Ok).
> 
> However, I have a freeze. When I mount the volume, start a du on it, and after
> a while do:

How was your scenario?
Did you delete device before, or just plain mount and umount?

>    echo 3 >/proc/sys/vm/drop_caches
> 
> Then this process hangs with 100% sys time. /proc/../stack gives no usable
> backtrace.

This should be shrinker is stuck on mutex. I suspect a deadlock.
Can you do this, if you meet this again?

# echo l > /proc/sysrq-trigger
# echo w > /proc/sysrq-trigger
# demsg

Thanks,

> 
> umount on the f2fs volume also hangs:
> 
>    [<ffffffff8139ba03>] call_rwsem_down_write_failed+0x13/0x20
>    [<ffffffff8118086d>] unregister_shrinker+0x1d/0x70
>    [<ffffffff811e7911>] deactivate_locked_super+0x41/0x60
>    [<ffffffff811e7eee>] deactivate_super+0x4e/0x70
>    [<ffffffff81204733>] cleanup_mnt+0x43/0x90
>    [<ffffffff812047d2>] __cleanup_mnt+0x12/0x20
>    [<ffffffff8108e8a4>] task_work_run+0xc4/0xe0
>    [<ffffffff81012fa7>] do_notify_resume+0x97/0xb0
>    [<ffffffff8178896f>] int_signal+0x12/0x17
>    [<ffffffffffffffff>] 0xffffffffffffffff
> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: 100% system time hang with git f2fs
  2015-10-02 16:51                             ` Jaegeuk Kim
@ 2015-10-03  6:29                               ` Marc Lehmann
  0 siblings, 0 replies; 74+ messages in thread
From: Marc Lehmann @ 2015-10-03  6:29 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Fri, Oct 02, 2015 at 09:51:24AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> > However, I have a freeze. When I mount the volume, start a du on it, and after
> > a while do:
> 
> How was your scenario?
> Did you delete device before, or just plain mount and umount?

After writing >2TB to it, I let it idle/gc for a while, then unmounted. Then
did a fsck and mount, and let it idle for the night. On the next day, I did a
du for a speed test, relaised I hadn't dropped the caches, interrupted the
du, and then trie drop_caches with the result.

> This should be shrinker is stuck on mutex. I suspect a deadlock.
> Can you do this, if you meet this again?

Will do. Btw, I already mentioned that /proc/*/stack didn't give a useful
backtrace. Since it was apparently running, I thought maybe run cat
/.../stack in a loop, and indeed, after a minute, I got one backtrace out
of it (the process doing the echo):

http://ue.tst.eu/2562aec1c51d4bad3d7b8b83380956a7.txt

Maybe that is already useful? It indeed mentions the shrinker.

(I did this after my mail, so didn't include it).

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: near disk full performance (full 8TB)
  2015-10-02 16:46                           ` write performance difference 3.18.21/git f2fs Jaegeuk Kim
@ 2015-10-04  9:40                             ` Marc Lehmann
  0 siblings, 0 replies; 74+ messages in thread
From: Marc Lehmann @ 2015-10-04  9:40 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On Fri, Oct 02, 2015 at 09:46:45AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> Hmm, this is because of FS metadata flushes in background.
> I pushed one patch, can you get it through v3.18 branch?

I continued to write the same ~2TB data set to the disk, with the same
kernel module, giving me 85MB/s and 66MB/s throughput, respectively. This
was due to extended periods where write performance was around 30-40MB/s,
followed by shorter periods where it was >100MB/s. After using
"disable_ext_identify", this *seemed* to improve somewhat. I did this on
the theory that the zones near the end of the device (assuming f2fs would
roughly fill from the beginning) get larger, causing more pressure on
the disk (which has only 128MB of RAM to combine writes), but the result
wasn't conclusive either wayy.

For the fourth set, which wouldn't fully fit, I choose pdfs (larger
average filesize), and used the new kernel, either of which might have
helped.

I configured f2fs like this:

   echo 16 >ipu_policy
   echo 100 >min_ipu_util
   echo 100000 >reclaim_segments
   echo 1 >gc_idle
   echo   500 >gc_min_sleep_time
   echo 90000 >gc_max_sleep_time
   echo 30000 >gc_no_gc_sleep_time

Performance was ok'ish (as during the whole test) till about 200GB were
left out of 8.1 (metric) TB.

I started to make a trace around the 197GB mark:

http://data.plan9.de/f2fs.near_full.xz

status at beginning: http://ue.tst.eu/a4fc2a2522f3e372c7e92255cad1f3c3.txt
rsync was writing at this point, and I think you can see GC activity.

at 173851.953639, I ^S'ed rsync, which, due to -v, would cause it to pause
after a file. there was regular (probably GC) activity, but most of the
time, the disk was again idle, something I simply wouldn't expect from the
GC config (see next mail).

status after pausing rsync: http://ue.tst.eu/26c170d7d9f946d60926a5cdca814bbe.txt

I unpaused rsync at 174004.438302
status before unpausing rsync: http://ue.tst.eu/cc22fafa0efcb1cadae5a3849dff873b.txt

At 174186.324000, speed went down to ~2MB/s, and looking at the traces, it
seems f2fs is writing random 2MB segments, which would explain the speed.

I stopped at this point and started to prepare this mail. I could see
constant but very low activity afterwards, roughly every 30s.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2015-10-04  9:40 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-23 21:58 sync/umount hang on 3.18.21, 1.4TB gone after crash Marc Lehmann
2015-09-23 23:11 ` write performance difference 3.18.21/4.2.1 Marc Lehmann
2015-09-24 18:28   ` Jaegeuk Kim
2015-09-24 23:20     ` Marc Lehmann
2015-09-24 23:27       ` Marc Lehmann
2015-09-25  6:50     ` Marc Lehmann
2015-09-25  9:47       ` Chao Yu
2015-09-25 18:20         ` Jaegeuk Kim
2015-09-26  3:22         ` Marc Lehmann
2015-09-26  5:25           ` write performance difference 3.18.21/git f2fs Marc Lehmann
2015-09-26  5:57             ` Marc Lehmann
2015-09-26  7:52             ` Jaegeuk Kim
2015-09-26 13:59               ` Marc Lehmann
2015-09-28 17:59                 ` Jaegeuk Kim
2015-09-29 11:02                   ` Marc Lehmann
2015-09-29 23:13                     ` Jaegeuk Kim
2015-09-30  9:02                       ` Chao Yu
2015-10-01 12:11                       ` Marc Lehmann
2015-10-01 18:51                         ` Marc Lehmann
2015-10-02  8:53                           ` 100% system time hang with git f2fs Marc Lehmann
2015-10-02 16:51                             ` Jaegeuk Kim
2015-10-03  6:29                               ` Marc Lehmann
2015-10-02 16:46                           ` write performance difference 3.18.21/git f2fs Jaegeuk Kim
2015-10-04  9:40                             ` near disk full performance (full 8TB) Marc Lehmann
2015-09-26  7:48           ` write performance difference 3.18.21/4.2.1 Jaegeuk Kim
2015-09-25 18:26       ` Jaegeuk Kim
2015-09-24 18:50 ` sync/umount hang on 3.18.21, 1.4TB gone after crash Jaegeuk Kim
2015-09-25  6:00   ` Marc Lehmann
2015-09-25  6:01     ` Marc Lehmann
2015-09-25 18:42     ` Jaegeuk Kim
2015-09-26  3:08       ` Marc Lehmann
2015-09-26  7:27         ` Jaegeuk Kim
2015-09-25  9:13   ` Chao Yu
2015-09-25 18:30     ` Jaegeuk Kim
  -- strict thread matches above, loose matches on Subject: below --
2015-08-08 20:50 general stability of f2fs? Marc Lehmann
2015-08-10 20:31 ` Jaegeuk Kim
2015-08-10 20:53   ` Marc Lehmann
2015-08-10 21:58     ` Jaegeuk Kim
2015-08-13  0:26       ` Marc Lehmann
2015-08-14 23:07         ` Jaegeuk Kim
2015-09-20 23:59   ` finally testing with SMR drives Marc Lehmann
2015-09-21  8:17     ` SMR drive test 1; 512GB partition; very slow + unfixable corruption Marc Lehmann
2015-09-21  8:19       ` Marc Lehmann
2015-09-21  9:58         ` SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning Marc Lehmann
2015-09-22 20:22           ` SMR drive test 3: full 8TB partition, mount problems, fsck error after delete Marc Lehmann
2015-09-22 23:08             ` Jaegeuk Kim
2015-09-23  3:50               ` Marc Lehmann
2015-09-23  1:12           ` SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning Jaegeuk Kim
2015-09-23  4:15             ` Marc Lehmann
2015-09-23  6:00               ` Marc Lehmann
2015-09-23  8:55                 ` Chao Yu
2015-09-23 23:30                   ` Marc Lehmann
2015-09-23 23:43                     ` Marc Lehmann
2015-09-24 17:21                       ` Jaegeuk Kim
2015-09-25  8:28                         ` Chao Yu
2015-09-25  8:05                     ` Chao Yu
2015-09-26  3:42                       ` Marc Lehmann
2015-09-23 22:08                 ` Jaegeuk Kim
2015-09-23 23:39                   ` Marc Lehmann
2015-09-24 17:27                     ` Jaegeuk Kim
2015-09-25  5:42                       ` Marc Lehmann
2015-09-25 17:45                         ` Jaegeuk Kim
2015-09-26  3:32                           ` Marc Lehmann
2015-09-26  7:36                             ` Jaegeuk Kim
2015-09-26 13:53                               ` Marc Lehmann
2015-09-28 18:33                                 ` Jaegeuk Kim
2015-09-29  7:36                                   ` Marc Lehmann
2015-09-23  6:06               ` Marc Lehmann
2015-09-23  9:10                 ` Chao Yu
2015-09-23 21:30                   ` Jaegeuk Kim
2015-09-23 23:11                   ` Marc Lehmann
2015-09-23 21:29               ` Jaegeuk Kim
2015-09-23 23:24                 ` Marc Lehmann
2015-09-24 17:51                   ` Jaegeuk Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.