linux-bcachefs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* More eager discard behaviour
@ 2021-11-06 17:11 Chris Webb
  2021-11-06 19:37 ` Kent Overstreet
  0 siblings, 1 reply; 5+ messages in thread
From: Chris Webb @ 2021-11-06 17:11 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

Discards issued to a loopback device punch holes in the underlying files, so I
thought they'd be an easy way to check (and maybe ktest) filesystem discard
behaviour. Here, I make a 1GB filesystem then repeatedly create and delete a
400MB file in it:

  # truncate -s 1G /tmp/fs
  # losetup /dev/loop0 /tmp/fs
  # bcachefs format -q --discard /dev/loop0
  initializing new filesystem
  going read-write
  mounted with opts: (null)
  # mkdir -p /tmp/mnt
  # mount -t bcachefs -o discard /dev/loop0 /tmp/mnt
  # while true; do
  >   sync && sleep 1 && du -h /tmp/fs
  >   dd if=/dev/zero of=/tmp/mnt/file bs=1M count=400 status=none
  >   sync && sleep 1 && du -h /tmp/fs
  >   rm /tmp/mnt/file
  > done
  1.7M  /tmp/fs
  404M  /tmp/fs
  403M  /tmp/fs
  806M  /tmp/fs
  806M  /tmp/fs
  992M  /tmp/fs
  993M  /tmp/fs
  992M  /tmp/fs
  992M  /tmp/fs
  992M  /tmp/fs
  [...]

Although bcachefs does issue discards (double-checked with printk in
discard_one_bucket), it only does so when the allocator thread wakes to
reclaim buckets once the entire block device is in use, so the practical
behaviour is that the whole device is kept full to the brim despite the
filesystem never being over 40% capacity. (With count=50, you can get the
same effect with an fs that never goes over 5% capacity.)

(Happy to roll the above into a ktest if it's useful, e.g. that capacity
never goes above x% with repeated deletes?)

The equivalent test with ext4 shows discard doing the expected thing:

  # truncate -s 1G /tmp/fs
  # losetup /dev/loop0 /tmp/fs
  # mkfs.ext4 -q /dev/loop0
  # mkdir -p /tmp/mnt
  # mount -t ext4 -o discard /dev/loop0 /tmp/mnt
  # while true; do
  >   sync && sleep 1 && du -h /tmp/fs
  >   dd if=/dev/zero of=/tmp/mnt/file bs=1M count=400 status=none
  >   sync && sleep 1 && du -h /tmp/fs
  >   rm /tmp/mnt/file
  > done
  33M /tmp/fs
  433M  /tmp/fs
  33M /tmp/fs
  433M  /tmp/fs
  33M /tmp/fs
  433M  /tmp/fs
  33M /tmp/fs
  433M  /tmp/fs
  33M /tmp/fs
  [...]

SSDs are happier TRIMmed, but discard is also invaluable for filesystems on
thin provisioning systems like dm-thin. (virtio-block can pass discards up
from guest to host, so this is a common VM configuration.)

How practical would it be either to more-greedily wake the allocator thread
and reclaim buckets, or to detect buckets available to discard earlier in
their lifetime?

Best wishes,

Chris.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: More eager discard behaviour
  2021-11-06 17:11 More eager discard behaviour Chris Webb
@ 2021-11-06 19:37 ` Kent Overstreet
  2021-11-06 21:36   ` Chris Webb
  0 siblings, 1 reply; 5+ messages in thread
From: Kent Overstreet @ 2021-11-06 19:37 UTC (permalink / raw)
  To: Chris Webb; +Cc: linux-bcachefs

On Sat, Nov 06, 2021 at 05:11:56PM +0000, Chris Webb wrote:
> SSDs are happier TRIMmed, but discard is also invaluable for filesystems on
> thin provisioning systems like dm-thin. (virtio-block can pass discards up
> from guest to host, so this is a common VM configuration.)
> 
> How practical would it be either to more-greedily wake the allocator thread
> and reclaim buckets, or to detect buckets available to discard earlier in
> their lifetime?

It's on the todo list, it'll be part of the grand allocator rework I have
planned - there's a lot that needs to be done.

We've got to rework the state transitions, and differentiate between buckets
with cached data, buckets that need a journal commit before we can write to them
again, buckets that need discarding, and buckets that are ready for use - and we
need to start storing all that persistently, as well as build some new data
structures to get rid of the scanning over every bucket we currently have to do
in various places.

Step one is going to be creating a persistent LRU. It'll be another btree,
sorted by the order in which we want to reuse buckets - a combination of the
amount of live data in the bucket and the time since it was last read.

If you're interested in tackling this, I can sketch it out for you :)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: More eager discard behaviour
  2021-11-06 19:37 ` Kent Overstreet
@ 2021-11-06 21:36   ` Chris Webb
  2021-11-07 14:59     ` Kent Overstreet
  0 siblings, 1 reply; 5+ messages in thread
From: Chris Webb @ 2021-11-06 21:36 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

Kent Overstreet <kent.overstreet@gmail.com> writes:

> On Sat, Nov 06, 2021 at 05:11:56PM +0000, Chris Webb wrote:
> > How practical would it be either to more-greedily wake the allocator thread
> > and reclaim buckets, or to detect buckets available to discard earlier in
> > their lifetime?
> 
> It's on the todo list, it'll be part of the grand allocator rework I have
> planned - there's a lot that needs to be done.

I read your discussion on the status update and in the related write-up on
the accounting code. It makes a lot of sense to me, given the tried and
tested btree layer is there to be used.

> We've got to rework the state transitions, and differentiate between buckets
> with cached data, buckets that need a journal commit before we can write to them
> again, buckets that need discarding, and buckets that are ready for use - and we
> need to start storing all that persistently, as well as build some new data
> structures to get rid of the scanning over every bucket we currently have to do
> in various places.
> 
> Step one is going to be creating a persistent LRU. It'll be another btree,
> sorted by the order in which we want to reuse buckets - a combination of the
> amount of live data in the bucket and the time since it was last read.
> 
> If you're interested in tackling this, I can sketch it out for you :)

I'm definitely interested in helping out if I can, although there's a slight
risk of biting off more than I can chew: I probably understand even less
about the design and structure of your fs at the moment than you think!

I don't want to be a slow-moving obstacle on your critical path or more of a
nuisance than a help, but equally, getting stuck in is clearly the best way
to figure out how it fits together and become more useful. So happy to give
anything a go that you think is reasonable for a newcomer to take a crack
at. :) 

Best wishes,

Chris.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: More eager discard behaviour
  2021-11-06 21:36   ` Chris Webb
@ 2021-11-07 14:59     ` Kent Overstreet
  2021-11-08 21:16       ` Chris Webb
  0 siblings, 1 reply; 5+ messages in thread
From: Kent Overstreet @ 2021-11-07 14:59 UTC (permalink / raw)
  To: Chris Webb; +Cc: linux-bcachefs

On Sat, Nov 06, 2021 at 09:36:50PM +0000, Chris Webb wrote:
> I'm definitely interested in helping out if I can, although there's a slight
> risk of biting off more than I can chew: I probably understand even less
> about the design and structure of your fs at the moment than you think!
> 
> I don't want to be a slow-moving obstacle on your critical path or more of a
> nuisance than a help, but equally, getting stuck in is clearly the best way
> to figure out how it fits together and become more useful. So happy to give
> anything a go that you think is reasonable for a newcomer to take a crack
> at. :) 

This won't be critical path for awhile - I need to focus on bugs before I can
start anything new (and I've been chasing zome real doozies the past few days; I
stumbled across multiple... interesting... bugs while trying to get xfstest
generic/547 to pass). And this might be one of the easier places to get started,
stuff that just interfaces with the btree tends to be fairly clean (vs. the code
that interfaces with the Linux VFS, which is where I'm at now).

So, the LRU: We don't do our LRU with lists, we do it with a clock. We've got
persistent clocks that track the amount of IO has been done in a filesystem,
broken out as reads or writes, and we track for each bucket when it was last
read or written to.

So in the btree, you've got struct bch_alloc/bch_alloc_v2/bch_alloc_v3 now -
that's the value type for alloc keys, which describe buckets - you'll see
they've got read_time and write_time fields (stored as varints).
bch2_bucket_io_time_reset() updates the read or write time for a given bucket.

Currently the lru is implemented by find_reclaimable_buckets_lru(), which scans
the in memory array of buckets and build up a heap - there's some logic in there
that won't be relevant for the persistent LRU, this code is also preferring
buckets which don't need a journal commit yet and that's going to be moved
elsewhere.

The relevant bit is the first path in bucket_sort_key(), for buckets that have a
nonzero amount of data (i.e. cached data in them) - there's some arithmetic in
there to combine the last read time with the amount of data currently used.

The new persistent LRU will be roughly the same idea - we'll create a new btree,
call it BTREE_ID_lru, where keys in that btree point back to buckets (and alloc
keys will also need a new field pointing the the LRU key, when it exists), and
the first key in the LRU btree will point to the bucket we want to reuse next.

We'll add a transactional trigger for alloc keys - bch2_trans_mark_alloc(),
following the patterns for other triggers in buckets.c (at some point we need to
split up buckets.c into multiple files, it's gotten to be a bit of a mishmash -
but that's where the btree triggers code lives for now).

The trigger will take an old and a new alloc key, and when the LRU index is
changing it'll just delete the old LRU key and create a new one. Note that
multiple buckets could end up with the same LRU index but we can't store their
keys in the LRU btree at the same slot - so when that happens we'll have to use
the next empty slot.

We need to decide how the LRU index should be calculated. The current code
calculates "time since this bucket was last read from", i.e. now -
g->io_time[READ], and then it scales that by the amount of live data in the
bucket - given two buckets that were both read from at the same time, we want to
prefer the bucket that has more live data in it.

But "time since this bucket was last read from" isn't a stable value, it changes
as we do IO. Ignoring the part where we scale by bucket_sectors_used for the
moment, we'd want the index in the LRU btree to be just the bucket's read_time
field - that would get us LRU behaviour.

So how do we incorporate bucket_sectors_used into the calculation? read_time *
sectors_used isn't right - it scales the LRU index up or down correctly as
sectors_used chages, but that means if we've got a full bucket that hasn't been
read from in a very long time, we'll prefer to keep that over a bucket that was
_just_ read from and is only half full.

I think this code will do roughly what we want:

/* ratio between 0 (completely full bucket) and 1 (full bucket)
float bucket_fragmentation = (bucket_size - bucket_sectors_used) / bucket_sectors_used;

u64 lru_idx = read_time - (now - read_time) * bucket_fragmentation;

Can't do floating point in the kernel so the actual arithmetic will be
different, and we'll have to guard against underflow and overflow. And this
calculation still isn't stable - we'll end up with a different number if we do
at different times, but that's ok as long as we store the LRU idx we calculated
in the alloc key.

Let me know if you get stuck :)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: More eager discard behaviour
  2021-11-07 14:59     ` Kent Overstreet
@ 2021-11-08 21:16       ` Chris Webb
  0 siblings, 0 replies; 5+ messages in thread
From: Chris Webb @ 2021-11-08 21:16 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

Kent Overstreet <kent.overstreet@gmail.com> wrote:

> Let me know if you get stuck :)

Great, and thanks for the detailed explanation and guidance. I'll give it a  
go!

Plenty for me to get to grips with there... :)

Best wishes,

Chris.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-11-08 21:16 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-06 17:11 More eager discard behaviour Chris Webb
2021-11-06 19:37 ` Kent Overstreet
2021-11-06 21:36   ` Chris Webb
2021-11-07 14:59     ` Kent Overstreet
2021-11-08 21:16       ` Chris Webb

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).