* More eager discard behaviour @ 2021-11-06 17:11 Chris Webb 2021-11-06 19:37 ` Kent Overstreet 0 siblings, 1 reply; 5+ messages in thread From: Chris Webb @ 2021-11-06 17:11 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs Discards issued to a loopback device punch holes in the underlying files, so I thought they'd be an easy way to check (and maybe ktest) filesystem discard behaviour. Here, I make a 1GB filesystem then repeatedly create and delete a 400MB file in it: # truncate -s 1G /tmp/fs # losetup /dev/loop0 /tmp/fs # bcachefs format -q --discard /dev/loop0 initializing new filesystem going read-write mounted with opts: (null) # mkdir -p /tmp/mnt # mount -t bcachefs -o discard /dev/loop0 /tmp/mnt # while true; do > sync && sleep 1 && du -h /tmp/fs > dd if=/dev/zero of=/tmp/mnt/file bs=1M count=400 status=none > sync && sleep 1 && du -h /tmp/fs > rm /tmp/mnt/file > done 1.7M /tmp/fs 404M /tmp/fs 403M /tmp/fs 806M /tmp/fs 806M /tmp/fs 992M /tmp/fs 993M /tmp/fs 992M /tmp/fs 992M /tmp/fs 992M /tmp/fs [...] Although bcachefs does issue discards (double-checked with printk in discard_one_bucket), it only does so when the allocator thread wakes to reclaim buckets once the entire block device is in use, so the practical behaviour is that the whole device is kept full to the brim despite the filesystem never being over 40% capacity. (With count=50, you can get the same effect with an fs that never goes over 5% capacity.) (Happy to roll the above into a ktest if it's useful, e.g. that capacity never goes above x% with repeated deletes?) The equivalent test with ext4 shows discard doing the expected thing: # truncate -s 1G /tmp/fs # losetup /dev/loop0 /tmp/fs # mkfs.ext4 -q /dev/loop0 # mkdir -p /tmp/mnt # mount -t ext4 -o discard /dev/loop0 /tmp/mnt # while true; do > sync && sleep 1 && du -h /tmp/fs > dd if=/dev/zero of=/tmp/mnt/file bs=1M count=400 status=none > sync && sleep 1 && du -h /tmp/fs > rm /tmp/mnt/file > done 33M /tmp/fs 433M /tmp/fs 33M /tmp/fs 433M /tmp/fs 33M /tmp/fs 433M /tmp/fs 33M /tmp/fs 433M /tmp/fs 33M /tmp/fs [...] SSDs are happier TRIMmed, but discard is also invaluable for filesystems on thin provisioning systems like dm-thin. (virtio-block can pass discards up from guest to host, so this is a common VM configuration.) How practical would it be either to more-greedily wake the allocator thread and reclaim buckets, or to detect buckets available to discard earlier in their lifetime? Best wishes, Chris. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: More eager discard behaviour 2021-11-06 17:11 More eager discard behaviour Chris Webb @ 2021-11-06 19:37 ` Kent Overstreet 2021-11-06 21:36 ` Chris Webb 0 siblings, 1 reply; 5+ messages in thread From: Kent Overstreet @ 2021-11-06 19:37 UTC (permalink / raw) To: Chris Webb; +Cc: linux-bcachefs On Sat, Nov 06, 2021 at 05:11:56PM +0000, Chris Webb wrote: > SSDs are happier TRIMmed, but discard is also invaluable for filesystems on > thin provisioning systems like dm-thin. (virtio-block can pass discards up > from guest to host, so this is a common VM configuration.) > > How practical would it be either to more-greedily wake the allocator thread > and reclaim buckets, or to detect buckets available to discard earlier in > their lifetime? It's on the todo list, it'll be part of the grand allocator rework I have planned - there's a lot that needs to be done. We've got to rework the state transitions, and differentiate between buckets with cached data, buckets that need a journal commit before we can write to them again, buckets that need discarding, and buckets that are ready for use - and we need to start storing all that persistently, as well as build some new data structures to get rid of the scanning over every bucket we currently have to do in various places. Step one is going to be creating a persistent LRU. It'll be another btree, sorted by the order in which we want to reuse buckets - a combination of the amount of live data in the bucket and the time since it was last read. If you're interested in tackling this, I can sketch it out for you :) ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: More eager discard behaviour 2021-11-06 19:37 ` Kent Overstreet @ 2021-11-06 21:36 ` Chris Webb 2021-11-07 14:59 ` Kent Overstreet 0 siblings, 1 reply; 5+ messages in thread From: Chris Webb @ 2021-11-06 21:36 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs Kent Overstreet <kent.overstreet@gmail.com> writes: > On Sat, Nov 06, 2021 at 05:11:56PM +0000, Chris Webb wrote: > > How practical would it be either to more-greedily wake the allocator thread > > and reclaim buckets, or to detect buckets available to discard earlier in > > their lifetime? > > It's on the todo list, it'll be part of the grand allocator rework I have > planned - there's a lot that needs to be done. I read your discussion on the status update and in the related write-up on the accounting code. It makes a lot of sense to me, given the tried and tested btree layer is there to be used. > We've got to rework the state transitions, and differentiate between buckets > with cached data, buckets that need a journal commit before we can write to them > again, buckets that need discarding, and buckets that are ready for use - and we > need to start storing all that persistently, as well as build some new data > structures to get rid of the scanning over every bucket we currently have to do > in various places. > > Step one is going to be creating a persistent LRU. It'll be another btree, > sorted by the order in which we want to reuse buckets - a combination of the > amount of live data in the bucket and the time since it was last read. > > If you're interested in tackling this, I can sketch it out for you :) I'm definitely interested in helping out if I can, although there's a slight risk of biting off more than I can chew: I probably understand even less about the design and structure of your fs at the moment than you think! I don't want to be a slow-moving obstacle on your critical path or more of a nuisance than a help, but equally, getting stuck in is clearly the best way to figure out how it fits together and become more useful. So happy to give anything a go that you think is reasonable for a newcomer to take a crack at. :) Best wishes, Chris. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: More eager discard behaviour 2021-11-06 21:36 ` Chris Webb @ 2021-11-07 14:59 ` Kent Overstreet 2021-11-08 21:16 ` Chris Webb 0 siblings, 1 reply; 5+ messages in thread From: Kent Overstreet @ 2021-11-07 14:59 UTC (permalink / raw) To: Chris Webb; +Cc: linux-bcachefs On Sat, Nov 06, 2021 at 09:36:50PM +0000, Chris Webb wrote: > I'm definitely interested in helping out if I can, although there's a slight > risk of biting off more than I can chew: I probably understand even less > about the design and structure of your fs at the moment than you think! > > I don't want to be a slow-moving obstacle on your critical path or more of a > nuisance than a help, but equally, getting stuck in is clearly the best way > to figure out how it fits together and become more useful. So happy to give > anything a go that you think is reasonable for a newcomer to take a crack > at. :) This won't be critical path for awhile - I need to focus on bugs before I can start anything new (and I've been chasing zome real doozies the past few days; I stumbled across multiple... interesting... bugs while trying to get xfstest generic/547 to pass). And this might be one of the easier places to get started, stuff that just interfaces with the btree tends to be fairly clean (vs. the code that interfaces with the Linux VFS, which is where I'm at now). So, the LRU: We don't do our LRU with lists, we do it with a clock. We've got persistent clocks that track the amount of IO has been done in a filesystem, broken out as reads or writes, and we track for each bucket when it was last read or written to. So in the btree, you've got struct bch_alloc/bch_alloc_v2/bch_alloc_v3 now - that's the value type for alloc keys, which describe buckets - you'll see they've got read_time and write_time fields (stored as varints). bch2_bucket_io_time_reset() updates the read or write time for a given bucket. Currently the lru is implemented by find_reclaimable_buckets_lru(), which scans the in memory array of buckets and build up a heap - there's some logic in there that won't be relevant for the persistent LRU, this code is also preferring buckets which don't need a journal commit yet and that's going to be moved elsewhere. The relevant bit is the first path in bucket_sort_key(), for buckets that have a nonzero amount of data (i.e. cached data in them) - there's some arithmetic in there to combine the last read time with the amount of data currently used. The new persistent LRU will be roughly the same idea - we'll create a new btree, call it BTREE_ID_lru, where keys in that btree point back to buckets (and alloc keys will also need a new field pointing the the LRU key, when it exists), and the first key in the LRU btree will point to the bucket we want to reuse next. We'll add a transactional trigger for alloc keys - bch2_trans_mark_alloc(), following the patterns for other triggers in buckets.c (at some point we need to split up buckets.c into multiple files, it's gotten to be a bit of a mishmash - but that's where the btree triggers code lives for now). The trigger will take an old and a new alloc key, and when the LRU index is changing it'll just delete the old LRU key and create a new one. Note that multiple buckets could end up with the same LRU index but we can't store their keys in the LRU btree at the same slot - so when that happens we'll have to use the next empty slot. We need to decide how the LRU index should be calculated. The current code calculates "time since this bucket was last read from", i.e. now - g->io_time[READ], and then it scales that by the amount of live data in the bucket - given two buckets that were both read from at the same time, we want to prefer the bucket that has more live data in it. But "time since this bucket was last read from" isn't a stable value, it changes as we do IO. Ignoring the part where we scale by bucket_sectors_used for the moment, we'd want the index in the LRU btree to be just the bucket's read_time field - that would get us LRU behaviour. So how do we incorporate bucket_sectors_used into the calculation? read_time * sectors_used isn't right - it scales the LRU index up or down correctly as sectors_used chages, but that means if we've got a full bucket that hasn't been read from in a very long time, we'll prefer to keep that over a bucket that was _just_ read from and is only half full. I think this code will do roughly what we want: /* ratio between 0 (completely full bucket) and 1 (full bucket) float bucket_fragmentation = (bucket_size - bucket_sectors_used) / bucket_sectors_used; u64 lru_idx = read_time - (now - read_time) * bucket_fragmentation; Can't do floating point in the kernel so the actual arithmetic will be different, and we'll have to guard against underflow and overflow. And this calculation still isn't stable - we'll end up with a different number if we do at different times, but that's ok as long as we store the LRU idx we calculated in the alloc key. Let me know if you get stuck :) ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: More eager discard behaviour 2021-11-07 14:59 ` Kent Overstreet @ 2021-11-08 21:16 ` Chris Webb 0 siblings, 0 replies; 5+ messages in thread From: Chris Webb @ 2021-11-08 21:16 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs Kent Overstreet <kent.overstreet@gmail.com> wrote: > Let me know if you get stuck :) Great, and thanks for the detailed explanation and guidance. I'll give it a go! Plenty for me to get to grips with there... :) Best wishes, Chris. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-11-08 21:16 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-11-06 17:11 More eager discard behaviour Chris Webb 2021-11-06 19:37 ` Kent Overstreet 2021-11-06 21:36 ` Chris Webb 2021-11-07 14:59 ` Kent Overstreet 2021-11-08 21:16 ` Chris Webb
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).