From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D4001C433EF for ; Sun, 7 Nov 2021 14:59:11 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B45C260E97 for ; Sun, 7 Nov 2021 14:59:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229639AbhKGPBx (ORCPT ); Sun, 7 Nov 2021 10:01:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39130 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235544AbhKGPBx (ORCPT ); Sun, 7 Nov 2021 10:01:53 -0500 Received: from mail-qk1-x72f.google.com (mail-qk1-x72f.google.com [IPv6:2607:f8b0:4864:20::72f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A7218C061714 for ; Sun, 7 Nov 2021 06:59:10 -0800 (PST) Received: by mail-qk1-x72f.google.com with SMTP id bi29so13350725qkb.5 for ; Sun, 07 Nov 2021 06:59:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=WngOgI+nWUypHGzRy2Ah2wCMzPHPBhxCMGztxN+LJkM=; b=hHzNbPlkNuyNu3N2bMM+/n8WfLcJPEwh5SalkecJF89OsSN5OyWF4FZAQz8gQY5G4v anCQcGMEYEyN79LNBW0wYfnQ7Vb9YDVpfk40JVFW/+m295xFd1MycfUrSJhlb3InwQZn iNcqOS4Ak+R58PEAL9D+gB1wrU2cNZKB9Mi+48N2uhynWEAo0Q+Kp7ctBDJwgkaz7KRX OTEOhXiqvXoHQUhlOzS7wf2r5nQY3vdLSmCuk1TW4VQFEHnrrbcFbbK6kEuAFjtSrBzl mLiwmM1lXAWqp4fTBJFDM1flaJRWMw3Bx0ksGJZTzulx6RNj6AQq+fe4iUkDtsuhPCCQ SDDQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=WngOgI+nWUypHGzRy2Ah2wCMzPHPBhxCMGztxN+LJkM=; b=LEf2uqCBMMO6OTQpaKjBiMBqQ7xyk2ZIACoTuge0wMQS0a73+wS+pR1S9O82Uf0xbK Xr6B6H4bbdQbXjfFIYY3hHSjrGKRqlIAdt4oB5rmx6EqYgDzGtB8jYcBbPO9EWHq6B9i FybjGn2G4Bx47wrTnsZIoinTQ08hF1Jh+tjmE+2tQdvjCyRKIHYSuxyt3fCiZddj+y3s vnhH8qwHPdpfKicsKy5G1H81phbaB0GtNta9rfK/H/DygcuyC+h450J6mX8Vi/hWFwOs 9vzvhu8r3BQEpXuqw55PE4nciPzYv3GaKUpc5N+vB58wyq80zVn/mALsGS81ioFyBDIM bo+w== X-Gm-Message-State: AOAM532+6XIGukBMDk75LS+VvXHhsACKsyuaLIdsMG5gav2jLaC0lolC fq9iw0gF2bps447d5cYjw3H3q3dFOQ== X-Google-Smtp-Source: ABdhPJy4pMl27RpuHaABiQSPXmpGp42mAsuDYm6jTooS7qwmFaVEGTJuvpwMKKyRQlF+VV+wq3qVDA== X-Received: by 2002:a05:620a:25ce:: with SMTP id y14mr58447469qko.66.1636297149453; Sun, 07 Nov 2021 06:59:09 -0800 (PST) Received: from moria.home.lan (c-73-219-103-14.hsd1.vt.comcast.net. [73.219.103.14]) by smtp.gmail.com with ESMTPSA id h14sm9575423qth.23.2021.11.07.06.59.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 07 Nov 2021 06:59:08 -0800 (PST) Date: Sun, 7 Nov 2021 09:59:07 -0500 From: Kent Overstreet To: Chris Webb Cc: linux-bcachefs@vger.kernel.org Subject: Re: More eager discard behaviour Message-ID: References: <20211106171156.GM11670@arachsys.com> <20211106213649.GN11670@arachsys.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20211106213649.GN11670@arachsys.com> Precedence: bulk List-ID: X-Mailing-List: linux-bcachefs@vger.kernel.org On Sat, Nov 06, 2021 at 09:36:50PM +0000, Chris Webb wrote: > I'm definitely interested in helping out if I can, although there's a slight > risk of biting off more than I can chew: I probably understand even less > about the design and structure of your fs at the moment than you think! > > I don't want to be a slow-moving obstacle on your critical path or more of a > nuisance than a help, but equally, getting stuck in is clearly the best way > to figure out how it fits together and become more useful. So happy to give > anything a go that you think is reasonable for a newcomer to take a crack > at. :) This won't be critical path for awhile - I need to focus on bugs before I can start anything new (and I've been chasing zome real doozies the past few days; I stumbled across multiple... interesting... bugs while trying to get xfstest generic/547 to pass). And this might be one of the easier places to get started, stuff that just interfaces with the btree tends to be fairly clean (vs. the code that interfaces with the Linux VFS, which is where I'm at now). So, the LRU: We don't do our LRU with lists, we do it with a clock. We've got persistent clocks that track the amount of IO has been done in a filesystem, broken out as reads or writes, and we track for each bucket when it was last read or written to. So in the btree, you've got struct bch_alloc/bch_alloc_v2/bch_alloc_v3 now - that's the value type for alloc keys, which describe buckets - you'll see they've got read_time and write_time fields (stored as varints). bch2_bucket_io_time_reset() updates the read or write time for a given bucket. Currently the lru is implemented by find_reclaimable_buckets_lru(), which scans the in memory array of buckets and build up a heap - there's some logic in there that won't be relevant for the persistent LRU, this code is also preferring buckets which don't need a journal commit yet and that's going to be moved elsewhere. The relevant bit is the first path in bucket_sort_key(), for buckets that have a nonzero amount of data (i.e. cached data in them) - there's some arithmetic in there to combine the last read time with the amount of data currently used. The new persistent LRU will be roughly the same idea - we'll create a new btree, call it BTREE_ID_lru, where keys in that btree point back to buckets (and alloc keys will also need a new field pointing the the LRU key, when it exists), and the first key in the LRU btree will point to the bucket we want to reuse next. We'll add a transactional trigger for alloc keys - bch2_trans_mark_alloc(), following the patterns for other triggers in buckets.c (at some point we need to split up buckets.c into multiple files, it's gotten to be a bit of a mishmash - but that's where the btree triggers code lives for now). The trigger will take an old and a new alloc key, and when the LRU index is changing it'll just delete the old LRU key and create a new one. Note that multiple buckets could end up with the same LRU index but we can't store their keys in the LRU btree at the same slot - so when that happens we'll have to use the next empty slot. We need to decide how the LRU index should be calculated. The current code calculates "time since this bucket was last read from", i.e. now - g->io_time[READ], and then it scales that by the amount of live data in the bucket - given two buckets that were both read from at the same time, we want to prefer the bucket that has more live data in it. But "time since this bucket was last read from" isn't a stable value, it changes as we do IO. Ignoring the part where we scale by bucket_sectors_used for the moment, we'd want the index in the LRU btree to be just the bucket's read_time field - that would get us LRU behaviour. So how do we incorporate bucket_sectors_used into the calculation? read_time * sectors_used isn't right - it scales the LRU index up or down correctly as sectors_used chages, but that means if we've got a full bucket that hasn't been read from in a very long time, we'll prefer to keep that over a bucket that was _just_ read from and is only half full. I think this code will do roughly what we want: /* ratio between 0 (completely full bucket) and 1 (full bucket) float bucket_fragmentation = (bucket_size - bucket_sectors_used) / bucket_sectors_used; u64 lru_idx = read_time - (now - read_time) * bucket_fragmentation; Can't do floating point in the kernel so the actual arithmetic will be different, and we'll have to guard against underflow and overflow. And this calculation still isn't stable - we'll end up with a different number if we do at different times, but that's ok as long as we store the LRU idx we calculated in the alloc key. Let me know if you get stuck :)