From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D0BDAC433F5 for ; Wed, 6 Oct 2021 03:21:49 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 5C44F610A8 for ; Wed, 6 Oct 2021 03:21:49 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 5C44F610A8 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 9451D900002; Tue, 5 Oct 2021 23:21:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8CD026B0071; Tue, 5 Oct 2021 23:21:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 76E95900002; Tue, 5 Oct 2021 23:21:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0046.hostedemail.com [216.40.44.46]) by kanga.kvack.org (Postfix) with ESMTP id 666696B006C for ; Tue, 5 Oct 2021 23:21:48 -0400 (EDT) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 30E052FE1F for ; Wed, 6 Oct 2021 03:21:48 +0000 (UTC) X-FDA: 78664563096.03.576A786 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf09.hostedemail.com (Postfix) with ESMTP id BC45E3001D95 for ; Wed, 6 Oct 2021 03:21:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=JIdrEeGUUC+uyNerAqnvCRxruwOLLAbCYAsPteqtWYg=; b=Us8KyveQpZ+X6A0FRVslSwc8R+ 91HGYLAb8AyQvw6NVQwKuuayJKnySrsQmjxYncG45IYO6hRcoDTr+zhtKcUVScJI2/k5knWx2YaMq 1WLsAqQ6VMtiKJI1QLlO61AuHBgWKAQOGRbTbTyOn85bGThWxJbA7u2ZIgNjVEtvmYYUkn/62j5wg cVjXDxAoIhe0EgjfLFIhsQhe4o4jOgv3vTN68DbA9CM2HeaSmj6wEgjuizjGBD196XGS5U1e6fdst Qebl9U8G7VJh6yyPL8Epvf/usD4rURZRiCHJCoDa+VrMq7trP3X9YXf+fZ0kEAgZCl9ZrvW7gcZ81 tdVmMPlQ==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1mXxUQ-000VJJ-8P; Wed, 06 Oct 2021 03:21:20 +0000 Date: Wed, 6 Oct 2021 04:21:06 +0100 From: Matthew Wilcox To: Kent Overstreet Cc: linux-mm@kvack.org, Minchan Kim , Nitin Gupta , Sergey Senozhatsky , Johannes Weiner Subject: Re: pageless memory & zsmalloc Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: BC45E3001D95 X-Stat-Signature: bzukcuxruuxictroxhez84i86uzjxybj Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=Us8KyveQ; dmarc=none; spf=none (imf09.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org X-HE-Tag: 1633490507-682015 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Oct 05, 2021 at 07:00:31PM -0400, Kent Overstreet wrote: > On Tue, Oct 05, 2021 at 10:28:23PM +0100, Matthew Wilcox wrote: > > I'm still not convinced of the need for allocator + allocatee words. > > But I don't think we need to resolve that point of disagreement in > > order to make progress towards the things we do agree on. > > It's not so much that I disagree, I just don't see how your one-word idea is > possible and you haven't outlined it :) Can you sketch it out for us? Sure! Assuming we want to support allocating page cache memory from slab (a proposition I'm not yet sold on, but I'm willing to believe that we end up wanting to do that), let's suppose slab allocates an order-3 slab to allocate from. It allocates a struct slab, then sets each of the 8 pages' compound_head entries to point to it. When we allocate a single page from this slab, we allocate the folio that this page will point to, then change that page's compound_head to point to this folio. We need to stash the slab compound_head entry from this page in the struct folio so we know where to free it back to when we free it. Yes, this is going to require a bit of a specialist interface, but we're already talking about adding specialist interfaces for allocating memory anyway, so that doesn't concern me. > > > And in my mind, compound_head is allocator state, not allocatee state, and it's > > > always been using for getting to the head page, which... this is not, so using > > > it this way, as slick as it is... eh, not sure that's quite what we want to do. > > > > In my mind, in the future where all memory descriptors are dynamically > > allocated, when we allocate an order-3 page, we initialise the > > 'allocatee state' of each of the 8 consecutive pages to point to the > > memory descriptor that we just allocated. We probably also encode the > > type of the memory descriptor in the allocatee state (I think we've > > probably got about 5 bits for that). > > Yep, I've been envisioning using the low bits of the pointer to the allocatee > state as a type tag. Where does compound_order go, though? Ah! I think it can go in page->flags on 64 bit and _last_cpupid on 32-bit. It's only perhaps 4 bits on 32-bit and 5 on 64-bit (128MB and 8TB seem like reasonable limits on 32 and 64 bit respectively). > > The lock state has to be in the memory descriptor. It can't be in the > > individual page. So I think all memory descriptors needs to start with > > a flags word. Memory compaction could refrain from locking pages if > > the memory descriptor is of the wrong type, of course. > > Memory compaction inherently has to switch on the allocatee type, so if the page > is of a type that can't be migrated, it would make sense to just not bother with > locking it. On the other hand, the type isn't stable without first locking the > page. I think it's stable once you get a refcount on the page; you don't need a lock to prevent freeing. > There's another synchronization thing we have to work out: with the lock behind > a pointer, we can race with the page, and the allocatee state, being freed. > Which implies that we're always going to have to RCU-free allocatee state, and > that after chasing the pointer and taking the lock we'll have to check that the > page is still a member of that allocatee state. Right, it's just like looking things up in the page cache. Get the pointer, inc the refcount if not zero, check the pointer still matches. I think I mentioned SLAB_TYPESAFE_BY_RCU somewhere. > This race is something that we'll have to handle every place we deref the > allocatee state pointer - and in many cases we won't want to lock the allocatee > state, so page->ref will also have to be part of this common page allocatee > state. I think we can avoid it for slab. But maybe not if we want to support compaction? I need to think about that some more. > > Eventually, I think lock_page() disappears in favour of folio_lock(). > > That doesn't quite work for compaction, but maybe we could do something > > like this ... > > Question is, do other types of pages besides just folios need lock_page() and > get_page()? If so, maybe folio_lock() doesn't make sense at all and we should > just have functions that operate on your (expanded, to include a refcount) > pgflags_t. I think folio_lock() makes sense for, eg, filesystems. We should have a complete API that operates on folios instead of punching through the layers into the implementation detail of the folio. For some parts of the core MM, something like pgflags_lock() makes more sense. Probably. I reserve the right to change my mind on this one ... David Hildenbrand put the idea into me earlier today and it's not entirely settled down yet.