* pageless memory & zsmalloc @ 2021-10-05 17:51 Matthew Wilcox 2021-10-05 20:13 ` Kent Overstreet ` (2 more replies) 0 siblings, 3 replies; 8+ messages in thread From: Matthew Wilcox @ 2021-10-05 17:51 UTC (permalink / raw) To: linux-mm Cc: Minchan Kim, Nitin Gupta, Sergey Senozhatsky, Kent Overstreet, Johannes Weiner We're trying to tidy up the mess in struct page, and as part of removing slab from struct page, zsmalloc came on my radar because it's using some of slab's fields. The eventual endgame is to get struct page down to a single word which points to the "memory descriptor" (ie the current zspage). zsmalloc, like vmalloc, allocates order-0 pages. Unlike vmalloc, zsmalloc allows compaction. Currently (from the file): * Usage of struct page fields: * page->private: points to zspage * page->freelist(index): links together all component pages of a zspage * For the huge page, this is always 0, so we use this field * to store handle. * page->units: first object offset in a subpage of zspage * * Usage of struct page flags: * PG_private: identifies the first component page * PG_owner_priv_1: identifies the huge component page This isn't quite everything. For compaction, zsmalloc also uses page->mapping (set in __SetPageMovable()), PG_lock (to sync with compaction) and page->_refcount (compaction gets a refcount on the page). Since zsmalloc is so well-contained, I propose we completely stop using struct page in it, as we intend to do for the rest of the users of struct page. That is, the _only_ element of struct page we use is compound_head and it points to struct zspage. That means every single page allocated by zsmalloc is PageTail(). Also it means that when isolate_movable_page() calls trylock_page(), it redirects to the zspage. That means struct zspage must now have page flags as its first element. Also, zspage->_refcount, and zspage->mapping must match their locations in struct page. That's something that we'll get cleaned up eventually, but for now, we're relying on offsetof() assertions. The good news is that trylock_zspage() no longer needs to walk the list of pages, calling trylock_page() on each of them. Anyway, is there a good test suite for zsmalloc()? Particularly something that would exercise its interactions with compaction / migration? I don't have any code written yet. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: pageless memory & zsmalloc 2021-10-05 17:51 pageless memory & zsmalloc Matthew Wilcox @ 2021-10-05 20:13 ` Kent Overstreet 2021-10-05 21:28 ` Matthew Wilcox 2021-10-07 15:03 ` Vlastimil Babka 2021-10-08 20:43 ` Minchan Kim 2 siblings, 1 reply; 8+ messages in thread From: Kent Overstreet @ 2021-10-05 20:13 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-mm, Minchan Kim, Nitin Gupta, Sergey Senozhatsky, Johannes Weiner On Tue, Oct 05, 2021 at 06:51:32PM +0100, Matthew Wilcox wrote: > We're trying to tidy up the mess in struct page, and as part of removing > slab from struct page, zsmalloc came on my radar because it's using some > of slab's fields. The eventual endgame is to get struct page down to a > single word which points to the "memory descriptor" (ie the current > zspage). > > zsmalloc, like vmalloc, allocates order-0 pages. Unlike vmalloc, > zsmalloc allows compaction. Currently (from the file): > > * Usage of struct page fields: > * page->private: points to zspage > * page->freelist(index): links together all component pages of a zspage > * For the huge page, this is always 0, so we use this field > * to store handle. > * page->units: first object offset in a subpage of zspage > * > * Usage of struct page flags: > * PG_private: identifies the first component page > * PG_owner_priv_1: identifies the huge component page > > This isn't quite everything. For compaction, zsmalloc also uses > page->mapping (set in __SetPageMovable()), PG_lock (to sync with > compaction) and page->_refcount (compaction gets a refcount on the page). > > Since zsmalloc is so well-contained, I propose we completely stop > using struct page in it, as we intend to do for the rest of the users > of struct page. That is, the _only_ element of struct page we use is > compound_head and it points to struct zspage. > > That means every single page allocated by zsmalloc is PageTail(). Also it > means that when isolate_movable_page() calls trylock_page(), it redirects > to the zspage. That means struct zspage must now have page flags as its > first element. Also, zspage->_refcount, and zspage->mapping must match > their locations in struct page. That's something that we'll get cleaned > up eventually, but for now, we're relying on offsetof() assertions. > > The good news is that trylock_zspage() no longer needs to walk the > list of pages, calling trylock_page() on each of them. > > Anyway, is there a good test suite for zsmalloc()? Particularly something > that would exercise its interactions with compaction / migration? > I don't have any code written yet. This is some deep hackery. So to restate - and making sure I understand correctly - the reason for doing it this way is that the compaction code calls lock_page(); by using compound_head (instead of page->private) for the pointer to the zspage is so that compound_head() will return the pointer to zspage, and lock_page() uses compound_page(), so the compaction code, when it calls lock_page(), will actually be taking the lock in struct zspage. So on the one hand, getting struct page down to two words means that we're going to be moving the page lock bit into those external structs (maybe we _should_ be doing some kind of inheritence thing, for the allocatee interface?) - so it is cool that that all lines up. But long term, we're going to need two words in struct page, not just one: We need to store the order of the allocation, and a type tagged pointer, and I don't think we can realistically cram compound_order into the same word as the type tagged pointer. Plus, transparently using slab for larger allocations - that today use the buddy allocator interface - means slab can't be using the allocatee pointer field. And in my mind, compound_head is allocator state, not allocatee state, and it's always been using for getting to the head page, which... this is not, so using it this way, as slick as it is... eh, not sure that's quite what we want to do. Q: should lock_page() be calling compound_head() at all? If the goal of the type system stuff that we're doing is to move the compound_head() calls up to where we're dealing with tail pages and never more after that, then that call in lock_page() should go - and we'll need to figure out something different for the migration code to take the lock in the allocatee-private structure. I've been kind of feeling that since folios are furthest along, separately allocating them and figuring out how to make that all work might be the logical next step. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: pageless memory & zsmalloc 2021-10-05 20:13 ` Kent Overstreet @ 2021-10-05 21:28 ` Matthew Wilcox 2021-10-05 23:00 ` Kent Overstreet 0 siblings, 1 reply; 8+ messages in thread From: Matthew Wilcox @ 2021-10-05 21:28 UTC (permalink / raw) To: Kent Overstreet Cc: linux-mm, Minchan Kim, Nitin Gupta, Sergey Senozhatsky, Johannes Weiner On Tue, Oct 05, 2021 at 04:13:12PM -0400, Kent Overstreet wrote: > On Tue, Oct 05, 2021 at 06:51:32PM +0100, Matthew Wilcox wrote: > > We're trying to tidy up the mess in struct page, and as part of removing > > slab from struct page, zsmalloc came on my radar because it's using some > > of slab's fields. The eventual endgame is to get struct page down to a > > single word which points to the "memory descriptor" (ie the current > > zspage). > > > > zsmalloc, like vmalloc, allocates order-0 pages. Unlike vmalloc, > > zsmalloc allows compaction. Currently (from the file): > > > > * Usage of struct page fields: > > * page->private: points to zspage > > * page->freelist(index): links together all component pages of a zspage > > * For the huge page, this is always 0, so we use this field > > * to store handle. > > * page->units: first object offset in a subpage of zspage > > * > > * Usage of struct page flags: > > * PG_private: identifies the first component page > > * PG_owner_priv_1: identifies the huge component page > > > > This isn't quite everything. For compaction, zsmalloc also uses > > page->mapping (set in __SetPageMovable()), PG_lock (to sync with > > compaction) and page->_refcount (compaction gets a refcount on the page). > > > > Since zsmalloc is so well-contained, I propose we completely stop > > using struct page in it, as we intend to do for the rest of the users > > of struct page. That is, the _only_ element of struct page we use is > > compound_head and it points to struct zspage. > > > > That means every single page allocated by zsmalloc is PageTail(). Also it > > means that when isolate_movable_page() calls trylock_page(), it redirects > > to the zspage. That means struct zspage must now have page flags as its > > first element. Also, zspage->_refcount, and zspage->mapping must match > > their locations in struct page. That's something that we'll get cleaned > > up eventually, but for now, we're relying on offsetof() assertions. > > > > The good news is that trylock_zspage() no longer needs to walk the > > list of pages, calling trylock_page() on each of them. > > > > Anyway, is there a good test suite for zsmalloc()? Particularly something > > that would exercise its interactions with compaction / migration? > > I don't have any code written yet. > > This is some deep hackery. ... thank you? ;-) Actually, it's indicative that we need to work through what we're doing in a bit more detail. > So to restate - and making sure I understand correctly - the reason for doing it > this way is that the compaction code calls lock_page(); by using compound_head > (instead of page->private) for the pointer to the zspage is so that > compound_head() will return the pointer to zspage, and lock_page() uses > compound_page(), so the compaction code, when it calls lock_page(), will > actually be taking the lock in struct zspage. Yes. Just like when we call lock_page() on a THP or hugetlbfs tail page, we actually take the lock in the head page. > So on the one hand, getting struct page down to two words means that we're going > to be moving the page lock bit into those external structs (maybe we _should_ be > doing some kind of inheritence thing, for the allocatee interface?) - so it is > cool that that all lines up. > > But long term, we're going to need two words in struct page, not just one: We > need to store the order of the allocation, and a type tagged pointer, and I > don't think we can realistically cram compound_order into the same word as the > type tagged pointer. Plus, transparently using slab for larger allocations - > that today use the buddy allocator interface - means slab can't be using the > allocatee pointer field. I'm still not convinced of the need for allocator + allocatee words. But I don't think we need to resolve that point of disagreement in order to make progress towards the things we do agree on. > And in my mind, compound_head is allocator state, not allocatee state, and it's > always been using for getting to the head page, which... this is not, so using > it this way, as slick as it is... eh, not sure that's quite what we want to do. In my mind, in the future where all memory descriptors are dynamically allocated, when we allocate an order-3 page, we initialise the 'allocatee state' of each of the 8 consecutive pages to point to the memory descriptor that we just allocated. We probably also encode the type of the memory descriptor in the allocatee state (I think we've probably got about 5 bits for that). The lock state has to be in the memory descriptor. It can't be in the individual page. So I think all memory descriptors needs to start with a flags word. Memory compaction could refrain from locking pages if the memory descriptor is of the wrong type, of course. > Q: should lock_page() be calling compound_head() at all? If the goal of the type > system stuff that we're doing is to move the compound_head() calls up to where > we're dealing with tail pages and never more after that, then that call in > lock_page() should go - and we'll need to figure out something different for the > migration code to take the lock in the allocatee-private structure. Eventually, I think lock_page() disappears in favour of folio_lock(). That doesn't quite work for compaction, but maybe we could do something like this ... typedef struct { unsigned long f; } pgflags_t; void pgflags_lock(pgflags_t *); struct folio { pgflags_t flags; ... }; static inline void folio_lock(struct folio *folio) { pgflags_lock(&folio->flags); } That way, compaction doesn't need to know what kind of memory descriptor this page belongs to. Something similar could happen for user-mappable memory. eg: struct mmapable { atomic_t _refcount; atomic_t _mapcount; struct address_space *mapping; }; struct folio { pgflags_t flags; struct mmapable m; struct list_head lru; pgoff_t index; unsigned long private; ... }; None of this can happen before we move struct slab out of struct page, because we can't mess with the alignment of freelist+counters and struct mmapable is 3 words on 32-bit and 2 on 64-bit. So I'm trying to break off pieces that can be done to get us a little bit closer. > I've been kind of feeling that since folios are furthest along, separately > allocating them and figuring out how to make that all work might be the logical > next step. I'd rather not put anything else on top of the folio work until some of it -- any of it -- is merged. Fortunately, our greatest bottleneck is reviewer bandwidth, and it's actually parallelisable; the people who need to approve of slab are independent of anon, file and zsmalloc reviewers. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: pageless memory & zsmalloc 2021-10-05 21:28 ` Matthew Wilcox @ 2021-10-05 23:00 ` Kent Overstreet 2021-10-06 3:21 ` Matthew Wilcox 0 siblings, 1 reply; 8+ messages in thread From: Kent Overstreet @ 2021-10-05 23:00 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-mm, Minchan Kim, Nitin Gupta, Sergey Senozhatsky, Johannes Weiner On Tue, Oct 05, 2021 at 10:28:23PM +0100, Matthew Wilcox wrote: > I'm still not convinced of the need for allocator + allocatee words. > But I don't think we need to resolve that point of disagreement in > order to make progress towards the things we do agree on. It's not so much that I disagree, I just don't see how your one-word idea is possible and you haven't outlined it :) Can you sketch it out for us? > > And in my mind, compound_head is allocator state, not allocatee state, and it's > > always been using for getting to the head page, which... this is not, so using > > it this way, as slick as it is... eh, not sure that's quite what we want to do. > > In my mind, in the future where all memory descriptors are dynamically > allocated, when we allocate an order-3 page, we initialise the > 'allocatee state' of each of the 8 consecutive pages to point to the > memory descriptor that we just allocated. We probably also encode the > type of the memory descriptor in the allocatee state (I think we've > probably got about 5 bits for that). Yep, I've been envisioning using the low bits of the pointer to the allocatee state as a type tag. Where does compound_order go, though? > The lock state has to be in the memory descriptor. It can't be in the > individual page. So I think all memory descriptors needs to start with > a flags word. Memory compaction could refrain from locking pages if > the memory descriptor is of the wrong type, of course. Memory compaction inherently has to switch on the allocatee type, so if the page is of a type that can't be migrated, it would make sense to just not bother with locking it. On the other hand, the type isn't stable without first locking the page. There's another synchronization thing we have to work out: with the lock behind a pointer, we can race with the page, and the allocatee state, being freed. Which implies that we're always going to have to RCU-free allocatee state, and that after chasing the pointer and taking the lock we'll have to check that the page is still a member of that allocatee state. This race is something that we'll have to handle every place we deref the allocatee state pointer - and in many cases we won't want to lock the allocatee state, so page->ref will also have to be part of this common page allocatee state. > Eventually, I think lock_page() disappears in favour of folio_lock(). > That doesn't quite work for compaction, but maybe we could do something > like this ... Question is, do other types of pages besides just folios need lock_page() and get_page()? If so, maybe folio_lock() doesn't make sense at all and we should just have functions that operate on your (expanded, to include a refcount) pgflags_t. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: pageless memory & zsmalloc 2021-10-05 23:00 ` Kent Overstreet @ 2021-10-06 3:21 ` Matthew Wilcox 0 siblings, 0 replies; 8+ messages in thread From: Matthew Wilcox @ 2021-10-06 3:21 UTC (permalink / raw) To: Kent Overstreet Cc: linux-mm, Minchan Kim, Nitin Gupta, Sergey Senozhatsky, Johannes Weiner On Tue, Oct 05, 2021 at 07:00:31PM -0400, Kent Overstreet wrote: > On Tue, Oct 05, 2021 at 10:28:23PM +0100, Matthew Wilcox wrote: > > I'm still not convinced of the need for allocator + allocatee words. > > But I don't think we need to resolve that point of disagreement in > > order to make progress towards the things we do agree on. > > It's not so much that I disagree, I just don't see how your one-word idea is > possible and you haven't outlined it :) Can you sketch it out for us? Sure! Assuming we want to support allocating page cache memory from slab (a proposition I'm not yet sold on, but I'm willing to believe that we end up wanting to do that), let's suppose slab allocates an order-3 slab to allocate from. It allocates a struct slab, then sets each of the 8 pages' compound_head entries to point to it. When we allocate a single page from this slab, we allocate the folio that this page will point to, then change that page's compound_head to point to this folio. We need to stash the slab compound_head entry from this page in the struct folio so we know where to free it back to when we free it. Yes, this is going to require a bit of a specialist interface, but we're already talking about adding specialist interfaces for allocating memory anyway, so that doesn't concern me. > > > And in my mind, compound_head is allocator state, not allocatee state, and it's > > > always been using for getting to the head page, which... this is not, so using > > > it this way, as slick as it is... eh, not sure that's quite what we want to do. > > > > In my mind, in the future where all memory descriptors are dynamically > > allocated, when we allocate an order-3 page, we initialise the > > 'allocatee state' of each of the 8 consecutive pages to point to the > > memory descriptor that we just allocated. We probably also encode the > > type of the memory descriptor in the allocatee state (I think we've > > probably got about 5 bits for that). > > Yep, I've been envisioning using the low bits of the pointer to the allocatee > state as a type tag. Where does compound_order go, though? Ah! I think it can go in page->flags on 64 bit and _last_cpupid on 32-bit. It's only perhaps 4 bits on 32-bit and 5 on 64-bit (128MB and 8TB seem like reasonable limits on 32 and 64 bit respectively). > > The lock state has to be in the memory descriptor. It can't be in the > > individual page. So I think all memory descriptors needs to start with > > a flags word. Memory compaction could refrain from locking pages if > > the memory descriptor is of the wrong type, of course. > > Memory compaction inherently has to switch on the allocatee type, so if the page > is of a type that can't be migrated, it would make sense to just not bother with > locking it. On the other hand, the type isn't stable without first locking the > page. I think it's stable once you get a refcount on the page; you don't need a lock to prevent freeing. > There's another synchronization thing we have to work out: with the lock behind > a pointer, we can race with the page, and the allocatee state, being freed. > Which implies that we're always going to have to RCU-free allocatee state, and > that after chasing the pointer and taking the lock we'll have to check that the > page is still a member of that allocatee state. Right, it's just like looking things up in the page cache. Get the pointer, inc the refcount if not zero, check the pointer still matches. I think I mentioned SLAB_TYPESAFE_BY_RCU somewhere. > This race is something that we'll have to handle every place we deref the > allocatee state pointer - and in many cases we won't want to lock the allocatee > state, so page->ref will also have to be part of this common page allocatee > state. I think we can avoid it for slab. But maybe not if we want to support compaction? I need to think about that some more. > > Eventually, I think lock_page() disappears in favour of folio_lock(). > > That doesn't quite work for compaction, but maybe we could do something > > like this ... > > Question is, do other types of pages besides just folios need lock_page() and > get_page()? If so, maybe folio_lock() doesn't make sense at all and we should > just have functions that operate on your (expanded, to include a refcount) > pgflags_t. I think folio_lock() makes sense for, eg, filesystems. We should have a complete API that operates on folios instead of punching through the layers into the implementation detail of the folio. For some parts of the core MM, something like pgflags_lock() makes more sense. Probably. I reserve the right to change my mind on this one ... David Hildenbrand put the idea into me earlier today and it's not entirely settled down yet. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: pageless memory & zsmalloc 2021-10-05 17:51 pageless memory & zsmalloc Matthew Wilcox 2021-10-05 20:13 ` Kent Overstreet @ 2021-10-07 15:03 ` Vlastimil Babka 2021-10-07 18:11 ` Matthew Wilcox 2021-10-08 20:43 ` Minchan Kim 2 siblings, 1 reply; 8+ messages in thread From: Vlastimil Babka @ 2021-10-07 15:03 UTC (permalink / raw) To: Matthew Wilcox, linux-mm Cc: Minchan Kim, Nitin Gupta, Sergey Senozhatsky, Kent Overstreet, Johannes Weiner On 10/5/21 19:51, Matthew Wilcox wrote: > We're trying to tidy up the mess in struct page, and as part of removing > slab from struct page, zsmalloc came on my radar because it's using some > of slab's fields. The eventual endgame is to get struct page down to a > single word which points to the "memory descriptor" (ie the current > zspage). > > zsmalloc, like vmalloc, allocates order-0 pages. Unlike vmalloc, > zsmalloc allows compaction. Currently (from the file): > > * Usage of struct page fields: > * page->private: points to zspage > * page->freelist(index): links together all component pages of a zspage > * For the huge page, this is always 0, so we use this field > * to store handle. > * page->units: first object offset in a subpage of zspage > * > * Usage of struct page flags: > * PG_private: identifies the first component page > * PG_owner_priv_1: identifies the huge component page > > This isn't quite everything. For compaction, zsmalloc also uses > page->mapping (set in __SetPageMovable()), PG_lock (to sync with > compaction) and page->_refcount (compaction gets a refcount on the page). > > Since zsmalloc is so well-contained, I propose we completely stop > using struct page in it, as we intend to do for the rest of the users > of struct page. That is, the _only_ element of struct page we use is > compound_head and it points to struct zspage. > > That means every single page allocated by zsmalloc is PageTail(). Also it I would be worried there is code, i.e. some pfn scanner that will see a PageTail, lookup its compound_head() and order and use it to skip over the rest of tail pages. Which would fail spectacularly if compound_head() pointed somewhere else than to the same memmap array to a struct page. > means that when isolate_movable_page() calls trylock_page(), it redirects > to the zspage. That means struct zspage must now have page flags as its > first element. Also, zspage->_refcount, and zspage->mapping must match > their locations in struct page. That's something that we'll get cleaned > up eventually, but for now, we're relying on offsetof() assertions. > > The good news is that trylock_zspage() no longer needs to walk the > list of pages, calling trylock_page() on each of them. > > Anyway, is there a good test suite for zsmalloc()? Particularly something > that would exercise its interactions with compaction / migration? > I don't have any code written yet. > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: pageless memory & zsmalloc 2021-10-07 15:03 ` Vlastimil Babka @ 2021-10-07 18:11 ` Matthew Wilcox 0 siblings, 0 replies; 8+ messages in thread From: Matthew Wilcox @ 2021-10-07 18:11 UTC (permalink / raw) To: Vlastimil Babka Cc: linux-mm, Minchan Kim, Nitin Gupta, Sergey Senozhatsky, Kent Overstreet, Johannes Weiner On Thu, Oct 07, 2021 at 05:03:12PM +0200, Vlastimil Babka wrote: > On 10/5/21 19:51, Matthew Wilcox wrote: > > We're trying to tidy up the mess in struct page, and as part of removing > > slab from struct page, zsmalloc came on my radar because it's using some > > of slab's fields. The eventual endgame is to get struct page down to a > > single word which points to the "memory descriptor" (ie the current > > zspage). > > > > zsmalloc, like vmalloc, allocates order-0 pages. Unlike vmalloc, > > zsmalloc allows compaction. Currently (from the file): > > > > * Usage of struct page fields: > > * page->private: points to zspage > > * page->freelist(index): links together all component pages of a zspage > > * For the huge page, this is always 0, so we use this field > > * to store handle. > > * page->units: first object offset in a subpage of zspage > > * > > * Usage of struct page flags: > > * PG_private: identifies the first component page > > * PG_owner_priv_1: identifies the huge component page > > > > This isn't quite everything. For compaction, zsmalloc also uses > > page->mapping (set in __SetPageMovable()), PG_lock (to sync with > > compaction) and page->_refcount (compaction gets a refcount on the page). > > > > Since zsmalloc is so well-contained, I propose we completely stop > > using struct page in it, as we intend to do for the rest of the users > > of struct page. That is, the _only_ element of struct page we use is > > compound_head and it points to struct zspage. > > > > That means every single page allocated by zsmalloc is PageTail(). Also it > > I would be worried there is code, i.e. some pfn scanner that will see a > PageTail, lookup its compound_head() and order and use it to skip over the > rest of tail pages. Which would fail spectacularly if compound_head() > pointed somewhere else than to the same memmap array to a struct page. Yes, that's definitely a concern. What does work is the pfn scanner doing pfn |= (1 << page_order(page)) - 1; (because page_order(zspage) is 0, so this is a noop). It's something that will need to be audited before we do this. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: pageless memory & zsmalloc 2021-10-05 17:51 pageless memory & zsmalloc Matthew Wilcox 2021-10-05 20:13 ` Kent Overstreet 2021-10-07 15:03 ` Vlastimil Babka @ 2021-10-08 20:43 ` Minchan Kim 2 siblings, 0 replies; 8+ messages in thread From: Minchan Kim @ 2021-10-08 20:43 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-mm, Nitin Gupta, Sergey Senozhatsky, Kent Overstreet, Johannes Weiner [-- Attachment #1: Type: text/plain, Size: 2853 bytes --] On Tue, Oct 05, 2021 at 06:51:32PM +0100, Matthew Wilcox wrote: > We're trying to tidy up the mess in struct page, and as part of removing > slab from struct page, zsmalloc came on my radar because it's using some > of slab's fields. The eventual endgame is to get struct page down to a > single word which points to the "memory descriptor" (ie the current > zspage). > > zsmalloc, like vmalloc, allocates order-0 pages. Unlike vmalloc, > zsmalloc allows compaction. Currently (from the file): > > * Usage of struct page fields: > * page->private: points to zspage > * page->freelist(index): links together all component pages of a zspage > * For the huge page, this is always 0, so we use this field > * to store handle. > * page->units: first object offset in a subpage of zspage > * > * Usage of struct page flags: > * PG_private: identifies the first component page > * PG_owner_priv_1: identifies the huge component page > > This isn't quite everything. For compaction, zsmalloc also uses > page->mapping (set in __SetPageMovable()), PG_lock (to sync with > compaction) and page->_refcount (compaction gets a refcount on the page). > > Since zsmalloc is so well-contained, I propose we completely stop > using struct page in it, as we intend to do for the rest of the users > of struct page. That is, the _only_ element of struct page we use is > compound_head and it points to struct zspage. Then, do you mean zsmalloc couldn't use page.lru to link tail pages from head page? IOW, does zspage need to have subpage list or array? > > That means every single page allocated by zsmalloc is PageTail(). Also it > means that when isolate_movable_page() calls trylock_page(), it redirects > to the zspage. That means struct zspage must now have page flags as its > first element. Also, zspage->_refcount, and zspage->mapping must match > their locations in struct page. That's something that we'll get cleaned > up eventually, but for now, we're relying on offsetof() assertions. > > The good news is that trylock_zspage() no longer needs to walk the > list of pages, calling trylock_page() on each of them. Sounds good if we could remove the mess. > > Anyway, is there a good test suite for zsmalloc()? Particularly something > that would exercise its interactions with compaction / migration? > I don't have any code written yet. This is a my toy to give a stress on those path. I ran it on KVM with 8 core and 8GB ram. [attached memhog.c] #!/bin/bash swapoff -a echo 1 > /sys/block/zram0/reset echo 6g > /sys/block/zram0/disksize mkswap /dev/zram0 swapon /dev/zram0 for comp_ratio in $(seq 10 10 100) do ./memhog -m 1g -c $comp_ratio & done while true : do echo 1 > /sys/block/zram0/compact & echo 1 > /proc/sys/vm/compact_memory & sleep 2 done [-- Attachment #2: memhog.c --] [-- Type: text/x-csrc, Size: 5779 bytes --] #include <stdio.h> #include <stdlib.h> #include <getopt.h> #include <sys/mman.h> #include <errno.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <string.h> #define CHUNK_SIZE (20UL<<20) #ifndef PAGE_SIZE #define PAGE_SIZE (4096) #endif /* * For native build, use aarch64-linux-android-clang -pie -o memhog memhog.c */ void usage(char *exe) { fprintf(stderr, "Usage: %s [options] size[k|m|g]\n" " -c|--compress ratio fill memory with ratio compressible data\n" " -m|--memory SIZE allocate memory in SIZE byte chunks\n" " -M|--mlock mlock() the memory\n" " -s|--sleep SEC sleep SEC seconds during repeat cycle\n" " -r|--repeat N repeat read/write N times\n" " -h|--help show this message\n", exe); exit(1); } static const struct option opts[] = { { "compress", 1, NULL, 'c' }, { "memory" , 1, NULL, 'm' }, { "mlock" , 0, NULL, 'M' }, { "sleep" , 1, NULL, 's' }, { "repeat" , 1, NULL, 'r' }, { "help" , 0, NULL, 'h' }, { NULL , 0, NULL, 0 } }; unsigned long long memparse(const char *ptr, char **retptr) { char *endptr; unsigned long long ret = strtoull(ptr, &endptr, 0); switch (*endptr) { case 'G': case 'g': ret <<= 10; case 'M': case 'm': ret <<= 10; case 'K': case 'k': ret <<= 10; endptr++; default: break; } if (retptr) *retptr = endptr; return ret; } void allocate_mem(unsigned long long size, void *alloc_ptr[], int *len) { int i; void *ptr; unsigned long nr_chunk = size / CHUNK_SIZE; int allocated = 0; for (i = 0; i < nr_chunk; i++) { ptr = mmap(NULL, CHUNK_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, 0, 0); if (ptr == MAP_FAILED) { printf("fail to allocate %d\n", i); break; } alloc_ptr[allocated++] = ptr; } *len = allocated; } void free_mem(void *alloc_ptr[], int len) { int i; for (i = 0; i < len; i++) munmap(alloc_ptr[i], CHUNK_SIZE); } void fill_mem(void *ptr, void *rand_page, long int comp_ratio) { int i; static int nr_page = CHUNK_SIZE / PAGE_SIZE; int zero_size = PAGE_SIZE * comp_ratio / 100; for (i = 0; i < nr_page; i++, ptr += PAGE_SIZE) { memset(ptr, 0, zero_size); memcpy(ptr + zero_size, rand_page, PAGE_SIZE - zero_size); } } int fill_chunk(void *alloc_ptr[], int len, long int comp_ratio) { int i, ret; char rand_buf[PAGE_SIZE]; int fd = open("/dev/urandom", O_RDONLY); if (fd < 0) { perror("Fail to open /dev/urandom\n"); return 1; } ret = read(fd, rand_buf, PAGE_SIZE); if (ret != PAGE_SIZE) { perror("Fail to read /dev/urandom\n"); return 1; } for (i = 0; i < len; i++) fill_mem(alloc_ptr[i], rand_buf, comp_ratio); close(fd); return 0; } int main(int argc, char *argv[]) { char buf[256] = {0,}; unsigned long long opt_mem = 100 << 20; long int opt_sleep = 1; long int opt_reps = 10000; long int opt_comp_ratio = 30; unsigned long opt_mlock = 0; long int loops; int pid = getpid(); int err, c, count; while ((c = getopt_long(argc, argv, "m:Ms:r:c:", opts, NULL)) != -1) { switch (c) { case 'c': opt_comp_ratio = strtol(optarg, NULL, 10); break; case 'm': opt_mem = memparse(optarg, NULL); break; case 'M': opt_mlock = 1; break; case 's': opt_sleep = strtol(optarg, NULL, 10); break; case 'r': opt_reps = strtol(optarg, NULL, 10); break; case 'h': usage(argv[0]); break; default: usage(argv[0]); } } if (opt_mem < CHUNK_SIZE) { printf("memory size should be greater than %lu\n", CHUNK_SIZE); return 1; } /* Disable LMK/OOM killer */ sprintf(buf, "echo -1000 > /proc/%d/oom_score_adj\n", pid); if (WEXITSTATUS(system(buf))) { fprintf(stderr, "fail to disable OOM. Maybe you need root permission\n"); return 1; } if (opt_mlock) { err = mlockall(MCL_CURRENT|MCL_FUTURE|MCL_ONFAULT); if (err) { perror("Fail to mlockall\n"); return err; } } printf("%llu MB allocated\n", opt_mem >> 20); printf("%lu loop\n", opt_reps); printf("%lu sleep\n", opt_sleep); printf("%lu comp_ratio\n", opt_comp_ratio); count = 0; loops = opt_reps; while (loops) { /* 20M * 4096 = 80G is enouggh */ void *alloc_ptr[PAGE_SIZE]; int len; count++; retry: allocate_mem(opt_mem, alloc_ptr, &len); if (len == 0) { /* * If we couldn't allocate any memory, let's try again * after a while */ sleep(1); goto retry; } if (fill_chunk(alloc_ptr, len, opt_comp_ratio)) { printf("Fail to fill chunck\n"); return 1; } if (opt_sleep == -1) { while (1) { printf("Forever sleep, Bye\n"); sleep(100000); } } sleep(opt_sleep); free_mem(alloc_ptr, len); if (loops != -1) loops--; printf("[%d] Pass %d\n", pid, count); } return 0; } ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2021-10-08 20:43 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-10-05 17:51 pageless memory & zsmalloc Matthew Wilcox 2021-10-05 20:13 ` Kent Overstreet 2021-10-05 21:28 ` Matthew Wilcox 2021-10-05 23:00 ` Kent Overstreet 2021-10-06 3:21 ` Matthew Wilcox 2021-10-07 15:03 ` Vlastimil Babka 2021-10-07 18:11 ` Matthew Wilcox 2021-10-08 20:43 ` Minchan Kim
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).