From: Hannes Reinecke <firstname.lastname@example.org>
To: Matthew Wilcox <email@example.com>
Cc: Naohiro Aota <firstname.lastname@example.org>,
Andrew Morton <email@example.com>
Subject: Re: Project idea: Swap to zoned block devices
Date: Tue, 15 Oct 2019 17:22:34 +0200 [thread overview]
Message-ID: <firstname.lastname@example.org> (raw)
On 10/15/19 5:09 PM, Matthew Wilcox wrote:
> On Tue, Oct 15, 2019 at 03:48:47PM +0200, Hannes Reinecke wrote:
>> On 10/15/19 1:35 PM, Matthew Wilcox wrote:
>>> On Tue, Oct 15, 2019 at 01:38:27PM +0900, Naohiro Aota wrote:
>>>> A zoned block device consists of a number of zones. Zones are
>>>> either conventional and accepting random writes or sequential and
>>>> requiring that writes be issued in LBA order from each zone write
>>>> pointer position. For the write restriction, zoned block devices are
>>>> not suitable for a swap device. Disallow swapon on them.
>>> That's unfortunate. I wonder what it would take to make the swap code be
>>> suitable for zoned devices. It might even perform better on conventional
>>> drives since swapout would be a large linear write. Swapin would be a
>>> fragmented, seeky set of reads, but this would seem like an excellent
>>> university project.
>> The main problem I'm seeing is the eviction of pages from swap.
>> While swapin is easy (as you can do random access on reads), evict pages
>> from cache becomes extremely tricky as you can only delete entire zones.
>> So how to we mark pages within zones as being stale?
>> Or can we modify the swapin code to always swap in an entire zone and
>> discard it immediately?
> I thought zones were too big to swap in all at once? What's a typical
> zone size these days? (the answer looks very different if a zone is 1MB
> or if it's 1GB)
Currently things have settled at 256MB, might be increased for ZNS.
But GB would be the upper limit I'd assume.
> Fundamentally an allocated anonymous page has 5 states:
> A: In memory, not written to swap (allocated)
> B: In memory, dirty, not written to swap (app modifies page)
> C: In memory, clean, written to swap (kernel decides to write it)
> D: Not in memory, written to swap (kernel decides to reuse the memory)
> E: In memory, clean, written to swap (app faults it back in for read)
> We currently have a sixth state which is a page that has previously been
> written to swap but has been redirtied by the app. It will be written
> back to the allocated location the next time it's targetted for writeout.
> That would have to change; since we can't do random writes, pages would
> transition from states D or E back to B. Swapping out a page that has
> previously been swapped will now mean appending to the tail of the swap,
> not writing in place.
> So the swap code will now need to keep track of which pages are still
> in use in storage and will need to be relocated once we decide to reuse
> the zone. Not an insurmountable task, but not entirely trivial.
Precisely my worries.
However, clearing stuff is _really_ fast (you just have to reset the
pointer which is kept in NVRAM of the device). Which might help a bit.
> There'd be some other gunk to deal with around handling badblocks.
> Those are currently stored in page 1, so adding new ones would be
> a rewrite of that block.
Bah. Can't we make that optional?
We really only need badblocks when writing to crappy media (or NV-DIMM
:-). Zoned devices _will_ have proper error recovery in place, so the
only time where badblocks might be used is when the device is
essentially dead ;-)
Dr. Hannes Reinecke Teamlead Storage & Networking
email@example.com +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer
prev parent reply other threads:[~2019-10-15 15:22 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-10-15 4:38 [PATCH] mm, swap: disallow swapon() on zoned block devices Naohiro Aota
2019-10-15 7:57 ` Christoph Hellwig
2019-10-15 8:58 ` [PATCH v2] " Naohiro Aota
2019-10-15 9:06 ` Christoph Hellwig
2019-10-15 20:43 ` Andrew Morton
2019-10-15 11:35 ` Project idea: Swap to " Matthew Wilcox
2019-10-15 13:27 ` Theodore Y. Ts'o
2019-10-15 13:48 ` Hannes Reinecke
2019-10-15 14:50 ` Christopher Lameter
2019-10-15 15:09 ` Matthew Wilcox
2019-10-15 15:22 ` Hannes Reinecke [this message]
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).