archive mirror
 help / color / mirror / Atom feed
From: Hannes Reinecke <>
To: Matthew Wilcox <>
Cc: Naohiro Aota <>,,,,
	Andrew Morton <>
Subject: Re: Project idea: Swap to zoned block devices
Date: Tue, 15 Oct 2019 17:22:34 +0200	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On 10/15/19 5:09 PM, Matthew Wilcox wrote:
> On Tue, Oct 15, 2019 at 03:48:47PM +0200, Hannes Reinecke wrote:
>> On 10/15/19 1:35 PM, Matthew Wilcox wrote:
>>> On Tue, Oct 15, 2019 at 01:38:27PM +0900, Naohiro Aota wrote:
>>>> A zoned block device consists of a number of zones. Zones are
>>>> either conventional and accepting random writes or sequential and
>>>> requiring that writes be issued in LBA order from each zone write
>>>> pointer position. For the write restriction, zoned block devices are
>>>> not suitable for a swap device. Disallow swapon on them.
>>> That's unfortunate.  I wonder what it would take to make the swap code be
>>> suitable for zoned devices.  It might even perform better on conventional
>>> drives since swapout would be a large linear write.  Swapin would be a
>>> fragmented, seeky set of reads, but this would seem like an excellent
>>> university project.
>> The main problem I'm seeing is the eviction of pages from swap.
>> While swapin is easy (as you can do random access on reads), evict pages
>> from cache becomes extremely tricky as you can only delete entire zones.
>> So how to we mark pages within zones as being stale?
>> Or can we modify the swapin code to always swap in an entire zone and
>> discard it immediately?
> I thought zones were too big to swap in all at once?  What's a typical
> zone size these days?  (the answer looks very different if a zone is 1MB
> or if it's 1GB)
Currently things have settled at 256MB, might be increased for ZNS.
But GB would be the upper limit I'd assume.

> Fundamentally an allocated anonymous page has 5 states:
> A: In memory, not written to swap (allocated)
> B: In memory, dirty, not written to swap (app modifies page)
> C: In memory, clean, written to swap (kernel decides to write it)
> D: Not in memory, written to swap (kernel decides to reuse the memory)
> E: In memory, clean, written to swap (app faults it back in for read)
> We currently have a sixth state which is a page that has previously been
> written to swap but has been redirtied by the app.  It will be written
> back to the allocated location the next time it's targetted for writeout.
> That would have to change; since we can't do random writes, pages would
> transition from states D or E back to B.  Swapping out a page that has
> previously been swapped will now mean appending to the tail of the swap,
> not writing in place.
> So the swap code will now need to keep track of which pages are still
> in use in storage and will need to be relocated once we decide to reuse
> the zone.  Not an insurmountable task, but not entirely trivial.
Precisely my worries.
However, clearing stuff is _really_ fast (you just have to reset the
pointer which is kept in NVRAM of the device). Which might help a bit.

> There'd be some other gunk to deal with around handling badblocks.
> Those are currently stored in page 1, so adding new ones would be
> a rewrite of that block.
Bah. Can't we make that optional?
We really only need badblocks when writing to crappy media (or NV-DIMM
:-). Zoned devices _will_ have proper error recovery in place, so the
only time where badblocks might be used is when the device is
essentially dead ;-)


Dr. Hannes Reinecke		      Teamlead Storage & Networking			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer

      reply	other threads:[~2019-10-15 15:22 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-15  4:38 [PATCH] mm, swap: disallow swapon() on zoned block devices Naohiro Aota
2019-10-15  7:57 ` Christoph Hellwig
2019-10-15  8:58 ` [PATCH v2] " Naohiro Aota
2019-10-15  9:06   ` Christoph Hellwig
2019-10-15 20:43     ` Andrew Morton
2019-10-15 11:35 ` Project idea: Swap to " Matthew Wilcox
2019-10-15 13:27   ` Theodore Y. Ts'o
2019-10-15 13:48   ` Hannes Reinecke
2019-10-15 14:50     ` Christopher Lameter
2019-10-15 15:09     ` Matthew Wilcox
2019-10-15 15:22       ` Hannes Reinecke [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).