linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* unconditional discard calls in the swap code
@ 2009-10-30  6:51 Christoph Hellwig
  2009-10-30 17:26 ` Hugh Dickins
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2009-10-30  6:51 UTC (permalink / raw)
  To: hugh.dickins; +Cc: linux-mm

Hi Hugh,

since 6a6ba83175c029c7820765bae44692266b29e67a the swap code
unconditionally calls blkdev_issue_discard when swap clusters get freed.
So far this was harmless because only the mtd driver has discard support
wired up and it's pretty fast there (entirely done in-kernel).

We're now adding support for real UNAP/TRIM support for SCSI arrays and
SSDs, and so far all the real life ones we've dealt with have too many
performance issues to just issue the discard requests on the fly.
Because of that unconditionally enabling this code is a bad idea, it
really needs an option to disable it or even better just leave it
disabled by default for now with an option to enable it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: unconditional discard calls in the swap code
  2009-10-30  6:51 unconditional discard calls in the swap code Christoph Hellwig
@ 2009-10-30 17:26 ` Hugh Dickins
  2009-11-18 17:12   ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Hugh Dickins @ 2009-10-30 17:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Matthew Wilcox, linux-mm, linux-scsi

Hi Christoph,

(I've added Ccs, hoping for more expertise than we have in linux-mm.)

On Fri, 30 Oct 2009, Christoph Hellwig wrote:
> 
> since 6a6ba83175c029c7820765bae44692266b29e67a the swap code
> unconditionally calls blkdev_issue_discard when swap clusters get freed.
> So far this was harmless because only the mtd driver has discard support
> wired up and it's pretty fast there (entirely done in-kernel).
> 
> We're now adding support for real UNMAP/TRIM support for SCSI arrays and
> SSDs, and so far all the real life ones we've dealt with have too many
> performance issues to just issue the discard requests on the fly.
> Because of that unconditionally enabling this code is a bad idea, it
> really needs an option to disable it or even better just leave it
> disabled by default for now with an option to enable it.

Thanks for the info.

Yes, in practice TRIM seems a huge disappointment: is there a device on
which it is really implemented, and not more trouble than it's worth?

I'd been waiting for OCZ to get a Vertex 1.4* firmware out of Beta
before looking at swap discard again; but even then, the Linux ATA
support is still up in the air, so far as I know.

You don't mention swap's discard of the whole partition (or all
extents of the swapfile) at swapon time: do you believe that usage
is okay to retain?  Is it likely on some devices to take so long,
that I ought to make it asynchronous?

Assuming that initial swap discard is good, I wonder whether just
to revert the discard of swap clusters for now: until such time as
we find devices (other than mtd) that can implement it efficiently.

If we do retain the discard of swap clusters, under something more
than an #if 0, any ideas for what I should make it conditional upon?

Something near /sys/block/sda/queue/rotational (nicely rw these days)
seems appropriate: any chance of a /sys/block/sda/queue/discard_is_useful?
I think I'd prefer that to a new option to swapon.

Or is there a sensible measurement I could make in swapfile.c: for
example, does discard of a range complete faster than write of the
same range?  (But my guess is that those devices we'd want to avoid
discard on, would give erratic answers to any such test; never mind
the noise of what other I/Os are concurrent to the same device.)

Something I should almost certainly revert: at one stage I made the
non-rotational case spread its swapping evenly over the partition,
in case the device's wear-levelling was inadequate (localized).

But now I think it's better to ignore that possibility, and anchor
swapping to the start of the partition just as in the rotational case:
in the rotational case it's done to minimize seeking, in the non-
rotational case it would be to minimize encroaching upon that
initially discarded total extent.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: unconditional discard calls in the swap code
  2009-10-30 17:26 ` Hugh Dickins
@ 2009-11-18 17:12   ` Christoph Hellwig
  2009-11-30 17:22     ` [PATCH] mm: don't discard unused swap slots by default Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2009-11-18 17:12 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Christoph Hellwig, Jens Axboe, Matthew Wilcox, linux-mm, linux-scsi

On Fri, Oct 30, 2009 at 05:26:18PM +0000, Hugh Dickins wrote:
> Yes, in practice TRIM seems a huge disappointment: is there a device on
> which it is really implemented, and not more trouble than it's worth?
> 
> I'd been waiting for OCZ to get a Vertex 1.4* firmware out of Beta
> before looking at swap discard again; but even then, the Linux ATA
> support is still up in the air, so far as I know.

I've tied it up now for libata, and testing with the releases OCZ 1.4
firmware.  Haven't tested anything else yet except for my own
implementations of TRIM and WRITE SAME in qemu which are a lot faster
than real hardware.

> 
> You don't mention swap's discard of the whole partition (or all
> extents of the swapfile) at swapon time: do you believe that usage
> is okay to retain?  Is it likely on some devices to take so long,
> that I ought to make it asynchronous?

The use on swapon seems fine - we've also added support to discard
on mkfs which is generally fast enough - the existing implementations
seem to have mostly constant overhead, the more blocks your discard,
the better.

> Assuming that initial swap discard is good, I wonder whether just
> to revert the discard of swap clusters for now: until such time as
> we find devices (other than mtd) that can implement it efficiently.
> 
> If we do retain the discard of swap clusters, under something more
> than an #if 0, any ideas for what I should make it conditional upon?

add a sysctl / sysfs tunable for it?  For all filesystems we now have
patches pending to require and -o discard option to use it, which will
be quite nessecary for 2.6.33 where all the block layer / scsi layer /
libata support will fall into place.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] mm: don't discard unused swap slots by default
  2009-11-18 17:12   ` Christoph Hellwig
@ 2009-11-30 17:22     ` Christoph Hellwig
  2009-11-30 18:28       ` Hugh Dickins
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2009-11-30 17:22 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Jens Axboe, Matthew Wilcox, linux-mm, linux-scsi

Current TRIM/UNMAP/etc implementation are slow enough that discarding
small chunk during run time is a bad idea.  So only discard the whole
swap space on swapon by default, but require the admin to enable it
for run-time discards using the new vm.discard_swapspace sysctl.


Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h	2009-11-27 11:50:47.319003920 +0100
+++ linux-2.6/include/linux/swap.h	2009-11-27 11:51:55.617286868 +0100
@@ -247,6 +247,7 @@ extern unsigned long mem_cgroup_shrink_n
 extern int __isolate_lru_page(struct page *page, int mode, int file);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
+extern int vm_discard_swapspace;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c	2009-11-27 11:49:02.935254088 +0100
+++ linux-2.6/kernel/sysctl.c	2009-11-27 11:53:10.333006621 +0100
@@ -1163,6 +1163,16 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one_hundred,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "discard_swapspace",
+		.data		= &vm_discard_swapspace,
+		.maxlen		= sizeof(vm_discard_swapspace),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
 #ifdef CONFIG_HUGETLB_PAGE
 	 {
 		.procname	= "nr_hugepages",
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c	2009-11-27 11:53:19.449254088 +0100
+++ linux-2.6/mm/swapfile.c	2009-11-27 11:54:07.883255931 +0100
@@ -41,6 +41,7 @@ long nr_swap_pages;
 long total_swap_pages;
 static int swap_overflow;
 static int least_priority;
+int vm_discard_swapspace;
 
 static const char Bad_file[] = "Bad swap file entry ";
 static const char Unused_file[] = "Unused swap file entry ";
@@ -1978,7 +1979,7 @@ SYSCALL_DEFINE2(swapon, const char __use
 			p->flags |= SWP_SOLIDSTATE;
 			p->cluster_next = 1 + (random32() % p->highest_bit);
 		}
-		if (discard_swap(p) == 0)
+		if (discard_swap(p) == 0 && vm_discard_swapspace)
 			p->flags |= SWP_DISCARDABLE;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm: don't discard unused swap slots by default
  2009-11-30 17:22     ` [PATCH] mm: don't discard unused swap slots by default Christoph Hellwig
@ 2009-11-30 18:28       ` Hugh Dickins
  2009-11-30 18:58         ` Martin K. Petersen
  2009-12-31  0:33         ` Hugh Dickins
  0 siblings, 2 replies; 7+ messages in thread
From: Hugh Dickins @ 2009-11-30 18:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Matthew Wilcox, linux-mm, linux-scsi

On Mon, 30 Nov 2009, Christoph Hellwig wrote:

> Current TRIM/UNMAP/etc implementation are slow enough that discarding
> small chunk during run time is a bad idea.  So only discard the whole
> swap space on swapon by default, but require the admin to enable it
> for run-time discards using the new vm.discard_swapspace sysctl.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Thanks: you having suggested it, I guess it's no coincidence that
this looks a little like what I'm currently experimenting with, on
2.6.32-rc8-mm1 which contains your mods, on Vertex with 1.4 firmware.

There's several variables (not the least my own idiocy), and too soon
for me to say anything definite; but the impression I'm getting from
numbers so far is that (on that SSD anyway) the dubious discards from
SWP_DISCARDABLE are actually beneficial - more so than the initial
discard of the whole partition.

Each SWP_DISCARDABLE discard is of a 1MB range (if 4kB pagesize, and
if swap partition - if swapping to fragmented regular file, they would
often be of less, so indeed less efficient).

Please could you send me, on or offlist, the tests you have which show
them to be worth suppressing?  I do prefer to avoid tunables if we can;
and although this sysctl you suggested is the easiest way, it doesn't
seem the correct way.

Something in /sys/block/<device>/queue/ would be more correct: but
perhaps more trouble than it's worth; and so very specific to swap
(and whatever mm/swapfile.c happens to be doing in any release)
that it wouldn't belong very well there either.

You mentioned an "-o discard" mount option before: so I think what
we ought to be doing is an option to swapon.  But you can imagine
that I'd prefer to avoid that too, if we can work this out without it.

If I could see how bad these SWP_DISCARDABLE discards are for
myself, I might for the moment prefer just to cut out that code,
until we can be more intelligent about it (instead of fixing visible
sysctl/sysfs/swapon options which limit to the current implementation).

Thanks,
Hugh

> 
> Index: linux-2.6/include/linux/swap.h
> ===================================================================
> --- linux-2.6.orig/include/linux/swap.h	2009-11-27 11:50:47.319003920 +0100
> +++ linux-2.6/include/linux/swap.h	2009-11-27 11:51:55.617286868 +0100
> @@ -247,6 +247,7 @@ extern unsigned long mem_cgroup_shrink_n
>  extern int __isolate_lru_page(struct page *page, int mode, int file);
>  extern unsigned long shrink_all_memory(unsigned long nr_pages);
>  extern int vm_swappiness;
> +extern int vm_discard_swapspace;
>  extern int remove_mapping(struct address_space *mapping, struct page *page);
>  extern long vm_total_pages;
>  
> Index: linux-2.6/kernel/sysctl.c
> ===================================================================
> --- linux-2.6.orig/kernel/sysctl.c	2009-11-27 11:49:02.935254088 +0100
> +++ linux-2.6/kernel/sysctl.c	2009-11-27 11:53:10.333006621 +0100
> @@ -1163,6 +1163,16 @@ static struct ctl_table vm_table[] = {
>  		.extra1		= &zero,
>  		.extra2		= &one_hundred,
>  	},
> +	{
> +		.ctl_name	= CTL_UNNUMBERED,
> +		.procname	= "discard_swapspace",
> +		.data		= &vm_discard_swapspace,
> +		.maxlen		= sizeof(vm_discard_swapspace),
> +		.mode		= 0644,
> +		.proc_handler	= &proc_dointvec_minmax,
> +		.extra1		= &zero,
> +		.extra2		= &one,
> +	},
>  #ifdef CONFIG_HUGETLB_PAGE
>  	 {
>  		.procname	= "nr_hugepages",
> Index: linux-2.6/mm/swapfile.c
> ===================================================================
> --- linux-2.6.orig/mm/swapfile.c	2009-11-27 11:53:19.449254088 +0100
> +++ linux-2.6/mm/swapfile.c	2009-11-27 11:54:07.883255931 +0100
> @@ -41,6 +41,7 @@ long nr_swap_pages;
>  long total_swap_pages;
>  static int swap_overflow;
>  static int least_priority;
> +int vm_discard_swapspace;
>  
>  static const char Bad_file[] = "Bad swap file entry ";
>  static const char Unused_file[] = "Unused swap file entry ";
> @@ -1978,7 +1979,7 @@ SYSCALL_DEFINE2(swapon, const char __use
>  			p->flags |= SWP_SOLIDSTATE;
>  			p->cluster_next = 1 + (random32() % p->highest_bit);
>  		}
> -		if (discard_swap(p) == 0)
> +		if (discard_swap(p) == 0 && vm_discard_swapspace)
>  			p->flags |= SWP_DISCARDABLE;
>  	}
>  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm: don't discard unused swap slots by default
  2009-11-30 18:28       ` Hugh Dickins
@ 2009-11-30 18:58         ` Martin K. Petersen
  2009-12-31  0:33         ` Hugh Dickins
  1 sibling, 0 replies; 7+ messages in thread
From: Martin K. Petersen @ 2009-11-30 18:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Christoph Hellwig, Jens Axboe, Matthew Wilcox, linux-mm, linux-scsi

>>>>> "Hugh" == Hugh Dickins <hugh.dickins@tiscali.co.uk> writes:

Hugh> You mentioned an "-o discard" mount option before: so I think what
Hugh> we ought to be doing is an option to swapon.  But you can imagine
Hugh> that I'd prefer to avoid that too, if we can work this out without
Hugh> it.

The main problem we have is that the devices currently supporting TRIM
are doing a piss poor job at it.

We have pretty good vendor guarantees that discards are going to be
essentially free on SCSI-class hardware.  But in the ATA space things
are currently being driven by early adopters / tweakers that care more
about benchmarketing and feature checklists.  Whether things actually
work as intended is mostly irrelevant.

I think we'll need to give things a little bit of time for decent ATA
TRIM implementations to materialize.  And then we can switch to an
"assume it works, blacklist bad eggs" approach.  Until then I think we
need to make discard an explicit opt-in feature.

-- 
Martin K. Petersen	Oracle Linux Engineering

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm: don't discard unused swap slots by default
  2009-11-30 18:28       ` Hugh Dickins
  2009-11-30 18:58         ` Martin K. Petersen
@ 2009-12-31  0:33         ` Hugh Dickins
  1 sibling, 0 replies; 7+ messages in thread
From: Hugh Dickins @ 2009-12-31  0:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Matthew Wilcox, Martin K. Petersen, linux-mm, linux-scsi

On Mon, 30 Nov 2009, Hugh Dickins wrote:
> On Mon, 30 Nov 2009, Christoph Hellwig wrote:
> 
> > Current TRIM/UNMAP/etc implementation are slow enough that discarding
> > small chunk during run time is a bad idea.  So only discard the whole
> > swap space on swapon by default, but require the admin to enable it
> > for run-time discards using the new vm.discard_swapspace sysctl.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Thanks: you having suggested it, I guess it's no coincidence that
> this looks a little like what I'm currently experimenting with, on
> 2.6.32-rc8-mm1 which contains your mods, on Vertex with 1.4 firmware.

I continued those experiments, on the OCZ Vertex, and then on
the Intel SSDSA2M080G2 which appeared at my door a few weeks ago.

> 
> There's several variables (not the least my own idiocy), and too soon
> for me to say anything definite; but the impression I'm getting from
> numbers so far is that (on that SSD anyway) the dubious discards from
> SWP_DISCARDABLE are actually beneficial - more so than the initial
> discard of the whole partition.

That initial impression was borne out by further testing on the Vertex,
which continued to benefit significantly (but not dramatically) from
SWP_DISCARDABLE discards, more so than from swapon's initial discard.

The Intel showed altogether less benefit from any swap discards, but
more benefit from swapon's initial discard (odd since I'd expect its
effects soon to get wiped out) than from ongoing SWP_DISCARDABLEs.

I didn't observe strong reason to tune out the SWP_DISCARDABLEs;
though it may well be that I'd have done better to invest some time
in looking for other swap speedups, than bother with discard in the
first place.

Martin mentioned "We have pretty good vendor guarantees that discards
are going to be essentially free on SCSI-class hardware": can anyone
suggest other ATA SSDs implementing discard that I could check?

I was surprised by what came through most strongly from this testing,
not an effect of initial or ongoing discards at all.

I mentioned earlier that I intended to remove how SWP_SOLIDSTATE
rotates around the swap area (whereas rotationals anchor to the start
of the area).  I put that in for low-end flash, on which wear-levelling
might be very localized; but I'd come to see that it would be a bad
idea on discard-capable SSDs, since they would tend to appear never
less than 1MB away from full (after the first pass around the area).

So while experimenting with tuning out the initial or ongoing discards,
I also experimented with tuning out the initial randomization and the
continued rotation.

But on both the Vertex and the Intel, the randomization and rotation
actually came through as much more consistently beneficial than the
discards: definitely behaviour to be retained, even though I'm
clueless why.

> 
> Each SWP_DISCARDABLE discard is of a 1MB range (if 4kB pagesize, and
> if swap partition - if swapping to fragmented regular file, they would
> often be of less, so indeed less efficient).
> 
> Please could you send me, on or offlist, the tests you have which show
> them to be worth suppressing?

You didn't respond, so perhaps the problem was a worry rather than
a demonstrated issue.  I can see that discarding filesystems appear
liable to be calling discard even on inappropriately small extents,
which swap avoids doing (unless swapping to a regular file which is
too fragmented to be sensible anyway).

> I do prefer to avoid tunables if we can;
> and although this sysctl you suggested is the easiest way, it doesn't
> seem the correct way.
> 
> Something in /sys/block/<device>/queue/ would be more correct: but
> perhaps more trouble than it's worth; and so very specific to swap
> (and whatever mm/swapfile.c happens to be doing in any release)
> that it wouldn't belong very well there either.
> 
> You mentioned an "-o discard" mount option before: so I think what
> we ought to be doing is an option to swapon.  But you can imagine
> that I'd prefer to avoid that too, if we can work this out without it.
> 
> If I could see how bad these SWP_DISCARDABLE discards are for
> myself, I might for the moment prefer just to cut out that code,
> until we can be more intelligent about it (instead of fixing visible
> sysctl/sysfs/swapon options which limit to the current implementation).

Yes, I still have these reservations, so haven't pushed forward
your patch, nor any such alternative.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-12-31  0:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-30  6:51 unconditional discard calls in the swap code Christoph Hellwig
2009-10-30 17:26 ` Hugh Dickins
2009-11-18 17:12   ` Christoph Hellwig
2009-11-30 17:22     ` [PATCH] mm: don't discard unused swap slots by default Christoph Hellwig
2009-11-30 18:28       ` Hugh Dickins
2009-11-30 18:58         ` Martin K. Petersen
2009-12-31  0:33         ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).