All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] swap: send callback when swap slot is freed
@ 2009-08-12 14:37 ` Nitin Gupta
  0 siblings, 0 replies; 208+ messages in thread
From: Nitin Gupta @ 2009-08-12 14:37 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel

Currently, we have "swap discard" mechanism which sends a discard bio request
when we find a free cluster during scan_swap_map(). This callback can come a
long time after swap slots are actually freed.

This delay in callback is a great problem when (compressed) RAM [1] is used
as a swap device. So, this change adds a callback which is called as
soon as a swap slot becomes free. For above mentioned case of swapping
over compressed RAM device, this is very useful since we can immediately
free memory allocated for this swap page.

This callback does not replace swap discard support. It is called with
swap_lock held, so it is meant to trigger action that finishes quickly.
However, swap discard is an I/O request and can be used for taking longer
actions.

Links:
[1] http://code.google.com/p/compcache/

Signed-off-by: Nitin Gupta <ngupta@vflare.org>
---

 include/linux/swap.h |    5 +++++
 mm/swapfile.c        |   16 ++++++++++++++++
 2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7c15334..4cbe3c4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -8,6 +8,7 @@
 #include <linux/memcontrol.h>
 #include <linux/sched.h>
 #include <linux/node.h>
+#include <linux/blkdev.h>
 
 #include <asm/atomic.h>
 #include <asm/page.h>
@@ -20,6 +21,8 @@ struct bio;
 #define SWAP_FLAG_PRIO_MASK	0x7fff
 #define SWAP_FLAG_PRIO_SHIFT	0
 
+typedef void (swap_free_notify_fn) (struct block_device *, unsigned long);
+
 static inline int current_is_kswapd(void)
 {
 	return current->flags & PF_KSWAPD;
@@ -155,6 +158,7 @@ struct swap_info_struct {
 	unsigned int max;
 	unsigned int inuse_pages;
 	unsigned int old_block_size;
+	swap_free_notify_fn *swap_free_notify_fn;
 };
 
 struct swap_list_t {
@@ -295,6 +299,7 @@ extern sector_t swapdev_block(int, pgoff_t);
 extern struct swap_info_struct *get_swap_info_struct(unsigned);
 extern int reuse_swap_page(struct page *);
 extern int try_to_free_swap(struct page *);
+extern void set_swap_free_notify(unsigned, swap_free_notify_fn *);
 struct backing_dev_info;
 
 /* linux/mm/thrash.c */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8ffdc0d..aa95fc7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -552,6 +552,20 @@ out:
 	return NULL;
 }
 
+/*
+ * Sets callback for event when swap_map[offset] == 0
+ * i.e. page at this swap offset is no longer used.
+ */
+void set_swap_free_notify(unsigned type, swap_free_notify_fn *notify_fn)
+{
+	struct swap_info_struct *sis;
+	sis = get_swap_info_struct(type);
+	BUG_ON(!sis);
+	sis->swap_free_notify_fn = notify_fn;
+	return;
+}
+EXPORT_SYMBOL(set_swap_free_notify);
+
 static int swap_entry_free(struct swap_info_struct *p,
 			   swp_entry_t ent, int cache)
 {
@@ -583,6 +597,8 @@ static int swap_entry_free(struct swap_info_struct *p,
 			swap_list.next = p - swap_info;
 		nr_swap_pages++;
 		p->inuse_pages--;
+		if (p->swap_free_notify_fn)
+			p->swap_free_notify_fn(p->bdev, offset);
 	}
 	if (!swap_count(count))
 		mem_cgroup_uncharge_swap(ent);

^ permalink raw reply related	[flat|nested] 208+ messages in thread

* [PATCH] swap: send callback when swap slot is freed
@ 2009-08-12 14:37 ` Nitin Gupta
  0 siblings, 0 replies; 208+ messages in thread
From: Nitin Gupta @ 2009-08-12 14:37 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel

Currently, we have "swap discard" mechanism which sends a discard bio request
when we find a free cluster during scan_swap_map(). This callback can come a
long time after swap slots are actually freed.

This delay in callback is a great problem when (compressed) RAM [1] is used
as a swap device. So, this change adds a callback which is called as
soon as a swap slot becomes free. For above mentioned case of swapping
over compressed RAM device, this is very useful since we can immediately
free memory allocated for this swap page.

This callback does not replace swap discard support. It is called with
swap_lock held, so it is meant to trigger action that finishes quickly.
However, swap discard is an I/O request and can be used for taking longer
actions.

Links:
[1] http://code.google.com/p/compcache/

Signed-off-by: Nitin Gupta <ngupta@vflare.org>
---

 include/linux/swap.h |    5 +++++
 mm/swapfile.c        |   16 ++++++++++++++++
 2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7c15334..4cbe3c4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -8,6 +8,7 @@
 #include <linux/memcontrol.h>
 #include <linux/sched.h>
 #include <linux/node.h>
+#include <linux/blkdev.h>
 
 #include <asm/atomic.h>
 #include <asm/page.h>
@@ -20,6 +21,8 @@ struct bio;
 #define SWAP_FLAG_PRIO_MASK	0x7fff
 #define SWAP_FLAG_PRIO_SHIFT	0
 
+typedef void (swap_free_notify_fn) (struct block_device *, unsigned long);
+
 static inline int current_is_kswapd(void)
 {
 	return current->flags & PF_KSWAPD;
@@ -155,6 +158,7 @@ struct swap_info_struct {
 	unsigned int max;
 	unsigned int inuse_pages;
 	unsigned int old_block_size;
+	swap_free_notify_fn *swap_free_notify_fn;
 };
 
 struct swap_list_t {
@@ -295,6 +299,7 @@ extern sector_t swapdev_block(int, pgoff_t);
 extern struct swap_info_struct *get_swap_info_struct(unsigned);
 extern int reuse_swap_page(struct page *);
 extern int try_to_free_swap(struct page *);
+extern void set_swap_free_notify(unsigned, swap_free_notify_fn *);
 struct backing_dev_info;
 
 /* linux/mm/thrash.c */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8ffdc0d..aa95fc7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -552,6 +552,20 @@ out:
 	return NULL;
 }
 
+/*
+ * Sets callback for event when swap_map[offset] == 0
+ * i.e. page at this swap offset is no longer used.
+ */
+void set_swap_free_notify(unsigned type, swap_free_notify_fn *notify_fn)
+{
+	struct swap_info_struct *sis;
+	sis = get_swap_info_struct(type);
+	BUG_ON(!sis);
+	sis->swap_free_notify_fn = notify_fn;
+	return;
+}
+EXPORT_SYMBOL(set_swap_free_notify);
+
 static int swap_entry_free(struct swap_info_struct *p,
 			   swp_entry_t ent, int cache)
 {
@@ -583,6 +597,8 @@ static int swap_entry_free(struct swap_info_struct *p,
 			swap_list.next = p - swap_info;
 		nr_swap_pages++;
 		p->inuse_pages--;
+		if (p->swap_free_notify_fn)
+			p->swap_free_notify_fn(p->bdev, offset);
 	}
 	if (!swap_count(count))
 		mem_cgroup_uncharge_swap(ent);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 208+ messages in thread

* Re: [PATCH] swap: send callback when swap slot is freed
  2009-08-12 14:37 ` Nitin Gupta
  (?)
@ 2009-08-12 18:33 ` Peter Zijlstra
  -1 siblings, 0 replies; 208+ messages in thread
From: Peter Zijlstra @ 2009-08-12 18:33 UTC (permalink / raw)
  To: ngupta; +Cc: mingo, linux-kernel

On Wed, 2009-08-12 at 20:07 +0530, Nitin Gupta wrote:
> Currently, we have "swap discard" mechanism which sends a discard bio request
> when we find a free cluster during scan_swap_map(). This callback can come a
> long time after swap slots are actually freed.
> 
> This delay in callback is a great problem when (compressed) RAM [1] is used
> as a swap device. So, this change adds a callback which is called as
> soon as a swap slot becomes free. For above mentioned case of swapping
> over compressed RAM device, this is very useful since we can immediately
> free memory allocated for this swap page.
> 
> This callback does not replace swap discard support. It is called with
> swap_lock held, so it is meant to trigger action that finishes quickly.
> However, swap discard is an I/O request and can be used for taking longer
> actions.

I'd suggest using a notifier list for this. The interface just begs to
go belly up once there's multiple consumers.

Also, EXPORT_SYMBOL_GPL please ;-)

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: [PATCH] swap: send callback when swap slot is freed
  2009-08-12 14:37 ` Nitin Gupta
  (?)
  (?)
@ 2009-08-12 22:13 ` Andrew Morton
  -1 siblings, 0 replies; 208+ messages in thread
From: Andrew Morton @ 2009-08-12 22:13 UTC (permalink / raw)
  To: ngupta; +Cc: mingo, linux-kernel, Hugh Dickins

On Wed, 12 Aug 2009 20:07:43 +0530
Nitin Gupta <ngupta@vflare.org> wrote:

> Currently, we have "swap discard" mechanism which sends a discard bio request
> when we find a free cluster during scan_swap_map(). This callback can come a
> long time after swap slots are actually freed.
> 
> This delay in callback is a great problem when (compressed) RAM [1] is used
> as a swap device. So, this change adds a callback which is called as
> soon as a swap slot becomes free. For above mentioned case of swapping
> over compressed RAM device, this is very useful since we can immediately
> free memory allocated for this swap page.
> 
> This callback does not replace swap discard support. It is called with
> swap_lock held, so it is meant to trigger action that finishes quickly.
> However, swap discard is an I/O request and can be used for taking longer
> actions.

It would be better if we could arrange for discard_swap() to be called
synchronously so we don't have to add a second parallel implementation
of a pretty similar thing.

Running the callback under spinlock is regrettable - it rather limits
what that handler can do.


> 
>  include/linux/swap.h |    5 +++++
>  mm/swapfile.c        |   16 ++++++++++++++++
>  2 files changed, 21 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 7c15334..4cbe3c4 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -8,6 +8,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/node.h>
> +#include <linux/blkdev.h>
>  
>  #include <asm/atomic.h>
>  #include <asm/page.h>
> @@ -20,6 +21,8 @@ struct bio;
>  #define SWAP_FLAG_PRIO_MASK	0x7fff
>  #define SWAP_FLAG_PRIO_SHIFT	0
>  
> +typedef void (swap_free_notify_fn) (struct block_device *, unsigned long);
> +
>  static inline int current_is_kswapd(void)
>  {
>  	return current->flags & PF_KSWAPD;
> @@ -155,6 +158,7 @@ struct swap_info_struct {
>  	unsigned int max;
>  	unsigned int inuse_pages;
>  	unsigned int old_block_size;
> +	swap_free_notify_fn *swap_free_notify_fn;
>  };
>  
>  struct swap_list_t {
> @@ -295,6 +299,7 @@ extern sector_t swapdev_block(int, pgoff_t);
>  extern struct swap_info_struct *get_swap_info_struct(unsigned);
>  extern int reuse_swap_page(struct page *);
>  extern int try_to_free_swap(struct page *);
> +extern void set_swap_free_notify(unsigned, swap_free_notify_fn *);
>  struct backing_dev_info;
>  
>  /* linux/mm/thrash.c */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 8ffdc0d..aa95fc7 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -552,6 +552,20 @@ out:
>  	return NULL;
>  }
>  
> +/*
> + * Sets callback for event when swap_map[offset] == 0
> + * i.e. page at this swap offset is no longer used.
> + */
> +void set_swap_free_notify(unsigned type, swap_free_notify_fn *notify_fn)
> +{
> +	struct swap_info_struct *sis;
> +	sis = get_swap_info_struct(type);
> +	BUG_ON(!sis);
> +	sis->swap_free_notify_fn = notify_fn;
> +	return;
> +}
> +EXPORT_SYMBOL(set_swap_free_notify);

hm, well, spose so.  There's no provision here for multiple watchers,
the function is racy and makes no provision for telling the caller what
the old value was and the kernel will crash horridly if the module which
implements ->swap_free_notify_fn() gets rmmoded and forgets to do
set_swap_free_notify(..., NULL).  

But I guess we can live with those things.

>  static int swap_entry_free(struct swap_info_struct *p,
>  			   swp_entry_t ent, int cache)
>  {
> @@ -583,6 +597,8 @@ static int swap_entry_free(struct swap_info_struct *p,
>  			swap_list.next = p - swap_info;
>  		nr_swap_pages++;
>  		p->inuse_pages--;
> +		if (p->swap_free_notify_fn)
> +			p->swap_free_notify_fn(p->bdev, offset);
>  	}
>  	if (!swap_count(count))
>  		mem_cgroup_uncharge_swap(ent);

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: [PATCH] swap: send callback when swap slot is freed
  2009-08-12 14:37 ` Nitin Gupta
@ 2009-08-12 22:48   ` Hugh Dickins
  -1 siblings, 0 replies; 208+ messages in thread
From: Hugh Dickins @ 2009-08-12 22:48 UTC (permalink / raw)
  To: Nitin Gupta
  Cc: Matthew Wilcox, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm

On Wed, 12 Aug 2009, Nitin Gupta wrote:

> Currently, we have "swap discard" mechanism which sends a discard bio request
> when we find a free cluster during scan_swap_map(). This callback can come a
> long time after swap slots are actually freed.
> 
> This delay in callback is a great problem when (compressed) RAM [1] is used
> as a swap device. So, this change adds a callback which is called as
> soon as a swap slot becomes free. For above mentioned case of swapping
> over compressed RAM device, this is very useful since we can immediately
> free memory allocated for this swap page.
> 
> This callback does not replace swap discard support. It is called with
> swap_lock held, so it is meant to trigger action that finishes quickly.
> However, swap discard is an I/O request and can be used for taking longer
> actions.
> 
> Links:
> [1] http://code.google.com/p/compcache/

Please keep this with compcache for the moment (it has no other users).

I don't share Peter's view that it should be using a more general
notifier interface (but I certainly agree with his EXPORT_SYMBOL_GPL).
There better not be others hooking in here at the same time (a BUG_ON
could check that): in fact I don't even want you hooking in here where
swap_lock is held.  Glancing at compcache, I don't see you violating
lock hierarchy by that, but it is a worry.

The interface to set the notifier, you currently have it by swap type:
that would better be by bdev, wouldn't it?  with a search for the right
slot.  There's nowhere else in ramzswap.c that you rely on swp_entry_t
and page_private(page), let's keep such details out of compcache.

But fundamentally, though I can see how this cutdown communication
path is useful to compcache, I'd much rather deal with it by the more
general discard route if we can.  (I'm one of those still puzzled by
the way swap is mixed up with block device in compcache: probably
because I never found time to pay attention when you explained.)

You're right to question the utility of the current swap discard
placement.  That code is almost a year old, written from a position
of great ignorance, yet only now do we appear to be on the threshold
of having an SSD which really supports TRIM (ah, the Linux ATA TRIM
support seems to have gone missing now, but perhaps it's been
waiting for a reality to check against too - Willy?).

I won't be surprised if we find that we need to move swap discard
support much closer to swap_free (though I know from trying before
that it's much messier there): in which case, even if we decided to
keep your hotline to compcache (to avoid allocating bios etc.), it
would be better placed alongside.

Hugh

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: [PATCH] swap: send callback when swap slot is freed
@ 2009-08-12 22:48   ` Hugh Dickins
  0 siblings, 0 replies; 208+ messages in thread
From: Hugh Dickins @ 2009-08-12 22:48 UTC (permalink / raw)
  To: Nitin Gupta
  Cc: Matthew Wilcox, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm

On Wed, 12 Aug 2009, Nitin Gupta wrote:

> Currently, we have "swap discard" mechanism which sends a discard bio request
> when we find a free cluster during scan_swap_map(). This callback can come a
> long time after swap slots are actually freed.
> 
> This delay in callback is a great problem when (compressed) RAM [1] is used
> as a swap device. So, this change adds a callback which is called as
> soon as a swap slot becomes free. For above mentioned case of swapping
> over compressed RAM device, this is very useful since we can immediately
> free memory allocated for this swap page.
> 
> This callback does not replace swap discard support. It is called with
> swap_lock held, so it is meant to trigger action that finishes quickly.
> However, swap discard is an I/O request and can be used for taking longer
> actions.
> 
> Links:
> [1] http://code.google.com/p/compcache/

Please keep this with compcache for the moment (it has no other users).

I don't share Peter's view that it should be using a more general
notifier interface (but I certainly agree with his EXPORT_SYMBOL_GPL).
There better not be others hooking in here at the same time (a BUG_ON
could check that): in fact I don't even want you hooking in here where
swap_lock is held.  Glancing at compcache, I don't see you violating
lock hierarchy by that, but it is a worry.

The interface to set the notifier, you currently have it by swap type:
that would better be by bdev, wouldn't it?  with a search for the right
slot.  There's nowhere else in ramzswap.c that you rely on swp_entry_t
and page_private(page), let's keep such details out of compcache.

But fundamentally, though I can see how this cutdown communication
path is useful to compcache, I'd much rather deal with it by the more
general discard route if we can.  (I'm one of those still puzzled by
the way swap is mixed up with block device in compcache: probably
because I never found time to pay attention when you explained.)

You're right to question the utility of the current swap discard
placement.  That code is almost a year old, written from a position
of great ignorance, yet only now do we appear to be on the threshold
of having an SSD which really supports TRIM (ah, the Linux ATA TRIM
support seems to have gone missing now, but perhaps it's been
waiting for a reality to check against too - Willy?).

I won't be surprised if we find that we need to move swap discard
support much closer to swap_free (though I know from trying before
that it's much messier there): in which case, even if we decided to
keep your hotline to compcache (to avoid allocating bios etc.), it
would be better placed alongside.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: [PATCH] swap: send callback when swap slot is freed
  2009-08-12 22:48   ` Hugh Dickins
@ 2009-08-13  2:30     ` Nitin Gupta
  -1 siblings, 0 replies; 208+ messages in thread
From: Nitin Gupta @ 2009-08-13  2:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Matthew Wilcox, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm

On 08/13/2009 04:18 AM, Hugh Dickins wrote:
> On Wed, 12 Aug 2009, Nitin Gupta wrote:
>
>> Currently, we have "swap discard" mechanism which sends a discard bio request
>> when we find a free cluster during scan_swap_map(). This callback can come a
>> long time after swap slots are actually freed.
>>
>> This delay in callback is a great problem when (compressed) RAM [1] is used
>> as a swap device. So, this change adds a callback which is called as
>> soon as a swap slot becomes free. For above mentioned case of swapping
>> over compressed RAM device, this is very useful since we can immediately
>> free memory allocated for this swap page.
>>
>> This callback does not replace swap discard support. It is called with
>> swap_lock held, so it is meant to trigger action that finishes quickly.
>> However, swap discard is an I/O request and can be used for taking longer
>> actions.
>>
>> Links:
>> [1] http://code.google.com/p/compcache/
>
> Please keep this with compcache for the moment (it has no other users).
>
> I don't share Peter's view that it should be using a more general
> notifier interface (but I certainly agree with his EXPORT_SYMBOL_GPL).

Considering that the callback is made under swap_lock, we should not 
have an array of callbacks to do. But what if this callback finds other 
users too? I think we should leave it in its current state till it finds 
more users and probably add BUG() to make sure callback is not already set.

I will make it EXPORT_SYMBOL_GPL.

> There better not be others hooking in here at the same time (a BUG_ON
> could check that): in fact I don't even want you hooking in here where
> swap_lock is held.  Glancing at compcache, I don't see you violating
> lock hierarchy by that, but it is a worry.
>

I tried an approach that allows releasing swap_lock and 'lazily' make
the callback but this turned out to be pretty messy. So, I think just
adding a note that the callback is done under swap_lock should be better.


> The interface to set the notifier, you currently have it by swap type:
> that would better be by bdev, wouldn't it?  with a search for the right
> slot.  There's nowhere else in ramzswap.c that you rely on swp_entry_t
> and page_private(page), let's keep such details out of compcache.
>

Use of bdev instead of swap_entry_t looks better. I will make this change.


> But fundamentally, though I can see how this cutdown communication
> path is useful to compcache, I'd much rather deal with it by the more
> general discard route if we can.  (I'm one of those still puzzled by
> the way swap is mixed up with block device in compcache: probably
> because I never found time to pay attention when you explained.)
>

I tried this too -- make discard bio request as soon as a swap slot 
becomes free (I can send details if you want). However, I could not get 
it to work. Also, allocating bio to issue discard I/O request looks like 
a complete artifact in compcache case.


> You're right to question the utility of the current swap discard
> placement.  That code is almost a year old, written from a position
> of great ignorance, yet only now do we appear to be on the threshold
> of having an SSD which really supports TRIM (ah, the Linux ATA TRIM
> support seems to have gone missing now, but perhaps it's been
> waiting for a reality to check against too - Willy?).
>

> I won't be surprised if we find that we need to move swap discard
> support much closer to swap_free (though I know from trying before
> that it's much messier there): in which case, even if we decided to
> keep your hotline to compcache (to avoid allocating bios etc.), it
> would be better placed alongside.
>

This new callback and discard can actually co-exist: Use callback to 
trigger small actions and discard for longer actions. Depending on use 
case, you might need both or either one of these.


I am not very sure how willing you are to accept this patch but let me 
send another revision with all the suggestions from you all.


Thanks for looking into this.
Nitin

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: [PATCH] swap: send callback when swap slot is freed
@ 2009-08-13  2:30     ` Nitin Gupta
  0 siblings, 0 replies; 208+ messages in thread
From: Nitin Gupta @ 2009-08-13  2:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Matthew Wilcox, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm

On 08/13/2009 04:18 AM, Hugh Dickins wrote:
> On Wed, 12 Aug 2009, Nitin Gupta wrote:
>
>> Currently, we have "swap discard" mechanism which sends a discard bio request
>> when we find a free cluster during scan_swap_map(). This callback can come a
>> long time after swap slots are actually freed.
>>
>> This delay in callback is a great problem when (compressed) RAM [1] is used
>> as a swap device. So, this change adds a callback which is called as
>> soon as a swap slot becomes free. For above mentioned case of swapping
>> over compressed RAM device, this is very useful since we can immediately
>> free memory allocated for this swap page.
>>
>> This callback does not replace swap discard support. It is called with
>> swap_lock held, so it is meant to trigger action that finishes quickly.
>> However, swap discard is an I/O request and can be used for taking longer
>> actions.
>>
>> Links:
>> [1] http://code.google.com/p/compcache/
>
> Please keep this with compcache for the moment (it has no other users).
>
> I don't share Peter's view that it should be using a more general
> notifier interface (but I certainly agree with his EXPORT_SYMBOL_GPL).

Considering that the callback is made under swap_lock, we should not 
have an array of callbacks to do. But what if this callback finds other 
users too? I think we should leave it in its current state till it finds 
more users and probably add BUG() to make sure callback is not already set.

I will make it EXPORT_SYMBOL_GPL.

> There better not be others hooking in here at the same time (a BUG_ON
> could check that): in fact I don't even want you hooking in here where
> swap_lock is held.  Glancing at compcache, I don't see you violating
> lock hierarchy by that, but it is a worry.
>

I tried an approach that allows releasing swap_lock and 'lazily' make
the callback but this turned out to be pretty messy. So, I think just
adding a note that the callback is done under swap_lock should be better.


> The interface to set the notifier, you currently have it by swap type:
> that would better be by bdev, wouldn't it?  with a search for the right
> slot.  There's nowhere else in ramzswap.c that you rely on swp_entry_t
> and page_private(page), let's keep such details out of compcache.
>

Use of bdev instead of swap_entry_t looks better. I will make this change.


> But fundamentally, though I can see how this cutdown communication
> path is useful to compcache, I'd much rather deal with it by the more
> general discard route if we can.  (I'm one of those still puzzled by
> the way swap is mixed up with block device in compcache: probably
> because I never found time to pay attention when you explained.)
>

I tried this too -- make discard bio request as soon as a swap slot 
becomes free (I can send details if you want). However, I could not get 
it to work. Also, allocating bio to issue discard I/O request looks like 
a complete artifact in compcache case.


> You're right to question the utility of the current swap discard
> placement.  That code is almost a year old, written from a position
> of great ignorance, yet only now do we appear to be on the threshold
> of having an SSD which really supports TRIM (ah, the Linux ATA TRIM
> support seems to have gone missing now, but perhaps it's been
> waiting for a reality to check against too - Willy?).
>

> I won't be surprised if we find that we need to move swap discard
> support much closer to swap_free (though I know from trying before
> that it's much messier there): in which case, even if we decided to
> keep your hotline to compcache (to avoid allocating bios etc.), it
> would be better placed alongside.
>

This new callback and discard can actually co-exist: Use callback to 
trigger small actions and discard for longer actions. Depending on use 
case, you might need both or either one of these.


I am not very sure how willing you are to accept this patch but let me 
send another revision with all the suggestions from you all.


Thanks for looking into this.
Nitin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: [PATCH] swap: send callback when swap slot is freed
  2009-08-12 22:48   ` Hugh Dickins
@ 2009-08-13  2:41     ` Nitin Gupta
  -1 siblings, 0 replies; 208+ messages in thread
From: Nitin Gupta @ 2009-08-13  2:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Matthew Wilcox, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm

On 08/13/2009 04:18 AM, Hugh Dickins wrote:
> On Wed, 12 Aug 2009, Nitin Gupta wrote:
>
>> Currently, we have "swap discard" mechanism which sends a discard bio request
>> when we find a free cluster during scan_swap_map(). This callback can come a
>> long time after swap slots are actually freed.
>>
>> This delay in callback is a great problem when (compressed) RAM [1] is used
>> as a swap device. So, this change adds a callback which is called as
>> soon as a swap slot becomes free. For above mentioned case of swapping
>> over compressed RAM device, this is very useful since we can immediately
>> free memory allocated for this swap page.
>>
>> This callback does not replace swap discard support. It is called with
>> swap_lock held, so it is meant to trigger action that finishes quickly.
>> However, swap discard is an I/O request and can be used for taking longer
>> actions.
>>
>> Links:
>> [1] http://code.google.com/p/compcache/
>
> Please keep this with compcache for the moment (it has no other users).
>

Oh, I missed this one.

This small patch can be considered as first step for merging compcache 
to mainline :)   Actually, it requires callbacks for swapon, swapoff too 
but that, I think, should be done in a separate patches.

BTW, last time compcache was not accepted due to lack of performance 
numbers. Now the project has lot more data for various cases:
http://code.google.com/p/compcache/wiki/Performance
Still need to collect data for worst-case behaviors and such...


Thanks,
Nitin

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: [PATCH] swap: send callback when swap slot is freed
@ 2009-08-13  2:41     ` Nitin Gupta
  0 siblings, 0 replies; 208+ messages in thread
From: Nitin Gupta @ 2009-08-13  2:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Matthew Wilcox, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm

On 08/13/2009 04:18 AM, Hugh Dickins wrote:
> On Wed, 12 Aug 2009, Nitin Gupta wrote:
>
>> Currently, we have "swap discard" mechanism which sends a discard bio request
>> when we find a free cluster during scan_swap_map(). This callback can come a
>> long time after swap slots are actually freed.
>>
>> This delay in callback is a great problem when (compressed) RAM [1] is used
>> as a swap device. So, this change adds a callback which is called as
>> soon as a swap slot becomes free. For above mentioned case of swapping
>> over compressed RAM device, this is very useful since we can immediately
>> free memory allocated for this swap page.
>>
>> This callback does not replace swap discard support. It is called with
>> swap_lock held, so it is meant to trigger action that finishes quickly.
>> However, swap discard is an I/O request and can be used for taking longer
>> actions.
>>
>> Links:
>> [1] http://code.google.com/p/compcache/
>
> Please keep this with compcache for the moment (it has no other users).
>

Oh, I missed this one.

This small patch can be considered as first step for merging compcache 
to mainline :)   Actually, it requires callbacks for swapon, swapoff too 
but that, I think, should be done in a separate patches.

BTW, last time compcache was not accepted due to lack of performance 
numbers. Now the project has lot more data for various cases:
http://code.google.com/p/compcache/wiki/Performance
Still need to collect data for worst-case behaviors and such...


Thanks,
Nitin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* compcache as a pre-swap area (was: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13  2:41     ` Nitin Gupta
@ 2009-08-13  5:05       ` Al Boldi
  -1 siblings, 0 replies; 208+ messages in thread
From: Al Boldi @ 2009-08-13  5:05 UTC (permalink / raw)
  To: ngupta, Hugh Dickins
  Cc: Matthew Wilcox, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm

Nitin Gupta wrote:
> BTW, last time compcache was not accepted due to lack of performance
> numbers. Now the project has lot more data for various cases:
> http://code.google.com/p/compcache/wiki/Performance
> Still need to collect data for worst-case behaviors and such...

I checked the link, and it looks like you are positioning compcache as a swap 
replacement.  If so, then repositioning it as a compressed pre-swap area 
working together with normal swap-space, if available, may yield a much more 
powerful system.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 208+ messages in thread

* compcache as a pre-swap area (was: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-13  5:05       ` Al Boldi
  0 siblings, 0 replies; 208+ messages in thread
From: Al Boldi @ 2009-08-13  5:05 UTC (permalink / raw)
  To: ngupta, Hugh Dickins
  Cc: Matthew Wilcox, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm

Nitin Gupta wrote:
> BTW, last time compcache was not accepted due to lack of performance
> numbers. Now the project has lot more data for various cases:
> http://code.google.com/p/compcache/wiki/Performance
> Still need to collect data for worst-case behaviors and such...

I checked the link, and it looks like you are positioning compcache as a swap 
replacement.  If so, then repositioning it as a compressed pre-swap area 
working together with normal swap-space, if available, may yield a much more 
powerful system.


Thanks!

--
Al

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: [PATCH] swap: send callback when swap slot is freed
  2009-08-13  2:30     ` Nitin Gupta
@ 2009-08-13  6:53       ` Peter Zijlstra
  -1 siblings, 0 replies; 208+ messages in thread
From: Peter Zijlstra @ 2009-08-13  6:53 UTC (permalink / raw)
  To: ngupta; +Cc: Hugh Dickins, Matthew Wilcox, Ingo Molnar, linux-kernel, linux-mm

On Thu, 2009-08-13 at 08:00 +0530, Nitin Gupta wrote:
> > I don't share Peter's view that it should be using a more general
> > notifier interface (but I certainly agree with his EXPORT_SYMBOL_GPL).
> 
> Considering that the callback is made under swap_lock, we should not 
> have an array of callbacks to do. But what if this callback finds other 
> users too? I think we should leave it in its current state till it finds 
> more users and probably add BUG() to make sure callback is not already set.
> 
> I will make it EXPORT_SYMBOL_GPL.

If its such a tightly coupled system, then why is compcache a module?

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: [PATCH] swap: send callback when swap slot is freed
@ 2009-08-13  6:53       ` Peter Zijlstra
  0 siblings, 0 replies; 208+ messages in thread
From: Peter Zijlstra @ 2009-08-13  6:53 UTC (permalink / raw)
  To: ngupta; +Cc: Hugh Dickins, Matthew Wilcox, Ingo Molnar, linux-kernel, linux-mm

On Thu, 2009-08-13 at 08:00 +0530, Nitin Gupta wrote:
> > I don't share Peter's view that it should be using a more general
> > notifier interface (but I certainly agree with his EXPORT_SYMBOL_GPL).
> 
> Considering that the callback is made under swap_lock, we should not 
> have an array of callbacks to do. But what if this callback finds other 
> users too? I think we should leave it in its current state till it finds 
> more users and probably add BUG() to make sure callback is not already set.
> 
> I will make it EXPORT_SYMBOL_GPL.

If its such a tightly coupled system, then why is compcache a module?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: [PATCH] swap: send callback when swap slot is freed
  2009-08-13  6:53       ` Peter Zijlstra
@ 2009-08-13 14:44         ` Nitin Gupta
  -1 siblings, 0 replies; 208+ messages in thread
From: Nitin Gupta @ 2009-08-13 14:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Matthew Wilcox, Ingo Molnar, linux-kernel, linux-mm

(resending in plain text)

On 08/13/2009 12:23 PM, Peter Zijlstra wrote:
> On Thu, 2009-08-13 at 08:00 +0530, Nitin Gupta wrote:
>>> I don't share Peter's view that it should be using a more general
>>> notifier interface (but I certainly agree with his EXPORT_SYMBOL_GPL).
>> Considering that the callback is made under swap_lock, we should not
>> have an array of callbacks to do. But what if this callback finds other
>> users too? I think we should leave it in its current state till it finds
>> more users and probably add BUG() to make sure callback is not already set.
>>
>> I will make it EXPORT_SYMBOL_GPL.
>
> If its such a tightly coupled system, then why is compcache a module?
>

Keeping everything as separate kernel modules has been the goal of this 
project. However, this callback is the only thing which I could not do 
without this small patching.

Nitin

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: [PATCH] swap: send callback when swap slot is freed
@ 2009-08-13 14:44         ` Nitin Gupta
  0 siblings, 0 replies; 208+ messages in thread
From: Nitin Gupta @ 2009-08-13 14:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Matthew Wilcox, Ingo Molnar, linux-kernel, linux-mm

(resending in plain text)

On 08/13/2009 12:23 PM, Peter Zijlstra wrote:
> On Thu, 2009-08-13 at 08:00 +0530, Nitin Gupta wrote:
>>> I don't share Peter's view that it should be using a more general
>>> notifier interface (but I certainly agree with his EXPORT_SYMBOL_GPL).
>> Considering that the callback is made under swap_lock, we should not
>> have an array of callbacks to do. But what if this callback finds other
>> users too? I think we should leave it in its current state till it finds
>> more users and probably add BUG() to make sure callback is not already set.
>>
>> I will make it EXPORT_SYMBOL_GPL.
>
> If its such a tightly coupled system, then why is compcache a module?
>

Keeping everything as separate kernel modules has been the goal of this 
project. However, this callback is the only thing which I could not do 
without this small patching.

Nitin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-12 22:48   ` Hugh Dickins
@ 2009-08-13 15:13     ` Matthew Wilcox
  -1 siblings, 0 replies; 208+ messages in thread
From: Matthew Wilcox @ 2009-08-13 15:13 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide

On Wed, Aug 12, 2009 at 11:48:27PM +0100, Hugh Dickins wrote:
> But fundamentally, though I can see how this cutdown communication
> path is useful to compcache, I'd much rather deal with it by the more
> general discard route if we can.  (I'm one of those still puzzled by
> the way swap is mixed up with block device in compcache: probably
> because I never found time to pay attention when you explained.)
> 
> You're right to question the utility of the current swap discard
> placement.  That code is almost a year old, written from a position
> of great ignorance, yet only now do we appear to be on the threshold
> of having an SSD which really supports TRIM (ah, the Linux ATA TRIM
> support seems to have gone missing now, but perhaps it's been
> waiting for a reality to check against too - Willy?).

I am indeed waiting for hardware with TRIM support to appear on my
desk before resubmitting the TRIM code.  It'd also be nice to be able to
get some performance numbers.

> I won't be surprised if we find that we need to move swap discard
> support much closer to swap_free (though I know from trying before
> that it's much messier there): in which case, even if we decided to
> keep your hotline to compcache (to avoid allocating bios etc.), it
> would be better placed alongside.

It turns out there are a lot of tradeoffs involved with discard, and
they're different between TRIM and UNMAP.

Let's start with UNMAP.  This SCSI command is used by giant arrays.
They want to do Thin Provisioning, so allocate physical storage to virtual
LUNs on demand, and want to deallocate it when they get an UNMAP command.
They allocate storage in large chunks (hundreds of kilobytes at a time).
They only care about discards that enable them to free an entire chunk.
The vast majority of users *do not care* about these arrays, because
they don't have one, and will never be able to afford one.  We should
ignore the desires of these vendors when designing our software.

Solid State Drives are introducing an ATA command called TRIM.  SSDs
generally have an intenal mapping layer, and due to their low, low seek
penalty, will happily remap blocks anywhere on the flash.  They want
to know when a block isn't in use any more, so they don't have to copy
it around when they want to erase the chunk of storage that it's on.
The unfortunate thing about the TRIM command is that it's not NCQ, so
all NCQ commands have to finish, then we can send the TRIM command and
wait for it to finish, then we can send NCQ commands again.

So TRIM isn't free, and there's a better way for the drive to find
out that the contents of a block no longer matter -- write some new
data to it.  So if we just swapped a page in, and we're going to swap
something else back out again soon, just write it to the same location
instead of to a fresh location.  You've saved a command, and you've
saved the drive some work, plus you've allowed other users to continue
accessing the drive in the meantime.

I am planning a complete overhaul of the discard work.  Users can send
down discard requests as frequently as they like.  The block layer will
cache them, and invalidate them if writes come through.  Periodically,
the block layer will send down a TRIM or an UNMAP (depending on the
underlying device) and get rid of the blocks that have remained unwanted
in the interim.

Thoughts on that are welcome.

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-13 15:13     ` Matthew Wilcox
  0 siblings, 0 replies; 208+ messages in thread
From: Matthew Wilcox @ 2009-08-13 15:13 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide

On Wed, Aug 12, 2009 at 11:48:27PM +0100, Hugh Dickins wrote:
> But fundamentally, though I can see how this cutdown communication
> path is useful to compcache, I'd much rather deal with it by the more
> general discard route if we can.  (I'm one of those still puzzled by
> the way swap is mixed up with block device in compcache: probably
> because I never found time to pay attention when you explained.)
> 
> You're right to question the utility of the current swap discard
> placement.  That code is almost a year old, written from a position
> of great ignorance, yet only now do we appear to be on the threshold
> of having an SSD which really supports TRIM (ah, the Linux ATA TRIM
> support seems to have gone missing now, but perhaps it's been
> waiting for a reality to check against too - Willy?).

I am indeed waiting for hardware with TRIM support to appear on my
desk before resubmitting the TRIM code.  It'd also be nice to be able to
get some performance numbers.

> I won't be surprised if we find that we need to move swap discard
> support much closer to swap_free (though I know from trying before
> that it's much messier there): in which case, even if we decided to
> keep your hotline to compcache (to avoid allocating bios etc.), it
> would be better placed alongside.

It turns out there are a lot of tradeoffs involved with discard, and
they're different between TRIM and UNMAP.

Let's start with UNMAP.  This SCSI command is used by giant arrays.
They want to do Thin Provisioning, so allocate physical storage to virtual
LUNs on demand, and want to deallocate it when they get an UNMAP command.
They allocate storage in large chunks (hundreds of kilobytes at a time).
They only care about discards that enable them to free an entire chunk.
The vast majority of users *do not care* about these arrays, because
they don't have one, and will never be able to afford one.  We should
ignore the desires of these vendors when designing our software.

Solid State Drives are introducing an ATA command called TRIM.  SSDs
generally have an intenal mapping layer, and due to their low, low seek
penalty, will happily remap blocks anywhere on the flash.  They want
to know when a block isn't in use any more, so they don't have to copy
it around when they want to erase the chunk of storage that it's on.
The unfortunate thing about the TRIM command is that it's not NCQ, so
all NCQ commands have to finish, then we can send the TRIM command and
wait for it to finish, then we can send NCQ commands again.

So TRIM isn't free, and there's a better way for the drive to find
out that the contents of a block no longer matter -- write some new
data to it.  So if we just swapped a page in, and we're going to swap
something else back out again soon, just write it to the same location
instead of to a fresh location.  You've saved a command, and you've
saved the drive some work, plus you've allowed other users to continue
accessing the drive in the meantime.

I am planning a complete overhaul of the discard work.  Users can send
down discard requests as frequently as they like.  The block layer will
cache them, and invalidate them if writes come through.  Periodically,
the block layer will send down a TRIM or an UNMAP (depending on the
underlying device) and get rid of the blocks that have remained unwanted
in the interim.

Thoughts on that are welcome.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13 15:13     ` Matthew Wilcox
@ 2009-08-13 15:17       ` david
  -1 siblings, 0 replies; 208+ messages in thread
From: david @ 2009-08-13 15:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide

On Thu, 13 Aug 2009, Matthew Wilcox wrote:

> So TRIM isn't free, and there's a better way for the drive to find
> out that the contents of a block no longer matter -- write some new
> data to it.  So if we just swapped a page in, and we're going to swap
> something else back out again soon, just write it to the same location
> instead of to a fresh location.  You've saved a command, and you've
> saved the drive some work, plus you've allowed other users to continue
> accessing the drive in the meantime.

on the other hand, if you then end up swapping the page you read in out 
again and haven't dirtied it, you now have to actually write it as opposed 
to just throwing it away (knowing that you already have a copy of it 
stored on the swap device)

David Lang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-13 15:17       ` david
  0 siblings, 0 replies; 208+ messages in thread
From: david @ 2009-08-13 15:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide

On Thu, 13 Aug 2009, Matthew Wilcox wrote:

> So TRIM isn't free, and there's a better way for the drive to find
> out that the contents of a block no longer matter -- write some new
> data to it.  So if we just swapped a page in, and we're going to swap
> something else back out again soon, just write it to the same location
> instead of to a fresh location.  You've saved a command, and you've
> saved the drive some work, plus you've allowed other users to continue
> accessing the drive in the meantime.

on the other hand, if you then end up swapping the page you read in out 
again and haven't dirtied it, you now have to actually write it as opposed 
to just throwing it away (knowing that you already have a copy of it 
stored on the swap device)

David Lang

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13 15:17       ` david
@ 2009-08-13 15:26         ` Matthew Wilcox
  -1 siblings, 0 replies; 208+ messages in thread
From: Matthew Wilcox @ 2009-08-13 15:26 UTC (permalink / raw)
  To: david
  Cc: Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide

On Thu, Aug 13, 2009 at 08:17:34AM -0700, david@lang.hm wrote:
> On Thu, 13 Aug 2009, Matthew Wilcox wrote:
>
>> So TRIM isn't free, and there's a better way for the drive to find
>> out that the contents of a block no longer matter -- write some new
>> data to it.  So if we just swapped a page in, and we're going to swap
>> something else back out again soon, just write it to the same location
>> instead of to a fresh location.  You've saved a command, and you've
>> saved the drive some work, plus you've allowed other users to continue
>> accessing the drive in the meantime.
>
> on the other hand, if you then end up swapping the page you read in out  
> again and haven't dirtied it, you now have to actually write it as 
> opposed to just throwing it away (knowing that you already have a copy of 
> it stored on the swap device)

This is true, but at the point where you call discard, you will also
have to write it again.  My point was about delaying calls to discard,
rather than moving the point, or changing the circumstances under which
one calls discard.

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-13 15:26         ` Matthew Wilcox
  0 siblings, 0 replies; 208+ messages in thread
From: Matthew Wilcox @ 2009-08-13 15:26 UTC (permalink / raw)
  To: david
  Cc: Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide

On Thu, Aug 13, 2009 at 08:17:34AM -0700, david@lang.hm wrote:
> On Thu, 13 Aug 2009, Matthew Wilcox wrote:
>
>> So TRIM isn't free, and there's a better way for the drive to find
>> out that the contents of a block no longer matter -- write some new
>> data to it.  So if we just swapped a page in, and we're going to swap
>> something else back out again soon, just write it to the same location
>> instead of to a fresh location.  You've saved a command, and you've
>> saved the drive some work, plus you've allowed other users to continue
>> accessing the drive in the meantime.
>
> on the other hand, if you then end up swapping the page you read in out  
> again and haven't dirtied it, you now have to actually write it as 
> opposed to just throwing it away (knowing that you already have a copy of 
> it stored on the swap device)

This is true, but at the point where you call discard, you will also
have to write it again.  My point was about delaying calls to discard,
rather than moving the point, or changing the circumstances under which
one calls discard.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13 15:13     ` Matthew Wilcox
@ 2009-08-13 15:43       ` James Bottomley
  -1 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-13 15:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide

On Thu, 2009-08-13 at 08:13 -0700, Matthew Wilcox wrote:
> On Wed, Aug 12, 2009 at 11:48:27PM +0100, Hugh Dickins wrote:
> > But fundamentally, though I can see how this cutdown communication
> > path is useful to compcache, I'd much rather deal with it by the more
> > general discard route if we can.  (I'm one of those still puzzled by
> > the way swap is mixed up with block device in compcache: probably
> > because I never found time to pay attention when you explained.)
> > 
> > You're right to question the utility of the current swap discard
> > placement.  That code is almost a year old, written from a position
> > of great ignorance, yet only now do we appear to be on the threshold
> > of having an SSD which really supports TRIM (ah, the Linux ATA TRIM
> > support seems to have gone missing now, but perhaps it's been
> > waiting for a reality to check against too - Willy?).
> 
> I am indeed waiting for hardware with TRIM support to appear on my
> desk before resubmitting the TRIM code.  It'd also be nice to be able to
> get some performance numbers.
> 
> > I won't be surprised if we find that we need to move swap discard
> > support much closer to swap_free (though I know from trying before
> > that it's much messier there): in which case, even if we decided to
> > keep your hotline to compcache (to avoid allocating bios etc.), it
> > would be better placed alongside.
> 
> It turns out there are a lot of tradeoffs involved with discard, and
> they're different between TRIM and UNMAP.
> 
> Let's start with UNMAP.  This SCSI command is used by giant arrays.
> They want to do Thin Provisioning, so allocate physical storage to virtual
> LUNs on demand, and want to deallocate it when they get an UNMAP command.
> They allocate storage in large chunks (hundreds of kilobytes at a time).
> They only care about discards that enable them to free an entire chunk.
> The vast majority of users *do not care* about these arrays, because
> they don't have one, and will never be able to afford one.  We should
> ignore the desires of these vendors when designing our software.

Fundamentally, unmap, trim and write_same do similar things, so
realistically they all map to discard in linux.

Ignoring the desires of the enterprise isn't an option, since they are a
good base for us.  However, they really do need to step up with a useful
patch set for discussion that does what they want, so in the interim I'm
happy with any proposal that doesn't actively damage what the enterprise
wants to do with trim/write_same.

> Solid State Drives are introducing an ATA command called TRIM.  SSDs
> generally have an intenal mapping layer, and due to their low, low seek
> penalty, will happily remap blocks anywhere on the flash.  They want
> to know when a block isn't in use any more, so they don't have to copy
> it around when they want to erase the chunk of storage that it's on.
> The unfortunate thing about the TRIM command is that it's not NCQ, so
> all NCQ commands have to finish, then we can send the TRIM command and
> wait for it to finish, then we can send NCQ commands again.

That's a bit of a silly protocol oversight ... I assume there's no way
it can be corrected?

> So TRIM isn't free, and there's a better way for the drive to find
> out that the contents of a block no longer matter -- write some new
> data to it.  So if we just swapped a page in, and we're going to swap
> something else back out again soon, just write it to the same location
> instead of to a fresh location.  You've saved a command, and you've
> saved the drive some work, plus you've allowed other users to continue
> accessing the drive in the meantime.
> 
> I am planning a complete overhaul of the discard work.  Users can send
> down discard requests as frequently as they like.  The block layer will
> cache them, and invalidate them if writes come through.  Periodically,
> the block layer will send down a TRIM or an UNMAP (depending on the
> underlying device) and get rid of the blocks that have remained unwanted
> in the interim.
> 
> Thoughts on that are welcome.

What you're basically planning is discard accumulation ... it's
certainly closer to what the enterprise is looking for, so no objections
from me.

James

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-13 15:43       ` James Bottomley
  0 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-13 15:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide

On Thu, 2009-08-13 at 08:13 -0700, Matthew Wilcox wrote:
> On Wed, Aug 12, 2009 at 11:48:27PM +0100, Hugh Dickins wrote:
> > But fundamentally, though I can see how this cutdown communication
> > path is useful to compcache, I'd much rather deal with it by the more
> > general discard route if we can.  (I'm one of those still puzzled by
> > the way swap is mixed up with block device in compcache: probably
> > because I never found time to pay attention when you explained.)
> > 
> > You're right to question the utility of the current swap discard
> > placement.  That code is almost a year old, written from a position
> > of great ignorance, yet only now do we appear to be on the threshold
> > of having an SSD which really supports TRIM (ah, the Linux ATA TRIM
> > support seems to have gone missing now, but perhaps it's been
> > waiting for a reality to check against too - Willy?).
> 
> I am indeed waiting for hardware with TRIM support to appear on my
> desk before resubmitting the TRIM code.  It'd also be nice to be able to
> get some performance numbers.
> 
> > I won't be surprised if we find that we need to move swap discard
> > support much closer to swap_free (though I know from trying before
> > that it's much messier there): in which case, even if we decided to
> > keep your hotline to compcache (to avoid allocating bios etc.), it
> > would be better placed alongside.
> 
> It turns out there are a lot of tradeoffs involved with discard, and
> they're different between TRIM and UNMAP.
> 
> Let's start with UNMAP.  This SCSI command is used by giant arrays.
> They want to do Thin Provisioning, so allocate physical storage to virtual
> LUNs on demand, and want to deallocate it when they get an UNMAP command.
> They allocate storage in large chunks (hundreds of kilobytes at a time).
> They only care about discards that enable them to free an entire chunk.
> The vast majority of users *do not care* about these arrays, because
> they don't have one, and will never be able to afford one.  We should
> ignore the desires of these vendors when designing our software.

Fundamentally, unmap, trim and write_same do similar things, so
realistically they all map to discard in linux.

Ignoring the desires of the enterprise isn't an option, since they are a
good base for us.  However, they really do need to step up with a useful
patch set for discussion that does what they want, so in the interim I'm
happy with any proposal that doesn't actively damage what the enterprise
wants to do with trim/write_same.

> Solid State Drives are introducing an ATA command called TRIM.  SSDs
> generally have an intenal mapping layer, and due to their low, low seek
> penalty, will happily remap blocks anywhere on the flash.  They want
> to know when a block isn't in use any more, so they don't have to copy
> it around when they want to erase the chunk of storage that it's on.
> The unfortunate thing about the TRIM command is that it's not NCQ, so
> all NCQ commands have to finish, then we can send the TRIM command and
> wait for it to finish, then we can send NCQ commands again.

That's a bit of a silly protocol oversight ... I assume there's no way
it can be corrected?

> So TRIM isn't free, and there's a better way for the drive to find
> out that the contents of a block no longer matter -- write some new
> data to it.  So if we just swapped a page in, and we're going to swap
> something else back out again soon, just write it to the same location
> instead of to a fresh location.  You've saved a command, and you've
> saved the drive some work, plus you've allowed other users to continue
> accessing the drive in the meantime.
> 
> I am planning a complete overhaul of the discard work.  Users can send
> down discard requests as frequently as they like.  The block layer will
> cache them, and invalidate them if writes come through.  Periodically,
> the block layer will send down a TRIM or an UNMAP (depending on the
> underlying device) and get rid of the blocks that have remained unwanted
> in the interim.
> 
> Thoughts on that are welcome.

What you're basically planning is discard accumulation ... it's
certainly closer to what the enterprise is looking for, so no objections
from me.

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13 15:13     ` Matthew Wilcox
@ 2009-08-13 16:13       ` Nitin Gupta
  -1 siblings, 0 replies; 208+ messages in thread
From: Nitin Gupta @ 2009-08-13 16:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Ingo Molnar, Peter Zijlstra, linux-kernel,
	linux-mm, linux-scsi, linux-ide

On 08/13/2009 08:43 PM, Matthew Wilcox wrote:

>
> I am planning a complete overhaul of the discard work.  Users can send
> down discard requests as frequently as they like.  The block layer will
> cache them, and invalidate them if writes come through.  Periodically,
> the block layer will send down a TRIM or an UNMAP (depending on the
> underlying device) and get rid of the blocks that have remained unwanted
> in the interim.
>

This batching of discard requests is still sub-optimal for compcache. 
The optimal solution in this case is to get callback *as soon as* a swap 
slot becomes free and this is what this patch does.

I see that it will be difficult to accept this patch since compcache 
seems to be the only user for now. However, this little addition makes a 
*big* difference for the project. Currently, much of memory is wasted to 
store all the stale data.

I will be posting compcache patches for review in next merge window. So, 
maybe this patch can be included now as the first step? The revised 
patch is ready which addresses issues raised during the first review -- 
will post it soon.

Thanks,
Nitin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-13 16:13       ` Nitin Gupta
  0 siblings, 0 replies; 208+ messages in thread
From: Nitin Gupta @ 2009-08-13 16:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Ingo Molnar, Peter Zijlstra, linux-kernel,
	linux-mm, linux-scsi, linux-ide

On 08/13/2009 08:43 PM, Matthew Wilcox wrote:

>
> I am planning a complete overhaul of the discard work.  Users can send
> down discard requests as frequently as they like.  The block layer will
> cache them, and invalidate them if writes come through.  Periodically,
> the block layer will send down a TRIM or an UNMAP (depending on the
> underlying device) and get rid of the blocks that have remained unwanted
> in the interim.
>

This batching of discard requests is still sub-optimal for compcache. 
The optimal solution in this case is to get callback *as soon as* a swap 
slot becomes free and this is what this patch does.

I see that it will be difficult to accept this patch since compcache 
seems to be the only user for now. However, this little addition makes a 
*big* difference for the project. Currently, much of memory is wasted to 
store all the stale data.

I will be posting compcache patches for review in next merge window. So, 
maybe this patch can be included now as the first step? The revised 
patch is ready which addresses issues raised during the first review -- 
will post it soon.

Thanks,
Nitin

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13 15:13     ` Matthew Wilcox
@ 2009-08-13 16:26       ` Markus Trippelsdorf
  -1 siblings, 0 replies; 208+ messages in thread
From: Markus Trippelsdorf @ 2009-08-13 16:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide

On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
> On Wed, Aug 12, 2009 at 11:48:27PM +0100, Hugh Dickins wrote:
> > But fundamentally, though I can see how this cutdown communication
> > path is useful to compcache, I'd much rather deal with it by the more
> > general discard route if we can.  (I'm one of those still puzzled by
> > the way swap is mixed up with block device in compcache: probably
> > because I never found time to pay attention when you explained.)
> > 
> > You're right to question the utility of the current swap discard
> > placement.  That code is almost a year old, written from a position
> > of great ignorance, yet only now do we appear to be on the threshold
> > of having an SSD which really supports TRIM (ah, the Linux ATA TRIM
> > support seems to have gone missing now, but perhaps it's been
> > waiting for a reality to check against too - Willy?).
> 
> I am indeed waiting for hardware with TRIM support to appear on my
> desk before resubmitting the TRIM code.  It'd also be nice to be able to
> get some performance numbers.
> 

OCZ just released a new firmware with full TRIM support for their Vertex
SSDs. 

> > I won't be surprised if we find that we need to move swap discard
> > support much closer to swap_free (though I know from trying before
> > that it's much messier there): in which case, even if we decided to
> > keep your hotline to compcache (to avoid allocating bios etc.), it
> > would be better placed alongside.
> 
> 
> Solid State Drives are introducing an ATA command called TRIM.  SSDs
> generally have an intenal mapping layer, and due to their low, low seek
> penalty, will happily remap blocks anywhere on the flash.  They want
> to know when a block isn't in use any more, so they don't have to copy
> it around when they want to erase the chunk of storage that it's on.
> The unfortunate thing about the TRIM command is that it's not NCQ, so
> all NCQ commands have to finish, then we can send the TRIM command and
> wait for it to finish, then we can send NCQ commands again.
> 
> So TRIM isn't free, and there's a better way for the drive to find
> out that the contents of a block no longer matter -- write some new
> data to it.  So if we just swapped a page in, and we're going to swap
> something else back out again soon, just write it to the same location
> instead of to a fresh location.  You've saved a command, and you've
> saved the drive some work, plus you've allowed other users to continue
> accessing the drive in the meantime.
> 
> I am planning a complete overhaul of the discard work.  Users can send
> down discard requests as frequently as they like.  The block layer will
> cache them, and invalidate them if writes come through.  Periodically,
> the block layer will send down a TRIM or an UNMAP (depending on the
> underlying device) and get rid of the blocks that have remained unwanted
> in the interim.

That is a very good idea. I've tested your original TRIM implementation on
my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
milliseconds to digest a single TRIM command. And since your implementation
sends a TRIM for each extent of each deleted file, the whole system is
unusable after a short while. 
An optimal solution would be to consolidate the discard requests, bundle
them and send them to the drive as infrequent as possible.

-- 
Markus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-13 16:26       ` Markus Trippelsdorf
  0 siblings, 0 replies; 208+ messages in thread
From: Markus Trippelsdorf @ 2009-08-13 16:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide

On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
> On Wed, Aug 12, 2009 at 11:48:27PM +0100, Hugh Dickins wrote:
> > But fundamentally, though I can see how this cutdown communication
> > path is useful to compcache, I'd much rather deal with it by the more
> > general discard route if we can.  (I'm one of those still puzzled by
> > the way swap is mixed up with block device in compcache: probably
> > because I never found time to pay attention when you explained.)
> > 
> > You're right to question the utility of the current swap discard
> > placement.  That code is almost a year old, written from a position
> > of great ignorance, yet only now do we appear to be on the threshold
> > of having an SSD which really supports TRIM (ah, the Linux ATA TRIM
> > support seems to have gone missing now, but perhaps it's been
> > waiting for a reality to check against too - Willy?).
> 
> I am indeed waiting for hardware with TRIM support to appear on my
> desk before resubmitting the TRIM code.  It'd also be nice to be able to
> get some performance numbers.
> 

OCZ just released a new firmware with full TRIM support for their Vertex
SSDs. 

> > I won't be surprised if we find that we need to move swap discard
> > support much closer to swap_free (though I know from trying before
> > that it's much messier there): in which case, even if we decided to
> > keep your hotline to compcache (to avoid allocating bios etc.), it
> > would be better placed alongside.
> 
> 
> Solid State Drives are introducing an ATA command called TRIM.  SSDs
> generally have an intenal mapping layer, and due to their low, low seek
> penalty, will happily remap blocks anywhere on the flash.  They want
> to know when a block isn't in use any more, so they don't have to copy
> it around when they want to erase the chunk of storage that it's on.
> The unfortunate thing about the TRIM command is that it's not NCQ, so
> all NCQ commands have to finish, then we can send the TRIM command and
> wait for it to finish, then we can send NCQ commands again.
> 
> So TRIM isn't free, and there's a better way for the drive to find
> out that the contents of a block no longer matter -- write some new
> data to it.  So if we just swapped a page in, and we're going to swap
> something else back out again soon, just write it to the same location
> instead of to a fresh location.  You've saved a command, and you've
> saved the drive some work, plus you've allowed other users to continue
> accessing the drive in the meantime.
> 
> I am planning a complete overhaul of the discard work.  Users can send
> down discard requests as frequently as they like.  The block layer will
> cache them, and invalidate them if writes come through.  Periodically,
> the block layer will send down a TRIM or an UNMAP (depending on the
> underlying device) and get rid of the blocks that have remained unwanted
> in the interim.

That is a very good idea. I've tested your original TRIM implementation on
my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
milliseconds to digest a single TRIM command. And since your implementation
sends a TRIM for each extent of each deleted file, the whole system is
unusable after a short while. 
An optimal solution would be to consolidate the discard requests, bundle
them and send them to the drive as infrequent as possible.

-- 
Markus

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13 16:26       ` Markus Trippelsdorf
@ 2009-08-13 16:33         ` david
  -1 siblings, 0 replies; 208+ messages in thread
From: david @ 2009-08-13 16:33 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide

On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:

> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>> I am planning a complete overhaul of the discard work.  Users can send
>> down discard requests as frequently as they like.  The block layer will
>> cache them, and invalidate them if writes come through.  Periodically,
>> the block layer will send down a TRIM or an UNMAP (depending on the
>> underlying device) and get rid of the blocks that have remained unwanted
>> in the interim.
>
> That is a very good idea. I've tested your original TRIM implementation on
> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
> milliseconds to digest a single TRIM command. And since your implementation
> sends a TRIM for each extent of each deleted file, the whole system is
> unusable after a short while.
> An optimal solution would be to consolidate the discard requests, bundle
> them and send them to the drive as infrequent as possible.

or queue them up and send them when the drive is idle (you would need to 
keep track to make sure the space isn't re-used)

as an example, if you would consider spinning down a drive you don't hurt 
performance by sending accumulated trim commands.

David Lang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-13 16:33         ` david
  0 siblings, 0 replies; 208+ messages in thread
From: david @ 2009-08-13 16:33 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide

On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:

> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>> I am planning a complete overhaul of the discard work.  Users can send
>> down discard requests as frequently as they like.  The block layer will
>> cache them, and invalidate them if writes come through.  Periodically,
>> the block layer will send down a TRIM or an UNMAP (depending on the
>> underlying device) and get rid of the blocks that have remained unwanted
>> in the interim.
>
> That is a very good idea. I've tested your original TRIM implementation on
> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
> milliseconds to digest a single TRIM command. And since your implementation
> sends a TRIM for each extent of each deleted file, the whole system is
> unusable after a short while.
> An optimal solution would be to consolidate the discard requests, bundle
> them and send them to the drive as infrequent as possible.

or queue them up and send them when the drive is idle (you would need to 
keep track to make sure the space isn't re-used)

as an example, if you would consider spinning down a drive you don't hurt 
performance by sending accumulated trim commands.

David Lang

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13 15:13     ` Matthew Wilcox
@ 2009-08-13 17:19       ` Hugh Dickins
  -1 siblings, 0 replies; 208+ messages in thread
From: Hugh Dickins @ 2009-08-13 17:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide

On Thu, 13 Aug 2009, Matthew Wilcox wrote:
> 
> So TRIM isn't free, and there's a better way for the drive to find
> out that the contents of a block no longer matter -- write some new
> data to it.  So if we just swapped a page in, and we're going to swap
> something else back out again soon, just write it to the same location
> instead of to a fresh location.  You've saved a command, and you've
> saved the drive some work, plus you've allowed other users to continue
> accessing the drive in the meantime.
> 
> I am planning a complete overhaul of the discard work.  Users can send
> down discard requests as frequently as they like.  The block layer will
> cache them, and invalidate them if writes come through.  Periodically,
> the block layer will send down a TRIM or an UNMAP (depending on the
> underlying device) and get rid of the blocks that have remained unwanted
> in the interim.

Very interesting report, thanks a lot for it.  Certainly your
good point about writes should dictate some change at the swap end.

I have assumed all along (even from just a block layer perspective)
that discard would entail more overhead than I really want, just to
say "forget about it": I never expected that discarding a page at a
time would be a sensible way to proceed.

So at present swap tends to be discarding a 1MB range at a time.
And even if we have to move the point of discard much closer to
freeing swap, it would still be trying for such amounts - when
a process is exiting, even given the accumulation you propose,
I would not want to be trying to allocate lots of bios to pass
the info down to you.

So it looks as if we'd be duplicating work.
And won't filesystems be discarding extents too?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-13 17:19       ` Hugh Dickins
  0 siblings, 0 replies; 208+ messages in thread
From: Hugh Dickins @ 2009-08-13 17:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide

On Thu, 13 Aug 2009, Matthew Wilcox wrote:
> 
> So TRIM isn't free, and there's a better way for the drive to find
> out that the contents of a block no longer matter -- write some new
> data to it.  So if we just swapped a page in, and we're going to swap
> something else back out again soon, just write it to the same location
> instead of to a fresh location.  You've saved a command, and you've
> saved the drive some work, plus you've allowed other users to continue
> accessing the drive in the meantime.
> 
> I am planning a complete overhaul of the discard work.  Users can send
> down discard requests as frequently as they like.  The block layer will
> cache them, and invalidate them if writes come through.  Periodically,
> the block layer will send down a TRIM or an UNMAP (depending on the
> underlying device) and get rid of the blocks that have remained unwanted
> in the interim.

Very interesting report, thanks a lot for it.  Certainly your
good point about writes should dictate some change at the swap end.

I have assumed all along (even from just a block layer perspective)
that discard would entail more overhead than I really want, just to
say "forget about it": I never expected that discarding a page at a
time would be a sensible way to proceed.

So at present swap tends to be discarding a 1MB range at a time.
And even if we have to move the point of discard much closer to
freeing swap, it would still be trying for such amounts - when
a process is exiting, even given the accumulation you propose,
I would not want to be trying to allocate lots of bios to pass
the info down to you.

So it looks as if we'd be duplicating work.
And won't filesystems be discarding extents too?

Hugh

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: compcache as a pre-swap area (was: [PATCH] swap: send callback  when swap slot is freed)
  2009-08-13  5:05       ` Al Boldi
@ 2009-08-13 17:31         ` Nitin Gupta
  -1 siblings, 0 replies; 208+ messages in thread
From: Nitin Gupta @ 2009-08-13 17:31 UTC (permalink / raw)
  To: Al Boldi
  Cc: Hugh Dickins, Matthew Wilcox, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm

On Thu, Aug 13, 2009 at 10:35 AM, Al Boldi<a1426z@gawab.com> wrote:
> Nitin Gupta wrote:
>> BTW, last time compcache was not accepted due to lack of performance
>> numbers. Now the project has lot more data for various cases:
>> http://code.google.com/p/compcache/wiki/Performance
>> Still need to collect data for worst-case behaviors and such...
>
> I checked the link, and it looks like you are positioning compcache as a swap
> replacement.  If so, then repositioning it as a compressed pre-swap area
> working together with normal swap-space, if available, may yield a much more
> powerful system.
>
>

compcache is really not really a swap replacement. Its just another
swap device that
compresses data and stores it in memory itself. You can have disk
based swaps along
with ramzswap (name of block device).

Nitin

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: compcache as a pre-swap area (was: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-13 17:31         ` Nitin Gupta
  0 siblings, 0 replies; 208+ messages in thread
From: Nitin Gupta @ 2009-08-13 17:31 UTC (permalink / raw)
  To: Al Boldi
  Cc: Hugh Dickins, Matthew Wilcox, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm

On Thu, Aug 13, 2009 at 10:35 AM, Al Boldi<a1426z@gawab.com> wrote:
> Nitin Gupta wrote:
>> BTW, last time compcache was not accepted due to lack of performance
>> numbers. Now the project has lot more data for various cases:
>> http://code.google.com/p/compcache/wiki/Performance
>> Still need to collect data for worst-case behaviors and such...
>
> I checked the link, and it looks like you are positioning compcache as a swap
> replacement.  If so, then repositioning it as a compressed pre-swap area
> working together with normal swap-space, if available, may yield a much more
> powerful system.
>
>

compcache is really not really a swap replacement. Its just another
swap device that
compresses data and stores it in memory itself. You can have disk
based swaps along
with ramzswap (name of block device).

Nitin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: [PATCH] swap: send callback when swap slot is freed
  2009-08-13  2:30     ` Nitin Gupta
@ 2009-08-13 17:45       ` Hugh Dickins
  -1 siblings, 0 replies; 208+ messages in thread
From: Hugh Dickins @ 2009-08-13 17:45 UTC (permalink / raw)
  To: Nitin Gupta
  Cc: Matthew Wilcox, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm

On Thu, 13 Aug 2009, Nitin Gupta wrote:
> On 08/13/2009 04:18 AM, Hugh Dickins wrote:
> 
> > But fundamentally, though I can see how this cutdown communication
> > path is useful to compcache, I'd much rather deal with it by the more
> > general discard route if we can.
> 
> I tried this too -- make discard bio request as soon as a swap slot becomes
> free (I can send details if you want). However, I could not get it to work.

I'll send you an updated version of what I experimented with eight
months ago: but changes in the swap_map count handling since then
mean that it might need some subtle adjustments - I'll need to go
over it carefully and retest before sending you.

(But that won't be a waste of my time: I shall soon need to try
that experiment again myself, and I do need to examine those
intervening swap_map count changes more closely.)

> Also, allocating bio to issue discard I/O request looks like a complete
> artifact in compcache case.

Yes, I do understand that feeling.

Hugh

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: [PATCH] swap: send callback when swap slot is freed
@ 2009-08-13 17:45       ` Hugh Dickins
  0 siblings, 0 replies; 208+ messages in thread
From: Hugh Dickins @ 2009-08-13 17:45 UTC (permalink / raw)
  To: Nitin Gupta
  Cc: Matthew Wilcox, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm

On Thu, 13 Aug 2009, Nitin Gupta wrote:
> On 08/13/2009 04:18 AM, Hugh Dickins wrote:
> 
> > But fundamentally, though I can see how this cutdown communication
> > path is useful to compcache, I'd much rather deal with it by the more
> > general discard route if we can.
> 
> I tried this too -- make discard bio request as soon as a swap slot becomes
> free (I can send details if you want). However, I could not get it to work.

I'll send you an updated version of what I experimented with eight
months ago: but changes in the swap_map count handling since then
mean that it might need some subtle adjustments - I'll need to go
over it carefully and retest before sending you.

(But that won't be a waste of my time: I shall soon need to try
that experiment again myself, and I do need to examine those
intervening swap_map count changes more closely.)

> Also, allocating bio to issue discard I/O request looks like a complete
> artifact in compcache case.

Yes, I do understand that feeling.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13 15:13     ` Matthew Wilcox
@ 2009-08-13 18:08       ` Douglas Gilbert
  -1 siblings, 0 replies; 208+ messages in thread
From: Douglas Gilbert @ 2009-08-13 18:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide

Matthew Wilcox wrote:
> On Wed, Aug 12, 2009 at 11:48:27PM +0100, Hugh Dickins wrote:
>> But fundamentally, though I can see how this cutdown communication
>> path is useful to compcache, I'd much rather deal with it by the more
>> general discard route if we can.  (I'm one of those still puzzled by
>> the way swap is mixed up with block device in compcache: probably
>> because I never found time to pay attention when you explained.)
>>
>> You're right to question the utility of the current swap discard
>> placement.  That code is almost a year old, written from a position
>> of great ignorance, yet only now do we appear to be on the threshold
>> of having an SSD which really supports TRIM (ah, the Linux ATA TRIM
>> support seems to have gone missing now, but perhaps it's been
>> waiting for a reality to check against too - Willy?).
> 
> I am indeed waiting for hardware with TRIM support to appear on my
> desk before resubmitting the TRIM code.  It'd also be nice to be able to
> get some performance numbers.
> 
>> I won't be surprised if we find that we need to move swap discard
>> support much closer to swap_free (though I know from trying before
>> that it's much messier there): in which case, even if we decided to
>> keep your hotline to compcache (to avoid allocating bios etc.), it
>> would be better placed alongside.
> 
> It turns out there are a lot of tradeoffs involved with discard, and
> they're different between TRIM and UNMAP.
> 
> Let's start with UNMAP.  This SCSI command is used by giant arrays.
> They want to do Thin Provisioning, so allocate physical storage to virtual
> LUNs on demand, and want to deallocate it when they get an UNMAP command.
> They allocate storage in large chunks (hundreds of kilobytes at a time).
> They only care about discards that enable them to free an entire chunk.
> The vast majority of users *do not care* about these arrays, because
> they don't have one, and will never be able to afford one.  We should
> ignore the desires of these vendors when designing our software.

Yes, the SCSI UNMAP command has high end uses with
maximum and optimal values specified in the Block Limits
VPD page. There is nothing stopping the SCSI UNMAP command
trimming a single logical block.

Being pedantic again there is no ATA TRIM command, there is
DATA SET MANAGEMENT command with a "Trim" bit, a count
field (2 byte, permitting up to 65536, 512 byte blocks to
be trimmed) and a LBA field which is reserved (??
d2015r1a.pdf).

As I noted a week ago we will need to revisit the SATL code
in libata if SAT-2 in its current form gets approved. Discard
support may be another reason to visit that SATL code. Since
we have many SATA devices being viewed as "SCSI" due to
libata's SATL, mapping the SCSI UNMAP command ** to one or more
ATA DATA SET MANAGEMENT commands may be helpful. IMO that
would be simpler than upper layers needing to worry about
using the SCSI ATA PASS-THROUGH commands to get Trim
functionality.


** and the SCSI WRITE SAME commands with their Unmap bits set

Doug Gilbert

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-13 18:08       ` Douglas Gilbert
  0 siblings, 0 replies; 208+ messages in thread
From: Douglas Gilbert @ 2009-08-13 18:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide

Matthew Wilcox wrote:
> On Wed, Aug 12, 2009 at 11:48:27PM +0100, Hugh Dickins wrote:
>> But fundamentally, though I can see how this cutdown communication
>> path is useful to compcache, I'd much rather deal with it by the more
>> general discard route if we can.  (I'm one of those still puzzled by
>> the way swap is mixed up with block device in compcache: probably
>> because I never found time to pay attention when you explained.)
>>
>> You're right to question the utility of the current swap discard
>> placement.  That code is almost a year old, written from a position
>> of great ignorance, yet only now do we appear to be on the threshold
>> of having an SSD which really supports TRIM (ah, the Linux ATA TRIM
>> support seems to have gone missing now, but perhaps it's been
>> waiting for a reality to check against too - Willy?).
> 
> I am indeed waiting for hardware with TRIM support to appear on my
> desk before resubmitting the TRIM code.  It'd also be nice to be able to
> get some performance numbers.
> 
>> I won't be surprised if we find that we need to move swap discard
>> support much closer to swap_free (though I know from trying before
>> that it's much messier there): in which case, even if we decided to
>> keep your hotline to compcache (to avoid allocating bios etc.), it
>> would be better placed alongside.
> 
> It turns out there are a lot of tradeoffs involved with discard, and
> they're different between TRIM and UNMAP.
> 
> Let's start with UNMAP.  This SCSI command is used by giant arrays.
> They want to do Thin Provisioning, so allocate physical storage to virtual
> LUNs on demand, and want to deallocate it when they get an UNMAP command.
> They allocate storage in large chunks (hundreds of kilobytes at a time).
> They only care about discards that enable them to free an entire chunk.
> The vast majority of users *do not care* about these arrays, because
> they don't have one, and will never be able to afford one.  We should
> ignore the desires of these vendors when designing our software.

Yes, the SCSI UNMAP command has high end uses with
maximum and optimal values specified in the Block Limits
VPD page. There is nothing stopping the SCSI UNMAP command
trimming a single logical block.

Being pedantic again there is no ATA TRIM command, there is
DATA SET MANAGEMENT command with a "Trim" bit, a count
field (2 byte, permitting up to 65536, 512 byte blocks to
be trimmed) and a LBA field which is reserved (??
d2015r1a.pdf).

As I noted a week ago we will need to revisit the SATL code
in libata if SAT-2 in its current form gets approved. Discard
support may be another reason to visit that SATL code. Since
we have many SATA devices being viewed as "SCSI" due to
libata's SATL, mapping the SCSI UNMAP command ** to one or more
ATA DATA SET MANAGEMENT commands may be helpful. IMO that
would be simpler than upper layers needing to worry about
using the SCSI ATA PASS-THROUGH commands to get Trim
functionality.


** and the SCSI WRITE SAME commands with their Unmap bits set

Doug Gilbert

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13 16:33         ` david
  (?)
@ 2009-08-13 18:15           ` Greg Freemyer
  -1 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-13 18:15 UTC (permalink / raw)
  To: david
  Cc: Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>
>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>
>>> I am planning a complete overhaul of the discard work.  Users can send
>>> down discard requests as frequently as they like.  The block layer will
>>> cache them, and invalidate them if writes come through.  Periodically,
>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>> underlying device) and get rid of the blocks that have remained unwanted
>>> in the interim.
>>
>> That is a very good idea. I've tested your original TRIM implementation on
>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>> milliseconds to digest a single TRIM command. And since your
>> implementation
>> sends a TRIM for each extent of each deleted file, the whole system is
>> unusable after a short while.
>> An optimal solution would be to consolidate the discard requests, bundle
>> them and send them to the drive as infrequent as possible.
>
> or queue them up and send them when the drive is idle (you would need to
> keep track to make sure the space isn't re-used)
>
> as an example, if you would consider spinning down a drive you don't hurt
> performance by sending accumulated trim commands.
>
> David Lang

An alternate approach is the block layer maintain its own bitmap of
used unused sectors / blocks. Unmap commands from the filesystem just
cause the bitmap to be updated.  No other effect.

(Big unknown: Where will the bitmap live between reboots?  Require DM
volumes so we can have a dedicated bitmap volume in the mix to store
the bitmap to? Maybe on mount, the filesystem has to be scanned to
initially populate the bitmap?   Other options?)

Assuming we have a persistent bitmap in place, have a background
scanner that kicks in when the cpu / disk is idle.  It just
continuously scans the bitmap looking for contiguous blocks of unused
sectors.  Each time it finds one, it sends the largest possible unmap
down the block stack and eventually to the device.

When normal cpu / disk activity kicks in, this process goes to sleep.

That way much of the smarts are concentrated in the block layer, not
in the filesystem code.  And it is being done when the disk is
otherwise idle, so you don't have the ncq interference.

Even laptop users should have enough idle cpu available to manage
this.  Enterprise would get the large discards it wants, and
unmentioned in the previous discussion, mdraid gets the large discards
it also wants.

ie. If a mdraid raid5/raid6 volume is built of SSDs, it will only be
able to discard a full stripe at a time. Otherwise the P=D1 ^ D2 logic
is lost.

Another benefit of the above is the code should be extremely safe and testable.

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
Preservation and Forensic processing of Exchange Repositories White Paper -
<http://www.norcrossgroup.com/forms/whitepapers/tng_whitepaper_fpe.html>

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap  slot is freed)
@ 2009-08-13 18:15           ` Greg Freemyer
  0 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-13 18:15 UTC (permalink / raw)
  To: david
  Cc: Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>
>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>
>>> I am planning a complete overhaul of the discard work.  Users can send
>>> down discard requests as frequently as they like.  The block layer will
>>> cache them, and invalidate them if writes come through.  Periodically,
>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>> underlying device) and get rid of the blocks that have remained unwanted
>>> in the interim.
>>
>> That is a very good idea. I've tested your original TRIM implementation on
>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>> milliseconds to digest a single TRIM command. And since your
>> implementation
>> sends a TRIM for each extent of each deleted file, the whole system is
>> unusable after a short while.
>> An optimal solution would be to consolidate the discard requests, bundle
>> them and send them to the drive as infrequent as possible.
>
> or queue them up and send them when the drive is idle (you would need to
> keep track to make sure the space isn't re-used)
>
> as an example, if you would consider spinning down a drive you don't hurt
> performance by sending accumulated trim commands.
>
> David Lang

An alternate approach is the block layer maintain its own bitmap of
used unused sectors / blocks. Unmap commands from the filesystem just
cause the bitmap to be updated.  No other effect.

(Big unknown: Where will the bitmap live between reboots?  Require DM
volumes so we can have a dedicated bitmap volume in the mix to store
the bitmap to? Maybe on mount, the filesystem has to be scanned to
initially populate the bitmap?   Other options?)

Assuming we have a persistent bitmap in place, have a background
scanner that kicks in when the cpu / disk is idle.  It just
continuously scans the bitmap looking for contiguous blocks of unused
sectors.  Each time it finds one, it sends the largest possible unmap
down the block stack and eventually to the device.

When normal cpu / disk activity kicks in, this process goes to sleep.

That way much of the smarts are concentrated in the block layer, not
in the filesystem code.  And it is being done when the disk is
otherwise idle, so you don't have the ncq interference.

Even laptop users should have enough idle cpu available to manage
this.  Enterprise would get the large discards it wants, and
unmentioned in the previous discussion, mdraid gets the large discards
it also wants.

ie. If a mdraid raid5/raid6 volume is built of SSDs, it will only be
able to discard a full stripe at a time. Otherwise the P=D1 ^ D2 logic
is lost.

Another benefit of the above is the code should be extremely safe and testable.

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
Preservation and Forensic processing of Exchange Repositories White Paper -
<http://www.norcrossgroup.com/forms/whitepapers/tng_whitepaper_fpe.html>

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-13 18:15           ` Greg Freemyer
  0 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-13 18:15 UTC (permalink / raw)
  To: david
  Cc: Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>
>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>
>>> I am planning a complete overhaul of the discard work.  Users can send
>>> down discard requests as frequently as they like.  The block layer will
>>> cache them, and invalidate them if writes come through.  Periodically,
>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>> underlying device) and get rid of the blocks that have remained unwanted
>>> in the interim.
>>
>> That is a very good idea. I've tested your original TRIM implementation on
>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>> milliseconds to digest a single TRIM command. And since your
>> implementation
>> sends a TRIM for each extent of each deleted file, the whole system is
>> unusable after a short while.
>> An optimal solution would be to consolidate the discard requests, bundle
>> them and send them to the drive as infrequent as possible.
>
> or queue them up and send them when the drive is idle (you would need to
> keep track to make sure the space isn't re-used)
>
> as an example, if you would consider spinning down a drive you don't hurt
> performance by sending accumulated trim commands.
>
> David Lang

An alternate approach is the block layer maintain its own bitmap of
used unused sectors / blocks. Unmap commands from the filesystem just
cause the bitmap to be updated.  No other effect.

(Big unknown: Where will the bitmap live between reboots?  Require DM
volumes so we can have a dedicated bitmap volume in the mix to store
the bitmap to? Maybe on mount, the filesystem has to be scanned to
initially populate the bitmap?   Other options?)

Assuming we have a persistent bitmap in place, have a background
scanner that kicks in when the cpu / disk is idle.  It just
continuously scans the bitmap looking for contiguous blocks of unused
sectors.  Each time it finds one, it sends the largest possible unmap
down the block stack and eventually to the device.

When normal cpu / disk activity kicks in, this process goes to sleep.

That way much of the smarts are concentrated in the block layer, not
in the filesystem code.  And it is being done when the disk is
otherwise idle, so you don't have the ncq interference.

Even laptop users should have enough idle cpu available to manage
this.  Enterprise would get the large discards it wants, and
unmentioned in the previous discussion, mdraid gets the large discards
it also wants.

ie. If a mdraid raid5/raid6 volume is built of SSDs, it will only be
able to discard a full stripe at a time. Otherwise the P=D1 ^ D2 logic
is lost.

Another benefit of the above is the code should be extremely safe and testable.

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
Preservation and Forensic processing of Exchange Repositories White Paper -
<http://www.norcrossgroup.com/forms/whitepapers/tng_whitepaper_fpe.html>

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13 15:43       ` James Bottomley
@ 2009-08-13 18:22         ` Ric Wheeler
  -1 siblings, 0 replies; 208+ messages in thread
From: Ric Wheeler @ 2009-08-13 18:22 UTC (permalink / raw)
  To: James Bottomley
  Cc: Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide

On 08/13/2009 11:43 AM, James Bottomley wrote:
> On Thu, 2009-08-13 at 08:13 -0700, Matthew Wilcox wrote:
>    
>> On Wed, Aug 12, 2009 at 11:48:27PM +0100, Hugh Dickins wrote:
>>      
>>> But fundamentally, though I can see how this cutdown communication
>>> path is useful to compcache, I'd much rather deal with it by the more
>>> general discard route if we can.  (I'm one of those still puzzled by
>>> the way swap is mixed up with block device in compcache: probably
>>> because I never found time to pay attention when you explained.)
>>>
>>> You're right to question the utility of the current swap discard
>>> placement.  That code is almost a year old, written from a position
>>> of great ignorance, yet only now do we appear to be on the threshold
>>> of having an SSD which really supports TRIM (ah, the Linux ATA TRIM
>>> support seems to have gone missing now, but perhaps it's been
>>> waiting for a reality to check against too - Willy?).
>>>        
>> I am indeed waiting for hardware with TRIM support to appear on my
>> desk before resubmitting the TRIM code.  It'd also be nice to be able to
>> get some performance numbers.
>>
>>      
>>> I won't be surprised if we find that we need to move swap discard
>>> support much closer to swap_free (though I know from trying before
>>> that it's much messier there): in which case, even if we decided to
>>> keep your hotline to compcache (to avoid allocating bios etc.), it
>>> would be better placed alongside.
>>>        
>> It turns out there are a lot of tradeoffs involved with discard, and
>> they're different between TRIM and UNMAP.
>>
>> Let's start with UNMAP.  This SCSI command is used by giant arrays.
>> They want to do Thin Provisioning, so allocate physical storage to virtual
>> LUNs on demand, and want to deallocate it when they get an UNMAP command.
>> They allocate storage in large chunks (hundreds of kilobytes at a time).
>> They only care about discards that enable them to free an entire chunk.
>> The vast majority of users *do not care* about these arrays, because
>> they don't have one, and will never be able to afford one.  We should
>> ignore the desires of these vendors when designing our software.
>>      
>
> Fundamentally, unmap, trim and write_same do similar things, so
> realistically they all map to discard in linux.
>
> Ignoring the desires of the enterprise isn't an option, since they are a
> good base for us.  However, they really do need to step up with a useful
> patch set for discussion that does what they want, so in the interim I'm
> happy with any proposal that doesn't actively damage what the enterprise
> wants to do with trim/write_same.
>    

I definitely agree - the UNMAP support and the needs of array users is a 
critical part of the solution.

I would also dispute the contention that this is irrelevant to most 
users - even those of us who don't personally use arrays almost always 
use them indirectly since major banks, airlines, etc all use them to 
store our data :-)

>    
>> Solid State Drives are introducing an ATA command called TRIM.  SSDs
>> generally have an intenal mapping layer, and due to their low, low seek
>> penalty, will happily remap blocks anywhere on the flash.  They want
>> to know when a block isn't in use any more, so they don't have to copy
>> it around when they want to erase the chunk of storage that it's on.
>> The unfortunate thing about the TRIM command is that it's not NCQ, so
>> all NCQ commands have to finish, then we can send the TRIM command and
>> wait for it to finish, then we can send NCQ commands again.
>>      
>
> That's a bit of a silly protocol oversight ... I assume there's no way
> it can be corrected?
>
>    
>> So TRIM isn't free, and there's a better way for the drive to find
>> out that the contents of a block no longer matter -- write some new
>> data to it.  So if we just swapped a page in, and we're going to swap
>> something else back out again soon, just write it to the same location
>> instead of to a fresh location.  You've saved a command, and you've
>> saved the drive some work, plus you've allowed other users to continue
>> accessing the drive in the meantime.
>>
>> I am planning a complete overhaul of the discard work.  Users can send
>> down discard requests as frequently as they like.  The block layer will
>> cache them, and invalidate them if writes come through.  Periodically,
>> the block layer will send down a TRIM or an UNMAP (depending on the
>> underlying device) and get rid of the blocks that have remained unwanted
>> in the interim.
>>
>> Thoughts on that are welcome.
>>      
>
> What you're basically planning is discard accumulation ... it's
> certainly closer to what the enterprise is looking for, so no objections
> from me.
>
> James
>
>    

This sounds like a good approach to me as well. I think that both TRIM 
and UNMAP use case will benefit from coalescing these discard requests,

Ric


^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-13 18:22         ` Ric Wheeler
  0 siblings, 0 replies; 208+ messages in thread
From: Ric Wheeler @ 2009-08-13 18:22 UTC (permalink / raw)
  To: James Bottomley
  Cc: Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide

On 08/13/2009 11:43 AM, James Bottomley wrote:
> On Thu, 2009-08-13 at 08:13 -0700, Matthew Wilcox wrote:
>    
>> On Wed, Aug 12, 2009 at 11:48:27PM +0100, Hugh Dickins wrote:
>>      
>>> But fundamentally, though I can see how this cutdown communication
>>> path is useful to compcache, I'd much rather deal with it by the more
>>> general discard route if we can.  (I'm one of those still puzzled by
>>> the way swap is mixed up with block device in compcache: probably
>>> because I never found time to pay attention when you explained.)
>>>
>>> You're right to question the utility of the current swap discard
>>> placement.  That code is almost a year old, written from a position
>>> of great ignorance, yet only now do we appear to be on the threshold
>>> of having an SSD which really supports TRIM (ah, the Linux ATA TRIM
>>> support seems to have gone missing now, but perhaps it's been
>>> waiting for a reality to check against too - Willy?).
>>>        
>> I am indeed waiting for hardware with TRIM support to appear on my
>> desk before resubmitting the TRIM code.  It'd also be nice to be able to
>> get some performance numbers.
>>
>>      
>>> I won't be surprised if we find that we need to move swap discard
>>> support much closer to swap_free (though I know from trying before
>>> that it's much messier there): in which case, even if we decided to
>>> keep your hotline to compcache (to avoid allocating bios etc.), it
>>> would be better placed alongside.
>>>        
>> It turns out there are a lot of tradeoffs involved with discard, and
>> they're different between TRIM and UNMAP.
>>
>> Let's start with UNMAP.  This SCSI command is used by giant arrays.
>> They want to do Thin Provisioning, so allocate physical storage to virtual
>> LUNs on demand, and want to deallocate it when they get an UNMAP command.
>> They allocate storage in large chunks (hundreds of kilobytes at a time).
>> They only care about discards that enable them to free an entire chunk.
>> The vast majority of users *do not care* about these arrays, because
>> they don't have one, and will never be able to afford one.  We should
>> ignore the desires of these vendors when designing our software.
>>      
>
> Fundamentally, unmap, trim and write_same do similar things, so
> realistically they all map to discard in linux.
>
> Ignoring the desires of the enterprise isn't an option, since they are a
> good base for us.  However, they really do need to step up with a useful
> patch set for discussion that does what they want, so in the interim I'm
> happy with any proposal that doesn't actively damage what the enterprise
> wants to do with trim/write_same.
>    

I definitely agree - the UNMAP support and the needs of array users is a 
critical part of the solution.

I would also dispute the contention that this is irrelevant to most 
users - even those of us who don't personally use arrays almost always 
use them indirectly since major banks, airlines, etc all use them to 
store our data :-)

>    
>> Solid State Drives are introducing an ATA command called TRIM.  SSDs
>> generally have an intenal mapping layer, and due to their low, low seek
>> penalty, will happily remap blocks anywhere on the flash.  They want
>> to know when a block isn't in use any more, so they don't have to copy
>> it around when they want to erase the chunk of storage that it's on.
>> The unfortunate thing about the TRIM command is that it's not NCQ, so
>> all NCQ commands have to finish, then we can send the TRIM command and
>> wait for it to finish, then we can send NCQ commands again.
>>      
>
> That's a bit of a silly protocol oversight ... I assume there's no way
> it can be corrected?
>
>    
>> So TRIM isn't free, and there's a better way for the drive to find
>> out that the contents of a block no longer matter -- write some new
>> data to it.  So if we just swapped a page in, and we're going to swap
>> something else back out again soon, just write it to the same location
>> instead of to a fresh location.  You've saved a command, and you've
>> saved the drive some work, plus you've allowed other users to continue
>> accessing the drive in the meantime.
>>
>> I am planning a complete overhaul of the discard work.  Users can send
>> down discard requests as frequently as they like.  The block layer will
>> cache them, and invalidate them if writes come through.  Periodically,
>> the block layer will send down a TRIM or an UNMAP (depending on the
>> underlying device) and get rid of the blocks that have remained unwanted
>> in the interim.
>>
>> Thoughts on that are welcome.
>>      
>
> What you're basically planning is discard accumulation ... it's
> certainly closer to what the enterprise is looking for, so no objections
> from me.
>
> James
>
>    

This sounds like a good approach to me as well. I think that both TRIM 
and UNMAP use case will benefit from coalescing these discard requests,

Ric

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13 18:15           ` Greg Freemyer
@ 2009-08-13 19:18             ` James Bottomley
  -1 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-13 19:18 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: david, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins,
	Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide, Linux RAID

On Thu, 2009-08-13 at 14:15 -0400, Greg Freemyer wrote:
> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
> > On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
> >
> >> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
> >>>
> >>> I am planning a complete overhaul of the discard work.  Users can send
> >>> down discard requests as frequently as they like.  The block layer will
> >>> cache them, and invalidate them if writes come through.  Periodically,
> >>> the block layer will send down a TRIM or an UNMAP (depending on the
> >>> underlying device) and get rid of the blocks that have remained unwanted
> >>> in the interim.
> >>
> >> That is a very good idea. I've tested your original TRIM implementation on
> >> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
> >> milliseconds to digest a single TRIM command. And since your
> >> implementation
> >> sends a TRIM for each extent of each deleted file, the whole system is
> >> unusable after a short while.
> >> An optimal solution would be to consolidate the discard requests, bundle
> >> them and send them to the drive as infrequent as possible.
> >
> > or queue them up and send them when the drive is idle (you would need to
> > keep track to make sure the space isn't re-used)
> >
> > as an example, if you would consider spinning down a drive you don't hurt
> > performance by sending accumulated trim commands.
> >
> > David Lang
> 
> An alternate approach is the block layer maintain its own bitmap of
> used unused sectors / blocks. Unmap commands from the filesystem just
> cause the bitmap to be updated.  No other effect.
> 
> (Big unknown: Where will the bitmap live between reboots?  Require DM
> volumes so we can have a dedicated bitmap volume in the mix to store
> the bitmap to? Maybe on mount, the filesystem has to be scanned to
> initially populate the bitmap?   Other options?)

I wouldn't really have it live anywhere.  Discard is best effort; it's
not required for fs integrity.  As long as we don't discard an in-use
block we're free to do anything else (including forget to discard,
rediscard a discarded block etc).

It is theoretically possible to run all of this from user space using
the fs mappings, a bit like a defrag command.

One other option would just be to scan on mount, discard everything
empty and redo on next mount ... this might be just the thing for
laptops.

> Assuming we have a persistent bitmap in place, have a background
> scanner that kicks in when the cpu / disk is idle.  It just
> continuously scans the bitmap looking for contiguous blocks of unused
> sectors.  Each time it finds one, it sends the largest possible unmap
> down the block stack and eventually to the device.
> 
> When normal cpu / disk activity kicks in, this process goes to sleep.
> 
> That way much of the smarts are concentrated in the block layer, not
> in the filesystem code.  And it is being done when the disk is
> otherwise idle, so you don't have the ncq interference.
> 
> Even laptop users should have enough idle cpu available to manage
> this.  Enterprise would get the large discards it wants, and
> unmentioned in the previous discussion, mdraid gets the large discards
> it also wants.
> 
> ie. If a mdraid raid5/raid6 volume is built of SSDs, it will only be
> able to discard a full stripe at a time. Otherwise the P=D1 ^ D2 logic
> is lost.
> 
> Another benefit of the above is the code should be extremely safe and testable.

Actually, I think, if we go in-kernel, the discard might be better tied
into the block plugging mechanism.  The real test might be no
outstanding commands and queue plugged, keep plugged and begin
discarding.

James



^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-13 19:18             ` James Bottomley
  0 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-13 19:18 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: david, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins,
	Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide, Linux RAID

On Thu, 2009-08-13 at 14:15 -0400, Greg Freemyer wrote:
> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
> > On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
> >
> >> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
> >>>
> >>> I am planning a complete overhaul of the discard work.  Users can send
> >>> down discard requests as frequently as they like.  The block layer will
> >>> cache them, and invalidate them if writes come through.  Periodically,
> >>> the block layer will send down a TRIM or an UNMAP (depending on the
> >>> underlying device) and get rid of the blocks that have remained unwanted
> >>> in the interim.
> >>
> >> That is a very good idea. I've tested your original TRIM implementation on
> >> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
> >> milliseconds to digest a single TRIM command. And since your
> >> implementation
> >> sends a TRIM for each extent of each deleted file, the whole system is
> >> unusable after a short while.
> >> An optimal solution would be to consolidate the discard requests, bundle
> >> them and send them to the drive as infrequent as possible.
> >
> > or queue them up and send them when the drive is idle (you would need to
> > keep track to make sure the space isn't re-used)
> >
> > as an example, if you would consider spinning down a drive you don't hurt
> > performance by sending accumulated trim commands.
> >
> > David Lang
> 
> An alternate approach is the block layer maintain its own bitmap of
> used unused sectors / blocks. Unmap commands from the filesystem just
> cause the bitmap to be updated.  No other effect.
> 
> (Big unknown: Where will the bitmap live between reboots?  Require DM
> volumes so we can have a dedicated bitmap volume in the mix to store
> the bitmap to? Maybe on mount, the filesystem has to be scanned to
> initially populate the bitmap?   Other options?)

I wouldn't really have it live anywhere.  Discard is best effort; it's
not required for fs integrity.  As long as we don't discard an in-use
block we're free to do anything else (including forget to discard,
rediscard a discarded block etc).

It is theoretically possible to run all of this from user space using
the fs mappings, a bit like a defrag command.

One other option would just be to scan on mount, discard everything
empty and redo on next mount ... this might be just the thing for
laptops.

> Assuming we have a persistent bitmap in place, have a background
> scanner that kicks in when the cpu / disk is idle.  It just
> continuously scans the bitmap looking for contiguous blocks of unused
> sectors.  Each time it finds one, it sends the largest possible unmap
> down the block stack and eventually to the device.
> 
> When normal cpu / disk activity kicks in, this process goes to sleep.
> 
> That way much of the smarts are concentrated in the block layer, not
> in the filesystem code.  And it is being done when the disk is
> otherwise idle, so you don't have the ncq interference.
> 
> Even laptop users should have enough idle cpu available to manage
> this.  Enterprise would get the large discards it wants, and
> unmentioned in the previous discussion, mdraid gets the large discards
> it also wants.
> 
> ie. If a mdraid raid5/raid6 volume is built of SSDs, it will only be
> able to discard a full stripe at a time. Otherwise the P=D1 ^ D2 logic
> is lost.
> 
> Another benefit of the above is the code should be extremely safe and testable.

Actually, I think, if we go in-kernel, the discard might be better tied
into the block plugging mechanism.  The real test might be no
outstanding commands and queue plugged, keep plugged and begin
discarding.

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13 19:18             ` James Bottomley
  (?)
@ 2009-08-13 20:31               ` Richard Sharpe
  -1 siblings, 0 replies; 208+ messages in thread
From: Richard Sharpe @ 2009-08-13 20:31 UTC (permalink / raw)
  To: James Bottomley
  Cc: Greg Freemyer, david, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

On Thu, Aug 13, 2009 at 12:18 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> Actually, I think, if we go in-kernel, the discard might be better tied
> into the block plugging mechanism.  The real test might be no
> outstanding commands and queue plugged, keep plugged and begin
> discarding.

I am very interested in this topic, as I have implemented UNMAP
support in SCST and scst_local.c and ib_srp.c for one SSD vendor, as
well as the block layer changes to have it work correctly (they were
minor changes and based on Matthew's original TRIM or UNMAP patch from
long ago). I believe that the performance was acceptable for them (I
will have to check).

I am also working on other, non-SSD, devices that are in a lower price
range than the large storage arrays where both DISCARD/UNMAP (and
WRITE SAME) would be useful in Linux. It also seems that Microsoft
supports TRIM in Windows 7 if you switch it on, although that really
only implies we should implement UNMAP support in our firmware and
hook it up to existing mechanisms.

I have logged internal enhancement bugs in bugzilla asking for both
TRIM and UNMAP/WRITE SAME support, and although one environment is
iSCSI in userland, and thus can be dealt with without support in the
Linux kernel, there are use cases where DISCARD/UNMAP support in the
Linux kernel would be useful.

I would be very willing to make the firmware changes needed in our
device to support UNMAP/WRITE SAME and to test changes to the Linux
kernel to support same.

I will go through this thread in more detail when I get back from my
trip to Australia, but if there are any GIT trees around with nascent
support in them I would love to know about them, as it will help my
internal efforts to get UNMAP/WRITE SAME support implemented as well.

--
Regards,
Richard Sharpe

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap  slot is freed)
@ 2009-08-13 20:31               ` Richard Sharpe
  0 siblings, 0 replies; 208+ messages in thread
From: Richard Sharpe @ 2009-08-13 20:31 UTC (permalink / raw)
  To: James Bottomley
  Cc: Greg Freemyer, david, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

On Thu, Aug 13, 2009 at 12:18 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> Actually, I think, if we go in-kernel, the discard might be better tied
> into the block plugging mechanism.  The real test might be no
> outstanding commands and queue plugged, keep plugged and begin
> discarding.

I am very interested in this topic, as I have implemented UNMAP
support in SCST and scst_local.c and ib_srp.c for one SSD vendor, as
well as the block layer changes to have it work correctly (they were
minor changes and based on Matthew's original TRIM or UNMAP patch from
long ago). I believe that the performance was acceptable for them (I
will have to check).

I am also working on other, non-SSD, devices that are in a lower price
range than the large storage arrays where both DISCARD/UNMAP (and
WRITE SAME) would be useful in Linux. It also seems that Microsoft
supports TRIM in Windows 7 if you switch it on, although that really
only implies we should implement UNMAP support in our firmware and
hook it up to existing mechanisms.

I have logged internal enhancement bugs in bugzilla asking for both
TRIM and UNMAP/WRITE SAME support, and although one environment is
iSCSI in userland, and thus can be dealt with without support in the
Linux kernel, there are use cases where DISCARD/UNMAP support in the
Linux kernel would be useful.

I would be very willing to make the firmware changes needed in our
device to support UNMAP/WRITE SAME and to test changes to the Linux
kernel to support same.

I will go through this thread in more detail when I get back from my
trip to Australia, but if there are any GIT trees around with nascent
support in them I would love to know about them, as it will help my
internal efforts to get UNMAP/WRITE SAME support implemented as well.

--
Regards,
Richard Sharpe

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-13 20:31               ` Richard Sharpe
  0 siblings, 0 replies; 208+ messages in thread
From: Richard Sharpe @ 2009-08-13 20:31 UTC (permalink / raw)
  To: James Bottomley
  Cc: Greg Freemyer, david, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

On Thu, Aug 13, 2009 at 12:18 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> Actually, I think, if we go in-kernel, the discard might be better tied
> into the block plugging mechanism.  The real test might be no
> outstanding commands and queue plugged, keep plugged and begin
> discarding.

I am very interested in this topic, as I have implemented UNMAP
support in SCST and scst_local.c and ib_srp.c for one SSD vendor, as
well as the block layer changes to have it work correctly (they were
minor changes and based on Matthew's original TRIM or UNMAP patch from
long ago). I believe that the performance was acceptable for them (I
will have to check).

I am also working on other, non-SSD, devices that are in a lower price
range than the large storage arrays where both DISCARD/UNMAP (and
WRITE SAME) would be useful in Linux. It also seems that Microsoft
supports TRIM in Windows 7 if you switch it on, although that really
only implies we should implement UNMAP support in our firmware and
hook it up to existing mechanisms.

I have logged internal enhancement bugs in bugzilla asking for both
TRIM and UNMAP/WRITE SAME support, and although one environment is
iSCSI in userland, and thus can be dealt with without support in the
Linux kernel, there are use cases where DISCARD/UNMAP support in the
Linux kernel would be useful.

I would be very willing to make the firmware changes needed in our
device to support UNMAP/WRITE SAME and to test changes to the Linux
kernel to support same.

I will go through this thread in more detail when I get back from my
trip to Australia, but if there are any GIT trees around with nascent
support in them I would love to know about them, as it will help my
internal efforts to get UNMAP/WRITE SAME support implemented as well.

--
Regards,
Richard Sharpe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13 18:15           ` Greg Freemyer
@ 2009-08-13 20:44             ` david
  -1 siblings, 0 replies; 208+ messages in thread
From: david @ 2009-08-13 20:44 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3292 bytes --]

On Thu, 13 Aug 2009, Greg Freemyer wrote:

> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>
>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>
>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>> down discard requests as frequently as they like.  The block layer will
>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>> underlying device) and get rid of the blocks that have remained unwanted
>>>> in the interim.
>>>
>>> That is a very good idea. I've tested your original TRIM implementation on
>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>> milliseconds to digest a single TRIM command. And since your
>>> implementation
>>> sends a TRIM for each extent of each deleted file, the whole system is
>>> unusable after a short while.
>>> An optimal solution would be to consolidate the discard requests, bundle
>>> them and send them to the drive as infrequent as possible.
>>
>> or queue them up and send them when the drive is idle (you would need to
>> keep track to make sure the space isn't re-used)
>>
>> as an example, if you would consider spinning down a drive you don't hurt
>> performance by sending accumulated trim commands.
>>
>> David Lang
>
> An alternate approach is the block layer maintain its own bitmap of
> used unused sectors / blocks. Unmap commands from the filesystem just
> cause the bitmap to be updated.  No other effect.

how does the block layer know what blocks are unused by the filesystem?

or would it be a case of the filesystem generating discard/trim requests 
to the block layer so that it can maintain it's bitmap, and then the block 
layer generating the requests to the drive below it?

David Lang

> (Big unknown: Where will the bitmap live between reboots?  Require DM
> volumes so we can have a dedicated bitmap volume in the mix to store
> the bitmap to? Maybe on mount, the filesystem has to be scanned to
> initially populate the bitmap?   Other options?)
>
> Assuming we have a persistent bitmap in place, have a background
> scanner that kicks in when the cpu / disk is idle.  It just
> continuously scans the bitmap looking for contiguous blocks of unused
> sectors.  Each time it finds one, it sends the largest possible unmap
> down the block stack and eventually to the device.
>
> When normal cpu / disk activity kicks in, this process goes to sleep.
>
> That way much of the smarts are concentrated in the block layer, not
> in the filesystem code.  And it is being done when the disk is
> otherwise idle, so you don't have the ncq interference.
>
> Even laptop users should have enough idle cpu available to manage
> this.  Enterprise would get the large discards it wants, and
> unmentioned in the previous discussion, mdraid gets the large discards
> it also wants.
>
> ie. If a mdraid raid5/raid6 volume is built of SSDs, it will only be
> able to discard a full stripe at a time. Otherwise the P=D1 ^ D2 logic
> is lost.
>
> Another benefit of the above is the code should be extremely safe and testable.
>
> Greg
>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-13 20:44             ` david
  0 siblings, 0 replies; 208+ messages in thread
From: david @ 2009-08-13 20:44 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3220 bytes --]

On Thu, 13 Aug 2009, Greg Freemyer wrote:

> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>
>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>
>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>> down discard requests as frequently as they like.  The block layer will
>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>> underlying device) and get rid of the blocks that have remained unwanted
>>>> in the interim.
>>>
>>> That is a very good idea. I've tested your original TRIM implementation on
>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>> milliseconds to digest a single TRIM command. And since your
>>> implementation
>>> sends a TRIM for each extent of each deleted file, the whole system is
>>> unusable after a short while.
>>> An optimal solution would be to consolidate the discard requests, bundle
>>> them and send them to the drive as infrequent as possible.
>>
>> or queue them up and send them when the drive is idle (you would need to
>> keep track to make sure the space isn't re-used)
>>
>> as an example, if you would consider spinning down a drive you don't hurt
>> performance by sending accumulated trim commands.
>>
>> David Lang
>
> An alternate approach is the block layer maintain its own bitmap of
> used unused sectors / blocks. Unmap commands from the filesystem just
> cause the bitmap to be updated.  No other effect.

how does the block layer know what blocks are unused by the filesystem?

or would it be a case of the filesystem generating discard/trim requests 
to the block layer so that it can maintain it's bitmap, and then the block 
layer generating the requests to the drive below it?

David Lang

> (Big unknown: Where will the bitmap live between reboots?  Require DM
> volumes so we can have a dedicated bitmap volume in the mix to store
> the bitmap to? Maybe on mount, the filesystem has to be scanned to
> initially populate the bitmap?   Other options?)
>
> Assuming we have a persistent bitmap in place, have a background
> scanner that kicks in when the cpu / disk is idle.  It just
> continuously scans the bitmap looking for contiguous blocks of unused
> sectors.  Each time it finds one, it sends the largest possible unmap
> down the block stack and eventually to the device.
>
> When normal cpu / disk activity kicks in, this process goes to sleep.
>
> That way much of the smarts are concentrated in the block layer, not
> in the filesystem code.  And it is being done when the disk is
> otherwise idle, so you don't have the ncq interference.
>
> Even laptop users should have enough idle cpu available to manage
> this.  Enterprise would get the large discards it wants, and
> unmentioned in the previous discussion, mdraid gets the large discards
> it also wants.
>
> ie. If a mdraid raid5/raid6 volume is built of SSDs, it will only be
> able to discard a full stripe at a time. Otherwise the P=D1 ^ D2 logic
> is lost.
>
> Another benefit of the above is the code should be extremely safe and testable.
>
> Greg
>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13 20:44             ` david
@ 2009-08-13 20:54               ` Bryan Donlan
  -1 siblings, 0 replies; 208+ messages in thread
From: Bryan Donlan @ 2009-08-13 20:54 UTC (permalink / raw)
  To: david
  Cc: Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins,
	Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide, Linux RAID

On Thu, Aug 13, 2009 at 4:44 PM, <david@lang.hm> wrote:
> On Thu, 13 Aug 2009, Greg Freemyer wrote:
>
>> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>>>
>>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>>
>>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>>
>>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>>> down discard requests as frequently as they like.  The block layer will
>>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>>> underlying device) and get rid of the blocks that have remained
>>>>> unwanted
>>>>> in the interim.
>>>>
>>>> That is a very good idea. I've tested your original TRIM implementation
>>>> on
>>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>>> milliseconds to digest a single TRIM command. And since your
>>>> implementation
>>>> sends a TRIM for each extent of each deleted file, the whole system is
>>>> unusable after a short while.
>>>> An optimal solution would be to consolidate the discard requests, bundle
>>>> them and send them to the drive as infrequent as possible.
>>>
>>> or queue them up and send them when the drive is idle (you would need to
>>> keep track to make sure the space isn't re-used)
>>>
>>> as an example, if you would consider spinning down a drive you don't hurt
>>> performance by sending accumulated trim commands.
>>>
>>> David Lang
>>
>> An alternate approach is the block layer maintain its own bitmap of
>> used unused sectors / blocks. Unmap commands from the filesystem just
>> cause the bitmap to be updated.  No other effect.
>
> how does the block layer know what blocks are unused by the filesystem?
>
> or would it be a case of the filesystem generating discard/trim requests to
> the block layer so that it can maintain it's bitmap, and then the block
> layer generating the requests to the drive below it?

Perhaps an interface (ioctl, etc) can be added to ask a filesystem to
discard all unused blocks in a certain range? (That is, have the
filesystem validate the request under any necessary locks before
passing it to the block IO layer)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap  slot is freed)
@ 2009-08-13 20:54               ` Bryan Donlan
  0 siblings, 0 replies; 208+ messages in thread
From: Bryan Donlan @ 2009-08-13 20:54 UTC (permalink / raw)
  To: david
  Cc: Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins,
	Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide, Linux RAID

On Thu, Aug 13, 2009 at 4:44 PM, <david@lang.hm> wrote:
> On Thu, 13 Aug 2009, Greg Freemyer wrote:
>
>> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>>>
>>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>>
>>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>>
>>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>>> down discard requests as frequently as they like.  The block layer will
>>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>>> underlying device) and get rid of the blocks that have remained
>>>>> unwanted
>>>>> in the interim.
>>>>
>>>> That is a very good idea. I've tested your original TRIM implementation
>>>> on
>>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>>> milliseconds to digest a single TRIM command. And since your
>>>> implementation
>>>> sends a TRIM for each extent of each deleted file, the whole system is
>>>> unusable after a short while.
>>>> An optimal solution would be to consolidate the discard requests, bundle
>>>> them and send them to the drive as infrequent as possible.
>>>
>>> or queue them up and send them when the drive is idle (you would need to
>>> keep track to make sure the space isn't re-used)
>>>
>>> as an example, if you would consider spinning down a drive you don't hurt
>>> performance by sending accumulated trim commands.
>>>
>>> David Lang
>>
>> An alternate approach is the block layer maintain its own bitmap of
>> used unused sectors / blocks. Unmap commands from the filesystem just
>> cause the bitmap to be updated.  No other effect.
>
> how does the block layer know what blocks are unused by the filesystem?
>
> or would it be a case of the filesystem generating discard/trim requests to
> the block layer so that it can maintain it's bitmap, and then the block
> layer generating the requests to the drive below it?

Perhaps an interface (ioctl, etc) can be added to ask a filesystem to
discard all unused blocks in a certain range? (That is, have the
filesystem validate the request under any necessary locks before
passing it to the block IO layer)

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13 20:44             ` david
@ 2009-08-13 21:28               ` Greg Freemyer
  -1 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-13 21:28 UTC (permalink / raw)
  To: david
  Cc: Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Thu, Aug 13, 2009 at 4:44 PM, <david@lang.hm> wrote:
> On Thu, 13 Aug 2009, Greg Freemyer wrote:
>
>> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>>>
>>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>>
>>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>>
>>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>>> down discard requests as frequently as they like.  The block layer will
>>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>>> underlying device) and get rid of the blocks that have remained
>>>>> unwanted
>>>>> in the interim.
>>>>
>>>> That is a very good idea. I've tested your original TRIM implementation
>>>> on
>>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>>> milliseconds to digest a single TRIM command. And since your
>>>> implementation
>>>> sends a TRIM for each extent of each deleted file, the whole system is
>>>> unusable after a short while.
>>>> An optimal solution would be to consolidate the discard requests, bundle
>>>> them and send them to the drive as infrequent as possible.
>>>
>>> or queue them up and send them when the drive is idle (you would need to
>>> keep track to make sure the space isn't re-used)
>>>
>>> as an example, if you would consider spinning down a drive you don't hurt
>>> performance by sending accumulated trim commands.
>>>
>>> David Lang
>>
>> An alternate approach is the block layer maintain its own bitmap of
>> used unused sectors / blocks. Unmap commands from the filesystem just
>> cause the bitmap to be updated.  No other effect.
>
> how does the block layer know what blocks are unused by the filesystem?
>
> or would it be a case of the filesystem generating discard/trim requests to
> the block layer so that it can maintain it's bitmap, and then the block
> layer generating the requests to the drive below it?
>
> David Lang

Yes, my thought.was that block layer would consume the discard/trim
requests from the filesystem in realtime to maintain the bitmap, then
at some later point in time when the system has extra resources it
would generate the calls down to the lower layers and eventually the
drive.

I highlight the lower layers because mdraid is also going to have to
be in the mix if raid5/6 is in use.  ie. At a minimum it will have to
adjust the block range to align with the stripe boundaries.

Greg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap  slot is freed)
@ 2009-08-13 21:28               ` Greg Freemyer
  0 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-13 21:28 UTC (permalink / raw)
  To: david
  Cc: Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Thu, Aug 13, 2009 at 4:44 PM, <david@lang.hm> wrote:
> On Thu, 13 Aug 2009, Greg Freemyer wrote:
>
>> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>>>
>>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>>
>>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>>
>>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>>> down discard requests as frequently as they like.  The block layer will
>>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>>> underlying device) and get rid of the blocks that have remained
>>>>> unwanted
>>>>> in the interim.
>>>>
>>>> That is a very good idea. I've tested your original TRIM implementation
>>>> on
>>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>>> milliseconds to digest a single TRIM command. And since your
>>>> implementation
>>>> sends a TRIM for each extent of each deleted file, the whole system is
>>>> unusable after a short while.
>>>> An optimal solution would be to consolidate the discard requests, bundle
>>>> them and send them to the drive as infrequent as possible.
>>>
>>> or queue them up and send them when the drive is idle (you would need to
>>> keep track to make sure the space isn't re-used)
>>>
>>> as an example, if you would consider spinning down a drive you don't hurt
>>> performance by sending accumulated trim commands.
>>>
>>> David Lang
>>
>> An alternate approach is the block layer maintain its own bitmap of
>> used unused sectors / blocks. Unmap commands from the filesystem just
>> cause the bitmap to be updated.  No other effect.
>
> how does the block layer know what blocks are unused by the filesystem?
>
> or would it be a case of the filesystem generating discard/trim requests to
> the block layer so that it can maintain it's bitmap, and then the block
> layer generating the requests to the drive below it?
>
> David Lang

Yes, my thought.was that block layer would consume the discard/trim
requests from the filesystem in realtime to maintain the bitmap, then
at some later point in time when the system has extra resources it
would generate the calls down to the lower layers and eventually the
drive.

I highlight the lower layers because mdraid is also going to have to
be in the mix if raid5/6 is in use.  ie. At a minimum it will have to
adjust the block range to align with the stripe boundaries.

Greg

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13 21:28               ` Greg Freemyer
  (?)
@ 2009-08-13 22:20                 ` Richard Sharpe
  -1 siblings, 0 replies; 208+ messages in thread
From: Richard Sharpe @ 2009-08-13 22:20 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: david, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins,
	Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide, Linux RAID

On Thu, Aug 13, 2009 at 2:28 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
> On Thu, Aug 13, 2009 at 4:44 PM, <david@lang.hm> wrote:
>> On Thu, 13 Aug 2009, Greg Freemyer wrote:
>>
>>> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>>>>
>>>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>>>
>>>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>>>
>>>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>>>> down discard requests as frequently as they like.  The block layer will
>>>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>>>> underlying device) and get rid of the blocks that have remained
>>>>>> unwanted
>>>>>> in the interim.
>>>>>
>>>>> That is a very good idea. I've tested your original TRIM implementation
>>>>> on
>>>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>>>> milliseconds to digest a single TRIM command. And since your
>>>>> implementation
>>>>> sends a TRIM for each extent of each deleted file, the whole system is
>>>>> unusable after a short while.
>>>>> An optimal solution would be to consolidate the discard requests, bundle
>>>>> them and send them to the drive as infrequent as possible.
>>>>
>>>> or queue them up and send them when the drive is idle (you would need to
>>>> keep track to make sure the space isn't re-used)
>>>>
>>>> as an example, if you would consider spinning down a drive you don't hurt
>>>> performance by sending accumulated trim commands.
>>>>
>>>> David Lang
>>>
>>> An alternate approach is the block layer maintain its own bitmap of
>>> used unused sectors / blocks. Unmap commands from the filesystem just
>>> cause the bitmap to be updated.  No other effect.
>>
>> how does the block layer know what blocks are unused by the filesystem?
>>
>> or would it be a case of the filesystem generating discard/trim requests to
>> the block layer so that it can maintain it's bitmap, and then the block
>> layer generating the requests to the drive below it?
>>
>> David Lang
>
> Yes, my thought.was that block layer would consume the discard/trim
> requests from the filesystem in realtime to maintain the bitmap, then
> at some later point in time when the system has extra resources it
> would generate the calls down to the lower layers and eventually the
> drive.

Why should the block layer be forced to maintain something that is
probably of use for only a limited number of cases? For example, the
devices I work on already maintain their own mapping of HOST-visible
LBAs to underlying storage, and I suspect that most such devices do.
So, you are duplicating something that we already do, and there is no
way that I am aware of to synchronise the two.

All we really need, I believe is for the UNMAP requests to come down
to us with writes barriered until we respond, and it is a relatively
cheap operation, although writes that are already in the cache and
uncommitted to disk present some issues if an UNMAP request comes down
for recently written blocks.

> I highlight the lower layers because mdraid is also going to have to
> be in the mix if raid5/6 is in use.  ie. At a minimum it will have to
> adjust the block range to align with the stripe boundaries.
>
> Greg
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Regards,
Richard Sharpe

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap  slot is freed)
@ 2009-08-13 22:20                 ` Richard Sharpe
  0 siblings, 0 replies; 208+ messages in thread
From: Richard Sharpe @ 2009-08-13 22:20 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: david, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins,
	Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide, Linux RAID

On Thu, Aug 13, 2009 at 2:28 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
> On Thu, Aug 13, 2009 at 4:44 PM, <david@lang.hm> wrote:
>> On Thu, 13 Aug 2009, Greg Freemyer wrote:
>>
>>> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>>>>
>>>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>>>
>>>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>>>
>>>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>>>> down discard requests as frequently as they like.  The block layer will
>>>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>>>> underlying device) and get rid of the blocks that have remained
>>>>>> unwanted
>>>>>> in the interim.
>>>>>
>>>>> That is a very good idea. I've tested your original TRIM implementation
>>>>> on
>>>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>>>> milliseconds to digest a single TRIM command. And since your
>>>>> implementation
>>>>> sends a TRIM for each extent of each deleted file, the whole system is
>>>>> unusable after a short while.
>>>>> An optimal solution would be to consolidate the discard requests, bundle
>>>>> them and send them to the drive as infrequent as possible.
>>>>
>>>> or queue them up and send them when the drive is idle (you would need to
>>>> keep track to make sure the space isn't re-used)
>>>>
>>>> as an example, if you would consider spinning down a drive you don't hurt
>>>> performance by sending accumulated trim commands.
>>>>
>>>> David Lang
>>>
>>> An alternate approach is the block layer maintain its own bitmap of
>>> used unused sectors / blocks. Unmap commands from the filesystem just
>>> cause the bitmap to be updated.  No other effect.
>>
>> how does the block layer know what blocks are unused by the filesystem?
>>
>> or would it be a case of the filesystem generating discard/trim requests to
>> the block layer so that it can maintain it's bitmap, and then the block
>> layer generating the requests to the drive below it?
>>
>> David Lang
>
> Yes, my thought.was that block layer would consume the discard/trim
> requests from the filesystem in realtime to maintain the bitmap, then
> at some later point in time when the system has extra resources it
> would generate the calls down to the lower layers and eventually the
> drive.

Why should the block layer be forced to maintain something that is
probably of use for only a limited number of cases? For example, the
devices I work on already maintain their own mapping of HOST-visible
LBAs to underlying storage, and I suspect that most such devices do.
So, you are duplicating something that we already do, and there is no
way that I am aware of to synchronise the two.

All we really need, I believe is for the UNMAP requests to come down
to us with writes barriered until we respond, and it is a relatively
cheap operation, although writes that are already in the cache and
uncommitted to disk present some issues if an UNMAP request comes down
for recently written blocks.

> I highlight the lower layers because mdraid is also going to have to
> be in the mix if raid5/6 is in use.  ie. At a minimum it will have to
> adjust the block range to align with the stripe boundaries.
>
> Greg
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Regards,
Richard Sharpe

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-13 22:20                 ` Richard Sharpe
  0 siblings, 0 replies; 208+ messages in thread
From: Richard Sharpe @ 2009-08-13 22:20 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: david, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins,
	Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide, Linux RAID

On Thu, Aug 13, 2009 at 2:28 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
> On Thu, Aug 13, 2009 at 4:44 PM, <david@lang.hm> wrote:
>> On Thu, 13 Aug 2009, Greg Freemyer wrote:
>>
>>> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>>>>
>>>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>>>
>>>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>>>
>>>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>>>> down discard requests as frequently as they like.  The block layer will
>>>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>>>> underlying device) and get rid of the blocks that have remained
>>>>>> unwanted
>>>>>> in the interim.
>>>>>
>>>>> That is a very good idea. I've tested your original TRIM implementation
>>>>> on
>>>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>>>> milliseconds to digest a single TRIM command. And since your
>>>>> implementation
>>>>> sends a TRIM for each extent of each deleted file, the whole system is
>>>>> unusable after a short while.
>>>>> An optimal solution would be to consolidate the discard requests, bundle
>>>>> them and send them to the drive as infrequent as possible.
>>>>
>>>> or queue them up and send them when the drive is idle (you would need to
>>>> keep track to make sure the space isn't re-used)
>>>>
>>>> as an example, if you would consider spinning down a drive you don't hurt
>>>> performance by sending accumulated trim commands.
>>>>
>>>> David Lang
>>>
>>> An alternate approach is the block layer maintain its own bitmap of
>>> used unused sectors / blocks. Unmap commands from the filesystem just
>>> cause the bitmap to be updated.  No other effect.
>>
>> how does the block layer know what blocks are unused by the filesystem?
>>
>> or would it be a case of the filesystem generating discard/trim requests to
>> the block layer so that it can maintain it's bitmap, and then the block
>> layer generating the requests to the drive below it?
>>
>> David Lang
>
> Yes, my thought.was that block layer would consume the discard/trim
> requests from the filesystem in realtime to maintain the bitmap, then
> at some later point in time when the system has extra resources it
> would generate the calls down to the lower layers and eventually the
> drive.

Why should the block layer be forced to maintain something that is
probably of use for only a limited number of cases? For example, the
devices I work on already maintain their own mapping of HOST-visible
LBAs to underlying storage, and I suspect that most such devices do.
So, you are duplicating something that we already do, and there is no
way that I am aware of to synchronise the two.

All we really need, I believe is for the UNMAP requests to come down
to us with writes barriered until we respond, and it is a relatively
cheap operation, although writes that are already in the cache and
uncommitted to disk present some issues if an UNMAP request comes down
for recently written blocks.

> I highlight the lower layers because mdraid is also going to have to
> be in the mix if raid5/6 is in use.  ie. At a minimum it will have to
> adjust the block range to align with the stripe boundaries.
>
> Greg
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Regards,
Richard Sharpe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13 22:20                 ` Richard Sharpe
@ 2009-08-14  0:19                   ` Greg Freemyer
  -1 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-14  0:19 UTC (permalink / raw)
  To: Richard Sharpe
  Cc: david, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins,
	Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide, Linux RAID

On Thu, Aug 13, 2009 at 6:20 PM, Richard
Sharpe<realrichardsharpe@gmail.com> wrote:
> On Thu, Aug 13, 2009 at 2:28 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
>> On Thu, Aug 13, 2009 at 4:44 PM, <david@lang.hm> wrote:
>>> On Thu, 13 Aug 2009, Greg Freemyer wrote:
>>>
>>>> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>>>>>
>>>>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>>>>
>>>>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>>>>
>>>>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>>>>> down discard requests as frequently as they like.  The block layer will
>>>>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>>>>> underlying device) and get rid of the blocks that have remained
>>>>>>> unwanted
>>>>>>> in the interim.
>>>>>>
>>>>>> That is a very good idea. I've tested your original TRIM implementation
>>>>>> on
>>>>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>>>>> milliseconds to digest a single TRIM command. And since your
>>>>>> implementation
>>>>>> sends a TRIM for each extent of each deleted file, the whole system is
>>>>>> unusable after a short while.
>>>>>> An optimal solution would be to consolidate the discard requests, bundle
>>>>>> them and send them to the drive as infrequent as possible.
>>>>>
>>>>> or queue them up and send them when the drive is idle (you would need to
>>>>> keep track to make sure the space isn't re-used)
>>>>>
>>>>> as an example, if you would consider spinning down a drive you don't hurt
>>>>> performance by sending accumulated trim commands.
>>>>>
>>>>> David Lang
>>>>
>>>> An alternate approach is the block layer maintain its own bitmap of
>>>> used unused sectors / blocks. Unmap commands from the filesystem just
>>>> cause the bitmap to be updated.  No other effect.
>>>
>>> how does the block layer know what blocks are unused by the filesystem?
>>>
>>> or would it be a case of the filesystem generating discard/trim requests to
>>> the block layer so that it can maintain it's bitmap, and then the block
>>> layer generating the requests to the drive below it?
>>>
>>> David Lang
>>
>> Yes, my thought.was that block layer would consume the discard/trim
>> requests from the filesystem in realtime to maintain the bitmap, then
>> at some later point in time when the system has extra resources it
>> would generate the calls down to the lower layers and eventually the
>> drive.
>
> Why should the block layer be forced to maintain something that is
> probably of use for only a limited number of cases? For example, the
> devices I work on already maintain their own mapping of HOST-visible
> LBAs to underlying storage, and I suspect that most such devices do.
> So, you are duplicating something that we already do, and there is no
> way that I am aware of to synchronise the two.
>
> All we really need, I believe is for the UNMAP requests to come down
> to us with writes barriered until we respond, and it is a relatively
> cheap operation, although writes that are already in the cache and
> uncommitted to disk present some issues if an UNMAP request comes down
> for recently written blocks.
>

Richard,

Quoting the original email I saw in this thread:

>
>The unfortunate thing about the TRIM command is that it's not NCQ, so
>all NCQ commands have to finish, then we can send the TRIM command and
>wait for it to finish, then we can send NCQ commands again.
>
>So TRIM isn't free, and there's a better way for the drive to find
>out that the contents of a block no longer matter -- write some new
>data to it.  So if we just swapped a page in, and we're going to swap
>something else back out again soon, just write it to the same location
>instead of to a fresh location.  You've saved a command, and you've
>saved the drive some work, plus you've allowed other users to continue
>accessing the drive in the meantime.
>
>I am planning a complete overhaul of the discard work.  Users can send
>down discard requests as frequently as they like.  The block layer will
>cache them, and invalidate them if writes come through.  Periodically,
>the block layer will send down a TRIM or an UNMAP (depending on the
>underlying device) and get rid of the blocks that have remained unwanted
>in the interim.
>
>Thoughts on that are welcome.
>>

My thought was that a bitmap was a better solution than a cache of
discard commands.

One of the biggest reasons is that a bitmap can coalesce the unused
areas into much larger discard ranges than a queue that will only have
a limited number of discards to coalesce.

And both Enterprise scsi and mdraid are desirous of larger discard ranges.

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
Preservation and Forensic processing of Exchange Repositories White Paper -
<http://www.norcrossgroup.com/forms/whitepapers/tng_whitepaper_fpe.html>

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap  slot is freed)
@ 2009-08-14  0:19                   ` Greg Freemyer
  0 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-14  0:19 UTC (permalink / raw)
  To: Richard Sharpe
  Cc: david, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins,
	Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide, Linux RAID

On Thu, Aug 13, 2009 at 6:20 PM, Richard
Sharpe<realrichardsharpe@gmail.com> wrote:
> On Thu, Aug 13, 2009 at 2:28 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
>> On Thu, Aug 13, 2009 at 4:44 PM, <david@lang.hm> wrote:
>>> On Thu, 13 Aug 2009, Greg Freemyer wrote:
>>>
>>>> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>>>>>
>>>>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>>>>
>>>>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>>>>
>>>>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>>>>> down discard requests as frequently as they like.  The block layer will
>>>>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>>>>> underlying device) and get rid of the blocks that have remained
>>>>>>> unwanted
>>>>>>> in the interim.
>>>>>>
>>>>>> That is a very good idea. I've tested your original TRIM implementation
>>>>>> on
>>>>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>>>>> milliseconds to digest a single TRIM command. And since your
>>>>>> implementation
>>>>>> sends a TRIM for each extent of each deleted file, the whole system is
>>>>>> unusable after a short while.
>>>>>> An optimal solution would be to consolidate the discard requests, bundle
>>>>>> them and send them to the drive as infrequent as possible.
>>>>>
>>>>> or queue them up and send them when the drive is idle (you would need to
>>>>> keep track to make sure the space isn't re-used)
>>>>>
>>>>> as an example, if you would consider spinning down a drive you don't hurt
>>>>> performance by sending accumulated trim commands.
>>>>>
>>>>> David Lang
>>>>
>>>> An alternate approach is the block layer maintain its own bitmap of
>>>> used unused sectors / blocks. Unmap commands from the filesystem just
>>>> cause the bitmap to be updated.  No other effect.
>>>
>>> how does the block layer know what blocks are unused by the filesystem?
>>>
>>> or would it be a case of the filesystem generating discard/trim requests to
>>> the block layer so that it can maintain it's bitmap, and then the block
>>> layer generating the requests to the drive below it?
>>>
>>> David Lang
>>
>> Yes, my thought.was that block layer would consume the discard/trim
>> requests from the filesystem in realtime to maintain the bitmap, then
>> at some later point in time when the system has extra resources it
>> would generate the calls down to the lower layers and eventually the
>> drive.
>
> Why should the block layer be forced to maintain something that is
> probably of use for only a limited number of cases? For example, the
> devices I work on already maintain their own mapping of HOST-visible
> LBAs to underlying storage, and I suspect that most such devices do.
> So, you are duplicating something that we already do, and there is no
> way that I am aware of to synchronise the two.
>
> All we really need, I believe is for the UNMAP requests to come down
> to us with writes barriered until we respond, and it is a relatively
> cheap operation, although writes that are already in the cache and
> uncommitted to disk present some issues if an UNMAP request comes down
> for recently written blocks.
>

Richard,

Quoting the original email I saw in this thread:

>
>The unfortunate thing about the TRIM command is that it's not NCQ, so
>all NCQ commands have to finish, then we can send the TRIM command and
>wait for it to finish, then we can send NCQ commands again.
>
>So TRIM isn't free, and there's a better way for the drive to find
>out that the contents of a block no longer matter -- write some new
>data to it.  So if we just swapped a page in, and we're going to swap
>something else back out again soon, just write it to the same location
>instead of to a fresh location.  You've saved a command, and you've
>saved the drive some work, plus you've allowed other users to continue
>accessing the drive in the meantime.
>
>I am planning a complete overhaul of the discard work.  Users can send
>down discard requests as frequently as they like.  The block layer will
>cache them, and invalidate them if writes come through.  Periodically,
>the block layer will send down a TRIM or an UNMAP (depending on the
>underlying device) and get rid of the blocks that have remained unwanted
>in the interim.
>
>Thoughts on that are welcome.
>>

My thought was that a bitmap was a better solution than a cache of
discard commands.

One of the biggest reasons is that a bitmap can coalesce the unused
areas into much larger discard ranges than a queue that will only have
a limited number of discards to coalesce.

And both Enterprise scsi and mdraid are desirous of larger discard ranges.

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
Preservation and Forensic processing of Exchange Repositories White Paper -
<http://www.norcrossgroup.com/forms/whitepapers/tng_whitepaper_fpe.html>

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: compcache as a pre-swap area (was: [PATCH] swap: send callback  when swap slot is freed)
  2009-08-13 17:31         ` Nitin Gupta
@ 2009-08-14  4:02           ` Al Boldi
  -1 siblings, 0 replies; 208+ messages in thread
From: Al Boldi @ 2009-08-14  4:02 UTC (permalink / raw)
  To: Nitin Gupta
  Cc: Hugh Dickins, Matthew Wilcox, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm

Nitin Gupta wrote:
> compcache is really not really a swap replacement. Its just another
> swap device that
> compresses data and stores it in memory itself. You can have disk
> based swaps along
> with ramzswap (name of block device).

So once compcache fills up, it will start to age its contents into normal 
swap?


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: compcache as a pre-swap area (was: [PATCH] swap: send callback  when swap slot is freed)
@ 2009-08-14  4:02           ` Al Boldi
  0 siblings, 0 replies; 208+ messages in thread
From: Al Boldi @ 2009-08-14  4:02 UTC (permalink / raw)
  To: Nitin Gupta
  Cc: Hugh Dickins, Matthew Wilcox, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm

Nitin Gupta wrote:
> compcache is really not really a swap replacement. Its just another
> swap device that
> compresses data and stores it in memory itself. You can have disk
> based swaps along
> with ramzswap (name of block device).

So once compcache fills up, it will start to age its contents into normal 
swap?


Thanks!

--
Al

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: compcache as a pre-swap area
  2009-08-14  4:02           ` Al Boldi
@ 2009-08-14  4:53             ` Nitin Gupta
  -1 siblings, 0 replies; 208+ messages in thread
From: Nitin Gupta @ 2009-08-14  4:53 UTC (permalink / raw)
  To: Al Boldi
  Cc: Hugh Dickins, Matthew Wilcox, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm

On 08/14/2009 09:32 AM, Al Boldi wrote:
> Nitin Gupta wrote:
>> compcache is really not really a swap replacement. Its just another
>> swap device that
>> compresses data and stores it in memory itself. You can have disk
>> based swaps along
>> with ramzswap (name of block device).
>
> So once compcache fills up, it will start to age its contents into normal
> swap?
>

This is desirable but not yet implemented. For now, if 'backing swap' is used, 
compcache will forward incompressible pages to the backing swap device. If 
compcache fills up, kernel will simply send further swap-outs to swap device 
which comes next in priority.

Nitin

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: compcache as a pre-swap area
@ 2009-08-14  4:53             ` Nitin Gupta
  0 siblings, 0 replies; 208+ messages in thread
From: Nitin Gupta @ 2009-08-14  4:53 UTC (permalink / raw)
  To: Al Boldi
  Cc: Hugh Dickins, Matthew Wilcox, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm

On 08/14/2009 09:32 AM, Al Boldi wrote:
> Nitin Gupta wrote:
>> compcache is really not really a swap replacement. Its just another
>> swap device that
>> compresses data and stores it in memory itself. You can have disk
>> based swaps along
>> with ramzswap (name of block device).
>
> So once compcache fills up, it will start to age its contents into normal
> swap?
>

This is desirable but not yet implemented. For now, if 'backing swap' is used, 
compcache will forward incompressible pages to the backing swap device. If 
compcache fills up, kernel will simply send further swap-outs to swap device 
which comes next in priority.

Nitin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: compcache as a pre-swap area
  2009-08-14  4:53             ` Nitin Gupta
@ 2009-08-14 15:49               ` Al Boldi
  -1 siblings, 0 replies; 208+ messages in thread
From: Al Boldi @ 2009-08-14 15:49 UTC (permalink / raw)
  To: ngupta
  Cc: Hugh Dickins, Matthew Wilcox, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm

Nitin Gupta wrote:
> On 08/14/2009 09:32 AM, Al Boldi wrote:
> > So once compcache fills up, it will start to age its contents into normal
> > swap?
>
> This is desirable but not yet implemented. For now, if 'backing swap' is
> used, compcache will forward incompressible pages to the backing swap
> device. If compcache fills up, kernel will simply send further swap-outs to
> swap device which comes next in priority.

Ok, this sounds acceptable for now.

The important thing now is to improve performance to a level comparable to a 
system with normal ssd-swap.  Do you have such a comparisson?

Another interresting benchmark would be to use compcache in a maximized 
configuration, ie. on a system w/ 1024KB Ram assign 960KB for compcache and 
leave 64KB for the system, and then see how it performs.  This may easily 
pinpoint any bottlenecks compcache has, if any.

Also, a link to the latest patch against .30 would be helpful.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: compcache as a pre-swap area
@ 2009-08-14 15:49               ` Al Boldi
  0 siblings, 0 replies; 208+ messages in thread
From: Al Boldi @ 2009-08-14 15:49 UTC (permalink / raw)
  To: ngupta
  Cc: Hugh Dickins, Matthew Wilcox, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm

Nitin Gupta wrote:
> On 08/14/2009 09:32 AM, Al Boldi wrote:
> > So once compcache fills up, it will start to age its contents into normal
> > swap?
>
> This is desirable but not yet implemented. For now, if 'backing swap' is
> used, compcache will forward incompressible pages to the backing swap
> device. If compcache fills up, kernel will simply send further swap-outs to
> swap device which comes next in priority.

Ok, this sounds acceptable for now.

The important thing now is to improve performance to a level comparable to a 
system with normal ssd-swap.  Do you have such a comparisson?

Another interresting benchmark would be to use compcache in a maximized 
configuration, ie. on a system w/ 1024KB Ram assign 960KB for compcache and 
leave 64KB for the system, and then see how it performs.  This may easily 
pinpoint any bottlenecks compcache has, if any.

Also, a link to the latest patch against .30 would be helpful.


Thanks!

--
Al

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
       [not found]                   ` <46b8a8850908131758s781b07f6v2729483c0e50ae7a@mail.gmail.com>
  2009-08-14 21:33                       ` Greg Freemyer
@ 2009-08-14 21:33                     ` Greg Freemyer
  2009-08-14 21:33                     ` Greg Freemyer
  2009-08-14 21:33                     ` Greg Freemyer
  3 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-14 21:33 UTC (permalink / raw)
  To: Richard Sharpe, david, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins

This inadvertently went just to me, replying to all:

On Thu, Aug 13, 2009 at 8:58 PM, Richard
Sharpe<realrichardsharpe@gmail.com> wrote:
> On Thu, Aug 13, 2009 at 5:19 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
>> On Thu, Aug 13, 2009 at 6:20 PM, Richard
>> Sharpe<realrichardsharpe@gmail.com> wrote:
>>> On Thu, Aug 13, 2009 at 2:28 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
>>>> On Thu, Aug 13, 2009 at 4:44 PM, <david@lang.hm> wrote:
>>>>> On Thu, 13 Aug 2009, Greg Freemyer wrote:
>>>>>
>>>>>> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>>>>>>>
>>>>>>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>>>>>>
>>>>>>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>>>>>>
>>>>>>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>>>>>>> down discard requests as frequently as they like.  The block layer will
>>>>>>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>>>>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>>>>>>> underlying device) and get rid of the blocks that have remained
>>>>>>>>> unwanted
>>>>>>>>> in the interim.
>>>>>>>>
>>>>>>>> That is a very good idea. I've tested your original TRIM implementation
>>>>>>>> on
>>>>>>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>>>>>>> milliseconds to digest a single TRIM command. And since your
>>>>>>>> implementation
>>>>>>>> sends a TRIM for each extent of each deleted file, the whole system is
>>>>>>>> unusable after a short while.
>>>>>>>> An optimal solution would be to consolidate the discard requests, bundle
>>>>>>>> them and send them to the drive as infrequent as possible.
>>>>>>>
>>>>>>> or queue them up and send them when the drive is idle (you would need to
>>>>>>> keep track to make sure the space isn't re-used)
>>>>>>>
>>>>>>> as an example, if you would consider spinning down a drive you don't hurt
>>>>>>> performance by sending accumulated trim commands.
>>>>>>>
>>>>>>> David Lang
>>>>>>
>>>>>> An alternate approach is the block layer maintain its own bitmap of
>>>>>> used unused sectors / blocks. Unmap commands from the filesystem just
>>>>>> cause the bitmap to be updated.  No other effect.
>>>>>
>>>>> how does the block layer know what blocks are unused by the filesystem?
>>>>>
>>>>> or would it be a case of the filesystem generating discard/trim requests to
>>>>> the block layer so that it can maintain it's bitmap, and then the block
>>>>> layer generating the requests to the drive below it?
>>>>>
>>>>> David Lang
>>>>
>>>> Yes, my thought.was that block layer would consume the discard/trim
>>>> requests from the filesystem in realtime to maintain the bitmap, then
>>>> at some later point in time when the system has extra resources it
>>>> would generate the calls down to the lower layers and eventually the
>>>> drive.
>>>
>>> Why should the block layer be forced to maintain something that is
>>> probably of use for only a limited number of cases? For example, the
>>> devices I work on already maintain their own mapping of HOST-visible
>>> LBAs to underlying storage, and I suspect that most such devices do.
>>> So, you are duplicating something that we already do, and there is no
>>> way that I am aware of to synchronise the two.
>>>
>>> All we really need, I believe is for the UNMAP requests to come down
>>> to us with writes barriered until we respond, and it is a relatively
>>> cheap operation, although writes that are already in the cache and
>>> uncommitted to disk present some issues if an UNMAP request comes down
>>> for recently written blocks.
>>>
>>
>> Richard,
>>
>> Quoting the original email I saw in this thread:
>>
>>>
>>>The unfortunate thing about the TRIM command is that it's not NCQ, so
>>>all NCQ commands have to finish, then we can send the TRIM command and
>>>wait for it to finish, then we can send NCQ commands again.
>>>
>>>So TRIM isn't free, and there's a better way for the drive to find
>>>out that the contents of a block no longer matter -- write some new
>>>data to it.  So if we just swapped a page in, and we're going to swap
>>>something else back out again soon, just write it to the same location
>>>instead of to a fresh location.  You've saved a command, and you've
>>>saved the drive some work, plus you've allowed other users to continue
>>>accessing the drive in the meantime.
>>>
>>>I am planning a complete overhaul of the discard work.  Users can send
>>>down discard requests as frequently as they like.  The block layer will
>>>cache them, and invalidate them if writes come through.  Periodically,
>>>the block layer will send down a TRIM or an UNMAP (depending on the
>>>underlying device) and get rid of the blocks that have remained unwanted
>>>in the interim.
>>>
>>>Thoughts on that are welcome.
>>>>
>>
>> My thought was that a bitmap was a better solution than a cache of
>> discard commands.
>>
>> One of the biggest reasons is that a bitmap can coalesce the unused
>> areas into much larger discard ranges than a queue that will only have
>> a limited number of discards to coalesce.
>
> OK, I misunderstood. For the work I did with an SSD company the UNMAP
> requests were coming down as 1024 LBA DISCARDs/UNMAPs. If someone
> deleted a multi-GB file that results in thousands of DISCARDS coming
> down, which is a problem.

I think the ext4 implementation is sending down discards way smaller
than 1024 sectors.  Ted Tso posted something a few months ago that he
did a test and was seeing a massive number of them being sent from
ext4 to block.  The rest of the stack was not in place, so he did not
know the real performance impact.

> However, I wonder if we cannot make do with merging in the block
> layer, especially with XFS or Ext4.

That's the cache and coalesce approach, right?  Just a personal thing,
but we run
things like defrag in the background during off hours.

It seems to me that unmap is not all that different, why do we need to
do it even close in time proximity to the deletes?  With a bitmap, we
have total timing control of when the unmaps are forwarded down to the
device.  I like that timing control much better than a cache and
coalesce approach.

>> And both Enterprise scsi and mdraid are desirous of larger discard ranges.
>
> I also would like large discard ranges ... metadata updates in the
> platform I am thinking of are transactional, and I would like to
> reduce the number of transactions pushed through the metadata journal.
>
> --
> Regards,
> Richard Sharpe

Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
       [not found]                   ` <46b8a8850908131758s781b07f6v2729483c0e50ae7a@mail.gmail.com>
                                       ` (2 preceding siblings ...)
  2009-08-14 21:33                     ` Greg Freemyer
@ 2009-08-14 21:33                     ` Greg Freemyer
  3 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-14 21:33 UTC (permalink / raw)
  To: Richard Sharpe, david, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins

This inadvertently went just to me, replying to all:

On Thu, Aug 13, 2009 at 8:58 PM, Richard
Sharpe<realrichardsharpe@gmail.com> wrote:
> On Thu, Aug 13, 2009 at 5:19 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
>> On Thu, Aug 13, 2009 at 6:20 PM, Richard
>> Sharpe<realrichardsharpe@gmail.com> wrote:
>>> On Thu, Aug 13, 2009 at 2:28 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
>>>> On Thu, Aug 13, 2009 at 4:44 PM, <david@lang.hm> wrote:
>>>>> On Thu, 13 Aug 2009, Greg Freemyer wrote:
>>>>>
>>>>>> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>>>>>>>
>>>>>>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>>>>>>
>>>>>>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>>>>>>
>>>>>>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>>>>>>> down discard requests as frequently as they like.  The block layer will
>>>>>>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>>>>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>>>>>>> underlying device) and get rid of the blocks that have remained
>>>>>>>>> unwanted
>>>>>>>>> in the interim.
>>>>>>>>
>>>>>>>> That is a very good idea. I've tested your original TRIM implementation
>>>>>>>> on
>>>>>>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>>>>>>> milliseconds to digest a single TRIM command. And since your
>>>>>>>> implementation
>>>>>>>> sends a TRIM for each extent of each deleted file, the whole system is
>>>>>>>> unusable after a short while.
>>>>>>>> An optimal solution would be to consolidate the discard requests, bundle
>>>>>>>> them and send them to the drive as infrequent as possible.
>>>>>>>
>>>>>>> or queue them up and send them when the drive is idle (you would need to
>>>>>>> keep track to make sure the space isn't re-used)
>>>>>>>
>>>>>>> as an example, if you would consider spinning down a drive you don't hurt
>>>>>>> performance by sending accumulated trim commands.
>>>>>>>
>>>>>>> David Lang
>>>>>>
>>>>>> An alternate approach is the block layer maintain its own bitmap of
>>>>>> used unused sectors / blocks. Unmap commands from the filesystem just
>>>>>> cause the bitmap to be updated.  No other effect.
>>>>>
>>>>> how does the block layer know what blocks are unused by the filesystem?
>>>>>
>>>>> or would it be a case of the filesystem generating discard/trim requests to
>>>>> the block layer so that it can maintain it's bitmap, and then the block
>>>>> layer generating the requests to the drive below it?
>>>>>
>>>>> David Lang
>>>>
>>>> Yes, my thought.was that block layer would consume the discard/trim
>>>> requests from the filesystem in realtime to maintain the bitmap, then
>>>> at some later point in time when the system has extra resources it
>>>> would generate the calls down to the lower layers and eventually the
>>>> drive.
>>>
>>> Why should the block layer be forced to maintain something that is
>>> probably of use for only a limited number of cases? For example, the
>>> devices I work on already maintain their own mapping of HOST-visible
>>> LBAs to underlying storage, and I suspect that most such devices do.
>>> So, you are duplicating something that we already do, and there is no
>>> way that I am aware of to synchronise the two.
>>>
>>> All we really need, I believe is for the UNMAP requests to come down
>>> to us with writes barriered until we respond, and it is a relatively
>>> cheap operation, although writes that are already in the cache and
>>> uncommitted to disk present some issues if an UNMAP request comes down
>>> for recently written blocks.
>>>
>>
>> Richard,
>>
>> Quoting the original email I saw in this thread:
>>
>>>
>>>The unfortunate thing about the TRIM command is that it's not NCQ, so
>>>all NCQ commands have to finish, then we can send the TRIM command and
>>>wait for it to finish, then we can send NCQ commands again.
>>>
>>>So TRIM isn't free, and there's a better way for the drive to find
>>>out that the contents of a block no longer matter -- write some new
>>>data to it.  So if we just swapped a page in, and we're going to swap
>>>something else back out again soon, just write it to the same location
>>>instead of to a fresh location.  You've saved a command, and you've
>>>saved the drive some work, plus you've allowed other users to continue
>>>accessing the drive in the meantime.
>>>
>>>I am planning a complete overhaul of the discard work.  Users can send
>>>down discard requests as frequently as they like.  The block layer will
>>>cache them, and invalidate them if writes come through.  Periodically,
>>>the block layer will send down a TRIM or an UNMAP (depending on the
>>>underlying device) and get rid of the blocks that have remained unwanted
>>>in the interim.
>>>
>>>Thoughts on that are welcome.
>>>>
>>
>> My thought was that a bitmap was a better solution than a cache of
>> discard commands.
>>
>> One of the biggest reasons is that a bitmap can coalesce the unused
>> areas into much larger discard ranges than a queue that will only have
>> a limited number of discards to coalesce.
>
> OK, I misunderstood. For the work I did with an SSD company the UNMAP
> requests were coming down as 1024 LBA DISCARDs/UNMAPs. If someone
> deleted a multi-GB file that results in thousands of DISCARDS coming
> down, which is a problem.

I think the ext4 implementation is sending down discards way smaller
than 1024 sectors.  Ted Tso posted something a few months ago that he
did a test and was seeing a massive number of them being sent from
ext4 to block.  The rest of the stack was not in place, so he did not
know the real performance impact.

> However, I wonder if we cannot make do with merging in the block
> layer, especially with XFS or Ext4.

That's the cache and coalesce approach, right?  Just a personal thing,
but we run
things like defrag in the background during off hours.

It seems to me that unmap is not all that different, why do we need to
do it even close in time proximity to the deletes?  With a bitmap, we
have total timing control of when the unmaps are forwarded down to the
device.  I like that timing control much better than a cache and
coalesce approach.

>> And both Enterprise scsi and mdraid are desirous of larger discard ranges.
>
> I also would like large discard ranges ... metadata updates in the
> platform I am thinking of are transactional, and I would like to
> reduce the number of transactions pushed through the metadata journal.
>
> --
> Regards,
> Richard Sharpe

Greg

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap  slot is freed)
       [not found]                   ` <46b8a8850908131758s781b07f6v2729483c0e50ae7a@mail.gmail.com>
@ 2009-08-14 21:33                       ` Greg Freemyer
  2009-08-14 21:33                     ` Discard support (was Re: [PATCH] swap: send callback when swap slot is freed) Greg Freemyer
                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-14 21:33 UTC (permalink / raw)
  To: Richard Sharpe, david, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

This inadvertently went just to me, replying to all:

On Thu, Aug 13, 2009 at 8:58 PM, Richard
Sharpe<realrichardsharpe@gmail.com> wrote:
> On Thu, Aug 13, 2009 at 5:19 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
>> On Thu, Aug 13, 2009 at 6:20 PM, Richard
>> Sharpe<realrichardsharpe@gmail.com> wrote:
>>> On Thu, Aug 13, 2009 at 2:28 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
>>>> On Thu, Aug 13, 2009 at 4:44 PM, <david@lang.hm> wrote:
>>>>> On Thu, 13 Aug 2009, Greg Freemyer wrote:
>>>>>
>>>>>> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>>>>>>>
>>>>>>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>>>>>>
>>>>>>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>>>>>>
>>>>>>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>>>>>>> down discard requests as frequently as they like.  The block layer will
>>>>>>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>>>>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>>>>>>> underlying device) and get rid of the blocks that have remained
>>>>>>>>> unwanted
>>>>>>>>> in the interim.
>>>>>>>>
>>>>>>>> That is a very good idea. I've tested your original TRIM implementation
>>>>>>>> on
>>>>>>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>>>>>>> milliseconds to digest a single TRIM command. And since your
>>>>>>>> implementation
>>>>>>>> sends a TRIM for each extent of each deleted file, the whole system is
>>>>>>>> unusable after a short while.
>>>>>>>> An optimal solution would be to consolidate the discard requests, bundle
>>>>>>>> them and send them to the drive as infrequent as possible.
>>>>>>>
>>>>>>> or queue them up and send them when the drive is idle (you would need to
>>>>>>> keep track to make sure the space isn't re-used)
>>>>>>>
>>>>>>> as an example, if you would consider spinning down a drive you don't hurt
>>>>>>> performance by sending accumulated trim commands.
>>>>>>>
>>>>>>> David Lang
>>>>>>
>>>>>> An alternate approach is the block layer maintain its own bitmap of
>>>>>> used unused sectors / blocks. Unmap commands from the filesystem just
>>>>>> cause the bitmap to be updated.  No other effect.
>>>>>
>>>>> how does the block layer know what blocks are unused by the filesystem?
>>>>>
>>>>> or would it be a case of the filesystem generating discard/trim requests to
>>>>> the block layer so that it can maintain it's bitmap, and then the block
>>>>> layer generating the requests to the drive below it?
>>>>>
>>>>> David Lang
>>>>
>>>> Yes, my thought.was that block layer would consume the discard/trim
>>>> requests from the filesystem in realtime to maintain the bitmap, then
>>>> at some later point in time when the system has extra resources it
>>>> would generate the calls down to the lower layers and eventually the
>>>> drive.
>>>
>>> Why should the block layer be forced to maintain something that is
>>> probably of use for only a limited number of cases? For example, the
>>> devices I work on already maintain their own mapping of HOST-visible
>>> LBAs to underlying storage, and I suspect that most such devices do.
>>> So, you are duplicating something that we already do, and there is no
>>> way that I am aware of to synchronise the two.
>>>
>>> All we really need, I believe is for the UNMAP requests to come down
>>> to us with writes barriered until we respond, and it is a relatively
>>> cheap operation, although writes that are already in the cache and
>>> uncommitted to disk present some issues if an UNMAP request comes down
>>> for recently written blocks.
>>>
>>
>> Richard,
>>
>> Quoting the original email I saw in this thread:
>>
>>>
>>>The unfortunate thing about the TRIM command is that it's not NCQ, so
>>>all NCQ commands have to finish, then we can send the TRIM command and
>>>wait for it to finish, then we can send NCQ commands again.
>>>
>>>So TRIM isn't free, and there's a better way for the drive to find
>>>out that the contents of a block no longer matter -- write some new
>>>data to it.  So if we just swapped a page in, and we're going to swap
>>>something else back out again soon, just write it to the same location
>>>instead of to a fresh location.  You've saved a command, and you've
>>>saved the drive some work, plus you've allowed other users to continue
>>>accessing the drive in the meantime.
>>>
>>>I am planning a complete overhaul of the discard work.  Users can send
>>>down discard requests as frequently as they like.  The block layer will
>>>cache them, and invalidate them if writes come through.  Periodically,
>>>the block layer will send down a TRIM or an UNMAP (depending on the
>>>underlying device) and get rid of the blocks that have remained unwanted
>>>in the interim.
>>>
>>>Thoughts on that are welcome.
>>>>
>>
>> My thought was that a bitmap was a better solution than a cache of
>> discard commands.
>>
>> One of the biggest reasons is that a bitmap can coalesce the unused
>> areas into much larger discard ranges than a queue that will only have
>> a limited number of discards to coalesce.
>
> OK, I misunderstood. For the work I did with an SSD company the UNMAP
> requests were coming down as 1024 LBA DISCARDs/UNMAPs. If someone
> deleted a multi-GB file that results in thousands of DISCARDS coming
> down, which is a problem.

I think the ext4 implementation is sending down discards way smaller
than 1024 sectors.  Ted Tso posted something a few months ago that he
did a test and was seeing a massive number of them being sent from
ext4 to block.  The rest of the stack was not in place, so he did not
know the real performance impact.

> However, I wonder if we cannot make do with merging in the block
> layer, especially with XFS or Ext4.

That's the cache and coalesce approach, right?  Just a personal thing,
but we run
things like defrag in the background during off hours.

It seems to me that unmap is not all that different, why do we need to
do it even close in time proximity to the deletes?  With a bitmap, we
have total timing control of when the unmaps are forwarded down to the
device.  I like that timing control much better than a cache and
coalesce approach.

>> And both Enterprise scsi and mdraid are desirous of larger discard ranges.
>
> I also would like large discard ranges ... metadata updates in the
> platform I am thinking of are transactional, and I would like to
> reduce the number of transactions pushed through the metadata journal.
>
> --
> Regards,
> Richard Sharpe

Greg

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
       [not found]                   ` <46b8a8850908131758s781b07f6v2729483c0e50ae7a@mail.gmail.com>
  2009-08-14 21:33                       ` Greg Freemyer
  2009-08-14 21:33                     ` Discard support (was Re: [PATCH] swap: send callback when swap slot is freed) Greg Freemyer
@ 2009-08-14 21:33                     ` Greg Freemyer
  2009-08-14 21:33                     ` Greg Freemyer
  3 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-14 21:33 UTC (permalink / raw)
  To: Richard Sharpe, david, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins

This inadvertently went just to me, replying to all:

On Thu, Aug 13, 2009 at 8:58 PM, Richard
Sharpe<realrichardsharpe@gmail.com> wrote:
> On Thu, Aug 13, 2009 at 5:19 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
>> On Thu, Aug 13, 2009 at 6:20 PM, Richard
>> Sharpe<realrichardsharpe@gmail.com> wrote:
>>> On Thu, Aug 13, 2009 at 2:28 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
>>>> On Thu, Aug 13, 2009 at 4:44 PM, <david@lang.hm> wrote:
>>>>> On Thu, 13 Aug 2009, Greg Freemyer wrote:
>>>>>
>>>>>> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>>>>>>>
>>>>>>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>>>>>>
>>>>>>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>>>>>>
>>>>>>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>>>>>>> down discard requests as frequently as they like.  The block layer will
>>>>>>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>>>>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>>>>>>> underlying device) and get rid of the blocks that have remained
>>>>>>>>> unwanted
>>>>>>>>> in the interim.
>>>>>>>>
>>>>>>>> That is a very good idea. I've tested your original TRIM implementation
>>>>>>>> on
>>>>>>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>>>>>>> milliseconds to digest a single TRIM command. And since your
>>>>>>>> implementation
>>>>>>>> sends a TRIM for each extent of each deleted file, the whole system is
>>>>>>>> unusable after a short while.
>>>>>>>> An optimal solution would be to consolidate the discard requests, bundle
>>>>>>>> them and send them to the drive as infrequent as possible.
>>>>>>>
>>>>>>> or queue them up and send them when the drive is idle (you would need to
>>>>>>> keep track to make sure the space isn't re-used)
>>>>>>>
>>>>>>> as an example, if you would consider spinning down a drive you don't hurt
>>>>>>> performance by sending accumulated trim commands.
>>>>>>>
>>>>>>> David Lang
>>>>>>
>>>>>> An alternate approach is the block layer maintain its own bitmap of
>>>>>> used unused sectors / blocks. Unmap commands from the filesystem just
>>>>>> cause the bitmap to be updated.  No other effect.
>>>>>
>>>>> how does the block layer know what blocks are unused by the filesystem?
>>>>>
>>>>> or would it be a case of the filesystem generating discard/trim requests to
>>>>> the block layer so that it can maintain it's bitmap, and then the block
>>>>> layer generating the requests to the drive below it?
>>>>>
>>>>> David Lang
>>>>
>>>> Yes, my thought.was that block layer would consume the discard/trim
>>>> requests from the filesystem in realtime to maintain the bitmap, then
>>>> at some later point in time when the system has extra resources it
>>>> would generate the calls down to the lower layers and eventually the
>>>> drive.
>>>
>>> Why should the block layer be forced to maintain something that is
>>> probably of use for only a limited number of cases? For example, the
>>> devices I work on already maintain their own mapping of HOST-visible
>>> LBAs to underlying storage, and I suspect that most such devices do.
>>> So, you are duplicating something that we already do, and there is no
>>> way that I am aware of to synchronise the two.
>>>
>>> All we really need, I believe is for the UNMAP requests to come down
>>> to us with writes barriered until we respond, and it is a relatively
>>> cheap operation, although writes that are already in the cache and
>>> uncommitted to disk present some issues if an UNMAP request comes down
>>> for recently written blocks.
>>>
>>
>> Richard,
>>
>> Quoting the original email I saw in this thread:
>>
>>>
>>>The unfortunate thing about the TRIM command is that it's not NCQ, so
>>>all NCQ commands have to finish, then we can send the TRIM command and
>>>wait for it to finish, then we can send NCQ commands again.
>>>
>>>So TRIM isn't free, and there's a better way for the drive to find
>>>out that the contents of a block no longer matter -- write some new
>>>data to it.  So if we just swapped a page in, and we're going to swap
>>>something else back out again soon, just write it to the same location
>>>instead of to a fresh location.  You've saved a command, and you've
>>>saved the drive some work, plus you've allowed other users to continue
>>>accessing the drive in the meantime.
>>>
>>>I am planning a complete overhaul of the discard work.  Users can send
>>>down discard requests as frequently as they like.  The block layer will
>>>cache them, and invalidate them if writes come through.  Periodically,
>>>the block layer will send down a TRIM or an UNMAP (depending on the
>>>underlying device) and get rid of the blocks that have remained unwanted
>>>in the interim.
>>>
>>>Thoughts on that are welcome.
>>>>
>>
>> My thought was that a bitmap was a better solution than a cache of
>> discard commands.
>>
>> One of the biggest reasons is that a bitmap can coalesce the unused
>> areas into much larger discard ranges than a queue that will only have
>> a limited number of discards to coalesce.
>
> OK, I misunderstood. For the work I did with an SSD company the UNMAP
> requests were coming down as 1024 LBA DISCARDs/UNMAPs. If someone
> deleted a multi-GB file that results in thousands of DISCARDS coming
> down, which is a problem.

I think the ext4 implementation is sending down discards way smaller
than 1024 sectors.  Ted Tso posted something a few months ago that he
did a test and was seeing a massive number of them being sent from
ext4 to block.  The rest of the stack was not in place, so he did not
know the real performance impact.

> However, I wonder if we cannot make do with merging in the block
> layer, especially with XFS or Ext4.

That's the cache and coalesce approach, right?  Just a personal thing,
but we run
things like defrag in the background during off hours.

It seems to me that unmap is not all that different, why do we need to
do it even close in time proximity to the deletes?  With a bitmap, we
have total timing control of when the unmaps are forwarded down to the
device.  I like that timing control much better than a cache and
coalesce approach.

>> And both Enterprise scsi and mdraid are desirous of larger discard ranges.
>
> I also would like large discard ranges ... metadata updates in the
> platform I am thinking of are transactional, and I would like to
> reduce the number of transactions pushed through the metadata journal.
>
> --
> Regards,
> Richard Sharpe

Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-14 21:33                       ` Greg Freemyer
  0 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-14 21:33 UTC (permalink / raw)
  To: Richard Sharpe, david, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

This inadvertently went just to me, replying to all:

On Thu, Aug 13, 2009 at 8:58 PM, Richard
Sharpe<realrichardsharpe@gmail.com> wrote:
> On Thu, Aug 13, 2009 at 5:19 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
>> On Thu, Aug 13, 2009 at 6:20 PM, Richard
>> Sharpe<realrichardsharpe@gmail.com> wrote:
>>> On Thu, Aug 13, 2009 at 2:28 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
>>>> On Thu, Aug 13, 2009 at 4:44 PM, <david@lang.hm> wrote:
>>>>> On Thu, 13 Aug 2009, Greg Freemyer wrote:
>>>>>
>>>>>> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>>>>>>>
>>>>>>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>>>>>>
>>>>>>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>>>>>>
>>>>>>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>>>>>>> down discard requests as frequently as they like.  The block layer will
>>>>>>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>>>>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>>>>>>> underlying device) and get rid of the blocks that have remained
>>>>>>>>> unwanted
>>>>>>>>> in the interim.
>>>>>>>>
>>>>>>>> That is a very good idea. I've tested your original TRIM implementation
>>>>>>>> on
>>>>>>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>>>>>>> milliseconds to digest a single TRIM command. And since your
>>>>>>>> implementation
>>>>>>>> sends a TRIM for each extent of each deleted file, the whole system is
>>>>>>>> unusable after a short while.
>>>>>>>> An optimal solution would be to consolidate the discard requests, bundle
>>>>>>>> them and send them to the drive as infrequent as possible.
>>>>>>>
>>>>>>> or queue them up and send them when the drive is idle (you would need to
>>>>>>> keep track to make sure the space isn't re-used)
>>>>>>>
>>>>>>> as an example, if you would consider spinning down a drive you don't hurt
>>>>>>> performance by sending accumulated trim commands.
>>>>>>>
>>>>>>> David Lang
>>>>>>
>>>>>> An alternate approach is the block layer maintain its own bitmap of
>>>>>> used unused sectors / blocks. Unmap commands from the filesystem just
>>>>>> cause the bitmap to be updated.  No other effect.
>>>>>
>>>>> how does the block layer know what blocks are unused by the filesystem?
>>>>>
>>>>> or would it be a case of the filesystem generating discard/trim requests to
>>>>> the block layer so that it can maintain it's bitmap, and then the block
>>>>> layer generating the requests to the drive below it?
>>>>>
>>>>> David Lang
>>>>
>>>> Yes, my thought.was that block layer would consume the discard/trim
>>>> requests from the filesystem in realtime to maintain the bitmap, then
>>>> at some later point in time when the system has extra resources it
>>>> would generate the calls down to the lower layers and eventually the
>>>> drive.
>>>
>>> Why should the block layer be forced to maintain something that is
>>> probably of use for only a limited number of cases? For example, the
>>> devices I work on already maintain their own mapping of HOST-visible
>>> LBAs to underlying storage, and I suspect that most such devices do.
>>> So, you are duplicating something that we already do, and there is no
>>> way that I am aware of to synchronise the two.
>>>
>>> All we really need, I believe is for the UNMAP requests to come down
>>> to us with writes barriered until we respond, and it is a relatively
>>> cheap operation, although writes that are already in the cache and
>>> uncommitted to disk present some issues if an UNMAP request comes down
>>> for recently written blocks.
>>>
>>
>> Richard,
>>
>> Quoting the original email I saw in this thread:
>>
>>>
>>>The unfortunate thing about the TRIM command is that it's not NCQ, so
>>>all NCQ commands have to finish, then we can send the TRIM command and
>>>wait for it to finish, then we can send NCQ commands again.
>>>
>>>So TRIM isn't free, and there's a better way for the drive to find
>>>out that the contents of a block no longer matter -- write some new
>>>data to it.  So if we just swapped a page in, and we're going to swap
>>>something else back out again soon, just write it to the same location
>>>instead of to a fresh location.  You've saved a command, and you've
>>>saved the drive some work, plus you've allowed other users to continue
>>>accessing the drive in the meantime.
>>>
>>>I am planning a complete overhaul of the discard work.  Users can send
>>>down discard requests as frequently as they like.  The block layer will
>>>cache them, and invalidate them if writes come through.  Periodically,
>>>the block layer will send down a TRIM or an UNMAP (depending on the
>>>underlying device) and get rid of the blocks that have remained unwanted
>>>in the interim.
>>>
>>>Thoughts on that are welcome.
>>>>
>>
>> My thought was that a bitmap was a better solution than a cache of
>> discard commands.
>>
>> One of the biggest reasons is that a bitmap can coalesce the unused
>> areas into much larger discard ranges than a queue that will only have
>> a limited number of discards to coalesce.
>
> OK, I misunderstood. For the work I did with an SSD company the UNMAP
> requests were coming down as 1024 LBA DISCARDs/UNMAPs. If someone
> deleted a multi-GB file that results in thousands of DISCARDS coming
> down, which is a problem.

I think the ext4 implementation is sending down discards way smaller
than 1024 sectors.  Ted Tso posted something a few months ago that he
did a test and was seeing a massive number of them being sent from
ext4 to block.  The rest of the stack was not in place, so he did not
know the real performance impact.

> However, I wonder if we cannot make do with merging in the block
> layer, especially with XFS or Ext4.

That's the cache and coalesce approach, right?  Just a personal thing,
but we run
things like defrag in the background during off hours.

It seems to me that unmap is not all that different, why do we need to
do it even close in time proximity to the deletes?  With a bitmap, we
have total timing control of when the unmaps are forwarded down to the
device.  I like that timing control much better than a cache and
coalesce approach.

>> And both Enterprise scsi and mdraid are desirous of larger discard ranges.
>
> I also would like large discard ranges ... metadata updates in the
> platform I am thinking of are transactional, and I would like to
> reduce the number of transactions pushed through the metadata journal.
>
> --
> Regards,
> Richard Sharpe

Greg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support
  2009-08-14 21:33                       ` Greg Freemyer
@ 2009-08-14 21:56                         ` Roland Dreier
  -1 siblings, 0 replies; 208+ messages in thread
From: Roland Dreier @ 2009-08-14 21:56 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Richard Sharpe, david, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID


 > It seems to me that unmap is not all that different, why do we need to
 > do it even close in time proximity to the deletes?  With a bitmap, we
 > have total timing control of when the unmaps are forwarded down to the
 > device.  I like that timing control much better than a cache and
 > coalesce approach.

The trouble I see with a bitmap is the amount of memory it consumes.  It
seems that discards must be tracked on no bigger than 4KB sectors (and
possibly even 512 byte sectors).  But even with 4KB, then, say, a 32 TB
volume (just 16 * 2TB disks, or even lower end with thin provisioning)
requires 1 GB of bitmap memory.  Which is a lot just to store, let alone
walk over etc.

 - R.

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support
@ 2009-08-14 21:56                         ` Roland Dreier
  0 siblings, 0 replies; 208+ messages in thread
From: Roland Dreier @ 2009-08-14 21:56 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Richard Sharpe, david, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID


 > It seems to me that unmap is not all that different, why do we need to
 > do it even close in time proximity to the deletes?  With a bitmap, we
 > have total timing control of when the unmaps are forwarded down to the
 > device.  I like that timing control much better than a cache and
 > coalesce approach.

The trouble I see with a bitmap is the amount of memory it consumes.  It
seems that discards must be tracked on no bigger than 4KB sectors (and
possibly even 512 byte sectors).  But even with 4KB, then, say, a 32 TB
volume (just 16 * 2TB disks, or even lower end with thin provisioning)
requires 1 GB of bitmap memory.  Which is a lot just to store, let alone
walk over etc.

 - R.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13 19:18             ` James Bottomley
@ 2009-08-14 22:03               ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-14 22:03 UTC (permalink / raw)
  To: James Bottomley
  Cc: Greg Freemyer, david, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

James Bottomley wrote:
> On Thu, 2009-08-13 at 14:15 -0400, Greg Freemyer wrote:
>> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>>
>>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>>> down discard requests as frequently as they like.  The block layer will
>>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>>> underlying device) and get rid of the blocks that have remained unwanted
>>>>> in the interim.
>>>> That is a very good idea. I've tested your original TRIM implementation on
>>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>>> milliseconds to digest a single TRIM command. And since your
>>>> implementation
>>>> sends a TRIM for each extent of each deleted file, the whole system is
>>>> unusable after a short while.
>>>> An optimal solution would be to consolidate the discard requests, bundle
>>>> them and send them to the drive as infrequent as possible.
>>> or queue them up and send them when the drive is idle (you would need to
>>> keep track to make sure the space isn't re-used)
>>>
>>> as an example, if you would consider spinning down a drive you don't hurt
>>> performance by sending accumulated trim commands.
>>>
>>> David Lang
>> An alternate approach is the block layer maintain its own bitmap of
>> used unused sectors / blocks. Unmap commands from the filesystem just
>> cause the bitmap to be updated.  No other effect.
>>
>> (Big unknown: Where will the bitmap live between reboots?  Require DM
>> volumes so we can have a dedicated bitmap volume in the mix to store
>> the bitmap to? Maybe on mount, the filesystem has to be scanned to
>> initially populate the bitmap?   Other options?)
> 
> I wouldn't really have it live anywhere.  Discard is best effort; it's
> not required for fs integrity.  As long as we don't discard an in-use
> block we're free to do anything else (including forget to discard,
> rediscard a discarded block etc).
> 
> It is theoretically possible to run all of this from user space using
> the fs mappings, a bit like a defrag command.
..

Already a work-in-progress -- see my wiper.sh script on the hdparm page
at sourceforge.  Trimming 50+GB of free space on a 120GB Vertex
(over 100 million sectors) takes a *single* TRIM command,
and completes in only a couple of seconds.

Cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-14 22:03               ` Mark Lord
  0 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-14 22:03 UTC (permalink / raw)
  To: James Bottomley
  Cc: Greg Freemyer, david, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

James Bottomley wrote:
> On Thu, 2009-08-13 at 14:15 -0400, Greg Freemyer wrote:
>> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>>
>>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>>> down discard requests as frequently as they like.  The block layer will
>>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>>> underlying device) and get rid of the blocks that have remained unwanted
>>>>> in the interim.
>>>> That is a very good idea. I've tested your original TRIM implementation on
>>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>>> milliseconds to digest a single TRIM command. And since your
>>>> implementation
>>>> sends a TRIM for each extent of each deleted file, the whole system is
>>>> unusable after a short while.
>>>> An optimal solution would be to consolidate the discard requests, bundle
>>>> them and send them to the drive as infrequent as possible.
>>> or queue them up and send them when the drive is idle (you would need to
>>> keep track to make sure the space isn't re-used)
>>>
>>> as an example, if you would consider spinning down a drive you don't hurt
>>> performance by sending accumulated trim commands.
>>>
>>> David Lang
>> An alternate approach is the block layer maintain its own bitmap of
>> used unused sectors / blocks. Unmap commands from the filesystem just
>> cause the bitmap to be updated.  No other effect.
>>
>> (Big unknown: Where will the bitmap live between reboots?  Require DM
>> volumes so we can have a dedicated bitmap volume in the mix to store
>> the bitmap to? Maybe on mount, the filesystem has to be scanned to
>> initially populate the bitmap?   Other options?)
> 
> I wouldn't really have it live anywhere.  Discard is best effort; it's
> not required for fs integrity.  As long as we don't discard an in-use
> block we're free to do anything else (including forget to discard,
> rediscard a discarded block etc).
> 
> It is theoretically possible to run all of this from user space using
> the fs mappings, a bit like a defrag command.
..

Already a work-in-progress -- see my wiper.sh script on the hdparm page
at sourceforge.  Trimming 50+GB of free space on a 120GB Vertex
(over 100 million sectors) takes a *single* TRIM command,
and completes in only a couple of seconds.

Cheers

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-13 20:54               ` Bryan Donlan
@ 2009-08-14 22:10                 ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-14 22:10 UTC (permalink / raw)
  To: Bryan Donlan
  Cc: david, Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

Bryan Donlan wrote:
..
> Perhaps an interface (ioctl, etc) can be added to ask a filesystem to
> discard all unused blocks in a certain range? (That is, have the
> filesystem validate the request under any necessary locks before
> passing it to the block IO layer)
..

While possibly TRIM-specific, this approach has the lowest overhead
and probably the greatest gain-for-pain ratio.

But it may not be as nice for enterprise (?).

On the Indilinx-based SSDs (eg. OCZ Vertex), TRIM seems to trigger an
internal garbage-collection/erase cycle.  As such, the drive really prefers
a few LARGE trim lists, rather than many smaller ones.

Here's some information that a vendor has observed from the Win7 use of TRIM:

> TRIM command is sent:
> -	About 2/3 of partition is filled up, when file is deleted.
>         (I am not talking about send file to trash bin.)
> -	In the above case, when trash bin gets emptied.
> -	In the above case, when partition is deleted.
> 
> TRIM command is not sent:-	
> -	When file is moved to trash bin
> -	When partition is formatted. (Both quick and full format)
> -	When empty partition is deleted
> -	When file is deleted while there is big remaining free space
..

His words, not mine.  But the idea seems to be to batch them in large chunks.

My wiper.sh "trim script" is packaged with the latest hdparm (currently 9.24)
on sourceforge, for those who want to try this stuff for real.  No special
kernel support is required to use it.

Cheers

Mark

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-14 22:10                 ` Mark Lord
  0 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-14 22:10 UTC (permalink / raw)
  To: Bryan Donlan
  Cc: david, Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

Bryan Donlan wrote:
..
> Perhaps an interface (ioctl, etc) can be added to ask a filesystem to
> discard all unused blocks in a certain range? (That is, have the
> filesystem validate the request under any necessary locks before
> passing it to the block IO layer)
..

While possibly TRIM-specific, this approach has the lowest overhead
and probably the greatest gain-for-pain ratio.

But it may not be as nice for enterprise (?).

On the Indilinx-based SSDs (eg. OCZ Vertex), TRIM seems to trigger an
internal garbage-collection/erase cycle.  As such, the drive really prefers
a few LARGE trim lists, rather than many smaller ones.

Here's some information that a vendor has observed from the Win7 use of TRIM:

> TRIM command is sent:
> -	About 2/3 of partition is filled up, when file is deleted.
>         (I am not talking about send file to trash bin.)
> -	In the above case, when trash bin gets emptied.
> -	In the above case, when partition is deleted.
> 
> TRIM command is not sent:-	
> -	When file is moved to trash bin
> -	When partition is formatted. (Both quick and full format)
> -	When empty partition is deleted
> -	When file is deleted while there is big remaining free space
..

His words, not mine.  But the idea seems to be to batch them in large chunks.

My wiper.sh "trim script" is packaged with the latest hdparm (currently 9.24)
on sourceforge, for those who want to try this stuff for real.  No special
kernel support is required to use it.

Cheers

Mark

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support
  2009-08-14 21:56                         ` Roland Dreier
@ 2009-08-14 22:10                           ` Greg Freemyer
  -1 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-14 22:10 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Richard Sharpe, david, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

On Fri, Aug 14, 2009 at 5:56 PM, Roland Dreier<rdreier@cisco.com> wrote:
>
>  > It seems to me that unmap is not all that different, why do we need to
>  > do it even close in time proximity to the deletes?  With a bitmap, we
>  > have total timing control of when the unmaps are forwarded down to the
>  > device.  I like that timing control much better than a cache and
>  > coalesce approach.
>
> The trouble I see with a bitmap is the amount of memory it consumes.  It
> seems that discards must be tracked on no bigger than 4KB sectors (and
> possibly even 512 byte sectors).  But even with 4KB, then, say, a 32 TB
> volume (just 16 * 2TB disks, or even lower end with thin provisioning)
> requires 1 GB of bitmap memory.  Which is a lot just to store, let alone
> walk over etc.

Have the filesystem guys created any efficient extent tree tracking solutions?

I mean a 16TB filesystem obviously has to track the freespace somehow
that does not require 1GB of ram.  Can that logic be leveraged in
block to track freespace?  That obviously assumes its not too cpu
intensive to do so.

If a leaf in the extent tracking tree becomes big enough, it could
even be sent down from the block layer and that leaf deleted.  ie. If
a leaf of the tree grows to represent X contiguous blocks, then a
discard could be sent down to the device and the leaf representing
those free blocks deleted.

The new topo info about block devices might be able to help optimize
the minimum size of a coalesced discard.

Greg
--
Greg Freemyer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support
@ 2009-08-14 22:10                           ` Greg Freemyer
  0 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-14 22:10 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Richard Sharpe, david, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

On Fri, Aug 14, 2009 at 5:56 PM, Roland Dreier<rdreier@cisco.com> wrote:
>
>  > It seems to me that unmap is not all that different, why do we need to
>  > do it even close in time proximity to the deletes?  With a bitmap, we
>  > have total timing control of when the unmaps are forwarded down to the
>  > device.  I like that timing control much better than a cache and
>  > coalesce approach.
>
> The trouble I see with a bitmap is the amount of memory it consumes.  It
> seems that discards must be tracked on no bigger than 4KB sectors (and
> possibly even 512 byte sectors).  But even with 4KB, then, say, a 32 TB
> volume (just 16 * 2TB disks, or even lower end with thin provisioning)
> requires 1 GB of bitmap memory.  Which is a lot just to store, let alone
> walk over etc.

Have the filesystem guys created any efficient extent tree tracking solutions?

I mean a 16TB filesystem obviously has to track the freespace somehow
that does not require 1GB of ram.  Can that logic be leveraged in
block to track freespace?  That obviously assumes its not too cpu
intensive to do so.

If a leaf in the extent tracking tree becomes big enough, it could
even be sent down from the block layer and that leaf deleted.  ie. If
a leaf of the tree grows to represent X contiguous blocks, then a
discard could be sent down to the device and the leaf representing
those free blocks deleted.

The new topo info about block devices might be able to help optimize
the minimum size of a coalesced discard.

Greg
--
Greg Freemyer

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-14 22:03               ` Mark Lord
@ 2009-08-14 22:54                 ` Greg Freemyer
  -1 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-14 22:54 UTC (permalink / raw)
  To: Mark Lord
  Cc: James Bottomley, david, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

On Fri, Aug 14, 2009 at 6:03 PM, Mark Lord<liml@rtr.ca> wrote:
> James Bottomley wrote:
>>
>> On Thu, 2009-08-13 at 14:15 -0400, Greg Freemyer wrote:
>>>
>>> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>>>>
>>>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>>>
>>>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>>>
>>>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>>>> down discard requests as frequently as they like.  The block layer
>>>>>> will
>>>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>>>> underlying device) and get rid of the blocks that have remained
>>>>>> unwanted
>>>>>> in the interim.
>>>>>
>>>>> That is a very good idea. I've tested your original TRIM implementation
>>>>> on
>>>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>>>> milliseconds to digest a single TRIM command. And since your
>>>>> implementation
>>>>> sends a TRIM for each extent of each deleted file, the whole system is
>>>>> unusable after a short while.
>>>>> An optimal solution would be to consolidate the discard requests,
>>>>> bundle
>>>>> them and send them to the drive as infrequent as possible.
>>>>
>>>> or queue them up and send them when the drive is idle (you would need to
>>>> keep track to make sure the space isn't re-used)
>>>>
>>>> as an example, if you would consider spinning down a drive you don't
>>>> hurt
>>>> performance by sending accumulated trim commands.
>>>>
>>>> David Lang
>>>
>>> An alternate approach is the block layer maintain its own bitmap of
>>> used unused sectors / blocks. Unmap commands from the filesystem just
>>> cause the bitmap to be updated.  No other effect.
>>>
>>> (Big unknown: Where will the bitmap live between reboots?  Require DM
>>> volumes so we can have a dedicated bitmap volume in the mix to store
>>> the bitmap to? Maybe on mount, the filesystem has to be scanned to
>>> initially populate the bitmap?   Other options?)
>>
>> I wouldn't really have it live anywhere.  Discard is best effort; it's
>> not required for fs integrity.  As long as we don't discard an in-use
>> block we're free to do anything else (including forget to discard,
>> rediscard a discarded block etc).
>>
>> It is theoretically possible to run all of this from user space using
>> the fs mappings, a bit like a defrag command.
>
> ..
>
> Already a work-in-progress -- see my wiper.sh script on the hdparm page
> at sourceforge.  Trimming 50+GB of free space on a 120GB Vertex
> (over 100 million sectors) takes a *single* TRIM command,
> and completes in only a couple of seconds.
>
> Cheers
>
Mark,

What filesystems does your script support?  Running a tool like this
in the middle of the night makes a lot of since to me even from the
perspective of many / most enterprise users.

How do prevent a race where a block becomes used between userspace
asking status and it sending the discard request?

ps: I tried to pull wiper.sh straight from sourceforge, but I'm
getting some crazy page asking all sorts of questions and not letting
me bypass it.  I hope sourceforge is broken.  The other option is they
meant to do this. :(

Greg
-- 
Greg Freemyer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap  slot is freed)
@ 2009-08-14 22:54                 ` Greg Freemyer
  0 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-14 22:54 UTC (permalink / raw)
  To: Mark Lord
  Cc: James Bottomley, david, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

On Fri, Aug 14, 2009 at 6:03 PM, Mark Lord<liml@rtr.ca> wrote:
> James Bottomley wrote:
>>
>> On Thu, 2009-08-13 at 14:15 -0400, Greg Freemyer wrote:
>>>
>>> On Thu, Aug 13, 2009 at 12:33 PM, <david@lang.hm> wrote:
>>>>
>>>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>>>
>>>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>>>
>>>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>>>> down discard requests as frequently as they like.  The block layer
>>>>>> will
>>>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>>>> underlying device) and get rid of the blocks that have remained
>>>>>> unwanted
>>>>>> in the interim.
>>>>>
>>>>> That is a very good idea. I've tested your original TRIM implementation
>>>>> on
>>>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>>>> milliseconds to digest a single TRIM command. And since your
>>>>> implementation
>>>>> sends a TRIM for each extent of each deleted file, the whole system is
>>>>> unusable after a short while.
>>>>> An optimal solution would be to consolidate the discard requests,
>>>>> bundle
>>>>> them and send them to the drive as infrequent as possible.
>>>>
>>>> or queue them up and send them when the drive is idle (you would need to
>>>> keep track to make sure the space isn't re-used)
>>>>
>>>> as an example, if you would consider spinning down a drive you don't
>>>> hurt
>>>> performance by sending accumulated trim commands.
>>>>
>>>> David Lang
>>>
>>> An alternate approach is the block layer maintain its own bitmap of
>>> used unused sectors / blocks. Unmap commands from the filesystem just
>>> cause the bitmap to be updated.  No other effect.
>>>
>>> (Big unknown: Where will the bitmap live between reboots?  Require DM
>>> volumes so we can have a dedicated bitmap volume in the mix to store
>>> the bitmap to? Maybe on mount, the filesystem has to be scanned to
>>> initially populate the bitmap?   Other options?)
>>
>> I wouldn't really have it live anywhere.  Discard is best effort; it's
>> not required for fs integrity.  As long as we don't discard an in-use
>> block we're free to do anything else (including forget to discard,
>> rediscard a discarded block etc).
>>
>> It is theoretically possible to run all of this from user space using
>> the fs mappings, a bit like a defrag command.
>
> ..
>
> Already a work-in-progress -- see my wiper.sh script on the hdparm page
> at sourceforge.  Trimming 50+GB of free space on a 120GB Vertex
> (over 100 million sectors) takes a *single* TRIM command,
> and completes in only a couple of seconds.
>
> Cheers
>
Mark,

What filesystems does your script support?  Running a tool like this
in the middle of the night makes a lot of since to me even from the
perspective of many / most enterprise users.

How do prevent a race where a block becomes used between userspace
asking status and it sending the discard request?

ps: I tried to pull wiper.sh straight from sourceforge, but I'm
getting some crazy page asking all sorts of questions and not letting
me bypass it.  I hope sourceforge is broken.  The other option is they
meant to do this. :(

Greg
-- 
Greg Freemyer

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-14 22:10                 ` Mark Lord
@ 2009-08-14 23:21                   ` Chris Worley
  -1 siblings, 0 replies; 208+ messages in thread
From: Chris Worley @ 2009-08-14 23:21 UTC (permalink / raw)
  To: Mark Lord
  Cc: Bryan Donlan, david, Greg Freemyer, Markus Trippelsdorf,
	Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide,
	Linux RAID

On Fri, Aug 14, 2009 at 4:10 PM, Mark Lord<liml@rtr.ca> wrote:
> Bryan Donlan wrote:
> ..
>>
>> Perhaps an interface (ioctl, etc) can be added to ask a filesystem to
>> discard all unused blocks in a certain range? (That is, have the
>> filesystem validate the request under any necessary locks before
>> passing it to the block IO layer)
>
> ..
>
> While possibly TRIM-specific, this approach has the lowest overhead
> and probably the greatest gain-for-pain ratio.
>
> But it may not be as nice for enterprise (?).
>
> On the Indilinx-based SSDs (eg. OCZ Vertex), TRIM seems to trigger an
> internal garbage-collection/erase cycle.  As such, the drive really prefers
> a few LARGE trim lists, rather than many smaller ones.
>
> Here's some information that a vendor has observed from the Win7 use of
> TRIM:
>
>> TRIM command is sent:
>> -       About 2/3 of partition is filled up, when file is deleted.
>>        (I am not talking about send file to trash bin.)
>> -       In the above case, when trash bin gets emptied.
>> -       In the above case, when partition is deleted.
>>
>> TRIM command is not sent:-
>> -       When file is moved to trash bin
>> -       When partition is formatted. (Both quick and full format)
>> -       When empty partition is deleted
>> -       When file is deleted while there is big remaining free space
>
> ..
>
> His words, not mine.  But the idea seems to be to batch them in large
> chunks.

Sooner is better than waiting to coalesce.  The longer an LBA is
inactive, the better for any management scheme.  If you wait until
it's reused, you might as well forgo the advantages of TRIM/UNMAP.  If
a the controller wants to coalesce, let it coalesce.

Chris
>
> My wiper.sh "trim script" is packaged with the latest hdparm (currently
> 9.24)
> on sourceforge, for those who want to try this stuff for real.  No special
> kernel support is required to use it.
>
> Cheers
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap  slot is freed)
@ 2009-08-14 23:21                   ` Chris Worley
  0 siblings, 0 replies; 208+ messages in thread
From: Chris Worley @ 2009-08-14 23:21 UTC (permalink / raw)
  To: Mark Lord
  Cc: Bryan Donlan, david, Greg Freemyer, Markus Trippelsdorf,
	Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide,
	Linux RAID

On Fri, Aug 14, 2009 at 4:10 PM, Mark Lord<liml@rtr.ca> wrote:
> Bryan Donlan wrote:
> ..
>>
>> Perhaps an interface (ioctl, etc) can be added to ask a filesystem to
>> discard all unused blocks in a certain range? (That is, have the
>> filesystem validate the request under any necessary locks before
>> passing it to the block IO layer)
>
> ..
>
> While possibly TRIM-specific, this approach has the lowest overhead
> and probably the greatest gain-for-pain ratio.
>
> But it may not be as nice for enterprise (?).
>
> On the Indilinx-based SSDs (eg. OCZ Vertex), TRIM seems to trigger an
> internal garbage-collection/erase cycle.  As such, the drive really prefers
> a few LARGE trim lists, rather than many smaller ones.
>
> Here's some information that a vendor has observed from the Win7 use of
> TRIM:
>
>> TRIM command is sent:
>> -       About 2/3 of partition is filled up, when file is deleted.
>>        (I am not talking about send file to trash bin.)
>> -       In the above case, when trash bin gets emptied.
>> -       In the above case, when partition is deleted.
>>
>> TRIM command is not sent:-
>> -       When file is moved to trash bin
>> -       When partition is formatted. (Both quick and full format)
>> -       When empty partition is deleted
>> -       When file is deleted while there is big remaining free space
>
> ..
>
> His words, not mine.  But the idea seems to be to batch them in large
> chunks.

Sooner is better than waiting to coalesce.  The longer an LBA is
inactive, the better for any management scheme.  If you wait until
it's reused, you might as well forgo the advantages of TRIM/UNMAP.  If
a the controller wants to coalesce, let it coalesce.

Chris
>
> My wiper.sh "trim script" is packaged with the latest hdparm (currently
> 9.24)
> on sourceforge, for those who want to try this stuff for real.  No special
> kernel support is required to use it.
>
> Cheers
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-14 23:21                   ` Chris Worley
@ 2009-08-14 23:45                     ` Matthew Wilcox
  -1 siblings, 0 replies; 208+ messages in thread
From: Matthew Wilcox @ 2009-08-14 23:45 UTC (permalink / raw)
  To: Chris Worley
  Cc: Mark Lord, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Fri, Aug 14, 2009 at 05:21:32PM -0600, Chris Worley wrote:
> Sooner is better than waiting to coalesce.  The longer an LBA is
> inactive, the better for any management scheme.  If you wait until
> it's reused, you might as well forgo the advantages of TRIM/UNMAP.  If
> a the controller wants to coalesce, let it coalesce.

I'm sorry, you're wrong.  There is a tradeoff point, and it's different
for each drive model.  Sending down a steady stream of tiny TRIMs is
going to give terrible performance.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-14 23:45                     ` Matthew Wilcox
  0 siblings, 0 replies; 208+ messages in thread
From: Matthew Wilcox @ 2009-08-14 23:45 UTC (permalink / raw)
  To: Chris Worley
  Cc: Mark Lord, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Fri, Aug 14, 2009 at 05:21:32PM -0600, Chris Worley wrote:
> Sooner is better than waiting to coalesce.  The longer an LBA is
> inactive, the better for any management scheme.  If you wait until
> it's reused, you might as well forgo the advantages of TRIM/UNMAP.  If
> a the controller wants to coalesce, let it coalesce.

I'm sorry, you're wrong.  There is a tradeoff point, and it's different
for each drive model.  Sending down a steady stream of tiny TRIMs is
going to give terrible performance.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-14 23:45                     ` Matthew Wilcox
@ 2009-08-15  0:19                       ` Chris Worley
  -1 siblings, 0 replies; 208+ messages in thread
From: Chris Worley @ 2009-08-15  0:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mark Lord, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Fri, Aug 14, 2009 at 5:45 PM, Matthew Wilcox<matthew@wil.cx> wrote:
> On Fri, Aug 14, 2009 at 05:21:32PM -0600, Chris Worley wrote:
>> Sooner is better than waiting to coalesce.  The longer an LBA is
>> inactive, the better for any management scheme.  If you wait until
>> it's reused, you might as well forgo the advantages of TRIM/UNMAP.  If
>> a the controller wants to coalesce, let it coalesce.
>
> I'm sorry, you're wrong.  There is a tradeoff point, and it's different
> for each drive model.  Sending down a steady stream of tiny TRIMs is
> going to give terrible performance.

Sounds like you might be using junk for a device?

For junk, a little coalescing may be warranted... like in the I/O
schedular, but no more than 100usecs wait before posting, or then you
effect high performing devices too.

Chris
>
> --
> Matthew Wilcox                          Intel Open Source Technology Centre
> "Bill, look, we understand that you're interested in selling us this
> operating system, but compare it to ours.  We can't possibly take such
> a retrograde step."
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap  slot is freed)
@ 2009-08-15  0:19                       ` Chris Worley
  0 siblings, 0 replies; 208+ messages in thread
From: Chris Worley @ 2009-08-15  0:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mark Lord, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Fri, Aug 14, 2009 at 5:45 PM, Matthew Wilcox<matthew@wil.cx> wrote:
> On Fri, Aug 14, 2009 at 05:21:32PM -0600, Chris Worley wrote:
>> Sooner is better than waiting to coalesce.  The longer an LBA is
>> inactive, the better for any management scheme.  If you wait until
>> it's reused, you might as well forgo the advantages of TRIM/UNMAP.  If
>> a the controller wants to coalesce, let it coalesce.
>
> I'm sorry, you're wrong.  There is a tradeoff point, and it's different
> for each drive model.  Sending down a steady stream of tiny TRIMs is
> going to give terrible performance.

Sounds like you might be using junk for a device?

For junk, a little coalescing may be warranted... like in the I/O
schedular, but no more than 100usecs wait before posting, or then you
effect high performing devices too.

Chris
>
> --
> Matthew Wilcox                          Intel Open Source Technology Centre
> "Bill, look, we understand that you're interested in selling us this
> operating system, but compare it to ours.  We can't possibly take such
> a retrograde step."
>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-15  0:19                       ` Chris Worley
@ 2009-08-15  0:30                         ` Greg Freemyer
  -1 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-15  0:30 UTC (permalink / raw)
  To: Chris Worley
  Cc: Matthew Wilcox, Mark Lord, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Fri, Aug 14, 2009 at 8:19 PM, Chris Worley<worleys@gmail.com> wrote:
> On Fri, Aug 14, 2009 at 5:45 PM, Matthew Wilcox<matthew@wil.cx> wrote:
>> On Fri, Aug 14, 2009 at 05:21:32PM -0600, Chris Worley wrote:
>>> Sooner is better than waiting to coalesce.  The longer an LBA is
>>> inactive, the better for any management scheme.  If you wait until
>>> it's reused, you might as well forgo the advantages of TRIM/UNMAP.  If
>>> a the controller wants to coalesce, let it coalesce.
>>
>> I'm sorry, you're wrong.  There is a tradeoff point, and it's different
>> for each drive model.  Sending down a steady stream of tiny TRIMs is
>> going to give terrible performance.
>
> Sounds like you might be using junk for a device?
>
> For junk, a little coalescing may be warranted... like in the I/O
> schedular, but no more than 100usecs wait before posting, or then you
> effect high performing devices too.
>
> Chris

Why?

AIUI, on every write a high performing device allocates a new erase
block from its free lists, writes to it, and puts the now unused erase
block on the free list.  That erase block becomes available for reuse
some milliseconds later.

As long as the SSD has enough free erase blocks to work with I see no
disadvantage in delaying a discard by minutes, hours or days in most
cases.  The exception is when the filesystem is almost full and the
SSD is short of erase blocks to work with.

In that case it will want to get as many free erase blocks as it can
as fast as it can get them.

Greg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap  slot is freed)
@ 2009-08-15  0:30                         ` Greg Freemyer
  0 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-15  0:30 UTC (permalink / raw)
  To: Chris Worley
  Cc: Matthew Wilcox, Mark Lord, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Fri, Aug 14, 2009 at 8:19 PM, Chris Worley<worleys@gmail.com> wrote:
> On Fri, Aug 14, 2009 at 5:45 PM, Matthew Wilcox<matthew@wil.cx> wrote:
>> On Fri, Aug 14, 2009 at 05:21:32PM -0600, Chris Worley wrote:
>>> Sooner is better than waiting to coalesce.  The longer an LBA is
>>> inactive, the better for any management scheme.  If you wait until
>>> it's reused, you might as well forgo the advantages of TRIM/UNMAP.  If
>>> a the controller wants to coalesce, let it coalesce.
>>
>> I'm sorry, you're wrong.  There is a tradeoff point, and it's different
>> for each drive model.  Sending down a steady stream of tiny TRIMs is
>> going to give terrible performance.
>
> Sounds like you might be using junk for a device?
>
> For junk, a little coalescing may be warranted... like in the I/O
> schedular, but no more than 100usecs wait before posting, or then you
> effect high performing devices too.
>
> Chris

Why?

AIUI, on every write a high performing device allocates a new erase
block from its free lists, writes to it, and puts the now unused erase
block on the free list.  That erase block becomes available for reuse
some milliseconds later.

As long as the SSD has enough free erase blocks to work with I see no
disadvantage in delaying a discard by minutes, hours or days in most
cases.  The exception is when the filesystem is almost full and the
SSD is short of erase blocks to work with.

In that case it will want to get as many free erase blocks as it can
as fast as it can get them.

Greg

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-15  0:30                         ` Greg Freemyer
  (?)
@ 2009-08-15  0:38                           ` Chris Worley
  -1 siblings, 0 replies; 208+ messages in thread
From: Chris Worley @ 2009-08-15  0:38 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Matthew Wilcox, Mark Lord, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Fri, Aug 14, 2009 at 6:30 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
> On Fri, Aug 14, 2009 at 8:19 PM, Chris Worley<worleys@gmail.com> wrote:
>> On Fri, Aug 14, 2009 at 5:45 PM, Matthew Wilcox<matthew@wil.cx> wrote:
>>> On Fri, Aug 14, 2009 at 05:21:32PM -0600, Chris Worley wrote:
>>>> Sooner is better than waiting to coalesce.  The longer an LBA is
>>>> inactive, the better for any management scheme.  If you wait until
>>>> it's reused, you might as well forgo the advantages of TRIM/UNMAP.  If
>>>> a the controller wants to coalesce, let it coalesce.
>>>
>>> I'm sorry, you're wrong.  There is a tradeoff point, and it's different
>>> for each drive model.  Sending down a steady stream of tiny TRIMs is
>>> going to give terrible performance.
>>
>> Sounds like you might be using junk for a device?
>>
>> For junk, a little coalescing may be warranted... like in the I/O
>> schedular, but no more than 100usecs wait before posting, or then you
>> effect high performing devices too.
>>
>> Chris
>
> Why?
>
> AIUI, on every write a high performing device allocates a new erase
> block from its free lists, writes to it, and puts the now unused erase
> block on the free list.

So erase blocks are 512 bytes (if I write 512 bytes, an erase block is
now freed)?  Not true.

>  That erase block becomes available for reuse
> some milliseconds later.
>
> As long as the SSD has enough free erase blocks to work with I see no
> disadvantage in delaying a discard by minutes, hours or days in most
> cases.  The exception is when the filesystem is almost full and the
> SSD is short of erase blocks to work with.

That "exception..." is another good reason why.

>
> In that case it will want to get as many free erase blocks as it can
> as fast as it can get them.

Exactly.

Chris
>
> Greg
>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap  slot is freed)
@ 2009-08-15  0:38                           ` Chris Worley
  0 siblings, 0 replies; 208+ messages in thread
From: Chris Worley @ 2009-08-15  0:38 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Matthew Wilcox, Mark Lord, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Fri, Aug 14, 2009 at 6:30 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
> On Fri, Aug 14, 2009 at 8:19 PM, Chris Worley<worleys@gmail.com> wrote:
>> On Fri, Aug 14, 2009 at 5:45 PM, Matthew Wilcox<matthew@wil.cx> wrote:
>>> On Fri, Aug 14, 2009 at 05:21:32PM -0600, Chris Worley wrote:
>>>> Sooner is better than waiting to coalesce.  The longer an LBA is
>>>> inactive, the better for any management scheme.  If you wait until
>>>> it's reused, you might as well forgo the advantages of TRIM/UNMAP.  If
>>>> a the controller wants to coalesce, let it coalesce.
>>>
>>> I'm sorry, you're wrong.  There is a tradeoff point, and it's different
>>> for each drive model.  Sending down a steady stream of tiny TRIMs is
>>> going to give terrible performance.
>>
>> Sounds like you might be using junk for a device?
>>
>> For junk, a little coalescing may be warranted... like in the I/O
>> schedular, but no more than 100usecs wait before posting, or then you
>> effect high performing devices too.
>>
>> Chris
>
> Why?
>
> AIUI, on every write a high performing device allocates a new erase
> block from its free lists, writes to it, and puts the now unused erase
> block on the free list.

So erase blocks are 512 bytes (if I write 512 bytes, an erase block is
now freed)?  Not true.

>  That erase block becomes available for reuse
> some milliseconds later.
>
> As long as the SSD has enough free erase blocks to work with I see no
> disadvantage in delaying a discard by minutes, hours or days in most
> cases.  The exception is when the filesystem is almost full and the
> SSD is short of erase blocks to work with.

That "exception..." is another good reason why.

>
> In that case it will want to get as many free erase blocks as it can
> as fast as it can get them.

Exactly.

Chris
>
> Greg
>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-15  0:38                           ` Chris Worley
  0 siblings, 0 replies; 208+ messages in thread
From: Chris Worley @ 2009-08-15  0:38 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Matthew Wilcox, Mark Lord, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Fri, Aug 14, 2009 at 6:30 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
> On Fri, Aug 14, 2009 at 8:19 PM, Chris Worley<worleys@gmail.com> wrote:
>> On Fri, Aug 14, 2009 at 5:45 PM, Matthew Wilcox<matthew@wil.cx> wrote:
>>> On Fri, Aug 14, 2009 at 05:21:32PM -0600, Chris Worley wrote:
>>>> Sooner is better than waiting to coalesce.  The longer an LBA is
>>>> inactive, the better for any management scheme.  If you wait until
>>>> it's reused, you might as well forgo the advantages of TRIM/UNMAP.  If
>>>> a the controller wants to coalesce, let it coalesce.
>>>
>>> I'm sorry, you're wrong.  There is a tradeoff point, and it's different
>>> for each drive model.  Sending down a steady stream of tiny TRIMs is
>>> going to give terrible performance.
>>
>> Sounds like you might be using junk for a device?
>>
>> For junk, a little coalescing may be warranted... like in the I/O
>> schedular, but no more than 100usecs wait before posting, or then you
>> effect high performing devices too.
>>
>> Chris
>
> Why?
>
> AIUI, on every write a high performing device allocates a new erase
> block from its free lists, writes to it, and puts the now unused erase
> block on the free list.

So erase blocks are 512 bytes (if I write 512 bytes, an erase block is
now freed)?  Not true.

>  That erase block becomes available for reuse
> some milliseconds later.
>
> As long as the SSD has enough free erase blocks to work with I see no
> disadvantage in delaying a discard by minutes, hours or days in most
> cases.  The exception is when the filesystem is almost full and the
> SSD is short of erase blocks to work with.

That "exception..." is another good reason why.

>
> In that case it will want to get as many free erase blocks as it can
> as fast as it can get them.

Exactly.

Chris
>
> Greg
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-15  0:38                           ` Chris Worley
  (?)
@ 2009-08-15  1:55                             ` Greg Freemyer
  -1 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-15  1:55 UTC (permalink / raw)
  To: Chris Worley
  Cc: Matthew Wilcox, Mark Lord, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Fri, Aug 14, 2009 at 8:38 PM, Chris Worley<worleys@gmail.com> wrote:
> On Fri, Aug 14, 2009 at 6:30 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
>> On Fri, Aug 14, 2009 at 8:19 PM, Chris Worley<worleys@gmail.com> wrote:
>>> On Fri, Aug 14, 2009 at 5:45 PM, Matthew Wilcox<matthew@wil.cx> wrote:
>>>> On Fri, Aug 14, 2009 at 05:21:32PM -0600, Chris Worley wrote:
>>>>> Sooner is better than waiting to coalesce.  The longer an LBA is
>>>>> inactive, the better for any management scheme.  If you wait until
>>>>> it's reused, you might as well forgo the advantages of TRIM/UNMAP.  If
>>>>> a the controller wants to coalesce, let it coalesce.
>>>>
>>>> I'm sorry, you're wrong.  There is a tradeoff point, and it's different
>>>> for each drive model.  Sending down a steady stream of tiny TRIMs is
>>>> going to give terrible performance.
>>>
>>> Sounds like you might be using junk for a device?
>>>
>>> For junk, a little coalescing may be warranted... like in the I/O
>>> schedular, but no more than 100usecs wait before posting, or then you
>>> effect high performing devices too.
>>>
>>> Chris
>>
>> Why?
>>
>> AIUI, on every write a high performing device allocates a new erase
>> block from its free lists, writes to it, and puts the now unused erase
>> block on the free list.
>
> So erase blocks are 512 bytes (if I write 512 bytes, an erase block is
> now freed)?  Not true.

Seriously, how do you  know?  Are you under NDA?

The write paper I read about typical SSD design described a partial
erase block write as:

Internal logic/micro-controller performs:

Read erase block, modify erase block, allocate new erase block, write
new erase block, free now unused old erase block, old erase block
added to a hardware erase queue the performs the actual erase in the
background at the relatively slow speed of multiple milliseconds..

The purpose of the trim/discard command being to allow the ssd to have
enough free erase blocks ready to go that the writes don't have to
stall while they wait for a erase block to pop out of the erase queue.

Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap  slot is freed)
@ 2009-08-15  1:55                             ` Greg Freemyer
  0 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-15  1:55 UTC (permalink / raw)
  To: Chris Worley
  Cc: Matthew Wilcox, Mark Lord, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Fri, Aug 14, 2009 at 8:38 PM, Chris Worley<worleys@gmail.com> wrote:
> On Fri, Aug 14, 2009 at 6:30 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
>> On Fri, Aug 14, 2009 at 8:19 PM, Chris Worley<worleys@gmail.com> wrote:
>>> On Fri, Aug 14, 2009 at 5:45 PM, Matthew Wilcox<matthew@wil.cx> wrote:
>>>> On Fri, Aug 14, 2009 at 05:21:32PM -0600, Chris Worley wrote:
>>>>> Sooner is better than waiting to coalesce.  The longer an LBA is
>>>>> inactive, the better for any management scheme.  If you wait until
>>>>> it's reused, you might as well forgo the advantages of TRIM/UNMAP.  If
>>>>> a the controller wants to coalesce, let it coalesce.
>>>>
>>>> I'm sorry, you're wrong.  There is a tradeoff point, and it's different
>>>> for each drive model.  Sending down a steady stream of tiny TRIMs is
>>>> going to give terrible performance.
>>>
>>> Sounds like you might be using junk for a device?
>>>
>>> For junk, a little coalescing may be warranted... like in the I/O
>>> schedular, but no more than 100usecs wait before posting, or then you
>>> effect high performing devices too.
>>>
>>> Chris
>>
>> Why?
>>
>> AIUI, on every write a high performing device allocates a new erase
>> block from its free lists, writes to it, and puts the now unused erase
>> block on the free list.
>
> So erase blocks are 512 bytes (if I write 512 bytes, an erase block is
> now freed)?  Not true.

Seriously, how do you  know?  Are you under NDA?

The write paper I read about typical SSD design described a partial
erase block write as:

Internal logic/micro-controller performs:

Read erase block, modify erase block, allocate new erase block, write
new erase block, free now unused old erase block, old erase block
added to a hardware erase queue the performs the actual erase in the
background at the relatively slow speed of multiple milliseconds..

The purpose of the trim/discard command being to allow the ssd to have
enough free erase blocks ready to go that the writes don't have to
stall while they wait for a erase block to pop out of the erase queue.

Greg

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-15  1:55                             ` Greg Freemyer
  0 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-15  1:55 UTC (permalink / raw)
  To: Chris Worley
  Cc: Matthew Wilcox, Mark Lord, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Fri, Aug 14, 2009 at 8:38 PM, Chris Worley<worleys@gmail.com> wrote:
> On Fri, Aug 14, 2009 at 6:30 PM, Greg Freemyer<greg.freemyer@gmail.com> wrote:
>> On Fri, Aug 14, 2009 at 8:19 PM, Chris Worley<worleys@gmail.com> wrote:
>>> On Fri, Aug 14, 2009 at 5:45 PM, Matthew Wilcox<matthew@wil.cx> wrote:
>>>> On Fri, Aug 14, 2009 at 05:21:32PM -0600, Chris Worley wrote:
>>>>> Sooner is better than waiting to coalesce.  The longer an LBA is
>>>>> inactive, the better for any management scheme.  If you wait until
>>>>> it's reused, you might as well forgo the advantages of TRIM/UNMAP.  If
>>>>> a the controller wants to coalesce, let it coalesce.
>>>>
>>>> I'm sorry, you're wrong.  There is a tradeoff point, and it's different
>>>> for each drive model.  Sending down a steady stream of tiny TRIMs is
>>>> going to give terrible performance.
>>>
>>> Sounds like you might be using junk for a device?
>>>
>>> For junk, a little coalescing may be warranted... like in the I/O
>>> schedular, but no more than 100usecs wait before posting, or then you
>>> effect high performing devices too.
>>>
>>> Chris
>>
>> Why?
>>
>> AIUI, on every write a high performing device allocates a new erase
>> block from its free lists, writes to it, and puts the now unused erase
>> block on the free list.
>
> So erase blocks are 512 bytes (if I write 512 bytes, an erase block is
> now freed)?  Not true.

Seriously, how do you  know?  Are you under NDA?

The write paper I read about typical SSD design described a partial
erase block write as:

Internal logic/micro-controller performs:

Read erase block, modify erase block, allocate new erase block, write
new erase block, free now unused old erase block, old erase block
added to a hardware erase queue the performs the actual erase in the
background at the relatively slow speed of multiple milliseconds..

The purpose of the trim/discard command being to allow the ssd to have
enough free erase blocks ready to go that the writes don't have to
stall while they wait for a erase block to pop out of the erase queue.

Greg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: compcache as a pre-swap area
  2009-08-14 15:49               ` Al Boldi
@ 2009-08-15 11:00                 ` Al Boldi
  -1 siblings, 0 replies; 208+ messages in thread
From: Al Boldi @ 2009-08-15 11:00 UTC (permalink / raw)
  To: ngupta
  Cc: Hugh Dickins, Matthew Wilcox, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm

Al Boldi wrote:
> Nitin Gupta wrote:
> > On 08/14/2009 09:32 AM, Al Boldi wrote:
> > > So once compcache fills up, it will start to age its contents into
> > > normal swap?
> >
> > This is desirable but not yet implemented. For now, if 'backing swap' is
> > used, compcache will forward incompressible pages to the backing swap
> > device. If compcache fills up, kernel will simply send further swap-outs
> > to swap device which comes next in priority.
>
> Ok, this sounds acceptable for now.
>
> The important thing now is to improve performance to a level comparable to
> a system with normal ssd-swap.  Do you have such a comparisson?
>
> Another interresting benchmark would be to use compcache in a maximized
> configuration, ie. on a system w/ 1024KB Ram assign 960KB for compcache and
> leave 64KB for the system, and then see how it performs.  This may easily
> pinpoint any bottlenecks compcache has, if any.

I am wondering, is it possible to run a system in 64KB?

Ok, make that MB instead.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: compcache as a pre-swap area
@ 2009-08-15 11:00                 ` Al Boldi
  0 siblings, 0 replies; 208+ messages in thread
From: Al Boldi @ 2009-08-15 11:00 UTC (permalink / raw)
  To: ngupta
  Cc: Hugh Dickins, Matthew Wilcox, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm

Al Boldi wrote:
> Nitin Gupta wrote:
> > On 08/14/2009 09:32 AM, Al Boldi wrote:
> > > So once compcache fills up, it will start to age its contents into
> > > normal swap?
> >
> > This is desirable but not yet implemented. For now, if 'backing swap' is
> > used, compcache will forward incompressible pages to the backing swap
> > device. If compcache fills up, kernel will simply send further swap-outs
> > to swap device which comes next in priority.
>
> Ok, this sounds acceptable for now.
>
> The important thing now is to improve performance to a level comparable to
> a system with normal ssd-swap.  Do you have such a comparisson?
>
> Another interresting benchmark would be to use compcache in a maximized
> configuration, ie. on a system w/ 1024KB Ram assign 960KB for compcache and
> leave 64KB for the system, and then see how it performs.  This may easily
> pinpoint any bottlenecks compcache has, if any.

I am wondering, is it possible to run a system in 64KB?

Ok, make that MB instead.


Thanks!

--
Al

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-15  0:19                       ` Chris Worley
@ 2009-08-15 12:59                         ` James Bottomley
  -1 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-15 12:59 UTC (permalink / raw)
  To: Chris Worley
  Cc: Matthew Wilcox, Mark Lord, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Fri, 2009-08-14 at 18:19 -0600, Chris Worley wrote:
> On Fri, Aug 14, 2009 at 5:45 PM, Matthew Wilcox<matthew@wil.cx> wrote:
> > On Fri, Aug 14, 2009 at 05:21:32PM -0600, Chris Worley wrote:
> >> Sooner is better than waiting to coalesce.  The longer an LBA is
> >> inactive, the better for any management scheme.  If you wait until
> >> it's reused, you might as well forgo the advantages of TRIM/UNMAP.  If
> >> a the controller wants to coalesce, let it coalesce.
> >
> > I'm sorry, you're wrong.  There is a tradeoff point, and it's different
> > for each drive model.  Sending down a steady stream of tiny TRIMs is
> > going to give terrible performance.
> 
> Sounds like you might be using junk for a device?
> 
> For junk, a little coalescing may be warranted... like in the I/O
> schedular, but no more than 100usecs wait before posting, or then you
> effect high performing devices too.

Um, I think you missed the original point in all of this at the
beginning of the thread:  On ATA TRIM commands cannot be tagged.  This
means you have to drain the outstanding NCQ commands (stalling the
device) before you can send a TRIM.   If we do this for every discard,
the performance impact will be pretty devastating, hence the need to
coalesce.  It's nothing really to do with device characteristics, it's
an ATA protocol problem.

James



^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-15 12:59                         ` James Bottomley
  0 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-15 12:59 UTC (permalink / raw)
  To: Chris Worley
  Cc: Matthew Wilcox, Mark Lord, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Fri, 2009-08-14 at 18:19 -0600, Chris Worley wrote:
> On Fri, Aug 14, 2009 at 5:45 PM, Matthew Wilcox<matthew@wil.cx> wrote:
> > On Fri, Aug 14, 2009 at 05:21:32PM -0600, Chris Worley wrote:
> >> Sooner is better than waiting to coalesce.  The longer an LBA is
> >> inactive, the better for any management scheme.  If you wait until
> >> it's reused, you might as well forgo the advantages of TRIM/UNMAP.  If
> >> a the controller wants to coalesce, let it coalesce.
> >
> > I'm sorry, you're wrong.  There is a tradeoff point, and it's different
> > for each drive model.  Sending down a steady stream of tiny TRIMs is
> > going to give terrible performance.
> 
> Sounds like you might be using junk for a device?
> 
> For junk, a little coalescing may be warranted... like in the I/O
> schedular, but no more than 100usecs wait before posting, or then you
> effect high performing devices too.

Um, I think you missed the original point in all of this at the
beginning of the thread:  On ATA TRIM commands cannot be tagged.  This
means you have to drain the outstanding NCQ commands (stalling the
device) before you can send a TRIM.   If we do this for every discard,
the performance impact will be pretty devastating, hence the need to
coalesce.  It's nothing really to do with device characteristics, it's
an ATA protocol problem.

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-14 22:54                 ` Greg Freemyer
@ 2009-08-15 13:12                   ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-15 13:12 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: James Bottomley, david, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

Greg Freemyer wrote:
>
> What filesystems does your script support?  Running a tool like this
> in the middle of the night makes a lot of since to me even from the
> perspective of many / most enterprise users.
..

It is designed to work on any *mounted* filesystem that supports
the fallocate() system call.  It uses fallocate() to reserve the
free space in a temporary file without any I/O, and then FIEMAP/FIBMAP
to get the block lists from the fallocated file, and then SGIO/ATA_16:TRIM
to discard the space, before deleting the fallocated file.

Tested by me on ext4 and xfs.  btrfs has a bug that prevents the fallocate
from succeeding at present, but CM say's they're trying to fix that.

It will also work on *unmounted" ext2/ext3/ext4 filesystems,
using dumpe2fs to get the free lists, and on xfs using xfs_db there.

HFS(+) support is coming as well.

Not currently compatible with LVM 1/2, or with some distros that use
imaginary device names in /proc/mounts --> I'm working on those issues.


> ps: I tried to pull wiper.sh straight from sourceforge, but I'm
> getting some crazy page asking all sorts of questions and not letting
> me bypass it.  I hope sourceforge is broken.  The other option is they
> meant to do this. :(
..

That's weird.  It should just be a simple click/download,
though you will need to also upgrade hdparm to the latest version.

Cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-15 13:12                   ` Mark Lord
  0 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-15 13:12 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: James Bottomley, david, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

Greg Freemyer wrote:
>
> What filesystems does your script support?  Running a tool like this
> in the middle of the night makes a lot of since to me even from the
> perspective of many / most enterprise users.
..

It is designed to work on any *mounted* filesystem that supports
the fallocate() system call.  It uses fallocate() to reserve the
free space in a temporary file without any I/O, and then FIEMAP/FIBMAP
to get the block lists from the fallocated file, and then SGIO/ATA_16:TRIM
to discard the space, before deleting the fallocated file.

Tested by me on ext4 and xfs.  btrfs has a bug that prevents the fallocate
from succeeding at present, but CM say's they're trying to fix that.

It will also work on *unmounted" ext2/ext3/ext4 filesystems,
using dumpe2fs to get the free lists, and on xfs using xfs_db there.

HFS(+) support is coming as well.

Not currently compatible with LVM 1/2, or with some distros that use
imaginary device names in /proc/mounts --> I'm working on those issues.


> ps: I tried to pull wiper.sh straight from sourceforge, but I'm
> getting some crazy page asking all sorts of questions and not letting
> me bypass it.  I hope sourceforge is broken.  The other option is they
> meant to do this. :(
..

That's weird.  It should just be a simple click/download,
though you will need to also upgrade hdparm to the latest version.

Cheers

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-15  0:38                           ` Chris Worley
@ 2009-08-15 13:20                             ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-15 13:20 UTC (permalink / raw)
  To: Chris Worley
  Cc: Greg Freemyer, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

Chris Worley wrote:
..
> So erase blocks are 512 bytes (if I write 512 bytes, an erase block is
> now freed)?  Not true.
..

No, erase blocks are typically 512 KILO-bytes, or 1024 sectors.
Logical write blocks are only 512 bytes, but most drives out there
now actually use 4096 bytes as the native internal write size.

Lots of issues there.

The only existing "in the wild" TRIM-capable SSDs today all incur
large overheads from TRIM --> they seem to run a garbage-collection
and erase cycle for each TRIM command, typically taking 100s of milliseconds
regardless of the amount being trimmed.

So it makes send to gather small TRIMs into single larger TRIMs.

But I think, even better, is to just not bother with the bookkeeping,
and instead have the filesystem periodically just issue a TRIM for all
free blocks within a block group, cycling through the block groups
one by one over time.

That's how I'd like it to work on my own machine here.
Server/enterprise users very likely want something different.

Pluggable architecture, anyone?  :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-15 13:20                             ` Mark Lord
  0 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-15 13:20 UTC (permalink / raw)
  To: Chris Worley
  Cc: Greg Freemyer, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

Chris Worley wrote:
..
> So erase blocks are 512 bytes (if I write 512 bytes, an erase block is
> now freed)?  Not true.
..

No, erase blocks are typically 512 KILO-bytes, or 1024 sectors.
Logical write blocks are only 512 bytes, but most drives out there
now actually use 4096 bytes as the native internal write size.

Lots of issues there.

The only existing "in the wild" TRIM-capable SSDs today all incur
large overheads from TRIM --> they seem to run a garbage-collection
and erase cycle for each TRIM command, typically taking 100s of milliseconds
regardless of the amount being trimmed.

So it makes send to gather small TRIMs into single larger TRIMs.

But I think, even better, is to just not bother with the bookkeeping,
and instead have the filesystem periodically just issue a TRIM for all
free blocks within a block group, cycling through the block groups
one by one over time.

That's how I'd like it to work on my own machine here.
Server/enterprise users very likely want something different.

Pluggable architecture, anyone?  :)

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-15 12:59                         ` James Bottomley
@ 2009-08-15 13:22                           ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-15 13:22 UTC (permalink / raw)
  To: James Bottomley
  Cc: Chris Worley, Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

James Bottomley wrote:
>
> This means you have to drain the outstanding NCQ commands (stalling the
> device) before you can send a TRIM.   If we do this for every discard,
> the performance impact will be pretty devastating, hence the need to
> coalesce.  It's nothing really to do with device characteristics, it's
> an ATA protocol problem.
..

I don't think that's really much of an issue -- we already have to do
that for cache-flushes whenever barriers are enabled.  Yes it costs,
but not too much.

The current problem is that the only existing SSDs in the wild with TRIM,
take 100s of milliseconds per TRIM, mostly regardless of the amount being
TRIMmed.  Sure, some TRIMs take only 10-20ms, and very large ones (millions
of sectors) can take 1-2 seconds, but most are in the 100ms range.

Cheers

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-15 13:22                           ` Mark Lord
  0 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-15 13:22 UTC (permalink / raw)
  To: James Bottomley
  Cc: Chris Worley, Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

James Bottomley wrote:
>
> This means you have to drain the outstanding NCQ commands (stalling the
> device) before you can send a TRIM.   If we do this for every discard,
> the performance impact will be pretty devastating, hence the need to
> coalesce.  It's nothing really to do with device characteristics, it's
> an ATA protocol problem.
..

I don't think that's really much of an issue -- we already have to do
that for cache-flushes whenever barriers are enabled.  Yes it costs,
but not too much.

The current problem is that the only existing SSDs in the wild with TRIM,
take 100s of milliseconds per TRIM, mostly regardless of the amount being
TRIMmed.  Sure, some TRIMs take only 10-20ms, and very large ones (millions
of sectors) can take 1-2 seconds, but most are in the 100ms range.

Cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-15 13:22                           ` Mark Lord
@ 2009-08-15 13:55                             ` James Bottomley
  -1 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-15 13:55 UTC (permalink / raw)
  To: Mark Lord
  Cc: Chris Worley, Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Sat, 2009-08-15 at 09:22 -0400, Mark Lord wrote:
> James Bottomley wrote:
> >
> > This means you have to drain the outstanding NCQ commands (stalling the
> > device) before you can send a TRIM.   If we do this for every discard,
> > the performance impact will be pretty devastating, hence the need to
> > coalesce.  It's nothing really to do with device characteristics, it's
> > an ATA protocol problem.
> ..
> 
> I don't think that's really much of an issue -- we already have to do
> that for cache-flushes whenever barriers are enabled.  Yes it costs,
> but not too much.

That's not really what the enterprise is saying about flush barriers.
True, not all the performance problems are NCQ queue drain, but for a
steady workload they are significant.

James

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-15 13:55                             ` James Bottomley
  0 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-15 13:55 UTC (permalink / raw)
  To: Mark Lord
  Cc: Chris Worley, Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Sat, 2009-08-15 at 09:22 -0400, Mark Lord wrote:
> James Bottomley wrote:
> >
> > This means you have to drain the outstanding NCQ commands (stalling the
> > device) before you can send a TRIM.   If we do this for every discard,
> > the performance impact will be pretty devastating, hence the need to
> > coalesce.  It's nothing really to do with device characteristics, it's
> > an ATA protocol problem.
> ..
> 
> I don't think that's really much of an issue -- we already have to do
> that for cache-flushes whenever barriers are enabled.  Yes it costs,
> but not too much.

That's not really what the enterprise is saying about flush barriers.
True, not all the performance problems are NCQ queue drain, but for a
steady workload they are significant.

James

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-15 13:55                             ` James Bottomley
@ 2009-08-15 17:39                               ` jim owens
  -1 siblings, 0 replies; 208+ messages in thread
From: jim owens @ 2009-08-15 17:39 UTC (permalink / raw)
  To: James Bottomley
  Cc: Mark Lord, Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins,
	Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide, Linux RAID

James Bottomley wrote:
> 
> That's not really what the enterprise is saying about flush barriers.
> True, not all the performance problems are NCQ queue drain, but for a
> steady workload they are significant.

OK, we now know that SSDs designed only to the letter of the ATA
spec will suck doing discards if we send them down as we are
doing today.

Having finally caught up with this thread, I'm going to add some
comments that James already knows but were not stated that some
of the others apparently don't know :

- The current filesystem/blockdev behavior with discard TRIM was
   argued and added quickly because this design was what the
   Intel SSD architect told us was "the right thing" in Sept 08.

- In the same workshop, Linus said "I'm tired of hardware
   vendors telling me to fix it because they are cheap and lazy",
   or something close to that, my memory gets bit-errors :)

- We decided not to track and coalesce the discards in the block
   or filesystem layer because of the high memory/performance cost.
   There is no cheap way to do this, all of the space management
   in filesystems is accepting some cost for some user benefit.

- Many people who live in filesystems (like me) are unconvinced
   that discard to SSD or an array will help in real world use,
   but the current discard design didn't seem to hurt us either.

***begin rant***

I have not seen any analysis of the benefit and cost to the
end user of the TRIM or array UNMAP.  We now see that TRIM
as implemented by some (all?) SSDs will come at high cost.
The cost is all born by the host.  Do we get any benefit, or
is it all for the device vendor.  And when we subtract the cost
from the benefit, does the user actually benefit and how?

I'm tired of working around shit storage products and broken
device protocols from the "T" committees.  I suggest we just
add a "white list" of devices that handle the discard fast
and without us needing NCQ queue drain.  Then only send TRIM
to devices that are on the white list and throw the others
away in the block device layer.

I do enterprise systems and the cost of RAM in those systems
is awful.  And the databases and applications are always big
memory pigs.  Our customers always complain about the kernel
using too much memory and they will go ballistic if we take
1GB from their 512GB system unless we can really show them
significant benefit in their production.  And so far all
we have is "this is all good stuff" from array vendors.
[and yes, our hardware guys always give me the most pain]

If continuous discard is going to be a PITA for us, then
I say don't do it.  Just let a user-space tool do it when
the admin wants.  IMO is no different than defragment,
where my experience with a kernel continuous defragment
was that it made a great sales gimmick, but in real production
most people saw no benefit and some had to shut it off
because it actually hurt them.  It is all about workload.

jim

P.S. Matthew, that SSD architect told me personally
that the trim of each 512 byte block before rewrite
will be a performance benefit, so if Intel SSDs are
not on the white list, please slap him for me.

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-15 17:39                               ` jim owens
  0 siblings, 0 replies; 208+ messages in thread
From: jim owens @ 2009-08-15 17:39 UTC (permalink / raw)
  To: James Bottomley
  Cc: Mark Lord, Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins,
	Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide, Linux RAID

James Bottomley wrote:
> 
> That's not really what the enterprise is saying about flush barriers.
> True, not all the performance problems are NCQ queue drain, but for a
> steady workload they are significant.

OK, we now know that SSDs designed only to the letter of the ATA
spec will suck doing discards if we send them down as we are
doing today.

Having finally caught up with this thread, I'm going to add some
comments that James already knows but were not stated that some
of the others apparently don't know :

- The current filesystem/blockdev behavior with discard TRIM was
   argued and added quickly because this design was what the
   Intel SSD architect told us was "the right thing" in Sept 08.

- In the same workshop, Linus said "I'm tired of hardware
   vendors telling me to fix it because they are cheap and lazy",
   or something close to that, my memory gets bit-errors :)

- We decided not to track and coalesce the discards in the block
   or filesystem layer because of the high memory/performance cost.
   There is no cheap way to do this, all of the space management
   in filesystems is accepting some cost for some user benefit.

- Many people who live in filesystems (like me) are unconvinced
   that discard to SSD or an array will help in real world use,
   but the current discard design didn't seem to hurt us either.

***begin rant***

I have not seen any analysis of the benefit and cost to the
end user of the TRIM or array UNMAP.  We now see that TRIM
as implemented by some (all?) SSDs will come at high cost.
The cost is all born by the host.  Do we get any benefit, or
is it all for the device vendor.  And when we subtract the cost
from the benefit, does the user actually benefit and how?

I'm tired of working around shit storage products and broken
device protocols from the "T" committees.  I suggest we just
add a "white list" of devices that handle the discard fast
and without us needing NCQ queue drain.  Then only send TRIM
to devices that are on the white list and throw the others
away in the block device layer.

I do enterprise systems and the cost of RAM in those systems
is awful.  And the databases and applications are always big
memory pigs.  Our customers always complain about the kernel
using too much memory and they will go ballistic if we take
1GB from their 512GB system unless we can really show them
significant benefit in their production.  And so far all
we have is "this is all good stuff" from array vendors.
[and yes, our hardware guys always give me the most pain]

If continuous discard is going to be a PITA for us, then
I say don't do it.  Just let a user-space tool do it when
the admin wants.  IMO is no different than defragment,
where my experience with a kernel continuous defragment
was that it made a great sales gimmick, but in real production
most people saw no benefit and some had to shut it off
because it actually hurt them.  It is all about workload.

jim

P.S. Matthew, that SSD architect told me personally
that the trim of each 512 byte block before rewrite
will be a performance benefit, so if Intel SSDs are
not on the white list, please slap him for me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-15 13:55                             ` James Bottomley
@ 2009-08-16 14:05                               ` Alan Cox
  -1 siblings, 0 replies; 208+ messages in thread
From: Alan Cox @ 2009-08-16 14:05 UTC (permalink / raw)
  To: James Bottomley
  Cc: Mark Lord, Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins,
	Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide, Linux RAID

On Sat, 15 Aug 2009 08:55:17 -0500
James Bottomley <James.Bottomley@suse.de> wrote:

> On Sat, 2009-08-15 at 09:22 -0400, Mark Lord wrote:
> > James Bottomley wrote:
> > >
> > > This means you have to drain the outstanding NCQ commands (stalling the
> > > device) before you can send a TRIM.   If we do this for every discard,
> > > the performance impact will be pretty devastating, hence the need to
> > > coalesce.  It's nothing really to do with device characteristics, it's
> > > an ATA protocol problem.
> > ..
> > 
> > I don't think that's really much of an issue -- we already have to do
> > that for cache-flushes whenever barriers are enabled.  Yes it costs,
> > but not too much.
> 
> That's not really what the enterprise is saying about flush barriers.
> True, not all the performance problems are NCQ queue drain, but for a
> steady workload they are significant.

Flush barriers are nightmare for more than enterprise. You drive
basically goes for a hike for a bit which trashes interactivity as well.
If the device can't do trim and the like without a drain I don't see much
point doing it at all, except maybe to wait for idle devices and run a
filesystem managed background 'strimmer' thread to just weed out now idle
blocks that have stayed idle - eg by adding an inode of all the deleted
untrimmed blocks and giving it an irregular empty ?

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-16 14:05                               ` Alan Cox
  0 siblings, 0 replies; 208+ messages in thread
From: Alan Cox @ 2009-08-16 14:05 UTC (permalink / raw)
  To: James Bottomley
  Cc: Mark Lord, Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins,
	Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide, Linux RAID

On Sat, 15 Aug 2009 08:55:17 -0500
James Bottomley <James.Bottomley@suse.de> wrote:

> On Sat, 2009-08-15 at 09:22 -0400, Mark Lord wrote:
> > James Bottomley wrote:
> > >
> > > This means you have to drain the outstanding NCQ commands (stalling the
> > > device) before you can send a TRIM.   If we do this for every discard,
> > > the performance impact will be pretty devastating, hence the need to
> > > coalesce.  It's nothing really to do with device characteristics, it's
> > > an ATA protocol problem.
> > ..
> > 
> > I don't think that's really much of an issue -- we already have to do
> > that for cache-flushes whenever barriers are enabled.  Yes it costs,
> > but not too much.
> 
> That's not really what the enterprise is saying about flush barriers.
> True, not all the performance problems are NCQ queue drain, but for a
> steady workload they are significant.

Flush barriers are nightmare for more than enterprise. You drive
basically goes for a hike for a bit which trashes interactivity as well.
If the device can't do trim and the like without a drain I don't see much
point doing it at all, except maybe to wait for idle devices and run a
filesystem managed background 'strimmer' thread to just weed out now idle
blocks that have stayed idle - eg by adding an inode of all the deleted
untrimmed blocks and giving it an irregular empty ?

Alan

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 14:05                               ` Alan Cox
@ 2009-08-16 14:16                                 ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 14:16 UTC (permalink / raw)
  To: Alan Cox
  Cc: James Bottomley, Chris Worley, Matthew Wilcox, Bryan Donlan,
	david, Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

Alan Cox wrote:
>
> Flush barriers are nightmare for more than enterprise. You drive
> basically goes for a hike for a bit which trashes interactivity as well.
> If the device can't do trim and the like without a drain I don't see much
> point doing it at all, except maybe to wait for idle devices and run a
> filesystem managed background 'strimmer' thread to just weed out now idle
> blocks that have stayed idle - eg by adding an inode of all the deleted
> untrimmed blocks and giving it an irregular empty ?
..

Agreed.  And I believe Matthew also said something similar already.
TRIM for the current (only!) SSDs needs to be a "batched, once in a while"
operation, rather than something done continuously.

And ideally, "once in a while" might be once a day, or once we have more
than a significant percentage of the drive capacity ready for a TRIM.

It needs to batch a lot of stuff into a single TRIM,
and not do it very often at all.

Cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-16 14:16                                 ` Mark Lord
  0 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 14:16 UTC (permalink / raw)
  To: Alan Cox
  Cc: James Bottomley, Chris Worley, Matthew Wilcox, Bryan Donlan,
	david, Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

Alan Cox wrote:
>
> Flush barriers are nightmare for more than enterprise. You drive
> basically goes for a hike for a bit which trashes interactivity as well.
> If the device can't do trim and the like without a drain I don't see much
> point doing it at all, except maybe to wait for idle devices and run a
> filesystem managed background 'strimmer' thread to just weed out now idle
> blocks that have stayed idle - eg by adding an inode of all the deleted
> untrimmed blocks and giving it an irregular empty ?
..

Agreed.  And I believe Matthew also said something similar already.
TRIM for the current (only!) SSDs needs to be a "batched, once in a while"
operation, rather than something done continuously.

And ideally, "once in a while" might be once a day, or once we have more
than a significant percentage of the drive capacity ready for a TRIM.

It needs to batch a lot of stuff into a single TRIM,
and not do it very often at all.

Cheers

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 14:05                               ` Alan Cox
@ 2009-08-16 15:34                                 ` Arjan van de Ven
  -1 siblings, 0 replies; 208+ messages in thread
From: Arjan van de Ven @ 2009-08-16 15:34 UTC (permalink / raw)
  To: Alan Cox
  Cc: James Bottomley, Mark Lord, Chris Worley, Matthew Wilcox,
	Bryan Donlan, david, Greg Freemyer, Markus Trippelsdorf,
	Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide,
	Linux RAID

On Sun, 16 Aug 2009 15:05:30 +0100
Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> On Sat, 15 Aug 2009 08:55:17 -0500
> James Bottomley <James.Bottomley@suse.de> wrote:
> 
> > On Sat, 2009-08-15 at 09:22 -0400, Mark Lord wrote:
> > > James Bottomley wrote:
> > > >
> > > > This means you have to drain the outstanding NCQ commands
> > > > (stalling the device) before you can send a TRIM.   If we do
> > > > this for every discard, the performance impact will be pretty
> > > > devastating, hence the need to coalesce.  It's nothing really
> > > > to do with device characteristics, it's an ATA protocol problem.
> > > ..
> > > 
> > > I don't think that's really much of an issue -- we already have
> > > to do that for cache-flushes whenever barriers are enabled.  Yes
> > > it costs, but not too much.
> > 
> > That's not really what the enterprise is saying about flush
> > barriers. True, not all the performance problems are NCQ queue
> > drain, but for a steady workload they are significant.
> 
> Flush barriers are nightmare for more than enterprise. You drive
> basically goes for a hike for a bit which trashes interactivity as
> well. If the device can't do trim and the like without a drain I
> don't see much point doing it at all, except maybe to wait for idle
> devices and run a filesystem managed background 'strimmer' thread to
> just weed out now idle blocks that have stayed idle - eg by adding an
> inode of all the deleted untrimmed blocks and giving it an irregular
> empty ?
> 

trim is mostly for ssd's though, and those tend to not have the "goes
for a hike" behavior as much......

I wonder if it's worse to batch stuff up, because then the trim itself
gets bigger and might take longer.....



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-16 15:34                                 ` Arjan van de Ven
  0 siblings, 0 replies; 208+ messages in thread
From: Arjan van de Ven @ 2009-08-16 15:34 UTC (permalink / raw)
  To: Alan Cox
  Cc: James Bottomley, Mark Lord, Chris Worley, Matthew Wilcox,
	Bryan Donlan, david, Greg Freemyer, Markus Trippelsdorf,
	Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide,
	Linux RAID

On Sun, 16 Aug 2009 15:05:30 +0100
Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> On Sat, 15 Aug 2009 08:55:17 -0500
> James Bottomley <James.Bottomley@suse.de> wrote:
> 
> > On Sat, 2009-08-15 at 09:22 -0400, Mark Lord wrote:
> > > James Bottomley wrote:
> > > >
> > > > This means you have to drain the outstanding NCQ commands
> > > > (stalling the device) before you can send a TRIM.   If we do
> > > > this for every discard, the performance impact will be pretty
> > > > devastating, hence the need to coalesce.  It's nothing really
> > > > to do with device characteristics, it's an ATA protocol problem.
> > > ..
> > > 
> > > I don't think that's really much of an issue -- we already have
> > > to do that for cache-flushes whenever barriers are enabled.  Yes
> > > it costs, but not too much.
> > 
> > That's not really what the enterprise is saying about flush
> > barriers. True, not all the performance problems are NCQ queue
> > drain, but for a steady workload they are significant.
> 
> Flush barriers are nightmare for more than enterprise. You drive
> basically goes for a hike for a bit which trashes interactivity as
> well. If the device can't do trim and the like without a drain I
> don't see much point doing it at all, except maybe to wait for idle
> devices and run a filesystem managed background 'strimmer' thread to
> just weed out now idle blocks that have stayed idle - eg by adding an
> inode of all the deleted untrimmed blocks and giving it an irregular
> empty ?
> 

trim is mostly for ssd's though, and those tend to not have the "goes
for a hike" behavior as much......

I wonder if it's worse to batch stuff up, because then the trim itself
gets bigger and might take longer.....



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 15:34                                 ` Arjan van de Ven
@ 2009-08-16 15:44                                   ` Theodore Tso
  -1 siblings, 0 replies; 208+ messages in thread
From: Theodore Tso @ 2009-08-16 15:44 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Alan Cox, James Bottomley, Mark Lord, Chris Worley,
	Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Sun, Aug 16, 2009 at 08:34:34AM -0700, Arjan van de Ven wrote:
> trim is mostly for ssd's though, and those tend to not have the "goes
> for a hike" behavior as much......

Mark Lord has claimed that the currently shipping SSD's take "hundreds
of milliseconds" for a TRIM, command.  Compared to the usual latency
of an SSD, that could very well be considered "takes a coffee break"
behaviour; maybe not "goes for a hike", but enough that you wouldn't
want to be doing one all the time.

The story that I've heard which worries me is that those of us which
were silly enough to spend $400 and $800 dollars on the first
generation X25-M drives may never get TRIM support, and that TRIM
support might only be offered on the second generation X25-M drives.
I certainly _hope_ that is not true, but in any case, I don't have any
TRIM capable drives at the moment, so it's not something which I'm set
up to test....

					- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-16 15:44                                   ` Theodore Tso
  0 siblings, 0 replies; 208+ messages in thread
From: Theodore Tso @ 2009-08-16 15:44 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Alan Cox, James Bottomley, Mark Lord, Chris Worley,
	Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Sun, Aug 16, 2009 at 08:34:34AM -0700, Arjan van de Ven wrote:
> trim is mostly for ssd's though, and those tend to not have the "goes
> for a hike" behavior as much......

Mark Lord has claimed that the currently shipping SSD's take "hundreds
of milliseconds" for a TRIM, command.  Compared to the usual latency
of an SSD, that could very well be considered "takes a coffee break"
behaviour; maybe not "goes for a hike", but enough that you wouldn't
want to be doing one all the time.

The story that I've heard which worries me is that those of us which
were silly enough to spend $400 and $800 dollars on the first
generation X25-M drives may never get TRIM support, and that TRIM
support might only be offered on the second generation X25-M drives.
I certainly _hope_ that is not true, but in any case, I don't have any
TRIM capable drives at the moment, so it's not something which I'm set
up to test....

					- Ted

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 15:34                                 ` Arjan van de Ven
@ 2009-08-16 15:52                                   ` James Bottomley
  -1 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-16 15:52 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Alan Cox, Mark Lord, Chris Worley, Matthew Wilcox, Bryan Donlan,
	david, Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

On Sun, 2009-08-16 at 08:34 -0700, Arjan van de Ven wrote:
> On Sun, 16 Aug 2009 15:05:30 +0100
> Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> 
> > On Sat, 15 Aug 2009 08:55:17 -0500
> > James Bottomley <James.Bottomley@suse.de> wrote:
> > 
> > > On Sat, 2009-08-15 at 09:22 -0400, Mark Lord wrote:
> > > > James Bottomley wrote:
> > > > >
> > > > > This means you have to drain the outstanding NCQ commands
> > > > > (stalling the device) before you can send a TRIM.   If we do
> > > > > this for every discard, the performance impact will be pretty
> > > > > devastating, hence the need to coalesce.  It's nothing really
> > > > > to do with device characteristics, it's an ATA protocol problem.
> > > > ..
> > > > 
> > > > I don't think that's really much of an issue -- we already have
> > > > to do that for cache-flushes whenever barriers are enabled.  Yes
> > > > it costs, but not too much.
> > > 
> > > That's not really what the enterprise is saying about flush
> > > barriers. True, not all the performance problems are NCQ queue
> > > drain, but for a steady workload they are significant.
> > 
> > Flush barriers are nightmare for more than enterprise. You drive
> > basically goes for a hike for a bit which trashes interactivity as
> > well. If the device can't do trim and the like without a drain I
> > don't see much point doing it at all, except maybe to wait for idle
> > devices and run a filesystem managed background 'strimmer' thread to
> > just weed out now idle blocks that have stayed idle - eg by adding an
> > inode of all the deleted untrimmed blocks and giving it an irregular
> > empty ? 
> 
> trim is mostly for ssd's though, and those tend to not have the "goes
> for a hike" behavior as much......

Well, yes and no ... a lot of SSDs don't actually implement NCQ, so the
impact to them will be less ... although I think enterprise class SSDs
do implement NCQ.

> I wonder if it's worse to batch stuff up, because then the trim itself
> gets bigger and might take longer.....

So this is where we're getting into the realms of speculation.  There
really are only about a couple of people out there with trim
implementing SSDs, so that's not really enough to make any judgement.

However, the enterprise has been doing UNMAP for a while, so we can draw
inferences from them since the SSD FTL will operate similarly.  For
them, UNMAP is the same cost in terms of time regardless of the number
of extents.  The reason is that it's moving the blocks from the global
in use list to the global free list.  Part of the problem is that this
involves locking and quiescing, so UNMAP ends up being quite expensive
to the array but constant in terms of cost (hence they want as few
unmaps for as many sectors as possible).

For SSDs, the FTL has to have a separate operation: erase.  Now, one
could see the correct implementation simply moving the sectors from the
in-use list to the to be cleaned list and still do the cleaning in the
background: that would be constant cost (but, again, likely expensive).
Of course, if SSD vendors decided to erase on the spot when seeing TRIM,
this wouldn't be true ...

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-16 15:52                                   ` James Bottomley
  0 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-16 15:52 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Alan Cox, Mark Lord, Chris Worley, Matthew Wilcox, Bryan Donlan,
	david, Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

On Sun, 2009-08-16 at 08:34 -0700, Arjan van de Ven wrote:
> On Sun, 16 Aug 2009 15:05:30 +0100
> Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> 
> > On Sat, 15 Aug 2009 08:55:17 -0500
> > James Bottomley <James.Bottomley@suse.de> wrote:
> > 
> > > On Sat, 2009-08-15 at 09:22 -0400, Mark Lord wrote:
> > > > James Bottomley wrote:
> > > > >
> > > > > This means you have to drain the outstanding NCQ commands
> > > > > (stalling the device) before you can send a TRIM.   If we do
> > > > > this for every discard, the performance impact will be pretty
> > > > > devastating, hence the need to coalesce.  It's nothing really
> > > > > to do with device characteristics, it's an ATA protocol problem.
> > > > ..
> > > > 
> > > > I don't think that's really much of an issue -- we already have
> > > > to do that for cache-flushes whenever barriers are enabled.  Yes
> > > > it costs, but not too much.
> > > 
> > > That's not really what the enterprise is saying about flush
> > > barriers. True, not all the performance problems are NCQ queue
> > > drain, but for a steady workload they are significant.
> > 
> > Flush barriers are nightmare for more than enterprise. You drive
> > basically goes for a hike for a bit which trashes interactivity as
> > well. If the device can't do trim and the like without a drain I
> > don't see much point doing it at all, except maybe to wait for idle
> > devices and run a filesystem managed background 'strimmer' thread to
> > just weed out now idle blocks that have stayed idle - eg by adding an
> > inode of all the deleted untrimmed blocks and giving it an irregular
> > empty ? 
> 
> trim is mostly for ssd's though, and those tend to not have the "goes
> for a hike" behavior as much......

Well, yes and no ... a lot of SSDs don't actually implement NCQ, so the
impact to them will be less ... although I think enterprise class SSDs
do implement NCQ.

> I wonder if it's worse to batch stuff up, because then the trim itself
> gets bigger and might take longer.....

So this is where we're getting into the realms of speculation.  There
really are only about a couple of people out there with trim
implementing SSDs, so that's not really enough to make any judgement.

However, the enterprise has been doing UNMAP for a while, so we can draw
inferences from them since the SSD FTL will operate similarly.  For
them, UNMAP is the same cost in terms of time regardless of the number
of extents.  The reason is that it's moving the blocks from the global
in use list to the global free list.  Part of the problem is that this
involves locking and quiescing, so UNMAP ends up being quite expensive
to the array but constant in terms of cost (hence they want as few
unmaps for as many sectors as possible).

For SSDs, the FTL has to have a separate operation: erase.  Now, one
could see the correct implementation simply moving the sectors from the
in-use list to the to be cleaned list and still do the cleaning in the
background: that would be constant cost (but, again, likely expensive).
Of course, if SSD vendors decided to erase on the spot when seeing TRIM,
this wouldn't be true ...

James



^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 15:52                                   ` James Bottomley
  (?)
@ 2009-08-16 16:32                                     ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 16:32 UTC (permalink / raw)
  To: James Bottomley
  Cc: Arjan van de Ven, Alan Cox, Chris Worley, Matthew Wilcox,
	Bryan Donlan, david, Greg Freemyer, Markus Trippelsdorf,
	Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide,
	Linux RAID

James Bottomley wrote:
>
> For SSDs, the FTL has to have a separate operation: erase.  Now, one
> could see the correct implementation simply moving the sectors from the
> in-use list to the to be cleaned list and still do the cleaning in the
> background: that would be constant cost (but, again, likely expensive).
> Of course, if SSD vendors decided to erase on the spot when seeing TRIM,
> this wouldn't be true ...
..

The SSDs based upon the Indilinx Barefoot controller appear to do
the erase on the spot, along with a fair amount of garbage collection.
The overhead does vary by size of the TRIM operation (number of sectors
and extents), but even a single-sector TRIM has very high overhead.

Samsung also now has SSDs at retail with TRIM.
I don't have one of those here.

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-16 16:32                                     ` Mark Lord
  0 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 16:32 UTC (permalink / raw)
  To: James Bottomley
  Cc: Arjan van de Ven, Alan Cox, Chris Worley, Matthew Wilcox,
	Bryan Donlan, david, Greg Freemyer, Markus Trippelsdorf,
	Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide,
	Linux RAID

James Bottomley wrote:
>
> For SSDs, the FTL has to have a separate operation: erase.  Now, one
> could see the correct implementation simply moving the sectors from the
> in-use list to the to be cleaned list and still do the cleaning in the
> background: that would be constant cost (but, again, likely expensive).
> Of course, if SSD vendors decided to erase on the spot when seeing TRIM,
> this wouldn't be true ...
..

The SSDs based upon the Indilinx Barefoot controller appear to do
the erase on the spot, along with a fair amount of garbage collection.
The overhead does vary by size of the TRIM operation (number of sectors
and extents), but even a single-sector TRIM has very high overhead.

Samsung also now has SSDs at retail with TRIM.
I don't have one of those here.

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-16 16:32                                     ` Mark Lord
  0 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 16:32 UTC (permalink / raw)
  To: James Bottomley
  Cc: Arjan van de Ven, Alan Cox, Chris Worley, Matthew Wilcox,
	Bryan Donlan, david, Greg Freemyer, Markus Trippelsdorf,
	Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide,
	Linux RAID

James Bottomley wrote:
>
> For SSDs, the FTL has to have a separate operation: erase.  Now, one
> could see the correct implementation simply moving the sectors from the
> in-use list to the to be cleaned list and still do the cleaning in the
> background: that would be constant cost (but, again, likely expensive).
> Of course, if SSD vendors decided to erase on the spot when seeing TRIM,
> this wouldn't be true ...
..

The SSDs based upon the Indilinx Barefoot controller appear to do
the erase on the spot, along with a fair amount of garbage collection.
The overhead does vary by size of the TRIM operation (number of sectors
and extents), but even a single-sector TRIM has very high overhead.

Samsung also now has SSDs at retail with TRIM.
I don't have one of those here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 15:52                                   ` James Bottomley
@ 2009-08-16 16:59                                     ` Christoph Hellwig
  -1 siblings, 0 replies; 208+ messages in thread
From: Christoph Hellwig @ 2009-08-16 16:59 UTC (permalink / raw)
  To: James Bottomley
  Cc: Arjan van de Ven, Alan Cox, Mark Lord, Chris Worley,
	Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Sun, Aug 16, 2009 at 10:52:07AM -0500, James Bottomley wrote:
> However, the enterprise has been doing UNMAP for a while, so we can draw
> inferences from them since the SSD FTL will operate similarly.  For
> them, UNMAP is the same cost in terms of time regardless of the number
> of extents.  The reason is that it's moving the blocks from the global
> in use list to the global free list.  Part of the problem is that this
> involves locking and quiescing, so UNMAP ends up being quite expensive
> to the array but constant in terms of cost (hence they want as few
> unmaps for as many sectors as possible).

How are they doing the unmaps?  Using something similar to Mark's wiper
script and using SG_IO?  Because right now we do not actually implement
UNMAP support in the kernel.  I'd really love to test the XFS batched
discard support with a real UNMAP implementation.


^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-16 16:59                                     ` Christoph Hellwig
  0 siblings, 0 replies; 208+ messages in thread
From: Christoph Hellwig @ 2009-08-16 16:59 UTC (permalink / raw)
  To: James Bottomley
  Cc: Arjan van de Ven, Alan Cox, Mark Lord, Chris Worley,
	Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Sun, Aug 16, 2009 at 10:52:07AM -0500, James Bottomley wrote:
> However, the enterprise has been doing UNMAP for a while, so we can draw
> inferences from them since the SSD FTL will operate similarly.  For
> them, UNMAP is the same cost in terms of time regardless of the number
> of extents.  The reason is that it's moving the blocks from the global
> in use list to the global free list.  Part of the problem is that this
> involves locking and quiescing, so UNMAP ends up being quite expensive
> to the array but constant in terms of cost (hence they want as few
> unmaps for as many sectors as possible).

How are they doing the unmaps?  Using something similar to Mark's wiper
script and using SG_IO?  Because right now we do not actually implement
UNMAP support in the kernel.  I'd really love to test the XFS batched
discard support with a real UNMAP implementation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-15 17:39                               ` jim owens
@ 2009-08-16 17:08                                 ` Robert Hancock
  -1 siblings, 0 replies; 208+ messages in thread
From: Robert Hancock @ 2009-08-16 17:08 UTC (permalink / raw)
  To: jim owens
  Cc: James Bottomley, Mark Lord, Chris Worley, Matthew Wilcox,
	Bryan Donlan, david, Greg Freemyer, Markus Trippelsdorf,
	Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide,
	Linux RAID

On 08/15/2009 11:39 AM, jim owens wrote:
> ***begin rant***
>
> I have not seen any analysis of the benefit and cost to the
> end user of the TRIM or array UNMAP. We now see that TRIM
> as implemented by some (all?) SSDs will come at high cost.
> The cost is all born by the host. Do we get any benefit, or
> is it all for the device vendor. And when we subtract the cost
> from the benefit, does the user actually benefit and how?
>
> I'm tired of working around shit storage products and broken
> device protocols from the "T" committees. I suggest we just
> add a "white list" of devices that handle the discard fast
> and without us needing NCQ queue drain. Then only send TRIM
> to devices that are on the white list and throw the others
> away in the block device layer.

They all will require NCQ queue drain. It's an inherent requirement of 
the protocol that you can't overlap NCQ and non-NCQ commands, and the 
trim command is not NCQ.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-16 17:08                                 ` Robert Hancock
  0 siblings, 0 replies; 208+ messages in thread
From: Robert Hancock @ 2009-08-16 17:08 UTC (permalink / raw)
  To: jim owens
  Cc: James Bottomley, Mark Lord, Chris Worley, Matthew Wilcox,
	Bryan Donlan, david, Greg Freemyer, Markus Trippelsdorf,
	Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide,
	Linux RAID

On 08/15/2009 11:39 AM, jim owens wrote:
> ***begin rant***
>
> I have not seen any analysis of the benefit and cost to the
> end user of the TRIM or array UNMAP. We now see that TRIM
> as implemented by some (all?) SSDs will come at high cost.
> The cost is all born by the host. Do we get any benefit, or
> is it all for the device vendor. And when we subtract the cost
> from the benefit, does the user actually benefit and how?
>
> I'm tired of working around shit storage products and broken
> device protocols from the "T" committees. I suggest we just
> add a "white list" of devices that handle the discard fast
> and without us needing NCQ queue drain. Then only send TRIM
> to devices that are on the white list and throw the others
> away in the block device layer.

They all will require NCQ queue drain. It's an inherent requirement of 
the protocol that you can't overlap NCQ and non-NCQ commands, and the 
trim command is not NCQ.

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 15:44                                   ` Theodore Tso
  (?)
@ 2009-08-16 17:28                                   ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 17:28 UTC (permalink / raw)
  To: Theodore Tso, Arjan van de Ven, Alan Cox, James Bottomley,
	Mark Lord, Chris

Theodore Tso wrote:
..
> Mark Lord has claimed that the currently shipping SSD's take "hundreds
> of milliseconds" for a TRIM, command.
..

Here's some data to support that claim.

First, here are a series of TRIM commands for single-extents
of varying lengths.

The measures include the printk() timestamp, plus I had libata itself
use rdtsc() before/after each TRIM.  This is with a T7400 CPU booted
using maxcpus=1, and locked at 2.16GHz using "performance" CPU policy.

The first set of data, is from individual single-extent TRIMs,
with a "sleep 1 ; sync" between each successive TRIM:

Beginning TRIM operations..
Trimming 1 free extents encompassing 656 sectors (0 MB)
Trimming 1 free extents encompassing 30 sectors (0 MB)
Trimming 1 free extents encompassing 194 sectors (0 MB)
Trimming 1 free extents encompassing 42 sectors (0 MB)
Trimming 1 free extents encompassing 1574 sectors (1 MB)
Trimming 1 free extents encompassing 612 sectors (0 MB)
Trimming 1 free extents encompassing 862 sectors (0 MB)
Trimming 1 free extents encompassing 1344 sectors (1 MB)
Trimming 1 free extents encompassing 822 sectors (0 MB)
Trimming 1 free extents encompassing 672 sectors (0 MB)
Trimming 1 free extents encompassing 226 sectors (0 MB)
Trimming 1 free extents encompassing 860 sectors (0 MB)
Trimming 1 free extents encompassing 638 sectors (0 MB)
Trimming 1 free extents encompassing 1020 sectors (0 MB)
Trimming 1 free extents encompassing 12286 sectors (6 MB)
Trimming 1 free extents encompassing 1964 sectors (1 MB)
Done.
[ 1083.768460] ata_qc_issue: ATA_CMD_DSM starting
[ 1083.768672] trim_completed: ATA_CMD_DSM took 438841 cycles
[ 1084.794304] ata_qc_issue: ATA_CMD_DSM starting
[ 1084.794469] trim_completed: ATA_CMD_DSM took 338065 cycles
[ 1085.823605] ata_qc_issue: ATA_CMD_DSM starting
[ 1085.823791] trim_completed: ATA_CMD_DSM took 382317 cycles
[ 1086.852989] ata_qc_issue: ATA_CMD_DSM starting
[ 1086.853166] trim_completed: ATA_CMD_DSM took 352248 cycles
[ 1087.882825] ata_qc_issue: ATA_CMD_DSM starting
[ 1087.883127] trim_completed: ATA_CMD_DSM took 624546 cycles
[ 1088.915833] ata_qc_issue: ATA_CMD_DSM starting
[ 1088.916056] trim_completed: ATA_CMD_DSM took 455299 cycles
[ 1089.941946] ata_qc_issue: ATA_CMD_DSM starting
[ 1089.942181] trim_completed: ATA_CMD_DSM took 485615 cycles
[ 1090.968793] ata_qc_issue: ATA_CMD_DSM starting
[ 1090.969062] trim_completed: ATA_CMD_DSM took 562042 cycles
[ 1091.994441] ata_qc_issue: ATA_CMD_DSM starting
[ 1091.994672] trim_completed: ATA_CMD_DSM took 479219 cycles
[ 1093.023576] ata_qc_issue: ATA_CMD_DSM starting
[ 1093.023799] trim_completed: ATA_CMD_DSM took 463398 cycles
[ 1094.053545] ata_qc_issue: ATA_CMD_DSM starting
[ 1094.053731] trim_completed: ATA_CMD_DSM took 385229 cycles
[ 1095.083131] ata_qc_issue: ATA_CMD_DSM starting
[ 1095.083356] trim_completed: ATA_CMD_DSM took 458328 cycles
[ 1096.113146] ata_qc_issue: ATA_CMD_DSM starting
[ 1096.113356] trim_completed: ATA_CMD_DSM took 423670 cycles
[ 1097.144211] ata_qc_issue: ATA_CMD_DSM starting
[ 1097.144464] trim_completed: ATA_CMD_DSM took 524706 cycles
[ 1098.174457] ata_qc_issue: ATA_CMD_DSM starting
[ 1098.175619] trim_completed: ATA_CMD_DSM took 2491138 cycles
[ 1099.209218] ata_qc_issue: ATA_CMD_DSM starting
[ 1099.209539] trim_completed: ATA_CMD_DSM took 674752 cycles

Those TRIMs look fine, in the single millisecond range.
But.. the "sleep 1" hides some drive firmware evils..
Here is exactly the same run again, but without the "sleep 1":

Beginning TRIM operations..
Trimming 1 free extents encompassing 656 sectors (0 MB)
Trimming 1 free extents encompassing 30 sectors (0 MB)
Trimming 1 free extents encompassing 194 sectors (0 MB)
Trimming 1 free extents encompassing 42 sectors (0 MB)
Trimming 1 free extents encompassing 1574 sectors (1 MB)
Trimming 1 free extents encompassing 612 sectors (0 MB)
Trimming 1 free extents encompassing 862 sectors (0 MB)
Trimming 1 free extents encompassing 1344 sectors (1 MB)
Trimming 1 free extents encompassing 822 sectors (0 MB)
Trimming 1 free extents encompassing 672 sectors (0 MB)
Trimming 1 free extents encompassing 226 sectors (0 MB)
Trimming 1 free extents encompassing 860 sectors (0 MB)
Trimming 1 free extents encompassing 638 sectors (0 MB)
Trimming 1 free extents encompassing 1020 sectors (0 MB)
Trimming 1 free extents encompassing 12286 sectors (6 MB)
Trimming 1 free extents encompassing 1964 sectors (1 MB)
Done.
[ 1258.206379] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.206587] trim_completed: ATA_CMD_DSM took 426088 cycles
[ 1258.254513] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.366141] trim_completed: ATA_CMD_DSM took 241231523 cycles
[ 1258.411749] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.524047] trim_completed: ATA_CMD_DSM took 242676590 cycles
[ 1258.600184] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.711766] trim_completed: ATA_CMD_DSM took 241136519 cycles
[ 1258.813515] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.910599] trim_completed: ATA_CMD_DSM took 209803152 cycles
[ 1259.027253] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.108916] trim_completed: ATA_CMD_DSM took 176473453 cycles
[ 1259.239549] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.306640] trim_completed: ATA_CMD_DSM took 144968694 cycles
[ 1259.452978] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.505017] trim_completed: ATA_CMD_DSM took 112440172 cycles
[ 1259.552393] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.664739] trim_completed: ATA_CMD_DSM took 242778861 cycles
[ 1259.775724] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.861318] trim_completed: ATA_CMD_DSM took 184955732 cycles
[ 1259.989289] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.059963] trim_completed: ATA_CMD_DSM took 152713730 cycles
[ 1260.211066] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.257474] trim_completed: ATA_CMD_DSM took 100279998 cycles
[ 1260.306277] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.417770] trim_completed: ATA_CMD_DSM took 240932835 cycles
[ 1260.464049] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.575418] trim_completed: ATA_CMD_DSM took 240673134 cycles
[ 1260.650624] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.763510] trim_completed: ATA_CMD_DSM took 243952865 cycles
[ 1260.810454] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.921433] trim_completed: ATA_CMD_DSM took 239832996 cycles

As you can see, we're now into the 100 millisecond range
for successive TRIM-followed-by-TRIM commands.

Those are all for single extents.  I will follow-up with a small
amount of similar data for TRIMs with multiple extents.

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 15:44                                   ` Theodore Tso
                                                     ` (2 preceding siblings ...)
  (?)
@ 2009-08-16 17:28                                   ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 17:28 UTC (permalink / raw)
  To: Theodore Tso, Arjan van de Ven, Alan Cox, James Bottomley,
	Mark Lord, Chris

Theodore Tso wrote:
..
> Mark Lord has claimed that the currently shipping SSD's take "hundreds
> of milliseconds" for a TRIM, command.
..

Here's some data to support that claim.

First, here are a series of TRIM commands for single-extents
of varying lengths.

The measures include the printk() timestamp, plus I had libata itself
use rdtsc() before/after each TRIM.  This is with a T7400 CPU booted
using maxcpus=1, and locked at 2.16GHz using "performance" CPU policy.

The first set of data, is from individual single-extent TRIMs,
with a "sleep 1 ; sync" between each successive TRIM:

Beginning TRIM operations..
Trimming 1 free extents encompassing 656 sectors (0 MB)
Trimming 1 free extents encompassing 30 sectors (0 MB)
Trimming 1 free extents encompassing 194 sectors (0 MB)
Trimming 1 free extents encompassing 42 sectors (0 MB)
Trimming 1 free extents encompassing 1574 sectors (1 MB)
Trimming 1 free extents encompassing 612 sectors (0 MB)
Trimming 1 free extents encompassing 862 sectors (0 MB)
Trimming 1 free extents encompassing 1344 sectors (1 MB)
Trimming 1 free extents encompassing 822 sectors (0 MB)
Trimming 1 free extents encompassing 672 sectors (0 MB)
Trimming 1 free extents encompassing 226 sectors (0 MB)
Trimming 1 free extents encompassing 860 sectors (0 MB)
Trimming 1 free extents encompassing 638 sectors (0 MB)
Trimming 1 free extents encompassing 1020 sectors (0 MB)
Trimming 1 free extents encompassing 12286 sectors (6 MB)
Trimming 1 free extents encompassing 1964 sectors (1 MB)
Done.
[ 1083.768460] ata_qc_issue: ATA_CMD_DSM starting
[ 1083.768672] trim_completed: ATA_CMD_DSM took 438841 cycles
[ 1084.794304] ata_qc_issue: ATA_CMD_DSM starting
[ 1084.794469] trim_completed: ATA_CMD_DSM took 338065 cycles
[ 1085.823605] ata_qc_issue: ATA_CMD_DSM starting
[ 1085.823791] trim_completed: ATA_CMD_DSM took 382317 cycles
[ 1086.852989] ata_qc_issue: ATA_CMD_DSM starting
[ 1086.853166] trim_completed: ATA_CMD_DSM took 352248 cycles
[ 1087.882825] ata_qc_issue: ATA_CMD_DSM starting
[ 1087.883127] trim_completed: ATA_CMD_DSM took 624546 cycles
[ 1088.915833] ata_qc_issue: ATA_CMD_DSM starting
[ 1088.916056] trim_completed: ATA_CMD_DSM took 455299 cycles
[ 1089.941946] ata_qc_issue: ATA_CMD_DSM starting
[ 1089.942181] trim_completed: ATA_CMD_DSM took 485615 cycles
[ 1090.968793] ata_qc_issue: ATA_CMD_DSM starting
[ 1090.969062] trim_completed: ATA_CMD_DSM took 562042 cycles
[ 1091.994441] ata_qc_issue: ATA_CMD_DSM starting
[ 1091.994672] trim_completed: ATA_CMD_DSM took 479219 cycles
[ 1093.023576] ata_qc_issue: ATA_CMD_DSM starting
[ 1093.023799] trim_completed: ATA_CMD_DSM took 463398 cycles
[ 1094.053545] ata_qc_issue: ATA_CMD_DSM starting
[ 1094.053731] trim_completed: ATA_CMD_DSM took 385229 cycles
[ 1095.083131] ata_qc_issue: ATA_CMD_DSM starting
[ 1095.083356] trim_completed: ATA_CMD_DSM took 458328 cycles
[ 1096.113146] ata_qc_issue: ATA_CMD_DSM starting
[ 1096.113356] trim_completed: ATA_CMD_DSM took 423670 cycles
[ 1097.144211] ata_qc_issue: ATA_CMD_DSM starting
[ 1097.144464] trim_completed: ATA_CMD_DSM took 524706 cycles
[ 1098.174457] ata_qc_issue: ATA_CMD_DSM starting
[ 1098.175619] trim_completed: ATA_CMD_DSM took 2491138 cycles
[ 1099.209218] ata_qc_issue: ATA_CMD_DSM starting
[ 1099.209539] trim_completed: ATA_CMD_DSM took 674752 cycles

Those TRIMs look fine, in the single millisecond range.
But.. the "sleep 1" hides some drive firmware evils..
Here is exactly the same run again, but without the "sleep 1":

Beginning TRIM operations..
Trimming 1 free extents encompassing 656 sectors (0 MB)
Trimming 1 free extents encompassing 30 sectors (0 MB)
Trimming 1 free extents encompassing 194 sectors (0 MB)
Trimming 1 free extents encompassing 42 sectors (0 MB)
Trimming 1 free extents encompassing 1574 sectors (1 MB)
Trimming 1 free extents encompassing 612 sectors (0 MB)
Trimming 1 free extents encompassing 862 sectors (0 MB)
Trimming 1 free extents encompassing 1344 sectors (1 MB)
Trimming 1 free extents encompassing 822 sectors (0 MB)
Trimming 1 free extents encompassing 672 sectors (0 MB)
Trimming 1 free extents encompassing 226 sectors (0 MB)
Trimming 1 free extents encompassing 860 sectors (0 MB)
Trimming 1 free extents encompassing 638 sectors (0 MB)
Trimming 1 free extents encompassing 1020 sectors (0 MB)
Trimming 1 free extents encompassing 12286 sectors (6 MB)
Trimming 1 free extents encompassing 1964 sectors (1 MB)
Done.
[ 1258.206379] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.206587] trim_completed: ATA_CMD_DSM took 426088 cycles
[ 1258.254513] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.366141] trim_completed: ATA_CMD_DSM took 241231523 cycles
[ 1258.411749] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.524047] trim_completed: ATA_CMD_DSM took 242676590 cycles
[ 1258.600184] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.711766] trim_completed: ATA_CMD_DSM took 241136519 cycles
[ 1258.813515] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.910599] trim_completed: ATA_CMD_DSM took 209803152 cycles
[ 1259.027253] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.108916] trim_completed: ATA_CMD_DSM took 176473453 cycles
[ 1259.239549] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.306640] trim_completed: ATA_CMD_DSM took 144968694 cycles
[ 1259.452978] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.505017] trim_completed: ATA_CMD_DSM took 112440172 cycles
[ 1259.552393] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.664739] trim_completed: ATA_CMD_DSM took 242778861 cycles
[ 1259.775724] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.861318] trim_completed: ATA_CMD_DSM took 184955732 cycles
[ 1259.989289] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.059963] trim_completed: ATA_CMD_DSM took 152713730 cycles
[ 1260.211066] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.257474] trim_completed: ATA_CMD_DSM took 100279998 cycles
[ 1260.306277] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.417770] trim_completed: ATA_CMD_DSM took 240932835 cycles
[ 1260.464049] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.575418] trim_completed: ATA_CMD_DSM took 240673134 cycles
[ 1260.650624] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.763510] trim_completed: ATA_CMD_DSM took 243952865 cycles
[ 1260.810454] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.921433] trim_completed: ATA_CMD_DSM took 239832996 cycles

As you can see, we're now into the 100 millisecond range
for successive TRIM-followed-by-TRIM commands.

Those are all for single extents.  I will follow-up with a small
amount of similar data for TRIMs with multiple extents.

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 15:44                                   ` Theodore Tso
@ 2009-08-16 17:28                                     ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 17:28 UTC (permalink / raw)
  To: Theodore Tso, Arjan van de Ven, Alan Cox, James Bottomley,
	Mark Lord, Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins,
	Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide, Linux RAID

Theodore Tso wrote:
..
> Mark Lord has claimed that the currently shipping SSD's take "hundreds
> of milliseconds" for a TRIM, command.
..

Here's some data to support that claim.

First, here are a series of TRIM commands for single-extents
of varying lengths.

The measures include the printk() timestamp, plus I had libata itself
use rdtsc() before/after each TRIM.  This is with a T7400 CPU booted
using maxcpus=1, and locked at 2.16GHz using "performance" CPU policy.

The first set of data, is from individual single-extent TRIMs,
with a "sleep 1 ; sync" between each successive TRIM:

Beginning TRIM operations..
Trimming 1 free extents encompassing 656 sectors (0 MB)
Trimming 1 free extents encompassing 30 sectors (0 MB)
Trimming 1 free extents encompassing 194 sectors (0 MB)
Trimming 1 free extents encompassing 42 sectors (0 MB)
Trimming 1 free extents encompassing 1574 sectors (1 MB)
Trimming 1 free extents encompassing 612 sectors (0 MB)
Trimming 1 free extents encompassing 862 sectors (0 MB)
Trimming 1 free extents encompassing 1344 sectors (1 MB)
Trimming 1 free extents encompassing 822 sectors (0 MB)
Trimming 1 free extents encompassing 672 sectors (0 MB)
Trimming 1 free extents encompassing 226 sectors (0 MB)
Trimming 1 free extents encompassing 860 sectors (0 MB)
Trimming 1 free extents encompassing 638 sectors (0 MB)
Trimming 1 free extents encompassing 1020 sectors (0 MB)
Trimming 1 free extents encompassing 12286 sectors (6 MB)
Trimming 1 free extents encompassing 1964 sectors (1 MB)
Done.
[ 1083.768460] ata_qc_issue: ATA_CMD_DSM starting
[ 1083.768672] trim_completed: ATA_CMD_DSM took 438841 cycles
[ 1084.794304] ata_qc_issue: ATA_CMD_DSM starting
[ 1084.794469] trim_completed: ATA_CMD_DSM took 338065 cycles
[ 1085.823605] ata_qc_issue: ATA_CMD_DSM starting
[ 1085.823791] trim_completed: ATA_CMD_DSM took 382317 cycles
[ 1086.852989] ata_qc_issue: ATA_CMD_DSM starting
[ 1086.853166] trim_completed: ATA_CMD_DSM took 352248 cycles
[ 1087.882825] ata_qc_issue: ATA_CMD_DSM starting
[ 1087.883127] trim_completed: ATA_CMD_DSM took 624546 cycles
[ 1088.915833] ata_qc_issue: ATA_CMD_DSM starting
[ 1088.916056] trim_completed: ATA_CMD_DSM took 455299 cycles
[ 1089.941946] ata_qc_issue: ATA_CMD_DSM starting
[ 1089.942181] trim_completed: ATA_CMD_DSM took 485615 cycles
[ 1090.968793] ata_qc_issue: ATA_CMD_DSM starting
[ 1090.969062] trim_completed: ATA_CMD_DSM took 562042 cycles
[ 1091.994441] ata_qc_issue: ATA_CMD_DSM starting
[ 1091.994672] trim_completed: ATA_CMD_DSM took 479219 cycles
[ 1093.023576] ata_qc_issue: ATA_CMD_DSM starting
[ 1093.023799] trim_completed: ATA_CMD_DSM took 463398 cycles
[ 1094.053545] ata_qc_issue: ATA_CMD_DSM starting
[ 1094.053731] trim_completed: ATA_CMD_DSM took 385229 cycles
[ 1095.083131] ata_qc_issue: ATA_CMD_DSM starting
[ 1095.083356] trim_completed: ATA_CMD_DSM took 458328 cycles
[ 1096.113146] ata_qc_issue: ATA_CMD_DSM starting
[ 1096.113356] trim_completed: ATA_CMD_DSM took 423670 cycles
[ 1097.144211] ata_qc_issue: ATA_CMD_DSM starting
[ 1097.144464] trim_completed: ATA_CMD_DSM took 524706 cycles
[ 1098.174457] ata_qc_issue: ATA_CMD_DSM starting
[ 1098.175619] trim_completed: ATA_CMD_DSM took 2491138 cycles
[ 1099.209218] ata_qc_issue: ATA_CMD_DSM starting
[ 1099.209539] trim_completed: ATA_CMD_DSM took 674752 cycles

Those TRIMs look fine, in the single millisecond range.
But.. the "sleep 1" hides some drive firmware evils..
Here is exactly the same run again, but without the "sleep 1":

Beginning TRIM operations..
Trimming 1 free extents encompassing 656 sectors (0 MB)
Trimming 1 free extents encompassing 30 sectors (0 MB)
Trimming 1 free extents encompassing 194 sectors (0 MB)
Trimming 1 free extents encompassing 42 sectors (0 MB)
Trimming 1 free extents encompassing 1574 sectors (1 MB)
Trimming 1 free extents encompassing 612 sectors (0 MB)
Trimming 1 free extents encompassing 862 sectors (0 MB)
Trimming 1 free extents encompassing 1344 sectors (1 MB)
Trimming 1 free extents encompassing 822 sectors (0 MB)
Trimming 1 free extents encompassing 672 sectors (0 MB)
Trimming 1 free extents encompassing 226 sectors (0 MB)
Trimming 1 free extents encompassing 860 sectors (0 MB)
Trimming 1 free extents encompassing 638 sectors (0 MB)
Trimming 1 free extents encompassing 1020 sectors (0 MB)
Trimming 1 free extents encompassing 12286 sectors (6 MB)
Trimming 1 free extents encompassing 1964 sectors (1 MB)
Done.
[ 1258.206379] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.206587] trim_completed: ATA_CMD_DSM took 426088 cycles
[ 1258.254513] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.366141] trim_completed: ATA_CMD_DSM took 241231523 cycles
[ 1258.411749] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.524047] trim_completed: ATA_CMD_DSM took 242676590 cycles
[ 1258.600184] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.711766] trim_completed: ATA_CMD_DSM took 241136519 cycles
[ 1258.813515] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.910599] trim_completed: ATA_CMD_DSM took 209803152 cycles
[ 1259.027253] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.108916] trim_completed: ATA_CMD_DSM took 176473453 cycles
[ 1259.239549] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.306640] trim_completed: ATA_CMD_DSM took 144968694 cycles
[ 1259.452978] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.505017] trim_completed: ATA_CMD_DSM took 112440172 cycles
[ 1259.552393] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.664739] trim_completed: ATA_CMD_DSM took 242778861 cycles
[ 1259.775724] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.861318] trim_completed: ATA_CMD_DSM took 184955732 cycles
[ 1259.989289] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.059963] trim_completed: ATA_CMD_DSM took 152713730 cycles
[ 1260.211066] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.257474] trim_completed: ATA_CMD_DSM took 100279998 cycles
[ 1260.306277] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.417770] trim_completed: ATA_CMD_DSM took 240932835 cycles
[ 1260.464049] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.575418] trim_completed: ATA_CMD_DSM took 240673134 cycles
[ 1260.650624] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.763510] trim_completed: ATA_CMD_DSM took 243952865 cycles
[ 1260.810454] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.921433] trim_completed: ATA_CMD_DSM took 239832996 cycles

As you can see, we're now into the 100 millisecond range
for successive TRIM-followed-by-TRIM commands.

Those are all for single extents.  I will follow-up with a small
amount of similar data for TRIMs with multiple extents.

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 15:44                                   ` Theodore Tso
                                                     ` (3 preceding siblings ...)
  (?)
@ 2009-08-16 17:28                                   ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 17:28 UTC (permalink / raw)
  To: Theodore Tso, Arjan van de Ven, Alan Cox, James Bottomley,
	Mark Lord, Chris

Theodore Tso wrote:
..
> Mark Lord has claimed that the currently shipping SSD's take "hundreds
> of milliseconds" for a TRIM, command.
..

Here's some data to support that claim.

First, here are a series of TRIM commands for single-extents
of varying lengths.

The measures include the printk() timestamp, plus I had libata itself
use rdtsc() before/after each TRIM.  This is with a T7400 CPU booted
using maxcpus=1, and locked at 2.16GHz using "performance" CPU policy.

The first set of data, is from individual single-extent TRIMs,
with a "sleep 1 ; sync" between each successive TRIM:

Beginning TRIM operations..
Trimming 1 free extents encompassing 656 sectors (0 MB)
Trimming 1 free extents encompassing 30 sectors (0 MB)
Trimming 1 free extents encompassing 194 sectors (0 MB)
Trimming 1 free extents encompassing 42 sectors (0 MB)
Trimming 1 free extents encompassing 1574 sectors (1 MB)
Trimming 1 free extents encompassing 612 sectors (0 MB)
Trimming 1 free extents encompassing 862 sectors (0 MB)
Trimming 1 free extents encompassing 1344 sectors (1 MB)
Trimming 1 free extents encompassing 822 sectors (0 MB)
Trimming 1 free extents encompassing 672 sectors (0 MB)
Trimming 1 free extents encompassing 226 sectors (0 MB)
Trimming 1 free extents encompassing 860 sectors (0 MB)
Trimming 1 free extents encompassing 638 sectors (0 MB)
Trimming 1 free extents encompassing 1020 sectors (0 MB)
Trimming 1 free extents encompassing 12286 sectors (6 MB)
Trimming 1 free extents encompassing 1964 sectors (1 MB)
Done.
[ 1083.768460] ata_qc_issue: ATA_CMD_DSM starting
[ 1083.768672] trim_completed: ATA_CMD_DSM took 438841 cycles
[ 1084.794304] ata_qc_issue: ATA_CMD_DSM starting
[ 1084.794469] trim_completed: ATA_CMD_DSM took 338065 cycles
[ 1085.823605] ata_qc_issue: ATA_CMD_DSM starting
[ 1085.823791] trim_completed: ATA_CMD_DSM took 382317 cycles
[ 1086.852989] ata_qc_issue: ATA_CMD_DSM starting
[ 1086.853166] trim_completed: ATA_CMD_DSM took 352248 cycles
[ 1087.882825] ata_qc_issue: ATA_CMD_DSM starting
[ 1087.883127] trim_completed: ATA_CMD_DSM took 624546 cycles
[ 1088.915833] ata_qc_issue: ATA_CMD_DSM starting
[ 1088.916056] trim_completed: ATA_CMD_DSM took 455299 cycles
[ 1089.941946] ata_qc_issue: ATA_CMD_DSM starting
[ 1089.942181] trim_completed: ATA_CMD_DSM took 485615 cycles
[ 1090.968793] ata_qc_issue: ATA_CMD_DSM starting
[ 1090.969062] trim_completed: ATA_CMD_DSM took 562042 cycles
[ 1091.994441] ata_qc_issue: ATA_CMD_DSM starting
[ 1091.994672] trim_completed: ATA_CMD_DSM took 479219 cycles
[ 1093.023576] ata_qc_issue: ATA_CMD_DSM starting
[ 1093.023799] trim_completed: ATA_CMD_DSM took 463398 cycles
[ 1094.053545] ata_qc_issue: ATA_CMD_DSM starting
[ 1094.053731] trim_completed: ATA_CMD_DSM took 385229 cycles
[ 1095.083131] ata_qc_issue: ATA_CMD_DSM starting
[ 1095.083356] trim_completed: ATA_CMD_DSM took 458328 cycles
[ 1096.113146] ata_qc_issue: ATA_CMD_DSM starting
[ 1096.113356] trim_completed: ATA_CMD_DSM took 423670 cycles
[ 1097.144211] ata_qc_issue: ATA_CMD_DSM starting
[ 1097.144464] trim_completed: ATA_CMD_DSM took 524706 cycles
[ 1098.174457] ata_qc_issue: ATA_CMD_DSM starting
[ 1098.175619] trim_completed: ATA_CMD_DSM took 2491138 cycles
[ 1099.209218] ata_qc_issue: ATA_CMD_DSM starting
[ 1099.209539] trim_completed: ATA_CMD_DSM took 674752 cycles

Those TRIMs look fine, in the single millisecond range.
But.. the "sleep 1" hides some drive firmware evils..
Here is exactly the same run again, but without the "sleep 1":

Beginning TRIM operations..
Trimming 1 free extents encompassing 656 sectors (0 MB)
Trimming 1 free extents encompassing 30 sectors (0 MB)
Trimming 1 free extents encompassing 194 sectors (0 MB)
Trimming 1 free extents encompassing 42 sectors (0 MB)
Trimming 1 free extents encompassing 1574 sectors (1 MB)
Trimming 1 free extents encompassing 612 sectors (0 MB)
Trimming 1 free extents encompassing 862 sectors (0 MB)
Trimming 1 free extents encompassing 1344 sectors (1 MB)
Trimming 1 free extents encompassing 822 sectors (0 MB)
Trimming 1 free extents encompassing 672 sectors (0 MB)
Trimming 1 free extents encompassing 226 sectors (0 MB)
Trimming 1 free extents encompassing 860 sectors (0 MB)
Trimming 1 free extents encompassing 638 sectors (0 MB)
Trimming 1 free extents encompassing 1020 sectors (0 MB)
Trimming 1 free extents encompassing 12286 sectors (6 MB)
Trimming 1 free extents encompassing 1964 sectors (1 MB)
Done.
[ 1258.206379] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.206587] trim_completed: ATA_CMD_DSM took 426088 cycles
[ 1258.254513] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.366141] trim_completed: ATA_CMD_DSM took 241231523 cycles
[ 1258.411749] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.524047] trim_completed: ATA_CMD_DSM took 242676590 cycles
[ 1258.600184] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.711766] trim_completed: ATA_CMD_DSM took 241136519 cycles
[ 1258.813515] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.910599] trim_completed: ATA_CMD_DSM took 209803152 cycles
[ 1259.027253] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.108916] trim_completed: ATA_CMD_DSM took 176473453 cycles
[ 1259.239549] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.306640] trim_completed: ATA_CMD_DSM took 144968694 cycles
[ 1259.452978] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.505017] trim_completed: ATA_CMD_DSM took 112440172 cycles
[ 1259.552393] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.664739] trim_completed: ATA_CMD_DSM took 242778861 cycles
[ 1259.775724] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.861318] trim_completed: ATA_CMD_DSM took 184955732 cycles
[ 1259.989289] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.059963] trim_completed: ATA_CMD_DSM took 152713730 cycles
[ 1260.211066] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.257474] trim_completed: ATA_CMD_DSM took 100279998 cycles
[ 1260.306277] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.417770] trim_completed: ATA_CMD_DSM took 240932835 cycles
[ 1260.464049] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.575418] trim_completed: ATA_CMD_DSM took 240673134 cycles
[ 1260.650624] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.763510] trim_completed: ATA_CMD_DSM took 243952865 cycles
[ 1260.810454] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.921433] trim_completed: ATA_CMD_DSM took 239832996 cycles

As you can see, we're now into the 100 millisecond range
for successive TRIM-followed-by-TRIM commands.

Those are all for single extents.  I will follow-up with a small
amount of similar data for TRIMs with multiple extents.

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-16 17:28                                     ` Mark Lord
  0 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 17:28 UTC (permalink / raw)
  To: Theodore Tso, Arjan van de Ven, Alan Cox, James Bottomley,
	Mark Lord, Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins,
	Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide, Linux RAID

Theodore Tso wrote:
..
> Mark Lord has claimed that the currently shipping SSD's take "hundreds
> of milliseconds" for a TRIM, command.
..

Here's some data to support that claim.

First, here are a series of TRIM commands for single-extents
of varying lengths.

The measures include the printk() timestamp, plus I had libata itself
use rdtsc() before/after each TRIM.  This is with a T7400 CPU booted
using maxcpus=1, and locked at 2.16GHz using "performance" CPU policy.

The first set of data, is from individual single-extent TRIMs,
with a "sleep 1 ; sync" between each successive TRIM:

Beginning TRIM operations..
Trimming 1 free extents encompassing 656 sectors (0 MB)
Trimming 1 free extents encompassing 30 sectors (0 MB)
Trimming 1 free extents encompassing 194 sectors (0 MB)
Trimming 1 free extents encompassing 42 sectors (0 MB)
Trimming 1 free extents encompassing 1574 sectors (1 MB)
Trimming 1 free extents encompassing 612 sectors (0 MB)
Trimming 1 free extents encompassing 862 sectors (0 MB)
Trimming 1 free extents encompassing 1344 sectors (1 MB)
Trimming 1 free extents encompassing 822 sectors (0 MB)
Trimming 1 free extents encompassing 672 sectors (0 MB)
Trimming 1 free extents encompassing 226 sectors (0 MB)
Trimming 1 free extents encompassing 860 sectors (0 MB)
Trimming 1 free extents encompassing 638 sectors (0 MB)
Trimming 1 free extents encompassing 1020 sectors (0 MB)
Trimming 1 free extents encompassing 12286 sectors (6 MB)
Trimming 1 free extents encompassing 1964 sectors (1 MB)
Done.
[ 1083.768460] ata_qc_issue: ATA_CMD_DSM starting
[ 1083.768672] trim_completed: ATA_CMD_DSM took 438841 cycles
[ 1084.794304] ata_qc_issue: ATA_CMD_DSM starting
[ 1084.794469] trim_completed: ATA_CMD_DSM took 338065 cycles
[ 1085.823605] ata_qc_issue: ATA_CMD_DSM starting
[ 1085.823791] trim_completed: ATA_CMD_DSM took 382317 cycles
[ 1086.852989] ata_qc_issue: ATA_CMD_DSM starting
[ 1086.853166] trim_completed: ATA_CMD_DSM took 352248 cycles
[ 1087.882825] ata_qc_issue: ATA_CMD_DSM starting
[ 1087.883127] trim_completed: ATA_CMD_DSM took 624546 cycles
[ 1088.915833] ata_qc_issue: ATA_CMD_DSM starting
[ 1088.916056] trim_completed: ATA_CMD_DSM took 455299 cycles
[ 1089.941946] ata_qc_issue: ATA_CMD_DSM starting
[ 1089.942181] trim_completed: ATA_CMD_DSM took 485615 cycles
[ 1090.968793] ata_qc_issue: ATA_CMD_DSM starting
[ 1090.969062] trim_completed: ATA_CMD_DSM took 562042 cycles
[ 1091.994441] ata_qc_issue: ATA_CMD_DSM starting
[ 1091.994672] trim_completed: ATA_CMD_DSM took 479219 cycles
[ 1093.023576] ata_qc_issue: ATA_CMD_DSM starting
[ 1093.023799] trim_completed: ATA_CMD_DSM took 463398 cycles
[ 1094.053545] ata_qc_issue: ATA_CMD_DSM starting
[ 1094.053731] trim_completed: ATA_CMD_DSM took 385229 cycles
[ 1095.083131] ata_qc_issue: ATA_CMD_DSM starting
[ 1095.083356] trim_completed: ATA_CMD_DSM took 458328 cycles
[ 1096.113146] ata_qc_issue: ATA_CMD_DSM starting
[ 1096.113356] trim_completed: ATA_CMD_DSM took 423670 cycles
[ 1097.144211] ata_qc_issue: ATA_CMD_DSM starting
[ 1097.144464] trim_completed: ATA_CMD_DSM took 524706 cycles
[ 1098.174457] ata_qc_issue: ATA_CMD_DSM starting
[ 1098.175619] trim_completed: ATA_CMD_DSM took 2491138 cycles
[ 1099.209218] ata_qc_issue: ATA_CMD_DSM starting
[ 1099.209539] trim_completed: ATA_CMD_DSM took 674752 cycles

Those TRIMs look fine, in the single millisecond range.
But.. the "sleep 1" hides some drive firmware evils..
Here is exactly the same run again, but without the "sleep 1":

Beginning TRIM operations..
Trimming 1 free extents encompassing 656 sectors (0 MB)
Trimming 1 free extents encompassing 30 sectors (0 MB)
Trimming 1 free extents encompassing 194 sectors (0 MB)
Trimming 1 free extents encompassing 42 sectors (0 MB)
Trimming 1 free extents encompassing 1574 sectors (1 MB)
Trimming 1 free extents encompassing 612 sectors (0 MB)
Trimming 1 free extents encompassing 862 sectors (0 MB)
Trimming 1 free extents encompassing 1344 sectors (1 MB)
Trimming 1 free extents encompassing 822 sectors (0 MB)
Trimming 1 free extents encompassing 672 sectors (0 MB)
Trimming 1 free extents encompassing 226 sectors (0 MB)
Trimming 1 free extents encompassing 860 sectors (0 MB)
Trimming 1 free extents encompassing 638 sectors (0 MB)
Trimming 1 free extents encompassing 1020 sectors (0 MB)
Trimming 1 free extents encompassing 12286 sectors (6 MB)
Trimming 1 free extents encompassing 1964 sectors (1 MB)
Done.
[ 1258.206379] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.206587] trim_completed: ATA_CMD_DSM took 426088 cycles
[ 1258.254513] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.366141] trim_completed: ATA_CMD_DSM took 241231523 cycles
[ 1258.411749] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.524047] trim_completed: ATA_CMD_DSM took 242676590 cycles
[ 1258.600184] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.711766] trim_completed: ATA_CMD_DSM took 241136519 cycles
[ 1258.813515] ata_qc_issue: ATA_CMD_DSM starting
[ 1258.910599] trim_completed: ATA_CMD_DSM took 209803152 cycles
[ 1259.027253] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.108916] trim_completed: ATA_CMD_DSM took 176473453 cycles
[ 1259.239549] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.306640] trim_completed: ATA_CMD_DSM took 144968694 cycles
[ 1259.452978] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.505017] trim_completed: ATA_CMD_DSM took 112440172 cycles
[ 1259.552393] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.664739] trim_completed: ATA_CMD_DSM took 242778861 cycles
[ 1259.775724] ata_qc_issue: ATA_CMD_DSM starting
[ 1259.861318] trim_completed: ATA_CMD_DSM took 184955732 cycles
[ 1259.989289] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.059963] trim_completed: ATA_CMD_DSM took 152713730 cycles
[ 1260.211066] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.257474] trim_completed: ATA_CMD_DSM took 100279998 cycles
[ 1260.306277] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.417770] trim_completed: ATA_CMD_DSM took 240932835 cycles
[ 1260.464049] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.575418] trim_completed: ATA_CMD_DSM took 240673134 cycles
[ 1260.650624] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.763510] trim_completed: ATA_CMD_DSM took 243952865 cycles
[ 1260.810454] ata_qc_issue: ATA_CMD_DSM starting
[ 1260.921433] trim_completed: ATA_CMD_DSM took 239832996 cycles

As you can see, we're now into the 100 millisecond range
for successive TRIM-followed-by-TRIM commands.

Those are all for single extents.  I will follow-up with a small
amount of similar data for TRIMs with multiple extents.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 17:28                                     ` Mark Lord
                                                       ` (2 preceding siblings ...)
  (?)
@ 2009-08-16 17:37                                     ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 17:37 UTC (permalink / raw)
  To: Theodore Tso, Arjan van de Ven, Alan Cox, James Bottomley,
	Mark Lord, Chris

Mark Lord wrote:
..
> As you can see, we're now into the 100 millisecond range
> for successive TRIM-followed-by-TRIM commands.
> 
> Those are all for single extents.  I will follow-up with a small
> amount of similar data for TRIMs with multiple extents.
..

Here's the exact same TRIM ranges, but issued with *two* extents
per TRIM command, and again *without* the "sleep 1" between them:

Beginning TRIM operations..
Trimming 2 free extents encompassing 686 sectors (0 MB)
Trimming 2 free extents encompassing 236 sectors (0 MB)
Trimming 2 free extents encompassing 2186 sectors (1 MB)
Trimming 2 free extents encompassing 2206 sectors (1 MB)
Trimming 2 free extents encompassing 1494 sectors (1 MB)
Trimming 2 free extents encompassing 1086 sectors (1 MB)
Trimming 2 free extents encompassing 1658 sectors (1 MB)
Trimming 2 free extents encompassing 14250 sectors (7 MB)
Done.
[ 1528.761626] ata_qc_issue: ATA_CMD_DSM starting
[ 1528.761825] trim_completed: ATA_CMD_DSM took 419952 cycles
[ 1528.807158] ata_qc_issue: ATA_CMD_DSM starting
[ 1528.919035] trim_completed: ATA_CMD_DSM took 241772908 cycles
[ 1528.956048] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.068536] trim_completed: ATA_CMD_DSM took 243085505 cycles
[ 1529.156661] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.266377] trim_completed: ATA_CMD_DSM took 237098927 cycles
[ 1529.367212] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.464676] trim_completed: ATA_CMD_DSM took 210619370 cycles
[ 1529.518619] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.630444] trim_completed: ATA_CMD_DSM took 241654712 cycles
[ 1529.739335] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.829826] trim_completed: ATA_CMD_DSM took 195545233 cycles
[ 1529.958442] ata_qc_issue: ATA_CMD_DSM starting
[ 1530.028356] trim_completed: ATA_CMD_DSM took 151077251 cycles

Next, with *four* extents per TRIM:

Beginning TRIM operations..
Trimming 4 free extents encompassing 922 sectors (0 MB)
Trimming 4 free extents encompassing 4392 sectors (2 MB)
Trimming 4 free extents encompassing 2580 sectors (1 MB)
Trimming 4 free extents encompassing 15908 sectors (8 MB)
Done.
[ 1728.923119] ata_qc_issue: ATA_CMD_DSM starting
[ 1728.923343] trim_completed: ATA_CMD_DSM took 460590 cycles
[ 1728.975082] ata_qc_issue: ATA_CMD_DSM starting
[ 1729.087266] trim_completed: ATA_CMD_DSM took 242429200 cycles
[ 1729.170167] ata_qc_issue: ATA_CMD_DSM starting
[ 1729.282718] trim_completed: ATA_CMD_DSM took 243229428 cycles
[ 1729.382328] ata_qc_issue: ATA_CMD_DSM starting
[ 1729.481364] trim_completed: ATA_CMD_DSM took 214012942 cycles

And with *eight* extents per TRIM:
Beginning TRIM operations..
Trimming 8 free extents encompassing 5314 sectors (3 MB)
Trimming 8 free extents encompassing 18488 sectors (9 MB)
Done.
[ 1788.289669] ata_qc_issue: ATA_CMD_DSM starting
[ 1788.290247] trim_completed: ATA_CMD_DSM took 1228539 cycles
[ 1788.327223] ata_qc_issue: ATA_CMD_DSM starting
[ 1788.440490] trim_completed: ATA_CMD_DSM took 244773243 cycles

And finally, with everything in a single TRIM:

Beginning TRIM operations..
Trimming 16 free extents encompassing 23802 sectors (12 MB)
Done.
[ 1841.561147] ata_qc_issue: ATA_CMD_DSM starting
[ 1841.563217] trim_completed: ATA_CMD_DSM took 4458480 cycles

Notice how the first TRIM of each group above shows an artificially
short completion time, because the firmware seems to return "done"
before it's really done.  Subsequent TRIMs seem to have to wait
for the previous one to really complete, and thus give more reliable
timing data for our purposes.

Cheers

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 17:28                                     ` Mark Lord
  (?)
@ 2009-08-16 17:37                                       ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 17:37 UTC (permalink / raw)
  To: Theodore Tso, Arjan van de Ven, Alan Cox, James Bottomley,
	Mark Lord, Chris

Mark Lord wrote:
..
> As you can see, we're now into the 100 millisecond range
> for successive TRIM-followed-by-TRIM commands.
> 
> Those are all for single extents.  I will follow-up with a small
> amount of similar data for TRIMs with multiple extents.
..

Here's the exact same TRIM ranges, but issued with *two* extents
per TRIM command, and again *without* the "sleep 1" between them:

Beginning TRIM operations..
Trimming 2 free extents encompassing 686 sectors (0 MB)
Trimming 2 free extents encompassing 236 sectors (0 MB)
Trimming 2 free extents encompassing 2186 sectors (1 MB)
Trimming 2 free extents encompassing 2206 sectors (1 MB)
Trimming 2 free extents encompassing 1494 sectors (1 MB)
Trimming 2 free extents encompassing 1086 sectors (1 MB)
Trimming 2 free extents encompassing 1658 sectors (1 MB)
Trimming 2 free extents encompassing 14250 sectors (7 MB)
Done.
[ 1528.761626] ata_qc_issue: ATA_CMD_DSM starting
[ 1528.761825] trim_completed: ATA_CMD_DSM took 419952 cycles
[ 1528.807158] ata_qc_issue: ATA_CMD_DSM starting
[ 1528.919035] trim_completed: ATA_CMD_DSM took 241772908 cycles
[ 1528.956048] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.068536] trim_completed: ATA_CMD_DSM took 243085505 cycles
[ 1529.156661] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.266377] trim_completed: ATA_CMD_DSM took 237098927 cycles
[ 1529.367212] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.464676] trim_completed: ATA_CMD_DSM took 210619370 cycles
[ 1529.518619] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.630444] trim_completed: ATA_CMD_DSM took 241654712 cycles
[ 1529.739335] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.829826] trim_completed: ATA_CMD_DSM took 195545233 cycles
[ 1529.958442] ata_qc_issue: ATA_CMD_DSM starting
[ 1530.028356] trim_completed: ATA_CMD_DSM took 151077251 cycles

Next, with *four* extents per TRIM:

Beginning TRIM operations..
Trimming 4 free extents encompassing 922 sectors (0 MB)
Trimming 4 free extents encompassing 4392 sectors (2 MB)
Trimming 4 free extents encompassing 2580 sectors (1 MB)
Trimming 4 free extents encompassing 15908 sectors (8 MB)
Done.
[ 1728.923119] ata_qc_issue: ATA_CMD_DSM starting
[ 1728.923343] trim_completed: ATA_CMD_DSM took 460590 cycles
[ 1728.975082] ata_qc_issue: ATA_CMD_DSM starting
[ 1729.087266] trim_completed: ATA_CMD_DSM took 242429200 cycles
[ 1729.170167] ata_qc_issue: ATA_CMD_DSM starting
[ 1729.282718] trim_completed: ATA_CMD_DSM took 243229428 cycles
[ 1729.382328] ata_qc_issue: ATA_CMD_DSM starting
[ 1729.481364] trim_completed: ATA_CMD_DSM took 214012942 cycles

And with *eight* extents per TRIM:
Beginning TRIM operations..
Trimming 8 free extents encompassing 5314 sectors (3 MB)
Trimming 8 free extents encompassing 18488 sectors (9 MB)
Done.
[ 1788.289669] ata_qc_issue: ATA_CMD_DSM starting
[ 1788.290247] trim_completed: ATA_CMD_DSM took 1228539 cycles
[ 1788.327223] ata_qc_issue: ATA_CMD_DSM starting
[ 1788.440490] trim_completed: ATA_CMD_DSM took 244773243 cycles

And finally, with everything in a single TRIM:

Beginning TRIM operations..
Trimming 16 free extents encompassing 23802 sectors (12 MB)
Done.
[ 1841.561147] ata_qc_issue: ATA_CMD_DSM starting
[ 1841.563217] trim_completed: ATA_CMD_DSM took 4458480 cycles

Notice how the first TRIM of each group above shows an artificially
short completion time, because the firmware seems to return "done"
before it's really done.  Subsequent TRIMs seem to have to wait
for the previous one to really complete, and thus give more reliable
timing data for our purposes.

Cheers

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-16 17:37                                       ` Mark Lord
  0 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 17:37 UTC (permalink / raw)
  To: Theodore Tso, Arjan van de Ven, Alan Cox, James Bottomley,
	Mark Lord, Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins,
	Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide, Linux RAID

Mark Lord wrote:
..
> As you can see, we're now into the 100 millisecond range
> for successive TRIM-followed-by-TRIM commands.
> 
> Those are all for single extents.  I will follow-up with a small
> amount of similar data for TRIMs with multiple extents.
..

Here's the exact same TRIM ranges, but issued with *two* extents
per TRIM command, and again *without* the "sleep 1" between them:

Beginning TRIM operations..
Trimming 2 free extents encompassing 686 sectors (0 MB)
Trimming 2 free extents encompassing 236 sectors (0 MB)
Trimming 2 free extents encompassing 2186 sectors (1 MB)
Trimming 2 free extents encompassing 2206 sectors (1 MB)
Trimming 2 free extents encompassing 1494 sectors (1 MB)
Trimming 2 free extents encompassing 1086 sectors (1 MB)
Trimming 2 free extents encompassing 1658 sectors (1 MB)
Trimming 2 free extents encompassing 14250 sectors (7 MB)
Done.
[ 1528.761626] ata_qc_issue: ATA_CMD_DSM starting
[ 1528.761825] trim_completed: ATA_CMD_DSM took 419952 cycles
[ 1528.807158] ata_qc_issue: ATA_CMD_DSM starting
[ 1528.919035] trim_completed: ATA_CMD_DSM took 241772908 cycles
[ 1528.956048] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.068536] trim_completed: ATA_CMD_DSM took 243085505 cycles
[ 1529.156661] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.266377] trim_completed: ATA_CMD_DSM took 237098927 cycles
[ 1529.367212] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.464676] trim_completed: ATA_CMD_DSM took 210619370 cycles
[ 1529.518619] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.630444] trim_completed: ATA_CMD_DSM took 241654712 cycles
[ 1529.739335] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.829826] trim_completed: ATA_CMD_DSM took 195545233 cycles
[ 1529.958442] ata_qc_issue: ATA_CMD_DSM starting
[ 1530.028356] trim_completed: ATA_CMD_DSM took 151077251 cycles

Next, with *four* extents per TRIM:

Beginning TRIM operations..
Trimming 4 free extents encompassing 922 sectors (0 MB)
Trimming 4 free extents encompassing 4392 sectors (2 MB)
Trimming 4 free extents encompassing 2580 sectors (1 MB)
Trimming 4 free extents encompassing 15908 sectors (8 MB)
Done.
[ 1728.923119] ata_qc_issue: ATA_CMD_DSM starting
[ 1728.923343] trim_completed: ATA_CMD_DSM took 460590 cycles
[ 1728.975082] ata_qc_issue: ATA_CMD_DSM starting
[ 1729.087266] trim_completed: ATA_CMD_DSM took 242429200 cycles
[ 1729.170167] ata_qc_issue: ATA_CMD_DSM starting
[ 1729.282718] trim_completed: ATA_CMD_DSM took 243229428 cycles
[ 1729.382328] ata_qc_issue: ATA_CMD_DSM starting
[ 1729.481364] trim_completed: ATA_CMD_DSM took 214012942 cycles

And with *eight* extents per TRIM:
Beginning TRIM operations..
Trimming 8 free extents encompassing 5314 sectors (3 MB)
Trimming 8 free extents encompassing 18488 sectors (9 MB)
Done.
[ 1788.289669] ata_qc_issue: ATA_CMD_DSM starting
[ 1788.290247] trim_completed: ATA_CMD_DSM took 1228539 cycles
[ 1788.327223] ata_qc_issue: ATA_CMD_DSM starting
[ 1788.440490] trim_completed: ATA_CMD_DSM took 244773243 cycles

And finally, with everything in a single TRIM:

Beginning TRIM operations..
Trimming 16 free extents encompassing 23802 sectors (12 MB)
Done.
[ 1841.561147] ata_qc_issue: ATA_CMD_DSM starting
[ 1841.563217] trim_completed: ATA_CMD_DSM took 4458480 cycles

Notice how the first TRIM of each group above shows an artificially
short completion time, because the firmware seems to return "done"
before it's really done.  Subsequent TRIMs seem to have to wait
for the previous one to really complete, and thus give more reliable
timing data for our purposes.

Cheers

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 17:28                                     ` Mark Lord
                                                       ` (3 preceding siblings ...)
  (?)
@ 2009-08-16 17:37                                     ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 17:37 UTC (permalink / raw)
  To: Theodore Tso, Arjan van de Ven, Alan Cox, James Bottomley,
	Mark Lord, Chris

Mark Lord wrote:
..
> As you can see, we're now into the 100 millisecond range
> for successive TRIM-followed-by-TRIM commands.
> 
> Those are all for single extents.  I will follow-up with a small
> amount of similar data for TRIMs with multiple extents.
..

Here's the exact same TRIM ranges, but issued with *two* extents
per TRIM command, and again *without* the "sleep 1" between them:

Beginning TRIM operations..
Trimming 2 free extents encompassing 686 sectors (0 MB)
Trimming 2 free extents encompassing 236 sectors (0 MB)
Trimming 2 free extents encompassing 2186 sectors (1 MB)
Trimming 2 free extents encompassing 2206 sectors (1 MB)
Trimming 2 free extents encompassing 1494 sectors (1 MB)
Trimming 2 free extents encompassing 1086 sectors (1 MB)
Trimming 2 free extents encompassing 1658 sectors (1 MB)
Trimming 2 free extents encompassing 14250 sectors (7 MB)
Done.
[ 1528.761626] ata_qc_issue: ATA_CMD_DSM starting
[ 1528.761825] trim_completed: ATA_CMD_DSM took 419952 cycles
[ 1528.807158] ata_qc_issue: ATA_CMD_DSM starting
[ 1528.919035] trim_completed: ATA_CMD_DSM took 241772908 cycles
[ 1528.956048] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.068536] trim_completed: ATA_CMD_DSM took 243085505 cycles
[ 1529.156661] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.266377] trim_completed: ATA_CMD_DSM took 237098927 cycles
[ 1529.367212] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.464676] trim_completed: ATA_CMD_DSM took 210619370 cycles
[ 1529.518619] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.630444] trim_completed: ATA_CMD_DSM took 241654712 cycles
[ 1529.739335] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.829826] trim_completed: ATA_CMD_DSM took 195545233 cycles
[ 1529.958442] ata_qc_issue: ATA_CMD_DSM starting
[ 1530.028356] trim_completed: ATA_CMD_DSM took 151077251 cycles

Next, with *four* extents per TRIM:

Beginning TRIM operations..
Trimming 4 free extents encompassing 922 sectors (0 MB)
Trimming 4 free extents encompassing 4392 sectors (2 MB)
Trimming 4 free extents encompassing 2580 sectors (1 MB)
Trimming 4 free extents encompassing 15908 sectors (8 MB)
Done.
[ 1728.923119] ata_qc_issue: ATA_CMD_DSM starting
[ 1728.923343] trim_completed: ATA_CMD_DSM took 460590 cycles
[ 1728.975082] ata_qc_issue: ATA_CMD_DSM starting
[ 1729.087266] trim_completed: ATA_CMD_DSM took 242429200 cycles
[ 1729.170167] ata_qc_issue: ATA_CMD_DSM starting
[ 1729.282718] trim_completed: ATA_CMD_DSM took 243229428 cycles
[ 1729.382328] ata_qc_issue: ATA_CMD_DSM starting
[ 1729.481364] trim_completed: ATA_CMD_DSM took 214012942 cycles

And with *eight* extents per TRIM:
Beginning TRIM operations..
Trimming 8 free extents encompassing 5314 sectors (3 MB)
Trimming 8 free extents encompassing 18488 sectors (9 MB)
Done.
[ 1788.289669] ata_qc_issue: ATA_CMD_DSM starting
[ 1788.290247] trim_completed: ATA_CMD_DSM took 1228539 cycles
[ 1788.327223] ata_qc_issue: ATA_CMD_DSM starting
[ 1788.440490] trim_completed: ATA_CMD_DSM took 244773243 cycles

And finally, with everything in a single TRIM:

Beginning TRIM operations..
Trimming 16 free extents encompassing 23802 sectors (12 MB)
Done.
[ 1841.561147] ata_qc_issue: ATA_CMD_DSM starting
[ 1841.563217] trim_completed: ATA_CMD_DSM took 4458480 cycles

Notice how the first TRIM of each group above shows an artificially
short completion time, because the firmware seems to return "done"
before it's really done.  Subsequent TRIMs seem to have to wait
for the previous one to really complete, and thus give more reliable
timing data for our purposes.

Cheers

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-16 17:37                                       ` Mark Lord
  0 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 17:37 UTC (permalink / raw)
  To: Theodore Tso, Arjan van de Ven, Alan Cox, James Bottomley,
	Mark Lord, Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins,
	Nitin Gupta, Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm,
	linux-scsi, linux-ide, Linux RAID

Mark Lord wrote:
..
> As you can see, we're now into the 100 millisecond range
> for successive TRIM-followed-by-TRIM commands.
> 
> Those are all for single extents.  I will follow-up with a small
> amount of similar data for TRIMs with multiple extents.
..

Here's the exact same TRIM ranges, but issued with *two* extents
per TRIM command, and again *without* the "sleep 1" between them:

Beginning TRIM operations..
Trimming 2 free extents encompassing 686 sectors (0 MB)
Trimming 2 free extents encompassing 236 sectors (0 MB)
Trimming 2 free extents encompassing 2186 sectors (1 MB)
Trimming 2 free extents encompassing 2206 sectors (1 MB)
Trimming 2 free extents encompassing 1494 sectors (1 MB)
Trimming 2 free extents encompassing 1086 sectors (1 MB)
Trimming 2 free extents encompassing 1658 sectors (1 MB)
Trimming 2 free extents encompassing 14250 sectors (7 MB)
Done.
[ 1528.761626] ata_qc_issue: ATA_CMD_DSM starting
[ 1528.761825] trim_completed: ATA_CMD_DSM took 419952 cycles
[ 1528.807158] ata_qc_issue: ATA_CMD_DSM starting
[ 1528.919035] trim_completed: ATA_CMD_DSM took 241772908 cycles
[ 1528.956048] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.068536] trim_completed: ATA_CMD_DSM took 243085505 cycles
[ 1529.156661] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.266377] trim_completed: ATA_CMD_DSM took 237098927 cycles
[ 1529.367212] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.464676] trim_completed: ATA_CMD_DSM took 210619370 cycles
[ 1529.518619] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.630444] trim_completed: ATA_CMD_DSM took 241654712 cycles
[ 1529.739335] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.829826] trim_completed: ATA_CMD_DSM took 195545233 cycles
[ 1529.958442] ata_qc_issue: ATA_CMD_DSM starting
[ 1530.028356] trim_completed: ATA_CMD_DSM took 151077251 cycles

Next, with *four* extents per TRIM:

Beginning TRIM operations..
Trimming 4 free extents encompassing 922 sectors (0 MB)
Trimming 4 free extents encompassing 4392 sectors (2 MB)
Trimming 4 free extents encompassing 2580 sectors (1 MB)
Trimming 4 free extents encompassing 15908 sectors (8 MB)
Done.
[ 1728.923119] ata_qc_issue: ATA_CMD_DSM starting
[ 1728.923343] trim_completed: ATA_CMD_DSM took 460590 cycles
[ 1728.975082] ata_qc_issue: ATA_CMD_DSM starting
[ 1729.087266] trim_completed: ATA_CMD_DSM took 242429200 cycles
[ 1729.170167] ata_qc_issue: ATA_CMD_DSM starting
[ 1729.282718] trim_completed: ATA_CMD_DSM took 243229428 cycles
[ 1729.382328] ata_qc_issue: ATA_CMD_DSM starting
[ 1729.481364] trim_completed: ATA_CMD_DSM took 214012942 cycles

And with *eight* extents per TRIM:
Beginning TRIM operations..
Trimming 8 free extents encompassing 5314 sectors (3 MB)
Trimming 8 free extents encompassing 18488 sectors (9 MB)
Done.
[ 1788.289669] ata_qc_issue: ATA_CMD_DSM starting
[ 1788.290247] trim_completed: ATA_CMD_DSM took 1228539 cycles
[ 1788.327223] ata_qc_issue: ATA_CMD_DSM starting
[ 1788.440490] trim_completed: ATA_CMD_DSM took 244773243 cycles

And finally, with everything in a single TRIM:

Beginning TRIM operations..
Trimming 16 free extents encompassing 23802 sectors (12 MB)
Done.
[ 1841.561147] ata_qc_issue: ATA_CMD_DSM starting
[ 1841.563217] trim_completed: ATA_CMD_DSM took 4458480 cycles

Notice how the first TRIM of each group above shows an artificially
short completion time, because the firmware seems to return "done"
before it's really done.  Subsequent TRIMs seem to have to wait
for the previous one to really complete, and thus give more reliable
timing data for our purposes.

Cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 17:28                                     ` Mark Lord
  (?)
@ 2009-08-16 17:37                                     ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 17:37 UTC (permalink / raw)
  To: Theodore Tso, Arjan van de Ven, Alan Cox, James Bottomley,
	Mark Lord, Chris

Mark Lord wrote:
..
> As you can see, we're now into the 100 millisecond range
> for successive TRIM-followed-by-TRIM commands.
> 
> Those are all for single extents.  I will follow-up with a small
> amount of similar data for TRIMs with multiple extents.
..

Here's the exact same TRIM ranges, but issued with *two* extents
per TRIM command, and again *without* the "sleep 1" between them:

Beginning TRIM operations..
Trimming 2 free extents encompassing 686 sectors (0 MB)
Trimming 2 free extents encompassing 236 sectors (0 MB)
Trimming 2 free extents encompassing 2186 sectors (1 MB)
Trimming 2 free extents encompassing 2206 sectors (1 MB)
Trimming 2 free extents encompassing 1494 sectors (1 MB)
Trimming 2 free extents encompassing 1086 sectors (1 MB)
Trimming 2 free extents encompassing 1658 sectors (1 MB)
Trimming 2 free extents encompassing 14250 sectors (7 MB)
Done.
[ 1528.761626] ata_qc_issue: ATA_CMD_DSM starting
[ 1528.761825] trim_completed: ATA_CMD_DSM took 419952 cycles
[ 1528.807158] ata_qc_issue: ATA_CMD_DSM starting
[ 1528.919035] trim_completed: ATA_CMD_DSM took 241772908 cycles
[ 1528.956048] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.068536] trim_completed: ATA_CMD_DSM took 243085505 cycles
[ 1529.156661] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.266377] trim_completed: ATA_CMD_DSM took 237098927 cycles
[ 1529.367212] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.464676] trim_completed: ATA_CMD_DSM took 210619370 cycles
[ 1529.518619] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.630444] trim_completed: ATA_CMD_DSM took 241654712 cycles
[ 1529.739335] ata_qc_issue: ATA_CMD_DSM starting
[ 1529.829826] trim_completed: ATA_CMD_DSM took 195545233 cycles
[ 1529.958442] ata_qc_issue: ATA_CMD_DSM starting
[ 1530.028356] trim_completed: ATA_CMD_DSM took 151077251 cycles

Next, with *four* extents per TRIM:

Beginning TRIM operations..
Trimming 4 free extents encompassing 922 sectors (0 MB)
Trimming 4 free extents encompassing 4392 sectors (2 MB)
Trimming 4 free extents encompassing 2580 sectors (1 MB)
Trimming 4 free extents encompassing 15908 sectors (8 MB)
Done.
[ 1728.923119] ata_qc_issue: ATA_CMD_DSM starting
[ 1728.923343] trim_completed: ATA_CMD_DSM took 460590 cycles
[ 1728.975082] ata_qc_issue: ATA_CMD_DSM starting
[ 1729.087266] trim_completed: ATA_CMD_DSM took 242429200 cycles
[ 1729.170167] ata_qc_issue: ATA_CMD_DSM starting
[ 1729.282718] trim_completed: ATA_CMD_DSM took 243229428 cycles
[ 1729.382328] ata_qc_issue: ATA_CMD_DSM starting
[ 1729.481364] trim_completed: ATA_CMD_DSM took 214012942 cycles

And with *eight* extents per TRIM:
Beginning TRIM operations..
Trimming 8 free extents encompassing 5314 sectors (3 MB)
Trimming 8 free extents encompassing 18488 sectors (9 MB)
Done.
[ 1788.289669] ata_qc_issue: ATA_CMD_DSM starting
[ 1788.290247] trim_completed: ATA_CMD_DSM took 1228539 cycles
[ 1788.327223] ata_qc_issue: ATA_CMD_DSM starting
[ 1788.440490] trim_completed: ATA_CMD_DSM took 244773243 cycles

And finally, with everything in a single TRIM:

Beginning TRIM operations..
Trimming 16 free extents encompassing 23802 sectors (12 MB)
Done.
[ 1841.561147] ata_qc_issue: ATA_CMD_DSM starting
[ 1841.563217] trim_completed: ATA_CMD_DSM took 4458480 cycles

Notice how the first TRIM of each group above shows an artificially
short completion time, because the firmware seems to return "done"
before it's really done.  Subsequent TRIMs seem to have to wait
for the previous one to really complete, and thus give more reliable
timing data for our purposes.

Cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 16:32                                     ` Mark Lord
@ 2009-08-16 18:07                                       ` James Bottomley
  -1 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-16 18:07 UTC (permalink / raw)
  To: Mark Lord
  Cc: Arjan van de Ven, Alan Cox, Chris Worley, Matthew Wilcox,
	Bryan Donlan, david, Greg Freemyer, Markus Trippelsdorf,
	Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide,
	Linux RAID

On Sun, 2009-08-16 at 12:32 -0400, Mark Lord wrote:
> James Bottomley wrote:
> >
> > For SSDs, the FTL has to have a separate operation: erase.  Now, one
> > could see the correct implementation simply moving the sectors from the
> > in-use list to the to be cleaned list and still do the cleaning in the
> > background: that would be constant cost (but, again, likely expensive).
> > Of course, if SSD vendors decided to erase on the spot when seeing TRIM,
> > this wouldn't be true ...
> ..
> 
> The SSDs based upon the Indilinx Barefoot controller appear to do
> the erase on the spot, along with a fair amount of garbage collection.

Groan.  I'm with Jim on this one:  If trim is going to cost us in terms
of current fs performance, it's likely not worth it.  The whole point of
a TRIM/UNMAP is that we're just passing hints about storage use.  If the
drives make us pay the penalty of acting on the hints as we pass them
in, we may as well improve performance just by not hinting.  Or at least
it's detrimental hinting in real time.

So I think we've iterated to the conclusion that it has to be a user
space process which tries to identify idle periods and begin trimming.

> The overhead does vary by size of the TRIM operation (number of sectors
> and extents), but even a single-sector TRIM has very high overhead.

So it's something like X + nY (n == number of sectors).  If X is large,
it still argues for batching .. it's just there's likely an upper bound
to the batch where the benefit is no longer worth the cost.

> Samsung also now has SSDs at retail with TRIM.
> I don't have one of those here.

Heh, OS writers not having access to the devices is about par for the
current course.

James



^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-16 18:07                                       ` James Bottomley
  0 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-16 18:07 UTC (permalink / raw)
  To: Mark Lord
  Cc: Arjan van de Ven, Alan Cox, Chris Worley, Matthew Wilcox,
	Bryan Donlan, david, Greg Freemyer, Markus Trippelsdorf,
	Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide,
	Linux RAID

On Sun, 2009-08-16 at 12:32 -0400, Mark Lord wrote:
> James Bottomley wrote:
> >
> > For SSDs, the FTL has to have a separate operation: erase.  Now, one
> > could see the correct implementation simply moving the sectors from the
> > in-use list to the to be cleaned list and still do the cleaning in the
> > background: that would be constant cost (but, again, likely expensive).
> > Of course, if SSD vendors decided to erase on the spot when seeing TRIM,
> > this wouldn't be true ...
> ..
> 
> The SSDs based upon the Indilinx Barefoot controller appear to do
> the erase on the spot, along with a fair amount of garbage collection.

Groan.  I'm with Jim on this one:  If trim is going to cost us in terms
of current fs performance, it's likely not worth it.  The whole point of
a TRIM/UNMAP is that we're just passing hints about storage use.  If the
drives make us pay the penalty of acting on the hints as we pass them
in, we may as well improve performance just by not hinting.  Or at least
it's detrimental hinting in real time.

So I think we've iterated to the conclusion that it has to be a user
space process which tries to identify idle periods and begin trimming.

> The overhead does vary by size of the TRIM operation (number of sectors
> and extents), but even a single-sector TRIM has very high overhead.

So it's something like X + nY (n == number of sectors).  If X is large,
it still argues for batching .. it's just there's likely an upper bound
to the batch where the benefit is no longer worth the cost.

> Samsung also now has SSDs at retail with TRIM.
> I don't have one of those here.

Heh, OS writers not having access to the devices is about par for the
current course.

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 18:07                                       ` James Bottomley
  (?)
@ 2009-08-16 18:19                                         ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 18:19 UTC (permalink / raw)
  To: James Bottomley
  Cc: Arjan van de Ven, Alan Cox, Chris Worley, Matthew Wilcox,
	Bryan Donlan, david, Greg Freemyer, Markus Trippelsdorf,
	Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide,
	Linux RAID

James Bottomley wrote:
>
> Heh, OS writers not having access to the devices is about par for the
> current course.
..

Pity the Linux Foundation doesn't simply step in and supply hardware
to us for new tech like this.  Cheap for them, expensive for folks like me.

Cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-16 18:19                                         ` Mark Lord
  0 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 18:19 UTC (permalink / raw)
  To: James Bottomley
  Cc: Arjan van de Ven, Alan Cox, Chris Worley, Matthew Wilcox,
	Bryan Donlan, david, Greg Freemyer, Markus Trippelsdorf,
	Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide,
	Linux RAID

James Bottomley wrote:
>
> Heh, OS writers not having access to the devices is about par for the
> current course.
..

Pity the Linux Foundation doesn't simply step in and supply hardware
to us for new tech like this.  Cheap for them, expensive for folks like me.

Cheers

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-16 18:19                                         ` Mark Lord
  0 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 18:19 UTC (permalink / raw)
  To: James Bottomley
  Cc: Arjan van de Ven, Alan Cox, Chris Worley, Matthew Wilcox,
	Bryan Donlan, david, Greg Freemyer, Markus Trippelsdorf,
	Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide,
	Linux RAID

James Bottomley wrote:
>
> Heh, OS writers not having access to the devices is about par for the
> current course.
..

Pity the Linux Foundation doesn't simply step in and supply hardware
to us for new tech like this.  Cheap for them, expensive for folks like me.

Cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 18:19                                         ` Mark Lord
@ 2009-08-16 18:24                                           ` James Bottomley
  -1 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-16 18:24 UTC (permalink / raw)
  To: Mark Lord
  Cc: Arjan van de Ven, Alan Cox, Chris Worley, Matthew Wilcox,
	Bryan Donlan, david, Greg Freemyer, Markus Trippelsdorf,
	Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide,
	Linux RAID

On Sun, 2009-08-16 at 14:19 -0400, Mark Lord wrote:
> James Bottomley wrote:
> >
> > Heh, OS writers not having access to the devices is about par for the
> > current course.
> ..
> 
> Pity the Linux Foundation doesn't simply step in and supply hardware
> to us for new tech like this.  Cheap for them, expensive for folks like me.

Um, to give a developer a selection of manufacturers' SSDs at retail
prices, you're talking several thousand dollars  ... in these lean
times, that would be two or three developers not getting travel
sponsorship per chosen SSD recipient.  It's not a worthwhile tradeoff.

The best the LF can likely do is try to explain to the manufacturers
that handing out samples at linux conferences (like plumbers) is in
their own interests.  It can also manage the handout if necessary
through its HW lending library.

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-16 18:24                                           ` James Bottomley
  0 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-16 18:24 UTC (permalink / raw)
  To: Mark Lord
  Cc: Arjan van de Ven, Alan Cox, Chris Worley, Matthew Wilcox,
	Bryan Donlan, david, Greg Freemyer, Markus Trippelsdorf,
	Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide,
	Linux RAID

On Sun, 2009-08-16 at 14:19 -0400, Mark Lord wrote:
> James Bottomley wrote:
> >
> > Heh, OS writers not having access to the devices is about par for the
> > current course.
> ..
> 
> Pity the Linux Foundation doesn't simply step in and supply hardware
> to us for new tech like this.  Cheap for them, expensive for folks like me.

Um, to give a developer a selection of manufacturers' SSDs at retail
prices, you're talking several thousand dollars  ... in these lean
times, that would be two or three developers not getting travel
sponsorship per chosen SSD recipient.  It's not a worthwhile tradeoff.

The best the LF can likely do is try to explain to the manufacturers
that handing out samples at linux conferences (like plumbers) is in
their own interests.  It can also manage the handout if necessary
through its HW lending library.

James



^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 15:34                                 ` Arjan van de Ven
@ 2009-08-16 19:29                                   ` Alan Cox
  -1 siblings, 0 replies; 208+ messages in thread
From: Alan Cox @ 2009-08-16 19:29 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: James Bottomley, Mark Lord, Chris Worley, Matthew Wilcox,
	Bryan Donlan, david, Greg Freemyer, Markus Trippelsdorf,
	Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide,
	Linux RAID

> trim is mostly for ssd's though, and those tend to not have the "goes
> for a hike" behavior as much......

Bench one.

> I wonder if it's worse to batch stuff up, because then the trim itself
> gets bigger and might take longer.....

They seem to implement a sort of async single threaded trim, which can
only have one outstanding trim at a time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-16 19:29                                   ` Alan Cox
  0 siblings, 0 replies; 208+ messages in thread
From: Alan Cox @ 2009-08-16 19:29 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: James Bottomley, Mark Lord, Chris Worley, Matthew Wilcox,
	Bryan Donlan, david, Greg Freemyer, Markus Trippelsdorf,
	Matthew Wilcox, Hugh Dickins, Nitin Gupta, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-mm, linux-scsi, linux-ide,
	Linux RAID

> trim is mostly for ssd's though, and those tend to not have the "goes
> for a hike" behavior as much......

Bench one.

> I wonder if it's worse to batch stuff up, because then the trim itself
> gets bigger and might take longer.....

They seem to implement a sort of async single threaded trim, which can
only have one outstanding trim at a time.


^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support
  2009-08-16 15:52                                   ` James Bottomley
@ 2009-08-16 21:50                                     ` Roland Dreier
  -1 siblings, 0 replies; 208+ messages in thread
From: Roland Dreier @ 2009-08-16 21:50 UTC (permalink / raw)
  To: James Bottomley
  Cc: Arjan van de Ven, Alan Cox, Mark Lord, Chris Worley,
	Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID


 > Well, yes and no ... a lot of SSDs don't actually implement NCQ, so the
 > impact to them will be less ... although I think enterprise class SSDs
 > do implement NCQ.

Really?  Which SSDs don't implement NCQ?

It seems that one couldn't keep the flash busy doing only one command at time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support
@ 2009-08-16 21:50                                     ` Roland Dreier
  0 siblings, 0 replies; 208+ messages in thread
From: Roland Dreier @ 2009-08-16 21:50 UTC (permalink / raw)
  To: James Bottomley
  Cc: Arjan van de Ven, Alan Cox, Mark Lord, Chris Worley,
	Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID


 > Well, yes and no ... a lot of SSDs don't actually implement NCQ, so the
 > impact to them will be less ... although I think enterprise class SSDs
 > do implement NCQ.

Really?  Which SSDs don't implement NCQ?

It seems that one couldn't keep the flash busy doing only one command at time.

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support
  2009-08-16 21:50                                     ` Roland Dreier
  (?)
@ 2009-08-16 22:06                                       ` Jeff Garzik
  -1 siblings, 0 replies; 208+ messages in thread
From: Jeff Garzik @ 2009-08-16 22:06 UTC (permalink / raw)
  To: Roland Dreier
  Cc: James Bottomley, Arjan van de Ven, Alan Cox, Mark Lord,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On 08/16/2009 05:50 PM, Roland Dreier wrote:
>
>   >  Well, yes and no ... a lot of SSDs don't actually implement NCQ, so the
>   >  impact to them will be less ... although I think enterprise class SSDs
>   >  do implement NCQ.
>
> Really?  Which SSDs don't implement NCQ?

ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata3.00: ATA-8: G.SKILL 128GB SSD, 02.10104, max UDMA/100
ata3.00: 250445824 sectors, multi 0: LBA
ata3.00: configured for UDMA/100

for one...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support
@ 2009-08-16 22:06                                       ` Jeff Garzik
  0 siblings, 0 replies; 208+ messages in thread
From: Jeff Garzik @ 2009-08-16 22:06 UTC (permalink / raw)
  To: Roland Dreier
  Cc: James Bottomley, Arjan van de Ven, Alan Cox, Mark Lord,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On 08/16/2009 05:50 PM, Roland Dreier wrote:
>
>   >  Well, yes and no ... a lot of SSDs don't actually implement NCQ, so the
>   >  impact to them will be less ... although I think enterprise class SSDs
>   >  do implement NCQ.
>
> Really?  Which SSDs don't implement NCQ?

ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata3.00: ATA-8: G.SKILL 128GB SSD, 02.10104, max UDMA/100
ata3.00: 250445824 sectors, multi 0: LBA
ata3.00: configured for UDMA/100

for one...


^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support
@ 2009-08-16 22:06                                       ` Jeff Garzik
  0 siblings, 0 replies; 208+ messages in thread
From: Jeff Garzik @ 2009-08-16 22:06 UTC (permalink / raw)
  To: Roland Dreier
  Cc: James Bottomley, Arjan van de Ven, Alan Cox, Mark Lord,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On 08/16/2009 05:50 PM, Roland Dreier wrote:
>
>   >  Well, yes and no ... a lot of SSDs don't actually implement NCQ, so the
>   >  impact to them will be less ... although I think enterprise class SSDs
>   >  do implement NCQ.
>
> Really?  Which SSDs don't implement NCQ?

ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata3.00: ATA-8: G.SKILL 128GB SSD, 02.10104, max UDMA/100
ata3.00: 250445824 sectors, multi 0: LBA
ata3.00: configured for UDMA/100

for one...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support
  2009-08-16 21:50                                     ` Roland Dreier
@ 2009-08-16 22:13                                       ` Theodore Tso
  -1 siblings, 0 replies; 208+ messages in thread
From: Theodore Tso @ 2009-08-16 22:13 UTC (permalink / raw)
  To: Roland Dreier
  Cc: James Bottomley, Arjan van de Ven, Alan Cox, Mark Lord,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Sun, Aug 16, 2009 at 02:50:40PM -0700, Roland Dreier wrote:
> 
>  > Well, yes and no ... a lot of SSDs don't actually implement NCQ, so the
>  > impact to them will be less ... although I think enterprise class SSDs
>  > do implement NCQ.
> 
> Really?  Which SSDs don't implement NCQ?

The Intel X25-M was the first SSD to implement NCQ.  The OCZ Core V2
advertised NCQ with a queue depth of 1, but even that was buggy, so
Linux has a black list for the that SSD:

      http://markmail.org/message/jvjpmcdqjwrmyl4w

As far as I know, all of the SSD's using the crappy JMicron JMF602
controllers don't support NCQ in any real way, which would includes
most of the reasonably priced SSD's up until first half of this year.
(The OCZ Summit SSD, which uses the Indilnx controller is an exception
to this statement, but it's more expensive that the JMF602 based
SSD's, although less expensive than the Intel SSD.)

JMicron is trying to seek redemption with their JMF612 controllers,
which were announced at the end of May of this year, and those
controllers do support NCQ, and are claimed to not to have the
disatrous small write latency problem of their '602 bretheren.  I'm
not aware of any products using the JMF612 yet, though.  (According to
reports the '612 controllers weren't going to hit mass production
until July, so hopefully later this fall we'll start seeing products
using the new JMicron controller.)

      	  				- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support
@ 2009-08-16 22:13                                       ` Theodore Tso
  0 siblings, 0 replies; 208+ messages in thread
From: Theodore Tso @ 2009-08-16 22:13 UTC (permalink / raw)
  To: Roland Dreier
  Cc: James Bottomley, Arjan van de Ven, Alan Cox, Mark Lord,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Sun, Aug 16, 2009 at 02:50:40PM -0700, Roland Dreier wrote:
> 
>  > Well, yes and no ... a lot of SSDs don't actually implement NCQ, so the
>  > impact to them will be less ... although I think enterprise class SSDs
>  > do implement NCQ.
> 
> Really?  Which SSDs don't implement NCQ?

The Intel X25-M was the first SSD to implement NCQ.  The OCZ Core V2
advertised NCQ with a queue depth of 1, but even that was buggy, so
Linux has a black list for the that SSD:

      http://markmail.org/message/jvjpmcdqjwrmyl4w

As far as I know, all of the SSD's using the crappy JMicron JMF602
controllers don't support NCQ in any real way, which would includes
most of the reasonably priced SSD's up until first half of this year.
(The OCZ Summit SSD, which uses the Indilnx controller is an exception
to this statement, but it's more expensive that the JMF602 based
SSD's, although less expensive than the Intel SSD.)

JMicron is trying to seek redemption with their JMF612 controllers,
which were announced at the end of May of this year, and those
controllers do support NCQ, and are claimed to not to have the
disatrous small write latency problem of their '602 bretheren.  I'm
not aware of any products using the JMF612 yet, though.  (According to
reports the '612 controllers weren't going to hit mass production
until July, so hopefully later this fall we'll start seeing products
using the new JMicron controller.)

      	  				- Ted

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support
  2009-08-16 22:13                                       ` Theodore Tso
  (?)
  (?)
@ 2009-08-16 22:51                                       ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 22:51 UTC (permalink / raw)
  To: Theodore Tso, Roland Dreier, James Bottomley, Arjan van de Ven, Alan Cox

Theodore Tso wrote:
> On Sun, Aug 16, 2009 at 02:50:40PM -0700, Roland Dreier wrote:
>>  > Well, yes and no ... a lot of SSDs don't actually implement NCQ, so the
>>  > impact to them will be less ... although I think enterprise class SSDs
>>  > do implement NCQ.
>>
>> Really?  Which SSDs don't implement NCQ?
> 
> The Intel X25-M was the first SSD to implement NCQ.  The OCZ Core V2
> advertised NCQ with a queue depth of 1, but even that was buggy, so
> Linux has a black list for the that SSD:
> 
>       http://markmail.org/message/jvjpmcdqjwrmyl4w
> 
> As far as I know, all of the SSD's using the crappy JMicron JMF602
> controllers don't support NCQ in any real way, which would includes
> most of the reasonably priced SSD's up until first half of this year.
> (The OCZ Summit SSD, which uses the Indilnx controller is an exception
> to this statement, but it's more expensive that the JMF602 based
> SSD's, although less expensive than the Intel SSD.)
> 
> JMicron is trying to seek redemption with their JMF612 controllers,
> which were announced at the end of May of this year, and those
> controllers do support NCQ, and are claimed to not to have the
> disatrous small write latency problem of their '602 bretheren.  I'm
> not aware of any products using the JMF612 yet, though.  (According to
> reports the '612 controllers weren't going to hit mass production
> until July, so hopefully later this fall we'll start seeing products
> using the new JMicron controller.)
..

Great summary, Ted.

To add:  SSDs based upon the Indilinx controller don't appear to scale
beyond an NCQ depth of about 4 or so.  Whereas Intel SSDs continue to
improve with increased queue depth up to 31 on Linux.

Or at least that's what I recall from reading various SSD benchmarks
a few weeks ago here. 

Cheers

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support
  2009-08-16 22:13                                       ` Theodore Tso
                                                         ` (2 preceding siblings ...)
  (?)
@ 2009-08-16 22:51                                       ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 22:51 UTC (permalink / raw)
  To: Theodore Tso, Roland Dreier, James Bottomley, Arjan van de Ven, Alan Cox

Theodore Tso wrote:
> On Sun, Aug 16, 2009 at 02:50:40PM -0700, Roland Dreier wrote:
>>  > Well, yes and no ... a lot of SSDs don't actually implement NCQ, so the
>>  > impact to them will be less ... although I think enterprise class SSDs
>>  > do implement NCQ.
>>
>> Really?  Which SSDs don't implement NCQ?
> 
> The Intel X25-M was the first SSD to implement NCQ.  The OCZ Core V2
> advertised NCQ with a queue depth of 1, but even that was buggy, so
> Linux has a black list for the that SSD:
> 
>       http://markmail.org/message/jvjpmcdqjwrmyl4w
> 
> As far as I know, all of the SSD's using the crappy JMicron JMF602
> controllers don't support NCQ in any real way, which would includes
> most of the reasonably priced SSD's up until first half of this year.
> (The OCZ Summit SSD, which uses the Indilnx controller is an exception
> to this statement, but it's more expensive that the JMF602 based
> SSD's, although less expensive than the Intel SSD.)
> 
> JMicron is trying to seek redemption with their JMF612 controllers,
> which were announced at the end of May of this year, and those
> controllers do support NCQ, and are claimed to not to have the
> disatrous small write latency problem of their '602 bretheren.  I'm
> not aware of any products using the JMF612 yet, though.  (According to
> reports the '612 controllers weren't going to hit mass production
> until July, so hopefully later this fall we'll start seeing products
> using the new JMicron controller.)
..

Great summary, Ted.

To add:  SSDs based upon the Indilinx controller don't appear to scale
beyond an NCQ depth of about 4 or so.  Whereas Intel SSDs continue to
improve with increased queue depth up to 31 on Linux.

Or at least that's what I recall from reading various SSD benchmarks
a few weeks ago here. 

Cheers

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support
  2009-08-16 22:13                                       ` Theodore Tso
@ 2009-08-16 22:51                                         ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 22:51 UTC (permalink / raw)
  To: Theodore Tso, Roland Dreier, James Bottomley, Arjan van de Ven,
	Alan Cox, Mark Lord, Chris Worley, Matthew Wilcox, Bryan Donlan,
	david, Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

Theodore Tso wrote:
> On Sun, Aug 16, 2009 at 02:50:40PM -0700, Roland Dreier wrote:
>>  > Well, yes and no ... a lot of SSDs don't actually implement NCQ, so the
>>  > impact to them will be less ... although I think enterprise class SSDs
>>  > do implement NCQ.
>>
>> Really?  Which SSDs don't implement NCQ?
> 
> The Intel X25-M was the first SSD to implement NCQ.  The OCZ Core V2
> advertised NCQ with a queue depth of 1, but even that was buggy, so
> Linux has a black list for the that SSD:
> 
>       http://markmail.org/message/jvjpmcdqjwrmyl4w
> 
> As far as I know, all of the SSD's using the crappy JMicron JMF602
> controllers don't support NCQ in any real way, which would includes
> most of the reasonably priced SSD's up until first half of this year.
> (The OCZ Summit SSD, which uses the Indilnx controller is an exception
> to this statement, but it's more expensive that the JMF602 based
> SSD's, although less expensive than the Intel SSD.)
> 
> JMicron is trying to seek redemption with their JMF612 controllers,
> which were announced at the end of May of this year, and those
> controllers do support NCQ, and are claimed to not to have the
> disatrous small write latency problem of their '602 bretheren.  I'm
> not aware of any products using the JMF612 yet, though.  (According to
> reports the '612 controllers weren't going to hit mass production
> until July, so hopefully later this fall we'll start seeing products
> using the new JMicron controller.)
..

Great summary, Ted.

To add:  SSDs based upon the Indilinx controller don't appear to scale
beyond an NCQ depth of about 4 or so.  Whereas Intel SSDs continue to
improve with increased queue depth up to 31 on Linux.

Or at least that's what I recall from reading various SSD benchmarks
a few weeks ago here. 

Cheers

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support
  2009-08-16 22:13                                       ` Theodore Tso
                                                         ` (3 preceding siblings ...)
  (?)
@ 2009-08-16 22:51                                       ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 22:51 UTC (permalink / raw)
  To: Theodore Tso, Roland Dreier, James Bottomley, Arjan van de Ven, Alan Cox

Theodore Tso wrote:
> On Sun, Aug 16, 2009 at 02:50:40PM -0700, Roland Dreier wrote:
>>  > Well, yes and no ... a lot of SSDs don't actually implement NCQ, so the
>>  > impact to them will be less ... although I think enterprise class SSDs
>>  > do implement NCQ.
>>
>> Really?  Which SSDs don't implement NCQ?
> 
> The Intel X25-M was the first SSD to implement NCQ.  The OCZ Core V2
> advertised NCQ with a queue depth of 1, but even that was buggy, so
> Linux has a black list for the that SSD:
> 
>       http://markmail.org/message/jvjpmcdqjwrmyl4w
> 
> As far as I know, all of the SSD's using the crappy JMicron JMF602
> controllers don't support NCQ in any real way, which would includes
> most of the reasonably priced SSD's up until first half of this year.
> (The OCZ Summit SSD, which uses the Indilnx controller is an exception
> to this statement, but it's more expensive that the JMF602 based
> SSD's, although less expensive than the Intel SSD.)
> 
> JMicron is trying to seek redemption with their JMF612 controllers,
> which were announced at the end of May of this year, and those
> controllers do support NCQ, and are claimed to not to have the
> disatrous small write latency problem of their '602 bretheren.  I'm
> not aware of any products using the JMF612 yet, though.  (According to
> reports the '612 controllers weren't going to hit mass production
> until July, so hopefully later this fall we'll start seeing products
> using the new JMicron controller.)
..

Great summary, Ted.

To add:  SSDs based upon the Indilinx controller don't appear to scale
beyond an NCQ depth of about 4 or so.  Whereas Intel SSDs continue to
improve with increased queue depth up to 31 on Linux.

Or at least that's what I recall from reading various SSD benchmarks
a few weeks ago here. 

Cheers

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support
@ 2009-08-16 22:51                                         ` Mark Lord
  0 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-16 22:51 UTC (permalink / raw)
  To: Theodore Tso, Roland Dreier, James Bottomley, Arjan van de Ven,
	Alan Cox, Mark Lord, Chris Worley, Matthew Wilcox, Bryan Donlan,
	david, Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

Theodore Tso wrote:
> On Sun, Aug 16, 2009 at 02:50:40PM -0700, Roland Dreier wrote:
>>  > Well, yes and no ... a lot of SSDs don't actually implement NCQ, so the
>>  > impact to them will be less ... although I think enterprise class SSDs
>>  > do implement NCQ.
>>
>> Really?  Which SSDs don't implement NCQ?
> 
> The Intel X25-M was the first SSD to implement NCQ.  The OCZ Core V2
> advertised NCQ with a queue depth of 1, but even that was buggy, so
> Linux has a black list for the that SSD:
> 
>       http://markmail.org/message/jvjpmcdqjwrmyl4w
> 
> As far as I know, all of the SSD's using the crappy JMicron JMF602
> controllers don't support NCQ in any real way, which would includes
> most of the reasonably priced SSD's up until first half of this year.
> (The OCZ Summit SSD, which uses the Indilnx controller is an exception
> to this statement, but it's more expensive that the JMF602 based
> SSD's, although less expensive than the Intel SSD.)
> 
> JMicron is trying to seek redemption with their JMF612 controllers,
> which were announced at the end of May of this year, and those
> controllers do support NCQ, and are claimed to not to have the
> disatrous small write latency problem of their '602 bretheren.  I'm
> not aware of any products using the JMF612 yet, though.  (According to
> reports the '612 controllers weren't going to hit mass production
> until July, so hopefully later this fall we'll start seeing products
> using the new JMicron controller.)
..

Great summary, Ted.

To add:  SSDs based upon the Indilinx controller don't appear to scale
beyond an NCQ depth of about 4 or so.  Whereas Intel SSDs continue to
improve with increased queue depth up to 31 on Linux.

Or at least that's what I recall from reading various SSD benchmarks
a few weeks ago here. 

Cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-15 13:20                             ` Mark Lord
@ 2009-08-16 22:52                               ` Chris Worley
  -1 siblings, 0 replies; 208+ messages in thread
From: Chris Worley @ 2009-08-16 22:52 UTC (permalink / raw)
  To: Mark Lord
  Cc: Greg Freemyer, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Sat, Aug 15, 2009 at 7:20 AM, Mark Lord<liml@rtr.ca> wrote:
> Chris Worley wrote:
> ..
>>
>> So erase blocks are 512 bytes (if I write 512 bytes, an erase block is
>> now freed)?  Not true.
>
> ..
>
> No, erase blocks are typically 512 KILO-bytes, or 1024 sectors.
> Logical write blocks are only 512 bytes,

That was my point.

The OS should not make assumptions of the size.

> but most drives out there
> now actually use 4096 bytes as the native internal write size.
>
> Lots of issues there.
>
> The only existing "in the wild" TRIM-capable SSDs today all incur
> large overheads from TRIM


SSD's yes, SSS no.

> --> they seem to run a garbage-collection
> and erase cycle for each TRIM command, typically taking 100s of milliseconds
> regardless of the amount being trimmed.

The OS should not assume a dumb algorithm on the part of the drive.

>
> So it makes send to gather small TRIMs into single larger TRIMs.

If Linux is only to support slow legacy SAS/SATA.

>
> But I think, even better, is to just not bother with the bookkeeping,
> and instead have the filesystem periodically just issue a TRIM for all
> free blocks within a block group, cycling through the block groups
> one by one over time.
>
> That's how I'd like it to work on my own machine here.
> Server/enterprise users very likely want something different.

Yes.  I didn't realize this was a laptop-only fix.

Thanks,

Chris
>
> Pluggable architecture, anyone?  :)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap  slot is freed)
@ 2009-08-16 22:52                               ` Chris Worley
  0 siblings, 0 replies; 208+ messages in thread
From: Chris Worley @ 2009-08-16 22:52 UTC (permalink / raw)
  To: Mark Lord
  Cc: Greg Freemyer, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Sat, Aug 15, 2009 at 7:20 AM, Mark Lord<liml@rtr.ca> wrote:
> Chris Worley wrote:
> ..
>>
>> So erase blocks are 512 bytes (if I write 512 bytes, an erase block is
>> now freed)?  Not true.
>
> ..
>
> No, erase blocks are typically 512 KILO-bytes, or 1024 sectors.
> Logical write blocks are only 512 bytes,

That was my point.

The OS should not make assumptions of the size.

> but most drives out there
> now actually use 4096 bytes as the native internal write size.
>
> Lots of issues there.
>
> The only existing "in the wild" TRIM-capable SSDs today all incur
> large overheads from TRIM


SSD's yes, SSS no.

> --> they seem to run a garbage-collection
> and erase cycle for each TRIM command, typically taking 100s of milliseconds
> regardless of the amount being trimmed.

The OS should not assume a dumb algorithm on the part of the drive.

>
> So it makes send to gather small TRIMs into single larger TRIMs.

If Linux is only to support slow legacy SAS/SATA.

>
> But I think, even better, is to just not bother with the bookkeeping,
> and instead have the filesystem periodically just issue a TRIM for all
> free blocks within a block group, cycling through the block groups
> one by one over time.
>
> That's how I'd like it to work on my own machine here.
> Server/enterprise users very likely want something different.

Yes.  I didn't realize this was a laptop-only fix.

Thanks,

Chris
>
> Pluggable architecture, anyone?  :)
>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 19:29                                   ` Alan Cox
@ 2009-08-16 23:05                                     ` John Robinson
  -1 siblings, 0 replies; 208+ messages in thread
From: John Robinson @ 2009-08-16 23:05 UTC (permalink / raw)
  To: Alan Cox
  Cc: Arjan van de Ven, James Bottomley, Mark Lord, Chris Worley,
	Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On 16/08/2009 20:29, Alan Cox wrote:
>> trim is mostly for ssd's though, and those tend to not have the "goes
>> for a hike" behavior as much......
> 
> Bench one.
> 
>> I wonder if it's worse to batch stuff up, because then the trim itself
>> gets bigger and might take longer.....
> 
> They seem to implement a sort of async single threaded trim, which can
> only have one outstanding trim at a time.

I'm slightly out of my depth here, but: if a single TRIM is issued, 
which apparently returns quickly, can one then revert to issuing 
ordinary commands like reads and writes and have them complete as 
quickly as they normally do, or does any following command have to wait 
until the trim completes? This could be useful if it turned out we won't 
stall these devices as long as we don't issue more than one TRIM every 
few seconds; we could keep a TRIM coalesce queue down to being (say) 5 
seconds long (or at least, a configurable small number of seconds).

Cheers,

John.

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-16 23:05                                     ` John Robinson
  0 siblings, 0 replies; 208+ messages in thread
From: John Robinson @ 2009-08-16 23:05 UTC (permalink / raw)
  To: Alan Cox
  Cc: Arjan van de Ven, James Bottomley, Mark Lord, Chris Worley,
	Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On 16/08/2009 20:29, Alan Cox wrote:
>> trim is mostly for ssd's though, and those tend to not have the "goes
>> for a hike" behavior as much......
> 
> Bench one.
> 
>> I wonder if it's worse to batch stuff up, because then the trim itself
>> gets bigger and might take longer.....
> 
> They seem to implement a sort of async single threaded trim, which can
> only have one outstanding trim at a time.

I'm slightly out of my depth here, but: if a single TRIM is issued, 
which apparently returns quickly, can one then revert to issuing 
ordinary commands like reads and writes and have them complete as 
quickly as they normally do, or does any following command have to wait 
until the trim completes? This could be useful if it turned out we won't 
stall these devices as long as we don't issue more than one TRIM every 
few seconds; we could keep a TRIM coalesce queue down to being (say) 5 
seconds long (or at least, a configurable small number of seconds).

Cheers,

John.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 22:52                               ` Chris Worley
@ 2009-08-17  2:03                                 ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-17  2:03 UTC (permalink / raw)
  To: Chris Worley
  Cc: Greg Freemyer, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

Chris Worley wrote:
..
> The OS should not assume a dumb algorithm on the part of the drive.
..

Welcome to the Real World.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17  2:03                                 ` Mark Lord
  0 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-17  2:03 UTC (permalink / raw)
  To: Chris Worley
  Cc: Greg Freemyer, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

Chris Worley wrote:
..
> The OS should not assume a dumb algorithm on the part of the drive.
..

Welcome to the Real World.

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 23:05                                     ` John Robinson
@ 2009-08-17  2:05                                       ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-17  2:05 UTC (permalink / raw)
  To: John Robinson
  Cc: Alan Cox, Arjan van de Ven, James Bottomley, Chris Worley,
	Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

John Robinson wrote:
..
> I'm slightly out of my depth here, but: if a single TRIM is issued, 
> which apparently returns quickly, can one then revert to issuing 
> ordinary commands like reads and writes and have them complete as 
> quickly as they normally do, or does any following command have to wait 
> until the trim completes? This could be useful if it turned out we won't 
> stall these devices as long as we don't issue more than one TRIM every 
> few seconds; we could keep a TRIM coalesce queue down to being (say) 5 
> seconds long (or at least, a configurable small number of seconds).
..

I have not attempted to instrument that, but I suspect that any
command after the TRIM has to wait.  Don't know for sure until
somebody measures it though.

One thing I do know, is that Matthew's first cut of TRIM support
means it takes half an hour to do "rm -r" on a kernel source tree.

-ml

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17  2:05                                       ` Mark Lord
  0 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-17  2:05 UTC (permalink / raw)
  To: John Robinson
  Cc: Alan Cox, Arjan van de Ven, James Bottomley, Chris Worley,
	Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

John Robinson wrote:
..
> I'm slightly out of my depth here, but: if a single TRIM is issued, 
> which apparently returns quickly, can one then revert to issuing 
> ordinary commands like reads and writes and have them complete as 
> quickly as they normally do, or does any following command have to wait 
> until the trim completes? This could be useful if it turned out we won't 
> stall these devices as long as we don't issue more than one TRIM every 
> few seconds; we could keep a TRIM coalesce queue down to being (say) 5 
> seconds long (or at least, a configurable small number of seconds).
..

I have not attempted to instrument that, but I suspect that any
command after the TRIM has to wait.  Don't know for sure until
somebody measures it though.

One thing I do know, is that Matthew's first cut of TRIM support
means it takes half an hour to do "rm -r" on a kernel source tree.

-ml

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: [PATCH] swap: send callback when swap slot is freed
  2009-08-12 14:37 ` Nitin Gupta
                   ` (3 preceding siblings ...)
  (?)
@ 2009-08-17  2:55 ` KAMEZAWA Hiroyuki
  2009-08-17  5:08   ` Nitin Gupta
  2009-08-22  7:34   ` Nai Xia
  -1 siblings, 2 replies; 208+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-17  2:55 UTC (permalink / raw)
  To: ngupta; +Cc: mingo, linux-kernel

On Wed, 12 Aug 2009 20:07:43 +0530
Nitin Gupta <ngupta@vflare.org> wrote:

> Currently, we have "swap discard" mechanism which sends a discard bio request
> when we find a free cluster during scan_swap_map(). This callback can come a
> long time after swap slots are actually freed.
> 
> This delay in callback is a great problem when (compressed) RAM [1] is used
> as a swap device. So, this change adds a callback which is called as
> soon as a swap slot becomes free. For above mentioned case of swapping
> over compressed RAM device, this is very useful since we can immediately
> free memory allocated for this swap page.
> 
> This callback does not replace swap discard support. It is called with
> swap_lock held, so it is meant to trigger action that finishes quickly.
> However, swap discard is an I/O request and can be used for taking longer
> actions.
> 
> Links:
> [1] http://code.google.com/p/compcache/
> 

Hmm, do you really need notify at *every* swap free ?
No batching is necessary ?

Thanks,
-Kame

> Signed-off-by: Nitin Gupta <ngupta@vflare.org>
> ---
> 
>  include/linux/swap.h |    5 +++++
>  mm/swapfile.c        |   16 ++++++++++++++++
>  2 files changed, 21 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 7c15334..4cbe3c4 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -8,6 +8,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/node.h>
> +#include <linux/blkdev.h>
>  
>  #include <asm/atomic.h>
>  #include <asm/page.h>
> @@ -20,6 +21,8 @@ struct bio;
>  #define SWAP_FLAG_PRIO_MASK	0x7fff
>  #define SWAP_FLAG_PRIO_SHIFT	0
>  
> +typedef void (swap_free_notify_fn) (struct block_device *, unsigned long);
> +
>  static inline int current_is_kswapd(void)
>  {
>  	return current->flags & PF_KSWAPD;
> @@ -155,6 +158,7 @@ struct swap_info_struct {
>  	unsigned int max;
>  	unsigned int inuse_pages;
>  	unsigned int old_block_size;
> +	swap_free_notify_fn *swap_free_notify_fn;
>  };
>  
>  struct swap_list_t {
> @@ -295,6 +299,7 @@ extern sector_t swapdev_block(int, pgoff_t);
>  extern struct swap_info_struct *get_swap_info_struct(unsigned);
>  extern int reuse_swap_page(struct page *);
>  extern int try_to_free_swap(struct page *);
> +extern void set_swap_free_notify(unsigned, swap_free_notify_fn *);
>  struct backing_dev_info;
>  
>  /* linux/mm/thrash.c */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 8ffdc0d..aa95fc7 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -552,6 +552,20 @@ out:
>  	return NULL;
>  }
>  
> +/*
> + * Sets callback for event when swap_map[offset] == 0
> + * i.e. page at this swap offset is no longer used.
> + */
> +void set_swap_free_notify(unsigned type, swap_free_notify_fn *notify_fn)
> +{
> +	struct swap_info_struct *sis;
> +	sis = get_swap_info_struct(type);
> +	BUG_ON(!sis);
> +	sis->swap_free_notify_fn = notify_fn;
> +	return;
> +}
> +EXPORT_SYMBOL(set_swap_free_notify);
> +
>  static int swap_entry_free(struct swap_info_struct *p,
>  			   swp_entry_t ent, int cache)
>  {
> @@ -583,6 +597,8 @@ static int swap_entry_free(struct swap_info_struct *p,
>  			swap_list.next = p - swap_info;
>  		nr_swap_pages++;
>  		p->inuse_pages--;
> +		if (p->swap_free_notify_fn)
> +			p->swap_free_notify_fn(p->bdev, offset);
>  	}
>  	if (!swap_count(count))
>  		mem_cgroup_uncharge_swap(ent);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 16:59                                     ` Christoph Hellwig
@ 2009-08-17  4:24                                       ` Douglas Gilbert
  -1 siblings, 0 replies; 208+ messages in thread
From: Douglas Gilbert @ 2009-08-17  4:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: James Bottomley, Arjan van de Ven, Alan Cox, Mark Lord,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

Christoph Hellwig wrote:
> On Sun, Aug 16, 2009 at 10:52:07AM -0500, James Bottomley wrote:
>> However, the enterprise has been doing UNMAP for a while, so we can draw
>> inferences from them since the SSD FTL will operate similarly.  For
>> them, UNMAP is the same cost in terms of time regardless of the number
>> of extents.  The reason is that it's moving the blocks from the global
>> in use list to the global free list.  Part of the problem is that this
>> involves locking and quiescing, so UNMAP ends up being quite expensive
>> to the array but constant in terms of cost (hence they want as few
>> unmaps for as many sectors as possible).
> 
> How are they doing the unmaps?  Using something similar to Mark's wiper
> script and using SG_IO?  Because right now we do not actually implement
> UNMAP support in the kernel.  I'd really love to test the XFS batched
> discard support with a real UNMAP implementation.

The sg3_utils version 1.28 beta at http://sg.danny.cz/sg/
has a new sg_unmap utility and the previous release
included sg_write_same with Unmap bit support.
sg_readcap has been updated to show the TPE and TPRZ bits.

There is a new SCSI GET LBA STATUS command coming
(approved at the last t10 meeting, awaiting the next
SBC-3 draft). That will show the mapped/unmapped
status of logical blocks in a range of LBAs. I can
add a utility for that as well.

Doug Gilbert


^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17  4:24                                       ` Douglas Gilbert
  0 siblings, 0 replies; 208+ messages in thread
From: Douglas Gilbert @ 2009-08-17  4:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: James Bottomley, Arjan van de Ven, Alan Cox, Mark Lord,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

Christoph Hellwig wrote:
> On Sun, Aug 16, 2009 at 10:52:07AM -0500, James Bottomley wrote:
>> However, the enterprise has been doing UNMAP for a while, so we can draw
>> inferences from them since the SSD FTL will operate similarly.  For
>> them, UNMAP is the same cost in terms of time regardless of the number
>> of extents.  The reason is that it's moving the blocks from the global
>> in use list to the global free list.  Part of the problem is that this
>> involves locking and quiescing, so UNMAP ends up being quite expensive
>> to the array but constant in terms of cost (hence they want as few
>> unmaps for as many sectors as possible).
> 
> How are they doing the unmaps?  Using something similar to Mark's wiper
> script and using SG_IO?  Because right now we do not actually implement
> UNMAP support in the kernel.  I'd really love to test the XFS batched
> discard support with a real UNMAP implementation.

The sg3_utils version 1.28 beta at http://sg.danny.cz/sg/
has a new sg_unmap utility and the previous release
included sg_write_same with Unmap bit support.
sg_readcap has been updated to show the TPE and TPRZ bits.

There is a new SCSI GET LBA STATUS command coming
(approved at the last t10 meeting, awaiting the next
SBC-3 draft). That will show the mapped/unmapped
status of logical blocks in a range of LBAs. I can
add a utility for that as well.

Doug Gilbert

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: [PATCH] swap: send callback when swap slot is freed
  2009-08-17  2:55 ` [PATCH] swap: send callback when swap slot is freed KAMEZAWA Hiroyuki
@ 2009-08-17  5:08   ` Nitin Gupta
  2009-08-17  5:11     ` KAMEZAWA Hiroyuki
  2009-08-22  7:34   ` Nai Xia
  1 sibling, 1 reply; 208+ messages in thread
From: Nitin Gupta @ 2009-08-17  5:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: mingo, linux-kernel

On 08/17/2009 08:25 AM, KAMEZAWA Hiroyuki wrote:
> On Wed, 12 Aug 2009 20:07:43 +0530
> Nitin Gupta<ngupta@vflare.org>  wrote:
>
>> Currently, we have "swap discard" mechanism which sends a discard bio request
>> when we find a free cluster during scan_swap_map(). This callback can come a
>> long time after swap slots are actually freed.
>>
>> This delay in callback is a great problem when (compressed) RAM [1] is used
>> as a swap device. So, this change adds a callback which is called as
>> soon as a swap slot becomes free. For above mentioned case of swapping
>> over compressed RAM device, this is very useful since we can immediately
>> free memory allocated for this swap page.
>>
>> This callback does not replace swap discard support. It is called with
>> swap_lock held, so it is meant to trigger action that finishes quickly.
>> However, swap discard is an I/O request and can be used for taking longer
>> actions.
>>
>> Links:
>> [1] http://code.google.com/p/compcache/
>>
>
> Hmm, do you really need notify at *every* swap free ?
> No batching is necessary ?
>

We need notify for every swap free and no batching is desired, at least for 
compcache case.

Thanks,
Nitin

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: [PATCH] swap: send callback when swap slot is freed
  2009-08-17  5:08   ` Nitin Gupta
@ 2009-08-17  5:11     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 208+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-17  5:11 UTC (permalink / raw)
  To: ngupta; +Cc: mingo, linux-kernel

On Mon, 17 Aug 2009 10:38:28 +0530
Nitin Gupta <ngupta@vflare.org> wrote:

> On 08/17/2009 08:25 AM, KAMEZAWA Hiroyuki wrote:
> > On Wed, 12 Aug 2009 20:07:43 +0530
> > Nitin Gupta<ngupta@vflare.org>  wrote:
> >
> >> Currently, we have "swap discard" mechanism which sends a discard bio request
> >> when we find a free cluster during scan_swap_map(). This callback can come a
> >> long time after swap slots are actually freed.
> >>
> >> This delay in callback is a great problem when (compressed) RAM [1] is used
> >> as a swap device. So, this change adds a callback which is called as
> >> soon as a swap slot becomes free. For above mentioned case of swapping
> >> over compressed RAM device, this is very useful since we can immediately
> >> free memory allocated for this swap page.
> >>
> >> This callback does not replace swap discard support. It is called with
> >> swap_lock held, so it is meant to trigger action that finishes quickly.
> >> However, swap discard is an I/O request and can be used for taking longer
> >> actions.
> >>
> >> Links:
> >> [1] http://code.google.com/p/compcache/
> >>
> >
> > Hmm, do you really need notify at *every* swap free ?
> > No batching is necessary ?
> >
> 
> We need notify for every swap free and no batching is desired, at least for 
> compcache case.
> 
Okay..I'll read discussions in linux-mm.

Thanks you,
-Kame


^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 16:59                                     ` Christoph Hellwig
@ 2009-08-17 13:56                                       ` James Bottomley
  -1 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-17 13:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Arjan van de Ven, Alan Cox, Mark Lord, Chris Worley,
	Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Sun, 2009-08-16 at 12:59 -0400, Christoph Hellwig wrote:
> On Sun, Aug 16, 2009 at 10:52:07AM -0500, James Bottomley wrote:
> > However, the enterprise has been doing UNMAP for a while, so we can draw
> > inferences from them since the SSD FTL will operate similarly.  For
> > them, UNMAP is the same cost in terms of time regardless of the number
> > of extents.  The reason is that it's moving the blocks from the global
> > in use list to the global free list.  Part of the problem is that this
> > involves locking and quiescing, so UNMAP ends up being quite expensive
> > to the array but constant in terms of cost (hence they want as few
> > unmaps for as many sectors as possible).
> 
> How are they doing the unmaps?  Using something similar to Mark's wiper
> script and using SG_IO?  Because right now we do not actually implement
> UNMAP support in the kernel.  I'd really love to test the XFS batched
> discard support with a real UNMAP implementation.

You mean how is the array vendor testing their implementation?  Using
SG_IO ... without any filesystem, I believe.

The testing was initially done to see if the initial maximal discard
proposal from LSF09 was a viable approach (which it wasn't given the
time taken to UNMAP).

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17 13:56                                       ` James Bottomley
  0 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-17 13:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Arjan van de Ven, Alan Cox, Mark Lord, Chris Worley,
	Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Sun, 2009-08-16 at 12:59 -0400, Christoph Hellwig wrote:
> On Sun, Aug 16, 2009 at 10:52:07AM -0500, James Bottomley wrote:
> > However, the enterprise has been doing UNMAP for a while, so we can draw
> > inferences from them since the SSD FTL will operate similarly.  For
> > them, UNMAP is the same cost in terms of time regardless of the number
> > of extents.  The reason is that it's moving the blocks from the global
> > in use list to the global free list.  Part of the problem is that this
> > involves locking and quiescing, so UNMAP ends up being quite expensive
> > to the array but constant in terms of cost (hence they want as few
> > unmaps for as many sectors as possible).
> 
> How are they doing the unmaps?  Using something similar to Mark's wiper
> script and using SG_IO?  Because right now we do not actually implement
> UNMAP support in the kernel.  I'd really love to test the XFS batched
> discard support with a real UNMAP implementation.

You mean how is the array vendor testing their implementation?  Using
SG_IO ... without any filesystem, I believe.

The testing was initially done to see if the initial maximal discard
proposal from LSF09 was a viable approach (which it wasn't given the
time taken to UNMAP).

James



^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-17 13:56                                       ` James Bottomley
@ 2009-08-17 14:10                                         ` Matthew Wilcox
  -1 siblings, 0 replies; 208+ messages in thread
From: Matthew Wilcox @ 2009-08-17 14:10 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Hellwig, Arjan van de Ven, Alan Cox, Mark Lord,
	Chris Worley, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Mon, Aug 17, 2009 at 08:56:12AM -0500, James Bottomley wrote:
> The testing was initially done to see if the initial maximal discard
> proposal from LSF09 was a viable approach (which it wasn't given the
> time taken to UNMAP).

It would be nice if that feedback could be made public instead of it
leaking out in dribs and drabs like this.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17 14:10                                         ` Matthew Wilcox
  0 siblings, 0 replies; 208+ messages in thread
From: Matthew Wilcox @ 2009-08-17 14:10 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Hellwig, Arjan van de Ven, Alan Cox, Mark Lord,
	Chris Worley, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Mon, Aug 17, 2009 at 08:56:12AM -0500, James Bottomley wrote:
> The testing was initially done to see if the initial maximal discard
> proposal from LSF09 was a viable approach (which it wasn't given the
> time taken to UNMAP).

It would be nice if that feedback could be made public instead of it
leaking out in dribs and drabs like this.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 17:37                                       ` Mark Lord
@ 2009-08-17 16:30                                         ` Bill Davidsen
  -1 siblings, 0 replies; 208+ messages in thread
From: Bill Davidsen @ 2009-08-17 16:30 UTC (permalink / raw)
  To: Mark Lord
  Cc: Theodore Tso, Arjan van de Ven, Alan Cox, James Bottomley,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

Mark Lord wrote:
> Mark Lord wrote:
> ..
>> As you can see, we're now into the 100 millisecond range
>> for successive TRIM-followed-by-TRIM commands.
>>
>> Those are all for single extents.  I will follow-up with a small
>> amount of similar data for TRIMs with multiple extents.
> ..
>
> Here's the exact same TRIM ranges, but issued with *two* extents
> per TRIM command, and again *without* the "sleep 1" between them:
>
> Beginning TRIM operations..
> Trimming 2 free extents encompassing 686 sectors (0 MB)
> Trimming 2 free extents encompassing 236 sectors (0 MB)
> Trimming 2 free extents encompassing 2186 sectors (1 MB)
> Trimming 2 free extents encompassing 2206 sectors (1 MB)
> Trimming 2 free extents encompassing 1494 sectors (1 MB)
> Trimming 2 free extents encompassing 1086 sectors (1 MB)
> Trimming 2 free extents encompassing 1658 sectors (1 MB)
> Trimming 2 free extents encompassing 14250 sectors (7 MB)
> Done.
> [ 1528.761626] ata_qc_issue: ATA_CMD_DSM starting
> [ 1528.761825] trim_completed: ATA_CMD_DSM took 419952 cycles
> [ 1528.807158] ata_qc_issue: ATA_CMD_DSM starting
> [ 1528.919035] trim_completed: ATA_CMD_DSM took 241772908 cycles
> [ 1528.956048] ata_qc_issue: ATA_CMD_DSM starting
> [ 1529.068536] trim_completed: ATA_CMD_DSM took 243085505 cycles
> [ 1529.156661] ata_qc_issue: ATA_CMD_DSM starting
> [ 1529.266377] trim_completed: ATA_CMD_DSM took 237098927 cycles
> [ 1529.367212] ata_qc_issue: ATA_CMD_DSM starting
> [ 1529.464676] trim_completed: ATA_CMD_DSM took 210619370 cycles
> [ 1529.518619] ata_qc_issue: ATA_CMD_DSM starting
> [ 1529.630444] trim_completed: ATA_CMD_DSM took 241654712 cycles
> [ 1529.739335] ata_qc_issue: ATA_CMD_DSM starting
> [ 1529.829826] trim_completed: ATA_CMD_DSM took 195545233 cycles
> [ 1529.958442] ata_qc_issue: ATA_CMD_DSM starting
> [ 1530.028356] trim_completed: ATA_CMD_DSM took 151077251 cycles
>
> Next, with *four* extents per TRIM:
>
> Beginning TRIM operations..
> Trimming 4 free extents encompassing 922 sectors (0 MB)
> Trimming 4 free extents encompassing 4392 sectors (2 MB)
> Trimming 4 free extents encompassing 2580 sectors (1 MB)
> Trimming 4 free extents encompassing 15908 sectors (8 MB)
> Done.
> [ 1728.923119] ata_qc_issue: ATA_CMD_DSM starting
> [ 1728.923343] trim_completed: ATA_CMD_DSM took 460590 cycles
> [ 1728.975082] ata_qc_issue: ATA_CMD_DSM starting
> [ 1729.087266] trim_completed: ATA_CMD_DSM took 242429200 cycles
> [ 1729.170167] ata_qc_issue: ATA_CMD_DSM starting
> [ 1729.282718] trim_completed: ATA_CMD_DSM took 243229428 cycles
> [ 1729.382328] ata_qc_issue: ATA_CMD_DSM starting
> [ 1729.481364] trim_completed: ATA_CMD_DSM took 214012942 cycles
>
> And with *eight* extents per TRIM:
> Beginning TRIM operations..
> Trimming 8 free extents encompassing 5314 sectors (3 MB)
> Trimming 8 free extents encompassing 18488 sectors (9 MB)
> Done.
> [ 1788.289669] ata_qc_issue: ATA_CMD_DSM starting
> [ 1788.290247] trim_completed: ATA_CMD_DSM took 1228539 cycles
> [ 1788.327223] ata_qc_issue: ATA_CMD_DSM starting
> [ 1788.440490] trim_completed: ATA_CMD_DSM took 244773243 cycles
>
> And finally, with everything in a single TRIM:
>
> Beginning TRIM operations..
> Trimming 16 free extents encompassing 23802 sectors (12 MB)
> Done.
> [ 1841.561147] ata_qc_issue: ATA_CMD_DSM starting
> [ 1841.563217] trim_completed: ATA_CMD_DSM took 4458480 cycles
>
> Notice how the first TRIM of each group above shows an artificially
> short completion time, because the firmware seems to return "done"
> before it's really done.  Subsequent TRIMs seem to have to wait
> for the previous one to really complete, and thus give more reliable
> timing data for our purposes.

I assume that it really is artificial, rather than the device really 
being ready for another operation (other than another TRIM). I lack the 
hardware, but the test would be the time to complete a read, trim and 
read, and two trim and read operations. Just my thought that the TRIM in 
progress may only block the next TRIM, rather than other operations.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc

"You are disgraced professional losers. And by the way, give us our money back."
    - Representative Earl Pomeroy,  Democrat of North Dakota
on the A.I.G. executives who were paid bonuses  after a federal bailout.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17 16:30                                         ` Bill Davidsen
  0 siblings, 0 replies; 208+ messages in thread
From: Bill Davidsen @ 2009-08-17 16:30 UTC (permalink / raw)
  To: Mark Lord
  Cc: Theodore Tso, Arjan van de Ven, Alan Cox, James Bottomley,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

Mark Lord wrote:
> Mark Lord wrote:
> ..
>> As you can see, we're now into the 100 millisecond range
>> for successive TRIM-followed-by-TRIM commands.
>>
>> Those are all for single extents.  I will follow-up with a small
>> amount of similar data for TRIMs with multiple extents.
> ..
>
> Here's the exact same TRIM ranges, but issued with *two* extents
> per TRIM command, and again *without* the "sleep 1" between them:
>
> Beginning TRIM operations..
> Trimming 2 free extents encompassing 686 sectors (0 MB)
> Trimming 2 free extents encompassing 236 sectors (0 MB)
> Trimming 2 free extents encompassing 2186 sectors (1 MB)
> Trimming 2 free extents encompassing 2206 sectors (1 MB)
> Trimming 2 free extents encompassing 1494 sectors (1 MB)
> Trimming 2 free extents encompassing 1086 sectors (1 MB)
> Trimming 2 free extents encompassing 1658 sectors (1 MB)
> Trimming 2 free extents encompassing 14250 sectors (7 MB)
> Done.
> [ 1528.761626] ata_qc_issue: ATA_CMD_DSM starting
> [ 1528.761825] trim_completed: ATA_CMD_DSM took 419952 cycles
> [ 1528.807158] ata_qc_issue: ATA_CMD_DSM starting
> [ 1528.919035] trim_completed: ATA_CMD_DSM took 241772908 cycles
> [ 1528.956048] ata_qc_issue: ATA_CMD_DSM starting
> [ 1529.068536] trim_completed: ATA_CMD_DSM took 243085505 cycles
> [ 1529.156661] ata_qc_issue: ATA_CMD_DSM starting
> [ 1529.266377] trim_completed: ATA_CMD_DSM took 237098927 cycles
> [ 1529.367212] ata_qc_issue: ATA_CMD_DSM starting
> [ 1529.464676] trim_completed: ATA_CMD_DSM took 210619370 cycles
> [ 1529.518619] ata_qc_issue: ATA_CMD_DSM starting
> [ 1529.630444] trim_completed: ATA_CMD_DSM took 241654712 cycles
> [ 1529.739335] ata_qc_issue: ATA_CMD_DSM starting
> [ 1529.829826] trim_completed: ATA_CMD_DSM took 195545233 cycles
> [ 1529.958442] ata_qc_issue: ATA_CMD_DSM starting
> [ 1530.028356] trim_completed: ATA_CMD_DSM took 151077251 cycles
>
> Next, with *four* extents per TRIM:
>
> Beginning TRIM operations..
> Trimming 4 free extents encompassing 922 sectors (0 MB)
> Trimming 4 free extents encompassing 4392 sectors (2 MB)
> Trimming 4 free extents encompassing 2580 sectors (1 MB)
> Trimming 4 free extents encompassing 15908 sectors (8 MB)
> Done.
> [ 1728.923119] ata_qc_issue: ATA_CMD_DSM starting
> [ 1728.923343] trim_completed: ATA_CMD_DSM took 460590 cycles
> [ 1728.975082] ata_qc_issue: ATA_CMD_DSM starting
> [ 1729.087266] trim_completed: ATA_CMD_DSM took 242429200 cycles
> [ 1729.170167] ata_qc_issue: ATA_CMD_DSM starting
> [ 1729.282718] trim_completed: ATA_CMD_DSM took 243229428 cycles
> [ 1729.382328] ata_qc_issue: ATA_CMD_DSM starting
> [ 1729.481364] trim_completed: ATA_CMD_DSM took 214012942 cycles
>
> And with *eight* extents per TRIM:
> Beginning TRIM operations..
> Trimming 8 free extents encompassing 5314 sectors (3 MB)
> Trimming 8 free extents encompassing 18488 sectors (9 MB)
> Done.
> [ 1788.289669] ata_qc_issue: ATA_CMD_DSM starting
> [ 1788.290247] trim_completed: ATA_CMD_DSM took 1228539 cycles
> [ 1788.327223] ata_qc_issue: ATA_CMD_DSM starting
> [ 1788.440490] trim_completed: ATA_CMD_DSM took 244773243 cycles
>
> And finally, with everything in a single TRIM:
>
> Beginning TRIM operations..
> Trimming 16 free extents encompassing 23802 sectors (12 MB)
> Done.
> [ 1841.561147] ata_qc_issue: ATA_CMD_DSM starting
> [ 1841.563217] trim_completed: ATA_CMD_DSM took 4458480 cycles
>
> Notice how the first TRIM of each group above shows an artificially
> short completion time, because the firmware seems to return "done"
> before it's really done.  Subsequent TRIMs seem to have to wait
> for the previous one to really complete, and thus give more reliable
> timing data for our purposes.

I assume that it really is artificial, rather than the device really 
being ready for another operation (other than another TRIM). I lack the 
hardware, but the test would be the time to complete a read, trim and 
read, and two trim and read operations. Just my thought that the TRIM in 
progress may only block the next TRIM, rather than other operations.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc

"You are disgraced professional losers. And by the way, give us our money back."
    - Representative Earl Pomeroy,  Democrat of North Dakota
on the A.I.G. executives who were paid bonuses  after a federal bailout.



^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-16 18:24                                           ` James Bottomley
  (?)
@ 2009-08-17 16:37                                             ` Bill Davidsen
  -1 siblings, 0 replies; 208+ messages in thread
From: Bill Davidsen @ 2009-08-17 16:37 UTC (permalink / raw)
  To: James Bottomley
  Cc: Mark Lord, Arjan van de Ven, Alan Cox, Chris Worley,
	Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

James Bottomley wrote:
> On Sun, 2009-08-16 at 14:19 -0400, Mark Lord wrote:
>   
>> James Bottomley wrote:
>>     
>>> Heh, OS writers not having access to the devices is about par for the
>>> current course.
>>>       
>> ..
>>
>> Pity the Linux Foundation doesn't simply step in and supply hardware
>> to us for new tech like this.  Cheap for them, expensive for folks like me.
>>     
>
> Um, to give a developer a selection of manufacturers' SSDs at retail
> prices, you're talking several thousand dollars  ... in these lean
> times, that would be two or three developers not getting travel
> sponsorship per chosen SSD recipient.  It's not a worthwhile tradeoff.
>
> The best the LF can likely do is try to explain to the manufacturers
> that handing out samples at linux conferences (like plumbers) is in
> their own interests.  It can also manage the handout if necessary
> through its HW lending library.
>   

Of install the hardware on a machine and give people access to the 
machine in time slots. Faster than FedEx-ing the hardware, and 
relatively fast to reinstall the OS from scratch. Testing of this type 
doesn't need huge bandwidth.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc

"You are disgraced professional losers. And by the way, give us our money back."
    - Representative Earl Pomeroy,  Democrat of North Dakota
on the A.I.G. executives who were paid bonuses  after a federal bailout.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17 16:37                                             ` Bill Davidsen
  0 siblings, 0 replies; 208+ messages in thread
From: Bill Davidsen @ 2009-08-17 16:37 UTC (permalink / raw)
  To: James Bottomley
  Cc: Mark Lord, Arjan van de Ven, Alan Cox, Chris Worley,
	Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

James Bottomley wrote:
> On Sun, 2009-08-16 at 14:19 -0400, Mark Lord wrote:
>   
>> James Bottomley wrote:
>>     
>>> Heh, OS writers not having access to the devices is about par for the
>>> current course.
>>>       
>> ..
>>
>> Pity the Linux Foundation doesn't simply step in and supply hardware
>> to us for new tech like this.  Cheap for them, expensive for folks like me.
>>     
>
> Um, to give a developer a selection of manufacturers' SSDs at retail
> prices, you're talking several thousand dollars  ... in these lean
> times, that would be two or three developers not getting travel
> sponsorship per chosen SSD recipient.  It's not a worthwhile tradeoff.
>
> The best the LF can likely do is try to explain to the manufacturers
> that handing out samples at linux conferences (like plumbers) is in
> their own interests.  It can also manage the handout if necessary
> through its HW lending library.
>   

Of install the hardware on a machine and give people access to the 
machine in time slots. Faster than FedEx-ing the hardware, and 
relatively fast to reinstall the OS from scratch. Testing of this type 
doesn't need huge bandwidth.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc

"You are disgraced professional losers. And by the way, give us our money back."
    - Representative Earl Pomeroy,  Democrat of North Dakota
on the A.I.G. executives who were paid bonuses  after a federal bailout.



^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17 16:37                                             ` Bill Davidsen
  0 siblings, 0 replies; 208+ messages in thread
From: Bill Davidsen @ 2009-08-17 16:37 UTC (permalink / raw)
  To: James Bottomley
  Cc: Mark Lord, Arjan van de Ven, Alan Cox, Chris Worley,
	Matthew Wilcox, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

James Bottomley wrote:
> On Sun, 2009-08-16 at 14:19 -0400, Mark Lord wrote:
>   
>> James Bottomley wrote:
>>     
>>> Heh, OS writers not having access to the devices is about par for the
>>> current course.
>>>       
>> ..
>>
>> Pity the Linux Foundation doesn't simply step in and supply hardware
>> to us for new tech like this.  Cheap for them, expensive for folks like me.
>>     
>
> Um, to give a developer a selection of manufacturers' SSDs at retail
> prices, you're talking several thousand dollars  ... in these lean
> times, that would be two or three developers not getting travel
> sponsorship per chosen SSD recipient.  It's not a worthwhile tradeoff.
>
> The best the LF can likely do is try to explain to the manufacturers
> that handing out samples at linux conferences (like plumbers) is in
> their own interests.  It can also manage the handout if necessary
> through its HW lending library.
>   

Of install the hardware on a machine and give people access to the 
machine in time slots. Faster than FedEx-ing the hardware, and 
relatively fast to reinstall the OS from scratch. Testing of this type 
doesn't need huge bandwidth.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc

"You are disgraced professional losers. And by the way, give us our money back."
    - Representative Earl Pomeroy,  Democrat of North Dakota
on the A.I.G. executives who were paid bonuses  after a federal bailout.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-17 16:30                                         ` Bill Davidsen
@ 2009-08-17 16:56                                           ` jim owens
  -1 siblings, 0 replies; 208+ messages in thread
From: jim owens @ 2009-08-17 16:56 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Mark Lord, Theodore Tso, Arjan van de Ven, Alan Cox,
	James Bottomley, Chris Worley, Matthew Wilcox, Bryan Donlan,
	david, Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

Bill Davidsen wrote:
> 
> I assume that it really is artificial, rather than the device really 
> being ready for another operation (other than another TRIM). I lack the 
> hardware, but the test would be the time to complete a read, trim and 
> read, and two trim and read operations. Just my thought that the TRIM in 
> progress may only block the next TRIM, rather than other operations.

I don't know his test sequence but READ is not the likely command
before and after TRIM unless we are talking about TRIM being issued
only in delayed host garbage collection.  Filesystems send WRITES
during delete.

jim


^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17 16:56                                           ` jim owens
  0 siblings, 0 replies; 208+ messages in thread
From: jim owens @ 2009-08-17 16:56 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Mark Lord, Theodore Tso, Arjan van de Ven, Alan Cox,
	James Bottomley, Chris Worley, Matthew Wilcox, Bryan Donlan,
	david, Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

Bill Davidsen wrote:
> 
> I assume that it really is artificial, rather than the device really 
> being ready for another operation (other than another TRIM). I lack the 
> hardware, but the test would be the time to complete a read, trim and 
> read, and two trim and read operations. Just my thought that the TRIM in 
> progress may only block the next TRIM, rather than other operations.

I don't know his test sequence but READ is not the likely command
before and after TRIM unless we are talking about TRIM being issued
only in delayed host garbage collection.  Filesystems send WRITES
during delete.

jim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-17 16:37                                             ` Bill Davidsen
@ 2009-08-17 17:08                                               ` Greg Freemyer
  -1 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-17 17:08 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: James Bottomley, Mark Lord, Arjan van de Ven, Alan Cox,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

All,

Seems like the high-level wrap-up of all this is:

There are hopes that highly efficient SSDs will appear on the market
that can leverage a passthru non-coalescing discard feature.  And that
a whitelist should be created to allow those SSDs to see discards
intermixed with the rest of the data i/o.

For the other known cases:

SSDs that meet the ata-8 spec, but don't exceed it
Enterprise SCSI
mdraid with SSD storage used to build raid5 / raid6 arrays

Non-coalescing is believed detrimental, but a regular flushing of the
unused blocks/sectors via a tool like Mark Lord has written should be
acceptable.

Mark, I don't believe your tool really addresses the mdraid situation,
do you agree.  ie. Since your bypassing most of the block stack,
mdraid has no way of snooping on / adjusting the discards you are
sending out.

Thus the 2 solutions that have been worked on already seem to address
the needs of everything but mdraid.

Also, there has been no discussion of dm based volumes.  (ie LVM2 based volumes)

For mdraid or dm it seems we need to enhance Mark's script to pass the
trim commands through the full block stack.  Mark, please cmiiw

Greg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap  slot is freed)
@ 2009-08-17 17:08                                               ` Greg Freemyer
  0 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-17 17:08 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: James Bottomley, Mark Lord, Arjan van de Ven, Alan Cox,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

All,

Seems like the high-level wrap-up of all this is:

There are hopes that highly efficient SSDs will appear on the market
that can leverage a passthru non-coalescing discard feature.  And that
a whitelist should be created to allow those SSDs to see discards
intermixed with the rest of the data i/o.

For the other known cases:

SSDs that meet the ata-8 spec, but don't exceed it
Enterprise SCSI
mdraid with SSD storage used to build raid5 / raid6 arrays

Non-coalescing is believed detrimental, but a regular flushing of the
unused blocks/sectors via a tool like Mark Lord has written should be
acceptable.

Mark, I don't believe your tool really addresses the mdraid situation,
do you agree.  ie. Since your bypassing most of the block stack,
mdraid has no way of snooping on / adjusting the discards you are
sending out.

Thus the 2 solutions that have been worked on already seem to address
the needs of everything but mdraid.

Also, there has been no discussion of dm based volumes.  (ie LVM2 based volumes)

For mdraid or dm it seems we need to enhance Mark's script to pass the
trim commands through the full block stack.  Mark, please cmiiw

Greg

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-17 16:56                                           ` jim owens
@ 2009-08-17 17:14                                             ` Bill Davidsen
  -1 siblings, 0 replies; 208+ messages in thread
From: Bill Davidsen @ 2009-08-17 17:14 UTC (permalink / raw)
  To: jim owens
  Cc: Mark Lord, Theodore Tso, Arjan van de Ven, Alan Cox,
	James Bottomley, Chris Worley, Matthew Wilcox, Bryan Donlan,
	david, Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

jim owens wrote:
> Bill Davidsen wrote:
>>
>> I assume that it really is artificial, rather than the device really 
>> being ready for another operation (other than another TRIM). I lack 
>> the hardware, but the test would be the time to complete a read, trim 
>> and read, and two trim and read operations. Just my thought that the 
>> TRIM in progress may only block the next TRIM, rather than other 
>> operations.
>
> I don't know his test sequence but READ is not the likely command
> before and after TRIM unless we are talking about TRIM being issued
> only in delayed host garbage collection.  Filesystems send WRITES
> during delete.

My idea is to test using a command which will definitely not need to 
prepare the media before completion, thus read. If TRIM doesn't block 
reads, then NCQ may allow reads to take place. Because of buffering slow 
reads hurt more than slow writes in terms of user perception.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc

"You are disgraced professional losers. And by the way, give us our money back."
    - Representative Earl Pomeroy,  Democrat of North Dakota
on the A.I.G. executives who were paid bonuses  after a federal bailout.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17 17:14                                             ` Bill Davidsen
  0 siblings, 0 replies; 208+ messages in thread
From: Bill Davidsen @ 2009-08-17 17:14 UTC (permalink / raw)
  To: jim owens
  Cc: Mark Lord, Theodore Tso, Arjan van de Ven, Alan Cox,
	James Bottomley, Chris Worley, Matthew Wilcox, Bryan Donlan,
	david, Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

jim owens wrote:
> Bill Davidsen wrote:
>>
>> I assume that it really is artificial, rather than the device really 
>> being ready for another operation (other than another TRIM). I lack 
>> the hardware, but the test would be the time to complete a read, trim 
>> and read, and two trim and read operations. Just my thought that the 
>> TRIM in progress may only block the next TRIM, rather than other 
>> operations.
>
> I don't know his test sequence but READ is not the likely command
> before and after TRIM unless we are talking about TRIM being issued
> only in delayed host garbage collection.  Filesystems send WRITES
> during delete.

My idea is to test using a command which will definitely not need to 
prepare the media before completion, thus read. If TRIM doesn't block 
reads, then NCQ may allow reads to take place. Because of buffering slow 
reads hurt more than slow writes in terms of user perception.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc

"You are disgraced professional losers. And by the way, give us our money back."
    - Representative Earl Pomeroy,  Democrat of North Dakota
on the A.I.G. executives who were paid bonuses  after a federal bailout.



^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-17 17:08                                               ` Greg Freemyer
@ 2009-08-17 17:19                                                 ` James Bottomley
  -1 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-17 17:19 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Bill Davidsen, Mark Lord, Arjan van de Ven, Alan Cox,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Mon, 2009-08-17 at 13:08 -0400, Greg Freemyer wrote:
> All,
> 
> Seems like the high-level wrap-up of all this is:
> 
> There are hopes that highly efficient SSDs will appear on the market
> that can leverage a passthru non-coalescing discard feature.  And that
> a whitelist should be created to allow those SSDs to see discards
> intermixed with the rest of the data i/o.

That's not my conclusion.  Mine was the NCQ drain would still be
detremental to interleaved trim even if the drive could do it for zero
cost.

> For the other known cases:
> 
> SSDs that meet the ata-8 spec, but don't exceed it
> Enterprise SCSI

No, SCSI will do WRITE_SAME/UNMAP as currently drafted in SBC3

> mdraid with SSD storage used to build raid5 / raid6 arrays
> 
> Non-coalescing is believed detrimental,

It is?  Why?

>  but a regular flushing of the
> unused blocks/sectors via a tool like Mark Lord has written should be
> acceptable.
> 
> Mark, I don't believe your tool really addresses the mdraid situation,
> do you agree.  ie. Since your bypassing most of the block stack,
> mdraid has no way of snooping on / adjusting the discards you are
> sending out.
> 
> Thus the 2 solutions that have been worked on already seem to address
> the needs of everything but mdraid.

I count three:  Mark Lord script via SG_IO.  hch enhanced script via
XFS_TRIM and willy current discard inline which he's considering
coalescing for.

James

> Also, there has been no discussion of dm based volumes.  (ie LVM2 based volumes)
> 
> For mdraid or dm it seems we need to enhance Mark's script to pass the
> trim commands through the full block stack.  Mark, please cmiiw
> 
> Greg


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17 17:19                                                 ` James Bottomley
  0 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-17 17:19 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Bill Davidsen, Mark Lord, Arjan van de Ven, Alan Cox,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Mon, 2009-08-17 at 13:08 -0400, Greg Freemyer wrote:
> All,
> 
> Seems like the high-level wrap-up of all this is:
> 
> There are hopes that highly efficient SSDs will appear on the market
> that can leverage a passthru non-coalescing discard feature.  And that
> a whitelist should be created to allow those SSDs to see discards
> intermixed with the rest of the data i/o.

That's not my conclusion.  Mine was the NCQ drain would still be
detremental to interleaved trim even if the drive could do it for zero
cost.

> For the other known cases:
> 
> SSDs that meet the ata-8 spec, but don't exceed it
> Enterprise SCSI

No, SCSI will do WRITE_SAME/UNMAP as currently drafted in SBC3

> mdraid with SSD storage used to build raid5 / raid6 arrays
> 
> Non-coalescing is believed detrimental,

It is?  Why?

>  but a regular flushing of the
> unused blocks/sectors via a tool like Mark Lord has written should be
> acceptable.
> 
> Mark, I don't believe your tool really addresses the mdraid situation,
> do you agree.  ie. Since your bypassing most of the block stack,
> mdraid has no way of snooping on / adjusting the discards you are
> sending out.
> 
> Thus the 2 solutions that have been worked on already seem to address
> the needs of everything but mdraid.

I count three:  Mark Lord script via SG_IO.  hch enhanced script via
XFS_TRIM and willy current discard inline which he's considering
coalescing for.

James

> Also, there has been no discussion of dm based volumes.  (ie LVM2 based volumes)
> 
> For mdraid or dm it seems we need to enhance Mark's script to pass the
> trim commands through the full block stack.  Mark, please cmiiw
> 
> Greg



^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-17 17:14                                             ` Bill Davidsen
@ 2009-08-17 17:37                                               ` jim owens
  -1 siblings, 0 replies; 208+ messages in thread
From: jim owens @ 2009-08-17 17:37 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Mark Lord, Theodore Tso, Arjan van de Ven, Alan Cox,
	James Bottomley, Chris Worley, Matthew Wilcox, Bryan Donlan,
	david, Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

Bill Davidsen wrote:
> jim owens wrote:
>> Bill Davidsen wrote:
>>>
>>> I assume that it really is artificial, rather than the device really 
>>> being ready for another operation (other than another TRIM). I lack 
>>> the hardware, but the test would be the time to complete a read, trim 
>>> and read, and two trim and read operations. Just my thought that the 
>>> TRIM in progress may only block the next TRIM, rather than other 
>>> operations.
>>
>> I don't know his test sequence but READ is not the likely command
>> before and after TRIM unless we are talking about TRIM being issued
>> only in delayed host garbage collection.  Filesystems send WRITES
>> during delete.
> 
> My idea is to test using a command which will definitely not need to 
> prepare the media before completion, thus read. If TRIM doesn't block 
> reads, then NCQ may allow reads to take place. Because of buffering slow 
> reads hurt more than slow writes in terms of user perception.
> 

The filesystem must send at least one unbuffered synchronous write
before it can send a trim for those blocks so the drive will not
release the blocks until we are certain they will not be needed again.

AKA the metadata consistency problem for crash recovery.

So non-delayed trim must at least be preceded by a write, but you
are correct that reads could be after the trim if the filesystem
does not have a multi-stage delete that requires a second synchronous
write, or if the trim can be held until all filesystem writes occur.

How hard it will be to "send the trim last" will be different for
each filesystem and some developers are already working on that.

jim

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17 17:37                                               ` jim owens
  0 siblings, 0 replies; 208+ messages in thread
From: jim owens @ 2009-08-17 17:37 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Mark Lord, Theodore Tso, Arjan van de Ven, Alan Cox,
	James Bottomley, Chris Worley, Matthew Wilcox, Bryan Donlan,
	david, Greg Freemyer, Markus Trippelsdorf, Matthew Wilcox,
	Hugh Dickins, Nitin Gupta, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-mm, linux-scsi, linux-ide, Linux RAID

Bill Davidsen wrote:
> jim owens wrote:
>> Bill Davidsen wrote:
>>>
>>> I assume that it really is artificial, rather than the device really 
>>> being ready for another operation (other than another TRIM). I lack 
>>> the hardware, but the test would be the time to complete a read, trim 
>>> and read, and two trim and read operations. Just my thought that the 
>>> TRIM in progress may only block the next TRIM, rather than other 
>>> operations.
>>
>> I don't know his test sequence but READ is not the likely command
>> before and after TRIM unless we are talking about TRIM being issued
>> only in delayed host garbage collection.  Filesystems send WRITES
>> during delete.
> 
> My idea is to test using a command which will definitely not need to 
> prepare the media before completion, thus read. If TRIM doesn't block 
> reads, then NCQ may allow reads to take place. Because of buffering slow 
> reads hurt more than slow writes in terms of user perception.
> 

The filesystem must send at least one unbuffered synchronous write
before it can send a trim for those blocks so the drive will not
release the blocks until we are certain they will not be needed again.

AKA the metadata consistency problem for crash recovery.

So non-delayed trim must at least be preceded by a write, but you
are correct that reads could be after the trim if the filesystem
does not have a multi-stage delete that requires a second synchronous
write, or if the trim can be held until all filesystem writes occur.

How hard it will be to "send the trim last" will be different for
each filesystem and some developers are already working on that.

jim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-17 17:19                                                 ` James Bottomley
@ 2009-08-17 18:16                                                   ` Ric Wheeler
  -1 siblings, 0 replies; 208+ messages in thread
From: Ric Wheeler @ 2009-08-17 18:16 UTC (permalink / raw)
  To: James Bottomley
  Cc: Greg Freemyer, Bill Davidsen, Mark Lord, Arjan van de Ven,
	Alan Cox, Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID


Chiming in here a bit late, but coalescing requests is also a good way 
to prevent read-modify-write cycles.

Specifically, if I remember the concern correctly, for the WRITE_SAME 
with unmap bit set, when the IO is not evenly aligned on the "erase 
chunk" (whatever they call it) boundary the device can be forced to do a 
read-modify-write (of zeroes) to the end or beginning of that region.

For a disk array, the WRITE_SAME with unmap bit when done cleanly on an 
aligned boundary can be done entirely in the array's cache. The 
read-modify-write can generate several reads to the back end disks which 
are significantly slower....

ric

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17 18:16                                                   ` Ric Wheeler
  0 siblings, 0 replies; 208+ messages in thread
From: Ric Wheeler @ 2009-08-17 18:16 UTC (permalink / raw)
  To: James Bottomley
  Cc: Greg Freemyer, Bill Davidsen, Mark Lord, Arjan van de Ven,
	Alan Cox, Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID


Chiming in here a bit late, but coalescing requests is also a good way 
to prevent read-modify-write cycles.

Specifically, if I remember the concern correctly, for the WRITE_SAME 
with unmap bit set, when the IO is not evenly aligned on the "erase 
chunk" (whatever they call it) boundary the device can be forced to do a 
read-modify-write (of zeroes) to the end or beginning of that region.

For a disk array, the WRITE_SAME with unmap bit when done cleanly on an 
aligned boundary can be done entirely in the array's cache. The 
read-modify-write can generate several reads to the back end disks which 
are significantly slower....

ric


^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-17 17:19                                                 ` James Bottomley
  (?)
@ 2009-08-17 18:21                                                   ` Greg Freemyer
  -1 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-17 18:21 UTC (permalink / raw)
  To: James Bottomley
  Cc: Bill Davidsen, Mark Lord, Arjan van de Ven, Alan Cox,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Mon, Aug 17, 2009 at 1:19 PM, James Bottomley<James.Bottomley@suse.de> wrote:
> On Mon, 2009-08-17 at 13:08 -0400, Greg Freemyer wrote:
>> All,
>>
>> Seems like the high-level wrap-up of all this is:
>>
>> There are hopes that highly efficient SSDs will appear on the market
>> that can leverage a passthru non-coalescing discard feature.  And that
>> a whitelist should be created to allow those SSDs to see discards
>> intermixed with the rest of the data i/o.
>
> That's not my conclusion.  Mine was the NCQ drain would still be
> detremental to interleaved trim even if the drive could do it for zero
> cost.

Maybe I misunderstood Jim Owens previous comment that designing for
devices that only meet the spec. was not his / Linus'es preference.

Instead they want to have a whitelist enabled list of drives that
support trim / ncq without having to drain the queue.

I just re-read his post and he did not explicitly say that, so maybe
I'm mis-representing it.

>> For the other known cases:
>>
>> SSDs that meet the ata-8 spec, but don't exceed it
>> Enterprise SCSI
>
> No, SCSI will do WRITE_SAME/UNMAP as currently drafted in SBC3
>
>> mdraid with SSD storage used to build raid5 / raid6 arrays
>>
>> Non-coalescing is believed detrimental,
>
> It is?  Why?

For the only compliant SSD in the wild, Mark has shown it to be true
via testing.

For Enterprise SCSI, I thought you said a coalescing solution is
preferred.  (I took that to mean non-coalescing is detremental.  Not
true?).

For mdraid, if the trims are not coalesced mdraid will have to either
ignore them, or coalesce them themselves. Having them come in bigger
discard ranges is clearly better.  (ie. At least the size of a stripe,
so it can adjust the start / end sector to a stripe boundary.)

>>  but a regular flushing of the
>> unused blocks/sectors via a tool like Mark Lord has written should be
>> acceptable.
>>
>> Mark, I don't believe your tool really addresses the mdraid situation,
>> do you agree.  ie. Since your bypassing most of the block stack,
>> mdraid has no way of snooping on / adjusting the discards you are
>> sending out.
>>
>> Thus the 2 solutions that have been worked on already seem to address
>> the needs of everything but mdraid.
>
> I count three:  Mark Lord script via SG_IO.  hch enhanced script via
> XFS_TRIM and willy current discard inline which he's considering
> coalescing for.

I missed XFS_TRIM somehow.  What benefit does XFS_TRIM provide at a
high level?  Is it part of the realtime delete file process, or an
after the fact scanner?

> James

Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap  slot is freed)
@ 2009-08-17 18:21                                                   ` Greg Freemyer
  0 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-17 18:21 UTC (permalink / raw)
  To: James Bottomley
  Cc: Bill Davidsen, Mark Lord, Arjan van de Ven, Alan Cox,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Mon, Aug 17, 2009 at 1:19 PM, James Bottomley<James.Bottomley@suse.de> wrote:
> On Mon, 2009-08-17 at 13:08 -0400, Greg Freemyer wrote:
>> All,
>>
>> Seems like the high-level wrap-up of all this is:
>>
>> There are hopes that highly efficient SSDs will appear on the market
>> that can leverage a passthru non-coalescing discard feature.  And that
>> a whitelist should be created to allow those SSDs to see discards
>> intermixed with the rest of the data i/o.
>
> That's not my conclusion.  Mine was the NCQ drain would still be
> detremental to interleaved trim even if the drive could do it for zero
> cost.

Maybe I misunderstood Jim Owens previous comment that designing for
devices that only meet the spec. was not his / Linus'es preference.

Instead they want to have a whitelist enabled list of drives that
support trim / ncq without having to drain the queue.

I just re-read his post and he did not explicitly say that, so maybe
I'm mis-representing it.

>> For the other known cases:
>>
>> SSDs that meet the ata-8 spec, but don't exceed it
>> Enterprise SCSI
>
> No, SCSI will do WRITE_SAME/UNMAP as currently drafted in SBC3
>
>> mdraid with SSD storage used to build raid5 / raid6 arrays
>>
>> Non-coalescing is believed detrimental,
>
> It is?  Why?

For the only compliant SSD in the wild, Mark has shown it to be true
via testing.

For Enterprise SCSI, I thought you said a coalescing solution is
preferred.  (I took that to mean non-coalescing is detremental.  Not
true?).

For mdraid, if the trims are not coalesced mdraid will have to either
ignore them, or coalesce them themselves. Having them come in bigger
discard ranges is clearly better.  (ie. At least the size of a stripe,
so it can adjust the start / end sector to a stripe boundary.)

>>  but a regular flushing of the
>> unused blocks/sectors via a tool like Mark Lord has written should be
>> acceptable.
>>
>> Mark, I don't believe your tool really addresses the mdraid situation,
>> do you agree.  ie. Since your bypassing most of the block stack,
>> mdraid has no way of snooping on / adjusting the discards you are
>> sending out.
>>
>> Thus the 2 solutions that have been worked on already seem to address
>> the needs of everything but mdraid.
>
> I count three:  Mark Lord script via SG_IO.  hch enhanced script via
> XFS_TRIM and willy current discard inline which he's considering
> coalescing for.

I missed XFS_TRIM somehow.  What benefit does XFS_TRIM provide at a
high level?  Is it part of the realtime delete file process, or an
after the fact scanner?

> James

Greg

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17 18:21                                                   ` Greg Freemyer
  0 siblings, 0 replies; 208+ messages in thread
From: Greg Freemyer @ 2009-08-17 18:21 UTC (permalink / raw)
  To: James Bottomley
  Cc: Bill Davidsen, Mark Lord, Arjan van de Ven, Alan Cox,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Mon, Aug 17, 2009 at 1:19 PM, James Bottomley<James.Bottomley@suse.de> wrote:
> On Mon, 2009-08-17 at 13:08 -0400, Greg Freemyer wrote:
>> All,
>>
>> Seems like the high-level wrap-up of all this is:
>>
>> There are hopes that highly efficient SSDs will appear on the market
>> that can leverage a passthru non-coalescing discard feature.  And that
>> a whitelist should be created to allow those SSDs to see discards
>> intermixed with the rest of the data i/o.
>
> That's not my conclusion.  Mine was the NCQ drain would still be
> detremental to interleaved trim even if the drive could do it for zero
> cost.

Maybe I misunderstood Jim Owens previous comment that designing for
devices that only meet the spec. was not his / Linus'es preference.

Instead they want to have a whitelist enabled list of drives that
support trim / ncq without having to drain the queue.

I just re-read his post and he did not explicitly say that, so maybe
I'm mis-representing it.

>> For the other known cases:
>>
>> SSDs that meet the ata-8 spec, but don't exceed it
>> Enterprise SCSI
>
> No, SCSI will do WRITE_SAME/UNMAP as currently drafted in SBC3
>
>> mdraid with SSD storage used to build raid5 / raid6 arrays
>>
>> Non-coalescing is believed detrimental,
>
> It is?  Why?

For the only compliant SSD in the wild, Mark has shown it to be true
via testing.

For Enterprise SCSI, I thought you said a coalescing solution is
preferred.  (I took that to mean non-coalescing is detremental.  Not
true?).

For mdraid, if the trims are not coalesced mdraid will have to either
ignore them, or coalesce them themselves. Having them come in bigger
discard ranges is clearly better.  (ie. At least the size of a stripe,
so it can adjust the start / end sector to a stripe boundary.)

>>  but a regular flushing of the
>> unused blocks/sectors via a tool like Mark Lord has written should be
>> acceptable.
>>
>> Mark, I don't believe your tool really addresses the mdraid situation,
>> do you agree.  ie. Since your bypassing most of the block stack,
>> mdraid has no way of snooping on / adjusting the discards you are
>> sending out.
>>
>> Thus the 2 solutions that have been worked on already seem to address
>> the needs of everything but mdraid.
>
> I count three:  Mark Lord script via SG_IO.  hch enhanced script via
> XFS_TRIM and willy current discard inline which he's considering
> coalescing for.

I missed XFS_TRIM somehow.  What benefit does XFS_TRIM provide at a
high level?  Is it part of the realtime delete file process, or an
after the fact scanner?

> James

Greg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-17 14:10                                         ` Matthew Wilcox
@ 2009-08-17 19:12                                           ` Christoph Hellwig
  -1 siblings, 0 replies; 208+ messages in thread
From: Christoph Hellwig @ 2009-08-17 19:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: James Bottomley, Christoph Hellwig, Arjan van de Ven, Alan Cox,
	Mark Lord, Chris Worley, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Mon, Aug 17, 2009 at 08:10:38AM -0600, Matthew Wilcox wrote:
> On Mon, Aug 17, 2009 at 08:56:12AM -0500, James Bottomley wrote:
> > The testing was initially done to see if the initial maximal discard
> > proposal from LSF09 was a viable approach (which it wasn't given the
> > time taken to UNMAP).
> 
> It would be nice if that feedback could be made public instead of it
> leaking out in dribs and drabs like this.

Yeah, I don't remember hearing about anything like this either.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17 19:12                                           ` Christoph Hellwig
  0 siblings, 0 replies; 208+ messages in thread
From: Christoph Hellwig @ 2009-08-17 19:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: James Bottomley, Christoph Hellwig, Arjan van de Ven, Alan Cox,
	Mark Lord, Chris Worley, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Mon, Aug 17, 2009 at 08:10:38AM -0600, Matthew Wilcox wrote:
> On Mon, Aug 17, 2009 at 08:56:12AM -0500, James Bottomley wrote:
> > The testing was initially done to see if the initial maximal discard
> > proposal from LSF09 was a viable approach (which it wasn't given the
> > time taken to UNMAP).
> 
> It would be nice if that feedback could be made public instead of it
> leaking out in dribs and drabs like this.

Yeah, I don't remember hearing about anything like this either.


^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-17 18:21                                                   ` Greg Freemyer
@ 2009-08-17 19:18                                                     ` James Bottomley
  -1 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-17 19:18 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Bill Davidsen, Mark Lord, Arjan van de Ven, Alan Cox,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Mon, 2009-08-17 at 14:21 -0400, Greg Freemyer wrote:
> On Mon, Aug 17, 2009 at 1:19 PM, James Bottomley<James.Bottomley@suse.de> wrote:
> > On Mon, 2009-08-17 at 13:08 -0400, Greg Freemyer wrote:
> >> All,
> >>
> >> Seems like the high-level wrap-up of all this is:
> >>
> >> There are hopes that highly efficient SSDs will appear on the market
> >> that can leverage a passthru non-coalescing discard feature.  And that
> >> a whitelist should be created to allow those SSDs to see discards
> >> intermixed with the rest of the data i/o.
> >
> > That's not my conclusion.  Mine was the NCQ drain would still be
> > detremental to interleaved trim even if the drive could do it for zero
> > cost.
> 
> Maybe I misunderstood Jim Owens previous comment that designing for
> devices that only meet the spec. was not his / Linus'es preference.
> 
> Instead they want to have a whitelist enabled list of drives that
> support trim / ncq without having to drain the queue.

There's no way to do this.  The spec explicitly requires that you not
overlap tagged and untagged commands.  The reason is fairly obvious:
you wouldn't be able to separate the completions.

> I just re-read his post and he did not explicitly say that, so maybe
> I'm mis-representing it.
> 
> >> For the other known cases:
> >>
> >> SSDs that meet the ata-8 spec, but don't exceed it
> >> Enterprise SCSI
> >
> > No, SCSI will do WRITE_SAME/UNMAP as currently drafted in SBC3
> >
> >> mdraid with SSD storage used to build raid5 / raid6 arrays
> >>
> >> Non-coalescing is believed detrimental,
> >
> > It is?  Why?
> 
> For the only compliant SSD in the wild, Mark has shown it to be true
> via testing.

He only said larger trims take longer.  As I said previously, if it's a
X+nY relationship, then we still benefit from accumulation up to some
value of n.

> For Enterprise SCSI, I thought you said a coalescing solution is
> preferred.  (I took that to mean non-coalescing is detremental.  Not
> true?).

I'm trying to persuade the array vendors to speak for themselves, but it
seems that UNMAP takes time.  Of course, in SCSI, this is a taggable
command so we don't have the drain overhead ... but then we can't do
anything that would produce an undetermined state based on out of order
tag execution either.

> For mdraid, if the trims are not coalesced mdraid will have to either
> ignore them, or coalesce them themselves. Having them come in bigger
> discard ranges is clearly better.  (ie. At least the size of a stripe,
> so it can adjust the start / end sector to a stripe boundary.)

If we did discard accumulation in-kernel (a big if), it would likely be
at the request level; thus md and dm would automatically inherit it.
dm/md are a problem for a userspace accumulation solution, though
(although I suspect the request elevator can fix that).

> >>  but a regular flushing of the
> >> unused blocks/sectors via a tool like Mark Lord has written should be
> >> acceptable.
> >>
> >> Mark, I don't believe your tool really addresses the mdraid situation,
> >> do you agree.  ie. Since your bypassing most of the block stack,
> >> mdraid has no way of snooping on / adjusting the discards you are
> >> sending out.
> >>
> >> Thus the 2 solutions that have been worked on already seem to address
> >> the needs of everything but mdraid.
> >
> > I count three:  Mark Lord script via SG_IO.  hch enhanced script via
> > XFS_TRIM and willy current discard inline which he's considering
> > coalescing for.
> 
> I missed XFS_TRIM somehow.  What benefit does XFS_TRIM provide at a
> high level?  Is it part of the realtime delete file process, or an
> after the fact scanner?

It guarantees that trim does not overlap allocations and writes on a
running system, so it gives us safety of execution.

James



^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17 19:18                                                     ` James Bottomley
  0 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-17 19:18 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Bill Davidsen, Mark Lord, Arjan van de Ven, Alan Cox,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Mon, 2009-08-17 at 14:21 -0400, Greg Freemyer wrote:
> On Mon, Aug 17, 2009 at 1:19 PM, James Bottomley<James.Bottomley@suse.de> wrote:
> > On Mon, 2009-08-17 at 13:08 -0400, Greg Freemyer wrote:
> >> All,
> >>
> >> Seems like the high-level wrap-up of all this is:
> >>
> >> There are hopes that highly efficient SSDs will appear on the market
> >> that can leverage a passthru non-coalescing discard feature.  And that
> >> a whitelist should be created to allow those SSDs to see discards
> >> intermixed with the rest of the data i/o.
> >
> > That's not my conclusion.  Mine was the NCQ drain would still be
> > detremental to interleaved trim even if the drive could do it for zero
> > cost.
> 
> Maybe I misunderstood Jim Owens previous comment that designing for
> devices that only meet the spec. was not his / Linus'es preference.
> 
> Instead they want to have a whitelist enabled list of drives that
> support trim / ncq without having to drain the queue.

There's no way to do this.  The spec explicitly requires that you not
overlap tagged and untagged commands.  The reason is fairly obvious:
you wouldn't be able to separate the completions.

> I just re-read his post and he did not explicitly say that, so maybe
> I'm mis-representing it.
> 
> >> For the other known cases:
> >>
> >> SSDs that meet the ata-8 spec, but don't exceed it
> >> Enterprise SCSI
> >
> > No, SCSI will do WRITE_SAME/UNMAP as currently drafted in SBC3
> >
> >> mdraid with SSD storage used to build raid5 / raid6 arrays
> >>
> >> Non-coalescing is believed detrimental,
> >
> > It is?  Why?
> 
> For the only compliant SSD in the wild, Mark has shown it to be true
> via testing.

He only said larger trims take longer.  As I said previously, if it's a
X+nY relationship, then we still benefit from accumulation up to some
value of n.

> For Enterprise SCSI, I thought you said a coalescing solution is
> preferred.  (I took that to mean non-coalescing is detremental.  Not
> true?).

I'm trying to persuade the array vendors to speak for themselves, but it
seems that UNMAP takes time.  Of course, in SCSI, this is a taggable
command so we don't have the drain overhead ... but then we can't do
anything that would produce an undetermined state based on out of order
tag execution either.

> For mdraid, if the trims are not coalesced mdraid will have to either
> ignore them, or coalesce them themselves. Having them come in bigger
> discard ranges is clearly better.  (ie. At least the size of a stripe,
> so it can adjust the start / end sector to a stripe boundary.)

If we did discard accumulation in-kernel (a big if), it would likely be
at the request level; thus md and dm would automatically inherit it.
dm/md are a problem for a userspace accumulation solution, though
(although I suspect the request elevator can fix that).

> >>  but a regular flushing of the
> >> unused blocks/sectors via a tool like Mark Lord has written should be
> >> acceptable.
> >>
> >> Mark, I don't believe your tool really addresses the mdraid situation,
> >> do you agree.  ie. Since your bypassing most of the block stack,
> >> mdraid has no way of snooping on / adjusting the discards you are
> >> sending out.
> >>
> >> Thus the 2 solutions that have been worked on already seem to address
> >> the needs of everything but mdraid.
> >
> > I count three:  Mark Lord script via SG_IO.  hch enhanced script via
> > XFS_TRIM and willy current discard inline which he's considering
> > coalescing for.
> 
> I missed XFS_TRIM somehow.  What benefit does XFS_TRIM provide at a
> high level?  Is it part of the realtime delete file process, or an
> after the fact scanner?

It guarantees that trim does not overlap allocations and writes on a
running system, so it gives us safety of execution.

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-17 19:12                                           ` Christoph Hellwig
@ 2009-08-17 19:24                                             ` James Bottomley
  -1 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-17 19:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, Arjan van de Ven, Alan Cox, Mark Lord,
	Chris Worley, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Mon, 2009-08-17 at 15:12 -0400, Christoph Hellwig wrote:
> On Mon, Aug 17, 2009 at 08:10:38AM -0600, Matthew Wilcox wrote:
> > On Mon, Aug 17, 2009 at 08:56:12AM -0500, James Bottomley wrote:
> > > The testing was initially done to see if the initial maximal discard
> > > proposal from LSF09 was a viable approach (which it wasn't given the
> > > time taken to UNMAP).
> > 
> > It would be nice if that feedback could be made public instead of it
> > leaking out in dribs and drabs like this.
> 
> Yeah, I don't remember hearing about anything like this either.

Well you were both in #storage when it got discussed on 4 May.  However,
I'm hopeful that the array vendors will make a more direct statement of
the capabilities shortly.

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17 19:24                                             ` James Bottomley
  0 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-17 19:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, Arjan van de Ven, Alan Cox, Mark Lord,
	Chris Worley, Bryan Donlan, david, Greg Freemyer,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Mon, 2009-08-17 at 15:12 -0400, Christoph Hellwig wrote:
> On Mon, Aug 17, 2009 at 08:10:38AM -0600, Matthew Wilcox wrote:
> > On Mon, Aug 17, 2009 at 08:56:12AM -0500, James Bottomley wrote:
> > > The testing was initially done to see if the initial maximal discard
> > > proposal from LSF09 was a viable approach (which it wasn't given the
> > > time taken to UNMAP).
> > 
> > It would be nice if that feedback could be made public instead of it
> > leaking out in dribs and drabs like this.
> 
> Yeah, I don't remember hearing about anything like this either.

Well you were both in #storage when it got discussed on 4 May.  However,
I'm hopeful that the array vendors will make a more direct statement of
the capabilities shortly.

James



^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-17 19:18                                                     ` James Bottomley
@ 2009-08-17 20:19                                                       ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-17 20:19 UTC (permalink / raw)
  To: James Bottomley
  Cc: Greg Freemyer, Bill Davidsen, Arjan van de Ven, Alan Cox,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

James Bottomley wrote:
> On Mon, 2009-08-17 at 14:21 -0400, Greg Freemyer wrote:
>> On Mon, Aug 17, 2009 at 1:19 PM, James Bottomley<James.Bottomley@suse.de> wrote:
>>> On Mon, 2009-08-17 at 13:08 -0400, Greg Freemyer wrote:
..
>>>> Non-coalescing is believed detrimental,
>>> It is?  Why?
>> For the only compliant SSD in the wild, Mark has shown it to be true
>> via testing.
> 
> He only said larger trims take longer.  As I said previously, if it's a
> X+nY relationship, then we still benefit from accumulation up to some
> value of n.
..

Err, what I said was, "rm -rf /usr/src/linux" takes over half an hour
with uncoalesced TRIM, and only a scant few seconds in total *with*
coalesced TRIM.



^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17 20:19                                                       ` Mark Lord
  0 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-17 20:19 UTC (permalink / raw)
  To: James Bottomley
  Cc: Greg Freemyer, Bill Davidsen, Arjan van de Ven, Alan Cox,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

James Bottomley wrote:
> On Mon, 2009-08-17 at 14:21 -0400, Greg Freemyer wrote:
>> On Mon, Aug 17, 2009 at 1:19 PM, James Bottomley<James.Bottomley@suse.de> wrote:
>>> On Mon, 2009-08-17 at 13:08 -0400, Greg Freemyer wrote:
..
>>>> Non-coalescing is believed detrimental,
>>> It is?  Why?
>> For the only compliant SSD in the wild, Mark has shown it to be true
>> via testing.
> 
> He only said larger trims take longer.  As I said previously, if it's a
> X+nY relationship, then we still benefit from accumulation up to some
> value of n.
..

Err, what I said was, "rm -rf /usr/src/linux" takes over half an hour
with uncoalesced TRIM, and only a scant few seconds in total *with*
coalesced TRIM.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-17 20:19                                                       ` Mark Lord
@ 2009-08-17 20:28                                                         ` James Bottomley
  -1 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-17 20:28 UTC (permalink / raw)
  To: Mark Lord
  Cc: Greg Freemyer, Bill Davidsen, Arjan van de Ven, Alan Cox,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Mon, 2009-08-17 at 16:19 -0400, Mark Lord wrote:
> James Bottomley wrote:
> > On Mon, 2009-08-17 at 14:21 -0400, Greg Freemyer wrote:
> >> On Mon, Aug 17, 2009 at 1:19 PM, James Bottomley<James.Bottomley@suse.de> wrote:
> >>> On Mon, 2009-08-17 at 13:08 -0400, Greg Freemyer wrote:
> ..
> >>>> Non-coalescing is believed detrimental,
> >>> It is?  Why?
> >> For the only compliant SSD in the wild, Mark has shown it to be true
> >> via testing.
> > 
> > He only said larger trims take longer.  As I said previously, if it's a
> > X+nY relationship, then we still benefit from accumulation up to some
> > value of n.
> ..
> 
> Err, what I said was, "rm -rf /usr/src/linux" takes over half an hour
> with uncoalesced TRIM, and only a scant few seconds in total *with*
> coalesced TRIM.

Yes, sorry, missed the Non- when I read that sentence.

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17 20:28                                                         ` James Bottomley
  0 siblings, 0 replies; 208+ messages in thread
From: James Bottomley @ 2009-08-17 20:28 UTC (permalink / raw)
  To: Mark Lord
  Cc: Greg Freemyer, Bill Davidsen, Arjan van de Ven, Alan Cox,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

On Mon, 2009-08-17 at 16:19 -0400, Mark Lord wrote:
> James Bottomley wrote:
> > On Mon, 2009-08-17 at 14:21 -0400, Greg Freemyer wrote:
> >> On Mon, Aug 17, 2009 at 1:19 PM, James Bottomley<James.Bottomley@suse.de> wrote:
> >>> On Mon, 2009-08-17 at 13:08 -0400, Greg Freemyer wrote:
> ..
> >>>> Non-coalescing is believed detrimental,
> >>> It is?  Why?
> >> For the only compliant SSD in the wild, Mark has shown it to be true
> >> via testing.
> > 
> > He only said larger trims take longer.  As I said previously, if it's a
> > X+nY relationship, then we still benefit from accumulation up to some
> > value of n.
> ..
> 
> Err, what I said was, "rm -rf /usr/src/linux" takes over half an hour
> with uncoalesced TRIM, and only a scant few seconds in total *with*
> coalesced TRIM.

Yes, sorry, missed the Non- when I read that sentence.

James



^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
  2009-08-17 17:08                                               ` Greg Freemyer
@ 2009-08-17 20:28                                                 ` Mark Lord
  -1 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-17 20:28 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Bill Davidsen, James Bottomley, Arjan van de Ven, Alan Cox,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

Greg Freemyer wrote:
..
> Mark, I don't believe your tool really addresses the mdraid situation,
> do you agree.  ie. Since your bypassing most of the block stack,
> mdraid has no way of snooping on / adjusting the discards you are
> sending out.
..

Taking care of mounted RAID / LVM filesystems requires in-kernel TRIM
support, possibly exported via an ioctl().

Taking care of unmounted RAID / LVM filesystems is possible in userland,
but would also benefit from in-kernel support, where layouts are defined
and known better than in userland.

The XFS_TRIM was an idea that Cristoph floated, as a concept for examination.

I think something along those lines would be best, but perhaps with an
interface at the VFS layer.  Something that permits a userland tool
to work like this (below) might be nearly ideal:

main() {
	int fd = open(filesystem_device);
	while (1) {
		int g, ngroups = ioctl(fd, GET_NUMBER_OF_BLOCK_GROUPS);
		for (g = 0; g < ngroups; ++g) {
			ioctl(fd, TRIM_ALL_FREE_EXTENTS_OF_GROUP, g);
		}
		sleep(3600);
	}
}

Not all filesystems have a "block group", or "allocation group" structure,
but I suspect that it's an easy mapping in most cases.

With this scheme, the kernel is absolved of the need to track/coallesce
TRIM requests entirely.

Something like that, perhaps.


^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)
@ 2009-08-17 20:28                                                 ` Mark Lord
  0 siblings, 0 replies; 208+ messages in thread
From: Mark Lord @ 2009-08-17 20:28 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Bill Davidsen, James Bottomley, Arjan van de Ven, Alan Cox,
	Chris Worley, Matthew Wilcox, Bryan Donlan, david,
	Markus Trippelsdorf, Matthew Wilcox, Hugh Dickins, Nitin Gupta,
	Ingo Molnar, Peter Zijlstra, linux-kernel, linux-mm, linux-scsi,
	linux-ide, Linux RAID

Greg Freemyer wrote:
..
> Mark, I don't believe your tool really addresses the mdraid situation,
> do you agree.  ie. Since your bypassing most of the block stack,
> mdraid has no way of snooping on / adjusting the discards you are
> sending out.
..

Taking care of mounted RAID / LVM filesystems requires in-kernel TRIM
support, possibly exported via an ioctl().

Taking care of unmounted RAID / LVM filesystems is possible in userland,
but would also benefit from in-kernel support, where layouts are defined
and known better than in userland.

The XFS_TRIM was an idea that Cristoph floated, as a concept for examination.

I think something along those lines would be best, but perhaps with an
interface at the VFS layer.  Something that permits a userland tool
to work like this (below) might be nearly ideal:

main() {
	int fd = open(filesystem_device);
	while (1) {
		int g, ngroups = ioctl(fd, GET_NUMBER_OF_BLOCK_GROUPS);
		for (g = 0; g < ngroups; ++g) {
			ioctl(fd, TRIM_ALL_FREE_EXTENTS_OF_GROUP, g);
		}
		sleep(3600);
	}
}

Not all filesystems have a "block group", or "allocation group" structure,
but I suspect that it's an easy mapping in most cases.

With this scheme, the kernel is absolved of the need to track/coallesce
TRIM requests entirely.

Something like that, perhaps.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* Re: [PATCH] swap: send callback when swap slot is freed
  2009-08-17  2:55 ` [PATCH] swap: send callback when swap slot is freed KAMEZAWA Hiroyuki
  2009-08-17  5:08   ` Nitin Gupta
@ 2009-08-22  7:34   ` Nai Xia
  1 sibling, 0 replies; 208+ messages in thread
From: Nai Xia @ 2009-08-22  7:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: ngupta, mingo, linux-kernel

On Mon, Aug 17, 2009 at 10:55 AM, KAMEZAWA
Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 12 Aug 2009 20:07:43 +0530
> Nitin Gupta <ngupta@vflare.org> wrote:
>
>> Currently, we have "swap discard" mechanism which sends a discard bio request
>> when we find a free cluster during scan_swap_map(). This callback can come a
>> long time after swap slots are actually freed.
>>
>> This delay in callback is a great problem when (compressed) RAM [1] is used
>> as a swap device. So, this change adds a callback which is called as
>> soon as a swap slot becomes free. For above mentioned case of swapping
>> over compressed RAM device, this is very useful since we can immediately
>> free memory allocated for this swap page.
>>
>> This callback does not replace swap discard support. It is called with
>> swap_lock held, so it is meant to trigger action that finishes quickly.
>> However, swap discard is an I/O request and can be used for taking longer
>> actions.
>>
>> Links:
>> [1] http://code.google.com/p/compcache/
>>
>
> Hmm, do you really need notify at *every* swap free ?
> No batching is necessary ?

Compcache is a block device, it passively accepts swap pages, compresses
them and stores in compressed pages. There is little information for when the
reclaiming of unused slots should be batched.

For a normal swap device, it does not need to care much about unused slots
because the storage is always there, but for compcache, the storage is
allocated in RAM, it needs to be freed as soon as possible, otherwise
the compressing loses its meaning.

Assume the compressed ratio is 50% and there is 8 stale compressed
pages in compcache (16 user pages already be swapined), 8 more swap
pages may be pushed into it, which means 4 more compressed pages and
then 4 more swap pages  -> 2 compressed pages -> 2 more swap pages
->1 compressed page ->1 more swap page ->half compressed page->
1 more swap page->another half compressed page.
It's about 8+4+2+1+1 = 16 wasted pages, still not counting the overhead
for compressed pages management.

That means, if you batch N slots (even if you have the batch information),
your waste the same number of pages during the delayed time.
For time when swapping is already happening,  we really do not want to hold
any pages which are wasted for 100% sure.


Thanks,
Nai

>
> Thanks,
> -Kame
>
>> Signed-off-by: Nitin Gupta <ngupta@vflare.org>
>> ---
>>
>>  include/linux/swap.h |    5 +++++
>>  mm/swapfile.c        |   16 ++++++++++++++++
>>  2 files changed, 21 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 7c15334..4cbe3c4 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -8,6 +8,7 @@
>>  #include <linux/memcontrol.h>
>>  #include <linux/sched.h>
>>  #include <linux/node.h>
>> +#include <linux/blkdev.h>
>>
>>  #include <asm/atomic.h>
>>  #include <asm/page.h>
>> @@ -20,6 +21,8 @@ struct bio;
>>  #define SWAP_FLAG_PRIO_MASK  0x7fff
>>  #define SWAP_FLAG_PRIO_SHIFT 0
>>
>> +typedef void (swap_free_notify_fn) (struct block_device *, unsigned long);
>> +
>>  static inline int current_is_kswapd(void)
>>  {
>>       return current->flags & PF_KSWAPD;
>> @@ -155,6 +158,7 @@ struct swap_info_struct {
>>       unsigned int max;
>>       unsigned int inuse_pages;
>>       unsigned int old_block_size;
>> +     swap_free_notify_fn *swap_free_notify_fn;
>>  };
>>
>>  struct swap_list_t {
>> @@ -295,6 +299,7 @@ extern sector_t swapdev_block(int, pgoff_t);
>>  extern struct swap_info_struct *get_swap_info_struct(unsigned);
>>  extern int reuse_swap_page(struct page *);
>>  extern int try_to_free_swap(struct page *);
>> +extern void set_swap_free_notify(unsigned, swap_free_notify_fn *);
>>  struct backing_dev_info;
>>
>>  /* linux/mm/thrash.c */
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 8ffdc0d..aa95fc7 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -552,6 +552,20 @@ out:
>>       return NULL;
>>  }
>>
>> +/*
>> + * Sets callback for event when swap_map[offset] == 0
>> + * i.e. page at this swap offset is no longer used.
>> + */
>> +void set_swap_free_notify(unsigned type, swap_free_notify_fn *notify_fn)
>> +{
>> +     struct swap_info_struct *sis;
>> +     sis = get_swap_info_struct(type);
>> +     BUG_ON(!sis);
>> +     sis->swap_free_notify_fn = notify_fn;
>> +     return;
>> +}
>> +EXPORT_SYMBOL(set_swap_free_notify);
>> +
>>  static int swap_entry_free(struct swap_info_struct *p,
>>                          swp_entry_t ent, int cache)
>>  {
>> @@ -583,6 +597,8 @@ static int swap_entry_free(struct swap_info_struct *p,
>>                       swap_list.next = p - swap_info;
>>               nr_swap_pages++;
>>               p->inuse_pages--;
>> +             if (p->swap_free_notify_fn)
>> +                     p->swap_free_notify_fn(p->bdev, offset);
>>       }
>>       if (!swap_count(count))
>>               mem_cgroup_uncharge_swap(ent);
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 208+ messages in thread

* [PATCH] swap: send callback when swap slot is freed
@ 2009-08-12 14:37 Nitin Gupta
  0 siblings, 0 replies; 208+ messages in thread
From: Nitin Gupta @ 2009-08-12 14:37 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel

Currently, we have "swap discard" mechanism which sends a discard bio request
when we find a free cluster during scan_swap_map(). This callback can come a
long time after swap slots are actually freed.

This delay in callback is a great problem when (compressed) RAM [1] is used
as a swap device. So, this change adds a callback which is called as
soon as a swap slot becomes free. For above mentioned case of swapping
over compressed RAM device, this is very useful since we can immediately
free memory allocated for this swap page.

This callback does not replace swap discard support. It is called with
swap_lock held, so it is meant to trigger action that finishes quickly.
However, swap discard is an I/O request and can be used for taking longer
actions.

Links:
[1] http://code.google.com/p/compcache/

Signed-off-by: Nitin Gupta <ngupta@vflare.org>
---

 include/linux/swap.h |    5 +++++
 mm/swapfile.c        |   16 ++++++++++++++++
 2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7c15334..4cbe3c4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -8,6 +8,7 @@
 #include <linux/memcontrol.h>
 #include <linux/sched.h>
 #include <linux/node.h>
+#include <linux/blkdev.h>
 
 #include <asm/atomic.h>
 #include <asm/page.h>
@@ -20,6 +21,8 @@ struct bio;
 #define SWAP_FLAG_PRIO_MASK	0x7fff
 #define SWAP_FLAG_PRIO_SHIFT	0
 
+typedef void (swap_free_notify_fn) (struct block_device *, unsigned long);
+
 static inline int current_is_kswapd(void)
 {
 	return current->flags & PF_KSWAPD;
@@ -155,6 +158,7 @@ struct swap_info_struct {
 	unsigned int max;
 	unsigned int inuse_pages;
 	unsigned int old_block_size;
+	swap_free_notify_fn *swap_free_notify_fn;
 };
 
 struct swap_list_t {
@@ -295,6 +299,7 @@ extern sector_t swapdev_block(int, pgoff_t);
 extern struct swap_info_struct *get_swap_info_struct(unsigned);
 extern int reuse_swap_page(struct page *);
 extern int try_to_free_swap(struct page *);
+extern void set_swap_free_notify(unsigned, swap_free_notify_fn *);
 struct backing_dev_info;
 
 /* linux/mm/thrash.c */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8ffdc0d..aa95fc7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -552,6 +552,20 @@ out:
 	return NULL;
 }
 
+/*
+ * Sets callback for event when swap_map[offset] == 0
+ * i.e. page at this swap offset is no longer used.
+ */
+void set_swap_free_notify(unsigned type, swap_free_notify_fn *notify_fn)
+{
+	struct swap_info_struct *sis;
+	sis = get_swap_info_struct(type);
+	BUG_ON(!sis);
+	sis->swap_free_notify_fn = notify_fn;
+	return;
+}
+EXPORT_SYMBOL(set_swap_free_notify);
+
 static int swap_entry_free(struct swap_info_struct *p,
 			   swp_entry_t ent, int cache)
 {
@@ -583,6 +597,8 @@ static int swap_entry_free(struct swap_info_struct *p,
 			swap_list.next = p - swap_info;
 		nr_swap_pages++;
 		p->inuse_pages--;
+		if (p->swap_free_notify_fn)
+			p->swap_free_notify_fn(p->bdev, offset);
 	}
 	if (!swap_count(count))
 		mem_cgroup_uncharge_swap(ent);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 208+ messages in thread

end of thread, other threads:[~2009-08-22  7:34 UTC | newest]

Thread overview: 208+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-12 14:37 [PATCH] swap: send callback when swap slot is freed Nitin Gupta
2009-08-12 14:37 ` Nitin Gupta
2009-08-12 18:33 ` Peter Zijlstra
2009-08-12 22:13 ` Andrew Morton
2009-08-12 22:48 ` Hugh Dickins
2009-08-12 22:48   ` Hugh Dickins
2009-08-13  2:30   ` Nitin Gupta
2009-08-13  2:30     ` Nitin Gupta
2009-08-13  6:53     ` Peter Zijlstra
2009-08-13  6:53       ` Peter Zijlstra
2009-08-13 14:44       ` Nitin Gupta
2009-08-13 14:44         ` Nitin Gupta
2009-08-13 17:45     ` Hugh Dickins
2009-08-13 17:45       ` Hugh Dickins
2009-08-13  2:41   ` Nitin Gupta
2009-08-13  2:41     ` Nitin Gupta
2009-08-13  5:05     ` compcache as a pre-swap area (was: [PATCH] swap: send callback when swap slot is freed) Al Boldi
2009-08-13  5:05       ` Al Boldi
2009-08-13 17:31       ` Nitin Gupta
2009-08-13 17:31         ` Nitin Gupta
2009-08-14  4:02         ` Al Boldi
2009-08-14  4:02           ` Al Boldi
2009-08-14  4:53           ` compcache as a pre-swap area Nitin Gupta
2009-08-14  4:53             ` Nitin Gupta
2009-08-14 15:49             ` Al Boldi
2009-08-14 15:49               ` Al Boldi
2009-08-15 11:00               ` Al Boldi
2009-08-15 11:00                 ` Al Boldi
2009-08-13 15:13   ` Discard support (was Re: [PATCH] swap: send callback when swap slot is freed) Matthew Wilcox
2009-08-13 15:13     ` Matthew Wilcox
2009-08-13 15:17     ` david
2009-08-13 15:17       ` david
2009-08-13 15:26       ` Matthew Wilcox
2009-08-13 15:26         ` Matthew Wilcox
2009-08-13 15:43     ` James Bottomley
2009-08-13 15:43       ` James Bottomley
2009-08-13 18:22       ` Ric Wheeler
2009-08-13 18:22         ` Ric Wheeler
2009-08-13 16:13     ` Nitin Gupta
2009-08-13 16:13       ` Nitin Gupta
2009-08-13 16:26     ` Markus Trippelsdorf
2009-08-13 16:26       ` Markus Trippelsdorf
2009-08-13 16:33       ` david
2009-08-13 16:33         ` david
2009-08-13 18:15         ` Greg Freemyer
2009-08-13 18:15           ` Greg Freemyer
2009-08-13 18:15           ` Greg Freemyer
2009-08-13 19:18           ` James Bottomley
2009-08-13 19:18             ` James Bottomley
2009-08-13 20:31             ` Richard Sharpe
2009-08-13 20:31               ` Richard Sharpe
2009-08-13 20:31               ` Richard Sharpe
2009-08-14 22:03             ` Mark Lord
2009-08-14 22:03               ` Mark Lord
2009-08-14 22:54               ` Greg Freemyer
2009-08-14 22:54                 ` Greg Freemyer
2009-08-15 13:12                 ` Mark Lord
2009-08-15 13:12                   ` Mark Lord
2009-08-13 20:44           ` david
2009-08-13 20:44             ` david
2009-08-13 20:54             ` Bryan Donlan
2009-08-13 20:54               ` Bryan Donlan
2009-08-14 22:10               ` Mark Lord
2009-08-14 22:10                 ` Mark Lord
2009-08-14 23:21                 ` Chris Worley
2009-08-14 23:21                   ` Chris Worley
2009-08-14 23:45                   ` Matthew Wilcox
2009-08-14 23:45                     ` Matthew Wilcox
2009-08-15  0:19                     ` Chris Worley
2009-08-15  0:19                       ` Chris Worley
2009-08-15  0:30                       ` Greg Freemyer
2009-08-15  0:30                         ` Greg Freemyer
2009-08-15  0:38                         ` Chris Worley
2009-08-15  0:38                           ` Chris Worley
2009-08-15  0:38                           ` Chris Worley
2009-08-15  1:55                           ` Greg Freemyer
2009-08-15  1:55                             ` Greg Freemyer
2009-08-15  1:55                             ` Greg Freemyer
2009-08-15 13:20                           ` Mark Lord
2009-08-15 13:20                             ` Mark Lord
2009-08-16 22:52                             ` Chris Worley
2009-08-16 22:52                               ` Chris Worley
2009-08-17  2:03                               ` Mark Lord
2009-08-17  2:03                                 ` Mark Lord
2009-08-15 12:59                       ` James Bottomley
2009-08-15 12:59                         ` James Bottomley
2009-08-15 13:22                         ` Mark Lord
2009-08-15 13:22                           ` Mark Lord
2009-08-15 13:55                           ` James Bottomley
2009-08-15 13:55                             ` James Bottomley
2009-08-15 17:39                             ` jim owens
2009-08-15 17:39                               ` jim owens
2009-08-16 17:08                               ` Robert Hancock
2009-08-16 17:08                                 ` Robert Hancock
2009-08-16 14:05                             ` Alan Cox
2009-08-16 14:05                               ` Alan Cox
2009-08-16 14:16                               ` Mark Lord
2009-08-16 14:16                                 ` Mark Lord
2009-08-16 15:34                               ` Arjan van de Ven
2009-08-16 15:34                                 ` Arjan van de Ven
2009-08-16 15:44                                 ` Theodore Tso
2009-08-16 15:44                                   ` Theodore Tso
2009-08-16 17:28                                   ` Mark Lord
2009-08-16 17:28                                   ` Mark Lord
2009-08-16 17:28                                     ` Mark Lord
2009-08-16 17:37                                     ` Mark Lord
2009-08-16 17:37                                     ` Mark Lord
2009-08-16 17:37                                       ` Mark Lord
2009-08-16 17:37                                       ` Mark Lord
2009-08-17 16:30                                       ` Bill Davidsen
2009-08-17 16:30                                         ` Bill Davidsen
2009-08-17 16:56                                         ` jim owens
2009-08-17 16:56                                           ` jim owens
2009-08-17 17:14                                           ` Bill Davidsen
2009-08-17 17:14                                             ` Bill Davidsen
2009-08-17 17:37                                             ` jim owens
2009-08-17 17:37                                               ` jim owens
2009-08-16 17:37                                     ` Mark Lord
2009-08-16 17:37                                     ` Mark Lord
2009-08-16 17:28                                   ` Mark Lord
2009-08-16 17:28                                   ` Mark Lord
2009-08-16 15:52                                 ` James Bottomley
2009-08-16 15:52                                   ` James Bottomley
2009-08-16 16:32                                   ` Mark Lord
2009-08-16 16:32                                     ` Mark Lord
2009-08-16 16:32                                     ` Mark Lord
2009-08-16 18:07                                     ` James Bottomley
2009-08-16 18:07                                       ` James Bottomley
2009-08-16 18:19                                       ` Mark Lord
2009-08-16 18:19                                         ` Mark Lord
2009-08-16 18:19                                         ` Mark Lord
2009-08-16 18:24                                         ` James Bottomley
2009-08-16 18:24                                           ` James Bottomley
2009-08-17 16:37                                           ` Bill Davidsen
2009-08-17 16:37                                             ` Bill Davidsen
2009-08-17 16:37                                             ` Bill Davidsen
2009-08-17 17:08                                             ` Greg Freemyer
2009-08-17 17:08                                               ` Greg Freemyer
2009-08-17 17:19                                               ` James Bottomley
2009-08-17 17:19                                                 ` James Bottomley
2009-08-17 18:16                                                 ` Ric Wheeler
2009-08-17 18:16                                                   ` Ric Wheeler
2009-08-17 18:21                                                 ` Greg Freemyer
2009-08-17 18:21                                                   ` Greg Freemyer
2009-08-17 18:21                                                   ` Greg Freemyer
2009-08-17 19:18                                                   ` James Bottomley
2009-08-17 19:18                                                     ` James Bottomley
2009-08-17 20:19                                                     ` Mark Lord
2009-08-17 20:19                                                       ` Mark Lord
2009-08-17 20:28                                                       ` James Bottomley
2009-08-17 20:28                                                         ` James Bottomley
2009-08-17 20:28                                               ` Mark Lord
2009-08-17 20:28                                                 ` Mark Lord
2009-08-16 16:59                                   ` Christoph Hellwig
2009-08-16 16:59                                     ` Christoph Hellwig
2009-08-17  4:24                                     ` Douglas Gilbert
2009-08-17  4:24                                       ` Douglas Gilbert
2009-08-17 13:56                                     ` James Bottomley
2009-08-17 13:56                                       ` James Bottomley
2009-08-17 14:10                                       ` Matthew Wilcox
2009-08-17 14:10                                         ` Matthew Wilcox
2009-08-17 19:12                                         ` Christoph Hellwig
2009-08-17 19:12                                           ` Christoph Hellwig
2009-08-17 19:24                                           ` James Bottomley
2009-08-17 19:24                                             ` James Bottomley
2009-08-16 21:50                                   ` Discard support Roland Dreier
2009-08-16 21:50                                     ` Roland Dreier
2009-08-16 22:06                                     ` Jeff Garzik
2009-08-16 22:06                                       ` Jeff Garzik
2009-08-16 22:06                                       ` Jeff Garzik
2009-08-16 22:13                                     ` Theodore Tso
2009-08-16 22:13                                       ` Theodore Tso
2009-08-16 22:51                                       ` Mark Lord
2009-08-16 22:51                                         ` Mark Lord
2009-08-16 22:51                                       ` Mark Lord
2009-08-16 22:51                                       ` Mark Lord
2009-08-16 22:51                                       ` Mark Lord
2009-08-16 19:29                                 ` Discard support (was Re: [PATCH] swap: send callback when swap slot is freed) Alan Cox
2009-08-16 19:29                                   ` Alan Cox
2009-08-16 23:05                                   ` John Robinson
2009-08-16 23:05                                     ` John Robinson
2009-08-17  2:05                                     ` Mark Lord
2009-08-17  2:05                                       ` Mark Lord
2009-08-13 21:28             ` Greg Freemyer
2009-08-13 21:28               ` Greg Freemyer
2009-08-13 22:20               ` Richard Sharpe
2009-08-13 22:20                 ` Richard Sharpe
2009-08-13 22:20                 ` Richard Sharpe
2009-08-14  0:19                 ` Greg Freemyer
2009-08-14  0:19                   ` Greg Freemyer
     [not found]                   ` <46b8a8850908131758s781b07f6v2729483c0e50ae7a@mail.gmail.com>
2009-08-14 21:33                     ` Greg Freemyer
2009-08-14 21:33                       ` Greg Freemyer
2009-08-14 21:56                       ` Discard support Roland Dreier
2009-08-14 21:56                         ` Roland Dreier
2009-08-14 22:10                         ` Greg Freemyer
2009-08-14 22:10                           ` Greg Freemyer
2009-08-14 21:33                     ` Discard support (was Re: [PATCH] swap: send callback when swap slot is freed) Greg Freemyer
2009-08-14 21:33                     ` Greg Freemyer
2009-08-14 21:33                     ` Greg Freemyer
2009-08-13 17:19     ` Hugh Dickins
2009-08-13 17:19       ` Hugh Dickins
2009-08-13 18:08     ` Douglas Gilbert
2009-08-13 18:08       ` Douglas Gilbert
2009-08-17  2:55 ` [PATCH] swap: send callback when swap slot is freed KAMEZAWA Hiroyuki
2009-08-17  5:08   ` Nitin Gupta
2009-08-17  5:11     ` KAMEZAWA Hiroyuki
2009-08-22  7:34   ` Nai Xia
  -- strict thread matches above, loose matches on Subject: below --
2009-08-12 14:37 Nitin Gupta

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.