Announcement: STEC EnhanceIO SSD caching software for Linux kernel

All of lore.kernel.org
 help / color / mirror / Atom feed

* Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-11 17:18 Amit Kale
  2013-01-11 22:36 ` Marcin Slusarz
                   ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Amit Kale @ 2013-01-11 17:18 UTC (permalink / raw)
  To: LKML

Greetings,

STEC is happy to announce hosting of our EnhanceIO SSD caching software on github.
We would like to invite kernel hackers to try it. We'll appreciate your valuable feedback to help us improve it to the standards of Linux kernel source code. We hope to eventually submit it for a possible inclusion in Linux kernel.

Repository location -  https://github.com/stec-inc/EnhanceIO
License - GPL
Source - Derived from the source base of EnhanceIO product Current state - Alpha.
Ongoing work - Code cleanup, testing, more documentation.

Do try it. If you face problems, file bugs at github or write to me.

First section of the README.txt file in this repository introduces EnhanceIO and is as follows

----------------
EnhanceIO driver is based on EnhanceIO SSD caching software product developed by STEC Inc. EnhanceIO was derived from Facebook's open source Flashcache project. EnhanceIO uses SSDs as cache devices for traditional rotating hard disk drives (referred to as source volumes throughout this document).

EnhanceIO can work with any block device, be it an entire physical disk, an individual disk partition,  a RAIDed DAS device, a SAN volume, a device mapper volume or a software RAID (md) device.

The source volume to SSD mapping is a set-associative mapping based on the source volume sector number with a default set size (aka associativity) of 512 blocks and a default block size of 4 KB.  Partial cache blocks are not used.
The default value of 4 KB is chosen because it is the common I/O block size of most storage systems.  With these default values, each cache set is 2 MB (512 *
4 KB).  Therefore, a 400 GB SSD will have a little less than 200,000 cache sets because a little space is used for storing the meta data on the SSD.

EnhanceIO supports three caching modes: read-only, write-through, and write-back and three cache replacement policies: random, FIFO, and LRU.

Read-only caching mode causes EnhanceIO to direct write IO requests only to HDD. Read IO requests are issued to HDD and the data read from HDD is stored on SSD. Subsequent Read requests for the same blocks are carried out from SSD, thus reducing their latency by a substantial amount. 

In Write-through mode - reads are handled similar to Read-only mode.
Write-through mode causes EnhanceIO to write application data to both HDD and SSD. Subsequent reads of the same data benefit because they can be served from SSD.

Write-back improves write latency by writing application requested data only to SSD. This data, referred to as dirty data, is copied later to HDD asynchronously. Reads are handled similar to Read-only and Write-through modes.
----------------

Look forward to hearing from you.
Thanks.
--
Amit Kale

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED

This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Announcement: STEC EnhanceIO SSD caching software for Linux kernel
  2013-01-11 17:18 Announcement: STEC EnhanceIO SSD caching software for Linux kernel Amit Kale
@ 2013-01-11 22:36 ` Marcin Slusarz
  2013-01-14 21:46 ` Mike Snitzer
  2013-01-30 12:36 ` Pavel Machek
  2 siblings, 0 replies; 54+ messages in thread
From: Marcin Slusarz @ 2013-01-11 22:36 UTC (permalink / raw)
  To: Amit Kale; +Cc: LKML

On Sat, Jan 12, 2013 at 01:18:37AM +0800, Amit Kale wrote:
> Greetings,
> 
> STEC is happy to announce hosting of our EnhanceIO SSD caching software on github.
> We would like to invite kernel hackers to try it. We'll appreciate your
> valuable feedback to help us improve it to the standards of Linux kernel
> source code. We hope to eventually submit it for a possible inclusion in
> Linux kernel.

If you are serious about inclusion in Linux kernel sources, you may want to
act on these comments:
- kernel API wrappers should be removed
- your ioctl won't work with 32-bit userspace on 64-bit kernel (differences
  in size and padding of struct cache_rec_short fields)
- volatiles (probably) should go, see Documentation/volatile-considered-harmful.txt
- printk wrappers can be easily replaced by pr_debug/pr_info/etc
- the code should be more or less checkpatch.pl clean
- this is buggy: VERIFY(spin_is_locked((sl))); - spin_is_locked always returns
  0 on !CONFIG_SMP && !CONFIG_DEBUG_SPINLOCK

> (...)
> PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED
> 
> This electronic transmission, and any documents attached hereto, may contain
> confidential, proprietary and/or legally privileged information. The
> information is intended only for use by the recipient named above. If you
> received this electronic message in error, please notify the sender and delete
> the electronic message. Any disclosure, copying, distribution, or use of the
> contents of information received in error is strictly prohibited, and
> violators will be pursued legally.

FYI, most people are annoyed when they see those kind of threats on public
mailing lists...

Marcin

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Announcement: STEC EnhanceIO SSD caching software for Linux kernel
  2013-01-11 17:18 Announcement: STEC EnhanceIO SSD caching software for Linux kernel Amit Kale
  2013-01-11 22:36 ` Marcin Slusarz
@ 2013-01-14 21:46 ` Mike Snitzer
  2013-01-15 13:19   ` Amit Kale
  2013-01-30 12:36 ` Pavel Machek
  2 siblings, 1 reply; 54+ messages in thread
From: Mike Snitzer @ 2013-01-14 21:46 UTC (permalink / raw)
  To: Amit Kale; +Cc: LKML, dm-devel

Hi Amit,

On Fri, Jan 11, 2013 at 12:18 PM, Amit Kale <akale@stec-inc.com> wrote:
> Greetings,
>
> STEC is happy to announce hosting of our EnhanceIO SSD caching software on github.
> We would like to invite kernel hackers to try it. We'll appreciate your valuable feedback to help us improve it to the standards of Linux kernel source code. We hope to eventually submit it for a possible inclusion in Linux kernel.

The github code you've referenced is in a strange place; it is
obviously in a bit of flux.

> Repository location -  https://github.com/stec-inc/EnhanceIO
> License - GPL
> Source - Derived from the source base of EnhanceIO product Current state - Alpha.
> Ongoing work - Code cleanup, testing, more documentation.
>
> Do try it. If you face problems, file bugs at github or write to me.
>
> First section of the README.txt file in this repository introduces EnhanceIO and is as follows
>
> ----------------
> EnhanceIO driver is based on EnhanceIO SSD caching software product developed by STEC Inc. EnhanceIO was derived from Facebook's open source Flashcache project. EnhanceIO uses SSDs as cache devices for traditional rotating hard disk drives (referred to as source volumes throughout this document).

Earlier versions of EnhanceIO made use of Device Mapper (and your
github code still has artifacts from that historic DM dependency, e.g.
eio_map still returns DM_MAPIO_SUBMITTED).

As a DM target, EnhanceIO still implemented its own bio splitting
rather than just use the DM core's bio splitting, now you've decided
to move away from DM entirely.  Any reason why?

Joe Thornber published the new DM cache target on dm-devel a month ago:
https://www.redhat.com/archives/dm-devel/2012-December/msg00029.html

( I've also kept a functional git repo with that code, and additional
fixes, in the 'dm-devel-cache' branch of my github repo:
git://github.com/snitm/linux.git )

It would be unfortunate if Joe's publishing of the dm-cache codebase
somehow motivated STEC's switch away from DM (despite EnhanceIO's DM
roots given it was based on FB's flashcache which also uses DM).

DM really does offer a compelling foundation for stacking storage
drivers in complementary ways (e.g. we envision dm-cache being stacked
in conjunction with dm-thinp).  So a DM-based caching layer has been
of real interest to Red Hat's DM team.

Given dm-cache's clean design and modular cache replacement policy
interface we were hopeful that any existing limitations in dm-cache
could be resolved through further work with the greater community
(STEC included).  Instead, in addition to bcache, with EnhanceIO we
have more fragmentation for a block caching layer (a layer which has
been sorely overdue in upstream Linux).

Hopefully upstream Linux will get this caching feature before its
utility is no longer needed.  The DM team welcomes review of dm-cache
from STEC and the greater community.  We're carrying on with dm-cache
review/fixes for hopeful upstream inclusion as soon as v3.9.

Mike

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: Announcement: STEC EnhanceIO SSD caching software for Linux kernel
  2013-01-14 21:46 ` Mike Snitzer
@ 2013-01-15 13:19   ` Amit Kale
  2013-01-16 10:45     ` [dm-devel] " thornber
  0 siblings, 1 reply; 54+ messages in thread
From: Amit Kale @ 2013-01-15 13:19 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: LKML, dm-devel

Hi Mike,

> The github code you've referenced is in a strange place; it is
> obviously in a bit of flux.

Git URLs for accessing these are - 

git clone https://github.com/stec-inc/EnhanceIO.git
git clone git://github.com/stec-inc/EnhanceIO.git

> 
> > Repository location -  https://github.com/stec-inc/EnhanceIO

> > ----------------
> > EnhanceIO driver is based on EnhanceIO SSD caching software product
> developed by STEC Inc. EnhanceIO was derived from Facebook's open
> source Flashcache project. EnhanceIO uses SSDs as cache devices for
> traditional rotating hard disk drives (referred to as source volumes
> throughout this document).
> 
> Earlier versions of EnhanceIO made use of Device Mapper (and your
> github code still has artifacts from that historic DM dependency, e.g.
> eio_map still returns DM_MAPIO_SUBMITTED).

This is correct. First version of our product was based on DM.

> 
> As a DM target, EnhanceIO still implemented its own bio splitting
> rather than just use the DM core's bio splitting, now you've decided to
> move away from DM entirely.  Any reason why?

1. EnhanceIO product was always created as a "transparent" cache. This meant the cached device path was identical to the original device path. In order to make this fit into device mapper scheme, we had a bunch of init and udev scripts to replace old device node by a new dm device node. The difficulty in this architecture was the principle reason for moving away from DM. Our transparent cache architecture has been a big winner with enterprise customers enabling easy deployments.

2. EnhanceIO now is fully transparent. So applications can continue running while a cache is created or deleted. This is a significant improvement helping enterprise users reduce downtime.

3. DM overhead is minimal compared to the CPU cycles spent in a cache block lookup. Since we weren't using DM's splitting anyways, that overhead was reduced by going away from DM.

4. We can now create a cache for an entire HDD containing partitions. All the partitions will be cached automatically. User always has an option to cache partitions individually, if required.

5. We have designed our writeback architecture from scratch. Coalescing/bunching together of metadata writes and cleanup is much improved after redesigning of the EnhanceIO-SSD interface. The DM interface would have been too restrictive for this. EnhanceIO uses set level locking, which improves parallelism of IO, particularly for writeback.

> 
> Joe Thornber published the new DM cache target on dm-devel a month ago:
> https://www.redhat.com/archives/dm-devel/2012-December/msg00029.html

Thanks for this link. Will review and get back to you.

> 
> ( I've also kept a functional git repo with that code, and additional
> fixes, in the 'dm-devel-cache' branch of my github repo:
> git://github.com/snitm/linux.git )
> 
> It would be unfortunate if Joe's publishing of the dm-cache codebase
> somehow motivated STEC's switch away from DM (despite EnhanceIO's DM
> roots given it was based on FB's flashcache which also uses DM).

Not at all! We had working on a fully transparent cache architecture since a long time.

> DM really does offer a compelling foundation for stacking storage
> drivers in complementary ways (e.g. we envision dm-cache being stacked
> in conjunction with dm-thinp).  So a DM-based caching layer has been of
> real interest to Red Hat's DM team.

IMHO caching does not DM architecture. DM is best suited for RAID requiring similar or even access to all component devices. Caching requires skewed or uneven access to SSD and HDD.

Regards.
-Amit

> 
> Given dm-cache's clean design and modular cache replacement policy
> interface we were hopeful that any existing limitations in dm-cache
> could be resolved through further work with the greater community (STEC
> included).  Instead, in addition to bcache, with EnhanceIO we have more
> fragmentation for a block caching layer (a layer which has been sorely
> overdue in upstream Linux).
> 
> Hopefully upstream Linux will get this caching feature before its
> utility is no longer needed.  The DM team welcomes review of dm-cache
> from STEC and the greater community.  We're carrying on with dm-cache
> review/fixes for hopeful upstream inclusion as soon as v3.9.
> 
> Mike

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED

This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
  2013-01-15 13:19   ` Amit Kale
@ 2013-01-16 10:45     ` thornber
  2013-01-16 12:15       ` thornber
                         ` (3 more replies)
  0 siblings, 4 replies; 54+ messages in thread
From: thornber @ 2013-01-16 10:45 UTC (permalink / raw)
  To: device-mapper development; +Cc: Mike Snitzer, LKML

Hi Amit,

I'll look through EnhanceIO this week.

There are several cache solutions out there; bcache, my dm-cache and
EnhanceIO seeming to be the favourites.  In suspect none of them are
without drawbacks, so I'd like to see if we can maybe work together.

I think the first thing we need to do is make it easy to compare the
performance of these impls.

I'll create a branch in my github tree with all three caches in.  So
it's easy to build a kernel with them.  (Mike's already combined
dm-cache and bcache and done some preliminary testing).

We've got some small test scenarios in our test suite that we run [1].
They certainly flatter dm-cache since it was developed using these.
It would be really nice if you could describe and provide scripts for
your test scenarios.  I'll integrate them with the test suite, and
then I can have some confidence that I'm seeing EnhanceIO in its best
light.

The 'transparent' cache issue is a valid one, but to be honest a bit
orthogonal to cache.  Integrating dm more closely with the block layer
such that a dm stack can replace any device has been discussed for
years and I know Alasdair has done some preliminary design work on
this.  Perhaps we can use your requirement to bump up the priority on
this work.

On Tue, Jan 15, 2013 at 09:19:10PM +0800, Amit Kale wrote:
> 5. We have designed our writeback architecture from
> scratch. Coalescing/bunching together of metadata writes and cleanup
> is much improved after redesigning of the EnhanceIO-SSD
> interface. The DM interface would have been too restrictive for
> this. EnhanceIO uses set level locking, which improves parallelism
> of IO, particularly for writeback.

I sympathise with this; dm-cache would also like to see a higher level
view of the io, rather than being given the ios to remap one by one.
Let's start by working out how much of a benefit you've gained from
this and then go from there.

> PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED
> 
> This electronic transmission, and any documents attached hereto, may
> contain confidential, proprietary and/or legally privileged
> information. The information is intended only for use by the
> recipient named above. If you received this electronic message in
> error, please notify the sender and delete the electronic
> message. Any disclosure, copying, distribution, or use of the
> contents of information received in error is strictly prohibited,
> and violators will be pursued legally.

Please do not use this signature when sending to dm-devel.  If there's
proprietary information in the email you need to tell people up front
so they can choose not to read it.

- Joe

  [1] https://github.com/jthornber/thinp-test-suite/tree/master/tests/cache

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
  2013-01-16 10:45     ` [dm-devel] " thornber
@ 2013-01-16 12:15       ` thornber
  2013-01-16 16:58       ` thornber
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 54+ messages in thread
From: thornber @ 2013-01-16 12:15 UTC (permalink / raw)
  To: device-mapper development, Mike Snitzer, LKML

On Wed, Jan 16, 2013 at 10:45:47AM +0000, thornber@redhat.com wrote:
> Hi Amit,
> 
> I'll look through EnhanceIO this week.

I just ran the code through sparse and it throws up a lot of warnings.
Most of these trivial; functions that should be declared static.  But
some are more concerning, like the 'different address spaces' ones.
If you're not sure how to fix the 'context imbalance' ones ping me and
I'll write a patch for you.

On another note I see linux_os.h and os.h.  Which contain things like:

#define SPIN_LOCK_INIT                  spin_lock_init
#define SPIN_LOCK_IRQSAVE(l, f)         spin_lock_irqsave(l, f)
#define SPIN_UNLOCK_IRQRESTORE(l, f)    spin_unlock_irqrestore(l, f)
#define SPIN_LOCK_IRQSAVE_FLAGS(l)      do { long unsigned int f; spin_lock_irqsave(l, f); *(l##_flags) = f; }\
 while (0)
#define SPIN_UNLOCK_IRQRESTORE_FLAGS(l) do { long unsigned int f = *(l##_flags); spin_unlock_irqrestore(l, f);\
 } while (0)

You wont get the code upstream if it has an OS abstraction layer like
this.  Other people have tried.

- Joe

drivers/block/enhanceio/eio_ioctl.c:50:52: warning: incorrect type in argument 2 (different address spaces)
drivers/block/enhanceio/eio_ioctl.c:50:52:    expected void const [noderef] <asn:1>*from
drivers/block/enhanceio/eio_ioctl.c:50:52:    got struct cache_rec_short [usertype] *<noident>
drivers/block/enhanceio/eio_ioctl.c:70:52: warning: incorrect type in argument 2 (different address spaces)
drivers/block/enhanceio/eio_ioctl.c:70:52:    expected void const [noderef] <asn:1>*from
drivers/block/enhanceio/eio_ioctl.c:70:52:    got struct cache_rec_short [usertype] *<noident>
drivers/block/enhanceio/eio_ioctl.c:86:52: warning: incorrect type in argument 2 (different address spaces)
drivers/block/enhanceio/eio_ioctl.c:86:52:    expected void const [noderef] <asn:1>*from
drivers/block/enhanceio/eio_ioctl.c:86:52:    got struct cache_rec_short [usertype] *<noident>
drivers/block/enhanceio/eio_ioctl.c:99:43: warning: incorrect type in argument 1 (different address spaces)
drivers/block/enhanceio/eio_ioctl.c:99:43:    expected void [noderef] <asn:1>*dst
drivers/block/enhanceio/eio_ioctl.c:99:43:    got unsigned long long [usertype] *<noident>
drivers/block/enhanceio/eio_ioctl.c:118:52: warning: incorrect type in argument 2 (different address spaces)
drivers/block/enhanceio/eio_ioctl.c:118:52:    expected void const [noderef] <asn:1>*from
drivers/block/enhanceio/eio_ioctl.c:118:52:    got struct cache_rec_short [usertype] *<noident>
drivers/block/enhanceio/eio_ioctl.c:134:52: warning: incorrect type in argument 2 (different address spaces)
drivers/block/enhanceio/eio_ioctl.c:134:52:    expected void const [noderef] <asn:1>*from
drivers/block/enhanceio/eio_ioctl.c:134:52:    got struct cache_rec_short [usertype] *<noident>
  CC      drivers/block/enhanceio/eio_ioctl.o
drivers/block/enhanceio/eio_conf.c:47:16: warning: symbol 'cache_list_head' was not declared. Should it be sta\
tic?
drivers/block/enhanceio/eio_conf.c:48:20: warning: symbol '_kcached_wq' was not declared. Should it be static?
drivers/block/enhanceio/eio_conf.c:50:19: warning: symbol '_job_cache' was not declared. Should it be static?
drivers/block/enhanceio/eio_conf.c:51:19: warning: symbol '_io_cache' was not declared. Should it be static?
drivers/block/enhanceio/eio_conf.c:52:11: warning: symbol '_job_pool' was not declared. Should it be static?
drivers/block/enhanceio/eio_conf.c:53:11: warning: symbol '_io_pool' was not declared. Should it be static?
drivers/block/enhanceio/eio_conf.c:55:10: warning: symbol 'nr_cache_jobs' was not declared. Should it be stati\
c?
drivers/block/enhanceio/eio_conf.c:59:1: warning: symbol 'ssd_rm_list' was not declared. Should it be static?
drivers/block/enhanceio/eio_conf.c:60:5: warning: symbol 'ssd_rm_list_not_empty' was not declared. Should it b\
e static?
drivers/block/enhanceio/eio_conf.c:61:12: warning: symbol 'ssd_rm_list_lock' was not declared. Should it be st\
atic?
drivers/block/enhanceio/eio_conf.c:63:22: warning: symbol 'eio_control' was not declared. Should it be static?
drivers/block/enhanceio/eio_conf.c:65:5: warning: symbol 'eio_force_warm_boot' was not declared. Should it be \
static?
drivers/block/enhanceio/eio_conf.c:2101:1: warning: symbol 'eio_status_info' was not declared. Should it be st\
atic?
drivers/block/enhanceio/eio_conf.c:2446:1: warning: symbol 'eio_init' was not declared. Should it be static?
drivers/block/enhanceio/eio_conf.c:2494:1: warning: symbol 'eio_exit' was not declared. Should it be static?
  CC      drivers/block/enhanceio/eio_conf.o
drivers/block/enhanceio/eio_main.c:3157:53: warning: Using plain integer as NULL pointer
drivers/block/enhanceio/eio_main.c:1095:34: warning: Using plain integer as NULL pointer
drivers/block/enhanceio/eio_main.c:1392:33: warning: Using plain integer as NULL pointer
drivers/block/enhanceio/eio_main.c:141:1: warning: symbol 'eio_io_async_pages' was not declared. Should it be \
static?
drivers/block/enhanceio/eio_main.c:171:1: warning: symbol 'eio_io_async_bvec' was not declared. Should it be s\
tatic?
drivers/block/enhanceio/eio_main.c:275:1: warning: symbol 'eio_disk_io_callback' was not declared. Should it b\
e static?
drivers/block/enhanceio/eio_main.c:359:1: warning: symbol 'eio_io_callback' was not declared. Should it be sta\
tic?
drivers/block/enhanceio/eio_main.c:3103:16: warning: symbol 'setup_bio_vecs' was not declared. Should it be st\
atic?
  CHECK   drivers/block/enhanceio/eio_mem.c
drivers/block/enhanceio/eio_main.c:1399:9: warning: context imbalance in 'eio_enq_mdupdate' - different lock c\
ontexts for basic block
drivers/block/enhanceio/eio_policy.c:25:1: warning: symbol 'eio_policy_list' was not declared. Should it be st\
atic?
  CC      drivers/block/enhanceio/eio_policy.o
  CHECK   drivers/block/enhanceio/eio_setlru.c
drivers/block/enhanceio/eio_procfs.c:56:1: warning: symbol 'eio_zerostats_sysctl' was not declared. Should it \
be static?
drivers/block/enhanceio/eio_procfs.c:120:1: warning: symbol 'eio_mem_limit_pct_sysctl' was not declared. Shoul\
d it be static?
drivers/block/enhanceio/eio_procfs.c:164:1: warning: symbol 'eio_error_inject_sysctl' was not declared. Should\
 it be static?
drivers/block/enhanceio/eio_procfs.c:186:1: warning: symbol 'eio_clean_sysctl' was not declared. Should it be \
static?
drivers/block/enhanceio/eio_procfs.c:263:1: warning: symbol 'eio_dirty_high_threshold_sysctl' was not declared\
. Should it be static?
drivers/block/enhanceio/eio_procfs.c:338:1: warning: symbol 'eio_dirty_low_threshold_sysctl' was not declared.\
 Should it be static?
drivers/block/enhanceio/eio_procfs.c:419:1: warning: symbol 'eio_dirty_set_high_threshold_sysctl' was not decl\
ared. Should it be static?
drivers/block/enhanceio/eio_procfs.c:497:1: warning: symbol 'eio_dirty_set_low_threshold_sysctl' was not decla\
red. Should it be static?
drivers/block/enhanceio/eio_procfs.c:582:1: warning: symbol 'eio_autoclean_threshold_sysctl' was not declared.\
 Should it be static?       
drivers/block/enhanceio/eio_procfs.c:651:1: warning: symbol 'eio_time_based_clean_interval_sysctl' was not dec\
lared. Should it be static?
drivers/block/enhanceio/eio_procfs.c:734:1: warning: symbol 'eio_control_sysctl' was not declared. Should it b\
e static?
drivers/block/enhanceio/eio_procfs.c:1276:11: warning: symbol 'invalidate_spin_lock_flags' was not declared. S\
hould it be static?
  CC      drivers/block/enhanceio/eio_procfs.o
  CC      drivers/block/enhanceio/eio_setlru.o
  CHECK   drivers/block/enhanceio/eio_subr.c
drivers/block/enhanceio/eio_subr.c:34:11: warning: symbol '_job_lock_flags' was not declared. Should it be sta\
tic?
drivers/block/enhanceio/eio_subr.c:40:1: warning: symbol '_io_jobs' was not declared. Should it be static?
drivers/block/enhanceio/eio_subr.c:41:1: warning: symbol '_disk_read_jobs' was not declared. Should it be stat\
ic?
drivers/block/enhanceio/eio_subr.c:74:20: warning: symbol 'eio_pop' was not declared. Should it be static?
drivers/block/enhanceio/eio_subr.c:92:1: warning: symbol 'eio_push' was not declared. Should it be static?
drivers/block/enhanceio/eio_subr.c:110:1: warning: symbol 'eio_push_io' was not declared. Should it be static?
drivers/block/enhanceio/eio_subr.c:216:1: warning: symbol 'eio_sync_endio' was not declared. Should it be stat\
ic?
  CC      drivers/block/enhanceio/eio_subr.o
  CHECK   drivers/block/enhanceio/eio_ttc.c
  CHECK   drivers/block/enhanceio/eio_fifo.c
drivers/block/enhanceio/eio_ttc.c:89:24: warning: non-ANSI function declaration of function 'eio_create_misc_d\
evice'
drivers/block/enhanceio/eio_ttc.c:95:24: warning: non-ANSI function declaration of function 'eio_delete_misc_d\
evice'
drivers/block/enhanceio/eio_ttc.c:34:25: warning: symbol 'eio_ttc_lock' was not declared. Should it be static?
drivers/block/enhanceio/eio_ttc.c:37:5: warning: symbol 'eio_reboot_notified' was not declared. Should it be s\
tatic?
drivers/block/enhanceio/eio_ttc.c:520:39: warning: incorrect type in argument 2 (different address spaces)
drivers/block/enhanceio/eio_ttc.c:520:39:    expected void const [noderef] <asn:1>*from
drivers/block/enhanceio/eio_ttc.c:520:39:    got struct cache_list [usertype] *<noident>
drivers/block/enhanceio/eio_ttc.c:550:27: warning: incorrect type in argument 1 (different address spaces)
drivers/block/enhanceio/eio_ttc.c:550:27:    expected void [noderef] <asn:1>*dst
drivers/block/enhanceio/eio_ttc.c:550:27:    got char *<noident>
drivers/block/enhanceio/eio_ttc.c:556:27: warning: incorrect type in argument 1 (different address spaces)
drivers/block/enhanceio/eio_ttc.c:556:27:    expected void [noderef] <asn:1>*dst
drivers/block/enhanceio/eio_ttc.c:556:27:    got struct cache_list [usertype] *<noident>
drivers/block/enhanceio/eio_ttc.c:642:6: warning: symbol 'eio_dec_count' was not declared. Should it be static\
?
drivers/block/enhanceio/eio_ttc.c:663:6: warning: symbol 'eio_endio' was not declared. Should it be static?
drivers/block/enhanceio/eio_ttc.c:675:5: warning: symbol 'eio_dispatch_io_pages' was not declared. Should it b\
e static?
drivers/block/enhanceio/eio_ttc.c:737:5: warning: symbol 'eio_dispatch_io' was not declared. Should it be stat\
ic?
drivers/block/enhanceio/eio_ttc.c:796:5: warning: symbol 'eio_async_io' was not declared. Should it be static?
drivers/block/enhanceio/eio_ttc.c:843:5: warning: symbol 'eio_sync_io' was not declared. Should it be static?
  CC      drivers/block/enhanceio/eio_ttc.o
drivers/block/enhanceio/eio_fifo.c:55:26: warning: symbol 'eio_fifo_ops' was not declared. Should it be static\
?
  CC      drivers/block/enhanceio/eio_fifo.o
  CHECK   drivers/block/enhanceio/eio_lru.c
drivers/block/enhanceio/eio_lru.c:59:16: warning: symbol 'eio_lru' was not declared. Should it be static?
drivers/block/enhanceio/eio_lru.c:67:26: warning: symbol 'eio_lru_ops' was not declared. Should it be static?
  CC      drivers/block/enhanceio/eio_lru.o
  LD      drivers/block/enhanceio/enhanceio.o

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
  2013-01-16 10:45     ` [dm-devel] " thornber
  2013-01-16 12:15       ` thornber
@ 2013-01-16 16:58       ` thornber
  2013-01-17  9:52       ` Amit Kale
  2013-01-18 14:43       ` thornber
  3 siblings, 0 replies; 54+ messages in thread
From: thornber @ 2013-01-16 16:58 UTC (permalink / raw)
  To: device-mapper development, Mike Snitzer, LKML

On Wed, Jan 16, 2013 at 10:45:47AM +0000, thornber@redhat.com wrote:
> I think the first thing we need to do is make it easy to compare the
> performance of these impls.

I've added EnhanceIO support to my cache tests [1].

I've run it through one of the benchmarks and got some curious results.

The benchmark runs with a 2G origin and a 256m of SSD and does the
following.

	a) format device
	b) clone the linux git tree into it
	c) checkout 5 different tags

So it's only a microbenchmark, but probably a scenario of interest to
developers like us.  It uses a lot of cpu and has a working set size
of around 1G.

Running on SSD (no cache involved, we're just establishing a
baseline), takes ~140 seconds.

Running on spindle (again no cache involved takes 261 seconds.

Running on dm-cache with mq policy takes 241 seconds (I told you it
was a tough scenario).

Running on EnhanceIO in wb mode (I presume this is the fastest?) takes
361 seconds.  Considerably slower than the Spindle alone.

In addition I often run tests with an SSD cache on an SSD origin.
This gives me a good idea of the overhead of the target.  In this
configuration dm-cache takes 161 seconds.  20 seconds of overhead
which I consider a lot and am working to cut down.  EnhanceIO in this
configuration takes 309 seconds, or 169 seconds of overhead.

Obviously different caches are going to perform differently under
different workloads.  But I think people will be upset if adding
expensive SSD to their spindle device slows things down.

Can you describe scenarios where eio performs well please?

- Joe

  [1] https://github.com/jthornber/thinp-test-suite/commit/730448e1f068d23a2ca54aad1fed76a4e8bd6dbb

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
  2013-01-16 10:45     ` [dm-devel] " thornber
  2013-01-16 12:15       ` thornber
  2013-01-16 16:58       ` thornber
@ 2013-01-17  9:52       ` Amit Kale
  2013-01-17 11:39         ` Kent Overstreet
  2013-01-17 13:26           ` thornber-H+wXaHxf7aLQT0dZR+AlfA
  2013-01-18 14:43       ` thornber
  3 siblings, 2 replies; 54+ messages in thread
From: Amit Kale @ 2013-01-17  9:52 UTC (permalink / raw)
  To: thornber, device-mapper development, kent.overstreet
  Cc: Mike Snitzer, LKML, linux-bcache

Hi Joe, Kent,

[Adding Kent as well since bcache is mentioned below as one of the contenders for being integrated into mainline kernel.]

My understanding is that these three caching solutions all have three principle blocks.
1. A cache block lookup - This refers to finding out whether a block was cached or not and the location on SSD, if it was.
2. Block replacement policy - This refers to the algorithm for replacing a block when a new free block can't be found.
3. IO handling - This is about issuing IO requests to SSD and HDD.
4. Dirty data clean-up algorithm (for write-back only) - The dirty data clean-up algorithm decides when to write a dirty block in an SSD to its original location on HDD and executes the copy. 

When comparing the three solutions we need to consider these aspects.
1. User interface - This consists of commands used by users for creating, deleting, editing properties and recovering from error conditions.
2. Software interface - Where it interfaces to Linux kernel and applications.
3. Availability - What's the downtime when adding, deleting caches, making changes to cache configuration, conversion between cache modes, recovering after a crash, recovering from an error condition.
4. Security - Security holes, if any.
5. Portability - Which HDDs, SSDs, partitions, other block devices it works with.
6. Persistence of cache configuration - Once created does the cache configuration stay persistent across reboots. How are changes in device sequence or numbering handled.
7. Persistence of cached data - Does cached data remain across reboots/crashes/intermittent failures. Is the "sticky"ness of data configurable.
8. SSD life - Projected SSD life. Does the caching solution cause too much of write amplification leading to an early SSD failure.
9. Performance - Throughput is generally most important. Latency is also one more performance comparison point. Performance under different load classes can be measured.
10. ACID properties - Atomicity, Concurrency, Idempotent, Durability. Does the caching solution have these typical transactional database or filesystem properties. This includes avoiding torn-page problem amongst crash and failure scenarios.
11. Error conditions - Handling power failures, intermittent and permanent device failures.
12. Configuration parameters for tuning according to applications.

We'll soon document EnhanceIO behavior in context of these aspects. We'll appreciate if dm-cache and bcache is also documented.

When comparing performance there are three levels at which it can be measured
1. Architectural elements
1.1. Throughput for 100% cache hit case (in absence of dirty data clean-up)
1.2. Throughput for 0% cache hit case (in absence of dirty data clean-up)
1.3. Dirty data clean-up rate (in absence of IO)
2. Performance of architectural elements combined
2.1. Varying mix of read/write, sustained performance.
3. Application level testing - The more real-life like benchmark we work with, the better it is.

Thanks.
-Amit

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of thornber@redhat.com
> Sent: Wednesday, January 16, 2013 4:16 PM
> To: device-mapper development
> Cc: Mike Snitzer; LKML
> Subject: Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching
> software for Linux kernel
> 
> Hi Amit,
> 
> I'll look through EnhanceIO this week.
> 
> There are several cache solutions out there; bcache, my dm-cache and
> EnhanceIO seeming to be the favourites.  In suspect none of them are
> without drawbacks, so I'd like to see if we can maybe work together.
> 
> I think the first thing we need to do is make it easy to compare the
> performance of these impls.
> 
> I'll create a branch in my github tree with all three caches in.  So
> it's easy to build a kernel with them.  (Mike's already combined dm-
> cache and bcache and done some preliminary testing).
> 
> We've got some small test scenarios in our test suite that we run [1].
> They certainly flatter dm-cache since it was developed using these.
> It would be really nice if you could describe and provide scripts for
> your test scenarios.  I'll integrate them with the test suite, and then
> I can have some confidence that I'm seeing EnhanceIO in its best light.
> 
> The 'transparent' cache issue is a valid one, but to be honest a bit
> orthogonal to cache.  Integrating dm more closely with the block layer
> such that a dm stack can replace any device has been discussed for
> years and I know Alasdair has done some preliminary design work on
> this.  Perhaps we can use your requirement to bump up the priority on
> this work.
> 
> On Tue, Jan 15, 2013 at 09:19:10PM +0800, Amit Kale wrote:
> > 5. We have designed our writeback architecture from scratch.
> > Coalescing/bunching together of metadata writes and cleanup is much
> > improved after redesigning of the EnhanceIO-SSD interface. The DM
> > interface would have been too restrictive for this. EnhanceIO uses
> set
> > level locking, which improves parallelism of IO, particularly for
> > writeback.
> 
> I sympathise with this; dm-cache would also like to see a higher level
> view of the io, rather than being given the ios to remap one by one.
> Let's start by working out how much of a benefit you've gained from
> this and then go from there.
> 
> > PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED
> >
> > This electronic transmission, and any documents attached hereto, may
> > contain confidential, proprietary and/or legally privileged
> > information. The information is intended only for use by the
> recipient
> > named above. If you received this electronic message in error, please
> > notify the sender and delete the electronic message. Any disclosure,
> > copying, distribution, or use of the contents of information received
> > in error is strictly prohibited, and violators will be pursued
> > legally.
> 
> Please do not use this signature when sending to dm-devel.  If there's
> proprietary information in the email you need to tell people up front
> so they can choose not to read it.
> 
> - Joe
> 
> 
>   [1] https://github.com/jthornber/thinp-test-
> suite/tree/master/tests/cache
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED

This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
  2013-01-17  9:52       ` Amit Kale
@ 2013-01-17 11:39         ` Kent Overstreet
  2013-01-17 17:17             ` Amit Kale
  2013-01-24 23:45             ` Kent Overstreet
  2013-01-17 13:26           ` thornber-H+wXaHxf7aLQT0dZR+AlfA
  1 sibling, 2 replies; 54+ messages in thread
From: Kent Overstreet @ 2013-01-17 11:39 UTC (permalink / raw)
  To: Amit Kale
  Cc: thornber, device-mapper development, kent.overstreet,
	Mike Snitzer, LKML, linux-bcache

Suppose I could fill out the bcache version...

On Thu, Jan 17, 2013 at 05:52:00PM +0800, Amit Kale wrote:
> Hi Joe, Kent,
> 
> [Adding Kent as well since bcache is mentioned below as one of the contenders for being integrated into mainline kernel.]
> 
> My understanding is that these three caching solutions all have three principle blocks.
> 1. A cache block lookup - This refers to finding out whether a block was cached or not and the location on SSD, if it was.
> 2. Block replacement policy - This refers to the algorithm for replacing a block when a new free block can't be found.
> 3. IO handling - This is about issuing IO requests to SSD and HDD.
> 4. Dirty data clean-up algorithm (for write-back only) - The dirty data clean-up algorithm decides when to write a dirty block in an SSD to its original location on HDD and executes the copy. 
> 
> When comparing the three solutions we need to consider these aspects.
> 1. User interface - This consists of commands used by users for creating, deleting, editing properties and recovering from error conditions.
> 2. Software interface - Where it interfaces to Linux kernel and applications.

Both done with sysfs, at least for now.

> 3. Availability - What's the downtime when adding, deleting caches, making changes to cache configuration, conversion between cache modes, recovering after a crash, recovering from an error condition.

All of that is done at runtime, without any interruption. bcache doesn't
distinguish between clean and unclean shutdown, which is nice because it
means the recovery code gets tested. Registering a cache device takes on
the order of half a second, for a large (half terabyte) cache.

> 4. Security - Security holes, if any.

Hope there aren't any!

> 5. Portability - Which HDDs, SSDs, partitions, other block devices it works with.

Any block device.

> 6. Persistence of cache configuration - Once created does the cache configuration stay persistent across reboots. How are changes in device sequence or numbering handled.

Persistent. Device nodes are not stable across reboots, same as say scsi
devices if they get probed in a different order. It does persist a label
in the backing device superblock which can be used to implement stable
device nodes.

> 7. Persistence of cached data - Does cached data remain across reboots/crashes/intermittent failures. Is the "sticky"ness of data configurable.

Persists across reboots. Can't be switched off, though it could be if
there was any demand.

> 8. SSD life - Projected SSD life. Does the caching solution cause too much of write amplification leading to an early SSD failure.

With LRU, there's only so much you can do to work around the SSD's FTL,
though bcache does try; allocation is done in terms of buckets, which
are on the order of a megabyte (configured when you format the cache
device). Buckets are written to sequentially, then rewritten later all
at once (and it'll issue a discard before rewriting a bucket if you flip
it on, it's not on by default because TRIM = slow).

Bcache also implements fifo cache replacement, and with that write
amplification should never be an issue.

> 9. Performance - Throughput is generally most important. Latency is also one more performance comparison point. Performance under different load classes can be measured.
> 10. ACID properties - Atomicity, Concurrency, Idempotent, Durability. Does the caching solution have these typical transactional database or filesystem properties. This includes avoiding torn-page problem amongst crash and failure scenarios.

Yes.

> 11. Error conditions - Handling power failures, intermittent and permanent device failures.

Power failures and device failures yes, intermittent failures are not
explicitly handled.

> 12. Configuration parameters for tuning according to applications.

Lots. The most important one is probably sequential bypass - you don't
typically want to cache your big sequential IO, because rotating disks
do fine at that. So bcache detects sequential IO and bypasses it with a
configurable threshold.

There's also stuff for bypassing more data if the SSD is overloaded - if
you're caching many disks with a single SSD, you don't want the SSD to
be the bottleneck. So it tracks latency to the SSD and cranks down the
sequential bypass threshold if it gets too high.

> We'll soon document EnhanceIO behavior in context of these aspects. We'll appreciate if dm-cache and bcache is also documented.
> 
> When comparing performance there are three levels at which it can be measured
> 1. Architectural elements
> 1.1. Throughput for 100% cache hit case (in absence of dirty data clean-up)

North of a million iops.

> 1.2. Throughput for 0% cache hit case (in absence of dirty data clean-up)

Also relevant whether you're adding the data to the cache. I'm sure
bcache is slightly slower than the raw backing device here, but if it's
noticable it's a bug (I haven't benchmarked that specifically in ages).

> 1.3. Dirty data clean-up rate (in absence of IO)

Background writeback is done by scanning the btree in the background for
dirty data, and then writing it out in lba order - so the writes are as
sequential as they're going to get. It's fast.

> 2. Performance of architectural elements combined
> 2.1. Varying mix of read/write, sustained performance.

Random write performance is definitely important, as there you've got to
keep an index up to date on stable storage (if you want to handle
unclean shutdown, anyways). Making that fast is non trivial. Bcache is
about as efficient as you're going to get w.r.t. metadata writes,
though.

> 3. Application level testing - The more real-life like benchmark we work with, the better it is.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-17 13:26           ` thornber-H+wXaHxf7aLQT0dZR+AlfA
  0 siblings, 0 replies; 54+ messages in thread
From: thornber @ 2013-01-17 13:26 UTC (permalink / raw)
  To: Amit Kale
  Cc: device-mapper development, kent.overstreet, Mike Snitzer, LKML,
	linux-bcache

On Thu, Jan 17, 2013 at 05:52:00PM +0800, Amit Kale wrote:
> Hi Joe, Kent,
> 
> [Adding Kent as well since bcache is mentioned below as one of the contenders for being integrated into mainline kernel.]
> 
> My understanding is that these three caching solutions all have three principle blocks.

Let me try and explain how dm-cache works.

> 1. A cache block lookup - This refers to finding out whether a block was cached or not and the location on SSD, if it was.

Of course we have this, but it's part of the policy plug-in.  I've
done this because the policy nearly always needs to do some book
keeping (eg, update a hit count when accessed).

> 2. Block replacement policy - This refers to the algorithm for replacing a block when a new free block can't be found.

I think there's more than just this.  These are the tasks that I hand
over to the policy:

  a) _Which_ blocks should be promoted to the cache.  This seems to be
     the key decision in terms of performance.  Blindly trying to
     promote every io or even just every write will lead to some very
     bad performance in certain situations.

     The mq policy uses a multiqueue (effectively a partially sorted
     lru list) to keep track of candidate block hit counts.  When
     candidates get enough hits they're promoted.  The promotion
     threshold his periodically recalculated by looking at the hit
     counts for the blocks already in the cache.

     The hit counts should degrade over time (for some definition of
     time; eg. io volume).  I've experimented with this, but not yet
     come up with a satisfactory method.

     I read through EnhanceIO yesterday, and think this is where
     you're lacking.

  b) When should a block be promoted.  If you're swamped with io, then
     adding copy io is probably not a good idea.  Current dm-cache
     just has a configurable threshold for the promotion/demotion io
     volume.  If you or Kent have some ideas for how to approximate
     the bandwidth of the devices I'd really like to hear about it.

  c) Which blocks should be demoted?

     This is the bit that people commonly think of when they say
     'caching algorithm'.  Examples are lru, arc, etc.  Such
     descriptions are fine when describing a cache where elements
     _have_ to be promoted before they can be accessed, for example a
     cpu memory cache.  But we should be aware that 'lru' for example
     really doesn't tell us much in the context of our policies.

     The mq policy uses a blend of lru and lfu for eviction, it seems
     to work well.

A couple of other things I should mention; dm-cache uses a large block
size compared to eio.  eg, 64k - 1m.  This is a mixed blessing;

 - our copy io is more efficient (we don't have to worry about
   batching migrations together so much.  Something eio is careful to
   do).

 - we have fewer blocks to hold stats about, so can keep more info per
   block in the same amount of memory.

 - We trigger more copying.  For example if an incoming write triggers
   a promotion from the origin to the cache, and the io covers a block
   we can avoid any copy from the origin to cache.  With a bigger
   block size this optmisation happens less frequently.

 - We waste SSD space.  eg, a 4k hotspot could trigger a whole block
   to be moved to the cache.

We do not keep the dirty state of cache blocks up to date on the
metadata device.  Instead we have a 'mounted' flag that's set in the
metadata when opened.  When a clean shutdown occurs (eg, dmsetup
suspend my-cache) the dirty bits are written out and the mounted flag
cleared.  On a crash the mounted flag will still be set on reopen and
all dirty flags degrade to 'dirty'.  Correct me if I'm wrong, but I
think eio is holding io completion until the dirty bits have been
committed to disk?

I really view dm-cache as a slow moving hotspot optimiser.  Whereas I
think eio and bcache are much more of a heirarchical storage approach,
where writes go through the cache if possible?

> 3. IO handling - This is about issuing IO requests to SSD and HDD.

  I get most of this for free via dm and kcopyd.  I'm really keen to
  see how bcache does; it's more invasive of the block layer, so I'm
  expecting it to show far better performance than dm-cache.

> 4. Dirty data clean-up algorithm (for write-back only) - The dirty
  data clean-up algorithm decides when to write a dirty block in an
  SSD to its original location on HDD and executes the copy.

  Yep.

> When comparing the three solutions we need to consider these aspects.

> 1. User interface - This consists of commands used by users for
  creating, deleting, editing properties and recovering from error
  conditions.

  I was impressed how easy eio was to use yesterday when I was playing
  with it.  Well done.

  Driving dm-cache through dm-setup isn't much more of a hassle
  though.  Though we've decided to pass policy specific params on the
  target line, and tweak via a dm message (again simple via dmsetup).
  I don't think this is as simple as exposing them through something
  like sysfs, but it is more in keeping with the device-mapper way.

> 2. Software interface - Where it interfaces to Linux kernel and applications.

  See above.

> 3. Availability - What's the downtime when adding, deleting caches,
  making changes to cache configuration, conversion between cache
  modes, recovering after a crash, recovering from an error condition.

  Normal dm suspend, alter table, resume cycle.  The LVM tools do this
  all the time.

> 4. Security - Security holes, if any.

  Well I saw the comment in your code describing the security flaw you
  think you've got.  I hope we don't have any, I'd like to understand
  your case more.

> 5. Portability - Which HDDs, SSDs, partitions, other block devices it works with.

  I think we all work with any block device.  But eio and bcache can
  overlay any device node, not just a dm one.  As mentioned in earlier
  email I really think this is a dm issue, not specific to dm-cache.

> 6. Persistence of cache configuration - Once created does the cache
  configuration stay persistent across reboots. How are changes in
  device sequence or numbering handled.

  We've gone for no persistence of policy parameters.  Instead
  everything is handed into the kernel when the target is setup.  This
  decision was made by the LVM team who wanted to store this
  information themselves (we certainly shouldn't store it in two
  places at once).  I don't feel strongly either way, and could
  persist the policy params v. easily (eg, 1 days work).

  One thing I do provide is a 'hint' array for the policy to use and
  persist.  The policy specifies how much data it would like to store
  per cache block, and then writes it on clean shutdown (hence 'hint',
  it has to cope without this, possibly with temporarily degraded
  performance).  The mq policy uses the hints to store hit counts.

> 7. Persistence of cached data - Does cached data remain across
  reboots/crashes/intermittent failures. Is the "sticky"ness of data
  configurable.

  Surely this is a given?  A cache would be trivial to write if it
  didn't need to be crash proof.

> 8. SSD life - Projected SSD life. Does the caching solution cause
  too much of write amplification leading to an early SSD failure.

  No, I decided years ago that life was too short to start optimising
  for specific block devices.  By the time you get it right the
  hardware characteristics will have moved on.  Doesn't the firmware
  on SSDs try and even out io wear these days?

  That said I think we evenly use the SSD.  Except for the superblock
  on the metadata device.

> 9. Performance - Throughput is generally most important. Latency is
  also one more performance comparison point. Performance under
  different load classes can be measured.

  I think latency is more important than throughput.  Spindles are
  pretty good at throughput.  In fact the mq policy tries to spot when
  we're doing large linear ios and stops hit counting; best leave this
  stuff on the spindle.

> 10. ACID properties - Atomicity, Concurrency, Idempotent,
  Durability. Does the caching solution have these typical
  transactional database or filesystem properties. This includes
  avoiding torn-page problem amongst crash and failure scenarios.

  Could you expand on the torn-page issue please?

> 11. Error conditions - Handling power failures, intermittent and permanent device failures.

  I think the area where dm-cache is currently lacking is intermittent
  failures.  For example if a cache read fails we just pass that error
  up, whereas eio sees if the block is clean and if so tries to read
  off the origin.  I'm not sure which behaviour is correct; I like to
  know about disk failure early.

> 12. Configuration parameters for tuning according to applications.

  Discussed above.

> We'll soon document EnhanceIO behavior in context of these
  aspects. We'll appreciate if dm-cache and bcache is also documented.

  I hope the above helps.  Please ask away if you're unsure about
  something.

> When comparing performance there are three levels at which it can be measured

Developing these caches is tedious.  Test runs take time, and really
slow the dev cycle down.  So I suspect we've all been using
microbenchmarks that run in a few minutes.

Let's get our pool of microbenchmarks together, then work on some
application level ones (we're happy to put some time into developing
these).

- Joe

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-17 13:26           ` thornber-H+wXaHxf7aLQT0dZR+AlfA
  0 siblings, 0 replies; 54+ messages in thread
From: thornber-H+wXaHxf7aLQT0dZR+AlfA @ 2013-01-17 13:26 UTC (permalink / raw)
  To: Amit Kale
  Cc: device-mapper development,
	kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w, Mike Snitzer, LKML,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA

On Thu, Jan 17, 2013 at 05:52:00PM +0800, Amit Kale wrote:
> Hi Joe, Kent,
> 
> [Adding Kent as well since bcache is mentioned below as one of the contenders for being integrated into mainline kernel.]
> 
> My understanding is that these three caching solutions all have three principle blocks.

Let me try and explain how dm-cache works.

> 1. A cache block lookup - This refers to finding out whether a block was cached or not and the location on SSD, if it was.

Of course we have this, but it's part of the policy plug-in.  I've
done this because the policy nearly always needs to do some book
keeping (eg, update a hit count when accessed).

> 2. Block replacement policy - This refers to the algorithm for replacing a block when a new free block can't be found.

I think there's more than just this.  These are the tasks that I hand
over to the policy:

  a) _Which_ blocks should be promoted to the cache.  This seems to be
     the key decision in terms of performance.  Blindly trying to
     promote every io or even just every write will lead to some very
     bad performance in certain situations.

     The mq policy uses a multiqueue (effectively a partially sorted
     lru list) to keep track of candidate block hit counts.  When
     candidates get enough hits they're promoted.  The promotion
     threshold his periodically recalculated by looking at the hit
     counts for the blocks already in the cache.

     The hit counts should degrade over time (for some definition of
     time; eg. io volume).  I've experimented with this, but not yet
     come up with a satisfactory method.

     I read through EnhanceIO yesterday, and think this is where
     you're lacking.

  b) When should a block be promoted.  If you're swamped with io, then
     adding copy io is probably not a good idea.  Current dm-cache
     just has a configurable threshold for the promotion/demotion io
     volume.  If you or Kent have some ideas for how to approximate
     the bandwidth of the devices I'd really like to hear about it.

  c) Which blocks should be demoted?

     This is the bit that people commonly think of when they say
     'caching algorithm'.  Examples are lru, arc, etc.  Such
     descriptions are fine when describing a cache where elements
     _have_ to be promoted before they can be accessed, for example a
     cpu memory cache.  But we should be aware that 'lru' for example
     really doesn't tell us much in the context of our policies.

     The mq policy uses a blend of lru and lfu for eviction, it seems
     to work well.

A couple of other things I should mention; dm-cache uses a large block
size compared to eio.  eg, 64k - 1m.  This is a mixed blessing;

 - our copy io is more efficient (we don't have to worry about
   batching migrations together so much.  Something eio is careful to
   do).

 - we have fewer blocks to hold stats about, so can keep more info per
   block in the same amount of memory.

 - We trigger more copying.  For example if an incoming write triggers
   a promotion from the origin to the cache, and the io covers a block
   we can avoid any copy from the origin to cache.  With a bigger
   block size this optmisation happens less frequently.

 - We waste SSD space.  eg, a 4k hotspot could trigger a whole block
   to be moved to the cache.

We do not keep the dirty state of cache blocks up to date on the
metadata device.  Instead we have a 'mounted' flag that's set in the
metadata when opened.  When a clean shutdown occurs (eg, dmsetup
suspend my-cache) the dirty bits are written out and the mounted flag
cleared.  On a crash the mounted flag will still be set on reopen and
all dirty flags degrade to 'dirty'.  Correct me if I'm wrong, but I
think eio is holding io completion until the dirty bits have been
committed to disk?

I really view dm-cache as a slow moving hotspot optimiser.  Whereas I
think eio and bcache are much more of a heirarchical storage approach,
where writes go through the cache if possible?

> 3. IO handling - This is about issuing IO requests to SSD and HDD.

  I get most of this for free via dm and kcopyd.  I'm really keen to
  see how bcache does; it's more invasive of the block layer, so I'm
  expecting it to show far better performance than dm-cache.

> 4. Dirty data clean-up algorithm (for write-back only) - The dirty
  data clean-up algorithm decides when to write a dirty block in an
  SSD to its original location on HDD and executes the copy.

  Yep.

> When comparing the three solutions we need to consider these aspects.

> 1. User interface - This consists of commands used by users for
  creating, deleting, editing properties and recovering from error
  conditions.

  I was impressed how easy eio was to use yesterday when I was playing
  with it.  Well done.

  Driving dm-cache through dm-setup isn't much more of a hassle
  though.  Though we've decided to pass policy specific params on the
  target line, and tweak via a dm message (again simple via dmsetup).
  I don't think this is as simple as exposing them through something
  like sysfs, but it is more in keeping with the device-mapper way.

> 2. Software interface - Where it interfaces to Linux kernel and applications.

  See above.

> 3. Availability - What's the downtime when adding, deleting caches,
  making changes to cache configuration, conversion between cache
  modes, recovering after a crash, recovering from an error condition.

  Normal dm suspend, alter table, resume cycle.  The LVM tools do this
  all the time.

> 4. Security - Security holes, if any.

  Well I saw the comment in your code describing the security flaw you
  think you've got.  I hope we don't have any, I'd like to understand
  your case more.

> 5. Portability - Which HDDs, SSDs, partitions, other block devices it works with.

  I think we all work with any block device.  But eio and bcache can
  overlay any device node, not just a dm one.  As mentioned in earlier
  email I really think this is a dm issue, not specific to dm-cache.

> 6. Persistence of cache configuration - Once created does the cache
  configuration stay persistent across reboots. How are changes in
  device sequence or numbering handled.

  We've gone for no persistence of policy parameters.  Instead
  everything is handed into the kernel when the target is setup.  This
  decision was made by the LVM team who wanted to store this
  information themselves (we certainly shouldn't store it in two
  places at once).  I don't feel strongly either way, and could
  persist the policy params v. easily (eg, 1 days work).

  One thing I do provide is a 'hint' array for the policy to use and
  persist.  The policy specifies how much data it would like to store
  per cache block, and then writes it on clean shutdown (hence 'hint',
  it has to cope without this, possibly with temporarily degraded
  performance).  The mq policy uses the hints to store hit counts.

> 7. Persistence of cached data - Does cached data remain across
  reboots/crashes/intermittent failures. Is the "sticky"ness of data
  configurable.

  Surely this is a given?  A cache would be trivial to write if it
  didn't need to be crash proof.

> 8. SSD life - Projected SSD life. Does the caching solution cause
  too much of write amplification leading to an early SSD failure.

  No, I decided years ago that life was too short to start optimising
  for specific block devices.  By the time you get it right the
  hardware characteristics will have moved on.  Doesn't the firmware
  on SSDs try and even out io wear these days?

  That said I think we evenly use the SSD.  Except for the superblock
  on the metadata device.

> 9. Performance - Throughput is generally most important. Latency is
  also one more performance comparison point. Performance under
  different load classes can be measured.

  I think latency is more important than throughput.  Spindles are
  pretty good at throughput.  In fact the mq policy tries to spot when
  we're doing large linear ios and stops hit counting; best leave this
  stuff on the spindle.

> 10. ACID properties - Atomicity, Concurrency, Idempotent,
  Durability. Does the caching solution have these typical
  transactional database or filesystem properties. This includes
  avoiding torn-page problem amongst crash and failure scenarios.

  Could you expand on the torn-page issue please?

> 11. Error conditions - Handling power failures, intermittent and permanent device failures.

  I think the area where dm-cache is currently lacking is intermittent
  failures.  For example if a cache read fails we just pass that error
  up, whereas eio sees if the block is clean and if so tries to read
  off the origin.  I'm not sure which behaviour is correct; I like to
  know about disk failure early.

> 12. Configuration parameters for tuning according to applications.

  Discussed above.

> We'll soon document EnhanceIO behavior in context of these
  aspects. We'll appreciate if dm-cache and bcache is also documented.

  I hope the above helps.  Please ask away if you're unsure about
  something.

> When comparing performance there are three levels at which it can be measured

Developing these caches is tedious.  Test runs take time, and really
slow the dev cycle down.  So I suspect we've all been using
microbenchmarks that run in a few minutes.

Let's get our pool of microbenchmarks together, then work on some
application level ones (we're happy to put some time into developing
these).

- Joe

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-17 17:17             ` Amit Kale
  0 siblings, 0 replies; 54+ messages in thread
From: Amit Kale @ 2013-01-17 17:17 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: thornber, device-mapper development, kent.overstreet,
	Mike Snitzer, LKML, linux-bcache

Thanks for a prompt reply.
 
> Suppose I could fill out the bcache version...
> 
> On Thu, Jan 17, 2013 at 05:52:00PM +0800, Amit Kale wrote:
> > Hi Joe, Kent,
> >
> > [Adding Kent as well since bcache is mentioned below as one of the
> > contenders for being integrated into mainline kernel.]
> >
> > My understanding is that these three caching solutions all have three
> principle blocks.
> > 1. A cache block lookup - This refers to finding out whether a block
> was cached or not and the location on SSD, if it was.
> > 2. Block replacement policy - This refers to the algorithm for
> replacing a block when a new free block can't be found.
> > 3. IO handling - This is about issuing IO requests to SSD and HDD.
> > 4. Dirty data clean-up algorithm (for write-back only) - The dirty
> data clean-up algorithm decides when to write a dirty block in an SSD
> to its original location on HDD and executes the copy.
> >
> > When comparing the three solutions we need to consider these aspects.
> > 1. User interface - This consists of commands used by users for
> creating, deleting, editing properties and recovering from error
> conditions.
> > 2. Software interface - Where it interfaces to Linux kernel and
> applications.
> 
> Both done with sysfs, at least for now.

sysfs is the user interface. Bcache creates a new block device. So it interfaces to Linux kernel at block device layer. HDD and SSD interfaces would at using submit_bio (pl. correct if this is wrong).

> 
> > 3. Availability - What's the downtime when adding, deleting caches,
> making changes to cache configuration, conversion between cache modes,
> recovering after a crash, recovering from an error condition.
> 
> All of that is done at runtime, without any interruption. bcache
> doesn't distinguish between clean and unclean shutdown, which is nice
> because it means the recovery code gets tested. Registering a cache
> device takes on the order of half a second, for a large (half terabyte)
> cache.

Since a new device is created, you need to bring down applications the first time a cache is created. There-onwards it would be online. Similarly applications need to be brought down when deleting a cache. Fstab changes etc also need to be done. My guess is all this requires some effort and understanding by a system administrator. Does fstab work without any manual editing if it contains labes instead of device paths?

> 
> > 4. Security - Security holes, if any.
> 
> Hope there aren't any!

All the three caches can be operated only as root. So as long as there are no bugs, there is no need to worry about security loopholes.

> 
> > 5. Portability - Which HDDs, SSDs, partitions, other block devices it
> works with.
> 
> Any block device.
> 
> > 6. Persistence of cache configuration - Once created does the cache
> configuration stay persistent across reboots. How are changes in device
> sequence or numbering handled.
> 
> Persistent. Device nodes are not stable across reboots, same as say
> scsi devices if they get probed in a different order. It does persist a
> label in the backing device superblock which can be used to implement
> stable device nodes.

Can this be embedded in a udev script so that the configuration becomes persistent regardless of probing order? What happens if either SSD or HDD are absent when a system comes up? Does it work with iSCSI HDDs? iSCSi HDDs can be tricky during shutdown, specifically if the iSCSI device goes offline before a cache saves metadata.

> > 7. Persistence of cached data - Does cached data remain across
> reboots/crashes/intermittent failures. Is the "sticky"ness of data
> configurable.
> 
> Persists across reboots. Can't be switched off, though it could be if
> there was any demand.

Believe me, enterprise customers do require a cache to be non-persistent. This is because of a paranoia that HDD and SSD may go out of sync after a shutdown and before a reboot. This is primarily in an environment with a large number of HDDs accessed through a complicated iSCSI based setup perhaps with software RAID.


> > 8. SSD life - Projected SSD life. Does the caching solution cause too
> much of write amplification leading to an early SSD failure.
> 
> With LRU, there's only so much you can do to work around the SSD's FTL,
> though bcache does try; allocation is done in terms of buckets, which
> are on the order of a megabyte (configured when you format the cache
> device). Buckets are written to sequentially, then rewritten later all
> at once (and it'll issue a discard before rewriting a bucket if you
> flip it on, it's not on by default because TRIM = slow).
> 
> Bcache also implements fifo cache replacement, and with that write
> amplification should never be an issue.

Most SSDs contain a fairly sophisticated FTL doing wear-leveling. Wear-leveling only helps by evenly balancing over-writes across an entire SSD. Do you have statistics on how many SSD writes are generated per block read from or written to HDD? Metadata writes should be done only for the affected sectors, or else they contribute to more SSD internal writes. There is also a common debate on whether writing a single sector is more beneficial compared writing a whole block containing that sector.

> 
> > 9. Performance - Throughput is generally most important. Latency is
> also one more performance comparison point. Performance under different
> load classes can be measured.
> > 10. ACID properties - Atomicity, Concurrency, Idempotent, Durability.
> Does the caching solution have these typical transactional database or
> filesystem properties. This includes avoiding torn-page problem amongst
> crash and failure scenarios.
> 
> Yes.
> 
> > 11. Error conditions - Handling power failures, intermittent and
> permanent device failures.
> 
> Power failures and device failures yes, intermittent failures are not
> explicitly handled.

The IO completion guarantee offered on intermittent failures should be as good as HDD.

> 
> > 12. Configuration parameters for tuning according to applications.
> 
> Lots. The most important one is probably sequential bypass - you don't
> typically want to cache your big sequential IO, because rotating disks
> do fine at that. So bcache detects sequential IO and bypasses it with a
> configurable threshold.
> 
> There's also stuff for bypassing more data if the SSD is overloaded -
> if you're caching many disks with a single SSD, you don't want the SSD
> to be the bottleneck. So it tracks latency to the SSD and cranks down
> the sequential bypass threshold if it gets too high.

That's interesting. I'll definitely want to read this part of the source code.

> 
> > We'll soon document EnhanceIO behavior in context of these aspects.
> We'll appreciate if dm-cache and bcache is also documented.
> >
> > When comparing performance there are three levels at which it can be
> > measured 1. Architectural elements 1.1. Throughput for 100% cache hit
> > case (in absence of dirty data clean-up)
> 
> North of a million iops.
> 
> > 1.2. Throughput for 0% cache hit case (in absence of dirty data
> > clean-up)
> 
> Also relevant whether you're adding the data to the cache. I'm sure
> bcache is slightly slower than the raw backing device here, but if it's
> noticable it's a bug (I haven't benchmarked that specifically in ages).
> 
> > 1.3. Dirty data clean-up rate (in absence of IO)
> 
> Background writeback is done by scanning the btree in the background
> for dirty data, and then writing it out in lba order - so the writes
> are as sequential as they're going to get. It's fast.

Great.

Thanks.
-Amit
> 
> > 2. Performance of architectural elements combined 2.1. Varying mix of
> > read/write, sustained performance.
> 
> Random write performance is definitely important, as there you've got
> to keep an index up to date on stable storage (if you want to handle
> unclean shutdown, anyways). Making that fast is non trivial. Bcache is
> about as efficient as you're going to get w.r.t. metadata writes,
> though.
> 
> > 3. Application level testing - The more real-life like benchmark we
> work with, the better it is.

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-17 17:17             ` Amit Kale
  0 siblings, 0 replies; 54+ messages in thread
From: Amit Kale @ 2013-01-17 17:17 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: thornber-H+wXaHxf7aLQT0dZR+AlfA, device-mapper development,
	kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w, Mike Snitzer, LKML,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA

Thanks for a prompt reply.
 
> Suppose I could fill out the bcache version...
> 
> On Thu, Jan 17, 2013 at 05:52:00PM +0800, Amit Kale wrote:
> > Hi Joe, Kent,
> >
> > [Adding Kent as well since bcache is mentioned below as one of the
> > contenders for being integrated into mainline kernel.]
> >
> > My understanding is that these three caching solutions all have three
> principle blocks.
> > 1. A cache block lookup - This refers to finding out whether a block
> was cached or not and the location on SSD, if it was.
> > 2. Block replacement policy - This refers to the algorithm for
> replacing a block when a new free block can't be found.
> > 3. IO handling - This is about issuing IO requests to SSD and HDD.
> > 4. Dirty data clean-up algorithm (for write-back only) - The dirty
> data clean-up algorithm decides when to write a dirty block in an SSD
> to its original location on HDD and executes the copy.
> >
> > When comparing the three solutions we need to consider these aspects.
> > 1. User interface - This consists of commands used by users for
> creating, deleting, editing properties and recovering from error
> conditions.
> > 2. Software interface - Where it interfaces to Linux kernel and
> applications.
> 
> Both done with sysfs, at least for now.

sysfs is the user interface. Bcache creates a new block device. So it interfaces to Linux kernel at block device layer. HDD and SSD interfaces would at using submit_bio (pl. correct if this is wrong).

> 
> > 3. Availability - What's the downtime when adding, deleting caches,
> making changes to cache configuration, conversion between cache modes,
> recovering after a crash, recovering from an error condition.
> 
> All of that is done at runtime, without any interruption. bcache
> doesn't distinguish between clean and unclean shutdown, which is nice
> because it means the recovery code gets tested. Registering a cache
> device takes on the order of half a second, for a large (half terabyte)
> cache.

Since a new device is created, you need to bring down applications the first time a cache is created. There-onwards it would be online. Similarly applications need to be brought down when deleting a cache. Fstab changes etc also need to be done. My guess is all this requires some effort and understanding by a system administrator. Does fstab work without any manual editing if it contains labes instead of device paths?

> 
> > 4. Security - Security holes, if any.
> 
> Hope there aren't any!

All the three caches can be operated only as root. So as long as there are no bugs, there is no need to worry about security loopholes.

> 
> > 5. Portability - Which HDDs, SSDs, partitions, other block devices it
> works with.
> 
> Any block device.
> 
> > 6. Persistence of cache configuration - Once created does the cache
> configuration stay persistent across reboots. How are changes in device
> sequence or numbering handled.
> 
> Persistent. Device nodes are not stable across reboots, same as say
> scsi devices if they get probed in a different order. It does persist a
> label in the backing device superblock which can be used to implement
> stable device nodes.

Can this be embedded in a udev script so that the configuration becomes persistent regardless of probing order? What happens if either SSD or HDD are absent when a system comes up? Does it work with iSCSI HDDs? iSCSi HDDs can be tricky during shutdown, specifically if the iSCSI device goes offline before a cache saves metadata.

> > 7. Persistence of cached data - Does cached data remain across
> reboots/crashes/intermittent failures. Is the "sticky"ness of data
> configurable.
> 
> Persists across reboots. Can't be switched off, though it could be if
> there was any demand.

Believe me, enterprise customers do require a cache to be non-persistent. This is because of a paranoia that HDD and SSD may go out of sync after a shutdown and before a reboot. This is primarily in an environment with a large number of HDDs accessed through a complicated iSCSI based setup perhaps with software RAID.


> > 8. SSD life - Projected SSD life. Does the caching solution cause too
> much of write amplification leading to an early SSD failure.
> 
> With LRU, there's only so much you can do to work around the SSD's FTL,
> though bcache does try; allocation is done in terms of buckets, which
> are on the order of a megabyte (configured when you format the cache
> device). Buckets are written to sequentially, then rewritten later all
> at once (and it'll issue a discard before rewriting a bucket if you
> flip it on, it's not on by default because TRIM = slow).
> 
> Bcache also implements fifo cache replacement, and with that write
> amplification should never be an issue.

Most SSDs contain a fairly sophisticated FTL doing wear-leveling. Wear-leveling only helps by evenly balancing over-writes across an entire SSD. Do you have statistics on how many SSD writes are generated per block read from or written to HDD? Metadata writes should be done only for the affected sectors, or else they contribute to more SSD internal writes. There is also a common debate on whether writing a single sector is more beneficial compared writing a whole block containing that sector.

> 
> > 9. Performance - Throughput is generally most important. Latency is
> also one more performance comparison point. Performance under different
> load classes can be measured.
> > 10. ACID properties - Atomicity, Concurrency, Idempotent, Durability.
> Does the caching solution have these typical transactional database or
> filesystem properties. This includes avoiding torn-page problem amongst
> crash and failure scenarios.
> 
> Yes.
> 
> > 11. Error conditions - Handling power failures, intermittent and
> permanent device failures.
> 
> Power failures and device failures yes, intermittent failures are not
> explicitly handled.

The IO completion guarantee offered on intermittent failures should be as good as HDD.

> 
> > 12. Configuration parameters for tuning according to applications.
> 
> Lots. The most important one is probably sequential bypass - you don't
> typically want to cache your big sequential IO, because rotating disks
> do fine at that. So bcache detects sequential IO and bypasses it with a
> configurable threshold.
> 
> There's also stuff for bypassing more data if the SSD is overloaded -
> if you're caching many disks with a single SSD, you don't want the SSD
> to be the bottleneck. So it tracks latency to the SSD and cranks down
> the sequential bypass threshold if it gets too high.

That's interesting. I'll definitely want to read this part of the source code.

> 
> > We'll soon document EnhanceIO behavior in context of these aspects.
> We'll appreciate if dm-cache and bcache is also documented.
> >
> > When comparing performance there are three levels at which it can be
> > measured 1. Architectural elements 1.1. Throughput for 100% cache hit
> > case (in absence of dirty data clean-up)
> 
> North of a million iops.
> 
> > 1.2. Throughput for 0% cache hit case (in absence of dirty data
> > clean-up)
> 
> Also relevant whether you're adding the data to the cache. I'm sure
> bcache is slightly slower than the raw backing device here, but if it's
> noticable it's a bug (I haven't benchmarked that specifically in ages).
> 
> > 1.3. Dirty data clean-up rate (in absence of IO)
> 
> Background writeback is done by scanning the btree in the background
> for dirty data, and then writing it out in lba order - so the writes
> are as sequential as they're going to get. It's fast.

Great.

Thanks.
-Amit
> 
> > 2. Performance of architectural elements combined 2.1. Varying mix of
> > read/write, sustained performance.
> 
> Random write performance is definitely important, as there you've got
> to keep an index up to date on stable storage (if you want to handle
> unclean shutdown, anyways). Making that fast is non trivial. Bcache is
> about as efficient as you're going to get w.r.t. metadata writes,
> though.
> 
> > 3. Application level testing - The more real-life like benchmark we
> work with, the better it is.

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
  2013-01-17 13:26           ` thornber-H+wXaHxf7aLQT0dZR+AlfA
  (?)
@ 2013-01-17 17:53           ` Amit Kale
  2013-01-17 18:36               ` Jason Warr
  2013-01-17 18:50               ` thornber-H+wXaHxf7aLQT0dZR+AlfA
  -1 siblings, 2 replies; 54+ messages in thread
From: Amit Kale @ 2013-01-17 17:53 UTC (permalink / raw)
  To: thornber
  Cc: device-mapper development, kent.overstreet, Mike Snitzer, LKML,
	linux-bcache

> 
> On Thu, Jan 17, 2013 at 05:52:00PM +0800, Amit Kale wrote:
> > Hi Joe, Kent,
> >
> > [Adding Kent as well since bcache is mentioned below as one of the
> > contenders for being integrated into mainline kernel.]
> >
> > My understanding is that these three caching solutions all have three
> principle blocks.
> 
> Let me try and explain how dm-cache works.
> 
> > 1. A cache block lookup - This refers to finding out whether a block
> was cached or not and the location on SSD, if it was.
> 
> Of course we have this, but it's part of the policy plug-in.  I've done
> this because the policy nearly always needs to do some book keeping
> (eg, update a hit count when accessed).
> 
> > 2. Block replacement policy - This refers to the algorithm for
> replacing a block when a new free block can't be found.
> 
> I think there's more than just this.  These are the tasks that I hand
> over to the policy:
> 
>   a) _Which_ blocks should be promoted to the cache.  This seems to be
>      the key decision in terms of performance.  Blindly trying to
>      promote every io or even just every write will lead to some very
>      bad performance in certain situations.
> 
>      The mq policy uses a multiqueue (effectively a partially sorted
>      lru list) to keep track of candidate block hit counts.  When
>      candidates get enough hits they're promoted.  The promotion
>      threshold his periodically recalculated by looking at the hit
>      counts for the blocks already in the cache.

Multi-queue algorithm typically results in a significant metadata overhead. How much percentage overhead does that imply?

> 
>      The hit counts should degrade over time (for some definition of
>      time; eg. io volume).  I've experimented with this, but not yet
>      come up with a satisfactory method.
> 
>      I read through EnhanceIO yesterday, and think this is where
>      you're lacking.

We have an LRU policy at a cache set level. Effectiveness of the LRU policy depends on the average duration of a block in a working dataset. If the average duration is small enough so a block is most of the times "hit" before it's chucked out, LRU works better than any other policies.

> 
>   b) When should a block be promoted.  If you're swamped with io, then
>      adding copy io is probably not a good idea.  Current dm-cache
>      just has a configurable threshold for the promotion/demotion io
>      volume.  If you or Kent have some ideas for how to approximate
>      the bandwidth of the devices I'd really like to hear about it.
> 
>   c) Which blocks should be demoted?
> 
>      This is the bit that people commonly think of when they say
>      'caching algorithm'.  Examples are lru, arc, etc.  Such
>      descriptions are fine when describing a cache where elements
>      _have_ to be promoted before they can be accessed, for example a
>      cpu memory cache.  But we should be aware that 'lru' for example
>      really doesn't tell us much in the context of our policies.
> 
>      The mq policy uses a blend of lru and lfu for eviction, it seems
>      to work well.
> 
> A couple of other things I should mention; dm-cache uses a large block
> size compared to eio.  eg, 64k - 1m.  This is a mixed blessing;

Yes. We had a lot of debate internally on the block size. For now we have restricted to 2k, 4k and 8k. We found that larger block sizes result in too much of internal fragmentation, in-spite of a significant reduction in metadata size. 8k is adequate for Oracle and mysql.

> 
>  - our copy io is more efficient (we don't have to worry about
>    batching migrations together so much.  Something eio is careful to
>    do).
> 
>  - we have fewer blocks to hold stats about, so can keep more info per
>    block in the same amount of memory.
> 
>  - We trigger more copying.  For example if an incoming write triggers
>    a promotion from the origin to the cache, and the io covers a block
>    we can avoid any copy from the origin to cache.  With a bigger
>    block size this optmisation happens less frequently.
> 
>  - We waste SSD space.  eg, a 4k hotspot could trigger a whole block
>    to be moved to the cache.
> 
> 
> We do not keep the dirty state of cache blocks up to date on the
> metadata device.  Instead we have a 'mounted' flag that's set in the
> metadata when opened.  When a clean shutdown occurs (eg, dmsetup
> suspend my-cache) the dirty bits are written out and the mounted flag
> cleared.  On a crash the mounted flag will still be set on reopen and
> all dirty flags degrade to 'dirty'.  

Not sure I understand this. Is there a guarantee that once an IO is reported as "done" to upstream layer (filesystem/database/application), it is persistent. The persistence should be guaranteed even if there is an OS crash immediately after status is reported. Persistence should be guaranteed for the entire IO range. The next time the application tries to read it, it should get updated data, not stale data.

> Correct me if I'm wrong, but I
> think eio is holding io completion until the dirty bits have been
> committed to disk?

That's correct. In addition to this, we try to batch metadata updates if multiple IOs occur in the same cache set.

> 
> I really view dm-cache as a slow moving hotspot optimiser.  Whereas I
> think eio and bcache are much more of a heirarchical storage approach,
> where writes go through the cache if possible?

Generally speaking, yes. EIO contains dirty data limits to avoid the situation where too much of the HDD is used for storing dirty data reducing the effectiveness of the cache for reads.

> 
> > 3. IO handling - This is about issuing IO requests to SSD and HDD.
> 
>   I get most of this for free via dm and kcopyd.  I'm really keen to
>   see how bcache does; it's more invasive of the block layer, so I'm
>   expecting it to show far better performance than dm-cache.
> 
> > 4. Dirty data clean-up algorithm (for write-back only) - The dirty
>   data clean-up algorithm decides when to write a dirty block in an
>   SSD to its original location on HDD and executes the copy.
> 
>   Yep.
> 
> > When comparing the three solutions we need to consider these aspects.
> 
> > 1. User interface - This consists of commands used by users for
>   creating, deleting, editing properties and recovering from error
>   conditions.
> 
>   I was impressed how easy eio was to use yesterday when I was playing
>   with it.  Well done.
> 
>   Driving dm-cache through dm-setup isn't much more of a hassle
>   though.  Though we've decided to pass policy specific params on the
>   target line, and tweak via a dm message (again simple via dmsetup).
>   I don't think this is as simple as exposing them through something
>   like sysfs, but it is more in keeping with the device-mapper way.

You have the benefit of using a well-know dm interface.

> 
> > 2. Software interface - Where it interfaces to Linux kernel and
> applications.
> 
>   See above.
> 
> > 3. Availability - What's the downtime when adding, deleting caches,
>   making changes to cache configuration, conversion between cache
>   modes, recovering after a crash, recovering from an error condition.
> 
>   Normal dm suspend, alter table, resume cycle.  The LVM tools do this
>   all the time.

Cache creation and deletion will require stopping applications, unmounting filesystems and then remounting and starting the applications. A sysad in addition to this will require updating fstab entries. Do fstab entries work automatically in case they use labels instead of full device paths.

Same with changes to cache configuration.

> 
> > 4. Security - Security holes, if any.
> 
>   Well I saw the comment in your code describing the security flaw you
>   think you've got.  I hope we don't have any, I'd like to understand
>   your case more.

Could you elaborate on which comment you are referring to? Since all of the three caching solutions allow only root user an access, my belief is that there are no security holes. I have listed it here as it's an important consideration for enterprise users.

> 
> > 5. Portability - Which HDDs, SSDs, partitions, other block devices it
> works with.
> 
>   I think we all work with any block device.  But eio and bcache can
>   overlay any device node, not just a dm one.  As mentioned in earlier
>   email I really think this is a dm issue, not specific to dm-cache.

DM was never meant to be cascaded. So it's ok for DM.

We recommend our customers to use a RAID for SSD when running writeback. This is because an SSD failure leads to a catastrophic data loss (dirty data). We support using an md device as a SSD. There are some issues with md devices for the code published in github. I'll get back with a code fix next week.

> 
> > 6. Persistence of cache configuration - Once created does the cache
>   configuration stay persistent across reboots. How are changes in
>   device sequence or numbering handled.
> 
>   We've gone for no persistence of policy parameters.  Instead
>   everything is handed into the kernel when the target is setup.  This
>   decision was made by the LVM team who wanted to store this
>   information themselves (we certainly shouldn't store it in two
>   places at once).  I don't feel strongly either way, and could
>   persist the policy params v. easily (eg, 1 days work).

Storing persistence information in a single place makes sense. 
> 
>   One thing I do provide is a 'hint' array for the policy to use and
>   persist.  The policy specifies how much data it would like to store
>   per cache block, and then writes it on clean shutdown (hence 'hint',
>   it has to cope without this, possibly with temporarily degraded
>   performance).  The mq policy uses the hints to store hit counts.
> 
> > 7. Persistence of cached data - Does cached data remain across
>   reboots/crashes/intermittent failures. Is the "sticky"ness of data
>   configurable.
> 
>   Surely this is a given?  A cache would be trivial to write if it
>   didn't need to be crash proof.

There has to be a way to make it either persistent or volatile depending on how users want it. Enterprise users are sometimes paranoid about HDD and SSD going out of sync after a system shutdown and before a bootup. This is typically for large complicated iSCSI based shared HDD setups.

> 
> > 8. SSD life - Projected SSD life. Does the caching solution cause
>   too much of write amplification leading to an early SSD failure.
> 
>   No, I decided years ago that life was too short to start optimising
>   for specific block devices.  By the time you get it right the
>   hardware characteristics will have moved on.  Doesn't the firmware
>   on SSDs try and even out io wear these days?

That's correct. We don't have to worry about wear leveling. All of the competent SSDs around do that.

What I wanted to bring up was how many SSD writes does a cache read/write result. Write back cache mode is specifically taxing on SSDs in this aspect.

>   That said I think we evenly use the SSD.  Except for the superblock
>   on the metadata device.
> 
> > 9. Performance - Throughput is generally most important. Latency is
>   also one more performance comparison point. Performance under
>   different load classes can be measured.
> 
>   I think latency is more important than throughput.  Spindles are
>   pretty good at throughput.  In fact the mq policy tries to spot when
>   we're doing large linear ios and stops hit counting; best leave this
>   stuff on the spindle.

I disagree. Latency is taken care of automatically when the number of application threads rises.

> 
> > 10. ACID properties - Atomicity, Concurrency, Idempotent,
>   Durability. Does the caching solution have these typical
>   transactional database or filesystem properties. This includes
>   avoiding torn-page problem amongst crash and failure scenarios.
> 
>   Could you expand on the torn-page issue please?

Databases run into torn-page error when an IO is found to be only partially written when it was supposed to be fully written. This is particularly important when an IO was reported to be "done". The original flashcache code we started with over an year ago showed torn-page problem in extremely rare crashes with writeback mode. Our present code contains specific design elements to avoid it.

> 
> > 11. Error conditions - Handling power failures, intermittent and
> permanent device failures.
> 
>   I think the area where dm-cache is currently lacking is intermittent
>   failures.  For example if a cache read fails we just pass that error
>   up, whereas eio sees if the block is clean and if so tries to read
>   off the origin.  I'm not sure which behaviour is correct; I like to
>   know about disk failure early.

Our read-only and write-through mode guarantee that no io errors are introduced regardless of the state SSD is in. So not retrying an io error doesn't cause any future problems. The worst case is a performance hit when an SSD shows an io error or goes completely bad.

It's a different story for write-back. We advise our customers to use RAID on SSD when using write-back as explained above.

> 
> > 12. Configuration parameters for tuning according to applications.
> 
>   Discussed above.
> 
> > We'll soon document EnhanceIO behavior in context of these
>   aspects. We'll appreciate if dm-cache and bcache is also documented.
> 
>   I hope the above helps.  Please ask away if you're unsure about
>   something.
> 
> > When comparing performance there are three levels at which it can be
> > measured
> 
> Developing these caches is tedious.  Test runs take time, and really
> slow the dev cycle down.  So I suspect we've all been using
> microbenchmarks that run in a few minutes.
> 
> Let's get our pool of microbenchmarks together, then work on some
> application level ones (we're happy to put some time into developing
> these).

We do run micro-benchmarks all the time. There are free database benchmarks, so we can try these. Running a full-fledged oracle based benchmarks takes hours, so I am not sure whether I can post that kind of a comparison. Will try to do the best possible.

Thanks.
-Amit

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-17 18:36               ` Jason Warr
  0 siblings, 0 replies; 54+ messages in thread
From: Jason Warr @ 2013-01-17 18:36 UTC (permalink / raw)
  To: Amit Kale
  Cc: thornber, device-mapper development, kent.overstreet,
	Mike Snitzer, LKML, linux-bcache

On 01/17/2013 11:53 AM, Amit Kale wrote:
>>> 9. Performance - Throughput is generally most important. Latency is
>> >   also one more performance comparison point. Performance under
>> >   different load classes can be measured.
>> > 
>> >   I think latency is more important than throughput.  Spindles are
>> >   pretty good at throughput.  In fact the mq policy tries to spot when
>> >   we're doing large linear ios and stops hit counting; best leave this
>> >   stuff on the spindle.
> I disagree. Latency is taken care of automatically when the number of application threads rises.
>

Can you explain what you mean by that in a little more detail?

As an enterprise level user I see both as important overall.  However,
the biggest driving factor in wanting a cache device in front of any
sort of target in my use cases is to hide latency as the number of
threads reading and writing to the backing device go up.  So for me the
cache is basically a tier stage where your ability to keep dirty blocks
on it is determined by the specific use case.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-17 18:36               ` Jason Warr
  0 siblings, 0 replies; 54+ messages in thread
From: Jason Warr @ 2013-01-17 18:36 UTC (permalink / raw)
  To: Amit Kale
  Cc: thornber-H+wXaHxf7aLQT0dZR+AlfA, device-mapper development,
	kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w, Mike Snitzer, LKML,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA


On 01/17/2013 11:53 AM, Amit Kale wrote:
>>> 9. Performance - Throughput is generally most important. Latency is
>> >   also one more performance comparison point. Performance under
>> >   different load classes can be measured.
>> > 
>> >   I think latency is more important than throughput.  Spindles are
>> >   pretty good at throughput.  In fact the mq policy tries to spot when
>> >   we're doing large linear ios and stops hit counting; best leave this
>> >   stuff on the spindle.
> I disagree. Latency is taken care of automatically when the number of application threads rises.
>

Can you explain what you mean by that in a little more detail?

As an enterprise level user I see both as important overall.  However,
the biggest driving factor in wanting a cache device in front of any
sort of target in my use cases is to hide latency as the number of
threads reading and writing to the backing device go up.  So for me the
cache is basically a tier stage where your ability to keep dirty blocks
on it is determined by the specific use case.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-17 18:50               ` thornber-H+wXaHxf7aLQT0dZR+AlfA
  0 siblings, 0 replies; 54+ messages in thread
From: thornber @ 2013-01-17 18:50 UTC (permalink / raw)
  To: Amit Kale
  Cc: device-mapper development, kent.overstreet, Mike Snitzer, LKML,
	linux-bcache

On Fri, Jan 18, 2013 at 01:53:11AM +0800, Amit Kale wrote:
> > 
> > On Thu, Jan 17, 2013 at 05:52:00PM +0800, Amit Kale wrote:
> >      The mq policy uses a multiqueue (effectively a partially sorted
> >      lru list) to keep track of candidate block hit counts.  When
> >      candidates get enough hits they're promoted.  The promotion
> >      threshold his periodically recalculated by looking at the hit
> >      counts for the blocks already in the cache.
> 
> Multi-queue algorithm typically results in a significant metadata
> overhead. How much percentage overhead does that imply?

It is a drawback, at the moment we have a list head, hit count and
some flags per block.  I can compress this, it's on my todo list.
Looking at the code I see you have doubly linked list fields per block
too, albeit 16 bit ones.  We use much bigger blocks than you, so I'm
happy to get the benefit of the extra space.

> >      I read through EnhanceIO yesterday, and think this is where
> >      you're lacking.
> 
> We have an LRU policy at a cache set level. Effectiveness of the LRU
> policy depends on the average duration of a block in a working
> dataset. If the average duration is small enough so a block is most
> of the times "hit" before it's chucked out, LRU works better than
> any other policies.

Yes, in some situations lru is best, in others lfu is best.  That's
why people try and blend in something like arc.  Now my real point was
although you're using lru to choose what to evict, you're not using
anything to choose what to put _in_ the cache, or have I got this
totally wrong?

> > A couple of other things I should mention; dm-cache uses a large block
> > size compared to eio.  eg, 64k - 1m.  This is a mixed blessing;
> 
> Yes. We had a lot of debate internally on the block size. For now we
> have restricted to 2k, 4k and 8k. We found that larger block sizes
> result in too much of internal fragmentation, in-spite of a
> significant reduction in metadata size. 8k is adequate for Oracle
> and mysql.

Right, you need to describe these scenarios so you can show off eio in
the best light.

> > We do not keep the dirty state of cache blocks up to date on the
> > metadata device.  Instead we have a 'mounted' flag that's set in the
> > metadata when opened.  When a clean shutdown occurs (eg, dmsetup
> > suspend my-cache) the dirty bits are written out and the mounted flag
> > cleared.  On a crash the mounted flag will still be set on reopen and
> > all dirty flags degrade to 'dirty'.  
> 

> Not sure I understand this. Is there a guarantee that once an IO is
> reported as "done" to upstream layer
> (filesystem/database/application), it is persistent. The persistence
> should be guaranteed even if there is an OS crash immediately after
> status is reported. Persistence should be guaranteed for the entire
> IO range. The next time the application tries to read it, it should
> get updated data, not stale data.

Yes, we're careful to persist all changes in the mapping before
completing io.  However the dirty bits are just used to ascertain what
blocks need writing back to the origin.  In the event of a crash it's
safe to assume they all do.  dm-cache is a slow moving cache, change
of dirty status occurs far, far more frequently than change of
mapping.  So avoiding these updates is a big win.

> > Correct me if I'm wrong, but I
> > think eio is holding io completion until the dirty bits have been
> > committed to disk?
> 
> That's correct. In addition to this, we try to batch metadata updates if multiple IOs occur in the same cache set.

y, I batch updates too.

> > > 3. Availability - What's the downtime when adding, deleting caches,
> >   making changes to cache configuration, conversion between cache
> >   modes, recovering after a crash, recovering from an error condition.
> > 
> >   Normal dm suspend, alter table, resume cycle.  The LVM tools do this
> >   all the time.
> 
> Cache creation and deletion will require stopping applications,
> unmounting filesystems and then remounting and starting the
> applications. A sysad in addition to this will require updating
> fstab entries. Do fstab entries work automatically in case they use
> labels instead of full device paths.

The common case will be someone using a volume manager like LVM, so
the device nodes are already dm ones.  In this case there's no need
for unmounting or stopping applications.  Changing the stack of dm
targets around on a live system is a key feature.  For example this is
how we implement the pvmove functionality.

> >   Well I saw the comment in your code describing the security flaw you
> >   think you've got.  I hope we don't have any, I'd like to understand
> >   your case more.
> 
> Could you elaborate on which comment you are referring to?

Top of eio_main.c

 * 5) Fix a security hole : A malicious process with 'ro' access to a
 * file can potentially corrupt file data. This can be fixed by
 * copying the data on a cache read miss.

> > > 5. Portability - Which HDDs, SSDs, partitions, other block devices it
> > works with.
> > 
> >   I think we all work with any block device.  But eio and bcache can
> >   overlay any device node, not just a dm one.  As mentioned in earlier
> >   email I really think this is a dm issue, not specific to dm-cache.
> 
> DM was never meant to be cascaded. So it's ok for DM.

Not sure what you mean here?  I wrote dm specifically with stacking
scenarios in mind.

> > > 7. Persistence of cached data - Does cached data remain across
> >   reboots/crashes/intermittent failures. Is the "sticky"ness of data
> >   configurable.
> > 
> >   Surely this is a given?  A cache would be trivial to write if it
> >   didn't need to be crash proof.
> 
> There has to be a way to make it either persistent or volatile
> depending on how users want it. Enterprise users are sometimes
> paranoid about HDD and SSD going out of sync after a system shutdown
> and before a bootup. This is typically for large complicated iSCSI
> based shared HDD setups.

Well in those Enterprise users can just use dm-cache in writethrough
mode and throw it away when they finish.  Writing our metadata is not
the bottle neck (copy for migrations is), and it's definitely worth
keeping so there are up to date hit counts for the policy to work off
after reboot.

> That's correct. We don't have to worry about wear leveling. All of the competent SSDs around do that.
> 

> What I wanted to bring up was how many SSD writes does a cache
> read/write result. Write back cache mode is specifically taxing on
> SSDs in this aspect.

No more than read/writes to a plain SSD.  Are you getting hit by extra
io because you persist dirty flags?

> Databases run into torn-page error when an IO is found to be only
> partially written when it was supposed to be fully written. This is
> particularly important when an IO was reported to be "done". The
> original flashcache code we started with over an year ago showed
> torn-page problem in extremely rare crashes with writeback mode. Our
> present code contains specific design elements to avoid it.

We get this for free in core dm.

- Joe

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-17 18:50               ` thornber-H+wXaHxf7aLQT0dZR+AlfA
  0 siblings, 0 replies; 54+ messages in thread
From: thornber-H+wXaHxf7aLQT0dZR+AlfA @ 2013-01-17 18:50 UTC (permalink / raw)
  To: Amit Kale
  Cc: device-mapper development,
	kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w, Mike Snitzer, LKML,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 18, 2013 at 01:53:11AM +0800, Amit Kale wrote:
> > 
> > On Thu, Jan 17, 2013 at 05:52:00PM +0800, Amit Kale wrote:
> >      The mq policy uses a multiqueue (effectively a partially sorted
> >      lru list) to keep track of candidate block hit counts.  When
> >      candidates get enough hits they're promoted.  The promotion
> >      threshold his periodically recalculated by looking at the hit
> >      counts for the blocks already in the cache.
> 
> Multi-queue algorithm typically results in a significant metadata
> overhead. How much percentage overhead does that imply?

It is a drawback, at the moment we have a list head, hit count and
some flags per block.  I can compress this, it's on my todo list.
Looking at the code I see you have doubly linked list fields per block
too, albeit 16 bit ones.  We use much bigger blocks than you, so I'm
happy to get the benefit of the extra space.

> >      I read through EnhanceIO yesterday, and think this is where
> >      you're lacking.
> 
> We have an LRU policy at a cache set level. Effectiveness of the LRU
> policy depends on the average duration of a block in a working
> dataset. If the average duration is small enough so a block is most
> of the times "hit" before it's chucked out, LRU works better than
> any other policies.

Yes, in some situations lru is best, in others lfu is best.  That's
why people try and blend in something like arc.  Now my real point was
although you're using lru to choose what to evict, you're not using
anything to choose what to put _in_ the cache, or have I got this
totally wrong?

> > A couple of other things I should mention; dm-cache uses a large block
> > size compared to eio.  eg, 64k - 1m.  This is a mixed blessing;
> 
> Yes. We had a lot of debate internally on the block size. For now we
> have restricted to 2k, 4k and 8k. We found that larger block sizes
> result in too much of internal fragmentation, in-spite of a
> significant reduction in metadata size. 8k is adequate for Oracle
> and mysql.

Right, you need to describe these scenarios so you can show off eio in
the best light.

> > We do not keep the dirty state of cache blocks up to date on the
> > metadata device.  Instead we have a 'mounted' flag that's set in the
> > metadata when opened.  When a clean shutdown occurs (eg, dmsetup
> > suspend my-cache) the dirty bits are written out and the mounted flag
> > cleared.  On a crash the mounted flag will still be set on reopen and
> > all dirty flags degrade to 'dirty'.  
> 

> Not sure I understand this. Is there a guarantee that once an IO is
> reported as "done" to upstream layer
> (filesystem/database/application), it is persistent. The persistence
> should be guaranteed even if there is an OS crash immediately after
> status is reported. Persistence should be guaranteed for the entire
> IO range. The next time the application tries to read it, it should
> get updated data, not stale data.

Yes, we're careful to persist all changes in the mapping before
completing io.  However the dirty bits are just used to ascertain what
blocks need writing back to the origin.  In the event of a crash it's
safe to assume they all do.  dm-cache is a slow moving cache, change
of dirty status occurs far, far more frequently than change of
mapping.  So avoiding these updates is a big win.

> > Correct me if I'm wrong, but I
> > think eio is holding io completion until the dirty bits have been
> > committed to disk?
> 
> That's correct. In addition to this, we try to batch metadata updates if multiple IOs occur in the same cache set.

y, I batch updates too.

> > > 3. Availability - What's the downtime when adding, deleting caches,
> >   making changes to cache configuration, conversion between cache
> >   modes, recovering after a crash, recovering from an error condition.
> > 
> >   Normal dm suspend, alter table, resume cycle.  The LVM tools do this
> >   all the time.
> 
> Cache creation and deletion will require stopping applications,
> unmounting filesystems and then remounting and starting the
> applications. A sysad in addition to this will require updating
> fstab entries. Do fstab entries work automatically in case they use
> labels instead of full device paths.

The common case will be someone using a volume manager like LVM, so
the device nodes are already dm ones.  In this case there's no need
for unmounting or stopping applications.  Changing the stack of dm
targets around on a live system is a key feature.  For example this is
how we implement the pvmove functionality.

> >   Well I saw the comment in your code describing the security flaw you
> >   think you've got.  I hope we don't have any, I'd like to understand
> >   your case more.
> 
> Could you elaborate on which comment you are referring to?

Top of eio_main.c

 * 5) Fix a security hole : A malicious process with 'ro' access to a
 * file can potentially corrupt file data. This can be fixed by
 * copying the data on a cache read miss.

> > > 5. Portability - Which HDDs, SSDs, partitions, other block devices it
> > works with.
> > 
> >   I think we all work with any block device.  But eio and bcache can
> >   overlay any device node, not just a dm one.  As mentioned in earlier
> >   email I really think this is a dm issue, not specific to dm-cache.
> 
> DM was never meant to be cascaded. So it's ok for DM.

Not sure what you mean here?  I wrote dm specifically with stacking
scenarios in mind.

> > > 7. Persistence of cached data - Does cached data remain across
> >   reboots/crashes/intermittent failures. Is the "sticky"ness of data
> >   configurable.
> > 
> >   Surely this is a given?  A cache would be trivial to write if it
> >   didn't need to be crash proof.
> 
> There has to be a way to make it either persistent or volatile
> depending on how users want it. Enterprise users are sometimes
> paranoid about HDD and SSD going out of sync after a system shutdown
> and before a bootup. This is typically for large complicated iSCSI
> based shared HDD setups.

Well in those Enterprise users can just use dm-cache in writethrough
mode and throw it away when they finish.  Writing our metadata is not
the bottle neck (copy for migrations is), and it's definitely worth
keeping so there are up to date hit counts for the policy to work off
after reboot.

> That's correct. We don't have to worry about wear leveling. All of the competent SSDs around do that.
> 

> What I wanted to bring up was how many SSD writes does a cache
> read/write result. Write back cache mode is specifically taxing on
> SSDs in this aspect.

No more than read/writes to a plain SSD.  Are you getting hit by extra
io because you persist dirty flags?

> Databases run into torn-page error when an IO is found to be only
> partially written when it was supposed to be fully written. This is
> particularly important when an IO was reported to be "done". The
> original flashcache code we started with over an year ago showed
> torn-page problem in extremely rare crashes with writeback mode. Our
> present code contains specific design elements to avoid it.

We get this for free in core dm.

- Joe

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
  2013-01-17 18:50               ` thornber-H+wXaHxf7aLQT0dZR+AlfA
@ 2013-01-18  7:03                 ` Amit Kale
  -1 siblings, 0 replies; 54+ messages in thread
From: Amit Kale @ 2013-01-18  7:03 UTC (permalink / raw)
  To: thornber
  Cc: device-mapper development, kent.overstreet, Mike Snitzer, LKML,
	linux-bcache

> > >      The mq policy uses a multiqueue (effectively a partially
> sorted
> > >      lru list) to keep track of candidate block hit counts.  When
> > >      candidates get enough hits they're promoted.  The promotion
> > >      threshold his periodically recalculated by looking at the hit
> > >      counts for the blocks already in the cache.
> >
> > Multi-queue algorithm typically results in a significant metadata
> > overhead. How much percentage overhead does that imply?
> 
> It is a drawback, at the moment we have a list head, hit count and some
> flags per block.  I can compress this, it's on my todo list.
> Looking at the code I see you have doubly linked list fields per block
> too, albeit 16 bit ones.  We use much bigger blocks than you, so I'm
> happy to get the benefit of the extra space.
> 
> > >      I read through EnhanceIO yesterday, and think this is where
> > >      you're lacking.
> >
> > We have an LRU policy at a cache set level. Effectiveness of the LRU
> > policy depends on the average duration of a block in a working
> > dataset. If the average duration is small enough so a block is most
> of
> > the times "hit" before it's chucked out, LRU works better than any
> > other policies.
> 
> Yes, in some situations lru is best, in others lfu is best.  That's why
> people try and blend in something like arc.  Now my real point was
> although you're using lru to choose what to evict, you're not using
> anything to choose what to put _in_ the cache, or have I got this
> totally wrong?

We simply put in any read or written blocks into the cache (subject to availability and controlled limits).

> 
> > > A couple of other things I should mention; dm-cache uses a large
> > > block size compared to eio.  eg, 64k - 1m.  This is a mixed
> > > blessing;
> >
> > Yes. We had a lot of debate internally on the block size. For now we
> > have restricted to 2k, 4k and 8k. We found that larger block sizes
> > result in too much of internal fragmentation, in-spite of a
> > significant reduction in metadata size. 8k is adequate for Oracle and
> > mysql.
> 
> Right, you need to describe these scenarios so you can show off eio in
> the best light.
> 
> > > We do not keep the dirty state of cache blocks up to date on the
> > > metadata device.  Instead we have a 'mounted' flag that's set in
> the
> > > metadata when opened.  When a clean shutdown occurs (eg, dmsetup
> > > suspend my-cache) the dirty bits are written out and the mounted
> > > flag cleared.  On a crash the mounted flag will still be set on
> > > reopen and all dirty flags degrade to 'dirty'.
> >
> 
> > Not sure I understand this. Is there a guarantee that once an IO is
> > reported as "done" to upstream layer
> > (filesystem/database/application), it is persistent. The persistence
> > should be guaranteed even if there is an OS crash immediately after
> > status is reported. Persistence should be guaranteed for the entire
> IO
> > range. The next time the application tries to read it, it should get
> > updated data, not stale data.
> 
> Yes, we're careful to persist all changes in the mapping before
> completing io.  However the dirty bits are just used to ascertain what
> blocks need writing back to the origin.  In the event of a crash it's
> safe to assume they all do.  dm-cache is a slow moving cache, change of
> dirty status occurs far, far more frequently than change of mapping.
> So avoiding these updates is a big win.

That's great.


> 
> > > Correct me if I'm wrong, but I
> > > think eio is holding io completion until the dirty bits have been
> > > committed to disk?
> >
> > That's correct. In addition to this, we try to batch metadata updates
> if multiple IOs occur in the same cache set.
> 
> y, I batch updates too.
> 
> > > > 3. Availability - What's the downtime when adding, deleting
> > > > caches,
> > >   making changes to cache configuration, conversion between cache
> > >   modes, recovering after a crash, recovering from an error
> condition.
> > >
> > >   Normal dm suspend, alter table, resume cycle.  The LVM tools do
> this
> > >   all the time.
> >
> > Cache creation and deletion will require stopping applications,
> > unmounting filesystems and then remounting and starting the
> > applications. A sysad in addition to this will require updating fstab
> > entries. Do fstab entries work automatically in case they use labels
> > instead of full device paths.
> 
> The common case will be someone using a volume manager like LVM, so the
> device nodes are already dm ones.  In this case there's no need for
> unmounting or stopping applications.  Changing the stack of dm targets
> around on a live system is a key feature.  For example this is how we
> implement the pvmove functionality.
> 
> > >   Well I saw the comment in your code describing the security flaw
> you
> > >   think you've got.  I hope we don't have any, I'd like to
> understand
> > >   your case more.
> >
> > Could you elaborate on which comment you are referring to?
> 
> Top of eio_main.c
> 
>  * 5) Fix a security hole : A malicious process with 'ro' access to a
>  * file can potentially corrupt file data. This can be fixed by
>  * copying the data on a cache read miss.

That's stale. Slipped out of our cleanup. Will remove that.

It's still possible for an ordinary user to "consume" a significant portion of a cache by perpetually reading all permissible data. Caches as of now don't have user based controls for caches.
-Amit

> 
> > > > 5. Portability - Which HDDs, SSDs, partitions, other block
> devices
> > > > it
> > > works with.
> > >
> > >   I think we all work with any block device.  But eio and bcache
> can
> > >   overlay any device node, not just a dm one.  As mentioned in
> earlier
> > >   email I really think this is a dm issue, not specific to dm-
> cache.
> >
> > DM was never meant to be cascaded. So it's ok for DM.
> 
> Not sure what you mean here?  I wrote dm specifically with stacking
> scenarios in mind.

DM can't use a device containing partitions, by design. It works on individual partitions, though.

> 
> > > > 7. Persistence of cached data - Does cached data remain across
> > >   reboots/crashes/intermittent failures. Is the "sticky"ness of
> data
> > >   configurable.
> > >
> > >   Surely this is a given?  A cache would be trivial to write if it
> > >   didn't need to be crash proof.
> >
> > There has to be a way to make it either persistent or volatile
> > depending on how users want it. Enterprise users are sometimes
> > paranoid about HDD and SSD going out of sync after a system shutdown
> > and before a bootup. This is typically for large complicated iSCSI
> > based shared HDD setups.
> 
> Well in those Enterprise users can just use dm-cache in writethrough
> mode and throw it away when they finish.  Writing our metadata is not
> the bottle neck (copy for migrations is), and it's definitely worth
> keeping so there are up to date hit counts for the policy to work off
> after reboot.

Agreed. However there are arguments both ways. The need to start afresh is valid, although not frequent.

> 
> > That's correct. We don't have to worry about wear leveling. All of
> the competent SSDs around do that.
> >
> 
> > What I wanted to bring up was how many SSD writes does a cache
> > read/write result. Write back cache mode is specifically taxing on
> > SSDs in this aspect.
> 
> No more than read/writes to a plain SSD.  Are you getting hit by extra
> io because you persist dirty flags?

It's a price users pay for metadata updates. Our three caching modes have different levels of SSD writes. Read-only < write-through < write-back. Users can look at the benefits versus SSD life and choose accordingly.
-Amit


PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-18  7:03                 ` Amit Kale
  0 siblings, 0 replies; 54+ messages in thread
From: Amit Kale @ 2013-01-18  7:03 UTC (permalink / raw)
  To: thornber-H+wXaHxf7aLQT0dZR+AlfA
  Cc: device-mapper development,
	kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w, Mike Snitzer, LKML,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA

> > >      The mq policy uses a multiqueue (effectively a partially
> sorted
> > >      lru list) to keep track of candidate block hit counts.  When
> > >      candidates get enough hits they're promoted.  The promotion
> > >      threshold his periodically recalculated by looking at the hit
> > >      counts for the blocks already in the cache.
> >
> > Multi-queue algorithm typically results in a significant metadata
> > overhead. How much percentage overhead does that imply?
> 
> It is a drawback, at the moment we have a list head, hit count and some
> flags per block.  I can compress this, it's on my todo list.
> Looking at the code I see you have doubly linked list fields per block
> too, albeit 16 bit ones.  We use much bigger blocks than you, so I'm
> happy to get the benefit of the extra space.
> 
> > >      I read through EnhanceIO yesterday, and think this is where
> > >      you're lacking.
> >
> > We have an LRU policy at a cache set level. Effectiveness of the LRU
> > policy depends on the average duration of a block in a working
> > dataset. If the average duration is small enough so a block is most
> of
> > the times "hit" before it's chucked out, LRU works better than any
> > other policies.
> 
> Yes, in some situations lru is best, in others lfu is best.  That's why
> people try and blend in something like arc.  Now my real point was
> although you're using lru to choose what to evict, you're not using
> anything to choose what to put _in_ the cache, or have I got this
> totally wrong?

We simply put in any read or written blocks into the cache (subject to availability and controlled limits).

> 
> > > A couple of other things I should mention; dm-cache uses a large
> > > block size compared to eio.  eg, 64k - 1m.  This is a mixed
> > > blessing;
> >
> > Yes. We had a lot of debate internally on the block size. For now we
> > have restricted to 2k, 4k and 8k. We found that larger block sizes
> > result in too much of internal fragmentation, in-spite of a
> > significant reduction in metadata size. 8k is adequate for Oracle and
> > mysql.
> 
> Right, you need to describe these scenarios so you can show off eio in
> the best light.
> 
> > > We do not keep the dirty state of cache blocks up to date on the
> > > metadata device.  Instead we have a 'mounted' flag that's set in
> the
> > > metadata when opened.  When a clean shutdown occurs (eg, dmsetup
> > > suspend my-cache) the dirty bits are written out and the mounted
> > > flag cleared.  On a crash the mounted flag will still be set on
> > > reopen and all dirty flags degrade to 'dirty'.
> >
> 
> > Not sure I understand this. Is there a guarantee that once an IO is
> > reported as "done" to upstream layer
> > (filesystem/database/application), it is persistent. The persistence
> > should be guaranteed even if there is an OS crash immediately after
> > status is reported. Persistence should be guaranteed for the entire
> IO
> > range. The next time the application tries to read it, it should get
> > updated data, not stale data.
> 
> Yes, we're careful to persist all changes in the mapping before
> completing io.  However the dirty bits are just used to ascertain what
> blocks need writing back to the origin.  In the event of a crash it's
> safe to assume they all do.  dm-cache is a slow moving cache, change of
> dirty status occurs far, far more frequently than change of mapping.
> So avoiding these updates is a big win.

That's great.


> 
> > > Correct me if I'm wrong, but I
> > > think eio is holding io completion until the dirty bits have been
> > > committed to disk?
> >
> > That's correct. In addition to this, we try to batch metadata updates
> if multiple IOs occur in the same cache set.
> 
> y, I batch updates too.
> 
> > > > 3. Availability - What's the downtime when adding, deleting
> > > > caches,
> > >   making changes to cache configuration, conversion between cache
> > >   modes, recovering after a crash, recovering from an error
> condition.
> > >
> > >   Normal dm suspend, alter table, resume cycle.  The LVM tools do
> this
> > >   all the time.
> >
> > Cache creation and deletion will require stopping applications,
> > unmounting filesystems and then remounting and starting the
> > applications. A sysad in addition to this will require updating fstab
> > entries. Do fstab entries work automatically in case they use labels
> > instead of full device paths.
> 
> The common case will be someone using a volume manager like LVM, so the
> device nodes are already dm ones.  In this case there's no need for
> unmounting or stopping applications.  Changing the stack of dm targets
> around on a live system is a key feature.  For example this is how we
> implement the pvmove functionality.
> 
> > >   Well I saw the comment in your code describing the security flaw
> you
> > >   think you've got.  I hope we don't have any, I'd like to
> understand
> > >   your case more.
> >
> > Could you elaborate on which comment you are referring to?
> 
> Top of eio_main.c
> 
>  * 5) Fix a security hole : A malicious process with 'ro' access to a
>  * file can potentially corrupt file data. This can be fixed by
>  * copying the data on a cache read miss.

That's stale. Slipped out of our cleanup. Will remove that.

It's still possible for an ordinary user to "consume" a significant portion of a cache by perpetually reading all permissible data. Caches as of now don't have user based controls for caches.
-Amit

> 
> > > > 5. Portability - Which HDDs, SSDs, partitions, other block
> devices
> > > > it
> > > works with.
> > >
> > >   I think we all work with any block device.  But eio and bcache
> can
> > >   overlay any device node, not just a dm one.  As mentioned in
> earlier
> > >   email I really think this is a dm issue, not specific to dm-
> cache.
> >
> > DM was never meant to be cascaded. So it's ok for DM.
> 
> Not sure what you mean here?  I wrote dm specifically with stacking
> scenarios in mind.

DM can't use a device containing partitions, by design. It works on individual partitions, though.

> 
> > > > 7. Persistence of cached data - Does cached data remain across
> > >   reboots/crashes/intermittent failures. Is the "sticky"ness of
> data
> > >   configurable.
> > >
> > >   Surely this is a given?  A cache would be trivial to write if it
> > >   didn't need to be crash proof.
> >
> > There has to be a way to make it either persistent or volatile
> > depending on how users want it. Enterprise users are sometimes
> > paranoid about HDD and SSD going out of sync after a system shutdown
> > and before a bootup. This is typically for large complicated iSCSI
> > based shared HDD setups.
> 
> Well in those Enterprise users can just use dm-cache in writethrough
> mode and throw it away when they finish.  Writing our metadata is not
> the bottle neck (copy for migrations is), and it's definitely worth
> keeping so there are up to date hit counts for the policy to work off
> after reboot.

Agreed. However there are arguments both ways. The need to start afresh is valid, although not frequent.

> 
> > That's correct. We don't have to worry about wear leveling. All of
> the competent SSDs around do that.
> >
> 
> > What I wanted to bring up was how many SSD writes does a cache
> > read/write result. Write back cache mode is specifically taxing on
> > SSDs in this aspect.
> 
> No more than read/writes to a plain SSD.  Are you getting hit by extra
> io because you persist dirty flags?

It's a price users pay for metadata updates. Our three caching modes have different levels of SSD writes. Read-only < write-through < write-back. Users can look at the benefits versus SSD life and choose accordingly.
-Amit


PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-18  9:08                 ` Amit Kale
  0 siblings, 0 replies; 54+ messages in thread
From: Amit Kale @ 2013-01-18  9:08 UTC (permalink / raw)
  To: Jason Warr
  Cc: thornber, device-mapper development, kent.overstreet,
	Mike Snitzer, LKML, linux-bcache

> From: Jason Warr [mailto:jason@warr.net]
> On 01/17/2013 11:53 AM, Amit Kale wrote:
> >>> 9. Performance - Throughput is generally most important. Latency is
> >> >   also one more performance comparison point. Performance under
> >> >   different load classes can be measured.
> >> >
> >> >   I think latency is more important than throughput.  Spindles are
> >> >   pretty good at throughput.  In fact the mq policy tries to spot
> when
> >> >   we're doing large linear ios and stops hit counting; best leave
> this
> >> >   stuff on the spindle.
> > I disagree. Latency is taken care of automatically when the number of
> application threads rises.
> >
> 
> Can you explain what you mean by that in a little more detail?

Let's say latency of a block device is 10ms for 4kB requests. With single threaded IO, the throughput will be 4kB/10ms = 400kB/s. If the device is capable of more throughput, a multithreaded IO will generate more throughput. So with 2 threads the throughput will be roughly 800kB/s. We can keep increasing the number of threads resulting in an approximately linear throughput. It'll saturate at the maximum capacity the device has. So it could saturate at perhaps at 8MB/s. Increasing the number of threads beyond this will not increase throughput.

This is a simplistic computation. Throughput, latency and number of threads are related in a more complex relationship. Latency is still important, but throughput is more important.

The way all this matters for SSD caching is, caching will typically show a higher latency compared to the base SSD, even for a 100% hit ratio. It may be possible to reach the maximum throughput achievable with the base SSD using a high number of threads. Let's say an SSD shows 450MB/s with 4 threads. A cache may show 440MB/s with 8 threads.

A practical difficulty in measuring latency is that the latency seen by an application is a sum of the device latency plus the time spent in request queue (and caching layer, when present). Increasing number of threads shows latency increase, although it's only because the requests stay in request queue for a longer duration. Latency measurement in a multithreaded environment is very challenging. Measurement of throughput is fairly straightforward.

> 
> As an enterprise level user I see both as important overall.  However,
> the biggest driving factor in wanting a cache device in front of any
> sort of target in my use cases is to hide latency as the number of
> threads reading and writing to the backing device go up.  So for me the
> cache is basically a tier stage where your ability to keep dirty blocks
> on it is determined by the specific use case.

SSD caching will help in this case since SSD's latency remains almost constant regardless of location of data. HDD latency for sequential and random IO could vary by a factor of 5 or even much more.

Throughput with caching could even be 100 times the HDD throughput when using multiple threaded non-sequential IO.
-Amit

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED

This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-18  9:08                 ` Amit Kale
  0 siblings, 0 replies; 54+ messages in thread
From: Amit Kale @ 2013-01-18  9:08 UTC (permalink / raw)
  To: Jason Warr
  Cc: thornber-H+wXaHxf7aLQT0dZR+AlfA, device-mapper development,
	kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w, Mike Snitzer, LKML,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA

> From: Jason Warr [mailto:jason-/cow75dQlsI@public.gmane.org]
> On 01/17/2013 11:53 AM, Amit Kale wrote:
> >>> 9. Performance - Throughput is generally most important. Latency is
> >> >   also one more performance comparison point. Performance under
> >> >   different load classes can be measured.
> >> >
> >> >   I think latency is more important than throughput.  Spindles are
> >> >   pretty good at throughput.  In fact the mq policy tries to spot
> when
> >> >   we're doing large linear ios and stops hit counting; best leave
> this
> >> >   stuff on the spindle.
> > I disagree. Latency is taken care of automatically when the number of
> application threads rises.
> >
> 
> Can you explain what you mean by that in a little more detail?

Let's say latency of a block device is 10ms for 4kB requests. With single threaded IO, the throughput will be 4kB/10ms = 400kB/s. If the device is capable of more throughput, a multithreaded IO will generate more throughput. So with 2 threads the throughput will be roughly 800kB/s. We can keep increasing the number of threads resulting in an approximately linear throughput. It'll saturate at the maximum capacity the device has. So it could saturate at perhaps at 8MB/s. Increasing the number of threads beyond this will not increase throughput.

This is a simplistic computation. Throughput, latency and number of threads are related in a more complex relationship. Latency is still important, but throughput is more important.

The way all this matters for SSD caching is, caching will typically show a higher latency compared to the base SSD, even for a 100% hit ratio. It may be possible to reach the maximum throughput achievable with the base SSD using a high number of threads. Let's say an SSD shows 450MB/s with 4 threads. A cache may show 440MB/s with 8 threads.

A practical difficulty in measuring latency is that the latency seen by an application is a sum of the device latency plus the time spent in request queue (and caching layer, when present). Increasing number of threads shows latency increase, although it's only because the requests stay in request queue for a longer duration. Latency measurement in a multithreaded environment is very challenging. Measurement of throughput is fairly straightforward.

> 
> As an enterprise level user I see both as important overall.  However,
> the biggest driving factor in wanting a cache device in front of any
> sort of target in my use cases is to hide latency as the number of
> threads reading and writing to the backing device go up.  So for me the
> cache is basically a tier stage where your ability to keep dirty blocks
> on it is determined by the specific use case.

SSD caching will help in this case since SSD's latency remains almost constant regardless of location of data. HDD latency for sequential and random IO could vary by a factor of 5 or even much more.

Throughput with caching could even be 100 times the HDD throughput when using multiple threaded non-sequential IO.
-Amit

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED

This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
  2013-01-16 10:45     ` [dm-devel] " thornber
                         ` (2 preceding siblings ...)
  2013-01-17  9:52       ` Amit Kale
@ 2013-01-18 14:43       ` thornber
  3 siblings, 0 replies; 54+ messages in thread
From: thornber @ 2013-01-18 14:43 UTC (permalink / raw)
  To: device-mapper development, Mike Snitzer, LKML

On Wed, Jan 16, 2013 at 10:45:47AM +0000, thornber@redhat.com wrote:
> I'll create a branch in my github tree with all three caches in.  So
> it's easy to build a kernel with them.  (Mike's already combined
> dm-cache and bcache and done some preliminary testing).

git://github.com/jthornber/linux-2.6.git

branch 'all-caches'

I managed to reproduce the hang with bcache that Mike described
before.  However I can see that it was running the test more quickly
than eio at this point.

- Joe

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
  2013-01-18  9:08                 ` Amit Kale
  (?)
@ 2013-01-18 15:56                 ` Jason Warr
  2013-01-18 16:11                     ` thornber-H+wXaHxf7aLQT0dZR+AlfA
  2013-01-18 16:12                     ` Amit Kale
  -1 siblings, 2 replies; 54+ messages in thread
From: Jason Warr @ 2013-01-18 15:56 UTC (permalink / raw)
  To: Amit Kale
  Cc: thornber, device-mapper development, kent.overstreet,
	Mike Snitzer, LKML, linux-bcache


On 01/18/2013 03:08 AM, Amit Kale wrote:
>> > Can you explain what you mean by that in a little more detail?
> Let's say latency of a block device is 10ms for 4kB requests. With single threaded IO, the throughput will be 4kB/10ms = 400kB/s. If the device is capable of more throughput, a multithreaded IO will generate more throughput. So with 2 threads the throughput will be roughly 800kB/s. We can keep increasing the number of threads resulting in an approximately linear throughput. It'll saturate at the maximum capacity the device has. So it could saturate at perhaps at 8MB/s. Increasing the number of threads beyond this will not increase throughput.
>
> This is a simplistic computation. Throughput, latency and number of threads are related in a more complex relationship. Latency is still important, but throughput is more important.
>
> The way all this matters for SSD caching is, caching will typically show a higher latency compared to the base SSD, even for a 100% hit ratio. It may be possible to reach the maximum throughput achievable with the base SSD using a high number of threads. Let's say an SSD shows 450MB/s with 4 threads. A cache may show 440MB/s with 8 threads.
>
> A practical difficulty in measuring latency is that the latency seen by an application is a sum of the device latency plus the time spent in request queue (and caching layer, when present). Increasing number of threads shows latency increase, although it's only because the requests stay in request queue for a longer duration. Latency measurement in a multithreaded environment is very challenging. Measurement of throughput is fairly straightforward.
>
>> > 
>> > As an enterprise level user I see both as important overall.  However,
>> > the biggest driving factor in wanting a cache device in front of any
>> > sort of target in my use cases is to hide latency as the number of
>> > threads reading and writing to the backing device go up.  So for me the
>> > cache is basically a tier stage where your ability to keep dirty blocks
>> > on it is determined by the specific use case.
> SSD caching will help in this case since SSD's latency remains almost constant regardless of location of data. HDD latency for sequential and random IO could vary by a factor of 5 or even much more.
>
> Throughput with caching could even be 100 times the HDD throughput when using multiple threaded non-sequential IO.
> -Amit

Thank you for the explanation.  In context your reasoning makes more
sense to me.

If I am understanding you correctly when you refer to throughput your
speaking more in terms of IOPS than what most people would think of as
referencing only bit rate.

I would expect a small increase in minimum and average latency when
adding in another layer that the blocks have to traverse.  If my minimum
and average increase by 20% on most of my workloads, that is very
acceptable as long as there is a decrease in 95th and 99th percentile
maximums.  I would hope that absolute maximum would decrease as well but
that is going to be much harder to achieve.

If I can help test and benchmark all three of these solutions please
ask.  I have allot of hardware resources available to me and perhaps I
can add value from an outsiders perspective.

Jason

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-18 16:11                     ` thornber-H+wXaHxf7aLQT0dZR+AlfA
  0 siblings, 0 replies; 54+ messages in thread
From: thornber @ 2013-01-18 16:11 UTC (permalink / raw)
  To: Jason Warr
  Cc: Amit Kale, device-mapper development, kent.overstreet,
	Mike Snitzer, LKML, linux-bcache

On Fri, Jan 18, 2013 at 09:56:19AM -0600, Jason Warr wrote:
> If I can help test and benchmark all three of these solutions please
> ask.  I have allot of hardware resources available to me and perhaps I
> can add value from an outsiders perspective.

We'd love your help.  Perhaps you could devise a test that represents
how you'd use it?

- Joe

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-18 16:11                     ` thornber-H+wXaHxf7aLQT0dZR+AlfA
  0 siblings, 0 replies; 54+ messages in thread
From: thornber-H+wXaHxf7aLQT0dZR+AlfA @ 2013-01-18 16:11 UTC (permalink / raw)
  To: Jason Warr
  Cc: Amit Kale, device-mapper development,
	kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w, Mike Snitzer, LKML,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 18, 2013 at 09:56:19AM -0600, Jason Warr wrote:
> If I can help test and benchmark all three of these solutions please
> ask.  I have allot of hardware resources available to me and perhaps I
> can add value from an outsiders perspective.

We'd love your help.  Perhaps you could devise a test that represents
how you'd use it?

- Joe

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-18 16:12                     ` Amit Kale
  0 siblings, 0 replies; 54+ messages in thread
From: Amit Kale @ 2013-01-18 16:12 UTC (permalink / raw)
  To: Jason Warr
  Cc: thornber, device-mapper development, kent.overstreet,
	Mike Snitzer, LKML, linux-bcache


> -----Original Message-----
> From: Jason Warr [mailto:jason@warr.net]
> Sent: Friday, January 18, 2013 9:26 PM
> To: Amit Kale
> Cc: thornber@redhat.com; device-mapper development;
> kent.overstreet@gmail.com; Mike Snitzer; LKML; linux-
> bcache@vger.kernel.org
> Subject: Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching
> software for Linux kernel
> 
> 
> On 01/18/2013 03:08 AM, Amit Kale wrote:
> >> > Can you explain what you mean by that in a little more detail?
> > Let's say latency of a block device is 10ms for 4kB requests. With
> single threaded IO, the throughput will be 4kB/10ms = 400kB/s. If the
> device is capable of more throughput, a multithreaded IO will generate
> more throughput. So with 2 threads the throughput will be roughly
> 800kB/s. We can keep increasing the number of threads resulting in an
> approximately linear throughput. It'll saturate at the maximum capacity
> the device has. So it could saturate at perhaps at 8MB/s. Increasing
> the number of threads beyond this will not increase throughput.
> >
> > This is a simplistic computation. Throughput, latency and number of
> threads are related in a more complex relationship. Latency is still
> important, but throughput is more important.
> >
> > The way all this matters for SSD caching is, caching will typically
> show a higher latency compared to the base SSD, even for a 100% hit
> ratio. It may be possible to reach the maximum throughput achievable
> with the base SSD using a high number of threads. Let's say an SSD
> shows 450MB/s with 4 threads. A cache may show 440MB/s with 8 threads.
> >
> > A practical difficulty in measuring latency is that the latency seen
> by an application is a sum of the device latency plus the time spent in
> request queue (and caching layer, when present). Increasing number of
> threads shows latency increase, although it's only because the requests
> stay in request queue for a longer duration. Latency measurement in a
> multithreaded environment is very challenging. Measurement of
> throughput is fairly straightforward.
> >
> >> >
> >> > As an enterprise level user I see both as important overall.
> >> > However, the biggest driving factor in wanting a cache device in
> >> > front of any sort of target in my use cases is to hide latency as
> >> > the number of threads reading and writing to the backing device go
> >> > up.  So for me the cache is basically a tier stage where your
> >> > ability to keep dirty blocks on it is determined by the specific
> use case.
> > SSD caching will help in this case since SSD's latency remains almost
> constant regardless of location of data. HDD latency for sequential and
> random IO could vary by a factor of 5 or even much more.
> >
> > Throughput with caching could even be 100 times the HDD throughput
> when using multiple threaded non-sequential IO.
> > -Amit
> 
> Thank you for the explanation.  In context your reasoning makes more
> sense to me.
> 
> If I am understanding you correctly when you refer to throughput your
> speaking more in terms of IOPS than what most people would think of as
> referencing only bit rate.
> 
> I would expect a small increase in minimum and average latency when
> adding in another layer that the blocks have to traverse.  If my
> minimum and average increase by 20% on most of my workloads, that is
> very acceptable as long as there is a decrease in 95th and 99th
> percentile maximums.  I would hope that absolute maximum would decrease
> as well but that is going to be much harder to achieve.
> 
> If I can help test and benchmark all three of these solutions please
> ask.  I have allot of hardware resources available to me and perhaps I
> can add value from an outsiders perspective.

That'll be great. I have so far marked EIO's status as alpha. Will require a little more functionality testing before performance. Perhaps in a week or so.

-Amit

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-18 16:12                     ` Amit Kale
  0 siblings, 0 replies; 54+ messages in thread
From: Amit Kale @ 2013-01-18 16:12 UTC (permalink / raw)
  To: Jason Warr
  Cc: thornber-H+wXaHxf7aLQT0dZR+AlfA, device-mapper development,
	kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w, Mike Snitzer, LKML,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA


> -----Original Message-----
> From: Jason Warr [mailto:jason-/cow75dQlsI@public.gmane.org]
> Sent: Friday, January 18, 2013 9:26 PM
> To: Amit Kale
> Cc: thornber-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; device-mapper development;
> kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; Mike Snitzer; LKML; linux-
> bcache-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching
> software for Linux kernel
> 
> 
> On 01/18/2013 03:08 AM, Amit Kale wrote:
> >> > Can you explain what you mean by that in a little more detail?
> > Let's say latency of a block device is 10ms for 4kB requests. With
> single threaded IO, the throughput will be 4kB/10ms = 400kB/s. If the
> device is capable of more throughput, a multithreaded IO will generate
> more throughput. So with 2 threads the throughput will be roughly
> 800kB/s. We can keep increasing the number of threads resulting in an
> approximately linear throughput. It'll saturate at the maximum capacity
> the device has. So it could saturate at perhaps at 8MB/s. Increasing
> the number of threads beyond this will not increase throughput.
> >
> > This is a simplistic computation. Throughput, latency and number of
> threads are related in a more complex relationship. Latency is still
> important, but throughput is more important.
> >
> > The way all this matters for SSD caching is, caching will typically
> show a higher latency compared to the base SSD, even for a 100% hit
> ratio. It may be possible to reach the maximum throughput achievable
> with the base SSD using a high number of threads. Let's say an SSD
> shows 450MB/s with 4 threads. A cache may show 440MB/s with 8 threads.
> >
> > A practical difficulty in measuring latency is that the latency seen
> by an application is a sum of the device latency plus the time spent in
> request queue (and caching layer, when present). Increasing number of
> threads shows latency increase, although it's only because the requests
> stay in request queue for a longer duration. Latency measurement in a
> multithreaded environment is very challenging. Measurement of
> throughput is fairly straightforward.
> >
> >> >
> >> > As an enterprise level user I see both as important overall.
> >> > However, the biggest driving factor in wanting a cache device in
> >> > front of any sort of target in my use cases is to hide latency as
> >> > the number of threads reading and writing to the backing device go
> >> > up.  So for me the cache is basically a tier stage where your
> >> > ability to keep dirty blocks on it is determined by the specific
> use case.
> > SSD caching will help in this case since SSD's latency remains almost
> constant regardless of location of data. HDD latency for sequential and
> random IO could vary by a factor of 5 or even much more.
> >
> > Throughput with caching could even be 100 times the HDD throughput
> when using multiple threaded non-sequential IO.
> > -Amit
> 
> Thank you for the explanation.  In context your reasoning makes more
> sense to me.
> 
> If I am understanding you correctly when you refer to throughput your
> speaking more in terms of IOPS than what most people would think of as
> referencing only bit rate.
> 
> I would expect a small increase in minimum and average latency when
> adding in another layer that the blocks have to traverse.  If my
> minimum and average increase by 20% on most of my workloads, that is
> very acceptable as long as there is a decrease in 95th and 99th
> percentile maximums.  I would hope that absolute maximum would decrease
> as well but that is going to be much harder to achieve.
> 
> If I can help test and benchmark all three of these solutions please
> ask.  I have allot of hardware resources available to me and perhaps I
> can add value from an outsiders perspective.

That'll be great. I have so far marked EIO's status as alpha. Will require a little more functionality testing before performance. Perhaps in a week or so.

-Amit

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
  2013-01-18 16:11                     ` thornber-H+wXaHxf7aLQT0dZR+AlfA
@ 2013-01-18 16:45                       ` Jason Warr
  -1 siblings, 0 replies; 54+ messages in thread
From: Jason Warr @ 2013-01-18 16:45 UTC (permalink / raw)
  To: Amit Kale, device-mapper development, kent.overstreet,
	Mike Snitzer, LKML, linux-bcache

On 01/18/2013 10:11 AM, thornber@redhat.com wrote:
> On Fri, Jan 18, 2013 at 09:56:19AM -0600, Jason Warr wrote:
>> If I can help test and benchmark all three of these solutions please
>> ask.  I have allot of hardware resources available to me and perhaps I
>> can add value from an outsiders perspective.
> We'd love your help.  Perhaps you could devise a test that represents
> how you'd use it?
>
> - Joe
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

As much as I dislike Oracle that is one of my primary applications.  I
am attempting to get one of my customers to setup an Oracle instance
that is modular in that I can move the storage around to fit a
particular hardware setup and have a consistent benchmark that they use
in the real world to gauge performance.  One of them is a debit card
transaction clearing entity on multi-TB databases so latency REALLY
matters there.  Hopefully I'll have a couple of them setup within a
week.  At that point I may need help in getting the proper kernel trees
and patch sets munged into a working kernel.  That seems to be the spot
where I fall over most of the time.

Unfortunately I probably could not share this specific setup but it is
likely that I can derive a version from it that can be opened.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-18 16:45                       ` Jason Warr
  0 siblings, 0 replies; 54+ messages in thread
From: Jason Warr @ 2013-01-18 16:45 UTC (permalink / raw)
  To: Amit Kale, device-mapper development,
	kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w, Mike Snitzer, LKML,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA

On 01/18/2013 10:11 AM, thornber-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org wrote:
> On Fri, Jan 18, 2013 at 09:56:19AM -0600, Jason Warr wrote:
>> If I can help test and benchmark all three of these solutions please
>> ask.  I have allot of hardware resources available to me and perhaps I
>> can add value from an outsiders perspective.
> We'd love your help.  Perhaps you could devise a test that represents
> how you'd use it?
>
> - Joe
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

As much as I dislike Oracle that is one of my primary applications.  I
am attempting to get one of my customers to setup an Oracle instance
that is modular in that I can move the storage around to fit a
particular hardware setup and have a consistent benchmark that they use
in the real world to gauge performance.  One of them is a debit card
transaction clearing entity on multi-TB databases so latency REALLY
matters there.  Hopefully I'll have a couple of them setup within a
week.  At that point I may need help in getting the proper kernel trees
and patch sets munged into a working kernel.  That seems to be the spot
where I fall over most of the time.

Unfortunately I probably could not share this specific setup but it is
likely that I can derive a version from it that can be opened.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-18 17:42                         ` thornber-H+wXaHxf7aLQT0dZR+AlfA
  0 siblings, 0 replies; 54+ messages in thread
From: thornber @ 2013-01-18 17:42 UTC (permalink / raw)
  To: device-mapper development
  Cc: Amit Kale, kent.overstreet, Mike Snitzer, LKML, linux-bcache

On Fri, Jan 18, 2013 at 10:45:03AM -0600, Jason Warr wrote:
> As much as I dislike Oracle that is one of my primary applications.  I
> am attempting to get one of my customers to setup an Oracle instance
> that is modular in that I can move the storage around to fit a
> particular hardware setup and have a consistent benchmark that they use
> in the real world to gauge performance.  One of them is a debit card
> transaction clearing entity on multi-TB databases so latency REALLY
> matters there.  Hopefully I'll have a couple of them setup within a
> week.  At that point I may need help in getting the proper kernel trees
> and patch sets munged into a working kernel.  That seems to be the spot
> where I fall over most of the time.
> 
> Unfortunately I probably could not share this specific setup but it is
> likely that I can derive a version from it that can be opened.

That would be perfect.  Please ask for any help you need.

- Joe

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-18 17:42                         ` thornber-H+wXaHxf7aLQT0dZR+AlfA
  0 siblings, 0 replies; 54+ messages in thread
From: thornber-H+wXaHxf7aLQT0dZR+AlfA @ 2013-01-18 17:42 UTC (permalink / raw)
  To: device-mapper development
  Cc: Amit Kale, kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w, Mike Snitzer,
	LKML, linux-bcache-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 18, 2013 at 10:45:03AM -0600, Jason Warr wrote:
> As much as I dislike Oracle that is one of my primary applications.  I
> am attempting to get one of my customers to setup an Oracle instance
> that is modular in that I can move the storage around to fit a
> particular hardware setup and have a consistent benchmark that they use
> in the real world to gauge performance.  One of them is a debit card
> transaction clearing entity on multi-TB databases so latency REALLY
> matters there.  Hopefully I'll have a couple of them setup within a
> week.  At that point I may need help in getting the proper kernel trees
> and patch sets munged into a working kernel.  That seems to be the spot
> where I fall over most of the time.
> 
> Unfortunately I probably could not share this specific setup but it is
> likely that I can derive a version from it that can be opened.

That would be perfect.  Please ask for any help you need.

- Joe

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-18 17:44                         ` Amit Kale
  0 siblings, 0 replies; 54+ messages in thread
From: Amit Kale @ 2013-01-18 17:44 UTC (permalink / raw)
  To: Jason Warr, device-mapper development, kent.overstreet,
	Mike Snitzer, LKML, linux-bcache

> -----Original Message-----
> From: Jason Warr [mailto:jason@warr.net]
> Sent: Friday, January 18, 2013 10:15 PM
> To: Amit Kale; device-mapper development; kent.overstreet@gmail.com;
> Mike Snitzer; LKML; linux-bcache@vger.kernel.org
> Subject: Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching
> software for Linux kernel
> 
> 
> On 01/18/2013 10:11 AM, thornber@redhat.com wrote:
> > On Fri, Jan 18, 2013 at 09:56:19AM -0600, Jason Warr wrote:
> >> If I can help test and benchmark all three of these solutions please
> >> ask.  I have allot of hardware resources available to me and perhaps
> >> I can add value from an outsiders perspective.
> > We'd love your help.  Perhaps you could devise a test that represents
> > how you'd use it?
> >
> > - Joe
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-bcache" in the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> As much as I dislike Oracle that is one of my primary applications.  I
> am attempting to get one of my customers to setup an Oracle instance
> that is modular in that I can move the storage around to fit a
> particular hardware setup and have a consistent benchmark that they use
> in the real world to gauge performance.  One of them is a debit card
> transaction clearing entity on multi-TB databases so latency REALLY
> matters there.  

I am curious as to how SSD latency matters so much in the overall transaction times.

We do a lot of performance measurements using SQL database benchmarks. Transaction times vary a lot depending on location of data, complexity of the transaction etc. Typically TPM (transactions per minute) is of primary interest for TPC-C.

> Hopefully I'll have a couple of them setup within a
> week.  At that point I may need help in getting the proper kernel trees
> and patch sets munged into a working kernel.  That seems to be the spot
> where I fall over most of the time.
> 
> Unfortunately I probably could not share this specific setup but it is
> likely that I can derive a version from it that can be opened.

That'll be good. I'll check with our testing team whether they can run TPC-C comparisons for these three caching solutions.

-Amit

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-18 17:44                         ` Amit Kale
  0 siblings, 0 replies; 54+ messages in thread
From: Amit Kale @ 2013-01-18 17:44 UTC (permalink / raw)
  To: Jason Warr, device-mapper development,
	kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w, Mike Snitzer, LKML,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA

> -----Original Message-----
> From: Jason Warr [mailto:jason-/cow75dQlsI@public.gmane.org]
> Sent: Friday, January 18, 2013 10:15 PM
> To: Amit Kale; device-mapper development; kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org;
> Mike Snitzer; LKML; linux-bcache-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching
> software for Linux kernel
> 
> 
> On 01/18/2013 10:11 AM, thornber-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org wrote:
> > On Fri, Jan 18, 2013 at 09:56:19AM -0600, Jason Warr wrote:
> >> If I can help test and benchmark all three of these solutions please
> >> ask.  I have allot of hardware resources available to me and perhaps
> >> I can add value from an outsiders perspective.
> > We'd love your help.  Perhaps you could devise a test that represents
> > how you'd use it?
> >
> > - Joe
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-bcache" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> As much as I dislike Oracle that is one of my primary applications.  I
> am attempting to get one of my customers to setup an Oracle instance
> that is modular in that I can move the storage around to fit a
> particular hardware setup and have a consistent benchmark that they use
> in the real world to gauge performance.  One of them is a debit card
> transaction clearing entity on multi-TB databases so latency REALLY
> matters there.  

I am curious as to how SSD latency matters so much in the overall transaction times.

We do a lot of performance measurements using SQL database benchmarks. Transaction times vary a lot depending on location of data, complexity of the transaction etc. Typically TPM (transactions per minute) is of primary interest for TPC-C.

> Hopefully I'll have a couple of them setup within a
> week.  At that point I may need help in getting the proper kernel trees
> and patch sets munged into a working kernel.  That seems to be the spot
> where I fall over most of the time.
> 
> Unfortunately I probably could not share this specific setup but it is
> likely that I can derive a version from it that can be opened.

That'll be good. I'll check with our testing team whether they can run TPC-C comparisons for these three caching solutions.

-Amit

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-18 18:36                           ` Jason Warr
  0 siblings, 0 replies; 54+ messages in thread
From: Jason Warr @ 2013-01-18 18:36 UTC (permalink / raw)
  To: Amit Kale
  Cc: device-mapper development, kent.overstreet, Mike Snitzer, LKML,
	linux-bcache

On 01/18/2013 11:44 AM, Amit Kale wrote:
>> As much as I dislike Oracle that is one of my primary applications.  I
>> > am attempting to get one of my customers to setup an Oracle instance
>> > that is modular in that I can move the storage around to fit a
>> > particular hardware setup and have a consistent benchmark that they use
>> > in the real world to gauge performance.  One of them is a debit card
>> > transaction clearing entity on multi-TB databases so latency REALLY
>> > matters there.  
> I am curious as to how SSD latency matters so much in the overall transaction times.
>
> We do a lot of performance measurements using SQL database benchmarks. Transaction times vary a lot depending on location of data, complexity of the transaction etc. Typically TPM (transactions per minute) is of primary interest for TPC-C.
>

It's not specifically SSD latency.  It's I/O transaction latency that
matters.  This particular application is very sensitive to that because
it is literally someone standing at a POS terminal swiping a
debit/credit card.  You only have a couple of seconds after the PIN is
entered for the transaction to go through your network, application
server to authorize against a DB and back to the POS.

The entire I/O stack on the DB is only a small time-slice of that round
trip.  Your 99th percentile needs to be under 20ms on the DB storage
side.  If your worst case DB I/O goes beyond 300ms it is considered an
outage because the POS transaction fails.  So it obviously takes allot
of planning and optimization work on the DB itself to get good
tablespace layout to even get into the realm where you can have that
predictable of latency with multi-million dollar FC storage frames. 

One of my goals is to be able to offer this level of I/O service on
commodity hardware.  Simplify the scope of hardware, reduce the number
of points of failure, make the systems more portable, reduce or
eliminate dependence on any specific vendor below the application and
save money.  Not to mention reduce the number of fingers that can point
away from themselves saying it is someone elses problem to find fault.

Allot of the pieces are already out there.  A good block caching target
is one of the missing pieces to help fill the ever growing canyon
between non-block device system performance and storage.  What they have
done with L2ARC and SLOG in ZFS/Solaris is good but it has some serious
short comings in other areas that DM/MD/LVM do extremely well.

I appreciate all of the brilliant work all of you guys do and hopefully
I can contribute a little bit of usefulness to this effort.

Thank you,

Jason

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-18 18:36                           ` Jason Warr
  0 siblings, 0 replies; 54+ messages in thread
From: Jason Warr @ 2013-01-18 18:36 UTC (permalink / raw)
  To: Amit Kale
  Cc: device-mapper development,
	kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w, Mike Snitzer, LKML,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA

On 01/18/2013 11:44 AM, Amit Kale wrote:
>> As much as I dislike Oracle that is one of my primary applications.  I
>> > am attempting to get one of my customers to setup an Oracle instance
>> > that is modular in that I can move the storage around to fit a
>> > particular hardware setup and have a consistent benchmark that they use
>> > in the real world to gauge performance.  One of them is a debit card
>> > transaction clearing entity on multi-TB databases so latency REALLY
>> > matters there.  
> I am curious as to how SSD latency matters so much in the overall transaction times.
>
> We do a lot of performance measurements using SQL database benchmarks. Transaction times vary a lot depending on location of data, complexity of the transaction etc. Typically TPM (transactions per minute) is of primary interest for TPC-C.
>

It's not specifically SSD latency.  It's I/O transaction latency that
matters.  This particular application is very sensitive to that because
it is literally someone standing at a POS terminal swiping a
debit/credit card.  You only have a couple of seconds after the PIN is
entered for the transaction to go through your network, application
server to authorize against a DB and back to the POS.

The entire I/O stack on the DB is only a small time-slice of that round
trip.  Your 99th percentile needs to be under 20ms on the DB storage
side.  If your worst case DB I/O goes beyond 300ms it is considered an
outage because the POS transaction fails.  So it obviously takes allot
of planning and optimization work on the DB itself to get good
tablespace layout to even get into the realm where you can have that
predictable of latency with multi-million dollar FC storage frames. 

One of my goals is to be able to offer this level of I/O service on
commodity hardware.  Simplify the scope of hardware, reduce the number
of points of failure, make the systems more portable, reduce or
eliminate dependence on any specific vendor below the application and
save money.  Not to mention reduce the number of fingers that can point
away from themselves saying it is someone elses problem to find fault.

Allot of the pieces are already out there.  A good block caching target
is one of the missing pieces to help fill the ever growing canyon
between non-block device system performance and storage.  What they have
done with L2ARC and SLOG in ZFS/Solaris is good but it has some serious
short comings in other areas that DM/MD/LVM do extremely well.

I appreciate all of the brilliant work all of you guys do and hopefully
I can contribute a little bit of usefulness to this effort.

Thank you,

Jason

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
  2013-01-18 18:36                           ` Jason Warr
  (?)
@ 2013-01-18 21:25                           ` Darrick J. Wong
  2013-01-18 21:37                               ` Mike Snitzer
  -1 siblings, 1 reply; 54+ messages in thread
From: Darrick J. Wong @ 2013-01-18 21:25 UTC (permalink / raw)
  To: device-mapper development
  Cc: Amit Kale, linux-bcache, kent.overstreet, Mike Snitzer, LKML, lsf-pc

Since Joe is putting together a testing tree to compare the three caching
things, what do you all think of having a(nother) session about ssd caching at
this year's LSFMM Summit?

[Apologies for hijacking the thread.]
[Adding lsf-pc to the cc list.]

--D

On Fri, Jan 18, 2013 at 12:36:42PM -0600, Jason Warr wrote:
> 
> On 01/18/2013 11:44 AM, Amit Kale wrote:
> >> As much as I dislike Oracle that is one of my primary applications.  I
> >> > am attempting to get one of my customers to setup an Oracle instance
> >> > that is modular in that I can move the storage around to fit a
> >> > particular hardware setup and have a consistent benchmark that they use
> >> > in the real world to gauge performance.  One of them is a debit card
> >> > transaction clearing entity on multi-TB databases so latency REALLY
> >> > matters there.  
> > I am curious as to how SSD latency matters so much in the overall transaction times.
> >
> > We do a lot of performance measurements using SQL database benchmarks. Transaction times vary a lot depending on location of data, complexity of the transaction etc. Typically TPM (transactions per minute) is of primary interest for TPC-C.
> >
> 
> It's not specifically SSD latency.  It's I/O transaction latency that
> matters.  This particular application is very sensitive to that because
> it is literally someone standing at a POS terminal swiping a
> debit/credit card.  You only have a couple of seconds after the PIN is
> entered for the transaction to go through your network, application
> server to authorize against a DB and back to the POS.
> 
> The entire I/O stack on the DB is only a small time-slice of that round
> trip.  Your 99th percentile needs to be under 20ms on the DB storage
> side.  If your worst case DB I/O goes beyond 300ms it is considered an
> outage because the POS transaction fails.  So it obviously takes allot
> of planning and optimization work on the DB itself to get good
> tablespace layout to even get into the realm where you can have that
> predictable of latency with multi-million dollar FC storage frames. 
> 
> One of my goals is to be able to offer this level of I/O service on
> commodity hardware.  Simplify the scope of hardware, reduce the number
> of points of failure, make the systems more portable, reduce or
> eliminate dependence on any specific vendor below the application and
> save money.  Not to mention reduce the number of fingers that can point
> away from themselves saying it is someone elses problem to find fault.
> 
> Allot of the pieces are already out there.  A good block caching target
> is one of the missing pieces to help fill the ever growing canyon
> between non-block device system performance and storage.  What they have
> done with L2ARC and SLOG in ZFS/Solaris is good but it has some serious
> short comings in other areas that DM/MD/LVM do extremely well.
> 
> I appreciate all of the brilliant work all of you guys do and hopefully
> I can contribute a little bit of usefulness to this effort.
> 
> Thank you,
> 
> Jason
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-18 21:37                               ` Mike Snitzer
  0 siblings, 0 replies; 54+ messages in thread
From: Mike Snitzer @ 2013-01-18 21:37 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: device-mapper development, Amit Kale, linux-bcache,
	kent.overstreet, LKML, lsf-pc, Joe Thornber

On Fri, Jan 18 2013 at  4:25pm -0500,
Darrick J. Wong <darrick.wong@oracle.com> wrote:

> Since Joe is putting together a testing tree to compare the three caching
> things, what do you all think of having a(nother) session about ssd caching at
> this year's LSFMM Summit?
> 
> [Apologies for hijacking the thread.]
> [Adding lsf-pc to the cc list.]

Hopefully we'll have some findings on the comparisons well before LSF
(since we currently have some momentum).  But yes it may be worthwhile
to discuss things further and/or report findings.

Mike

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-18 21:37                               ` Mike Snitzer
  0 siblings, 0 replies; 54+ messages in thread
From: Mike Snitzer @ 2013-01-18 21:37 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: device-mapper development, Amit Kale,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w, LKML,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Joe Thornber

On Fri, Jan 18 2013 at  4:25pm -0500,
Darrick J. Wong <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:

> Since Joe is putting together a testing tree to compare the three caching
> things, what do you all think of having a(nother) session about ssd caching at
> this year's LSFMM Summit?
> 
> [Apologies for hijacking the thread.]
> [Adding lsf-pc to the cc list.]

Hopefully we'll have some findings on the comparisons well before LSF
(since we currently have some momentum).  But yes it may be worthwhile
to discuss things further and/or report findings.

Mike

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-21  5:26                                 ` Amit Kale
  0 siblings, 0 replies; 54+ messages in thread
From: Amit Kale @ 2013-01-21  5:26 UTC (permalink / raw)
  To: Mike Snitzer, Darrick J. Wong
  Cc: device-mapper development, linux-bcache, kent.overstreet, LKML,
	lsf-pc, Joe Thornber

> -----Original Message-----
> From: Mike Snitzer [mailto:snitzer@redhat.com]
> Sent: Saturday, January 19, 2013 3:08 AM
> To: Darrick J. Wong
> Cc: device-mapper development; Amit Kale; linux-bcache@vger.kernel.org;
> kent.overstreet@gmail.com; LKML; lsf-pc@lists.linux-foundation.org; Joe
> Thornber
> Subject: Re: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO
> SSD caching software for Linux kernel
> 
> On Fri, Jan 18 2013 at  4:25pm -0500,
> Darrick J. Wong <darrick.wong@oracle.com> wrote:
> 
> > Since Joe is putting together a testing tree to compare the three
> > caching things, what do you all think of having a(nother) session
> > about ssd caching at this year's LSFMM Summit?
> >
> > [Apologies for hijacking the thread.]
> > [Adding lsf-pc to the cc list.]
> 
> Hopefully we'll have some findings on the comparisons well before LSF
> (since we currently have some momentum).  But yes it may be worthwhile
> to discuss things further and/or report findings.

We should have performance comparisons presented well before the summit. It'll be good to have ssd caching session in any case. The likelihood that one of them will be included in Linux kernel before April is very low.

-Amit

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED

This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-21  5:26                                 ` Amit Kale
  0 siblings, 0 replies; 54+ messages in thread
From: Amit Kale @ 2013-01-21  5:26 UTC (permalink / raw)
  To: Mike Snitzer, Darrick J. Wong
  Cc: device-mapper development, linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w, LKML,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Joe Thornber

> -----Original Message-----
> From: Mike Snitzer [mailto:snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org]
> Sent: Saturday, January 19, 2013 3:08 AM
> To: Darrick J. Wong
> Cc: device-mapper development; Amit Kale; linux-bcache-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; LKML; lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org; Joe
> Thornber
> Subject: Re: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO
> SSD caching software for Linux kernel
> 
> On Fri, Jan 18 2013 at  4:25pm -0500,
> Darrick J. Wong <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > Since Joe is putting together a testing tree to compare the three
> > caching things, what do you all think of having a(nother) session
> > about ssd caching at this year's LSFMM Summit?
> >
> > [Apologies for hijacking the thread.]
> > [Adding lsf-pc to the cc list.]
> 
> Hopefully we'll have some findings on the comparisons well before LSF
> (since we currently have some momentum).  But yes it may be worthwhile
> to discuss things further and/or report findings.

We should have performance comparisons presented well before the summit. It'll be good to have ssd caching session in any case. The likelihood that one of them will be included in Linux kernel before April is very low.

-Amit

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
  2013-01-21  5:26                                 ` Amit Kale
@ 2013-01-21 13:09                                   ` Mike Snitzer
  -1 siblings, 0 replies; 54+ messages in thread
From: Mike Snitzer @ 2013-01-21 13:09 UTC (permalink / raw)
  To: Amit Kale
  Cc: Darrick J. Wong, device-mapper development, linux-bcache,
	kent.overstreet, LKML, lsf-pc, Joe Thornber

On Mon, Jan 21 2013 at 12:26am -0500,
Amit Kale <akale@stec-inc.com> wrote:

> > -----Original Message-----
> > From: Mike Snitzer [mailto:snitzer@redhat.com]
> > Sent: Saturday, January 19, 2013 3:08 AM
> > To: Darrick J. Wong
> > Cc: device-mapper development; Amit Kale; linux-bcache@vger.kernel.org;
> > kent.overstreet@gmail.com; LKML; lsf-pc@lists.linux-foundation.org; Joe
> > Thornber
> > Subject: Re: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO
> > SSD caching software for Linux kernel
> > 
> > On Fri, Jan 18 2013 at  4:25pm -0500,
> > Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > 
> > > Since Joe is putting together a testing tree to compare the three
> > > caching things, what do you all think of having a(nother) session
> > > about ssd caching at this year's LSFMM Summit?
> > >
> > > [Apologies for hijacking the thread.]
> > > [Adding lsf-pc to the cc list.]
> > 
> > Hopefully we'll have some findings on the comparisons well before LSF
> > (since we currently have some momentum).  But yes it may be worthwhile
> > to discuss things further and/or report findings.
> 
> We should have performance comparisons presented well before the
> summit. It'll be good to have ssd caching session in any case. The
> likelihood that one of them will be included in Linux kernel before
> April is very low.

dm-cache is under active review for upstream inclusion.  I wouldn't
categorize the chances of dm-cache going upstream when the v3.9 merge
window opens as "very low".  But even if dm-cache does go upstream it
doesn't preclude bcache and/or enhanceio from going upstream too.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM TOPIC] Re: Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-21 13:09                                   ` Mike Snitzer
  0 siblings, 0 replies; 54+ messages in thread
From: Mike Snitzer @ 2013-01-21 13:09 UTC (permalink / raw)
  To: Amit Kale
  Cc: linux-bcache, Darrick J. Wong, Joe Thornber, LKML,
	device-mapper development, lsf-pc, kent.overstreet

On Mon, Jan 21 2013 at 12:26am -0500,
Amit Kale <akale@stec-inc.com> wrote:

> > -----Original Message-----
> > From: Mike Snitzer [mailto:snitzer@redhat.com]
> > Sent: Saturday, January 19, 2013 3:08 AM
> > To: Darrick J. Wong
> > Cc: device-mapper development; Amit Kale; linux-bcache@vger.kernel.org;
> > kent.overstreet@gmail.com; LKML; lsf-pc@lists.linux-foundation.org; Joe
> > Thornber
> > Subject: Re: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO
> > SSD caching software for Linux kernel
> > 
> > On Fri, Jan 18 2013 at  4:25pm -0500,
> > Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > 
> > > Since Joe is putting together a testing tree to compare the three
> > > caching things, what do you all think of having a(nother) session
> > > about ssd caching at this year's LSFMM Summit?
> > >
> > > [Apologies for hijacking the thread.]
> > > [Adding lsf-pc to the cc list.]
> > 
> > Hopefully we'll have some findings on the comparisons well before LSF
> > (since we currently have some momentum).  But yes it may be worthwhile
> > to discuss things further and/or report findings.
> 
> We should have performance comparisons presented well before the
> summit. It'll be good to have ssd caching session in any case. The
> likelihood that one of them will be included in Linux kernel before
> April is very low.

dm-cache is under active review for upstream inclusion.  I wouldn't
categorize the chances of dm-cache going upstream when the v3.9 merge
window opens as "very low".  But even if dm-cache does go upstream it
doesn't preclude bcache and/or enhanceio from going upstream too.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-21 13:58                                     ` thornber-H+wXaHxf7aLQT0dZR+AlfA
  0 siblings, 0 replies; 54+ messages in thread
From: thornber @ 2013-01-21 13:58 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Amit Kale, Darrick J. Wong, device-mapper development,
	linux-bcache, kent.overstreet, LKML, lsf-pc, Joe Thornber

On Mon, Jan 21, 2013 at 08:09:51AM -0500, Mike Snitzer wrote:
> dm-cache is under active review for upstream inclusion.  I wouldn't
> categorize the chances of dm-cache going upstream when the v3.9 merge
> window opens as "very low".  But even if dm-cache does go upstream it
> doesn't preclude bcache and/or enhanceio from going upstream too.

As I understand it bcache is being reviewed too.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-21 13:58                                     ` thornber-H+wXaHxf7aLQT0dZR+AlfA
  0 siblings, 0 replies; 54+ messages in thread
From: thornber-H+wXaHxf7aLQT0dZR+AlfA @ 2013-01-21 13:58 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Amit Kale, Darrick J. Wong, device-mapper development,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w, LKML,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Joe Thornber

On Mon, Jan 21, 2013 at 08:09:51AM -0500, Mike Snitzer wrote:
> dm-cache is under active review for upstream inclusion.  I wouldn't
> categorize the chances of dm-cache going upstream when the v3.9 merge
> window opens as "very low".  But even if dm-cache does go upstream it
> doesn't preclude bcache and/or enhanceio from going upstream too.

As I understand it bcache is being reviewed too.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-22  5:00                                     ` Amit Kale
  0 siblings, 0 replies; 54+ messages in thread
From: Amit Kale @ 2013-01-22  5:00 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Darrick J. Wong, device-mapper development, linux-bcache,
	kent.overstreet, LKML, lsf-pc, Joe Thornber

> -----Original Message-----
> From: Mike Snitzer [mailto:snitzer@redhat.com]
> Sent: Monday, January 21, 2013 6:40 PM
> To: Amit Kale
> Cc: Darrick J. Wong; device-mapper development; linux-
> bcache@vger.kernel.org; kent.overstreet@gmail.com; LKML; lsf-
> pc@lists.linux-foundation.org; Joe Thornber
> Subject: Re: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO
> SSD caching software for Linux kernel
> 
> On Mon, Jan 21 2013 at 12:26am -0500,
> Amit Kale <akale@stec-inc.com> wrote:
> 
> > > -----Original Message-----
> > > From: Mike Snitzer [mailto:snitzer@redhat.com]
> > > Sent: Saturday, January 19, 2013 3:08 AM
> > > To: Darrick J. Wong
> > > Cc: device-mapper development; Amit Kale;
> > > linux-bcache@vger.kernel.org; kent.overstreet@gmail.com; LKML;
> > > lsf-pc@lists.linux-foundation.org; Joe Thornber
> > > Subject: Re: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC
> > > EnhanceIO SSD caching software for Linux kernel
> > >
> > > On Fri, Jan 18 2013 at  4:25pm -0500, Darrick J. Wong
> > > <darrick.wong@oracle.com> wrote:
> > >
> > > > Since Joe is putting together a testing tree to compare the three
> > > > caching things, what do you all think of having a(nother) session
> > > > about ssd caching at this year's LSFMM Summit?
> > > >
> > > > [Apologies for hijacking the thread.] [Adding lsf-pc to the cc
> > > > list.]
> > >
> > > Hopefully we'll have some findings on the comparisons well before
> > > LSF (since we currently have some momentum).  But yes it may be
> > > worthwhile to discuss things further and/or report findings.
> >
> > We should have performance comparisons presented well before the
> > summit. It'll be good to have ssd caching session in any case. The
> > likelihood that one of them will be included in Linux kernel before
> > April is very low.
> 
> dm-cache is under active review for upstream inclusion.  I wouldn't
> categorize the chances of dm-cache going upstream when the v3.9 merge
> window opens as "very low".  But even if dm-cache does go upstream it
> doesn't preclude bcache and/or enhanceio from going upstream too.

I agree. We haven't seen a full comparison yet, IMHO. If different solutions offer mutually exclusive benefits, it'll be worthwhile including them all.

We haven't submitted EnhanceIO for an inclusion yet. Need more testing from the community before we can mark it Beta.
-Amit

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-22  5:00                                     ` Amit Kale
  0 siblings, 0 replies; 54+ messages in thread
From: Amit Kale @ 2013-01-22  5:00 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Darrick J. Wong, device-mapper development,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w, LKML,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Joe Thornber

> -----Original Message-----
> From: Mike Snitzer [mailto:snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org]
> Sent: Monday, January 21, 2013 6:40 PM
> To: Amit Kale
> Cc: Darrick J. Wong; device-mapper development; linux-
> bcache-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; LKML; lsf-
> pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org; Joe Thornber
> Subject: Re: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO
> SSD caching software for Linux kernel
> 
> On Mon, Jan 21 2013 at 12:26am -0500,
> Amit Kale <akale-FZ1t8LVTR2ZWk0Htik3J/w@public.gmane.org> wrote:
> 
> > > -----Original Message-----
> > > From: Mike Snitzer [mailto:snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org]
> > > Sent: Saturday, January 19, 2013 3:08 AM
> > > To: Darrick J. Wong
> > > Cc: device-mapper development; Amit Kale;
> > > linux-bcache-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; LKML;
> > > lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org; Joe Thornber
> > > Subject: Re: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC
> > > EnhanceIO SSD caching software for Linux kernel
> > >
> > > On Fri, Jan 18 2013 at  4:25pm -0500, Darrick J. Wong
> > > <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > >
> > > > Since Joe is putting together a testing tree to compare the three
> > > > caching things, what do you all think of having a(nother) session
> > > > about ssd caching at this year's LSFMM Summit?
> > > >
> > > > [Apologies for hijacking the thread.] [Adding lsf-pc to the cc
> > > > list.]
> > >
> > > Hopefully we'll have some findings on the comparisons well before
> > > LSF (since we currently have some momentum).  But yes it may be
> > > worthwhile to discuss things further and/or report findings.
> >
> > We should have performance comparisons presented well before the
> > summit. It'll be good to have ssd caching session in any case. The
> > likelihood that one of them will be included in Linux kernel before
> > April is very low.
> 
> dm-cache is under active review for upstream inclusion.  I wouldn't
> categorize the chances of dm-cache going upstream when the v3.9 merge
> window opens as "very low".  But even if dm-cache does go upstream it
> doesn't preclude bcache and/or enhanceio from going upstream too.

I agree. We haven't seen a full comparison yet, IMHO. If different solutions offer mutually exclusive benefits, it'll be worthwhile including them all.

We haven't submitted EnhanceIO for an inclusion yet. Need more testing from the community before we can mark it Beta.
-Amit

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-24 23:45             ` Kent Overstreet
  0 siblings, 0 replies; 54+ messages in thread
From: Kent Overstreet @ 2013-01-24 23:45 UTC (permalink / raw)
  To: Amit Kale
  Cc: thornber, device-mapper development, kent.overstreet,
	Mike Snitzer, LKML, linux-bcache

On Thu, Jan 17, 2013 at 03:39:40AM -0800, Kent Overstreet wrote:
> Suppose I could fill out the bcache version...
> 
> On Thu, Jan 17, 2013 at 05:52:00PM +0800, Amit Kale wrote:
> > 11. Error conditions - Handling power failures, intermittent and permanent device failures.
> 
> Power failures and device failures yes, intermittent failures are not
> explicitly handled.

Coworker pointed out bcache actually does handle some intermittent io errors. I
just added error handling to the documentation: 
http://atlas.evilpiepirate.org/git/linux-bcache.git/tree/Documentation/bcache.txt?h=bcache-dev

To cut and paste,

Bcache tries to transparently handle IO errors to/from the cache device without
affecting normal operation; if it sees too many errors (the threshold is
configurable, and defaults to 0) it shuts down the cache device and switches all
the backing devices to passthrough mode.

 - For reads from the cache, if they error we just retry the read from the
   backing device.

 - For writethrough writes, if the write to the cache errors we just switch to
   invalidating the data at that lba in the cache (i.e. the same thing we do for
   a write that bypasses the cache)
 
 - For writeback writes, we currently pass that error back up to the
   filesystem/userspace. This could be improved - we could retry it as a write
   that skips the cache so we don't have to error the write.

 - When we detach, we first try to flush any dirty data (if we were running in
   writeback mode). It currently doesn't do anything intelligent if it fails to
   read some of the dirty data, though.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-24 23:45             ` Kent Overstreet
  0 siblings, 0 replies; 54+ messages in thread
From: Kent Overstreet @ 2013-01-24 23:45 UTC (permalink / raw)
  To: Amit Kale
  Cc: thornber-H+wXaHxf7aLQT0dZR+AlfA, device-mapper development,
	kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w, Mike Snitzer, LKML,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA

On Thu, Jan 17, 2013 at 03:39:40AM -0800, Kent Overstreet wrote:
> Suppose I could fill out the bcache version...
> 
> On Thu, Jan 17, 2013 at 05:52:00PM +0800, Amit Kale wrote:
> > 11. Error conditions - Handling power failures, intermittent and permanent device failures.
> 
> Power failures and device failures yes, intermittent failures are not
> explicitly handled.

Coworker pointed out bcache actually does handle some intermittent io errors. I
just added error handling to the documentation: 
http://atlas.evilpiepirate.org/git/linux-bcache.git/tree/Documentation/bcache.txt?h=bcache-dev

To cut and paste,

Bcache tries to transparently handle IO errors to/from the cache device without
affecting normal operation; if it sees too many errors (the threshold is
configurable, and defaults to 0) it shuts down the cache device and switches all
the backing devices to passthrough mode.

 - For reads from the cache, if they error we just retry the read from the
   backing device.

 - For writethrough writes, if the write to the cache errors we just switch to
   invalidating the data at that lba in the cache (i.e. the same thing we do for
   a write that bypasses the cache)
 
 - For writeback writes, we currently pass that error back up to the
   filesystem/userspace. This could be improved - we could retry it as a write
   that skips the cache so we don't have to error the write.

 - When we detach, we first try to flush any dirty data (if we were running in
   writeback mode). It currently doesn't do anything intelligent if it fails to
   read some of the dirty data, though.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-24 23:55                   ` Kent Overstreet
  0 siblings, 0 replies; 54+ messages in thread
From: Kent Overstreet @ 2013-01-24 23:55 UTC (permalink / raw)
  To: Amit Kale
  Cc: Jason Warr, thornber, device-mapper development, kent.overstreet,
	Mike Snitzer, LKML, linux-bcache

On Fri, Jan 18, 2013 at 05:08:37PM +0800, Amit Kale wrote:
> > From: Jason Warr [mailto:jason@warr.net]
> > On 01/17/2013 11:53 AM, Amit Kale wrote:
> > >>> 9. Performance - Throughput is generally most important. Latency is
> > >> >   also one more performance comparison point. Performance under
> > >> >   different load classes can be measured.
> > >> >
> > >> >   I think latency is more important than throughput.  Spindles are
> > >> >   pretty good at throughput.  In fact the mq policy tries to spot
> > when
> > >> >   we're doing large linear ios and stops hit counting; best leave
> > this
> > >> >   stuff on the spindle.
> > > I disagree. Latency is taken care of automatically when the number of
> > application threads rises.
> > >
> > 
> > Can you explain what you mean by that in a little more detail?
> 
> Let's say latency of a block device is 10ms for 4kB requests. With single threaded IO, the throughput will be 4kB/10ms = 400kB/s. If the device is capable of more throughput, a multithreaded IO will generate more throughput. So with 2 threads the throughput will be roughly 800kB/s. We can keep increasing the number of threads resulting in an approximately linear throughput. It'll saturate at the maximum capacity the device has. So it could saturate at perhaps at 8MB/s. Increasing the number of threads beyond this will not increase throughput.
> 
> This is a simplistic computation. Throughput, latency and number of threads are related in a more complex relationship. Latency is still important, but throughput is more important.
> 
> The way all this matters for SSD caching is, caching will typically show a higher latency compared to the base SSD, even for a 100% hit ratio. It may be possible to reach the maximum throughput achievable with the base SSD using a high number of threads. Let's say an SSD shows 450MB/s with 4 threads. A cache may show 440MB/s with 8 threads.

Going through the cache should only (measurably) increase latency for
writes, not reads (assuming they're cache hits, not misses). It sounds
like you're talking about the overhead for keeping the index up to date,
which is only a factor for writes, but I'm not quite sure since you talk
about hit rate.

I don't know of any reason why throughput or latency should be noticably
worse than raw for reads from cache.

But for writes, yeah - as number of of concurrent IOs goes up, you can
amortize the metadata writes more and more so throughput compared to raw
goes up. I don't think latency would change much vs. raw, you're always
going to have an extra metadata write to wait on... though there are
tricks you can do so the metadata write and data write can go down in
parallel. Bcache doesn't do those yet.

_But_, you only have to pay the metadata write penalty when you see a
cache flush/FUA write. In the absense of cache flushes/FUA, for
metadata purposes you can basically treat a stream as sequential writes
as going down in parallel.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
@ 2013-01-24 23:55                   ` Kent Overstreet
  0 siblings, 0 replies; 54+ messages in thread
From: Kent Overstreet @ 2013-01-24 23:55 UTC (permalink / raw)
  To: Amit Kale
  Cc: Jason Warr, thornber-H+wXaHxf7aLQT0dZR+AlfA,
	device-mapper development,
	kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w, Mike Snitzer, LKML,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 18, 2013 at 05:08:37PM +0800, Amit Kale wrote:
> > From: Jason Warr [mailto:jason-/cow75dQlsI@public.gmane.org]
> > On 01/17/2013 11:53 AM, Amit Kale wrote:
> > >>> 9. Performance - Throughput is generally most important. Latency is
> > >> >   also one more performance comparison point. Performance under
> > >> >   different load classes can be measured.
> > >> >
> > >> >   I think latency is more important than throughput.  Spindles are
> > >> >   pretty good at throughput.  In fact the mq policy tries to spot
> > when
> > >> >   we're doing large linear ios and stops hit counting; best leave
> > this
> > >> >   stuff on the spindle.
> > > I disagree. Latency is taken care of automatically when the number of
> > application threads rises.
> > >
> > 
> > Can you explain what you mean by that in a little more detail?
> 
> Let's say latency of a block device is 10ms for 4kB requests. With single threaded IO, the throughput will be 4kB/10ms = 400kB/s. If the device is capable of more throughput, a multithreaded IO will generate more throughput. So with 2 threads the throughput will be roughly 800kB/s. We can keep increasing the number of threads resulting in an approximately linear throughput. It'll saturate at the maximum capacity the device has. So it could saturate at perhaps at 8MB/s. Increasing the number of threads beyond this will not increase throughput.
> 
> This is a simplistic computation. Throughput, latency and number of threads are related in a more complex relationship. Latency is still important, but throughput is more important.
> 
> The way all this matters for SSD caching is, caching will typically show a higher latency compared to the base SSD, even for a 100% hit ratio. It may be possible to reach the maximum throughput achievable with the base SSD using a high number of threads. Let's say an SSD shows 450MB/s with 4 threads. A cache may show 440MB/s with 8 threads.

Going through the cache should only (measurably) increase latency for
writes, not reads (assuming they're cache hits, not misses). It sounds
like you're talking about the overhead for keeping the index up to date,
which is only a factor for writes, but I'm not quite sure since you talk
about hit rate.

I don't know of any reason why throughput or latency should be noticably
worse than raw for reads from cache.

But for writes, yeah - as number of of concurrent IOs goes up, you can
amortize the metadata writes more and more so throughput compared to raw
goes up. I don't think latency would change much vs. raw, you're always
going to have an extra metadata write to wait on... though there are
tricks you can do so the metadata write and data write can go down in
parallel. Bcache doesn't do those yet.

_But_, you only have to pay the metadata write penalty when you see a
cache flush/FUA write. In the absense of cache flushes/FUA, for
metadata purposes you can basically treat a stream as sequential writes
as going down in parallel.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Announcement: STEC EnhanceIO SSD caching software for Linux kernel
  2013-01-11 17:18 Announcement: STEC EnhanceIO SSD caching software for Linux kernel Amit Kale
  2013-01-11 22:36 ` Marcin Slusarz
  2013-01-14 21:46 ` Mike Snitzer
@ 2013-01-30 12:36 ` Pavel Machek
  2013-01-30 19:56   ` Amit Kale
  2 siblings, 1 reply; 54+ messages in thread
From: Pavel Machek @ 2013-01-30 12:36 UTC (permalink / raw)
  To: Amit Kale; +Cc: LKML

Hi!

> EnhanceIO driver is based on EnhanceIO SSD caching software product developed by STEC Inc. EnhanceIO was derived from Facebook's open source Flashcache project. EnhanceIO uses SSDs as cache devices for traditional rotating hard disk drives (referred to as source volumes throughout this document).
> 
> EnhanceIO can work with any block device, be it an entire physical disk, an individual disk partition,  a RAIDed DAS device, a SAN volume, a device mapper volume or a software RAID (md) device.
> 
> The source volume to SSD mapping is a set-associative mapping based on the source volume sector number with a default set size (aka associativity) of 512 blocks and a default block size of 4 KB.  Partial cache blocks are not used.
> The default value of 4 KB is chosen because it is the common I/O block size of most storage systems.  With these default values, each cache set is 2 MB (512 *
> 4 KB).  Therefore, a 400 GB SSD will have a little less than 200,000 cache sets because a little space is used for storing the meta data on the SSD.
> 
> EnhanceIO supports three caching modes: read-only, write-through, and write-back and three cache replacement policies: random, FIFO, and LRU.
> 
> Read-only caching mode causes EnhanceIO to direct write IO requests
> only to HDD. Read IO requests are issued to HDD and the data read
> from HDD is stored on SSD. Subsequent Read requests for the same
> blocks are carried out from SSD, thus reducing their latency by a
> substantial amount. 

What are the requirements for the SSD? I have 500GB 2.5" HDD in the
notebook... and it starts to be slightly slow for git. Would cheap 8GB
USB stick be useful thing to cache at? (USB sticks have reasonably
fast "seek", but reads are in 20MB/sec range and writes are very
slow.)

								Pavel  

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: Announcement: STEC EnhanceIO SSD caching software for Linux kernel
  2013-01-30 12:36 ` Pavel Machek
@ 2013-01-30 19:56   ` Amit Kale
  0 siblings, 0 replies; 54+ messages in thread
From: Amit Kale @ 2013-01-30 19:56 UTC (permalink / raw)
  To: Pavel Machek; +Cc: LKML

> -----Original Message-----
> From: Pavel Machek [mailto:pavel@ucw.cz]
> Sent: Wednesday, January 30, 2013 4:37 AM
> To: Amit Kale
> Cc: LKML
> Subject: Re: Announcement: STEC EnhanceIO SSD caching software for
> Linux kernel
> 
> Hi!
> 
> > EnhanceIO driver is based on EnhanceIO SSD caching software product
> developed by STEC Inc. EnhanceIO was derived from Facebook's open
> source Flashcache project. EnhanceIO uses SSDs as cache devices for
> traditional rotating hard disk drives (referred to as source volumes
> throughout this document).
> >
> > EnhanceIO can work with any block device, be it an entire physical
> disk, an individual disk partition,  a RAIDed DAS device, a SAN volume,
> a device mapper volume or a software RAID (md) device.
> >
> > The source volume to SSD mapping is a set-associative mapping based
> on the source volume sector number with a default set size (aka
> associativity) of 512 blocks and a default block size of 4 KB.  Partial
> cache blocks are not used.
> > The default value of 4 KB is chosen because it is the common I/O
> block
> > size of most storage systems.  With these default values, each cache
> > set is 2 MB (512 *
> > 4 KB).  Therefore, a 400 GB SSD will have a little less than 200,000
> cache sets because a little space is used for storing the meta data on
> the SSD.
> >
> > EnhanceIO supports three caching modes: read-only, write-through, and
> write-back and three cache replacement policies: random, FIFO, and LRU.
> >
> > Read-only caching mode causes EnhanceIO to direct write IO requests
> > only to HDD. Read IO requests are issued to HDD and the data read
> from
> > HDD is stored on SSD. Subsequent Read requests for the same blocks
> are
> > carried out from SSD, thus reducing their latency by a substantial
> > amount.
> 
> What are the requirements for the SSD? I have 500GB 2.5" HDD in the
> notebook... and it starts to be slightly slow for git. Would cheap 8GB
> USB stick be useful thing to cache at? (USB sticks have reasonably fast
> "seek", but reads are in 20MB/sec range and writes are very
> slow.)

Hi Pavel,

Our testing primarily covered 100GB+ SSDs due to our focus on enterprise market. 8GB will work, but 500:8 is about 60times. Our recommendation is to keep SSD:HDD ratio between 1:5 and 1:10. I'll be most interested in knowing your findings.

Thanks.
-Amit

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
  2013-01-18 21:37                               ` Mike Snitzer
  (?)
  (?)
@ 2013-02-04 20:33                               ` Kent Overstreet
  -1 siblings, 0 replies; 54+ messages in thread
From: Kent Overstreet @ 2013-02-04 20:33 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Darrick J. Wong, device-mapper development, Amit Kale,
	linux-bcache, kent.overstreet, LKML, lsf-pc, Joe Thornber

On Fri, Jan 18, 2013 at 04:37:59PM -0500, Mike Snitzer wrote:
> On Fri, Jan 18 2013 at  4:25pm -0500,
> Darrick J. Wong <darrick.wong@oracle.com> wrote:
> 
> > Since Joe is putting together a testing tree to compare the three caching
> > things, what do you all think of having a(nother) session about ssd caching at
> > this year's LSFMM Summit?
> > 
> > [Apologies for hijacking the thread.]
> > [Adding lsf-pc to the cc list.]
> 
> Hopefully we'll have some findings on the comparisons well before LSF
> (since we currently have some momentum).  But yes it may be worthwhile
> to discuss things further and/or report findings.

I'd be willing to go and talk a bit about bcache. Curious to hear more
about the dm caching stuff, too.

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2013-02-04 20:33 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-11 17:18 Announcement: STEC EnhanceIO SSD caching software for Linux kernel Amit Kale
2013-01-11 22:36 ` Marcin Slusarz
2013-01-14 21:46 ` Mike Snitzer
2013-01-15 13:19   ` Amit Kale
2013-01-16 10:45     ` [dm-devel] " thornber
2013-01-16 12:15       ` thornber
2013-01-16 16:58       ` thornber
2013-01-17  9:52       ` Amit Kale
2013-01-17 11:39         ` Kent Overstreet
2013-01-17 17:17           ` Amit Kale
2013-01-17 17:17             ` Amit Kale
2013-01-24 23:45           ` Kent Overstreet
2013-01-24 23:45             ` Kent Overstreet
2013-01-17 13:26         ` thornber
2013-01-17 13:26           ` thornber-H+wXaHxf7aLQT0dZR+AlfA
2013-01-17 17:53           ` Amit Kale
2013-01-17 18:36             ` Jason Warr
2013-01-17 18:36               ` Jason Warr
2013-01-18  9:08               ` Amit Kale
2013-01-18  9:08                 ` Amit Kale
2013-01-18 15:56                 ` Jason Warr
2013-01-18 16:11                   ` thornber
2013-01-18 16:11                     ` thornber-H+wXaHxf7aLQT0dZR+AlfA
2013-01-18 16:45                     ` Jason Warr
2013-01-18 16:45                       ` Jason Warr
2013-01-18 17:42                       ` thornber
2013-01-18 17:42                         ` thornber-H+wXaHxf7aLQT0dZR+AlfA
2013-01-18 17:44                       ` Amit Kale
2013-01-18 17:44                         ` Amit Kale
2013-01-18 18:36                         ` Jason Warr
2013-01-18 18:36                           ` Jason Warr
2013-01-18 21:25                           ` [LSF/MM TOPIC] " Darrick J. Wong
2013-01-18 21:37                             ` Mike Snitzer
2013-01-18 21:37                               ` Mike Snitzer
2013-01-21  5:26                               ` Amit Kale
2013-01-21  5:26                                 ` Amit Kale
2013-01-21 13:09                                 ` Mike Snitzer
2013-01-21 13:09                                   ` [LSF/MM TOPIC] " Mike Snitzer
2013-01-21 13:58                                   ` [LSF/MM TOPIC] Re: [dm-devel] " thornber
2013-01-21 13:58                                     ` thornber-H+wXaHxf7aLQT0dZR+AlfA
2013-01-22  5:00                                   ` Amit Kale
2013-01-22  5:00                                     ` Amit Kale
2013-02-04 20:33                               ` Kent Overstreet
2013-01-18 16:12                   ` Amit Kale
2013-01-18 16:12                     ` Amit Kale
2013-01-24 23:55                 ` Kent Overstreet
2013-01-24 23:55                   ` Kent Overstreet
2013-01-17 18:50             ` thornber
2013-01-17 18:50               ` thornber-H+wXaHxf7aLQT0dZR+AlfA
2013-01-18  7:03               ` Amit Kale
2013-01-18  7:03                 ` Amit Kale
2013-01-18 14:43       ` thornber
2013-01-30 12:36 ` Pavel Machek
2013-01-30 19:56   ` Amit Kale

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.