From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751037Ab3BOUZI (ORCPT ); Fri, 15 Feb 2013 15:25:08 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:17323 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750779Ab3BOUZE (ORCPT ); Fri, 15 Feb 2013 15:25:04 -0500 Date: Fri, 15 Feb 2013 12:23:21 -0800 From: "Darrick J. Wong" To: OS Engineering Cc: Greg Kroah-Hartman , LKML , Jens Axboe , Sanoj Unnikrishnan , =?utf-8?B?546L6YeR5rWm?= , Amit Kale Subject: Re: [PATCH] EnhanceIO ssd caching software Message-ID: <20130215202321.GD26292@blackbox.djwong.org> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Source-IP: acsinet22.oracle.com [141.146.126.238] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Feb 15, 2013 at 02:02:38PM +0800, OS Engineering wrote: > Hi Greg, Jens, > > We are submitting EnhanceIO(TM) software driver for an inclusion in linux > staging tree. Present state of this driver is beta. We have been posting it > for a few weeks, while it was maintained at github. It is still being > cleaned-up and is being tested by LKML members. Inclusion in linux staging > tree will make testing and reviewing easier and help a future integration in > Linux kernel. > > Could you please include it? Mmm large patches, I'll try to review it... in pieces. Thanks for the work! > Thanks. > -- > Amit Kale > > From 31f636ffd63ce46c4a65ce622bcb7b18ce05f7c3 Mon Sep 17 00:00:00 2001 > From: sanoj > Date: Tue, 12 Feb 2013 16:15:19 +0530 > Subject: [PATCH] Enhanceio Driver > > This driver is based on EnhanceIO(TM) SSD caching software product > developed by STEC Inc. EnhanceIO(TM) software was derived from Facebook's open source > Flashcache project. EnhanceIO(TM) software uses SSDs as cache devices for traditional > rotating hard disk drives. EnhanceIO(TM) software can work with any block device, be it > an entire physical disk, an individual disk partition, a RAIDed DAS device, > a SAN volume, a device mapper volume or a software RAID (md) device. > > Signed-off-by: > Amit Kale > Sanoj Unnikrishnan > Darrick J. Wong > Jinpu Wang Each of these email addresses needs to have the "S-o-b:" prefix attached. Also, you ought to run this patch through scripts/checkpatch.pl, as there are quite a lot of style errors. > --- > Documentation/enhanceio/94-Enhanceio.template | 32 + > Documentation/enhanceio/Persistence.txt | 10 + > Documentation/enhanceio/README.txt | 194 ++ > drivers/staging/Kconfig | 2 + > drivers/staging/Makefile | 1 + > drivers/staging/enhanceio/Kconfig | 20 + > drivers/staging/enhanceio/Makefile | 16 + > drivers/staging/enhanceio/eio.h | 1137 ++++++++ > drivers/staging/enhanceio/eio_conf.c | 2627 ++++++++++++++++++ > drivers/staging/enhanceio/eio_fifo.c | 240 ++ > drivers/staging/enhanceio/eio_ioctl.c | 157 ++ > drivers/staging/enhanceio/eio_ioctl.h | 89 + > drivers/staging/enhanceio/eio_lru.c | 323 +++ > drivers/staging/enhanceio/eio_main.c | 3599 +++++++++++++++++++++++++ > drivers/staging/enhanceio/eio_mem.c | 235 ++ > drivers/staging/enhanceio/eio_policy.c | 146 + > drivers/staging/enhanceio/eio_policy.h | 105 + > drivers/staging/enhanceio/eio_procfs.c | 1939 +++++++++++++ > drivers/staging/enhanceio/eio_setlru.c | 170 ++ > drivers/staging/enhanceio/eio_setlru.h | 49 + > drivers/staging/enhanceio/eio_subr.c | 451 ++++ > drivers/staging/enhanceio/eio_ttc.c | 1702 ++++++++++++ > drivers/staging/enhanceio/eio_ttc.h | 150 + > tools/enhanceio/eio_cli | 344 +++ > 24 files changed, 13738 insertions(+), 0 deletions(-) > create mode 100644 Documentation/enhanceio/94-Enhanceio.template > create mode 100644 Documentation/enhanceio/Persistence.txt > create mode 100644 Documentation/enhanceio/README.txt > create mode 100644 drivers/staging/enhanceio/Kconfig > create mode 100644 drivers/staging/enhanceio/Makefile > create mode 100644 drivers/staging/enhanceio/eio.h > create mode 100644 drivers/staging/enhanceio/eio_conf.c > create mode 100644 drivers/staging/enhanceio/eio_fifo.c > create mode 100644 drivers/staging/enhanceio/eio_ioctl.c > create mode 100644 drivers/staging/enhanceio/eio_ioctl.h > create mode 100644 drivers/staging/enhanceio/eio_lru.c > create mode 100644 drivers/staging/enhanceio/eio_main.c > create mode 100644 drivers/staging/enhanceio/eio_mem.c > create mode 100644 drivers/staging/enhanceio/eio_policy.c > create mode 100644 drivers/staging/enhanceio/eio_policy.h > create mode 100644 drivers/staging/enhanceio/eio_procfs.c > create mode 100644 drivers/staging/enhanceio/eio_setlru.c > create mode 100644 drivers/staging/enhanceio/eio_setlru.h > create mode 100644 drivers/staging/enhanceio/eio_subr.c > create mode 100644 drivers/staging/enhanceio/eio_ttc.c > create mode 100644 drivers/staging/enhanceio/eio_ttc.h > create mode 100644 tools/enhanceio/eio_cli > > diff --git a/Documentation/enhanceio/94-Enhanceio.template b/Documentation/enhanceio/94-Enhanceio.template > new file mode 100644 > index 0000000..ec4a685 > --- /dev/null > +++ b/Documentation/enhanceio/94-Enhanceio.template > @@ -0,0 +1,32 @@ > +ACTION!="add|change", GOTO="EIO_EOF" > +SUBSYSTEM!="block", GOTO="EIO_EOF" > + > +, GOTO="EIO_CACHE" > + > +, GOTO="EIO_SOURCE" > + > +# If none of the rules above matched then it isn't an EnhanceIO device so ignore it. > +GOTO="EIO_EOF" > + > +# If we just found the cache device and the source already exists then we can setup > +LABEL="EIO_CACHE" > + TEST!="/dev/enhanceio/", PROGRAM="/bin/mkdir -p /dev/enhanceio/" > + PROGRAM="/bin/sh -c 'echo $kernel > /dev/enhanceio//.ssd_name'" > + > + TEST=="/dev/enhanceio//.disk_name", GOTO="EIO_SETUP" > +GOTO="EIO_EOF" > + > +# If we just found the source device and the cache already exists then we can setup > +LABEL="EIO_SOURCE" > + TEST!="/dev/enhanceio/", PROGRAM="/bin/mkdir -p /dev/enhanceio/" > + PROGRAM="/bin/sh -c 'echo $kernel > /dev/enhanceio//.disk_name'" > + > + TEST=="/dev/enhanceio//.ssd_name", GOTO="EIO_SETUP" If the cache is running in wb mode, perhaps we should make it ro until the SSD shows up and we run eio_cli? Run blockdev --setro in the EIO_CACHE part, and blockdev --setrw in the EIO_SOURCE part? not a udev developer, take that with a grain of salt. > +GOTO="EIO_EOF" > + > +LABEL="EIO_SETUP" > + PROGRAM="/bin/sh -c 'cat /dev/enhanceio//.ssd_name'", ENV{ssd_name}="%c" > + PROGRAM="/bin/sh -c 'cat /dev/enhanceio//.disk_name'", ENV{disk_name}="%c" > + > + TEST!="/proc/enhanceio/", RUN+="/sbin/eio_cli enable -d /dev/$env{disk_name} -s /dev/$env{ssd_name} " > +LABEL="EIO_EOF" > diff --git a/Documentation/enhanceio/Persistence.txt b/Documentation/enhanceio/Persistence.txt > new file mode 100644 > index 0000000..8b6e58f > --- /dev/null > +++ b/Documentation/enhanceio/Persistence.txt > @@ -0,0 +1,10 @@ > +How to create persistent cache > +============================== > + > +Use the 94-Enhanceio-template file to create a per cache udev-rule file named /etc/udev/rules.d/94-enhancio-.rules > + > +1) Change to ENV{ID_SERIAL}=="", ENV{DEVTYPE}== > + > +2) Change to ENV{ID_SERIAL}=="", ENV{DEVTYPE}== > + > +3) Replace all instances of with the name of your cache I wonder if there's a better way to do this than manually cutting and pasting all these IDs into a udev rules file. Or, how about a quick script at cache creation time that spits out files into /etc/udev/rules.d/ ? > diff --git a/Documentation/enhanceio/README.txt b/Documentation/enhanceio/README.txt > new file mode 100644 > index 0000000..6391dce > --- /dev/null > +++ b/Documentation/enhanceio/README.txt > @@ -0,0 +1,194 @@ > + STEC EnhanceIO SSD Caching Software > + 25th December, 2012 > + > + > +1. WHAT IS ENHANCEIO? > + > + EnhanceIO driver is based on EnhanceIO SSD caching software product > + developed by STEC Inc. EnhanceIO was derived from Facebook's open source > + Flashcache project. EnhanceIO uses SSDs as cache devices for > + traditional rotating hard disk drives (referred to as source volumes > + throughout this document). > + > + EnhanceIO can work with any block device, be it an entire physical > + disk, an individual disk partition, a RAIDed DAS device, a SAN volume, > + a device mapper volume or a software RAID (md) device. > + > + The source volume to SSD mapping is a set-associative mapping based on > + the source volume sector number with a default set size > + (aka associativity) of 512 blocks and a default block size of 4 KB. > + Partial cache blocks are not used. The default value of 4 KB is chosen > + because it is the common I/O block size of most storage systems. With > + these default values, each cache set is 2 MB (512 * 4 KB). Therefore, > + a 400 GB SSD will have a little less than 200,000 cache sets because a > + little space is used for storing the meta data on the SSD. > + > + EnhanceIO supports three caching modes: read-only, write-through, and > + write-back and three cache replacement policies: random, FIFO, and LRU. > + > + Read-only caching mode causes EnhanceIO to direct write IO requests only > + to HDD. Read IO requests are issued to HDD and the data read from HDD is > + stored on SSD. Subsequent Read requests for the same blocks are carried > + out from SSD, thus reducing their latency by a substantial amount. > + > + In Write-through mode - reads are handled similar to Read-only mode. > + Write-through mode causes EnhanceIO to write application data to both > + HDD and SSD. Subsequent reads of the same data benefit because they can > + be served from SSD. > + > + Write-back improves write latency by writing application requested data > + only to SSD. This data, referred to as dirty data, is copied later to How much later? > + HDD asynchronously. Reads are handled similar to Read-only and > + Write-through modes. > + > +2. WHAT HAS ENHANCEIO ADDED TO FLASHCACHE? > + > +2.1. A new write-back engine > + > + The write-back engine in EnhanceiO has been designed from scratch. > + Several optimizations have been done. IO completion guarantees have > + been improved. We have defined limits to let a user control the amount > + of dirty data in a cache. Clean-up of dirty data is stopped by default > + under a high load; this can be overridden if required. A user can > + control the extent to which a single cache set can be filled with dirty > + data. A background thread cleans-up dirty data at regular intervals. > + Clean-up is also done at regular intevals by identifying cache sets > + which have been written least recently. > + > +2.2. Transparent cache > + > + EnhanceIO does not use device mapper. This enables creation and > + deletion of caches while a source volume is being used. It's possible > + to either create or delete cache while a partition is mounted. > + > + EnhanceIO also supports creation of a cache for a device which contains > + partitions. With this feature it's possible to create a cache without > + worrying about having to create several SSD partitions and many > + separate caches. > + > + > +2.3. Large I/O Support > + > + Unlike Flashcache, EnhanceIO does not cause source volume I/O requests > + to be split into cache block size pieces. For the typical SSD cache > + block size of 4 KB, this means that a write I/O request size of, say, > + 64 KB to the source volume is not split into 16 individual requests of > + 4 KB each. This is a performance improvement over Flashcache. IO > + codepaths have been substantially modified for this improvement. > + > +2.4. Small Memory Footprint > + > + Through a special compression algorithm, the meta data RAM usage has > + been reduced to only 4 bytes for each SSD cache block (versus 16 bytes > + in Flashcache). Since the most typical SSD cache block size is 4 KB, > + this means that RAM usage is 0.1% (1/1000) of SSD capacity. > + For example, for a 400 GB SSD, EnhanceIO will need only 400 MB to keep > + all meta data in RAM. > + > + For an SSD cache block size of 8 KB, RAM usage is 0.05% (1/2000) of SSD > + capacity. > + > + The compression algorithm needs at least 32,768 cache sets > + (i.e., 16 bits to encode the set number). If the SSD capacity is small > + and there are not at least 32,768 cache sets, EnhanceIO uses 8 bytes of > + RAM for each SSD cache block. In this case, RAM usage is 0.2% (2/1000) > + of SSD capacity for a cache block size of 4K. > + > +2.4. Loadable Replacement Policies > + > + Since the SSD cache size is typically 10%-20% of the source volume > + size, the set-associative nature of EnhanceIO necessitates cache > + block replacement. > + > + The main EnhanceIO kernel module that implements the caching engine > + uses a random (actually, almost like round-robin) replacement policy > + that does not require any additional RAM and has the least CPU > + overhead. However, there are two additional kernel modules that > + implement FIFO and LRU replacement policies. FIFO is the default cache > + replacement policy because it uses less RAM than LRU. The FIFO and LRU > + kernel modules are independent of each other and do not have to be > + loaded if they are not needed. > + > + Since the replacement policy modules do not consume much RAM when not > + used, both modules are typically loaded after the main caching engine > + is loaded. RAM is used only after a cache has been instantiated to use > + either the FIFO or the LRU replacement policy. > + > + Please note that the RAM used for replacement policies is in addition > + to the RAM used for meta data (mentioned in Section 2.1). The table > + below shows how much RAM each cache replacement policy uses: > + > + POLICY RAM USAGE > + ------ --------- > + Random 0 > + FIFO 4 bytes per cache set > + LRU 4 bytes per cache set + 4 bytes per cache block > + > +2.5. Optimal Alignment of Data Blocks on SSD > + > + EnhanceIO writes all meta data and data blocks on 4K-aligned blocks > + on the SSD. This minimizes write amplification and flash wear. > + It also improves performance. > + > +2.6. Improved device failure handling > + > + Failure of an SSD device in read-only and write-through modes is > + handled gracefully by allowing I/O to continue to/from the > + source volume. An application may notice a drop in performance but it > + will not receive any I/O errors. > + > + Failure of an SSD device in write-back mode obviously results in the > + loss of dirty blocks in the cache. To guard against this data loss, two > + SSD devices can be mirrored via RAID 1. What happens to writes that happen after the SSD goes down? Are they simply passed through to the slow disk? > + EnhanceIO identifies device failures based on error codes. Depending on > + whether the failure is likely to be intermittent or permanent, it takes > + the best suited action. > + > +2.8. Coding optimizations > + > + Several coding optizations have been done to reduce CPU usage. These > + include removing queues which are not required for write-through and > + read-only cache modes, splitting of a single large spinlock, and more. > + Most of the code paths in flashcache have been substantially > + restructured. > + > +3. EnhanceIO usage > + > +3.1. Cache creation, deletion and editing properties > + > + eio_cli utility is used for creating and deleting caches and editing > + their properties. Manpage for this utility eio_cli(8) provides more > + information. > + > +3.2. Making a cache configuration persistent > + It's essential that a cache be resumed before any applications or a > + filesystem use the source volume during a bootup. If a cache is enabled > + after a source volume is written to, stale data may be present in the > + cache. It may cause data corruption. The document Persistent.txt > + describes how to enable a cache during bootup using udev scripts. > + > + In case an SSD does not come up during a bootup, it's ok to allow read > + and write access to HDD only in the case of a Write-through or a > + read-only cache. A cache should be created again when SSD becomes > + available. If a previous cache configuration is resumed, it may cause > + stale data to be read. > + > +3.3. Using a Write-back cache > + It's absolutely necessary to make a Write-back cache configuration > + persistent. This is required particularly in the case of an OS crash or > + a power failure. A Write-back cache may contain dirty blocks which > + haven't been written to HDD yet. Reading the source volume without > + enabling the cache will cause incorrect data to be read. > + > + In case an SSD does not come up during a bootup, access to HDD should > + stopped. It should be enabled only after SSD comes-up and a cache is > + enabled. > + > +4. ACKNOWLEDGEMENTS > + > + STEC acknowledges Facebook and in particular Mohan Srinivasan > + for the design, development, and release of Flashcache as an > + open source project. > + > + Flashcache, in turn, is based on DM-Cache by Ming Zhao. > diff --git a/drivers/staging/Kconfig b/drivers/staging/Kconfig > index 329bdb4..0e97141 100644 > --- a/drivers/staging/Kconfig > +++ b/drivers/staging/Kconfig > @@ -142,4 +142,6 @@ source "drivers/staging/sb105x/Kconfig" > > source "drivers/staging/fwserial/Kconfig" > > +source "drivers/staging/enhanceio/Kconfig" > + > endif # STAGING > diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile > index c7ec486..81656de 100644 > --- a/drivers/staging/Makefile > +++ b/drivers/staging/Makefile > @@ -63,3 +63,4 @@ obj-$(CONFIG_DRM_IMX) += imx-drm/ > obj-$(CONFIG_DGRP) += dgrp/ > obj-$(CONFIG_SB105X) += sb105x/ > obj-$(CONFIG_FIREWIRE_SERIAL) += fwserial/ > +obj-$(CONFIG_ENHANCEIO) += enhanceio/ > diff --git a/drivers/staging/enhanceio/Kconfig b/drivers/staging/enhanceio/Kconfig > new file mode 100644 > index 0000000..ca740e1 > --- /dev/null > +++ b/drivers/staging/enhanceio/Kconfig > @@ -0,0 +1,20 @@ > +# > +# EnhanceIO caching solution by STEC INC. > +# > + > +config ENHANCEIO > + tristate "Enable EnhanceIO" > + default n > + ---help--- > + Based on Facebook's open source Flashcache project developed by > + Mohan Srinivasan and hosted at "http://github.com", EnhanceIO is Probably not necessary to mention github once this ends up in the kernel. > + a collection of (currently three) loadable kernel modules for > + using SSDs as cache devices for traditional rotating hard disk > + > + The caching engine is a loadable kernel module ("enhanceio.ko") > + implemented as a device mapper target. The cache replacement > + policies are implemented as loadable kernel modules > + ("enhanceio_fifo.ko", "enhanceio_lru.ko") that register with > + the caching engine module. > + > + If unsure, say N. > diff --git a/drivers/staging/enhanceio/Makefile b/drivers/staging/enhanceio/Makefile > new file mode 100644 > index 0000000..926aa71 > --- /dev/null > +++ b/drivers/staging/enhanceio/Makefile > @@ -0,0 +1,16 @@ > +# > +# Makefile for EnhanceIO block device caching. > +# > +obj-$(CONFIG_ENHANCEIO) += enhanceio.o enhanceio_lru.o enhanceio_fifo.o > +enhanceio-y += \ > + eio_conf.o \ > + eio_ioctl.o \ > + eio_main.o \ > + eio_mem.o \ > + eio_policy.o \ > + eio_procfs.o \ > + eio_setlru.o \ > + eio_subr.o \ > + eio_ttc.o > +enhanceio_fifo-y += eio_fifo.o > +enhanceio_lru-y += eio_lru.o > diff --git a/drivers/staging/enhanceio/eio.h b/drivers/staging/enhanceio/eio.h > new file mode 100644 > index 0000000..82bab6e > --- /dev/null > +++ b/drivers/staging/enhanceio/eio.h > @@ -0,0 +1,1137 @@ > +/* > + * eio.h > + * > + * Copyright (C) 2012 STEC, Inc. All rights not specifically granted > + * under a license included herein are reserved > + * Saied Kazemi > + * Added EnhanceIO-specific code. > + * Siddharth Choudhuri > + * Common data structures and definitions between Windows and Linux. > + * Amit Kale > + * Restructured much of the io code to split bio within map function instead > + * of letting dm do it. > + * Amit Kale > + * Harish Pujari > + * Designed and implemented the writeback caching mode > + * Copyright 2010 Facebook, Inc. > + * Author: Mohan Srinivasan (mohan@facebook.com) > + * > + * Based on DM-Cache: > + * Copyright (C) International Business Machines Corp., 2006 > + * Author: Ming Zhao (mingzhao@ufl.edu) > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; under version 2 of the License. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program. If not, see . > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include /* required for eio_subr.c */ > +#include > +#include > +#include /* for sysinfo (mem) variables */ > +#include > +#include /* required for SSD failure handling */ > +/* resolve conflict with scsi/scsi_device.h */ > +#ifdef QUEUED > +#undef QUEUED > +#endif > + > +#if defined(__KERNEL__) && !defined(CONFIG_PROC_FS) > +#error "EnhanceIO requires CONFIG_PROC_FS" > +#endif /* __KERNEL__ && !CONFIG_PROC_FS */ This dependency should be stated in the Kconfig file. 'depends PROC_FS' or something like that. > +#ifndef EIO_INC_H > +#define EIO_INC_H > + > +#define EIO_DBN_SET(dmc, index, dbn) ssdcache_dbn_set(dmc, index, dbn) > +#define EIO_DBN_GET(dmc, index) ssdcache_dbn_get(dmc, index) > +#define EIO_CACHE_STATE_SET(dmc, index, state) ssdcache_cache_state_set(dmc, index, state) > +#define EIO_CACHE_STATE_GET(dmc, index) ssdcache_cache_state_get(dmc, index) > +#define EIO_CACHE_STATE_OFF(dmc, index, bitmask) ssdcache_cache_state_off(dmc, index, bitmask) > +#define EIO_CACHE_STATE_ON(dmc, index, bitmask) ssdcache_cache_state_on(dmc, index, bitmask) > + > +/* Bit offsets for wait_on_bit_lock() */ > +#define EIO_UPDATE_LIST 0 > +#define EIO_HANDLE_REBOOT 1 > + > +struct eio_control_s { > + volatile unsigned long synch_flags; Are you sure that this volatile does what you think it does? http://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt afaict all the uses of synch_flags seem to use atomic operations already... > +}; > + > +int eio_wait_schedule(void *unused); > + > +struct eio_event { > + struct task_struct *process; /* handle of the sleeping process */ > +}; > + > +typedef long int index_t; > + > +/* > + * This file has three sections as follows: > + * > + * Section 1: User space only > + * Section 2: User space and kernel > + * Section 3: Kernel only > + * > + * Each section may contain its own subsections. > + */ > + > +/* > + * Begin Section 1: User space only. > + */ Empty? > +/* > + * End Section 1: User space only. > + */ > + > +/* > + * Begin Section 2: User space and kernel. > + */ > + > +/* States of a cache block */ > +#define INVALID 0x0001 > +#define VALID 0x0002 /* Valid */ > +#define DISKREADINPROG 0x0004 /* Read from disk in progress */ > +#define DISKWRITEINPROG 0x0008 /* Write to disk in progress */ > +#define CACHEREADINPROG 0x0010 /* Read from cache in progress */ > +#define CACHEWRITEINPROG 0x0020 /* Write to cache in progress */ > +#define DIRTY 0x0040 /* Dirty, needs writeback to disk */ > +#define QUEUED 0x0080 /* Other requests are queued for this block */ > + > +#define BLOCK_IO_INPROG (DISKREADINPROG | DISKWRITEINPROG | \ > + CACHEREADINPROG | CACHEWRITEINPROG) > +#define DIRTY_INPROG (VALID | DIRTY | CACHEWRITEINPROG) /* block being dirtied */ > +#define CLEAN_INPROG (VALID | DIRTY | DISKWRITEINPROG) /* ongoing clean */ > +#define ALREADY_DIRTY (VALID | DIRTY) /* block which is dirty to begin with for an I/O */ These shouldn't go past 80 columns. > +/* > + * This is a special state used only in the following scenario as > + * part of device (SSD) failure handling: > + * > + * ------| dev fail |------| dev resume |------------ > + * ...-<--- Tf --><- Td -><---- Tr ---><-- Tn ---... > + * |---- Normal ----|-- Degraded -------|-- Normal ---| > + * > + * Tf: Time during device failure. > + * Td: Time after failure when the cache is in degraded mode. > + * Tr: Time when the SSD comes back online. > + * > + * When a failed SSD is added back again, it should be treated > + * as a cold SSD. > + * > + * If Td is very small, then there can be IOs that were initiated > + * before or during Tf, and did not finish until the end of Tr. From > + * the IO's viewpoint, the SSD was there when the IO was initiated > + * and it was there when the IO was finished. These IOs need special > + * handling as described below. > + * > + * To add the SSD as a cold cache device, we initialize all blocks > + * to INVALID, execept for the ones that had IOs in progress before > + * or during Tf. We mark such blocks as both VALID and INVALID. > + * These blocks will be marked INVALID when finished. > + */ > +#define NO_SSD_IO_INPROG (VALID | INVALID) > + > +/* > + * On Flash (cache metadata) Structures > + */ > +#define CACHE_MD_STATE_DIRTY 0x55daddee > +#define CACHE_MD_STATE_CLEAN 0xacceded1 > +#define CACHE_MD_STATE_FASTCLEAN 0xcafebabf > +#define CACHE_MD_STATE_UNSTABLE 0xdeaddeee > + > +/* Do we have a read cache or a read-write cache */ > +#define CACHE_MODE_WB 1 > +#define CACHE_MODE_RO 2 > +#define CACHE_MODE_WT 3 > +#define CACHE_MODE_FIRST CACHE_MODE_WB > +#define CACHE_MODE_LAST CACHE_MODE_WT > +#define CACHE_MODE_DEFAULT CACHE_MODE_WT > + > +#define DEV_PATHLEN 128 > +#define EIO_SUPERBLOCK_SIZE 4096 > + > +#define EIO_CLEAN_ABORT 0x00000000 > +#define EIO_CLEAN_START 0x00000001 > +#define EIO_CLEAN_KEEP 0x00000002 > + > +/* EIO magic number */ > +#define EIO_MAGIC 0xE10CAC6E > +#define EIO_BAD_MAGIC 0xBADCAC6E > + > +/* EIO version */ > +#define EIO_SB_VERSION 3 /* kernel superblock version */ > +#define EIO_SB_MAGIC_VERSION 3 /* version in which magic number was introduced */ > + > +typedef union eio_superblock { > + struct superblock_fields { > + sector_t size; /* Cache size */ sector_t is 32 bits on !LBDAF 32-bit systems and 64 bits otherwise. This structure seems reflect an on-disk format, which means that I can badly screw things up if I move a cache disk between machines with differently configured kernels. Plus, if we ever change the definition of sector_t then this structure will be broken. This field should be declared with an explicit size, i.e. __le64. > + u_int32_t block_size; /* Cache block size */ Worse yet, these fields should use endianness notations (e.g. __le32) and when you write out the superblock, you need to wrap the assignments with a cpu_to_leXX() call. Otherwise, enhanceio caches created on ppc64 won't load on a x64 box (and vice versa) because all the bytes are swapped. These two grumblings also apply to any other on-disk-format structs in this patch. > + u_int32_t assoc; /* Cache associativity */ > + u_int32_t cache_sb_state; /* Clean shutdown ? */ > + char cache_devname[DEV_PATHLEN]; > + sector_t cache_devsize; > + char disk_devname[DEV_PATHLEN]; > + sector_t disk_devsize; > + u_int32_t cache_version; > + char cache_name[DEV_PATHLEN]; > + u_int32_t mode; > + u_int32_t repl_policy; > + u_int32_t cache_flags; > + /* > + * Version 1.1 superblock ends here. > + * Don't modify any of the above fields. > + */ > + u_int32_t magic; /* Has to be the 1st field afer 1.1 superblock */ > + u_int32_t cold_boot; /* cache to be started as cold after boot */ > + char ssd_uuid[DEV_PATHLEN]; > + sector_t cache_md_start_sect; /* cache metadata start (8K aligned) */ > + sector_t cache_data_start_sect; /* cache data start (8K aligned) */ > + u_int32_t dirty_high_threshold; > + u_int32_t dirty_low_threshold; > + u_int32_t dirty_set_high_threshold; > + u_int32_t dirty_set_low_threshold; > + u_int32_t time_based_clean_interval; > + u_int32_t autoclean_threshold; > + } sbf; > + u_int8_t padding[EIO_SUPERBLOCK_SIZE]; > +} eio_superblock_t; Why does this in-memory data structure need to be 4096 bytes long? 'padding' doesn't seem to be used anywhere. > + > +/* > + * For EnhanceIO, we move the superblock from sector 0 to 128 > + * and give it a full 4K. Also, in addition to the single > + * "red-zone" buffer that separates metadata sectors from the > + * data sectors, we allocate extra sectors so that we can > + * align the data sectors on a 4K boundary. > + * > + * 64K 4K variable variable 8K variable variable > + * +--------+--+--------+---------+---+--------+---------+ > + * | unused |SB| align1 |metadata | Z | align2 | data... | > + * +--------+--+--------+---------+---+--------+---------+ > + * <------------- dmc->md_sectors ------------> > + */ > +#define EIO_UNUSED_SECTORS 128 > +#define EIO_SUPERBLOCK_SECTORS 8 > +#define EIO_REDZONE_SECTORS 16 > +#define EIO_START 0 > + > +#define EIO_ALIGN1_SECTORS(index) ((index % 16) ? (24 - (index % 16)) : 8) > +#define EIO_ALIGN2_SECTORS(index) ((index % 16) ? (16 - (index % 16)) : 0) > +#define EIO_SUPERBLOCK_START (EIO_START + EIO_UNUSED_SECTORS) > +#define EIO_METADATA_START(hd_start_sect) (EIO_SUPERBLOCK_START + \ > + EIO_SUPERBLOCK_SECTORS + \ > + EIO_ALIGN1_SECTORS(hd_start_sect)) > + > +#define EIO_EXTRA_SECTORS(start_sect, md_sects) (EIO_METADATA_START(start_sect) + \ > + EIO_REDZONE_SECTORS + \ > + EIO_ALIGN2_SECTORS(md_sects)) > + > +/* > + * We do metadata updates only when a block trasitions from DIRTY -> CLEAN > + * or from CLEAN -> DIRTY. Consequently, on an unclean shutdown, we only > + * pick up blocks that are marked (DIRTY | CLEAN), we clean these and stick > + * them in the cache. > + * On a clean shutdown, we will sync the state for every block, and we will > + * load every block back into cache on a restart. > + */ > +struct flash_cacheblock { > + sector_t dbn; /* Sector number of the cached block */ > +#ifdef DO_CHECKSUM > + u_int64_t checksum; > +#endif /* DO_CHECKSUM */ > + u_int32_t cache_state; > +}; > + > +/* blksize in terms of no. of sectors */ > +#define BLKSIZE_2K 4 > +#define BLKSIZE_4K 8 > +#define BLKSIZE_8K 16 > + > +/* > + * Give me number of pages to allocated for the > + * iosize x specified in terms of bytes. > + */ > +#define IO_PAGE_COUNT(x) (((x) + (PAGE_SIZE - 1)) / PAGE_SIZE) > + > +/* > + * Macro that calculates number of biovecs to be > + * allocated depending on the iosize and cache > + * block size. > + */ > +#define IO_BVEC_COUNT(x, blksize) ({ \ > + int count = IO_PAGE_COUNT(x); \ > + switch ((blksize)) { \ > + case BLKSIZE_2K: \ > + count = count * 2; \ > + break; \ > + case BLKSIZE_4K: \ > + case BLKSIZE_8K: \ > + break; \ > + } \ > + count; \ > + }) > + > +#define MD_MAX_NR_PAGES 16 > +#define MD_BLOCKS_PER_PAGE ((PAGE_SIZE) / sizeof(struct flash_cacheblock)) > +#define INDEX_TO_MD_PAGE(INDEX) ((INDEX) / MD_BLOCKS_PER_PAGE) > +#define INDEX_TO_MD_PAGE_OFFSET(INDEX) ((INDEX) % MD_BLOCKS_PER_PAGE) > + > +#define MD_BLOCKS_PER_SECTOR (512 / (sizeof(struct flash_cacheblock))) > +#define INDEX_TO_MD_SECTOR(INDEX) ((INDEX) / MD_BLOCKS_PER_SECTOR) > +#define INDEX_TO_MD_SECTOR_OFFSET(INDEX) ((INDEX) % MD_BLOCKS_PER_SECTOR) > +#define MD_BLOCKS_PER_CBLOCK(dmc) (MD_BLOCKS_PER_SECTOR * (dmc)->block_size) > + > +#define METADATA_IO_BLOCKSIZE (256 * 1024) > +#define METADATA_IO_BLOCKSIZE_SECT (METADATA_IO_BLOCKSIZE / 512) > +#define SECTORS_PER_PAGE ((PAGE_SIZE) / 512) > + > +/* > + * Cache persistence. > + */ > +#define CACHE_RELOAD 1 > +#define CACHE_CREATE 2 > +#define CACHE_FORCECREATE 3 > + > +/* > + * Cache replacement policy. > + */ > +#define CACHE_REPL_FIFO 1 > +#define CACHE_REPL_LRU 2 > +#define CACHE_REPL_RANDOM 3 > +#define CACHE_REPL_FIRST CACHE_REPL_FIFO > +#define CACHE_REPL_LAST CACHE_REPL_RANDOM > +#define CACHE_REPL_DEFAULT CACHE_REPL_FIFO > + > +/* > + * Default cache parameters. > + */ > +#define DEFAULT_CACHE_ASSOC 512 > +#define DEFAULT_CACHE_BLKSIZE 8 /* 4 KB */ > + > +/* > + * Valid commands that can be written to "control". > + * NOTE: Update CACHE_CONTROL_FLAG_MAX value whenever a new control flag is added > + */ > +#define CACHE_CONTROL_FLAG_MAX 7 > +#define CACHE_VERBOSE_OFF 0 > +#define CACHE_VERBOSE_ON 1 > +#define CACHE_WRITEBACK_ON 2 /* register write back variables */ > +#define CACHE_WRITEBACK_OFF 3 > +#define CACHE_INVALIDATE_ON 4 /* register invalidate variables */ > +#define CACHE_INVALIDATE_OFF 5 > +#define CACHE_FAST_REMOVE_ON 6 /* do not write MD when destroying cache */ > +#define CACHE_FAST_REMOVE_OFF 7 > + > +/* > + * Bit definitions in "cache_flags". These are exported in Linux as > + * hex in the "flags" output line of /proc/enhanceio//config. > + */ > + > +#define CACHE_FLAGS_VERBOSE (1 << 0) > +#define CACHE_FLAGS_INVALIDATE (1 << 1) > +#define CACHE_FLAGS_FAST_REMOVE (1 << 2) > +#define CACHE_FLAGS_DEGRADED (1 << 3) > +#define CACHE_FLAGS_SSD_ADD_INPROG (1 << 4) > +#define CACHE_FLAGS_MD8 (1 << 5) /* using 8-byte metadata (instead of 4-byte md) */ > +#define CACHE_FLAGS_FAILED (1 << 6) > +#define CACHE_FLAGS_STALE (1 << 7) > +#define CACHE_FLAGS_SHUTDOWN_INPROG (1 << 8) > +#define CACHE_FLAGS_MOD_INPROG (1 << 9) /* cache modification such as edit/delete in progress */ > +#define CACHE_FLAGS_DELETED (1 << 10) > +#define CACHE_FLAGS_INCORE_ONLY (CACHE_FLAGS_DEGRADED | \ > + CACHE_FLAGS_SSD_ADD_INPROG | \ > + CACHE_FLAGS_FAILED | \ > + CACHE_FLAGS_SHUTDOWN_INPROG | \ > + CACHE_FLAGS_MOD_INPROG | \ > + CACHE_FLAGS_STALE | \ > + CACHE_FLAGS_DELETED) /* need a proper definition */ > + > +/* flags that govern cold/warm enable after reboot */ > +#define BOOT_FLAG_COLD_ENABLE (1 << 0) /* enable the cache as cold */ > +#define BOOT_FLAG_FORCE_WARM (1 << 1) /* override the cold enable flag */ > + > +typedef enum dev_notifier { > + NOTIFY_INITIALIZER, > + NOTIFY_SSD_ADD, > + NOTIFY_SSD_REMOVED, > + NOTIFY_SRC_REMOVED > +} dev_notifier_t; > + > +/* > + * End Section 2: User space and kernel. > + */ > + > +/* > + * Begin Section 3: Kernel only. > + */ > +#if defined(__KERNEL__) > + > +/* > + * Subsection 3.1: Definitions. > + */ > + > +#define EIO_SB_VERSION 3 /* kernel superblock version */ > + > +/* kcached/pending job states */ > +#define READCACHE 1 > +#define WRITECACHE 2 > +#define READDISK 3 > +#define WRITEDISK 4 > +#define READFILL 5 /* Read Cache Miss Fill */ > +#define INVALIDATE 6 > + > +/* Cache persistence */ > +#define CACHE_RELOAD 1 > +#define CACHE_CREATE 2 > +#define CACHE_FORCECREATE 3 > + > +/* Sysctl defined */ > +#define MAX_CLEAN_IOS_SET 2 > +#define MAX_CLEAN_IOS_TOTAL 4 > + > +/* > + * Harish: TBD > + * Rethink on max, min, default values > + */ > +#define DIRTY_HIGH_THRESH_DEF 30 > +#define DIRTY_LOW_THRESH_DEF 10 > +#define DIRTY_SET_HIGH_THRESH_DEF 100 > +#define DIRTY_SET_LOW_THRESH_DEF 30 What are the units of these values? I suspect that they're used to decide when to start (and stop) flushing dirty blocks out of a wb cache, but please write down in Documentation/enhanceio/README.txt or somewhere what the sysctl values do, and in what units they are expressed. > + > +#define CLEAN_FACTOR(sectors) ((sectors) >> 25) /* in 16 GB multiples */ > +#define TIME_BASED_CLEAN_INTERVAL_DEF(dmc) (uint32_t)(CLEAN_FACTOR((dmc)->cache_size) ? \ > + CLEAN_FACTOR((dmc)->cache_size) : 1) > +#define TIME_BASED_CLEAN_INTERVAL_MAX 720 /* in minutes */ > + > +#define AUTOCLEAN_THRESH_DEF 128 /* Number of I/Os which puts a hold on time based cleaning */ > +#define AUTOCLEAN_THRESH_MAX 1024 /* Number of I/Os which puts a hold on time based cleaning */ > + > +/* Inject a 5s delay between cleaning blocks and metadata */ > +#define CLEAN_REMOVE_DELAY 5000 > + > +/* > + * Subsection 2: Data structures. > + */ > + > +/* > + * Block checksums : > + * Block checksums seem a good idea (especially for debugging, I found a couple > + * of bugs with this), but in practice there are a number of issues with this > + * in production. > + * 1) If a flash write fails, there is no guarantee that the failure was atomic. > + * Some sectors may have been written to flash. If so, the checksum we have > + * is wrong. We could re-read the flash block and recompute the checksum, but > + * the read could fail too. > + * 2) On a node crash, we could have crashed between the flash data write and the > + * flash metadata update (which updates the new checksum to flash metadata). When > + * we reboot, the checksum we read from metadata is wrong. This is worked around > + * by having the cache load recompute checksums after an unclean shutdown. > + * 3) Checksums require 4 or 8 more bytes per block in terms of metadata overhead. > + * Especially because the metadata is wired into memory. > + * 4) Checksums force us to do a flash metadata IO on a block re-dirty. If we > + * didn't maintain checksums, we could avoid the metadata IO on a re-dirty. > + * Therefore in production we disable block checksums. > + * > + * Use the Makefile to enable/disable DO_CHECKSUM OH? I don't see any code that actually touches checksums (or enables DO_CHECKSUM). > + */ > +typedef void (*eio_notify_fn)(int error, void *context); > + > +/* > + * 4-byte metadata support. > + */ > + > +#define EIO_MAX_SECTOR (((u_int64_t)1) << 40) > + > +struct md4 { > + u_int16_t bytes1_2; > + u_int8_t byte3; > + u_int8_t cache_state; > +}; > + > +struct cacheblock { > + union { > + u_int32_t u_i_md4; > + struct md4 u_s_md4; > + } md4_u; > +#ifdef DO_CHECKSUM > + u_int64_t checksum; > +#endif /* DO_CHECKSUM */ > +}; > + > +#define md4_md md4_u.u_i_md4 > +#define md4_cache_state md4_u.u_s_md4.cache_state > +#define EIO_MD4_DBN_BITS (32 - 8) /* 8 bits for state */ > +#define EIO_MD4_DBN_MASK ((1 << EIO_MD4_DBN_BITS) - 1) > +#define EIO_MD4_INVALID (INVALID << EIO_MD4_DBN_BITS) > +#define EIO_MD4_CACHE_STATE(dmc, index) (dmc->cache[index].md4_cache_state) > + > +/* > + * 8-byte metadata support. > + */ > + > +struct md8 { > + u_int32_t bytes1_4; > + u_int16_t bytes5_6; > + u_int8_t byte7; > + u_int8_t cache_state; > +}; > + > +struct cacheblock_md8 { > + union { > + u_int64_t u_i_md8; > + struct md8 u_s_md8; > + } md8_u; > +#ifdef DO_CHECKSUM > + u_int64_t checksum; > +#endif /* DO_CHECKSUM */ > +}; > + > +#define md8_md md8_u.u_i_md8 > +#define md8_cache_state md8_u.u_s_md8.cache_state > +#define EIO_MD8_DBN_BITS (64 - 8) /* 8 bits for state */ > +#define EIO_MD8_DBN_MASK ((((u_int64_t)1) << EIO_MD8_DBN_BITS) - 1) > +#define EIO_MD8_INVALID (((u_int64_t)INVALID) << EIO_MD8_DBN_BITS) > +#define EIO_MD8_CACHE_STATE(dmc, index) ((dmc)->cache_md8[index].md8_cache_state) > +#define EIO_MD8(dmc) CACHE_MD8_IS_SET(dmc) > + > +/* Structure used for metadata update on-disk and in-core for writeback cache */ > +struct mdupdate_request { > + struct list_head list; /* to build mdrequest chain */ > + struct work_struct work; /* work structure */ > + struct cache_c *dmc; /* cache pointer */ > + index_t set; /* set index */ > + unsigned md_size; /* metadata size */ > + unsigned mdbvec_count; /* count of bvecs allocated. */ > + struct bio_vec *mdblk_bvecs; /* bvecs for updating md_blocks */ > + atomic_t holdcount; /* I/O hold count */ > + struct eio_bio *pending_mdlist; /* ebios pending for md update */ > + struct eio_bio *inprog_mdlist; /* ebios processed for md update */ > + int error; /* error during md update */ > + struct mdupdate_request *next; /* next mdreq in the mdreq list .Harish: TBD. Deprecate */ > +}; > + > +#define SETFLAG_CLEAN_INPROG 0x00000001 /* clean in progress on a set */ > +#define SETFLAG_CLEAN_WHOLE 0x00000002 /* clean the set fully */ > + > +/* Structure used for doing operations and storing cache set level info */ > +struct cache_set { > + struct list_head list; > + u_int32_t nr_dirty; /* number of dirty blocks */ > + spinlock_t cs_lock; /* spin lock to protect struct fields */ > + struct rw_semaphore rw_lock; /* reader-writer lock used for clean */ > + unsigned int flags; /* misc cache set specific flags */ > + struct mdupdate_request *mdreq; /* metadata update request pointer */ > +}; > + > +struct eio_errors { > + int disk_read_errors; > + int disk_write_errors; > + int ssd_read_errors; > + int ssd_write_errors; > + int memory_alloc_errors; > + int no_cache_dev; > + int no_source_dev; > +}; > + > +/* > + * Stats. Note that everything should be "atomic64_t" as > + * code relies on it. > + */ > +#define SECTOR_STATS(statval, io_size) \ > + atomic64_add(to_sector(io_size), &statval); > + > +struct eio_stats { > + atomic64_t reads; /* Number of reads */ > + atomic64_t writes; /* Number of writes */ > + atomic64_t read_hits; /* Number of cache hits */ > + atomic64_t write_hits; /* Number of write hits (includes dirty write hits) */ > + atomic64_t dirty_write_hits; /* Number of "dirty" write hits */ > + atomic64_t cached_blocks; /* Number of cached blocks */ > + atomic64_t rd_replace; /* Number of read cache replacements. Harish: TBD modify def doc */ > + atomic64_t wr_replace; /* Number of write cache replacements. Harish: TBD modify def doc */ > + atomic64_t noroom; /* No room in set */ > + atomic64_t cleanings; /* blocks cleaned Harish: TBD modify def doc */ > + atomic64_t md_write_dirty; /* Metadata sector writes dirtying block */ > + atomic64_t md_write_clean; /* Metadata sector writes cleaning block */ > + atomic64_t md_ssd_writes; /* How many md ssd writes did we do ? */ > + atomic64_t uncached_reads; > + atomic64_t uncached_writes; > + atomic64_t uncached_map_size; > + atomic64_t uncached_map_uncacheable; > + atomic64_t disk_reads; > + atomic64_t disk_writes; > + atomic64_t ssd_reads; > + atomic64_t ssd_writes; > + atomic64_t ssd_readfills; > + atomic64_t ssd_readfill_unplugs; > + atomic64_t readdisk; > + atomic64_t writedisk; > + atomic64_t readcache; > + atomic64_t readfill; > + atomic64_t writecache; > + atomic64_t wrtime_ms; /* total write time in ms */ > + atomic64_t rdtime_ms; /* total read time in ms */ > + atomic64_t readcount; /* total reads received so far */ > + atomic64_t writecount; /* total writes received so far */ > +}; > + > +#define PENDING_JOB_HASH_SIZE 32 > +#define PENDING_JOB_HASH(index) ((index) % PENDING_JOB_HASH_SIZE) > +#define SIZE_HIST (128 + 1) > +#define EIO_COPY_PAGES 1024 /* Number of pages for I/O */ > +#define MIN_JOBS 1024 > +#define MIN_EIO_IO 4096 > +#define MIN_DMC_BIO_PAIR 8192 > + > +/* Structure representing a sequence of sets(first to last set index) */ > +struct set_seq { > + index_t first_set; > + index_t last_set; > + struct set_seq *next; > +}; > + > +/* EIO system control variables(tunables) */ > +/* > + * vloatile are used here since the cost a strong synchonisation "synchronization" > + * is not worth the benefits. > + */ > +struct eio_sysctl { > + volatile uint32_t error_inject; > + volatile int32_t fast_remove; > + volatile int32_t zerostats; > + volatile int32_t do_clean; > + volatile uint32_t dirty_high_threshold; > + volatile uint32_t dirty_low_threshold; > + volatile uint32_t dirty_set_high_threshold; > + volatile uint32_t dirty_set_low_threshold; > + volatile uint32_t time_based_clean_interval; /* time after which dirty sets should clean */ > + volatile int32_t autoclean_threshold; > + volatile int32_t mem_limit_pct; > + volatile int32_t control; > + volatile u_int64_t invalidate; > +}; > + > +/* forward declaration */ > +struct lru_ls; > + > +/* Replacement for 'struct dm_dev' */ > +struct eio_bdev { > + struct block_device *bdev; > + fmode_t mode; > + char name[16]; > +}; > + > +/* Replacement for 'struct dm_io_region */ > +struct eio_io_region { > + struct block_device *bdev; > + sector_t sector; > + sector_t count; /* If zero the region is ignored */ > +}; > + > +/* > + * Cache context > + */ > +struct cache_c { > + struct list_head cachelist; > + make_request_fn *origmfn; > + char dev_info; /* partition or whole device */ > + > + sector_t dev_start_sect; > + sector_t dev_end_sect; > + int cache_rdonly; /* protected by ttc_write lock */ > + struct eio_bdev *disk_dev; /* Source device */ > + struct eio_bdev *cache_dev; /* Cache device */ > + struct cacheblock *cache; /* Hash table for cache blocks */ > + struct cache_set *cache_sets; > + struct cache_c *next_cache; > + struct kcached_job *readfill_queue; > + struct work_struct readfill_wq; > + > + struct list_head cleanq; /* queue of sets to awaiting clean */ > + struct eio_event clean_event; /* event to wait for, when cleanq is empty */ > + spinlock_t clean_sl; /* spinlock to protect cleanq etc */ > + void *clean_thread; /* OS specific thread object to handle cleanq */ > + int clean_thread_running; /* to indicate that clean thread is running */ > + atomic64_t clean_pendings; /* Number of sets pending to be cleaned */ > + struct bio_vec *clean_dbvecs; /* Data bvecs for clean set */ > + struct page **clean_mdpages; /* Metadata pages for clean set */ > + int dbvec_count; > + int mdpage_count; > + int clean_excess_dirty; /* Clean in progress to bring cache dirty blocks in limits */ > + atomic_t clean_index; /* set being cleaned, in case of force clean */ > + > + u_int64_t md_start_sect; /* Sector no. at which Metadata starts */ > + u_int64_t md_sectors; /* Numbers of metadata sectors, including header */ > + u_int64_t disk_size; /* Source size */ > + u_int64_t size; /* Cache size */ > + u_int32_t assoc; /* Cache associativity */ > + u_int32_t block_size; /* Cache block size */ > + u_int32_t block_shift; /* Cache block size in bits */ > + u_int32_t block_mask; /* Cache block mask */ > + u_int32_t consecutive_shift; /* Consecutive blocks size in bits */ > + u_int32_t persistence; /* Create | Force create | Reload */ > + u_int32_t mode; /* CACHE_MODE_{WB, RO, WT} */ > + u_int32_t cold_boot; /* Cache should be started as cold after boot */ > + u_int32_t bio_nr_pages; /* number of hardware sectors supported by SSD in terms of PAGE_SIZE */ > + > + spinlock_t cache_spin_lock; > + long unsigned int cache_spin_lock_flags; /* See comments above spin_lock_irqsave_FLAGS */ > + atomic_t nr_jobs; /* Number of I/O jobs */ > + > + volatile u_int32_t cache_flags; > + u_int32_t sb_state; /* Superblock state */ > + u_int32_t sb_version; /* Superblock version */ > + > + int readfill_in_prog; > + struct eio_stats eio_stats; /* Run time stats */ > + struct eio_errors eio_errors; /* Error stats */ > + int max_clean_ios_set; /* Max cleaning IOs per set */ > + int max_clean_ios_total; /* Total max cleaning IOs */ > + int clean_inprog; > + atomic64_t nr_dirty; > + atomic64_t nr_ios; > + atomic64_t size_hist[SIZE_HIST]; > + > + void *sysctl_handle_common; > + void *sysctl_handle_writeback; > + void *sysctl_handle_invalidate; > + > + struct eio_sysctl sysctl_pending; /* sysctl values pending to become active */ > + struct eio_sysctl sysctl_active; /* sysctl currently active */ > + > + char cache_devname[DEV_PATHLEN]; > + char disk_devname[DEV_PATHLEN]; > + char cache_name[DEV_PATHLEN]; > + char cache_gendisk_name[DEV_PATHLEN]; /* Used for SSD failure checks */ > + char cache_srcdisk_name[DEV_PATHLEN]; /* Used for SRC failure checks */ > + char ssd_uuid[DEV_PATHLEN]; > + > + struct cacheblock_md8 *cache_md8; > + sector_t cache_size; /* Cache size passed to ctr(), used by dmsetup info */ > + sector_t cache_dev_start_sect; /* starting sector of cache device */ > + u_int64_t index_zero; /* index of cache block with starting sector 0 */ > + u_int32_t num_sets; /* number of cache sets */ > + u_int32_t num_sets_bits; /* number of bits to encode "num_sets" */ > + u_int64_t num_sets_mask; /* mask value for bits in "num_sets" */ > + > + struct eio_policy *policy_ops; /* Cache block Replacement policy */ > + u_int32_t req_policy; /* Policy requested by the user */ > + u_int32_t random; /* Use for random replacement policy */ > + void *sp_cache_blk; /* Per cache-block data structure */ > + void *sp_cache_set; /* Per cache-set data structure */ > + struct lru_ls *dirty_set_lru; /* lru for dirty sets : lru_list_t */ > + spinlock_t dirty_set_lru_lock; /* spinlock for dirty set lru */ > + struct delayed_work clean_aged_sets_work; /* work item for clean_aged_sets */ > + int is_clean_aged_sets_sched; /* to know whether clean aged sets is scheduled */ > + struct workqueue_struct *mdupdate_q; /* Workqueue to handle md updates */ > + struct workqueue_struct *callback_q; /* Workqueue to handle io callbacks */ > +}; > + > +#define EIO_CACHE_IOSIZE 0 > + > +#define EIO_ROUND_SECTOR(dmc, sector) (sector & (~(unsigned)(dmc->block_size - 1))) > +#define EIO_ROUND_SET_SECTOR(dmc, sector) (sector & (~(unsigned)((dmc->block_size * dmc->assoc) - 1))) > + > +/* > + * The bit definitions are exported to the user space and are in the very beginning of the file. > + */ > +#define CACHE_VERBOSE_IS_SET(dmc) (((dmc)->cache_flags & CACHE_FLAGS_VERBOSE) ? 1 : 0) > +#define CACHE_INVALIDATE_IS_SET(dmc) (((dmc)->cache_flags & CACHE_FLAGS_INVALIDATE) ? 1 : 0) > +#define CACHE_FAST_REMOVE_IS_SET(dmc) (((dmc)->cache_flags & CACHE_FLAGS_FAST_REMOVE) ? 1 : 0) > +#define CACHE_DEGRADED_IS_SET(dmc) (((dmc)->cache_flags & CACHE_FLAGS_DEGRADED) ? 1 : 0) > +#define CACHE_SSD_ADD_INPROG_IS_SET(dmc) (((dmc)->cache_flags & CACHE_FLAGS_SSD_ADD_INPROG) ? 1 : 0) > +#define CACHE_MD8_IS_SET(dmc) (((dmc)->cache_flags & CACHE_FLAGS_MD8) ? 1 : 0) > +#define CACHE_FAILED_IS_SET(dmc) (((dmc)->cache_flags & CACHE_FLAGS_FAILED) ? 1 : 0) > +#define CACHE_STALE_IS_SET(dmc) (((dmc)->cache_flags & CACHE_FLAGS_STALE) ? 1 : 0) > + > +/* Device failure handling. */ > +#define CACHE_SRC_IS_ABSENT(dmc) (((dmc)->eio_errors.no_source_dev == 1) ? 1 : 0) > + > +#define AUTOCLEAN_THRESHOLD_CROSSED(dmc) \ > + ((atomic64_read(&(dmc)->nr_ios) > (int64_t)(dmc)->sysctl_active.autoclean_threshold) || \ > + ((dmc)->sysctl_active.autoclean_threshold == 0)) > + > +#define DIRTY_CACHE_THRESHOLD_CROSSED(dmc) \ > + (((atomic64_read(&(dmc)->nr_dirty) - atomic64_read(&(dmc)->clean_pendings)) >= \ > + (int64_t)((dmc)->sysctl_active.dirty_high_threshold * (dmc)->size) / 100) && \ > + ((dmc)->sysctl_active.dirty_high_threshold > (dmc)->sysctl_active.dirty_low_threshold)) > + > +#define DIRTY_SET_THRESHOLD_CROSSED(dmc, set) \ > + (((dmc)->cache_sets[(set)].nr_dirty >= (u_int32_t)((dmc)->sysctl_active.dirty_set_high_threshold * (dmc)->assoc) / 100) && \ > + ((dmc)->sysctl_active.dirty_set_high_threshold > (dmc)->sysctl_active.dirty_set_low_threshold)) > + > +/* > + * Do not reverse the order of disk and cache! Code > + * relies on this ordering. (Eg: eio_dm_io_async_bvec()). > + */ > +struct job_io_regions { > + struct eio_io_region disk; /* has to be the first member */ > + struct eio_io_region cache; /* has to be the second member */ > +}; > + > +#define EB_MAIN_IO 1 > +#define EB_SUBORDINATE_IO 2 > +#define EB_INVAL 4 > +#define GET_BIO_FLAGS(ebio) ((ebio)->eb_bc->bc_bio->bi_rw) > +#define VERIFY_BIO_FLAGS(ebio) VERIFY((ebio) && (ebio)->eb_bc && (ebio)->eb_bc->bc_bio) > + > +#define SET_BARRIER_FLAGS(rw_flags) (rw_flags |= (REQ_WRITE | REQ_FLUSH)) > + > +struct eio_bio { > + int eb_iotype; > + struct bio_container *eb_bc; > + unsigned eb_cacheset; > + sector_t eb_sector; /*sector number*/ > + unsigned eb_size; /*size in bytes*/ > + struct bio_vec *eb_bv; /*bvec pointer*/ > + unsigned eb_nbvec; /*number of bio_vecs*/ > + int eb_dir; /* io direction*/ > + struct eio_bio *eb_next; /*used for splitting reads*/ > + index_t eb_index; /*for read bios*/ > + atomic_t eb_holdcount; /* ebio hold count, currently used only for dirty block I/O */ > + struct bio_vec eb_rbv[0]; > +}; > + > +enum eio_io_dir { > + EIO_IO_INVALID_DIR = 0, > + CACHED_WRITE, > + CACHED_READ, > + UNCACHED_WRITE, > + UNCACHED_READ, > + UNCACHED_READ_AND_READFILL > +}; > + > +/* ASK > + * Container for all eio_bio corresponding to a given bio > + */ > +struct bio_container { > + spinlock_t bc_lock; /* lock protecting the bc fields */ > + atomic_t bc_holdcount; /* number of ebios referencing bc */ > + struct bio *bc_bio; /* bio for the bc */ > + struct cache_c *bc_dmc; /* cache structure */ > + struct eio_bio *bc_mdlist; /* ebios waiting for md update */ > + int bc_mdwait; /* count of ebios that will do md update */ > + struct mdupdate_request *mdreqs; /* mdrequest structures required for md update */ > + struct set_seq *bc_setspan; /* sets spanned by the bc(used only for wb) */ > + struct set_seq bc_singlesspan; /* used(by wb) if bc spans a single set sequence */ > + enum eio_io_dir bc_dir; /* bc I/O direction */ > + int bc_error; /* error encountered during processing bc */ > + unsigned long bc_iotime; /* maintains i/o time in jiffies */ > + struct bio_container *bc_next; /* next bc in the chain */ > +}; > + > +/* structure used as callback context during synchronous I/O */ > +struct sync_io_context { > + struct rw_semaphore sio_lock; > + unsigned long sio_error; > +}; > + > +struct kcached_job { > + struct list_head list; > + struct work_struct work; > + struct cache_c *dmc; > + struct eio_bio *ebio; > + struct job_io_regions job_io_regions; > + index_t index; > + int action; > + int error; > + struct flash_cacheblock *md_sector; > + struct bio_vec md_io_bvec; > + struct kcached_job *next; > +}; > + > +struct ssd_rm_list { > + struct cache_c *dmc; > + int action; > + dev_t devt; > + dev_notifier_t note; > + struct list_head list; > +}; > + > +struct dbn_index_pair { > + sector_t dbn; > + index_t index; > +}; > + > +/* > + * Subsection 3: Function prototypes and definitions. > + */ > + > +struct kcached_job *eio_alloc_cache_job(void); > +void eio_free_cache_job(struct kcached_job *job); > +struct kcached_job *pop(struct list_head *jobs); > +void push(struct list_head *jobs, struct kcached_job *job); > +void do_work(struct work_struct *unused); > +void update_job_cacheregion(struct kcached_job *job, struct cache_c *dmc, > + struct eio_bio *bio); > +void push_io(struct kcached_job *job); > +void push_md_io(struct kcached_job *job); > +void push_md_complete(struct kcached_job *job); > +void push_uncached_io_complete(struct kcached_job *job); > +int eio_io_empty(void); > +int eio_md_io_empty(void); > +int eio_md_complete_empty(void); > +void eio_md_write_done(struct kcached_job *job); > +void eio_ssderror_diskread(struct kcached_job *job); > +void eio_md_write(struct kcached_job *job); > +void eio_md_write_kickoff(struct kcached_job *job); > +void eio_do_readfill(struct work_struct *work); > +void eio_comply_dirty_thresholds(struct cache_c *dmc, index_t set); > +void eio_clean_all(struct cache_c *dmc); > +void eio_clean_for_reboot(struct cache_c *dmc); > +void eio_clean_aged_sets(struct work_struct *work); > +void eio_comply_dirty_thresholds(struct cache_c *dmc, index_t set); > +#ifndef SSDCACHE > +void eio_reclaim_lru_movetail(struct cache_c *dmc, index_t index, > + struct eio_policy *); > +#endif /* !SSDCACHE */ > +int eio_io_sync_vm(struct cache_c *dmc, struct eio_io_region *where, int rw, > + struct bio_vec *bvec, int nbvec); > +int eio_io_sync_pages(struct cache_c *dmc, struct eio_io_region *where, int rw, > + struct page **pages, int num_bvecs); > +void eio_update_sync_progress(struct cache_c *dmc); > +void eio_plug_cache_device(struct cache_c *dmc); > +void eio_unplug_cache_device(struct cache_c *dmc); > +void eio_plug_disk_device(struct cache_c *dmc); > +void eio_unplug_disk_device(struct cache_c *dmc); > +int dm_io_async_bvec(unsigned int num_regions, struct eio_io_region *where, > + int rw, struct bio_vec *bvec, eio_notify_fn fn, > + void *context); > +void eio_put_cache_device(struct cache_c *dmc); > +void eio_suspend_caching(struct cache_c *dmc, dev_notifier_t note); > +void eio_resume_caching(struct cache_c *dmc, char *dev); > +int eio_ctr_ssd_add(struct cache_c *dmc, char *dev); > + > +/* procfs */ > +void eio_module_procfs_init(void); > +void eio_module_procfs_exit(void); > +void eio_procfs_ctr(struct cache_c *dmc); > +void eio_procfs_dtr(struct cache_c *dmc); > + > +int eio_sb_store(struct cache_c *dmc); > + > +int eio_md_destroy(struct dm_target *tip, char *namep, char *srcp, char *cachep, > + int force); > + > +/* eio_conf.c */ > +extern int eio_ctr(struct dm_target *ti, unsigned int argc, char **argv); > +extern void eio_dtr(struct dm_target *ti); > +extern int eio_md_destroy(struct dm_target *tip, char *namep, char *srcp, > + char *cachep, int force); > +extern int eio_ctr_ssd_add(struct cache_c *dmc, char *dev); > + > +/* thread related functions */ > +void *eio_create_thread(int (*func)(void *), void *context, char *name); > +void eio_thread_exit(long exit_code); > +void eio_wait_thread_exit(void *thrdptr, int *notifier); > + > +/* eio_main.c */ > +extern int eio_map(struct cache_c *, struct request_queue *, struct bio *); > +extern void eio_md_write_done(struct kcached_job *job); > +extern void eio_ssderror_diskread(struct kcached_job *job); > +extern void eio_md_write(struct kcached_job *job); > +extern void eio_md_write_kickoff(struct kcached_job *job); > +extern void eio_do_readfill(struct work_struct *work); > +extern void eio_check_dirty_thresholds(struct cache_c *dmc, index_t set); > +extern void eio_clean_all(struct cache_c *dmc); > +extern int eio_clean_thread_proc(void *context); > +extern void eio_touch_set_lru(struct cache_c *dmc, index_t set); > +extern void eio_inval_range(struct cache_c *dmc, sector_t iosector, > + unsigned iosize); > +extern int eio_invalidate_sanity_check(struct cache_c *dmc, u_int64_t iosector, > + u_int64_t *iosize); > +/* > + * Invalidates all cached blocks without waiting for them to complete > + * Should be called with incoming IO suspended > + */ > +extern int eio_invalidate_cache(struct cache_c *dmc); > + > +/* eio_mem.c */ > +extern int eio_mem_init(struct cache_c *dmc); > +extern u_int32_t eio_hash_block(struct cache_c *dmc, sector_t dbn); > +extern unsigned int eio_shrink_dbn(struct cache_c *dmc, sector_t dbn); > +extern sector_t eio_expand_dbn(struct cache_c *dmc, u_int64_t index); > +extern void eio_invalidate_md(struct cache_c *dmc, u_int64_t index); > +extern void eio_md4_dbn_set(struct cache_c *dmc, u_int64_t index, > + u_int32_t dbn_24); > +extern void eio_md8_dbn_set(struct cache_c *dmc, u_int64_t index, sector_t dbn); > + > +/* eio_procfs.c */ > +extern void eio_module_procfs_init(void); > +extern void eio_module_procfs_exit(void); > +extern void eio_procfs_ctr(struct cache_c *dmc); > +extern void eio_procfs_dtr(struct cache_c *dmc); > +extern int eio_version_query(size_t buf_sz, char *bufp); > + > +/* eio_subr.c */ > +extern void eio_free_cache_job(struct kcached_job *job); > +extern void eio_do_work(struct work_struct *unused); > +extern struct kcached_job *eio_new_job(struct cache_c *dmc, struct eio_bio *bio, > + index_t index); > +extern void eio_push_ssdread_failures(struct kcached_job *job); > +extern void eio_push_md_io(struct kcached_job *job); > +extern void eio_push_md_complete(struct kcached_job *job); > +extern void eio_push_uncached_io_complete(struct kcached_job *job); > +extern int eio_io_empty(void); > +extern int eio_io_sync_vm(struct cache_c *dmc, struct eio_io_region *where, > + int rw, struct bio_vec *bvec, int nbvec); > +extern void eio_unplug_cache_device(struct cache_c *dmc); > +extern void eio_put_cache_device(struct cache_c *dmc); > +extern void eio_suspend_caching(struct cache_c *dmc, dev_notifier_t note); > +extern void eio_resume_caching(struct cache_c *dmc, char *dev); > + > +static __inline__ void > +EIO_DBN_SET(struct cache_c *dmc, u_int64_t index, sector_t dbn) > +{ > + if (EIO_MD8(dmc)) > + eio_md8_dbn_set(dmc, index, dbn); > + else > + eio_md4_dbn_set(dmc, index, eio_shrink_dbn(dmc, dbn)); > + if (dbn == 0) > + dmc->index_zero = index; > +} > + > +static __inline__ u_int64_t EIO_DBN_GET(struct cache_c *dmc, u_int64_t index) > +{ > + if (EIO_MD8(dmc)) > + return dmc->cache_md8[index].md8_md & EIO_MD8_DBN_MASK; > + > + return eio_expand_dbn(dmc, index); > +} > + > +static __inline__ void > +EIO_CACHE_STATE_SET(struct cache_c *dmc, u_int64_t index, u_int8_t cache_state) > +{ > + if (EIO_MD8(dmc)) > + EIO_MD8_CACHE_STATE(dmc, index) = cache_state; > + else > + EIO_MD4_CACHE_STATE(dmc, index) = cache_state; > +} > + > +static __inline__ u_int8_t > +EIO_CACHE_STATE_GET(struct cache_c *dmc, u_int64_t index) > +{ > + u_int8_t cache_state; > + > + if (EIO_MD8(dmc)) > + cache_state = EIO_MD8_CACHE_STATE(dmc, index); > + else > + cache_state = EIO_MD4_CACHE_STATE(dmc, index); > + return cache_state; > +} > + > +static __inline__ void > +EIO_CACHE_STATE_OFF(struct cache_c *dmc, index_t index, u_int8_t bitmask) > +{ > + u_int8_t cache_state = EIO_CACHE_STATE_GET(dmc, index); > + > + cache_state &= ~bitmask; > + EIO_CACHE_STATE_SET(dmc, index, cache_state); > +} > + > +static __inline__ void > +EIO_CACHE_STATE_ON(struct cache_c *dmc, index_t index, u_int8_t bitmask) > +{ > + u_int8_t cache_state = EIO_CACHE_STATE_GET(dmc, index); > + > + cache_state |= bitmask; > + EIO_CACHE_STATE_SET(dmc, index, cache_state); > +} > + > +void eio_set_warm_boot(void); > +#endif /* defined(__KERNEL__) */ > + > +#include "eio_ioctl.h" > + > +/* resolve conflict with scsi/scsi_device.h */ > +#ifdef __KERNEL__ > +#ifdef VERIFY > +#undef VERIFY > +#endif > +#define ENABLE_VERIFY > +#ifdef ENABLE_VERIFY > +/* Like ASSERT() but always compiled in */ > +#define VERIFY(x) do { \ > + if (unlikely(!(x))) { \ > + dump_stack(); \ > + panic("VERIFY: assertion (%s) failed at %s (%d)\n", \ > + # x, __FILE__, __LINE__); \ > + } \ > +} while (0) > +#else /* ENABLE_VERIFY */ > +#define VERIFY(x) do { } while (0); > +#endif /* ENABLE_VERIFY */ BUG_ON()? > + > +extern sector_t eio_get_device_size(struct eio_bdev *); > +extern sector_t eio_get_device_start_sect(struct eio_bdev *); > +#endif /* __KERNEL__ */ > + > +#define EIO_INIT_EVENT(ev) \ > + do { \ > + (ev)->process = NULL; \ > + } while (0) > + > +/*Assumes that the macro gets called under the same spinlock as in wait event*/ > +#define EIO_SET_EVENT_AND_UNLOCK(ev, sl, flags) \ > + do { \ > + struct task_struct *p = NULL; \ > + if ((ev)->process) { \ > + (p) = (ev)->process; \ > + (ev)->process = NULL; \ > + } \ > + spin_unlock_irqrestore((sl), flags); \ > + if (p) { \ > + (void)wake_up_process(p); \ > + } \ > + } while (0) > + > +/*Assumes that the spin lock sl is taken while calling this macro*/ > +#define EIO_WAIT_EVENT(ev, sl, flags) \ > + do { \ > + (ev)->process = current; \ > + set_current_state(TASK_INTERRUPTIBLE); \ > + spin_unlock_irqrestore((sl), flags); \ > + (void)schedule_timeout(10 * HZ); \ > + spin_lock_irqsave((sl), flags); \ > + (ev)->process = NULL; \ > + } while (0) > + > +#define EIO_CLEAR_EVENT(ev) \ > + do { \ > + (ev)->process = NULL; \ > + } while (0) > + > +#include "eio_setlru.h" > +#include "eio_policy.h" > +#define EIO_CACHE(dmc) (EIO_MD8(dmc) ? (void *)dmc->cache_md8 : (void *)dmc->cache) > + > +#endif /* !EIO_INC_H */ Ooookay, that's enough, I need a break, I'll review more later. --D