All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/16] DRBD: a block device for HA clusters
@ 2009-04-30 11:26 Philipp Reisner
  2009-04-30 11:26 ` [PATCH 01/16] DRBD: major.h Philipp Reisner
                   ` (2 more replies)
  0 siblings, 3 replies; 90+ messages in thread
From: Philipp Reisner @ 2009-04-30 11:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg, Philipp Reisner

Hi,

This is a repost of DRBD, to keep you updated about the ongoing
cleanups and improvements.

Patch set attached. Git tree available:
git pull git://git.drbd.org/linux-2.6-drbd.git drbd

We are looking for reviews!

Description

  DRBD is a shared-nothing, synchronously replicated block device. It
  is designed to serve as a building block for high availability
  clusters and in this context, is a "drop-in" replacement for shared
  storage. Simplistically, you could see it as a network RAID 1.

  Although I use the "RAID1+NBD" metaphor myself, recent discussion
  unveiled that one needs to understand the differences as well.
  Here are just two examples of that:

   1) Think of a two node HA cluster. Node A is active ('primary' in DRBD
    speak) has the filesystem mounted and the application running. Node B is
    in standby mode ('secondary' in DRBD speak).

    We loose network connectivity, the primary node continues to run, the
    secondary no longer gets updates.

    Then we have a complete power failure, both nodes are down. Then they
    power up the data center again, but at first the get only the power
    circuit of node B up and running again.

    Should node B offer the service right now ?
      ( DRBD has configurable policies for that )

    Later on they manage to get node A up and running again, now lets assume
    node B was chosen to be the new primary node. What needs to be done ?

    Modifications on B since it became primary needs to be resynced to A.
    Modifications on A sind it lost contact to B needs to be taken out.

    DRBD does that.

    How do you fit that into a RAID1+NBD model ? NBD is just a block
    transport, it does not offer the ability to exchange dirty bitmaps or
    data generation identifiers, nor does the RAID1 code has a concept of
    that.

   2) When using DRBD over small bandwidth links, one has to run a resync,
    DRBD offers the option to do a "checksum based resync". Similar to rsync
    it at first only exchanges a checksum, and transmits the whole data
    block only if the checksums differ.

    That again is something that does not fit into the concepts of NBD or RAID1.

  DRBD can also be used in dual-Primary mode (device writable on both
  nodes), which means it can exhibit shared disk semantics in a
  shared-nothing cluster.  Needless to say, on top of dual-Primary
  DRBD utilizing a cluster file system is necessary to maintain for
  cache coherency.

  More background on this can be found in this paper:
    http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf

  Beyond that, DRBD addresses various issues of cluster partitioning,
  which the MD/NBD stack, to the best of our knowledge, does not
  solve. The above-mentioned paper goes into some detail about that as
  well.

  DRBD can operate in synchronous mode, or in asynchronous mode. I want
  to point out that we guarantee not to violate a single possible write
  after write dependency when writing on the standby node. More on that
  can be found in this paper:
    http://www.drbd.org/fileadmin/drbd/publications/drbd_lk9.pdf

  Last not least DRBD offers background resynchronisation and keeps
  a on disk representation of the dirty bitmap up-to-date. A reasonable
  tradeoff between number of updates, and resyncing more than needed
  is implemented with the activity log.
  More on that:
    http://www.drbd.org/fileadmin/drbd/publications/drbd-activity-logging_v6.pdf

Changes since 2009-04-10

  * Cleanup: Removed all CamelCase
  * Cleanup: Replaced DRBD's own tracing stuff with regular tracepoints
  * Cleanup: Removed ERR/INFO/ALERT ... macros, using dev_err/dev_info/... now
  * Cleanup: Minor stuff, as suggested in feedback on LKML
  * DRBD:    Bitmap compression feature was finalised
  * DRBD:    new disable_sendpage parameter

Changes since the post on 2009-03-30, all triggered by reviews

  * Improvements to Makefile and Kconfig
  * Simplified definitions of bm_flags' bitnumbers
  * Removed debugging aid

Changes since the post on 2009-03-23, from drbd-mainline

  * Updated to the final drbd-8.3.1 code
  * Optionally run-length encode bitmap transfers

Changes since the post on 2009-03-23, triggered by reviews

  * Using the latest proc_create() now
  * Moved the allocation of md_io_tmpp to attach/detach out of drbd_md_sync_page_io()
  * Removing the mode selection comments for emacs
  * Removed DRBD_ratelimit()

cheers,
  Phil

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 01/16] DRBD: major.h
  2009-04-30 11:26 [PATCH 00/16] DRBD: a block device for HA clusters Philipp Reisner
@ 2009-04-30 11:26 ` Philipp Reisner
  2009-04-30 11:26   ` [PATCH 02/16] DRBD: lru_cache Philipp Reisner
  2009-05-01  8:59   ` [PATCH 01/16] DRBD: major.h Andrew Morton
  2009-05-01  8:59 ` [PATCH 00/16] DRBD: a block device for HA clusters Andrew Morton
  2009-05-03  5:53 ` Neil Brown
  2 siblings, 2 replies; 90+ messages in thread
From: Philipp Reisner @ 2009-04-30 11:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg, Philipp Reisner

Since we have had a LANANA major number for years, and it is documented in devices.txt,
I think that this first patch can go upstream without further changes.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

---
diff --git a/include/linux/major.h b/include/linux/major.h
index 058ec15..6a8ca98 100644
--- a/include/linux/major.h
+++ b/include/linux/major.h
@@ -145,6 +145,7 @@
 #define UNIX98_PTY_MAJOR_COUNT	8
 #define UNIX98_PTY_SLAVE_MAJOR	(UNIX98_PTY_MASTER_MAJOR+UNIX98_PTY_MAJOR_COUNT)
 
+#define DRBD_MAJOR		147
 #define RTF_MAJOR		150
 #define RAW_MAJOR		162
 

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 02/16] DRBD: lru_cache
  2009-04-30 11:26 ` [PATCH 01/16] DRBD: major.h Philipp Reisner
@ 2009-04-30 11:26   ` Philipp Reisner
  2009-04-30 11:26     ` [PATCH 03/16] DRBD: activity_log Philipp Reisner
                       ` (2 more replies)
  2009-05-01  8:59   ` [PATCH 01/16] DRBD: major.h Andrew Morton
  1 sibling, 3 replies; 90+ messages in thread
From: Philipp Reisner @ 2009-04-30 11:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg, Philipp Reisner

The lru_cache is a fixed size cache of equal sized objects. It allows its
users to do arbitrary transactions in case an element in the cache needs to
be replaced. Its replacement policy is LRU.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

---
diff --git a/drivers/block/drbd/lru_cache.h b/drivers/block/drbd/lru_cache.h
new file mode 100644
index 0000000..eabf897
--- /dev/null
+++ b/drivers/block/drbd/lru_cache.h
@@ -0,0 +1,116 @@
+/*
+   lru_cache.h
+
+   This file is part of DRBD by Philipp Reisner and Lars Ellenberg.
+
+   Copyright (C) 2003-2008, LINBIT Information Technologies GmbH.
+   Copyright (C) 2003-2008, Philipp Reisner <philipp.reisner@linbit.com>.
+   Copyright (C) 2003-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.
+
+   drbd is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 2, or (at your option)
+   any later version.
+
+   drbd is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with drbd; see the file COPYING.  If not, write to
+   the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+
+ */
+
+#ifndef LRU_CACHE_H
+#define LRU_CACHE_H
+
+#include <linux/list.h>
+
+struct lc_element {
+	struct hlist_node colision;
+	struct list_head list;		 /* LRU list or free list */
+	unsigned int refcnt;
+	unsigned int lc_number;
+};
+
+struct lru_cache {
+	struct list_head lru;
+	struct list_head free;
+	struct list_head in_use;
+	size_t element_size;
+	unsigned int  nr_elements;
+	unsigned int  new_number;
+
+	unsigned int used;
+	unsigned long flags;
+	unsigned long hits, misses, starving, dirty, changed;
+	struct lc_element *changing_element; /* just for paranoia */
+
+	void  *lc_private;
+	const char *name;
+
+	struct hlist_head slot[0];
+	/* hash colision chains here, then element storage. */
+};
+
+
+/* flag-bits for lru_cache */
+enum {
+	__LC_PARANOIA,
+	__LC_DIRTY,
+	__LC_STARVING,
+};
+#define LC_PARANOIA (1<<__LC_PARANOIA)
+#define LC_DIRTY    (1<<__LC_DIRTY)
+#define LC_STARVING (1<<__LC_STARVING)
+
+extern struct lru_cache *lc_alloc(const char *name, unsigned int e_count,
+				  size_t e_size, void *private_p);
+extern void lc_reset(struct lru_cache *lc);
+extern void lc_free(struct lru_cache *lc);
+extern void lc_set(struct lru_cache *lc, unsigned int enr, int index);
+extern void lc_del(struct lru_cache *lc, struct lc_element *element);
+
+extern struct lc_element *lc_try_get(struct lru_cache *lc, unsigned int enr);
+extern struct lc_element *lc_find(struct lru_cache *lc, unsigned int enr);
+extern struct lc_element *lc_get(struct lru_cache *lc, unsigned int enr);
+extern unsigned int lc_put(struct lru_cache *lc, struct lc_element *e);
+extern void lc_changed(struct lru_cache *lc, struct lc_element *e);
+
+struct seq_file;
+extern size_t lc_printf_stats(struct seq_file *seq, struct lru_cache *lc);
+
+void lc_dump(struct lru_cache *lc, struct seq_file *seq, char *utext,
+	     void (*detail) (struct seq_file *, struct lc_element *));
+
+/* This can be used to stop lc_get from changing the set of active elements.
+ * Note that the reference counts and order on the lru list may still change.
+ * returns true if we aquired the lock.
+ */
+static inline int lc_try_lock(struct lru_cache *lc)
+{
+	return !test_and_set_bit(__LC_DIRTY, &lc->flags);
+}
+
+static inline void lc_unlock(struct lru_cache *lc)
+{
+	clear_bit(__LC_DIRTY, &lc->flags);
+	smp_mb__after_clear_bit();
+}
+
+static inline int lc_is_used(struct lru_cache *lc, unsigned int enr)
+{
+	struct lc_element *e = lc_find(lc, enr);
+	return e && e->refcnt;
+}
+
+#define LC_FREE (-1U)
+
+#define lc_e_base(lc)  ((char *)((lc)->slot + (lc)->nr_elements))
+#define lc_entry(lc, i) ((struct lc_element *) \
+		       (lc_e_base(lc) + (i)*(lc)->element_size))
+#define lc_index_of(lc, e) (((char *)(e) - lc_e_base(lc))/(lc)->element_size)
+
+#endif
diff --git a/drivers/block/drbd/lru_cache.c b/drivers/block/drbd/lru_cache.c
new file mode 100644
index 0000000..71858ff
--- /dev/null
+++ b/drivers/block/drbd/lru_cache.c
@@ -0,0 +1,398 @@
+/*
+   lru_cache.c
+
+   This file is part of DRBD by Philipp Reisner and Lars Ellenberg.
+
+   Copyright (C) 2003-2008, LINBIT Information Technologies GmbH.
+   Copyright (C) 2003-2008, Philipp Reisner <philipp.reisner@linbit.com>.
+   Copyright (C) 2003-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.
+
+   drbd is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 2, or (at your option)
+   any later version.
+
+   drbd is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with drbd; see the file COPYING.  If not, write to
+   the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+
+ */
+
+#include <linux/bitops.h>
+#include <linux/vmalloc.h>
+#include <linux/string.h> /* for memset */
+#include <linux/seq_file.h> /* for seq_printf */
+#include "lru_cache.h"
+
+/* this is developers aid only! */
+#define PARANOIA_ENTRY() BUG_ON(test_and_set_bit(__LC_PARANOIA, &lc->flags))
+#define PARANOIA_LEAVE() do { clear_bit(__LC_PARANOIA, &lc->flags); smp_mb__after_clear_bit(); } while (0)
+#define RETURN(x...)     do { PARANOIA_LEAVE(); return x ; } while (0)
+
+static inline size_t size_of_lc(unsigned int e_count, size_t e_size)
+{
+	return sizeof(struct lru_cache)
+	     + e_count * (e_size + sizeof(struct hlist_head));
+}
+
+static inline void lc_init(struct lru_cache *lc,
+		const size_t bytes, const char *name,
+		const unsigned int e_count, const size_t e_size,
+		void *private_p)
+{
+	struct lc_element *e;
+	unsigned int i;
+
+	BUG_ON(!e_count);
+
+	memset(lc, 0, bytes);
+	INIT_LIST_HEAD(&lc->in_use);
+	INIT_LIST_HEAD(&lc->lru);
+	INIT_LIST_HEAD(&lc->free);
+	lc->element_size = e_size;
+	lc->nr_elements  = e_count;
+	lc->new_number	 = -1;
+	lc->lc_private   = private_p;
+	lc->name         = name;
+	for (i = 0; i < e_count; i++) {
+		e = lc_entry(lc, i);
+		e->lc_number = LC_FREE;
+		list_add(&e->list, &lc->free);
+		/* memset(,0,) did the rest of init for us */
+	}
+}
+
+/**
+ * lc_alloc: allocates memory for @e_count objects of @e_size bytes plus the
+ * struct lru_cache, and the hash table slots.
+ * returns pointer to a newly initialized lru_cache object with said parameters.
+ */
+struct lru_cache *lc_alloc(const char *name, unsigned int e_count,
+			   size_t e_size, void *private_p)
+{
+	struct lru_cache   *lc;
+	size_t bytes;
+
+	BUG_ON(!e_count);
+	e_size = max(sizeof(struct lc_element), e_size);
+	bytes = size_of_lc(e_count, e_size);
+	lc = vmalloc(bytes);
+	if (lc)
+		lc_init(lc, bytes, name, e_count, e_size, private_p);
+	return lc;
+}
+
+/**
+ * lc_free: Frees memory allocated by lc_alloc.
+ * @lc: The lru_cache object
+ */
+void lc_free(struct lru_cache *lc)
+{
+	vfree(lc);
+}
+
+/**
+ * lc_reset: does a full reset for @lc and the hash table slots.
+ * It is roughly the equivalent of re-allocating a fresh lru_cache object,
+ * basically a short cut to lc_free(lc); lc = lc_alloc(...);
+ */
+void lc_reset(struct lru_cache *lc)
+{
+	lc_init(lc, size_of_lc(lc->nr_elements, lc->element_size), lc->name,
+			lc->nr_elements, lc->element_size, lc->lc_private);
+}
+
+size_t	lc_printf_stats(struct seq_file *seq, struct lru_cache *lc)
+{
+	/* NOTE:
+	 * total calls to lc_get are
+	 * (starving + hits + misses)
+	 * misses include "dirty" count (update from an other thread in
+	 * progress) and "changed", when this in fact lead to an successful
+	 * update of the cache.
+	 */
+	return seq_printf(seq, "\t%s: used:%u/%u "
+		"hits:%lu misses:%lu starving:%lu dirty:%lu changed:%lu\n",
+		lc->name, lc->used, lc->nr_elements,
+		lc->hits, lc->misses, lc->starving, lc->dirty, lc->changed);
+}
+
+static unsigned int lc_hash_fn(struct lru_cache *lc, unsigned int enr)
+{
+	return enr % lc->nr_elements;
+}
+
+
+/**
+ * lc_find: Returns the pointer to an element, if the element is present
+ * in the hash table. In case it is not this function returns NULL.
+ * @lc: The lru_cache object
+ * @enr: element number
+ */
+struct lc_element *lc_find(struct lru_cache *lc, unsigned int enr)
+{
+	struct hlist_node *n;
+	struct lc_element *e;
+
+	BUG_ON(!lc);
+	hlist_for_each_entry(e, n, lc->slot + lc_hash_fn(lc, enr), colision) {
+		if (e->lc_number == enr)
+			return e;
+	}
+	return NULL;
+}
+
+static struct lc_element *lc_evict(struct lru_cache *lc)
+{
+	struct list_head  *n;
+	struct lc_element *e;
+
+	if (list_empty(&lc->lru))
+		return NULL;
+
+	n = lc->lru.prev;
+	e = list_entry(n, struct lc_element, list);
+
+	list_del(&e->list);
+	hlist_del(&e->colision);
+	return e;
+}
+
+/**
+ * lc_del: Removes an element from the cache (and therefore adds the
+ * element's storage to the free list)
+ *
+ * @lc: The lru_cache object
+ * @e: The element to remove
+ */
+void lc_del(struct lru_cache *lc, struct lc_element *e)
+{
+	PARANOIA_ENTRY();
+	BUG_ON(e->refcnt);
+	list_del(&e->list);
+	hlist_del_init(&e->colision);
+	e->lc_number = LC_FREE;
+	e->refcnt = 0;
+	list_add(&e->list, &lc->free);
+	RETURN();
+}
+
+static struct lc_element *lc_get_unused_element(struct lru_cache *lc)
+{
+	struct list_head *n;
+
+	if (list_empty(&lc->free))
+		return lc_evict(lc);
+
+	n = lc->free.next;
+	list_del(n);
+	return list_entry(n, struct lc_element, list);
+}
+
+static int lc_unused_element_available(struct lru_cache *lc)
+{
+	if (!list_empty(&lc->free))
+		return 1; /* something on the free list */
+	if (!list_empty(&lc->lru))
+		return 1;  /* something to evict */
+
+	return 0;
+}
+
+
+/**
+ * lc_get: Finds an element in the cache, increases its usage count,
+ * "touches" and returns it.
+ * In case the requested number is not present, it needs to be added to the
+ * cache. Therefore it is possible that an other element becomes eviced from
+ * the cache. In either case, the user is notified so he is able to e.g. keep
+ * a persistent log of the cache changes, and therefore the objects in use.
+ *
+ * Return values:
+ *  NULL    if the requested element number was not in the cache, and no unused
+ *          element could be recycled
+ *  pointer to the element with the REQUESTED element number
+ *          In this case, it can be used right away
+ *
+ *  pointer to an UNUSED element with some different element number.
+ *          In this case, the cache is marked dirty, and the returned element
+ *          pointer is removed from the lru list and hash collision chains.
+ *          The user now should do whatever houskeeping is necessary. Then he
+ *          needs to call lc_element_changed(lc,element_pointer), to finish the
+ *          change.
+ *
+ * NOTE: The user needs to check the lc_number on EACH use, so he recognizes
+ *       any cache set change.
+ *
+ * @lc: The lru_cache object
+ * @enr: element number
+ */
+struct lc_element *lc_get(struct lru_cache *lc, unsigned int enr)
+{
+	struct lc_element *e;
+
+	BUG_ON(!lc);
+	BUG_ON(!lc->nr_elements);
+
+	PARANOIA_ENTRY();
+	if (lc->flags & LC_STARVING) {
+		++lc->starving;
+		RETURN(NULL);
+	}
+
+	e = lc_find(lc, enr);
+	if (e) {
+		++lc->hits;
+		if (e->refcnt++ == 0)
+			lc->used++;
+		list_move(&e->list, &lc->in_use); /* Not evictable... */
+		RETURN(e);
+	}
+
+	++lc->misses;
+
+	/* In case there is nothing available and we can not kick out
+	 * the LRU element, we have to wait ...
+	 */
+	if (!lc_unused_element_available(lc)) {
+		__set_bit(__LC_STARVING, &lc->flags);
+		RETURN(NULL);
+	}
+
+	/* it was not present in the cache, find an unused element,
+	 * which then is replaced.
+	 * we need to update the cache; serialize on lc->flags & LC_DIRTY
+	 */
+	if (test_and_set_bit(__LC_DIRTY, &lc->flags)) {
+		++lc->dirty;
+		RETURN(NULL);
+	}
+
+	e = lc_get_unused_element(lc);
+	BUG_ON(!e);
+
+	clear_bit(__LC_STARVING, &lc->flags);
+	BUG_ON(++e->refcnt != 1);
+	lc->used++;
+
+	lc->changing_element = e;
+	lc->new_number = enr;
+
+	RETURN(e);
+}
+
+/* similar to lc_get,
+ * but only gets a new reference on an existing element.
+ * you either get the requested element, or NULL.
+ */
+struct lc_element *lc_try_get(struct lru_cache *lc, unsigned int enr)
+{
+	struct lc_element *e;
+
+	BUG_ON(!lc);
+	BUG_ON(!lc->nr_elements);
+
+	PARANOIA_ENTRY();
+	if (lc->flags & LC_STARVING) {
+		++lc->starving;
+		RETURN(NULL);
+	}
+
+	e = lc_find(lc, enr);
+	if (e) {
+		++lc->hits;
+		if (e->refcnt++ == 0)
+			lc->used++;
+		list_move(&e->list, &lc->in_use); /* Not evictable... */
+	}
+	RETURN(e);
+}
+
+void lc_changed(struct lru_cache *lc, struct lc_element *e)
+{
+	PARANOIA_ENTRY();
+	BUG_ON(e != lc->changing_element);
+	++lc->changed;
+	e->lc_number = lc->new_number;
+	list_add(&e->list, &lc->in_use);
+	hlist_add_head(&e->colision,
+		lc->slot + lc_hash_fn(lc, lc->new_number));
+	lc->changing_element = NULL;
+	lc->new_number = -1;
+	clear_bit(__LC_DIRTY, &lc->flags);
+	smp_mb__after_clear_bit();
+	RETURN();
+}
+
+
+unsigned int lc_put(struct lru_cache *lc, struct lc_element *e)
+{
+	BUG_ON(!lc);
+	BUG_ON(!lc->nr_elements);
+	BUG_ON(!e);
+
+	PARANOIA_ENTRY();
+	BUG_ON(e->refcnt == 0);
+	BUG_ON(e == lc->changing_element);
+	if (--e->refcnt == 0) {
+		/* move it to the front of LRU. */
+		list_move(&e->list, &lc->lru);
+		lc->used--;
+		clear_bit(__LC_STARVING, &lc->flags);
+		smp_mb__after_clear_bit();
+	}
+	RETURN(e->refcnt);
+}
+
+
+/**
+ * lc_set: Sets an element in the cache. You might use this function to
+ * setup the cache. It is expected that the elements are properly initialized.
+ * @lc: The lru_cache object
+ * @enr: element number
+ * @index: The elements' position in the cache
+ */
+void lc_set(struct lru_cache *lc, unsigned int enr, int index)
+{
+	struct lc_element *e;
+
+	if (index < 0 || index >= lc->nr_elements)
+		return;
+
+	e = lc_entry(lc, index);
+	e->lc_number = enr;
+
+	hlist_del_init(&e->colision);
+	hlist_add_head(&e->colision, lc->slot + lc_hash_fn(lc, enr));
+	list_move(&e->list, e->refcnt ? &lc->in_use : &lc->lru);
+}
+
+/**
+ * lc_dump: Dump a complete LRU cache to seq in textual form.
+ */
+void lc_dump(struct lru_cache *lc, struct seq_file *seq, char *utext,
+	     void (*detail) (struct seq_file *, struct lc_element *))
+{
+	unsigned int nr_elements = lc->nr_elements;
+	struct lc_element *e;
+	int i;
+
+	seq_printf(seq, "\tnn: lc_number refcnt %s\n ", utext);
+	for (i = 0; i < nr_elements; i++) {
+		e = lc_entry(lc, i);
+		if (e->lc_number == LC_FREE) {
+			seq_printf(seq, "\t%2d: FREE\n", i);
+		} else {
+			seq_printf(seq, "\t%2d: %4u %4u    ", i,
+				   e->lc_number,
+				   e->refcnt);
+			detail(seq, e);
+		}
+	}
+}
+

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 03/16] DRBD: activity_log
  2009-04-30 11:26   ` [PATCH 02/16] DRBD: lru_cache Philipp Reisner
@ 2009-04-30 11:26     ` Philipp Reisner
  2009-04-30 11:26       ` [PATCH 04/16] DRBD: bitmap Philipp Reisner
  2009-05-01  9:01       ` [PATCH 03/16] DRBD: activity_log Andrew Morton
  2009-05-01  8:59     ` [PATCH 02/16] DRBD: lru_cache Andrew Morton
  2009-05-02 23:51     ` Kyle Moffett
  2 siblings, 2 replies; 90+ messages in thread
From: Philipp Reisner @ 2009-04-30 11:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg, Philipp Reisner

Within DRBD the activity log is used to track extents (4MB each) in which IO
happens (or happened recently). It is based on the LRU cache. Each change of
the activity log causes a meta data update (single sector write).  The size
of the activity log is configured by the user, and is a tradeoff between
minimizing updates to the meta data and the resync time after the crash of a
primary node.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

---
diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
new file mode 100644
index 0000000..c894b4f
--- /dev/null
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -0,0 +1,1458 @@
+/*
+   drbd_actlog.c
+
+   This file is part of DRBD by Philipp Reisner and Lars Ellenberg.
+
+   Copyright (C) 2003-2008, LINBIT Information Technologies GmbH.
+   Copyright (C) 2003-2008, Philipp Reisner <philipp.reisner@linbit.com>.
+   Copyright (C) 2003-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.
+
+   drbd is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 2, or (at your option)
+   any later version.
+
+   drbd is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with drbd; see the file COPYING.  If not, write to
+   the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+
+ */
+
+#include <linux/slab.h>
+#include <linux/drbd.h>
+#include "drbd_int.h"
+#include "drbd_tracing.h"
+#include "drbd_wrappers.h"
+
+/* I do not believe that all storage medias can guarantee atomic
+ * 512 byte write operations. When the journal is read, only
+ * transactions with correct xor_sums are considered.
+ * sizeof() = 512 byte */
+struct __attribute__((packed)) al_transaction {
+	u32       magic;
+	u32       tr_number;
+	struct __attribute__((packed)) {
+		u32 pos;
+		u32 extent; } updates[1 + AL_EXTENTS_PT];
+	u32       xor_sum;
+};
+
+struct update_odbm_work {
+	struct drbd_work w;
+	unsigned int enr;
+};
+
+struct update_al_work {
+	struct drbd_work w;
+	struct lc_element *al_ext;
+	struct completion event;
+	unsigned int enr;
+	/* if old_enr != LC_FREE, write corresponding bitmap sector, too */
+	unsigned int old_enr;
+};
+
+struct drbd_atodb_wait {
+	atomic_t           count;
+	struct completion  io_done;
+	struct drbd_conf   *mdev;
+	int                error;
+};
+
+
+int w_al_write_transaction(struct drbd_conf *, struct drbd_work *, int);
+
+/* The actual tracepoint needs to have constant number of known arguments...
+ */
+void trace_drbd_resync(struct drbd_conf *mdev, int level, const char *fmt, ...)
+{
+	va_list ap;
+
+	va_start(ap, fmt);
+	trace__drbd_resync(mdev, level, fmt, ap);
+	va_end(ap);
+}
+
+STATIC int _drbd_md_sync_page_io(struct drbd_conf *mdev,
+				 struct drbd_backing_dev *bdev,
+				 struct page *page, sector_t sector,
+				 int rw, int size)
+{
+	struct bio *bio;
+	struct drbd_md_io md_io;
+	int ok;
+
+	md_io.mdev = mdev;
+	init_completion(&md_io.event);
+	md_io.error = 0;
+
+	if (rw == WRITE && !test_bit(MD_NO_BARRIER, &mdev->flags))
+		rw |= (1<<BIO_RW_BARRIER);
+	rw |= ((1<<BIO_RW_UNPLUG) | (1<<BIO_RW_SYNCIO));
+
+ retry:
+	bio = bio_alloc(GFP_NOIO, 1);
+	bio->bi_bdev = bdev->md_bdev;
+	bio->bi_sector = sector;
+	ok = (bio_add_page(bio, page, size, 0) == size);
+	if (!ok)
+		goto out;
+	bio->bi_private = &md_io;
+	bio->bi_end_io = drbd_md_io_complete;
+	bio->bi_rw = rw;
+
+	trace_drbd_bio(mdev, "Md", bio, 0, NULL);
+
+	if (FAULT_ACTIVE(mdev, (rw & WRITE) ? DRBD_FAULT_MD_WR : DRBD_FAULT_MD_RD))
+		bio_endio(bio, -EIO);
+	else
+		submit_bio(rw, bio);
+	wait_for_completion(&md_io.event);
+	ok = bio_flagged(bio, BIO_UPTODATE) && md_io.error == 0;
+
+	/* check for unsupported barrier op.
+	 * would rather check on EOPNOTSUPP, but that is not reliable.
+	 * don't try again for ANY return value != 0 */
+	if (unlikely(bio_barrier(bio) && !ok)) {
+		/* Try again with no barrier */
+		dev_warn(DEV, "Barriers not supported on meta data device - disabling\n");
+		set_bit(MD_NO_BARRIER, &mdev->flags);
+		rw &= ~(1 << BIO_RW_BARRIER);
+		bio_put(bio);
+		goto retry;
+	}
+ out:
+	bio_put(bio);
+	return ok;
+}
+
+int drbd_md_sync_page_io(struct drbd_conf *mdev, struct drbd_backing_dev *bdev,
+			 sector_t sector, int rw)
+{
+	int hardsect, mask, ok;
+	int offset = 0;
+	struct page *iop = mdev->md_io_page;
+
+	D_ASSERT(mutex_is_locked(&mdev->md_io_mutex));
+
+	BUG_ON(!bdev->md_bdev);
+
+	hardsect = drbd_get_hardsect(bdev->md_bdev);
+	if (hardsect == 0)
+		hardsect = MD_HARDSECT;
+
+	/* in case hardsect != 512 [ s390 only? ] */
+	if (hardsect != MD_HARDSECT) {
+		mask = (hardsect / MD_HARDSECT) - 1;
+		D_ASSERT(mask == 1 || mask == 3 || mask == 7);
+		D_ASSERT(hardsect == (mask+1) * MD_HARDSECT);
+		offset = sector & mask;
+		sector = sector & ~mask;
+		iop = mdev->md_io_tmpp;
+
+		if (rw == WRITE) {
+			void *p = page_address(mdev->md_io_page);
+			void *hp = page_address(mdev->md_io_tmpp);
+
+			ok = _drbd_md_sync_page_io(mdev, bdev, iop,
+						   sector, READ, hardsect);
+
+			if (unlikely(!ok)) {
+				dev_err(DEV, "drbd_md_sync_page_io(,%llus,"
+				    "READ [hardsect!=512]) failed!\n",
+				    (unsigned long long)sector);
+				return 0;
+			}
+
+			memcpy(hp + offset*MD_HARDSECT , p, MD_HARDSECT);
+		}
+	}
+
+	if (sector < drbd_md_first_sector(bdev) ||
+	    sector > drbd_md_last_sector(bdev))
+		dev_alert(DEV, "%s [%d]:%s(,%llus,%s) out of range md access!\n",
+		     current->comm, current->pid, __func__,
+		     (unsigned long long)sector, rw ? "WRITE" : "READ");
+
+	ok = _drbd_md_sync_page_io(mdev, bdev, iop, sector, rw, hardsect);
+	if (unlikely(!ok)) {
+		dev_err(DEV, "drbd_md_sync_page_io(,%llus,%s) failed!\n",
+		    (unsigned long long)sector, rw ? "WRITE" : "READ");
+		return 0;
+	}
+
+	if (hardsect != MD_HARDSECT && rw == READ) {
+		void *p = page_address(mdev->md_io_page);
+		void *hp = page_address(mdev->md_io_tmpp);
+
+		memcpy(p, hp + offset*MD_HARDSECT, MD_HARDSECT);
+	}
+
+	return ok;
+}
+
+static inline
+struct lc_element *_al_get(struct drbd_conf *mdev, unsigned int enr)
+{
+	struct lc_element *al_ext;
+	struct bm_extent  *bm_ext;
+	unsigned long     al_flags = 0;
+
+	spin_lock_irq(&mdev->al_lock);
+	bm_ext = (struct bm_extent *)
+		lc_find(mdev->resync, enr/AL_EXT_PER_BM_SECT);
+	if (unlikely(bm_ext != NULL)) {
+		if (test_bit(BME_NO_WRITES, &bm_ext->flags)) {
+			spin_unlock_irq(&mdev->al_lock);
+			return NULL;
+		}
+	}
+	al_ext   = lc_get(mdev->act_log, enr);
+	al_flags = mdev->act_log->flags;
+	spin_unlock_irq(&mdev->al_lock);
+
+	/*
+	if (!al_ext) {
+		if (al_flags & LC_STARVING)
+			dev_warn(DEV, "Have to wait for LRU element (AL too small?)\n");
+		if (al_flags & LC_DIRTY)
+			dev_warn(DEV, "Ongoing AL update (AL device too slow?)\n");
+	}
+	*/
+
+	return al_ext;
+}
+
+void drbd_al_begin_io(struct drbd_conf *mdev, sector_t sector)
+{
+	unsigned int enr = (sector >> (AL_EXTENT_SIZE_B-9));
+	struct lc_element *al_ext;
+	struct update_al_work al_work;
+
+	D_ASSERT(atomic_read(&mdev->local_cnt) > 0);
+
+	trace_drbd_actlog(mdev, sector, "al_begin_io");
+
+	wait_event(mdev->al_wait, (al_ext = _al_get(mdev, enr)));
+
+	if (al_ext->lc_number != enr) {
+		/* drbd_al_write_transaction(mdev,al_ext,enr);
+		   generic_make_request() are serialized on the
+		   current->bio_tail list now. Therefore we have
+		   to deligate writing something to AL to the
+		   worker thread. */
+		init_completion(&al_work.event);
+		al_work.al_ext = al_ext;
+		al_work.enr = enr;
+		al_work.old_enr = al_ext->lc_number;
+		al_work.w.cb = w_al_write_transaction;
+		drbd_queue_work_front(&mdev->data.work, &al_work.w);
+		wait_for_completion(&al_work.event);
+
+		mdev->al_writ_cnt++;
+
+		spin_lock_irq(&mdev->al_lock);
+		lc_changed(mdev->act_log, al_ext);
+		spin_unlock_irq(&mdev->al_lock);
+		wake_up(&mdev->al_wait);
+	}
+}
+
+void drbd_al_complete_io(struct drbd_conf *mdev, sector_t sector)
+{
+	unsigned int enr = (sector >> (AL_EXTENT_SIZE_B-9));
+	struct lc_element *extent;
+	unsigned long flags;
+
+	trace_drbd_actlog(mdev, sector, "al_complete_io");
+
+	spin_lock_irqsave(&mdev->al_lock, flags);
+
+	extent = lc_find(mdev->act_log, enr);
+
+	if (!extent) {
+		spin_unlock_irqrestore(&mdev->al_lock, flags);
+		dev_err(DEV, "al_complete_io() called on inactive extent %u\n", enr);
+		return;
+	}
+
+	if (lc_put(mdev->act_log, extent) == 0)
+		wake_up(&mdev->al_wait);
+
+	spin_unlock_irqrestore(&mdev->al_lock, flags);
+}
+
+int
+w_al_write_transaction(struct drbd_conf *mdev, struct drbd_work *w, int unused)
+{
+	struct update_al_work *aw = (struct update_al_work *)w;
+	struct lc_element *updated = aw->al_ext;
+	const unsigned int new_enr = aw->enr;
+	const unsigned int evicted = aw->old_enr;
+
+	struct al_transaction *buffer;
+	sector_t sector;
+	int i, n, mx;
+	unsigned int extent_nr;
+	u32 xor_sum = 0;
+
+	if (!inc_local(mdev)) {
+		dev_err(DEV, "inc_local() failed in w_al_write_transaction\n");
+		complete(&((struct update_al_work *)w)->event);
+		return 1;
+	}
+	/* do we have to do a bitmap write, first?
+	 * TODO reduce maximum latency:
+	 * submit both bios, then wait for both,
+	 * instead of doing two synchronous sector writes. */
+	if (mdev->state.conn < C_CONNECTED && evicted != LC_FREE)
+		drbd_bm_write_sect(mdev, evicted/AL_EXT_PER_BM_SECT);
+
+	mutex_lock(&mdev->md_io_mutex); /* protects md_io_page, al_tr_cycle, ... */
+	buffer = (struct al_transaction *)page_address(mdev->md_io_page);
+
+	buffer->magic = __constant_cpu_to_be32(DRBD_MAGIC);
+	buffer->tr_number = cpu_to_be32(mdev->al_tr_number);
+
+	n = lc_index_of(mdev->act_log, updated);
+
+	buffer->updates[0].pos = cpu_to_be32(n);
+	buffer->updates[0].extent = cpu_to_be32(new_enr);
+
+	xor_sum ^= new_enr;
+
+	mx = min_t(int, AL_EXTENTS_PT,
+		   mdev->act_log->nr_elements - mdev->al_tr_cycle);
+	for (i = 0; i < mx; i++) {
+		extent_nr = lc_entry(mdev->act_log,
+				     mdev->al_tr_cycle+i)->lc_number;
+		buffer->updates[i+1].pos = cpu_to_be32(mdev->al_tr_cycle+i);
+		buffer->updates[i+1].extent = cpu_to_be32(extent_nr);
+		xor_sum ^= extent_nr;
+	}
+	for (; i < AL_EXTENTS_PT; i++) {
+		buffer->updates[i+1].pos = __constant_cpu_to_be32(-1);
+		buffer->updates[i+1].extent = __constant_cpu_to_be32(LC_FREE);
+		xor_sum ^= LC_FREE;
+	}
+	mdev->al_tr_cycle += AL_EXTENTS_PT;
+	if (mdev->al_tr_cycle >= mdev->act_log->nr_elements)
+		mdev->al_tr_cycle = 0;
+
+	buffer->xor_sum = cpu_to_be32(xor_sum);
+
+	sector =  mdev->bc->md.md_offset
+		+ mdev->bc->md.al_offset + mdev->al_tr_pos;
+
+	if (!drbd_md_sync_page_io(mdev, mdev->bc, sector, WRITE)) {
+		drbd_chk_io_error(mdev, 1, TRUE);
+		drbd_io_error(mdev, TRUE);
+	}
+
+	if (++mdev->al_tr_pos >
+	    div_ceil(mdev->act_log->nr_elements, AL_EXTENTS_PT))
+		mdev->al_tr_pos = 0;
+
+	D_ASSERT(mdev->al_tr_pos < MD_AL_MAX_SIZE);
+	mdev->al_tr_number++;
+
+	mutex_unlock(&mdev->md_io_mutex);
+
+	complete(&((struct update_al_work *)w)->event);
+	dec_local(mdev);
+
+	return 1;
+}
+
+/**
+ * drbd_al_read_tr: Reads a single transaction record form the
+ * on disk activity log.
+ * Returns -1 on IO error, 0 on checksum error and 1 if it is a valid
+ * record.
+ */
+STATIC int drbd_al_read_tr(struct drbd_conf *mdev,
+			   struct drbd_backing_dev *bdev,
+			   struct al_transaction *b,
+			   int index)
+{
+	sector_t sector;
+	int rv, i;
+	u32 xor_sum = 0;
+
+	sector = bdev->md.md_offset + bdev->md.al_offset + index;
+
+	/* Dont process error normally,
+	 * as this is done before disk is atached! */
+	if (!drbd_md_sync_page_io(mdev, bdev, sector, READ))
+		return -1;
+
+	rv = (be32_to_cpu(b->magic) == DRBD_MAGIC);
+
+	for (i = 0; i < AL_EXTENTS_PT + 1; i++)
+		xor_sum ^= be32_to_cpu(b->updates[i].extent);
+	rv &= (xor_sum == be32_to_cpu(b->xor_sum));
+
+	return rv;
+}
+
+/**
+ * drbd_al_read_log: Restores the activity log from its on disk
+ * representation. Returns 1 on success, returns 0 when
+ * reading the log failed due to IO errors.
+ */
+int drbd_al_read_log(struct drbd_conf *mdev, struct drbd_backing_dev *bdev)
+{
+	struct al_transaction *buffer;
+	int i;
+	int rv;
+	int mx;
+	int cnr;
+	int active_extents = 0;
+	int transactions = 0;
+	int overflow = 0;
+	int from = -1;
+	int to = -1;
+	u32 from_tnr = -1;
+	u32 to_tnr = 0;
+
+	mx = div_ceil(mdev->act_log->nr_elements, AL_EXTENTS_PT);
+
+	/* lock out all other meta data io for now,
+	 * and make sure the page is mapped.
+	 */
+	mutex_lock(&mdev->md_io_mutex);
+	buffer = page_address(mdev->md_io_page);
+
+	/* Find the valid transaction in the log */
+	for (i = 0; i <= mx; i++) {
+		rv = drbd_al_read_tr(mdev, bdev, buffer, i);
+		if (rv == 0)
+			continue;
+		if (rv == -1) {
+			mutex_unlock(&mdev->md_io_mutex);
+			return 0;
+		}
+		cnr = be32_to_cpu(buffer->tr_number);
+
+		if (cnr == -1)
+			overflow = 1;
+
+		if (cnr < from_tnr && !overflow) {
+			from = i;
+			from_tnr = cnr;
+		}
+		if (cnr > to_tnr) {
+			to = i;
+			to_tnr = cnr;
+		}
+	}
+
+	if (from == -1 || to == -1) {
+		dev_warn(DEV, "No usable activity log found.\n");
+
+		mutex_unlock(&mdev->md_io_mutex);
+		return 1;
+	}
+
+	/* Read the valid transactions.
+	 * dev_info(DEV, "Reading from %d to %d.\n",from,to); */
+	i = from;
+	while (1) {
+		int j, pos;
+		unsigned int extent_nr;
+		unsigned int trn;
+
+		rv = drbd_al_read_tr(mdev, bdev, buffer, i);
+		ERR_IF(rv == 0) goto cancel;
+		if (rv == -1) {
+			mutex_unlock(&mdev->md_io_mutex);
+			return 0;
+		}
+
+		trn = be32_to_cpu(buffer->tr_number);
+
+		spin_lock_irq(&mdev->al_lock);
+
+		/* This loop runs backwards because in the cyclic
+		   elements there might be an old version of the
+		   updated element (in slot 0). So the element in slot 0
+		   can overwrite old versions. */
+		for (j = AL_EXTENTS_PT; j >= 0; j--) {
+			pos = be32_to_cpu(buffer->updates[j].pos);
+			extent_nr = be32_to_cpu(buffer->updates[j].extent);
+
+			if (extent_nr == LC_FREE)
+				continue;
+
+			lc_set(mdev->act_log, extent_nr, pos);
+			active_extents++;
+		}
+		spin_unlock_irq(&mdev->al_lock);
+
+		transactions++;
+
+cancel:
+		if (i == to)
+			break;
+		i++;
+		if (i > mx)
+			i = 0;
+	}
+
+	mdev->al_tr_number = to_tnr+1;
+	mdev->al_tr_pos = to;
+	if (++mdev->al_tr_pos >
+	    div_ceil(mdev->act_log->nr_elements, AL_EXTENTS_PT))
+		mdev->al_tr_pos = 0;
+
+	/* ok, we are done with it */
+	mutex_unlock(&mdev->md_io_mutex);
+
+	dev_info(DEV, "Found %d transactions (%d active extents) in activity log.\n",
+	     transactions, active_extents);
+
+	return 1;
+}
+
+STATIC void atodb_endio(struct bio *bio, int error)
+{
+	struct drbd_atodb_wait *wc = bio->bi_private;
+	struct drbd_conf *mdev = wc->mdev;
+	struct page *page;
+	int uptodate = bio_flagged(bio, BIO_UPTODATE);
+
+	/* strange behaviour of some lower level drivers...
+	 * fail the request by clearing the uptodate flag,
+	 * but do not return any error?! */
+	if (!error && !uptodate)
+		error = -EIO;
+
+	/* corresponding drbd_io_error is in drbd_al_to_on_disk_bm */
+	drbd_chk_io_error(mdev, error, TRUE);
+	if (error && wc->error == 0)
+		wc->error = error;
+
+	if (atomic_dec_and_test(&wc->count))
+		complete(&wc->io_done);
+
+	page = bio->bi_io_vec[0].bv_page;
+	put_page(page);
+	bio_put(bio);
+	mdev->bm_writ_cnt++;
+	dec_local(mdev);
+}
+
+#define S2W(s)	((s)<<(BM_EXT_SIZE_B-BM_BLOCK_SIZE_B-LN2_BPL))
+/* activity log to on disk bitmap -- prepare bio unless that sector
+ * is already covered by previously prepared bios */
+STATIC int atodb_prepare_unless_covered(struct drbd_conf *mdev,
+					struct bio **bios,
+					unsigned int enr,
+					struct drbd_atodb_wait *wc) __must_hold(local)
+{
+	struct bio *bio;
+	struct page *page;
+	sector_t on_disk_sector = enr + mdev->bc->md.md_offset
+				      + mdev->bc->md.bm_offset;
+	unsigned int page_offset = PAGE_SIZE;
+	int offset;
+	int i = 0;
+	int err = -ENOMEM;
+
+	/* Check if that enr is already covered by an already created bio.
+	 * Caution, bios[] is not NULL terminated,
+	 * but only initialized to all NULL.
+	 * For completely scattered activity log,
+	 * the last invocation iterates over all bios,
+	 * and finds the last NULL entry.
+	 */
+	while ((bio = bios[i])) {
+		if (bio->bi_sector == on_disk_sector)
+			return 0;
+		i++;
+	}
+	/* bios[i] == NULL, the next not yet used slot */
+
+	bio = bio_alloc(GFP_KERNEL, 1);
+	if (bio == NULL)
+		return -ENOMEM;
+
+	if (i > 0) {
+		const struct bio_vec *prev_bv = bios[i-1]->bi_io_vec;
+		page_offset = prev_bv->bv_offset + prev_bv->bv_len;
+		page = prev_bv->bv_page;
+	}
+	if (page_offset == PAGE_SIZE) {
+		page = alloc_page(__GFP_HIGHMEM);
+		if (page == NULL)
+			goto out_bio_put;
+		page_offset = 0;
+	} else {
+		get_page(page);
+	}
+
+	offset = S2W(enr);
+	drbd_bm_get_lel(mdev, offset,
+			min_t(size_t, S2W(1), drbd_bm_words(mdev) - offset),
+			kmap(page) + page_offset);
+	kunmap(page);
+
+	bio->bi_private = wc;
+	bio->bi_end_io = atodb_endio;
+	bio->bi_bdev = mdev->bc->md_bdev;
+	bio->bi_sector = on_disk_sector;
+
+	if (bio_add_page(bio, page, MD_HARDSECT, page_offset) != MD_HARDSECT)
+		goto out_put_page;
+
+	atomic_inc(&wc->count);
+	/* we already know that we may do this...
+	 * inc_local_if_state(mdev,D_ATTACHING);
+	 * just get the extra reference, so that the local_cnt reflects
+	 * the number of pending IO requests DRBD at its backing device.
+	 */
+	atomic_inc(&mdev->local_cnt);
+
+	bios[i] = bio;
+
+	return 0;
+
+out_put_page:
+	err = -EINVAL;
+	put_page(page);
+out_bio_put:
+	bio_put(bio);
+	return err;
+}
+
+/**
+ * drbd_al_to_on_disk_bm:
+ * Writes the areas of the bitmap which are covered by the AL.
+ * called when we detach (unconfigure) local storage,
+ * or when we go from R_PRIMARY to R_SECONDARY state.
+ */
+void drbd_al_to_on_disk_bm(struct drbd_conf *mdev)
+{
+	int i, nr_elements;
+	unsigned int enr;
+	struct bio **bios;
+	struct drbd_atodb_wait wc;
+
+	ERR_IF (!inc_local_if_state(mdev, D_ATTACHING))
+		return; /* sorry, I don't have any act_log etc... */
+
+	wait_event(mdev->al_wait, lc_try_lock(mdev->act_log));
+
+	nr_elements = mdev->act_log->nr_elements;
+
+	bios = kzalloc(sizeof(struct bio *) * nr_elements, GFP_KERNEL);
+	if (!bios)
+		goto submit_one_by_one;
+
+	atomic_set(&wc.count, 0);
+	init_completion(&wc.io_done);
+	wc.mdev = mdev;
+	wc.error = 0;
+
+	for (i = 0; i < nr_elements; i++) {
+		enr = lc_entry(mdev->act_log, i)->lc_number;
+		if (enr == LC_FREE)
+			continue;
+		/* next statement also does atomic_inc wc.count and local_cnt */
+		if (atodb_prepare_unless_covered(mdev, bios,
+						enr/AL_EXT_PER_BM_SECT,
+						&wc))
+			goto free_bios_submit_one_by_one;
+	}
+
+	/* unneccessary optimization? */
+	lc_unlock(mdev->act_log);
+	wake_up(&mdev->al_wait);
+
+	/* all prepared, submit them */
+	for (i = 0; i < nr_elements; i++) {
+		if (bios[i] == NULL)
+			break;
+		if (FAULT_ACTIVE(mdev, DRBD_FAULT_MD_WR)) {
+			bios[i]->bi_rw = WRITE;
+			bio_endio(bios[i], -EIO);
+		} else {
+			submit_bio(WRITE, bios[i]);
+		}
+	}
+
+	drbd_blk_run_queue(bdev_get_queue(mdev->bc->md_bdev));
+
+	/* always (try to) flush bitmap to stable storage */
+	drbd_md_flush(mdev);
+
+	/* In case we did not submit a single IO do not wait for
+	 * them to complete. ( Because we would wait forever here. )
+	 *
+	 * In case we had IOs and they are already complete, there
+	 * is not point in waiting anyways.
+	 * Therefore this if () ... */
+	if (atomic_read(&wc.count))
+		wait_for_completion(&wc.io_done);
+
+	dec_local(mdev);
+
+	if (wc.error)
+		drbd_io_error(mdev, TRUE);
+	kfree(bios);
+	return;
+
+ free_bios_submit_one_by_one:
+	/* free everything by calling the endio callback directly. */
+	for (i = 0; i < nr_elements && bios[i]; i++)
+		bio_endio(bios[i], 0);
+
+	kfree(bios);
+
+ submit_one_by_one:
+	dev_warn(DEV, "Using the slow drbd_al_to_on_disk_bm()\n");
+
+	for (i = 0; i < mdev->act_log->nr_elements; i++) {
+		enr = lc_entry(mdev->act_log, i)->lc_number;
+		if (enr == LC_FREE)
+			continue;
+		/* Really slow: if we have al-extents 16..19 active,
+		 * sector 4 will be written four times! Synchronous! */
+		drbd_bm_write_sect(mdev, enr/AL_EXT_PER_BM_SECT);
+	}
+
+	lc_unlock(mdev->act_log);
+	wake_up(&mdev->al_wait);
+	dec_local(mdev);
+}
+
+/**
+ * drbd_al_apply_to_bm: Sets the bits in the bitmap that are described
+ * by the active extents of the AL.
+ */
+void drbd_al_apply_to_bm(struct drbd_conf *mdev)
+{
+	unsigned int enr;
+	unsigned long add = 0;
+	char ppb[10];
+	int i;
+
+	wait_event(mdev->al_wait, lc_try_lock(mdev->act_log));
+
+	for (i = 0; i < mdev->act_log->nr_elements; i++) {
+		enr = lc_entry(mdev->act_log, i)->lc_number;
+		if (enr == LC_FREE)
+			continue;
+		add += drbd_bm_ALe_set_all(mdev, enr);
+	}
+
+	lc_unlock(mdev->act_log);
+	wake_up(&mdev->al_wait);
+
+	dev_info(DEV, "Marked additional %s as out-of-sync based on AL.\n",
+	     ppsize(ppb, Bit2KB(add)));
+}
+
+static inline int _try_lc_del(struct drbd_conf *mdev, struct lc_element *al_ext)
+{
+	int rv;
+
+	spin_lock_irq(&mdev->al_lock);
+	rv = (al_ext->refcnt == 0);
+	if (likely(rv))
+		lc_del(mdev->act_log, al_ext);
+	spin_unlock_irq(&mdev->al_lock);
+
+	return rv;
+}
+
+/**
+ * drbd_al_shrink: Removes all active extents form the AL. (but does not
+ * write any transactions)
+ * You need to lock mdev->act_log with lc_try_lock() / lc_unlock()
+ */
+void drbd_al_shrink(struct drbd_conf *mdev)
+{
+	struct lc_element *al_ext;
+	int i;
+
+	D_ASSERT(test_bit(__LC_DIRTY, &mdev->act_log->flags));
+
+	for (i = 0; i < mdev->act_log->nr_elements; i++) {
+		al_ext = lc_entry(mdev->act_log, i);
+		if (al_ext->lc_number == LC_FREE)
+			continue;
+		wait_event(mdev->al_wait, _try_lc_del(mdev, al_ext));
+	}
+
+	wake_up(&mdev->al_wait);
+}
+
+STATIC int w_update_odbm(struct drbd_conf *mdev, struct drbd_work *w, int unused)
+{
+	struct update_odbm_work *udw = (struct update_odbm_work *)w;
+
+	if (!inc_local(mdev)) {
+		if (__ratelimit(&drbd_ratelimit_state))
+			dev_warn(DEV, "Can not update on disk bitmap, local IO disabled.\n");
+		return 1;
+	}
+
+	drbd_bm_write_sect(mdev, udw->enr);
+	dec_local(mdev);
+
+	kfree(udw);
+
+	if (drbd_bm_total_weight(mdev) <= mdev->rs_failed) {
+		switch (mdev->state.conn) {
+		case C_SYNC_SOURCE:  case C_SYNC_TARGET:
+		case C_PAUSED_SYNC_S: case C_PAUSED_SYNC_T:
+			drbd_resync_finished(mdev);
+		default:
+			/* nothing to do */
+			break;
+		}
+	}
+	drbd_bcast_sync_progress(mdev);
+
+	return 1;
+}
+
+
+/* ATTENTION. The AL's extents are 4MB each, while the extents in the
+ * resync LRU-cache are 16MB each.
+ * The caller of this function has to hold an inc_local() reference.
+ *
+ * TODO will be obsoleted once we have a caching lru of the on disk bitmap
+ */
+STATIC void drbd_try_clear_on_disk_bm(struct drbd_conf *mdev, sector_t sector,
+				      int count, int success)
+{
+	struct bm_extent *ext;
+	struct update_odbm_work *udw;
+
+	unsigned int enr;
+
+	D_ASSERT(atomic_read(&mdev->local_cnt));
+
+	/* I simply assume that a sector/size pair never crosses
+	 * a 16 MB extent border. (Currently this is true...) */
+	enr = BM_SECT_TO_EXT(sector);
+
+	ext = (struct bm_extent *) lc_get(mdev->resync, enr);
+	if (ext) {
+		if (ext->lce.lc_number == enr) {
+			if (success)
+				ext->rs_left -= count;
+			else
+				ext->rs_failed += count;
+			if (ext->rs_left < ext->rs_failed) {
+				dev_err(DEV, "BAD! sector=%llus enr=%u rs_left=%d "
+				    "rs_failed=%d count=%d\n",
+				     (unsigned long long)sector,
+				     ext->lce.lc_number, ext->rs_left,
+				     ext->rs_failed, count);
+				dump_stack();
+
+				lc_put(mdev->resync, &ext->lce);
+				drbd_force_state(mdev, NS(conn, C_DISCONNECTING));
+				return;
+			}
+		} else {
+			/* Normally this element should be in the cache,
+			 * since drbd_rs_begin_io() pulled it already in.
+			 *
+			 * But maybe an application write finished, and we set
+			 * something outside the resync lru_cache in sync.
+			 */
+			int rs_left = drbd_bm_e_weight(mdev, enr);
+			if (ext->flags != 0) {
+				dev_warn(DEV, "changing resync lce: %d[%u;%02lx]"
+				     " -> %d[%u;00]\n",
+				     ext->lce.lc_number, ext->rs_left,
+				     ext->flags, enr, rs_left);
+				ext->flags = 0;
+			}
+			if (ext->rs_failed) {
+				dev_warn(DEV, "Kicking resync_lru element enr=%u "
+				     "out with rs_failed=%d\n",
+				     ext->lce.lc_number, ext->rs_failed);
+				set_bit(WRITE_BM_AFTER_RESYNC, &mdev->flags);
+			}
+			ext->rs_left = rs_left;
+			ext->rs_failed = success ? 0 : count;
+			lc_changed(mdev->resync, &ext->lce);
+		}
+		lc_put(mdev->resync, &ext->lce);
+		/* no race, we are within the al_lock! */
+
+		if (ext->rs_left == ext->rs_failed) {
+			ext->rs_failed = 0;
+
+			udw = kmalloc(sizeof(*udw), GFP_ATOMIC);
+			if (udw) {
+				udw->enr = ext->lce.lc_number;
+				udw->w.cb = w_update_odbm;
+				drbd_queue_work_front(&mdev->data.work, &udw->w);
+			} else {
+				dev_warn(DEV, "Could not kmalloc an udw\n");
+				set_bit(WRITE_BM_AFTER_RESYNC, &mdev->flags);
+			}
+		}
+	} else {
+		dev_err(DEV, "lc_get() failed! locked=%d/%d flags=%lu\n",
+		    mdev->resync_locked,
+		    mdev->resync->nr_elements,
+		    mdev->resync->flags);
+	}
+}
+
+/* clear the bit corresponding to the piece of storage in question:
+ * size byte of data starting from sector.  Only clear a bits of the affected
+ * one ore more _aligned_ BM_BLOCK_SIZE blocks.
+ *
+ * called by worker on C_SYNC_TARGET and receiver on SyncSource.
+ *
+ */
+void __drbd_set_in_sync(struct drbd_conf *mdev, sector_t sector, int size,
+		       const char *file, const unsigned int line)
+{
+	/* Is called from worker and receiver context _only_ */
+	unsigned long sbnr, ebnr, lbnr;
+	unsigned long count = 0;
+	sector_t esector, nr_sectors;
+	int wake_up = 0;
+	unsigned long flags;
+
+	if (size <= 0 || (size & 0x1ff) != 0 || size > DRBD_MAX_SEGMENT_SIZE) {
+		dev_err(DEV, "drbd_set_in_sync: sector=%llus size=%d nonsense!\n",
+				(unsigned long long)sector, size);
+		return;
+	}
+	nr_sectors = drbd_get_capacity(mdev->this_bdev);
+	esector = sector + (size >> 9) - 1;
+
+	ERR_IF(sector >= nr_sectors) return;
+	ERR_IF(esector >= nr_sectors) esector = (nr_sectors-1);
+
+	lbnr = BM_SECT_TO_BIT(nr_sectors-1);
+
+	/* we clear it (in sync).
+	 * round up start sector, round down end sector.  we make sure we only
+	 * clear full, alligned, BM_BLOCK_SIZE (4K) blocks */
+	if (unlikely(esector < BM_SECT_PER_BIT-1))
+		return;
+	if (unlikely(esector == (nr_sectors-1)))
+		ebnr = lbnr;
+	else
+		ebnr = BM_SECT_TO_BIT(esector - (BM_SECT_PER_BIT-1));
+	sbnr = BM_SECT_TO_BIT(sector + BM_SECT_PER_BIT-1);
+
+	trace_drbd_resync(mdev, TRACE_LVL_METRICS,
+			  "drbd_set_in_sync: sector=%llus size=%u sbnr=%lu ebnr=%lu\n",
+			  (unsigned long long)sector, size, sbnr, ebnr);
+
+	if (sbnr > ebnr)
+		return;
+
+	/*
+	 * ok, (capacity & 7) != 0 sometimes, but who cares...
+	 * we count rs_{total,left} in bits, not sectors.
+	 */
+	spin_lock_irqsave(&mdev->al_lock, flags);
+	count = drbd_bm_clear_bits(mdev, sbnr, ebnr);
+	if (count) {
+		/* we need the lock for drbd_try_clear_on_disk_bm */
+		if (jiffies - mdev->rs_mark_time > HZ*10) {
+			/* should be roling marks,
+			 * but we estimate only anyways. */
+			if (mdev->rs_mark_left != drbd_bm_total_weight(mdev) &&
+			    mdev->state.conn != C_PAUSED_SYNC_T &&
+			    mdev->state.conn != C_PAUSED_SYNC_S) {
+				mdev->rs_mark_time = jiffies;
+				mdev->rs_mark_left = drbd_bm_total_weight(mdev);
+			}
+		}
+		if (inc_local(mdev)) {
+			drbd_try_clear_on_disk_bm(mdev, sector, count, TRUE);
+			dec_local(mdev);
+		}
+		/* just wake_up unconditional now, various lc_chaged(),
+		 * lc_put() in drbd_try_clear_on_disk_bm(). */
+		wake_up = 1;
+	}
+	spin_unlock_irqrestore(&mdev->al_lock, flags);
+	if (wake_up)
+		wake_up(&mdev->al_wait);
+}
+
+/*
+ * this is intended to set one request worth of data out of sync.
+ * affects at least 1 bit,
+ * and at most 1+DRBD_MAX_SEGMENT_SIZE/BM_BLOCK_SIZE bits.
+ *
+ * called by tl_clear and drbd_send_dblock (==drbd_make_request).
+ * so this can be _any_ process.
+ */
+void __drbd_set_out_of_sync(struct drbd_conf *mdev, sector_t sector, int size,
+			    const char *file, const unsigned int line)
+{
+	unsigned long sbnr, ebnr, lbnr, flags;
+	sector_t esector, nr_sectors;
+	unsigned int enr, count;
+	struct bm_extent *ext;
+
+	if (size <= 0 || (size & 0x1ff) != 0 || size > DRBD_MAX_SEGMENT_SIZE) {
+		dev_err(DEV, "sector: %llus, size: %d\n",
+			(unsigned long long)sector, size);
+		return;
+	}
+
+	if (!inc_local(mdev))
+		return; /* no disk, no metadata, no bitmap to set bits in */
+
+	nr_sectors = drbd_get_capacity(mdev->this_bdev);
+	esector = sector + (size >> 9) - 1;
+
+	ERR_IF(sector >= nr_sectors)
+		goto out;
+	ERR_IF(esector >= nr_sectors)
+		esector = (nr_sectors-1);
+
+	lbnr = BM_SECT_TO_BIT(nr_sectors-1);
+
+	/* we set it out of sync,
+	 * we do not need to round anything here */
+	sbnr = BM_SECT_TO_BIT(sector);
+	ebnr = BM_SECT_TO_BIT(esector);
+
+	trace_drbd_resync(mdev, TRACE_LVL_METRICS,
+			  "drbd_set_out_of_sync: sector=%llus size=%u sbnr=%lu ebnr=%lu\n",
+			  (unsigned long long)sector, size, sbnr, ebnr);
+
+	/* ok, (capacity & 7) != 0 sometimes, but who cares...
+	 * we count rs_{total,left} in bits, not sectors.  */
+	spin_lock_irqsave(&mdev->al_lock, flags);
+	count = drbd_bm_set_bits(mdev, sbnr, ebnr);
+
+	enr = BM_SECT_TO_EXT(sector);
+	ext = (struct bm_extent *) lc_find(mdev->resync, enr);
+	if (ext)
+		ext->rs_left += count;
+	spin_unlock_irqrestore(&mdev->al_lock, flags);
+
+out:
+	dec_local(mdev);
+}
+
+static inline
+struct bm_extent *_bme_get(struct drbd_conf *mdev, unsigned int enr)
+{
+	struct bm_extent  *bm_ext;
+	int wakeup = 0;
+	unsigned long     rs_flags;
+
+	spin_lock_irq(&mdev->al_lock);
+	if (mdev->resync_locked > mdev->resync->nr_elements/2) {
+		spin_unlock_irq(&mdev->al_lock);
+		return NULL;
+	}
+	bm_ext = (struct bm_extent *) lc_get(mdev->resync, enr);
+	if (bm_ext) {
+		if (bm_ext->lce.lc_number != enr) {
+			bm_ext->rs_left = drbd_bm_e_weight(mdev, enr);
+			bm_ext->rs_failed = 0;
+			lc_changed(mdev->resync, (struct lc_element *)bm_ext);
+			wakeup = 1;
+		}
+		if (bm_ext->lce.refcnt == 1)
+			mdev->resync_locked++;
+		set_bit(BME_NO_WRITES, &bm_ext->flags);
+	}
+	rs_flags = mdev->resync->flags;
+	spin_unlock_irq(&mdev->al_lock);
+	if (wakeup)
+		wake_up(&mdev->al_wait);
+
+	if (!bm_ext) {
+		if (rs_flags & LC_STARVING)
+			dev_warn(DEV, "Have to wait for element"
+			     " (resync LRU too small?)\n");
+		BUG_ON(rs_flags & LC_DIRTY);
+	}
+
+	return bm_ext;
+}
+
+static inline int _is_in_al(struct drbd_conf *mdev, unsigned int enr)
+{
+	struct lc_element *al_ext;
+	int rv = 0;
+
+	spin_lock_irq(&mdev->al_lock);
+	if (unlikely(enr == mdev->act_log->new_number))
+		rv = 1;
+	else {
+		al_ext = lc_find(mdev->act_log, enr);
+		if (al_ext) {
+			if (al_ext->refcnt)
+				rv = 1;
+		}
+	}
+	spin_unlock_irq(&mdev->al_lock);
+
+	/*
+	if (unlikely(rv)) {
+		dev_info(DEV, "Delaying sync read until app's write is done\n");
+	}
+	*/
+	return rv;
+}
+
+/**
+ * drbd_rs_begin_io: Gets an extent in the resync LRU cache and sets it
+ * to BME_LOCKED.
+ *
+ * @sector: The sector number
+ *
+ * sleeps on al_wait.
+ * returns 1 if successful.
+ * returns 0 if interrupted.
+ */
+int drbd_rs_begin_io(struct drbd_conf *mdev, sector_t sector)
+{
+	unsigned int enr = BM_SECT_TO_EXT(sector);
+	struct bm_extent *bm_ext;
+	int i, sig;
+
+	trace_drbd_resync(mdev, TRACE_LVL_ALL,
+			  "drbd_rs_begin_io: sector=%llus (rs_end=%d)\n",
+			  (unsigned long long)sector, enr);
+
+	sig = wait_event_interruptible(mdev->al_wait,
+			(bm_ext = _bme_get(mdev, enr)));
+	if (sig)
+		return 0;
+
+	if (test_bit(BME_LOCKED, &bm_ext->flags))
+		return 1;
+
+	for (i = 0; i < AL_EXT_PER_BM_SECT; i++) {
+		sig = wait_event_interruptible(mdev->al_wait,
+				!_is_in_al(mdev, enr * AL_EXT_PER_BM_SECT + i));
+		if (sig) {
+			spin_lock_irq(&mdev->al_lock);
+			if (lc_put(mdev->resync, &bm_ext->lce) == 0) {
+				clear_bit(BME_NO_WRITES, &bm_ext->flags);
+				mdev->resync_locked--;
+				wake_up(&mdev->al_wait);
+			}
+			spin_unlock_irq(&mdev->al_lock);
+			return 0;
+		}
+	}
+
+	set_bit(BME_LOCKED, &bm_ext->flags);
+
+	return 1;
+}
+
+/**
+ * drbd_try_rs_begin_io: Gets an extent in the resync LRU cache, sets it
+ * to BME_NO_WRITES, then tries to set it to BME_LOCKED.
+ *
+ * @sector: The sector number
+ *
+ * does not sleep.
+ * returns zero if we could set BME_LOCKED and can proceed,
+ * -EAGAIN if we need to try again.
+ */
+int drbd_try_rs_begin_io(struct drbd_conf *mdev, sector_t sector)
+{
+	unsigned int enr = BM_SECT_TO_EXT(sector);
+	const unsigned int al_enr = enr*AL_EXT_PER_BM_SECT;
+	struct bm_extent *bm_ext;
+	int i;
+
+	trace_drbd_resync(mdev, TRACE_LVL_ALL, "drbd_try_rs_begin_io: sector=%llus\n",
+			  (unsigned long long)sector);
+
+	spin_lock_irq(&mdev->al_lock);
+	if (mdev->resync_wenr != LC_FREE && mdev->resync_wenr != enr) {
+		/* in case you have very heavy scattered io, it may
+		 * stall the syncer undefined if we giveup the ref count
+		 * when we try again and requeue.
+		 *
+		 * if we don't give up the refcount, but the next time
+		 * we are scheduled this extent has been "synced" by new
+		 * application writes, we'd miss the lc_put on the
+		 * extent we keept the refcount on.
+		 * so we remembered which extent we had to try agin, and
+		 * if the next requested one is something else, we do
+		 * the lc_put here...
+		 * we also have to wake_up
+		 */
+
+		trace_drbd_resync(mdev, TRACE_LVL_ALL,
+				  "dropping %u, aparently got 'synced' by application io\n",
+				  mdev->resync_wenr);
+
+		bm_ext = (struct bm_extent *)
+			lc_find(mdev->resync, mdev->resync_wenr);
+		if (bm_ext) {
+			D_ASSERT(!test_bit(BME_LOCKED, &bm_ext->flags));
+			D_ASSERT(test_bit(BME_NO_WRITES, &bm_ext->flags));
+			clear_bit(BME_NO_WRITES, &bm_ext->flags);
+			mdev->resync_wenr = LC_FREE;
+			if (lc_put(mdev->resync, &bm_ext->lce) == 0)
+				mdev->resync_locked--;
+			wake_up(&mdev->al_wait);
+		} else {
+			dev_alert(DEV, "LOGIC BUG\n");
+		}
+	}
+	bm_ext = (struct bm_extent *)lc_try_get(mdev->resync, enr);
+	if (bm_ext) {
+		if (test_bit(BME_LOCKED, &bm_ext->flags))
+			goto proceed;
+		if (!test_and_set_bit(BME_NO_WRITES, &bm_ext->flags)) {
+			mdev->resync_locked++;
+		} else {
+			/* we did set the BME_NO_WRITES,
+			 * but then could not set BME_LOCKED,
+			 * so we tried again.
+			 * drop the extra reference. */
+			trace_drbd_resync(mdev, TRACE_LVL_ALL,
+					  "dropping extra reference on %u\n", enr);
+
+			bm_ext->lce.refcnt--;
+			D_ASSERT(bm_ext->lce.refcnt > 0);
+		}
+		goto check_al;
+	} else {
+		if (mdev->resync_locked > mdev->resync->nr_elements-3) {
+			trace_drbd_resync(mdev, TRACE_LVL_ALL,
+					  "resync_locked = %u!\n", mdev->resync_locked);
+
+			goto try_again;
+		}
+		bm_ext = (struct bm_extent *)lc_get(mdev->resync, enr);
+		if (!bm_ext) {
+			const unsigned long rs_flags = mdev->resync->flags;
+			if (rs_flags & LC_STARVING)
+				dev_warn(DEV, "Have to wait for element"
+				     " (resync LRU too small?)\n");
+			BUG_ON(rs_flags & LC_DIRTY);
+			goto try_again;
+		}
+		if (bm_ext->lce.lc_number != enr) {
+			bm_ext->rs_left = drbd_bm_e_weight(mdev, enr);
+			bm_ext->rs_failed = 0;
+			lc_changed(mdev->resync, (struct lc_element *)bm_ext);
+			wake_up(&mdev->al_wait);
+			D_ASSERT(test_bit(BME_LOCKED, &bm_ext->flags) == 0);
+		}
+		set_bit(BME_NO_WRITES, &bm_ext->flags);
+		D_ASSERT(bm_ext->lce.refcnt == 1);
+		mdev->resync_locked++;
+		goto check_al;
+	}
+check_al:
+	trace_drbd_resync(mdev, TRACE_LVL_ALL, "checking al for %u\n", enr);
+
+	for (i = 0; i < AL_EXT_PER_BM_SECT; i++) {
+		if (unlikely(al_enr+i == mdev->act_log->new_number))
+			goto try_again;
+		if (lc_is_used(mdev->act_log, al_enr+i))
+			goto try_again;
+	}
+	set_bit(BME_LOCKED, &bm_ext->flags);
+proceed:
+	mdev->resync_wenr = LC_FREE;
+	spin_unlock_irq(&mdev->al_lock);
+	return 0;
+
+try_again:
+	trace_drbd_resync(mdev, TRACE_LVL_ALL, "need to try again for %u\n", enr);
+	if (bm_ext)
+		mdev->resync_wenr = enr;
+	spin_unlock_irq(&mdev->al_lock);
+	return -EAGAIN;
+}
+
+void drbd_rs_complete_io(struct drbd_conf *mdev, sector_t sector)
+{
+	unsigned int enr = BM_SECT_TO_EXT(sector);
+	struct bm_extent *bm_ext;
+	unsigned long flags;
+
+	trace_drbd_resync(mdev, TRACE_LVL_ALL,
+			  "drbd_rs_complete_io: sector=%llus (rs_enr=%d)\n",
+			  (long long)sector, enr);
+
+	spin_lock_irqsave(&mdev->al_lock, flags);
+	bm_ext = (struct bm_extent *) lc_find(mdev->resync, enr);
+	if (!bm_ext) {
+		spin_unlock_irqrestore(&mdev->al_lock, flags);
+		dev_err(DEV, "drbd_rs_complete_io() called, but extent not found\n");
+		return;
+	}
+
+	if (bm_ext->lce.refcnt == 0) {
+		spin_unlock_irqrestore(&mdev->al_lock, flags);
+		dev_err(DEV, "drbd_rs_complete_io(,%llu [=%u]) called, "
+		    "but refcnt is 0!?\n",
+		    (unsigned long long)sector, enr);
+		return;
+	}
+
+	if (lc_put(mdev->resync, (struct lc_element *)bm_ext) == 0) {
+		clear_bit(BME_LOCKED, &bm_ext->flags);
+		clear_bit(BME_NO_WRITES, &bm_ext->flags);
+		mdev->resync_locked--;
+		wake_up(&mdev->al_wait);
+	}
+
+	spin_unlock_irqrestore(&mdev->al_lock, flags);
+}
+
+/**
+ * drbd_rs_cancel_all: Removes extents from the resync LRU. Even
+ * if they are BME_LOCKED.
+ */
+void drbd_rs_cancel_all(struct drbd_conf *mdev)
+{
+	trace_drbd_resync(mdev, TRACE_LVL_METRICS, "drbd_rs_cancel_all\n");
+
+	spin_lock_irq(&mdev->al_lock);
+
+	if (inc_local_if_state(mdev, D_FAILED)) { /* Makes sure ->resync is there. */
+		lc_reset(mdev->resync);
+		dec_local(mdev);
+	}
+	mdev->resync_locked = 0;
+	mdev->resync_wenr = LC_FREE;
+	spin_unlock_irq(&mdev->al_lock);
+	wake_up(&mdev->al_wait);
+}
+
+/**
+ * drbd_rs_del_all: Gracefully remove all extents from the resync LRU.
+ * there may be still a reference hold by someone. In that case this function
+ * returns -EAGAIN.
+ * In case all elements got removed it returns zero.
+ */
+int drbd_rs_del_all(struct drbd_conf *mdev)
+{
+	struct bm_extent *bm_ext;
+	int i;
+
+	trace_drbd_resync(mdev, TRACE_LVL_METRICS, "drbd_rs_del_all\n");
+
+	spin_lock_irq(&mdev->al_lock);
+
+	if (inc_local_if_state(mdev, D_FAILED)) {
+		/* ok, ->resync is there. */
+		for (i = 0; i < mdev->resync->nr_elements; i++) {
+			bm_ext = (struct bm_extent *) lc_entry(mdev->resync, i);
+			if (bm_ext->lce.lc_number == LC_FREE)
+				continue;
+			if (bm_ext->lce.lc_number == mdev->resync_wenr) {
+				dev_info(DEV, "dropping %u in drbd_rs_del_all, apparently"
+				     " got 'synced' by application io\n",
+				     mdev->resync_wenr);
+				D_ASSERT(!test_bit(BME_LOCKED, &bm_ext->flags));
+				D_ASSERT(test_bit(BME_NO_WRITES, &bm_ext->flags));
+				clear_bit(BME_NO_WRITES, &bm_ext->flags);
+				mdev->resync_wenr = LC_FREE;
+				lc_put(mdev->resync, &bm_ext->lce);
+			}
+			if (bm_ext->lce.refcnt != 0) {
+				dev_info(DEV, "Retrying drbd_rs_del_all() later. "
+				     "refcnt=%d\n", bm_ext->lce.refcnt);
+				dec_local(mdev);
+				spin_unlock_irq(&mdev->al_lock);
+				return -EAGAIN;
+			}
+			D_ASSERT(!test_bit(BME_LOCKED, &bm_ext->flags));
+			D_ASSERT(!test_bit(BME_NO_WRITES, &bm_ext->flags));
+			lc_del(mdev->resync, &bm_ext->lce);
+		}
+		D_ASSERT(mdev->resync->used == 0);
+		dec_local(mdev);
+	}
+	spin_unlock_irq(&mdev->al_lock);
+
+	return 0;
+}
+
+/* Record information on a failure to resync the specified blocks
+ *
+ * called on C_SYNC_TARGET when resync write fails or P_NEG_RS_DREPLY received
+ *
+ */
+void drbd_rs_failed_io(struct drbd_conf *mdev, sector_t sector, int size)
+{
+	/* Is called from worker and receiver context _only_ */
+	unsigned long sbnr, ebnr, lbnr;
+	unsigned long count;
+	sector_t esector, nr_sectors;
+	int wake_up = 0;
+
+	trace_drbd_resync(mdev, TRACE_LVL_SUMMARY,
+			  "drbd_rs_failed_io: sector=%llus, size=%u\n",
+			  (unsigned long long)sector, size);
+
+	if (size <= 0 || (size & 0x1ff) != 0 || size > DRBD_MAX_SEGMENT_SIZE) {
+		dev_err(DEV, "drbd_rs_failed_io: sector=%llus size=%d nonsense!\n",
+				(unsigned long long)sector, size);
+		return;
+	}
+	nr_sectors = drbd_get_capacity(mdev->this_bdev);
+	esector = sector + (size >> 9) - 1;
+
+	ERR_IF(sector >= nr_sectors) return;
+	ERR_IF(esector >= nr_sectors) esector = (nr_sectors-1);
+
+	lbnr = BM_SECT_TO_BIT(nr_sectors-1);
+
+	/*
+	 * round up start sector, round down end sector.  we make sure we only
+	 * handle full, alligned, BM_BLOCK_SIZE (4K) blocks */
+	if (unlikely(esector < BM_SECT_PER_BIT-1))
+		return;
+	if (unlikely(esector == (nr_sectors-1)))
+		ebnr = lbnr;
+	else
+		ebnr = BM_SECT_TO_BIT(esector - (BM_SECT_PER_BIT-1));
+	sbnr = BM_SECT_TO_BIT(sector + BM_SECT_PER_BIT-1);
+
+	if (sbnr > ebnr)
+		return;
+
+	/*
+	 * ok, (capacity & 7) != 0 sometimes, but who cares...
+	 * we count rs_{total,left} in bits, not sectors.
+	 */
+	spin_lock_irq(&mdev->al_lock);
+	count = drbd_bm_count_bits(mdev, sbnr, ebnr);
+	if (count) {
+		mdev->rs_failed += count;
+
+		if (inc_local(mdev)) {
+			drbd_try_clear_on_disk_bm(mdev, sector, count, FALSE);
+			dec_local(mdev);
+		}
+
+		/* just wake_up unconditional now, various lc_chaged(),
+		 * lc_put() in drbd_try_clear_on_disk_bm(). */
+		wake_up = 1;
+	}
+	spin_unlock_irq(&mdev->al_lock);
+	if (wake_up)
+		wake_up(&mdev->al_wait);
+}

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 04/16] DRBD: bitmap
  2009-04-30 11:26     ` [PATCH 03/16] DRBD: activity_log Philipp Reisner
@ 2009-04-30 11:26       ` Philipp Reisner
  2009-04-30 11:26         ` [PATCH 05/16] DRBD: request Philipp Reisner
  2009-05-02 15:41         ` [PATCH 04/16] DRBD: bitmap James Bottomley
  2009-05-01  9:01       ` [PATCH 03/16] DRBD: activity_log Andrew Morton
  1 sibling, 2 replies; 90+ messages in thread
From: Philipp Reisner @ 2009-04-30 11:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg, Philipp Reisner

DRBD maintains a dirty bitmap in case it has to run without peer node or
without local disk. Writes to the on disk dirty bitmap are minimized by the
activity log (=AL). Each time an extent is evicted from the AL the part of
the bitmap no longer covered by the AL is written to disk.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

---
diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
new file mode 100644
index 0000000..c160f7a
--- /dev/null
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -0,0 +1,1263 @@
+/*
+   drbd_bitmap.c
+
+   This file is part of DRBD by Philipp Reisner and Lars Ellenberg.
+
+   Copyright (C) 2004-2008, LINBIT Information Technologies GmbH.
+   Copyright (C) 2004-2008, Philipp Reisner <philipp.reisner@linbit.com>.
+   Copyright (C) 2004-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.
+
+   drbd is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 2, or (at your option)
+   any later version.
+
+   drbd is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with drbd; see the file COPYING.  If not, write to
+   the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include <linux/bitops.h>
+#include <linux/vmalloc.h>
+#include <linux/string.h>
+#include <linux/drbd.h>
+#include "drbd_int.h"
+
+/* OPAQUE outside this file!
+ * interface defined in drbd_int.h
+
+ * convetion:
+ * function name drbd_bm_... => used elsewhere, "public".
+ * function name      bm_... => internal to implementation, "private".
+
+ * Note that since find_first_bit returns int, at the current granularity of
+ * the bitmap (4KB per byte), this implementation "only" supports up to
+ * 1<<(32+12) == 16 TB...
+ */
+
+/*
+ * NOTE
+ *  Access to the *bm_pages is protected by bm_lock.
+ *  It is safe to read the other members within the lock.
+ *
+ *  drbd_bm_set_bits is called from bio_endio callbacks,
+ *  We may be called with irq already disabled,
+ *  so we need spin_lock_irqsave().
+ *  And we need the kmap_atomic.
+ */
+struct drbd_bitmap {
+	struct page **bm_pages;
+	spinlock_t bm_lock;
+	/* WARNING unsigned long bm_*:
+	 * 32bit number of bit offset is just enough for 512 MB bitmap.
+	 * it will blow up if we make the bitmap bigger...
+	 * not that it makes much sense to have a bitmap that large,
+	 * rather change the granularity to 16k or 64k or something.
+	 * (that implies other problems, however...)
+	 */
+	unsigned long bm_set;       /* nr of set bits; THINK maybe atomic_t? */
+	unsigned long bm_bits;
+	size_t   bm_words;
+	size_t   bm_number_of_pages;
+	sector_t bm_dev_capacity;
+	struct semaphore bm_change; /* serializes resize operations */
+
+	atomic_t bm_async_io;
+	wait_queue_head_t bm_io_wait;
+
+	unsigned long  bm_flags;
+
+	/* debugging aid, in case we are still racy somewhere */
+	char          *bm_why;
+	struct task_struct *bm_task;
+};
+
+/* definition of bits in bm_flags */
+#define BM_LOCKED       0
+#define BM_MD_IO_ERROR  1
+
+static inline int bm_is_locked(struct drbd_bitmap *b)
+{
+	return test_bit(BM_LOCKED, &b->bm_flags);
+}
+
+#define bm_print_lock_info(m) __bm_print_lock_info(m, __func__)
+static void __bm_print_lock_info(struct drbd_conf *mdev, const char *func)
+{
+	struct drbd_bitmap *b = mdev->bitmap;
+	if (!__ratelimit(&drbd_ratelimit_state))
+		return;
+	dev_err(DEV, "FIXME %s in %s, bitmap locked for '%s' by %s\n",
+	    current == mdev->receiver.task ? "receiver" :
+	    current == mdev->asender.task  ? "asender"  :
+	    current == mdev->worker.task   ? "worker"   : current->comm,
+	    func, b->bm_why ?: "?",
+	    b->bm_task == mdev->receiver.task ? "receiver" :
+	    b->bm_task == mdev->asender.task  ? "asender"  :
+	    b->bm_task == mdev->worker.task   ? "worker"   : "?");
+}
+
+void drbd_bm_lock(struct drbd_conf *mdev, char *why)
+{
+	struct drbd_bitmap *b = mdev->bitmap;
+	int trylock_failed;
+
+	if (!b) {
+		dev_err(DEV, "FIXME no bitmap in drbd_bm_lock!?\n");
+		return;
+	}
+
+	trylock_failed = down_trylock(&b->bm_change);
+
+	if (trylock_failed) {
+		dev_warn(DEV, "%s going to '%s' but bitmap already locked for '%s' by %s\n",
+		    current == mdev->receiver.task ? "receiver" :
+		    current == mdev->asender.task  ? "asender"  :
+		    current == mdev->worker.task   ? "worker"   : current->comm,
+		    why, b->bm_why ?: "?",
+		    b->bm_task == mdev->receiver.task ? "receiver" :
+		    b->bm_task == mdev->asender.task  ? "asender"  :
+		    b->bm_task == mdev->worker.task   ? "worker"   : "?");
+		down(&b->bm_change);
+	}
+	if (__test_and_set_bit(BM_LOCKED, &b->bm_flags))
+		dev_err(DEV, "FIXME bitmap already locked in bm_lock\n");
+
+	b->bm_why  = why;
+	b->bm_task = current;
+}
+
+void drbd_bm_unlock(struct drbd_conf *mdev)
+{
+	struct drbd_bitmap *b = mdev->bitmap;
+	if (!b) {
+		dev_err(DEV, "FIXME no bitmap in drbd_bm_unlock!?\n");
+		return;
+	}
+
+	if (!__test_and_clear_bit(BM_LOCKED, &mdev->bitmap->bm_flags))
+		dev_err(DEV, "FIXME bitmap not locked in bm_unlock\n");
+
+	b->bm_why  = NULL;
+	b->bm_task = NULL;
+	up(&b->bm_change);
+}
+
+/* word offset to long pointer */
+STATIC unsigned long *__bm_map_paddr(struct drbd_bitmap *b, unsigned long offset, const enum km_type km)
+{
+	struct page *page;
+	unsigned long page_nr;
+
+	/* page_nr = (word*sizeof(long)) >> PAGE_SHIFT; */
+	page_nr = offset >> (PAGE_SHIFT - LN2_BPL + 3);
+	BUG_ON(page_nr >= b->bm_number_of_pages);
+	page = b->bm_pages[page_nr];
+
+	return (unsigned long *) kmap_atomic(page, km);
+}
+
+unsigned long * bm_map_paddr(struct drbd_bitmap *b, unsigned long offset)
+{
+	return __bm_map_paddr(b, offset, KM_IRQ1);
+}
+
+void __bm_unmap(unsigned long *p_addr, const enum km_type km)
+{
+	kunmap_atomic(p_addr, km);
+};
+
+void bm_unmap(unsigned long *p_addr)
+{
+	return __bm_unmap(p_addr, KM_IRQ1);
+}
+
+/* long word offset of _bitmap_ sector */
+#define S2W(s)	((s)<<(BM_EXT_SIZE_B-BM_BLOCK_SIZE_B-LN2_BPL))
+/* word offset from start of bitmap to word number _in_page_
+ * modulo longs per page
+#define MLPP(X) ((X) % (PAGE_SIZE/sizeof(long))
+ hm, well, Philipp thinks gcc might not optimze the % into & (... - 1)
+ so do it explicitly:
+ */
+#define MLPP(X) ((X) & ((PAGE_SIZE/sizeof(long))-1))
+
+/* Long words per page */
+#define LWPP (PAGE_SIZE/sizeof(long))
+
+/*
+ * actually most functions herein should take a struct drbd_bitmap*, not a
+ * struct drbd_conf*, but for the debug macros I like to have the mdev around
+ * to be able to report device specific.
+ */
+
+STATIC void bm_free_pages(struct page **pages, unsigned long number)
+{
+	unsigned long i;
+	if (!pages)
+		return;
+
+	for (i = 0; i < number; i++) {
+		if (!pages[i]) {
+			printk(KERN_ALERT "drbd: bm_free_pages tried to free "
+					  "a NULL pointer; i=%lu n=%lu\n",
+					  i, number);
+			continue;
+		}
+		__free_page(pages[i]);
+		pages[i] = NULL;
+	}
+}
+
+/*
+ * "have" and "want" are NUMBER OF PAGES.
+ */
+STATIC struct page **bm_realloc_pages(struct page **old_pages,
+				       unsigned long have,
+				       unsigned long want)
+{
+	struct page **new_pages, *page;
+	unsigned int i, bytes;
+
+	BUG_ON(have == 0 && old_pages != NULL);
+	BUG_ON(have != 0 && old_pages == NULL);
+
+	if (have == want)
+		return old_pages;
+
+	/* To use kmalloc here is ok, as long as we support 4TB at max...
+	 * otherwise this might become bigger than 128KB, which is
+	 * the maximum for kmalloc.
+	 *
+	 * no, it is not: on 64bit boxes, sizeof(void*) == 8,
+	 * 128MB bitmap @ 4K pages -> 256K of page pointers.
+	 * ==> use vmalloc for now again.
+	 * then again, we could do something like
+	 *   if (nr_pages > watermark) vmalloc else kmalloc :*> ...
+	 * or do cascading page arrays:
+	 *   one page for the page array of the page array,
+	 *   those pages for the real bitmap pages.
+	 *   there we could even add some optimization members,
+	 *   so we won't need to kmap_atomic in bm_find_next_bit just to see
+	 *   that the page has no bits set ...
+	 * or we can try a "huge" page ;-)
+	 */
+	bytes = sizeof(struct page *)*want;
+	new_pages = vmalloc(bytes);
+	if (!new_pages)
+		return NULL;
+
+	memset(new_pages, 0, bytes);
+	if (want >= have) {
+		for (i = 0; i < have; i++)
+			new_pages[i] = old_pages[i];
+		for (; i < want; i++) {
+			page = alloc_page(GFP_HIGHUSER);
+			if (!page) {
+				bm_free_pages(new_pages + have, i - have);
+				vfree(new_pages);
+				return NULL;
+			}
+			new_pages[i] = page;
+		}
+	} else {
+		for (i = 0; i < want; i++)
+			new_pages[i] = old_pages[i];
+		/* NOT HERE, we are outside the spinlock!
+		bm_free_pages(old_pages + want, have - want);
+		*/
+	}
+
+	return new_pages;
+}
+
+/*
+ * called on driver init only. TODO call when a device is created.
+ * allocates the drbd_bitmap, and stores it in mdev->bitmap.
+ */
+int drbd_bm_init(struct drbd_conf *mdev)
+{
+	struct drbd_bitmap *b = mdev->bitmap;
+	WARN_ON(b != NULL);
+	b = kzalloc(sizeof(struct drbd_bitmap), GFP_KERNEL);
+	if (!b)
+		return -ENOMEM;
+	spin_lock_init(&b->bm_lock);
+	init_MUTEX(&b->bm_change);
+	init_waitqueue_head(&b->bm_io_wait);
+
+	mdev->bitmap = b;
+
+	return 0;
+}
+
+sector_t drbd_bm_capacity(struct drbd_conf *mdev)
+{
+	ERR_IF(!mdev->bitmap) return 0;
+	return mdev->bitmap->bm_dev_capacity;
+}
+
+/* called on driver unload. TODO: call when a device is destroyed.
+ */
+void drbd_bm_cleanup(struct drbd_conf *mdev)
+{
+	ERR_IF (!mdev->bitmap) return;
+	bm_free_pages(mdev->bitmap->bm_pages, mdev->bitmap->bm_number_of_pages);
+	vfree(mdev->bitmap->bm_pages);
+	kfree(mdev->bitmap);
+	mdev->bitmap = NULL;
+}
+
+/*
+ * since (b->bm_bits % BITS_PER_LONG) != 0,
+ * this masks out the remaining bits.
+ * Rerturns the number of bits cleared.
+ */
+STATIC int bm_clear_surplus(struct drbd_bitmap *b)
+{
+	const unsigned long mask = (1UL << (b->bm_bits & (BITS_PER_LONG-1))) - 1;
+	size_t w = b->bm_bits >> LN2_BPL;
+	int cleared = 0;
+	unsigned long *p_addr, *bm;
+
+	p_addr = bm_map_paddr(b, w);
+	bm = p_addr + MLPP(w);
+	if (w < b->bm_words) {
+		cleared = hweight_long(*bm & ~mask);
+		*bm &= mask;
+		w++; bm++;
+	}
+
+	if (w < b->bm_words) {
+		cleared += hweight_long(*bm);
+		*bm = 0;
+	}
+	bm_unmap(p_addr);
+	return cleared;
+}
+
+STATIC void bm_set_surplus(struct drbd_bitmap *b)
+{
+	const unsigned long mask = (1UL << (b->bm_bits & (BITS_PER_LONG-1))) - 1;
+	size_t w = b->bm_bits >> LN2_BPL;
+	unsigned long *p_addr, *bm;
+
+	p_addr = bm_map_paddr(b, w);
+	bm = p_addr + MLPP(w);
+	if (w < b->bm_words) {
+		*bm |= ~mask;
+		bm++; w++;
+	}
+
+	if (w < b->bm_words) {
+		*bm = ~(0UL);
+	}
+	bm_unmap(p_addr);
+}
+
+STATIC unsigned long __bm_count_bits(struct drbd_bitmap *b, const int swap_endian)
+{
+	unsigned long *p_addr, *bm, offset = 0;
+	unsigned long bits = 0;
+	unsigned long i, do_now;
+
+	while (offset < b->bm_words) {
+		i = do_now = min_t(size_t, b->bm_words-offset, LWPP);
+		p_addr = bm_map_paddr(b, offset);
+		bm = p_addr + MLPP(offset);
+		while (i--) {
+#ifndef __LITTLE_ENDIAN
+			if (swap_endian)
+				*bm = lel_to_cpu(*bm);
+#endif
+			bits += hweight_long(*bm++);
+		}
+		bm_unmap(p_addr);
+		offset += do_now;
+	}
+
+	return bits;
+}
+
+static inline unsigned long bm_count_bits(struct drbd_bitmap *b)
+{
+	return __bm_count_bits(b, 0);
+}
+
+static inline unsigned long bm_count_bits_swap_endian(struct drbd_bitmap *b)
+{
+	return __bm_count_bits(b, 1);
+}
+
+void _drbd_bm_recount_bits(struct drbd_conf *mdev, char *file, int line)
+{
+	struct drbd_bitmap *b = mdev->bitmap;
+	unsigned long flags, bits;
+
+	ERR_IF(!b) return;
+
+	/* IMO this should be inside drbd_bm_lock/unlock.
+	 * Unfortunately it is used outside of the locks.
+	 * And I'm not yet sure where we need to place the
+	 * lock/unlock correctly.
+	 */
+
+	spin_lock_irqsave(&b->bm_lock, flags);
+	bits = bm_count_bits(b);
+	if (bits != b->bm_set) {
+		dev_err(DEV, "bm_set was %lu, corrected to %lu. %s:%d\n",
+		    b->bm_set, bits, file, line);
+		b->bm_set = bits;
+	}
+	spin_unlock_irqrestore(&b->bm_lock, flags);
+}
+
+/* offset and len in long words.*/
+STATIC void bm_memset(struct drbd_bitmap *b, size_t offset, int c, size_t len)
+{
+	unsigned long *p_addr, *bm;
+	size_t do_now, end;
+
+#define BM_SECTORS_PER_BIT (BM_BLOCK_SIZE/512)
+
+	end = offset + len;
+
+	if (end > b->bm_words) {
+		printk(KERN_ALERT "drbd: bm_memset end > bm_words\n");
+		return;
+	}
+
+	while (offset < end) {
+		do_now = min_t(size_t, ALIGN(offset + 1, LWPP), end) - offset;
+		p_addr = bm_map_paddr(b, offset);
+		bm = p_addr + MLPP(offset);
+		if (bm+do_now > p_addr + LWPP) {
+			printk(KERN_ALERT "drbd: BUG BUG BUG! p_addr:%p bm:%p do_now:%d\n",
+			       p_addr, bm, (int)do_now);
+			break; /* breaks to after catch_oob_access_end() only! */
+		}
+		memset(bm, c, do_now * sizeof(long));
+		bm_unmap(p_addr);
+		offset += do_now;
+	}
+}
+
+/*
+ * make sure the bitmap has enough room for the attached storage,
+ * if neccessary, resize.
+ * called whenever we may have changed the device size.
+ * returns -ENOMEM if we could not allocate enough memory, 0 on success.
+ * In case this is actually a resize, we copy the old bitmap into the new one.
+ * Otherwise, the bitmap is initiallized to all bits set.
+ */
+int drbd_bm_resize(struct drbd_conf *mdev, sector_t capacity)
+{
+	struct drbd_bitmap *b = mdev->bitmap;
+	unsigned long bits, words, owords, obits, *p_addr, *bm;
+	unsigned long want, have, onpages; /* number of pages */
+	struct page **npages, **opages = NULL;
+	int err = 0, growing;
+
+	ERR_IF(!b) return -ENOMEM;
+
+	drbd_bm_lock(mdev, "resize");
+
+	dev_info(DEV, "drbd_bm_resize called with capacity == %llu\n",
+			(unsigned long long)capacity);
+
+	if (capacity == b->bm_dev_capacity)
+		goto out;
+
+	if (capacity == 0) {
+		spin_lock_irq(&b->bm_lock);
+		opages = b->bm_pages;
+		onpages = b->bm_number_of_pages;
+		owords = b->bm_words;
+		b->bm_pages = NULL;
+		b->bm_number_of_pages =
+		b->bm_set   =
+		b->bm_bits  =
+		b->bm_words =
+		b->bm_dev_capacity = 0;
+		spin_unlock_irq(&b->bm_lock);
+		bm_free_pages(opages, onpages);
+		vfree(opages);
+		goto out;
+	}
+	bits  = BM_SECT_TO_BIT(ALIGN(capacity, BM_SECT_PER_BIT));
+
+	/* if we would use
+	   words = ALIGN(bits,BITS_PER_LONG) >> LN2_BPL;
+	   a 32bit host could present the wrong number of words
+	   to a 64bit host.
+	*/
+	words = ALIGN(bits, 64) >> LN2_BPL;
+
+	if (inc_local(mdev)) {
+		D_ASSERT((u64)bits <= (((u64)mdev->bc->md.md_size_sect-MD_BM_OFFSET) << 12));
+		dec_local(mdev);
+	}
+
+	/* one extra long to catch off by one errors */
+	want = ALIGN((words+1)*sizeof(long), PAGE_SIZE) >> PAGE_SHIFT;
+	have = b->bm_number_of_pages;
+	if (want == have) {
+		D_ASSERT(b->bm_pages != NULL);
+		npages = b->bm_pages;
+	} else {
+		if (FAULT_ACTIVE(mdev, DRBD_FAULT_BM_ALLOC))
+			npages = NULL;
+		else
+			npages = bm_realloc_pages(b->bm_pages, have, want);
+	}
+
+	if (!npages) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	spin_lock_irq(&b->bm_lock);
+	opages = b->bm_pages;
+	owords = b->bm_words;
+	obits  = b->bm_bits;
+
+	growing = bits > obits;
+	if (opages)
+		bm_set_surplus(b);
+
+	b->bm_pages = npages;
+	b->bm_number_of_pages = want;
+	b->bm_bits  = bits;
+	b->bm_words = words;
+	b->bm_dev_capacity = capacity;
+
+	if (growing) {
+		bm_memset(b, owords, 0xff, words-owords);
+		b->bm_set += bits - obits;
+	}
+
+	if (want < have) {
+		/* implicit: (opages != NULL) && (opages != npages) */
+		bm_free_pages(opages + want, have - want);
+	}
+
+	p_addr = bm_map_paddr(b, words);
+	bm = p_addr + MLPP(words);
+	*bm = DRBD_MAGIC;
+	bm_unmap(p_addr);
+
+	(void)bm_clear_surplus(b);
+	if (!growing)
+		b->bm_set = bm_count_bits(b);
+
+	spin_unlock_irq(&b->bm_lock);
+	if (opages != npages)
+		vfree(opages);
+	dev_info(DEV, "resync bitmap: bits=%lu words=%lu\n", bits, words);
+
+ out:
+	drbd_bm_unlock(mdev);
+	return err;
+}
+
+/* inherently racy:
+ * if not protected by other means, return value may be out of date when
+ * leaving this function...
+ * we still need to lock it, since it is important that this returns
+ * bm_set == 0 precisely.
+ *
+ * maybe bm_set should be atomic_t ?
+ */
+unsigned long drbd_bm_total_weight(struct drbd_conf *mdev)
+{
+	struct drbd_bitmap *b = mdev->bitmap;
+	unsigned long s;
+	unsigned long flags;
+
+	/* if I don't have a disk, I don't know about out-of-sync status */
+	if (!inc_local_if_state(mdev, D_NEGOTIATING))
+		return 0;
+
+	ERR_IF(!b) return 0;
+	ERR_IF(!b->bm_pages) return 0;
+
+	spin_lock_irqsave(&b->bm_lock, flags);
+	s = b->bm_set;
+	spin_unlock_irqrestore(&b->bm_lock, flags);
+
+	dec_local(mdev);
+
+	return s;
+}
+
+size_t drbd_bm_words(struct drbd_conf *mdev)
+{
+	struct drbd_bitmap *b = mdev->bitmap;
+	ERR_IF(!b) return 0;
+	ERR_IF(!b->bm_pages) return 0;
+
+	return b->bm_words;
+}
+
+unsigned long drbd_bm_bits(struct drbd_conf *mdev)
+{
+	struct drbd_bitmap *b = mdev->bitmap;
+	ERR_IF(!b) return 0;
+
+	return b->bm_bits;
+}
+
+/* merge number words from buffer into the bitmap starting at offset.
+ * buffer[i] is expected to be little endian unsigned long.
+ * bitmap must be locked by drbd_bm_lock.
+ * currently only used from receive_bitmap.
+ */
+void drbd_bm_merge_lel(struct drbd_conf *mdev, size_t offset, size_t number,
+			unsigned long *buffer)
+{
+	struct drbd_bitmap *b = mdev->bitmap;
+	unsigned long *p_addr, *bm;
+	unsigned long word, bits;
+	size_t end, do_now;
+
+	end = offset + number;
+
+	ERR_IF(!b) return;
+	ERR_IF(!b->bm_pages) return;
+	if (number == 0)
+		return;
+	WARN_ON(offset >= b->bm_words);
+	WARN_ON(end    >  b->bm_words);
+
+	spin_lock_irq(&b->bm_lock);
+	while (offset < end) {
+		do_now = min_t(size_t, ALIGN(offset+1, LWPP), end) - offset;
+		p_addr = bm_map_paddr(b, offset);
+		bm = p_addr + MLPP(offset);
+		offset += do_now;
+		while (do_now--) {
+			bits = hweight_long(*bm);
+			word = *bm | lel_to_cpu(*buffer++);
+			*bm++ = word;
+			b->bm_set += hweight_long(word) - bits;
+		}
+		bm_unmap(p_addr);
+	}
+	/* with 32bit <-> 64bit cross-platform connect
+	 * this is only correct for current usage,
+	 * where we _know_ that we are 64 bit aligned,
+	 * and know that this function is used in this way, too...
+	 */
+	if (end == b->bm_words)
+		b->bm_set -= bm_clear_surplus(b);
+
+	spin_unlock_irq(&b->bm_lock);
+}
+
+/* copy number words from the bitmap starting at offset into the buffer.
+ * buffer[i] will be little endian unsigned long.
+ */
+void drbd_bm_get_lel(struct drbd_conf *mdev, size_t offset, size_t number,
+		     unsigned long *buffer)
+{
+	struct drbd_bitmap *b = mdev->bitmap;
+	unsigned long *p_addr, *bm;
+	size_t end, do_now;
+
+	end = offset + number;
+
+	ERR_IF(!b) return;
+	ERR_IF(!b->bm_pages) return;
+
+	spin_lock_irq(&b->bm_lock);
+	if ((offset >= b->bm_words) ||
+	    (end    >  b->bm_words) ||
+	    (number <= 0))
+		dev_err(DEV, "offset=%lu number=%lu bm_words=%lu\n",
+			(unsigned long)	offset,
+			(unsigned long)	number,
+			(unsigned long) b->bm_words);
+	else {
+		while (offset < end) {
+			do_now = min_t(size_t, ALIGN(offset+1, LWPP), end) - offset;
+			p_addr = bm_map_paddr(b, offset);
+			bm = p_addr + MLPP(offset);
+			offset += do_now;
+			while (do_now--)
+				*buffer++ = cpu_to_lel(*bm++);
+			bm_unmap(p_addr);
+		}
+	}
+	spin_unlock_irq(&b->bm_lock);
+}
+
+/* set all bits in the bitmap */
+void drbd_bm_set_all(struct drbd_conf *mdev)
+{
+	struct drbd_bitmap *b = mdev->bitmap;
+	ERR_IF(!b) return;
+	ERR_IF(!b->bm_pages) return;
+
+	spin_lock_irq(&b->bm_lock);
+	bm_memset(b, 0, 0xff, b->bm_words);
+	(void)bm_clear_surplus(b);
+	b->bm_set = b->bm_bits;
+	spin_unlock_irq(&b->bm_lock);
+}
+
+/* clear all bits in the bitmap */
+void drbd_bm_clear_all(struct drbd_conf *mdev)
+{
+	struct drbd_bitmap *b = mdev->bitmap;
+	ERR_IF(!b) return;
+	ERR_IF(!b->bm_pages) return;
+
+	spin_lock_irq(&b->bm_lock);
+	bm_memset(b, 0, 0, b->bm_words);
+	b->bm_set = 0;
+	spin_unlock_irq(&b->bm_lock);
+}
+
+static void bm_async_io_complete(struct bio *bio, int error)
+{
+	struct drbd_bitmap *b = bio->bi_private;
+	int uptodate = bio_flagged(bio, BIO_UPTODATE);
+
+
+	/* strange behaviour of some lower level drivers...
+	 * fail the request by clearing the uptodate flag,
+	 * but do not return any error?!
+	 * do we want to WARN() on this? */
+	if (!error && !uptodate)
+		error = -EIO;
+
+	if (error) {
+		/* doh. what now?
+		 * for now, set all bits, and flag MD_IO_ERROR */
+		__set_bit(BM_MD_IO_ERROR, &b->bm_flags);
+	}
+	if (atomic_dec_and_test(&b->bm_async_io))
+		wake_up(&b->bm_io_wait);
+
+	bio_put(bio);
+}
+
+STATIC void bm_page_io_async(struct drbd_conf *mdev, struct drbd_bitmap *b, int page_nr, int rw) __must_hold(local)
+{
+	/* we are process context. we always get a bio */
+	struct bio *bio = bio_alloc(GFP_KERNEL, 1);
+	unsigned int len;
+	sector_t on_disk_sector =
+		mdev->bc->md.md_offset + mdev->bc->md.bm_offset;
+	on_disk_sector += ((sector_t)page_nr) << (PAGE_SHIFT-9);
+
+	/* this might happen with very small
+	 * flexible external meta data device */
+	len = min_t(unsigned int, PAGE_SIZE,
+		(drbd_md_last_sector(mdev->bc) - on_disk_sector + 1)<<9);
+
+	bio->bi_bdev = mdev->bc->md_bdev;
+	bio->bi_sector = on_disk_sector;
+	bio_add_page(bio, b->bm_pages[page_nr], len, 0);
+	bio->bi_private = b;
+	bio->bi_end_io = bm_async_io_complete;
+
+	if (FAULT_ACTIVE(mdev, (rw & WRITE) ? DRBD_FAULT_MD_WR : DRBD_FAULT_MD_RD)) {
+		bio->bi_rw |= rw;
+		bio_endio(bio, -EIO);
+	} else {
+		submit_bio(rw, bio);
+	}
+}
+
+# if defined(__LITTLE_ENDIAN)
+	/* nothing to do, on disk == in memory */
+# define bm_cpu_to_lel(x) ((void)0)
+# else
+void bm_cpu_to_lel(struct drbd_bitmap *b)
+{
+	/* need to cpu_to_lel all the pages ...
+	 * this may be optimized by using
+	 * cpu_to_lel(-1) == -1 and cpu_to_lel(0) == 0;
+	 * the following is still not optimal, but better than nothing */
+	if (b->bm_set == 0) {
+		/* no page at all; avoid swap if all is 0 */
+		i = b->bm_number_of_pages;
+	} else if (b->bm_set == b->bm_bits) {
+		/* only the last page */
+		i = b->bm_number_of_pages - 1;
+	} else {
+		/* all pages */
+		i = 0;
+	}
+	for (; i < b->bm_number_of_pages; i++) {
+		unsigned long *bm;
+		/* if you'd want to use kmap_atomic, you'd have to disable irq! */
+		p_addr = kmap(b->bm_pages[i]);
+		for (bm = p_addr; bm < p_addr + PAGE_SIZE/sizeof(long); bm++)
+			*bm = cpu_to_lel(*bm);
+		kunmap(p_addr);
+	}
+}
+# endif
+/* lel_to_cpu == cpu_to_lel */
+# define bm_lel_to_cpu(x) bm_cpu_to_lel(x)
+
+/*
+ * bm_rw: read/write the whole bitmap from/to its on disk location.
+ */
+STATIC int bm_rw(struct drbd_conf *mdev, int rw) __must_hold(local)
+{
+	struct drbd_bitmap *b = mdev->bitmap;
+	/* sector_t sector; */
+	int bm_words, num_pages, i;
+	unsigned long now;
+	char ppb[10];
+	int err = 0;
+
+	WARN_ON(!bm_is_locked(b));
+
+	/* no spinlock here, the drbd_bm_lock should be enough! */
+
+	bm_words  = drbd_bm_words(mdev);
+	num_pages = (bm_words*sizeof(long) + PAGE_SIZE-1) >> PAGE_SHIFT;
+
+	/* on disk bitmap is little endian */
+	if (rw == WRITE)
+		bm_cpu_to_lel(b);
+
+	now = jiffies;
+	atomic_set(&b->bm_async_io, num_pages);
+	__clear_bit(BM_MD_IO_ERROR, &b->bm_flags);
+
+	/* let the layers below us try to merge these bios... */
+	for (i = 0; i < num_pages; i++)
+		bm_page_io_async(mdev, b, i, rw);
+
+	drbd_blk_run_queue(bdev_get_queue(mdev->bc->md_bdev));
+	wait_event(b->bm_io_wait, atomic_read(&b->bm_async_io) == 0);
+
+	if (test_bit(BM_MD_IO_ERROR, &b->bm_flags)) {
+		dev_alert(DEV, "we had at least one MD IO ERROR during bitmap IO\n");
+		drbd_chk_io_error(mdev, 1, TRUE);
+		drbd_io_error(mdev, TRUE);
+		err = -EIO;
+	}
+
+	now = jiffies;
+	if (rw == WRITE) {
+		/* swap back endianness */
+		bm_lel_to_cpu(b);
+		/* flush bitmap to stable storage */
+		drbd_md_flush(mdev);
+	} else /* rw == READ */ {
+		/* just read, if neccessary adjust endianness */
+		b->bm_set = bm_count_bits_swap_endian(b);
+		dev_info(DEV, "recounting of set bits took additional %lu jiffies\n",
+		     jiffies - now);
+	}
+	now = b->bm_set;
+
+	dev_info(DEV, "%s (%lu bits) marked out-of-sync by on disk bit-map.\n",
+	     ppsize(ppb, now << (BM_BLOCK_SIZE_B-10)), now);
+
+	return err;
+}
+
+/**
+ * drbd_bm_read: Read the whole bitmap from its on disk location.
+ *
+ * currently only called from "drbd_nl_disk_conf"
+ */
+int drbd_bm_read(struct drbd_conf *mdev) __must_hold(local)
+{
+	return bm_rw(mdev, READ);
+}
+
+/**
+ * drbd_bm_write: Write the whole bitmap to its on disk location.
+ *
+ * called at various occasions.
+ */
+int drbd_bm_write(struct drbd_conf *mdev) __must_hold(local)
+{
+	return bm_rw(mdev, WRITE);
+}
+
+/**
+ * drbd_bm_write_sect: Writes a 512 byte piece of the bitmap to its
+ * on disk location. On disk bitmap is little endian.
+ *
+ * @enr: The _sector_ offset from the start of the bitmap.
+ *
+ */
+int drbd_bm_write_sect(struct drbd_conf *mdev, unsigned long enr) __must_hold(local)
+{
+	sector_t on_disk_sector = enr + mdev->bc->md.md_offset
+				      + mdev->bc->md.bm_offset;
+	int bm_words, num_words, offset;
+	int err = 0;
+
+	mutex_lock(&mdev->md_io_mutex);
+	bm_words  = drbd_bm_words(mdev);
+	offset    = S2W(enr);	/* word offset into bitmap */
+	num_words = min(S2W(1), bm_words - offset);
+	if (num_words < S2W(1))
+		memset(page_address(mdev->md_io_page), 0, MD_HARDSECT);
+	drbd_bm_get_lel(mdev, offset, num_words,
+			page_address(mdev->md_io_page));
+	if (!drbd_md_sync_page_io(mdev, mdev->bc, on_disk_sector, WRITE)) {
+		int i;
+		err = -EIO;
+		dev_err(DEV, "IO ERROR writing bitmap sector %lu "
+		    "(meta-disk sector %llus)\n",
+		    enr, (unsigned long long)on_disk_sector);
+		drbd_chk_io_error(mdev, 1, TRUE);
+		drbd_io_error(mdev, TRUE);
+		for (i = 0; i < AL_EXT_PER_BM_SECT; i++)
+			drbd_bm_ALe_set_all(mdev, enr*AL_EXT_PER_BM_SECT+i);
+	}
+	mdev->bm_writ_cnt++;
+	mutex_unlock(&mdev->md_io_mutex);
+	return err;
+}
+
+/* NOTE
+ * find_first_bit returns int, we return unsigned long.
+ * should not make much difference anyways, but ...
+ *
+ * this returns a bit number, NOT a sector!
+ */
+#define BPP_MASK ((1UL << (PAGE_SHIFT+3)) - 1)
+static unsigned long __bm_find_next(struct drbd_conf *mdev, unsigned long bm_fo,
+	const int find_zero_bit, const enum km_type km)
+{
+	struct drbd_bitmap *b = mdev->bitmap;
+	unsigned long i = -1UL;
+	unsigned long *p_addr;
+	unsigned long bit_offset; /* bit offset of the mapped page. */
+
+	if (bm_fo > b->bm_bits) {
+		dev_err(DEV, "bm_fo=%lu bm_bits=%lu\n", bm_fo, b->bm_bits);
+	} else {
+		while (bm_fo < b->bm_bits) {
+			unsigned long offset;
+			bit_offset = bm_fo & ~BPP_MASK; /* bit offset of the page */
+			offset = bit_offset >> LN2_BPL;    /* word offset of the page */
+			p_addr = __bm_map_paddr(b, offset, km);
+
+			if (find_zero_bit)
+				i = find_next_zero_bit(p_addr, PAGE_SIZE*8, bm_fo & BPP_MASK);
+			else
+				i = find_next_bit(p_addr, PAGE_SIZE*8, bm_fo & BPP_MASK);
+
+			__bm_unmap(p_addr, km);
+			if (i < PAGE_SIZE*8) {
+				i = bit_offset + i;
+				if (i >= b->bm_bits)
+					break;
+				goto found;
+			}
+			bm_fo = bit_offset + PAGE_SIZE*8;
+		}
+		i = -1UL;
+	}
+ found:
+	return i;
+}
+
+static unsigned long bm_find_next(struct drbd_conf *mdev,
+	unsigned long bm_fo, const int find_zero_bit)
+{
+	struct drbd_bitmap *b = mdev->bitmap;
+	unsigned long i = -1UL;
+
+	ERR_IF(!b) return i;
+	ERR_IF(!b->bm_pages) return i;
+
+	spin_lock_irq(&b->bm_lock);
+	if (bm_is_locked(b))
+		bm_print_lock_info(mdev);
+
+	i = __bm_find_next(mdev, bm_fo, find_zero_bit, KM_IRQ1);
+
+	spin_unlock_irq(&b->bm_lock);
+	return i;
+}
+
+unsigned long drbd_bm_find_next(struct drbd_conf *mdev, unsigned long bm_fo)
+{
+	return bm_find_next(mdev, bm_fo, 0);
+}
+
+#if 0
+/* not yet needed for anything. */
+unsigned long drbd_bm_find_next_zero(struct drbd_conf *mdev, unsigned long bm_fo)
+{
+	return bm_find_next(mdev, bm_fo, 1);
+}
+#endif
+
+/* does not spin_lock_irqsave.
+ * you must take drbd_bm_lock() first */
+unsigned long _drbd_bm_find_next(struct drbd_conf *mdev, unsigned long bm_fo)
+{
+	/* WARN_ON(!bm_is_locked(mdev)); */
+	return __bm_find_next(mdev, bm_fo, 0, KM_USER1);
+}
+
+unsigned long _drbd_bm_find_next_zero(struct drbd_conf *mdev, unsigned long bm_fo)
+{
+	/* WARN_ON(!bm_is_locked(mdev)); */
+	return __bm_find_next(mdev, bm_fo, 1, KM_USER1);
+}
+
+/* returns number of bits actually changed.
+ * for val != 0, we change 0 -> 1, return code positiv
+ * for val == 0, we change 1 -> 0, return code negative
+ * wants bitnr, not sector.
+ * Must hold bitmap lock already. */
+
+int __bm_change_bits_to(struct drbd_conf *mdev, const unsigned long s,
+	const unsigned long e, int val, const enum km_type km)
+{
+	struct drbd_bitmap *b = mdev->bitmap;
+	unsigned long *p_addr = NULL;
+	unsigned long bitnr;
+	unsigned long last_page_nr = -1UL;
+	int c = 0;
+
+	for (bitnr = s; bitnr <= e; bitnr++) {
+		ERR_IF (bitnr >= b->bm_bits) {
+			dev_err(DEV, "bitnr=%lu bm_bits=%lu\n", bitnr, b->bm_bits);
+		} else {
+			unsigned long offset = bitnr>>LN2_BPL;
+			unsigned long page_nr = offset >> (PAGE_SHIFT - LN2_BPL + 3);
+			if (page_nr != last_page_nr) {
+				if (p_addr)
+					__bm_unmap(p_addr, km);
+				p_addr = __bm_map_paddr(b, offset, km);
+				last_page_nr = page_nr;
+			}
+			if (val)
+				c += (0 == __test_and_set_bit(bitnr & BPP_MASK, p_addr));
+			else
+				c -= (0 != __test_and_clear_bit(bitnr & BPP_MASK, p_addr));
+		}
+	}
+	if (p_addr)
+		__bm_unmap(p_addr, km);
+	b->bm_set += c;
+	return c;
+}
+
+/* returns number of bits actually changed.
+ * for val != 0, we change 0 -> 1, return code positiv
+ * for val == 0, we change 1 -> 0, return code negative
+ * wants bitnr, not sector */
+int bm_change_bits_to(struct drbd_conf *mdev, const unsigned long s,
+	const unsigned long e, int val)
+{
+	unsigned long flags;
+	struct drbd_bitmap *b = mdev->bitmap;
+	int c = 0;
+
+	ERR_IF(!b) return 1;
+	ERR_IF(!b->bm_pages) return 0;
+
+	spin_lock_irqsave(&b->bm_lock, flags);
+	if (bm_is_locked(b))
+		bm_print_lock_info(mdev);
+
+	c = __bm_change_bits_to(mdev, s, e, val, KM_IRQ1);
+
+	spin_unlock_irqrestore(&b->bm_lock, flags);
+	return c;
+}
+
+/* returns number of bits changed 0 -> 1 */
+int drbd_bm_set_bits(struct drbd_conf *mdev, const unsigned long s, const unsigned long e)
+{
+	return bm_change_bits_to(mdev, s, e, 1);
+}
+
+/* returns number of bits changed 1 -> 0 */
+int drbd_bm_clear_bits(struct drbd_conf *mdev, const unsigned long s, const unsigned long e)
+{
+	return -bm_change_bits_to(mdev, s, e, 0);
+}
+
+/* the same thing, but without taking the spin_lock_irqsave.
+ * you must first drbd_bm_lock(). */
+int _drbd_bm_set_bits(struct drbd_conf *mdev, const unsigned long s, const unsigned long e)
+{
+       /* WARN_ON(!bm_is_locked(b)); */
+       return __bm_change_bits_to(mdev, s, e, 1, KM_USER0);
+}
+
+/* returns bit state
+ * wants bitnr, NOT sector.
+ * inherently racy... area needs to be locked by means of {al,rs}_lru
+ *  1 ... bit set
+ *  0 ... bit not set
+ * -1 ... first out of bounds access, stop testing for bits!
+ */
+int drbd_bm_test_bit(struct drbd_conf *mdev, const unsigned long bitnr)
+{
+	unsigned long flags;
+	struct drbd_bitmap *b = mdev->bitmap;
+	unsigned long *p_addr;
+	int i;
+
+	ERR_IF(!b) return 0;
+	ERR_IF(!b->bm_pages) return 0;
+
+	spin_lock_irqsave(&b->bm_lock, flags);
+	if (bm_is_locked(b))
+		bm_print_lock_info(mdev);
+	if (bitnr < b->bm_bits) {
+		unsigned long offset = bitnr>>LN2_BPL;
+		p_addr = bm_map_paddr(b, offset);
+		i = test_bit(bitnr & BPP_MASK, p_addr) ? 1 : 0;
+		bm_unmap(p_addr);
+	} else if (bitnr == b->bm_bits) {
+		i = -1;
+	} else { /* (bitnr > b->bm_bits) */
+		dev_err(DEV, "bitnr=%lu > bm_bits=%lu\n", bitnr, b->bm_bits);
+		i = 0;
+	}
+
+	spin_unlock_irqrestore(&b->bm_lock, flags);
+	return i;
+}
+
+/* returns number of bits set */
+int drbd_bm_count_bits(struct drbd_conf *mdev, const unsigned long s, const unsigned long e)
+{
+	unsigned long flags;
+	struct drbd_bitmap *b = mdev->bitmap;
+	unsigned long *p_addr = NULL, page_nr = -1;
+	unsigned long bitnr;
+	int c = 0;
+	size_t w;
+
+	/* If this is called without a bitmap, that is a bug.  But just to be
+	 * robust in case we screwed up elsewhere, in that case pretend there
+	 * was one dirty bit in the requested area, so we won't try to do a
+	 * local read there (no bitmap probably implies no disk) */
+	ERR_IF(!b) return 1;
+	ERR_IF(!b->bm_pages) return 1;
+
+	spin_lock_irqsave(&b->bm_lock, flags);
+	for (bitnr = s; bitnr <= e; bitnr++) {
+		w = bitnr >> LN2_BPL;
+		if (page_nr != w >> (PAGE_SHIFT - LN2_BPL + 3)) {
+			page_nr = w >> (PAGE_SHIFT - LN2_BPL + 3);
+			if (p_addr)
+				bm_unmap(p_addr);
+			p_addr = bm_map_paddr(b, w);
+		}
+		ERR_IF (bitnr >= b->bm_bits) {
+			dev_err(DEV, "bitnr=%lu bm_bits=%lu\n", bitnr, b->bm_bits);
+		} else {
+			c += (0 != test_bit(bitnr - (page_nr << (PAGE_SHIFT+3)), p_addr));
+		}
+	}
+	if (p_addr)
+		bm_unmap(p_addr);
+	spin_unlock_irqrestore(&b->bm_lock, flags);
+	return c;
+}
+
+
+/* inherently racy...
+ * return value may be already out-of-date when this function returns.
+ * but the general usage is that this is only use during a cstate when bits are
+ * only cleared, not set, and typically only care for the case when the return
+ * value is zero, or we already "locked" this "bitmap extent" by other means.
+ *
+ * enr is bm-extent number, since we chose to name one sector (512 bytes)
+ * worth of the bitmap a "bitmap extent".
+ *
+ * TODO
+ * I think since we use it like a reference count, we should use the real
+ * reference count of some bitmap extent element from some lru instead...
+ *
+ */
+int drbd_bm_e_weight(struct drbd_conf *mdev, unsigned long enr)
+{
+	struct drbd_bitmap *b = mdev->bitmap;
+	int count, s, e;
+	unsigned long flags;
+	unsigned long *p_addr, *bm;
+
+	ERR_IF(!b) return 0;
+	ERR_IF(!b->bm_pages) return 0;
+
+	spin_lock_irqsave(&b->bm_lock, flags);
+	if (bm_is_locked(b))
+		bm_print_lock_info(mdev);
+
+	s = S2W(enr);
+	e = min((size_t)S2W(enr+1), b->bm_words);
+	count = 0;
+	if (s < b->bm_words) {
+		int n = e-s;
+		p_addr = bm_map_paddr(b, s);
+		bm = p_addr + MLPP(s);
+		while (n--)
+			count += hweight_long(*bm++);
+		bm_unmap(p_addr);
+	} else {
+		dev_err(DEV, "start offset (%d) too large in drbd_bm_e_weight\n", s);
+	}
+	spin_unlock_irqrestore(&b->bm_lock, flags);
+	return count;
+}
+
+/* set all bits covered by the AL-extent al_enr */
+unsigned long drbd_bm_ALe_set_all(struct drbd_conf *mdev, unsigned long al_enr)
+{
+	struct drbd_bitmap *b = mdev->bitmap;
+	unsigned long *p_addr, *bm;
+	unsigned long weight;
+	int count, s, e, i, do_now;
+	ERR_IF(!b) return 0;
+	ERR_IF(!b->bm_pages) return 0;
+
+	spin_lock_irq(&b->bm_lock);
+	if (bm_is_locked(b))
+		bm_print_lock_info(mdev);
+	weight = b->bm_set;
+
+	s = al_enr * BM_WORDS_PER_AL_EXT;
+	e = min_t(size_t, s + BM_WORDS_PER_AL_EXT, b->bm_words);
+	/* assert that s and e are on the same page */
+	D_ASSERT((e-1) >> (PAGE_SHIFT - LN2_BPL + 3)
+	      ==  s    >> (PAGE_SHIFT - LN2_BPL + 3));
+	count = 0;
+	if (s < b->bm_words) {
+		i = do_now = e-s;
+		p_addr = bm_map_paddr(b, s);
+		bm = p_addr + MLPP(s);
+		while (i--) {
+			count += hweight_long(*bm);
+			*bm = -1UL;
+			bm++;
+		}
+		bm_unmap(p_addr);
+		b->bm_set += do_now*BITS_PER_LONG - count;
+		if (e == b->bm_words)
+			b->bm_set -= bm_clear_surplus(b);
+	} else {
+		dev_err(DEV, "start offset (%d) too large in drbd_bm_ALe_set_all\n", s);
+	}
+	weight = b->bm_set - weight;
+	spin_unlock_irq(&b->bm_lock);
+	return weight;
+}

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 05/16] DRBD: request
  2009-04-30 11:26       ` [PATCH 04/16] DRBD: bitmap Philipp Reisner
@ 2009-04-30 11:26         ` Philipp Reisner
  2009-04-30 11:26           ` [PATCH 06/16] DRBD: userspace_interface Philipp Reisner
  2009-05-02 15:41         ` [PATCH 04/16] DRBD: bitmap James Bottomley
  1 sibling, 1 reply; 90+ messages in thread
From: Philipp Reisner @ 2009-04-30 11:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg, Philipp Reisner

The request state engine.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

---
diff --git a/drivers/block/drbd/drbd_req.h b/drivers/block/drbd/drbd_req.h
new file mode 100644
index 0000000..a63a1e9
--- /dev/null
+++ b/drivers/block/drbd/drbd_req.h
@@ -0,0 +1,325 @@
+/*
+   drbd_req.h
+
+   This file is part of DRBD by Philipp Reisner and Lars Ellenberg.
+
+   Copyright (C) 2006-2008, LINBIT Information Technologies GmbH.
+   Copyright (C) 2006-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.
+   Copyright (C) 2006-2008, Philipp Reisner <philipp.reisner@linbit.com>.
+
+   DRBD is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 2, or (at your option)
+   any later version.
+
+   DRBD is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with drbd; see the file COPYING.  If not, write to
+   the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#ifndef _DRBD_REQ_H
+#define _DRBD_REQ_H
+
+#include <linux/autoconf.h>
+#include <linux/module.h>
+
+#include <linux/slab.h>
+#include <linux/drbd.h>
+#include "drbd_int.h"
+#include "drbd_wrappers.h"
+
+/* The request callbacks will be called in irq context by the IDE drivers,
+   and in Softirqs/Tasklets/BH context by the SCSI drivers,
+   and by the receiver and worker in kernel-thread context.
+   Try to get the locking right :) */
+
+/*
+ * Objects of type struct drbd_request do only exist on a R_PRIMARY node, and are
+ * associated with IO requests originating from the block layer above us.
+ *
+ * There are quite a few things that may happen to a drbd request
+ * during its lifetime.
+ *
+ *  It will be created.
+ *  It will be marked with the intention to be
+ *    submitted to local disk and/or
+ *    send via the network.
+ *
+ *  It has to be placed on the transfer log and other housekeeping lists,
+ *  In case we have a network connection.
+ *
+ *  It may be identified as a concurrent (write) request
+ *    and be handled accordingly.
+ *
+ *  It may me handed over to the local disk subsystem.
+ *  It may be completed by the local disk subsystem,
+ *    either sucessfully or with io-error.
+ *  In case it is a READ request, and it failed locally,
+ *    it may be retried remotely.
+ *
+ *  It may be queued for sending.
+ *  It may be handed over to the network stack,
+ *    which may fail.
+ *  It may be acknowledged by the "peer" according to the wire_protocol in use.
+ *    this may be a negative ack.
+ *  It may receive a faked ack when the network connection is lost and the
+ *  transfer log is cleaned up.
+ *  Sending may be canceled due to network connection loss.
+ *  When it finally has outlived its time,
+ *    corresponding dirty bits in the resync-bitmap may be cleared or set,
+ *    it will be destroyed,
+ *    and completion will be signalled to the originator,
+ *      with or without "success".
+ */
+
+enum drbd_req_event {
+	created,
+	to_be_send,
+	to_be_submitted,
+
+	/* XXX yes, now I am inconsistent...
+	 * these two are not "events" but "actions"
+	 * oh, well... */
+	queue_for_net_write,
+	queue_for_net_read,
+
+	send_canceled,
+	send_failed,
+	handed_over_to_network,
+	connection_lost_while_pending,
+	recv_acked_by_peer,
+	write_acked_by_peer,
+	write_acked_by_peer_and_sis, /* and set_in_sync */
+	conflict_discarded_by_peer,
+	neg_acked,
+	barrier_acked, /* in protocol A and B */
+	data_received, /* (remote read) */
+
+	read_completed_with_error,
+	write_completed_with_error,
+	completed_ok,
+	nothing, /* for tracing only */
+};
+
+/* encoding of request states for now.  we don't actually need that many bits.
+ * we don't need to do atomic bit operations either, since most of the time we
+ * need to look at the connection state and/or manipulate some lists at the
+ * same time, so we should hold the request lock anyways.
+ */
+enum drbd_req_state_bits {
+	/* 210
+	 * 000: no local possible
+	 * 001: to be submitted
+	 *    UNUSED, we could map: 011: submitted, completion still pending
+	 * 110: completed ok
+	 * 010: completed with error
+	 */
+	__RQ_LOCAL_PENDING,
+	__RQ_LOCAL_COMPLETED,
+	__RQ_LOCAL_OK,
+
+	/* 76543
+	 * 00000: no network possible
+	 * 00001: to be send
+	 * 00011: to be send, on worker queue
+	 * 00101: sent, expecting recv_ack (B) or write_ack (C)
+	 * 11101: sent,
+	 *        recv_ack (B) or implicit "ack" (A),
+	 *        still waiting for the barrier ack.
+	 *        master_bio may already be completed and invalidated.
+	 * 11100: write_acked (C),
+	 *        data_received (for remote read, any protocol)
+	 *        or finally the barrier ack has arrived (B,A)...
+	 *        request can be freed
+	 * 01100: neg-acked (write, protocol C)
+	 *        or neg-d-acked (read, any protocol)
+	 *        or killed from the transfer log
+	 *        during cleanup after connection loss
+	 *        request can be freed
+	 * 01000: canceled or send failed...
+	 *        request can be freed
+	 */
+
+	/* if "SENT" is not set, yet, this can still fail or be canceled.
+	 * if "SENT" is set already, we still wait for an Ack packet.
+	 * when cleared, the master_bio may be completed.
+	 * in (B,A) the request object may still linger on the transaction log
+	 * until the corresponding barrier ack comes in */
+	__RQ_NET_PENDING,
+
+	/* If it is QUEUED, and it is a WRITE, it is also registered in the
+	 * transfer log. Currently we need this flag to avoid conflicts between
+	 * worker canceling the request and tl_clear_barrier killing it from
+	 * transfer log.  We should restructure the code so this conflict does
+	 * no longer occur. */
+	__RQ_NET_QUEUED,
+
+	/* well, actually only "handed over to the network stack".
+	 *
+	 * TODO can potentially be dropped because of the similar meaning
+	 * of RQ_NET_SENT and ~RQ_NET_QUEUED.
+	 * however it is not exactly the same. before we drop it
+	 * we must ensure that we can tell a request with network part
+	 * from a request without, regardless of what happens to it. */
+	__RQ_NET_SENT,
+
+	/* when set, the request may be freed (if RQ_NET_QUEUED is clear).
+	 * basically this means the corresponding P_BARRIER_ACK was received */
+	__RQ_NET_DONE,
+
+	/* whether or not we know (C) or pretend (B,A) that the write
+	 * was successfully written on the peer.
+	 */
+	__RQ_NET_OK,
+
+	/* peer called drbd_set_in_sync() for this write */
+	__RQ_NET_SIS,
+
+	/* keep this last, its for the RQ_NET_MASK */
+	__RQ_NET_MAX,
+};
+
+#define RQ_LOCAL_PENDING   (1UL << __RQ_LOCAL_PENDING)
+#define RQ_LOCAL_COMPLETED (1UL << __RQ_LOCAL_COMPLETED)
+#define RQ_LOCAL_OK        (1UL << __RQ_LOCAL_OK)
+
+#define RQ_LOCAL_MASK      ((RQ_LOCAL_OK << 1)-1) /* 0x07 */
+
+#define RQ_NET_PENDING     (1UL << __RQ_NET_PENDING)
+#define RQ_NET_QUEUED      (1UL << __RQ_NET_QUEUED)
+#define RQ_NET_SENT        (1UL << __RQ_NET_SENT)
+#define RQ_NET_DONE        (1UL << __RQ_NET_DONE)
+#define RQ_NET_OK          (1UL << __RQ_NET_OK)
+#define RQ_NET_SIS         (1UL << __RQ_NET_SIS)
+
+/* 0x1f8 */
+#define RQ_NET_MASK        (((1UL << __RQ_NET_MAX)-1) & ~RQ_LOCAL_MASK)
+
+/* epoch entries */
+static inline
+struct hlist_head *ee_hash_slot(struct drbd_conf *mdev, sector_t sector)
+{
+	BUG_ON(mdev->ee_hash_s == 0);
+	return mdev->ee_hash +
+		((unsigned int)(sector>>HT_SHIFT) % mdev->ee_hash_s);
+}
+
+/* transfer log (drbd_request objects) */
+static inline
+struct hlist_head *tl_hash_slot(struct drbd_conf *mdev, sector_t sector)
+{
+	BUG_ON(mdev->tl_hash_s == 0);
+	return mdev->tl_hash +
+		((unsigned int)(sector>>HT_SHIFT) % mdev->tl_hash_s);
+}
+
+/* when we receive the ACK for a write request,
+ * verify that we actually know about it */
+static inline struct drbd_request *_ack_id_to_req(struct drbd_conf *mdev,
+	u64 id, sector_t sector)
+{
+	struct hlist_head *slot = tl_hash_slot(mdev, sector);
+	struct hlist_node *n;
+	struct drbd_request *req;
+
+	hlist_for_each_entry(req, n, slot, colision) {
+		if ((unsigned long)req == (unsigned long)id) {
+			if (req->sector != sector) {
+				dev_err(DEV, "_ack_id_to_req: found req %p but it has "
+				    "wrong sector (%llus versus %llus)\n", req,
+				    (unsigned long long)req->sector,
+				    (unsigned long long)sector);
+				break;
+			}
+			return req;
+		}
+	}
+	dev_err(DEV, "_ack_id_to_req: failed to find req %p, sector %llus in list\n",
+		(void *)(unsigned long)id, (unsigned long long)sector);
+	return NULL;
+}
+
+/* application reads (drbd_request objects) */
+static struct hlist_head *ar_hash_slot(struct drbd_conf *mdev, sector_t sector)
+{
+	return mdev->app_reads_hash
+		+ ((unsigned int)(sector) % APP_R_HSIZE);
+}
+
+/* when we receive the answer for a read request,
+ * verify that we actually know about it */
+static inline struct drbd_request *_ar_id_to_req(struct drbd_conf *mdev,
+	u64 id, sector_t sector)
+{
+	struct hlist_head *slot = ar_hash_slot(mdev, sector);
+	struct hlist_node *n;
+	struct drbd_request *req;
+
+	hlist_for_each_entry(req, n, slot, colision) {
+		if ((unsigned long)req == (unsigned long)id) {
+			D_ASSERT(req->sector == sector);
+			return req;
+		}
+	}
+	return NULL;
+}
+
+static inline struct drbd_request *drbd_req_new(struct drbd_conf *mdev,
+	struct bio *bio_src)
+{
+	struct bio *bio;
+	struct drbd_request *req =
+		mempool_alloc(drbd_request_mempool, GFP_NOIO);
+	if (likely(req)) {
+		bio = bio_clone(bio_src, GFP_NOIO); /* XXX cannot fail?? */
+
+		req->rq_state    = 0;
+		req->mdev        = mdev;
+		req->master_bio  = bio_src;
+		req->private_bio = bio;
+		req->epoch       = 0;
+		req->sector      = bio->bi_sector;
+		req->size        = bio->bi_size;
+		req->start_time  = jiffies;
+		INIT_HLIST_NODE(&req->colision);
+		INIT_LIST_HEAD(&req->tl_requests);
+		INIT_LIST_HEAD(&req->w.list);
+
+		bio->bi_private  = req;
+		bio->bi_end_io   = drbd_endio_pri;
+		bio->bi_next     = NULL;
+	}
+	return req;
+}
+
+static inline void drbd_req_free(struct drbd_request *req)
+{
+	mempool_free(req, drbd_request_mempool);
+}
+
+static inline int overlaps(sector_t s1, int l1, sector_t s2, int l2)
+{
+	return !((s1 + (l1>>9) <= s2) || (s1 >= s2 + (l2>>9)));
+}
+
+/* aparently too large to be inlined...
+ * moved to drbd_req.c */
+extern void _req_may_be_done(struct drbd_request *req, int error);
+extern void _req_mod(struct drbd_request *req,
+		enum drbd_req_event what, int error);
+
+/* If you need it irqsave, do it your self! */
+static inline void req_mod(struct drbd_request *req,
+		enum drbd_req_event what, int error)
+{
+	struct drbd_conf *mdev = req->mdev;
+	spin_lock_irq(&mdev->req_lock);
+	_req_mod(req, what, error);
+	spin_unlock_irq(&mdev->req_lock);
+}
+#endif
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
new file mode 100644
index 0000000..dcf6425
--- /dev/null
+++ b/drivers/block/drbd/drbd_req.c
@@ -0,0 +1,1133 @@
+/*
+   drbd_req.c
+
+   This file is part of DRBD by Philipp Reisner and Lars Ellenberg.
+
+   Copyright (C) 2001-2008, LINBIT Information Technologies GmbH.
+   Copyright (C) 1999-2008, Philipp Reisner <philipp.reisner@linbit.com>.
+   Copyright (C) 2002-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.
+
+   drbd is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 2, or (at your option)
+   any later version.
+
+   drbd is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with drbd; see the file COPYING.  If not, write to
+   the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+
+ */
+
+#include <linux/autoconf.h>
+#include <linux/module.h>
+
+#include <linux/slab.h>
+#include <linux/drbd.h>
+#include "drbd_int.h"
+#include "drbd_tracing.h"
+#include "drbd_req.h"
+
+
+/* Update disk stats at start of I/O request */
+static inline void _drbd_start_io_acct(struct drbd_conf *mdev, struct drbd_request *req, struct bio *bio)
+{
+	const int rw = bio_data_dir(bio);
+	int cpu;
+	cpu = part_stat_lock();
+	part_stat_inc(cpu, &mdev->vdisk->part0, ios[rw]);
+	part_stat_add(cpu, &mdev->vdisk->part0, sectors[rw], bio_sectors(bio));
+	part_stat_unlock();
+	mdev->vdisk->part0.in_flight++;
+}
+
+/* Update disk stats when completing request upwards */
+static inline void _drbd_end_io_acct(struct drbd_conf *mdev, struct drbd_request *req)
+{
+	int rw = bio_data_dir(req->master_bio);
+	unsigned long duration = jiffies - req->start_time;
+	int cpu;
+	cpu = part_stat_lock();
+	part_stat_add(cpu, &mdev->vdisk->part0, ticks[rw], duration);
+	part_round_stats(cpu, &mdev->vdisk->part0);
+	part_stat_unlock();
+	mdev->vdisk->part0.in_flight--;
+}
+
+static void _req_is_done(struct drbd_conf *mdev, struct drbd_request *req, const int rw)
+{
+	const unsigned long s = req->rq_state;
+	/* if it was a write, we may have to set the corresponding
+	 * bit(s) out-of-sync first. If it had a local part, we need to
+	 * release the reference to the activity log. */
+	if (rw == WRITE) {
+		/* remove it from the transfer log.
+		 * well, only if it had been there in the first
+		 * place... if it had not (local only or conflicting
+		 * and never sent), it should still be "empty" as
+		 * initialised in drbd_req_new(), so we can list_del() it
+		 * here unconditionally */
+		list_del(&req->tl_requests);
+		/* Set out-of-sync unless both OK flags are set
+		 * (local only or remote failed).
+		 * Other places where we set out-of-sync:
+		 * READ with local io-error */
+		if (!(s & RQ_NET_OK) || !(s & RQ_LOCAL_OK))
+			drbd_set_out_of_sync(mdev, req->sector, req->size);
+
+		if ((s & RQ_NET_OK) && (s & RQ_LOCAL_OK) && (s & RQ_NET_SIS))
+			drbd_set_in_sync(mdev, req->sector, req->size);
+
+		/* one might be tempted to move the drbd_al_complete_io
+		 * to the local io completion callback drbd_endio_pri.
+		 * but, if this was a mirror write, we may only
+		 * drbd_al_complete_io after this is RQ_NET_DONE,
+		 * otherwise the extent could be dropped from the al
+		 * before it has actually been written on the peer.
+		 * if we crash before our peer knows about the request,
+		 * but after the extent has been dropped from the al,
+		 * we would forget to resync the corresponding extent.
+		 */
+		if (s & RQ_LOCAL_MASK) {
+			if (inc_local_if_state(mdev, D_FAILED)) {
+				drbd_al_complete_io(mdev, req->sector);
+				dec_local(mdev);
+			} else if (__ratelimit(&drbd_ratelimit_state)) {
+				dev_warn(DEV, "Should have called drbd_al_complete_io(, %llu), "
+				     "but my Disk seems to have failed :(\n",
+				     (unsigned long long) req->sector);
+			}
+		}
+	}
+
+	/* if it was a local io error, we want to notify our
+	 * peer about that, and see if we need to
+	 * detach the disk and stuff.
+	 * to avoid allocating some special work
+	 * struct, reuse the request. */
+
+	/* THINK
+	 * why do we do this not when we detect the error,
+	 * but delay it until it is "done", i.e. possibly
+	 * until the next barrier ack? */
+
+	if (rw == WRITE &&
+	    ((s & RQ_LOCAL_MASK) && !(s & RQ_LOCAL_OK))) {
+		if (!(req->w.list.next == LIST_POISON1 ||
+		      list_empty(&req->w.list))) {
+			/* DEBUG ASSERT only; if this triggers, we
+			 * probably corrupt the worker list here */
+			DUMPP(req->w.list.next);
+			DUMPP(req->w.list.prev);
+		}
+		req->w.cb = w_io_error;
+		drbd_queue_work(&mdev->data.work, &req->w);
+		/* drbd_req_free() is done in w_io_error */
+	} else {
+		drbd_req_free(req);
+	}
+}
+
+static void queue_barrier(struct drbd_conf *mdev)
+{
+	struct drbd_tl_epoch *b;
+
+	/* We are within the req_lock. Once we queued the barrier for sending,
+	 * we set the CREATE_BARRIER bit. It is cleared as soon as a new
+	 * barrier/epoch object is added. This is the only place this bit is
+	 * set. It indicates that the barrier for this epoch is already queued,
+	 * and no new epoch has been created yet. */
+	if (test_bit(CREATE_BARRIER, &mdev->flags))
+		return;
+
+	b = mdev->newest_tle;
+	b->w.cb = w_send_barrier;
+	/* inc_ap_pending done here, so we won't
+	 * get imbalanced on connection loss.
+	 * dec_ap_pending will be done in got_BarrierAck
+	 * or (on connection loss) in tl_clear.  */
+	inc_ap_pending(mdev);
+	drbd_queue_work(&mdev->data.work, &b->w);
+	set_bit(CREATE_BARRIER, &mdev->flags);
+}
+
+static void _about_to_complete_local_write(struct drbd_conf *mdev,
+	struct drbd_request *req)
+{
+	const unsigned long s = req->rq_state;
+	struct drbd_request *i;
+	struct drbd_epoch_entry *e;
+	struct hlist_node *n;
+	struct hlist_head *slot;
+
+	/* before we can signal completion to the upper layers,
+	 * we may need to close the current epoch */
+	if (mdev->state.conn >= C_CONNECTED &&
+	    req->epoch == mdev->newest_tle->br_number)
+		queue_barrier(mdev);
+
+	/* we need to do the conflict detection stuff,
+	 * if we have the ee_hash (two_primaries) and
+	 * this has been on the network */
+	if ((s & RQ_NET_DONE) && mdev->ee_hash != NULL) {
+		const sector_t sector = req->sector;
+		const int size = req->size;
+
+		/* ASSERT:
+		 * there must be no conflicting requests, since
+		 * they must have been failed on the spot */
+#define OVERLAPS overlaps(sector, size, i->sector, i->size)
+		slot = tl_hash_slot(mdev, sector);
+		hlist_for_each_entry(i, n, slot, colision) {
+			if (OVERLAPS) {
+				dev_alert(DEV, "LOGIC BUG: completed: %p %llus +%u; "
+				      "other: %p %llus +%u\n",
+				      req, (unsigned long long)sector, size,
+				      i, (unsigned long long)i->sector, i->size);
+			}
+		}
+
+		/* maybe "wake" those conflicting epoch entries
+		 * that wait for this request to finish.
+		 *
+		 * currently, there can be only _one_ such ee
+		 * (well, or some more, which would be pending
+		 * P_DISCARD_ACK not yet sent by the asender...),
+		 * since we block the receiver thread upon the
+		 * first conflict detection, which will wait on
+		 * misc_wait.  maybe we want to assert that?
+		 *
+		 * anyways, if we found one,
+		 * we just have to do a wake_up.  */
+#undef OVERLAPS
+#define OVERLAPS overlaps(sector, size, e->sector, e->size)
+		slot = ee_hash_slot(mdev, req->sector);
+		hlist_for_each_entry(e, n, slot, colision) {
+			if (OVERLAPS) {
+				wake_up(&mdev->misc_wait);
+				break;
+			}
+		}
+	}
+#undef OVERLAPS
+}
+
+static void _complete_master_bio(struct drbd_conf *mdev,
+	struct drbd_request *req, int error)
+{
+	trace_drbd_bio(mdev, "Rq", req->master_bio, 1, req);
+	bio_endio(req->master_bio, error);
+	req->master_bio = NULL;
+	dec_ap_bio(mdev);
+}
+
+void _req_may_be_done(struct drbd_request *req, int error)
+{
+	const unsigned long s = req->rq_state;
+	struct drbd_conf *mdev = req->mdev;
+	int rw;
+
+	trace_drbd_req(req, nothing, "_req_may_be_done");
+
+	/* we must not complete the master bio, while it is
+	 *	still being processed by _drbd_send_zc_bio (drbd_send_dblock)
+	 *	not yet acknowledged by the peer
+	 *	not yet completed by the local io subsystem
+	 * these flags may get cleared in any order by
+	 *	the worker,
+	 *	the receiver,
+	 *	the bio_endio completion callbacks.
+	 */
+	if (s & RQ_NET_QUEUED)
+		return;
+	if (s & RQ_NET_PENDING)
+		return;
+	if (s & RQ_LOCAL_PENDING)
+		return;
+
+	if (req->master_bio) {
+		/* this is data_received (remote read)
+		 * or protocol C P_WRITE_ACK
+		 * or protocol B P_RECV_ACK
+		 * or protocol A "handed_over_to_network" (SendAck)
+		 * or canceled or failed,
+		 * or killed from the transfer log due to connection loss.
+		 */
+
+		/*
+		 * figure out whether to report success or failure.
+		 *
+		 * report success when at least one of the operations suceeded.
+		 * or, to put the other way,
+		 * only report failure, when both operations failed.
+		 *
+		 * what to do about the failures is handled elsewhere.
+		 * what we need to do here is just: complete the master_bio.
+		 */
+		int ok = (s & RQ_LOCAL_OK) || (s & RQ_NET_OK);
+		rw = bio_data_dir(req->master_bio);
+
+		/* remove the request from the conflict detection
+		 * respective block_id verification hash */
+		if (!hlist_unhashed(&req->colision))
+			hlist_del(&req->colision);
+		else
+			D_ASSERT((s & RQ_NET_MASK) == 0);
+
+		/* for writes we need to do some extra housekeeping */
+		if (rw == WRITE)
+			_about_to_complete_local_write(mdev, req);
+
+		/* Update disk stats */
+		_drbd_end_io_acct(mdev, req);
+
+		_complete_master_bio(mdev, req,
+				     ok ? 0 : (error ? error : -EIO));
+	} else {
+		/* only WRITE requests can end up here without a master_bio */
+		rw = WRITE;
+	}
+
+	if ((s & RQ_NET_MASK) == 0 || (s & RQ_NET_DONE)) {
+		/* this is disconnected (local only) operation,
+		 * or protocol C P_WRITE_ACK,
+		 * or protocol A or B P_BARRIER_ACK,
+		 * or killed from the transfer log due to connection loss. */
+		_req_is_done(mdev, req, rw);
+	}
+	/* else: network part and not DONE yet. that is
+	 * protocol A or B, barrier ack still pending... */
+}
+
+/*
+ * checks whether there was an overlapping request
+ * or ee already registered.
+ *
+ * if so, return 1, in which case this request is completed on the spot,
+ * without ever being submitted or send.
+ *
+ * return 0 if it is ok to submit this request.
+ *
+ * NOTE:
+ * paranoia: assume something above us is broken, and issues different write
+ * requests for the same block simultaneously...
+ *
+ * To ensure these won't be reordered differently on both nodes, resulting in
+ * diverging data sets, we discard the later one(s). Not that this is supposed
+ * to happen, but this is the rationale why we also have to check for
+ * conflicting requests with local origin, and why we have to do so regardless
+ * of whether we allowed multiple primaries.
+ *
+ * BTW, in case we only have one primary, the ee_hash is empty anyways, and the
+ * second hlist_for_each_entry becomes a noop. This is even simpler than to
+ * grab a reference on the net_conf, and check for the two_primaries flag...
+ */
+STATIC int _req_conflicts(struct drbd_request *req)
+{
+	struct drbd_conf *mdev = req->mdev;
+	const sector_t sector = req->sector;
+	const int size = req->size;
+	struct drbd_request *i;
+	struct drbd_epoch_entry *e;
+	struct hlist_node *n;
+	struct hlist_head *slot;
+
+	D_ASSERT(hlist_unhashed(&req->colision));
+
+	if (!inc_net(mdev))
+		return 0;
+
+	/* BUG_ON */
+	ERR_IF (mdev->tl_hash_s == 0)
+		goto out_no_conflict;
+	BUG_ON(mdev->tl_hash == NULL);
+
+#define OVERLAPS overlaps(i->sector, i->size, sector, size)
+	slot = tl_hash_slot(mdev, sector);
+	hlist_for_each_entry(i, n, slot, colision) {
+		if (OVERLAPS) {
+			dev_alert(DEV, "%s[%u] Concurrent local write detected! "
+			      "[DISCARD L] new: %llus +%u; "
+			      "pending: %llus +%u\n",
+			      current->comm, current->pid,
+			      (unsigned long long)sector, size,
+			      (unsigned long long)i->sector, i->size);
+			goto out_conflict;
+		}
+	}
+
+	if (mdev->ee_hash_s) {
+		/* now, check for overlapping requests with remote origin */
+		BUG_ON(mdev->ee_hash == NULL);
+#undef OVERLAPS
+#define OVERLAPS overlaps(e->sector, e->size, sector, size)
+		slot = ee_hash_slot(mdev, sector);
+		hlist_for_each_entry(e, n, slot, colision) {
+			if (OVERLAPS) {
+				dev_alert(DEV, "%s[%u] Concurrent remote write detected!"
+				      " [DISCARD L] new: %llus +%u; "
+				      "pending: %llus +%u\n",
+				      current->comm, current->pid,
+				      (unsigned long long)sector, size,
+				      (unsigned long long)e->sector, e->size);
+				goto out_conflict;
+			}
+		}
+	}
+#undef OVERLAPS
+
+out_no_conflict:
+	/* this is like it should be, and what we expected.
+	 * our users do behave after all... */
+	dec_net(mdev);
+	return 0;
+
+out_conflict:
+	dec_net(mdev);
+	return 1;
+}
+
+/* obviously this could be coded as many single functions
+ * instead of one huge switch,
+ * or by putting the code directly in the respective locations
+ * (as it has been before).
+ *
+ * but having it this way
+ *  enforces that it is all in this one place, where it is easier to audit,
+ *  it makes it obvious that whatever "event" "happens" to a request should
+ *  happen "atomically" within the req_lock,
+ *  and it enforces that we have to think in a very structured manner
+ *  about the "events" that may happen to a request during its life time ...
+ *
+ * Though I think it is likely that we break this again into many
+ * static inline void _req_mod_ ## what (req) ...
+ */
+void _req_mod(struct drbd_request *req, enum drbd_req_event what, int error)
+{
+	struct drbd_conf *mdev = req->mdev;
+
+	if (error && (bio_rw(req->master_bio) != READA))
+		dev_err(DEV, "got an _req_mod() errno of %d\n", error);
+
+	trace_drbd_req(req, what, NULL);
+
+	switch (what) {
+	default:
+		dev_err(DEV, "LOGIC BUG in %s:%u\n", __FILE__ , __LINE__);
+		return;
+
+	/* does not happen...
+	 * initialization done in drbd_req_new
+	case created:
+		break;
+		*/
+
+	case to_be_send: /* via network */
+		/* reached via drbd_make_request_common
+		 * and from w_read_retry_remote */
+		D_ASSERT(!(req->rq_state & RQ_NET_MASK));
+		req->rq_state |= RQ_NET_PENDING;
+		inc_ap_pending(mdev);
+		break;
+
+	case to_be_submitted: /* locally */
+		/* reached via drbd_make_request_common */
+		D_ASSERT(!(req->rq_state & RQ_LOCAL_MASK));
+		req->rq_state |= RQ_LOCAL_PENDING;
+		break;
+
+	case completed_ok:
+		if (bio_data_dir(req->private_bio) == WRITE)
+			mdev->writ_cnt += req->size>>9;
+		else
+			mdev->read_cnt += req->size>>9;
+
+		bio_put(req->private_bio);
+		req->private_bio = NULL;
+
+		req->rq_state |= (RQ_LOCAL_COMPLETED|RQ_LOCAL_OK);
+		req->rq_state &= ~RQ_LOCAL_PENDING;
+
+		_req_may_be_done(req, error);
+		dec_local(mdev);
+		break;
+
+	case write_completed_with_error:
+		req->rq_state |= RQ_LOCAL_COMPLETED;
+		req->rq_state &= ~RQ_LOCAL_PENDING;
+
+		bio_put(req->private_bio);
+		req->private_bio = NULL;
+		dev_alert(DEV, "Local WRITE failed sec=%llus size=%u\n",
+		      (unsigned long long)req->sector, req->size);
+		/* and now: check how to handle local io error. */
+		__drbd_chk_io_error(mdev, FALSE);
+		_req_may_be_done(req, error);
+		dec_local(mdev);
+		break;
+
+	case read_completed_with_error:
+		if (bio_rw(req->master_bio) != READA)
+			drbd_set_out_of_sync(mdev, req->sector, req->size);
+
+		req->rq_state |= RQ_LOCAL_COMPLETED;
+		req->rq_state &= ~RQ_LOCAL_PENDING;
+
+		bio_put(req->private_bio);
+		req->private_bio = NULL;
+		if (bio_rw(req->master_bio) == READA) {
+			/* it is legal to fail READA */
+			_req_may_be_done(req, error);
+			dec_local(mdev);
+			break;
+		}
+		/* else */
+		dev_alert(DEV, "Local READ failed sec=%llus size=%u\n",
+		      (unsigned long long)req->sector, req->size);
+		/* _req_mod(req,to_be_send); oops, recursion in static inline */
+		D_ASSERT(!(req->rq_state & RQ_NET_MASK));
+		req->rq_state |= RQ_NET_PENDING;
+		inc_ap_pending(mdev);
+
+		__drbd_chk_io_error(mdev, FALSE);
+		dec_local(mdev);
+		/* NOTE: if we have no connection,
+		 * or know the peer has no good data either,
+		 * then we don't actually need to "queue_for_net_read",
+		 * but we do so anyways, since the drbd_io_error()
+		 * and the potential state change to "Diskless"
+		 * needs to be done from process context */
+
+		/* fall through: _req_mod(req,queue_for_net_read); */
+
+	case queue_for_net_read:
+		/* READ or READA, and
+		 * no local disk,
+		 * or target area marked as invalid,
+		 * or just got an io-error. */
+		/* from drbd_make_request_common
+		 * or from bio_endio during read io-error recovery */
+
+		/* so we can verify the handle in the answer packet
+		 * corresponding hlist_del is in _req_may_be_done() */
+		hlist_add_head(&req->colision, ar_hash_slot(mdev, req->sector));
+
+		set_bit(UNPLUG_REMOTE, &mdev->flags); /* why? */
+
+		D_ASSERT(req->rq_state & RQ_NET_PENDING);
+		req->rq_state |= RQ_NET_QUEUED;
+		req->w.cb = (req->rq_state & RQ_LOCAL_MASK)
+			? w_read_retry_remote
+			: w_send_read_req;
+		drbd_queue_work(&mdev->data.work, &req->w);
+		break;
+
+	case queue_for_net_write:
+		/* assert something? */
+		/* from drbd_make_request_common only */
+
+		hlist_add_head(&req->colision, tl_hash_slot(mdev, req->sector));
+		/* corresponding hlist_del is in _req_may_be_done() */
+
+		/* NOTE
+		 * In case the req ended up on the transfer log before being
+		 * queued on the worker, it could lead to this request being
+		 * missed during cleanup after connection loss.
+		 * So we have to do both operations here,
+		 * within the same lock that protects the transfer log.
+		 *
+		 * _req_add_to_epoch(req); this has to be after the
+		 * _maybe_start_new_epoch(req); which happened in
+		 * drbd_make_request_common, because we now may set the bit
+		 * again ourselves to close the current epoch.
+		 *
+		 * Add req to the (now) current epoch (barrier). */
+
+		/* see drbd_make_request_common,
+		 * just after it grabs the req_lock */
+		D_ASSERT(test_bit(CREATE_BARRIER, &mdev->flags) == 0);
+
+		req->epoch = mdev->newest_tle->br_number;
+		list_add_tail(&req->tl_requests,
+				&mdev->newest_tle->requests);
+
+		/* increment size of current epoch */
+		mdev->newest_tle->n_req++;
+
+		/* queue work item to send data */
+		D_ASSERT(req->rq_state & RQ_NET_PENDING);
+		req->rq_state |= RQ_NET_QUEUED;
+		req->w.cb =  w_send_dblock;
+		drbd_queue_work(&mdev->data.work, &req->w);
+
+		/* close the epoch, in case it outgrew the limit */
+		if (mdev->newest_tle->n_req >= mdev->net_conf->max_epoch_size)
+			queue_barrier(mdev);
+
+		break;
+
+	case send_canceled:
+		/* treat it the same */
+	case send_failed:
+		/* real cleanup will be done from tl_clear.  just update flags
+		 * so it is no longer marked as on the worker queue */
+		req->rq_state &= ~RQ_NET_QUEUED;
+		/* if we did it right, tl_clear should be scheduled only after
+		 * this, so this should not be necessary! */
+		_req_may_be_done(req, error);
+		break;
+
+	case handed_over_to_network:
+		/* assert something? */
+		if (bio_data_dir(req->master_bio) == WRITE &&
+		    mdev->net_conf->wire_protocol == DRBD_PROT_A) {
+			/* this is what is dangerous about protocol A:
+			 * pretend it was sucessfully written on the peer. */
+			if (req->rq_state & RQ_NET_PENDING) {
+				dec_ap_pending(mdev);
+				req->rq_state &= ~RQ_NET_PENDING;
+				req->rq_state |= RQ_NET_OK;
+			} /* else: neg-ack was faster... */
+			/* it is still not yet RQ_NET_DONE until the
+			 * corresponding epoch barrier got acked as well,
+			 * so we know what to dirty on connection loss */
+		}
+		req->rq_state &= ~RQ_NET_QUEUED;
+		req->rq_state |= RQ_NET_SENT;
+		/* because _drbd_send_zc_bio could sleep, and may want to
+		 * dereference the bio even after the "write_acked_by_peer" and
+		 * "completed_ok" events came in, once we return from
+		 * _drbd_send_zc_bio (drbd_send_dblock), we have to check
+		 * whether it is done already, and end it.  */
+		_req_may_be_done(req, error);
+		break;
+
+	case connection_lost_while_pending:
+		/* transfer log cleanup after connection loss */
+		/* assert something? */
+		if (req->rq_state & RQ_NET_PENDING)
+			dec_ap_pending(mdev);
+		req->rq_state &= ~(RQ_NET_OK|RQ_NET_PENDING);
+		req->rq_state |= RQ_NET_DONE;
+		/* if it is still queued, we may not complete it here.
+		 * it will be canceled soon. */
+		if (!(req->rq_state & RQ_NET_QUEUED))
+			_req_may_be_done(req, error);
+		break;
+
+	case write_acked_by_peer_and_sis:
+		req->rq_state |= RQ_NET_SIS;
+	case conflict_discarded_by_peer:
+		/* for discarded conflicting writes of multiple primarys,
+		 * there is no need to keep anything in the tl, potential
+		 * node crashes are covered by the activity log. */
+		req->rq_state |= RQ_NET_DONE;
+		/* fall through */
+	case write_acked_by_peer:
+		/* protocol C; successfully written on peer.
+		 * Nothing to do here.
+		 * We want to keep the tl in place for all protocols, to cater
+		 * for volatile write-back caches on lower level devices.
+		 *
+		 * A barrier request is expected to have forced all prior
+		 * requests onto stable storage, so completion of a barrier
+		 * request could set NET_DONE right here, and not wait for the
+		 * P_BARRIER_ACK, but that is an unecessary optimisation. */
+
+		/* this makes it effectively the same as for: */
+	case recv_acked_by_peer:
+		/* protocol B; pretends to be sucessfully written on peer.
+		 * see also notes above in handed_over_to_network about
+		 * protocol != C */
+		req->rq_state |= RQ_NET_OK;
+		D_ASSERT(req->rq_state & RQ_NET_PENDING);
+		dec_ap_pending(mdev);
+		req->rq_state &= ~RQ_NET_PENDING;
+		_req_may_be_done(req, error);
+		break;
+
+	case neg_acked:
+		/* assert something? */
+		if (req->rq_state & RQ_NET_PENDING)
+			dec_ap_pending(mdev);
+		req->rq_state &= ~(RQ_NET_OK|RQ_NET_PENDING);
+
+		req->rq_state |= RQ_NET_DONE;
+		_req_may_be_done(req, error);
+		/* else: done by handed_over_to_network */
+		break;
+
+	case barrier_acked:
+		if (req->rq_state & RQ_NET_PENDING) {
+			/* barrier came in before all requests have been acked.
+			 * this is bad, because if the connection is lost now,
+			 * we won't be able to clean them up... */
+			dev_err(DEV, "FIXME (barrier_acked but pending)\n");
+			trace_drbd_req(req, nothing, "FIXME (barrier_acked but pending)");
+			list_move(&req->tl_requests, &mdev->out_of_sequence_requests);
+		}
+		D_ASSERT(req->rq_state & RQ_NET_SENT);
+		req->rq_state |= RQ_NET_DONE;
+		_req_may_be_done(req, error);
+		break;
+
+	case data_received:
+		D_ASSERT(req->rq_state & RQ_NET_PENDING);
+		dec_ap_pending(mdev);
+		req->rq_state &= ~RQ_NET_PENDING;
+		req->rq_state |= (RQ_NET_OK|RQ_NET_DONE);
+		_req_may_be_done(req, error);
+		break;
+	};
+}
+
+/* we may do a local read if:
+ * - we are consistent (of course),
+ * - or we are generally inconsistent,
+ *   BUT we are still/already IN SYNC for this area.
+ *   since size may be bigger than BM_BLOCK_SIZE,
+ *   we may need to check several bits.
+ */
+STATIC int drbd_may_do_local_read(struct drbd_conf *mdev, sector_t sector, int size)
+{
+	unsigned long sbnr, ebnr;
+	sector_t esector, nr_sectors;
+
+	if (mdev->state.disk == D_UP_TO_DATE)
+		return 1;
+	if (mdev->state.disk >= D_OUTDATED)
+		return 0;
+	if (mdev->state.disk <  D_INCONSISTENT)
+		return 0;
+	/* state.disk == D_INCONSISTENT   We will have a look at the BitMap */
+	nr_sectors = drbd_get_capacity(mdev->this_bdev);
+	esector = sector + (size >> 9) - 1;
+
+	D_ASSERT(sector  < nr_sectors);
+	D_ASSERT(esector < nr_sectors);
+
+	sbnr = BM_SECT_TO_BIT(sector);
+	ebnr = BM_SECT_TO_BIT(esector);
+
+	return 0 == drbd_bm_count_bits(mdev, sbnr, ebnr);
+}
+
+STATIC int drbd_make_request_common(struct drbd_conf *mdev, struct bio *bio)
+{
+	const int rw = bio_rw(bio);
+	const int size = bio->bi_size;
+	const sector_t sector = bio->bi_sector;
+	struct drbd_tl_epoch *b = NULL;
+	struct drbd_request *req;
+	int local, remote;
+	int err = -EIO;
+
+	/* allocate outside of all locks; */
+	req = drbd_req_new(mdev, bio);
+	if (!req) {
+		dec_ap_bio(mdev);
+		/* only pass the error to the upper layers.
+		 * if user cannot handle io errors, thats not our business. */
+		dev_err(DEV, "could not kmalloc() req\n");
+		bio_endio(bio, -ENOMEM);
+		return 0;
+	}
+
+	trace_drbd_bio(mdev, "Rq", bio, 0, req);
+
+	local = inc_local(mdev);
+	if (!local) {
+		bio_put(req->private_bio); /* or we get a bio leak */
+		req->private_bio = NULL;
+	}
+	if (rw == WRITE) {
+		remote = 1;
+	} else {
+		/* READ || READA */
+		if (local) {
+			if (!drbd_may_do_local_read(mdev, sector, size)) {
+				/* we could kick the syncer to
+				 * sync this extent asap, wait for
+				 * it, then continue locally.
+				 * Or just issue the request remotely.
+				 */
+				local = 0;
+				bio_put(req->private_bio);
+				req->private_bio = NULL;
+				dec_local(mdev);
+			}
+		}
+		remote = !local && mdev->state.pdsk >= D_UP_TO_DATE;
+	}
+
+	/* If we have a disk, but a READA request is mapped to remote,
+	 * we are R_PRIMARY, D_INCONSISTENT, SyncTarget.
+	 * Just fail that READA request right here.
+	 *
+	 * THINK: maybe fail all READA when not local?
+	 *        or make this configurable...
+	 *        if network is slow, READA won't do any good.
+	 */
+	if (rw == READA && mdev->state.disk >= D_INCONSISTENT && !local) {
+		err = -EWOULDBLOCK;
+		goto fail_and_free_req;
+	}
+
+	/* For WRITES going to the local disk, grab a reference on the target
+	 * extent.  This waits for any resync activity in the corresponding
+	 * resync extent to finish, and, if necessary, pulls in the target
+	 * extent into the activity log, which involves further disk io because
+	 * of transactional on-disk meta data updates. */
+	if (rw == WRITE && local)
+		drbd_al_begin_io(mdev, sector);
+
+	remote = remote && (mdev->state.pdsk == D_UP_TO_DATE ||
+			    (mdev->state.pdsk == D_INCONSISTENT &&
+			     mdev->state.conn >= C_CONNECTED));
+
+	if (!(local || remote)) {
+		dev_err(DEV, "IO ERROR: neither local nor remote disk\n");
+		goto fail_free_complete;
+	}
+
+	/* For WRITE request, we have to make sure that we have an
+	 * unused_spare_tle, in case we need to start a new epoch.
+	 * I try to be smart and avoid to pre-allocate always "just in case",
+	 * but there is a race between testing the bit and pointer outside the
+	 * spinlock, and grabbing the spinlock.
+	 * if we lost that race, we retry.  */
+	if (rw == WRITE && remote &&
+	    mdev->unused_spare_tle == NULL &&
+	    test_bit(CREATE_BARRIER, &mdev->flags)) {
+allocate_barrier:
+		b = kmalloc(sizeof(struct drbd_tl_epoch), GFP_NOIO);
+		if (!b) {
+			dev_err(DEV, "Failed to alloc barrier.\n");
+			err = -ENOMEM;
+			goto fail_free_complete;
+		}
+	}
+
+	/* GOOD, everything prepared, grab the spin_lock */
+	spin_lock_irq(&mdev->req_lock);
+
+	if (remote) {
+		remote = (mdev->state.pdsk == D_UP_TO_DATE ||
+			    (mdev->state.pdsk == D_INCONSISTENT &&
+			     mdev->state.conn >= C_CONNECTED));
+		if (!remote)
+			dev_warn(DEV, "lost connection while grabbing the req_lock!\n");
+		if (!(local || remote)) {
+			dev_err(DEV, "IO ERROR: neither local nor remote disk\n");
+			spin_unlock_irq(&mdev->req_lock);
+			goto fail_free_complete;
+		}
+	}
+
+	if (b && mdev->unused_spare_tle == NULL) {
+		mdev->unused_spare_tle = b;
+		b = NULL;
+	}
+	if (rw == WRITE && remote &&
+	    mdev->unused_spare_tle == NULL &&
+	    test_bit(CREATE_BARRIER, &mdev->flags)) {
+		/* someone closed the current epoch
+		 * while we were grabbing the spinlock */
+		spin_unlock_irq(&mdev->req_lock);
+		goto allocate_barrier;
+	}
+
+
+	/* Update disk stats */
+	_drbd_start_io_acct(mdev, req, bio);
+
+	/* _maybe_start_new_epoch(mdev);
+	 * If we need to generate a write barrier packet, we have to add the
+	 * new epoch (barrier) object, and queue the barrier packet for sending,
+	 * and queue the req's data after it _within the same lock_, otherwise
+	 * we have race conditions were the reorder domains could be mixed up.
+	 *
+	 * Even read requests may start a new epoch and queue the corresponding
+	 * barrier packet.  To get the write ordering right, we only have to
+	 * make sure that, if this is a write request and it triggered a
+	 * barrier packet, this request is queued within the same spinlock. */
+	if (remote && mdev->unused_spare_tle &&
+	    test_and_clear_bit(CREATE_BARRIER, &mdev->flags)) {
+		_tl_add_barrier(mdev, mdev->unused_spare_tle);
+		mdev->unused_spare_tle = NULL;
+	} else {
+		D_ASSERT(!(remote && rw == WRITE &&
+			   test_bit(CREATE_BARRIER, &mdev->flags)));
+	}
+
+	/* NOTE
+	 * Actually, 'local' may be wrong here already, since we may have failed
+	 * to write to the meta data, and may become wrong anytime because of
+	 * local io-error for some other request, which would lead to us
+	 * "detaching" the local disk.
+	 *
+	 * 'remote' may become wrong any time because the network could fail.
+	 *
+	 * This is a harmless race condition, though, since it is handled
+	 * correctly at the appropriate places; so it just deferres the failure
+	 * of the respective operation.
+	 */
+
+	/* mark them early for readability.
+	 * this just sets some state flags. */
+	if (remote)
+		_req_mod(req, to_be_send, 0);
+	if (local)
+		_req_mod(req, to_be_submitted, 0);
+
+	/* check this request on the colison detection hash tables.
+	 * if we have a conflict, just complete it here.
+	 * THINK do we want to check reads, too? (I don't think so...) */
+	if (rw == WRITE && _req_conflicts(req)) {
+		/* this is a conflicting request.
+		 * even though it may have been only _partially_
+		 * overlapping with one of the currently pending requests,
+		 * without even submitting or sending it, we will
+		 * pretend that it was successfully served right now.
+		 */
+		if (local) {
+			bio_put(req->private_bio);
+			req->private_bio = NULL;
+			drbd_al_complete_io(mdev, req->sector);
+			dec_local(mdev);
+			local = 0;
+		}
+		if (remote)
+			dec_ap_pending(mdev);
+		_drbd_end_io_acct(mdev, req);
+		/* THINK: do we want to fail it (-EIO), or pretend success? */
+		bio_endio(req->master_bio, 0);
+		req->master_bio = NULL;
+		dec_ap_bio(mdev);
+		drbd_req_free(req);
+		remote = 0;
+	}
+
+	/* NOTE remote first: to get the concurrent write detection right,
+	 * we must register the request before start of local IO.  */
+	if (remote) {
+		/* either WRITE and C_CONNECTED,
+		 * or READ, and no local disk,
+		 * or READ, but not in sync.
+		 */
+		if (rw == WRITE)
+			_req_mod(req, queue_for_net_write, 0);
+		else
+			_req_mod(req, queue_for_net_read, 0);
+	}
+	spin_unlock_irq(&mdev->req_lock);
+	kfree(b); /* if someone else has beaten us to it... */
+
+	if (local) {
+		req->private_bio->bi_bdev = mdev->bc->backing_bdev;
+
+		trace_drbd_bio(mdev, "Pri", req->private_bio, 0, NULL);
+
+		if (FAULT_ACTIVE(mdev, rw == WRITE ? DRBD_FAULT_DT_WR
+				     : rw == READ  ? DRBD_FAULT_DT_RD
+				     :               DRBD_FAULT_DT_RA))
+			bio_endio(req->private_bio, -EIO);
+		else
+			generic_make_request(req->private_bio);
+	}
+
+	/* we need to plug ALWAYS since we possibly need to kick lo_dev.
+	 * we plug after submit, so we won't miss an unplug event */
+	drbd_plug_device(mdev);
+
+	return 0;
+
+fail_free_complete:
+	if (rw == WRITE && local)
+		drbd_al_complete_io(mdev, sector);
+fail_and_free_req:
+	if (local) {
+		bio_put(req->private_bio);
+		req->private_bio = NULL;
+		dec_local(mdev);
+	}
+	bio_endio(bio, err);
+	drbd_req_free(req);
+	dec_ap_bio(mdev);
+	kfree(b);
+
+	return 0;
+}
+
+/* helper function for drbd_make_request
+ * if we can determine just by the mdev (state) that this request will fail,
+ * return 1
+ * otherwise return 0
+ */
+static int drbd_fail_request_early(struct drbd_conf *mdev, int is_write)
+{
+	/* Unconfigured */
+	if (mdev->state.conn == C_DISCONNECTING &&
+	    mdev->state.disk == D_DISKLESS)
+		return 1;
+
+	if (mdev->state.role != R_PRIMARY &&
+		(!allow_oos || is_write)) {
+		if (__ratelimit(&drbd_ratelimit_state)) {
+			dev_err(DEV, "Process %s[%u] tried to %s; "
+			    "since we are not in Primary state, "
+			    "we cannot allow this\n",
+			    current->comm, current->pid,
+			    is_write ? "WRITE" : "READ");
+		}
+		return 1;
+	}
+
+	/*
+	 * Paranoia: we might have been primary, but sync target, or
+	 * even diskless, then lost the connection.
+	 * This should have been handled (panic? suspend?) somehwere
+	 * else. But maybe it was not, so check again here.
+	 * Caution: as long as we do not have a read/write lock on mdev,
+	 * to serialize state changes, this is racy, since we may lose
+	 * the connection *after* we test for the cstate.
+	 */
+	if (mdev->state.disk < D_UP_TO_DATE && mdev->state.pdsk < D_UP_TO_DATE) {
+		if (__ratelimit(&drbd_ratelimit_state))
+			dev_err(DEV, "Sorry, I have no access to good data anymore.\n");
+		return 1;
+	}
+
+	return 0;
+}
+
+int drbd_make_request_26(struct request_queue *q, struct bio *bio)
+{
+	unsigned int s_enr, e_enr;
+	struct drbd_conf *mdev = (struct drbd_conf *) q->queuedata;
+
+	if (drbd_fail_request_early(mdev, bio_data_dir(bio) & WRITE)) {
+		bio_endio(bio, -EPERM);
+		return 0;
+	}
+
+	/* Reject barrier requests if we know the underlying device does
+	 * not support them.
+	 * XXX: Need to get this info from peer as well some how so we
+	 * XXX: reject if EITHER side/data/metadata area does not support them.
+	 *
+	 * because of those XXX, this is not yet enabled,
+	 * i.e. in drbd_init_set_defaults we set the NO_BARRIER_SUPP bit.
+	 */
+	if (unlikely(bio_barrier(bio) && test_bit(NO_BARRIER_SUPP, &mdev->flags))) {
+		/* dev_warn(DEV, "Rejecting barrier request as underlying device does not support\n"); */
+		bio_endio(bio, -EOPNOTSUPP);
+		return 0;
+	}
+
+	/*
+	 * what we "blindly" assume:
+	 */
+	D_ASSERT(bio->bi_size > 0);
+	D_ASSERT((bio->bi_size & 0x1ff) == 0);
+	D_ASSERT(bio->bi_idx == 0);
+
+	/* to make some things easier, force allignment of requests within the
+	 * granularity of our hash tables */
+	s_enr = bio->bi_sector >> HT_SHIFT;
+	e_enr = (bio->bi_sector+(bio->bi_size>>9)-1) >> HT_SHIFT;
+
+	if (likely(s_enr == e_enr)) {
+		inc_ap_bio(mdev, 1);
+		return drbd_make_request_common(mdev, bio);
+	}
+
+	/* can this bio be split generically?
+	 * Maybe add our own split-arbitrary-bios function. */
+	if (bio->bi_vcnt != 1 || bio->bi_idx != 0 || bio->bi_size > DRBD_MAX_SEGMENT_SIZE) {
+		/* rather error out here than BUG in bio_split */
+		dev_err(DEV, "bio would need to, but cannot, be split: "
+		    "(vcnt=%u,idx=%u,size=%u,sector=%llu)\n",
+		    bio->bi_vcnt, bio->bi_idx, bio->bi_size,
+		    (unsigned long long)bio->bi_sector);
+		bio_endio(bio, -EINVAL);
+	} else {
+		/* This bio crosses some boundary, so we have to split it. */
+		struct bio_pair *bp;
+		/* works for the "do not cross hash slot boundaries" case
+		 * e.g. sector 262269, size 4096
+		 * s_enr = 262269 >> 6 = 4097
+		 * e_enr = (262269+8-1) >> 6 = 4098
+		 * HT_SHIFT = 6
+		 * sps = 64, mask = 63
+		 * first_sectors = 64 - (262269 & 63) = 3
+		 */
+		const sector_t sect = bio->bi_sector;
+		const int sps = 1 << HT_SHIFT; /* sectors per slot */
+		const int mask = sps - 1;
+		const sector_t first_sectors = sps - (sect & mask);
+		bp = bio_split(bio,
+#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,28)
+				bio_split_pool,
+#endif
+				first_sectors);
+
+		/* we need to get a "reference count" (ap_bio_cnt)
+		 * to avoid races with the disconnect/reconnect/suspend code.
+		 * In case we need to split the bio here, we need to get two references
+		 * atomically, otherwise we might deadlock when trying to submit the
+		 * second one! */
+		inc_ap_bio(mdev, 2);
+
+		D_ASSERT(e_enr == s_enr + 1);
+
+		drbd_make_request_common(mdev, &bp->bio1);
+		drbd_make_request_common(mdev, &bp->bio2);
+		bio_pair_release(bp);
+	}
+	return 0;
+}
+
+/* This is called by bio_add_page().  With this function we reduce
+ * the number of BIOs that span over multiple DRBD_MAX_SEGMENT_SIZEs
+ * units (was AL_EXTENTs).
+ *
+ * we do the calculation within the lower 32bit of the byte offsets,
+ * since we don't care for actual offset, but only check whether it
+ * would cross "activity log extent" boundaries.
+ *
+ * As long as the BIO is emtpy we have to allow at least one bvec,
+ * regardless of size and offset.  so the resulting bio may still
+ * cross extent boundaries.  those are dealt with (bio_split) in
+ * drbd_make_request_26.
+ */
+int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec)
+{
+	struct drbd_conf *mdev = (struct drbd_conf *) q->queuedata;
+	unsigned int bio_offset =
+		(unsigned int)bvm->bi_sector << 9; /* 32 bit */
+	unsigned int bio_size = bvm->bi_size;
+	int limit, backing_limit;
+
+	limit = DRBD_MAX_SEGMENT_SIZE
+	      - ((bio_offset & (DRBD_MAX_SEGMENT_SIZE-1)) + bio_size);
+	if (limit < 0)
+		limit = 0;
+	if (bio_size == 0) {
+		if (limit <= bvec->bv_len)
+			limit = bvec->bv_len;
+	} else if (limit && inc_local(mdev)) {
+		struct request_queue * const b =
+			mdev->bc->backing_bdev->bd_disk->queue;
+		if (b->merge_bvec_fn && mdev->bc->dc.use_bmbv) {
+			backing_limit = b->merge_bvec_fn(b, bvm, bvec);
+			limit = min(limit, backing_limit);
+		}
+		dec_local(mdev);
+	}
+	return limit;
+}

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 06/16] DRBD: userspace_interface
  2009-04-30 11:26         ` [PATCH 05/16] DRBD: request Philipp Reisner
@ 2009-04-30 11:26           ` Philipp Reisner
  2009-04-30 11:26             ` [PATCH 07/16] DRBD: internal_data_structures Philipp Reisner
  0 siblings, 1 reply; 90+ messages in thread
From: Philipp Reisner @ 2009-04-30 11:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg, Philipp Reisner

DRBD uses netlink via connector. The packets are composed of extensible tag
lists. That interface can be extended over time without breaking old
userspace programs.
The nice part of the interface to userspace: drbd.h. The ugly part is for
sure drbd_tag_magic.h. I realize that macros are generally frowned upon, but
this way it is easier to maintain. The code that gets generated by
repeatedly including drbd_nl.h is hard to maintain over time if it is open
coded. (BTW, did you know that the samba 4 people are proud to have more 
than 50% of their code auto generated:)

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

---
diff --git a/include/linux/drbd.h b/include/linux/drbd.h
new file mode 100644
index 0000000..2500021
--- /dev/null
+++ b/include/linux/drbd.h
@@ -0,0 +1,343 @@
+/*
+  drbd.h
+  Kernel module for 2.6.x Kernels
+
+  This file is part of DRBD by Philipp Reisner and Lars Ellenberg.
+
+  Copyright (C) 2001-2008, LINBIT Information Technologies GmbH.
+  Copyright (C) 2001-2008, Philipp Reisner <philipp.reisner@linbit.com>.
+  Copyright (C) 2001-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.
+
+  drbd is free software; you can redistribute it and/or modify
+  it under the terms of the GNU General Public License as published by
+  the Free Software Foundation; either version 2, or (at your option)
+  any later version.
+
+  drbd is distributed in the hope that it will be useful,
+  but WITHOUT ANY WARRANTY; without even the implied warranty of
+  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+  GNU General Public License for more details.
+
+  You should have received a copy of the GNU General Public License
+  along with drbd; see the file COPYING.  If not, write to
+  the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+
+*/
+#ifndef DRBD_H
+#define DRBD_H
+#include <linux/connector.h>
+
+#include <asm/types.h>
+
+#ifdef __KERNEL__
+#include <linux/types.h>
+#include <asm/byteorder.h>
+#else
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <limits.h>
+
+/* Altough the Linux source code makes a difference between
+   generic endianness and the bitfields' endianness, there is no
+   architecture as of Linux-2.6.24-rc4 where the bitfileds' endianness
+   does not match the generic endianness. */
+
+#if __BYTE_ORDER == __LITTLE_ENDIAN
+#define __LITTLE_ENDIAN_BITFIELD
+#elif __BYTE_ORDER == __BIG_ENDIAN
+#define __BIG_ENDIAN_BITFIELD
+#else
+# error "sorry, weird endianness on this box"
+#endif
+
+#endif
+
+
+enum drbd_io_error_p {
+	EP_PASS_ON, /* FIXME should the better be named "Ignore"? */
+	EP_CALL_HELPER,
+	EP_DETACH
+};
+
+enum drbd_fencing_p {
+	FP_DONT_CARE,
+	FP_RESOURCE,
+	FP_STONITH
+};
+
+enum drbd_disconnect_p {
+	DP_RECONNECT,
+	DP_DROP_NET_CONF,
+	DP_FREEZE_IO
+};
+
+enum drbd_after_sb_p {
+	ASB_DISCONNECT,
+	ASB_DISCARD_YOUNGER_PRI,
+	ASB_DISCARD_OLDER_PRI,
+	ASB_DISCARD_ZERO_CHG,
+	ASB_DISCARD_LEAST_CHG,
+	ASB_DISCARD_LOCAL,
+	ASB_DISCARD_REMOTE,
+	ASB_CONSENSUS,
+	ASB_DISCARD_SECONDARY,
+	ASB_CALL_HELPER,
+	ASB_VIOLENTLY
+};
+
+/* KEEP the order, do not delete or insert. Only append. */
+enum drbd_ret_codes {
+	ERR_CODE_BASE		= 100,
+	NO_ERROR		= 101,
+	ERR_LOCAL_ADDR		= 102,
+	ERR_PEER_ADDR		= 103,
+	ERR_OPEN_DISK		= 104,
+	ERR_OPEN_MD_DISK	= 105,
+	ERR_DISK_NOT_BDEV	= 107,
+	ERR_MD_NOT_BDEV		= 108,
+	ERR_DISK_TO_SMALL	= 111,
+	ERR_MD_DISK_TO_SMALL	= 112,
+	ERR_BDCLAIM_DISK	= 114,
+	ERR_BDCLAIM_MD_DISK	= 115,
+	ERR_MD_IDX_INVALID	= 116,
+	ERR_IO_MD_DISK		= 118,
+	ERR_MD_INVALID          = 119,
+	ERR_AUTH_ALG		= 120,
+	ERR_AUTH_ALG_ND		= 121,
+	ERR_NOMEM		= 122,
+	ERR_DISCARD		= 123,
+	ERR_DISK_CONFIGURED	= 124,
+	ERR_NET_CONFIGURED	= 125,
+	ERR_MANDATORY_TAG	= 126,
+	ERR_MINOR_INVALID	= 127,
+	ERR_INTR		= 129, /* EINTR */
+	ERR_RESIZE_RESYNC	= 130,
+	ERR_NO_PRIMARY		= 131,
+	ERR_SYNC_AFTER		= 132,
+	ERR_SYNC_AFTER_CYCLE	= 133,
+	ERR_PAUSE_IS_SET	= 134,
+	ERR_PAUSE_IS_CLEAR	= 135,
+	ERR_PACKET_NR		= 137,
+	ERR_NO_DISK		= 138,
+	ERR_NOT_PROTO_C		= 139,
+	ERR_NOMEM_BITMAP	= 140,
+	ERR_INTEGRITY_ALG	= 141, /* DRBD 8.2 only */
+	ERR_INTEGRITY_ALG_ND	= 142, /* DRBD 8.2 only */
+	ERR_CPU_MASK_PARSE	= 143, /* DRBD 8.2 only */
+	ERR_CSUMS_ALG		= 144, /* DRBD 8.2 only */
+	ERR_CSUMS_ALG_ND	= 145, /* DRBD 8.2 only */
+	ERR_VERIFY_ALG		= 146, /* DRBD 8.2 only */
+	ERR_VERIFY_ALG_ND	= 147, /* DRBD 8.2 only */
+	ERR_CSUMS_RESYNC_RUNNING= 148, /* DRBD 8.2 only */
+	ERR_VERIFY_RUNNING	= 149, /* DRBD 8.2 only */
+	ERR_DATA_NOT_CURRENT	= 150,
+	ERR_CONNECTED		= 151, /* DRBD 8.3 only */
+
+	/* insert new ones above this line */
+	AFTER_LAST_ERR_CODE
+};
+
+#define DRBD_PROT_A   1
+#define DRBD_PROT_B   2
+#define DRBD_PROT_C   3
+
+enum drbd_role {
+	R_UNKNOWN = 0,
+	R_PRIMARY = 1,     /* role */
+	R_SECONDARY = 2,   /* role */
+	R_MASK = 3,
+};
+
+/* The order of these constants is important.
+ * The lower ones (<C_WF_REPORT_PARAMS) indicate
+ * that there is no socket!
+ * >=C_WF_REPORT_PARAMS ==> There is a socket
+ */
+enum drbd_conns {
+	C_STANDALONE,
+	C_DISCONNECTING,  /* Temporal state on the way to StandAlone. */
+	C_UNCONNECTED,    /* >= C_UNCONNECTED -> inc_net() succeeds */
+
+	/* These temporal states are all used on the way
+	 * from >= C_CONNECTED to Unconnected.
+	 * The 'disconnect reason' states
+	 * I do not allow to change beween them. */
+	C_TIMEOUT,
+	C_BROKEN_PIPE,
+	C_NETWORK_FAILURE,
+	C_PROTOCOL_ERROR,
+	C_TEAR_DOWN,
+
+	C_WF_CONNECTION,
+	C_WF_REPORT_PARAMS, /* we have a socket */
+	C_CONNECTED,      /* we have introduced each other */
+	C_STARTING_SYNC_S,  /* starting full sync by IOCTL. */
+	C_STARTING_SYNC_T,  /* stariing full sync by IOCTL. */
+	C_WF_BITMAP_S,
+	C_WF_BITMAP_T,
+	C_WF_SYNC_UUID,
+
+	/* All SyncStates are tested with this comparison
+	 * xx >= C_SYNC_SOURCE && xx <= C_PAUSED_SYNC_T */
+	C_SYNC_SOURCE,
+	C_SYNC_TARGET,
+	C_VERIFY_S,
+	C_VERIFY_T,
+	C_PAUSED_SYNC_S,
+	C_PAUSED_SYNC_T,
+	C_MASK = 31
+};
+
+enum drbd_disk_state {
+	D_DISKLESS,
+	D_ATTACHING,      /* In the process of reading the meta-data */
+	D_FAILED,         /* Becomes D_DISKLESS as soon as we told it the peer */
+			/* when >= D_FAILED it is legal to access mdev->bc */
+	D_NEGOTIATING,    /* Late attaching state, we need to talk to the peer */
+	D_INCONSISTENT,
+	D_OUTDATED,
+	D_UNKNOWN,       /* Only used for the peer, never for myself */
+	D_CONSISTENT,     /* Might be D_OUTDATED, might be D_UP_TO_DATE ... */
+	D_UP_TO_DATE,       /* Only this disk state allows applications' IO ! */
+	D_MASK = 15
+};
+
+union drbd_state {
+/* According to gcc's docs is the ...
+ * The order of allocation of bit-fields within a unit (C90 6.5.2.1, C99 6.7.2.1).
+ * Determined by ABI.
+ * pointed out by Maxim Uvarov q<muvarov@ru.mvista.com>
+ * even though we transmit as "cpu_to_be32(state)",
+ * the offsets of the bitfields still need to be swapped
+ * on different endianess.
+ */
+	struct {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+		unsigned role:2 ;   /* 3/4	 primary/secondary/unknown */
+		unsigned peer:2 ;   /* 3/4	 primary/secondary/unknown */
+		unsigned conn:5 ;   /* 17/32	 cstates */
+		unsigned disk:4 ;   /* 8/16	 from D_DISKLESS to D_UP_TO_DATE */
+		unsigned pdsk:4 ;   /* 8/16	 from D_DISKLESS to D_UP_TO_DATE */
+		unsigned susp:1 ;   /* 2/2	 IO suspended  no/yes */
+		unsigned aftr_isp:1 ; /* isp .. imposed sync pause */
+		unsigned peer_isp:1 ;
+		unsigned user_isp:1 ;
+		unsigned _pad:11;   /* 0	 unused */
+#elif defined(__BIG_ENDIAN_BITFIELD)
+		unsigned _pad:11;   /* 0	 unused */
+		unsigned user_isp:1 ;
+		unsigned peer_isp:1 ;
+		unsigned aftr_isp:1 ; /* isp .. imposed sync pause */
+		unsigned susp:1 ;   /* 2/2	 IO suspended  no/yes */
+		unsigned pdsk:4 ;   /* 8/16	 from D_DISKLESS to D_UP_TO_DATE */
+		unsigned disk:4 ;   /* 8/16	 from D_DISKLESS to D_UP_TO_DATE */
+		unsigned conn:5 ;   /* 17/32	 cstates */
+		unsigned peer:2 ;   /* 3/4	 primary/secondary/unknown */
+		unsigned role:2 ;   /* 3/4	 primary/secondary/unknown */
+#else
+# error "this endianess is not supported"
+#endif
+	};
+	unsigned int i;
+};
+
+enum drbd_state_ret_codes {
+	SS_CW_NO_NEED = 4,
+	SS_CW_SUCCESS = 3,
+	SS_NOTHING_TO_DO = 2,
+	SS_SUCCESS = 1,
+	SS_UNKNOWN_ERROR = 0, /* Used to sleep longer in _drbd_request_state */
+	SS_TWO_PRIMARIES = -1,
+	SS_NO_UP_TO_DATE_DISK = -2,
+	SS_BOTH_INCONSISTENT = -4,
+	SS_SYNCING_DISKLESS = -5,
+	SS_CONNECTED_OUTDATES = -6,
+	SS_PRIMARY_NOP = -7,
+	SS_RESYNC_RUNNING = -8,
+	SS_ALREADY_STANDALONE = -9,
+	SS_CW_FAILED_BY_PEER = -10,
+	SS_IS_DISKLESS = -11,
+	SS_DEVICE_IN_USE = -12,
+	SS_NO_NET_CONFIG = -13,
+	SS_NO_VERIFY_ALG = -14,       /* drbd-8.2 only */
+	SS_NEED_CONNECTION = -15,    /* drbd-8.2 only */
+	SS_LOWER_THAN_OUTDATED = -16,
+	SS_NOT_SUPPORTED = -17,      /* drbd-8.2 only */
+	SS_IN_TRANSIENT_STATE = -18,  /* Retry after the next state change */
+	SS_CONCURRENT_ST_CHG = -19,   /* Concurrent cluster side state change! */
+	SS_AFTER_LAST_ERROR = -20,    /* Keep this at bottom */
+};
+
+/* from drbd_strings.c */
+extern const char *conns_to_name(enum drbd_conns);
+extern const char *roles_to_name(enum drbd_role);
+extern const char *disks_to_name(enum drbd_disk_state);
+extern const char *set_st_err_name(enum drbd_state_ret_codes);
+
+#define SHARED_SECRET_MAX 64
+
+#define MDF_CONSISTENT		(1 << 0)
+#define MDF_PRIMARY_IND		(1 << 1)
+#define MDF_CONNECTED_IND	(1 << 2)
+#define MDF_FULL_SYNC		(1 << 3)
+#define MDF_WAS_UP_TO_DATE	(1 << 4)
+#define MDF_PEER_OUT_DATED	(1 << 5)
+#define MDF_CRASHED_PRIMARY     (1 << 6)
+
+enum drbd_uuid_index {
+	UI_CURRENT,
+	UI_BITMAP,
+	UI_HISTORY_START,
+	UI_HISTORY_END,
+	UI_SIZE,      /* nl-packet: number of dirty bits */
+	UI_FLAGS,     /* nl-packet: flags */
+	UI_EXTENDED_SIZE   /* Everything. */
+};
+
+enum drbd_timeout_flag {
+	UT_DEFAULT      = 0,
+	UT_DEGRADED     = 1,
+	UT_PEER_OUTDATED = 2,
+};
+
+#define UUID_JUST_CREATED ((__u64)4)
+
+#define DRBD_MAGIC 0x83740267
+#define BE_DRBD_MAGIC __constant_cpu_to_be32(DRBD_MAGIC)
+
+/* these are of type "int" */
+#define DRBD_MD_INDEX_INTERNAL -1
+#define DRBD_MD_INDEX_FLEX_EXT -2
+#define DRBD_MD_INDEX_FLEX_INT -3
+
+/* Start of the new netlink/connector stuff */
+
+#define DRBD_NL_CREATE_DEVICE 0x01
+#define DRBD_NL_SET_DEFAULTS  0x02
+
+/* The following line should be moved over to linux/connector.h
+ * when the time comes */
+#ifndef CN_IDX_DRBD
+# define CN_IDX_DRBD			0x4
+/* Ubuntu "intrepid ibex" release defined CN_IDX_DRBD as 0x6 */
+#endif
+#define CN_VAL_DRBD			0x1
+
+/* For searching a vacant cn_idx value */
+#define CN_IDX_STEP			6977
+
+struct drbd_nl_cfg_req {
+	int packet_type;
+	unsigned int drbd_minor;
+	int flags;
+	unsigned short tag_list[];
+};
+
+struct drbd_nl_cfg_reply {
+	int packet_type;
+	unsigned int minor;
+	int ret_code; /* enum ret_code or set_st_err_t */
+	unsigned short tag_list[]; /* only used with get_* calls */
+};
+
+#endif
diff --git a/include/linux/drbd_config.h b/include/linux/drbd_config.h
new file mode 100644
index 0000000..06a750e
--- /dev/null
+++ b/include/linux/drbd_config.h
@@ -0,0 +1,37 @@
+/*
+  drbd_config.h
+  DRBD's compile time configuration.
+
+  drbd is free software; you can redistribute it and/or modify
+  it under the terms of the GNU General Public License as published by
+  the Free Software Foundation; either version 2, or (at your option)
+  any later version.
+
+  drbd is distributed in the hope that it will be useful,
+  but WITHOUT ANY WARRANTY; without even the implied warranty of
+  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+  GNU General Public License for more details.
+
+  You should have received a copy of the GNU General Public License
+  along with drbd; see the file COPYING.  If not, write to
+  the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+*/
+
+#ifndef DRBD_CONFIG_H
+#define DRBD_CONFIG_H
+
+extern const char *drbd_buildtag(void);
+
+#define REL_VERSION "8.3.1"
+#define API_VERSION 88
+#define PRO_VERSION_MIN 86
+#define PRO_VERSION_MAX 90
+
+#ifndef __CHECKER__   /* for a sparse run, we need all STATICs */
+#define DBG_ALL_SYMBOLS /* no static functs, improves quality of OOPS traces */
+#endif
+
+/* Enable fault insertion code */
+#define DRBD_ENABLE_FAULTS
+
+#endif
diff --git a/include/linux/drbd_limits.h b/include/linux/drbd_limits.h
new file mode 100644
index 0000000..2fafc2b
--- /dev/null
+++ b/include/linux/drbd_limits.h
@@ -0,0 +1,133 @@
+/*
+  drbd_limits.h
+  This file is part of DRBD by Philipp Reisner and Lars Ellenberg.
+*/
+
+/*
+ * Our current limitations.
+ * Some of them are hard limits,
+ * some of them are arbitrary range limits, that make it easier to provide
+ * feedback about nonsense settings for certain configurable values.
+ */
+
+#ifndef DRBD_LIMITS_H
+#define DRBD_LIMITS_H 1
+
+#define DEBUG_RANGE_CHECK 0
+
+#define DRBD_MINOR_COUNT_MIN 1
+#define DRBD_MINOR_COUNT_MAX 255
+
+#define DRBD_DIALOG_REFRESH_MIN 0
+#define DRBD_DIALOG_REFRESH_MAX 600
+
+/* valid port number */
+#define DRBD_PORT_MIN 1
+#define DRBD_PORT_MAX 0xffff
+
+/* startup { */
+  /* if you want more than 3.4 days, disable */
+#define DRBD_WFC_TIMEOUT_MIN 0
+#define DRBD_WFC_TIMEOUT_MAX 300000
+#define DRBD_WFC_TIMEOUT_DEF 0
+
+#define DRBD_DEGR_WFC_TIMEOUT_MIN 0
+#define DRBD_DEGR_WFC_TIMEOUT_MAX 300000
+#define DRBD_DEGR_WFC_TIMEOUT_DEF 0
+
+#define DRBD_OUTDATED_WFC_TIMEOUT_MIN 0
+#define DRBD_OUTDATED_WFC_TIMEOUT_MAX 300000
+#define DRBD_OUTDATED_WFC_TIMEOUT_DEF 0
+/* }*/
+
+/* net { */
+  /* timeout, unit centi seconds
+   * more than one minute timeout is not usefull */
+#define DRBD_TIMEOUT_MIN 1
+#define DRBD_TIMEOUT_MAX 600
+#define DRBD_TIMEOUT_DEF 60       /* 6 seconds */
+
+  /* active connection retries when C_WF_CONNECTION */
+#define DRBD_CONNECT_INT_MIN 1
+#define DRBD_CONNECT_INT_MAX 120
+#define DRBD_CONNECT_INT_DEF 10   /* seconds */
+
+  /* keep-alive probes when idle */
+#define DRBD_PING_INT_MIN 1
+#define DRBD_PING_INT_MAX 120
+#define DRBD_PING_INT_DEF 10
+
+ /* timeout for the ping packets.*/
+#define DRBD_PING_TIMEO_MIN  1
+#define DRBD_PING_TIMEO_MAX  100
+#define DRBD_PING_TIMEO_DEF  5
+
+  /* max number of write requests between write barriers */
+#define DRBD_MAX_EPOCH_SIZE_MIN 1
+#define DRBD_MAX_EPOCH_SIZE_MAX 20000
+#define DRBD_MAX_EPOCH_SIZE_DEF 2048
+
+  /* I don't think that a tcp send buffer of more than 10M is usefull */
+#define DRBD_SNDBUF_SIZE_MIN  0
+#define DRBD_SNDBUF_SIZE_MAX  (10<<20)
+#define DRBD_SNDBUF_SIZE_DEF  (2*65535)
+
+  /* @4k PageSize -> 128kB - 512MB */
+#define DRBD_MAX_BUFFERS_MIN  32
+#define DRBD_MAX_BUFFERS_MAX  131072
+#define DRBD_MAX_BUFFERS_DEF  2048
+
+  /* @4k PageSize -> 4kB - 512MB */
+#define DRBD_UNPLUG_WATERMARK_MIN  1
+#define DRBD_UNPLUG_WATERMARK_MAX  131072
+#define DRBD_UNPLUG_WATERMARK_DEF (DRBD_MAX_BUFFERS_DEF/16)
+
+  /* 0 is disabled.
+   * 200 should be more than enough even for very short timeouts */
+#define DRBD_KO_COUNT_MIN  0
+#define DRBD_KO_COUNT_MAX  200
+#define DRBD_KO_COUNT_DEF  0
+/* } */
+
+/* syncer { */
+  /* FIXME allow rate to be zero? */
+#define DRBD_RATE_MIN 1
+/* channel bonding 10 GbE, or other hardware */
+#define DRBD_RATE_MAX (4 << 20)
+#define DRBD_RATE_DEF 250  /* kb/second */
+
+  /* less than 7 would hit performance unneccessarily.
+   * 3833 is the largest prime that still does fit
+   * into 64 sectors of activity log */
+#define DRBD_AL_EXTENTS_MIN  7
+#define DRBD_AL_EXTENTS_MAX  3833
+#define DRBD_AL_EXTENTS_DEF  127
+
+#define DRBD_AFTER_MIN  -1
+#define DRBD_AFTER_MAX  255
+#define DRBD_AFTER_DEF  -1
+
+/* } */
+
+/* drbdsetup XY resize -d Z
+ * you are free to reduce the device size to nothing, if you want to.
+ * the upper limit with 64bit kernel, enough ram and flexible meta data
+ * is 16 TB, currently. */
+/* DRBD_MAX_SECTORS */
+#define DRBD_DISK_SIZE_SECT_MIN  0
+#define DRBD_DISK_SIZE_SECT_MAX  (16 * (2LLU << 30))
+#define DRBD_DISK_SIZE_SECT_DEF  0 /* = disabled = no user size... */
+
+#define DRBD_ON_IO_ERROR_DEF EP_PASS_ON
+#define DRBD_FENCING_DEF FP_DONT_CARE
+#define DRBD_AFTER_SB_0P_DEF ASB_DISCONNECT
+#define DRBD_AFTER_SB_1P_DEF ASB_DISCONNECT
+#define DRBD_AFTER_SB_2P_DEF ASB_DISCONNECT
+#define DRBD_RR_CONFLICT_DEF ASB_DISCONNECT
+
+#define DRBD_MAX_BIO_BVECS_MIN 0
+#define DRBD_MAX_BIO_BVECS_MAX 128
+#define DRBD_MAX_BIO_BVECS_DEF 0
+
+#undef RANGE
+#endif
diff --git a/include/linux/drbd_nl.h b/include/linux/drbd_nl.h
new file mode 100644
index 0000000..cc99f3e
--- /dev/null
+++ b/include/linux/drbd_nl.h
@@ -0,0 +1,135 @@
+/*
+   PAKET( name,
+	  TYPE ( pn, pr, member )
+	  ...
+   )
+
+   You may never reissue one of the pn arguments
+*/
+
+#if !defined(NL_PACKET) || !defined(NL_STRING) || !defined(NL_INTEGER) || !defined(NL_BIT) || !defined(NL_INT64)
+#error "The macros NL_PACKET, NL_STRING, NL_INTEGER, NL_INT64 and NL_BIT needs to be defined"
+#endif
+
+NL_PACKET(primary, 1,
+       NL_BIT(		1,	T_MAY_IGNORE,	overwrite_peer)
+)
+
+NL_PACKET(secondary, 2, )
+
+NL_PACKET(disk_conf, 3,
+	NL_INT64(	2,	T_MAY_IGNORE,	disk_size)
+	NL_STRING(	3,	T_MANDATORY,	backing_dev,	128)
+	NL_STRING(	4,	T_MANDATORY,	meta_dev,	128)
+	NL_INTEGER(	5,	T_MANDATORY,	meta_dev_idx)
+	NL_INTEGER(	6,	T_MAY_IGNORE,	on_io_error)
+	NL_INTEGER(	7,	T_MAY_IGNORE,	fencing)
+	NL_BIT(		37,	T_MAY_IGNORE,	use_bmbv)
+	NL_BIT(		53,	T_MAY_IGNORE,	no_disk_flush)
+	NL_BIT(		54,	T_MAY_IGNORE,	no_md_flush)
+	  /*  55 max_bio_size was available in 8.2.6rc2 */
+	NL_INTEGER(	56,	T_MAY_IGNORE,	max_bio_bvecs)
+	NL_BIT(		57,	T_MAY_IGNORE,	no_disk_barrier)
+	NL_BIT(		58,	T_MAY_IGNORE,	no_disk_drain)
+)
+
+NL_PACKET(detach, 4, )
+
+NL_PACKET(net_conf, 5,
+	NL_STRING(	8,	T_MANDATORY,	my_addr,	128)
+	NL_STRING(	9,	T_MANDATORY,	peer_addr,	128)
+	NL_STRING(	10,	T_MAY_IGNORE,	shared_secret,	SHARED_SECRET_MAX)
+	NL_STRING(	11,	T_MAY_IGNORE,	cram_hmac_alg,	SHARED_SECRET_MAX)
+	NL_STRING(	44,	T_MAY_IGNORE,	integrity_alg,	SHARED_SECRET_MAX)
+	NL_INTEGER(	14,	T_MAY_IGNORE,	timeout)
+	NL_INTEGER(	15,	T_MANDATORY,	wire_protocol)
+	NL_INTEGER(	16,	T_MAY_IGNORE,	try_connect_int)
+	NL_INTEGER(	17,	T_MAY_IGNORE,	ping_int)
+	NL_INTEGER(	18,	T_MAY_IGNORE,	max_epoch_size)
+	NL_INTEGER(	19,	T_MAY_IGNORE,	max_buffers)
+	NL_INTEGER(	20,	T_MAY_IGNORE,	unplug_watermark)
+	NL_INTEGER(	21,	T_MAY_IGNORE,	sndbuf_size)
+	NL_INTEGER(	22,	T_MAY_IGNORE,	ko_count)
+	NL_INTEGER(	24,	T_MAY_IGNORE,	after_sb_0p)
+	NL_INTEGER(	25,	T_MAY_IGNORE,	after_sb_1p)
+	NL_INTEGER(	26,	T_MAY_IGNORE,	after_sb_2p)
+	NL_INTEGER(	39,	T_MAY_IGNORE,	rr_conflict)
+	NL_INTEGER(	40,	T_MAY_IGNORE,	ping_timeo)
+	  /* 59 addr_family was available in GIT, never released */
+	NL_BIT(		60,	T_MANDATORY,	mind_af)
+	NL_BIT(		27,	T_MAY_IGNORE,	want_lose)
+	NL_BIT(		28,	T_MAY_IGNORE,	two_primaries)
+	NL_BIT(		41,	T_MAY_IGNORE,	always_asbp)
+	NL_BIT(		61,	T_MAY_IGNORE,	no_cork)
+	NL_BIT(		62,	T_MANDATORY,	auto_sndbuf_size)
+)
+
+NL_PACKET(disconnect, 6, )
+
+NL_PACKET(resize, 7,
+	NL_INT64(		29,	T_MAY_IGNORE,	resize_size)
+)
+
+NL_PACKET(syncer_conf, 8,
+	NL_INTEGER(	30,	T_MAY_IGNORE,	rate)
+	NL_INTEGER(	31,	T_MAY_IGNORE,	after)
+	NL_INTEGER(	32,	T_MAY_IGNORE,	al_extents)
+	NL_STRING(      52,     T_MAY_IGNORE,   verify_alg,     SHARED_SECRET_MAX)
+	NL_STRING(      51,     T_MAY_IGNORE,   cpu_mask,       32)
+	NL_STRING(	64,	T_MAY_IGNORE,	csums_alg,	SHARED_SECRET_MAX)
+	NL_BIT(         65,     T_MAY_IGNORE,   use_rle_encoding)
+)
+
+NL_PACKET(invalidate, 9, )
+NL_PACKET(invalidate_peer, 10, )
+NL_PACKET(pause_sync, 11, )
+NL_PACKET(resume_sync, 12, )
+NL_PACKET(suspend_io, 13, )
+NL_PACKET(resume_io, 14, )
+NL_PACKET(outdate, 15, )
+NL_PACKET(get_config, 16, )
+NL_PACKET(get_state, 17,
+	NL_INTEGER(	33,	T_MAY_IGNORE,	state_i)
+)
+
+NL_PACKET(get_uuids, 18,
+	NL_STRING(	34,	T_MAY_IGNORE,	uuids,	(UI_SIZE*sizeof(__u64)))
+	NL_INTEGER(	35,	T_MAY_IGNORE,	uuids_flags)
+)
+
+NL_PACKET(get_timeout_flag, 19,
+	NL_BIT(		36,	T_MAY_IGNORE,	use_degraded)
+)
+
+NL_PACKET(call_helper, 20,
+	NL_STRING(	38,	T_MAY_IGNORE,	helper,		32)
+)
+
+/* Tag nr 42 already allocated in drbd-8.1 development. */
+
+NL_PACKET(sync_progress, 23,
+	NL_INTEGER(	43,	T_MAY_IGNORE,	sync_progress)
+)
+
+NL_PACKET(dump_ee, 24,
+	NL_STRING(	45,	T_MAY_IGNORE,	dump_ee_reason, 32)
+	NL_STRING(	46,	T_MAY_IGNORE,	seen_digest, SHARED_SECRET_MAX)
+	NL_STRING(	47,	T_MAY_IGNORE,	calc_digest, SHARED_SECRET_MAX)
+	NL_INT64(	48,	T_MAY_IGNORE,	ee_sector)
+	NL_INT64(	49,	T_MAY_IGNORE,	ee_block_id)
+	NL_STRING(	50,	T_MAY_IGNORE,	ee_data,	32 << 10)
+)
+
+NL_PACKET(start_ov, 25,
+)
+
+NL_PACKET(new_c_uuid, 26,
+       NL_BIT(		63,	T_MANDATORY,	clear_bm)
+)
+
+#undef NL_PACKET
+#undef NL_INTEGER
+#undef NL_INT64
+#undef NL_BIT
+#undef NL_STRING
+
diff --git a/include/linux/drbd_tag_magic.h b/include/linux/drbd_tag_magic.h
new file mode 100644
index 0000000..fcdff84
--- /dev/null
+++ b/include/linux/drbd_tag_magic.h
@@ -0,0 +1,83 @@
+#ifndef DRBD_TAG_MAGIC_H
+#define DRBD_TAG_MAGIC_H
+
+#define TT_END     0
+#define TT_REMOVED 0xE000
+
+/* declare packet_type enums */
+enum packet_types {
+#define NL_PACKET(name, number, fields) P_ ## name = number,
+#define NL_INTEGER(pn, pr, member)
+#define NL_INT64(pn, pr, member)
+#define NL_BIT(pn, pr, member)
+#define NL_STRING(pn, pr, member, len)
+#include "drbd_nl.h"
+	P_nl_after_last_packet,
+};
+
+/* These struct are used to deduce the size of the tag lists: */
+#define NL_PACKET(name, number, fields)	\
+	struct name ## _tag_len_struct { fields };
+#define NL_INTEGER(pn, pr, member)		\
+	int member; int tag_and_len ## member;
+#define NL_INT64(pn, pr, member)		\
+	__u64 member; int tag_and_len ## member;
+#define NL_BIT(pn, pr, member)		\
+	unsigned char member:1; int tag_and_len ## member;
+#define NL_STRING(pn, pr, member, len)	\
+	unsigned char member[len]; int member ## _len; \
+	int tag_and_len ## member;
+#include "linux/drbd_nl.h"
+
+/* declate tag-list-sizes */
+static const int tag_list_sizes[] = {
+#define NL_PACKET(name, number, fields) 2 fields ,
+#define NL_INTEGER(pn, pr, member)      + 4 + 4
+#define NL_INT64(pn, pr, member)        + 4 + 8
+#define NL_BIT(pn, pr, member)          + 4 + 1
+#define NL_STRING(pn, pr, member, len)  + 4 + (len)
+#include "drbd_nl.h"
+};
+
+/* The two highest bits are used for the tag type */
+#define TT_MASK      0xC000
+#define TT_INTEGER   0x0000
+#define TT_INT64     0x4000
+#define TT_BIT       0x8000
+#define TT_STRING    0xC000
+/* The next bit indicates if processing of the tag is mandatory */
+#define T_MANDATORY  0x2000
+#define T_MAY_IGNORE 0x0000
+#define TN_MASK      0x1fff
+/* The remaining 13 bits are used to enumerate the tags */
+
+#define tag_type(T)   ((T) & TT_MASK)
+#define tag_number(T) ((T) & TN_MASK)
+
+/* declare tag enums */
+#define NL_PACKET(name, number, fields) fields
+enum drbd_tags {
+#define NL_INTEGER(pn, pr, member)     T_ ## member = pn | TT_INTEGER | pr ,
+#define NL_INT64(pn, pr, member)       T_ ## member = pn | TT_INT64   | pr ,
+#define NL_BIT(pn, pr, member)         T_ ## member = pn | TT_BIT     | pr ,
+#define NL_STRING(pn, pr, member, len) T_ ## member = pn | TT_STRING  | pr ,
+#include "drbd_nl.h"
+};
+
+struct tag {
+	const char *name;
+	int type_n_flags;
+	int max_len;
+};
+
+/* declare tag names */
+#define NL_PACKET(name, number, fields) fields
+static const struct tag tag_descriptions[] = {
+#define NL_INTEGER(pn, pr, member)     [ pn ] = { #member, TT_INTEGER | pr, sizeof(int)   },
+#define NL_INT64(pn, pr, member)       [ pn ] = { #member, TT_INT64   | pr, sizeof(__u64) },
+#define NL_BIT(pn, pr, member)         [ pn ] = { #member, TT_BIT     | pr, sizeof(int)   },
+#define NL_STRING(pn, pr, member, len) [ pn ] = { #member, TT_STRING  | pr, (len)         },
+#include "drbd_nl.h"
+};
+
+#endif
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
new file mode 100644
index 0000000..c388478
--- /dev/null
+++ b/drivers/block/drbd/drbd_nl.c
@@ -0,0 +1,2341 @@
+/*
+   drbd_nl.c
+
+   This file is part of DRBD by Philipp Reisner and Lars Ellenberg.
+
+   Copyright (C) 2001-2008, LINBIT Information Technologies GmbH.
+   Copyright (C) 1999-2008, Philipp Reisner <philipp.reisner@linbit.com>.
+   Copyright (C) 2002-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.
+
+   drbd is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 2, or (at your option)
+   any later version.
+
+   drbd is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with drbd; see the file COPYING.  If not, write to
+   the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+
+ */
+
+#include <linux/autoconf.h>
+#include <linux/module.h>
+#include <linux/in.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/connector.h>
+#include <linux/drbd_config.h>
+#include <linux/drbd.h>
+#include <linux/blkpg.h>
+#include <linux/cpumask.h>
+#include "drbd_int.h"
+#include "drbd_tracing.h"
+#include "drbd_wrappers.h"
+#include <linux/drbd_tag_magic.h>
+#include <linux/drbd_limits.h>
+
+/* see get_sb_bdev and bd_claim */
+static char *drbd_m_holder = "Hands off! this is DRBD's meta data device.";
+
+/* Generate the tag_list to struct functions */
+#define NL_PACKET(name, number, fields) \
+STATIC int name ## _from_tags(struct drbd_conf *mdev, \
+	unsigned short *tags, struct name *arg) \
+{ \
+	int tag; \
+	int dlen; \
+	\
+	while ((tag = *tags++) != TT_END) { \
+		dlen = *tags++; \
+		switch (tag_number(tag)) { \
+		fields \
+		default: \
+			if (tag & T_MANDATORY) { \
+				dev_err(DEV, "Unknown tag: %d\n", tag_number(tag)); \
+				return 0; \
+			} \
+		} \
+		tags = (unsigned short *)((char *)tags + dlen); \
+	} \
+	return 1; \
+}
+#define NL_INTEGER(pn, pr, member) \
+	case pn: /* D_ASSERT( tag_type(tag) == TT_INTEGER ); */ \
+		 arg->member = *(int *)(tags); \
+		 break;
+#define NL_INT64(pn, pr, member) \
+	case pn: /* D_ASSERT( tag_type(tag) == TT_INT64 ); */ \
+		 arg->member = *(u64 *)(tags); \
+		 break;
+#define NL_BIT(pn, pr, member) \
+	case pn: /* D_ASSERT( tag_type(tag) == TT_BIT ); */ \
+		 arg->member = *(char *)(tags) ? 1 : 0; \
+		 break;
+#define NL_STRING(pn, pr, member, len) \
+	case pn: /* D_ASSERT( tag_type(tag) == TT_STRING ); */ \
+		if (dlen > len) { \
+			dev_err(DEV, "arg too long: %s (%u wanted, max len: %u bytes)\n", \
+				#member, dlen, (unsigned int)len); \
+			return 0; \
+		} \
+		 arg->member ## _len = dlen; \
+		 memcpy(arg->member, tags, min_t(size_t, dlen, len)); \
+		 break;
+#include "linux/drbd_nl.h"
+
+/* Generate the struct to tag_list functions */
+#define NL_PACKET(name, number, fields) \
+STATIC unsigned short* \
+name ## _to_tags(struct drbd_conf *mdev, \
+	struct name *arg, unsigned short *tags) \
+{ \
+	fields \
+	return tags; \
+}
+
+#define NL_INTEGER(pn, pr, member) \
+	*tags++ = pn | pr | TT_INTEGER; \
+	*tags++ = sizeof(int); \
+	*(int *)tags = arg->member; \
+	tags = (unsigned short *)((char *)tags+sizeof(int));
+#define NL_INT64(pn, pr, member) \
+	*tags++ = pn | pr | TT_INT64; \
+	*tags++ = sizeof(u64); \
+	*(u64 *)tags = arg->member; \
+	tags = (unsigned short *)((char *)tags+sizeof(u64));
+#define NL_BIT(pn, pr, member) \
+	*tags++ = pn | pr | TT_BIT; \
+	*tags++ = sizeof(char); \
+	*(char *)tags = arg->member; \
+	tags = (unsigned short *)((char *)tags+sizeof(char));
+#define NL_STRING(pn, pr, member, len) \
+	*tags++ = pn | pr | TT_STRING; \
+	*tags++ = arg->member ## _len; \
+	memcpy(tags, arg->member, arg->member ## _len); \
+	tags = (unsigned short *)((char *)tags + arg->member ## _len);
+#include "linux/drbd_nl.h"
+
+void drbd_bcast_ev_helper(struct drbd_conf *mdev, char *helper_name);
+void drbd_nl_send_reply(struct cn_msg *, int);
+
+int drbd_khelper(struct drbd_conf *mdev, char *cmd)
+{
+	char mb[12];
+	char *argv[] = {usermode_helper, cmd, mb, NULL };
+	int ret;
+	static char *envp[] = { "HOME=/",
+				"TERM=linux",
+				"PATH=/sbin:/usr/sbin:/bin:/usr/bin",
+				NULL };
+
+	snprintf(mb, 12, "minor-%d", mdev_to_minor(mdev));
+
+	dev_info(DEV, "helper command: %s %s %s\n", usermode_helper, cmd, mb);
+
+	drbd_bcast_ev_helper(mdev, cmd);
+	ret = call_usermodehelper(usermode_helper, argv, envp, 1);
+	if (ret)
+		dev_warn(DEV, "helper command: %s %s %s exit code %u (0x%x)\n",
+				usermode_helper, cmd, mb,
+				(ret >> 8) & 0xff, ret);
+	else
+		dev_info(DEV, "helper command: %s %s %s exit code %u (0x%x)\n",
+				usermode_helper, cmd, mb,
+				(ret >> 8) & 0xff, ret);
+
+	if (ret < 0) /* Ignore any ERRNOs we got. */
+		ret = 0;
+
+	return ret;
+}
+
+enum drbd_disk_state drbd_try_outdate_peer(struct drbd_conf *mdev)
+{
+	char *ex_to_string;
+	int r;
+	enum drbd_disk_state nps;
+	enum drbd_fencing_p fp;
+
+	D_ASSERT(mdev->state.pdsk == D_UNKNOWN);
+
+	if (inc_local_if_state(mdev, D_CONSISTENT)) {
+		fp = mdev->bc->dc.fencing;
+		dec_local(mdev);
+	} else {
+		dev_warn(DEV, "Not fencing peer, I'm not even Consistent myself.\n");
+		return mdev->state.pdsk;
+	}
+
+	if (fp == FP_STONITH)
+		_drbd_request_state(mdev, NS(susp, 1), CS_WAIT_COMPLETE);
+
+	r = drbd_khelper(mdev, "fence-peer");
+
+	switch ((r>>8) & 0xff) {
+	case 3: /* peer is inconsistent */
+		ex_to_string = "peer is inconsistent or worse";
+		nps = D_INCONSISTENT;
+		break;
+	case 4:
+		ex_to_string = "peer is outdated";
+		nps = D_OUTDATED;
+		break;
+	case 5: /* peer was down, we will(have) create(d) a new UUID anyways... */
+		/* If we would be more strict, we would return D_UNKNOWN here. */
+		ex_to_string = "peer is unreachable, assumed to be dead";
+		nps = D_OUTDATED;
+		break;
+	case 6: /* Peer is primary, voluntarily outdate myself.
+		 * This is useful when an unconnected R_SECONDARY is asked to
+		 * become R_PRIMARY, but findes the other peer being active. */
+		ex_to_string = "peer is active";
+		dev_warn(DEV, "Peer is primary, outdating myself.\n");
+		nps = D_UNKNOWN;
+		_drbd_request_state(mdev, NS(disk, D_OUTDATED), CS_WAIT_COMPLETE);
+		break;
+	case 7:
+		if (fp != FP_STONITH)
+			dev_err(DEV, "fence-peer() = 7 && fencing != Stonith !!!\n");
+		ex_to_string = "peer was stonithed";
+		nps = D_OUTDATED;
+		break;
+	default:
+		/* The script is broken ... */
+		nps = D_UNKNOWN;
+		dev_err(DEV, "fence-peer helper broken, returned %d\n", (r>>8)&0xff);
+		return nps;
+	}
+
+	dev_info(DEV, "fence-peer helper returned %d (%s)\n",
+			(r>>8) & 0xff, ex_to_string);
+	return nps;
+}
+
+
+int drbd_set_role(struct drbd_conf *mdev, enum drbd_role new_role, int force)
+{
+	const int max_tries = 4;
+	int r = 0;
+	int try = 0;
+	int forced = 0;
+	union drbd_state mask, val;
+	enum drbd_disk_state nps;
+
+	if (new_role == R_PRIMARY)
+		request_ping(mdev); /* Detect a dead peer ASAP */
+
+	mutex_lock(&mdev->state_mutex);
+
+	mask.i = 0; mask.role = R_MASK;
+	val.i  = 0; val.role  = new_role;
+
+	while (try++ < max_tries) {
+		r = _drbd_request_state(mdev, mask, val, CS_WAIT_COMPLETE);
+
+		/* in case we first succeeded to outdate,
+		 * but now suddenly could establish a connection */
+		if (r == SS_CW_FAILED_BY_PEER && mask.pdsk != 0) {
+			val.pdsk = 0;
+			mask.pdsk = 0;
+			continue;
+		}
+
+		if (r == SS_NO_UP_TO_DATE_DISK && force &&
+		    (mdev->state.disk == D_INCONSISTENT ||
+		     mdev->state.disk == D_OUTDATED)) {
+			mask.disk = D_MASK;
+			val.disk  = D_UP_TO_DATE;
+			forced = 1;
+			continue;
+		}
+
+		if (r == SS_NO_UP_TO_DATE_DISK &&
+		    mdev->state.disk == D_CONSISTENT) {
+			D_ASSERT(mdev->state.pdsk == D_UNKNOWN);
+			nps = drbd_try_outdate_peer(mdev);
+
+			if (nps == D_OUTDATED) {
+				val.disk = D_UP_TO_DATE;
+				mask.disk = D_MASK;
+			}
+
+			val.pdsk = nps;
+			mask.pdsk = D_MASK;
+
+			continue;
+		}
+
+		if (r == SS_NOTHING_TO_DO)
+			goto fail;
+		if (r == SS_PRIMARY_NOP) {
+			nps = drbd_try_outdate_peer(mdev);
+
+			if (force && nps > D_OUTDATED) {
+				dev_warn(DEV, "Forced into split brain situation!\n");
+				nps = D_OUTDATED;
+			}
+
+			mask.pdsk = D_MASK;
+			val.pdsk  = nps;
+
+			continue;
+		}
+		if (r == SS_TWO_PRIMARIES) {
+			/* Maybe the peer is detected as dead very soon...
+			   retry at most once more in this case. */
+			__set_current_state(TASK_INTERRUPTIBLE);
+			schedule_timeout((mdev->net_conf->ping_timeo+1)*HZ/10);
+			if (try < max_tries)
+				try = max_tries - 1;
+			continue;
+		}
+		if (r < SS_SUCCESS) {
+			r = _drbd_request_state(mdev, mask, val,
+						CS_VERBOSE + CS_WAIT_COMPLETE);
+			if (r < SS_SUCCESS)
+				goto fail;
+		}
+		break;
+	}
+
+	if (forced)
+		dev_warn(DEV, "Forced to consider local data as UpToDate!\n");
+
+	/* Wait until nothing is on the fly :) */
+	wait_event(mdev->misc_wait, atomic_read(&mdev->ap_pending_cnt) == 0);
+
+	if (new_role == R_SECONDARY) {
+		set_disk_ro(mdev->vdisk, TRUE);
+		if (inc_local(mdev)) {
+			mdev->bc->md.uuid[UI_CURRENT] &= ~(u64)1;
+			dec_local(mdev);
+		}
+	} else {
+		if (inc_net(mdev)) {
+			mdev->net_conf->want_lose = 0;
+			dec_net(mdev);
+		}
+		set_disk_ro(mdev->vdisk, FALSE);
+		if (inc_local(mdev)) {
+			if (((mdev->state.conn < C_CONNECTED ||
+			       mdev->state.pdsk <= D_FAILED)
+			      && mdev->bc->md.uuid[UI_BITMAP] == 0) || forced)
+				drbd_uuid_new_current(mdev);
+
+			mdev->bc->md.uuid[UI_CURRENT] |=  (u64)1;
+			dec_local(mdev);
+		}
+	}
+
+	if ((new_role == R_SECONDARY) && inc_local(mdev)) {
+		drbd_al_to_on_disk_bm(mdev);
+		dec_local(mdev);
+	}
+
+	if (mdev->state.conn >= C_WF_REPORT_PARAMS) {
+		/* if this was forced, we should consider sync */
+		if (forced)
+			drbd_send_uuids(mdev);
+		drbd_send_state(mdev);
+	}
+
+	drbd_md_sync(mdev);
+
+	kobject_uevent(&disk_to_dev(mdev->vdisk)->kobj, KOBJ_CHANGE);
+ fail:
+	mutex_unlock(&mdev->state_mutex);
+	return r;
+}
+
+
+STATIC int drbd_nl_primary(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+			   struct drbd_nl_cfg_reply *reply)
+{
+	struct primary primary_args;
+
+	memset(&primary_args, 0, sizeof(struct primary));
+	if (!primary_from_tags(mdev, nlp->tag_list, &primary_args)) {
+		reply->ret_code = ERR_MANDATORY_TAG;
+		return 0;
+	}
+
+	reply->ret_code =
+		drbd_set_role(mdev, R_PRIMARY, primary_args.overwrite_peer);
+
+	return 0;
+}
+
+STATIC int drbd_nl_secondary(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+			     struct drbd_nl_cfg_reply *reply)
+{
+	reply->ret_code = drbd_set_role(mdev, R_SECONDARY, 0);
+
+	return 0;
+}
+
+/* initializes the md.*_offset members, so we are able to find
+ * the on disk meta data */
+STATIC void drbd_md_set_sector_offsets(struct drbd_conf *mdev,
+				       struct drbd_backing_dev *bdev)
+{
+	sector_t md_size_sect = 0;
+	switch (bdev->dc.meta_dev_idx) {
+	default:
+		/* v07 style fixed size indexed meta data */
+		bdev->md.md_size_sect = MD_RESERVED_SECT;
+		bdev->md.md_offset = drbd_md_ss__(mdev, bdev);
+		bdev->md.al_offset = MD_AL_OFFSET;
+		bdev->md.bm_offset = MD_BM_OFFSET;
+		break;
+	case DRBD_MD_INDEX_FLEX_EXT:
+		/* just occupy the full device; unit: sectors */
+		bdev->md.md_size_sect = drbd_get_capacity(bdev->md_bdev);
+		bdev->md.md_offset = 0;
+		bdev->md.al_offset = MD_AL_OFFSET;
+		bdev->md.bm_offset = MD_BM_OFFSET;
+		break;
+	case DRBD_MD_INDEX_INTERNAL:
+	case DRBD_MD_INDEX_FLEX_INT:
+		bdev->md.md_offset = drbd_md_ss__(mdev, bdev);
+		/* al size is still fixed */
+		bdev->md.al_offset = -MD_AL_MAX_SIZE;
+		/* we need (slightly less than) ~ this much bitmap sectors: */
+		md_size_sect = drbd_get_capacity(bdev->backing_bdev);
+		md_size_sect = ALIGN(md_size_sect, BM_SECT_PER_EXT);
+		md_size_sect = BM_SECT_TO_EXT(md_size_sect);
+		md_size_sect = ALIGN(md_size_sect, 8);
+
+		/* plus the "drbd meta data super block",
+		 * and the activity log; */
+		md_size_sect += MD_BM_OFFSET;
+
+		bdev->md.md_size_sect = md_size_sect;
+		/* bitmap offset is adjusted by 'super' block size */
+		bdev->md.bm_offset   = -md_size_sect + MD_AL_OFFSET;
+		break;
+	}
+}
+
+char *ppsize(char *buf, unsigned long long size)
+{
+	/* Needs 9 bytes at max. */
+	static char units[] = { 'K', 'M', 'G', 'T', 'P', 'E' };
+	int base = 0;
+	while (size >= 10000) {
+		/* shift + round */
+		size = (size >> 10) + !!(size & (1<<9));
+		base++;
+	}
+	sprintf(buf, "%lu %cB", (long)size, units[base]);
+
+	return buf;
+}
+
+/* there is still a theoretical deadlock when called from receiver
+ * on an D_INCONSISTENT R_PRIMARY:
+ *  remote READ does inc_ap_bio, receiver would need to receive answer
+ *  packet from remote to dec_ap_bio again.
+ *  receiver receive_sizes(), comes here,
+ *  waits for ap_bio_cnt == 0. -> deadlock.
+ * but this cannot happen, actually, because:
+ *  R_PRIMARY D_INCONSISTENT, and peer's disk is unreachable
+ *  (not connected, or bad/no disk on peer):
+ *  see drbd_fail_request_early, ap_bio_cnt is zero.
+ *  R_PRIMARY D_INCONSISTENT, and C_SYNC_TARGET:
+ *  peer may not initiate a resize.
+ */
+void drbd_suspend_io(struct drbd_conf *mdev)
+{
+	set_bit(SUSPEND_IO, &mdev->flags);
+	wait_event(mdev->misc_wait, !atomic_read(&mdev->ap_bio_cnt));
+}
+
+void drbd_resume_io(struct drbd_conf *mdev)
+{
+	clear_bit(SUSPEND_IO, &mdev->flags);
+	wake_up(&mdev->misc_wait);
+}
+
+/**
+ * drbd_determin_dev_size:
+ * Evaluates all constraints and sets our correct device size.
+ * Negative return values indicate errors. 0 and positive values
+ * indicate success.
+ * You should call drbd_md_sync() after calling this function.
+ */
+enum determine_dev_size drbd_determin_dev_size(struct drbd_conf *mdev) __must_hold(local)
+{
+	sector_t prev_first_sect, prev_size; /* previous meta location */
+	sector_t la_size;
+	sector_t size;
+	char ppb[10];
+
+	int md_moved, la_size_changed;
+	enum determine_dev_size rv = unchanged;
+
+	/* race:
+	 * application request passes inc_ap_bio,
+	 * but then cannot get an AL-reference.
+	 * this function later may wait on ap_bio_cnt == 0. -> deadlock.
+	 *
+	 * to avoid that:
+	 * Suspend IO right here.
+	 * still lock the act_log to not trigger ASSERTs there.
+	 */
+	drbd_suspend_io(mdev);
+
+	/* no wait necessary anymore, actually we could assert that */
+	wait_event(mdev->al_wait, lc_try_lock(mdev->act_log));
+
+	prev_first_sect = drbd_md_first_sector(mdev->bc);
+	prev_size = mdev->bc->md.md_size_sect;
+	la_size = mdev->bc->md.la_size_sect;
+
+	/* TODO: should only be some assert here, not (re)init... */
+	drbd_md_set_sector_offsets(mdev, mdev->bc);
+
+	size = drbd_new_dev_size(mdev, mdev->bc);
+
+	if (drbd_get_capacity(mdev->this_bdev) != size ||
+	    drbd_bm_capacity(mdev) != size) {
+		int err;
+		err = drbd_bm_resize(mdev, size);
+		if (unlikely(err)) {
+			/* currently there is only one error: ENOMEM! */
+			size = drbd_bm_capacity(mdev)>>1;
+			if (size == 0) {
+				dev_err(DEV, "OUT OF MEMORY! "
+				    "Could not allocate bitmap!\n");
+			} else {
+				dev_err(DEV, "BM resizing failed. "
+				    "Leaving size unchanged at size = %lu KB\n",
+				    (unsigned long)size);
+			}
+			rv = dev_size_error;
+		}
+		/* racy, see comments above. */
+		drbd_set_my_capacity(mdev, size);
+		mdev->bc->md.la_size_sect = size;
+		dev_info(DEV, "size = %s (%llu KB)\n", ppsize(ppb, size>>1),
+		     (unsigned long long)size>>1);
+	}
+	if (rv == dev_size_error)
+		goto out;
+
+	la_size_changed = (la_size != mdev->bc->md.la_size_sect);
+
+	md_moved = prev_first_sect != drbd_md_first_sector(mdev->bc)
+		|| prev_size	   != mdev->bc->md.md_size_sect;
+
+	if (md_moved) {
+		dev_warn(DEV, "Moving meta-data.\n");
+		/* assert: (flexible) internal meta data */
+	}
+
+	if (la_size_changed || md_moved) {
+		drbd_al_shrink(mdev); /* All extents inactive. */
+		dev_info(DEV, "Writing the whole bitmap, size changed\n");
+		rv = drbd_bitmap_io(mdev, &drbd_bm_write, "size changed");
+		drbd_md_mark_dirty(mdev);
+	}
+
+	if (size > la_size)
+		rv = grew;
+	if (size < la_size)
+		rv = shrunk;
+out:
+	lc_unlock(mdev->act_log);
+	wake_up(&mdev->al_wait);
+	drbd_resume_io(mdev);
+
+	return rv;
+}
+
+sector_t
+drbd_new_dev_size(struct drbd_conf *mdev, struct drbd_backing_dev *bdev)
+{
+	sector_t p_size = mdev->p_size;   /* partner's disk size. */
+	sector_t la_size = bdev->md.la_size_sect; /* last agreed size. */
+	sector_t m_size; /* my size */
+	sector_t u_size = bdev->dc.disk_size; /* size requested by user. */
+	sector_t size = 0;
+
+	m_size = drbd_get_max_capacity(bdev);
+
+	if (p_size && m_size) {
+		size = min_t(sector_t, p_size, m_size);
+	} else {
+		if (la_size) {
+			size = la_size;
+			if (m_size && m_size < size)
+				size = m_size;
+			if (p_size && p_size < size)
+				size = p_size;
+		} else {
+			if (m_size)
+				size = m_size;
+			if (p_size)
+				size = p_size;
+		}
+	}
+
+	if (size == 0)
+		dev_err(DEV, "Both nodes diskless!\n");
+
+	if (u_size) {
+		if (u_size > size)
+			dev_err(DEV, "Requested disk size is too big (%lu > %lu)\n",
+			    (unsigned long)u_size>>1, (unsigned long)size>>1);
+		else
+			size = u_size;
+	}
+
+	return size;
+}
+
+/**
+ * drbd_check_al_size:
+ * checks that the al lru is of requested size, and if neccessary tries to
+ * allocate a new one. returns -EBUSY if current al lru is still used,
+ * -ENOMEM when allocation failed, and 0 on success. You should call
+ * drbd_md_sync() after you called this function.
+ */
+STATIC int drbd_check_al_size(struct drbd_conf *mdev)
+{
+	struct lru_cache *n, *t;
+	struct lc_element *e;
+	unsigned int in_use;
+	int i;
+
+	ERR_IF(mdev->sync_conf.al_extents < 7)
+		mdev->sync_conf.al_extents = 127;
+
+	if (mdev->act_log &&
+	    mdev->act_log->nr_elements == mdev->sync_conf.al_extents)
+		return 0;
+
+	in_use = 0;
+	t = mdev->act_log;
+	n = lc_alloc("act_log", mdev->sync_conf.al_extents,
+		     sizeof(struct lc_element), mdev);
+
+	if (n == NULL) {
+		dev_err(DEV, "Cannot allocate act_log lru!\n");
+		return -ENOMEM;
+	}
+	spin_lock_irq(&mdev->al_lock);
+	if (t) {
+		for (i = 0; i < t->nr_elements; i++) {
+			e = lc_entry(t, i);
+			if (e->refcnt)
+				dev_err(DEV, "refcnt(%d)==%d\n",
+				    e->lc_number, e->refcnt);
+			in_use += e->refcnt;
+		}
+	}
+	if (!in_use)
+		mdev->act_log = n;
+	spin_unlock_irq(&mdev->al_lock);
+	if (in_use) {
+		dev_err(DEV, "Activity log still in use!\n");
+		lc_free(n);
+		return -EBUSY;
+	} else {
+		if (t)
+			lc_free(t);
+	}
+	drbd_md_mark_dirty(mdev); /* we changed mdev->act_log->nr_elemens */
+	return 0;
+}
+
+void drbd_setup_queue_param(struct drbd_conf *mdev, unsigned int max_seg_s) __must_hold(local)
+{
+	struct request_queue * const q = mdev->rq_queue;
+	struct request_queue * const b = mdev->bc->backing_bdev->bd_disk->queue;
+	/* unsigned int old_max_seg_s = q->max_segment_size; */
+	int max_segments = mdev->bc->dc.max_bio_bvecs;
+
+	if (b->merge_bvec_fn && !mdev->bc->dc.use_bmbv)
+		max_seg_s = PAGE_SIZE;
+
+	max_seg_s = min(b->max_sectors * b->hardsect_size, max_seg_s);
+
+	q->max_sectors	     = max_seg_s >> 9;
+	if (max_segments) {
+		q->max_phys_segments = max_segments;
+		q->max_hw_segments   = max_segments;
+	} else {
+		q->max_phys_segments = MAX_PHYS_SEGMENTS;
+		q->max_hw_segments   = MAX_HW_SEGMENTS;
+	}
+	q->max_segment_size  = max_seg_s;
+	q->hardsect_size     = 512;
+	q->seg_boundary_mask = PAGE_SIZE-1;
+	blk_queue_stack_limits(q, b);
+
+	if (b->merge_bvec_fn)
+		dev_warn(DEV, "Backing device's merge_bvec_fn() = %p\n",
+		     b->merge_bvec_fn);
+	dev_info(DEV, "max_segment_size ( = BIO size ) = %u\n", q->max_segment_size);
+
+	if (q->backing_dev_info.ra_pages != b->backing_dev_info.ra_pages) {
+		dev_info(DEV, "Adjusting my ra_pages to backing device's (%lu -> %lu)\n",
+		     q->backing_dev_info.ra_pages,
+		     b->backing_dev_info.ra_pages);
+		q->backing_dev_info.ra_pages = b->backing_dev_info.ra_pages;
+	}
+}
+
+/* serialize deconfig (worker exiting, doing cleanup)
+ * and reconfig (drbdsetup disk, drbdsetup net)
+ *
+ * wait for a potentially exiting worker, then restart it,
+ * or start a new one.
+ */
+static void drbd_reconfig_start(struct drbd_conf *mdev)
+{
+	wait_event(mdev->state_wait, test_and_set_bit(CONFIG_PENDING, &mdev->flags));
+	wait_event(mdev->state_wait, !test_bit(DEVICE_DYING, &mdev->flags));
+	drbd_thread_start(&mdev->worker);
+}
+
+/* if still unconfigured, stops worker again.
+ * if configured now, clears CONFIG_PENDING.
+ * wakes potential waiters */
+static void drbd_reconfig_done(struct drbd_conf *mdev)
+{
+	spin_lock_irq(&mdev->req_lock);
+	if (mdev->state.disk == D_DISKLESS &&
+	    mdev->state.conn == C_STANDALONE &&
+	    mdev->state.role == R_SECONDARY) {
+		set_bit(DEVICE_DYING, &mdev->flags);
+		drbd_thread_stop_nowait(&mdev->worker);
+	} else
+		clear_bit(CONFIG_PENDING, &mdev->flags);
+	spin_unlock_irq(&mdev->req_lock);
+	wake_up(&mdev->state_wait);
+}
+
+/* does always return 0;
+ * interesting return code is in reply->ret_code */
+STATIC int drbd_nl_disk_conf(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+			     struct drbd_nl_cfg_reply *reply)
+{
+	enum drbd_ret_codes retcode;
+	enum determine_dev_size dd;
+	sector_t max_possible_sectors;
+	sector_t min_md_device_sectors;
+	struct drbd_backing_dev *nbc = NULL; /* new_backing_conf */
+	struct inode *inode, *inode2;
+	struct lru_cache *resync_lru = NULL;
+	union drbd_state ns, os;
+	int rv;
+	int cp_discovered = 0;
+	int hardsect;
+
+	drbd_reconfig_start(mdev);
+
+	/* if you want to reconfigure, please tear down first */
+	if (mdev->state.disk > D_DISKLESS) {
+		retcode = ERR_DISK_CONFIGURED;
+		goto fail;
+	}
+
+	nbc = kmalloc(sizeof(struct drbd_backing_dev), GFP_KERNEL);
+	if (!nbc) {
+		retcode = ERR_NOMEM;
+		goto fail;
+	}
+
+	memset(&nbc->md, 0, sizeof(struct drbd_md));
+	memset(&nbc->dc, 0, sizeof(struct disk_conf));
+	nbc->dc.disk_size     = DRBD_DISK_SIZE_SECT_DEF;
+	nbc->dc.on_io_error   = DRBD_ON_IO_ERROR_DEF;
+	nbc->dc.fencing       = DRBD_FENCING_DEF;
+	nbc->dc.max_bio_bvecs = DRBD_MAX_BIO_BVECS_DEF;
+
+	if (!disk_conf_from_tags(mdev, nlp->tag_list, &nbc->dc)) {
+		retcode = ERR_MANDATORY_TAG;
+		goto fail;
+	}
+
+	nbc->lo_file = NULL;
+	nbc->md_file = NULL;
+
+	if (nbc->dc.meta_dev_idx < DRBD_MD_INDEX_FLEX_INT) {
+		retcode = ERR_MD_IDX_INVALID;
+		goto fail;
+	}
+
+	nbc->lo_file = filp_open(nbc->dc.backing_dev, O_RDWR, 0);
+	if (IS_ERR(nbc->lo_file)) {
+		dev_err(DEV, "open(\"%s\") failed with %ld\n", nbc->dc.backing_dev,
+		    PTR_ERR(nbc->lo_file));
+		nbc->lo_file = NULL;
+		retcode = ERR_OPEN_DISK;
+		goto fail;
+	}
+
+	inode = nbc->lo_file->f_dentry->d_inode;
+
+	if (!S_ISBLK(inode->i_mode)) {
+		retcode = ERR_DISK_NOT_BDEV;
+		goto fail;
+	}
+
+	nbc->md_file = filp_open(nbc->dc.meta_dev, O_RDWR, 0);
+	if (IS_ERR(nbc->md_file)) {
+		dev_err(DEV, "open(\"%s\") failed with %ld\n", nbc->dc.meta_dev,
+		    PTR_ERR(nbc->md_file));
+		nbc->md_file = NULL;
+		retcode = ERR_OPEN_MD_DISK;
+		goto fail;
+	}
+
+	inode2 = nbc->md_file->f_dentry->d_inode;
+
+	if (!S_ISBLK(inode2->i_mode)) {
+		retcode = ERR_MD_NOT_BDEV;
+		goto fail;
+	}
+
+	nbc->backing_bdev = inode->i_bdev;
+	if (bd_claim(nbc->backing_bdev, mdev)) {
+		printk(KERN_ERR "drbd: bd_claim(%p,%p); failed [%p;%p;%u]\n",
+		       nbc->backing_bdev, mdev,
+		       nbc->backing_bdev->bd_holder,
+		       nbc->backing_bdev->bd_contains->bd_holder,
+		       nbc->backing_bdev->bd_holders);
+		retcode = ERR_BDCLAIM_DISK;
+		goto fail;
+	}
+
+	resync_lru = lc_alloc("resync", 61, sizeof(struct bm_extent), mdev);
+	if (!resync_lru) {
+		retcode = ERR_NOMEM;
+		goto release_bdev_fail;
+	}
+
+	nbc->md_bdev = inode2->i_bdev;
+	if (bd_claim(nbc->md_bdev,
+		     (nbc->dc.meta_dev_idx == DRBD_MD_INDEX_INTERNAL ||
+		      nbc->dc.meta_dev_idx == DRBD_MD_INDEX_FLEX_INT) ?
+		     (void *)mdev : (void *) drbd_m_holder)) {
+		retcode = ERR_BDCLAIM_MD_DISK;
+		goto release_bdev_fail;
+	}
+
+	if ((nbc->backing_bdev == nbc->md_bdev) !=
+	    (nbc->dc.meta_dev_idx == DRBD_MD_INDEX_INTERNAL ||
+	     nbc->dc.meta_dev_idx == DRBD_MD_INDEX_FLEX_INT)) {
+		retcode = ERR_MD_IDX_INVALID;
+		goto release_bdev2_fail;
+	}
+
+	/* RT - for drbd_get_max_capacity() DRBD_MD_INDEX_FLEX_INT */
+	drbd_md_set_sector_offsets(mdev, nbc);
+
+	if (drbd_get_max_capacity(nbc) < nbc->dc.disk_size) {
+		dev_err(DEV, "max capacity %llu smaller than disk size %llu\n",
+			(unsigned long long) drbd_get_max_capacity(nbc),
+			(unsigned long long) nbc->dc.disk_size);
+		retcode = ERR_DISK_TO_SMALL;
+		goto release_bdev2_fail;
+	}
+
+	if (nbc->dc.meta_dev_idx < 0) {
+		max_possible_sectors = DRBD_MAX_SECTORS_FLEX;
+		/* at least one MB, otherwise it does not make sense */
+		min_md_device_sectors = (2<<10);
+	} else {
+		max_possible_sectors = DRBD_MAX_SECTORS;
+		min_md_device_sectors = MD_RESERVED_SECT * (nbc->dc.meta_dev_idx + 1);
+	}
+
+	if (drbd_get_capacity(nbc->md_bdev) > max_possible_sectors)
+		dev_warn(DEV, "truncating very big lower level device "
+		     "to currently maximum possible %llu sectors\n",
+		     (unsigned long long) max_possible_sectors);
+
+	if (drbd_get_capacity(nbc->md_bdev) < min_md_device_sectors) {
+		retcode = ERR_MD_DISK_TO_SMALL;
+		dev_warn(DEV, "refusing attach: md-device too small, "
+		     "at least %llu sectors needed for this meta-disk type\n",
+		     (unsigned long long) min_md_device_sectors);
+		goto release_bdev2_fail;
+	}
+
+	/* Make sure the new disk is big enough
+	 * (we may currently be R_PRIMARY with no local disk...) */
+	if (drbd_get_max_capacity(nbc) <
+	    drbd_get_capacity(mdev->this_bdev)) {
+		retcode = ERR_DISK_TO_SMALL;
+		goto release_bdev2_fail;
+	}
+
+	nbc->known_size = drbd_get_capacity(nbc->backing_bdev);
+
+	drbd_suspend_io(mdev);
+	/* also wait for the last barrier ack. */
+	wait_event(mdev->misc_wait, !atomic_read(&mdev->ap_pending_cnt));
+
+	retcode = _drbd_request_state(mdev, NS(disk, D_ATTACHING), CS_VERBOSE);
+	drbd_resume_io(mdev);
+	if (retcode < SS_SUCCESS)
+		goto release_bdev2_fail;
+
+	if (!inc_local_if_state(mdev, D_ATTACHING))
+		goto force_diskless;
+
+	drbd_md_set_sector_offsets(mdev, nbc);
+
+	if (!mdev->bitmap) {
+		if (drbd_bm_init(mdev)) {
+			retcode = ERR_NOMEM;
+			goto force_diskless_dec;
+		}
+	}
+
+	retcode = drbd_md_read(mdev, nbc);
+	if (retcode != NO_ERROR)
+		goto force_diskless_dec;
+
+	if (mdev->state.conn < C_CONNECTED &&
+	    mdev->state.role == R_PRIMARY &&
+	    (mdev->ed_uuid & ~((u64)1)) != (nbc->md.uuid[UI_CURRENT] & ~((u64)1))) {
+		dev_err(DEV, "Can only attach to data with current UUID=%016llX\n",
+		    (unsigned long long)mdev->ed_uuid);
+		retcode = ERR_DATA_NOT_CURRENT;
+		goto force_diskless_dec;
+	}
+
+	/* Since we are diskless, fix the AL first... */
+	if (drbd_check_al_size(mdev)) {
+		retcode = ERR_NOMEM;
+		goto force_diskless_dec;
+	}
+
+	/* Prevent shrinking of consistent devices ! */
+	if (drbd_md_test_flag(nbc, MDF_CONSISTENT) &&
+	   drbd_new_dev_size(mdev, nbc) < nbc->md.la_size_sect) {
+		dev_warn(DEV, "refusing to truncate a consistent device\n");
+		retcode = ERR_DISK_TO_SMALL;
+		goto force_diskless_dec;
+	}
+
+	if (!drbd_al_read_log(mdev, nbc)) {
+		retcode = ERR_IO_MD_DISK;
+		goto force_diskless_dec;
+	}
+
+	/* allocate a second IO page if hardsect != 512 */
+	hardsect = drbd_get_hardsect(nbc->md_bdev);
+	if (hardsect == 0)
+		hardsect = MD_HARDSECT;
+
+	if (hardsect != MD_HARDSECT) {
+		if (!mdev->md_io_tmpp) {
+			struct page *page = alloc_page(GFP_NOIO);
+			if (!page)
+				goto force_diskless_dec;
+
+			dev_warn(DEV, "Meta data's bdev hardsect = %d != %d\n",
+			     hardsect, MD_HARDSECT);
+			dev_warn(DEV, "Workaround engaged (has performace impact).\n");
+
+			mdev->md_io_tmpp = page;
+		}
+	}
+
+	/* Reset the "barriers don't work" bits here, then force meta data to
+	 * be written, to ensure we determine if barriers are supported. */
+	if (nbc->dc.no_md_flush)
+		set_bit(MD_NO_BARRIER, &mdev->flags);
+	else
+		clear_bit(MD_NO_BARRIER, &mdev->flags);
+
+	/* Point of no return reached.
+	 * Devices and memory are no longer released by error cleanup below.
+	 * now mdev takes over responsibility, and the state engine should
+	 * clean it up somewhere.  */
+	D_ASSERT(mdev->bc == NULL);
+	mdev->bc = nbc;
+	mdev->resync = resync_lru;
+	nbc = NULL;
+	resync_lru = NULL;
+
+	mdev->write_ordering = WO_bio_barrier;
+	drbd_bump_write_ordering(mdev, WO_bio_barrier);
+
+	if (drbd_md_test_flag(mdev->bc, MDF_CRASHED_PRIMARY))
+		set_bit(CRASHED_PRIMARY, &mdev->flags);
+	else
+		clear_bit(CRASHED_PRIMARY, &mdev->flags);
+
+	if (drbd_md_test_flag(mdev->bc, MDF_PRIMARY_IND)) {
+		set_bit(CRASHED_PRIMARY, &mdev->flags);
+		cp_discovered = 1;
+	}
+
+	mdev->send_cnt = 0;
+	mdev->recv_cnt = 0;
+	mdev->read_cnt = 0;
+	mdev->writ_cnt = 0;
+
+	drbd_setup_queue_param(mdev, DRBD_MAX_SEGMENT_SIZE);
+
+	/* If I am currently not R_PRIMARY,
+	 * but meta data primary indicator is set,
+	 * I just now recover from a hard crash,
+	 * and have been R_PRIMARY before that crash.
+	 *
+	 * Now, if I had no connection before that crash
+	 * (have been degraded R_PRIMARY), chances are that
+	 * I won't find my peer now either.
+	 *
+	 * In that case, and _only_ in that case,
+	 * we use the degr-wfc-timeout instead of the default,
+	 * so we can automatically recover from a crash of a
+	 * degraded but active "cluster" after a certain timeout.
+	 */
+	clear_bit(USE_DEGR_WFC_T, &mdev->flags);
+	if (mdev->state.role != R_PRIMARY &&
+	     drbd_md_test_flag(mdev->bc, MDF_PRIMARY_IND) &&
+	    !drbd_md_test_flag(mdev->bc, MDF_CONNECTED_IND))
+		set_bit(USE_DEGR_WFC_T, &mdev->flags);
+
+	dd = drbd_determin_dev_size(mdev);
+	if (dd == dev_size_error) {
+		retcode = ERR_NOMEM_BITMAP;
+		goto force_diskless_dec;
+	} else if (dd == grew)
+		set_bit(RESYNC_AFTER_NEG, &mdev->flags);
+
+	if (drbd_md_test_flag(mdev->bc, MDF_FULL_SYNC)) {
+		dev_info(DEV, "Assuming that all blocks are out of sync "
+		     "(aka FullSync)\n");
+		if (drbd_bitmap_io(mdev, &drbd_bmio_set_n_write, "set_n_write from attaching")) {
+			retcode = ERR_IO_MD_DISK;
+			goto force_diskless_dec;
+		}
+	} else {
+		if (drbd_bitmap_io(mdev, &drbd_bm_read, "read from attaching") < 0) {
+			retcode = ERR_IO_MD_DISK;
+			goto force_diskless_dec;
+		}
+	}
+
+	if (cp_discovered) {
+		drbd_al_apply_to_bm(mdev);
+		drbd_al_to_on_disk_bm(mdev);
+	}
+
+	spin_lock_irq(&mdev->req_lock);
+	os = mdev->state;
+	ns.i = os.i;
+	/* If MDF_CONSISTENT is not set go into inconsistent state,
+	   otherwise investige MDF_WasUpToDate...
+	   If MDF_WAS_UP_TO_DATE is not set go into D_OUTDATED disk state,
+	   otherwise into D_CONSISTENT state.
+	*/
+	if (drbd_md_test_flag(mdev->bc, MDF_CONSISTENT)) {
+		if (drbd_md_test_flag(mdev->bc, MDF_WAS_UP_TO_DATE))
+			ns.disk = D_CONSISTENT;
+		else
+			ns.disk = D_OUTDATED;
+	} else {
+		ns.disk = D_INCONSISTENT;
+	}
+
+	if (drbd_md_test_flag(mdev->bc, MDF_PEER_OUT_DATED))
+		ns.pdsk = D_OUTDATED;
+
+	if ( ns.disk == D_CONSISTENT &&
+	    (ns.pdsk == D_OUTDATED || mdev->bc->dc.fencing == FP_DONT_CARE))
+		ns.disk = D_UP_TO_DATE;
+
+	/* All tests on MDF_PRIMARY_IND, MDF_CONNECTED_IND,
+	   MDF_CONSISTENT and MDF_WAS_UP_TO_DATE must happen before
+	   this point, because drbd_request_state() modifies these
+	   flags. */
+
+	/* In case we are C_CONNECTED postpone any desicion on the new disk
+	   state after the negotiation phase. */
+	if (mdev->state.conn == C_CONNECTED) {
+		mdev->new_state_tmp.i = ns.i;
+		ns.i = os.i;
+		ns.disk = D_NEGOTIATING;
+	}
+
+	rv = _drbd_set_state(mdev, ns, CS_VERBOSE, NULL);
+	ns = mdev->state;
+	spin_unlock_irq(&mdev->req_lock);
+
+	if (rv < SS_SUCCESS)
+		goto force_diskless_dec;
+
+	if (mdev->state.role == R_PRIMARY)
+		mdev->bc->md.uuid[UI_CURRENT] |=  (u64)1;
+	else
+		mdev->bc->md.uuid[UI_CURRENT] &= ~(u64)1;
+
+	drbd_md_mark_dirty(mdev);
+	drbd_md_sync(mdev);
+
+	kobject_uevent(&disk_to_dev(mdev->vdisk)->kobj, KOBJ_CHANGE);
+	dec_local(mdev);
+	reply->ret_code = retcode;
+	drbd_reconfig_done(mdev);
+	return 0;
+
+ force_diskless_dec:
+	dec_local(mdev);
+ force_diskless:
+	drbd_force_state(mdev, NS(disk, D_DISKLESS));
+	drbd_md_sync(mdev);
+ release_bdev2_fail:
+	if (nbc)
+		bd_release(nbc->md_bdev);
+ release_bdev_fail:
+	if (nbc)
+		bd_release(nbc->backing_bdev);
+ fail:
+	if (nbc) {
+		if (nbc->lo_file)
+			fput(nbc->lo_file);
+		if (nbc->md_file)
+			fput(nbc->md_file);
+		kfree(nbc);
+	}
+	if (resync_lru)
+		lc_free(resync_lru);
+
+	reply->ret_code = retcode;
+	drbd_reconfig_done(mdev);
+	return 0;
+}
+
+STATIC int drbd_nl_detach(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+			  struct drbd_nl_cfg_reply *reply)
+{
+	reply->ret_code = drbd_request_state(mdev, NS(disk, D_DISKLESS));
+	return 0;
+}
+
+STATIC int drbd_nl_net_conf(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+			    struct drbd_nl_cfg_reply *reply)
+{
+	int i, ns;
+	enum drbd_ret_codes retcode;
+	struct net_conf *new_conf = NULL;
+	struct crypto_hash *tfm = NULL;
+	struct crypto_hash *integrity_w_tfm = NULL;
+	struct crypto_hash *integrity_r_tfm = NULL;
+	struct hlist_head *new_tl_hash = NULL;
+	struct hlist_head *new_ee_hash = NULL;
+	struct drbd_conf *odev;
+	char hmac_name[CRYPTO_MAX_ALG_NAME];
+	void *int_dig_out = NULL;
+	void *int_dig_in = NULL;
+	void *int_dig_vv = NULL;
+	struct sockaddr *new_my_addr, *new_peer_addr, *taken_addr;
+
+	drbd_reconfig_start(mdev);
+
+	if (mdev->state.conn > C_STANDALONE) {
+		retcode = ERR_NET_CONFIGURED;
+		goto fail;
+	}
+
+	new_conf = kmalloc(sizeof(struct net_conf), GFP_KERNEL);
+	if (!new_conf) {
+		retcode = ERR_NOMEM;
+		goto fail;
+	}
+
+	memset(new_conf, 0, sizeof(struct net_conf));
+	new_conf->timeout	   = DRBD_TIMEOUT_DEF;
+	new_conf->try_connect_int  = DRBD_CONNECT_INT_DEF;
+	new_conf->ping_int	   = DRBD_PING_INT_DEF;
+	new_conf->max_epoch_size   = DRBD_MAX_EPOCH_SIZE_DEF;
+	new_conf->max_buffers	   = DRBD_MAX_BUFFERS_DEF;
+	new_conf->unplug_watermark = DRBD_UNPLUG_WATERMARK_DEF;
+	new_conf->sndbuf_size	   = DRBD_SNDBUF_SIZE_DEF;
+	new_conf->ko_count	   = DRBD_KO_COUNT_DEF;
+	new_conf->after_sb_0p	   = DRBD_AFTER_SB_0P_DEF;
+	new_conf->after_sb_1p	   = DRBD_AFTER_SB_1P_DEF;
+	new_conf->after_sb_2p	   = DRBD_AFTER_SB_2P_DEF;
+	new_conf->want_lose	   = 0;
+	new_conf->two_primaries    = 0;
+	new_conf->wire_protocol    = DRBD_PROT_C;
+	new_conf->ping_timeo	   = DRBD_PING_TIMEO_DEF;
+	new_conf->rr_conflict	   = DRBD_RR_CONFLICT_DEF;
+
+	if (!net_conf_from_tags(mdev, nlp->tag_list, new_conf)) {
+		retcode = ERR_MANDATORY_TAG;
+		goto fail;
+	}
+
+	if (new_conf->two_primaries
+	&& (new_conf->wire_protocol != DRBD_PROT_C)) {
+		retcode = ERR_NOT_PROTO_C;
+		goto fail;
+	};
+
+	if (mdev->state.role == R_PRIMARY && new_conf->want_lose) {
+		retcode = ERR_DISCARD;
+		goto fail;
+	}
+
+	retcode = NO_ERROR;
+
+	new_my_addr = (struct sockaddr *)&new_conf->my_addr;
+	new_peer_addr = (struct sockaddr *)&new_conf->peer_addr;
+	for (i = 0; i < minor_count; i++) {
+		odev = minor_to_mdev(i);
+		if (!odev || odev == mdev)
+			continue;
+		if (inc_net(odev)) {
+			taken_addr = (struct sockaddr *)&odev->net_conf->my_addr;
+			if (new_conf->my_addr_len == odev->net_conf->my_addr_len &&
+			    !memcmp(new_my_addr, taken_addr, new_conf->my_addr_len))
+				retcode = ERR_LOCAL_ADDR;
+
+			taken_addr = (struct sockaddr *)&odev->net_conf->peer_addr;
+			if (new_conf->peer_addr_len == odev->net_conf->peer_addr_len &&
+			    !memcmp(new_peer_addr, taken_addr, new_conf->peer_addr_len))
+				retcode = ERR_PEER_ADDR;
+
+			dec_net(odev);
+			if (retcode != NO_ERROR)
+				goto fail;
+		}
+	}
+
+	if (new_conf->cram_hmac_alg[0] != 0) {
+		snprintf(hmac_name, CRYPTO_MAX_ALG_NAME, "hmac(%s)",
+			new_conf->cram_hmac_alg);
+		tfm = crypto_alloc_hash(hmac_name, 0, CRYPTO_ALG_ASYNC);
+		if (IS_ERR(tfm)) {
+			tfm = NULL;
+			retcode = ERR_AUTH_ALG;
+			goto fail;
+		}
+
+		if (crypto_tfm_alg_type(crypto_hash_tfm(tfm))
+						!= CRYPTO_ALG_TYPE_HASH) {
+			retcode = ERR_AUTH_ALG_ND;
+			goto fail;
+		}
+	}
+
+	if (new_conf->integrity_alg[0]) {
+		integrity_w_tfm = crypto_alloc_hash(new_conf->integrity_alg, 0, CRYPTO_ALG_ASYNC);
+		if (IS_ERR(integrity_w_tfm)) {
+			integrity_w_tfm = NULL;
+			retcode=ERR_INTEGRITY_ALG;
+			goto fail;
+		}
+
+		if (!drbd_crypto_is_hash(crypto_hash_tfm(integrity_w_tfm))) {
+			retcode=ERR_INTEGRITY_ALG_ND;
+			goto fail;
+		}
+
+		integrity_r_tfm = crypto_alloc_hash(new_conf->integrity_alg, 0, CRYPTO_ALG_ASYNC);
+		if (IS_ERR(integrity_r_tfm)) {
+			integrity_r_tfm = NULL;
+			retcode=ERR_INTEGRITY_ALG;
+			goto fail;
+		}
+	}
+
+	ns = new_conf->max_epoch_size/8;
+	if (mdev->tl_hash_s != ns) {
+		new_tl_hash = kzalloc(ns*sizeof(void *), GFP_KERNEL);
+		if (!new_tl_hash) {
+			retcode = ERR_NOMEM;
+			goto fail;
+		}
+	}
+
+	ns = new_conf->max_buffers/8;
+	if (new_conf->two_primaries && (mdev->ee_hash_s != ns)) {
+		new_ee_hash = kzalloc(ns*sizeof(void *), GFP_KERNEL);
+		if (!new_ee_hash) {
+			retcode = ERR_NOMEM;
+			goto fail;
+		}
+	}
+
+	((char *)new_conf->shared_secret)[SHARED_SECRET_MAX-1] = 0;
+
+	if (integrity_w_tfm) {
+		i = crypto_hash_digestsize(integrity_w_tfm);
+		int_dig_out = kmalloc(i, GFP_KERNEL);
+		if (!int_dig_out) {
+			retcode = ERR_NOMEM;
+			goto fail;
+		}
+		int_dig_in = kmalloc(i, GFP_KERNEL);
+		if (!int_dig_in) {
+			retcode = ERR_NOMEM;
+			goto fail;
+		}
+		int_dig_vv = kmalloc(i, GFP_KERNEL);
+		if (!int_dig_vv) {
+			retcode = ERR_NOMEM;
+			goto fail;
+		}
+	}
+
+	if (!mdev->bitmap) {
+		if(drbd_bm_init(mdev)) {
+			retcode = ERR_NOMEM;
+			goto fail;
+		}
+	}
+
+	spin_lock_irq(&mdev->req_lock);
+	if (mdev->net_conf != NULL) {
+		retcode = ERR_NET_CONFIGURED;
+		spin_unlock_irq(&mdev->req_lock);
+		goto fail;
+	}
+	mdev->net_conf = new_conf;
+
+	mdev->send_cnt = 0;
+	mdev->recv_cnt = 0;
+
+	if (new_tl_hash) {
+		kfree(mdev->tl_hash);
+		mdev->tl_hash_s = mdev->net_conf->max_epoch_size/8;
+		mdev->tl_hash = new_tl_hash;
+	}
+
+	if (new_ee_hash) {
+		kfree(mdev->ee_hash);
+		mdev->ee_hash_s = mdev->net_conf->max_buffers/8;
+		mdev->ee_hash = new_ee_hash;
+	}
+
+	crypto_free_hash(mdev->cram_hmac_tfm);
+	mdev->cram_hmac_tfm = tfm;
+
+	crypto_free_hash(mdev->integrity_w_tfm);
+	mdev->integrity_w_tfm = integrity_w_tfm;
+
+	crypto_free_hash(mdev->integrity_r_tfm);
+	mdev->integrity_r_tfm = integrity_r_tfm;
+
+	kfree(mdev->int_dig_out);
+	kfree(mdev->int_dig_in);
+	kfree(mdev->int_dig_vv);
+	mdev->int_dig_out=int_dig_out;
+	mdev->int_dig_in=int_dig_in;
+	mdev->int_dig_vv=int_dig_vv;
+	spin_unlock_irq(&mdev->req_lock);
+
+	retcode = _drbd_request_state(mdev, NS(conn, C_UNCONNECTED), CS_VERBOSE);
+
+	kobject_uevent(&disk_to_dev(mdev->vdisk)->kobj, KOBJ_CHANGE);
+	reply->ret_code = retcode;
+	drbd_reconfig_done(mdev);
+	return 0;
+
+fail:
+	kfree(int_dig_out);
+	kfree(int_dig_in);
+	kfree(int_dig_vv);
+	crypto_free_hash(tfm);
+	crypto_free_hash(integrity_w_tfm);
+	crypto_free_hash(integrity_r_tfm);
+	kfree(new_tl_hash);
+	kfree(new_ee_hash);
+	kfree(new_conf);
+
+	reply->ret_code = retcode;
+	drbd_reconfig_done(mdev);
+	return 0;
+}
+
+STATIC int drbd_nl_disconnect(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+			      struct drbd_nl_cfg_reply *reply)
+{
+	int retcode;
+
+	retcode = _drbd_request_state(mdev, NS(conn, C_DISCONNECTING), CS_ORDERED);
+
+	if (retcode == SS_NOTHING_TO_DO)
+		goto done;
+	else if (retcode == SS_ALREADY_STANDALONE)
+		goto done;
+	else if (retcode == SS_PRIMARY_NOP) {
+		/* Our statche checking code wants to see the peer outdated. */
+		retcode = drbd_request_state(mdev, NS2(conn, C_DISCONNECTING,
+						      pdsk, D_OUTDATED));
+	} else if (retcode == SS_CW_FAILED_BY_PEER) {
+		/* The peer probabely wants to see us outdated. */
+		retcode = _drbd_request_state(mdev, NS2(conn, C_DISCONNECTING,
+							disk, D_OUTDATED),
+					      CS_ORDERED);
+		if (retcode == SS_IS_DISKLESS || retcode == SS_LOWER_THAN_OUTDATED) {
+			drbd_force_state(mdev, NS(conn, C_DISCONNECTING));
+			retcode = SS_SUCCESS;
+		}
+	}
+
+	if (retcode < SS_SUCCESS)
+		goto fail;
+
+	if (wait_event_interruptible(mdev->state_wait,
+				     mdev->state.conn != C_DISCONNECTING)) {
+		/* Do not test for mdev->state.conn == C_STANDALONE, since
+		   someone else might connect us in the mean time! */
+		retcode = ERR_INTR;
+		goto fail;
+	}
+
+ done:
+	retcode = NO_ERROR;
+ fail:
+	drbd_md_sync(mdev);
+	reply->ret_code = retcode;
+	return 0;
+}
+
+void resync_after_online_grow(struct drbd_conf *mdev)
+{
+	int iass; /* I am sync source */
+
+	dev_info(DEV, "Resync of new storage after online grow\n");
+	if (mdev->state.role != mdev->state.peer)
+		iass = (mdev->state.role == R_PRIMARY);
+	else
+		iass = test_bit(DISCARD_CONCURRENT, &mdev->flags);
+
+	if (iass)
+		drbd_start_resync(mdev, C_SYNC_SOURCE);
+	else
+		_drbd_request_state(mdev, NS(conn, C_WF_SYNC_UUID), CS_VERBOSE + CS_SERIALIZE);
+}
+
+STATIC int drbd_nl_resize(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+			  struct drbd_nl_cfg_reply *reply)
+{
+	struct resize rs;
+	int retcode = NO_ERROR;
+	int ldsc = 0; /* local disk size changed */
+	enum determine_dev_size dd;
+
+	memset(&rs, 0, sizeof(struct resize));
+	if (!resize_from_tags(mdev, nlp->tag_list, &rs)) {
+		retcode = ERR_MANDATORY_TAG;
+		goto fail;
+	}
+
+	if (mdev->state.conn > C_CONNECTED) {
+		retcode = ERR_RESIZE_RESYNC;
+		goto fail;
+	}
+
+	if (mdev->state.role == R_SECONDARY &&
+	    mdev->state.peer == R_SECONDARY) {
+		retcode = ERR_NO_PRIMARY;
+		goto fail;
+	}
+
+	if (!inc_local(mdev)) {
+		retcode = ERR_NO_DISK;
+		goto fail;
+	}
+
+	if (mdev->bc->known_size != drbd_get_capacity(mdev->bc->backing_bdev)) {
+		mdev->bc->known_size = drbd_get_capacity(mdev->bc->backing_bdev);
+		ldsc = 1;
+	}
+
+	mdev->bc->dc.disk_size = (sector_t)rs.resize_size;
+	dd = drbd_determin_dev_size(mdev);
+	drbd_md_sync(mdev);
+	dec_local(mdev);
+	if (dd == dev_size_error) {
+		retcode = ERR_NOMEM_BITMAP;
+		goto fail;
+	}
+
+	if (mdev->state.conn == C_CONNECTED && (dd != unchanged || ldsc)) {
+		drbd_send_uuids(mdev);
+		drbd_send_sizes(mdev);
+		if (dd == grew)
+			resync_after_online_grow(mdev);
+	}
+
+ fail:
+	reply->ret_code = retcode;
+	return 0;
+}
+
+STATIC int drbd_nl_syncer_conf(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+			       struct drbd_nl_cfg_reply *reply)
+{
+	int retcode = NO_ERROR;
+	int err;
+	int ovr; /* online verify running */
+	int rsr; /* re-sync running */
+	struct drbd_conf *odev;
+	struct crypto_hash *verify_tfm = NULL;
+	struct crypto_hash *csums_tfm = NULL;
+	struct syncer_conf sc;
+	cpumask_t n_cpu_mask = CPU_MASK_NONE;
+
+	memcpy(&sc, &mdev->sync_conf, sizeof(struct syncer_conf));
+
+	if (nlp->flags & DRBD_NL_SET_DEFAULTS) {
+		memset(&sc, 0, sizeof(struct syncer_conf));
+		sc.rate       = DRBD_RATE_DEF;
+		sc.after      = DRBD_AFTER_DEF;
+		sc.al_extents = DRBD_AL_EXTENTS_DEF;
+	}
+
+	if (!syncer_conf_from_tags(mdev, nlp->tag_list, &sc)) {
+		retcode = ERR_MANDATORY_TAG;
+		goto fail;
+	}
+
+	if (sc.after != -1) {
+		if (sc.after < -1 || minor_to_mdev(sc.after) == NULL) {
+			retcode = ERR_SYNC_AFTER;
+			goto fail;
+		}
+		odev = minor_to_mdev(sc.after); /* check against loops in */
+		while (1) {
+			if (odev == mdev) {
+				retcode = ERR_SYNC_AFTER_CYCLE;
+				goto fail;
+			}
+			if (odev->sync_conf.after == -1)
+				break; /* no cycles. */
+			odev = minor_to_mdev(odev->sync_conf.after);
+		}
+	}
+
+	/* re-sync running */
+	rsr = (	mdev->state.conn == C_SYNC_SOURCE ||
+		mdev->state.conn == C_SYNC_TARGET ||
+		mdev->state.conn == C_PAUSED_SYNC_S ||
+		mdev->state.conn == C_PAUSED_SYNC_T );
+
+	if (rsr && strcmp(sc.csums_alg, mdev->sync_conf.csums_alg)) {
+		retcode = ERR_CSUMS_RESYNC_RUNNING;
+		goto fail;
+	}
+
+	if (!rsr && sc.csums_alg[0]) {
+		csums_tfm = crypto_alloc_hash(sc.csums_alg, 0, CRYPTO_ALG_ASYNC);
+		if (IS_ERR(csums_tfm)) {
+			csums_tfm = NULL;
+			retcode = ERR_CSUMS_ALG;
+			goto fail;
+		}
+
+		if (!drbd_crypto_is_hash(crypto_hash_tfm(csums_tfm))) {
+			retcode = ERR_CSUMS_ALG_ND;
+			goto fail;
+		}
+	}
+
+	/* online verify running */
+	ovr = (mdev->state.conn == C_VERIFY_S || mdev->state.conn == C_VERIFY_T);
+
+	if (ovr) {
+		if (strcmp(sc.verify_alg, mdev->sync_conf.verify_alg)) {
+			retcode = ERR_VERIFY_RUNNING;
+			goto fail;
+		}
+	}
+
+	if (!ovr && sc.verify_alg[0]) {
+		verify_tfm = crypto_alloc_hash(sc.verify_alg, 0, CRYPTO_ALG_ASYNC);
+		if (IS_ERR(verify_tfm)) {
+			verify_tfm = NULL;
+			retcode = ERR_VERIFY_ALG;
+			goto fail;
+		}
+
+		if (!drbd_crypto_is_hash(crypto_hash_tfm(verify_tfm))) {
+			retcode = ERR_VERIFY_ALG_ND;
+			goto fail;
+		}
+	}
+
+	if (sc.cpu_mask[0] != 0) {
+		err = __bitmap_parse(sc.cpu_mask, 32, 0, (unsigned long *)&n_cpu_mask, NR_CPUS);
+		if (err) {
+			dev_warn(DEV, "__bitmap_parse() failed with %d\n", err);
+			retcode = ERR_CPU_MASK_PARSE;
+			goto fail;
+		}
+	}
+
+	ERR_IF (sc.rate < 1) sc.rate = 1;
+	ERR_IF (sc.al_extents < 7) sc.al_extents = 127; /* arbitrary minimum */
+#define AL_MAX ((MD_AL_MAX_SIZE-1) * AL_EXTENTS_PT)
+	if (sc.al_extents > AL_MAX) {
+		dev_err(DEV, "sc.al_extents > %d\n", AL_MAX);
+		sc.al_extents = AL_MAX;
+	}
+#undef AL_MAX
+
+	spin_lock(&mdev->peer_seq_lock);
+	/* lock against receive_SyncParam() */
+	mdev->sync_conf = sc;
+
+	if (!rsr) {
+		crypto_free_hash(mdev->csums_tfm);
+		mdev->csums_tfm = csums_tfm;
+		csums_tfm = NULL;
+	}
+
+	if (!ovr) {
+		crypto_free_hash(mdev->verify_tfm);
+		mdev->verify_tfm = verify_tfm;
+		verify_tfm = NULL;
+	}
+	spin_unlock(&mdev->peer_seq_lock);
+
+	if (inc_local(mdev)) {
+		wait_event(mdev->al_wait, lc_try_lock(mdev->act_log));
+		drbd_al_shrink(mdev);
+		err = drbd_check_al_size(mdev);
+		lc_unlock(mdev->act_log);
+		wake_up(&mdev->al_wait);
+
+		dec_local(mdev);
+		drbd_md_sync(mdev);
+
+		if (err) {
+			retcode = ERR_NOMEM;
+			goto fail;
+		}
+	}
+
+	if (mdev->state.conn >= C_CONNECTED)
+		drbd_send_sync_param(mdev, &sc);
+
+	drbd_alter_sa(mdev, sc.after);
+
+	if (!cpus_equal(mdev->cpu_mask, n_cpu_mask)) {
+		mdev->cpu_mask = n_cpu_mask;
+		mdev->cpu_mask = drbd_calc_cpu_mask(mdev);
+		mdev->receiver.reset_cpu_mask = 1;
+		mdev->asender.reset_cpu_mask = 1;
+		mdev->worker.reset_cpu_mask = 1;
+	}
+
+	kobject_uevent(&disk_to_dev(mdev->vdisk)->kobj, KOBJ_CHANGE);
+fail:
+	crypto_free_hash(csums_tfm);
+	crypto_free_hash(verify_tfm);
+	reply->ret_code = retcode;
+	return 0;
+}
+
+STATIC int drbd_nl_invalidate(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+			      struct drbd_nl_cfg_reply *reply)
+{
+	int retcode;
+
+	retcode = _drbd_request_state(mdev, NS(conn, C_STARTING_SYNC_T), CS_ORDERED);
+
+	if (retcode < SS_SUCCESS && retcode != SS_NEED_CONNECTION)
+		retcode = drbd_request_state(mdev, NS(conn, C_STARTING_SYNC_T));
+
+	while (retcode == SS_NEED_CONNECTION) {
+		spin_lock_irq(&mdev->req_lock);
+		if (mdev->state.conn < C_CONNECTED)
+			retcode = _drbd_set_state(_NS(mdev, disk, D_INCONSISTENT), CS_VERBOSE, NULL);
+		spin_unlock_irq(&mdev->req_lock);
+
+		if (retcode != SS_NEED_CONNECTION)
+			break;
+
+		retcode = drbd_request_state(mdev, NS(conn, C_STARTING_SYNC_T));
+	}
+
+	reply->ret_code = retcode;
+	return 0;
+}
+
+STATIC int drbd_nl_invalidate_peer(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+				   struct drbd_nl_cfg_reply *reply)
+{
+
+	reply->ret_code = drbd_request_state(mdev, NS(conn, C_STARTING_SYNC_S));
+
+	return 0;
+}
+
+STATIC int drbd_nl_pause_sync(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+			      struct drbd_nl_cfg_reply *reply)
+{
+	int retcode = NO_ERROR;
+
+	if (drbd_request_state(mdev, NS(user_isp, 1)) == SS_NOTHING_TO_DO)
+		retcode = ERR_PAUSE_IS_SET;
+
+	reply->ret_code = retcode;
+	return 0;
+}
+
+STATIC int drbd_nl_resume_sync(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+			       struct drbd_nl_cfg_reply *reply)
+{
+	int retcode = NO_ERROR;
+
+	if (drbd_request_state(mdev, NS(user_isp, 0)) == SS_NOTHING_TO_DO)
+		retcode = ERR_PAUSE_IS_CLEAR;
+
+	reply->ret_code = retcode;
+	return 0;
+}
+
+STATIC int drbd_nl_suspend_io(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+			      struct drbd_nl_cfg_reply *reply)
+{
+	reply->ret_code = drbd_request_state(mdev, NS(susp, 1));
+
+	return 0;
+}
+
+STATIC int drbd_nl_resume_io(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+			     struct drbd_nl_cfg_reply *reply)
+{
+	reply->ret_code = drbd_request_state(mdev, NS(susp, 0));
+	return 0;
+}
+
+STATIC int drbd_nl_outdate(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+			   struct drbd_nl_cfg_reply *reply)
+{
+	reply->ret_code = drbd_request_state(mdev, NS(disk, D_OUTDATED));
+	return 0;
+}
+
+STATIC int drbd_nl_get_config(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+			   struct drbd_nl_cfg_reply *reply)
+{
+	unsigned short *tl;
+
+	tl = reply->tag_list;
+
+	if (inc_local(mdev)) {
+		tl = disk_conf_to_tags(mdev, &mdev->bc->dc, tl);
+		dec_local(mdev);
+	}
+
+	if (inc_net(mdev)) {
+		tl = net_conf_to_tags(mdev, mdev->net_conf, tl);
+		dec_net(mdev);
+	}
+	tl = syncer_conf_to_tags(mdev, &mdev->sync_conf, tl);
+
+	*tl++ = TT_END; /* Close the tag list */
+
+	return (int)((char *)tl - (char *)reply->tag_list);
+}
+
+STATIC int drbd_nl_get_state(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+			     struct drbd_nl_cfg_reply *reply)
+{
+	unsigned short *tl = reply->tag_list;
+	union drbd_state s = mdev->state;
+	unsigned long rs_left;
+	unsigned int res;
+
+	tl = get_state_to_tags(mdev, (struct get_state *)&s, tl);
+
+	/* no local ref, no bitmap, no syncer progress. */
+	if (s.conn >= C_SYNC_SOURCE && s.conn <= C_PAUSED_SYNC_T) {
+		if (inc_local(mdev)) {
+			drbd_get_syncer_progress(mdev, &rs_left, &res);
+			*tl++ = T_sync_progress;
+			*tl++ = sizeof(int);
+			memcpy(tl, &res, sizeof(int));
+			tl = (unsigned short *)((char *)tl + sizeof(int));
+			dec_local(mdev);
+		}
+	}
+	*tl++ = TT_END; /* Close the tag list */
+
+	return (int)((char *)tl - (char *)reply->tag_list);
+}
+
+STATIC int drbd_nl_get_uuids(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+			     struct drbd_nl_cfg_reply *reply)
+{
+	unsigned short *tl;
+
+	tl = reply->tag_list;
+
+	if (inc_local(mdev)) {
+		/* This is a hand crafted add tag ;) */
+		*tl++ = T_uuids;
+		*tl++ = UI_SIZE*sizeof(u64);
+		memcpy(tl, mdev->bc->md.uuid, UI_SIZE*sizeof(u64));
+		tl = (unsigned short *)((char *)tl + UI_SIZE*sizeof(u64));
+		*tl++ = T_uuids_flags;
+		*tl++ = sizeof(int);
+		memcpy(tl, &mdev->bc->md.flags, sizeof(int));
+		tl = (unsigned short *)((char *)tl + sizeof(int));
+		dec_local(mdev);
+	}
+	*tl++ = TT_END; /* Close the tag list */
+
+	return (int)((char *)tl - (char *)reply->tag_list);
+}
+
+/**
+ * drbd_nl_get_timeout_flag:
+ * Is used by drbdsetup to find out which timeout value to use.
+ */
+STATIC int drbd_nl_get_timeout_flag(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+				    struct drbd_nl_cfg_reply *reply)
+{
+	unsigned short *tl;
+	char rv;
+
+	tl = reply->tag_list;
+
+	rv = mdev->state.pdsk == D_OUTDATED        ? UT_PEER_OUTDATED :
+	  test_bit(USE_DEGR_WFC_T, &mdev->flags) ? UT_DEGRADED : UT_DEFAULT;
+
+	/* This is a hand crafted add tag ;) */
+	*tl++ = T_use_degraded;
+	*tl++ = sizeof(char);
+	*((char *)tl) = rv;
+	tl = (unsigned short *)((char *)tl + sizeof(char));
+	*tl++ = TT_END;
+
+	return (int)((char *)tl - (char *)reply->tag_list);
+}
+
+STATIC int drbd_nl_start_ov(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+				    struct drbd_nl_cfg_reply *reply)
+{
+	reply->ret_code = drbd_request_state(mdev,NS(conn,C_VERIFY_S));
+
+	return 0;
+}
+
+
+STATIC int drbd_nl_new_c_uuid(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,
+			      struct drbd_nl_cfg_reply *reply)
+{
+	int retcode = NO_ERROR;
+	int skip_initial_sync = 0;
+	int err;
+
+	struct new_c_uuid args;
+
+	memset(&args, 0, sizeof(struct new_c_uuid));
+	if (!new_c_uuid_from_tags(mdev, nlp->tag_list, &args)) {
+		reply->ret_code = ERR_MANDATORY_TAG;
+		return 0;
+	}
+
+	mutex_lock(&mdev->state_mutex); /* Protects us against serialized state changes. */
+
+	if (!inc_local(mdev)) {
+		retcode = ERR_NO_DISK;
+		goto out;
+	}
+
+	/* this is "skip initial sync", assume to be clean */
+	if (mdev->state.conn == C_CONNECTED && mdev->agreed_pro_version >= 90 &&
+	    mdev->bc->md.uuid[UI_CURRENT] == UUID_JUST_CREATED && args.clear_bm) {
+		dev_info(DEV, "Preparing to skip initial sync\n");
+		skip_initial_sync = 1;
+	} else if (mdev->state.conn >= C_CONNECTED) {
+		retcode = ERR_CONNECTED;
+		goto out_dec;
+	}
+
+	drbd_uuid_set(mdev, UI_BITMAP, 0); /* Rotate UI_BITMAP to History 1, etc... */
+	drbd_uuid_new_current(mdev); /* New current, previous to UI_BITMAP */
+
+	if (args.clear_bm) {
+		err = drbd_bitmap_io(mdev, &drbd_bmio_clear_n_write, "clear_n_write from new_c_uuid");
+		if (err) {
+			dev_err(DEV, "Writing bitmap failed with %d\n",err);
+			retcode = ERR_IO_MD_DISK;
+		}
+		if (skip_initial_sync) {
+			drbd_send_uuids_skip_initial_sync(mdev);
+			_drbd_uuid_set(mdev, UI_BITMAP, 0);
+			spin_lock_irq(&mdev->req_lock);
+			_drbd_set_state(_NS2(mdev, disk, D_UP_TO_DATE, pdsk, D_UP_TO_DATE),
+					CS_VERBOSE, NULL);
+			spin_unlock_irq(&mdev->req_lock);
+		}
+	}
+
+	drbd_md_sync(mdev);
+out_dec:
+	dec_local(mdev);
+out:
+	mutex_unlock(&mdev->state_mutex);
+
+	reply->ret_code = retcode;
+	return 0;
+}
+
+STATIC struct drbd_conf *ensure_mdev(struct drbd_nl_cfg_req *nlp)
+{
+	struct drbd_conf *mdev;
+
+	if (nlp->drbd_minor >= minor_count)
+		return NULL;
+
+	mdev = minor_to_mdev(nlp->drbd_minor);
+
+	if (!mdev && (nlp->flags & DRBD_NL_CREATE_DEVICE)) {
+		struct gendisk *disk = NULL;
+		mdev = drbd_new_device(nlp->drbd_minor);
+
+		spin_lock_irq(&drbd_pp_lock);
+		if (minor_table[nlp->drbd_minor] == NULL) {
+			minor_table[nlp->drbd_minor] = mdev;
+			disk = mdev->vdisk;
+			mdev = NULL;
+		} /* else: we lost the race */
+		spin_unlock_irq(&drbd_pp_lock);
+
+		if (disk) /* we won the race above */
+			/* in case we ever add a drbd_delete_device(),
+			 * don't forget the del_gendisk! */
+			add_disk(disk);
+		else /* we lost the race above */
+			drbd_free_mdev(mdev);
+
+		mdev = minor_to_mdev(nlp->drbd_minor);
+	}
+
+	return mdev;
+}
+
+struct cn_handler_struct {
+	int (*function)(struct drbd_conf *,
+			 struct drbd_nl_cfg_req *,
+			 struct drbd_nl_cfg_reply *);
+	int reply_body_size;
+};
+
+static struct cn_handler_struct cnd_table[] = {
+	[ P_primary ]		= { &drbd_nl_primary,		0 },
+	[ P_secondary ]		= { &drbd_nl_secondary,		0 },
+	[ P_disk_conf ]		= { &drbd_nl_disk_conf,		0 },
+	[ P_detach ]		= { &drbd_nl_detach,		0 },
+	[ P_net_conf ]		= { &drbd_nl_net_conf,		0 },
+	[ P_disconnect ]	= { &drbd_nl_disconnect,	0 },
+	[ P_resize ]		= { &drbd_nl_resize,		0 },
+	[ P_syncer_conf ]	= { &drbd_nl_syncer_conf,	0 },
+	[ P_invalidate ]	= { &drbd_nl_invalidate,	0 },
+	[ P_invalidate_peer ]	= { &drbd_nl_invalidate_peer,	0 },
+	[ P_pause_sync ]	= { &drbd_nl_pause_sync,	0 },
+	[ P_resume_sync ]	= { &drbd_nl_resume_sync,	0 },
+	[ P_suspend_io ]	= { &drbd_nl_suspend_io,	0 },
+	[ P_resume_io ]		= { &drbd_nl_resume_io,		0 },
+	[ P_outdate ]		= { &drbd_nl_outdate,		0 },
+	[ P_get_config ]	= { &drbd_nl_get_config,
+				    sizeof(struct syncer_conf_tag_len_struct) +
+				    sizeof(struct disk_conf_tag_len_struct) +
+				    sizeof(struct net_conf_tag_len_struct) },
+	[ P_get_state ]		= { &drbd_nl_get_state,
+				    sizeof(struct get_state_tag_len_struct) +
+				    sizeof(struct sync_progress_tag_len_struct)	},
+	[ P_get_uuids ]		= { &drbd_nl_get_uuids,
+				    sizeof(struct get_uuids_tag_len_struct) },
+	[ P_get_timeout_flag ]	= { &drbd_nl_get_timeout_flag,
+				    sizeof(struct get_timeout_flag_tag_len_struct)},
+	[ P_start_ov ]		= { &drbd_nl_start_ov,		0 },
+	[ P_new_c_uuid ]	= { &drbd_nl_new_c_uuid,	0 },
+};
+
+STATIC void drbd_connector_callback(void *data)
+{
+	struct cn_msg *req = data;
+	struct drbd_nl_cfg_req *nlp = (struct drbd_nl_cfg_req *)req->data;
+	struct cn_handler_struct *cm;
+	struct cn_msg *cn_reply;
+	struct drbd_nl_cfg_reply *reply;
+	struct drbd_conf *mdev;
+	int retcode, rr;
+	int reply_size = sizeof(struct cn_msg)
+		+ sizeof(struct drbd_nl_cfg_reply)
+		+ sizeof(short int);
+
+	if (!try_module_get(THIS_MODULE)) {
+		printk(KERN_ERR "drbd: try_module_get() failed!\n");
+		return;
+	}
+
+	mdev = ensure_mdev(nlp);
+	if (!mdev) {
+		retcode = ERR_MINOR_INVALID;
+		goto fail;
+	}
+
+	trace_drbd_netlink(data, 1);
+
+	if (nlp->packet_type >= P_nl_after_last_packet) {
+		retcode = ERR_PACKET_NR;
+		goto fail;
+	}
+
+	cm = cnd_table + nlp->packet_type;
+
+	/* This may happen if packet number is 0: */
+	if (cm->function == NULL) {
+		retcode = ERR_PACKET_NR;
+		goto fail;
+	}
+
+	reply_size += cm->reply_body_size;
+
+	cn_reply = kmalloc(reply_size, GFP_KERNEL);
+	if (!cn_reply) {
+		retcode = ERR_NOMEM;
+		goto fail;
+	}
+	reply = (struct drbd_nl_cfg_reply *) cn_reply->data;
+
+	reply->packet_type =
+		cm->reply_body_size ? nlp->packet_type : P_nl_after_last_packet;
+	reply->minor = nlp->drbd_minor;
+	reply->ret_code = NO_ERROR; /* Might by modified by cm->function. */
+	/* reply->tag_list; might be modified by cm->fucntion. */
+
+	rr = cm->function(mdev, nlp, reply);
+
+	cn_reply->id = req->id;
+	cn_reply->seq = req->seq;
+	cn_reply->ack = req->ack  + 1;
+	cn_reply->len = sizeof(struct drbd_nl_cfg_reply) + rr;
+	cn_reply->flags = 0;
+
+	trace_drbd_netlink(cn_reply, 0);
+	rr = cn_netlink_send(cn_reply, CN_IDX_DRBD, GFP_KERNEL);
+	if (rr && rr != -ESRCH)
+		printk(KERN_INFO "drbd: cn_netlink_send()=%d\n", rr);
+
+	kfree(cn_reply);
+	module_put(THIS_MODULE);
+	return;
+ fail:
+	drbd_nl_send_reply(req, retcode);
+	module_put(THIS_MODULE);
+}
+
+static atomic_t drbd_nl_seq = ATOMIC_INIT(2); /* two. */
+
+static inline unsigned short *
+__tl_add_blob(unsigned short *tl, enum drbd_tags tag, const void *data,
+	int len, int nul_terminated)
+{
+	int l = tag_descriptions[tag_number(tag)].max_len;
+	l = (len < l) ? len :  l;
+	*tl++ = tag;
+	*tl++ = len;
+	memcpy(tl, data, len);
+	/* TODO
+	 * maybe we need to add some padding to the data stream.
+	 * otherwise we may get strange effects on architectures
+	 * that require certain data types to be strictly aligned,
+	 * because now the next "unsigned short" may be misaligned. */
+	tl = (unsigned short*)((char*)tl + len);
+	if (nul_terminated)
+		*((char*)tl - 1) = 0;
+	return tl;
+}
+
+static inline unsigned short *
+tl_add_blob(unsigned short *tl, enum drbd_tags tag, const void *data, int len)
+{
+	return __tl_add_blob(tl, tag, data, len, 0);
+}
+
+static inline unsigned short *
+tl_add_str(unsigned short *tl, enum drbd_tags tag, const char *str)
+{
+	return __tl_add_blob(tl, tag, str, strlen(str)+1, 0);
+}
+
+static inline unsigned short *
+tl_add_int(unsigned short *tl, enum drbd_tags tag, const void *val)
+{
+	switch(tag_type(tag)) {
+	case TT_INTEGER:
+		*tl++ = tag;
+		*tl++ = sizeof(int);
+		*(int*)tl = *(int*)val;
+		tl = (unsigned short*)((char*)tl+sizeof(int));
+		break;
+	case TT_INT64:
+		*tl++ = tag;
+		*tl++ = sizeof(u64);
+		*(u64*)tl = *(u64*)val;
+		tl = (unsigned short*)((char*)tl+sizeof(u64));
+		break;
+	default:
+		/* someone did something stupid. */
+		;
+	}
+	return tl;
+}
+
+void drbd_bcast_state(struct drbd_conf *mdev, union drbd_state state)
+{
+	char buffer[sizeof(struct cn_msg)+
+		    sizeof(struct drbd_nl_cfg_reply)+
+		    sizeof(struct get_state_tag_len_struct)+
+		    sizeof(short int)];
+	struct cn_msg *cn_reply = (struct cn_msg *) buffer;
+	struct drbd_nl_cfg_reply *reply =
+		(struct drbd_nl_cfg_reply *)cn_reply->data;
+	unsigned short *tl = reply->tag_list;
+
+	/* dev_warn(DEV, "drbd_bcast_state() got called\n"); */
+
+	tl = get_state_to_tags(mdev, (struct get_state *)&state, tl);
+	*tl++ = TT_END; /* Close the tag list */
+
+	cn_reply->id.idx = CN_IDX_DRBD;
+	cn_reply->id.val = CN_VAL_DRBD;
+
+	cn_reply->seq = atomic_add_return(1, &drbd_nl_seq);
+	cn_reply->ack = 0; /* not used here. */
+	cn_reply->len = sizeof(struct drbd_nl_cfg_reply) +
+		(int)((char *)tl - (char *)reply->tag_list);
+	cn_reply->flags = 0;
+
+	reply->packet_type = P_get_state;
+	reply->minor = mdev_to_minor(mdev);
+	reply->ret_code = NO_ERROR;
+
+	trace_drbd_netlink(cn_reply, 0);
+	cn_netlink_send(cn_reply, CN_IDX_DRBD, GFP_KERNEL);
+}
+
+void drbd_bcast_ev_helper(struct drbd_conf *mdev, char *helper_name)
+{
+	char buffer[sizeof(struct cn_msg)+
+		    sizeof(struct drbd_nl_cfg_reply)+
+		    sizeof(struct call_helper_tag_len_struct)+
+		    sizeof(short int)];
+	struct cn_msg *cn_reply = (struct cn_msg *) buffer;
+	struct drbd_nl_cfg_reply *reply =
+		(struct drbd_nl_cfg_reply *)cn_reply->data;
+	unsigned short *tl = reply->tag_list;
+	int str_len;
+
+	/* dev_warn(DEV, "drbd_bcast_state() got called\n"); */
+
+	str_len = strlen(helper_name)+1;
+	*tl++ = T_helper;
+	*tl++ = str_len;
+	memcpy(tl, helper_name, str_len);
+	tl = (unsigned short *)((char *)tl + str_len);
+	*tl++ = TT_END; /* Close the tag list */
+
+	cn_reply->id.idx = CN_IDX_DRBD;
+	cn_reply->id.val = CN_VAL_DRBD;
+
+	cn_reply->seq = atomic_add_return(1, &drbd_nl_seq);
+	cn_reply->ack = 0; /* not used here. */
+	cn_reply->len = sizeof(struct drbd_nl_cfg_reply) +
+		(int)((char *)tl - (char *)reply->tag_list);
+	cn_reply->flags = 0;
+
+	reply->packet_type = P_call_helper;
+	reply->minor = mdev_to_minor(mdev);
+	reply->ret_code = NO_ERROR;
+
+	trace_drbd_netlink(cn_reply, 0);
+	cn_netlink_send(cn_reply, CN_IDX_DRBD, GFP_KERNEL);
+}
+
+void drbd_bcast_ee(struct drbd_conf *mdev,
+		const char *reason, const int dgs,
+		const char* seen_hash, const char* calc_hash,
+		const struct drbd_epoch_entry* e)
+{
+	struct cn_msg *cn_reply;
+	struct drbd_nl_cfg_reply *reply;
+	struct bio_vec *bvec;
+	unsigned short *tl;
+	int i;
+
+	if (!e)
+		return;
+	if (!reason || !reason[0])
+		return;
+
+	/* aparently we have to memcpy twice, first to prepare the data for the
+	 * struct cn_msg, then within cn_netlink_send from the cn_msg to the
+	 * netlink skb. */
+	cn_reply = kmalloc(
+		sizeof(struct cn_msg)+
+		sizeof(struct drbd_nl_cfg_reply)+
+		sizeof(struct dump_ee_tag_len_struct)+
+		sizeof(short int)
+		, GFP_KERNEL);
+
+	if (!cn_reply) {
+		dev_err(DEV, "could not kmalloc buffer for drbd_bcast_ee, sector %llu, size %u\n",
+				(unsigned long long)e->sector, e->size);
+		return;
+	}
+
+	reply = (struct drbd_nl_cfg_reply*)cn_reply->data;
+	tl = reply->tag_list;
+
+	tl = tl_add_str(tl, T_dump_ee_reason, reason);
+	tl = tl_add_blob(tl, T_seen_digest, seen_hash, dgs);
+	tl = tl_add_blob(tl, T_calc_digest, calc_hash, dgs);
+	tl = tl_add_int(tl, T_ee_sector, &e->sector);
+	tl = tl_add_int(tl, T_ee_block_id, &e->block_id);
+
+	*tl++ = T_ee_data;
+	*tl++ = e->size;
+
+	__bio_for_each_segment(bvec, e->private_bio, i, 0) {
+		void *d = kmap(bvec->bv_page);
+		memcpy(tl, d + bvec->bv_offset, bvec->bv_len);
+		kunmap(bvec->bv_page);
+		tl=(unsigned short*)((char*)tl + bvec->bv_len);
+	}
+	*tl++ = TT_END; /* Close the tag list */
+
+	cn_reply->id.idx = CN_IDX_DRBD;
+	cn_reply->id.val = CN_VAL_DRBD;
+
+	cn_reply->seq = atomic_add_return(1,&drbd_nl_seq);
+	cn_reply->ack = 0; // not used here.
+	cn_reply->len = sizeof(struct drbd_nl_cfg_reply) +
+		(int)((char*)tl - (char*)reply->tag_list);
+	cn_reply->flags = 0;
+
+	reply->packet_type = P_dump_ee;
+	reply->minor = mdev_to_minor(mdev);
+	reply->ret_code = NO_ERROR;
+
+	trace_drbd_netlink(cn_reply, 0);
+	cn_netlink_send(cn_reply, CN_IDX_DRBD, GFP_KERNEL);
+	kfree(cn_reply);
+}
+
+void drbd_bcast_sync_progress(struct drbd_conf *mdev)
+{
+	char buffer[sizeof(struct cn_msg)+
+		    sizeof(struct drbd_nl_cfg_reply)+
+		    sizeof(struct sync_progress_tag_len_struct)+
+		    sizeof(short int)];
+	struct cn_msg *cn_reply = (struct cn_msg *) buffer;
+	struct drbd_nl_cfg_reply *reply =
+		(struct drbd_nl_cfg_reply *)cn_reply->data;
+	unsigned short *tl = reply->tag_list;
+	unsigned long rs_left;
+	unsigned int res;
+
+	/* no local ref, no bitmap, no syncer progress, no broadcast. */
+	if (!inc_local(mdev))
+		return;
+	drbd_get_syncer_progress(mdev, &rs_left, &res);
+	dec_local(mdev);
+
+	*tl++ = T_sync_progress;
+	*tl++ = sizeof(int);
+	memcpy(tl, &res, sizeof(int));
+	tl = (unsigned short *)((char *)tl + sizeof(int));
+	*tl++ = TT_END; /* Close the tag list */
+
+	cn_reply->id.idx = CN_IDX_DRBD;
+	cn_reply->id.val = CN_VAL_DRBD;
+
+	cn_reply->seq = atomic_add_return(1, &drbd_nl_seq);
+	cn_reply->ack = 0; /* not used here. */
+	cn_reply->len = sizeof(struct drbd_nl_cfg_reply) +
+		(int)((char *)tl - (char *)reply->tag_list);
+	cn_reply->flags = 0;
+
+	reply->packet_type = P_sync_progress;
+	reply->minor = mdev_to_minor(mdev);
+	reply->ret_code = NO_ERROR;
+
+	trace_drbd_netlink(cn_reply, 0);
+	cn_netlink_send(cn_reply, CN_IDX_DRBD, GFP_KERNEL);
+}
+
+int __init drbd_nl_init(void)
+{
+	static struct cb_id cn_id_drbd;
+	int err, try=10;
+
+	cn_id_drbd.val = CN_VAL_DRBD;
+	do {
+		cn_id_drbd.idx = cn_idx;
+		err = cn_add_callback(&cn_id_drbd, "cn_drbd", &drbd_connector_callback);
+		if (!err)
+			break;
+		cn_idx = (cn_idx + CN_IDX_STEP);
+	} while (try--);
+
+	if (err) {
+		printk(KERN_ERR "drbd: cn_drbd failed to register\n");
+		return err;
+	}
+
+	return 0;
+}
+
+void drbd_nl_cleanup(void)
+{
+	static struct cb_id cn_id_drbd;
+
+	cn_id_drbd.idx = cn_idx;
+	cn_id_drbd.val = CN_VAL_DRBD;
+
+	cn_del_callback(&cn_id_drbd);
+}
+
+void drbd_nl_send_reply(struct cn_msg *req, int ret_code)
+{
+	char buffer[sizeof(struct cn_msg)+sizeof(struct drbd_nl_cfg_reply)];
+	struct cn_msg *cn_reply = (struct cn_msg *) buffer;
+	struct drbd_nl_cfg_reply *reply =
+		(struct drbd_nl_cfg_reply *)cn_reply->data;
+	int rr;
+
+	cn_reply->id = req->id;
+
+	cn_reply->seq = req->seq;
+	cn_reply->ack = req->ack  + 1;
+	cn_reply->len = sizeof(struct drbd_nl_cfg_reply);
+	cn_reply->flags = 0;
+
+	reply->minor = ((struct drbd_nl_cfg_req *)req->data)->drbd_minor;
+	reply->ret_code = ret_code;
+
+	trace_drbd_netlink(cn_reply, 0);
+	rr = cn_netlink_send(cn_reply, CN_IDX_DRBD, GFP_KERNEL);
+	if (rr && rr != -ESRCH)
+		printk(KERN_INFO "drbd: cn_netlink_send()=%d\n", rr);
+}
+

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 07/16] DRBD: internal_data_structures
  2009-04-30 11:26           ` [PATCH 06/16] DRBD: userspace_interface Philipp Reisner
@ 2009-04-30 11:26             ` Philipp Reisner
  2009-04-30 11:26               ` [PATCH 08/16] DRBD: main Philipp Reisner
  0 siblings, 1 reply; 90+ messages in thread
From: Philipp Reisner @ 2009-04-30 11:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg, Philipp Reisner

The big "struct drbd_conf". It actually describes one DRBD device.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

---
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
new file mode 100644
index 0000000..175de11
--- /dev/null
+++ b/drivers/block/drbd/drbd_int.h
@@ -0,0 +1,2219 @@
+/*
+  drbd_int.h
+
+  This file is part of DRBD by Philipp Reisner and Lars Ellenberg.
+
+  Copyright (C) 2001-2008, LINBIT Information Technologies GmbH.
+  Copyright (C) 1999-2008, Philipp Reisner <philipp.reisner@linbit.com>.
+  Copyright (C) 2002-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.
+
+  drbd is free software; you can redistribute it and/or modify
+  it under the terms of the GNU General Public License as published by
+  the Free Software Foundation; either version 2, or (at your option)
+  any later version.
+
+  drbd is distributed in the hope that it will be useful,
+  but WITHOUT ANY WARRANTY; without even the implied warranty of
+  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	See the
+  GNU General Public License for more details.
+
+  You should have received a copy of the GNU General Public License
+  along with drbd; see the file COPYING.  If not, write to
+  the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+
+*/
+
+#ifndef _DRBD_INT_H
+#define _DRBD_INT_H
+
+#include <linux/compiler.h>
+#include <linux/types.h>
+#include <linux/version.h>
+#include <linux/list.h>
+#include <linux/sched.h>
+#include <linux/bitops.h>
+#include <linux/slab.h>
+#include <linux/crypto.h>
+#include <linux/tcp.h>
+#include <linux/mutex.h>
+#include <linux/major.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <net/tcp.h>
+#include "lru_cache.h"
+
+#ifdef __CHECKER__
+# define __protected_by(x)       __attribute__((require_context(x,1,999,"rdwr")))
+# define __protected_read_by(x)  __attribute__((require_context(x,1,999,"read")))
+# define __protected_write_by(x) __attribute__((require_context(x,1,999,"write")))
+# define __must_hold(x)       __attribute__((context(x,1,1), require_context(x,1,999,"call")))
+#else
+# define __protected_by(x)
+# define __protected_read_by(x)
+# define __protected_write_by(x)
+# define __must_hold(x)
+#endif
+
+#define __no_warn(lock, stmt) do { __acquire(lock); stmt; __release(lock); } while (0)
+
+/* module parameter, defined in drbd_main.c */
+extern unsigned int minor_count;
+extern int disable_sendpage;
+extern int allow_oos;
+extern unsigned int cn_idx;
+
+#ifdef DRBD_ENABLE_FAULTS
+extern int enable_faults;
+extern int fault_rate;
+extern int fault_devs;
+#endif
+
+extern char usermode_helper[];
+
+
+#ifndef TRUE
+#define TRUE 1
+#endif
+#ifndef FALSE
+#define FALSE 0
+#endif
+
+/* I don't remember why XCPU ...
+ * This is used to wake the asender,
+ * and to interrupt sending the sending task
+ * on disconnect.
+ */
+#define DRBD_SIG SIGXCPU
+
+/* This is used to stop/restart our threads.
+ * Cannot use SIGTERM nor SIGKILL, since these
+ * are sent out by init on runlevel changes
+ * I choose SIGHUP for now.
+ */
+#define DRBD_SIGKILL SIGHUP
+
+/* All EEs on the free list should have ID_VACANT (== 0)
+ * freshly allocated EEs get !ID_VACANT (== 1)
+ * so if it says "cannot dereference null pointer at adress 0x00000001",
+ * it is most likely one of these :( */
+
+#define ID_IN_SYNC      (4711ULL)
+#define ID_OUT_OF_SYNC  (4712ULL)
+
+#define ID_SYNCER (-1ULL)
+#define ID_VACANT 0
+#define is_syncer_block_id(id) ((id) == ID_SYNCER)
+
+struct drbd_conf;
+
+#ifdef DBG_ALL_SYMBOLS
+# define STATIC
+#else
+# define STATIC static
+#endif
+
+/*
+ * Some Message Macros
+ *************************/
+
+#define DUMPP(A)   dev_err(DEV, #A " = %p in %s:%d\n", (A), __FILE__, __LINE__);
+#define DUMPLU(A)  dev_err(DEV, #A " = %lu in %s:%d\n", (unsigned long)(A), __FILE__, __LINE__);
+#define DUMPLLU(A) dev_err(DEV, #A " = %llu in %s:%d\n", (unsigned long long)(A), __FILE__, __LINE__);
+#define DUMPLX(A)  dev_err(DEV, #A " = %lx in %s:%d\n", (A), __FILE__, __LINE__);
+#define DUMPI(A)   dev_err(DEV, #A " = %d in %s:%d\n", (int)(A), __FILE__, __LINE__);
+
+
+/* to shorten dev_warn(DEV, "msg"); and relatives statements */
+#define DEV (disk_to_dev(mdev->vdisk))
+
+#define D_ASSERT(exp)	if (!(exp)) \
+	 dev_err(DEV, "ASSERT( " #exp " ) in %s:%d\n", __FILE__, __LINE__)
+
+#define ERR_IF(exp) if (({				\
+	int _b = (exp) != 0;				\
+	if (_b) dev_err(DEV, "%s: (%s) in %s:%d\n",		\
+		__func__, #exp, __FILE__, __LINE__);	\
+	 _b;						\
+	}))
+
+/* Defines to control fault insertion */
+enum {
+    DRBD_FAULT_MD_WR = 0,	/* meta data write */
+    DRBD_FAULT_MD_RD,		/*           read  */
+    DRBD_FAULT_RS_WR,		/* resync          */
+    DRBD_FAULT_RS_RD,
+    DRBD_FAULT_DT_WR,		/* data            */
+    DRBD_FAULT_DT_RD,
+    DRBD_FAULT_DT_RA,		/* data read ahead */
+    DRBD_FAULT_BM_ALLOC,        /* bitmap allocation */
+    DRBD_FAULT_AL_EE,		/* alloc ee */
+
+    DRBD_FAULT_MAX,
+};
+
+extern void trace_drbd_resync(struct drbd_conf *mdev, int level, const char *fmt, ...);
+
+#ifdef DRBD_ENABLE_FAULTS
+extern unsigned int
+_drbd_insert_fault(struct drbd_conf *mdev, unsigned int type);
+static inline int
+drbd_insert_fault(struct drbd_conf *mdev, unsigned int type) {
+    return fault_rate &&
+	    (enable_faults & (1<<type)) &&
+	    _drbd_insert_fault(mdev, type);
+}
+#define FAULT_ACTIVE(_m, _t) (drbd_insert_fault((_m), (_t)))
+
+#else
+#define FAULT_ACTIVE(_m, _t) (0)
+#endif
+
+/* integer division, round _UP_ to the next integer */
+#define div_ceil(A, B) ((A)/(B) + ((A)%(B) ? 1 : 0))
+/* usual integer division */
+#define div_floor(A, B) ((A)/(B))
+
+/* drbd_meta-data.c (still in drbd_main.c) */
+/* 4th incarnation of the disk layout. */
+#define DRBD_MD_MAGIC (DRBD_MAGIC+4)
+
+extern struct drbd_conf **minor_table;
+extern struct ratelimit_state drbd_ratelimit_state;
+
+/***
+ * on the wire
+ *********************************************************************/
+
+enum drbd_packets {
+	/* receiver (data socket) */
+	P_DATA		      = 0x00,
+	P_DATA_REPLY	      = 0x01, /* Response to P_DATA_REQUEST */
+	P_RS_DATA_REPLY	      = 0x02, /* Response to P_RS_DATA_REQUEST */
+	P_BARRIER	      = 0x03,
+	P_BITMAP	      = 0x04,
+	P_BECOME_SYNC_TARGET  = 0x05,
+	P_BECOME_SYNC_SOURCE  = 0x06,
+	P_UNPLUG_REMOTE	      = 0x07, /* Used at various times to hint the peer */
+	P_DATA_REQUEST	      = 0x08, /* Used to ask for a data block */
+	P_RS_DATA_REQUEST     = 0x09, /* Used to ask for a data block for resync */
+	P_SYNC_PARAM	      = 0x0a,
+	P_PROTOCOL	      = 0x0b,
+	P_UUIDS		      = 0x0c,
+	P_SIZES		      = 0x0d,
+	P_STATE		      = 0x0e,
+	P_SYNC_UUID	      = 0x0f,
+	P_AUTH_CHALLENGE      = 0x10,
+	P_AUTH_RESPONSE	      = 0x11,
+	P_STATE_CHG_REQ	      = 0x12,
+
+	/* asender (meta socket */
+	P_PING		      = 0x13,
+	P_PING_ACK	      = 0x14,
+	P_RECV_ACK	      = 0x15, /* Used in protocol B */
+	P_WRITE_ACK	      = 0x16, /* Used in protocol C */
+	P_RS_WRITE_ACK	      = 0x17, /* Is a P_WRITE_ACK, additionally call set_in_sync(). */
+	P_DISCARD_ACK	      = 0x18, /* Used in proto C, two-primaries conflict detection */
+	P_NEG_ACK	      = 0x19, /* Sent if local disk is unusable */
+	P_NEG_DREPLY	      = 0x1a, /* Local disk is broken... */
+	P_NEG_RS_DREPLY	      = 0x1b, /* Local disk is broken... */
+	P_BARRIER_ACK	      = 0x1c,
+	P_STATE_CHG_REPLY     = 0x1d,
+
+	/* "new" commands, no longer fitting into the ordering scheme above */
+
+	P_OV_REQUEST	      = 0x1e, /* data socket */
+	P_OV_REPLY	      = 0x1f,
+	P_OV_RESULT	      = 0x20, /* meta socket */
+	P_CSUM_RS_REQUEST     = 0x21, /* data socket */
+	P_RS_IS_IN_SYNC	      = 0x22, /* meta socket */
+	P_SYNC_PARAM89	      = 0x23, /* data socket, protocol version 89 replacement for P_SYNC_PARAM */
+	P_COMPRESSED_BITMAP   = 0x24, /* compressed or otherwise encoded bitmap transfer */
+
+	P_MAX_CMD	      = 0x25,
+	P_MAY_IGNORE	      = 0x100, /* Flag to test if (cmd > P_MAY_IGNORE) ... */
+	P_MAX_OPT_CMD	      = 0x101,
+
+	/* special command ids for handshake */
+
+	P_HAND_SHAKE_M	      = 0xfff1, /* First Packet on the MetaSock */
+	P_HAND_SHAKE_S	      = 0xfff2, /* First Packet on the Socket */
+
+	P_HAND_SHAKE	      = 0xfffe	/* FIXED for the next century! */
+};
+
+static inline const char *cmdname(enum drbd_packets cmd)
+{
+	/* THINK may need to become several global tables
+	 * when we want to support more than
+	 * one PRO_VERSION */
+	static const char *cmdnames[] = {
+		[P_DATA]	        = "Data",
+		[P_DATA_REPLY]	        = "DataReply",
+		[P_RS_DATA_REPLY]	= "RSDataReply",
+		[P_BARRIER]	        = "Barrier",
+		[P_BITMAP]	        = "ReportBitMap",
+		[P_BECOME_SYNC_TARGET]  = "BecomeSyncTarget",
+		[P_BECOME_SYNC_SOURCE]  = "BecomeSyncSource",
+		[P_UNPLUG_REMOTE]	= "UnplugRemote",
+		[P_DATA_REQUEST]	= "DataRequest",
+		[P_RS_DATA_REQUEST]     = "RSDataRequest",
+		[P_SYNC_PARAM]	        = "SyncParam",
+		[P_SYNC_PARAM89]	= "SyncParam89",
+		[P_PROTOCOL]            = "ReportProtocol",
+		[P_UUIDS]	        = "ReportUUIDs",
+		[P_SIZES]	        = "ReportSizes",
+		[P_STATE]	        = "ReportState",
+		[P_SYNC_UUID]           = "ReportSyncUUID",
+		[P_AUTH_CHALLENGE]      = "AuthChallenge",
+		[P_AUTH_RESPONSE]	= "AuthResponse",
+		[P_PING]		= "Ping",
+		[P_PING_ACK]	        = "PingAck",
+		[P_RECV_ACK]	        = "RecvAck",
+		[P_WRITE_ACK]	        = "WriteAck",
+		[P_RS_WRITE_ACK]	= "RSWriteAck",
+		[P_DISCARD_ACK]	        = "DiscardAck",
+		[P_NEG_ACK]	        = "NegAck",
+		[P_NEG_DREPLY]	        = "NegDReply",
+		[P_NEG_RS_DREPLY]	= "NegRSDReply",
+		[P_BARRIER_ACK]	        = "BarrierAck",
+		[P_STATE_CHG_REQ]       = "StateChgRequest",
+		[P_STATE_CHG_REPLY]     = "StateChgReply",
+		[P_OV_REQUEST]          = "OVRequest",
+		[P_OV_REPLY]            = "OVReply",
+		[P_OV_RESULT]           = "OVResult",
+		[P_MAX_CMD]	        = NULL,
+	};
+
+	if (cmd == P_HAND_SHAKE_M)
+		return "HandShakeM";
+	if (cmd == P_HAND_SHAKE_S)
+		return "HandShakeS";
+	if (cmd == P_HAND_SHAKE)
+		return "HandShake";
+	if (cmd >= P_MAX_CMD)
+		return "Unknown";
+	return cmdnames[cmd];
+}
+
+/* for sending/receiving the bitmap,
+ * possibly in some encoding scheme */
+struct bm_xfer_ctx {
+	/* "const"
+	 * stores total bits and long words
+	 * of the bitmap, so we don't need to
+	 * call the accessor functions over and again. */
+	unsigned long bm_bits;
+	unsigned long bm_words;
+	/* during xfer, current position within the bitmap */
+	unsigned long bit_offset;
+	unsigned long word_offset;
+
+	/* statistics; index: (h->command == P_BITMAP) */
+	unsigned packets[2];
+	unsigned bytes[2];
+};
+
+extern void INFO_bm_xfer_stats(struct drbd_conf *mdev,
+		const char *direction, struct bm_xfer_ctx *c);
+
+static inline void bm_xfer_ctx_bit_to_word_offset(struct bm_xfer_ctx *c)
+{
+	/* word_offset counts "native long words" (32 or 64 bit),
+	 * aligned at 64 bit.
+	 * Encoded packet may end at an unaligned bit offset.
+	 * In case a fallback clear text packet is transmitted in
+	 * between, we adjust this offset back to the last 64bit
+	 * aligned "native long word", which makes coding and decoding
+	 * the plain text bitmap much more convenient.  */
+#if BITS_PER_LONG == 64
+	c->word_offset = c->bit_offset >> 6;
+#elif BITS_PER_LONG == 32
+	c->word_offset = c->bit_offset >> 5;
+	c->word_offset &= ~(1UL);
+#else
+# error "unsupported BITS_PER_LONG"
+#endif
+}
+
+/* This is the layout for a packet on the wire.
+ * The byteorder is the network byte order.
+ *     (except block_id and barrier fields.
+ *	these are pointers to local structs
+ *	and have no relevance for the partner,
+ *	which just echoes them as received.)
+ *
+ * NOTE that the payload starts at a long aligned offset,
+ * regardless of 32 or 64 bit arch!
+ */
+struct p_header {
+	u32	  magic;
+	u16	  command;
+	u16	  length;	/* bytes of data after this header */
+	u8	  payload[0];
+} __attribute((packed));
+/* 8 bytes. packet FIXED for the next century! */
+
+/*
+ * short commands, packets without payload, plain p_header:
+ *   P_PING
+ *   P_PING_ACK
+ *   P_BECOME_SYNC_TARGET
+ *   P_BECOME_SYNC_SOURCE
+ *   P_UNPLUG_REMOTE
+ */
+
+/*
+ * commands with out-of-struct payload:
+ *   P_BITMAP    (no additional fields)
+ *   P_DATA, P_DATA_REPLY (see p_data)
+ *   P_COMPRESSED_BITMAP (see receive_compressed_bitmap)
+ */
+
+/* these defines must not be changed without changing the protocol version */
+#define DP_HARDBARRIER	      1
+#define DP_RW_SYNC	      2
+#define DP_MAY_SET_IN_SYNC    4
+
+struct p_data {
+	struct p_header head;
+	u64	    sector;    /* 64 bits sector number */
+	u64	    block_id;  /* to identify the request in protocol B&C */
+	u32	    seq_num;
+	u32	    dp_flags;
+} __attribute((packed));
+
+/*
+ * commands which share a struct:
+ *  p_block_ack:
+ *   P_RECV_ACK (proto B), P_WRITE_ACK (proto C),
+ *   P_DISCARD_ACK (proto C, two-primaries conflict detection)
+ *  p_block_req:
+ *   P_DATA_REQUEST, P_RS_DATA_REQUEST
+ */
+struct p_block_ack {
+	struct p_header head;
+	u64	    sector;
+	u64	    block_id;
+	u32	    blksize;
+	u32	    seq_num;
+} __attribute((packed));
+
+
+struct p_block_req {
+	struct p_header head;
+	u64 sector;
+	u64 block_id;
+	u32 blksize;
+	u32 pad;	/* to multiple of 8 Byte */
+} __attribute((packed));
+
+/*
+ * commands with their own struct for additional fields:
+ *   P_HAND_SHAKE
+ *   P_BARRIER
+ *   P_BARRIER_ACK
+ *   P_SYNC_PARAM
+ *   ReportParams
+ */
+
+struct p_handshake {
+	struct p_header head;	/* 8 bytes */
+	u32 protocol_min;
+	u32 feature_flags;
+	u32 protocol_max;
+
+	/* should be more than enough for future enhancements
+	 * for now, feature_flags and the reserverd array shall be zero.
+	 */
+
+	u32 _pad;
+	u64 reserverd[7];
+} __attribute((packed));
+/* 80 bytes, FIXED for the next century */
+
+struct p_barrier {
+	struct p_header head;
+	u32 barrier;	/* barrier number _handle_ only */
+	u32 pad;	/* to multiple of 8 Byte */
+} __attribute((packed));
+
+struct p_barrier_ack {
+	struct p_header head;
+	u32 barrier;
+	u32 set_size;
+} __attribute((packed));
+
+struct p_rs_param {
+	struct p_header head;
+	u32 rate;
+
+	      /* Since protocol version 88 and higher. */
+	char verify_alg[0];
+} __attribute((packed));
+
+struct p_rs_param_89 {
+	struct p_header head;
+	u32 rate;
+        /* protocol version 89: */
+	char verify_alg[SHARED_SECRET_MAX];
+	char csums_alg[SHARED_SECRET_MAX];
+} __attribute((packed));
+
+struct p_protocol {
+	struct p_header head;
+	u32 protocol;
+	u32 after_sb_0p;
+	u32 after_sb_1p;
+	u32 after_sb_2p;
+	u32 want_lose;
+	u32 two_primaries;
+
+              /* Since protocol version 87 and higher. */
+	char integrity_alg[0];
+
+} __attribute((packed));
+
+struct p_uuids {
+	struct p_header head;
+	u64 uuid[UI_EXTENDED_SIZE];
+} __attribute((packed));
+
+struct p_rs_uuid {
+	struct p_header head;
+	u64	    uuid;
+} __attribute((packed));
+
+struct p_sizes {
+	struct p_header head;
+	u64	    d_size;  /* size of disk */
+	u64	    u_size;  /* user requested size */
+	u64	    c_size;  /* current exported size */
+	u32	    max_segment_size;  /* Maximal size of a BIO */
+	u32	    queue_order_type;
+} __attribute((packed));
+
+struct p_state {
+	struct p_header head;
+	u32	    state;
+} __attribute((packed));
+
+struct p_req_state {
+	struct p_header head;
+	u32	    mask;
+	u32	    val;
+} __attribute((packed));
+
+struct p_req_state_reply {
+	struct p_header head;
+	u32	    retcode;
+} __attribute((packed));
+
+struct p_drbd06_param {
+	u64	  size;
+	u32	  state;
+	u32	  blksize;
+	u32	  protocol;
+	u32	  version;
+	u32	  gen_cnt[5];
+	u32	  bit_map_gen[5];
+} __attribute((packed));
+
+struct p_discard {
+	struct p_header head;
+	u64	    block_id;
+	u32	    seq_num;
+	u32	    pad;
+} __attribute((packed));
+
+/* Valid values for the encoding field.
+ * Bump proto version when changing this. */
+enum drbd_bitmap_code {
+	/* RLE_VLI_Bytes = 0,
+	 * and other bit variants had been defined during
+	 * algorithm evaluation. */
+	RLE_VLI_Bits = 2,
+};
+
+struct p_compressed_bm {
+	struct p_header head;
+	/* (encoding & 0x0f): actual encoding, see enum drbd_bitmap_code
+	 * (encoding & 0x80): polarity (set/unset) of first runlength
+	 * ((encoding >> 4) & 0x07): pad_bits, number of trailing zero bits
+	 * used to pad up to head.length bytes
+	 */
+	u8 encoding;
+
+	u8 code[0];
+} __attribute((packed));
+
+static inline enum drbd_bitmap_code
+DCBP_get_code(struct p_compressed_bm *p)
+{
+	return (enum drbd_bitmap_code)(p->encoding & 0x0f);
+}
+
+static inline void
+DCBP_set_code(struct p_compressed_bm *p, enum drbd_bitmap_code code)
+{
+	BUG_ON(code & ~0xf);
+	p->encoding = (p->encoding & ~0xf) | code;
+}
+
+static inline int
+DCBP_get_start(struct p_compressed_bm *p)
+{
+	return (p->encoding & 0x80) != 0;
+}
+
+static inline void
+DCBP_set_start(struct p_compressed_bm *p, int set)
+{
+	p->encoding = (p->encoding & ~0x80) | (set ? 0x80 : 0);
+}
+
+static inline int
+DCBP_get_pad_bits(struct p_compressed_bm *p)
+{
+	return (p->encoding >> 4) & 0x7;
+}
+
+static inline void
+DCBP_set_pad_bits(struct p_compressed_bm *p, int n)
+{
+	BUG_ON(n & ~0x7);
+	p->encoding = (p->encoding & (~0x7 << 4)) | (n << 4);
+}
+
+/* one bitmap packet, including the p_header,
+ * should fit within one _architecture independend_ page.
+ * so we need to use the fixed size 4KiB page size
+ * most architechtures have used for a long time.
+ */
+#define BM_PACKET_PAYLOAD_BYTES (4096 - sizeof(struct p_header))
+#define BM_PACKET_WORDS (BM_PACKET_PAYLOAD_BYTES/sizeof(long))
+#define BM_PACKET_VLI_BYTES_MAX (4096 - sizeof(struct p_compressed_bm))
+#if (PAGE_SIZE < 4096)
+/* drbd_send_bitmap / receive_bitmap would break horribly */
+#error "PAGE_SIZE too small"
+#endif
+
+union p_polymorph {
+        struct p_header          header;
+        struct p_handshake       handshake;
+        struct p_data            data;
+        struct p_block_ack       block_ack;
+        struct p_barrier         barrier;
+        struct p_barrier_ack     barrier_ack;
+        struct p_rs_param_89     rs_param_89;
+        struct p_protocol        protocol;
+        struct p_sizes           sizes;
+        struct p_uuids           uuids;
+        struct p_state           state;
+        struct p_req_state       req_state;
+        struct p_req_state_reply req_state_reply;
+        struct p_block_req       block_req;
+} __attribute((packed));
+
+/**********************************************************************/
+enum drbd_thread_state {
+	None,
+	Running,
+	Exiting,
+	Restarting
+};
+
+struct drbd_thread {
+	spinlock_t t_lock;
+	struct task_struct *task;
+	struct completion stop;
+	enum drbd_thread_state t_state;
+	int (*function) (struct drbd_thread *);
+	struct drbd_conf *mdev;
+	int reset_cpu_mask;
+};
+
+static inline enum drbd_thread_state get_t_state(struct drbd_thread *thi)
+{
+	/* THINK testing the t_state seems to be uncritical in all cases
+	 * (but thread_{start,stop}), so we can read it *without* the lock.
+	 *	--lge */
+
+	smp_rmb();
+	return thi->t_state;
+}
+
+
+/*
+ * Having this as the first member of a struct provides sort of "inheritance".
+ * "derived" structs can be "drbd_queue_work()"ed.
+ * The callback should know and cast back to the descendant struct.
+ * drbd_request and drbd_epoch_entry are descendants of drbd_work.
+ */
+struct drbd_work;
+typedef int (*drbd_work_cb)(struct drbd_conf *, struct drbd_work *, int cancel);
+struct drbd_work {
+	struct list_head list;
+	drbd_work_cb cb;
+};
+
+struct drbd_tl_epoch;
+struct drbd_request {
+	struct drbd_work w;
+	struct drbd_conf *mdev;
+	struct bio *private_bio;
+	struct hlist_node colision;
+	sector_t sector;
+	unsigned int size;
+	unsigned int epoch; /* barrier_nr */
+
+	/* barrier_nr: used to check on "completion" whether this req was in
+	 * the current epoch, and we therefore have to close it,
+	 * starting a new epoch...
+	 */
+
+	/* up to here, the struct layout is identical to drbd_epoch_entry;
+	 * we might be able to use that to our advantage...  */
+
+	struct list_head tl_requests; /* ring list in the transfer log */
+	struct bio *master_bio;       /* master bio pointer */
+	unsigned long rq_state; /* see comments above _req_mod() */
+	int seq_num;
+	unsigned long start_time;
+};
+
+struct drbd_tl_epoch {
+	struct drbd_work w;
+	struct list_head requests; /* requests before */
+	struct drbd_tl_epoch *next; /* pointer to the next barrier */
+	unsigned int br_number;  /* the barriers identifier. */
+	int n_req;	/* number of requests attached before this barrier */
+};
+
+struct drbd_request;
+
+/* These Tl_epoch_entries may be in one of 6 lists:
+   active_ee .. data packet being written
+   sync_ee   .. syncer block being written
+   done_ee   .. block written, need to send P_WRITE_ACK
+   read_ee   .. [RS]P_DATA_REQUEST being read
+*/
+
+struct drbd_epoch {
+	struct list_head list;
+	unsigned int barrier_nr;
+	atomic_t epoch_size; /* increased on every request added. */
+	atomic_t active;     /* increased on every req. added, and dec on every finished. */
+	unsigned long flags;
+};
+
+/* drbd_epoch flag bits */
+enum {
+	DE_BARRIER_IN_NEXT_EPOCH_ISSUED,
+	DE_BARRIER_IN_NEXT_EPOCH_DONE,
+	DE_CONTAINS_A_BARRIER,
+	DE_HAVE_BARRIER_NUMBER,
+	DE_IS_FINISHING,
+};
+
+enum epoch_event {
+	EV_PUT,
+	EV_GOT_BARRIER_NR,
+	EV_BARRIER_DONE,
+	EV_BECAME_LAST,
+	EV_TRACE_FLUSH,       /* TRACE_ are not real events, only used for tracing */
+	EV_TRACE_ADD_BARRIER, /* Doing the first write as a barrier write */
+	EV_TRACE_SETTING_BI,  /* Barrier is expressed with the first write of the next epoch */
+	EV_TRACE_ALLOC,
+	EV_TRACE_FREE,
+	EV_CLEANUP = 32, /* used as flag */
+};
+
+struct drbd_epoch_entry {
+	struct drbd_work    w;
+	struct drbd_conf *mdev;
+	struct bio *private_bio;
+	struct hlist_node colision;
+	sector_t sector;
+	unsigned int size;
+	struct drbd_epoch *epoch;
+
+	/* up to here, the struct layout is identical to drbd_request;
+	 * we might be able to use that to our advantage...  */
+
+	unsigned int flags;
+	u64    block_id;
+};
+
+struct digest_info {
+	int digest_size;
+	void *digest;
+};
+
+/* ee flag bits */
+enum {
+	__EE_CALL_AL_COMPLETE_IO,
+	__EE_CONFLICT_PENDING,
+	__EE_MAY_SET_IN_SYNC,
+	__EE_IS_BARRIER,
+};
+#define EE_CALL_AL_COMPLETE_IO (1<<__EE_CALL_AL_COMPLETE_IO)
+#define EE_CONFLICT_PENDING    (1<<__EE_CONFLICT_PENDING)
+#define EE_MAY_SET_IN_SYNC     (1<<__EE_MAY_SET_IN_SYNC)
+#define EE_IS_BARRIER          (1<<__EE_IS_BARRIER)
+
+/* global flag bits */
+enum {
+	CREATE_BARRIER,		/* next P_DATA is preceeded by a P_BARRIER */
+	SIGNAL_ASENDER,		/* whether asender wants to be interrupted */
+	SEND_PING,		/* whether asender should send a ping asap */
+	WORK_PENDING,		/* completion flag for drbd_disconnect */
+	STOP_SYNC_TIMER,	/* tell timer to cancel itself */
+	UNPLUG_QUEUED,		/* only relevant with kernel 2.4 */
+	UNPLUG_REMOTE,		/* sending a "UnplugRemote" could help */
+	MD_DIRTY,		/* current uuids and flags not yet on disk */
+	DISCARD_CONCURRENT,	/* Set on one node, cleared on the peer! */
+	USE_DEGR_WFC_T,		/* degr-wfc-timeout instead of wfc-timeout. */
+	CLUSTER_ST_CHANGE,	/* Cluster wide state change going on... */
+	CL_ST_CHG_SUCCESS,
+	CL_ST_CHG_FAIL,
+	CRASHED_PRIMARY,	/* This node was a crashed primary.
+				 * Gets cleared when the state.conn
+				 * goes into C_CONNECTED state. */
+	WRITE_BM_AFTER_RESYNC,	/* A kmalloc() during resync failed */
+	NO_BARRIER_SUPP,	/* underlying block device doesn't implement barriers */
+	CONSIDER_RESYNC,
+
+	MD_NO_BARRIER,		/* meta data device does not support barriers,
+				   so don't even try */
+	SUSPEND_IO,		/* suspend application io */
+	BITMAP_IO,		/* suspend application io;
+				   once no more io in flight, start bitmap io */
+	BITMAP_IO_QUEUED,       /* Started bitmap IO */
+	RESYNC_AFTER_NEG,       /* Resync after online grow after the attach&negotiate finished. */
+	NET_CONGESTED,		/* The data socket is congested */
+
+	CONFIG_PENDING,		/* serialization of (re)configuration requests.
+				 * if set, also prevents the device from dying */
+	DEVICE_DYING,		/* device became unconfigured,
+				 * but worker thread is still handling the cleanup.
+				 * reconfiguring (nl_disk_conf, nl_net_conf) is dissalowed,
+				 * while this is set. */
+};
+
+struct drbd_bitmap; /* opaque for drbd_conf */
+
+/* TODO sort members for performance
+ * MAYBE group them further */
+
+/* THINK maybe we actually want to use the default "event/%s" worker threads
+ * or similar in linux 2.6, which uses per cpu data and threads.
+ *
+ * To be general, this might need a spin_lock member.
+ * For now, please use the mdev->req_lock to protect list_head,
+ * see drbd_queue_work below.
+ */
+struct drbd_work_queue {
+	struct list_head q;
+	struct semaphore s; /* producers up it, worker down()s it */
+	spinlock_t q_lock;  /* to protect the list. */
+};
+
+struct drbd_socket {
+	struct drbd_work_queue work;
+	struct mutex mutex;
+	struct socket    *socket;
+	/* this way we get our
+	 * send/receive buffers off the stack */
+	union p_polymorph sbuf;
+	union p_polymorph rbuf;
+};
+
+struct drbd_md {
+	u64 md_offset;		/* sector offset to 'super' block */
+
+	u64 la_size_sect;	/* last agreed size, unit sectors */
+	u64 uuid[UI_SIZE];
+	u64 device_uuid;
+	u32 flags;
+	u32 md_size_sect;
+
+	s32 al_offset;	/* signed relative sector offset to al area */
+	s32 bm_offset;	/* signed relative sector offset to bitmap */
+
+	/* u32 al_nr_extents;	   important for restoring the AL
+	 * is stored into  sync_conf.al_extents, which in turn
+	 * gets applied to act_log->nr_elements
+	 */
+};
+
+/* for sync_conf and other types... */
+#define NL_PACKET(name, number, fields) struct name { fields };
+#define NL_INTEGER(pn,pr,member) int member;
+#define NL_INT64(pn,pr,member) __u64 member;
+#define NL_BIT(pn,pr,member)   unsigned member:1;
+#define NL_STRING(pn,pr,member,len) unsigned char member[len]; int member ## _len;
+#include "linux/drbd_nl.h"
+
+struct drbd_backing_dev {
+	struct block_device *backing_bdev;
+	struct block_device *md_bdev;
+	struct file *lo_file;
+	struct file *md_file;
+	struct drbd_md md;
+	struct disk_conf dc; /* The user provided config... */
+	sector_t known_size; /* last known size of that backing device */
+};
+
+struct drbd_md_io {
+	struct drbd_conf *mdev;
+	struct completion event;
+	int error;
+};
+
+struct bm_io_work {
+	struct drbd_work w;
+	char *why;
+	int (*io_fn)(struct drbd_conf *mdev);
+	void (*done)(struct drbd_conf *mdev, int rv);
+};
+
+enum write_ordering_e {
+	WO_none,
+	WO_drain_io,
+	WO_bdev_flush,
+	WO_bio_barrier
+};
+
+struct drbd_conf {
+	/* things that are stored as / read from meta data on disk */
+	unsigned long flags;
+
+	/* configured by drbdsetup */
+	struct net_conf *net_conf; /* protected by inc_net() and dec_net() */
+	struct syncer_conf sync_conf;
+	struct drbd_backing_dev *bc __protected_by(local);
+
+	sector_t p_size;     /* partner's disk size */
+	struct request_queue *rq_queue;
+	struct block_device *this_bdev;
+	struct gendisk	    *vdisk;
+
+	struct drbd_socket data; /* data/barrier/cstate/parameter packets */
+	struct drbd_socket meta; /* ping/ack (metadata) packets */
+	int agreed_pro_version;  /* actually used protocol version */
+	unsigned long last_received; /* in jiffies, either socket */
+	unsigned int ko_count;
+	struct drbd_work  resync_work,
+			  unplug_work,
+			  md_sync_work;
+	struct timer_list resync_timer;
+	struct timer_list md_sync_timer;
+
+	/* Used after attach while negotiating new disk state. */
+	union drbd_state new_state_tmp;
+
+	union drbd_state state;
+	wait_queue_head_t misc_wait;
+	wait_queue_head_t state_wait;  /* upon each state change. */
+	unsigned int send_cnt;
+	unsigned int recv_cnt;
+	unsigned int read_cnt;
+	unsigned int writ_cnt;
+	unsigned int al_writ_cnt;
+	unsigned int bm_writ_cnt;
+	atomic_t ap_bio_cnt;	 /* Requests we need to complete */
+	atomic_t ap_pending_cnt; /* AP data packets on the wire, ack expected */
+	atomic_t rs_pending_cnt; /* RS request/data packets on the wire */
+	atomic_t unacked_cnt;	 /* Need to send replys for */
+	atomic_t local_cnt;	 /* Waiting for local completion */
+	atomic_t net_cnt;	 /* Users of net_conf */
+	spinlock_t req_lock;
+	struct drbd_tl_epoch *unused_spare_tle; /* for pre-allocation */
+	struct drbd_tl_epoch *newest_tle;
+	struct drbd_tl_epoch *oldest_tle;
+	struct list_head out_of_sequence_requests;
+	struct hlist_head *tl_hash;
+	unsigned int tl_hash_s;
+
+	/* blocks to sync in this run [unit BM_BLOCK_SIZE] */
+	unsigned long rs_total;
+	/* number of sync IOs that failed in this run */
+	unsigned long rs_failed;
+	/* Syncer's start time [unit jiffies] */
+	unsigned long rs_start;
+	/* cumulated time in PausedSyncX state [unit jiffies] */
+	unsigned long rs_paused;
+	/* block not up-to-date at mark [unit BM_BLOCK_SIZE] */
+	unsigned long rs_mark_left;
+	/* marks's time [unit jiffies] */
+	unsigned long rs_mark_time;
+	/* skipped because csum was equeal [unit BM_BLOCK_SIZE] */
+	unsigned long rs_same_csum;
+	sector_t ov_position;
+	/* Start sector of out of sync range. */
+	sector_t ov_last_oos_start;
+	/* size of out-of-sync range in sectors. */
+	sector_t ov_last_oos_size;
+	unsigned long ov_left;
+	struct crypto_hash *csums_tfm;
+	struct crypto_hash *verify_tfm;
+
+	struct drbd_thread receiver;
+	struct drbd_thread worker;
+	struct drbd_thread asender;
+	struct drbd_bitmap *bitmap;
+	unsigned long bm_resync_fo; /* bit offset for drbd_bm_find_next */
+
+	/* Used to track operations of resync... */
+	struct lru_cache *resync;
+	/* Number of locked elements in resync LRU */
+	unsigned int resync_locked;
+	/* resync extent number waiting for application requests */
+	unsigned int resync_wenr;
+
+	int open_cnt;
+	u64 *p_uuid;
+	struct drbd_epoch *current_epoch;
+	spinlock_t epoch_lock;
+	unsigned int epochs;
+	enum write_ordering_e write_ordering;
+	struct list_head active_ee; /* IO in progress */
+	struct list_head sync_ee;   /* IO in progress */
+	struct list_head done_ee;   /* send ack */
+	struct list_head read_ee;   /* IO in progress */
+	struct list_head net_ee;    /* zero-copy network send in progress */
+	struct hlist_head *ee_hash; /* is proteced by req_lock! */
+	unsigned int ee_hash_s;
+
+	/* this one is protected by ee_lock, single thread */
+	struct drbd_epoch_entry *last_write_w_barrier;
+
+	int next_barrier_nr;
+	struct hlist_head *app_reads_hash; /* is proteced by req_lock */
+	struct list_head resync_reads;
+	atomic_t pp_in_use;
+	wait_queue_head_t ee_wait;
+	struct page *md_io_page;	/* one page buffer for md_io */
+	struct page *md_io_tmpp;	/* for hardsect != 512 [s390 only?] */
+	struct mutex md_io_mutex;	/* protects the md_io_buffer */
+	spinlock_t al_lock;
+	wait_queue_head_t al_wait;
+	struct lru_cache *act_log;	/* activity log */
+	unsigned int al_tr_number;
+	int al_tr_cycle;
+	int al_tr_pos;   /* position of the next transaction in the journal */
+	struct crypto_hash *cram_hmac_tfm;
+	struct crypto_hash *integrity_w_tfm; /* to be used by the worker thread */
+	struct crypto_hash *integrity_r_tfm; /* to be used by the receiver thread */
+	void *int_dig_out;
+	void *int_dig_in;
+	void *int_dig_vv;
+	wait_queue_head_t seq_wait;
+	atomic_t packet_seq;
+	unsigned int peer_seq;
+	spinlock_t peer_seq_lock;
+	unsigned int minor;
+	unsigned long comm_bm_set; /* communicated number of set bits. */
+	cpumask_t cpu_mask;
+	struct bm_io_work bm_io_work;
+	u64 ed_uuid; /* UUID of the exposed data */
+	struct mutex state_mutex;
+	char congestion_reason;  /* Why we where congested... */
+};
+
+static inline struct drbd_conf *minor_to_mdev(unsigned int minor)
+{
+	struct drbd_conf *mdev;
+
+	mdev = minor < minor_count ? minor_table[minor] : NULL;
+
+	return mdev;
+}
+
+static inline unsigned int mdev_to_minor(struct drbd_conf *mdev)
+{
+	return mdev->minor;
+}
+
+/* returns 1 if it was successfull,
+ * returns 0 if there was no data socket.
+ * so wherever you are going to use the data.socket, e.g. do
+ * if (!drbd_get_data_sock(mdev))
+ *	return 0;
+ *	CODE();
+ * drbd_put_data_sock(mdev);
+ */
+static inline int drbd_get_data_sock(struct drbd_conf *mdev)
+{
+	mutex_lock(&mdev->data.mutex);
+	/* drbd_disconnect() could have called drbd_free_sock()
+	 * while we were waiting in down()... */
+	if (unlikely(mdev->data.socket == NULL)) {
+		mutex_unlock(&mdev->data.mutex);
+		return 0;
+	}
+	return 1;
+}
+
+static inline void drbd_put_data_sock(struct drbd_conf *mdev)
+{
+	mutex_unlock(&mdev->data.mutex);
+}
+
+/*
+ * function declarations
+ *************************/
+
+/* drbd_main.c */
+
+enum chg_state_flags {
+	CS_HARD	= 1,
+	CS_VERBOSE = 2,
+	CS_WAIT_COMPLETE = 4,
+	CS_SERIALIZE    = 8,
+	CS_ORDERED      = CS_WAIT_COMPLETE + CS_SERIALIZE,
+};
+
+extern void drbd_init_set_defaults(struct drbd_conf *mdev);
+extern int drbd_change_state(struct drbd_conf *mdev, enum chg_state_flags f,
+			union drbd_state mask, union drbd_state val);
+extern void drbd_force_state(struct drbd_conf *, union drbd_state,
+			union drbd_state);
+extern int _drbd_request_state(struct drbd_conf *, union drbd_state,
+			union drbd_state, enum chg_state_flags);
+extern int __drbd_set_state(struct drbd_conf *, union drbd_state,
+			    enum chg_state_flags, struct completion *done);
+extern void print_st_err(struct drbd_conf *, union drbd_state,
+			union drbd_state, int);
+extern int  drbd_thread_start(struct drbd_thread *thi);
+extern void _drbd_thread_stop(struct drbd_thread *thi, int restart, int wait);
+#ifdef CONFIG_SMP
+extern void drbd_thread_current_set_cpu(struct drbd_conf *mdev);
+extern cpumask_t drbd_calc_cpu_mask(struct drbd_conf *mdev);
+#else
+#define drbd_thread_current_set_cpu(A) ({})
+#define drbd_calc_cpu_mask(A) CPU_MASK_ALL
+#endif
+extern void drbd_free_resources(struct drbd_conf *mdev);
+extern void tl_release(struct drbd_conf *mdev, unsigned int barrier_nr,
+		       unsigned int set_size);
+extern void tl_clear(struct drbd_conf *mdev);
+extern void _tl_add_barrier(struct drbd_conf *, struct drbd_tl_epoch *);
+extern void drbd_free_sock(struct drbd_conf *mdev);
+extern int drbd_send(struct drbd_conf *mdev, struct socket *sock,
+			void *buf, size_t size, unsigned msg_flags);
+extern int drbd_send_protocol(struct drbd_conf *mdev);
+extern int drbd_send_uuids(struct drbd_conf *mdev);
+extern int drbd_send_uuids_skip_initial_sync(struct drbd_conf *mdev);
+extern int drbd_send_sync_uuid(struct drbd_conf *mdev, u64 val);
+extern int drbd_send_sizes(struct drbd_conf *mdev);
+extern int _drbd_send_state(struct drbd_conf *mdev);
+extern int drbd_send_state(struct drbd_conf *mdev);
+extern int _drbd_send_cmd(struct drbd_conf *mdev, struct socket *sock,
+			enum drbd_packets cmd, struct p_header *h,
+			size_t size, unsigned msg_flags);
+#define USE_DATA_SOCKET 1
+#define USE_META_SOCKET 0
+extern int drbd_send_cmd(struct drbd_conf *mdev, int use_data_socket,
+			enum drbd_packets cmd, struct p_header *h,
+			size_t size);
+extern int drbd_send_cmd2(struct drbd_conf *mdev, enum drbd_packets cmd,
+			char *data, size_t size);
+extern int drbd_send_sync_param(struct drbd_conf *mdev, struct syncer_conf *sc);
+extern int drbd_send_b_ack(struct drbd_conf *mdev, u32 barrier_nr,
+			u32 set_size);
+extern int drbd_send_ack(struct drbd_conf *mdev, enum drbd_packets cmd,
+			struct drbd_epoch_entry *e);
+extern int drbd_send_ack_rp(struct drbd_conf *mdev, enum drbd_packets cmd,
+			struct p_block_req *rp);
+extern int drbd_send_ack_dp(struct drbd_conf *mdev, enum drbd_packets cmd,
+			struct p_data *dp);
+extern int drbd_send_ack_ex(struct drbd_conf *mdev, enum drbd_packets cmd,
+			    sector_t sector, int blksize, u64 block_id);
+extern int _drbd_send_page(struct drbd_conf *mdev, struct page *page,
+			int offset, size_t size);
+extern int drbd_send_block(struct drbd_conf *mdev, enum drbd_packets cmd,
+			   struct drbd_epoch_entry *e);
+extern int drbd_send_dblock(struct drbd_conf *mdev, struct drbd_request *req);
+extern int _drbd_send_barrier(struct drbd_conf *mdev,
+			struct drbd_tl_epoch *barrier);
+extern int drbd_send_drequest(struct drbd_conf *mdev, int cmd,
+			      sector_t sector, int size, u64 block_id);
+extern int drbd_send_drequest_csum(struct drbd_conf *mdev,
+				   sector_t sector,int size,
+				   void *digest, int digest_size,
+				   enum drbd_packets cmd);
+extern int drbd_send_ov_request(struct drbd_conf *mdev,sector_t sector,int size);
+
+extern int drbd_send_bitmap(struct drbd_conf *mdev);
+extern int _drbd_send_bitmap(struct drbd_conf *mdev);
+extern int drbd_send_sr_reply(struct drbd_conf *mdev, int retcode);
+extern void drbd_free_bc(struct drbd_backing_dev *bc);
+extern int drbd_io_error(struct drbd_conf *mdev, int forcedetach);
+extern void drbd_mdev_cleanup(struct drbd_conf *mdev);
+
+/* drbd_meta-data.c (still in drbd_main.c) */
+extern void drbd_md_sync(struct drbd_conf *mdev);
+extern int  drbd_md_read(struct drbd_conf *mdev, struct drbd_backing_dev *bdev);
+/* maybe define them below as inline? */
+extern void drbd_uuid_set(struct drbd_conf *mdev, int idx, u64 val) __must_hold(local);
+extern void _drbd_uuid_set(struct drbd_conf *mdev, int idx, u64 val) __must_hold(local);
+extern void drbd_uuid_new_current(struct drbd_conf *mdev) __must_hold(local);
+extern void _drbd_uuid_new_current(struct drbd_conf *mdev) __must_hold(local);
+extern void drbd_uuid_set_bm(struct drbd_conf *mdev, u64 val) __must_hold(local);
+extern void drbd_md_set_flag(struct drbd_conf *mdev, int flags) __must_hold(local);
+extern void drbd_md_clear_flag(struct drbd_conf *mdev, int flags)__must_hold(local);
+extern int drbd_md_test_flag(struct drbd_backing_dev *, int);
+extern void drbd_md_mark_dirty(struct drbd_conf *mdev);
+extern void drbd_queue_bitmap_io(struct drbd_conf *mdev,
+				 int (*io_fn)(struct drbd_conf *),
+				 void (*done)(struct drbd_conf *, int),
+				 char *why);
+extern int drbd_bmio_set_n_write(struct drbd_conf *mdev);
+extern int drbd_bmio_clear_n_write(struct drbd_conf *mdev);
+extern int drbd_bitmap_io(struct drbd_conf *mdev, int (*io_fn)(struct drbd_conf *), char *why);
+
+
+/* Meta data layout
+   We reserve a 128MB Block (4k aligned)
+   * either at the end of the backing device
+   * or on a seperate meta data device. */
+
+#define MD_RESERVED_SECT (128LU << 11)  /* 128 MB, unit sectors */
+/* The following numbers are sectors */
+#define MD_AL_OFFSET 8	    /* 8 Sectors after start of meta area */
+#define MD_AL_MAX_SIZE 64   /* = 32 kb LOG  ~ 3776 extents ~ 14 GB Storage */
+/* Allows up to about 3.8TB */
+#define MD_BM_OFFSET (MD_AL_OFFSET + MD_AL_MAX_SIZE)
+
+/* Since the smalles IO unit is usually 512 byte */
+#define MD_HARDSECT_B	 9
+#define MD_HARDSECT	 (1<<MD_HARDSECT_B)
+
+/* activity log */
+#define AL_EXTENTS_PT ((MD_HARDSECT-12)/8-1) /* 61 ; Extents per 512B sector */
+#define AL_EXTENT_SIZE_B 22		 /* One extent represents 4M Storage */
+#define AL_EXTENT_SIZE (1<<AL_EXTENT_SIZE_B)
+
+#if BITS_PER_LONG == 32
+#define LN2_BPL 5
+#define cpu_to_lel(A) cpu_to_le32(A)
+#define lel_to_cpu(A) le32_to_cpu(A)
+#elif BITS_PER_LONG == 64
+#define LN2_BPL 6
+#define cpu_to_lel(A) cpu_to_le64(A)
+#define lel_to_cpu(A) le64_to_cpu(A)
+#else
+#error "LN2 of BITS_PER_LONG unknown!"
+#endif
+
+/* resync bitmap */
+/* 16MB sized 'bitmap extent' to track syncer usage */
+struct bm_extent {
+	struct lc_element lce;
+	int rs_left; /* number of bits set (out of sync) in this extent. */
+	int rs_failed; /* number of failed resync requests in this extent. */
+	unsigned long flags;
+};
+
+#define BME_NO_WRITES  0  /* bm_extent.flags: no more requests on this one! */
+#define BME_LOCKED     1  /* bm_extent.flags: syncer active on this one. */
+
+/* drbd_bitmap.c */
+/*
+ * We need to store one bit for a block.
+ * Example: 1GB disk @ 4096 byte blocks ==> we need 32 KB bitmap.
+ * Bit 0 ==> local node thinks this block is binary identical on both nodes
+ * Bit 1 ==> local node thinks this block needs to be synced.
+ */
+
+#define BM_BLOCK_SIZE_B  12			 /* 4k per bit */
+#define BM_BLOCK_SIZE	 (1<<BM_BLOCK_SIZE_B)
+/* (9+3) : 512 bytes @ 8 bits; representing 16M storage
+ * per sector of on disk bitmap */
+#define BM_EXT_SIZE_B	 (BM_BLOCK_SIZE_B + MD_HARDSECT_B + 3)  /* = 24 */
+#define BM_EXT_SIZE	 (1<<BM_EXT_SIZE_B)
+
+#if (BM_EXT_SIZE_B != 24) || (BM_BLOCK_SIZE_B != 12)
+#error "HAVE YOU FIXED drbdmeta AS WELL??"
+#endif
+
+/* thus many _storage_ sectors are described by one bit */
+#define BM_SECT_TO_BIT(x)   ((x)>>(BM_BLOCK_SIZE_B-9))
+#define BM_BIT_TO_SECT(x)   ((sector_t)(x)<<(BM_BLOCK_SIZE_B-9))
+#define BM_SECT_PER_BIT     BM_BIT_TO_SECT(1)
+
+/* bit to represented kilo byte conversion */
+#define Bit2KB(bits) ((bits)<<(BM_BLOCK_SIZE_B-10))
+
+/* in which _bitmap_ extent (resp. sector) the bit for a certain
+ * _storage_ sector is located in */
+#define BM_SECT_TO_EXT(x)   ((x)>>(BM_EXT_SIZE_B-9))
+
+/* how much _storage_ sectors we have per bitmap sector */
+#define BM_EXT_TO_SECT(x)   ((sector_t)(x) << (BM_EXT_SIZE_B-9))
+#define BM_SECT_PER_EXT     BM_EXT_TO_SECT(1)
+
+/* in one sector of the bitmap, we have this many activity_log extents. */
+#define AL_EXT_PER_BM_SECT  (1 << (BM_EXT_SIZE_B - AL_EXTENT_SIZE_B))
+#define BM_WORDS_PER_AL_EXT (1 << (AL_EXTENT_SIZE_B-BM_BLOCK_SIZE_B-LN2_BPL))
+
+#define BM_BLOCKS_PER_BM_EXT_B (BM_EXT_SIZE_B - BM_BLOCK_SIZE_B)
+#define BM_BLOCKS_PER_BM_EXT_MASK  ((1<<BM_BLOCKS_PER_BM_EXT_B) - 1)
+
+/* the extent in "PER_EXTENT" below is an activity log extent
+ * we need that many (long words/bytes) to store the bitmap
+ *		     of one AL_EXTENT_SIZE chunk of storage.
+ * we can store the bitmap for that many AL_EXTENTS within
+ * one sector of the _on_disk_ bitmap:
+ * bit	 0	  bit 37   bit 38	     bit (512*8)-1
+ *	     ...|........|........|.. // ..|........|
+ * sect. 0	 `296	  `304			   ^(512*8*8)-1
+ *
+#define BM_WORDS_PER_EXT    ( (AL_EXT_SIZE/BM_BLOCK_SIZE) / BITS_PER_LONG )
+#define BM_BYTES_PER_EXT    ( (AL_EXT_SIZE/BM_BLOCK_SIZE) / 8 )  // 128
+#define BM_EXT_PER_SECT	    ( 512 / BM_BYTES_PER_EXTENT )	 //   4
+ */
+
+#define DRBD_MAX_SECTORS_32 (0xffffffffLU)
+#define DRBD_MAX_SECTORS_BM \
+	  ((MD_RESERVED_SECT - MD_BM_OFFSET) * (1LL<<(BM_EXT_SIZE_B-9)))
+#if DRBD_MAX_SECTORS_BM < DRBD_MAX_SECTORS_32
+#define DRBD_MAX_SECTORS      DRBD_MAX_SECTORS_BM
+#define DRBD_MAX_SECTORS_FLEX DRBD_MAX_SECTORS_BM
+#elif !defined(CONFIG_LBD) && BITS_PER_LONG == 32
+#define DRBD_MAX_SECTORS      DRBD_MAX_SECTORS_32
+#define DRBD_MAX_SECTORS_FLEX DRBD_MAX_SECTORS_32
+#else
+#define DRBD_MAX_SECTORS      DRBD_MAX_SECTORS_BM
+/* 16 TB in units of sectors */
+#if BITS_PER_LONG == 32
+/* adjust by one page worth of bitmap,
+ * so we won't wrap around in drbd_bm_find_next_bit.
+ * you should use 64bit OS for that much storage, anyways. */
+#define DRBD_MAX_SECTORS_FLEX BM_BIT_TO_SECT(0xffff7fff)
+#else
+#define DRBD_MAX_SECTORS_FLEX BM_BIT_TO_SECT(0x1LU << 32)
+#endif
+#endif
+
+/* Sector shift value for the "hash" functions of tl_hash and ee_hash tables.
+ * With a value of 6 all IO in one 32K block make it to the same slot of the
+ * hash table. */
+#define HT_SHIFT 6
+#define DRBD_MAX_SEGMENT_SIZE (1U<<(9+HT_SHIFT))
+
+/* Number of elements in the app_reads_hash */
+#define APP_R_HSIZE 15
+
+extern int  drbd_bm_init(struct drbd_conf *mdev);
+extern int  drbd_bm_resize(struct drbd_conf *mdev, sector_t sectors);
+extern void drbd_bm_cleanup(struct drbd_conf *mdev);
+extern void drbd_bm_set_all(struct drbd_conf *mdev);
+extern void drbd_bm_clear_all(struct drbd_conf *mdev);
+extern int  drbd_bm_set_bits(
+		struct drbd_conf *mdev, unsigned long s, unsigned long e);
+extern int  drbd_bm_clear_bits(
+		struct drbd_conf *mdev, unsigned long s, unsigned long e);
+/* bm_set_bits variant for use while holding drbd_bm_lock */
+extern int _drbd_bm_set_bits(struct drbd_conf *mdev,
+		const unsigned long s, const unsigned long e);
+extern int  drbd_bm_test_bit(struct drbd_conf *mdev, unsigned long bitnr);
+extern int  drbd_bm_e_weight(struct drbd_conf *mdev, unsigned long enr);
+extern int  drbd_bm_write_sect(struct drbd_conf *mdev, unsigned long enr) __must_hold(local);
+extern int  drbd_bm_read(struct drbd_conf *mdev) __must_hold(local);
+extern int  drbd_bm_write(struct drbd_conf *mdev) __must_hold(local);
+extern unsigned long drbd_bm_ALe_set_all(struct drbd_conf *mdev,
+		unsigned long al_enr);
+extern size_t	     drbd_bm_words(struct drbd_conf *mdev);
+extern unsigned long drbd_bm_bits(struct drbd_conf *mdev);
+extern sector_t      drbd_bm_capacity(struct drbd_conf *mdev);
+extern unsigned long drbd_bm_find_next(struct drbd_conf *mdev, unsigned long bm_fo);
+/* bm_find_next variants for use while you hold drbd_bm_lock() */
+extern unsigned long _drbd_bm_find_next(struct drbd_conf *mdev, unsigned long bm_fo);
+extern unsigned long _drbd_bm_find_next_zero(struct drbd_conf *mdev, unsigned long bm_fo);
+extern unsigned long drbd_bm_total_weight(struct drbd_conf *mdev);
+extern int drbd_bm_rs_done(struct drbd_conf *mdev);
+/* for receive_bitmap */
+extern void drbd_bm_merge_lel(struct drbd_conf *mdev, size_t offset,
+		size_t number, unsigned long *buffer);
+/* for _drbd_send_bitmap and drbd_bm_write_sect */
+extern void drbd_bm_get_lel(struct drbd_conf *mdev, size_t offset,
+		size_t number, unsigned long *buffer);
+
+extern void drbd_bm_lock(struct drbd_conf *mdev, char *why);
+extern void drbd_bm_unlock(struct drbd_conf *mdev);
+
+extern void _drbd_bm_recount_bits(struct drbd_conf *mdev, char *file, int line);
+#define drbd_bm_recount_bits(mdev) \
+	_drbd_bm_recount_bits(mdev, __FILE__, __LINE__)
+extern int drbd_bm_count_bits(struct drbd_conf *mdev, const unsigned long s, const unsigned long e);
+/* drbd_main.c */
+
+extern struct kmem_cache *drbd_request_cache;
+extern struct kmem_cache *drbd_ee_cache;
+extern mempool_t *drbd_request_mempool;
+extern mempool_t *drbd_ee_mempool;
+
+extern struct page *drbd_pp_pool; /* drbd's page pool */
+extern spinlock_t   drbd_pp_lock;
+extern int	    drbd_pp_vacant;
+extern wait_queue_head_t drbd_pp_wait;
+
+extern rwlock_t global_state_lock;
+
+extern struct drbd_conf *drbd_new_device(unsigned int minor);
+extern void drbd_free_mdev(struct drbd_conf *mdev);
+
+extern int proc_details;
+
+/* drbd_req */
+extern int drbd_make_request_26(struct request_queue *q, struct bio *bio);
+extern int drbd_read_remote(struct drbd_conf *mdev, struct drbd_request *req);
+extern int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec);
+extern int is_valid_ar_handle(struct drbd_request *, sector_t);
+
+
+/* drbd_nl.c */
+extern void drbd_suspend_io(struct drbd_conf *mdev);
+extern void drbd_resume_io(struct drbd_conf *mdev);
+extern char *ppsize(char *buf, unsigned long long size);
+extern sector_t drbd_new_dev_size(struct drbd_conf *,
+		struct drbd_backing_dev *);
+enum determine_dev_size { dev_size_error = -1, unchanged = 0, shrunk = 1, grew = 2 };
+extern enum determine_dev_size drbd_determin_dev_size(struct drbd_conf *) __must_hold(local);
+extern void resync_after_online_grow(struct drbd_conf *);
+extern void drbd_setup_queue_param(struct drbd_conf *mdev, unsigned int) __must_hold(local);
+extern int drbd_set_role(struct drbd_conf *mdev, enum drbd_role new_role,
+		int force);
+enum drbd_disk_state drbd_try_outdate_peer(struct drbd_conf *mdev);
+extern int drbd_khelper(struct drbd_conf *mdev, char *cmd);
+
+/* drbd_worker.c */
+extern int drbd_worker(struct drbd_thread *thi);
+extern void drbd_alter_sa(struct drbd_conf *mdev, int na);
+extern void drbd_start_resync(struct drbd_conf *mdev, enum drbd_conns side);
+extern void resume_next_sg(struct drbd_conf *mdev);
+extern void suspend_other_sg(struct drbd_conf *mdev);
+extern int drbd_resync_finished(struct drbd_conf *mdev);
+/* maybe rather drbd_main.c ? */
+extern int drbd_md_sync_page_io(struct drbd_conf *mdev,
+		struct drbd_backing_dev *bdev, sector_t sector, int rw);
+extern void drbd_ov_oos_found(struct drbd_conf*, sector_t, int);
+
+static inline void ov_oos_print(struct drbd_conf *mdev)
+{
+	if (mdev->ov_last_oos_size) {
+		dev_err(DEV, "Out of sync: start=%llu, size=%lu (sectors)\n",
+		     (unsigned long long)mdev->ov_last_oos_start,
+		     (unsigned long)mdev->ov_last_oos_size);
+	}
+	mdev->ov_last_oos_size=0;
+}
+
+
+void drbd_csum(struct drbd_conf *, struct crypto_hash *, struct bio *, void *);
+/* worker callbacks */
+extern int w_req_cancel_conflict(struct drbd_conf *, struct drbd_work *, int);
+extern int w_read_retry_remote(struct drbd_conf *, struct drbd_work *, int);
+extern int w_e_end_data_req(struct drbd_conf *, struct drbd_work *, int);
+extern int w_e_end_rsdata_req(struct drbd_conf *, struct drbd_work *, int);
+extern int w_e_end_csum_rs_req(struct drbd_conf *, struct drbd_work *, int);
+extern int w_e_end_ov_reply(struct drbd_conf *, struct drbd_work *, int);
+extern int w_e_end_ov_req(struct drbd_conf *, struct drbd_work *, int);
+extern int w_ov_finished(struct drbd_conf *, struct drbd_work *, int);
+extern int w_resync_inactive(struct drbd_conf *, struct drbd_work *, int);
+extern int w_resume_next_sg(struct drbd_conf *, struct drbd_work *, int);
+extern int w_io_error(struct drbd_conf *, struct drbd_work *, int);
+extern int w_send_write_hint(struct drbd_conf *, struct drbd_work *, int);
+extern int w_make_resync_request(struct drbd_conf *, struct drbd_work *, int);
+extern int w_send_dblock(struct drbd_conf *, struct drbd_work *, int);
+extern int w_send_barrier(struct drbd_conf *, struct drbd_work *, int);
+extern int w_send_read_req(struct drbd_conf *, struct drbd_work *, int);
+extern int w_prev_work_done(struct drbd_conf *, struct drbd_work *, int);
+extern int w_e_reissue(struct drbd_conf *, struct drbd_work *, int);
+
+extern void resync_timer_fn(unsigned long data);
+
+/* drbd_receiver.c */
+extern int drbd_release_ee(struct drbd_conf *mdev, struct list_head *list);
+extern struct drbd_epoch_entry *drbd_alloc_ee(struct drbd_conf *mdev,
+					    u64 id,
+					    sector_t sector,
+					    unsigned int data_size,
+					    gfp_t gfp_mask) __must_hold(local);
+extern void drbd_free_ee(struct drbd_conf *mdev, struct drbd_epoch_entry *e);
+extern void drbd_wait_ee_list_empty(struct drbd_conf *mdev,
+		struct list_head *head);
+extern void _drbd_wait_ee_list_empty(struct drbd_conf *mdev,
+		struct list_head *head);
+extern void drbd_set_recv_tcq(struct drbd_conf *mdev, int tcq_enabled);
+extern void _drbd_clear_done_ee(struct drbd_conf *mdev);
+
+/* yes, there is kernel_setsockopt, but only since 2.6.18. we don't need to
+ * mess with get_fs/set_fs, we know we are KERNEL_DS always. */
+static inline int drbd_setsockopt(struct socket *sock, int level, int optname,
+			char __user *optval, int optlen)
+{
+	int err;
+	if (level == SOL_SOCKET)
+		err = sock_setsockopt(sock, level, optname, optval, optlen);
+	else
+		err = sock->ops->setsockopt(sock, level, optname, optval,
+					    optlen);
+	return err;
+}
+
+static inline void drbd_tcp_cork(struct socket *sock)
+{
+	int __user val = 1;
+	(void) drbd_setsockopt(sock, SOL_TCP, TCP_CORK,
+			(char __user *)&val, sizeof(val));
+}
+
+static inline void drbd_tcp_uncork(struct socket *sock)
+{
+	int __user val = 0;
+	(void) drbd_setsockopt(sock, SOL_TCP, TCP_CORK,
+			(char __user *)&val, sizeof(val));
+}
+
+static inline void drbd_tcp_nodelay(struct socket *sock)
+{
+	int __user val = 1;
+	(void) drbd_setsockopt(sock, SOL_TCP, TCP_NODELAY,
+			(char __user *)&val, sizeof(val));
+}
+
+static inline void drbd_tcp_quickack(struct socket *sock)
+{
+	int __user val = 1;
+	(void) drbd_setsockopt(sock, SOL_TCP, TCP_QUICKACK,
+			(char __user *)&val, sizeof(val));
+}
+
+void drbd_bump_write_ordering(struct drbd_conf *mdev, enum write_ordering_e wo);
+
+/* drbd_proc.c */
+extern struct proc_dir_entry *drbd_proc;
+extern struct file_operations drbd_proc_fops;
+extern const char *conns_to_name(enum drbd_conns s);
+extern const char *roles_to_name(enum drbd_role s);
+
+/* drbd_actlog.c */
+extern void drbd_al_begin_io(struct drbd_conf *mdev, sector_t sector);
+extern void drbd_al_complete_io(struct drbd_conf *mdev, sector_t sector);
+extern void drbd_rs_complete_io(struct drbd_conf *mdev, sector_t sector);
+extern int drbd_rs_begin_io(struct drbd_conf *mdev, sector_t sector);
+extern int drbd_try_rs_begin_io(struct drbd_conf *mdev, sector_t sector);
+extern void drbd_rs_cancel_all(struct drbd_conf *mdev);
+extern int drbd_rs_del_all(struct drbd_conf *mdev);
+extern void drbd_rs_failed_io(struct drbd_conf *mdev,
+		sector_t sector, int size);
+extern int drbd_al_read_log(struct drbd_conf *mdev, struct drbd_backing_dev *);
+extern void __drbd_set_in_sync(struct drbd_conf *mdev, sector_t sector,
+		int size, const char *file, const unsigned int line);
+#define drbd_set_in_sync(mdev, sector, size) \
+	__drbd_set_in_sync(mdev, sector, size, __FILE__, __LINE__)
+extern void __drbd_set_out_of_sync(struct drbd_conf *mdev, sector_t sector,
+		int size, const char *file, const unsigned int line);
+#define drbd_set_out_of_sync(mdev, sector, size) \
+	__drbd_set_out_of_sync(mdev, sector, size, __FILE__, __LINE__)
+extern void drbd_al_apply_to_bm(struct drbd_conf *mdev);
+extern void drbd_al_to_on_disk_bm(struct drbd_conf *mdev);
+extern void drbd_al_shrink(struct drbd_conf *mdev);
+
+
+/* drbd_nl.c */
+
+void drbd_nl_cleanup(void);
+int __init drbd_nl_init(void);
+void drbd_bcast_state(struct drbd_conf *mdev, union drbd_state);
+void drbd_bcast_sync_progress(struct drbd_conf *mdev);
+void drbd_bcast_ee(struct drbd_conf *mdev,
+		const char *reason, const int dgs,
+		const char* seen_hash, const char* calc_hash,
+		const struct drbd_epoch_entry* e);
+
+
+/** DRBD State macros:
+ * These macros are used to express state changes in easily readable form.
+ *
+ * The NS macros expand to a mask and a value, that can be bit ored onto the
+ * current state as soon as the spinlock (req_lock) was taken.
+ *
+ * The _NS macros are used for state functions that get called with the
+ * spinlock. These macros expand directly to the new state value.
+ *
+ * Besides the basic forms NS() and _NS() additional _?NS[23] are defined
+ * to express state changes that affect more than one aspect of the state.
+ *
+ * E.g. NS2(conn, C_CONNECTED, peer, R_SECONDARY)
+ * Means that the network connection was established and that the peer
+ * is in secondary role.
+ */
+#define role_MASK R_MASK
+#define peer_MASK R_MASK
+#define disk_MASK D_MASK
+#define pdsk_MASK D_MASK
+#define conn_MASK C_MASK
+#define susp_MASK 1
+#define user_isp_MASK 1
+#define aftr_isp_MASK 1
+
+#define NS(T, S) \
+	({ union drbd_state mask; mask.i = 0; mask.T = T##_MASK; mask; }), \
+	({ union drbd_state val; val.i = 0; val.T = (S); val; })
+#define NS2(T1, S1, T2, S2) \
+	({ union drbd_state mask; mask.i = 0; mask.T1 = T1##_MASK; \
+	  mask.T2 = T2##_MASK; mask; }), \
+	({ union drbd_state val; val.i = 0; val.T1 = (S1); \
+	  val.T2 = (S2); val; })
+#define NS3(T1, S1, T2, S2, T3, S3) \
+	({ union drbd_state mask; mask.i = 0; mask.T1 = T1##_MASK; \
+	  mask.T2 = T2##_MASK; mask.T3 = T3##_MASK; mask; }), \
+	({ union drbd_state val;  val.i = 0; val.T1 = (S1); \
+	  val.T2 = (S2); val.T3 = (S3); val; })
+
+#define _NS(D, T, S) \
+	D, ({ union drbd_state __ns; __ns.i = D->state.i; __ns.T = (S); __ns; })
+#define _NS2(D, T1, S1, T2, S2) \
+	D, ({ union drbd_state __ns; __ns.i = D->state.i; __ns.T1 = (S1); \
+	__ns.T2 = (S2); __ns; })
+#define _NS3(D, T1, S1, T2, S2, T3, S3) \
+	D, ({ union drbd_state __ns; __ns.i = D->state.i; __ns.T1 = (S1); \
+	__ns.T2 = (S2); __ns.T3 = (S3); __ns; })
+
+/*
+ * inline helper functions
+ *************************/
+
+static inline void drbd_state_lock(struct drbd_conf *mdev)
+{
+	wait_event(mdev->misc_wait,
+		   !test_and_set_bit(CLUSTER_ST_CHANGE, &mdev->flags));
+}
+
+static inline void drbd_state_unlock(struct drbd_conf *mdev)
+{
+	clear_bit(CLUSTER_ST_CHANGE, &mdev->flags);
+	wake_up(&mdev->misc_wait);
+}
+
+static inline int _drbd_set_state(struct drbd_conf *mdev,
+				   union drbd_state ns, enum chg_state_flags flags,
+				   struct completion *done)
+{
+	int rv;
+
+	read_lock(&global_state_lock);
+	rv = __drbd_set_state(mdev, ns, flags, done);
+	read_unlock(&global_state_lock);
+
+	return rv;
+}
+
+static inline int drbd_request_state(struct drbd_conf *mdev,
+				     union drbd_state mask,
+				     union drbd_state val)
+{
+	return _drbd_request_state(mdev, mask, val, CS_VERBOSE + CS_ORDERED);
+}
+
+/**
+ * drbd_chk_io_error: Handles the on_io_error setting, should be called from
+ * all io completion handlers. See also drbd_io_error().
+ */
+static inline void __drbd_chk_io_error(struct drbd_conf *mdev, int forcedetach)
+{
+	switch (mdev->bc->dc.on_io_error) {
+	case EP_PASS_ON:
+		if (!forcedetach) {
+			if (printk_ratelimit())
+				dev_err(DEV, "Local IO failed. Passing error on...\n");
+			break;
+		}
+		/* NOTE fall through to detach case if forcedetach set */
+	case EP_DETACH:
+	case EP_CALL_HELPER:
+		if (mdev->state.disk > D_FAILED) {
+			_drbd_set_state(_NS(mdev, disk, D_FAILED), CS_HARD, NULL);
+			dev_err(DEV, "Local IO failed. Detaching...\n");
+		}
+		break;
+	}
+}
+
+static inline void drbd_chk_io_error(struct drbd_conf *mdev,
+	int error, int forcedetach)
+{
+	if (error) {
+		unsigned long flags;
+		spin_lock_irqsave(&mdev->req_lock, flags);
+		__drbd_chk_io_error(mdev, forcedetach);
+		spin_unlock_irqrestore(&mdev->req_lock, flags);
+	}
+}
+
+/* Returns the first sector number of our meta data,
+ * which, for internal meta data, happens to be the maximum capacity
+ * we could agree upon with our peer
+ */
+static inline sector_t drbd_md_first_sector(struct drbd_backing_dev *bdev)
+{
+	switch (bdev->dc.meta_dev_idx) {
+	case DRBD_MD_INDEX_INTERNAL:
+	case DRBD_MD_INDEX_FLEX_INT:
+		return bdev->md.md_offset + bdev->md.bm_offset;
+	case DRBD_MD_INDEX_FLEX_EXT:
+	default:
+		return bdev->md.md_offset;
+	}
+}
+
+/* returns the last sector number of our meta data,
+ * to be able to catch out of band md access */
+static inline sector_t drbd_md_last_sector(struct drbd_backing_dev *bdev)
+{
+	switch (bdev->dc.meta_dev_idx) {
+	case DRBD_MD_INDEX_INTERNAL:
+	case DRBD_MD_INDEX_FLEX_INT:
+		return bdev->md.md_offset + MD_AL_OFFSET - 1;
+	case DRBD_MD_INDEX_FLEX_EXT:
+	default:
+		return bdev->md.md_offset + bdev->md.md_size_sect;
+	}
+}
+
+/* Returns the number of 512 byte sectors of the device */
+static inline sector_t drbd_get_capacity(struct block_device *bdev)
+{
+	/* return bdev ? get_capacity(bdev->bd_disk) : 0; */
+	return bdev ? bdev->bd_inode->i_size >> 9 : 0;
+}
+
+/* returns the capacity we announce to out peer.
+ * we clip ourselves at the various MAX_SECTORS, because if we don't,
+ * current implementation will oops sooner or later */
+static inline sector_t drbd_get_max_capacity(struct drbd_backing_dev *bdev)
+{
+	sector_t s;
+	switch (bdev->dc.meta_dev_idx) {
+	case DRBD_MD_INDEX_INTERNAL:
+	case DRBD_MD_INDEX_FLEX_INT:
+		s = drbd_get_capacity(bdev->backing_bdev)
+			? min_t(sector_t, DRBD_MAX_SECTORS_FLEX,
+					drbd_md_first_sector(bdev))
+			: 0;
+		break;
+	case DRBD_MD_INDEX_FLEX_EXT:
+		s = min_t(sector_t, DRBD_MAX_SECTORS_FLEX,
+				drbd_get_capacity(bdev->backing_bdev));
+		/* clip at maximum size the meta device can support */
+		s = min_t(sector_t, s,
+			BM_EXT_TO_SECT(bdev->md.md_size_sect
+				     - bdev->md.bm_offset));
+		break;
+	default:
+		s = min_t(sector_t, DRBD_MAX_SECTORS,
+				drbd_get_capacity(bdev->backing_bdev));
+	}
+	return s;
+}
+
+/* returns the sector number of our meta data 'super' block */
+static inline sector_t drbd_md_ss__(struct drbd_conf *mdev,
+				    struct drbd_backing_dev *bdev)
+{
+	switch (bdev->dc.meta_dev_idx) {
+	default: /* external, some index */
+		return MD_RESERVED_SECT * bdev->dc.meta_dev_idx;
+	case DRBD_MD_INDEX_INTERNAL:
+		/* with drbd08, internal meta data is always "flexible" */
+	case DRBD_MD_INDEX_FLEX_INT:
+		/* sizeof(struct md_on_disk_07) == 4k
+		 * position: last 4k aligned block of 4k size */
+		if (!bdev->backing_bdev) {
+			if (__ratelimit(&drbd_ratelimit_state)) {
+				dev_err(DEV, "bdev->backing_bdev==NULL\n");
+				dump_stack();
+			}
+			return 0;
+		}
+		return (drbd_get_capacity(bdev->backing_bdev) & ~7ULL)
+			- MD_AL_OFFSET;
+	case DRBD_MD_INDEX_FLEX_EXT:
+		return 0;
+	}
+}
+
+static inline void
+_drbd_queue_work(struct drbd_work_queue *q, struct drbd_work *w)
+{
+	list_add_tail(&w->list, &q->q);
+	up(&q->s);
+}
+
+static inline void
+drbd_queue_work_front(struct drbd_work_queue *q, struct drbd_work *w)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&q->q_lock, flags);
+	list_add(&w->list, &q->q);
+	up(&q->s); /* within the spinlock,
+		      see comment near end of drbd_worker() */
+	spin_unlock_irqrestore(&q->q_lock, flags);
+}
+
+static inline void
+drbd_queue_work(struct drbd_work_queue *q, struct drbd_work *w)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&q->q_lock, flags);
+	list_add_tail(&w->list, &q->q);
+	up(&q->s); /* within the spinlock,
+		      see comment near end of drbd_worker() */
+	spin_unlock_irqrestore(&q->q_lock, flags);
+}
+
+static inline void wake_asender(struct drbd_conf *mdev)
+{
+	if (test_bit(SIGNAL_ASENDER, &mdev->flags))
+		force_sig(DRBD_SIG, mdev->asender.task);
+}
+
+static inline void request_ping(struct drbd_conf *mdev)
+{
+	set_bit(SEND_PING, &mdev->flags);
+	wake_asender(mdev);
+}
+
+static inline int drbd_send_short_cmd(struct drbd_conf *mdev,
+	enum drbd_packets cmd)
+{
+	struct p_header h;
+	return drbd_send_cmd(mdev, USE_DATA_SOCKET, cmd, &h, sizeof(h));
+}
+
+static inline int drbd_send_ping(struct drbd_conf *mdev)
+{
+	struct p_header h;
+	return drbd_send_cmd(mdev, USE_META_SOCKET, P_PING, &h, sizeof(h));
+}
+
+static inline int drbd_send_ping_ack(struct drbd_conf *mdev)
+{
+	struct p_header h;
+	return drbd_send_cmd(mdev, USE_META_SOCKET, P_PING_ACK, &h, sizeof(h));
+}
+
+static inline void drbd_thread_stop(struct drbd_thread *thi)
+{
+	_drbd_thread_stop(thi, FALSE, TRUE);
+}
+
+static inline void drbd_thread_stop_nowait(struct drbd_thread *thi)
+{
+	_drbd_thread_stop(thi, FALSE, FALSE);
+}
+
+static inline void drbd_thread_restart_nowait(struct drbd_thread *thi)
+{
+	_drbd_thread_stop(thi, TRUE, FALSE);
+}
+
+/* counts how many answer packets packets we expect from our peer,
+ * for either explicit application requests,
+ * or implicit barrier packets as necessary.
+ * increased:
+ *  w_send_barrier
+ *  _req_mod(req, queue_for_net_write or queue_for_net_read);
+ *    it is much easier and equally valid to count what we queue for the
+ *    worker, even before it actually was queued or send.
+ *    (drbd_make_request_common; recovery path on read io-error)
+ * decreased:
+ *  got_BarrierAck (respective tl_clear, tl_clear_barrier)
+ *  _req_mod(req, data_received)
+ *     [from receive_DataReply]
+ *  _req_mod(req, write_acked_by_peer or recv_acked_by_peer or neg_acked)
+ *     [from got_BlockAck (P_WRITE_ACK, P_RECV_ACK)]
+ *     for some reason it is NOT decreased in got_NegAck,
+ *     but in the resulting cleanup code from report_params.
+ *     we should try to remember the reason for that...
+ *  _req_mod(req, send_failed or send_canceled)
+ *  _req_mod(req, connection_lost_while_pending)
+ *     [from tl_clear_barrier]
+ */
+static inline void inc_ap_pending(struct drbd_conf *mdev)
+{
+	atomic_inc(&mdev->ap_pending_cnt);
+}
+
+#define ERR_IF_CNT_IS_NEGATIVE(which)				\
+	if (atomic_read(&mdev->which) < 0)			\
+		dev_err(DEV, "in %s:%d: " #which " = %d < 0 !\n",	\
+		    __func__ , __LINE__ ,			\
+		    atomic_read(&mdev->which))
+
+#define dec_ap_pending(mdev)	do {				\
+	typecheck(struct drbd_conf *, mdev);			\
+	if (atomic_dec_and_test(&mdev->ap_pending_cnt))		\
+		wake_up(&mdev->misc_wait);			\
+	ERR_IF_CNT_IS_NEGATIVE(ap_pending_cnt); } while (0)
+
+/* counts how many resync-related answers we still expect from the peer
+ *		     increase			decrease
+ * C_SYNC_TARGET sends P_RS_DATA_REQUEST (and expects P_RS_DATA_REPLY)
+ * C_SYNC_SOURCE sends P_RS_DATA_REPLY   (and expects P_WRITE_ACK whith ID_SYNCER)
+ *					   (or P_NEG_ACK with ID_SYNCER)
+ */
+static inline void inc_rs_pending(struct drbd_conf *mdev)
+{
+	atomic_inc(&mdev->rs_pending_cnt);
+}
+
+#define dec_rs_pending(mdev)	do {				\
+	typecheck(struct drbd_conf *, mdev);			\
+	atomic_dec(&mdev->rs_pending_cnt);			\
+	ERR_IF_CNT_IS_NEGATIVE(rs_pending_cnt); } while (0)
+
+/* counts how many answers we still need to send to the peer.
+ * increased on
+ *  receive_Data	unless protocol A;
+ *			we need to send a P_RECV_ACK (proto B)
+ *			or P_WRITE_ACK (proto C)
+ *  receive_RSDataReply (recv_resync_read) we need to send a P_WRITE_ACK
+ *  receive_DataRequest (receive_RSDataRequest) we need to send back P_DATA
+ *  receive_Barrier_*	we need to send a P_BARRIER_ACK
+ */
+static inline void inc_unacked(struct drbd_conf *mdev)
+{
+	atomic_inc(&mdev->unacked_cnt);
+}
+
+#define dec_unacked(mdev)	do {				\
+	typecheck(struct drbd_conf *, mdev);			\
+	atomic_dec(&mdev->unacked_cnt);				\
+	ERR_IF_CNT_IS_NEGATIVE(unacked_cnt); } while (0)
+
+#define sub_unacked(mdev, n)	do {				\
+	typecheck(struct drbd_conf *, mdev);			\
+	atomic_sub(n, &mdev->unacked_cnt);			\
+	ERR_IF_CNT_IS_NEGATIVE(unacked_cnt); } while (0)
+
+
+static inline void dec_net(struct drbd_conf *mdev)
+{
+	if (atomic_dec_and_test(&mdev->net_cnt))
+		wake_up(&mdev->misc_wait);
+}
+
+/**
+ * inc_net: Returns TRUE when it is ok to access mdev->net_conf. You
+ * should call dec_net() when finished looking at mdev->net_conf.
+ */
+static inline int inc_net(struct drbd_conf *mdev)
+{
+	int have_net_conf;
+
+	atomic_inc(&mdev->net_cnt);
+	have_net_conf = mdev->state.conn >= C_UNCONNECTED;
+	if (!have_net_conf)
+		dec_net(mdev);
+	return have_net_conf;
+}
+
+/**
+ * inc_local: Returns TRUE when local IO is possible. If it returns
+ * TRUE you should call dec_local() after IO is completed.
+ */
+#define inc_local_if_state(M,MINS) __cond_lock(local, _inc_local_if_state(M,MINS))
+#define inc_local(M) __cond_lock(local, _inc_local_if_state(M,D_INCONSISTENT))
+
+static inline void dec_local(struct drbd_conf *mdev)
+{
+	__release(local);
+	if (atomic_dec_and_test(&mdev->local_cnt))
+		wake_up(&mdev->misc_wait);
+	D_ASSERT(atomic_read(&mdev->local_cnt) >= 0);
+}
+
+#ifndef __CHECKER__
+static inline int _inc_local_if_state(struct drbd_conf *mdev, enum drbd_disk_state mins)
+{
+	int io_allowed;
+
+	atomic_inc(&mdev->local_cnt);
+	io_allowed = (mdev->state.disk >= mins);
+	if (!io_allowed)
+		dec_local(mdev);
+	return io_allowed;
+}
+#else
+extern int _inc_local_if_state(struct drbd_conf *mdev, enum drbd_disk_state mins);
+#endif
+
+/* you must have an "inc_local" reference */
+static inline void drbd_get_syncer_progress(struct drbd_conf *mdev,
+		unsigned long *bits_left, unsigned int *per_mil_done)
+{
+	/*
+	 * this is to break it at compile time when we change that
+	 * (we may feel 4TB maximum storage per drbd is not enough)
+	 */
+	typecheck(unsigned long, mdev->rs_total);
+
+	/* note: both rs_total and rs_left are in bits, i.e. in
+	 * units of BM_BLOCK_SIZE.
+	 * for the percentage, we don't care. */
+
+	*bits_left = drbd_bm_total_weight(mdev) - mdev->rs_failed;
+	/* >> 10 to prevent overflow,
+	 * +1 to prevent division by zero */
+	if (*bits_left > mdev->rs_total) {
+		/* doh. maybe a logic bug somewhere.
+		 * may also be just a race condition
+		 * between this and a disconnect during sync.
+		 * for now, just prevent in-kernel buffer overflow.
+		 */
+		smp_rmb();
+		dev_warn(DEV, "cs:%s rs_left=%lu > rs_total=%lu (rs_failed %lu)\n",
+				conns_to_name(mdev->state.conn),
+				*bits_left, mdev->rs_total, mdev->rs_failed);
+		*per_mil_done = 0;
+	} else {
+		/* make sure the calculation happens in long context */
+		unsigned long tmp = 1000UL -
+				(*bits_left >> 10)*1000UL
+				/ ((mdev->rs_total >> 10) + 1UL);
+		*per_mil_done = tmp;
+	}
+}
+
+
+/* this throttles on-the-fly application requests
+ * according to max_buffers settings;
+ * maybe re-implement using semaphores? */
+static inline int drbd_get_max_buffers(struct drbd_conf *mdev)
+{
+	int mxb = 1000000; /* arbitrary limit on open requests */
+	if (inc_net(mdev)) {
+		mxb = mdev->net_conf->max_buffers;
+		dec_net(mdev);
+	}
+	return mxb;
+}
+
+static inline int drbd_state_is_stable(union drbd_state s)
+{
+
+	/* DO NOT add a default clause, we want the compiler to warn us
+	 * for any newly introduced state we may have forgotten to add here */
+
+	switch ((enum drbd_conns)s.conn) {
+	/* new io only accepted when there is no connection, ... */
+	case C_STANDALONE:
+	case C_WF_CONNECTION:
+	/* ... or there is a well established connection. */
+	case C_CONNECTED:
+	case C_SYNC_SOURCE:
+	case C_SYNC_TARGET:
+	case C_VERIFY_S:
+	case C_VERIFY_T:
+	case C_PAUSED_SYNC_S:
+	case C_PAUSED_SYNC_T:
+		/* maybe stable, look at the disk state */
+		break;
+
+	/* no new io accepted during tansitional states
+	 * like handshake or teardown */
+	case C_DISCONNECTING:
+	case C_UNCONNECTED:
+	case C_TIMEOUT:
+	case C_BROKEN_PIPE:
+	case C_NETWORK_FAILURE:
+	case C_PROTOCOL_ERROR:
+	case C_TEAR_DOWN:
+	case C_WF_REPORT_PARAMS:
+	case C_STARTING_SYNC_S:
+	case C_STARTING_SYNC_T:
+	case C_WF_BITMAP_S:
+	case C_WF_BITMAP_T:
+	case C_WF_SYNC_UUID:
+	case C_MASK:
+		/* not "stable" */
+		return 0;
+	}
+
+	switch ((enum drbd_disk_state)s.disk) {
+	case D_DISKLESS:
+	case D_INCONSISTENT:
+	case D_OUTDATED:
+	case D_CONSISTENT:
+	case D_UP_TO_DATE:
+		/* disk state is stable as well. */
+		break;
+
+	/* no new io accepted during tansitional states */
+	case D_ATTACHING:
+	case D_FAILED:
+	case D_NEGOTIATING:
+	case D_UNKNOWN:
+	case D_MASK:
+		/* not "stable" */
+		return 0;
+	}
+
+	return 1;
+}
+
+static inline int __inc_ap_bio_cond(struct drbd_conf *mdev)
+{
+	int mxb = drbd_get_max_buffers(mdev);
+
+	if (mdev->state.susp)
+		return 0;
+	if (test_bit(SUSPEND_IO, &mdev->flags))
+		return 0;
+
+	/* to avoid potential deadlock or bitmap corruption,
+	 * in various places, we only allow new application io
+	 * to start during "stable" states. */
+
+	/* no new io accepted when attaching or detaching the disk */
+	if (!drbd_state_is_stable(mdev->state))
+		return 0;
+
+	/* since some older kernels don't have atomic_add_unless,
+	 * and we are within the spinlock anyways, we have this workaround.  */
+	if (atomic_read(&mdev->ap_bio_cnt) > mxb)
+		return 0;
+	if (test_bit(BITMAP_IO, &mdev->flags))
+		return 0;
+	return 1;
+}
+
+/* I'd like to use wait_event_lock_irq,
+ * but I'm not sure when it got introduced,
+ * and not sure when it has 3 or 4 arguments */
+static inline void inc_ap_bio(struct drbd_conf *mdev, int one_or_two)
+{
+	/* compare with after_state_ch,
+	 * os.conn != C_WF_BITMAP_S && ns.conn == C_WF_BITMAP_S */
+	DEFINE_WAIT(wait);
+
+	/* we wait here
+	 *    as long as the device is suspended
+	 *    until the bitmap is no longer on the fly during connection
+	 *    handshake as long as we would exeed the max_buffer limit.
+	 *
+	 * to avoid races with the reconnect code,
+	 * we need to atomic_inc within the spinlock. */
+
+	spin_lock_irq(&mdev->req_lock);
+	while (!__inc_ap_bio_cond(mdev)) {
+		prepare_to_wait(&mdev->misc_wait, &wait, TASK_UNINTERRUPTIBLE);
+		spin_unlock_irq(&mdev->req_lock);
+		schedule();
+		finish_wait(&mdev->misc_wait, &wait);
+		spin_lock_irq(&mdev->req_lock);
+	}
+	atomic_add(one_or_two, &mdev->ap_bio_cnt);
+	spin_unlock_irq(&mdev->req_lock);
+}
+
+static inline void dec_ap_bio(struct drbd_conf *mdev)
+{
+	int mxb = drbd_get_max_buffers(mdev);
+	int ap_bio = atomic_dec_return(&mdev->ap_bio_cnt);
+
+	D_ASSERT(ap_bio >= 0);
+	/* this currently does wake_up for every dec_ap_bio!
+	 * maybe rather introduce some type of hysteresis?
+	 * e.g. (ap_bio == mxb/2 || ap_bio == 0) ? */
+	if (ap_bio < mxb)
+		wake_up(&mdev->misc_wait);
+	if (ap_bio == 0 && test_bit(BITMAP_IO, &mdev->flags)) {
+		if (!test_and_set_bit(BITMAP_IO_QUEUED, &mdev->flags))
+			drbd_queue_work(&mdev->data.work, &mdev->bm_io_work.w);
+	}
+}
+
+static inline void drbd_set_ed_uuid(struct drbd_conf *mdev, u64 val)
+{
+	mdev->ed_uuid = val;
+}
+
+static inline int seq_cmp(u32 a, u32 b)
+{
+	/* we assume wrap around at 32bit.
+	 * for wrap around at 24bit (old atomic_t),
+	 * we'd have to
+	 *  a <<= 8; b <<= 8;
+	 */
+	return (s32)(a) - (s32)(b);
+}
+#define seq_lt(a, b) (seq_cmp((a), (b)) < 0)
+#define seq_gt(a, b) (seq_cmp((a), (b)) > 0)
+#define seq_ge(a, b) (seq_cmp((a), (b)) >= 0)
+#define seq_le(a, b) (seq_cmp((a), (b)) <= 0)
+/* CAUTION: please no side effects in arguments! */
+#define seq_max(a, b) ((u32)(seq_gt((a), (b)) ? (a) : (b)))
+
+static inline void update_peer_seq(struct drbd_conf *mdev, unsigned int new_seq)
+{
+	unsigned int m;
+	spin_lock(&mdev->peer_seq_lock);
+	m = seq_max(mdev->peer_seq, new_seq);
+	mdev->peer_seq = m;
+	spin_unlock(&mdev->peer_seq_lock);
+	if (m == new_seq)
+		wake_up(&mdev->seq_wait);
+}
+
+static inline void drbd_update_congested(struct drbd_conf *mdev)
+{
+	struct sock *sk = mdev->data.socket->sk;
+	if (sk->sk_wmem_queued > sk->sk_sndbuf * 4 / 5)
+		set_bit(NET_CONGESTED, &mdev->flags);
+}
+
+static inline int drbd_queue_order_type(struct drbd_conf *mdev)
+{
+	/* sorry, we currently have no working implementation
+	 * of distributed TCQ stuff */
+#ifndef QUEUE_ORDERED_NONE
+#define QUEUE_ORDERED_NONE 0
+#endif
+	return QUEUE_ORDERED_NONE;
+}
+
+static inline void drbd_blk_run_queue(struct request_queue *q)
+{
+	if (q && q->unplug_fn)
+		q->unplug_fn(q);
+}
+
+static inline void drbd_kick_lo(struct drbd_conf *mdev)
+{
+	if (inc_local(mdev)) {
+		drbd_blk_run_queue(bdev_get_queue(mdev->bc->backing_bdev));
+		dec_local(mdev);
+	}
+}
+
+static inline void drbd_md_flush(struct drbd_conf *mdev)
+{
+	int r;
+
+	if (test_bit(MD_NO_BARRIER, &mdev->flags))
+		return;
+
+	r = blkdev_issue_flush(mdev->bc->md_bdev, NULL);
+	if (r) {
+		set_bit(MD_NO_BARRIER, &mdev->flags);
+		dev_err(DEV, "meta data flush failed with status %d, disabling md-flushes\n", r);
+	}
+}
+
+#endif
diff --git a/drivers/block/drbd/drbd_wrappers.h b/drivers/block/drbd/drbd_wrappers.h
new file mode 100644
index 0000000..b7ce5ac
--- /dev/null
+++ b/drivers/block/drbd/drbd_wrappers.h
@@ -0,0 +1,97 @@
+#ifndef _DRBD_WRAPPERS_H
+#define _DRBD_WRAPPERS_H
+
+#include <linux/ctype.h>
+#include <linux/mm.h>
+
+
+/* see get_sb_bdev and bd_claim */
+extern char *drbd_sec_holder;
+
+static inline sector_t drbd_get_hardsect(struct block_device *bdev)
+{
+	return bdev->bd_disk->queue->hardsect_size;
+}
+
+/* sets the number of 512 byte sectors of our virtual device */
+static inline void drbd_set_my_capacity(struct drbd_conf *mdev,
+					sector_t size)
+{
+	/* set_capacity(mdev->this_bdev->bd_disk, size); */
+	set_capacity(mdev->vdisk, size);
+	mdev->this_bdev->bd_inode->i_size = (loff_t)size << 9;
+}
+
+#define drbd_bio_uptodate(bio) bio_flagged(bio, BIO_UPTODATE)
+
+static inline int drbd_bio_has_active_page(struct bio *bio)
+{
+	struct bio_vec *bvec;
+	int i;
+
+	__bio_for_each_segment(bvec, bio, i, 0) {
+		if (page_count(bvec->bv_page) > 1)
+			return 1;
+	}
+
+	return 0;
+}
+
+/* bi_end_io handlers */
+extern void drbd_md_io_complete(struct bio *bio, int error);
+extern void drbd_endio_read_sec(struct bio *bio, int error);
+extern void drbd_endio_write_sec(struct bio *bio, int error);
+extern void drbd_endio_pri(struct bio *bio, int error);
+
+/*
+ * used to submit our private bio
+ */
+static inline void drbd_generic_make_request(struct drbd_conf *mdev,
+					     int fault_type, struct bio *bio)
+{
+	__release(local);
+	if (!bio->bi_bdev) {
+		printk(KERN_ERR "drbd%d: drbd_generic_make_request: "
+				"bio->bi_bdev == NULL\n",
+		       mdev_to_minor(mdev));
+		dump_stack();
+		bio_endio(bio, -ENODEV);
+		return;
+	}
+
+	if (FAULT_ACTIVE(mdev, fault_type))
+		bio_endio(bio, -EIO);
+	else
+		generic_make_request(bio);
+}
+
+static inline void drbd_plug_device(struct drbd_conf *mdev)
+{
+	struct request_queue *q;
+	q = bdev_get_queue(mdev->this_bdev);
+
+	spin_lock_irq(q->queue_lock);
+
+/* XXX the check on !blk_queue_plugged is redundant,
+ * implicitly checked in blk_plug_device */
+
+	if (!blk_queue_plugged(q)) {
+		blk_plug_device(q);
+		del_timer(&q->unplug_timer);
+		/* unplugging should not happen automatically... */
+	}
+	spin_unlock_irq(q->queue_lock);
+}
+
+static inline int drbd_crypto_is_hash(struct crypto_tfm *tfm)
+{
+        return (crypto_tfm_alg_type(tfm) & CRYPTO_ALG_TYPE_HASH_MASK)
+                == CRYPTO_ALG_TYPE_HASH;
+}
+
+#ifndef __CHECKER__
+# undef __cond_lock
+# define __cond_lock(x,c) (c)
+#endif
+
+#endif

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 08/16] DRBD: main
  2009-04-30 11:26             ` [PATCH 07/16] DRBD: internal_data_structures Philipp Reisner
@ 2009-04-30 11:26               ` Philipp Reisner
  2009-04-30 11:26                 ` [PATCH 09/16] DRBD: receiver Philipp Reisner
  0 siblings, 1 reply; 90+ messages in thread
From: Philipp Reisner @ 2009-04-30 11:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg, Philipp Reisner

The DRBD state engine, and lots of other stuff, that does not have its own
source file.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

---
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
new file mode 100644
index 0000000..4a2593c
--- /dev/null
+++ b/drivers/block/drbd/drbd_main.c
@@ -0,0 +1,3542 @@
+/*
+   drbd.c
+
+   This file is part of DRBD by Philipp Reisner and Lars Ellenberg.
+
+   Copyright (C) 2001-2008, LINBIT Information Technologies GmbH.
+   Copyright (C) 1999-2008, Philipp Reisner <philipp.reisner@linbit.com>.
+   Copyright (C) 2002-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.
+
+   drbd is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 2, or (at your option)
+   any later version.
+
+   drbd is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with drbd; see the file COPYING.  If not, write to
+   the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+
+ */
+
+#include <linux/autoconf.h>
+#include <linux/module.h>
+#include <linux/version.h>
+
+#include <asm/uaccess.h>
+#include <asm/types.h>
+#include <net/sock.h>
+#include <linux/ctype.h>
+#include <linux/smp_lock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/proc_fs.h>
+#include <linux/init.h>
+#include <linux/mm.h>
+#include <linux/drbd_config.h>
+#include <linux/memcontrol.h>
+#include <linux/mm_inline.h>
+#include <linux/slab.h>
+#include <linux/random.h>
+#include <linux/reboot.h>
+#include <linux/notifier.h>
+#include <linux/kthread.h>
+
+#define __KERNEL_SYSCALLS__
+#include <linux/unistd.h>
+#include <linux/vmalloc.h>
+
+#include <linux/drbd.h>
+#include <linux/drbd_limits.h>
+#include "drbd_int.h"
+#include "drbd_tracing.h"
+#include "drbd_req.h" /* only for _req_mod in tl_release and tl_clear */
+
+#include "drbd_vli.h"
+
+struct after_state_chg_work {
+	struct drbd_work w;
+	union drbd_state os;
+	union drbd_state ns;
+	enum chg_state_flags flags;
+	struct completion *done;
+};
+
+int drbdd_init(struct drbd_thread *);
+int drbd_worker(struct drbd_thread *);
+int drbd_asender(struct drbd_thread *);
+
+int drbd_init(void);
+static int drbd_open(struct block_device *bdev, fmode_t mode);
+static int drbd_release(struct gendisk *gd, fmode_t mode);
+STATIC int w_after_state_ch(struct drbd_conf *mdev, struct drbd_work *w, int unused);
+STATIC void after_state_ch(struct drbd_conf *mdev, union drbd_state os,
+			   union drbd_state ns, enum chg_state_flags flags);
+STATIC int w_md_sync(struct drbd_conf *mdev, struct drbd_work *w, int unused);
+STATIC void md_sync_timer_fn(unsigned long data);
+STATIC int w_bitmap_io(struct drbd_conf *mdev, struct drbd_work *w, int unused);
+
+DEFINE_TRACE(drbd_unplug);
+DEFINE_TRACE(drbd_uuid);
+DEFINE_TRACE(drbd_ee);
+DEFINE_TRACE(drbd_packet);
+DEFINE_TRACE(drbd_md_io);
+DEFINE_TRACE(drbd_epoch);
+DEFINE_TRACE(drbd_netlink);
+DEFINE_TRACE(drbd_actlog);
+DEFINE_TRACE(drbd_bio);
+DEFINE_TRACE(_drbd_resync);
+DEFINE_TRACE(drbd_req);
+
+MODULE_AUTHOR("Philipp Reisner <phil@linbit.com>, "
+	      "Lars Ellenberg <lars@linbit.com>");
+MODULE_DESCRIPTION("drbd - Distributed Replicated Block Device v" REL_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_PARM_DESC(minor_count, "Maximum number of drbd devices (1-255)");
+MODULE_ALIAS_BLOCKDEV_MAJOR(DRBD_MAJOR);
+
+#include <linux/moduleparam.h>
+/* allow_open_on_secondary */
+MODULE_PARM_DESC(allow_oos, "DONT USE!");
+/* thanks to these macros, if compiled into the kernel (not-module),
+ * this becomes the boot parameter drbd.minor_count */
+module_param(minor_count, uint, 0444);
+module_param(disable_sendpage, bool, 0644);
+module_param(allow_oos, bool, 0);
+module_param(cn_idx, uint, 0444);
+module_param(proc_details, int, 0644);
+
+#ifdef DRBD_ENABLE_FAULTS
+int enable_faults;
+int fault_rate;
+static int fault_count;
+int fault_devs;
+/* bitmap of enabled faults */
+module_param(enable_faults, int, 0664);
+/* fault rate % value - applies to all enabled faults */
+module_param(fault_rate, int, 0664);
+/* count of faults inserted */
+module_param(fault_count, int, 0664);
+/* bitmap of devices to insert faults on */
+module_param(fault_devs, int, 0644);
+#endif
+
+/* module parameter, defined */
+unsigned int minor_count = 32;
+int disable_sendpage;
+int allow_oos;
+unsigned int cn_idx = CN_IDX_DRBD;
+int proc_details;       /* Detail level in proc drbd*/
+
+/* Module parameter for setting the user mode helper program
+ * to run. Default is /sbin/drbdadm */
+char usermode_helper[80] = "/sbin/drbdadm";
+
+module_param_string(usermode_helper, usermode_helper, sizeof(usermode_helper), 0644);
+
+/* in 2.6.x, our device mapping and config info contains our virtual gendisks
+ * as member "struct gendisk *vdisk;"
+ */
+struct drbd_conf **minor_table;
+
+struct kmem_cache *drbd_request_cache;
+struct kmem_cache *drbd_ee_cache;
+mempool_t *drbd_request_mempool;
+mempool_t *drbd_ee_mempool;
+
+/* I do not use a standard mempool, because:
+   1) I want to hand out the preallocated objects first.
+   2) I want to be able to interrupt sleeping allocation with a signal.
+   Note: This is a single linked list, the next pointer is the private
+	 member of struct page.
+ */
+struct page *drbd_pp_pool;
+spinlock_t   drbd_pp_lock;
+int          drbd_pp_vacant;
+wait_queue_head_t drbd_pp_wait;
+
+DEFINE_RATELIMIT_STATE(drbd_ratelimit_state, 5 * HZ, 5);
+
+STATIC struct block_device_operations drbd_ops = {
+	.owner =   THIS_MODULE,
+	.open =    drbd_open,
+	.release = drbd_release,
+};
+
+#define ARRY_SIZE(A) (sizeof(A)/sizeof(A[0]))
+
+#ifdef __CHECKER__
+/* When checking with sparse, and this is an inline function, sparse will
+   give tons of false positives. When this is a real functions sparse works.
+ */
+int _inc_local_if_state(struct drbd_conf *mdev, enum drbd_disk_state mins)
+{
+	int io_allowed;
+
+	atomic_inc(&mdev->local_cnt);
+	io_allowed = (mdev->state.disk >= mins);
+	if (!io_allowed) {
+		if (atomic_dec_and_test(&mdev->local_cnt))
+			wake_up(&mdev->misc_wait);
+	}
+	return io_allowed;
+}
+
+#endif
+
+/************************* The transfer log start */
+STATIC int tl_init(struct drbd_conf *mdev)
+{
+	struct drbd_tl_epoch *b;
+
+	b = kmalloc(sizeof(struct drbd_tl_epoch), GFP_KERNEL);
+	if (!b)
+		return 0;
+	INIT_LIST_HEAD(&b->requests);
+	INIT_LIST_HEAD(&b->w.list);
+	b->next = NULL;
+	b->br_number = 4711;
+	b->n_req = 0;
+	b->w.cb = NULL; /* if this is != NULL, we need to dec_ap_pending in tl_clear */
+
+	mdev->oldest_tle = b;
+	mdev->newest_tle = b;
+	INIT_LIST_HEAD(&mdev->out_of_sequence_requests);
+
+	mdev->tl_hash = NULL;
+	mdev->tl_hash_s = 0;
+
+	return 1;
+}
+
+STATIC void tl_cleanup(struct drbd_conf *mdev)
+{
+	D_ASSERT(mdev->oldest_tle == mdev->newest_tle);
+	D_ASSERT(list_empty(&mdev->out_of_sequence_requests));
+	kfree(mdev->oldest_tle);
+	mdev->oldest_tle = NULL;
+	kfree(mdev->unused_spare_tle);
+	mdev->unused_spare_tle = NULL;
+	kfree(mdev->tl_hash);
+	mdev->tl_hash = NULL;
+	mdev->tl_hash_s = 0;
+}
+
+/**
+ * _tl_add_barrier: Adds a barrier to the TL.
+ */
+void _tl_add_barrier(struct drbd_conf *mdev, struct drbd_tl_epoch *new)
+{
+	struct drbd_tl_epoch *newest_before;
+
+	INIT_LIST_HEAD(&new->requests);
+	INIT_LIST_HEAD(&new->w.list);
+	new->w.cb = NULL; /* if this is != NULL, we need to dec_ap_pending in tl_clear */
+	new->next = NULL;
+	new->n_req = 0;
+
+	newest_before = mdev->newest_tle;
+	/* never send a barrier number == 0, because that is special-cased
+	 * when using TCQ for our write ordering code */
+	new->br_number = (newest_before->br_number+1) ?: 1;
+	if (mdev->newest_tle != new) {
+		mdev->newest_tle->next = new;
+		mdev->newest_tle = new;
+	}
+}
+
+/* when we receive a barrier ack */
+void tl_release(struct drbd_conf *mdev, unsigned int barrier_nr,
+		       unsigned int set_size)
+{
+	struct drbd_tl_epoch *b, *nob; /* next old barrier */
+	struct list_head *le, *tle;
+	struct drbd_request *r;
+
+	spin_lock_irq(&mdev->req_lock);
+
+	b = mdev->oldest_tle;
+
+	/* first some paranoia code */
+	if (b == NULL) {
+		dev_err(DEV, "BAD! BarrierAck #%u received, but no epoch in tl!?\n",
+			barrier_nr);
+		goto bail;
+	}
+	if (b->br_number != barrier_nr) {
+		dev_err(DEV, "BAD! BarrierAck #%u received, expected #%u!\n",
+			barrier_nr, b->br_number);
+		goto bail;
+	}
+	if (b->n_req != set_size) {
+		dev_err(DEV, "BAD! BarrierAck #%u received with n_req=%u, expected n_req=%u!\n",
+			barrier_nr, set_size, b->n_req);
+		goto bail;
+	}
+
+	/* Clean up list of requests processed during current epoch */
+	list_for_each_safe(le, tle, &b->requests) {
+		r = list_entry(le, struct drbd_request, tl_requests);
+		_req_mod(r, barrier_acked, 0);
+	}
+	/* There could be requests on the list waiting for completion
+	   of the write to the local disk. To avoid corruptions of
+	   slab's data structures we have to remove the lists head.
+
+	   Also there could have been a barrier ack out of sequence, overtaking
+	   the write acks - which would be a but and violating write ordering.
+	   To not deadlock in case we lose connection while such requests are
+	   still pending, we need some way to find them for the
+	   _req_mode(connection_lost_while_pending).
+
+	   These have been list_move'd to the out_of_sequence_requests list in
+	   _req_mod(, barrier_acked,) above.
+	   */
+	list_del_init(&b->requests);
+
+	nob = b->next;
+	if (test_and_clear_bit(CREATE_BARRIER, &mdev->flags)) {
+		_tl_add_barrier(mdev, b);
+		if (nob)
+			mdev->oldest_tle = nob;
+		/* if nob == NULL b was the only barrier, and becomes the new
+		   barrer. Threfore mdev->oldest_tle points already to b */
+	} else {
+		D_ASSERT(nob != NULL);
+		mdev->oldest_tle = nob;
+		kfree(b);
+	}
+
+	spin_unlock_irq(&mdev->req_lock);
+	dec_ap_pending(mdev);
+
+	return;
+
+bail:
+	spin_unlock_irq(&mdev->req_lock);
+	drbd_force_state(mdev, NS(conn, C_PROTOCOL_ERROR));
+}
+
+
+/* called by drbd_disconnect (exiting receiver thread)
+ * or from some after_state_ch */
+void tl_clear(struct drbd_conf *mdev)
+{
+	struct drbd_tl_epoch *b, *tmp;
+	struct list_head *le, *tle;
+	struct drbd_request *r;
+	int new_initial_bnr = net_random();
+
+	spin_lock_irq(&mdev->req_lock);
+
+	b = mdev->oldest_tle;
+	while (b) {
+		list_for_each_safe(le, tle, &b->requests) {
+			r = list_entry(le, struct drbd_request, tl_requests);
+			_req_mod(r, connection_lost_while_pending, 0);
+		}
+		tmp = b->next;
+
+		/* there could still be requests on that ring list,
+		 * in case local io is still pending */
+		list_del(&b->requests);
+
+		/* dec_ap_pending corresponding to queue_barrier.
+		 * the newest barrier may not have been queued yet,
+		 * in which case w.cb is still NULL. */
+		if (b->w.cb != NULL)
+			dec_ap_pending(mdev);
+
+		if (b == mdev->newest_tle) {
+			/* recycle, but reinit! */
+			D_ASSERT(tmp == NULL);
+			INIT_LIST_HEAD(&b->requests);
+			INIT_LIST_HEAD(&b->w.list);
+			b->w.cb = NULL;
+			b->br_number = new_initial_bnr;
+			b->n_req = 0;
+
+			mdev->oldest_tle = b;
+			break;
+		}
+		kfree(b);
+		b = tmp;
+	}
+
+	/* we expect this list to be empty. */
+	D_ASSERT(list_empty(&mdev->out_of_sequence_requests));
+
+	/* but just in case, clean it up anyways! */
+	list_for_each_safe(le, tle, &mdev->out_of_sequence_requests) {
+		r = list_entry(le, struct drbd_request, tl_requests);
+		_req_mod(r, connection_lost_while_pending, 0);
+	}
+
+	/* ensure bit indicating barrier is required is clear */
+	clear_bit(CREATE_BARRIER, &mdev->flags);
+
+	spin_unlock_irq(&mdev->req_lock);
+}
+
+/**
+ * drbd_io_error: Handles the on_io_error setting, should be called in the
+ * unlikely(!drbd_bio_uptodate(e->bio)) case from kernel thread context.
+ * See also drbd_chk_io_error
+ *
+ * NOTE: we set ourselves FAILED here if on_io_error is EP_DETACH or Panic OR
+ *	 if the forcedetach flag is set. This flag is set when failures
+ *	 occur writing the meta data portion of the disk as they are
+ *	 not recoverable.
+ */
+int drbd_io_error(struct drbd_conf *mdev, int forcedetach)
+{
+	enum drbd_io_error_p eh;
+	unsigned long flags;
+	int send;
+	int ok = 1;
+
+	eh = EP_PASS_ON;
+	if (inc_local_if_state(mdev, D_FAILED)) {
+		eh = mdev->bc->dc.on_io_error;
+		dec_local(mdev);
+	}
+
+	if (!forcedetach && eh == EP_PASS_ON)
+		return 1;
+
+	spin_lock_irqsave(&mdev->req_lock, flags);
+	send = (mdev->state.disk == D_FAILED);
+	if (send)
+		_drbd_set_state(_NS(mdev, disk, D_DISKLESS), CS_HARD, NULL);
+	spin_unlock_irqrestore(&mdev->req_lock, flags);
+
+	if (!send)
+		return ok;
+
+	if (mdev->state.conn >= C_CONNECTED) {
+		ok = drbd_send_state(mdev);
+		if (ok)
+			dev_warn(DEV, "Notified peer that my disk is broken.\n");
+		else
+			dev_err(DEV, "Sending state in drbd_io_error() failed\n");
+	}
+
+	/* Make sure we try to flush meta-data to disk - we come
+	 * in here because of a local disk error so it might fail
+	 * but we still need to try -- both because the error might
+	 * be in the data portion of the disk and because we need
+	 * to ensure the md-sync-timer is stopped if running. */
+	drbd_md_sync(mdev);
+
+	/* Releasing the backing device is done in after_state_ch() */
+
+	if (eh == EP_CALL_HELPER)
+		drbd_khelper(mdev, "local-io-error");
+
+	return ok;
+}
+
+/**
+ * cl_wide_st_chg:
+ * Returns TRUE if this state change should be preformed as a cluster wide
+ * transaction. Of course it returns 0 as soon as the connection is lost.
+ */
+STATIC int cl_wide_st_chg(struct drbd_conf *mdev,
+			  union drbd_state os, union drbd_state ns)
+{
+	return (os.conn >= C_CONNECTED && ns.conn >= C_CONNECTED &&
+		 ((os.role != R_PRIMARY && ns.role == R_PRIMARY) ||
+		  (os.conn != C_STARTING_SYNC_T && ns.conn == C_STARTING_SYNC_T) ||
+		  (os.conn != C_STARTING_SYNC_S && ns.conn == C_STARTING_SYNC_S) ||
+		  (os.disk != D_DISKLESS && ns.disk == D_DISKLESS))) ||
+		(os.conn >= C_CONNECTED && ns.conn == C_DISCONNECTING) ||
+		(os.conn == C_CONNECTED && ns.conn == C_VERIFY_S);
+}
+
+int drbd_change_state(struct drbd_conf *mdev, enum chg_state_flags f,
+		      union drbd_state mask, union drbd_state val)
+{
+	unsigned long flags;
+	union drbd_state os, ns;
+	int rv;
+
+	spin_lock_irqsave(&mdev->req_lock, flags);
+	os = mdev->state;
+	ns.i = (os.i & ~mask.i) | val.i;
+	rv = _drbd_set_state(mdev, ns, f, NULL);
+	ns = mdev->state;
+	spin_unlock_irqrestore(&mdev->req_lock, flags);
+
+	return rv;
+}
+
+void drbd_force_state(struct drbd_conf *mdev,
+	union drbd_state mask, union drbd_state val)
+{
+	drbd_change_state(mdev, CS_HARD, mask, val);
+}
+
+int is_valid_state(struct drbd_conf *mdev, union drbd_state ns);
+int is_valid_state_transition(struct drbd_conf *,
+	union drbd_state, union drbd_state);
+int drbd_send_state_req(struct drbd_conf *,
+	union drbd_state, union drbd_state);
+
+STATIC enum drbd_state_ret_codes _req_st_cond(struct drbd_conf *mdev,
+				    union drbd_state mask, union drbd_state val)
+{
+	union drbd_state os, ns;
+	unsigned long flags;
+	int rv;
+
+	if (test_and_clear_bit(CL_ST_CHG_SUCCESS, &mdev->flags))
+		return SS_CW_SUCCESS;
+
+	if (test_and_clear_bit(CL_ST_CHG_FAIL, &mdev->flags))
+		return SS_CW_FAILED_BY_PEER;
+
+	rv = 0;
+	spin_lock_irqsave(&mdev->req_lock, flags);
+	os = mdev->state;
+	ns.i = (os.i & ~mask.i) | val.i;
+	if (!cl_wide_st_chg(mdev, os, ns))
+		rv = SS_CW_NO_NEED;
+	if (!rv) {
+		rv = is_valid_state(mdev, ns);
+		if (rv == SS_SUCCESS) {
+			rv = is_valid_state_transition(mdev, ns, os);
+			if (rv == SS_SUCCESS)
+				rv = 0; /* cont waiting, otherwise fail. */
+		}
+	}
+	spin_unlock_irqrestore(&mdev->req_lock, flags);
+
+	return rv;
+}
+
+/**
+ * _drbd_request_state:
+ * This function is the most gracefull way to change state. For some state
+ * transition this function even does a cluster wide transaction.
+ * It has a cousin named drbd_request_state(), which is always verbose.
+ */
+STATIC int drbd_req_state(struct drbd_conf *mdev,
+			  union drbd_state mask, union drbd_state val,
+			  enum chg_state_flags f)
+{
+	struct completion done;
+	unsigned long flags;
+	union drbd_state os, ns;
+	int rv;
+
+	init_completion(&done);
+
+	if (f & CS_SERIALIZE)
+		mutex_lock(&mdev->state_mutex);
+
+	spin_lock_irqsave(&mdev->req_lock, flags);
+	os = mdev->state;
+	ns.i = (os.i & ~mask.i) | val.i;
+
+	if (cl_wide_st_chg(mdev, os, ns)) {
+		rv = is_valid_state(mdev, ns);
+		if (rv == SS_SUCCESS)
+			rv = is_valid_state_transition(mdev, ns, os);
+		spin_unlock_irqrestore(&mdev->req_lock, flags);
+
+		if (rv < SS_SUCCESS) {
+			if (f & CS_VERBOSE)
+				print_st_err(mdev, os, ns, rv);
+			goto abort;
+		}
+
+		drbd_state_lock(mdev);
+		if (!drbd_send_state_req(mdev, mask, val)) {
+			drbd_state_unlock(mdev);
+			rv = SS_CW_FAILED_BY_PEER;
+			if (f & CS_VERBOSE)
+				print_st_err(mdev, os, ns, rv);
+			goto abort;
+		}
+
+		wait_event(mdev->state_wait,
+			(rv = _req_st_cond(mdev, mask, val)));
+
+		if (rv < SS_SUCCESS) {
+			/* nearly dead code. */
+			drbd_state_unlock(mdev);
+			if (f & CS_VERBOSE)
+				print_st_err(mdev, os, ns, rv);
+			goto abort;
+		}
+		spin_lock_irqsave(&mdev->req_lock, flags);
+		os = mdev->state;
+		ns.i = (os.i & ~mask.i) | val.i;
+		rv = _drbd_set_state(mdev, ns, f, &done);
+		drbd_state_unlock(mdev);
+	} else {
+		rv = _drbd_set_state(mdev, ns, f, &done);
+	}
+
+	spin_unlock_irqrestore(&mdev->req_lock, flags);
+
+	if (f & CS_WAIT_COMPLETE && rv == SS_SUCCESS) {
+		D_ASSERT(current != mdev->worker.task);
+		wait_for_completion(&done);
+	}
+
+abort:
+	if (f & CS_SERIALIZE)
+		mutex_unlock(&mdev->state_mutex);
+
+	return rv;
+}
+
+/**
+ * _drbd_request_state:
+ * This function is the most gracefull way to change state. For some state
+ * transition this function even does a cluster wide transaction.
+ * It has a cousin named drbd_request_state(), which is always verbose.
+ */
+int _drbd_request_state(struct drbd_conf *mdev,	union drbd_state mask,
+			union drbd_state val,	enum chg_state_flags f)
+{
+	int rv;
+
+	wait_event(mdev->state_wait,
+		   (rv = drbd_req_state(mdev, mask, val, f)) != SS_IN_TRANSIENT_STATE);
+
+	return rv;
+}
+
+STATIC void print_st(struct drbd_conf *mdev, char *name, union drbd_state ns)
+{
+	dev_err(DEV, " %s = { cs:%s ro:%s/%s ds:%s/%s %c%c%c%c }\n",
+	    name,
+	    conns_to_name(ns.conn),
+	    roles_to_name(ns.role),
+	    roles_to_name(ns.peer),
+	    disks_to_name(ns.disk),
+	    disks_to_name(ns.pdsk),
+	    ns.susp ? 's' : 'r',
+	    ns.aftr_isp ? 'a' : '-',
+	    ns.peer_isp ? 'p' : '-',
+	    ns.user_isp ? 'u' : '-'
+	    );
+}
+
+void print_st_err(struct drbd_conf *mdev,
+	union drbd_state os, union drbd_state ns, int err)
+{
+	if (err == SS_IN_TRANSIENT_STATE)
+		return;
+	dev_err(DEV, "State change failed: %s\n", set_st_err_name(err));
+	print_st(mdev, " state", os);
+	print_st(mdev, "wanted", ns);
+}
+
+
+#define peers_to_name roles_to_name
+#define pdsks_to_name disks_to_name
+
+#define susps_to_name(A)     ((A) ? "1" : "0")
+#define aftr_isps_to_name(A) ((A) ? "1" : "0")
+#define peer_isps_to_name(A) ((A) ? "1" : "0")
+#define user_isps_to_name(A) ((A) ? "1" : "0")
+
+#define PSC(A) \
+	({ if (ns.A != os.A) { \
+		pbp += sprintf(pbp, #A "( %s -> %s ) ", \
+			      A##s_to_name(os.A), \
+			      A##s_to_name(ns.A)); \
+	} })
+
+int is_valid_state(struct drbd_conf *mdev, union drbd_state ns)
+{
+	/* See drbd_state_sw_errors in drbd_strings.c */
+
+	enum drbd_fencing_p fp;
+	int rv = SS_SUCCESS;
+
+	fp = FP_DONT_CARE;
+	if (inc_local(mdev)) {
+		fp = mdev->bc->dc.fencing;
+		dec_local(mdev);
+	}
+
+	if (inc_net(mdev)) {
+		if (!mdev->net_conf->two_primaries &&
+		    ns.role == R_PRIMARY && ns.peer == R_PRIMARY)
+			rv = SS_TWO_PRIMARIES;
+		dec_net(mdev);
+	}
+
+	if (rv <= 0)
+		/* already found a reason to abort */;
+	else if (ns.role == R_SECONDARY && mdev->open_cnt)
+		rv = SS_DEVICE_IN_USE;
+
+	else if (ns.role == R_PRIMARY && ns.conn < C_CONNECTED && ns.disk < D_UP_TO_DATE)
+		rv = SS_NO_UP_TO_DATE_DISK;
+
+	else if (fp >= FP_RESOURCE &&
+		 ns.role == R_PRIMARY && ns.conn < C_CONNECTED && ns.pdsk >= D_UNKNOWN)
+		rv = SS_PRIMARY_NOP;
+
+	else if (ns.role == R_PRIMARY && ns.disk <= D_INCONSISTENT && ns.pdsk <= D_INCONSISTENT)
+		rv = SS_NO_UP_TO_DATE_DISK;
+
+	else if (ns.conn > C_CONNECTED && ns.disk < D_UP_TO_DATE && ns.pdsk < D_UP_TO_DATE)
+		rv = SS_BOTH_INCONSISTENT;
+
+	else if (ns.conn > C_CONNECTED && (ns.disk == D_DISKLESS || ns.pdsk == D_DISKLESS))
+		rv = SS_SYNCING_DISKLESS;
+
+	else if ((ns.conn == C_CONNECTED ||
+		  ns.conn == C_WF_BITMAP_S ||
+		  ns.conn == C_SYNC_SOURCE ||
+		  ns.conn == C_PAUSED_SYNC_S) &&
+		  ns.disk == D_OUTDATED)
+		rv = SS_CONNECTED_OUTDATES;
+
+	else if ((ns.conn == C_VERIFY_S || ns.conn == C_VERIFY_T) &&
+		 (mdev->sync_conf.verify_alg[0] == 0))
+		rv = SS_NO_VERIFY_ALG;
+
+	else if ((ns.conn == C_VERIFY_S || ns.conn == C_VERIFY_T) &&
+		  mdev->agreed_pro_version < 88)
+		rv = SS_NOT_SUPPORTED;
+
+	return rv;
+}
+
+int is_valid_state_transition(struct drbd_conf *mdev,
+	union drbd_state ns, union drbd_state os)
+{
+	int rv = SS_SUCCESS;
+
+	if ((ns.conn == C_STARTING_SYNC_T || ns.conn == C_STARTING_SYNC_S) &&
+	    os.conn > C_CONNECTED)
+		rv = SS_RESYNC_RUNNING;
+
+	if (ns.conn == C_DISCONNECTING && os.conn == C_STANDALONE)
+		rv = SS_ALREADY_STANDALONE;
+
+	if (ns.disk > D_ATTACHING && os.disk == D_DISKLESS)
+		rv = SS_IS_DISKLESS;
+
+	if (ns.conn == C_WF_CONNECTION && os.conn < C_UNCONNECTED)
+		rv = SS_NO_NET_CONFIG;
+
+	if (ns.disk == D_OUTDATED && os.disk < D_OUTDATED && os.disk != D_ATTACHING)
+		rv = SS_LOWER_THAN_OUTDATED;
+
+	if (ns.conn == C_DISCONNECTING && os.conn == C_UNCONNECTED)
+		rv = SS_IN_TRANSIENT_STATE;
+
+	if (ns.conn == os.conn && ns.conn == C_WF_REPORT_PARAMS)
+		rv = SS_IN_TRANSIENT_STATE;
+
+	if ((ns.conn == C_VERIFY_S || ns.conn == C_VERIFY_T) && os.conn < C_CONNECTED)
+		rv = SS_NEED_CONNECTION;
+
+	if ((ns.conn == C_VERIFY_S || ns.conn == C_VERIFY_T) &&
+	    ns.conn != os.conn && os.conn > C_CONNECTED)
+		rv = SS_RESYNC_RUNNING;
+
+	if ((ns.conn == C_STARTING_SYNC_S || ns.conn == C_STARTING_SYNC_T) &&
+	    os.conn < C_CONNECTED)
+		rv = SS_NEED_CONNECTION;
+
+	return rv;
+}
+
+int __drbd_set_state(struct drbd_conf *mdev,
+		    union drbd_state ns, enum chg_state_flags flags,
+		    struct completion *done)
+{
+	union drbd_state os;
+	int rv = SS_SUCCESS;
+	int warn_sync_abort = 0;
+	enum drbd_fencing_p fp;
+	struct after_state_chg_work *ascw;
+
+
+	os = mdev->state;
+
+	fp = FP_DONT_CARE;
+	if (inc_local(mdev)) {
+		fp = mdev->bc->dc.fencing;
+		dec_local(mdev);
+	}
+
+	/* Early state sanitising. */
+
+	/* Dissalow Network errors to configure a device's network part */
+	if ((ns.conn >= C_TIMEOUT && ns.conn <= C_TEAR_DOWN) &&
+	    os.conn <= C_DISCONNECTING)
+		ns.conn = os.conn;
+
+	/* After a network error (+C_TEAR_DOWN) only C_UNCONNECTED or C_DISCONNECTING can follow */
+	if (os.conn >= C_TIMEOUT && os.conn <= C_TEAR_DOWN &&
+	    ns.conn != C_UNCONNECTED && ns.conn != C_DISCONNECTING)
+		ns.conn = os.conn;
+
+	/* After C_DISCONNECTING only C_STANDALONE may follow */
+	if (os.conn == C_DISCONNECTING && ns.conn != C_STANDALONE)
+		ns.conn = os.conn;
+
+	if (ns.conn < C_CONNECTED) {
+		ns.peer_isp = 0;
+		ns.peer = R_UNKNOWN;
+		if (ns.pdsk > D_UNKNOWN || ns.pdsk < D_INCONSISTENT)
+			ns.pdsk = D_UNKNOWN;
+	}
+
+	/* Clear the aftr_isp when becomming Unconfigured */
+	if (ns.conn == C_STANDALONE && ns.disk == D_DISKLESS && ns.role == R_SECONDARY)
+		ns.aftr_isp = 0;
+
+	if (ns.conn <= C_DISCONNECTING && ns.disk == D_DISKLESS)
+		ns.pdsk = D_UNKNOWN;
+
+	if (os.conn > C_CONNECTED && ns.conn > C_CONNECTED &&
+	    (ns.disk <= D_FAILED || ns.pdsk <= D_FAILED)) {
+		warn_sync_abort = 1;
+		ns.conn = C_CONNECTED;
+	}
+
+	if (ns.conn >= C_CONNECTED &&
+	    ((ns.disk == D_CONSISTENT || ns.disk == D_OUTDATED) ||
+	     (ns.disk == D_NEGOTIATING && ns.conn == C_WF_BITMAP_T))) {
+		switch (ns.conn) {
+		case C_WF_BITMAP_T:
+		case C_PAUSED_SYNC_T:
+			ns.disk = D_OUTDATED;
+			break;
+		case C_CONNECTED:
+		case C_WF_BITMAP_S:
+		case C_SYNC_SOURCE:
+		case C_PAUSED_SYNC_S:
+			ns.disk = D_UP_TO_DATE;
+			break;
+		case C_SYNC_TARGET:
+			ns.disk = D_INCONSISTENT;
+			dev_warn(DEV, "Implicitly set disk state Inconsistent!\n");
+			break;
+		}
+		if (os.disk == D_OUTDATED && ns.disk == D_UP_TO_DATE)
+			dev_warn(DEV, "Implicitly set disk from Outdated to UpToDate\n");
+	}
+
+	if (ns.conn >= C_CONNECTED &&
+	    (ns.pdsk == D_CONSISTENT || ns.pdsk == D_OUTDATED)) {
+		switch (ns.conn) {
+		case C_CONNECTED:
+		case C_WF_BITMAP_T:
+		case C_PAUSED_SYNC_T:
+		case C_SYNC_TARGET:
+			ns.pdsk = D_UP_TO_DATE;
+			break;
+		case C_WF_BITMAP_S:
+		case C_PAUSED_SYNC_S:
+			ns.pdsk = D_OUTDATED;
+			break;
+		case C_SYNC_SOURCE:
+			ns.pdsk = D_INCONSISTENT;
+			dev_warn(DEV, "Implicitly set pdsk Inconsistent!\n");
+			break;
+		}
+		if (os.pdsk == D_OUTDATED && ns.pdsk == D_UP_TO_DATE)
+			dev_warn(DEV, "Implicitly set pdsk from Outdated to UpToDate\n");
+	}
+
+	/* Connection breaks down before we finished "Negotiating" */
+	if (ns.conn < C_CONNECTED && ns.disk == D_NEGOTIATING &&
+	    inc_local_if_state(mdev, D_NEGOTIATING)) {
+		if (mdev->ed_uuid == mdev->bc->md.uuid[UI_CURRENT]) {
+			ns.disk = mdev->new_state_tmp.disk;
+			ns.pdsk = mdev->new_state_tmp.pdsk;
+		} else {
+			dev_alert(DEV, "Connection lost while negotiating, no data!\n");
+			ns.disk = D_DISKLESS;
+			ns.pdsk = D_UNKNOWN;
+		}
+		dec_local(mdev);
+	}
+
+	if (fp == FP_STONITH &&
+	    (ns.role == R_PRIMARY &&
+	     ns.conn < C_CONNECTED &&
+	     ns.pdsk > D_OUTDATED))
+			ns.susp = 1;
+
+	if (ns.aftr_isp || ns.peer_isp || ns.user_isp) {
+		if (ns.conn == C_SYNC_SOURCE)
+			ns.conn = C_PAUSED_SYNC_S;
+		if (ns.conn == C_SYNC_TARGET)
+			ns.conn = C_PAUSED_SYNC_T;
+	} else {
+		if (ns.conn == C_PAUSED_SYNC_S)
+			ns.conn = C_SYNC_SOURCE;
+		if (ns.conn == C_PAUSED_SYNC_T)
+			ns.conn = C_SYNC_TARGET;
+	}
+
+	if (ns.i == os.i)
+		return SS_NOTHING_TO_DO;
+
+	if (!(flags & CS_HARD)) {
+		/*  pre-state-change checks ; only look at ns  */
+		/* See drbd_state_sw_errors in drbd_strings.c */
+
+		rv = is_valid_state(mdev, ns);
+		if (rv < SS_SUCCESS) {
+			/* If the old state was illegal as well, then let
+			   this happen...*/
+
+			if (is_valid_state(mdev, os) == rv) {
+				dev_err(DEV, "Considering state change from bad state. "
+				    "Error would be: '%s'\n",
+				    set_st_err_name(rv));
+				print_st(mdev, "old", os);
+				print_st(mdev, "new", ns);
+				rv = is_valid_state_transition(mdev, ns, os);
+			}
+		} else
+			rv = is_valid_state_transition(mdev, ns, os);
+	}
+
+	if (rv < SS_SUCCESS) {
+		if (flags & CS_VERBOSE)
+			print_st_err(mdev, os, ns, rv);
+		return rv;
+	}
+
+	if (warn_sync_abort)
+		dev_warn(DEV, "Resync aborted.\n");
+
+	{
+		char *pbp, pb[300];
+		pbp = pb;
+		*pbp = 0;
+		PSC(role);
+		PSC(peer);
+		PSC(conn);
+		PSC(disk);
+		PSC(pdsk);
+		PSC(susp);
+		PSC(aftr_isp);
+		PSC(peer_isp);
+		PSC(user_isp);
+		dev_info(DEV, "%s\n", pb);
+	}
+
+	/* solve the race between becoming unconfigured,
+	 * worker doing the cleanup, and
+	 * admin reconfiguring us:
+	 * on (re)configure, first set CONFIG_PENDING,
+	 * then wait for a potentially exiting worker,
+	 * start the worker, and schedule one no_op.
+	 * then proceed with configuration.
+	 */
+	if (ns.disk == D_DISKLESS &&
+	    ns.conn == C_STANDALONE &&
+	    ns.role == R_SECONDARY &&
+	    !test_and_set_bit(CONFIG_PENDING, &mdev->flags))
+		set_bit(DEVICE_DYING, &mdev->flags);
+
+	mdev->state.i = ns.i;
+	wake_up(&mdev->misc_wait);
+	wake_up(&mdev->state_wait);
+
+	/**   post-state-change actions   **/
+	if (os.conn >= C_SYNC_SOURCE   && ns.conn <= C_CONNECTED) {
+		set_bit(STOP_SYNC_TIMER, &mdev->flags);
+		mod_timer(&mdev->resync_timer, jiffies);
+	}
+
+	if ((os.conn == C_PAUSED_SYNC_T || os.conn == C_PAUSED_SYNC_S) &&
+	    (ns.conn == C_SYNC_TARGET  || ns.conn == C_SYNC_SOURCE)) {
+		dev_info(DEV, "Syncer continues.\n");
+		mdev->rs_paused += (long)jiffies-(long)mdev->rs_mark_time;
+		if (ns.conn == C_SYNC_TARGET) {
+			if (!test_and_clear_bit(STOP_SYNC_TIMER, &mdev->flags))
+				mod_timer(&mdev->resync_timer, jiffies);
+			/* This if (!test_bit) is only needed for the case
+			   that a device that has ceased to used its timer,
+			   i.e. it is already in drbd_resync_finished() gets
+			   paused and resumed. */
+		}
+	}
+
+	if ((os.conn == C_SYNC_TARGET  || os.conn == C_SYNC_SOURCE) &&
+	    (ns.conn == C_PAUSED_SYNC_T || ns.conn == C_PAUSED_SYNC_S)) {
+		dev_info(DEV, "Resync suspended\n");
+		mdev->rs_mark_time = jiffies;
+		if (ns.conn == C_PAUSED_SYNC_T)
+			set_bit(STOP_SYNC_TIMER, &mdev->flags);
+	}
+
+	if (os.conn == C_CONNECTED &&
+	    (ns.conn == C_VERIFY_S || ns.conn == C_VERIFY_T)) {
+		mdev->ov_position = 0;
+		mdev->ov_left  =
+		mdev->rs_total =
+		mdev->rs_mark_left = drbd_bm_bits(mdev);
+		mdev->rs_start     =
+		mdev->rs_mark_time = jiffies;
+		mdev->ov_last_oos_size = 0;
+		mdev->ov_last_oos_start = 0;
+
+		if (ns.conn == C_VERIFY_S)
+			mod_timer(&mdev->resync_timer, jiffies);
+	}
+
+	if (inc_local(mdev)) {
+		u32 mdf = mdev->bc->md.flags & ~(MDF_CONSISTENT|MDF_PRIMARY_IND|
+						 MDF_CONNECTED_IND|MDF_WAS_UP_TO_DATE|
+						 MDF_PEER_OUT_DATED|MDF_CRASHED_PRIMARY);
+
+		if (test_bit(CRASHED_PRIMARY, &mdev->flags))
+			mdf |= MDF_CRASHED_PRIMARY;
+		if (mdev->state.role == R_PRIMARY ||
+		    (mdev->state.pdsk < D_INCONSISTENT && mdev->state.peer == R_PRIMARY))
+			mdf |= MDF_PRIMARY_IND;
+		if (mdev->state.conn > C_WF_REPORT_PARAMS)
+			mdf |= MDF_CONNECTED_IND;
+		if (mdev->state.disk > D_INCONSISTENT)
+			mdf |= MDF_CONSISTENT;
+		if (mdev->state.disk > D_OUTDATED)
+			mdf |= MDF_WAS_UP_TO_DATE;
+		if (mdev->state.pdsk <= D_OUTDATED && mdev->state.pdsk >= D_INCONSISTENT)
+			mdf |= MDF_PEER_OUT_DATED;
+		if (mdf != mdev->bc->md.flags) {
+			mdev->bc->md.flags = mdf;
+			drbd_md_mark_dirty(mdev);
+		}
+		if (os.disk < D_CONSISTENT && ns.disk >= D_CONSISTENT)
+			drbd_set_ed_uuid(mdev, mdev->bc->md.uuid[UI_CURRENT]);
+		dec_local(mdev);
+	}
+
+	/* Peer was forced D_UP_TO_DATE & R_PRIMARY, consider to resync */
+	if (os.disk == D_INCONSISTENT && os.pdsk == D_INCONSISTENT &&
+	    os.peer == R_SECONDARY && ns.peer == R_PRIMARY)
+		set_bit(CONSIDER_RESYNC, &mdev->flags);
+
+	/* Receiver should clean up itself */
+	if (os.conn != C_DISCONNECTING && ns.conn == C_DISCONNECTING)
+		drbd_thread_stop_nowait(&mdev->receiver);
+
+	/* Now the receiver finished cleaning up itself, it should die */
+	if (os.conn != C_STANDALONE && ns.conn == C_STANDALONE)
+		drbd_thread_stop_nowait(&mdev->receiver);
+
+	/* Upon network failure, we need to restart the receiver. */
+	if (os.conn > C_TEAR_DOWN &&
+	    ns.conn <= C_TEAR_DOWN && ns.conn >= C_TIMEOUT)
+		drbd_thread_restart_nowait(&mdev->receiver);
+
+	ascw = kmalloc(sizeof(*ascw), GFP_ATOMIC);
+	if (ascw) {
+		ascw->os = os;
+		ascw->ns = ns;
+		ascw->flags = flags;
+		ascw->w.cb = w_after_state_ch;
+		ascw->done = done;
+		drbd_queue_work(&mdev->data.work, &ascw->w);
+	} else {
+		dev_warn(DEV, "Could not kmalloc an ascw\n");
+	}
+
+	return rv;
+}
+
+STATIC int w_after_state_ch(struct drbd_conf *mdev, struct drbd_work *w, int unused)
+{
+	struct after_state_chg_work *ascw;
+
+	ascw = (struct after_state_chg_work *) w;
+	after_state_ch(mdev, ascw->os, ascw->ns, ascw->flags);
+	if (ascw->flags & CS_WAIT_COMPLETE) {
+		D_ASSERT(ascw->done != NULL);
+		complete(ascw->done);
+	}
+	kfree(ascw);
+
+	return 1;
+}
+
+static void abw_start_sync(struct drbd_conf *mdev, int rv)
+{
+	if (rv) {
+		dev_err(DEV, "Writing the bitmap failed not starting resync.\n");
+		_drbd_request_state(mdev, NS(conn, C_CONNECTED), CS_VERBOSE);
+		return;
+	}
+
+	switch (mdev->state.conn) {
+	case C_STARTING_SYNC_T:
+		_drbd_request_state(mdev, NS(conn, C_WF_SYNC_UUID), CS_VERBOSE);
+		break;
+	case C_STARTING_SYNC_S:
+		drbd_start_resync(mdev, C_SYNC_SOURCE);
+		break;
+	}
+}
+
+STATIC void after_state_ch(struct drbd_conf *mdev, union drbd_state os,
+			   union drbd_state ns, enum chg_state_flags flags)
+{
+	enum drbd_fencing_p fp;
+
+	if (os.conn != C_CONNECTED && ns.conn == C_CONNECTED) {
+		clear_bit(CRASHED_PRIMARY, &mdev->flags);
+		if (mdev->p_uuid)
+			mdev->p_uuid[UI_FLAGS] &= ~((u64)2);
+	}
+
+	fp = FP_DONT_CARE;
+	if (inc_local(mdev)) {
+		fp = mdev->bc->dc.fencing;
+		dec_local(mdev);
+	}
+
+	/* Inform userspace about the change... */
+	drbd_bcast_state(mdev, ns);
+
+	if (!(os.role == R_PRIMARY && os.disk < D_UP_TO_DATE && os.pdsk < D_UP_TO_DATE) &&
+	    (ns.role == R_PRIMARY && ns.disk < D_UP_TO_DATE && ns.pdsk < D_UP_TO_DATE))
+		drbd_khelper(mdev, "pri-on-incon-degr");
+
+	/* Here we have the actions that are performed after a
+	   state change. This function might sleep */
+
+	if (fp == FP_STONITH && ns.susp) {
+		/* case1: The outdate peer handler is successfull:
+		 * case2: The connection was established again: */
+		if ((os.pdsk > D_OUTDATED  && ns.pdsk <= D_OUTDATED) ||
+		    (os.conn < C_CONNECTED && ns.conn >= C_CONNECTED)) {
+			tl_clear(mdev);
+			spin_lock_irq(&mdev->req_lock);
+			_drbd_set_state(_NS(mdev, susp, 0), CS_VERBOSE, NULL);
+			spin_unlock_irq(&mdev->req_lock);
+		}
+	}
+	/* Do not change the order of the if above and the two below... */
+	if (os.pdsk == D_DISKLESS && ns.pdsk > D_DISKLESS) {      /* attach on the peer */
+		drbd_send_uuids(mdev);
+		drbd_send_state(mdev);
+	}
+	if (os.conn != C_WF_BITMAP_S && ns.conn == C_WF_BITMAP_S)
+		drbd_queue_bitmap_io(mdev, &drbd_send_bitmap, NULL, "send_bitmap (WFBitMapS)");
+
+	/* Lost contact to peer's copy of the data */
+	if ((os.pdsk >= D_INCONSISTENT &&
+	     os.pdsk != D_UNKNOWN &&
+	     os.pdsk != D_OUTDATED)
+	&&  (ns.pdsk < D_INCONSISTENT ||
+	     ns.pdsk == D_UNKNOWN ||
+	     ns.pdsk == D_OUTDATED)) {
+		kfree(mdev->p_uuid);
+		mdev->p_uuid = NULL;
+		if (inc_local(mdev)) {
+			if ((ns.role == R_PRIMARY || ns.peer == R_PRIMARY) &&
+			    mdev->bc->md.uuid[UI_BITMAP] == 0 && ns.disk >= D_UP_TO_DATE) {
+				drbd_uuid_new_current(mdev);
+				drbd_send_uuids(mdev);
+			}
+			dec_local(mdev);
+		}
+	}
+
+	if (ns.pdsk < D_INCONSISTENT && inc_local(mdev)) {
+		if (ns.peer == R_PRIMARY && mdev->bc->md.uuid[UI_BITMAP] == 0)
+			drbd_uuid_new_current(mdev);
+
+		/* D_DISKLESS Peer becomes secondary */
+		if (os.peer == R_PRIMARY && ns.peer == R_SECONDARY)
+			drbd_al_to_on_disk_bm(mdev);
+		dec_local(mdev);
+	}
+
+	/* Last part of the attaching process ... */
+	if (ns.conn >= C_CONNECTED &&
+	    os.disk == D_ATTACHING && ns.disk == D_NEGOTIATING) {
+		kfree(mdev->p_uuid); /* We expect to receive up-to-date UUIDs soon. */
+		mdev->p_uuid = NULL; /* ...to not use the old ones in the mean time */
+		drbd_send_sizes(mdev);  /* to start sync... */
+		drbd_send_uuids(mdev);
+		drbd_send_state(mdev);
+	}
+
+	/* We want to pause/continue resync, tell peer. */
+	if (ns.conn >= C_CONNECTED &&
+	     ((os.aftr_isp != ns.aftr_isp) ||
+	      (os.user_isp != ns.user_isp)))
+		drbd_send_state(mdev);
+
+	/* In case one of the isp bits got set, suspend other devices. */
+	if ((!os.aftr_isp && !os.peer_isp && !os.user_isp) &&
+	    (ns.aftr_isp || ns.peer_isp || ns.user_isp))
+		suspend_other_sg(mdev);
+
+	/* Make sure the peer gets informed about eventual state
+	   changes (ISP bits) while we were in WFReportParams. */
+	if (os.conn == C_WF_REPORT_PARAMS && ns.conn >= C_CONNECTED)
+		drbd_send_state(mdev);
+
+	/* We are in the progress to start a full sync... */
+	if ((os.conn != C_STARTING_SYNC_T && ns.conn == C_STARTING_SYNC_T) ||
+	    (os.conn != C_STARTING_SYNC_S && ns.conn == C_STARTING_SYNC_S))
+		drbd_queue_bitmap_io(mdev, &drbd_bmio_set_n_write, &abw_start_sync, "set_n_write from StartingSync");
+
+	/* We are invalidating our self... */
+	if (os.conn < C_CONNECTED && ns.conn < C_CONNECTED &&
+	    os.disk > D_INCONSISTENT && ns.disk == D_INCONSISTENT)
+		drbd_queue_bitmap_io(mdev, &drbd_bmio_set_n_write, NULL, "set_n_write from invalidate");
+
+	if (os.disk > D_DISKLESS && ns.disk == D_DISKLESS) {
+		/* since inc_local() only works as long as disk>=D_INCONSISTENT,
+		   and it is D_DISKLESS here, local_cnt can only go down, it can
+		   not increase... It will reach zero */
+		wait_event(mdev->misc_wait, !atomic_read(&mdev->local_cnt));
+
+		lc_free(mdev->resync);
+		mdev->resync = NULL;
+		lc_free(mdev->act_log);
+		mdev->act_log = NULL;
+		__no_warn(local,
+			drbd_free_bc(mdev->bc);
+			mdev->bc = NULL;);
+
+		if (mdev->md_io_tmpp)
+			__free_page(mdev->md_io_tmpp);
+	}
+
+	/* Disks got bigger while they were detached */
+	if (ns.disk > D_NEGOTIATING && ns.pdsk > D_NEGOTIATING &&
+	    test_and_clear_bit(RESYNC_AFTER_NEG, &mdev->flags)) {
+		if (ns.conn == C_CONNECTED)
+			resync_after_online_grow(mdev);
+	}
+
+	/* A resync finished or aborted, wake paused devices... */
+	if ((os.conn > C_CONNECTED && ns.conn <= C_CONNECTED) ||
+	    (os.peer_isp && !ns.peer_isp) ||
+	    (os.user_isp && !ns.user_isp))
+		resume_next_sg(mdev);
+
+	/* Upon network connection, we need to start the received */
+	if (os.conn == C_STANDALONE && ns.conn == C_UNCONNECTED)
+		drbd_thread_start(&mdev->receiver);
+
+	/* Terminate worker thread if we are unconfigured - it will be
+	   restarted as needed... */
+	if (ns.disk == D_DISKLESS &&
+	    ns.conn == C_STANDALONE &&
+	    ns.role == R_SECONDARY) {
+		if (os.aftr_isp != ns.aftr_isp)
+			resume_next_sg(mdev);
+		/* set in __drbd_set_state, unless CONFIG_PENDING was set */
+		if (test_bit(DEVICE_DYING, &mdev->flags))
+			drbd_thread_stop_nowait(&mdev->worker);
+	}
+
+	drbd_md_sync(mdev);
+}
+
+
+STATIC int drbd_thread_setup(void *arg)
+{
+	struct drbd_thread *thi = (struct drbd_thread *) arg;
+	struct drbd_conf *mdev = thi->mdev;
+	int retval;
+
+restart:
+	retval = thi->function(thi);
+
+	spin_lock(&thi->t_lock);
+
+	/* if the receiver has been "Exiting", the last thing it did
+	 * was set the conn state to "StandAlone",
+	 * if now a re-connect request comes in, conn state goes C_UNCONNECTED,
+	 * and receiver thread will be "started".
+	 * drbd_thread_start needs to set "Restarting" in that case.
+	 * t_state check and assignement needs to be within the same spinlock,
+	 * so either thread_start sees Exiting, and can remap to Restarting,
+	 * or thread_start see None, and can proceed as normal.
+	 */
+
+	if (thi->t_state == Restarting) {
+		dev_info(DEV, "Restarting %s\n", current->comm);
+		thi->t_state = Running;
+		spin_unlock(&thi->t_lock);
+		goto restart;
+	}
+
+	thi->task = NULL;
+	thi->t_state = None;
+	smp_mb();
+	complete(&thi->stop);
+	spin_unlock(&thi->t_lock);
+
+	dev_info(DEV, "Terminating %s\n", current->comm);
+
+	/* Release mod reference taken when thread was started */
+	module_put(THIS_MODULE);
+	return retval;
+}
+
+STATIC void drbd_thread_init(struct drbd_conf *mdev, struct drbd_thread *thi,
+		      int (*func) (struct drbd_thread *))
+{
+	spin_lock_init(&thi->t_lock);
+	thi->task    = NULL;
+	thi->t_state = None;
+	thi->function = func;
+	thi->mdev = mdev;
+}
+
+int drbd_thread_start(struct drbd_thread *thi)
+{
+	struct drbd_conf *mdev = thi->mdev;
+	struct task_struct *nt;
+	const char *me =
+		thi == &mdev->receiver ? "receiver" :
+		thi == &mdev->asender  ? "asender"  :
+		thi == &mdev->worker   ? "worker"   : "NONSENSE";
+
+	spin_lock(&thi->t_lock);
+	switch (thi->t_state) {
+	case None:
+		dev_info(DEV, "Starting %s thread (from %s [%d])\n",
+				me, current->comm, current->pid);
+
+		/* Get ref on module for thread - this is released when thread exits */
+		if (!try_module_get(THIS_MODULE)) {
+			dev_err(DEV, "Failed to get module reference in drbd_thread_start\n");
+			spin_unlock(&thi->t_lock);
+			return FALSE;
+		}
+
+		D_ASSERT(thi->task == NULL);
+		thi->reset_cpu_mask = 1;
+		thi->t_state = Running;
+		spin_unlock(&thi->t_lock);
+		flush_signals(current); /* otherw. may get -ERESTARTNOINTR */
+
+		nt = kthread_create(drbd_thread_setup, (void *) thi,
+				    "drbd%d_%s", mdev_to_minor(mdev), me);
+
+		if (IS_ERR(nt)) {
+			dev_err(DEV, "Couldn't start thread\n");
+
+			module_put(THIS_MODULE);
+			return FALSE;
+		}
+		spin_lock(&thi->t_lock);
+		thi->task = nt;
+		thi->t_state = Running;
+		spin_unlock(&thi->t_lock);
+		wake_up_process(nt);
+		break;
+	case Exiting:
+		thi->t_state = Restarting;
+		dev_info(DEV, "Restarting %s thread (from %s [%d])\n",
+				me, current->comm, current->pid);
+		/* fall through */
+	case Running:
+	case Restarting:
+	default:
+		spin_unlock(&thi->t_lock);
+		break;
+	}
+
+	return TRUE;
+}
+
+
+void _drbd_thread_stop(struct drbd_thread *thi, int restart, int wait)
+{
+	enum drbd_thread_state ns = restart ? Restarting : Exiting;
+
+	spin_lock(&thi->t_lock);
+
+	if (thi->t_state == None) {
+		spin_unlock(&thi->t_lock);
+		if (restart)
+			drbd_thread_start(thi);
+		return;
+	}
+
+	if (thi->t_state != ns) {
+		if (thi->task == NULL) {
+			spin_unlock(&thi->t_lock);
+			return;
+		}
+
+		thi->t_state = ns;
+		smp_mb();
+		init_completion(&thi->stop);
+		if (thi->task != current)
+			force_sig(DRBD_SIGKILL, thi->task);
+
+	}
+
+	spin_unlock(&thi->t_lock);
+
+	if (wait)
+		wait_for_completion(&thi->stop);
+}
+
+#ifdef CONFIG_SMP
+/**
+ * drbd_calc_cpu_mask: Generates CPU masks, sprad over all CPUs.
+ * Forces all threads of a device onto the same CPU. This is benificial for
+ * DRBD's performance. May be overwritten by user's configuration.
+ */
+cpumask_t drbd_calc_cpu_mask(struct drbd_conf *mdev)
+{
+	int sv, cpu;
+	cpumask_t av_cpu_m;
+
+	if (cpus_weight(mdev->cpu_mask))
+		return mdev->cpu_mask;
+
+	av_cpu_m = cpu_online_map;
+	sv = mdev_to_minor(mdev) % cpus_weight(av_cpu_m);
+
+	for_each_cpu_mask(cpu, av_cpu_m) {
+		if (sv-- == 0)
+			return cpumask_of_cpu(cpu);
+	}
+
+	/* some kernel versions "forget" to add the (cpumask_t) typecast
+	 * to that macro, which results in "parse error before '{'" ;-> */
+	return (cpumask_t) CPU_MASK_ALL; /* Never reached. */
+}
+
+/* modifies the cpu mask of the _current_ thread,
+ * call in the "main loop" of _all_ threads.
+ * no need for any mutex, current won't die prematurely.
+ */
+void drbd_thread_current_set_cpu(struct drbd_conf *mdev)
+{
+	struct task_struct *p = current;
+	struct drbd_thread *thi =
+		p == mdev->asender.task  ? &mdev->asender  :
+		p == mdev->receiver.task ? &mdev->receiver :
+		p == mdev->worker.task   ? &mdev->worker   :
+		NULL;
+	ERR_IF(thi == NULL)
+		return;
+	if (!thi->reset_cpu_mask)
+		return;
+	thi->reset_cpu_mask = 0;
+	/* preempt_disable();
+	   Thas was a kernel that warned about a call to smp_processor_id() while preemt
+	   was not disabled. It seems that this was fixed in manline. */
+	set_cpus_allowed(p, mdev->cpu_mask);
+	/* preempt_enable(); */
+}
+#endif
+
+/* the appropriate socket mutex must be held already */
+int _drbd_send_cmd(struct drbd_conf *mdev, struct socket *sock,
+			  enum drbd_packets cmd, struct p_header *h,
+			  size_t size, unsigned msg_flags)
+{
+	int sent, ok;
+
+	ERR_IF(!h) return FALSE;
+	ERR_IF(!size) return FALSE;
+
+	h->magic   = BE_DRBD_MAGIC;
+	h->command = cpu_to_be16(cmd);
+	h->length  = cpu_to_be16(size-sizeof(struct p_header));
+
+	trace_drbd_packet(mdev, sock, 0, (void *)h, __FILE__, __LINE__);
+	sent = drbd_send(mdev, sock, h, size, msg_flags);
+
+	ok = (sent == size);
+	if (!ok)
+		dev_err(DEV, "short sent %s size=%d sent=%d\n",
+		    cmdname(cmd), (int)size, sent);
+	return ok;
+}
+
+/* don't pass the socket. we may only look at it
+ * when we hold the appropriate socket mutex.
+ */
+int drbd_send_cmd(struct drbd_conf *mdev, int use_data_socket,
+		  enum drbd_packets cmd, struct p_header *h, size_t size)
+{
+	int ok = 0;
+	struct socket *sock;
+
+	if (use_data_socket) {
+		mutex_lock(&mdev->data.mutex);
+		sock = mdev->data.socket;
+	} else {
+		mutex_lock(&mdev->meta.mutex);
+		sock = mdev->meta.socket;
+	}
+
+	/* drbd_disconnect() could have called drbd_free_sock()
+	 * while we were waiting in down()... */
+	if (likely(sock != NULL))
+		ok = _drbd_send_cmd(mdev, sock, cmd, h, size, 0);
+
+	if (use_data_socket)
+		mutex_unlock(&mdev->data.mutex);
+	else
+		mutex_unlock(&mdev->meta.mutex);
+	return ok;
+}
+
+int drbd_send_cmd2(struct drbd_conf *mdev, enum drbd_packets cmd, char *data,
+		   size_t size)
+{
+	struct p_header h;
+	int ok;
+
+	h.magic   = BE_DRBD_MAGIC;
+	h.command = cpu_to_be16(cmd);
+	h.length  = cpu_to_be16(size);
+
+	if (!drbd_get_data_sock(mdev))
+		return 0;
+
+	trace_drbd_packet(mdev, mdev->data.socket, 0, (void *)&h, __FILE__, __LINE__);
+
+	ok = (sizeof(h) ==
+		drbd_send(mdev, mdev->data.socket, &h, sizeof(h), 0));
+	ok = ok && (size ==
+		drbd_send(mdev, mdev->data.socket, data, size, 0));
+
+	drbd_put_data_sock(mdev);
+
+	return ok;
+}
+
+int drbd_send_sync_param(struct drbd_conf *mdev, struct syncer_conf *sc)
+{
+	struct p_rs_param_89 *p;
+	struct socket *sock;
+	int size, rv;
+	const int apv = mdev->agreed_pro_version;
+
+	size = apv <= 87 ? sizeof(struct p_rs_param)
+		: apv == 88 ? sizeof(struct p_rs_param)
+			+ strlen(mdev->sync_conf.verify_alg) + 1
+		: /* 89 */    sizeof(struct p_rs_param_89);
+
+	/* used from admin command context and receiver/worker context.
+	 * to avoid kmalloc, grab the socket right here,
+	 * then use the pre-allocated sbuf there */
+	mutex_lock(&mdev->data.mutex);
+	sock = mdev->data.socket;
+
+	if (likely(sock != NULL)) {
+		enum drbd_packets cmd = apv >= 89 ? P_SYNC_PARAM89 : P_SYNC_PARAM;
+
+		p = &mdev->data.sbuf.rs_param_89;
+
+		/* initialize verify_alg and csums_alg */
+		memset(p->verify_alg, 0, 2 * SHARED_SECRET_MAX);
+
+		p->rate = cpu_to_be32(sc->rate);
+
+		if (apv >= 88)
+			strcpy(p->verify_alg, mdev->sync_conf.verify_alg);
+		if (apv >= 89)
+			strcpy(p->csums_alg, mdev->sync_conf.csums_alg);
+
+		rv = _drbd_send_cmd(mdev, sock, cmd, &p->head, size, 0);
+	} else
+		rv = 0; /* not ok */
+
+	mutex_unlock(&mdev->data.mutex);
+
+	return rv;
+}
+
+int drbd_send_protocol(struct drbd_conf *mdev)
+{
+	struct p_protocol *p;
+	int size, rv;
+
+	size = sizeof(struct p_protocol);
+
+	if (mdev->agreed_pro_version >= 87)
+		size += strlen(mdev->net_conf->integrity_alg) + 1;
+
+	p = kmalloc(size, GFP_KERNEL);
+	if (p == NULL)
+		return 0;
+
+	p->protocol      = cpu_to_be32(mdev->net_conf->wire_protocol);
+	p->after_sb_0p   = cpu_to_be32(mdev->net_conf->after_sb_0p);
+	p->after_sb_1p   = cpu_to_be32(mdev->net_conf->after_sb_1p);
+	p->after_sb_2p   = cpu_to_be32(mdev->net_conf->after_sb_2p);
+	p->want_lose     = cpu_to_be32(mdev->net_conf->want_lose);
+	p->two_primaries = cpu_to_be32(mdev->net_conf->two_primaries);
+
+	if (mdev->agreed_pro_version >= 87)
+		strcpy(p->integrity_alg, mdev->net_conf->integrity_alg);
+
+	rv = drbd_send_cmd(mdev, USE_DATA_SOCKET, P_PROTOCOL,
+			   (struct p_header *)p, size);
+	kfree(p);
+	return rv;
+}
+
+int _drbd_send_uuids(struct drbd_conf *mdev, u64 uuid_flags)
+{
+	struct p_uuids p;
+	int i;
+
+	if (!inc_local_if_state(mdev, D_NEGOTIATING))
+		return 1;
+
+	for (i = UI_CURRENT; i < UI_SIZE; i++)
+		p.uuid[i] = mdev->bc ? cpu_to_be64(mdev->bc->md.uuid[i]) : 0;
+
+	mdev->comm_bm_set = drbd_bm_total_weight(mdev);
+	p.uuid[UI_SIZE] = cpu_to_be64(mdev->comm_bm_set);
+	uuid_flags |= mdev->net_conf->want_lose ? 1 : 0;
+	uuid_flags |= test_bit(CRASHED_PRIMARY, &mdev->flags) ? 2 : 0;
+	uuid_flags |= mdev->new_state_tmp.disk == D_INCONSISTENT ? 4 : 0;
+	p.uuid[UI_FLAGS] = cpu_to_be64(uuid_flags);
+
+	dec_local(mdev);
+
+	return drbd_send_cmd(mdev, USE_DATA_SOCKET, P_UUIDS,
+			     (struct p_header *)&p, sizeof(p));
+}
+
+int drbd_send_uuids(struct drbd_conf *mdev)
+{
+	return _drbd_send_uuids(mdev, 0);
+}
+
+int drbd_send_uuids_skip_initial_sync(struct drbd_conf *mdev)
+{
+	return _drbd_send_uuids(mdev, 8);
+}
+
+
+int drbd_send_sync_uuid(struct drbd_conf *mdev, u64 val)
+{
+	struct p_rs_uuid p;
+
+	p.uuid = cpu_to_be64(val);
+
+	return drbd_send_cmd(mdev, USE_DATA_SOCKET, P_SYNC_UUID,
+			     (struct p_header *)&p, sizeof(p));
+}
+
+int drbd_send_sizes(struct drbd_conf *mdev)
+{
+	struct p_sizes p;
+	sector_t d_size, u_size;
+	int q_order_type;
+	int ok;
+
+	if (inc_local_if_state(mdev, D_NEGOTIATING)) {
+		D_ASSERT(mdev->bc->backing_bdev);
+		d_size = drbd_get_max_capacity(mdev->bc);
+		u_size = mdev->bc->dc.disk_size;
+		q_order_type = drbd_queue_order_type(mdev);
+		p.queue_order_type = cpu_to_be32(drbd_queue_order_type(mdev));
+		dec_local(mdev);
+	} else {
+		d_size = 0;
+		u_size = 0;
+		q_order_type = QUEUE_ORDERED_NONE;
+	}
+
+	p.d_size = cpu_to_be64(d_size);
+	p.u_size = cpu_to_be64(u_size);
+	p.c_size = cpu_to_be64(drbd_get_capacity(mdev->this_bdev));
+	p.max_segment_size = cpu_to_be32(mdev->rq_queue->max_segment_size);
+	p.queue_order_type = cpu_to_be32(q_order_type);
+
+	ok = drbd_send_cmd(mdev, USE_DATA_SOCKET, P_SIZES,
+			   (struct p_header *)&p, sizeof(p));
+	return ok;
+}
+
+/**
+ * drbd_send_state:
+ * Informs the peer about our state. Only call it when
+ * mdev->state.conn >= C_CONNECTED (I.e. you may not call it while in
+ * WFReportParams. Though there is one valid and necessary exception,
+ * drbd_connect() calls drbd_send_state() while in it WFReportParams.
+ */
+int drbd_send_state(struct drbd_conf *mdev)
+{
+	struct socket *sock;
+	struct p_state p;
+	int ok = 0;
+
+	/* Grab state lock so we wont send state if we're in the middle
+	 * of a cluster wide state change on another thread */
+	drbd_state_lock(mdev);
+
+	mutex_lock(&mdev->data.mutex);
+
+	p.state = cpu_to_be32(mdev->state.i); /* Within the send mutex */
+	sock = mdev->data.socket;
+
+	if (likely(sock != NULL)) {
+		ok = _drbd_send_cmd(mdev, sock, P_STATE,
+				    (struct p_header *)&p, sizeof(p), 0);
+	}
+
+	mutex_unlock(&mdev->data.mutex);
+
+	drbd_state_unlock(mdev);
+	return ok;
+}
+
+int drbd_send_state_req(struct drbd_conf *mdev,
+	union drbd_state mask, union drbd_state val)
+{
+	struct p_req_state p;
+
+	p.mask    = cpu_to_be32(mask.i);
+	p.val     = cpu_to_be32(val.i);
+
+	return drbd_send_cmd(mdev, USE_DATA_SOCKET, P_STATE_CHG_REQ,
+			     (struct p_header *)&p, sizeof(p));
+}
+
+int drbd_send_sr_reply(struct drbd_conf *mdev, int retcode)
+{
+	struct p_req_state_reply p;
+
+	p.retcode    = cpu_to_be32(retcode);
+
+	return drbd_send_cmd(mdev, USE_META_SOCKET, P_STATE_CHG_REPLY,
+			     (struct p_header *)&p, sizeof(p));
+}
+
+int fill_bitmap_rle_bits(struct drbd_conf *mdev,
+	struct p_compressed_bm *p,
+	struct bm_xfer_ctx *c)
+{
+	struct bitstream bs;
+	unsigned long plain_bits;
+	unsigned long tmp;
+	unsigned long rl;
+	unsigned len;
+	unsigned toggle;
+	int bits;
+
+	/* may we use this feature? */
+	if ((mdev->sync_conf.use_rle_encoding == 0) ||
+		(mdev->agreed_pro_version < 90))
+			return 0;
+
+	if (c->bit_offset >= c->bm_bits)
+		return 0; /* nothing to do. */
+
+	/* use at most thus many bytes */
+	bitstream_init(&bs, p->code, BM_PACKET_VLI_BYTES_MAX, 0);
+	memset(p->code, 0, BM_PACKET_VLI_BYTES_MAX);
+	/* plain bits covered in this code string */
+	plain_bits = 0;
+
+	/* p->encoding & 0x80 stores whether the first run length is set.
+	 * bit offset is implicit.
+	 * start with toggle == 2 to be able to tell the first iteration */
+	toggle = 2;
+
+	/* see how much plain bits we can stuff into one packet
+	 * using RLE and VLI. */
+	do {
+		tmp = (toggle == 0) ? _drbd_bm_find_next_zero(mdev, c->bit_offset)
+				    : _drbd_bm_find_next(mdev, c->bit_offset);
+		if (tmp == -1UL)
+			tmp = c->bm_bits;
+		rl = tmp - c->bit_offset;
+
+		if (toggle == 2) { /* first iteration */
+			if (rl == 0) {
+				/* the first checked bit was set,
+				 * store start value, */
+				DCBP_set_start(p, 1);
+				/* but skip encoding of zero run length */
+				toggle = !toggle;
+				continue;
+			}
+			DCBP_set_start(p, 0);
+		}
+
+		/* paranoia: catch zero runlength.
+		 * can only happen if bitmap is modified while we scan it. */
+		if (rl == 0) {
+			dev_err(DEV, "unexpected zero runlength while encoding bitmap "
+			    "t:%u bo:%lu\n", toggle, c->bit_offset);
+			return -1;
+		}
+
+		bits = vli_encode_bits(&bs, rl);
+		if (bits == -ENOBUFS) /* buffer full */
+			break;
+		if (bits <= 0) {
+			dev_err(DEV, "error while encoding bitmap: %d\n", bits);
+			return 0;
+		}
+
+		toggle = !toggle;
+		plain_bits += rl;
+		c->bit_offset = tmp;
+	} while (c->bit_offset < c->bm_bits);
+
+	len = bs.cur.b - p->code + !!bs.cur.bit;
+
+	if (plain_bits < (len << 3)) {
+		/* incompressible with this method.
+		 * we need to rewind both word and bit position. */
+		c->bit_offset -= plain_bits;
+		bm_xfer_ctx_bit_to_word_offset(c);
+		c->bit_offset = c->word_offset * BITS_PER_LONG;
+		return 0;
+	}
+
+	/* RLE + VLI was able to compress it just fine.
+	 * update c->word_offset. */
+	bm_xfer_ctx_bit_to_word_offset(c);
+
+	/* store pad_bits */
+	DCBP_set_pad_bits(p, (8 - bs.cur.bit) & 0x7);
+
+	return len;
+}
+
+enum { OK, FAILED, DONE }
+send_bitmap_rle_or_plain(struct drbd_conf *mdev,
+	struct p_header *h, struct bm_xfer_ctx *c)
+{
+	struct p_compressed_bm *p = (void*)h;
+	unsigned long num_words;
+	int len;
+	int ok;
+
+	len = fill_bitmap_rle_bits(mdev, p, c);
+
+	if (len < 0)
+		return FAILED;
+
+	if (len) {
+		DCBP_set_code(p, RLE_VLI_Bits);
+		ok = _drbd_send_cmd(mdev, mdev->data.socket, P_COMPRESSED_BITMAP, h,
+			sizeof(*p) + len, 0);
+
+		c->packets[0]++;
+		c->bytes[0] += sizeof(*p) + len;
+
+		if (c->bit_offset >= c->bm_bits)
+			len = 0; /* DONE */
+	} else {
+		/* was not compressible.
+		 * send a buffer full of plain text bits instead. */
+		num_words = min_t(size_t, BM_PACKET_WORDS, c->bm_words - c->word_offset);
+		len = num_words * sizeof(long);
+		if (len)
+			drbd_bm_get_lel(mdev, c->word_offset, num_words, (unsigned long*)h->payload);
+		ok = _drbd_send_cmd(mdev, mdev->data.socket, P_BITMAP,
+				   h, sizeof(struct p_header) + len, 0);
+		c->word_offset += num_words;
+		c->bit_offset = c->word_offset * BITS_PER_LONG;
+
+		c->packets[1]++;
+		c->bytes[1] += sizeof(struct p_header) + len;
+
+		if (c->bit_offset > c->bm_bits)
+			c->bit_offset = c->bm_bits;
+	}
+	ok = ok ? ((len == 0) ? DONE : OK) : FAILED;
+
+	if (ok == DONE)
+		INFO_bm_xfer_stats(mdev, "send", c);
+	return ok;
+}
+
+/* See the comment at receive_bitmap() */
+int _drbd_send_bitmap(struct drbd_conf *mdev)
+{
+	struct bm_xfer_ctx c;
+	struct p_header *p;
+	int ret;
+
+	ERR_IF(!mdev->bitmap) return FALSE;
+
+	/* maybe we should use some per thread scratch page,
+	 * and allocate that during initial device creation? */
+	p = (struct p_header *) __get_free_page(GFP_NOIO);
+	if (!p) {
+		dev_err(DEV, "failed to allocate one page buffer in %s\n", __func__);
+		return FALSE;
+	}
+
+	if (inc_local(mdev)) {
+		if (drbd_md_test_flag(mdev->bc, MDF_FULL_SYNC)) {
+			dev_info(DEV, "Writing the whole bitmap, MDF_FullSync was set.\n");
+			drbd_bm_set_all(mdev);
+			if (drbd_bm_write(mdev)) {
+				/* write_bm did fail! Leave full sync flag set in Meta P_DATA
+				 * but otherwise process as per normal - need to tell other
+				 * side that a full resync is required! */
+				dev_err(DEV, "Failed to write bitmap to disk!\n");
+			} else {
+				drbd_md_clear_flag(mdev, MDF_FULL_SYNC);
+				drbd_md_sync(mdev);
+			}
+		}
+		dec_local(mdev);
+	}
+
+	c = (struct bm_xfer_ctx) {
+		.bm_bits = drbd_bm_bits(mdev),
+		.bm_words = drbd_bm_words(mdev),
+	};
+
+	do {
+		ret = send_bitmap_rle_or_plain(mdev, p, &c);
+	} while (ret == OK);
+
+	free_page((unsigned long) p);
+	return (ret == DONE);
+}
+
+int drbd_send_bitmap(struct drbd_conf *mdev)
+{
+	int err;
+
+	if (!drbd_get_data_sock(mdev))
+		return -1;
+	err = !_drbd_send_bitmap(mdev);
+	drbd_put_data_sock(mdev);
+	return err;
+}
+
+int drbd_send_b_ack(struct drbd_conf *mdev, u32 barrier_nr, u32 set_size)
+{
+	int ok;
+	struct p_barrier_ack p;
+
+	p.barrier  = barrier_nr;
+	p.set_size = cpu_to_be32(set_size);
+
+	if (mdev->state.conn < C_CONNECTED)
+		return FALSE;
+	ok = drbd_send_cmd(mdev, USE_META_SOCKET, P_BARRIER_ACK,
+			(struct p_header *)&p, sizeof(p));
+	return ok;
+}
+
+/**
+ * _drbd_send_ack:
+ * This helper function expects the sector and block_id parameter already
+ * in big endian!
+ */
+STATIC int _drbd_send_ack(struct drbd_conf *mdev, enum drbd_packets cmd,
+			  u64 sector,
+			  u32 blksize,
+			  u64 block_id)
+{
+	int ok;
+	struct p_block_ack p;
+
+	p.sector   = sector;
+	p.block_id = block_id;
+	p.blksize  = blksize;
+	p.seq_num  = cpu_to_be32(atomic_add_return(1, &mdev->packet_seq));
+
+	if (!mdev->meta.socket || mdev->state.conn < C_CONNECTED)
+		return FALSE;
+	ok = drbd_send_cmd(mdev, USE_META_SOCKET, cmd,
+				(struct p_header *)&p, sizeof(p));
+	return ok;
+}
+
+int drbd_send_ack_dp(struct drbd_conf *mdev, enum drbd_packets cmd,
+		     struct p_data *dp)
+{
+	const int header_size = sizeof(struct p_data)
+			      - sizeof(struct p_header);
+	int data_size  = ((struct p_header *)dp)->length - header_size;
+
+	return _drbd_send_ack(mdev, cmd, dp->sector, cpu_to_be32(data_size),
+			      dp->block_id);
+}
+
+int drbd_send_ack_rp(struct drbd_conf *mdev, enum drbd_packets cmd,
+		     struct p_block_req *rp)
+{
+	return _drbd_send_ack(mdev, cmd, rp->sector, rp->blksize, rp->block_id);
+}
+
+int drbd_send_ack(struct drbd_conf *mdev,
+	enum drbd_packets cmd, struct drbd_epoch_entry *e)
+{
+	return _drbd_send_ack(mdev, cmd,
+			      cpu_to_be64(e->sector),
+			      cpu_to_be32(e->size),
+			      e->block_id);
+}
+
+/* This function misuses the block_id field to signal if the blocks
+ * are is sync or not. */
+int drbd_send_ack_ex(struct drbd_conf *mdev, enum drbd_packets cmd,
+		     sector_t sector, int blksize, u64 block_id)
+{
+	return _drbd_send_ack(mdev, cmd,
+			      cpu_to_be64(sector),
+			      cpu_to_be32(blksize),
+			      cpu_to_be64(block_id));
+}
+
+int drbd_send_drequest(struct drbd_conf *mdev, int cmd,
+		       sector_t sector, int size, u64 block_id)
+{
+	int ok;
+	struct p_block_req p;
+
+	p.sector   = cpu_to_be64(sector);
+	p.block_id = block_id;
+	p.blksize  = cpu_to_be32(size);
+
+	ok = drbd_send_cmd(mdev, USE_DATA_SOCKET, cmd,
+				(struct p_header *)&p, sizeof(p));
+	return ok;
+}
+
+int drbd_send_drequest_csum(struct drbd_conf *mdev,
+			    sector_t sector, int size,
+			    void *digest, int digest_size,
+			    enum drbd_packets cmd)
+{
+	int ok;
+	struct p_block_req p;
+
+	p.sector   = cpu_to_be64(sector);
+	p.block_id = BE_DRBD_MAGIC + 0xbeef;
+	p.blksize  = cpu_to_be32(size);
+
+	p.head.magic   = BE_DRBD_MAGIC;
+	p.head.command = cpu_to_be16(cmd);
+	p.head.length  = cpu_to_be16(sizeof(p) - sizeof(struct p_header) + digest_size);
+
+	mutex_lock(&mdev->data.mutex);
+
+	ok = (sizeof(p) == drbd_send(mdev, mdev->data.socket, &p, sizeof(p), 0));
+	ok = ok && (digest_size == drbd_send(mdev, mdev->data.socket, digest, digest_size, 0));
+
+	mutex_unlock(&mdev->data.mutex);
+
+	return ok;
+}
+
+int drbd_send_ov_request(struct drbd_conf *mdev, sector_t sector, int size)
+{
+	int ok;
+	struct p_block_req p;
+
+	p.sector   = cpu_to_be64(sector);
+	p.block_id = BE_DRBD_MAGIC + 0xbabe;
+	p.blksize  = cpu_to_be32(size);
+
+	ok = drbd_send_cmd(mdev, USE_DATA_SOCKET, P_OV_REQUEST,
+			   (struct p_header *)&p, sizeof(p));
+	return ok;
+}
+
+/* called on sndtimeo
+ * returns FALSE if we should retry,
+ * TRUE if we think connection is dead
+ */
+STATIC int we_should_drop_the_connection(struct drbd_conf *mdev, struct socket *sock)
+{
+	int drop_it;
+	/* long elapsed = (long)(jiffies - mdev->last_received); */
+
+	drop_it =   mdev->meta.socket == sock
+		|| !mdev->asender.task
+		|| get_t_state(&mdev->asender) != Running
+		|| mdev->state.conn < C_CONNECTED;
+
+	if (drop_it)
+		return TRUE;
+
+	drop_it = !--mdev->ko_count;
+	if (!drop_it) {
+		dev_err(DEV, "[%s/%d] sock_sendmsg time expired, ko = %u\n",
+		       current->comm, current->pid, mdev->ko_count);
+		request_ping(mdev);
+	}
+
+	return drop_it; /* && (mdev->state == R_PRIMARY) */;
+}
+
+/* The idea of sendpage seems to be to put some kind of reference
+ * to the page into the skb, and to hand it over to the NIC. In
+ * this process get_page() gets called.
+ *
+ * As soon as the page was really sent over the network put_page()
+ * gets called by some part of the network layer. [ NIC driver? ]
+ *
+ * [ get_page() / put_page() increment/decrement the count. If count
+ *   reaches 0 the page will be freed. ]
+ *
+ * This works nicely with pages from FSs.
+ * But this means that in protocol A we might signal IO completion too early!
+ *
+ * In order not to corrupt data during a resync we must make sure
+ * that we do not reuse our own buffer pages (EEs) to early, therefore
+ * we have the net_ee list.
+ *
+ * XFS seems to have problems, still, it submits pages with page_count == 0!
+ * As a workaround, we disable sendpage on pages
+ * with page_count == 0 or PageSlab.
+ */
+STATIC int _drbd_no_send_page(struct drbd_conf *mdev, struct page *page,
+		   int offset, size_t size)
+{
+	int sent = drbd_send(mdev, mdev->data.socket, kmap(page) + offset, size, 0);
+	kunmap(page);
+	if (sent == size)
+		mdev->send_cnt += size>>9;
+	return sent == size;
+}
+
+int _drbd_send_page(struct drbd_conf *mdev, struct page *page,
+		    int offset, size_t size)
+{
+	mm_segment_t oldfs = get_fs();
+	int sent, ok;
+	int len = size;
+
+	/* e.g. XFS meta- & log-data is in slab pages, which have a
+	 * page_count of 0 and/or have PageSlab() set.
+	 * we cannot use send_page for those, as that does get_page();
+	 * put_page(); and would cause either a VM_BUG directly, or
+	 * __page_cache_release a page that would actually still be referenced
+	 * by someone, leading to some obscure delayed Oops somewhere else. */
+	if (disable_sendpage || (page_count(page) < 1) || PageSlab(page))
+		return _drbd_no_send_page(mdev, page, offset, size);
+
+	drbd_update_congested(mdev);
+	set_fs(KERNEL_DS);
+	do {
+		sent = mdev->data.socket->ops->sendpage(mdev->data.socket, page,
+							offset, len,
+							MSG_NOSIGNAL);
+		if (sent == -EAGAIN) {
+			if (we_should_drop_the_connection(mdev,
+							  mdev->data.socket))
+				break;
+			else
+				continue;
+		}
+		if (sent <= 0) {
+			dev_warn(DEV, "%s: size=%d len=%d sent=%d\n",
+			     __func__, (int)size, len, sent);
+			break;
+		}
+		len    -= sent;
+		offset += sent;
+	} while (len > 0 /* THINK && mdev->cstate >= C_CONNECTED*/);
+	set_fs(oldfs);
+	clear_bit(NET_CONGESTED, &mdev->flags);
+
+	ok = (len == 0);
+	if (likely(ok))
+		mdev->send_cnt += size>>9;
+	return ok;
+}
+
+static inline int _drbd_send_bio(struct drbd_conf *mdev, struct bio *bio)
+{
+	struct bio_vec *bvec;
+	int i;
+	__bio_for_each_segment(bvec, bio, i, 0) {
+		if (!_drbd_no_send_page(mdev, bvec->bv_page,
+				     bvec->bv_offset, bvec->bv_len))
+			return 0;
+	}
+	return 1;
+}
+
+static inline int _drbd_send_zc_bio(struct drbd_conf *mdev, struct bio *bio)
+{
+	struct bio_vec *bvec;
+	int i;
+	__bio_for_each_segment(bvec, bio, i, 0) {
+		if (!_drbd_send_page(mdev, bvec->bv_page,
+				     bvec->bv_offset, bvec->bv_len))
+			return 0;
+	}
+
+	return 1;
+}
+
+/* Used to send write requests
+ * R_PRIMARY -> Peer	(P_DATA)
+ */
+int drbd_send_dblock(struct drbd_conf *mdev, struct drbd_request *req)
+{
+	int ok = 1;
+	struct p_data p;
+	unsigned int dp_flags = 0;
+	void *dgb;
+	int dgs;
+
+	if (!drbd_get_data_sock(mdev))
+		return 0;
+
+	dgs = (mdev->agreed_pro_version >= 87 && mdev->integrity_w_tfm) ?
+		crypto_hash_digestsize(mdev->integrity_w_tfm) : 0;
+
+	p.head.magic   = BE_DRBD_MAGIC;
+	p.head.command = cpu_to_be16(P_DATA);
+	p.head.length  =
+		cpu_to_be16(sizeof(p) - sizeof(struct p_header) + dgs + req->size);
+
+	p.sector   = cpu_to_be64(req->sector);
+	p.block_id = (unsigned long)req;
+	p.seq_num  = cpu_to_be32(req->seq_num =
+				 atomic_add_return(1, &mdev->packet_seq));
+	dp_flags = 0;
+
+	/* NOTE: no need to check if barriers supported here as we would
+	 *       not pass the test in make_request_common in that case
+	 */
+	if (bio_barrier(req->master_bio))
+		dp_flags |= DP_HARDBARRIER;
+	if (bio_sync(req->master_bio))
+		dp_flags |= DP_RW_SYNC;
+	if (mdev->state.conn >= C_SYNC_SOURCE &&
+	    mdev->state.conn <= C_PAUSED_SYNC_T)
+		dp_flags |= DP_MAY_SET_IN_SYNC;
+
+	p.dp_flags = cpu_to_be32(dp_flags);
+	trace_drbd_packet(mdev, mdev->data.socket, 0, (void *)&p, __FILE__, __LINE__);
+	set_bit(UNPLUG_REMOTE, &mdev->flags);
+	ok = (sizeof(p) ==
+		drbd_send(mdev, mdev->data.socket, &p, sizeof(p), MSG_MORE));
+	if (ok && dgs) {
+		dgb = mdev->int_dig_out;
+		drbd_csum(mdev, mdev->integrity_w_tfm, req->master_bio, dgb);
+		ok = drbd_send(mdev, mdev->data.socket, dgb, dgs, MSG_MORE);
+	}
+	if (ok) {
+		if (mdev->net_conf->wire_protocol == DRBD_PROT_A)
+			ok = _drbd_send_bio(mdev, req->master_bio);
+		else
+			ok = _drbd_send_zc_bio(mdev, req->master_bio);
+	}
+
+	drbd_put_data_sock(mdev);
+	return ok;
+}
+
+/* answer packet, used to send data back for read requests:
+ *  Peer       -> (diskless) R_PRIMARY   (P_DATA_REPLY)
+ *  C_SYNC_SOURCE -> C_SYNC_TARGET         (P_RS_DATA_REPLY)
+ */
+int drbd_send_block(struct drbd_conf *mdev, enum drbd_packets cmd,
+		    struct drbd_epoch_entry *e)
+{
+	int ok;
+	struct p_data p;
+	void *dgb;
+	int dgs;
+
+	dgs = (mdev->agreed_pro_version >= 87 && mdev->integrity_w_tfm) ?
+		crypto_hash_digestsize(mdev->integrity_w_tfm) : 0;
+
+	p.head.magic   = BE_DRBD_MAGIC;
+	p.head.command = cpu_to_be16(cmd);
+	p.head.length  =
+		cpu_to_be16(sizeof(p) - sizeof(struct p_header) + dgs + e->size);
+
+	p.sector   = cpu_to_be64(e->sector);
+	p.block_id = e->block_id;
+	/* p.seq_num  = 0;    No sequence numbers here.. */
+
+	/* Only called by our kernel thread.
+	 * This one may be interupted by DRBD_SIG and/or DRBD_SIGKILL
+	 * in response to admin command or module unload.
+	 */
+	if (!drbd_get_data_sock(mdev))
+		return 0;
+
+	trace_drbd_packet(mdev, mdev->data.socket, 0, (void *)&p, __FILE__, __LINE__);
+	ok = sizeof(p) == drbd_send(mdev, mdev->data.socket, &p,
+					sizeof(p), MSG_MORE);
+	if (ok && dgs) {
+		dgb = mdev->int_dig_out;
+		drbd_csum(mdev, mdev->integrity_w_tfm, e->private_bio, dgb);
+		ok = drbd_send(mdev, mdev->data.socket, dgb, dgs, MSG_MORE);
+	}
+	if (ok)
+		ok = _drbd_send_zc_bio(mdev, e->private_bio);
+
+	drbd_put_data_sock(mdev);
+	return ok;
+}
+
+/*
+  drbd_send distinguishes two cases:
+
+  Packets sent via the data socket "sock"
+  and packets sent via the meta data socket "msock"
+
+		    sock                      msock
+  -----------------+-------------------------+------------------------------
+  timeout           conf.timeout / 2          conf.timeout / 2
+  timeout action    send a ping via msock     Abort communication
+					      and close all sockets
+*/
+
+/*
+ * you must have down()ed the appropriate [m]sock_mutex elsewhere!
+ */
+int drbd_send(struct drbd_conf *mdev, struct socket *sock,
+	      void *buf, size_t size, unsigned msg_flags)
+{
+	struct kvec iov;
+	struct msghdr msg;
+	int rv, sent = 0;
+
+	if (!sock)
+		return -1000;
+
+	/* THINK  if (signal_pending) return ... ? */
+
+	iov.iov_base = buf;
+	iov.iov_len  = size;
+
+	msg.msg_name       = NULL;
+	msg.msg_namelen    = 0;
+	msg.msg_control    = NULL;
+	msg.msg_controllen = 0;
+	msg.msg_flags      = msg_flags | MSG_NOSIGNAL;
+
+	if (sock == mdev->data.socket) {
+		mdev->ko_count = mdev->net_conf->ko_count;
+		drbd_update_congested(mdev);
+	}
+	do {
+		/* STRANGE
+		 * tcp_sendmsg does _not_ use its size parameter at all ?
+		 *
+		 * -EAGAIN on timeout, -EINTR on signal.
+		 */
+/* THINK
+ * do we need to block DRBD_SIG if sock == &meta.socket ??
+ * otherwise wake_asender() might interrupt some send_*Ack !
+ */
+		rv = kernel_sendmsg(sock, &msg, &iov, 1, size);
+		if (rv == -EAGAIN) {
+			if (we_should_drop_the_connection(mdev, sock))
+				break;
+			else
+				continue;
+		}
+		D_ASSERT(rv != 0);
+		if (rv == -EINTR) {
+			flush_signals(current);
+			rv = 0;
+		}
+		if (rv < 0)
+			break;
+		sent += rv;
+		iov.iov_base += rv;
+		iov.iov_len  -= rv;
+	} while (sent < size);
+
+	if (sock == mdev->data.socket)
+		clear_bit(NET_CONGESTED, &mdev->flags);
+
+	if (rv <= 0) {
+		if (rv != -EAGAIN) {
+			dev_err(DEV, "%s_sendmsg returned %d\n",
+			    sock == mdev->meta.socket ? "msock" : "sock",
+			    rv);
+			drbd_force_state(mdev, NS(conn, C_BROKEN_PIPE));
+		} else
+			drbd_force_state(mdev, NS(conn, C_TIMEOUT));
+	}
+
+	return sent;
+}
+
+static int drbd_open(struct block_device *bdev, fmode_t mode)
+{
+	struct drbd_conf *mdev = bdev->bd_disk->private_data;
+	unsigned long flags;
+	int rv = 0;
+
+	spin_lock_irqsave(&mdev->req_lock, flags);
+	/* to have a stable mdev->state.role
+	 * and no race with updating open_cnt */
+
+	if (mdev->state.role != R_PRIMARY) {
+		if (mode & FMODE_WRITE)
+			rv = -EROFS;
+		else if (!allow_oos)
+			rv = -EMEDIUMTYPE;
+	}
+
+	if (!rv)
+		mdev->open_cnt++;
+	spin_unlock_irqrestore(&mdev->req_lock, flags);
+
+	return rv;
+}
+
+static int drbd_release(struct gendisk *gd, fmode_t mode)
+{
+	struct drbd_conf *mdev = gd->private_data;
+	mdev->open_cnt--;
+	return 0;
+}
+
+STATIC void drbd_unplug_fn(struct request_queue *q)
+{
+	struct drbd_conf *mdev = q->queuedata;
+
+	trace_drbd_unplug(mdev, "got unplugged");
+
+	/* unplug FIRST */
+	spin_lock_irq(q->queue_lock);
+	blk_remove_plug(q);
+	spin_unlock_irq(q->queue_lock);
+
+	/* only if connected */
+	spin_lock_irq(&mdev->req_lock);
+	if (mdev->state.pdsk >= D_INCONSISTENT && mdev->state.conn >= C_CONNECTED) {
+		D_ASSERT(mdev->state.role == R_PRIMARY);
+		if (test_and_clear_bit(UNPLUG_REMOTE, &mdev->flags)) {
+			/* add to the data.work queue,
+			 * unless already queued.
+			 * XXX this might be a good addition to drbd_queue_work
+			 * anyways, to detect "double queuing" ... */
+			if (list_empty(&mdev->unplug_work.list))
+				drbd_queue_work(&mdev->data.work,
+						&mdev->unplug_work);
+		}
+	}
+	spin_unlock_irq(&mdev->req_lock);
+
+	if (mdev->state.disk >= D_INCONSISTENT)
+		drbd_kick_lo(mdev);
+}
+
+STATIC void drbd_set_defaults(struct drbd_conf *mdev)
+{
+	mdev->sync_conf.after      = DRBD_AFTER_DEF;
+	mdev->sync_conf.rate       = DRBD_RATE_DEF;
+	mdev->sync_conf.al_extents = DRBD_AL_EXTENTS_DEF;
+	mdev->state = (union drbd_state) {
+		{ .role = R_SECONDARY,
+		  .peer = R_UNKNOWN,
+		  .conn = C_STANDALONE,
+		  .disk = D_DISKLESS,
+		  .pdsk = D_UNKNOWN,
+		  .susp = 0
+		} };
+}
+
+void drbd_init_set_defaults(struct drbd_conf *mdev)
+{
+	/* the memset(,0,) did most of this.
+	 * note: only assignments, no allocation in here */
+
+	drbd_set_defaults(mdev);
+
+	/* for now, we do NOT yet support it,
+	 * even though we start some framework
+	 * to eventually support barriers */
+	set_bit(NO_BARRIER_SUPP, &mdev->flags);
+
+	atomic_set(&mdev->ap_bio_cnt, 0);
+	atomic_set(&mdev->ap_pending_cnt, 0);
+	atomic_set(&mdev->rs_pending_cnt, 0);
+	atomic_set(&mdev->unacked_cnt, 0);
+	atomic_set(&mdev->local_cnt, 0);
+	atomic_set(&mdev->net_cnt, 0);
+	atomic_set(&mdev->packet_seq, 0);
+	atomic_set(&mdev->pp_in_use, 0);
+
+	mutex_init(&mdev->md_io_mutex);
+	mutex_init(&mdev->data.mutex);
+	mutex_init(&mdev->meta.mutex);
+	sema_init(&mdev->data.work.s, 0);
+	sema_init(&mdev->meta.work.s, 0);
+	mutex_init(&mdev->state_mutex);
+
+	spin_lock_init(&mdev->data.work.q_lock);
+	spin_lock_init(&mdev->meta.work.q_lock);
+
+	spin_lock_init(&mdev->al_lock);
+	spin_lock_init(&mdev->req_lock);
+	spin_lock_init(&mdev->peer_seq_lock);
+	spin_lock_init(&mdev->epoch_lock);
+
+	INIT_LIST_HEAD(&mdev->active_ee);
+	INIT_LIST_HEAD(&mdev->sync_ee);
+	INIT_LIST_HEAD(&mdev->done_ee);
+	INIT_LIST_HEAD(&mdev->read_ee);
+	INIT_LIST_HEAD(&mdev->net_ee);
+	INIT_LIST_HEAD(&mdev->resync_reads);
+	INIT_LIST_HEAD(&mdev->data.work.q);
+	INIT_LIST_HEAD(&mdev->meta.work.q);
+	INIT_LIST_HEAD(&mdev->resync_work.list);
+	INIT_LIST_HEAD(&mdev->unplug_work.list);
+	INIT_LIST_HEAD(&mdev->md_sync_work.list);
+	INIT_LIST_HEAD(&mdev->bm_io_work.w.list);
+	mdev->resync_work.cb  = w_resync_inactive;
+	mdev->unplug_work.cb  = w_send_write_hint;
+	mdev->md_sync_work.cb = w_md_sync;
+	mdev->bm_io_work.w.cb = w_bitmap_io;
+	init_timer(&mdev->resync_timer);
+	init_timer(&mdev->md_sync_timer);
+	mdev->resync_timer.function = resync_timer_fn;
+	mdev->resync_timer.data = (unsigned long) mdev;
+	mdev->md_sync_timer.function = md_sync_timer_fn;
+	mdev->md_sync_timer.data = (unsigned long) mdev;
+
+	init_waitqueue_head(&mdev->misc_wait);
+	init_waitqueue_head(&mdev->state_wait);
+	init_waitqueue_head(&mdev->ee_wait);
+	init_waitqueue_head(&mdev->al_wait);
+	init_waitqueue_head(&mdev->seq_wait);
+
+	drbd_thread_init(mdev, &mdev->receiver, drbdd_init);
+	drbd_thread_init(mdev, &mdev->worker, drbd_worker);
+	drbd_thread_init(mdev, &mdev->asender, drbd_asender);
+
+	mdev->agreed_pro_version = PRO_VERSION_MAX;
+	mdev->write_ordering = WO_bio_barrier;
+	mdev->resync_wenr = LC_FREE;
+}
+
+void drbd_mdev_cleanup(struct drbd_conf *mdev)
+{
+	if (mdev->receiver.t_state != None)
+		dev_err(DEV, "ASSERT FAILED: receiver t_state == %d expected 0.\n",
+				mdev->receiver.t_state);
+
+	/* no need to lock it, I'm the only thread alive */
+	if (atomic_read(&mdev->current_epoch->epoch_size) !=  0)
+		dev_err(DEV, "epoch_size:%d\n", atomic_read(&mdev->current_epoch->epoch_size));
+	mdev->al_writ_cnt  =
+	mdev->bm_writ_cnt  =
+	mdev->read_cnt     =
+	mdev->recv_cnt     =
+	mdev->send_cnt     =
+	mdev->writ_cnt     =
+	mdev->p_size       =
+	mdev->rs_start     =
+	mdev->rs_total     =
+	mdev->rs_failed    =
+	mdev->rs_mark_left =
+	mdev->rs_mark_time = 0;
+	D_ASSERT(mdev->net_conf == NULL);
+
+	drbd_set_my_capacity(mdev, 0);
+	if (mdev->bitmap) {
+		/* maybe never allocated. */
+		drbd_bm_resize(mdev, 0);
+		drbd_bm_cleanup(mdev);
+	}
+
+	drbd_free_resources(mdev);
+
+	/*
+	 * currently we drbd_init_ee only on module load, so
+	 * we may do drbd_release_ee only on module unload!
+	 */
+	D_ASSERT(list_empty(&mdev->active_ee));
+	D_ASSERT(list_empty(&mdev->sync_ee));
+	D_ASSERT(list_empty(&mdev->done_ee));
+	D_ASSERT(list_empty(&mdev->read_ee));
+	D_ASSERT(list_empty(&mdev->net_ee));
+	D_ASSERT(list_empty(&mdev->resync_reads));
+	D_ASSERT(list_empty(&mdev->data.work.q));
+	D_ASSERT(list_empty(&mdev->meta.work.q));
+	D_ASSERT(list_empty(&mdev->resync_work.list));
+	D_ASSERT(list_empty(&mdev->unplug_work.list));
+
+}
+
+
+STATIC void drbd_destroy_mempools(void)
+{
+	struct page *page;
+
+	while (drbd_pp_pool) {
+		page = drbd_pp_pool;
+		drbd_pp_pool = (struct page *)page_private(page);
+		__free_page(page);
+		drbd_pp_vacant--;
+	}
+
+	/* D_ASSERT(atomic_read(&drbd_pp_vacant)==0); */
+
+	if (drbd_ee_mempool)
+		mempool_destroy(drbd_ee_mempool);
+	if (drbd_request_mempool)
+		mempool_destroy(drbd_request_mempool);
+	if (drbd_ee_cache)
+		kmem_cache_destroy(drbd_ee_cache);
+	if (drbd_request_cache)
+		kmem_cache_destroy(drbd_request_cache);
+
+	drbd_ee_mempool      = NULL;
+	drbd_request_mempool = NULL;
+	drbd_ee_cache        = NULL;
+	drbd_request_cache   = NULL;
+
+	return;
+}
+
+STATIC int drbd_create_mempools(void)
+{
+	struct page *page;
+	const int number = (DRBD_MAX_SEGMENT_SIZE/PAGE_SIZE) * minor_count;
+	int i;
+
+	/* prepare our caches and mempools */
+	drbd_request_mempool = NULL;
+	drbd_ee_cache        = NULL;
+	drbd_request_cache   = NULL;
+	drbd_pp_pool         = NULL;
+
+	/* caches */
+	drbd_request_cache = kmem_cache_create(
+		"drbd_req_cache", sizeof(struct drbd_request), 0, 0, NULL);
+	if (drbd_request_cache == NULL)
+		goto Enomem;
+
+	drbd_ee_cache = kmem_cache_create(
+		"drbd_ee_cache", sizeof(struct drbd_epoch_entry), 0, 0, NULL);
+	if (drbd_ee_cache == NULL)
+		goto Enomem;
+
+	/* mempools */
+	drbd_request_mempool = mempool_create(number,
+		mempool_alloc_slab, mempool_free_slab, drbd_request_cache);
+	if (drbd_request_mempool == NULL)
+		goto Enomem;
+
+	drbd_ee_mempool = mempool_create(number,
+		mempool_alloc_slab, mempool_free_slab, drbd_ee_cache);
+	if (drbd_request_mempool == NULL)
+		goto Enomem;
+
+	/* drbd's page pool */
+	spin_lock_init(&drbd_pp_lock);
+
+	for (i = 0; i < number; i++) {
+		page = alloc_page(GFP_HIGHUSER);
+		if (!page)
+			goto Enomem;
+		set_page_private(page, (unsigned long)drbd_pp_pool);
+		drbd_pp_pool = page;
+	}
+	drbd_pp_vacant = number;
+
+	return 0;
+
+Enomem:
+	drbd_destroy_mempools(); /* in case we allocated some */
+	return -ENOMEM;
+}
+
+STATIC int drbd_notify_sys(struct notifier_block *this, unsigned long code,
+	void *unused)
+{
+	/* just so we have it.  you never know what interessting things we
+	 * might want to do here some day...
+	 */
+
+	return NOTIFY_DONE;
+}
+
+STATIC struct notifier_block drbd_notifier = {
+	.notifier_call = drbd_notify_sys,
+};
+
+static void drbd_release_ee_lists(struct drbd_conf *mdev)
+{
+	int rr;
+
+	rr = drbd_release_ee(mdev, &mdev->active_ee);
+	if (rr)
+		dev_err(DEV, "%d EEs in active list found!\n", rr);
+
+	rr = drbd_release_ee(mdev, &mdev->sync_ee);
+	if (rr)
+		dev_err(DEV, "%d EEs in sync list found!\n", rr);
+
+	rr = drbd_release_ee(mdev, &mdev->read_ee);
+	if (rr)
+		dev_err(DEV, "%d EEs in read list found!\n", rr);
+
+	rr = drbd_release_ee(mdev, &mdev->done_ee);
+	if (rr)
+		dev_err(DEV, "%d EEs in done list found!\n", rr);
+
+	rr = drbd_release_ee(mdev, &mdev->net_ee);
+	if (rr)
+		dev_err(DEV, "%d EEs in net list found!\n", rr);
+}
+
+/* caution. no locking.
+ * currently only used from module cleanup code. */
+static void drbd_delete_device(unsigned int minor)
+{
+	struct drbd_conf *mdev = minor_to_mdev(minor);
+
+	if (!mdev)
+		return;
+
+	/* paranoia asserts */
+	if (mdev->open_cnt != 0)
+		dev_err(DEV, "open_cnt = %d in %s:%u", mdev->open_cnt,
+				__FILE__ , __LINE__);
+
+	ERR_IF (!list_empty(&mdev->data.work.q)) {
+		struct list_head *lp;
+		list_for_each(lp, &mdev->data.work.q) {
+			DUMPP(lp);
+		}
+	};
+	/* end paranoia asserts */
+
+	del_gendisk(mdev->vdisk);
+
+	/* cleanup stuff that may have been allocated during
+	 * device (re-)configuration or state changes */
+
+	if (mdev->this_bdev)
+		bdput(mdev->this_bdev);
+
+	drbd_free_resources(mdev);
+
+	drbd_release_ee_lists(mdev);
+
+	/* should be free'd on disconnect? */
+	kfree(mdev->ee_hash);
+	/*
+	mdev->ee_hash_s = 0;
+	mdev->ee_hash = NULL;
+	*/
+
+	if (mdev->act_log)
+		lc_free(mdev->act_log);
+	if (mdev->resync)
+		lc_free(mdev->resync);
+
+	kfree(mdev->p_uuid);
+	/* mdev->p_uuid = NULL; */
+
+	kfree(mdev->int_dig_out);
+	kfree(mdev->int_dig_in);
+	kfree(mdev->int_dig_vv);
+
+	/* cleanup the rest that has been
+	 * allocated from drbd_new_device
+	 * and actually free the mdev itself */
+	drbd_free_mdev(mdev);
+}
+
+STATIC void drbd_cleanup(void)
+{
+	unsigned int i;
+
+	unregister_reboot_notifier(&drbd_notifier);
+
+	drbd_nl_cleanup();
+
+	if (minor_table) {
+		if (drbd_proc)
+			remove_proc_entry("drbd", NULL);
+		i = minor_count;
+		while (i--)
+			drbd_delete_device(i);
+		drbd_destroy_mempools();
+	}
+
+	kfree(minor_table);
+
+	unregister_blkdev(DRBD_MAJOR, "drbd");
+
+	printk(KERN_INFO "drbd: module cleanup done.\n");
+}
+
+/**
+ * drbd_congested: Returns 1<<BDI_async_congested and/or
+ * 1<<BDI_sync_congested if we are congested. This interface is known
+ * to be used by pdflush.
+ */
+static int drbd_congested(void *congested_data, int bdi_bits)
+{
+	struct drbd_conf *mdev = congested_data;
+	struct request_queue *q;
+	char reason = '-';
+	int r = 0;
+
+	if (!__inc_ap_bio_cond(mdev)) {
+		/* DRBD has frozen IO */
+		r = bdi_bits;
+		reason = 'd';
+		goto out;
+	}
+
+	if (inc_local(mdev)) {
+		q = bdev_get_queue(mdev->bc->backing_bdev);
+		r = bdi_congested(&q->backing_dev_info, bdi_bits);
+		dec_local(mdev);
+		if (r)
+			reason = 'b';
+	}
+
+	if (bdi_bits & (1 << BDI_async_congested) && test_bit(NET_CONGESTED, &mdev->flags)) {
+		r |= (1 << BDI_async_congested);
+		reason = reason == 'b' ? 'a' : 'n';
+	}
+
+out:
+	mdev->congestion_reason = reason;
+	return r;
+}
+
+struct drbd_conf *drbd_new_device(unsigned int minor)
+{
+	struct drbd_conf *mdev;
+	struct gendisk *disk;
+	struct request_queue *q;
+
+	mdev = kzalloc(sizeof(struct drbd_conf), GFP_KERNEL);
+	if (!mdev)
+		return NULL;
+
+	mdev->minor = minor;
+
+	drbd_init_set_defaults(mdev);
+
+	q = blk_alloc_queue(GFP_KERNEL);
+	if (!q)
+		goto out_no_q;
+	mdev->rq_queue = q;
+	q->queuedata   = mdev;
+	q->max_segment_size = DRBD_MAX_SEGMENT_SIZE;
+
+	disk = alloc_disk(1);
+	if (!disk)
+		goto out_no_disk;
+	mdev->vdisk = disk;
+
+	set_disk_ro(disk, TRUE);
+
+	disk->queue = q;
+	disk->major = DRBD_MAJOR;
+	disk->first_minor = minor;
+	disk->fops = &drbd_ops;
+	sprintf(disk->disk_name, "drbd%d", minor);
+	disk->private_data = mdev;
+
+	mdev->this_bdev = bdget(MKDEV(DRBD_MAJOR, minor));
+	/* we have no partitions. we contain only ourselves. */
+	mdev->this_bdev->bd_contains = mdev->this_bdev;
+
+	q->backing_dev_info.congested_fn = drbd_congested;
+	q->backing_dev_info.congested_data = mdev;
+
+	blk_queue_make_request(q, drbd_make_request_26);
+	blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
+	blk_queue_merge_bvec(q, drbd_merge_bvec);
+	q->queue_lock = &mdev->req_lock; /* needed since we use */
+		/* plugging on a queue, that actually has no requests! */
+	q->unplug_fn = drbd_unplug_fn;
+
+	mdev->md_io_page = alloc_page(GFP_KERNEL);
+	if (!mdev->md_io_page)
+		goto out_no_io_page;
+
+	if (drbd_bm_init(mdev))
+		goto out_no_bitmap;
+	/* no need to lock access, we are still initializing the module. */
+	if (!tl_init(mdev))
+		goto out_no_tl;
+
+	mdev->app_reads_hash = kzalloc(APP_R_HSIZE*sizeof(void *), GFP_KERNEL);
+	if (!mdev->app_reads_hash)
+		goto out_no_app_reads;
+
+	mdev->current_epoch = kzalloc(sizeof(struct drbd_epoch), GFP_KERNEL);
+	if (!mdev->current_epoch)
+		goto out_no_epoch;
+
+	INIT_LIST_HEAD(&mdev->current_epoch->list);
+	mdev->epochs = 1;
+
+	return mdev;
+
+/* out_whatever_else:
+	kfree(mdev->current_epoch); */
+out_no_epoch:
+	kfree(mdev->app_reads_hash);
+out_no_app_reads:
+	tl_cleanup(mdev);
+out_no_tl:
+	drbd_bm_cleanup(mdev);
+out_no_bitmap:
+	__free_page(mdev->md_io_page);
+out_no_io_page:
+	put_disk(disk);
+out_no_disk:
+	blk_cleanup_queue(q);
+out_no_q:
+	kfree(mdev);
+	return NULL;
+}
+
+/* counterpart of drbd_new_device.
+ * last part of drbd_delete_device. */
+void drbd_free_mdev(struct drbd_conf *mdev)
+{
+	kfree(mdev->current_epoch);
+	kfree(mdev->app_reads_hash);
+	tl_cleanup(mdev);
+	if (mdev->bitmap) /* should no longer be there. */
+		drbd_bm_cleanup(mdev);
+	__free_page(mdev->md_io_page);
+	put_disk(mdev->vdisk);
+	blk_cleanup_queue(mdev->rq_queue);
+	kfree(mdev);
+}
+
+
+int __init drbd_init(void)
+{
+	int err;
+
+	if (sizeof(struct p_handshake) != 80) {
+		printk(KERN_ERR
+		       "drbd: never change the size or layout "
+		       "of the HandShake packet.\n");
+		return -EINVAL;
+	}
+
+	if (1 > minor_count || minor_count > 255) {
+		printk(KERN_ERR
+			"drbd: invalid minor_count (%d)\n", minor_count);
+#ifdef MODULE
+		return -EINVAL;
+#else
+		minor_count = 8;
+#endif
+	}
+
+	err = drbd_nl_init();
+	if (err)
+		return err;
+
+	err = register_blkdev(DRBD_MAJOR, "drbd");
+	if (err) {
+		printk(KERN_ERR
+		       "drbd: unable to register block device major %d\n",
+		       DRBD_MAJOR);
+		return err;
+	}
+
+	register_reboot_notifier(&drbd_notifier);
+
+	/*
+	 * allocate all necessary structs
+	 */
+	err = -ENOMEM;
+
+	init_waitqueue_head(&drbd_pp_wait);
+
+	drbd_proc = NULL; /* play safe for drbd_cleanup */
+	minor_table = kzalloc(sizeof(struct drbd_conf *)*minor_count,
+				GFP_KERNEL);
+	if (!minor_table)
+		goto Enomem;
+
+	err = drbd_create_mempools();
+	if (err)
+		goto Enomem;
+
+	drbd_proc = proc_create("drbd", S_IFREG | S_IRUGO , NULL, &drbd_proc_fops);
+	if (!drbd_proc)	{
+		printk(KERN_ERR "drbd: unable to register proc file\n");
+		goto Enomem;
+	}
+
+	rwlock_init(&global_state_lock);
+
+	printk(KERN_INFO "drbd: initialised. "
+	       "Version: " REL_VERSION " (api:%d/proto:%d-%d)\n",
+	       API_VERSION, PRO_VERSION_MIN, PRO_VERSION_MAX);
+	printk(KERN_INFO "drbd: %s\n", drbd_buildtag());
+	printk(KERN_INFO "drbd: registered as block device major %d\n",
+		DRBD_MAJOR);
+	printk(KERN_INFO "drbd: minor_table @ 0x%p\n", minor_table);
+
+	return 0; /* Success! */
+
+Enomem:
+	drbd_cleanup();
+	if (err == -ENOMEM)
+		/* currently always the case */
+		printk(KERN_ERR "drbd: ran out of memory\n");
+	else
+		printk(KERN_ERR "drbd: initialization failure\n");
+	return err;
+}
+
+void drbd_free_bc(struct drbd_backing_dev *bc)
+{
+	if (bc == NULL)
+		return;
+
+	bd_release(bc->backing_bdev);
+	bd_release(bc->md_bdev);
+
+	fput(bc->lo_file);
+	fput(bc->md_file);
+
+	kfree(bc);
+}
+
+void drbd_free_sock(struct drbd_conf *mdev)
+{
+	if (mdev->data.socket) {
+		sock_release(mdev->data.socket);
+		mdev->data.socket = NULL;
+	}
+	if (mdev->meta.socket) {
+		sock_release(mdev->meta.socket);
+		mdev->meta.socket = NULL;
+	}
+}
+
+
+void drbd_free_resources(struct drbd_conf *mdev)
+{
+	crypto_free_hash(mdev->csums_tfm);
+	mdev->csums_tfm = NULL;
+	crypto_free_hash(mdev->verify_tfm);
+	mdev->verify_tfm = NULL;
+	crypto_free_hash(mdev->cram_hmac_tfm);
+	mdev->cram_hmac_tfm = NULL;
+	crypto_free_hash(mdev->integrity_w_tfm);
+	mdev->integrity_w_tfm = NULL;
+	crypto_free_hash(mdev->integrity_r_tfm);
+	mdev->integrity_r_tfm = NULL;
+
+	drbd_free_sock(mdev);
+
+	__no_warn(local,
+		  drbd_free_bc(mdev->bc);
+		  mdev->bc = NULL;);
+}
+
+/*********************************/
+/* meta data management */
+
+struct meta_data_on_disk {
+	u64 la_size;           /* last agreed size. */
+	u64 uuid[UI_SIZE];   /* UUIDs. */
+	u64 device_uuid;
+	u64 reserved_u64_1;
+	u32 flags;             /* MDF */
+	u32 magic;
+	u32 md_size_sect;
+	u32 al_offset;         /* offset to this block */
+	u32 al_nr_extents;     /* important for restoring the AL */
+	      /* `-- act_log->nr_elements <-- sync_conf.al_extents */
+	u32 bm_offset;         /* offset to the bitmap, from here */
+	u32 bm_bytes_per_bit;  /* BM_BLOCK_SIZE */
+	u32 reserved_u32[4];
+
+} __attribute((packed));
+
+/**
+ * drbd_md_sync:
+ * Writes the meta data super block if the MD_DIRTY flag bit is set.
+ */
+void drbd_md_sync(struct drbd_conf *mdev)
+{
+	struct meta_data_on_disk *buffer;
+	sector_t sector;
+	int i;
+
+	if (!test_and_clear_bit(MD_DIRTY, &mdev->flags))
+		return;
+	del_timer(&mdev->md_sync_timer);
+
+	/* We use here D_FAILED and not D_ATTACHING because we try to write
+	 * metadata even if we detach due to a disk failure! */
+	if (!inc_local_if_state(mdev, D_FAILED))
+		return;
+
+	trace_drbd_md_io(mdev, WRITE, mdev->bc);
+
+	mutex_lock(&mdev->md_io_mutex);
+	buffer = (struct meta_data_on_disk *)page_address(mdev->md_io_page);
+	memset(buffer, 0, 512);
+
+	buffer->la_size = cpu_to_be64(drbd_get_capacity(mdev->this_bdev));
+	for (i = UI_CURRENT; i < UI_SIZE; i++)
+		buffer->uuid[i] = cpu_to_be64(mdev->bc->md.uuid[i]);
+	buffer->flags = cpu_to_be32(mdev->bc->md.flags);
+	buffer->magic = cpu_to_be32(DRBD_MD_MAGIC);
+
+	buffer->md_size_sect  = cpu_to_be32(mdev->bc->md.md_size_sect);
+	buffer->al_offset     = cpu_to_be32(mdev->bc->md.al_offset);
+	buffer->al_nr_extents = cpu_to_be32(mdev->act_log->nr_elements);
+	buffer->bm_bytes_per_bit = cpu_to_be32(BM_BLOCK_SIZE);
+	buffer->device_uuid = cpu_to_be64(mdev->bc->md.device_uuid);
+
+	buffer->bm_offset = cpu_to_be32(mdev->bc->md.bm_offset);
+
+	D_ASSERT(drbd_md_ss__(mdev, mdev->bc) == mdev->bc->md.md_offset);
+	sector = mdev->bc->md.md_offset;
+
+	if (drbd_md_sync_page_io(mdev, mdev->bc, sector, WRITE)) {
+		clear_bit(MD_DIRTY, &mdev->flags);
+	} else {
+		/* this was a try anyways ... */
+		dev_err(DEV, "meta data update failed!\n");
+
+		drbd_chk_io_error(mdev, 1, TRUE);
+		drbd_io_error(mdev, TRUE);
+	}
+
+	/* Update mdev->bc->md.la_size_sect,
+	 * since we updated it on metadata. */
+	mdev->bc->md.la_size_sect = drbd_get_capacity(mdev->this_bdev);
+
+	mutex_unlock(&mdev->md_io_mutex);
+	dec_local(mdev);
+}
+
+/**
+ * drbd_md_read:
+ * @bdev: describes the backing storage and the meta-data storage
+ * Reads the meta data from bdev. Return 0 (NO_ERROR) on success, and an
+ * enum drbd_ret_codes in case something goes wrong.
+ * Currently only: ERR_IO_MD_DISK, MDInvalid.
+ */
+int drbd_md_read(struct drbd_conf *mdev, struct drbd_backing_dev *bdev)
+{
+	struct meta_data_on_disk *buffer;
+	int i, rv = NO_ERROR;
+
+	if (!inc_local_if_state(mdev, D_ATTACHING))
+		return ERR_IO_MD_DISK;
+
+	trace_drbd_md_io(mdev, READ, bdev);
+
+	mutex_lock(&mdev->md_io_mutex);
+	buffer = (struct meta_data_on_disk *)page_address(mdev->md_io_page);
+
+	if (!drbd_md_sync_page_io(mdev, bdev, bdev->md.md_offset, READ)) {
+		/* NOTE: cant do normal error processing here as this is
+		   called BEFORE disk is attached */
+		dev_err(DEV, "Error while reading metadata.\n");
+		rv = ERR_IO_MD_DISK;
+		goto err;
+	}
+
+	if (be32_to_cpu(buffer->magic) != DRBD_MD_MAGIC) {
+		dev_err(DEV, "Error while reading metadata, magic not found.\n");
+		rv = ERR_MD_INVALID;
+		goto err;
+	}
+	if (be32_to_cpu(buffer->al_offset) != bdev->md.al_offset) {
+		dev_err(DEV, "unexpected al_offset: %d (expected %d)\n",
+		    be32_to_cpu(buffer->al_offset), bdev->md.al_offset);
+		rv = ERR_MD_INVALID;
+		goto err;
+	}
+	if (be32_to_cpu(buffer->bm_offset) != bdev->md.bm_offset) {
+		dev_err(DEV, "unexpected bm_offset: %d (expected %d)\n",
+		    be32_to_cpu(buffer->bm_offset), bdev->md.bm_offset);
+		rv = ERR_MD_INVALID;
+		goto err;
+	}
+	if (be32_to_cpu(buffer->md_size_sect) != bdev->md.md_size_sect) {
+		dev_err(DEV, "unexpected md_size: %u (expected %u)\n",
+		    be32_to_cpu(buffer->md_size_sect), bdev->md.md_size_sect);
+		rv = ERR_MD_INVALID;
+		goto err;
+	}
+
+	if (be32_to_cpu(buffer->bm_bytes_per_bit) != BM_BLOCK_SIZE) {
+		dev_err(DEV, "unexpected bm_bytes_per_bit: %u (expected %u)\n",
+		    be32_to_cpu(buffer->bm_bytes_per_bit), BM_BLOCK_SIZE);
+		rv = ERR_MD_INVALID;
+		goto err;
+	}
+
+	bdev->md.la_size_sect = be64_to_cpu(buffer->la_size);
+	for (i = UI_CURRENT; i < UI_SIZE; i++)
+		bdev->md.uuid[i] = be64_to_cpu(buffer->uuid[i]);
+	bdev->md.flags = be32_to_cpu(buffer->flags);
+	mdev->sync_conf.al_extents = be32_to_cpu(buffer->al_nr_extents);
+	bdev->md.device_uuid = be64_to_cpu(buffer->device_uuid);
+
+	if (mdev->sync_conf.al_extents < 7)
+		mdev->sync_conf.al_extents = 127;
+
+ err:
+	mutex_unlock(&mdev->md_io_mutex);
+	dec_local(mdev);
+
+	return rv;
+}
+
+/**
+ * drbd_md_mark_dirty:
+ * Call this function if you change enything that should be written to
+ * the meta-data super block. This function sets MD_DIRTY, and starts a
+ * timer that ensures that within five seconds you have to call drbd_md_sync().
+ */
+void drbd_md_mark_dirty(struct drbd_conf *mdev)
+{
+	set_bit(MD_DIRTY, &mdev->flags);
+	mod_timer(&mdev->md_sync_timer, jiffies + 5*HZ);
+}
+
+
+STATIC void drbd_uuid_move_history(struct drbd_conf *mdev) __must_hold(local)
+{
+	int i;
+
+	for (i = UI_HISTORY_START; i < UI_HISTORY_END; i++) {
+		mdev->bc->md.uuid[i+1] = mdev->bc->md.uuid[i];
+
+		trace_drbd_uuid(mdev, i+1);
+	}
+}
+
+void _drbd_uuid_set(struct drbd_conf *mdev, int idx, u64 val) __must_hold(local)
+{
+	if (idx == UI_CURRENT) {
+		if (mdev->state.role == R_PRIMARY)
+			val |= 1;
+		else
+			val &= ~((u64)1);
+
+		drbd_set_ed_uuid(mdev, val);
+	}
+
+	mdev->bc->md.uuid[idx] = val;
+	trace_drbd_uuid(mdev, idx);
+	drbd_md_mark_dirty(mdev);
+}
+
+
+void drbd_uuid_set(struct drbd_conf *mdev, int idx, u64 val) __must_hold(local)
+{
+	if (mdev->bc->md.uuid[idx]) {
+		drbd_uuid_move_history(mdev);
+		mdev->bc->md.uuid[UI_HISTORY_START] = mdev->bc->md.uuid[idx];
+		trace_drbd_uuid(mdev, UI_HISTORY_START);
+	}
+	_drbd_uuid_set(mdev, idx, val);
+}
+
+/**
+ * drbd_uuid_new_current:
+ * Creates a new current UUID, and rotates the old current UUID into
+ * the bitmap slot. Causes an incremental resync upon next connect.
+ */
+void drbd_uuid_new_current(struct drbd_conf *mdev) __must_hold(local)
+{
+	u64 val;
+
+	dev_info(DEV, "Creating new current UUID\n");
+	D_ASSERT(mdev->bc->md.uuid[UI_BITMAP] == 0);
+	mdev->bc->md.uuid[UI_BITMAP] = mdev->bc->md.uuid[UI_CURRENT];
+	trace_drbd_uuid(mdev, UI_BITMAP);
+
+	get_random_bytes(&val, sizeof(u64));
+	_drbd_uuid_set(mdev, UI_CURRENT, val);
+}
+
+void drbd_uuid_set_bm(struct drbd_conf *mdev, u64 val) __must_hold(local)
+{
+	if (mdev->bc->md.uuid[UI_BITMAP] == 0 && val == 0)
+		return;
+
+	if (val == 0) {
+		drbd_uuid_move_history(mdev);
+		mdev->bc->md.uuid[UI_HISTORY_START] = mdev->bc->md.uuid[UI_BITMAP];
+		mdev->bc->md.uuid[UI_BITMAP] = 0;
+		trace_drbd_uuid(mdev, UI_HISTORY_START);
+		trace_drbd_uuid(mdev, UI_BITMAP);
+	} else {
+		if (mdev->bc->md.uuid[UI_BITMAP])
+			dev_warn(DEV, "bm UUID already set");
+
+		mdev->bc->md.uuid[UI_BITMAP] = val;
+		mdev->bc->md.uuid[UI_BITMAP] &= ~((u64)1);
+
+		trace_drbd_uuid(mdev, UI_BITMAP);
+	}
+	drbd_md_mark_dirty(mdev);
+}
+
+/**
+ * drbd_bmio_set_n_write:
+ * Is an io_fn for drbd_queue_bitmap_io() or drbd_bitmap_io() that sets
+ * all bits in the bitmap and writes the whole bitmap to stable storage.
+ */
+int drbd_bmio_set_n_write(struct drbd_conf *mdev)
+{
+	int rv = -EIO;
+
+	if (inc_local_if_state(mdev, D_ATTACHING)) {
+		drbd_md_set_flag(mdev, MDF_FULL_SYNC);
+		drbd_md_sync(mdev);
+		drbd_bm_set_all(mdev);
+
+		rv = drbd_bm_write(mdev);
+
+		if (!rv) {
+			drbd_md_clear_flag(mdev, MDF_FULL_SYNC);
+			drbd_md_sync(mdev);
+		}
+
+		dec_local(mdev);
+	}
+
+	return rv;
+}
+
+/**
+ * drbd_bmio_clear_n_write:
+ * Is an io_fn for drbd_queue_bitmap_io() or drbd_bitmap_io() that clears
+ * all bits in the bitmap and writes the whole bitmap to stable storage.
+ */
+int drbd_bmio_clear_n_write(struct drbd_conf *mdev)
+{
+	int rv = -EIO;
+
+	if (inc_local_if_state(mdev, D_ATTACHING)) {
+		drbd_bm_clear_all(mdev);
+		rv = drbd_bm_write(mdev);
+		dec_local(mdev);
+	}
+
+	return rv;
+}
+
+STATIC int w_bitmap_io(struct drbd_conf *mdev, struct drbd_work *w, int unused)
+{
+	struct bm_io_work *work = (struct bm_io_work *)w;
+	int rv;
+
+	D_ASSERT(atomic_read(&mdev->ap_bio_cnt) == 0);
+
+	drbd_bm_lock(mdev, work->why);
+	rv = work->io_fn(mdev);
+	drbd_bm_unlock(mdev);
+
+	clear_bit(BITMAP_IO, &mdev->flags);
+	wake_up(&mdev->misc_wait);
+
+	if (work->done)
+		work->done(mdev, rv);
+
+	clear_bit(BITMAP_IO_QUEUED, &mdev->flags);
+	work->why = NULL;
+
+	return 1;
+}
+
+/**
+ * drbd_queue_bitmap_io:
+ * Queues an IO operation on the whole bitmap.
+ * While IO on the bitmap happens we freeze appliation IO thus we ensure
+ * that drbd_set_out_of_sync() can not be called.
+ * This function MUST ONLY be called from worker context.
+ * BAD API ALERT!
+ * It MUST NOT be used while a previous such work is still pending!
+ */
+void drbd_queue_bitmap_io(struct drbd_conf *mdev,
+			  int (*io_fn)(struct drbd_conf *),
+			  void (*done)(struct drbd_conf *, int),
+			  char *why)
+{
+	D_ASSERT(current == mdev->worker.task);
+
+	D_ASSERT(!test_bit(BITMAP_IO_QUEUED, &mdev->flags));
+	D_ASSERT(!test_bit(BITMAP_IO, &mdev->flags));
+	D_ASSERT(list_empty(&mdev->bm_io_work.w.list));
+	if (mdev->bm_io_work.why)
+		dev_err(DEV, "FIXME going to queue '%s' but '%s' still pending?\n",
+			why, mdev->bm_io_work.why);
+
+	mdev->bm_io_work.io_fn = io_fn;
+	mdev->bm_io_work.done = done;
+	mdev->bm_io_work.why = why;
+
+	set_bit(BITMAP_IO, &mdev->flags);
+	if (atomic_read(&mdev->ap_bio_cnt) == 0) {
+		if (list_empty(&mdev->bm_io_work.w.list)) {
+			set_bit(BITMAP_IO_QUEUED, &mdev->flags);
+			drbd_queue_work(&mdev->data.work, &mdev->bm_io_work.w);
+		} else
+			dev_err(DEV, "FIXME avoided double queuing bm_io_work\n");
+	}
+}
+
+/**
+ * drbd_bitmap_io:
+ * Does an IO operation on the bitmap, freezing application IO while that
+ * IO operations runs. This functions MUST NOT be called from worker context.
+ */
+int drbd_bitmap_io(struct drbd_conf *mdev, int (*io_fn)(struct drbd_conf *), char *why)
+{
+	int rv;
+
+	D_ASSERT(current != mdev->worker.task);
+
+	drbd_suspend_io(mdev);
+
+	drbd_bm_lock(mdev, why);
+	rv = io_fn(mdev);
+	drbd_bm_unlock(mdev);
+
+	drbd_resume_io(mdev);
+
+	return rv;
+}
+
+void drbd_md_set_flag(struct drbd_conf *mdev, int flag) __must_hold(local)
+{
+	if ((mdev->bc->md.flags & flag) != flag) {
+		drbd_md_mark_dirty(mdev);
+		mdev->bc->md.flags |= flag;
+	}
+}
+
+void drbd_md_clear_flag(struct drbd_conf *mdev, int flag) __must_hold(local)
+{
+	if ((mdev->bc->md.flags & flag) != 0) {
+		drbd_md_mark_dirty(mdev);
+		mdev->bc->md.flags &= ~flag;
+	}
+}
+int drbd_md_test_flag(struct drbd_backing_dev *bdev, int flag)
+{
+	return (bdev->md.flags & flag) != 0;
+}
+
+STATIC void md_sync_timer_fn(unsigned long data)
+{
+	struct drbd_conf *mdev = (struct drbd_conf *) data;
+
+	drbd_queue_work_front(&mdev->data.work, &mdev->md_sync_work);
+}
+
+STATIC int w_md_sync(struct drbd_conf *mdev, struct drbd_work *w, int unused)
+{
+	dev_warn(DEV, "md_sync_timer expired! Worker calls drbd_md_sync().\n");
+	drbd_md_sync(mdev);
+
+	return 1;
+}
+
+#ifdef DRBD_ENABLE_FAULTS
+/* Fault insertion support including random number generator shamelessly
+ * stolen from kernel/rcutorture.c */
+struct fault_random_state {
+	unsigned long state;
+	unsigned long count;
+};
+
+#define FAULT_RANDOM_MULT 39916801  /* prime */
+#define FAULT_RANDOM_ADD	479001701 /* prime */
+#define FAULT_RANDOM_REFRESH 10000
+
+/*
+ * Crude but fast random-number generator.  Uses a linear congruential
+ * generator, with occasional help from get_random_bytes().
+ */
+STATIC unsigned long
+_drbd_fault_random(struct fault_random_state *rsp)
+{
+	long refresh;
+
+	if (--rsp->count < 0) {
+		get_random_bytes(&refresh, sizeof(refresh));
+		rsp->state += refresh;
+		rsp->count = FAULT_RANDOM_REFRESH;
+	}
+	rsp->state = rsp->state * FAULT_RANDOM_MULT + FAULT_RANDOM_ADD;
+	return swahw32(rsp->state);
+}
+
+STATIC char *
+_drbd_fault_str(unsigned int type) {
+	static char *_faults[] = {
+		"Meta-data write",
+		"Meta-data read",
+		"Resync write",
+		"Resync read",
+		"Data write",
+		"Data read",
+		"Data read ahead",
+		"BM allocation",
+		"EE allocation"
+	};
+
+	return (type < DRBD_FAULT_MAX) ? _faults[type] : "**Unknown**";
+}
+
+unsigned int
+_drbd_insert_fault(struct drbd_conf *mdev, unsigned int type)
+{
+	static struct fault_random_state rrs = {0, 0};
+
+	unsigned int ret = (
+		(fault_devs == 0 ||
+			((1 << mdev_to_minor(mdev)) & fault_devs) != 0) &&
+		(((_drbd_fault_random(&rrs) % 100) + 1) <= fault_rate));
+
+	if (ret) {
+		fault_count++;
+
+		if (printk_ratelimit())
+			dev_warn(DEV, "***Simulating %s failure\n",
+				_drbd_fault_str(type));
+	}
+
+	return ret;
+}
+#endif
+
+module_init(drbd_init)
+module_exit(drbd_cleanup)

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 09/16] DRBD: receiver
  2009-04-30 11:26               ` [PATCH 08/16] DRBD: main Philipp Reisner
@ 2009-04-30 11:26                 ` Philipp Reisner
  2009-04-30 11:26                   ` [PATCH 10/16] DRBD: proc Philipp Reisner
  0 siblings, 1 reply; 90+ messages in thread
From: Philipp Reisner @ 2009-04-30 11:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg, Philipp Reisner

Nearly everything of the "receiver" and the "asender" is in this file. The
receiver is the thread that processes all data packets. The receiver might
gets blocked while waiting for memory or being slowed while submitting IO.
  The asender on the other hand is used to send out acknowledgements and
to receive them. It only blocks while waiting on its socket.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

---
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
new file mode 100644
index 0000000..077480f
--- /dev/null
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -0,0 +1,4285 @@
+/*
+   drbd_receiver.c
+
+   This file is part of DRBD by Philipp Reisner and Lars Ellenberg.
+
+   Copyright (C) 2001-2008, LINBIT Information Technologies GmbH.
+   Copyright (C) 1999-2008, Philipp Reisner <philipp.reisner@linbit.com>.
+   Copyright (C) 2002-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.
+
+   drbd is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 2, or (at your option)
+   any later version.
+
+   drbd is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with drbd; see the file COPYING.  If not, write to
+   the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+
+#include <linux/autoconf.h>
+#include <linux/module.h>
+
+#include <asm/uaccess.h>
+#include <net/sock.h>
+
+#include <linux/version.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/in.h>
+#include <linux/mm.h>
+#include <linux/drbd_config.h>
+#include <linux/memcontrol.h>
+#include <linux/mm_inline.h>
+#include <linux/slab.h>
+#include <linux/smp_lock.h>
+#include <linux/pkt_sched.h>
+#define __KERNEL_SYSCALLS__
+#include <linux/unistd.h>
+#include <linux/vmalloc.h>
+#include <linux/random.h>
+#include <linux/mm.h>
+#include <linux/string.h>
+#include <linux/scatterlist.h>
+#include <linux/drbd.h>
+#include "drbd_int.h"
+#include "drbd_tracing.h"
+#include "drbd_req.h"
+
+#include "drbd_vli.h"
+
+struct flush_work {
+	struct drbd_work w;
+	struct drbd_epoch *epoch;
+};
+
+enum finish_epoch {
+	FE_STILL_LIVE,
+	FE_DESTROYED,
+	FE_RECYCLED,
+};
+
+STATIC int drbd_do_handshake(struct drbd_conf *mdev);
+STATIC int drbd_do_auth(struct drbd_conf *mdev);
+
+STATIC enum finish_epoch drbd_may_finish_epoch(struct drbd_conf *, struct drbd_epoch *, enum epoch_event);
+STATIC int e_end_block(struct drbd_conf *, struct drbd_work *, int);
+static inline struct drbd_epoch *previous_epoch(struct drbd_conf *mdev, struct drbd_epoch *epoch)
+{
+	struct drbd_epoch *prev;
+	spin_lock(&mdev->epoch_lock);
+	prev = list_entry(epoch->list.prev, struct drbd_epoch, list);
+	if (prev == epoch || prev == mdev->current_epoch)
+		prev = NULL;
+	spin_unlock(&mdev->epoch_lock);
+	return prev;
+}
+
+#define GFP_TRY	(__GFP_HIGHMEM | __GFP_NOWARN)
+
+/**
+ * drbd_bp_alloc: Returns a page. Fails only if a signal comes in.
+ */
+STATIC struct page *drbd_pp_alloc(struct drbd_conf *mdev, gfp_t gfp_mask)
+{
+	unsigned long flags = 0;
+	struct page *page;
+	DEFINE_WAIT(wait);
+
+	spin_lock_irqsave(&drbd_pp_lock, flags);
+	page = drbd_pp_pool;
+	if (page) {
+		drbd_pp_pool = (struct page *)page_private(page);
+		set_page_private(page, 0); /* just to be polite */
+		drbd_pp_vacant--;
+	}
+	spin_unlock_irqrestore(&drbd_pp_lock, flags);
+	if (page)
+		goto got_page;
+
+	drbd_kick_lo(mdev);
+
+	for (;;) {
+		prepare_to_wait(&drbd_pp_wait, &wait, TASK_INTERRUPTIBLE);
+
+		/* try the pool again, maybe the drbd_kick_lo set some free */
+		spin_lock_irqsave(&drbd_pp_lock, flags);
+		page = drbd_pp_pool;
+		if (page) {
+			drbd_pp_pool = (struct page *)page_private(page);
+			drbd_pp_vacant--;
+		}
+		spin_unlock_irqrestore(&drbd_pp_lock, flags);
+
+		if (page)
+			break;
+
+		/* hm. pool was empty. try to allocate from kernel.
+		 * don't wait, if none is available, though.
+		 */
+		if (atomic_read(&mdev->pp_in_use)
+					< mdev->net_conf->max_buffers) {
+			page = alloc_page(GFP_TRY);
+			if (page)
+				break;
+		}
+
+		/* doh. still no page.
+		 * either used up the configured maximum number,
+		 * or we are low on memory.
+		 * wait for someone to return a page into the pool.
+		 * unless, of course, someone signalled us.
+		 */
+		if (signal_pending(current)) {
+			dev_warn(DEV, "drbd_pp_alloc interrupted!\n");
+			finish_wait(&drbd_pp_wait, &wait);
+			return NULL;
+		}
+		drbd_kick_lo(mdev);
+		if (!(gfp_mask & __GFP_WAIT)) {
+			finish_wait(&drbd_pp_wait, &wait);
+			return NULL;
+		}
+		schedule();
+	}
+	finish_wait(&drbd_pp_wait, &wait);
+
+ got_page:
+	atomic_inc(&mdev->pp_in_use);
+	return page;
+}
+
+STATIC void drbd_pp_free(struct drbd_conf *mdev, struct page *page)
+{
+	unsigned long flags = 0;
+	int free_it;
+
+	spin_lock_irqsave(&drbd_pp_lock, flags);
+	if (drbd_pp_vacant > (DRBD_MAX_SEGMENT_SIZE/PAGE_SIZE)*minor_count) {
+		free_it = 1;
+	} else {
+		set_page_private(page, (unsigned long)drbd_pp_pool);
+		drbd_pp_pool = page;
+		drbd_pp_vacant++;
+		free_it = 0;
+	}
+	spin_unlock_irqrestore(&drbd_pp_lock, flags);
+
+	atomic_dec(&mdev->pp_in_use);
+
+	if (free_it)
+		__free_page(page);
+
+	wake_up(&drbd_pp_wait);
+}
+
+/*
+You need to hold the req_lock:
+ drbd_free_ee()
+ _drbd_wait_ee_list_empty()
+
+You must not have the req_lock:
+ drbd_alloc_ee()
+ drbd_init_ee()
+ drbd_release_ee()
+ drbd_ee_fix_bhs()
+ drbd_process_done_ee()
+ drbd_clear_done_ee()
+ drbd_wait_ee_list_empty()
+*/
+
+struct drbd_epoch_entry *drbd_alloc_ee(struct drbd_conf *mdev,
+				     u64 id,
+				     sector_t sector,
+				     unsigned int data_size,
+				     gfp_t gfp_mask) __must_hold(local)
+{
+	struct request_queue *q;
+	struct drbd_epoch_entry *e;
+	struct bio_vec *bvec;
+	struct page *page;
+	struct bio *bio;
+	unsigned int ds;
+	int i;
+
+	e = mempool_alloc(drbd_ee_mempool, gfp_mask & ~__GFP_HIGHMEM);
+	if (!e) {
+		if (!(gfp_mask & __GFP_NOWARN))
+			dev_err(DEV, "alloc_ee: Allocation of an EE failed\n");
+		return NULL;
+	}
+
+	bio = bio_alloc(gfp_mask & ~__GFP_HIGHMEM, div_ceil(data_size, PAGE_SIZE));
+	if (!bio) {
+		if (!(gfp_mask & __GFP_NOWARN))
+			dev_err(DEV, "alloc_ee: Allocation of a bio failed\n");
+		goto fail1;
+	}
+
+	bio->bi_bdev = mdev->bc->backing_bdev;
+	bio->bi_sector = sector;
+
+	ds = data_size;
+	while (ds) {
+		page = drbd_pp_alloc(mdev, gfp_mask);
+		if (!page) {
+			if (!(gfp_mask & __GFP_NOWARN))
+				dev_err(DEV, "alloc_ee: Allocation of a page failed\n");
+			goto fail2;
+		}
+		if (!bio_add_page(bio, page, min_t(int, ds, PAGE_SIZE), 0)) {
+			drbd_pp_free(mdev, page);
+			dev_err(DEV, "alloc_ee: bio_add_page(s=%llu,"
+			    "data_size=%u,ds=%u) failed\n",
+			    (unsigned long long)sector, data_size, ds);
+
+			q = bdev_get_queue(bio->bi_bdev);
+			if (q->merge_bvec_fn) {
+				struct bvec_merge_data bvm = {
+					.bi_bdev = bio->bi_bdev,
+					.bi_sector = bio->bi_sector,
+					.bi_size = bio->bi_size,
+					.bi_rw = bio->bi_rw,
+				};
+				int l = q->merge_bvec_fn(q, &bvm,
+						&bio->bi_io_vec[bio->bi_vcnt]);
+				dev_err(DEV, "merge_bvec_fn() = %d\n", l);
+			}
+
+			/* dump more of the bio. */
+			DUMPI(bio->bi_max_vecs);
+			DUMPI(bio->bi_vcnt);
+			DUMPI(bio->bi_size);
+			DUMPI(bio->bi_phys_segments);
+
+			goto fail2;
+			break;
+		}
+		ds -= min_t(int, ds, PAGE_SIZE);
+	}
+
+	D_ASSERT(data_size == bio->bi_size);
+
+	bio->bi_private = e;
+	e->mdev = mdev;
+	e->sector = sector;
+	e->size = bio->bi_size;
+
+	e->private_bio = bio;
+	e->block_id = id;
+	INIT_HLIST_NODE(&e->colision);
+	e->epoch = NULL;
+	e->flags = 0;
+
+	trace_drbd_ee(mdev, e, "allocated");
+
+	return e;
+
+ fail2:
+	__bio_for_each_segment(bvec, bio, i, 0) {
+		drbd_pp_free(mdev, bvec->bv_page);
+	}
+	bio_put(bio);
+ fail1:
+	mempool_free(e, drbd_ee_mempool);
+
+	return NULL;
+}
+
+void drbd_free_ee(struct drbd_conf *mdev, struct drbd_epoch_entry *e)
+{
+	struct bio *bio = e->private_bio;
+	struct bio_vec *bvec;
+	int i;
+
+	trace_drbd_ee(mdev, e, "freed");
+
+	__bio_for_each_segment(bvec, bio, i, 0) {
+		drbd_pp_free(mdev, bvec->bv_page);
+	}
+
+	bio_put(bio);
+
+	D_ASSERT(hlist_unhashed(&e->colision));
+
+	mempool_free(e, drbd_ee_mempool);
+}
+
+/* currently on module unload only */
+int drbd_release_ee(struct drbd_conf *mdev, struct list_head *list)
+{
+	int count = 0;
+	struct drbd_epoch_entry *e;
+	struct list_head *le;
+
+	spin_lock_irq(&mdev->req_lock);
+	while (!list_empty(list)) {
+		le = list->next;
+		list_del(le);
+		e = list_entry(le, struct drbd_epoch_entry, w.list);
+		drbd_free_ee(mdev, e);
+		count++;
+	}
+	spin_unlock_irq(&mdev->req_lock);
+
+	return count;
+}
+
+
+STATIC void reclaim_net_ee(struct drbd_conf *mdev)
+{
+	struct drbd_epoch_entry *e;
+	struct list_head *le, *tle;
+
+	/* The EEs are always appended to the end of the list. Since
+	   they are sent in order over the wire, they have to finish
+	   in order. As soon as we see the first not finished we can
+	   stop to examine the list... */
+
+	list_for_each_safe(le, tle, &mdev->net_ee) {
+		e = list_entry(le, struct drbd_epoch_entry, w.list);
+		if (drbd_bio_has_active_page(e->private_bio))
+			break;
+		list_del(le);
+		drbd_free_ee(mdev, e);
+	}
+}
+
+
+/*
+ * This function is called from _asender only_
+ * but see also comments in _req_mod(,barrier_acked)
+ * and receive_Barrier.
+ *
+ * Move entries from net_ee to done_ee, if ready.
+ * Grab done_ee, call all callbacks, free the entries.
+ * The callbacks typically send out ACKs.
+ */
+STATIC int drbd_process_done_ee(struct drbd_conf *mdev)
+{
+	LIST_HEAD(work_list);
+	struct drbd_epoch_entry *e, *t;
+	int ok = 1;
+
+	spin_lock_irq(&mdev->req_lock);
+	reclaim_net_ee(mdev);
+	list_splice_init(&mdev->done_ee, &work_list);
+	spin_unlock_irq(&mdev->req_lock);
+
+	/* possible callbacks here:
+	 * e_end_block, and e_end_resync_block, e_send_discard_ack.
+	 * all ignore the last argument.
+	 */
+	list_for_each_entry_safe(e, t, &work_list, w.list) {
+		trace_drbd_ee(mdev, e, "process_done_ee");
+		/* list_del not necessary, next/prev members not touched */
+		if (e->w.cb(mdev, &e->w, 0) == 0)
+			ok = 0;
+		drbd_free_ee(mdev, e);
+	}
+	wake_up(&mdev->ee_wait);
+
+	return ok;
+}
+
+
+
+/* clean-up helper for drbd_disconnect */
+void _drbd_clear_done_ee(struct drbd_conf *mdev)
+{
+	struct list_head *le;
+	struct drbd_epoch_entry *e;
+	struct drbd_epoch *epoch;
+	int n = 0;
+
+
+	reclaim_net_ee(mdev);
+
+	while (!list_empty(&mdev->done_ee)) {
+		le = mdev->done_ee.next;
+		list_del(le);
+		e = list_entry(le, struct drbd_epoch_entry, w.list);
+		if (mdev->net_conf->wire_protocol == DRBD_PROT_C
+		|| is_syncer_block_id(e->block_id))
+			++n;
+
+		if (!hlist_unhashed(&e->colision))
+			hlist_del_init(&e->colision);
+
+		if (e->epoch) {
+			if (e->flags & EE_IS_BARRIER) {
+				epoch = previous_epoch(mdev, e->epoch);
+				if (epoch)
+					drbd_may_finish_epoch(mdev, epoch, EV_BARRIER_DONE + EV_CLEANUP);
+			}
+			drbd_may_finish_epoch(mdev, e->epoch, EV_PUT + EV_CLEANUP);
+		}
+		drbd_free_ee(mdev, e);
+	}
+
+	sub_unacked(mdev, n);
+}
+
+void _drbd_wait_ee_list_empty(struct drbd_conf *mdev, struct list_head *head)
+{
+	DEFINE_WAIT(wait);
+
+	/* avoids spin_lock/unlock
+	 * and calling prepare_to_wait in the fast path */
+	while (!list_empty(head)) {
+		prepare_to_wait(&mdev->ee_wait, &wait, TASK_UNINTERRUPTIBLE);
+		spin_unlock_irq(&mdev->req_lock);
+		drbd_kick_lo(mdev);
+		schedule();
+		finish_wait(&mdev->ee_wait, &wait);
+		spin_lock_irq(&mdev->req_lock);
+	}
+}
+
+void drbd_wait_ee_list_empty(struct drbd_conf *mdev, struct list_head *head)
+{
+	spin_lock_irq(&mdev->req_lock);
+	_drbd_wait_ee_list_empty(mdev, head);
+	spin_unlock_irq(&mdev->req_lock);
+}
+
+/* see also kernel_accept; which is only present since 2.6.18.
+ * also we want to log which part of it failed, exactly */
+STATIC int drbd_accept(struct drbd_conf *mdev, const char **what,
+		struct socket *sock, struct socket **newsock)
+{
+	struct sock *sk = sock->sk;
+	int err = 0;
+
+	*what = "listen";
+	err = sock->ops->listen(sock, 5);
+	if (err < 0)
+		goto out;
+
+	*what = "sock_create_lite";
+	err = sock_create_lite(sk->sk_family, sk->sk_type, sk->sk_protocol,
+			       newsock);
+	if (err < 0)
+		goto out;
+
+	*what = "accept";
+	err = sock->ops->accept(sock, *newsock, 0);
+	if (err < 0) {
+		sock_release(*newsock);
+		*newsock = NULL;
+		goto out;
+	}
+	(*newsock)->ops  = sock->ops;
+
+out:
+	return err;
+}
+
+STATIC int drbd_recv_short(struct drbd_conf *mdev, struct socket *sock,
+		    void *buf, size_t size, int flags)
+{
+	mm_segment_t oldfs;
+	struct kvec iov = {
+		.iov_base = buf,
+		.iov_len = size,
+	};
+	struct msghdr msg = {
+		.msg_iovlen = 1,
+		.msg_iov = (struct iovec *)&iov,
+		.msg_flags = (flags ? flags : MSG_WAITALL | MSG_NOSIGNAL)
+	};
+	int rv;
+
+	oldfs = get_fs();
+	set_fs(KERNEL_DS);
+	rv = sock_recvmsg(sock, &msg, size, msg.msg_flags);
+	set_fs(oldfs);
+
+	return rv;
+}
+
+STATIC int drbd_recv(struct drbd_conf *mdev, void *buf, size_t size)
+{
+	mm_segment_t oldfs;
+	struct kvec iov = {
+		.iov_base = buf,
+		.iov_len = size,
+	};
+	struct msghdr msg = {
+		.msg_iovlen = 1,
+		.msg_iov = (struct iovec *)&iov,
+		.msg_flags = MSG_WAITALL | MSG_NOSIGNAL
+	};
+	int rv;
+
+	oldfs = get_fs();
+	set_fs(KERNEL_DS);
+
+	for (;;) {
+		rv = sock_recvmsg(mdev->data.socket, &msg, size, msg.msg_flags);
+		if (rv == size)
+			break;
+
+		/* Note:
+		 * ECONNRESET	other side closed the connection
+		 * ERESTARTSYS	(on  sock) we got a signal
+		 */
+
+		if (rv < 0) {
+			if (rv == -ECONNRESET)
+				dev_info(DEV, "sock was reset by peer\n");
+			else if (rv != -ERESTARTSYS)
+				dev_err(DEV, "sock_recvmsg returned %d\n", rv);
+			break;
+		} else if (rv == 0) {
+			dev_info(DEV, "sock was shut down by peer\n");
+			break;
+		} else	{
+			/* signal came in, or peer/link went down,
+			 * after we read a partial message
+			 */
+			/* D_ASSERT(signal_pending(current)); */
+			break;
+		}
+	};
+
+	set_fs(oldfs);
+
+	if (rv != size)
+		drbd_force_state(mdev, NS(conn, C_BROKEN_PIPE));
+
+	return rv;
+}
+
+STATIC struct socket *drbd_try_connect(struct drbd_conf *mdev)
+{
+	const char *what;
+	struct socket *sock;
+	struct sockaddr_in6 src_in6;
+	int err;
+	int disconnect_on_error = 1;
+
+	if (!inc_net(mdev))
+		return NULL;
+
+	what = "sock_create_kern";
+	err = sock_create_kern(((struct sockaddr *)mdev->net_conf->my_addr)->sa_family,
+		SOCK_STREAM, IPPROTO_TCP, &sock);
+	if (err < 0) {
+		sock = NULL;
+		goto out;
+	}
+
+	sock->sk->sk_rcvtimeo =
+	sock->sk->sk_sndtimeo =  mdev->net_conf->try_connect_int*HZ;
+
+       /* explicitly bind to the configured IP as source IP
+	*  for the outgoing connections.
+	*  This is needed for multihomed hosts and to be
+	*  able to use lo: interfaces for drbd.
+	* Make sure to use 0 as portnumber, so linux selects
+	*  a free one dynamically.
+	*/
+	memcpy(&src_in6, mdev->net_conf->my_addr,
+	       min_t(int, mdev->net_conf->my_addr_len, sizeof(src_in6)));
+	if (((struct sockaddr *)mdev->net_conf->my_addr)->sa_family == AF_INET6)
+		src_in6.sin6_port = 0;
+	else
+		((struct sockaddr_in *)&src_in6)->sin_port = 0; /* AF_INET & AF_SCI */
+
+	what = "bind before connect";
+	err = sock->ops->bind(sock,
+			      (struct sockaddr *) &src_in6,
+			      mdev->net_conf->my_addr_len);
+	if (err < 0)
+		goto out;
+
+	/* connect may fail, peer not yet available.
+	 * stay C_WF_CONNECTION, don't go Disconnecting! */
+	disconnect_on_error = 0;
+	what = "connect";
+	err = sock->ops->connect(sock,
+				 (struct sockaddr *)mdev->net_conf->peer_addr,
+				 mdev->net_conf->peer_addr_len, 0);
+
+out:
+	if (err < 0) {
+		if (sock) {
+			sock_release(sock);
+			sock = NULL;
+		}
+		switch (-err) {
+			/* timeout, busy, signal pending */
+		case ETIMEDOUT: case EAGAIN: case EINPROGRESS:
+		case EINTR: case ERESTARTSYS:
+			/* peer not (yet) available, network problem */
+		case ECONNREFUSED: case ENETUNREACH:
+		case EHOSTDOWN:    case EHOSTUNREACH:
+			disconnect_on_error = 0;
+			break;
+		default:
+			dev_err(DEV, "%s failed, err = %d\n", what, err);
+		}
+		if (disconnect_on_error)
+			drbd_force_state(mdev, NS(conn, C_DISCONNECTING));
+	}
+	dec_net(mdev);
+	return sock;
+}
+
+STATIC struct socket *drbd_wait_for_connect(struct drbd_conf *mdev)
+{
+	int timeo, err;
+	struct socket *s_estab = NULL, *s_listen;
+	const char *what;
+
+	if (!inc_net(mdev))
+		return NULL;
+
+	what = "sock_create_kern";
+	err = sock_create_kern(((struct sockaddr *)mdev->net_conf->my_addr)->sa_family,
+		SOCK_STREAM, IPPROTO_TCP, &s_listen);
+	if (err) {
+		s_listen = NULL;
+		goto out;
+	}
+
+	timeo = mdev->net_conf->try_connect_int * HZ;
+	timeo += (random32() & 1) ? timeo / 7 : -timeo / 7; /* 28.5% random jitter */
+
+	s_listen->sk->sk_reuse    = 1; /* SO_REUSEADDR */
+	s_listen->sk->sk_rcvtimeo = timeo;
+	s_listen->sk->sk_sndtimeo = timeo;
+
+	what = "bind before listen";
+	err = s_listen->ops->bind(s_listen,
+			      (struct sockaddr *) mdev->net_conf->my_addr,
+			      mdev->net_conf->my_addr_len);
+	if (err < 0)
+		goto out;
+
+	err = drbd_accept(mdev, &what, s_listen, &s_estab);
+
+out:
+	if (s_listen)
+		sock_release(s_listen);
+	if (err < 0) {
+		if (err != -EAGAIN && err != -EINTR && err != -ERESTARTSYS) {
+			dev_err(DEV, "%s failed, err = %d\n", what, err);
+			drbd_force_state(mdev, NS(conn, C_DISCONNECTING));
+		}
+	}
+	dec_net(mdev);
+
+	return s_estab;
+}
+
+STATIC int drbd_send_fp(struct drbd_conf *mdev,
+	struct socket *sock, enum drbd_packets cmd)
+{
+	struct p_header *h = (struct p_header *) &mdev->data.sbuf.header;
+
+	return _drbd_send_cmd(mdev, sock, cmd, h, sizeof(*h), 0);
+}
+
+STATIC enum drbd_packets drbd_recv_fp(struct drbd_conf *mdev, struct socket *sock)
+{
+	struct p_header *h = (struct p_header *) &mdev->data.sbuf.header;
+	int rr;
+
+	rr = drbd_recv_short(mdev, sock, h, sizeof(*h), 0);
+
+	if (rr == sizeof(*h) && h->magic == BE_DRBD_MAGIC)
+		return be16_to_cpu(h->command);
+
+	return 0xffff;
+}
+
+/**
+ * drbd_socket_okay:
+ * Tests if the connection behind the socket still exists. If not it frees
+ * the socket.
+ */
+static int drbd_socket_okay(struct drbd_conf *mdev, struct socket **sock)
+{
+	int rr;
+	char tb[4];
+
+	if (!*sock)
+		return FALSE;
+
+	rr = drbd_recv_short(mdev, *sock, tb, 4, MSG_DONTWAIT | MSG_PEEK);
+
+	if (rr > 0 || rr == -EAGAIN) {
+		return TRUE;
+	} else {
+		sock_release(*sock);
+		*sock = NULL;
+		return FALSE;
+	}
+}
+
+/*
+ * return values:
+ *   1 yess, we have a valid connection
+ *   0 oops, did not work out, please try again
+ *  -1 peer talks different language,
+ *     no point in trying again, please go standalone.
+ *  -2 We do not have a network config...
+ */
+STATIC int drbd_connect(struct drbd_conf *mdev)
+{
+	struct socket *s, *sock, *msock;
+	int try, h, ok;
+
+	D_ASSERT(!mdev->data.socket);
+
+	if (test_and_clear_bit(CREATE_BARRIER, &mdev->flags))
+		dev_err(DEV, "CREATE_BARRIER flag was set in drbd_connect - now cleared!\n");
+
+	if (drbd_request_state(mdev, NS(conn, C_WF_CONNECTION)) < SS_SUCCESS)
+		return -2;
+
+	clear_bit(DISCARD_CONCURRENT, &mdev->flags);
+
+	sock  = NULL;
+	msock = NULL;
+
+	do {
+		for (try = 0;;) {
+			/* 3 tries, this should take less than a second! */
+			s = drbd_try_connect(mdev);
+			if (s || ++try >= 3)
+				break;
+			/* give the other side time to call bind() & listen() */
+			__set_current_state(TASK_INTERRUPTIBLE);
+			schedule_timeout(HZ / 10);
+		}
+
+		if (s) {
+			if (!sock) {
+				drbd_send_fp(mdev, s, P_HAND_SHAKE_S);
+				sock = s;
+				s = NULL;
+			} else if (!msock) {
+				drbd_send_fp(mdev, s, P_HAND_SHAKE_M);
+				msock = s;
+				s = NULL;
+			} else {
+				dev_err(DEV, "Logic error in drbd_connect()\n");
+				return -1;
+			}
+		}
+
+		if (sock && msock) {
+			__set_current_state(TASK_INTERRUPTIBLE);
+			schedule_timeout(HZ / 10);
+			ok = drbd_socket_okay(mdev, &sock);
+			ok = drbd_socket_okay(mdev, &msock) && ok;
+			if (ok)
+				break;
+		}
+
+retry:
+		s = drbd_wait_for_connect(mdev);
+		if (s) {
+			try = drbd_recv_fp(mdev, s);
+			drbd_socket_okay(mdev, &sock);
+			drbd_socket_okay(mdev, &msock);
+			switch (try) {
+			case P_HAND_SHAKE_S:
+				if (sock) {
+					dev_warn(DEV, "initial packet S crossed\n");
+					sock_release(sock);
+				}
+				sock = s;
+				break;
+			case P_HAND_SHAKE_M:
+				if (msock) {
+					dev_warn(DEV, "initial packet M crossed\n");
+					sock_release(msock);
+					}
+				msock = s;
+				set_bit(DISCARD_CONCURRENT, &mdev->flags);
+				break;
+			default:
+				dev_warn(DEV, "Error receiving initial packet\n");
+				sock_release(s);
+				if (random32() & 1)
+					goto retry;
+			}
+		}
+
+		if (mdev->state.conn <= C_DISCONNECTING)
+			return -1;
+		if (signal_pending(current)) {
+			flush_signals(current);
+			smp_rmb();
+			if (get_t_state(&mdev->receiver) == Exiting) {
+				if (sock)
+					sock_release(sock);
+				if (msock)
+					sock_release(msock);
+				return -1;
+			}
+		}
+
+		if (sock && msock) {
+			ok = drbd_socket_okay(mdev, &sock);
+			ok = drbd_socket_okay(mdev, &msock) && ok;
+			if (ok)
+				break;
+		}
+	} while (1);
+
+	msock->sk->sk_reuse = 1; /* SO_REUSEADDR */
+	sock->sk->sk_reuse = 1; /* SO_REUSEADDR */
+
+	sock->sk->sk_allocation = GFP_NOIO;
+	msock->sk->sk_allocation = GFP_NOIO;
+
+	sock->sk->sk_priority = TC_PRIO_INTERACTIVE_BULK;
+	msock->sk->sk_priority = TC_PRIO_INTERACTIVE;
+
+	if (mdev->net_conf->sndbuf_size) {
+		sock->sk->sk_sndbuf = mdev->net_conf->sndbuf_size;
+		sock->sk->sk_rcvbuf = mdev->net_conf->sndbuf_size;
+		sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK | SOCK_RCVBUF_LOCK;
+	}
+
+	/* NOT YET ...
+	 * sock->sk->sk_sndtimeo = mdev->net_conf->timeout*HZ/10;
+	 * sock->sk->sk_rcvtimeo = MAX_SCHEDULE_TIMEOUT;
+	 * first set it to the P_HAND_SHAKE timeout, wich is hardcoded for now: */
+	sock->sk->sk_sndtimeo =
+	sock->sk->sk_rcvtimeo = 2*HZ;
+
+	msock->sk->sk_sndtimeo = mdev->net_conf->timeout*HZ/10;
+	msock->sk->sk_rcvtimeo = mdev->net_conf->ping_int*HZ;
+
+	/* we don't want delays.
+	 * we use TCP_CORK where apropriate, though */
+	drbd_tcp_nodelay(sock);
+	drbd_tcp_nodelay(msock);
+
+	mdev->data.socket = sock;
+	mdev->meta.socket = msock;
+	mdev->last_received = jiffies;
+
+	D_ASSERT(mdev->asender.task == NULL);
+
+	h = drbd_do_handshake(mdev);
+	if (h <= 0)
+		return h;
+
+	if (mdev->cram_hmac_tfm) {
+		/* drbd_request_state(mdev, NS(conn, WFAuth)); */
+		if (!drbd_do_auth(mdev)) {
+			dev_err(DEV, "Authentication of peer failed\n");
+			return -1;
+		}
+	}
+
+	if (drbd_request_state(mdev, NS(conn, C_WF_REPORT_PARAMS)) < SS_SUCCESS)
+		return 0;
+
+	sock->sk->sk_sndtimeo = mdev->net_conf->timeout*HZ/10;
+	sock->sk->sk_rcvtimeo = MAX_SCHEDULE_TIMEOUT;
+
+	atomic_set(&mdev->packet_seq, 0);
+	mdev->peer_seq = 0;
+
+	drbd_thread_start(&mdev->asender);
+
+	drbd_send_protocol(mdev);
+	drbd_send_sync_param(mdev, &mdev->sync_conf);
+	drbd_send_sizes(mdev);
+	drbd_send_uuids(mdev);
+	drbd_send_state(mdev);
+	clear_bit(USE_DEGR_WFC_T, &mdev->flags);
+
+	return 1;
+}
+
+STATIC int drbd_recv_header(struct drbd_conf *mdev, struct p_header *h)
+{
+	int r;
+
+	r = drbd_recv(mdev, h, sizeof(*h));
+
+	if (unlikely(r != sizeof(*h))) {
+		dev_err(DEV, "short read expecting header on sock: r=%d\n", r);
+		return FALSE;
+	};
+	h->command = be16_to_cpu(h->command);
+	h->length  = be16_to_cpu(h->length);
+	if (unlikely(h->magic != BE_DRBD_MAGIC)) {
+		dev_err(DEV, "magic?? on data m: 0x%lx c: %d l: %d\n",
+		    (long)be32_to_cpu(h->magic),
+		    h->command, h->length);
+		return FALSE;
+	}
+	mdev->last_received = jiffies;
+
+	return TRUE;
+}
+
+STATIC enum finish_epoch drbd_flush_after_epoch(struct drbd_conf *mdev, struct drbd_epoch *epoch)
+{
+	int rv;
+
+	if (mdev->write_ordering >= WO_bdev_flush && inc_local(mdev)) {
+		rv = blkdev_issue_flush(mdev->bc->backing_bdev, NULL);
+		if (rv) {
+			dev_err(DEV, "local disk flush failed with status %d\n", rv);
+			/* would rather check on EOPNOTSUPP, but that is not reliable.
+			 * don't try again for ANY return value != 0
+			 * if (rv == -EOPNOTSUPP) */
+			drbd_bump_write_ordering(mdev, WO_drain_io);
+		}
+		dec_local(mdev);
+	}
+
+	return drbd_may_finish_epoch(mdev, epoch, EV_BARRIER_DONE);
+}
+
+/**
+ * w_flush: Checks if an epoch can be closed and therefore might
+ * close and/or free the epoch object.
+ */
+STATIC int w_flush(struct drbd_conf *mdev, struct drbd_work *w, int cancel)
+{
+	struct flush_work *fw = (struct flush_work *)w;
+	struct drbd_epoch *epoch = fw->epoch;
+
+	kfree(w);
+
+	if (!test_and_set_bit(DE_BARRIER_IN_NEXT_EPOCH_ISSUED, &epoch->flags))
+		drbd_flush_after_epoch(mdev, epoch);
+
+	drbd_may_finish_epoch(mdev, epoch, EV_PUT |
+			      (mdev->state.conn < C_CONNECTED ? EV_CLEANUP : 0));
+
+	return 1;
+}
+
+/**
+ * drbd_may_finish_epoch: Checks if an epoch can be closed and therefore might
+ * close and/or free the epoch object.
+ */
+STATIC enum finish_epoch drbd_may_finish_epoch(struct drbd_conf *mdev,
+					       struct drbd_epoch *epoch,
+					       enum epoch_event ev)
+{
+	int finish, epoch_size;
+	struct drbd_epoch *next_epoch;
+	int schedule_flush = 0;
+	enum finish_epoch rv = FE_STILL_LIVE;
+
+	spin_lock(&mdev->epoch_lock);
+	do {
+		next_epoch = NULL;
+		finish = 0;
+
+		epoch_size = atomic_read(&epoch->epoch_size);
+
+		switch (ev & ~EV_CLEANUP) {
+		case EV_PUT:
+			atomic_dec(&epoch->active);
+			break;
+		case EV_GOT_BARRIER_NR:
+			set_bit(DE_HAVE_BARRIER_NUMBER, &epoch->flags);
+
+			/* Special case: If we just switched from WO_bio_barrier to
+			   WO_bdev_flush we should not finish the current epoch */
+			if (test_bit(DE_CONTAINS_A_BARRIER, &epoch->flags) && epoch_size == 1 &&
+			    mdev->write_ordering != WO_bio_barrier &&
+			    epoch == mdev->current_epoch)
+				clear_bit(DE_CONTAINS_A_BARRIER, &epoch->flags);
+			break;
+		case EV_BARRIER_DONE:
+			set_bit(DE_BARRIER_IN_NEXT_EPOCH_DONE, &epoch->flags);
+			break;
+		case EV_BECAME_LAST:
+			/* nothing to do*/
+			break;
+		}
+
+		trace_drbd_epoch(mdev, epoch, ev);
+
+		if (epoch_size != 0 &&
+		    atomic_read(&epoch->active) == 0 &&
+		    test_bit(DE_HAVE_BARRIER_NUMBER, &epoch->flags) &&
+		    epoch->list.prev == &mdev->current_epoch->list &&
+		    !test_bit(DE_IS_FINISHING, &epoch->flags)) {
+			/* Nearly all conditions are met to finish that epoch... */
+			if (test_bit(DE_BARRIER_IN_NEXT_EPOCH_DONE, &epoch->flags) ||
+			    mdev->write_ordering == WO_none ||
+			    (epoch_size == 1 && test_bit(DE_CONTAINS_A_BARRIER, &epoch->flags)) ||
+			    ev & EV_CLEANUP) {
+				finish = 1;
+				set_bit(DE_IS_FINISHING, &epoch->flags);
+			} else if (!test_bit(DE_BARRIER_IN_NEXT_EPOCH_ISSUED, &epoch->flags) &&
+				 mdev->write_ordering == WO_bio_barrier) {
+				atomic_inc(&epoch->active);
+				schedule_flush = 1;
+			}
+		}
+		if (finish) {
+			if (!(ev & EV_CLEANUP)) {
+				spin_unlock(&mdev->epoch_lock);
+				drbd_send_b_ack(mdev, epoch->barrier_nr, epoch_size);
+				spin_lock(&mdev->epoch_lock);
+			}
+			dec_unacked(mdev);
+
+			if (mdev->current_epoch != epoch) {
+				next_epoch = list_entry(epoch->list.next, struct drbd_epoch, list);
+				list_del(&epoch->list);
+				ev = EV_BECAME_LAST | (ev & EV_CLEANUP);
+				mdev->epochs--;
+				trace_drbd_epoch(mdev, epoch, EV_TRACE_FREE);
+				kfree(epoch);
+
+				if (rv == FE_STILL_LIVE)
+					rv = FE_DESTROYED;
+			} else {
+				epoch->flags = 0;
+				atomic_set(&epoch->epoch_size, 0);
+				/* atomic_set(&epoch->active, 0); is alrady zero */
+				if (rv == FE_STILL_LIVE)
+					rv = FE_RECYCLED;
+			}
+		}
+
+		if (!next_epoch)
+			break;
+
+		epoch = next_epoch;
+	} while (1);
+
+	spin_unlock(&mdev->epoch_lock);
+
+	if (schedule_flush) {
+		struct flush_work *fw;
+		fw = kmalloc(sizeof(*fw), GFP_ATOMIC);
+		if (fw) {
+			trace_drbd_epoch(mdev, epoch, EV_TRACE_FLUSH);
+			fw->w.cb = w_flush;
+			fw->epoch = epoch;
+			drbd_queue_work(&mdev->data.work, &fw->w);
+		} else {
+			dev_warn(DEV, "Could not kmalloc a flush_work obj\n");
+			set_bit(DE_BARRIER_IN_NEXT_EPOCH_ISSUED, &epoch->flags);
+			/* That is not a recursion, only one level */
+			drbd_may_finish_epoch(mdev, epoch, EV_BARRIER_DONE);
+			drbd_may_finish_epoch(mdev, epoch, EV_PUT);
+		}
+	}
+
+	return rv;
+}
+
+/**
+ * drbd_bump_write_ordering: It turned out that the current mdev->write_ordering
+ * method does not work on the backing block device. Try the next allowed method.
+ */
+void drbd_bump_write_ordering(struct drbd_conf *mdev, enum write_ordering_e wo) __must_hold(local)
+{
+	enum write_ordering_e pwo;
+	static char *write_ordering_str[] = {
+		[WO_none] = "none",
+		[WO_drain_io] = "drain",
+		[WO_bdev_flush] = "flush",
+		[WO_bio_barrier] = "barrier",
+	};
+
+	pwo = mdev->write_ordering;
+	wo = min(pwo, wo);
+	if (wo == WO_bio_barrier && mdev->bc->dc.no_disk_barrier)
+		wo = WO_bdev_flush;
+	if (wo == WO_bdev_flush && mdev->bc->dc.no_disk_flush)
+		wo = WO_drain_io;
+	if (wo == WO_drain_io && mdev->bc->dc.no_disk_drain)
+		wo = WO_none;
+	mdev->write_ordering = wo;
+	if (pwo != mdev->write_ordering || wo == WO_bio_barrier)
+		dev_info(DEV, "Method to ensure write ordering: %s\n", write_ordering_str[mdev->write_ordering]);
+}
+
+/**
+ * w_e_reissue: In case the IO subsystem delivered an error for an BIO with the
+ * BIO_RW_BARRIER flag set, retry that bio without the barrier flag set.
+ */
+int w_e_reissue(struct drbd_conf *mdev, struct drbd_work *w, int cancel) __releases(local)
+{
+	struct drbd_epoch_entry *e = (struct drbd_epoch_entry *)w;
+	struct bio *bio = e->private_bio;
+
+	/* We leave DE_CONTAINS_A_BARRIER and EE_IS_BARRIER in place,
+	   (and DE_BARRIER_IN_NEXT_EPOCH_ISSUED in the previous Epoch)
+	   so that we can finish that epoch in drbd_may_finish_epoch().
+	   That is necessary if we already have a long chain of Epochs, before
+	   we realize that BIO_RW_BARRIER is actually not supported */
+
+	/* As long as the -ENOTSUPP on the barrier is reported immediately
+	   that will never trigger. It it is reported late, we will just
+	   print that warning an continue corretly for all future requests
+	   with WO_bdev_flush */
+	if (previous_epoch(mdev, e->epoch))
+		dev_warn(DEV, "Write ordering was not enforced (one time event)\n");
+
+	/* prepare bio for re-submit,
+	 * re-init volatile members */
+	/* we still have a local reference,
+	 * inc_local was done in receive_Data. */
+	bio->bi_bdev = mdev->bc->backing_bdev;
+	bio->bi_sector = e->sector;
+	bio->bi_size = e->size;
+	bio->bi_idx = 0;
+
+	bio->bi_flags &= ~(BIO_POOL_MASK - 1);
+	bio->bi_flags |= 1 << BIO_UPTODATE;
+
+	/* don't know whether this is necessary: */
+	bio->bi_phys_segments = 0;
+	bio->bi_next = NULL;
+
+	/* these should be unchanged: */
+	/* bio->bi_end_io = drbd_endio_write_sec; */
+	/* bio->bi_vcnt = whatever; */
+
+	e->w.cb = e_end_block;
+
+	/* This is no longer a barrier request. */
+	bio->bi_rw &= ~(1UL << BIO_RW_BARRIER);
+
+	drbd_generic_make_request(mdev, DRBD_FAULT_DT_WR, bio);
+
+	return 1;
+}
+
+STATIC int receive_Barrier(struct drbd_conf *mdev, struct p_header *h)
+{
+	int rv, issue_flush;
+	struct p_barrier *p = (struct p_barrier *)h;
+	struct drbd_epoch *epoch;
+
+	ERR_IF(h->length != (sizeof(*p)-sizeof(*h))) return FALSE;
+
+	rv = drbd_recv(mdev, h->payload, h->length);
+	ERR_IF(rv != h->length) return FALSE;
+
+	inc_unacked(mdev);
+
+	if (mdev->net_conf->wire_protocol != DRBD_PROT_C)
+		drbd_kick_lo(mdev);
+
+	mdev->current_epoch->barrier_nr = p->barrier;
+	rv = drbd_may_finish_epoch(mdev, mdev->current_epoch, EV_GOT_BARRIER_NR);
+
+	/* P_BARRIER_ACK may imply that the corresponding extent is dropped from
+	 * the activity log, which means it would not be resynced in case the
+	 * R_PRIMARY crashes now.
+	 * Therefore we must send the barrier_ack after the barrier request was
+	 * completed. */
+	switch (mdev->write_ordering) {
+	case WO_bio_barrier:
+	case WO_none:
+		if (rv == FE_RECYCLED)
+			return TRUE;
+		break;
+
+	case WO_bdev_flush:
+	case WO_drain_io:
+		D_ASSERT(rv == FE_STILL_LIVE);
+		set_bit(DE_BARRIER_IN_NEXT_EPOCH_ISSUED, &mdev->current_epoch->flags);
+		drbd_wait_ee_list_empty(mdev, &mdev->active_ee);
+		rv = drbd_flush_after_epoch(mdev, mdev->current_epoch);
+		if (rv == FE_RECYCLED)
+			return TRUE;
+
+		/* The asender will send all the ACKs and barrier ACKs out, since
+		   all EEs moved from the active_ee to the done_ee. We need to
+		   provide a new epoch object for the EEs that come in soon */
+		break;
+	}
+
+	epoch = kmalloc(sizeof(struct drbd_epoch), GFP_KERNEL);
+	if (!epoch) {
+		dev_warn(DEV, "Allocation of an epoch failed, slowing down\n");
+		issue_flush = !test_and_set_bit(DE_BARRIER_IN_NEXT_EPOCH_ISSUED, &epoch->flags);
+		drbd_wait_ee_list_empty(mdev, &mdev->active_ee);
+		if (issue_flush) {
+			rv = drbd_flush_after_epoch(mdev, mdev->current_epoch);
+			if (rv == FE_RECYCLED)
+				return TRUE;
+		}
+
+		drbd_wait_ee_list_empty(mdev, &mdev->done_ee);
+
+		return TRUE;
+	}
+
+	epoch->flags = 0;
+	atomic_set(&epoch->epoch_size, 0);
+	atomic_set(&epoch->active, 0);
+
+	spin_lock(&mdev->epoch_lock);
+	if (atomic_read(&mdev->current_epoch->epoch_size)) {
+		list_add(&epoch->list, &mdev->current_epoch->list);
+		mdev->current_epoch = epoch;
+		mdev->epochs++;
+		trace_drbd_epoch(mdev, epoch, EV_TRACE_ALLOC);
+	} else {
+		/* The current_epoch got recycled while we allocated this one... */
+		kfree(epoch);
+	}
+	spin_unlock(&mdev->epoch_lock);
+
+	return TRUE;
+}
+
+/* used from receive_RSDataReply (recv_resync_read)
+ * and from receive_Data */
+STATIC struct drbd_epoch_entry *
+read_in_block(struct drbd_conf *mdev, u64 id, sector_t sector, int data_size) __must_hold(local)
+{
+	struct drbd_epoch_entry *e;
+	struct bio_vec *bvec;
+	struct page *page;
+	struct bio *bio;
+	int dgs, ds, i, rr;
+	void *dig_in = mdev->int_dig_in;
+	void *dig_vv = mdev->int_dig_vv;
+
+	dgs = (mdev->agreed_pro_version >= 87 && mdev->integrity_r_tfm) ?
+		crypto_hash_digestsize(mdev->integrity_r_tfm) : 0;
+
+	if (dgs) {
+		rr = drbd_recv(mdev, dig_in, dgs);
+		if (rr != dgs) {
+			dev_warn(DEV, "short read receiving data digest: read %d expected %d\n",
+			     rr, dgs);
+			return NULL;
+		}
+	}
+
+	data_size -= dgs;
+
+	ERR_IF(data_size &  0x1ff) return NULL;
+	ERR_IF(data_size >  DRBD_MAX_SEGMENT_SIZE) return NULL;
+
+	e = drbd_alloc_ee(mdev, id, sector, data_size, GFP_KERNEL);
+	if (!e)
+		return NULL;
+	bio = e->private_bio;
+	ds = data_size;
+	bio_for_each_segment(bvec, bio, i) {
+		page = bvec->bv_page;
+		rr = drbd_recv(mdev, kmap(page), min_t(int, ds, PAGE_SIZE));
+		kunmap(page);
+		if (rr != min_t(int, ds, PAGE_SIZE)) {
+			drbd_free_ee(mdev, e);
+			dev_warn(DEV, "short read receiving data: read %d expected %d\n",
+			     rr, min_t(int, ds, PAGE_SIZE));
+			return NULL;
+		}
+		ds -= rr;
+	}
+
+	if (dgs) {
+		drbd_csum(mdev, mdev->integrity_r_tfm, bio, dig_vv);
+		if (memcmp(dig_in, dig_vv, dgs)) {
+			dev_err(DEV, "Digest integrity check FAILED.\n");
+			drbd_bcast_ee(mdev, "digest failed",
+					dgs, dig_in, dig_vv, e);
+			drbd_free_ee(mdev, e);
+			return NULL;
+		}
+	}
+	mdev->recv_cnt += data_size>>9;
+	return e;
+}
+
+/* drbd_drain_block() just takes a data block
+ * out of the socket input buffer, and discards it.
+ */
+STATIC int drbd_drain_block(struct drbd_conf *mdev, int data_size)
+{
+	struct page *page;
+	int rr, rv = 1;
+	void *data;
+
+	page = drbd_pp_alloc(mdev, GFP_KERNEL);
+
+	data = kmap(page);
+	while (data_size) {
+		rr = drbd_recv(mdev, data, min_t(int, data_size, PAGE_SIZE));
+		if (rr != min_t(int, data_size, PAGE_SIZE)) {
+			rv = 0;
+			dev_warn(DEV, "short read receiving data: read %d expected %d\n",
+			     rr, min_t(int, data_size, PAGE_SIZE));
+			break;
+		}
+		data_size -= rr;
+	}
+	kunmap(page);
+	drbd_pp_free(mdev, page);
+	return rv;
+}
+
+/* kick lower level device, if we have more than (arbitrary number)
+ * reference counts on it, which typically are locally submitted io
+ * requests.  don't use unacked_cnt, so we speed up proto A and B, too. */
+static void maybe_kick_lo(struct drbd_conf *mdev)
+{
+	if (atomic_read(&mdev->local_cnt) >= mdev->net_conf->unplug_watermark)
+		drbd_kick_lo(mdev);
+}
+
+STATIC int recv_dless_read(struct drbd_conf *mdev, struct drbd_request *req,
+			   sector_t sector, int data_size)
+{
+	struct bio_vec *bvec;
+	struct bio *bio;
+	int dgs, rr, i, expect;
+	void *dig_in = mdev->int_dig_in;
+	void *dig_vv = mdev->int_dig_vv;
+
+	dgs = (mdev->agreed_pro_version >= 87 && mdev->integrity_r_tfm) ?
+		crypto_hash_digestsize(mdev->integrity_r_tfm) : 0;
+
+	if (dgs) {
+		rr = drbd_recv(mdev, dig_in, dgs);
+		if (rr != dgs) {
+			dev_warn(DEV, "short read receiving data reply digest: read %d expected %d\n",
+			     rr, dgs);
+			return 0;
+		}
+	}
+
+	data_size -= dgs;
+
+	bio = req->master_bio;
+	D_ASSERT(sector == bio->bi_sector);
+
+	bio_for_each_segment(bvec, bio, i) {
+		expect = min_t(int, data_size, bvec->bv_len);
+		rr = drbd_recv(mdev,
+			     kmap(bvec->bv_page)+bvec->bv_offset,
+			     expect);
+		kunmap(bvec->bv_page);
+		if (rr != expect) {
+			dev_warn(DEV, "short read receiving data reply: "
+			     "read %d expected %d\n",
+			     rr, expect);
+			return 0;
+		}
+		data_size -= rr;
+	}
+
+	if (dgs) {
+		drbd_csum(mdev, mdev->integrity_r_tfm, bio, dig_vv);
+		if (memcmp(dig_in, dig_vv, dgs)) {
+			dev_err(DEV, "Digest integrity check FAILED. Broken NICs?\n");
+			return 0;
+		}
+	}
+
+	D_ASSERT(data_size == 0);
+	return 1;
+}
+
+/* e_end_resync_block() is called via
+ * drbd_process_done_ee() by asender only */
+STATIC int e_end_resync_block(struct drbd_conf *mdev, struct drbd_work *w, int unused)
+{
+	struct drbd_epoch_entry *e = (struct drbd_epoch_entry *)w;
+	sector_t sector = e->sector;
+	int ok;
+
+	D_ASSERT(hlist_unhashed(&e->colision));
+
+	if (likely(drbd_bio_uptodate(e->private_bio))) {
+		drbd_set_in_sync(mdev, sector, e->size);
+		ok = drbd_send_ack(mdev, P_RS_WRITE_ACK, e);
+	} else {
+		/* Record failure to sync */
+		drbd_rs_failed_io(mdev, sector, e->size);
+
+		ok  = drbd_send_ack(mdev, P_NEG_ACK, e);
+		ok &= drbd_io_error(mdev, FALSE);
+	}
+	dec_unacked(mdev);
+
+	return ok;
+}
+
+STATIC int recv_resync_read(struct drbd_conf *mdev, sector_t sector, int data_size) __releases(local)
+{
+	struct drbd_epoch_entry *e;
+
+	e = read_in_block(mdev, ID_SYNCER, sector, data_size);
+	if (!e) {
+		dec_local(mdev);
+		return FALSE;
+	}
+
+	dec_rs_pending(mdev);
+
+	e->private_bio->bi_end_io = drbd_endio_write_sec;
+	e->private_bio->bi_rw = WRITE;
+	e->w.cb = e_end_resync_block;
+
+	inc_unacked(mdev);
+	/* corresponding dec_unacked() in e_end_resync_block()
+	 * respective _drbd_clear_done_ee */
+
+	spin_lock_irq(&mdev->req_lock);
+	list_add(&e->w.list, &mdev->sync_ee);
+	spin_unlock_irq(&mdev->req_lock);
+
+	trace_drbd_ee(mdev, e, "submitting for (rs)write");
+	trace_drbd_bio(mdev, "Sec", e->private_bio, 0, NULL);
+	drbd_generic_make_request(mdev, DRBD_FAULT_RS_WR, e->private_bio);
+	/* accounting done in endio */
+
+	maybe_kick_lo(mdev);
+	return TRUE;
+}
+
+STATIC int receive_DataReply(struct drbd_conf *mdev, struct p_header *h)
+{
+	struct drbd_request *req;
+	sector_t sector;
+	unsigned int header_size, data_size;
+	int ok;
+	struct p_data *p = (struct p_data *)h;
+
+	header_size = sizeof(*p) - sizeof(*h);
+	data_size   = h->length  - header_size;
+
+	ERR_IF(data_size == 0) return FALSE;
+
+	if (drbd_recv(mdev, h->payload, header_size) != header_size)
+		return FALSE;
+
+	sector = be64_to_cpu(p->sector);
+
+	spin_lock_irq(&mdev->req_lock);
+	req = _ar_id_to_req(mdev, p->block_id, sector);
+	spin_unlock_irq(&mdev->req_lock);
+	if (unlikely(!req)) {
+		dev_err(DEV, "Got a corrupt block_id/sector pair(1).\n");
+		return FALSE;
+	}
+
+	/* hlist_del(&req->colision) is done in _req_may_be_done, to avoid
+	 * special casing it there for the various failure cases.
+	 * still no race with drbd_fail_pending_reads */
+	ok = recv_dless_read(mdev, req, sector, data_size);
+
+	if (ok)
+		req_mod(req, data_received, 0);
+	/* else: nothing. handled from drbd_disconnect...
+	 * I don't think we may complete this just yet
+	 * in case we are "on-disconnect: freeze" */
+
+	return ok;
+}
+
+STATIC int receive_RSDataReply(struct drbd_conf *mdev, struct p_header *h)
+{
+	sector_t sector;
+	unsigned int header_size, data_size;
+	int ok;
+	struct p_data *p = (struct p_data *)h;
+
+	header_size = sizeof(*p) - sizeof(*h);
+	data_size   = h->length  - header_size;
+
+	ERR_IF(data_size == 0) return FALSE;
+
+	if (drbd_recv(mdev, h->payload, header_size) != header_size)
+		return FALSE;
+
+	sector = be64_to_cpu(p->sector);
+	D_ASSERT(p->block_id == ID_SYNCER);
+
+	if (inc_local(mdev)) {
+		/* data is submitted to disk within recv_resync_read.
+		 * corresponding dec_local done below on error,
+		 * or in drbd_endio_write_sec. */
+		ok = recv_resync_read(mdev, sector, data_size);
+	} else {
+		if (__ratelimit(&drbd_ratelimit_state))
+			dev_err(DEV, "Can not write resync data to local disk.\n");
+
+		ok = drbd_drain_block(mdev, data_size);
+
+		drbd_send_ack_dp(mdev, P_NEG_ACK, p);
+	}
+
+	return ok;
+}
+
+/* e_end_block() is called via drbd_process_done_ee().
+ * this means this function only runs in the asender thread
+ */
+STATIC int e_end_block(struct drbd_conf *mdev, struct drbd_work *w, int unused)
+{
+	struct drbd_epoch_entry *e = (struct drbd_epoch_entry *)w;
+	sector_t sector = e->sector;
+	struct drbd_epoch *epoch;
+	int ok = 1, pcmd;
+
+	if (e->flags & EE_IS_BARRIER) {
+		epoch = previous_epoch(mdev, e->epoch);
+		if (epoch)
+			drbd_may_finish_epoch(mdev, epoch, EV_BARRIER_DONE);
+	}
+
+	if (mdev->net_conf->wire_protocol == DRBD_PROT_C) {
+		if (likely(drbd_bio_uptodate(e->private_bio))) {
+			pcmd = (mdev->state.conn >= C_SYNC_SOURCE &&
+				mdev->state.conn <= C_PAUSED_SYNC_T &&
+				e->flags & EE_MAY_SET_IN_SYNC) ?
+				P_RS_WRITE_ACK : P_WRITE_ACK;
+			ok &= drbd_send_ack(mdev, pcmd, e);
+			if (pcmd == P_RS_WRITE_ACK)
+				drbd_set_in_sync(mdev, sector, e->size);
+		} else {
+			ok  = drbd_send_ack(mdev, P_NEG_ACK, e);
+			ok &= drbd_io_error(mdev, FALSE);
+			/* we expect it to be marked out of sync anyways...
+			 * maybe assert this?  */
+		}
+		dec_unacked(mdev);
+	} else if (unlikely(!drbd_bio_uptodate(e->private_bio))) {
+		ok = drbd_io_error(mdev, FALSE);
+	}
+
+	/* we delete from the conflict detection hash _after_ we sent out the
+	 * P_WRITE_ACK / P_NEG_ACK, to get the sequence number right.  */
+	if (mdev->net_conf->two_primaries) {
+		spin_lock_irq(&mdev->req_lock);
+		D_ASSERT(!hlist_unhashed(&e->colision));
+		hlist_del_init(&e->colision);
+		spin_unlock_irq(&mdev->req_lock);
+	} else {
+		D_ASSERT(hlist_unhashed(&e->colision));
+	}
+
+	drbd_may_finish_epoch(mdev, e->epoch, EV_PUT);
+
+	return ok;
+}
+
+STATIC int e_send_discard_ack(struct drbd_conf *mdev, struct drbd_work *w, int unused)
+{
+	struct drbd_epoch_entry *e = (struct drbd_epoch_entry *)w;
+	int ok = 1;
+
+	D_ASSERT(mdev->net_conf->wire_protocol == DRBD_PROT_C);
+	ok = drbd_send_ack(mdev, P_DISCARD_ACK, e);
+
+	spin_lock_irq(&mdev->req_lock);
+	D_ASSERT(!hlist_unhashed(&e->colision));
+	hlist_del_init(&e->colision);
+	spin_unlock_irq(&mdev->req_lock);
+
+	dec_unacked(mdev);
+
+	return ok;
+}
+
+/* Called from receive_Data.
+ * Synchronize packets on sock with packets on msock.
+ *
+ * This is here so even when a P_DATA packet traveling via sock overtook an Ack
+ * packet traveling on msock, they are still processed in the order they have
+ * been sent.
+ *
+ * Note: we don't care for Ack packets overtaking P_DATA packets.
+ *
+ * In case packet_seq is larger than mdev->peer_seq number, there are
+ * outstanding packets on the msock. We wait for them to arrive.
+ * In case we are the logically next packet, we update mdev->peer_seq
+ * ourselves. Correctly handles 32bit wrap around.
+ *
+ * Assume we have a 10 GBit connection, that is about 1<<30 byte per second,
+ * about 1<<21 sectors per second. So "worst" case, we have 1<<3 == 8 seconds
+ * for the 24bit wrap (historical atomic_t guarantee on some archs), and we have
+ * 1<<9 == 512 seconds aka ages for the 32bit wrap around...
+ *
+ * returns 0 if we may process the packet,
+ * -ERESTARTSYS if we were interrupted (by disconnect signal). */
+static int drbd_wait_peer_seq(struct drbd_conf *mdev, const u32 packet_seq)
+{
+	DEFINE_WAIT(wait);
+	unsigned int p_seq;
+	long timeout;
+	int ret = 0;
+	spin_lock(&mdev->peer_seq_lock);
+	for (;;) {
+		prepare_to_wait(&mdev->seq_wait, &wait, TASK_INTERRUPTIBLE);
+		if (seq_le(packet_seq, mdev->peer_seq+1))
+			break;
+		if (signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
+		p_seq = mdev->peer_seq;
+		spin_unlock(&mdev->peer_seq_lock);
+		timeout = schedule_timeout(30*HZ);
+		spin_lock(&mdev->peer_seq_lock);
+		if (timeout == 0 && p_seq == mdev->peer_seq) {
+			ret = -ETIMEDOUT;
+			dev_err(DEV, "ASSERT FAILED waited 30 seconds for sequence update, forcing reconnect\n");
+			break;
+		}
+	}
+	finish_wait(&mdev->seq_wait, &wait);
+	if (mdev->peer_seq+1 == packet_seq)
+		mdev->peer_seq++;
+	spin_unlock(&mdev->peer_seq_lock);
+	return ret;
+}
+
+/* mirrored write */
+STATIC int receive_Data(struct drbd_conf *mdev, struct p_header *h)
+{
+	sector_t sector;
+	struct drbd_epoch_entry *e;
+	struct p_data *p = (struct p_data *)h;
+	int header_size, data_size;
+	int rw = WRITE;
+	u32 dp_flags;
+
+	header_size = sizeof(*p) - sizeof(*h);
+	data_size   = h->length  - header_size;
+
+	ERR_IF(data_size == 0) return FALSE;
+
+	if (drbd_recv(mdev, h->payload, header_size) != header_size)
+		return FALSE;
+
+	if (!inc_local(mdev)) {
+		/* data is submitted to disk at the end of this function.
+		 * corresponding dec_local done either below (on error),
+		 * or in drbd_endio_write_sec. */
+		if (__ratelimit(&drbd_ratelimit_state))
+			dev_err(DEV, "Can not write mirrored data block "
+			    "to local disk.\n");
+		spin_lock(&mdev->peer_seq_lock);
+		if (mdev->peer_seq+1 == be32_to_cpu(p->seq_num))
+			mdev->peer_seq++;
+		spin_unlock(&mdev->peer_seq_lock);
+
+		drbd_send_ack_dp(mdev, P_NEG_ACK, p);
+		atomic_inc(&mdev->current_epoch->epoch_size);
+		return drbd_drain_block(mdev, data_size);
+	}
+
+	sector = be64_to_cpu(p->sector);
+	e = read_in_block(mdev, p->block_id, sector, data_size);
+	if (!e) {
+		dec_local(mdev);
+		return FALSE;
+	}
+
+	e->private_bio->bi_end_io = drbd_endio_write_sec;
+	e->w.cb = e_end_block;
+
+	spin_lock(&mdev->epoch_lock);
+	e->epoch = mdev->current_epoch;
+	atomic_inc(&e->epoch->epoch_size);
+	atomic_inc(&e->epoch->active);
+
+	if (mdev->write_ordering == WO_bio_barrier && atomic_read(&e->epoch->epoch_size) == 1) {
+		struct drbd_epoch *epoch;
+		/* Issue a barrier if we start a new epoch, and the previous epoch
+		   was not a epoch containing a single request which already was
+		   a Barrier. */
+		epoch = list_entry(e->epoch->list.prev, struct drbd_epoch, list);
+		if (epoch == e->epoch) {
+			set_bit(DE_CONTAINS_A_BARRIER, &e->epoch->flags);
+			trace_drbd_epoch(mdev, e->epoch, EV_TRACE_ADD_BARRIER);
+			rw |= (1<<BIO_RW_BARRIER);
+			e->flags |= EE_IS_BARRIER;
+		} else {
+			if (atomic_read(&epoch->epoch_size) > 1 ||
+			    !test_bit(DE_CONTAINS_A_BARRIER, &epoch->flags)) {
+				set_bit(DE_BARRIER_IN_NEXT_EPOCH_ISSUED, &epoch->flags);
+				trace_drbd_epoch(mdev, epoch, EV_TRACE_SETTING_BI);
+				set_bit(DE_CONTAINS_A_BARRIER, &e->epoch->flags);
+				trace_drbd_epoch(mdev, e->epoch, EV_TRACE_ADD_BARRIER);
+				rw |= (1<<BIO_RW_BARRIER);
+				e->flags |= EE_IS_BARRIER;
+			}
+		}
+	}
+	spin_unlock(&mdev->epoch_lock);
+
+	dp_flags = be32_to_cpu(p->dp_flags);
+	if (dp_flags & DP_HARDBARRIER)
+		rw |= (1<<BIO_RW_BARRIER);
+	if (dp_flags & DP_RW_SYNC)
+		rw |= (1<<BIO_RW_SYNCIO) | (1<<BIO_RW_UNPLUG);
+	if (dp_flags & DP_MAY_SET_IN_SYNC)
+		e->flags |= EE_MAY_SET_IN_SYNC;
+
+	/* I'm the receiver, I do hold a net_cnt reference. */
+	if (!mdev->net_conf->two_primaries) {
+		spin_lock_irq(&mdev->req_lock);
+	} else {
+		/* don't get the req_lock yet,
+		 * we may sleep in drbd_wait_peer_seq */
+		const int size = e->size;
+		const int discard = test_bit(DISCARD_CONCURRENT, &mdev->flags);
+		DEFINE_WAIT(wait);
+		struct drbd_request *i;
+		struct hlist_node *n;
+		struct hlist_head *slot;
+		int first;
+
+		D_ASSERT(mdev->net_conf->wire_protocol == DRBD_PROT_C);
+		BUG_ON(mdev->ee_hash == NULL);
+		BUG_ON(mdev->tl_hash == NULL);
+
+		/* conflict detection and handling:
+		 * 1. wait on the sequence number,
+		 *    in case this data packet overtook ACK packets.
+		 * 2. check our hash tables for conflicting requests.
+		 *    we only need to walk the tl_hash, since an ee can not
+		 *    have a conflict with an other ee: on the submitting
+		 *    node, the corresponding req had already been conflicting,
+		 *    and a conflicting req is never sent.
+		 *
+		 * Note: for two_primaries, we are protocol C,
+		 * so there cannot be any request that is DONE
+		 * but still on the transfer log.
+		 *
+		 * unconditionally add to the ee_hash.
+		 *
+		 * if no conflicting request is found:
+		 *    submit.
+		 *
+		 * if any conflicting request is found
+		 * that has not yet been acked,
+		 * AND I have the "discard concurrent writes" flag:
+		 *	 queue (via done_ee) the P_DISCARD_ACK; OUT.
+		 *
+		 * if any conflicting request is found:
+		 *	 block the receiver, waiting on misc_wait
+		 *	 until no more conflicting requests are there,
+		 *	 or we get interrupted (disconnect).
+		 *
+		 *	 we do not just write after local io completion of those
+		 *	 requests, but only after req is done completely, i.e.
+		 *	 we wait for the P_DISCARD_ACK to arrive!
+		 *
+		 *	 then proceed normally, i.e. submit.
+		 */
+		if (drbd_wait_peer_seq(mdev, be32_to_cpu(p->seq_num)))
+			goto out_interrupted;
+
+		spin_lock_irq(&mdev->req_lock);
+
+		hlist_add_head(&e->colision, ee_hash_slot(mdev, sector));
+
+#define OVERLAPS overlaps(i->sector, i->size, sector, size)
+		slot = tl_hash_slot(mdev, sector);
+		first = 1;
+		for (;;) {
+			int have_unacked = 0;
+			int have_conflict = 0;
+			prepare_to_wait(&mdev->misc_wait, &wait,
+				TASK_INTERRUPTIBLE);
+			hlist_for_each_entry(i, n, slot, colision) {
+				if (OVERLAPS) {
+					/* only ALERT on first iteration,
+					 * we may be woken up early... */
+					if (first)
+						dev_alert(DEV, "%s[%u] Concurrent local write detected!"
+						      "	new: %llus +%u; pending: %llus +%u\n",
+						      current->comm, current->pid,
+						      (unsigned long long)sector, size,
+						      (unsigned long long)i->sector, i->size);
+					if (i->rq_state & RQ_NET_PENDING)
+						++have_unacked;
+					++have_conflict;
+				}
+			}
+#undef OVERLAPS
+			if (!have_conflict)
+				break;
+
+			/* Discard Ack only for the _first_ iteration */
+			if (first && discard && have_unacked) {
+				dev_alert(DEV, "Concurrent write! [DISCARD BY FLAG] sec=%llus\n",
+				     (unsigned long long)sector);
+				inc_unacked(mdev);
+				e->w.cb = e_send_discard_ack;
+				list_add_tail(&e->w.list, &mdev->done_ee);
+
+				spin_unlock_irq(&mdev->req_lock);
+
+				/* we could probably send that P_DISCARD_ACK ourselves,
+				 * but I don't like the receiver using the msock */
+
+				dec_local(mdev);
+				wake_asender(mdev);
+				finish_wait(&mdev->misc_wait, &wait);
+				return TRUE;
+			}
+
+			if (signal_pending(current)) {
+				hlist_del_init(&e->colision);
+
+				spin_unlock_irq(&mdev->req_lock);
+
+				finish_wait(&mdev->misc_wait, &wait);
+				goto out_interrupted;
+			}
+
+			spin_unlock_irq(&mdev->req_lock);
+			if (first) {
+				first = 0;
+				dev_alert(DEV, "Concurrent write! [W AFTERWARDS] "
+				     "sec=%llus\n", (unsigned long long)sector);
+			} else if (discard) {
+				/* we had none on the first iteration.
+				 * there must be none now. */
+				D_ASSERT(have_unacked == 0);
+			}
+			schedule();
+			spin_lock_irq(&mdev->req_lock);
+		}
+		finish_wait(&mdev->misc_wait, &wait);
+	}
+
+	list_add(&e->w.list, &mdev->active_ee);
+	spin_unlock_irq(&mdev->req_lock);
+
+	switch (mdev->net_conf->wire_protocol) {
+	case DRBD_PROT_C:
+		inc_unacked(mdev);
+		/* corresponding dec_unacked() in e_end_block()
+		 * respective _drbd_clear_done_ee */
+		break;
+	case DRBD_PROT_B:
+		/* I really don't like it that the receiver thread
+		 * sends on the msock, but anyways */
+		drbd_send_ack(mdev, P_RECV_ACK, e);
+		break;
+	case DRBD_PROT_A:
+		/* nothing to do */
+		break;
+	}
+
+	if (mdev->state.pdsk == D_DISKLESS) {
+		/* In case we have the only disk of the cluster, */
+		drbd_set_out_of_sync(mdev, e->sector, e->size);
+		e->flags |= EE_CALL_AL_COMPLETE_IO;
+		drbd_al_begin_io(mdev, e->sector);
+	}
+
+	e->private_bio->bi_rw = rw;
+	trace_drbd_ee(mdev, e, "submitting for (data)write");
+	trace_drbd_bio(mdev, "Sec", e->private_bio, 0, NULL);
+	drbd_generic_make_request(mdev, DRBD_FAULT_DT_WR, e->private_bio);
+	/* accounting done in endio */
+
+	maybe_kick_lo(mdev);
+	return TRUE;
+
+out_interrupted:
+	/* yes, the epoch_size now is imbalanced.
+	 * but we drop the connection anyways, so we don't have a chance to
+	 * receive a barrier... atomic_inc(&mdev->epoch_size); */
+	dec_local(mdev);
+	drbd_free_ee(mdev, e);
+	return FALSE;
+}
+
+STATIC int receive_DataRequest(struct drbd_conf *mdev, struct p_header *h)
+{
+	sector_t sector;
+	const sector_t capacity = drbd_get_capacity(mdev->this_bdev);
+	struct drbd_epoch_entry *e;
+	struct digest_info *di;
+	int size, digest_size;
+	unsigned int fault_type;
+	struct p_block_req *p =
+		(struct p_block_req *)h;
+	const int brps = sizeof(*p)-sizeof(*h);
+
+	if (drbd_recv(mdev, h->payload, brps) != brps)
+		return FALSE;
+
+	sector = be64_to_cpu(p->sector);
+	size   = be32_to_cpu(p->blksize);
+
+	if (size <= 0 || (size & 0x1ff) != 0 || size > DRBD_MAX_SEGMENT_SIZE) {
+		dev_err(DEV, "%s:%d: sector: %llus, size: %u\n", __FILE__, __LINE__,
+				(unsigned long long)sector, size);
+		return FALSE;
+	}
+	if (sector + (size>>9) > capacity) {
+		dev_err(DEV, "%s:%d: sector: %llus, size: %u\n", __FILE__, __LINE__,
+				(unsigned long long)sector, size);
+		return FALSE;
+	}
+
+	if (!inc_local_if_state(mdev, D_UP_TO_DATE)) {
+		if (__ratelimit(&drbd_ratelimit_state))
+			dev_err(DEV, "Can not satisfy peer's read request, "
+			    "no local data.\n");
+		drbd_send_ack_rp(mdev, h->command == P_DATA_REQUEST ? P_NEG_DREPLY :
+				 P_NEG_RS_DREPLY , p);
+		return TRUE;
+	}
+
+	e = drbd_alloc_ee(mdev, p->block_id, sector, size, GFP_KERNEL);
+	if (!e) {
+		dec_local(mdev);
+		return FALSE;
+	}
+
+	e->private_bio->bi_rw = READ;
+	e->private_bio->bi_end_io = drbd_endio_read_sec;
+
+	switch (h->command) {
+	case P_DATA_REQUEST:
+		e->w.cb = w_e_end_data_req;
+		fault_type = DRBD_FAULT_DT_RD;
+		break;
+	case P_RS_DATA_REQUEST:
+		e->w.cb = w_e_end_rsdata_req;
+		fault_type = DRBD_FAULT_RS_RD;
+		/* Eventually this should become asynchrously. Currently it
+		 * blocks the whole receiver just to delay the reading of a
+		 * resync data block.
+		 * the drbd_work_queue mechanism is made for this...
+		 */
+		if (!drbd_rs_begin_io(mdev, sector)) {
+			/* we have been interrupted,
+			 * probably connection lost! */
+			D_ASSERT(signal_pending(current));
+			dec_local(mdev);
+			drbd_free_ee(mdev, e);
+			return 0;
+		}
+		break;
+
+	case P_OV_REPLY:
+	case P_CSUM_RS_REQUEST:
+		fault_type = DRBD_FAULT_RS_RD;
+		digest_size = h->length - brps ;
+		di = kmalloc(sizeof(*di) + digest_size, GFP_KERNEL);
+		if (!di) {
+			dec_local(mdev);
+			drbd_free_ee(mdev, e);
+			return 0;
+		}
+
+		di->digest_size = digest_size;
+		di->digest = (((char *)di)+sizeof(struct digest_info));
+
+		if (drbd_recv(mdev, di->digest, digest_size) != digest_size) {
+			dec_local(mdev);
+			drbd_free_ee(mdev, e);
+			kfree(di);
+			return FALSE;
+		}
+
+		e->block_id = (u64)(unsigned long)di;
+		if (h->command == P_CSUM_RS_REQUEST) {
+			D_ASSERT(mdev->agreed_pro_version >= 89);
+			e->w.cb = w_e_end_csum_rs_req;
+		} else if (h->command == P_OV_REPLY) {
+			e->w.cb = w_e_end_ov_reply;
+			dec_rs_pending(mdev);
+			break;
+		}
+
+		if (!drbd_rs_begin_io(mdev, sector)) {
+			/* we have been interrupted, probably connection lost! */
+			D_ASSERT(signal_pending(current));
+			drbd_free_ee(mdev, e);
+			kfree(di);
+			dec_local(mdev);
+			return FALSE;
+		}
+		break;
+
+	case P_OV_REQUEST:
+		e->w.cb = w_e_end_ov_req;
+		fault_type = DRBD_FAULT_RS_RD;
+		/* Eventually this should become asynchrously. Currently it
+		 * blocks the whole receiver just to delay the reading of a
+		 * resync data block.
+		 * the drbd_work_queue mechanism is made for this...
+		 */
+		if (!drbd_rs_begin_io(mdev, sector)) {
+			/* we have been interrupted,
+			 * probably connection lost! */
+			D_ASSERT(signal_pending(current));
+			dec_local(mdev);
+			drbd_free_ee(mdev, e);
+			return 0;
+		}
+		break;
+
+
+	default:
+		dev_err(DEV, "unexpected command (%s) in receive_DataRequest\n",
+		    cmdname(h->command));
+		fault_type = DRBD_FAULT_MAX;
+	}
+
+	spin_lock_irq(&mdev->req_lock);
+	list_add(&e->w.list, &mdev->read_ee);
+	spin_unlock_irq(&mdev->req_lock);
+
+	inc_unacked(mdev);
+
+	trace_drbd_ee(mdev, e, "submitting for read");
+	trace_drbd_bio(mdev, "Sec", e->private_bio, 0, NULL);
+	drbd_generic_make_request(mdev, fault_type, e->private_bio);
+	maybe_kick_lo(mdev);
+
+	return TRUE;
+}
+
+STATIC int drbd_asb_recover_0p(struct drbd_conf *mdev) __must_hold(local)
+{
+	int self, peer, rv = -100;
+	unsigned long ch_self, ch_peer;
+
+	self = mdev->bc->md.uuid[UI_BITMAP] & 1;
+	peer = mdev->p_uuid[UI_BITMAP] & 1;
+
+	ch_peer = mdev->p_uuid[UI_SIZE];
+	ch_self = mdev->comm_bm_set;
+
+	switch (mdev->net_conf->after_sb_0p) {
+	case ASB_CONSENSUS:
+	case ASB_DISCARD_SECONDARY:
+	case ASB_CALL_HELPER:
+		dev_err(DEV, "Configuration error.\n");
+		break;
+	case ASB_DISCONNECT:
+		break;
+	case ASB_DISCARD_YOUNGER_PRI:
+		if (self == 0 && peer == 1) {
+			rv = -1;
+			break;
+		}
+		if (self == 1 && peer == 0) {
+			rv =  1;
+			break;
+		}
+		/* Else fall through to one of the other strategies... */
+	case ASB_DISCARD_OLDER_PRI:
+		if (self == 0 && peer == 1) {
+			rv = 1;
+			break;
+		}
+		if (self == 1 && peer == 0) {
+			rv = -1;
+			break;
+		}
+		/* Else fall through to one of the other strategies... */
+		dev_warn(DEV, "Discard younger/older primary did not found a decision\n"
+		     "Using discard-least-changes instead\n");
+	case ASB_DISCARD_ZERO_CHG:
+		if (ch_peer == 0 && ch_self == 0) {
+			rv = test_bit(DISCARD_CONCURRENT, &mdev->flags)
+				? -1 : 1;
+			break;
+		} else {
+			if (ch_peer == 0) { rv =  1; break; }
+			if (ch_self == 0) { rv = -1; break; }
+		}
+		if (mdev->net_conf->after_sb_0p == ASB_DISCARD_ZERO_CHG)
+			break;
+	case ASB_DISCARD_LEAST_CHG:
+		if	(ch_self < ch_peer)
+			rv = -1;
+		else if (ch_self > ch_peer)
+			rv =  1;
+		else /* ( ch_self == ch_peer ) */
+		     /* Well, then use something else. */
+			rv = test_bit(DISCARD_CONCURRENT, &mdev->flags)
+				? -1 : 1;
+		break;
+	case ASB_DISCARD_LOCAL:
+		rv = -1;
+		break;
+	case ASB_DISCARD_REMOTE:
+		rv =  1;
+	}
+
+	return rv;
+}
+
+STATIC int drbd_asb_recover_1p(struct drbd_conf *mdev) __must_hold(local)
+{
+	int self, peer, hg, rv = -100;
+
+	self = mdev->bc->md.uuid[UI_BITMAP] & 1;
+	peer = mdev->p_uuid[UI_BITMAP] & 1;
+
+	switch (mdev->net_conf->after_sb_1p) {
+	case ASB_DISCARD_YOUNGER_PRI:
+	case ASB_DISCARD_OLDER_PRI:
+	case ASB_DISCARD_LEAST_CHG:
+	case ASB_DISCARD_LOCAL:
+	case ASB_DISCARD_REMOTE:
+		dev_err(DEV, "Configuration error.\n");
+		break;
+	case ASB_DISCONNECT:
+		break;
+	case ASB_CONSENSUS:
+		hg = drbd_asb_recover_0p(mdev);
+		if (hg == -1 && mdev->state.role == R_SECONDARY)
+			rv = hg;
+		if (hg == 1  && mdev->state.role == R_PRIMARY)
+			rv = hg;
+		break;
+	case ASB_VIOLENTLY:
+		rv = drbd_asb_recover_0p(mdev);
+		break;
+	case ASB_DISCARD_SECONDARY:
+		return mdev->state.role == R_PRIMARY ? 1 : -1;
+	case ASB_CALL_HELPER:
+		hg = drbd_asb_recover_0p(mdev);
+		if (hg == -1 && mdev->state.role == R_PRIMARY) {
+			self = drbd_set_role(mdev, R_SECONDARY, 0);
+			if (self != SS_SUCCESS) {
+				drbd_khelper(mdev, "pri-lost-after-sb");
+			} else {
+				dev_warn(DEV, "Sucessfully gave up primary role.\n");
+				rv = hg;
+			}
+		} else
+			rv = hg;
+	}
+
+	return rv;
+}
+
+STATIC int drbd_asb_recover_2p(struct drbd_conf *mdev) __must_hold(local)
+{
+	int self, peer, hg, rv = -100;
+
+	self = mdev->bc->md.uuid[UI_BITMAP] & 1;
+	peer = mdev->p_uuid[UI_BITMAP] & 1;
+
+	switch (mdev->net_conf->after_sb_2p) {
+	case ASB_DISCARD_YOUNGER_PRI:
+	case ASB_DISCARD_OLDER_PRI:
+	case ASB_DISCARD_LEAST_CHG:
+	case ASB_DISCARD_LOCAL:
+	case ASB_DISCARD_REMOTE:
+	case ASB_CONSENSUS:
+	case ASB_DISCARD_SECONDARY:
+		dev_err(DEV, "Configuration error.\n");
+		break;
+	case ASB_VIOLENTLY:
+		rv = drbd_asb_recover_0p(mdev);
+		break;
+	case ASB_DISCONNECT:
+		break;
+	case ASB_CALL_HELPER:
+		hg = drbd_asb_recover_0p(mdev);
+		if (hg == -1) {
+			self = drbd_set_role(mdev, R_SECONDARY, 0);
+			if (self != SS_SUCCESS) {
+				drbd_khelper(mdev, "pri-lost-after-sb");
+			} else {
+				dev_warn(DEV, "Sucessfully gave up primary role.\n");
+				rv = hg;
+			}
+		} else
+			rv = hg;
+	}
+
+	return rv;
+}
+
+STATIC void drbd_uuid_dump(struct drbd_conf *mdev, char *text, u64 *uuid,
+			   u64 bits, u64 flags)
+{
+	if (!uuid) {
+		dev_info(DEV, "%s uuid info vanished while I was looking!\n", text);
+		return;
+	}
+	dev_info(DEV, "%s %016llX:%016llX:%016llX:%016llX bits:%llu flags:%llX\n",
+	     text,
+	     (unsigned long long)uuid[UI_CURRENT],
+	     (unsigned long long)uuid[UI_BITMAP],
+	     (unsigned long long)uuid[UI_HISTORY_START],
+	     (unsigned long long)uuid[UI_HISTORY_END],
+	     (unsigned long long)bits,
+	     (unsigned long long)flags);
+}
+
+/*
+  100	after split brain try auto recover
+    2	C_SYNC_SOURCE set BitMap
+    1	C_SYNC_SOURCE use BitMap
+    0	no Sync
+   -1	C_SYNC_TARGET use BitMap
+   -2	C_SYNC_TARGET set BitMap
+ -100	after split brain, disconnect
+-1000	unrelated data
+ */
+STATIC int drbd_uuid_compare(struct drbd_conf *mdev, int *rule_nr) __must_hold(local)
+{
+	u64 self, peer;
+	int i, j;
+
+	self = mdev->bc->md.uuid[UI_CURRENT] & ~((u64)1);
+	peer = mdev->p_uuid[UI_CURRENT] & ~((u64)1);
+
+	*rule_nr = 1;
+	if (self == UUID_JUST_CREATED && peer == UUID_JUST_CREATED)
+		return 0;
+
+	*rule_nr = 2;
+	if ((self == UUID_JUST_CREATED || self == (u64)0) &&
+	     peer != UUID_JUST_CREATED)
+		return -2;
+
+	*rule_nr = 3;
+	if (self != UUID_JUST_CREATED &&
+	    (peer == UUID_JUST_CREATED || peer == (u64)0))
+		return 2;
+
+	*rule_nr = 4;
+	if (self == peer) { /* Common power [off|failure] */
+		int rct, dc; /* roles at crash time */
+
+		rct = (test_bit(CRASHED_PRIMARY, &mdev->flags) ? 1 : 0) +
+			(mdev->p_uuid[UI_FLAGS] & 2);
+		/* lowest bit is set when we were primary,
+		 * next bit (weight 2) is set when peer was primary */
+
+		switch (rct) {
+		case 0: /* !self_pri && !peer_pri */ return 0;
+		case 1: /*  self_pri && !peer_pri */ return 1;
+		case 2: /* !self_pri &&  peer_pri */ return -1;
+		case 3: /*  self_pri &&  peer_pri */
+			dc = test_bit(DISCARD_CONCURRENT, &mdev->flags);
+			return dc ? -1 : 1;
+		}
+	}
+
+	*rule_nr = 5;
+	peer = mdev->p_uuid[UI_BITMAP] & ~((u64)1);
+	if (self == peer)
+		return -1;
+
+	*rule_nr = 6;
+	for (i = UI_HISTORY_START; i <= UI_HISTORY_END; i++) {
+		peer = mdev->p_uuid[i] & ~((u64)1);
+		if (self == peer)
+			return -2;
+	}
+
+	*rule_nr = 7;
+	self = mdev->bc->md.uuid[UI_BITMAP] & ~((u64)1);
+	peer = mdev->p_uuid[UI_CURRENT] & ~((u64)1);
+	if (self == peer)
+		return 1;
+
+	*rule_nr = 8;
+	for (i = UI_HISTORY_START; i <= UI_HISTORY_END; i++) {
+		self = mdev->bc->md.uuid[i] & ~((u64)1);
+		if (self == peer)
+			return 2;
+	}
+
+	*rule_nr = 9;
+	self = mdev->bc->md.uuid[UI_BITMAP] & ~((u64)1);
+	peer = mdev->p_uuid[UI_BITMAP] & ~((u64)1);
+	if (self == peer && self != ((u64)0))
+		return 100;
+
+	*rule_nr = 10;
+	for (i = UI_HISTORY_START; i <= UI_HISTORY_END; i++) {
+		self = mdev->p_uuid[i] & ~((u64)1);
+		for (j = UI_HISTORY_START; j <= UI_HISTORY_END; j++) {
+			peer = mdev->p_uuid[j] & ~((u64)1);
+			if (self == peer)
+				return -100;
+		}
+	}
+
+	return -1000;
+}
+
+/* drbd_sync_handshake() returns the new conn state on success, or
+   CONN_MASK (-1) on failure.
+ */
+STATIC enum drbd_conns drbd_sync_handshake(struct drbd_conf *mdev, enum drbd_role peer_role,
+					   enum drbd_disk_state peer_disk) __must_hold(local)
+{
+	int hg, rule_nr;
+	enum drbd_conns rv = C_MASK;
+	enum drbd_disk_state mydisk;
+
+	mydisk = mdev->state.disk;
+	if (mydisk == D_NEGOTIATING)
+		mydisk = mdev->new_state_tmp.disk;
+
+	hg = drbd_uuid_compare(mdev, &rule_nr);
+
+	dev_info(DEV, "drbd_sync_handshake:\n");
+	drbd_uuid_dump(mdev, "self", mdev->bc->md.uuid,
+		       mdev->state.disk >= D_NEGOTIATING ? drbd_bm_total_weight(mdev) : 0, 0);
+	drbd_uuid_dump(mdev, "peer", mdev->p_uuid,
+		       mdev->p_uuid[UI_SIZE], mdev->p_uuid[UI_FLAGS]);
+	dev_info(DEV, "uuid_compare()=%d by rule %d\n", hg, rule_nr);
+
+	if (hg == -1000) {
+		dev_alert(DEV, "Unrelated data, aborting!\n");
+		return C_MASK;
+	}
+
+	if    ((mydisk == D_INCONSISTENT && peer_disk > D_INCONSISTENT) ||
+	    (peer_disk == D_INCONSISTENT && mydisk    > D_INCONSISTENT)) {
+		int f = (hg == -100) || abs(hg) == 2;
+		hg = mydisk > D_INCONSISTENT ? 1 : -1;
+		if (f)
+			hg = hg*2;
+		dev_info(DEV, "Becoming sync %s due to disk states.\n",
+		     hg > 0 ? "source" : "target");
+	}
+
+	if (hg == 100 || (hg == -100 && mdev->net_conf->always_asbp)) {
+		int pcount = (mdev->state.role == R_PRIMARY)
+			   + (peer_role == R_PRIMARY);
+		int forced = (hg == -100);
+
+		switch (pcount) {
+		case 0:
+			hg = drbd_asb_recover_0p(mdev);
+			break;
+		case 1:
+			hg = drbd_asb_recover_1p(mdev);
+			break;
+		case 2:
+			hg = drbd_asb_recover_2p(mdev);
+			break;
+		}
+		if (abs(hg) < 100) {
+			dev_warn(DEV, "Split-Brain detected, %d primaries, "
+			     "automatically solved. Sync from %s node\n",
+			     pcount, (hg < 0) ? "peer" : "this");
+			if (forced) {
+				dev_warn(DEV, "Doing a full sync, since"
+				     " UUIDs where ambiguous.\n");
+				hg = hg*2;
+			}
+		}
+	}
+
+	if (hg == -100) {
+		if (mdev->net_conf->want_lose && !(mdev->p_uuid[UI_FLAGS]&1))
+			hg = -1;
+		if (!mdev->net_conf->want_lose && (mdev->p_uuid[UI_FLAGS]&1))
+			hg = 1;
+
+		if (abs(hg) < 100)
+			dev_warn(DEV, "Split-Brain detected, manually solved. "
+			     "Sync from %s node\n",
+			     (hg < 0) ? "peer" : "this");
+	}
+
+	if (hg == -100) {
+		dev_alert(DEV, "Split-Brain detected, dropping connection!\n");
+		drbd_khelper(mdev, "split-brain");
+		return C_MASK;
+	}
+
+	if (hg > 0 && mydisk <= D_INCONSISTENT) {
+		dev_err(DEV, "I shall become SyncSource, but I am inconsistent!\n");
+		return C_MASK;
+	}
+
+	if (hg < 0 && /* by intention we do not use mydisk here. */
+	    mdev->state.role == R_PRIMARY && mdev->state.disk >= D_CONSISTENT) {
+		switch (mdev->net_conf->rr_conflict) {
+		case ASB_CALL_HELPER:
+			drbd_khelper(mdev, "pri-lost");
+			/* fall through */
+		case ASB_DISCONNECT:
+			dev_err(DEV, "I shall become SyncTarget, but I am primary!\n");
+			return C_MASK;
+		case ASB_VIOLENTLY:
+			dev_warn(DEV, "Becoming SyncTarget, violating the stable-data"
+			     "assumption\n");
+		}
+	}
+
+	if (abs(hg) >= 2) {
+		dev_info(DEV, "Writing the whole bitmap, full sync required after drbd_sync_handshake.\n");
+		if (drbd_bitmap_io(mdev, &drbd_bmio_set_n_write, "set_n_write from sync_handshake"))
+			return C_MASK;
+	}
+
+	if (hg > 0) { /* become sync source. */
+		rv = C_WF_BITMAP_S;
+	} else if (hg < 0) { /* become sync target */
+		rv = C_WF_BITMAP_T;
+	} else {
+		rv = C_CONNECTED;
+		if (drbd_bm_total_weight(mdev)) {
+			dev_info(DEV, "No resync, but %lu bits in bitmap!\n",
+			     drbd_bm_total_weight(mdev));
+		}
+	}
+
+	drbd_bm_recount_bits(mdev);
+
+	return rv;
+}
+
+/* returns 1 if invalid */
+STATIC int cmp_after_sb(enum drbd_after_sb_p peer, enum drbd_after_sb_p self)
+{
+	/* ASB_DISCARD_REMOTE - ASB_DISCARD_LOCAL is valid */
+	if ((peer == ASB_DISCARD_REMOTE && self == ASB_DISCARD_LOCAL) ||
+	    (self == ASB_DISCARD_REMOTE && peer == ASB_DISCARD_LOCAL))
+		return 0;
+
+	/* any other things with ASB_DISCARD_REMOTE or ASB_DISCARD_LOCAL are invalid */
+	if (peer == ASB_DISCARD_REMOTE || peer == ASB_DISCARD_LOCAL ||
+	    self == ASB_DISCARD_REMOTE || self == ASB_DISCARD_LOCAL)
+		return 1;
+
+	/* everything else is valid if they are equal on both sides. */
+	if (peer == self)
+		return 0;
+
+	/* everything es is invalid. */
+	return 1;
+}
+
+STATIC int receive_protocol(struct drbd_conf *mdev, struct p_header *h)
+{
+	struct p_protocol *p = (struct p_protocol *)h;
+	int header_size, data_size;
+	int p_proto, p_after_sb_0p, p_after_sb_1p, p_after_sb_2p;
+	int p_want_lose, p_two_primaries;
+	char p_integrity_alg[SHARED_SECRET_MAX] = "";
+
+	header_size = sizeof(*p) - sizeof(*h);
+	data_size   = h->length  - header_size;
+
+	if (drbd_recv(mdev, h->payload, header_size) != header_size)
+		return FALSE;
+
+	p_proto		= be32_to_cpu(p->protocol);
+	p_after_sb_0p	= be32_to_cpu(p->after_sb_0p);
+	p_after_sb_1p	= be32_to_cpu(p->after_sb_1p);
+	p_after_sb_2p	= be32_to_cpu(p->after_sb_2p);
+	p_want_lose	= be32_to_cpu(p->want_lose);
+	p_two_primaries = be32_to_cpu(p->two_primaries);
+
+	if (p_proto != mdev->net_conf->wire_protocol) {
+		dev_err(DEV, "incompatible communication protocols\n");
+		goto disconnect;
+	}
+
+	if (cmp_after_sb(p_after_sb_0p, mdev->net_conf->after_sb_0p)) {
+		dev_err(DEV, "incompatible after-sb-0pri settings\n");
+		goto disconnect;
+	}
+
+	if (cmp_after_sb(p_after_sb_1p, mdev->net_conf->after_sb_1p)) {
+		dev_err(DEV, "incompatible after-sb-1pri settings\n");
+		goto disconnect;
+	}
+
+	if (cmp_after_sb(p_after_sb_2p, mdev->net_conf->after_sb_2p)) {
+		dev_err(DEV, "incompatible after-sb-2pri settings\n");
+		goto disconnect;
+	}
+
+	if (p_want_lose && mdev->net_conf->want_lose) {
+		dev_err(DEV, "both sides have the 'want_lose' flag set\n");
+		goto disconnect;
+	}
+
+	if (p_two_primaries != mdev->net_conf->two_primaries) {
+		dev_err(DEV, "incompatible setting of the two-primaries options\n");
+		goto disconnect;
+	}
+
+	if (mdev->agreed_pro_version >= 87) {
+		unsigned char *my_alg = mdev->net_conf->integrity_alg;
+
+		if (drbd_recv(mdev, p_integrity_alg, data_size) != data_size)
+			return FALSE;
+
+		p_integrity_alg[SHARED_SECRET_MAX-1] = 0;
+		if (strcmp(p_integrity_alg, my_alg)) {
+			dev_err(DEV, "incompatible setting of the data-integrity-alg\n");
+			goto disconnect;
+		}
+		dev_info(DEV, "data-integrity-alg: %s\n",
+		     my_alg[0] ? my_alg : (unsigned char *)"<not-used>");
+	}
+
+	return TRUE;
+
+disconnect:
+	drbd_force_state(mdev, NS(conn, C_DISCONNECTING));
+	return FALSE;
+}
+
+/* helper function
+ * input: alg name, feature name
+ * return: NULL (alg name was "")
+ *         ERR_PTR(error) if something goes wrong
+ *         or the crypto hash ptr, if it worked out ok. */
+struct crypto_hash *drbd_crypto_alloc_digest_safe(const struct drbd_conf *mdev,
+		const char *alg, const char *name)
+{
+	struct crypto_hash *tfm;
+
+	if (!alg[0])
+		return NULL;
+
+	tfm = crypto_alloc_hash(alg, 0, CRYPTO_ALG_ASYNC);
+	if (IS_ERR(tfm)) {
+		dev_err(DEV, "Can not allocate \"%s\" as %s (reason: %ld)\n",
+			alg, name, PTR_ERR(tfm));
+		return tfm;
+	}
+	if (crypto_tfm_alg_type(crypto_hash_tfm(tfm)) != CRYPTO_ALG_TYPE_DIGEST) {
+		crypto_free_hash(tfm);
+		dev_err(DEV, "\"%s\" is not a digest (%s)\n", alg, name);
+		return ERR_PTR(-EINVAL);
+	}
+	return tfm;
+}
+
+STATIC int receive_SyncParam(struct drbd_conf *mdev, struct p_header *h)
+{
+	int ok = TRUE;
+	struct p_rs_param_89 *p = (struct p_rs_param_89 *)h;
+	unsigned int header_size, data_size, exp_max_sz;
+	struct crypto_hash *verify_tfm = NULL;
+	struct crypto_hash *csums_tfm = NULL;
+	const int apv = mdev->agreed_pro_version;
+
+	exp_max_sz  = apv <= 87 ? sizeof(struct p_rs_param)
+		    : apv == 88 ? sizeof(struct p_rs_param)
+					+ SHARED_SECRET_MAX
+		    : /* 89 */    sizeof(struct p_rs_param_89);
+
+	if (h->length > exp_max_sz) {
+		dev_err(DEV, "SyncParam packet too long: received %u, expected <= %u bytes\n",
+		    h->length, exp_max_sz);
+		return FALSE;
+	}
+
+	if (apv <= 88) {
+		header_size = sizeof(struct p_rs_param) - sizeof(*h);
+		data_size   = h->length  - header_size;
+	} else /* apv >= 89 */ {
+		header_size = sizeof(struct p_rs_param_89) - sizeof(*h);
+		data_size   = h->length  - header_size;
+		D_ASSERT(data_size == 0);
+	}
+
+	/* initialize verify_alg and csums_alg */
+	memset(p->verify_alg, 0, 2 * SHARED_SECRET_MAX);
+
+	if (drbd_recv(mdev, h->payload, header_size) != header_size)
+		return FALSE;
+
+	mdev->sync_conf.rate	  = be32_to_cpu(p->rate);
+
+	if (apv >= 88) {
+		if (apv == 88) {
+			if (data_size > SHARED_SECRET_MAX) {
+				dev_err(DEV, "verify-alg too long, "
+				    "peer wants %u, accepting only %u byte\n",
+						data_size, SHARED_SECRET_MAX);
+				return FALSE;
+			}
+
+			if (drbd_recv(mdev, p->verify_alg, data_size) != data_size)
+				return FALSE;
+
+			/* we expect NUL terminated string */
+			/* but just in case someone tries to be evil */
+			D_ASSERT(p->verify_alg[data_size-1] == 0);
+			p->verify_alg[data_size-1] = 0;
+
+		} else /* apv >= 89 */ {
+			/* we still expect NUL terminated strings */
+			/* but just in case someone tries to be evil */
+			D_ASSERT(p->verify_alg[SHARED_SECRET_MAX-1] == 0);
+			D_ASSERT(p->csums_alg[SHARED_SECRET_MAX-1] == 0);
+			p->verify_alg[SHARED_SECRET_MAX-1] = 0;
+			p->csums_alg[SHARED_SECRET_MAX-1] = 0;
+		}
+
+		if (strcmp(mdev->sync_conf.verify_alg, p->verify_alg)) {
+			if (mdev->state.conn == C_WF_REPORT_PARAMS) {
+				dev_err(DEV, "Different verify-alg settings. me=\"%s\" peer=\"%s\"\n",
+				    mdev->sync_conf.verify_alg, p->verify_alg);
+				goto disconnect;
+			}
+			verify_tfm = drbd_crypto_alloc_digest_safe(mdev,
+					p->verify_alg, "verify-alg");
+			if (IS_ERR(verify_tfm))
+				goto disconnect;
+		}
+
+		if (apv >= 89 && strcmp(mdev->sync_conf.csums_alg, p->csums_alg)) {
+			if (mdev->state.conn == C_WF_REPORT_PARAMS) {
+				dev_err(DEV, "Different csums-alg settings. me=\"%s\" peer=\"%s\"\n",
+				    mdev->sync_conf.csums_alg, p->csums_alg);
+				goto disconnect;
+			}
+			csums_tfm = drbd_crypto_alloc_digest_safe(mdev,
+					p->csums_alg, "csums-alg");
+			if (IS_ERR(csums_tfm))
+				goto disconnect;
+		}
+
+
+		spin_lock(&mdev->peer_seq_lock);
+		/* lock against drbd_nl_syncer_conf() */
+		if (verify_tfm) {
+			strcpy(mdev->sync_conf.verify_alg, p->verify_alg);
+			mdev->sync_conf.verify_alg_len = strlen(p->verify_alg) + 1;
+			crypto_free_hash(mdev->verify_tfm);
+			mdev->verify_tfm = verify_tfm;
+			dev_info(DEV, "using verify-alg: \"%s\"\n", p->verify_alg);
+		}
+		if (csums_tfm) {
+			strcpy(mdev->sync_conf.csums_alg, p->csums_alg);
+			mdev->sync_conf.csums_alg_len = strlen(p->csums_alg) + 1;
+			crypto_free_hash(mdev->csums_tfm);
+			mdev->csums_tfm = csums_tfm;
+			dev_info(DEV, "using csums-alg: \"%s\"\n", p->csums_alg);
+		}
+		spin_unlock(&mdev->peer_seq_lock);
+	}
+
+	return ok;
+disconnect:
+	crypto_free_hash(verify_tfm);
+	drbd_force_state(mdev, NS(conn, C_DISCONNECTING));
+	return FALSE;
+}
+
+STATIC void drbd_setup_order_type(struct drbd_conf *mdev, int peer)
+{
+	/* sorry, we currently have no working implementation
+	 * of distributed TCQ */
+}
+
+/* warn if the arguments differ by more than 12.5% */
+static void warn_if_differ_considerably(struct drbd_conf *mdev,
+	const char *s, sector_t a, sector_t b)
+{
+	sector_t d;
+	if (a == 0 || b == 0)
+		return;
+	d = (a > b) ? (a - b) : (b - a);
+	if (d > (a>>3) || d > (b>>3))
+		dev_warn(DEV, "Considerable difference in %s: %llus vs. %llus\n", s,
+		     (unsigned long long)a, (unsigned long long)b);
+}
+
+STATIC int receive_sizes(struct drbd_conf *mdev, struct p_header *h)
+{
+	struct p_sizes *p = (struct p_sizes *)h;
+	enum determine_dev_size dd = unchanged;
+	unsigned int max_seg_s;
+	sector_t p_size, p_usize, my_usize;
+	int ldsc = 0; /* local disk size changed */
+	enum drbd_conns nconn;
+
+	ERR_IF(h->length != (sizeof(*p)-sizeof(*h))) return FALSE;
+	if (drbd_recv(mdev, h->payload, h->length) != h->length)
+		return FALSE;
+
+	p_size = be64_to_cpu(p->d_size);
+	p_usize = be64_to_cpu(p->u_size);
+
+	if (p_size == 0 && mdev->state.disk == D_DISKLESS) {
+		dev_err(DEV, "some backing storage is needed\n");
+		drbd_force_state(mdev, NS(conn, C_DISCONNECTING));
+		return FALSE;
+	}
+
+	/* just store the peer's disk size for now.
+	 * we still need to figure out wether we accept that. */
+	mdev->p_size = p_size;
+
+#define min_not_zero(l, r) (l == 0) ? r : ((r == 0) ? l : min(l, r))
+	if (inc_local(mdev)) {
+		warn_if_differ_considerably(mdev, "lower level device sizes",
+			   p_size, drbd_get_max_capacity(mdev->bc));
+		warn_if_differ_considerably(mdev, "user requested size",
+					    p_usize, mdev->bc->dc.disk_size);
+
+		/* if this is the first connect, or an otherwise expected
+		 * param exchange, choose the minimum */
+		if (mdev->state.conn == C_WF_REPORT_PARAMS)
+			p_usize = min_not_zero((sector_t)mdev->bc->dc.disk_size,
+					     p_usize);
+
+		my_usize = mdev->bc->dc.disk_size;
+
+		if (mdev->bc->dc.disk_size != p_usize) {
+			mdev->bc->dc.disk_size = p_usize;
+			dev_info(DEV, "Peer sets u_size to %lu sectors\n",
+			     (unsigned long)mdev->bc->dc.disk_size);
+		}
+
+		/* Never shrink a device with usable data during connect.
+		   But allow online shrinking if we are connected. */
+		if (drbd_new_dev_size(mdev, mdev->bc) <
+		   drbd_get_capacity(mdev->this_bdev) &&
+		   mdev->state.disk >= D_OUTDATED &&
+		   mdev->state.conn < C_CONNECTED) {
+			dev_err(DEV, "The peer's disk size is too small!\n");
+			drbd_force_state(mdev, NS(conn, C_DISCONNECTING));
+			mdev->bc->dc.disk_size = my_usize;
+			dec_local(mdev);
+			return FALSE;
+		}
+		dec_local(mdev);
+	}
+#undef min_not_zero
+
+	if (inc_local(mdev)) {
+		dd = drbd_determin_dev_size(mdev);
+		dec_local(mdev);
+		if (dd == dev_size_error)
+			return FALSE;
+		drbd_md_sync(mdev);
+	} else {
+		/* I am diskless, need to accept the peer's size. */
+		drbd_set_my_capacity(mdev, p_size);
+	}
+
+	if (mdev->p_uuid && mdev->state.conn <= C_CONNECTED && inc_local(mdev)) {
+		nconn = drbd_sync_handshake(mdev,
+				mdev->state.peer, mdev->state.pdsk);
+		dec_local(mdev);
+
+		if (nconn == C_MASK) {
+			drbd_force_state(mdev, NS(conn, C_DISCONNECTING));
+			return FALSE;
+		}
+
+		if (drbd_request_state(mdev, NS(conn, nconn)) < SS_SUCCESS) {
+			drbd_force_state(mdev, NS(conn, C_DISCONNECTING));
+			return FALSE;
+		}
+	}
+
+	if (inc_local(mdev)) {
+		if (mdev->bc->known_size != drbd_get_capacity(mdev->bc->backing_bdev)) {
+			mdev->bc->known_size = drbd_get_capacity(mdev->bc->backing_bdev);
+			ldsc = 1;
+		}
+
+		max_seg_s = be32_to_cpu(p->max_segment_size);
+		if (max_seg_s != mdev->rq_queue->max_segment_size)
+			drbd_setup_queue_param(mdev, max_seg_s);
+
+		drbd_setup_order_type(mdev, be32_to_cpu(p->queue_order_type));
+		dec_local(mdev);
+	}
+
+	if (mdev->state.conn > C_WF_REPORT_PARAMS) {
+		if (be64_to_cpu(p->c_size) !=
+		    drbd_get_capacity(mdev->this_bdev) || ldsc) {
+			/* we have different sizes, probabely peer
+			 * needs to know my new size... */
+			drbd_send_sizes(mdev);
+		}
+		if (dd == grew && mdev->state.conn == C_CONNECTED) {
+			if (mdev->state.pdsk >= D_INCONSISTENT &&
+			    mdev->state.disk >= D_INCONSISTENT)
+				resync_after_online_grow(mdev);
+			else
+				set_bit(RESYNC_AFTER_NEG, &mdev->flags);
+		}
+	}
+
+	return TRUE;
+}
+
+STATIC int receive_uuids(struct drbd_conf *mdev, struct p_header *h)
+{
+	struct p_uuids *p = (struct p_uuids *)h;
+	u64 *p_uuid;
+	int i;
+
+	ERR_IF(h->length != (sizeof(*p)-sizeof(*h))) return FALSE;
+	if (drbd_recv(mdev, h->payload, h->length) != h->length)
+		return FALSE;
+
+	p_uuid = kmalloc(sizeof(u64)*UI_EXTENDED_SIZE, GFP_KERNEL);
+
+	for (i = UI_CURRENT; i < UI_EXTENDED_SIZE; i++)
+		p_uuid[i] = be64_to_cpu(p->uuid[i]);
+
+	kfree(mdev->p_uuid);
+	mdev->p_uuid = p_uuid;
+
+	if (mdev->state.conn < C_CONNECTED &&
+	    mdev->state.disk < D_INCONSISTENT &&
+	    mdev->state.role == R_PRIMARY &&
+	    (mdev->ed_uuid & ~((u64)1)) != (p_uuid[UI_CURRENT] & ~((u64)1))) {
+		dev_err(DEV, "Can only connect to data with current UUID=%016llX\n",
+		    (unsigned long long)mdev->ed_uuid);
+		drbd_force_state(mdev, NS(conn, C_DISCONNECTING));
+		return FALSE;
+	}
+
+	if (inc_local(mdev)) {
+		int skip_initial_sync =
+			mdev->state.conn == C_CONNECTED &&
+			mdev->agreed_pro_version >= 90 &&
+			mdev->bc->md.uuid[UI_CURRENT] == UUID_JUST_CREATED &&
+			(p_uuid[UI_FLAGS] & 8);
+		if (skip_initial_sync) {
+			dev_info(DEV, "Accepted new current UUID, preparing to skip initial sync\n");
+			drbd_bitmap_io(mdev, &drbd_bmio_clear_n_write,
+					"clear_n_write from receive_uuids");
+			_drbd_uuid_set(mdev, UI_CURRENT, p_uuid[UI_CURRENT]);
+			_drbd_uuid_set(mdev, UI_BITMAP, 0);
+			_drbd_set_state(_NS2(mdev, disk, D_UP_TO_DATE, pdsk, D_UP_TO_DATE),
+					CS_VERBOSE, NULL);
+			drbd_md_sync(mdev);
+		}
+		dec_local(mdev);
+	}
+
+	/* Before we test for the disk state, we should wait until an eventually
+	   ongoing cluster wide state change is finished. That is important if
+	   we are primary and are detaching from our disk. We need to see the
+	   new disk state... */
+	wait_event(mdev->misc_wait, !test_bit(CLUSTER_ST_CHANGE, &mdev->flags));
+	if (mdev->state.conn >= C_CONNECTED && mdev->state.disk < D_INCONSISTENT)
+		drbd_set_ed_uuid(mdev, p_uuid[UI_CURRENT]);
+
+	return TRUE;
+}
+
+/**
+ * convert_state:
+ * Switches the view of the state.
+ */
+STATIC union drbd_state convert_state(union drbd_state ps)
+{
+	union drbd_state ms;
+
+	static enum drbd_conns c_tab[] = {
+		[C_CONNECTED] = C_CONNECTED,
+
+		[C_STARTING_SYNC_S] = C_STARTING_SYNC_T,
+		[C_STARTING_SYNC_T] = C_STARTING_SYNC_S,
+		[C_DISCONNECTING] = C_TEAR_DOWN, /* C_NETWORK_FAILURE, */
+		[C_VERIFY_S]       = C_VERIFY_T,
+		[C_MASK]   = C_MASK,
+	};
+
+	ms.i = ps.i;
+
+	ms.conn = c_tab[ps.conn];
+	ms.peer = ps.role;
+	ms.role = ps.peer;
+	ms.pdsk = ps.disk;
+	ms.disk = ps.pdsk;
+	ms.peer_isp = (ps.aftr_isp | ps.user_isp);
+
+	return ms;
+}
+
+STATIC int receive_req_state(struct drbd_conf *mdev, struct p_header *h)
+{
+	struct p_req_state *p = (struct p_req_state *)h;
+	union drbd_state mask, val;
+	int rv;
+
+	ERR_IF(h->length != (sizeof(*p)-sizeof(*h))) return FALSE;
+	if (drbd_recv(mdev, h->payload, h->length) != h->length)
+		return FALSE;
+
+	mask.i = be32_to_cpu(p->mask);
+	val.i = be32_to_cpu(p->val);
+
+	if (test_bit(DISCARD_CONCURRENT, &mdev->flags) &&
+	    test_bit(CLUSTER_ST_CHANGE, &mdev->flags)) {
+		drbd_send_sr_reply(mdev, SS_CONCURRENT_ST_CHG);
+		return TRUE;
+	}
+
+	mask = convert_state(mask);
+	val = convert_state(val);
+
+	rv = drbd_change_state(mdev, CS_VERBOSE, mask, val);
+
+	drbd_send_sr_reply(mdev, rv);
+	drbd_md_sync(mdev);
+
+	return TRUE;
+}
+
+STATIC int receive_state(struct drbd_conf *mdev, struct p_header *h)
+{
+	struct p_state *p = (struct p_state *)h;
+	enum drbd_conns nconn, oconn;
+	union drbd_state ns, peer_state;
+	enum drbd_disk_state real_peer_disk;
+	int rv;
+
+	ERR_IF(h->length != (sizeof(*p)-sizeof(*h)))
+		return FALSE;
+
+	if (drbd_recv(mdev, h->payload, h->length) != h->length)
+		return FALSE;
+
+	peer_state.i = be32_to_cpu(p->state);
+
+	real_peer_disk = peer_state.disk;
+	if (peer_state.disk == D_NEGOTIATING) {
+		real_peer_disk = mdev->p_uuid[UI_FLAGS] & 4 ? D_INCONSISTENT : D_CONSISTENT;
+		dev_info(DEV, "real peer disk state = %s\n", disks_to_name(real_peer_disk));
+	}
+
+	spin_lock_irq(&mdev->req_lock);
+ retry:
+	oconn = nconn = mdev->state.conn;
+	spin_unlock_irq(&mdev->req_lock);
+
+	if (nconn == C_WF_REPORT_PARAMS)
+		nconn = C_CONNECTED;
+
+	if (mdev->p_uuid && peer_state.disk >= D_NEGOTIATING &&
+	    inc_local_if_state(mdev, D_NEGOTIATING)) {
+		int cr; /* consider resync */
+
+		cr  = (oconn < C_CONNECTED);
+		cr |= (oconn == C_CONNECTED &&
+		       (peer_state.disk == D_NEGOTIATING ||
+			mdev->state.disk == D_NEGOTIATING));
+		cr |= test_bit(CONSIDER_RESYNC, &mdev->flags); /* peer forced */
+		cr |= (oconn == C_CONNECTED && peer_state.conn > C_CONNECTED);
+
+		if (cr)
+			nconn = drbd_sync_handshake(mdev, peer_state.role, real_peer_disk);
+
+		dec_local(mdev);
+		if (nconn == C_MASK) {
+			if (mdev->state.disk == D_NEGOTIATING) {
+				drbd_force_state(mdev, NS(disk, D_DISKLESS));
+				nconn = C_CONNECTED;
+			} else if (peer_state.disk == D_NEGOTIATING) {
+				dev_err(DEV, "Disk attach process on the peer node was aborted.\n");
+				peer_state.disk = D_DISKLESS;
+			} else {
+				D_ASSERT(oconn == C_WF_REPORT_PARAMS);
+				drbd_force_state(mdev, NS(conn, C_DISCONNECTING));
+				return FALSE;
+			}
+		}
+	}
+
+	spin_lock_irq(&mdev->req_lock);
+	if (mdev->state.conn != oconn)
+		goto retry;
+	clear_bit(CONSIDER_RESYNC, &mdev->flags);
+	ns.i = mdev->state.i;
+	ns.conn = nconn;
+	ns.peer = peer_state.role;
+	ns.pdsk = real_peer_disk;
+	ns.peer_isp = (peer_state.aftr_isp | peer_state.user_isp);
+	if ((nconn == C_CONNECTED || nconn == C_WF_BITMAP_S) && ns.disk == D_NEGOTIATING)
+		ns.disk = mdev->new_state_tmp.disk;
+
+	rv = _drbd_set_state(mdev, ns, CS_VERBOSE | CS_HARD, NULL);
+	ns = mdev->state;
+	spin_unlock_irq(&mdev->req_lock);
+
+	if (rv < SS_SUCCESS) {
+		drbd_force_state(mdev, NS(conn, C_DISCONNECTING));
+		return FALSE;
+	}
+
+	if (oconn > C_WF_REPORT_PARAMS) {
+		if (nconn > C_CONNECTED && peer_state.conn <= C_CONNECTED &&
+		    peer_state.disk != D_NEGOTIATING ) {
+			/* we want resync, peer has not yet decided to sync... */
+			/* Nowadays only used when forcing a node into primary role and
+			   setting its disk to UpTpDate with that */
+			drbd_send_uuids(mdev);
+			drbd_send_state(mdev);
+		}
+	}
+
+	mdev->net_conf->want_lose = 0;
+
+	drbd_md_sync(mdev); /* update connected indicator, la_size, ... */
+
+	return TRUE;
+}
+
+STATIC int receive_sync_uuid(struct drbd_conf *mdev, struct p_header *h)
+{
+	struct p_rs_uuid *p = (struct p_rs_uuid *)h;
+
+	wait_event(mdev->misc_wait,
+		   mdev->state.conn < C_CONNECTED ||
+		   mdev->state.conn == C_WF_SYNC_UUID);
+
+	/* D_ASSERT( mdev->state.conn == C_WF_SYNC_UUID ); */
+
+	ERR_IF(h->length != (sizeof(*p)-sizeof(*h))) return FALSE;
+	if (drbd_recv(mdev, h->payload, h->length) != h->length)
+		return FALSE;
+
+	/* Here the _drbd_uuid_ functions are right, current should
+	   _not_ be rotated into the history */
+	if (inc_local_if_state(mdev, D_NEGOTIATING)) {
+		_drbd_uuid_set(mdev, UI_CURRENT, be64_to_cpu(p->uuid));
+		_drbd_uuid_set(mdev, UI_BITMAP, 0UL);
+
+		drbd_start_resync(mdev, C_SYNC_TARGET);
+
+		dec_local(mdev);
+	} else
+		dev_err(DEV, "Ignoring SyncUUID packet!\n");
+
+	return TRUE;
+}
+
+enum receive_bitmap_ret { OK, DONE, FAILED };
+
+static enum receive_bitmap_ret
+receive_bitmap_plain(struct drbd_conf *mdev, struct p_header *h,
+	unsigned long *buffer, struct bm_xfer_ctx *c)
+{
+	unsigned num_words = min_t(size_t, BM_PACKET_WORDS, c->bm_words - c->word_offset);
+	unsigned want = num_words * sizeof(long);
+
+	if (want != h->length) {
+		dev_err(DEV, "%s:want (%u) != h->length (%u)\n", __func__, want, h->length);
+		return FAILED;
+	}
+	if (want == 0)
+		return DONE;
+	if (drbd_recv(mdev, buffer, want) != want)
+		return FAILED;
+
+	drbd_bm_merge_lel(mdev, c->word_offset, num_words, buffer);
+
+	c->word_offset += num_words;
+	c->bit_offset = c->word_offset * BITS_PER_LONG;
+	if (c->bit_offset > c->bm_bits)
+		c->bit_offset = c->bm_bits;
+
+	return OK;
+}
+
+static enum receive_bitmap_ret
+recv_bm_rle_bits(struct drbd_conf *mdev,
+		struct p_compressed_bm *p,
+		struct bm_xfer_ctx *c)
+{
+	struct bitstream bs;
+	u64 look_ahead;
+	u64 rl;
+	u64 tmp;
+	unsigned long s = c->bit_offset;
+	unsigned long e;
+	int len = p->head.length - (sizeof(*p) - sizeof(p->head));
+	int toggle = DCBP_get_start(p);
+	int have;
+	int bits;
+
+	bitstream_init(&bs, p->code, len, DCBP_get_pad_bits(p));
+
+	bits = bitstream_get_bits(&bs, &look_ahead, 64);
+	if (bits < 0)
+		return FAILED;
+
+	for (have = bits; have > 0; s += rl, toggle = !toggle) {
+		bits = vli_decode_bits(&rl, look_ahead);
+		if (bits <= 0)
+			return FAILED;
+
+		if (toggle) {
+			e = s + rl -1;
+			if (e >= c->bm_bits) {
+				dev_err(DEV, "bitmap overflow (e:%lu) while decoding bm RLE packet\n", e);
+				return FAILED;
+			}
+			_drbd_bm_set_bits(mdev, s, e);
+		}
+
+		if (have < bits) {
+			dev_err(DEV, "bitmap decoding error: h:%d b:%d la:0x%08llx l:%u/%u\n",
+				have, bits, look_ahead,
+				(unsigned int)(bs.cur.b - p->code),
+				(unsigned int)bs.buf_len);
+			return FAILED;
+		}
+		look_ahead >>= bits;
+		have -= bits;
+
+		bits = bitstream_get_bits(&bs, &tmp, 64 - have);
+		if (bits < 0)
+			return FAILED;
+		look_ahead |= tmp << have;
+		have += bits;
+	}
+
+	c->bit_offset = s;
+	bm_xfer_ctx_bit_to_word_offset(c);
+
+	return (s == c->bm_bits) ? DONE : OK;
+}
+
+static enum receive_bitmap_ret
+decode_bitmap_c(struct drbd_conf *mdev,
+		struct p_compressed_bm *p,
+		struct bm_xfer_ctx *c)
+{
+	if (DCBP_get_code(p) == RLE_VLI_Bits)
+		return recv_bm_rle_bits(mdev, p, c);
+
+	/* other variants had been implemented for evaluation,
+	 * but have been dropped as this one turned out to be "best"
+	 * during all our tests. */
+
+	dev_err(DEV, "receive_bitmap_c: unknown encoding %u\n", p->encoding);
+	drbd_force_state(mdev, NS(conn, C_PROTOCOL_ERROR));
+	return FAILED;
+}
+
+void INFO_bm_xfer_stats(struct drbd_conf *mdev,
+		const char *direction, struct bm_xfer_ctx *c)
+{
+	/* what would it take to transfer it "plaintext" */
+	unsigned plain = sizeof(struct p_header) *
+		((c->bm_words+BM_PACKET_WORDS-1)/BM_PACKET_WORDS+1)
+		+ c->bm_words * sizeof(long);
+	unsigned total = c->bytes[0] + c->bytes[1];
+	unsigned r;
+
+	/* total can not be zero. but just in case: */
+	if (total == 0)
+		return;
+
+	/* don't report if not compressed */
+	if (total >= plain)
+		return;
+
+	/* total < plain. check for overflow, still */
+	r = (total > UINT_MAX/1000) ? (total / (plain/1000))
+		                    : (1000 * total / plain);
+
+	if (r > 1000)
+		r = 1000;
+
+	r = 1000 - r;
+	dev_info(DEV, "%s bitmap stats [Bytes(packets)]: plain %u(%u), RLE %u(%u), "
+	     "total %u; compression: %u.%u%%\n",
+			direction,
+			c->bytes[1], c->packets[1],
+			c->bytes[0], c->packets[0],
+			total, r/10, r % 10);
+}
+
+/* Since we are processing the bitfield from lower addresses to higher,
+   it does not matter if the process it in 32 bit chunks or 64 bit
+   chunks as long as it is little endian. (Understand it as byte stream,
+   beginning with the lowest byte...) If we would use big endian
+   we would need to process it from the highest address to the lowest,
+   in order to be agnostic to the 32 vs 64 bits issue.
+
+   returns 0 on failure, 1 if we suceessfully received it. */
+STATIC int receive_bitmap(struct drbd_conf *mdev, struct p_header *h)
+{
+	struct bm_xfer_ctx c;
+	void *buffer;
+	enum receive_bitmap_ret ret;
+	int ok = FALSE;
+
+	wait_event(mdev->misc_wait, !atomic_read(&mdev->ap_bio_cnt));
+
+	drbd_bm_lock(mdev, "receive bitmap");
+
+	/* maybe we should use some per thread scratch page,
+	 * and allocate that during initial device creation? */
+	buffer	 = (unsigned long *) __get_free_page(GFP_NOIO);
+	if (!buffer) {
+		dev_err(DEV, "failed to allocate one page buffer in %s\n", __func__);
+		goto out;
+	}
+
+	c = (struct bm_xfer_ctx) {
+		.bm_bits = drbd_bm_bits(mdev),
+		.bm_words = drbd_bm_words(mdev),
+	};
+
+	do {
+		if (h->command == P_BITMAP) {
+			ret = receive_bitmap_plain(mdev, h, buffer, &c);
+		} else if (h->command == P_COMPRESSED_BITMAP) {
+			/* MAYBE: sanity check that we speak proto >= 90,
+			 * and the feature is enabled! */
+			struct p_compressed_bm *p;
+
+			if (h->length > BM_PACKET_PAYLOAD_BYTES) {
+				dev_err(DEV, "ReportCBitmap packet too large\n");
+				goto out;
+			}
+			/* use the page buff */
+			p = buffer;
+			memcpy(p, h, sizeof(*h));
+			if (drbd_recv(mdev, p->head.payload, h->length) != h->length)
+				goto out;
+			if (p->head.length <= (sizeof(*p) - sizeof(p->head))) {
+				dev_err(DEV, "ReportCBitmap packet too small (l:%u)\n", p->head.length);
+				return FAILED;
+			}
+			ret = decode_bitmap_c(mdev, p, &c);
+		} else {
+			dev_warn(DEV, "receive_bitmap: h->command neither ReportBitMap nor ReportCBitMap (is 0x%x)", h->command);
+			goto out;
+		}
+
+		c.packets[h->command == P_BITMAP]++;
+		c.bytes[h->command == P_BITMAP] += sizeof(struct p_header) + h->length;
+
+		if (ret != OK)
+			break;
+
+		if (!drbd_recv_header(mdev, h))
+			goto out;
+	} while (ret == OK);
+	if (ret == FAILED)
+		goto out;
+
+	INFO_bm_xfer_stats(mdev, "receive", &c);
+
+	if (mdev->state.conn == C_WF_BITMAP_T) {
+		ok = !drbd_send_bitmap(mdev);
+		if (!ok)
+			goto out;
+		/* Omit CS_ORDERED with this state transition to avoid deadlocks. */
+		ok = _drbd_request_state(mdev, NS(conn, C_WF_SYNC_UUID), CS_VERBOSE);
+		D_ASSERT(ok == SS_SUCCESS);
+	} else if (mdev->state.conn != C_WF_BITMAP_S) {
+		/* admin may have requested C_DISCONNECTING,
+		 * other threads may have noticed network errors */
+		dev_info(DEV, "unexpected cstate (%s) in receive_bitmap\n",
+		    conns_to_name(mdev->state.conn));
+	}
+
+	ok = TRUE;
+ out:
+	drbd_bm_unlock(mdev);
+	if (ok && mdev->state.conn == C_WF_BITMAP_S)
+		drbd_start_resync(mdev, C_SYNC_SOURCE);
+	free_page((unsigned long) buffer);
+	return ok;
+}
+
+STATIC int receive_skip(struct drbd_conf *mdev, struct p_header *h)
+{
+	/* TODO zero copy sink :) */
+	static char sink[128];
+	int size, want, r;
+
+	dev_warn(DEV, "skipping unknown optional packet type %d, l: %d!\n",
+	     h->command, h->length);
+
+	size = h->length;
+	while (size > 0) {
+		want = min_t(int, size, sizeof(sink));
+		r = drbd_recv(mdev, sink, want);
+		ERR_IF(r <= 0) break;
+		size -= r;
+	}
+	return size == 0;
+}
+
+STATIC int receive_UnplugRemote(struct drbd_conf *mdev, struct p_header *h)
+{
+	if (mdev->state.disk >= D_INCONSISTENT)
+		drbd_kick_lo(mdev);
+
+	/* Make sure we've acked all the TCP data associated
+	 * with the data requests being unplugged */
+	drbd_tcp_quickack(mdev->data.socket);
+
+	return TRUE;
+}
+
+typedef int (*drbd_cmd_handler_f)(struct drbd_conf *, struct p_header *);
+
+static drbd_cmd_handler_f drbd_default_handler[] = {
+	[P_DATA]	    = receive_Data,
+	[P_DATA_REPLY]	    = receive_DataReply,
+	[P_RS_DATA_REPLY]   = receive_RSDataReply,
+	[P_BARRIER]	    = receive_Barrier,
+	[P_BITMAP]	    = receive_bitmap,
+	[P_COMPRESSED_BITMAP]    = receive_bitmap,
+	[P_UNPLUG_REMOTE]   = receive_UnplugRemote,
+	[P_DATA_REQUEST]    = receive_DataRequest,
+	[P_RS_DATA_REQUEST] = receive_DataRequest,
+	[P_SYNC_PARAM]	    = receive_SyncParam,
+	[P_SYNC_PARAM89]	   = receive_SyncParam,
+	[P_PROTOCOL]        = receive_protocol,
+	[P_UUIDS]	    = receive_uuids,
+	[P_SIZES]	    = receive_sizes,
+	[P_STATE]	    = receive_state,
+	[P_STATE_CHG_REQ]   = receive_req_state,
+	[P_SYNC_UUID]       = receive_sync_uuid,
+	[P_OV_REQUEST]      = receive_DataRequest,
+	[P_OV_REPLY]        = receive_DataRequest,
+	[P_CSUM_RS_REQUEST]    = receive_DataRequest,
+	/* anything missing from this table is in
+	 * the asender_tbl, see get_asender_cmd */
+	[P_MAX_CMD]	    = NULL,
+};
+
+static drbd_cmd_handler_f *drbd_cmd_handler = drbd_default_handler;
+static drbd_cmd_handler_f *drbd_opt_cmd_handler;
+
+STATIC void drbdd(struct drbd_conf *mdev)
+{
+	drbd_cmd_handler_f handler;
+	struct p_header *header = &mdev->data.rbuf.header;
+
+	while (get_t_state(&mdev->receiver) == Running) {
+		drbd_thread_current_set_cpu(mdev);
+		if (!drbd_recv_header(mdev, header))
+			break;
+
+		if (header->command < P_MAX_CMD)
+			handler = drbd_cmd_handler[header->command];
+		else if (P_MAY_IGNORE < header->command
+		     && header->command < P_MAX_OPT_CMD)
+			handler = drbd_opt_cmd_handler[header->command-P_MAY_IGNORE];
+		else if (header->command > P_MAX_OPT_CMD)
+			handler = receive_skip;
+		else
+			handler = NULL;
+
+		if (unlikely(!handler)) {
+			dev_err(DEV, "unknown packet type %d, l: %d!\n",
+			    header->command, header->length);
+			drbd_force_state(mdev, NS(conn, C_PROTOCOL_ERROR));
+			break;
+		}
+		if (unlikely(!handler(mdev, header))) {
+			dev_err(DEV, "error receiving %s, l: %d!\n",
+			    cmdname(header->command), header->length);
+			drbd_force_state(mdev, NS(conn, C_PROTOCOL_ERROR));
+			break;
+		}
+
+		trace_drbd_packet(mdev, mdev->data.socket, 2, &mdev->data.rbuf,
+				__FILE__, __LINE__);
+	}
+}
+
+STATIC void drbd_fail_pending_reads(struct drbd_conf *mdev)
+{
+	struct hlist_head *slot;
+	struct hlist_node *pos;
+	struct hlist_node *tmp;
+	struct drbd_request *req;
+	int i;
+
+	/*
+	 * Application READ requests
+	 */
+	spin_lock_irq(&mdev->req_lock);
+	for (i = 0; i < APP_R_HSIZE; i++) {
+		slot = mdev->app_reads_hash+i;
+		hlist_for_each_entry_safe(req, pos, tmp, slot, colision) {
+			/* it may (but should not any longer!)
+			 * be on the work queue; if that assert triggers,
+			 * we need to also grab the
+			 * spin_lock_irq(&mdev->data.work.q_lock);
+			 * and list_del_init here. */
+			D_ASSERT(list_empty(&req->w.list));
+			_req_mod(req, connection_lost_while_pending, 0);
+		}
+	}
+	for (i = 0; i < APP_R_HSIZE; i++)
+		if (!hlist_empty(mdev->app_reads_hash+i))
+			dev_warn(DEV, "ASSERT FAILED: app_reads_hash[%d].first: "
+				"%p, should be NULL\n", i, mdev->app_reads_hash[i].first);
+
+	memset(mdev->app_reads_hash, 0, APP_R_HSIZE*sizeof(void *));
+	spin_unlock_irq(&mdev->req_lock);
+}
+
+STATIC void drbd_disconnect(struct drbd_conf *mdev)
+{
+	struct drbd_work prev_work_done;
+	enum drbd_fencing_p fp;
+	union drbd_state os, ns;
+	int rv = SS_UNKNOWN_ERROR;
+	unsigned int i;
+
+	if (mdev->state.conn == C_STANDALONE)
+		return;
+	if (mdev->state.conn >= C_WF_CONNECTION)
+		dev_err(DEV, "ASSERT FAILED cstate = %s, expected < WFConnection\n",
+				conns_to_name(mdev->state.conn));
+
+	/* asender does not clean up anything. it must not interfere, either */
+	drbd_thread_stop(&mdev->asender);
+
+	mutex_lock(&mdev->data.mutex);
+	drbd_free_sock(mdev);
+	mutex_unlock(&mdev->data.mutex);
+
+	spin_lock_irq(&mdev->req_lock);
+	_drbd_wait_ee_list_empty(mdev, &mdev->active_ee);
+	_drbd_wait_ee_list_empty(mdev, &mdev->sync_ee);
+	_drbd_clear_done_ee(mdev);
+	_drbd_wait_ee_list_empty(mdev, &mdev->read_ee);
+	reclaim_net_ee(mdev);
+	spin_unlock_irq(&mdev->req_lock);
+
+	/* We do not have data structures that would allow us to
+	 * get the rs_pending_cnt down to 0 again.
+	 *  * On C_SYNC_TARGET we do not have any data structures describing
+	 *    the pending RSDataRequest's we have sent.
+	 *  * On C_SYNC_SOURCE there is no data structure that tracks
+	 *    the P_RS_DATA_REPLY blocks that we sent to the SyncTarget.
+	 *  And no, it is not the sum of the reference counts in the
+	 *  resync_LRU. The resync_LRU tracks the whole operation including
+	 *  the disk-IO, while the rs_pending_cnt only tracks the blocks
+	 *  on the fly. */
+	drbd_rs_cancel_all(mdev);
+	mdev->rs_total = 0;
+	mdev->rs_failed = 0;
+	atomic_set(&mdev->rs_pending_cnt, 0);
+	wake_up(&mdev->misc_wait);
+
+	/* make sure syncer is stopped and w_resume_next_sg queued */
+	del_timer_sync(&mdev->resync_timer);
+	set_bit(STOP_SYNC_TIMER, &mdev->flags);
+	resync_timer_fn((unsigned long)mdev);
+
+	/* wait for all w_e_end_data_req, w_e_end_rsdata_req, w_send_barrier,
+	 * w_make_resync_request etc. which may still be on the worker queue
+	 * to be "canceled" */
+	set_bit(WORK_PENDING, &mdev->flags);
+	prev_work_done.cb = w_prev_work_done;
+	drbd_queue_work(&mdev->data.work, &prev_work_done);
+	wait_event(mdev->misc_wait, !test_bit(WORK_PENDING, &mdev->flags));
+
+	kfree(mdev->p_uuid);
+	mdev->p_uuid = NULL;
+
+	if (!mdev->state.susp)
+		tl_clear(mdev);
+
+	drbd_fail_pending_reads(mdev);
+
+	dev_info(DEV, "Connection closed\n");
+
+	drbd_md_sync(mdev);
+
+	fp = FP_DONT_CARE;
+	if (inc_local(mdev)) {
+		fp = mdev->bc->dc.fencing;
+		dec_local(mdev);
+	}
+
+	if (mdev->state.role == R_PRIMARY) {
+		if (fp >= FP_RESOURCE && mdev->state.pdsk >= D_UNKNOWN) {
+			enum drbd_disk_state nps = drbd_try_outdate_peer(mdev);
+			drbd_request_state(mdev, NS(pdsk, nps));
+		}
+	}
+
+	spin_lock_irq(&mdev->req_lock);
+	os = mdev->state;
+	if (os.conn >= C_UNCONNECTED) {
+		/* Do not restart in case we are C_DISCONNECTING */
+		ns = os;
+		ns.conn = C_UNCONNECTED;
+		rv = _drbd_set_state(mdev, ns, CS_VERBOSE, NULL);
+	}
+	spin_unlock_irq(&mdev->req_lock);
+
+	if (os.conn == C_DISCONNECTING) {
+		struct hlist_head *h;
+		wait_event(mdev->misc_wait, atomic_read(&mdev->net_cnt) == 0);
+
+		/* we must not free the tl_hash
+		 * while application io is still on the fly */
+		wait_event(mdev->misc_wait, atomic_read(&mdev->ap_bio_cnt) == 0);
+
+		spin_lock_irq(&mdev->req_lock);
+		/* paranoia code */
+		for (h = mdev->ee_hash; h < mdev->ee_hash + mdev->ee_hash_s; h++)
+			if (h->first)
+				dev_err(DEV, "ASSERT FAILED ee_hash[%u].first == %p, expected NULL\n",
+						(int)(h - mdev->ee_hash), h->first);
+		kfree(mdev->ee_hash);
+		mdev->ee_hash = NULL;
+		mdev->ee_hash_s = 0;
+
+		/* paranoia code */
+		for (h = mdev->tl_hash; h < mdev->tl_hash + mdev->tl_hash_s; h++)
+			if (h->first)
+				dev_err(DEV, "ASSERT FAILED tl_hash[%u] == %p, expected NULL\n",
+						(int)(h - mdev->tl_hash), h->first);
+		kfree(mdev->tl_hash);
+		mdev->tl_hash = NULL;
+		mdev->tl_hash_s = 0;
+		spin_unlock_irq(&mdev->req_lock);
+
+		crypto_free_hash(mdev->cram_hmac_tfm);
+		mdev->cram_hmac_tfm = NULL;
+
+		kfree(mdev->net_conf);
+		mdev->net_conf = NULL;
+		drbd_request_state(mdev, NS(conn, C_STANDALONE));
+	}
+
+	/* they do trigger all the time.
+	 * hm. why won't tcp release the page references,
+	 * we already released the socket!? */
+	i = atomic_read(&mdev->pp_in_use);
+	if (i)
+		dev_info(DEV, "pp_in_use = %u, expected 0\n", i);
+	if (!list_empty(&mdev->net_ee))
+		dev_info(DEV, "net_ee not empty!\n");
+
+	D_ASSERT(list_empty(&mdev->read_ee));
+	D_ASSERT(list_empty(&mdev->active_ee));
+	D_ASSERT(list_empty(&mdev->sync_ee));
+	D_ASSERT(list_empty(&mdev->done_ee));
+
+	/* ok, no more ee's on the fly, it is safe to reset the epoch_size */
+	atomic_set(&mdev->current_epoch->epoch_size, 0);
+	D_ASSERT(list_empty(&mdev->current_epoch->list));
+}
+
+/*
+ * We support PRO_VERSION_MIN to PRO_VERSION_MAX. The protocol version
+ * we can agree on is stored in agreed_pro_version.
+ *
+ * feature flags and the reserved array should be enough room for future
+ * enhancements of the handshake protocol, and possible plugins...
+ *
+ * for now, they are expected to be zero, but ignored.
+ */
+STATIC int drbd_send_handshake(struct drbd_conf *mdev)
+{
+	/* ASSERT current == mdev->receiver ... */
+	struct p_handshake *p = &mdev->data.sbuf.handshake;
+	int ok;
+
+	if (mutex_lock_interruptible(&mdev->data.mutex)) {
+		dev_err(DEV, "interrupted during initial handshake\n");
+		return 0; /* interrupted. not ok. */
+	}
+
+	if (mdev->data.socket == NULL) {
+		mutex_unlock(&mdev->data.mutex);
+		return 0;
+	}
+
+	memset(p, 0, sizeof(*p));
+	p->protocol_min = cpu_to_be32(PRO_VERSION_MIN);
+	p->protocol_max = cpu_to_be32(PRO_VERSION_MAX);
+	ok = _drbd_send_cmd( mdev, mdev->data.socket, P_HAND_SHAKE,
+			     (struct p_header *)p, sizeof(*p), 0 );
+	mutex_unlock(&mdev->data.mutex);
+	return ok;
+}
+
+/*
+ * return values:
+ *   1 yess, we have a valid connection
+ *   0 oops, did not work out, please try again
+ *  -1 peer talks different language,
+ *     no point in trying again, please go standalone.
+ */
+int drbd_do_handshake(struct drbd_conf *mdev)
+{
+	/* ASSERT current == mdev->receiver ... */
+	struct p_handshake *p = &mdev->data.rbuf.handshake;
+	const int expect = sizeof(struct p_handshake)
+			  -sizeof(struct p_header);
+	int rv;
+
+	rv = drbd_send_handshake(mdev);
+	if (!rv)
+		return 0;
+
+	rv = drbd_recv_header(mdev, &p->head);
+	if (!rv)
+		return 0;
+
+	if (p->head.command != P_HAND_SHAKE) {
+		dev_err(DEV, "expected HandShake packet, received: %s (0x%04x)\n",
+		     cmdname(p->head.command), p->head.command);
+		return -1;
+	}
+
+	if (p->head.length != expect) {
+		dev_err(DEV, "expected HandShake length: %u, received: %u\n",
+		     expect, p->head.length);
+		return -1;
+	}
+
+	rv = drbd_recv(mdev, &p->head.payload, expect);
+
+	if (rv != expect) {
+		dev_err(DEV, "short read receiving handshake packet: l=%u\n", rv);
+		return 0;
+	}
+
+	trace_drbd_packet(mdev, mdev->data.socket, 2, &mdev->data.rbuf,
+			__FILE__, __LINE__);
+
+	p->protocol_min = be32_to_cpu(p->protocol_min);
+	p->protocol_max = be32_to_cpu(p->protocol_max);
+	if (p->protocol_max == 0)
+		p->protocol_max = p->protocol_min;
+
+	if (PRO_VERSION_MAX < p->protocol_min ||
+	    PRO_VERSION_MIN > p->protocol_max)
+		goto incompat;
+
+	mdev->agreed_pro_version = min_t(int, PRO_VERSION_MAX, p->protocol_max);
+
+	dev_info(DEV, "Handshake successful: "
+	     "Agreed network protocol version %d\n", mdev->agreed_pro_version);
+
+	return 1;
+
+ incompat:
+	dev_err(DEV, "incompatible DRBD dialects: "
+	    "I support %d-%d, peer supports %d-%d\n",
+	    PRO_VERSION_MIN, PRO_VERSION_MAX,
+	    p->protocol_min, p->protocol_max);
+	return -1;
+}
+
+#if !defined(CONFIG_CRYPTO_HMAC) && !defined(CONFIG_CRYPTO_HMAC_MODULE)
+int drbd_do_auth(struct drbd_conf *mdev)
+{
+	dev_err(DEV, "This kernel was build without CONFIG_CRYPTO_HMAC.\n");
+	dev_err(DEV, "You need to disable 'cram-hmac-alg' in drbd.conf.\n");
+	return 0;
+}
+#else
+#define CHALLENGE_LEN 64
+int drbd_do_auth(struct drbd_conf *mdev)
+{
+	char my_challenge[CHALLENGE_LEN];  /* 64 Bytes... */
+	struct scatterlist sg;
+	char *response = NULL;
+	char *right_response = NULL;
+	char *peers_ch = NULL;
+	struct p_header p;
+	unsigned int key_len = strlen(mdev->net_conf->shared_secret);
+	unsigned int resp_size;
+	struct hash_desc desc;
+	int rv;
+
+	desc.tfm = mdev->cram_hmac_tfm;
+	desc.flags = 0;
+
+	rv = crypto_hash_setkey(mdev->cram_hmac_tfm,
+				(u8 *)mdev->net_conf->shared_secret, key_len);
+	if (rv) {
+		dev_err(DEV, "crypto_hash_setkey() failed with %d\n", rv);
+		rv = 0;
+		goto fail;
+	}
+
+	get_random_bytes(my_challenge, CHALLENGE_LEN);
+
+	rv = drbd_send_cmd2(mdev, P_AUTH_CHALLENGE, my_challenge, CHALLENGE_LEN);
+	if (!rv)
+		goto fail;
+
+	rv = drbd_recv_header(mdev, &p);
+	if (!rv)
+		goto fail;
+
+	if (p.command != P_AUTH_CHALLENGE) {
+		dev_err(DEV, "expected AuthChallenge packet, received: %s (0x%04x)\n",
+		    cmdname(p.command), p.command);
+		rv = 0;
+		goto fail;
+	}
+
+	if (p.length > CHALLENGE_LEN*2) {
+		dev_err(DEV, "expected AuthChallenge payload too big.\n");
+		rv = 0;
+		goto fail;
+	}
+
+	peers_ch = kmalloc(p.length, GFP_KERNEL);
+	if (peers_ch == NULL) {
+		dev_err(DEV, "kmalloc of peers_ch failed\n");
+		rv = 0;
+		goto fail;
+	}
+
+	rv = drbd_recv(mdev, peers_ch, p.length);
+
+	if (rv != p.length) {
+		dev_err(DEV, "short read AuthChallenge: l=%u\n", rv);
+		rv = 0;
+		goto fail;
+	}
+
+	resp_size = crypto_hash_digestsize(mdev->cram_hmac_tfm);
+	response = kmalloc(resp_size, GFP_KERNEL);
+	if (response == NULL) {
+		dev_err(DEV, "kmalloc of response failed\n");
+		rv = 0;
+		goto fail;
+	}
+
+	sg_init_table(&sg, 1);
+	sg_set_buf(&sg, peers_ch, p.length);
+
+	rv = crypto_hash_digest(&desc, &sg, sg.length, response);
+	if (rv) {
+		dev_err(DEV, "crypto_hash_digest() failed with %d\n", rv);
+		rv = 0;
+		goto fail;
+	}
+
+	rv = drbd_send_cmd2(mdev, P_AUTH_RESPONSE, response, resp_size);
+	if (!rv)
+		goto fail;
+
+	rv = drbd_recv_header(mdev, &p);
+	if (!rv)
+		goto fail;
+
+	if (p.command != P_AUTH_RESPONSE) {
+		dev_err(DEV, "expected AuthResponse packet, received: %s (0x%04x)\n",
+		    cmdname(p.command), p.command);
+		rv = 0;
+		goto fail;
+	}
+
+	if (p.length != resp_size) {
+		dev_err(DEV, "expected AuthResponse payload of wrong size\n");
+		rv = 0;
+		goto fail;
+	}
+
+	rv = drbd_recv(mdev, response , resp_size);
+
+	if (rv != resp_size) {
+		dev_err(DEV, "short read receiving AuthResponse: l=%u\n", rv);
+		rv = 0;
+		goto fail;
+	}
+
+	right_response = kmalloc(resp_size, GFP_KERNEL);
+	if (response == NULL) {
+		dev_err(DEV, "kmalloc of right_response failed\n");
+		rv = 0;
+		goto fail;
+	}
+
+	sg_set_buf(&sg, my_challenge, CHALLENGE_LEN);
+
+	rv = crypto_hash_digest(&desc, &sg, sg.length, right_response);
+	if (rv) {
+		dev_err(DEV, "crypto_hash_digest() failed with %d\n", rv);
+		rv = 0;
+		goto fail;
+	}
+
+	rv = !memcmp(response, right_response, resp_size);
+
+	if (rv)
+		dev_info(DEV, "Peer authenticated using %d bytes of '%s' HMAC\n",
+		     resp_size, mdev->net_conf->cram_hmac_alg);
+
+ fail:
+	kfree(peers_ch);
+	kfree(response);
+	kfree(right_response);
+
+	return rv;
+}
+#endif
+
+STATIC int drbdd_init(struct drbd_thread *thi)
+{
+	struct drbd_conf *mdev = thi->mdev;
+	unsigned int minor = mdev_to_minor(mdev);
+	int h;
+
+	sprintf(current->comm, "drbd%d_receiver", minor);
+
+	dev_info(DEV, "receiver (re)started\n");
+
+	do {
+		h = drbd_connect(mdev);
+		if (h == 0) {
+			drbd_disconnect(mdev);
+			__set_current_state(TASK_INTERRUPTIBLE);
+			schedule_timeout(HZ);
+		}
+		if (h == -1) {
+			dev_warn(DEV, "Discarding network configuration.\n");
+			drbd_force_state(mdev, NS(conn, C_DISCONNECTING));
+		}
+	} while (h == 0);
+
+	if (h > 0) {
+		if (inc_net(mdev)) {
+			drbdd(mdev);
+			dec_net(mdev);
+		}
+	}
+
+	drbd_disconnect(mdev);
+
+	dev_info(DEV, "receiver terminated\n");
+	return 0;
+}
+
+/* ********* acknowledge sender ******** */
+
+STATIC int got_RqSReply(struct drbd_conf *mdev, struct p_header *h)
+{
+	struct p_req_state_reply *p = (struct p_req_state_reply *)h;
+
+	int retcode = be32_to_cpu(p->retcode);
+
+	if (retcode >= SS_SUCCESS) {
+		set_bit(CL_ST_CHG_SUCCESS, &mdev->flags);
+	} else {
+		set_bit(CL_ST_CHG_FAIL, &mdev->flags);
+		dev_err(DEV, "Requested state change failed by peer: %s (%d)\n",
+		    set_st_err_name(retcode), retcode);
+	}
+	wake_up(&mdev->state_wait);
+
+	return TRUE;
+}
+
+STATIC int got_Ping(struct drbd_conf *mdev, struct p_header *h)
+{
+	return drbd_send_ping_ack(mdev);
+
+}
+
+STATIC int got_PingAck(struct drbd_conf *mdev, struct p_header *h)
+{
+	/* restore idle timeout */
+	mdev->meta.socket->sk->sk_rcvtimeo = mdev->net_conf->ping_int*HZ;
+
+	return TRUE;
+}
+
+STATIC int got_IsInSync(struct drbd_conf *mdev, struct p_header *h)
+{
+	struct p_block_ack *p = (struct p_block_ack *)h;
+	sector_t sector = be64_to_cpu(p->sector);
+	int blksize = be32_to_cpu(p->blksize);
+
+	D_ASSERT(mdev->agreed_pro_version >= 89);
+
+	update_peer_seq(mdev, be32_to_cpu(p->seq_num));
+
+	drbd_rs_complete_io(mdev, sector);
+	drbd_set_in_sync(mdev, sector, blksize);
+	/* rs_same_csums is supposed to count in units of BM_BLOCK_SIZE */
+	mdev->rs_same_csum += (blksize >> BM_BLOCK_SIZE_B);
+	dec_rs_pending(mdev);
+
+	return TRUE;
+}
+
+STATIC int got_BlockAck(struct drbd_conf *mdev, struct p_header *h)
+{
+	struct drbd_request *req;
+	struct p_block_ack *p = (struct p_block_ack *)h;
+	sector_t sector = be64_to_cpu(p->sector);
+	int blksize = be32_to_cpu(p->blksize);
+
+	update_peer_seq(mdev, be32_to_cpu(p->seq_num));
+
+	if (is_syncer_block_id(p->block_id)) {
+		drbd_set_in_sync(mdev, sector, blksize);
+		dec_rs_pending(mdev);
+	} else {
+		spin_lock_irq(&mdev->req_lock);
+		req = _ack_id_to_req(mdev, p->block_id, sector);
+
+		if (unlikely(!req)) {
+			spin_unlock_irq(&mdev->req_lock);
+			dev_err(DEV, "Got a corrupt block_id/sector pair(2).\n");
+			return FALSE;
+		}
+
+		switch (be16_to_cpu(h->command)) {
+		case P_RS_WRITE_ACK:
+			D_ASSERT(mdev->net_conf->wire_protocol == DRBD_PROT_C);
+			_req_mod(req, write_acked_by_peer_and_sis, 0);
+			break;
+		case P_WRITE_ACK:
+			D_ASSERT(mdev->net_conf->wire_protocol == DRBD_PROT_C);
+			_req_mod(req, write_acked_by_peer, 0);
+			break;
+		case P_RECV_ACK:
+			D_ASSERT(mdev->net_conf->wire_protocol == DRBD_PROT_B);
+			_req_mod(req, recv_acked_by_peer, 0);
+			break;
+		case P_DISCARD_ACK:
+			D_ASSERT(mdev->net_conf->wire_protocol == DRBD_PROT_C);
+			dev_alert(DEV, "Got DiscardAck packet %llus +%u!"
+			      " DRBD is not a random data generator!\n",
+			      (unsigned long long)req->sector, req->size);
+			_req_mod(req, conflict_discarded_by_peer, 0);
+			break;
+		default:
+			D_ASSERT(0);
+		}
+		spin_unlock_irq(&mdev->req_lock);
+	}
+	/* dec_ap_pending is handled within _req_mod */
+
+	return TRUE;
+}
+
+STATIC int got_NegAck(struct drbd_conf *mdev, struct p_header *h)
+{
+	struct p_block_ack *p = (struct p_block_ack *)h;
+	sector_t sector = be64_to_cpu(p->sector);
+	struct drbd_request *req;
+
+	if (__ratelimit(&drbd_ratelimit_state))
+		dev_warn(DEV, "Got NegAck packet. Peer is in troubles?\n");
+
+	update_peer_seq(mdev, be32_to_cpu(p->seq_num));
+
+	if (is_syncer_block_id(p->block_id)) {
+		int size = be32_to_cpu(p->blksize);
+
+		dec_rs_pending(mdev);
+
+		drbd_rs_failed_io(mdev, sector, size);
+	} else {
+		spin_lock_irq(&mdev->req_lock);
+		req = _ack_id_to_req(mdev, p->block_id, sector);
+
+		if (unlikely(!req)) {
+			spin_unlock_irq(&mdev->req_lock);
+			dev_err(DEV, "Got a corrupt block_id/sector pair(2).\n");
+			return FALSE;
+		}
+
+		_req_mod(req, neg_acked, 0);
+		spin_unlock_irq(&mdev->req_lock);
+	}
+
+	return TRUE;
+}
+
+STATIC int got_NegDReply(struct drbd_conf *mdev, struct p_header *h)
+{
+	struct drbd_request *req;
+	struct p_block_ack *p = (struct p_block_ack *)h;
+	sector_t sector = be64_to_cpu(p->sector);
+
+	spin_lock_irq(&mdev->req_lock);
+	req = _ar_id_to_req(mdev, p->block_id, sector);
+	if (unlikely(!req)) {
+		spin_unlock_irq(&mdev->req_lock);
+		dev_err(DEV, "Got a corrupt block_id/sector pair(3).\n");
+		return FALSE;
+	}
+
+	_req_mod(req, neg_acked, 0);
+	spin_unlock_irq(&mdev->req_lock);
+
+	update_peer_seq(mdev, be32_to_cpu(p->seq_num));
+
+	dev_err(DEV, "Got NegDReply; Sector %llus, len %u; Fail original request.\n",
+	    (unsigned long long)sector, be32_to_cpu(p->blksize));
+
+	return TRUE;
+}
+
+STATIC int got_NegRSDReply(struct drbd_conf *mdev, struct p_header *h)
+{
+	sector_t sector;
+	int size;
+	struct p_block_ack *p = (struct p_block_ack *)h;
+
+	sector = be64_to_cpu(p->sector);
+	size = be32_to_cpu(p->blksize);
+	D_ASSERT(p->block_id == ID_SYNCER);
+
+	update_peer_seq(mdev, be32_to_cpu(p->seq_num));
+
+	dec_rs_pending(mdev);
+
+	if (inc_local_if_state(mdev, D_FAILED)) {
+		drbd_rs_complete_io(mdev, sector);
+		drbd_rs_failed_io(mdev, sector, size);
+		dec_local(mdev);
+	}
+
+	return TRUE;
+}
+
+STATIC int got_BarrierAck(struct drbd_conf *mdev, struct p_header *h)
+{
+	struct p_barrier_ack *p = (struct p_barrier_ack *)h;
+
+	tl_release(mdev, p->barrier, be32_to_cpu(p->set_size));
+
+	return TRUE;
+}
+
+STATIC int got_OVResult(struct drbd_conf *mdev, struct p_header *h)
+{
+	struct p_block_ack *p = (struct p_block_ack *)h;
+	struct drbd_work *w;
+	sector_t sector;
+	int size;
+
+	sector = be64_to_cpu(p->sector);
+	size = be32_to_cpu(p->blksize);
+
+	update_peer_seq(mdev, be32_to_cpu(p->seq_num));
+
+	if (be64_to_cpu(p->block_id) == ID_OUT_OF_SYNC)
+		drbd_ov_oos_found(mdev, sector, size);
+	else
+		ov_oos_print(mdev);
+
+	drbd_rs_complete_io(mdev, sector);
+	dec_rs_pending(mdev);
+
+	if (--mdev->ov_left == 0) {
+		w = kmalloc(sizeof(*w), GFP_KERNEL);
+		if (w) {
+			w->cb = w_ov_finished;
+			drbd_queue_work_front(&mdev->data.work, w);
+		} else {
+			dev_err(DEV, "kmalloc(w) failed.");
+			drbd_resync_finished(mdev);
+		}
+	}
+	return TRUE;
+}
+
+struct asender_cmd {
+	size_t pkt_size;
+	int (*process)(struct drbd_conf *mdev, struct p_header *h);
+};
+
+static struct asender_cmd *get_asender_cmd(int cmd)
+{
+	static struct asender_cmd asender_tbl[] = {
+		/* anything missing from this table is in
+		 * the drbd_cmd_handler (drbd_default_handler) table,
+		 * see the beginning of drbdd() */
+	[P_PING]	    = { sizeof(struct p_header), got_Ping },
+	[P_PING_ACK]	    = { sizeof(struct p_header), got_PingAck },
+	[P_RECV_ACK]	    = { sizeof(struct p_block_ack), got_BlockAck },
+	[P_WRITE_ACK]	    = { sizeof(struct p_block_ack), got_BlockAck },
+	[P_RS_WRITE_ACK]    = { sizeof(struct p_block_ack), got_BlockAck },
+	[P_DISCARD_ACK]	    = { sizeof(struct p_block_ack), got_BlockAck },
+	[P_NEG_ACK]	    = { sizeof(struct p_block_ack), got_NegAck },
+	[P_NEG_DREPLY]	    = { sizeof(struct p_block_ack), got_NegDReply },
+	[P_NEG_RS_DREPLY]   = { sizeof(struct p_block_ack), got_NegRSDReply},
+	[P_BARRIER_ACK]	    = { sizeof(struct p_barrier_ack), got_BarrierAck },
+	[P_STATE_CHG_REPLY] = { sizeof(struct p_req_state_reply), got_RqSReply },
+	[P_RS_IS_IN_SYNC]   = { sizeof(struct p_block_ack), got_IsInSync },
+	};
+	if (cmd > P_MAX_CMD)
+		return NULL;
+	return &asender_tbl[cmd];
+}
+
+STATIC int drbd_asender(struct drbd_thread *thi)
+{
+	struct drbd_conf *mdev = thi->mdev;
+	struct p_header *h = &mdev->meta.rbuf.header;
+	struct asender_cmd *cmd = NULL;
+
+	int rv, len;
+	void *buf    = h;
+	int received = 0;
+	int expect   = sizeof(struct p_header);
+	int empty;
+
+	sprintf(current->comm, "drbd%d_asender", mdev_to_minor(mdev));
+
+	current->policy = SCHED_RR;  /* Make this a realtime task! */
+	current->rt_priority = 2;    /* more important than all other tasks */
+
+	while (get_t_state(thi) == Running) {
+		drbd_thread_current_set_cpu(mdev);
+		if (test_and_clear_bit(SEND_PING, &mdev->flags)) {
+			ERR_IF(!drbd_send_ping(mdev)) goto reconnect;
+			mdev->meta.socket->sk->sk_rcvtimeo =
+				mdev->net_conf->ping_timeo*HZ/10;
+		}
+
+		/* conditionally cork;
+		 * it may hurt latency if we cork without much to send */
+		if (!mdev->net_conf->no_cork &&
+			3 < atomic_read(&mdev->unacked_cnt))
+			drbd_tcp_cork(mdev->meta.socket);
+		while (1) {
+			clear_bit(SIGNAL_ASENDER, &mdev->flags);
+			flush_signals(current);
+			if (!drbd_process_done_ee(mdev)) {
+				dev_err(DEV, "process_done_ee() = NOT_OK\n");
+				goto reconnect;
+			}
+			/* to avoid race with newly queued ACKs */
+			set_bit(SIGNAL_ASENDER, &mdev->flags);
+			spin_lock_irq(&mdev->req_lock);
+			empty = list_empty(&mdev->done_ee);
+			spin_unlock_irq(&mdev->req_lock);
+			/* new ack may have been queued right here,
+			 * but then there is also a signal pending,
+			 * and we start over... */
+			if (empty)
+				break;
+		}
+		/* but unconditionally uncork unless disabled */
+		if (!mdev->net_conf->no_cork)
+			drbd_tcp_uncork(mdev->meta.socket);
+
+		/* short circuit, recv_msg would return EINTR anyways. */
+		if (signal_pending(current))
+			continue;
+
+		rv = drbd_recv_short(mdev, mdev->meta.socket,
+				     buf, expect-received, 0);
+		clear_bit(SIGNAL_ASENDER, &mdev->flags);
+
+		flush_signals(current);
+
+		/* Note:
+		 * -EINTR	 (on meta) we got a signal
+		 * -EAGAIN	 (on meta) rcvtimeo expired
+		 * -ECONNRESET	 other side closed the connection
+		 * -ERESTARTSYS  (on data) we got a signal
+		 * rv <  0	 other than above: unexpected error!
+		 * rv == expected: full header or command
+		 * rv <  expected: "woken" by signal during receive
+		 * rv == 0	 : "connection shut down by peer"
+		 */
+		if (likely(rv > 0)) {
+			received += rv;
+			buf	 += rv;
+		} else if (rv == 0) {
+			dev_err(DEV, "meta connection shut down by peer.\n");
+			goto reconnect;
+		} else if (rv == -EAGAIN) {
+			if (mdev->meta.socket->sk->sk_rcvtimeo ==
+			    mdev->net_conf->ping_timeo*HZ/10) {
+				dev_err(DEV, "PingAck did not arrive in time.\n");
+				goto reconnect;
+			}
+			set_bit(SEND_PING, &mdev->flags);
+			continue;
+		} else if (rv == -EINTR) {
+			continue;
+		} else {
+			dev_err(DEV, "sock_recvmsg returned %d\n", rv);
+			goto reconnect;
+		}
+
+		if (received == expect && cmd == NULL) {
+			if (unlikely(h->magic != BE_DRBD_MAGIC)) {
+				dev_err(DEV, "magic?? on meta m: 0x%lx c: %d l: %d\n",
+				    (long)be32_to_cpu(h->magic),
+				    h->command, h->length);
+				goto reconnect;
+			}
+			cmd = get_asender_cmd(be16_to_cpu(h->command));
+			len = be16_to_cpu(h->length);
+			if (unlikely(cmd == NULL)) {
+				dev_err(DEV, "unknown command?? on meta m: 0x%lx c: %d l: %d\n",
+				    (long)be32_to_cpu(h->magic),
+				    h->command, h->length);
+				goto disconnect;
+			}
+			expect = cmd->pkt_size;
+			ERR_IF(len != expect-sizeof(struct p_header)) {
+				trace_drbd_packet(mdev, mdev->meta.socket, 1, (void *)h, __FILE__, __LINE__);
+				DUMPI(expect);
+				goto reconnect;
+			}
+		}
+		if (received == expect) {
+			D_ASSERT(cmd != NULL);
+			trace_drbd_packet(mdev, mdev->meta.socket, 1, (void *)h, __FILE__, __LINE__);
+			if (!cmd->process(mdev, h))
+				goto reconnect;
+
+			buf	 = h;
+			received = 0;
+			expect	 = sizeof(struct p_header);
+			cmd	 = NULL;
+		}
+	}
+
+	if (0) {
+reconnect:
+		drbd_force_state(mdev, NS(conn, C_NETWORK_FAILURE));
+	}
+	if (0) {
+disconnect:
+		drbd_force_state(mdev, NS(conn, C_DISCONNECTING));
+	}
+	clear_bit(SIGNAL_ASENDER, &mdev->flags);
+
+	D_ASSERT(mdev->state.conn < C_CONNECTED);
+	dev_info(DEV, "asender terminated\n");
+
+	return 0;
+}

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 10/16] DRBD: proc
  2009-04-30 11:26                 ` [PATCH 09/16] DRBD: receiver Philipp Reisner
@ 2009-04-30 11:26                   ` Philipp Reisner
  2009-04-30 11:26                     ` [PATCH 11/16] DRBD: worker Philipp Reisner
  2009-05-02 15:44                     ` [PATCH 10/16] DRBD: proc James Bottomley
  0 siblings, 2 replies; 90+ messages in thread
From: Philipp Reisner @ 2009-04-30 11:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg, Philipp Reisner

The /proc/drbd interface.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

---
diff --git a/drivers/block/drbd/drbd_proc.c b/drivers/block/drbd/drbd_proc.c
new file mode 100644
index 0000000..7de68d9
--- /dev/null
+++ b/drivers/block/drbd/drbd_proc.c
@@ -0,0 +1,268 @@
+/*
+   drbd_proc.c
+
+   This file is part of DRBD by Philipp Reisner and Lars Ellenberg.
+
+   Copyright (C) 2001-2008, LINBIT Information Technologies GmbH.
+   Copyright (C) 1999-2008, Philipp Reisner <philipp.reisner@linbit.com>.
+   Copyright (C) 2002-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.
+
+   drbd is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 2, or (at your option)
+   any later version.
+
+   drbd is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with drbd; see the file COPYING.  If not, write to
+   the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+
+ */
+
+#include <linux/autoconf.h>
+#include <linux/module.h>
+
+#include <asm/uaccess.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/drbd_config.h>
+#include <linux/drbd.h>
+#include "drbd_int.h"
+#include "lru_cache.h" /* for lc_sprintf_stats */
+
+STATIC int drbd_proc_open(struct inode *inode, struct file *file);
+
+
+struct proc_dir_entry *drbd_proc;
+struct file_operations drbd_proc_fops = {
+	.owner		= THIS_MODULE,
+	.open		= drbd_proc_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+
+/*lge
+ * progress bars shamelessly adapted from driver/md/md.c
+ * output looks like
+ *	[=====>..............] 33.5% (23456/123456)
+ *	finish: 2:20:20 speed: 6,345 (6,456) K/sec
+ */
+STATIC void drbd_syncer_progress(struct drbd_conf *mdev, struct seq_file *seq)
+{
+	unsigned long db, dt, dbdt, rt, rs_left;
+	unsigned int res;
+	int i, x, y;
+
+	drbd_get_syncer_progress(mdev, &rs_left, &res);
+
+	x = res/50;
+	y = 20-x;
+	seq_printf(seq, "\t[");
+	for (i = 1; i < x; i++)
+		seq_printf(seq, "=");
+	seq_printf(seq, ">");
+	for (i = 0; i < y; i++)
+		seq_printf(seq, ".");
+	seq_printf(seq, "] ");
+
+	seq_printf(seq, "sync'ed:%3u.%u%% ", res / 10, res % 10);
+	/* if more than 1 GB display in MB */
+	if (mdev->rs_total > 0x100000L)
+		seq_printf(seq, "(%lu/%lu)M\n\t",
+			    (unsigned long) Bit2KB(rs_left >> 10),
+			    (unsigned long) Bit2KB(mdev->rs_total >> 10));
+	else
+		seq_printf(seq, "(%lu/%lu)K\n\t",
+			    (unsigned long) Bit2KB(rs_left),
+			    (unsigned long) Bit2KB(mdev->rs_total));
+
+	/* see drivers/md/md.c
+	 * We do not want to overflow, so the order of operands and
+	 * the * 100 / 100 trick are important. We do a +1 to be
+	 * safe against division by zero. We only estimate anyway.
+	 *
+	 * dt: time from mark until now
+	 * db: blocks written from mark until now
+	 * rt: remaining time
+	 */
+	dt = (jiffies - mdev->rs_mark_time) / HZ;
+
+	if (dt > 20) {
+		/* if we made no update to rs_mark_time for too long,
+		 * we are stalled. show that. */
+		seq_printf(seq, "stalled\n");
+		return;
+	}
+
+	if (!dt)
+		dt++;
+	db = mdev->rs_mark_left - rs_left;
+	rt = (dt * (rs_left / (db/100+1)))/100; /* seconds */
+
+	seq_printf(seq, "finish: %lu:%02lu:%02lu",
+		rt / 3600, (rt % 3600) / 60, rt % 60);
+
+	/* current speed average over (SYNC_MARKS * SYNC_MARK_STEP) jiffies */
+	dbdt = Bit2KB(db/dt);
+	if (dbdt > 1000)
+		seq_printf(seq, " speed: %ld,%03ld",
+			dbdt/1000, dbdt % 1000);
+	else
+		seq_printf(seq, " speed: %ld", dbdt);
+
+	/* mean speed since syncer started
+	 * we do account for PausedSync periods */
+	dt = (jiffies - mdev->rs_start - mdev->rs_paused) / HZ;
+	if (dt <= 0)
+		dt = 1;
+	db = mdev->rs_total - rs_left;
+	dbdt = Bit2KB(db/dt);
+	if (dbdt > 1000)
+		seq_printf(seq, " (%ld,%03ld)",
+			dbdt/1000, dbdt % 1000);
+	else
+		seq_printf(seq, " (%ld)", dbdt);
+
+	seq_printf(seq, " K/sec\n");
+}
+
+STATIC void resync_dump_detail(struct seq_file *seq, struct lc_element *e)
+{
+	struct bm_extent *bme = (struct bm_extent *)e;
+
+	seq_printf(seq, "%5d %s %s\n", bme->rs_left,
+		   bme->flags & BME_NO_WRITES ? "NO_WRITES" : "---------",
+		   bme->flags & BME_LOCKED ? "LOCKED" : "------"
+		   );
+}
+
+STATIC int drbd_seq_show(struct seq_file *seq, void *v)
+{
+	int i, hole = 0;
+	const char *sn;
+	struct drbd_conf *mdev;
+
+	static char write_ordering_chars[] = {
+		[WO_none] = 'n',
+		[WO_drain_io] = 'd',
+		[WO_bdev_flush] = 'f',
+		[WO_bio_barrier] = 'b',
+	};
+
+	seq_printf(seq, "version: " REL_VERSION " (api:%d/proto:%d-%d)\n%s\n",
+		   API_VERSION, PRO_VERSION_MIN, PRO_VERSION_MAX, drbd_buildtag());
+
+	/*
+	  cs .. connection state
+	  ro .. node role (local/remote)
+	  ds .. disk state (local/remote)
+	     protocol
+	     various flags
+	  ns .. network send
+	  nr .. network receive
+	  dw .. disk write
+	  dr .. disk read
+	  al .. activity log write count
+	  bm .. bitmap update write count
+	  pe .. pending (waiting for ack or data reply)
+	  ua .. unack'd (still need to send ack or data reply)
+	  ap .. application requests accepted, but not yet completed
+	  ep .. number of epochs currently "on the fly", P_BARRIER_ACK pending
+	  wo .. write ordering mode currently in use
+	 oos .. known out-of-sync kB
+	*/
+
+	for (i = 0; i < minor_count; i++) {
+		mdev = minor_to_mdev(i);
+		if (!mdev) {
+			hole = 1;
+			continue;
+		}
+		if (hole) {
+			hole = 0;
+			seq_printf(seq, "\n");
+		}
+
+		sn = conns_to_name(mdev->state.conn);
+
+		if (mdev->state.conn == C_STANDALONE &&
+		    mdev->state.disk == D_DISKLESS &&
+		    mdev->state.role == R_SECONDARY) {
+			seq_printf(seq, "%2d: cs:Unconfigured\n", i);
+		} else {
+			seq_printf(seq,
+			   "%2d: cs:%s ro:%s/%s ds:%s/%s %c %c%c%c%c%c\n"
+			   "    ns:%u nr:%u dw:%u dr:%u al:%u bm:%u "
+			   "lo:%d pe:%d ua:%d ap:%d ep:%d wo:%c",
+			   i, sn,
+			   roles_to_name(mdev->state.role),
+			   roles_to_name(mdev->state.peer),
+			   disks_to_name(mdev->state.disk),
+			   disks_to_name(mdev->state.pdsk),
+			   (mdev->net_conf == NULL ? ' ' :
+			    (mdev->net_conf->wire_protocol - DRBD_PROT_A+'A')),
+			   mdev->state.susp ? 's' : 'r',
+			   mdev->state.aftr_isp ? 'a' : '-',
+			   mdev->state.peer_isp ? 'p' : '-',
+			   mdev->state.user_isp ? 'u' : '-',
+			   mdev->congestion_reason ?: '-',
+			   mdev->send_cnt/2,
+			   mdev->recv_cnt/2,
+			   mdev->writ_cnt/2,
+			   mdev->read_cnt/2,
+			   mdev->al_writ_cnt,
+			   mdev->bm_writ_cnt,
+			   atomic_read(&mdev->local_cnt),
+			   atomic_read(&mdev->ap_pending_cnt) +
+			   atomic_read(&mdev->rs_pending_cnt),
+			   atomic_read(&mdev->unacked_cnt),
+			   atomic_read(&mdev->ap_bio_cnt),
+			   mdev->epochs,
+			   write_ordering_chars[mdev->write_ordering]
+			);
+			seq_printf(seq, " oos:%lu\n",
+				   Bit2KB(drbd_bm_total_weight(mdev)));
+		}
+		if (mdev->state.conn == C_SYNC_SOURCE ||
+		    mdev->state.conn == C_SYNC_TARGET)
+			drbd_syncer_progress(mdev, seq);
+
+		if (mdev->state.conn == C_VERIFY_S || mdev->state.conn == C_VERIFY_T)
+			seq_printf(seq, "\t%3d%%      %lu/%lu\n",
+				   (int)((mdev->rs_total-mdev->ov_left) /
+					 (mdev->rs_total/100+1)),
+				   mdev->rs_total - mdev->ov_left,
+				   mdev->rs_total);
+
+		if (proc_details >= 1 && inc_local_if_state(mdev, D_FAILED)) {
+			lc_printf_stats(seq, mdev->resync);
+			lc_printf_stats(seq, mdev->act_log);
+			dec_local(mdev);
+		}
+
+		if (proc_details >= 2) {
+			if (mdev->resync) {
+				lc_dump(mdev->resync, seq, "rs_left",
+					resync_dump_detail);
+			}
+		}
+	}
+
+	return 0;
+}
+
+STATIC int drbd_proc_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, drbd_seq_show, PDE(inode)->data);
+}
+
+/* PROC FS stuff end */

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 11/16] DRBD: worker
  2009-04-30 11:26                   ` [PATCH 10/16] DRBD: proc Philipp Reisner
@ 2009-04-30 11:26                     ` Philipp Reisner
  2009-04-30 11:26                       ` [PATCH 12/16] DRBD: variable_length_integer_encoding Philipp Reisner
  2009-05-02 15:44                     ` [PATCH 10/16] DRBD: proc James Bottomley
  1 sibling, 1 reply; 90+ messages in thread
From: Philipp Reisner @ 2009-04-30 11:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg, Philipp Reisner

Our generic worker thread. Does the actual sending of data via the network
link. Does all the after-state-change activities, that have to be done
without holding the req_lock spinlock. And some other stuff.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

---
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
new file mode 100644
index 0000000..81f3a4e
--- /dev/null
+++ b/drivers/block/drbd/drbd_worker.c
@@ -0,0 +1,1463 @@
+/*
+   drbd_worker.c
+
+   This file is part of DRBD by Philipp Reisner and Lars Ellenberg.
+
+   Copyright (C) 2001-2008, LINBIT Information Technologies GmbH.
+   Copyright (C) 1999-2008, Philipp Reisner <philipp.reisner@linbit.com>.
+   Copyright (C) 2002-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.
+
+   drbd is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 2, or (at your option)
+   any later version.
+
+   drbd is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with drbd; see the file COPYING.  If not, write to
+   the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+
+ */
+
+#include <linux/autoconf.h>
+#include <linux/module.h>
+#include <linux/version.h>
+
+#include <linux/sched.h>
+#include <linux/smp_lock.h>
+#include <linux/wait.h>
+#include <linux/mm.h>
+#include <linux/drbd_config.h>
+#include <linux/memcontrol.h>
+#include <linux/mm_inline.h>
+#include <linux/slab.h>
+#include <linux/random.h>
+#include <linux/mm.h>
+#include <linux/string.h>
+#include <linux/scatterlist.h>
+
+#include <linux/drbd.h>
+#include "drbd_int.h"
+#include "drbd_req.h"
+#include "drbd_tracing.h"
+
+#define SLEEP_TIME (HZ/10)
+
+STATIC int w_make_ov_request(struct drbd_conf *mdev, struct drbd_work *w, int cancel);
+
+
+
+/* defined here:
+   drbd_md_io_complete
+   drbd_endio_write_sec
+   drbd_endio_read_sec
+   drbd_endio_pri
+
+ * more endio handlers:
+   atodb_endio in drbd_actlog.c
+   drbd_bm_async_io_complete in drbd_bitmap.c
+
+ * For all these callbacks, note the follwing:
+ * The callbacks will be called in irq context by the IDE drivers,
+ * and in Softirqs/Tasklets/BH context by the SCSI drivers.
+ * Try to get the locking right :)
+ *
+ */
+
+
+/* About the global_state_lock
+   Each state transition on an device holds a read lock. In case we have
+   to evaluate the sync after dependencies, we grab a write lock, because
+   we need stable states on all devices for that.  */
+rwlock_t global_state_lock;
+
+/* used for synchronous meta data and bitmap IO
+ * submitted by drbd_md_sync_page_io()
+ */
+void drbd_md_io_complete(struct bio *bio, int error)
+{
+	struct drbd_md_io *md_io;
+
+	/* error parameter ignored:
+	 * drbd_md_sync_page_io explicitly tests bio_uptodate(bio); */
+
+	md_io = (struct drbd_md_io *)bio->bi_private;
+
+	md_io->error = error;
+
+	trace_drbd_bio(md_io->mdev, "Md", bio, 1, NULL);
+
+	complete(&md_io->event);
+}
+
+/* reads on behalf of the partner,
+ * "submitted" by the receiver
+ */
+void drbd_endio_read_sec(struct bio *bio, int error) __releases(local)
+{
+	unsigned long flags = 0;
+	struct drbd_epoch_entry *e = NULL;
+	struct drbd_conf *mdev;
+	int uptodate = bio_flagged(bio, BIO_UPTODATE);
+
+	e = bio->bi_private;
+	mdev = e->mdev;
+
+	if (!error && !uptodate) {
+		/* strange behaviour of some lower level drivers...
+		 * fail the request by clearing the uptodate flag,
+		 * but do not return any error?!
+		 * do we want to dev_warn(DEV, ) on this? */
+		error = -EIO;
+	}
+
+	D_ASSERT(e->block_id != ID_VACANT);
+
+	trace_drbd_bio(mdev, "Sec", bio, 1, NULL);
+
+	spin_lock_irqsave(&mdev->req_lock, flags);
+	mdev->read_cnt += e->size >> 9;
+	list_del(&e->w.list);
+	if (list_empty(&mdev->read_ee))
+		wake_up(&mdev->ee_wait);
+	spin_unlock_irqrestore(&mdev->req_lock, flags);
+
+	drbd_chk_io_error(mdev, error, FALSE);
+	drbd_queue_work(&mdev->data.work, &e->w);
+	dec_local(mdev);
+
+	trace_drbd_ee(mdev, e, "read completed");
+}
+
+/* writes on behalf of the partner, or resync writes,
+ * "submitted" by the receiver.
+ */
+void drbd_endio_write_sec(struct bio *bio, int error) __releases(local)
+{
+	unsigned long flags = 0;
+	struct drbd_epoch_entry *e = NULL;
+	struct drbd_conf *mdev;
+	sector_t e_sector;
+	int do_wake;
+	int is_syncer_req;
+	int do_al_complete_io;
+	int uptodate = bio_flagged(bio, BIO_UPTODATE);
+
+	e = bio->bi_private;
+	mdev = e->mdev;
+
+	if (!error && !uptodate) {
+		/* strange behaviour of some lower level drivers...
+		 * fail the request by clearing the uptodate flag,
+		 * but do not return any error?!
+		 * do we want to dev_warn(DEV, ) on this? */
+		error = -EIO;
+	}
+
+	/* error == -ENOTSUPP would be a better test,
+	 * alas it is not reliable */
+	if (error && e->flags & EE_IS_BARRIER) {
+		drbd_bump_write_ordering(mdev, WO_bdev_flush);
+		spin_lock_irqsave(&mdev->req_lock, flags);
+		list_del(&e->w.list);
+		e->w.cb = w_e_reissue;
+		__release(local); /* Actually happens in w_e_reissue. */
+		spin_unlock_irqrestore(&mdev->req_lock, flags);
+		drbd_queue_work(&mdev->data.work, &e->w);
+		return;
+	}
+
+	D_ASSERT(e->block_id != ID_VACANT);
+
+	trace_drbd_bio(mdev, "Sec", bio, 1, NULL);
+
+	spin_lock_irqsave(&mdev->req_lock, flags);
+	mdev->writ_cnt += e->size >> 9;
+	is_syncer_req = is_syncer_block_id(e->block_id);
+
+	/* after we moved e to done_ee,
+	 * we may no longer access it,
+	 * it may be freed/reused already!
+	 * (as soon as we release the req_lock) */
+	e_sector = e->sector;
+	do_al_complete_io = e->flags & EE_CALL_AL_COMPLETE_IO;
+
+	list_del(&e->w.list); /* has been on active_ee or sync_ee */
+	list_add_tail(&e->w.list, &mdev->done_ee);
+
+	trace_drbd_ee(mdev, e, "write completed");
+
+	/* No hlist_del_init(&e->colision) here, we did not send the Ack yet,
+	 * neither did we wake possibly waiting conflicting requests.
+	 * done from "drbd_process_done_ee" within the appropriate w.cb
+	 * (e_end_block/e_end_resync_block) or from _drbd_clear_done_ee */
+
+	do_wake = is_syncer_req
+		? list_empty(&mdev->sync_ee)
+		: list_empty(&mdev->active_ee);
+
+	if (error)
+		__drbd_chk_io_error(mdev, FALSE);
+	spin_unlock_irqrestore(&mdev->req_lock, flags);
+
+	if (is_syncer_req)
+		drbd_rs_complete_io(mdev, e_sector);
+
+	if (do_wake)
+		wake_up(&mdev->ee_wait);
+
+	if (do_al_complete_io)
+		drbd_al_complete_io(mdev, e_sector);
+
+	wake_asender(mdev);
+	dec_local(mdev);
+
+}
+
+/* read, readA or write requests on R_PRIMARY comming from drbd_make_request
+ */
+void drbd_endio_pri(struct bio *bio, int error)
+{
+	unsigned long flags;
+	struct drbd_request *req = bio->bi_private;
+	struct drbd_conf *mdev = req->mdev;
+	enum drbd_req_event what;
+	int uptodate = bio_flagged(bio, BIO_UPTODATE);
+
+	if (!error && !uptodate) {
+		/* strange behaviour of some lower level drivers...
+		 * fail the request by clearing the uptodate flag,
+		 * but do not return any error?!
+		 * do we want to dev_warn(DEV, ) on this? */
+		error = -EIO;
+	}
+
+	trace_drbd_bio(mdev, "Pri", bio, 1, NULL);
+
+	/* to avoid recursion in _req_mod */
+	what = error
+	       ? (bio_data_dir(bio) == WRITE)
+		 ? write_completed_with_error
+		 : read_completed_with_error
+	       : completed_ok;
+	spin_lock_irqsave(&mdev->req_lock, flags);
+	_req_mod(req, what, error);
+	spin_unlock_irqrestore(&mdev->req_lock, flags);
+}
+
+int w_io_error(struct drbd_conf *mdev, struct drbd_work *w, int cancel)
+{
+	struct drbd_request *req = (struct drbd_request *)w;
+	int ok;
+
+	/* NOTE: mdev->bc can be NULL by the time we get here! */
+	/* D_ASSERT(mdev->bc->dc.on_io_error != EP_PASS_ON); */
+
+	/* the only way this callback is scheduled is from _req_may_be_done,
+	 * when it is done and had a local write error, see comments there */
+	drbd_req_free(req);
+
+	ok = drbd_io_error(mdev, FALSE);
+	if (unlikely(!ok))
+		dev_err(DEV, "Sending in w_io_error() failed\n");
+	return ok;
+}
+
+int w_read_retry_remote(struct drbd_conf *mdev, struct drbd_work *w, int cancel)
+{
+	struct drbd_request *req = (struct drbd_request *)w;
+
+	/* We should not detach for read io-error,
+	 * but try to WRITE the P_DATA_REPLY to the failed location,
+	 * to give the disk the chance to relocate that block */
+	drbd_io_error(mdev, FALSE); /* tries to schedule a detach and notifies peer */
+
+	spin_lock_irq(&mdev->req_lock);
+	if (cancel ||
+	    mdev->state.conn < C_CONNECTED ||
+	    mdev->state.pdsk <= D_INCONSISTENT) {
+		_req_mod(req, send_canceled, 0);
+		spin_unlock_irq(&mdev->req_lock);
+		dev_alert(DEV, "WE ARE LOST. Local IO failure, no peer.\n");
+		return 1;
+	}
+	spin_unlock_irq(&mdev->req_lock);
+
+	return w_send_read_req(mdev, w, 0);
+}
+
+int w_resync_inactive(struct drbd_conf *mdev, struct drbd_work *w, int cancel)
+{
+	ERR_IF(cancel) return 1;
+	dev_err(DEV, "resync inactive, but callback triggered??\n");
+	return 1; /* Simply ignore this! */
+}
+
+STATIC void drbd_csum(struct drbd_conf *mdev, struct crypto_hash *tfm, struct bio *bio, void *digest)
+{
+	struct hash_desc desc;
+	struct scatterlist sg;
+	struct bio_vec *bvec;
+	int i;
+
+	desc.tfm = tfm;
+	desc.flags = 0;
+
+	sg_init_table(&sg, 1);
+	crypto_hash_init(&desc);
+
+	__bio_for_each_segment(bvec, bio, i, 0) {
+		sg_set_page(&sg, bvec->bv_page, bvec->bv_len, bvec->bv_offset);
+		crypto_hash_update(&desc, &sg, sg.length);
+	}
+	crypto_hash_final(&desc, digest);
+}
+
+STATIC int w_e_send_csum(struct drbd_conf *mdev, struct drbd_work *w, int cancel)
+{
+	struct drbd_epoch_entry *e = (struct drbd_epoch_entry *)w;
+	int digest_size;
+	void *digest;
+	int ok;
+
+	D_ASSERT(e->block_id == DRBD_MAGIC + 0xbeef);
+
+	if (unlikely(cancel)) {
+		drbd_free_ee(mdev, e);
+		return 1;
+	}
+
+	if (likely(drbd_bio_uptodate(e->private_bio))) {
+		digest_size = crypto_hash_digestsize(mdev->csums_tfm);
+		digest = kmalloc(digest_size, GFP_KERNEL);
+		if (digest) {
+			drbd_csum(mdev, mdev->csums_tfm, e->private_bio, digest);
+
+			inc_rs_pending(mdev);
+			ok = drbd_send_drequest_csum(mdev,
+						     e->sector,
+						     e->size,
+						     digest,
+						     digest_size,
+						     P_CSUM_RS_REQUEST);
+			kfree(digest);
+		} else {
+			dev_err(DEV, "kmalloc() of digest failed.\n");
+			ok = 0;
+		}
+	} else {
+		drbd_io_error(mdev, FALSE);
+		ok = 1;
+	}
+
+	drbd_free_ee(mdev, e);
+
+	if (unlikely(!ok))
+		dev_err(DEV, "drbd_send_drequest(..., csum) failed\n");
+	return ok;
+}
+
+#define GFP_TRY	(__GFP_HIGHMEM | __GFP_NOWARN)
+
+STATIC int read_for_csum(struct drbd_conf *mdev, sector_t sector, int size)
+{
+	struct drbd_epoch_entry *e;
+
+	if (!inc_local(mdev))
+		return 0;
+
+	if (FAULT_ACTIVE(mdev, DRBD_FAULT_AL_EE))
+		return 2;
+
+	e = drbd_alloc_ee(mdev, DRBD_MAGIC+0xbeef, sector, size, GFP_TRY);
+	if (!e) {
+		dec_local(mdev);
+		return 2;
+	}
+
+	spin_lock_irq(&mdev->req_lock);
+	list_add(&e->w.list, &mdev->read_ee);
+	spin_unlock_irq(&mdev->req_lock);
+
+	e->private_bio->bi_end_io = drbd_endio_read_sec;
+	e->private_bio->bi_rw = READ;
+	e->w.cb = w_e_send_csum;
+
+	mdev->read_cnt += size >> 9;
+	drbd_generic_make_request(mdev, DRBD_FAULT_RS_RD, e->private_bio);
+
+	return 1;
+}
+
+void resync_timer_fn(unsigned long data)
+{
+	unsigned long flags;
+	struct drbd_conf *mdev = (struct drbd_conf *) data;
+	int queue;
+
+	spin_lock_irqsave(&mdev->req_lock, flags);
+
+	if (likely(!test_and_clear_bit(STOP_SYNC_TIMER, &mdev->flags))) {
+		queue = 1;
+		if (mdev->state.conn == C_VERIFY_S)
+			mdev->resync_work.cb = w_make_ov_request;
+		else
+			mdev->resync_work.cb = w_make_resync_request;
+	} else {
+		queue = 0;
+		mdev->resync_work.cb = w_resync_inactive;
+	}
+
+	spin_unlock_irqrestore(&mdev->req_lock, flags);
+
+	/* harmless race: list_empty outside data.work.q_lock */
+	if (list_empty(&mdev->resync_work.list) && queue)
+		drbd_queue_work(&mdev->data.work, &mdev->resync_work);
+}
+
+int w_make_resync_request(struct drbd_conf *mdev,
+		struct drbd_work *w, int cancel)
+{
+	unsigned long bit;
+	sector_t sector;
+	const sector_t capacity = drbd_get_capacity(mdev->this_bdev);
+	int max_segment_size = mdev->rq_queue->max_segment_size;
+	int number, i, size;
+	int align;
+
+	if (unlikely(cancel))
+		return 1;
+
+	if (unlikely(mdev->state.conn < C_CONNECTED)) {
+		dev_err(DEV, "Confused in w_make_resync_request()! cstate < Connected");
+		return 0;
+	}
+
+	if (mdev->state.conn != C_SYNC_TARGET)
+		dev_err(DEV, "%s in w_make_resync_request\n",
+			conns_to_name(mdev->state.conn));
+
+	if (!inc_local(mdev)) {
+		/* Since we only need to access mdev->rsync a
+		   inc_local_if_state(mdev,D_FAILED) would be sufficient, but
+		   to continue resync with a broken disk makes no sense at
+		   all */
+		dev_err(DEV, "Disk broke down during resync!\n");
+		mdev->resync_work.cb = w_resync_inactive;
+		return 1;
+	}
+	/* All goto requeses have to happend after this block: inc_local() */
+
+	number = SLEEP_TIME*mdev->sync_conf.rate / ((BM_BLOCK_SIZE/1024)*HZ);
+
+	if (atomic_read(&mdev->rs_pending_cnt) > number)
+		goto requeue;
+	number -= atomic_read(&mdev->rs_pending_cnt);
+
+	for (i = 0; i < number; i++) {
+next_sector:
+		size = BM_BLOCK_SIZE;
+		bit  = drbd_bm_find_next(mdev, mdev->bm_resync_fo);
+
+		if (bit == -1UL) {
+			mdev->bm_resync_fo = drbd_bm_bits(mdev);
+			mdev->resync_work.cb = w_resync_inactive;
+			dec_local(mdev);
+			return 1;
+		}
+
+		sector = BM_BIT_TO_SECT(bit);
+
+		if (drbd_try_rs_begin_io(mdev, sector)) {
+			mdev->bm_resync_fo = bit;
+			goto requeue;
+		}
+		mdev->bm_resync_fo = bit + 1;
+
+		if (unlikely(drbd_bm_test_bit(mdev, bit) == 0)) {
+			drbd_rs_complete_io(mdev, sector);
+			goto next_sector;
+		}
+
+#if DRBD_MAX_SEGMENT_SIZE > BM_BLOCK_SIZE
+		/* try to find some adjacent bits.
+		 * we stop if we have already the maximum req size.
+		 *
+		 * Aditionally always align bigger requests, in order to
+		 * be prepared for all stripe sizes of software RAIDs.
+		 *
+		 * we _do_ care about the agreed-uppon q->max_segment_size
+		 * here, as splitting up the requests on the other side is more
+		 * difficult.  the consequence is, that on lvm and md and other
+		 * "indirect" devices, this is dead code, since
+		 * q->max_segment_size will be PAGE_SIZE.
+		 */
+		align = 1;
+		for (;;) {
+			if (size + BM_BLOCK_SIZE > max_segment_size)
+				break;
+
+			/* Be always aligned */
+			if (sector & ((1<<(align+3))-1))
+				break;
+
+			/* do not cross extent boundaries */
+			if (((bit+1) & BM_BLOCKS_PER_BM_EXT_MASK) == 0)
+				break;
+			/* now, is it actually dirty, after all?
+			 * caution, drbd_bm_test_bit is tri-state for some
+			 * obscure reason; ( b == 0 ) would get the out-of-band
+			 * only accidentally right because of the "oddly sized"
+			 * adjustment below */
+			if (drbd_bm_test_bit(mdev, bit+1) != 1)
+				break;
+			bit++;
+			size += BM_BLOCK_SIZE;
+			if ((BM_BLOCK_SIZE << align) <= size)
+				align++;
+			i++;
+		}
+		/* if we merged some,
+		 * reset the offset to start the next drbd_bm_find_next from */
+		if (size > BM_BLOCK_SIZE)
+			mdev->bm_resync_fo = bit + 1;
+#endif
+
+		/* adjust very last sectors, in case we are oddly sized */
+		if (sector + (size>>9) > capacity)
+			size = (capacity-sector)<<9;
+		if (mdev->agreed_pro_version >= 89 && mdev->csums_tfm) {
+			switch (read_for_csum(mdev, sector, size)) {
+			case 0: /* Disk failure*/
+				dec_local(mdev);
+				return 0;
+			case 2: /* Allocation failed */
+				drbd_rs_complete_io(mdev, sector);
+				mdev->bm_resync_fo = BM_SECT_TO_BIT(sector);
+				goto requeue;
+			/* case 1: everything ok */
+			}
+		} else {
+			inc_rs_pending(mdev);
+			if (!drbd_send_drequest(mdev, P_RS_DATA_REQUEST,
+					       sector, size, ID_SYNCER)) {
+				dev_err(DEV, "drbd_send_drequest() failed, aborting...\n");
+				dec_rs_pending(mdev);
+				dec_local(mdev);
+				return 0;
+			}
+		}
+	}
+
+	if (mdev->bm_resync_fo >= drbd_bm_bits(mdev)) {
+		/* last syncer _request_ was sent,
+		 * but the P_RS_DATA_REPLY not yet received.  sync will end (and
+		 * next sync group will resume), as soon as we receive the last
+		 * resync data block, and the last bit is cleared.
+		 * until then resync "work" is "inactive" ...
+		 */
+		mdev->resync_work.cb = w_resync_inactive;
+		dec_local(mdev);
+		return 1;
+	}
+
+ requeue:
+	mod_timer(&mdev->resync_timer, jiffies + SLEEP_TIME);
+	dec_local(mdev);
+	return 1;
+}
+
+int w_make_ov_request(struct drbd_conf *mdev, struct drbd_work *w, int cancel)
+{
+	int number, i, size;
+	sector_t sector;
+	const sector_t capacity = drbd_get_capacity(mdev->this_bdev);
+
+	if (unlikely(cancel))
+		return 1;
+
+	if (unlikely(mdev->state.conn < C_CONNECTED)) {
+		dev_err(DEV, "Confused in w_make_ov_request()! cstate < Connected");
+		return 0;
+	}
+
+	number = SLEEP_TIME*mdev->sync_conf.rate / ((BM_BLOCK_SIZE/1024)*HZ);
+	if (atomic_read(&mdev->rs_pending_cnt) > number)
+		goto requeue;
+
+	number -= atomic_read(&mdev->rs_pending_cnt);
+
+	sector = mdev->ov_position;
+	for (i = 0; i < number; i++) {
+		size = BM_BLOCK_SIZE;
+
+		if (drbd_try_rs_begin_io(mdev, sector)) {
+			mdev->ov_position = sector;
+			goto requeue;
+		}
+
+		if (sector + (size>>9) > capacity)
+			size = (capacity-sector)<<9;
+
+		inc_rs_pending(mdev);
+		if (!drbd_send_ov_request(mdev, sector, size)) {
+			dec_rs_pending(mdev);
+			return 0;
+		}
+		sector += BM_SECT_PER_BIT;
+		if (sector >= capacity) {
+			mdev->resync_work.cb = w_resync_inactive;
+
+			return 1;
+		}
+	}
+	mdev->ov_position = sector;
+
+ requeue:
+	mod_timer(&mdev->resync_timer, jiffies + SLEEP_TIME);
+	return 1;
+}
+
+
+int w_ov_finished(struct drbd_conf *mdev, struct drbd_work *w, int cancel)
+{
+	kfree(w);
+	ov_oos_print(mdev);
+	drbd_resync_finished(mdev);
+
+	return 1;
+}
+
+STATIC int w_resync_finished(struct drbd_conf *mdev, struct drbd_work *w, int cancel)
+{
+	kfree(w);
+
+	drbd_resync_finished(mdev);
+
+	return 1;
+}
+
+int drbd_resync_finished(struct drbd_conf *mdev)
+{
+	unsigned long db, dt, dbdt;
+	unsigned long n_oos;
+	union drbd_state os, ns;
+	struct drbd_work *w;
+	char *khelper_cmd = NULL;
+
+	/* Remove all elements from the resync LRU. Since future actions
+	 * might set bits in the (main) bitmap, then the entries in the
+	 * resync LRU would be wrong. */
+	if (drbd_rs_del_all(mdev)) {
+		/* In case this is not possible now, most probabely because
+		 * there are P_RS_DATA_REPLY Packets lingering on the worker's
+		 * queue (or even the read operations for those packets
+		 * is not finished by now).   Retry in 100ms. */
+
+		drbd_kick_lo(mdev);
+		__set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(HZ / 10);
+		w = kmalloc(sizeof(struct drbd_work), GFP_ATOMIC);
+		if (w) {
+			w->cb = w_resync_finished;
+			drbd_queue_work(&mdev->data.work, w);
+			return 1;
+		}
+		dev_err(DEV, "Warn failed to drbd_rs_del_all() and to kmalloc(w).\n");
+	}
+
+	dt = (jiffies - mdev->rs_start - mdev->rs_paused) / HZ;
+	if (dt <= 0)
+		dt = 1;
+	db = mdev->rs_total;
+	dbdt = Bit2KB(db/dt);
+	mdev->rs_paused /= HZ;
+
+	if (!inc_local(mdev))
+		goto out;
+
+	spin_lock_irq(&mdev->req_lock);
+	os = mdev->state;
+
+	/* This protects us against multiple calls (that can happen in the presence
+	   of application IO), and against connectivity loss just before we arrive here. */
+	if (os.conn <= C_CONNECTED)
+		goto out_unlock;
+
+	ns = os;
+	ns.conn = C_CONNECTED;
+
+	dev_info(DEV, "%s done (total %lu sec; paused %lu sec; %lu K/sec)\n",
+	     (os.conn == C_VERIFY_S || os.conn == C_VERIFY_T) ?
+	     "Online verify " : "Resync",
+	     dt + mdev->rs_paused, mdev->rs_paused, dbdt);
+
+	n_oos = drbd_bm_total_weight(mdev);
+
+	if (os.conn == C_VERIFY_S || os.conn == C_VERIFY_T) {
+		if (n_oos) {
+			dev_alert(DEV, "Online verify found %lu %dk block out of sync!\n",
+			      n_oos, Bit2KB(1));
+			khelper_cmd = "out-of-sync";
+		}
+	} else {
+		D_ASSERT((n_oos - mdev->rs_failed) == 0);
+
+		if (os.conn == C_SYNC_TARGET || os.conn == C_PAUSED_SYNC_T)
+			khelper_cmd = "after-resync-target";
+
+		if (mdev->csums_tfm && mdev->rs_total) {
+			const unsigned long s = mdev->rs_same_csum;
+			const unsigned long t = mdev->rs_total;
+			const int ratio =
+				(t == 0)     ? 0 :
+			(t < 100000) ? ((s*100)/t) : (s/(t/100));
+			dev_info(DEV, "%u %% had equal check sums, eliminated: %luK; "
+			     "transferred %luK total %luK\n",
+			     ratio,
+			     Bit2KB(mdev->rs_same_csum),
+			     Bit2KB(mdev->rs_total - mdev->rs_same_csum),
+			     Bit2KB(mdev->rs_total));
+		}
+	}
+
+	if (mdev->rs_failed) {
+		dev_info(DEV, "            %lu failed blocks\n", mdev->rs_failed);
+
+		if (os.conn == C_SYNC_TARGET || os.conn == C_PAUSED_SYNC_T) {
+			ns.disk = D_INCONSISTENT;
+			ns.pdsk = D_UP_TO_DATE;
+		} else {
+			ns.disk = D_UP_TO_DATE;
+			ns.pdsk = D_INCONSISTENT;
+		}
+	} else {
+		ns.disk = D_UP_TO_DATE;
+		ns.pdsk = D_UP_TO_DATE;
+
+		if (os.conn == C_SYNC_TARGET || os.conn == C_PAUSED_SYNC_T) {
+			if (mdev->p_uuid) {
+				int i;
+				for (i = UI_BITMAP ; i <= UI_HISTORY_END ; i++)
+					_drbd_uuid_set(mdev, i, mdev->p_uuid[i]);
+				drbd_uuid_set(mdev, UI_BITMAP, mdev->bc->md.uuid[UI_CURRENT]);
+				_drbd_uuid_set(mdev, UI_CURRENT, mdev->p_uuid[UI_CURRENT]);
+			} else {
+				dev_err(DEV, "mdev->p_uuid is NULL! BUG\n");
+			}
+		}
+
+		drbd_uuid_set_bm(mdev, 0UL);
+
+		if (mdev->p_uuid) {
+			/* Now the two UUID sets are equal, update what we
+			 * know of the peer. */
+			int i;
+			for (i = UI_CURRENT ; i <= UI_HISTORY_END ; i++)
+				mdev->p_uuid[i] = mdev->bc->md.uuid[i];
+		}
+	}
+
+	_drbd_set_state(mdev, ns, CS_VERBOSE, NULL);
+out_unlock:
+	spin_unlock_irq(&mdev->req_lock);
+	dec_local(mdev);
+out:
+	mdev->rs_total  = 0;
+	mdev->rs_failed = 0;
+	mdev->rs_paused = 0;
+
+	if (test_and_clear_bit(WRITE_BM_AFTER_RESYNC, &mdev->flags)) {
+		dev_warn(DEV, "Writing the whole bitmap, due to failed kmalloc\n");
+		drbd_queue_bitmap_io(mdev, &drbd_bm_write, NULL, "write from resync_finished");
+	}
+
+	drbd_bm_recount_bits(mdev);
+
+	if (khelper_cmd)
+		drbd_khelper(mdev, khelper_cmd);
+
+	return 1;
+}
+
+/**
+ * w_e_end_data_req: Send the answer (P_DATA_REPLY) in response to a DataRequest.
+ */
+int w_e_end_data_req(struct drbd_conf *mdev, struct drbd_work *w, int cancel)
+{
+	struct drbd_epoch_entry *e = (struct drbd_epoch_entry *)w;
+	int ok;
+
+	if (unlikely(cancel)) {
+		drbd_free_ee(mdev, e);
+		dec_unacked(mdev);
+		return 1;
+	}
+
+	if (likely(drbd_bio_uptodate(e->private_bio))) {
+		ok = drbd_send_block(mdev, P_DATA_REPLY, e);
+	} else {
+		if (__ratelimit(&drbd_ratelimit_state))
+			dev_err(DEV, "Sending NegDReply. sector=%llus.\n",
+			    (unsigned long long)e->sector);
+
+		ok = drbd_send_ack(mdev, P_NEG_DREPLY, e);
+
+		drbd_io_error(mdev, FALSE);
+	}
+
+	dec_unacked(mdev);
+
+	spin_lock_irq(&mdev->req_lock);
+	if (drbd_bio_has_active_page(e->private_bio)) {
+		/* This might happen if sendpage() has not finished */
+		list_add_tail(&e->w.list, &mdev->net_ee);
+	} else {
+		drbd_free_ee(mdev, e);
+	}
+	spin_unlock_irq(&mdev->req_lock);
+
+	if (unlikely(!ok))
+		dev_err(DEV, "drbd_send_block() failed\n");
+	return ok;
+}
+
+/**
+ * w_e_end_rsdata_req: Send the answer (P_RS_DATA_REPLY) to a RSDataRequest.
+ */
+int w_e_end_rsdata_req(struct drbd_conf *mdev, struct drbd_work *w, int cancel)
+{
+	struct drbd_epoch_entry *e = (struct drbd_epoch_entry *)w;
+	int ok;
+
+	if (unlikely(cancel)) {
+		drbd_free_ee(mdev, e);
+		dec_unacked(mdev);
+		return 1;
+	}
+
+	if (inc_local_if_state(mdev, D_FAILED)) {
+		drbd_rs_complete_io(mdev, e->sector);
+		dec_local(mdev);
+	}
+
+	if (likely(drbd_bio_uptodate(e->private_bio))) {
+		if (likely(mdev->state.pdsk >= D_INCONSISTENT)) {
+			inc_rs_pending(mdev);
+			ok = drbd_send_block(mdev, P_RS_DATA_REPLY, e);
+		} else {
+			if (__ratelimit(&drbd_ratelimit_state))
+				dev_err(DEV, "Not sending RSDataReply, "
+				    "partner DISKLESS!\n");
+			ok = 1;
+		}
+	} else {
+		if (__ratelimit(&drbd_ratelimit_state))
+			dev_err(DEV, "Sending NegRSDReply. sector %llus.\n",
+			    (unsigned long long)e->sector);
+
+		ok = drbd_send_ack(mdev, P_NEG_RS_DREPLY, e);
+
+		drbd_io_error(mdev, FALSE);
+
+		/* update resync data with failure */
+		drbd_rs_failed_io(mdev, e->sector, e->size);
+	}
+
+	dec_unacked(mdev);
+
+	spin_lock_irq(&mdev->req_lock);
+	if (drbd_bio_has_active_page(e->private_bio)) {
+		/* This might happen if sendpage() has not finished */
+		list_add_tail(&e->w.list, &mdev->net_ee);
+	} else {
+		drbd_free_ee(mdev, e);
+	}
+	spin_unlock_irq(&mdev->req_lock);
+
+	if (unlikely(!ok))
+		dev_err(DEV, "drbd_send_block() failed\n");
+	return ok;
+}
+
+int w_e_end_csum_rs_req(struct drbd_conf *mdev, struct drbd_work *w, int cancel)
+{
+	struct drbd_epoch_entry *e = (struct drbd_epoch_entry *)w;
+	struct digest_info *di;
+	int digest_size;
+	void *digest = NULL;
+	int ok, eq = 0;
+
+	if (unlikely(cancel)) {
+		drbd_free_ee(mdev, e);
+		dec_unacked(mdev);
+		return 1;
+	}
+
+	drbd_rs_complete_io(mdev, e->sector);
+
+	di = (struct digest_info *)(unsigned long)e->block_id;
+
+	if (likely(drbd_bio_uptodate(e->private_bio))) {
+		/* quick hack to try to avoid a race against reconfiguration.
+		 * a real fix would be much more involved,
+		 * introducing more locking mechanisms */
+		if (mdev->csums_tfm) {
+			digest_size = crypto_hash_digestsize(mdev->csums_tfm);
+			D_ASSERT(digest_size == di->digest_size);
+			digest = kmalloc(digest_size, GFP_KERNEL);
+		}
+		if (digest) {
+			drbd_csum(mdev, mdev->csums_tfm, e->private_bio, digest);
+			eq = !memcmp(digest, di->digest, digest_size);
+			kfree(digest);
+		}
+
+		if (eq) {
+			drbd_set_in_sync(mdev, e->sector, e->size);
+			mdev->rs_same_csum++;
+			ok = drbd_send_ack(mdev, P_RS_IS_IN_SYNC, e);
+		} else {
+			inc_rs_pending(mdev);
+			e->block_id = ID_SYNCER;
+			ok = drbd_send_block(mdev, P_RS_DATA_REPLY, e);
+		}
+	} else {
+		ok = drbd_send_ack(mdev, P_NEG_RS_DREPLY, e);
+		if (__ratelimit(&drbd_ratelimit_state))
+			dev_err(DEV, "Sending NegDReply. I guess it gets messy.\n");
+		drbd_io_error(mdev, FALSE);
+	}
+
+	dec_unacked(mdev);
+
+	kfree(di);
+
+	spin_lock_irq(&mdev->req_lock);
+	if (drbd_bio_has_active_page(e->private_bio)) {
+		/* This might happen if sendpage() has not finished */
+		list_add_tail(&e->w.list, &mdev->net_ee);
+	} else {
+		drbd_free_ee(mdev, e);
+	}
+	spin_unlock_irq(&mdev->req_lock);
+
+	if (unlikely(!ok))
+		dev_err(DEV, "drbd_send_block/ack() failed\n");
+	return ok;
+}
+
+int w_e_end_ov_req(struct drbd_conf *mdev, struct drbd_work *w, int cancel)
+{
+	struct drbd_epoch_entry *e = (struct drbd_epoch_entry *)w;
+	int digest_size;
+	void *digest;
+	int ok = 1;
+
+	if (unlikely(cancel))
+		goto out;
+
+	if (unlikely(!drbd_bio_uptodate(e->private_bio)))
+		goto out;
+
+	digest_size = crypto_hash_digestsize(mdev->verify_tfm);
+	digest = kmalloc(digest_size, GFP_KERNEL);
+	if (digest) {
+		drbd_csum(mdev, mdev->verify_tfm, e->private_bio, digest);
+		ok = drbd_send_drequest_csum(mdev, e->sector, e->size,
+					     digest, digest_size, P_OV_REPLY);
+		if (ok)
+			inc_rs_pending(mdev);
+		kfree(digest);
+	}
+
+out:
+	spin_lock_irq(&mdev->req_lock);
+	drbd_free_ee(mdev, e);
+	spin_unlock_irq(&mdev->req_lock);
+
+	dec_unacked(mdev);
+
+	return ok;
+}
+
+void drbd_ov_oos_found(struct drbd_conf *mdev, sector_t sector, int size)
+{
+	if (mdev->ov_last_oos_start + mdev->ov_last_oos_size == sector) {
+		mdev->ov_last_oos_size += size>>9;
+	} else {
+		mdev->ov_last_oos_start = sector;
+		mdev->ov_last_oos_size = size>>9;
+	}
+	drbd_set_out_of_sync(mdev, sector, size);
+	set_bit(WRITE_BM_AFTER_RESYNC, &mdev->flags);
+}
+
+int w_e_end_ov_reply(struct drbd_conf *mdev, struct drbd_work *w, int cancel)
+{
+	struct drbd_epoch_entry *e = (struct drbd_epoch_entry *)w;
+	struct digest_info *di;
+	int digest_size;
+	void *digest;
+	int ok, eq = 0;
+
+	if (unlikely(cancel)) {
+		drbd_free_ee(mdev, e);
+		dec_unacked(mdev);
+		return 1;
+	}
+
+	/* after "cancel", because after drbd_disconnect/drbd_rs_cancel_all
+	 * the resync lru has been cleaned up already */
+	drbd_rs_complete_io(mdev, e->sector);
+
+	di = (struct digest_info *)(unsigned long)e->block_id;
+
+	if (likely(drbd_bio_uptodate(e->private_bio))) {
+		digest_size = crypto_hash_digestsize(mdev->verify_tfm);
+		digest = kmalloc(digest_size, GFP_KERNEL);
+		if (digest) {
+			drbd_csum(mdev, mdev->verify_tfm, e->private_bio, digest);
+
+			D_ASSERT(digest_size == di->digest_size);
+			eq = !memcmp(digest, di->digest, digest_size);
+			kfree(digest);
+		}
+	} else {
+		ok = drbd_send_ack(mdev, P_NEG_RS_DREPLY, e);
+		if (__ratelimit(&drbd_ratelimit_state))
+			dev_err(DEV, "Sending NegDReply. I guess it gets messy.\n");
+		drbd_io_error(mdev, FALSE);
+	}
+
+	dec_unacked(mdev);
+
+	kfree(di);
+
+	if (!eq)
+		drbd_ov_oos_found(mdev, e->sector, e->size);
+	else
+		ov_oos_print(mdev);
+
+	ok = drbd_send_ack_ex(mdev, P_OV_RESULT, e->sector, e->size,
+			      eq ? ID_IN_SYNC : ID_OUT_OF_SYNC);
+
+	spin_lock_irq(&mdev->req_lock);
+	drbd_free_ee(mdev, e);
+	spin_unlock_irq(&mdev->req_lock);
+
+	if (--mdev->ov_left == 0) {
+		ov_oos_print(mdev);
+		drbd_resync_finished(mdev);
+	}
+
+	return ok;
+}
+
+int w_prev_work_done(struct drbd_conf *mdev, struct drbd_work *w, int cancel)
+{
+	clear_bit(WORK_PENDING, &mdev->flags);
+	wake_up(&mdev->misc_wait);
+	return 1;
+}
+
+int w_send_barrier(struct drbd_conf *mdev, struct drbd_work *w, int cancel)
+{
+	struct drbd_tl_epoch *b = (struct drbd_tl_epoch *)w;
+	struct p_barrier *p = &mdev->data.sbuf.barrier;
+	int ok = 1;
+
+	/* really avoid racing with tl_clear.  w.cb may have been referenced
+	 * just before it was reassigned and requeued, so double check that.
+	 * actually, this race was harmless, since we only try to send the
+	 * barrier packet here, and otherwise do nothing with the object.
+	 * but compare with the head of w_clear_epoch */
+	spin_lock_irq(&mdev->req_lock);
+	if (w->cb != w_send_barrier || mdev->state.conn < C_CONNECTED)
+		cancel = 1;
+	spin_unlock_irq(&mdev->req_lock);
+	if (cancel)
+		return 1;
+
+	if (!drbd_get_data_sock(mdev))
+		return 0;
+	p->barrier = b->br_number;
+	/* inc_ap_pending was done where this was queued.
+	 * dec_ap_pending will be done in got_BarrierAck
+	 * or (on connection loss) in w_clear_epoch.  */
+	ok = _drbd_send_cmd(mdev, mdev->data.socket, P_BARRIER,
+				(struct p_header *)p, sizeof(*p), 0);
+	drbd_put_data_sock(mdev);
+
+	return ok;
+}
+
+int w_send_write_hint(struct drbd_conf *mdev, struct drbd_work *w, int cancel)
+{
+	if (cancel)
+		return 1;
+	return drbd_send_short_cmd(mdev, P_UNPLUG_REMOTE);
+}
+
+/**
+ * w_send_dblock: Send a mirrored write request.
+ */
+int w_send_dblock(struct drbd_conf *mdev, struct drbd_work *w, int cancel)
+{
+	struct drbd_request *req = (struct drbd_request *)w;
+	int ok;
+
+	if (unlikely(cancel)) {
+		req_mod(req, send_canceled, 0);
+		return 1;
+	}
+
+	ok = drbd_send_dblock(mdev, req);
+	req_mod(req, ok ? handed_over_to_network : send_failed, 0);
+
+	return ok;
+}
+
+/**
+ * w_send_read_req: Send a read requests.
+ */
+int w_send_read_req(struct drbd_conf *mdev, struct drbd_work *w, int cancel)
+{
+	struct drbd_request *req = (struct drbd_request *)w;
+	int ok;
+
+	if (unlikely(cancel)) {
+		req_mod(req, send_canceled, 0);
+		return 1;
+	}
+
+	ok = drbd_send_drequest(mdev, P_DATA_REQUEST, req->sector, req->size,
+				(unsigned long)req);
+
+	if (!ok) {
+		/* ?? we set C_TIMEOUT or C_BROKEN_PIPE in drbd_send();
+		 * so this is probably redundant */
+		if (mdev->state.conn >= C_CONNECTED)
+			drbd_force_state(mdev, NS(conn, C_NETWORK_FAILURE));
+	}
+	req_mod(req, ok ? handed_over_to_network : send_failed, 0);
+
+	return ok;
+}
+
+STATIC int _drbd_may_sync_now(struct drbd_conf *mdev)
+{
+	struct drbd_conf *odev = mdev;
+
+	while (1) {
+		if (odev->sync_conf.after == -1)
+			return 1;
+		odev = minor_to_mdev(odev->sync_conf.after);
+		ERR_IF(!odev) return 1;
+		if ((odev->state.conn >= C_SYNC_SOURCE &&
+		     odev->state.conn <= C_PAUSED_SYNC_T) ||
+		    odev->state.aftr_isp || odev->state.peer_isp ||
+		    odev->state.user_isp)
+			return 0;
+	}
+}
+
+/**
+ * _drbd_pause_after:
+ * Finds all devices that may not resync now, and causes them to
+ * pause their resynchronisation.
+ * Called from process context only (admin command and after_state_ch).
+ */
+STATIC int _drbd_pause_after(struct drbd_conf *mdev)
+{
+	struct drbd_conf *odev;
+	int i, rv = 0;
+
+	for (i = 0; i < minor_count; i++) {
+		odev = minor_to_mdev(i);
+		if (!odev)
+			continue;
+		if (odev->state.conn == C_STANDALONE && odev->state.disk == D_DISKLESS)
+			continue;
+		if (!_drbd_may_sync_now(odev))
+			rv |= (__drbd_set_state(_NS(odev, aftr_isp, 1), CS_HARD, NULL)
+			       != SS_NOTHING_TO_DO);
+	}
+
+	return rv;
+}
+
+/**
+ * _drbd_resume_next:
+ * Finds all devices that can resume resynchronisation
+ * process, and causes them to resume.
+ * Called from process context only (admin command and worker).
+ */
+STATIC int _drbd_resume_next(struct drbd_conf *mdev)
+{
+	struct drbd_conf *odev;
+	int i, rv = 0;
+
+	for (i = 0; i < minor_count; i++) {
+		odev = minor_to_mdev(i);
+		if (!odev)
+			continue;
+		if (odev->state.conn == C_STANDALONE && odev->state.disk == D_DISKLESS)
+			continue;
+		if (odev->state.aftr_isp) {
+			if (_drbd_may_sync_now(odev))
+				rv |= (__drbd_set_state(_NS(odev, aftr_isp, 0),
+							CS_HARD, NULL)
+				       != SS_NOTHING_TO_DO) ;
+		}
+	}
+	return rv;
+}
+
+void resume_next_sg(struct drbd_conf *mdev)
+{
+	write_lock_irq(&global_state_lock);
+	_drbd_resume_next(mdev);
+	write_unlock_irq(&global_state_lock);
+}
+
+void suspend_other_sg(struct drbd_conf *mdev)
+{
+	write_lock_irq(&global_state_lock);
+	_drbd_pause_after(mdev);
+	write_unlock_irq(&global_state_lock);
+}
+
+void drbd_alter_sa(struct drbd_conf *mdev, int na)
+{
+	int changes;
+
+	write_lock_irq(&global_state_lock);
+	mdev->sync_conf.after = na;
+
+	do {
+		changes  = _drbd_pause_after(mdev);
+		changes |= _drbd_resume_next(mdev);
+	} while (changes);
+
+	write_unlock_irq(&global_state_lock);
+}
+
+/**
+ * drbd_start_resync:
+ * @side: Either C_SYNC_SOURCE or C_SYNC_TARGET
+ * Start the resync process. Called from process context only,
+ * either admin command or drbd_receiver.
+ * Note, this function might bring you directly into one of the
+ * PausedSync* states.
+ */
+void drbd_start_resync(struct drbd_conf *mdev, enum drbd_conns side)
+{
+	union drbd_state ns;
+	int r;
+
+	trace_drbd_resync(mdev, TRACE_LVL_SUMMARY, "Resync starting: side=%s\n",
+			  side == C_SYNC_TARGET ? "SyncTarget" : "SyncSource");
+
+	drbd_bm_recount_bits(mdev);
+
+	/* In case a previous resync run was aborted by an IO error... */
+	drbd_rs_cancel_all(mdev);
+
+	if (side == C_SYNC_TARGET) {
+		/* Since application IO was locked out during C_WF_BITMAP_T and
+		   C_WF_SYNC_UUID we are still unmodified. Before going to C_SYNC_TARGET
+		   we check that we might make the data inconsistent. */
+		r = drbd_khelper(mdev, "before-resync-target");
+		r = (r >> 8) & 0xff;
+		if (r > 0) {
+			dev_info(DEV, "before-resync-target handler returned %d, "
+			     "dropping connection.\n", r);
+			drbd_force_state(mdev, NS(conn, C_DISCONNECTING));
+			return;
+		}
+	}
+
+	drbd_state_lock(mdev);
+
+	if (!inc_local_if_state(mdev, D_NEGOTIATING)) {
+		drbd_state_unlock(mdev);
+		return;
+	}
+
+	if (side == C_SYNC_TARGET) {
+		mdev->bm_resync_fo = 0;
+	} else /* side == C_SYNC_SOURCE */ {
+		u64 uuid;
+
+		get_random_bytes(&uuid, sizeof(u64));
+		drbd_uuid_set(mdev, UI_BITMAP, uuid);
+		drbd_send_sync_uuid(mdev, uuid);
+
+		D_ASSERT(mdev->state.disk == D_UP_TO_DATE);
+	}
+
+	write_lock_irq(&global_state_lock);
+	ns = mdev->state;
+
+	ns.aftr_isp = !_drbd_may_sync_now(mdev);
+
+	ns.conn = side;
+
+	if (side == C_SYNC_TARGET)
+		ns.disk = D_INCONSISTENT;
+	else /* side == C_SYNC_SOURCE */
+		ns.pdsk = D_INCONSISTENT;
+
+	r = __drbd_set_state(mdev, ns, CS_VERBOSE, NULL);
+	ns = mdev->state;
+
+	if (ns.conn < C_CONNECTED)
+		r = SS_UNKNOWN_ERROR;
+
+	if (r == SS_SUCCESS) {
+		mdev->rs_total     =
+		mdev->rs_mark_left = drbd_bm_total_weight(mdev);
+		mdev->rs_failed    = 0;
+		mdev->rs_paused    = 0;
+		mdev->rs_start     =
+		mdev->rs_mark_time = jiffies;
+		mdev->rs_same_csum = 0;
+		_drbd_pause_after(mdev);
+	}
+	write_unlock_irq(&global_state_lock);
+	drbd_state_unlock(mdev);
+	dec_local(mdev);
+
+	if (r == SS_SUCCESS) {
+		dev_info(DEV, "Began resync as %s (will sync %lu KB [%lu bits set]).\n",
+		     conns_to_name(ns.conn),
+		     (unsigned long) mdev->rs_total << (BM_BLOCK_SIZE_B-10),
+		     (unsigned long) mdev->rs_total);
+
+		if (mdev->rs_total == 0) {
+			drbd_resync_finished(mdev);
+			return;
+		}
+
+		if (ns.conn == C_SYNC_TARGET) {
+			D_ASSERT(!test_bit(STOP_SYNC_TIMER, &mdev->flags));
+			mod_timer(&mdev->resync_timer, jiffies);
+		}
+
+		drbd_md_sync(mdev);
+	}
+}
+
+int drbd_worker(struct drbd_thread *thi)
+{
+	struct drbd_conf *mdev = thi->mdev;
+	struct drbd_work *w = NULL;
+	LIST_HEAD(work_list);
+	int intr = 0, i;
+
+	sprintf(current->comm, "drbd%d_worker", mdev_to_minor(mdev));
+
+	while (get_t_state(thi) == Running) {
+		drbd_thread_current_set_cpu(mdev);
+
+		if (down_trylock(&mdev->data.work.s)) {
+			mutex_lock(&mdev->data.mutex);
+			if (mdev->data.socket && !mdev->net_conf->no_cork)
+				drbd_tcp_uncork(mdev->data.socket);
+			mutex_unlock(&mdev->data.mutex);
+
+			intr = down_interruptible(&mdev->data.work.s);
+
+			mutex_lock(&mdev->data.mutex);
+			if (mdev->data.socket  && !mdev->net_conf->no_cork)
+				drbd_tcp_cork(mdev->data.socket);
+			mutex_unlock(&mdev->data.mutex);
+		}
+
+		if (intr) {
+			D_ASSERT(intr == -EINTR);
+			flush_signals(current);
+			ERR_IF (get_t_state(thi) == Running)
+				continue;
+			break;
+		}
+
+		if (get_t_state(thi) != Running)
+			break;
+		/* With this break, we have done a down() but not consumed
+		   the entry from the list. The cleanup code takes care of
+		   this...   */
+
+		w = NULL;
+		spin_lock_irq(&mdev->data.work.q_lock);
+		ERR_IF(list_empty(&mdev->data.work.q)) {
+			/* something terribly wrong in our logic.
+			 * we were able to down() the semaphore,
+			 * but the list is empty... doh.
+			 *
+			 * what is the best thing to do now?
+			 * try again from scratch, restarting the receiver,
+			 * asender, whatnot? could break even more ugly,
+			 * e.g. when we are primary, but no good local data.
+			 *
+			 * I'll try to get away just starting over this loop.
+			 */
+			spin_unlock_irq(&mdev->data.work.q_lock);
+			continue;
+		}
+		w = list_entry(mdev->data.work.q.next, struct drbd_work, list);
+		list_del_init(&w->list);
+		spin_unlock_irq(&mdev->data.work.q_lock);
+
+		if (!w->cb(mdev, w, mdev->state.conn < C_CONNECTED)) {
+			/* dev_warn(DEV, "worker: a callback failed! \n"); */
+			if (mdev->state.conn >= C_CONNECTED)
+				drbd_force_state(mdev,
+						NS(conn, C_NETWORK_FAILURE));
+		}
+	}
+	D_ASSERT(test_bit(DEVICE_DYING, &mdev->flags));
+	D_ASSERT(test_bit(CONFIG_PENDING, &mdev->flags));
+
+	spin_lock_irq(&mdev->data.work.q_lock);
+	i = 0;
+	while (!list_empty(&mdev->data.work.q)) {
+		list_splice_init(&mdev->data.work.q, &work_list);
+		spin_unlock_irq(&mdev->data.work.q_lock);
+
+		while (!list_empty(&work_list)) {
+			w = list_entry(work_list.next, struct drbd_work, list);
+			list_del_init(&w->list);
+			w->cb(mdev, w, 1);
+			i++; /* dead debugging code */
+		}
+
+		spin_lock_irq(&mdev->data.work.q_lock);
+	}
+	sema_init(&mdev->data.work.s, 0);
+	/* DANGEROUS race: if someone did queue his work within the spinlock,
+	 * but up() ed outside the spinlock, we could get an up() on the
+	 * semaphore without corresponding list entry.
+	 * So don't do that.
+	 */
+	spin_unlock_irq(&mdev->data.work.q_lock);
+
+	D_ASSERT(mdev->state.disk == D_DISKLESS && mdev->state.conn == C_STANDALONE);
+	/* _drbd_set_state only uses stop_nowait.
+	 * wait here for the Exiting receiver. */
+	drbd_thread_stop(&mdev->receiver);
+	drbd_mdev_cleanup(mdev);
+
+	dev_info(DEV, "worker terminated\n");
+
+	clear_bit(DEVICE_DYING, &mdev->flags);
+	clear_bit(CONFIG_PENDING, &mdev->flags);
+	wake_up(&mdev->state_wait);
+
+	return 0;
+}

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 12/16] DRBD: variable_length_integer_encoding
  2009-04-30 11:26                     ` [PATCH 11/16] DRBD: worker Philipp Reisner
@ 2009-04-30 11:26                       ` Philipp Reisner
  2009-04-30 11:26                         ` [PATCH 13/16] DRBD: misc Philipp Reisner
  2009-05-02 15:45                         ` [PATCH 12/16] DRBD: variable_length_integer_encoding James Bottomley
  0 siblings, 2 replies; 90+ messages in thread
From: Philipp Reisner @ 2009-04-30 11:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg, Philipp Reisner

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 12287 bytes --]

Encoding of our simple LRE compression scheme. It is very effective since
large parts of our bitmap are sparse.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

---
diff --git a/drivers/block/drbd/drbd_vli.h b/drivers/block/drbd/drbd_vli.h
new file mode 100644
index 0000000..fc82400
--- /dev/null
+++ b/drivers/block/drbd/drbd_vli.h
@@ -0,0 +1,351 @@
+/*
+-*- linux-c -*-
+   drbd_receiver.c
+   This file is part of DRBD by Philipp Reisner and Lars Ellenberg.
+
+   Copyright (C) 2001-2008, LINBIT Information Technologies GmbH.
+   Copyright (C) 1999-2008, Philipp Reisner <philipp.reisner@linbit.com>.
+   Copyright (C) 2002-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.
+
+   drbd is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 2, or (at your option)
+   any later version.
+
+   drbd is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with drbd; see the file COPYING.  If not, write to
+   the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#ifndef _DRBD_VLI_H
+#define _DRBD_VLI_H
+
+/*
+ * At a granularity of 4KiB storage represented per bit,
+ * and stroage sizes of several TiB,
+ * and possibly small-bandwidth replication,
+ * the bitmap transfer time can take much too long,
+ * if transmitted in plain text.
+ *
+ * We try to reduce the transfered bitmap information
+ * by encoding runlengths of bit polarity.
+ *
+ * We never actually need to encode a "zero" (runlengths are positive).
+ * But then we have to store the value of the first bit.
+ * The first bit of information thus shall encode if the first runlength
+ * gives the number of set or unset bits.
+ *
+ * We assume that large areas are either completely set or unset,
+ * which gives good compression with any runlength method,
+ * even when encoding the runlength as fixed size 32bit/64bit integers.
+ *
+ * Still, there may be areas where the polarity flips every few bits,
+ * and encoding the runlength sequence of those areas with fix size
+ * integers would be much worse than plaintext.
+ *
+ * We want to encode small runlength values with minimum code length,
+ * while still being able to encode a Huge run of all zeros.
+ *
+ * Thus we need a Variable Length Integer encoding, VLI.
+ *
+ * For some cases, we produce more code bits than plaintext input.
+ * We need to send incompressible chunks as plaintext, skip over them
+ * and then see if the next chunk compresses better.
+ *
+ * We don't care too much about "excellent" compression ratio for large
+ * runlengths (all set/all clear): whether we achieve a factor of 100
+ * or 1000 is not that much of an issue.
+ * We do not want to waste too much on short runlengths in the "noisy"
+ * parts of the bitmap, though.
+ *
+ * There are endless variants of VLI, we experimented with:
+ *  * simple byte-based
+ *  * various bit based with different code word length.
+ *
+ * To avoid yet an other configuration parameter (choice of bitmap compression
+ * algorithm) which was difficult to explain and tune, we just chose the one
+ * variant that turned out best in all test cases.
+ * Based on real world usage patterns, with device sizes ranging from a few GiB
+ * to several TiB, file server/mailserver/webserver/mysql/postgress,
+ * mostly idle to really busy, the all time winner (though sometimes only
+ * marginally better) is:
+ */
+
+/*
+ * encoding is "visualised" as
+ * __little endian__ bitstream, least significant bit first (left most)
+ *
+ * this particular encoding is chosen so that the prefix code
+ * starts as unary encoding the level, then modified so that
+ * 10 levels can be described in 8bit, with minimal overhead
+ * for the smaller levels.
+ *
+ * Number of data bits follow fibonacci sequence, with the exception of the
+ * last level (+1 data bit, so it makes 64bit total).  The only worse code when
+ * encoding bit polarity runlength is 1 plain bits => 2 code bits.
+prefix    data bits                                    max val  Nº data bits
+0 x                                                         0x2            1
+10 x                                                        0x4            1
+110 xx                                                      0x8            2
+1110 xxx                                                   0x10            3
+11110 xxx xx                                               0x30            5
+111110 xx xxxxxx                                          0x130            8
+11111100  xxxxxxxx xxxxx                                 0x2130           13
+11111110  xxxxxxxx xxxxxxxx xxxxx                      0x202130           21
+11111101  xxxxxxxx xxxxxxxx xxxxxxxx  xxxxxxxx xx   0x400202130           34
+11111111  xxxxxxxx xxxxxxxx xxxxxxxx  xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx 56
+ * maximum encodable value: 0x100000400202130 == 2**56 + some */
+
+/* compression "table":
+ transmitted   x                                0.29
+ as plaintext x                                  ........................
+             x                                   ........................
+            x                                    ........................
+           x    0.59                         0.21........................
+          x      ........................................................
+         x       .. c ...................................................
+        x    0.44.. o ...................................................
+       x .......... d ...................................................
+      x  .......... e ...................................................
+     X.............   ...................................................
+    x.............. b ...................................................
+2.0x............... i ...................................................
+ #X................ t ...................................................
+ #................. s ...........................  plain bits  ..........
+-+-----------------------------------------------------------------------
+ 1             16              32                              64
+*/
+
+/* LEVEL: (total bits, prefix bits, prefix value),
+ * sorted ascending by number of total bits.
+ * The rest of the code table is calculated at compiletime from this. */
+
+/* fibonacci data 1, 1, ... */
+#define VLI_L_1_1() do { \
+	LEVEL( 2, 1, 0x00); \
+	LEVEL( 3, 2, 0x01); \
+	LEVEL( 5, 3, 0x03); \
+	LEVEL( 7, 4, 0x07); \
+	LEVEL(10, 5, 0x0f); \
+	LEVEL(14, 6, 0x1f); \
+	LEVEL(21, 8, 0x3f); \
+	LEVEL(29, 8, 0x7f); \
+	LEVEL(42, 8, 0xbf); \
+	LEVEL(64, 8, 0xff); \
+	} while (0)
+
+/* finds a suitable level to decode the least significant part of in.
+ * returns number of bits consumed.
+ *
+ * BUG() for bad input, as that would mean a buggy code table. */
+static inline int vli_decode_bits(u64 *out, const u64 in)
+{
+	u64 adj = 1;
+
+#define LEVEL(t,b,v)					\
+	do {						\
+		if ((in & ((1 << b) -1)) == v) {	\
+			*out = ((in & ((~0ULL) >> (64-t))) >> b) + adj;	\
+			return t;			\
+		}					\
+		adj += 1ULL << (t - b);			\
+	} while (0)
+
+	VLI_L_1_1();
+
+	/* NOT REACHED, if VLI_LEVELS code table is defined properly */
+	BUG();
+#undef LEVEL
+}
+
+/* return number of code bits needed,
+ * or negative error number */
+static inline int __vli_encode_bits(u64 *out, const u64 in)
+{
+	u64 max = 0;
+	u64 adj = 1;
+
+	if (in == 0)
+		return -EINVAL;
+
+#define LEVEL(t,b,v) do {		\
+		max += 1ULL << (t - b);	\
+		if (in <= max) {	\
+			if (out)	\
+				*out = ((in - adj) << b) | v;	\
+			return t;	\
+		}			\
+		adj = max + 1;		\
+	} while (0)
+
+	VLI_L_1_1();
+
+	return -EOVERFLOW;
+#undef LEVEL
+}
+
+#undef VLI_L_1_1
+
+/* code from here down is independend of actually used bit code */
+
+/*
+ * Code length is determined by some unique (e.g. unary) prefix.
+ * This encodes arbitrary bit length, not whole bytes: we have a bit-stream,
+ * not a byte stream.
+ */
+
+/* for the bitstream, we need a cursor */
+struct bitstream_cursor {
+	/* the current byte */
+	u8 *b;
+	/* the current bit within *b, nomalized: 0..7 */
+	unsigned int bit;
+};
+
+/* initialize cursor to point to first bit of stream */
+static inline void bitstream_cursor_reset(struct bitstream_cursor *cur, void *s)
+{
+	cur->b = s;
+	cur->bit = 0;
+}
+
+/* advance cursor by that many bits; maximum expected input value: 64,
+ * but depending on VLI implementation, it may be more. */
+static inline void bitstream_cursor_advance(struct bitstream_cursor *cur, unsigned int bits)
+{
+	bits += cur->bit;
+	cur->b = cur->b + (bits >> 3);
+	cur->bit = bits & 7;
+}
+
+/* the bitstream itself knows its length */
+struct bitstream {
+	struct bitstream_cursor cur;
+	unsigned char *buf;
+	size_t buf_len;		/* in bytes */
+
+	/* for input stream:
+	 * number of trailing 0 bits for padding
+	 * total number of valid bits in stream: buf_len * 8 - pad_bits */
+	unsigned int pad_bits;
+};
+
+static inline void bitstream_init(struct bitstream *bs, void *s, size_t len, unsigned int pad_bits)
+{
+	bs->buf = s;
+	bs->buf_len = len;
+	bs->pad_bits = pad_bits;
+	bitstream_cursor_reset(&bs->cur, bs->buf);
+}
+
+static inline void bitstream_rewind(struct bitstream *bs)
+{
+	bitstream_cursor_reset(&bs->cur, bs->buf);
+	memset(bs->buf, 0, bs->buf_len);
+}
+
+/* Put (at most 64) least significant bits of val into bitstream, and advance cursor.
+ * Ignores "pad_bits".
+ * Returns zero if bits == 0 (nothing to do).
+ * Returns number of bits used if successful.
+ *
+ * If there is not enough room left in bitstream,
+ * leaves bitstream unchanged and returns -ENOBUFS.
+ */
+static inline int bitstream_put_bits(struct bitstream *bs, u64 val, const unsigned int bits)
+{
+	unsigned char *b = bs->cur.b;
+	unsigned int tmp;
+
+	if (bits == 0)
+		return 0;
+
+	if ((bs->cur.b + ((bs->cur.bit + bits -1) >> 3)) - bs->buf >= bs->buf_len)
+		return -ENOBUFS;
+
+	/* paranoia: strip off hi bits; they should not be set anyways. */
+	if (bits < 64)
+		val &= ~0ULL >> (64 - bits);
+
+	*b++ |= (val & 0xff) << bs->cur.bit;
+
+	for (tmp = 8 - bs->cur.bit; tmp < bits; tmp += 8)
+		*b++ |= (val >> tmp) & 0xff;
+
+	bitstream_cursor_advance(&bs->cur, bits);
+	return bits;
+}
+
+/* Fetch (at most 64) bits from bitstream into *out, and advance cursor.
+ *
+ * If more than 64 bits are requested, returns -EINVAL and leave *out unchanged.
+ *
+ * If there are less than the requested number of valid bits left in the
+ * bitstream, still fetches all available bits.
+ *
+ * Returns number of actually fetched bits.
+ */
+static inline int bitstream_get_bits(struct bitstream *bs, u64 *out, int bits)
+{
+	u64 val;
+	unsigned int n;
+
+	if (bits > 64)
+		return -EINVAL;
+
+	if (bs->cur.b + ((bs->cur.bit + bs->pad_bits + bits -1) >> 3) - bs->buf >= bs->buf_len)
+		bits = ((bs->buf_len - (bs->cur.b - bs->buf)) << 3)
+			- bs->cur.bit - bs->pad_bits;
+
+	if (bits == 0) {
+		*out = 0;
+		return 0;
+	}
+
+	/* get the high bits */
+	val = 0;
+	n = (bs->cur.bit + bits + 7) >> 3;
+	/* n may be at most 9, if cur.bit + bits > 64 */
+	/* which means this copies at most 8 byte */
+	if (n) {
+		memcpy(&val, bs->cur.b+1, n - 1);
+		val = le64_to_cpu(val) << (8 - bs->cur.bit);
+	}
+
+	/* we still need the low bits */
+	val |= bs->cur.b[0] >> bs->cur.bit;
+
+	/* and mask out bits we don't want */
+	val &= ~0ULL >> (64 - bits);
+
+	bitstream_cursor_advance(&bs->cur, bits);
+	*out = val;
+
+	return bits;
+}
+
+/* encodes @in as vli into @bs;
+
+ * return values
+ *  > 0: number of bits successfully stored in bitstream
+ * -ENOBUFS @bs is full
+ * -EINVAL input zero (invalid)
+ * -EOVERFLOW input too large for this vli code (invalid)
+ */
+static inline int vli_encode_bits(struct bitstream *bs, u64 in)
+{
+	u64 code = code;
+	int bits = __vli_encode_bits(&code, in);
+
+	if (bits <= 0)
+		return bits;
+
+	return bitstream_put_bits(bs, code, bits);
+}
+
+#endif

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 13/16] DRBD: misc
  2009-04-30 11:26                       ` [PATCH 12/16] DRBD: variable_length_integer_encoding Philipp Reisner
@ 2009-04-30 11:26                         ` Philipp Reisner
  2009-04-30 11:26                           ` [PATCH 14/16] DRBD: tracepoint_probes Philipp Reisner
  2009-05-02 15:45                         ` [PATCH 12/16] DRBD: variable_length_integer_encoding James Bottomley
  1 sibling, 1 reply; 90+ messages in thread
From: Philipp Reisner @ 2009-04-30 11:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg, Philipp Reisner

buildtag.c tag will go away when we are not longer an external module.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

---
diff --git a/drivers/block/drbd/drbd_buildtag.c b/drivers/block/drbd/drbd_buildtag.c
new file mode 100644
index 0000000..a58ad76
--- /dev/null
+++ b/drivers/block/drbd/drbd_buildtag.c
@@ -0,0 +1,7 @@
+/* automatically generated. DO NOT EDIT. */
+#include <linux/drbd_config.h>
+const char *drbd_buildtag(void)
+{
+	return "GIT-hash: 29ef4c01e46b0a269d7bec39d5178be06097fead drbd/Kconfig drbd/Makefile drbd/Makefile-2.6 drbd/drbd_actlog.c drbd/drbd_bitmap.c drbd/drbd_int.h drbd/drbd_main.c drbd/drbd_nl.c drbd/drbd_proc.c drbd/drbd_receiver.c drbd/drbd_req.c drbd/drbd_req.h drbd/drbd_tracing.c drbd/drbd_tracing.h drbd/drbd_worker.c drbd/drbd_wrappers.h drbd/linux/drbd_config.h"
+		" build by phil@fat-tyre, 2009-04-29 15:43:41";
+}
diff --git a/drivers/block/drbd/drbd_strings.c b/drivers/block/drbd/drbd_strings.c
new file mode 100644
index 0000000..b230693
--- /dev/null
+++ b/drivers/block/drbd/drbd_strings.c
@@ -0,0 +1,113 @@
+/*
+  drbd.h
+
+  This file is part of DRBD by Philipp Reisner and Lars Ellenberg.
+
+  Copyright (C) 2003-2008, LINBIT Information Technologies GmbH.
+  Copyright (C) 2003-2008, Philipp Reisner <philipp.reisner@linbit.com>.
+  Copyright (C) 2003-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.
+
+  drbd is free software; you can redistribute it and/or modify
+  it under the terms of the GNU General Public License as published by
+  the Free Software Foundation; either version 2, or (at your option)
+  any later version.
+
+  drbd is distributed in the hope that it will be useful,
+  but WITHOUT ANY WARRANTY; without even the implied warranty of
+  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+  GNU General Public License for more details.
+
+  You should have received a copy of the GNU General Public License
+  along with drbd; see the file COPYING.  If not, write to
+  the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+
+*/
+
+#include <linux/drbd.h>
+
+static const char *drbd_conn_s_names[] = {
+	[C_STANDALONE]       = "StandAlone",
+	[C_DISCONNECTING]    = "Disconnecting",
+	[C_UNCONNECTED]      = "Unconnected",
+	[C_TIMEOUT]          = "Timeout",
+	[C_BROKEN_PIPE]      = "BrokenPipe",
+	[C_NETWORK_FAILURE]  = "NetworkFailure",
+	[C_PROTOCOL_ERROR]   = "ProtocolError",
+	[C_WF_CONNECTION]    = "WFConnection",
+	[C_WF_REPORT_PARAMS] = "WFReportParams",
+	[C_TEAR_DOWN]        = "TearDown",
+	[C_CONNECTED]        = "Connected",
+	[C_STARTING_SYNC_S]  = "StartingSyncS",
+	[C_STARTING_SYNC_T]  = "StartingSyncT",
+	[C_WF_BITMAP_S]      = "WFBitMapS",
+	[C_WF_BITMAP_T]      = "WFBitMapT",
+	[C_WF_SYNC_UUID]     = "WFSyncUUID",
+	[C_SYNC_SOURCE]      = "SyncSource",
+	[C_SYNC_TARGET]      = "SyncTarget",
+	[C_PAUSED_SYNC_S]    = "PausedSyncS",
+	[C_PAUSED_SYNC_T]    = "PausedSyncT",
+	[C_VERIFY_S]         = "VerifyS",
+	[C_VERIFY_T]         = "VerifyT",
+};
+
+static const char *drbd_role_s_names[] = {
+	[R_PRIMARY]   = "Primary",
+	[R_SECONDARY] = "Secondary",
+	[R_UNKNOWN]   = "Unknown"
+};
+
+static const char *drbd_disk_s_names[] = {
+	[D_DISKLESS]     = "Diskless",
+	[D_ATTACHING]    = "Attaching",
+	[D_FAILED]       = "Failed",
+	[D_NEGOTIATING]  = "Negotiating",
+	[D_INCONSISTENT] = "Inconsistent",
+	[D_OUTDATED]     = "Outdated",
+	[D_UNKNOWN]      = "DUnknown",
+	[D_CONSISTENT]   = "Consistent",
+	[D_UP_TO_DATE]   = "UpToDate",
+};
+
+static const char *drbd_state_sw_errors[] = {
+	[-SS_TWO_PRIMARIES] = "Multiple primaries not allowed by config",
+	[-SS_NO_UP_TO_DATE_DISK] = "Refusing to be Primary without at least one UpToDate disk",
+	[-SS_BOTH_INCONSISTENT] = "Refusing to be inconsistent on both nodes",
+	[-SS_SYNCING_DISKLESS] = "Refusing to be syncing and diskless",
+	[-SS_CONNECTED_OUTDATES] = "Refusing to be Outdated while Connected",
+	[-SS_PRIMARY_NOP] = "Refusing to be Primary while peer is not outdated",
+	[-SS_RESYNC_RUNNING] = "Can not start OV/resync since it is already active",
+	[-SS_ALREADY_STANDALONE] = "Can not disconnect a StandAlone device",
+	[-SS_CW_FAILED_BY_PEER] = "State changed was refused by peer node",
+	[-SS_IS_DISKLESS] = "Device is diskless, the requesed operation requires a disk",
+	[-SS_DEVICE_IN_USE] = "Device is held open by someone",
+	[-SS_NO_NET_CONFIG] = "Have no net/connection configuration",
+	[-SS_NO_VERIFY_ALG] = "Need a verify algorithm to start online verify",
+	[-SS_NEED_CONNECTION] = "Need a connection to start verify or resync",
+	[-SS_NOT_SUPPORTED] = "Peer does not support protocol",
+	[-SS_LOWER_THAN_OUTDATED] = "Disk state is lower than outdated",
+	[-SS_IN_TRANSIENT_STATE] = "In transient state, retry after next state change",
+	[-SS_CONCURRENT_ST_CHG] = "Concurrent state changes detected and aborted",
+};
+
+const char *conns_to_name(enum drbd_conns s)
+{
+	/* enums are unsigned... */
+	return s > C_PAUSED_SYNC_T ? "TOO_LARGE" : drbd_conn_s_names[s];
+}
+
+const char *roles_to_name(enum drbd_role s)
+{
+	return s > R_SECONDARY   ? "TOO_LARGE" : drbd_role_s_names[s];
+}
+
+const char *disks_to_name(enum drbd_disk_state s)
+{
+	return s > D_UP_TO_DATE    ? "TOO_LARGE" : drbd_disk_s_names[s];
+}
+
+const char *set_st_err_name(enum drbd_state_ret_codes err)
+{
+	return err <= SS_AFTER_LAST_ERROR ? "TOO_SMALL" :
+	       err > SS_TWO_PRIMARIES ? "TOO_LARGE"
+			: drbd_state_sw_errors[-err];
+}

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 14/16] DRBD: tracepoint_probes
  2009-04-30 11:26                         ` [PATCH 13/16] DRBD: misc Philipp Reisner
@ 2009-04-30 11:26                           ` Philipp Reisner
  2009-04-30 11:26                             ` [PATCH 15/16] DRBD: documentation Philipp Reisner
  0 siblings, 1 reply; 90+ messages in thread
From: Philipp Reisner @ 2009-04-30 11:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg, Philipp Reisner

There are a number of static tracepoints, mainly for debugging purposis.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

---
diff --git a/drivers/block/drbd/drbd_tracing.h b/drivers/block/drbd/drbd_tracing.h
new file mode 100644
index 0000000..c4531a1
--- /dev/null
+++ b/drivers/block/drbd/drbd_tracing.h
@@ -0,0 +1,87 @@
+/*
+   drbd_tracing.h
+
+   This file is part of DRBD by Philipp Reisner and Lars Ellenberg.
+
+   Copyright (C) 2003-2008, LINBIT Information Technologies GmbH.
+   Copyright (C) 2003-2008, Philipp Reisner <philipp.reisner@linbit.com>.
+   Copyright (C) 2003-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.
+
+   drbd is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 2, or (at your option)
+   any later version.
+
+   drbd is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with drbd; see the file COPYING.  If not, write to
+   the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+
+ */
+
+#ifndef DRBD_TRACING_H
+#define DRBD_TRACING_H
+
+#include <linux/tracepoint.h>
+#include "drbd_int.h"
+#include "drbd_req.h"
+
+enum {
+	TRACE_LVL_ALWAYS = 0,
+	TRACE_LVL_SUMMARY,
+	TRACE_LVL_METRICS,
+	TRACE_LVL_ALL,
+	TRACE_LVL_MAX
+};
+
+DECLARE_TRACE(drbd_unplug,
+	TP_PROTO(struct drbd_conf *mdev, char* msg),
+	TP_ARGS(mdev, msg));
+
+DECLARE_TRACE(drbd_uuid,
+	TP_PROTO(struct drbd_conf *mdev, enum drbd_uuid_index index),
+	TP_ARGS(mdev, index));
+
+DECLARE_TRACE(drbd_ee,
+	TP_PROTO(struct drbd_conf *mdev, struct drbd_epoch_entry *e, char* msg),
+	TP_ARGS(mdev, e, msg));
+
+DECLARE_TRACE(drbd_md_io,
+	TP_PROTO(struct drbd_conf *mdev, int rw, struct drbd_backing_dev *bdev),
+	TP_ARGS(mdev, rw, bdev));
+
+DECLARE_TRACE(drbd_epoch,
+	TP_PROTO(struct drbd_conf *mdev, struct drbd_epoch *epoch, enum epoch_event ev),
+	TP_ARGS(mdev, epoch, ev));
+
+DECLARE_TRACE(drbd_netlink,
+	TP_PROTO(void *data, int is_req),
+	TP_ARGS(data, is_req));
+
+DECLARE_TRACE(drbd_actlog,
+	TP_PROTO(struct drbd_conf *mdev, sector_t sector, char* msg),
+	TP_ARGS(mdev, sector, msg));
+
+DECLARE_TRACE(drbd_bio,
+	TP_PROTO(struct drbd_conf *mdev, const char *pfx, struct bio *bio, int complete,
+		 struct drbd_request *r),
+	TP_ARGS(mdev, pfx, bio, complete, r));
+
+DECLARE_TRACE(drbd_req,
+	TP_PROTO(struct drbd_request *req, enum drbd_req_event what, char *msg),
+	      TP_ARGS(req, what, msg));
+
+DECLARE_TRACE(drbd_packet,
+	TP_PROTO(struct drbd_conf *mdev, struct socket *sock,
+		 int recv, union p_polymorph *p, char *file, int line),
+	TP_ARGS(mdev, sock, recv, p, file, line));
+
+DECLARE_TRACE(_drbd_resync,
+	TP_PROTO(struct drbd_conf *mdev, int level, const char *fmt, va_list args),
+	TP_ARGS(mdev, level, fmt, args));
+
+#endif
diff --git a/drivers/block/drbd/drbd_tracing.c b/drivers/block/drbd/drbd_tracing.c
new file mode 100644
index 0000000..2eff178
--- /dev/null
+++ b/drivers/block/drbd/drbd_tracing.c
@@ -0,0 +1,762 @@
+/*
+   drbd_tracing.c
+
+   This file is part of DRBD by Philipp Reisner and Lars Ellenberg.
+
+   Copyright (C) 2003-2008, LINBIT Information Technologies GmbH.
+   Copyright (C) 2003-2008, Philipp Reisner <philipp.reisner@linbit.com>.
+   Copyright (C) 2003-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.
+
+   drbd is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 2, or (at your option)
+   any later version.
+
+   drbd is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with drbd; see the file COPYING.  If not, write to
+   the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+
+ */
+
+#include <linux/module.h>
+#include <linux/drbd.h>
+#include <linux/ctype.h>
+#include <linux/marker.h>
+#include "drbd_int.h"
+#include "drbd_tracing.h"
+#include <linux/drbd_tag_magic.h>
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Philipp Reisner, Lars Ellenberg");
+MODULE_DESCRIPTION("DRBD tracepoint probes");
+MODULE_PARM_DESC(trace_mask, "Bitmap of events to trace see drbd_tracing.c");
+MODULE_PARM_DESC(trace_level, "Current tracing level (changeable in /sys)");
+MODULE_PARM_DESC(trace_devs, "Bitmap of devices to trace (changeable in /sys)");
+
+unsigned int trace_mask = 0;  /* Bitmap of events to trace */
+int trace_level;              /* Current trace level */
+int trace_devs;		      /* Bitmap of devices to trace */
+
+module_param(trace_mask, uint, 0444);
+module_param(trace_level, int, 0644);
+module_param(trace_devs, int, 0644);
+
+enum {
+	TRACE_PACKET  = 0x0001,
+	TRACE_RQ      = 0x0002,
+	TRACE_UUID    = 0x0004,
+	TRACE_RESYNC  = 0x0008,
+	TRACE_EE      = 0x0010,
+	TRACE_UNPLUG  = 0x0020,
+	TRACE_NL      = 0x0040,
+	TRACE_AL_EXT  = 0x0080,
+	TRACE_INT_RQ  = 0x0100,
+	TRACE_MD_IO   = 0x0200,
+	TRACE_EPOCH   = 0x0400,
+};
+
+/* Buffer printing support
+ * dbg_print_flags: used for Flags arg to drbd_print_buffer
+ * - DBGPRINT_BUFFADDR; if set, each line starts with the
+ *	 virtual address of the line being output. If clear,
+ *	 each line starts with the offset from the beginning
+ *	 of the buffer. */
+enum dbg_print_flags {
+    DBGPRINT_BUFFADDR = 0x0001,
+};
+
+/* Macro stuff */
+STATIC char *nl_packet_name(int packet_type)
+{
+/* Generate packet type strings */
+#define NL_PACKET(name, number, fields) \
+	[P_ ## name] = # name,
+#define NL_INTEGER Argh!
+#define NL_BIT Argh!
+#define NL_INT64 Argh!
+#define NL_STRING Argh!
+
+	static char *nl_tag_name[P_nl_after_last_packet] = {
+#include "linux/drbd_nl.h"
+	};
+
+	return (packet_type < sizeof(nl_tag_name)/sizeof(nl_tag_name[0])) ?
+	    nl_tag_name[packet_type] : "*Unknown*";
+}
+/* /Macro stuff */
+
+static inline int is_mdev_trace(struct drbd_conf *mdev, unsigned int level)
+{
+	return trace_level >= level && ((1 << mdev_to_minor(mdev)) & trace_devs);
+}
+
+static void probe_drbd_unplug(struct drbd_conf *mdev, char *msg)
+{
+	if (!is_mdev_trace(mdev, TRACE_LVL_ALWAYS))
+		return;
+
+	dev_info(DEV, "%s, ap_bio_count=%d\n", msg, atomic_read(&mdev->ap_bio_cnt));
+}
+
+static void probe_drbd_uuid(struct drbd_conf *mdev, enum drbd_uuid_index index)
+{
+	static char *uuid_str[UI_EXTENDED_SIZE] = {
+		[UI_CURRENT] = "CURRENT",
+		[UI_BITMAP] = "BITMAP",
+		[UI_HISTORY_START] = "HISTORY_START",
+		[UI_HISTORY_END] = "HISTORY_END",
+		[UI_SIZE] = "SIZE",
+		[UI_FLAGS] = "FLAGS",
+	};
+
+	if (!is_mdev_trace(mdev, TRACE_LVL_ALWAYS))
+		return;
+
+	if (index >= UI_EXTENDED_SIZE) {
+		dev_warn(DEV, " uuid_index >= EXTENDED_SIZE\n");
+		return;
+	}
+
+	dev_info(DEV, " uuid[%s] now %016llX\n",
+		 uuid_str[index],
+		 (unsigned long long)mdev->bc->md.uuid[index]);
+}
+
+static void probe_drbd_md_io(struct drbd_conf *mdev, int rw,
+			     struct drbd_backing_dev *bdev)
+{
+	if (!is_mdev_trace(mdev, TRACE_LVL_ALWAYS))
+		return;
+
+	dev_info(DEV, " %s metadata superblock now\n",
+		 rw == READ ? "Reading" : "Writing");
+}
+
+static void probe_drbd_ee(struct drbd_conf *mdev, struct drbd_epoch_entry *e, char* msg)
+{
+	if (!is_mdev_trace(mdev, TRACE_LVL_ALWAYS))
+		return;
+
+	dev_info(DEV, "EE %s sec=%llus size=%u e=%p\n",
+		 msg, (unsigned long long)e->sector, e->size, e);
+}
+
+static void probe_drbd_epoch(struct drbd_conf *mdev, struct drbd_epoch *epoch,
+			     enum epoch_event ev)
+{
+	static char *epoch_event_str[] = {
+		[EV_PUT] = "put",
+		[EV_GOT_BARRIER_NR] = "got_barrier_nr",
+		[EV_BARRIER_DONE] = "barrier_done",
+		[EV_BECAME_LAST] = "became_last",
+		[EV_TRACE_FLUSH] = "issuing_flush",
+		[EV_TRACE_ADD_BARRIER] = "added_barrier",
+		[EV_TRACE_SETTING_BI] = "just set barrier_in_next_epoch",
+	};
+
+	if (!is_mdev_trace(mdev, TRACE_LVL_ALWAYS))
+		return;
+
+	ev &= ~EV_CLEANUP;
+
+	switch (ev) {
+	case EV_TRACE_ALLOC:
+		dev_info(DEV, "Allocat epoch %p/xxxx { } nr_epochs=%d\n", epoch, mdev->epochs);
+		break;
+	case EV_TRACE_FREE:
+		dev_info(DEV, "Freeing epoch %p/%d { size=%d } nr_epochs=%d\n",
+			 epoch, epoch->barrier_nr, atomic_read(&epoch->epoch_size),
+			 mdev->epochs);
+		break;
+	default:
+		dev_info(DEV, "Update epoch  %p/%d { size=%d active=%d %c%c n%c%c } ev=%s\n",
+			 epoch, epoch->barrier_nr, atomic_read(&epoch->epoch_size),
+			 atomic_read(&epoch->active),
+			 test_bit(DE_HAVE_BARRIER_NUMBER, &epoch->flags) ? 'n' : '-',
+			 test_bit(DE_CONTAINS_A_BARRIER, &epoch->flags) ? 'b' : '-',
+			 test_bit(DE_BARRIER_IN_NEXT_EPOCH_ISSUED, &epoch->flags) ? 'i' : '-',
+			 test_bit(DE_BARRIER_IN_NEXT_EPOCH_DONE, &epoch->flags) ? 'd' : '-',
+			 epoch_event_str[ev]);
+	}
+}
+
+static void probe_drbd_netlink(void *data, int is_req)
+{
+	struct cn_msg *msg = data;
+
+	if (is_req) {
+		struct drbd_nl_cfg_req *nlp = (struct drbd_nl_cfg_req *)msg->data;
+
+		printk(KERN_INFO "drbd%d: "
+			 "Netlink: << %s (%d) - seq: %x, ack: %x, len: %x\n",
+			 nlp->drbd_minor,
+			 nl_packet_name(nlp->packet_type),
+			 nlp->packet_type,
+			 msg->seq, msg->ack, msg->len);
+	} else {
+		struct drbd_nl_cfg_reply *nlp = (struct drbd_nl_cfg_reply *)msg->data;
+
+		printk(KERN_INFO "drbd%d: "
+		       "Netlink: >> %s (%d) - seq: %x, ack: %x, len: %x\n",
+		       nlp->minor,
+		       nlp->packet_type == P_nl_after_last_packet ?
+		       "Empty-Reply" : nl_packet_name(nlp->packet_type),
+		       nlp->packet_type,
+		       msg->seq, msg->ack, msg->len);
+	}
+}
+
+static void probe_drbd_actlog(struct drbd_conf *mdev, sector_t sector, char* msg)
+{
+	unsigned int enr = (sector >> (AL_EXTENT_SIZE_B-9));
+
+	if (!is_mdev_trace(mdev, TRACE_LVL_ALWAYS))
+		return;
+
+	dev_info(DEV, "%s (sec=%llus, al_enr=%u, rs_enr=%d)\n",
+		 msg, (unsigned long long) sector, enr,
+		 (int)BM_SECT_TO_EXT(sector));
+}
+
+/*
+ *
+ * drbd_print_buffer
+ *
+ * This routine dumps binary data to the debugging output. Can be
+ * called at interrupt level.
+ *
+ * Arguments:
+ *
+ *     prefix      - String is output at the beginning of each line output
+ *     flags       - Control operation of the routine. Currently defined
+ *                   Flags are:
+ *                   DBGPRINT_BUFFADDR; if set, each line starts with the
+ *                       virtual address of the line being outupt. If clear,
+ *                       each line starts with the offset from the beginning
+ *                       of the buffer.
+ *     size        - Indicates the size of each entry in the buffer. Supported
+ *                   values are sizeof(char), sizeof(short) and sizeof(int)
+ *     buffer      - Start address of buffer
+ *     buffer_va   - Virtual address of start of buffer (normally the same
+ *                   as Buffer, but having it separate allows it to hold
+ *                   file address for example)
+ *     length      - length of buffer
+ *
+ */
+static void drbd_print_buffer(const char *prefix, unsigned int flags, int size,
+			      const void *buffer, const void *buffer_va,
+			      unsigned int length)
+
+#define LINE_SIZE       16
+#define LINE_ENTRIES    (int)(LINE_SIZE/size)
+{
+	const unsigned char *pstart;
+	const unsigned char *pstart_va;
+	const unsigned char *pend;
+	char bytes_str[LINE_SIZE*3+8], ascii_str[LINE_SIZE+8];
+	char *pbytes = bytes_str, *pascii = ascii_str;
+	int  offset = 0;
+	long sizemask;
+	int  field_width;
+	int  index;
+	const unsigned char *pend_str;
+	const unsigned char *p;
+	int count;
+
+	/* verify size parameter */
+	if (size != sizeof(char) &&
+	    size != sizeof(short) &&
+	    size != sizeof(int)) {
+		printk(KERN_DEBUG "drbd_print_buffer: "
+			"ERROR invalid size %d\n", size);
+		return;
+	}
+
+	sizemask = size-1;
+	field_width = size*2;
+
+	/* Adjust start/end to be on appropriate boundary for size */
+	buffer = (const char *)((long)buffer & ~sizemask);
+	pend   = (const unsigned char *)
+		(((long)buffer + length + sizemask) & ~sizemask);
+
+	if (flags & DBGPRINT_BUFFADDR) {
+		/* Move start back to nearest multiple of line size,
+		 * if printing address. This results in nicely formatted output
+		 * with addresses being on line size (16) byte boundaries */
+		pstart = (const unsigned char *)((long)buffer & ~(LINE_SIZE-1));
+	} else {
+		pstart = (const unsigned char *)buffer;
+	}
+
+	/* Set value of start VA to print if addresses asked for */
+	pstart_va = (const unsigned char *)buffer_va
+		 - ((const unsigned char *)buffer-pstart);
+
+	/* Calculate end position to nicely align right hand side */
+	pend_str = pstart + (((pend-pstart) + LINE_SIZE-1) & ~(LINE_SIZE-1));
+
+	/* Init strings */
+	*pbytes = *pascii = '\0';
+
+	/* Start at beginning of first line */
+	p = pstart;
+	count = 0;
+
+	while (p < pend_str) {
+		if (p < (const unsigned char *)buffer || p >= pend) {
+			/* Before start of buffer or after end- print spaces */
+			pbytes += sprintf(pbytes, "%*c ", field_width, ' ');
+			pascii += sprintf(pascii, "%*c", size, ' ');
+			p += size;
+		} else {
+			/* Add hex and ascii to strings */
+			int val;
+			switch (size) {
+			default:
+			case 1:
+				val = *(unsigned char *)p;
+				break;
+			case 2:
+				val = *(unsigned short *)p;
+				break;
+			case 4:
+				val = *(unsigned int *)p;
+				break;
+			}
+
+			pbytes += sprintf(pbytes, "%0*x ", field_width, val);
+
+			for (index = size; index; index--) {
+				*pascii++ = isprint(*p) ? *p : '.';
+				p++;
+			}
+		}
+
+		count++;
+
+		if (count == LINE_ENTRIES || p >= pend_str) {
+			/* Null terminate and print record */
+			*pascii = '\0';
+			printk(KERN_DEBUG "%s%8.8lx: %*s|%*s|\n",
+			       prefix,
+			       (flags & DBGPRINT_BUFFADDR)
+			       ? (long)pstart_va:(long)offset,
+			       LINE_ENTRIES*(field_width+1), bytes_str,
+			       LINE_SIZE, ascii_str);
+
+			/* Move onto next line */
+			pstart_va += (p-pstart);
+			pstart = p;
+			count  = 0;
+			offset += LINE_SIZE;
+
+			/* Re-init strings */
+			pbytes = bytes_str;
+			pascii = ascii_str;
+			*pbytes = *pascii = '\0';
+		}
+	}
+}
+
+static void probe_drbd_resync(struct drbd_conf *mdev, int level, const char *fmt, va_list args)
+{
+	char str[256];
+
+	if (!is_mdev_trace(mdev, level))
+		return;
+
+	if (vsnprintf(str, 256, fmt, args) >= 256)
+		str[255] = 0;
+
+	printk(KERN_INFO "%s %s: %s", dev_driver_string(disk_to_dev(mdev->vdisk)),
+	       dev_name(disk_to_dev(mdev->vdisk)), str);
+}
+
+static void probe_drbd_bio(struct drbd_conf *mdev, const char *pfx, struct bio *bio, int complete,
+			   struct drbd_request *r)
+{
+#ifdef CONFIG_LBD
+#define SECTOR_FORMAT "%Lx"
+#else
+#define SECTOR_FORMAT "%lx"
+#endif
+#define SECTOR_SHIFT 9
+
+	unsigned long lowaddr = (unsigned long)(bio->bi_sector << SECTOR_SHIFT);
+	char *faddr = (char *)(lowaddr);
+	char rb[sizeof(void *)*2+6] = { 0, };
+	struct bio_vec *bvec;
+	int segno;
+
+	const int rw = bio->bi_rw;
+	const int biorw      = (rw & (RW_MASK|RWA_MASK));
+	const int biobarrier = (rw & (1<<BIO_RW_BARRIER));
+	const int biosync    = (rw & ((1<<BIO_RW_UNPLUG) | (1<<BIO_RW_SYNCIO)));
+
+	if (!is_mdev_trace(mdev, TRACE_LVL_ALWAYS))
+		return;
+
+	if (r)
+		sprintf(rb, "Req:%p ", r);
+
+	dev_info(DEV, "%s %s:%s%s%s Bio:%p %s- %soffset " SECTOR_FORMAT ", size %x\n",
+		 complete ? "<<<" : ">>>",
+		 pfx,
+		 biorw == WRITE ? "Write" : "Read",
+		 biobarrier ? " : B" : "",
+		 biosync ? " : S" : "",
+		 bio,
+		 rb,
+		 complete ? (bio_flagged(bio, BIO_UPTODATE) ? "Success, " : "Failed, ") : "",
+		 bio->bi_sector << SECTOR_SHIFT,
+		 bio->bi_size);
+
+	if (trace_level >= TRACE_LVL_METRICS &&
+	    ((biorw == WRITE) ^ complete)) {
+		printk(KERN_DEBUG "  ind     page   offset   length\n");
+		__bio_for_each_segment(bvec, bio, segno, 0) {
+			printk(KERN_DEBUG "  [%d] %p %8.8x %8.8x\n", segno,
+			       bvec->bv_page, bvec->bv_offset, bvec->bv_len);
+
+			if (trace_level >= TRACE_LVL_ALL) {
+				char *bvec_buf;
+				unsigned long flags;
+
+				bvec_buf = bvec_kmap_irq(bvec, &flags);
+
+				drbd_print_buffer("    ", DBGPRINT_BUFFADDR, 1,
+						  bvec_buf,
+						  faddr,
+						  (bvec->bv_len <= 0x80)
+						  ? bvec->bv_len : 0x80);
+
+				bvec_kunmap_irq(bvec_buf, &flags);
+
+				if (bvec->bv_len > 0x40)
+					printk(KERN_DEBUG "    ....\n");
+
+				faddr += bvec->bv_len;
+			}
+		}
+	}
+}
+
+static void probe_drbd_req(struct drbd_request *req, enum drbd_req_event what, char *msg)
+{
+	static const char *rq_event_names[] = {
+		[created] = "created",
+		[to_be_send] = "to_be_send",
+		[to_be_submitted] = "to_be_submitted",
+		[queue_for_net_write] = "queue_for_net_write",
+		[queue_for_net_read] = "queue_for_net_read",
+		[send_canceled] = "send_canceled",
+		[send_failed] = "send_failed",
+		[handed_over_to_network] = "handed_over_to_network",
+		[connection_lost_while_pending] =
+					"connection_lost_while_pending",
+		[recv_acked_by_peer] = "recv_acked_by_peer",
+		[write_acked_by_peer] = "write_acked_by_peer",
+		[neg_acked] = "neg_acked",
+		[conflict_discarded_by_peer] = "conflict_discarded_by_peer",
+		[barrier_acked] = "barrier_acked",
+		[data_received] = "data_received",
+		[read_completed_with_error] = "read_completed_with_error",
+		[write_completed_with_error] = "write_completed_with_error",
+		[completed_ok] = "completed_ok",
+	};
+
+	struct drbd_conf *mdev = req->mdev;
+
+	const int rw = (req->master_bio == NULL ||
+			bio_data_dir(req->master_bio) == WRITE) ?
+		'W' : 'R';
+	const unsigned long s = req->rq_state;
+
+	if (what != nothing) {
+		dev_info(DEV, "_req_mod(%p %c ,%s)\n", req, rw, rq_event_names[what]);
+	} else {
+		dev_info(DEV, "%s %p %c L%c%c%cN%c%c%c%c%c %u (%llus +%u) %s\n",
+			 msg, req, rw,
+			 s & RQ_LOCAL_PENDING ? 'p' : '-',
+			 s & RQ_LOCAL_COMPLETED ? 'c' : '-',
+			 s & RQ_LOCAL_OK ? 'o' : '-',
+			 s & RQ_NET_PENDING ? 'p' : '-',
+			 s & RQ_NET_QUEUED ? 'q' : '-',
+			 s & RQ_NET_SENT ? 's' : '-',
+			 s & RQ_NET_DONE ? 'd' : '-',
+			 s & RQ_NET_OK ? 'o' : '-',
+			 req->epoch,
+			 (unsigned long long)req->sector,
+			 req->size,
+			 conns_to_name(mdev->state.conn));
+	}
+}
+
+
+#define peers_to_name roles_to_name
+#define pdsks_to_name disks_to_name
+
+#define PSM(A)							\
+do {								\
+	if (mask.A) {						\
+		int i = snprintf(p, len, " " #A "( %s )",	\
+				A##s_to_name(val.A));		\
+		if (i >= len)					\
+			return op;				\
+		p += i;						\
+		len -= i;					\
+	}							\
+} while (0)
+
+STATIC char *dump_st(char *p, int len, union drbd_state mask, union drbd_state val)
+{
+	char *op = p;
+	*p = '\0';
+	PSM(role);
+	PSM(peer);
+	PSM(conn);
+	PSM(disk);
+	PSM(pdsk);
+
+	return op;
+}
+
+#define INFOP(fmt, args...) \
+do { \
+	if (trace_level >= TRACE_LVL_ALL) { \
+		dev_info(DEV, "%s:%d: %s [%d] %s %s " fmt , \
+		     file, line, current->comm, current->pid, \
+		     sockname, recv ? "<<<" : ">>>" , \
+		     ## args); \
+	} else { \
+		dev_info(DEV, "%s %s " fmt, sockname, \
+		     recv ? "<<<" : ">>>" , \
+		     ## args); \
+	} \
+} while (0)
+
+STATIC char *_dump_block_id(u64 block_id, char *buff)
+{
+	if (is_syncer_block_id(block_id))
+		strcpy(buff, "SyncerId");
+	else
+		sprintf(buff, "%llx", (unsigned long long)block_id);
+
+	return buff;
+}
+
+static void probe_drbd_packet(struct drbd_conf *mdev, struct socket *sock,
+			      int recv, union p_polymorph *p, char *file, int line)
+{
+	char *sockname = sock == mdev->meta.socket ? "meta" : "data";
+	int cmd = (recv == 2) ? p->header.command : be16_to_cpu(p->header.command);
+	char tmp[300];
+	union drbd_state m, v;
+
+	switch (cmd) {
+	case P_HAND_SHAKE:
+		INFOP("%s (protocol %u-%u)\n", cmdname(cmd),
+			be32_to_cpu(p->handshake.protocol_min),
+			be32_to_cpu(p->handshake.protocol_max));
+		break;
+
+	case P_BITMAP: /* don't report this */
+	case P_COMPRESSED_BITMAP: /* don't report this */
+		break;
+
+	case P_DATA:
+		INFOP("%s (sector %llus, id %s, seq %u, f %x)\n", cmdname(cmd),
+		      (unsigned long long)be64_to_cpu(p->data.sector),
+		      _dump_block_id(p->data.block_id, tmp),
+		      be32_to_cpu(p->data.seq_num),
+		      be32_to_cpu(p->data.dp_flags)
+			);
+		break;
+
+	case P_DATA_REPLY:
+	case P_RS_DATA_REPLY:
+		INFOP("%s (sector %llus, id %s)\n", cmdname(cmd),
+		      (unsigned long long)be64_to_cpu(p->data.sector),
+		      _dump_block_id(p->data.block_id, tmp)
+			);
+		break;
+
+	case P_RECV_ACK:
+	case P_WRITE_ACK:
+	case P_RS_WRITE_ACK:
+	case P_DISCARD_ACK:
+	case P_NEG_ACK:
+	case P_NEG_RS_DREPLY:
+		INFOP("%s (sector %llus, size %u, id %s, seq %u)\n",
+			cmdname(cmd),
+		      (long long)be64_to_cpu(p->block_ack.sector),
+		      be32_to_cpu(p->block_ack.blksize),
+		      _dump_block_id(p->block_ack.block_id, tmp),
+		      be32_to_cpu(p->block_ack.seq_num)
+			);
+		break;
+
+	case P_DATA_REQUEST:
+	case P_RS_DATA_REQUEST:
+		INFOP("%s (sector %llus, size %u, id %s)\n", cmdname(cmd),
+		      (long long)be64_to_cpu(p->block_req.sector),
+		      be32_to_cpu(p->block_req.blksize),
+		      _dump_block_id(p->block_req.block_id, tmp)
+			);
+		break;
+
+	case P_BARRIER:
+	case P_BARRIER_ACK:
+		INFOP("%s (barrier %u)\n", cmdname(cmd), p->barrier.barrier);
+		break;
+
+	case P_SYNC_PARAM:
+	case P_SYNC_PARAM89:
+		INFOP("%s (rate %u, verify-alg \"%.64s\", csums-alg \"%.64s\")\n",
+			cmdname(cmd), be32_to_cpu(p->rs_param_89.rate),
+			p->rs_param_89.verify_alg, p->rs_param_89.csums_alg);
+		break;
+
+	case P_UUIDS:
+		INFOP("%s Curr:%016llX, Bitmap:%016llX, "
+		      "HisSt:%016llX, HisEnd:%016llX\n",
+		      cmdname(cmd),
+		      (unsigned long long)be64_to_cpu(p->uuids.uuid[UI_CURRENT]),
+		      (unsigned long long)be64_to_cpu(p->uuids.uuid[UI_BITMAP]),
+		      (unsigned long long)be64_to_cpu(p->uuids.uuid[UI_HISTORY_START]),
+		      (unsigned long long)be64_to_cpu(p->uuids.uuid[UI_HISTORY_END]));
+		break;
+
+	case P_SIZES:
+		INFOP("%s (d %lluMiB, u %lluMiB, c %lldMiB, "
+		      "max bio %x, q order %x)\n",
+		      cmdname(cmd),
+		      (long long)(be64_to_cpu(p->sizes.d_size)>>(20-9)),
+		      (long long)(be64_to_cpu(p->sizes.u_size)>>(20-9)),
+		      (long long)(be64_to_cpu(p->sizes.c_size)>>(20-9)),
+		      be32_to_cpu(p->sizes.max_segment_size),
+		      be32_to_cpu(p->sizes.queue_order_type));
+		break;
+
+	case P_STATE:
+		v.i = be32_to_cpu(p->state.state);
+		m.i = 0xffffffff;
+		dump_st(tmp, sizeof(tmp), m, v);
+		INFOP("%s (s %x {%s})\n", cmdname(cmd), v.i, tmp);
+		break;
+
+	case P_STATE_CHG_REQ:
+		m.i = be32_to_cpu(p->req_state.mask);
+		v.i = be32_to_cpu(p->req_state.val);
+		dump_st(tmp, sizeof(tmp), m, v);
+		INFOP("%s (m %x v %x {%s})\n", cmdname(cmd), m.i, v.i, tmp);
+		break;
+
+	case P_STATE_CHG_REPLY:
+		INFOP("%s (ret %x)\n", cmdname(cmd),
+		      be32_to_cpu(p->req_state_reply.retcode));
+		break;
+
+	case P_PING:
+	case P_PING_ACK:
+		/*
+		 * Dont trace pings at summary level
+		 */
+		if (trace_level < TRACE_LVL_ALL)
+			break;
+		/* fall through... */
+	default:
+		INFOP("%s (%u)\n", cmdname(cmd), cmd);
+		break;
+	}
+}
+
+
+static int __init drbd_trace_init(void)
+{
+	int ret;
+
+	if (trace_mask & TRACE_UNPLUG) {
+		ret = register_trace_drbd_unplug(probe_drbd_unplug);
+		WARN_ON(ret);
+	}
+	if (trace_mask & TRACE_UUID) {
+		ret = register_trace_drbd_uuid(probe_drbd_uuid);
+		WARN_ON(ret);
+	}
+	if (trace_mask & TRACE_EE) {
+		ret = register_trace_drbd_ee(probe_drbd_ee);
+		WARN_ON(ret);
+	}
+	if (trace_mask & TRACE_PACKET) {
+		ret = register_trace_drbd_packet(probe_drbd_packet);
+		WARN_ON(ret);
+	}
+	if (trace_mask & TRACE_MD_IO) {
+		ret = register_trace_drbd_md_io(probe_drbd_md_io);
+		WARN_ON(ret);
+	}
+	if (trace_mask & TRACE_EPOCH) {
+		ret = register_trace_drbd_epoch(probe_drbd_epoch);
+		WARN_ON(ret);
+	}
+	if (trace_mask & TRACE_NL) {
+		ret = register_trace_drbd_netlink(probe_drbd_netlink);
+		WARN_ON(ret);
+	}
+	if (trace_mask & TRACE_AL_EXT) {
+		ret = register_trace_drbd_actlog(probe_drbd_actlog);
+		WARN_ON(ret);
+	}
+	if (trace_mask & TRACE_RQ) {
+		ret = register_trace_drbd_bio(probe_drbd_bio);
+		WARN_ON(ret);
+	}
+	if (trace_mask & TRACE_INT_RQ) {
+		ret = register_trace_drbd_req(probe_drbd_req);
+		WARN_ON(ret);
+	}
+	if (trace_mask & TRACE_RESYNC) {
+		ret = register_trace__drbd_resync(probe_drbd_resync);
+		WARN_ON(ret);
+	}
+	return 0;
+}
+
+module_init(drbd_trace_init);
+
+static void __exit drbd_trace_exit(void)
+{
+	if (trace_mask & TRACE_UNPLUG)
+		unregister_trace_drbd_unplug(probe_drbd_unplug);
+	if (trace_mask & TRACE_UUID)
+		unregister_trace_drbd_uuid(probe_drbd_uuid);
+	if (trace_mask & TRACE_EE)
+		unregister_trace_drbd_ee(probe_drbd_ee);
+	if (trace_mask & TRACE_PACKET)
+		unregister_trace_drbd_packet(probe_drbd_packet);
+	if (trace_mask & TRACE_MD_IO)
+		unregister_trace_drbd_md_io(probe_drbd_md_io);
+	if (trace_mask & TRACE_EPOCH)
+		unregister_trace_drbd_epoch(probe_drbd_epoch);
+	if (trace_mask & TRACE_NL)
+		unregister_trace_drbd_netlink(probe_drbd_netlink);
+	if (trace_mask & TRACE_AL_EXT)
+		unregister_trace_drbd_actlog(probe_drbd_actlog);
+	if (trace_mask & TRACE_RQ)
+		unregister_trace_drbd_bio(probe_drbd_bio);
+	if (trace_mask & TRACE_INT_RQ)
+		unregister_trace_drbd_req(probe_drbd_req);
+	if (trace_mask & TRACE_RESYNC)
+		unregister_trace__drbd_resync(probe_drbd_resync);
+
+	tracepoint_synchronize_unregister();
+}
+
+module_exit(drbd_trace_exit);

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 15/16] DRBD: documentation
  2009-04-30 11:26                           ` [PATCH 14/16] DRBD: tracepoint_probes Philipp Reisner
@ 2009-04-30 11:26                             ` Philipp Reisner
  2009-04-30 11:26                               ` [PATCH 16/16] DRBD: final Philipp Reisner
  0 siblings, 1 reply; 90+ messages in thread
From: Philipp Reisner @ 2009-04-30 11:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg, Philipp Reisner

Some documentation about the implementation.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

---
diff --git a/Documentation/blockdev/drbd/README.txt b/Documentation/blockdev/drbd/README.txt
new file mode 100644
index 0000000..627b0a1
--- /dev/null
+++ b/Documentation/blockdev/drbd/README.txt
@@ -0,0 +1,16 @@
+Description
+
+  DRBD is a shared-nothing, synchronously replicated block device. It
+  is designed to serve as a building block for high availability
+  clusters and in this context, is a "drop-in" replacement for shared
+  storage. Simplistically, you could see it as a network RAID 1.
+
+  Please visit http://www.drbd.org to find out more.
+
+The here included files are intended to help understand the implementation
+
+DRBD-8.3-data-packets.svg, DRBD-data-packets.svg  
+  relates some functions, and write packets.
+
+conn-states-8.dot, disk-states-8.dot, node-states-8.dot
+  The sub graphs of DRBD's state transitions
diff --git a/Documentation/blockdev/drbd/conn-states-8.dot b/Documentation/blockdev/drbd/conn-states-8.dot
new file mode 100644
index 0000000..025e8cf
--- /dev/null
+++ b/Documentation/blockdev/drbd/conn-states-8.dot
@@ -0,0 +1,18 @@
+digraph conn_states {
+	StandAllone  -> WFConnection   [ label = "ioctl_set_net()" ]
+	WFConnection -> Unconnected    [ label = "unable to bind()" ]
+	WFConnection -> WFReportParams [ label = "in connect() after accept" ]
+	WFReportParams -> StandAllone  [ label = "checks in receive_param()" ]
+	WFReportParams -> Connected    [ label = "in receive_param()" ]
+	WFReportParams -> WFBitMapS    [ label = "sync_handshake()" ]
+	WFReportParams -> WFBitMapT    [ label = "sync_handshake()" ]
+	WFBitMapS -> SyncSource        [ label = "receive_bitmap()" ]
+	WFBitMapT -> SyncTarget        [ label = "receive_bitmap()" ]
+	SyncSource -> Connected
+	SyncTarget -> Connected
+	SyncSource -> PausedSyncS
+	SyncTarget -> PausedSyncT
+	PausedSyncS -> SyncSource
+	PausedSyncT -> SyncTarget
+	Connected   -> WFConnection    [ label = "* on network error" ]
+}
diff --git a/Documentation/blockdev/drbd/disk-states-8.dot b/Documentation/blockdev/drbd/disk-states-8.dot
new file mode 100644
index 0000000..d06cfb4
--- /dev/null
+++ b/Documentation/blockdev/drbd/disk-states-8.dot
@@ -0,0 +1,16 @@
+digraph disk_states {
+	Diskless -> Inconsistent       [ label = "ioctl_set_disk()" ]
+	Diskless -> Consistent         [ label = "ioctl_set_disk()" ]
+	Diskless -> Outdated           [ label = "ioctl_set_disk()" ]
+	Consistent -> Outdated         [ label = "receive_param()" ]
+	Consistent -> UpToDate         [ label = "receive_param()" ]
+	Consistent -> Inconsistent     [ label = "start resync" ]
+	Outdated   -> Inconsistent     [ label = "start resync" ]
+	UpToDate   -> Inconsistent     [ label = "ioctl_replicate" ]
+	Inconsistent -> UpToDate       [ label = "resync completed" ]
+	Consistent -> Failed           [ label = "io completion error" ]
+	Outdated   -> Failed           [ label = "io completion error" ]
+	UpToDate   -> Failed           [ label = "io completion error" ]
+	Inconsistent -> Failed         [ label = "io completion error" ]
+	Failed -> Diskless             [ label = "sending notify to peer" ]
+}
diff --git a/Documentation/blockdev/drbd/node-states-8.dot b/Documentation/blockdev/drbd/node-states-8.dot
new file mode 100644
index 0000000..4a2b00c
--- /dev/null
+++ b/Documentation/blockdev/drbd/node-states-8.dot
@@ -0,0 +1,14 @@
+digraph node_states {
+	Secondary -> Primary           [ label = "ioctl_set_state()" ]
+	Primary   -> Secondary 	       [ label = "ioctl_set_state()" ]
+}
+
+digraph peer_states {
+	Secondary -> Primary           [ label = "recv state packet" ]
+	Primary   -> Secondary 	       [ label = "recv state packet" ]
+	Primary   -> Unknown 	       [ label = "connection lost" ]
+	Secondary  -> Unknown  	       [ label = "connection lost" ]
+	Unknown   -> Primary           [ label = "connected" ]
+	Unknown   -> Secondary         [ label = "connected" ]
+}
+
diff --git a/Documentation/blockdev/drbd/DRBD-data-packets.svg b/Documentation/blockdev/drbd/DRBD-data-packets.svg
new file mode 100644
index 0000000..48a1e21
--- /dev/null
+++ b/Documentation/blockdev/drbd/DRBD-data-packets.svg
@@ -0,0 +1,459 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!-- Created with Inkscape (http://www.inkscape.org/) -->
+<svg
+   xmlns:svg="http://www.w3.org/2000/svg"
+   xmlns="http://www.w3.org/2000/svg"
+   version="1.0"
+   width="210mm"
+   height="297mm"
+   viewBox="0 0 21000 29700"
+   id="svg2"
+   style="fill-rule:evenodd">
+  <defs
+     id="defs4" />
+  <g
+     id="Default"
+     style="visibility:visible">
+    <desc
+       id="desc176">Master slide</desc>
+  </g>
+  <path
+     d="M 11999,19601 L 11899,19301 L 12099,19301 L 11999,19601 z"
+     id="path189"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 11999,18801 L 11999,19361"
+     id="path193"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <path
+     d="M 7999,21401 L 7899,21101 L 8099,21101 L 7999,21401 z"
+     id="path205"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 7999,20601 L 7999,21161"
+     id="path209"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <path
+     d="M 11999,18801 L 11685,18840 L 11724,18644 L 11999,18801 z"
+     id="path221"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 7999,18001 L 11764,18754"
+     id="path225"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <text
+     x="-3023.845"
+     y="1106.8124"
+     transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,0,0)"
+     id="text243"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="6115.1553 6344.1553 6555.1553 6784.1553 6962.1553 7051.1553 7228.1553 7457.1553 7635.1553 7813.1553 7885.1553"
+       y="21390.812"
+       id="tspan245">RSDataReply</tspan>
+  </text>
+  <path
+     d="M 7999,20601 L 8281,20458 L 8311,20655 L 7999,20601 z"
+     id="path255"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 11999,20001 L 8236,20565"
+     id="path259"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <text
+     x="3502.5356"
+     y="-2184.6621"
+     transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,0,0)"
+     id="text277"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12321.536 12550.536 12761.536 12990.536 13168.536 13257.536 13434.536 13663.536 13841.536 14019.536 14196.536 14374.536 14535.536"
+       y="15854.338"
+       id="tspan279">RSDataRequest</tspan>
+  </text>
+  <text
+     id="text293"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="4034 4263 4440 4703 4881 5042 5219 5397 5503 5681 5842 6003 6180 6341 6519 6625 6803 6980 7158 7336 7497 7586 7692"
+       y="17807"
+       id="tspan295">w_make_resync_request()</tspan>
+  </text>
+  <text
+     id="text309"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12199 12305 12483 12644 12821 12893 13054 13232 13410 13638 13816 13905 14083 14311 14489 14667 14845 15023 15184 15272 15378"
+       y="18806"
+       id="tspan311">receive_DataRequest()</tspan>
+  </text>
+  <text
+     id="text325"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12199 12377 12483 12660 12838 13016 13194 13372 13549 13621 13799 13977 14083 14261 14438 14616 14794 14955 15133 15294 15399"
+       y="19606"
+       id="tspan327">drbd_endio_read_sec()</tspan>
+  </text>
+  <text
+     id="text341"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12191 12420 12597 12775 12953 13131 13309 13486 13664 13770 13931 14109 14287 14375 14553 14731 14837 15015 15192 15298"
+       y="20007"
+       id="tspan343">w_e_end_rsdata_req()</tspan>
+  </text>
+  <text
+     id="text357"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="4444 4550 4728 4889 5066 5138 5299 5477 5655 5883 6095 6324 6501 6590 6768 6997 7175 7352 7424 7585 7691"
+       y="20507"
+       id="tspan359">receive_RSDataReply()</tspan>
+  </text>
+  <text
+     id="text373"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="4457 4635 4741 4918 5096 5274 5452 5630 5807 5879 6057 6235 6464 6569 6641 6730 6908 7086 7247 7425 7585 7691"
+       y="21407"
+       id="tspan375">drbd_endio_write_sec()</tspan>
+  </text>
+  <text
+     id="text389"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="4647 4825 5003 5180 5358 5536 5714 5820 5997 6158 6319 6497 6658 6836 7013 7085 7263 7424 7585 7691"
+       y="21907"
+       id="tspan391">e_end_resync_block()</tspan>
+  </text>
+  <path
+     d="M 11999,22601 L 11685,22640 L 11724,22444 L 11999,22601 z"
+     id="path401"
+     style="fill:#000080;visibility:visible" />
+  <path
+     d="M 7999,21801 L 11764,22554"
+     id="path405"
+     style="fill:none;stroke:#000080;visibility:visible" />
+  <text
+     x="4290.3008"
+     y="-2369.6162"
+     transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,0,0)"
+     id="text423"
+     style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="13610.301 13911.301 14016.301 14088.301 14177.301 14355.301 14567.301 14728.301"
+       y="19573.385"
+       id="tspan425">WriteAck</tspan>
+  </text>
+  <text
+     id="text439"
+     style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12199 12377 12555 12644 12821 13033 13105 13283 13444 13604 13816 13977 14138 14244"
+       y="22559"
+       id="tspan441">got_BlockAck()</tspan>
+  </text>
+  <text
+     id="text455"
+     style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="7999 8304 8541 8753 8964 9201 9413 9531 9769 9862 10099 10310 10522 10734 10852 10971 11208 11348 11585 11822"
+       y="16877"
+       id="tspan457">Resync blocks, 4-32K</tspan>
+  </text>
+  <path
+     d="M 12000,7601 L 11900,7301 L 12100,7301 L 12000,7601 z"
+     id="path467"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 12000,6801 L 12000,7361"
+     id="path471"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <path
+     d="M 12000,6801 L 11686,6840 L 11725,6644 L 12000,6801 z"
+     id="path483"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 8000,6001 L 11765,6754"
+     id="path487"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <text
+     x="-1288.1796"
+     y="1279.7666"
+     transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,0,0)"
+     id="text505"
+     style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="8174.8208 8475.8203 8580.8203 8652.8203 8741.8203 8919.8203 9131.8203 9292.8203"
+       y="9516.7666"
+       id="tspan507">WriteAck</tspan>
+  </text>
+  <path
+     d="M 8000,8601 L 8282,8458 L 8312,8655 L 8000,8601 z"
+     id="path517"
+     style="fill:#000080;visibility:visible" />
+  <path
+     d="M 12000,8001 L 8237,8565"
+     id="path521"
+     style="fill:none;stroke:#000080;visibility:visible" />
+  <text
+     x="1065.6655"
+     y="-2097.7664"
+     transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,0,0)"
+     id="text539"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="10682.666 10911.666 11088.666 11177.666"
+       y="4107.2339"
+       id="tspan541">Data</tspan>
+  </text>
+  <text
+     id="text555"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="4746 4924 5030 5207 5385 5563 5826 6003 6164 6342 6520 6626 6803 6981 7159 7337 7498 7587 7692"
+       y="5505"
+       id="tspan557">drbd_make_request()</tspan>
+  </text>
+  <text
+     id="text571"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12200 12306 12484 12645 12822 12894 13055 13233 13411 13639 13817 13906 14084 14190"
+       y="6806"
+       id="tspan573">receive_Data()</tspan>
+  </text>
+  <text
+     id="text587"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12200 12378 12484 12661 12839 13017 13195 13373 13550 13622 13800 13978 14207 14312 14384 14473 14651 14829 14990 15168 15328 15434"
+       y="7606"
+       id="tspan589">drbd_endio_write_sec()</tspan>
+  </text>
+  <text
+     id="text603"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12192 12370 12548 12725 12903 13081 13259 13437 13509 13686 13847 14008 14114"
+       y="8007"
+       id="tspan605">e_end_block()</tspan>
+  </text>
+  <text
+     id="text619"
+     style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="5647 5825 6003 6092 6269 6481 6553 6731 6892 7052 7264 7425 7586 7692"
+       y="8606"
+       id="tspan621">got_BlockAck()</tspan>
+  </text>
+  <text
+     id="text635"
+     style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="8000 8305 8542 8779 9016 9109 9346 9486 9604 9956 10049 10189 10328 10565 10705 10942 11179 11298 11603 11742 11835 11954 12191 12310 12428 12665 12902 13139 13279 13516 13753"
+       y="4877"
+       id="tspan637">Regular mirrored write, 512-32K</tspan>
+  </text>
+  <text
+     id="text651"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="5381 5610 5787 5948 6126 6304 6482 6659 6837 7015 7087 7265 7426 7587 7692"
+       y="6003"
+       id="tspan653">w_send_dblock()</tspan>
+  </text>
+  <path
+     d="M 8000,6800 L 7900,6500 L 8100,6500 L 8000,6800 z"
+     id="path663"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 8000,6000 L 8000,6560"
+     id="path667"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <text
+     id="text683"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="4602 4780 4886 5063 5241 5419 5597 5775 5952 6024 6202 6380 6609 6714 6786 6875 7053 7231 7409 7515 7587 7692"
+       y="6905"
+       id="tspan685">drbd_endio_write_pri()</tspan>
+  </text>
+  <path
+     d="M 12000,13602 L 11900,13302 L 12100,13302 L 12000,13602 z"
+     id="path695"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 12000,12802 L 12000,13362"
+     id="path699"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <path
+     d="M 12000,12802 L 11686,12841 L 11725,12645 L 12000,12802 z"
+     id="path711"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 8000,12002 L 11765,12755"
+     id="path715"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <text
+     x="-2155.5266"
+     y="1201.5964"
+     transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,0,0)"
+     id="text733"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="7202.4736 7431.4736 7608.4736 7697.4736 7875.4736 8104.4736 8282.4736 8459.4736 8531.4736"
+       y="15454.597"
+       id="tspan735">DataReply</tspan>
+  </text>
+  <path
+     d="M 8000,14602 L 8282,14459 L 8312,14656 L 8000,14602 z"
+     id="path745"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 12000,14002 L 8237,14566"
+     id="path749"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <text
+     x="2280.3804"
+     y="-2103.2141"
+     transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,0,0)"
+     id="text767"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="11316.381 11545.381 11722.381 11811.381 11989.381 12218.381 12396.381 12573.381 12751.381 12929.381 13090.381"
+       y="9981.7861"
+       id="tspan769">DataRequest</tspan>
+  </text>
+  <text
+     id="text783"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="4746 4924 5030 5207 5385 5563 5826 6003 6164 6342 6520 6626 6803 6981 7159 7337 7498 7587 7692"
+       y="11506"
+       id="tspan785">drbd_make_request()</tspan>
+  </text>
+  <text
+     id="text799"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12200 12306 12484 12645 12822 12894 13055 13233 13411 13639 13817 13906 14084 14312 14490 14668 14846 15024 15185 15273 15379"
+       y="12807"
+       id="tspan801">receive_DataRequest()</tspan>
+  </text>
+  <text
+     id="text815"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12200 12378 12484 12661 12839 13017 13195 13373 13550 13622 13800 13978 14084 14262 14439 14617 14795 14956 15134 15295 15400"
+       y="13607"
+       id="tspan817">drbd_endio_read_sec()</tspan>
+  </text>
+  <text
+     id="text831"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12192 12421 12598 12776 12954 13132 13310 13487 13665 13843 14021 14110 14288 14465 14571 14749 14927 15033"
+       y="14008"
+       id="tspan833">w_e_end_data_req()</tspan>
+  </text>
+  <g
+     id="g835"
+     style="visibility:visible">
+    <desc
+       id="desc837">Drawing</desc>
+    <text
+       id="text847"
+       style="font-size:318px;font-weight:400;fill:#008000;font-family:Helvetica embedded">
+      <tspan
+         x="4885 4991 5169 5330 5507 5579 5740 5918 6096 6324 6502 6591 6769 6997 7175 7353 7425 7586 7692"
+         y="14607"
+         id="tspan849">receive_DataReply()</tspan>
+    </text>
+  </g>
+  <text
+     id="text863"
+     style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="8000 8305 8398 8610 8821 8914 9151 9363 9575 9693 9833 10070 10307 10544 10663 10781 11018 11255 11493 11632 11869 12106"
+       y="10878"
+       id="tspan865">Diskless read, 512-32K</tspan>
+  </text>
+  <text
+     id="text879"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="5029 5258 5435 5596 5774 5952 6130 6307 6413 6591 6769 6947 7125 7230 7408 7586 7692"
+       y="12004"
+       id="tspan881">w_send_read_req()</tspan>
+  </text>
+  <text
+     id="text895"
+     style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="6961 7266 7571 7854 8159 8278 8515 8633 8870 9107 9226 9463 9581 9700 9793 10030"
+       y="2806"
+       id="tspan897">DRBD 8 data flow</tspan>
+  </text>
+  <path
+     d="M 3900,5300 L 3700,5300 L 3700,7000 L 3900,7000"
+     id="path907"
+     style="fill:none;stroke:#000000;visibility:visible" />
+  <path
+     d="M 3900,17600 L 3700,17600 L 3700,22000 L 3900,22000"
+     id="path919"
+     style="fill:none;stroke:#000000;visibility:visible" />
+  <path
+     d="M 16100,20000 L 16300,20000 L 16300,18500 L 16100,18500"
+     id="path931"
+     style="fill:none;stroke:#000000;visibility:visible" />
+  <text
+     id="text947"
+     style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="2126 2304 2376 2554 2731 2909 3087 3159 3337 3515 3587 3764 3870"
+       y="5202"
+       id="tspan949">al_begin_io()</tspan>
+  </text>
+  <text
+     id="text963"
+     style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="1632 1810 1882 2060 2220 2398 2661 2839 2910 3088 3177 3355 3533 3605 3783 3888"
+       y="7331"
+       id="tspan965">al_complete_io()</tspan>
+  </text>
+  <text
+     id="text979"
+     style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="2126 2232 2393 2571 2748 2926 3104 3176 3354 3531 3603 3781 3887"
+       y="17431"
+       id="tspan981">rs_begin_io()</tspan>
+  </text>
+  <text
+     id="text995"
+     style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="1626 1732 1893 2071 2231 2409 2672 2849 2921 3099 3188 3366 3544 3616 3793 3899"
+       y="22331"
+       id="tspan997">rs_complete_io()</tspan>
+  </text>
+  <text
+     id="text1011"
+     style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="16027 16133 16294 16472 16649 16827 17005 17077 17255 17432 17504 17682 17788"
+       y="18402"
+       id="tspan1013">rs_begin_io()</tspan>
+  </text>
+  <text
+     id="text1027"
+     style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="16115 16221 16382 16560 16720 16898 17161 17338 17410 17588 17677 17855 18033 18105 18282 18388"
+       y="20331"
+       id="tspan1029">rs_complete_io()</tspan>
+  </text>
+</svg>
diff --git a/Documentation/blockdev/drbd/DRBD-8.3-data-packets.svg b/Documentation/blockdev/drbd/DRBD-8.3-data-packets.svg
new file mode 100644
index 0000000..f87cfa0
--- /dev/null
+++ b/Documentation/blockdev/drbd/DRBD-8.3-data-packets.svg
@@ -0,0 +1,588 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!-- Created with Inkscape (http://www.inkscape.org/) -->
+<svg
+   xmlns:svg="http://www.w3.org/2000/svg"
+   xmlns="http://www.w3.org/2000/svg"
+   version="1.0"
+   width="210mm"
+   height="297mm"
+   viewBox="0 0 21000 29700"
+   id="svg2"
+   style="fill-rule:evenodd">
+  <defs
+     id="defs4" />
+  <g
+     id="Default"
+     style="visibility:visible">
+    <desc
+       id="desc180">Master slide</desc>
+  </g>
+  <path
+     d="M 11999,8601 L 11899,8301 L 12099,8301 L 11999,8601 z"
+     id="path193"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 11999,7801 L 11999,8361"
+     id="path197"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <path
+     d="M 7999,10401 L 7899,10101 L 8099,10101 L 7999,10401 z"
+     id="path209"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 7999,9601 L 7999,10161"
+     id="path213"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <path
+     d="M 11999,7801 L 11685,7840 L 11724,7644 L 11999,7801 z"
+     id="path225"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 7999,7001 L 11764,7754"
+     id="path229"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <g
+     transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,-1244.4792,1416.5139)"
+     id="g245"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <text
+       id="text247">
+      <tspan
+         x="9139 9368 9579 9808 9986 10075 10252 10481 10659 10837 10909"
+         y="9284"
+         id="tspan249">RSDataReply</tspan>
+    </text>
+  </g>
+  <path
+     d="M 7999,9601 L 8281,9458 L 8311,9655 L 7999,9601 z"
+     id="path259"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 11999,9001 L 8236,9565"
+     id="path263"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <g
+     transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,1620.9382,-1639.4947)"
+     id="g279"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <text
+       id="text281">
+      <tspan
+         x="8743 8972 9132 9310 9573 9801 10013 10242 10419 10597 10775 10953 11114"
+         y="7023"
+         id="tspan283">CsumRSRequest</tspan>
+    </text>
+  </g>
+  <text
+     id="text297"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="4034 4263 4440 4703 4881 5042 5219 5397 5503 5681 5842 6003 6180 6341 6519 6625 6803 6980 7158 7336 7497 7586 7692"
+       y="5707"
+       id="tspan299">w_make_resync_request()</tspan>
+  </text>
+  <text
+     id="text313"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12199 12305 12483 12644 12821 12893 13054 13232 13410 13638 13816 13905 14083 14311 14489 14667 14845 15023 15184 15272 15378"
+       y="7806"
+       id="tspan315">receive_DataRequest()</tspan>
+  </text>
+  <text
+     id="text329"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12199 12377 12483 12660 12838 13016 13194 13372 13549 13621 13799 13977 14083 14261 14438 14616 14794 14955 15133 15294 15399"
+       y="8606"
+       id="tspan331">drbd_endio_read_sec()</tspan>
+  </text>
+  <text
+     id="text345"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12191 12420 12597 12775 12953 13131 13309 13486 13664 13825 13986 14164 14426 14604 14710 14871 15049 15154 15332 15510 15616"
+       y="9007"
+       id="tspan347">w_e_end_csum_rs_req()</tspan>
+  </text>
+  <text
+     id="text361"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="4444 4550 4728 4889 5066 5138 5299 5477 5655 5883 6095 6324 6501 6590 6768 6997 7175 7352 7424 7585 7691"
+       y="9507"
+       id="tspan363">receive_RSDataReply()</tspan>
+  </text>
+  <text
+     id="text377"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="4457 4635 4741 4918 5096 5274 5452 5630 5807 5879 6057 6235 6464 6569 6641 6730 6908 7086 7247 7425 7585 7691"
+       y="10407"
+       id="tspan379">drbd_endio_write_sec()</tspan>
+  </text>
+  <text
+     id="text393"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="4647 4825 5003 5180 5358 5536 5714 5820 5997 6158 6319 6497 6658 6836 7013 7085 7263 7424 7585 7691"
+       y="10907"
+       id="tspan395">e_end_resync_block()</tspan>
+  </text>
+  <path
+     d="M 11999,11601 L 11685,11640 L 11724,11444 L 11999,11601 z"
+     id="path405"
+     style="fill:#000080;visibility:visible" />
+  <path
+     d="M 7999,10801 L 11764,11554"
+     id="path409"
+     style="fill:none;stroke:#000080;visibility:visible" />
+  <g
+     transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,2434.7562,-1674.649)"
+     id="g425"
+     style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+    <text
+       id="text427">
+      <tspan
+         x="9320 9621 9726 9798 9887 10065 10277 10438"
+         y="10943"
+         id="tspan429">WriteAck</tspan>
+    </text>
+  </g>
+  <text
+     id="text443"
+     style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12199 12377 12555 12644 12821 13033 13105 13283 13444 13604 13816 13977 14138 14244"
+       y="11559"
+       id="tspan445">got_BlockAck()</tspan>
+  </text>
+  <text
+     id="text459"
+     style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="7999 8304 8541 8778 8990 9201 9413 9650 10001 10120 10357 10594 10806 11043 11280 11398 11703 11940 12152 12364 12601 12812 12931 13049 13261 13498 13710 13947 14065 14302 14540 14658 14777 14870 15107 15225 15437 15649 15886"
+       y="4877"
+       id="tspan461">Checksum based Resync, case not in sync</tspan>
+  </text>
+  <text
+     id="text475"
+     style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="6961 7266 7571 7854 8159 8299 8536 8654 8891 9010 9247 9484 9603 9840 9958 10077 10170 10407"
+       y="2806"
+       id="tspan477">DRBD-8.3 data flow</tspan>
+  </text>
+  <text
+     id="text491"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="5190 5419 5596 5774 5952 6113 6291 6468 6646 6824 6985 7146 7324 7586 7692"
+       y="7005"
+       id="tspan493">w_e_send_csum()</tspan>
+  </text>
+  <path
+     d="M 11999,17601 L 11899,17301 L 12099,17301 L 11999,17601 z"
+     id="path503"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 11999,16801 L 11999,17361"
+     id="path507"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <path
+     d="M 11999,16801 L 11685,16840 L 11724,16644 L 11999,16801 z"
+     id="path519"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 7999,16001 L 11764,16754"
+     id="path523"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <g
+     transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,-2539.5806,1529.3491)"
+     id="g539"
+     style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+    <text
+       id="text541">
+      <tspan
+         x="9269 9498 9709 9798 9959 10048 10226 10437 10598 10776"
+         y="18265"
+         id="tspan543">RSIsInSync</tspan>
+    </text>
+  </g>
+  <path
+     d="M 7999,18601 L 8281,18458 L 8311,18655 L 7999,18601 z"
+     id="path553"
+     style="fill:#000080;visibility:visible" />
+  <path
+     d="M 11999,18001 L 8236,18565"
+     id="path557"
+     style="fill:none;stroke:#000080;visibility:visible" />
+  <g
+     transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,3461.4027,-1449.3012)"
+     id="g573"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <text
+       id="text575">
+      <tspan
+         x="8743 8972 9132 9310 9573 9801 10013 10242 10419 10597 10775 10953 11114"
+         y="16023"
+         id="tspan577">CsumRSRequest</tspan>
+    </text>
+  </g>
+  <text
+     id="text591"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12199 12305 12483 12644 12821 12893 13054 13232 13410 13638 13816 13905 14083 14311 14489 14667 14845 15023 15184 15272 15378"
+       y="16806"
+       id="tspan593">receive_DataRequest()</tspan>
+  </text>
+  <text
+     id="text607"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12199 12377 12483 12660 12838 13016 13194 13372 13549 13621 13799 13977 14083 14261 14438 14616 14794 14955 15133 15294 15399"
+       y="17606"
+       id="tspan609">drbd_endio_read_sec()</tspan>
+  </text>
+  <text
+     id="text623"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12191 12420 12597 12775 12953 13131 13309 13486 13664 13825 13986 14164 14426 14604 14710 14871 15049 15154 15332 15510 15616"
+       y="18007"
+       id="tspan625">w_e_end_csum_rs_req()</tspan>
+  </text>
+  <text
+     id="text639"
+     style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="5735 5913 6091 6180 6357 6446 6607 6696 6874 7085 7246 7424 7585 7691"
+       y="18507"
+       id="tspan641">got_IsInSync()</tspan>
+  </text>
+  <text
+     id="text655"
+     style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="7999 8304 8541 8778 8990 9201 9413 9650 10001 10120 10357 10594 10806 11043 11280 11398 11703 11940 12152 12364 12601 12812 12931 13049 13261 13498 13710 13947 14065 14159 14396 14514 14726 14937 15175"
+       y="13877"
+       id="tspan657">Checksum based Resync, case in sync</tspan>
+  </text>
+  <path
+     d="M 12000,24601 L 11900,24301 L 12100,24301 L 12000,24601 z"
+     id="path667"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 12000,23801 L 12000,24361"
+     id="path671"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <path
+     d="M 8000,26401 L 7900,26101 L 8100,26101 L 8000,26401 z"
+     id="path683"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 8000,25601 L 8000,26161"
+     id="path687"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <path
+     d="M 12000,23801 L 11686,23840 L 11725,23644 L 12000,23801 z"
+     id="path699"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 8000,23001 L 11765,23754"
+     id="path703"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <g
+     transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,-3543.8452,1630.5143)"
+     id="g719"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <text
+       id="text721">
+      <tspan
+         x="9464 9710 9921 10150 10328 10505 10577"
+         y="25236"
+         id="tspan723">OVReply</tspan>
+    </text>
+  </g>
+  <path
+     d="M 8000,25601 L 8282,25458 L 8312,25655 L 8000,25601 z"
+     id="path733"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 12000,25001 L 8237,25565"
+     id="path737"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <g
+     transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,4918.2801,-1381.2128)"
+     id="g753"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <text
+       id="text755">
+      <tspan
+         x="9142 9388 9599 9828 10006 10183 10361 10539 10700"
+         y="23106"
+         id="tspan757">OVRequest</tspan>
+    </text>
+  </g>
+  <text
+     id="text771"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12200 12306 12484 12645 12822 12894 13055 13233 13411 13656 13868 14097 14274 14452 14630 14808 14969 15058 15163"
+       y="23806"
+       id="tspan773">receive_OVRequest()</tspan>
+  </text>
+  <text
+     id="text787"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12200 12378 12484 12661 12839 13017 13195 13373 13550 13622 13800 13978 14084 14262 14439 14617 14795 14956 15134 15295 15400"
+       y="24606"
+       id="tspan789">drbd_endio_read_sec()</tspan>
+  </text>
+  <text
+     id="text803"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12192 12421 12598 12776 12954 13132 13310 13487 13665 13843 14004 14182 14288 14465 14643 14749"
+       y="25007"
+       id="tspan805">w_e_end_ov_req()</tspan>
+  </text>
+  <text
+     id="text819"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="5101 5207 5385 5546 5723 5795 5956 6134 6312 6557 6769 6998 7175 7353 7425 7586 7692"
+       y="25507"
+       id="tspan821">receive_OVReply()</tspan>
+  </text>
+  <text
+     id="text835"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="4492 4670 4776 4953 5131 5309 5487 5665 5842 5914 6092 6270 6376 6554 6731 6909 7087 7248 7426 7587 7692"
+       y="26407"
+       id="tspan837">drbd_endio_read_sec()</tspan>
+  </text>
+  <text
+     id="text851"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="4902 5131 5308 5486 5664 5842 6020 6197 6375 6553 6714 6892 6998 7175 7353 7425 7586 7692"
+       y="26907"
+       id="tspan853">w_e_end_ov_reply()</tspan>
+  </text>
+  <path
+     d="M 12000,27601 L 11686,27640 L 11725,27444 L 12000,27601 z"
+     id="path863"
+     style="fill:#000080;visibility:visible" />
+  <path
+     d="M 8000,26801 L 11765,27554"
+     id="path867"
+     style="fill:none;stroke:#000080;visibility:visible" />
+  <g
+     transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,5704.1907,-1328.312)"
+     id="g883"
+     style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+    <text
+       id="text885">
+      <tspan
+         x="9279 9525 9736 9965 10143 10303 10481 10553"
+         y="26935"
+         id="tspan887">OVResult</tspan>
+    </text>
+  </g>
+  <text
+     id="text901"
+     style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="12200 12378 12556 12645 12822 13068 13280 13508 13686 13847 14025 14097 14185 14291"
+       y="27559"
+       id="tspan903">got_OVResult()</tspan>
+  </text>
+  <text
+     id="text917"
+     style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="8000 8330 8567 8660 8754 8991 9228 9346 9558 9795 9935 10028 10146"
+       y="21877"
+       id="tspan919">Online verify</tspan>
+  </text>
+  <text
+     id="text933"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="4641 4870 5047 5310 5488 5649 5826 6004 6182 6343 6521 6626 6804 6982 7160 7338 7499 7587 7693"
+       y="23005"
+       id="tspan935">w_make_ov_request()</tspan>
+  </text>
+  <path
+     d="M 8000,6500 L 7900,6200 L 8100,6200 L 8000,6500 z"
+     id="path945"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 8000,5700 L 8000,6260"
+     id="path949"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <path
+     d="M 3900,5500 L 3700,5500 L 3700,11000 L 3900,11000"
+     id="path961"
+     style="fill:none;stroke:#000000;visibility:visible" />
+  <path
+     d="M 3900,14500 L 3700,14500 L 3700,18600 L 3900,18600"
+     id="path973"
+     style="fill:none;stroke:#000000;visibility:visible" />
+  <path
+     d="M 3900,22800 L 3700,22800 L 3700,26900 L 3900,26900"
+     id="path985"
+     style="fill:none;stroke:#000000;visibility:visible" />
+  <text
+     id="text1001"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="4492 4670 4776 4953 5131 5309 5487 5665 5842 5914 6092 6270 6376 6554 6731 6909 7087 7248 7426 7587 7692"
+       y="6506"
+       id="tspan1003">drbd_endio_read_sec()</tspan>
+  </text>
+  <text
+     id="text1017"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="4034 4263 4440 4703 4881 5042 5219 5397 5503 5681 5842 6003 6180 6341 6519 6625 6803 6980 7158 7336 7497 7586 7692"
+       y="14708"
+       id="tspan1019">w_make_resync_request()</tspan>
+  </text>
+  <text
+     id="text1033"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="5190 5419 5596 5774 5952 6113 6291 6468 6646 6824 6985 7146 7324 7586 7692"
+       y="16006"
+       id="tspan1035">w_e_send_csum()</tspan>
+  </text>
+  <path
+     d="M 8000,15501 L 7900,15201 L 8100,15201 L 8000,15501 z"
+     id="path1045"
+     style="fill:#008000;visibility:visible" />
+  <path
+     d="M 8000,14701 L 8000,15261"
+     id="path1049"
+     style="fill:none;stroke:#008000;visibility:visible" />
+  <text
+     id="text1065"
+     style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="4492 4670 4776 4953 5131 5309 5487 5665 5842 5914 6092 6270 6376 6554 6731 6909 7087 7248 7426 7587 7692"
+       y="15507"
+       id="tspan1067">drbd_endio_read_sec()</tspan>
+  </text>
+  <path
+     d="M 16100,9000 L 16300,9000 L 16300,7500 L 16100,7500"
+     id="path1077"
+     style="fill:none;stroke:#000000;visibility:visible" />
+  <path
+     d="M 16100,18000 L 16300,18000 L 16300,16500 L 16100,16500"
+     id="path1089"
+     style="fill:none;stroke:#000000;visibility:visible" />
+  <path
+     d="M 16100,25000 L 16300,25000 L 16300,23500 L 16100,23500"
+     id="path1101"
+     style="fill:none;stroke:#000000;visibility:visible" />
+  <text
+     id="text1117"
+     style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="2026 2132 2293 2471 2648 2826 3004 3076 3254 3431 3503 3681 3787"
+       y="5402"
+       id="tspan1119">rs_begin_io()</tspan>
+  </text>
+  <text
+     id="text1133"
+     style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="2027 2133 2294 2472 2649 2827 3005 3077 3255 3432 3504 3682 3788"
+       y="14402"
+       id="tspan1135">rs_begin_io()</tspan>
+  </text>
+  <text
+     id="text1149"
+     style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="2026 2132 2293 2471 2648 2826 3004 3076 3254 3431 3503 3681 3787"
+       y="22602"
+       id="tspan1151">rs_begin_io()</tspan>
+  </text>
+  <text
+     id="text1165"
+     style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="1426 1532 1693 1871 2031 2209 2472 2649 2721 2899 2988 3166 3344 3416 3593 3699"
+       y="11302"
+       id="tspan1167">rs_complete_io()</tspan>
+  </text>
+  <text
+     id="text1181"
+     style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="1526 1632 1793 1971 2131 2309 2572 2749 2821 2999 3088 3266 3444 3516 3693 3799"
+       y="18931"
+       id="tspan1183">rs_complete_io()</tspan>
+  </text>
+  <text
+     id="text1197"
+     style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="1526 1632 1793 1971 2131 2309 2572 2749 2821 2999 3088 3266 3444 3516 3693 3799"
+       y="27231"
+       id="tspan1199">rs_complete_io()</tspan>
+  </text>
+  <text
+     id="text1213"
+     style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="16126 16232 16393 16571 16748 16926 17104 17176 17354 17531 17603 17781 17887"
+       y="7402"
+       id="tspan1215">rs_begin_io()</tspan>
+  </text>
+  <text
+     id="text1229"
+     style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="16127 16233 16394 16572 16749 16927 17105 17177 17355 17532 17604 17782 17888"
+       y="16331"
+       id="tspan1231">rs_begin_io()</tspan>
+  </text>
+  <text
+     id="text1245"
+     style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="16127 16233 16394 16572 16749 16927 17105 17177 17355 17532 17604 17782 17888"
+       y="23302"
+       id="tspan1247">rs_begin_io()</tspan>
+  </text>
+  <text
+     id="text1261"
+     style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="16115 16221 16382 16560 16720 16898 17161 17338 17410 17588 17677 17855 18033 18105 18282 18388"
+       y="9302"
+       id="tspan1263">rs_complete_io()</tspan>
+  </text>
+  <text
+     id="text1277"
+     style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="16115 16221 16382 16560 16720 16898 17161 17338 17410 17588 17677 17855 18033 18105 18282 18388"
+       y="18331"
+       id="tspan1279">rs_complete_io()</tspan>
+  </text>
+  <text
+     id="text1293"
+     style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+    <tspan
+       x="16126 16232 16393 16571 16731 16909 17172 17349 17421 17599 17688 17866 18044 18116 18293 18399"
+       y="25302"
+       id="tspan1295">rs_complete_io()</tspan>
+  </text>
+</svg>
diff --git a/Documentation/blockdev/drbd/drbd-connection-state-overview.dot b/Documentation/blockdev/drbd/drbd-connection-state-overview.dot
new file mode 100644
index 0000000..6d9cf0a
--- /dev/null
+++ b/Documentation/blockdev/drbd/drbd-connection-state-overview.dot
@@ -0,0 +1,85 @@
+// vim: set sw=2 sts=2 :
+digraph {
+  rankdir=BT
+  bgcolor=white
+
+  node [shape=plaintext]
+  node [fontcolor=black]
+
+  StandAlone     [ style=filled,fillcolor=gray,label=StandAlone ]
+
+  node [fontcolor=lightgray]
+
+  Unconnected    [ label=Unconnected ]
+
+  CommTrouble [ shape=record,
+    label="{communication loss|{Timeout|BrokenPipe|NetworkFailure}}" ]
+
+  node [fontcolor=gray]
+
+  subgraph cluster_try_connect {
+    label="try to connect, handshake"
+    rank=max
+    WFConnection   [ label=WFConnection ]
+    WFReportParams [ label=WFReportParams ]
+  }
+
+  TearDown       [ label=TearDown ]
+
+  Connected      [ label=Connected,style=filled,fillcolor=green,fontcolor=black ]
+
+  node [fontcolor=lightblue]
+
+  StartingSyncS  [ label=StartingSyncS ]
+  StartingSyncT  [ label=StartingSyncT ]
+
+  subgraph cluster_bitmap_exchange {
+    node [fontcolor=red]
+    fontcolor=red
+    label="new application (WRITE?) requests blocked\lwhile bitmap is exchanged"
+
+    WFBitMapT      [ label=WFBitMapT ]
+    WFSyncUUID     [ label=WFSyncUUID ]
+    WFBitMapS      [ label=WFBitMapS ]
+  }
+
+  node [fontcolor=blue]
+
+  cluster_resync [ shape=record,label="{<any>resynchronisation process running\l'concurrent' application requests allowed|{{<T>PausedSyncT\nSyncTarget}|{<S>PausedSyncS\nSyncSource}}}" ]
+
+  node [shape=box,fontcolor=black]
+
+  // drbdadm [label="drbdadm connect"]
+  // handshake [label="drbd_connect()\ndrbd_do_handshake\ndrbd_sync_handshake() etc."]
+  // comm_error [label="communication trouble"]
+
+  //
+  // edges
+  // --------------------------------------
+
+  StandAlone -> Unconnected [ label="drbdadm connect" ]
+  Unconnected -> StandAlone  [ label="drbdadm disconnect\lor serious communication trouble" ]
+  Unconnected -> WFConnection [ label="receiver thread is started" ]
+  WFConnection -> WFReportParams [ headlabel="accept()\land/or                        \lconnect()\l" ]
+
+  WFReportParams -> StandAlone [ label="during handshake\lpeers do not agree\labout something essential" ]
+  WFReportParams -> Connected [ label="data identical\lno sync needed",color=green,fontcolor=green ]
+
+    WFReportParams -> WFBitMapS
+    WFReportParams -> WFBitMapT
+    WFBitMapT -> WFSyncUUID [minlen=0.1,constraint=false]
+
+      WFBitMapS -> cluster_resync:S
+      WFSyncUUID -> cluster_resync:T
+
+  edge [color=green]
+  cluster_resync:any -> Connected [ label="resnyc done",fontcolor=green ]
+
+  edge [color=red]
+  WFReportParams -> CommTrouble
+  Connected -> CommTrouble
+  cluster_resync:any -> CommTrouble
+  edge [color=black]
+  CommTrouble -> Unconnected [label="receiver thread is stopped" ]
+
+}

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 16/16] DRBD: final
  2009-04-30 11:26                             ` [PATCH 15/16] DRBD: documentation Philipp Reisner
@ 2009-04-30 11:26                               ` Philipp Reisner
  0 siblings, 0 replies; 90+ messages in thread
From: Philipp Reisner @ 2009-04-30 11:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg, Philipp Reisner

Kconfig integration and Makefile

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

---
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index ddea8e4..e8db999 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -271,6 +271,8 @@ config BLK_DEV_CRYPTOLOOP
 	  instead, which can be configured to be on-disk compatible with the
 	  cryptoloop device.
 
+source "drivers/block/drbd/Kconfig"
+
 config BLK_DEV_NBD
 	tristate "Network block device support"
 	depends on NET
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 7755a5e..33f0046 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -35,5 +35,6 @@ obj-$(CONFIG_BLK_DEV_UB)	+= ub.o
 obj-$(CONFIG_BLK_DEV_HD)	+= hd.o
 
 obj-$(CONFIG_XEN_BLKDEV_FRONTEND)	+= xen-blkfront.o
+obj-$(CONFIG_BLK_DEV_DRBD)     += drbd/
 
 swim_mod-objs	:= swim.o swim_asm.o
diff --git a/drivers/block/drbd/Kconfig b/drivers/block/drbd/Kconfig
new file mode 100644
index 0000000..7ad8c2a
--- /dev/null
+++ b/drivers/block/drbd/Kconfig
@@ -0,0 +1,47 @@
+#
+# DRBD device driver configuration
+#
+
+comment "DRBD disabled because PROC_FS, INET or CONNECTOR not selected"
+	depends on !PROC_FS || !INET || !CONNECTOR
+
+config BLK_DEV_DRBD
+	tristate "DRBD Distributed Replicated Block Device support"
+	depends on PROC_FS && INET && CONNECTOR
+	help
+
+	  NOTE: In order to authenticate connections you have to select
+	  CRYPTO_HMAC and a hash function as well.
+
+	  DRBD is a shared-nothing, synchronously replicated block device. It
+	  is designed to serve as a building block for high availability
+	  clusters and in this context, is a "drop-in" replacement for shared
+	  storage. Simplistically, you could see it as a network RAID 1.
+
+	  Each minor device has a role, which can be 'primary' or 'secondary'.
+	  On the node with the primary device the application is supposed to
+	  run and to access the device (/dev/drbdX). Every write is sent to
+	  the local 'lower level block device' and, across the network, to the
+	  node with the device in 'secondary' state.  The secondary device
+	  simply writes the data to its lower level block device.
+
+	  DRBD can also be used in dual-Primary mode (device writable on both
+	  nodes), which means it can exhibit shared disk semantics in a
+	  shared-nothing cluster.  Needless to say, on top of dual-Primary
+	  DRBD utilizing a cluster file system is necessary to maintain for
+	  cache coherency.
+
+	  For automatic failover you need a cluster manager (e.g. heartbeat).
+	  See also: http://www.drbd.org/, http://www.linux-ha.org
+
+	  If unsure, say N.
+
+config DRBD_TRACE
+	tristate "DRBD tracing"
+	depends on BLK_DEV_DRBD
+	select TRACEPOINTS
+	help
+
+	  Say Y here if you want to be able to trace various events in DRBD.
+
+	  If unsure, say N.
diff --git a/drivers/block/drbd/Makefile b/drivers/block/drbd/Makefile
new file mode 100644
index 0000000..f0f805c
--- /dev/null
+++ b/drivers/block/drbd/Makefile
@@ -0,0 +1,8 @@
+drbd-y := drbd_buildtag.o drbd_bitmap.o drbd_proc.o
+drbd-y += drbd_worker.o drbd_receiver.o drbd_req.o drbd_actlog.o
+drbd-y += lru_cache.o drbd_main.o drbd_strings.o drbd_nl.o
+
+drbd_trace-y := drbd_tracing.o drbd_strings.o
+
+obj-$(CONFIG_BLK_DEV_DRBD)     += drbd.o
+obj-$(CONFIG_DRBD_TRACE)       += drbd_trace.o

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-04-30 11:26 [PATCH 00/16] DRBD: a block device for HA clusters Philipp Reisner
  2009-04-30 11:26 ` [PATCH 01/16] DRBD: major.h Philipp Reisner
@ 2009-05-01  8:59 ` Andrew Morton
  2009-05-01 11:15   ` Lars Marowsky-Bree
  2009-05-02  7:33   ` Bart Van Assche
  2009-05-03  5:53 ` Neil Brown
  2 siblings, 2 replies; 90+ messages in thread
From: Andrew Morton @ 2009-05-01  8:59 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: linux-kernel, Jens Axboe, Greg KH, Neil Brown, James Bottomley,
	Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche, Lars Ellenberg

On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:

> This is a repost of DRBD

How fast is it?

Is it being used anywhere for anything?  If so, where and what?

(it would be useful to add such info to the changelog, and to
maintain it)

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/16] DRBD: major.h
  2009-04-30 11:26 ` [PATCH 01/16] DRBD: major.h Philipp Reisner
  2009-04-30 11:26   ` [PATCH 02/16] DRBD: lru_cache Philipp Reisner
@ 2009-05-01  8:59   ` Andrew Morton
  1 sibling, 0 replies; 90+ messages in thread
From: Andrew Morton @ 2009-05-01  8:59 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: linux-kernel, Jens Axboe, Greg KH, Neil Brown, James Bottomley,
	Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche, Lars Ellenberg

On Thu, 30 Apr 2009 13:26:37 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:

> Since we have had a LANANA major number for years, and it is documented in devices.txt,
> I think that this first patch can go upstream without further changes.
> 
> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> 
> ---
> diff --git a/include/linux/major.h b/include/linux/major.h
> index 058ec15..6a8ca98 100644
> --- a/include/linux/major.h
> +++ b/include/linux/major.h
> @@ -145,6 +145,7 @@
>  #define UNIX98_PTY_MAJOR_COUNT	8
>  #define UNIX98_PTY_SLAVE_MAJOR	(UNIX98_PTY_MASTER_MAJOR+UNIX98_PTY_MAJOR_COUNT)
>  
> +#define DRBD_MAJOR		147
>  #define RTF_MAJOR		150
>  #define RAW_MAJOR		162

Yup.  I'll merge this, after having given it a vaguely useful title,
"drbd: add major number to major.h".

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/16] DRBD: lru_cache
  2009-04-30 11:26   ` [PATCH 02/16] DRBD: lru_cache Philipp Reisner
  2009-04-30 11:26     ` [PATCH 03/16] DRBD: activity_log Philipp Reisner
@ 2009-05-01  8:59     ` Andrew Morton
  2009-05-02 15:26       ` Lars Ellenberg
  2009-05-02 23:51     ` Kyle Moffett
  2 siblings, 1 reply; 90+ messages in thread
From: Andrew Morton @ 2009-05-01  8:59 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: linux-kernel, Jens Axboe, Greg KH, Neil Brown, James Bottomley,
	Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche, Lars Ellenberg

On Thu, 30 Apr 2009 13:26:38 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:

> The lru_cache is a fixed size cache of equal sized objects. It allows its
> users to do arbitrary transactions in case an element in the cache needs to
> be replaced. Its replacement policy is LRU.
> 

None of this really looks drbd-specific.

Would it not be better to present this as a general library function? 
lib/lru_cache.c?

I think I might have asked this before.  If I did, then thwap-to-you
for not permanently answering it in the changelog ;)

>
> ...
>
> +#define lc_e_base(lc)  ((char *)((lc)->slot + (lc)->nr_elements))
> +#define lc_entry(lc, i) ((struct lc_element *) \
> +		       (lc_e_base(lc) + (i)*(lc)->element_size))
> +#define lc_index_of(lc, e) (((char *)(e) - lc_e_base(lc))/(lc)->element_size)

The macros reference their arguments multiple times and hence are
inefficient and/or buggy and/or unpredictable when passed an expression
with side-effects.

If possible this should be fixed by turning them into regular C
functions.  Inlined C functions if that makes sense (it frequently
doesn't).

A pleasing side-effect of this conversion is that for some reason
developers are more likely to document C functions than they are macros
(hint).

I don't understand what these macros are doing and can't be bothered
reverse-engineering the code to work that out.  But all the typecasting
looks fishy.

>
> ...
>
> +static inline void lc_init(struct lru_cache *lc,
> +		const size_t bytes, const char *name,
> +		const unsigned int e_count, const size_t e_size,
> +		void *private_p)
> +{
> +	struct lc_element *e;
> +	unsigned int i;
> +
> +	BUG_ON(!e_count);
> +
> +	memset(lc, 0, bytes);
> +	INIT_LIST_HEAD(&lc->in_use);
> +	INIT_LIST_HEAD(&lc->lru);
> +	INIT_LIST_HEAD(&lc->free);
> +	lc->element_size = e_size;
> +	lc->nr_elements  = e_count;
> +	lc->new_number	 = -1;
> +	lc->lc_private   = private_p;
> +	lc->name         = name;
> +	for (i = 0; i < e_count; i++) {
> +		e = lc_entry(lc, i);
> +		e->lc_number = LC_FREE;
> +		list_add(&e->list, &lc->free);
> +		/* memset(,0,) did the rest of init for us */
> +	}
> +}

How's about you remove all `inline' keywords from the whole patchset
and then go back and inline the functions where there is a demonstrable
benefit?  This function won't be one of them!

>
> ...
>
> +/**
> + * lc_free: Frees memory allocated by lc_alloc.
> + * @lc: The lru_cache object
> + */
> +void lc_free(struct lru_cache *lc)
> +{
> +	vfree(lc);
> +}

vmalloc() is a last-resort thing.  It generates slower-to-access memory
and can cause internal fragmentation of the vmalloc arena, leading to
total machine failure.

Can it be avoided?  Often it _can_ be avoided, and the code falls back
to vmalloc() if the more robust memory allocation schemes failed.

> +/**
> + * lc_reset: does a full reset for @lc and the hash table slots.
> + * It is roughly the equivalent of re-allocating a fresh lru_cache object,
> + * basically a short cut to lc_free(lc); lc = lc_alloc(...);
> + */

Comment purports to be kerneldoc but doesn't document the formal argument.

> +void lc_reset(struct lru_cache *lc)
> +{
> +	lc_init(lc, size_of_lc(lc->nr_elements, lc->element_size), lc->name,
> +			lc->nr_elements, lc->element_size, lc->lc_private);
> +}
> +
> +size_t	lc_printf_stats(struct seq_file *seq, struct lru_cache *lc)
> +{
> +	/* NOTE:
> +	 * total calls to lc_get are
> +	 * (starving + hits + misses)
> +	 * misses include "dirty" count (update from an other thread in
> +	 * progress) and "changed", when this in fact lead to an successful
> +	 * update of the cache.
> +	 */
> +	return seq_printf(seq, "\t%s: used:%u/%u "
> +		"hits:%lu misses:%lu starving:%lu dirty:%lu changed:%lu\n",
> +		lc->name, lc->used, lc->nr_elements,
> +		lc->hits, lc->misses, lc->starving, lc->dirty, lc->changed);
> +}
> +
> +static unsigned int lc_hash_fn(struct lru_cache *lc, unsigned int enr)
> +{
> +	return enr % lc->nr_elements;
> +}
> +
> +
> +/**
> + * lc_find: Returns the pointer to an element, if the element is present
> + * in the hash table. In case it is not this function returns NULL.

Unfortunately the above must be done in a single 140 column line -
kerneldoc doesn't understand leading lines which have a newline in the
middle.

Please review all kerneldoc comments in the patchset - I won't commeent
on them further.

> + * @lc: The lru_cache object
> + * @enr: element number
> + */
> +struct lc_element *lc_find(struct lru_cache *lc, unsigned int enr)
> +{
> +	struct hlist_node *n;
> +	struct lc_element *e;
> +
> +	BUG_ON(!lc);
> +	hlist_for_each_entry(e, n, lc->slot + lc_hash_fn(lc, enr), colision) {
> +		if (e->lc_number == enr)
> +			return e;
> +	}
> +	return NULL;
> +}
> +
>
> ...
>


So I assume that the caller of this facility must provide the locking
for its internals.  Is that documented somewhere?


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 03/16] DRBD: activity_log
  2009-04-30 11:26     ` [PATCH 03/16] DRBD: activity_log Philipp Reisner
  2009-04-30 11:26       ` [PATCH 04/16] DRBD: bitmap Philipp Reisner
@ 2009-05-01  9:01       ` Andrew Morton
  2009-05-02 17:00         ` Lars Ellenberg
  1 sibling, 1 reply; 90+ messages in thread
From: Andrew Morton @ 2009-05-01  9:01 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: linux-kernel, Jens Axboe, Greg KH, Neil Brown, James Bottomley,
	Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche, Lars Ellenberg

On Thu, 30 Apr 2009 13:26:39 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:

> Within DRBD the activity log is used to track extents (4MB each) in which IO
> happens (or happened recently). It is based on the LRU cache. Each change of
> the activity log causes a meta data update (single sector write).  The size
> of the activity log is configured by the user, and is a tradeoff between
> minimizing updates to the meta data and the resync time after the crash of a
> primary node.
> 

OK, this is where I lose the plot and start bikeshed-painting.

Has anyone done a serious review of this stuff yet?  If so, a link
would be appreciated.

> 
> ---
> diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
> new file mode 100644
> index 0000000..c894b4f
> --- /dev/null
> +++ b/drivers/block/drbd/drbd_actlog.c
> @@ -0,0 +1,1458 @@
> +/*
> +   drbd_actlog.c
> +
> +   This file is part of DRBD by Philipp Reisner and Lars Ellenberg.
> +
> +   Copyright (C) 2003-2008, LINBIT Information Technologies GmbH.
> +   Copyright (C) 2003-2008, Philipp Reisner <philipp.reisner@linbit.com>.
> +   Copyright (C) 2003-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.
> +
> +   drbd is free software; you can redistribute it and/or modify
> +   it under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 2, or (at your option)
> +   any later version.
> +
> +   drbd is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +   GNU General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with drbd; see the file COPYING.  If not, write to
> +   the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
> +
> + */
> +
> +#include <linux/slab.h>
> +#include <linux/drbd.h>
> +#include "drbd_int.h"
> +#include "drbd_tracing.h"
> +#include "drbd_wrappers.h"
> +
> +/* I do not believe that all storage medias can guarantee atomic
> + * 512 byte write operations.

ooh.  I think you'd be safe assuming that in the Linux context. 
Everything else does.

Not sure what this means, really.

> When the journal is read, only
> + * transactions with correct xor_sums are considered.
> + * sizeof() = 512 byte */
> +struct __attribute__((packed)) al_transaction {
> +	u32       magic;
> +	u32       tr_number;
> +	struct __attribute__((packed)) {
> +		u32 pos;
> +		u32 extent; } updates[1 + AL_EXTENTS_PT];
> +	u32       xor_sum;
> +};

Please use __packed (whole patchset).

> +struct update_odbm_work {
> +	struct drbd_work w;
> +	unsigned int enr;
> +};
> +
> +struct update_al_work {
> +	struct drbd_work w;
> +	struct lc_element *al_ext;
> +	struct completion event;
> +	unsigned int enr;
> +	/* if old_enr != LC_FREE, write corresponding bitmap sector, too */
> +	unsigned int old_enr;
> +};
> +
> +struct drbd_atodb_wait {
> +	atomic_t           count;
> +	struct completion  io_done;
> +	struct drbd_conf   *mdev;
> +	int                error;
> +};
> +
> +
> +int w_al_write_transaction(struct drbd_conf *, struct drbd_work *, int);
> +
> +/* The actual tracepoint needs to have constant number of known arguments...
> + */
> +void trace_drbd_resync(struct drbd_conf *mdev, int level, const char *fmt, ...)
> +{
> +	va_list ap;
> +
> +	va_start(ap, fmt);
> +	trace__drbd_resync(mdev, level, fmt, ap);
> +	va_end(ap);
> +}


The "trace_" namespace is already taken.

I suggest that all globally visible symbols in this driver start with
"drbd_".

> +STATIC int _drbd_md_sync_page_io(struct drbd_conf *mdev,
> +				 struct drbd_backing_dev *bdev,
> +				 struct page *page, sector_t sector,
> +				 int rw, int size)
> +{
> +	struct bio *bio;
> +	struct drbd_md_io md_io;
> +	int ok;
> +
> +	md_io.mdev = mdev;
> +	init_completion(&md_io.event);

urgh, you're going to have to scratch your head over
DECLARE_COMPLETION_ONSTACK() here.

Has this code all been tested with all kernel debug options enabled? 
inclusing lockdep?

> +	md_io.error = 0;
> +
> +	if (rw == WRITE && !test_bit(MD_NO_BARRIER, &mdev->flags))
> +		rw |= (1<<BIO_RW_BARRIER);
> +	rw |= ((1<<BIO_RW_UNPLUG) | (1<<BIO_RW_SYNCIO));

The semantics of these flags seem to have been changing at 15Hz lately.
You might want to check that this code still does what you think it
does.

It would be prudent to add comments explaining precisely what behaviour
the driver is expecting from the lower layers, and why it wants that
behaviour.

> + retry:
> +	bio = bio_alloc(GFP_NOIO, 1);
> +	bio->bi_bdev = bdev->md_bdev;
> +	bio->bi_sector = sector;
> +	ok = (bio_add_page(bio, page, size, 0) == size);
> +	if (!ok)
> +		goto out;
> +	bio->bi_private = &md_io;
> +	bio->bi_end_io = drbd_md_io_complete;
> +	bio->bi_rw = rw;
> +
> +	trace_drbd_bio(mdev, "Md", bio, 0, NULL);
> +
> +	if (FAULT_ACTIVE(mdev, (rw & WRITE) ? DRBD_FAULT_MD_WR : DRBD_FAULT_MD_RD))
> +		bio_endio(bio, -EIO);
> +	else
> +		submit_bio(rw, bio);
> +	wait_for_completion(&md_io.event);
> +	ok = bio_flagged(bio, BIO_UPTODATE) && md_io.error == 0;
> +
> +	/* check for unsupported barrier op.
> +	 * would rather check on EOPNOTSUPP, but that is not reliable.
> +	 * don't try again for ANY return value != 0 */
> +	if (unlikely(bio_barrier(bio) && !ok)) {
> +		/* Try again with no barrier */
> +		dev_warn(DEV, "Barriers not supported on meta data device - disabling\n");
> +		set_bit(MD_NO_BARRIER, &mdev->flags);
> +		rw &= ~(1 << BIO_RW_BARRIER);
> +		bio_put(bio);

Maybe the original bio could be reused.

> +		goto retry;
> +	}
> + out:
> +	bio_put(bio);
> +	return ok;
> +}
> +
> +int drbd_md_sync_page_io(struct drbd_conf *mdev, struct drbd_backing_dev *bdev,
> +			 sector_t sector, int rw)
> +{
> +	int hardsect, mask, ok;
> +	int offset = 0;
> +	struct page *iop = mdev->md_io_page;
> +
> +	D_ASSERT(mutex_is_locked(&mdev->md_io_mutex));
> +
> +	BUG_ON(!bdev->md_bdev);
> +
> +	hardsect = drbd_get_hardsect(bdev->md_bdev);

hm.  Sounds like hardsect shold have type sector_t.

> +	if (hardsect == 0)
> +		hardsect = MD_HARDSECT;
> +
> +	/* in case hardsect != 512 [ s390 only? ] */

Nope, it looks like it should have been called hardsect_size?

> +	if (hardsect != MD_HARDSECT) {
> +		mask = (hardsect / MD_HARDSECT) - 1;
> +		D_ASSERT(mask == 1 || mask == 3 || mask == 7);
> +		D_ASSERT(hardsect == (mask+1) * MD_HARDSECT);
> +		offset = sector & mask;
> +		sector = sector & ~mask;
> +		iop = mdev->md_io_tmpp;
> +
> +		if (rw == WRITE) {

This will evaluate to false if someone passed you WRITE_SYNC.  Maybe it
should have been `if (rw & WRITE)'?

> +			void *p = page_address(mdev->md_io_page);
> +			void *hp = page_address(mdev->md_io_tmpp);

I trust these pages cannot be in highmem.  If they are, they'll need
kmapping.

> +			ok = _drbd_md_sync_page_io(mdev, bdev, iop,
> +						   sector, READ, hardsect);
> +
> +			if (unlikely(!ok)) {
> +				dev_err(DEV, "drbd_md_sync_page_io(,%llus,"
> +				    "READ [hardsect!=512]) failed!\n",
> +				    (unsigned long long)sector);
> +				return 0;
> +			}
> +
> +			memcpy(hp + offset*MD_HARDSECT , p, MD_HARDSECT);

whitespace went funny.

> +		}
> +	}
> +
> +	if (sector < drbd_md_first_sector(bdev) ||
> +	    sector > drbd_md_last_sector(bdev))
> +		dev_alert(DEV, "%s [%d]:%s(,%llus,%s) out of range md access!\n",
> +		     current->comm, current->pid, __func__,
> +		     (unsigned long long)sector, rw ? "WRITE" : "READ");
> +
> +	ok = _drbd_md_sync_page_io(mdev, bdev, iop, sector, rw, hardsect);
> +	if (unlikely(!ok)) {
> +		dev_err(DEV, "drbd_md_sync_page_io(,%llus,%s) failed!\n",
> +		    (unsigned long long)sector, rw ? "WRITE" : "READ");
> +		return 0;
> +	}
> +
> +	if (hardsect != MD_HARDSECT && rw == READ) {
> +		void *p = page_address(mdev->md_io_page);
> +		void *hp = page_address(mdev->md_io_tmpp);
> +
> +		memcpy(p, hp + offset*MD_HARDSECT, MD_HARDSECT);
> +	}
> +
> +	return ok;
> +}
> +
> +static inline
> +struct lc_element *_al_get(struct drbd_conf *mdev, unsigned int enr)
> +{
> +	struct lc_element *al_ext;
> +	struct bm_extent  *bm_ext;
> +	unsigned long     al_flags = 0;
> +
> +	spin_lock_irq(&mdev->al_lock);
> +	bm_ext = (struct bm_extent *)
> +		lc_find(mdev->resync, enr/AL_EXT_PER_BM_SECT);

OK, what's going on here.

lc_find() returns an lc_element* and it's getting cast to a bm_extent*.

<tries to find the definition of bm_extent>

OK, it's defined five patches in the future.  Tricky!

+struct bm_extent {
+	struct lc_element lce;
+	int rs_left; /* number of bits set (out of sync) in this extent. */
+	int rs_failed; /* number of failed resync requests in this extent. */
+	unsigned long flags;
+};
	
I see what you did there.

Please use container_of().  It makes things much clearer and removes
the requirement that the embedded lc_element be the first element in
the outer strut.

A good way of doing this is to implement the container_of() in a single
helper function and to then call that helper function in all the
relevant places.  This improves readability and provides a much better
level of typechecking.

Doubtless these comments apply to many places in this patchset.

> +	if (unlikely(bm_ext != NULL)) {
> +		if (test_bit(BME_NO_WRITES, &bm_ext->flags)) {
> +			spin_unlock_irq(&mdev->al_lock);
> +			return NULL;
> +		}
> +	}
> +	al_ext   = lc_get(mdev->act_log, enr);
> +	al_flags = mdev->act_log->flags;
> +	spin_unlock_irq(&mdev->al_lock);
> +
> +	/*
> +	if (!al_ext) {
> +		if (al_flags & LC_STARVING)
> +			dev_warn(DEV, "Have to wait for LRU element (AL too small?)\n");
> +		if (al_flags & LC_DIRTY)
> +			dev_warn(DEV, "Ongoing AL update (AL device too slow?)\n");
> +	}
> +	*/
> +
> +	return al_ext;
> +}
> +
> +void drbd_al_begin_io(struct drbd_conf *mdev, sector_t sector)
> +{
> +	unsigned int enr = (sector >> (AL_EXTENT_SIZE_B-9));

This limits the maximum size of a device to 4 gigasectors * <however
much that is>.

Do we have a problem here?

> +	struct lc_element *al_ext;
> +	struct update_al_work al_work;
> +
> +	D_ASSERT(atomic_read(&mdev->local_cnt) > 0);
> +
> +	trace_drbd_actlog(mdev, sector, "al_begin_io");
> +
> +	wait_event(mdev->al_wait, (al_ext = _al_get(mdev, enr)));
> +
> +	if (al_ext->lc_number != enr) {
> +		/* drbd_al_write_transaction(mdev,al_ext,enr);
> +		   generic_make_request() are serialized on the
> +		   current->bio_tail list now. Therefore we have
> +		   to deligate writing something to AL to the
> +		   worker thread. */
> +		init_completion(&al_work.event);
> +		al_work.al_ext = al_ext;
> +		al_work.enr = enr;
> +		al_work.old_enr = al_ext->lc_number;
> +		al_work.w.cb = w_al_write_transaction;
> +		drbd_queue_work_front(&mdev->data.work, &al_work.w);
> +		wait_for_completion(&al_work.event);
> +
> +		mdev->al_writ_cnt++;
> +
> +		spin_lock_irq(&mdev->al_lock);
> +		lc_changed(mdev->act_log, al_ext);
> +		spin_unlock_irq(&mdev->al_lock);
> +		wake_up(&mdev->al_wait);
> +	}
> +}
> +
> +void drbd_al_complete_io(struct drbd_conf *mdev, sector_t sector)
> +{
> +	unsigned int enr = (sector >> (AL_EXTENT_SIZE_B-9));
> +	struct lc_element *extent;
> +	unsigned long flags;
> +
> +	trace_drbd_actlog(mdev, sector, "al_complete_io");
> +
> +	spin_lock_irqsave(&mdev->al_lock, flags);
> +
> +	extent = lc_find(mdev->act_log, enr);
> +
> +	if (!extent) {
> +		spin_unlock_irqrestore(&mdev->al_lock, flags);
> +		dev_err(DEV, "al_complete_io() called on inactive extent %u\n", enr);
> +		return;
> +	}
> +
> +	if (lc_put(mdev->act_log, extent) == 0)
> +		wake_up(&mdev->al_wait);
> +
> +	spin_unlock_irqrestore(&mdev->al_lock, flags);
> +}
> +
> +int
> +w_al_write_transaction(struct drbd_conf *mdev, struct drbd_work *w, int unused)
> +{

We're getting deep into uncommented territory here.  Ths makes the code
much harder to review, and makes the review less effective and makes
the code harder to maintain and all those other things you already know ;)

> +	struct update_al_work *aw = (struct update_al_work *)w;
> +	struct lc_element *updated = aw->al_ext;
> +	const unsigned int new_enr = aw->enr;
> +	const unsigned int evicted = aw->old_enr;
> +
> +	struct al_transaction *buffer;
> +	sector_t sector;
> +	int i, n, mx;
> +	unsigned int extent_nr;
> +	u32 xor_sum = 0;

strange newline in middle of local definitions.

> +	if (!inc_local(mdev)) {

<tries to find inc_local>

<finds it four patches in the future>

this is harder than it needs to be

<considers hunting for _inc_local_if_state>

<changes mind>

inc_local() isn't a very good choice of identifier, given its
potentially-global scope.

The amount of inlining in drbd_int.h is bizarre.

> +		dev_err(DEV, "inc_local() failed in w_al_write_transaction\n");
> +		complete(&((struct update_al_work *)w)->event);
> +		return 1;
> +	}
> +	/* do we have to do a bitmap write, first?
> +	 * TODO reduce maximum latency:
> +	 * submit both bios, then wait for both,
> +	 * instead of doing two synchronous sector writes. */
> +	if (mdev->state.conn < C_CONNECTED && evicted != LC_FREE)
> +		drbd_bm_write_sect(mdev, evicted/AL_EXT_PER_BM_SECT);
> +
> +	mutex_lock(&mdev->md_io_mutex); /* protects md_io_page, al_tr_cycle, ... */
> +	buffer = (struct al_transaction *)page_address(mdev->md_io_page);
> +
> +	buffer->magic = __constant_cpu_to_be32(DRBD_MAGIC);

DRBD_MAGIC should be defined in magic.h.  Maybe it was - I didn't check.

> +	buffer->tr_number = cpu_to_be32(mdev->al_tr_number);
> +
> +	n = lc_index_of(mdev->act_log, updated);
> +
> +	buffer->updates[0].pos = cpu_to_be32(n);
> +	buffer->updates[0].extent = cpu_to_be32(new_enr);
> +
> +	xor_sum ^= new_enr;
> +
> +	mx = min_t(int, AL_EXTENTS_PT,
> +		   mdev->act_log->nr_elements - mdev->al_tr_cycle);
> +	for (i = 0; i < mx; i++) {
> +		extent_nr = lc_entry(mdev->act_log,
> +				     mdev->al_tr_cycle+i)->lc_number;
> +		buffer->updates[i+1].pos = cpu_to_be32(mdev->al_tr_cycle+i);
> +		buffer->updates[i+1].extent = cpu_to_be32(extent_nr);
> +		xor_sum ^= extent_nr;
> +	}
> +	for (; i < AL_EXTENTS_PT; i++) {
> +		buffer->updates[i+1].pos = __constant_cpu_to_be32(-1);
> +		buffer->updates[i+1].extent = __constant_cpu_to_be32(LC_FREE);
> +		xor_sum ^= LC_FREE;
> +	}
> +	mdev->al_tr_cycle += AL_EXTENTS_PT;
> +	if (mdev->al_tr_cycle >= mdev->act_log->nr_elements)
> +		mdev->al_tr_cycle = 0;
> +
> +	buffer->xor_sum = cpu_to_be32(xor_sum);
> +
> +	sector =  mdev->bc->md.md_offset
> +		+ mdev->bc->md.al_offset + mdev->al_tr_pos;
> +
> +	if (!drbd_md_sync_page_io(mdev, mdev->bc, sector, WRITE)) {
> +		drbd_chk_io_error(mdev, 1, TRUE);
> +		drbd_io_error(mdev, TRUE);
> +	}
> +
> +	if (++mdev->al_tr_pos >
> +	    div_ceil(mdev->act_log->nr_elements, AL_EXTENTS_PT))
> +		mdev->al_tr_pos = 0;
> +
> +	D_ASSERT(mdev->al_tr_pos < MD_AL_MAX_SIZE);
> +	mdev->al_tr_number++;
> +
> +	mutex_unlock(&mdev->md_io_mutex);
> +
> +	complete(&((struct update_al_work *)w)->event);
> +	dec_local(mdev);
> +
> +	return 1;
> +}
> +
> +/**
> + * drbd_al_read_tr: Reads a single transaction record form the

"from"

> + * on disk activity log.
> + * Returns -1 on IO error, 0 on checksum error and 1 if it is a valid
> + * record.
> + */
> +STATIC int drbd_al_read_tr(struct drbd_conf *mdev,

Can we please do s/STATIC/static/g and remove the definition of STATIC,
whereever it is?

> +			   struct drbd_backing_dev *bdev,
> +			   struct al_transaction *b,
> +			   int index)
> +{
> +	sector_t sector;
> +	int rv, i;
> +	u32 xor_sum = 0;
> +
> +	sector = bdev->md.md_offset + bdev->md.al_offset + index;

Strange that `index' doesn't have sector_t type, but I don't know (and
wasn't told) what it represents.

> +	/* Dont process error normally,
> +	 * as this is done before disk is atached! */
> +	if (!drbd_md_sync_page_io(mdev, bdev, sector, READ))
> +		return -1;
> +
> +	rv = (be32_to_cpu(b->magic) == DRBD_MAGIC);
> +
> +	for (i = 0; i < AL_EXTENTS_PT + 1; i++)
> +		xor_sum ^= be32_to_cpu(b->updates[i].extent);
> +	rv &= (xor_sum == be32_to_cpu(b->xor_sum));
> +
> +	return rv;
> +}
> +
>
> ...
>
> +#define S2W(s)	((s)<<(BM_EXT_SIZE_B-BM_BLOCK_SIZE_B-LN2_BPL))

S2W means, umm, "sector to ..."?

Optimise the code for the code reader, please.

> +/* activity log to on disk bitmap -- prepare bio unless that sector
> + * is already covered by previously prepared bios */
> +STATIC int atodb_prepare_unless_covered(struct drbd_conf *mdev,
> +					struct bio **bios,
> +					unsigned int enr,
> +					struct drbd_atodb_wait *wc) __must_hold(local)
> +{
> +	struct bio *bio;
> +	struct page *page;
> +	sector_t on_disk_sector = enr + mdev->bc->md.md_offset
> +				      + mdev->bc->md.bm_offset;
> +	unsigned int page_offset = PAGE_SIZE;
> +	int offset;
> +	int i = 0;
> +	int err = -ENOMEM;
> +
> +	/* Check if that enr is already covered by an already created bio.
> +	 * Caution, bios[] is not NULL terminated,
> +	 * but only initialized to all NULL.
> +	 * For completely scattered activity log,
> +	 * the last invocation iterates over all bios,
> +	 * and finds the last NULL entry.
> +	 */
> +	while ((bio = bios[i])) {
> +		if (bio->bi_sector == on_disk_sector)
> +			return 0;
> +		i++;
> +	}
> +	/* bios[i] == NULL, the next not yet used slot */
> +
> +	bio = bio_alloc(GFP_KERNEL, 1);

Should it be GFP_NOIO?

> +	if (bio == NULL)
> +		return -ENOMEM;
> +
> +	if (i > 0) {
> +		const struct bio_vec *prev_bv = bios[i-1]->bi_io_vec;
> +		page_offset = prev_bv->bv_offset + prev_bv->bv_len;
> +		page = prev_bv->bv_page;
> +	}
> +	if (page_offset == PAGE_SIZE) {
> +		page = alloc_page(__GFP_HIGHMEM);
> +		if (page == NULL)
> +			goto out_bio_put;
> +		page_offset = 0;
> +	} else {
> +		get_page(page);
> +	}
> +
> +	offset = S2W(enr);
> +	drbd_bm_get_lel(mdev, offset,
> +			min_t(size_t, S2W(1), drbd_bm_words(mdev) - offset),
> +			kmap(page) + page_offset);
> +	kunmap(page);
> +
> +	bio->bi_private = wc;
> +	bio->bi_end_io = atodb_endio;
> +	bio->bi_bdev = mdev->bc->md_bdev;
> +	bio->bi_sector = on_disk_sector;
> +
> +	if (bio_add_page(bio, page, MD_HARDSECT, page_offset) != MD_HARDSECT)
> +		goto out_put_page;
> +
> +	atomic_inc(&wc->count);
> +	/* we already know that we may do this...
> +	 * inc_local_if_state(mdev,D_ATTACHING);
> +	 * just get the extra reference, so that the local_cnt reflects
> +	 * the number of pending IO requests DRBD at its backing device.
> +	 */
> +	atomic_inc(&mdev->local_cnt);
> +
> +	bios[i] = bio;
> +
> +	return 0;
> +
> +out_put_page:
> +	err = -EINVAL;
> +	put_page(page);
> +out_bio_put:
> +	bio_put(bio);
> +	return err;
> +}
> +
> +/**
> + * drbd_al_to_on_disk_bm:
> + * Writes the areas of the bitmap which are covered by the AL.

what's an AL?

> + * called when we detach (unconfigure) local storage,
> + * or when we go from R_PRIMARY to R_SECONDARY state.
> + */
> +void drbd_al_to_on_disk_bm(struct drbd_conf *mdev)
> +{
> +	int i, nr_elements;
> +	unsigned int enr;
> +	struct bio **bios;
> +	struct drbd_atodb_wait wc;
> +
> +	ERR_IF (!inc_local_if_state(mdev, D_ATTACHING))
> +		return; /* sorry, I don't have any act_log etc... */
> +
> +	wait_event(mdev->al_wait, lc_try_lock(mdev->act_log));
> +
> +	nr_elements = mdev->act_log->nr_elements;
> +
> +	bios = kzalloc(sizeof(struct bio *) * nr_elements, GFP_KERNEL);

Please check all GFP_KERNELS, see if they should have been GFP_NOIO.

> +	if (!bios)
> +		goto submit_one_by_one;
> +
> +	atomic_set(&wc.count, 0);
> +	init_completion(&wc.io_done);

Again, we have the DECLARE_COMPLETION_ONSTACK() thing to worry about here.

<attention span exhausted, sorry>

>
> ...
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-01  8:59 ` [PATCH 00/16] DRBD: a block device for HA clusters Andrew Morton
@ 2009-05-01 11:15   ` Lars Marowsky-Bree
  2009-05-01 13:14     ` Dave Jones
  2009-05-05  4:05     ` Christian Kujau
  2009-05-02  7:33   ` Bart Van Assche
  1 sibling, 2 replies; 90+ messages in thread
From: Lars Marowsky-Bree @ 2009-05-01 11:15 UTC (permalink / raw)
  To: Andrew Morton, Philipp Reisner
  Cc: linux-kernel, Jens Axboe, Greg KH, Neil Brown, James Bottomley,
	Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg

On 2009-05-01T01:59:02, Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
> 
> > This is a repost of DRBD
> How fast is it?

>From experience, it achieves performance of approx. 98% of wire or
spindle speed, so it is considered rather efficient code.

> Is it being used anywhere for anything?  If so, where and what?

It is used by many customers (thousands world-wide, I'm sure) to
replicate block device data locally (to replace more expensive SANs
while achieving higher availablity) or async/remotely (for disaster
recovery).

The code is rather stable, the first drbd deployments date back many
years - drbd0.7 for example has been shipping with SLES10/9, and 0.6
with SLES8 already. The new drbd8 code is shipping on SLE11 and used
also in combination with OCFS2.

So we very much welcome the renewed and persistent interest of merging
the code in mainline (once all serious issues are addressed).

Even if in the long-term a merge with other raid implementations is
pursued (which I'd welcome even more), the existence of so many
deployments means we'll need the code for awhile still.


Regards,
    Lars

-- 
SuSE Labs, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-01 11:15   ` Lars Marowsky-Bree
@ 2009-05-01 13:14     ` Dave Jones
  2009-05-01 19:14       ` Andrew Morton
  2009-05-05  4:05     ` Christian Kujau
  1 sibling, 1 reply; 90+ messages in thread
From: Dave Jones @ 2009-05-01 13:14 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: Andrew Morton, Philipp Reisner, linux-kernel, Jens Axboe,
	Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Nikanth Karthikesan, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche, Lars Ellenberg

On Fri, May 01, 2009 at 01:15:54PM +0200, Lars Marowsky-Bree wrote:
 
 > > Is it being used anywhere for anything?  If so, where and what?
 > 
 > It is used by many customers (thousands world-wide, I'm sure) to
 > replicate block device data locally (to replace more expensive SANs
 > while achieving higher availablity) or async/remotely (for disaster
 > recovery).
 > 
 > The code is rather stable, the first drbd deployments date back many
 > years - drbd0.7 for example has been shipping with SLES10/9, and 0.6
 > with SLES8 already. The new drbd8 code is shipping on SLE11 and used
 > also in combination with OCFS2.
 > 
 > So we very much welcome the renewed and persistent interest of merging
 > the code in mainline (once all serious issues are addressed).
 > 
 > Even if in the long-term a merge with other raid implementations is
 > pursued (which I'd welcome even more), the existence of so many
 > deployments means we'll need the code for awhile still.

I've not looked through the patchset, and it's a bit outside my
domain of expertise, but I can attest we have had requests to
merge it in Fedora (which we've given the usual "get it upstream" response to).
The folks who run the Fedora infrastructure have been enthusiastic
about it for a while (which is why I ended up on the CC for this thread I guess).
I don't have details about their exact use-cases, but if desired, I can
find out more.

	Dave


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-01 13:14     ` Dave Jones
@ 2009-05-01 19:14       ` Andrew Morton
  0 siblings, 0 replies; 90+ messages in thread
From: Andrew Morton @ 2009-05-01 19:14 UTC (permalink / raw)
  To: Dave Jones
  Cc: lmb, philipp.reisner, linux-kernel, jens.axboe, gregkh, neilb,
	James.Bottomley, sam, knikanth, nab, kyle, bart.vanassche,
	lars.ellenberg

On Fri, 1 May 2009 09:14:25 -0400
Dave Jones <davej@redhat.com> wrote:

> On Fri, May 01, 2009 at 01:15:54PM +0200, Lars Marowsky-Bree wrote:
>  
>  > > Is it being used anywhere for anything?  If so, where and what?
>  > 
>  > It is used by many customers (thousands world-wide, I'm sure) to
>  > replicate block device data locally (to replace more expensive SANs
>  > while achieving higher availablity) or async/remotely (for disaster
>  > recovery).
>  > 
>  > The code is rather stable, the first drbd deployments date back many
>  > years - drbd0.7 for example has been shipping with SLES10/9, and 0.6
>  > with SLES8 already. The new drbd8 code is shipping on SLE11 and used
>  > also in combination with OCFS2.
>  > 
>  > So we very much welcome the renewed and persistent interest of merging
>  > the code in mainline (once all serious issues are addressed).
>  > 
>  > Even if in the long-term a merge with other raid implementations is
>  > pursued (which I'd welcome even more), the existence of so many
>  > deployments means we'll need the code for awhile still.
> 
> I've not looked through the patchset, and it's a bit outside my
> domain of expertise, but I can attest we have had requests to
> merge it in Fedora (which we've given the usual "get it upstream" response to).
> The folks who run the Fedora infrastructure have been enthusiastic
> about it for a while (which is why I ended up on the CC for this thread I guess).
> I don't have details about their exact use-cases, but if desired, I can
> find out more.
> 

Oh.  Thanks.  Well we should all get cracking on it then.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-01  8:59 ` [PATCH 00/16] DRBD: a block device for HA clusters Andrew Morton
  2009-05-01 11:15   ` Lars Marowsky-Bree
@ 2009-05-02  7:33   ` Bart Van Assche
  2009-05-03  5:36     ` Willy Tarreau
  1 sibling, 1 reply; 90+ messages in thread
From: Bart Van Assche @ 2009-05-02  7:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH, Neil Brown,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Kyle Moffett, Lars Ellenberg

On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
>
>> This is a repost of DRBD
>
> Is it being used anywhere for anything?  If so, where and what?

One popular application is to run iSCSI and HA software on top of DRBD
in order to build a highly available iSCSI storage target.

Bart.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/16] DRBD: lru_cache
  2009-05-01  8:59     ` [PATCH 02/16] DRBD: lru_cache Andrew Morton
@ 2009-05-02 15:26       ` Lars Ellenberg
  2009-05-02 17:58         ` Andrew Morton
  0 siblings, 1 reply; 90+ messages in thread
From: Lars Ellenberg @ 2009-05-02 15:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH, Neil Brown,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

On Fri, May 01, 2009 at 01:59:56AM -0700, Andrew Morton wrote:
> On Thu, 30 Apr 2009 13:26:38 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
> 
> > The lru_cache is a fixed size cache of equal sized objects. It allows its
> > users to do arbitrary transactions in case an element in the cache needs to
> > be replaced. Its replacement policy is LRU.
> > 
> 
> None of this really looks drbd-specific.
> 
> Would it not be better to present this as a general library function? 
> lib/lru_cache.c?

yes. will do.

> I think I might have asked this before.  If I did, then thwap-to-you
> for not permanently answering it in the changelog ;)
> 
> >
> > ...
> >
> > +#define lc_e_base(lc)  ((char *)((lc)->slot + (lc)->nr_elements))
> > +#define lc_entry(lc, i) ((struct lc_element *) \
> > +		       (lc_e_base(lc) + (i)*(lc)->element_size))
> > +#define lc_index_of(lc, e) (((char *)(e) - lc_e_base(lc))/(lc)->element_size)
> 
> The macros reference their arguments multiple times and hence are
> inefficient and/or buggy and/or unpredictable when passed an expression
> with side-effects.
> 
> If possible this should be fixed by turning them into regular C
> functions.  Inlined C functions if that makes sense (it frequently
> doesn't).
> 
> A pleasing side-effect of this conversion is that for some reason
> developers are more likely to document C functions than they are macros
> (hint).
> 
> I don't understand what these macros are doing and can't be bothered
> reverse-engineering the code to work that out.  But all the typecasting
> looks fishy.

I have a half finished rewrite of that piece code anyways,
so I guess I'll have to finish it then.

basically it allocates one continguous memory block,
to avoid additional small allocations and pointer arrays.

in memory structure is

struct lru_cache {
        struct list_head active;
        struct list_head quiet;
        struct list_head free;
        size_t element_size;          <-- parameter to "lc_alloc"
        unsigned int  nr_elements;    <-- parameter to "lc_alloc"
        unsigned int  new_number;

        unsigned int used;
        unsigned long flags;
        unsigned long hits, misses, starving, dirty, changed;

        struct lc_element *changing_element; /* just for paranoia */

        const char *name;

        struct hlist_head slot[0];
        /* hash colision chains here, then element storage. */
};

so we have fixed size list heads,
size of a single such "element", to allow the user
to add small payload;
number of hash slots and "elements" following this header;
some counters;
hlist_slot[0];
}
following:
struct hlist_head[nr_elements];
array of element_size blobs[nr_elements];

these "blobs" start with the struct lru_element,
possibly followed by some user payload.

the "index" you are asking about later is
index into that "blob" array,
and is used primarily to initialize the state of this thing
from an on-disk representation (the "activity log", "AL"),
for crash recovery purposes.

the typecasting is necessary to get from the slot[0] to the "elements"
skipping the hash slots.
using "container of" or something like that would obscure the fact that,
as currently implemented, the "lru_element" _must_ be the first member
of any payload structure.  but see below.

> > +static inline void lc_init(struct lru_cache *lc,
<24 lines/>
> > +}
> 
> How's about you remove all `inline' keywords from the whole patchset
> and then go back and inline the functions where there is a demonstrable
> benefit?  This function won't be one of them!

will do.

> > +/**
> > + * lc_free: Frees memory allocated by lc_alloc.
> > + * @lc: The lru_cache object
> > + */
> > +void lc_free(struct lru_cache *lc)
> > +{
> > +	vfree(lc);
> > +}
> 
> vmalloc() is a last-resort thing.  It generates slower-to-access memory
> and can cause internal fragmentation of the vmalloc arena, leading to
> total machine failure.
> 
> Can it be avoided?  Often it _can_ be avoided, and the code falls back
> to vmalloc() if the more robust memory allocation schemes failed.

yes, it can be avoided.
it used to be a kmalloc. but the mentioned "single memory block"
implementation hit some kmalloc limit, and we have then been lazy,
going the simple "two-line-patch" way.

we'll go for separate, and independently kmalloc'ed,
	struct hlist_head *slot;
	struct lru_element *blob;
then, or maybe for "struct page *blob_pages;" instead.

> Please review all kerneldoc comments in the patchset - I won't commeent
> on them further.

will do.

> > + * @lc: The lru_cache object
> > + * @enr: element number
> > + */
> > +struct lc_element *lc_find(struct lru_cache *lc, unsigned int enr)
> > +{
> > +	struct hlist_node *n;
> > +	struct lc_element *e;
> > +
> > +	BUG_ON(!lc);
> > +	hlist_for_each_entry(e, n, lc->slot + lc_hash_fn(lc, enr), colision) {
> > +		if (e->lc_number == enr)
> > +			return e;
> > +	}
> > +	return NULL;
> > +}
> > +
> >
> > ...
> >
> 
> 
> So I assume that the caller of this facility must provide the locking
> for its internals.  Is that documented somewhere?

in my half-finished rewrite ;)

Thanks,

	Lars

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 04/16] DRBD: bitmap
  2009-04-30 11:26       ` [PATCH 04/16] DRBD: bitmap Philipp Reisner
  2009-04-30 11:26         ` [PATCH 05/16] DRBD: request Philipp Reisner
@ 2009-05-02 15:41         ` James Bottomley
  2009-05-02 17:28           ` Lars Ellenberg
  1 sibling, 1 reply; 90+ messages in thread
From: James Bottomley @ 2009-05-02 15:41 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg

On Thu, 2009-04-30 at 13:26 +0200, Philipp Reisner wrote:
> DRBD maintains a dirty bitmap in case it has to run without peer node or
> without local disk. Writes to the on disk dirty bitmap are minimized by the
> activity log (=AL). Each time an extent is evicted from the AL the part of
> the bitmap no longer covered by the AL is written to disk.
> 
> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

The way the bitmap and activity log work are very similar to the way the
md bitmap works (and are implemented for almost exactly the same
reason).  Is there any way we could combine them?

James



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/16] DRBD: proc
  2009-04-30 11:26                   ` [PATCH 10/16] DRBD: proc Philipp Reisner
  2009-04-30 11:26                     ` [PATCH 11/16] DRBD: worker Philipp Reisner
@ 2009-05-02 15:44                     ` James Bottomley
  2009-05-02 20:23                       ` Lars Ellenberg
  1 sibling, 1 reply; 90+ messages in thread
From: James Bottomley @ 2009-05-02 15:44 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg

On Thu, 2009-04-30 at 13:26 +0200, Philipp Reisner wrote:
> The /proc/drbd interface.
> 
> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

/proc is deprecated for device control and printing.  I can see why you
want this (because it looks very similar to /proc/mdstat) but might it
not be better to convert it to a proper sysfs view with your own bus and
one device per connection with the stats?

James



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 12/16] DRBD: variable_length_integer_encoding
  2009-04-30 11:26                       ` [PATCH 12/16] DRBD: variable_length_integer_encoding Philipp Reisner
  2009-04-30 11:26                         ` [PATCH 13/16] DRBD: misc Philipp Reisner
@ 2009-05-02 15:45                         ` James Bottomley
  2009-05-02 17:29                           ` Lars Ellenberg
  1 sibling, 1 reply; 90+ messages in thread
From: James Bottomley @ 2009-05-02 15:45 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg

On Thu, 2009-04-30 at 13:26 +0200, Philipp Reisner wrote:
> Encoding of our simple LRE compression scheme. It is very effective since
> large parts of our bitmap are sparse.
> 
> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

This seems fine to me, but it needs to be in /lib as a separate module
(which drbd would then select) so the compression can be made available
to other users.

James



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 03/16] DRBD: activity_log
  2009-05-01  9:01       ` [PATCH 03/16] DRBD: activity_log Andrew Morton
@ 2009-05-02 17:00         ` Lars Ellenberg
  0 siblings, 0 replies; 90+ messages in thread
From: Lars Ellenberg @ 2009-05-02 17:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH, Neil Brown,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

On Fri, May 01, 2009 at 02:01:49AM -0700, Andrew Morton wrote:
> On Thu, 30 Apr 2009 13:26:39 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
> 
> > Within DRBD the activity log is used to track extents (4MB each) in which IO
> > happens (or happened recently). It is based on the LRU cache. Each change of
> > the activity log causes a meta data update (single sector write).  The size
> > of the activity log is configured by the user, and is a tradeoff between
> > minimizing updates to the meta data and the resync time after the crash of a
> > primary node.
> > 
> 
> OK, this is where I lose the plot and start bikeshed-painting.

:(


> > +/* I do not believe that all storage medias can guarantee atomic
> > + * 512 byte write operations.
> 
> ooh.  I think you'd be safe assuming that in the Linux context. 
> Everything else does.
> 
> Not sure what this means, really.


there has been observations of real world hard disks,
which had _half_ updated 512 byte sectors after power loss.
sorry, I lost the link. I believe that was some 10 years ago,
and somehow one of the university collegues of Phil had been
involved in that testing. Maybe Phil can dig up the link
somewhere.

But actually that comment is not necessary to give a reason
for adding a (simplistic xor) checksum to an on-disk transaction,
and for having some more room for transactions in the on disk
transaction ring buffer than theoretically necessary.

> Please use __packed (whole patchset).

ack.

> > +void trace_drbd_resync(struct drbd_conf *mdev, int level, const char *fmt, ...)

> The "trace_" namespace is already taken.
> 
> I suggest that all globally visible symbols in this driver start with
> "drbd_".

well, it is a va_args wrapper around the in kernel tracing framework,
drbd tracepoints being declared in drbd_tracing.h.
so this _is_ the namespace you are talking about.

> > +STATIC int _drbd_md_sync_page_io(struct drbd_conf *mdev,
> > +				 struct drbd_backing_dev *bdev,
> > +				 struct page *page, sector_t sector,
> > +				 int rw, int size)
> > +{
> > +	struct bio *bio;
> > +	struct drbd_md_io md_io;
> > +	int ok;
> > +
> > +	md_io.mdev = mdev;
> > +	init_completion(&md_io.event);
> 
> urgh, you're going to have to scratch your head over
> DECLARE_COMPLETION_ONSTACK() here.
> 
> Has this code all been tested with all kernel debug options enabled? 
> inclusing lockdep?

Yes, last year I did that regularly.  Not sure if we still have that
enabled in one of our test clusters now, I need to double check.


> > +	md_io.error = 0;
> > +
> > +	if (rw == WRITE && !test_bit(MD_NO_BARRIER, &mdev->flags))
> > +		rw |= (1<<BIO_RW_BARRIER);
> > +	rw |= ((1<<BIO_RW_UNPLUG) | (1<<BIO_RW_SYNCIO));
> 
> The semantics of these flags seem to have been changing at 15Hz lately.
> You might want to check that this code still does what you think it
> does.

I know, I followed that closely.
Currently it does.

> It would be prudent to add comments explaining precisely what behaviour
> the driver is expecting from the lower layers, and why it wants that
> behaviour.

Ok.

...
> > +	if (unlikely(bio_barrier(bio) && !ok)) {
> > +		/* Try again with no barrier */
> > +		dev_warn(DEV, "Barriers not supported on meta data device - disabling\n");
> > +		set_bit(MD_NO_BARRIER, &mdev->flags);
> > +		rw &= ~(1 << BIO_RW_BARRIER);
> > +		bio_put(bio);
> 
> Maybe the original bio could be reused.

yes. and we used to do that, I think.  though which members of the bio
would needed to be re-init-ed to what, exactly, changed a few times in
the past iirc, so we went for "lets the block layer do it", to be on the
safe side.
you suggest to re-implement (oops. sorry. out-of-tree coder behind the
keyboard; re-use, of course!) drivers/md/dm-bio-record.h all over again?
or can we shorten that?

> > +	hardsect = drbd_get_hardsect(bdev->md_bdev);
> 
> hm.  Sounds like hardsect shold have type sector_t.
> 
> > +	if (hardsect == 0)
> > +		hardsect = MD_HARDSECT;
> > +
> > +	/* in case hardsect != 512 [ s390 only? ] */
> 
> Nope, it looks like it should have been called hardsect_size?

ok.

> 
> > +	if (hardsect != MD_HARDSECT) {
> > +		mask = (hardsect / MD_HARDSECT) - 1;
> > +		D_ASSERT(mask == 1 || mask == 3 || mask == 7);
> > +		D_ASSERT(hardsect == (mask+1) * MD_HARDSECT);
> > +		offset = sector & mask;
> > +		sector = sector & ~mask;
> > +		iop = mdev->md_io_tmpp;
> > +
> > +		if (rw == WRITE) {
> 
> This will evaluate to false if someone passed you WRITE_SYNC.  Maybe it
> should have been `if (rw & WRITE)'?

as this is a drbd internal function, and rw passes in only the data
direction, it will always become "sync" | "unplug" (though there is
one code path during initialization where we could optimize the unplug
away, if we preallocate a few pages), and for writes usually (unless
disabled in the config) also barrier, this code is correct.
though (rw & WRITE) would not make it wrong, of course, and more
flexible (e.g. if we "optimize" that initialization code path).

ok, will do.

> > +			void *p = page_address(mdev->md_io_page);
> > +			void *hp = page_address(mdev->md_io_tmpp);
> 
> I trust these pages cannot be in highmem.  If they are, they'll need
> kmapping.

these are alloc_page(GFP_KERNEL), allocated during device creation.
they are good.

> whitespace went funny.

yeah, that sometimes still happens :(

> > +		}
> > +	}
> > +
> > +	if (sector < drbd_md_first_sector(bdev) ||
> > +	    sector > drbd_md_last_sector(bdev))
> > +		dev_alert(DEV, "%s [%d]:%s(,%llus,%s) out of range md access!\n",
> > +		     current->comm, current->pid, __func__,
> > +		     (unsigned long long)sector, rw ? "WRITE" : "READ");
> > +
> > +	ok = _drbd_md_sync_page_io(mdev, bdev, iop, sector, rw, hardsect);
> > +	if (unlikely(!ok)) {
> > +		dev_err(DEV, "drbd_md_sync_page_io(,%llus,%s) failed!\n",
> > +		    (unsigned long long)sector, rw ? "WRITE" : "READ");
> > +		return 0;
> > +	}
> > +
> > +	if (hardsect != MD_HARDSECT && rw == READ) {
> > +		void *p = page_address(mdev->md_io_page);
> > +		void *hp = page_address(mdev->md_io_tmpp);
> > +
> > +		memcpy(p, hp + offset*MD_HARDSECT, MD_HARDSECT);
> > +	}
> > +
> > +	return ok;
> > +}
> > +
> > +static inline
> > +struct lc_element *_al_get(struct drbd_conf *mdev, unsigned int enr)
> > +{
> > +	struct lc_element *al_ext;
> > +	struct bm_extent  *bm_ext;
> > +	unsigned long     al_flags = 0;
> > +
> > +	spin_lock_irq(&mdev->al_lock);
> > +	bm_ext = (struct bm_extent *)
> > +		lc_find(mdev->resync, enr/AL_EXT_PER_BM_SECT);
> 
> OK, what's going on here.
> 
> lc_find() returns an lc_element* and it's getting cast to a bm_extent*.
> 
> <tries to find the definition of bm_extent>
> 
> OK, it's defined five patches in the future.  Tricky!
> 
> +struct bm_extent {
> +	struct lc_element lce;
> +	int rs_left; /* number of bits set (out of sync) in this extent. */
> +	int rs_failed; /* number of failed resync requests in this extent. */
> +	unsigned long flags;
> +};
> 	
> I see what you did there.
> 
> Please use container_of().  It makes things much clearer and removes
> the requirement that the embedded lc_element be the first element in
> the outer strut.

we could.
but it would obscure the fact that in the current implementation of the
lru_cache, the struct lc_element _must_ be the first member of any such
user structure, see my other mail RE your lru_cache comments.

I guess we could simply remove the flexibility of the lru_cache and
lc_element, embedding "u64 payload[2]" into the struct lc_element,
and be done with it. would waste 16 bytes * nr_elements for the activity
log lru_cache, though.

> > +void drbd_al_begin_io(struct drbd_conf *mdev, sector_t sector)
> > +{
> > +	unsigned int enr = (sector >> (AL_EXTENT_SIZE_B-9));
> 
> This limits the maximum size of a device to 4 gigasectors * <however
> much that is>.


AL_EXTENT_SIZE is 4 MiB. 4 gigasectors * 4 MB is 16 PB.
currently maximum device size allowed in DRBD is 16 TB
(for dirty bitmap granularity reasons).

> Do we have a problem here?

depending on how you define "problem", yes.  this is not going to be
much use once the retailers start selling petabyte storage units.
we are working on it, though.

> We're getting deep into uncommented territory here.  Ths makes the code
> much harder to review, and makes the review less effective and makes
> the code harder to maintain and all those other things you already know ;)

yes. sorry.
going to improve on that next time around.

> > +	if (!inc_local(mdev)) {
> 
> <tries to find inc_local>
> 
> <finds it four patches in the future>
> 
> this is harder than it needs to be
> 
> <considers hunting for _inc_local_if_state>
> 
> <changes mind>
> 
> inc_local() isn't a very good choice of identifier, given its
> potentially-global scope.

oh, well, it "increases" the reference count on the "local disk".
but only "if" the "state" of the local disk is "as good or better"
than its argument ;)
there.

guess that was covered by the "missing documentation" rant above
already.

> The amount of inlining in drbd_int.h is bizarre.

hm. I'm not even sure were it comes from.

> > +	buffer->magic = __constant_cpu_to_be32(DRBD_MAGIC);
> 
> DRBD_MAGIC should be defined in magic.h.  Maybe it was - I didn't check.

ok.
we have a few variants of that magic, adding or xoring with it.
we'll discuss whether we need explicit defines of those,
and which one we can drop entirely.

> > +STATIC int drbd_al_read_tr(struct drbd_conf *mdev,
> 
> Can we please do s/STATIC/static/g and remove the definition of STATIC,
> whereever it is?

ok.

> > +			   struct drbd_backing_dev *bdev,
> > +			   struct al_transaction *b,
> > +			   int index)
> > +{
> > +	sector_t sector;
> > +	int rv, i;
> > +	u32 xor_sum = 0;
> > +
> > +	sector = bdev->md.md_offset + bdev->md.al_offset + index;
> 
> Strange that `index' doesn't have sector_t type, but I don't know (and
> wasn't told) what it represents.

no, actually it is the index into the fixed number of lc_elements
covered by the activity log lru_cache (AL).
and it is small by design, currently the maximum number we allow is 3833.
yes, that again should be documented better in the code.
it is documented in the man pages of the userland tools, though.

> > +#define S2W(s)	((s)<<(BM_EXT_SIZE_B-BM_BLOCK_SIZE_B-LN2_BPL))
> 
> S2W means, umm, "sector to ..."?

sector to "word", where word is meant as "unsigned long".

> Optimise the code for the code reader, please.

aye.

> > +	bio = bio_alloc(GFP_KERNEL, 1);
> 
> Should it be GFP_NOIO?

does not need to be.
happens outside the io path.

> > + * drbd_al_to_on_disk_bm:
> > + * Writes the areas of the bitmap which are covered by the AL.
> 
> what's an AL?

the "activity log".

> Please check all GFP_KERNELS, see if they should have been GFP_NOIO.

done years ago ;-)

> Again, we have the DECLARE_COMPLETION_ONSTACK() thing to worry about here.

I think we are good.
But we can change to use that anyways.

> <attention span exhausted, sorry>

Thank you very much.

Cheers,

	Lars

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 04/16] DRBD: bitmap
  2009-05-02 15:41         ` [PATCH 04/16] DRBD: bitmap James Bottomley
@ 2009-05-02 17:28           ` Lars Ellenberg
  2009-05-03  5:21             ` Neil Brown
  0 siblings, 1 reply; 90+ messages in thread
From: Lars Ellenberg @ 2009-05-02 17:28 UTC (permalink / raw)
  To: James Bottomley
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH, Neil Brown,
	Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

On Sat, May 02, 2009 at 10:41:58AM -0500, James Bottomley wrote:
> On Thu, 2009-04-30 at 13:26 +0200, Philipp Reisner wrote:
> > DRBD maintains a dirty bitmap in case it has to run without peer node or
> > without local disk. Writes to the on disk dirty bitmap are minimized by the
> > activity log (=AL). Each time an extent is evicted from the AL the part of
> > the bitmap no longer covered by the AL is written to disk.
> > 
> > Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
> > Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> 
> The way the bitmap and activity log work are very similar to the way the
> md bitmap works (and are implemented for almost exactly the same
> reason).  Is there any way we could combine them?

in principle yes.
the DRBD bitmap has a granularity of 4 kB per bit,
and the "activity log" covers 4 MB per what we call "al extent".

though there is a very important difference.

in MD, when the bitmap is in use, I think the approach is:

  for each write queued to the lower level devices,
     dirty bits in memory
     for every newly dirtied bitmap page,
	flush bitmap pages to disk
	wait for these bitmap writes to complete
  then unplug the lowe level devices

  in background: periodically try to clean some pages,
	and write them to disk

the DRBD approach is:
  if target "al extent" of this write request
  is NOT in the in-memory "lru_cache" already,
	get it into the cache,
		if that means we have to kick an
		old element from the cache, and
		the associated bitmap is dirty
			write that part of the bitmap
        write an "al transaction" (synchonous single sector write)
  else
  	FAST PATH, no additional "meta data" write needed.
  
  submit to lower level device.


MD most of the time just _needs_ the additional "meta data" writes.
DRBD most of the time does not (unless you have completely random
writes, always requesting an extent not yet/anymore in the activity log.

I'm in the process of generalizing DRBDs approach to allow more than one
"al extent" to change during a "prepare" step, and cover several such changes
in one "al transaction", so the number of meta data updates can be
reduced even further.

adopting this "activity log" approach would make MD even better, IMO.

Thanks,

	Lars

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 12/16] DRBD: variable_length_integer_encoding
  2009-05-02 15:45                         ` [PATCH 12/16] DRBD: variable_length_integer_encoding James Bottomley
@ 2009-05-02 17:29                           ` Lars Ellenberg
  0 siblings, 0 replies; 90+ messages in thread
From: Lars Ellenberg @ 2009-05-02 17:29 UTC (permalink / raw)
  To: James Bottomley
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH, Neil Brown,
	Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

On Sat, May 02, 2009 at 10:45:17AM -0500, James Bottomley wrote:
> On Thu, 2009-04-30 at 13:26 +0200, Philipp Reisner wrote:
> > Encoding of our simple LRE compression scheme. It is very effective since
> > large parts of our bitmap are sparse.
> > 
> > Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
> > Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> 
> This seems fine to me, but it needs to be in /lib as a separate module
> (which drbd would then select) so the compression can be made available
> to other users.

will do.

Thanks,

	Lars

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/16] DRBD: lru_cache
  2009-05-02 15:26       ` Lars Ellenberg
@ 2009-05-02 17:58         ` Andrew Morton
  2009-05-02 18:13           ` Lars Ellenberg
  0 siblings, 1 reply; 90+ messages in thread
From: Andrew Morton @ 2009-05-02 17:58 UTC (permalink / raw)
  To: Lars Ellenberg
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH, Neil Brown,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

On Sat, 2 May 2009 17:26:20 +0200 Lars Ellenberg <lars.ellenberg@linbit.com> wrote:

> in memory structure is
> 
> struct lru_cache {
>         struct list_head active;
>         struct list_head quiet;
>         struct list_head free;
>         size_t element_size;          <-- parameter to "lc_alloc"
>         unsigned int  nr_elements;    <-- parameter to "lc_alloc"
>         unsigned int  new_number;
> 
>         unsigned int used;
>         unsigned long flags;
>         unsigned long hits, misses, starving, dirty, changed;
> 
>         struct lc_element *changing_element; /* just for paranoia */
> 
>         const char *name;
> 
>         struct hlist_head slot[0];
>         /* hash colision chains here, then element storage. */
> };
> 
> so we have fixed size list heads,
> size of a single such "element", to allow the user
> to add small payload;
> number of hash slots and "elements" following this header;
> some counters;
> hlist_slot[0];
> }
> following:
> struct hlist_head[nr_elements];
> array of element_size blobs[nr_elements];
> 
> these "blobs" start with the struct lru_element,
> possibly followed by some user payload.
> 
> the "index" you are asking about later is
> index into that "blob" array,
> and is used primarily to initialize the state of this thing
> from an on-disk representation (the "activity log", "AL"),
> for crash recovery purposes.
> 
> the typecasting is necessary to get from the slot[0] to the "elements"
> skipping the hash slots.
> using "container of" or something like that would obscure the fact that,
> as currently implemented, the "lru_element" _must_ be the first member
> of any payload structure.

I still don't see why the lru_element must be the first member of the
user's outer, containing structure.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/16] DRBD: lru_cache
  2009-05-02 17:58         ` Andrew Morton
@ 2009-05-02 18:13           ` Lars Ellenberg
  2009-05-02 18:26             ` Andrew Morton
  0 siblings, 1 reply; 90+ messages in thread
From: Lars Ellenberg @ 2009-05-02 18:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH, Neil Brown,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

On Sat, May 02, 2009 at 10:58:23AM -0700, Andrew Morton wrote:
> On Sat, 2 May 2009 17:26:20 +0200 Lars Ellenberg <lars.ellenberg@linbit.com> wrote:
> 
> > in memory structure is
> > 
> > struct lru_cache {
> >         struct list_head active;
> >         struct list_head quiet;
> >         struct list_head free;
> >         size_t element_size;          <-- parameter to "lc_alloc"
> >         unsigned int  nr_elements;    <-- parameter to "lc_alloc"
> >         unsigned int  new_number;
> > 
> >         unsigned int used;
> >         unsigned long flags;
> >         unsigned long hits, misses, starving, dirty, changed;
> > 
> >         struct lc_element *changing_element; /* just for paranoia */
> > 
> >         const char *name;
> > 
> >         struct hlist_head slot[0];
> >         /* hash colision chains here, then element storage. */
> > };
> > 
> > so we have fixed size list heads,
> > size of a single such "element", to allow the user
> > to add small payload;
> > number of hash slots and "elements" following this header;
> > some counters;
> > hlist_slot[0];
> > }
> > following:
> > struct hlist_head[nr_elements];
> > array of element_size blobs[nr_elements];
> > 
> > these "blobs" start with the struct lru_element,
> > possibly followed by some user payload.
> > 
> > the "index" you are asking about later is
> > index into that "blob" array,
> > and is used primarily to initialize the state of this thing
> > from an on-disk representation (the "activity log", "AL"),
> > for crash recovery purposes.
> > 
> > the typecasting is necessary to get from the slot[0] to the "elements"
> > skipping the hash slots.
> > using "container of" or something like that would obscure the fact that,
> > as currently implemented, the "lru_element" _must_ be the first member
> > of any payload structure.
> 
> I still don't see why the lru_element must be the first member of the
> user's outer, containing structure.


ok, arguably one could also record the offset_of beneath the element_size,
and add that in when doing the lc_element *e =  blob[index] + offset.
would not make it much more appealing, though.

	Lars

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/16] DRBD: lru_cache
  2009-05-02 18:13           ` Lars Ellenberg
@ 2009-05-02 18:26             ` Andrew Morton
  2009-05-02 19:39               ` Lars Ellenberg
  0 siblings, 1 reply; 90+ messages in thread
From: Andrew Morton @ 2009-05-02 18:26 UTC (permalink / raw)
  To: Lars Ellenberg
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH, Neil Brown,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

On Sat, 2 May 2009 20:13:12 +0200 Lars Ellenberg <lars.ellenberg@linbit.com> wrote:

> On Sat, May 02, 2009 at 10:58:23AM -0700, Andrew Morton wrote:
> > On Sat, 2 May 2009 17:26:20 +0200 Lars Ellenberg <lars.ellenberg@linbit.com> wrote:
> > 
> > > in memory structure is
> > > 
> > > struct lru_cache {
> > >         struct list_head active;
> > >         struct list_head quiet;
> > >         struct list_head free;
> > >         size_t element_size;          <-- parameter to "lc_alloc"
> > >         unsigned int  nr_elements;    <-- parameter to "lc_alloc"
> > >         unsigned int  new_number;
> > > 
> > >         unsigned int used;
> > >         unsigned long flags;
> > >         unsigned long hits, misses, starving, dirty, changed;
> > > 
> > >         struct lc_element *changing_element; /* just for paranoia */
> > > 
> > >         const char *name;
> > > 
> > >         struct hlist_head slot[0];
> > >         /* hash colision chains here, then element storage. */
> > > };
> > > 
> > > so we have fixed size list heads,
> > > size of a single such "element", to allow the user
> > > to add small payload;
> > > number of hash slots and "elements" following this header;
> > > some counters;
> > > hlist_slot[0];
> > > }
> > > following:
> > > struct hlist_head[nr_elements];
> > > array of element_size blobs[nr_elements];
> > > 
> > > these "blobs" start with the struct lru_element,
> > > possibly followed by some user payload.
> > > 
> > > the "index" you are asking about later is
> > > index into that "blob" array,
> > > and is used primarily to initialize the state of this thing
> > > from an on-disk representation (the "activity log", "AL"),
> > > for crash recovery purposes.
> > > 
> > > the typecasting is necessary to get from the slot[0] to the "elements"
> > > skipping the hash slots.
> > > using "container of" or something like that would obscure the fact that,
> > > as currently implemented, the "lru_element" _must_ be the first member
> > > of any payload structure.
> > 
> > I still don't see why the lru_element must be the first member of the
> > user's outer, containing structure.
> 
> 
> ok, arguably one could also record the offset_of beneath the element_size,
> and add that in when doing the lc_element *e =  blob[index] + offset.
> would not make it much more appealing, though.
> 

You appear to believe that I understood the relevance of all the above
text.  I didn't ;)

Let's start again.

Why can't I do

struct foo {
	int x;
	struct lc_element lc;
	..
};

and then use the lru library code to handle my foo objects?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/16] DRBD: lru_cache
  2009-05-02 18:26             ` Andrew Morton
@ 2009-05-02 19:39               ` Lars Ellenberg
  0 siblings, 0 replies; 90+ messages in thread
From: Lars Ellenberg @ 2009-05-02 19:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH, Neil Brown,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

On Sat, May 02, 2009 at 11:26:09AM -0700, Andrew Morton wrote:
> You appear to believe that I understood the relevance of all the above
> text.  I didn't ;)
> 
> Let's start again.
> 
> Why can't I do
> 
> struct foo {
> 	int x;
> 	struct lc_element lc;
> 	..
> };
> 
> and then use the lru library code to handle my foo objects?


to do so, you'd
struct lru_cache *foo_cache =
	lc_alloc("dummy", 42, sizeof(struct foo), NULL);

that would alloc 
sizeof(struct lru_cache)
+ 42 * sizeof(struct hlist_head)
+ 42 * sizeof(struct foo);

the number of elements in the cache is fixed for its lifetime,
only the reference counts, position on the lru lists,
and tracked "house numbers" change.

the lru code sometimes addresses the lc_element part of the object
just via their index in this memory area, which is useful to initialize
the elements from some on-disk recorded state.

it does so by assuming a struct lc_element
every sizeof(struct foo) bytes
after the initial 42 hlist slots.

there, that was it.

to make this generic, appart from adding some paranoia
like a BUG_ON(sizeof(foo) < sizeof(lc_element)), and possibly
rounding any "odd" element_size parameter to get proper alignment,
it would need to pass in offset_of(struct foo, lc),
so it could do 

#define lc_e_base(lc) \
        ((char *)((lc)->slot + (lc)->nr_elements))

static inline struct lc_element *
lc_element_by_index(struct lru_cache *lc, unsigned int i)
{
        BUG_ON(i >= lc->nr_elements);
        return (struct lc_element *) (lc_e_base(lc)
			+ i * lc->element_size
			+     lc->offset_of);
}

lc->offset_of is what is currently missing from the code.
lc_element_by_index() is what currently is badly named lc_entry().

a real lc_entry would then become
#define lc_entry(ptr, type, member) \
        container_of(ptr, type, member)

so what now is
-struct foo *bar = (struct foo *) lc_entry(foo_cache, 7);

would become
+struct lc_element *e = lc_element_by_index(foo_cache, 7);
+struct foo *bar = lc_entry(e, struct foo, lc);

does that make sense?

any suggestions to split up the one vmalloc into several allocations?
kmalloc?  a bunch of zero order pages?
or are we reinventing the wheel?
(probably we do, anyways; but is that particular wheel already in kernel?)

	Lars

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/16] DRBD: proc
  2009-05-02 15:44                     ` [PATCH 10/16] DRBD: proc James Bottomley
@ 2009-05-02 20:23                       ` Lars Ellenberg
  0 siblings, 0 replies; 90+ messages in thread
From: Lars Ellenberg @ 2009-05-02 20:23 UTC (permalink / raw)
  To: James Bottomley
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH, Neil Brown,
	Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

On Sat, May 02, 2009 at 10:44:00AM -0500, James Bottomley wrote:
> On Thu, 2009-04-30 at 13:26 +0200, Philipp Reisner wrote:
> > The /proc/drbd interface.
> > 
> > Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
> > Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> 
> /proc is deprecated for device control and printing.  I can see why you
> want this (because it looks very similar to /proc/mdstat) but might it
> not be better to convert it to a proper sysfs view with your own bus and
> one device per connection with the stats?

we consider that for a while, now.

fortunately we have a lot of users.

unfortunately that means we probably need to at least still provide it
for a while, maybe mark it deprecated, and make it a Kconfig setting,
because /proc/drbd is part of the API, sort of.

yes, I think we can do that.

	Lars

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/16] DRBD: lru_cache
  2009-04-30 11:26   ` [PATCH 02/16] DRBD: lru_cache Philipp Reisner
  2009-04-30 11:26     ` [PATCH 03/16] DRBD: activity_log Philipp Reisner
  2009-05-01  8:59     ` [PATCH 02/16] DRBD: lru_cache Andrew Morton
@ 2009-05-02 23:51     ` Kyle Moffett
  2009-05-03  6:27       ` Lars Ellenberg
  2 siblings, 1 reply; 90+ messages in thread
From: Kyle Moffett @ 2009-05-02 23:51 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: linux-kernel, Jens Axboe, Greg KH, Neil Brown, James Bottomley,
	Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Bart Van Assche,
	Lars Ellenberg

After digging through this a bit, I would recommend rewriting this
whole part.  The part that definitely needs to go is the vmalloc() of
your whole LRU at once.

Here's some code I've been fiddling around with for a while.  It uses
a kmem_cache and a shrinker callback (although it does need a small
patch to the shrinker code) to provide a dynamically sized LRU list
for fixed-size objects.

This hasn't even been compile-tested, and it probably has some minor
locking errors, but it should have enough comments to make up for it.
This explicitly does *NOT* have your custom hash-lookup code in it,
with the premise that generally that part should be left up to
whatever architecture best fits the user of the LRU list.

I did design it so that it is suitable for usage with a lockless-RCU
hash-lookup.  The user would need to make sure to specify
SLAB_DESTROY_BY_RCU in the flags field and make sure to include some
kind of generation counter (see the various docs on
SLAB_DESTROY_BY_RCU).  On the other hand it would also work fine with
a single spinlock used to control updates of the lookup datastructure.

The user will need to use spin_trylock() or similar in their ->evict()
callback, as that is called with the internal list spinlock held.  On
the other hand, if one CPU has their lookup table locked and is trying
to lru_cache_get() an object, there's not a whole lot that another CPU
can do to free anything referenced by that lookup table anyways.  I
don't see it as a big issue.

Nowhere-near-signed-off-by: Kyle Moffett <kyle@moffetthome.net>

struct lru_cache_type {
	const char *name;
	size_t size, align;
	unsigned long flags;
	int seeks;

	/*
	 * Return nonzero if the passed object may be evicted (IE: freed).
	 * Otherwise return zero and it will be "touched" (IE: moved to the
	 * tail of the LRU) so that later scans will try other objects.
	 *
	 * If you return nonero from this function, "obj" will be immediately
	 * kmem_cache_free()d.
	 */
	unsigned int (*evict)(struct lru_cache *lc, void *obj);
};

struct lru_cache {
	const struct lru_cache_type *type;
	size_t offset;

	struct kmem_cache *cache;
	struct shrinker shrinker;

	/* The least-recently used item in the LRU is at the head */
	spinlock_t lock;
	struct list_head lru;
	struct list_head in_use;
	unsigned long nr_lru;
	unsigned long nr_in_use;
};

struct lru_cache_elem {
	/* The node is either on the LRU or on this in-use list */
	struct list_head node;
	const struct lru_cache *lc;
	atomic_t refcount;
}

/*
 * FIXME:  Needs a patch to make the shinker->shrink() function take an extra
 * argument with the address of the "struct shrinker".
 */
static int lru_cache_shrink__(struct shrinker *s, int nr_scan, gfp_t flags)
{
	return lru_cache_shrink(container_of(s, struct lru_cache, shrinker),
				nr_scan, flags);
}

int lru_cache_create(struct lru_cache *lc, const struct lru_cache_type *type)
{
	/* Align the size so an lru_cache_elem can sit at the end */
	size_t align = MAX(type->align, __alignof__(struct lru_cache_elem));
	size_t size = ALIGN(type->size, __alignof__(struct lru_cache_elem));
	lc->offset = size;

	/* Now add space for that element */
	size += sizeof(struct lru_cache_elem);

	/* Initialize internal fields */
	lc->type = type;
	INIT_LIST_HEAD(&lc->lru);
	INIT_LIST_HEAD(&lc->in_use);
	lc->nr_lru = 0;
	lc->nr_in_use = 0;

	/* Allocate the fixed-sized-object cache */
	lc->cache = kmem_cache_create(type->name, size, align,
			type->flags, NULL);
	if (!lc->cache)
		return -ENOMEM;

	/* Now initialize and register our shrinker */
	lc->shrinker.shrink = &lru_cache_shrink;
	lc->shrinker.seeks = type->seeks;
	register_shrinker(&lc->shrinker);

	return 0;
}

/*
 * Before you can call this function, you must free all of the objects on the
 * LRU list (which in turn means they must all have zeroed refcounts), and
 * you must ensure that no other functions will be called on this lru-cache.
 */
void lru_cache_destroy(struct lru_cache *lc)
{
	BUG_ON(lc->nr_lru);
	BUG_ON(lc->nr_in_use);
	BUG_ON(!list_empty(&lc->lru));
	BUG_ON(!list_empty(&lc->in_use));

	unregister_shrinker(&lc->shrinker);
	kmem_cache_destroy(lc->cache);
	lc->cache = NULL;
}

void *lru_cache_alloc(struct lru_cache *lc, gfp_t flags)
{
	struct lru_cache_elem *elem;
	void *obj = kmem_cache_alloc(lc->cache, flags);
	if (!obj)
		return NULL;

	elem = obj + lc->offset;
	atomic_set(&elem->refcount, 1);
	elem->lc = lc;
	smp_wmb();

	spin_lock(&lc->lock);
	list_add_tail(&elem->node, &lc->in_use);
	lc->in_use++;
	spin_unlock(&lc->lock);

	return obj;
}

/*
 * You must ensure that the lru object has a zero refcount and can no longer
 * be looked up before calling lru_cache_free().  Specifically, you must
 * ensure that lru_cache_get() cannot be called on this object.
 */
void lru_cache_free(struct lru_cache *lc, void *obj)
{
	struct lru_cache_elem *elem = obj + lc->offset;
	BUG_ON(elem->lc != lc);

	spin_lock(&lc->lock);
	BUG_ON(atomic_read(&elem->refcount));
	list_del(&elem->node);
	lc->nr_lru--;
	spin_unlock(&lc->lock);

	kmem_cache_free(lc->cache, obj);
}

/*
 * This may be called at any time between lru_cache_create() and
 * lru_cache_destroy() by the shrinker code to reduce our memory usage.
 */
int lru_cache_shrink(struct lru_cache *lc, int nr_scan, gfp_t flags)
{
	struct lru_cache_elem *elem, *n;
	unsigned long nr_lru;
	int nr_left = nr_scan;

	spin_lock(&lc->lock);

	/* Try to scan the number of requested objects */
	list_for_each_entry_safe(elem, n, &lc->lru, node) {
		void *obj;
		if (!nr_left--)
			break;

		/* Sanity check */
		BUG_ON(atomic_read(&elem->refcount));

		/* Ask them if we can free this item */
		obj = ((void *)elem) - lc->offset;
		if (!lc->type->evict(lc, obj)) {
			/*
			 * They wouldn't let us free it, so move it to the
			 * other end of the LRU so we can keep scanning.
			 */
			list_del(&elem->node);
			list_add_tail(&elem->node, &lc->lru);
			continue;
		}

		/* Remove this node from the LRU and free it */
		list_del(&elem->node);
		lc->nr_lru--;
		kmem_cache_free(obj);
	}

	nr_lru = lc->nr_lru;
	spin_unlock(&lc->lock);

	/* Now if we were asked to scan, tell the kmem_cache to shrink */
	if (nr_scan)
		kmem_cache_shrink(lc->cache);

	return (nr_lru < (unsigned long)INT_MAX) ? (int)nr_lru : INT_MAX;
}

/* Use this function if you already have a reference to "obj" */
void *lru_cache_get__(struct lru_cache *lc, void *obj)
{
	struct lru_cache_elem *elem = obj + lc->offset;
	BUG_ON(elem->lc != lc);
	atomic_inc(&elem->refcount);
	return obj;
}

/*
 * If you do not already have a reference to "obj", you must wrap the
 * combined lookup + lru_cache_get() in rcu_read_lock/unlock().
 */
void *lru_cache_get(struct lru_cache *lc, void *obj)
{
	struct lru_cache_elem *elem = obj + lc->offset;
	BUG_ON(elem->lc != lc);

	/* Fastpath:  If it's already referenced just add another one */
	if (atomic_inc_not_zero(&elem->refcount))
		return obj;

	/* Slowpath:  Need to lock the lru-cache and mark the object used */
	spin_lock(&lc->lock);

	/* One more attempt at the fastpath, now that we've got the lock */
	if (atomic_inc_not_zero(&elem->refcount))
		goto out;

	/*
	 * Ok, it has a zero refcount and we've got the lock, anybody else in
	 * here trying to lru_cache_get() this object will wait until we are
	 * done.
	 */

	/* Remove it from the LRU */
	BUG_ON(!lc->nr_lru);
	list_del(&elem->node);
	lc->nr_lru--;

	/* Add it to the in-use list */
	list_add_tail(&elem->node, &lc->in_use);
	lc->nr_in_use++;

	/* Acquire a reference */
	atomic_set(&elem->refcount, 1);

out:
	spin_unlock(&lc->lock);
	return obj;
}

/* This releases one reference */
void lru_cache_put(struct lru_cache *lc, void *obj)
{
	struct lru_cache_elem *elem = obj + lc->offset;
	BUG_ON(elem->lc != lc);
	BUG_ON(!atomic_read(&elem->refcount));

	/* Drop the refcount; if it's still nonzero, we're done */
	if (atomic_dec_return(&elem->refcount))
		return;

	/* We need to take the lru-cache lock to make sure we release it */
	spin_lock(&lc->lock);
	if (atomic_read(&elem->refcount))
		goto out;

	/*
	 * Ok, it has a zero refcount and we hold the lock, anybody trying to
	 * lru_cache_get() this object will block until we're done.
	 */

	/* Remove it from the in-use list */
	BUG_ON(!lc->nr_in_use);
	list_del(&elem->node);
	lc->nr_in_use--;

	/* Add it to the LRU list */
	list_add_tail(&elem->node, &lc->lru);
	lc->nr_lru++;

out:
	spin_unlock(&lc->lock);
}

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 04/16] DRBD: bitmap
  2009-05-02 17:28           ` Lars Ellenberg
@ 2009-05-03  5:21             ` Neil Brown
  2009-05-03  7:38               ` Lars Ellenberg
  2009-05-05 17:48               ` Lars Marowsky-Bree
  0 siblings, 2 replies; 90+ messages in thread
From: Neil Brown @ 2009-05-03  5:21 UTC (permalink / raw)
  To: Lars Ellenberg
  Cc: James Bottomley, Philipp Reisner, linux-kernel, Jens Axboe,
	Greg KH, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

On Saturday May 2, lars.ellenberg@linbit.com wrote:
> On Sat, May 02, 2009 at 10:41:58AM -0500, James Bottomley wrote:
> > On Thu, 2009-04-30 at 13:26 +0200, Philipp Reisner wrote:
> > > DRBD maintains a dirty bitmap in case it has to run without peer node or
> > > without local disk. Writes to the on disk dirty bitmap are minimized by the
> > > activity log (=AL). Each time an extent is evicted from the AL the part of
> > > the bitmap no longer covered by the AL is written to disk.
> > > 
> > > Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
> > > Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> > 
> > The way the bitmap and activity log work are very similar to the way the
> > md bitmap works (and are implemented for almost exactly the same
> > reason).  Is there any way we could combine them?
> 
> in principle yes.
> the DRBD bitmap has a granularity of 4 kB per bit,
> and the "activity log" covers 4 MB per what we call "al extent".
> 
> though there is a very important difference.
> 
> in MD, when the bitmap is in use, I think the approach is:
> 
>   for each write queued to the lower level devices,
>      dirty bits in memory
>      for every newly dirtied bitmap page,
> 	flush bitmap pages to disk
> 	wait for these bitmap writes to complete
>   then unplug the lowe level devices
> 
>   in background: periodically try to clean some pages,
> 	and write them to disk
> 
> the DRBD approach is:
>   if target "al extent" of this write request
>   is NOT in the in-memory "lru_cache" already,
> 	get it into the cache,
> 		if that means we have to kick an
> 		old element from the cache, and
> 		the associated bitmap is dirty
> 			write that part of the bitmap
>         write an "al transaction" (synchonous single sector write)
>   else
>   	FAST PATH, no additional "meta data" write needed.
>   
>   submit to lower level device.
> 
> 
> MD most of the time just _needs_ the additional "meta data" writes.
> DRBD most of the time does not (unless you have completely random
> writes, always requesting an extent not yet/anymore in the activity log.
> 
> I'm in the process of generalizing DRBDs approach to allow more than one
> "al extent" to change during a "prepare" step, and cover several such changes
> in one "al transaction", so the number of meta data updates can be
> reduced even further.
> 
> adopting this "activity log" approach would make MD even better, IMO.

I've been pondering this, wondering what the important difference is.
I picture the DRBD approach - abstractly - as maintaining 2 bitmaps.
One is very fine granularity (4K).  The other has much coarser
granularity (4M).
A sector of the array is considered to need resync (After unclean
shutdown or whatever) if either bitmap has the bit set for the
corresponding region of the array.

Bits are set on-disk in the coarse bitmap before any writes are
allowed to corresponding regions, and a cleared lazily when there are
no writes active in that region.
Bits are set on-disk in the fine bitmap only when the corresponding
bit of the coarse bitmap is about to be cleared on-disk.  There will
only be bits to set if the array is degraded, so writes have completed
to one half and cannot be sent to the other half.
Bits are cleared on-disk in the fine bitmap after a 'resync' - and
presumably again just before the corresponding coarse bit is cleared.

DRBD stores this coarse bitmap as an activity log which is (I think)
just a list of addresses of bits that are set.  Not unlike run-length
encoding.   The rule for lazy clearing of bits is that when the number
of bits which are set crosses a threshold, we clear the 'oldest' bit.

I could conceivably take this approach into md without changing the
on-disk layout at all.  To set a bit in the coarse bitmap, I would
simply set all the corresponding bits in the fine on-disk bitmap.
This could involve writing a whole sector of ones to just set one
bit... but as you cannot write less than a sector that isn't really
a problem.  DRBD currently writes one sector per bit set, so it should
be no worse than DRBD.


The approach that md currently takes to lazy clearing of bits is to
clear bits which have not needed to be set for n seconds, where n
defaults to 5 (I think).
It may well make sense to modify this so that we don't clear bits if
fewer than N are set.  I can imagine that this could benefit some
workloads.  However as the time it takes to update the bitmap is such
a tiny fraction of 5 seconds, I'm not certain that it would be a
noticeable benefit.

Another issue here is bitmap granularity.  DRBD uses two granularities:
4M and 4K.  md uses just one, but it is configurable.  People tend to
find larger granularities provide better performance for exactly the
same reason that DRBD uses 4M for the activity log - to minimise
updates when write activity is fairly local.
By doing so, we miss out on the advantages of fine granularity - that
being that there is less data to move around during resync.  For local
disks, that cost is not enormous as seek time is much slower that data
transfer, so copying a large block costs much the same as a few small
blocks at the same location.
For DRBD where the data is moved over the network which is slower than
a local interconnect, the data transfer time presumably becomes the
main cost, so minimising the data that needs to be transferred after a
reconnect is important.  So supporting two different granularities
certainly seems to make sense where a network transport is involved.

I would be interested in adding this sort of two-level support to md's
bitmaps.  I cannot immediately see the benefits of the activity log
format though.  I would probably just set more bits any time I had to
set any, to avoid subsequent updates.
e.g. for a 4TB filesystem with 4K bitmap chunk size, I would have 2^30 bits
in 2^18 sectors - 128Meg of bitmap altogether.
Whenever updating a bit, I'd set maybe 1/4 or 1/2 of the bits in the
sector, this covers 4MB or 8MB.  They then get cleared lazily as
discussed above.
This would need a bit of work in md/bitmap, partly because the current
implementation limits a bitmap to 2^20 bits (partly because I won't
use vmalloc).

As I said, I don't immediately see the benefits of the activity log
format, however,
 1/ I am happy to listen to its benefits being explained
 2/ If we were to agree that merging DRBD functionality into md
   (for which there isn't a concrete proposal, but the suggestion
    seems to be floating around) were a good thing, I don't have any
    problem with supporting an activity log in md in the name of
    compatibility.


NeilBrown

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-02  7:33   ` Bart Van Assche
@ 2009-05-03  5:36     ` Willy Tarreau
  2009-05-03  5:40       ` david
  2009-05-03 10:06       ` Philipp Reisner
  0 siblings, 2 replies; 90+ messages in thread
From: Willy Tarreau @ 2009-05-03  5:36 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Andrew Morton, Philipp Reisner, linux-kernel, Jens Axboe,
	Greg KH, Neil Brown, James Bottomley, Sam Ravnborg, Dave Jones,
	Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> > On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
> >
> >> This is a repost of DRBD
> >
> > Is it being used anywhere for anything?  If so, where and what?
> 
> One popular application is to run iSCSI and HA software on top of DRBD
> in order to build a highly available iSCSI storage target.

Confirmed, I have several customers who're doing exactly that.

Willy


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03  5:36     ` Willy Tarreau
@ 2009-05-03  5:40       ` david
  2009-05-03 14:21         ` James Bottomley
  2009-05-03 10:06       ` Philipp Reisner
  1 sibling, 1 reply; 90+ messages in thread
From: david @ 2009-05-03  5:40 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Bart Van Assche, Andrew Morton, Philipp Reisner, linux-kernel,
	Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

[-- Attachment #1: Type: TEXT/PLAIN, Size: 804 bytes --]

On Sun, 3 May 2009, Willy Tarreau wrote:

> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
>> <akpm@linux-foundation.org> wrote:
>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
>>>
>>>> This is a repost of DRBD
>>>
>>> Is it being used anywhere for anything?  If so, where and what?
>>
>> One popular application is to run iSCSI and HA software on top of DRBD
>> in order to build a highly available iSCSI storage target.
>
> Confirmed, I have several customers who're doing exactly that.

I will also say that there are a lot of us out here who would have a use 
for DRDB in our HA setups, but have held off implementing it specificly 
because it's not yet in the upstream kernel.

David Lang

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-04-30 11:26 [PATCH 00/16] DRBD: a block device for HA clusters Philipp Reisner
  2009-04-30 11:26 ` [PATCH 01/16] DRBD: major.h Philipp Reisner
  2009-05-01  8:59 ` [PATCH 00/16] DRBD: a block device for HA clusters Andrew Morton
@ 2009-05-03  5:53 ` Neil Brown
  2009-05-03  6:24   ` david
  2009-05-03  8:29   ` Lars Ellenberg
  2 siblings, 2 replies; 90+ messages in thread
From: Neil Brown @ 2009-05-03  5:53 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: linux-kernel, Jens Axboe, Greg KH, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg

On Thursday April 30, philipp.reisner@linbit.com wrote:
> Hi,
> 
> This is a repost of DRBD, to keep you updated about the ongoing
> cleanups and improvements.
> 
> Patch set attached. Git tree available:
> git pull git://git.drbd.org/linux-2.6-drbd.git drbd
> 
> We are looking for reviews!
> 
> Description
> 
>   DRBD is a shared-nothing, synchronously replicated block device. It
>   is designed to serve as a building block for high availability
>   clusters and in this context, is a "drop-in" replacement for shared
>   storage. Simplistically, you could see it as a network RAID 1.

I know this is minor, but it bugs me every time I see that phrase
"shared-nothing".   Surely the network is shared?? And the code...
Can you just say "DRBD is a synchronously replicated block device"?
or would we have to call it SRBD then?
Or maybe "shared-nothing" is an accepted technical term in the
clustering world??

> 
>   Although I use the "RAID1+NBD" metaphor myself, recent discussion
>   unveiled that one needs to understand the differences as well.
>   Here are just two examples of that:

All this should probably be in a patch against Documentation/drbd.txt 

> 
>    1) Think of a two node HA cluster. Node A is active ('primary' in DRBD
>     speak) has the filesystem mounted and the application running. Node B is
>     in standby mode ('secondary' in DRBD speak).

If there some strong technical reason to only allow 2 nodes?  Was it
Asimov who said the only sensible numbers were 0, 1, and infinity?
(People still get surprised that md/raid1 can do 2 or 3 or n drives,
and that md/raid5 can handle just 2 :-)

> 
>     We loose network connectivity, the primary node continues to run, the
         lose
>     secondary no longer gets updates.
> 
>     Then we have a complete power failure, both nodes are down. Then they
>     power up the data center again, but at first the get only the power
                                                   they
>     circuit of node B up and running again.
> 
>     Should node B offer the service right now ?
>       ( DRBD has configurable policies for that )
> 
>     Later on they manage to get node A up and running again, now lets assume
>     node B was chosen to be the new primary node. What needs to be done ?
> 
>     Modifications on B since it became primary needs to be resynced to A.
>     Modifications on A sind it lost contact to B needs to be taken out.
> 
>     DRBD does that.
> 
>     How do you fit that into a RAID1+NBD model ? NBD is just a block
>     transport, it does not offer the ability to exchange dirty bitmaps or
>     data generation identifiers, nor does the RAID1 code has a concept of
>     that.

Not 100% true, but I - at least partly -  get your point.
As md stores bitmaps and data generation identifiers on the block
device, these can be transferred over NBD just like any other data on
the block device.
However I think that part of your point is that DRBD can transfer them
more efficiently (e.g. it compresses the bitmap before transferring it
-  I assume the compression you use is much more effective than gzip??
else why both to code your own).
I suspect there is more to your point that I am missing.
You say "nor does the RAID1 code has a concept of that".  It isn't
clear what you are referring to.  RAID1 does have a concept of dirty
bitmaps as you know, and it does have a concept of data generation,
though it is quite possibly weaker than the concept that DRBD has.
I'd need to explore the DRBD code more to be sure.


> 
>    2) When using DRBD over small bandwidth links, one has to run a resync,
>     DRBD offers the option to do a "checksum based resync". Similar to rsync
>     it at first only exchanges a checksum, and transmits the whole data
>     block only if the checksums differ.
> 
>     That again is something that does not fit into the concepts of
>     NBD or RAID1.

Interesting idea....  RAID1 does have a mode where it reads both (all)
devices and compares them to see if they match or not.  Doing this
compare with checksums rather than memcmp would not be an enormous
change.

I'm beginning to imagine an enhanced NBD as a model for what DRBD
does.
This enhanced NBD not only supports read and write of blocks but also:

   - maintains the local bitmap and sets bits before allowing a write
   - can return a strong checksum rather than the data of a block
   - provides sequence numbers in a way that I don't fully understand
     yet, but which allows consistent write ordering.
   - allows reads to be compressed so that the bitmap can be
     transferred efficiently.

I can imagine that md/raid1 could be made to work well with an
enhanced NBD like this.

> 
>   DRBD can also be used in dual-Primary mode (device writable on both
>   nodes), which means it can exhibit shared disk semantics in a
>   shared-nothing cluster.  Needless to say, on top of dual-Primary
>   DRBD utilizing a cluster file system is necessary to maintain for
>   cache coherency.
> 
>   More background on this can be found in this paper:
>     http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf
> 
>   Beyond that, DRBD addresses various issues of cluster partitioning,
>   which the MD/NBD stack, to the best of our knowledge, does not
>   solve. The above-mentioned paper goes into some detail about that as
>   well.

Agreed - MD/NBD could probably be easily confused by cluster
partitioning, though I suspect that in many simple cases it would get
it right.  I haven't given it enough thought to be sure.  I doubt the
enhancements necessary would be very significant though.

> 
>   DRBD can operate in synchronous mode, or in asynchronous mode. I want
>   to point out that we guarantee not to violate a single possible write
>   after write dependency when writing on the standby node. More on that
>   can be found in this paper:
>     http://www.drbd.org/fileadmin/drbd/publications/drbd_lk9.pdf

I really must read and understand this paper..


So... what would you think of working towards incorporating all of the
DRBD functionality into md/raid1??
I suspect that it would be a mutually beneficial exercise, except for
the small fact that it would take a significant amount of time and
effort.  I'd be will to shuffle some priorities and put in some effort
if it was a direction that you would be open to exploring.

Whether the current DRBD code gets merged or not is possibly a
separate question, though I would hope that if we followed the path of
merging DRBD into md/raid1, then any duplicate code would eventually be
excised from the kernel.

What do you think?

NeilBrown

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03  5:53 ` Neil Brown
@ 2009-05-03  6:24   ` david
  2009-05-03  8:29   ` Lars Ellenberg
  1 sibling, 0 replies; 90+ messages in thread
From: david @ 2009-05-03  6:24 UTC (permalink / raw)
  To: Neil Brown
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche, Lars Ellenberg

I am not a DRDB developer, but I can answer some of your questions below.

On Sun, 3 May 2009, Neil Brown wrote:

> On Thursday April 30, philipp.reisner@linbit.com wrote:
>> Hi,
>>
>> This is a repost of DRBD, to keep you updated about the ongoing
>> cleanups and improvements.
>>
>> Patch set attached. Git tree available:
>> git pull git://git.drbd.org/linux-2.6-drbd.git drbd
>>
>> We are looking for reviews!
>>
>> Description
>>
>>   DRBD is a shared-nothing, synchronously replicated block device. It
>>   is designed to serve as a building block for high availability
>>   clusters and in this context, is a "drop-in" replacement for shared
>>   storage. Simplistically, you could see it as a network RAID 1.
>
> I know this is minor, but it bugs me every time I see that phrase
> "shared-nothing".   Surely the network is shared??

the logical network(s) as a whole are shared, but physicaly they can be 
redundant, multi-pathed, etc.

> And the code...
> Can you just say "DRBD is a synchronously replicated block device"?
> or would we have to call it SRBD then?
> Or maybe "shared-nothing" is an accepted technical term in the
> clustering world??

DRDB can be configured to be synchronous or asynchronous.

'shared-nothing' is a accepted technical term in the clustering world for 
when two systems are not using any single device.

in the case of a network, I commonly setup systems where the network has 
two switches (connected togeather with fiber so that an electrical problem 
in one switch cannot short out the other) with the primary box plugged 
into one switch and the backup box plugged into another. I also make sure 
that my systems primary and backup systems are in seperate racks, so that 
if something goes wrong in one rack that causes an excessive amount of 
heat it won't affect the backup systems (and yes, this has happened to me 
when I got lazy and stopped checking on this)

at this point the network switch is not shared (although the logical 
network is)

in the case of disk storage the common situation is 'shared-disk' where 
you have one disk array and both machines are plugged into it.

this gives you a single point of failure if the disk array crashes (even 
if it has redundant controllers, power supplies, etc things still happen), 
and the disk array can only be in one physical location.

DRDB lets you logicly setup your systems as if they were a 'shared-disk' 
architecture, but with the hardware being 'shared-nothing'

you can have the two halves of the cluster in different states, so that 
even a major disaster like a earthquake won't kill the system. (a classic 
case of 'shared-nothing'

>>
>>    1) Think of a two node HA cluster. Node A is active ('primary' in DRBD
>>     speak) has the filesystem mounted and the application running. Node B is
>>     in standby mode ('secondary' in DRBD speak).
>
> If there some strong technical reason to only allow 2 nodes?  Was it
> Asimov who said the only sensible numbers were 0, 1, and infinity?
> (People still get surprised that md/raid1 can do 2 or 3 or n drives,
> and that md/raid5 can handle just 2 :-)

in this case we have 1 replica (or '1 other machine'), so we are on an 
'interesting number' ;-)

many people would love to see DRDB extended beyond this, but my 
understanding is that doing so in non-trivial.

>>   DRBD can also be used in dual-Primary mode (device writable on both
>>   nodes), which means it can exhibit shared disk semantics in a
>>   shared-nothing cluster.  Needless to say, on top of dual-Primary
>>   DRBD utilizing a cluster file system is necessary to maintain for
>>   cache coherency.
>>
>>   More background on this can be found in this paper:
>>     http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf
>>
>>   Beyond that, DRBD addresses various issues of cluster partitioning,
>>   which the MD/NBD stack, to the best of our knowledge, does not
>>   solve. The above-mentioned paper goes into some detail about that as
>>   well.
>
> Agreed - MD/NBD could probably be easily confused by cluster
> partitioning, though I suspect that in many simple cases it would get
> it right.  I haven't given it enough thought to be sure.  I doubt the
> enhancements necessary would be very significant though.

think of two different threads doing writes directly to their side of the 
mirror and the system needs to notice this happening and copy the data to 
the other half of the mirror (with GFS working above you to coordinate the 
two threads and make sure they don't make conflicting writes)

it's not a trivial task.

David Lang

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/16] DRBD: lru_cache
  2009-05-02 23:51     ` Kyle Moffett
@ 2009-05-03  6:27       ` Lars Ellenberg
  2009-05-03 14:06         ` Kyle Moffett
  0 siblings, 1 reply; 90+ messages in thread
From: Lars Ellenberg @ 2009-05-03  6:27 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH, Neil Brown,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Bart Van Assche

On Sat, May 02, 2009 at 07:51:36PM -0400, Kyle Moffett wrote:
> After digging through this a bit, I would recommend rewriting this
> whole part.  The part that definitely needs to go is the vmalloc() of
> your whole LRU at once.
> 
> Here's some code I've been fiddling around with for a while.  It uses
> a kmem_cache and a shrinker callback (although it does need a small
> patch to the shrinker code) to provide a dynamically sized LRU list
> for fixed-size objects.

Thanks.

When we created our lru_cache stuff, we considered embedding callbacks
and internal locking, but decided against it.  Conceptually it should be
more like the "list.h" list handling infrastructure.

The user will have their own locking in place anyways, and in general
their critical section will be a few lines of code larger than the
"lru cache" manipulation itself.

And, the specific use of our implementation is that there is a
pre-selected maximum count of in-use objects, and the user gets
feedback about changes to this "active" set of objects.

Think of a small room with N seats, and one entrance.
Any number of people outside.
Only N fit in the room.
Seats are numbered (index), people have id (element number).

Our lru_cache implementation is the man at the door,
keepeing track of which seats are empty,
and seeing to it that if one needs to go in,
either there is an empty seat available,
or first the "most dispensable" one gets out.

That is where the focus is:
make the set of active objects easily trackable.
So one can easily keep track of who is in, and who is not,
by writing a log of just this "diff":
seat index was occupied by element_nr A, but now is by element_nr B.

The lru part is to make it more easy to determine
who the "most dispensable" is.

So from looking at your code, it may be fine for the "lru" part,
but it is not suitable for our purposes.

	Lars

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 04/16] DRBD: bitmap
  2009-05-03  5:21             ` Neil Brown
@ 2009-05-03  7:38               ` Lars Ellenberg
  2009-05-05 17:48               ` Lars Marowsky-Bree
  1 sibling, 0 replies; 90+ messages in thread
From: Lars Ellenberg @ 2009-05-03  7:38 UTC (permalink / raw)
  To: Neil Brown
  Cc: James Bottomley, Philipp Reisner, linux-kernel, Jens Axboe,
	Greg KH, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

On Sun, May 03, 2009 at 03:21:41PM +1000, Neil Brown wrote:
> On Saturday May 2, lars.ellenberg@linbit.com wrote:
> > On Sat, May 02, 2009 at 10:41:58AM -0500, James Bottomley wrote:
> > > On Thu, 2009-04-30 at 13:26 +0200, Philipp Reisner wrote:
> > > > DRBD maintains a dirty bitmap in case it has to run without peer node or
> > > > without local disk. Writes to the on disk dirty bitmap are minimized by the
> > > > activity log (=AL). Each time an extent is evicted from the AL the part of
> > > > the bitmap no longer covered by the AL is written to disk.
> > > > 
> > > > Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
> > > > Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> > > 
> > > The way the bitmap and activity log work are very similar to the way the
> > > md bitmap works (and are implemented for almost exactly the same
> > > reason).  Is there any way we could combine them?
> > 
> > in principle yes.
> > the DRBD bitmap has a granularity of 4 kB per bit,
> > and the "activity log" covers 4 MB per what we call "al extent".
> > 
> > though there is a very important difference.
> > 
> > in MD, when the bitmap is in use, I think the approach is:
> > 
> >   for each write queued to the lower level devices,
> >      dirty bits in memory
> >      for every newly dirtied bitmap page,
> > 	flush bitmap pages to disk
> > 	wait for these bitmap writes to complete
> >   then unplug the lowe level devices
> > 
> >   in background: periodically try to clean some pages,
> > 	and write them to disk
> > 
> > the DRBD approach is:
> >   if target "al extent" of this write request
> >   is NOT in the in-memory "lru_cache" already,
> > 	get it into the cache,
> > 		if that means we have to kick an
> > 		old element from the cache, and
> > 		the associated bitmap is dirty
> > 			write that part of the bitmap
> >         write an "al transaction" (synchonous single sector write)
> >   else
> >   	FAST PATH, no additional "meta data" write needed.
> >   
> >   submit to lower level device.
> > 
> > 
> > MD most of the time just _needs_ the additional "meta data" writes.
> > DRBD most of the time does not (unless you have completely random
> > writes, always requesting an extent not yet/anymore in the activity log.
> > 
> > I'm in the process of generalizing DRBDs approach to allow more than one
> > "al extent" to change during a "prepare" step, and cover several such changes
> > in one "al transaction", so the number of meta data updates can be
> > reduced even further.
> > 
> > adopting this "activity log" approach would make MD even better, IMO.
> 
> I've been pondering this, wondering what the important difference is.
> I picture the DRBD approach - abstractly - as maintaining 2 bitmaps.
> One is very fine granularity (4K).  The other has much coarser
> granularity (4M).
> A sector of the array is considered to need resync (After unclean
> shutdown or whatever) if either bitmap has the bit set for the
> corresponding region of the array.
> 
> Bits are set on-disk in the coarse bitmap before any writes are
> allowed to corresponding regions, and a cleared lazily when there are
> no writes active in that region.
> Bits are set on-disk in the fine bitmap only when the corresponding
> bit of the coarse bitmap is about to be cleared on-disk.  There will
> only be bits to set if the array is degraded, so writes have completed
> to one half and cannot be sent to the other half.
> Bits are cleared on-disk in the fine bitmap after a 'resync' - and
> presumably again just before the corresponding coarse bit is cleared.
> 
> DRBD stores this coarse bitmap as an activity log which is (I think)
> just a list of addresses of bits that are set.  Not unlike run-length
> encoding.   The rule for lazy clearing of bits is that when the number
> of bits which are set crosses a threshold, we clear the 'oldest' bit.
> 
> I could conceivably take this approach into md without changing the
> on-disk layout at all.  To set a bit in the coarse bitmap, I would
> simply set all the corresponding bits in the fine on-disk bitmap.
> This could involve writing a whole sector of ones to just set one
> bit... but as you cannot write less than a sector that isn't really
> a problem.  DRBD currently writes one sector per bit set, so it should
> be no worse than DRBD.


You'd set a whole sector of bits on disk,
but keep them cleared in memory - unless degraded.
On "clearing the coarse bit" (evicting from our lru_cache),
you'd write the actual in-memory bitmap.
Yes, that would be more or less functionally equivalent, probably.

> Another issue here is bitmap granularity.  DRBD uses two granularities:
> 4M and 4K.  md uses just one, but it is configurable.  People tend to
> find larger granularities provide better performance for exactly the
> same reason that DRBD uses 4M for the activity log - to minimise
> updates when write activity is fairly local.
> By doing so, we miss out on the advantages of fine granularity - that
> being that there is less data to move around during resync.  For local
> disks, that cost is not enormous as seek time is much slower that data
> transfer, so copying a large block costs much the same as a few small
> blocks at the same location.
> For DRBD where the data is moved over the network which is slower than
> a local interconnect, the data transfer time presumably becomes the
> main cost, so minimising the data that needs to be transferred after a
> reconnect is important.  So supporting two different granularities
> certainly seems to make sense where a network transport is involved.

Right.
Mid-term, we intend to make both granularities configurable, btw.

> I would be interested in adding this sort of two-level support to md's
> bitmaps.  I cannot immediately see the benefits of the activity log
> format though.  I would probably just set more bits any time I had to
> set any, to avoid subsequent updates.
> e.g. for a 4TB filesystem with 4K bitmap chunk size, I would have 2^30 bits
> in 2^18 sectors - 128Meg of bitmap altogether.

exactly.

> Whenever updating a bit, I'd set maybe 1/4 or 1/2 of the bits in the
> sector, this covers 4MB or 8MB.  They then get cleared lazily as
> discussed above.
> This would need a bit of work in md/bitmap, partly because the current
> implementation limits a bitmap to 2^20 bits (partly because I won't
> use vmalloc).

In memory bitmap implementation of DRBD uses an array of GFP_HIGHUSER
pages, and is capable of supporting (ULONG_MAX-1) bits.
The main disadvantage: it holds the bitmap in memory all the time,
which makes 512 MB mostly unused core memory for 16 TB backing store.
Conveivably the implementation can be changed to only hold "N" pages
in memory at any given time, where "N" would again be the number of
elements in our "lru_cache".

> As I said, I don't immediately see the benefits of the activity log
> format, however,
>  1/ I am happy to listen to its benefits being explained

Compared to using an explicit on-disk "coarse" bitmap,
the "activity log" format can track arbitrarily large
devices in a small, constant, on-disk area.

But as the fine bitmap is on disk anyways, there is no need for the
coarse bitmap to be present anywhere appart from using it to explain the
concept to people.

Just for the dirty bitmap, an in-memroy scheme using (somthing simliar
to) our lru_cache stuff, and instead of writing "al-transactions"
flushing in-memory (fine) bitmap to disk for "evict",
and marking quarter, half or full on-disk (fine) bitmap sectors as dirty
without touching the in-memory bitmap, should be functionally equivalent.

We had greater plans for the activity log,
using its most useful property: on crash recovery,
one knows exactly which parts of the device (may)
have been target of in-flight IO during the crash
(which, when degraded is not the same as looking
at the dirty bitmap, and when cleanly shut down
is something else again, but still may be useful).
  But none of those plans has yet been coded,
and possibly some of them are just nonsense, anyways,
so you may well ignore this paragraph.

>  2/ If we were to agree that merging DRBD functionality into md
>    (for which there isn't a concrete proposal, but the suggestion
>     seems to be floating around) were a good thing, I don't have any
>     problem with supporting an activity log in md in the name of
>     compatibility.

 ;)

The activity log would be the least.

MD currently talks only to "dumb" block devices.
DRBD uses a stateful transport.

examples:

  High Availability clustering, single "Active" node.
    For some reason you run into diverging data sets,
    result of what is usually called split-brain (communication loss
    between cluster nodes).

    If you had a SAN, you'd be screwed.

    But you have been replicating, so you now have two data sets,
    both consistent, but slightly different.

    Assume that one node was completely shut off of comunication,
    so could not be reached by clients either, and it is thus easy
    to determine the "better" data set.

    You used MD over iSCSI (or NBD or whatever).
      You are still screwed: first you need to detect this
      "after split-brain, diverging data sets" situation,
      then you need to do a full resync.
      If you only do a partial resync
      based on whatever MD bitmap there is,
      you get inconsistent mirrors.

    You used DRBD.
      which can communicate, and reliably detect the situation
      and usually refuses to destroy either consistent dataset.
      Until you tell it which changes to throw away.
      Then it will exchange dirty bitmaps, bitor them together,
      and thus revert the changes of the victim,
      and apply the changes of the chosen "better" data set.

  Clustering, "Two Primaries", both nodes active.
    You can use OCFS2 on top of DRBD,
    _without_ any shared components.
    No SAN. Shared nothing.
    Just replicating.

    However useful that is, and yes, it is a tricky setup,
    and yes we can (and will) improve on the handling of these setups.

There are more neat things possible
when not restricted to "dumb" transports.
To support this kind of stuff,
MD needs to change its architechture quite a bit,
so that would be more of a long-term project.

Please get me right:
I'm definetly in, if this turns out to be the way to go.
But don't underestimate the effort.


	Lars

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03  5:53 ` Neil Brown
  2009-05-03  6:24   ` david
@ 2009-05-03  8:29   ` Lars Ellenberg
  2009-05-03 11:00     ` Neil Brown
  1 sibling, 1 reply; 90+ messages in thread
From: Lars Ellenberg @ 2009-05-03  8:29 UTC (permalink / raw)
  To: Neil Brown
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

On Sun, May 03, 2009 at 03:53:41PM +1000, Neil Brown wrote:
> I know this is minor, but it bugs me every time I see that phrase
> "shared-nothing". 

> Or maybe "shared-nothing" is an accepted technical term in the
> clustering world??

yes.

> All this should probably be in a patch against Documentation/drbd.txt 

Ok.

> >    1) Think of a two node HA cluster. Node A is active ('primary' in DRBD
> >     speak) has the filesystem mounted and the application running. Node B is
> >     in standby mode ('secondary' in DRBD speak).
> 
> If there some strong technical reason to only allow 2 nodes?

It "just" has not yet been implemented.
I'm working on that, though.

> >     How do you fit that into a RAID1+NBD model ? NBD is just a block
> >     transport, it does not offer the ability to exchange dirty bitmaps or
> >     data generation identifiers, nor does the RAID1 code has a concept of
> >     that.
> 
> Not 100% true, but I - at least partly -  get your point.
> As md stores bitmaps and data generation identifiers on the block
> device, these can be transferred over NBD just like any other data on
> the block device.

Do you have one dirty bitmap per mirror (yet) ?
Do you _merge_ them?

the "NBD" mirrors are remote, and once you lose communication,
they may be (and in general, you have to assume they are) modified
by which ever node they are directly attached to.

> However I think that part of your point is that DRBD can transfer them
> more efficiently (e.g. it compresses the bitmap before transferring it
> -  I assume the compression you use is much more effective than gzip??
> else why both to code your own).

No, the point was that we have one bitmap per mirror (though currently
number of mirrors == 2, only), and that we do merge them.

but to answer the question:
why bother to implement our own encoding?
because we know a lot about the data to be encoded.

the compression of the bitmap transfer we just added very recently.
for a bitmap, with large chunks of bits set or unset, it is efficient
to just code the runlength.
to use gzip in kernel would add yet an other huge overhead for code
tables and so on.
during testing of this encoding, applying it to an already gzip'ed file
was able to compress it even further, btw.
though on english plain text, gzip compression is _much_ more effective.

> You say "nor does the RAID1 code has a concept of that".  It isn't
> clear what you are referring to.

The concept that one of the mirrors (the "nbd" one in that picture)
may have been accessed independently, without MD knowning,
because the node this MD (and its "local" mirror) was living on
suffered from power outage.

The concept of both mirrors being modified _simultaneously_,
(e.g. living below a cluster file system).

> >    2) When using DRBD over small bandwidth links, one has to run a resync,
> >     DRBD offers the option to do a "checksum based resync". Similar to rsync
> >     it at first only exchanges a checksum, and transmits the whole data
> >     block only if the checksums differ.
> > 
> >     That again is something that does not fit into the concepts of
> >     NBD or RAID1.
> 
> Interesting idea....  RAID1 does have a mode where it reads both (all)
> devices and compares them to see if they match or not.  Doing this
> compare with checksums rather than memcmp would not be an enormous
> change.
> 
> I'm beginning to imagine an enhanced NBD as a model for what DRBD
> does.  This enhanced NBD not only supports read and write of blocks
> but also:
> 
>    - maintains the local bitmap and sets bits before allowing a write

right.

>    - can return a strong checksum rather than the data of a block

ok.

>    - provides sequence numbers in a way that I don't fully understand
>      yet, but which allows consistent write ordering.

yes, please.

>    - allows reads to be compressed so that the bitmap can be
>      transferred efficiently.

yep.

add to that
     - can exchange data generations on handshake,
     - can refuse the handshake (consistent data,
       but evolved differently than the other copy;
       diverging data sets detected!)
     - is bi-directional, can _push_ writes!

and whatever else I forgot just now.

> I can imagine that md/raid1 could be made to work well with an
> enhanced NBD like this.

of course.

> >   DRBD can also be used in dual-Primary mode (device writable on both
> >   nodes), which means it can exhibit shared disk semantics in a
> >   shared-nothing cluster.  Needless to say, on top of dual-Primary
> >   DRBD utilizing a cluster file system is necessary to maintain for
> >   cache coherency.
> > 
> >   More background on this can be found in this paper:
> >     http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf
> > 
> >   Beyond that, DRBD addresses various issues of cluster partitioning,
> >   which the MD/NBD stack, to the best of our knowledge, does not
> >   solve. The above-mentioned paper goes into some detail about that as
> >   well.
> 
> Agreed - MD/NBD could probably be easily confused by cluster
> partitioning, though I suspect that in many simple cases it would get
> it right.  I haven't given it enough thought to be sure.  I doubt the
> enhancements necessary would be very significant though.

The most significant part is probably the bidirectional nature
and the "refuse it" part of the handshake.

> >   DRBD can operate in synchronous mode, or in asynchronous mode. I want
> >   to point out that we guarantee not to violate a single possible write
> >   after write dependency when writing on the standby node. More on that
> >   can be found in this paper:
> >     http://www.drbd.org/fileadmin/drbd/publications/drbd_lk9.pdf
> 
> I really must read and understand this paper..
> 
> 
> So... what would you think of working towards incorporating all of the
> DRBD functionality into md/raid1??
> I suspect that it would be a mutually beneficial exercise, except for
> the small fact that it would take a significant amount of time and
> effort.  I'd be will to shuffle some priorities and put in some effort
> if it was a direction that you would be open to exploring.

Sure. But yes, full ack on the time and effort part ;)

> Whether the current DRBD code gets merged or not is possibly a
> separate question, though I would hope that if we followed the path of
> merging DRBD into md/raid1, then any duplicate code would eventually be
> excised from the kernel.

Rumor [http://lwn.net/Articles/326818/] has it, that the various in
kernel raid implementations are being unified right now, anyways?

If you want to stick to "replication is almost identical to RAID1",
best not to forget "this may be a remote mirror", there may be more than
one entity accessing it, this may be part of a bi-directional
(active-active) replication setup.

For further ideas on what could be done with replication (enhancing the
strict "raid1" notion), see also
http://www.drbd.org/fileadmin/drbd/publications/drbd9.linux-kongress.2008.pdf

 - time shift replication
 - generic point in time recovery of block device data
 - (remote) backup by periodically, round-robin re-sync of
   "raid" members, then "dropping" them again.
 ...

No useable code on those ideas, yet,
but a lot of thought. It is not all handwaving.

	Lars

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03  5:36     ` Willy Tarreau
  2009-05-03  5:40       ` david
@ 2009-05-03 10:06       ` Philipp Reisner
  2009-05-03 10:15         ` Thomas Backlund
  1 sibling, 1 reply; 90+ messages in thread
From: Philipp Reisner @ 2009-05-03 10:06 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Bart Van Assche, Andrew Morton, linux-kernel, Jens Axboe,
	Greg KH, Neil Brown, James Bottomley, Sam Ravnborg, Dave Jones,
	Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

Am Sonntag 03 Mai 2009 07:36:00 schrieb Willy Tarreau:
> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> > On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> >
> > <akpm@linux-foundation.org> wrote:
> > > On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner 
<philipp.reisner@linbit.com> wrote:
> > >> This is a repost of DRBD
> > >
> > > Is it being used anywhere for anything? ?If so, where and what?
> >
> > One popular application is to run iSCSI and HA software on top of DRBD
> > in order to build a highly available iSCSI storage target.
>
> Confirmed, I have several customers who're doing exactly that.
>

Besides storage targets, DRBD is also very popular for databases,
it is widely used with PostgreSQL and MySQL. It is advertised by both
database projects, to use their DB on top of DRBD to form HA clusters.

Raw numbers of installations:

We have an opt-in global usage counter. See http://www.drbd.org/usage/year/
If we assume that 30% of all users agree that their DRBD installation gets
counted, then we have more then 12000 new installations in April.
(4245 Installations where counted)

It seems that nowadays most of our users get DRBD through their distributions. 
Distributions that include DRBD are (list incomplete): 
Debian, Ubuntu, SLES, CentOS

-Phil

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 10:06       ` Philipp Reisner
@ 2009-05-03 10:15         ` Thomas Backlund
  0 siblings, 0 replies; 90+ messages in thread
From: Thomas Backlund @ 2009-05-03 10:15 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Philipp Reisner skrev:
> 
> It seems that nowadays most of our users get DRBD through their distributions. 
> Distributions that include DRBD are (list incomplete): 
> Debian, Ubuntu, SLES, CentOS

Mandriva has been including it since ~4 years too...

--
Thomas

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03  8:29   ` Lars Ellenberg
@ 2009-05-03 11:00     ` Neil Brown
  2009-05-03 21:32       ` Lars Ellenberg
  0 siblings, 1 reply; 90+ messages in thread
From: Neil Brown @ 2009-05-03 11:00 UTC (permalink / raw)
  To: Lars Ellenberg
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

On Sunday May 3, lars.ellenberg@linbit.com wrote:
> > If there some strong technical reason to only allow 2 nodes?
> 
> It "just" has not yet been implemented.
> I'm working on that, though.

:-)

> 
> > >     How do you fit that into a RAID1+NBD model ? NBD is just a block
> > >     transport, it does not offer the ability to exchange dirty bitmaps or
> > >     data generation identifiers, nor does the RAID1 code has a concept of
> > >     that.
> > 
> > Not 100% true, but I - at least partly -  get your point.
> > As md stores bitmaps and data generation identifiers on the block
> > device, these can be transferred over NBD just like any other data on
> > the block device.
> 
> Do you have one dirty bitmap per mirror (yet) ?
> Do you _merge_ them?

md doesn't merge bitmaps yet.  However if I found a need to, I would
simple read a bitmap in userspace and feed it into the kernel via 
	/sys/block/mdX/md/md/bitmap_set_bits

We sort-of have one bitmap per mirror, but only because the one bitmap
is mirrored...

> 
> the "NBD" mirrors are remote, and once you lose communication,
> they may be (and in general, you have to assume they are) modified
> by which ever node they are directly attached to.
> 
> > However I think that part of your point is that DRBD can transfer them
> > more efficiently (e.g. it compresses the bitmap before transferring it
> > -  I assume the compression you use is much more effective than gzip??
> > else why both to code your own).
> 
> No, the point was that we have one bitmap per mirror (though currently
> number of mirrors == 2, only), and that we do merge them.

Right.  I imagine much of the complexity of that could be handled in
user-space while setting an a DRBD instance (??).

> 
> but to answer the question:
> why bother to implement our own encoding?
> because we know a lot about the data to be encoded.
> 
> the compression of the bitmap transfer we just added very recently.
> for a bitmap, with large chunks of bits set or unset, it is efficient
> to just code the runlength.
> to use gzip in kernel would add yet an other huge overhead for code
> tables and so on.
> during testing of this encoding, applying it to an already gzip'ed file
> was able to compress it even further, btw.
> though on english plain text, gzip compression is _much_ more effective.

I just tried a little experiment.
I created a 128meg file and randomly set 1000 bits in it.
I compressed it with "gzip --best" and the result was 4Meg.  Not
particularly impressive.
I then tried to compress it wit bzip2 and got 3452 bytes.
Now *that* is impressive.  I suspect your encoding might do a little
better, but I wonder if it is worth the effort.
I'm not certain that my test file is entirely realistic, but it is
still an interesting experiment.

Why do you do this compression in the kernel?  It seems to me that it
would be quite practical to do it all in user-space, thus making it
really easy to use pre-existing libraries.

BTW, the kernel already contains various compression code as part of
the crypto API.

> 
> > You say "nor does the RAID1 code has a concept of that".  It isn't
> > clear what you are referring to.
> 
> The concept that one of the mirrors (the "nbd" one in that picture)
> may have been accessed independently, without MD knowning,
> because the node this MD (and its "local" mirror) was living on
> suffered from power outage.
> 
> The concept of both mirrors being modified _simultaneously_,
> (e.g. living below a cluster file system).

Yes, that is an important concept.  Certainly one of the bits that
would need to be added to md.

> > Whether the current DRBD code gets merged or not is possibly a
> > separate question, though I would hope that if we followed the path of
> > merging DRBD into md/raid1, then any duplicate code would eventually be
> > excised from the kernel.
> 
> Rumor [http://lwn.net/Articles/326818/] has it, that the various in
> kernel raid implementations are being unified right now, anyways?

I'm not holding my breath on that one...  
I think that merging DRBD with md/raid1 would be significantly easier
that any sort of merge between md and dm.  But (in either case) I'll
do what I can to assist any effort that is technically sound.


> 
> If you want to stick to "replication is almost identical to RAID1",
> best not to forget "this may be a remote mirror", there may be more than
> one entity accessing it, this may be part of a bi-directional
> (active-active) replication setup.
> 
> For further ideas on what could be done with replication (enhancing the
> strict "raid1" notion), see also
> http://www.drbd.org/fileadmin/drbd/publications/drbd9.linux-kongress.2008.pdf
> 
>  - time shift replication
>  - generic point in time recovery of block device data
>  - (remote) backup by periodically, round-robin re-sync of
>    "raid" members, then "dropping" them again.
>  ...
> 
> No useable code on those ideas, yet,
> but a lot of thought. It is not all handwaving.

:-)

I'll have to do a bit of reading I see.  I'll then try to rough out a
design and plan for merging DRBD functionality with md/raid1.  At the
very least that would give me enough background understanding to be
able to sensibly review your code submission.

NeilBrown

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/16] DRBD: lru_cache
  2009-05-03  6:27       ` Lars Ellenberg
@ 2009-05-03 14:06         ` Kyle Moffett
  2009-05-03 22:48           ` Lars Ellenberg
  0 siblings, 1 reply; 90+ messages in thread
From: Kyle Moffett @ 2009-05-03 14:06 UTC (permalink / raw)
  To: Lars Ellenberg
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH, Neil Brown,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Bart Van Assche

On Sun, May 3, 2009 at 2:27 AM, Lars Ellenberg
<lars.ellenberg@linbit.com> wrote:
> When we created our lru_cache stuff, we considered embedding callbacks
> and internal locking, but decided against it.  Conceptually it should be
> more like the "list.h" list handling infrastructure.
>
> The user will have their own locking in place anyways, and in general
> their critical section will be a few lines of code larger than the
> "lru cache" manipulation itself.

One of the major design-points for the code I'm fiddling with is that
it allows you to use RCU on your lookup table, which basically means
lock-free lookup (although I haven't stress-tested that part of it yet
so the code itself may have subtle bugs).  With a bit more work it's
probably even possible to make it lock-free even when adding a
reference to an object that's currently on the LRU.


> And, the specific use of our implementation is that there is a
> pre-selected maximum count of in-use objects, and the user gets
> feedback about changes to this "active" set of objects.

Another major design point (and the reason for the single "evict"
callback) is that my code does not require manual tuning, it responds
to memory-pressure dynamically using the "shrinker" mechanism.  So on
a box with 128MB of RAM your LRU cache will be automatically
size-limited by other activity on the system to an appropriate size;
yet it can scale up to tens or hundreds of megabytes on a system with
hundreds of gigs of RAM under heavy IO load.

The real deal-breaker for your code is its usage of "vmalloc", it's
unlikely to be merged when it relies on vmalloc() of a large
continuous block for operation.


> That is where the focus is:
> make the set of active objects easily trackable.
> So one can easily keep track of who is in, and who is not,
> by writing a log of just this "diff":
> seat index was occupied by element_nr A, but now is by element_nr B.

This could be very easily done with tracepoints and a few minor tweaks
to the implementation I provided.  I could add an object number and
various statistics similar to the ones in your code; the only reason I
didn't before is I did not need them and they detracted slightly from
the simplicity of the implementation (just 271 lines).

Keep in mind that by using the kmem_cache infrastructure, you get to
take advantage of all of the other SLAB debugging features on objects
allocated through your LRUs.  This includes redzones and
poison-on-free.  It also makes the number of objects and the number of
pages allocated show up in /proc/slabinfo and the various SLAB/SLUB
debug tools.


> So from looking at your code, it may be fine for the "lru" part,
> but it is not suitable for our purposes.

It would need an extra layer stacked on top to handle the hash-table
lookups, but it would solve the vmalloc issue and allow your LRU lists
to dynamically size a bit better.  It's also not that difficult to
apply memory allocation limits (aside from the default
memory-pressure) and add additional statistics and debugging info.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03  5:40       ` david
@ 2009-05-03 14:21         ` James Bottomley
  2009-05-03 14:36           ` david
  0 siblings, 1 reply; 90+ messages in thread
From: James Bottomley @ 2009-05-03 14:21 UTC (permalink / raw)
  To: david
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
> On Sun, 3 May 2009, Willy Tarreau wrote:
> 
> > On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> >> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> >> <akpm@linux-foundation.org> wrote:
> >>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
> >>>
> >>>> This is a repost of DRBD
> >>>
> >>> Is it being used anywhere for anything?  If so, where and what?
> >>
> >> One popular application is to run iSCSI and HA software on top of DRBD
> >> in order to build a highly available iSCSI storage target.
> >
> > Confirmed, I have several customers who're doing exactly that.
> 
> I will also say that there are a lot of us out here who would have a use 
> for DRDB in our HA setups, but have held off implementing it specificly 
> because it's not yet in the upstream kernel.

Actually, that's not a particularly strong reason because we already
have an in-kernel replicator that has much of the functionality of drbd
that you could use.  The main reason for wanting drbd in kernel is that
it has a *current* user base.

Both the in kernel md/nbd and drbd do sync and async replication with
primary side bitmaps.  The main differences are:

      * md/nbd can do 1 to N replication,
      * drbd can do active/active replication (useful for cluster
        filesystems)
      * The chunk size of the md/nbd is tunable
      * With the updated nbd-tools, current md/nbd can do point in time
        rollback on transaction logged secondaries (a BCS requirement)
      * drbd manages the mirror state explicitly, md/nbd needs a user
        space helper

And probably a few others I forget.

James



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 14:21         ` James Bottomley
@ 2009-05-03 14:36           ` david
  2009-05-03 14:45             ` James Bottomley
  0 siblings, 1 reply; 90+ messages in thread
From: david @ 2009-05-03 14:36 UTC (permalink / raw)
  To: James Bottomley
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

On Sun, 3 May 2009, James Bottomley wrote:

> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> 
> On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
>> On Sun, 3 May 2009, Willy Tarreau wrote:
>>
>>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
>>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
>>>> <akpm@linux-foundation.org> wrote:
>>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
>>>>>
>>>>>> This is a repost of DRBD
>>>>>
>>>>> Is it being used anywhere for anything?  If so, where and what?
>>>>
>>>> One popular application is to run iSCSI and HA software on top of DRBD
>>>> in order to build a highly available iSCSI storage target.
>>>
>>> Confirmed, I have several customers who're doing exactly that.
>>
>> I will also say that there are a lot of us out here who would have a use
>> for DRDB in our HA setups, but have held off implementing it specificly
>> because it's not yet in the upstream kernel.
>
> Actually, that's not a particularly strong reason because we already
> have an in-kernel replicator that has much of the functionality of drbd
> that you could use.  The main reason for wanting drbd in kernel is that
> it has a *current* user base.
>
> Both the in kernel md/nbd and drbd do sync and async replication with
> primary side bitmaps.  The main differences are:
>
>      * md/nbd can do 1 to N replication,
>      * drbd can do active/active replication (useful for cluster
>        filesystems)
>      * The chunk size of the md/nbd is tunable
>      * With the updated nbd-tools, current md/nbd can do point in time
>        rollback on transaction logged secondaries (a BCS requirement)
>      * drbd manages the mirror state explicitly, md/nbd needs a user
>        space helper
>
> And probably a few others I forget.

one very big one:

DRDB has better support for dealing with split brain situations and 
recovering from them.

David Lang

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 14:36           ` david
@ 2009-05-03 14:45             ` James Bottomley
  2009-05-03 14:56               ` david
  2009-05-04  8:28               ` Philipp Reisner
  0 siblings, 2 replies; 90+ messages in thread
From: James Bottomley @ 2009-05-03 14:45 UTC (permalink / raw)
  To: david
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
> On Sun, 3 May 2009, James Bottomley wrote:
> 
> > Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> > 
> > On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
> >> On Sun, 3 May 2009, Willy Tarreau wrote:
> >>
> >>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> >>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> >>>> <akpm@linux-foundation.org> wrote:
> >>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
> >>>>>
> >>>>>> This is a repost of DRBD
> >>>>>
> >>>>> Is it being used anywhere for anything?  If so, where and what?
> >>>>
> >>>> One popular application is to run iSCSI and HA software on top of DRBD
> >>>> in order to build a highly available iSCSI storage target.
> >>>
> >>> Confirmed, I have several customers who're doing exactly that.
> >>
> >> I will also say that there are a lot of us out here who would have a use
> >> for DRDB in our HA setups, but have held off implementing it specificly
> >> because it's not yet in the upstream kernel.
> >
> > Actually, that's not a particularly strong reason because we already
> > have an in-kernel replicator that has much of the functionality of drbd
> > that you could use.  The main reason for wanting drbd in kernel is that
> > it has a *current* user base.
> >
> > Both the in kernel md/nbd and drbd do sync and async replication with
> > primary side bitmaps.  The main differences are:
> >
> >      * md/nbd can do 1 to N replication,
> >      * drbd can do active/active replication (useful for cluster
> >        filesystems)
> >      * The chunk size of the md/nbd is tunable
> >      * With the updated nbd-tools, current md/nbd can do point in time
> >        rollback on transaction logged secondaries (a BCS requirement)
> >      * drbd manages the mirror state explicitly, md/nbd needs a user
> >        space helper
> >
> > And probably a few others I forget.
> 
> one very big one:
> 
> DRDB has better support for dealing with split brain situations and 
> recovering from them.

I don't really think so.  The decision about which (or if a) node should
be killed lies with the HA harness outside of the province of the
replication.

One could argue that the symmetric active mode of drbd allows both nodes
to continue rather than having the harness make a kill decision about
one.  However, if they both alter the same data, you get an
irreconcilable data corruption fault which, one can argue, is directly
counter to HA principles and so allowing drbd continuation is arguably
the wrong thing to do.

James



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 14:45             ` James Bottomley
@ 2009-05-03 14:56               ` david
  2009-05-03 15:09                 ` James Bottomley
  2009-05-04  8:28               ` Philipp Reisner
  1 sibling, 1 reply; 90+ messages in thread
From: david @ 2009-05-03 14:56 UTC (permalink / raw)
  To: James Bottomley
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

On Sun, 3 May 2009, James Bottomley wrote:

> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> 
> On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
>> On Sun, 3 May 2009, James Bottomley wrote:
>>
>>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
>>>
>>> On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
>>>> On Sun, 3 May 2009, Willy Tarreau wrote:
>>>>
>>>>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
>>>>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
>>>>>> <akpm@linux-foundation.org> wrote:
>>>>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
>>>>>>>
>>>>>>>> This is a repost of DRBD
>>>>>>>
>>>>>>> Is it being used anywhere for anything?  If so, where and what?
>>>>>>
>>>>>> One popular application is to run iSCSI and HA software on top of DRBD
>>>>>> in order to build a highly available iSCSI storage target.
>>>>>
>>>>> Confirmed, I have several customers who're doing exactly that.
>>>>
>>>> I will also say that there are a lot of us out here who would have a use
>>>> for DRDB in our HA setups, but have held off implementing it specificly
>>>> because it's not yet in the upstream kernel.
>>>
>>> Actually, that's not a particularly strong reason because we already
>>> have an in-kernel replicator that has much of the functionality of drbd
>>> that you could use.  The main reason for wanting drbd in kernel is that
>>> it has a *current* user base.
>>>
>>> Both the in kernel md/nbd and drbd do sync and async replication with
>>> primary side bitmaps.  The main differences are:
>>>
>>>      * md/nbd can do 1 to N replication,
>>>      * drbd can do active/active replication (useful for cluster
>>>        filesystems)
>>>      * The chunk size of the md/nbd is tunable
>>>      * With the updated nbd-tools, current md/nbd can do point in time
>>>        rollback on transaction logged secondaries (a BCS requirement)
>>>      * drbd manages the mirror state explicitly, md/nbd needs a user
>>>        space helper
>>>
>>> And probably a few others I forget.
>>
>> one very big one:
>>
>> DRDB has better support for dealing with split brain situations and
>> recovering from them.
>
> I don't really think so.  The decision about which (or if a) node should
> be killed lies with the HA harness outside of the province of the
> replication.
>
> One could argue that the symmetric active mode of drbd allows both nodes
> to continue rather than having the harness make a kill decision about
> one.  However, if they both alter the same data, you get an
> irreconcilable data corruption fault which, one can argue, is directly
> counter to HA principles and so allowing drbd continuation is arguably
> the wrong thing to do.

but the issue is that at the time the failure is taking place, neither 
side _knows_ that the other side is running. In fact, they both think that 
the other side is dead.

with DRDB, when the two sides start talking again they will discover that 
they are different and complain, loudly, to the sysadmin that they need 
help

with md/ndb you have the situation where both sides will try to resync to 
the other side as soon as the packets can get through. this can end up 
corrupting both sides if it's not caught fast enough

David Lang

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 14:56               ` david
@ 2009-05-03 15:09                 ` James Bottomley
  2009-05-03 15:22                   ` david
  0 siblings, 1 reply; 90+ messages in thread
From: James Bottomley @ 2009-05-03 15:09 UTC (permalink / raw)
  To: david
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

On Sun, 2009-05-03 at 07:56 -0700, david@lang.hm wrote:
> On Sun, 3 May 2009, James Bottomley wrote:
> 
> > Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> > 
> > On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
> >> On Sun, 3 May 2009, James Bottomley wrote:
> >>
> >>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> >>>
> >>> On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
> >>>> On Sun, 3 May 2009, Willy Tarreau wrote:
> >>>>
> >>>>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> >>>>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> >>>>>> <akpm@linux-foundation.org> wrote:
> >>>>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
> >>>>>>>
> >>>>>>>> This is a repost of DRBD
> >>>>>>>
> >>>>>>> Is it being used anywhere for anything?  If so, where and what?
> >>>>>>
> >>>>>> One popular application is to run iSCSI and HA software on top of DRBD
> >>>>>> in order to build a highly available iSCSI storage target.
> >>>>>
> >>>>> Confirmed, I have several customers who're doing exactly that.
> >>>>
> >>>> I will also say that there are a lot of us out here who would have a use
> >>>> for DRDB in our HA setups, but have held off implementing it specificly
> >>>> because it's not yet in the upstream kernel.
> >>>
> >>> Actually, that's not a particularly strong reason because we already
> >>> have an in-kernel replicator that has much of the functionality of drbd
> >>> that you could use.  The main reason for wanting drbd in kernel is that
> >>> it has a *current* user base.
> >>>
> >>> Both the in kernel md/nbd and drbd do sync and async replication with
> >>> primary side bitmaps.  The main differences are:
> >>>
> >>>      * md/nbd can do 1 to N replication,
> >>>      * drbd can do active/active replication (useful for cluster
> >>>        filesystems)
> >>>      * The chunk size of the md/nbd is tunable
> >>>      * With the updated nbd-tools, current md/nbd can do point in time
> >>>        rollback on transaction logged secondaries (a BCS requirement)
> >>>      * drbd manages the mirror state explicitly, md/nbd needs a user
> >>>        space helper
> >>>
> >>> And probably a few others I forget.
> >>
> >> one very big one:
> >>
> >> DRDB has better support for dealing with split brain situations and
> >> recovering from them.
> >
> > I don't really think so.  The decision about which (or if a) node should
> > be killed lies with the HA harness outside of the province of the
> > replication.
> >
> > One could argue that the symmetric active mode of drbd allows both nodes
> > to continue rather than having the harness make a kill decision about
> > one.  However, if they both alter the same data, you get an
> > irreconcilable data corruption fault which, one can argue, is directly
> > counter to HA principles and so allowing drbd continuation is arguably
> > the wrong thing to do.
> 
> but the issue is that at the time the failure is taking place, neither 
> side _knows_ that the other side is running. In fact, they both think that 
> the other side is dead.

Resolving this is the job of the HA harness, as I said ... the usual
solution being either third node pings or confirmable switchover.

> with DRDB, when the two sides start talking again they will discover that 
> they are different and complain, loudly, to the sysadmin that they need 
> help

The object of HA is to prevent data becoming toast, not to point it out
to the sysadmin after the fact.

> with md/ndb you have the situation where both sides will try to resync to 
> the other side as soon as the packets can get through. this can end up 
> corrupting both sides if it's not caught fast enough

Actually, that's just your implementation: md/nbd does nothing to
re-establish the replication, it has to be done by the HA harness after
split brain resolution.  What a correct harness would do is to compare
the HA event log and the intent logs to see if there had been activity
to both sides after loss of contact and, if their had, to flag the data
corruption problem and not resume replication.

This corruption situation isn't unique to replication ... any time you
may potentially have allowed both sides to write to a data store, you
get it, that's why it's the job of the HA harness to sort out whether a
split brain happened and what to do about it *first*.

James




^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 15:09                 ` James Bottomley
@ 2009-05-03 15:22                   ` david
  2009-05-03 15:38                     ` James Bottomley
  0 siblings, 1 reply; 90+ messages in thread
From: david @ 2009-05-03 15:22 UTC (permalink / raw)
  To: James Bottomley
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

On Sun, 3 May 2009, James Bottomley wrote:

>> On Sun, 3 May 2009, James Bottomley wrote:
>>
>>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
>>>
>>> On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
>>>> On Sun, 3 May 2009, James Bottomley wrote:
>>>>
>>>>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
>>>>>
>>>>> On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
>>>>>> On Sun, 3 May 2009, Willy Tarreau wrote:
>>>>>>
>>>>>>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
>>>>>>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
>>>>>>>> <akpm@linux-foundation.org> wrote:
>>>>>>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
>>>>>>>>>
>>>>>>>>>> This is a repost of DRBD
>>>>>>>>>
>>>>>>>>> Is it being used anywhere for anything?  If so, where and what?
>>>>>>>>
>>>>>>>> One popular application is to run iSCSI and HA software on top of DRBD
>>>>>>>> in order to build a highly available iSCSI storage target.
>>>>>>>
>>>>>>> Confirmed, I have several customers who're doing exactly that.
>>>>>>
>>>>>> I will also say that there are a lot of us out here who would have a use
>>>>>> for DRDB in our HA setups, but have held off implementing it specificly
>>>>>> because it's not yet in the upstream kernel.
>>>>>
>>>>> Actually, that's not a particularly strong reason because we already
>>>>> have an in-kernel replicator that has much of the functionality of drbd
>>>>> that you could use.  The main reason for wanting drbd in kernel is that
>>>>> it has a *current* user base.
>>>>>
>>>>> Both the in kernel md/nbd and drbd do sync and async replication with
>>>>> primary side bitmaps.  The main differences are:
>>>>>
>>>>>      * md/nbd can do 1 to N replication,
>>>>>      * drbd can do active/active replication (useful for cluster
>>>>>        filesystems)
>>>>>      * The chunk size of the md/nbd is tunable
>>>>>      * With the updated nbd-tools, current md/nbd can do point in time
>>>>>        rollback on transaction logged secondaries (a BCS requirement)
>>>>>      * drbd manages the mirror state explicitly, md/nbd needs a user
>>>>>        space helper
>>>>>
>>>>> And probably a few others I forget.
>>>>
>>>> one very big one:
>>>>
>>>> DRDB has better support for dealing with split brain situations and
>>>> recovering from them.
>>>
>>> I don't really think so.  The decision about which (or if a) node should
>>> be killed lies with the HA harness outside of the province of the
>>> replication.
>>>
>>> One could argue that the symmetric active mode of drbd allows both nodes
>>> to continue rather than having the harness make a kill decision about
>>> one.  However, if they both alter the same data, you get an
>>> irreconcilable data corruption fault which, one can argue, is directly
>>> counter to HA principles and so allowing drbd continuation is arguably
>>> the wrong thing to do.
>>
>> but the issue is that at the time the failure is taking place, neither
>> side _knows_ that the other side is running. In fact, they both think that
>> the other side is dead.
>
> Resolving this is the job of the HA harness, as I said ... the usual
> solution being either third node pings or confirmable switchover.

and none of those solutions are failsafe in a distributed environment (in 
a local environment you can have a race to see which system powers off the 
other first to ensure that at most one is running, but you can't do that 
reliably remotely)

>> with DRDB, when the two sides start talking again they will discover that
>> they are different and complain, loudly, to the sysadmin that they need
>> help
>
> The object of HA is to prevent data becoming toast, not to point it out
> to the sysadmin after the fact.

it needs to do both

>> with md/ndb you have the situation where both sides will try to resync to
>> the other side as soon as the packets can get through. this can end up
>> corrupting both sides if it's not caught fast enough
>
> Actually, that's just your implementation: md/nbd does nothing to
> re-establish the replication, it has to be done by the HA harness after
> split brain resolution.  What a correct harness would do is to compare
> the HA event log and the intent logs to see if there had been activity
> to both sides after loss of contact and, if their had, to flag the data
> corruption problem and not resume replication.
>
> This corruption situation isn't unique to replication ... any time you
> may potentially have allowed both sides to write to a data store, you
> get it, that's why it's the job of the HA harness to sort out whether a
> split brain happened and what to do about it *first*.

but you can have packets sitting in the network buffers waiting to get to 
the remote machine, then once the connection is reestablished those 
packets will go out. no remounting needed., just connectivity restored. 
(this isn't as bad as if the system tries to re-sync to the temprarily 
unavailable drive by itself, but it can still corrupt things)

a cluster spread across different locations has problems to face that a 
cluster within easy cabling distance does not.

DRDB has been extensivly tested and build to survive in the harsher 
environment. md/ndb is a reasonable approximation for the simple 
enviornment of two servers in one datacenter, but that doesn't mean that 
it handles the rest of the possible conditions.

David Lang

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 15:22                   ` david
@ 2009-05-03 15:38                     ` James Bottomley
  2009-05-03 15:48                       ` david
  0 siblings, 1 reply; 90+ messages in thread
From: James Bottomley @ 2009-05-03 15:38 UTC (permalink / raw)
  To: david
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

On Sun, 2009-05-03 at 08:22 -0700, david@lang.hm wrote:
> On Sun, 3 May 2009, James Bottomley wrote:
> 
> >> On Sun, 3 May 2009, James Bottomley wrote:
> >>
> >>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> >>>
> >>> On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
> >>>> On Sun, 3 May 2009, James Bottomley wrote:
> >>>>
> >>>>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> >>>>>
> >>>>> On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
> >>>>>> On Sun, 3 May 2009, Willy Tarreau wrote:
> >>>>>>
> >>>>>>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> >>>>>>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> >>>>>>>> <akpm@linux-foundation.org> wrote:
> >>>>>>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> This is a repost of DRBD
> >>>>>>>>>
> >>>>>>>>> Is it being used anywhere for anything?  If so, where and what?
> >>>>>>>>
> >>>>>>>> One popular application is to run iSCSI and HA software on top of DRBD
> >>>>>>>> in order to build a highly available iSCSI storage target.
> >>>>>>>
> >>>>>>> Confirmed, I have several customers who're doing exactly that.
> >>>>>>
> >>>>>> I will also say that there are a lot of us out here who would have a use
> >>>>>> for DRDB in our HA setups, but have held off implementing it specificly
> >>>>>> because it's not yet in the upstream kernel.
> >>>>>
> >>>>> Actually, that's not a particularly strong reason because we already
> >>>>> have an in-kernel replicator that has much of the functionality of drbd
> >>>>> that you could use.  The main reason for wanting drbd in kernel is that
> >>>>> it has a *current* user base.
> >>>>>
> >>>>> Both the in kernel md/nbd and drbd do sync and async replication with
> >>>>> primary side bitmaps.  The main differences are:
> >>>>>
> >>>>>      * md/nbd can do 1 to N replication,
> >>>>>      * drbd can do active/active replication (useful for cluster
> >>>>>        filesystems)
> >>>>>      * The chunk size of the md/nbd is tunable
> >>>>>      * With the updated nbd-tools, current md/nbd can do point in time
> >>>>>        rollback on transaction logged secondaries (a BCS requirement)
> >>>>>      * drbd manages the mirror state explicitly, md/nbd needs a user
> >>>>>        space helper
> >>>>>
> >>>>> And probably a few others I forget.
> >>>>
> >>>> one very big one:
> >>>>
> >>>> DRDB has better support for dealing with split brain situations and
> >>>> recovering from them.
> >>>
> >>> I don't really think so.  The decision about which (or if a) node should
> >>> be killed lies with the HA harness outside of the province of the
> >>> replication.
> >>>
> >>> One could argue that the symmetric active mode of drbd allows both nodes
> >>> to continue rather than having the harness make a kill decision about
> >>> one.  However, if they both alter the same data, you get an
> >>> irreconcilable data corruption fault which, one can argue, is directly
> >>> counter to HA principles and so allowing drbd continuation is arguably
> >>> the wrong thing to do.
> >>
> >> but the issue is that at the time the failure is taking place, neither
> >> side _knows_ that the other side is running. In fact, they both think that
> >> the other side is dead.
> >
> > Resolving this is the job of the HA harness, as I said ... the usual
> > solution being either third node pings or confirmable switchover.
> 
> and none of those solutions are failsafe in a distributed environment (in 
> a local environment you can have a race to see which system powers off the 
> other first to ensure that at most one is running, but you can't do that 
> reliably remotely)

Um, yes they are, that's why they're used.

Do you understand how they work?

Third node ping means that there has to be an external third node acting
as mediator (like a quorum device) ... usually in a third location.  A
node surviving has to make contact with it before failover can proceed
automatically (the running node has to be in contact to keep running).

Confirmable switchover is where the cluster detects the failure and
pages an admin to check on the remote and confirm or deny the switch
over manually.  Without the confirmation it just waits.

Both of these mechanisms are robust to split brain.  By and large most
enterprises I've seen go for confirmable switchover, but some do
implement third node ping.

> >> with DRDB, when the two sides start talking again they will discover that
> >> they are different and complain, loudly, to the sysadmin that they need
> >> help
> >
> > The object of HA is to prevent data becoming toast, not to point it out
> > to the sysadmin after the fact.
> 
> it needs to do both
> 
> >> with md/ndb you have the situation where both sides will try to resync to
> >> the other side as soon as the packets can get through. this can end up
> >> corrupting both sides if it's not caught fast enough
> >
> > Actually, that's just your implementation: md/nbd does nothing to
> > re-establish the replication, it has to be done by the HA harness after
> > split brain resolution.  What a correct harness would do is to compare
> > the HA event log and the intent logs to see if there had been activity
> > to both sides after loss of contact and, if their had, to flag the data
> > corruption problem and not resume replication.
> >
> > This corruption situation isn't unique to replication ... any time you
> > may potentially have allowed both sides to write to a data store, you
> > get it, that's why it's the job of the HA harness to sort out whether a
> > split brain happened and what to do about it *first*.
> 
> but you can have packets sitting in the network buffers waiting to get to 
> the remote machine, then once the connection is reestablished those 
> packets will go out. no remounting needed., just connectivity restored. 
> (this isn't as bad as if the system tries to re-sync to the temprarily 
> unavailable drive by itself, but it can still corrupt things)

This is an interesting thought, but not what happens.  As soon as the HA
harness stops replication, which it does at the instant failure is
detected, the closure of the socket kills all the in flight network
data.

There is an variant of this problem that occurs with device mapper
queue_if_no_path (on local disks) which does exactly what you say (keeps
unsaved data around in the queue forever), but that's fixed by not using
queue_if_no_path for HA.  Maybe that's what you were thinking of?

> a cluster spread across different locations has problems to face that a 
> cluster within easy cabling distance does not.
> 
> DRDB has been extensivly tested and build to survive in the harsher 
> environment.

There are commercial HA products based on md/nbd, so I'd say it's also
hardened for harsher environments

>  md/ndb is a reasonable approximation for the simple 
> enviornment of two servers in one datacenter, but that doesn't mean that 
> it handles the rest of the possible conditions.

The implementations I've seen do ... and that includes some fairly
exotic cascading WAN replication ones.

James



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 15:38                     ` James Bottomley
@ 2009-05-03 15:48                       ` david
  2009-05-03 16:02                         ` James Bottomley
  0 siblings, 1 reply; 90+ messages in thread
From: david @ 2009-05-03 15:48 UTC (permalink / raw)
  To: James Bottomley
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

On Sun, 3 May 2009, James Bottomley wrote:

> On Sun, 2009-05-03 at 08:22 -0700, david@lang.hm wrote:
>> On Sun, 3 May 2009, James Bottomley wrote:
>>
>>>> On Sun, 3 May 2009, James Bottomley wrote:
>>>>
>>>>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
>>>>>
>>>>> On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
>>>>>> On Sun, 3 May 2009, James Bottomley wrote:
>>>>>>
>>>>>>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
>>>>>>>
>>>>>>> On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
>>>>>>>> On Sun, 3 May 2009, Willy Tarreau wrote:
>>>>>>>>
>>>>>>>>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
>>>>>>>>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
>>>>>>>>>> <akpm@linux-foundation.org> wrote:
>>>>>>>>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> This is a repost of DRBD
>>>>>>>>>>>
>>>>>>>>>>> Is it being used anywhere for anything?  If so, where and what?
>>>>>>>>>>
>>>>>>>>>> One popular application is to run iSCSI and HA software on top of DRBD
>>>>>>>>>> in order to build a highly available iSCSI storage target.
>>>>>>>>>
>>>>>>>>> Confirmed, I have several customers who're doing exactly that.
>>>>>>>>
>>>>>>>> I will also say that there are a lot of us out here who would have a use
>>>>>>>> for DRDB in our HA setups, but have held off implementing it specificly
>>>>>>>> because it's not yet in the upstream kernel.
>>>>>>>
>>>>>>> Actually, that's not a particularly strong reason because we already
>>>>>>> have an in-kernel replicator that has much of the functionality of drbd
>>>>>>> that you could use.  The main reason for wanting drbd in kernel is that
>>>>>>> it has a *current* user base.
>>>>>>>
>>>>>>> Both the in kernel md/nbd and drbd do sync and async replication with
>>>>>>> primary side bitmaps.  The main differences are:
>>>>>>>
>>>>>>>      * md/nbd can do 1 to N replication,
>>>>>>>      * drbd can do active/active replication (useful for cluster
>>>>>>>        filesystems)
>>>>>>>      * The chunk size of the md/nbd is tunable
>>>>>>>      * With the updated nbd-tools, current md/nbd can do point in time
>>>>>>>        rollback on transaction logged secondaries (a BCS requirement)
>>>>>>>      * drbd manages the mirror state explicitly, md/nbd needs a user
>>>>>>>        space helper
>>>>>>>
>>>>>>> And probably a few others I forget.
>>>>>>
>>>>>> one very big one:
>>>>>>
>>>>>> DRDB has better support for dealing with split brain situations and
>>>>>> recovering from them.
>>>>>
>>>>> I don't really think so.  The decision about which (or if a) node should
>>>>> be killed lies with the HA harness outside of the province of the
>>>>> replication.
>>>>>
>>>>> One could argue that the symmetric active mode of drbd allows both nodes
>>>>> to continue rather than having the harness make a kill decision about
>>>>> one.  However, if they both alter the same data, you get an
>>>>> irreconcilable data corruption fault which, one can argue, is directly
>>>>> counter to HA principles and so allowing drbd continuation is arguably
>>>>> the wrong thing to do.
>>>>
>>>> but the issue is that at the time the failure is taking place, neither
>>>> side _knows_ that the other side is running. In fact, they both think that
>>>> the other side is dead.
>>>
>>> Resolving this is the job of the HA harness, as I said ... the usual
>>> solution being either third node pings or confirmable switchover.
>>
>> and none of those solutions are failsafe in a distributed environment (in
>> a local environment you can have a race to see which system powers off the
>> other first to ensure that at most one is running, but you can't do that
>> reliably remotely)
>
> Um, yes they are, that's why they're used.
>
> Do you understand how they work?
>
> Third node ping means that there has to be an external third node acting
> as mediator (like a quorum device) ... usually in a third location.  A
> node surviving has to make contact with it before failover can proceed
> automatically (the running node has to be in contact to keep running).

this is what I understood, there are many cases where this doesn't work 
well

> Confirmable switchover is where the cluster detects the failure and
> pages an admin to check on the remote and confirm or deny the switch
> over manually.  Without the confirmation it just waits.

this I did not understand

> Both of these mechanisms are robust to split brain.  By and large most
> enterprises I've seen go for confirmable switchover, but some do
> implement third node ping.

it depends on how much tolerance teh business has for things to be down as 
a result of a problem with the third node (including communications to 
it) and how long they are willing to be down while waiting for a sysadmin 
to be paged

>>> This corruption situation isn't unique to replication ... any time you
>>> may potentially have allowed both sides to write to a data store, you
>>> get it, that's why it's the job of the HA harness to sort out whether a
>>> split brain happened and what to do about it *first*.
>>
>> but you can have packets sitting in the network buffers waiting to get to
>> the remote machine, then once the connection is reestablished those
>> packets will go out. no remounting needed., just connectivity restored.
>> (this isn't as bad as if the system tries to re-sync to the temprarily
>> unavailable drive by itself, but it can still corrupt things)
>
> This is an interesting thought, but not what happens.  As soon as the HA
> harness stops replication, which it does at the instant failure is
> detected, the closure of the socket kills all the in flight network
> data.
>
> There is an variant of this problem that occurs with device mapper
> queue_if_no_path (on local disks) which does exactly what you say (keeps
> unsaved data around in the queue forever), but that's fixed by not using
> queue_if_no_path for HA.  Maybe that's what you were thinking of?

is there a mechanism in ndb that prevents it from beign mounted more than 
once? if so then could have the same protection that DRDB has, if not it 
is possible for it to be mounted more than once place and therefor get 
corrupted.

>> a cluster spread across different locations has problems to face that a
>> cluster within easy cabling distance does not.
>>
>> DRDB has been extensivly tested and build to survive in the harsher
>> environment.
>
> There are commercial HA products based on md/nbd, so I'd say it's also
> hardened for harsher environments

which ones?

David Lang

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 15:48                       ` david
@ 2009-05-03 16:02                         ` James Bottomley
  2009-05-03 16:13                           ` david
  0 siblings, 1 reply; 90+ messages in thread
From: James Bottomley @ 2009-05-03 16:02 UTC (permalink / raw)
  To: david
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

On Sun, 2009-05-03 at 08:48 -0700, david@lang.hm wrote:
> On Sun, 3 May 2009, James Bottomley wrote:
> 
> > On Sun, 2009-05-03 at 08:22 -0700, david@lang.hm wrote:
> >> On Sun, 3 May 2009, James Bottomley wrote:
> >>
> >>>> On Sun, 3 May 2009, James Bottomley wrote:
> >>>>
> >>>>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> >>>>>
> >>>>> On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
> >>>>>> On Sun, 3 May 2009, James Bottomley wrote:
> >>>>>>
> >>>>>>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> >>>>>>>
> >>>>>>> On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
> >>>>>>>> On Sun, 3 May 2009, Willy Tarreau wrote:
> >>>>>>>>
> >>>>>>>>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> >>>>>>>>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> >>>>>>>>>> <akpm@linux-foundation.org> wrote:
> >>>>>>>>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> This is a repost of DRBD
> >>>>>>>>>>>
> >>>>>>>>>>> Is it being used anywhere for anything?  If so, where and what?
> >>>>>>>>>>
> >>>>>>>>>> One popular application is to run iSCSI and HA software on top of DRBD
> >>>>>>>>>> in order to build a highly available iSCSI storage target.
> >>>>>>>>>
> >>>>>>>>> Confirmed, I have several customers who're doing exactly that.
> >>>>>>>>
> >>>>>>>> I will also say that there are a lot of us out here who would have a use
> >>>>>>>> for DRDB in our HA setups, but have held off implementing it specificly
> >>>>>>>> because it's not yet in the upstream kernel.
> >>>>>>>
> >>>>>>> Actually, that's not a particularly strong reason because we already
> >>>>>>> have an in-kernel replicator that has much of the functionality of drbd
> >>>>>>> that you could use.  The main reason for wanting drbd in kernel is that
> >>>>>>> it has a *current* user base.
> >>>>>>>
> >>>>>>> Both the in kernel md/nbd and drbd do sync and async replication with
> >>>>>>> primary side bitmaps.  The main differences are:
> >>>>>>>
> >>>>>>>      * md/nbd can do 1 to N replication,
> >>>>>>>      * drbd can do active/active replication (useful for cluster
> >>>>>>>        filesystems)
> >>>>>>>      * The chunk size of the md/nbd is tunable
> >>>>>>>      * With the updated nbd-tools, current md/nbd can do point in time
> >>>>>>>        rollback on transaction logged secondaries (a BCS requirement)
> >>>>>>>      * drbd manages the mirror state explicitly, md/nbd needs a user
> >>>>>>>        space helper
> >>>>>>>
> >>>>>>> And probably a few others I forget.
> >>>>>>
> >>>>>> one very big one:
> >>>>>>
> >>>>>> DRDB has better support for dealing with split brain situations and
> >>>>>> recovering from them.
> >>>>>
> >>>>> I don't really think so.  The decision about which (or if a) node should
> >>>>> be killed lies with the HA harness outside of the province of the
> >>>>> replication.
> >>>>>
> >>>>> One could argue that the symmetric active mode of drbd allows both nodes
> >>>>> to continue rather than having the harness make a kill decision about
> >>>>> one.  However, if they both alter the same data, you get an
> >>>>> irreconcilable data corruption fault which, one can argue, is directly
> >>>>> counter to HA principles and so allowing drbd continuation is arguably
> >>>>> the wrong thing to do.
> >>>>
> >>>> but the issue is that at the time the failure is taking place, neither
> >>>> side _knows_ that the other side is running. In fact, they both think that
> >>>> the other side is dead.
> >>>
> >>> Resolving this is the job of the HA harness, as I said ... the usual
> >>> solution being either third node pings or confirmable switchover.
> >>
> >> and none of those solutions are failsafe in a distributed environment (in
> >> a local environment you can have a race to see which system powers off the
> >> other first to ensure that at most one is running, but you can't do that
> >> reliably remotely)
> >
> > Um, yes they are, that's why they're used.
> >
> > Do you understand how they work?
> >
> > Third node ping means that there has to be an external third node acting
> > as mediator (like a quorum device) ... usually in a third location.  A
> > node surviving has to make contact with it before failover can proceed
> > automatically (the running node has to be in contact to keep running).
> 
> this is what I understood, there are many cases where this doesn't work 
> well

You mean there are situations where both can be down?  Sure, but a)
they're rare and b) it's still not a split brain.

> > Confirmable switchover is where the cluster detects the failure and
> > pages an admin to check on the remote and confirm or deny the switch
> > over manually.  Without the confirmation it just waits.
> 
> this I did not understand
> 
> > Both of these mechanisms are robust to split brain.  By and large most
> > enterprises I've seen go for confirmable switchover, but some do
> > implement third node ping.
> 
> it depends on how much tolerance teh business has for things to be down as 
> a result of a problem with the third node (including communications to 
> it) and how long they are willing to be down while waiting for a sysadmin 
> to be paged

Usually for geo disaster type situations, the recovery plans I've seen
actually *require* manual intervention (likely because they don't fully
trust their HA suppliers, of course ...)

> >>> This corruption situation isn't unique to replication ... any time you
> >>> may potentially have allowed both sides to write to a data store, you
> >>> get it, that's why it's the job of the HA harness to sort out whether a
> >>> split brain happened and what to do about it *first*.
> >>
> >> but you can have packets sitting in the network buffers waiting to get to
> >> the remote machine, then once the connection is reestablished those
> >> packets will go out. no remounting needed., just connectivity restored.
> >> (this isn't as bad as if the system tries to re-sync to the temprarily
> >> unavailable drive by itself, but it can still corrupt things)
> >
> > This is an interesting thought, but not what happens.  As soon as the HA
> > harness stops replication, which it does at the instant failure is
> > detected, the closure of the socket kills all the in flight network
> > data.
> >
> > There is an variant of this problem that occurs with device mapper
> > queue_if_no_path (on local disks) which does exactly what you say (keeps
> > unsaved data around in the queue forever), but that's fixed by not using
> > queue_if_no_path for HA.  Maybe that's what you were thinking of?
> 
> is there a mechanism in ndb that prevents it from beign mounted more than 
> once? if so then could have the same protection that DRDB has, if not it 
> is possible for it to be mounted more than once place and therefor get 
> corrupted.

That's not really relevant, is it?  An ordinary disk doesn't have this
property either.  Mediating simultaneous access is the job of the HA
harness.  If the device does it for you, fine, the harness can make use
of that (as long as the device gets it right) but all good HA harnesses
sort out the usual case where the device doesn't do it.

> >> a cluster spread across different locations has problems to face that a
> >> cluster within easy cabling distance does not.
> >>
> >> DRDB has been extensivly tested and build to survive in the harsher
> >> environment.
> >
> > There are commercial HA products based on md/nbd, so I'd say it's also
> > hardened for harsher environments
> 
> which ones?

SteelEye LifeKeeper.  It actually supports both drbd and md/nbd.

James




^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 16:02                         ` James Bottomley
@ 2009-05-03 16:13                           ` david
  0 siblings, 0 replies; 90+ messages in thread
From: david @ 2009-05-03 16:13 UTC (permalink / raw)
  To: James Bottomley
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

On Sun, 3 May 2009, James Bottomley wrote:

> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> 
> On Sun, 2009-05-03 at 08:48 -0700, david@lang.hm wrote:
>> On Sun, 3 May 2009, James Bottomley wrote:
>>
>>> On Sun, 2009-05-03 at 08:22 -0700, david@lang.hm wrote:
>>>> On Sun, 3 May 2009, James Bottomley wrote:
>>>>
>
>>>>> This corruption situation isn't unique to replication ... any time you
>>>>> may potentially have allowed both sides to write to a data store, you
>>>>> get it, that's why it's the job of the HA harness to sort out whether a
>>>>> split brain happened and what to do about it *first*.
>>>>
>>>> but you can have packets sitting in the network buffers waiting to get to
>>>> the remote machine, then once the connection is reestablished those
>>>> packets will go out. no remounting needed., just connectivity restored.
>>>> (this isn't as bad as if the system tries to re-sync to the temprarily
>>>> unavailable drive by itself, but it can still corrupt things)
>>>
>>> This is an interesting thought, but not what happens.  As soon as the HA
>>> harness stops replication, which it does at the instant failure is
>>> detected, the closure of the socket kills all the in flight network
>>> data.
>>>
>>> There is an variant of this problem that occurs with device mapper
>>> queue_if_no_path (on local disks) which does exactly what you say (keeps
>>> unsaved data around in the queue forever), but that's fixed by not using
>>> queue_if_no_path for HA.  Maybe that's what you were thinking of?
>>
>> is there a mechanism in ndb that prevents it from beign mounted more than
>> once? if so then could have the same protection that DRDB has, if not it
>> is possible for it to be mounted more than once place and therefor get
>> corrupted.
>
> That's not really relevant, is it?  An ordinary disk doesn't have this
> property either.  Mediating simultaneous access is the job of the HA
> harness.  If the device does it for you, fine, the harness can make use
> of that (as long as the device gets it right) but all good HA harnesses
> sort out the usual case where the device doesn't do it.

with a local disk you can mount it multiple times, write to it from all 
the mounts, and not have any problems, because all access goes through a 
common layer.

you would have this sort of problem if you used one partition as part of 
multiple md arrays, but the md layer itself would detect and prevent this 
(because it would see both arrays), but again, in a multi-machine 
situation you don't have the common layer to do the detection.

you can rely on the HA layer to detect and prevent all of this (and 
apparently there are people doing this, I wasn't aware of it), but I've 
seen enough problems with every HA implementation I've dealt with over the 
years (both opensource and commercial), that I would be very uncomfortable 
depending on this exclusivly. having the disk replication layer detect 
this adds a significant amount of safety in my eyes.

>>> There are commercial HA products based on md/nbd, so I'd say it's also
>>> hardened for harsher environments
>>
>> which ones?
>
> SteelEye LifeKeeper.  It actually supports both drbd and md/nbd.

thanks for the info.

David Lang

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 11:00     ` Neil Brown
@ 2009-05-03 21:32       ` Lars Ellenberg
  2009-05-04 16:12         ` Lars Marowsky-Bree
  2009-05-05 22:08         ` Lars Ellenberg
  0 siblings, 2 replies; 90+ messages in thread
From: Lars Ellenberg @ 2009-05-03 21:32 UTC (permalink / raw)
  To: Neil Brown
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

On Sun, May 03, 2009 at 09:00:45PM +1000, Neil Brown wrote:
> On Sunday May 3, lars.ellenberg@linbit.com wrote:
> > > If there some strong technical reason to only allow 2 nodes?
> > 
> > It "just" has not yet been implemented.
> > I'm working on that, though.
> 
> :-)
> 
> > 
> > > >     How do you fit that into a RAID1+NBD model ? NBD is just a block
> > > >     transport, it does not offer the ability to exchange dirty bitmaps or
> > > >     data generation identifiers, nor does the RAID1 code has a concept of
> > > >     that.
> > > 
> > > Not 100% true, but I - at least partly -  get your point.
> > > As md stores bitmaps and data generation identifiers on the block
> > > device, these can be transferred over NBD just like any other data on
> > > the block device.
> > 
> > Do you have one dirty bitmap per mirror (yet) ?
> > Do you _merge_ them?
> 
> md doesn't merge bitmaps yet.  However if I found a need to, I would
> simple read a bitmap in userspace and feed it into the kernel via 
> 	/sys/block/mdX/md/md/bitmap_set_bits

ah, ok.  right.  that would do it.

> We sort-of have one bitmap per mirror, but only because the one bitmap
> is mirrored...

Which it could not be while replication link is down,
so once replication link is back (or remote node is back,
which is not easily distinguishable just there, blablabla),
you'd need to fetch the remote bitmap, and merge it with the local
bitmap (feeding it into bitmap_set_bits),
then re-attach the "failed" mirror.

The reasoning in the commit 9b1d1dac181d8c1b9492e05cee660a985d035a06,
which adds that feature, exactly describes this use case.

There, again, our simple run-length encoding scheme does make very much
sense, as the numbers dropping out of it during decoding are exactly
the runlengths, and could be fed into this almost directly.

> > the "NBD" mirrors are remote, and once you lose communication,
> > they may be (and in general, you have to assume they are) modified
> > by which ever node they are directly attached to.
> > 
> > > However I think that part of your point is that DRBD can transfer them
> > > more efficiently (e.g. it compresses the bitmap before transferring it
> > > -  I assume the compression you use is much more effective than gzip??
> > > else why both to code your own).
> > 
> > No, the point was that we have one bitmap per mirror (though currently
> > number of mirrors == 2, only), and that we do merge them.
> 
> Right.  I imagine much of the complexity of that could be handled in
> user-space while setting an a DRBD instance (??).

possibly.
you'd need to involve these steps on each and every communication loss
and network handshake.  I think that would make the system slower to
react on e.g. "flaky" replication links.

you are thinking in the "MD" paradigma: at any point in time, there is
only one MD instance involved, the mirror transports (currently dumb
block devices) simply do what they are told.

in DRBD, we have multiple (ok, two) instances talking to each other,
and I think that is the better approach for (remote) replication.

> > but to answer the question:
> > why bother to implement our own encoding?
> > because we know a lot about the data to be encoded.
> > 
> > the compression of the bitmap transfer we just added very recently.
> > for a bitmap, with large chunks of bits set or unset, it is efficient
> > to just code the runlength.
> > to use gzip in kernel would add yet an other huge overhead for code
> > tables and so on.
> > during testing of this encoding, applying it to an already gzip'ed file
> > was able to compress it even further, btw.
> > though on english plain text, gzip compression is _much_ more effective.
> 
> I just tried a little experiment.
> I created a 128meg file and randomly set 1000 bits in it.
> I compressed it with "gzip --best" and the result was 4Meg.  Not
> particularly impressive.
> I then tried to compress it wit bzip2 and got 3452 bytes.
> Now *that* is impressive.  I suspect your encoding might do a little
> better, but I wonder if it is worth the effort.

The effort is minimal.
The cpu overhead is negligible (compared with bzip2, or any other
generic compression scheme), and the memory overhead is next to none
(just a small scratch buffer, to assemble the network packet).
No tables or anything involved.
Especially the _decoding_ part has this nice property:
  chunk = 0;
  while (!eof) {
	vli_decode_bits(&rl, input); /* number of unset bits */
	chunk += rl;
	vli_decode_bits(&rl, input); /* number of set bits */
	bitmap_dirty_bits(bitmap, chunk, chunk + rl);
	chunk += rl;
 }

The source code is there.

For your example, on average you'd have (128 << 23) / 1000 "clear" bits,
then one set bit. The encoding transfers
"first bit unset -- ca. (1<<20), 1, ca. (1<<20), 1, ca. (1<<20), 1, ...",
using 2 bits for the "1", and up to 29 bit for the "ca. 1<<20".
should be in the very same ballpark as your bzip2 result.

> I'm not certain that my test file is entirely realistic, but it is
> still an interesting experiment.

It is not ;) but still...
If you are interessted, I can dig up my throw away user land code,
that has been used to evaluate various such schemes.
But it is so ugly that I won't post it to lkml.

> Why do you do this compression in the kernel?  It seems to me that it
> would be quite practical to do it all in user-space, thus making it
> really easy to use pre-existing libraries.

Because the bitmap exchange happens in kernel.

If considering to rewrite a replication solution,
one can start to reconsider design choices.

But DRBD as of now does the connection handshake and bitmap exchange in
kernel.  We wanted to have a fast compression scheme suitable for
bitmaps, without cpu or memory overhead.  This does it quite nicely.

I can dig up my userland throwaway code used during evaluation
of various encoding schemes again, if you are interessted.

> BTW, the kernel already contains various compression code as part of
> the crypto API.

Of course I know.  But you are not really suggesting that I should do
bzip2 in kernel to exchange the bitmap. And on decoding, I want those
runlengths, not the actual plain bitmap.

> > > You say "nor does the RAID1 code has a concept of that".  It isn't
> > > clear what you are referring to.
> > 
> > The concept that one of the mirrors (the "nbd" one in that picture)
> > may have been accessed independently, without MD knowning,
> > because the node this MD (and its "local" mirror) was living on
> > suffered from power outage.

or the link has been down,
and the remote side decided to go active with it.

or the link has been taken down,
to activate the other side, knowingly creating a data set divergence,
to do some off-site processing.

> > The concept of both mirrors being modified _simultaneously_,
> > (e.g. living below a cluster file system).
> 
> Yes, that is an important concept.  Certainly one of the bits that
> would need to be added to md.
> 
> > > Whether the current DRBD code gets merged or not is possibly a
> > > separate question, though I would hope that if we followed the path of
> > > merging DRBD into md/raid1, then any duplicate code would eventually be
> > > excised from the kernel.
> > 
> > Rumor [http://lwn.net/Articles/326818/] has it, that the various in
> > kernel raid implementations are being unified right now, anyways?
> 
> I'm not holding my breath on that one...  
> I think that merging DRBD with md/raid1 would be significantly easier
> that any sort of merge between md and dm.  But (in either case) I'll
> do what I can to assist any effort that is technically sound.

D'accord.

> > If you want to stick to "replication is almost identical to RAID1",
> > best not to forget "this may be a remote mirror", there may be more than
> > one entity accessing it, this may be part of a bi-directional
> > (active-active) replication setup.
> > 
> > For further ideas on what could be done with replication (enhancing the
> > strict "raid1" notion), see also
> > http://www.drbd.org/fileadmin/drbd/publications/drbd9.linux-kongress.2008.pdf
> > 
> >  - time shift replication
> >  - generic point in time recovery of block device data
> >  - (remote) backup by periodically, round-robin re-sync of
> >    "raid" members, then "dropping" them again.
> >  ...
> > 
> > No useable code on those ideas, yet,
> > but a lot of thought. It is not all handwaving.
> 
> :-)
> 
> I'll have to do a bit of reading I see.  I'll then try to rough out a
> design and plan for merging DRBD functionality with md/raid1.  At the
> very least that would give me enough background understanding to be
> able to sensibly review your code submission.

Thanks.  Please give particular attention to the "taxonomy paper"
referenced therein, so we are going to use the same terms.

	Lars

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/16] DRBD: lru_cache
  2009-05-03 14:06         ` Kyle Moffett
@ 2009-05-03 22:48           ` Lars Ellenberg
  2009-05-04  0:48             ` Kyle Moffett
  0 siblings, 1 reply; 90+ messages in thread
From: Lars Ellenberg @ 2009-05-03 22:48 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH, Neil Brown,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Bart Van Assche

On Sun, May 03, 2009 at 10:06:58AM -0400, Kyle Moffett wrote:
> On Sun, May 3, 2009 at 2:27 AM, Lars Ellenberg
> <lars.ellenberg@linbit.com> wrote:
> > When we created our lru_cache stuff, we considered embedding callbacks
> > and internal locking, but decided against it.  Conceptually it should be
> > more like the "list.h" list handling infrastructure.
> >
> > The user will have their own locking in place anyways, and in general
> > their critical section will be a few lines of code larger than the
> > "lru cache" manipulation itself.
> 
> One of the major design-points for the code I'm fiddling with is that
> it allows you to use RCU on your lookup table, which basically means
> lock-free lookup (although I haven't stress-tested that part of it yet
> so the code itself may have subtle bugs).  With a bit more work it's
> probably even possible to make it lock-free even when adding a
> reference to an object that's currently on the LRU.

sounds good.

maybe it is only late, and I just don't see how good your code fits
what we are doing, after just adding twenty-or-so lines of wrapper code.
let me sleep about it.

still, some comments below.

> > And, the specific use of our implementation is that there is a
> > pre-selected maximum count of in-use objects, and the user gets
> > feedback about changes to this "active" set of objects.
> 
> Another major design point (and the reason for the single "evict"
> callback) is that my code does not require manual tuning, it responds
> to memory-pressure dynamically using the "shrinker" mechanism.  So on
> a box with 128MB of RAM your LRU cache will be automatically
> size-limited by other activity on the system to an appropriate size;
> yet it can scale up to tens or hundreds of megabytes on a system with
> hundreds of gigs of RAM under heavy IO load.

I'm in the IO path.
I absolutely do not want some stupid excessive read_ahead setting
or an out-of-bounds badly hacked php cronjob to limit the amount
of write-out I can do when things get tight (because of that badly
behaved process, most likely).

And I do not want to let it grow to an arbitrary number of objects,
that is even one important point of it: giving the admin a tuning knob
to trade resync-time after crash vs. frequency of meta-data updates.
I _want_ manual tuning. We are talking about (at max) a few thousand
small (<=56 bytes, even on 64bit arch), a few hundred kB.

I'm sure your approach has a number of valid uses.
I don't yet see it as a good fit for our purpose, though.
well, maybe it could replace the "inter part" of what we do.

> The real deal-breaker for your code is its usage of "vmalloc", it's
> unlikely to be merged when it relies on vmalloc() of a large
> continuous block for operation.

yes, I already got that part, thanks ;)
it is not a problem, its only in there because we have been lazy.

it is easily changed, either by using kmalloc (the objects, and the
maximum number of objects are both small), or some array of pages.
maybe even using a kmem_cache, partly following your code,
if that makes you happy ;-)

still, the main purpose of your "lru_cache",
and what we do with our "lru_cache",
seems to me to be different.
maybe we should change our name to something more appropriate.

> > That is where the focus is:
> > make the set of active objects easily trackable.
> > So one can easily keep track of who is in, and who is not,
> > by writing a log of just this "diff":
> > seat index was occupied by element_nr A, but now is by element_nr B.
> 
> This could be very easily done with tracepoints and a few minor tweaks
> to the implementation I provided.  I could add an object number and
> various statistics similar to the ones in your code; the only reason I
> didn't before is I did not need them and they detracted slightly from
> the simplicity of the implementation (just 271 lines).
> 
> Keep in mind that by using the kmem_cache infrastructure, you get to
> take advantage of all of the other SLAB debugging features on objects
> allocated through your LRUs.

I'm not oposed to it. I just say we don't need it.
We never allocate through our LRUs.
It is all just reference counting, and changing "seat lables".
Changing seat lables is the important part of it, actually.

> > So from looking at your code, it may be fine for the "lru" part,
> > but it is not suitable for our purposes.
> 
> It would need an extra layer stacked on top to handle the hash-table
> lookups,

yes, we need a fast way to know "is this id currently in the active set?".
that is what we use it most often for.

for our use case, now using label for element_number,
the most important functions would be

lru_cache_is_label_present(),
lru_cache_get_by_label(),
which would need to say "EAGAIN", once the active set is full and all
are in fact used.  and also return back the previous label now evicted,
(or "was free anyways") if successful.

lru_cache_try_get_by_label(),
which is just a combination of these two.

>  but it would solve the vmalloc issue

I hear you ;)

> and allow your LRU lists to dynamically size a bit better.

We currently neither want nor need that.

> It's also not that difficult to apply memory allocation limits (aside
> from the default memory-pressure) and add additional statistics and
> debugging info.

Of course, nothing is difficult, if it has been done already ;)
We don't need memory allocation limits, we want a limit on the
number of active objects. Boils down to the same thing, but for
different reasons.

I dislike the embeded lock as well as the callback.
I see what you need it for.
But for our purposes, we probably would need to just set a
"return NO_PLEASE;". And we do need to embed calls to this stuff in larger
critical sections.  Which means the embeded lock can never have
contention, it would always use the fast path, ok.  But still: unneccessary.

We just happen to also have the "lru" part somewhere.  But in actually it is
all about having a fixed number of "seats", with fixed indices and changeing
labels, then changing these "seat labels", while persistently journalling the
currently in-use labels.

but wait for the next post to see a better documented (or possibly
rewritten) implementation of this.

	Lars

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/16] DRBD: lru_cache
  2009-05-03 22:48           ` Lars Ellenberg
@ 2009-05-04  0:48             ` Kyle Moffett
  2009-05-04  1:01               ` Kyle Moffett
  0 siblings, 1 reply; 90+ messages in thread
From: Kyle Moffett @ 2009-05-04  0:48 UTC (permalink / raw)
  To: Lars Ellenberg
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH, Neil Brown,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Bart Van Assche

On Sun, May 3, 2009 at 6:48 PM, Lars Ellenberg
<lars.ellenberg@linbit.com> wrote:
> On Sun, May 03, 2009 at 10:06:58AM -0400, Kyle Moffett wrote:
>> On Sun, May 3, 2009 at 2:27 AM, Lars Ellenberg <lars.ellenberg@linbit.com> wrote:
>>> And, the specific use of our implementation is that there is a
>>> pre-selected maximum count of in-use objects, and the user gets
>>> feedback about changes to this "active" set of objects.
>>
>> Another major design point (and the reason for the single "evict"
>> callback) is that my code does not require manual tuning, it responds
>> to memory-pressure dynamically using the "shrinker" mechanism.  So on
>> a box with 128MB of RAM your LRU cache will be automatically
>> size-limited by other activity on the system to an appropriate size;
>> yet it can scale up to tens or hundreds of megabytes on a system with
>> hundreds of gigs of RAM under heavy IO load.
>
> I'm in the IO path.
> I absolutely do not want some stupid excessive read_ahead setting
> or an out-of-bounds badly hacked php cronjob to limit the amount
> of write-out I can do when things get tight (because of that badly
> behaved process, most likely).

I completely understand this part, but I think these issues can be
satisfied by the code without limiting its use for other LRU purposes.
 You should also remember that some people *will* want to throttle
DRBD I/O at the expense of other forms of I/O, and memory pressure is
a *big* part of how that is managed.

There are a couple trivial tunables you can apply to the model I
provided to dramatically change the effect of memory pressure on the
LRU:

  (1)  The biggie:  Make sure that the proccess(es) using the
writeout-centric LRUs have PF_LESS_THROTTLE or similar so that they
are throttled less than other processes.  This is good advice
regardless.

  (2)  Change the "seeks" variable in the lru_cache_info structure.
That is forwarded to the "shrinker" code to determine how to weight
memory pressure on these objects.  Specifically, that number is
intended to represent a rough analogy of the number of seeks that it
takes to recreate an object purged from the LRU.  Critical LRUs can
have an artificially inflated "seeks" value to reflect desired
weighting.

  (3)  Add "nr_elem_min" and "nr_elem_max" counters which apply
minimum or maximum bounds for the LRU list, while still allowing it to
dynamically resize within some range.  An example where this is useful
is a large simulation program which writes out the results of its
computation to a DRBD array at the end.  You want to let memory
pressure from the simulation program push out other objects from the
LRU while its running, then allow the LRU to grow large again as it
frees memory and submits I/O.

  (4)  Apply a "lru_scan_factor" which acts as a multiplier or divider
on the "nr_scan" in the lru_cache_shrink() function (as well as its
return value).  This will also cause a change in the weighting of the
shrinker code; if your LRU has 1000 objects in it with 2 seeks to
recreate each object, it will be asked to free much more than if you
claim it has 100 objects with 8 seeks per object.


> And I do not want to let it grow to an arbitrary number of objects,
> that is even one important point of it: giving the admin a tuning knob
> to trade resync-time after crash vs. frequency of meta-data updates.
> I _want_ manual tuning. We are talking about (at max) a few thousand
> small (<=56 bytes, even on 64bit arch), a few hundred kB.

Hmm, I'd be interested to see the numbers on this for large
many-terabyte volumes.  You should consider that most journalling
filesystems have to make similar tradeoffs; people get really
frustrated if things do not (A) have a useful automatic default and
(B) have a runtime-modifiable knob.  See the recent exceptionally long
ext3/ext4 threads on delayed allocation and writeback for some of the
flamewars that can result.


> still, the main purpose of your "lru_cache",
> and what we do with our "lru_cache",
> seems to me to be different.
> maybe we should change our name to something more appropriate.

I definitely agree that what you have isn't really properly named as
an "lru_cache".  It does make me curious, though, what precisely the
performance differences are (for DRBD specifically) between an
appropriately-tuned LRU cache and your fixed-size working set.


>> Keep in mind that by using the kmem_cache infrastructure, you get to
>> take advantage of all of the other SLAB debugging features on objects
>> allocated through your LRUs.
>
> I'm not oposed to it. I just say we don't need it.
> We never allocate through our LRUs.
> It is all just reference counting, and changing "seat lables".
> Changing seat lables is the important part of it, actually.

Well, technically what you're doing is using your LRU as a fixed-size
kmem_cache which returns the oldest object if there are no more free
slots.

>> > So from looking at your code, it may be fine for the "lru" part,
>> > but it is not suitable for our purposes.
>>
>> It would need an extra layer stacked on top to handle the hash-table
>> lookups,
>
> yes, we need a fast way to know "is this id currently in the active set?".
> that is what we use it most often for.
>
> for our use case, now using label for element_number,
> the most important functions would be
>
> lru_cache_is_label_present(),
> lru_cache_get_by_label(),
> which would need to say "EAGAIN", once the active set is full and all
> are in fact used.  and also return back the previous label now evicted,
> (or "was free anyways") if successful.
>
> lru_cache_try_get_by_label(),
> which is just a combination of these two.

I'll try to do a simple mostly-lockless hash-table lookup on top of
this, to show you what my thoughts are.


> but wait for the next post to see a better documented (or possibly
> rewritten) implementation of this.

Yeah, I'm definitely reworking it now that I have a better
understanding of what the DRBD code really wants.  My main intention
is to have the code be flexible enough that filesystems and other
sorts of network-related code can use it transparently, without
requiring much in the way of manual tuning.  See Linus' various
comments on why he *hates* manual tunables.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/16] DRBD: lru_cache
  2009-05-04  0:48             ` Kyle Moffett
@ 2009-05-04  1:01               ` Kyle Moffett
  2009-05-04 16:12                 ` Rik van Riel
  0 siblings, 1 reply; 90+ messages in thread
From: Kyle Moffett @ 2009-05-04  1:01 UTC (permalink / raw)
  To: Lars Ellenberg
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH, Neil Brown,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Bart Van Assche

On Sun, May 3, 2009 at 8:48 PM, Kyle Moffett <kyle@moffetthome.net> wrote:
> There are a couple trivial tunables you can apply to the model I
> provided to dramatically change the effect of memory pressure on the
> LRU:
>
> [...]
>

Ooh, I forgot to mention another biggie:  There's a way to allocate a
reserve pool of memory (I don't remember the exact API, sorry), which
can be attached to a specific kmem_cache to be used by processes
attempting writeout.  This would allow you to allocate more in-use
elements to make forward progress, even if all of your existing
elements are already in-use.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 14:45             ` James Bottomley
  2009-05-03 14:56               ` david
@ 2009-05-04  8:28               ` Philipp Reisner
  2009-05-04 17:24                 ` James Bottomley
  1 sibling, 1 reply; 90+ messages in thread
From: Philipp Reisner @ 2009-05-04  8:28 UTC (permalink / raw)
  To: James Bottomley
  Cc: david, Willy Tarreau, Bart Van Assche, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

On Sunday 03 May 2009 16:45:25 James Bottomley wrote:
> On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
> > On Sun, 3 May 2009, James Bottomley wrote:
> > > Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> > >
> > > On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
> > >> On Sun, 3 May 2009, Willy Tarreau wrote:
> > >>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> > >>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> > >>>>
> > >>>> <akpm@linux-foundation.org> wrote:
> > >>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner 
<philipp.reisner@linbit.com> wrote:
> > >>>>>> This is a repost of DRBD
> > >>>>>
> > >>>>> Is it being used anywhere for anything?  If so, where and what?
> > >>>>
> > >>>> One popular application is to run iSCSI and HA software on top of
> > >>>> DRBD in order to build a highly available iSCSI storage target.
> > >>>
> > >>> Confirmed, I have several customers who're doing exactly that.
> > >>
> > >> I will also say that there are a lot of us out here who would have a
> > >> use for DRDB in our HA setups, but have held off implementing it
> > >> specificly because it's not yet in the upstream kernel.
> > >
> > > Actually, that's not a particularly strong reason because we already
> > > have an in-kernel replicator that has much of the functionality of drbd
> > > that you could use.  The main reason for wanting drbd in kernel is that
> > > it has a *current* user base.
> > >
> > > Both the in kernel md/nbd and drbd do sync and async replication with
> > > primary side bitmaps.  The main differences are:
> > >
> > >      * md/nbd can do 1 to N replication,
> > >      * drbd can do active/active replication (useful for cluster
> > >        filesystems)
> > >      * The chunk size of the md/nbd is tunable
> > >      * With the updated nbd-tools, current md/nbd can do point in time
> > >        rollback on transaction logged secondaries (a BCS requirement)
> > >      * drbd manages the mirror state explicitly, md/nbd needs a user
> > >        space helper
> > >
> > > And probably a few others I forget.
> >
> > one very big one:
> >
> > DRDB has better support for dealing with split brain situations and
> > recovering from them.
>
> I don't really think so.  The decision about which (or if a) node should
> be killed lies with the HA harness outside of the province of the
> replication.
>
> One could argue that the symmetric active mode of drbd allows both nodes
> to continue rather than having the harness make a kill decision about
> one.  However, if they both alter the same data, you get an
> irreconcilable data corruption fault which, one can argue, is directly
> counter to HA principles and so allowing drbd continuation is arguably
> the wrong thing to do.
>

When you do asynchronous replication, how do you ensure that implicit
write-after-write dependencies in the stream of writes you get from
the file system above, are not violated on the secondary ?

There might be a disk scheduler on the secondary.

-Phil
-- 
: Dipl-Ing Philipp Reisner
: LINBIT | Your Way to High Availability
: Tel: +43-1-8178292-50, Fax: +43-1-8178292-82
: http://www.linbit.com

DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/16] DRBD: lru_cache
  2009-05-04  1:01               ` Kyle Moffett
@ 2009-05-04 16:12                 ` Rik van Riel
  2009-05-04 16:15                   ` Lars Ellenberg
  0 siblings, 1 reply; 90+ messages in thread
From: Rik van Riel @ 2009-05-04 16:12 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Lars Ellenberg, Philipp Reisner, linux-kernel, Jens Axboe,
	Greg KH, Neil Brown, James Bottomley, Sam Ravnborg, Dave Jones,
	Nikanth Karthikesan, Lars Marowsky-Bree, Nicholas A. Bellinger,
	Bart Van Assche

Kyle Moffett wrote:
> On Sun, May 3, 2009 at 8:48 PM, Kyle Moffett <kyle@moffetthome.net> wrote:
>> There are a couple trivial tunables you can apply to the model I
>> provided to dramatically change the effect of memory pressure on the
>> LRU:
>>
>> [...]
>>
> 
> Ooh, I forgot to mention another biggie:  There's a way to allocate a
> reserve pool of memory (I don't remember the exact API, sorry), which
> can be attached to a specific kmem_cache to be used by processes
> attempting writeout.  This would allow you to allocate more in-use
> elements to make forward progress, even if all of your existing
> elements are already in-use.

Lars,

is using a mempool for allocation, in combination with a
shrinker callback for freeing older entries an option for
DRBD?

It looks like that could get rid of a fair amount of custom infrastructure.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 21:32       ` Lars Ellenberg
@ 2009-05-04 16:12         ` Lars Marowsky-Bree
  2009-05-05 22:08         ` Lars Ellenberg
  1 sibling, 0 replies; 90+ messages in thread
From: Lars Marowsky-Bree @ 2009-05-04 16:12 UTC (permalink / raw)
  To: Lars Ellenberg, Neil Brown
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche

On 2009-05-03T23:32:31, Lars Ellenberg <lars.ellenberg@linbit.com> wrote:

> Which it could not be while replication link is down,
> so once replication link is back (or remote node is back,
> which is not easily distinguishable just there, blablabla),
> you'd need to fetch the remote bitmap, and merge it with the local
> bitmap (feeding it into bitmap_set_bits),
> then re-attach the "failed" mirror.

Note that this sacrifices transactional consistency on the sync target;
an understandable trade-off (versus recording the stream of writes
entirely, which consumes space and possibly more resync bandwidth), but
a noteworthy one.

> But DRBD as of now does the connection handshake and bitmap exchange in
> kernel.  We wanted to have a fast compression scheme suitable for
> bitmaps, without cpu or memory overhead.  This does it quite nicely.

Sharing the connection between meta- and regular data also avoids some
ordering issues between channels, which probably helps simplify some
aspects of drbd.

Conceivably, the kernel could escalate such metadata/out-of-band
communications to user-space for handling, and user-space would then
afterwards instruct the continuation of the stream processing.

> or the link has been down,
> and the remote side decided to go active with it.

That is arguably a horrible failure on behalf of the cluster stack being
used, but indeed something drbd must be able to recover from.


Regards,
    Lars

-- 
SuSE Labs, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/16] DRBD: lru_cache
  2009-05-04 16:12                 ` Rik van Riel
@ 2009-05-04 16:15                   ` Lars Ellenberg
  0 siblings, 0 replies; 90+ messages in thread
From: Lars Ellenberg @ 2009-05-04 16:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kyle Moffett, Philipp Reisner, linux-kernel, Jens Axboe, Greg KH,
	Neil Brown, James Bottomley, Sam Ravnborg, Dave Jones,
	Nikanth Karthikesan, Lars Marowsky-Bree, Nicholas A. Bellinger,
	Bart Van Assche

On Mon, May 04, 2009 at 12:12:07PM -0400, Rik van Riel wrote:
> Kyle Moffett wrote:
>> On Sun, May 3, 2009 at 8:48 PM, Kyle Moffett <kyle@moffetthome.net> wrote:
>>> There are a couple trivial tunables you can apply to the model I
>>> provided to dramatically change the effect of memory pressure on the
>>> LRU:
>>>
>>> [...]
>>>
>>
>> Ooh, I forgot to mention another biggie:  There's a way to allocate a
>> reserve pool of memory (I don't remember the exact API, sorry), which
>> can be attached to a specific kmem_cache to be used by processes
>> attempting writeout.  This would allow you to allocate more in-use
>> elements to make forward progress, even if all of your existing
>> elements are already in-use.
>
> Lars,
>
> is using a mempool for allocation, in combination with a
> shrinker callback for freeing older entries an option for
> DRBD?
>
> It looks like that could get rid of a fair amount of custom infrastructure.

I'm going to look into it.


	Lars

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-04  8:28               ` Philipp Reisner
@ 2009-05-04 17:24                 ` James Bottomley
  2009-05-05  8:21                   ` Philipp Reisner
  0 siblings, 1 reply; 90+ messages in thread
From: James Bottomley @ 2009-05-04 17:24 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: david, Willy Tarreau, Bart Van Assche, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

On Mon, 2009-05-04 at 10:28 +0200, Philipp Reisner wrote:
> On Sunday 03 May 2009 16:45:25 James Bottomley wrote:
> > On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
> > > On Sun, 3 May 2009, James Bottomley wrote:
> > > > Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> > > >
> > > > On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
> > > >> On Sun, 3 May 2009, Willy Tarreau wrote:
> > > >>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> > > >>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> > > >>>>
> > > >>>> <akpm@linux-foundation.org> wrote:
> > > >>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner 
> <philipp.reisner@linbit.com> wrote:
> > > >>>>>> This is a repost of DRBD
> > > >>>>>
> > > >>>>> Is it being used anywhere for anything?  If so, where and what?
> > > >>>>
> > > >>>> One popular application is to run iSCSI and HA software on top of
> > > >>>> DRBD in order to build a highly available iSCSI storage target.
> > > >>>
> > > >>> Confirmed, I have several customers who're doing exactly that.
> > > >>
> > > >> I will also say that there are a lot of us out here who would have a
> > > >> use for DRDB in our HA setups, but have held off implementing it
> > > >> specificly because it's not yet in the upstream kernel.
> > > >
> > > > Actually, that's not a particularly strong reason because we already
> > > > have an in-kernel replicator that has much of the functionality of drbd
> > > > that you could use.  The main reason for wanting drbd in kernel is that
> > > > it has a *current* user base.
> > > >
> > > > Both the in kernel md/nbd and drbd do sync and async replication with
> > > > primary side bitmaps.  The main differences are:
> > > >
> > > >      * md/nbd can do 1 to N replication,
> > > >      * drbd can do active/active replication (useful for cluster
> > > >        filesystems)
> > > >      * The chunk size of the md/nbd is tunable
> > > >      * With the updated nbd-tools, current md/nbd can do point in time
> > > >        rollback on transaction logged secondaries (a BCS requirement)
> > > >      * drbd manages the mirror state explicitly, md/nbd needs a user
> > > >        space helper
> > > >
> > > > And probably a few others I forget.
> > >
> > > one very big one:
> > >
> > > DRDB has better support for dealing with split brain situations and
> > > recovering from them.
> >
> > I don't really think so.  The decision about which (or if a) node should
> > be killed lies with the HA harness outside of the province of the
> > replication.
> >
> > One could argue that the symmetric active mode of drbd allows both nodes
> > to continue rather than having the harness make a kill decision about
> > one.  However, if they both alter the same data, you get an
> > irreconcilable data corruption fault which, one can argue, is directly
> > counter to HA principles and so allowing drbd continuation is arguably
> > the wrong thing to do.
> >
> 
> When you do asynchronous replication, how do you ensure that implicit
> write-after-write dependencies in the stream of writes you get from
> the file system above, are not violated on the secondary ?

Are you telling me drbd doesn't currently do this?

The way nbd does it (in the updated tools is to use DIRECT_IO and
fsync).

> There might be a disk scheduler on the secondary.

There usually is a disk scheduler ... you just have to take the required
action to persuade it to preserve ordering ... a simplistic way of doing
this is to switch to the noop scheduler.

James



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-01 11:15   ` Lars Marowsky-Bree
  2009-05-01 13:14     ` Dave Jones
@ 2009-05-05  4:05     ` Christian Kujau
  1 sibling, 0 replies; 90+ messages in thread
From: Christian Kujau @ 2009-05-05  4:05 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: Andrew Morton, Philipp Reisner, LKML, Jens Axboe, Greg KH,
	Neil Brown, James Bottomley, Sam Ravnborg, Dave Jones,
	Nikanth Karthikesan, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche, Lars Ellenberg

On Fri, 1 May 2009, Lars Marowsky-Bree wrote:
> It is used by many customers (thousands world-wide, I'm sure) to
> replicate block device data locally (to replace more expensive SANs
> while achieving higher availablity) or async/remotely (for disaster
> recovery).

While covering Linux-HA success stories really, most of them are using 
DRBD as well: http://moin.linux-ha.org/lha/SuccessStories

C.
-- 
Bruce Schneier does not sleep. He prempts everything.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-04 17:24                 ` James Bottomley
@ 2009-05-05  8:21                   ` Philipp Reisner
  2009-05-05 14:09                     ` James Bottomley
  2009-05-05 15:03                     ` Bart Van Assche
  0 siblings, 2 replies; 90+ messages in thread
From: Philipp Reisner @ 2009-05-05  8:21 UTC (permalink / raw)
  To: James Bottomley
  Cc: david, Willy Tarreau, Bart Van Assche, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

On Monday 04 May 2009 19:24:11 James Bottomley wrote:
> On Mon, 2009-05-04 at 10:28 +0200, Philipp Reisner wrote:
> > On Sunday 03 May 2009 16:45:25 James Bottomley wrote:
> > > On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
> > > > On Sun, 3 May 2009, James Bottomley wrote:
> > > > > Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> > > > >
> > > > > On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
> > > > >> On Sun, 3 May 2009, Willy Tarreau wrote:
> > > > >>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> > > > >>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> > > > >>>>
> > > > >>>> <akpm@linux-foundation.org> wrote:
> > > > >>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner
> >
> > <philipp.reisner@linbit.com> wrote:
> > > > >>>>>> This is a repost of DRBD
> > > > >>>>>
> > > > >>>>> Is it being used anywhere for anything?  If so, where and what?
> > > > >>>>
> > > > >>>> One popular application is to run iSCSI and HA software on top
> > > > >>>> of DRBD in order to build a highly available iSCSI storage
> > > > >>>> target.
> > > > >>>
> > > > >>> Confirmed, I have several customers who're doing exactly that.
> > > > >>
> > > > >> I will also say that there are a lot of us out here who would have
> > > > >> a use for DRDB in our HA setups, but have held off implementing it
> > > > >> specificly because it's not yet in the upstream kernel.
> > > > >
> > > > > Actually, that's not a particularly strong reason because we
> > > > > already have an in-kernel replicator that has much of the
> > > > > functionality of drbd that you could use.  The main reason for
> > > > > wanting drbd in kernel is that it has a *current* user base.
> > > > >
> > > > > Both the in kernel md/nbd and drbd do sync and async replication
> > > > > with primary side bitmaps.  The main differences are:
> > > > >
> > > > >      * md/nbd can do 1 to N replication,
> > > > >      * drbd can do active/active replication (useful for cluster
> > > > >        filesystems)
> > > > >      * The chunk size of the md/nbd is tunable
> > > > >      * With the updated nbd-tools, current md/nbd can do point in
> > > > > time rollback on transaction logged secondaries (a BCS requirement)
> > > > > * drbd manages the mirror state explicitly, md/nbd needs a user
> > > > > space helper
> > > > >
> > > > > And probably a few others I forget.
> > > >
> > > > one very big one:
> > > >
> > > > DRDB has better support for dealing with split brain situations and
> > > > recovering from them.
> > >
> > > I don't really think so.  The decision about which (or if a) node
> > > should be killed lies with the HA harness outside of the province of
> > > the replication.
> > >
> > > One could argue that the symmetric active mode of drbd allows both
> > > nodes to continue rather than having the harness make a kill decision
> > > about one.  However, if they both alter the same data, you get an
> > > irreconcilable data corruption fault which, one can argue, is directly
> > > counter to HA principles and so allowing drbd continuation is arguably
> > > the wrong thing to do.
> >
> > When you do asynchronous replication, how do you ensure that implicit
> > write-after-write dependencies in the stream of writes you get from
> > the file system above, are not violated on the secondary ?
>
> Are you telling me drbd doesn't currently do this?
>

No I am not. DRBD does exactly this!
But I am wondering how that is achieved in the MD/NBD stack when running 
in async mode.

The issue is covered since the early days in DRBD, (back in 2000).
The issue, and the solution we have in DRBD is described in this paper:

http://www.drbd.org/fileadmin/drbd/publications/drbd_paper_for_NLUUG_2001.pdf

> The way nbd does it (in the updated tools is to use DIRECT_IO and
> fsync).

Is that available in the existing tools ? -- Are the updated tools
something that will be available in the future ?

Are you telling me md/ndb (async) doesn't currently do this ?

> > There might be a disk scheduler on the secondary.
>
> There usually is a disk scheduler ... you just have to take the required
> action to persuade it to preserve ordering ... a simplistic way of doing
> this is to switch to the noop scheduler.

The issue actually goes further down the stack. Not only the in kernel
disk scheduler might reorder something, also the driver and finally the
drive might do so.

What we have in DRBD boils down to:

* We obey all possible write after write dependencies in the stream of
  writes we get from the upper layers. And generate DRBD internal
  reorder barriers for the packet stream.
* On the secondary node we impose these barriers onto the stream of writes
  submitted to the stack below us by either:

   - Let previously submitted write-IO drain before we submit write-IO after
     such an DRBD barrier. (That we have since 2000 or so)

   - Additionally issue a blkdev_issue_flush()

   - Use write requests with BIO_RW_BARRIER. This method has two advantages:
     We can continue to submit writes after the DRBD internal barrier
     immediately, and the number of requests with BIO_RW_BARRIER can be
     further reduced. 
     See section 6 of
     http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf
     for more details, and nice illustrations.

     Unfortunately only high end SAN devices seem to benefit from this
     method. For most in-machine-disk controlers this method does not
     achieve the highest throughput.

Expressed in other words: 
We allow reordering on the secondary node to an extend so that we can
guarantee that no implicit write-after-write dependencies are violated.

Coming back to the idea of disabling the in Linux IO scheduler. It might
solve the issue for some devices, but it does not guarantee to solve it.

-Phil
-- 
: Dipl-Ing Philipp Reisner
: LINBIT | Your Way to High Availability
: Tel: +43-1-8178292-50, Fax: +43-1-8178292-82
: http://www.linbit.com

DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-05  8:21                   ` Philipp Reisner
@ 2009-05-05 14:09                     ` James Bottomley
  2009-05-05 15:56                       ` Philipp Reisner
  2009-05-05 15:03                     ` Bart Van Assche
  1 sibling, 1 reply; 90+ messages in thread
From: James Bottomley @ 2009-05-05 14:09 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: david, Willy Tarreau, Bart Van Assche, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

On Tue, 2009-05-05 at 10:21 +0200, Philipp Reisner wrote:
> > > When you do asynchronous replication, how do you ensure that implicit
> > > write-after-write dependencies in the stream of writes you get from
> > > the file system above, are not violated on the secondary ?
> >
> > Are you telling me drbd doesn't currently do this?
> >
> 
> No I am not. DRBD does exactly this!
> But I am wondering how that is achieved in the MD/NBD stack when running 
> in async mode.

The explanation is below.

> The issue is covered since the early days in DRBD, (back in 2000).
> The issue, and the solution we have in DRBD is described in this paper:
> 
> http://www.drbd.org/fileadmin/drbd/publications/drbd_paper_for_NLUUG_2001.pdf
> 
> > The way nbd does it (in the updated tools is to use DIRECT_IO and
> > fsync).
> 
> Is that available in the existing tools ? -- Are the updated tools
> something that will be available in the future ?

It's in the existing.

> Are you telling me md/ndb (async) doesn't currently do this ?

I just described how it doe this ... I don't quite see how that
translates into telling you it doesn't do this.

> > > There might be a disk scheduler on the secondary.
> >
> > There usually is a disk scheduler ... you just have to take the required
> > action to persuade it to preserve ordering ... a simplistic way of doing
> > this is to switch to the noop scheduler.
> 
> The issue actually goes further down the stack. Not only the in kernel
> disk scheduler might reorder something, also the driver and finally the
> drive might do so.
> 
> What we have in DRBD boils down to:
> 
> * We obey all possible write after write dependencies in the stream of
>   writes we get from the upper layers. And generate DRBD internal
>   reorder barriers for the packet stream.
> * On the secondary node we impose these barriers onto the stream of writes
>   submitted to the stack below us by either:
> 
>    - Let previously submitted write-IO drain before we submit write-IO after
>      such an DRBD barrier. (That we have since 2000 or so)
> 
>    - Additionally issue a blkdev_issue_flush()
> 
>    - Use write requests with BIO_RW_BARRIER. This method has two advantages:
>      We can continue to submit writes after the DRBD internal barrier
>      immediately, and the number of requests with BIO_RW_BARRIER can be
>      further reduced. 
>      See section 6 of
>      http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf
>      for more details, and nice illustrations.

THere's a slight error in there ... we don't use ordered tags for
barriers (yet).  I don't think it will really matter because the main
domain of ordering problems is the scheduler, which REQ_BARRIER does
cope with, it just means the queue drains for a barrier.

>      Unfortunately only high end SAN devices seem to benefit from this
>      method. For most in-machine-disk controlers this method does not
>      achieve the highest throughput.
> 
> Expressed in other words: 
> We allow reordering on the secondary node to an extend so that we can
> guarantee that no implicit write-after-write dependencies are violated.
> 
> Coming back to the idea of disabling the in Linux IO scheduler. It might
> solve the issue for some devices, but it does not guarantee to solve it.

I think you'll find the dio/fsync method above actually does solve all
of these issues (mainly because it enforces the semantics from top to
bottom in the stack).  I agree one could use more elaborate semantics
like you do for drbd, but since the simple ones worked efficiently for
md/nbd, there didn't seem to be much point.

James



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-05  8:21                   ` Philipp Reisner
  2009-05-05 14:09                     ` James Bottomley
@ 2009-05-05 15:03                     ` Bart Van Assche
  2009-05-05 15:57                       ` Philipp Reisner
  1 sibling, 1 reply; 90+ messages in thread
From: Bart Van Assche @ 2009-05-05 15:03 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: James Bottomley, david, Willy Tarreau, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

On Tue, May 5, 2009 at 10:21 AM, Philipp Reisner
<philipp.reisner@linbit.com> wrote:
> What we have in DRBD boils down to:
>
> * We obey all possible write after write dependencies in the stream of
>  writes we get from the upper layers. And generate DRBD internal
>  reorder barriers for the packet stream.

Hello Philipp,

I couldn't find a call to blk_queue_ordered() in the DRBD 8.3.1 source
code. This made me wonder how DRBD obtains information about barriers
that is generated by filesystems like ext3 with the option barrier=1 ?

Bart.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-05 14:09                     ` James Bottomley
@ 2009-05-05 15:56                       ` Philipp Reisner
  2009-05-05 17:05                         ` James Bottomley
  0 siblings, 1 reply; 90+ messages in thread
From: Philipp Reisner @ 2009-05-05 15:56 UTC (permalink / raw)
  To: James Bottomley
  Cc: david, Willy Tarreau, Bart Van Assche, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

On Tuesday 05 May 2009 16:09:45 James Bottomley wrote:
> On Tue, 2009-05-05 at 10:21 +0200, Philipp Reisner wrote:
> > > > When you do asynchronous replication, how do you ensure that implicit
> > > > write-after-write dependencies in the stream of writes you get from
> > > > the file system above, are not violated on the secondary ?
> > >
[...]
> > > The way nbd does it (in the updated tools is to use DIRECT_IO and
> > > fsync).
> >
[...]
> I think you'll find the dio/fsync method above actually does solve all
> of these issues (mainly because it enforces the semantics from top to
> bottom in the stack).  I agree one could use more elaborate semantics
> like you do for drbd, but since the simple ones worked efficiently for
> md/nbd, there didn't seem to be much point.
>

Do I get it right, that you enforce the exact same write order on the 
secondary node as the stream of writes was comming in on the primary?

Using either DIRECT_IO or fsync() calls ?

Is DIRECT_IO/fsync() enabled by default ?

-Phil
-- 
: Dipl-Ing Philipp Reisner
: LINBIT | Your Way to High Availability
: Tel: +43-1-8178292-50, Fax: +43-1-8178292-82
: http://www.linbit.com

DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-05 15:03                     ` Bart Van Assche
@ 2009-05-05 15:57                       ` Philipp Reisner
  2009-05-05 17:38                         ` Lars Marowsky-Bree
  0 siblings, 1 reply; 90+ messages in thread
From: Philipp Reisner @ 2009-05-05 15:57 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: James Bottomley, david, Willy Tarreau, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

On Tuesday 05 May 2009 17:03:13 Bart Van Assche wrote:
> On Tue, May 5, 2009 at 10:21 AM, Philipp Reisner
>
> <philipp.reisner@linbit.com> wrote:
> > What we have in DRBD boils down to:
> >
> > * We obey all possible write after write dependencies in the stream of
> >  writes we get from the upper layers. And generate DRBD internal
> >  reorder barriers for the packet stream.
>
> Hello Philipp,
>
> I couldn't find a call to blk_queue_ordered() in the DRBD 8.3.1 source
> code. This made me wonder how DRBD obtains information about barriers
> that is generated by filesystems like ext3 with the option barrier=1 ?
>

Hi Bart,

I was refering to implicit write after write dependencies, that one
needs to obey when doing asynchronous replication.

Up to now we do not offer barrier support for the layers above us.
That will follow sooner or later.

Here is an example, why it is not completely trivial:

  Imagine DRBD on top of a dm-linear on both nodes. When you start,
  both dm-linear mappings sit on top of something that supports 
  barriers itself. -- Then the user replaces the backing device
  below the dm-linear on the secondary node with something that
  does not support barriers.

  When we get a write request with the BIO_RW_BARRIER flag set
  in from the FS, we submit this locally, ship it over to the
  peer and submit it there. Unfortunately it fails now with
  ENOTSUP on the peer. 

  We can not ship that error back to the upper layer, because
  our mirror is already inconsistent. We have to resubmit
  it with BIO_RW_BARRIER cleared, and other means to enforce
  write ordering...  Then tell the other node that we prefer
  to no longer accept BIO_RW_BARRIER etc...

-Phil
-- 
: Dipl-Ing Philipp Reisner
: LINBIT | Your Way to High Availability
: Tel: +43-1-8178292-50, Fax: +43-1-8178292-82
: http://www.linbit.com

DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-05 15:56                       ` Philipp Reisner
@ 2009-05-05 17:05                         ` James Bottomley
  2009-05-05 21:45                           ` Philipp Reisner
  0 siblings, 1 reply; 90+ messages in thread
From: James Bottomley @ 2009-05-05 17:05 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: david, Willy Tarreau, Bart Van Assche, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

On Tue, 2009-05-05 at 17:56 +0200, Philipp Reisner wrote:
> On Tuesday 05 May 2009 16:09:45 James Bottomley wrote:
> > On Tue, 2009-05-05 at 10:21 +0200, Philipp Reisner wrote:
> > > > > When you do asynchronous replication, how do you ensure that implicit
> > > > > write-after-write dependencies in the stream of writes you get from
> > > > > the file system above, are not violated on the secondary ?
> > > >
> [...]
> > > > The way nbd does it (in the updated tools is to use DIRECT_IO and
> > > > fsync).
> > >
> [...]
> > I think you'll find the dio/fsync method above actually does solve all
> > of these issues (mainly because it enforces the semantics from top to
> > bottom in the stack).  I agree one could use more elaborate semantics
> > like you do for drbd, but since the simple ones worked efficiently for
> > md/nbd, there didn't seem to be much point.
> >
> 
> Do I get it right, that you enforce the exact same write order on the 
> secondary node as the stream of writes was comming in on the primary?

Um, yes ... that's the text book way of doing replication: write order
preservation.

> Using either DIRECT_IO or fsync() calls ?

Yes.

> Is DIRECT_IO/fsync() enabled by default ?

I'd have to look at the tools (and, unfortunately, there are many
variants) but it was certainly true in the variant I used.  However, the
current main use case of md/nbd is a secondary transaction log to allow
rollback anyway, so the incoming network stream is stored on the device
in write order and the problem doesn't arise.

I also think you're not quite looking at the important case: if you
think about it, the real necessity for the ordered domain is the
network, not so much the actual secondary server.  The reason is that
it's very hard to find a failure case where the write order on the
secondary from the network tap to disk actually matters (as long as the
flight into the network tap was in order).  The standard failure is of
the primary, not the secondary, so the network stream stops and so does
the secondary writing: as long as we guarantee to stop at a consistent
point in flight, everything works.  If the secondary fails while the
primary is still up, that's just a standard replay to bring the
secondary back into replication, so the issue doesn't arise there
either.

The case where it does matter is failure of the primary followed by
instantaneous failure of the secondary before the actual network stream
completes, so guaranteeing that the secondary can be brought back up
consistently.  However, this is an incredibly rare failure scenario
given the tight race timings.

James



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-05 15:57                       ` Philipp Reisner
@ 2009-05-05 17:38                         ` Lars Marowsky-Bree
  0 siblings, 0 replies; 90+ messages in thread
From: Lars Marowsky-Bree @ 2009-05-05 17:38 UTC (permalink / raw)
  To: Philipp Reisner, Bart Van Assche
  Cc: James Bottomley, david, Willy Tarreau, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Kyle Moffett, Lars Ellenberg

On 2009-05-05T17:57:15, Philipp Reisner <philipp.reisner@linbit.com> wrote:

> Up to now we do not offer barrier support for the layers above us.
> That will follow sooner or later.
> 
> Here is an example, why it is not completely trivial:
> 
>   Imagine DRBD on top of a dm-linear on both nodes. When you start,
>   both dm-linear mappings sit on top of something that supports 
>   barriers itself. -- Then the user replaces the backing device
>   below the dm-linear on the secondary node with something that
>   does not support barriers.

The same problem exists essentially for md raid1 as well, and I'd not
consider it objectionable if you took a brutal approach:

>   When we get a write request with the BIO_RW_BARRIER flag set
>   in from the FS, we submit this locally, ship it over to the
>   peer and submit it there. Unfortunately it fails now with
>   ENOTSUP on the peer. 
> 
>   We can not ship that error back to the upper layer, because
>   our mirror is already inconsistent.

Disconnect the secondary with a loud error as to why (incompatible
change of the device below). (Re-)negotiate barrier capability at
connect time; then, resync.


Regards,
    Lars

-- 
SuSE Labs, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 04/16] DRBD: bitmap
  2009-05-03  5:21             ` Neil Brown
  2009-05-03  7:38               ` Lars Ellenberg
@ 2009-05-05 17:48               ` Lars Marowsky-Bree
  2009-05-05 17:51                 ` James Bottomley
  2009-05-05 22:26                 ` Neil Brown
  1 sibling, 2 replies; 90+ messages in thread
From: Lars Marowsky-Bree @ 2009-05-05 17:48 UTC (permalink / raw)
  To: Neil Brown, Lars Ellenberg
  Cc: James Bottomley, Philipp Reisner, linux-kernel, Jens Axboe,
	Greg KH, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche

On 2009-05-03T15:21:41, Neil Brown <neilb@suse.de> wrote:

> As I said, I don't immediately see the benefits of the activity log
> format, however,
>  1/ I am happy to listen to its benefits being explained
>  2/ If we were to agree that merging DRBD functionality into md
>    (for which there isn't a concrete proposal, but the suggestion
>     seems to be floating around) were a good thing, I don't have any
>     problem with supporting an activity log in md in the name of
>     compatibility.

So, let's take a step back here.

All of this is extremely beneficial discussion to be had. As some of you
are (painfully, sometimes ;-) aware, I'm a big fan of converging RAID
implementations/back-ends, and the goal is well received.

But this will take a while, and both drbd, md, md/nbd, or even dm-raid1
have large existing user bases, and HA environments don't switch easily.
All are actively maintained.

Sharing more and more of the code strikes me as a mid-term goal, and
full converges as a long-term one (alas).

What I think this argument has shown that drbd's design is sound (even
if some choices, like that of the alternatives, are up for discussion),
similar to different file systems (of which we seem to have plenty
too).

I would suggest at this time, we may want to refocus on the remaining
objections to merging drbd as a driver in the short-term.

I think I've not read anything in the last 3-5 days which still would
rate as a reason for rejection or delay.

Did I miss something?


Regards,
    Lars

-- 
SuSE Labs, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 04/16] DRBD: bitmap
  2009-05-05 17:48               ` Lars Marowsky-Bree
@ 2009-05-05 17:51                 ` James Bottomley
  2009-05-05 22:26                 ` Neil Brown
  1 sibling, 0 replies; 90+ messages in thread
From: James Bottomley @ 2009-05-05 17:51 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: Neil Brown, Lars Ellenberg, Philipp Reisner, linux-kernel,
	Jens Axboe, Greg KH, Sam Ravnborg, Dave Jones,
	Nikanth Karthikesan, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

On Tue, 2009-05-05 at 19:48 +0200, Lars Marowsky-Bree wrote:
> On 2009-05-03T15:21:41, Neil Brown <neilb@suse.de> wrote:
> 
> > As I said, I don't immediately see the benefits of the activity log
> > format, however,
> >  1/ I am happy to listen to its benefits being explained
> >  2/ If we were to agree that merging DRBD functionality into md
> >    (for which there isn't a concrete proposal, but the suggestion
> >     seems to be floating around) were a good thing, I don't have any
> >     problem with supporting an activity log in md in the name of
> >     compatibility.
> 
> So, let's take a step back here.
> 
> All of this is extremely beneficial discussion to be had. As some of you
> are (painfully, sometimes ;-) aware, I'm a big fan of converging RAID
> implementations/back-ends, and the goal is well received.
> 
> But this will take a while, and both drbd, md, md/nbd, or even dm-raid1
> have large existing user bases, and HA environments don't switch easily.
> All are actively maintained.
> 
> Sharing more and more of the code strikes me as a mid-term goal, and
> full converges as a long-term one (alas).
> 
> What I think this argument has shown that drbd's design is sound (even
> if some choices, like that of the alternatives, are up for discussion),
> similar to different file systems (of which we seem to have plenty
> too).
> 
> I would suggest at this time, we may want to refocus on the remaining
> objections to merging drbd as a driver in the short-term.
> 
> I think I've not read anything in the last 3-5 days which still would
> rate as a reason for rejection or delay.
> 
> Did I miss something?

No ... I'd agree with that.  drbd essentially qualifies as a driver
under our new merge rules, so we should be thinking about blockers to
getting it into the tree first (serious issues) and working out kinks
(like raid unification) after it gets in.

James



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-05 17:05                         ` James Bottomley
@ 2009-05-05 21:45                           ` Philipp Reisner
  2009-05-05 21:53                             ` James Bottomley
  0 siblings, 1 reply; 90+ messages in thread
From: Philipp Reisner @ 2009-05-05 21:45 UTC (permalink / raw)
  To: James Bottomley
  Cc: david, Willy Tarreau, Bart Van Assche, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

Am Dienstag 05 Mai 2009 19:05:46 schrieb James Bottomley:
> On Tue, 2009-05-05 at 17:56 +0200, Philipp Reisner wrote:
> > On Tuesday 05 May 2009 16:09:45 James Bottomley wrote:
> > > On Tue, 2009-05-05 at 10:21 +0200, Philipp Reisner wrote:
> > > > > > When you do asynchronous replication, how do you ensure that
> > > > > > implicit write-after-write dependencies in the stream of writes
> > > > > > you get from the file system above, are not violated on the
> > > > > > secondary ?
> >
> > [...]
> >
> > > > > The way nbd does it (in the updated tools is to use DIRECT_IO and
> > > > > fsync).
> >
> > [...]
> >
> > > I think you'll find the dio/fsync method above actually does solve all
> > > of these issues (mainly because it enforces the semantics from top to
> > > bottom in the stack).  I agree one could use more elaborate semantics
> > > like you do for drbd, but since the simple ones worked efficiently for
> > > md/nbd, there didn't seem to be much point.
> >
> > Do I get it right, that you enforce the exact same write order on the
> > secondary node as the stream of writes was comming in on the primary?
>
> Um, yes ... that's the text book way of doing replication: write order
> preservation.
>
> > Using either DIRECT_IO or fsync() calls ?
>
> Yes.
>
> > Is DIRECT_IO/fsync() enabled by default ?
>
> I'd have to look at the tools (and, unfortunately, there are many
> variants) but it was certainly true in the variant I used.

[...]

My experience is that enforcing the exact same write order as on the primary
by using IO draining, kills performance. - Of course things are changing in 
a world where everybody uses a RAID controller with a gig of battery
backed RAM. But there are for sure some embedded users that run
the replication technology on top of plain hard disks.

What I want to work out is, that in DRBD we have that capability to allow
limited reordering on the secondary, to achieve the highest possible
performance, while maintaining these implicit write-after-write dependencies.

> I also think you're not quite looking at the important case: if you
> think about it, the real necessity for the ordered domain is the
> network, not so much the actual secondary server.  The reason is that
> it's very hard to find a failure case where the write order on the
> secondary from the network tap to disk actually matters (as long as the
> flight into the network tap was in order).  The standard failure is of
> the primary, not the secondary, so the network stream stops and so does
> the secondary writing: as long as we guarantee to stop at a consistent
> point in flight, everything works.  If the secondary fails while the
> primary is still up, that's just a standard replay to bring the
> secondary back into replication, so the issue doesn't arise there
> either.

A common power failure is possible. We aim for an HA system, we can
not ignore a possible failure scenario. No user will buy: Well in most
scenarios we do it correctly, in the unlikely case of a common power
failure, and you loose your former primary at the same time, you might
have a secondary with the last write but not that one write before!

Correctness before efficiency!

But I will now stop this discussion now. Proving that DRBD does some
details better than the md/nbd approch gets pointless, when we agreed
that DRBD can get merged as a driver. We will focus on the necessary
code cleanups.

-Phil


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-05 21:45                           ` Philipp Reisner
@ 2009-05-05 21:53                             ` James Bottomley
  2009-05-06  8:17                               ` Philipp Reisner
  0 siblings, 1 reply; 90+ messages in thread
From: James Bottomley @ 2009-05-05 21:53 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: david, Willy Tarreau, Bart Van Assche, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

On Tue, 2009-05-05 at 23:45 +0200, Philipp Reisner wrote:
> > I also think you're not quite looking at the important case: if you
> > think about it, the real necessity for the ordered domain is the
> > network, not so much the actual secondary server.  The reason is that
> > it's very hard to find a failure case where the write order on the
> > secondary from the network tap to disk actually matters (as long as the
> > flight into the network tap was in order).  The standard failure is of
> > the primary, not the secondary, so the network stream stops and so does
> > the secondary writing: as long as we guarantee to stop at a consistent
> > point in flight, everything works.  If the secondary fails while the
> > primary is still up, that's just a standard replay to bring the
> > secondary back into replication, so the issue doesn't arise there
> > either.
> 
> A common power failure is possible. We aim for an HA system, we can
> not ignore a possible failure scenario. No user will buy: Well in most
> scenarios we do it correctly, in the unlikely case of a common power
> failure, and you loose your former primary at the same time, you might
> have a secondary with the last write but not that one write before!
> 
> Correctness before efficiency!

Well, you have to agree that during a resync from the activity log,
which plays up the primary disk from one end to another, the secondary
is completely corrupt if a primary failure occurs before the resync
completes.  That's something that's triggered by a network outage, and
so is a far more common event than cascading dual failures.  It's all
really a question of where you focus your effort to eliminate the corner
cases.

> But I will now stop this discussion now. Proving that DRBD does some
> details better than the md/nbd approch gets pointless, when we agreed
> that DRBD can get merged as a driver. We will focus on the necessary
> code cleanups.

I agree.  Also HA is full of corner cases like this and opinion is
endlessly divided over which corner cases are more important than which
others.

James



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 21:32       ` Lars Ellenberg
  2009-05-04 16:12         ` Lars Marowsky-Bree
@ 2009-05-05 22:08         ` Lars Ellenberg
  1 sibling, 0 replies; 90+ messages in thread
From: Lars Ellenberg @ 2009-05-05 22:08 UTC (permalink / raw)
  To: Neil Brown
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

[-- Attachment #1: Type: text/plain, Size: 2654 bytes --]

> > > but to answer the question:
> > > why bother to implement our own encoding?
> > > because we know a lot about the data to be encoded.
> > > 
> > > the compression of the bitmap transfer we just added very recently.
> > > for a bitmap, with large chunks of bits set or unset, it is efficient
> > > to just code the runlength.
> > > to use gzip in kernel would add yet an other huge overhead for code
> > > tables and so on.
> > > during testing of this encoding, applying it to an already gzip'ed file
> > > was able to compress it even further, btw.
> > > though on english plain text, gzip compression is _much_ more effective.
> > 
> > I just tried a little experiment.
> > I created a 128meg file and randomly set 1000 bits in it.
> > I compressed it with "gzip --best" and the result was 4Meg.  Not
> > particularly impressive.
> > I then tried to compress it wit bzip2 and got 3452 bytes.
> > Now *that* is impressive.  I suspect your encoding might do a little
> > better, but I wonder if it is worth the effort.
> 
> The effort is minimal.
> The cpu overhead is negligible (compared with bzip2, or any other
> generic compression scheme), and the memory overhead is next to none
> (just a small scratch buffer, to assemble the network packet).
> No tables or anything involved.
> Especially the _decoding_ part has this nice property:
>   chunk = 0;
>   while (!eof) {
> 	vli_decode_bits(&rl, input); /* number of unset bits */
> 	chunk += rl;
> 	vli_decode_bits(&rl, input); /* number of set bits */
> 	bitmap_dirty_bits(bitmap, chunk, chunk + rl);
> 	chunk += rl;
>  }
> 
> The source code is there.
> 
> For your example, on average you'd have (128 << 23) / 1000 "clear" bits,
> then one set bit. The encoding transfers
> "first bit unset -- ca. (1<<20), 1, ca. (1<<20), 1, ca. (1<<20), 1, ...",
> using 2 bits for the "1", and up to 29 bit for the "ca. 1<<20".
> should be in the very same ballpark as your bzip2 result.
> 
> > I'm not certain that my test file is entirely realistic, but it is
> > still an interesting experiment.
> 
> It is not ;) but still...
> If you are interessted, I can dig up my throw away user land code,
> that has been used to evaluate various such schemes.
> But it is so ugly that I won't post it to lkml.

I found round about ten different versions of that throw away code.
Oh well.  So I just hacked up an other one.

For your entertainment, prepare some example bitmaps.  From all my
real-world example bitmaps I can see that (at least with 4KiB bitmap
granularity), areas with alternating single-bit (which is the only run
length that does not compress) are rare.

Comments wellcome.  Have fun.

	Lars

[-- Attachment #2: vli_bitstream_demo.c --]
[-- Type: text/x-csrc, Size: 28391 bytes --]

/* vim: set foldmethod=marker foldlevel=1 foldenable :
 *
 * Copyright 2009 Lars Ellenberg <lars@linbit.com>
 * Licence: GPL
 *
 * Purpose: demonstrate the simple, but efficient (for bitmaps, anyways),
 * encoding (usually: compression) used to exchange the DRBD bitmap.
 * Note: DRBD transmits incompressible chunks as plain text.
 * This demo does not, to better show the properties of the encoding method.
 * See also the comments just above the "struct code_chunk"
 *
 * More than half of this file is (almost) verbatim copy
 * from other .c and .h files, so you won't need the extra files,
 * like the generic find_next_bit.c or the drbd_vli.h.
 *
 * Tested on i686 and x86_64 Debian.
 * Might have issues on other archs, though I think it should not.
 *
 * For USAGE,
 * see show_usage_and_die() below. */

#include <sys/mman.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>
#include <linux/types.h>
#include <errno.h>
#include <endian.h>
#include <byteswap.h>

/* gcc -g -O3 -Wall -o vli_bitstream_demo vli_bitstream_demo.c */
/* for easy debugging with gdb, do
 *  gdb --args ./x 3< IN 4> OUT
 *  then set in_fd and out_fd to 3 and 4,
 *  and "run".
 */
static int in_fd = 0;  /* 0: stdin */
static int out_fd = 1; /* 1: stdout */

/* (almost) verbatim copied files from elsewhere {{{2 */

/* find bit helpers from linux kernel tree {{{3 */

/* taken from arch/x86/include/asm/bitops.h {{{4 */
static inline unsigned long __ffs(unsigned long word)
{
	asm("bsf %1,%0"
		: "=r" (word)
		: "rm" (word));
	return word;
}

static inline unsigned long ffz(unsigned long word)
{
	asm("bsf %1,%0"
		: "=r" (word)
		: "r" (~word));
	return word;
}

/* taken from lib/find_next_bit.c {{{4 */

/* find_next_bit.c: fallback find next bit implementation
 *
 * Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
 * Written by David Howells (dhowells@redhat.com)
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version
 * 2 of the License, or (at your option) any later version.
 */

#define BITS_PER_LONG		(sizeof(long)*8)
#define BITOP_WORD(nr)		((nr) / BITS_PER_LONG)

/*
 * Find the next set bit in a memory region.
 */
unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
			    unsigned long offset)
{
	const unsigned long *p = addr + BITOP_WORD(offset);
	unsigned long result = offset & ~(BITS_PER_LONG-1);
	unsigned long tmp;

	if (offset >= size)
		return size;
	size -= result;
	offset %= BITS_PER_LONG;
	if (offset) {
		tmp = *(p++);
		tmp &= (~0UL << offset);
		if (size < BITS_PER_LONG)
			goto found_first;
		if (tmp)
			goto found_middle;
		size -= BITS_PER_LONG;
		result += BITS_PER_LONG;
	}
	while (size & ~(BITS_PER_LONG-1)) {
		if ((tmp = *(p++)))
			goto found_middle;
		result += BITS_PER_LONG;
		size -= BITS_PER_LONG;
	}
	if (!size)
		return result;
	tmp = *p;

found_first:
	tmp &= (~0UL >> (BITS_PER_LONG - size));
	if (tmp == 0UL)		/* Are any bits set? */
		return result + size;	/* Nope. */
found_middle:
	return result + __ffs(tmp);
}

/*
 * This implementation of find_{first,next}_zero_bit was stolen from
 * Linus' asm-alpha/bitops.h.
 */
unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
				 unsigned long offset)
{
	const unsigned long *p = addr + BITOP_WORD(offset);
	unsigned long result = offset & ~(BITS_PER_LONG-1);
	unsigned long tmp;

	if (offset >= size)
		return size;
	size -= result;
	offset %= BITS_PER_LONG;
	if (offset) {
		tmp = *(p++);
		tmp |= ~0UL >> (BITS_PER_LONG - offset);
		if (size < BITS_PER_LONG)
			goto found_first;
		if (~tmp)
			goto found_middle;
		size -= BITS_PER_LONG;
		result += BITS_PER_LONG;
	}
	while (size & ~(BITS_PER_LONG-1)) {
		if (~(tmp = *(p++)))
			goto found_middle;
		result += BITS_PER_LONG;
		size -= BITS_PER_LONG;
	}
	if (!size)
		return result;
	tmp = *p;

found_first:
	tmp |= ~0UL << size;
	if (tmp == ~0UL)	/* Are any bits zero? */
		return result + size;	/* Nope. */
found_middle:
	return result + ffz(tmp);
}

/*
 * Find the first set bit in a memory region.
 */
unsigned long find_first_bit(const unsigned long *addr, unsigned long size)
{
	const unsigned long *p = addr;
	unsigned long result = 0;
	unsigned long tmp;

	while (size & ~(BITS_PER_LONG-1)) {
		if ((tmp = *(p++)))
			goto found;
		result += BITS_PER_LONG;
		size -= BITS_PER_LONG;
	}
	if (!size)
		return result;

	tmp = (*p) & (~0UL >> (BITS_PER_LONG - size));
	if (tmp == 0UL)		/* Are any bits set? */
		return result + size;	/* Nope. */
found:
	return result + __ffs(tmp);
}

/*
 * Find the first cleared bit in a memory region.
 */
unsigned long find_first_zero_bit(const unsigned long *addr, unsigned long size)
{
	const unsigned long *p = addr;
	unsigned long result = 0;
	unsigned long tmp;

	while (size & ~(BITS_PER_LONG-1)) {
		if (~(tmp = *(p++)))
			goto found;
		result += BITS_PER_LONG;
		size -= BITS_PER_LONG;
	}
	if (!size)
		return result;

	tmp = (*p) | (~0UL << size);
	if (tmp == ~0UL)	/* Are any bits zero? */
		return result + size;	/* Nope. */
found:
	return result + ffz(tmp);
}

/* end of find_next_bit.c }}}1 */

/* the VLI code implementation from drbd_vli.h {{{3 */

#define u64 __u64
#define u8 __u8
#define BUG() abort()

#if __BYTE_ORDER == __LITTLE_ENDIAN
#define le64_to_cpu(x)	(x)
#elif __BYTE_ORDER == __BIG_ENDIAN
#define le64_to_cpu(x)	bswap_64(x)
#else
#error "endian?"
#endif

/*
 * At a granularity of 4KiB storage represented per bit,
 * and stroage sizes of several TiB,
 * and possibly small-bandwidth replication,
 * the bitmap transfer time can take much too long,
 * if transmitted in plain text.
 *
 * We try to reduce the transfered bitmap information
 * by encoding runlengths of bit polarity.
 *
 * We never actually need to encode a "zero" (runlengths are positive).
 * But then we have to store the value of the first bit.
 * The first bit of information thus shall encode if the first runlength
 * gives the number of set or unset bits.
 *
 * We assume that large areas are either completely set or unset,
 * which gives good compression with any runlength method,
 * even when encoding the runlength as fixed size 32bit/64bit integers.
 *
 * Still, there may be areas where the polarity flips every few bits,
 * and encoding the runlength sequence of those areas with fix size
 * integers would be much worse than plaintext.
 *
 * We want to encode small runlength values with minimum code length,
 * while still being able to encode a Huge run of all zeros efficiently.
 *
 * Thus we need a Variable Length Integer encoding, VLI.
 *
 * For some cases, we produce more code bits than plaintext input.
 * We need to send incompressible chunks as plaintext, skip over them
 * and then see if the next chunk compresses better.
 *
 * We don't care too much about "excellent" compression ratio for large
 * runlengths (all set/all clear): whether we achieve a factor of 100
 * or 1000 is not that much of an issue.
 * We do not want to waste too much on short runlengths in the "noisy"
 * parts of the bitmap, though.
 *
 * There are endless variants of VLI, we experimented with:
 *  * simple byte-based
 *  * various bit based with different code word length.
 *
 * To avoid yet an other configuration parameter (choice of bitmap compression
 * algorithm) which was difficult to explain and tune, we just chose the one
 * variant that turned out best in all test cases.
 * Based on real world usage patterns, with device sizes ranging from a few GiB
 * to several TiB, file server/mailserver/webserver/mysql/postgress,
 * mostly idle to really busy, the all time winner (though sometimes only
 * marginally better) is:
 */

/*
 * encoding is "visualised" as
 * __little endian__ bitstream, least significant bit first (left most)
 *
 * this particular encoding is chosen so that the prefix code
 * starts as unary encoding the level, then modified so that
 * 10 levels can be described in 8bit, with minimal overhead
 * for the smaller levels.
 *
 * Number of data bits follow fibonacci sequence, with the exception of the
 * last level (+1 data bit, so it makes 64bit total).  The only worse code when
 * encoding bit polarity runlength is 1 plain bits => 2 code bits.
prefix    data bits                                    max val  Nº data bits
0 x                                                         0x2            1
10 x                                                        0x4            1
110 xx                                                      0x8            2
1110 xxx                                                   0x10            3
11110 xxx xx                                               0x30            5
111110 xx xxxxxx                                          0x130            8
11111100  xxxxxxxx xxxxx                                 0x2130           13
11111110  xxxxxxxx xxxxxxxx xxxxx                      0x202130           21
11111101  xxxxxxxx xxxxxxxx xxxxxxxx  xxxxxxxx xx   0x400202130           34
11111111  xxxxxxxx xxxxxxxx xxxxxxxx  xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx 56
 * maximum encodable value: 0x100000400202130 == 2**56 + some */

/* compression "table":
 transmitted   x                                0.29
 as plaintext x                                  ........................
             x                                   ........................
            x                                    ........................
           x    0.59                         0.21........................
          x      ........................................................
         x       .. c ...................................................
        x    0.44.. o ...................................................
       x .......... d ...................................................
      x  .......... e ...................................................
     X.............   ...................................................
    x.............. b ...................................................
2.0x............... i ...................................................
 #X................ t ...................................................
 #................. s ...........................  plain bits  ..........
-+-----------------------------------------------------------------------
 1             16              32                              64
*/

/* LEVEL: (total bits, prefix bits, prefix value),
 * sorted ascending by number of total bits.
 * The rest of the code table is calculated at compiletime from this. */

/* fibonacci data 1, 1, ... */
#define VLI_L_1_1() do { \
	LEVEL( 2, 1, 0x00); \
	LEVEL( 3, 2, 0x01); \
	LEVEL( 5, 3, 0x03); \
	LEVEL( 7, 4, 0x07); \
	LEVEL(10, 5, 0x0f); \
	LEVEL(14, 6, 0x1f); \
	LEVEL(21, 8, 0x3f); \
	LEVEL(29, 8, 0x7f); \
	LEVEL(42, 8, 0xbf); \
	LEVEL(64, 8, 0xff); \
	} while (0)

/* finds a suitable level to decode the least significant part of in.
 * returns number of bits consumed.
 *
 * BUG() for bad input, as that would mean a buggy code table. */
static inline int vli_decode_bits(u64 *out, const u64 in)
{
	u64 adj = 1;

#define LEVEL(t,b,v)					\
	do {						\
		if ((in & ((1 << b) -1)) == v) {	\
			*out = ((in & ((~0ULL) >> (64-t))) >> b) + adj;	\
			return t;			\
		}					\
		adj += 1ULL << (t - b);			\
	} while (0)

	VLI_L_1_1();

	/* NOT REACHED, if VLI_LEVELS code table is defined properly */
	BUG();
#undef LEVEL
}

/* return number of code bits needed,
 * or negative error number */
static inline int __vli_encode_bits(u64 *out, const u64 in)
{
	u64 max = 0;
	u64 adj = 1;

	if (in == 0)
		return -EINVAL;

#define LEVEL(t,b,v) do {		\
		max += 1ULL << (t - b);	\
		if (in <= max) {	\
			if (out)	\
				*out = ((in - adj) << b) | v;	\
			return t;	\
		}			\
		adj = max + 1;		\
	} while (0)

	VLI_L_1_1();

	return -EOVERFLOW;
#undef LEVEL
}

#undef VLI_L_1_1

/* code from here down is independend of actually used bit code */

/*
 * Code length is determined by some unique (e.g. unary) prefix.
 * This encodes arbitrary bit length, not whole bytes: we have a bit-stream,
 * not a byte stream.
 */

/* for the bitstream, we need a cursor */
struct bitstream_cursor {
	/* the current byte */
	u8 *b;
	/* the current bit within *b, nomalized: 0..7 */
	unsigned int bit;
};

/* initialize cursor to point to first bit of stream */
static inline void bitstream_cursor_reset(struct bitstream_cursor *cur, void *s)
{
	cur->b = s;
	cur->bit = 0;
}

/* advance cursor by that many bits; maximum expected input value: 64,
 * but depending on VLI implementation, it may be more. */
static inline void bitstream_cursor_advance(struct bitstream_cursor *cur, unsigned int bits)
{
	bits += cur->bit;
	cur->b = cur->b + (bits >> 3);
	cur->bit = bits & 7;
}

/* the bitstream itself knows its length */
struct bitstream {
	struct bitstream_cursor cur;
	unsigned char *buf;
	size_t buf_len;		/* in bytes */

	/* for input stream:
	 * number of trailing 0 bits for padding
	 * total number of valid bits in stream: buf_len * 8 - pad_bits */
	unsigned int pad_bits;
};

static inline void bitstream_init(struct bitstream *bs, void *s, size_t len, unsigned int pad_bits)
{
	bs->buf = s;
	bs->buf_len = len;
	bs->pad_bits = pad_bits;
	bitstream_cursor_reset(&bs->cur, bs->buf);
}

static inline void bitstream_rewind(struct bitstream *bs)
{
	bitstream_cursor_reset(&bs->cur, bs->buf);
	memset(bs->buf, 0, bs->buf_len);
}

/* Put (at most 64) least significant bits of val into bitstream, and advance cursor.
 * Ignores "pad_bits".
 * Returns zero if bits == 0 (nothing to do).
 * Returns number of bits used if successful.
 *
 * If there is not enough room left in bitstream,
 * leaves bitstream unchanged and returns -ENOBUFS.
 */
static inline int bitstream_put_bits(struct bitstream *bs, u64 val, const unsigned int bits)
{
	unsigned char *b = bs->cur.b;
	unsigned int tmp;

	if (bits == 0)
		return 0;

	if ((bs->cur.b + ((bs->cur.bit + bits -1) >> 3)) - bs->buf >= bs->buf_len)
		return -ENOBUFS;

	/* paranoia: strip off hi bits; they should not be set anyways. */
	if (bits < 64)
		val &= ~0ULL >> (64 - bits);

	*b++ |= (val & 0xff) << bs->cur.bit;

	for (tmp = 8 - bs->cur.bit; tmp < bits; tmp += 8)
		*b++ |= (val >> tmp) & 0xff;

	bitstream_cursor_advance(&bs->cur, bits);
	return bits;
}

/* Fetch (at most 64) bits from bitstream into *out, and advance cursor.
 *
 * If more than 64 bits are requested, returns -EINVAL and leave *out unchanged.
 *
 * If there are less than the requested number of valid bits left in the
 * bitstream, still fetches all available bits.
 *
 * Returns number of actually fetched bits.
 */
static inline int bitstream_get_bits(struct bitstream *bs, u64 *out, int bits)
{
	u64 val;
	unsigned int n;

	if (bits > 64)
		return -EINVAL;

	if (bs->cur.b + ((bs->cur.bit + bs->pad_bits + bits -1) >> 3) - bs->buf >= bs->buf_len)
		bits = ((bs->buf_len - (bs->cur.b - bs->buf)) << 3)
			- bs->cur.bit - bs->pad_bits;

	if (bits == 0) {
		*out = 0;
		return 0;
	}

	/* get the high bits */
	val = 0;
	n = (bs->cur.bit + bits + 7) >> 3;
	/* n may be at most 9, if cur.bit + bits > 64 */
	/* which means this copies at most 8 byte */
	if (n) {
		memcpy(&val, bs->cur.b+1, n - 1);
		val = le64_to_cpu(val) << (8 - bs->cur.bit);
	}

	/* we still need the low bits */
	val |= bs->cur.b[0] >> bs->cur.bit;

	/* and mask out bits we don't want */
	val &= ~0ULL >> (64 - bits);

	bitstream_cursor_advance(&bs->cur, bits);
	*out = val;

	return bits;
}

/* encodes @in as vli into @bs;

 * return values
 *  > 0: number of bits successfully stored in bitstream
 * -ENOBUFS @bs is full
 * -EINVAL input zero (invalid)
 * -EOVERFLOW input too large for this vli code (invalid)
 */
static inline int vli_encode_bits(struct bitstream *bs, u64 in)
{
	u64 code = code;
	int bits = __vli_encode_bits(&code, in);

	if (bits <= 0)
		return bits;

	return bitstream_put_bits(bs, code, bits);
}

/* end of drbd_vli.h }}}1 */

static char *progname;

void show_usage_and_die(void)
{
	fprintf(stderr,
"Usage: %s subcommand subcommand-options ...\n"
"  Subcommands:\n"
"  plain-to-rl <in_file>:\n"
"    mmap()s in_file,\n"
"    Writes a stream of 32bit native unsigned int to stdout,\n"
"    representing the runlengths of set and unset bits.\n"
"    First runlength is _unset_ bits,\n"
"    and is the only runlength that may be zero.\n\n"

"  rl-to-vli\n"
"    Takes the output of 'plain-to-rl' from stdin,\n"
"    and encodes those runlengths into variable length integer (VLI)\n"
"    bitstream chunks (to stdout).\n"
"    export RL_TO_VLI_STATS=any-value gives extra stats on this one.\n\n"

"  vli-to-rl\n"
"    Takes the output of 'rl-to-vli' from stdin,\n"
"    and decodes those VLI chunks back into a stream of\n"
"    32bit native unsigned int (to stdout).\n\n"

"  rl-to-plain\n"
"    Takes a stream of 32bit native unsigned int runlengths from stdin\n"
"    and writes the corresponding plaintext to stdout.\n\n"

"Note that in the proper implementation, incompressible chunks\n"
"are stored as plain. This is not done here, to better demonstrate\n"
"the properties of this method\n\n"

"shell examples:\n"
"  do_try() { ( set -vxeC\n"
"    local f=${1:? missing input file};\n"
"    export RL_TO_VLI_STATS=whatever\n"
"    time %s plain-to-rl \"$f\" > RL;\n"
"    time %s rl-to-vli < RL > BS;\n"
"    time %s vli-to-rl < BS > DE;\n"
"    time %s rl-to-plain < DE > R;\n"
"    ls -l \"$f\" R BS RL DE;\n"
"    time cmp RL DE;\n"
"    time cmp \"$f\" R )\n"
"  }\n"
"  # rm -f RL BS DE R # for cleanup, because of set -C\n"
"  do_try example_bitmap\n"
"The bit stream in BS is the compressed representation of infile\n\n"

"Stupid comparison for the fun of it. Prepare some example_bitmap, then:\n"
"f=example_bitmap\n"
"time { gzip --best < $f | tee C | gunzip > /dev/null; ls -l C; }\n"
"time { bzip2 < $f | tee C | bunzip2 > /dev/null; ls -l C; }\n"
"time { %s plain-to-rl $f | %s rl-to-vli |\n"
"       tee C | %s vli-to-rl |\n"
"       %s rl-to-plain > /dev/null; ls -l C; }\n",
		progname,
		progname, progname, progname, progname,
		progname, progname, progname, progname);
	exit(1);
}

static char *subcmd;
#define eprintf(fmt, args...) fprintf(stderr, "%s: " fmt, subcmd , ## args)

#define RING_SIZE 1024
static unsigned RL[RING_SIZE];

/* plain-to-rl {{{2 */

/* also used in vli-to-rl */
void write_RL(int i)
{
	ssize_t s = i * sizeof(RL[0]);
	ssize_t c = write(out_fd, RL, s);
	if (c < 0) {
		eprintf("error writing runlength blob to stdout: %m\n");
		exit(6);
	}
	if (c != s) {
		eprintf("short write: %lu != %lu\n",
				(unsigned long)c, (unsigned long)s);
		exit(7);
	}
}

/* don't risk bit number wrap around.
 * could be coded around, of course. */
#define MAX_BYTES ((off_t)((unsigned int)(~0) >> 3))

int plain_to_rl(int argc, char **argv)
{
	struct stat sb;
	unsigned long current_bit = 0;
	unsigned long n_bits;
	unsigned long tmp;
	unsigned long *bm;
	char *in_file;
	int toggle = 0;
	int i = 0;
	int fd;

	if (argc != 1) {
		eprintf("missing input file argument\n");
		return 1;
	}

	in_file = argv[0];
	fd = open(in_file, O_RDONLY);
	if (fd < 0) {
		eprintf("open('%s'): %m\n", in_file);
		return 2;
	}

	if (fstat(fd, &sb)) {
		eprintf("fstat(%s): %m\n", in_file);
		return 3;
	}

	if ((sb.st_mode & S_IFMT) != S_IFREG) {
		eprintf("%s: not a regular file\n", in_file);
		return 4;
	}

	if (sb.st_size > MAX_BYTES) {
		eprintf("%s too big, only scanning first %lu bytes\n",
				in_file, MAX_BYTES);
		sb.st_size = MAX_BYTES;
	}

	/* maybe TODO: allow start offset and size to be specified,
	 * possibly use mmap2 */
	bm = mmap(NULL, sb.st_size, PROT_READ, MAP_SHARED | MAP_POPULATE, fd, 0);
	if (bm == MAP_FAILED) {
		eprintf("mmap(%s): %m\n", in_file);
		return 5;
	}

	n_bits = sb.st_size << 3;
	for (;;) {
		toggle = !toggle;

		tmp = toggle ? find_next_bit(bm, n_bits, current_bit)
		             : find_next_zero_bit(bm, n_bits, current_bit);

		if (tmp >= n_bits)
			RL[i++] = n_bits - current_bit;
		else
			RL[i++] = tmp - current_bit;
		if (i == RING_SIZE || tmp >= n_bits) {
			write_RL(i);
			if (tmp >= n_bits)
				break;
			i = 0;
		}
		current_bit = tmp;
	}
	close(1);
	return 0;
}

/* plain-to-rl }}}1 */

/* rl-to-vli {{{2 */

/* drbd on-network packet is preceeded by an 8 byte header, btw.
 * also, we transmit an incompressible chunk as plain, obviously.
 * left out here for demonstation of the encoding properties only. */

/* also used from vli_to_rl */
/* CAUTION: do not increase OUTPUT_BUFF_SIZE without also changing
 * the chunk.head format, see head_to_bytes() below.
 * in a proper implementation, you would add some magic or checksum,
 * plus a flag for interleaved plaintext (for incompressible chunks). */
#define OUTPUT_BUFF_SIZE 4096
static struct code_chunk {
	unsigned short head;
#define head_to_bytes(head)		((head & 0x0fff) +1)
#define head_to_pad_bits(head)		((head & 0x7000) >> 12)
#define head_is_first_bit_set(head)	((head & 0x8000) != 0)

	unsigned char code[OUTPUT_BUFF_SIZE];
} chunk;

ssize_t pipe_read(int fd, void *buf, ssize_t l)
{
	ssize_t c, count = 0;
	int loop = 0;
	do {
		c = read(in_fd, buf + count, l);
		if (c < 0) {
			eprintf("error reading from stdin: %m\n");
			exit(2);
		}
		if (c == 0 && ++loop > 3)
			break;
		count += c;
		l -= c;
	} while (l);
	return count;
}

int read_RL()
{
	int count = pipe_read(in_fd, RL, sizeof(RL));
	if (count % sizeof(RL[0])) {
		eprintf("short read, not modulo native unsingned int!\n"
				" %u %% %u == %u\n", count, (unsigned)sizeof(RL[0]),
				count % (unsigned)sizeof(RL[0]));
		exit(2);
	}
	return count / sizeof(RL[0]);
}

/* one could save (8*sizeof(head) + up to 7) bits every chunk, by just
 * streaming the code bits one after the other.
 * but then you have no easy way to detect truncated code during decode */
void write_code_chunk_rewind_bs(struct bitstream *bs, int first_bit_set)
{
	ssize_t c;
	ssize_t s = bs->cur.b - bs->buf + !!bs->cur.bit;

	chunk.head = first_bit_set ? 0x8000 : 0;
	/* pad bits */
	chunk.head |= (0x7 & (8 - bs->cur.bit)) << 12;
	/* code bytes */
	chunk.head |= 0x0fff & (s - 1);

	/* should not happen? */
	if (!s)
		return;

	s += sizeof(chunk.head);
	c = write(out_fd, &chunk, s);
	if (c < 0) {
		eprintf("error writing code blob to stdout: %m\n");
		exit(6);
	}
	if (c != s) {
		eprintf("short write: %lu != %lu\n",
				(unsigned long)c, (unsigned long)s);
		exit(7);
	}
	bitstream_rewind(bs);
}

static struct {
       unsigned n;
       unsigned plain;
       unsigned code;
} stats[2]; /* compressible, incompressible */

void do_stats(struct bitstream *bs, int n_chunks, unsigned long plain)
{
	unsigned long code = ((bs->cur.b - bs->buf) + !!bs->cur.bit + 2) * 8;
	/* eprintf("chunk:%u plain_bits:%lu code_bits:%u\n",
		n_chunks, plain, code); */
	++stats[code > plain].n;
	stats[code > plain].plain += plain;
	stats[code > plain].code += code;
}

void print_stat_summary(void)
{
	unsigned total_n = stats[0].n + stats[1].n;
	unsigned long total_plain_bits = stats[0].plain + stats[1].plain;
	unsigned long total_code_bits = stats[0].code + stats[1].code;

	eprintf("stats: %u chunks, %u compressed, %u uncompressed\n"
		"\tonly compressible: %u plain bits -> %u code bits\n",
		total_n, stats[0].n, stats[1].n,
		stats[0].plain, stats[0].code);
	eprintf("total saved: %.2f%%\n",
			100.0 * (total_plain_bits - total_code_bits)
			    / total_plain_bits);
}

int rl_to_vli(int argc, char **argv)
{
	struct bitstream bs;
	unsigned long plain_bits = 0;
	int l = read_RL();
	int first_is_set;
	int odd_is_set = 1;
	int i;
	int bits;
	int stats = NULL != getenv("RL_TO_VLI_STATS");
	int n_chunks = 0;

	if (l == 0) {
		eprintf("empty input!\n");
		return 3;
	}

	bitstream_init(&bs, chunk.code, OUTPUT_BUFF_SIZE, 0);
	i = !RL[0];
	first_is_set = i;
	do {
		while (i < l) {
			/* paranoia: catch zero runlength.
			 * can only happen if bitmap was modified while it was scanned. */
			if (RL[i] == 0) {
				eprintf("unexpected zero runlength i=%d\n", i);
				return 4;
			}
redo:
			bits = vli_encode_bits(&bs, RL[i]);
			if (bits == -ENOBUFS) { /* buffer full */
				if (stats) {
					do_stats(&bs, n_chunks++, plain_bits);
					plain_bits = 0;
				}
				write_code_chunk_rewind_bs(&bs, first_is_set);
				/* current will be first of next packet */
				first_is_set = (i & 1) == odd_is_set;
				goto redo;
			}
			if (bits <= 0) {
				eprintf("error while encoding runlength: %d\n", bits);
				return 5;
			}
			if (stats)
				plain_bits += RL[i];
			i++;
		}
		if (l & 1)
			odd_is_set = !odd_is_set;
		l = read_RL();
		i = 0;
	} while (l);
	if (bs.cur.b != bs.buf || bs.cur.bit)
		write_code_chunk_rewind_bs(&bs, first_is_set);
	if (stats)
		print_stat_summary();
	return 0;
}

/* rl-to-vli }}}1 */

/* vli-to-rl {{{2 */

ssize_t read_one_code_chunk(struct bitstream *bs)
{
	static unsigned long total_read_bytes;
	ssize_t len;
	int bytes;

	len = pipe_read(in_fd, &chunk.head, sizeof(chunk.head));
	if (len == 0)
		return 0;
	if (len != 2) {
		eprintf("short read reading chunk.head\n");
		exit(2);
	}
	total_read_bytes += len;
	bytes = head_to_bytes(chunk.head);
	len = pipe_read(in_fd, &chunk.code, bytes);
	if (len != bytes) {
		eprintf("short read reading chunk.code: %d %d (%lu)\n",
			(unsigned)len, bytes, total_read_bytes);
		exit(2);
	}
	total_read_bytes += len;
	bitstream_init(bs, chunk.code, bytes, head_to_pad_bits(chunk.head));
	return len + 2;
}

int vli_to_rl(int argc, char **argv)
{
	struct bitstream bs;
	__u64 look_ahead;
	__u64 tmp;
	__u64 rl;
	int toggle;
	int have;
	int bits;
	int first = 1;
	int i = 0;

	while (read_one_code_chunk(&bs)) {
		look_ahead = 0;
		tmp = 0;
		have = 0;
		toggle = !head_is_first_bit_set(chunk.head);
		if (first) {
			first = 0;
			if (!toggle)
				RL[i++] = 0;
		}
		for (;;) {
			toggle = !toggle;
			/* get fresh bits */
			bits = bitstream_get_bits(&bs, &tmp, 64 - have);
			if (bits < 0)
				return 3;
			look_ahead |= tmp << have;
			have += bits;
			if (have == 0)
				break;

			/* consume one code number */
			bits = vli_decode_bits(&rl, look_ahead);
			if (bits <= 0)
				return 4;
			/* cannot possibly decode more bits than I had */
			if (have < bits)
				return 5;
			look_ahead >>= bits;
			have -= bits;

			RL[i++] = rl;
			if (i == RING_SIZE) {
				write_RL(i);
				i = 0;
			}
		}
	}
	write_RL(i);
	return 0;
}

/* vli-to-rl }}}1 */

/* rl-to-plain {{{2 */

int rl_to_plain(int argc, char **argv)
{
	/* yep, this is excessive, and only used here to quick'n'dirty
	 * code something that is streamable. */
	static unsigned char zeros[4096];
	static unsigned char FFFFs[4096];

	unsigned char *out;
	unsigned int rl;
	int l;
	int i;
	int set = 0;
	int minibuf_bits = 0;
	unsigned char minibuf = 0;

	/* no need to initialize zeros, static did that for us */
	memset(FFFFs, 0xff, sizeof(FFFFs));

	while ((l = read_RL())) {
		for (i = 0; i < l; i++, set = !set) {
			rl = RL[i];
			if (minibuf_bits || rl < 8) {
				if (set) /* set bits */
					minibuf |= ((1 << (rl < 7 ? rl : 7)) -1)
						<< minibuf_bits;
				if (rl < 8 - minibuf_bits) {
					minibuf_bits += rl;
					continue;
				}
				if (write(out_fd, &minibuf, 1) != 1) {
					eprintf("FIXME 1\n");
					exit(2);
				}
				rl -= 8 - minibuf_bits;
				minibuf = 0;
				minibuf_bits = 0;
			}

			out = set ? FFFFs : zeros;
			while (rl > 8) {
				size_t c = rl/8; /* 8 bits per byte */
				if (sizeof(FFFFs) < c)
					c = sizeof(FFFFs);
				if (write(out_fd, out, c) != c) {
					eprintf("FIXME 2\n");
					exit(2);
				}
				rl -= c * 8;
			}
			if (rl) {
				if (set)
					minibuf = (1 << rl) -1;
				minibuf_bits = rl;
			}
		}
	}
	if (minibuf_bits)
		if (write(out_fd, &minibuf, 1) != 1) {
			eprintf("FIXME 3\n");
			exit(2);
		}

	return 0;
}

/* rl_to_plain }}}1 */

int main(int argc, char **argv)
{
	progname = strrchr(argv[0], '/');
	if (!progname)
		progname = argv[0];
	else
		progname++;

	if (argc < 2)
		show_usage_and_die();
	subcmd = argv[1];
	if (!strcmp(subcmd, "plain-to-rl"))
		return plain_to_rl(argc-2, argv+2);
	if (!strcmp(subcmd, "rl-to-vli"))
		return rl_to_vli(argc-2, argv+2);
	if (!strcmp(subcmd, "vli-to-rl"))
		return vli_to_rl(argc-2, argv+2);
	if (!strcmp(subcmd, "rl-to-plain"))
		return rl_to_plain(argc-2, argv+2);

	fprintf(stderr, "%s %s: unimplemented subcommand\n", progname, subcmd);
	return 1;
}

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 04/16] DRBD: bitmap
  2009-05-05 17:48               ` Lars Marowsky-Bree
  2009-05-05 17:51                 ` James Bottomley
@ 2009-05-05 22:26                 ` Neil Brown
  1 sibling, 0 replies; 90+ messages in thread
From: Neil Brown @ 2009-05-05 22:26 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: Lars Ellenberg, James Bottomley, Philipp Reisner, linux-kernel,
	Jens Axboe, Greg KH, Sam Ravnborg, Dave Jones,
	Nikanth Karthikesan, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

On Tuesday May 5, lmb@suse.de wrote:
> On 2009-05-03T15:21:41, Neil Brown <neilb@suse.de> wrote:
> 
> > As I said, I don't immediately see the benefits of the activity log
> > format, however,
> >  1/ I am happy to listen to its benefits being explained
> >  2/ If we were to agree that merging DRBD functionality into md
> >    (for which there isn't a concrete proposal, but the suggestion
> >     seems to be floating around) were a good thing, I don't have any
> >     problem with supporting an activity log in md in the name of
> >     compatibility.
> 
> So, let's take a step back here.
> 
> All of this is extremely beneficial discussion to be had. As some of you
> are (painfully, sometimes ;-) aware, I'm a big fan of converging RAID
> implementations/back-ends, and the goal is well received.
> 
> But this will take a while, and both drbd, md, md/nbd, or even dm-raid1
> have large existing user bases, and HA environments don't switch easily.
> All are actively maintained.
> 
> Sharing more and more of the code strikes me as a mid-term goal, and
> full converges as a long-term one (alas).
> 
> What I think this argument has shown that drbd's design is sound (even
> if some choices, like that of the alternatives, are up for discussion),
> similar to different file systems (of which we seem to have plenty
> too).
> 
> I would suggest at this time, we may want to refocus on the remaining
> objections to merging drbd as a driver in the short-term.

I cannot imagine that there would be any.  Given its history, its
popularity, and its modularity, there can be no question about merging
it, and only a possible question on whether it should spend some time
in 'staging' first.
I doubt there is much call for that, but nor it is clear to be how the
decision would be made.

> 
> I think I've not read anything in the last 3-5 days which still would
> rate as a reason for rejection or delay.
> 
> Did I miss something?

This is lkml - no one can catch everything :-)

I big part of why I was comparing and contrasting DRBD to md is
because that enables me to understand it better.  That sort of in-depth
understanding is, for me, a prerequisite for an in-depth review.

So it is all just part of the review process....

NeilBrown

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-05 21:53                             ` James Bottomley
@ 2009-05-06  8:17                               ` Philipp Reisner
  0 siblings, 0 replies; 90+ messages in thread
From: Philipp Reisner @ 2009-05-06  8:17 UTC (permalink / raw)
  To: James Bottomley
  Cc: david, Willy Tarreau, Bart Van Assche, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Kyle Moffett, Lars Ellenberg

[...]
>
> Well, you have to agree that during a resync from the activity log,
> which plays up the primary disk from one end to another, the secondary
> is completely corrupt if a primary failure occurs before the resync
> completes.  That's something that's triggered by a network outage, and
> so is a far more common event than cascading dual failures.  It's all
> really a question of where you focus your effort to eliminate the corner
> cases.
>

I fully agree. Just not not leave this unanswered: With DRBD we provide
a snapshot-resync-target handler. By using LVM's snapshotting
mechanism a snapshot is taken before it becomes resync-target. In
case the resync completes gracefully, the snapshot is automatically
removed.

Which is still inferior to a full transaction log on the secondary.

-Phil


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] drbd: a block device for HA clusters
       [not found]           ` <200907241720.22771.philipp.reisner-63ez5xqkn6DQT0dZR+AlfA@public.gmane.org>
@ 2009-07-26 23:24             ` Stephen Rothwell
  0 siblings, 0 replies; 90+ messages in thread
From: Stephen Rothwell @ 2009-07-26 23:24 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Lars Ellenberg,
	linux-next-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	drbd-dev-cunTk1MwBs8qoQakbn7OcQ


[-- Attachment #1.1: Type: text/plain, Size: 503 bytes --]

Hi Philipp,

On Fri, 24 Jul 2009 17:20:21 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
>
> Please add git://git.drbd.org/linux-2.6-drbd.git to linux-next.

I have added that tree from today.  Currently you are listed as the
contact (I can add others if you like).  I used the drbd branch of that
tree as the master branch appears to be Linus' tree.

-- 
Cheers,
Stephen Rothwell                    sfr-3FnU+UHB4dNDw9hX6IcOSA@public.gmane.org
http://www.canb.auug.org.au/~sfr/

[-- Attachment #1.2: Type: application/pgp-signature, Size: 197 bytes --]

[-- Attachment #2: Type: text/plain, Size: 169 bytes --]

_______________________________________________
drbd-dev mailing list
drbd-dev-cunTk1MwBs8qoQakbn7OcQ@public.gmane.org
http://lists.linbit.com/mailman/listinfo/drbd-dev

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] drbd: a block device for HA clusters
       [not found]   ` <20090720224940.36da1ef8.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2009-07-21 18:51     ` Lars Ellenberg
  2009-07-22  4:59       ` [Drbd-dev] " Stephen Rothwell
  0 siblings, 1 reply; 90+ messages in thread
From: Lars Ellenberg @ 2009-07-21 18:51 UTC (permalink / raw)
  To: Andrew Morton, Stephen Rothwell
  Cc: Christoph Hellwig, Kyle Moffett, Neil Brown, Nikanth Karthikesan,
	Greg KH, Philipp Reisner, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Bart Van Assche, linux-next-u79uwXL29TY76Z2rM5mHXA,
	Lars Marowsky-Bree, Jens Axboe, Dave Jones, Sam Ravnborg,
	James Bottomley, drbd-dev-cunTk1MwBs8qoQakbn7OcQ

On Mon, Jul 20, 2009 at 10:49:40PM -0700, Andrew Morton wrote:
> On Mon,  6 Jul 2009 17:39:19 +0200 Philipp Reisner <philipp.reisner-63ez5xqkn6DQT0dZR+AlfA@public.gmane.org> wrote:
> > Patch set attached. Git tree available:
> > git pull git://git.drbd.org/linux-2.6-drbd.git drbd
> 
> I don't think I can be bothered reading all this again ;)  I trust that
> earlier review comments were suitably addressed.

To the best of my knowledge, we addressed all of them, yes.

> Please prepare a tree for inclusion in linux-next,
> send that off to Stephen

Ok, will do.
Stephen: expect an explicit linux-next merge request
within the next few days.

> and unless someone can identify reasons otherwise,
> send Linus a pull request for 2.6.32-rc1.

Thanks very much,

	Lars

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] drbd: a block device for HA clusters
  2009-07-06 15:39 [PATCH 00/16] drbd: " Philipp Reisner
@ 2009-07-21  5:49 ` Andrew Morton
       [not found]   ` <20090720224940.36da1ef8.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 90+ messages in thread
From: Andrew Morton @ 2009-07-21  5:49 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: linux-kernel, Jens Axboe, Greg KH, Neil Brown, James Bottomley,
	Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche, Christoph Hellwig, drbd-dev, Lars Ellenberg,
	linux-next

On Mon,  6 Jul 2009 17:39:19 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:

> As the first bit of the DBRD patch already got upstream (see commit
> 10fc89d01a) it is time to get more of DRBD towards mainline.
> 
> Here is a post of drbd-8.3.2 for inclusion into linux-mm (or linux-next).
> 
> Patch set attached. Git tree available:
> git pull git://git.drbd.org/linux-2.6-drbd.git drbd

I don't think I can be bothered reading all this again ;)  I trust that
earlier review comments were suitably addressed.

Please prepare a tree for inclusion in linux-next, send that off to
Stephen and unless someone can identify reasons otherwise, send Linus a pull
request for 2.6.32-rc1.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 00/16] drbd: a block device for HA clusters
@ 2009-07-06 15:39 Philipp Reisner
  2009-07-21  5:49 ` Andrew Morton
  0 siblings, 1 reply; 90+ messages in thread
From: Philipp Reisner @ 2009-07-06 15:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Jens Axboe, Greg KH, Neil Brown, James Bottomley,
	Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche, Christoph Hellwig, drbd-dev, Lars Ellenberg,
	Philipp Reisner

Hi,

As the first bit of the DBRD patch already got upstream (see commit
10fc89d01a) it is time to get more of DRBD towards mainline.

Here is a post of drbd-8.3.2 for inclusion into linux-mm (or linux-next).

Patch set attached. Git tree available:
git pull git://git.drbd.org/linux-2.6-drbd.git drbd

In case you want to review the code, here is a note for you:

  Only the first patch (lru_cache) is self contained. The other patches are
  just split at file boundaries. Sorry, DRBD was developed as out-of-tree
  modules just for too long.

Short Description

  DRBD is a shared-nothing, replicated block device. It is designed to
  serve as a building block for high availability clusters and

  in this context, is a "drop-in" replacement for shared storage.
  Simplistically, you could see it as a network RAID 1.

  More information can be found at http://www.drbd.org

Changes since 2009-06-26

  * Cleanup: Added an entry to the MAINTAINERS file
  * DRBD:    Now at drbd-8.3.2:
  * DRBD:    Fixed a hard to trigger race condition. (kmap_atomic(..., KM_IRQ1) interruptible)

Changes since 2009-05-15

  * Cleanup: Moved lru_cache.c to /lib
  * Cleanup: all STATIC -> static
  * Cleanup: Removed drbd_config.h ; New Kconfig option: CONFIG_DRBD_FAULT_INJECTION
  * Cleanup: Removed drbd_buildtag.c
  * DRBD:    Following DRBD-upstream, now at 8.3.2-rc2. Relevant changes:
  * DRBD:    lru_cache: use pointer arrays and kmem_cache
  * DRBD:    Fixed for building on big endian architectures
  * DRBD:    Fixed nl stuff to work on architectures that does not do unaligned memory accesses
  * DRBD:    Deal with hash functions already ported to SHASH
  * DRBD:    GFP_KERNEL -> GFP_NOIO in various places

Changes since 2009-04-30

  * Cleanup: Removed typecasts, more documentation in lru_cache. Moved to /lib
  * Cleanup: replaced __attribute__((packed)) with __packed
  * Cleanup: remove quite a few 'inline's from .c files
  * Cleanup: renaming a few constants: _SECT -> _SECTOR_SIZE, _SIZE_B -> _SHIFT ...
  * Cleanup: rename inc_local -> get_ldev; inc_net -> get_net_conf; and corresponding dec_* -> put_*
  * Cleanup: rename mdev->bc to mdev->ldev (to match the recent change to get_ldev/put_ldev)
  * Cleanup: Made function comments kernel-doc compliant
  * Cleanup: vmalloc() only as a fall back for kmalloc()
  * DRBD:    Allow detach of a SyncTarget node. (Bugz 221)
  * DRBD:    Call drbd_rs_cancel_all() and reset rs_pending when aborting resync due to detach. (Bugz 223)
  * DRBD:    make drbd thread t_lock irqsave - lockdep complained, and lockdep is right (theoretically)

Changes since 2009-04-10

  * Cleanup: Removed all CamelCase
  * Cleanup: Replaced DRBD's own tracing stuff with regular tracepoints
  * Cleanup: Removed ERR/INFO/ALERT ... macros, using dev_err/dev_info/... now
  * Cleanup: Minor stuff, as suggested in feedback on LKML
  * DRBD:    Bitmap compression feature was finalised
  * DRBD:    new disable_sendpage parameter

Changes since the post on 2009-03-30, all triggered by reviews

  * Improvements to Makefile and Kconfig
  * Simplified definitions of bm_flags' bitnumbers
  * Removed debugging aid

Changes since the post on 2009-03-23, from drbd-mainline

  * Updated to the final drbd-8.3.1 code
  * Optionally run-length encode bitmap transfers

Changes since the post on 2009-03-23, triggered by reviews

  * Using the latest proc_create() now
  * Moved the allocation of md_io_tmpp to attach/detach out of drbd_md_sync_page_io()
  * Removing the mode selection comments for emacs
  * Removed DRBD_ratelimit()

cheers,
  Phil

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 00/16] DRBD: a block device for HA clusters
@ 2009-05-15 12:10 Philipp Reisner
  0 siblings, 0 replies; 90+ messages in thread
From: Philipp Reisner @ 2009-05-15 12:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Jens Axboe, Greg KH, Neil Brown, James Bottomley,
	Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche, Lars Ellenberg, Philipp Reisner

Hi,

This is a repost of DRBD, to keep you updated about the ongoing
cleanups and improvements.

Patch set attached. Git tree available:
git pull git://git.drbd.org/linux-2.6-drbd.git drbd

We are looking for reviews!

Note for reviewers:

  Only the first two patches (major.h and lru_cache) are self contained.
  The other patches are just split at file boundaries. Sorry, DRBD
  was developed as out-of-tree modules just for too long.

Short Description

  DRBD is a shared-nothing, synchronously replicated block device. It
  is designed to serve as a building block for high availability
  clusters and in this context, is a "drop-in" replacement for shared
  storage. Simplistically, you could see it as a network RAID 1.

  More information can be found at http://www.drbd.org

Changes since 2009-04-30

  * Cleanup: Removed typecasts, more documentation in lru_cache. Moved to /lib
  * Cleanup: replaced __attribute__((packed)) with __packed
  * Cleanup: remove quite a few 'inline's from .c files
  * Cleanup: renaming a few constants: _SECT -> _SECTOR_SIZE, _SIZE_B -> _SHIFT ...
  * Cleanup: rename inc_local -> get_ldev; inc_net -> get_net_conf; and corresponding dec_* -> put_*
  * Cleanup: rename mdev->bc to mdev->ldev (to match the recent change to get_ldev/put_ldev)
  * Cleanup: Made function comments kernel-doc compliant
  * Cleanup: vmalloc() only as a fall back for kmalloc()
  * DRBD:    Allow detach of a SyncTarget node. (Bugz 221)
  * DRBD:    Call drbd_rs_cancel_all() and reset rs_pending when aborting resync due to detach. (Bugz 223)
  * DRBD:    make drbd thread t_lock irqsave - lockdep complained, and lockdep is right (theoretically)

Changes since 2009-04-10

  * Cleanup: Removed all CamelCase
  * Cleanup: Replaced DRBD's own tracing stuff with regular tracepoints
  * Cleanup: Removed ERR/INFO/ALERT ... macros, using dev_err/dev_info/... now
  * Cleanup: Minor stuff, as suggested in feedback on LKML
  * DRBD:    Bitmap compression feature was finalised
  * DRBD:    new disable_sendpage parameter

Changes since the post on 2009-03-30, all triggered by reviews

  * Improvements to Makefile and Kconfig
  * Simplified definitions of bm_flags' bitnumbers
  * Removed debugging aid

Changes since the post on 2009-03-23, from drbd-mainline

  * Updated to the final drbd-8.3.1 code
  * Optionally run-length encode bitmap transfers

Changes since the post on 2009-03-23, triggered by reviews

  * Using the latest proc_create() now
  * Moved the allocation of md_io_tmpp to attach/detach out of drbd_md_sync_page_io()
  * Removing the mode selection comments for emacs
  * Removed DRBD_ratelimit()

cheers,
  Phil

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
@ 2009-05-14 22:31 devzero
  0 siblings, 0 replies; 90+ messages in thread
From: devzero @ 2009-05-14 22:31 UTC (permalink / raw)
  To: bart.vanassche; +Cc: akpm, linux-kernel, philipp.reisner

>On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
><akpm@linux-foundation.org> wrote:
>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
>>
>>> This is a repost of DRBD
>>
>> Is it being used anywhere for anything?  If so, where and what?
>
>One popular application is to run iSCSI and HA software on top of DRBD
>in order to build a highly available iSCSI storage target.
>
>Bart.

Iirc, Xtravirt`s XVS Virtual SAN appliance is built around DRBD.

For those interested:

Some Blog entry:
http://vmetc.com/2008/05/23/xtravirt-xvs-creates-a-free-san-out-of-local-esx-vmfs/

Design sheet:
http://communities.vmware.com/servlet/JiveServlet/download/950436-9486/xvsrefdiag.jpg

Discussion:
http://communities.vmware.com/message/1114092#1114092

Seems to be sold to PHD Virtual Technologies now....

regards
roland

______________________________________________________
GRATIS für alle WEB.DE-Nutzer: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://movieflat.web.de


^ permalink raw reply	[flat|nested] 90+ messages in thread

end of thread, other threads:[~2009-07-26 23:24 UTC | newest]

Thread overview: 90+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-30 11:26 [PATCH 00/16] DRBD: a block device for HA clusters Philipp Reisner
2009-04-30 11:26 ` [PATCH 01/16] DRBD: major.h Philipp Reisner
2009-04-30 11:26   ` [PATCH 02/16] DRBD: lru_cache Philipp Reisner
2009-04-30 11:26     ` [PATCH 03/16] DRBD: activity_log Philipp Reisner
2009-04-30 11:26       ` [PATCH 04/16] DRBD: bitmap Philipp Reisner
2009-04-30 11:26         ` [PATCH 05/16] DRBD: request Philipp Reisner
2009-04-30 11:26           ` [PATCH 06/16] DRBD: userspace_interface Philipp Reisner
2009-04-30 11:26             ` [PATCH 07/16] DRBD: internal_data_structures Philipp Reisner
2009-04-30 11:26               ` [PATCH 08/16] DRBD: main Philipp Reisner
2009-04-30 11:26                 ` [PATCH 09/16] DRBD: receiver Philipp Reisner
2009-04-30 11:26                   ` [PATCH 10/16] DRBD: proc Philipp Reisner
2009-04-30 11:26                     ` [PATCH 11/16] DRBD: worker Philipp Reisner
2009-04-30 11:26                       ` [PATCH 12/16] DRBD: variable_length_integer_encoding Philipp Reisner
2009-04-30 11:26                         ` [PATCH 13/16] DRBD: misc Philipp Reisner
2009-04-30 11:26                           ` [PATCH 14/16] DRBD: tracepoint_probes Philipp Reisner
2009-04-30 11:26                             ` [PATCH 15/16] DRBD: documentation Philipp Reisner
2009-04-30 11:26                               ` [PATCH 16/16] DRBD: final Philipp Reisner
2009-05-02 15:45                         ` [PATCH 12/16] DRBD: variable_length_integer_encoding James Bottomley
2009-05-02 17:29                           ` Lars Ellenberg
2009-05-02 15:44                     ` [PATCH 10/16] DRBD: proc James Bottomley
2009-05-02 20:23                       ` Lars Ellenberg
2009-05-02 15:41         ` [PATCH 04/16] DRBD: bitmap James Bottomley
2009-05-02 17:28           ` Lars Ellenberg
2009-05-03  5:21             ` Neil Brown
2009-05-03  7:38               ` Lars Ellenberg
2009-05-05 17:48               ` Lars Marowsky-Bree
2009-05-05 17:51                 ` James Bottomley
2009-05-05 22:26                 ` Neil Brown
2009-05-01  9:01       ` [PATCH 03/16] DRBD: activity_log Andrew Morton
2009-05-02 17:00         ` Lars Ellenberg
2009-05-01  8:59     ` [PATCH 02/16] DRBD: lru_cache Andrew Morton
2009-05-02 15:26       ` Lars Ellenberg
2009-05-02 17:58         ` Andrew Morton
2009-05-02 18:13           ` Lars Ellenberg
2009-05-02 18:26             ` Andrew Morton
2009-05-02 19:39               ` Lars Ellenberg
2009-05-02 23:51     ` Kyle Moffett
2009-05-03  6:27       ` Lars Ellenberg
2009-05-03 14:06         ` Kyle Moffett
2009-05-03 22:48           ` Lars Ellenberg
2009-05-04  0:48             ` Kyle Moffett
2009-05-04  1:01               ` Kyle Moffett
2009-05-04 16:12                 ` Rik van Riel
2009-05-04 16:15                   ` Lars Ellenberg
2009-05-01  8:59   ` [PATCH 01/16] DRBD: major.h Andrew Morton
2009-05-01  8:59 ` [PATCH 00/16] DRBD: a block device for HA clusters Andrew Morton
2009-05-01 11:15   ` Lars Marowsky-Bree
2009-05-01 13:14     ` Dave Jones
2009-05-01 19:14       ` Andrew Morton
2009-05-05  4:05     ` Christian Kujau
2009-05-02  7:33   ` Bart Van Assche
2009-05-03  5:36     ` Willy Tarreau
2009-05-03  5:40       ` david
2009-05-03 14:21         ` James Bottomley
2009-05-03 14:36           ` david
2009-05-03 14:45             ` James Bottomley
2009-05-03 14:56               ` david
2009-05-03 15:09                 ` James Bottomley
2009-05-03 15:22                   ` david
2009-05-03 15:38                     ` James Bottomley
2009-05-03 15:48                       ` david
2009-05-03 16:02                         ` James Bottomley
2009-05-03 16:13                           ` david
2009-05-04  8:28               ` Philipp Reisner
2009-05-04 17:24                 ` James Bottomley
2009-05-05  8:21                   ` Philipp Reisner
2009-05-05 14:09                     ` James Bottomley
2009-05-05 15:56                       ` Philipp Reisner
2009-05-05 17:05                         ` James Bottomley
2009-05-05 21:45                           ` Philipp Reisner
2009-05-05 21:53                             ` James Bottomley
2009-05-06  8:17                               ` Philipp Reisner
2009-05-05 15:03                     ` Bart Van Assche
2009-05-05 15:57                       ` Philipp Reisner
2009-05-05 17:38                         ` Lars Marowsky-Bree
2009-05-03 10:06       ` Philipp Reisner
2009-05-03 10:15         ` Thomas Backlund
2009-05-03  5:53 ` Neil Brown
2009-05-03  6:24   ` david
2009-05-03  8:29   ` Lars Ellenberg
2009-05-03 11:00     ` Neil Brown
2009-05-03 21:32       ` Lars Ellenberg
2009-05-04 16:12         ` Lars Marowsky-Bree
2009-05-05 22:08         ` Lars Ellenberg
2009-05-14 22:31 devzero
2009-05-15 12:10 Philipp Reisner
2009-07-06 15:39 [PATCH 00/16] drbd: " Philipp Reisner
2009-07-21  5:49 ` Andrew Morton
     [not found]   ` <20090720224940.36da1ef8.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2009-07-21 18:51     ` Lars Ellenberg
2009-07-22  4:59       ` [Drbd-dev] " Stephen Rothwell
2009-07-24 15:20         ` Philipp Reisner
     [not found]           ` <200907241720.22771.philipp.reisner-63ez5xqkn6DQT0dZR+AlfA@public.gmane.org>
2009-07-26 23:24             ` Stephen Rothwell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.