[PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6)
@ 2006-09-06 13:16 Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                   ` (20 more replies)
  0 siblings, 21 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton

--

The latest version of my networked swap patches.

These patches provide robust swap over NFS and iSCSI, and in lesser form
also over NBD (NBD cannot reconnect on network failure).

The following test scenario was used (for NFS and iSCSI):
 - client A mounts swap device on server B
 - client A starts heavy swapper
 - server B stops service
 - server B floods server A
 - server A is wedged;
    soft lockup detection messages are given
    sysrq-m shows free < min
 - server B resumes service
 - client A is seen to recover.

The iSCSI work depends on their upstream (svn) development work and requires
some additional user-space patches.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 01/21] mm: serialize access to min_free_kbytes
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra

[-- Attachment #1: setup_per_zone_pages_min.patch --]
[-- Type: text/plain, Size: 1932 bytes --]

There is a small race between the procfs caller and the memory hotplug caller
of setup_per_zone_pages_min(). Not a big deal, but the next patch will add yet
another caller. Time to close the gap.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/page_alloc.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -81,6 +81,7 @@ struct zone *zone_table[1 << ZONETABLE_S
 EXPORT_SYMBOL(zone_table);
 
 static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
+static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
 
 unsigned long __meminitdata nr_kernel_pages;
@@ -2190,11 +2191,11 @@ static void setup_per_zone_lowmem_reserv
 }
 
 /*
- * setup_per_zone_pages_min - called when min_free_kbytes changes.  Ensures 
+ * __setup_per_zone_pages_min - called when min_free_kbytes changes.  Ensures
  *	that the pages_{min,low,high} values for each zone are set correctly 
  *	with respect to min_free_kbytes.
  */
-void setup_per_zone_pages_min(void)
+static void __setup_per_zone_pages_min(void)
 {
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
@@ -2248,6 +2249,15 @@ void setup_per_zone_pages_min(void)
 	calculate_totalreserve_pages();
 }
 
+void setup_per_zone_pages_min(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&min_free_lock, flags);
+	__setup_per_zone_pages_min();
+	spin_unlock_irqrestore(&min_free_lock, flags);
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -2283,7 +2293,7 @@ static int __init init_per_zone_pages_mi
 		min_free_kbytes = 128;
 	if (min_free_kbytes > 65536)
 		min_free_kbytes = 65536;
-	setup_per_zone_pages_min();
+	__setup_per_zone_pages_min();
 	setup_per_zone_lowmem_reserve();
 	return 0;
 }

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 01/21] mm: serialize access to min_free_kbytes
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra

[-- Attachment #1: setup_per_zone_pages_min.patch --]
[-- Type: text/plain, Size: 2158 bytes --]

There is a small race between the procfs caller and the memory hotplug caller
of setup_per_zone_pages_min(). Not a big deal, but the next patch will add yet
another caller. Time to close the gap.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/page_alloc.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -81,6 +81,7 @@ struct zone *zone_table[1 << ZONETABLE_S
 EXPORT_SYMBOL(zone_table);
 
 static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
+static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
 
 unsigned long __meminitdata nr_kernel_pages;
@@ -2190,11 +2191,11 @@ static void setup_per_zone_lowmem_reserv
 }
 
 /*
- * setup_per_zone_pages_min - called when min_free_kbytes changes.  Ensures 
+ * __setup_per_zone_pages_min - called when min_free_kbytes changes.  Ensures
  *	that the pages_{min,low,high} values for each zone are set correctly 
  *	with respect to min_free_kbytes.
  */
-void setup_per_zone_pages_min(void)
+static void __setup_per_zone_pages_min(void)
 {
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
@@ -2248,6 +2249,15 @@ void setup_per_zone_pages_min(void)
 	calculate_totalreserve_pages();
 }
 
+void setup_per_zone_pages_min(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&min_free_lock, flags);
+	__setup_per_zone_pages_min();
+	spin_unlock_irqrestore(&min_free_lock, flags);
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -2283,7 +2293,7 @@ static int __init init_per_zone_pages_mi
 		min_free_kbytes = 128;
 	if (min_free_kbytes > 65536)
 		min_free_kbytes = 65536;
-	setup_per_zone_pages_min();
+	__setup_per_zone_pages_min();
 	setup_per_zone_lowmem_reserve();
 	return 0;
 }

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 02/21] net: vm deadlock avoidance core
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Mike Christie, Trond Myklebust, Pavel Machek

[-- Attachment #1: vm_deadlock_core.patch --]
[-- Type: text/plain, Size: 20906 bytes --]

In order to provide robust networked block devices there must be a guarantee
of progress. That is, the block device must never stall because of (physical)
OOM, because the device itself might be needed to get out of it (reclaim).

This means that the device queue must always be unplugable, this in turn means
that it must always find enough memory to build/send packets over the network
_and_ receive (level 7) ACKs for those packets.

The network stack has a huge capacity for buffering packets; waiting for 
user-space to read them. There is a practical limit imposed to avoid DoS 
scenarios. These two things make for a deadlock; what if the receive limit is
reached and all packets are buffered in non-critical sockets (those not serving
the network block device waiting for an ACK to free a page). 

Memory pressure will add to that; what if there is simply no memory left to
receive packets in.

This patch provides a service to register sockets as critical; SOCK_VMIO
is a promise the socket will never block on receive. Along with with a memory
reserve that will service a limited number of packets this can guarantee a
limited service to these critical sockets.

When we make sure that packets allocated from the reserve will only service
critical sockets we will not lose the memory and can guarantee progress.

Since memory is tight and the reserve modest, we do not want to lose memory to
fragmentation effects. Hence a very simple allocator is used to guarantee that
the memory used for each packet is returned to the page allocator.

Converted protocols:
IPv4 & IPv6:
 - icmp
 - udp
 - tcp
IPv4:
 - igmp

Caveat: currently there is no support for higher order allocations. So 
basically everything jumbo frame will fail for these situations. To mitigate
this one could add a tiny pool of pre-allocated 2nd-order pages to the
emergency allocator.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Daniel Phillips <phillips@google.com>
CC: Mike Christie <michaelc@cs.wisc.edu>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
CC: Pavel Machek <pavel@ucw.cz>
---
 include/linux/gfp.h    |    3 +
 include/linux/mmzone.h |    1 
 include/linux/skbuff.h |   13 +++++--
 include/net/sock.h     |   39 +++++++++++++++++++++
 mm/page_alloc.c        |   35 +++++++++++++++++--
 net/core/skbuff.c      |   85 +++++++++++++++++++++++++++++++++++++----------
 net/core/sock.c        |   88 +++++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/icmp.c        |    3 +
 net/ipv4/igmp.c        |    3 +
 net/ipv4/tcp_ipv4.c    |    3 +
 net/ipv4/udp.c         |    8 +++-
 net/ipv6/icmp.c        |    3 +
 net/ipv6/tcp_ipv6.c    |    3 +
 net/ipv6/udp.c         |    3 +
 14 files changed, 263 insertions(+), 27 deletions(-)

Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -46,6 +46,7 @@ struct vm_area_struct;
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
 #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_EMERGENCY  ((__force gfp_t)0x40000u) /* Use emergency reserves */
 
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +55,7 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC|__GFP_HARDWALL)
+			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_EMERGENCY)
 
 /* This equals 0, but use constants in case they ever change */
 #define GFP_NOWAIT	(GFP_ATOMIC & ~__GFP_HIGH)
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h
+++ linux-2.6/include/linux/mmzone.h
@@ -421,6 +421,7 @@ int percpu_pagelist_fraction_sysctl_hand
 					void __user *, size_t *, loff_t *);
 int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
+void adjust_memalloc_reserve(int pages);
 
 #include <linux/topology.h>
 /* Returns the number of the current Node. */
Index: linux-2.6/include/linux/skbuff.h
===================================================================
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -282,7 +282,8 @@ struct sk_buff {
 				nfctinfo:3;
 	__u8			pkt_type:3,
 				fclone:2,
-				ipvs_property:1;
+				ipvs_property:1,
+				emergency:1;
 	__be16			protocol;
 
 	void			(*destructor)(struct sk_buff *skb);
@@ -327,10 +328,13 @@ struct sk_buff {
 
 #include <asm/system.h>
 
+#define SKB_ALLOC_FCLONE	0x01
+#define SKB_ALLOC_RX		0x02
+
 extern void kfree_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-				   gfp_t priority, int fclone);
+				   gfp_t priority, int flags);
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
 {
@@ -340,7 +344,7 @@ static inline struct sk_buff *alloc_skb(
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 					       gfp_t priority)
 {
-	return __alloc_skb(size, priority, 1);
+	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE);
 }
 
 extern struct sk_buff *alloc_skb_from_cache(kmem_cache_t *cp,
@@ -1101,7 +1105,8 @@ static inline void __skb_queue_purge(str
 static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
 					      gfp_t gfp_mask)
 {
-	struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+	struct sk_buff *skb =
+		__alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX);
 	if (likely(skb))
 		skb_reserve(skb, NET_SKB_PAD);
 	return skb;
Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -391,6 +391,7 @@ enum sock_flags {
 	SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
 	SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
 	SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+	SOCK_VMIO, /* the VM depends on us - make sure we're serviced */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -413,6 +414,44 @@ static inline int sock_flag(struct sock 
 	return test_bit(flag, &sk->sk_flags);
 }
 
+static inline int sk_is_vmio(struct sock *sk)
+{
+	return sock_flag(sk, SOCK_VMIO);
+}
+
+#define MAX_PAGES_PER_PACKET 2
+#define MAX_FRAGMENTS ((65536 + 1500 - 1) / 1500)
+/*
+ * Set an upper limit on the number of pages used for RX skbs.
+ */
+#define RX_RESERVE_PAGES	(64 * MAX_PAGES_PER_PACKET)
+
+/*
+ * Guestimate the per request queue TX upper bound.
+ */
+#define TX_RESERVE_PAGES \
+	(4 * MAX_FRAGMENTS * MAX_PAGES_PER_PACKET)
+
+extern atomic_t vmio_socks;
+extern atomic_t emergency_rx_pages_used;
+
+static inline int sk_vmio_socks(void)
+{
+	return atomic_read(&vmio_socks);
+}
+
+extern void * sk_emergency_rx_alloc(size_t size, gfp_t gfp_mask);
+
+static inline void sk_emergency_rx_free(void *page, size_t size)
+{
+	free_page((unsigned long)page);
+	atomic_dec(&emergency_rx_pages_used);
+}
+
+extern void sk_adjust_memalloc(int socks, int tx_reserve_pages);
+extern int sk_set_vmio(struct sock *sk);
+extern int sk_clear_vmio(struct sock *sk);
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
 	sk->sk_ack_backlog--;
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -83,6 +83,7 @@ EXPORT_SYMBOL(zone_table);
 static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
 static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
+int var_free_kbytes;
 
 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
@@ -971,8 +972,8 @@ restart:
 
 	/* This allocation should allow future memory freeing. */
 
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
+	if ((((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+			&& !in_interrupt()) || (gfp_mask & __GFP_EMERGENCY)) {
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 nofail_alloc:
 			/* go through the zonelist yet again, ignoring mins */
@@ -2197,7 +2198,8 @@ static void setup_per_zone_lowmem_reserv
  */
 static void __setup_per_zone_pages_min(void)
 {
-	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_min = (min_free_kbytes + var_free_kbytes)
+		>> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
 	unsigned long flags;
@@ -2258,6 +2260,33 @@ void setup_per_zone_pages_min(void)
 	spin_unlock_irqrestore(&min_free_lock, flags);
 }
 
+/**
+ *	adjust_memalloc_reserve - adjust the memalloc reserve
+ *	@pages: number of pages to add
+ *
+ *	It adds a number of pages to the memalloc reserve; if
+ *	the number was positive it kicks kswapd into action to
+ *	satisfy the higher watermarks.
+ *
+ *	NOTE: there is only a single caller, hence no locking.
+ */
+void adjust_memalloc_reserve(int pages)
+{
+	var_free_kbytes += pages << (PAGE_SHIFT - 10);
+	BUG_ON(var_free_kbytes < 0);
+	setup_per_zone_pages_min();
+	if (pages > 0) {
+		struct zone *zone;
+		for_each_zone(zone)
+			wakeup_kswapd(zone, 0);
+	}
+	if (pages)
+		printk(KERN_DEBUG "Emergency reserve: %d\n",
+				var_free_kbytes);
+}
+
+EXPORT_SYMBOL_GPL(adjust_memalloc_reserve);
+
 /*
  * Initialise min_free_kbytes.
  *
Index: linux-2.6/net/core/skbuff.c
===================================================================
--- linux-2.6.orig/net/core/skbuff.c
+++ linux-2.6/net/core/skbuff.c
@@ -139,28 +139,30 @@ EXPORT_SYMBOL(skb_truesize_bug);
  *	Buffers may only be allocated from interrupts using a @gfp_mask of
  *	%GFP_ATOMIC.
  */
-struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-			    int fclone)
+struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, int flags)
 {
 	kmem_cache_t *cache;
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	u8 *data;
 
-	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+	size = SKB_DATA_ALIGN(size);
+	cache = (flags & SKB_ALLOC_FCLONE)
+		? skbuff_fclone_cache : skbuff_head_cache;
 
 	/* Get the HEAD */
 	skb = kmem_cache_alloc(cache, gfp_mask & ~__GFP_DMA);
 	if (!skb)
-		goto out;
+		goto noskb;
 
 	/* Get the DATA. Size must match skb_add_mtu(). */
-	size = SKB_DATA_ALIGN(size);
 	data = ____kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
 	if (!data)
 		goto nodata;
 
+allocated:
 	memset(skb, 0, offsetof(struct sk_buff, truesize));
+	skb->emergency = !cache;
 	skb->truesize = size + sizeof(struct sk_buff);
 	atomic_set(&skb->users, 1);
 	skb->head = data;
@@ -177,7 +179,7 @@ struct sk_buff *__alloc_skb(unsigned int
 	shinfo->ip6_frag_id = 0;
 	shinfo->frag_list = NULL;
 
-	if (fclone) {
+	if (flags & SKB_ALLOC_FCLONE) {
 		struct sk_buff *child = skb + 1;
 		atomic_t *fclone_ref = (atomic_t *) (child + 1);
 
@@ -185,13 +187,34 @@ struct sk_buff *__alloc_skb(unsigned int
 		atomic_set(fclone_ref, 1);
 
 		child->fclone = SKB_FCLONE_UNAVAILABLE;
+		child->emergency = skb->emergency;
 	}
 out:
 	return skb;
+
 nodata:
 	kmem_cache_free(cache, skb);
 	skb = NULL;
-	goto out;
+noskb:
+	/* Attempt emergency allocation when RX skb. */
+	if (!(flags & SKB_ALLOC_RX) || !sk_vmio_socks())
+		goto out;
+
+	skb = sk_emergency_rx_alloc(kmem_cache_size(cache),
+			gfp_mask | __GFP_EMERGENCY);
+	if (!skb)
+		goto out;
+
+	data = sk_emergency_rx_alloc(size + sizeof(struct skb_shared_info),
+			gfp_mask | __GFP_EMERGENCY);
+	if (!data) {
+		sk_emergency_rx_free(skb, kmem_cache_size(cache));
+		skb = NULL;
+		goto out;
+	}
+
+	cache = NULL;
+	goto allocated;
 }
 
 /**
@@ -267,7 +290,7 @@ struct sk_buff *__netdev_alloc_skb(struc
 {
 	struct sk_buff *skb;
 
-	skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX);
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;
@@ -315,7 +338,12 @@ static void skb_release_data(struct sk_b
 		if (skb_shinfo(skb)->frag_list)
 			skb_drop_fraglist(skb);
 
-		kfree(skb->head);
+		if (skb->emergency)
+			sk_emergency_rx_free(skb->head,
+					(skb->end - skb->head) +
+					sizeof(struct skb_shared_info));
+		else
+			kfree(skb->head);
 	}
 }
 
@@ -324,24 +352,26 @@ static void skb_release_data(struct sk_b
  */
 void kfree_skbmem(struct sk_buff *skb)
 {
-	struct sk_buff *other;
+	struct kmem_cache *cache = skbuff_head_cache;
+	struct sk_buff *free = skb;
 	atomic_t *fclone_ref;
 
 	skb_release_data(skb);
 	switch (skb->fclone) {
 	case SKB_FCLONE_UNAVAILABLE:
-		kmem_cache_free(skbuff_head_cache, skb);
-		break;
+		goto free;
 
 	case SKB_FCLONE_ORIG:
+		cache = skbuff_fclone_cache;
 		fclone_ref = (atomic_t *) (skb + 2);
 		if (atomic_dec_and_test(fclone_ref))
-			kmem_cache_free(skbuff_fclone_cache, skb);
-		break;
+			goto free;
+		return;
 
 	case SKB_FCLONE_CLONE:
+		cache = skbuff_fclone_cache;
 		fclone_ref = (atomic_t *) (skb + 1);
-		other = skb - 1;
+		free = skb - 1;
 
 		/* The clone portion is available for
 		 * fast-cloning again.
@@ -349,9 +379,15 @@ void kfree_skbmem(struct sk_buff *skb)
 		skb->fclone = SKB_FCLONE_UNAVAILABLE;
 
 		if (atomic_dec_and_test(fclone_ref))
-			kmem_cache_free(skbuff_fclone_cache, other);
-		break;
+			goto free;
+		return;
 	};
+
+free:
+	if (skb->emergency)
+		sk_emergency_rx_free(free, kmem_cache_size(cache));
+	else
+		kmem_cache_free(cache, free);
 }
 
 /**
@@ -435,6 +471,12 @@ struct sk_buff *skb_clone(struct sk_buff
 		atomic_t *fclone_ref = (atomic_t *) (n + 1);
 		n->fclone = SKB_FCLONE_CLONE;
 		atomic_inc(fclone_ref);
+	} else if (skb->emergency) {
+		n = sk_emergency_rx_alloc(kmem_cache_size(skbuff_head_cache),
+				gfp_mask | __GFP_EMERGENCY);
+		if (!n)
+			return NULL;
+		n->fclone = SKB_FCLONE_UNAVAILABLE;
 	} else {
 		n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
 		if (!n)
@@ -470,6 +512,7 @@ struct sk_buff *skb_clone(struct sk_buff
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	C(ipvs_property);
 #endif
+	C(emergency);
 	C(protocol);
 	n->destructor = NULL;
 #ifdef CONFIG_NETFILTER
@@ -690,7 +733,13 @@ int pskb_expand_head(struct sk_buff *skb
 
 	size = SKB_DATA_ALIGN(size);
 
-	data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
+	if (skb->emergency) {
+		data = sk_emergency_rx_alloc(size + sizeof(struct skb_shared_info),
+				gfp_mask | __GFP_EMERGENCY);
+		if (!data)
+			goto nodata;
+	} else
+		data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
 	if (!data)
 		goto nodata;
 
Index: linux-2.6/net/ipv4/icmp.c
===================================================================
--- linux-2.6.orig/net/ipv4/icmp.c
+++ linux-2.6/net/ipv4/icmp.c
@@ -938,6 +938,9 @@ int icmp_rcv(struct sk_buff *skb)
 			goto error;
 	}
 
+	if (unlikely(skb->emergency))
+		goto drop;
+
 	if (!pskb_pull(skb, sizeof(struct icmphdr)))
 		goto error;
 
Index: linux-2.6/net/ipv4/tcp_ipv4.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp_ipv4.c
+++ linux-2.6/net/ipv4/tcp_ipv4.c
@@ -1093,6 +1093,9 @@ int tcp_v4_rcv(struct sk_buff *skb)
 	if (!sk)
 		goto no_tcp_socket;
 
+	if (unlikely(skb->emergency && !sk_is_vmio(sk)))
+		goto discard_and_relse;
+
 process:
 	if (sk->sk_state == TCP_TIME_WAIT)
 		goto do_time_wait;
Index: linux-2.6/net/ipv4/udp.c
===================================================================
--- linux-2.6.orig/net/ipv4/udp.c
+++ linux-2.6/net/ipv4/udp.c
@@ -1136,7 +1136,12 @@ int udp_rcv(struct sk_buff *skb)
 	sk = udp_v4_lookup(saddr, uh->source, daddr, uh->dest, skb->dev->ifindex);
 
 	if (sk != NULL) {
-		int ret = udp_queue_rcv_skb(sk, skb);
+		int ret;
+
+		if (unlikely(skb->emergency && !sk_is_vmio(sk)))
+			goto drop_noncritical;
+
+		ret = udp_queue_rcv_skb(sk, skb);
 		sock_put(sk);
 
 		/* a return value > 0 means to resubmit the input, but
@@ -1147,6 +1152,7 @@ int udp_rcv(struct sk_buff *skb)
 		return 0;
 	}
 
+drop_noncritical:
 	if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
 		goto drop;
 	nf_reset(skb);
Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -195,6 +195,93 @@ __u32 sysctl_rmem_default = SK_RMEM_MAX;
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max = sizeof(unsigned long)*(2*UIO_MAXIOV + 512);
 
+static DEFINE_SPINLOCK(memalloc_lock);
+
+atomic_t vmio_socks;
+atomic_t emergency_rx_pages_used;
+
+/**
+ *	sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
+ *	@socks: number of new %SOCK_VMIO sockets
+ *	@tx_resserve_pages: number of pages to (un)reserve for TX
+ *
+ *	This function adjusts the memalloc reserve based on system demand.
+ *	The RX reserve is a limit, and only added once, not for each socket.
+ *
+ *	NOTE:
+ *	   @tx_reserve_pages is an upper-bound of memory used for TX hence
+ *	   we need not account the pages like we do for %RX_RESERVE_PAGES.
+ */
+void sk_adjust_memalloc(int socks, int tx_reserve_pages)
+{
+	unsigned long flags;
+	int reserve = tx_reserve_pages;
+	int nr_socks;
+
+	spin_lock_irqsave(&memalloc_lock, flags);
+	if (socks) {
+		nr_socks = atomic_add_return(socks, &vmio_socks);
+		BUG_ON(nr_socks < 0);
+
+		if (nr_socks - socks == 0)
+			reserve += RX_RESERVE_PAGES;
+		if (nr_socks == 0)
+			reserve -= RX_RESERVE_PAGES;
+	}
+	adjust_memalloc_reserve(reserve);
+	spin_unlock_irqrestore(&memalloc_lock, flags);
+}
+EXPORT_SYMBOL_GPL(sk_adjust_memalloc);
+
+/**
+ *	sk_set_vmio - sets %SOCK_VMIO
+ *	@sk: socket to set it on
+ *
+ *	Set %SOCK_VMIO on a socket and increase the memalloc reserve
+ *	accordingly.
+ */
+int sk_set_vmio(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_VMIO);
+	if (!set) {
+		sk_adjust_memalloc(1, 0);
+		sock_set_flag(sk, SOCK_VMIO);
+		sk->sk_allocation |= __GFP_EMERGENCY;
+	}
+	return !set;
+}
+EXPORT_SYMBOL_GPL(sk_set_vmio);
+
+int sk_clear_vmio(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_VMIO);
+	if (set) {
+		sk_adjust_memalloc(-1, 0);
+		sock_reset_flag(sk, SOCK_VMIO);
+		sk->sk_allocation &= ~__GFP_EMERGENCY;
+	}
+	return set;
+}
+EXPORT_SYMBOL_GPL(sk_clear_vmio);
+
+void * sk_emergency_rx_alloc(size_t size, gfp_t gfp_mask)
+{
+	void * page = NULL;
+
+	if (size > PAGE_SIZE)
+		return page;
+
+	if (atomic_add_unless(&emergency_rx_pages_used, 1, RX_RESERVE_PAGES)) {
+		page = (void *)__get_free_page(gfp_mask);
+		if (!page) {
+			WARN_ON(1);
+			atomic_dec(&emergency_rx_pages_used);
+		}
+	}
+
+	return page;
+}
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;
@@ -881,6 +968,7 @@ void sk_free(struct sock *sk)
 	struct sk_filter *filter;
 	struct module *owner = sk->sk_prot_creator->owner;
 
+	sk_clear_vmio(sk);
 	if (sk->sk_destruct)
 		sk->sk_destruct(sk);
 
Index: linux-2.6/net/ipv6/icmp.c
===================================================================
--- linux-2.6.orig/net/ipv6/icmp.c
+++ linux-2.6/net/ipv6/icmp.c
@@ -599,6 +599,9 @@ static int icmpv6_rcv(struct sk_buff **p
 
 	ICMP6_INC_STATS_BH(idev, ICMP6_MIB_INMSGS);
 
+	if (unlikely(skb->emergency))
+		goto discard_it;
+
 	saddr = &skb->nh.ipv6h->saddr;
 	daddr = &skb->nh.ipv6h->daddr;
 
Index: linux-2.6/net/ipv6/tcp_ipv6.c
===================================================================
--- linux-2.6.orig/net/ipv6/tcp_ipv6.c
+++ linux-2.6/net/ipv6/tcp_ipv6.c
@@ -1216,6 +1216,9 @@ static int tcp_v6_rcv(struct sk_buff **p
 	if (!sk)
 		goto no_tcp_socket;
 
+	if (unlikely(skb->emergency && !sk_is_vmio(sk)))
+		goto discard_and_relse;
+
 process:
 	if (sk->sk_state == TCP_TIME_WAIT)
 		goto do_time_wait;
Index: linux-2.6/net/ipv6/udp.c
===================================================================
--- linux-2.6.orig/net/ipv6/udp.c
+++ linux-2.6/net/ipv6/udp.c
@@ -499,6 +499,9 @@ static int udpv6_rcv(struct sk_buff **ps
 	sk = udp_v6_lookup(saddr, uh->source, daddr, uh->dest, dev->ifindex);
 
 	if (sk == NULL) {
+		if (unlikely(skb->emergency && !sk_is_vmio(sk)))
+			goto discard;
+
 		if (!xfrm6_policy_check(NULL, XFRM_POLICY_IN, skb))
 			goto discard;
 
Index: linux-2.6/net/ipv4/igmp.c
===================================================================
--- linux-2.6.orig/net/ipv4/igmp.c
+++ linux-2.6/net/ipv4/igmp.c
@@ -927,6 +927,9 @@ int igmp_rcv(struct sk_buff *skb)
 		return 0;
 	}
 
+	if (unlikely(skb->emergency))
+		goto drop;
+
 	if (!pskb_may_pull(skb, sizeof(struct igmphdr)))
 		goto drop;
 

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 02/21] net: vm deadlock avoidance core
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Mike Christie, Trond Myklebust, Pavel Machek

[-- Attachment #1: vm_deadlock_core.patch --]
[-- Type: text/plain, Size: 21132 bytes --]

In order to provide robust networked block devices there must be a guarantee
of progress. That is, the block device must never stall because of (physical)
OOM, because the device itself might be needed to get out of it (reclaim).

This means that the device queue must always be unplugable, this in turn means
that it must always find enough memory to build/send packets over the network
_and_ receive (level 7) ACKs for those packets.

The network stack has a huge capacity for buffering packets; waiting for 
user-space to read them. There is a practical limit imposed to avoid DoS 
scenarios. These two things make for a deadlock; what if the receive limit is
reached and all packets are buffered in non-critical sockets (those not serving
the network block device waiting for an ACK to free a page). 

Memory pressure will add to that; what if there is simply no memory left to
receive packets in.

This patch provides a service to register sockets as critical; SOCK_VMIO
is a promise the socket will never block on receive. Along with with a memory
reserve that will service a limited number of packets this can guarantee a
limited service to these critical sockets.

When we make sure that packets allocated from the reserve will only service
critical sockets we will not lose the memory and can guarantee progress.

Since memory is tight and the reserve modest, we do not want to lose memory to
fragmentation effects. Hence a very simple allocator is used to guarantee that
the memory used for each packet is returned to the page allocator.

Converted protocols:
IPv4 & IPv6:
 - icmp
 - udp
 - tcp
IPv4:
 - igmp

Caveat: currently there is no support for higher order allocations. So 
basically everything jumbo frame will fail for these situations. To mitigate
this one could add a tiny pool of pre-allocated 2nd-order pages to the
emergency allocator.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Daniel Phillips <phillips@google.com>
CC: Mike Christie <michaelc@cs.wisc.edu>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
CC: Pavel Machek <pavel@ucw.cz>
---
 include/linux/gfp.h    |    3 +
 include/linux/mmzone.h |    1 
 include/linux/skbuff.h |   13 +++++--
 include/net/sock.h     |   39 +++++++++++++++++++++
 mm/page_alloc.c        |   35 +++++++++++++++++--
 net/core/skbuff.c      |   85 +++++++++++++++++++++++++++++++++++++----------
 net/core/sock.c        |   88 +++++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/icmp.c        |    3 +
 net/ipv4/igmp.c        |    3 +
 net/ipv4/tcp_ipv4.c    |    3 +
 net/ipv4/udp.c         |    8 +++-
 net/ipv6/icmp.c        |    3 +
 net/ipv6/tcp_ipv6.c    |    3 +
 net/ipv6/udp.c         |    3 +
 14 files changed, 263 insertions(+), 27 deletions(-)

Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -46,6 +46,7 @@ struct vm_area_struct;
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
 #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_EMERGENCY  ((__force gfp_t)0x40000u) /* Use emergency reserves */
 
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +55,7 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC|__GFP_HARDWALL)
+			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_EMERGENCY)
 
 /* This equals 0, but use constants in case they ever change */
 #define GFP_NOWAIT	(GFP_ATOMIC & ~__GFP_HIGH)
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h
+++ linux-2.6/include/linux/mmzone.h
@@ -421,6 +421,7 @@ int percpu_pagelist_fraction_sysctl_hand
 					void __user *, size_t *, loff_t *);
 int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
+void adjust_memalloc_reserve(int pages);
 
 #include <linux/topology.h>
 /* Returns the number of the current Node. */
Index: linux-2.6/include/linux/skbuff.h
===================================================================
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -282,7 +282,8 @@ struct sk_buff {
 				nfctinfo:3;
 	__u8			pkt_type:3,
 				fclone:2,
-				ipvs_property:1;
+				ipvs_property:1,
+				emergency:1;
 	__be16			protocol;
 
 	void			(*destructor)(struct sk_buff *skb);
@@ -327,10 +328,13 @@ struct sk_buff {
 
 #include <asm/system.h>
 
+#define SKB_ALLOC_FCLONE	0x01
+#define SKB_ALLOC_RX		0x02
+
 extern void kfree_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-				   gfp_t priority, int fclone);
+				   gfp_t priority, int flags);
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
 {
@@ -340,7 +344,7 @@ static inline struct sk_buff *alloc_skb(
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 					       gfp_t priority)
 {
-	return __alloc_skb(size, priority, 1);
+	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE);
 }
 
 extern struct sk_buff *alloc_skb_from_cache(kmem_cache_t *cp,
@@ -1101,7 +1105,8 @@ static inline void __skb_queue_purge(str
 static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
 					      gfp_t gfp_mask)
 {
-	struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+	struct sk_buff *skb =
+		__alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX);
 	if (likely(skb))
 		skb_reserve(skb, NET_SKB_PAD);
 	return skb;
Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -391,6 +391,7 @@ enum sock_flags {
 	SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
 	SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
 	SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+	SOCK_VMIO, /* the VM depends on us - make sure we're serviced */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -413,6 +414,44 @@ static inline int sock_flag(struct sock 
 	return test_bit(flag, &sk->sk_flags);
 }
 
+static inline int sk_is_vmio(struct sock *sk)
+{
+	return sock_flag(sk, SOCK_VMIO);
+}
+
+#define MAX_PAGES_PER_PACKET 2
+#define MAX_FRAGMENTS ((65536 + 1500 - 1) / 1500)
+/*
+ * Set an upper limit on the number of pages used for RX skbs.
+ */
+#define RX_RESERVE_PAGES	(64 * MAX_PAGES_PER_PACKET)
+
+/*
+ * Guestimate the per request queue TX upper bound.
+ */
+#define TX_RESERVE_PAGES \
+	(4 * MAX_FRAGMENTS * MAX_PAGES_PER_PACKET)
+
+extern atomic_t vmio_socks;
+extern atomic_t emergency_rx_pages_used;
+
+static inline int sk_vmio_socks(void)
+{
+	return atomic_read(&vmio_socks);
+}
+
+extern void * sk_emergency_rx_alloc(size_t size, gfp_t gfp_mask);
+
+static inline void sk_emergency_rx_free(void *page, size_t size)
+{
+	free_page((unsigned long)page);
+	atomic_dec(&emergency_rx_pages_used);
+}
+
+extern void sk_adjust_memalloc(int socks, int tx_reserve_pages);
+extern int sk_set_vmio(struct sock *sk);
+extern int sk_clear_vmio(struct sock *sk);
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
 	sk->sk_ack_backlog--;
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -83,6 +83,7 @@ EXPORT_SYMBOL(zone_table);
 static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
 static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
+int var_free_kbytes;
 
 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
@@ -971,8 +972,8 @@ restart:
 
 	/* This allocation should allow future memory freeing. */
 
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
+	if ((((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+			&& !in_interrupt()) || (gfp_mask & __GFP_EMERGENCY)) {
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 nofail_alloc:
 			/* go through the zonelist yet again, ignoring mins */
@@ -2197,7 +2198,8 @@ static void setup_per_zone_lowmem_reserv
  */
 static void __setup_per_zone_pages_min(void)
 {
-	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_min = (min_free_kbytes + var_free_kbytes)
+		>> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
 	unsigned long flags;
@@ -2258,6 +2260,33 @@ void setup_per_zone_pages_min(void)
 	spin_unlock_irqrestore(&min_free_lock, flags);
 }
 
+/**
+ *	adjust_memalloc_reserve - adjust the memalloc reserve
+ *	@pages: number of pages to add
+ *
+ *	It adds a number of pages to the memalloc reserve; if
+ *	the number was positive it kicks kswapd into action to
+ *	satisfy the higher watermarks.
+ *
+ *	NOTE: there is only a single caller, hence no locking.
+ */
+void adjust_memalloc_reserve(int pages)
+{
+	var_free_kbytes += pages << (PAGE_SHIFT - 10);
+	BUG_ON(var_free_kbytes < 0);
+	setup_per_zone_pages_min();
+	if (pages > 0) {
+		struct zone *zone;
+		for_each_zone(zone)
+			wakeup_kswapd(zone, 0);
+	}
+	if (pages)
+		printk(KERN_DEBUG "Emergency reserve: %d\n",
+				var_free_kbytes);
+}
+
+EXPORT_SYMBOL_GPL(adjust_memalloc_reserve);
+
 /*
  * Initialise min_free_kbytes.
  *
Index: linux-2.6/net/core/skbuff.c
===================================================================
--- linux-2.6.orig/net/core/skbuff.c
+++ linux-2.6/net/core/skbuff.c
@@ -139,28 +139,30 @@ EXPORT_SYMBOL(skb_truesize_bug);
  *	Buffers may only be allocated from interrupts using a @gfp_mask of
  *	%GFP_ATOMIC.
  */
-struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-			    int fclone)
+struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, int flags)
 {
 	kmem_cache_t *cache;
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	u8 *data;
 
-	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+	size = SKB_DATA_ALIGN(size);
+	cache = (flags & SKB_ALLOC_FCLONE)
+		? skbuff_fclone_cache : skbuff_head_cache;
 
 	/* Get the HEAD */
 	skb = kmem_cache_alloc(cache, gfp_mask & ~__GFP_DMA);
 	if (!skb)
-		goto out;
+		goto noskb;
 
 	/* Get the DATA. Size must match skb_add_mtu(). */
-	size = SKB_DATA_ALIGN(size);
 	data = ____kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
 	if (!data)
 		goto nodata;
 
+allocated:
 	memset(skb, 0, offsetof(struct sk_buff, truesize));
+	skb->emergency = !cache;
 	skb->truesize = size + sizeof(struct sk_buff);
 	atomic_set(&skb->users, 1);
 	skb->head = data;
@@ -177,7 +179,7 @@ struct sk_buff *__alloc_skb(unsigned int
 	shinfo->ip6_frag_id = 0;
 	shinfo->frag_list = NULL;
 
-	if (fclone) {
+	if (flags & SKB_ALLOC_FCLONE) {
 		struct sk_buff *child = skb + 1;
 		atomic_t *fclone_ref = (atomic_t *) (child + 1);
 
@@ -185,13 +187,34 @@ struct sk_buff *__alloc_skb(unsigned int
 		atomic_set(fclone_ref, 1);
 
 		child->fclone = SKB_FCLONE_UNAVAILABLE;
+		child->emergency = skb->emergency;
 	}
 out:
 	return skb;
+
 nodata:
 	kmem_cache_free(cache, skb);
 	skb = NULL;
-	goto out;
+noskb:
+	/* Attempt emergency allocation when RX skb. */
+	if (!(flags & SKB_ALLOC_RX) || !sk_vmio_socks())
+		goto out;
+
+	skb = sk_emergency_rx_alloc(kmem_cache_size(cache),
+			gfp_mask | __GFP_EMERGENCY);
+	if (!skb)
+		goto out;
+
+	data = sk_emergency_rx_alloc(size + sizeof(struct skb_shared_info),
+			gfp_mask | __GFP_EMERGENCY);
+	if (!data) {
+		sk_emergency_rx_free(skb, kmem_cache_size(cache));
+		skb = NULL;
+		goto out;
+	}
+
+	cache = NULL;
+	goto allocated;
 }
 
 /**
@@ -267,7 +290,7 @@ struct sk_buff *__netdev_alloc_skb(struc
 {
 	struct sk_buff *skb;
 
-	skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX);
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;
@@ -315,7 +338,12 @@ static void skb_release_data(struct sk_b
 		if (skb_shinfo(skb)->frag_list)
 			skb_drop_fraglist(skb);
 
-		kfree(skb->head);
+		if (skb->emergency)
+			sk_emergency_rx_free(skb->head,
+					(skb->end - skb->head) +
+					sizeof(struct skb_shared_info));
+		else
+			kfree(skb->head);
 	}
 }
 
@@ -324,24 +352,26 @@ static void skb_release_data(struct sk_b
  */
 void kfree_skbmem(struct sk_buff *skb)
 {
-	struct sk_buff *other;
+	struct kmem_cache *cache = skbuff_head_cache;
+	struct sk_buff *free = skb;
 	atomic_t *fclone_ref;
 
 	skb_release_data(skb);
 	switch (skb->fclone) {
 	case SKB_FCLONE_UNAVAILABLE:
-		kmem_cache_free(skbuff_head_cache, skb);
-		break;
+		goto free;
 
 	case SKB_FCLONE_ORIG:
+		cache = skbuff_fclone_cache;
 		fclone_ref = (atomic_t *) (skb + 2);
 		if (atomic_dec_and_test(fclone_ref))
-			kmem_cache_free(skbuff_fclone_cache, skb);
-		break;
+			goto free;
+		return;
 
 	case SKB_FCLONE_CLONE:
+		cache = skbuff_fclone_cache;
 		fclone_ref = (atomic_t *) (skb + 1);
-		other = skb - 1;
+		free = skb - 1;
 
 		/* The clone portion is available for
 		 * fast-cloning again.
@@ -349,9 +379,15 @@ void kfree_skbmem(struct sk_buff *skb)
 		skb->fclone = SKB_FCLONE_UNAVAILABLE;
 
 		if (atomic_dec_and_test(fclone_ref))
-			kmem_cache_free(skbuff_fclone_cache, other);
-		break;
+			goto free;
+		return;
 	};
+
+free:
+	if (skb->emergency)
+		sk_emergency_rx_free(free, kmem_cache_size(cache));
+	else
+		kmem_cache_free(cache, free);
 }
 
 /**
@@ -435,6 +471,12 @@ struct sk_buff *skb_clone(struct sk_buff
 		atomic_t *fclone_ref = (atomic_t *) (n + 1);
 		n->fclone = SKB_FCLONE_CLONE;
 		atomic_inc(fclone_ref);
+	} else if (skb->emergency) {
+		n = sk_emergency_rx_alloc(kmem_cache_size(skbuff_head_cache),
+				gfp_mask | __GFP_EMERGENCY);
+		if (!n)
+			return NULL;
+		n->fclone = SKB_FCLONE_UNAVAILABLE;
 	} else {
 		n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
 		if (!n)
@@ -470,6 +512,7 @@ struct sk_buff *skb_clone(struct sk_buff
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	C(ipvs_property);
 #endif
+	C(emergency);
 	C(protocol);
 	n->destructor = NULL;
 #ifdef CONFIG_NETFILTER
@@ -690,7 +733,13 @@ int pskb_expand_head(struct sk_buff *skb
 
 	size = SKB_DATA_ALIGN(size);
 
-	data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
+	if (skb->emergency) {
+		data = sk_emergency_rx_alloc(size + sizeof(struct skb_shared_info),
+				gfp_mask | __GFP_EMERGENCY);
+		if (!data)
+			goto nodata;
+	} else
+		data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
 	if (!data)
 		goto nodata;
 
Index: linux-2.6/net/ipv4/icmp.c
===================================================================
--- linux-2.6.orig/net/ipv4/icmp.c
+++ linux-2.6/net/ipv4/icmp.c
@@ -938,6 +938,9 @@ int icmp_rcv(struct sk_buff *skb)
 			goto error;
 	}
 
+	if (unlikely(skb->emergency))
+		goto drop;
+
 	if (!pskb_pull(skb, sizeof(struct icmphdr)))
 		goto error;
 
Index: linux-2.6/net/ipv4/tcp_ipv4.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp_ipv4.c
+++ linux-2.6/net/ipv4/tcp_ipv4.c
@@ -1093,6 +1093,9 @@ int tcp_v4_rcv(struct sk_buff *skb)
 	if (!sk)
 		goto no_tcp_socket;
 
+	if (unlikely(skb->emergency && !sk_is_vmio(sk)))
+		goto discard_and_relse;
+
 process:
 	if (sk->sk_state == TCP_TIME_WAIT)
 		goto do_time_wait;
Index: linux-2.6/net/ipv4/udp.c
===================================================================
--- linux-2.6.orig/net/ipv4/udp.c
+++ linux-2.6/net/ipv4/udp.c
@@ -1136,7 +1136,12 @@ int udp_rcv(struct sk_buff *skb)
 	sk = udp_v4_lookup(saddr, uh->source, daddr, uh->dest, skb->dev->ifindex);
 
 	if (sk != NULL) {
-		int ret = udp_queue_rcv_skb(sk, skb);
+		int ret;
+
+		if (unlikely(skb->emergency && !sk_is_vmio(sk)))
+			goto drop_noncritical;
+
+		ret = udp_queue_rcv_skb(sk, skb);
 		sock_put(sk);
 
 		/* a return value > 0 means to resubmit the input, but
@@ -1147,6 +1152,7 @@ int udp_rcv(struct sk_buff *skb)
 		return 0;
 	}
 
+drop_noncritical:
 	if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
 		goto drop;
 	nf_reset(skb);
Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -195,6 +195,93 @@ __u32 sysctl_rmem_default = SK_RMEM_MAX;
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max = sizeof(unsigned long)*(2*UIO_MAXIOV + 512);
 
+static DEFINE_SPINLOCK(memalloc_lock);
+
+atomic_t vmio_socks;
+atomic_t emergency_rx_pages_used;
+
+/**
+ *	sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
+ *	@socks: number of new %SOCK_VMIO sockets
+ *	@tx_resserve_pages: number of pages to (un)reserve for TX
+ *
+ *	This function adjusts the memalloc reserve based on system demand.
+ *	The RX reserve is a limit, and only added once, not for each socket.
+ *
+ *	NOTE:
+ *	   @tx_reserve_pages is an upper-bound of memory used for TX hence
+ *	   we need not account the pages like we do for %RX_RESERVE_PAGES.
+ */
+void sk_adjust_memalloc(int socks, int tx_reserve_pages)
+{
+	unsigned long flags;
+	int reserve = tx_reserve_pages;
+	int nr_socks;
+
+	spin_lock_irqsave(&memalloc_lock, flags);
+	if (socks) {
+		nr_socks = atomic_add_return(socks, &vmio_socks);
+		BUG_ON(nr_socks < 0);
+
+		if (nr_socks - socks == 0)
+			reserve += RX_RESERVE_PAGES;
+		if (nr_socks == 0)
+			reserve -= RX_RESERVE_PAGES;
+	}
+	adjust_memalloc_reserve(reserve);
+	spin_unlock_irqrestore(&memalloc_lock, flags);
+}
+EXPORT_SYMBOL_GPL(sk_adjust_memalloc);
+
+/**
+ *	sk_set_vmio - sets %SOCK_VMIO
+ *	@sk: socket to set it on
+ *
+ *	Set %SOCK_VMIO on a socket and increase the memalloc reserve
+ *	accordingly.
+ */
+int sk_set_vmio(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_VMIO);
+	if (!set) {
+		sk_adjust_memalloc(1, 0);
+		sock_set_flag(sk, SOCK_VMIO);
+		sk->sk_allocation |= __GFP_EMERGENCY;
+	}
+	return !set;
+}
+EXPORT_SYMBOL_GPL(sk_set_vmio);
+
+int sk_clear_vmio(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_VMIO);
+	if (set) {
+		sk_adjust_memalloc(-1, 0);
+		sock_reset_flag(sk, SOCK_VMIO);
+		sk->sk_allocation &= ~__GFP_EMERGENCY;
+	}
+	return set;
+}
+EXPORT_SYMBOL_GPL(sk_clear_vmio);
+
+void * sk_emergency_rx_alloc(size_t size, gfp_t gfp_mask)
+{
+	void * page = NULL;
+
+	if (size > PAGE_SIZE)
+		return page;
+
+	if (atomic_add_unless(&emergency_rx_pages_used, 1, RX_RESERVE_PAGES)) {
+		page = (void *)__get_free_page(gfp_mask);
+		if (!page) {
+			WARN_ON(1);
+			atomic_dec(&emergency_rx_pages_used);
+		}
+	}
+
+	return page;
+}
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;
@@ -881,6 +968,7 @@ void sk_free(struct sock *sk)
 	struct sk_filter *filter;
 	struct module *owner = sk->sk_prot_creator->owner;
 
+	sk_clear_vmio(sk);
 	if (sk->sk_destruct)
 		sk->sk_destruct(sk);
 
Index: linux-2.6/net/ipv6/icmp.c
===================================================================
--- linux-2.6.orig/net/ipv6/icmp.c
+++ linux-2.6/net/ipv6/icmp.c
@@ -599,6 +599,9 @@ static int icmpv6_rcv(struct sk_buff **p
 
 	ICMP6_INC_STATS_BH(idev, ICMP6_MIB_INMSGS);
 
+	if (unlikely(skb->emergency))
+		goto discard_it;
+
 	saddr = &skb->nh.ipv6h->saddr;
 	daddr = &skb->nh.ipv6h->daddr;
 
Index: linux-2.6/net/ipv6/tcp_ipv6.c
===================================================================
--- linux-2.6.orig/net/ipv6/tcp_ipv6.c
+++ linux-2.6/net/ipv6/tcp_ipv6.c
@@ -1216,6 +1216,9 @@ static int tcp_v6_rcv(struct sk_buff **p
 	if (!sk)
 		goto no_tcp_socket;
 
+	if (unlikely(skb->emergency && !sk_is_vmio(sk)))
+		goto discard_and_relse;
+
 process:
 	if (sk->sk_state == TCP_TIME_WAIT)
 		goto do_time_wait;
Index: linux-2.6/net/ipv6/udp.c
===================================================================
--- linux-2.6.orig/net/ipv6/udp.c
+++ linux-2.6/net/ipv6/udp.c
@@ -499,6 +499,9 @@ static int udpv6_rcv(struct sk_buff **ps
 	sk = udp_v6_lookup(saddr, uh->source, daddr, uh->dest, dev->ifindex);
 
 	if (sk == NULL) {
+		if (unlikely(skb->emergency && !sk_is_vmio(sk)))
+			goto discard;
+
 		if (!xfrm6_policy_check(NULL, XFRM_POLICY_IN, skb))
 			goto discard;
 
Index: linux-2.6/net/ipv4/igmp.c
===================================================================
--- linux-2.6.orig/net/ipv4/igmp.c
+++ linux-2.6/net/ipv4/igmp.c
@@ -927,6 +927,9 @@ int igmp_rcv(struct sk_buff *skb)
 		return 0;
 	}
 
+	if (unlikely(skb->emergency))
+		goto drop;
+
 	if (!pskb_may_pull(skb, sizeof(struct igmphdr)))
 		goto drop;
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 03/21] mm: add support for non block device backed swap files
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Trond Myklebust

[-- Attachment #1: swapfile.patch --]
[-- Type: text/plain, Size: 7817 bytes --]

A new addres_space_operations method is added:
  int swapfile(struct address_space *, int)

When during sys_swapon() this method is found and returns no error the 
swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops.

The swapfile method will be used to communicate to the address_space that the
VM relies on it, and the address_space should take adequate measures (like 
reserving memory for mempools or the like).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 fs/buffer.c          |    2 -
 include/linux/fs.h   |    1 
 include/linux/swap.h |    4 +++
 init/Kconfig         |    5 ++++
 mm/page_io.c         |   60 +++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/swap_state.c      |    6 +++++
 mm/swapfile.c        |   27 ++++++++++++++++++++++
 7 files changed, 103 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h
+++ linux-2.6/include/linux/swap.h
@@ -115,6 +115,7 @@ enum {
 	SWP_USED	= (1 << 0),	/* is slot in swap_info[] used? */
 	SWP_WRITEOK	= (1 << 1),	/* ok to write to this swap?	*/
 	SWP_ACTIVE	= (SWP_USED | SWP_WRITEOK),
+	SWP_FILE	= (1 << 2),	/* file swap area */
 					/* add others here before... */
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
@@ -212,6 +213,9 @@ extern void swap_unplug_io_fn(struct bac
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct file *, struct page *);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
+extern void swap_sync_page(struct page *page);
+extern int swap_set_page_dirty(struct page *page);
+extern int swap_releasepage(struct page *page, gfp_t gfp_mask);
 extern int rw_swap_page_sync(int, swp_entry_t, struct page *);
 
 /* linux/mm/swap_state.c */
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -100,6 +100,11 @@ config SWAP
 	  used to provide more virtual memory than the actual RAM present
 	  in your computer.  If unsure say Y.
 
+config SWAP_FILE
+	bool "Support for paging to/from non block device files"
+	depends on SWAP
+	default n
+
 config SYSVIPC
 	bool "System V IPC"
 	---help---
Index: linux-2.6/mm/page_io.c
===================================================================
--- linux-2.6.orig/mm/page_io.c
+++ linux-2.6/mm/page_io.c
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/swapops.h>
 #include <linux/writeback.h>
+#include <linux/buffer_head.h>
 #include <asm/pgtable.h>
 
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -91,6 +92,14 @@ int swap_writepage(struct page *page, st
 		unlock_page(page);
 		goto out;
 	}
+#ifdef CONFIG_SWAP_FILE
+	{
+		struct swap_info_struct *sis = page_swap_info(page);
+		if (sis->flags & SWP_FILE)
+			return sis->swap_file->f_mapping->
+				a_ops->writepage(page, wbc);
+	}
+#endif
 	bio = get_swap_bio(GFP_NOIO, page_private(page), page,
 				end_swap_bio_write);
 	if (bio == NULL) {
@@ -116,6 +125,14 @@ int swap_readpage(struct file *file, str
 
 	BUG_ON(!PageLocked(page));
 	ClearPageUptodate(page);
+#ifdef CONFIG_SWAP_FILE
+	{
+		struct swap_info_struct *sis = page_swap_info(page);
+		if (sis->flags & SWP_FILE)
+			return sis->swap_file->f_mapping->
+				a_ops->readpage(sis->swap_file, page);
+	}
+#endif
 	bio = get_swap_bio(GFP_KERNEL, page_private(page), page,
 				end_swap_bio_read);
 	if (bio == NULL) {
@@ -129,6 +146,49 @@ out:
 	return ret;
 }
 
+#ifdef CONFIG_SWAP_FILE
+void swap_sync_page(struct page *page)
+{
+	struct swap_info_struct *sis = page_swap_info(page);
+
+	if (sis->flags & SWP_FILE) {
+		const struct address_space_operations * a_ops =
+			sis->swap_file->f_mapping->a_ops;
+		if (a_ops->sync_page)
+			a_ops->sync_page(page);
+	} else
+		block_sync_page(page);
+}
+
+int swap_set_page_dirty(struct page *page)
+{
+	struct swap_info_struct *sis = page_swap_info(page);
+
+	if (sis->flags & SWP_FILE) {
+		const struct address_space_operations * a_ops =
+			sis->swap_file->f_mapping->a_ops;
+		if (a_ops->set_page_dirty)
+			return a_ops->set_page_dirty(page);
+		return __set_page_dirty_buffers(page);
+	}
+
+	return __set_page_dirty_nobuffers(page);
+}
+
+int swap_releasepage(struct page *page, gfp_t gfp_mask)
+{
+	struct swap_info_struct *sis = page_swap_info(page);
+	const struct address_space_operations * a_ops =
+		sis->swap_file->f_mapping->a_ops;
+
+	if ((sis->flags & SWP_FILE) && a_ops->releasepage)
+		return a_ops->releasepage(page, gfp_mask);
+
+	BUG();
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_SOFTWARE_SUSPEND
 /*
  * A scruffy utility function to read or write an arbitrary swap page
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -26,8 +26,14 @@
  */
 static const struct address_space_operations swap_aops = {
 	.writepage	= swap_writepage,
+#ifdef CONFIG_SWAP_FILE
+	.sync_page	= swap_sync_page,
+	.set_page_dirty	= swap_set_page_dirty,
+	.releasepage	= swap_releasepage,
+#else
 	.sync_page	= block_sync_page,
 	.set_page_dirty	= __set_page_dirty_nobuffers,
+#endif
 	.migratepage	= migrate_page,
 };
 
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -411,7 +411,12 @@ void free_swap_and_cache(swp_entry_t ent
 	if (page) {
 		int one_user;
 
+#ifdef CONFIG_SWAP_FILE
+		if (PagePrivate(page))
+			page_mapping(page)->a_ops->releasepage(page, 0);
+#else
 		BUG_ON(PagePrivate(page));
+#endif
 		one_user = (page_count(page) == 2);
 		/* Only cache user (+us), or swap space full? Free it! */
 		/* Also recheck PageSwapCache after page is locked (above) */
@@ -944,6 +949,13 @@ static void destroy_swap_extents(struct 
 		list_del(&se->list);
 		kfree(se);
 	}
+#ifdef CONFIG_SWAP_FILE
+	if (sis->flags & SWP_FILE) {
+		sis->flags &= ~SWP_FILE;
+		sis->swap_file->f_mapping->a_ops->
+			swapfile(sis->swap_file->f_mapping, 0);
+	}
+#endif
 }
 
 /*
@@ -1036,6 +1048,19 @@ static int setup_swap_extents(struct swa
 		goto done;
 	}
 
+#ifdef CONFIG_SWAP_FILE
+	if (sis->swap_file->f_mapping->a_ops->swapfile) {
+		ret = sis->swap_file->f_mapping->a_ops->
+			swapfile(sis->swap_file->f_mapping, 1);
+		if (!ret) {
+			sis->flags |= SWP_FILE;
+			ret = add_swap_extent(sis, 0, sis->max, 0);
+			*span = sis->pages;
+		}
+		goto done;
+	}
+#endif
+
 	blkbits = inode->i_blkbits;
 	blocks_per_page = PAGE_SIZE >> blkbits;
 
@@ -1592,7 +1617,7 @@ asmlinkage long sys_swapon(const char __
 
 	mutex_lock(&swapon_mutex);
 	spin_lock(&swap_lock);
-	p->flags = SWP_ACTIVE;
+	p->flags |= SWP_WRITEOK;
 	nr_swap_pages += nr_good_pages;
 	total_swap_pages += nr_good_pages;
 
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -382,6 +382,7 @@ struct address_space_operations {
 	/* migrate the contents of a page to the specified target */
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *);
+	int (*swapfile)(struct address_space *, int);
 };
 
 struct backing_dev_info;
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1567,7 +1567,7 @@ static void discard_buffer(struct buffer
  */
 int try_to_release_page(struct page *page, gfp_t gfp_mask)
 {
-	struct address_space * const mapping = page->mapping;
+	struct address_space * const mapping = page_mapping(page);
 
 	BUG_ON(!PageLocked(page));
 	if (PageWriteback(page))

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 03/21] mm: add support for non block device backed swap files
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Trond Myklebust

[-- Attachment #1: swapfile.patch --]
[-- Type: text/plain, Size: 8043 bytes --]

A new addres_space_operations method is added:
  int swapfile(struct address_space *, int)

When during sys_swapon() this method is found and returns no error the 
swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops.

The swapfile method will be used to communicate to the address_space that the
VM relies on it, and the address_space should take adequate measures (like 
reserving memory for mempools or the like).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 fs/buffer.c          |    2 -
 include/linux/fs.h   |    1 
 include/linux/swap.h |    4 +++
 init/Kconfig         |    5 ++++
 mm/page_io.c         |   60 +++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/swap_state.c      |    6 +++++
 mm/swapfile.c        |   27 ++++++++++++++++++++++
 7 files changed, 103 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h
+++ linux-2.6/include/linux/swap.h
@@ -115,6 +115,7 @@ enum {
 	SWP_USED	= (1 << 0),	/* is slot in swap_info[] used? */
 	SWP_WRITEOK	= (1 << 1),	/* ok to write to this swap?	*/
 	SWP_ACTIVE	= (SWP_USED | SWP_WRITEOK),
+	SWP_FILE	= (1 << 2),	/* file swap area */
 					/* add others here before... */
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
@@ -212,6 +213,9 @@ extern void swap_unplug_io_fn(struct bac
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct file *, struct page *);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
+extern void swap_sync_page(struct page *page);
+extern int swap_set_page_dirty(struct page *page);
+extern int swap_releasepage(struct page *page, gfp_t gfp_mask);
 extern int rw_swap_page_sync(int, swp_entry_t, struct page *);
 
 /* linux/mm/swap_state.c */
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -100,6 +100,11 @@ config SWAP
 	  used to provide more virtual memory than the actual RAM present
 	  in your computer.  If unsure say Y.
 
+config SWAP_FILE
+	bool "Support for paging to/from non block device files"
+	depends on SWAP
+	default n
+
 config SYSVIPC
 	bool "System V IPC"
 	---help---
Index: linux-2.6/mm/page_io.c
===================================================================
--- linux-2.6.orig/mm/page_io.c
+++ linux-2.6/mm/page_io.c
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/swapops.h>
 #include <linux/writeback.h>
+#include <linux/buffer_head.h>
 #include <asm/pgtable.h>
 
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -91,6 +92,14 @@ int swap_writepage(struct page *page, st
 		unlock_page(page);
 		goto out;
 	}
+#ifdef CONFIG_SWAP_FILE
+	{
+		struct swap_info_struct *sis = page_swap_info(page);
+		if (sis->flags & SWP_FILE)
+			return sis->swap_file->f_mapping->
+				a_ops->writepage(page, wbc);
+	}
+#endif
 	bio = get_swap_bio(GFP_NOIO, page_private(page), page,
 				end_swap_bio_write);
 	if (bio == NULL) {
@@ -116,6 +125,14 @@ int swap_readpage(struct file *file, str
 
 	BUG_ON(!PageLocked(page));
 	ClearPageUptodate(page);
+#ifdef CONFIG_SWAP_FILE
+	{
+		struct swap_info_struct *sis = page_swap_info(page);
+		if (sis->flags & SWP_FILE)
+			return sis->swap_file->f_mapping->
+				a_ops->readpage(sis->swap_file, page);
+	}
+#endif
 	bio = get_swap_bio(GFP_KERNEL, page_private(page), page,
 				end_swap_bio_read);
 	if (bio == NULL) {
@@ -129,6 +146,49 @@ out:
 	return ret;
 }
 
+#ifdef CONFIG_SWAP_FILE
+void swap_sync_page(struct page *page)
+{
+	struct swap_info_struct *sis = page_swap_info(page);
+
+	if (sis->flags & SWP_FILE) {
+		const struct address_space_operations * a_ops =
+			sis->swap_file->f_mapping->a_ops;
+		if (a_ops->sync_page)
+			a_ops->sync_page(page);
+	} else
+		block_sync_page(page);
+}
+
+int swap_set_page_dirty(struct page *page)
+{
+	struct swap_info_struct *sis = page_swap_info(page);
+
+	if (sis->flags & SWP_FILE) {
+		const struct address_space_operations * a_ops =
+			sis->swap_file->f_mapping->a_ops;
+		if (a_ops->set_page_dirty)
+			return a_ops->set_page_dirty(page);
+		return __set_page_dirty_buffers(page);
+	}
+
+	return __set_page_dirty_nobuffers(page);
+}
+
+int swap_releasepage(struct page *page, gfp_t gfp_mask)
+{
+	struct swap_info_struct *sis = page_swap_info(page);
+	const struct address_space_operations * a_ops =
+		sis->swap_file->f_mapping->a_ops;
+
+	if ((sis->flags & SWP_FILE) && a_ops->releasepage)
+		return a_ops->releasepage(page, gfp_mask);
+
+	BUG();
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_SOFTWARE_SUSPEND
 /*
  * A scruffy utility function to read or write an arbitrary swap page
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -26,8 +26,14 @@
  */
 static const struct address_space_operations swap_aops = {
 	.writepage	= swap_writepage,
+#ifdef CONFIG_SWAP_FILE
+	.sync_page	= swap_sync_page,
+	.set_page_dirty	= swap_set_page_dirty,
+	.releasepage	= swap_releasepage,
+#else
 	.sync_page	= block_sync_page,
 	.set_page_dirty	= __set_page_dirty_nobuffers,
+#endif
 	.migratepage	= migrate_page,
 };
 
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -411,7 +411,12 @@ void free_swap_and_cache(swp_entry_t ent
 	if (page) {
 		int one_user;
 
+#ifdef CONFIG_SWAP_FILE
+		if (PagePrivate(page))
+			page_mapping(page)->a_ops->releasepage(page, 0);
+#else
 		BUG_ON(PagePrivate(page));
+#endif
 		one_user = (page_count(page) == 2);
 		/* Only cache user (+us), or swap space full? Free it! */
 		/* Also recheck PageSwapCache after page is locked (above) */
@@ -944,6 +949,13 @@ static void destroy_swap_extents(struct 
 		list_del(&se->list);
 		kfree(se);
 	}
+#ifdef CONFIG_SWAP_FILE
+	if (sis->flags & SWP_FILE) {
+		sis->flags &= ~SWP_FILE;
+		sis->swap_file->f_mapping->a_ops->
+			swapfile(sis->swap_file->f_mapping, 0);
+	}
+#endif
 }
 
 /*
@@ -1036,6 +1048,19 @@ static int setup_swap_extents(struct swa
 		goto done;
 	}
 
+#ifdef CONFIG_SWAP_FILE
+	if (sis->swap_file->f_mapping->a_ops->swapfile) {
+		ret = sis->swap_file->f_mapping->a_ops->
+			swapfile(sis->swap_file->f_mapping, 1);
+		if (!ret) {
+			sis->flags |= SWP_FILE;
+			ret = add_swap_extent(sis, 0, sis->max, 0);
+			*span = sis->pages;
+		}
+		goto done;
+	}
+#endif
+
 	blkbits = inode->i_blkbits;
 	blocks_per_page = PAGE_SIZE >> blkbits;
 
@@ -1592,7 +1617,7 @@ asmlinkage long sys_swapon(const char __
 
 	mutex_lock(&swapon_mutex);
 	spin_lock(&swap_lock);
-	p->flags = SWP_ACTIVE;
+	p->flags |= SWP_WRITEOK;
 	nr_swap_pages += nr_good_pages;
 	total_swap_pages += nr_good_pages;
 
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -382,6 +382,7 @@ struct address_space_operations {
 	/* migrate the contents of a page to the specified target */
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *);
+	int (*swapfile)(struct address_space *, int);
 };
 
 struct backing_dev_info;
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1567,7 +1567,7 @@ static void discard_buffer(struct buffer
  */
 int try_to_release_page(struct page *page, gfp_t gfp_mask)
 {
-	struct address_space * const mapping = page->mapping;
+	struct address_space * const mapping = page_mapping(page);
 
 	BUG_ON(!PageLocked(page));
 	if (PageWriteback(page))

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 04/21] mm: methods for teaching filesystems about PG_swapcache pages
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Trond Myklebust

[-- Attachment #1: page_file_methods.patch --]
[-- Type: text/plain, Size: 6666 bytes --]

In order to teach filesystems to handle swap cache pages, two new page
functions are introduced:

  pgoff_t page_file_index(struct page *);
  struct address_space *page_file_mapping(struct page *);

page_file_index - gives the offset of this page in the file in PAGE_CACHE_SIZE
blocks. Like page->index is for mapped pages, this function also gives the
correct index for PG_swapcache pages.

page_file_mapping - gives the mapping backing the actual page; that is for
swap cache pages it will give swap_file->f_mapping.

page_offset() is modified to use page_file_index(), so that it will give the
expected result, even for PG_swapcache pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 include/linux/mm.h      |   30 ++++++++++++++++++++++++++++++
 include/linux/pagemap.h |    2 +-
 include/linux/swap.h    |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/swapops.h |   44 --------------------------------------------
 4 files changed, 79 insertions(+), 45 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -15,6 +15,7 @@
 #include <linux/fs.h>
 #include <linux/mutex.h>
 #include <linux/debug_locks.h>
+#include <linux/swap.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -579,6 +580,22 @@ static inline struct address_space *page
 	return mapping;
 }
 
+static inline
+struct swap_info_struct * page_swap_info(struct page *page)
+{
+	swp_entry_t swap = { .val = page_private(page) };
+	BUG_ON(!PageSwapCache(page));
+	return get_swap_info_struct(swp_type(swap));
+}
+
+static inline
+struct address_space *page_file_mapping(struct page *page)
+{
+	if (unlikely(PageSwapCache(page)))
+		return page_swap_info(page)->swap_file->f_mapping;
+	return page->mapping;
+}
+
 static inline int PageAnon(struct page *page)
 {
 	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
@@ -596,6 +613,19 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
+ * Return the file index of the page. Regular pagecache pages use ->index
+ * whereas swapcache pages use swp_offset(->private)
+ */
+static inline pgoff_t page_file_index(struct page *page)
+{
+	if (unlikely(PageSwapCache(page))) {
+		swp_entry_t swap = { .val = page_private(page) };
+		return swp_offset(swap);
+	}
+	return page->index;
+}
+
+/*
  * The atomic page->_mapcount, like _count, starts from -1:
  * so that transitions both from it and to it can be tracked,
  * using atomic_inc_and_test and atomic_add_negative(-1).
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -118,7 +118,7 @@ extern void __remove_from_page_cache(str
  */
 static inline loff_t page_offset(struct page *page)
 {
-	return ((loff_t)page->index) << PAGE_CACHE_SHIFT;
+	return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
 }
 
 static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h
+++ linux-2.6/include/linux/swap.h
@@ -75,6 +75,50 @@ typedef struct {
 } swp_entry_t;
 
 /*
+ * swapcache pages are stored in the swapper_space radix tree.  We want to
+ * get good packing density in that tree, so the index should be dense in
+ * the low-order bits.
+ *
+ * We arrange the `type' and `offset' fields so that `type' is at the five
+ * high-order bits of the swp_entry_t and `offset' is right-aligned in the
+ * remaining bits.
+ *
+ * swp_entry_t's are *never* stored anywhere in their arch-dependent format.
+ */
+#define SWP_TYPE_SHIFT(e)	(sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
+#define SWP_OFFSET_MASK(e)	((1UL << SWP_TYPE_SHIFT(e)) - 1)
+
+/*
+ * Store a type+offset into a swp_entry_t in an arch-independent format
+ */
+static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset)
+{
+	swp_entry_t ret;
+
+	ret.val = (type << SWP_TYPE_SHIFT(ret)) |
+			(offset & SWP_OFFSET_MASK(ret));
+	return ret;
+}
+
+/*
+ * Extract the `type' field from a swp_entry_t.  The swp_entry_t is in
+ * arch-independent format
+ */
+static inline unsigned swp_type(swp_entry_t entry)
+{
+	return (entry.val >> SWP_TYPE_SHIFT(entry));
+}
+
+/*
+ * Extract the `offset' field from a swp_entry_t.  The swp_entry_t is in
+ * arch-independent format
+ */
+static inline pgoff_t swp_offset(swp_entry_t entry)
+{
+	return entry.val & SWP_OFFSET_MASK(entry);
+}
+
+/*
  * current->reclaim_state points to one of these when a task is running
  * memory reclaim
  */
@@ -322,6 +366,10 @@ static inline int valid_swaphandles(swp_
 	return 0;
 }
 
+static inline struct swap_info_struct *get_swap_info_struct(unsigned type)
+{
+	return NULL;
+}
 #define can_share_swap_page(p)			(page_mapcount(p) == 1)
 
 static inline int move_to_swap_cache(struct page *page, swp_entry_t entry)
Index: linux-2.6/include/linux/swapops.h
===================================================================
--- linux-2.6.orig/include/linux/swapops.h
+++ linux-2.6/include/linux/swapops.h
@@ -1,48 +1,4 @@
 /*
- * swapcache pages are stored in the swapper_space radix tree.  We want to
- * get good packing density in that tree, so the index should be dense in
- * the low-order bits.
- *
- * We arrange the `type' and `offset' fields so that `type' is at the five
- * high-order bits of the swp_entry_t and `offset' is right-aligned in the
- * remaining bits.
- *
- * swp_entry_t's are *never* stored anywhere in their arch-dependent format.
- */
-#define SWP_TYPE_SHIFT(e)	(sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
-#define SWP_OFFSET_MASK(e)	((1UL << SWP_TYPE_SHIFT(e)) - 1)
-
-/*
- * Store a type+offset into a swp_entry_t in an arch-independent format
- */
-static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset)
-{
-	swp_entry_t ret;
-
-	ret.val = (type << SWP_TYPE_SHIFT(ret)) |
-			(offset & SWP_OFFSET_MASK(ret));
-	return ret;
-}
-
-/*
- * Extract the `type' field from a swp_entry_t.  The swp_entry_t is in
- * arch-independent format
- */
-static inline unsigned swp_type(swp_entry_t entry)
-{
-	return (entry.val >> SWP_TYPE_SHIFT(entry));
-}
-
-/*
- * Extract the `offset' field from a swp_entry_t.  The swp_entry_t is in
- * arch-independent format
- */
-static inline pgoff_t swp_offset(swp_entry_t entry)
-{
-	return entry.val & SWP_OFFSET_MASK(entry);
-}
-
-/*
  * Convert the arch-dependent pte representation of a swp_entry_t into an
  * arch-independent swp_entry_t.
  */

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 04/21] mm: methods for teaching filesystems about PG_swapcache pages
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Trond Myklebust

[-- Attachment #1: page_file_methods.patch --]
[-- Type: text/plain, Size: 6892 bytes --]

In order to teach filesystems to handle swap cache pages, two new page
functions are introduced:

  pgoff_t page_file_index(struct page *);
  struct address_space *page_file_mapping(struct page *);

page_file_index - gives the offset of this page in the file in PAGE_CACHE_SIZE
blocks. Like page->index is for mapped pages, this function also gives the
correct index for PG_swapcache pages.

page_file_mapping - gives the mapping backing the actual page; that is for
swap cache pages it will give swap_file->f_mapping.

page_offset() is modified to use page_file_index(), so that it will give the
expected result, even for PG_swapcache pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 include/linux/mm.h      |   30 ++++++++++++++++++++++++++++++
 include/linux/pagemap.h |    2 +-
 include/linux/swap.h    |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/swapops.h |   44 --------------------------------------------
 4 files changed, 79 insertions(+), 45 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -15,6 +15,7 @@
 #include <linux/fs.h>
 #include <linux/mutex.h>
 #include <linux/debug_locks.h>
+#include <linux/swap.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -579,6 +580,22 @@ static inline struct address_space *page
 	return mapping;
 }
 
+static inline
+struct swap_info_struct * page_swap_info(struct page *page)
+{
+	swp_entry_t swap = { .val = page_private(page) };
+	BUG_ON(!PageSwapCache(page));
+	return get_swap_info_struct(swp_type(swap));
+}
+
+static inline
+struct address_space *page_file_mapping(struct page *page)
+{
+	if (unlikely(PageSwapCache(page)))
+		return page_swap_info(page)->swap_file->f_mapping;
+	return page->mapping;
+}
+
 static inline int PageAnon(struct page *page)
 {
 	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
@@ -596,6 +613,19 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
+ * Return the file index of the page. Regular pagecache pages use ->index
+ * whereas swapcache pages use swp_offset(->private)
+ */
+static inline pgoff_t page_file_index(struct page *page)
+{
+	if (unlikely(PageSwapCache(page))) {
+		swp_entry_t swap = { .val = page_private(page) };
+		return swp_offset(swap);
+	}
+	return page->index;
+}
+
+/*
  * The atomic page->_mapcount, like _count, starts from -1:
  * so that transitions both from it and to it can be tracked,
  * using atomic_inc_and_test and atomic_add_negative(-1).
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -118,7 +118,7 @@ extern void __remove_from_page_cache(str
  */
 static inline loff_t page_offset(struct page *page)
 {
-	return ((loff_t)page->index) << PAGE_CACHE_SHIFT;
+	return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
 }
 
 static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h
+++ linux-2.6/include/linux/swap.h
@@ -75,6 +75,50 @@ typedef struct {
 } swp_entry_t;
 
 /*
+ * swapcache pages are stored in the swapper_space radix tree.  We want to
+ * get good packing density in that tree, so the index should be dense in
+ * the low-order bits.
+ *
+ * We arrange the `type' and `offset' fields so that `type' is at the five
+ * high-order bits of the swp_entry_t and `offset' is right-aligned in the
+ * remaining bits.
+ *
+ * swp_entry_t's are *never* stored anywhere in their arch-dependent format.
+ */
+#define SWP_TYPE_SHIFT(e)	(sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
+#define SWP_OFFSET_MASK(e)	((1UL << SWP_TYPE_SHIFT(e)) - 1)
+
+/*
+ * Store a type+offset into a swp_entry_t in an arch-independent format
+ */
+static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset)
+{
+	swp_entry_t ret;
+
+	ret.val = (type << SWP_TYPE_SHIFT(ret)) |
+			(offset & SWP_OFFSET_MASK(ret));
+	return ret;
+}
+
+/*
+ * Extract the `type' field from a swp_entry_t.  The swp_entry_t is in
+ * arch-independent format
+ */
+static inline unsigned swp_type(swp_entry_t entry)
+{
+	return (entry.val >> SWP_TYPE_SHIFT(entry));
+}
+
+/*
+ * Extract the `offset' field from a swp_entry_t.  The swp_entry_t is in
+ * arch-independent format
+ */
+static inline pgoff_t swp_offset(swp_entry_t entry)
+{
+	return entry.val & SWP_OFFSET_MASK(entry);
+}
+
+/*
  * current->reclaim_state points to one of these when a task is running
  * memory reclaim
  */
@@ -322,6 +366,10 @@ static inline int valid_swaphandles(swp_
 	return 0;
 }
 
+static inline struct swap_info_struct *get_swap_info_struct(unsigned type)
+{
+	return NULL;
+}
 #define can_share_swap_page(p)			(page_mapcount(p) == 1)
 
 static inline int move_to_swap_cache(struct page *page, swp_entry_t entry)
Index: linux-2.6/include/linux/swapops.h
===================================================================
--- linux-2.6.orig/include/linux/swapops.h
+++ linux-2.6/include/linux/swapops.h
@@ -1,48 +1,4 @@
 /*
- * swapcache pages are stored in the swapper_space radix tree.  We want to
- * get good packing density in that tree, so the index should be dense in
- * the low-order bits.
- *
- * We arrange the `type' and `offset' fields so that `type' is at the five
- * high-order bits of the swp_entry_t and `offset' is right-aligned in the
- * remaining bits.
- *
- * swp_entry_t's are *never* stored anywhere in their arch-dependent format.
- */
-#define SWP_TYPE_SHIFT(e)	(sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
-#define SWP_OFFSET_MASK(e)	((1UL << SWP_TYPE_SHIFT(e)) - 1)
-
-/*
- * Store a type+offset into a swp_entry_t in an arch-independent format
- */
-static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset)
-{
-	swp_entry_t ret;
-
-	ret.val = (type << SWP_TYPE_SHIFT(ret)) |
-			(offset & SWP_OFFSET_MASK(ret));
-	return ret;
-}
-
-/*
- * Extract the `type' field from a swp_entry_t.  The swp_entry_t is in
- * arch-independent format
- */
-static inline unsigned swp_type(swp_entry_t entry)
-{
-	return (entry.val >> SWP_TYPE_SHIFT(entry));
-}
-
-/*
- * Extract the `offset' field from a swp_entry_t.  The swp_entry_t is in
- * arch-independent format
- */
-static inline pgoff_t swp_offset(swp_entry_t entry)
-{
-	return entry.val & SWP_OFFSET_MASK(entry);
-}
-
-/*
  * Convert the arch-dependent pte representation of a swp_entry_t into an
  * arch-independent swp_entry_t.
  */

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 05/21] uml: rename arch/um remove_mapping()
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Jeff Dike

[-- Attachment #1: uml_remove_mapping.patch --]
[-- Type: text/plain, Size: 1262 bytes --]

Now that 'include/linux/mm.h' includes 'include/linux/swap.h', the global
remove_mapping() definition clashes with the arch/um one.

Rename the arch/um one.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jeff Dike <jdike@addtoit.com>
---
 arch/um/kernel/physmem.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/arch/um/kernel/physmem.c
===================================================================
--- linux-2.6.orig/arch/um/kernel/physmem.c
+++ linux-2.6/arch/um/kernel/physmem.c
@@ -160,7 +160,7 @@ int physmem_subst_mapping(void *virt, in
 
 static int physmem_fd = -1;
 
-static void remove_mapping(struct phys_desc *desc)
+static void um_remove_mapping(struct phys_desc *desc)
 {
 	void *virt = desc->virt;
 	int err;
@@ -184,7 +184,7 @@ int physmem_remove_mapping(void *virt)
 	if(desc == NULL)
 		return(0);
 
-	remove_mapping(desc);
+	um_remove_mapping(desc);
 	return(1);
 }
 
@@ -205,7 +205,7 @@ void physmem_forget_descriptor(int fd)
 		page = list_entry(ele, struct phys_desc, list);
 		offset = page->offset;
 		addr = page->virt;
-		remove_mapping(page);
+		um_remove_mapping(page);
 		err = os_seek_file(fd, offset);
 		if(err)
 			panic("physmem_forget_descriptor - failed to seek "

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 05/21] uml: rename arch/um remove_mapping()
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Jeff Dike

[-- Attachment #1: uml_remove_mapping.patch --]
[-- Type: text/plain, Size: 1488 bytes --]

Now that 'include/linux/mm.h' includes 'include/linux/swap.h', the global
remove_mapping() definition clashes with the arch/um one.

Rename the arch/um one.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jeff Dike <jdike@addtoit.com>
---
 arch/um/kernel/physmem.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/arch/um/kernel/physmem.c
===================================================================
--- linux-2.6.orig/arch/um/kernel/physmem.c
+++ linux-2.6/arch/um/kernel/physmem.c
@@ -160,7 +160,7 @@ int physmem_subst_mapping(void *virt, in
 
 static int physmem_fd = -1;
 
-static void remove_mapping(struct phys_desc *desc)
+static void um_remove_mapping(struct phys_desc *desc)
 {
 	void *virt = desc->virt;
 	int err;
@@ -184,7 +184,7 @@ int physmem_remove_mapping(void *virt)
 	if(desc == NULL)
 		return(0);
 
-	remove_mapping(desc);
+	um_remove_mapping(desc);
 	return(1);
 }
 
@@ -205,7 +205,7 @@ void physmem_forget_descriptor(int fd)
 		page = list_entry(ele, struct phys_desc, list);
 		offset = page->offset;
 		addr = page->virt;
-		remove_mapping(page);
+		um_remove_mapping(page);
 		err = os_seek_file(fd, offset);
 		if(err)
 			panic("physmem_forget_descriptor - failed to seek "

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 06/21] nfs: teach the NFS client how to treat PG_swapcache pages
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Trond Myklebust

[-- Attachment #1: nfs_swapcache.patch --]
[-- Type: text/plain, Size: 10227 bytes --]

Replace all occurences of page->index and page->mapping in the NFS client
with the new page_file_index() and page_file_mapping() functions.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 fs/nfs/dir.c      |    4 ++--
 fs/nfs/file.c     |    6 +++---
 fs/nfs/pagelist.c |    8 ++++----
 fs/nfs/read.c     |   10 +++++-----
 fs/nfs/write.c    |   34 +++++++++++++++++-----------------
 5 files changed, 31 insertions(+), 31 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===================================================================
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -303,17 +303,17 @@ static int nfs_commit_write(struct file 
 
 static void nfs_invalidate_page(struct page *page, unsigned long offset)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 
 	/* Cancel any unstarted writes on this page */
 	if (offset == 0)
-		nfs_sync_inode_wait(inode, page->index, 1, FLUSH_INVALIDATE);
+		nfs_sync_inode_wait(inode, page_file_index(page), 1, FLUSH_INVALIDATE);
 }
 
 static int nfs_release_page(struct page *page, gfp_t gfp)
 {
 	if (gfp & __GFP_FS)
-		return !nfs_wb_page(page->mapping->host, page);
+		return !nfs_wb_page(page_file_mapping(page)->host, page);
 	else
 		/*
 		 * Avoid deadlock on nfs_wait_on_request().
Index: linux-2.6/fs/nfs/pagelist.c
===================================================================
--- linux-2.6.orig/fs/nfs/pagelist.c
+++ linux-2.6/fs/nfs/pagelist.c
@@ -82,11 +82,11 @@ nfs_create_request(struct nfs_open_conte
 	 * update_nfs_request below if the region is not locked. */
 	req->wb_page    = page;
 	atomic_set(&req->wb_complete, 0);
-	req->wb_index	= page->index;
+	req->wb_index	= page_file_index(page);
 	page_cache_get(page);
 	BUG_ON(PagePrivate(page));
 	BUG_ON(!PageLocked(page));
-	BUG_ON(page->mapping->host != inode);
+	BUG_ON(page_file_mapping(page)->host != inode);
 	req->wb_offset  = offset;
 	req->wb_pgbase	= offset;
 	req->wb_bytes   = count;
@@ -271,7 +271,7 @@ nfs_coalesce_requests(struct list_head *
  * nfs_scan_lock_dirty - Scan the radix tree for dirty requests
  * @nfsi: NFS inode
  * @dst: Destination list
- * @idx_start: lower bound of page->index to scan
+ * @idx_start: lower bound of page_file_index(page) to scan
  * @npages: idx_start + npages sets the upper bound to scan.
  *
  * Moves elements from one of the inode request lists.
@@ -328,7 +328,7 @@ out:
  * @nfsi: NFS inode
  * @head: One of the NFS inode request lists
  * @dst: Destination list
- * @idx_start: lower bound of page->index to scan
+ * @idx_start: lower bound of page_file_index(page) to scan
  * @npages: idx_start + npages sets the upper bound to scan.
  *
  * Moves elements from one of the inode request lists.
Index: linux-2.6/fs/nfs/read.c
===================================================================
--- linux-2.6.orig/fs/nfs/read.c
+++ linux-2.6/fs/nfs/read.c
@@ -84,9 +84,9 @@ unsigned int nfs_page_length(struct inod
 	if (i_size <= 0)
 		return 0;
 	idx = (i_size - 1) >> PAGE_CACHE_SHIFT;
-	if (page->index > idx)
+	if (page_file_index(page) > idx)
 		return 0;
-	if (page->index != idx)
+	if (page_file_index(page) != idx)
 		return PAGE_CACHE_SIZE;
 	return 1 + ((i_size - 1) & (PAGE_CACHE_SIZE - 1));
 }
@@ -593,11 +593,11 @@ int nfs_readpage_result(struct rpc_task 
 int nfs_readpage(struct file *file, struct page *page)
 {
 	struct nfs_open_context *ctx;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	int		error;
 
 	dprintk("NFS: nfs_readpage (%p %ld@%lu)\n",
-		page, PAGE_CACHE_SIZE, page->index);
+		page, PAGE_CACHE_SIZE, page_file_index(page));
 	nfs_inc_stats(inode, NFSIOS_VFSREADPAGE);
 	nfs_add_stats(inode, NFSIOS_READPAGES, 1);
 
@@ -645,7 +645,7 @@ static int
 readpage_async_filler(void *data, struct page *page)
 {
 	struct nfs_readdesc *desc = (struct nfs_readdesc *)data;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_page *new;
 	unsigned int len;
 
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -152,13 +152,13 @@ void nfs_writedata_release(void *wdata)
 /* Adjust the file length if we're writing beyond the end */
 static void nfs_grow_file(struct page *page, unsigned int offset, unsigned int count)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	loff_t end, i_size = i_size_read(inode);
 	unsigned long end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
 
-	if (i_size > 0 && page->index < end_index)
+	if (i_size > 0 && page_file_index(page) < end_index)
 		return;
-	end = ((loff_t)page->index << PAGE_CACHE_SHIFT) + ((loff_t)offset+count);
+	end = page_offset(page) + ((loff_t)offset+count);
 	if (i_size >= end)
 		return;
 	nfs_inc_stats(inode, NFSIOS_EXTENDWRITE);
@@ -181,11 +181,11 @@ static void nfs_mark_uptodate(struct pag
 		return;
 	}
 
-	end_offs = i_size_read(page->mapping->host) - 1;
+	end_offs = i_size_read(page_file_mapping(page)->host) - 1;
 	if (end_offs < 0)
 		return;
 	/* Is this the last page? */
-	if (page->index != (unsigned long)(end_offs >> PAGE_CACHE_SHIFT))
+	if (page_file_index(page) != (unsigned long)(end_offs >> PAGE_CACHE_SHIFT))
 		return;
 	/* This is the last page: set PG_uptodate if we cover the entire
 	 * extent of the data, then zero the rest of the page.
@@ -300,7 +300,7 @@ static int wb_priority(struct writeback_
 int nfs_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct nfs_open_context *ctx;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	unsigned long end_index;
 	unsigned offset = PAGE_CACHE_SIZE;
 	loff_t i_size = i_size_read(inode);
@@ -327,14 +327,14 @@ int nfs_writepage(struct page *page, str
 	nfs_wb_page_priority(inode, page, priority);
 
 	/* easy case */
-	if (page->index < end_index)
+	if (page_file_index(page) < end_index)
 		goto do_it;
 	/* things got complicated... */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
 
 	/* OK, are we completely out? */
 	err = 0; /* potential race with truncate - ignore */
-	if (page->index >= end_index+1 || !offset)
+	if (page_file_index(page) >= end_index+1 || !offset)
 		goto out;
 do_it:
 	ctx = nfs_find_open_context(inode, NULL, FMODE_WRITE);
@@ -606,7 +606,7 @@ static void nfs_cancel_commit_list(struc
  * nfs_scan_dirty - Scan an inode for dirty requests
  * @inode: NFS inode to scan
  * @dst: destination list
- * @idx_start: lower bound of page->index to scan.
+ * @idx_start: lower bound of page_file_index(page) to scan.
  * @npages: idx_start + npages sets the upper bound to scan.
  *
  * Moves requests from the inode's dirty page list.
@@ -632,7 +632,7 @@ nfs_scan_dirty(struct inode *inode, stru
  * nfs_scan_commit - Scan an inode for commit requests
  * @inode: NFS inode to scan
  * @dst: destination list
- * @idx_start: lower bound of page->index to scan.
+ * @idx_start: lower bound of page_file_index(page) to scan.
  * @npages: idx_start + npages sets the upper bound to scan.
  *
  * Moves requests from the inode's 'commit' request list.
@@ -713,14 +713,14 @@ static struct nfs_page * nfs_update_requ
 
 	end = offset + bytes;
 
-	if (nfs_wait_on_write_congestion(page->mapping, server->flags & NFS_MOUNT_INTR))
+	if (nfs_wait_on_write_congestion(page_file_mapping(page), server->flags & NFS_MOUNT_INTR))
 		return ERR_PTR(-ERESTARTSYS);
 	for (;;) {
 		/* Loop over all inode entries and see if we find
 		 * A request for the page we wish to update
 		 */
 		spin_lock(&nfsi->req_lock);
-		req = _nfs_find_request(inode, page->index);
+		req = _nfs_find_request(inode, page_file_index(page));
 		if (req) {
 			if (!nfs_lock_request_dontget(req)) {
 				int error;
@@ -791,7 +791,7 @@ static struct nfs_page * nfs_update_requ
 int nfs_flush_incompatible(struct file *file, struct page *page)
 {
 	struct nfs_open_context *ctx = (struct nfs_open_context *)file->private_data;
-	struct inode	*inode = page->mapping->host;
+	struct inode	*inode = page_file_mapping(page)->host;
 	struct nfs_page	*req;
 	int		status = 0;
 	/*
@@ -802,7 +802,7 @@ int nfs_flush_incompatible(struct file *
 	 * Also do the same if we find a request from an existing
 	 * dropped page.
 	 */
-	req = nfs_find_request(inode, page->index);
+	req = nfs_find_request(inode, page_file_index(page));
 	if (req) {
 		if (req->wb_page != page || ctx != req->wb_context)
 			status = nfs_wb_page(inode, page);
@@ -821,7 +821,7 @@ int nfs_updatepage(struct file *file, st
 		unsigned int offset, unsigned int count)
 {
 	struct nfs_open_context *ctx = (struct nfs_open_context *)file->private_data;
-	struct inode	*inode = page->mapping->host;
+	struct inode	*inode = page_file_mapping(page)->host;
 	struct nfs_page	*req;
 	int		status = 0;
 
@@ -854,12 +854,12 @@ int nfs_updatepage(struct file *file, st
 		offset = 0;
 		if (unlikely(end_offs < 0)) {
 			/* Do nothing */
-		} else if (page->index == end_index) {
+		} else if (page_file_index(page) == end_index) {
 			unsigned int pglen;
 			pglen = (unsigned int)(end_offs & (PAGE_CACHE_SIZE-1)) + 1;
 			if (count < pglen)
 				count = pglen;
-		} else if (page->index < end_index)
+		} else if (page_file_index(page) < end_index)
 			count = PAGE_CACHE_SIZE;
 	}
 
Index: linux-2.6/fs/nfs/dir.c
===================================================================
--- linux-2.6.orig/fs/nfs/dir.c
+++ linux-2.6/fs/nfs/dir.c
@@ -177,7 +177,7 @@ int nfs_readdir_filler(nfs_readdir_descr
 
 	dfprintk(DIRCACHE, "NFS: %s: reading cookie %Lu into page %lu\n",
 			__FUNCTION__, (long long)desc->entry->cookie,
-			page->index);
+			page_file_index(page));
 
  again:
 	timestamp = jiffies;
@@ -201,7 +201,7 @@ int nfs_readdir_filler(nfs_readdir_descr
 	 * Note: assumes we have exclusive access to this mapping either
 	 *	 through inode->i_mutex or some other mechanism.
 	 */
-	if (page->index == 0)
+	if (page_file_index(page) == 0)
 		invalidate_inode_pages2_range(inode->i_mapping, PAGE_CACHE_SIZE, -1);
 	unlock_page(page);
 	return 0;

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 06/21] nfs: teach the NFS client how to treat PG_swapcache pages
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Trond Myklebust

[-- Attachment #1: nfs_swapcache.patch --]
[-- Type: text/plain, Size: 10453 bytes --]

Replace all occurences of page->index and page->mapping in the NFS client
with the new page_file_index() and page_file_mapping() functions.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 fs/nfs/dir.c      |    4 ++--
 fs/nfs/file.c     |    6 +++---
 fs/nfs/pagelist.c |    8 ++++----
 fs/nfs/read.c     |   10 +++++-----
 fs/nfs/write.c    |   34 +++++++++++++++++-----------------
 5 files changed, 31 insertions(+), 31 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===================================================================
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -303,17 +303,17 @@ static int nfs_commit_write(struct file 
 
 static void nfs_invalidate_page(struct page *page, unsigned long offset)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 
 	/* Cancel any unstarted writes on this page */
 	if (offset == 0)
-		nfs_sync_inode_wait(inode, page->index, 1, FLUSH_INVALIDATE);
+		nfs_sync_inode_wait(inode, page_file_index(page), 1, FLUSH_INVALIDATE);
 }
 
 static int nfs_release_page(struct page *page, gfp_t gfp)
 {
 	if (gfp & __GFP_FS)
-		return !nfs_wb_page(page->mapping->host, page);
+		return !nfs_wb_page(page_file_mapping(page)->host, page);
 	else
 		/*
 		 * Avoid deadlock on nfs_wait_on_request().
Index: linux-2.6/fs/nfs/pagelist.c
===================================================================
--- linux-2.6.orig/fs/nfs/pagelist.c
+++ linux-2.6/fs/nfs/pagelist.c
@@ -82,11 +82,11 @@ nfs_create_request(struct nfs_open_conte
 	 * update_nfs_request below if the region is not locked. */
 	req->wb_page    = page;
 	atomic_set(&req->wb_complete, 0);
-	req->wb_index	= page->index;
+	req->wb_index	= page_file_index(page);
 	page_cache_get(page);
 	BUG_ON(PagePrivate(page));
 	BUG_ON(!PageLocked(page));
-	BUG_ON(page->mapping->host != inode);
+	BUG_ON(page_file_mapping(page)->host != inode);
 	req->wb_offset  = offset;
 	req->wb_pgbase	= offset;
 	req->wb_bytes   = count;
@@ -271,7 +271,7 @@ nfs_coalesce_requests(struct list_head *
  * nfs_scan_lock_dirty - Scan the radix tree for dirty requests
  * @nfsi: NFS inode
  * @dst: Destination list
- * @idx_start: lower bound of page->index to scan
+ * @idx_start: lower bound of page_file_index(page) to scan
  * @npages: idx_start + npages sets the upper bound to scan.
  *
  * Moves elements from one of the inode request lists.
@@ -328,7 +328,7 @@ out:
  * @nfsi: NFS inode
  * @head: One of the NFS inode request lists
  * @dst: Destination list
- * @idx_start: lower bound of page->index to scan
+ * @idx_start: lower bound of page_file_index(page) to scan
  * @npages: idx_start + npages sets the upper bound to scan.
  *
  * Moves elements from one of the inode request lists.
Index: linux-2.6/fs/nfs/read.c
===================================================================
--- linux-2.6.orig/fs/nfs/read.c
+++ linux-2.6/fs/nfs/read.c
@@ -84,9 +84,9 @@ unsigned int nfs_page_length(struct inod
 	if (i_size <= 0)
 		return 0;
 	idx = (i_size - 1) >> PAGE_CACHE_SHIFT;
-	if (page->index > idx)
+	if (page_file_index(page) > idx)
 		return 0;
-	if (page->index != idx)
+	if (page_file_index(page) != idx)
 		return PAGE_CACHE_SIZE;
 	return 1 + ((i_size - 1) & (PAGE_CACHE_SIZE - 1));
 }
@@ -593,11 +593,11 @@ int nfs_readpage_result(struct rpc_task 
 int nfs_readpage(struct file *file, struct page *page)
 {
 	struct nfs_open_context *ctx;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	int		error;
 
 	dprintk("NFS: nfs_readpage (%p %ld@%lu)\n",
-		page, PAGE_CACHE_SIZE, page->index);
+		page, PAGE_CACHE_SIZE, page_file_index(page));
 	nfs_inc_stats(inode, NFSIOS_VFSREADPAGE);
 	nfs_add_stats(inode, NFSIOS_READPAGES, 1);
 
@@ -645,7 +645,7 @@ static int
 readpage_async_filler(void *data, struct page *page)
 {
 	struct nfs_readdesc *desc = (struct nfs_readdesc *)data;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_page *new;
 	unsigned int len;
 
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -152,13 +152,13 @@ void nfs_writedata_release(void *wdata)
 /* Adjust the file length if we're writing beyond the end */
 static void nfs_grow_file(struct page *page, unsigned int offset, unsigned int count)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	loff_t end, i_size = i_size_read(inode);
 	unsigned long end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
 
-	if (i_size > 0 && page->index < end_index)
+	if (i_size > 0 && page_file_index(page) < end_index)
 		return;
-	end = ((loff_t)page->index << PAGE_CACHE_SHIFT) + ((loff_t)offset+count);
+	end = page_offset(page) + ((loff_t)offset+count);
 	if (i_size >= end)
 		return;
 	nfs_inc_stats(inode, NFSIOS_EXTENDWRITE);
@@ -181,11 +181,11 @@ static void nfs_mark_uptodate(struct pag
 		return;
 	}
 
-	end_offs = i_size_read(page->mapping->host) - 1;
+	end_offs = i_size_read(page_file_mapping(page)->host) - 1;
 	if (end_offs < 0)
 		return;
 	/* Is this the last page? */
-	if (page->index != (unsigned long)(end_offs >> PAGE_CACHE_SHIFT))
+	if (page_file_index(page) != (unsigned long)(end_offs >> PAGE_CACHE_SHIFT))
 		return;
 	/* This is the last page: set PG_uptodate if we cover the entire
 	 * extent of the data, then zero the rest of the page.
@@ -300,7 +300,7 @@ static int wb_priority(struct writeback_
 int nfs_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct nfs_open_context *ctx;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	unsigned long end_index;
 	unsigned offset = PAGE_CACHE_SIZE;
 	loff_t i_size = i_size_read(inode);
@@ -327,14 +327,14 @@ int nfs_writepage(struct page *page, str
 	nfs_wb_page_priority(inode, page, priority);
 
 	/* easy case */
-	if (page->index < end_index)
+	if (page_file_index(page) < end_index)
 		goto do_it;
 	/* things got complicated... */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
 
 	/* OK, are we completely out? */
 	err = 0; /* potential race with truncate - ignore */
-	if (page->index >= end_index+1 || !offset)
+	if (page_file_index(page) >= end_index+1 || !offset)
 		goto out;
 do_it:
 	ctx = nfs_find_open_context(inode, NULL, FMODE_WRITE);
@@ -606,7 +606,7 @@ static void nfs_cancel_commit_list(struc
  * nfs_scan_dirty - Scan an inode for dirty requests
  * @inode: NFS inode to scan
  * @dst: destination list
- * @idx_start: lower bound of page->index to scan.
+ * @idx_start: lower bound of page_file_index(page) to scan.
  * @npages: idx_start + npages sets the upper bound to scan.
  *
  * Moves requests from the inode's dirty page list.
@@ -632,7 +632,7 @@ nfs_scan_dirty(struct inode *inode, stru
  * nfs_scan_commit - Scan an inode for commit requests
  * @inode: NFS inode to scan
  * @dst: destination list
- * @idx_start: lower bound of page->index to scan.
+ * @idx_start: lower bound of page_file_index(page) to scan.
  * @npages: idx_start + npages sets the upper bound to scan.
  *
  * Moves requests from the inode's 'commit' request list.
@@ -713,14 +713,14 @@ static struct nfs_page * nfs_update_requ
 
 	end = offset + bytes;
 
-	if (nfs_wait_on_write_congestion(page->mapping, server->flags & NFS_MOUNT_INTR))
+	if (nfs_wait_on_write_congestion(page_file_mapping(page), server->flags & NFS_MOUNT_INTR))
 		return ERR_PTR(-ERESTARTSYS);
 	for (;;) {
 		/* Loop over all inode entries and see if we find
 		 * A request for the page we wish to update
 		 */
 		spin_lock(&nfsi->req_lock);
-		req = _nfs_find_request(inode, page->index);
+		req = _nfs_find_request(inode, page_file_index(page));
 		if (req) {
 			if (!nfs_lock_request_dontget(req)) {
 				int error;
@@ -791,7 +791,7 @@ static struct nfs_page * nfs_update_requ
 int nfs_flush_incompatible(struct file *file, struct page *page)
 {
 	struct nfs_open_context *ctx = (struct nfs_open_context *)file->private_data;
-	struct inode	*inode = page->mapping->host;
+	struct inode	*inode = page_file_mapping(page)->host;
 	struct nfs_page	*req;
 	int		status = 0;
 	/*
@@ -802,7 +802,7 @@ int nfs_flush_incompatible(struct file *
 	 * Also do the same if we find a request from an existing
 	 * dropped page.
 	 */
-	req = nfs_find_request(inode, page->index);
+	req = nfs_find_request(inode, page_file_index(page));
 	if (req) {
 		if (req->wb_page != page || ctx != req->wb_context)
 			status = nfs_wb_page(inode, page);
@@ -821,7 +821,7 @@ int nfs_updatepage(struct file *file, st
 		unsigned int offset, unsigned int count)
 {
 	struct nfs_open_context *ctx = (struct nfs_open_context *)file->private_data;
-	struct inode	*inode = page->mapping->host;
+	struct inode	*inode = page_file_mapping(page)->host;
 	struct nfs_page	*req;
 	int		status = 0;
 
@@ -854,12 +854,12 @@ int nfs_updatepage(struct file *file, st
 		offset = 0;
 		if (unlikely(end_offs < 0)) {
 			/* Do nothing */
-		} else if (page->index == end_index) {
+		} else if (page_file_index(page) == end_index) {
 			unsigned int pglen;
 			pglen = (unsigned int)(end_offs & (PAGE_CACHE_SIZE-1)) + 1;
 			if (count < pglen)
 				count = pglen;
-		} else if (page->index < end_index)
+		} else if (page_file_index(page) < end_index)
 			count = PAGE_CACHE_SIZE;
 	}
 
Index: linux-2.6/fs/nfs/dir.c
===================================================================
--- linux-2.6.orig/fs/nfs/dir.c
+++ linux-2.6/fs/nfs/dir.c
@@ -177,7 +177,7 @@ int nfs_readdir_filler(nfs_readdir_descr
 
 	dfprintk(DIRCACHE, "NFS: %s: reading cookie %Lu into page %lu\n",
 			__FUNCTION__, (long long)desc->entry->cookie,
-			page->index);
+			page_file_index(page));
 
  again:
 	timestamp = jiffies;
@@ -201,7 +201,7 @@ int nfs_readdir_filler(nfs_readdir_descr
 	 * Note: assumes we have exclusive access to this mapping either
 	 *	 through inode->i_mutex or some other mechanism.
 	 */
-	if (page->index == 0)
+	if (page_file_index(page) == 0)
 		invalidate_inode_pages2_range(inode->i_mapping, PAGE_CACHE_SIZE, -1);
 	unlock_page(page);
 	return 0;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 07/21] nfs: add a comment explaining the use of PG_private in the NFS client
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Trond Myklebust

[-- Attachment #1: nfs_PG_private_comment.patch --]
[-- Type: text/plain, Size: 758 bytes --]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 fs/nfs/write.c |    5 +++++
 1 file changed, 5 insertions(+)

Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -424,6 +424,11 @@ static int nfs_inode_add_request(struct 
 		if (nfs_have_delegation(inode, FMODE_WRITE))
 			nfsi->change_attr++;
 	}
+	/*
+	 * The PG_private bit is unfortunately needed if we want to fix the
+	 * hole in the mmap semantics. If we do not set it, then the VM will
+	 * fail to call the "releasepage" address ops.
+	 */
 	SetPagePrivate(req->wb_page);
 	nfsi->npages++;
 	atomic_inc(&req->wb_count);

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 07/21] nfs: add a comment explaining the use of PG_private in the NFS client
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Trond Myklebust

[-- Attachment #1: nfs_PG_private_comment.patch --]
[-- Type: text/plain, Size: 880 bytes --]

---
 fs/nfs/write.c |    5 +++++
 1 file changed, 5 insertions(+)

Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -424,6 +424,11 @@ static int nfs_inode_add_request(struct 
 		if (nfs_have_delegation(inode, FMODE_WRITE))
 			nfsi->change_attr++;
 	}
+	/*
+	 * The PG_private bit is unfortunately needed if we want to fix the
+	 * hole in the mmap semantics. If we do not set it, then the VM will
+	 * fail to call the "releasepage" address ops.
+	 */
 	SetPagePrivate(req->wb_page);
 	nfsi->npages++;
 	atomic_inc(&req->wb_count);

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 08/21] nfs: enable swap on NFS
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Trond Myklebust

[-- Attachment #1: nfs_swapfile.patch --]
[-- Type: text/plain, Size: 1057 bytes --]

Now that NFS can handle swap cache pages, add a swapfile method to allow
swapping over NFS.

NOTE: this dummy method is obviously not enough to make it safe.
A more complete version of the nfs_swapfile() function will be present
in the next VM deadlock avoidance patches.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 fs/nfs/file.c |    6 ++++++
 1 file changed, 6 insertions(+)

Index: linux-2.6/fs/nfs/file.c
===================================================================
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -321,6 +321,11 @@ static int nfs_release_page(struct page 
 		return 0;
 }
 
+static int nfs_swapfile(struct address_space *mapping, int enable)
+{
+	return 0;
+}
+
 const struct address_space_operations nfs_file_aops = {
 	.readpage = nfs_readpage,
 	.readpages = nfs_readpages,
@@ -334,6 +339,7 @@ const struct address_space_operations nf
 #ifdef CONFIG_NFS_DIRECTIO
 	.direct_IO = nfs_direct_IO,
 #endif
+	.swapfile = nfs_swapfile,
 };
 
 /* 

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 08/21] nfs: enable swap on NFS
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Trond Myklebust

[-- Attachment #1: nfs_swapfile.patch --]
[-- Type: text/plain, Size: 1283 bytes --]

Now that NFS can handle swap cache pages, add a swapfile method to allow
swapping over NFS.

NOTE: this dummy method is obviously not enough to make it safe.
A more complete version of the nfs_swapfile() function will be present
in the next VM deadlock avoidance patches.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 fs/nfs/file.c |    6 ++++++
 1 file changed, 6 insertions(+)

Index: linux-2.6/fs/nfs/file.c
===================================================================
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -321,6 +321,11 @@ static int nfs_release_page(struct page 
 		return 0;
 }
 
+static int nfs_swapfile(struct address_space *mapping, int enable)
+{
+	return 0;
+}
+
 const struct address_space_operations nfs_file_aops = {
 	.readpage = nfs_readpage,
 	.readpages = nfs_readpages,
@@ -334,6 +339,7 @@ const struct address_space_operations nf
 #ifdef CONFIG_NFS_DIRECTIO
 	.direct_IO = nfs_direct_IO,
 #endif
+	.swapfile = nfs_swapfile,
 };
 
 /* 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 09/21] nfs: make swap on NFS robust
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Trond Myklebust

[-- Attachment #1: nfs_vmio.patch --]
[-- Type: text/plain, Size: 5265 bytes --]

Provide a proper a_ops->swapfile() implementation for NFS. This will set the
NFS socket to SOCK_VMIO and run socket reconnect under PF_MEMALLOC as well
as reset SOCK_VMIO before engaging the protocol ->connect() method.

PF_MEMALLOC should allow the allocation of struct socket and related objects
and the early (re)setting of SOCK_VMIO should allow us to receive the packets
required for the TCP connection buildup.

(swapping continues over a server reset during a large (4k) ping flood)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 fs/nfs/file.c               |    2 +-
 include/linux/sunrpc/xprt.h |    5 ++++-
 net/sunrpc/sched.c          |    4 ++--
 net/sunrpc/xprtsock.c       |   44 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 51 insertions(+), 4 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===================================================================
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -323,7 +323,7 @@ static int nfs_release_page(struct page 
 
 static int nfs_swapfile(struct address_space *mapping, int enable)
 {
-	return 0;
+	return xs_swapper(NFS_CLIENT(mapping->host)->cl_xprt, enable);
 }
 
 const struct address_space_operations nfs_file_aops = {
Index: linux-2.6/net/sunrpc/xprtsock.c
===================================================================
--- linux-2.6.orig/net/sunrpc/xprtsock.c
+++ linux-2.6/net/sunrpc/xprtsock.c
@@ -1014,6 +1014,7 @@ static void xs_udp_connect_worker(void *
 {
 	struct rpc_xprt *xprt = (struct rpc_xprt *) args;
 	struct socket *sock = xprt->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || xprt->addr.sin_port == 0)
@@ -1021,6 +1022,9 @@ static void xs_udp_connect_worker(void *
 
 	dprintk("RPC:      xs_udp_connect_worker for xprt %p\n", xprt);
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	/* Start by resetting any existing state */
 	xs_close(xprt);
 
@@ -1054,6 +1058,9 @@ static void xs_udp_connect_worker(void *
 		xprt->sock = sock;
 		xprt->inet = sk;
 
+		if (xprt->swapper)
+			sk_set_vmio(sk);
+
 		write_unlock_bh(&sk->sk_callback_lock);
 	}
 	xs_udp_do_set_buffer_size(xprt);
@@ -1061,6 +1068,7 @@ static void xs_udp_connect_worker(void *
 out:
 	xprt_wake_pending_tasks(xprt, status);
 	xprt_clear_connecting(xprt);
+	current->flags = pflags;
 }
 
 /*
@@ -1097,11 +1105,15 @@ static void xs_tcp_connect_worker(void *
 {
 	struct rpc_xprt *xprt = (struct rpc_xprt *)args;
 	struct socket *sock = xprt->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || xprt->addr.sin_port == 0)
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	dprintk("RPC:      xs_tcp_connect_worker for xprt %p\n", xprt);
 
 	if (!xprt->sock) {
@@ -1148,6 +1160,10 @@ static void xs_tcp_connect_worker(void *
 		write_unlock_bh(&sk->sk_callback_lock);
 	}
 
+
+	if (xprt->swapper)
+		sk_set_vmio(xprt->inet);
+
 	/* Tell the socket layer to start connecting... */
 	xprt->stat.connect_count++;
 	xprt->stat.connect_start = jiffies;
@@ -1174,6 +1190,7 @@ out:
 	xprt_wake_pending_tasks(xprt, status);
 out_clear:
 	xprt_clear_connecting(xprt);
+	current->flags = pflags;
 }
 
 /**
@@ -1369,3 +1386,30 @@ int xs_setup_tcp(struct rpc_xprt *xprt, 
 
 	return 0;
 }
+
+/**
+ * xs_swapper - Tag this transport as being used for swap.
+ * @xprt: transport to tag
+ * @enable: enable/disable
+ *
+ */
+int xs_swapper(struct rpc_xprt *xprt, int enable)
+{
+	int err = 0;
+
+	if (enable) {
+		/*
+		 * keep one extra sock reference so the reserve won't dip
+		 * when the socket gets reconnected.
+		 */
+		sk_adjust_memalloc(1, TX_RESERVE_PAGES);
+		sk_set_vmio(xprt->inet);
+		xprt->swapper = 1;
+	} else if (xprt->swapper) {
+		xprt->swapper = 0;
+		sk_clear_vmio(xprt->inet);
+		sk_adjust_memalloc(-1, -TX_RESERVE_PAGES);
+	}
+
+	return err;
+}
Index: linux-2.6/include/linux/sunrpc/xprt.h
===================================================================
--- linux-2.6.orig/include/linux/sunrpc/xprt.h
+++ linux-2.6/include/linux/sunrpc/xprt.h
@@ -147,7 +147,9 @@ struct rpc_xprt {
 	unsigned int		max_reqs;	/* total slots */
 	unsigned long		state;		/* transport state */
 	unsigned char		shutdown   : 1,	/* being shut down */
-				resvport   : 1; /* use a reserved port */
+				resvport   : 1, /* use a reserved port */
+				swapper    : 1; /* we're swapping over this
+						   transport */
 
 	/*
 	 * XID
@@ -261,6 +263,7 @@ void			xprt_disconnect(struct rpc_xprt *
  */
 int			xs_setup_udp(struct rpc_xprt *xprt, struct rpc_timeout *to);
 int			xs_setup_tcp(struct rpc_xprt *xprt, struct rpc_timeout *to);
+int			xs_swapper(struct rpc_xprt *xprt, int enable);
 
 /*
  * Reserved bit positions in xprt->state
Index: linux-2.6/net/sunrpc/sched.c
===================================================================
--- linux-2.6.orig/net/sunrpc/sched.c
+++ linux-2.6/net/sunrpc/sched.c
@@ -736,8 +736,8 @@ void * rpc_malloc(struct rpc_task *task,
 	struct rpc_rqst *req = task->tk_rqstp;
 	gfp_t	gfp;
 
-	if (task->tk_flags & RPC_TASK_SWAPPER)
-		gfp = GFP_ATOMIC;
+	if (RPC_IS_SWAPPER(task))
+		gfp = GFP_ATOMIC | __GFP_EMERGENCY;
 	else
 		gfp = GFP_NOFS;
 

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 09/21] nfs: make swap on NFS robust
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Trond Myklebust

[-- Attachment #1: nfs_vmio.patch --]
[-- Type: text/plain, Size: 5491 bytes --]

Provide a proper a_ops->swapfile() implementation for NFS. This will set the
NFS socket to SOCK_VMIO and run socket reconnect under PF_MEMALLOC as well
as reset SOCK_VMIO before engaging the protocol ->connect() method.

PF_MEMALLOC should allow the allocation of struct socket and related objects
and the early (re)setting of SOCK_VMIO should allow us to receive the packets
required for the TCP connection buildup.

(swapping continues over a server reset during a large (4k) ping flood)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 fs/nfs/file.c               |    2 +-
 include/linux/sunrpc/xprt.h |    5 ++++-
 net/sunrpc/sched.c          |    4 ++--
 net/sunrpc/xprtsock.c       |   44 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 51 insertions(+), 4 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===================================================================
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -323,7 +323,7 @@ static int nfs_release_page(struct page 
 
 static int nfs_swapfile(struct address_space *mapping, int enable)
 {
-	return 0;
+	return xs_swapper(NFS_CLIENT(mapping->host)->cl_xprt, enable);
 }
 
 const struct address_space_operations nfs_file_aops = {
Index: linux-2.6/net/sunrpc/xprtsock.c
===================================================================
--- linux-2.6.orig/net/sunrpc/xprtsock.c
+++ linux-2.6/net/sunrpc/xprtsock.c
@@ -1014,6 +1014,7 @@ static void xs_udp_connect_worker(void *
 {
 	struct rpc_xprt *xprt = (struct rpc_xprt *) args;
 	struct socket *sock = xprt->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || xprt->addr.sin_port == 0)
@@ -1021,6 +1022,9 @@ static void xs_udp_connect_worker(void *
 
 	dprintk("RPC:      xs_udp_connect_worker for xprt %p\n", xprt);
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	/* Start by resetting any existing state */
 	xs_close(xprt);
 
@@ -1054,6 +1058,9 @@ static void xs_udp_connect_worker(void *
 		xprt->sock = sock;
 		xprt->inet = sk;
 
+		if (xprt->swapper)
+			sk_set_vmio(sk);
+
 		write_unlock_bh(&sk->sk_callback_lock);
 	}
 	xs_udp_do_set_buffer_size(xprt);
@@ -1061,6 +1068,7 @@ static void xs_udp_connect_worker(void *
 out:
 	xprt_wake_pending_tasks(xprt, status);
 	xprt_clear_connecting(xprt);
+	current->flags = pflags;
 }
 
 /*
@@ -1097,11 +1105,15 @@ static void xs_tcp_connect_worker(void *
 {
 	struct rpc_xprt *xprt = (struct rpc_xprt *)args;
 	struct socket *sock = xprt->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || xprt->addr.sin_port == 0)
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	dprintk("RPC:      xs_tcp_connect_worker for xprt %p\n", xprt);
 
 	if (!xprt->sock) {
@@ -1148,6 +1160,10 @@ static void xs_tcp_connect_worker(void *
 		write_unlock_bh(&sk->sk_callback_lock);
 	}
 
+
+	if (xprt->swapper)
+		sk_set_vmio(xprt->inet);
+
 	/* Tell the socket layer to start connecting... */
 	xprt->stat.connect_count++;
 	xprt->stat.connect_start = jiffies;
@@ -1174,6 +1190,7 @@ out:
 	xprt_wake_pending_tasks(xprt, status);
 out_clear:
 	xprt_clear_connecting(xprt);
+	current->flags = pflags;
 }
 
 /**
@@ -1369,3 +1386,30 @@ int xs_setup_tcp(struct rpc_xprt *xprt, 
 
 	return 0;
 }
+
+/**
+ * xs_swapper - Tag this transport as being used for swap.
+ * @xprt: transport to tag
+ * @enable: enable/disable
+ *
+ */
+int xs_swapper(struct rpc_xprt *xprt, int enable)
+{
+	int err = 0;
+
+	if (enable) {
+		/*
+		 * keep one extra sock reference so the reserve won't dip
+		 * when the socket gets reconnected.
+		 */
+		sk_adjust_memalloc(1, TX_RESERVE_PAGES);
+		sk_set_vmio(xprt->inet);
+		xprt->swapper = 1;
+	} else if (xprt->swapper) {
+		xprt->swapper = 0;
+		sk_clear_vmio(xprt->inet);
+		sk_adjust_memalloc(-1, -TX_RESERVE_PAGES);
+	}
+
+	return err;
+}
Index: linux-2.6/include/linux/sunrpc/xprt.h
===================================================================
--- linux-2.6.orig/include/linux/sunrpc/xprt.h
+++ linux-2.6/include/linux/sunrpc/xprt.h
@@ -147,7 +147,9 @@ struct rpc_xprt {
 	unsigned int		max_reqs;	/* total slots */
 	unsigned long		state;		/* transport state */
 	unsigned char		shutdown   : 1,	/* being shut down */
-				resvport   : 1; /* use a reserved port */
+				resvport   : 1, /* use a reserved port */
+				swapper    : 1; /* we're swapping over this
+						   transport */
 
 	/*
 	 * XID
@@ -261,6 +263,7 @@ void			xprt_disconnect(struct rpc_xprt *
  */
 int			xs_setup_udp(struct rpc_xprt *xprt, struct rpc_timeout *to);
 int			xs_setup_tcp(struct rpc_xprt *xprt, struct rpc_timeout *to);
+int			xs_swapper(struct rpc_xprt *xprt, int enable);
 
 /*
  * Reserved bit positions in xprt->state
Index: linux-2.6/net/sunrpc/sched.c
===================================================================
--- linux-2.6.orig/net/sunrpc/sched.c
+++ linux-2.6/net/sunrpc/sched.c
@@ -736,8 +736,8 @@ void * rpc_malloc(struct rpc_task *task,
 	struct rpc_rqst *req = task->tk_rqstp;
 	gfp_t	gfp;
 
-	if (task->tk_flags & RPC_TASK_SWAPPER)
-		gfp = GFP_ATOMIC;
+	if (RPC_IS_SWAPPER(task))
+		gfp = GFP_ATOMIC | __GFP_EMERGENCY;
 	else
 		gfp = GFP_NOFS;
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 10/21] block: elevator selection and pinning
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Jens Axboe, Pavel Machek

[-- Attachment #1: block_queue_init_elv.patch --]
[-- Type: text/plain, Size: 5444 bytes --]

Provide an block queue init function that allows to set an elevator. And a 
function to pin the current elevator.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Daniel Phillips <phillips@google.com>
CC: Jens Axboe <axboe@suse.de>
CC: Pavel Machek <pavel@ucw.cz>
---
 block/elevator.c         |   56 ++++++++++++++++++++++++++++++++++++-----------
 block/ll_rw_blk.c        |   12 ++++++++--
 include/linux/blkdev.h   |    9 +++++++
 include/linux/elevator.h |    1 
 4 files changed, 63 insertions(+), 15 deletions(-)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -1899,6 +1899,14 @@ EXPORT_SYMBOL(blk_init_queue);
 request_queue_t *
 blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 {
+	return blk_init_queue_node_elv(rfn, lock, node_id, NULL);
+}
+EXPORT_SYMBOL(blk_init_queue_node);
+
+request_queue_t *
+blk_init_queue_node_elv(request_fn_proc *rfn, spinlock_t *lock, int node_id,
+		char *elv_name)
+{
 	request_queue_t *q = blk_alloc_queue_node(GFP_KERNEL, node_id);
 
 	if (!q)
@@ -1939,7 +1947,7 @@ blk_init_queue_node(request_fn_proc *rfn
 	/*
 	 * all done
 	 */
-	if (!elevator_init(q, NULL)) {
+	if (!elevator_init(q, elv_name)) {
 		blk_queue_congestion_threshold(q);
 		return q;
 	}
@@ -1947,7 +1955,7 @@ blk_init_queue_node(request_fn_proc *rfn
 	blk_put_queue(q);
 	return NULL;
 }
-EXPORT_SYMBOL(blk_init_queue_node);
+EXPORT_SYMBOL(blk_init_queue_node_elv);
 
 int blk_get_queue(request_queue_t *q)
 {
Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h
+++ linux-2.6/include/linux/blkdev.h
@@ -444,6 +444,12 @@ struct request_queue
 #define QUEUE_FLAG_REENTER	6	/* Re-entrancy avoidance */
 #define QUEUE_FLAG_PLUGGED	7	/* queue is plugged */
 #define QUEUE_FLAG_ELVSWITCH	8	/* don't use elevator, just do FIFO */
+#define QUEUE_FLAG_ELVPINNED	9	/* pin the current elevator */
+
+static inline void blk_queue_pin_elevator(struct request_queue *q)
+{
+	set_bit(QUEUE_FLAG_ELVPINNED, &q->queue_flags);
+}
 
 enum {
 	/*
@@ -696,6 +702,9 @@ static inline void elv_dispatch_add_tail
 /*
  * Access functions for manipulating queue properties
  */
+extern request_queue_t *blk_init_queue_node_elv(request_fn_proc *rfn,
+					spinlock_t *lock, int node_id,
+					char *elv_name);
 extern request_queue_t *blk_init_queue_node(request_fn_proc *rfn,
 					spinlock_t *lock, int node_id);
 extern request_queue_t *blk_init_queue(request_fn_proc *, spinlock_t *);
Index: linux-2.6/block/elevator.c
===================================================================
--- linux-2.6.orig/block/elevator.c
+++ linux-2.6/block/elevator.c
@@ -856,11 +856,33 @@ fail_register:
 	return 0;
 }
 
+int elv_iosched_switch(request_queue_t *q, const char *elevator_name)
+{
+	struct elevator_type *e;
+
+	if (test_bit(QUEUE_FLAG_ELVPINNED, &q->queue_flags))
+		return -EPERM;
+
+	e = elevator_get(elevator_name);
+	if (!e)
+		return -EINVAL;
+
+	if (!strcmp(elevator_name, q->elevator->elevator_type->elevator_name)) {
+		elevator_put(e);
+		return -EEXIST;
+	}
+
+	if (!elevator_switch(q, e))
+		return -ENOMEM;
+
+	return 0;
+}
+
 ssize_t elv_iosched_store(request_queue_t *q, const char *name, size_t count)
 {
 	char elevator_name[ELV_NAME_MAX];
 	size_t len;
-	struct elevator_type *e;
+	int error;
 
 	elevator_name[sizeof(elevator_name) - 1] = '\0';
 	strncpy(elevator_name, name, sizeof(elevator_name) - 1);
@@ -869,20 +891,27 @@ ssize_t elv_iosched_store(request_queue_
 	if (len && elevator_name[len - 1] == '\n')
 		elevator_name[len - 1] = '\0';
 
-	e = elevator_get(elevator_name);
-	if (!e) {
-		printk(KERN_ERR "elevator: type %s not found\n", elevator_name);
-		return -EINVAL;
-	}
-
-	if (!strcmp(elevator_name, q->elevator->elevator_type->elevator_name)) {
-		elevator_put(e);
-		return count;
+	error = elv_iosched_switch(q, elevator_name);
+	switch (error) {
+		case -EPERM:
+			printk(KERN_NOTICE
+				"elevator: cannot switch elevator, pinned\n");
+			break;
+
+		case -EINVAL:
+			printk(KERN_ERR "elevator: type %s not found\n",
+					elevator_name);
+			break;
+
+		case -ENOMEM:
+			printk(KERN_ERR "elevator: switch to %s failed\n",
+					elevator_name);
+		default:
+			error = 0;
+			break;
 	}
 
-	if (!elevator_switch(q, e))
-		printk(KERN_ERR "elevator: switch to %s failed\n",elevator_name);
-	return count;
+	return error ?: count;
 }
 
 ssize_t elv_iosched_show(request_queue_t *q, char *name)
@@ -914,5 +943,6 @@ EXPORT_SYMBOL(__elv_add_request);
 EXPORT_SYMBOL(elv_next_request);
 EXPORT_SYMBOL(elv_dequeue_request);
 EXPORT_SYMBOL(elv_queue_empty);
+EXPORT_SYMBOL(elv_iosched_switch);
 EXPORT_SYMBOL(elevator_exit);
 EXPORT_SYMBOL(elevator_init);
Index: linux-2.6/include/linux/elevator.h
===================================================================
--- linux-2.6.orig/include/linux/elevator.h
+++ linux-2.6/include/linux/elevator.h
@@ -107,6 +107,7 @@ extern int elv_may_queue(request_queue_t
 extern void elv_completed_request(request_queue_t *, struct request *);
 extern int elv_set_request(request_queue_t *, struct request *, struct bio *, gfp_t);
 extern void elv_put_request(request_queue_t *, struct request *);
+extern int elv_iosched_switch(request_queue_t *, const char *);
 
 /*
  * io scheduler registration

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 10/21] block: elevator selection and pinning
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Jens Axboe, Pavel Machek

[-- Attachment #1: block_queue_init_elv.patch --]
[-- Type: text/plain, Size: 5670 bytes --]

Provide an block queue init function that allows to set an elevator. And a 
function to pin the current elevator.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Daniel Phillips <phillips@google.com>
CC: Jens Axboe <axboe@suse.de>
CC: Pavel Machek <pavel@ucw.cz>
---
 block/elevator.c         |   56 ++++++++++++++++++++++++++++++++++++-----------
 block/ll_rw_blk.c        |   12 ++++++++--
 include/linux/blkdev.h   |    9 +++++++
 include/linux/elevator.h |    1 
 4 files changed, 63 insertions(+), 15 deletions(-)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -1899,6 +1899,14 @@ EXPORT_SYMBOL(blk_init_queue);
 request_queue_t *
 blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 {
+	return blk_init_queue_node_elv(rfn, lock, node_id, NULL);
+}
+EXPORT_SYMBOL(blk_init_queue_node);
+
+request_queue_t *
+blk_init_queue_node_elv(request_fn_proc *rfn, spinlock_t *lock, int node_id,
+		char *elv_name)
+{
 	request_queue_t *q = blk_alloc_queue_node(GFP_KERNEL, node_id);
 
 	if (!q)
@@ -1939,7 +1947,7 @@ blk_init_queue_node(request_fn_proc *rfn
 	/*
 	 * all done
 	 */
-	if (!elevator_init(q, NULL)) {
+	if (!elevator_init(q, elv_name)) {
 		blk_queue_congestion_threshold(q);
 		return q;
 	}
@@ -1947,7 +1955,7 @@ blk_init_queue_node(request_fn_proc *rfn
 	blk_put_queue(q);
 	return NULL;
 }
-EXPORT_SYMBOL(blk_init_queue_node);
+EXPORT_SYMBOL(blk_init_queue_node_elv);
 
 int blk_get_queue(request_queue_t *q)
 {
Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h
+++ linux-2.6/include/linux/blkdev.h
@@ -444,6 +444,12 @@ struct request_queue
 #define QUEUE_FLAG_REENTER	6	/* Re-entrancy avoidance */
 #define QUEUE_FLAG_PLUGGED	7	/* queue is plugged */
 #define QUEUE_FLAG_ELVSWITCH	8	/* don't use elevator, just do FIFO */
+#define QUEUE_FLAG_ELVPINNED	9	/* pin the current elevator */
+
+static inline void blk_queue_pin_elevator(struct request_queue *q)
+{
+	set_bit(QUEUE_FLAG_ELVPINNED, &q->queue_flags);
+}
 
 enum {
 	/*
@@ -696,6 +702,9 @@ static inline void elv_dispatch_add_tail
 /*
  * Access functions for manipulating queue properties
  */
+extern request_queue_t *blk_init_queue_node_elv(request_fn_proc *rfn,
+					spinlock_t *lock, int node_id,
+					char *elv_name);
 extern request_queue_t *blk_init_queue_node(request_fn_proc *rfn,
 					spinlock_t *lock, int node_id);
 extern request_queue_t *blk_init_queue(request_fn_proc *, spinlock_t *);
Index: linux-2.6/block/elevator.c
===================================================================
--- linux-2.6.orig/block/elevator.c
+++ linux-2.6/block/elevator.c
@@ -856,11 +856,33 @@ fail_register:
 	return 0;
 }
 
+int elv_iosched_switch(request_queue_t *q, const char *elevator_name)
+{
+	struct elevator_type *e;
+
+	if (test_bit(QUEUE_FLAG_ELVPINNED, &q->queue_flags))
+		return -EPERM;
+
+	e = elevator_get(elevator_name);
+	if (!e)
+		return -EINVAL;
+
+	if (!strcmp(elevator_name, q->elevator->elevator_type->elevator_name)) {
+		elevator_put(e);
+		return -EEXIST;
+	}
+
+	if (!elevator_switch(q, e))
+		return -ENOMEM;
+
+	return 0;
+}
+
 ssize_t elv_iosched_store(request_queue_t *q, const char *name, size_t count)
 {
 	char elevator_name[ELV_NAME_MAX];
 	size_t len;
-	struct elevator_type *e;
+	int error;
 
 	elevator_name[sizeof(elevator_name) - 1] = '\0';
 	strncpy(elevator_name, name, sizeof(elevator_name) - 1);
@@ -869,20 +891,27 @@ ssize_t elv_iosched_store(request_queue_
 	if (len && elevator_name[len - 1] == '\n')
 		elevator_name[len - 1] = '\0';
 
-	e = elevator_get(elevator_name);
-	if (!e) {
-		printk(KERN_ERR "elevator: type %s not found\n", elevator_name);
-		return -EINVAL;
-	}
-
-	if (!strcmp(elevator_name, q->elevator->elevator_type->elevator_name)) {
-		elevator_put(e);
-		return count;
+	error = elv_iosched_switch(q, elevator_name);
+	switch (error) {
+		case -EPERM:
+			printk(KERN_NOTICE
+				"elevator: cannot switch elevator, pinned\n");
+			break;
+
+		case -EINVAL:
+			printk(KERN_ERR "elevator: type %s not found\n",
+					elevator_name);
+			break;
+
+		case -ENOMEM:
+			printk(KERN_ERR "elevator: switch to %s failed\n",
+					elevator_name);
+		default:
+			error = 0;
+			break;
 	}
 
-	if (!elevator_switch(q, e))
-		printk(KERN_ERR "elevator: switch to %s failed\n",elevator_name);
-	return count;
+	return error ?: count;
 }
 
 ssize_t elv_iosched_show(request_queue_t *q, char *name)
@@ -914,5 +943,6 @@ EXPORT_SYMBOL(__elv_add_request);
 EXPORT_SYMBOL(elv_next_request);
 EXPORT_SYMBOL(elv_dequeue_request);
 EXPORT_SYMBOL(elv_queue_empty);
+EXPORT_SYMBOL(elv_iosched_switch);
 EXPORT_SYMBOL(elevator_exit);
 EXPORT_SYMBOL(elevator_init);
Index: linux-2.6/include/linux/elevator.h
===================================================================
--- linux-2.6.orig/include/linux/elevator.h
+++ linux-2.6/include/linux/elevator.h
@@ -107,6 +107,7 @@ extern int elv_may_queue(request_queue_t
 extern void elv_completed_request(request_queue_t *, struct request *);
 extern int elv_set_request(request_queue_t *, struct request *, struct bio *, gfp_t);
 extern void elv_put_request(request_queue_t *, struct request *);
+extern int elv_iosched_switch(request_queue_t *, const char *);
 
 /*
  * io scheduler registration

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 11/21] nbd: limit blk_queue
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Pavel Machek

[-- Attachment #1: nbd_queue.patch --]
[-- Type: text/plain, Size: 1347 bytes --]

Limit each request to 1 page, so that the request throttling also limits the
number of in-flight pages and force the IO scheduler to NOOP as anything else
doesn't make sense anyway.

(Pavel, I will analyse those !NOOP deadlocks I got, I'm just re-posting so 
people can comment on the rest)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Daniel Phillips <phillips@google.com>
CC: Pavel Machek <pavel@ucw.cz>
---
 drivers/block/nbd.c |   17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

Index: linux-2.6/drivers/block/nbd.c
===================================================================
--- linux-2.6.orig/drivers/block/nbd.c
+++ linux-2.6/drivers/block/nbd.c
@@ -628,11 +636,16 @@ static int __init nbd_init(void)
 		 * every gendisk to have its very own request_queue struct.
 		 * These structs are big so we dynamically allocate them.
 		 */
-		disk->queue = blk_init_queue(do_nbd_request, &nbd_lock);
+		disk->queue = blk_init_queue_node_elv(do_nbd_request,
+				&nbd_lock, -1, "noop");
 		if (!disk->queue) {
 			put_disk(disk);
 			goto out;
 		}
+		blk_queue_pin_elevator(disk->queue);
+		blk_queue_max_segment_size(disk->queue, PAGE_SIZE);
+		blk_queue_max_hw_segments(disk->queue, 1);
+		blk_queue_max_phys_segments(disk->queue, 1);
 	}
 
 	if (register_blkdev(NBD_MAJOR, "nbd")) {

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 11/21] nbd: limit blk_queue
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Pavel Machek

[-- Attachment #1: nbd_queue.patch --]
[-- Type: text/plain, Size: 1573 bytes --]

Limit each request to 1 page, so that the request throttling also limits the
number of in-flight pages and force the IO scheduler to NOOP as anything else
doesn't make sense anyway.

(Pavel, I will analyse those !NOOP deadlocks I got, I'm just re-posting so 
people can comment on the rest)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Daniel Phillips <phillips@google.com>
CC: Pavel Machek <pavel@ucw.cz>
---
 drivers/block/nbd.c |   17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

Index: linux-2.6/drivers/block/nbd.c
===================================================================
--- linux-2.6.orig/drivers/block/nbd.c
+++ linux-2.6/drivers/block/nbd.c
@@ -628,11 +636,16 @@ static int __init nbd_init(void)
 		 * every gendisk to have its very own request_queue struct.
 		 * These structs are big so we dynamically allocate them.
 		 */
-		disk->queue = blk_init_queue(do_nbd_request, &nbd_lock);
+		disk->queue = blk_init_queue_node_elv(do_nbd_request,
+				&nbd_lock, -1, "noop");
 		if (!disk->queue) {
 			put_disk(disk);
 			goto out;
 		}
+		blk_queue_pin_elevator(disk->queue);
+		blk_queue_max_segment_size(disk->queue, PAGE_SIZE);
+		blk_queue_max_hw_segments(disk->queue, 1);
+		blk_queue_max_phys_segments(disk->queue, 1);
 	}
 
 	if (register_blkdev(NBD_MAJOR, "nbd")) {

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 12/21] mm: block device swap notification
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, James E.J. Bottomley, Mike Christie,
	Pavel Machek

[-- Attachment #1: swapdev.patch --]
[-- Type: text/plain, Size: 1681 bytes --]

Some block devices need to do some extra work when used as swap device.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: James E.J. Bottomley <James.Bottomley@SteelEye.com>
CC: Mike Christie <michaelc@cs.wisc.edu>
CC: Pavel Machek <pavel@ucw.cz>
---
 include/linux/fs.h |    1 +
 mm/swapfile.c      |    7 +++++++
 2 files changed, 8 insertions(+)

Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -1017,6 +1017,7 @@ struct block_device_operations {
 	int (*media_changed) (struct gendisk *);
 	int (*revalidate_disk) (struct gendisk *);
 	int (*getgeo)(struct block_device *, struct hd_geometry *);
+	int (*swapdev)(struct gendisk *, int enable);
 	struct module *owner;
 };
 
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -1273,6 +1273,8 @@ asmlinkage long sys_swapoff(const char _
 	inode = mapping->host;
 	if (S_ISBLK(inode->i_mode)) {
 		struct block_device *bdev = I_BDEV(inode);
+		if (bdev->bd_disk->fops->swapdev)
+			bdev->bd_disk->fops->swapdev(bdev->bd_disk, 0);
 		set_blocksize(bdev, p->old_block_size);
 		bd_release(bdev);
 	} else {
@@ -1481,6 +1483,11 @@ asmlinkage long sys_swapdev(const char __
 		if (error < 0)
 			goto bad_swap;
 		p->bdev = bdev;
+		if (bdev->bd_disk->fops->swapdev) {
+			error = bdev->bd_disk->fops->swapdev(bdev->bd_disk, 1);
+			if (error < 0)
+				goto bad_swap;
+		}
 	} else if (S_ISREG(inode->i_mode)) {
 		p->bdev = inode->i_sb->s_bdev;
 		mutex_lock(&inode->i_mutex);

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 12/21] mm: block device swap notification
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, James E.J. Bottomley, Mike Christie,
	Pavel Machek

[-- Attachment #1: swapdev.patch --]
[-- Type: text/plain, Size: 1907 bytes --]

Some block devices need to do some extra work when used as swap device.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: James E.J. Bottomley <James.Bottomley@SteelEye.com>
CC: Mike Christie <michaelc@cs.wisc.edu>
CC: Pavel Machek <pavel@ucw.cz>
---
 include/linux/fs.h |    1 +
 mm/swapfile.c      |    7 +++++++
 2 files changed, 8 insertions(+)

Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -1017,6 +1017,7 @@ struct block_device_operations {
 	int (*media_changed) (struct gendisk *);
 	int (*revalidate_disk) (struct gendisk *);
 	int (*getgeo)(struct block_device *, struct hd_geometry *);
+	int (*swapdev)(struct gendisk *, int enable);
 	struct module *owner;
 };
 
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -1273,6 +1273,8 @@ asmlinkage long sys_swapoff(const char _
 	inode = mapping->host;
 	if (S_ISBLK(inode->i_mode)) {
 		struct block_device *bdev = I_BDEV(inode);
+		if (bdev->bd_disk->fops->swapdev)
+			bdev->bd_disk->fops->swapdev(bdev->bd_disk, 0);
 		set_blocksize(bdev, p->old_block_size);
 		bd_release(bdev);
 	} else {
@@ -1481,6 +1483,11 @@ asmlinkage long sys_swapdev(const char __
 		if (error < 0)
 			goto bad_swap;
 		p->bdev = bdev;
+		if (bdev->bd_disk->fops->swapdev) {
+			error = bdev->bd_disk->fops->swapdev(bdev->bd_disk, 1);
+			if (error < 0)
+				goto bad_swap;
+		}
 	} else if (S_ISREG(inode->i_mode)) {
 		p->bdev = inode->i_sb->s_bdev;
 		mutex_lock(&inode->i_mutex);

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 13/21] nbd: use swapdev hook to make swap deadlock free
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Pavel Machek

[-- Attachment #1: nbd_vmio.patch --]
[-- Type: text/plain, Size: 1636 bytes --]

Use sk_set_vmio() on the nbd socket.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Daniel Phillips <phillips@google.com>
CC: Pavel Machek <pavel@ucw.cz>
---
 drivers/block/nbd.c |   22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/block/nbd.c
===================================================================
--- linux-2.6.orig/drivers/block/nbd.c
+++ linux-2.6/drivers/block/nbd.c
@@ -135,7 +135,6 @@ static int sock_xmit(struct socket *sock
 	spin_unlock_irqrestore(&current->sighand->siglock, flags);
 
 	do {
-		sock->sk->sk_allocation = GFP_NOIO;
 		iov.iov_base = buf;
 		iov.iov_len = size;
 		msg.msg_name = NULL;
@@ -525,6 +524,7 @@ static int nbd_ioctl(struct inode *inode
 			if (S_ISSOCK(inode->i_mode)) {
 				lo->file = file;
 				lo->sock = SOCKET_I(inode);
+				lo->sock->sk->sk_allocation = GFP_NOIO;
 				error = 0;
 			} else {
 				fput(file);
@@ -594,10 +594,30 @@ static int nbd_ioctl(struct inode *inode
 	return -EINVAL;
 }
 
+static int nbd_swapdev(struct gendisk *disk, int enable)
+{
+	struct nbd_device *lo = disk->private_data;
+
+	if (enable) {
+		sk_adjust_memalloc(0, TX_RESERVE_PAGES);
+		if (!sk_set_vmio(lo->sock->sk))
+			printk(KERN_WARNING
+				"failed to set SOCK_VMIO on NBD socket\n");
+	} else {
+		if (!sk_clear_vmio(lo->sock->sk))
+			printk(KERN_WARNING
+				"failed to clear SOCK_VMIO on NBD socket\n");
+		sk_adjust_memalloc(0, -TX_RESERVE_PAGES);
+	}
+
+	return 0;
+}
+
 static struct block_device_operations nbd_fops =
 {
 	.owner =	THIS_MODULE,
 	.ioctl =	nbd_ioctl,
+	.swapdev =	nbd_swapdev,
 };
 
 /*

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 13/21] nbd: use swapdev hook to make swap deadlock free
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Pavel Machek

[-- Attachment #1: nbd_vmio.patch --]
[-- Type: text/plain, Size: 1862 bytes --]

Use sk_set_vmio() on the nbd socket.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Daniel Phillips <phillips@google.com>
CC: Pavel Machek <pavel@ucw.cz>
---
 drivers/block/nbd.c |   22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/block/nbd.c
===================================================================
--- linux-2.6.orig/drivers/block/nbd.c
+++ linux-2.6/drivers/block/nbd.c
@@ -135,7 +135,6 @@ static int sock_xmit(struct socket *sock
 	spin_unlock_irqrestore(&current->sighand->siglock, flags);
 
 	do {
-		sock->sk->sk_allocation = GFP_NOIO;
 		iov.iov_base = buf;
 		iov.iov_len = size;
 		msg.msg_name = NULL;
@@ -525,6 +524,7 @@ static int nbd_ioctl(struct inode *inode
 			if (S_ISSOCK(inode->i_mode)) {
 				lo->file = file;
 				lo->sock = SOCKET_I(inode);
+				lo->sock->sk->sk_allocation = GFP_NOIO;
 				error = 0;
 			} else {
 				fput(file);
@@ -594,10 +594,30 @@ static int nbd_ioctl(struct inode *inode
 	return -EINVAL;
 }
 
+static int nbd_swapdev(struct gendisk *disk, int enable)
+{
+	struct nbd_device *lo = disk->private_data;
+
+	if (enable) {
+		sk_adjust_memalloc(0, TX_RESERVE_PAGES);
+		if (!sk_set_vmio(lo->sock->sk))
+			printk(KERN_WARNING
+				"failed to set SOCK_VMIO on NBD socket\n");
+	} else {
+		if (!sk_clear_vmio(lo->sock->sk))
+			printk(KERN_WARNING
+				"failed to clear SOCK_VMIO on NBD socket\n");
+		sk_adjust_memalloc(0, -TX_RESERVE_PAGES);
+	}
+
+	return 0;
+}
+
 static struct block_device_operations nbd_fops =
 {
 	.owner =	THIS_MODULE,
 	.ioctl =	nbd_ioctl,
+	.swapdev =	nbd_swapdev,
 };
 
 /*

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 14/21] uml: enable scsi and add iscsi config
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Jeff Dike, Mike Christie

[-- Attachment #1: uml_iscsi.patch --]
[-- Type: text/plain, Size: 2135 bytes --]

Enable iSCSI on UML, dunno why SCSI was deemed broken, it works like a charm.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Jeff Dike <jdike@addtoit.com>
CC: Mike Christie <michaelc@cs.wisc.edu>
---
 arch/um/Kconfig      |    2 +-
 arch/um/Kconfig.scsi |   32 ++++++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/um/Kconfig
===================================================================
--- linux-2.6.orig/arch/um/Kconfig
+++ linux-2.6/arch/um/Kconfig
@@ -286,7 +286,6 @@ source "crypto/Kconfig"
 source "lib/Kconfig"
 
 menu "SCSI support"
-depends on BROKEN
 
 config SCSI
 	tristate "SCSI support"
Index: linux-2.6/arch/um/Kconfig.scsi
===================================================================
--- linux-2.6.orig/arch/um/Kconfig.scsi
+++ linux-2.6/arch/um/Kconfig.scsi
@@ -56,3 +56,35 @@ config SCSI_DEBUG
 	tristate "SCSI debugging host simulator (EXPERIMENTAL)"
 	depends on SCSI
 
+config SCSI_ISCSI_ATTRS
+	tristate "iSCSI Transport Attributes"
+	depends on SCSI && NET
+	help
+	  If you wish to export transport-specific information about
+	  each attached iSCSI device to sysfs, say Y.
+	  Otherwise, say N.
+
+config ISCSI_TCP
+	tristate "iSCSI Initiator over TCP/IP"
+	depends on SCSI && INET
+	select CRYPTO
+	select CRYPTO_MD5
+	select CRYPTO_CRC32C
+	select SCSI_ISCSI_ATTRS
+	help
+	 The iSCSI Driver provides a host with the ability to access storage
+	 through an IP network. The driver uses the iSCSI protocol to transport
+	 SCSI requests and responses over a TCP/IP network between the host
+	 (the "initiator") and "targets".  Architecturally, the iSCSI driver
+	 combines with the host's TCP/IP stack, network drivers, and Network
+	 Interface Card (NIC) to provide the same functions as a SCSI or a
+	 Fibre Channel (FC) adapter driver with a Host Bus Adapter (HBA).
+
+	 To compile this driver as a module, choose M here: the
+	 module will be called iscsi_tcp.
+
+	 The userspace component needed to initialize the driver, documentation,
+	 and sample configuration files can be found here:
+
+	 http://linux-iscsi.sf.net
+

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 14/21] uml: enable scsi and add iscsi config
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Jeff Dike, Mike Christie

[-- Attachment #1: uml_iscsi.patch --]
[-- Type: text/plain, Size: 2361 bytes --]

Enable iSCSI on UML, dunno why SCSI was deemed broken, it works like a charm.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Jeff Dike <jdike@addtoit.com>
CC: Mike Christie <michaelc@cs.wisc.edu>
---
 arch/um/Kconfig      |    2 +-
 arch/um/Kconfig.scsi |   32 ++++++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/um/Kconfig
===================================================================
--- linux-2.6.orig/arch/um/Kconfig
+++ linux-2.6/arch/um/Kconfig
@@ -286,7 +286,6 @@ source "crypto/Kconfig"
 source "lib/Kconfig"
 
 menu "SCSI support"
-depends on BROKEN
 
 config SCSI
 	tristate "SCSI support"
Index: linux-2.6/arch/um/Kconfig.scsi
===================================================================
--- linux-2.6.orig/arch/um/Kconfig.scsi
+++ linux-2.6/arch/um/Kconfig.scsi
@@ -56,3 +56,35 @@ config SCSI_DEBUG
 	tristate "SCSI debugging host simulator (EXPERIMENTAL)"
 	depends on SCSI
 
+config SCSI_ISCSI_ATTRS
+	tristate "iSCSI Transport Attributes"
+	depends on SCSI && NET
+	help
+	  If you wish to export transport-specific information about
+	  each attached iSCSI device to sysfs, say Y.
+	  Otherwise, say N.
+
+config ISCSI_TCP
+	tristate "iSCSI Initiator over TCP/IP"
+	depends on SCSI && INET
+	select CRYPTO
+	select CRYPTO_MD5
+	select CRYPTO_CRC32C
+	select SCSI_ISCSI_ATTRS
+	help
+	 The iSCSI Driver provides a host with the ability to access storage
+	 through an IP network. The driver uses the iSCSI protocol to transport
+	 SCSI requests and responses over a TCP/IP network between the host
+	 (the "initiator") and "targets".  Architecturally, the iSCSI driver
+	 combines with the host's TCP/IP stack, network drivers, and Network
+	 Interface Card (NIC) to provide the same functions as a SCSI or a
+	 Fibre Channel (FC) adapter driver with a Host Bus Adapter (HBA).
+
+	 To compile this driver as a module, choose M here: the
+	 module will be called iscsi_tcp.
+
+	 The userspace component needed to initialize the driver, documentation,
+	 and sample configuration files can be found here:
+
+	 http://linux-iscsi.sf.net
+

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 15/21] iscsi: kernel side tcp connect
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Mike Christie, Peter Zijlstra

[-- Attachment #1: iscsi_ep_connect.patch --]
[-- Type: text/plain, Size: 8080 bytes --]

Move tcp connection code from user-space into kernel-space.
This makes it possible to reconnect deadlock free.

(This patch requires userspace changes too)

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 drivers/scsi/iscsi_tcp.c            |  128 +++++++++++++++++++++++-------------
 drivers/scsi/iscsi_tcp.h            |    2 
 drivers/scsi/libiscsi.c             |    6 -
 drivers/scsi/scsi_transport_iscsi.c |    4 -
 include/scsi/libiscsi.h             |    4 -
 include/scsi/scsi_transport_iscsi.h |    2 
 6 files changed, 93 insertions(+), 53 deletions(-)

Index: linux-2.6/drivers/scsi/iscsi_tcp.c
===================================================================
--- linux-2.6.orig/drivers/scsi/iscsi_tcp.c
+++ linux-2.6/drivers/scsi/iscsi_tcp.c
@@ -904,7 +904,7 @@ more:
 				goto again;
 			iscsi_conn_failure(conn, ISCSI_ERR_CONN_FAILED);
 			return 0;
-		}
+		}
 
 		memcpy(&recv_digest, conn->data, sizeof(uint32_t));
 		if (recv_digest != tcp_conn->in.datadgst) {
@@ -1062,21 +1062,6 @@ iscsi_conn_set_callbacks(struct iscsi_co
 	write_unlock_bh(&sk->sk_callback_lock);
 }
 
-static void
-iscsi_conn_restore_callbacks(struct iscsi_tcp_conn *tcp_conn)
-{
-	struct sock *sk = tcp_conn->sock->sk;
-
-	/* restore socket callbacks, see also: iscsi_conn_set_callbacks() */
-	write_lock_bh(&sk->sk_callback_lock);
-	sk->sk_user_data    = NULL;
-	sk->sk_data_ready   = tcp_conn->old_data_ready;
-	sk->sk_state_change = tcp_conn->old_state_change;
-	sk->sk_write_space  = tcp_conn->old_write_space;
-	sk->sk_no_check	 = 0;
-	write_unlock_bh(&sk->sk_callback_lock);
-}
-
 /**
  * iscsi_send - generic send routine
  * @sk: kernel's socket
@@ -1304,7 +1289,7 @@ iscsi_tcp_cmd_init(struct iscsi_cmd_task
 		debug_scsi("cmd [itt 0x%x total %d imm_data %d "
 			   "unsol count %d, unsol offset %d]\n",
 			   ctask->itt, ctask->total_length, ctask->imm_count,
-			   ctask->unsol_count, ctask->unsol_offset);
+			   ctask->unsol_count, ctask->unsol_offset);
 	} else
 		tcp_ctask->xmstate = XMSTATE_R_HDR;
 
@@ -1455,7 +1440,7 @@ iscsi_send_padding(struct iscsi_conn *co
 	tcp_ctask->xmstate &= ~XMSTATE_W_PAD;
 	tcp_ctask->xmstate &= ~XMSTATE_W_RESEND_PAD;
 	debug_scsi("sending %d pad bytes for itt 0x%x\n",
-		   tcp_ctask->pad_count, ctask->itt);
+		   tcp_ctask->pad_count, ctask->itt);
 	rc = iscsi_sendpage(conn, &tcp_ctask->sendbuf, &tcp_ctask->pad_count,
 			   &sent);
 	if (rc) {
@@ -1484,7 +1469,7 @@ iscsi_send_digest(struct iscsi_conn *con
 		iscsi_buf_init_iov(buf, (char*)digest, 4);
 	}
 	tcp_ctask->xmstate &= ~XMSTATE_W_RESEND_DATA_DIGEST;
-
+
 	rc = iscsi_sendpage(conn, buf, &tcp_ctask->digest_count, &sent);
 	if (!rc)
 		debug_scsi("sent digest 0x%x for itt 0x%x\n", *digest,
@@ -1593,7 +1578,7 @@ send_hdr:
 		int start = tcp_ctask->sent;
 
 		rc = iscsi_send_data(ctask, &tcp_ctask->sendbuf, &tcp_ctask->sg,
-				     &tcp_ctask->sent, &ctask->data_count,
+				     &tcp_ctask->sent, &ctask->data_count,
 				     &dtask->digestbuf, &dtask->digest);
 		ctask->unsol_count -= tcp_ctask->sent - start;
 		if (rc)
@@ -1741,6 +1726,79 @@ iscsi_tcp_ctask_xmit(struct iscsi_conn *
 	return rc;
 }
 
+static int
+iscsi_tcp_ep_connect(struct sockaddr *dst_addr, int non_blocking,
+		     uint64_t *ep_handle)
+{
+	struct socket *sock;
+	int rc, size, arg = 1, window = 524288;
+
+	rc = sock_create_kern(dst_addr->sa_family, SOCK_STREAM, IPPROTO_TCP,
+			      &sock);
+	if (rc < 0) {
+		printk(KERN_ERR "Could not create socket %d.\n", rc);
+		return rc;
+	}
+	sock->sk->sk_allocation = GFP_ATOMIC;
+/*
+	rc = sock->ops->setsockopt(sock, IPPROTO_TCP, TCP_NODELAY,
+				   (char __user *)&arg, sizeof(arg));
+	if (rc) {
+		printk(KERN_ERR "Could not set TCP_NODELAY %d\n", rc);
+		goto release_sock;
+	}
+*/
+	/* should set like nfs */
+	sock_setsockopt(sock, SOL_SOCKET, SO_RCVBUF,
+			(char __user *)&window, sizeof(window));
+	sock_setsockopt(sock, SOL_SOCKET, SO_SNDBUF,
+			(char __user *)&window, sizeof(window));
+
+	if (dst_addr->sa_family == PF_INET)
+		size = sizeof(struct sockaddr_in);
+	else if (dst_addr->sa_family == PF_INET6)
+		size = sizeof(struct sockaddr_in6);
+	else {
+		rc = -EINVAL;
+		goto release_sock;
+	}
+
+	/* TODO we cannot block here */
+	rc = sock->ops->connect(sock, (struct sockaddr *)dst_addr, size,
+				0 /*O_NONBLOCK*/);
+	if (rc == -EINPROGRESS)
+		rc = 0;
+	else if (rc) {
+		printk(KERN_ERR "Could not connect %d\n", rc);
+		goto release_sock;
+	}
+
+	*ep_handle = (uint64_t)(unsigned long)sock;
+	return 0;
+
+release_sock:
+	sock_release(sock);
+	return rc;
+}
+
+static int
+iscsi_tcp_ep_poll(uint64_t ep_handle, int timeout_ms)
+{
+	/* we cheated and blocked on the connect (TODO must fix) */
+	return 1;
+}
+
+static void
+iscsi_tcp_ep_disconnect(uint64_t ep_handle)
+{
+	struct socket *sock;
+
+	sock = (struct socket *)(unsigned long)ep_handle;
+	if (!sock)
+		return;
+	sock_release(sock);
+}
+
 static struct iscsi_cls_conn *
 iscsi_tcp_conn_create(struct iscsi_cls_session *cls_session, uint32_t conn_idx)
 {
@@ -1788,23 +1846,6 @@ tcp_conn_alloc_fail:
 }
 
 static void
-iscsi_tcp_release_conn(struct iscsi_conn *conn)
-{
-	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
-
-	if (!tcp_conn->sock)
-		return;
-
-	sock_hold(tcp_conn->sock->sk);
-	iscsi_conn_restore_callbacks(tcp_conn);
-	sock_put(tcp_conn->sock->sk);
-
-	sock_release(tcp_conn->sock);
-	tcp_conn->sock = NULL;
-	conn->recv_lock = NULL;
-}
-
-static void
 iscsi_tcp_conn_destroy(struct iscsi_cls_conn *cls_conn)
 {
 	struct iscsi_conn *conn = cls_conn->dd_data;
@@ -1814,7 +1855,6 @@ iscsi_tcp_conn_destroy(struct iscsi_cls_
 	if (conn->hdrdgst_en || conn->datadgst_en)
 		digest = 1;
 
-	iscsi_tcp_release_conn(conn);
 	iscsi_conn_teardown(cls_conn);
 
 	/* now free tcp_conn */
@@ -1835,7 +1875,6 @@ iscsi_tcp_conn_stop(struct iscsi_cls_con
 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
 
 	iscsi_conn_stop(cls_conn, flag);
-	iscsi_tcp_release_conn(conn);
 	tcp_conn->hdr_size = sizeof(struct iscsi_hdr);
 }
 
@@ -2041,13 +2080,11 @@ iscsi_tcp_conn_get_param(struct iscsi_cl
 		sk = tcp_conn->sock->sk;
 		if (sk->sk_family == PF_INET) {
 			inet = inet_sk(sk);
-			len = sprintf(buf, "%u.%u.%u.%u\n",
+			len = sprintf(buf, NIPQUAD_FMT "\n",
 				      NIPQUAD(inet->daddr));
 		} else {
 			np = inet6_sk(sk);
-			len = sprintf(buf,
-				"%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n",
-				NIP6(np->daddr));
+			len = sprintf(buf, NIP6_FMT "\n", NIP6(np->daddr));
 		}
 		mutex_unlock(&conn->xmitmutex);
 		break;
@@ -2185,6 +2222,9 @@ static struct iscsi_transport iscsi_tcp_
 	.get_session_param	= iscsi_session_get_param,
 	.start_conn		= iscsi_conn_start,
 	.stop_conn		= iscsi_tcp_conn_stop,
+	.ep_connect		= iscsi_tcp_ep_connect,
+	.ep_poll		= iscsi_tcp_ep_poll,
+	.ep_disconnect		= iscsi_tcp_ep_disconnect,
 	/* IO */
 	.send_pdu		= iscsi_conn_send_pdu,
 	.get_stats		= iscsi_conn_get_stats,
Index: linux-2.6/drivers/scsi/libiscsi.c
===================================================================
--- linux-2.6.orig/drivers/scsi/libiscsi.c
+++ linux-2.6/drivers/scsi/libiscsi.c
 struct iscsi_session *
 class_to_transport_session(struct iscsi_cls_session *cls_session)
@@ -1609,7 +1609,7 @@ int iscsi_conn_start(struct iscsi_cls_co
 		return -EPERM;
 	}
 
-	if ((session->imm_data_en || !session->initial_r2t_en) &&
+	if ((session->imm_data_en || !session->initial_r2t_en) &&
 	     session->first_burst > session->max_burst) {
 		printk("iscsi: invalid burst lengths: "
 		       "first_burst %d max_burst %d\n",
Index: linux-2.6/include/scsi/libiscsi.h
===================================================================
--- linux-2.6.orig/include/scsi/libiscsi.h
+++ linux-2.6/include/scsi/libiscsi.h
@@ -76,7 +76,7 @@ struct iscsi_mgmt_task {
 	 * Becuae LLDs allocate their hdr differently, this is a pointer to
 	 * that storage. It must be setup at session creation time.
 	 */
-	struct iscsi_hdr	*hdr;
+	struct iscsi_hdr	*hdr;
 	char			*data;		/* mgmt payload */
 	int			data_count;	/* counts data to be sent */
 	uint32_t		itt;		/* this ITT */

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 15/21] iscsi: kernel side tcp connect
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Mike Christie, Peter Zijlstra

[-- Attachment #1: iscsi_ep_connect.patch --]
[-- Type: text/plain, Size: 8306 bytes --]

Move tcp connection code from user-space into kernel-space.
This makes it possible to reconnect deadlock free.

(This patch requires userspace changes too)

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 drivers/scsi/iscsi_tcp.c            |  128 +++++++++++++++++++++++-------------
 drivers/scsi/iscsi_tcp.h            |    2 
 drivers/scsi/libiscsi.c             |    6 -
 drivers/scsi/scsi_transport_iscsi.c |    4 -
 include/scsi/libiscsi.h             |    4 -
 include/scsi/scsi_transport_iscsi.h |    2 
 6 files changed, 93 insertions(+), 53 deletions(-)

Index: linux-2.6/drivers/scsi/iscsi_tcp.c
===================================================================
--- linux-2.6.orig/drivers/scsi/iscsi_tcp.c
+++ linux-2.6/drivers/scsi/iscsi_tcp.c
@@ -904,7 +904,7 @@ more:
 				goto again;
 			iscsi_conn_failure(conn, ISCSI_ERR_CONN_FAILED);
 			return 0;
-		}
+		}
 
 		memcpy(&recv_digest, conn->data, sizeof(uint32_t));
 		if (recv_digest != tcp_conn->in.datadgst) {
@@ -1062,21 +1062,6 @@ iscsi_conn_set_callbacks(struct iscsi_co
 	write_unlock_bh(&sk->sk_callback_lock);
 }
 
-static void
-iscsi_conn_restore_callbacks(struct iscsi_tcp_conn *tcp_conn)
-{
-	struct sock *sk = tcp_conn->sock->sk;
-
-	/* restore socket callbacks, see also: iscsi_conn_set_callbacks() */
-	write_lock_bh(&sk->sk_callback_lock);
-	sk->sk_user_data    = NULL;
-	sk->sk_data_ready   = tcp_conn->old_data_ready;
-	sk->sk_state_change = tcp_conn->old_state_change;
-	sk->sk_write_space  = tcp_conn->old_write_space;
-	sk->sk_no_check	 = 0;
-	write_unlock_bh(&sk->sk_callback_lock);
-}
-
 /**
  * iscsi_send - generic send routine
  * @sk: kernel's socket
@@ -1304,7 +1289,7 @@ iscsi_tcp_cmd_init(struct iscsi_cmd_task
 		debug_scsi("cmd [itt 0x%x total %d imm_data %d "
 			   "unsol count %d, unsol offset %d]\n",
 			   ctask->itt, ctask->total_length, ctask->imm_count,
-			   ctask->unsol_count, ctask->unsol_offset);
+			   ctask->unsol_count, ctask->unsol_offset);
 	} else
 		tcp_ctask->xmstate = XMSTATE_R_HDR;
 
@@ -1455,7 +1440,7 @@ iscsi_send_padding(struct iscsi_conn *co
 	tcp_ctask->xmstate &= ~XMSTATE_W_PAD;
 	tcp_ctask->xmstate &= ~XMSTATE_W_RESEND_PAD;
 	debug_scsi("sending %d pad bytes for itt 0x%x\n",
-		   tcp_ctask->pad_count, ctask->itt);
+		   tcp_ctask->pad_count, ctask->itt);
 	rc = iscsi_sendpage(conn, &tcp_ctask->sendbuf, &tcp_ctask->pad_count,
 			   &sent);
 	if (rc) {
@@ -1484,7 +1469,7 @@ iscsi_send_digest(struct iscsi_conn *con
 		iscsi_buf_init_iov(buf, (char*)digest, 4);
 	}
 	tcp_ctask->xmstate &= ~XMSTATE_W_RESEND_DATA_DIGEST;
-
+
 	rc = iscsi_sendpage(conn, buf, &tcp_ctask->digest_count, &sent);
 	if (!rc)
 		debug_scsi("sent digest 0x%x for itt 0x%x\n", *digest,
@@ -1593,7 +1578,7 @@ send_hdr:
 		int start = tcp_ctask->sent;
 
 		rc = iscsi_send_data(ctask, &tcp_ctask->sendbuf, &tcp_ctask->sg,
-				     &tcp_ctask->sent, &ctask->data_count,
+				     &tcp_ctask->sent, &ctask->data_count,
 				     &dtask->digestbuf, &dtask->digest);
 		ctask->unsol_count -= tcp_ctask->sent - start;
 		if (rc)
@@ -1741,6 +1726,79 @@ iscsi_tcp_ctask_xmit(struct iscsi_conn *
 	return rc;
 }
 
+static int
+iscsi_tcp_ep_connect(struct sockaddr *dst_addr, int non_blocking,
+		     uint64_t *ep_handle)
+{
+	struct socket *sock;
+	int rc, size, arg = 1, window = 524288;
+
+	rc = sock_create_kern(dst_addr->sa_family, SOCK_STREAM, IPPROTO_TCP,
+			      &sock);
+	if (rc < 0) {
+		printk(KERN_ERR "Could not create socket %d.\n", rc);
+		return rc;
+	}
+	sock->sk->sk_allocation = GFP_ATOMIC;
+/*
+	rc = sock->ops->setsockopt(sock, IPPROTO_TCP, TCP_NODELAY,
+				   (char __user *)&arg, sizeof(arg));
+	if (rc) {
+		printk(KERN_ERR "Could not set TCP_NODELAY %d\n", rc);
+		goto release_sock;
+	}
+*/
+	/* should set like nfs */
+	sock_setsockopt(sock, SOL_SOCKET, SO_RCVBUF,
+			(char __user *)&window, sizeof(window));
+	sock_setsockopt(sock, SOL_SOCKET, SO_SNDBUF,
+			(char __user *)&window, sizeof(window));
+
+	if (dst_addr->sa_family == PF_INET)
+		size = sizeof(struct sockaddr_in);
+	else if (dst_addr->sa_family == PF_INET6)
+		size = sizeof(struct sockaddr_in6);
+	else {
+		rc = -EINVAL;
+		goto release_sock;
+	}
+
+	/* TODO we cannot block here */
+	rc = sock->ops->connect(sock, (struct sockaddr *)dst_addr, size,
+				0 /*O_NONBLOCK*/);
+	if (rc == -EINPROGRESS)
+		rc = 0;
+	else if (rc) {
+		printk(KERN_ERR "Could not connect %d\n", rc);
+		goto release_sock;
+	}
+
+	*ep_handle = (uint64_t)(unsigned long)sock;
+	return 0;
+
+release_sock:
+	sock_release(sock);
+	return rc;
+}
+
+static int
+iscsi_tcp_ep_poll(uint64_t ep_handle, int timeout_ms)
+{
+	/* we cheated and blocked on the connect (TODO must fix) */
+	return 1;
+}
+
+static void
+iscsi_tcp_ep_disconnect(uint64_t ep_handle)
+{
+	struct socket *sock;
+
+	sock = (struct socket *)(unsigned long)ep_handle;
+	if (!sock)
+		return;
+	sock_release(sock);
+}
+
 static struct iscsi_cls_conn *
 iscsi_tcp_conn_create(struct iscsi_cls_session *cls_session, uint32_t conn_idx)
 {
@@ -1788,23 +1846,6 @@ tcp_conn_alloc_fail:
 }
 
 static void
-iscsi_tcp_release_conn(struct iscsi_conn *conn)
-{
-	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
-
-	if (!tcp_conn->sock)
-		return;
-
-	sock_hold(tcp_conn->sock->sk);
-	iscsi_conn_restore_callbacks(tcp_conn);
-	sock_put(tcp_conn->sock->sk);
-
-	sock_release(tcp_conn->sock);
-	tcp_conn->sock = NULL;
-	conn->recv_lock = NULL;
-}
-
-static void
 iscsi_tcp_conn_destroy(struct iscsi_cls_conn *cls_conn)
 {
 	struct iscsi_conn *conn = cls_conn->dd_data;
@@ -1814,7 +1855,6 @@ iscsi_tcp_conn_destroy(struct iscsi_cls_
 	if (conn->hdrdgst_en || conn->datadgst_en)
 		digest = 1;
 
-	iscsi_tcp_release_conn(conn);
 	iscsi_conn_teardown(cls_conn);
 
 	/* now free tcp_conn */
@@ -1835,7 +1875,6 @@ iscsi_tcp_conn_stop(struct iscsi_cls_con
 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
 
 	iscsi_conn_stop(cls_conn, flag);
-	iscsi_tcp_release_conn(conn);
 	tcp_conn->hdr_size = sizeof(struct iscsi_hdr);
 }
 
@@ -2041,13 +2080,11 @@ iscsi_tcp_conn_get_param(struct iscsi_cl
 		sk = tcp_conn->sock->sk;
 		if (sk->sk_family == PF_INET) {
 			inet = inet_sk(sk);
-			len = sprintf(buf, "%u.%u.%u.%u\n",
+			len = sprintf(buf, NIPQUAD_FMT "\n",
 				      NIPQUAD(inet->daddr));
 		} else {
 			np = inet6_sk(sk);
-			len = sprintf(buf,
-				"%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n",
-				NIP6(np->daddr));
+			len = sprintf(buf, NIP6_FMT "\n", NIP6(np->daddr));
 		}
 		mutex_unlock(&conn->xmitmutex);
 		break;
@@ -2185,6 +2222,9 @@ static struct iscsi_transport iscsi_tcp_
 	.get_session_param	= iscsi_session_get_param,
 	.start_conn		= iscsi_conn_start,
 	.stop_conn		= iscsi_tcp_conn_stop,
+	.ep_connect		= iscsi_tcp_ep_connect,
+	.ep_poll		= iscsi_tcp_ep_poll,
+	.ep_disconnect		= iscsi_tcp_ep_disconnect,
 	/* IO */
 	.send_pdu		= iscsi_conn_send_pdu,
 	.get_stats		= iscsi_conn_get_stats,
Index: linux-2.6/drivers/scsi/libiscsi.c
===================================================================
--- linux-2.6.orig/drivers/scsi/libiscsi.c
+++ linux-2.6/drivers/scsi/libiscsi.c
 struct iscsi_session *
 class_to_transport_session(struct iscsi_cls_session *cls_session)
@@ -1609,7 +1609,7 @@ int iscsi_conn_start(struct iscsi_cls_co
 		return -EPERM;
 	}
 
-	if ((session->imm_data_en || !session->initial_r2t_en) &&
+	if ((session->imm_data_en || !session->initial_r2t_en) &&
 	     session->first_burst > session->max_burst) {
 		printk("iscsi: invalid burst lengths: "
 		       "first_burst %d max_burst %d\n",
Index: linux-2.6/include/scsi/libiscsi.h
===================================================================
--- linux-2.6.orig/include/scsi/libiscsi.h
+++ linux-2.6/include/scsi/libiscsi.h
@@ -76,7 +76,7 @@ struct iscsi_mgmt_task {
 	 * Becuae LLDs allocate their hdr differently, this is a pointer to
 	 * that storage. It must be setup at session creation time.
 	 */
-	struct iscsi_hdr	*hdr;
+	struct iscsi_hdr	*hdr;
 	char			*data;		/* mgmt payload */
 	int			data_count;	/* counts data to be sent */
 	uint32_t		itt;		/* this ITT */

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 16/21] iscsi: fixup of the ep_connect patch
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Mike Christie

[-- Attachment #1: iscsi_ep_connect_fix.patch --]
[-- Type: text/plain, Size: 2592 bytes --]

Never hand out kernel pointers, and really never ever ask them back.
Also, iscsi_tcp_conn_bind expects it to be a valid file descriptor.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Mike Christie <michaelc@cs.wisc.edu>
---
 drivers/scsi/iscsi_tcp.c |   34 ++++++++++++++++++++++++++--------
 1 file changed, 26 insertions(+), 8 deletions(-)

Index: linux-2.6/drivers/scsi/iscsi_tcp.c
===================================================================
--- linux-2.6.orig/drivers/scsi/iscsi_tcp.c
+++ linux-2.6/drivers/scsi/iscsi_tcp.c
@@ -35,6 +35,8 @@
 #include <linux/kfifo.h>
 #include <linux/scatterlist.h>
 #include <linux/mutex.h>
+#include <linux/syscalls.h>
+#include <linux/file.h>
 #include <net/tcp.h>
 #include <scsi/scsi_cmnd.h>
 #include <scsi/scsi_host.h>
@@ -1773,7 +1775,10 @@ iscsi_tcp_ep_connect(struct sockaddr *ds
 		goto release_sock;
 	}
 
-	*ep_handle = (uint64_t)(unsigned long)sock;
+	rc = sock_map_fd(sock);
+	if (rc < 0)
+		goto release_sock;
+	*ep_handle = (uint64_t)rc;
 	return 0;
 
 release_sock:
@@ -1791,12 +1796,7 @@ iscsi_tcp_ep_poll(uint64_t ep_handle, in
 static void
 iscsi_tcp_ep_disconnect(uint64_t ep_handle)
 {
-	struct socket *sock;
-
-	sock = (struct socket *)(unsigned long)ep_handle;
-	if (!sock)
-		return;
-	sock_release(sock);
+	sys_close(ep_handle);
 }
 
 static struct iscsi_cls_conn *
@@ -1846,6 +1846,19 @@ tcp_conn_alloc_fail:
 }
 
 static void
+iscsi_tcp_release_conn(struct iscsi_conn *conn)
+{
+	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+
+	if (!tcp_conn->sock)
+		return;
+
+	fput(tcp_conn->sock->file);
+	tcp_conn->sock = NULL;
+	conn->recv_lock = NULL;
+}
+
+static void
 iscsi_tcp_conn_destroy(struct iscsi_cls_conn *cls_conn)
 {
 	struct iscsi_conn *conn = cls_conn->dd_data;
@@ -1855,6 +1868,7 @@ iscsi_tcp_conn_destroy(struct iscsi_cls_
 	if (conn->hdrdgst_en || conn->datadgst_en)
 		digest = 1;
 
+	iscsi_tcp_release_conn(conn);
 	iscsi_conn_teardown(cls_conn);
 
 	/* now free tcp_conn */
@@ -1875,6 +1889,7 @@ iscsi_tcp_conn_stop(struct iscsi_cls_con
 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
 
 	iscsi_conn_stop(cls_conn, flag);
+	iscsi_tcp_release_conn(conn);
 	tcp_conn->hdr_size = sizeof(struct iscsi_hdr);
 }
 
@@ -1895,10 +1910,13 @@ iscsi_tcp_conn_bind(struct iscsi_cls_ses
 		printk(KERN_ERR "iscsi_tcp: sockfd_lookup failed %d\n", err);
 		return -EEXIST;
 	}
+	get_file(sock->file);
 
 	err = iscsi_conn_bind(cls_session, cls_conn, is_leading);
-	if (err)
+	if (err) {
+		fput(sock->file);
 		return err;
+	}
 
 	/* bind iSCSI connection and socket */
 	tcp_conn->sock = sock;

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 16/21] iscsi: fixup of the ep_connect patch
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Mike Christie

[-- Attachment #1: iscsi_ep_connect_fix.patch --]
[-- Type: text/plain, Size: 2818 bytes --]

Never hand out kernel pointers, and really never ever ask them back.
Also, iscsi_tcp_conn_bind expects it to be a valid file descriptor.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Mike Christie <michaelc@cs.wisc.edu>
---
 drivers/scsi/iscsi_tcp.c |   34 ++++++++++++++++++++++++++--------
 1 file changed, 26 insertions(+), 8 deletions(-)

Index: linux-2.6/drivers/scsi/iscsi_tcp.c
===================================================================
--- linux-2.6.orig/drivers/scsi/iscsi_tcp.c
+++ linux-2.6/drivers/scsi/iscsi_tcp.c
@@ -35,6 +35,8 @@
 #include <linux/kfifo.h>
 #include <linux/scatterlist.h>
 #include <linux/mutex.h>
+#include <linux/syscalls.h>
+#include <linux/file.h>
 #include <net/tcp.h>
 #include <scsi/scsi_cmnd.h>
 #include <scsi/scsi_host.h>
@@ -1773,7 +1775,10 @@ iscsi_tcp_ep_connect(struct sockaddr *ds
 		goto release_sock;
 	}
 
-	*ep_handle = (uint64_t)(unsigned long)sock;
+	rc = sock_map_fd(sock);
+	if (rc < 0)
+		goto release_sock;
+	*ep_handle = (uint64_t)rc;
 	return 0;
 
 release_sock:
@@ -1791,12 +1796,7 @@ iscsi_tcp_ep_poll(uint64_t ep_handle, in
 static void
 iscsi_tcp_ep_disconnect(uint64_t ep_handle)
 {
-	struct socket *sock;
-
-	sock = (struct socket *)(unsigned long)ep_handle;
-	if (!sock)
-		return;
-	sock_release(sock);
+	sys_close(ep_handle);
 }
 
 static struct iscsi_cls_conn *
@@ -1846,6 +1846,19 @@ tcp_conn_alloc_fail:
 }
 
 static void
+iscsi_tcp_release_conn(struct iscsi_conn *conn)
+{
+	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+
+	if (!tcp_conn->sock)
+		return;
+
+	fput(tcp_conn->sock->file);
+	tcp_conn->sock = NULL;
+	conn->recv_lock = NULL;
+}
+
+static void
 iscsi_tcp_conn_destroy(struct iscsi_cls_conn *cls_conn)
 {
 	struct iscsi_conn *conn = cls_conn->dd_data;
@@ -1855,6 +1868,7 @@ iscsi_tcp_conn_destroy(struct iscsi_cls_
 	if (conn->hdrdgst_en || conn->datadgst_en)
 		digest = 1;
 
+	iscsi_tcp_release_conn(conn);
 	iscsi_conn_teardown(cls_conn);
 
 	/* now free tcp_conn */
@@ -1875,6 +1889,7 @@ iscsi_tcp_conn_stop(struct iscsi_cls_con
 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
 
 	iscsi_conn_stop(cls_conn, flag);
+	iscsi_tcp_release_conn(conn);
 	tcp_conn->hdr_size = sizeof(struct iscsi_hdr);
 }
 
@@ -1895,10 +1910,13 @@ iscsi_tcp_conn_bind(struct iscsi_cls_ses
 		printk(KERN_ERR "iscsi_tcp: sockfd_lookup failed %d\n", err);
 		return -EEXIST;
 	}
+	get_file(sock->file);
 
 	err = iscsi_conn_bind(cls_session, cls_conn, is_leading);
-	if (err)
+	if (err) {
+		fput(sock->file);
 		return err;
+	}
 
 	/* bind iSCSI connection and socket */
 	tcp_conn->sock = sock;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 17/21] iscsi: add session context to ep_connect
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Mike Christie

[-- Attachment #1: iscsi_ep_connect_session.patch --]
[-- Type: text/plain, Size: 3622 bytes --]

In order to do a proper reconnect we need to know if we're a swapper.
Only the session context can tell us that.

(This patch breaks the NETLINK_ISCSI ABI, userspace also needs a change)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Mike Christie <michaelc@cs.wisc.edu>
---
 drivers/infiniband/ulp/iser/iscsi_iser.c |    3 ++-
 drivers/scsi/iscsi_tcp.c                 |    3 ++-
 drivers/scsi/scsi_transport_iscsi.c      |    4 +++-
 include/scsi/iscsi_if.h                  |    1 +
 include/scsi/scsi_transport_iscsi.h      |    3 ++-
 5 files changed, 10 insertions(+), 4 deletions(-)

Index: linux-2.6/drivers/scsi/iscsi_tcp.c
===================================================================
--- linux-2.6.orig/drivers/scsi/iscsi_tcp.c
+++ linux-2.6/drivers/scsi/iscsi_tcp.c
@@ -1728,7 +1728,8 @@ iscsi_tcp_ctask_xmit(struct iscsi_conn *
 }
 
 static int
-iscsi_tcp_ep_connect(struct sockaddr *dst_addr, int non_blocking,
+iscsi_tcp_ep_connect(struct iscsi_cls_session *cls_session,
+		     struct sockaddr *dst_addr, int non_blocking,
 		     uint64_t *ep_handle)
 {
 	struct socket *sock;
Index: linux-2.6/drivers/scsi/scsi_transport_iscsi.c
===================================================================
--- linux-2.6.orig/drivers/scsi/scsi_transport_iscsi.c
+++ linux-2.6/drivers/scsi/scsi_transport_iscsi.c
@@ -914,6 +914,7 @@ iscsi_if_transport_ep(struct iscsi_trans
 		      struct iscsi_uevent *ev, int msg_type)
 {
 	struct sockaddr *dst_addr;
+	struct iscsi_cls_session *session;
 	int rc = 0;
 
 	switch (msg_type) {
@@ -922,7 +923,8 @@ iscsi_if_transport_ep(struct iscsi_trans
 			return -EINVAL;
 
 		dst_addr = (struct sockaddr *)((char*)ev + sizeof(*ev));
-		rc = transport->ep_connect(dst_addr,
+		session = iscsi_session_lookup(ev->u.ep_connect.sid);
+		rc = transport->ep_connect(session, dst_addr,
 					   ev->u.ep_connect.non_blocking,
 					   &ev->r.ep_connect_ret.handle);
 		break;
Index: linux-2.6/include/scsi/iscsi_if.h
===================================================================
--- linux-2.6.orig/include/scsi/iscsi_if.h
+++ linux-2.6/include/scsi/iscsi_if.h
@@ -117,6 +117,7 @@ struct iscsi_uevent {
 		} get_stats;
 		struct msg_transport_connect {
 			uint32_t	non_blocking;
+			uint32_t	sid;
 		} ep_connect;
 		struct msg_transport_poll {
 			uint64_t	ep_handle;
Index: linux-2.6/include/scsi/scsi_transport_iscsi.h
===================================================================
--- linux-2.6.orig/include/scsi/scsi_transport_iscsi.h
+++ linux-2.6/include/scsi/scsi_transport_iscsi.h
@@ -120,7 +120,8 @@ struct iscsi_transport {
 	int (*xmit_mgmt_task) (struct iscsi_conn *conn,
 			       struct iscsi_mgmt_task *mtask);
 	void (*session_recovery_timedout) (struct iscsi_cls_session *session);
-	int (*ep_connect) (struct sockaddr *dst_addr, int non_blocking,
+	int (*ep_connect) (struct iscsi_cls_session *session,
+			   struct sockaddr *dst_addr, int non_blocking,
 			   uint64_t *ep_handle);
 	int (*ep_poll) (uint64_t ep_handle, int timeout_ms);
 	void (*ep_disconnect) (uint64_t ep_handle);
Index: linux-2.6/drivers/infiniband/ulp/iser/iscsi_iser.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/ulp/iser/iscsi_iser.c
+++ linux-2.6/drivers/infiniband/ulp/iser/iscsi_iser.c
@@ -490,7 +490,8 @@ iscsi_iser_conn_get_stats(struct iscsi_c
 }
 
 static int
-iscsi_iser_ep_connect(struct sockaddr *dst_addr, int non_blocking,
+iscsi_iser_ep_connect(struct iscsi_cls_session *cls_session,
+		      struct sockaddr *dst_addr, int non_blocking,
 		      __u64 *ep_handle)
 {
 	int err;

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 17/21] iscsi: add session context to ep_connect
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Mike Christie

[-- Attachment #1: iscsi_ep_connect_session.patch --]
[-- Type: text/plain, Size: 3848 bytes --]

In order to do a proper reconnect we need to know if we're a swapper.
Only the session context can tell us that.

(This patch breaks the NETLINK_ISCSI ABI, userspace also needs a change)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Mike Christie <michaelc@cs.wisc.edu>
---
 drivers/infiniband/ulp/iser/iscsi_iser.c |    3 ++-
 drivers/scsi/iscsi_tcp.c                 |    3 ++-
 drivers/scsi/scsi_transport_iscsi.c      |    4 +++-
 include/scsi/iscsi_if.h                  |    1 +
 include/scsi/scsi_transport_iscsi.h      |    3 ++-
 5 files changed, 10 insertions(+), 4 deletions(-)

Index: linux-2.6/drivers/scsi/iscsi_tcp.c
===================================================================
--- linux-2.6.orig/drivers/scsi/iscsi_tcp.c
+++ linux-2.6/drivers/scsi/iscsi_tcp.c
@@ -1728,7 +1728,8 @@ iscsi_tcp_ctask_xmit(struct iscsi_conn *
 }
 
 static int
-iscsi_tcp_ep_connect(struct sockaddr *dst_addr, int non_blocking,
+iscsi_tcp_ep_connect(struct iscsi_cls_session *cls_session,
+		     struct sockaddr *dst_addr, int non_blocking,
 		     uint64_t *ep_handle)
 {
 	struct socket *sock;
Index: linux-2.6/drivers/scsi/scsi_transport_iscsi.c
===================================================================
--- linux-2.6.orig/drivers/scsi/scsi_transport_iscsi.c
+++ linux-2.6/drivers/scsi/scsi_transport_iscsi.c
@@ -914,6 +914,7 @@ iscsi_if_transport_ep(struct iscsi_trans
 		      struct iscsi_uevent *ev, int msg_type)
 {
 	struct sockaddr *dst_addr;
+	struct iscsi_cls_session *session;
 	int rc = 0;
 
 	switch (msg_type) {
@@ -922,7 +923,8 @@ iscsi_if_transport_ep(struct iscsi_trans
 			return -EINVAL;
 
 		dst_addr = (struct sockaddr *)((char*)ev + sizeof(*ev));
-		rc = transport->ep_connect(dst_addr,
+		session = iscsi_session_lookup(ev->u.ep_connect.sid);
+		rc = transport->ep_connect(session, dst_addr,
 					   ev->u.ep_connect.non_blocking,
 					   &ev->r.ep_connect_ret.handle);
 		break;
Index: linux-2.6/include/scsi/iscsi_if.h
===================================================================
--- linux-2.6.orig/include/scsi/iscsi_if.h
+++ linux-2.6/include/scsi/iscsi_if.h
@@ -117,6 +117,7 @@ struct iscsi_uevent {
 		} get_stats;
 		struct msg_transport_connect {
 			uint32_t	non_blocking;
+			uint32_t	sid;
 		} ep_connect;
 		struct msg_transport_poll {
 			uint64_t	ep_handle;
Index: linux-2.6/include/scsi/scsi_transport_iscsi.h
===================================================================
--- linux-2.6.orig/include/scsi/scsi_transport_iscsi.h
+++ linux-2.6/include/scsi/scsi_transport_iscsi.h
@@ -120,7 +120,8 @@ struct iscsi_transport {
 	int (*xmit_mgmt_task) (struct iscsi_conn *conn,
 			       struct iscsi_mgmt_task *mtask);
 	void (*session_recovery_timedout) (struct iscsi_cls_session *session);
-	int (*ep_connect) (struct sockaddr *dst_addr, int non_blocking,
+	int (*ep_connect) (struct iscsi_cls_session *session,
+			   struct sockaddr *dst_addr, int non_blocking,
 			   uint64_t *ep_handle);
 	int (*ep_poll) (uint64_t ep_handle, int timeout_ms);
 	void (*ep_disconnect) (uint64_t ep_handle);
Index: linux-2.6/drivers/infiniband/ulp/iser/iscsi_iser.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/ulp/iser/iscsi_iser.c
+++ linux-2.6/drivers/infiniband/ulp/iser/iscsi_iser.c
@@ -490,7 +490,8 @@ iscsi_iser_conn_get_stats(struct iscsi_c
 }
 
 static int
-iscsi_iser_ep_connect(struct sockaddr *dst_addr, int non_blocking,
+iscsi_iser_ep_connect(struct iscsi_cls_session *cls_session,
+		      struct sockaddr *dst_addr, int non_blocking,
 		      __u64 *ep_handle)
 {
 	int err;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 18/21] scsi: propagate the swapdev hook into the scsi stack
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, James E.J. Bottomley, Mike Christie

[-- Attachment #1: scsi_swapdev.patch --]
[-- Type: text/plain, Size: 1648 bytes --]

Allow scsi devices to receive the swapdev notification.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: James E.J. Bottomley <James.Bottomley@SteelEye.com>
CC: Mike Christie <michaelc@cs.wisc.edu>
---
 drivers/scsi/sd.c        |   13 +++++++++++++
 include/scsi/scsi_host.h |    7 +++++++
 2 files changed, 20 insertions(+)

Index: linux-2.6/drivers/scsi/sd.c
===================================================================
--- linux-2.6.orig/drivers/scsi/sd.c
+++ linux-2.6/drivers/scsi/sd.c
@@ -892,6 +892,18 @@ static long sd_compat_ioctl(struct file 
 }
 #endif
 
+static int sd_swapdev(struct gendisk *disk, int enable)
+{
+	int error = 0;
+	struct scsi_disk *sdkp = scsi_disk(disk);
+	struct scsi_device *sdp = sdkp->device;
+
+	if (sdp->host->hostt->swapdev)
+		error = sdp->host->hostt->swapdev(sdp, enable);
+
+	return error;
+}
+
 static struct block_device_operations sd_fops = {
 	.owner			= THIS_MODULE,
 	.open			= sd_open,
@@ -903,6 +915,7 @@ static struct block_device_operations sd
 #endif
 	.media_changed		= sd_media_changed,
 	.revalidate_disk	= sd_revalidate_disk,
+	.swapdev		= sd_swapdev,
 };
 
 /**
Index: linux-2.6/include/scsi/scsi_host.h
===================================================================
--- linux-2.6.orig/include/scsi/scsi_host.h
+++ linux-2.6/include/scsi/scsi_host.h
@@ -288,6 +288,13 @@ struct scsi_host_template {
 	int (*suspend)(struct scsi_device *, pm_message_t state);
 
 	/*
+	 * Notify that this device is used for swapping.
+	 *
+	 * Status: OPTIONAL
+	 */
+	int (*swapdev)(struct scsi_device *, int enable);
+
+	/*
 	 * Name of proc directory
 	 */
 	char *proc_name;

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 18/21] scsi: propagate the swapdev hook into the scsi stack
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, James E.J. Bottomley, Mike Christie

[-- Attachment #1: scsi_swapdev.patch --]
[-- Type: text/plain, Size: 1874 bytes --]

Allow scsi devices to receive the swapdev notification.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: James E.J. Bottomley <James.Bottomley@SteelEye.com>
CC: Mike Christie <michaelc@cs.wisc.edu>
---
 drivers/scsi/sd.c        |   13 +++++++++++++
 include/scsi/scsi_host.h |    7 +++++++
 2 files changed, 20 insertions(+)

Index: linux-2.6/drivers/scsi/sd.c
===================================================================
--- linux-2.6.orig/drivers/scsi/sd.c
+++ linux-2.6/drivers/scsi/sd.c
@@ -892,6 +892,18 @@ static long sd_compat_ioctl(struct file 
 }
 #endif
 
+static int sd_swapdev(struct gendisk *disk, int enable)
+{
+	int error = 0;
+	struct scsi_disk *sdkp = scsi_disk(disk);
+	struct scsi_device *sdp = sdkp->device;
+
+	if (sdp->host->hostt->swapdev)
+		error = sdp->host->hostt->swapdev(sdp, enable);
+
+	return error;
+}
+
 static struct block_device_operations sd_fops = {
 	.owner			= THIS_MODULE,
 	.open			= sd_open,
@@ -903,6 +915,7 @@ static struct block_device_operations sd
 #endif
 	.media_changed		= sd_media_changed,
 	.revalidate_disk	= sd_revalidate_disk,
+	.swapdev		= sd_swapdev,
 };
 
 /**
Index: linux-2.6/include/scsi/scsi_host.h
===================================================================
--- linux-2.6.orig/include/scsi/scsi_host.h
+++ linux-2.6/include/scsi/scsi_host.h
@@ -288,6 +288,13 @@ struct scsi_host_template {
 	int (*suspend)(struct scsi_device *, pm_message_t state);
 
 	/*
+	 * Notify that this device is used for swapping.
+	 *
+	 * Status: OPTIONAL
+	 */
+	int (*swapdev)(struct scsi_device *, int enable);
+
+	/*
 	 * Name of proc directory
 	 */
 	char *proc_name;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 19/21] netlink: add SOCK_VMIO support to AF_NETLINK
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Mike Christie

[-- Attachment #1: netlink_vmio.patch --]
[-- Type: text/plain, Size: 3473 bytes --]

Propagate SOCK_VMIO from kernel socket to userspace sockets.
Allow sys_{send,recv}msg to succeed under memory pressure for
SOCK_VMIO netlink sockets.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Mike Christie <michaelc@cs.wisc.edu>
---
 include/linux/netlink.h  |    1 +
 net/netlink/af_netlink.c |    8 +++++---
 net/socket.c             |    6 +++---
 3 files changed, 9 insertions(+), 6 deletions(-)

Index: linux-2.6/net/netlink/af_netlink.c
===================================================================
--- linux-2.6.orig/net/netlink/af_netlink.c
+++ linux-2.6/net/netlink/af_netlink.c
@@ -199,7 +199,7 @@ netlink_unlock_table(void)
 		wake_up(&nl_table_wait);
 }
 
-static __inline__ struct sock *netlink_lookup(int protocol, u32 pid)
+__inline__ struct sock *netlink_lookup(int protocol, u32 pid)
 {
 	struct nl_pid_hash *hash = &nl_table[protocol].hash;
 	struct hlist_head *head;
@@ -1147,7 +1147,7 @@ static int netlink_sendmsg(struct kiocb 
 	if (len > sk->sk_sndbuf - 32)
 		goto out;
 	err = -ENOBUFS;
-	skb = alloc_skb(len, GFP_KERNEL);
+	skb = __alloc_skb(len, GFP_KERNEL, SKB_ALLOC_RX);
 	if (skb==NULL)
 		goto out;
 
@@ -1178,7 +1178,8 @@ static int netlink_sendmsg(struct kiocb 
 
 	if (dst_group) {
 		atomic_inc(&skb->users);
-		netlink_broadcast(sk, skb, dst_pid, dst_group, GFP_KERNEL);
+		netlink_broadcast(sk, skb, dst_pid, dst_group,
+				sk->sk_allocation);
 	}
 	err = netlink_unicast(sk, skb, dst_pid, msg->msg_flags&MSG_DONTWAIT);
 
@@ -1788,6 +1789,7 @@ panic:
 
 core_initcall(netlink_proto_init);
 
+EXPORT_SYMBOL(netlink_lookup);
 EXPORT_SYMBOL(netlink_ack);
 EXPORT_SYMBOL(netlink_run_queue);
 EXPORT_SYMBOL(netlink_queue_skip);
Index: linux-2.6/net/socket.c
===================================================================
--- linux-2.6.orig/net/socket.c
+++ linux-2.6/net/socket.c
@@ -1790,7 +1790,7 @@ asmlinkage long sys_sendmsg(int fd, stru
 	err = -ENOMEM;
 	iov_size = msg_sys.msg_iovlen * sizeof(struct iovec);
 	if (msg_sys.msg_iovlen > UIO_FASTIOV) {
-		iov = sock_kmalloc(sock->sk, iov_size, GFP_KERNEL);
+		iov = sock_kmalloc(sock->sk, iov_size, sock->sk->sk_allocation);
 		if (!iov)
 			goto out_put;
 	}
@@ -1818,7 +1818,7 @@ asmlinkage long sys_sendmsg(int fd, stru
 	} else if (ctl_len) {
 		if (ctl_len > sizeof(ctl))
 		{
-			ctl_buf = sock_kmalloc(sock->sk, ctl_len, GFP_KERNEL);
+			ctl_buf = sock_kmalloc(sock->sk, ctl_len, sock->sk->sk_allocation);
 			if (ctl_buf == NULL) 
 				goto out_freeiov;
 		}
@@ -1891,7 +1891,7 @@ asmlinkage long sys_recvmsg(int fd, stru
 	err = -ENOMEM;
 	iov_size = msg_sys.msg_iovlen * sizeof(struct iovec);
 	if (msg_sys.msg_iovlen > UIO_FASTIOV) {
-		iov = sock_kmalloc(sock->sk, iov_size, GFP_KERNEL);
+		iov = sock_kmalloc(sock->sk, iov_size, sock->sk->sk_allocation);
 		if (!iov)
 			goto out_put;
 	}
Index: linux-2.6/include/linux/netlink.h
===================================================================
--- linux-2.6.orig/include/linux/netlink.h
+++ linux-2.6/include/linux/netlink.h
@@ -150,6 +150,7 @@ struct netlink_skb_parms
 #define NETLINK_CREDS(skb)	(&NETLINK_CB((skb)).creds)
 
 
+extern struct sock *netlink_lookup(int protocol, __u32 pid);
 extern struct sock *netlink_kernel_create(int unit, unsigned int groups, void (*input)(struct sock *sk, int len), struct module *module);
 extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err);
 extern int netlink_has_listeners(struct sock *sk, unsigned int group);

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 19/21] netlink: add SOCK_VMIO support to AF_NETLINK
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Mike Christie

[-- Attachment #1: netlink_vmio.patch --]
[-- Type: text/plain, Size: 3699 bytes --]

Propagate SOCK_VMIO from kernel socket to userspace sockets.
Allow sys_{send,recv}msg to succeed under memory pressure for
SOCK_VMIO netlink sockets.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Mike Christie <michaelc@cs.wisc.edu>
---
 include/linux/netlink.h  |    1 +
 net/netlink/af_netlink.c |    8 +++++---
 net/socket.c             |    6 +++---
 3 files changed, 9 insertions(+), 6 deletions(-)

Index: linux-2.6/net/netlink/af_netlink.c
===================================================================
--- linux-2.6.orig/net/netlink/af_netlink.c
+++ linux-2.6/net/netlink/af_netlink.c
@@ -199,7 +199,7 @@ netlink_unlock_table(void)
 		wake_up(&nl_table_wait);
 }
 
-static __inline__ struct sock *netlink_lookup(int protocol, u32 pid)
+__inline__ struct sock *netlink_lookup(int protocol, u32 pid)
 {
 	struct nl_pid_hash *hash = &nl_table[protocol].hash;
 	struct hlist_head *head;
@@ -1147,7 +1147,7 @@ static int netlink_sendmsg(struct kiocb 
 	if (len > sk->sk_sndbuf - 32)
 		goto out;
 	err = -ENOBUFS;
-	skb = alloc_skb(len, GFP_KERNEL);
+	skb = __alloc_skb(len, GFP_KERNEL, SKB_ALLOC_RX);
 	if (skb==NULL)
 		goto out;
 
@@ -1178,7 +1178,8 @@ static int netlink_sendmsg(struct kiocb 
 
 	if (dst_group) {
 		atomic_inc(&skb->users);
-		netlink_broadcast(sk, skb, dst_pid, dst_group, GFP_KERNEL);
+		netlink_broadcast(sk, skb, dst_pid, dst_group,
+				sk->sk_allocation);
 	}
 	err = netlink_unicast(sk, skb, dst_pid, msg->msg_flags&MSG_DONTWAIT);
 
@@ -1788,6 +1789,7 @@ panic:
 
 core_initcall(netlink_proto_init);
 
+EXPORT_SYMBOL(netlink_lookup);
 EXPORT_SYMBOL(netlink_ack);
 EXPORT_SYMBOL(netlink_run_queue);
 EXPORT_SYMBOL(netlink_queue_skip);
Index: linux-2.6/net/socket.c
===================================================================
--- linux-2.6.orig/net/socket.c
+++ linux-2.6/net/socket.c
@@ -1790,7 +1790,7 @@ asmlinkage long sys_sendmsg(int fd, stru
 	err = -ENOMEM;
 	iov_size = msg_sys.msg_iovlen * sizeof(struct iovec);
 	if (msg_sys.msg_iovlen > UIO_FASTIOV) {
-		iov = sock_kmalloc(sock->sk, iov_size, GFP_KERNEL);
+		iov = sock_kmalloc(sock->sk, iov_size, sock->sk->sk_allocation);
 		if (!iov)
 			goto out_put;
 	}
@@ -1818,7 +1818,7 @@ asmlinkage long sys_sendmsg(int fd, stru
 	} else if (ctl_len) {
 		if (ctl_len > sizeof(ctl))
 		{
-			ctl_buf = sock_kmalloc(sock->sk, ctl_len, GFP_KERNEL);
+			ctl_buf = sock_kmalloc(sock->sk, ctl_len, sock->sk->sk_allocation);
 			if (ctl_buf == NULL) 
 				goto out_freeiov;
 		}
@@ -1891,7 +1891,7 @@ asmlinkage long sys_recvmsg(int fd, stru
 	err = -ENOMEM;
 	iov_size = msg_sys.msg_iovlen * sizeof(struct iovec);
 	if (msg_sys.msg_iovlen > UIO_FASTIOV) {
-		iov = sock_kmalloc(sock->sk, iov_size, GFP_KERNEL);
+		iov = sock_kmalloc(sock->sk, iov_size, sock->sk->sk_allocation);
 		if (!iov)
 			goto out_put;
 	}
Index: linux-2.6/include/linux/netlink.h
===================================================================
--- linux-2.6.orig/include/linux/netlink.h
+++ linux-2.6/include/linux/netlink.h
@@ -150,6 +150,7 @@ struct netlink_skb_parms
 #define NETLINK_CREDS(skb)	(&NETLINK_CB((skb)).creds)
 
 
+extern struct sock *netlink_lookup(int protocol, __u32 pid);
 extern struct sock *netlink_kernel_create(int unit, unsigned int groups, void (*input)(struct sock *sk, int len), struct module *module);
 extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err);
 extern int netlink_has_listeners(struct sock *sk, unsigned int group);

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 20/21] mm: a process flags to avoid blocking allocations
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Mike Christie

[-- Attachment #1: pf_mem_nowait.patch --]
[-- Type: text/plain, Size: 1590 bytes --]

PF_MEM_NOWAIT - will make allocations fail before blocking. This is usefull
to convert process behaviour to non-blocking.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Mike Christie <michaelc@cs.wisc.edu>
---
 include/linux/sched.h |    1 +
 mm/page_alloc.c       |    4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1056,6 +1056,7 @@ static inline void put_task_struct(struc
 #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
+#define PF_MEM_NOWAIT	0x40000000	/* Make allocations fail instead of block */
 
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -912,11 +912,11 @@ struct page * fastcall
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
 		struct zonelist *zonelist)
 {
-	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	struct task_struct *p = current;
+	const int wait = (gfp_mask & __GFP_WAIT) && !(p->flags & PF_MEM_NOWAIT);
 	struct zone **z;
 	struct page *page;
 	struct reclaim_state reclaim_state;
-	struct task_struct *p = current;
 	int do_retry;
 	int alloc_flags;
 	int did_some_progress;

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 20/21] mm: a process flags to avoid blocking allocations
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Mike Christie

[-- Attachment #1: pf_mem_nowait.patch --]
[-- Type: text/plain, Size: 1816 bytes --]

PF_MEM_NOWAIT - will make allocations fail before blocking. This is usefull
to convert process behaviour to non-blocking.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Mike Christie <michaelc@cs.wisc.edu>
---
 include/linux/sched.h |    1 +
 mm/page_alloc.c       |    4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1056,6 +1056,7 @@ static inline void put_task_struct(struc
 #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
+#define PF_MEM_NOWAIT	0x40000000	/* Make allocations fail instead of block */
 
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -912,11 +912,11 @@ struct page * fastcall
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
 		struct zonelist *zonelist)
 {
-	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	struct task_struct *p = current;
+	const int wait = (gfp_mask & __GFP_WAIT) && !(p->flags & PF_MEM_NOWAIT);
 	struct zone **z;
 	struct page *page;
 	struct reclaim_state reclaim_state;
-	struct task_struct *p = current;
 	int do_retry;
 	int alloc_flags;
 	int did_some_progress;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 21/21] iscsi: support for swapping over iSCSI.
  2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
@ 2006-09-06 13:16   ` Peter Zijlstra
  2006-09-06 13:16   ` Peter Zijlstra
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Mike Christie

[-- Attachment #1: iscsi_vmio.patch --]
[-- Type: text/plain, Size: 16729 bytes --]

Implement sht->swapdev() for iSCSI. This method takes care of reserving
the extra memory needed and marking all relevant sockets with SOCK_VMIO.

When used for swapping, TCP socket creation is done under GFP_MEMALLOC and
the TCP connect is done with SOCK_VMIO to ensure their success. Also the
netlink userspace interface is marked SOCK_VMIO, this will ensure that even
under pressure we can still communicate with the daemon (which runs as
mlockall() and needs no additional memory to operate).

Netlink requests are handled under the new PF_MEM_NOWAIT when a swapper is
present. This ensures that the netlink socket will not block. User-space will
need to retry failed requests.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Mike Christie <michaelc@cs.wisc.edu>
---
 drivers/infiniband/ulp/iser/iscsi_iser.c |    2 
 drivers/scsi/iscsi_tcp.c                 |   93 ++++++++++++++++++++++++++++---
 drivers/scsi/libiscsi.c                  |   12 ++--
 drivers/scsi/scsi_transport_iscsi.c      |   41 ++++++++++---
 include/scsi/libiscsi.h                  |    5 +
 include/scsi/scsi_transport_iscsi.h      |    6 +-
 6 files changed, 129 insertions(+), 30 deletions(-)

Index: linux-2.6/drivers/scsi/iscsi_tcp.c
===================================================================
--- linux-2.6.orig/drivers/scsi/iscsi_tcp.c
+++ linux-2.6/drivers/scsi/iscsi_tcp.c
@@ -42,6 +42,7 @@
 #include <scsi/scsi_host.h>
 #include <scsi/scsi.h>
 #include <scsi/scsi_transport_iscsi.h>
+#include <scsi/scsi_device.h>
 
 #include "iscsi_tcp.h"
 
@@ -436,6 +437,7 @@ iscsi_tcp_hdr_recv(struct iscsi_conn *co
 	struct iscsi_session *session = conn->session;
 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
 	uint32_t cdgst, rdgst = 0, itt;
+	gfp_t gfp_mask = iscsi_gfp(tcp_conn->sock->sk->sk_allocation);
 
 	hdr = tcp_conn->in.hdr;
 
@@ -506,7 +508,7 @@ iscsi_tcp_hdr_recv(struct iscsi_conn *co
 			goto copy_hdr;
 
 		spin_lock(&session->lock);
-		rc = __iscsi_complete_pdu(conn, hdr, NULL, 0);
+		rc = __iscsi_complete_pdu(conn, hdr, NULL, 0, gfp_mask);
 		spin_unlock(&session->lock);
 		break;
 	case ISCSI_OP_R2T:
@@ -544,7 +546,7 @@ iscsi_tcp_hdr_recv(struct iscsi_conn *co
 	case ISCSI_OP_LOGOUT_RSP:
 	case ISCSI_OP_NOOP_IN:
 	case ISCSI_OP_SCSI_TMFUNC_RSP:
-		rc = iscsi_complete_pdu(conn, hdr, NULL, 0);
+		rc = iscsi_complete_pdu(conn, hdr, NULL, 0, gfp_mask);
 		break;
 	default:
 		rc = ISCSI_ERR_BAD_OPCODE;
@@ -705,6 +707,7 @@ static int iscsi_scsi_data_in(struct isc
 	struct scsi_cmnd *sc = ctask->sc;
 	struct scatterlist *sg;
 	int i, offset, rc = 0;
+	gfp_t gfp_mask = iscsi_gfp(tcp_conn->sock->sk->sk_allocation);
 
 	BUG_ON((void*)ctask != sc->SCp.ptr);
 
@@ -786,7 +789,7 @@ done:
 			   (long)sc, sc->result, ctask->itt,
 			   tcp_conn->in.hdr->flags);
 		spin_lock(&conn->session->lock);
-		__iscsi_complete_pdu(conn, tcp_conn->in.hdr, NULL, 0);
+		__iscsi_complete_pdu(conn, tcp_conn->in.hdr, NULL, 0, gfp_mask);
 		spin_unlock(&conn->session->lock);
 	}
 
@@ -798,6 +801,7 @@ iscsi_data_recv(struct iscsi_conn *conn)
 {
 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
 	int rc = 0, opcode;
+	gfp_t gfp_mask = iscsi_gfp(tcp_conn->sock->sk->sk_allocation);
 
 	opcode = tcp_conn->in.hdr->opcode & ISCSI_OPCODE_MASK;
 	switch (opcode) {
@@ -819,7 +823,7 @@ iscsi_data_recv(struct iscsi_conn *conn)
 		}
 
 		rc = iscsi_complete_pdu(conn, tcp_conn->in.hdr, conn->data,
-					tcp_conn->in.datalen);
+					tcp_conn->in.datalen, gfp_mask);
 		if (!rc && conn->datadgst_en && opcode != ISCSI_OP_LOGIN_RSP)
 			iscsi_recv_digest_update(tcp_conn, conn->data,
 			  			tcp_conn->in.datalen);
@@ -1735,14 +1739,26 @@ iscsi_tcp_ep_connect(struct iscsi_cls_se
 {
 	struct socket *sock;
 	int rc, size, arg = 1, window = 524288;
+	int swapper = 0;
+	unsigned long pflags = current->flags;
+
+	if (cls_session) {
+		struct iscsi_session *session;
+		session = class_to_transport_session(cls_session);
+		swapper = session->swapper;
+	}
+
+	if (swapper)
+		pflags |= PF_MEMALLOC;
 
 	rc = sock_create_kern(dst_addr->sa_family, SOCK_STREAM, IPPROTO_TCP,
 			      &sock);
 	if (rc < 0) {
 		printk(KERN_ERR "Could not create socket %d.\n", rc);
-		return rc;
+		goto out;
 	}
-	sock->sk->sk_allocation = GFP_ATOMIC;
+	sock->sk->sk_allocation = GFP_ATOMIC; /* used from interrupt context */
+
 /*
 	rc = sock->ops->setsockopt(sock, IPPROTO_TCP, TCP_NODELAY,
 				   (char __user *)&arg, sizeof(arg));
@@ -1766,6 +1782,9 @@ iscsi_tcp_ep_connect(struct iscsi_cls_se
 		goto release_sock;
 	}
 
+	if (swapper)
+		sk_set_vmio(sock->sk);
+
 	/* TODO we cannot block here */
 	rc = sock->ops->connect(sock, (struct sockaddr *)dst_addr, size,
 				0 /*O_NONBLOCK*/);
@@ -1780,11 +1799,14 @@ iscsi_tcp_ep_connect(struct iscsi_cls_se
 	if (rc < 0)
 		goto release_sock;
 	*ep_handle = (uint64_t)rc;
-	return 0;
+	rc = 0;
+out:
+	current->flags = pflags;
+	return rc;
 
 release_sock:
 	sock_release(sock);
-	return rc;
+	goto out;
 }
 
 static int
@@ -1926,10 +1948,11 @@ iscsi_tcp_conn_bind(struct iscsi_cls_ses
 	sk = sock->sk;
 	sk->sk_reuse = 1;
 	sk->sk_sndtimeo = 15 * HZ; /* FIXME: make it configurable */
-	sk->sk_allocation = GFP_ATOMIC;
 
 	/* FIXME: disable Nagle's algorithm */
 
+	BUG_ON(!sk_is_vmio(sk) && conn->session->swapper);
+
 	/*
 	 * Intercept TCP callbacks for sendfile like receive
 	 * processing.
@@ -2187,6 +2210,56 @@ static void iscsi_tcp_session_destroy(st
 	iscsi_session_teardown(cls_session);
 }
 
+#define NETLINK_RESERVE_PAGES	(5 + 2 * (5 + 31))
+#define ISCSI_RESERVE_PAGES	(NETLINK_RESERVE_PAGES + TX_RESERVE_PAGES)
+
+static int iscsi_swapdev(struct scsi_device *sdev, int enable)
+{
+	int error = 0;
+	struct Scsi_Host *host;
+	struct iscsi_session *session;
+	struct iscsi_conn *conn;
+	struct sock *sk;
+	int daemon_pid;
+
+	host = sdev->host;
+	session = iscsi_hostdata(host->hostdata);
+	session->swapper = !!enable;
+	daemon_pid = iscsi_if_daemon_pid(session->tt);
+
+	if (enable) {
+		sk_adjust_memalloc(1, ISCSI_RESERVE_PAGES);
+		sk = netlink_lookup(NETLINK_ISCSI, 0);
+		if (sk)
+			sk_set_vmio(sk);
+		sk = netlink_lookup(NETLINK_ISCSI, daemon_pid);
+		if (sk)
+			sk_set_vmio(sk);
+	}
+
+	spin_lock(&session->lock);
+	list_for_each_entry(conn, &session->connections, item) {
+		struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+		if (enable)
+			sk_set_vmio(tcp_conn->sock->sk);
+		else
+			sk_clear_vmio(tcp_conn->sock->sk);
+	}
+	spin_unlock(&session->lock);
+
+	if (!enable) {
+		sk = netlink_lookup(NETLINK_ISCSI, daemon_pid);
+		if (sk)
+			sk_clear_vmio(sk);
+		sk = netlink_lookup(NETLINK_ISCSI, 0);
+		if (sk)
+			sk_clear_vmio(sk);
+		sk_adjust_memalloc(-1, -ISCSI_RESERVE_PAGES);
+	}
+
+	return error;
+}
+
 static struct scsi_host_template iscsi_sht = {
 	.name			= "iSCSI Initiator over TCP/IP",
 	.queuecommand           = iscsi_queuecommand,
@@ -2199,6 +2272,7 @@ static struct scsi_host_template iscsi_s
 	.use_clustering         = DISABLE_CLUSTERING,
 	.proc_name		= "iscsi_tcp",
 	.this_id		= -1,
+	.swapdev		= iscsi_swapdev,
 };
 
 static struct iscsi_transport iscsi_tcp_transport = {
Index: linux-2.6/include/scsi/libiscsi.h
===================================================================
--- linux-2.6.orig/include/scsi/libiscsi.h
+++ linux-2.6/include/scsi/libiscsi.h
@@ -245,6 +245,7 @@ struct iscsi_session {
 	int			mgmtpool_max;	/* size of mgmt array */
 	struct iscsi_mgmt_task	**mgmt_cmds;	/* Original mgmt arr */
 	struct iscsi_queue	mgmtpool;	/* Mgmt PDU's pool */
+	int			swapper;	/* we are used to swap on */
 };
 
 /*
@@ -297,9 +298,9 @@ extern void iscsi_prep_unsolicit_data_pd
 extern int iscsi_conn_send_pdu(struct iscsi_cls_conn *, struct iscsi_hdr *,
 				char *, uint32_t);
 extern int iscsi_complete_pdu(struct iscsi_conn *, struct iscsi_hdr *,
-			      char *, int);
+			      char *, int, gfp_t);
 extern int __iscsi_complete_pdu(struct iscsi_conn *, struct iscsi_hdr *,
-				char *, int);
+				char *, int, gfp_t);
 extern int iscsi_verify_itt(struct iscsi_conn *, struct iscsi_hdr *,
 			    uint32_t *);
 
Index: linux-2.6/drivers/scsi/scsi_transport_iscsi.c
===================================================================
--- linux-2.6.orig/drivers/scsi/scsi_transport_iscsi.c
+++ linux-2.6/drivers/scsi/scsi_transport_iscsi.c
@@ -496,6 +496,13 @@ iscsi_if_transport_lookup(struct iscsi_t
 	return NULL;
 }
 
+int iscsi_if_daemon_pid(struct iscsi_transport *tt)
+{
+	return iscsi_if_transport_lookup(tt)->daemon_pid;
+}
+
+EXPORT_SYMBOL_GPL(iscsi_if_daemon_pid);
+
 static int
 iscsi_broadcast_skb(struct sk_buff *skb, gfp_t gfp)
 {
@@ -527,7 +534,7 @@ iscsi_unicast_skb(struct sk_buff *skb, i
 }
 
 int iscsi_recv_pdu(struct iscsi_cls_conn *conn, struct iscsi_hdr *hdr,
-		   char *data, uint32_t data_size)
+		   char *data, uint32_t data_size, gfp_t gfp_mask)
 {
 	struct nlmsghdr	*nlh;
 	struct sk_buff *skb;
@@ -541,7 +548,7 @@ int iscsi_recv_pdu(struct iscsi_cls_conn
 	if (!priv)
 		return -EINVAL;
 
-	skb = alloc_skb(len, GFP_ATOMIC);
+	skb = alloc_skb(len, gfp_mask);
 	if (!skb) {
 		iscsi_conn_error(conn, ISCSI_ERR_CONN_FAILED);
 		dev_printk(KERN_ERR, &conn->dev, "iscsi: can not deliver "
@@ -576,7 +583,7 @@ void iscsi_conn_error(struct iscsi_cls_c
 	if (!priv)
 		return;
 
-	skb = alloc_skb(len, GFP_ATOMIC);
+	skb = alloc_skb(len, iscsi_gfp(nls->sk_allocation));
 	if (!skb) {
 		dev_printk(KERN_ERR, &conn->dev, "iscsi: gracefully ignored "
 			  "conn error (%d)\n", error);
@@ -591,7 +598,7 @@ void iscsi_conn_error(struct iscsi_cls_c
 	ev->r.connerror.cid = conn->cid;
 	ev->r.connerror.sid = iscsi_conn_get_sid(conn);
 
-	iscsi_broadcast_skb(skb, GFP_ATOMIC);
+	iscsi_broadcast_skb(skb, iscsi_gfp(nls->sk_allocation));
 
 	dev_printk(KERN_INFO, &conn->dev, "iscsi: detected conn error (%d)\n",
 		   error);
@@ -608,7 +615,7 @@ iscsi_if_send_reply(int pid, int seq, in
 	int flags = multi ? NLM_F_MULTI : 0;
 	int t = done ? NLMSG_DONE : type;
 
-	skb = alloc_skb(len, GFP_KERNEL);
+	skb = alloc_skb(len, iscsi_gfp(nls->sk_allocation));
 	/*
 	 * FIXME:
 	 * user is supposed to react on iferror == -ENOMEM;
@@ -649,7 +656,7 @@ iscsi_if_get_stats(struct iscsi_transpor
 	do {
 		int actual_size;
 
-		skbstat = alloc_skb(len, GFP_KERNEL);
+		skbstat = alloc_skb(len, iscsi_gfp(nls->sk_allocation));
 		if (!skbstat) {
 			dev_printk(KERN_ERR, &conn->dev, "iscsi: can not "
 				   "deliver stats: OOM\n");
@@ -711,7 +718,7 @@ int iscsi_if_destroy_session_done(struct
 	session = iscsi_dev_to_session(conn->dev.parent);
 	shost = iscsi_session_to_shost(session);
 
-	skb = alloc_skb(len, GFP_KERNEL);
+	skb = alloc_skb(len, iscsi_gfp(nls->sk_allocation));
 	if (!skb) {
 		dev_printk(KERN_ERR, &conn->dev, "Cannot notify userspace of "
 			  "session creation event\n");
@@ -729,7 +736,7 @@ int iscsi_if_destroy_session_done(struct
 	 * this will occur if the daemon is not up, so we just warn
 	 * the user and when the daemon is restarted it will handle it
 	 */
-	rc = iscsi_broadcast_skb(skb, GFP_KERNEL);
+	rc = iscsi_broadcast_skb(skb, iscsi_gfp(nls->sk_allocation));
 	if (rc < 0)
 		dev_printk(KERN_ERR, &conn->dev, "Cannot notify userspace of "
 			  "session destruction event. Check iscsi daemon\n");
@@ -772,7 +779,7 @@ int iscsi_if_create_session_done(struct 
 	session = iscsi_dev_to_session(conn->dev.parent);
 	shost = iscsi_session_to_shost(session);
 
-	skb = alloc_skb(len, GFP_KERNEL);
+	skb = alloc_skb(len, iscsi_gfp(nls->sk_allocation));
 	if (!skb) {
 		dev_printk(KERN_ERR, &conn->dev, "Cannot notify userspace of "
 			  "session creation event\n");
@@ -790,7 +797,7 @@ int iscsi_if_create_session_done(struct 
 	 * this will occur if the daemon is not up, so we just warn
 	 * the user and when the daemon is restarted it will handle it
 	 */
-	rc = iscsi_broadcast_skb(skb, GFP_KERNEL);
+	rc = iscsi_broadcast_skb(skb, iscsi_gfp(nls->sk_allocation));
 	if (rc < 0)
 		dev_printk(KERN_ERR, &conn->dev, "Cannot notify userspace of "
 			  "session creation event. Check iscsi daemon\n");
@@ -970,6 +977,7 @@ iscsi_if_recv_msg(struct sk_buff *skb, s
 	struct iscsi_cls_session *session;
 	struct iscsi_cls_conn *conn;
 	unsigned long flags;
+	int pid;
 
 	priv = iscsi_if_transport_lookup(iscsi_ptr(ev->transport_handle));
 	if (!priv)
@@ -979,7 +987,15 @@ iscsi_if_recv_msg(struct sk_buff *skb, s
 	if (!try_module_get(transport->owner))
 		return -EINVAL;
 
-	priv->daemon_pid = NETLINK_CREDS(skb)->pid;
+	pid = NETLINK_CREDS(skb)->pid;
+	if (priv->daemon_pid > 0 && priv->daemon_pid != pid) {
+		if (sk_is_vmio(nls)) {
+			struct sock * sk = netlink_lookup(NETLINK_ISCSI, pid);
+			BUG_ON(!sk);
+			WARN_ON(!sk_set_vmio(sk));
+		}
+	}
+	priv->daemon_pid = pid;
 
 	switch (nlh->nlmsg_type) {
 	case ISCSI_UEVENT_CREATE_SESSION:
@@ -1094,7 +1110,10 @@ iscsi_if_rx(struct sock *sk, int len)
 			if (rlen > skb->len)
 				rlen = skb->len;
 
+			if (sk_is_vmio(sk))
+				current->flags |= PF_MEM_NOWAIT;
 			err = iscsi_if_recv_msg(skb, nlh);
+			current->flags &= ~PF_MEM_NOWAIT;
 			if (err) {
 				ev->type = ISCSI_KEVENT_IF_ERROR;
 				ev->iferror = err;
Index: linux-2.6/drivers/scsi/libiscsi.c
===================================================================
--- linux-2.6.orig/drivers/scsi/libiscsi.c
+++ linux-2.6/drivers/scsi/libiscsi.c
@@ -359,7 +359,7 @@ static int iscsi_handle_reject(struct is
  * itt must have been called.
  */
 int __iscsi_complete_pdu(struct iscsi_conn *conn, struct iscsi_hdr *hdr,
-			 char *data, int datalen)
+			 char *data, int datalen, gfp_t gfp_mask)
 {
 	struct iscsi_session *session = conn->session;
 	int opcode = hdr->opcode & ISCSI_OPCODE_MASK, rc = 0;
@@ -424,7 +424,7 @@ int __iscsi_complete_pdu(struct iscsi_co
 			 * login related PDU's exp_statsn is handled in
 			 * userspace
 			 */
-			if (iscsi_recv_pdu(conn->cls_conn, hdr, data, datalen))
+			if (iscsi_recv_pdu(conn->cls_conn, hdr, data, datalen, gfp_mask))
 				rc = ISCSI_ERR_CONN_FAILED;
 			list_del(&mtask->running);
 			if (conn->login_mtask != mtask)
@@ -446,7 +446,7 @@ int __iscsi_complete_pdu(struct iscsi_co
 			}
 			conn->exp_statsn = be32_to_cpu(hdr->statsn) + 1;
 
-			if (iscsi_recv_pdu(conn->cls_conn, hdr, data, datalen))
+			if (iscsi_recv_pdu(conn->cls_conn, hdr, data, datalen, gfp_mask))
 				rc = ISCSI_ERR_CONN_FAILED;
 			list_del(&mtask->running);
 			if (conn->login_mtask != mtask)
@@ -473,7 +473,7 @@ int __iscsi_complete_pdu(struct iscsi_co
 			if (hdr->ttt == ISCSI_RESERVED_TAG)
 				break;
 
-			if (iscsi_recv_pdu(conn->cls_conn, hdr, NULL, 0))
+			if (iscsi_recv_pdu(conn->cls_conn, hdr, NULL, 0, gfp_mask))
 				rc = ISCSI_ERR_CONN_FAILED;
 			break;
 		case ISCSI_OP_REJECT:
@@ -497,12 +497,12 @@ done:
 EXPORT_SYMBOL_GPL(__iscsi_complete_pdu);
 
 int iscsi_complete_pdu(struct iscsi_conn *conn, struct iscsi_hdr *hdr,
-		       char *data, int datalen)
+		       char *data, int datalen, gfp_t gfp_mask)
 {
 	int rc;
 
 	spin_lock(&conn->session->lock);
-	rc = __iscsi_complete_pdu(conn, hdr, data, datalen);
+	rc = __iscsi_complete_pdu(conn, hdr, data, datalen, gfp_mask);
 	spin_unlock(&conn->session->lock);
 	return rc;
 }
Index: linux-2.6/include/scsi/scsi_transport_iscsi.h
===================================================================
--- linux-2.6.orig/include/scsi/scsi_transport_iscsi.h
+++ linux-2.6/include/scsi/scsi_transport_iscsi.h
@@ -140,7 +140,7 @@ extern int iscsi_unregister_transport(st
  */
 extern void iscsi_conn_error(struct iscsi_cls_conn *conn, enum iscsi_err error);
 extern int iscsi_recv_pdu(struct iscsi_cls_conn *conn, struct iscsi_hdr *hdr,
-			  char *data, uint32_t data_size);
+			  char *data, uint32_t data_size, gfp_t gfp_mask);
 
 
 /* Connection's states */
@@ -218,8 +218,12 @@ extern int iscsi_destroy_session(struct 
 extern struct iscsi_cls_conn *iscsi_create_conn(struct iscsi_cls_session *sess,
 					    uint32_t cid);
 extern int iscsi_destroy_conn(struct iscsi_cls_conn *conn);
+extern int iscsi_if_daemon_pid(struct iscsi_transport *tt);
 extern void iscsi_unblock_session(struct iscsi_cls_session *session);
 extern void iscsi_block_session(struct iscsi_cls_session *session);
 
+#define iscsi_gfp(gfp_mask) \
+	((in_interrupt() ? GFP_ATOMIC : GFP_NOIO) | \
+	 (gfp_mask & __GFP_EMERGENCY))
 
 #endif
Index: linux-2.6/drivers/infiniband/ulp/iser/iscsi_iser.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/ulp/iser/iscsi_iser.c
+++ linux-2.6/drivers/infiniband/ulp/iser/iscsi_iser.c
@@ -114,7 +114,7 @@ iscsi_iser_recv(struct iscsi_conn *conn,
 	rc = iscsi_verify_itt(conn, hdr, &ret_itt);
 
 	if (!rc)
-		rc = iscsi_complete_pdu(conn, hdr, rx_data, rx_data_len);
+		rc = iscsi_complete_pdu(conn, hdr, rx_data, rx_data_len, GFP_NOIO);
 
 	if (rc && rc != ISCSI_ERR_NO_SCSI_CMD)
 		goto error;

--

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 21/21] iscsi: support for swapping over iSCSI.
@ 2006-09-06 13:16   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-06 13:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Daniel Phillips, Rik van Riel, David Miller, Andrew Morton,
	Peter Zijlstra, Mike Christie

[-- Attachment #1: iscsi_vmio.patch --]
[-- Type: text/plain, Size: 16955 bytes --]

Implement sht->swapdev() for iSCSI. This method takes care of reserving
the extra memory needed and marking all relevant sockets with SOCK_VMIO.

When used for swapping, TCP socket creation is done under GFP_MEMALLOC and
the TCP connect is done with SOCK_VMIO to ensure their success. Also the
netlink userspace interface is marked SOCK_VMIO, this will ensure that even
under pressure we can still communicate with the daemon (which runs as
mlockall() and needs no additional memory to operate).

Netlink requests are handled under the new PF_MEM_NOWAIT when a swapper is
present. This ensures that the netlink socket will not block. User-space will
need to retry failed requests.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Mike Christie <michaelc@cs.wisc.edu>
---
 drivers/infiniband/ulp/iser/iscsi_iser.c |    2 
 drivers/scsi/iscsi_tcp.c                 |   93 ++++++++++++++++++++++++++++---
 drivers/scsi/libiscsi.c                  |   12 ++--
 drivers/scsi/scsi_transport_iscsi.c      |   41 ++++++++++---
 include/scsi/libiscsi.h                  |    5 +
 include/scsi/scsi_transport_iscsi.h      |    6 +-
 6 files changed, 129 insertions(+), 30 deletions(-)

Index: linux-2.6/drivers/scsi/iscsi_tcp.c
===================================================================
--- linux-2.6.orig/drivers/scsi/iscsi_tcp.c
+++ linux-2.6/drivers/scsi/iscsi_tcp.c
@@ -42,6 +42,7 @@
 #include <scsi/scsi_host.h>
 #include <scsi/scsi.h>
 #include <scsi/scsi_transport_iscsi.h>
+#include <scsi/scsi_device.h>
 
 #include "iscsi_tcp.h"
 
@@ -436,6 +437,7 @@ iscsi_tcp_hdr_recv(struct iscsi_conn *co
 	struct iscsi_session *session = conn->session;
 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
 	uint32_t cdgst, rdgst = 0, itt;
+	gfp_t gfp_mask = iscsi_gfp(tcp_conn->sock->sk->sk_allocation);
 
 	hdr = tcp_conn->in.hdr;
 
@@ -506,7 +508,7 @@ iscsi_tcp_hdr_recv(struct iscsi_conn *co
 			goto copy_hdr;
 
 		spin_lock(&session->lock);
-		rc = __iscsi_complete_pdu(conn, hdr, NULL, 0);
+		rc = __iscsi_complete_pdu(conn, hdr, NULL, 0, gfp_mask);
 		spin_unlock(&session->lock);
 		break;
 	case ISCSI_OP_R2T:
@@ -544,7 +546,7 @@ iscsi_tcp_hdr_recv(struct iscsi_conn *co
 	case ISCSI_OP_LOGOUT_RSP:
 	case ISCSI_OP_NOOP_IN:
 	case ISCSI_OP_SCSI_TMFUNC_RSP:
-		rc = iscsi_complete_pdu(conn, hdr, NULL, 0);
+		rc = iscsi_complete_pdu(conn, hdr, NULL, 0, gfp_mask);
 		break;
 	default:
 		rc = ISCSI_ERR_BAD_OPCODE;
@@ -705,6 +707,7 @@ static int iscsi_scsi_data_in(struct isc
 	struct scsi_cmnd *sc = ctask->sc;
 	struct scatterlist *sg;
 	int i, offset, rc = 0;
+	gfp_t gfp_mask = iscsi_gfp(tcp_conn->sock->sk->sk_allocation);
 
 	BUG_ON((void*)ctask != sc->SCp.ptr);
 
@@ -786,7 +789,7 @@ done:
 			   (long)sc, sc->result, ctask->itt,
 			   tcp_conn->in.hdr->flags);
 		spin_lock(&conn->session->lock);
-		__iscsi_complete_pdu(conn, tcp_conn->in.hdr, NULL, 0);
+		__iscsi_complete_pdu(conn, tcp_conn->in.hdr, NULL, 0, gfp_mask);
 		spin_unlock(&conn->session->lock);
 	}
 
@@ -798,6 +801,7 @@ iscsi_data_recv(struct iscsi_conn *conn)
 {
 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
 	int rc = 0, opcode;
+	gfp_t gfp_mask = iscsi_gfp(tcp_conn->sock->sk->sk_allocation);
 
 	opcode = tcp_conn->in.hdr->opcode & ISCSI_OPCODE_MASK;
 	switch (opcode) {
@@ -819,7 +823,7 @@ iscsi_data_recv(struct iscsi_conn *conn)
 		}
 
 		rc = iscsi_complete_pdu(conn, tcp_conn->in.hdr, conn->data,
-					tcp_conn->in.datalen);
+					tcp_conn->in.datalen, gfp_mask);
 		if (!rc && conn->datadgst_en && opcode != ISCSI_OP_LOGIN_RSP)
 			iscsi_recv_digest_update(tcp_conn, conn->data,
 			  			tcp_conn->in.datalen);
@@ -1735,14 +1739,26 @@ iscsi_tcp_ep_connect(struct iscsi_cls_se
 {
 	struct socket *sock;
 	int rc, size, arg = 1, window = 524288;
+	int swapper = 0;
+	unsigned long pflags = current->flags;
+
+	if (cls_session) {
+		struct iscsi_session *session;
+		session = class_to_transport_session(cls_session);
+		swapper = session->swapper;
+	}
+
+	if (swapper)
+		pflags |= PF_MEMALLOC;
 
 	rc = sock_create_kern(dst_addr->sa_family, SOCK_STREAM, IPPROTO_TCP,
 			      &sock);
 	if (rc < 0) {
 		printk(KERN_ERR "Could not create socket %d.\n", rc);
-		return rc;
+		goto out;
 	}
-	sock->sk->sk_allocation = GFP_ATOMIC;
+	sock->sk->sk_allocation = GFP_ATOMIC; /* used from interrupt context */
+
 /*
 	rc = sock->ops->setsockopt(sock, IPPROTO_TCP, TCP_NODELAY,
 				   (char __user *)&arg, sizeof(arg));
@@ -1766,6 +1782,9 @@ iscsi_tcp_ep_connect(struct iscsi_cls_se
 		goto release_sock;
 	}
 
+	if (swapper)
+		sk_set_vmio(sock->sk);
+
 	/* TODO we cannot block here */
 	rc = sock->ops->connect(sock, (struct sockaddr *)dst_addr, size,
 				0 /*O_NONBLOCK*/);
@@ -1780,11 +1799,14 @@ iscsi_tcp_ep_connect(struct iscsi_cls_se
 	if (rc < 0)
 		goto release_sock;
 	*ep_handle = (uint64_t)rc;
-	return 0;
+	rc = 0;
+out:
+	current->flags = pflags;
+	return rc;
 
 release_sock:
 	sock_release(sock);
-	return rc;
+	goto out;
 }
 
 static int
@@ -1926,10 +1948,11 @@ iscsi_tcp_conn_bind(struct iscsi_cls_ses
 	sk = sock->sk;
 	sk->sk_reuse = 1;
 	sk->sk_sndtimeo = 15 * HZ; /* FIXME: make it configurable */
-	sk->sk_allocation = GFP_ATOMIC;
 
 	/* FIXME: disable Nagle's algorithm */
 
+	BUG_ON(!sk_is_vmio(sk) && conn->session->swapper);
+
 	/*
 	 * Intercept TCP callbacks for sendfile like receive
 	 * processing.
@@ -2187,6 +2210,56 @@ static void iscsi_tcp_session_destroy(st
 	iscsi_session_teardown(cls_session);
 }
 
+#define NETLINK_RESERVE_PAGES	(5 + 2 * (5 + 31))
+#define ISCSI_RESERVE_PAGES	(NETLINK_RESERVE_PAGES + TX_RESERVE_PAGES)
+
+static int iscsi_swapdev(struct scsi_device *sdev, int enable)
+{
+	int error = 0;
+	struct Scsi_Host *host;
+	struct iscsi_session *session;
+	struct iscsi_conn *conn;
+	struct sock *sk;
+	int daemon_pid;
+
+	host = sdev->host;
+	session = iscsi_hostdata(host->hostdata);
+	session->swapper = !!enable;
+	daemon_pid = iscsi_if_daemon_pid(session->tt);
+
+	if (enable) {
+		sk_adjust_memalloc(1, ISCSI_RESERVE_PAGES);
+		sk = netlink_lookup(NETLINK_ISCSI, 0);
+		if (sk)
+			sk_set_vmio(sk);
+		sk = netlink_lookup(NETLINK_ISCSI, daemon_pid);
+		if (sk)
+			sk_set_vmio(sk);
+	}
+
+	spin_lock(&session->lock);
+	list_for_each_entry(conn, &session->connections, item) {
+		struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+		if (enable)
+			sk_set_vmio(tcp_conn->sock->sk);
+		else
+			sk_clear_vmio(tcp_conn->sock->sk);
+	}
+	spin_unlock(&session->lock);
+
+	if (!enable) {
+		sk = netlink_lookup(NETLINK_ISCSI, daemon_pid);
+		if (sk)
+			sk_clear_vmio(sk);
+		sk = netlink_lookup(NETLINK_ISCSI, 0);
+		if (sk)
+			sk_clear_vmio(sk);
+		sk_adjust_memalloc(-1, -ISCSI_RESERVE_PAGES);
+	}
+
+	return error;
+}
+
 static struct scsi_host_template iscsi_sht = {
 	.name			= "iSCSI Initiator over TCP/IP",
 	.queuecommand           = iscsi_queuecommand,
@@ -2199,6 +2272,7 @@ static struct scsi_host_template iscsi_s
 	.use_clustering         = DISABLE_CLUSTERING,
 	.proc_name		= "iscsi_tcp",
 	.this_id		= -1,
+	.swapdev		= iscsi_swapdev,
 };
 
 static struct iscsi_transport iscsi_tcp_transport = {
Index: linux-2.6/include/scsi/libiscsi.h
===================================================================
--- linux-2.6.orig/include/scsi/libiscsi.h
+++ linux-2.6/include/scsi/libiscsi.h
@@ -245,6 +245,7 @@ struct iscsi_session {
 	int			mgmtpool_max;	/* size of mgmt array */
 	struct iscsi_mgmt_task	**mgmt_cmds;	/* Original mgmt arr */
 	struct iscsi_queue	mgmtpool;	/* Mgmt PDU's pool */
+	int			swapper;	/* we are used to swap on */
 };
 
 /*
@@ -297,9 +298,9 @@ extern void iscsi_prep_unsolicit_data_pd
 extern int iscsi_conn_send_pdu(struct iscsi_cls_conn *, struct iscsi_hdr *,
 				char *, uint32_t);
 extern int iscsi_complete_pdu(struct iscsi_conn *, struct iscsi_hdr *,
-			      char *, int);
+			      char *, int, gfp_t);
 extern int __iscsi_complete_pdu(struct iscsi_conn *, struct iscsi_hdr *,
-				char *, int);
+				char *, int, gfp_t);
 extern int iscsi_verify_itt(struct iscsi_conn *, struct iscsi_hdr *,
 			    uint32_t *);
 
Index: linux-2.6/drivers/scsi/scsi_transport_iscsi.c
===================================================================
--- linux-2.6.orig/drivers/scsi/scsi_transport_iscsi.c
+++ linux-2.6/drivers/scsi/scsi_transport_iscsi.c
@@ -496,6 +496,13 @@ iscsi_if_transport_lookup(struct iscsi_t
 	return NULL;
 }
 
+int iscsi_if_daemon_pid(struct iscsi_transport *tt)
+{
+	return iscsi_if_transport_lookup(tt)->daemon_pid;
+}
+
+EXPORT_SYMBOL_GPL(iscsi_if_daemon_pid);
+
 static int
 iscsi_broadcast_skb(struct sk_buff *skb, gfp_t gfp)
 {
@@ -527,7 +534,7 @@ iscsi_unicast_skb(struct sk_buff *skb, i
 }
 
 int iscsi_recv_pdu(struct iscsi_cls_conn *conn, struct iscsi_hdr *hdr,
-		   char *data, uint32_t data_size)
+		   char *data, uint32_t data_size, gfp_t gfp_mask)
 {
 	struct nlmsghdr	*nlh;
 	struct sk_buff *skb;
@@ -541,7 +548,7 @@ int iscsi_recv_pdu(struct iscsi_cls_conn
 	if (!priv)
 		return -EINVAL;
 
-	skb = alloc_skb(len, GFP_ATOMIC);
+	skb = alloc_skb(len, gfp_mask);
 	if (!skb) {
 		iscsi_conn_error(conn, ISCSI_ERR_CONN_FAILED);
 		dev_printk(KERN_ERR, &conn->dev, "iscsi: can not deliver "
@@ -576,7 +583,7 @@ void iscsi_conn_error(struct iscsi_cls_c
 	if (!priv)
 		return;
 
-	skb = alloc_skb(len, GFP_ATOMIC);
+	skb = alloc_skb(len, iscsi_gfp(nls->sk_allocation));
 	if (!skb) {
 		dev_printk(KERN_ERR, &conn->dev, "iscsi: gracefully ignored "
 			  "conn error (%d)\n", error);
@@ -591,7 +598,7 @@ void iscsi_conn_error(struct iscsi_cls_c
 	ev->r.connerror.cid = conn->cid;
 	ev->r.connerror.sid = iscsi_conn_get_sid(conn);
 
-	iscsi_broadcast_skb(skb, GFP_ATOMIC);
+	iscsi_broadcast_skb(skb, iscsi_gfp(nls->sk_allocation));
 
 	dev_printk(KERN_INFO, &conn->dev, "iscsi: detected conn error (%d)\n",
 		   error);
@@ -608,7 +615,7 @@ iscsi_if_send_reply(int pid, int seq, in
 	int flags = multi ? NLM_F_MULTI : 0;
 	int t = done ? NLMSG_DONE : type;
 
-	skb = alloc_skb(len, GFP_KERNEL);
+	skb = alloc_skb(len, iscsi_gfp(nls->sk_allocation));
 	/*
 	 * FIXME:
 	 * user is supposed to react on iferror == -ENOMEM;
@@ -649,7 +656,7 @@ iscsi_if_get_stats(struct iscsi_transpor
 	do {
 		int actual_size;
 
-		skbstat = alloc_skb(len, GFP_KERNEL);
+		skbstat = alloc_skb(len, iscsi_gfp(nls->sk_allocation));
 		if (!skbstat) {
 			dev_printk(KERN_ERR, &conn->dev, "iscsi: can not "
 				   "deliver stats: OOM\n");
@@ -711,7 +718,7 @@ int iscsi_if_destroy_session_done(struct
 	session = iscsi_dev_to_session(conn->dev.parent);
 	shost = iscsi_session_to_shost(session);
 
-	skb = alloc_skb(len, GFP_KERNEL);
+	skb = alloc_skb(len, iscsi_gfp(nls->sk_allocation));
 	if (!skb) {
 		dev_printk(KERN_ERR, &conn->dev, "Cannot notify userspace of "
 			  "session creation event\n");
@@ -729,7 +736,7 @@ int iscsi_if_destroy_session_done(struct
 	 * this will occur if the daemon is not up, so we just warn
 	 * the user and when the daemon is restarted it will handle it
 	 */
-	rc = iscsi_broadcast_skb(skb, GFP_KERNEL);
+	rc = iscsi_broadcast_skb(skb, iscsi_gfp(nls->sk_allocation));
 	if (rc < 0)
 		dev_printk(KERN_ERR, &conn->dev, "Cannot notify userspace of "
 			  "session destruction event. Check iscsi daemon\n");
@@ -772,7 +779,7 @@ int iscsi_if_create_session_done(struct 
 	session = iscsi_dev_to_session(conn->dev.parent);
 	shost = iscsi_session_to_shost(session);
 
-	skb = alloc_skb(len, GFP_KERNEL);
+	skb = alloc_skb(len, iscsi_gfp(nls->sk_allocation));
 	if (!skb) {
 		dev_printk(KERN_ERR, &conn->dev, "Cannot notify userspace of "
 			  "session creation event\n");
@@ -790,7 +797,7 @@ int iscsi_if_create_session_done(struct 
 	 * this will occur if the daemon is not up, so we just warn
 	 * the user and when the daemon is restarted it will handle it
 	 */
-	rc = iscsi_broadcast_skb(skb, GFP_KERNEL);
+	rc = iscsi_broadcast_skb(skb, iscsi_gfp(nls->sk_allocation));
 	if (rc < 0)
 		dev_printk(KERN_ERR, &conn->dev, "Cannot notify userspace of "
 			  "session creation event. Check iscsi daemon\n");
@@ -970,6 +977,7 @@ iscsi_if_recv_msg(struct sk_buff *skb, s
 	struct iscsi_cls_session *session;
 	struct iscsi_cls_conn *conn;
 	unsigned long flags;
+	int pid;
 
 	priv = iscsi_if_transport_lookup(iscsi_ptr(ev->transport_handle));
 	if (!priv)
@@ -979,7 +987,15 @@ iscsi_if_recv_msg(struct sk_buff *skb, s
 	if (!try_module_get(transport->owner))
 		return -EINVAL;
 
-	priv->daemon_pid = NETLINK_CREDS(skb)->pid;
+	pid = NETLINK_CREDS(skb)->pid;
+	if (priv->daemon_pid > 0 && priv->daemon_pid != pid) {
+		if (sk_is_vmio(nls)) {
+			struct sock * sk = netlink_lookup(NETLINK_ISCSI, pid);
+			BUG_ON(!sk);
+			WARN_ON(!sk_set_vmio(sk));
+		}
+	}
+	priv->daemon_pid = pid;
 
 	switch (nlh->nlmsg_type) {
 	case ISCSI_UEVENT_CREATE_SESSION:
@@ -1094,7 +1110,10 @@ iscsi_if_rx(struct sock *sk, int len)
 			if (rlen > skb->len)
 				rlen = skb->len;
 
+			if (sk_is_vmio(sk))
+				current->flags |= PF_MEM_NOWAIT;
 			err = iscsi_if_recv_msg(skb, nlh);
+			current->flags &= ~PF_MEM_NOWAIT;
 			if (err) {
 				ev->type = ISCSI_KEVENT_IF_ERROR;
 				ev->iferror = err;
Index: linux-2.6/drivers/scsi/libiscsi.c
===================================================================
--- linux-2.6.orig/drivers/scsi/libiscsi.c
+++ linux-2.6/drivers/scsi/libiscsi.c
@@ -359,7 +359,7 @@ static int iscsi_handle_reject(struct is
  * itt must have been called.
  */
 int __iscsi_complete_pdu(struct iscsi_conn *conn, struct iscsi_hdr *hdr,
-			 char *data, int datalen)
+			 char *data, int datalen, gfp_t gfp_mask)
 {
 	struct iscsi_session *session = conn->session;
 	int opcode = hdr->opcode & ISCSI_OPCODE_MASK, rc = 0;
@@ -424,7 +424,7 @@ int __iscsi_complete_pdu(struct iscsi_co
 			 * login related PDU's exp_statsn is handled in
 			 * userspace
 			 */
-			if (iscsi_recv_pdu(conn->cls_conn, hdr, data, datalen))
+			if (iscsi_recv_pdu(conn->cls_conn, hdr, data, datalen, gfp_mask))
 				rc = ISCSI_ERR_CONN_FAILED;
 			list_del(&mtask->running);
 			if (conn->login_mtask != mtask)
@@ -446,7 +446,7 @@ int __iscsi_complete_pdu(struct iscsi_co
 			}
 			conn->exp_statsn = be32_to_cpu(hdr->statsn) + 1;
 
-			if (iscsi_recv_pdu(conn->cls_conn, hdr, data, datalen))
+			if (iscsi_recv_pdu(conn->cls_conn, hdr, data, datalen, gfp_mask))
 				rc = ISCSI_ERR_CONN_FAILED;
 			list_del(&mtask->running);
 			if (conn->login_mtask != mtask)
@@ -473,7 +473,7 @@ int __iscsi_complete_pdu(struct iscsi_co
 			if (hdr->ttt == ISCSI_RESERVED_TAG)
 				break;
 
-			if (iscsi_recv_pdu(conn->cls_conn, hdr, NULL, 0))
+			if (iscsi_recv_pdu(conn->cls_conn, hdr, NULL, 0, gfp_mask))
 				rc = ISCSI_ERR_CONN_FAILED;
 			break;
 		case ISCSI_OP_REJECT:
@@ -497,12 +497,12 @@ done:
 EXPORT_SYMBOL_GPL(__iscsi_complete_pdu);
 
 int iscsi_complete_pdu(struct iscsi_conn *conn, struct iscsi_hdr *hdr,
-		       char *data, int datalen)
+		       char *data, int datalen, gfp_t gfp_mask)
 {
 	int rc;
 
 	spin_lock(&conn->session->lock);
-	rc = __iscsi_complete_pdu(conn, hdr, data, datalen);
+	rc = __iscsi_complete_pdu(conn, hdr, data, datalen, gfp_mask);
 	spin_unlock(&conn->session->lock);
 	return rc;
 }
Index: linux-2.6/include/scsi/scsi_transport_iscsi.h
===================================================================
--- linux-2.6.orig/include/scsi/scsi_transport_iscsi.h
+++ linux-2.6/include/scsi/scsi_transport_iscsi.h
@@ -140,7 +140,7 @@ extern int iscsi_unregister_transport(st
  */
 extern void iscsi_conn_error(struct iscsi_cls_conn *conn, enum iscsi_err error);
 extern int iscsi_recv_pdu(struct iscsi_cls_conn *conn, struct iscsi_hdr *hdr,
-			  char *data, uint32_t data_size);
+			  char *data, uint32_t data_size, gfp_t gfp_mask);
 
 
 /* Connection's states */
@@ -218,8 +218,12 @@ extern int iscsi_destroy_session(struct 
 extern struct iscsi_cls_conn *iscsi_create_conn(struct iscsi_cls_session *sess,
 					    uint32_t cid);
 extern int iscsi_destroy_conn(struct iscsi_cls_conn *conn);
+extern int iscsi_if_daemon_pid(struct iscsi_transport *tt);
 extern void iscsi_unblock_session(struct iscsi_cls_session *session);
 extern void iscsi_block_session(struct iscsi_cls_session *session);
 
+#define iscsi_gfp(gfp_mask) \
+	((in_interrupt() ? GFP_ATOMIC : GFP_NOIO) | \
+	 (gfp_mask & __GFP_EMERGENCY))
 
 #endif
Index: linux-2.6/drivers/infiniband/ulp/iser/iscsi_iser.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/ulp/iser/iscsi_iser.c
+++ linux-2.6/drivers/infiniband/ulp/iser/iscsi_iser.c
@@ -114,7 +114,7 @@ iscsi_iser_recv(struct iscsi_conn *conn,
 	rc = iscsi_verify_itt(conn, hdr, &ret_itt);
 
 	if (!rc)
-		rc = iscsi_complete_pdu(conn, hdr, rx_data, rx_data_len);
+		rc = iscsi_complete_pdu(conn, hdr, rx_data, rx_data_len, GFP_NOIO);
 
 	if (rc && rc != ISCSI_ERR_NO_SCSI_CMD)
 		goto error;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 10/21] block: elevator selection and pinning
  2006-09-06 13:16   ` Peter Zijlstra
@ 2006-09-06 13:46     ` Jens Axboe
  -1 siblings, 0 replies; 55+ messages in thread
From: Jens Axboe @ 2006-09-06 13:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, netdev, Daniel Phillips, Rik van Riel,
	David Miller, Andrew Morton, Pavel Machek

On Wed, Sep 06 2006, Peter Zijlstra wrote:
> Provide an block queue init function that allows to set an elevator. And a 
> function to pin the current elevator.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Daniel Phillips <phillips@google.com>
> CC: Jens Axboe <axboe@suse.de>
> CC: Pavel Machek <pavel@ucw.cz>

Generally I don't think this is the right approach, as what you really
want to do is let the driver say "I want intelligent scheduling" or not.
The type of scheduler is policy that is left with the user, not the
driver.

And this patch seems to do two things, and you don't explain what the
pinning is useful for at all.

So that's 2 for 2 currently, NAK from me.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 10/21] block: elevator selection and pinning
@ 2006-09-06 13:46     ` Jens Axboe
  0 siblings, 0 replies; 55+ messages in thread
From: Jens Axboe @ 2006-09-06 13:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, netdev, Daniel Phillips, Rik van Riel,
	David Miller, Andrew Morton, Pavel Machek

On Wed, Sep 06 2006, Peter Zijlstra wrote:
> Provide an block queue init function that allows to set an elevator. And a 
> function to pin the current elevator.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Daniel Phillips <phillips@google.com>
> CC: Jens Axboe <axboe@suse.de>
> CC: Pavel Machek <pavel@ucw.cz>

Generally I don't think this is the right approach, as what you really
want to do is let the driver say "I want intelligent scheduling" or not.
The type of scheduler is policy that is left with the user, not the
driver.

And this patch seems to do two things, and you don't explain what the
pinning is useful for at all.

So that's 2 for 2 currently, NAK from me.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 11/21] nbd: limit blk_queue
  2006-09-06 13:16   ` Peter Zijlstra
@ 2006-09-06 15:17     ` Erik Mouw
  -1 siblings, 0 replies; 55+ messages in thread
From: Erik Mouw @ 2006-09-06 15:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, netdev, Daniel Phillips, Rik van Riel,
	David Miller, Andrew Morton, Pavel Machek

On Wed, Sep 06, 2006 at 03:16:41PM +0200, Peter Zijlstra wrote:
> -		disk->queue = blk_init_queue(do_nbd_request, &nbd_lock);
> +		disk->queue = blk_init_queue_node_elv(do_nbd_request,
> +				&nbd_lock, -1, "noop");

So what happens if the noop scheduler isn't compiled into the kernel?


Erik

-- 
+-- Erik Mouw -- www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 11/21] nbd: limit blk_queue
@ 2006-09-06 15:17     ` Erik Mouw
  0 siblings, 0 replies; 55+ messages in thread
From: Erik Mouw @ 2006-09-06 15:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, netdev, Daniel Phillips, Rik van Riel,
	David Miller, Andrew Morton, Pavel Machek

On Wed, Sep 06, 2006 at 03:16:41PM +0200, Peter Zijlstra wrote:
> -		disk->queue = blk_init_queue(do_nbd_request, &nbd_lock);
> +		disk->queue = blk_init_queue_node_elv(do_nbd_request,
> +				&nbd_lock, -1, "noop");

So what happens if the noop scheduler isn't compiled into the kernel?


Erik

-- 
+-- Erik Mouw -- www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 11/21] nbd: limit blk_queue
  2006-09-06 15:17     ` Erik Mouw
@ 2006-09-06 17:45       ` Jens Axboe
  -1 siblings, 0 replies; 55+ messages in thread
From: Jens Axboe @ 2006-09-06 17:45 UTC (permalink / raw)
  To: Erik Mouw
  Cc: Peter Zijlstra, linux-mm, linux-kernel, netdev, Daniel Phillips,
	Rik van Riel, David Miller, Andrew Morton, Pavel Machek

On Wed, Sep 06 2006, Erik Mouw wrote:
> On Wed, Sep 06, 2006 at 03:16:41PM +0200, Peter Zijlstra wrote:
> > -		disk->queue = blk_init_queue(do_nbd_request, &nbd_lock);
> > +		disk->queue = blk_init_queue_node_elv(do_nbd_request,
> > +				&nbd_lock, -1, "noop");
> 
> So what happens if the noop scheduler isn't compiled into the kernel?

You can't de-select noop, so that cannot happen. But the point is valid
for other choices of io schedulers, which is another reason why this
_elv api addition is a bad idea.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 11/21] nbd: limit blk_queue
@ 2006-09-06 17:45       ` Jens Axboe
  0 siblings, 0 replies; 55+ messages in thread
From: Jens Axboe @ 2006-09-06 17:45 UTC (permalink / raw)
  To: Erik Mouw
  Cc: Peter Zijlstra, linux-mm, linux-kernel, netdev, Daniel Phillips,
	Rik van Riel, David Miller, Andrew Morton, Pavel Machek

On Wed, Sep 06 2006, Erik Mouw wrote:
> On Wed, Sep 06, 2006 at 03:16:41PM +0200, Peter Zijlstra wrote:
> > -		disk->queue = blk_init_queue(do_nbd_request, &nbd_lock);
> > +		disk->queue = blk_init_queue_node_elv(do_nbd_request,
> > +				&nbd_lock, -1, "noop");
> 
> So what happens if the noop scheduler isn't compiled into the kernel?

You can't de-select noop, so that cannot happen. But the point is valid
for other choices of io schedulers, which is another reason why this
_elv api addition is a bad idea.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 10/21] block: elevator selection and pinning
  2006-09-06 13:46     ` Jens Axboe
@ 2006-09-07 16:01       ` Peter Zijlstra
  -1 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-07 16:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-kernel, netdev, Daniel Phillips, Rik van Riel,
	David Miller, Andrew Morton, Pavel Machek

On Wed, 2006-09-06 at 15:46 +0200, Jens Axboe wrote:
> On Wed, Sep 06 2006, Peter Zijlstra wrote:
> > Provide an block queue init function that allows to set an elevator. And a 
> > function to pin the current elevator.
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Signed-off-by: Daniel Phillips <phillips@google.com>
> > CC: Jens Axboe <axboe@suse.de>
> > CC: Pavel Machek <pavel@ucw.cz>
> 
> Generally I don't think this is the right approach, as what you really
> want to do is let the driver say "I want intelligent scheduling" or not.
> The type of scheduler is policy that is left with the user, not the
> driver.

True, and the only sane value here is NOOP, any other policy would not
be a good value. With this in mind would you rather prefer a 'boolean'
argument suggesting we use NOOP over the default scheduler?

(The whole switch API was done so I could reset the policy from the
iSCSI side of things without changing the regular SCSI code - however
even that doesn't seem to work out, mnc suggested to do it in userspace,
so that API can go too)

Would you agree that this hint on intelligent scheduling could be used
to set the initial policy, the user can always override when he
disagrees.

These network block devices like NBD, iSCSI and AoE often talk to
virtual disks, any attempt to be smart is a waste of time.

> And this patch seems to do two things, and you don't explain what the
> pinning is useful for at all.

It was a hack, and its gone now.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 10/21] block: elevator selection and pinning
@ 2006-09-07 16:01       ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2006-09-07 16:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-kernel, netdev, Daniel Phillips, Rik van Riel,
	David Miller, Andrew Morton, Pavel Machek

On Wed, 2006-09-06 at 15:46 +0200, Jens Axboe wrote:
> On Wed, Sep 06 2006, Peter Zijlstra wrote:
> > Provide an block queue init function that allows to set an elevator. And a 
> > function to pin the current elevator.
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Signed-off-by: Daniel Phillips <phillips@google.com>
> > CC: Jens Axboe <axboe@suse.de>
> > CC: Pavel Machek <pavel@ucw.cz>
> 
> Generally I don't think this is the right approach, as what you really
> want to do is let the driver say "I want intelligent scheduling" or not.
> The type of scheduler is policy that is left with the user, not the
> driver.

True, and the only sane value here is NOOP, any other policy would not
be a good value. With this in mind would you rather prefer a 'boolean'
argument suggesting we use NOOP over the default scheduler?

(The whole switch API was done so I could reset the policy from the
iSCSI side of things without changing the regular SCSI code - however
even that doesn't seem to work out, mnc suggested to do it in userspace,
so that API can go too)

Would you agree that this hint on intelligent scheduling could be used
to set the initial policy, the user can always override when he
disagrees.

These network block devices like NBD, iSCSI and AoE often talk to
virtual disks, any attempt to be smart is a waste of time.

> And this patch seems to do two things, and you don't explain what the
> pinning is useful for at all.

It was a hack, and its gone now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 10/21] block: elevator selection and pinning
  2006-09-07 16:01       ` Peter Zijlstra
@ 2006-09-07 16:33         ` Jens Axboe
  -1 siblings, 0 replies; 55+ messages in thread
From: Jens Axboe @ 2006-09-07 16:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, netdev, Daniel Phillips, Rik van Riel,
	David Miller, Andrew Morton, Pavel Machek

On Thu, Sep 07 2006, Peter Zijlstra wrote:
> On Wed, 2006-09-06 at 15:46 +0200, Jens Axboe wrote:
> > On Wed, Sep 06 2006, Peter Zijlstra wrote:
> > > Provide an block queue init function that allows to set an elevator. And a 
> > > function to pin the current elevator.
> > > 
> > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > Signed-off-by: Daniel Phillips <phillips@google.com>
> > > CC: Jens Axboe <axboe@suse.de>
> > > CC: Pavel Machek <pavel@ucw.cz>
> > 
> > Generally I don't think this is the right approach, as what you really
> > want to do is let the driver say "I want intelligent scheduling" or not.
> > The type of scheduler is policy that is left with the user, not the
> > driver.
> 
> True, and the only sane value here is NOOP, any other policy would not
> be a good value. With this in mind would you rather prefer a 'boolean'
> argument suggesting we use NOOP over the default scheduler?

Nope, I don't think it's the right thing to do. Either we want to pass a
type down or a profile of some sort.

For the work you are doing here, just forget about it. It's not like
it's a critical piece, just list in the README (or wherever) that noop
is a good choice and leave it at that.

> Would you agree that this hint on intelligent scheduling could be used
> to set the initial policy, the user can always override when he
> disagrees.
> 
> These network block devices like NBD, iSCSI and AoE often talk to
> virtual disks, any attempt to be smart is a waste of time.

I think you'll find the runtime overhead of them is pretty close and
hard to measure in most setups, it's things like the queue idling and
anticipation that AS/CFQ might do that will potentially decrease your io
performance. deadline and noop would be equally good choices.

So you don't want to pass down hints on "use noop", you really want to
tell the block layer that you have no seek penalty for instance. And a
range of other parameters.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 10/21] block: elevator selection and pinning
@ 2006-09-07 16:33         ` Jens Axboe
  0 siblings, 0 replies; 55+ messages in thread
From: Jens Axboe @ 2006-09-07 16:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, netdev, Daniel Phillips, Rik van Riel,
	David Miller, Andrew Morton, Pavel Machek

On Thu, Sep 07 2006, Peter Zijlstra wrote:
> On Wed, 2006-09-06 at 15:46 +0200, Jens Axboe wrote:
> > On Wed, Sep 06 2006, Peter Zijlstra wrote:
> > > Provide an block queue init function that allows to set an elevator. And a 
> > > function to pin the current elevator.
> > > 
> > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > Signed-off-by: Daniel Phillips <phillips@google.com>
> > > CC: Jens Axboe <axboe@suse.de>
> > > CC: Pavel Machek <pavel@ucw.cz>
> > 
> > Generally I don't think this is the right approach, as what you really
> > want to do is let the driver say "I want intelligent scheduling" or not.
> > The type of scheduler is policy that is left with the user, not the
> > driver.
> 
> True, and the only sane value here is NOOP, any other policy would not
> be a good value. With this in mind would you rather prefer a 'boolean'
> argument suggesting we use NOOP over the default scheduler?

Nope, I don't think it's the right thing to do. Either we want to pass a
type down or a profile of some sort.

For the work you are doing here, just forget about it. It's not like
it's a critical piece, just list in the README (or wherever) that noop
is a good choice and leave it at that.

> Would you agree that this hint on intelligent scheduling could be used
> to set the initial policy, the user can always override when he
> disagrees.
> 
> These network block devices like NBD, iSCSI and AoE often talk to
> virtual disks, any attempt to be smart is a waste of time.

I think you'll find the runtime overhead of them is pretty close and
hard to measure in most setups, it's things like the queue idling and
anticipation that AS/CFQ might do that will potentially decrease your io
performance. deadline and noop would be equally good choices.

So you don't want to pass down hints on "use noop", you really want to
tell the block layer that you have no seek penalty for instance. And a
range of other parameters.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 14/21] uml: enable scsi and add iscsi config
  2006-09-06 13:16   ` Peter Zijlstra
@ 2006-09-11 15:49     ` Jeff Dike
  -1 siblings, 0 replies; 55+ messages in thread
From: Jeff Dike @ 2006-09-11 15:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, netdev, Daniel Phillips, Rik van Riel,
	David Miller, Andrew Morton, Mike Christie

On Wed, Sep 06, 2006 at 03:16:44PM +0200, Peter Zijlstra wrote:
> Enable iSCSI on UML, dunno why SCSI was deemed broken, it works like a charm.

Acked-by: Jeff Dike <jdike@addtoit.com>

Although it would be nice if we didn't have to copy bits of Kconfig files
to do this.

				Jeff

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 14/21] uml: enable scsi and add iscsi config
@ 2006-09-11 15:49     ` Jeff Dike
  0 siblings, 0 replies; 55+ messages in thread
From: Jeff Dike @ 2006-09-11 15:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, netdev, Daniel Phillips, Rik van Riel,
	David Miller, Andrew Morton, Mike Christie

On Wed, Sep 06, 2006 at 03:16:44PM +0200, Peter Zijlstra wrote:
> Enable iSCSI on UML, dunno why SCSI was deemed broken, it works like a charm.

Acked-by: Jeff Dike <jdike@addtoit.com>

Although it would be nice if we didn't have to copy bits of Kconfig files
to do this.

				Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2006-09-11 15:51 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-09-06 13:16 [PATCH 00/21] vm deadlock avoidance for NBD, NFS and iSCSI (take 6) Peter Zijlstra
2006-09-06 13:16 ` [PATCH 01/21] mm: serialize access to min_free_kbytes Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-06 13:16 ` [PATCH 02/21] net: vm deadlock avoidance core Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-06 13:16 ` [PATCH 03/21] mm: add support for non block device backed swap files Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-06 13:16 ` [PATCH 04/21] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-06 13:16 ` [PATCH 05/21] uml: rename arch/um remove_mapping() Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-06 13:16 ` [PATCH 06/21] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-06 13:16 ` [PATCH 07/21] nfs: add a comment explaining the use of PG_private in the NFS client Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-06 13:16 ` [PATCH 08/21] nfs: enable swap on NFS Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-06 13:16 ` [PATCH 09/21] nfs: make swap on NFS robust Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-06 13:16 ` [PATCH 10/21] block: elevator selection and pinning Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-06 13:46   ` Jens Axboe
2006-09-06 13:46     ` Jens Axboe
2006-09-07 16:01     ` Peter Zijlstra
2006-09-07 16:01       ` Peter Zijlstra
2006-09-07 16:33       ` Jens Axboe
2006-09-07 16:33         ` Jens Axboe
2006-09-06 13:16 ` [PATCH 11/21] nbd: limit blk_queue Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-06 15:17   ` Erik Mouw
2006-09-06 15:17     ` Erik Mouw
2006-09-06 17:45     ` Jens Axboe
2006-09-06 17:45       ` Jens Axboe
2006-09-06 13:16 ` [PATCH 12/21] mm: block device swap notification Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-06 13:16 ` [PATCH 13/21] nbd: use swapdev hook to make swap deadlock free Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-06 13:16 ` [PATCH 14/21] uml: enable scsi and add iscsi config Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-11 15:49   ` Jeff Dike
2006-09-11 15:49     ` Jeff Dike
2006-09-06 13:16 ` [PATCH 15/21] iscsi: kernel side tcp connect Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-06 13:16 ` [PATCH 16/21] iscsi: fixup of the ep_connect patch Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-06 13:16 ` [PATCH 17/21] iscsi: add session context to ep_connect Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-06 13:16 ` [PATCH 18/21] scsi: propagate the swapdev hook into the scsi stack Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-06 13:16 ` [PATCH 19/21] netlink: add SOCK_VMIO support to AF_NETLINK Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-06 13:16 ` [PATCH 20/21] mm: a process flags to avoid blocking allocations Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra
2006-09-06 13:16 ` [PATCH 21/21] iscsi: support for swapping over iSCSI Peter Zijlstra
2006-09-06 13:16   ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.