linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Is  Swapping on software RAID1 possible  in linux 2.4 ?
@ 2001-07-05 11:24 ` Peter Zaitsev
  2001-07-05 12:13   ` Neil Brown
                     ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Peter Zaitsev @ 2001-07-05 11:24 UTC (permalink / raw)
  To: linux-kernel

Hello linux-kernel,

  Does anyone have information on this subject ?  I have the constant
  failures with system swapping on RAID1, I just wanted to be shure
  this may be the problem or not.   It works without any problems with
  2.2 kernel.

-- 
Best regards,
 Peter                          mailto:pz@spylog.ru


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is  Swapping on software RAID1 possible  in linux 2.4 ?
  2001-07-05 11:24 ` Is Swapping on software RAID1 possible in linux 2.4 ? Peter Zaitsev
@ 2001-07-05 12:13   ` Neil Brown
  2001-07-05 13:22   ` Re[2]: " Peter Zaitsev
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Neil Brown @ 2001-07-05 12:13 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: linux-kernel

On Thursday July 5, pz@spylog.ru wrote:
> Hello linux-kernel,
> 
>   Does anyone have information on this subject ?  I have the constant
>   failures with system swapping on RAID1, I just wanted to be shure
>   this may be the problem or not.   It works without any problems with
>   2.2 kernel.

It certainly should work in 2.4.  What sort of "constant failures" are
you experiencing?

Though it does appear to work in 2.2, there is a possibility of data
corruption if you swap onto a raid1 array that is resyncing.  This
possibility does not exist in 2.4.

NeilBrown

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re[2]: Is  Swapping on software RAID1 possible  in linux 2.4 ?
  2001-07-05 11:24 ` Is Swapping on software RAID1 possible in linux 2.4 ? Peter Zaitsev
  2001-07-05 12:13   ` Neil Brown
@ 2001-07-05 13:22   ` Peter Zaitsev
  2001-07-05 13:42     ` Arjan van de Ven
                       ` (2 more replies)
  2001-07-05 14:54   ` Nick DeClario
  2001-07-06  9:38   ` Re[2]: " Peter Zaitsev
  3 siblings, 3 replies; 13+ messages in thread
From: Peter Zaitsev @ 2001-07-05 13:22 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-kernel

Hello Neil,

Thursday, July 05, 2001, 4:13:00 PM, you wrote:

NB> On Thursday July 5, pz@spylog.ru wrote:
>> Hello linux-kernel,
>> 
>>   Does anyone have information on this subject ?  I have the constant
>>   failures with system swapping on RAID1, I just wanted to be shure
>>   this may be the problem or not.   It works without any problems with
>>   2.2 kernel.

NB> It certainly should work in 2.4.  What sort of "constant failures" are
NB> you experiencing?

NB> Though it does appear to work in 2.2, there is a possibility of data
NB> corruption if you swap onto a raid1 array that is resyncing.  This
NB> possibility does not exist in 2.4.



The problem is I'm constantly getting these  X-order-allocation errors
in kernel log and after which system becomes unstable and often hangs
or leaves process which cannot be killed even by "-9" signal.
Installed debuggin patches produce the following allocation paths:

> Jun 20 05:56:14 tor kernel: Call Trace: [__get_free_pages+20/36]
> [__get_free_pages+20/36] [kmem_cache_grow+187/520] [kmalloc+183/224]
> [raid1_alloc_r1bh+105/256] [raid1_make_request+832/852]
> [raid1_make_request+80/852]
> Jun 20 05:56:14 tor kernel:        [md_make_request+79/124]
> [generic_make_request+293/308] [submit_bh+87/116] [brw_page+143/160]
> [rw_swap_page_base+336/428] [rw_swap_page+112/184] [swap_writepage+120/128]
> [page_launder+644/2132]
> Jun 20 05:56:14 tor kernel:        [do_try_to_free_pages+52/124]
> [kswapd+89/228] [kernel_thread+40/56]
>

one more trace:

SR>>Jun 19 09:50:08 garnet kernel: __alloc_pages: 0-order allocation failed.
SR>>Jun 19 09:50:08 garnet kernel: __alloc_pages: 0-order allocation failed from
SR>>c01Jun 19 09:50:08 garnet kernel: ^M^Mf4a2bc74 c024ac20 00000000 c012ca09
SR>>c024abe0
SR>>Jun 19 09:50:08 garnet kernel:        00000008 c03225e0 00000003 00000001
SR>>c029c9Jun 19 09:50:08 garnet kernel:        f0ebb760 00000001 00000008
SR>>c03225e0 c0197bJun 19 09:50:08 garnet kernel: Call Trace:
SR>>[alloc_bounce_page+13/140] [alloc_bouJun 19 09:50:08 garnet kernel:
SR>>[raid1_make_request+832/852] [md_make_requJun 19 09:50:08 garnet kernel:
SR>>[swap_writepage+120/128] [page_launder+644Jun 19 09:50:08 garnet kernel:
SR>>[sock_poll+35/40] [do_select+230/476] [sysJun 19 10:21:27 garnet kernel:
SR>>sending pkt_too_big to self
SR>>Jun 19 10:21:55 garnet kernel: sending pkt_too_big to self
SR>>Jun 19 10:34:36 garnet kernel: sending pkt_too_big to self
SR>>Jun 19 10:35:33 garnet last message repeated 2 times
SR>>Jun 19 10:36:50 garnet kernel: sending pkt_too_big to self

That's why I thought this problem is related to raid1 swapping I'm
using.

Well. Of couse I'm speaking about synced RAID1.




-- 
Best regards,
 Peter                            mailto:pz@spylog.ru


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is  Swapping on software RAID1 possible  in linux 2.4 ?
  2001-07-05 13:22   ` Re[2]: " Peter Zaitsev
@ 2001-07-05 13:42     ` Arjan van de Ven
  2001-07-05 18:56     ` Pete Zaitcev
  2001-07-12  1:14     ` Re[2]: " Neil Brown
  2 siblings, 0 replies; 13+ messages in thread
From: Arjan van de Ven @ 2001-07-05 13:42 UTC (permalink / raw)
  To: Peter Zaitsev, linux-kernel

Peter Zaitsev wrote:
> 
> That's why I thought this problem is related to raid1 swapping I'm
> using.

Well there is the potential problem that RAID1 has that it can't avoid
allocating
memory in some occasions, for the 2nd bufferhead. ATARAID raid0 has the
same problem for
now, and there is no real solution to this. You can pre-allocate a bunch
of bufferheads,
but under high load you will run out of those, no matter how many you
pre-allocate.

Of course you can then wait for the "in flight" ones to become available
again, and that is
the best thing I've come up with so far. It would be nice if the 3
subsystems that need such 
bufferheads now (MD RAID1, ATARAID RAID0 and the bouncebuffer(head)
code) could share their 
pool.

Greetings,
   Arjan van de Ven

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is  Swapping on software RAID1 possible  in linux 2.4 ?
  2001-07-05 11:24 ` Is Swapping on software RAID1 possible in linux 2.4 ? Peter Zaitsev
  2001-07-05 12:13   ` Neil Brown
  2001-07-05 13:22   ` Re[2]: " Peter Zaitsev
@ 2001-07-05 14:54   ` Nick DeClario
  2001-07-05 15:12     ` Joseph Bueno
  2001-07-11 12:08     ` Paul Jakma
  2001-07-06  9:38   ` Re[2]: " Peter Zaitsev
  3 siblings, 2 replies; 13+ messages in thread
From: Nick DeClario @ 2001-07-05 14:54 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: linux-kernel

Just out of curiousity what are the advantages to having a RAID1 swap
partition?  Setting the swap priority to 0 (pri=0) in the fstab of all
the swap partitions on your system should have the same effect as doing
it with RAID but without the overhead, right?  RAID1 would also mirror
your swap.  Why would you want that? 

Regards,
	-Nick

Peter Zaitsev wrote:
> 
> Hello linux-kernel,
> 
>   Does anyone have information on this subject ?  I have the constant
>   failures with system swapping on RAID1, I just wanted to be shure
>   this may be the problem or not.   It works without any problems with
>   2.2 kernel.
> 
> --
> Best regards,
>  Peter                          mailto:pz@spylog.ru
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Nicholas DeClario
Systems Engineer                            Guardian Digital, Inc.
(201) 934-9230                Pioneering.  Open Source.  Security.
nick@guardiandigital.com            http://www.guardiandigital.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is  Swapping on software RAID1 possible  in linux 2.4 ?
  2001-07-05 14:54   ` Nick DeClario
@ 2001-07-05 15:12     ` Joseph Bueno
  2001-07-11 12:08     ` Paul Jakma
  1 sibling, 0 replies; 13+ messages in thread
From: Joseph Bueno @ 2001-07-05 15:12 UTC (permalink / raw)
  To: nick; +Cc: Peter Zaitsev, linux-kernel

Nick DeClario wrote:
> 
> Just out of curiousity what are the advantages to having a RAID1 swap
> partition?  Setting the swap priority to 0 (pri=0) in the fstab of all
> the swap partitions on your system should have the same effect as doing
> it with RAID but without the overhead, right?  RAID1 would also mirror
> your swap.  Why would you want that?
> 
> Regards,
>         -Nick
> 
Hi,

Setting swap priority to 0 is equivalent to RAID0 (striping) not RAID1 (mirroring).

Mirroring your swap partition is important because if the disk containing
your swap fails, your system is dead. If you want to keep your system running
even if one disk fails you need to mirror ALL your active partitions including
swap.
If you only mirror your data partitions, your are only protected against data
loss in case of a disk crash (assuming you shutdown gracefully before it panics
while it tries to read/write  on a crashed swap partition and leave your data in
some inconsistent state).

Regards
--
Joseph Bueno

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is  Swapping on software RAID1 possible  in linux 2.4 ?
  2001-07-05 13:22   ` Re[2]: " Peter Zaitsev
  2001-07-05 13:42     ` Arjan van de Ven
@ 2001-07-05 18:56     ` Pete Zaitcev
  2001-07-12  1:14     ` Re[2]: " Neil Brown
  2 siblings, 0 replies; 13+ messages in thread
From: Pete Zaitcev @ 2001-07-05 18:56 UTC (permalink / raw)
  To: linux-kernel

In linux-kernel, you wrote:
> Peter Zaitsev wrote:
> > 
> > That's why I thought this problem is related to raid1 swapping I'm
> > using.
> 
> Well there is the potential problem that RAID1 has that it can't avoid allocating
> memory in some occasions, for the 2nd bufferhead. ATARAID raid0 has the same problem for
> now, and there is no real solution to this. You can pre-allocate a bunch of bufferheads,
> but under high load you will run out of those, no matter how many you pre-allocate.

Arjan, why doesn't it sleep instead (GFP_KERNEL)?

-- Pete

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re[2]: Is  Swapping on software RAID1 possible  in linux 2.4 ?
  2001-07-05 11:24 ` Is Swapping on software RAID1 possible in linux 2.4 ? Peter Zaitsev
                     ` (2 preceding siblings ...)
  2001-07-05 14:54   ` Nick DeClario
@ 2001-07-06  9:38   ` Peter Zaitsev
  3 siblings, 0 replies; 13+ messages in thread
From: Peter Zaitsev @ 2001-07-06  9:38 UTC (permalink / raw)
  To: Nick DeClario; +Cc: linux-kernel

Hello Nick,

Thursday, July 05, 2001, 6:54:37 PM, you wrote:

Well The idea is simple. I want my system to survive if one of the
disk fails. So I store all of my data including swap on RAID
partitions.


ND> Just out of curiousity what are the advantages to having a RAID1 swap
ND> partition?  Setting the swap priority to 0 (pri=0) in the fstab of all
ND> the swap partitions on your system should have the same effect as doing
ND> it with RAID but without the overhead, right?  RAID1 would also mirror
ND> your swap.  Why would you want that? 

ND> Regards,
ND>         -Nick

ND> Peter Zaitsev wrote:
>> 
>> Hello linux-kernel,
>> 
>>   Does anyone have information on this subject ?  I have the constant
>>   failures with system swapping on RAID1, I just wanted to be shure
>>   this may be the problem or not.   It works without any problems with
>>   2.2 kernel.
>> 
>> --
>> Best regards,
>>  Peter                          mailto:pz@spylog.ru
>> 
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/




-- 
Best regards,
 Peter                            mailto:pz@spylog.ru


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is  Swapping on software RAID1 possible  in linux 2.4 ?
  2001-07-05 14:54   ` Nick DeClario
  2001-07-05 15:12     ` Joseph Bueno
@ 2001-07-11 12:08     ` Paul Jakma
  1 sibling, 0 replies; 13+ messages in thread
From: Paul Jakma @ 2001-07-11 12:08 UTC (permalink / raw)
  To: Nick DeClario; +Cc: Peter Zaitsev, linux-kernel

On Thu, 5 Jul 2001, Nick DeClario wrote:

> RAID1 would also mirror your swap.  Why would you want that?

redundancy. no point having your data redundant if your swap isn't -
1 drive failure will take out the box the moment it tries to access
swap on the failed drive.

PS: i have 2 boxes deployed running RH's 2.4.2, with swap on top of
LVM on top of RAID1. no problems sofar, even during resync.

> Regards,
> 	-Nick

--paulj


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Re[2]: Is  Swapping on software RAID1 possible  in linux 2.4 ?
  2001-07-05 13:22   ` Re[2]: " Peter Zaitsev
  2001-07-05 13:42     ` Arjan van de Ven
  2001-07-05 18:56     ` Pete Zaitcev
@ 2001-07-12  1:14     ` Neil Brown
  2001-07-12  1:48       ` Andrew Morton
  2 siblings, 1 reply; 13+ messages in thread
From: Neil Brown @ 2001-07-12  1:14 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: linux-kernel

On Thursday July 5, pz@spylog.ru wrote:
> Hello Neil,
> 
> Thursday, July 05, 2001, 4:13:00 PM, you wrote:
> 
> NB> On Thursday July 5, pz@spylog.ru wrote:
> >> Hello linux-kernel,
> >> 
> >>   Does anyone have information on this subject ?  I have the constant
> >>   failures with system swapping on RAID1, I just wanted to be shure
> >>   this may be the problem or not.   It works without any problems with
> >>   2.2 kernel.
> 
> NB> It certainly should work in 2.4.  What sort of "constant failures" are
> NB> you experiencing?
> 
> NB> Though it does appear to work in 2.2, there is a possibility of data
> NB> corruption if you swap onto a raid1 array that is resyncing.  This
> NB> possibility does not exist in 2.4.
> 
> 
> 
> The problem is I'm constantly getting these  X-order-allocation errors
> in kernel log and after which system becomes unstable and often hangs
> or leaves process which cannot be killed even by "-9" signal.
> Installed debuggin patches produce the following allocation paths:

These "X-order-allocation" failures are just an indication that you
are running out or memory.  raid1 is explicitly written to cope.
If memory allocation fails it waits for some to be free, and it has
made sure in advance that there is some memory that it will get
first-dibs on when it becomes free, so there is no risk of deadlock.

However this does not explain why you are getting unkillable
processes.

Can you try to put swap on just one of the partitions that your raid1
together instead of on the raid1 array and see if you can get
processes to become unkillable.

Also, can you find out what that process is doing when it is
unkillable.
If you compile with alt-sysrq support, then alt-sysrq-t should print
the process table.  If you can get this out of dmesg and run if though
ksymoops it might be most interesting.

NeilBrown

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is  Swapping on software RAID1 possible  in linux 2.4 ?
  2001-07-12  1:14     ` Re[2]: " Neil Brown
@ 2001-07-12  1:48       ` Andrew Morton
  2001-07-12  3:22         ` Neil Brown
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2001-07-12  1:48 UTC (permalink / raw)
  To: Neil Brown; +Cc: Peter Zaitsev, linux-kernel

Neil Brown wrote:
> 
> Also, can you find out what that process is doing when it is
> unkillable.
> If you compile with alt-sysrq support, then alt-sysrq-t should print
> the process table.  If you can get this out of dmesg and run if though
> ksymoops it might be most interesting.

Neil, he showed us a trace the other day - kswapd was
stuck in raid1_alloc_r1_bh().  This is basically the
same situation as I had yesterday, where bdflush was stuck
in the same place.

It is completely fatal to the VM for these two processes to
get stuck in this way.  The approach I took was to beef up
the reserved bh queues and to keep a number of them
reserved *only* for the swapout and dirty buffer flush functions.
That way, we have at hand the memory we need to be able to
free up memory.

It was necessary to define a new task_struct.flags bit so we
can identify when the caller is a `buffer flusher' - I expect
we'll need that in other places as well.

An easy way to demonstrate the problem is to put ext3 on RAID1,
boot with `mem=64m' and run `dd if=/dev/zero of=foo bs=1024k count=1k'.
The machine wedges on the first run.  This is due to a bdflush deadlock.
Once swap is on RAID1, there will be kswapd deadlocks as well.  The
patch *should* fix those, but I haven't tested that.

Could you please review these changes?

BTW: I removed the initial buffer_head reservation code.  It's
not necessary with the modified reservation algorithm - as soon
as we start to use the device the reserve pools will build
up.  There will be a deadlock opportunity if the machine is totally
and utterly oom when the RAID device initially starts up, but it's
really not worth the code space to even bother about this.




--- linux-2.4.6/include/linux/sched.h	Wed May  2 22:00:07 2001
+++ lk-ext3/include/linux/sched.h	Thu Jul 12 01:03:20 2001
@@ -413,7 +418,7 @@ struct task_struct {
 #define PF_SIGNALED	0x00000400	/* killed by a signal */
 #define PF_MEMALLOC	0x00000800	/* Allocating memory */
 #define PF_VFORK	0x00001000	/* Wake up parent in mm_release */
-
+#define PF_FLUSH	0x00002000	/* Flushes buffers to disk */
 #define PF_USEDFPU	0x00100000	/* task used FPU this quantum (SMP) */
 
 /*
--- linux-2.4.6/include/linux/raid/raid1.h	Tue Dec 12 08:20:08 2000
+++ lk-ext3/include/linux/raid/raid1.h	Thu Jul 12 01:15:39 2001
@@ -37,12 +37,12 @@ struct raid1_private_data {
 	/* buffer pool */
 	/* buffer_heads that we have pre-allocated have b_pprev -> &freebh
 	 * and are linked into a stack using b_next
-	 * raid1_bh that are pre-allocated have R1BH_PreAlloc set.
 	 * All these variable are protected by device_lock
 	 */
 	struct buffer_head	*freebh;
 	int			freebh_cnt;	/* how many are on the list */
 	struct raid1_bh		*freer1;
+	unsigned		freer1_cnt;
 	struct raid1_bh		*freebuf; 	/* each bh_req has a page allocated */
 	md_wait_queue_head_t	wait_buffer;
 
@@ -87,5 +87,4 @@ struct raid1_bh {
 /* bits for raid1_bh.state */
 #define	R1BH_Uptodate	1
 #define	R1BH_SyncPhase	2
-#define	R1BH_PreAlloc	3	/* this was pre-allocated, add to free list */
 #endif
--- linux-2.4.6/fs/buffer.c	Wed Jul  4 18:21:31 2001
+++ lk-ext3/fs/buffer.c	Thu Jul 12 01:03:57 2001
@@ -2685,6 +2748,7 @@ int bdflush(void *sem)
 	sigfillset(&tsk->blocked);
 	recalc_sigpending(tsk);
 	spin_unlock_irq(&tsk->sigmask_lock);
+	current->flags |= PF_FLUSH;
 
 	up((struct semaphore *)sem);
 
@@ -2726,6 +2790,7 @@ int kupdate(void *sem)
 	siginitsetinv(&current->blocked, sigmask(SIGCONT) | sigmask(SIGSTOP));
 	recalc_sigpending(tsk);
 	spin_unlock_irq(&tsk->sigmask_lock);
+	current->flags |= PF_FLUSH;
 
 	up((struct semaphore *)sem);
 
--- linux-2.4.6/drivers/md/raid1.c	Wed Jul  4 18:21:26 2001
+++ lk-ext3/drivers/md/raid1.c	Thu Jul 12 01:28:58 2001
@@ -51,6 +51,28 @@ static mdk_personality_t raid1_personali
 static md_spinlock_t retry_list_lock = MD_SPIN_LOCK_UNLOCKED;
 struct raid1_bh *raid1_retry_list = NULL, **raid1_retry_tail;
 
+/*
+ * We need to scale the number of reserved buffers by the page size
+ * to make writepage()s sucessful. --akpm
+ */
+#define R1_BLOCKS_PP			(PAGE_CACHE_SIZE / 1024)
+#define FREER1_MEMALLOC_RESERVED	(16 * R1_BLOCKS_PP)
+
+/*
+ * Return true if the caller make take a bh from the list.
+ * PF_FLUSH and PF_MEMALLOC tasks are allowed to use the reserves, because
+ * they're trying to *free* some memory.
+ *
+ * Requires that conf->device_lock be held.
+ */
+static int may_take_bh(raid1_conf_t *conf, int cnt)
+{
+	int min_free = (current->flags & (PF_FLUSH|PF_MEMALLOC)) ?
+			cnt :
+			(cnt + FREER1_MEMALLOC_RESERVED * conf->raid_disks);
+	return conf->freebh_cnt >= min_free;
+}
+
 static struct buffer_head *raid1_alloc_bh(raid1_conf_t *conf, int cnt)
 {
 	/* return a linked list of "cnt" struct buffer_heads.
@@ -62,7 +84,7 @@ static struct buffer_head *raid1_alloc_b
 	while(cnt) {
 		struct buffer_head *t;
 		md_spin_lock_irq(&conf->device_lock);
-		if (conf->freebh_cnt >= cnt)
+		if (may_take_bh(conf, cnt))
 			while (cnt) {
 				t = conf->freebh;
 				conf->freebh = t->b_next;
@@ -83,7 +105,7 @@ static struct buffer_head *raid1_alloc_b
 			cnt--;
 		} else {
 			PRINTK("raid1: waiting for %d bh\n", cnt);
-			wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
+			wait_event(conf->wait_buffer, may_take_bh(conf, cnt));
 		}
 	}
 	return bh;
@@ -96,9 +118,9 @@ static inline void raid1_free_bh(raid1_c
 	while (bh) {
 		struct buffer_head *t = bh;
 		bh=bh->b_next;
-		if (t->b_pprev == NULL)
+		if (conf->freebh_cnt >= FREER1_MEMALLOC_RESERVED) {
 			kfree(t);
-		else {
+		} else {
 			t->b_next= conf->freebh;
 			conf->freebh = t;
 			conf->freebh_cnt++;
@@ -108,29 +130,6 @@ static inline void raid1_free_bh(raid1_c
 	wake_up(&conf->wait_buffer);
 }
 
-static int raid1_grow_bh(raid1_conf_t *conf, int cnt)
-{
-	/* allocate cnt buffer_heads, possibly less if kalloc fails */
-	int i = 0;
-
-	while (i < cnt) {
-		struct buffer_head *bh;
-		bh = kmalloc(sizeof(*bh), GFP_KERNEL);
-		if (!bh) break;
-		memset(bh, 0, sizeof(*bh));
-
-		md_spin_lock_irq(&conf->device_lock);
-		bh->b_pprev = &conf->freebh;
-		bh->b_next = conf->freebh;
-		conf->freebh = bh;
-		conf->freebh_cnt++;
-		md_spin_unlock_irq(&conf->device_lock);
-
-		i++;
-	}
-	return i;
-}
-
 static int raid1_shrink_bh(raid1_conf_t *conf, int cnt)
 {
 	/* discard cnt buffer_heads, if we can find them */
@@ -147,7 +146,16 @@ static int raid1_shrink_bh(raid1_conf_t 
 	md_spin_unlock_irq(&conf->device_lock);
 	return i;
 }
-		
+
+/*
+ * Return true if the caller make take a raid1_bh from the list.
+ * Requires that conf->device_lock be held.
+ */
+static int may_take_r1bh(raid1_conf_t *conf)
+{
+	return ((conf->freer1_cnt > FREER1_MEMALLOC_RESERVED) ||
+		  (current->flags & (PF_FLUSH|PF_MEMALLOC))) && conf->freer1;
+}
 
 static struct raid1_bh *raid1_alloc_r1bh(raid1_conf_t *conf)
 {
@@ -155,8 +163,9 @@ static struct raid1_bh *raid1_alloc_r1bh
 
 	do {
 		md_spin_lock_irq(&conf->device_lock);
-		if (conf->freer1) {
+		if (may_take_r1bh(conf)) {
 			r1_bh = conf->freer1;
+			conf->freer1_cnt--;
 			conf->freer1 = r1_bh->next_r1;
 			r1_bh->next_r1 = NULL;
 			r1_bh->state = 0;
@@ -170,7 +179,7 @@ static struct raid1_bh *raid1_alloc_r1bh
 			memset(r1_bh, 0, sizeof(*r1_bh));
 			return r1_bh;
 		}
-		wait_event(conf->wait_buffer, conf->freer1);
+		wait_event(conf->wait_buffer, may_take_r1bh(conf));
 	} while (1);
 }
 
@@ -178,49 +187,30 @@ static inline void raid1_free_r1bh(struc
 {
 	struct buffer_head *bh = r1_bh->mirror_bh_list;
 	raid1_conf_t *conf = mddev_to_conf(r1_bh->mddev);
+	unsigned long flags;
 
 	r1_bh->mirror_bh_list = NULL;
 
-	if (test_bit(R1BH_PreAlloc, &r1_bh->state)) {
-		unsigned long flags;
-		spin_lock_irqsave(&conf->device_lock, flags);
+	spin_lock_irqsave(&conf->device_lock, flags);
+	if (conf->freer1_cnt < FREER1_MEMALLOC_RESERVED) {
 		r1_bh->next_r1 = conf->freer1;
 		conf->freer1 = r1_bh;
+		conf->freer1_cnt++;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 	} else {
+		spin_unlock_irqrestore(&conf->device_lock, flags);
 		kfree(r1_bh);
 	}
 	raid1_free_bh(conf, bh);
 }
 
-static int raid1_grow_r1bh (raid1_conf_t *conf, int cnt)
-{
-	int i = 0;
-
-	while (i < cnt) {
-		struct raid1_bh *r1_bh;
-		r1_bh = (struct raid1_bh*)kmalloc(sizeof(*r1_bh), GFP_KERNEL);
-		if (!r1_bh)
-			break;
-		memset(r1_bh, 0, sizeof(*r1_bh));
-
-		md_spin_lock_irq(&conf->device_lock);
-		set_bit(R1BH_PreAlloc, &r1_bh->state);
-		r1_bh->next_r1 = conf->freer1;
-		conf->freer1 = r1_bh;
-		md_spin_unlock_irq(&conf->device_lock);
-
-		i++;
-	}
-	return i;
-}
-
 static void raid1_shrink_r1bh(raid1_conf_t *conf)
 {
 	md_spin_lock_irq(&conf->device_lock);
 	while (conf->freer1) {
 		struct raid1_bh *r1_bh = conf->freer1;
 		conf->freer1 = r1_bh->next_r1;
+		conf->freer1_cnt--;	/* pedantry */
 		kfree(r1_bh);
 	}
 	md_spin_unlock_irq(&conf->device_lock);
@@ -1610,21 +1600,6 @@ static int raid1_run (mddev_t *mddev)
 		goto out_free_conf;
 	}
 
-
-	/* pre-allocate some buffer_head structures.
-	 * As a minimum, 1 r1bh and raid_disks buffer_heads
-	 * would probably get us by in tight memory situations,
-	 * but a few more is probably a good idea.
-	 * For now, try 16 r1bh and 16*raid_disks bufferheads
-	 * This will allow at least 16 concurrent reads or writes
-	 * even if kmalloc starts failing
-	 */
-	if (raid1_grow_r1bh(conf, 16) < 16 ||
-	    raid1_grow_bh(conf, 16*conf->raid_disks)< 16*conf->raid_disks) {
-		printk(MEM_ERROR, mdidx(mddev));
-		goto out_free_conf;
-	}
-
 	for (i = 0; i < MD_SB_DISKS; i++) {
 		
 		descriptor = sb->disks+i;
@@ -1713,6 +1688,8 @@ out_free_conf:
 	raid1_shrink_r1bh(conf);
 	raid1_shrink_bh(conf, conf->freebh_cnt);
 	raid1_shrink_buffers(conf);
+	if (conf->freer1_cnt != 0)
+		BUG();
 	kfree(conf);
 	mddev->private = NULL;
 out:

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is  Swapping on software RAID1 possible  in linux 2.4 ?
  2001-07-12  1:48       ` Andrew Morton
@ 2001-07-12  3:22         ` Neil Brown
  2001-07-12  4:53           ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Neil Brown @ 2001-07-12  3:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Peter Zaitsev, linux-kernel

On Thursday July 12, andrewm@uow.edu.au wrote:
> 
> Could you please review these changes?

I think I see what you are trying to do, and there is nothing
obviously wrong except this comment :-)

> + * Return true if the caller make take a raid1_bh from the list.
                                ^^^^

but now that I see what the problem is, I think a simpler patch would
be 

--- drivers/md/raid1.c	2001/07/12 02:00:35	1.1
+++ drivers/md/raid1.c	2001/07/12 02:01:42
@@ -83,6 +83,7 @@
 			cnt--;
 		} else {
 			PRINTK("raid1: waiting for %d bh\n", cnt);
+			run_task_queue(&tq_disk);
 			wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
 		}
 	}
@@ -170,6 +171,7 @@
 			memset(r1_bh, 0, sizeof(*r1_bh));
 			return r1_bh;
 		}
+		run_task_queue(&tq_disk);
 		wait_event(conf->wait_buffer, conf->freer1);
 	} while (1);
 }


This is needed anyway to be "correct", as you should always unplug
the queues before waiting for IO to complete.

On the issue of whether to pre-allocate some reserved structures or
not, I think it's "6-of-one-half-a-dozen-of-the-other".  My rationale
for pre-allocating was that the buffer that we hold on to would have
been allocated together and so probably are fairly dense within their
pages, and so there is no risk of hogging excess memory that isn't
actually being used.  Mind you, if I was really serious about being
gentle on the memory allocation, I would use 
   kmem_cache_alloc(bh_cachep,SLAB_whatever)
instead of 
   kmalloc(sizeof(struct buffer_head), GFP_whatever)
but I hadn't 'got' the slab stuff properly when I was writing that
code. 

Peter, does the above little patch help your problem?

NeilBrown

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is  Swapping on software RAID1 possible  in linux 2.4 ?
  2001-07-12  3:22         ` Neil Brown
@ 2001-07-12  4:53           ` Andrew Morton
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2001-07-12  4:53 UTC (permalink / raw)
  To: Neil Brown; +Cc: Peter Zaitsev, linux-kernel

Neil Brown wrote:
> 
> --- drivers/md/raid1.c  2001/07/12 02:00:35     1.1
> +++ drivers/md/raid1.c  2001/07/12 02:01:42
> @@ -83,6 +83,7 @@
>                         cnt--;
>                 } else {
>                         PRINTK("raid1: waiting for %d bh\n", cnt);
> +                       run_task_queue(&tq_disk);
>                         wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
>                 }
>         }
> @@ -170,6 +171,7 @@
>                         memset(r1_bh, 0, sizeof(*r1_bh));
>                         return r1_bh;
>                 }
> +               run_task_queue(&tq_disk);
>                 wait_event(conf->wait_buffer, conf->freer1);
>         } while (1);
>  }
> 
> This is needed anyway to be "correct", as you should always unplug
> the queues before waiting for IO to complete.

The problem with this approach is the waitqueue - you get several
tasks on the waitqueue, and bdflush loses the race - some other
thread steals the r1bh and bdflush goes back to sleep.

Replacing the wait_event() with a special raid1_wait_event()
which unplugs *each time* the caller is woken does help - but
it is still easy to deadlock the system.

Clearly this approach is racy: it assumes that the reserved buffers have
actually been submitted when we unplug - they may not yet have been.
But the lockup is too easy to trigger for that to be a satisfactory
explanation.

The most effective, aggressive, successful and grotty fix for this
problem is to remove the wait_event altogether and replace it with:

	run_task_queue(tq_disk);
	current->policy |= SCHED_YIELD;
	__set_current_state(TASK_RUNNING);
	schedule();

This can still deadlock in bad OOM situations, but I think we're
dead anyway.  A combination of this approach plus the PF_FLUSH
reservations would work even better, but I found the PF_FLUSH
stuff was sufficient.

> Mind you, if I was really serious about being
> gentle on the memory allocation, I would use
>    kmem_cache_alloc(bh_cachep,SLAB_whatever)
> instead of
>    kmalloc(sizeof(struct buffer_head), GFP_whatever)

get/put_unused_buffer_head() should be exported API functions.

-

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2001-07-12  4:52 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <mailman.994340644.23368.linux-kernel2news@redhat.com>
2001-07-05 11:24 ` Is Swapping on software RAID1 possible in linux 2.4 ? Peter Zaitsev
2001-07-05 12:13   ` Neil Brown
2001-07-05 13:22   ` Re[2]: " Peter Zaitsev
2001-07-05 13:42     ` Arjan van de Ven
2001-07-05 18:56     ` Pete Zaitcev
2001-07-12  1:14     ` Re[2]: " Neil Brown
2001-07-12  1:48       ` Andrew Morton
2001-07-12  3:22         ` Neil Brown
2001-07-12  4:53           ` Andrew Morton
2001-07-05 14:54   ` Nick DeClario
2001-07-05 15:12     ` Joseph Bueno
2001-07-11 12:08     ` Paul Jakma
2001-07-06  9:38   ` Re[2]: " Peter Zaitsev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).