All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: 2.6.28.2 & dm-snapshot or kcopyd Oops
       [not found]           ` <200902051113011850587@163.com>
@ 2009-02-05 22:52             ` Mikulas Patocka
  2009-02-06  3:26               ` Mikulas Patocka
  0 siblings, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2009-02-05 22:52 UTC (permalink / raw)
  To: Jacky Kim; +Cc: dm-devel, Alasdair G Kergon, Milan Broz



On Thu, 5 Feb 2009, Jacky Kim wrote:

> Hi Mikulas,
> 
> Please check my kernel config file in the attachment.
> I have tested with your patch, and get the follow message:
> 
> # dmesg 

Hi

Try this patch. With the same workload and kernel config.

See if it fixes the crash.

Mikulas

---

Under specific circumstances, kcopyd callback could be called under from
the thread that called dm_kcopyd_copy not from kcopyd workqueue.

dm_kcopyd_copy -> split_job -> segment_complete -> job->fn()

This code path is taken if thread calling dm_kcopyd_copy is delayed due to
scheduling inside split_job/segment_complete and the subjobs complete before
the loop in split_job completes.

Snapshots depend on the fact that callbacks are called from the singlethreaded
kcopyd workqueue and expect that there is no racing between individual
callbacks. The racing between callbacks can lead to corruption of exception
store and it can also cause that exception store callbacks are called twice
for the same exception --- a likely reason for crashes inside
pending_complete() / remove_exception().


When I reviewed kcopyd further, I found four total problems:

1. job->fn being called from the thread that submitted the job (see above).

2. job->fn(read_err, write_err, job->context); in segment_complete
reports the error of the last subjob, not the union of all errors.

3. This comment is bogus (the jobs are not cancelable), plus the code
is prone to deadlock because the callback may allocate another job from
the same mempool.
/*
 * To avoid a race we must keep the job around
 * until after the notify function has completed.
 * Otherwise the client may try and stop the job
 * after we've completed.
 */
job->fn(read_err, write_err, job->context);
mempool_free(job, job->kc->job_pool);

4. Master jobs and subjobs are allocated from the same mempool.
Completion and freeing of a master job depends on successful allocation of
all subjobs -> deadlock.


This patch moves completion of master jobs to run_complete_job (being called
from the kcopyd workqueue). The patch fixes problems 1, 2, 3.

Problem 4 will need more changes and another patch.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm-kcopyd.c |   15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

Index: linux-2.6.29-rc3-devel/drivers/md/dm-kcopyd.c
===================================================================
--- linux-2.6.29-rc3-devel.orig/drivers/md/dm-kcopyd.c	2009-02-05 23:01:50.000000000 +0100
+++ linux-2.6.29-rc3-devel/drivers/md/dm-kcopyd.c	2009-02-05 23:28:56.000000000 +0100
@@ -297,7 +297,8 @@ static int run_complete_job(struct kcopy
 	dm_kcopyd_notify_fn fn = job->fn;
 	struct dm_kcopyd_client *kc = job->kc;
 
-	kcopyd_put_pages(kc, job->pages);
+	if (job->pages)
+		kcopyd_put_pages(kc, job->pages);
 	mempool_free(job, kc->job_pool);
 	fn(read_err, write_err, context);
 
@@ -508,14 +509,8 @@ static void segment_complete(int read_er
 
 	} else if (atomic_dec_and_test(&job->sub_jobs)) {
 
-		/*
-		 * To avoid a race we must keep the job around
-		 * until after the notify function has completed.
-		 * Otherwise the client may try and stop the job
-		 * after we've completed.
-		 */
-		job->fn(read_err, write_err, job->context);
-		mempool_free(job, job->kc->job_pool);
+		push(&job->kc->complete_jobs, job);
+		wake(job->kc);
 	}
 }
 
@@ -528,6 +523,8 @@ static void split_job(struct kcopyd_job 
 {
 	int i;
 
+	atomic_inc(&job->kc->nr_jobs);
+
 	atomic_set(&job->sub_jobs, SPLIT_COUNT);
 	for (i = 0; i < SPLIT_COUNT; i++)
 		segment_complete(0, 0u, job);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: 2.6.28.2 & dm-snapshot or kcopyd Oops
  2009-02-05 22:52             ` 2.6.28.2 & dm-snapshot or kcopyd Oops Mikulas Patocka
@ 2009-02-06  3:26               ` Mikulas Patocka
       [not found]                 ` <200902061924107769776@163.com>
  0 siblings, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2009-02-06  3:26 UTC (permalink / raw)
  To: device-mapper development; +Cc: Alasdair G Kergon, Jacky Kim, Milan Broz

> +		push(&job->kc->complete_jobs, job);
> +		wake(job->kc);

I still found one bug with my patch. When I push the job to complete_jobs 
queue, it may be immediatelly taken and cease to exist. So job->kc 
dereference is wrong.

A new patch:



Under specific circumstances, kcopyd callback could be called from
the thread that called dm_kcopyd_copy not from kcopyd workqueue.

dm_kcopyd_copy -> split_job -> segment_complete -> job->fn()

This code path is taken if thread calling dm_kcopyd_copy is delayed due to
scheduling inside split_job/segment_complete and the subjobs complete before
the loop in split_job completes.

Snapshots depend on the fact that callbacks are called from the singlethreaded
kcopyd workqueue and expect that there is no racing between individual
callbacks. The racing between callbacks can lead to corruption of exception
store and it can also cause that exception store callbacks are called twice
for the same exception --- a likely reason for crashes inside
pending_complete() / remove_exception().


When I reviewed kcopyd further, I found four total problems:

1. job->fn being called from the thread that submitted the job (see above).

2. job->fn(read_err, write_err, job->context); in segment_complete
reports the error of the last subjob, not the union of all errors.

3. This comment is bogus (the jobs are not cancelable), plus the code
is prone to deadlock because the callback may allocate another job from
the same mempool.
/*
 * To avoid a race we must keep the job around
 * until after the notify function has completed.
 * Otherwise the client may try and stop the job
 * after we've completed.
 */
job->fn(read_err, write_err, job->context);
mempool_free(job, job->kc->job_pool);

4. Master jobs and subjobs are allocated from the same mempool.
Completion and freeing of a master job depends on successful allocation of
all subjobs -> deadlock.


This patch moves completion of master jobs to run_complete_job (being called
from the kcopyd workqueue). The patch fixes problems 1, 2, 3.

Problem 4 will need more changes and another patch.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm-kcopyd.c |   18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

Index: linux-2.6.29-rc3-devel/drivers/md/dm-kcopyd.c
===================================================================
--- linux-2.6.29-rc3-devel.orig/drivers/md/dm-kcopyd.c	2009-02-05 23:01:50.000000000 +0100
+++ linux-2.6.29-rc3-devel/drivers/md/dm-kcopyd.c	2009-02-06 04:15:52.000000000 +0100
@@ -297,7 +297,8 @@ static int run_complete_job(struct kcopy
 	dm_kcopyd_notify_fn fn = job->fn;
 	struct dm_kcopyd_client *kc = job->kc;
 
-	kcopyd_put_pages(kc, job->pages);
+	if (job->pages)
+		kcopyd_put_pages(kc, job->pages);
 	mempool_free(job, kc->job_pool);
 	fn(read_err, write_err, context);
 
@@ -461,6 +462,7 @@ static void segment_complete(int read_er
 	sector_t progress = 0;
 	sector_t count = 0;
 	struct kcopyd_job *job = (struct kcopyd_job *) context;
+	struct dm_kcopyd_client *kc = job->kc;
 
 	mutex_lock(&job->lock);
 
@@ -490,7 +492,7 @@ static void segment_complete(int read_er
 
 	if (count) {
 		int i;
-		struct kcopyd_job *sub_job = mempool_alloc(job->kc->job_pool,
+		struct kcopyd_job *sub_job = mempool_alloc(kc->job_pool,
 							   GFP_NOIO);
 
 		*sub_job = *job;
@@ -508,14 +510,8 @@ static void segment_complete(int read_er
 
 	} else if (atomic_dec_and_test(&job->sub_jobs)) {
 
-		/*
-		 * To avoid a race we must keep the job around
-		 * until after the notify function has completed.
-		 * Otherwise the client may try and stop the job
-		 * after we've completed.
-		 */
-		job->fn(read_err, write_err, job->context);
-		mempool_free(job, job->kc->job_pool);
+		push(&kc->complete_jobs, job);
+		wake(kc);
 	}
 }
 
@@ -528,6 +524,8 @@ static void split_job(struct kcopyd_job 
 {
 	int i;
 
+	atomic_inc(&job->kc->nr_jobs);
+
 	atomic_set(&job->sub_jobs, SPLIT_COUNT);
 	for (i = 0; i < SPLIT_COUNT; i++)
 		segment_complete(0, 0u, job);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: Re: 2.6.28.2 & dm-snapshot or kcopyd Oops
       [not found]                 ` <200902061924107769776@163.com>
@ 2009-02-06 23:17                   ` Mikulas Patocka
       [not found]                     ` <200902072041046488539@163.com>
  0 siblings, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2009-02-06 23:17 UTC (permalink / raw)
  To: Jacky Kim; +Cc: device-mapper development, Alasdair G Kergon, Milan Broz



On Fri, 6 Feb 2009, Jacky Kim wrote:

> Hi Mikulas,
> 
> I patch it with kernel 2.6.28.2, but it still has probrem:
> 
> [  493.509650] ------------[ cut here ]------------
> [  493.509654] kernel BUG at drivers/md/dm-snap.c:946!
> [  493.509656] invalid opcode: 0000 [#1] SMP 
> [  493.509658] last sysfs file: /sys/devices/virtual/block/dm-11/dev
> [  493.509660] Modules linked in: iscsi_trgt arcmsr bonding e1000
> [  493.509664] 
> [  493.509666] Pid: 7520, comm: kcopyd Not tainted (2.6.28.2-dm #5) S5000PSL
> [  493.509668] EIP: 0060:[<c03c6726>] EFLAGS: 00010246 CPU: 1
> [  493.509677] EIP is at commit_callback+0x16/0x30
> [  493.509679] EAX: f021f188 EBX: 0000002b ECX: fac54150 EDX: 00000001
> [  493.509680] ESI: 021c44dc EDI: 00000000 EBP: eed384c0 ESP: efacdf10
> [  493.509682]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> [  493.509686] Process kcopyd (pid: 7520, ti=efacc000 task=f2c95b40 task.ti=efacc000)
> [  493.509687] Stack:
> [  493.509688]  c03c7799 c03c6710 0005aa3b 00000000 efa8fd48 f75b3ec0 00000000 c03c66b0
> [  493.509692]  c03c66e3 efa8fd48 effca9a0 eed3c1c0 c03c13f8 f2c95b40 efa8fd48 00000000
> [  493.509695]  effca9a0 effca9a4 eed3c1fc eed3c200 c03c119c c03c13a0 eed3c1c0 00000000
> [  493.509699] Call Trace:
> [  493.509701]  [<c03c7799>] persistent_commit+0xb9/0x130
> [  493.509703]  [<c03c6710>] commit_callback+0x0/0x30
> [  493.509709]  [<c03c66b0>] copy_callback+0x0/0x60
> [  493.509712]  [<c03c66e3>] copy_callback+0x33/0x60
> [  493.509714]  [<c03c13f8>] run_complete_job+0x58/0xa0
> [  493.509716]  [<c03c119c>] process_jobs+0x4c/0xe0
> [  493.509718]  [<c03c13a0>] run_complete_job+0x0/0xa0
> [  493.509720]  [<c03c1230>] do_work+0x0/0x50
> [  493.509722]  [<c03c124e>] do_work+0x1e/0x50
> [  493.509724]  [<c012ef32>] run_workqueue+0x72/0x100
> [  493.509728]  [<c0132570>] prepare_to_wait+0x20/0x60
> [  493.509732]  [<c012f840>] worker_thread+0x0/0xb0
> [  493.509734]  [<c012f8b9>] worker_thread+0x79/0xb0
> [  493.509736]  [<c01323d0>] autoremove_wake_function+0x0/0x50
> [  493.509738]  [<c012f840>] worker_thread+0x0/0xb0
> [  493.509741]  [<c01320d2>] kthread+0x42/0x70
> [  493.509742]  [<c0132090>] kthread+0x0/0x70
> [  493.509744]  [<c0103eff>] kernel_thread_helper+0x7/0x18
> [  493.509748] Code: 90 8d 74 26 00 0f 0b eb fe 8d b6 00 00 00 00 8d bf 00 00 00 00 81 38 00 01 10 00 74 0e 81 78 04 00 02 20 00 74 0f e9 6a fc ff ff <0f> 0b eb fe 8d b6 00 00 00 00 0f 0b eb fe 8d b6 00 00 00 00 8d 
> [  493.509766] EIP: [<c03c6726>] commit_callback+0x16/0x30 SS:ESP 0068:efacdf10
> [  493.509794] ---[ end trace 05237464c530eabb ]---
> [  517.575348] iscsi_trgt: Logical Unit Reset (05) issued on tid:1 lun:0 by sid:281475899523136 (Function Complete)
> 
> Jacky
> .

Hi

So apply this on the top of both patches (on 2.6.28.x)

(this is not supposed to fix the bug, it will just add more debug tests)

BTW. what is the chunk size of the snapshot store you are testing?

Mikulas

---
 drivers/md/dm-exception-store.c |   85 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 84 insertions(+), 1 deletion(-)

Index: linux-2.6.28-clean/drivers/md/dm-exception-store.c
===================================================================
--- linux-2.6.28-clean.orig/drivers/md/dm-exception-store.c	2009-02-06 23:26:55.000000000 +0100
+++ linux-2.6.28-clean/drivers/md/dm-exception-store.c	2009-02-06 23:39:25.000000000 +0100
@@ -131,6 +131,8 @@ struct pstore {
 	struct dm_io_client *io_client;
 
 	struct workqueue_struct *metadata_wq;
+
+	struct task_struct *process;
 };
 
 static unsigned sectors_to_pages(unsigned sectors)
@@ -577,6 +579,48 @@ static int persistent_prepare(struct exc
 	return 0;
 }
 
+struct dm_snap_pending_exception {
+	struct dm_snap_exception e;
+
+	/*
+	 * Origin buffers waiting for this to complete are held
+	 * in a bio list
+	 */
+	struct bio_list origin_bios;
+	struct bio_list snapshot_bios;
+
+	/*
+	 * Short-term queue of pending exceptions prior to submission.
+	 */
+	struct list_head list;
+
+	/*
+	 * The primary pending_exception is the one that holds
+	 * the ref_count and the list of origin_bios for a
+	 * group of pending_exceptions.  It is always last to get freed.
+	 * These fields get set up when writing to the origin.
+	 */
+	struct dm_snap_pending_exception *primary_pe;
+
+	/*
+	 * Number of pending_exceptions processing this chunk.
+	 * When this drops to zero we must complete the origin bios.
+	 * If incrementing or decrementing this, hold pe->snap->lock for
+	 * the sibling concerned and not pe->primary_pe->snap->lock unless
+	 * they are the same.
+	 */
+	atomic_t ref_count;
+
+	/* Pointer back to snapshot context */
+	struct dm_snapshot *snap;
+
+	/*
+	 * 1 indicates the exception has already been sent to
+	 * kcopyd.
+	 */
+	int started;
+};
+
 static void persistent_commit(struct exception_store *store,
 			      struct dm_snap_exception *e,
 			      void (*callback) (void *, int success),
@@ -586,6 +630,16 @@ static void persistent_commit(struct exc
 	struct pstore *ps = get_info(store);
 	struct disk_exception de;
 	struct commit_callback *cb;
+	struct dm_snap_pending_exception *pe;
+
+	if (!ps->process)
+		ps->process = current;
+
+	BUG_ON(ps->process != current);
+
+	pe = callback_context;
+	BUG_ON(pe->e.hash_list.next == LIST_POISON1);
+	BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
 
 	de.old_chunk = e->old_chunk;
 	de.new_chunk = e->new_chunk;
@@ -601,13 +655,30 @@ static void persistent_commit(struct exc
 	cb->callback = callback;
 	cb->context = callback_context;
 
+	for (i = 0; i < ps->callback_count; i++) {
+		cb = ps->callbacks + i;
+		pe = cb->context;
+		BUG_ON(pe->e.hash_list.next == LIST_POISON1);
+		BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
+	}
 	/*
 	 * If there are exceptions in flight and we have not yet
 	 * filled this metadata area there's nothing more to do.
 	 */
 	if (!atomic_dec_and_test(&ps->pending_count) &&
-	    (ps->current_committed != ps->exceptions_per_area))
+	    (ps->current_committed != ps->exceptions_per_area)) {
+		pe = callback_context;
+		BUG_ON(pe->e.hash_list.next == LIST_POISON1);
+		BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
 		return;
+	}
+	for (i = 0; i < ps->callback_count; i++) {
+		cb = ps->callbacks + i;
+		pe = cb->context;
+		BUG_ON(pe->e.hash_list.next == LIST_POISON1);
+		BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
+	}
+
 
 	/*
 	 * If we completely filled the current area, then wipe the next one.
@@ -633,6 +704,16 @@ static void persistent_commit(struct exc
 
 	for (i = 0; i < ps->callback_count; i++) {
 		cb = ps->callbacks + i;
+		pe = cb->context;
+		BUG_ON(pe->e.hash_list.next == LIST_POISON1);
+		BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
+	}
+
+	for (i = 0; i < ps->callback_count; i++) {
+		cb = ps->callbacks + i;
+		pe = cb->context;
+		BUG_ON(pe->e.hash_list.next == LIST_POISON1);
+		BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
 		cb->callback(cb->context, ps->valid);
 	}
 
@@ -668,6 +749,8 @@ int dm_create_persistent(struct exceptio
 	atomic_set(&ps->pending_count, 0);
 	ps->callbacks = NULL;
 
+	ps->process = NULL;
+
 	ps->metadata_wq = create_singlethread_workqueue("ksnaphd");
 	if (!ps->metadata_wq) {
 		kfree(ps);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: 2.6.28.2 & dm-snapshot or kcopyd Oops
       [not found]                     ` <200902072041046488539@163.com>
@ 2009-02-09  8:13                       ` Mikulas Patocka
       [not found]                         ` <200902101305353713189@163.com>
  0 siblings, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2009-02-09  8:13 UTC (permalink / raw)
  To: Jacky Kim; +Cc: device-mapper development, Alasdair G Kergon, Milan Broz

On Sat, 7 Feb 2009, Jacky Kim wrote:

> Hi
> 
> I create PV over a RAID set, and its stripe size is 64KB.
> The chunk size of the snapshot is 4KB, is it too small? then what size 
> is better?

It means that there is another bug besides the one I have just fixed :-(

So try another patch (on the top of them all) with even more debug points.

Mikulas

> I test with kernel 2.6.28.2, and get the follow message:
> 
> [  531.209879] ------------[ cut here ]------------
> [  531.209884] kernel BUG at drivers/md/dm-exception-store.c:715!
> [  531.209886] invalid opcode: 0000 [#1] SMP 
> [  531.209888] last sysfs file: /sys/devices/virtual/block/dm-11/dev
> [  531.209890] Modules linked in: iscsi_trgt arcmsr bonding e1000
> [  531.209893] 
> [  531.209896] Pid: 8241, comm: kcopyd Not tainted (2.6.28.2-dm #6) S5000PSL
> [  531.209898] EIP: 0060:[<c03c7dc2>] EFLAGS: 00010246 CPU: 1
> [  531.209903] EIP is at persistent_commit+0x222/0x280
> [  531.209905] EAX: f5385708 EBX: 00000006 ECX: fabab030 EDX: 00000000
> [  531.209906] ESI: 00000000 EDI: ef35f840 EBP: 00000075 ESP: f4fc3f14
> [  531.209908]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> [  531.209910] Process kcopyd (pid: 8241, ti=f4fc2000 task=f4f16800 task.ti=f4fc2000)
> [  531.209911] Stack:
> [  531.209912]  c03c6710 000165da 00000000 f5015348 f482f2c0 00000000 c03c66b0 c03c66e3
> [  531.209916]  f5015348 f487f160 ef52e1c0 c03c13f8 f4f16800 f5015348 00000000 f487f160
> [  531.209919]  f487f164 ef52e1fc ef52e200 c03c119c c03c13a0 ef52e1c0 00000001 ef52e1ec
> [  531.209923] Call Trace:
> [  531.209925]  [<c03c6710>] commit_callback+0x0/0x30
> [  531.209928]  [<c03c66b0>] copy_callback+0x0/0x60
> [  531.209935]  [<c03c66e3>] copy_callback+0x33/0x60
> [  531.209938]  [<c03c13f8>] run_complete_job+0x58/0xa0
> [  531.209945]  [<c03c119c>] process_jobs+0x4c/0xe0
> [  531.209947]  [<c03c13a0>] run_complete_job+0x0/0xa0
> [  531.209950]  [<c03c1230>] do_work+0x0/0x50
> [  531.209951]  [<c03c124e>] do_work+0x1e/0x50
> [  531.209953]  [<c012ef32>] run_workqueue+0x72/0x100
> [  531.209962]  [<c0132570>] prepare_to_wait+0x20/0x60
> [  531.209965]  [<c012f840>] worker_thread+0x0/0xb0
> [  531.209972]  [<c012f8b9>] worker_thread+0x79/0xb0
> [  531.209974]  [<c01323d0>] autoremove_wake_function+0x0/0x50
> [  531.209976]  [<c012f840>] worker_thread+0x0/0xb0
> [  531.209978]  [<c01320d2>] kthread+0x42/0x70
> [  531.209980]  [<c0132090>] kthread+0x0/0x70
> [  531.209982]  [<c0103eff>] kernel_thread_helper+0x7/0x18
> [  531.209984] Code: 0b eb fe 0f 0b eb fe ba 01 00 00 00 89 f8 e8 d6 f8 ff ff 85 c0 0f 84 18 ff ff ff c7 47 08 00 00 00 00 e9 0c ff ff ff 0f 0b eb fe <0f> 0b eb fe 0f 0b eb fe 0f 0b eb fe 0f 0b eb fe 0f 0b eb fe 83 
> [  531.210006] EIP: [<c03c7dc2>] persistent_commit+0x222/0x280 SS:ESP 0068:f4fc3f14
> [  531.210010] ---[ end trace fc1bc1bb8712a6ff ]---
> [  556.042136] iscsi_trgt: Logical Unit Reset (05) issued on tid:1 lun:0 by sid:281475899523136 (Function Complete)
> 
> Jacky
> .

---
 drivers/md/dm-exception-store.c |    7 +++++++
 drivers/md/dm-kcopyd.c          |   34 +++++++++++++++++++++++++++++++++-
 drivers/md/dm-snap.c            |    4 ++++
 3 files changed, 44 insertions(+), 1 deletion(-)

Index: linux-2.6.28-clean/drivers/md/dm-exception-store.c
===================================================================
--- linux-2.6.28-clean.orig/drivers/md/dm-exception-store.c	2009-02-09 08:43:40.000000000 +0100
+++ linux-2.6.28-clean/drivers/md/dm-exception-store.c	2009-02-09 08:43:46.000000000 +0100
@@ -645,6 +645,13 @@ static void persistent_commit(struct exc
 	de.new_chunk = e->new_chunk;
 	write_exception(ps, ps->current_committed++, &de);
 
+	for (i = 0; i < ps->callback_count; i++) {
+		cb = ps->callbacks + i;
+		pe = cb->context;
+		BUG_ON(pe->e.hash_list.next == LIST_POISON1);
+		BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
+		BUG_ON(pe == callback_context);
+	}
 	/*
 	 * Add the callback to the back of the array.  This code
 	 * is the only place where the callback array is
Index: linux-2.6.28-clean/drivers/md/dm-snap.c
===================================================================
--- linux-2.6.28-clean.orig/drivers/md/dm-snap.c	2009-02-09 08:43:40.000000000 +0100
+++ linux-2.6.28-clean/drivers/md/dm-snap.c	2009-02-09 08:43:46.000000000 +0100
@@ -979,6 +979,10 @@ static void start_copy(struct dm_snap_pe
 	struct dm_io_region src, dest;
 	struct block_device *bdev = s->origin->bdev;
 	sector_t dev_size;
+	BUG_ON(!pe->started);
+	BUG_ON(pe->started == 2);
+	BUG_ON(pe->started != 1);
+	pe->started = 2;
 
 	dev_size = get_dev_size(bdev);
 
Index: linux-2.6.28-clean/drivers/md/dm-kcopyd.c
===================================================================
--- linux-2.6.28-clean.orig/drivers/md/dm-kcopyd.c	2009-02-09 08:43:40.000000000 +0100
+++ linux-2.6.28-clean/drivers/md/dm-kcopyd.c	2009-02-09 08:44:47.000000000 +0100
@@ -60,6 +60,7 @@ struct dm_kcopyd_client {
 	struct list_head complete_jobs;
 	struct list_head io_jobs;
 	struct list_head pages_jobs;
+	struct list_head all_jobs;
 };
 
 static void wake(struct dm_kcopyd_client *kc)
@@ -209,6 +210,8 @@ struct kcopyd_job {
 	dm_kcopyd_notify_fn fn;
 	void *context;
 
+	struct list_head list_all;
+
 	/*
 	 * These fields are only used if the job has been split
 	 * into more manageable parts.
@@ -280,6 +283,9 @@ static void push_head(struct list_head *
 	spin_unlock_irqrestore(&kc->job_lock, flags);
 }
 
+static void segment_complete(int read_err, unsigned long write_err,
+			     void *context);
+
 /*
  * These three functions process 1 item from the corresponding
  * job list.
@@ -291,6 +297,8 @@ static void push_head(struct list_head *
  */
 static int run_complete_job(struct kcopyd_job *job)
 {
+	struct kcopyd_job *jobb;
+	unsigned long flags;
 	void *context = job->context;
 	int read_err = job->read_err;
 	unsigned long write_err = job->write_err;
@@ -299,6 +307,18 @@ static int run_complete_job(struct kcopy
 
 	if (job->pages)
 		kcopyd_put_pages(kc, job->pages);
+
+	if (fn != segment_complete) {
+		spin_lock_irqsave(&kc->job_lock, flags);
+		list_del(&job->list_all);
+		spin_unlock_irqrestore(&kc->job_lock, flags);
+
+		spin_lock_irqsave(&kc->job_lock, flags);
+		list_for_each_entry(jobb, &kc->all_jobs, list_all)
+			BUG_ON(jobb->fn == fn && jobb->context == context);
+		spin_unlock_irqrestore(&kc->job_lock, flags);
+	}
+
 	mempool_free(job, kc->job_pool);
 	fn(read_err, write_err, context);
 
@@ -535,7 +555,8 @@ int dm_kcopyd_copy(struct dm_kcopyd_clie
 		   unsigned int num_dests, struct dm_io_region *dests,
 		   unsigned int flags, dm_kcopyd_notify_fn fn, void *context)
 {
-	struct kcopyd_job *job;
+	struct kcopyd_job *job, *jobb;
+	unsigned long fflags;
 
 	/*
 	 * Allocate a new job.
@@ -563,6 +584,15 @@ int dm_kcopyd_copy(struct dm_kcopyd_clie
 	job->fn = fn;
 	job->context = context;
 
+	spin_lock_irqsave(&kc->job_lock, fflags);
+	list_for_each_entry(jobb, &kc->all_jobs, list_all)
+		BUG_ON(jobb->fn == fn && jobb->context == context);
+	spin_unlock_irqrestore(&kc->job_lock, fflags);
+
+	spin_lock_irqsave(&kc->job_lock, fflags);
+	list_add_tail(&job->list_all, &kc->all_jobs);
+	spin_unlock_irqrestore(&kc->job_lock, fflags);
+
 	if (job->source.count < SUB_JOB_SIZE)
 		dispatch_job(job);
 
@@ -603,6 +633,7 @@ int dm_kcopyd_client_create(unsigned int
 
 	spin_lock_init(&kc->lock);
 	spin_lock_init(&kc->job_lock);
+	INIT_LIST_HEAD(&kc->all_jobs);
 	INIT_LIST_HEAD(&kc->complete_jobs);
 	INIT_LIST_HEAD(&kc->io_jobs);
 	INIT_LIST_HEAD(&kc->pages_jobs);
@@ -652,6 +683,7 @@ void dm_kcopyd_client_destroy(struct dm_
 	/* Wait for completion of all jobs submitted by this client. */
 	wait_event(kc->destroyq, !atomic_read(&kc->nr_jobs));
 
+	BUG_ON(!list_empty(&kc->all_jobs));
 	BUG_ON(!list_empty(&kc->complete_jobs));
 	BUG_ON(!list_empty(&kc->io_jobs));
 	BUG_ON(!list_empty(&kc->pages_jobs));

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: 2.6.28.2 & dm-snapshot or kcopyd Oops
       [not found]                         ` <200902101305353713189@163.com>
@ 2009-02-10 23:36                           ` Mikulas Patocka
  2009-02-11 10:54                             ` Mikulas Patocka
  0 siblings, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2009-02-10 23:36 UTC (permalink / raw)
  To: Jacky Kim; +Cc: device-mapper development, Alasdair G Kergon, Milan Broz



On Tue, 10 Feb 2009, Jacky Kim wrote:

> Hi Mikulas,
> 
> The latest debug messages is as follow:
> 
> [ 2006.004571] ------------[ cut here ]------------
> [ 2006.004576] kernel BUG at drivers/md/dm-exception-store.c:653!
> [ 2006.004583] invalid opcode: 0000 [#1] SMP 
> [ 2006.004585] last sysfs file: /sys/devices/virtual/block/dm-9/dev
> [ 2006.004587] Modules linked in: iscsi_trgt arcmsr bonding e1000
> [ 2006.004590] 
> [ 2006.004593] Pid: 23442, comm: kcopyd Not tainted (2.6.28.2-dm #7) S5000PSL
> [ 2006.004595] EIP: 0060:[<c03c7ede>] EFLAGS: 00010246 CPU: 3
> [ 2006.004605] EIP is at persistent_commit+0x1ce/0x2c0
> [ 2006.004606] EAX: ef347948 EBX: fa7f5000 ECX: 00000001 EDX: 00000007
> [ 2006.004608] ESI: 00000000 EDI: f0f44a40 EBP: ef347948 ESP: efa1df04
> [ 2006.004609]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> [ 2006.004611] Process kcopyd (pid: 23442, ti=efa1c000 task=f0c76460 task.ti=efa1c000)
> [ 2006.004613] Stack:
> [ 2006.004614]  c03c6880 ef347948 000174b7 00000000 ef347948 f6f4c080 f0e535e4 f6e71240
> [ 2006.004617]  c03c6853 ef347948 f6e71298 c03c6820 c03c1858 f6e7127c f6e7127c ef347948
> [ 2006.004621]  00000000 00000000 00000202 f0e535e4 f0e535e8 f6e7127c f6e71280 c03c119c
> [ 2006.004625] Call Trace:
> [ 2006.004626]  [<c03c6880>] commit_callback+0x0/0x30
> [ 2006.004629]  [<c03c6853>] copy_callback+0x33/0x60
> [ 2006.004632]  [<c03c6820>] copy_callback+0x0/0x60
> [ 2006.004634]  [<c03c1858>] run_complete_job+0x108/0x140
> [ 2006.004637]  [<c03c119c>] process_jobs+0x4c/0xe0
> [ 2006.004639]  [<c03c1750>] run_complete_job+0x0/0x140
> [ 2006.004641]  [<c03c1230>] do_work+0x0/0x50
> [ 2006.004643]  [<c03c124e>] do_work+0x1e/0x50
> [ 2006.004645]  [<c012ef32>] run_workqueue+0x72/0x100
> [ 2006.004649]  [<c0132570>] prepare_to_wait+0x20/0x60
> [ 2006.004652]  [<c012f840>] worker_thread+0x0/0xb0
> [ 2006.004654]  [<c012f8b9>] worker_thread+0x79/0xb0
> [ 2006.004656]  [<c01323d0>] autoremove_wake_function+0x0/0x50
> [ 2006.004658]  [<c012f840>] worker_thread+0x0/0xb0
> [ 2006.004660]  [<c01320d2>] kthread+0x42/0x70
> [ 2006.004662]  [<c0132090>] kthread+0x0/0x70
> [ 2006.004664]  [<c0103eff>] kernel_thread_helper+0x7/0x18
> [ 2006.004667] Code: fe 8b 4f 28 3b 4f 0c 0f 84 6c ff ff ff 81 7d 00 00 01 10 00 74 4d 81 7d 04 00 02 20 00 0f 84 c8 00 00 00 83 c4 10 5b 5e 5f 5d c3 <0f> 0b eb fe 0f 0b eb fe 0f 0b eb fe 8d b6 00 00 00 00 64 a1 00 
> [ 2006.004686] EIP: [<c03c7ede>] persistent_commit+0x1ce/0x2c0 SS:ESP 0068:efa1df04
> [ 2006.005020] ---[ end trace 6a5aa6c590d2cad7 ]---
> 
> Jacky
> .

Hi

I have found another bug in dm-snapshots, I think it won't fix your crash 
but it is a bug anyway, so it should be fixed. Apply the below patch.

I'll send you some more debug patches to pinpoint the crash you are 
seeing.

Mikulas

---

__find_pending_exception drops the snapshot lock for the time of allocation
(this is correct and required behavior because the lock is held when freeing
the exception into the mempool)

After __find_pending_exception regains the lock, it checks if the exception
was already placed into &s->pending hash and it checks if the snapshot was
already invalidated.

But __find_pending_exception doesn't check if the exception was already
completed and placed into &s->complete hash. If the process waiting in
__find_pending_exception was delayed at this point because of a scheduling
latency and the exception was meanwhile completed, __find_pending_exception
will miss that and allocate another pending exception for already completed
chunk.

It will lead to a situation when two records for the same chunk exist and
it could lead to data corruption because multiple snapshot I/Os to the
affected chunk could be redirected to different locations in the snapshot.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

Index: linux-2.6.28-clean/drivers/md/dm-snap.c
===================================================================
--- linux-2.6.28-clean.orig/drivers/md/dm-snap.c	2009-02-11 00:26:47.000000000 +0100
+++ linux-2.6.28-clean/drivers/md/dm-snap.c	2009-02-11 00:27:54.000000000 +0100
@@ -1007,6 +1007,10 @@ static void start_copy(struct dm_snap_pe
  * for this chunk, otherwise it allocates a new one and inserts
  * it into the pending table.
  *
+ * Returns: pointer to a pending exception
+ *	ERR_PTR(-ENOMEM) pointer --- the snapshot is invalidated
+ *	NULL --- the exception was already completed, the caller should recheck
+ *
  * NOTE: a write lock must be held on snap->lock before calling
  * this.
  */
@@ -1037,6 +1041,12 @@ __find_pending_exception(struct dm_snaps
 
 	if (!s->valid) {
 		free_pending_exception(pe);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	e = lookup_exception(&s->complete, chunk);
+	if (e) {
+		free_pending_exception(pe);
 		return NULL;
 	}
 
@@ -1056,7 +1066,7 @@ __find_pending_exception(struct dm_snaps
 
 	if (s->store.prepare_exception(&s->store, &pe->e)) {
 		free_pending_exception(pe);
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 	}
 
 	get_pending_exception(pe);
@@ -1100,6 +1110,7 @@ static int snapshot_map(struct dm_target
 		goto out_unlock;
 	}
 
+lookup_again:
 	/* If the block is already remapped - use that, else remap it */
 	e = lookup_exception(&s->complete, chunk);
 	if (e) {
@@ -1114,8 +1125,10 @@ static int snapshot_map(struct dm_target
 	 */
 	if (bio_rw(bio) == WRITE) {
 		pe = __find_pending_exception(s, bio);
-		if (!pe) {
-			__invalidate_snapshot(s, -ENOMEM);
+		if (!pe)
+			goto lookup_again;
+		if (IS_ERR(pe)) {
+			__invalidate_snapshot(s, PTR_ERR(pe));
 			r = -EIO;
 			goto out_unlock;
 		}
@@ -1248,8 +1261,10 @@ static int __origin_write(struct list_he
 			goto next_snapshot;
 
 		pe = __find_pending_exception(snap, bio);
-		if (!pe) {
-			__invalidate_snapshot(snap, -ENOMEM);
+		if (!pe)
+			goto next_snapshot;
+		if (IS_ERR(pe)) {
+			__invalidate_snapshot(snap, PTR_ERR(pe));
 			goto next_snapshot;
 		}
 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: 2.6.28.2 & dm-snapshot or kcopyd Oops
  2009-02-10 23:36                           ` Mikulas Patocka
@ 2009-02-11 10:54                             ` Mikulas Patocka
       [not found]                               ` <200902121131140773281@163.com>
  0 siblings, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2009-02-11 10:54 UTC (permalink / raw)
  To: Jacky Kim; +Cc: device-mapper development, Alasdair G Kergon, Milan Broz

> Hi
> 
> I have found another bug in dm-snapshots, I think it won't fix your crash 
> but it is a bug anyway, so it should be fixed. Apply the below patch.
> 
> I'll send you some more debug patches to pinpoint the crash you are 
> seeing.
> 
> Mikulas

Hi

Here is another patch to pinpoint your crashes, so apply it and run the 
same workload. Thanks for testing it.

Mikulas

---
 drivers/md/dm-exception-store.c |   63 +++++++++++++++++++++++++++++++++++-----
 drivers/md/dm-snap.c            |   51 +++++++++++++++++++++++++++++---
 drivers/md/dm-snap.h            |    3 +
 3 files changed, 106 insertions(+), 11 deletions(-)

Index: linux-2.6.28-clean/drivers/md/dm-exception-store.c
===================================================================
--- linux-2.6.28-clean.orig/drivers/md/dm-exception-store.c	2009-02-11 00:39:23.000000000 +0100
+++ linux-2.6.28-clean/drivers/md/dm-exception-store.c	2009-02-11 10:33:11.000000000 +0100
@@ -619,14 +619,18 @@ struct dm_snap_pending_exception {
 	 * kcopyd.
 	 */
 	int started;
+
+	struct list_head list_all;
 };
 
+static DEFINE_SPINLOCK(callback_spinlock);
+
 static void persistent_commit(struct exception_store *store,
 			      struct dm_snap_exception *e,
 			      void (*callback) (void *, int success),
 			      void *callback_context)
 {
-	unsigned int i;
+	unsigned int i, n;
 	struct pstore *ps = get_info(store);
 	struct disk_exception de;
 	struct commit_callback *cb;
@@ -641,32 +645,41 @@ static void persistent_commit(struct exc
 	BUG_ON(pe->e.hash_list.next == LIST_POISON1);
 	BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
 
-	de.old_chunk = e->old_chunk;
-	de.new_chunk = e->new_chunk;
-	write_exception(ps, ps->current_committed++, &de);
-
 	for (i = 0; i < ps->callback_count; i++) {
 		cb = ps->callbacks + i;
 		pe = cb->context;
 		BUG_ON(pe->e.hash_list.next == LIST_POISON1);
 		BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
 		BUG_ON(pe == callback_context);
+		BUG_ON(pe->started != 2);
+	}
+	for (i = 0; i < ps->current_committed; i++) {
+		read_exception(ps, i, &de);
+		BUG_ON(de.old_chunk == e->old_chunk);
+		BUG_ON(de.new_chunk == e->new_chunk);
 	}
+	de.old_chunk = e->old_chunk;
+	de.new_chunk = e->new_chunk;
+	write_exception(ps, ps->current_committed++, &de);
+
 	/*
 	 * Add the callback to the back of the array.  This code
 	 * is the only place where the callback array is
 	 * manipulated, and we know that it will never be called
 	 * multiple times concurrently.
 	 */
+	spin_lock_irq(&callback_spinlock);
 	cb = ps->callbacks + ps->callback_count++;
 	cb->callback = callback;
 	cb->context = callback_context;
+	spin_unlock_irq(&callback_spinlock);
 
 	for (i = 0; i < ps->callback_count; i++) {
 		cb = ps->callbacks + i;
 		pe = cb->context;
 		BUG_ON(pe->e.hash_list.next == LIST_POISON1);
 		BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
+		BUG_ON(pe->started != 2);
 	}
 	/*
 	 * If there are exceptions in flight and we have not yet
@@ -677,6 +690,7 @@ static void persistent_commit(struct exc
 		pe = callback_context;
 		BUG_ON(pe->e.hash_list.next == LIST_POISON1);
 		BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
+		BUG_ON(pe->started != 2);
 		return;
 	}
 	for (i = 0; i < ps->callback_count; i++) {
@@ -684,6 +698,7 @@ static void persistent_commit(struct exc
 		pe = cb->context;
 		BUG_ON(pe->e.hash_list.next == LIST_POISON1);
 		BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
+		BUG_ON(pe->started != 2);
 	}
 
 
@@ -714,17 +729,45 @@ static void persistent_commit(struct exc
 		pe = cb->context;
 		BUG_ON(pe->e.hash_list.next == LIST_POISON1);
 		BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
+		BUG_ON(pe->started != 2);
 	}
 
-	for (i = 0; i < ps->callback_count; i++) {
+	spin_lock_irq(&callback_spinlock);
+	n = ps->callback_count;
+	ps->callback_count = 0;
+	spin_unlock_irq(&callback_spinlock);
+	for (i = 0; i < n; i++) {
 		cb = ps->callbacks + i;
 		pe = cb->context;
+		BUG_ON(pe->started != 2);
 		BUG_ON(pe->e.hash_list.next == LIST_POISON1);
 		BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
 		cb->callback(cb->context, ps->valid);
 	}
+}
 
-	ps->callback_count = 0;
+static void persistent_check(struct exception_store *store, void *context, int line)
+{
+	struct pstore *ps = get_info(store);
+	unsigned long flags;
+	unsigned i;
+	struct commit_callback *cb;
+	struct dm_snap_pending_exception *pe;
+
+	spin_lock_irqsave(&callback_spinlock, flags);
+	for (i = 0; i < ps->callback_count; i++) {
+		cb = ps->callbacks + i;
+		pe = cb->context;
+		if (pe == context) {
+			printk("CALLED FROM LINE %d\n", line);
+			BUG();
+		}
+		if (pe->started != 2) {
+			printk("CALLED FROM LINE %d\n", line);
+			BUG();
+		}
+	}
+	spin_unlock_irqrestore(&callback_spinlock, flags);
 }
 
 static void persistent_drop(struct exception_store *store)
@@ -769,6 +812,7 @@ int dm_create_persistent(struct exceptio
 	store->read_metadata = persistent_read_metadata;
 	store->prepare_exception = persistent_prepare;
 	store->commit_exception = persistent_commit;
+	store->check_pending_exception = persistent_check;
 	store->drop_snapshot = persistent_drop;
 	store->fraction_full = persistent_fraction_full;
 	store->context = ps;
@@ -824,6 +868,10 @@ static void transient_fraction_full(stru
 	*denominator = get_dev_size(store->snap->cow->bdev);
 }
 
+static void transient_check(struct exception_store *store, void *context, int line)
+{
+}
+
 int dm_create_transient(struct exception_store *store)
 {
 	struct transient_c *tc;
@@ -832,6 +880,7 @@ int dm_create_transient(struct exception
 	store->read_metadata = transient_read_metadata;
 	store->prepare_exception = transient_prepare;
 	store->commit_exception = transient_commit;
+	store->check_pending_exception = transient_check;
 	store->drop_snapshot = NULL;
 	store->fraction_full = transient_fraction_full;
 
Index: linux-2.6.28-clean/drivers/md/dm-snap.h
===================================================================
--- linux-2.6.28-clean.orig/drivers/md/dm-snap.h	2009-02-11 00:39:36.000000000 +0100
+++ linux-2.6.28-clean/drivers/md/dm-snap.h	2009-02-11 09:54:45.000000000 +0100
@@ -114,6 +114,9 @@ struct exception_store {
 				  void (*callback) (void *, int success),
 				  void *callback_context);
 
+	void (*check_pending_exception)(struct exception_store *store,
+				void *callback_context, int line);
+
 	/*
 	 * The snapshot is invalid, note this in the metadata.
 	 */
Index: linux-2.6.28-clean/drivers/md/dm-snap.c
===================================================================
--- linux-2.6.28-clean.orig/drivers/md/dm-snap.c	2009-02-11 03:15:26.000000000 +0100
+++ linux-2.6.28-clean/drivers/md/dm-snap.c	2009-02-11 10:20:55.000000000 +0100
@@ -88,6 +88,8 @@ struct dm_snap_pending_exception {
 	 * kcopyd.
 	 */
 	int started;
+
+	struct list_head list_all;
 };
 
 /*
@@ -365,20 +367,45 @@ static void free_exception(struct dm_sna
 	kmem_cache_free(exception_cache, e);
 }
 
+static DEFINE_SPINLOCK(pending_spinlock);
+static LIST_HEAD(pending_all);
+
 static struct dm_snap_pending_exception *alloc_pending_exception(struct dm_snapshot *s)
 {
 	struct dm_snap_pending_exception *pe = mempool_alloc(s->pending_pool,
 							     GFP_NOIO);
+	struct dm_snap_pending_exception *pe2;
 
 	atomic_inc(&s->pending_exceptions_count);
 	pe->snap = s;
 
+	spin_lock(&pending_spinlock);
+	list_for_each_entry(pe2, &pending_all, list_all) {
+		BUG_ON(pe2 == pe);
+	}
+	list_add(&pe->list_all, &pending_all);
+	spin_unlock(&pending_spinlock);
+
+	s->store.check_pending_exception(&s->store, pe, __LINE__);
+
 	return pe;
 }
 
 static void free_pending_exception(struct dm_snap_pending_exception *pe)
 {
 	struct dm_snapshot *s = pe->snap;
+	struct dm_snap_pending_exception *pe2;
+
+	s->store.check_pending_exception(&s->store, pe, __LINE__);
+
+	spin_lock(&pending_spinlock);
+	list_for_each_entry(pe2, &pending_all, list_all) {
+		if (pe2 == pe) goto found;
+	}
+	BUG();
+	found:
+	list_del(&pe->list_all);
+	spin_unlock(&pending_spinlock);
 
 	mempool_free(pe, s->pending_pool);
 	smp_mb__before_atomic_dec();
@@ -831,6 +858,8 @@ static struct bio *put_pending_exception
 	struct dm_snap_pending_exception *primary_pe;
 	struct bio *origin_bios = NULL;
 
+	pe->snap->store.check_pending_exception(&pe->snap->store, pe, __LINE__);
+
 	primary_pe = pe->primary_pe;
 
 	/*
@@ -862,12 +891,15 @@ static void pending_complete(struct dm_s
 	struct bio *snapshot_bios = NULL;
 	int error = 0;
 
+	s->store.check_pending_exception(&s->store, pe, __LINE__);
+
 	BUG_ON(pe->e.hash_list.next == LIST_POISON1);
 	BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
 
 	if (!success) {
 		/* Read/write error - snapshot is unusable */
 		down_write(&s->lock);
+		s->store.check_pending_exception(&s->store, pe, __LINE__);
 		__invalidate_snapshot(s, -EIO);
 	BUG_ON(pe->e.hash_list.next == LIST_POISON1);
 	BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
@@ -885,6 +917,7 @@ static void pending_complete(struct dm_s
 
 	if (!e) {
 		down_write(&s->lock);
+		s->store.check_pending_exception(&s->store, pe, __LINE__);
 		__invalidate_snapshot(s, -ENOMEM);
 		error = 1;
 		goto out;
@@ -892,6 +925,7 @@ static void pending_complete(struct dm_s
 	*e = pe->e;
 
 	down_write(&s->lock);
+	s->store.check_pending_exception(&s->store, pe, __LINE__);
 	BUG_ON(pe->e.hash_list.next == LIST_POISON1);
 	BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
 
@@ -943,6 +977,7 @@ static void commit_callback(void *contex
 {
 	struct dm_snap_pending_exception *pe = context;
 
+	pe->snap->store.check_pending_exception(&pe->snap->store, pe, __LINE__);
 	BUG_ON(pe->e.hash_list.next == LIST_POISON1);
 	BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
 
@@ -961,13 +996,15 @@ static void copy_callback(int read_err, 
 	BUG_ON(pe->e.hash_list.next == LIST_POISON1);
 	BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
 
-	if (read_err || write_err)
+	if (read_err || write_err) {
+		s->store.check_pending_exception(&s->store, pe, __LINE__);
 		pending_complete(pe, 0);
-
-	else
+	} else {
+		s->store.check_pending_exception(&s->store, pe, __LINE__);
 		/* Update the metadata if we are persistent */
 		s->store.commit_exception(&s->store, &pe->e, commit_callback,
 					  pe);
+	}
 }
 
 /*
@@ -979,6 +1016,7 @@ static void start_copy(struct dm_snap_pe
 	struct dm_io_region src, dest;
 	struct block_device *bdev = s->origin->bdev;
 	sector_t dev_size;
+	s->store.check_pending_exception(&s->store, pe, __LINE__);
 	BUG_ON(!pe->started);
 	BUG_ON(pe->started == 2);
 	BUG_ON(pe->started != 1);
@@ -1038,6 +1076,7 @@ __find_pending_exception(struct dm_snaps
 	up_write(&s->lock);
 	pe = alloc_pending_exception(s);
 	down_write(&s->lock);
+	s->store.check_pending_exception(&s->store, pe, __LINE__);
 
 	if (!s->valid) {
 		free_pending_exception(pe);
@@ -1071,6 +1110,7 @@ __find_pending_exception(struct dm_snaps
 
 	get_pending_exception(pe);
 	insert_exception(&s->pending, &pe->e);
+	s->store.check_pending_exception(&s->store, pe, __LINE__);
 
  out:
 	return pe;
@@ -1291,6 +1331,7 @@ static int __origin_write(struct list_he
 		}
 
 		if (!pe->started) {
+			snap->store.check_pending_exception(&snap->store, pe, __LINE__);
 			pe->started = 1;
 			list_add_tail(&pe->list, &pe_queue);
 		}
@@ -1319,8 +1360,10 @@ static int __origin_write(struct list_he
 	/*
 	 * Now that we have a complete pe list we can start the copying.
 	 */
-	list_for_each_entry_safe(pe, next_pe, &pe_queue, list)
+	list_for_each_entry_safe(pe, next_pe, &pe_queue, list) {
+		pe->snap->store.check_pending_exception(&pe->snap->store, pe, __LINE__);
 		start_copy(pe);
+	}
 
 	return r;
 }

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: 2.6.28.2 & dm-snapshot or kcopyd Oops
       [not found]                               ` <200902121131140773281@163.com>
@ 2009-02-13  6:41                                 ` Mikulas Patocka
  2009-02-13  7:31                                   ` Mikulas Patocka
  0 siblings, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2009-02-13  6:41 UTC (permalink / raw)
  To: Jacky Kim; +Cc: device-mapper development, Alasdair G Kergon, Milan Broz

Hi

Few more questions:
- do you write to the snapshots?
- do you create/delete snapshots while the crash happens?
- do all the snapshots have the same chunk size?
- how many snapshots do you have?

Mikulas

> Hi
> 
> With the two patches of yesterday, the latest debug infomation is as follow:
> 
> [  197.073630] ------------[ cut here ]------------
> [  197.073633] kernel BUG at drivers/md/dm-snap.c:405!
> [  197.073635] invalid opcode: 0000 [#1] SMP 
> [  197.073637] last sysfs file: /sys/devices/virtual/block/dm-6/dev
> [  197.073639] Modules linked in: iscsi_trgt arcmsr bonding e1000
> [  197.073642] 
> [  197.073645] Pid: 4268, comm: kcopyd Not tainted (2.6.28.2-dm #8) S5000PSL
> [  197.073647] EIP: 0060:[<c03c5da8>] EFLAGS: 00010246 CPU: 2
> [  197.073653] EIP is at free_pending_exception+0x88/0x90
> [  197.073654] EAX: c05a0ce8 EBX: f55449f0 ECX: 00000282 EDX: ef3056d0
> [  197.073656] ESI: f6c65bc0 EDI: 00000000 EBP: ef2a34e0 ESP: ef3d5ecc
> [  197.073657]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> [  197.073659] Process kcopyd (pid: 4268, ti=ef3d4000 task=f6d77bc0 task.ti=ef3d4000)
> [  197.073660] Stack:
> [  197.073661]  f55449f0 00000000 c03c6918 f75fbf70 ef2843c8 f75fbec0 00000000 ef2409c0
> [  197.073665]  00000001 00000001 ef2a34e0 c03c82ca c03c6a20 002689a4 00000000 000030d8
> [  197.073668]  00000000 f75fbf2c ef2a34e0 f75fbec0 eeded3c0 c03c69e2 ef2a34e0 eeded418
> [  197.073672] Call Trace:
> [  197.073674]  [<c03c6918>] pending_complete+0x2d8/0x360
> [  197.073677]  [<c03c82ca>] persistent_commit+0x34a/0x380
> [  197.073679]  [<c03c6a20>] commit_callback+0x0/0x40
> [  197.073681]  [<c03c69e2>] copy_callback+0x42/0x80
> [  197.073683]  [<c03c69a0>] copy_callback+0x0/0x80
> [  197.073686]  [<c03c1858>] run_complete_job+0x108/0x140
> [  197.073688]  [<c03c119c>] process_jobs+0x4c/0xe0
> [  197.073690]  [<c03c1750>] run_complete_job+0x0/0x140
> [  197.073692]  [<c03c1230>] do_work+0x0/0x50
> [  197.073694]  [<c03c124e>] do_work+0x1e/0x50
> [  197.073696]  [<c012ef32>] run_workqueue+0x72/0x100
> [  197.073700]  [<c0132570>] prepare_to_wait+0x20/0x60
> [  197.073703]  [<c012f840>] worker_thread+0x0/0xb0
> [  197.073705]  [<c012f8b9>] worker_thread+0x79/0xb0
> [  197.073707]  [<c01323d0>] autoremove_wake_function+0x0/0x50
> [  197.073710]  [<c012f840>] worker_thread+0x0/0xb0
> [  197.073712]  [<c01320d2>] kthread+0x42/0x70
> [  197.073713]  [<c0132090>] kthread+0x0/0x70
> [  197.073715]  [<c0103eff>] kernel_thread_helper+0x7/0x18
> [  197.073718] Code: 04 89 42 04 89 10 c7 41 04 00 02 20 00 c7 43 40 00 01 10 00 fe 05 94 41 65 c0 8b 56 48 89 d8 e8 3f 30 d8 ff f0 ff 4e 4c 5b 5e c3 <0f> 0b eb fe 8d 74 26 00 55 89 c5 57 56 53 83 ec 0c 8b 75 30 8b 
> [  197.073737] EIP: [<c03c5da8>] free_pending_exception+0x88/0x90 SS:ESP 0068:ef3d5ecc
> [  197.073754] ---[ end trace beb1c34d6186bec5 ]---
> 
> Jacky Kim
> .

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: 2.6.28.2 & dm-snapshot or kcopyd Oops
  2009-02-13  6:41                                 ` Mikulas Patocka
@ 2009-02-13  7:31                                   ` Mikulas Patocka
       [not found]                                     ` <200902131912139165975@163.com>
  0 siblings, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2009-02-13  7:31 UTC (permalink / raw)
  To: Jacky Kim; +Cc: device-mapper development, Alasdair G Kergon, Milan Broz

On Fri, 13 Feb 2009, Mikulas Patocka wrote:

> Hi
> 
> Few more questions:
> - do you write to the snapshots?
> - do you create/delete snapshots while the crash happens?
> - do all the snapshots have the same chunk size?
> - how many snapshots do you have?

another question: do you resize the origin device? or the snapshots?

One more test patch. (the code looks so terrible that I am thinking about 
submitting you a patch that erases this logic all and reimplements it).

Mikulas

---
 drivers/md/dm-exception-store.c |    2 +
 drivers/md/dm-snap.c            |   52 ++++++++++++++++++++++++++++++++++------
 2 files changed, 47 insertions(+), 7 deletions(-)

Index: linux-2.6.28-clean/drivers/md/dm-snap.c
===================================================================
--- linux-2.6.28-clean.orig/drivers/md/dm-snap.c	2009-02-13 07:49:49.000000000 +0100
+++ linux-2.6.28-clean/drivers/md/dm-snap.c	2009-02-13 08:26:54.000000000 +0100
@@ -90,6 +90,8 @@ struct dm_snap_pending_exception {
 	int started;
 
 	struct list_head list_all;
+
+	int magic;
 };
 
 /*
@@ -378,10 +380,13 @@ static struct dm_snap_pending_exception 
 
 	atomic_inc(&s->pending_exceptions_count);
 	pe->snap = s;
+	pe->primary_pe = NULL;
+	pe->magic = 0x12345678;
 
 	spin_lock(&pending_spinlock);
 	list_for_each_entry(pe2, &pending_all, list_all) {
 		BUG_ON(pe2 == pe);
+		BUG_ON(pe2->primary_pe == pe);
 	}
 	list_add(&pe->list_all, &pending_all);
 	spin_unlock(&pending_spinlock);
@@ -400,13 +405,21 @@ static void free_pending_exception(struc
 
 	spin_lock(&pending_spinlock);
 	list_for_each_entry(pe2, &pending_all, list_all) {
+		if (pe2->primary_pe == pe && pe2 != pe) {
+			printk(KERN_ALERT "Freeing referenced exception, pe %p, pe->magic %x, pe2 %p, pe2->magic %x, pe2->primary_pe, %p, pe->ref_count %d\n", pe, pe->magic, pe2, pe2->magic, pe2->primary_pe, atomic_read(&pe->ref_count));
+			BUG();
+		}
+	}
+	list_for_each_entry(pe2, &pending_all, list_all) {
 		if (pe2 == pe) goto found;
 	}
+	printk(KERN_ALERT "Freeing free exception pe %p, magic %x, pe->primary_pe %p, pe->ref_count %d\n", pe, pe->magic, pe->primary_pe, atomic_read(&pe->ref_count));
 	BUG();
 	found:
 	list_del(&pe->list_all);
 	spin_unlock(&pending_spinlock);
 
+	pe->magic = 0x90abcdef;
 	mempool_free(pe, s->pending_pool);
 	smp_mb__before_atomic_dec();
 	atomic_dec(&s->pending_exceptions_count);
@@ -862,6 +875,13 @@ static struct bio *put_pending_exception
 
 	primary_pe = pe->primary_pe;
 
+	if (primary_pe) {
+		if (atomic_read(&primary_pe->ref_count) <= 0 || primary_pe->magic != 0x12345678) {
+			printk(KERN_ALERT "Bad ref count, pe %p, pe->magic %x, primary_pe %p, primary_pe->magic %x, primary_pe->ref_count %d\n", pe, pe->magic, pe->primary_pe, primary_pe->magic, atomic_read(&primary_pe->ref_count));
+			BUG();
+		}
+	}
+
 	/*
 	 * If this pe is involved in a write to the origin and
 	 * it is the last sibling to complete then release
@@ -1313,11 +1333,19 @@ static int __origin_write(struct list_he
 			 * Either every pe here has same
 			 * primary_pe or none has one yet.
 			 */
-			if (pe->primary_pe)
+			if (pe->primary_pe) {
 				primary_pe = pe->primary_pe;
-			else {
+				if (atomic_read(&primary_pe->ref_count) <= 0 || primary_pe->magic != 0x12345678) {
+					printk(KERN_ALERT "Bad ref count, pe %p, pe->magic %x, primary_pe %p, primary_pe->magic %x, primary_pe->ref_count %d\n", pe, pe->magic, primary_pe, primary_pe->magic, atomic_read(&primary_pe->ref_count));
+					BUG();
+				}
+			} else {
 				primary_pe = pe;
 				first = 1;
+				if (atomic_read(&primary_pe->ref_count) <= 0 || primary_pe->magic != 0x12345678) {
+					printk(KERN_ALERT "Bad ref count, pe %p, pe->magic %x, primary_pe %p, primary_pe->magic %x, primary_pe->ref_count %d\n", pe, pe->magic, pe->primary_pe, primary_pe->magic, atomic_read(&primary_pe->ref_count));
+					BUG();
+				}
 			}
 
 			bio_list_add(&primary_pe->origin_bios, bio);
@@ -1327,6 +1355,10 @@ static int __origin_write(struct list_he
 
 		if (!pe->primary_pe) {
 			pe->primary_pe = primary_pe;
+			if (atomic_read(&primary_pe->ref_count) <= 0 || primary_pe->magic != 0x12345678) {
+				printk(KERN_ALERT "Bad ref count, pe %p, pe->magic %x, primary_pe %p, primary_pe->magic %x, primary_pe->ref_count %d\n", pe, pe->magic, primary_pe, primary_pe->magic, atomic_read(&primary_pe->ref_count));
+				BUG();
+			}
 			get_pending_exception(primary_pe);
 		}
 
@@ -1350,11 +1382,17 @@ static int __origin_write(struct list_he
 	 * us here to remove the primary_pe and submit any origin_bios.
 	 */
 
-	if (first && atomic_dec_and_test(&primary_pe->ref_count)) {
-		flush_bios(bio_list_get(&primary_pe->origin_bios));
-		free_pending_exception(primary_pe);
-		/* If we got here, pe_queue is necessarily empty. */
-		return r;
+	if (first) {
+		if (atomic_read(&primary_pe->ref_count) <= 0 || primary_pe->magic != 0x12345678) {
+			printk(KERN_ALERT "Bad ref count, primary_pe %p, primary_pe->magic %x, primary_pe->primary_pe %p, primary_pe->ref_count %d\n", primary_pe, primary_pe->magic, primary_pe->primary_pe, atomic_read(&primary_pe->ref_count));
+			BUG();
+		}
+		if (atomic_dec_and_test(&primary_pe->ref_count)) {
+			flush_bios(bio_list_get(&primary_pe->origin_bios));
+			free_pending_exception(primary_pe);
+			/* If we got here, pe_queue is necessarily empty. */
+			return r;
+		}
 	}
 
 	/*
Index: linux-2.6.28-clean/drivers/md/dm-exception-store.c
===================================================================
--- linux-2.6.28-clean.orig/drivers/md/dm-exception-store.c	2009-02-13 08:20:45.000000000 +0100
+++ linux-2.6.28-clean/drivers/md/dm-exception-store.c	2009-02-13 08:20:50.000000000 +0100
@@ -621,6 +621,8 @@ struct dm_snap_pending_exception {
 	int started;
 
 	struct list_head list_all;
+
+	int magic;
 };
 
 static DEFINE_SPINLOCK(callback_spinlock);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: 2.6.28.2 & dm-snapshot or kcopyd Oops
       [not found]                                     ` <200902131912139165975@163.com>
@ 2009-02-16  7:14                                       ` Mikulas Patocka
       [not found]                                         ` <200902161703497923912@163.com>
  2009-02-16  7:40                                       ` Mikulas Patocka
  1 sibling, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2009-02-16  7:14 UTC (permalink / raw)
  To: Jacky Kim; +Cc: device-mapper development, Alasdair G Kergon, Milan Broz



On Fri, 13 Feb 2009, Jacky Kim wrote:

> Hi,
> 
> 1. I test with the origin LV, and don't write to snapshots.
> 2. I don't delete snapshots when copy big files to the origin LV, but do create snapshot sometimes.
> 3. The snapshots have the same chunk size: 4KB or 64KB.
> 4. It seems to work well with 1 snapshot. But it always crashes each time with 2 snapshots.
> 5. I don't resize the origin device and the snapshots.
> 
> The latest debug information is as follows:

Ohhh, the last patch from me was wrong, so it could produce a false 
positive, so its result is unusable. Apply this on the top of it.

Mikulas

---
 drivers/md/dm-snap.c |    1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6.28-snap-debug/drivers/md/dm-snap.c
===================================================================
--- linux-2.6.28-snap-debug.orig/drivers/md/dm-snap.c	2009-02-16 08:09:22.000000000 +0100
+++ linux-2.6.28-snap-debug/drivers/md/dm-snap.c	2009-02-16 08:09:47.000000000 +0100
@@ -874,6 +874,7 @@ static struct bio *put_pending_exception
 	pe->snap->store.check_pending_exception(&pe->snap->store, pe, __LINE__);
 
 	primary_pe = pe->primary_pe;
+	pe->primary_pe = NULL;
 
 	if (primary_pe) {
 		if (atomic_read(&primary_pe->ref_count) <= 0 || primary_pe->magic != 0x12345678) {

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: 2.6.28.2 & dm-snapshot or kcopyd Oops
       [not found]                                     ` <200902131912139165975@163.com>
  2009-02-16  7:14                                       ` Mikulas Patocka
@ 2009-02-16  7:40                                       ` Mikulas Patocka
  1 sibling, 0 replies; 11+ messages in thread
From: Mikulas Patocka @ 2009-02-16  7:40 UTC (permalink / raw)
  To: Jacky Kim; +Cc: device-mapper development, Alasdair G Kergon, Milan Broz



On Fri, 13 Feb 2009, Jacky Kim wrote:

> Hi,
> 
> 1. I test with the origin LV, and don't write to snapshots.
> 2. I don't delete snapshots when copy big files to the origin LV, but do create snapshot sometimes.
> 3. The snapshots have the same chunk size: 4KB or 64KB.
> 4. It seems to work well with 1 snapshot. But it always crashes each time with 2 snapshots.
> 5. I don't resize the origin device and the snapshots.
> 
> The latest debug information is as follows:

Hi

As an alternative, try this patch on a clean 2.6.28.* kernel. It combines 
patches for two bugs found so far + complete rewrite of chunk handling. 
(the old code was really terrible, I found multiple bugs in it and there 
is no way how to improve it except to rewrite it).

Don't erase the old kernel and test my previous patch too, I intend to 
backport it to RHEL (when I find the bug with your help) and I intend to 
submit this rewrite patch for upstream kernel.

Mikulas

Index: linux-2.6.28-snap-rewrite/drivers/md/dm-kcopyd.c
===================================================================
--- linux-2.6.28-snap-rewrite.orig/drivers/md/dm-kcopyd.c	2009-02-16 02:54:36.000000000 +0100
+++ linux-2.6.28-snap-rewrite/drivers/md/dm-kcopyd.c	2009-02-13 08:44:05.000000000 +0100
@@ -297,7 +297,8 @@ static int run_complete_job(struct kcopy
 	dm_kcopyd_notify_fn fn = job->fn;
 	struct dm_kcopyd_client *kc = job->kc;
 
-	kcopyd_put_pages(kc, job->pages);
+	if (job->pages)
+		kcopyd_put_pages(kc, job->pages);
 	mempool_free(job, kc->job_pool);
 	fn(read_err, write_err, context);
 
@@ -461,6 +462,7 @@ static void segment_complete(int read_er
 	sector_t progress = 0;
 	sector_t count = 0;
 	struct kcopyd_job *job = (struct kcopyd_job *) context;
+	struct dm_kcopyd_client *kc = job->kc;
 
 	mutex_lock(&job->lock);
 
@@ -490,7 +492,7 @@ static void segment_complete(int read_er
 
 	if (count) {
 		int i;
-		struct kcopyd_job *sub_job = mempool_alloc(job->kc->job_pool,
+		struct kcopyd_job *sub_job = mempool_alloc(kc->job_pool,
 							   GFP_NOIO);
 
 		*sub_job = *job;
@@ -508,14 +510,8 @@ static void segment_complete(int read_er
 
 	} else if (atomic_dec_and_test(&job->sub_jobs)) {
 
-		/*
-		 * To avoid a race we must keep the job around
-		 * until after the notify function has completed.
-		 * Otherwise the client may try and stop the job
-		 * after we've completed.
-		 */
-		job->fn(read_err, write_err, job->context);
-		mempool_free(job, job->kc->job_pool);
+		push(&kc->complete_jobs, job);
+		wake(kc);
 	}
 }
 
@@ -528,6 +524,8 @@ static void split_job(struct kcopyd_job 
 {
 	int i;
 
+	atomic_inc(&job->kc->nr_jobs);
+
 	atomic_set(&job->sub_jobs, SPLIT_COUNT);
 	for (i = 0; i < SPLIT_COUNT; i++)
 		segment_complete(0, 0u, job);
Index: linux-2.6.28-snap-rewrite/drivers/md/dm-snap.c
===================================================================
--- linux-2.6.28-snap-rewrite.orig/drivers/md/dm-snap.c	2009-02-13 08:44:00.000000000 +0100
+++ linux-2.6.28-snap-rewrite/drivers/md/dm-snap.c	2009-02-13 08:47:21.000000000 +0100
@@ -58,28 +58,6 @@ struct dm_snap_pending_exception {
 	struct bio_list origin_bios;
 	struct bio_list snapshot_bios;
 
-	/*
-	 * Short-term queue of pending exceptions prior to submission.
-	 */
-	struct list_head list;
-
-	/*
-	 * The primary pending_exception is the one that holds
-	 * the ref_count and the list of origin_bios for a
-	 * group of pending_exceptions.  It is always last to get freed.
-	 * These fields get set up when writing to the origin.
-	 */
-	struct dm_snap_pending_exception *primary_pe;
-
-	/*
-	 * Number of pending_exceptions processing this chunk.
-	 * When this drops to zero we must complete the origin bios.
-	 * If incrementing or decrementing this, hold pe->snap->lock for
-	 * the sibling concerned and not pe->primary_pe->snap->lock unless
-	 * they are the same.
-	 */
-	atomic_t ref_count;
-
 	/* Pointer back to snapshot context */
 	struct dm_snapshot *snap;
 
@@ -788,6 +766,28 @@ static void flush_queued_bios(struct wor
 	flush_bios(queued_bios);
 }
 
+static int do_origin(struct dm_dev *origin, struct bio *bio);
+
+/*
+ * Flush a list of buffers.
+ */
+static void retry_origin_bios(struct dm_snapshot *s, struct bio *bio)
+{
+	struct bio *n;
+	int r;
+
+	while (bio) {
+		n = bio->bi_next;
+		bio->bi_next = NULL;
+		r = do_origin(s->origin, bio);
+		if (r == DM_MAPIO_REMAPPED)
+			generic_make_request(bio);
+		else
+			BUG_ON(r != DM_MAPIO_SUBMITTED);
+		bio = n;
+	}
+}
+
 /*
  * Error a list of buffers.
  */
@@ -821,39 +821,6 @@ static void __invalidate_snapshot(struct
 	dm_table_event(s->ti->table);
 }
 
-static void get_pending_exception(struct dm_snap_pending_exception *pe)
-{
-	atomic_inc(&pe->ref_count);
-}
-
-static struct bio *put_pending_exception(struct dm_snap_pending_exception *pe)
-{
-	struct dm_snap_pending_exception *primary_pe;
-	struct bio *origin_bios = NULL;
-
-	primary_pe = pe->primary_pe;
-
-	/*
-	 * If this pe is involved in a write to the origin and
-	 * it is the last sibling to complete then release
-	 * the bios for the original write to the origin.
-	 */
-	if (primary_pe &&
-	    atomic_dec_and_test(&primary_pe->ref_count)) {
-		origin_bios = bio_list_get(&primary_pe->origin_bios);
-		free_pending_exception(primary_pe);
-	}
-
-	/*
-	 * Free the pe if it's not linked to an origin write or if
-	 * it's not itself a primary pe.
-	 */
-	if (!primary_pe || primary_pe != pe)
-		free_pending_exception(pe);
-
-	return origin_bios;
-}
-
 static void pending_complete(struct dm_snap_pending_exception *pe, int success)
 {
 	struct dm_snap_exception *e;
@@ -902,7 +869,8 @@ static void pending_complete(struct dm_s
  out:
 	remove_exception(&pe->e);
 	snapshot_bios = bio_list_get(&pe->snapshot_bios);
-	origin_bios = put_pending_exception(pe);
+	origin_bios = bio_list_get(&pe->origin_bios);
+	free_pending_exception(pe);
 
 	up_write(&s->lock);
 
@@ -912,7 +880,7 @@ static void pending_complete(struct dm_s
 	else
 		flush_bios(snapshot_bios);
 
-	flush_bios(origin_bios);
+	retry_origin_bios(s, origin_bios);
 }
 
 static void commit_callback(void *context, int success)
@@ -970,15 +938,19 @@ static void start_copy(struct dm_snap_pe
  * for this chunk, otherwise it allocates a new one and inserts
  * it into the pending table.
  *
+ * Returns: pointer to a pending exception
+ *	ERR_PTR(-ENOMEM) pointer --- the snapshot is invalidated
+ *	NULL --- the exception was already completed, the caller should recheck
+ *
  * NOTE: a write lock must be held on snap->lock before calling
  * this.
  */
 static struct dm_snap_pending_exception *
-__find_pending_exception(struct dm_snapshot *s, struct bio *bio)
+__find_pending_exception(struct dm_snapshot *s, sector_t sector)
 {
 	struct dm_snap_exception *e;
 	struct dm_snap_pending_exception *pe;
-	chunk_t chunk = sector_to_chunk(s, bio->bi_sector);
+	chunk_t chunk = sector_to_chunk(s, sector);
 
 	/*
 	 * Is there a pending exception for this already ?
@@ -1000,6 +972,12 @@ __find_pending_exception(struct dm_snaps
 
 	if (!s->valid) {
 		free_pending_exception(pe);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	e = lookup_exception(&s->complete, chunk);
+	if (e) {
+		free_pending_exception(pe);
 		return NULL;
 	}
 
@@ -1013,16 +991,13 @@ __find_pending_exception(struct dm_snaps
 	pe->e.old_chunk = chunk;
 	bio_list_init(&pe->origin_bios);
 	bio_list_init(&pe->snapshot_bios);
-	pe->primary_pe = NULL;
-	atomic_set(&pe->ref_count, 0);
 	pe->started = 0;
 
 	if (s->store.prepare_exception(&s->store, &pe->e)) {
 		free_pending_exception(pe);
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 	}
 
-	get_pending_exception(pe);
 	insert_exception(&s->pending, &pe->e);
 
  out:
@@ -1063,6 +1038,7 @@ static int snapshot_map(struct dm_target
 		goto out_unlock;
 	}
 
+lookup_again:
 	/* If the block is already remapped - use that, else remap it */
 	e = lookup_exception(&s->complete, chunk);
 	if (e) {
@@ -1076,9 +1052,11 @@ static int snapshot_map(struct dm_target
 	 * writeable.
 	 */
 	if (bio_rw(bio) == WRITE) {
-		pe = __find_pending_exception(s, bio);
-		if (!pe) {
-			__invalidate_snapshot(s, -ENOMEM);
+		pe = __find_pending_exception(s, bio->bi_sector);
+		if (!pe)
+			goto lookup_again;
+		if (IS_ERR(pe)) {
+			__invalidate_snapshot(s, PTR_ERR(pe));
 			r = -EIO;
 			goto out_unlock;
 		}
@@ -1170,14 +1148,20 @@ static int snapshot_status(struct dm_tar
 /*-----------------------------------------------------------------
  * Origin methods
  *---------------------------------------------------------------*/
-static int __origin_write(struct list_head *snapshots, struct bio *bio)
+
+/*
+ * Returns:
+ *	DM_MAPIO_REMAPPED: bio may be submitted to origin device
+ *	DM_MAPIO_SUBMITTED: bio was queued on queue on one of exceptions
+ */
+
+static int __origin_write(struct list_head *snapshots, sector_t sector, struct bio *bio)
 {
-	int r = DM_MAPIO_REMAPPED, first = 0;
+	int r = DM_MAPIO_REMAPPED;
 	struct dm_snapshot *snap;
 	struct dm_snap_exception *e;
-	struct dm_snap_pending_exception *pe, *next_pe, *primary_pe = NULL;
+	struct dm_snap_pending_exception *pe, *pe_to_start = NULL;
 	chunk_t chunk;
-	LIST_HEAD(pe_queue);
 
 	/* Do all the snapshots on this origin */
 	list_for_each_entry (snap, snapshots, list) {
@@ -1189,86 +1173,65 @@ static int __origin_write(struct list_he
 			goto next_snapshot;
 
 		/* Nothing to do if writing beyond end of snapshot */
-		if (bio->bi_sector >= dm_table_get_size(snap->ti->table))
+		if (sector >= dm_table_get_size(snap->ti->table))
 			goto next_snapshot;
 
 		/*
 		 * Remember, different snapshots can have
 		 * different chunk sizes.
 		 */
-		chunk = sector_to_chunk(snap, bio->bi_sector);
+		chunk = sector_to_chunk(snap, sector);
 
 		/*
 		 * Check exception table to see if block
 		 * is already remapped in this snapshot
 		 * and trigger an exception if not.
-		 *
-		 * ref_count is initialised to 1 so pending_complete()
-		 * won't destroy the primary_pe while we're inside this loop.
 		 */
 		e = lookup_exception(&snap->complete, chunk);
 		if (e)
 			goto next_snapshot;
 
-		pe = __find_pending_exception(snap, bio);
-		if (!pe) {
-			__invalidate_snapshot(snap, -ENOMEM);
+		pe = __find_pending_exception(snap, sector);
+		if (!pe)
+			goto next_snapshot;
+		if (IS_ERR(pe)) {
+			__invalidate_snapshot(snap, PTR_ERR(pe));
 			goto next_snapshot;
 		}
 
-		if (!primary_pe) {
-			/*
-			 * Either every pe here has same
-			 * primary_pe or none has one yet.
-			 */
-			if (pe->primary_pe)
-				primary_pe = pe->primary_pe;
-			else {
-				primary_pe = pe;
-				first = 1;
-			}
-
-			bio_list_add(&primary_pe->origin_bios, bio);
-
-			r = DM_MAPIO_SUBMITTED;
-		}
+		r = DM_MAPIO_SUBMITTED;
 
-		if (!pe->primary_pe) {
-			pe->primary_pe = primary_pe;
-			get_pending_exception(primary_pe);
+		if (bio) {
+			bio_list_add(&pe->origin_bios, bio);
+			bio = NULL;
+
+			if (!pe->started) {
+				pe->started = 1;
+				pe_to_start = pe;
+			}
 		}
 
 		if (!pe->started) {
 			pe->started = 1;
-			list_add_tail(&pe->list, &pe_queue);
+			start_copy(pe);
 		}
 
  next_snapshot:
 		up_write(&snap->lock);
 	}
 
-	if (!primary_pe)
-		return r;
-
 	/*
-	 * If this is the first time we're processing this chunk and
-	 * ref_count is now 1 it means all the pending exceptions
-	 * got completed while we were in the loop above, so it falls to
-	 * us here to remove the primary_pe and submit any origin_bios.
+	 * pe_to_start is a small performance improvement:
+	 * To avoid calling __origin_write N times for N snapshots, we start
+	 * the snapshot where we queued the bio as the last one.
+	 *
+	 * If we start it as the last one, it finishes most likely as the last
+	 * one and exceptions in other snapshots will be already finished when
+	 * the bio will be retried.
 	 */
 
-	if (first && atomic_dec_and_test(&primary_pe->ref_count)) {
-		flush_bios(bio_list_get(&primary_pe->origin_bios));
-		free_pending_exception(primary_pe);
-		/* If we got here, pe_queue is necessarily empty. */
-		return r;
-	}
-
-	/*
-	 * Now that we have a complete pe list we can start the copying.
-	 */
-	list_for_each_entry_safe(pe, next_pe, &pe_queue, list)
-		start_copy(pe);
+	if (pe_to_start)
+		start_copy(pe_to_start);
 
 	return r;
 }
@@ -1284,7 +1247,7 @@ static int do_origin(struct dm_dev *orig
 	down_read(&_origins_lock);
 	o = __lookup_origin(origin->bdev);
 	if (o)
-		r = __origin_write(&o->snapshots, bio);
+		r = __origin_write(&o->snapshots, bio->bi_sector, bio);
 	up_read(&_origins_lock);
 
 	return r;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: 2.6.28.2 & dm-snapshot or kcopyd Oops
       [not found]                                         ` <200902161703497923912@163.com>
@ 2009-02-16 16:18                                           ` Mikulas Patocka
  0 siblings, 0 replies; 11+ messages in thread
From: Mikulas Patocka @ 2009-02-16 16:18 UTC (permalink / raw)
  To: Jacky Kim; +Cc: device-mapper development, Alasdair G Kergon, Milan Broz

> Hi,
> 
> The debug info is as follow:
> 
> [  424.830790] Bad ref count, pe f0f73f70, pe->magic 12345678, primary_pe ef7d27b8, primary_pe->magic 90abcdef, primary_pe->ref_count 0
> [  424.830805] ------------[ cut here ]------------
> [  424.830806] kernel BUG at drivers/md/dm-snap.c:1361!
> [  424.830808] invalid opcode: 0000 [#1] SMP 
> [  424.830811] last sysfs file: /sys/devices/virtual/block/dm-10/dev
> [  424.830812] Modules linked in: iscsi_trgt arcmsr bonding e1000
> [  424.830816] 
> [  424.830818] Pid: 1486, comm: istiod1 Not tainted (2.6.28.2-storix-mcore #10) S5000PSL
> [  424.830820] EIP: 0060:[<c03c64dc>] EFLAGS: 00010282 CPU: 0
> [  424.830825] EIP is at origin_map+0x33c/0x3d0
> [  424.830827] EAX: 0000008b EBX: f0f73f70 ECX: 00000082 EDX: 00000046
> [  424.830828] ESI: f6cf98c0 EDI: 0023cd0c EBP: 00000000 ESP: f0f5bd50
> [  424.830830]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> [  424.830831] Process istiod1 (pid: 1486, ti=f0f5a000 task=eed3a780 task.ti=f0f5a000)
> [  424.830833] Stack:
> [  424.830834]  c0555698 f0f73f70 12345678 ef7d27b8 90abcdef 00000000 f0f5bd80 ef3978c0
> [  424.830837]  00000000 ef7d27b8 00000000 eed2974c f0f5bd80 f0f5bd80 f74b8a40 ef3978c0
> [  424.830841]  f6dabf48 f9d31040 c03bb495 11e683a0 00000000 00000000 f0d6c740 f6dabf58
> [  424.830845] Call Trace:
> [  424.830846]  [<c03bb495>] __map_bio+0x35/0xb0
> [  424.830849]  [<c03bc3ec>] __split_bio+0x36c/0x4b0
> [  424.830852]  [<c03bc917>] dm_request+0x117/0x1b0
> [  424.830854]  [<c02c93e0>] generic_make_request+0x1c0/0x2a0
> [  424.830858]  [<c02ca622>] generic_unplug_device+0x22/0x30
> [  424.830860]  [<c03bc01c>] dm_merge_bvec+0xac/0x110
> [  424.830862]  [<c02ca6ea>] submit_bio+0x4a/0xd0
> [  424.830864]  [<c018b81a>] bio_add_page+0x3a/0x50
> [  424.830868]  [<f87bec55>] blockio_make_request+0x215/0x2f6 [iscsi_trgt]
> [  424.830877]  [<f87bea40>] blockio_make_request+0x0/0x2f6 [iscsi_trgt]
> [  424.830883]  [<f87b5120>] tio_write+0x20/0x60 [iscsi_trgt]
> [  424.830888]  [<f87bce7e>] build_write_response+0x2e/0xb0 [iscsi_trgt]
> [  424.830893]  [<f87b685c>] iscsi_cmnd_create_rsp_cmnd+0x1c/0x60 [iscsi_trgt]
> [  424.830898]  [<f87b81d7>] send_scsi_rsp+0x17/0xd0 [iscsi_trgt]
> [  424.830903]  [<f87bcd2c>] disk_execute_cmnd+0xdc/0x160 [iscsi_trgt]
> [  424.830908]  [<f87b9d02>] worker_thread+0xf2/0x170 [iscsi_trgt]
> [  424.830913]  [<c011a880>] default_wake_function+0x0/0x10
> [  424.830917]  [<f87b9c10>] worker_thread+0x0/0x170 [iscsi_trgt]
> [  424.830922]  [<c01320d2>] kthread+0x42/0x70
> [  424.830925]  [<c0132090>] kthread+0x0/0x70
> [  424.830927]  [<c0103eff>] kernel_thread_helper+0x7/0x18
> [  424.830930] Code: ff 8b 4c 24 24 89 44 24 14 8b 41 48 89 4c 24 0c 89 44 24 10 8b 43 48 89 5c 24 04 c7 04 24 98 56 55 c0 89 44 24 08 e8 54 a7 d5 ff <0f> 0b eb fe 8b 4c 24 1c 8b 44 24 24 89 48 18 eb a3 8b 4c 24 24 
> [  424.830948] EIP: [<c03c64dc>] origin_map+0x33c/0x3d0 SS:ESP 0068:f0f5bd50
> [  424.830953] ---[ end trace e814d4d4e6a134e7 ]---
> 
> Jacky
> .

Thanks.

Here's another one to try (on the top of all those patches):

Mikulas

---
 drivers/md/dm-snap.c |   60 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 58 insertions(+), 2 deletions(-)

Index: linux-2.6.28-snap-debug/drivers/md/dm-snap.c
===================================================================
--- linux-2.6.28-snap-debug.orig/drivers/md/dm-snap.c	2009-02-16 13:05:12.000000000 +0100
+++ linux-2.6.28-snap-debug/drivers/md/dm-snap.c	2009-02-16 16:30:03.000000000 +0100
@@ -861,6 +861,43 @@ static void __invalidate_snapshot(struct
 	dm_table_event(s->ti->table);
 }
 
+static void check_allocated_chunk(struct block_device *bdev, chunk_t chunk, struct dm_snap_pending_exception *pe, int line)
+{
+	struct dm_snapshot *snap;
+	int i = 0;
+	struct origin *o;
+	down_read(&_origins_lock);
+	o = __lookup_origin(bdev);
+	if (!o) {
+		printk("line %d\n", line);
+		BUG();
+	}
+	list_for_each_entry (snap, &o->snapshots, list) {
+		struct dm_snap_exception *e;
+		down_write(&snap->lock);
+		if (!snap->valid || !snap->active)
+			goto next_snapshot;
+		e = lookup_exception(&snap->complete, chunk);
+		if (e)
+			goto next_snapshot;
+		e = lookup_exception(&snap->pending, chunk);
+		if (e) {
+			struct dm_snap_pending_exception *pe = container_of(e, struct dm_snap_pending_exception, e);
+			if (!pe->primary_pe) {
+				printk(KERN_ALERT "%d: no primary pe %Lx in snapshot %p(%d), copying snapshot %p, pe %p, pe->primary_pe %p, refcount %d\n", line, (unsigned long long)chunk, snap, i, pe->snap, pe, pe->primary_pe, atomic_read(&pe->ref_count));
+				BUG();
+			}
+			goto next_snapshot;
+		}
+		printk(KERN_ALERT "%d: not allocated chunk %Lx in snapshot %p(%d), copying snapshot %p, pe %p, pe->primary_pe %p, refcount %d\n", line, (unsigned long long)chunk, snap, i, pe->snap, pe, pe->primary_pe, atomic_read(&pe->ref_count));
+		BUG();
+next_snapshot:
+		up_write(&snap->lock);
+		i++;
+	}
+	up_read(&_origins_lock);
+}
+
 static void get_pending_exception(struct dm_snap_pending_exception *pe)
 {
 	atomic_inc(&pe->ref_count);
@@ -917,6 +954,8 @@ static void pending_complete(struct dm_s
 	BUG_ON(pe->e.hash_list.next == LIST_POISON1);
 	BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
 
+	check_allocated_chunk(s->origin->bdev, pe->e.old_chunk, pe, __LINE__);
+
 	if (!success) {
 		/* Read/write error - snapshot is unusable */
 		down_write(&s->lock);
@@ -1017,6 +1056,8 @@ static void copy_callback(int read_err, 
 	BUG_ON(pe->e.hash_list.next == LIST_POISON1);
 	BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
 
+	check_allocated_chunk(s->origin->bdev, pe->e.old_chunk, pe, __LINE__);
+
 	if (read_err || write_err) {
 		s->store.check_pending_exception(&s->store, pe, __LINE__);
 		pending_complete(pe, 0);
@@ -1056,6 +1097,8 @@ static void start_copy(struct dm_snap_pe
 	BUG_ON(pe->e.hash_list.next == LIST_POISON1);
 	BUG_ON(pe->e.hash_list.prev == LIST_POISON2);
 
+	check_allocated_chunk(bdev, pe->e.old_chunk, pe, __LINE__);
+
 	/* Hand over to kcopyd */
 	dm_kcopyd_copy(s->kcopyd_client,
 		    &src, 1, &dest, 0, copy_callback, pe);
@@ -1155,6 +1198,11 @@ static int snapshot_map(struct dm_target
 	chunk_t chunk;
 	struct dm_snap_pending_exception *pe = NULL;
 
+	if (bio_rw(bio) == WRITE) {
+		printk(KERN_ALERT "Writing to a snapshot --- not supported!\n");
+		BUG();
+	}
+
 	chunk = sector_to_chunk(s, bio->bi_sector);
 
 	/* Full snapshots are not usable */
@@ -1300,8 +1348,11 @@ static int __origin_write(struct list_he
 			goto next_snapshot;
 
 		/* Nothing to do if writing beyond end of snapshot */
-		if (bio->bi_sector >= dm_table_get_size(snap->ti->table))
+		if (bio->bi_sector >= dm_table_get_size(snap->ti->table)) {
+			printk(KERN_ALERT "over snapshot end - not supported: %Lx >= %Lx\n", (unsigned long long)bio->bi_sector, (unsigned long long)dm_table_get_size(snap->ti->table));
+			BUG();
 			goto next_snapshot;
+		}
 
 		/*
 		 * Remember, different snapshots can have
@@ -1486,8 +1537,13 @@ static void origin_resume(struct dm_targ
 	down_read(&_origins_lock);
 	o = __lookup_origin(dev->bdev);
 	if (o)
-		list_for_each_entry (snap, &o->snapshots, list)
+		list_for_each_entry (snap, &o->snapshots, list) {
+			if (chunk_size && chunk_size != snap->chunk_size) {
+				printk(KERN_ALERT "Different chunk sizes - not supported!\n");
+				BUG();
+			}
 			chunk_size = min_not_zero(chunk_size, snap->chunk_size);
+		}
 	up_read(&_origins_lock);
 
 	ti->split_io = chunk_size;

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-02-16 16:18 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <200901231836184950432@163.com>
     [not found] ` <Pine.LNX.4.64.0901231514590.9877@hs20-bc2-1.build.redhat.com>
     [not found]   ` <200901301800149019891@163.com>
     [not found]     ` <Pine.LNX.4.64.0901301533390.5744@hs20-bc2-1.build.redhat.com>
     [not found]       ` <200901311416111648168@163.com>
     [not found]         ` <Pine.LNX.4.64.0902031505180.28433@hs20-bc2-1.build.redhat.com>
     [not found]           ` <200902051113011850587@163.com>
2009-02-05 22:52             ` 2.6.28.2 & dm-snapshot or kcopyd Oops Mikulas Patocka
2009-02-06  3:26               ` Mikulas Patocka
     [not found]                 ` <200902061924107769776@163.com>
2009-02-06 23:17                   ` Mikulas Patocka
     [not found]                     ` <200902072041046488539@163.com>
2009-02-09  8:13                       ` Mikulas Patocka
     [not found]                         ` <200902101305353713189@163.com>
2009-02-10 23:36                           ` Mikulas Patocka
2009-02-11 10:54                             ` Mikulas Patocka
     [not found]                               ` <200902121131140773281@163.com>
2009-02-13  6:41                                 ` Mikulas Patocka
2009-02-13  7:31                                   ` Mikulas Patocka
     [not found]                                     ` <200902131912139165975@163.com>
2009-02-16  7:14                                       ` Mikulas Patocka
     [not found]                                         ` <200902161703497923912@163.com>
2009-02-16 16:18                                           ` Mikulas Patocka
2009-02-16  7:40                                       ` Mikulas Patocka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.