All of lore.kernel.org
 help / color / mirror / Atom feed
* A BUG in snapshot merging
       [not found]   ` <Pine.LNX.4.64.0909211815580.31653@hs20-bc2-1.build.redhat.com>
@ 2009-09-22 16:37     ` Mikulas Patocka
  2009-09-22 17:00       ` Mike Snitzer
  0 siblings, 1 reply; 5+ messages in thread
From: Mikulas Patocka @ 2009-09-22 16:37 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, Alasdair G Kergon

I got this BUG when attempting to use merged patchset of Mike's and Jon's 
patches (from 
http://people.redhat.com/msnitzer/patches/snapshot-merge/kernel_unified/2.6.31/)

I think we shouldn't join these two patchsets together. I mean, before 
clustered patches, merging was stable (I reviewed and tested it and except 
for one userspace bug (already fixed) there were no flaws) ... now it 
doesn't work.

I would recommend to leave merging as it was (i.e. stable, apply only 
little patches on it) and develop Jon's clustering on the top of merging 
and not interleave it with merging, so that the clustering patches could 
be rolled back if problems were found. When clustering will be stable and 
reviewed, it could be added to the kernel --- but it may happen later than 
merging, so don't mix it.

Mikulas

kernel BUG at drivers/md/dm-snap-persistent.c:498!
              \|/ ____ \|/
              "@'/ .. \`@"
              /_| \__/ |_\
                 \__U_/
kcopyd(3648): Kernel bad sw trap 5 [#1]
TSTATE: 0000008080001604 TPC: 0000000010077cd4 TNPC: 0000000010077cd8 Y: 
00000000    Not tainted
TPC: <persistent_commit_merge+0xf4/0x120 [dm_snapshot]>
g0: 0000000000000000 g1: 0000000000000000 g2: 0000000000000000 g3: 
0000000000000001
g4: fffff8003f335100 g5: 0000000000000000 g6: fffff8003baa0000 g7: 
00000000006d76c0
o0: 0000000000000036 o1: 0000000010079848 o2: 00000000000001f2 o3: 
0000000000000000
o4: 0000000000000001 o5: 0000000000000000 sp: fffff8003baa31b1 ret_pc: 
0000000010077ccc
RPC: <persistent_commit_merge+0xec/0x120 [dm_snapshot]>
l0: 0000000100384390 l1: fffff8003f77ea60 l2: 0000000000000000 l3: 
ff00000000000000
l4: 0000000000000000 l5: 0000000000000000 l6: 0000000000000000 l7: 
00000000f7d21000
i0: 0000000000000000 i1: 000000000000003a i2: fffff8003f588140 i3: 
0000000000000000
i4: 0000000000000000 i5: 0000000000000000 i6: fffff8003baa3271 i7: 
0000000010074ee4
I7: <merge_callback+0xa4/0x140 [dm_snapshot]>
Disabling lock debugging due to kernel taint
Caller[0000000010074ee4]: merge_callback+0xa4/0x140 [dm_snapshot]
Caller[00000000100533dc]: run_complete_job+0x3c/0x80 [dm_mod]
Caller[000000001005300c]: process_jobs+0x8c/0x180 [dm_mod]
Caller[0000000010053118]: do_work+0x18/0x60 [dm_mod]
Caller[0000000000466ad4]: worker_thread+0x134/0x220
Caller[000000000046b180]: kthread+0x60/0x80
Caller[000000000042b4b0]: kernel_thread+0x30/0x60
Caller[000000000046b070]: kthreadd+0xf0/0x1a0
Instruction DUMP: 921021f2  7c0ec8cd  90122048 <91d02005> 110401e6  
92102371  7c0ec8c8  90122048  91d02005

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A BUG in snapshot merging
  2009-09-22 16:37     ` A BUG in snapshot merging Mikulas Patocka
@ 2009-09-22 17:00       ` Mike Snitzer
  2009-09-22 19:20         ` Mike Snitzer
  0 siblings, 1 reply; 5+ messages in thread
From: Mike Snitzer @ 2009-09-22 17:00 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Alasdair G Kergon

On Tue, Sep 22 2009 at 12:37pm -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> I got this BUG when attempting to use merged patchset of Mike's and Jon's 
> patches (from 
> http://people.redhat.com/msnitzer/patches/snapshot-merge/kernel_unified/2.6.31/)
> 
> I think we shouldn't join these two patchsets together. I mean, before 
> clustered patches, merging was stable (I reviewed and tested it and except 
> for one userspace bug (already fixed) there were no flaws) ... now it 
> doesn't work.
> 
> I would recommend to leave merging as it was (i.e. stable, apply only 
> little patches on it) and develop Jon's clustering on the top of merging 
> and not interleave it with merging, so that the clustering patches could 
> be rolled back if problems were found. When clustering will be stable and 
> reviewed, it could be added to the kernel --- but it may happen later than 
> merging, so don't mix it.


When did you pull in the patches on my people page?

As of yesterday evening (Boston) I uploaded patches that are broken
(based on Jon's reworked handover).  I'm now combining your handover
with Jon's handover (in hopes of avoiding refactoring associations).

The patches are in flux... I'm working to resolve the issues that are
rooted at the handover mechanism.

As for patch ordering.  I'm not opposed to what you suggested (merge
first then clusterized).  But that is a secondary concern right now.  We
have enough time between now and the next merge window to get them both
working.

Mike

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A BUG in snapshot merging
  2009-09-22 17:00       ` Mike Snitzer
@ 2009-09-22 19:20         ` Mike Snitzer
  2009-09-24  0:07           ` Mike Snitzer
  0 siblings, 1 reply; 5+ messages in thread
From: Mike Snitzer @ 2009-09-22 19:20 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Alasdair G Kergon

On Tue, Sep 22 2009 at  1:00pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Sep 22 2009 at 12:37pm -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > I got this BUG when attempting to use merged patchset of Mike's and Jon's 
> > patches (from 
> > http://people.redhat.com/msnitzer/patches/snapshot-merge/kernel_unified/2.6.31/)
> > 
> > I think we shouldn't join these two patchsets together. I mean, before 
> > clustered patches, merging was stable (I reviewed and tested it and except 
> > for one userspace bug (already fixed) there were no flaws) ... now it 
> > doesn't work.
> > 
> > I would recommend to leave merging as it was (i.e. stable, apply only 
> > little patches on it) and develop Jon's clustering on the top of merging 
> > and not interleave it with merging, so that the clustering patches could 
> > be rolled back if problems were found. When clustering will be stable and 
> > reviewed, it could be added to the kernel --- but it may happen later than 
> > merging, so don't mix it.
> 
> 
> When did you pull in the patches on my people page?
> 
> As of yesterday evening (Boston) I uploaded patches that are broken
> (based on Jon's reworked handover).  I'm now combining your handover
> with Jon's handover (in hopes of avoiding refactoring associations).
> 
> The patches are in flux... I'm working to resolve the issues that are
> rooted at the handover mechanism.
> 
> As for patch ordering.  I'm not opposed to what you suggested (merge
> first then clusterized).  But that is a secondary concern right now.  We
> have enough time between now and the next merge window to get them both
> working.

Mikulas,

I've fixed the handover mechanism.  It now reflects the combination of
both your handover and Jon's (whereby avoiding refactoring
associations in dm_exception_store and dm_snapshot).

I've uploaded the updated quilt series to the usual place:
http://people.redhat.com/msnitzer/patches/snapshot-merge/kernel_unified/2.6.31/

I not sure which BUG() you hit in dm-snap-persistent.c (because my line
numbers have changed).. but given that it was in merge_callback() I'd
imagine it is the BUG_ON() that Jon added to clear_exception().

That BUG_ON() is actually useful.  If you can reproduce it with these
updated patches it bears further investigation.

In general, I think snapshot-merge is stronger for having combined with
Jon's clusterized patches.  I actually prefer the final result more so
than if the merge patches were to try to stand on their own (Jon's
refactoring of the exception-store et al has had a positive side-effect
on snapshot-merge).

Mike

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A BUG in snapshot merging
  2009-09-22 19:20         ` Mike Snitzer
@ 2009-09-24  0:07           ` Mike Snitzer
  2009-09-24  6:33             ` Mike Snitzer
  0 siblings, 1 reply; 5+ messages in thread
From: Mike Snitzer @ 2009-09-24  0:07 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Alasdair G Kergon

On Tue, Sep 22 2009 at  3:20pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Sep 22 2009 at  1:00pm -0400,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > On Tue, Sep 22 2009 at 12:37pm -0400,
> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> > 
> > > I got this BUG when attempting to use merged patchset of Mike's and Jon's 
> > > patches (from 
> > > http://people.redhat.com/msnitzer/patches/snapshot-merge/kernel_unified/2.6.31/)
> > > 
> > > I think we shouldn't join these two patchsets together. I mean, before 
> > > clustered patches, merging was stable (I reviewed and tested it and except 
> > > for one userspace bug (already fixed) there were no flaws) ... now it 
> > > doesn't work.
> > > 
> > > I would recommend to leave merging as it was (i.e. stable, apply only 
> > > little patches on it) and develop Jon's clustering on the top of merging 
> > > and not interleave it with merging, so that the clustering patches could 
> > > be rolled back if problems were found. When clustering will be stable and 
> > > reviewed, it could be added to the kernel --- but it may happen later than 
> > > merging, so don't mix it.
> > 
> > 
> > When did you pull in the patches on my people page?
> > 
> > As of yesterday evening (Boston) I uploaded patches that are broken
> > (based on Jon's reworked handover).  I'm now combining your handover
> > with Jon's handover (in hopes of avoiding refactoring associations).
> > 
> > The patches are in flux... I'm working to resolve the issues that are
> > rooted at the handover mechanism.
> > 
> > As for patch ordering.  I'm not opposed to what you suggested (merge
> > first then clusterized).  But that is a secondary concern right now.  We
> > have enough time between now and the next merge window to get them both
> > working.
> 
> Mikulas,
> 
> I've fixed the handover mechanism.  It now reflects the combination of
> both your handover and Jon's (whereby avoiding refactoring
> associations in dm_exception_store and dm_snapshot).
> 
> I've uploaded the updated quilt series to the usual place:
> http://people.redhat.com/msnitzer/patches/snapshot-merge/kernel_unified/2.6.31/
> 
> I not sure which BUG() you hit in dm-snap-persistent.c (because my line
> numbers have changed).. but given that it was in merge_callback() I'd
> imagine it is the BUG_ON() that Jon added to clear_exception().
> 
> That BUG_ON() is actually useful.  If you can reproduce it with these
> updated patches it bears further investigation.
> 
> In general, I think snapshot-merge is stronger for having combined with
> Jon's clusterized patches.  I actually prefer the final result more so
> than if the merge patches were to try to stand on their own (Jon's
> refactoring of the exception-store et al has had a positive side-effect
> on snapshot-merge).

Mikulas,

I can easily reproduce the BUG you reported (you were running on
sparc64) if I take the following patch out of the quilt series:
http://people.redhat.com/msnitzer/patches/snapshot-merge/kernel_unified/2.6.31/dm-exstore-persistent-allow-metadata-reread.patch

I've made an adjusted quilt series available here:
http://people.redhat.com/msnitzer/patches/snapshot-merge/kernel_unified_no_reread/2.6.31/

I'll be working to sort this out but I wanted to give you a heads up
that I can now easily reproduce the BUG on x86.  Don't even need to stop
the merge and restart; just a normal merge triggers the BUG after the
first extent of chunks is merged.

Also, I did try to put the snapshot-merge patches before the
exception/exception_table refactor patches but the new handover
mechanism is very tightly coupled with those so I stopped that effort.

And I'm pretty sure you tested a version of the 'kernel_unified' quilt
tree that was using the old "refactored association" patch.  You hit the
BUG with that too (your initial report).

So what it comes down to is we need to track down the problem that the
new exception/exception_table refactor patches seem to have introduced.
Not ideal but we need all these patches to coexist in the end anyway.

If I can't cut through this BUG in the next day I'll think more about
alternative approaches to forward progress.

Mike

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A BUG in snapshot merging
  2009-09-24  0:07           ` Mike Snitzer
@ 2009-09-24  6:33             ` Mike Snitzer
  0 siblings, 0 replies; 5+ messages in thread
From: Mike Snitzer @ 2009-09-24  6:33 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Alasdair G Kergon

On Wed, Sep 23 2009 at  8:07pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> Mikulas,
> 
> I can easily reproduce the BUG you reported (you were running on
> sparc64) if I take the following patch out of the quilt series:
> http://people.redhat.com/msnitzer/patches/snapshot-merge/kernel_unified/2.6.31/dm-exstore-persistent-allow-metadata-reread.patch
> 
> I've made an adjusted quilt series available here:
> http://people.redhat.com/msnitzer/patches/snapshot-merge/kernel_unified_no_reread/2.6.31/
> 
> I'll be working to sort this out but I wanted to give you a heads up
> that I can now easily reproduce the BUG on x86.  Don't even need to stop
> the merge and restart; just a normal merge triggers the BUG after the
> first extent of chunks is merged.

Turns out the BUG doesn't occur immediately after processing the first
extent of chunks (via merge_callback).

It occurs IFF the chunks are _not_ processed in descending order, e.g.:

...
start  merge chunk=29939 linear_chunks=160
finish merge chunk=29939
finish merge chunk=29938
finish merge chunk=29937
finish merge chunk=29936
...
finish merge chunk=29784
finish merge chunk=29783
finish merge chunk=29782
finish merge chunk=29781
finish merge chunk=29780
start  merge chunk=29736 linear_chunks=3
finish merge chunk=29736
finish merge chunk=29735
finish merge chunk=29734
start  merge chunk=29779 linear_chunks=43
finish merge chunk=29779
------------[ cut here ]------------
kernel BUG at drivers/md/dm-snap-persistent.c:456!
...

So in the above you see chunks 29734-29736 gets interleaved between
29779 and 29780.

This provided the hint I needed to fix the fact that when we moved
dm-snapshot-dont-insert-before-existing-chunk.patch to the beginning of
the series  we neglected to adjust
dm-snapshot-move-exception-code-to-new-file.patch accordingly.

As a result dm-snapshot-move-exception-code-to-new-file.patch had
reintroduced inserting the exceptions into the hash_table before other
exceptions..

Mikulas/Jon, I'd really appreciate it if you could test the following
updated quilt tree(s):

http://people.redhat.com/msnitzer/patches/snapshot-merge/kernel_unified/2.6.31/
http://people.redhat.com/msnitzer/patches/snapshot-merge/kernel_unified_no_reread/2.6.31/

Mikulas, the 'kernel_unified_no_reread' quilt tree is slightly more
minimalist (avoids adding re-read support to dm-snap-persistent.c) so it
is worth a shot if 'kernel_unified' still fails for you.

Both work for me on x86_64.  But we may not be out of the woods on
sparc64.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-09-24  6:33 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.64.0909211349480.16230@hs20-bc2-1.build.redhat.com>
     [not found] ` <20090921175655.GA28074@redhat.com>
     [not found]   ` <Pine.LNX.4.64.0909211815580.31653@hs20-bc2-1.build.redhat.com>
2009-09-22 16:37     ` A BUG in snapshot merging Mikulas Patocka
2009-09-22 17:00       ` Mike Snitzer
2009-09-22 19:20         ` Mike Snitzer
2009-09-24  0:07           ` Mike Snitzer
2009-09-24  6:33             ` Mike Snitzer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.