From mboxrd@z Thu Jan 1 00:00:00 1970 From: Joseph Glanville Subject: Re: bcache-3.2 branch Date: Sat, 14 Jul 2012 07:10:11 +1000 Message-ID: References: <20120709155734.GA23774@google.com> <20120709170742.GA26798@google.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Return-path: In-Reply-To: Sender: linux-bcache-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Kent Overstreet Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-Id: linux-bcache@vger.kernel.org On 13 July 2012 19:01, Kent Overstreet wrote: > Argh, weird. > > That kinda sounds like it'd be a massive pain for me to reproduce too... > > So you're only seeing errors with Xen, correct? Yes, it seems find under other workloads. I will try dropping LVM out of it and see how that goes. > > Probably have to figure out either what xen_blkback is doing different > from everything else (in which case we should be able to reproduce the > errors without it) or track down where in the io stack the errors are > coming from. > > Neither sound very appealing :/ I've had to chase bugs that showed up > like that before, the io stack is big and messy. > > If you can get a test system set up though I can try and help narrow it down. For sure, should have something running on Monday to try play with it some more. > > Something that would be really useful for narrowing it down is finding > out whether LVM is required - i.e. whether xen_blkback + bcache on a > partition works. > > 3.2 should be fine for debugging this (I'm keeping it up to date, and > running it on my workstation at work). 3.2 is a good target for a stable version, most major distributions are heavily invested in 3.2 at this point. > > On Tue, Jul 10, 2012 at 11:52 AM, Joseph Glanville > wrote: >> On 10 July 2012 03:07, Kent Overstreet wrote: >>> On Tue, Jul 10, 2012 at 02:32:36AM +1000, Joseph Glanville wrote: >>>> On 10 July 2012 01:57, Kent Overstreet wrote: >>>> > On Wed, Jun 20, 2012 at 10:08:51PM +1000, Joseph Glanville wrote: >>>> >> Hi Kent and list, >>>> >> >>>> >> I have pulled down the latest bcache code and have been playing around >>>> >> with it when I noticed that I am having issues starting Xen virtual >>>> >> machines using bcache + LVM. >>>> >> What is interesting is the QEMU storage emulation in userspace is able >>>> >> to access the device fine however blkback kernel module which uses the >>>> >> device directly seems to fail. >>>> >> How would I go about debugging any of this? >>>> >> >>>> >> Older versions of bcache work fine so it's a regression as far as I can tell. >>>> > >>>> > Hey, sorry for the delay - I just got back from my first sort-of >>>> > vacation in... awhile :P >>>> > >>>> > I'm pretty sure I know the approximate source of the regression - I >>>> > fairly recently reworked some code in the generic block layer to handle >>>> > arbitrary size bios (which enabled some major cleanups in the bcache >>>> > code). I've chased down a few bugs with that code since then. >>>> > >>>> > Got some logs for me to look at? Or did you want me to give you pointers >>>> > on debugging kernel code? :) >>>> >>>> A few pointers would be great. :) >>> >>> More than happy to :) I'm not sure what sort of general pointers I could >>> give you off the top of my head - there's no Unified Theory of >>> Debugging, it's just a big bag of tricks you learn to narrow things down >>> until you figure it out. But I'll try to tell you everything I'd do with >>> this bug, at least (and whatever else you find :) >>> >>> Also just understanding how things work so you can figure out a root >>> cause from the symptom. >>> >>>> >>>> Also how do I best get it to do a really verbose log that I can use to >>>> help you track down bugs? >>> >>> I think for all the bugs that have shown up in the wild so far we >>> haven't needed any special logging, just the normal stuff has been fine. >>> There's all kinds of logging and tracing and whatnot buried in there but >>> for the most part you don't want to bother with the non default stuff >>> unless you have to. >>> >>> But anyways, just whatever the kernel spits out is the place to start. >>> If you've still got that, I'll take a look and tell you what I'd get out >>> of it. >> >> Unfortunately the kernel wasn't talking much, I didn't see anything >> unusual and everything else seemed to work fine. :( >> I was able to successfully use bcached LVM volumes with filesystems >> too, it only became an issue when trying to use them as block devices >> for virtual machines. >> From the virtual machine all I could see where I/O errors, probably >> caused by the xen_blkback module returning failed read. >> Debugging that beast is not all that fun but I will see how I can go >> setting up a test system sometime this week with the latest bcache >> code. >> We are pretty entrenched in 3.2 but would be be more useful if I >> carried out testing on latter kernels instead or is 3.2 fine? >> >>> >>>> >>>> > >>>> >> >>>> >> Joseph. >>>> >> >>>> >> -- >>>> >> CTO | Orion Virtualisation Solutions | www.orionvm.com.au >>>> >> Phone: 1300 56 99 52 | Mobile: 0428 754 846 >>>> >>>> Cheers, >>>> Joseph. >>>> >>>> -- >>>> CTO | Orion Virtualisation Solutions | www.orionvm.com.au >>>> Phone: 1300 56 99 52 | Mobile: 0428 754 846 >> >> Joseph. >> >> -- >> CTO | Orion Virtualisation Solutions | www.orionvm.com.au >> Phone: 1300 56 99 52 | Mobile: 0428 754 846 -- CTO | Orion Virtualisation Solutions | www.orionvm.com.au Phone: 1300 56 99 52 | Mobile: 0428 754 846