From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bart Van Assche Subject: Re: 4.1-rc2 dm-multipath-mq kernel warning Date: Thu, 7 May 2015 12:19:14 +0200 Message-ID: <554B3C22.4060305@sandisk.com> References: <5548CDE5.9@sandisk.com> <20150506022332.GA12096@redhat.com> <5549C68E.2050705@sandisk.com> <20150506182942.GA15545@redhat.com> Reply-To: device-mapper development Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20150506182942.GA15545@redhat.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com To: Mike Snitzer Cc: device-mapper development List-Id: dm-devel.ids On 05/06/15 20:29, Mike Snitzer wrote: > On Wed, May 06 2015 at 3:45am -0400, > Bart Van Assche wrote: > >> On 05/06/15 04:23, Mike Snitzer wrote: >>> On Tue, May 05 2015 at 10:04am -0400, >>> Bart Van Assche wrote: >>>> While retesting my SRP initiator patches on top of kernel v4.1-rc2 >>>> with DM_MQ_DEFAULT=y I ran into the kernel warning below. Does this >>>> mean that I'm missing any device mapper related patches ? This >>>> warning was reported shortly after scsi_remove_host() had been >>>> invoked. >>> >>> I put the warning in place because, to me, if it triggers it speaks to >>> unsafe teardown occuring (request is still completing but the queue it >>> was issued from no longer exists). >>> >>> Like I said before I'm open to removing the WARN_ON_ONCE() if this >>> scenario is perfectly valid. But I just haven't had time to revisit >>> what appears to be a potentially serious problem with the underlying >>> paths' teardown vs upper level mpath IO. >>> >>> I'll try to revisit this week. But I welcome input from others too. >>> >>> (Just thinking about it further now, it could be that the way the clone >>> request is allocated in the case of blk-mq DM is as part of the original >>> request's pdu... meaning there isn't a proper get_request() call against >>> the underlying queue.. so the expected refcounting likely isn't >>> happening. And given the request won't be free'd from that underlying >>> request_queue there really isn't a need to artificially link these >>> cloned requests with the underlying request_queue... so I'm now leaning >>> toward just removing the WARN_ON_ONCE.. but I'll look closer tomorrow) >> >> Hello Mike, >> >> With CONFIG_SCSI_MQ_DEFAULT=y and CONFIG_DM_MQ_DEFAULT=n I just ran into >> the bug report below. I will continue my v4.1-rc2 tests with SCSI_MQ=n. > > What were you doing when this happened? Quite a strange place to get a > NULL pointer (it should be noted that for 4.2 hch's patch does away with > cloning the request's bios). Is there an easy reproducer (unlikely > considering I've tested CONFIG_SCSI_MQ_DEFAULT=y and > CONFIG_DM_MQ_DEFAULT=n a fair amount). > > BTW, my "Just thinking about it further now" above was relative to > CONFIG_DM_MQ_DEFAULT=y and CONFIG_SCSI_MQ_DEFAULT=n. Hello Mike, With kernel v4.1-rc2, with CONFIG_SCSI_MQ_DEFAULT=y and CONFIG_DM_MQ_DEFAULT=n if I run "for p in /sys/class/srp_remote_ports/*; do echo 1 > $p/delete; done" if no I/O is running that command works fine. That command triggers a call of scsi_remove_host(). But if I run the same command while I/O is running the message "BUG: unable to handle kernel NULL pointer dereference at 0000000000000068 / IP: blk_rq_prep_clone+0x87/0x160" appears. I just reproduced this after having rebuilt the kernel after a "make clean". Bart.