From mboxrd@z Thu Jan 1 00:00:00 1970 From: Guruswamy Basavaiah Subject: Re: Fix "dm kcopyd: Fix bug causing workqueue stalls" causes dead lock Date: Sat, 12 Oct 2019 14:16:02 +0530 Message-ID: References: <1b2b06a1-0b68-c265-e211-48273f26efaf@arrikto.com> <20191009141308.GA1670@redhat.com> <20191009160446.GA2284@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com To: Nikos Tsironis Cc: dm-devel@redhat.com, Mikulas Patocka , agk@redhat.com, Mike Snitzer , iliastsi@arrikto.com List-Id: dm-devel.ids Hello Nikos, I am having some issues in our set-up, I will try to get the results ASAP. Guru On Fri, 11 Oct 2019 at 17:47, Nikos Tsironis wrote: > > On 10/11/19 2:39 PM, Nikos Tsironis wrote: > > On 10/11/19 1:17 PM, Guruswamy Basavaiah wrote: > >> Hello Nikos, > >> Applied these patches and tested. > >> We still see hung_task_timeout back traces and the drbd Resync is blocked. > >> Attached the back trace, please let me know if you need any other information. > >> > > > > Hi Guru, > > > > Can you provide more information about your setup? The output of > > 'dmsetup table', 'dmsetup ls --tree' and the DRBD configuration would > > help to get a better picture of your I/O stack. > > > > Also, is it possible to describe the test case you are running and > > exactly what it does? > > > > Thanks, > > Nikos > > > > Hi Guru, > > I believe I found the mistake. The in_progress variable was never > initialized to zero. > > I attach a new version of the second patch correcting this. > > Can you please test again with this patch? > > Thanks, > Nikos > > >> In patch "0002-dm-snapshot-rework-COW-throttling-to-fix-deadlock.patch" > >> I change "struct wait_queue_head" to "wait_queue_head_t" as i was > >> getting compilation error with former one. > >> > >> On Thu, 10 Oct 2019 at 17:33, Nikos Tsironis wrote: > >>> > >>> On 10/10/19 9:34 AM, Guruswamy Basavaiah wrote: > >>>> Hello, > >>>> We use 4.4.184 in our builds and the patch fails to apply. > >>>> Is it possible to give a patch for 4.4.x branch ? > >>> Hi Guru, > >>> > >>> I attach the two patches fixing the deadlock rebased on the 4.4.x branch. > >>> > >>> Nikos > >>> > >>>> > >>>> patching Logs. > >>>> patching file drivers/md/dm-snap.c > >>>> Hunk #1 succeeded at 19 (offset 1 line). > >>>> Hunk #2 succeeded at 105 (offset -1 lines). > >>>> Hunk #3 succeeded at 157 (offset -4 lines). > >>>> Hunk #4 succeeded at 1206 (offset -120 lines). > >>>> Hunk #5 FAILED at 1508. > >>>> Hunk #6 succeeded at 1412 (offset -124 lines). > >>>> Hunk #7 succeeded at 1425 (offset -124 lines). > >>>> Hunk #8 FAILED at 1925. > >>>> Hunk #9 succeeded at 1866 with fuzz 2 (offset -255 lines). > >>>> Hunk #10 succeeded at 2202 (offset -294 lines). > >>>> Hunk #11 succeeded at 2332 (offset -294 lines). > >>>> 2 out of 11 hunks FAILED -- saving rejects to file drivers/md/dm-snap.c.rej > >>>> > >>>> Guru > >>>> > >>>> On Thu, 10 Oct 2019 at 01:33, Guruswamy Basavaiah wrote: > >>>>> > >>>>> Hello Mike, > >>>>> I will get the testing result before end of Thursday. > >>>>> Guru > >>>>> > >>>>> On Wed, 9 Oct 2019 at 21:34, Mike Snitzer wrote: > >>>>>> > >>>>>> On Wed, Oct 09 2019 at 11:44am -0400, > >>>>>> Nikos Tsironis wrote: > >>>>>> > >>>>>>> On 10/9/19 5:13 PM, Mike Snitzer wrote:> On Tue, Oct 01 2019 at 8:43am -0400, > >>>>>>>> Nikos Tsironis wrote: > >>>>>>>> > >>>>>>>>> On 10/1/19 3:27 PM, Guruswamy Basavaiah wrote: > >>>>>>>>>> Hello Nikos, > >>>>>>>>>> Yes, issue is consistently reproducible with us, in a particular > >>>>>>>>>> set-up and test case. > >>>>>>>>>> I will get the access to set-up next week, will try to test and let > >>>>>>>>>> you know the results before end of next week. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> That sounds great! > >>>>>>>>> > >>>>>>>>> Thanks a lot, > >>>>>>>>> Nikos > >>>>>>>> > >>>>>>>> Hi Guru, > >>>>>>>> > >>>>>>>> Any chance you could try this fix that I've staged to send to Linus? > >>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.4&id=633b1613b2a49304743c18314bb6e6465c21fd8a > >>>>>>>> > >>>>>>>> Shiort of that, Nikos: do you happen to have a test scenario that teases > >>>>>>>> out this deadlock? > >>>>>>>> > >>>>>>> > >>>>>>> Hi Mike, > >>>>>>> > >>>>>>> Yes, > >>>>>>> > >>>>>>> I created a 50G LV and took a snapshot of the same size: > >>>>>>> > >>>>>>> lvcreate -n data-lv -L50G testvg > >>>>>>> lvcreate -n snap-lv -L50G -s testvg/data-lv > >>>>>>> > >>>>>>> Then I ran the following fio job: > >>>>>>> > >>>>>>> [global] > >>>>>>> randrepeat=1 > >>>>>>> ioengine=libaio > >>>>>>> bs=1M > >>>>>>> size=6G > >>>>>>> offset_increment=6G > >>>>>>> numjobs=8 > >>>>>>> direct=1 > >>>>>>> iodepth=32 > >>>>>>> group_reporting > >>>>>>> filename=/dev/testvg/data-lv > >>>>>>> > >>>>>>> [test] > >>>>>>> rw=write > >>>>>>> timeout=180 > >>>>>>> > >>>>>>> , concurrently with the following script: > >>>>>>> > >>>>>>> lvcreate -n dummy-lv -L1G testvg > >>>>>>> > >>>>>>> while true > >>>>>>> do > >>>>>>> lvcreate -n dummy-snap -L1M -s testvg/dummy-lv > >>>>>>> lvremove -f testvg/dummy-snap > >>>>>>> done > >>>>>>> > >>>>>>> This reproduced the deadlock for me. I also ran 'echo 30 > > >>>>>>> /proc/sys/kernel/hung_task_timeout_secs', to reduce the hung task > >>>>>>> timeout. > >>>>>>> > >>>>>>> Nikos. > >>>>>> > >>>>>> Very nice, well done. Curious if you've tested with the fix I've staged > >>>>>> (see above)? If so, does it resolve the deadlock? If you've had > >>>>>> success I'd be happy to update the tags in the commit header to include > >>>>>> your Tested-by before sending it to Linus. Also, any review of the > >>>>>> patch that you can do would be appreciated and with your formal > >>>>>> Reviewed-by reply would be welcomed and folded in too. > >>>>>> > >>>>>> Mike > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Guruswamy Basavaiah > >>>> > >>>> > >>>> > >> > >> > >> -- Guruswamy Basavaiah