From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.bluemoose.org.uk ([217.169.27.91]:41538 "EHLO mail.bluemoose.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751772AbaDAM4K (ORCPT ); Tue, 1 Apr 2014 08:56:10 -0400 Received: from localhost (localhost [127.0.0.1]) by mail.bluemoose.org.uk (Postfix) with ESMTP id 1DC6958174 for ; Tue, 1 Apr 2014 13:56:09 +0100 (BST) Received: from mail.bluemoose.org.uk ([127.0.0.1]) by localhost (mailtest.bluemoose.org.uk [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id OFiPlnaJ7MbB for ; Tue, 1 Apr 2014 13:56:07 +0100 (BST) Received: from SSW747599C (ssw747599c.sims.cranfield.ac.uk [138.250.107.38]) (Authenticated sender: kim@bluemoose.org.uk) by mail.bluemoose.org.uk (Postfix) with ESMTPSA id BC217580F9 for ; Tue, 1 Apr 2014 13:56:07 +0100 (BST) From: To: Subject: BTRFS hangs - possibly NFS related? Date: Tue, 1 Apr 2014 13:56:06 +0100 Message-ID: <019301cf4da9$bf837930$3e8a6b90$@bluemoose.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Sender: linux-btrfs-owner@vger.kernel.org List-ID: Apologies if this is known, but I've been lurking a while on the list and not seen anything similar - and I'm running out of ideas on what to do next to debug it. Small HP microserver box, running Debian, EXT4 system disk plus 4 disk BTRFS array shared over NFS (nfs-kernel-server) and SMB - the disks recently moved from a different box where they've been running faultlessly for months, although that didn't use NFS. Under reasonable combined NFS and SMB load with only a couple of clients, the shares lock up, load average on server and clients goes high and stays high (10-12) and stays there. Apparently not actually CPU and there's little if any disk activity on the server. Killing NFS and/or Samba sometimes helps, but it's always back when the load comes back on. Chased round NFS and Samba options, then find that when the clients hang it's unresponsive on the server directly to the disk. Notice a "btrfs-transacti" process hung in "d". As are all the NFS processes: 3779 ? S< 0:00 [nfsd4] 3780 ? S< 0:00 [nfsd4_callbacks] 3782 ? D 0:27 [nfsd] 3783 ? D 0:27 [nfsd] 3784 ? D 0:28 [nfsd] 3785 ? D 0:26 [nfsd] "sync" instantly unsticks everything and it all works again for another couple of minutes, when it locks up again, same symptoms. Nothing apparently written to kern.log or dmesg, which has been the frustration all through - I don't know where to find the culprit! As a band-aid I've put btrfs filesystem sync /mnt/btrfs In the crontab once a minute which is actually working just fine and has been all morning - every 5 minutes was not enough. Any recommendations on where I can look next, or any known holes I've fallen in.? Do I need to force NFS clients to sync in their mount options? Background: Kernel - 3.13-1-amd64 #1 SMP Debian 3.13.7-1 (2014-03-25) AMD N54L with 10GB RAM. ################################################## Total devices 4 FS bytes used 848.88GiB devid 2 size 465.76GiB used 319.03GiB path /dev/sdc devid 4 size 465.76GiB used 319.00GiB path /dev/sda devid 5 size 455.76GiB used 309.03GiB path /dev/sdb2 devid 6 size 931.51GiB used 785.00GiB path /dev/sdd ################################################## Data, RAID1: total=864.00GiB, used=847.86GiB System, RAID1: total=32.00MiB, used=128.00KiB Metadata, RAID1: total=2.00GiB, used=1009.93MiB A "scrub" passes without finding any errors. There are a couple of VM images with light traffic which do fragment a little but I manually defrag those every day so often and I haven't had any problems there - it certainly isn't thrashing. Cheers Kim