From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=/9w8=WY=vger.kernel.org=linux-nfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 917D2C3A5A1
	for <linux-nfs@archiver.kernel.org>; Wed, 28 Aug 2019 17:46:10 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 6413120644
	for <linux-nfs@archiver.kernel.org>; Wed, 28 Aug 2019 17:46:10 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726566AbfH1RqK (ORCPT <rfc822;linux-nfs@archiver.kernel.org>);
        Wed, 28 Aug 2019 13:46:10 -0400
Received: from fieldses.org ([173.255.197.46]:49346 "EHLO fieldses.org"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1726515AbfH1RqJ (ORCPT <rfc822;linux-nfs@vger.kernel.org>);
        Wed, 28 Aug 2019 13:46:09 -0400
Received: by fieldses.org (Postfix, from userid 2815)
        id 480341E3B; Wed, 28 Aug 2019 13:46:09 -0400 (EDT)
Date:   Wed, 28 Aug 2019 13:46:09 -0400
To:     Jason L Tibbitts III <tibbs@math.uh.edu>
Cc:     linux-nfs@vger.kernel.org, km@cm4all.com,
        linux-kernel@vger.kernel.org
Subject: Re: Regression in 5.1.20: Reading long directory fails
Message-ID: <20190828174609.GB29148@fieldses.org>
References: <ufak1bhyuew.fsf@epithumia.math.uh.edu>
 <ufapnkxkn0x.fsf@epithumia.math.uh.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <ufapnkxkn0x.fsf@epithumia.math.uh.edu>
User-Agent: Mutt/1.5.21 (2010-09-15)
From:   bfields@fieldses.org (J. Bruce Fields)
Sender: linux-nfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-nfs.vger.kernel.org>
X-Mailing-List: linux-nfs@vger.kernel.org

On Thu, Aug 22, 2019 at 02:39:26PM -0500, Jason L Tibbitts III wrote:
> I now have another user reporting the same failure of readdir on a long
> directory which showed up in 5.1.20 and was traced to
> 3536b79ba75ba44b9ac1a9f1634f2e833bbb735c.  I'm not sure what to do to
> get more traction besides reposting and adding some addresses to the CC
> list.  If there is any information I can provide which might help to get
> to the bottom of this, please let me know.
> 
> To recap:
> 
> 5.1.20 introduced a regression reading some large directories.  In this
> case, the directory should have 7800 files or so in it:
> 
> [root@ld00 ~]# ls -l ~dblecher|wc -l
> ls: reading directory '/home/dblecher': Input/output error
> 1844
> [root@ld00 ~]# cat /proc/version Linux version 5.1.20-300.fc30.x86_64 (mockbuild@bkernel04.phx2.fedoraproject.org) (gcc version 9.1.1 20190503 (Red Hat 9.1.1-1) (GCC)) #1 SMP Fri Jul 26 15:03:11 UTC 2019
> 
> (The server is a Centos 7 machine running kernel 3.10.0-957.12.2.el7.x86_64.)
> 
> Building a kernel which reverts commit 3536b79ba75ba44b9ac1a9f1634f2e833bbb735c:
>   Revert "NFS: readdirplus optimization by cache mechanism" (memleak)

Looks like that's db531db951f950b8 upstream.  (Do you know if it's
reproduceable upstream as well?)

> fixes the issue, but of course that revert was fixing a real issue so
> I'm not sure what to do.
> 
> I can trivially reproduce this by simply trying to list the problematic
> directories but I'm not sure how to construct such a directory; simply
> creating 10000 files doesn't cause the problem for me.

Maybe it depends on having names of the right length to place some bit
of xdr on a boundary.  I wonder if it'd be possible to reproduce just by
varying the name lengths randomly till you hit it.

The fact that the problematic patch fixed a memory leak also makes me
wonder if it might have gone to far and freed something out from under
the readdir code.

> I am willing to
> test patches and can build my own kernels, and I'm happy to provide any
> debugging information you might require.  Unfortunately I don't know
> enough to dig in and figure out for myself what's going wrong.
> 
> I did file https://bugzilla.redhat.com/show_bug.cgi?id=1740954 just to
> have this in a bug tracker somewhere.  I'm happy to file one somewhere
> else if that would help.

No clever debugging ideas off the top of my head, I'm afraid.  I might
start by patching the kernel or doing some tracing to figure out exactly
where that EIO is being generated?

--b.