From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yb0-f174.google.com ([209.85.213.174]:34098 "EHLO mail-yb0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754847AbeDYR1G (ORCPT ); Wed, 25 Apr 2018 13:27:06 -0400 Received: by mail-yb0-f174.google.com with SMTP id b14-v6so8524248ybk.1 for ; Wed, 25 Apr 2018 10:27:06 -0700 (PDT) Date: Wed, 25 Apr 2018 13:27:03 -0400 From: martin@omnibond.com To: linux-fsdevel@vger.kernel.org, devel@lists.orangefs.org, hubcap@omnibond.com, walt@omnibond.com, ligon@omnibond.com Subject: [RFC] OrangeFS blocksize in superblock and inode Message-ID: <20180425172703.gemqjqcw6svhjt56@t480s.mkb.name> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: linux-fsdevel-owner@vger.kernel.org List-ID: OrangeFS sends and receives I/O to a userspace client using shared memory buffers. OrangeFS sets blocksize in the superblock based on the size of these buffers. This is typically four megabytes. sb->s_blocksize = orangefs_bufmap_size_query(); sb->s_blocksize_bits = orangefs_bufmap_shift_query(); Then OrangeFS sets blkbits on a regular file inode to PAGE_SHIFT, but does not. This dates back to 2003, long before OrangeFS was upstream. http://dev.orangefs.org/trac/orangefs/changeset/2083 It appears neill was attempting to implement mmap. /* FIXME: We're faking our inode block size to be PAGE_CACHE_SIZE to play nicely with the page cache. In some reality, inode->i_blksize != PAGE_CACHE_SIZE and inode->i_blkbits != PAGE_CACHE_SHIFT */ The comment is wrong anyway, as he sets block size to PAGE_CACHE_SIZE. This leads me to suspect the whole idea is wrong. I don't see any reason blkbits must equal PAGE_SHIFT. Paging-related code uses PAGE_SIZE and PAGE_SHIFT. There are some references in fs/* and mm/*, but none that appear to affect OrangeFS. The goal of setting blksize is to cause applications to perform bigger reads and writes by reporting an increased block size through stat. Then despite all the hoopla around blksize, it is not used for that purpose. stat->blksize = orangefs_inode->blksize; which is the block size reported by the OrangeFS server (64 kilobytes in my single-server environment, but it varies based on how parallel the installation is). This value may differ per file. Then stat->blocks is set to inode->i_blocks in generic_fillattr, which is set to (i_size + (4096 - i_size % 4096))/512. The 4096 is hardcoded. One approach is to set i_blksize to the size reported by the server and set i_blocks to (i_size + (512 - i_size % 512))/512. I intend to remove orangefs_inode->blksize. Then i_blksize may not equal PAGE_SIZE. I am unsure of the implications of this. That leaves the superblock block size. This ends up being reported as i_blksize for directories and symlinks. OrangeFS only reports a block size on regular files. Four megabytes seems rather large, but I'm not sure what else to use. I could use PAGE_SIZE. Another approach is to set blksize to the page size. However, I am informed by the OrangeFS server developers that they really want this to value to change based on the number of servers the file will be striped across. With that in mind, I could set stat->blksize based on the server-reported value but use PAGE_SIZE for the superblock and non-regular-file objects. Do I need to keep i_blksize equal to PAGE_SIZE or may it also be set to the server-reported value? Do either of these sound more reasonable than the other? Should I do something else entirely? Thanks in advance for any advice anyone may be able to give. Martin